unknown host when lookup pod by name, resolved with pod restart

9/7/2018

I have an installer that spins up two pods in my CI flow, let's call them web and activemq. When the web pod starts it tries to communicate with the activemq pod using the k8s assigned amq-deployment-0.activemq pod name.

Randomly, the web will get an unknown host exception when trying to access amq-deployment1.activemq. If I restart the web pod in this situation the web pod will have no problem communicating with the activemq pod.

I've logged into the web pod when this happens and the /etc/resolv.conf and /etc/hosts files look fine. The host machines /etc/resolve.conf and /etc/hosts are sparse with nothing that looks questionable.

Information: There is only 1 worker node.

kubectl --version Kubernetes v1.8.3+icp+ee

Any ideas on how to go about debugging this issue. I can't think of a good reason for it to happen randomly nor resolve itself on a pod restart.

If there is other useful information needed, I can get it. Thank in advance

For activeMQ we do have this service file

apiVersion: v1 kind: Service
metadata:
    name: activemq
    labels:
            app: myapp
            env: dev
spec:
    ports:
        - port: 8161
          protocol: TCP
          targetPort: 8161
          name: http
        - port: 61616
          protocol: TCP
          targetPort: 61616
          name: amq
    selector:
        component: analytics-amq
        app: myapp
        environment: dev
        type: fa-core
    clusterIP: None

And this ActiveMQ stateful set (this is the template)

kind: StatefulSet
apiVersion: apps/v1beta1
metadata:
  name: pa-amq-deployment
spec:
  replicas: {{ activemqs }}
  updateStrategy:
    type: RollingUpdate
  serviceName: "activemq"
  template:
      metadata:
          labels:
              component: analytics-amq
              app: myapp
              environment: dev
              type: fa-core
      spec:
          containers:
              - name: pa-amq
                image: default/myco/activemq:latest
                imagePullPolicy: Always
                resources:
                      limits:
                          cpu: 150m
                          memory: 1Gi
                livenessProbe:
                    exec:
                        command:
                        - /etc/init.d/activemq
                        - status
                    initialDelaySeconds: 10
                    periodSeconds: 15
                    failureThreshold: 16
                ports:
                    - containerPort: 8161
                      protocol: TCP
                      name: http
                    - containerPort: 61616
                      protocol: TCP
                      name: amq
                envFrom:
                    - configMapRef:
                        name: pa-activemq-conf-all
                    - secretRef:
                        name: pa-activemq-secret
                volumeMounts:
                    - name: timezone
                      mountPath: /etc/localtime
          volumes:
              - name: timezone
                hostPath:
                  path: /usr/share/zoneinfo/UTC

The Web stateful set:

apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
    name: pa-web-deployment
spec:
    replicas: 1
    updateStrategy:
        type: RollingUpdate
    serviceName: "pa-web"
    template:
        metadata:
            labels:
                component: analytics-web
                app: myapp
                environment: dev
                type: fa-core
        spec:
            affinity:
              podAntiAffinity:
                preferredDuringSchedulingIgnoredDuringExecution:
                - weight: 100
                  podAffinityTerm:
                    labelSelector:
                      matchExpressions:
                      - key: component
                        operator: In
                        values:
                        - analytics-web
                    topologyKey: kubernetes.io/hostname
            containers:
                - name: pa-web
                  image: default/myco/web:latest
                  imagePullPolicy: Always
                  resources:
                        limits:
                            cpu: 1
                            memory: 2Gi
                  readinessProbe:
                      httpGet:
                          path: /versions
                          port: 8080
                      initialDelaySeconds: 30
                      periodSeconds: 15
                      failureThreshold: 76
                  livenessProbe:
                      httpGet:
                          path: /versions
                          port: 8080
                      initialDelaySeconds: 30
                      periodSeconds: 15
                      failureThreshold: 80
                  securityContext:
                      privileged: true
                  ports:
                      - containerPort: 8080
                        name: http
                        protocol: TCP
                  envFrom:
                      - configMapRef:
                         name: pa-web-conf-all
                      - secretRef:
                         name: pa-web-secret
                  volumeMounts:
                      - name: shared-volume
                        mountPath: /MySharedPath
                      - name: timezone
                        mountPath: /etc/localtime
            volumes:
                - nfs:
                    server: 10.100.10.23
                    path: /MySharedPath
                  name: shared-volume
                - name: timezone
                  hostPath:
                    path: /usr/share/zoneinfo/UTC

This web pod also has a similar "unknown host" problem finding an external database we have configured. The issue being resolved similarly by restarting the pod. Here is the configuration of that external service. Maybe it is easier to tackle the problem from this angle? ActiveMQ has no problem using the database service name to find the DB and startup.

apiVersion: v1
kind: Service
metadata:
  name: dbhost
  labels:
    app: myapp
    env: dev
spec:
  type: ExternalName
  externalName: mydb.host.com
-- Chris
kubernetes
unknown-host

2 Answers

10/3/2018

Not able to find a solution, I created a workaround. I set up the entrypoint.sh in my image to lookup the domain I need to access and write to the log, exiting on error:

#!/bin/bash

#disable echo and exit on error
set +ex

#####################################
# verfiy that the db service can be found or exit container
#####################################
# we do not want to install nslookup to determine if the db_host_name is valid name
# we have ping available though
# 0-success, 1-error pinging but lookup worked (services can not be pinged), 2-unreachable host
ping -W 2 -c 1 ${db_host_name} &> /dev/null
if [ $? -le 1 ]
then
  echo "service ${db_host_name} is known"
else
  echo "${db_host_name} service is NOT recognized. Exiting container..."
  exit 1
fi

Next since only a pod restart fixed the issue. In my ansible deploy, I do a rollout check, querying the log to see if I need to do a pod restart. For example:

rollout-check.yml

- name: "Rollout status for {{rollout_item.statefulset}}"
  shell: timeout 4m kubectl rollout status -n {{fa_namespace}} -f {{ rollout_item.statefulset }}
  ignore_errors: yes

# assuming that the first pod will be the one that would have an issue
- name: "Get {{rollout_item.pod_name}} log to check for issue with dns lookup"
  shell: kubectl logs {{rollout_item.pod_name}} --tail=1 -n {{fa_namespace}}
  register: log_line

# the entrypoint will write dbhost service is NOT recognized. Exiting container... to the log
# if there is a problem getting to the dbhost
- name: "Try removing {{rollout_item.component}} pod if unable to deploy"
  shell: kubectl delete pods -l component={{rollout_item.component}} --force --grace-period=0 --ignore-not-found=true -n {{fa_namespace}}
  when: log_line.stdout.find('service is NOT recognized') > 0

I repeat this rollout check 6 times as sometimes even after a pod restart the service cannot be found. The additional checks are instant once the pod is successfully up.

- name: "Web rollout"
  include_tasks: rollout-check.yml
  loop:
  - { c: 1, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  - { c: 2, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  - { c: 3, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  - { c: 4, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  - { c: 5, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  - { c: 6, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  loop_control:
    loop_var: rollout_item
-- Chris
Source: StackOverflow

9/8/2018

Is it possible that it is a question of which pod, and the app in its container, is started up first and which second?

In any case, connecting using a Service and not the pod name would be recommended as the pod's name assigned by Kubernetes changes between pod restarts.

A way to test connectivity, is to use telnet (or curl for the protocols it supports), if found in the image:

telnet <host/pod/Service> <port>
-- apisim
Source: StackOverflow