Some Kubernetes pods consistently not able to resolve internal DNS on only one node

10/15/2019

I have just moved my first cluster from minikube up to AWS EKS. All went pretty smoothly so far, except I'm running into some DNS issues I think, but only on one of the cluster nodes.

I have two nodes in the cluster running v1.14, and 4 pods of one type, and 4 of another, 3 of each work, but 1 of each - both on the same node - start then error (CrashLoopBackOff) with the script inside the container erroring because it can't resolve the hostname for the database. Deleting the errored pod, or even all pods, results in one pod on the same node failing every time.

The database is in its own pod and has a service assigned, none of the other pods of the same type have problems resolving the name or connecting. The database pod is on the same node as the pods that can't resolve the hostname. I'm not sure how to migrate the pod to a different node, but that might be worth trying to see if the problem follows. No errors in the coredns pods. I'm not sure where to start looking to discover the issue from here, and any help or suggestions would be appreciated.

Providing the configs below. As mentioned, they all work on Minikube, and also they work on one node.

kubectl get pods - note age, all pod1's were deleted at the same time and they recreated themselves, 3 worked fine, 4th does not.

NAME                          READY   STATUS             RESTARTS   AGE
pod1-85f7968f7-2cjwt         1/1     Running            0          34h
pod1-85f7968f7-cbqn6         1/1     Running            0          34h
pod1-85f7968f7-k9xv2         0/1     CrashLoopBackOff   399        34h
pod1-85f7968f7-qwcrz         1/1     Running            0          34h
postgresql-865db94687-cpptb   1/1     Running            0          3d14h
rabbitmq-667cfc4cc-t92pl      1/1     Running            0          34h
pod2-94b9bc6b6-6bzf7     1/1     Running            0          34h
pod2-94b9bc6b6-6nvkr     1/1     Running            0          34h
pod2-94b9bc6b6-jcjtb     0/1     CrashLoopBackOff   140        11h
pod2-94b9bc6b6-t4gfq     1/1     Running            0          34h

postgresql service

apiVersion: v1
kind: Service
metadata: 
    name: postgresql
spec:
    ports:
        - port: 5432
    selector:
        app: postgresql

pod1 deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
    name: pod1
spec:
    replicas: 4
    selector:
        matchLabels:
            app: pod1
    template:
        metadata:
            labels:
                app: pod1
        spec:
            containers:
                - name: pod1
                  image: us.gcr.io/gcp-project-8888888/pod1:latest
                  env:
                      - name: rabbitmquser
                        valueFrom:
                            secretKeyRef:
                                name: rabbitmq-secrets
                                key: rmquser
                  volumeMounts:
                      - mountPath: /data/files
                        name: datafiles
            volumes:
                - name: datafiles
                  persistentVolumeClaim:
                      claimName: datafiles-pv-claim
            imagePullSecrets:
                - name: container-readonly

pod2 depoloyment:

apiVersion: apps/v1
kind: Deployment
metadata:
    name: pod2
spec:
    replicas: 4
    selector:
        matchLabels:
            app: pod2
    template:
        metadata:
            labels:
                app: pod2
        spec:
            containers:
                - name: pod2
                  image: us.gcr.io/gcp-project-8888888/pod2:latest
                  env:
                      - name: rabbitmquser
                        valueFrom:
                            secretKeyRef:
                                name: rabbitmq-secrets
                                key: rmquser
                  volumeMounts:
                      - mountPath: /data/files
                        name: datafiles
            volumes:
                - name: datafiles
                  persistentVolumeClaim:
                      claimName: datafiles-pv-claim
            imagePullSecrets:
                - name: container-readonly

CoreDNS config map to forward DNS to external service if it doesn't resolve internally. This is the only place I can think that would be causing the issue - but as said it works for one node.

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           upstream
           fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        proxy . 8.8.8.8
        cache 30
        loop
        reload
        loadbalance
    }

Errored Pod output. Same for both pods, as it occurs in library code common to both. As mentioned, this does not occur for all pods so the issue likely doesn't lie with the code.

Error connecting to database (psycopg2.OperationalError) could not translate host name "postgresql" to address: Try again

Errored Pod1 description:

Name:           xyz-94b9bc6b6-jcjtb
Namespace:      default
Priority:       0
Node:           ip-192-168-87-230.us-east-2.compute.internal/192.168.87.230
Start Time:     Tue, 15 Oct 2019 19:43:11 +1030
Labels:         app=pod1
                pod-template-hash=94b9bc6b6
Annotations:    kubernetes.io/psp: eks.privileged
Status:         Running
IP:             192.168.70.63
Controlled By:  ReplicaSet/xyz-94b9bc6b6
Containers:
  pod1:
    Container ID:   docker://f7dc735111bd94b7c7b698e69ad302ca19ece6c72b654057627626620b67d6de
    Image:          us.gcr.io/xyz/xyz:latest
    Image ID:       docker-pullable://us.gcr.io/xyz/xyz@sha256:20110cf126b35773ef3a8656512c023b1e8fe5c81dd88f19a64c5bfbde89f07e
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 16 Oct 2019 07:21:40 +1030
      Finished:     Wed, 16 Oct 2019 07:21:46 +1030
    Ready:          False
    Restart Count:  139
    Environment:
      xyz:    <set to the key 'xyz' in secret 'xyz-secrets'>           Optional: false
    Mounts:
      /data/xyz from xyz (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-m72kz (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  xyz:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  xyz-pv-claim
    ReadOnly:   false
  default-token-m72kz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-m72kz
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason   Age                     From                                                   Message
  ----     ------   ----                    ----                                                   -------
  Warning  BackOff  2m22s (x3143 over 11h)  kubelet, ip-192-168-87-230.us-east-2.compute.internal  Back-off restarting failed container

Errored Pod 2 description:

Name:           xyz-85f7968f7-k9xv2
Namespace:      default
Priority:       0
Node:           ip-192-168-87-230.us-east-2.compute.internal/192.168.87.230
Start Time:     Mon, 14 Oct 2019 21:19:42 +1030
Labels:         app=pod2
                pod-template-hash=85f7968f7
Annotations:    kubernetes.io/psp: eks.privileged
Status:         Running
IP:             192.168.84.69
Controlled By:  ReplicaSet/pod2-85f7968f7
Containers:
  pod2:
    Container ID:   docker://f7c7379f92f57ea7d381ae189b964527e02218dc64337177d6d7cd6b70990143
    Image:          us.gcr.io/xyz-217300/xyz:latest
    Image ID:       docker-pullable://us.gcr.io/xyz-217300/xyz@sha256:b9cecdbc90c5c5f7ff6170ee1eccac83163ac670d9df5febd573c2d84a4d628d
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 16 Oct 2019 07:23:35 +1030
      Finished:     Wed, 16 Oct 2019 07:23:41 +1030
    Ready:          False
    Restart Count:  398
    Environment:
      xyz:    <set to the key 'xyz' in secret 'xyz-secrets'>     Optional: false
    Mounts:
      /data/xyz from xyz (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-m72kz (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  xyz:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  xyz-pv-claim
    ReadOnly:   false
  default-token-m72kz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-m72kz
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason   Age                     From                                                   Message
  ----     ------   ----                    ----                                                   -------
  Warning  BackOff  3m28s (x9208 over 34h)  kubelet, ip-192-168-87-230.us-east-2.compute.internal  Back-off restarting failed container
-- vortex
dns
kubernetes

1 Answer

10/17/2019

At the suggestion of a k8s community member, I applied the following change to my coredns configuration to be more in line with the best practice:

Line: proxy . 8.8.8.8 changed to forward . /etc/resolv.conf 8.8.8.8

I then deleted the pods, and after they were recreated by k8s, the issue did not appear again.

EDIT:

Turns out, that was not the issue at all as shortly afterwards the issue re-occurred and persisted. In the end, it was this: https://github.com/aws/amazon-vpc-cni-k8s/issues/641 Rolled back to 1.5.3 as recommended by Amazon, restarted the cluster, and the issue was resolved.

-- vortex
Source: StackOverflow