I have just moved my first cluster from minikube up to AWS EKS. All went pretty smoothly so far, except I'm running into some DNS issues I think, but only on one of the cluster nodes.
I have two nodes in the cluster running v1.14, and 4 pods of one type, and 4 of another, 3 of each work, but 1 of each - both on the same node - start then error (CrashLoopBackOff) with the script inside the container erroring because it can't resolve the hostname for the database. Deleting the errored pod, or even all pods, results in one pod on the same node failing every time.
The database is in its own pod and has a service assigned, none of the other pods of the same type have problems resolving the name or connecting. The database pod is on the same node as the pods that can't resolve the hostname. I'm not sure how to migrate the pod to a different node, but that might be worth trying to see if the problem follows. No errors in the coredns pods. I'm not sure where to start looking to discover the issue from here, and any help or suggestions would be appreciated.
Providing the configs below. As mentioned, they all work on Minikube, and also they work on one node.
kubectl get pods - note age, all pod1's were deleted at the same time and they recreated themselves, 3 worked fine, 4th does not.
NAME READY STATUS RESTARTS AGE
pod1-85f7968f7-2cjwt 1/1 Running 0 34h
pod1-85f7968f7-cbqn6 1/1 Running 0 34h
pod1-85f7968f7-k9xv2 0/1 CrashLoopBackOff 399 34h
pod1-85f7968f7-qwcrz 1/1 Running 0 34h
postgresql-865db94687-cpptb 1/1 Running 0 3d14h
rabbitmq-667cfc4cc-t92pl 1/1 Running 0 34h
pod2-94b9bc6b6-6bzf7 1/1 Running 0 34h
pod2-94b9bc6b6-6nvkr 1/1 Running 0 34h
pod2-94b9bc6b6-jcjtb 0/1 CrashLoopBackOff 140 11h
pod2-94b9bc6b6-t4gfq 1/1 Running 0 34h
postgresql service
apiVersion: v1
kind: Service
metadata:
name: postgresql
spec:
ports:
- port: 5432
selector:
app: postgresql
pod1 deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: pod1
spec:
replicas: 4
selector:
matchLabels:
app: pod1
template:
metadata:
labels:
app: pod1
spec:
containers:
- name: pod1
image: us.gcr.io/gcp-project-8888888/pod1:latest
env:
- name: rabbitmquser
valueFrom:
secretKeyRef:
name: rabbitmq-secrets
key: rmquser
volumeMounts:
- mountPath: /data/files
name: datafiles
volumes:
- name: datafiles
persistentVolumeClaim:
claimName: datafiles-pv-claim
imagePullSecrets:
- name: container-readonly
pod2 depoloyment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: pod2
spec:
replicas: 4
selector:
matchLabels:
app: pod2
template:
metadata:
labels:
app: pod2
spec:
containers:
- name: pod2
image: us.gcr.io/gcp-project-8888888/pod2:latest
env:
- name: rabbitmquser
valueFrom:
secretKeyRef:
name: rabbitmq-secrets
key: rmquser
volumeMounts:
- mountPath: /data/files
name: datafiles
volumes:
- name: datafiles
persistentVolumeClaim:
claimName: datafiles-pv-claim
imagePullSecrets:
- name: container-readonly
CoreDNS config map to forward DNS to external service if it doesn't resolve internally. This is the only place I can think that would be causing the issue - but as said it works for one node.
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
upstream
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
proxy . 8.8.8.8
cache 30
loop
reload
loadbalance
}
Errored Pod output. Same for both pods, as it occurs in library code common to both. As mentioned, this does not occur for all pods so the issue likely doesn't lie with the code.
Error connecting to database (psycopg2.OperationalError) could not translate host name "postgresql" to address: Try again
Errored Pod1 description:
Name: xyz-94b9bc6b6-jcjtb
Namespace: default
Priority: 0
Node: ip-192-168-87-230.us-east-2.compute.internal/192.168.87.230
Start Time: Tue, 15 Oct 2019 19:43:11 +1030
Labels: app=pod1
pod-template-hash=94b9bc6b6
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 192.168.70.63
Controlled By: ReplicaSet/xyz-94b9bc6b6
Containers:
pod1:
Container ID: docker://f7dc735111bd94b7c7b698e69ad302ca19ece6c72b654057627626620b67d6de
Image: us.gcr.io/xyz/xyz:latest
Image ID: docker-pullable://us.gcr.io/xyz/xyz@sha256:20110cf126b35773ef3a8656512c023b1e8fe5c81dd88f19a64c5bfbde89f07e
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 16 Oct 2019 07:21:40 +1030
Finished: Wed, 16 Oct 2019 07:21:46 +1030
Ready: False
Restart Count: 139
Environment:
xyz: <set to the key 'xyz' in secret 'xyz-secrets'> Optional: false
Mounts:
/data/xyz from xyz (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-m72kz (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
xyz:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: xyz-pv-claim
ReadOnly: false
default-token-m72kz:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-m72kz
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 2m22s (x3143 over 11h) kubelet, ip-192-168-87-230.us-east-2.compute.internal Back-off restarting failed container
Errored Pod 2 description:
Name: xyz-85f7968f7-k9xv2
Namespace: default
Priority: 0
Node: ip-192-168-87-230.us-east-2.compute.internal/192.168.87.230
Start Time: Mon, 14 Oct 2019 21:19:42 +1030
Labels: app=pod2
pod-template-hash=85f7968f7
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 192.168.84.69
Controlled By: ReplicaSet/pod2-85f7968f7
Containers:
pod2:
Container ID: docker://f7c7379f92f57ea7d381ae189b964527e02218dc64337177d6d7cd6b70990143
Image: us.gcr.io/xyz-217300/xyz:latest
Image ID: docker-pullable://us.gcr.io/xyz-217300/xyz@sha256:b9cecdbc90c5c5f7ff6170ee1eccac83163ac670d9df5febd573c2d84a4d628d
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 16 Oct 2019 07:23:35 +1030
Finished: Wed, 16 Oct 2019 07:23:41 +1030
Ready: False
Restart Count: 398
Environment:
xyz: <set to the key 'xyz' in secret 'xyz-secrets'> Optional: false
Mounts:
/data/xyz from xyz (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-m72kz (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
xyz:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: xyz-pv-claim
ReadOnly: false
default-token-m72kz:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-m72kz
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 3m28s (x9208 over 34h) kubelet, ip-192-168-87-230.us-east-2.compute.internal Back-off restarting failed container
At the suggestion of a k8s community member, I applied the following change to my coredns configuration to be more in line with the best practice:
Line: proxy . 8.8.8.8
changed to forward . /etc/resolv.conf 8.8.8.8
I then deleted the pods, and after they were recreated by k8s, the issue did not appear again.
EDIT:
Turns out, that was not the issue at all as shortly afterwards the issue re-occurred and persisted. In the end, it was this: https://github.com/aws/amazon-vpc-cni-k8s/issues/641 Rolled back to 1.5.3 as recommended by Amazon, restarted the cluster, and the issue was resolved.