Connectivity issues between EKS cluster's API and nodes

11/27/2019

My EKS cluster becomes unhealthy with errors of "ContainerCreating" from all pods which might be related to issues with CNI.

Once I launched new node workers they are not getting the "Ready" state and prompt errors of:

"couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp
172.20.0.1:443: i/o timeout) Failed to communicate with K8S Server. Please check instance security groups or http proxy setting"

I'm not using http proxy and the security groups allowed from private CIDR (Telnet to the API server from port 443 is working).

My CNI version is 1.5.5, according to some threads about this issue I've tried to downgrade the CNI to 1.5.3 - Nodes still didn't connect, and to 1.5.1 - Nodes were connected as the /etc/cni/net.d/10-aws.conflist file exists but pods didn't manage to connect to them.

In version 1.5.5 the location of the conflist file changed to /etc/cni/10-aws.conflist but still nodes are in "NotReady" state.

My EKS version is 1.14 and the platform version is eks.2.

Ipamd log:

2019-11-27T09:09:13.446Z [INFO] Starting L-IPAMD v1.5.5  ...
2019-11-27T09:09:43.447Z [INFO] Testing communication with server
2019-11-27T09:10:13.448Z [INFO] Failed to communicate with K8S Server. Please check instance security groups or http proxy setting
2019-11-27T09:10:13.448Z [ERROR]        Failed to create client: error communicating with apiserver: Get https://172.20.0.1:443/version?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout

The errors from the containers are:

Warning  FailedCreatePodSandBox  17m                   kubelet, ip-10-1-1-144.eu-west-1.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "b02f175d5e68011332655e0d6e6aa3ae226bbd7bf447c7461c0140a7e026d831" network for pod "coredns-759d6fc95f-zx292": NetworkPlugin cni failed to set up pod "coredns-759d6fc95f-zx292_kube-system" network: failed to find plugin "aws-cni" in path [/opt/cni/bin], failed to clean up sandbox container "b02f175d5e68011332655e0d6e6aa3ae226bbd7bf447c7461c0140a7e026d831" network for pod "coredns-759d6fc95f-zx292": NetworkPlugin cni failed to teardown pod "coredns-759d6fc95f-zx292_kube-system" network: failed to find plugin "aws-cni" in path [/opt/cni/bin]]
  Normal   SandboxChanged          2m47s (x70 over 17m)  kubelet, ip-10-1-1-144.eu-west-1.compute.internal  Pod sandbox changed, it will be killed and re-created.

CNI Image: 602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.5.5

/opt/cni/bin/aws-cni-support.sh script output: /opt/cni/bin/aws-cni-support.sh

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to localhost port 61679: Connection refused
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to localhost port 61679: Connection refused
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to localhost port 61679: Connection refused
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to localhost port 61679: Connection refused
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to localhost port 61679: Connection refused
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to localhost port 61678: Connection refused
tar: Removing leading `/' from member names
/var/log/aws-routed-eni/
/var/log/aws-routed-eni/ipamd.log.2019-11-27-09
/var/log/aws-routed-eni/ipamd.log.2019-11-27-10
/var/log/aws-routed-eni/eni.out
/var/log/aws-routed-eni/pod.out
/var/log/aws-routed-eni/networkutils-env.out
/var/log/aws-routed-eni/ipamd-env.out
/var/log/aws-routed-eni/eni-configs.out
/var/log/aws-routed-eni/metrics.out
/var/log/aws-routed-eni/ifconfig.out
/var/log/aws-routed-eni/iprule.out
/var/log/aws-routed-eni/iptables-save.out
/var/log/aws-routed-eni/iptables.out
/var/log/aws-routed-eni/iptables-nat.out
/var/log/aws-routed-eni/iptables-mangle.out
/var/log/aws-routed-eni/cni/
/var/log/aws-routed-eni/cni/10-aws.conflist
/var/log/aws-routed-eni/messages
/var/log/aws-routed-eni/route.out
/var/log/aws-routed-eni/sysctls.out

Also, a lot of the following errors appear in /var/log/aws-routed-eni/messages: network: failed to find plugin \"aws-cni\" in path [/opt/cni/bin]"

There is no /opt/cni/bin/aws-cni file.

Does anyone have any clue on what the issue could be?

-- eazary
amazon-eks
amazon-web-services
aws-eks
eks
kubernetes

0 Answers