pod-to-pod communication issues on k8s cluster created with kubeadm

5/7/2019

I created a 2 node k8s cluster with kubeadm (1 master + 2 workers), on GCP, and everything seems to be fine, except the pod-to-pod communication.

So, first thing first, there are no visible issues in the cluster. All pods are running. No errors, no crushloopbackoffs, no pending pods.

I forced the following scenario for the tests:

NAMESPACE     NAME                                  READY   STATUS    RESTARTS   AGE   IP            NODE           
default       bb-9bd94cf6f-b5cj5                    1/1     Running   1          19h   192.168.2.3   worker-node-1  
default       curler-7668c66bf5-6c6v8               1/1     Running   1          20h   192.168.2.2   worker-node-1  
default       curler-master-5b86858f9f-c6zhq        1/1     Running   0          18h   192.168.0.6   master-node    
default       nginx-5c7588df-x42vt                  1/1     Running   0          19h   192.168.2.4   worker-node-1  
default       nginy-6d77947646-4r4rl                1/1     Running   0          20h   192.168.1.4   worker-node-2  
kube-system   calico-node-9v98k                     2/2     Running   0          97m   10.240.0.7    master-node    
kube-system   calico-node-h2px8                     2/2     Running   0          97m   10.240.0.9    worker-node-2  
kube-system   calico-node-qjn5t                     2/2     Running   0          97m   10.240.0.8    worker-node-1  
kube-system   coredns-86c58d9df4-gckhl              1/1     Running   0          97m   192.168.1.9   worker-node-2  
kube-system   coredns-86c58d9df4-wvt2n              1/1     Running   0          97m   192.168.2.6   worker-node-1  
kube-system   etcd-master-node                      1/1     Running   0          97m   10.240.0.7    master-node    
kube-system   kube-apiserver-master-node            1/1     Running   0          97m   10.240.0.7    master-node    
kube-system   kube-controller-manager-master-node   1/1     Running   0          97m   10.240.0.7    master-node    
kube-system   kube-proxy-2g85h                      1/1     Running   0          97m   10.240.0.8    worker-node-1  
kube-system   kube-proxy-77pq4                      1/1     Running   0          97m   10.240.0.9    worker-node-2  
kube-system   kube-proxy-bbd2d                      1/1     Running   0          97m   10.240.0.7    master-node    
kube-system   kube-scheduler-master-node            1/1     Running   0          97m   10.240.0.7    master-node  

And these are the services:

$ kubectl get svc --all-namespaces
NAMESPACE     NAME           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)         AGE
default       kubernetes     ClusterIP   10.96.0.1        <none>        443/TCP         21h
default       nginx          ClusterIP   10.109.136.120   <none>        80/TCP          20h
default       nginy          NodePort    10.101.111.222   <none>        80:30066/TCP    20h
kube-system   calico-typha   ClusterIP   10.111.238.0     <none>        5473/TCP        21h
kube-system   kube-dns       ClusterIP   10.96.0.10       <none>        53/UDP,53/TCP   21h

nginx and nginy services are pointing to nginx-xxx and nginy-xxx pods, and are running nginx, curlers are pods with curl and ping. One of them is running on the master node, and the other one on worker-node-1. If I access the curler pod running on the worker-node-1 (curler-7668c66bf5-6c6v8), and curl the nginx pod on the same node, it works fine.

$ kubectl exec -it curler-7668c66bf5-6c6v8 sh
/ # curl 192.168.2.4 -I
HTTP/1.1 200 OK
Server: nginx/1.15.12
Date: Tue, 07 May 2019 10:59:06 GMT
Content-Type: text/html
Content-Length: 612
Last-Modified: Tue, 16 Apr 2019 13:08:19 GMT
Connection: keep-alive
ETag: "5cb5d3c3-264"
Accept-Ranges: bytes

If I try the same thing through the service name, it half works, as coredns is running; one on the worker-node-1, and the other one on worker-node-2. I believe if the request goes to the coredns pod running on worker-node-1 it works, but when it goes to worker-node-2, it doesn't.

/ # curl nginx -I
curl: (6) Could not resolve host: nginx

/ # curl nginx -I
HTTP/1.1 200 OK
Server: nginx/1.15.12
Date: Tue, 07 May 2019 11:06:13 GMT
Content-Type: text/html
Content-Length: 612
Last-Modified: Tue, 16 Apr 2019 13:08:19 GMT
Connection: keep-alive
ETag: "5cb5d3c3-264"
Accept-Ranges: bytes

So, definitely my pod-to-pod communication is not working. I checked the logs of the calico daemonset pods, but nothing suspicious. I do have some suspicious logs in kube-proxy pods though:

$ kubectl logs kube-proxy-77pq4 -n kube-system
W0507 09:16:51.305357       1 server_others.go:295] Flag proxy-mode="" unknown, assuming iptables proxy
I0507 09:16:51.315528       1 server_others.go:148] Using iptables Proxier.
I0507 09:16:51.315775       1 server_others.go:178] Tearing down inactive rules.
E0507 09:16:51.356243       1 proxier.go:563] Error removing iptables rules in ipvs proxier: error deleting chain "KUBE-MARK-MASQ": exit status 1: iptables: Too many links.
I0507 09:16:51.648112       1 server.go:464] Version: v1.13.1
I0507 09:16:51.658690       1 conntrack.go:52] Setting nf_conntrack_max to 131072
I0507 09:16:51.659034       1 config.go:102] Starting endpoints config controller
I0507 09:16:51.659052       1 controller_utils.go:1027] Waiting for caches to sync for endpoints config controller
I0507 09:16:51.659076       1 config.go:202] Starting service config controller
I0507 09:16:51.659083       1 controller_utils.go:1027] Waiting for caches to sync for service config controller
I0507 09:16:51.759278       1 controller_utils.go:1034] Caches are synced for endpoints config controller
I0507 09:16:51.759291       1 controller_utils.go:1034] Caches are synced for service config controller

Can anyone tell me if the issue could be due the kube-proxy misconfiguration of the iptables? Or point out anything I am missing?

-- suren
calico
google-cloud-platform
kubeadm
kubernetes
project-calico

1 Answer

6/25/2019

The issue was resolved by the original poster, with following solution:

The issue was that I had to open IP in IP communication in my firewall rules, in GCP. Now it works

-- Nepomucen
Source: StackOverflow