Prometheus pod unable to call apiserver endpoints

4/12/2020

I am trying to set up monitoring stack (prometheus + alertmanager + node_exporter etc) via helm install stable/prometheus onto a raspberry pi k8s cluster (1 master + 3 worker nodes) which i set up.

Managed to get all the required pods running.

pi-monitoring-prometheus-alertmanager-767cd8bc65-89hxt   2/2     Running            0          131m    10.17.2.56      kube2   <none>           <none>
pi-monitoring-prometheus-node-exporter-h86gt             1/1     Running            0          131m    192.168.1.212   kube2   <none>           <none>
pi-monitoring-prometheus-node-exporter-kg957             1/1     Running            0          131m    192.168.1.211   kube1   <none>           <none>
pi-monitoring-prometheus-node-exporter-x9wgb             1/1     Running            0          131m    192.168.1.213   kube3   <none>           <none>
pi-monitoring-prometheus-pushgateway-799d4ff9d6-rdpkf    1/1     Running            0          131m    10.17.3.36      kube1   <none>           <none>
pi-monitoring-prometheus-server-5d989754b6-gp69j         2/2     Running            0          98m     10.17.1.60      kube3   <none>           <none>

however after port-forwarding prometheus server port 9090 and navigating to Targets page, i realized none of the node_exporters are registered.

Digging through the logs, i found this

evel=error ts=2020-04-12T05:15:05.083Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:333: Failed to list *v1.Node: Get https://10.18.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 10.18.0.1:443: i/o timeout"
level=error ts=2020-04-12T05:15:05.084Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:299: Failed to list *v1.Service: Get https://10.18.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.18.0.1:443: i/o timeout"
level=error ts=2020-04-12T05:15:05.084Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:261: Failed to list *v1.Endpoints: Get https://10.18.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.18.0.1:443: i/o timeout"
level=error ts=2020-04-12T05:15:05.085Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:262: Failed to list *v1.Service: Get https://10.18.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.18.0.1:443: i/o timeout"

Question: why is the prometheus pod unable to call the apiserver endpoints? Not really sure where was the configuration done wrongly

Followed through debug guide and realized individual nodes are unable to resolve services on other nodes.

Been troubleshooting for the past 1 day reading various sources but to be honest, i am not even sure where to begin with.

These are the pods running in kube-system namespace. Hope this will give a better idea of how my system is set up.

pi@kube4:~ $ kubectl get pods -n kube-system -o wide
NAME                            READY   STATUS    RESTARTS   AGE   IP              NODE    NOMINATED NODE   READINESS GATES
coredns-66bff467f8-nzvq8        1/1     Running   0          13d   10.17.0.2       kube4   <none>           <none>
coredns-66bff467f8-z7wdb        1/1     Running   0          13d   10.17.0.3       kube4   <none>           <none>
etcd-kube4                      1/1     Running   0          13d   192.168.1.214   kube4   <none>           <none>
kube-apiserver-kube4            1/1     Running   2          13d   192.168.1.214   kube4   <none>           <none>
kube-controller-manager-kube4   1/1     Running   2          13d   192.168.1.214   kube4   <none>           <none>
kube-flannel-ds-arm-8g9fb       1/1     Running   1          13d   192.168.1.212   kube2   <none>           <none>
kube-flannel-ds-arm-c5qt9       1/1     Running   0          13d   192.168.1.214   kube4   <none>           <none>
kube-flannel-ds-arm-q5pln       1/1     Running   1          13d   192.168.1.211   kube1   <none>           <none>
kube-flannel-ds-arm-tkmn6       1/1     Running   1          13d   192.168.1.213   kube3   <none>           <none>
kube-proxy-4zjjh                1/1     Running   0          13d   192.168.1.213   kube3   <none>           <none>
kube-proxy-6mk2z                1/1     Running   0          13d   192.168.1.211   kube1   <none>           <none>
kube-proxy-bbr8v                1/1     Running   0          13d   192.168.1.212   kube2   <none>           <none>
kube-proxy-wfsbm                1/1     Running   0          13d   192.168.1.214   kube4   <none>           <none>
kube-scheduler-kube4            1/1     Running   3          13d   192.168.1.214   kube4   <none>           <none>
-- jaanhio
kubernetes
networking
prometheus

3 Answers

4/12/2020

I suspect there is a networking issue that prevents you from reaching the API server. "dial tcp 10.18.0.1:443: i/o timeout" generally reflects that you are not able to connect or read from the server. You can use below steps to figure out the problem: 1. Deploy one busybox pod using kubectl run busybox --image=busybox -n kube-system 2. Get into the pod using kubectl exec -n kube-system -it <podname> sh 3. Try to do telnet from the tty like telnet 10.18.0.1 443 to figure out the connection issues

Let me know the output.

-- Shivam Singh
Source: StackOverflow

4/17/2020

After much troubleshooting, i realized i am not able to ping other pods from other nodes but only able to ping from those within the node. Issue seems to be with iptables config as covered here https://github.com/coreos/flannel/issues/699.

tl;dr: running iptables --policy FORWARD ACCEPT solved my problem. prior to updating iptables policy

Chain FORWARD (policy DROP)
target     prot opt source               destination
KUBE-FORWARD  all  --  anywhere             anywhere             /* kubernetes forwarding rules */
KUBE-SERVICES  all  --  anywhere             anywhere             ctstate NEW /* kubernetes service portals */
DOCKER-USER  all  --  anywhere             anywhere
DOCKER-ISOLATION-STAGE-1  all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere

issue it solved now. thanks @kitt for the help earlier!

-- jaanhio
Source: StackOverflow

4/16/2020

Flannel documentation states:

NOTE: If kubeadm is used, then pass --pod-network-cidr=10.244.0.0/16 to kubeadm init to ensure that the podCIDR is set.

This is because flannel ConfigMap by default is configured to work on "Network": "10.244.0.0/16"

You have configured your kubeadm with --pod-network-cidr=10.17.0.0/16 now this needs to be configured in flannel ConfigMap kube-flannel-cfg to look like this:

kind: ConfigMap
apiVersion: v1
metadata:
  name: kube-flannel-cfg
  namespace: kube-system
  labels:
    tier: node
    app: flannel
data:
  cni-conf.json: |
    {
      "name": "cbr0",
      "cniVersion": "0.3.1",
      "plugins": [
        {
          "type": "flannel",
          "delegate": {
            "hairpinMode": true,
            "isDefaultGateway": true
          }
        },
        {
          "type": "portmap",
          "capabilities": {
            "portMappings": true
          }
        }
      ]
    }
  net-conf.json: |
    {
      "Network": "10.17.0.0/16",
      "Backend": {
        "Type": "vxlan"
      }
    }

Thanks to @kitt for his debugging help.

-- Crou
Source: StackOverflow