I am trying to set up monitoring stack (prometheus + alertmanager + node_exporter etc) via helm install stable/prometheus
onto a raspberry pi k8s cluster (1 master + 3 worker nodes) which i set up.
Managed to get all the required pods running.
pi-monitoring-prometheus-alertmanager-767cd8bc65-89hxt 2/2 Running 0 131m 10.17.2.56 kube2 <none> <none>
pi-monitoring-prometheus-node-exporter-h86gt 1/1 Running 0 131m 192.168.1.212 kube2 <none> <none>
pi-monitoring-prometheus-node-exporter-kg957 1/1 Running 0 131m 192.168.1.211 kube1 <none> <none>
pi-monitoring-prometheus-node-exporter-x9wgb 1/1 Running 0 131m 192.168.1.213 kube3 <none> <none>
pi-monitoring-prometheus-pushgateway-799d4ff9d6-rdpkf 1/1 Running 0 131m 10.17.3.36 kube1 <none> <none>
pi-monitoring-prometheus-server-5d989754b6-gp69j 2/2 Running 0 98m 10.17.1.60 kube3 <none> <none>
however after port-forwarding prometheus server port 9090 and navigating to Targets
page, i realized none of the node_exporters are registered.
Digging through the logs, i found this
evel=error ts=2020-04-12T05:15:05.083Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:333: Failed to list *v1.Node: Get https://10.18.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 10.18.0.1:443: i/o timeout"
level=error ts=2020-04-12T05:15:05.084Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:299: Failed to list *v1.Service: Get https://10.18.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.18.0.1:443: i/o timeout"
level=error ts=2020-04-12T05:15:05.084Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:261: Failed to list *v1.Endpoints: Get https://10.18.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.18.0.1:443: i/o timeout"
level=error ts=2020-04-12T05:15:05.085Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:262: Failed to list *v1.Service: Get https://10.18.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.18.0.1:443: i/o timeout"
Question: why is the prometheus pod unable to call the apiserver endpoints? Not really sure where was the configuration done wrongly
Followed through debug guide and realized individual nodes are unable to resolve services on other nodes.
Been troubleshooting for the past 1 day reading various sources but to be honest, i am not even sure where to begin with.
These are the pods running in kube-system
namespace. Hope this will give a better idea of how my system is set up.
pi@kube4:~ $ kubectl get pods -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
coredns-66bff467f8-nzvq8 1/1 Running 0 13d 10.17.0.2 kube4 <none> <none>
coredns-66bff467f8-z7wdb 1/1 Running 0 13d 10.17.0.3 kube4 <none> <none>
etcd-kube4 1/1 Running 0 13d 192.168.1.214 kube4 <none> <none>
kube-apiserver-kube4 1/1 Running 2 13d 192.168.1.214 kube4 <none> <none>
kube-controller-manager-kube4 1/1 Running 2 13d 192.168.1.214 kube4 <none> <none>
kube-flannel-ds-arm-8g9fb 1/1 Running 1 13d 192.168.1.212 kube2 <none> <none>
kube-flannel-ds-arm-c5qt9 1/1 Running 0 13d 192.168.1.214 kube4 <none> <none>
kube-flannel-ds-arm-q5pln 1/1 Running 1 13d 192.168.1.211 kube1 <none> <none>
kube-flannel-ds-arm-tkmn6 1/1 Running 1 13d 192.168.1.213 kube3 <none> <none>
kube-proxy-4zjjh 1/1 Running 0 13d 192.168.1.213 kube3 <none> <none>
kube-proxy-6mk2z 1/1 Running 0 13d 192.168.1.211 kube1 <none> <none>
kube-proxy-bbr8v 1/1 Running 0 13d 192.168.1.212 kube2 <none> <none>
kube-proxy-wfsbm 1/1 Running 0 13d 192.168.1.214 kube4 <none> <none>
kube-scheduler-kube4 1/1 Running 3 13d 192.168.1.214 kube4 <none> <none>
I suspect there is a networking issue that prevents you from reaching the API server. "dial tcp 10.18.0.1:443: i/o timeout" generally reflects that you are not able to connect or read from the server. You can use below steps to figure out the problem: 1. Deploy one busybox pod using kubectl run busybox --image=busybox -n kube-system
2. Get into the pod using kubectl exec -n kube-system -it <podname> sh
3. Try to do telnet from the tty like telnet 10.18.0.1 443
to figure out the connection issues
Let me know the output.
After much troubleshooting, i realized i am not able to ping other pods from other nodes but only able to ping from those within the node. Issue seems to be with iptables config as covered here https://github.com/coreos/flannel/issues/699.
tl;dr: running iptables --policy FORWARD ACCEPT
solved my problem. prior to updating iptables policy
Chain FORWARD (policy DROP)
target prot opt source destination
KUBE-FORWARD all -- anywhere anywhere /* kubernetes forwarding rules */
KUBE-SERVICES all -- anywhere anywhere ctstate NEW /* kubernetes service portals */
DOCKER-USER all -- anywhere anywhere
DOCKER-ISOLATION-STAGE-1 all -- anywhere anywhere
ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED
DOCKER all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
issue it solved now. thanks @kitt for the help earlier!
Flannel documentation states:
NOTE: If
kubeadm
is used, then pass--pod-network-cidr=10.244.0.0/16
tokubeadm init
to ensure that thepodCIDR
is set.
This is because flannel ConfigMap by default is configured to work on "Network": "10.244.0.0/16"
You have configured your kubeadm with --pod-network-cidr=10.17.0.0/16
now this needs to be configured in flannel ConfigMap kube-flannel-cfg
to look like this:
kind: ConfigMap
apiVersion: v1
metadata:
name: kube-flannel-cfg
namespace: kube-system
labels:
tier: node
app: flannel
data:
cni-conf.json: |
{
"name": "cbr0",
"cniVersion": "0.3.1",
"plugins": [
{
"type": "flannel",
"delegate": {
"hairpinMode": true,
"isDefaultGateway": true
}
},
{
"type": "portmap",
"capabilities": {
"portMappings": true
}
}
]
}
net-conf.json: |
{
"Network": "10.17.0.0/16",
"Backend": {
"Type": "vxlan"
}
}
Thanks to @kitt for his debugging help.