I have a procedure of installing kubernetes cluster via kubeadm and it worked multiple times.
for some reason now I have a cluster which I installed and for some reason the nodes are having trouble communicating.
the problem reflect in couple of ways : sometimes the cluster is unable to resolve global dns records such as mirrorlist.centos.org sometimes one pod from a specific node has no connectivity to another pod in different node
my kubernetes version is 1.9.2 my hosts are centOS 7.4 I use flannel as cni plugin in version 0.9.1 my cluster is built on AWS
mt debugging so far was :
kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'
- to see subnets 10.244.0.0/24 10.244.1.0/24
I tried adding configurations to kubedns ( even though it is needed in all my other clusters ) like https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/#configure-stub-domain-and-upstream-dns-servers
I even tried creating an AMI from other running environments and deploying it as a node to this cluster and it still fails.
I tried checking if some port is missing so I even opened all ports between nodes
I also disabled iptables and firewall and all nodes just to make sure it is not the reason
nothing helps.
please any tip would help
edit : I added my flannel configuration:
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: flannel
rules:
- apiGroups:
- ""
resources:
- pods
verbs:
- get
- apiGroups:
- ""
resources:
- nodes
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- nodes/status
verbs:
- patch
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: flannel
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: flannel
subjects:
- kind: ServiceAccount
name: flannel
namespace: kube-system
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: flannel
namespace: kube-system
---
kind: ConfigMap
apiVersion: v1
metadata:
name: kube-flannel-cfg
namespace: kube-system
labels:
tier: node
app: flannel
data:
cni-conf.json: |
{
"name": "cbr0",
"type": "flannel",
"delegate": {
"isDefaultGateway": true
}
}
net-conf.json: |
{
"Network": "10.244.0.0/16",
"Backend": {
"Type": "vxlan"
}
}
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: kube-flannel-ds
namespace: kube-system
labels:
tier: node
app: flannel
spec:
template:
metadata:
labels:
tier: node
app: flannel
spec:
hostNetwork: true
nodeSelector:
beta.kubernetes.io/arch: amd64
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
serviceAccountName: flannel
initContainers:
- name: install-cni
image: quay.io/coreos/flannel:v0.9.1-amd64
command:
- cp
args:
- -f
- /etc/kube-flannel/cni-conf.json
- /etc/cni/net.d/10-flannel.conf
volumeMounts:
- name: cni
mountPath: /etc/cni/net.d
- name: flannel-cfg
mountPath: /etc/kube-flannel/
containers:
- name: kube-flannel
image: quay.io/coreos/flannel:v0.9.1-amd64
command: [ "/opt/bin/flanneld", "--ip-masq", "--kube-subnet-mgr" ]
securityContext:
privileged: true
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
volumeMounts:
- name: run
mountPath: /run
- name: flannel-cfg
mountPath: /etc/kube-flannel/
volumes:
- name: run
hostPath:
path: /run
- name: cni
hostPath:
path: /etc/cni/net.d
- name: flannel-cfg
configMap:
name: kube-flannel-cfg
the issues was that the AWS machines were provisioned not by me and the team that provisioned the machines assured that all internal traffic is opened.
after a lot of debugging with nmap I found out that UDP ports are not opened and since flannel requires UDP traffic the communication was not working properly.
once UDP was opened issues got solved.