We've been getting intermittent connectivity/dns issues in our Kubernetes cluster running 1.10 on Ubuntu.
We've been all over the bug reports/etc., and nearest we can figure a process is holding onto /run/xtables.lock
and it's causing issues in a kube-proxy pod.
One of the kube-proxy pods bound to a worker has this error repeating in the logs:
E0920 13:39:42.758280 1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
E0920 13:46:46.193919 1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
E0920 14:05:45.185720 1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
E0920 14:11:52.455183 1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
E0920 14:38:36.213967 1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
E0920 14:44:43.442933 1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
These errors started happening about 3 weeks ago and we've been unable thus far to remedy it. Because the problems were intermittent we didn't track it down to this until now.
We think this is causing one of the kube-flannel-ds pods to be in a perpetual CrashLoopBackOff
state as well:
NAME READY STATUS RESTARTS AGE
coredns-78fcdf6894-6z6rs 1/1 Running 0 40d
coredns-78fcdf6894-dddqd 1/1 Running 0 40d
etcd-k8smaster1 1/1 Running 0 40d
kube-apiserver-k8smaster1 1/1 Running 0 40d
kube-controller-manager-k8smaster1 1/1 Running 0 40d
kube-flannel-ds-amd64-sh5gc 1/1 Running 0 40d
kube-flannel-ds-amd64-szkxt 0/1 CrashLoopBackOff 7077 40d
kube-proxy-6pmhs 1/1 Running 0 40d
kube-proxy-d7d8g 1/1 Running 0 40d
kube-scheduler-k8smaster1 1/1 Running 0 40d
Most bug reports around /run/xtables.lock
seem to show it was resolved July 2017, but we're seeing this on a new setup. We seem to have appropriate chain configurating in iptables.
running fuser /run/xtables.lock
returns nothing.
Does anybody have insight on this? it's causing a lot of pain
So after a bit more digging we were able to find a reason code with this command:
kubectl -n kube-system describe pods kube-flannel-ds-amd64-szkxt
The name of the pod is going to change in a different installation of course, but the reason code for a termination was outputted as:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
We had missed this Reason code earlier (we were mostly focused on the exit code of 137), and it means out of memory; killed.
By default, kube-flannel-ds gets a max memory allocation of 100Mi
- this is too low apparently. There are other issues logged around changing this default in the reference config, but our fix was to adjust the max limit to 256Mi
Changing the configuration is one step, just issue:
kubectl -n kube-system edit ds kube-flannel-ds-amd64
and change the value from 100Mi
under the limits -> memory to something higher; we did 256Mi
.
By default these pods will only update OnDelete
, so you then need to delete the pod that's in a CrashLoopBackOff
, after which it will be re-created with the updated values.
You can roll through and delete any on other nodes as well I suppose, but we only deleted the one that kept failing.
Here's references to some issues that helped us track this down:
https://github.com/coreos/flannel/issues/963 https://github.com/coreos/flannel/issues/1012