I have various single-node Kubernetes clusters which become unstable after having accumulated ~300 completed jobs.
In one cluster, for example, there are 303 completed jobs:
root@xxxx:/home/xxxx# kubectl get jobs | wc -l
303
What I observe is that
kubelet
logs are filled with error messages like this: kubelet[877]: E0219 09:06:14.637045 877 reflector.go:134] object-"default"/"job-162273560": Failed to list *v1.ConfigMap: Get https://172.13.13.13:6443/api/v1/namespaces/default/configmaps?fieldSelector=metadata.name%3Djob-162273560&limit=500&resourceVersion=0: http2: no cached connection was available
kubelet[877]: E0219 09:32:57.379751 877 reflector.go:134] k8s.io/kubernetes/pkg/kubelet/kubelet.go:451: Failed to list *v1.Node: Get https://172.13.13.13:6443/api/v1/nodes?fieldSelector=metadata.name%3Dxxxxx&limit=500&resourceVersion=0: http2: no cached connection was available
NotReady
and no new pods are scheduled NAME STATUS ROLES AGE VERSION xxxxx NotReady master 6d4h v1.12.1
kube-controller-manager
logs): I0219 09:29:46.875397 1 node_lifecycle_controller.go:1015] Controller detected that all Nodes are not-Ready. Entering master disruption mode. I0219 09:30:16.877715 1 node_lifecycle_controller.go:1042] Controller detected that some Nodes are Ready. Exiting master disruption mode.
The real culprit appears to be the http2: no cached connection was available
error message. The only real references I've found are a couple of issues in the Go repository (like #16582), which appear to have been fixed a long time ago.
In most cases, deleting the completed jobs seems to restore the system stability.
I seem to be able to reproduce this problem by creating lots of jobs which use containers which mount ConfigMaps:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: job-%JOB_ID%
data:
# Just some sample data
game.properties: |
enemies=aliens
lives=3
enemies.cheat=true
enemies.cheat.level=noGoodRotten
secret.code.passphrase=UUDDLRLRBABAS
secret.code.allowed=true
secret.code.lives=30
ui.properties: |
color.good=purple
color.bad=yellow
allow.textmode=true
how.nice.to.look=fairlyNice
---
apiVersion: batch/v1
kind: Job
metadata:
name: job-%JOB_ID%
spec:
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(20)"]
volumeMounts:
- name: config-volume
mountPath: /etc/config
volumes:
- name: config-volume
configMap:
name: job-%JOB_ID%
restartPolicy: Never
backoffLimit: 4
Schedule lots of these jobs:
#!/bin/bash
for i in `seq 100 399`;
do
cat job.yaml | sed "s/%JOB_ID%/$i/g" | kubectl create -f -
sleep 0.1
done
I'm very curious though as to what causes this problem, as 300 completed jobs seems to be a fairly low number.
Is this a configuration problem in my cluster? A possible bug in Kubernetes/Go? Anything else that I can try?
Just to summarize this problem and why does it happen. It has been actually an issue related to 1.12 and 1.13. As explained in GitHub issue (probably created by author) this seems to be an issue of http2 connection pool implementing, or as explained in one of the comments it is a connection management problem in kubelet. Described ways of mitigating can be found here. and if you would need more information all links are avaliable in the linked GitHub issue.