Kubernetes Cluster with finished jobs unstable; kubelet logs filled with "http2: no cached connection was available"

2/19/2019

Summary

I have various single-node Kubernetes clusters which become unstable after having accumulated ~300 completed jobs.

In one cluster, for example, there are 303 completed jobs:

root@xxxx:/home/xxxx# kubectl get jobs | wc -l
303

Observations

What I observe is that

  • The kubelet logs are filled with error messages like this: kubelet[877]: E0219 09:06:14.637045 877 reflector.go:134] object-"default"/"job-162273560": Failed to list *v1.ConfigMap: Get https://172.13.13.13:6443/api/v1/namespaces/default/configmaps?fieldSelector=metadata.name%3Djob-162273560&limit=500&resourceVersion=0: http2: no cached connection was available
  • The node status is not being updated, with a similar error message: kubelet[877]: E0219 09:32:57.379751 877 reflector.go:134] k8s.io/kubernetes/pkg/kubelet/kubelet.go:451: Failed to list *v1.Node: Get https://172.13.13.13:6443/api/v1/nodes?fieldSelector=metadata.name%3Dxxxxx&limit=500&resourceVersion=0: http2: no cached connection was available
  • Eventually, the node is being marked as NotReady and no new pods are scheduled NAME STATUS ROLES AGE VERSION xxxxx NotReady master 6d4h v1.12.1
  • The cluster is entering and exiting the master disruption mode (from the kube-controller-manager logs): I0219 09:29:46.875397 1 node_lifecycle_controller.go:1015] Controller detected that all Nodes are not-Ready. Entering master disruption mode. I0219 09:30:16.877715 1 node_lifecycle_controller.go:1042] Controller detected that some Nodes are Ready. Exiting master disruption mode.

The real culprit appears to be the http2: no cached connection was available error message. The only real references I've found are a couple of issues in the Go repository (like #16582), which appear to have been fixed a long time ago.

In most cases, deleting the completed jobs seems to restore the system stability.

Minimal repro (tbc)

I seem to be able to reproduce this problem by creating lots of jobs which use containers which mount ConfigMaps:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: job-%JOB_ID%
data:
# Just some sample data
  game.properties: |
    enemies=aliens
    lives=3
    enemies.cheat=true
    enemies.cheat.level=noGoodRotten
    secret.code.passphrase=UUDDLRLRBABAS
    secret.code.allowed=true
    secret.code.lives=30
  ui.properties: |
    color.good=purple
    color.bad=yellow
    allow.textmode=true
    how.nice.to.look=fairlyNice
---
apiVersion: batch/v1
kind: Job
metadata:
  name: job-%JOB_ID%
spec:
  template:
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(20)"]
        volumeMounts:
        - name: config-volume
          mountPath: /etc/config
      volumes:
        - name: config-volume
          configMap:
            name: job-%JOB_ID%
      restartPolicy: Never
  backoffLimit: 4

Schedule lots of these jobs:

#!/bin/bash
for i in `seq 100 399`;
do
    cat job.yaml | sed "s/%JOB_ID%/$i/g" | kubectl create -f -
    sleep 0.1
done

Questions

I'm very curious though as to what causes this problem, as 300 completed jobs seems to be a fairly low number.

Is this a configuration problem in my cluster? A possible bug in Kubernetes/Go? Anything else that I can try?

-- Frederik Carlier
go
kubelet
kubernetes
ubuntu

1 Answer

4/15/2019

Just to summarize this problem and why does it happen. It has been actually an issue related to 1.12 and 1.13. As explained in GitHub issue (probably created by author) this seems to be an issue of http2 connection pool implementing, or as explained in one of the comments it is a connection management problem in kubelet. Described ways of mitigating can be found here. and if you would need more information all links are avaliable in the linked GitHub issue.

-- aurelius
Source: StackOverflow