Kubernetes's GC doesn't remove Exited Docker containers

5/16/2017

I have a cluster of 3 nodes running Kubernetes 1.6.1, each has 2 CPU and 4G RAM.

I am constantly redeploying my application with the same Docker tag by updating pod template hash by replacing environment variable value that is passed to the container.

sed "s/THIS_STRING_IS_REPLACED_DURING_BUILD/$(date)/g" nginx-deployment.yml | kubectl replace -f -

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.7.9
        ports:
        - containerPort: 80
        env:
        - name: FOR_GODS_SAKE_PLEASE_REDEPLOY
          value: 'THIS_STRING_IS_REPLACED_DURING_BUILD'

If I do this for a few hundred times, I can't redeploy any more - new pods are in Pending state. kubectl get events produces the following:

Events:
  FirstSeen LastSeen    Count   From            SubObjectPath   Type        Reason          Message
  --------- --------    -----   ----            -------------   --------    ------          -------
  1h        50s     379 default-scheduler           Warning     
FailedScheduling    No nodes are available that match all of the following predicates:: Insufficient pods (3). 

At the same time I can see about 200 Exited nginx containers on every Kube node.

Looking in kube-controller-manager logs I can see that PodGC is trying to delete some pods, but they are not found.

I0516 12:53:41.137311       1 gc_controller.go:175] Found unscheduled terminating Pod nginx-deployment-2927112463-qczvv not assigned to any Node. Deleting.
I0516 12:53:41.137320       1 gc_controller.go:62] PodGC is force deleting Pod: default:nginx-deployment-2927112463-qczvv
E0516 12:53:41.190592       1 gc_controller.go:177] pods "nginx-deployment-2927112463-qczvv" not found
I0516 12:53:41.195020       1 gc_controller.go:175] Found unscheduled terminating Pod nginx-deployment-3265736979-jrpzb not assigned to any Node. Deleting.
I0516 12:53:41.195048       1 gc_controller.go:62] PodGC is force deleting Pod: default:nginx-deployment-3265736979-jrpzb
E0516 12:53:41.238307       1 gc_controller.go:177] pods "nginx-deployment-3265736979-jrpzb" not found

Is there anything I can do to prevent that from happening?

-- Vladimir Kozyrev
kubernetes

2 Answers

5/24/2017

I think you have run out of all the resource the your nodes. The scheduler can not find any node to schedule the pod. Since the pod is not scheduled to any node, so the PodGC can't remove your pod.

I think you should double check why you have run out of all your resource.

-- Xianglin Gao
Source: StackOverflow

5/23/2017

Kubernetes allows you to tweak the garbage collection flags of kubelet. This can be done via changing the flags --maximum-dead-containers or --maximum-dead-containers-per-container. Read more about it in docs here:

-- surajd
Source: StackOverflow