Kubernetes pods crash after a few hours, restarting kubelet fixes

9/10/2017

I'm running an insecure test Kubernetes v1.7.5 in a bare metal setup running CoreOS 1409.7.0. I've installed the api-server, controller, scheduler, proxy and kubelet on the master node, and the kubelet and proxy on 3 other worker nodes, with flanneld using the systemd service files provided in the contrib/init k8s project.

Everything is running perfectly when the cluster starts up. I can deploy the dashboard and some deploys that i've customized (consul clients/server, nginx, etc) and they all work great. However, if I leave the cluster running for a few hours I will come back and every pod will be in CrashLoopBackup, being restarted many times. The only thing that solves the problem is to restart the kubelet on each machine. The problem immediately goes away and everything goes back to normal.

Logs from kubelet after it's gone into a bad state:

Sep 10 19:09:06 k8-app-2.example.com kubelet[1025]: , failed to "StartContainer" for "nginx-server" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=nginx-server pod=nginx-deployment-617048525-mgf0v_default(f6dff9f2-95db-11e7-b533-02c75fb65df0)"
Sep 10 19:09:06 k8-app-2.example.com kubelet[1025]: ]
Sep 10 19:09:07 k8-app-2.example.com kubelet[1025]: I0910 19:09:07.286367    1025 kuberuntime_manager.go:457] Container {Name:nginx-server Image:nginx Command:[] Args:[] WorkingDir: Ports:[{Name:http HostPort:0 ContainerPort:80 Protocol:TCP HostIP:}] EnvFrom:[] Env:[{Name:NODE_IP Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.hostIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {Name:POD_IP Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.podIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}}] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/,Port:80,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:10,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:Always SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Sep 10 19:09:07 k8-app-2.example.com kubelet[1025]: I0910 19:09:07.286795    1025 kuberuntime_manager.go:457] Container {Name:regup Image:registry.hub.docker.com/spunon/regup:latest Command:[] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[{Name:SERVICE_NAME Value:nginx ValueFrom:nil} {Name:SERVICE_PORT Value:80 ValueFrom:nil} {Name:NODE_IP Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.hostIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {Name:POD_IP Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.podIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}}] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[] LivenessProbe:nil ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:Always SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Sep 10 19:09:07 k8-app-2.example.com kubelet[1025]: I0910 19:09:07.287071    1025 kuberuntime_manager.go:741] checking backoff for container "nginx-server" in pod "nginx-deployment-617048525-mgf0v_default(f6dff9f2-95db-11e7-b533-02c75fb65df0)"
Sep 10 19:09:07 k8-app-2.example.com kubelet[1025]: I0910 19:09:07.287376    1025 kuberuntime_manager.go:751] Back-off 5m0s restarting failed container=nginx-server pod=nginx-deployment-617048525-mgf0v_default(f6dff9f2-95db-11e7-b533-02c75fb65df0)
Sep 10 19:09:07 k8-app-2.example.com kubelet[1025]: I0910 19:09:07.287601    1025 kuberuntime_manager.go:741] checking backoff for container "regup" in pod "nginx-deployment-617048525-mgf0v_default(f6dff9f2-95db-11e7-b533-02c75fb65df0)"
Sep 10 19:09:07 k8-app-2.example.com kubelet[1025]: I0910 19:09:07.287863    1025 kuberuntime_manager.go:751] Back-off 5m0s restarting failed container=regup pod=nginx-deployment-617048525-mgf0v_default(f6dff9f2-95db-11e7-b533-02c75fb65df0)

EDIT: Here are the logs from the kubelet when the issue seems to start

-- Douglas McAdams
coreos
kubelet
kubernetes

0 Answers