Environment:
kubernetes=1.14.2
docker-ce=18.03.1~ce-0~ubuntu
Error describe:
pod rec-train-jzqzf-0 stuck in Terminating state.
kubelet log:
DEC 28 23:58:53 Node0296 kubelet[2283]: E1228 23:58:53.205139 2283 pod_workers.go:190] Error syncing pod 0058cf61-298a-11ea-901c-98039b61d091 ("rec-train-jzqzf-0_research(0058cf61-298a-11ea-901c-98039b61d091)"), skipping: rpc error: code = DeadlineExceeded desc = context deadline exceeded
DEC 28 23:59:02 Node0296 kubelet[2283]: I1228 23:59:02.205026 2283 kubelet.go:1823] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m0.611016853s ago; threshold is 3m0s.
docker log:
DEC 28 23:53:09 Node0296 dockerd[3877]: time="2019-12-28T23:53:09.984864688+08:00" level=warning msg="Your kernel does not support swap limit capabilities,or the cgroup is not mounted. Memory limited without swap."
DEC 28 23:53:10 Node0296 dockerd[3877]: time="2019-12-28T23:53:10+08:00" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/418eebcb5db650dbeced34df2469ccfff633d15660724e2da6156e77445f12da/shim.sock" debug=false module="containerd/tasks" pid=130177
DEC 28 23:54:30 Node0296 dockerd[3877]: time="2019-12-28T23:54:30.512317111+08:00" level=info msg="Container 418eebcb5db650dbeced34df2469ccfff633d15660724e2da6156e77445f12da failed to exit within 30 seconds of signal 15 - using the force"
Besides, I found "journalctl -u docker -f " stuck at DEC 29 13:46:02 as below.
DEC 29 13:46:02 Node0296 dockerd[3877]: time="2019-12-29T13:46:02+08:00" level=info msg="shim reaped" id=24f4968b5352df5eba4462382067ded6c6402878f957b83db2917824b560f1a1 module="containerd/tasks"
DEC 29 13:46:02 Node0296 dockerd[3877]: time="2019-12-29T13:46:02.256200069+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
As the docker log printed, it looks like the root cause is that container 418eebcb5db650db* failed to exit. Anyone could tell me why force exit not worked here? (I encountered this error several times, temporarily method is to reboot node, but it is painful.)
Run the below command
kubectl delete pod rec-train-jzqzf-0 --grace-period=0 --force
Usually volume or network deletion can consume time during pod deletion.
You can use --now flag as below example, so that resources are signaled for immediate shutdown (same as using --grace-period=1)
kubectl delete pod <pod-name> --now
using "--force grace-period=0" may result in inconsistency or data loss during deletion of pods.
Refer "kubectl delete pod --help"
--force=false: Only used when grace-period=0. If true, immediately remove resources from API and bypass graceful deletion. Note that immediate deletion of some resources may result in inconsistency or data loss and requires
confirmation.
--grace-period=-1: Period of time in seconds given to the resource to terminate gracefully. Ignored if negative.
Set to 1 for immediate shutdown. Can only be set to 0 when --force is true (force deletion).
--now=false: If true, resources are signaled for immediate shutdown (same as --grace-period=1).