We have an issue in an AKS cluster running Kubernetes 1.13.5. The symptoms are:
The issues were are experiencing match those described in https://github.com/moby/moby/issues/38768. However, I can find no way to determine if the version of Docker run on the AKS nodes is affected by this bug, because AKS seems to use a custom build of Docker whose version is something like 3.0.4, and I can't find any relationship between these custom version numbers and the upstream Docker releases.
Does anyone know how to match internal AKS Docker build numbers to upstream Docker releases, or better yet how someone might prevent pods from being randomly killed?
Update
This is still an ongoing issue, and I though I would document how we debugged it for future AKS users.
This is the typical description of a pod with a container that has been killed with an exit code of 137. The common factors are the Last State
set to Terminated
, the Reason
set to Error
, Exit Code
set to 137 and no events.
Containers:
octopus:
Container ID: docker://3a5707ab02f4c9cbd66db14d1a1b52395d74e2a979093aa35a16be856193c37a
Image: index.docker.io/octopusdeploy/linuxoctopus:2019.5.10-hosted.462
Image ID: docker-pullable://octopusdeploy/linuxoctopus@sha256:0ea2a0b2943921dc7d8a0e3d7d9402eb63b82de07d6a97cc928cc3f816a69574
Ports: 10943/TCP, 80/TCP
Host Ports: 0/TCP, 0/TCP
State: Running
Started: Mon, 08 Jul 2019 07:51:52 +1000
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Thu, 04 Jul 2019 21:04:55 +1000
Finished: Mon, 08 Jul 2019 07:51:51 +1000
Ready: True
Restart Count: 2
...
Events: <none>
The lack of events is caused by the event TTL set in Kubernetes itself resulting in the events expiring. However with Azure monitoring enabled we can see that there were no events around the time of the restart other than the container starting again.
In our case, running kubectl logs octopus-i002680-596954c5f5-sbrgs --previous --tail 500 -n i002680
shows no application errors before the restart.
Running docker ps --all --filter 'exited=137'
on the Kubernetes node hosting the pod shows the container 593f857910ff with an exit code of 137.
Enable succeeded:
[stdout]
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
20930700810f 7c23e4d2be70 "./install.sh " 14 hours ago Exited (137) 12 hours ago k8s_octopus_octopus-i002525-55f69565f8-s488l_i002525_b08125ab-9e2e-11e9-99be-422b98e8f214_2
593f857910ff 7c23e4d2be70 "./install.sh " 4 days ago Exited (137) 25 hours ago k8s_octopus_octopus-i002680-596954c5f5-sbrgs_i002680_01eb1b4d-9e03-11e9-99be-422b98e8f214_1
d792afb85c6f 7c23e4d2be70 "./install.sh " 4 days ago Exited (137) 4 days ago k8s_octopus_octopus-i002521-76bb77b5fd-twsdx_i002521_035093c5-9e2e-11e9-99be-422b98e8f214_0
0361bc71bf14 7c23e4d2be70 "./install.sh " 4 days ago Exited (137) 2 days ago k8s_octopus_octopus-i002684-769bd954-f89km_i002684_d832682d-9e03-11e9-99be-422b98e8f214_0
[stderr]
Running docker inspect 593f857910ff | jq .[0] | jq .State
shows the container was not OOMKilled
.
Enable succeeded:
[stdout]
{
"Status": "exited",
"Running": false,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 0,
"ExitCode": 137,
"Error": "",
"StartedAt": "2019-07-04T11:04:55.037288884Z",
"FinishedAt": "2019-07-07T21:51:51.080928603Z"
}
[stderr]
This issue appears to have been resolved by updating to AKS 1.13.7, which includes an update to Moby 3.0.6. Since updating a few days ago we have not seen containers killed in the manner described in the Docker bug at https://github.com/moby/moby/issues/38768.