Pods in Azure AKS randomly being restarted

6/21/2019

We have an issue in an AKS cluster running Kubernetes 1.13.5. The symptoms are:

  • Pods are randomly restarted
  • The "Last State" is "Terminated", the "Reason" is "Error" and the "Exit Code" is "137"
  • The pod events show no errors, either related to lack of resources or failed liveness checks
  • The docker container shows "OOMKilled" as "false" for the stopped container
  • The linux logs show no OOM killed pods

The issues were are experiencing match those described in https://github.com/moby/moby/issues/38768. However, I can find no way to determine if the version of Docker run on the AKS nodes is affected by this bug, because AKS seems to use a custom build of Docker whose version is something like 3.0.4, and I can't find any relationship between these custom version numbers and the upstream Docker releases.

Does anyone know how to match internal AKS Docker build numbers to upstream Docker releases, or better yet how someone might prevent pods from being randomly killed?

Update

This is still an ongoing issue, and I though I would document how we debugged it for future AKS users.

This is the typical description of a pod with a container that has been killed with an exit code of 137. The common factors are the Last State set to Terminated, the Reason set to Error, Exit Code set to 137 and no events.

Containers:
  octopus:
    Container ID:   docker://3a5707ab02f4c9cbd66db14d1a1b52395d74e2a979093aa35a16be856193c37a
    Image:          index.docker.io/octopusdeploy/linuxoctopus:2019.5.10-hosted.462
    Image ID:       docker-pullable://octopusdeploy/linuxoctopus@sha256:0ea2a0b2943921dc7d8a0e3d7d9402eb63b82de07d6a97cc928cc3f816a69574
    Ports:          10943/TCP, 80/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Running
      Started:      Mon, 08 Jul 2019 07:51:52 +1000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Thu, 04 Jul 2019 21:04:55 +1000
      Finished:     Mon, 08 Jul 2019 07:51:51 +1000
    Ready:          True
    Restart Count:  2
...
Events:          <none>

The lack of events is caused by the event TTL set in Kubernetes itself resulting in the events expiring. However with Azure monitoring enabled we can see that there were no events around the time of the restart other than the container starting again.

enter image description here

In our case, running kubectl logs octopus-i002680-596954c5f5-sbrgs --previous --tail 500 -n i002680 shows no application errors before the restart.

Running docker ps --all --filter 'exited=137' on the Kubernetes node hosting the pod shows the container 593f857910ff with an exit code of 137.

Enable succeeded:
[stdout]
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS                      PORTS               NAMES
20930700810f        7c23e4d2be70        "./install.sh "     14 hours ago        Exited (137) 12 hours ago                       k8s_octopus_octopus-i002525-55f69565f8-s488l_i002525_b08125ab-9e2e-11e9-99be-422b98e8f214_2
593f857910ff        7c23e4d2be70        "./install.sh "     4 days ago          Exited (137) 25 hours ago                       k8s_octopus_octopus-i002680-596954c5f5-sbrgs_i002680_01eb1b4d-9e03-11e9-99be-422b98e8f214_1
d792afb85c6f        7c23e4d2be70        "./install.sh "     4 days ago          Exited (137) 4 days ago                         k8s_octopus_octopus-i002521-76bb77b5fd-twsdx_i002521_035093c5-9e2e-11e9-99be-422b98e8f214_0
0361bc71bf14        7c23e4d2be70        "./install.sh "     4 days ago          Exited (137) 2 days ago                         k8s_octopus_octopus-i002684-769bd954-f89km_i002684_d832682d-9e03-11e9-99be-422b98e8f214_0


[stderr]

Running docker inspect 593f857910ff | jq .[0] | jq .State shows the container was not OOMKilled.

Enable succeeded:
[stdout]
{
  "Status": "exited",
  "Running": false,
  "Paused": false,
  "Restarting": false,
  "OOMKilled": false,
  "Dead": false,
  "Pid": 0,
  "ExitCode": 137,
  "Error": "",
  "StartedAt": "2019-07-04T11:04:55.037288884Z",
  "FinishedAt": "2019-07-07T21:51:51.080928603Z"
}


[stderr]
-- Phyxx
azure-aks
docker
kubernetes

1 Answer

7/20/2019

This issue appears to have been resolved by updating to AKS 1.13.7, which includes an update to Moby 3.0.6. Since updating a few days ago we have not seen containers killed in the manner described in the Docker bug at https://github.com/moby/moby/issues/38768.

-- Phyxx
Source: StackOverflow