I am running a Python app on production but my pod is restarting frequently on production environment. While on a staging environment it's not happening.
So I thought it could be CPU & Memory limit issue. I have updated that also.
Further debug I got 137
exit code.
For more debug I go inside Kubernetes node and check for container.
Command used: docker inspect < container id >
Here is output:
{
"Id": "a0f18cd48fb4bba66ef128581992e919c4ddba5e13d8b6a535a9cff6e1494fa6",
"Created": "2019-11-04T12:47:14.929891668Z",
"Path": "/bin/sh",
"Args": [
"-c",
"python3 run.py"
],
"State": {
"Status": "exited",
"Running": false,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 0,
"ExitCode": 137,
"Error": "",
"StartedAt": "2019-11-04T12:47:21.108670992Z",
"FinishedAt": "2019-11-05T00:01:30.184225387Z"
},
OOMKilled is false so I think that is not issue.
Using GKE master version: 1.13.10-gke.0
Technically all the 137 means is your process was terminated as a result of a SIGKILL. Unfortunately this doesn't have enough info to know where it came from. Tools like auditd or Falco on top of that can help gather that data by recording those kinds of system calls, or at least get you closer.
Exit code 137 is a docker exit code that tells us that the container is killed by the OOM killer. This does not mean that the container itself reached a memory limit or that it does not have sufficient memory to run. Since the OS level OOM killer is killing the application, the pod and docker won't register OOM for the container itself because it did not necessarily reach a memory limit.
The above linked doc goes into some detials on how to debug error 137, though you can also check your node metrics for memory usage or check the node logs to see if OOM was ever registered at the OS level.
If this is a regular problem, make sure your python container includes limits, and make sure your other containers in the cluster have appropriate requests and limits set.