Kubernetes Pod OOMKilled Solution

11/12/2018

I have a service running on Kubernetes processing files passed from another resource. Single file size can vary between 10MB - 1GB.

Recently I've been seeing the pod dead because of OOMKilled Error:

State: Running
Started: Sun, 11 Nov 2018 07:28:46 +0000
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Fri, 09 Nov 2018 18:49:46 +0000
Finished: Sun, 11 Nov 2018 07:28:45 +0000

I mitigate the issue by bumping the resource (Memory) limit on the pod. But I am concerning whenever there is a traffic or file size spike, we will run into this OOMKilled issue again. But if I set the memory limit too high, I am concerning it will cause trouble on the host of this pod.

I read through the best practices given by Kubernetes: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#best-practices. But I am not sure by adding --eviction-hard and --system-reserved=memory could resolve the issue.

Has anyone had experience with a similar issue before?

Any help would be appreciated.

-- Edward
kubernetes
memory

1 Answer

11/13/2018

More than a Kubernetes/Container runtime issue this is more memory management in your application and this will vary depending on what language runtime or if something like the JVM is running your application.

You generally want to set an upper limit on the memory usage in the application, for example, a maximum heap space in your JVM, then leave a little headroom for garbage collection and overruns.

Another example is the Go runtime and looks like they have talked about memory management but with no solution as of this writing. For these cases, it might be good to manually set the ulimit the virtual memory for the specific process of your application. (If you have a leak you will see other types of errors) or using timeout

There's also manual cgroup management but then again that's exactly with docker and Kubernetes are supposed to do.

This is a good article with some insights about managing a JVM in containers.

-- Rico
Source: StackOverflow