Memory builds up overtime on Kubernetes pod causing JVM unable to start

2/25/2021

We are running a kubernetes environment and we have a pod that is encountering memory issues. The pod runs only a single container, and this container is responsible for running various utility jobs throughout the day.

The issue is that this pod's memory usage grows slowly over time. There is a 6 GB memory limit for this pod, and eventually, the memory consumption grows very close to 6GB.

A lot of our utility jobs are written in Java, and when the JVM spins up for them, they require -Xms256m in order to start. Yet, since the pod's memory is growing over time, eventually it gets to the point where there isn't 256MB free to start the JVM, and the Linux oom-killer kills the java process. Here is what I see from dmesg when this occurs:

[Thu Feb 18 17:43:13 2021] Memory cgroup stats for /kubepods/burstable/pod4f5d9d31-71c5-11eb-a98c-023a5ae8b224/921550be41cd797d9a32ed7673fb29ea8c48dc002a4df63638520fd7df7cf3f9: cache:8KB rss:119180KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:119132KB inactive_file:8KB active_file:0KB unevictable:4KB
[Thu Feb 18 17:43:13 2021] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[Thu Feb 18 17:43:13 2021] [ 5579]     0  5579      253        1       4        0          -998 pause
[Thu Feb 18 17:43:13 2021] [ 5737]     0  5737     3815      439      12        0           907 entrypoint.sh
[Thu Feb 18 17:43:13 2021] [13411]     0 13411     1952      155       9        0           907 tail
[Thu Feb 18 17:43:13 2021] [28363]     0 28363     3814      431      13        0           907 dataextract.sh
[Thu Feb 18 17:43:14 2021] [28401]     0 28401   768177    32228     152        0           907 java
[Thu Feb 18 17:43:14 2021] Memory cgroup out of memory: Kill process 28471 (Finalizer threa) score 928 or sacrifice child
[Thu Feb 18 17:43:14 2021] Killed process 28401 (java), UID 0, total-vm:3072708kB, anon-rss:116856kB, file-rss:12056kB, shmem-rss:0kB

Based on research I've been doing, here for example, it seems like it is normal on Linux to grow in memory consumption over time as various caches grow. From what I understand, cached memory should also be freed when new processes (such as my java process) begin to run.

My main question is: should this pod's memory be getting freed in order for these java processes to run? If so, are there any steps I can take to begin to debug why this may not be happening correctly?

Aside from this concern, I've also been trying to track down what is responsible for the growing memory in the first place. I was able to narrow it down to a certain job that runs every 15 minutes. I noticed that after every time it ran, used memory for the pod grew by ~.1 GB.

I was able to figure this out by running this command (inside the container) before and after each execution of the job:

cat /sys/fs/cgroup/memory/memory.usage_in_bytes | numfmt --to si

From there I narrowed down the piece of bash code from which the memory seems to consistently grow. That code looks like this:

while [ "z${_STATUS}" != "z0" ]
do
	RES=`$CURL -X GET "${TS_URL}/wcs/resources/admin/index/dataImport/status?jobStatusId=${JOB_ID}"`
	_STATUS=`echo $RES | jq -r '.status.status' || exit 1`
	PROGRES=`echo $RES | jq -r '.status.progress' || exit 1`
	[ "x$_STATUS" == "x1" ] && exit 1
	[ "x$_STATUS" == "x3" ] && exit 3
	[ $CNT -gt 10 ] && PrintLog "WC Job ($JOB_ID) Progress: $PROGRES Status: $_STATUS " && CNT=0

	sleep 10
	((CNT++))
done
[ "z${_STATUS}" == "z0" ] && STATUS=Success || STATUS=Failed

This piece of code seems innocuous to me at first glance, so I do not know where to go from here.

I would really appreciate any help, I've been trying to get to the bottom of this issue for days now.

-- Matt Ciaravino
bash
docker
jvm
kubernetes
memory

1 Answer

3/15/2021

I did eventually get to the bottom of this so I figured I'd post my solution here. I mentioned in my original post that I narrowed down my issue to the while loop that I posted above in my question. Each time the job in question ran, that while loop would iterate maybe 10 times. After the while loop completed, I noticed that utilized memory increased by 100MB each time pretty consistently.

On a hunch, I had a feeling the CURL command within the loop could be the culprit. And in fact, it did turn out that CURL was eating up my memory and not releasing it for whatever reason. Instead of looping and running the following CURL command:

RES=`$CURL -X GET "${TS_URL}/wcs/resources/admin/index/dataImport/status?jobStatusId=${JOB_ID}"`

I replaced this command with a simple python script that utilized the requests module to check our job statuses instead.

I am not sure still why CURL was the culprit in this case. After running CURL --version it appears that the underlying library being used is libcurl/7.29.0. Maybe there is an bug within that library version causing some issues with memory management, but that is just a guess.

In any case, switching from using python's requests module instead of CURL has resolved my issue.

-- Matt Ciaravino
Source: StackOverflow