I have a Kubernetes deployment that deploys a Java application based on the anapsix/alpine-java image. There is nothing else running in the container expect for the Java application and the container overhead.
I want to maximise the amount of memory the Java process can use inside the docker container and minimise the amount of ram that will be reserved but never used.
For example I have:
How can I safely maximise the amount of pods running on the two nodes while never having Kubernetes terminate my PODs because of memory limits?
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: my-deployment
spec:
replicas: 1
template:
metadata:
labels:
app: my-deployment
spec:
containers:
- name: my-deployment
image: myreg:5000/my-deployment:0.0.1-SNAPSHOT
ports:
- containerPort: 8080
name: http
resources:
requests:
memory: 1024Mi
limits:
memory: 1024Mi
Java 8 update 131+ has a flag -XX:+UseCGroupMemoryLimitForHeap to use the Docker limits that come from the Kubernetes deployment.
My Docker experiments show me what is happening in Kubernetes
If I run the following in Docker:
docker run -m 1024m anapsix/alpine-java:8_server-jre_unlimited java -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XshowSettings:vm -version
I get:
VM settings:
Max. Heap Size (Estimated): 228.00M
This low value is because Java sets -XX:MaxRAMFraction to 4 by default and I get about 1/4 of the ram allocated...
If I run the same command with -XX:MaxRAMFraction=2 in Docker:
docker run -m 1024m anapsix/alpine-java:8_server-jre_unlimited java -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XshowSettings:vm -XX:MaxRAMFraction=2 -version
I get:
VM settings:
Max. Heap Size (Estimated): 455.50M
Finally setting MaxRAMFraction=1 quickly causes Kubernetes to Kill my container.
docker run -m 1024m anapsix/alpine-java:8_server-jre_unlimited java -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XshowSettings:vm -XX:MaxRAMFraction=1 -version
I get:
VM settings:
Max. Heap Size (Estimated): 910.50M
I think the issue here is that the kubernetes memory limits are for the container and MaxRAMFraction is for jvm. So, if jvm heap is the same as kubernetes limits then there wont be enough memory left for the container itself.
One thing you can try is increasing
limits:
memory: 2048Mi
keeping requests
limit the same. Fundamental difference between requests and limits is that requests will let you go over the limit if there is memory available at the node level while limits
is a hard limit. This may not be the ideal solution and you will have to figure out how much memory is your pod consuming on top of jvm, but as a quick fix increasing limits
should work.
The reason Kubernetes kills your pods is the resource limit. It is difficult to calculate because of container overhead and the usual mismatches between decimal and binary prefixes in specification of memory usage. My solution is to entirely drop the limit and only keep the requirement(which is what your pod will have available in any case if it is scheduled). Rely on the JVM to limit its heap via static specification and let Kubernetes manage how many pods are scheduled on a single node via resource requirement.
At first you will need to determine the actual memory usage of your container when running with your desired heap size. Run a pod with -Xmx1024m -Xms1024m
and connect to the hosts docker daemon it's scheduled on. Run docker ps
to find your pod and docker stats <container>
to see its current memory usage wich is the sum of JVM heap, other static JVM usage like direct memory and your containers overhead(alpine with glibc). This value should only fluctuate within kibibytes because of some network usage that is handled outside the JVM. Add this value as memory requirement to your pod template.
Calculate or estimate how much memory other components on your nodes need to function properly. There will at least be the Kubernetes kubelet, the Linux kernel, its userland, probably an SSH daemon and in your case a docker daemon running on them. You can choose a generous default like 1 Gibibyte excluding the kubelet if you can spare the extra few bytes. Specify --system-reserved=1Gi
and --kube-reserved=100Mi
in your kubelets flags and restart it. This will add those reserved resources to the Kubernetes schedulers calculations when determining how many pods can run on a node. See the official Kubernetes documentation for more information.
This way there will probably be five to seven pods scheduled on a node with eight Gigabytes of RAM, depending on the above chosen and measured values. They will be guaranteed the RAM specified in the memory requirement and will not be terminated. Verify the memory usage via kubectl describe node
under Allocated resources
. As for elegancy/flexibility, you just need to adjust the memory requirement and JVM heap size if you want to increase RAM available to your application.
This approach only works assuming that the pods memory usage will not explode, if it would not be limited by the JVM a rouge pod might cause eviction, see out of resource handling.
What we do in our case is we launch with high memory limit on kubernetes, observe over time under load and either tune memory usage to the level we want to reach with -Xmx
or adapt memory limits (and requests) to the real memory consumption. Truth be told, we usually use the mix of both approaches. The key to this method is to have a decent monitoring enabled on your cluster (Prometheus in our case), if you want high level of finetuning you might also want to add something like a JMX prometheus exporter, to have a detailed insight into metrics when tuning your setup.