To demonstrate the kubelet's eviction behaviour, I am trying to deploy a Kubernetes workload that will consume memory to the point that the kubelet evicts all BestEffort Pods due to memory pressure but does not kill my workload (or at least not before the BestEffort Pods).
My best attempt is below. It writes to two tmpfs volumes (since, by default, the limit of a tmpfs volume is half of the Node's total memory). The 100
comes from the fact that --eviction-hard=memory.available<100Mi
is set on the kubelet:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fallocate
namespace: developer
spec:
selector:
matchLabels:
app: fallocate
template:
metadata:
labels:
app: fallocate
spec:
containers:
- name: alpine
image: alpine
command:
- /bin/sh
- -c
- |
count=1
while true
do
AVAILABLE_DISK_KB=$(df /cache-1 | grep /cache-1 | awk '{print $4}')
AVAILABLE_DISK_MB=$(( $AVAILABLE_DISK_KB / 1000 ))
AVAILABLE_MEMORY_MB=$(free -m | grep Mem | awk '{print $4}')
MINIMUM=$(( $AVAILABLE_DISK_MB > $AVAILABLE_MEMORY_MB ? $AVAILABLE_MEMORY_MB : $AVAILABLE_DISK_MB ))
fallocate -l $(( $MINIMUM - 100 ))MB /cache-1/$count
AVAILABLE_DISK_KB=$(df /cache-2 | grep /cache-2 | awk '{print $4}')
AVAILABLE_DISK_MB=$(( $AVAILABLE_DISK_KB / 1000 ))
AVAILABLE_MEMORY_MB=$(free -m | grep Mem | awk '{print $4}')
MINIMUM=$(( $AVAILABLE_DISK_MB > $AVAILABLE_MEMORY_MB ? $AVAILABLE_MEMORY_MB : $AVAILABLE_DISK_MB ))
fallocate -l $(( $MINIMUM - 100 ))MB /cache-2/$count
count=$(( $count+1 ))
sleep 1
done
resources:
requests:
memory: 2Gi
cpu: 100m
limits:
cpu: 100m
volumeMounts:
- name: cache-1
mountPath: /cache-1
- name: cache-2
mountPath: /cache-2
volumes:
- name: cache-1
emptyDir:
medium: Memory
- name: cache-2
emptyDir:
medium: Memory
The intention of this script is to use up memory to the point that Node memory usage is in the hard eviction threshhold boundary to cause the kubelet to start to evict. It evicts some BestEfforts Pods, but in most cases the workload is killed before all BestEffort Pods are evicted. Is there a better way of doing this?
I am running on GKE with cluster version 1.9.3-gke.0.
EDIT:
I also tried using simmemleak:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: simmemleak
namespace: developer
spec:
selector:
matchLabels:
app: simmemleak
template:
metadata:
labels:
app: simmemleak
spec:
containers:
- name: simmemleak
image: saadali/simmemleak
resources:
requests:
memory: 1Gi
cpu: 1m
limits:
cpu: 1m
But this workload keeps dying before any evictions. I think the issue is that it is being killed by the kernel before the kubelet has time to react.
To avoid system OOM taking effect prior to the kubelet eviction, you could configure kubepods memory limits --system-reserved
and --enforce-node-allocatable
Read more.
For example, Node has 32Gi of memory, configure to limit kubepods memory up to 20Gi
--eviction-hard=memory.available<500Mi
I found this on the Kubernetes docs, I hope it helps:
kubelet may not observe memory pressure right away The kubelet currently polls cAdvisor to collect memory usage stats at a regular interval.
If memory usage increases within that window rapidly, the kubelet may not observe MemoryPressure fast enough, and the OOMKiller will still be invoked.
We intend to integrate with the memcg notification API in a future release to reduce this latency, and instead have the kernel tell us when a threshold has been crossed immediately.
If you are not trying to achieve extreme utilization, but a sensible measure of overcommit, a viable workaround for this issue is to set eviction thresholds at approximately 75% capacity.
This increases the ability of this feature to prevent system OOMs, and promote eviction of workloads so cluster state can rebalance.
\==EDIT== : As there seems to be a race between OOM and kubelet, and the memory allocated by your script grows quicker than the time Kubelet takes to realise that the pods need to be evicted, it might be wise to try to allocate memory more slowly inside your script.