How can I OOM all BestEffort Pods in Kubernetes?

3/9/2018

To demonstrate the kubelet's eviction behaviour, I am trying to deploy a Kubernetes workload that will consume memory to the point that the kubelet evicts all BestEffort Pods due to memory pressure but does not kill my workload (or at least not before the BestEffort Pods).

My best attempt is below. It writes to two tmpfs volumes (since, by default, the limit of a tmpfs volume is half of the Node's total memory). The 100 comes from the fact that --eviction-hard=memory.available<100Mi is set on the kubelet:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fallocate
  namespace: developer
spec:
  selector:
    matchLabels:
      app: fallocate
  template:
    metadata:
      labels:
        app: fallocate
    spec:
      containers:
      - name: alpine
        image: alpine
        command:
        - /bin/sh
        - -c
        - |
          count=1
          while true
          do          

            AVAILABLE_DISK_KB=$(df /cache-1 | grep /cache-1 | awk '{print $4}')
            AVAILABLE_DISK_MB=$(( $AVAILABLE_DISK_KB / 1000 ))
            AVAILABLE_MEMORY_MB=$(free -m | grep Mem | awk '{print $4}')
            MINIMUM=$(( $AVAILABLE_DISK_MB > $AVAILABLE_MEMORY_MB ?  $AVAILABLE_MEMORY_MB : $AVAILABLE_DISK_MB ))
            fallocate -l $(( $MINIMUM - 100 ))MB /cache-1/$count

            AVAILABLE_DISK_KB=$(df /cache-2 | grep /cache-2 | awk '{print $4}')
            AVAILABLE_DISK_MB=$(( $AVAILABLE_DISK_KB / 1000 ))
            AVAILABLE_MEMORY_MB=$(free -m | grep Mem | awk '{print $4}')
            MINIMUM=$(( $AVAILABLE_DISK_MB > $AVAILABLE_MEMORY_MB ?  $AVAILABLE_MEMORY_MB : $AVAILABLE_DISK_MB ))
            fallocate -l $(( $MINIMUM - 100 ))MB /cache-2/$count            

            count=$(( $count+1 ))
            sleep 1

          done
        resources:
          requests:
            memory: 2Gi
            cpu: 100m
          limits:
            cpu: 100m
        volumeMounts:
        - name: cache-1
          mountPath: /cache-1
        - name: cache-2
          mountPath: /cache-2
      volumes:
      - name: cache-1
        emptyDir:
          medium: Memory
      - name: cache-2
        emptyDir:
          medium: Memory

The intention of this script is to use up memory to the point that Node memory usage is in the hard eviction threshhold boundary to cause the kubelet to start to evict. It evicts some BestEfforts Pods, but in most cases the workload is killed before all BestEffort Pods are evicted. Is there a better way of doing this?

I am running on GKE with cluster version 1.9.3-gke.0.

EDIT:

I also tried using simmemleak:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: simmemleak
  namespace: developer
spec:
  selector:
    matchLabels:
      app: simmemleak
  template:
    metadata:
      labels:
        app: simmemleak
    spec:
      containers:
      - name: simmemleak
        image: saadali/simmemleak
        resources:
          requests:
            memory: 1Gi
            cpu: 1m
          limits:
            cpu: 1m

But this workload keeps dying before any evictions. I think the issue is that it is being killed by the kernel before the kubelet has time to react.

-- dippynark
google-cloud-platform
google-kubernetes-engine
kubernetes

2 Answers

2/25/2019

To avoid system OOM taking effect prior to the kubelet eviction, you could configure kubepods memory limits --system-reserved and --enforce-node-allocatable Read more.

For example, Node has 32Gi of memory, configure to limit kubepods memory up to 20Gi

--eviction-hard=memory.available<500Mi
-- Bo Wang
Source: StackOverflow

3/12/2018

I found this on the Kubernetes docs, I hope it helps:

kubelet may not observe memory pressure right away The kubelet currently polls cAdvisor to collect memory usage stats at a regular interval.
If memory usage increases within that window rapidly, the kubelet may not observe MemoryPressure fast enough, and the OOMKiller will still be invoked.
We intend to integrate with the memcg notification API in a future release to reduce this latency, and instead have the kernel tell us when a threshold has been crossed immediately.

If you are not trying to achieve extreme utilization, but a sensible measure of overcommit, a viable workaround for this issue is to set eviction thresholds at approximately 75% capacity.
This increases the ability of this feature to prevent system OOMs, and promote eviction of workloads so cluster state can rebalance.

\==EDIT== : As there seems to be a race between OOM and kubelet, and the memory allocated by your script grows quicker than the time Kubelet takes to realise that the pods need to be evicted, it might be wise to try to allocate memory more slowly inside your script.

-- Django
Source: StackOverflow