Kubernetes OOMKilled containers for Tensorflow

11/4/2019

I have a Keras model (tensorflow backend) which runs fine on my laptop (16GB RAM).

However, I'm use Kubeflow to deploy it to GCP and each time the pod is terminated (OOMKilled). There are both requests and limits specified for CPU and memory.

The dockerfile that kubeflow produces for me:

FROM gcr.io/deeplearning-platform-release/tf-cpu.1-14
WORKDIR /python_env
COPY requirements.txt .
RUN python3 -m pip install -r requirements.txt
COPY . .

There's some log output from what looks like Tensorflow:

First RAM-looking message:
time="2019-11-03T22:17:14Z" level=info msg="Alloc=3248 TotalAlloc=11862 Sys=70846 NumGC=12 Goroutines=11

Final RAM-looking message:
time="2019-11-03T22:52:14Z" level=info msg="Alloc=3254 TotalAlloc=11952 Sys=70846 NumGC=29 Goroutines=11

But ultimately, the RAM grows with a linear curve until it's termiated after ~50 minutes.

The model is simple, and although the data is a ~1GB CSV file, that's loaded immediately and the crash happens around the 3rd epoch.

My suspicion is that Tensorflow is not respecting maximum memory limits.

I've tried different requests/limits, and indeed as I say the model has trained fine on my laptop previously.

What can I try? Where does the fault lie?

Relevant part of the container spec:

      resources:
        limits:
          cpu: '7'
          memory: 26Gi
        requests:
          cpu: '7'
          memory: 26Gi

The node was provisioned automatically using GKE's auto provisioning - it created an n1-standard-8, i.e. 8 VCPU, 30 GB RAM.

-- Kieren Johnstone
gcp-ai-platform-training
keras
kubernetes
python-3.x
tensorflow

0 Answers