Google Kubernetes Container CPU usage doesn't increase when using joblib Parallel in Python

3/30/2020

I'm running a container in Google Kubernetes Cluster and the container is running on a node with 64 vCPUs and 57 GB memory. I've allocated the container 16 vCPUs and 24 GB memory. When I run a Python function in the container that uses joblib Parallel processing with n_jobs=12, the CPU usage never exceeds 1 core. I've tried running a simple parallel processing script within the container and the CPU usage stays at 1. I don't know whats going on. Any help would be appreciated!

Here is the YAML of the pod:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "9"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"extensions/v1beta1","kind":"Deployment","metadata":{"annotations":{},"name":"inference","namespace":"default"},"spec":{"replicas":1,"selector":{"matchLabels":{"app":"inference"}},"template":{"metadata":{"labels":{"app":"inference"}},"spec":{"containers":[{"args":["workers/infer.py"],"command":["python"],"image":"gcr.io/staging-239917/autoqa:v3","name":"inference","resources":{"limits":{"cpu":"16000m","memory":"16000Mi"},"requests":{"cpu":"16000m","memory":"8000Mi"}}}]}}}}
  creationTimestamp: "2020-03-28T16:49:50Z"
  generation: 9
  labels:
    app: inference
  name: inference
  namespace: default
  resourceVersion: "4878070"
  selfLink: /apis/apps/v1/namespaces/default/deployments/inference
  uid: 23eb391e-7114-11ea-a540-42010aa20052
spec:
  progressDeadlineSeconds: 2147483647
  replicas: 1
  revisionHistoryLimit: 2147483647
  selector:
    matchLabels:
      app: inference
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: inference
    spec:
      containers:
      - args:
        - workers/infer.py
        command:
        - python
        image: gcr.io/staging-239917/autoqa:1.0.9026a5a8-55ba-44b5-8f86-269cea2e201c
        imagePullPolicy: IfNotPresent
        name: inference
        resources:
          limits:
            cpu: 16100m
            memory: 16000Mi
          requests:
            cpu: "16"
            memory: 16000Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2020-03-28T16:49:50Z"
    lastUpdateTime: "2020-03-28T16:49:50Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 9
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1
-- Angad Kalra
google-kubernetes-engine
python-multiprocessing

1 Answer

5/6/2020

The most important thing in parallel computing is the implementation.

Without the algorithm it could be hard to pinpoint the bottleneck affecting the performance.

I think the issue lies within the implementation and not the GKE itself.


Prerequisites

I tried to recreate parallel computing scenario with newly created GKE cluster and here is what I found out:

  • GKE version: 1.15.9
  • Node count: 1
  • Node specification:
    • 22 cores
    • 64gb ram.

Here is the YAML definition of a pod that was used to testing:

apiVersion: v1
kind: Pod
metadata:
  name: ubuntu
  namespace: default
spec:
  containers:
  - image: ubuntu
    resources:
      limits:
        cpu: "16"
        memory: "32Gi"
      requests:
        cpu: "16"
        memory: "32Gi" 
    command:
      - sleep
      - "infinity"
    imagePullPolicy: IfNotPresent
    name: ubuntu
  restartPolicy: Always

Please take specific look on part requests and limits

    resources:
      limits:
        cpu: "16"
        memory: "32Gi"
      requests:
        cpu: "16"
        memory: "32Gi" 

It will request 16 CPU which will be important later.

Testing

I tried to simulate workload in pod with:

  • application called stress (apt install stress)
  • python program with joblib and Parallel:

Application stress

I ran this application inside a pod 3 times with different cpu amount used (4,16,32):

  • $ stress -c 4
  • $ stress -c 16
  • $ stress -c 32

4 cores

HTOP1

As you can see above output, 4 cores are fully taken by this application.

Output of the command: $ kubectl top node shows that the CPU usage is around 18%:

kubectl top node
NAME                                      CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
gke-gke-high-default-pool-f7a17f78-0696   4090m        18%    948Mi           1%  

1 core from 22 should be around 4.5% and after multiplying it by 4 it will be around 18% which is the same as above output.

Looking into GCP Monitoring you can see utilization at about 0.25 (which is 4 cores from 16 allocated):

GCP1

16 cores

HTOP2

As you can see above output, 16 cores are fully taken by this application.

Output of the command: $ kubectl top node shows that the CPU usage is around 73%:

NAME                                      CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
gke-gke-high-default-pool-f7a17f78-0696   16168m       73%    945Mi           1%  

1 core from 22 should be around 4.5% and after multiplying it by 16 it will be around 72% which is the near the same as above output.

Looking into GCP Monitoring you can see utilization at about 1.0 (which is 16 cores from 16 allocated):

GCP2

32 cores

There is no option to run 32 core workload on a 16 core machine without dividing the time for each process. Take a look how Kubernetes manages it:

HTOP3

As you can see all cores are used but only in about 73% each. This would add to about 16 cores at 100% of usage.

The $ kubectl top node and GCP Monitoring will have the same output as 16 core case.

Python program

I tried to use basic Python program with joblib and Parallel like below:

from joblib import Parallel, delayed
import sys 

my_list = range(20000)
squares = []

# Function to parallelize
def find_square(i):
    return i ** 131072

# With parallel processing
squares = Parallel(n_jobs=int(sys.argv[1]), verbose=1)(delayed(
    find_square)(i)for i in my_list)

I ran above program with number of cores from 4 to 16. The results are below:

[Parallel(n_jobs=4)]: Done 20000 out of 20000 | elapsed:  7.0min finished
[Parallel(n_jobs=5)]: Done 20000 out of 20000 | elapsed:  5.6min finished
[Parallel(n_jobs=6)]: Done 20000 out of 20000 | elapsed:  4.7min finished
[Parallel(n_jobs=7)]: Done 20000 out of 20000 | elapsed:  4.0min finished
[Parallel(n_jobs=8)]: Done 20000 out of 20000 | elapsed:  3.5min finished
[Parallel(n_jobs=9)]: Done 20000 out of 20000 | elapsed:  3.1min finished
[Parallel(n_jobs=10)]: Done 20000 out of 20000 | elapsed:  2.8min finished
[Parallel(n_jobs=11)]: Done 20000 out of 20000 | elapsed:  2.6min finished
[Parallel(n_jobs=12)]: Done 20000 out of 20000 | elapsed:  2.6min finished
[Parallel(n_jobs=13)]: Done 20000 out of 20000 | elapsed:  2.6min finished
[Parallel(n_jobs=14)]: Done 20000 out of 20000 | elapsed:  2.5min finished
[Parallel(n_jobs=15)]: Done 20000 out of 20000 | elapsed:  2.5min finished
[Parallel(n_jobs=16)]: Done 20000 out of 20000 | elapsed:  2.5min finished

As you can see there is a difference in time needed for computation between 4 cores and 16.

Below graph shows CPU usage for pod that was running this tests. As you can see there is gradual increase in usage. Please take in mind that algorithm used will have a huge impact on CPU usage.

GCP Monitoring

If program had:

  • n_jobs=4 it would run on 4 cores at 100% which would entail about 25% of CPU utilization for pod.
  • n_jobs=16 it would run at all cores on about 100% usage which would entail about 100% of CPU utilization for pod.

Please let me know if you have any questions to that.

-- Dawid Kruk
Source: StackOverflow