I'm running a container in Google Kubernetes Cluster and the container is running on a node with 64 vCPUs and 57 GB memory. I've allocated the container 16 vCPUs and 24 GB memory. When I run a Python function in the container that uses joblib Parallel processing with n_jobs=12, the CPU usage never exceeds 1 core. I've tried running a simple parallel processing script within the container and the CPU usage stays at 1. I don't know whats going on. Any help would be appreciated!
Here is the YAML of the pod:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "9"
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"extensions/v1beta1","kind":"Deployment","metadata":{"annotations":{},"name":"inference","namespace":"default"},"spec":{"replicas":1,"selector":{"matchLabels":{"app":"inference"}},"template":{"metadata":{"labels":{"app":"inference"}},"spec":{"containers":[{"args":["workers/infer.py"],"command":["python"],"image":"gcr.io/staging-239917/autoqa:v3","name":"inference","resources":{"limits":{"cpu":"16000m","memory":"16000Mi"},"requests":{"cpu":"16000m","memory":"8000Mi"}}}]}}}}
creationTimestamp: "2020-03-28T16:49:50Z"
generation: 9
labels:
app: inference
name: inference
namespace: default
resourceVersion: "4878070"
selfLink: /apis/apps/v1/namespaces/default/deployments/inference
uid: 23eb391e-7114-11ea-a540-42010aa20052
spec:
progressDeadlineSeconds: 2147483647
replicas: 1
revisionHistoryLimit: 2147483647
selector:
matchLabels:
app: inference
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: inference
spec:
containers:
- args:
- workers/infer.py
command:
- python
image: gcr.io/staging-239917/autoqa:1.0.9026a5a8-55ba-44b5-8f86-269cea2e201c
imagePullPolicy: IfNotPresent
name: inference
resources:
limits:
cpu: 16100m
memory: 16000Mi
requests:
cpu: "16"
memory: 16000Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
status:
availableReplicas: 1
conditions:
- lastTransitionTime: "2020-03-28T16:49:50Z"
lastUpdateTime: "2020-03-28T16:49:50Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
observedGeneration: 9
readyReplicas: 1
replicas: 1
updatedReplicas: 1
The most important thing in parallel computing is the implementation.
Without the algorithm it could be hard to pinpoint the bottleneck affecting the performance.
I think the issue lies within the implementation and not the GKE
itself.
I tried to recreate parallel computing scenario with newly created GKE cluster and here is what I found out:
Here is the YAML definition of a pod that was used to testing:
apiVersion: v1
kind: Pod
metadata:
name: ubuntu
namespace: default
spec:
containers:
- image: ubuntu
resources:
limits:
cpu: "16"
memory: "32Gi"
requests:
cpu: "16"
memory: "32Gi"
command:
- sleep
- "infinity"
imagePullPolicy: IfNotPresent
name: ubuntu
restartPolicy: Always
Please take specific look on part requests
and limits
resources:
limits:
cpu: "16"
memory: "32Gi"
requests:
cpu: "16"
memory: "32Gi"
It will request 16 CPU which will be important later.
I tried to simulate workload in pod with:
apt install stress
)joblib
and Parallel
:I ran this application inside a pod 3 times with different cpu amount used (4,16,32):
$ stress -c 4
$ stress -c 16
$ stress -c 32
As you can see above output, 4 cores are fully taken by this application.
Output of the command: $ kubectl top node
shows that the CPU usage is around 18%:
❯ kubectl top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
gke-gke-high-default-pool-f7a17f78-0696 4090m 18% 948Mi 1%
1 core from 22 should be around 4.5% and after multiplying it by 4 it will be around 18% which is the same as above output.
Looking into GCP Monitoring
you can see utilization at about 0.25 (which is 4 cores from 16 allocated):
As you can see above output, 16 cores are fully taken by this application.
Output of the command: $ kubectl top node
shows that the CPU usage is around 73%:
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
gke-gke-high-default-pool-f7a17f78-0696 16168m 73% 945Mi 1%
1 core from 22 should be around 4.5% and after multiplying it by 16 it will be around 72% which is the near the same as above output.
Looking into GCP Monitoring
you can see utilization at about 1.0 (which is 16 cores from 16 allocated):
There is no option to run 32 core workload on a 16 core machine without dividing the time for each process. Take a look how Kubernetes manages it:
As you can see all cores are used but only in about 73% each. This would add to about 16 cores at 100% of usage.
The $ kubectl top node
and GCP Monitoring
will have the same output as 16 core case.
I tried to use basic Python program with joblib
and Parallel
like below:
from joblib import Parallel, delayed
import sys
my_list = range(20000)
squares = []
# Function to parallelize
def find_square(i):
return i ** 131072
# With parallel processing
squares = Parallel(n_jobs=int(sys.argv[1]), verbose=1)(delayed(
find_square)(i)for i in my_list)
I ran above program with number of cores from 4 to 16. The results are below:
[Parallel(n_jobs=4)]: Done 20000 out of 20000 | elapsed: 7.0min finished
[Parallel(n_jobs=5)]: Done 20000 out of 20000 | elapsed: 5.6min finished
[Parallel(n_jobs=6)]: Done 20000 out of 20000 | elapsed: 4.7min finished
[Parallel(n_jobs=7)]: Done 20000 out of 20000 | elapsed: 4.0min finished
[Parallel(n_jobs=8)]: Done 20000 out of 20000 | elapsed: 3.5min finished
[Parallel(n_jobs=9)]: Done 20000 out of 20000 | elapsed: 3.1min finished
[Parallel(n_jobs=10)]: Done 20000 out of 20000 | elapsed: 2.8min finished
[Parallel(n_jobs=11)]: Done 20000 out of 20000 | elapsed: 2.6min finished
[Parallel(n_jobs=12)]: Done 20000 out of 20000 | elapsed: 2.6min finished
[Parallel(n_jobs=13)]: Done 20000 out of 20000 | elapsed: 2.6min finished
[Parallel(n_jobs=14)]: Done 20000 out of 20000 | elapsed: 2.5min finished
[Parallel(n_jobs=15)]: Done 20000 out of 20000 | elapsed: 2.5min finished
[Parallel(n_jobs=16)]: Done 20000 out of 20000 | elapsed: 2.5min finished
As you can see there is a difference in time needed for computation between 4 cores and 16.
Below graph shows CPU usage for pod that was running this tests. As you can see there is gradual increase in usage. Please take in mind that algorithm used will have a huge impact on CPU usage.
If program had:
n_jobs=4
it would run on 4 cores at 100% which would entail about 25% of CPU utilization for pod.n_jobs=16
it would run at all cores on about 100% usage which would entail about 100% of CPU utilization for pod.Please let me know if you have any questions to that.