I am running a job as defined here i.e. "parallel processing using work queue" on GKE
Each pod runs a single container and That container has an R script, takes around 5 mins to finish and then Pod completes successfully.
When I run the job for small numbers of completions like
completions: 606
parallelism: 450
backoffLimit: 1
Everything runs fine, Cluster scales up and down properly and Job gets finished.
But when I run the job with specifications like
completions: 37572
parallelism: 1610
backoffLimit: 1
Pod succeeded count gets increased for a while but after that, it stays in between 1000 to 1500 and never reach to completions
Though pods are completing successfully, I can see them on google cloud kubernetes dashboard and output files are also being generated successfully. Queue is also showing the progress very accurately
And It has happened to me every time when I have run the job with high number of parallelism. I have tried different autoscaling node pool setup in my cluster for machine type with 64-CPUs, 32-CPUs, 16-CPUs.
Currently, the way I am handling is
\=> When queue has number of consumers == parallelism OR files_in_output == completions
\=> I delete the job and delete the autoscaling node pool.
Please find the cluster details
Cluster status is always green during the run.
QUESTION
Update:
GKE ISSUE REPORTED => https://issuetracker.google.com/issues/114650730
job.yml
apiVersion: batch/v1
kind: Job
metadata:
# Unique key of the Job instance
name: my-job
spec:
completions: 37572
parallelism: 1610
backoffLimit: 1
template:
metadata:
name: my-job
labels:
jobgroup: my-jobs
spec:
volumes:
- name: jobs-pv-storage
persistentVolumeClaim:
claimName: fileserver-claim
readOnly: false
containers:
- name: rscript
image: gcr.io/project/image:v1
resources:
limits:
cpu: "1"
memory: 1200Mi
requests:
cpu: "1"
memory: 1000Mi
env:
- name: BROKER_URL
value: amqp://user:pwd@rabbitmq-service:5672
- name: QUEUE
value: job-queue
volumeMounts:
- mountPath: /opt/work/output
name: jobs-pv-storage
# Do not restart containers after they exit
restartPolicy: Never