kubernetes does not track pods success count for job completion

8/31/2018

I am running a job as defined here i.e. "parallel processing using work queue" on GKE

Each pod runs a single container and That container has an R script, takes around 5 mins to finish and then Pod completes successfully.

When I run the job for small numbers of completions like

completions: 606
parallelism: 450
backoffLimit: 1

Everything runs fine, Cluster scales up and down properly and Job gets finished.

But when I run the job with specifications like

completions: 37572
parallelism: 1610
backoffLimit: 1

Pod succeeded count gets increased for a while but after that, it stays in between 1000 to 1500 and never reach to completions

enter image description here

Though pods are completing successfully, I can see them on google cloud kubernetes dashboard and output files are also being generated successfully. Queue is also showing the progress very accurately enter image description here

And It has happened to me every time when I have run the job with high number of parallelism. I have tried different autoscaling node pool setup in my cluster for machine type with 64-CPUs, 32-CPUs, 16-CPUs.

Currently, the way I am handling is
\=> When queue has number of consumers == parallelism OR files_in_output == completions
\=> I delete the job and delete the autoscaling node pool.

Please find the cluster details enter image description here

Cluster status is always green during the run.

QUESTION

  • Why job completion count never increase after a certain point (i.e in my case below my parallelism count)? Even though pods are finishing successfully.
  • And the worst thing is the job completion count decreases too? That I can't even understand. What will be the reason that kubernetes is behaving like so?
  • Do I need to add some additional fields to my spec template so that it tracks job completions properly?

Update:

  • I have enough CPU Quota
  • Each container(Pod) is limited to use 1 CPU and 1GB RAM.
  • I have also upgraded cluster and node-pools to 1.10.6-gke.2 version. No Luck.

GKE ISSUE REPORTED => https://issuetracker.google.com/issues/114650730

job.yml

apiVersion: batch/v1
kind: Job
metadata:
  # Unique key of the Job instance
  name: my-job
spec:
  completions: 37572
  parallelism: 1610
  backoffLimit: 1
  template:
    metadata:
      name: my-job
      labels:
        jobgroup: my-jobs
    spec:
      volumes:
      - name: jobs-pv-storage
        persistentVolumeClaim:
          claimName: fileserver-claim
          readOnly: false
      containers:
      - name: rscript
        image: gcr.io/project/image:v1
        resources:
          limits:
            cpu: "1"
            memory: 1200Mi
          requests:
            cpu: "1"
            memory: 1000Mi
        env:
        - name: BROKER_URL
          value: amqp://user:pwd@rabbitmq-service:5672
        - name: QUEUE
          value: job-queue
        volumeMounts:
        - mountPath: /opt/work/output
          name: jobs-pv-storage
      # Do not restart containers after they exit
      restartPolicy: Never
-- Rahul Gautam
google-cloud-platform
google-kubernetes-engine
kubernetes
parallel-processing

0 Answers