Pods of job get killed because of propagation of object cache

10/11/2019

We are trying to make some analysis code spin on the kubernetes cluster. We want to run 10 pods of 1 job with the following yaml file:

apiVersion: batch/v1
kind: Job
metadata:
 name: job-platooning-s1b
spec:
 parallelism: 10
 template:
   metadata:
     name: job-platooning-s1b
   spec:
     containers:
     - name: platooning-dp-1b
       image: registry.gitlab.com/company_name/repo_name/platooning:latest
       command: ["python3" , "/app/scenario_1b_cluster.py"]
     restartPolicy: 'OnFailure'
     imagePullSecrets:
     - name: regcred-nextsys

Our 10 pods can survive for a few minutes before getting killed. The error that I get is: MountVolume.SetUp failed for volume "default-token-7td4s" : couldn't propagate object cache: timed out waiting for the condition.

My thinking is that the pods consume too much memory. We tried to specify memory usage by adding the following parameters under containers in the yaml file:

resources:
    limits:
        memory: "15Gi"
    requests:
        memory: "500Mi"

But it does not help, as pods are still terminated. Running the job with 1 pod is fine, as it does not get killed. In the end, we want to have a scalable solution where multiple scenarios with multiple pods could be run overnight.

Do you have any idea about why the pods are getting killed in this scenario?

The pods are running properly when they run one by one (without parallelism). When we try to run a lot of them together, they run for a bit then get killed (evicted?) sometimes producing this error :

MountVolume.SetUp failed for volume "default-token-7td4s" : couldn't propagate object cache: timed out waiting for the condition

The strange thing is that we the secret this job uses is not default-token-7td4s but regcred-nextsys, as seen in the job YAML file. Is that expected behaviour? And if so, why does it actually fail? I'm suspecting a race condition, or just different pods trying to mount the same resource, but I'm not sure that it makes sense. The other reason I suspect is a memory issue.

We are running kubernetes as a managed service from DigitalOcean.

-- Pavlo Bazilnskyy
cluster-computing
kubernetes
python

3 Answers

10/14/2019

Sorry, I didn't notice that you don't use minikube. I have corrected my answer.

Check which Kubernetes version you use. Your logs indicate you are running 1.12.3.

This was resolved in 1.12.7 in #74755.

Here you can find mor einformation: cache secret/configmap behavior.

Hope it helps.

-- MaggieO
Source: StackOverflow

10/11/2019

You are using a kubernetes resource of type JOB not a POD. Those are very different things. And the job you are running has not even started as it cannot mount the default token which is another kubernetes resource that you should have listed as a Secret.

Most likely, when you create the job, it stays in a ContainerCreating state forever. Run a kubectl get pods to see this. And run kubectl get secrets to find the default-token.

-- Rodrigo Loza
Source: StackOverflow

10/14/2019

I'm working together with Pavlo and encountered this issue. What I understand from now is that the pods are running properly when they run one by one (without parallelism). When we try to run a lot of them together, they run for a bit then get killed (evicted?) sometimes producing this error :

MountVolume.SetUp failed for volume "default-token-7td4s" : couldn't propagate object cache: timed out waiting for the condition

The strange thing is that we the secret this job uses is not default-token-7td4s but regcred-nextsys, as seen in the job YAML file. Is that expected behaviour? And if so, why does it actually fail? I'm suspecting a race condition, or just different pods trying to mount the same resource, but I'm not sure that it makes sense. The other reason I suspect is a memory issue.

We are running kubernetes as a managed service from DigitalOcean.

-- Fotis Paraschiakos
Source: StackOverflow