We are trying to make some analysis code spin on the kubernetes cluster. We want to run 10 pods of 1 job with the following yaml file:
apiVersion: batch/v1
kind: Job
metadata:
name: job-platooning-s1b
spec:
parallelism: 10
template:
metadata:
name: job-platooning-s1b
spec:
containers:
- name: platooning-dp-1b
image: registry.gitlab.com/company_name/repo_name/platooning:latest
command: ["python3" , "/app/scenario_1b_cluster.py"]
restartPolicy: 'OnFailure'
imagePullSecrets:
- name: regcred-nextsys
Our 10 pods can survive for a few minutes before getting killed. The error that I get is: MountVolume.SetUp failed for volume "default-token-7td4s" : couldn't propagate object cache: timed out waiting for the condition
.
My thinking is that the pods consume too much memory. We tried to specify memory usage by adding the following parameters under containers
in the yaml file:
resources:
limits:
memory: "15Gi"
requests:
memory: "500Mi"
But it does not help, as pods are still terminated. Running the job with 1 pod is fine, as it does not get killed. In the end, we want to have a scalable solution where multiple scenarios with multiple pods could be run overnight.
Do you have any idea about why the pods are getting killed in this scenario?
The pods are running properly when they run one by one (without parallelism). When we try to run a lot of them together, they run for a bit then get killed (evicted?) sometimes producing this error :
MountVolume.SetUp failed for volume "default-token-7td4s" : couldn't propagate object cache: timed out waiting for the condition
The strange thing is that we the secret this job uses is not default-token-7td4s but regcred-nextsys, as seen in the job YAML file. Is that expected behaviour? And if so, why does it actually fail? I'm suspecting a race condition, or just different pods trying to mount the same resource, but I'm not sure that it makes sense. The other reason I suspect is a memory issue.
We are running kubernetes as a managed service from DigitalOcean.
Sorry, I didn't notice that you don't use minikube. I have corrected my answer.
Check which Kubernetes version you use. Your logs indicate you are running 1.12.3.
This was resolved in 1.12.7 in #74755.
Here you can find mor einformation: cache secret/configmap behavior.
Hope it helps.
You are using a kubernetes resource of type JOB not a POD. Those are very different things. And the job you are running has not even started as it cannot mount the default token which is another kubernetes resource that you should have listed as a Secret.
Most likely, when you create the job, it stays in a ContainerCreating state forever. Run a kubectl get pods
to see this. And run kubectl get secrets
to find the default-token.
I'm working together with Pavlo and encountered this issue. What I understand from now is that the pods are running properly when they run one by one (without parallelism). When we try to run a lot of them together, they run for a bit then get killed (evicted?) sometimes producing this error :
MountVolume.SetUp failed for volume "default-token-7td4s" : couldn't propagate object cache: timed out waiting for the condition
The strange thing is that we the secret this job uses is not default-token-7td4s but regcred-nextsys, as seen in the job YAML file. Is that expected behaviour? And if so, why does it actually fail? I'm suspecting a race condition, or just different pods trying to mount the same resource, but I'm not sure that it makes sense. The other reason I suspect is a memory issue.
We are running kubernetes as a managed service from DigitalOcean.