Batch computations, Monte Carlo, using Docker image, multiple jobs running on Google cloud and managed by Kubernetes. But it (replication controller, I guess?) managed to restart same computation again and again due to default restart policy.
Is there a way now to let pods die? Or maybe other workarounds to do pods garbage collection?
Now that v1.0 is out, better native support for getting the batch computations is one of the team's top priorities, but it is already quite possible to run them.
If you run something as a pod rather than as a replication controller, you can set the restartPolicy
field on it. The OnFailure
policy is probably what you'd want, meaning that kubernetes will restart a pod that exited with a non-zero exit code, but won't restart a pod that exited zero.
If you're using kubectl run
to start your pods, though, I'm unfortunately not aware of a way to have it create just a pod rather than a replication controller. If you'd like something like that, it'd be great if you opened an issue requesting it as an option.
As of November 2015, kubernetes v1.1.1 now provides a jobs api https://github.com/kubernetes/kubernetes/blob/master/docs/user-guide/jobs.md
The following is a simple job that executes the date command once per second for 60secs:
$ cat job.yaml
apiVersion: extensions/v1beta1
kind: Job
metadata:
name: example
spec:
selector:
matchLabels:
app: example
template:
metadata:
name: example
labels:
app: example
spec:
containers:
- name: example
image: debian
command: ["timeout", "60", "bash", "-c", "while sleep 1; do date;done"]
restartPolicy: Never
Run the job on your kubernetes cluster:
$ cluster/kubectl.sh create -f job.yaml
job "example" created
Retrieve the pod id:
$ cluster/kubectl.sh get pods
NAME READY STATUS RESTARTS AGE
example-3nxin 1/1 Running 0 15s
Now check the logs for the pod:
$ cluster/kubectl.sh logs example-3nxin
Sat Dec 5 04:47:12 UTC 2015
Sat Dec 5 04:47:13 UTC 2015
Sat Dec 5 04:47:14 UTC 2015
Sat Dec 5 04:47:15 UTC 2015
Sat Dec 5 04:47:16 UTC 2015
Sat Dec 5 04:47:17 UTC 2015
Sat Dec 5 04:47:18 UTC 2015
Sat Dec 5 04:47:19 UTC 2015
Sat Dec 5 04:47:20 UTC 2015
Sat Dec 5 04:47:21 UTC 2015
Sat Dec 5 04:47:22 UTC 2015
Sat Dec 5 04:47:23 UTC 2015
Sat Dec 5 04:47:24 UTC 2015
Sat Dec 5 04:47:25 UTC 2015
Sat Dec 5 04:47:26 UTC 2015
Sat Dec 5 04:47:27 UTC 2015
Sat Dec 5 04:47:28 UTC 2015
Sat Dec 5 04:47:29 UTC 2015
Optionally you can set the restartPolicy to OnFailure
, so that if the job exits with a non-zero exit status, it is restarted.