Make a long computation kubernetes job recover from node failures

4/11/2018

I am setting up a kubernetes cluster with a large number of long computation jobs, all of which are single replica. Often the processes crash because 1) the container has crashed, or 2) the node fails due to some hardware failure. I want to be able to recover from these crashes, since they often take weeks to finish.

I can easily recover from failures of type 1 by using emptyDir and write intermediate checkpoints to /emptydir/checkpoint.txt, that is local to each Pod. However, it's not clear to me how can I recover from node failures.

I have a centralized NFS accessible by all nodes; however, it's a lot of pain to provide a unique NFS path to each job (I have lots of them). I was thinking that maybe each Pod should write checkpoint to some random path on the NFS, and somehow communicate this random path to the next Pod at the time of Pod failure. Is there any way that a Pod communicate anything to its succeeding at failure time? Is that the way to go?

Please keep it simple, I'm very new to kubernetes.

Thanks!

-- Pro.Hessam
cluster-computing
disaster-recovery
kubernetes

1 Answer

4/11/2018

Unfortunately, Kubernetes does not provide any features to somehow communicate between the current and the next version of a Pod.

I see 2 ways how you can implement the path saving procedure:

  1. Use the third party consistent storage like Consul or Etcd to store information about a randomly generated path.

  2. Generate ConfigMap with the path in NFS before you are starting a job. ConfigMap will contain a static path in NFS, and it will be the same for the first container and for the recovered one.

-- Anton Kostenko
Source: StackOverflow