Consider a linux cluster of N nodes. It needs to run M tasks. Each task can run on any node. Assume the cluster is up and working normally.
Question: what's the simplest way to monitor the M tasks are running, and if a task exits abnormally (exit code != 0), start a new task on any of the up machines. Ignore network partitions.
Two of the M tasks have a dependency so that if task 'm' does down, task 'm1' should be stopped. Then 'm' is started and when up, 'm1' can be restarted. 'm1' depends on 'm'. I can provide an orchestration script for this.
I eventually want to work up to Kubernetes which does self-healing but I'm not there yet.
The right (tm) way to do is to setup a retry, potentially with some back-off strategy. There were many similar questions here on StackOverflow how to do this - this is one of them.
If you still want to do the monitoring, and explicit task restart, then you can implement a service based on the task events that will do it for you. It is extremely simple, and a proof how brilliant Celery is. The service should handle the task-failed event. An example how to do it is on the same page.
If you just need an initialization task to run for each computation task, you can use the Job
concept along with an init container. Jobs are tasks that run just once until completion, Kubernetes will restart it, if it crashes. Init containers run before the actual pod containers are started and are used for initialization tasks: https://kubernetes.io/docs/concepts/workloads/pods/init-containers/