We have several pods (as service/deployments) in our k8s workflow that are dependent on each other, such that if one goes into a CrashLoopBackOff
state, then all these services need to be redeployed.
Instead of having to manually do this, is there a programatic way of handling this?
Of course we are trying to figure out why the pod in question is crashing.
First thing to do is making sure that pods are started in correct sequence. This can be done using initContainers like that:
spec:
initContainers:
- name: waitfor
image: jwilder/dockerize
args:
- -wait
- "http://config-srv/actuator/health"
- -wait
- "http://registry-srv/actuator/health"
- -wait
- "http://rabbitmq:15672"
- -timeout
- 600s
Here your pod will not start until all the services in a list are responding to HTTP probes.
Next thing you may want to define liveness probe that periodically executes curl to the same services
spec:
livenessProbe:
exec:
command:
- /bin/sh
- -c
- curl http://config-srv/actuator/health &&
curl http://registry-srv/actuator/health &&
curl http://rabbitmq:15672
Now if any of those services fail - you pod will fail liveness probe, be restarted and wait for services to become back online.
That's just an example how it can be done. In your case checks can be different of course.
If these are so tightly dependant on each other, I would consider these options a) Rearchitect your system to be more resilient towards failure and tolerate, if a pod is temporary unavailable b) Put all parts into one pod as separate containers, making the atomic design more explicit
If these don't fit your needs, you can use the Kubernetes API to create a program that automates the task of restarting all dependent parts. There are client libraries for multiple languages and integration is quite easy. The next step would be a custom resource definition (CRD) so you can manage your own system using an extension to the Kubernetes API.