I have an application which will de deployed in an OpenShift/Kubernetes cluster as a pod. I know this is against the principles of Kubernetes, but this pod should only be run once (so ther shall not be parallel processing). There may be a second pod running for the case the first one crashes to immediately take over. My question now is: How would you implement a "lock" and ensure that this lock is released when the container crashes.
My first idea is to write a "locked" attribute to the used database. As long as the attribute is set, the second pod won´t do anything. Once the processing pod crashes, it shouild realease this lock. But how to release it when the application is already crashed?
Thanks for your ideas in advance!
This has to be done by the application, because Kubernetes deliberately does not provide these application-tier primitives out of the box. There are ways to orchestrate things such that Kubernetes will generally keep only one pod running, but the guarantees offered by that orchestration are limited.
Using a durable datastore to coordinate ownership of a responsibility at the application tier is a good idea, and using a "locked" attribute or similar is also fine. The typical way the problem of ensuring "liveness" is handled is by having the responsible application periodically update a last-update timestamp in that datastore. Then the secondary application can keep tabs on the interval since the last update.
In this kind of solution there also has to be a way of taking ownership of the responsibility, which can be done by having the "locked" attribute be an application instance ID.
So, at periodic intervals, the responsible instance of the coordinated application does a SELECT FOR UPDATE to atomically update the timestamp for their application instance ID. The update only succeeds if the application still owns the responsibility.
The backup instance periodically checks the last-updated timestamp. If the interval since the last update exceeds the timeout, then the backup instance attempts to do an atomic update to change the locked attribute to their application instance ID- again, only if the last-update timestamp is too old.
One has to be a little careful about race conditions and use transactions and the datastore's atomicity appropriately. Also, when work fails or gets interrupted, there has to be a way to appropriately retry or rollback.
But for many cases this kind of simple solution is fine. Hope that helps.