I have a HA kubernetes setup with 3 replicated master nodes and a few worker nodes, split in 3 zones (these are AWS availability zones, but it could also be 3 virtualized hardware machines or similar). One of the services (or rather, the service's pods) is forming a cluster across zones so as to be still available if one zone goes down. The pods are distributed using anti-affinity rules. I'll refer to a single application running inside of the service's pods as "application node" (as opposed to "node" which simply is a kubernetes node).
The clustered application is capable of detecting a network partition and avoids split-brain scenarios by shutting down the applications in the pods that are in the minority region. Let's consider the following layout:
In case of a network partition between (A, B) and (C), the application running in Zone C would shut itself down.
Now the trouble is that the master in region C is going to re-create the pods for that service, leading to the formation of an entirely new application cluster, which is exactly what we want to avoid in this case.
I'd like to tell kubernetes not to recreate pods for this service in zone C until the network partition is resolved. As far as I can see, this would involve:
1) telling kubernetes not to recreate the pod in zone C 2) telling kubernetes to allow creating pods in zone C one the network partition is over
I think this could be achieved via node taints that would be created & removed accordingly.
For 1) ideally I'd like to be able to signal this via an exit code, although I don't think this is available. I can setup a node taint programatically by calling the kubernetes API from the application node in zone C before it shuts itself down, though it probably would be nice if this behaviour could be declared in the deployment.
For 2) I am not quite sure how to proceed. Kubernetes probably sees the master and worker nodes in zone C as unhealthy from zones A and B but I don't know if there's any specific event that would signal that they are healthy again and that could be leveraged to un-taint the nodes in zone C in this case. I don't think there's a way to do this in kubernetes so I think I'd have to setup this logic on the application layer and listen (?) to events related to node health, and then call the kubernetes API to un-taint the nodes.
My questions would hence be:
1) is there an API endpoint I could query to get events related to node health, and which type of events would these be?
2) more generally, are there any design considerations / feature plans for the kubernetes scheduler to address the topic of network partitions / failures? I did not find much information about this on the documentation or design document for HA masters. As I see it, there's a need for coordination between cluster-aware applications deployed on kubernetes and kubernetes itself.
assuming the app we're talking about is some sort of externally exposed service, I would suggest that instead of exiting the app when split is detected you should start returning an error code for your readiness probe. This way you do not shut down pods - so no recreation, but mark them as not ready to serve production traffic while split is happening.