We have a statefulset for a service (Druid historicals) that caches a lot of data on local SSDs. (We run one pod per node in SSD using taints and affinity.) When we need to replace the underlying machines, this means pods start up with empty local disks and they then take a while to refill their caches. We ideally only want to do planned replacement of nodes (eg, GKE Node pool upgrade) one node at a time and wait until the pod on the new node has fully filled up its cache before rolling out the next node.
OK, so this means that we need to set a PodDisruptionBudget of 1, and set up Readiness probe to make the new node not ready until the cache has been filled.
The problem is: the system doesn't really offer a great way for us to ask the question "has pod X downloaded all the stuff it needs to have to get the system as a whole fully replicated".
What it does let us ask is "is the entire system fully replicated?".
So we are tempted to write a Readiness probe that says "not ready unless entire system is fully replicated". But this means that during node pool upgrades (or other short occasions that have brief "not fully replicated" states), every pod in the statefulset would become unready.
My question is: I don't really understand the full implications of every part of k8s that consults the Ready status. Would it be bad if every pod in the SS became unready while a single pod is "loading up"?
My understanding is that readiness is used for things like controlling the tempo of a Deployment or StatefulSet rollout (which is fine here), and that it's also used for having Services determine which pods to route to. In this case we don't actually use the Service associated with the StatefulSet for routing (clients connect directly to individual pods). So it seems like this might actually be fine. But is it? Or are there other applications of the Ready state which would make it bad for us to mark all pods as unready while global replication isn't at 100%?
I cannot answer your questions about the general implications of the Kubernetes readiness probe, but I happen to know your application (Druid) pretty well.
I believe your assumption is false. You say there is no way to ask an individual historical node for it's status with regard to the loading of segments from deep storage, but in fact there is such an API:
/druid/historical/v1/readiness
as well as the related /druid/historical/v1/loadstatus
as documented here: https://druid.apache.org/docs/latest/operations/api-reference.html