Kubernetes Stateful set, AZ and Volume claims: what happens when an AZ fails

2/9/2018

Consider a Statefulset (Cassandra using offical K8S example) across 3 Availability zones:

  • cassandra-0 -> zone a
  • cassandra-1 -> zone b
  • cassandra-2 -> zone c

Each Cassandra pod uses an EBS volume. So there is automatically an affinity. For instance, cassandra-0 cannot move to "zone-b" because its volume is in "zone-a". All good.

If some Kubernetes nodes/workers fail, they will be replaced. The pods will start again on the new node and be re-attached their EBS volume. Looking like nothing happened.

Now if the entire AZ "zone-a" goes down and is unavailable for some time (meaning cassandra-0 cannot start anymore due to affinity for EBS in the same zone). You are left with:

  • cassandra-1 -> zone b
  • cassandra-2 -> zone c

Kubernetes will never be able to start cassandra-0 for as long as "zone-a" is unavailable. That's all good because cassandra-1 and cassandra-2 can serve requests.

Now if on top of that, another K8S node goes down or you have setup auto-scaling of your infrastructure, you could end up with cassandra-1 or cassandra-2 needed to move to another K8S node. It shouldn't be a problem.

However from my testing, K8S will not do that because the pod cassandra-0 is offline. It will never self-heal cassandra-1 or cassandra-2 (or any cassandra-X) because it wants cassandra-0 back first. And cassandra-0 cannot start because it's volume is in a zone which is down and not recovering.

So if you use Statefulset + VolumeClaim + across zones AND you experience an entire AZ failure AND you experience an EC2 failure in another AZ or have auto-scaling of your infrastructure

\=> then you will loose all your Cassandra pods. Up until zone-a is back online

This seems like a dangerous situation. Is there a way for a stateful set to not care about the order and still self-heal or start more pod on cassandra-3, 4, 5, X?

-- VinceMD
amazon-web-services
kubernetes
kubernetes-statefulset

2 Answers

2/9/2018

Starting with Kubernetes 1.7 you can tell Kubernetes to relax the StatefulSet ordering guarantees using the podManagementPolicy option (documentation). By setting that option to Parallel Kubernetes will no longer guarantee any ordering when starting or stopping pods and start pods in parallel. This can have an impact on your service discovery, but should resolve the issue you're talking about.

-- Lorenz
Source: StackOverflow

2/20/2018

Two options:

Option 1: use podManagementPolicy and set it to Parallel. The pod-1 and pod-2 will crash a few times until the seed node (pod-0) is available. This happens when creating the statefulset the first time. Also note that Cassandra documentation used to recommend NOT creating multiple nodes in parallel but it seems recent updates makes this not true. Multiple nodes can be added to the cluster at the same time

Issue found: if using 2 seed nodes, you will get a split brain scenario. Each seed node will be created at the same time and create 2 separate logical Cassandra clusters

Option 1 b: use podManagementPolicy and set it to Parallel and use ContainerInit. Same as option 1 but use an initContainer https://kubernetes.io/docs/concepts/workloads/pods/init-containers/. The init container is a short lived container which has for role to check that the seed node is available before starting the actual container. This is not required if we are happy for the pod to crash until the seed node is available again The problem is that Init Container will always run which is not required. We want to ensure the Cassandra cluster was well formed the first time it was created. After that it does not matter

Option 2: create 3 different statefulets.

1 statefulset per AZ/Rack. Each statefulset has constraints so it can run only on nodes in the specific AZ. I've also got 3 storage classes (again constraint to a particular zone), to make sure the statefulset does not provision EBS in the wrong zone (statefulset does not handle that dynamically yet) In each statefulset I've got a Cassandra seed node (defined as environment variable CASSANDRA_SEEDS which populates SEED_PROVIDER at run time). That makes 3 seeds which is plenty. My setup can survive a complete zone outage thanks to replication-factor=3

Tips:

  • the list of seed node contains all 3 nodes separated by commas: "cassandra-a-0.cassandra.MYNAMESPACE.svc.cluster.local, cassandra-b-0.cassandra.MYNAMESPACE.svc.cluster.local, cassandra-c-0.cassandra.MYNAMESPACE.svc.cluster.local"
  • Wait until the first seed (cassandra-a-0) is ready before creating the other 2 statefulsets. Otherwise you get a split brain. This is only an issue when you create the cluster. After that, you can loose one or two seed nodes without impact as the third one is aware of all the others.
-- VinceMD
Source: StackOverflow