StatefulSet: pods stuck in unknown state

1/6/2017

I'm experimenting with Cassandra and Redis on Kubernetes, using the examples for v1.5.1.

  • With a Cassandra StatefulSet, if I shutdown a node without draining or deleting it via kubectl, that node's Pod stays around forever (at least over a week, anyway), without being moved to another node.
  • With Redis, even though the pod sticks around like with Cassandra, the sentinel service starts a new pod, so the number of functional pods is always maintained.

Is there a way to automatically move the Cassandra pod to another node, if a node goes down? Or do I have to drain or delete the node manually?

-- muru
cassandra
kubernetes

1 Answer

1/6/2017

Please refer to the documentation here.

Kubernetes (versions 1.5 or newer) will not delete Pods just because a Node is unreachable. The Pods running on an unreachable Node enter the ‘Terminating’ or ‘Unknown’ state after a timeout. Pods may also enter these states when the user attempts graceful deletion of a Pod on an unreachable Node. The only ways in which a Pod in such a state can be removed from the apiserver are as follows:

  • The Node object is deleted (either by you, or by the Node Controller).
  • The kubelet on the unresponsive Node starts responding, kills the Pod and removes the entry from the apiserver.
  • Force deletion of the Pod by the user.

This was a behavioral change introduced in kubernetes 1.5, which allows StatefulSet to prioritize safety.

There is no way to differentiate between the following cases:

  1. The instance being shut down without the Node object being deleted.
  2. A network partition is introduced between the Node in question and the kubernetes-master.

Both these cases are seen as the kubelet on a Node being unresponsive by the Kubernetes master. If in the second case, we were to quickly create a replacement pod on a different Node, we may violate the at-most-one semantics guaranteed by StatefulSet, and have multiple pods with the same identity running on different nodes. At worst, this could even lead to split brain and data loss when running Stateful applications.

On most cloud providers, when an instance is deleted, Kubernetes can figure out that the Node is also deleted, and hence let the StatefulSet pod be recreated elsewhere.

However, if you're running on-prem, this may not happen. It is recommended that you delete the Node object from kubernetes as you power it down, or have a reconciliation loop keeping the Kubernetes idea of Nodes in sync with the the actual nodes available.

Some more context is in the github issue.

-- Anirudh Ramanathan
Source: StackOverflow