Been experimenting with Kubernetes/Rancher and encountered some unexpected behavior. Today I'm deliberately putting on my chaos monkey hat and learning how things behave when stuff fails.
Here's what I've done:
1) Using the Rancher UI, stand up a 3 node cluster on Digital Ocean Success -- a few mins later I have a 3 node cluster, visible in Rancher.
2) Using the Rancher UI, I deleted a node in a 'happy' scenario where I push the appropriate node delete button using Rancher.
Some minutes later, I have a 2 node cluster. Great.
3) Using the Digital Ocean admin UI, I delete a node in an 'oops' scenario as if a sysadmin accidentally deleted a node.
Back on the ranch (sorry), I click here to view the state of the cluster:
Unfortunately after three minutes, I'm getting a gateway timeout
Detailed timeouts in Chrome network inspector
Here's what kubectl says:
$ kubectl get nodes
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)
So, question is, what happened here? I was under the impression Kubernetes was 'self healing' and even if this node I deleted was the etcd leader, it would eventually recover. Been around 2 hours -- do I just need to wait more?