We are using kubernetes 1.1.8 (with flannel) but with 1.2 about to drop any input on this topic that is specific to 1.2 is fine.
We run kubernetes in our own datacenters on bare metal which means that we need to do maintenance on worker nodes which take them in and out of production.
We have a process for taking a node out of the cluster to do maintenance on it and I'm wondering if our process can be improved to minimize the potential for user facing downtime.
We are using f5 load balancers. Each service that we deploy is given a static nodePort. For example appXYZ has nodePort 30173. In the F5 pool for service appXYZ all minions in the cluster are added as pool members with a tcp port open check on port 30173.
During maintenance on a node we take the following steps 1. Set the node to unschedulable = true. 2. Get the list of pods running on the node and delete each pod. Sometimes this will be 40 pods per node.
3. Wait for up to two minutes for the pods in step #2 to shutdown. 4. Reboot the physical node.
I'm wondering if this is what other people are doing or if we are missing one or more steps that would further minimize the amount of traffic that could potentially get set to a dead or dying pod on the node undergoing maintenance?
When I read through http://kubernetes.io/docs/user-guide/pods/#termination-of-pods it makes me wonder if adding a longer (over 30 seconds) --grace-period= to our delete command and pausing for our reboot for a longer amount of time would ensure all of the kube-proxy's have been updated to remove the node from the list of endpoints.
So if anyone can confirm that what we are doing is a decent practice or has any suggestions on how to improve it. Especially any tips on what to do in kubernetes 1.2.
TIA!
The approach I follow is
I have used this strategy to switch (blue-green deploy?) AWS ASG of nodes when doing changes to the nodes.
This blog post has also been a good reference (though I don't use the delete pod method) - http://sttts.github.io/kubernetes/api/kubectl/2016/01/13/kubernetes-node-evacuation.html
Adding a longer grace period certainly wouldn't hurt.
If you want to be really cautious, you could also remove the labels from the pods running on those nodes before deleting them. This will keep them running such that they can finish whatever work they're doing, but remove them from all services so that they'll stop receiving new requests.
Checkout the 'kubectl drain' command:
# Drain node "foo", even if there are pods not managed by a ReplicationController, Job, or DaemonSet on it.
$ kubectl drain foo --force
# As above, but abort if there are pods not managed by a ReplicationController, Job, or DaemonSet, and use a grace period of 15 minutes.
$ kubectl drain foo --grace-period=900
See also Issue 3885 and related linked issues