Kubernetes on GCE / Prevent pods undergoing an eviction with "The node was low on compute resources."

11/12/2016

Painful investigation on aspects that so far are not that highlighted by documentation (at least from what I've googled)

My cluster's kube-proxy became evicted (+-experienced users might be able to consider the faced issues). Searched a lot, but no clues about how to have them up again.

Until describing the concerned pod gave a clear reason : "The node was low on compute resources."

Still not that experienced with resources balance between pods/deployments and "physical" compute, how would one 'prioritizes' (or similar approach) to make sure specific pods will never end up in such a state ?

The cluster has been created with fairly low resources in order to get our hands on while keeping low costs and eventually witnessing such problems (gcloud container clusters create deemx --machine-type g1-small --enable-autoscaling --min-nodes=1 --max-nodes=5 --disk-size=30), is using g1-small is to prohibit ?

-- Ben
google-compute-engine
kubernetes

1 Answer

11/12/2016

If you are using iptables-based kube-proxy (the current best practice), then kube-proxy being killed should not immediately cause your network connectivity to fail, but new services and updates to endpoints will stop working. Still, your apps should continue to work, but degrade slowly. If you are using userspace kube-proxy, you might want to upgrade.

The error message sounds like it was due to memory pressure on the machine.

When there is memory pressure, Kubelet tries to terminate things in order of lowest to highest QoS level.

If your kube-proxy pod is not using Guaranteed resources, then you might want to change that.

Other things to look at:

  • if kube-proxy suddenly used a lot more memory, it could be terminated. If you made a huge number of pods or services or endpoints, this could cause it to use more memory.
  • if you started processes on the machine that are not under kubernetes control, that could cause kubelet to make an incorrect decision about what to terminate. Avoid this.
  • It is possible that on such a small machine as a g1-small, the amount of node resources held back is insufficient, such that too much guaranteed work got put on the machine -- see allocatable vs capacity. This might need tweaking.
  • Node oom documentation
-- Eric Tune
Source: StackOverflow