Kubernetes Deployment update crashes ReplicaSet and creates too many Pods

4/8/2017

Using Kubernetes I deploy an app to Google Cloud Containerengine on a cluster with 3 smalll instances.

On a first-time deploy, all goes well using:

kubectl create -f deployment.yaml

And:

kubectl create -f service.yaml

Then I change the image in my deployment.yaml and update it like so:

kubectl apply -f deployment.yaml

After the update, a couple of things happen:

  • Kubernetes updates its Pods correctly, ending up with 3 updated instances.
  • Short after this, another ReplicaSet is created (?)
  • Also, the double amount (2 * 3 = 6) of Pods are suddenly present, where half of them have a status of Running, and the other half Unknown.

So I inspected my Pods and came across this error:

FailedSync      Error syncing pod, skipping: network is not ready: [Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]

Also I can't use the dashboard anymore using kubectl proxy. The page shows:

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "no endpoints available for service \"kubernetes-dashboard\"",
  "reason": "ServiceUnavailable",
  "code": 503
}

So I decided to delete all pods forecefully:

kubectl delete pod <pod-name> --grace-period=0 --force

Then, three Pods are triggered for creation, since this is defined in my service.yaml. But upon inspecting my Pods using kubectl describe pods/<pod-name>, I see:

no nodes available to schedule pods

I have no idea where this all went wrong. I essence, all I did was updating an image of a deployment.

Anyone ideas?

-- Nicky
google-cloud-platform
kubectl
kubernetes

2 Answers

4/9/2017

If your intention is just to update the image try to use kubectl set image instead. That at least works for me.

By googling kubectl apply a lot of known issues do seem to come up. See this issue for example or this one.

You did not post which version of kubernetes you deployed, but if you can try to upgrade your cluster to the latest version to see if the issue still persists.

-- Oswin Noetzelmann
Source: StackOverflow

8/8/2017

I've run into similar issues on Kubernetes. According to your reply to my question on your question (see above):

I noticed that this happens only when I deploy to a micro instance on Google Cloud, which simply has insufficient resources to handle the deployment. Scaling up the initial resources (CPU, Memory) resolved my issue

It seems to me like what's happening here is that the OOM killer from the Linux kernel ends up killing the kubelet, which in turn makes the Node useless to the cluster (and becomes "Unknown").

A real solution to this problem (to prevent an entire node from dropping out of service) is to add resource limits. Make sure you're not just adding requests; add limits because you want your services -- rather than K8s system services -- to be killed so that they can be rescheduled appropriately (if possible).

Also inside of the cluster settings (specifically in the Node Pool -- select from https://console.cloud.google.com/kubernetes/list), there is a box you can check for "Automatic Node Repair" that would at least partially re-mediate this problem rather than giving you an undefined amount of downtime.

-- Hut8
Source: StackOverflow