Using Kubernetes I deploy an app to Google Cloud Containerengine on a cluster with 3 smalll instances.
On a first-time deploy, all goes well using:
kubectl create -f deployment.yaml
And:
kubectl create -f service.yaml
Then I change the image
in my deployment.yaml
and update it like so:
kubectl apply -f deployment.yaml
After the update, a couple of things happen:
ReplicaSet
is created (?)Running
, and the other half Unknown
.So I inspected my Pods and came across this error:
FailedSync Error syncing pod, skipping: network is not ready: [Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
Also I can't use the dashboard anymore using kubectl proxy
. The page shows:
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "no endpoints available for service \"kubernetes-dashboard\"",
"reason": "ServiceUnavailable",
"code": 503
}
So I decided to delete all pods forecefully:
kubectl delete pod <pod-name> --grace-period=0 --force
Then, three Pods are triggered for creation, since this is defined in my service.yaml
. But upon inspecting my Pods using kubectl describe pods/<pod-name>
, I see:
no nodes available to schedule pods
I have no idea where this all went wrong. I essence, all I did was updating an image of a deployment.
Anyone ideas?
If your intention is just to update the image try to use kubectl set image
instead. That at least works for me.
By googling kubectl apply
a lot of known issues do seem to come up. See this issue for example or this one.
You did not post which version of kubernetes you deployed, but if you can try to upgrade your cluster to the latest version to see if the issue still persists.
I've run into similar issues on Kubernetes. According to your reply to my question on your question (see above):
I noticed that this happens only when I deploy to a micro instance on Google Cloud, which simply has insufficient resources to handle the deployment. Scaling up the initial resources (CPU, Memory) resolved my issue
It seems to me like what's happening here is that the OOM killer from the Linux kernel ends up killing the kubelet, which in turn makes the Node useless to the cluster (and becomes "Unknown").
A real solution to this problem (to prevent an entire node from dropping out of service) is to add resource limits. Make sure you're not just adding requests; add limits because you want your services -- rather than K8s system services -- to be killed so that they can be rescheduled appropriately (if possible).
Also inside of the cluster settings (specifically in the Node Pool -- select from https://console.cloud.google.com/kubernetes/list), there is a box you can check for "Automatic Node Repair" that would at least partially re-mediate this problem rather than giving you an undefined amount of downtime.