This happen's once or twice a week, without apply any commands. I just receive the alert that many of pods are down.
The settings of the cluster, created at least 40 days ago:
gcloud container \
clusters create "yourclustername" \
--project "yourprojectname" \
--zone "yourregion-zone" \
--no-enable-basic-auth \
--release-channel "regular" \
--machine-type "e2-standard-2" \
--image-type "COS" \
--disk-type "pd-ssd" \
--disk-size "20" \
--metadata disable-legacy-endpoints=true \
--scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" \
--num-nodes "2" \
--enable-stackdriver-kubernetes \
--enable-ip-alias \
--network "projects/yourprojectname/global/networks/yournetwork" \
--subnetwork "projects/yourprojectname/regions/yourregion/subnetworks/yournetwork" \
--default-max-pods-per-node "110" \
--enable-autoscaling \
--min-nodes "2" \
--max-nodes "4" \
--no-enable-master-authorized-networks \
--addons HorizontalPodAutoscaling,HttpLoadBalancing,NodeLocalDNS,ApplicationManager \
--enable-autoupgrade \
--enable-autorepair \
--max-surge-upgrade 1 \
--max-unavailable-upgrade 0 \
--enable-shielded-nodes
Node condition:
I delete the pods with this error and GKE creates another, off course this is not a solution, at least 4 minutes of downtime. How to solve this? Do I need something like calico or flannel even on GKE?
Solved, I enabled Calico and it's CNI plugin solved the problem. GKE has built in support for it and the steps to enable it can be found here: https://cloud.google.com/kubernetes-engine/docs/how-to/network-policy