Why did we get auto-upgraded to 14.7-gke.14? Why is the latest 1.14.8-gke.2 not working at all…

11/6/2019

Seems like our master got auto-upgraded this morning without prior warning from an earlier 1.14.x version to 1.14.7-gke.14 but we’ve disabled auto-upgrade before so it’s completely odd especially given that 1.14.7-gke.14 is breaking workload identity which quite a few of us in the GKE world rely on.

Moreover, I’m seeing a new version in the list 1.14.8-gke.2 which at the time of writing this question was not even mentioned in the GKE release notes at all. Our cluster was created before release channels became a thing so AFAIK we're not enrolled in any of those.

Wondering if someone pulled the release trigger prematurely (again as this is not the first time).

Since there was no option to go back to a stable version where workload identity was working fine, we went ahead and upgraded to 1.14.8-gke.2 in a hope it would fix the problem. This was a clear gamble that did not pay off at all because it resulted in a colossal clusterfuck where some services did not come up at all due to all kinds of container and networking issues. Specifically:

  1. kube-dns's container prometheus-to-sd is throwing the following error: http://169.254.169.254/computeMetadata/v1/project/project-id: dial tcp 169.254.169.254:80: getsockopt: connection refused

  2. readiness and liveness probes are failing due to some sort of internal networking issue for some of our services

Now on to the main question. What would you guys suggest? Recreate a cluster and go back to v1.13.x and also enroll in a stable release channel? What other options do we have? Would like to get some insight into what the heck is going on here.

EDIT: we're currently hosted in europe-west1-d, it's a zonal cluster that comprises 1 standard pool w/ 3 auto-scaled nodes and 1 preemptible pool for dynamic workloads.

EDIT2: just noticed in the release notes from Oct 30 that 1.14.6-gke.2 was going to be wound down, I must have missed it before. Either way the question is still why there's an undocumented version and why it's not working + if anyone knows when workload identity will work again in the regular channel? This forces us to downgrade to a 1.13.x

EDIT3: turns out kube-dns is using a 1.15.4 image version on a 1.14.x cluster, this almost looks like a broken release, when i downgrade it works for a brief moment until the cluster controller decided to revert the change so we're going to recreate the cluster

EDIT4: after a few hours it turns out that cos_containerd image was the main culprit, switching the image helped resolve all issues it seems, we'll keep monitoring

-- Marek Suscak
google-kubernetes-engine

0 Answers