I set up a (very) simple deployment with GKE on a GKE Autopilot cluster running the latest version of Kubernetes (1.18.15-gke.1501
) and attached an ingress (external HTTP(s) load balancer) that links to a simple ClusterIP
service.
Whenever I update the deployment with a new image, I experience about 5-15 minutes of downtime where the load balancer returns a 502 error. It seems like the control plane creates the new, updated pod, allows the service-level health checks to go through (not the load-balancer ones, it doesn't create the NEG yet), then kills the older pod while at the same time setting up the new NEG. It then doesn't remove the old NEG until a variable amount of time later.
Logs on the pods show that health checks are going through, but the GKE dashboard show inconsistent results for the Ingress' state. The ingress will show as fine, but the service will 502.
maxSurge
to 1 and maxUnavailable
to 0. This did not improve the downtime at all.lifecycle.preStop.exec.command: ["sleep", "60"]
to the container on the deployment. This was suggested in the GKE docs here.BackendConfig
to the service that adds slower draining on it.None of the above have helped or made any noticeable difference in how long things were down.
I'm really, really confused by why this isn't working. It feels like I'm missing something really obvious, but this is also such a simple config that you'd think it'd... just work?? Anyone have any idea on what's going on?
Deployment config:
apiVersion: apps/v1
kind: Deployment
metadata:
name: foundation-deployment
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
selector:
matchLabels:
app: foundation-web
template:
metadata:
labels:
app: foundation-web
spec:
readinessGates:
- conditionType: "cloud.google.com/load-balancer-neg-ready"
serviceAccountName: foundation-database-account
containers:
# Run Cloud SQL proxy so we can safely connect to Postgres on localhost.
- name: cloud-sql-proxy
image: gcr.io/cloudsql-docker/gce-proxy:1.17
resources:
requests:
cpu: "250m"
memory: 100Mi
limits:
cpu: "500m"
memory: 100Mi
command:
- "/cloud_sql_proxy"
- "-instances=nine-foundation:us-central1:foundation-staging=tcp:5432"
securityContext:
runAsNonRoot: true
# Main container config
- name: foundation-web
image: gcr.io/project-name/foundation_web:latest
imagePullPolicy: Always
lifecycle:
preStop:
exec:
command: ["sleep", "60"]
env:
# Env variables
resources:
requests:
memory: "500Mi"
cpu: "500m"
limits:
memory: "1000Mi"
cpu: "1"
livenessProbe:
httpGet:
path: /healthz
port: 4000
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /healthz
port: 4000
initialDelaySeconds: 10
periodSeconds: 10
ports:
- containerPort: 4000
Service config:
apiVersion: v1
kind: Service
metadata:
name: foundation-web-service
annotations:
cloud.google.com/neg: '{"ingress": true}'
cloud.google.com/backend-config: '{"ports": {"4000": "foundation-service-config"}}'
spec:
type: ClusterIP
selector:
app: foundation-web
ports:
- port: 4000
targetPort: 4000
BackendConfig:
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
name: foundation-service-config
spec:
# sessionAffinity:
# affinityType: "GENERATED_COOKIE"
# affinityCookieTtlSec: 120
connectionDraining:
drainingTimeoutSec: 60
Ingress config:
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
name: foundation-web-ingress
labels:
name: foundation-web-ingress
spec:
backend:
serviceName: foundation-web-service
servicePort: 4000
I think this might be to do with the cloud sql auth proxy sidecar not terminating properly, leading to the load balancer getting in a twist.
Note this from the GCP docs (I've made the key parts italics)
Symptoms
502 errors or rejected connections. Potential causes New endpoints generally become reachable after attaching them to the load balancer, provided that they respond to health checks. You might encounter 502 errors or rejected connections if traffic cannot reach the endpoints.
502 errors and rejected connections can also be caused by a container that doesn't handle SIGTERM. If a container doesn't explicitly handle SIGTERM, it immediately terminates and stops handling requests. The load balancer continues to send incoming traffic to the terminated container, leading to errors.
The container native load balancer only has one backend endpoint. During a rolling update, the old endpoint gets deprogrammed before the new endpoint gets programmed.
Backend Pod(s) are deployed into a new zone for the first time after a container native load balancer is provisioned. Load balancer infrastructure is programmed in a zone when there is at least one endpoint in the zone. When a new endpoint is added to a zone, load balancer infrastructure is programmed and causes service disruptions.
Resolution
Configure containers to handle SIGTERM and continue responding to requests throughout the termination grace period (30 seconds by default). Configure Pods to begin failing health checks when they receive SIGTERM. This signals the load balancer to stop sending traffic to the Pod while endpoint deprograming is in progress.
By default, the proxy doesn't handle SIGTERM nicely, and wont exit gracefully on a SIGTERM (see related issue), however it now has a nice flag for handling this, so you can use something like
- name: cloud-sql-proxy
image: gcr.io/cloudsql-docker/gce-proxy:1.23.1
command:
- "/cloud_sql_proxy"
- "-instances={{ .Values.postgresConnectionName }}=tcp:5432"
- "-term_timeout=60s"
securityContext:
runAsNonRoot: true
resources:
requests:
memory: "2Gi"
cpu: "1"
Adding the term_timeout flag mainly fixed it for me but was still seeing the occasional 502 during a deployment. Upping replicas to 3 (my cluster is regional, so I wanted to cover all the zones) seemed to help, once I had the term_timeout in place.