Pod deletion causes errors when using NEG

4/10/2019

I am for this example running the "echoheaders" Nginx in a deployment with 2 replicas. When I delete 1 pod, I sometimes get slow responses and errors for ~40 seconds.

We are running our API-gateway in Kubernetes, and need to be able to allow the Kubernetes scheduler to handle the pods as it sees fit.

We recently wanted to introduce session affinity, and for that, we wanted to migrate to the new and shiny NEG's: Network Endpoint Groups: https://cloud.google.com/load-balancing/docs/negs/

When using NEG we experience problems during failover. Without NEG we're fine.

deployment.yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: echoheaders
  labels:
    app: echoheaders
spec:
  replicas: 2
  selector:
    matchLabels:
      app: echoheaders
  template:
    metadata:
      labels:
        app: echoheaders
    spec:
      containers:
      - image: brndnmtthws/nginx-echo-headers
        imagePullPolicy: Always
        name: echoheaders
        readinessProbe:
          httpGet:
            path: /
            port: 8080
        lifecycle:
          preStop:
            exec:
              # Hack: wait for kube-proxy to remove endpoint before exiting, and
              # gracefully shut down 
              command: ["bash", "-c", "sleep 10; nginx -s quit; sleep 40"]
      restartPolicy: Always
      terminationGracePeriodSeconds: 60

service.yaml

apiVersion: v1 kind: Service metadata: name: echoheaders labels: app: echoheaders annotations: cloud.google.com/neg: '{"ingress": true}' spec: ports:

  • port: 80 protocol: TCP targetPort: 8080 selector: app: echoheaders

ingress.yaml

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.global-static-ip-name: echoheaders-staging
  name: echoheaders-staging
spec:
  backend:
    serviceName: echoheaders
    servicePort: 80

When deleting a pod I get errors as shown in this image of

$ httping -G -K 35.190.69.21

(https://i.imgur.com/u14MvHN.png)

This is new behaviour when using NEG. Disabling NEG gives the old behaviour with working failover.

Any way to use Google LB, ingress, NEG and Kubernetes without errors during pod deletion?

-- Nicholas Klem
google-cloud-networking
google-kubernetes-engine
kubernetes

1 Answer

4/12/2019

In GCP load balancers a GET request will only be served a 502 after two subsequent backends fail to meet the response timeout or a impactful error occurred which seems more plausible.

What is possibly happening may be an interim period, wherein a Pod was due to be terminated and had received its SIGTERM, but was still considered healthy by the load balancer and was sent a request. Since this period was so brief, it wasn't able to complete the request and closed the connection.

A graceful service stop[1] in the machine will make that after receiving a SIGTERM, your service would continue to serve in-flight requests, but refuse new connections. This may solve your issue, but keep in mind that there is no guarantee of zero-downtime.


[1] https://landing.google.com/sre/sre-book/chapters/load-balancing-datacenter/#robust_approach_lame_duck

-- Lozano
Source: StackOverflow