Testing graceful shutdown on an HTTP server during a Kubernetes rollout

11/7/2019

I followed some tutorials on how to set up an HTTP server, and test it in a local Kubernetes cluster (using minikube).

I also implemented graceful shutdown from some examples I found, and expected that there would be no downtime from a Kubernetes rolling restart.

To verify that, I started performing load tests (using Apache Benchmark, by running ab -n 100000 -c 20 <addr>) and running kubectl rollout restart during the benchmarking, but ab stops running as soon as the rolling restart is performed.

Here is my current project setup:

Dockerfile

FROM golang:1.13.4-alpine3.10

RUN mkdir /app
ADD . /app
WORKDIR /app

RUN go build -o main src/main.go
CMD ["/app/main"]

src/main.go

package main

import (
    "context"
    "fmt"
    "log"
    "net/http"
    "os"
    "os/signal"
    "syscall"

    "github.com/gorilla/mux"
)

func main() {
    srv := &http.Server{
        Addr:    ":8080",
        Handler: NewHTTPServer(),
    }

    idleConnsClosed := make(chan struct{})
    go func() {
        sigint := make(chan os.Signal, 1)
        signal.Notify(sigint, os.Interrupt, syscall.SIGTERM, syscall.SIGINT)
        <-sigint

        // We received an interrupt signal, shut down.
        if err := srv.Shutdown(context.Background()); err != nil {
            // Error from closing listeners, or context timeout:
            log.Printf("HTTP server Shutdown: %v", err)
        }

        close(idleConnsClosed)
    }()

    log.Printf("Starting HTTP server")
    running = true
    if err := srv.ListenAndServe(); err != http.ErrServerClosed {
        // Error starting or closing listener:
        log.Fatalf("HTTP server ListenAndServe: %v", err)
    }

    <-idleConnsClosed
}

func NewHTTPServer() http.Handler {
    r := mux.NewRouter()

    // Ping
    r.HandleFunc("/", handler)

    return r
}

func handler(w http.ResponseWriter, r *http.Request) {
    fmt.Fprintf(w, "Hello World!")
}

kubernetes/deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: myapp
  name: myapp
spec:
  replicas: 10
  selector:
    matchLabels:
      app: myapp
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 5
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: myapp
        image: dickster/graceful-shutdown-test:latest
        imagePullPolicy: Never
        ports:
        - containerPort: 8080

kubernetes/service.yaml

apiVersion: v1
kind: Service
metadata:
  labels:
    app: myapp
  name: myapp
spec:
  ports:
  - port: 8080
    protocol: TCP
  selector:
    app: myapp
  sessionAffinity: None
  type: NodePort

Is there something missing in this setup? According to the rollingUpdate strategy, there should be at least five running pods that should serve the incoming requests, but ab exits with an apr_socket_recv: Connection reset by peer (54) error. I also tried adding readiness/liveness probes, but no luck. I suspect they're not needed here, either.

-- dickster
docker
go
http
kubernetes

1 Answer

11/7/2019

For this to work without downtime, you need to have the pods stop receiving new connections while the pod is allowed to gracefully finish handling current connections. This means the pod needs to be running, but not ready so that new requests are not sent to it.

Your service will match all pods using the label selector you configured (I assume app: myapp) and will use any pod in the ready state as a possible backend. The pod is marked as ready as long as it is passing the readinessProbe. Since you have no probe configured, the pod status will default to ready as long as it is running.

Just having a readinessProbe configured will help immensely, but will not provide 100% uptime, that will require some tweaks in your code to cause the readinessProbe to fail (so new requests are not sent) while the container gracefully finishes with current connections.

EDIT: As per @Thomas Jungblut mentioned, a big part of eliminating errors with your webserver is how the application handles SIGTERM. While the pod is in terminating state, it will no longer receive requests through the service. During this phase, your webserver needs to be configured to gracefully complete and terminate connections rather than stop abruptly and terminate requests.

Note that this is configured in the application itself and is not a k8s setting. As long as the webserver gracefully drains the connections and your pod spec includes a gracefulTerminationPeriod long enough to allow the webserver to drain, you should see basically no errors. Although this still won't guarantee 100% uptime, especially when bombarding the service using ab.

-- Patrick W
Source: StackOverflow