I followed some tutorials on how to set up an HTTP server, and test it in a local Kubernetes cluster (using minikube
).
I also implemented graceful shutdown from some examples I found, and expected that there would be no downtime from a Kubernetes rolling restart.
To verify that, I started performing load tests (using Apache Benchmark, by running ab -n 100000 -c 20 <addr>
) and running kubectl rollout restart
during the benchmarking, but ab
stops running as soon as the rolling restart is performed.
Here is my current project setup:
Dockerfile
FROM golang:1.13.4-alpine3.10
RUN mkdir /app
ADD . /app
WORKDIR /app
RUN go build -o main src/main.go
CMD ["/app/main"]
src/main.go
package main
import (
"context"
"fmt"
"log"
"net/http"
"os"
"os/signal"
"syscall"
"github.com/gorilla/mux"
)
func main() {
srv := &http.Server{
Addr: ":8080",
Handler: NewHTTPServer(),
}
idleConnsClosed := make(chan struct{})
go func() {
sigint := make(chan os.Signal, 1)
signal.Notify(sigint, os.Interrupt, syscall.SIGTERM, syscall.SIGINT)
<-sigint
// We received an interrupt signal, shut down.
if err := srv.Shutdown(context.Background()); err != nil {
// Error from closing listeners, or context timeout:
log.Printf("HTTP server Shutdown: %v", err)
}
close(idleConnsClosed)
}()
log.Printf("Starting HTTP server")
running = true
if err := srv.ListenAndServe(); err != http.ErrServerClosed {
// Error starting or closing listener:
log.Fatalf("HTTP server ListenAndServe: %v", err)
}
<-idleConnsClosed
}
func NewHTTPServer() http.Handler {
r := mux.NewRouter()
// Ping
r.HandleFunc("/", handler)
return r
}
func handler(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "Hello World!")
}
kubernetes/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: myapp
name: myapp
spec:
replicas: 10
selector:
matchLabels:
app: myapp
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 5
type: RollingUpdate
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: dickster/graceful-shutdown-test:latest
imagePullPolicy: Never
ports:
- containerPort: 8080
kubernetes/service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: myapp
name: myapp
spec:
ports:
- port: 8080
protocol: TCP
selector:
app: myapp
sessionAffinity: None
type: NodePort
Is there something missing in this setup? According to the rollingUpdate
strategy, there should be at least five running pods that should serve the incoming requests, but ab
exits with an apr_socket_recv: Connection reset by peer (54)
error. I also tried adding readiness/liveness probes, but no luck. I suspect they're not needed here, either.
For this to work without downtime, you need to have the pods stop receiving new connections while the pod is allowed to gracefully finish handling current connections. This means the pod needs to be running, but not ready so that new requests are not sent to it.
Your service will match all pods using the label selector you configured (I assume app: myapp
) and will use any pod in the ready state as a possible backend. The pod is marked as ready as long as it is passing the readinessProbe. Since you have no probe configured, the pod status will default to ready as long as it is running.
Just having a readinessProbe configured will help immensely, but will not provide 100% uptime, that will require some tweaks in your code to cause the readinessProbe to fail (so new requests are not sent) while the container gracefully finishes with current connections.
EDIT: As per @Thomas Jungblut mentioned, a big part of eliminating errors with your webserver is how the application handles SIGTERM. While the pod is in terminating state, it will no longer receive requests through the service. During this phase, your webserver needs to be configured to gracefully complete and terminate connections rather than stop abruptly and terminate requests.
Note that this is configured in the application itself and is not a k8s setting. As long as the webserver gracefully drains the connections and your pod spec includes a gracefulTerminationPeriod long enough to allow the webserver to drain, you should see basically no errors. Although this still won't guarantee 100% uptime, especially when bombarding the service using ab.