Prevent Spring Boot application closing until all current requests are finished

5/17/2019

We have a Spring Boot (2.0.4) application exposing a number of endpoints, one of which enables clients to retrieve sometimes very large files (~200 GB). The application is exposed in a Pod via a Kubernetes deployment configured with the rolling-update strategy.

When we update our deployment by setting the image to the latest version the pods get destroyed and new ones spun up. Our service provision is seamless for new requests. However current requests can and do get severed and this can be annoying for clients in the middle of downloading very large files.

We can configure Container Lifecycle Pre-Stop hooks in our deployment spec to inject a pause before sending shutdown signals to the app via it's PID. This helps prevent any new traffic going to pods which have been set to Terminate. Is there a way to then pause the application shutdown process until all current requests have been completed (this may take tens of minutes)?

Here's what we have tried from within the Spring Boot application:

  • Implementing a shutdown listener which intercepts ContextCloseEvents; unfortunately we can't reliably retrieve a list of active requests. Any Actuator metrics which may have been useful are unavailable at this stage of the shutdown process.

  • Count active sessions by implementing a HttpSessionListener and overriding sessionCreated/Destroy methods to update a counter. This fails because the methods are not invoked on a separate thread so always report the same value in the shutdown listener.

Any other strategy we should try? From within the app itself, or the container, or directly through Kubernetes resource descriptors? Advice/Help/Pointers would be much appreciated.

Edit: We manage the cluster so we're only trying to mitigate service outages to currently connected clients during a managed update of our deployment via a modified pod spec

-- slowko
docker
java
kubernetes
spring-boot
spring-boot-actuator

3 Answers

12/7/2019

Try to Gracefully Shutdown your Spring Boot Application.

This might help :

https://dzone.com/articles/graceful-shutdown-spring-boot-applications

This implementation will make sure that none of your active connections are killed and application will gracefully wait for them to finish before the shutdown.

-- Amit kumar
Source: StackOverflow

5/18/2019

You could increase the terminationGracePeriodSeconds, the default is 30 seconds. But unfortunately, there's nothing to prevent a cluster admin from force deleting your pod, and there's all sorts of reasons the whole node could go away.

-- Joshua Oliphant
Source: StackOverflow

5/21/2019

We did a combination of the above to resolve our problem.

  • increased the terminationGracePeriodSeconds to the absolute maximum we expect to see in production
  • added livenessProbe to prevent Traefik routing to our pod too soon
  • introduced a pre-stop hook injecting a pause and invoking a monitoring script:
    1. Monitored netstat for ESTABLISHED connections to our process (pid 1) with a Foreign Address of our cluster Traefik service
    2. sent TERM to pid 1

Note that because we send TERM to pid 1 from the monitoring script the pod will terminate at this point and the terminationGracePeriodSeconds never gets hit (it's there as a precaution)

Here's the script:

#!/bin/sh

while [ "$(/bin/netstat -ap 2>/dev/null | /bin/grep http-alt.*ESTABLISHED.*1/java | grep -c traefik-ingress-service)" -gt 0 ]
do
  sleep 1
done

kill -TERM 1

Here's the new pod spec:

containers:
  - env:
    - name: spring_profiles_active
      value: dev
    image: container.registry.host/project/app:@@version@@
    imagePullPolicy: Always
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - sleep 5 && /monitoring.sh
    livenessProbe:
      httpGet:
        path: /actuator/health
        port: 8080
      initialDelaySeconds: 60
      periodSeconds: 20
      timeoutSeconds: 3
    name: app
    ports:
    - containerPort: 8080
    readinessProbe:
      httpGet:
        path: /actuator/health
        port: 8080
      initialDelaySeconds: 60
    resources:
      limits:
        cpu: 2
        memory: 2Gi
      requests:
        cpu: 2
        memory: 2Gi
  imagePullSecrets:
  - name: app-secret
  serviceAccountName: vault-auth
  terminationGracePeriodSeconds: 86400
-- slowko
Source: StackOverflow