We have a Spring Boot (2.0.4) application exposing a number of endpoints, one of which enables clients to retrieve sometimes very large files (~200 GB). The application is exposed in a Pod via a Kubernetes deployment configured with the rolling-update strategy.
When we update our deployment by setting the image to the latest version the pods get destroyed and new ones spun up. Our service provision is seamless for new requests. However current requests can and do get severed and this can be annoying for clients in the middle of downloading very large files.
We can configure Container Lifecycle Pre-Stop hooks in our deployment spec to inject a pause before sending shutdown signals to the app via it's PID. This helps prevent any new traffic going to pods which have been set to Terminate. Is there a way to then pause the application shutdown process until all current requests have been completed (this may take tens of minutes)?
Here's what we have tried from within the Spring Boot application:
Implementing a shutdown listener which intercepts ContextCloseEvents
; unfortunately we can't reliably retrieve a list of active requests. Any Actuator metrics which may have been useful are unavailable at this stage of the shutdown process.
Count active sessions by implementing a HttpSessionListener
and overriding sessionCreated/Destroy
methods to update a counter. This fails because the methods are not invoked on a separate thread so always report the same value in the shutdown listener.
Any other strategy we should try? From within the app itself, or the container, or directly through Kubernetes resource descriptors? Advice/Help/Pointers would be much appreciated.
Edit: We manage the cluster so we're only trying to mitigate service outages to currently connected clients during a managed update of our deployment via a modified pod spec
Try to Gracefully Shutdown your Spring Boot Application.
This might help :
https://dzone.com/articles/graceful-shutdown-spring-boot-applications
This implementation will make sure that none of your active connections are killed and application will gracefully wait for them to finish before the shutdown.
You could increase the terminationGracePeriodSeconds
, the default is 30 seconds. But unfortunately, there's nothing to prevent a cluster admin from force deleting your pod, and there's all sorts of reasons the whole node could go away.
We did a combination of the above to resolve our problem.
Note that because we send TERM to pid 1 from the monitoring script the pod will terminate at this point and the terminationGracePeriodSeconds never gets hit (it's there as a precaution)
Here's the script:
#!/bin/sh
while [ "$(/bin/netstat -ap 2>/dev/null | /bin/grep http-alt.*ESTABLISHED.*1/java | grep -c traefik-ingress-service)" -gt 0 ]
do
sleep 1
done
kill -TERM 1
Here's the new pod spec:
containers:
- env:
- name: spring_profiles_active
value: dev
image: container.registry.host/project/app:@@version@@
imagePullPolicy: Always
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- sleep 5 && /monitoring.sh
livenessProbe:
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 60
periodSeconds: 20
timeoutSeconds: 3
name: app
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 60
resources:
limits:
cpu: 2
memory: 2Gi
requests:
cpu: 2
memory: 2Gi
imagePullSecrets:
- name: app-secret
serviceAccountName: vault-auth
terminationGracePeriodSeconds: 86400