I have a SpringBoot project with graceful shutdown configured. Deployed on k8s 1.12.7
Here are the logs,
2019-07-20 10:23:16.180 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Received shutdown event
2019-07-20 10:23:16.180 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Waiting for 30s to finish
2019-07-20 10:23:16.273 INFO [service,fd964ebaa631a860,75a07c123397e4ff,false] 1 --- [io-8080-exec-10] com.jay.resource.ProductResource : GET /products?id=59
2019-07-20 10:23:16.374 INFO [service,9a569ecd8c448e98,00bc11ef2776d7fb,false] 1 --- [nio-8080-exec-1] com.jay.resource.ProductResource : GET /products?id=68
...
2019-07-20 10:23:33.711 INFO [service,1532d6298acce718,08cfb8085553b02e,false] 1 --- [nio-8080-exec-9] com.jay.resource.ProductResource : GET /products?id=209
2019-07-20 10:23:46.181 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Resumed after hibernation
2019-07-20 10:23:46.216 INFO [service,,,] 1 --- [ Thread-7] o.s.s.concurrent.ThreadPoolTaskExecutor : Shutting down ExecutorService 'applicationTaskExecutor'
Application has received the SIGTERM
at 10:23:16.180
from Kubernetes. As per Termination of Pods point#5
says that the terminating pod is removed from the endpoints list of service, but it is contradicting that it forwarded the requests for 17 seconds (until 10:23:33.711
) after sending SIGTERM
signal. Is there any configuration missing?
Dockerfile
FROM openjdk:8-jre-slim
MAINTAINER Jay
RUN apt update && apt install -y curl libtcnative-1 gcc && apt clean
ADD build/libs/sample-service.jar /
CMD ["java", "-jar" , "sample-service.jar"]
GracefulShutdown
// https://github.com/spring-projects/spring-boot/issues/4657
class GracefulShutdown(val waitTime: Long, val timeout: Long) : TomcatConnectorCustomizer, ApplicationListener<ContextClosedEvent> {
@Volatile
private var connector: Connector? = null
override fun customize(connector: Connector) {
this.connector = connector
}
override fun onApplicationEvent(event: ContextClosedEvent) {
log.info("Received shutdown event")
val executor = this.connector?.protocolHandler?.executor
if (executor is ThreadPoolExecutor) {
try {
val threadPoolExecutor: ThreadPoolExecutor = executor
log.info("Waiting for ${waitTime}s to finish")
hibernate(waitTime * 1000)
log.info("Resumed after hibernation")
this.connector?.pause()
threadPoolExecutor.shutdown()
if (!threadPoolExecutor.awaitTermination(timeout, TimeUnit.SECONDS)) {
log.warn("Tomcat thread pool did not shut down gracefully within $timeout seconds. Proceeding with forceful shutdown")
threadPoolExecutor.shutdownNow()
if (!threadPoolExecutor.awaitTermination(timeout, TimeUnit.SECONDS)) {
log.error("Tomcat thread pool did not terminate")
}
}
} catch (ex: InterruptedException) {
log.info("Interrupted")
Thread.currentThread().interrupt()
}
}else
this.connector?.pause()
}
private fun hibernate(time: Long){
try {
Thread.sleep(time)
}catch (ex: Exception){}
}
companion object {
private val log = LoggerFactory.getLogger(GracefulShutdown::class.java)
}
}
@Configuration
class GracefulShutdownConfig(@Value("\${app.shutdown.graceful.wait-time:30}") val waitTime: Long,
@Value("\${app.shutdown.graceful.timeout:30}") val timeout: Long) {
companion object {
private val log = LoggerFactory.getLogger(GracefulShutdownConfig::class.java)
}
@Bean
fun gracefulShutdown(): GracefulShutdown {
return GracefulShutdown(waitTime, timeout)
}
@Bean
fun webServerFactory(gracefulShutdown: GracefulShutdown): ConfigurableServletWebServerFactory {
log.info("GracefulShutdown configured with wait: ${waitTime}s and timeout: ${timeout}s")
val factory = TomcatServletWebServerFactory()
factory.addConnectorCustomizers(gracefulShutdown)
return factory
}
}
deployment file
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
k8s-app: service
name: service
spec:
progressDeadlineSeconds: 420
replicas: 1
revisionHistoryLimit: 1
selector:
matchLabels:
k8s-app: service
strategy:
rollingUpdate:
maxSurge: 2
maxUnavailable: 0
type: RollingUpdate
template:
metadata:
labels:
k8s-app: service
spec:
terminationGracePeriodSeconds: 60
containers:
- env:
- name: SPRING_PROFILES_ACTIVE
value: dev
image: service:2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 20
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 5
name: service
ports:
- containerPort: 8080
protocol: TCP
readinessProbe:
failureThreshold: 60
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 100
periodSeconds: 10
timeoutSeconds: 5
UPDATE:
Added custom health check endpoint
@RestControllerEndpoint(id = "live")
@Component
class LiveEndpoint {
companion object {
private val log = LoggerFactory.getLogger(LiveEndpoint::class.java)
}
@Autowired
private lateinit var gracefulShutdownStatus: GracefulShutdownStatus
@GetMapping
fun live(): ResponseEntity<Any> {
val status = if(gracefulShutdownStatus.isTerminating())
HttpStatus.INTERNAL_SERVER_ERROR.value()
else
HttpStatus.OK.value()
log.info("Status: $status")
return ResponseEntity.status(status).build()
}
}
Changed the livenessProbe
,
livenessProbe:
httpGet:
path: /actuator/live
port: 8080
initialDelaySeconds: 100
periodSeconds: 5
timeoutSeconds: 5
failureThreshold: 3
Here are the logs after the change,
2019-07-21 14:13:01.431 INFO [service,9b65b26907f2cf8f,9b65b26907f2cf8f,false] 1 --- [nio-8080-exec-2] com.jay.util.LiveEndpoint : Status: 200
2019-07-21 14:13:01.444 INFO [service,3da259976f9c286c,64b0d5973fddd577,false] 1 --- [nio-8080-exec-3] com.jay.resource.ProductResource : GET /products?id=52
2019-07-21 14:13:01.609 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Received shutdown event
2019-07-21 14:13:01.610 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Waiting for 30s to finish
...
2019-07-21 14:13:06.431 INFO [service,002c0da2133cf3b0,002c0da2133cf3b0,false] 1 --- [nio-8080-exec-3] com.jay.util.LiveEndpoint : Status: 500
2019-07-21 14:13:06.433 INFO [service,072abbd7275103ce,d1ead06b4abf2a34,false] 1 --- [nio-8080-exec-4] com.jay.resource.ProductResource : GET /products?id=96
...
2019-07-21 14:13:11.431 INFO [service,35aa09a8aea64ae6,35aa09a8aea64ae6,false] 1 --- [io-8080-exec-10] com.jay.util.LiveEndpoint : Status: 500
2019-07-21 14:13:11.508 INFO [service,a78c924f75538a50,0314f77f21076313,false] 1 --- [nio-8080-exec-2] com.jay.resource.ProductResource : GET /products?id=110
...
2019-07-21 14:13:16.431 INFO [service,38a940dfda03956b,38a940dfda03956b,false] 1 --- [nio-8080-exec-9] com.jay.util.LiveEndpoint : Status: 500
2019-07-21 14:13:16.593 INFO [service,d76e81012934805f,b61cb062154bb7f0,false] 1 --- [io-8080-exec-10] com.jay.resource.ProductResource : GET /products?id=152
...
2019-07-21 14:13:29.634 INFO [service,38a32a20358a7cc4,2029de1ed90e9539,false] 1 --- [nio-8080-exec-6] com.jay.resource.ProductResource : GET /products?id=191
2019-07-21 14:13:31.610 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Resumed after hibernation
2019-07-21 14:13:31.692 INFO [service,,,] 1 --- [ Thread-7] o.s.s.concurrent.ThreadPoolTaskExecutor : Shutting down ExecutorService 'applicationTaskExecutor'
With the livenessProbe
of 3 failures, kubernetes served the traffic for 13 seconds after liveness failures i.e., from 14:13:16.431
to 14:13:29.634
.
UPDATE 2: The sequence of events (thanks to Eamonn McEvoy
)
seconds | healthy | events
0 | ✔ | * liveness probe healthy
1 | ✔ | - SIGTERM
2 | ✔ |
3 | ✔ |
4 | ✔ |
5 | ✔ | * liveness probe unhealthy (1/3)
6 | ✔ |
7 | ✔ |
8 | ✔ |
9 | ✔ |
10 | ✔ | * liveness probe unhealthy (2/3)
11 | ✔ |
12 | ✔ |
13 | ✔ |
14 | ✔ |
15 | ✘ | * liveness probe unhealthy (3/3)
.. | ✔ | * traffic is served
28 | ✔ | * traffic is served
29 | ✘ | * pod restarts
SIGTERM isn't putting the pod into a terminating state immediately. You can see in the logs your application begins graceful shutdown at 10:23:16.180 and takes >20 seconds to complete. At this point, the container stops and pod can enter the terminating state.
As far as kubernetes is concerned the pod looks ok during the graceful shutdown period. You need to add a liveness probe to your deployment; when it becomes unhealthy the traffic will stop.
livenessProbe:
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 100
periodSeconds: 10
timeoutSeconds: 5
Update:
This is because you have a failure threshold of 3, so you are allowing traffic for up to 15 seconds after the sigterm;
e.g.
seconds | healthy | events
0 | ✔ | * liveness probe healthy
1 | ✔ | - SIGTERM
2 | ✔ |
3 | ✔ |
4 | ✔ |
5 | ✔ | * liveness probe issued
6 | ✔ | .
7 | ✔ | .
8 | ✔ | .
9 | ✔ | .
10 | ✔ | * liveness probe timeout - unhealthy (1/3)
11 | ✔ |
12 | ✔ |
13 | ✔ |
14 | ✔ |
15 | ✔ | * liveness probe issued
16 | ✔ | .
17 | ✔ | .
18 | ✔ | .
19 | ✔ | .
20 | ✔ | * liveness probe timeout - unhealthy (2/3)
21 | ✔ |
22 | ✔ |
23 | ✔ |
24 | ✔ |
25 | ✔ | * liveness probe issued
26 | ✔ | .
27 | ✔ | .
28 | ✔ | .
29 | ✔ | .
30 | ✘ | * liveness probe timeout - unhealthy (3/3)
| | * pod restarts
This is assuming that the endpoint returns an unhealthy response during the graceful shutdown. Since you have timeoutSeconds: 5
, if the probe simply times out this will take much longer, with a 5 second delay between issuing a liveness probe request and receiving its response. It could be the case that the container actually dies before the liveness threshold is hit and you are still seeing the original behaviour