I am hosting my services on azure cloud, sometimes I get "BackendConnectionFailure" without any apparent reason, after investigation I found a correlation between this exception and autoscale (scaling down) almost at the same second in most of the cases.
According to documentation termination grace period by default is 30 seconds, which is the case. The pod will be marked terminating and the loadbalancer will not consider it anymore, so receiving no more requests. According to this if my service takes far less time than 30 seconds, I should not need prestop hook or any special implementation in my application (please correct me if I am wrong).
If the previous paragraph is correct, why does this exception occur relatively frequent? My thought is when the pod is marked terminating and the loadbalancer does not forward anymore requests to the pod while it should do.
Edit 1:
The Architecture is simply like this
Client -> Firewall(azure) -> API(azure APIM) -> Microservices(Spring boot) -> backend(third party) or azure RDB depending on the service
I think the Exception comes from APIM, I found two patterns for this exception:
Message The underlying connection was closed: The connection was closed unexpectedly. Exception type BackendConnectionFailure Failed method forward-request
Response time 10.0 s
Message The underlying connection was closed: A connection that was expected to be kept alive was closed by the server. Exception type BackendConnectionFailure Failed method forward-request
Response time 3.6 ms
Spring Boot doesn't do graceful termination by default.
The Spring Boot app and it's application container (not linux container) are in control of what happens to existing connections during the termination grace period. The protocols being used and how a client reacts to a "close" also have a part to play.
If you get to the end of the grace period, then everything gets a hard reset.
When a pod is deleted in k8s, the Pod Endpoint removal from Services is triggered at the same time as the SIGTERM
signal to the container(s).
At this point the cluster nodes will be reconfigured to remove any rules directing new traffic to the Pod. Any existing TCP connections to the Pod/containers will remain in connection tracking until they are closed (by the client, server or network stack).
For HTTP Keep Alive or HTTP/2 services, the client will continue hitting the same Pod Endpoint until it is told to close the connection (or it is forcibly reset)
The basic rules are, on SIGTERM the application should:
Some circumstances you might not be able to handle (depends on the client)
Connection: close
header. It will need a TCP level FIN close.Although keepalive clients should respect a TCP FIN close, every client reacts differently. Microsoft APIM might be sensitive and produce the error even though there was no real world impact. It's best to load test your setup while scaling to see if there is a real world impact.
For more spring boot info see:
https://github.com/spring-projects/spring-boot/issues/4657 https://github.com/corentin59/spring-boot-graceful-shutdown https://github.com/SchweizerischeBundesbahnen/springboot-graceful-shutdown
When your applications receives a SIGTERM
(from the Pod termination) it needs to first stop reporting it is ready (fail the readinessProbe
) but still serve requests as they come in from clients. After a certain time (depending on your readinessProbe
settings) you can shut down the application.
For Spring Boot there is a small library doing exactly that: springboot-graceful-shutdown
You can use a preStop sleep if needed. While the pod is removed from the service endpoints immediately, it still takes time (10-100ms) for the endpoint update to be sent to every node and for them to update iptables.