Springboot with undertow becomes unresponsive when worker thread pool grows too large

8/30/2019

We are running spring-boot microservices on k8s on Amazon EC2, using undertow as our embedded web server.

Whenever - for whatever reason - our downstream services are overwhelmed by incoming requests, and the downstream pods' worker queue grows too large (i've seen this issue happen at 400-ish), then spring-boot stops processing queued requests completely and the app goes silent.

Monitoring the queue size via JMX we can see that the queue size continues to grow as more requests are queued by the IO worker - but by this point no queued requests are ever processed by any worker threads.

We can't see any log output or anything to indicate why this might be happening.

This issue cascades upstream, whereby the paralyzed downstream pods cause the traffic in the upstream pods to experience the same issue and they too become unresponsive - even when we turn off all incoming traffic through the API gateway.

To resolve the issue we have to stop incoming traffic upstream, and then kill all of the affected pods, before bringing them back up in greater numbers and turning the traffic back on.

Does anyone have any ideas about this? Is it expected behaviour? If so, how can we make undertow refuse connections before the queue size grows too large and kills the service? If not, whhat is causing this behaviour?

Many thanks. Aaron.

-- Aaron Shaw
kubernetes
spring-boot
undertow

1 Answer

8/30/2019

I am not entirely sure if tweaking spring boot version / embedded web server will fix this, but below is how you can scale this up using Kubernetes / Istio .

  • livenessProbe

If livenessProbe is configured correctly then Kubernetes restarts pods if they aren't alive. https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-liveness-http-request

  • Horizontal Pod Autoscaller

Increases/Decreases the number of replicas of the pods based on CPU utilization or custom metrics. https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/

  • Vertical Pod Autoscaller

Increase/Decrease the CPU / RAM of the POD based on the load. https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler

  • Cluster Autoscaller

Increase/Decrease the number of nodes in the cluster based on load. https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler

  • Istio Rate limiting & Retry mechanism

Limit the number of requests that the service will receive & have a retry mechanism for the requests which couldn't get executed https://istio.io/docs/tasks/traffic-management/request-timeouts/ https://istio.io/docs/concepts/traffic-management/#network-resilience-and-testing

-- Tummala Dhanvi
Source: StackOverflow