502 ALB errors when scaling pods on AWS EKS

9/6/2021

I have HPA for my Kubernetes-deployed app with cluster autoscaler. Scaling works properly for both pods and nodes, but during production load spikes I see a lot of 502 errors from ALB (aws-load-balancer-controller).

It seems like I have enabled everything to achieve zero-downtime deployment / scaling:

  • pod readiness probe is in place
           readinessProbe:
             httpGet:
               path: /_healthcheck/
               port: 80
 * pod readiness gate [is enabled](https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/deploy/pod_readiness_gate/)
 * ingress annotation uses `ip` target type 

alb.ingress.kubernetes.io/target-type: ip

 * healthcheck parameters are specified on the ingress resource

alb.ingress.kubernetes.io/healthcheck-path: "/healthcheck/" alb.ingress.kubernetes.io/healthcheck-interval-seconds: "10"

but that doesn't help.

How to properly debug this kind of issue and which other parameters should I tune to completely eliminate 5xx errors from my load balancer?
-- Most Wanted
amazon-eks
aws-application-load-balancer
aws-load-balancer
kubernetes
kubernetes-ingress

1 Answer

10/21/2021

Here's a list of some extra things that I've added to my configuration alongside those mentioned above

  • container preStop hook
lifecycle:
  preStop:
    exec:
      command: ["/bin/sleep", "30"]
  • termination grace period on a pod terminationGracePeriodSeconds: 40 (sleep time from the above + 10-15 seconds)

  • tune deregistration delay value on a target group by setting

alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=30

this annotation on an ingress resources. Usually the value should match your timeout on backend webserver (we don't want to have a target around more than it requires for the longest possible request to finish).

The main idea behind this tuning is to make sure changes of the Pods state have enough time to propagate to the underlying AWS resources, so traffic is no longer routed from ALB to the pod within target group that has been already marked as terminated/unhealthy by k8s.

P.S. Make sure to always have enough pods to handle incoming requests (this is especially important for synchronous workers when doing rolling redeploy). Consider lower values for maxUnavailable and higher values for maxSurge in case your cluster/worker nodes have the capacity to allocate these extra pods. So if your pod handles 100 reqs/min on average on your load is 400 reqs/min make sure num of replicas - maxUnavailable > 4 (total reqs / reqs per pod)

-- Most Wanted
Source: StackOverflow