Deploying Akka: heartbeat interval is growing too large when deploying application with resource limits

3/2/2022

We are working with Akka and have an application which uses Akka Cluster that we deploy to AWS EKS environment. When we run a load test against the application we observe heartbeat intervals between the components growing large and pods getting restarted. Akka documentation mentions not to use CPU resource limit, and removing the resource limit solves the issue.

But is there any other way of getting around this, we are not sure if removing the resource limits is a good practice when deploying application.

Resource limits

To avoid CFS scheduler limits, it is best not to use resources.limits.cpu limits, but use resources.requests.cpu configuration instead.

-- Karan Khanna
akka
akka-cluster
kubernetes

2 Answers

3/2/2022

So, I know the docs do make this recommendation to not use limits so the following is just "unofficial" advice you are getting from StackOverflow. (And thus, if you want support from Lightbend you will have to debate this with them.)

Firstly, I agree with you. For many reasons, you absolutely should have resource limits. For example, if you don't have CPU limits your process ends up being designated as "best effort" as far as the CFS scheduler is concerned and that can actually have bad consequences.

As I understand the history of this recommendation from Lightbend, it comes from a situation similar to yours where the CFS scheduler was preventing Akka from getting the resources it needed. Plus the broader problem that, especially when the garbage collector kicks in, it's definitely possible to consume all of your CFS quota very quickly and end up with long GC pause times. The Lightbend position has been if you use CPU resource limits, then the CFS scheduler will limit your CPU consumption and that could cause problems.

But my position is that limiting your CPU consumption is the entire point, and is actually a good thing. Kubernetes is a shared environment and limiting CPU consumption is how the system is designed to prevent "noisy neighbors", fair sharing, and often cost chargebacks. My position is that the CPU limits themselves aren't bad, the problem is only when your limits are too low.

While I hate to make generic advice, as there may be some situations where I might make different recommendations, I would generally recommend setting CPU limits, but having them be significantly higher than your CPU requests. (2-3 times as a rule of thumb.) This type of setting will classify your Pods as "Burstable". Which is important so that your Akka node can burst up to handle high load situations such as GC, handling failover, etc.

Ideally you then use something like HBA such that your Akka cluster will autoscale so that it can handle its normal load with its "request" allocation of CPU and only uses that burst allocation during these occasional circumstances. (Because if you are always operating past the request allocation and close to burst allocation, then you really aren't getting the benefit of bursting.)

For example, in your specific case, you say you have problems with heartbeats when you set CPU limits but that goes away when you turn off the CPU limits. This tells me that your CPU limits were way too low. If Akka isn't able get enough CPU to do heartbeats, even under normal circumstances, then there is no way that it would be able to get enough CPU to handle peaks.

(Also, since we are we assuming that the system will be normally running at the request CPU allocation, but potentially running at limit CPU allocation during GC, I would typically tune my Akka threads as if it had "request" CPU and tune my GC threads as if it had "limit" CPU allocation.)

-- David Ogren
Source: StackOverflow

3/2/2022

There are two basic flavors of CPU limiting that Kubernetes supports: CFS and shares.

CFS places a hard limit on the host CPU time that a give container (or more properly the cgroup associated with the container) can consume: typically X milliseconds in every 100 milliseconds. Once the container has had X milliseconds it cannot run for the remainder of the 100 millisecond slice.

Shares instead proportionally allocates CPU limits. If the containers on a host have a total share of 4000, a container with a share of 1500 will get 15/40 = 37.5% of the total CPU capacity. Since Kubernetes typically determines the share value based on millicores (1.0 CPU => 1000 share), in our 4000 total share example, if the host has 8 cores the container with a 1500 share (1.5 CPU) would actually get the equivalent number of clock cycles as 3 CPU (or more if other containers aren't using their allocations). If the total shares were 8000, that container would get the equivalent of 1.5 CPU (Kubernetes in turn will ensure that the total share doesn't go above 8000).

So under full load, there's not really a difference between the CFS limit approach and the share approach. When the host isn't fully used, CFS will have idle CPU while shares will allow containers to have more CPU. If the workloads running on the host are bursty (especially at shorter timescales), the share approach will be very effective.

For steadier workloads (think IoT more than online retailing), that you're seeing this is indicative of an issue:

  • You might be underprovisioning CPU.
  • It's possible that you're seeing thread starvation, especially in the dispatchers related to clustering/remoting. A thread starvation detector (e.g. the one available with a Lightbend subscription (disclaimer: I am employed by Lightbend, but can endorse its utility from when I was a client)) can help diagnose this. In Akka 2.6, clustering actors run in a separate dispatcher from application actors, so this is less likely to be an issue; if running pre-2.6, you can have clustering use a separate dispatcher by adjusting akka.cluster.use-dispatcher. It is also probably worth looking for actors in your application which take a long time to process messages and moving those actors to other dispatchers.
  • JVM garbage collectors all have situations where they can become stop-the-world and typically the duration of the pause is correlated with the size of the heap: more instances with smaller heaps can help here.
  • Finally, if the constraints of the application and how you're deploying it are such that heartbeat arrivals are going to be volatile, you can adjust the failure detector to be less eager to declare a node failed. The primary knobs to twiddle here are akka.cluster.failure-detector.threshold and akka.cluster.failure-detector.acceptable-heartbeat-pause. Note that the former defaults to 8.0 which is considered appropriate for a LAN environment with physical hardware. Every layer of virtualization and sharing of network/compute infrastructure with workloads you don't control warrants increasing the threshold: I've seen 12.0 recommended for EC2 deploys and given that EKS and such gives you even less control, I'd consider something like 13.0. Likewise, my experience has been that acceptable-heartbeat-pause should be about 500ms per GB of Java heap. Note that doing anything to make the failure detector less eager implies being slower to respond to a node being down.
-- Levi Ramsey
Source: StackOverflow