I dig into Kubernetes resource restrictions and have a hard time to understand what CPU limits are for. I know Kubernetes passes requests and limits down to the (in my case) Docker runtime.
Example: I have 1 Node with 1 CPU and 2 Pods with CPU requests: 500m and limits: 800m. In Docker, this results in (500m -> 0.5 * 1024 = 512) --cpu-shares=512 and (800m -> 800 * 100) --cpu-quota=80000. The pods get allocated by Kube scheduler because the requests sum does not exceed 100% of the node's capacity; in terms of limits the node is overcommited.
The above allows each container to get 80ms CPU time per 100ms period (the default). As soon as the CPU usage is 100%, the CPU time is shared between the containers based on their weight, expressed in CPU shares. Which would be 50% for each container according to the base value of 1024 and a 512 share fo each. At this point - in my understanding - the limits have no more relevance because none of the containers can get its 80ms anymore. They both would get 50ms. So no matter how much limits I define, when usage reaches critical 100%, it's partitioned by requests anyway.
This makes me wonder: Why should I define CPU limits in the first place, and does overcommitment make any difference at all? requests on the other hand in terms of "how much share do I get when everything is in use" is completely understandable.
There is no upper bound with just cpu shares. If there are free cycles, you are free to use them. limit is imposed so that one rogue process is not holding up the resource forever. There should some fair scheduling. CFS imposes that using cpu quota and cpu period via the limit attribute configured here.
To conclude, this kind of property ensures that when i schedule your task you get a minimum of 50 microseconds to finish it. If you need more time, then if no one is waiting in the queue i would let you run for few more but not more than 80 microseconds.
One reason to set CPU limits is that, if you set CPU request == limit and memory request == limit, your pod is assigned a Quality of Service class = Guaranteed, which makes it less likely to be OOMKilled if the node runs out of memory. Here I quote from the Kubernetes doc Configure Quality of Service for Pods:
For a Pod to be given a QoS class of Guaranteed:
- Every Container in the Pod must have a memory limit and a memory request, and they must be the same.
- Every Container in the Pod must have a CPU limit and a CPU request, and they must be the same.
Another benefit of using the Guaranteed QoS class is that it allows you to lock exclusive CPUs for the pod, which is critical for certain kinds of low-latency programs. Quote from Control CPU Management Policies:
The
staticCPU management policy allows containers inGuaranteedpods with integer CPUrequestsaccess to exclusive CPUs on the node. ... Only containers that are both part of aGuaranteedpod and have integer CPUrequestsare assigned exclusive CPUs.
According to the Motivation for CPU Requests and Limits section of the Assign CPU Resources to Containers and Pods Kubernetes walkthrough:
By having a CPU limit that is greater than the CPU request, you accomplish two things:
- The Pod can have bursts of activity where it makes use of CPU resources that happen to be available.
- The amount of CPU resources a Pod can use during a burst is limited to some reasonable amount.
I guess that might leave us wondering why we care about limiting the burst to "some reasonable amount" since the very fact that it can burst seems to seems to suggest there are no other processes contending for CPU at that time. But I find myself dissatisfied with that line of reasoning...
So first off I checked out the command line help for the docker flags you mentioned:
    --cpu-quota int                  Limit CPU CFS (Completely Fair Scheduler) quota
-c, --cpu-shares int                 CPU shares (relative weight)Reference to the Linux Completely Fair Scheduler means that in order to understand the value of CPU limit/quota we need to undestand how the underlying process scheduling algorithm works. Makes sense, right? My intuition is that it's not as simple as time-slicing CPU execution according to the CPU shares/requests and allocating whatever is leftover at the end of some fixed timeslice on a first-come, first-serve basis.
I found this old Linux Journal article snippet which seems to be a legit description of how CFS works:
The CFS tries to keep track of the fair share of the CPU that would have been available to each process in the system. So, CFS runs a fair clock at a fraction of real CPU clock speed. The fair clock's rate of increase is calculated by dividing the wall time (in nanoseconds) by the total number of processes waiting. The resulting value is the amount of CPU time to which each process is entitled.
As a process waits for the CPU, the scheduler tracks the amount of time it would have used on the ideal processor. This wait time, represented by the per-task wait_runtime variable, is used to rank processes for scheduling and to determine the amount of time the process is allowed to execute before being preempted. The process with the longest wait time (that is, with the gravest need of CPU) is picked by the scheduler and assigned to the CPU. When this process is running, its wait time decreases, while the time of other waiting tasks increases (as they were waiting). This essentially means that after some time, there will be another task with the largest wait time (in gravest need of the CPU), and the currently running task will be preempted. Using this principle, CFS tries to be fair to all tasks and always tries to have a system with zero wait time for each process—each process has an equal share of the CPU (something an “ideal, precise, multitasking CPU” would have done).
While I haven't gone as far as to dive into the Linux kernel source to see how this algorithm actually works, I do have some guesses I would like to put forth as to how shares/requests and quotas/limits play into this CFS algorithm.
First off, my intuition leads me to believe that different processes/tasks accumulate wait_runtime at different relative rates based on their assigned CPU shares/requests since Wikipedia claims that CFS is an implementation of weighted fair queuing and this seems like a reasonable way to achieve a shares/request based weighting in the context of an algorithm that attempts to minimize the wait_runtime for all processes/tasks. I know this doesn't directly speak to the question that was asked, but I want to be sure that my explanation as a whole has a place for both concepts of shares/requests and quotas/limits.
Second, with regard to quotas/limits I intuit that these would be applicable in situations where a process/task has accumulated a disproportionately large wait_runtime while waiting on I/O. Remember that the quoted description above CFP prioritizes the process/tasks with the largest wait_runtime? If there were no quota/limit on a given process/task then it seems to me like a burst of CPU usage on that process/task would have the effect of, for as long as it takes for its wait_runtime to reduce enough that another task is allowed to preempt it, blocking all other processes/tasks from execution.
So in other words, CPU quotas/limits in Docker/Kubernetes land is a mechanism that allows the given container/pod/process to burst in CPU activity to play catch up to other processes after waiting on I/O (rather than CPU) without in the course of doing so unfairly blocking other processes from also doing work.