Docker/Kubernetes + Gunicorn/Celery - Multiple Workers vs Replicas?


I was wondering what the correct approach to deploying a containerized Django app using gunicorn & celery was.

Specifically, each of these processes has a built-in way of scaling vertically, using workers for gunicorn and concurrency for celery. And then there is the Kubernetes approach to scaling using replicas

There is also this notion of setting workers equal to some function of the CPUs. Gunicorn recommends

2-4 workers per core

However, I am confused what this translates to on K8s where CPU is a divisible shared resource - unless I use resoureceQuotas.

I want to understand what the Best Practice is. There are three options I can think of:

  • Have single workers for gunicorn and a concurrency of 1 for celery, and scale them using the replicas? (horizontal scaling)
  • Have gunicorn & celery run in a single replica deployment with internal scaling (vertical scaling). This would mean setting fairly high values of workers & concurrency respectively.
  • A mixed approach between 1 and 2, where we run gunicorn and celery with a small value for workers & concurrency, (say 2), and then use K8s Deployment replicas to scale horizontally.

There are some questions on SO around this, but none offer an in-depth/thoughtful answer. Would appreciate if someone can share their experience.

Note: We use the default worker_class sync for Gunicorn



These technologies aren't as similar as they initially seem. They address different portions of the application stack and are actually complementary.

Gunicorn is for scaling web request concurrency, while celery should be thought of as a worker queue. We'll get to kubernetes soon.


Web request concurrency is primarily limited by network I/O or "I/O bound". These types of tasks can be scaled using cooperative scheduling provided by threads. If you find request concurrency is limiting your application, increasing gunicorn worker threads may well be the place to start.


Heavy lifting tasks e.g. compress an image, run some ML algo, are "CPU bound" tasks. They can't benefit from threading as much as more CPUs. These tasks should be offloaded and parallelized by celery workers.


Where Kubernetes comes in handy is by providing out-of-the-box horizontal scalability and fault tolerance.

Architecturally, I'd use two separate k8s deployments to represent the different scalablity concerns of your application. One deployment for the Django app and another for the celery workers. This allows you to independently scale request throughput vs. processing power.

I run celery workers pinned to a single core per container (-c 1) this vastly simplifies debugging and adheres to Docker's "one process per container" mantra. It also gives you the added benefit of predictability, as you can scale the processing power on a per-core basis by incrementing the replica count.

Scaling the Django app deployment is where you'll need to DYOR to find the best settings for your particular application. Again stick to using --workers 1 so there is a single process per container but you should experiment with --threads to find the best solution. Again leave horizontal scaling to Kubernetes by simply changing the replica count.

HTH It's definitely something I had to wrap my head around when working on similar projects.





We run a Kubernetes kluster with Django and Celery, and implemented the first approach. As such some of my thoughts on this trade-off and why we choose for this approach.

In my opinion Kubernetes is all about horizontally scaling your replica's (called deployments). In that respect it makes most sense to keep your deployments as single use as possible, and increase the deployments (and pods if you run out) as demand increases. The LoadBalancer thus manages traffic to the Gunicorn deployments, and the Redis queue manages the tasks to the Celery workers. This ensures that the underlying docker containers are simple and small, and we can individually (and automagically) scale them as we see fit.

As for your thought on how many many workers/concurrency you need per deployment, that really depends on the underlying hardware you have your Kubernetes running on and requires experimentation to get right.

For example, we run our cluster on Amazon EC2 and experimented with different EC2 instance types and workers to balance performance and costs. The more CPU you have per instance, the less instances you need and the more workers you can deploy per instance. But we found out that deploying more smaller instances is in our case cheaper. We now deploy multiple m4.large instances with 3 workers per deployment.

interesting side note: we have had really bad performance of gunicorn in combination with the amazon load balancers, as such we switched to uwsgi with great performance increases. But the principles are the same.


