Gunicorn does not repondes more than 6 requests at a time

3/23/2018

To give you some context:

I have two server environments running the same app. The first, which I intend to abandon, is a Standard Google App Engine environment that has many limitations. The second one is a Google Kubernetes cluster running my Python app with Gunicorn.

Concurrency

At the first server, I can send multiple requests to the app and it will answer many simultaneously. I run two batches of simultaneous requests against the app on both environments. At Google App Engine the first batch and the second were responded simultaneously and the first din't block the second.

At the Kubernetes, the server only responses 6 simultanous, and the first batch blocks the second. I've read some posts on how to achieve Gunicorn concurrency with gevent or multiple threading, and all of them say I need to have CPU cores, but the problem is that no matter how much cpu I put into it, the limitation continues. I've tried Google nodes from 1VCPU to 8VCPU and it doesn't change much.

Can you guys give me any ideas on what I'm possibly missing? Maybe Google Cluster nodes limitation?

Kubernetes response waterfall

As you can notice, the second batch only started to be responded after the first one started to finish.

enter image description here

App Engine response waterfall

enter image description here

-- Mauricio
concurrency
gunicorn
kubernetes
python

1 Answer

5/5/2018

What you describe appears to be an indicator that you running the Gunicorn server with the sync worker class serving an I/O bound application. Can you share your Gunicorn configuration?

Is it possible that Google's platform has some kind of autoscaling feature (I'm not really familiar with their service) that's being triggered while your Kubernetes configuration does not?

Generically speaking increasing the number cores for a single instance will only help if you also increase the number of workers spawned to attend incoming requests. Please see the Gunicorn's design documentation with a special emphasis on the worker types section (and why sync workers are suboptimal for I/O bound applications) - its a good read and provides a more detailed explanation about this problem.

Just for fun, here's a small exercise to compare the two approaches:

import time

def app(env, start_response):
    time.sleep(1) # takes 1 second to process the request
    start_response('200 OK', [('Content-Type', 'text/plain')])
    return [b'Hello World']

Running Gunicorn with 4 sync workers: gunicorn --bind '127.0.0.1:9001' --workers 4 --worker-class sync --chdir app app:app

Let's trigger 8 request at the same time: ab -n 8 -c 8 "http://localhost:9001/"

This is ApacheBench, Version 2.3 <$Revision: 1706008 
gt;
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking localhost (be patient).....done Server Software: gunicorn/19.8.1 Server Hostname: localhost Server Port: 9001 Document Path: / Document Length: 11 bytes Concurrency Level: 8 Time taken for tests: 2.007 seconds Complete requests: 8 Failed requests: 0 Total transferred: 1096 bytes HTML transferred: 88 bytes Requests per second: 3.99 [#/sec] (mean) Time per request: 2006.938 [ms] (mean) Time per request: 250.867 [ms] (mean, across all concurrent requests) Transfer rate: 0.53 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 1 0.2 1 1 Processing: 1003 1504 535.7 2005 2005 Waiting: 1002 1504 535.8 2005 2005 Total: 1003 1505 535.8 2006 2006 Percentage of the requests served within a certain time (ms) 50% 2006 66% 2006 75% 2006 80% 2006 90% 2006 95% 2006 98% 2006 99% 2006 100% 2006 (longest request)

Around 2 seconds to complete the test. That's the behavior you got on your tests - the 4 first requests took kept your workers busy, the second batch was queued until the first batch was processed.


Same test, but let's tell Gunicorn to use an async worker: unicorn --bind '127.0.0.1:9001' --workers 4 --worker-class gevent --chdir app app:app

Same test as above: ab -n 8 -c 8 "http://localhost:9001/"

This is ApacheBench, Version 2.3 <$Revision: 1706008 
gt;
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking localhost (be patient).....done Server Software: gunicorn/19.8.1 Server Hostname: localhost Server Port: 9001 Document Path: / Document Length: 11 bytes Concurrency Level: 8 Time taken for tests: 1.005 seconds Complete requests: 8 Failed requests: 0 Total transferred: 1096 bytes HTML transferred: 88 bytes Requests per second: 7.96 [#/sec] (mean) Time per request: 1005.463 [ms] (mean) Time per request: 125.683 [ms] (mean, across all concurrent requests) Transfer rate: 1.06 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 1 0.4 1 2 Processing: 1002 1003 0.6 1003 1004 Waiting: 1001 1003 0.9 1003 1004 Total: 1002 1004 0.9 1004 1005 Percentage of the requests served within a certain time (ms) 50% 1004 66% 1005 75% 1005 80% 1005 90% 1005 95% 1005 98% 1005 99% 1005 100% 1005 (longest request)

We actually double the application's throughput here - it only took ~1s to reply to all the requests.

To understand what happened Gevent has a great tutorial about its architecture and this article has a more in-depth explanation about co-routines.


I apologize in advance if was way off on the actual cause of your problem (I do believe that some additional information is lacking from your initial comment for anyone to have a conclusive answer). If not to you, I hope this'll helpful to someone else. :)

Also do notice that I've oversimplified things a lot (my example was a simple proof of concept), tweaking an HTTP server configuration is mostly a trial and error exercise - it's all dependent on the type of workload the application has and the hardware it sits on.

-- Diogo
Source: StackOverflow