Best practice of deploying a flask-api on google production kubernetes cluster

4/5/2020

A flask-api (using gunicorn) is used as an inference api of a deep learning model. This specific inference process is very cpu intensive (not using gpu yet).

What is the best practice of deploying it to a kubernetes cluster, based on these aspects:

  1. should I create multiple pods handling requests using single gunicorn worker or less pods enabling gunicorn multiple workers? (node memory footprint)

  2. since google provides to expose your deployment as a service using an external loadbalancer, do I need a nginx web server on my flask-gunicorn stack?

  3. creating multiple identical pods on the same node, is it more memory intensive than handling all these requests using multithreading on a single pod?

-- Nikos Epitropakis
flask
google-kubernetes-engine
gunicorn
kubernetes
nginx

1 Answer

4/6/2020
  1. More smaller pods is generally better, provided you're staying under "thousands". It is easier for the cluster to place a pod that requires 1 CPU and 1 GB of RAM 16 times than it is to place a single pod that requires 16 CPU and 16 GB RAM once. You usually want multiple replicas for redundancy, to tolerate node failure, and for zero-downtime upgrades in any case.

  2. If the Istio Ingress system works for you, you may not need separate a URL-routing layer (Nginx) inside your cluster. If you're okay with having direct access to your Gunicorn servers with no routing or filtering in front of that, directly pointing a LoadBalancer Service at them is a valid choice.

  3. Running 16 copies of 1 application will generally need more memory than 1 copy with 16 threads; how much more depends on the application.

    In particular, if you load your model into memory and the model itself is large, but your multi-threaded setup can share a single copy of it, 1 large pod could use significantly less memory than 16 small pods. If the model is COPYed directly into the Docker image and the application code mmap()s it then you'd probably get to share memory at the kernel layer.

    If the model itself is small and most of the memory is used in the processing, it will still use "more" memory to have multiple pods, but it would just be the cost of your runtime system and HTTP service; it shouldn't substantially change the memory required per thread/task/pod if that isn't otherwise shared.

-- David Maze
Source: StackOverflow