I'm currently trying to deploy a backend service API (running a tensorflow model in python and flask environment, with GPUs of course) for my app, which needs to be scalable so that I can serve to say maybe 1000 requests simultaneously.
The model needs to run 15 seconds per request, which is relatively slow, and there is a timeout limit for firebaseapp that I need to satisfy for each request. The question is that I want to deploy this thing in google-kubernetes-engine, but I don't know how to deploy my image so that each pod(running an image) is running in only one gpu-node (and vice versa) and each request is directed to one available pod, meaning no two requests are directed to the same pod.
I know there is something called Daemonset https://cloud.google.com/kubernetes-engine/docs/concepts/daemonset But I'm not sure if this fits my need. Another question is that is it possible to scale pod/gpu-nodes by requests (or by pod availability)? For example, if currently there is only one node running one pod, the first incoming request can be served, and if the second requests come in, then the second pod/gpu-node need to be generated to serve the request. What is the traffic directing mechanism? Is it Ingress Service? How do I detect pod availability in the traffic directing mechanism? To sum up, here are three questions: 1. How to direct each request to each different pod? 2. How to run only one pod in one GPU-Node? 3. How to scale (one unit of Daemonset maybe ?) and scale fast so that each request can be served within 30 seconds?
You may use Container-native load balancing in order to target pods and evenly distribute traffic.
Please have a look at pod anti-affinity, where the idea is that the pod should not run in node X if that node X is already running one or more pods that meet a rule.
For autoscaling, I would go with HPA (horizontal pod autoscaler), so pods will scale depending on the metric that is being monitored.