I have smallish model that I've tuned on Kubernetes to maximize performance for a single instance.
I ended up with the following setup, which gives me around 1000 requests per second at around 10ms response time for the 99th percentile. My goal is to keep p99 response time below 10ms and get as many requests per second as possible per CPU.
Because there was no shared state, I assumed I could scale horizontally and be able to get 1000 QPS per instance. However when I tried it, performance is drastically decreased and I need to set it to around 700QPS to still maintain that 99th percentile response time.
Running in Google Kubernetes Engine using n2-highcpu-32 machines.
communicating over grpc (using GRPC client side load balancer to evenly distribute connections)
batch_timeout_micros: 500
max_batch_size: 4
num_batch_threads: 4
max_enqueued_batches: 0
tensorflow_intra_op_parallelism: 4
tensorflow_inter_op_parallelism: 4
Requests/Limits:
CPUs: 4
Memory: 4Gi
If I try to maintain the 1000 requests per second per pod on more than one pod, latency shoots up to >15ms
Anyone have any ideas about what's going on here?
Edit: I realized that when i scaled to 4 pods, 3 of them ended up on the same node, and one on another by itself other. When I broke it down by performance per pod, the one by itself was doing MUCH better than the 3 on the same.
That makes me think there is some sort of CPU or threading contention, even though I specified limits...