I have a TensorFlow object detection model, served with TensorFlow serving and deployed into the Azure Kubernetes cluster. I have used Nvidia K80 GPU device, with TensorFlow/serving:1.12.3:gpu version.
The model is deployed and response properly, but the response time is huge, 3-4 seconds for 500*375 - 135 KB images.
Can anyone help me to understand what can be improved?
If this image is the first prediction request, it is a normal situation. You may need a warm-up request.