I have an application that is ~40 docker containers varying from NoSQL, RDBMS, C applications, Go apps, Python and so on, orchestrated using Kubernetes
, Its all running on GCP
. With a GLB(Load Balancer) at the frontend.
Now if I create a lot of replicas and give a lot of resources to these applications then everything runs properly. But if I give just enough resources then the frontend sometimes loads very slowly, the web application becomes unresponsive for sometime and then mysteriously comes back up again.
All this happens with no pod evictions or restarts.
When this happens I can see that the CPU/Memory are at 50%, so resources are not exhausted.
How to a go about debugging what is the reason for slowness? How to I calibrate which application requires how mush of resources?
You haven't mentioned about any monitoring tools implemented in Kubernetes cluster that you can use to check overall cluster performance or check application resource usage.
All the monitoring aspects based on the metrics characteristics, therefore Kubernetes offers Resource metrics pipeline gathered by metrics-server or Full metrics pipeline for some more advanced metrics and Prometheus is a good example for that approach.
For GCP related environments you can use Stackdriver logging with lots of monitoring features and appropriate group of metrics.
Therefore, I would start from checking monitoring metrics from underlying Kubernetes resources in order to collect measurements and take necessary actions to improve total cluster productivity.