Is there a configuration in Kubernetes in which I can specify minimum number of requests to be queued before a new instance gets spawned?
This is the context: We have got powerful high CPU machines set for our use case and every request levies a high amount of load on the server. Everything works perfect until we reach the specific number say... 300 requests with a ramp-up time of 100 milliseconds. And from that point we are receiving Connection refused error for some time and then the server starts to handle them once a new machine is spawned. What is the best way to handle the load spikes? I am looking for something like "Pending latency" config in the app engine. My application is deployed on Google compute engine and orchestrated by Kubernetes.
You can use readinessProbe
(see container probes) to indicate the container is ready to service requests, and use HorizontalPodAutoscaler
to automatically scale your apps up/down based on observed CPU utilization. Hope this helps.