Metrics of terminated GCP instances

4/11/2018

I have set an auto scaling policy for my GKE cluster when CPU usage crosses 70% for 5 minutes. But sometimes there is a sudden spike and the server crashes. That Google Cloud Compute instance gets terminated and a new instance fires up.

In Stackdriver monitoring how can I view metrics of terminated GCP instances or are there any alternatives.

-- Naina Gupta
google-cloud-platform
google-kubernetes-engine

1 Answer

4/12/2018

From my understanding the GKE autoscaling scales based on checks to see if there are any Pods that are not being scheduled and are waiting for nodes with available resources. If such Pods exist, and the autoscaler determines that resizing a node pool would allow the waiting Pods to be scheduled, then the autoscaler expands that node pool.

Cluster autoscaler also measures the usage of each node against the node pool's total demand for capacity. If a node has had no new Pods scheduled on it for a set period of time, and all Pods running on that node can be scheduled onto other nodes in the pool, the autoscaler moves the Pods and deletes the node.

By the sound of it, you've configured a managed instance group autoscaler.

The Google documentation suggests not to use managed instance group autoscaling on cluster nodes.

Caution: Do not enable Google Compute Engine's autoscaling for managed instance groups for your cluster's nodes. Kubernetes Engine's cluster autoscaler is separate from Compute Engine autoscaling.

However, as far as I'm aware, you can still retrieve metric data for deleted instance 30 days after the instance has been deleted. To do this you can use the instance ID rather than the instance name.

You can then check Stackdriver monitoring for information about the instance by navigating to:

https://app.google.stackdriver.com/instances/INSTANCE-ID?project=PROJECT-ID

Instance ID's can be retrieved by viewing the relevant resource in Stackdrivers monitoring view, or running the following command and searching for the id value:

gcloud compute instances describe INSTANCE_NAME --zone ZONE
-- neilH
Source: StackOverflow