Create an Incident and Notifications in Stackdriver when a GKE Workload has Issues

11/29/2019

I have a gke cluster with some workloads that can have boot issues. is it possible to create a stackdriver notification when a workload runs into an issue.

For example: create an incident when CrashLoopBackOff is triggered, pods are unshedulable or the Workload Status is anything other than OK for 5 minutes.

-- Laures
google-kubernetes-engine
monitoring
stackdriver

1 Answer

12/4/2019

You can use log-based metrics to track all the CrashLoopBackOff states in your pods, using the following advanced query:

https://cloud.google.com/logging/docs/view/advanced-queries

resource.type="k8s_pod"
resource.labels.location="us-central1-a"
resource.labels.cluster_name="standard-cluster-1"
"myproject"
jsonPayload.message="Back-off restarting failed container"
resource.labels.pod_name:"myproject"

Pods unschedulable might go into crashloopbackoff or not be deployed, which is only traceable at the API server.

We need to consider that to make the log based metrics, is necessary to adapt the labels depending on the monitoring version (whether you have legacy or non-legacy) - "non-legacy" monitoring & metrics are used in this example

Create the metric via log-based metrics and you'll find them in Monitoring as logging/user/xxxx

https://cloud.google.com/logging/docs/logs-based-metrics/

When you have a metric created you can create an alert policy to notify you when the issue occurs.

-- W_B
Source: StackOverflow