How to create ImagePullBackOff alert and its recovery alert for Kubernetes on Datadog?

9/7/2021

I am trying to create an alert due to ImagePullBackOff on Kubernetes cluster using Datadog as following (for details see this documentation)

metric description

Although the alert is created properly, I am having a problem with the recovery alert. It never recovers after the error is corrected, since after it is corrected the pod is destroyed and it never goes to zero as shown below. Is there any way to create the recovery alert?

pods

-- ord_bear
datadog
kubernetes

1 Answer

9/20/2021

• I would suggest you troubleshoot this issue with ‘kubectl describe’. This will show you the full error log of the Pod, so you can see what’s causing the issue and why the recovery alert didn’t go off. You can refer the events output from ‘kubectl describe <pod name>’.

• Also, check for typos, wrong tag names, missing registry secrets defined in the metric created and the pod configuration that is deployed in datadog for monitoring.

• Since, you have created a metric for monitoring the Kubernetes pod, you can test the notifications for the monitor set for the pod and check its events for any errors for triggering the recovery alert when the pod is deleted.

• Also, when a alert is triggered by a monitor, it is classified as an ‘ALERT’, ‘WARNING’ or ‘’NO DATA and if it is downtimed, it is suppressed from notifying you, so check whether downtimes are scheduled or not for that monitor and pod. And check the recovery threshold that is configured for the monitor as you have configured the option of ‘Do not require a full window of data evaluation’.

Please find the below links form more information: -

https://docs.datadoghq.com/monitors/create/types/metric/?tab=threshold

https://docs.datadoghq.com/monitors/faq/why-did-i-get-a-recovery-event-from-a-monitor-that-was-in-a-downtime-when-it-alerted/

-- KartikBhiwapurkar-MT
Source: StackOverflow