In Kubernetes-client Java api, I can get a count of available and total deployed pod instances for a given app using this way:
ApiClient defaultClient = Configuration.getDefaultApiClient();
AppsV1beta1Api apiInstance = new AppsV1beta1Api();
...
try {
AppsV1beta1DeploymentList result = apiInstance.listDeploymentForAllNamespaces(_continue, fieldSelector, includeUninitialized, labelSelector, limit, pretty, resourceVersion, timeoutSeconds, watch);
foreach(ExtensionsV1beta1Deployment extensionsDeployment : result.getItems() ) {
Map<String, String> labels = extensionsDeployment.getMetadata().getLabels();
String appName = labels.getOrDefault("app", "");
ExtensionsV1beta1DeploymentStatus status = extensionsDeployment.getStatus();
int availablePods = status.getAvailableReplicas();
int deployedPods = status.getReplicas();
if ( availablePods != deployedPods) {
// Generate an alert
}
}
} catch (ApiException e) {
System.err.println("Exception when calling AppsV1beta1Api#listDeploymentForAllNamespaces");
e.printStackTrace();
}
In above example, I'm comparing the availablePods
with the deployedPods
and if they don't match, I generate an alert.
How can I replicate this logic using Prometheus using Alerting Rules and/or Alertmanager config, where it checks the number of available pod instances for a given app or job, and if it doesn't match a specified number of instances, it will trigger an alert?
The specified threshold can be total deployedPods
or it can come from another config file or template.
I don’t know how to do this for all namespaces but for one namespace it will look like:
curl -k -s 'https://prometheus-k8s/api/v1/query?query=(sum(kube_deployment_spec_replicas%7Bnamespace%3D%22default%22%7D)%20without%20(deployment%2C%20instance%2C%20pod))%20-%20(sum(kube_deployment_status_replicas_available%7Bnamespace%3D%22default%22%7D)%20without%20(deployment%2C%20instance%2C%20pod))'
This is curl request to the default namespace.
Alert config will look like:
groups:
- name: example
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: availablePods!=deployedPods
expr: (sum(kube_deployment_spec_replicas{namespace="$Name_of_namespace"}) without (deployment, instance, pod)) - (sum(kube_deployment_status_replicas_available{namespace="$Name_of_namespace"}) without (deployment, instance, pod)) != 0
for: 15m
labels:
severity: page
annotations:
summary: "availablePods are not equal deployedPods"
description: "In namespace $Name_of_namespace more than 15 minutes availablePods are not equal deployedPods. "
Don’t forget to change variable $Name_of_namespace
to namespace name where you want to check.