GKE: How to alert on memory request/allocatable ratio?

3/19/2020

I have a GKE cluster and I'd like to keep track of the ratio between the total memory requested and the total memory allocatable. I was able to create a chart in Google Cloud Monitoring using

metric.type="kubernetes.io/container/memory/request_bytes" resource.type="k8s_container"

and

metric.type="kubernetes.io/node/memory/allocatable_bytes" resource.type="k8s_node"

both with crossSeriesReducer set to REDUCE_SUM in order to get the aggregate total across the cluster.

Then, when I tried to set up an alerting policy (using the cloud monitoring api) with the ratio of the two (following this), I get this error

ERROR: (gcloud.alpha.monitoring.policies.create) INVALID_ARGUMENT: The numerator and denominator must have the same resource type.

It doesn't like that the first metric is a k8s_container and the second metric is a k8s_node Are there different metrics I can use or some sort of workaround in order to alert on memory request/allocatable ratio in Google Cloud Monitoring?

EDIT:

Here is the full request and response

$ gcloud alpha monitoring policies create --policy-from-file=policy.json
ERROR: (gcloud.alpha.monitoring.policies.create) INVALID_ARGUMENT: The numerator and denominator must have the same resource type.

$ cat policy.json
{
    "displayName": "Cluster Memory",
    "enabled": true,
    "combiner": "OR",
    "conditions": [
        {
            "displayName": "Ratio: Memory Requests / Memory Allocatable",
            "conditionThreshold": {
                 "filter": "metric.type=\"kubernetes.io/container/memory/request_bytes\" resource.type=\"k8s_container\"",
                 "aggregations": [
                    {
                        "alignmentPeriod": "60s",
                        "crossSeriesReducer": "REDUCE_SUM",
                        "groupByFields": [
                        ],
                        "perSeriesAligner": "ALIGN_MEAN"
                    }
                ],
                "denominatorFilter": "metric.type=\"kubernetes.io/node/memory/allocatable_bytes\" resource.type=\"k8s_node\"",
                "denominatorAggregations": [
                   {
                      "alignmentPeriod": "60s",
                      "crossSeriesReducer": "REDUCE_SUM",
                      "groupByFields": [
                       ],
                      "perSeriesAligner": "ALIGN_MEAN",
                    }
                ],
                "comparison": "COMPARISON_GT",
                "thresholdValue": 0.8,
                "duration": "60s",
                "trigger": {
                    "count": 1
                }
            }
        }
    ]
}
-- Jesse Shieh
google-cloud-monitoring
google-cloud-platform
google-cloud-stackdriver
google-kubernetes-engine
stackdriver

1 Answer

3/31/2020
ERROR: (gcloud.alpha.monitoring.policies.create) INVALID_ARGUMENT: The numerator and denominator must have the same resource type.

Following official documentation:

groupByFields[] - parameter

The set of fields to preserve when crossSeriesReducer is specified. The groupByFields determine how the time series are partitioned into subsets prior to applying the aggregation operation. Each subset contains time series that have the same value for each of the grouping fields. Each individual time series is a member of exactly one subset. The crossSeriesReducer is applied to each subset of time series. It is not possible to reduce across different resource types, so this field implicitly contains resource.type. Fields not specified in groupByFields are aggregated away. If groupByFields is not specified and all the time series have the same resource type, then the time series are aggregated into a single output time series. If crossSeriesReducer is not defined, this field is ignored.

-- Cloud.google.com: Monitoring: projects.alertPolicies

Please take specific look on part:

It is not possible to reduce across different resource types, so this field implicitly contains resource.type.

Above error shows when you try to create a policy with a different resource types.

Metrics shown below have Resource type of:

  • kubernetes.io/container/memory/request_bytes - k8s_container
  • kubernetes.io/node/memory/allocatable_bytes - k8s_node

You can check the Resource type by looking at the metric in the GCP Monitoring:

Container

Node

As a workaround you could try to create an alert policy which will alert you when allocatable utilization of memory is above 85%. It will indirectly tell you that requested memory is high enough to trigger an alarm.

Example below with YAML:

combiner: OR
conditions:
- conditionThreshold:
    aggregations:
    - alignmentPeriod: 60s
      crossSeriesReducer: REDUCE_SUM
      groupByFields:
      - resource.label.cluster_name
      perSeriesAligner: ALIGN_MEAN
    comparison: COMPARISON_GT
    duration: 60s
    filter: metric.type="kubernetes.io/node/memory/allocatable_utilization" resource.type="k8s_node"
      resource.label."cluster_name"="GKE-CLUSTER-NAME"
    thresholdValue: 0.85
    trigger:
      count: 1
  displayName: Memory allocatable utilization for GKE-CLUSTER-NAME by label.cluster_name
    [SUM]
  name: projects/XX-YY-ZZ/alertPolicies/AAA/conditions/BBB
creationRecord:
  mutateTime: '2020-03-31T08:29:21.443831070Z'
  mutatedBy: XXX@YYY.com
displayName: alerting-policy-when-allocatable-memory-is-above-85
enabled: true
mutationRecord:
  mutateTime: '2020-03-31T08:29:21.443831070Z'
  mutatedBy: XXX@YYY.com
name: projects/XX-YY-ZZ/alertPolicies/

Example with GCP Monitoring web access:

GCP Monitoring metric web

Please let me know if you have any questions in that.

EDIT:

To properly create alert policies which will show relevant data you need to take a lot of factors into consideration like:

  • type of workload
  • amount of nodes and node pools
  • node affinity (for example: spawn certain type of workload on GPU nodes)
  • etc

For more advanced alert policy which will take into consideration the allocatable memory per node pool you can do something like that:

combiner: OR
conditions:
- conditionThreshold:
    aggregations:
    - alignmentPeriod: 60s
      crossSeriesReducer: REDUCE_SUM
      groupByFields:
      - metadata.user_labels."cloud.google.com/gke-nodepool"
      perSeriesAligner: ALIGN_MEAN
    comparison: COMPARISON_GT
    duration: 60s
    filter: metric.type="kubernetes.io/node/memory/allocatable_utilization" resource.type="k8s_node"
      resource.label."cluster_name"="CLUSTER_NAME"
    thresholdValue: 0.85
    trigger:
      count: 1
  displayName: Memory allocatable utilization (filtered) (grouped) [SUM]
creationRecord:
  mutateTime: '2020-03-31T18:03:20.325259198Z'
  mutatedBy: XXX@YYY.ZZZ
displayName: allocatable-memory-per-node-pool-above-85
enabled: true
mutationRecord:
  mutateTime: '2020-03-31T18:18:57.169590414Z'
  mutatedBy: XXX@YYY.ZZZ

Please be aware that there is a bug: Groups.google.com: Google Stackdriver discussion and the only possibility to create above alert policy is with command line.

-- Dawid Kruk
Source: StackOverflow