I have a GKE cluster with 5 f1-micro nodes. It's running a very simple, 3-service, nodejs based app, seeing very little traffic.
I recently configured StackDriver and I noticed this weird graph:
Notice that all metrics are going up. I suspect this is a bug, the metrics are somehow cumulative, but they should be a gauge.
kube-ui doesn't show this outrageous CPU usage. I SSHed to the boxes and couldn't find any outstanding problems using top
.
Moreover this graph, which should show the same thing, is completely different:
A couple of questions:
Thank you
The CPU usage has stabilised, but it's still at ridiculously high levels. It appears to be the bug JMD described below. Here's how the graph looks now for the last month:
There was an issue with false positives for high CPU usage taking place. What you have experienced should be related to it.
This appears to be happening because data for short lived instances is being reported while they are up but then no values are reported once they go away.
It appears to create data that violates the threshold in the alert policy. Once the duration on the policy has passed, if all of the data in that duration is above the threshold the policy will fire.
The policy should close after the instance reports a value under the threshold or after 7 days with no data reported.