Why is GKE compute percentage so high

12/18/2017

I have a GKE cluster with 5 f1-micro nodes. It's running a very simple, 3-service, nodejs based app, seeing very little traffic.

I recently configured StackDriver and I noticed this weird graph:

enter image description here

Notice that all metrics are going up. I suspect this is a bug, the metrics are somehow cumulative, but they should be a gauge.

kube-ui doesn't show this outrageous CPU usage. I SSHed to the boxes and couldn't find any outstanding problems using top.

Moreover this graph, which should show the same thing, is completely different:

enter image description here

A couple of questions:

  1. first, has anyone else experienced this?
  2. why is this happening? Is there any way I can debug this?
  3. how can I fix it?

Thank you

Edit

The CPU usage has stabilised, but it's still at ridiculously high levels. It appears to be the bug JMD described below. Here's how the graph looks now for the last month:

enter image description here

-- alexandru.topliceanu
google-kubernetes-engine

1 Answer

1/18/2018

There was an issue with false positives for high CPU usage taking place. What you have experienced should be related to it.

This appears to be happening because data for short lived instances is being reported while they are up but then no values are reported once they go away.
It appears to create data that violates the threshold in the alert policy. Once the duration on the policy has passed, if all of the data in that duration is above the threshold the policy will fire.

The policy should close after the instance reports a value under the threshold or after 7 days with no data reported.

-- JMD
Source: StackOverflow