I have the following alert configured in prometheus:
alert: ClockSkewDetected
expr: abs(node_timex_offset_seconds{job="node-exporter"})
> 0.03
for: 2m
labels:
severity: warning
annotations:
message: Clock skew detected on node-exporter {{ $labels.namespace }}/{{ $labels.pod }}. Ensure NTP is configured correctly on this host.
This alert is part of the default kube-prometheus
stack which I am using.
I find this alert fires for around 10 mins every day or two.
I'd like to know how to deal with this problem (the alert firing!). It's suggested in this answer that I shouldn't need to run NTP (via a daemonset I guess) myself on GKE.
I'm also keen to use the kube-prometheus
defaults where possible - so I'm unsure about increasing the 0.03
value.
As pointed in the answer, instances in GCP are preconfigured to have their own NTP server synced, so there shouldn't be any need to use DaemonSets to manually configure them.
It might be the case that the clock is skewing on live migrations and it catches up automatically but not without triggering the alert. However, this theory only applies for non-preemptible instances.
Some events on GCE instances are supposed to trigger the Clock Skew Daemon that will eventually correct changes initiated by the user (or a process action on behalf of the user), so if this is happening in your nodes, that's another possibility.
Regardless of the aforementioned theories and since nodes are managed resources in GKE, I think you have pretty solid case for the GKE support to investigate as this might be an implementation detail.