What are the common options for sharding metrics for different kuberentes components?

3/20/2017

1) In kubernetes many components (i.e. nodes) have metadata that you want to view by group. Examples:

  • monitor CPU usage
  • monitor CPU usage on all machines with GPUs
  • monitor memory usage
  • monitor memory usage on all machines (kubelets) that are labelled with a a particular zone (i.e. 'ASIA-EAST-1')

And so on : For any metric that is being measured on a node, you might want to view/query it by arbitrary labels or taints that exist on the said node.

In any of these cases, since metrics aren't emitted with labels for all of these data.

One solution: many prometheus masters

So far I've thought of one solution: A separate Prometheus master for different logical groups of nodes. This would allow an administrator to create masters that rollup metrics by an arbitrary label, i.e.

  • query cluster for all nodes w/ label = SSD=16GB,
  • create a CSV from that list,
  • use it as the endpoints for a prometheus master,
  • use that as a specific datasource".

2) Are there any more elegant solutions to this problem?

The above solution is frought with terror - you are doing alot of work just to "hack" the grafana "data source" concept as a way to shard your metrics up.

3) A few more, crazy ideas... just to help seed a broader conversation on how to shard metrics in kubernetes by hosts...

  • Maybe grafana is smart enough to add its own groups, somehow?
  • Or can grafana be extended to do the prometheus master polling/rollup itself ?
-- jayunit100
grafana
kubernetes
metrics
prometheus

1 Answer

3/22/2017

Generally you'd have one Prometheus per datacenter, to keep things within the same failure domain. You may split that out in future if there's load issues, but for just node exporter stats that's unlikely.

https://www.robustperception.io/scaling-and-federating-prometheus/ describes the general scaling approach.

https://www.robustperception.io/how-to-have-labels-for-machine-roles/ addresses how to aggregate based on things like GPU presence.

I would expect zone to end up as a target label, so no special consideration is required there.

-- brian-brazil
Source: StackOverflow