How to get metrics of bunches of short-lived Kubernetes jobs

7/19/2019

I have a case that short-lived(from seconds to 1-2 minutes) k8s jobs will be created on user request. I'm trying to retrieve job runtime metrics(like cpu and memory usage).

The methods I've thought of(and tried) includes:

  1. Prometheus query, like container_cpu_usage_seconds_total, but pull-based scape means that many short-lived jobs will not be included
  2. Pushgateway, but as prometheus suggests, ...valid use case for the Pushgateway is for capturing the outcome of a service-level batch job, so I doubt this is not the suitable case.
  3. Metric-server, but metric-server only returns 404 on short-lived job pods, leading to worse results than Prometheus.
  4. Query /api/v1/nodes/{nodeName}/proxy/metrics/cadvisor directly. Though almost real-time, it returns all containers, so I have to manually parse the results and find what I need.

I'm thinking of using a lightweight monitor container beside the job worker container to retrieve the worker's metrics. But I don't know whether this is a good idea, and even if so, how to retrieve the worker's metrics.

So my question is:

What method do you recommend to retrieve a large number of short-lived jobs' cpu and memory usage?

-- leowang
cadvisor
kubelet
kubernetes

1 Answer

7/22/2019

As you wrote you used prometheus, pushgateways, metrics-server ns query /api/v1/nodes/{nodeName}/proxy/metrics/cadvisor if they don't satisfy you enough new approach which I recommend of montitoring and metrics saving of cluster performance is Litmus.

Prometheus is most common and complex tool which may be used by most of engineers but Litmus is kind new tool which is focused on workload testing, metrics are saved and you can store them as long as you want.

More information you can find here: litmus.

Useful artice: litmus-openebs, this describe not to get metrics not only like memory usage.

Then you can generate charts in egg. gnuplot.

-- MaggieO
Source: StackOverflow