I have a case that short-lived(from seconds to 1-2 minutes) k8s jobs will be created on user request. I'm trying to retrieve job runtime metrics(like cpu and memory usage).
The methods I've thought of(and tried) includes:
container_cpu_usage_seconds_total
, but pull-based scape means that many short-lived jobs will not be included/api/v1/nodes/{nodeName}/proxy/metrics/cadvisor
directly. Though almost real-time, it returns all containers, so I have to manually parse the results and find what I need.I'm thinking of using a lightweight monitor container beside the job worker container to retrieve the worker's metrics. But I don't know whether this is a good idea, and even if so, how to retrieve the worker's metrics.
So my question is:
What method do you recommend to retrieve a large number of short-lived jobs' cpu and memory usage?
As you wrote you used prometheus, pushgateways, metrics-server ns query /api/v1/nodes/{nodeName}/proxy/metrics/cadvisor if they don't satisfy you enough new approach which I recommend of montitoring and metrics saving of cluster performance is Litmus.
Prometheus is most common and complex tool which may be used by most of engineers but Litmus is kind new tool which is focused on workload testing, metrics are saved and you can store them as long as you want.
More information you can find here: litmus.
Useful artice: litmus-openebs, this describe not to get metrics not only like memory usage.
Then you can generate charts in egg. gnuplot.