How To Reduce Prometheus(Federation) Scrape Duration

11/20/2018

I have a Prometheus federation with 2 prometheus' servers - one per Kubernetes cluster and a central to rule them all.

Over time the scrape durations increase. At some point, the scrape duration exceeds the timeout duration and then metrics get lost and alerts fire.

I’m trying to reduce the scrape duration by dropping metrics but this is an uphill battle and more like sisyphus then Prometheus.

Does anyone know a way to reduce the scrape time without losing metrics and without having to drop more and more as times progresses?

Thanks in advance!

-- BarakH
kubernetes
monitoring
prometheus

1 Answer

11/20/2018

Per Prometheus' documentation, these settings determine the global timeout and alerting rules evaluation frequency:

global:
  # How frequently to scrape targets by default.
  [ scrape_interval: <duration> | default = 1m ]

  # How long until a scrape request times out.
  [ scrape_timeout: <duration> | default = 10s ]

  # How frequently to evaluate rules.
  [ evaluation_interval: <duration> | default = 1m ]

...and for each scrape job the configuration allows setting job-specific values:

# The job name assigned to scraped metrics by default.
job_name: <job_name>

# How frequently to scrape targets from this job.
[ scrape_interval: <duration> | default = <global_config.scrape_interval> ]

# Per-scrape timeout when scraping this job.
[ scrape_timeout: <duration> | default = <global_config.scrape_timeout> ]

Not knowing more about the number of targets and number of metrics per target...I can suggest to try to configure appropriate scrape_timeout per job and adjust the global evaluation_interval accordingly.

Another option, in combination with the suggestion above or on its own, can be to have prometheus instances dedicated on scraping non-overlapping set of targets. Thus, making it possible to scale prometheus and to have different evaluation_interval per set of targets. For example, longer scrape_timeout and less frequent evaluation_interval (higher value) for jobs that take longer so that they don't affect other jobs.

Also, check if an exporter isn't misbehaving by accumulating metrics over time instead of just providing current readings at the time of scraping - otherwise, the list of what's returned to prometheus will keep on growing over time.

-- apisim
Source: StackOverflow