How to calculate the persistent volume size needed for prometheus server pod in K8S cluster?

10/11/2019

I installed the Prometheus helm chart to a kubernetes cluster for monitoring. By default,

  • persistent volume size for prometheus server is defined as 8Gi.
  • Prometheus server will store the metrics in this volume for 15 days (retention period)

After some days of deploying the chart, the prometheus server pod enetered to a crashloopbackoff state. The reason found from pod logs was:

level=error ts=2019-10-09T11:03:10.802847347Z caller=main.go:625 err="opening storage failed: zero-pad torn page: write /data/wal/00000429: no space left on device"

That means there is no space available in the disk (persistent volume) to save the data. So I cleared the existing data of the volume and fixed the issue temporarily.

What would be the proper solution for this?

The Prometheus documentation says:

To plan the capacity of a Prometheus server, you can use the rough formula:

needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample

Can someone explain how to use this formula deliberately?

Why the 8Gi size is not enough with 15days retention period?

EDIT :

The default 8Gi space was 100% used after 6 days.

-- AnjanaDyna
kubernetes
persistent-storage
prometheus

2 Answers

10/11/2019

15 days is about 1.3 million seconds. Let’s overestimate 8 bytes per sample. So each metric takes about 10mb. So 8gb would let you store 800 metrics. You probably have more than that. Multiply the number of series you want to store by 10 and that’s the number of megabytes you need. Roughly, that will get you the right order of magnitude at least.

-- coderanger
Source: StackOverflow

10/11/2019

As of Prometheus 2.7, theye've introduced a new flag to manage retention. From docs:

--storage.tsdb.retention.size: [EXPERIMENTAL] This determines the maximum number of bytes that storage blocks can use (note that this does not include the WAL size, which can be substantial). The oldest data will be removed first. Defaults to 0 or disabled. This flag is experimental and can be changed in future releases. Units supported: KB, MB, GB, PB. Ex: "512MB"

You can set this argument option in your Deployment configuration to limit the retention according to size, instead of time.

As it is experimental yet, according to this source, it would be safe to allow for space for the WAL and one maximum size block (which is the smaller of 10% of the retention time and a month).

-- Ali Tou
Source: StackOverflow