I installed the Prometheus helm chart to a kubernetes cluster for monitoring. By default,
After some days of deploying the chart, the prometheus server pod enetered to a crashloopbackoff state. The reason found from pod logs was:
level=error ts=2019-10-09T11:03:10.802847347Z caller=main.go:625 err="opening storage failed: zero-pad torn page: write /data/wal/00000429: no space left on device"
That means there is no space available in the disk (persistent volume) to save the data. So I cleared the existing data of the volume and fixed the issue temporarily.
What would be the proper solution for this?
The Prometheus documentation says:
To plan the capacity of a Prometheus server, you can use the rough formula:
needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample
Can someone explain how to use this formula deliberately?
Why the 8Gi size is not enough with 15days retention period?
EDIT :
The default 8Gi space was 100% used after 6 days.
15 days is about 1.3 million seconds. Let’s overestimate 8 bytes per sample. So each metric takes about 10mb. So 8gb would let you store 800 metrics. You probably have more than that. Multiply the number of series you want to store by 10 and that’s the number of megabytes you need. Roughly, that will get you the right order of magnitude at least.
As of Prometheus 2.7, theye've introduced a new flag to manage retention. From docs:
--storage.tsdb.retention.size
: [EXPERIMENTAL] This determines the maximum number of bytes that storage blocks can use (note that this does not include the WAL size, which can be substantial). The oldest data will be removed first. Defaults to 0 or disabled. This flag is experimental and can be changed in future releases. Units supported: KB, MB, GB, PB. Ex: "512MB"
You can set this argument option in your Deployment configuration to limit the retention according to size, instead of time.
As it is experimental yet, according to this source, it would be safe to allow for space for the WAL and one maximum size block (which is the smaller of 10% of the retention time and a month).