Our kubernetes cluster recently crashed because of etcd "database size exceeded".
We succeeded to put everything back on with a "simple" etcd cluster endpoints defrag (see here).
Unfortunately everything is not perfect yet. Especially the /var/lib/etcd/member/snap directory of an etcd endpoint:
total 25G
-rw-r--r--. 1 etcd root 21K Jun 29 20:27 00000000000810cb-00000000066072bc.snap
-rw-r--r--. 1 etcd root 21K Jun 29 20:40 00000000000810cb-00000000066099cd.snap
-rw-r--r--. 1 etcd root 21K Jun 29 20:55 00000000000810df-000000000660c0de.snap
-rw-r--r--. 1 etcd root 21K Jun 29 21:19 000000000008113f-000000000660e7ef.snap
-rw-r--r--. 1 etcd root 21K Jun 29 21:37 0000000000081162-0000000006610f00.snap
-rw-------. 1 etcd root 916M Jun 29 15:40 000000000619e354.snap.db
-rw-------. 1 etcd root 916M Jun 29 15:41 00000000061b9704.snap.db
-rw-------. 1 etcd root 916M Jun 29 15:43 00000000061ca269.snap.db
-rw-------. 1 etcd root 916M Jun 29 15:44 00000000061dbb43.snap.db
-rw-------. 1 etcd root 916M Jun 29 15:47 00000000061e40df.snap.db
-rw-------. 1 etcd root 916M Jun 29 15:48 00000000061e8192.snap.db
-rw-------. 1 etcd root 916M Jun 29 15:49 00000000061f8799.snap.db
-rw-------. 1 etcd root 916M Jun 29 15:49 0000000006200018.snap.db
-rw-------. 1 etcd root 916M Jun 29 15:52 0000000006225cfd.snap.db
-rw-------. 1 etcd root 916M Jun 29 15:53 00000000062323d6.snap.db
-rw-------. 1 etcd root 916M Jun 29 15:53 00000000062396fa.snap.db
-rw-------. 1 etcd root 970M Jun 29 15:54 000000000624dfe7.snap.db
-rw-------. 1 etcd root 1003M Jun 29 15:54 0000000006259f3f.snap.db
-rw-------. 1 etcd root 1.2G Jun 29 15:58 0000000006296ff0.snap.db
-rw-------. 1 etcd root 1.2G Jun 29 15:59 000000000629b9bc.snap.db
-rw-------. 1 etcd root 1.2G Jun 29 16:01 00000000062b02c0.snap.db
-rw-------. 1 etcd root 1.2G Jun 29 16:02 00000000062bef04.snap.db
-rw-------. 1 etcd root 1.2G Jun 29 16:05 00000000062db8e2.snap.db
-rw-------. 1 etcd root 1.2G Jun 29 16:09 00000000062ef4c4.snap.db
-rw-------. 1 etcd root 1.2G Jun 29 16:11 00000000062fce7a.snap.db
-rw-------. 1 etcd root 1.2G Jun 29 16:12 00000000063063c1.snap.db
-rw-------. 1 etcd root 1.2G Jun 29 16:12 000000000630b648.snap.db
-rw-------. 1 etcd root 1.2G Jun 29 16:12 000000000630bdbb.snap.db
-rw-------. 1 etcd root 1.2G Jun 29 16:13 000000000630e3f1.snap.db
-rw-------. 1 etcd root 27M Jun 29 21:41 db
This is the only endpoint like this and on this endpoint we are the low disk.
No documentation exists about those files and especially how to reduce the overall disk footprint in our case (defrag / compaction / etc has been tried)).
What are those files? How to reduce the disk footprint of this endpoint (get rid of those huge snap.db files) ?
They seem to be snapshots
of a given state of your etcd cluster over time.
Sounds like they can be rotated. At least according to this:
etcdserver: purge old snap.db files #7967 https://github.com/coreos/etcd/pull/7967
Hope this answer is helpful somehow.