This particular service is Dockerized and its infraestructure runs on Kubernetes.
Five days ago, Postgres complained about no space left in a volume of 2GB. It was about the logs not having enough space, so instead of trying to solve that problem, we just upgraded the volume to 10GB. In five days all that space has been used. Now we have the green light to investigate.
There are only 10 records in the whole database, so clearly this is not a common issue, nor it is related to data.
The specific error is the following:
DETAIL: Could not write to file "pg_clog/0000" at offset 32768: No space left on device.
ERROR: could not access status of transaction 0
It doesn't relate specifically to logs like the old error, but it has to be related.
Another important point: the same K8S definitions are used with several other projects and even other architectures, but this is the first issue like this I've encountered.
Any insight is appreciated. For more information, files or anything else, I can supply'em. I'm not sure what else I can add right now.
Thanks!
Kubernetes definition files are the following:
apiVersion: v1
kind: Service
metadata:
name: postgres-users-service
spec:
type: ClusterIP
selector:
app: postgres-users-app
ports:
- port: 5432
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres-users-deployment
spec:
replicas: 1
revisionHistoryLimit: 2
selector:
matchLabels:
app: postgres-users-app
template:
metadata:
labels:
app: postgres-users-app
spec:
containers:
- name: postgres-users-container
image: postgres:9.6
envFrom:
- secretRef:
name: users-secrets
ports:
- containerPort: 5432
volumeMounts:
- name: pvc
mountPath: /var/lib/postgresql/data
subPath: postgres-users-app
volumes:
- name: pvc
persistentVolumeClaim:
claimName: postgres-users-pvc
restartPolicy: Always
imagePullSecrets:
- name: gitlab-registry
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-users-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
---
UPDATE
wal_keep_segments
, this is what shows up in postgresql.conf
.#wal_keep_segments = 0 # in logfile segments, 16MB each; 0 disables
No complex operations are being performed. Creation of records (which are 10, like 10 users), some simple update of these records, and that's it.
"abandoned replication slots": Have no idea what this is. At least, we haven't configured anything like this, ever.
UPDATE
pg_stat_activity
show the column state
as idle
(except for the query asking for records of this table). Also the column xact_start
has no value for all records.UPDATE
I've found a weird record in the users table. Here's the GIF showing the strange behaviour.
UPDATE
ls pg_wal | wc -l
root@postgres-users-deployment-58bb54d59d-96dnw:/var/lib/postgresql/data# ls pg_wal | wc -l
ls: cannot access 'pg_wal': No such file or directory
0
ls pg_xlog | wc -l
root@postgres-users-deployment-58bb54d59d-96dnw:/var/lib/postgresql/data# ls pg_xlog | wc -l
61
UPDATE
users=# SELECT pg_size_pretty(pg_relation_size('users'));
pg_size_pretty
----------------
48 kB
root@postgres-users-deployment-58bb54d59d-96dnw:/var/lib/postgresql/data# du --max-depth=2 -h $PGDATA
12K /var/lib/postgresql/data/pg_multixact/members
12K /var/lib/postgresql/data/pg_multixact/offsets
28K /var/lib/postgresql/data/pg_multixact
8.3G /var/lib/postgresql/data/base/16384
7.0M /var/lib/postgresql/data/base/12407
6.9M /var/lib/postgresql/data/base/1
6.9M /var/lib/postgresql/data/base/12406
8.3G /var/lib/postgresql/data/base
12K /var/lib/postgresql/data/pg_notify
36K /var/lib/postgresql/data/pg_clog
32K /var/lib/postgresql/data/pg_stat_tmp
4.0K /var/lib/postgresql/data/pg_replslot
4.0K /var/lib/postgresql/data/pg_xlog/archive_status
961M /var/lib/postgresql/data/pg_xlog
4.0K /var/lib/postgresql/data/pg_commit_ts
4.0K /var/lib/postgresql/data/pg_snapshots
4.0K /var/lib/postgresql/data/pg_dynshmem
4.0K /var/lib/postgresql/data/pg_logical/snapshots
4.0K /var/lib/postgresql/data/pg_logical/mappings
12K /var/lib/postgresql/data/pg_logical
4.0K /var/lib/postgresql/data/pg_tblspc
4.0K /var/lib/postgresql/data/pg_stat
508K /var/lib/postgresql/data/global
4.0K /var/lib/postgresql/data/pg_serial
176K /var/lib/postgresql/data/pg_subtrans
4.0K /var/lib/postgresql/data/pg_twophase
9.2G /var/lib/postgresql/data
The solution to this was pretty straightforward.
Turns out the database was being interacted to a lot. There was a scrapper process entering data once every 5 seconds (the master user was being updated). The key was to follow the gif I uploaded: there was an array column that was being uploaded with a new element every 5 seconds. The total amount of elements was around 100k and I presume that was the reason psql
wasn't showing the selected element.