postgres No space left on device, even with a big enough volume and only 10 records

12/9/2019

This particular service is Dockerized and its infraestructure runs on Kubernetes.

Five days ago, Postgres complained about no space left in a volume of 2GB. It was about the logs not having enough space, so instead of trying to solve that problem, we just upgraded the volume to 10GB. In five days all that space has been used. Now we have the green light to investigate.

There are only 10 records in the whole database, so clearly this is not a common issue, nor it is related to data.

The specific error is the following:

DETAIL:  Could not write to file "pg_clog/0000" at offset 32768: No space left on device.                                                                                                
ERROR:  could not access status of transaction 0                                                                                                                                        

It doesn't relate specifically to logs like the old error, but it has to be related.

Another important point: the same K8S definitions are used with several other projects and even other architectures, but this is the first issue like this I've encountered.

Any insight is appreciated. For more information, files or anything else, I can supply'em. I'm not sure what else I can add right now.

Thanks!

Kubernetes definition files are the following:

apiVersion: v1
kind: Service
metadata:
  name: postgres-users-service
spec:
  type: ClusterIP
  selector:
    app: postgres-users-app
  ports:
    - port: 5432

---

apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres-users-deployment
spec:
  replicas: 1
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app: postgres-users-app
  template:
    metadata:
      labels:
        app: postgres-users-app
    spec:
      containers:
        - name: postgres-users-container
          image: postgres:9.6

          envFrom:
            - secretRef:
                name: users-secrets

          ports:
            - containerPort: 5432

          volumeMounts:
            - name: pvc
              mountPath: /var/lib/postgresql/data
              subPath: postgres-users-app
      volumes:
        - name: pvc
          persistentVolumeClaim:
            claimName: postgres-users-pvc
      restartPolicy: Always
      imagePullSecrets:
        - name: gitlab-registry

---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-users-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
---

UPDATE

  • About wal_keep_segments, this is what shows up in postgresql.conf.
#wal_keep_segments = 0      # in logfile segments, 16MB each; 0 disables
  • No complex operations are being performed. Creation of records (which are 10, like 10 users), some simple update of these records, and that's it.

  • "abandoned replication slots": Have no idea what this is. At least, we haven't configured anything like this, ever.

UPDATE

  • About long running transactions: All elements in pg_stat_activity show the column state as idle (except for the query asking for records of this table). Also the column xact_start has no value for all records.

UPDATE

I've found a weird record in the users table. Here's the GIF showing the strange behaviour. weird record

UPDATE

ls pg_wal | wc -l

root@postgres-users-deployment-58bb54d59d-96dnw:/var/lib/postgresql/data# ls pg_wal | wc -l
ls: cannot access 'pg_wal': No such file or directory
0

ls pg_xlog | wc -l

root@postgres-users-deployment-58bb54d59d-96dnw:/var/lib/postgresql/data# ls pg_xlog | wc -l
61

UPDATE

users=# SELECT pg_size_pretty(pg_relation_size('users'));

 pg_size_pretty 
----------------
 48 kB

root@postgres-users-deployment-58bb54d59d-96dnw:/var/lib/postgresql/data# du --max-depth=2 -h $PGDATA

12K /var/lib/postgresql/data/pg_multixact/members
12K /var/lib/postgresql/data/pg_multixact/offsets
28K /var/lib/postgresql/data/pg_multixact
8.3G    /var/lib/postgresql/data/base/16384
7.0M    /var/lib/postgresql/data/base/12407
6.9M    /var/lib/postgresql/data/base/1
6.9M    /var/lib/postgresql/data/base/12406
8.3G    /var/lib/postgresql/data/base
12K /var/lib/postgresql/data/pg_notify
36K /var/lib/postgresql/data/pg_clog
32K /var/lib/postgresql/data/pg_stat_tmp
4.0K    /var/lib/postgresql/data/pg_replslot
4.0K    /var/lib/postgresql/data/pg_xlog/archive_status
961M    /var/lib/postgresql/data/pg_xlog
4.0K    /var/lib/postgresql/data/pg_commit_ts
4.0K    /var/lib/postgresql/data/pg_snapshots
4.0K    /var/lib/postgresql/data/pg_dynshmem
4.0K    /var/lib/postgresql/data/pg_logical/snapshots
4.0K    /var/lib/postgresql/data/pg_logical/mappings
12K /var/lib/postgresql/data/pg_logical
4.0K    /var/lib/postgresql/data/pg_tblspc
4.0K    /var/lib/postgresql/data/pg_stat
508K    /var/lib/postgresql/data/global
4.0K    /var/lib/postgresql/data/pg_serial
176K    /var/lib/postgresql/data/pg_subtrans
4.0K    /var/lib/postgresql/data/pg_twophase
9.2G    /var/lib/postgresql/data
-- Sebastialonso
kubernetes
phoenix-framework
postgresql

1 Answer

12/9/2019

The solution to this was pretty straightforward.

Turns out the database was being interacted to a lot. There was a scrapper process entering data once every 5 seconds (the master user was being updated). The key was to follow the gif I uploaded: there was an array column that was being uploaded with a new element every 5 seconds. The total amount of elements was around 100k and I presume that was the reason psql wasn't showing the selected element.

-- Sebastialonso
Source: StackOverflow