Postgresql WalReceiver process waits on connecting master regardless of "connect_timeout"

10/6/2019

I am trying to deploy an automated high-available PostgreSQL cluster on kubernetes. In cases of master failover or temporary failures in master, standby loses streaming replication connection and when retrying, it takes a long time until it gets failed and retries.

I use PostgreSQL 10 and streaming replication (cluster-main-cluster-master-service is a service that always routes to master and all the replicas connect to this service for replication). I've tried setting configs like connect_timeout and keepalive in primary_conninfo of recovery.conf and wal_receiver_timeout in postgresql.conf of standby but I could not make any progress with them.

In the first place when master goes down, replication stops with the following error (state 1):

2019-10-06 14:14:54.042 +0330 [3039] LOG:  replication terminated by primary server
2019-10-06 14:14:54.042 +0330 [3039] DETAIL:  End of WAL reached on timeline 17 at 0/33000098.
2019-10-06 14:14:54.042 +0330 [3039] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
2019-10-06 14:14:55.534 +0330 [12] LOG:  record with incorrect prev-link 0/2D000028 at 0/33000098

After investigating Postgres activities I found out that WalReceiver proccess stucks in LibPQWalReceiverConnect wait_event (state 2) but timeout is way longer than what I configured (although I set connect_timeout to 10 seconds, it takes about 2 minutes). Then, It fails with the following error (state 3):

2019-10-06 14:17:06.035 +0330 [3264] FATAL:  could not connect to the primary server: could not connect to server: Connection timed out
        Is the server running on host "cluster-main-cluster-master-service" (192.168.0.166) and accepting
        TCP/IP connections on port 5432?

In the next try, It successfully connects the primary (state 4):

2019-10-06 14:17:07.892 +0330 [5786] LOG:  started streaming WAL from primary at 0/33000000 on timeline 17

I also tried killing the process when stuck event occurs (state 2), and when I do, It starts the process again and connects and then streams normally (jumps to state 4).

After checking netstat, I also found that there is a connection with SYN_SENT state to the old master in the walreceiver process (in failover case).

-- Emad Mohamadi
kubernetes
postgresql
tcp

1 Answer

10/6/2019

connect_timeout governs how long PostgreSQL will wait for the replication connection to succeed, but that does not include establishing the TCP connection.

To reduce the time that the kernel waits for a successful answer to a TCP SYN request, reduce the number of retries. In /etc/sysctl.conf, set:

net.ipv4.tcp_syn_retries = 3

and run sysctl -p.

That should reduce the time significantly.

Reducing the value too much might make your system less stable.

-- Laurenz Albe
Source: StackOverflow