This is an internal documentation. There is a good chance you’re looking for something else. See Disclaimer.

Postgres Replication

What to do if Replication Streaming on Postgres fails

Important! Read this section first before doing anything!

In this section, very powerful commands are described. So before execute anything, be 100% sure that you are on the right server! Else you could cause a lot of damage.

To check if you are on the right server use the following query. On master the result is empty. On slave it is not (see example).

  1. Check if you are on slave

    SELECT pg_last_xlog_receive_location();
    
      pg_last_xlog_receive_location
     -------------------------------
             2493/54AAF118
    

Check if Replication is working

First of all you should check the delay. If the delay is critical nagios or Backoffice will inform you about it. This will look like this:

  1. Check the delay

    POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB postgres (host:db01slave) 147866242312 seconds
    

    If the number (delay in seconds) is extremely high then you know something must be wrong with the replication.

  2. Another way to check the delay directly is

    $ ssh db01a
    
    $ psql -U postgres -h db01slave
    
    SELECT now() - pg_last_xact_replay_timestamp() AS delay;
    
             delay
         --------------
         04:15:55.03134
    

    It prints the delay in human readable time format like (hh:mm:ss). The advantage of this query is, that it is very comprehensive. So you don’t have to calculate a lot, the output is appropriate. If the delay is very small the slave is probably reading from master at that moment. But always have an look at the issue.

  3. Check the last replay

    SELECT pg_last_xact_replay_timestamp();
    
      pg_last_xact_replay_timestamp
     -------------------------------
      2017-10-10 11:45:00.945915+02
    

    It prints the timestamp for the last replication. It is more comprehensive than just a number. If the delay is more than 3 hours, you should have a look at the postgres log file: /var/log/postgres/postgresql-9.5-main.log. If you find messages like “0000000200002440000000A8: file already removed” then you have to do the basebackup. Otherwise you could have a look at the postgres configurations. If you can’t see anything odd, a restart of postgres could solve the problem.

Hint

Postgres doesn’t replay any changes during dumps.

  1. Restart Postgres-Slave

    $ sudo systemctl restart postgresql@9.5-main.service
    

Basebackup

Important! All these commands must only be executed on de Slave (db01slave)! The initial user for this procedure should be tadm.

  1. save password from /postgres/postgres_data/recovery.conf

  2. change to user root

  3. copy the password

    sudo su
    
    cat /postgres/postgres_data/recovery.conf
    
  4. Base backup

    screen -S restore_slavei
    
    cp -a /etc/postgresql/9.5/main/ /postgres/backup/main
    
    pg_dropcluster --stop 9.5 main
    
    cd /etc/postgres
    
    mkdir 9.5
    
    chown -R postgres:postgres 9.5
    
    cp -a /postgres/backup/main /etc/postgresql/9.5/main
    
    ^D or exit
    
    sudo su postgres
    
    pg_basebackup -h db01master -U pg_replica -D /postgres/postgres_data/ -v -P --xlog-method=stream
    
    cd /postgres/postgres_data/ && mv recovery.done recovery.conf
    
    recovery.conf change (Slave-IP) -> (Master-IP)
    
    ^D or exit
    
    sudo systemctl start postgresql@9.5-main.service