Wednesday, November 9, 2011

Understanding WAL nomenclature

There are some PostgreSQL functions that reports an information such as 68A/16E1DA8. But what does it mean? The documentation states that it is a transaction log location. Let me try to explain it in details.

What is WAL? WAL is an abbreviation of Write-Ahead Log. That is how PostgreSQL named the transaction log (also known as REDO log). Every modern database system has one. It is basically a history of operations that was executed by a DBMS. It improves performance (you only need to flush one file per transaction) and guarantees ACID properties (if the DBMS was in an inconsistent state -- because of software crash, for example -- it reapplies all transactions that are committed but whose changes are not in the database files). PostgreSQL introduced the transaction log in version 7.1 (2001). It opted to implement only REDO log (there is no UNDO log on PostgreSQL).

The transaction log files are stored in $PGDATA/pg_xlog directory. They are named as 000000020000070A0000008E. They are all hexadecimal digits (0123456789ABCDEF). The first 8 digits identifies the timeline, the following 8 digits identifies the (logical) xlog file and the last ones represents the (physical) xlog file (aka segment). The physical files in pg_xlog directory are not actually the xlog files; PostgreSQL calls it segments. By default, a logical xlog file has 255 segments (16 MB each, 4080 MB in total) So in our example, the 0000070A and 0000008E are the logical xlog file and segment file, respectively. When segment file 000000FE is filled up the logical xlog file is incremented by 1 (i.e. 0000070B) and the segment file sequence starts up again (i.e. 00000000) -- 000000020000070B00000000.

What means 68A/16E1DA8? 68A is the (logical) xlog file and 16E1DA8 is the offset inside the logical xlog file. It means that the offset goes from 00000000 to FF000000 (4278190080 bytes = 4080 MB). When it reaches FF000000, it means we are in the last segment (aka physical xlog file) and the xlog file number will be incremented by 1 (i.e. 68B) and the offset will be restarted (00000000).

PostgreSQL provides a set of functions [1] [2] that deals with the transaction log location. For example, converting transaction log location to segment file name:

postgres=# select pg_xlogfile_name('68A/16E1DA8');
     pg_xlogfile_name  
--------------------------
 000000020000068A00000001
(1 row)

Converting transaction log location to segment file name and the offset inside that segment.

postgres=# select * from pg_xlogfile_name_offset('68A/16E1DA8');
        file_name         | file_offset
--------------------------+-------------
 000000020000068A00000001 |     7216552
(1 row)


If I want to know which segment file the PostgreSQL server is writing the transaction log now:

postgres=# select pg_xlogfile_name(pg_current_xlog_location());
     pg_xlogfile_name  
--------------------------
 000000020000074B000000E4
(1 row)


If you have a replication scenario, you can obtain the receive and replay lags. It is a simple hexadecimal math: master transaction log location minus replica transaction log location.

In the replica:

postgres=# select pg_last_xlog_receive_location();
 pg_last_xlog_receive_location
-------------------------------
 74B/E4D1C628
(1 row)

postgres=# select pg_last_xlog_replay_location();
 pg_last_xlog_replay_location
------------------------------
 74B/E4D1C628
(1 row)

In the master:

postgres=# select pg_current_xlog_location();
 pg_current_xlog_location
--------------------------
 74B/E4D3B070
(1 row)

If they report the same xlog file (74B), it is a matter of E4D3B070E4D1C628 = 1EA48 (125512 bytes). It means that the replica is roughly 123 kB behind the master. Receive and replay locations aren't the same all the time; if the replica can't keep up with writing receive location > replay location.

What about different xlog files? Some tools use the following formula:

(FFFFFFFF * xlog_master + offset_master) - (FFFFFFFF * xlog_replica + offset_replica)

It's wrong ¹. Why? Because the offset goes up to FF000000 (and not FFFFFFFF). It means that the above formula reports 16 MB times the difference between xlog numbers when the xlog numbers aren't the same. The correct formula is:

(FF000000 * xlog_master + offset_master) - (FF000000 * xlog_replica + offset_replica)

Let's suppose the master and replica transaction location are 67E/AFE198 and 67D/FECFA308, respectively. The correct lag is 14696080 bytes and not 31473295 bytes (the difference between results is 16 MB!).


¹ I will send patches to fix check_postgres and Munin. If you know another tool that is using the wrong formula, let me know or even better, the tool's author. FYI, both tools were fixed in a few hours. Thanks, Greg and Magnus!

14 comments:

  1. Thanks! A fix for check_postgres was git pushed as 62517b40d2bbf3e624e97eea027ef21cec3cf99e

    ReplyDelete
  2. Probably it is off topic a bit but can pg_last_xact_replay_timestamp() be used somehow to estimate a time lag between master and replica?

    ReplyDelete
  3. @gray_hemp Yes. Keep in mind that this function is only available in 9.1 or later.

    ReplyDelete
  4. Suggestion: requiring magic hex numbers in a simple calculation is bad and makes perpetuating this mistake easy. Perhaps you can submit a pg_replication_lag_bytes() function?

    ReplyDelete
  5. An example of "pg_replication_lag_bytes" can be:

    CREATE OR REPLACE FUNCTION hex2int(TEXT) RETURNS bigint AS
    $$
    DECLARE
    result BIGINT;
    BEGIN
    EXECUTE 'SELECT CAST(X'||quote_literal($1)||' AS BIGINT)' INTO result;
    RETURN result;
    END;
    $$
    LANGUAGE plpgsql;


    --
    -- Parameters: 1 = xlog master
    -- 2 = xlog replica
    --
    CREATE OR REPLACE FUNCTION pg_replication_lag_bytes(TEXT, TEXT) RETURNS bigint AS
    $$
    SELECT ( hex2int('FF000000') * hex2int( split_part($1, '/', 1) ) + hex2int( split_part($1, '/', 2) ) ) -
    ( hex2int('FF000000') * hex2int( split_part($2, '/', 1) ) + hex2int( split_part($2, '/', 2) ) );
    $$
    LANGUAGE sql;



    postgres@bdteste=# SELECT pg_replication_lag_bytes('67E/AFE198', '67D/FECFA308');
    pg_replication_lag_bytes
    --------------------------
    14696080
    (1 registro)

    ReplyDelete
  6. "By default, a logical xlog file has 255 segments (16 MB each, 4080 MB in total)"

    Can you tell me where the 255 defines in postgres source code/configuration file?
    I only find the XLOG_SEG_SIZE which defines the segment file size, but no clue on how many segment per sequence.

    ReplyDelete
    Replies
    1. Look XLogSegmentsPerXLogId in src/include/access/xlog_internal.h

      Have in mind, that piece of code changed after removing the hole at the end of a logical xlog file (9.3 doesn't skip the last 16MB segment every 4GB). It means that now we have 256 segments every 4GB instead of 255.

      Delete
    2. @Euler Taveira
      The correct formula for PostgreSQL 9.3 is

      (FFFFFFFF * xlog_master + offset_master) - (FFFFFFFF * xlog_replica + offset_replica)

      or

      (FF000000 * xlog_master + offset_master) - (FF000000 * xlog_replica + offset_replica)

      ????

      Delete
  7. @Euler Taveira
    I setup a PostgreSQL 9.6 streaming-replication for two servers on Linux using asynchronous.
    Then I met a case related to WAL-Latency, i.e. receive WAL on slave > current WAL on master.
    Here, receive WAL = SELECT pg_last_xlog_receive_location(), current WAL = SELECT pg_current_xlog_location()
    I guess that there were some reasons such as lag of network, disk of speed or something else.
    I'm still finding suitable answer for that case.

    For example:
    ------master ------------slave ------
    16F9/307C84A0 ------ 16F9/307C97A0
    16F9/368E5A88 ------ 16F9/368E6E10
    16F9/3691DB60 ------ 16F9/3691F500


    My script likes that https://github.com/sensu-plugins/sensu-plugins-postgres/blob/master/bin/metric-postgres-graphite.rb

    ReplyDelete
    Replies
    1. Yeah, WAL sender/receiver transfers data very fast. Therefore, sometimes, my code can't get exact result.

      For example: (Ruby, https://github.com/sensu-plugins/sensu-plugins-postgres/blob/master/bin/metric-postgres-graphite.rb
      )
      # on master, get xlog location
      res1 = conn_master.exec('SELECT pg_current_xlog_location()').getvalue(0, 0)
      # on slave, get receive xlog (wal written on disk)
      res = conn_slave.exec('SELECT pg_last_xlog_receive_location()').getvalue(0, 0)
      # check lag between two values above
      lag = lag_compute(res1, res, m_segbytes)

      Delete
    2. You should round to zero if lag is negative. That's because slave connection time isn't fast enough.

      BTW, this is an old method (9.0 only). If your tool doesn't support 9.0, consider using view pg_stat_replication. It is simpler and easier. You only need to query master to obtain what you want. Also, you don't have the imprecision problem.

      -- 9.1 <= v < 9.4
      select *, pg_xlog_location_diff(sent_location, replay_location) as lagreplay from pg_stat_replication;
      -- >= 9.4
      select *, sent_location - replay_location as lagreplay from pg_stat_replication;

      Delete
    3. I'll update my code. Thanks.

      Delete
  8. Thanks for this explanation.

    ReplyDelete
  9. In 14.x there is no such directory $PGDATA/pg_xlog, seems to be now $PGDATA/pg_wal now. As well the function pg_current_xlog_location() does not exist.

    ReplyDelete