Pythian Blog: Technical Track

Beware Starting Slaves in the Position in the master.info file

I’ve seen many a good DBA make the master of starting slaves from the position in the master.info file, most recently this week, that I want to bring it to everyone’s attention. Of course I mean the underlying issue and not the names of the DBA because that would be cruel.

In the typical scenario where this is an issue, the sequence of events is roughly the same with some small variation. A cold backup or a snapshot is restored onto a new server to build out a new slave. The binary log position from the master.info file, which is part of the backup, is used to start replication. Eventually after a short while, someone notices data discrepancies on the new slave compared to the master or replication stops due to an error.

The problem can be best looked by looking the slave status output in MySQL like below:

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: msandbox
                  Master_Port: 26768
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000001
          Read_Master_Log_Pos: 4723
               Relay_Log_File: mysql_sandbox26769-relay-bin.000002
                Relay_Log_Pos: 874
        Relay_Master_Log_File: mysql-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 729
              Relay_Log_Space: 1042
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
1 row in set (0.00 sec)

mysql>

There are two sets of coordinates for replication in consideration here, the Execute Master Position and the Read Master Position. The Execute Master Position, denoted by the ‘Relay_Master_Log_File’ and Exec_Master_Log_Pos’ in the output above, is the position in the master’s binary log of the current statement being executed by the slave from the relay log. The Read Master Position, denoted by the ‘Master_Log_File’ and Exec_Master_Log_Pos’ in the output above, is the position in the master’s binary logs which the slave is reading from and writing to the relay log. So if the slave is lagged for some reason, which can happen often in the MySQL world, the Read Master Position will be different and in fact newer then the Execute Master Position. On the file system, the Read Master Position is stored in the master.info file and the Execute Master Position is stored instead in the relay-log.info file.

So if the slave was lagged when it was shut down for the cold backup or when the snapshot was taken, then the two positions would be different. Consider the following example from a server which was lagged behind.

The master.info file shows

--(Wed:20110112:1527)-(0:$)-- more master.info
15
mysql-bin.000001
4723
127.0.0.1
msandbox
msandbox
26768
60
0

0
--(wang@polygon)-(/home/wang/sandboxes/rsandbox_5_1_53/node1/data)--

The relay-log.info file shows

--(Wed:20110112:1527)-(0:$)-- more relay-log.info
./mysql_sandbox26769-relay-bin.000002
874
mysql-bin.000001
729
--(wang@polygon)-(/home/wang/sandboxes/rsandbox_5_1_53/node1/data)--

When the replication is started on the new server from the position in the master.info file, the statements in the binary logs between the Execute Master Position and the Read Master Position get skipped and cause data corruption.

In the end, the solution is to always start a new slave using the position in the relay-log.info file.

No Comments Yet

Let us know what you think

Subscribe by email