Pythian Blog: Technical Track

Replication Issues: Never purge logs before slave catches them!!

A few days ago one of our customers contact us to inform a problem with one of their replication servers. This server was reporting this issue: Last_Error: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. After brief research we found the customer had deleted some binary logs from the master and relay logs from slave to release space since they were having space issues. The customer requested us to get slave working again without affecting the production environment. Our goal here was to figure out how to restore replication without going through an entire rebuild of slave, although this may have been the easiest solution, it had been necessary to lock the master to take a good backup, restore the binary logging, and then restore the slave of a proper replication point. This kind of things starts to be complicated when we talk about more than 1TB to rebuild. Unfortunately customer didn't backed up relay-logs, but fortunately customer was backed up binary logs so that would be our starting point. First of all, we had to get the slave status to know when exactly it stops replication: mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: xxx.xxxxx.xxx Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.003079 Read_Master_Log_Pos: 39705328 Relay_Log_File: xxxxxxxxxxxx-relay-bin.002256 Relay_Log_Pos: 4 Relay_Master_Log_File: mysql-bin.002671 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Skip_Counter: 0 Exec_Master_Log_Pos: 1073741674 Relay_Log_Space: 438126861751 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL 1 row in set (0.00 sec) Ok, with that information in our minds we could start with modifications, and this should be our action plan:
  1. Restore backed up binary logs in Master.
  2. Modify mysql-bin.index file in Master to make binary logs “visible”
  3. Reset slave to restart replication information.
  4. Restart replication.
All of this sound easy but each steps had it's own complexity. Let's take a look on each one:
  1. Customer doesn't have enough space in default folder to restore all deleted binary logs (about 250GB) so they had to mount a NFS folder.
  2. There isn't a good documented way to refresh cached information about binary logs reported with SHOW BINARY LOGS statement.
  3. Easy one.
  4. Easy one.
Now a deepest explanation of each issue we've found during our work:
  1. Once we had the list of each recovered binary log (wich was from mysql-bin.002671 to mysql-bin.002877) we moved to next step.
  2. We modified mysql-bin.index file to include lost binaries. After finish it looks this way:
[pythian@oasdb data]$ cat mysql-bin.index /XXXX_logs/Logs/mysql-bin.002671 /XXXX_logs/Logs/mysql-bin.002672 .... ./mysql-bin.002878 ./mysql-bin.002879 ./mysql-bin.002880 ... and so on Notice we have to add absolute path to some location since they weren't in the expected folder We've checked if changes takes effect:
mysql> show binary logs;
 +------------------+------------+
 | Log_name | File_size |
 +------------------+------------+
 | mysql-bin.002671 | 1073742130 |
 | mysql-bin.002672 | 1073741934 |
 | mysql-bin.002673 | 1073742001 |
 | mysql-bin.002674 | 1073742857 |
 | mysql-bin.002675 | 1073741944 |
 ...and so on
At this point thing looks good so we continued with steps 3 and 4 and try to get replication back. We took notes about master information: [pythian@XXXslave data]$cat master.info Password: 14 mysql-bin.003079 160765462 XXXX.XXXXXX.XXX rep1 pwd 3306 60 0 And start with slave reset: mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Master_Host: xxx.xxxxx.xxx Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.003079 Read_Master_Log_Pos: 160765462 Relay_Log_File: xxxxxxxxxxxx-relay-bin.002256 Relay_Log_Pos: 4 Relay_Master_Log_File: mysql-bin.002671 Slave_IO_Running: No Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 1073741674 Relay_Log_Space: 438247922177 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL 1 row in set (0.00 sec) mysql> reset slave; Query OK, 0 rows affected, 244 warnings (6 min 47.97 sec) mysql> show warnings; +-------+------+--------------------------------------------------------------------+ | Level | Code | Message | +-------+------+--------------------------------------------------------------------+ | Error | 6 | Error on delete of './xxxxxxxxxxxx-relay-bin.002256' (Errcode: 2) | | Error | 6 | Error on delete of './xxxxxxxxxxxx-relay-bin.002257' (Errcode: 2) | | Error | 6 | Error on delete of './xxxxxxxxxxxx-relay-bin.002258' (Errcode: 2) | and so on, which were expected warnings since relay logs were deleted.
Continuing changes: 
 mysql> change master to master_host ='xxxx.xxxxxx.xxx', master_user='repl', master_password='pwd', master_log_file='mysql-bin.002671', master_log_pos=1073741674;
 
 Query OK, 0 rows affected (0.16 sec)
 mysql> START SLAVE;
 Query OK, 0 rows affected (0.00 sec)
 mysql> show slave status\G
 *************************** 1. row ***************************
 Slave_IO_State:
 Master_Host: oxxxx.xxxxxx.xxx
 Master_User: repl
 Master_Port: 3306
 Connect_Retry: 60
 Master_Log_File: mysql-bin.002671
 Read_Master_Log_Pos: 1073741674
 Relay_Log_File: xxxxxxxxxxxx-relay-bin.000001
 Relay_Log_Pos: 98
 Relay_Master_Log_File: mysql-bin.002671
 Slave_IO_Running: No
 Slave_SQL_Running: Yes
 Replicate_Do_DB:
 Replicate_Ignore_DB:
 Replicate_Do_Table:
 Replicate_Ignore_Table:
 Replicate_Wild_Do_Table:
 Replicate_Wild_Ignore_Table:
 Last_Errno: 0
 Last_Error:
 Skip_Counter: 0
 Exec_Master_Log_Pos: 1073741674
 Relay_Log_Space: 98
 Until_Condition: None
 Until_Log_File:
 Until_Log_Pos: 0
 Master_SSL_Allowed: No
 Master_SSL_CA_File:
 Master_SSL_CA_Path:
 Master_SSL_Cert:
 Master_SSL_Cipher:
 Master_SSL_Key:
 Seconds_Behind_Master: NULL
 1 row in set (0.00 sec)
1 row in set (0.00 sec)Something doesn't work as expected. [pythian@xxxxxxxxxxxx data]$ tail -20 xxxxxxxxxxxx.xxxxx.xxx.err ... 100929 11:21:47 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.002671' at position 1073741674, relay log './xxxxxxxxxxxx-relay-bin.000001' position: 4 100929 11:21:47 [Note] Slave I/O thread: connected to master 'repl@oxxxx.xxxxxx.xxx:3306', replication started in log 'mysql-bin.002671' at position 1073741674 100929 11:21:47 [ERROR] Error reading packet from server: Could not find first log file name in binary log index file ( server_errno=1236) 100929 11:21:47 [ERROR] Got fatal error 1236: 'Could not find first log file name in binary log index file' from master when reading data from binary log 100929 11:21:47 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.002671', position 1073741674 100929 11:22:08 [Note] Error reading relay log event: slave SQL thread was killed This errors tell us references to a binary log on master was missed. The weirdest part here is when we perform a SHOW BINARY LOGS statement in master it actually showed us the binaries we wanted to use, so at this point we were stuck in a non-sense point because some cached information in master about binary logs needs to be refreshed dynamically (we just couldn't restart master). Solution for this was dig into source code to know how MySQL deal with logs, we found this interesting function (funny part it red)
-- void MYSQL_LOG::new_file(bool need_lock)
 {
  char new_name[FN_REFLEN], *new_name_ptr, *old_name;
  enum_log_type save_log_type;
 
  DBUG_ENTER("MYSQL_LOG::new_file");
  if (!is_open())
  {
  DBUG_PRINT("info",("log is closed"));
  DBUG_VOID_RETURN;
  }
 
  if (need_lock)
  pthread_mutex_lock(&LOCK_log);
  pthread_mutex_lock(&LOCK_index);
 
  safe_mutex_assert_owner(&LOCK_log);
  safe_mutex_assert_owner(&LOCK_index);
 
  /*
  if binlog is used as tc log, be sure all xids are "unlogged",
  so that on recover we only need to scan one - latest - binlog file
  for prepared xids. As this is expected to be a rare event,
  simple wait strategy is enough. We're locking LOCK_log to be sure no
  new Xid_log_event's are added to the log (and prepared_xids is not
  increased), and waiting on COND_prep_xids for late threads to
  catch up.
  */
  if (prepared_xids)
  {
  tc_log_page_waits++;
  pthread_mutex_lock(&LOCK_prep_xids);
  while (prepared_xids)
  pthread_cond_wait(&COND_prep_xids, &LOCK_prep_xids);
  pthread_mutex_unlock(&LOCK_prep_xids);
  }
 
  /* Reuse old name if not binlog and not update log */
  new_name_ptr= name;
 
  /*
  If user hasn't specified an extension, generate a new log name
  We have to do this here and not in open as we want to store the
  new file name in the current binary log file.
  */
  if (generate_new_name(new_name, name))
  goto end;
  new_name_ptr=new_name;
 
  if (log_type == LOG_BIN)
  {
  if (!no_auto_events)
  {
  /*
  We log the whole file name for log file as the user may decide
  to change base names at some point.
  */
  THD *thd = current_thd; /* may be 0 if we are reacting to SIGHUP */
  Rotate_log_event r(thd,new_name+dirname_length(new_name),
  0, LOG_EVENT_OFFSET, 0);
  r.write(&log_file);
  bytes_written += r.data_written;
  }
  /*
  Update needs to be signalled even if there is no rotate event
  log rotation should give the waiting thread a signal to
  discover EOF and move on to the next log.
  */
  signal_update();
  }
  old_name=name;
  save_log_type=log_type;
  name=0; // Don't free name
 close(LOG_CLOSE_TO_BE_OPENED | LOG_CLOSE_INDEX);
 
  /*
  Note that at this point, log_type != LOG_CLOSED (important for is_open()).
  */
 
  /*
  new_file() is only used for rotation (in FLUSH LOGS or because size >
  max_binlog_size or max_relay_log_size).
  If this is a binary log, the Format_description_log_event at the beginning of
  the new file should have created=0 (to distinguish with the
  Format_description_log_event written at server startup, which should
  trigger temp tables deletion on slaves.
  */
 
  /* reopen index binlog file, BUG#34582 */
  if (!open_index_file(index_file_name, 0))
  open(old_name, save_log_type, new_name_ptr,
  io_cache_type, no_auto_events, max_size, 1);
  my_free(old_name,MYF(0));
 
 end:
  if (need_lock)
  pthread_mutex_unlock(&LOCK_log);
  pthread_mutex_unlock(&LOCK_index);
 
  DBUG_VOID_RETURN;
 }
This function is called by FLUSH LOGS command, which close and re opens all MySQL logs (including binaries) and refresh cached binary index. Additionally before perform this command we went through these steps:
  1. Moved the last log that slaves catches ( mysql-bin.002671) to it's original position.
  2. Modified again the mysql-bin.index to point this file to proper path.
  3. Perform FLUSH LOGS and FLUSH PRIVILEGES command to force server to refresh the information about binary logs and clients.
  4. Went to slave and restart replication (and cross fingers of course)
This was the result: mysql> start slave; Query OK, 0 rows affected (0.00 sec) mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: oxxxx.xxxxxx.xxx Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.002672 Read_Master_Log_Pos: 563985312 Relay_Log_File: xxxxxxxxxxxx-relay-bin.000005 Relay_Log_Pos: 131915515 Relay_Master_Log_File: mysql-bin.002672 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 131915378 Relay_Log_Space: 563985741 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 946273 1 row in set (0.00 sec) Which show us a better picture. Conclusion: this issue was caused by a customer error, I mean, is not a good idea delete logs that wasn't cached by slave. In case this is needed at least be sure to keep relay logs in slave to keep replication working. Logs should be deleted ONLY with PURGE BINARY LOGS command instead of delete them using the OS commands. Dig into source code could give you a good approach of some MySQL internals and usually it will helps more than on line documentation. And last, but not least, I would like to say thanks to Marco Tusa for his great support in code review.

No Comments Yet

Let us know what you think

Subscribe by email