Recovering After a MySQL Replication Failure

I just solved a weird MySQL replication problem and thought I would share with you all. First, the background. I have a master and slave running with one-way replication. The slave just sits by waiting for his time to shine, but otherwise doesn't do anything. Well, last week the master had a problem with the filesystem. I can't recall exactly what it was, out of space or something. It confused the heck out of the master and so it basically shut down. One of my co-workers fixed the problem and got the master running again, but the slave was in a pickle. Here is the error it was showing:

Relay_Master_Log_File: mysql-bin.031
     Slave_IO_Running: Yes
    Slave_SQL_Running: No
           Last_errno: 0
           Last_error: Query 'DELETE FROM foo WHERE bar = 1' caused different errors on master and slave.
                       Error on master: 'Got error %d from table handler' (1030), Error on slave:
                       'no error' (0). Default database: 'baz'
  Exec_master_log_pos: 118871

Because it thought the command failed on the master, it refused to continue. I can't say it's an altogether bad plan since data integrity is generally the main theme of a database (yes, cue the jokes about my using MySQL in the first place). The question became, how do I get the slave to start up again. "SLAVE STOP; SLAVE START" didn't have any effect.

The trick was suggested to me by a post at mysql.com which pointed out a tool new to me, mysqlbinlog. See I figured the simplest thing would be to restart replication at the step just after the "failed" transaction, since I knew that transaction had actually succeeded. But I have no idea how the binlog counters work, so I couldn't just make up numbers. It's some kind of binary offset. Well, mysqlbinlog will show it to you.

# mysqlbinlog mysql-bin.031 -j 118871 |less

Which of course showed me this:

# at 118955
#080605 18:59:09 server id 1  log_pos 118955    Query   thread_id=3218  exec_time=0     error_code=0

So on my slave I restarted replication at offset 118955 and like magic, the slave ripped through the binlogs and caught up in practically no time at all.

tags: 

Subscribe to Comments for "Recovering After a MySQL Replication Failure" Subscribe to zmonkey.org - All comments