环境介绍
数据库1主2从,mysql版本是v5.19
表结构
一、主库新增记录,从库提示主键冲突
模拟故障
1, master上关闭 sql_log_bin,删除id =103 后打开
2, 确认此时从库有id=103,主库没有
3, master insert id=103 主从异常
sql
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1062
Last_Error: Could not execute Write_rows event on table test.hero3; Duplicate entry '103' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000045, end_log_pos 3056
修复故障
根据报错test.hero3; Duplicate entry '103' for key 'PRIMARY,去slave中找到对应记录,删除即可
sql
mysql> delete from test.hero3 where id=103;
Query OK, 1 row affected (0.00 sec)
mysql> stop slave;start slave;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.10.129
Master_User: repl
Master_Port: 3358
Connect_Retry: 60
Master_Log_File: mysql-bin.000046
Read_Master_Log_Pos: 194
Relay_Log_File: mha3-relay-bin.000012
Relay_Log_Pos: 407
Relay_Master_Log_File: mysql-bin.000046
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
二、主库UPDATE,从库找不到记录
模拟故障
1.1, slave上将id=1 的记录delete
1.2, master上执行update xxx where id = 1;
1.3, show slave status\G查看主从同步异常
sql
# slave:
mysql> delete from hero3 where id = 1;
Query OK, 1 row affected (0.01 sec)
# master
mysql> update hero3 set age = 1 where id = 1;
Query OK, 1 row affected (0.03 sec)
Rows matched: 1 Changed: 1 Warnings: 0
# slave;
mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.10.129
Master_User: repl
Master_Port: 3358
Connect_Retry: 60
Master_Log_File: mysql-bin.000045
Read_Master_Log_Pos: 2819
Relay_Log_File: mha2-relay-bin.000009
Relay_Log_Pos: 866
Relay_Master_Log_File: mysql-bin.000045
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1032
Last_Error: Could not execute Update_rows event on table test.hero3; Can't find record in 'hero3', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000045, end_log_pos 2788
修复故障
从库中找到对应的记录,在slave上将该记录插入;
1,根据同步异常错误mysql-bin.000045, end_log_pos 2788,去主库对应sql
2, mysqlbinlog --no-defaults -vv --base64-output=decode-rows mysql-bin.000045 | grep -A 20 'end_log_pos 2788' 命令查找执行的sql
sql
[root@mha1 mysql]# mysqlbinlog --no-defaults -vv --base64-output=decode-rows mysql-bin.000045 | grep -A 20 'end_log_pos 2788'
#241222 10:38:37 server id 10129 end_log_pos 2788 CRC32 0x1bfc92fc Update_rows: table id 224 flags: STMT_END_F
### UPDATE `test`.`hero3`
### WHERE
### @1=1 /* INT meta=0 nullable=0 is_null=0 */
### @2='mao1' /* VARSTRING(60) meta=60 nullable=0 is_null=0 */
### @3=110 /* INT meta=0 nullable=0 is_null=0 */
### SET
### @1=1 /* INT meta=0 nullable=0 is_null=0 */
### @2='mao1' /* VARSTRING(60) meta=60 nullable=0 is_null=0 */
### @3=1 /* INT meta=0 nullable=0 is_null=0 */
3,根据 UPDATE `test`.`hero3` WHERE @1=1 @2='mao1' @3=110 拼接insert并插入slave
sql
mysql> insert into `test`.`hero3` values (1,'mao1',110);
Query OK, 1 row affected (0.03 sec)
mysql> stop slave; start slave;
Query OK, 0 rows affected (0.01 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.10.129
Master_User: repl
Master_Port: 3358
Connect_Retry: 60
Master_Log_File: mysql-bin.000045
Read_Master_Log_Pos: 2819
Relay_Log_File: mha2-relay-bin.000010
Relay_Log_Pos: 454
Relay_Master_Log_File: mysql-bin.000045
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
三、主库DELETE,从库找不到记录
模拟故障
主库在关闭binlog的情况下insert一条数据,然后打开binlog并删除该数据
sql
mysql> set sql_log_bin = 0;
Query OK, 0 rows affected (0.00 sec)
mysql> insert into hero3(name,age) values('delete_test',22);
Query OK, 1 row affected (0.01 sec)
mysql> set sql_log_bin = 1;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from hero3 where id >= 102;
+-----+-------------+-----+
| id | name | age |
+-----+-------------+-----+
| 102 | mao2 | 111 |
| 103 | mao2 | 109 |
| 104 | delete_test | 22 |
+-----+-------------+-----+
3 rows in set (0.00 sec)
mysql> delete from hero3 where id = 104;
Query OK, 1 row affected (0.03 sec)
此时从库报错信息如下
sql
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1032
Last_Error: Could not execute Delete_rows event on table test.hero3; Can't find record in 'hero3', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000045, end_log_pos 2092
修复故障
由于在执行delete的时候,slave找不到该记录,那么我们就可以跳过该事务
Retrieved_Gtid_Set 表示slave已经收到的gtid集合,Executed_Gtid_Set表示slave已经执行的gtid集合,从中可以看到 4855c186-ba02-11ef-8bf4-000c297511da:11没有执行
由于删除的数据在slave中找不到,所以我们就可以跳过该事务;
sql
mysql> stop slave;
Query OK, 0 rows affected (0.00 sec)
mysql> set gtid_next = '4855c186-ba02-11ef-8bf4-000c297511da:11'; #设定下一个要执行的GTID:11
Query OK, 0 rows affected (0.02 sec)
mysql> begin;commit; #产生一个空事务
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> set gtid_next='AUTOMATIC'; # 将next设为自动
Query OK, 0 rows affected (0.00 sec)
mysql> start slave;
Query OK, 0 rows affected (0.02 sec)
mysql> show slave status\G # 可以看到主从状态已恢复 *************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.10.129
Master_User: repl
Master_Port: 3358
Connect_Retry: 60
Master_Log_File: mysql-bin.000045
Read_Master_Log_Pos: 2123
Relay_Log_File: mha3-relay-bin.000008
Relay_Log_Pos: 454
Relay_Master_Log_File: mysql-bin.000045
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
四、主库日志被purged,主从同步失败
模拟故障
4.1.1 stop slave断开主从同步
4.1.2 主库上执行一些事务,使GTID持续增长
sql
mysql> flush logs;
Query OK, 0 rows affected (0.00 sec)
mysql> insert into hero3(name,age) select name,age from hero3;
Query OK, 28 rows affected (0.00 sec)
Records: 28 Duplicates: 0 Warnings: 0
mysql> flush logs;
Query OK, 0 rows affected (0.00 sec)
mysql> insert into hero3(name,age) select name,age from hero3;
Query OK, 56 rows affected (0.00 sec)
Records: 56 Duplicates: 0 Warnings: 0
mysql> flush logs;
Query OK, 0 rows affected (0.00 sec)
mysql> insert into hero3(name,age) select name,age from hero3;
Query OK, 112 rows affected (0.00 sec)
Records: 112 Duplicates: 0 Warnings: 0
mysql> flush logs;
Query OK, 0 rows affected (0.00 sec)
mysql> insert into hero3(name,age) select name,age from hero3;
Query OK, 224 rows affected (0.00 sec)
Records: 224 Duplicates: 0 Warnings: 0
mysql> show master logs;
+------------------+-----------+
| Log_name | File_size |
+------------------+-----------+
| mysql-bin.000053 | 734 |
| mysql-bin.000054 | 972 |
| mysql-bin.000055 | 1448 |
| mysql-bin.000056 | 2400 |
| mysql-bin.000057 | 4257 |
+------------------+-----------+
4.1.3 清理binlog,造成master上事务缺失
sql
mysql> purge binary logs to 'mysql-bin.000058';
Query OK, 0 rows affected (0.00 sec)
mysql> show master logs;
+------------------+-----------+
| Log_name | File_size |
+------------------+-----------+
| mysql-bin.000058 | 194 |
+------------------+-----------+
1 row in set (0.00 sec)
4.1.4 slave上执行start slave,观察主从异常,提示"but the master has purged binary logs containing GTIDs that the slave requires"
sql
mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State:
Master_Host: 192.168.10.129
Master_User: repl
Master_Port: 3358
Connect_Retry: 60
Master_Log_File: mysql-bin.000050
Read_Master_Log_Pos: 194
Relay_Log_File: mha3-relay-bin.000021
Relay_Log_Pos: 407
Relay_Master_Log_File: mysql-bin.000050
Slave_IO_Running: No
Slave_SQL_Running: Yes
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'
修复故障
由于master上这部分binlog已经被清理掉了,所以这部分数据已经无法同步到从库。 及时恢复了主从同步状态,这部分数据也是丢失的; 工作的时候遇到这种情况有两种处理方法
4.2.1 master上根据select @@gtid_purged;可以查看主库GTID执行了1-27
sql
mysql> select @@gtid_purged;
+-------------------------------------------+
| @@gtid_purged |
+-------------------------------------------+
| 4855c186-ba02-11ef-8bf4-000c297511da:1-27 |
+-------------------------------------------+
从库show slave status命令可以看到Retrieved_Gtid_Set 和Executed_Gtid_Set 中显示从库已经获取并执行了GTID:1-20
sql
Retrieved_Gtid_Set: 4855c186-ba02-11ef-8bf4-000c297511da:6-20
Executed_Gtid_Set: 090b3bdd-ba02-11ef-9ce1-000c292443d0:1-4,
4855c186-ba02-11ef-8bf4-000c297511da:1-20
证明中间有21-27,一共7个事务丢失;
4.2.2 尝试将gtid_purged设置为1-27
sql
mysql> set global gtid_purged = '4855c186-ba02-11ef-8bf4-000c297511da:1-27';
ERROR 1840 (HY000): @@GLOBAL.GTID_PURGED can only be set when @@GLOBAL.GTID_EXECUTED is empty.
mysql>
mysql> reset slve^C
mysql> set @@GLOBAL.GTID_EXECUTED = '';
ERROR 1238 (HY000): Variable 'gtid_executed' is a read only variable
发现无法设置,我们可以reset master,清空从库的binlog和gtid_executed
sql
mysql> reset master;
Query OK, 0 rows affected (0.00 sec)
mysql> set global gtid_purged='4855c186-ba02-11ef-8bf4-000c297511da:1-27';
Query OK, 0 rows affected (0.00 sec)
mysql> start slave;
Query OK, 0 rows affected (0.00 sec)
mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.10.129
Master_User: repl
Master_Port: 3358
Connect_Retry: 60
Master_Log_File: mysql-bin.000058
Read_Master_Log_Pos: 194
Relay_Log_File: mha3-relay-bin.000022
Relay_Log_Pos: 407
Relay_Master_Log_File: mysql-bin.000058
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
4.2.3 虽然此时同步恢复正常,但是主从是不一致的,后面执行其他事务的时候是正常的,但是当执行到丢失的7个事务中的数据的时候还是会报错
sql
# master
mysql> select count(1) from hero3;
+----------+
| count(1) |
+----------+
| 448 |
+----------+
#slave
mysql> select count(1) from hero3;
+----------+
| count(1) |
+----------+
| 208 |
+----------+
1 row in set (0.00 sec)
后面等异常的时候可以按照上三种情况进行修复,发现一例修复一例;
4.2.4 我们不知道上面7个事务究竟影响了多少数据量,如果条件允许,可以从主库上dump一份全量数据,重新搭建一个新的从库,这样可以保证数据完全一致,不用每次都手动修复