GTID下复制问题和解决

环境介绍

数据库1主2从，mysql版本是v5.19

表结构

一、主库新增记录，从库提示主键冲突

模拟故障

1， master上关闭 sql_log_bin,删除id =103 后打开

2，确认此时从库有id=103,主库没有

3， master insert id=103 主从异常

sql 复制代码

             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1062
                   Last_Error: Could not execute Write_rows event on table test.hero3; Duplicate entry '103' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000045, end_log_pos 3056

修复故障

根据报错test.hero3; Duplicate entry '103' for key 'PRIMARY，去slave中找到对应记录，删除即可

sql 复制代码

mysql> delete from test.hero3 where id=103;
Query OK, 1 row affected (0.00 sec)

mysql> stop slave;start slave;
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.10.129
                  Master_User: repl
                  Master_Port: 3358
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000046
          Read_Master_Log_Pos: 194
               Relay_Log_File: mha3-relay-bin.000012
                Relay_Log_Pos: 407
        Relay_Master_Log_File: mysql-bin.000046
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes

二、主库UPDATE，从库找不到记录

模拟故障

1.1， slave上将id=1 的记录delete

1.2， master上执行update xxx where id = 1;

1.3, show slave status\G查看主从同步异常

sql 复制代码

# slave:
mysql> delete from hero3 where id = 1;
Query OK, 1 row affected (0.01 sec)

# master
mysql> update hero3 set age = 1 where id = 1;
Query OK, 1 row affected (0.03 sec)
Rows matched: 1  Changed: 1  Warnings: 0

# slave;
mysql>  show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.10.129
                  Master_User: repl
                  Master_Port: 3358
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000045
          Read_Master_Log_Pos: 2819
               Relay_Log_File: mha2-relay-bin.000009
                Relay_Log_Pos: 866
        Relay_Master_Log_File: mysql-bin.000045
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1032
                   Last_Error: Could not execute Update_rows event on table test.hero3; Can't find record in 'hero3', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000045, end_log_pos 2788

修复故障

从库中找到对应的记录，在slave上将该记录插入;

1，根据同步异常错误mysql-bin.000045, end_log_pos 2788，去主库对应sql

2， mysqlbinlog --no-defaults -vv --base64-output=decode-rows mysql-bin.000045 | grep -A 20 'end_log_pos 2788' 命令查找执行的sql

sql 复制代码

[root@mha1 mysql]# mysqlbinlog --no-defaults -vv --base64-output=decode-rows mysql-bin.000045 | grep -A 20 'end_log_pos 2788'
#241222 10:38:37 server id 10129  end_log_pos 2788 CRC32 0x1bfc92fc     Update_rows: table id 224 flags: STMT_END_F
### UPDATE `test`.`hero3`
### WHERE
###   @1=1 /* INT meta=0 nullable=0 is_null=0 */
###   @2='mao1' /* VARSTRING(60) meta=60 nullable=0 is_null=0 */
###   @3=110 /* INT meta=0 nullable=0 is_null=0 */
### SET
###   @1=1 /* INT meta=0 nullable=0 is_null=0 */
###   @2='mao1' /* VARSTRING(60) meta=60 nullable=0 is_null=0 */
###   @3=1 /* INT meta=0 nullable=0 is_null=0 */

3，根据 UPDATE `test`.`hero3` WHERE @1=1 @2='mao1' @3=110 拼接insert并插入slave

sql 复制代码

mysql>    insert into `test`.`hero3` values (1,'mao1',110);
Query OK, 1 row affected (0.03 sec)

mysql> stop slave; start slave;
Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.00 sec)

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.10.129
                  Master_User: repl
                  Master_Port: 3358
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000045
          Read_Master_Log_Pos: 2819
               Relay_Log_File: mha2-relay-bin.000010
                Relay_Log_Pos: 454
        Relay_Master_Log_File: mysql-bin.000045
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes

三、主库DELETE，从库找不到记录

模拟故障

主库在关闭binlog的情况下insert一条数据，然后打开binlog并删除该数据

sql 复制代码

mysql> set sql_log_bin = 0; 
Query OK, 0 rows affected (0.00 sec)

mysql> insert into hero3(name,age) values('delete_test',22);
Query OK, 1 row affected (0.01 sec)

mysql> set sql_log_bin = 1;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from hero3 where id >= 102;
+-----+-------------+-----+
| id  | name        | age |
+-----+-------------+-----+
| 102 | mao2        | 111 |
| 103 | mao2        | 109 |
| 104 | delete_test |  22 |
+-----+-------------+-----+
3 rows in set (0.00 sec)

mysql> delete from hero3 where id = 104;
Query OK, 1 row affected (0.03 sec)

此时从库报错信息如下

sql 复制代码

             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1032
                   Last_Error: Could not execute Delete_rows event on table test.hero3; Can't find record in 'hero3', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000045, end_log_pos 2092

修复故障

由于在执行delete的时候，slave找不到该记录，那么我们就可以跳过该事务

Retrieved_Gtid_Set 表示slave已经收到的gtid集合，Executed_Gtid_Set表示slave已经执行的gtid集合，从中可以看到 4855c186-ba02-11ef-8bf4-000c297511da:11没有执行

由于删除的数据在slave中找不到，所以我们就可以跳过该事务；

sql 复制代码

mysql> stop slave;
Query OK, 0 rows affected (0.00 sec)

mysql> set gtid_next = '4855c186-ba02-11ef-8bf4-000c297511da:11'; #设定下一个要执行的GTID:11
Query OK, 0 rows affected (0.02 sec)

mysql> begin;commit; #产生一个空事务
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

mysql> set gtid_next='AUTOMATIC'; # 将next设为自动
Query OK, 0 rows affected (0.00 sec)

mysql> start slave;
Query OK, 0 rows affected (0.02 sec)

mysql> show slave status\G      # 可以看到主从状态已恢复                                                                                                                                             *************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.10.129
                  Master_User: repl
                  Master_Port: 3358
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000045
          Read_Master_Log_Pos: 2123
               Relay_Log_File: mha3-relay-bin.000008
                Relay_Log_Pos: 454
        Relay_Master_Log_File: mysql-bin.000045
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes

四、主库日志被purged，主从同步失败

模拟故障

4.1.1 stop slave断开主从同步

4.1.2 主库上执行一些事务，使GTID持续增长

sql 复制代码

mysql> flush logs;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into hero3(name,age) select name,age from hero3;
Query OK, 28 rows affected (0.00 sec)
Records: 28  Duplicates: 0  Warnings: 0

mysql> flush logs;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into hero3(name,age) select name,age from hero3;
Query OK, 56 rows affected (0.00 sec)
Records: 56  Duplicates: 0  Warnings: 0

mysql> flush logs;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into hero3(name,age) select name,age from hero3;
Query OK, 112 rows affected (0.00 sec)
Records: 112  Duplicates: 0  Warnings: 0

mysql> flush logs;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into hero3(name,age) select name,age from hero3;
Query OK, 224 rows affected (0.00 sec)
Records: 224  Duplicates: 0  Warnings: 0

mysql> show master logs;
+------------------+-----------+
| Log_name         | File_size |
+------------------+-----------+
| mysql-bin.000053 |       734 |
| mysql-bin.000054 |       972 |
| mysql-bin.000055 |      1448 |
| mysql-bin.000056 |      2400 |
| mysql-bin.000057 |      4257 |
+------------------+-----------+

4.1.3 清理binlog，造成master上事务缺失

sql 复制代码

mysql> purge binary logs to 'mysql-bin.000058';
Query OK, 0 rows affected (0.00 sec)

mysql> show master logs;
+------------------+-----------+
| Log_name         | File_size |
+------------------+-----------+
| mysql-bin.000058 |       194 |
+------------------+-----------+
1 row in set (0.00 sec)

4.1.4 slave上执行start slave，观察主从异常，提示"but the master has purged binary logs containing GTIDs that the slave requires"

sql 复制代码

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: 192.168.10.129
                  Master_User: repl
                  Master_Port: 3358
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000050
          Read_Master_Log_Pos: 194
               Relay_Log_File: mha3-relay-bin.000021
                Relay_Log_Pos: 407
        Relay_Master_Log_File: mysql-bin.000050
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'

修复故障

由于master上这部分binlog已经被清理掉了，所以这部分数据已经无法同步到从库。及时恢复了主从同步状态，这部分数据也是丢失的；工作的时候遇到这种情况有两种处理方法

4.2.1 master上根据select @@gtid_purged;可以查看主库GTID执行了1-27

sql 复制代码

mysql> select @@gtid_purged;
+-------------------------------------------+
| @@gtid_purged                             |
+-------------------------------------------+
| 4855c186-ba02-11ef-8bf4-000c297511da:1-27 |
+-------------------------------------------+

从库show slave status命令可以看到Retrieved_Gtid_Set 和Executed_Gtid_Set 中显示从库已经获取并执行了GTID：1-20

sql 复制代码

           Retrieved_Gtid_Set: 4855c186-ba02-11ef-8bf4-000c297511da:6-20
            Executed_Gtid_Set: 090b3bdd-ba02-11ef-9ce1-000c292443d0:1-4,
4855c186-ba02-11ef-8bf4-000c297511da:1-20

证明中间有21-27，一共7个事务丢失；

4.2.2 尝试将gtid_purged设置为1-27

sql 复制代码

mysql>  set global gtid_purged = '4855c186-ba02-11ef-8bf4-000c297511da:1-27';
ERROR 1840 (HY000): @@GLOBAL.GTID_PURGED can only be set when @@GLOBAL.GTID_EXECUTED is empty.
mysql> 
mysql> reset slve^C
mysql> set @@GLOBAL.GTID_EXECUTED = '';
ERROR 1238 (HY000): Variable 'gtid_executed' is a read only variable

发现无法设置，我们可以reset master，清空从库的binlog和gtid_executed

sql 复制代码

mysql> reset master;
Query OK, 0 rows affected (0.00 sec)

mysql> set global gtid_purged='4855c186-ba02-11ef-8bf4-000c297511da:1-27';
Query OK, 0 rows affected (0.00 sec)

mysql> start slave;
Query OK, 0 rows affected (0.00 sec)

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.10.129
                  Master_User: repl
                  Master_Port: 3358
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000058
          Read_Master_Log_Pos: 194
               Relay_Log_File: mha3-relay-bin.000022
                Relay_Log_Pos: 407
        Relay_Master_Log_File: mysql-bin.000058
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes

4.2.3 虽然此时同步恢复正常，但是主从是不一致的，后面执行其他事务的时候是正常的，但是当执行到丢失的7个事务中的数据的时候还是会报错

sql 复制代码

# master
mysql>  select count(1) from hero3;
+----------+
| count(1) |
+----------+
|      448 |
+----------+
#slave
mysql>   select count(1) from hero3;
+----------+
| count(1) |
+----------+
|      208 |
+----------+
1 row in set (0.00 sec)

后面等异常的时候可以按照上三种情况进行修复，发现一例修复一例；

4.2.4 我们不知道上面7个事务究竟影响了多少数据量，如果条件允许，可以从主库上dump一份全量数据，重新搭建一个新的从库，这样可以保证数据完全一致，不用每次都手动修复