mysql主库delete一个没主键的表导致从库延迟很久问题处理

一问题描述

发现线上环境一个从库出现延迟，延迟了2天了，还没追上主库。

查看当前运行的sql及事务，发现这个sql语句是在delete一个没主键的表。

二问题模拟

这里在测试环境复现下这个问题。

2.1 在主库造数据

use baidd;

CREATE TABLE test2(id INT primary key AUTO_INCREMENT);

INSERT INTO test2(id) VALUES(1),(2),(3),(4);

INSERT INTO test2(id) SELECT id+(SELECT COUNT(*) FROM test2) FROM test2;

......

多次运行如上sql。

2.2 在主库删除这个表的主键

alter table test2 modify id int;

alter table test2 drop primary key;

2.3 在主库清空这个表数据

#待这两百万条数据同步到从库后，在主库上删除这两百万条数据

delete from test2;

三问题排查

3.1 在从库观察主从复制

#发现五分钟了从库还没同步删除掉，查看主从状态，发现复制位点很久都不变

复制代码

root@localhost [(none)]>show slave status \G;

*************************** 1. row ***************************

               Slave_IO_State: Waiting for source to send event

                  Master_Host: 10.106.210.206

                  Master_User: repl

                  Master_Port: 14728

                Connect_Retry: 60

              Master_Log_File: mysql-bin.000025

          Read_Master_Log_Pos: 84258028

               Relay_Log_File: host01-relay-bin.000004

                Relay_Log_Pos: 63196660

        Relay_Master_Log_File: mysql-bin.000025

             Slave_IO_Running: Yes

            Slave_SQL_Running: Yes

              Replicate_Do_DB:

          Replicate_Ignore_DB:

           Replicate_Do_Table:

       Replicate_Ignore_Table:

      Replicate_Wild_Do_Table:

  Replicate_Wild_Ignore_Table:

                   Last_Errno: 0

                   Last_Error:

                 Skip_Counter: 0

          Exec_Master_Log_Pos: 63196524

              Relay_Log_Space: 84258422

              Until_Condition: None

               Until_Log_File:

                Until_Log_Pos: 0

           Master_SSL_Allowed: No

           Master_SSL_CA_File:

           Master_SSL_CA_Path:

              Master_SSL_Cert:

            Master_SSL_Cipher:

               Master_SSL_Key:

        Seconds_Behind_Master: 96

Master_SSL_Verify_Server_Cert: No

                Last_IO_Errno: 0

                Last_IO_Error:

               Last_SQL_Errno: 0

               Last_SQL_Error:

  Replicate_Ignore_Server_Ids:

             Master_Server_Id: 210206

                  Master_UUID: 408abbb7-dc1b-11ee-a91f-fefcfee1f844

             Master_Info_File: mysql.slave_master_info

                    SQL_Delay: 0

          SQL_Remaining_Delay: NULL

      Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates

           Master_Retry_Count: 86400

                  Master_Bind:

      Last_IO_Error_Timestamp:

     Last_SQL_Error_Timestamp:

               Master_SSL_Crl:

           Master_SSL_Crlpath:

           Retrieved_Gtid_Set:

            Executed_Gtid_Set: 408abbb7-dc1b-11ee-a91f-fefcfee1f844:1-1120,

b5896746-dc1a-11ee-b573-fefcfee443b5:1-1391

                Auto_Position: 0

         Replicate_Rewrite_DB:

                 Channel_Name:

           Master_TLS_Version:

       Master_public_key_path:

        Get_master_public_key: 0

            Network_Namespace:

1 row in set, 1 warning (0.00 sec)

ERROR:

No query specified

注意这里：

Relay_Log_File: host01-relay-bin.000004

Relay_Log_Pos: 63196660

3.2 查看当前运行的sql

#查看当前运行的事务

select * from information_schema.INNODB_TRX;

发现trx_query为空，以前线上从库出现延迟，能看到当前运行的sql的，不知为何这次看不到。

#查看该事务对应的进程

select * from information_schema.PROCESSLIST where id in(select trx_mysql_thread_id from information_schema.INNODB_TRX);

Info也是为空。

select * from `performance_schema`.threads where processlist_id =

(

select id from information_schema.PROCESSLIST where id in(select trx_mysql_thread_id from information_schema.INNODB_TRX)

);

3.3 查看binlog里对应的sql

用3.2命令查不到导致延迟的sql内容，只能去binlog里查看了。

查看从库中继日志里延迟时卡住的位置对应的sql。

上面3.1那里show slave status：

Relay_Log_File: host01-relay-bin.000004

Relay_Log_Pos: 63196660

mysqlbinlog -v host01-relay-bin.000004 > 000004.log

less 000004.log

搜索63196660

可以看到这个位置，在操作baidd.t2这个表：

#240427 17:47:17 server id 210206 end_log_pos 63196737 CRC32 0x71c38d61 Table_map: `baidd`.`test2` mapped to number 107

再一直往下翻，看到有很多对应的delete语句：

3.4 在从库查看捕获的慢sql有没有主键

show indexes from baidd.test2;

如果没有主键，就需要先在从库上为这个表建下主键，先解决这个延迟的问题。

四着手处理

思路：

先在从库杀掉这个慢的delete的sql
在从库加主键，然后在从库delete就很快了
在主库加主键，从库报主键冲突时，跳过这个错误，继续同步即可

4.1 在从库停掉主从复制

stop slave;

发现2分钟都stop不掉，因为当前有慢事务在运行。

4.2 在从库杀掉这个慢事务

#查看慢事务对应的进程id，注意别多杀了，只杀运行时间很久的，导致延迟的sql

select * from information_schema.PROCESSLIST where id in(select trx_mysql_thread_id from information_schema.INNODB_TRX);

这里是kill 55588

Kill完后，stop slave就能很快执行完了。

4.3 在从库上给这个表加主键

set sql_log_bin=0;

SET GLOBAL super_read_only=off;

show variables like '%super_read_only%';

alter table test2 add primary key(id);

SET GLOBAL super_read_only=on;

4.4 开启同步

start slave;

set sql_log_bin=1;

发现很快就同步完了。

4.5 在主库上建下主键，避免以后再出现这个问题

alter table test2 add primary key(id);

4.6 在从库跳过这个建主键的事务

因为上面已经在从库建过主键了，所以从库复制会停止，提示主键冲突，可以通过跳过这个事务来处理。

4.6.1 查看从库复制状态

root@localhost [(none)]>show slave status \G;

复制代码

*************************** 1. row ***************************

               Slave_IO_State: Waiting for source to send event

                  Master_Host: 10.106.210.206

                  Master_User: repl

                  Master_Port: 14728

                Connect_Retry: 60

              Master_Log_File: mysql-bin.000025

          Read_Master_Log_Pos: 84259101

               Relay_Log_File: host01-relay-bin.000005

                Relay_Log_Pos: 1195

        Relay_Master_Log_File: mysql-bin.000025

             Slave_IO_Running: Yes

            Slave_SQL_Running: No

              Replicate_Do_DB:

          Replicate_Ignore_DB:

           Replicate_Do_Table:

       Replicate_Ignore_Table:

      Replicate_Wild_Do_Table:

  Replicate_Wild_Ignore_Table:

                   Last_Errno: 1068

                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'ANONYMOUS' at source log mysql-bin.000025, end_log_pos 84259101. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

                 Skip_Counter: 0

          Exec_Master_Log_Pos: 84258897

              Relay_Log_Space: 84259617

              Until_Condition: None

               Until_Log_File:

                Until_Log_Pos: 0

           Master_SSL_Allowed: No

           Master_SSL_CA_File:

           Master_SSL_CA_Path:

              Master_SSL_Cert:

            Master_SSL_Cipher:

               Master_SSL_Key:

        Seconds_Behind_Master: NULL

Master_SSL_Verify_Server_Cert: No

                Last_IO_Errno: 0

                Last_IO_Error:

               Last_SQL_Errno: 1068

               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'ANONYMOUS' at source log mysql-bin.000025, end_log_pos 84259101. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

  Replicate_Ignore_Server_Ids:

             Master_Server_Id: 210206

                  Master_UUID: 408abbb7-dc1b-11ee-a91f-fefcfee1f844

             Master_Info_File: mysql.slave_master_info

                    SQL_Delay: 0

          SQL_Remaining_Delay: NULL

      Slave_SQL_Running_State:

           Master_Retry_Count: 86400

                  Master_Bind:

      Last_IO_Error_Timestamp:

     Last_SQL_Error_Timestamp: 240427 18:04:40

               Master_SSL_Crl:

           Master_SSL_Crlpath:

           Retrieved_Gtid_Set:

            Executed_Gtid_Set: 408abbb7-dc1b-11ee-a91f-fefcfee1f844:1-1120,

b5896746-dc1a-11ee-b573-fefcfee443b5:1-1391

                Auto_Position: 0

         Replicate_Rewrite_DB:

                 Channel_Name:

           Master_TLS_Version:

       Master_public_key_path:

        Get_master_public_key: 0

            Network_Namespace:

1 row in set, 1 warning (0.00 sec)

注意看这几个：

#查看具体报错

select * from performance_schema.replication_applier_status_by_worker

LAST_ERROR_MESSAGE显示报错：

Worker 1 failed executing transaction 'ANONYMOUS' at source log mysql-bin.000025, end_log_pos 84259101; Error 'Multiple primary key defined' on query. Default database: 'baidd'. Query: 'alter table test2 add primary key(id)'

4.6.2 在从库跳过这个事务

4.6.2.1 针对未开启GTID的从库

show variables like '%gtid_mode%'; #off表示未开启gtid

只需要在从库执行以下命令就能跳过一个事务。

set sql_log_bin=0;

set global sql_slave_skip_counter=1;

4.6.2.2 针对开启了GTID的从库

思路：

将gtid_next设置为自己环境Executed_Gtid_Set的下一个位置，然后开始一个空事务。

示例：

set sql_log_bin=0;

set gtid_next='408abbb7-dc1b-11ee-a91f-fefcfee1f844:1121';

begin;

commit;

set gtid_next='automatic';

4.6.3 开启同步

start slave;

set sql_log_bin=1;

4.6.4 验证主从同步状态

show slave status \G;

确保复制没问题。

五教训

表没有主键的时候，delete很多数据，会导致从库出现很长时间延迟，因此需要严格把控，定期检查，确保主库上不存在没有主键的表。

mysql主库delete一个没主键的表导致从库延迟很久问题处理

一 问题描述

二 问题模拟

2.1 在主库造数据

2.2 在主库删除这个表的主键

2.3 在主库清空这个表数据

三 问题排查

3.1 在从库观察主从复制

3.2 查看当前运行的sql

3.3 查看binlog里对应的sql

3.4 在从库查看捕获的慢sql有没有主键

四 着手处理

4.1 在从库停掉主从复制

4.2 在从库杀掉这个慢事务

4.3 在 从库上给这个表加主键

4.4 开启同步

4.5 在主库上建下主键，避免以后再出现这个问题

4.6 在从库跳过这个建主键的事务

4.6.1 查看从库复制状态

4.6.2 在从库跳过这个事务

4.6.2.1 针对未开启GTID的从库

4.6.2.2 针对开启了GTID的从库

4.6.3 开启同步

4.6.4 验证主从同步状态

五 教训

一问题描述

二问题模拟

三问题排查

四着手处理

4.3 在从库上给这个表加主键

五教训