文章目录
背景
在巡检过程中发现prometheus当前存在一条mysql sql thread down告警,如下图所示。从告警名称看,当前mysql从库已经存在sql线程宕机问题,需要立刻处置。具体处置流程见下方所示
错误现象: io线程为Yes sql线程为No,同时从库会停止复制进程,不再进行主从同步,只有等到用户自行来处理完成后才会继续组从同步。
MySQL错误代码含义
yaml
一些error code代表的错误如下:
1007: 数据库已存在,创建数据库失败
1008: 数据库不存在,删除数据库失败
1050: 数据表已存在,创建数据表失败
1051: 数据表不存在,删除数据表失败
1054: 字段不存在,或程序文件跟数据库有冲突
1060: 字段重复,导致无法插入
1061: 重复键名
1068: 定义了多个主键
1094: 位置线程ID
1146: 数据表缺失,请恢复数据库
1053: 复制过程中主服务器宕机
1062: 主键冲突 Duplicate entry '%s' for key %d
问题排查过程
排查过程:
sql
1、登录数据库,执行 show slave status\G; 命令
2、执行后,相关信息如下所示
mysql> show slave status\G;
*************************** 1. row ***************************
Slave_IO_State:
Master_Host: 192.163.xx.xx
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.002037
Read_Master_Log_Pos: 473170850
Relay_Log_File: mysqld-relay-bin.005393
Relay_Log_Pos: 345276015
Relay_Master_Log_File: mysql-bin.001811
Slave_IO_Running: YES
Slave_SQL_Running: No #正常应该为yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1062
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '4edc6608-394c-11eb-a4d2-fa163efd1323:193750567' at master log mysql-bin.001811, end_log_pos 345391371. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Skip_Counter: 0
Exec_Master_Log_Pos: 345275842
Relay_Log_Space: 124217681926
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1062
Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '4edc6608-394c-11eb-a4d2-fa163efd1323:193750567' at master log mysql-bin.001811, end_log_pos 345391371. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any. #这个就是错误信息
Replicate_Ignore_Server_Ids:
Master_Server_Id: 2233306
Master_UUID: 4edc6608-394c-11eb-a4d2-fa163efd1323
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp: 240904 07:27:25
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 4edc6608-394c-11eb-a4d2-fa163efd1323:27928-202465884
Executed_Gtid_Set: 4edc6608-394c-11eb-a4d2-fa163efd1323:1-193750566,
5b42d742-394c-11eb-991e-fa163e6fbaa6:1-77
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)
ERROR:
No query specified
yaml
3、对该报错进行翻译并查询报错含义
Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '4edc6608-394c-11eb-a4d2-fa163efd1323:193750567' at master log mysql-bin.001811, end_log_pos 345391371. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any
整体意思如下:
协调器已停止,因为工作器中出现错误。
最近的失败是:工作器1在主日志mysql-bin.001811,end_log_pos345391371中执行事务
'4edc6608-394c-11eb-a4d2-fa163efd1323: 193750567'失败。
有关此失败或其他失败的更多详细信息,请参阅错误日志和/或performance_schemareplication_applier_status_by_worker表
sql
4、根据报错信息,我们查看performance_schemareplication_applier_status_by_worker表信息,执行 SELECT * FROM performance_schema.replication_applier_status_by_worker\G;返回如下所示
*************************** 1. row ***************************
CHANNEL_NAME:
WORKER_ID: 1
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: 4edc6608-394c-11eb-a4d2-fa163efd1323:193750567
LAST_ERROR_NUMBER: 1062
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction '4edc6608-394c-11eb-a4d2-fa163efd1323:193750567' at master log mysql-bin.001811, end_log_pos 345391371; Could not execute Write_rows event on table yz_work.ucms_dict; Duplicate entry '135' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.001811, end_log_pos 345391371
LAST_ERROR_TIMESTAMP: 2024-09-04 07:27:25
*************************** 2. row ***************************
CHANNEL_NAME:
WORKER_ID: 2
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: 4edc6608-394c-11eb-a4d2-fa163efd1323:193750565
LAST_ERROR_NUMBER: 0
LAST_ERROR_MESSAGE:
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
*************************** 3. row ***************************
CHANNEL_NAME:
WORKER_ID: 3
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: 4edc6608-394c-11eb-a4d2-fa163efd1323:193750536
LAST_ERROR_NUMBER: 0
LAST_ERROR_MESSAGE:
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
*************************** 4. row ***************************
CHANNEL_NAME:
WORKER_ID: 4
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: 4edc6608-394c-11eb-a4d2-fa163efd1323:193743275
LAST_ERROR_NUMBER: 0
LAST_ERROR_MESSAGE:
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
4 rows in set (0.00 sec)
ERROR:
No query specified
yaml
5、根据select查出来的结果,我们发现了详细的报错信息
<Worker 1 failed executing transaction '4edc6608-394c-11eb-a4d2-fa163efd1323:193750567'
at master log mysql-bin.001811, end_log_pos 345391371; Could not execute Write_rows event
on table yz_work.ucms_dict; Duplicate entry '135' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.001811, end_log_pos
345391371>
翻译为中文如下所示:
Worker 1在主日志mysql-bin.001811,end_log_pos345391371执行事务
'4edc6608-394c-11eb-a4d2-fa163efd1323: 193750567'失败;
无法在表yz_work上执行Write_rows事件ucms_dict;密钥'PRIMARY'重复条目'135',
Error_code:1062;处理程序错误HA_ERR_FOUND_DUPP_KEY;事件的主日志mysql-bin.001811,end_log_pos345391371
从这条报错信息,我们找到了问题根因,是因为插入重复主键导致从库不工作了,即主键冲突错误。
问题处理过程
第一种解决思路
直接跳过错误执行语句
sql
1、MySQL 传统的主从复制方式使用 master_log_files 和 master_log_pos 两个参数来确定复制位点。
当出现复制错误时,可以设置跳过出错的事务来恢复同步,MySQL 提供了 sql_slave_skip_counter 参数来实现此功能。使用方法如下:
root@(none) >stop slave;
Query OK, 0 rows affected (0.00 sec)
root@(none) >SET GLOBAL SQL_SLAVE_SKIP_COUNTER = N; #跳过N个事务
Query OK, 0 rows affected (0.00 sec)
root@(none) >start slave;
Query OK, 0 rows affected, 1 warning (0.03 sec)
sql
2、但是因为我们的数据库是基于GTID来做的主从复制,因此上述第一种方法不适用,执行会报错误,如下所示
mysql> stop slave;
Query OK, 0 rows affected (0.01 sec)
mysql> set GLOBAL SQL_SLAVE_SKIP_COUNTER=1;
ERROR 1858 (HY000): sql_slave_skip_counter can not be set when the server is running with @@GLOBAL.GTID_MODE = ON. Instead, for each transaction that you want to skip, generate an empty transaction with the same GTID as the transaction
sql
3、基于GTID主从复制1062错误的真正的解决方法如下所示
解决方法: 在发生错误的 GTID 事务号上插入一个空的事务,来实现跳过此复制错误的效果。
解决过程:
mysql> stop slave;
mysql> show slave status\G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
...
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '4edc6608-394c-11eb-a4d2-fa163efd1323:193750567' at master log mysql-bin.001811, end_log_pos 345391371. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
mysql> SET GTID_NEXT='4edc6608-394c-11eb-a4d2-fa163efd1323:193750567'; #此处的gtid取自于上述Last_Error中的报错信息
Query OK, 0 rows affected (0.01 sec)
mysql> begin; #开启事务
Query OK, 0 rows affected (0.00 sec)
mysql> commit; #提交空事务
Query OK, 0 rows affected (0.01 sec)
mysql> SET GTID_NEXT="AUTOMATIC"; #重新设置 GTID_NEXT 为自动获取
Query OK, 0 rows affected (0.00 sec)
mysql> start slave; #开启事务
Query OK, 0 rows affected (0.01 sec)
第二种解决思路
找到错误执行语句,修复从库数据
yaml
解决思路:
根据show slave status 中Last_Error中提示的master log和end_log_pos的位置查找该库上缺失的数据
解决方法
sql
0、根据报错信息,我们查看performance_schemareplication_applier_status_by_worker表信息,执行
SELECT * FROM performance_schema.replication_applier_status_by_worker\G;
它会返回详细的错误信息,包含库名、表名等信息
1、在主库中查看表结构,确定主键,并确认是否存在对应信息
mysql >
......
PRIMARY KEY (`id`) #主键信息
) ENGINE=InnoDB AUTO_INCREMENT=136 DEFAULT CHARSET=utf8 COLLATE=utf8_bin COMMENT='字典数据表' |
mysql> select * from ucms_dict where id=135;
+-----+--------------+---------------+--------------+------------+-----------+-----------+------------+---------------------+------------+---------------------+-------------+--------+
| id | dict_name | dict_code | dict_enrty | dict_value | dict_sort | is_enable | created_by | created_time | updated_by | updated_time | delete_flag | remark |
+-----+--------------+---------------+--------------+------------+-----------+-----------+------------+---------------------+------------+---------------------+-------------+--------+
| 135 | 问题性质 | problemNature | 企业动态 | 6 | 6 | 0 | | 2024-07-16 18:49:11 | | 2024-07-16 18:49:11 | 0 | NULL |
+-----+--------------+---------------+--------------+------------+-----------+-----------+------------+---------------------+------------+---------------------+-------------+--------+
1 row in set (0.00 sec)
2、在从库的binlog日志中解析last error报错的master log mysql-bin.001811, end_log_pos
[root@db-2 binlog]# mysqlbinlog --no-defaults -v --base64-output=decode-rows --stop-position=345391371(取自于show slave status) /export/servers/data/my3306/binlog/mysql-bin.001811(取自于show slave status) | tail -500
# at 345375314
#240716 8:00:02 server id 2233306 end_log_pos 345383128 CRC32 0x225674ce Delete_rows: table id 2984
# at 345383128
#240716 8:00:02 server id 2233306 end_log_pos 345391282 CRC32 0x3746f8aa Delete_rows: table id 2984
# at 345391282
#240716 8:00:02 server id 2233306 end_log_pos 345399443 CRC32 0x7cb6e901 Delete_rows: table id 2984
ROLLBACK /* added by mysqlbinlog */ /*!*/;
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
3、确定从库二进制日志中记录的插入数据和主库、从库里的数据是不是相同,如果相同可以删除从库的数据来解决,否则可以考虑跳过这个事务来解决,即使用第一种思路解决
4、执行删除操作
mysql> stop slave;
mysql> delete from jaat.test where id=135; #此处的id取决于SELECT * FROM performance_schema.replication_applier_status_by_worker\G中的Duplicate entry '135'
mysql> start slave;
mysql> show slave status\G;
修复结果演示
sql
执行完修复操作后,再次查看从库状态
mysql> show slave status\G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.161.3.223
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.002037
Read_Master_Log_Pos: 495785235
Relay_Log_File: mysqld-relay-bin.005396
Relay_Log_Pos: 224060624
Relay_Master_Log_File: mysql-bin.001812
Slave_IO_Running: Yes
Slave_SQL_Running: Yes #正常