文章目录
- [1 搭建好一主多从架构(半同步模式)](#1 搭建好一主多从架构(半同步模式))
- [2 部署MHA](#2 部署MHA)
-
- [2.1 安装MHA环境依赖](#2.1 安装MHA环境依赖)
- [2.2 MySQL节点安装依赖](#2.2 MySQL节点安装依赖)
- [2.3 创建MySQL监控用户mha](#2.3 创建MySQL监控用户mha)
- [2.4 配置mah配置文件](#2.4 配置mah配置文件)
- [3 MHA故障切换](#3 MHA故障切换)
-
- [3.1 master未故障手动切换](#3.1 master未故障手动切换)
-
- 小插曲(解决延迟复制)
- 解决
-
- 停止延迟复制:
- [方法1 在从库上跳过错误(推荐)](#方法1 在从库上跳过错误(推荐))
- [方法2 在配置文件跳过1396错误(持久化方案)](#方法2 在配置文件跳过1396错误(持久化方案))
- [方法3 重新配置主从关系(最彻底)](#方法3 重新配置主从关系(最彻底))
- [方法4 检查用户一致性](#方法4 检查用户一致性)
- 再测试
- [3.2 master故障手动切换](#3.2 master故障手动切换)
- [3.3 master故障自动切换](#3.3 master故障自动切换)
- [4 VIP实现](#4 VIP实现)
1 搭建好一主多从架构(半同步模式)
🔭MHA架构不支持/不兼容的功能:
不支持延迟复制 、MGR 、双主/多主架构 ,部分冲突MySQL Router
此实验在原来做的实验上进行,关掉延迟复制
bash
[root@mysql-node3 ~]# mysql -p123 -e "stop replica;change replication source to source_delay=0;start replica;show slave status\G" |grep -i sql_delay
SQL_Delay: 0
2 部署MHA
2.1 安装MHA环境依赖
安装后masterha_manager --version,它会告诉你缺什么模块的
查找需要的依赖 rpm -qpR mha4mysql-node-0.58-0.el7.centos.noarch.rpm、rpm -qpR mha4mysql-master-0.58-0.el7.centos.noarch.rpm
也可以用cpan连网解决依赖
bash
# MHA-M:172.25.254.40;mysql-node1(master):172.25.254.10;mysql-node2(slave):172.25.254.20;mysql-node3(slave):172.25.254.30;
# MHA-M
[root@MHA-M ~]# vim /etc/hosts
1 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
2 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
3 172.25.254.40 MHA-M
4 172.25.254.10 mysql-node1
5 172.25.254.20 mysql-node2
6 172.25.254.30 mysql-node3
[root@MHA-M ~]# for i in {10,20,30};do scp -q /etc/hosts root@172.25.254.$i:/etc/hosts;done
# 安装perl环境(MHA管理节点)
[root@MHA-M ~]# dnf install perl perl-DBD-MySQL perl-CPAN perl-English -yq
# 利用cpan在安装MHA的依赖
[root@MHA-M ~]# cpan
..................
Would you like to configure as much as possible automatically? [yes] yes
..................
Perl site library directory "/usr/local/share/perl5/5.32" does not exist.
Perl site library directory "/usr/local/share/perl5/5.32" created.
Perl site library directory "/usr/local/lib64/perl5/5.32" does not exist.
Perl site library directory "/usr/local/lib64/perl5/5.32" created.
..................
cpan[1]> install Config::Tiny
cpan[2]> install Log::Dispatch
cpan[3]> install Mail::Sender
# 空着的直接回车,后续手动配置
Specify defaults for Mail::Sender? (y/N) y
Default SMTP server (hostname or IP address)
: # 回车
*********************************************************************
Default FROM value (must be perl code / ENTER for none):
:
*********************************************************************
Default for REPLY-TO field (must be perl code / ENTER for none):
:
*********************************************************************
Default for CC field (must be perl code / ENTER for none):
:
Default for BCC field (must be perl code / ENTER for none):
:
Default name of the client MACHINE used when connecting
to the SMTP server (must be perl code / ENTER for none):
:
*********************************************************************
Default additional headers (must be perl code / ENTER for none):
:
*********************************************************************
Default encoding of message bodies (N)one, (Q)uoted-printable, (B)ase64:
: n
*********************************************************************
Default charset of message bodies (must be perl code / ENTER for none):
:
cpan[4]> install Parallel::ForkManager
cpan[5]> exit
[root@MHA-M ~]# unzip MHA-7.zip
# 要安装node包,这是MHA 0.58 的设计缺陷
[root@MHA-M ~]# rpm -ivh MHA-7/mha4mysql-manager-0.58-0.el7.centos.noarch.rpm MHA-7/mha4mysql-node-0.58-0.el7.centos.noarch.rpm --nodeps
[root@MHA-M ~]# masterha_manager --version
masterha_manager version 0.58.
2.2 MySQL节点安装依赖
bash
# 装MHA节点组件(MySQL所有节点)
[root@mysql-node2 ~]# yum install -y perl perl-English perl-Time-HiRes perl-DBI perl-DBD-MySQL perl-File-Copy perl-core
# 在mysql节点中安装相应MHA管理工具
[root@MHA-M ~]# for i in {10,20,30};do
> scp MHA-7/mha4mysql-node-0.58-0.el7.centos.noarch.rpm root@172.25.254.$i:/mnt
> ssh -l root 172.25.254.$i "rpm -ivh /mnt/mha4mysql-node-0.58-0.el7.centos.noarch.rpm --nodeps"
> done
# 修复MHA 0.58的Perl sprintf bug(mha管理的节点-->mysql所有节点)
[mysql-node1 ~]# vim /usr/share/perl5/vendor_perl/MHA/NodeUtil.pm
199 #sub parse_mysql_major_version($) {
200 # my $str = shift;
201 # my $result = sprintf( '%03d%03d', $str =~ m/(\d+)/g );
202 # return $result;
203 #}
204
205 sub parse_mysql_major_version($) {
206 my $str = shift;
207 my @nums = $str =~ m/(\d+)/g;
208 my $result = sprintf( '%03d%03d', $nums[0]//0, $nums[1]//0);
209 return $result;
210 }
# scp到其它节点
2.3 创建MySQL监控用户mha
bash
# 为MHA建立远程登录账户,创建MySQL监控用户mha
[root@MHA-M ~]# dnf install mysql-server -yq
[root@MHA-M ~]# grep mysql /etc/passwd
mysql:x:27:27:MySQL Server:/var/lib/mysql:/sbin/nologin
[root@MHA-M ~]# mkdir /data/mysql
[root@MHA-M ~]# chown mysql.mysql /data/mysql
[root@MHA-M ~]# vim /etc/my.cnf.d/mysql-server.cnf
[root@MHA-M ~]# tail -n6 /etc/my.cnf.d/mysql-server.cnf
[mysqld]
server-id=40
datadir=/data/mysql
socket=/data/mysql/mysql.sock
log-error=/var/log/mysql/mysqld.log
pid-file=/run/mysqld/mysqld.pid
[root@MHA-M ~]# vim /etc/my.cnf.d/client.cnf
7 [client]
8 socket=/data/mysql/mysql.sock
[root@MHA-M ~]# systemctl enable --now mysqld
[root@MHA-M ~]# grep 'temporary password' /var/log/mysql/mysqld.log
2026-03-05T01:56:48.550596Z 6 [Note] [MY-010454] [Server] A temporary password is generated for root@localhost: g)kUyiLTJ1=&
[root@MHA-M ~]# mysql_secure_installation
secure enough. Would you like to setup VALIDATE PASSWORD component?
Press y|Y for Yes, any other key for No: no
Please set the password for root here.
New password:
Re-enter new password:
Remove anonymous users? (Press y|Y for Yes, any other key for No) : y
Disallow root login remotely? (Press y|Y for Yes, any other key for No) : y
Remove test database and access to it? (Press y|Y for Yes, any other key for No) : y
Reload privilege tables now? (Press y|Y for Yes, any other key for No) : y
All done!
[root@MHA-M ~]# mysql -p123 -e "create user mha@'%' identified with mysql_native_password by '123'; grant all on *.* to mha@'%';"
# 检测所有节点用户完整性,用于后续MHA配置文件配置
# MHA-M:mha用户;node1:repl、mha用户;node2:repl、mha用户;node2:repl、mha用户;
[root@MHA-M ~]# mysql -p123 -t -e "select user,host from mysql.user" | grep mha
| mha | % |
[root@mysql-node1 ~]# mysql -p123 -t -e "select user,host from mysql.user" | egrep "mha|repl"
| repl | % |
[root@mysql-node2 ~]# mysql -p123 -t -e "select user,host from mysql.user" | egrep "mha|repl"
[root@mysql-node3 ~]# mysql -p123 -t -e "select user,host from mysql.user" | egrep "mha|repl"
# node1
[root@mysql-node1 ~]# mysql -p123 -e "create user mha@'%' identified with mysql_native_password by '123'; grant all on *.* to mha@'%';"
[root@mysql-node1 ~]# mysql -p123 -t -e "select user,host from mysql.user" | egrep "mha|repl"
| mha | % |
| repl | % |
# node2
[root@mysql-node2 ~]# mysql -p123 -e "create user repl@'%' identified with mysql_native_password by '123'; grant all on *.* to repl@'%';"
[root@mysql-node2 ~]# mysql -p123 -e "create user mha@'%' identified with mysql_native_password by '123'; grant all on *.* to mha@'%';"
[root@mysql-node2 ~]# mysql -p123 -t -e "select user,host from mysql.user" | egrep "mha|repl"
| mha | % |
| repl | % |
# node3
[root@mysql-node3 ~]# mysql -p123 -e "create user repl@'%' identified with mysql_native_password by '123'; grant all on *.* to repl@'%';"
[root@mysql-node3 ~]# mysql -p123 -e "create user mha@'%' identified with mysql_native_password by '123'; grant all on *.* to mha@'%';"
[root@mysql-node3 ~]# mysql -p123 -t -e "select user,host from mysql.user" | egrep "mha|repl"
| mha | % |
| repl | % |
2.4 配置mah配置文件
bash
# 生成MHA-M的配置文件模板,修改配置文件
[root@MHA-M ~]# mkdir -p /etc/masterha/
[root@MHA-M ~]# tar -zxf MHA-7/mha4mysql-manager-0.58.tar.gz
[root@MHA-M ~]# cat mha4mysql-manager-0.58/samples/conf/masterha_default.cnf mha4mysql-manager-0.58/samples/conf/app1.cnf > /etc/masterha/app1.cnf
root@MHA-M ~]# vim /etc/masterha/app1.cnf
[root@MHA-M ~]# cat /etc/masterha/app1.cnf
[server default]
# MySQL监控用户
user=mha
password=123
# SSH用户(建议用root,避免权限问题)
ssh_user=root
# 复制用户
repl_user=repl
repl_password=123
# 目录设置
master_binlog_dir=/data/mysql
remote_workdir=/tmp
manager_workdir=/etc/masterha
manager_log=/etc/masterha/mha.log
# 二次检测,网关加从库,(预算充足,用两台独立监控机),这里后续有实验,要模拟node1,node2节点故障,就用node3来二次检测,网关的话我的实验是报不可达的所以去掉了(用......_check_ssh命令是ok的)
secondary_check_script= masterha_secondary_check -s 172.25.254.30
# 监控间隔
ping_interval=3
[server1]
hostname=172.25.254.10
candidate_master=1
check_repl_delay=0
[server2]
hostname=172.25.254.20
candidate_master=1
check_repl_delay=0
[server3]
hostname=172.25.254.30
no_master=1
# 检测环境
[root@MHA-M ~]# masterha_check_ssh --conf=/etc/masterha/app1.cnf
..................
Thu Mar 5 11:12:24 2026 - [info] All SSH connection tests passed successfully.
Use of uninitialized value in exit at /usr/bin/masterha_check_ssh line 44.
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
..................
MySQL Replication Health is OK.
3 MHA故障切换
如果实验失败,实验环境省时间直接清理数据(删数据目录、数据库初始化建立mysql基本数据(
mysqld --initialize --user=mysql)、起服务、数据库安全初始化(mysql_secure_installation))即可
3.1 master未故障手动切换
bash
[root@MHA-M ~]# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=172.25.254.20 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000
..................
It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 172.25.254.10(172.25.254.10:3306)? (YES/no): yes
..................
Starting master switch from 172.25.254.10(172.25.254.10:3306) to 172.25.254.20(172.25.254.20:3306)? (yes/NO): yes
..................
master_ip_online_change_script is not defined. If you do not disable writes on the current master manually, applications keep writing on the current master. Is it ok to proceed? (yes/NO): yes
..................
Thu Mar 5 12:14:21 2026 - [info] Switching master to 172.25.254.20(172.25.254.20:3306) completed successfully.
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
Thu Mar 5 13:02:27 2026 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm, ln939] SQL Thread is stopped(error) on 172.25.254.10(172.25.254.10:3306)! Errno:1396, Error:Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '0d44a8bc-176a-11f1-ae1e-000c2939010c:1' at source log mysql-bin.000005, end_log_pos 481. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Thu Mar 5 13:02:27 2026 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm, ln939] SQL Thread is stopped(error) on 172.25.254.30(172.25.254.30:3306)! Errno:1396, Error:Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '0d44a8bc-176a-11f1-ae1e-000c2939010c:1' at source log mysql-bin.000005, end_log_pos 481. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
MySQL Replication Health is NOT OK!
# 手动切换master实际上是成功了(日志显示"completed successfully"),但切换后复制链路出现了问题。错误1396通常与用户权限相关,需要手动修复。
小插曲(解决延迟复制)
由于我为在做MHA时未关闭延迟复制功能,出现下面报错
bash
Thu Mar 5 13:02:27 2026 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm, ln939] SQL Thread is stopped(error) on 172.25.254.10(172.25.254.10:3306)! Errno:1396, Error:Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '0d44a8bc-176a-11f1-ae1e-000c2939010c:1' at source log mysql-bin.000005, end_log_pos 481. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Thu Mar 5 13:02:27 2026 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm, ln939] SQL Thread is stopped(error) on 172.25.254.30(172.25.254.30:3306)! Errno:1396, Error:Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '0d44a8bc-176a-11f1-ae1e-000c2939010c:1' at source log mysql-bin.000005, end_log_pos 481. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
总结关键问题:
'0d44a8bc-176a-11f1-ae1e-000c2939010c:1'就是导致复制中断的错误事务。你需要分别在两个从节点(10和30)上跳过这个GTID事务。
| 节点 | 错误信息 | 问题分析 |
|---|---|---|
| 172.25.254.10 | SQL Thread stopped | 原master变成slave后,复制线程停止 |
| 172.25.254.30 | SQL Thread stopped | 另一个slave也停止复制 |
| 错误码 | 1396 | 用户权限相关错误 |
解决
停止延迟复制:
bash
[root@mysql-node3 ~]# mysql -p123 -e "stop replica;change replication source to source_delay=0;start replica;show slave status\G" |grep -i sql_delay
SQL_Delay: 0
方法1 在从库上跳过错误(推荐)
报错slave报错节点10,30
集群不是gtid模式
bash
# 停止复制
STOP SLAVE;
# 查看当前复制状态
SHOW SLAVE STATUS\G
# 跳过一个错误事务(如果是非GTID模式)
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
# 启动复制
START SLAVE;
# 验证复制状态
SHOW SLAVE STATUS\G
集群是gtid模式
bash
# 停止复制
STOP SLAVE;
# 或者如果是GTID模式,需要注入空事务
# 先查看GTID执行情况
SHOW SLAVE STATUS\G
# 找到Retrieved_Gtid_Set和Executed_Gtid_Set
# STOP SLAVE;
# 跳过错误的事务
# SET GTID_NEXT='UUID:事务号';
# BEGIN; COMMIT;
# 然后注入空事务(示例)
# SET GTID_NEXT='AUTOMATIC';
# 启动复制
START SLAVE;
# 验证复制状态
SHOW SLAVE STATUS\G
开始解决错误
bash
# slave-node1
# 验证错误gtid
[root@mysql-node1 ~]# mysql -p123 -e "show slave status\G" | egrep -i "slave_sql_runing|last_sql_error"
Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '0d44a8bc-176a-11f1-ae1e-000c2939010c:1' at source log mysql-bin.000005, end_log_pos 481. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Last_SQL_Error_Timestamp: 260305 12:51:37
# 跳过错误事务并注入空事务(停止SQL线程-->跳过错误的事务-->注入空事务-->重新启动SQL线程)
[root@mysql-node1 ~]# mysql -p123 -e "stop slave sql_thread;set gtid_next='0d44a8bc-176a-11f1-ae1e-000c2939010c:1';begin;commit;set gtid_next='automatic';start slave sql_thread;"
# 验证复制是否恢复正常
[root@mysql-node1 ~]# mysql -p123 -e "show slave status\G" | egrep -i "slave_sql_runing|last_sql_error"
Last_SQL_Error:
Last_SQL_Error_Timestamp:
# slave-node3
[root@mysql-node3 ~]# mysql -p123 -e "stop slave sql_thread;set gtid_next='0d44a8bc-176a-11f1-ae1e-000c2939010c:1';begin;commit;set gtid_next='automatic';start slave sql_thread;"
[root@mysql-node3 ~]# mysql -p123 -e "show slave status\G" | egrep -i "slave_sql_runing|last_sql_error"
Last_SQL_Error:
Last_SQL_Error_Timestamp:
方法2 在配置文件跳过1396错误(持久化方案)
如果方法一不能彻底解决,可以在从库配置文件跳过这类错误
bash
# 在每个从库(10和30)上编辑配置文件
# 在[mysqld]段添加
slave_skip_errors=1396
# 重启MySQL
systemctl restart mysqld
# 然后重新启动复制
mysql -p123 -e "START SLAVE;"
方法3 重新配置主从关系(最彻底)
如果以上方法都无效,需要重新建立主从关系
bash
# 1.在新master(20)上查看binlog位置,记录下File和Position
mysql -p123 -e "SHOW MASTER STATUS;"
# 2.在从库10和30上重新配置
# 登录MySQL
mysql -p123
# 停止并重置复制
STOP SLAVE;
RESET SLAVE ALL;
# 重新配置从新master同步
CHANGE MASTER TO
MASTER_HOST='172.25.254.20',
MASTER_PORT=3306,
MASTER_USER='repl',
MASTER_PASSWORD='123',
MASTER_LOG_FILE='mysql-bin.xxxxxx', -- 使用上面查到的File
MASTER_LOG_POS=xxxxxx; -- 使用上面查到的Position
# 启动复制
START SLAVE;
# 验证状态
SHOW SLAVE STATUS\G
方法4 检查用户一致性
由于错误1396与用户权限相关,建议检查所有节点的用户是否一致
bash
# 在所有节点执行,对比用户
mysql -p123 -e "SELECT user, host FROM mysql.user ORDER BY user, host;"
# 确保关键用户(mha、repl)在所有节点都存在且一致
# 如果不一致,手动创建缺失的用户
再测试
bash
# --interactive=0 非交互模式
# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=172.25.254.20 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000 --interactive=0
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
..................
MySQL Replication Health is OK.
[root@MHA-M ~]# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=172.25.254.10 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000
..................
Thu Mar 5 14:51:30 2026 - [info] Switching master to 172.25.254.10(172.25.254.10:3306) completed successfully.
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
..................
MySQL Replication Health is OK.
[root@MHA-M ~]# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=172.25.254.20 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000
..................
Thu Mar 5 14:52:39 2026 - [info] Switching master to 172.25.254.20(172.25.254.20:3306) completed successfully.
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
..................
MySQL Replication Health is OK.
3.2 master故障手动切换
使用masterha_master_switch切换master时,不论master有没有故障,都一定要添加--remove_dead_master_conf参数,避免造成双主结构。
bash
[root@mysql-node2 ~]# systemctl stop mysqld.service
# 手动切换master(缺少remove_dead_master_conf参数)
[root@MHA-M ~]# masterha_master_switch --master_state=dead --conf=/etc/masterha/app1.cnf --dead_master_host=172.25.254.20 --dead_master_port=3306 --new_master_host=172.25.254.10 --new_master_port=3306 --ignore_last_failover
..................
Master 172.25.254.20(172.25.254.20:3306) is dead. Proceed? (yes/NO): yes
..................
Starting master switch from 172.25.254.20(172.25.254.20:3306) to 172.25.254.10(172.25.254.10:3306)? (yes/NO): yes
..................
Started manual(interactive) failover.
Selected 172.25.254.10(172.25.254.10:3306) as a new master.
172.25.254.10(172.25.254.10:3306): OK: Applying all logs succeeded.
172.25.254.30(172.25.254.30:3306): ERROR: Failed on waiting gtid exec set on master.
Master failover to 172.25.254.10(172.25.254.10:3306) done, but recovery on slave partially failed.
# 但是主从复制可以写
[root@mysql-node1 ~]# mysql -p123 -e "create database slave_test2;create table slave_test2.userlist(user varchar(10),password varchar(50));insert into slave_test2.userlist values('user1','123'); select * from slave_test2.userlist;"
+-------+----------+
| user | password |
+-------+----------+
| user1 | 123 |
+-------+----------+
[root@mysql-node1 ~]#
[root@mysql-node3 ~]# mysql -p123 -e "select * from slave_test2.userlist;"
+-------+----------+
| user | password |
+-------+----------+
| user1 | 123 |
+-------+----------+
[root@mysql-node3 ~]#
小插曲(解决双主架构)
bash
# 造成了双主结构
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
Thu Mar 5 15:32:23 2026 - [error][/usr/share/perl5/vendor_perl/MHA/ServerManager.pm, ln781] Multi-master configuration is detected, but two or more masters are either writable (read-only is not set) or dead! Check configurations for details. Master configurations are as below:
# 再启动时会构成双master,要手动清理新master节点上的slave
Master 172.25.254.10(172.25.254.10:3306), replicating from 172.25.254.20(172.25.254.20:3306)
Master 172.25.254.20(172.25.254.20:3306), dead
Thu Mar 5 15:32:23 2026 - [error][/usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations. at /usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm line 329.
Thu Mar 5 15:32:23 2026 - [error][/usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers.
# 新master10还在向宕机master20复制
[root@mysql-node1 ~]# mysql -p123 -e "show slave status\G" | egrep -i "master_host|slave_(io|sql)_running|last_errno|auto_position"
Master_Host: 172.25.254.20
Slave_IO_Running: No
Slave_SQL_Running: No
Last_Errno: 0
Slave_SQL_Running_State:
Auto_Position: 1
# node3就是正常的,所以不用重新指定master_host
[root@mysql-node3 ~]# mysql -p123 -e "show slave status\G" | egrep -i "master_host|slave_(io|sql)_running|last_errno|auto_position"
Master_Host: 172.25.254.10
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Last_Errno: 0
Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates
Auto_Position: 1
解决1
bash
# 第一步:在新master(10节点)上停止从库角色
# 停止复制(它不应该再从任何节点复制)、重置复制配置(彻底清除从库身份)
[root@mysql-node1 ~]# mysql -p123 -e "stop slave;reset slave all;"
# 确认不再是从库并且作为master确保它是可写的
[root@mysql-node1 ~]# mysql -p123 -e "show slave status\G;show variables like 'read_only'"
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only | OFF |
+---------------+-------+
# 第二步:更新MHA配置文件(移除已死的20节点)
[root@MHA-M ~]# vim /etc/masterha/app1.cnf
24 #[server2]
25 #hostname=172.25.254.20
26 #candidate_master=1
27 #check_repl_delay=0
28 [server3]
29 hostname=172.25.254.30
30 #no_master=1
31 candidate_master=1 # 当前只有一个master(10节点),唯一的从库(30节点)被设置了 no_master=1,如果10节点挂了,没有节点可以接替成为新master,MHA认为这是不安全的配置,在测试复制时会报错
# 第四步:验证新拓扑
[root@MHA-M ~]# masterha_check_ssh --conf=/etc/masterha/app1.cnf
..................
Thu Mar 5 16:37:31 2026 - [info] All SSH connection tests passed successfully.
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
..................
MySQL Replication Health is OK.
# 第五步:重启MHA监控
[root@MHA-M ~]# touch /etc/masterha/mha.log
[root@MHA-M ~]# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 (pid:1924) is running(0:PING_OK), master:172.25.254.10
# 如果之前有MHA进程在运行,先停止
[root@MHA-M ~]# masterha_stop --conf=/etc/masterha/app1.cnf
Stopped app1 successfully.
[1]+ Exit 1 nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover > /etc/masterha/mha.log 2>&1
# 再启动MHA监控
[root@MHA-M ~]# nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover > /etc/masterha/mha.log 2>&1 &
[1] 2016
# 启动MHA-Manager(前台运行测试)
# masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover
# 检测状态
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
..................
MySQL Replication Health is OK.
# 第七步:启动20节点配置指向新master(10),是否清除原有数据看自己选择。
[root@mysql-node2 ~]# systemctl start mysqld.service
[root@mysql-node2 ~]# mysql -p123 -e "show slave status\G" |grep -i master_host
[root@mysql-node2 ~]# mysql -p123 -e "
> change master to \
> master_host='172.25.254.10',
> master_port=3306,
> master_user='repl',
> master_password='123',
> master_auto_position=1;
> start slave;
> show slave status\G" | egrep -i "master_host|slave_(io|sql)_running"
Master_Host: 172.25.254.10
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates
[root@mysql-node2 ~]# mysql -p123 -e "change master to master_port=3306,master_host='172.25.254.10',master_user='repl',master_password='123',master_auto_position=1;start slave;show slave status\G" | egrep -i "master_host|slave_(io|sql)_running"
# 第八步:恢复MHA配置,将20节点加回MHA配置,重启MHA监控
[root@MHA-M ~]# vim /etc/masterha/app1.cnf
[root@MHA-M ~]# cat /etc/masterha/app1.cnf
[server default]
# MySQL监控用户
user=mha
password=123
# SSH用户(建议用root,避免权限问题)
ssh_user=root
# 复制用户
repl_user=repl
repl_password=123
# 目录设置
master_binlog_dir=/data/mysql
remote_workdir=/tmp
manager_workdir=/etc/masterha
manager_log=/etc/masterha/mha.log
# 二次检测,网关加从库,(预算充足,用两台独立监控机)
secondary_check_script= masterha_secondary_check -s 172.25.254.30
# 监控间隔
ping_interval=3
[server1]
hostname=172.25.254.10
candidate_master=1
check_repl_delay=0
[server2]
hostname=172.25.254.20
candidate_master=1
check_repl_delay=0
[server3]
hostname=172.25.254.30
no_master=1
# 重启MHA监控
[root@MHA-M ~]# masterha_stop --conf=/etc/masterha/app1.cnf
Stopped app1 successfully.
[1]+ Exit 1 nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover > /etc/masterha/mha.log 2>&1
[root@MHA-M ~]# nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover > /etc/masterha/mha.log 2>&1 &
[1] 2537
# 恢复了单master
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
172.25.254.10(172.25.254.10:3306) (current master)
+--172.25.254.20(172.25.254.20:3306)
+--172.25.254.30(172.25.254.30:3306)
..................
MySQL Replication Health is OK.
再测试
添加--remove_dead_master_conf参数
| MHA版本 | dry-run支持情况 | 替代方案 |
|---|---|---|
| MHA 0.56+ | 支持 --dry-run |
直接使用 |
| MHA 0.55及更早 | 不支持 | 使用 --interactive 测试 |
bash
# 模拟故障
# 先停止MHA-Manager再执行手动切换,避免冲突。
[root@MHA-M ~]# masterha_stop --conf=/etc/masterha/app1.cnf
Stopped app1 successfully.
[1]+ Exit 1 nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover > /etc/masterha/mha.log 2>&1
[root@mysql-node1 ~]# systemctl stop mysqld.service
[root@MHA-M ~]# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=dead --dead_master_host=172.25.254.10 --dead_master_port=3306 --new_master_host=172.25.254.20 --new_master_port=3306 --remove_dead_master_conf --ignore_last_failover
# 还是双master故障
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
Master 172.25.254.10(172.25.254.10:3306), dead
Master 172.25.254.20(172.25.254.20:3306), replicating from 172.25.254.10(172.25.254.10:3306)
AI说这版本GTID模式有BUG
感觉是乱扯的。
| 问题 | 真相 |
|---|---|
| 为什么总是报错? | MHA 0.58 GTID模式有Bug,新Master不会自动清理复制配置 |
| 是不是操作问题? | 不是,是MHA版本Bug |
| 怎么解决? | 每次切换后手动执行RESET SLAVE ALL |
| 要不要升级MHA? | 强烈推荐升级到0.60+ |
| 版本 | 状态 | 说明 |
|---|---|---|
| 0.58 | ❌ 有Bug | 你当前版本,GTID模式不清理slave配置 |
| 0.59 | ❌ 未修复 | 过渡版本,问题依旧 |
| 0.60 | ✅ 修复 | 官方修复GTID相关Bug |
| 0.61 | ✅ 推荐 | 进一步改进,更稳定 |
| 特性 | 0.58 | 0.60+ |
|---|---|---|
| GTID自动清理slave | ❌ 不支持 | ✅ 支持 |
| --dry-run参数 | ❌ 不支持 | ✅ 支持 |
| Perl 5.32+兼容 | ❌ 需手动修复 | ✅ 原生支持 |
| RHEL 9支持 | ⚠️ 需补丁 | ✅ 完整支持 |
3.3 master故障自动切换
bash
# mysql拓扑
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
172.25.254.20(172.25.254.20:3306) (current master)
+--172.25.254.10(172.25.254.10:3306)
+--172.25.254.30(172.25.254.30:3306)
# 删掉切换锁文件
[root@MHA-M ~]# find / -name app1.failover.complete
/etc/masterha/app1.failover.complete
[root@MHA-M ~]# rm -rf /etc/masterha/app1.failover.complete
# 启动MHA监控
[root@MHA-M ~]# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 is stopped(2:NOT_RUNNING).
[root@MHA-M ~]# nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover > /etc/masterha/mha.log 2>&1 &
[1] 3135
[root@MHA-M ~]# ps aux | grep masterha_manager
root 3135 0.0 1.1 55980 42624 pts/0 S 12:04 0:00 perl /usr/bin/masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover
root 3406 0.0 0.0 6412 2304 pts/0 S+ 12:12 0:00 grep --color=auto masterha_manager
[root@MHA-M ~]# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 (pid:3135) is running(0:PING_OK), master:172.25.254.20
# 故障测试
[root@mysql-node2 ~]# systemctl stop mysqld.service
[root@mysql-node2 ~]# systemctl status mysqld.service | grep -i active
Active: inactive (dead) since Fri 2026-03-06 13:15:03 CST; 23s ago
# 检测是否转移成功
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
Master 172.25.254.20(172.25.254.20:3306), dead
Master 172.25.254.10(172.25.254.10:3306), replicating from 172.25.254.20(172.25.254.20:3306)
Fri Mar 6 15:15:44 2026 - [error][/usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations. at /usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm line 329.
Fri Mar 6 15:15:44 2026 - [error][/usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers.
Fri Mar 6 15:15:44 2026 - [info] Got exit code 1 (Not master dead).
MySQL Replication Health is NOT OK!
[root@MHA-M ~]# sh check_all_nodes.sh
==============================================
检查所有MySQL节点复制状态
==============================================
节点: 172.25.254.10
----------------------------------------------
Host Server_ID GTID_Mode
Master_Host: 172.25.254.20
Slave_IO_Running: No
Slave_SQL_Running: No
Last_Errno: 0
Slave_SQL_Running_State:
Auto_Position: 1
节点: 172.25.254.20
----------------------------------------------
节点: 172.25.254.30
----------------------------------------------
Host Server_ID GTID_Mode
Master_Host: 172.25.254.10
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Last_Errno: 0
Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates
Auto_Position: 1
# 但是主从复制可以用
[root@mysql-node1 ~]# mysql -p123 -e "insert into slave_test2.userlist values('user2','456'); select * from slave_test2.userlist;"
+-------+----------+
| user | password |
+-------+----------+
| user1 | 123 |
| user2 | 456 |
+-------+----------+
[root@mysql-node3 ~]# mysql -p123 -e "select * from slave_test2.userlist;"
+-------+----------+
| user | password |
+-------+----------+
| user1 | 123 |
| user2 | 456 |
+-------+----------+
解决残留问题
根治GTID隐患
bash
# 在新主库10节点上,彻底清理GTID历史
[root@mysql-node1 ~]# mysql -p123 -e "stop slave;reset slave all;reset master;"
# 30节点
[root@mysql-node3 ~]# mysql -p123 -e "stop slave;reset slave all;reset master;change master to master_port=3306,master_host='172.25.254.10',master_user='repl',master_password='123',master_auto_position=1;start slave;show slave status\G" | egrep -i "master_host|slave_(io|sql)_running"
mysql: [Warning] Using a password on the command line interface can be insecure.
Master_Host: 172.25.254.10
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates
[root@mysql-node3 ~]#
将原主库20作为新从库加入集群
bash
[root@mysql-node2 ~]# systemctl start mysqld.service
[root@mysql-node2 ~]# mysql -p123 -e "stop slave;reset slave all;reset master;change master to master_port=3306,master_host='172.25.254.10',master_user='repl',master_password='123',master_auto_position=1;start slave;show slave status\G" | egrep -i "master_host|slave_(io|sql)_running"
mysql: [Warning] Using a password on the command line interface can be insecure.
Master_Host: 172.25.254.10
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates
将20节点加回MHA配置文件
bash
[root@MHA-M ~]# vim /etc/masterha/app1.cnf
[root@MHA-M ~]# cat /etc/masterha/app1.cnf
[server default]
# MySQL监控用户
user=mha
password=123
# SSH用户(建议用root,避免权限问题)
ssh_user=root
# 复制用户
repl_user=repl
repl_password=123
# 目录设置
master_binlog_dir=/data/mysql
remote_workdir=/tmp
manager_workdir=/etc/masterha
manager_log=/etc/masterha/mha.log
# 二次检测,网关加从库,(预算充足,用两台独立监控机)
secondary_check_script= masterha_secondary_check -s 172.25.254.30
# 监控间隔
ping_interval=3
[server1]
hostname=172.25.254.10
candidate_master=1
check_repl_delay=0
[server2]
hostname=172.25.254.20
candidate_master=1
check_repl_delay=0
[server3]
解决环境变量缺失
让系统能找到
mysqlbinlog命令
bash
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
Can't exec "mysqlbinlog": No such file or directory at /usr/share/perl5/vendor_perl/MHA/BinlogManager.pm line 106.
# 系统找不到mysqlbinlog命令。
[root@mysql-node1 ~]# find / -name mysqlbinlog 2>/dev/null
/root/mysql-8.3.0/build/runtime_output_directory/mysqlbinlog
/usr/local/mysql/bin/mysqlbinlog
[root@mysql-node1 ~]# ln -s /usr/local/mysql/bin/mysqlbinlog /usr/local/bin/mysqlbinlog
[root@mysql-node1 ~]# ln -s /usr/local/mysql/bin/mysql /usr/local/bin/mysql
[root@mysql-node1 ~]# ls -l /usr/local/bin/mysqlbinlog
lrwxrwxrwx 1 root root 32 Mar 6 17:06 /usr/local/bin/mysqlbinlog -> /usr/local/mysql/bin/mysqlbinlog
[root@mysql-node1 ~]# mysqlbinlog --version
mysqlbinlog Ver 8.3.0 for Linux on x86_64 (Source distribution)
[root@mysql-node2 ~]# find / -name mysqlbinlog 2>/dev/null
/root/mysql-8.3.0/build/runtime_output_directory/mysqlbinlog
/usr/local/mysql/bin/mysqlbinlog
[root@mysql-node2 ~]# ln -s /usr/local/mysql/bin/mysqlbinlog /usr/local/bin/mysqlbinlog
[root@mysql-node2 ~]# ln -s /usr/local/mysql/bin/mysql /usr/local/bin/mysql
[root@mysql-node2 ~]# ls -l /usr/local/bin/mysqlbinlog
lrwxrwxrwx 1 root root 32 Mar 6 17:06 /usr/local/bin/mysqlbinlog -> /usr/local/mysql/bin/mysqlbinlog
[root@mysql-node2 ~]# mysqlbinlog --version
mysqlbinlog Ver 8.3.0 for Linux on x86_64 (Source distribution)
[root@mysql-node3 ~]# find / -name mysqlbinlog 2>/dev/null
/usr/local/mysql/bin/mysqlbinlog
[root@mysql-node3 ~]# which mysqlbinlog
/usr/local/mysql/bin/mysqlbinlog
[root@mysql-node3 ~]# ln -s /usr/local/mysql/bin/mysqlbinlog /usr/local/bin/mysqlbinlog
[root@mysql-node3 ~]# ln -s /usr/local/mysql/bin/mysql /usr/local/bin/mysql
[root@mysql-node3 ~]# ls -l /usr/local/bin/mysqlbinlog
lrwxrwxrwx 1 root root 32 Mar 6 17:06 /usr/local/bin/mysqlbinlog -> /usr/local/mysql/bin/mysqlbinlog
[root@mysql-node3 ~]# mysqlbinlog --version
mysqlbinlog Ver 8.3.0 for Linux on x86_64 (Source distribution)
解决relay-log.info文件问题
手动构建、修复
relay-log.info文件,解决 MySQL 8.0+ 版本中默认不生成该文件,导致MHA无法读取从库中继日志位置的问题。
bash
[root@mysql-node2 ~]# mysql -u mha -p123 -h 172.25.254.20 -e "SHOW SLAVE STATUS\G" > /tmp/slave_status.txt
mysql: [Warning] Using a password on the command line interface can be insecure.
[root@mysql-node2 ~]# RELAY_LOG_FILE=$(grep "Relay_Log_File" /tmp/slave_status.txt | awk '{print $2}') &&RELAY_LOG_POS=$(grep "Relay_Log_Pos" /tmp/slave_status.txt | awk '{print $2}')
[root@mysql-node2 ~]# echo "$RELAY_LOG_FILE" > /data/mysql/relay-log.info &&echo "$RELAY_LOG_POS" >> /data/mysql/relay-log.info
[root@mysql-node2 ~]# cat /data/mysql/relay-log.info
mysql-node2-relay-bin.000004
375
[root@mysql-node2 ~]# chown mysql:mysql /data/mysql/relay-log.info
[root@mysql-node3 ~]# mysql -u mha -p123 -h 172.25.254.30 -e "SHOW SLAVE STATUS\G" > /tmp/slave_status.txt mysql: [Warning] Using a password on the command line interface can be insecure.
[root@mysql-node3 ~]# RELAY_LOG_FILE=$(grep "Relay_Log_File" /tmp/slave_status.txt | awk '{print $2}') &&RELAY_LOG_POS=$(grep "Relay_Log_Pos" /tmp/slave_status.txt | awk '{print $2}')
[root@mysql-node3 ~]# echo "$RELAY_LOG_FILE" > /data/mysql/relay-log.info &&echo "$RELAY_LOG_POS" >> /data/mysql/relay-log.inf
[root@mysql-node3 ~]# cat /data/mysql/relay-log.info mysql-node3-relay-bin.000002
[root@mysql-node3 ~]# chown mysql:mysql /data/mysql/relay-log.info
下次启动监控
bash
# 清理锁文件
[root@MHA-M masterha]# rm -rf app1.failover.complete manager.log
# 检测ssh和repl
# 开启后台监控
[root@MHA-M ~]# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 is stopped(2:NOT_RUNNING).
[root@MHA-M ~]# nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover > /etc/masterha/mha.log 2>&1 &
4 VIP实现
主要是通过脚本来实现的,在MHA的
/etc/masterha/app1.cnf配置文件中添加使用脚本参数。
bash
# 在老师给的解压包里面有脚本
[root@mha ~]# ls MHA-7/master_ip_*
MHA-7/master_ip_failover MHA-7/master_ip_online_change
[root@MHA-M ~]# mkdir /etc/masterha/scripts
[root@MHA-M ~]# cp MHA-7/master_ip_* /etc/masterha/scripts
[root@MHA-M ~]# vim /etc/masterha/app1.cnf
master_ip_failover_script= /etc/masterha/scripts/master_ip_failover
master_ip_online_change_script= /etc/masterha/scripts/master_ip_online_change
# 设置VIP
[root@MHA-M ~]# vim /etc/masterha/scripts/master_ip_failover
my $vip = '172.25.254.100/24';
[root@MHA-M ~]# vim /etc/masterha/scripts/master_ip_online_change
my $vip = '172.25.254.100/24';
# 向网卡中添加虚拟IP,即时生效。
[root@mysql-node1 ~]# ip a a 172.25.254.100/24 dev eth0
# 检测监控
[root@mha ~]# masterha_manager --conf=/etc/masterha/app1.cnf &
[root@mha ~]# jobs
[1]+ 运行中 masterha_manager --conf=/etc/masterha/app1.cnf &
# 测试:
# 关闭mysql master
[root@mysql-node1 ~]# /etc/init.d/mysqld stop
[root@mysql-node2 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:0c:29:e8:4b:64 brd ff:ff:ff:ff:ff:ff
altname enp3s0
altname ens160
inet 172.25.254.20/24 brd 172.25.254.255 scope global noprefixroute eth0
valid_lft forever preferred_lft forever
inet 172.25.254.100/24 scope global secondary eth0
valid_lft forever preferred_lft forever
inet6 fe80::f8be:d443:72d7:d336/64 scope link noprefixroute
valid_lft forever preferred_lft forever