MySQL集群高可用-MHA

文章目录

[1 搭建好一主多从架构(半同步模式)](#1 搭建好一主多从架构(半同步模式))
[2 部署MHA](#2 部署MHA)
- [2.1 安装MHA环境依赖](#2.1 安装MHA环境依赖)
- [2.2 MySQL节点安装依赖](#2.2 MySQL节点安装依赖)
- [2.3 创建MySQL监控用户mha](#2.3 创建MySQL监控用户mha)
- [2.4 配置mah配置文件](#2.4 配置mah配置文件)
[3 MHA故障切换](#3 MHA故障切换)
- [3.1 master未故障手动切换](#3.1 master未故障手动切换)
- - 小插曲(解决延迟复制)
  - 解决
  - - 停止延迟复制：
    - [方法1 在从库上跳过错误(推荐)](#方法1 在从库上跳过错误(推荐))
    - [方法2 在配置文件跳过1396错误(持久化方案)](#方法2 在配置文件跳过1396错误(持久化方案))
    - [方法3 重新配置主从关系(最彻底)](#方法3 重新配置主从关系(最彻底))
    - [方法4 检查用户一致性](#方法4 检查用户一致性)
  - 再测试
- [3.2 master故障手动切换](#3.2 master故障手动切换)
- [3.3 master故障自动切换](#3.3 master故障自动切换)
[4 VIP实现](#4 VIP实现)

1 搭建好一主多从架构(半同步模式)

🔭MHA架构不支持/不兼容的功能：

不支持延迟复制 、MGR 、双主/多主架构 ，部分冲突MySQL Router

此实验在原来做的实验上进行，关掉延迟复制

bash 复制代码

[root@mysql-node3 ~]# mysql -p123 -e "stop replica;change replication source to source_delay=0;start replica;show slave status\G" |grep -i sql_delay
                    SQL_Delay: 0

2 部署MHA

2.1 安装MHA环境依赖

安装后masterha_manager --version，它会告诉你缺什么模块的

查找需要的依赖 rpm -qpR mha4mysql-node-0.58-0.el7.centos.noarch.rpm、rpm -qpR mha4mysql-master-0.58-0.el7.centos.noarch.rpm

也可以用cpan连网解决依赖

bash 复制代码

# MHA-M:172.25.254.40;mysql-node1(master):172.25.254.10;mysql-node2(slave):172.25.254.20;mysql-node3(slave):172.25.254.30;
# MHA-M
[root@MHA-M ~]# vim /etc/hosts
  1 127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
  2 ::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
  3 172.25.254.40   MHA-M
  4 172.25.254.10   mysql-node1
  5 172.25.254.20   mysql-node2
  6 172.25.254.30   mysql-node3
[root@MHA-M ~]# for i in {10,20,30};do scp -q /etc/hosts root@172.25.254.$i:/etc/hosts;done

# 安装perl环境（MHA管理节点）
[root@MHA-M ~]# dnf install perl perl-DBD-MySQL perl-CPAN perl-English -yq
# 利用cpan在安装MHA的依赖
[root@MHA-M ~]# cpan
..................
Would you like to configure as much as possible automatically? [yes] yes
..................
Perl site library directory "/usr/local/share/perl5/5.32" does not exist.
Perl site library directory "/usr/local/share/perl5/5.32" created.
Perl site library directory "/usr/local/lib64/perl5/5.32" does not exist.
Perl site library directory "/usr/local/lib64/perl5/5.32" created.
..................
cpan[1]> install Config::Tiny
cpan[2]> install Log::Dispatch
cpan[3]> install Mail::Sender
# 空着的直接回车，后续手动配置
Specify defaults for Mail::Sender? (y/N) y
Default SMTP server (hostname or IP address)
        :	# 回车
*********************************************************************
Default FROM value (must be perl code / ENTER for none):
        :
*********************************************************************
Default for REPLY-TO field (must be perl code / ENTER for none):
        :
*********************************************************************
Default for CC field (must be perl code / ENTER for none):
        :
Default for BCC field (must be perl code / ENTER for none):
        :
Default name of the client MACHINE used when connecting
to the SMTP server (must be perl code / ENTER for none):
        :
*********************************************************************
Default additional headers (must be perl code / ENTER for none):
        :
*********************************************************************
Default encoding of message bodies (N)one, (Q)uoted-printable, (B)ase64:
        : n
*********************************************************************
Default charset of message bodies (must be perl code / ENTER for none):
        :
cpan[4]> install Parallel::ForkManager
cpan[5]> exit
[root@MHA-M ~]# unzip MHA-7.zip
# 要安装node包，这是MHA 0.58 的设计缺陷
[root@MHA-M ~]# rpm -ivh MHA-7/mha4mysql-manager-0.58-0.el7.centos.noarch.rpm MHA-7/mha4mysql-node-0.58-0.el7.centos.noarch.rpm --nodeps
[root@MHA-M ~]# masterha_manager --version
masterha_manager version 0.58.

2.2 MySQL节点安装依赖

bash 复制代码

# 装MHA节点组件（MySQL所有节点）
[root@mysql-node2 ~]# yum install -y perl perl-English perl-Time-HiRes perl-DBI perl-DBD-MySQL perl-File-Copy perl-core
# 在mysql节点中安装相应MHA管理工具
[root@MHA-M ~]# for i in {10,20,30};do
> scp MHA-7/mha4mysql-node-0.58-0.el7.centos.noarch.rpm root@172.25.254.$i:/mnt
> ssh -l root 172.25.254.$i "rpm -ivh /mnt/mha4mysql-node-0.58-0.el7.centos.noarch.rpm --nodeps"
> done
# 修复MHA 0.58的Perl sprintf bug（mha管理的节点-->mysql所有节点）
[mysql-node1 ~]# vim /usr/share/perl5/vendor_perl/MHA/NodeUtil.pm
199 #sub parse_mysql_major_version($) {
200 #  my $str = shift;
201 #  my $result = sprintf( '%03d%03d', $str =~ m/(\d+)/g );
202 #  return $result;
203 #}
204
205 sub parse_mysql_major_version($) {
206   my $str = shift;
207   my @nums = $str =~ m/(\d+)/g;
208   my $result = sprintf( '%03d%03d', $nums[0]//0, $nums[1]//0);
209   return $result;
210 }
# scp到其它节点

2.3 创建MySQL监控用户mha

bash 复制代码

# 为MHA建立远程登录账户，创建MySQL监控用户mha
[root@MHA-M ~]# dnf install mysql-server -yq
[root@MHA-M ~]# grep mysql /etc/passwd
mysql:x:27:27:MySQL Server:/var/lib/mysql:/sbin/nologin
[root@MHA-M ~]# mkdir /data/mysql
[root@MHA-M ~]# chown mysql.mysql /data/mysql
[root@MHA-M ~]# vim /etc/my.cnf.d/mysql-server.cnf
[root@MHA-M ~]# tail -n6 /etc/my.cnf.d/mysql-server.cnf
[mysqld]
server-id=40
datadir=/data/mysql
socket=/data/mysql/mysql.sock
log-error=/var/log/mysql/mysqld.log
pid-file=/run/mysqld/mysqld.pid
[root@MHA-M ~]# vim /etc/my.cnf.d/client.cnf
  7 [client]
  8 socket=/data/mysql/mysql.sock
[root@MHA-M ~]# systemctl enable --now mysqld
[root@MHA-M ~]# grep 'temporary password' /var/log/mysql/mysqld.log
2026-03-05T01:56:48.550596Z 6 [Note] [MY-010454] [Server] A temporary password is generated for root@localhost: g)kUyiLTJ1=&
[root@MHA-M ~]# mysql_secure_installation
secure enough. Would you like to setup VALIDATE PASSWORD component?
Press y|Y for Yes, any other key for No: no
Please set the password for root here.
New password:
Re-enter new password:
Remove anonymous users? (Press y|Y for Yes, any other key for No) : y
Disallow root login remotely? (Press y|Y for Yes, any other key for No) : y
Remove test database and access to it? (Press y|Y for Yes, any other key for No) : y
Reload privilege tables now? (Press y|Y for Yes, any other key for No) : y
All done!
[root@MHA-M ~]# mysql -p123 -e "create user mha@'%' identified with mysql_native_password by '123'; grant all on *.* to mha@'%';"
# 检测所有节点用户完整性，用于后续MHA配置文件配置
# MHA-M:mha用户;node1:repl、mha用户;node2:repl、mha用户;node2:repl、mha用户;
[root@MHA-M ~]# mysql -p123 -t -e "select user,host from mysql.user" | grep mha
| mha              | %         |
[root@mysql-node1 ~]# mysql -p123 -t -e "select user,host from mysql.user" | egrep "mha|repl"
| repl             | %         |
[root@mysql-node2 ~]# mysql -p123 -t -e "select user,host from mysql.user" | egrep "mha|repl"
[root@mysql-node3 ~]# mysql -p123 -t -e "select user,host from mysql.user" | egrep "mha|repl"

# node1
[root@mysql-node1 ~]# mysql -p123 -e "create user mha@'%' identified with mysql_native_password by '123'; grant all on *.* to mha@'%';"
[root@mysql-node1 ~]# mysql -p123 -t -e "select user,host from mysql.user" | egrep "mha|repl"
| mha              | %         |
| repl             | %         |

# node2
[root@mysql-node2 ~]# mysql -p123 -e "create user repl@'%' identified with mysql_native_password by '123'; grant all on *.* to repl@'%';"
[root@mysql-node2 ~]# mysql -p123 -e "create user mha@'%' identified with mysql_native_password by '123'; grant all on *.* to mha@'%';"
[root@mysql-node2 ~]# mysql -p123 -t -e "select user,host from mysql.user" | egrep "mha|repl"
| mha              | %         |
| repl             | %         |

# node3
[root@mysql-node3 ~]# mysql -p123 -e "create user repl@'%' identified with mysql_native_password by '123'; grant all on *.* to repl@'%';"
[root@mysql-node3 ~]# mysql -p123 -e "create user mha@'%' identified with mysql_native_password by '123'; grant all on *.* to mha@'%';"
[root@mysql-node3 ~]# mysql -p123 -t -e "select user,host from mysql.user" | egrep "mha|repl"
| mha              | %         |
| repl             | %         |

2.4 配置mah配置文件

bash 复制代码

# 生成MHA-M的配置文件模板，修改配置文件
[root@MHA-M ~]# mkdir -p /etc/masterha/
[root@MHA-M ~]# tar -zxf MHA-7/mha4mysql-manager-0.58.tar.gz
[root@MHA-M ~]# cat mha4mysql-manager-0.58/samples/conf/masterha_default.cnf mha4mysql-manager-0.58/samples/conf/app1.cnf > /etc/masterha/app1.cnf
root@MHA-M ~]# vim /etc/masterha/app1.cnf
[root@MHA-M ~]# cat /etc/masterha/app1.cnf
[server default]
# MySQL监控用户
user=mha
password=123
# SSH用户（建议用root，避免权限问题）
ssh_user=root
# 复制用户
repl_user=repl
repl_password=123
# 目录设置
master_binlog_dir=/data/mysql
remote_workdir=/tmp
manager_workdir=/etc/masterha
manager_log=/etc/masterha/mha.log
# 二次检测,网关加从库,（预算充足，用两台独立监控机），这里后续有实验，要模拟node1，node2节点故障，就用node3来二次检测，网关的话我的实验是报不可达的所以去掉了（用......_check_ssh命令是ok的）
secondary_check_script= masterha_secondary_check -s 172.25.254.30
# 监控间隔
ping_interval=3
[server1]
hostname=172.25.254.10
candidate_master=1
check_repl_delay=0
[server2]
hostname=172.25.254.20
candidate_master=1
check_repl_delay=0
[server3]
hostname=172.25.254.30
no_master=1
# 检测环境
[root@MHA-M ~]# masterha_check_ssh  --conf=/etc/masterha/app1.cnf
..................
Thu Mar  5 11:12:24 2026 - [info] All SSH connection tests passed successfully.
Use of uninitialized value in exit at /usr/bin/masterha_check_ssh line 44.
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
..................
MySQL Replication Health is OK.

3 MHA故障切换

如果实验失败，实验环境省时间直接清理数据(删数据目录、数据库初始化建立mysql基本数据(mysqld --initialize --user=mysql)、起服务、数据库安全初始化(mysql_secure_installation))即可

3.1 master未故障手动切换

bash 复制代码

[root@MHA-M ~]# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=172.25.254.20 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000
..................
It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 172.25.254.10(172.25.254.10:3306)? (YES/no): yes
..................
Starting master switch from 172.25.254.10(172.25.254.10:3306) to 172.25.254.20(172.25.254.20:3306)? (yes/NO): yes
..................
master_ip_online_change_script is not defined. If you do not disable writes on the current master manually, applications keep writing on the current master. Is it ok to proceed? (yes/NO): yes
..................
Thu Mar  5 12:14:21 2026 - [info] Switching master to 172.25.254.20(172.25.254.20:3306) completed successfully.
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
Thu Mar  5 13:02:27 2026 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm, ln939] SQL Thread is stopped(error) on 172.25.254.10(172.25.254.10:3306)! Errno:1396, Error:Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '0d44a8bc-176a-11f1-ae1e-000c2939010c:1' at source log mysql-bin.000005, end_log_pos 481. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Thu Mar  5 13:02:27 2026 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm, ln939] SQL Thread is stopped(error) on 172.25.254.30(172.25.254.30:3306)! Errno:1396, Error:Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '0d44a8bc-176a-11f1-ae1e-000c2939010c:1' at source log mysql-bin.000005, end_log_pos 481. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
MySQL Replication Health is NOT OK!
# 手动切换master实际上是成功了（日志显示"completed successfully"），但切换后复制链路出现了问题。错误1396通常与用户权限相关，需要手动修复。

小插曲(解决延迟复制)

由于我为在做MHA时未关闭延迟复制功能，出现下面报错

bash 复制代码

Thu Mar  5 13:02:27 2026 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm, ln939] SQL Thread is stopped(error) on 172.25.254.10(172.25.254.10:3306)! Errno:1396, Error:Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '0d44a8bc-176a-11f1-ae1e-000c2939010c:1' at source log mysql-bin.000005, end_log_pos 481. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Thu Mar  5 13:02:27 2026 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm, ln939] SQL Thread is stopped(error) on 172.25.254.30(172.25.254.30:3306)! Errno:1396, Error:Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '0d44a8bc-176a-11f1-ae1e-000c2939010c:1' at source log mysql-bin.000005, end_log_pos 481. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

总结关键问题：

'0d44a8bc-176a-11f1-ae1e-000c2939010c:1' 就是导致复制中断的错误事务。你需要分别在两个从节点（10和30）上跳过这个GTID事务。

节点	错误信息	问题分析
172.25.254.10	SQL Thread stopped	原master变成slave后，复制线程停止
172.25.254.30	SQL Thread stopped	另一个slave也停止复制
错误码	1396	用户权限相关错误

解决

停止延迟复制：

bash 复制代码

[root@mysql-node3 ~]# mysql -p123 -e "stop replica;change replication source to source_delay=0;start replica;show slave status\G" |grep -i sql_delay
                    SQL_Delay: 0

方法1 在从库上跳过错误(推荐)

报错slave报错节点10，30

集群不是gtid模式

bash 复制代码

# 停止复制
STOP SLAVE;
# 查看当前复制状态
SHOW SLAVE STATUS\G
# 跳过一个错误事务（如果是非GTID模式）
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
# 启动复制
START SLAVE;
# 验证复制状态
SHOW SLAVE STATUS\G

集群是gtid模式

bash 复制代码

# 停止复制
STOP SLAVE;
# 或者如果是GTID模式，需要注入空事务
# 先查看GTID执行情况
SHOW SLAVE STATUS\G
# 找到Retrieved_Gtid_Set和Executed_Gtid_Set
# STOP SLAVE;
# 跳过错误的事务
# SET GTID_NEXT='UUID:事务号';
# BEGIN; COMMIT;
# 然后注入空事务（示例）
# SET GTID_NEXT='AUTOMATIC';
# 启动复制
START SLAVE;
# 验证复制状态
SHOW SLAVE STATUS\G

开始解决错误

bash 复制代码

# slave-node1
# 验证错误gtid
[root@mysql-node1 ~]# mysql -p123 -e "show slave status\G" | egrep -i "slave_sql_runing|last_sql_error"
               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '0d44a8bc-176a-11f1-ae1e-000c2939010c:1' at source log mysql-bin.000005, end_log_pos 481. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
     Last_SQL_Error_Timestamp: 260305 12:51:37
# 跳过错误事务并注入空事务（停止SQL线程-->跳过错误的事务-->注入空事务-->重新启动SQL线程）
[root@mysql-node1 ~]# mysql -p123 -e "stop slave sql_thread;set gtid_next='0d44a8bc-176a-11f1-ae1e-000c2939010c:1';begin;commit;set gtid_next='automatic';start slave sql_thread;"
# 验证复制是否恢复正常
[root@mysql-node1 ~]# mysql -p123 -e "show slave status\G" | egrep -i "slave_sql_runing|last_sql_error"
               Last_SQL_Error:
     Last_SQL_Error_Timestamp:

# slave-node3
[root@mysql-node3 ~]# mysql -p123 -e "stop slave sql_thread;set gtid_next='0d44a8bc-176a-11f1-ae1e-000c2939010c:1';begin;commit;set gtid_next='automatic';start slave sql_thread;"
[root@mysql-node3 ~]# mysql -p123 -e "show slave status\G" | egrep -i "slave_sql_runing|last_sql_error"
               Last_SQL_Error:
     Last_SQL_Error_Timestamp:

方法2 在配置文件跳过1396错误(持久化方案)

如果方法一不能彻底解决，可以在从库配置文件跳过这类错误

bash 复制代码

# 在每个从库（10和30）上编辑配置文件
# 在[mysqld]段添加
slave_skip_errors=1396
# 重启MySQL
systemctl restart mysqld
# 然后重新启动复制
mysql -p123 -e "START SLAVE;"

方法3 重新配置主从关系(最彻底)

如果以上方法都无效，需要重新建立主从关系

bash 复制代码

# 1.在新master（20）上查看binlog位置,记录下File和Position
mysql -p123 -e "SHOW MASTER STATUS;"

# 2.在从库10和30上重新配置
# 登录MySQL
mysql -p123
# 停止并重置复制
STOP SLAVE;
RESET SLAVE ALL;
# 重新配置从新master同步
CHANGE MASTER TO
MASTER_HOST='172.25.254.20',
MASTER_PORT=3306,
MASTER_USER='repl',
MASTER_PASSWORD='123',
MASTER_LOG_FILE='mysql-bin.xxxxxx',  -- 使用上面查到的File
MASTER_LOG_POS=xxxxxx;                -- 使用上面查到的Position
# 启动复制
START SLAVE;
# 验证状态
SHOW SLAVE STATUS\G

方法4 检查用户一致性

由于错误1396与用户权限相关，建议检查所有节点的用户是否一致

bash 复制代码

# 在所有节点执行，对比用户
mysql -p123 -e "SELECT user, host FROM mysql.user ORDER BY user, host;"
# 确保关键用户（mha、repl）在所有节点都存在且一致
# 如果不一致，手动创建缺失的用户

再测试

bash 复制代码

# --interactive=0 非交互模式
# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=172.25.254.20 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000 --interactive=0


[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
..................
MySQL Replication Health is OK.
[root@MHA-M ~]# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=172.25.254.10 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000
..................
Thu Mar  5 14:51:30 2026 - [info] Switching master to 172.25.254.10(172.25.254.10:3306) completed successfully.
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
..................
MySQL Replication Health is OK.
[root@MHA-M ~]# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=172.25.254.20 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000
..................
Thu Mar  5 14:52:39 2026 - [info] Switching master to 172.25.254.20(172.25.254.20:3306) completed successfully.
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
..................
MySQL Replication Health is OK.

3.2 master故障手动切换

使用masterha_master_switch切换master时，不论master有没有故障，都一定要添加--remove_dead_master_conf参数，避免造成双主结构。

bash 复制代码

[root@mysql-node2 ~]# systemctl stop mysqld.service

# 手动切换master（缺少remove_dead_master_conf参数）
[root@MHA-M ~]# masterha_master_switch --master_state=dead --conf=/etc/masterha/app1.cnf --dead_master_host=172.25.254.20 --dead_master_port=3306 --new_master_host=172.25.254.10 --new_master_port=3306 --ignore_last_failover
..................
Master 172.25.254.20(172.25.254.20:3306) is dead. Proceed? (yes/NO): yes
..................
Starting master switch from 172.25.254.20(172.25.254.20:3306) to 172.25.254.10(172.25.254.10:3306)? (yes/NO): yes
..................
Started manual(interactive) failover.
Selected 172.25.254.10(172.25.254.10:3306) as a new master.
172.25.254.10(172.25.254.10:3306): OK: Applying all logs succeeded.
172.25.254.30(172.25.254.30:3306): ERROR: Failed on waiting gtid exec set on master.
Master failover to 172.25.254.10(172.25.254.10:3306) done, but recovery on slave partially failed.

# 但是主从复制可以写
[root@mysql-node1 ~]# mysql -p123 -e "create database slave_test2;create table slave_test2.userlist(user varchar(10),password varchar(50));insert into slave_test2.userlist values('user1','123'); select * from slave_test2.userlist;"
+-------+----------+
| user  | password |
+-------+----------+
| user1 | 123      |
+-------+----------+
[root@mysql-node1 ~]#
[root@mysql-node3 ~]# mysql -p123 -e "select * from slave_test2.userlist;"                 
+-------+----------+
| user  | password |
+-------+----------+
| user1 | 123      |
+-------+----------+
[root@mysql-node3 ~]#

小插曲(解决双主架构)

bash 复制代码

# 造成了双主结构
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
Thu Mar  5 15:32:23 2026 - [error][/usr/share/perl5/vendor_perl/MHA/ServerManager.pm, ln781] Multi-master configuration is detected, but two or more masters are either writable (read-only is not set) or dead! Check configurations for details. Master configurations are as below:
# 再启动时会构成双master，要手动清理新master节点上的slave
Master 172.25.254.10(172.25.254.10:3306), replicating from 172.25.254.20(172.25.254.20:3306)
Master 172.25.254.20(172.25.254.20:3306), dead

Thu Mar  5 15:32:23 2026 - [error][/usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations.  at /usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm line 329.
Thu Mar  5 15:32:23 2026 - [error][/usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers.
# 新master10还在向宕机master20复制
[root@mysql-node1 ~]# mysql -p123 -e "show slave status\G" | egrep -i "master_host|slave_(io|sql)_running|last_errno|auto_position"
                  Master_Host: 172.25.254.20
             Slave_IO_Running: No
            Slave_SQL_Running: No
                   Last_Errno: 0
      Slave_SQL_Running_State:
                Auto_Position: 1
# node3就是正常的，所以不用重新指定master_host
[root@mysql-node3 ~]# mysql -p123 -e "show slave status\G" | egrep -i "master_host|slave_(io|sql)_running|last_errno|auto_position"
                  Master_Host: 172.25.254.10
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
                   Last_Errno: 0
      Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates
                Auto_Position: 1

解决1

bash 复制代码

# 第一步：在新master（10节点）上停止从库角色
# 停止复制（它不应该再从任何节点复制）、重置复制配置（彻底清除从库身份）
[root@mysql-node1 ~]# mysql -p123 -e "stop slave;reset slave all;"
# 确认不再是从库并且作为master确保它是可写的
[root@mysql-node1 ~]# mysql -p123 -e "show slave status\G;show variables like 'read_only'"
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only     | OFF   |
+---------------+-------+

# 第二步：更新MHA配置文件（移除已死的20节点）
[root@MHA-M ~]# vim /etc/masterha/app1.cnf
 24 #[server2]
 25 #hostname=172.25.254.20
 26 #candidate_master=1
 27 #check_repl_delay=0
 28 [server3]
 29 hostname=172.25.254.30
 30 #no_master=1
 31 candidate_master=1		# 当前只有一个master（10节点），唯一的从库（30节点）被设置了 no_master=1，如果10节点挂了，没有节点可以接替成为新master，MHA认为这是不安全的配置，在测试复制时会报错

# 第四步：验证新拓扑
[root@MHA-M ~]# masterha_check_ssh --conf=/etc/masterha/app1.cnf
..................
Thu Mar  5 16:37:31 2026 - [info] All SSH connection tests passed successfully.
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
..................
MySQL Replication Health is OK.

# 第五步：重启MHA监控
[root@MHA-M ~]# touch /etc/masterha/mha.log
[root@MHA-M ~]# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 (pid:1924) is running(0:PING_OK), master:172.25.254.10
# 如果之前有MHA进程在运行，先停止
[root@MHA-M ~]# masterha_stop --conf=/etc/masterha/app1.cnf
Stopped app1 successfully.
[1]+  Exit 1                  nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover > /etc/masterha/mha.log 2>&1
# 再启动MHA监控
[root@MHA-M ~]# nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover > /etc/masterha/mha.log 2>&1 &
[1] 2016
# 启动MHA-Manager（前台运行测试）
# masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover
# 检测状态
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
..................
MySQL Replication Health is OK.

# 第七步：启动20节点配置指向新master（10），是否清除原有数据看自己选择。
[root@mysql-node2 ~]# systemctl start mysqld.service
[root@mysql-node2 ~]# mysql -p123 -e "show slave status\G" |grep -i master_host
[root@mysql-node2 ~]# mysql -p123 -e "
> change master to \
> master_host='172.25.254.10',
> master_port=3306,
> master_user='repl',
> master_password='123',
> master_auto_position=1;
> start slave;
> show slave status\G" | egrep -i "master_host|slave_(io|sql)_running"
                  Master_Host: 172.25.254.10
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
      Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates
[root@mysql-node2 ~]# mysql -p123 -e "change master to master_port=3306,master_host='172.25.254.10',master_user='repl',master_password='123',master_auto_position=1;start slave;show slave status\G" | egrep -i "master_host|slave_(io|sql)_running"

# 第八步：恢复MHA配置，将20节点加回MHA配置，重启MHA监控
[root@MHA-M ~]# vim /etc/masterha/app1.cnf
[root@MHA-M ~]# cat /etc/masterha/app1.cnf
[server default]
# MySQL监控用户
user=mha
password=123
# SSH用户（建议用root，避免权限问题）
ssh_user=root
# 复制用户
repl_user=repl
repl_password=123
# 目录设置
master_binlog_dir=/data/mysql
remote_workdir=/tmp
manager_workdir=/etc/masterha
manager_log=/etc/masterha/mha.log
# 二次检测,网关加从库,（预算充足，用两台独立监控机）
secondary_check_script= masterha_secondary_check -s 172.25.254.30
# 监控间隔
ping_interval=3
[server1]
hostname=172.25.254.10
candidate_master=1
check_repl_delay=0
[server2]
hostname=172.25.254.20
candidate_master=1
check_repl_delay=0
[server3]
hostname=172.25.254.30
no_master=1
# 重启MHA监控
[root@MHA-M ~]# masterha_stop --conf=/etc/masterha/app1.cnf
Stopped app1 successfully.
[1]+  Exit 1                  nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover > /etc/masterha/mha.log 2>&1
[root@MHA-M ~]# nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover > /etc/masterha/mha.log 2>&1 &
[1] 2537
# 恢复了单master
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
172.25.254.10(172.25.254.10:3306) (current master)
 +--172.25.254.20(172.25.254.20:3306)
 +--172.25.254.30(172.25.254.30:3306)
..................
MySQL Replication Health is OK.

再测试

添加--remove_dead_master_conf参数

MHA版本	dry-run支持情况	替代方案
MHA 0.56+	支持 `--dry-run`	直接使用
MHA 0.55及更早	不支持	使用 `--interactive` 测试

bash 复制代码

# 模拟故障
# 先停止MHA-Manager再执行手动切换，避免冲突。
[root@MHA-M ~]# masterha_stop --conf=/etc/masterha/app1.cnf
Stopped app1 successfully.
[1]+  Exit 1                  nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover > /etc/masterha/mha.log 2>&1
[root@mysql-node1 ~]# systemctl stop mysqld.service
[root@MHA-M ~]# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=dead --dead_master_host=172.25.254.10 --dead_master_port=3306 --new_master_host=172.25.254.20 --new_master_port=3306 --remove_dead_master_conf --ignore_last_failover
# 还是双master故障
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
Master 172.25.254.10(172.25.254.10:3306), dead
Master 172.25.254.20(172.25.254.20:3306), replicating from 172.25.254.10(172.25.254.10:3306)

AI说这版本GTID模式有BUG

感觉是乱扯的。

问题	真相
为什么总是报错？	MHA 0.58 GTID模式有Bug，新Master不会自动清理复制配置
是不是操作问题？	不是，是MHA版本Bug
怎么解决？	每次切换后手动执行`RESET SLAVE ALL`
要不要升级MHA？	强烈推荐升级到0.60+

版本	状态	说明
0.58	❌ 有Bug	你当前版本，GTID模式不清理slave配置
0.59	❌ 未修复	过渡版本，问题依旧
0.60	✅ 修复	官方修复GTID相关Bug
0.61	✅ 推荐	进一步改进，更稳定

特性	0.58	0.60+
GTID自动清理slave	❌ 不支持	✅ 支持
--dry-run参数	❌ 不支持	✅ 支持
Perl 5.32+兼容	❌ 需手动修复	✅ 原生支持
RHEL 9支持	⚠️ 需补丁	✅ 完整支持

3.3 master故障自动切换

bash 复制代码

# mysql拓扑
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
172.25.254.20(172.25.254.20:3306) (current master)
 +--172.25.254.10(172.25.254.10:3306)
 +--172.25.254.30(172.25.254.30:3306)

# 删掉切换锁文件
[root@MHA-M ~]# find / -name app1.failover.complete
/etc/masterha/app1.failover.complete
[root@MHA-M ~]# rm -rf /etc/masterha/app1.failover.complete

# 启动MHA监控
[root@MHA-M ~]# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 is stopped(2:NOT_RUNNING).
[root@MHA-M ~]# nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover > /etc/masterha/mha.log 2>&1 &
[1] 3135
[root@MHA-M ~]# ps aux | grep masterha_manager
root        3135  0.0  1.1  55980 42624 pts/0    S    12:04   0:00 perl /usr/bin/masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover
root        3406  0.0  0.0   6412  2304 pts/0    S+   12:12   0:00 grep --color=auto masterha_manager
[root@MHA-M ~]# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 (pid:3135) is running(0:PING_OK), master:172.25.254.20

# 故障测试
[root@mysql-node2 ~]# systemctl stop mysqld.service
[root@mysql-node2 ~]# systemctl status mysqld.service | grep -i active
     Active: inactive (dead) since Fri 2026-03-06 13:15:03 CST; 23s ago
# 检测是否转移成功
[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
Master 172.25.254.20(172.25.254.20:3306), dead
Master 172.25.254.10(172.25.254.10:3306), replicating from 172.25.254.20(172.25.254.20:3306)

Fri Mar  6 15:15:44 2026 - [error][/usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations.  at /usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm line 329.
Fri Mar  6 15:15:44 2026 - [error][/usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers.
Fri Mar  6 15:15:44 2026 - [info] Got exit code 1 (Not master dead).
MySQL Replication Health is NOT OK!
[root@MHA-M ~]# sh check_all_nodes.sh
==============================================
检查所有MySQL节点复制状态
==============================================
节点: 172.25.254.10
----------------------------------------------
Host    Server_ID       GTID_Mode
                  Master_Host: 172.25.254.20
             Slave_IO_Running: No
            Slave_SQL_Running: No
                   Last_Errno: 0
      Slave_SQL_Running_State:
                Auto_Position: 1

节点: 172.25.254.20
----------------------------------------------

节点: 172.25.254.30
----------------------------------------------
Host    Server_ID       GTID_Mode
                  Master_Host: 172.25.254.10
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
                   Last_Errno: 0
      Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates
                Auto_Position: 1

# 但是主从复制可以用
[root@mysql-node1 ~]# mysql -p123 -e "insert into slave_test2.userlist values('user2','456'); select * from slave_test2.userlist;"
+-------+----------+
| user  | password |
+-------+----------+
| user1 | 123      |
| user2 | 456      |
+-------+----------+
[root@mysql-node3 ~]# mysql -p123 -e "select * from slave_test2.userlist;"
+-------+----------+
| user  | password |
+-------+----------+
| user1 | 123      |
| user2 | 456      |
+-------+----------+

解决残留问题

根治GTID隐患

bash 复制代码

# 在新主库10节点上，彻底清理GTID历史
[root@mysql-node1 ~]# mysql -p123 -e "stop slave;reset slave all;reset master;"

# 30节点
[root@mysql-node3 ~]# mysql -p123 -e "stop slave;reset slave all;reset master;change master to master_port=3306,master_host='172.25.254.10',master_user='repl',master_password='123',master_auto_position=1;start slave;show slave status\G" | egrep -i "master_host|slave_(io|sql)_running"
mysql: [Warning] Using a password on the command line interface can be insecure.
                  Master_Host: 172.25.254.10
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
      Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates
[root@mysql-node3 ~]#

将原主库20作为新从库加入集群

bash 复制代码

[root@mysql-node2 ~]# systemctl start mysqld.service
[root@mysql-node2 ~]# mysql -p123 -e "stop slave;reset slave all;reset master;change master to master_port=3306,master_host='172.25.254.10',master_user='repl',master_password='123',master_auto_position=1;start slave;show slave status\G" | egrep -i "master_host|slave_(io|sql)_running"
mysql: [Warning] Using a password on the command line interface can be insecure.
                  Master_Host: 172.25.254.10
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
      Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates

将20节点加回MHA配置文件

bash 复制代码

[root@MHA-M ~]# vim /etc/masterha/app1.cnf
[root@MHA-M ~]# cat /etc/masterha/app1.cnf
[server default]
# MySQL监控用户
user=mha
password=123
# SSH用户（建议用root，避免权限问题）
ssh_user=root
# 复制用户
repl_user=repl
repl_password=123
# 目录设置
master_binlog_dir=/data/mysql
remote_workdir=/tmp
manager_workdir=/etc/masterha
manager_log=/etc/masterha/mha.log
# 二次检测,网关加从库,（预算充足，用两台独立监控机）
secondary_check_script= masterha_secondary_check -s 172.25.254.30
# 监控间隔
ping_interval=3
[server1]
hostname=172.25.254.10
candidate_master=1
check_repl_delay=0
[server2]
hostname=172.25.254.20
candidate_master=1
check_repl_delay=0
[server3]

解决环境变量缺失

让系统能找到mysqlbinlog命令

bash 复制代码

[root@MHA-M ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
Can't exec "mysqlbinlog": No such file or directory at /usr/share/perl5/vendor_perl/MHA/BinlogManager.pm line 106.
# 系统找不到mysqlbinlog命令。
[root@mysql-node1 ~]# find / -name mysqlbinlog 2>/dev/null
/root/mysql-8.3.0/build/runtime_output_directory/mysqlbinlog
/usr/local/mysql/bin/mysqlbinlog
[root@mysql-node1 ~]# ln -s /usr/local/mysql/bin/mysqlbinlog /usr/local/bin/mysqlbinlog
[root@mysql-node1 ~]# ln -s /usr/local/mysql/bin/mysql /usr/local/bin/mysql
[root@mysql-node1 ~]# ls -l /usr/local/bin/mysqlbinlog
lrwxrwxrwx 1 root root 32 Mar  6 17:06 /usr/local/bin/mysqlbinlog -> /usr/local/mysql/bin/mysqlbinlog
[root@mysql-node1 ~]# mysqlbinlog --version
mysqlbinlog  Ver 8.3.0 for Linux on x86_64 (Source distribution)

[root@mysql-node2 ~]# find / -name mysqlbinlog 2>/dev/null
/root/mysql-8.3.0/build/runtime_output_directory/mysqlbinlog
/usr/local/mysql/bin/mysqlbinlog
[root@mysql-node2 ~]# ln -s /usr/local/mysql/bin/mysqlbinlog /usr/local/bin/mysqlbinlog
[root@mysql-node2 ~]# ln -s /usr/local/mysql/bin/mysql /usr/local/bin/mysql
[root@mysql-node2 ~]# ls -l /usr/local/bin/mysqlbinlog
lrwxrwxrwx 1 root root 32 Mar  6 17:06 /usr/local/bin/mysqlbinlog -> /usr/local/mysql/bin/mysqlbinlog
[root@mysql-node2 ~]# mysqlbinlog --version
mysqlbinlog  Ver 8.3.0 for Linux on x86_64 (Source distribution)

[root@mysql-node3 ~]# find / -name mysqlbinlog 2>/dev/null
/usr/local/mysql/bin/mysqlbinlog
[root@mysql-node3 ~]# which mysqlbinlog
/usr/local/mysql/bin/mysqlbinlog
[root@mysql-node3 ~]# ln -s /usr/local/mysql/bin/mysqlbinlog /usr/local/bin/mysqlbinlog
[root@mysql-node3 ~]# ln -s /usr/local/mysql/bin/mysql /usr/local/bin/mysql
[root@mysql-node3 ~]# ls -l /usr/local/bin/mysqlbinlog
lrwxrwxrwx 1 root root 32 Mar  6 17:06 /usr/local/bin/mysqlbinlog -> /usr/local/mysql/bin/mysqlbinlog
[root@mysql-node3 ~]# mysqlbinlog --version
mysqlbinlog  Ver 8.3.0 for Linux on x86_64 (Source distribution)

解决relay-log.info文件问题

手动构建、修复relay-log.info文件，解决 MySQL 8.0+ 版本中默认不生成该文件，导致MHA无法读取从库中继日志位置的问题。

bash 复制代码

[root@mysql-node2 ~]# mysql -u mha -p123 -h 172.25.254.20 -e "SHOW SLAVE STATUS\G" > /tmp/slave_status.txt
mysql: [Warning] Using a password on the command line interface can be insecure.
[root@mysql-node2 ~]# RELAY_LOG_FILE=$(grep "Relay_Log_File" /tmp/slave_status.txt | awk '{print $2}') &&RELAY_LOG_POS=$(grep "Relay_Log_Pos" /tmp/slave_status.txt | awk '{print $2}')
[root@mysql-node2 ~]# echo "$RELAY_LOG_FILE" > /data/mysql/relay-log.info &&echo "$RELAY_LOG_POS" >> /data/mysql/relay-log.info
[root@mysql-node2 ~]# cat /data/mysql/relay-log.info
mysql-node2-relay-bin.000004
375
[root@mysql-node2 ~]# chown mysql:mysql /data/mysql/relay-log.info
[root@mysql-node3 ~]# mysql -u mha -p123 -h 172.25.254.30 -e "SHOW SLAVE STATUS\G" > /tmp/slave_status.txt         mysql: [Warning] Using a password on the command line interface can be insecure.
[root@mysql-node3 ~]#  RELAY_LOG_FILE=$(grep "Relay_Log_File" /tmp/slave_status.txt | awk '{print $2}') &&RELAY_LOG_POS=$(grep "Relay_Log_Pos" /tmp/slave_status.txt | awk '{print $2}')
[root@mysql-node3 ~]#  echo "$RELAY_LOG_FILE" > /data/mysql/relay-log.info &&echo "$RELAY_LOG_POS" >> /data/mysql/relay-log.inf
[root@mysql-node3 ~]# cat /data/mysql/relay-log.info                                                               mysql-node3-relay-bin.000002
[root@mysql-node3 ~]# chown mysql:mysql /data/mysql/relay-log.info

下次启动监控

bash 复制代码

# 清理锁文件
[root@MHA-M masterha]# rm -rf app1.failover.complete manager.log
# 检测ssh和repl

# 开启后台监控
[root@MHA-M ~]# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 is stopped(2:NOT_RUNNING).
[root@MHA-M ~]# nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover > /etc/masterha/mha.log 2>&1 &

4 VIP实现

主要是通过脚本来实现的，在MHA的/etc/masterha/app1.cnf配置文件中添加使用脚本参数。

bash 复制代码

# 在老师给的解压包里面有脚本
[root@mha ~]# ls MHA-7/master_ip_*
MHA-7/master_ip_failover  MHA-7/master_ip_online_change
[root@MHA-M ~]# mkdir  /etc/masterha/scripts
[root@MHA-M ~]# cp  MHA-7/master_ip_*  /etc/masterha/scripts
[root@MHA-M ~]# vim /etc/masterha/app1.cnf
master_ip_failover_script= /etc/masterha/scripts/master_ip_failover
master_ip_online_change_script= /etc/masterha/scripts/master_ip_online_change

# 设置VIP
[root@MHA-M ~]# vim /etc/masterha/scripts/master_ip_failover
my $vip = '172.25.254.100/24';
[root@MHA-M ~]# vim /etc/masterha/scripts/master_ip_online_change
my $vip = '172.25.254.100/24';
# 向网卡中添加虚拟IP，即时生效。
[root@mysql-node1 ~]# ip a a 172.25.254.100/24 dev eth0

# 检测监控
[root@mha ~]# masterha_manager  --conf=/etc/masterha/app1.cnf &
[root@mha ~]# jobs
[1]+  运行中               masterha_manager --conf=/etc/masterha/app1.cnf &

# 测试：
# 关闭mysql master
[root@mysql-node1 ~]# /etc/init.d/mysqld stop
[root@mysql-node2 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:0c:29:e8:4b:64 brd ff:ff:ff:ff:ff:ff
    altname enp3s0
    altname ens160
    inet 172.25.254.20/24 brd 172.25.254.255 scope global noprefixroute eth0
       valid_lft forever preferred_lft forever
    inet 172.25.254.100/24 scope global secondary eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::f8be:d443:72d7:d336/64 scope link noprefixroute
       valid_lft forever preferred_lft forever