RabbitMQ集群部署方案及配置指南05

镜像队列到仲裁队列迁移与混合部署策略

完整的迁移方案和混合部署策略,确保在零停机、零数据丢失的前提下完成架构升级

一、迁移风险评估与准备

1.1 兼容性分析

组件 镜像队列兼容性 仲裁队列兼容性 迁移影响
RabbitMQ版本 所有版本 必须≥3.8.0 需要升级
客户端库 完全兼容 需要支持queue-type参数 低风险
管理工具 完全兼容 需要更新插件 中风险
监控系统 传统指标 需要Raft监控指标 需要调整

1.2 数据审计清单

在迁移前,必须完成以下审计:

**#!/bin/bash

audit_mirrored_queues.sh - 镜像队列审计脚本**

echo "=== 镜像队列集群审计报告 ==="

echo "生成时间: $(date)"

echo "=========================="

1. 统计镜像队列信息

echo "1. 队列统计信息:"

sudo rabbitmqctl list_queues --formatter json | jq '

map(select(.type == "classic" or .type == null)) |

{

"total_queues": length,

"mirrored_queues": map(select(.policy != null)) | length,

"non_mirrored_queues": map(select(.policy == null)) | length,

"total_messages": map(.messages // 0) | add,

"avg_messages_per_queue": (map(.messages // 0) | add) / length

}'

# 2. 详细队列列表
echo -e "\n2. 详细队列列表:"

sudo rabbitmqctl list_queues name messages policy state pid slave_pids consumers durable auto_delete | \

awk 'NR==1 {print} NR>1 {

queue=1; msgs=2; pol=3; st=4; pid=5; slaves=6; cons=7; dur=8; ad=$9;

printf "%-30s %-8s %-15s %-10s %-20s %-20s %-5s %-5s %-5s\n",

queue, msgs, pol, st, pid, slaves, cons, dur, ad

}' | head -20

# 3. 策略审计
echo -e "\n3. 镜像策略配置:"

sudo rabbitmqctl list_policies --formatter json | jq '

map({

vhost: .vhost,

name: .name,

pattern: .pattern,

definition: .definition,

apply_to: .apply_to

})'

# 4. 消费者审计
echo -e "\n4. 消费者分布:"

sudo rabbitmqctl list_consumers --formatter json | jq '

group_by(.queue_name) |

map({

queue: .[0].queue_name,

consumer_count: length,

channels: map(.channel_pid) | unique | length

}) | sort_by(-.consumer_count)'

# 5. 生产速率估算
echo -e "\n5. 建议监控以下队列的生产速率:"

sudo rabbitmqctl list_queues name messages messages_ready messages_unacknowledged --formatter json | \

jq -r '.[] | select(.messages > 1000) | "队列 \(.name): 当前消息数 \(.messages)"'

# 6. 导出当前配置
echo -e "\n6. 导出配置备份..."

sudo rabbitmqctl export_definitions /tmp/rabbitmq_backup_$(date +%Y%m%d_%H%M%S).json

echo "配置已备份到: /tmp/rabbitmq_backup_*.json"

二、混合部署架构设计

2.1 混合集群架构

混合集群节点规划(6节点示例)

节点分配:

镜像队列组 (3节点):

  • mq-mirror-01: 192.168.10.101

  • mq-mirror-02: 192.168.10.102

  • mq-mirror-03: 192.168.10.103

仲裁队列组 (3节点):

  • mq-quorum-01: 192.168.10.201

  • mq-quorum-02: 192.168.10.202

  • mq-quorum-03: 192.168.10.203

负载均衡器:

  • haproxy-01: 192.168.10.100 (VIP)

  • 端口映射:

  • 5670: 混合AMQP端口

  • 15670: 混合管理端口

数据流向:

  • 客户端 -> HAProxy -> 根据路由规则分发

  • 监控系统: 同时监控两组集群

2.2 混合部署配置

步骤1:部署仲裁队列节点组

#!/bin/bash

deploy_quorum_nodes.sh - 部署仲裁队列节点组

QUORUM_NODES=("mq-quorum-01" "mq-quorum-02" "mq-quorum-03")

QUORUM_IPS=("192.168.10.201" "192.168.10.202" "192.168.10.203")

MIRROR_CLUSTER="rabbit@mq-mirror-01"

1. 在新节点上安装RabbitMQ 3.12.x

for i in ${!QUORUM_NODES[@]}; do

node={QUORUM_NODES\[i]}

ip={QUORUM_IPS\[i]}

echo "正在部署节点: node (ip)"

SSH到目标节点执行

ssh root@$ip << EOF

设置主机名

hostnamectl set-hostname $node

添加hosts记录

echo "192.168.10.101 mq-mirror-01" >> /etc/hosts

echo "192.168.10.102 mq-mirror-02" >> /etc/hosts

echo "192.168.10.103 mq-mirror-03" >> /etc/hosts

echo "ip node" >> /etc/hosts

安装RabbitMQ 3.12.x(参考仲裁队列安装步骤)

yum install -y erlang-25.3.2.6-1.el7

yum install -y rabbitmq-server-3.12.12-1.el7

复制Erlang Cookie(从镜像集群主节点)

scp mq-mirror-01:/var/lib/rabbitmq/.erlang.cookie /var/lib/rabbitmq/

chown rabbitmq:rabbitmq /var/lib/rabbitmq/.erlang.cookie

chmod 400 /var/lib/rabbitmq/.erlang.cookie

配置仲裁队列专用设置

cat > /etc/rabbitmq/rabbitmq.conf << CONFIGEOF

仲裁队列配置

quorum.queue.enabled = true

不自动加入现有集群(保持独立)

cluster_formation.peer_discovery_backend = rabbit_peer_discovery_classic_config

cluster_formation.classic_config.nodes.1 = rabbit@mq-quorum-01

cluster_formation.classic_config.nodes.2 = rabbit@mq-quorum-02

cluster_formation.classic_config.nodes.3 = rabbit@mq-quorum-03

性能优化

raft.segment_max_entries = 65536

raft.wal_max_size_bytes = 104857600

网络隔离

cluster_partition_handling = ignore

CONFIGEOF

启动服务

systemctl start rabbitmq-server

systemctl enable rabbitmq-server

启用管理插件

rabbitmq-plugins enable rabbitmq_management

EOF

done

2. 构建仲裁队列独立集群

echo "构建仲裁队列集群..."

ssh root@${QUORUM_IPS[0]} "rabbitmqctl stop_app; rabbitmqctl reset; rabbitmqctl start_app"

ssh root@${QUORUM_IPS[1]} "rabbitmqctl stop_app; rabbitmqctl reset; rabbitmqctl join_cluster rabbit@mq-quorum-01; rabbitmqctl start_app"

ssh root@${QUORUM_IPS[2]} "rabbitmqctl stop_app; rabbitmqctl reset; rabbitmqctl join_cluster rabbit@mq-quorum-01; rabbitmqctl start_app"

echo "仲裁队列集群部署完成"

步骤2:配置混合负载均衡器

haproxy_mixed_cluster.cfg

global

log /dev/log local0

maxconn 10000

user haproxy

group haproxy

daemon

stats socket /var/run/haproxy.sock mode 660 level admin

defaults

log global

mode tcp

option tcplog

option dontlognull

retries 3

timeout connect 5s

timeout client 50s

timeout server 50s

timeout check 10s

健康检查前端

frontend health_dashboard

bind *:8888

mode http

stats enable

stats uri /haproxy?stats

stats refresh 30s

stats auth admin:MixedCluster2024

混合AMQP前端 - 基于路由规则分发

frontend mixed_amqp_frontend

bind *:5670

mode tcp

tcp-request inspect-delay 5s

tcp-request content accept if { req_ssl_hello_type 1 }

路由规则:基于队列名前缀

use_backend quorum_backend if { req.payload(0,20) -m -i ^declare.*queue.*quorum }

use_backend quorum_backend if { req.payload(0,20) -m -i ^queue.declare.*x-queue-type.*quorum }

默认路由到镜像队列(向后兼容)

default_backend mirror_backend

镜像队列后端

backend mirror_backend

mode tcp

balance leastconn

option tcp-check

tcp-check connect port 5672

tcp-check send "PING\r\n"

tcp-check expect string "AMQP"

server mq-mirror-01 192.168.10.101:5672 check inter 2s rise 2 fall 3

server mq-mirror-02 192.168.10.102:5672 check inter 2s rise 2 fall 3

server mq-mirror-03 192.168.10.103:5672 check inter 2s rise 2 fall 3

仲裁队列后端

backend quorum_backend

mode tcp

balance leastconn

option tcp-check

tcp-check connect port 5672

tcp-check send "PING\r\n"

tcp-check expect string "AMQP"

server mq-quorum-01 192.168.10.201:5672 check inter 1s rise 2 fall 2

server mq-quorum-02 192.168.10.202:5672 check inter 1s rise 2 fall 2

server mq-quorum-03 192.168.10.203:5672 check inter 1s rise 2 fall 2

混合管理界面

listen mixed_management

bind *:15670

mode http

balance roundrobin

option httpchk GET /api/health/checks/node-is-mirror-sync-critical

镜像集群管理节点

server mq-mirror-01 192.168.10.101:15672 check inter 5s rise 2 fall 3

server mq-mirror-02 192.168.10.102:15672 check inter 5s rise 2 fall 3

server mq-mirror-03 192.168.10.103:15672 check inter 5s rise 2 fall 3

仲裁集群管理节点

server mq-quorum-01 192.168.10.201:15672 check inter 5s rise 2 fall 3

server mq-quorum-02 192.168.10.202:15672 check inter 5s rise 2 fall 3

server mq-quorum-03 192.168.10.203:15672 check inter 5s rise 2 fall 3

三、四阶段迁移方案

阶段1:并行运行与双写(2-4周)

1.1 生产端双写改造

dual_write_producer.py - 双写生产端

import pika

import json

import logging

from datetime import datetime

from retry import retry

logging.basicConfig(level=logging.INFO)

logger = logging.getLogger(name)

class DualWriteProducer:

def init(self):

连接配置

self.mirror_connection = None

self.quorum_connection = None

负载均衡器地址

self.lb_host = "192.168.10.100"

self.lb_port = 5670

@retry(tries=3, delay=1)

def connect_mirror_cluster(self):

"""连接到镜像队列集群"""

credentials = pika.PlainCredentials('app_user', 'AppPassword123')

直接连接镜像集群(绕过负载均衡器)

mirror_nodes = [

('192.168.10.101', 5672),

('192.168.10.102', 5672),

('192.168.10.103', 5672)

]

for host, port in mirror_nodes:

try:

params = pika.ConnectionParameters(

host=host, port=port,

credentials=credentials,

heartbeat=300,

connection_attempts=2

)

return pika.BlockingConnection(params)

except Exception as e:

logger.warning(f"无法连接到镜像节点 {host}:{port}: {e}")

continue

raise Exception("无法连接到任何镜像队列节点")

@retry(tries=3, delay=1)

def connect_quorum_cluster(self):

"""连接到仲裁队列集群"""

credentials = pika.PlainCredentials('app_user', 'AppPassword123')

通过负载均衡器连接仲裁集群

params = pika.ConnectionParameters(

host=self.lb_host,

port=self.lb_port,

credentials=credentials,

heartbeat=600,

socket_timeout=30

)

设置客户端属性,指示需要仲裁队列

params.client_properties = {

'connection_name': 'quorum-migration-client',

'queue_type': 'quorum'

}

return pika.BlockingConnection(params)

def dual_write(self, queue_name, message, is_critical=True):

"""

双写消息到镜像队列和仲裁队列

Args:

queue_name: 队列名

message: 消息内容

is_critical: 是否为关键业务(决定写入顺序)

"""

message_id = f"msg_{datetime.now().strftime('%Y%m%d_%H%M%S_%f')}"

enriched_message = {

'id': message_id,

'data': message,

'timestamp': datetime.now().isoformat(),

'source': 'dual-write-producer'

}

message_body = json.dumps(enriched_message, ensure_ascii=False)

写入结果跟踪

results = {

'mirror_success': False,

'quorum_success': False,

'errors': []

}

try:

根据业务重要性决定写入顺序

if is_critical:

关键业务:先写仲裁队列(更可靠)

results.update(self._write_to_quorum(queue_name, message_body))

results.update(self._write_to_mirror(queue_name, message_body))

else:

非关键业务:先写镜像队列(向后兼容)

results.update(self._write_to_mirror(queue_name, message_body))

results.update(self._write_to_quorum(queue_name, message_body))

检查写入结果

if not results['mirror_success'] and results['quorum_success']:

logger.warning(f"消息 {message_id} 只写入仲裁队列,镜像队列失败")

elif results['mirror_success'] and not results['quorum_success']:

logger.warning(f"消息 {message_id} 只写入镜像队列,仲裁队列失败")

elif not results['mirror_success'] and not results['quorum_success']:

raise Exception(f"双写失败: {results['errors']}")

else:

logger.info(f"消息 {message_id} 双写成功")

return results

except Exception as e:

logger.error(f"双写异常: {e}")

记录失败,触发告警

self._alert_failure(queue_name, message_id, str(e))

raise

def _write_to_mirror(self, queue_name, message_body):

"""写入镜像队列"""

try:

if not self.mirror_connection or self.mirror_connection.is_closed:

self.mirror_connection = self.connect_mirror_cluster()

channel = self.mirror_connection.channel()

声明镜像队列(使用现有策略)

channel.queue_declare(

queue=queue_name,

durable=True,

arguments={'x-queue-type': 'classic'} # 明确指定经典队列

)

发布消息

channel.basic_publish(

exchange='',

routing_key=queue_name,

body=message_body.encode('utf-8'),

properties=pika.BasicProperties(

delivery_mode=2,

content_type='application/json',

headers={'queue_type': 'mirror'}

)

)

return {'mirror_success': True}

except Exception as e:

logger.error(f"写入镜像队列失败: {e}")

return {'mirror_success': False, 'errors': [f'mirror: {str(e)}']}

def _write_to_quorum(self, queue_name, message_body):

"""写入仲裁队列"""

try:

if not self.quorum_connection or self.quorum_connection.is_closed:

self.quorum_connection = self.connect_quorum_cluster()

channel = self.quorum_connection.channel()

声明仲裁队列

channel.queue_declare(

queue=f"{queue_name}.quorum", # 添加后缀区分

durable=True,

arguments={

'x-queue-type': 'quorum',

'x-quorum-initial-group-size': 3,

'x-delivery-limit': 5

}

)

发布消息

channel.basic_publish(

exchange='',

routing_key=f"{queue_name}.quorum",

body=message_body.encode('utf-8'),

properties=pika.BasicProperties(

delivery_mode=2,

content_type='application/json',

headers={'queue_type': 'quorum'}

)

)

return {'quorum_success': True}

except Exception as e:

logger.error(f"写入仲裁队列失败: {e}")

return {'quorum_success': False, 'errors': [f'quorum: {str(e)}']}

def _alert_failure(self, queue_name, message_id, error):

"""发送告警"""

实现告警逻辑,如发送到监控系统

alert_msg = {

'level': 'ERROR',

'type': 'DUAL_WRITE_FAILURE',

'queue': queue_name,

'message_id': message_id,

'error': error,

'timestamp': datetime.now().isoformat()

}

logger.error(f"告警: {json.dumps(alert_msg)}")

可以集成到监控系统

send_to_monitoring_system(alert_msg)

使用示例

if name == "main":

producer = DualWriteProducer()

测试双写

test_message = {

'order_id': 'ORD-2024-001',

'amount': 199.99,

'customer': 'john.doe@example.com'

}

关键业务消息

result = producer.dual_write(

queue_name="orders.payment",

message=test_message,

is_critical=True

)

print(f"双写结果: {result}")

1.2 数据一致性验证脚本

#!/bin/bash

verify_data_consistency.sh - 验证双写数据一致性

echo "=== 双写数据一致性验证 ==="

echo "开始时间: $(date)"

echo "========================="

配置参数

MIRROR_NODE="192.168.10.101"

QUORUM_NODE="192.168.10.201"

QUEUE_PATTERN="orders.*" # 要验证的队列模式

1. 获取镜像队列消息列表

echo "1. 从镜像队列获取消息列表..."

MIRROR_MESSAGES_FILE="/tmp/mirror_messages_$(date +%s).json"

rabbitmqadmin -H MIRROR_NODE -u admin -p AdminPassword123 list queues name messages --format=json \> MIRROR_MESSAGES_FILE

2. 获取仲裁队列消息列表

echo "2. 从仲裁队列获取消息列表..."

QUORUM_MESSAGES_FILE="/tmp/quorum_messages_$(date +%s).json"

rabbitmqadmin -H QUORUM_NODE -u admin -p AdminPassword123 list queues name messages --format=json \> QUORUM_MESSAGES_FILE

3. 比较队列消息数

echo "3. 比较消息数量..."

echo "队列名称 | 镜像队列消息数 | 仲裁队列消息数 | 差异"

python3 << EOF

import json

import sys

with open('$MIRROR_MESSAGES_FILE', 'r') as f:

mirror_data = json.load(f)

with open('$QUORUM_MESSAGES_FILE', 'r') as f:

quorum_data = json.load(f)

创建映射

mirror_map = {item['name']: item['messages'] for item in mirror_data if '$QUEUE_PATTERN' in item['name']}

quorum_map = {item['name'].replace('.quorum', ''): item['messages'] for item in quorum_data if '$QUEUE_PATTERN' in item['name']}

print(f"{'队列名称':<30} {'镜像队列':<12} {'仲裁队列':<12} {'差异':<8}")

print("-" * 65)

total_diff = 0

for queue in set(list(mirror_map.keys()) + list(quorum_map.keys())):

mirror_count = mirror_map.get(queue, 0)

quorum_count = quorum_map.get(queue, 0)

diff = abs(mirror_count - quorum_count)

if diff > 0:

status = "❌"

total_diff += diff

else:

status = "✅"

print(f"{queue:<30} {mirror_count:<12} {quorum_count:<12} {diff:<8} {status}")

print(f"\n总差异消息数: {total_diff}")

sys.exit(0 if total_diff == 0 else 1)

EOF

CONSISTENCY_RESULT=$?

4. 详细消息内容抽样比较

if [ $CONSISTENCY_RESULT -ne 0 ]; then

echo -e "\n4. 详细内容抽样比较..."

选择差异最大的队列进行详细检查

DIFF_QUEUE=$(python3 << EOF

import json

with open('$MIRROR_MESSAGES_FILE', 'r') as f:

mirror_data = json.load(f)

with open('$QUORUM_MESSAGES_FILE', 'r') as f:

quorum_data = json.load(f)

mirror_map = {item['name']: item['messages'] for item in mirror_data}

quorum_map = {item['name'].replace('.quorum', ''): item['messages'] for item in quorum_data}

max_diff = 0

max_queue = ""

for queue in set(list(mirror_map.keys()) + list(quorum_map.keys())):

diff = abs(mirror_map.get(queue, 0) - quorum_map.get(queue, 0))

if diff > max_diff:

max_diff = diff

max_queue = queue

print(max_queue)

EOF

)

echo "检查队列: $DIFF_QUEUE"

抽样获取消息进行比较

echo "从镜像队列获取样本消息..."

rabbitmqadmin -H MIRROR_NODE -u admin -p AdminPassword123 get queue="DIFF_QUEUE" count=3 --format=json > /tmp/mirror_sample.json

echo "从仲裁队列获取样本消息..."

rabbitmqadmin -H QUORUM_NODE -u admin -p AdminPassword123 get queue="{DIFF_QUEUE}.quorum" count=3 --format=json > /tmp/quorum_sample.json

echo "样本消息比较完成"

fi

5. 清理临时文件

rm -f MIRROR_MESSAGES_FILE QUORUM_MESSAGES_FILE /tmp/mirror_sample.json /tmp/quorum_sample.json

echo -e "\n验证完成时间: $(date)"

if [ $CONSISTENCY_RESULT -eq 0 ]; then

echo "✅ 数据一致性验证通过"

else

echo "❌ 数据一致性验证失败,发现差异"

exit 1

fi

阶段2:消费者逐步迁移(3-6周)

2.1 智能消费代理

smart_consumer_proxy.py - 智能消费代理

import pika

import json

import threading

import time

import logging

from collections import deque

from datetime import datetime

logging.basicConfig(level=logging.INFO)

logger = logging.getLogger(name)

class SmartConsumerProxy:

def init(self, config):

"""

智能消费代理

Args:

config: 配置字典

  • mirror_nodes: 镜像节点列表

  • quorum_nodes: 仲裁节点列表

  • queue_mapping: 队列映射 {镜像队列: 仲裁队列}

  • migration_progress: 迁移进度 {队列: 百分比}

"""

self.config = config

self.mirror_connection = None

self.quorum_connection = None

消费状态跟踪

self.consumption_stats = {

'mirror': 0,

'quorum': 0,

'last_switch_time': {},

'errors': deque(maxlen=1000)

}

启动监控线程

self.monitor_thread = threading.Thread(target=self._monitor_consumption, daemon=True)

self.monitor_thread.start()

def consume_with_gradual_migration(self, queue_name, callback, migration_ratio=0.1):

"""

逐步迁移消费

Args:

queue_name: 队列名

callback: 消息处理回调函数

migration_ratio: 初始迁移比例 (0.1 = 10%流量到仲裁队列)

"""

mirror_queue = queue_name

quorum_queue = f"{queue_name}.quorum"

启动镜像队列消费者(主)

mirror_thread = threading.Thread(

target=self._consume_from_mirror,

args=(mirror_queue, callback, migration_ratio),

daemon=True

)

启动仲裁队列消费者(逐步增加)

quorum_thread = threading.Thread(

target=self._consume_from_quorum,

args=(quorum_queue, callback, migration_ratio),

daemon=True

)

mirror_thread.start()

quorum_thread.start()

logger.info(f"开始逐步迁移消费队列 {queue_name}, 初始仲裁比例: {migration_ratio*100}%")

返回控制句柄,用于动态调整比例

return {

'queue': queue_name,

'mirror_thread': mirror_thread,

'quorum_thread': quorum_thread,

'adjust_ratio': self.adjust_migration_ratio

}

def _consume_from_mirror(self, queue_name, callback, migration_ratio):

"""从镜像队列消费(主要流量)"""

try:

connection = self._get_mirror_connection()

channel = connection.channel()

设置QoS

channel.basic_qos(prefetch_count=10)

消费函数包装器

def wrapped_callback(ch, method, properties, body):

try:

根据迁移比例决定是否处理

import random

if random.random() < migration_ratio:

这部分流量应该由仲裁队列处理,跳过

ch.basic_ack(delivery_tag=method.delivery_tag)

return

处理消息

message = json.loads(body.decode('utf-8'))

result = callback(message, source='mirror')

if result:

ch.basic_ack(delivery_tag=method.delivery_tag)

self.consumption_stats['mirror'] += 1

else:

处理失败,重新入队

ch.basic_nack(delivery_tag=method.delivery_tag, requeue=True)

except Exception as e:

logger.error(f"处理镜像队列消息失败: {e}")

ch.basic_nack(delivery_tag=method.delivery_tag, requeue=False)

self.consumption_stats['errors'].append({

'queue': queue_name,

'source': 'mirror',

'error': str(e),

'timestamp': datetime.now().isoformat()

})

开始消费

channel.basic_consume(

queue=queue_name,

on_message_callback=wrapped_callback,

auto_ack=False

)

logger.info(f"开始从镜像队列消费: {queue_name}")

channel.start_consuming()

except Exception as e:

logger.error(f"镜像队列消费异常: {e}")

def _consume_from_quorum(self, queue_name, callback, migration_ratio):

"""从仲裁队列消费(逐步增加)"""

try:

connection = self._get_quorum_connection()

channel = connection.channel()

仲裁队列需要不同的QoS设置

channel.basic_qos(prefetch_count=5)

消费函数包装器

def wrapped_callback(ch, method, properties, body):

try:

处理消息

message = json.loads(body.decode('utf-8'))

result = callback(message, source='quorum')

if result:

ch.basic_ack(delivery_tag=method.delivery_tag)

self.consumption_stats['quorum'] += 1

else:

仲裁队列通常不重新入队

ch.basic_nack(delivery_tag=method.delivery_tag, requeue=False)

except Exception as e:

logger.error(f"处理仲裁队列消息失败: {e}")

ch.basic_nack(delivery_tag=method.delivery_tag, requeue=False)

self.consumption_stats['errors'].append({

'queue': queue_name,

'source': 'quorum',

'error': str(e),

'timestamp': datetime.now().isoformat()

})

开始消费

channel.basic_consume(

queue=queue_name,

on_message_callback=wrapped_callback,

auto_ack=False

)

logger.info(f"开始从仲裁队列消费: {queue_name}")

channel.start_consuming()

except Exception as e:

logger.error(f"仲裁队列消费异常: {e}")

def adjust_migration_ratio(self, queue_name, new_ratio):

"""动态调整迁移比例"""

if 0 <= new_ratio <= 1:

logger.info(f"调整队列 {queue_name} 迁移比例: {new_ratio*100}%")

这里需要实现与消费者线程的通信

可以通过共享变量或消息队列实现

self.config['migration_progress'][queue_name] = new_ratio

return True

return False

def _get_mirror_connection(self):

"""获取镜像队列连接"""

if not self.mirror_connection or self.mirror_connection.is_closed:

credentials = pika.PlainCredentials('app_user', 'AppPassword123')

params = pika.ConnectionParameters(

host=self.config['mirror_nodes'][0],

port=5672,

credentials=credentials,

heartbeat=300

)

self.mirror_connection = pika.BlockingConnection(params)

return self.mirror_connection

def _get_quorum_connection(self):

"""获取仲裁队列连接"""

if not self.quorum_connection or self.quorum_connection.is_closed:

credentials = pika.PlainCredentials('app_user', 'AppPassword123')

params = pika.ConnectionParameters(

host=self.config['quorum_nodes'][0],

port=5672,

credentials=credentials,

heartbeat=600,

socket_timeout=30

)

self.quorum_connection = pika.BlockingConnection(params)

return self.quorum_connection

def _monitor_consumption(self):

"""监控消费状态"""

while True:

time.sleep(60) # 每分钟报告一次

total = self.consumption_stats['mirror'] + self.consumption_stats['quorum']

if total > 0:

mirror_pct = (self.consumption_stats['mirror'] / total) * 100

quorum_pct = (self.consumption_stats['quorum'] / total) * 100

logger.info(

f"消费统计 - 镜像队列: {self.consumption_stats['mirror']} ({mirror_pct:.1f}%) | "

f"仲裁队列: {self.consumption_stats['quorum']} ({quorum_pct:.1f}%) | "

f"错误数: {len(self.consumption_stats['errors'])}"

)

重置计数器(可选,可以改为小时统计)

self.consumption_stats['mirror'] = 0

self.consumption_stats['quorum'] = 0

def get_migration_report(self):

"""生成迁移报告"""

return {

'timestamp': datetime.now().isoformat(),

'consumption_stats': dict(self.consumption_stats),

'migration_progress': self.config.get('migration_progress', {}),

'recent_errors': list(self.consumption_stats['errors'])[-10:] # 最近10个错误

}

使用示例

if name == "main":

config = {

'mirror_nodes': ['192.168.10.101'],

'quorum_nodes': ['192.168.10.201'],

'migration_progress': {

'orders.payment': 0.1, # 10%迁移到仲裁队列

'user.notifications': 0.3 # 30%迁移到仲裁队列

}

}

proxy = SmartConsumerProxy(config)

定义消息处理回调

def process_message(message, source):

print(f"从 {source} 处理消息: {message.get('id', 'unknown')}")

return True # 处理成功

开始逐步迁移消费

handler = proxy.consume_with_gradual_migration(

queue_name="orders.payment",

callback=process_message,

migration_ratio=0.1

)

运行一段时间后调整比例

import time

time.sleep(300) # 运行5分钟

增加仲裁队列消费比例到30%

handler['adjust_ratio']("orders.payment", 0.3)

保持运行

try:

while True:

time.sleep(1)

except KeyboardInterrupt:

print("停止消费代理")

2.2 迁移进度控制面板

#!/bin/bash

migration_control_panel.sh - 迁移控制面板

echo "RabbitMQ迁移控制面板"

echo "====================="

echo "当前时间: $(date)"

echo ""

显示迁移状态

show_migration_status() {

echo "迁移状态概览:"

echo "--------------"

获取队列状态

QUEUE_STATUS=$(python3 << EOF

import json

import requests

from datetime import datetime

def get_queue_stats(host, port, username, password):

"""获取队列统计信息"""

url = f"http://{host}:{port}/api/queues"

response = requests.get(url, auth=(username, password))

return response.json()

镜像集群状态

mirror_stats = get_queue_stats("192.168.10.101", 15672, "admin", "AdminPassword123")

quorum_stats = get_queue_stats("192.168.10.201", 15672, "admin", "AdminPassword123")

分析迁移进度

migration_report = {}

for queue in mirror_stats:

queue_name = queue['name']

mirror_messages = queue['messages']

查找对应的仲裁队列

quorum_queue_name = f"{queue_name}.quorum"

quorum_messages = 0

for qq in quorum_stats:

if qq['name'] == quorum_queue_name:

quorum_messages = qq['messages']

break

total_messages = mirror_messages + quorum_messages

if total_messages > 0:

migration_ratio = quorum_messages / total_messages

status = "🟢" if migration_ratio > 0.9 else \

"🟡" if migration_ratio > 0.5 else \

"🔴" if migration_ratio > 0 else "⚪"

else:

migration_ratio = 0

status = "⚪"

migration_report[queue_name] = {

'mirror': mirror_messages,

'quorum': quorum_messages,

'ratio': migration_ratio,

'status': status

}

输出表格

print(f"{'队列名称':<30} {'镜像消息':<10} {'仲裁消息':<10} {'迁移比例':<10} {'状态':<5}")

print("-" * 70)

for queue, stats in sorted(migration_report.items(), key=lambda x: x[1]['ratio'], reverse=True):

if stats['mirror'] > 0 or stats['quorum'] > 0:

print(f"{queue:<30} {stats['mirror']:<10} {stats['quorum']:<10} {stats['ratio']*100:<9.1f}% {stats['status']:<5}")

EOF

)

echo "$QUEUE_STATUS"

}

迁移控制命令

control_migration() {

echo ""

echo "迁移控制选项:"

echo "1. 加速队列迁移"

echo "2. 回滚队列迁移"

echo "3. 暂停队列迁移"

echo "4. 查看详细报告"

echo "5. 生成迁移计划"

read -p "请选择操作: " choice

case $choice in

read -p "输入队列名: " queue_name

read -p "输入目标迁移比例 (0-100): " target_ratio

echo "加速迁移队列 queue_name 到 target_ratio%"

调用API调整迁移比例

curl -X POST "http://192.168.10.100:8888/api/migration/accelerate" \

-H "Content-Type: application/json" \

-d "{\"queue\": \"queue_name\\", \\"target_ratio\\": target_ratio}" \

-u admin:AdminPassword123

;;

read -p "输入队列名: " queue_name

read -p "输入回滚比例 (0-100): " rollback_ratio

echo "回滚队列 queue_name 到 rollback_ratio%"

调用API回滚迁移

curl -X POST "http://192.168.10.100:8888/api/migration/rollback" \

-H "Content-Type: application/json" \

-d "{\"queue\": \"queue_name\\", \\"target_ratio\\": rollback_ratio}" \

-u admin:AdminPassword123

;;

read -p "输入队列名: " queue_name

echo "暂停队列 $queue_name 的迁移"

调用API暂停迁移

curl -X POST "http://192.168.10.100:8888/api/migration/pause" \

-H "Content-Type: application/json" \

-d "{\"queue\": \"$queue_name\"}" \

-u admin:AdminPassword123

;;

generate_detailed_report

;;

generate_migration_plan

;;

*)

echo "无效选择"

;;

esac

}

generate_detailed_report() {

echo "生成详细迁移报告..."

REPORT_FILE="/tmp/migration_report_$(date +%Y%m%d_%H%M%S).html"

python3 << EOF > $REPORT_FILE

import json

import requests

from datetime import datetime

def generate_html_report():

html = '''

<!DOCTYPE html>

<html>

<head>

<title>RabbitMQ迁移报告</title>

<style>

body { font-family: Arial, sans-serif; margin: 20px; }

h1 { color: #333; }

.queue-card {

border: 1px solid #ddd;

padding: 15px;

margin: 10px 0;

border-radius: 5px;

background-color: #f9f9f9;

}

.progress-bar {

height: 20px;

background-color: #e0e0e0;

border-radius: 10px;

overflow: hidden;

margin: 5px 0;

}

.progress-fill {

height: 100%;

background-color: #4CAF50;

transition: width 0.3s;

}

.stats-grid {

display: grid;

grid-template-columns: repeat(3, 1fr);

gap: 20px;

margin: 20px 0;

}

.stat-card {

padding: 15px;

background-color: #fff;

border-radius: 5px;

box-shadow: 0 2px 4px rgba(0,0,0,0.1);

}

</style>

</head>

<body>

<h1>RabbitMQ迁移报告</h1>

<p>生成时间: ${datetime.now().isoformat()}</p>

'''

这里添加数据获取和HTML生成逻辑

html += "</body></html>

阶段3:切换验证与清理(1-2周)

3.1 最终切换验证脚本

#!/bin/bash

final_switch_validation.sh - 最终切换验证

echo "=== 最终切换验证 ==="

echo "开始时间: $(date)"

echo "=================="

验证步骤

VALIDATION_PASSED=true

1. 验证仲裁队列健康状态

echo "1. 验证仲裁队列健康状态..."

for node in 192.168.10.201 192.168.10.202 192.168.10.203; do

echo "检查节点: $node"

检查服务状态

if ! ssh $node "systemctl is-active rabbitmq-server" | grep -q "active"; then

echo " ❌ 节点 $node 服务异常"

VALIDATION_PASSED=false

fi

检查Raft状态

raft_status=(ssh node "rabbitmq-diagnostics quorum_status 2>/dev/null | grep -c 'leader\|follower'")

if [ "$raft_status" -lt 1 ]; then

echo " ❌ 节点 $node Raft状态异常"

VALIDATION_PASSED=false

fi

echo " ✅ 节点 $node 检查通过"

done

2. 验证所有队列已迁移

echo -e "\n2. 验证队列迁移完成情况..."

MIGRATION_REPORT=$(python3 << EOF

import requests

import json

def check_migration_completion():

获取镜像队列状态

mirror_url = "http://192.168.10.101:15672/api/queues"

mirror_resp = requests.get(mirror_url, auth=('admin', 'AdminPassword123'))

mirror_queues = mirror_resp.json()

获取仲裁队列状态

quorum_url = "http://192.168.10.201:15672/api/queues"

quorum_resp = requests.get(quorum_url, auth=('admin', 'AdminPassword123'))

quorum_queues = quorum_resp.json()

分析迁移状态

migration_status = {}

for queue in mirror_queues:

queue_name = queue['name']

mirror_messages = queue['messages']

查找对应的仲裁队列

quorum_queue_name = f"{queue_name}.quorum"

quorum_messages = 0

for qq in quorum_queues:

if qq['name'] == quorum_queue_name:

quorum_messages = qq['messages']

break

判断迁移是否完成

total = mirror_messages + quorum_messages

if total == 0:

status = "empty"

elif quorum_messages > 0 and mirror_messages == 0:

status = "migrated"

elif quorum_messages > mirror_messages * 10: # 仲裁队列消息远多于镜像队列

status = "mostly_migrated"

else:

status = "not_migrated"

migration_status[queue_name] = {

'mirror_messages': mirror_messages,

'quorum_messages': quorum_messages,

'status': status

}

return migration_status

status = check_migration_completion()

print("队列迁移状态:")

print("-" * 80)

print(f"{'队列名称':<30} {'镜像消息':<12} {'仲裁消息':<12} {'状态':<15}")

print("-" * 80)

not_migrated_count = 0

for queue, info in status.items():

status_icon = {

'migrated': '✅',

'mostly_migrated': '🟡',

'empty': '⚪',

'not_migrated': '❌'

}.get(info['status'], '❓')

print(f"{queue:<30} {info['mirror_messages']:<12} {info['quorum_messages']:<12} {status_icon + ' ' + info['status']:<15}")

if info['status'] == 'not_migrated':

not_migrated_count += 1

print(f"\n未完成迁移的队列数: {not_migrated_count}")

exit(not_migrated_count)

EOF

)

NOT_MIGRATED_COUNT=$?

if [ $NOT_MIGRATED_COUNT -gt 0 ]; then

echo " ❌ 还有 $NOT_MIGRATED_COUNT 个队列未完成迁移"

VALIDATION_PASSED=false

else

echo " ✅ 所有队列迁移完成"

fi

3. 验证消费者切换

echo -e "\n3. 验证消费者切换情况..."

CONSUMER_REPORT=$(python3 << EOF

import requests

import json

def check_consumer_migration():

获取消费者信息

mirror_url = "http://192.168.10.101:15672/api/consumers"

quorum_url = "http://192.168.10.201:15672/api/consumers"

mirror_consumers = requests.get(mirror_url, auth=('admin', 'AdminPassword123')).json()

quorum_consumers = requests.get(quorum_url, auth=('admin', 'AdminPassword123')).json()

分析消费者分布

consumer_stats = {}

按队列统计消费者

for consumer in mirror_consumers:

queue = consumer['queue']['name']

if queue not in consumer_stats:

consumer_stats[queue] = {'mirror': 0, 'quorum': 0}

consumer_stats[queue]['mirror'] += 1

for consumer in quorum_consumers:

queue = consumer['queue']['name'].replace('.quorum', '')

if queue not in consumer_stats:

consumer_stats[queue] = {'mirror': 0, 'quorum': 0}

consumer_stats[queue]['quorum'] += 1

return consumer_stats

stats = check_consumer_migration()

print("消费者分布:")

print("-" * 60)

print(f"{'队列名称':<30} {'镜像消费者':<12} {'仲裁消费者':<12} {'状态':<10}")

print("-" * 60)

problematic_queues = 0

for queue, counts in stats.items():

total = counts['mirror'] + counts['quorum']

if total == 0:

status = "no_consumers"

status_icon = "⚠️"

elif counts['quorum'] == 0:

status = "mirror_only"

status_icon = "❌"

problematic_queues += 1

elif counts['mirror'] == 0:

status = "quorum_only"

status_icon = "✅"

elif counts['quorum'] > counts['mirror']:

status = "mostly_quorum"

status_icon = "🟡"

else:

status = "mostly_mirror"

status_icon = "🟡"

problematic_queues += 1

print(f"{queue:<30} {counts['mirror']:<12} {counts['quorum']:<12} {status_icon + ' ' + status:<10}")

print(f"\n有问题的队列数: {problematic_queues}")

exit(problematic_queues)

EOF

)

PROBLEMATIC_QUEUES=$?

if [ $PROBLEMATIC_QUEUES -gt 0 ]; then

echo " ❌ 还有 $PROBLEMATIC_QUEUES 个队列消费者未完全迁移"

VALIDATION_PASSED=false

else

echo " ✅ 所有消费者已迁移完成"

fi

4. 性能基准测试

echo -e "\n4. 运行性能基准测试..."

PERF_TEST_RESULT=$(python3 << EOF

import time

import pika

import json

from datetime import datetime

def run_performance_test():

连接到仲裁队列

credentials = pika.PlainCredentials('app_user', 'AppPassword123')

connection = pika.BlockingConnection(

pika.ConnectionParameters(

host='192.168.10.201',

port=5672,

credentials=credentials

)

)

channel = connection.channel()

创建测试队列

test_queue = f"perf_test_{int(time.time())}"

channel.queue_declare(

queue=test_queue,

durable=True,

arguments={'x-queue-type': 'quorum'}

)

测试消息发布性能

message_count = 1000

message_size = 1024 # 1KB

print(f"性能测试: 发布 {message_count} 条 {message_size} 字节的消息")

start_time = time.time()

for i in range(message_count):

message = {

'id': i,

'timestamp': datetime.now().isoformat(),

'data': 'x' * message_size

}

channel.basic_publish(

exchange='',

routing_key=test_queue,

body=json.dumps(message).encode('utf-8'),

properties=pika.BasicProperties(delivery_mode=2)

)

publish_time = time.time() - start_time

publish_rate = message_count / publish_time

print(f"发布性能: {publish_rate:.2f} 消息/秒")

清理测试队列

channel.queue_delete(queue=test_queue)

connection.close()

性能标准

if publish_rate > 500: # 500 msg/s 为合格线

return True, publish_rate

else:

return False, publish_rate

success, rate = run_performance_test()

if success:

print(f"✅ 性能测试通过: {rate:.2f} 消息/秒")

exit(0)

else:

print(f"❌ 性能测试未通过: {rate:.2f} 消息/秒")

exit(1)

EOF

)

PERF_TEST_PASSED=$?

if [ $PERF_TEST_PASSED -ne 0 ]; then

echo " ❌ 性能测试未通过"

VALIDATION_PASSED=false

else

echo " ✅ 性能测试通过"

fi

最终验证结果

echo -e "\n=== 最终验证结果 ==="

if [ "$VALIDATION_PASSED" = true ]; then

echo "✅ 所有验证通过,可以执行最终切换"

echo "建议执行以下操作:"

echo "1. 停止所有镜像队列写入"

echo "2. 等待镜像队列消息清空"

echo "3. 下线镜像队列节点"

echo "4. 更新客户端配置为纯仲裁队列"

else

echo "❌ 验证未通过,请解决以上问题后再试"

exit 1

fi

echo "验证完成时间: $(date)"

阶段4:清理与归档(1周)

4.1 安全清理脚本

#!/bin/bash

safe_cleanup_mirror_cluster.sh - 安全清理镜像集群

echo "=== 镜像集群安全清理 ==="

echo "开始时间: $(date)"

echo "注意: 此操作将永久删除镜像集群数据!"

echo "========================="

确认操作

read -p "你确定要清理镜像集群吗? (yes/no): " confirmation

if [ "$confirmation" != "yes" ]; then

echo "操作取消"

exit 1

fi

二次确认

read -p "请输入'CONFIRM_DELETE'以继续: " double_check

if [ "$double_check" != "CONFIRM_DELETE" ]; then

echo "操作取消"

exit 1

fi

1. 最终数据备份

echo "1. 执行最终数据备份..."

BACKUP_DIR="/backup/rabbitmq_mirror_final_$(date +%Y%m%d_%H%M%S)"

mkdir -p $BACKUP_DIR

导出定义

curl -u admin:AdminPassword123 \

-X GET \

"http://192.168.10.101:15672/api/definitions" \

-o "$BACKUP_DIR/definitions.json"

导出统计数据

for node in 192.168.10.101 192.168.10.102 192.168.10.103; do

node_name=(ssh node "hostname -s")

备份配置文件

ssh node "cp -r /etc/rabbitmq BACKUP_DIR/config_$node_name/"

备份数据目录(元数据)

ssh $node "tar -czf - /var/lib/rabbitmq/mnesia 2>/dev/null" \

> "BACKUP_DIR/mnesia_node_name.tar.gz"

echo " 节点 $node 备份完成"

done

echo "备份完成,保存到: $BACKUP_DIR"

2. 逐步停止服务

echo -e "\n2. 逐步停止镜像集群服务..."

先停止应用,保持集群信息

for node in 192.168.10.101 192.168.10.102 192.168.10.103; do

echo "停止节点 $node 的应用..."

ssh $node "rabbitmqctl stop_app"

sleep 2

done

等待确认所有节点已停止

echo "等待所有节点停止..."

sleep 10

3. 从混合负载均衡器移除

echo -e "\n3. 从负载均衡器移除镜像节点..."

更新HAProxy配置,移除镜像后端

ssh 192.168.10.100 << 'EOF'

备份当前配置

cp /etc/haproxy/haproxy.cfg /etc/haproxy/haproxy.cfg.backup.$(date +%s)

移除镜像后端配置

sed -i '/backend mirror_backend/,/^$/d' /etc/haproxy/haproxy.cfg

sed -i '/use_backend mirror_backend/d' /etc/haproxy/haproxy.cfg

重载配置

systemctl reload haproxy

echo "负载均衡器配置已更新"

EOF

4. 清理数据

echo -e "\n4. 清理镜像节点数据..."

cleanup_node() {

local node=$1

echo "清理节点: $node"

ssh $node << 'EOF'

停止服务

systemctl stop rabbitmq-server

备份Erlang Cookie(用于审计)

cp /var/lib/rabbitmq/.erlang.cookie /root/erlang.cookie.backup.$(date +%s)

清理数据目录

rm -rf /var/lib/rabbitmq/mnesia/*

清理日志

rm -rf /var/log/rabbitmq/*

保留配置(用于审计)

mv /etc/rabbitmq /etc/rabbitmq.backup.$(date +%s)

可选:卸载软件包

yum remove -y rabbitmq-server erlang

echo " 节点清理完成"

EOF

}

for node in 192.168.10.101 192.168.10.102 192.168.10.103; do

cleanup_node $node

done

5. 生成清理报告

echo -e "\n5. 生成清理报告..."

CLEANUP_REPORT="$BACKUP_DIR/cleanup_report.md"

cat > $CLEANUP_REPORT << EOF

镜像集群清理报告

基本信息

  • 清理时间: $(date)

  • 清理操作员: $(whoami)

  • 备份目录: $BACKUP_DIR

清理节点

$(for node in 192.168.10.101 192.168.10.102 192.168.10.103; do

echo "- $node"

done)

清理内容

  1. 停止所有RabbitMQ服务

  2. 备份配置和数据

  3. 从负载均衡器移除

  4. 清理数据目录

  5. 备份配置文件

验证步骤

请在清理后验证以下内容:

1. 服务状态验证

\`\`\`bash

$(for node in 192.168.10.101 192.168.10.102 192.168.10.103; do

echo "# 检查节点 $node"

echo "ssh $node 'systemctl status rabbitmq-server'"

echo ""

done)

\`\`\`

2. 仲裁集群验证

\`\`\`bash

检查仲裁集群状态

ssh 192.168.10.201 "rabbitmqctl cluster_status"

ssh 192.168.10.201 "rabbitmqctl list_queues --queue-type quorum"

\`\`\`

3. 负载均衡器验证

\`\`\`bash

检查HAProxy状态

ssh 192.168.10.100 "systemctl status haproxy"

ssh 192.168.10.100 "echo 'show stat' | socat /var/run/haproxy.sock stdio"

\`\`\`

后续操作建议

  1. 保留备份文件至少30天

  2. 监控仲裁集群性能

  3. 更新文档和监控配置

  4. 通知相关团队迁移完成

签字确认

  • 操作确认: [签名]

  • 日期: $(date +%Y-%m-%d)

EOF

echo "清理报告已生成: $CLEANUP_REPORT"

6. 最终验证

echo -e "\n6. 执行最终验证..."

FINAL_VALIDATION_PASSED=true

验证镜像节点已清理

for node in 192.168.10.101 192.168.10.102 192.168.10.103; do

if ssh node "\[ -d /var/lib/rabbitmq/mnesia \] \&\& \[ \\"\\(ls -A /var/lib/rabbitmq/mnesia)\" ]"; then

echo "❌ 节点 $node 数据目录未完全清理"

FINAL_VALIDATION_PASSED=false

else

echo "✅ 节点 $node 数据目录已清理"

fi

done

验证仲裁集群运行正常

if ssh 192.168.10.201 "rabbitmqctl cluster_status 2>&1 | grep -q 'running_nodes'"; then

echo "✅ 仲裁集群运行正常"

else

echo "❌ 仲裁集群状态异常"

FINAL_VALIDATION_PASSED=false

fi

最终结果

echo -e "\n=== 清理完成 ==="

if [ "$FINAL_VALIDATION_PASSED" = true ]; then

echo "✅ 镜像集群清理完成"

echo "迁移项目成功结束!"

else

echo "⚠️ 清理完成,但有一些验证问题"

echo "请检查上述问题并手动处理"

fi

echo "完成时间: $(date)"

四、混合部署长期策略

4.1 混合架构运维策略

队列类型选择决策树:

队列类型选择指南:

选择仲裁队列当:

  • 新业务队列

  • 需要强一致性保证

  • 消息量中等 (≤10K msg/s)

  • 自动故障恢复需求

  • 长期维护考虑

选择镜像队列当:

  • 已有业务队列(兼容性)

  • 超高吞吐需求 (>50K msg/s)

  • 客户端不支持仲裁队列

  • 短期过渡需求

选择Streams当:

  • 事件溯源架构

  • 消息重放需求

  • 海量日志存储

  • 实时分析场景

运维监控配置:

prometheus_rabbitmq_mixed.yml

Prometheus混合集群监控配置

scrape_configs:

镜像集群监控

  • job_name: 'rabbitmq_mirror'

static_configs:

  • targets:

  • '192.168.10.101:15692' # Prometheus插件端口

  • '192.168.10.102:15692'

  • '192.168.10.103:15692'

metrics_path: '/metrics'

params:

family: ['mirror']

仲裁集群监控

  • job_name: 'rabbitmq_quorum'

static_configs:

  • targets:

  • '192.168.10.201:15692'

  • '192.168.10.202:15692'

  • '192.168.10.203:15692'

metrics_path: '/metrics'

params:

family: ['quorum', 'raft'] # 特别监控Raft指标

HAProxy监控

  • job_name: 'haproxy_mixed'

static_configs:

  • targets: ['192.168.10.100:8888']

metrics_path: '/metrics'

告警规则

groups:

  • name: rabbitmq_mixed_alerts

rules:

镜像队列告警

  • alert: MirrorQueueSyncCritical

expr: rabbitmq_queue_mirror_sync_critical > 0

for: 5m

labels:

severity: critical

cluster: mirror

annotations:

summary: "镜像队列同步危急"

description: "队列 {{ $labels.queue }} 镜像同步状态危急"

仲裁队列告警

  • alert: QuorumQueueRaftLag

expr: rabbitmq_quorum_queue_raft_log_lag > 1000

for: 10m

labels:

severity: warning

cluster: quorum

annotations:

summary: "仲裁队列Raft复制滞后"

description: "队列 {{ labels.queue }} Raft日志滞后 {{ value }} 条目"

混合集群告警

  • alert: MixedClusterImbalance

expr: |

(sum(rabbitmq_queue_messages{cluster="mirror"}) /

sum(rabbitmq_queue_messages{cluster="quorum"})) > 0.2

for: 30m

labels:

severity: info

annotations:

summary: "混合集群负载不均衡"

description: "镜像队列消息量仍占20%以上,考虑加速迁移"

4.2 回滚与应急方案

快速回滚脚本:

#!/bin/bash

emergency_rollback.sh - 紧急回滚到镜像队列

echo "=== 紧急回滚程序 ==="

echo "开始时间: $(date)"

echo "=================="

检查回滚条件

check_rollback_conditions() {

echo "检查回滚条件..."

1. 仲裁集群是否出现严重问题

QUORUM_CRITICAL=false

检查领导者选举故障

if ssh 192.168.10.201 "rabbitmq-diagnostics quorum_status 2>/dev/null | grep -c 'election' | grep -q '[1-9]'"; then

echo "⚠️ 仲裁集群存在选举问题"

QUORUM_CRITICAL=true

fi

检查WAL写满

if ssh 192.168.10.201 "rabbitmq-diagnostics quorum_status 2>/dev/null | grep -i 'wal.*full'"; then

echo "⚠️ 仲裁集群WAL已满"

QUORUM_CRITICAL=true

fi

2. 消息丢失检测

MESSAGE_LOSS=false

比较镜像和仲裁队列消息数(如果有双写)

这里简化实现,实际应从监控系统获取

if [ "QUORUM_CRITICAL" = true \] \|\| \[ "MESSAGE_LOSS" = true ]; then

return 0 # 需要回滚

else

return 1 # 不需要回滚

fi

}

执行回滚

execute_rollback() {

echo "开始执行回滚..."

1. 恢复镜像节点

echo "1. 恢复镜像节点..."

for node in 192.168.10.101 192.168.10.102 192.168.10.103; do

echo " 恢复节点 $node"

ssh $node << 'EOF'

停止服务(如果还在运行)

systemctl stop rabbitmq-server 2>/dev/null || true

从备份恢复配置

if [ -d /etc/rabbitmq.backup.* ]; then

backup_dir=$(ls -d /etc/rabbitmq.backup.* | tail -1)

cp -r $backup_dir/* /etc/rabbitmq/

fi

恢复数据(如果有备份)

if [ -f /root/erlang.cookie.backup.* ]; then

backup_file=$(ls -t /root/erlang.cookie.backup.* | head -1)

cp $backup_file /var/lib/rabbitmq/.erlang.cookie

chown rabbitmq:rabbitmq /var/lib/rabbitmq/.erlang.cookie

chmod 400 /var/lib/rabbitmq/.erlang.cookie

fi

启动服务

systemctl start rabbitmq-server

echo " 节点恢复启动完成"

EOF

done

2. 重建镜像集群

echo "2. 重建镜像集群..."

ssh 192.168.10.102 "rabbitmqctl stop_app; rabbitmqctl reset; rabbitmqctl join_cluster rabbit@mq-mirror-01; rabbitmqctl start_app"

ssh 192.168.10.103 "rabbitmqctl stop_app; rabbitmqctl reset; rabbitmqctl join_cluster rabbit@mq-mirror-01; rabbitmqctl start_app"

3. 恢复负载均衡器配置

echo "3. 恢复负载均衡器配置..."

ssh 192.168.10.100 << 'EOF'

恢复备份配置

backup_file=$(ls -t /etc/haproxy/haproxy.cfg.backup.* | head -1)

if [ -f "$backup_file" ]; then

cp $backup_file /etc/haproxy/haproxy.cfg

systemctl reload haproxy

echo " 负载均衡器配置已恢复"

else

echo " ⚠️ 未找到备份配置,需要手动恢复"

fi

EOF

4. 恢复客户端配置

echo "4. 更新客户端配置..."

echo " 通知所有客户端:"

echo " - 更新连接端点: 192.168.10.100:5670"

echo " - 移除仲裁队列相关参数"

echo " - 验证连接镜像队列"

5. 导入备份数据(如果有)

if [ -f "/backup/rabbitmq_mirror_final_*/definitions.json" ]; then

echo "5. 导入备份数据..."

backup_file=$(ls -t /backup/rabbitmq_mirror_final_*/definitions.json | head -1)

curl -u admin:AdminPassword123 \

-X POST \

-H "Content-Type: application/json" \

--data-binary "@$backup_file" \

"http://192.168.10.101:15672/api/definitions"

echo " 备份数据导入完成"

fi

echo "回滚操作完成"

}

主程序

if check_rollback_conditions; then

echo "检测到需要回滚的条件"

read -p "确认执行回滚? (yes/no): " confirm

if [ "$confirm" = "yes" ]; then

execute_rollback

echo -e "\n=== 回滚完成 ==="

echo "请验证以下内容:"

echo "1. 镜像集群状态: ssh 192.168.10.101 'rabbitmqctl cluster_status'"

echo "2. 服务可用性: 测试客户端连接"

echo "3. 消息完整性: 检查关键队列消息"

echo "4. 监控告警: 确认监控恢复正常"

else

echo "回滚取消"

fi

else

echo "未检测到需要回滚的条件"

echo "当前系统状态正常"

fi

echo "完成时间: $(date)"

五、迁移成功指标与验收标准

5.1 成功指标定义

迁移成功指标:

技术指标:

  • 消息零丢失率: 100%

  • 服务可用性: ≥99.95%

  • 性能降级: ≤10%

  • 故障恢复时间: ≤5分钟

业务指标:

  • 所有核心业务队列已迁移: 100%

  • 消费者无感知切换: 100%

  • 监控覆盖率: 100%

  • 文档更新完成率: 100%

运维指标:

  • 告警准确率: ≥95%

  • 备份恢复成功率: 100%

  • 团队培训完成率: 100%

  • 回滚预案验证: 完成

5.2 验收检查清单

#!/bin/bash

migration_acceptance_checklist.sh - 迁移验收检查清单

echo "RabbitMQ迁移验收检查清单"

echo "========================="

echo "检查时间: $(date)"

echo ""

CHECKLIST=(

"✅ 1. 数据完整性验证"

"✅ 2. 性能基准测试通过"

"✅ 3. 高可用性验证"

"✅ 4. 监控告警就绪"

"✅ 5. 文档更新完成"

"✅ 6. 团队培训完成"

"✅ 7. 回滚预案验证"

"✅ 8. 安全审计通过"

"✅ 9. 合规性检查"

"✅ 10. 客户通知完成"

)

PASS_COUNT=0

TOTAL_COUNT=${#CHECKLIST[@]}

for item in "${CHECKLIST[@]}"; do

模拟检查结果(实际应从具体检查获取)

status=(echo item | cut -d' ' -f1)

description=(echo item | cut -d' ' -f2-)

这里可以添加具体的检查逻辑

例如:

if check_data_integrity; then

status="✅"

else

status="❌"

fi

echo "status description"

if [ "$status" = "✅" ]; then

((PASS_COUNT++))

fi

done

echo ""

echo "验收结果: PASS_COUNT/TOTAL_COUNT"

if [ PASS_COUNT -eq TOTAL_COUNT ]; then

echo "🎉 迁移项目验收通过!"

生成验收报告

cat > /tmp/migration_acceptance_report.md << EOF

RabbitMQ迁移项目验收报告

项目概述

  • 项目名称: RabbitMQ镜像队列到仲裁队列迁移

  • 迁移时间: $(date +%Y年%m月%d日)

  • 验收时间: $(date)

验收结果

  • 总体状态: ✅ 通过

  • 通过项目: PASS_COUNT/TOTAL_COUNT

详细检查结果

(for item in "{CHECKLIST[@]}"; do

echo "- $item"

done)

技术指标达成情况

  • 消息零丢失率: ✅ 达成 (100%)

  • 服务可用性: ✅ 达成 (99.99%)

  • 性能表现: ✅ 达成 (提升15%)

  • 故障恢复: ✅ 达成 (<3分钟)

签字确认

  • 技术负责人: _________________

  • 运维负责人: _________________

  • 业务负责人: _________________

  • 项目经理: _________________

日期: $(date +%Y-%m-%d)

EOF

echo "验收报告已生成: /tmp/migration_acceptance_report.md"

else

echo "⚠️ 迁移项目验收未通过"

echo "请解决以下问题:"

for i in "${!CHECKLIST[@]}"; do

if [[ "{CHECKLIST\[i]}" == "❌"* ]]; then

echo " - {CHECKLIST\[i]}"

fi

done

fi

总结与建议

关键成功因素

  1. 分阶段迁移:不要一次性迁移所有队列

  2. 双写验证:确保数据一致性

  3. 渐进式切换:逐步转移流量

  4. 完备监控:实时掌握迁移状态

  5. 回滚预案:随时准备退回

下一步行动建议

根据具体情况,建议:

  1. 立即开始:执行环境审计和兼容性测试

  2. 优先处理:选择1-2个非关键业务队列试点

  3. 建立基线:记录迁移前的性能指标

  4. 小步快跑:每周迁移1-2个队列,及时调整策略

相关推荐
小马爱打代码15 小时前
ZooKeeper:五种经典应用场景
分布式·zookeeper·云原生
上海锟联科技18 小时前
DAS一体化光模块
分布式·分布式光纤传感·ofdr·光频域反射·das
Java 码农19 小时前
RabbitMQ集群部署方案及配置指南01
linux·服务器·rabbitmq
Overt0p19 小时前
抽奖系统(6)
java·spring boot·redis·设计模式·rabbitmq·状态模式
Java 码农19 小时前
RabbitMQ集群部署方案及配置指南04
分布式·rabbitmq
独自破碎E19 小时前
在RabbitMQ中,怎么确保消息不会丢失?
分布式·rabbitmq
Java 码农19 小时前
RabbitMQ集群部署方案及配置指南02
分布式·rabbitmq
虫小宝19 小时前
京东返利app分布式追踪系统:基于SkyWalking的全链路问题定位
分布式·skywalking
星图易码19 小时前
星图云开发者平台功能详解 | IoT物联网平台:工业设备全链路智能管控中枢
分布式·物联网·低代码·低代码平台