etcd 高可用集群部署及监控配置指南

环境信息

OS：Rocky Linux 9.7

etcd 版本：v3.6.7

节点规划：3节点集群（满足 Raft 多数派协议，最多容忍 1 节点故障）

Zabbix Server：192.168.44.135（端口 8080）

Prometheus：192.168.44.135（端口 9090）

一、节点规划

节点名	IP 地址	角色
node1	192.168.44.132	etcd member
node2	192.168.44.133	etcd member
node3	192.168.44.134	etcd member（初始 Leader）

二、前置准备

2.1 时间同步（所有节点）

etcd 集群对时间一致性有严格要求，节点间时间偏差需保持在 500ms 以内。

bash

复制

复制代码

# 安装并启动 chrony
yum install -y chrony
systemctl enable chronyd && systemctl start chronyd

# 手动同步时间
timedatectl set-ntp true
chronyc tracking

2.2 安装依赖工具（所有节点）

bash

复制

复制代码

# 安装 tar（Rocky Linux 9 minimal 版本默认不带 tar）
yum install -y tar

2.3 准备安装包

从本地 Windows 机器（192.168.44.1）通过 HTTP 服务分发安装包：

bash

复制

复制代码

# 在 Windows 上启动临时 HTTP 服务（Downloads 目录下）
python -m http.server 18888 --directory C:\Users\qiyongquan\Downloads

# 在各 Linux 节点下载
curl -o /tmp/etcd-v3.6.7-linux-amd64.tar.gz http://192.168.44.1:18888/etcd-v3.6.7-linux-amd64.tar.gz

三、etcd 安装

以下步骤在所有三台节点上执行（节点间配置仅 IP 和名称不同）

3.1 创建系统用户和目录

bash

复制

复制代码

# 创建 etcd 专用系统用户（无登录 shell）
useradd -r -s /sbin/nologin etcd

# 创建数据目录、日志目录、配置目录
mkdir -p /etc/etcd /var/lib/etcd /var/log/etcd

3.2 安装二进制文件

bash

复制

复制代码

# 解压安装包（注意：需先安装 tar）
cd /tmp
tar xzf etcd-v3.6.7-linux-amd64.tar.gz

# 复制二进制到系统路径
cp /tmp/etcd-v3.6.7-linux-amd64/etcd \
   /tmp/etcd-v3.6.7-linux-amd64/etcdctl \
   /tmp/etcd-v3.6.7-linux-amd64/etcdutl \
   /usr/local/bin/

chmod +x /usr/local/bin/etcd /usr/local/bin/etcdctl /usr/local/bin/etcdutl

# 验证版本
etcd --version
# etcd Version: 3.6.7

3.3 创建配置文件

node1（192.168.44.132）

bash

复制

复制代码

cat > /etc/etcd/etcd.conf << 'EOF'
# ── 基础标识 ──────────────────────────────────────────────
ETCD_NAME="node1"
ETCD_DATA_DIR="/var/lib/etcd"

# ── 网络监听 ──────────────────────────────────────────────
ETCD_LISTEN_PEER_URLS="http://192.168.44.132:2380"
ETCD_LISTEN_CLIENT_URLS="http://192.168.44.132:2379,http://127.0.0.1:2379"

# ── 对外公告地址 ──────────────────────────────────────────
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.44.132:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://192.168.44.132:2379"

# ── 集群初始化 ────────────────────────────────────────────
ETCD_INITIAL_CLUSTER="node1=http://192.168.44.132:2380,node2=http://192.168.44.133:2380,node3=http://192.168.44.134:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster-prod"
ETCD_INITIAL_CLUSTER_STATE="new"

# ── 日志配置 ──────────────────────────────────────────────
ETCD_LOGGER="zap"
ETCD_LOG_OUTPUTS="/var/log/etcd/etcd.log"
ETCD_LOG_LEVEL="info"

# ── 监控指标（Prometheus）────────────────────────────────
ETCD_METRICS="extensive"
ETCD_LISTEN_METRICS_URLS="http://0.0.0.0:2381"

# ── 性能与稳定性调优 ──────────────────────────────────────
ETCD_AUTO_COMPACTION_RETENTION="1"    # 自动压缩，保留1小时历史
ETCD_SNAPSHOT_COUNT="5000"            # 每5000次写入创建快照
ETCD_QUOTA_BACKEND_BYTES="8589934592" # 后端存储配额 8GB
ETCD_HEARTBEAT_INTERVAL="100"         # 心跳间隔 100ms
ETCD_ELECTION_TIMEOUT="1000"          # 选举超时 1000ms
ETCD_MAX_SNAPSHOTS="5"
ETCD_MAX_WALS="5"
ETCD_ENABLE_V2="false"
EOF

node2（192.168.44.133）

与 node1 相同，仅修改以下字段：

bash

复制

复制代码

ETCD_NAME="node2"
ETCD_LISTEN_PEER_URLS="http://192.168.44.133:2380"
ETCD_LISTEN_CLIENT_URLS="http://192.168.44.133:2379,http://127.0.0.1:2379"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.44.133:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://192.168.44.133:2379"

node3（192.168.44.134）

与 node1 相同，仅修改以下字段：

bash

复制

复制代码

ETCD_NAME="node3"
ETCD_LISTEN_PEER_URLS="http://192.168.44.134:2380"
ETCD_LISTEN_CLIENT_URLS="http://192.168.44.134:2379,http://127.0.0.1:2379"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.44.134:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://192.168.44.134:2379"

3.4 设置目录权限

bash

复制

复制代码

chown -R etcd:etcd /etc/etcd /var/lib/etcd /var/log/etcd

3.5 创建 systemd 服务单元

bash

复制

复制代码

cat > /etc/systemd/system/etcd.service << 'EOF'
[Unit]
Description=etcd Key-Value Store
Documentation=https://etcd.io/docs/
After=network.target

[Service]
Type=notify
User=etcd
EnvironmentFile=/etc/etcd/etcd.conf
ExecStart=/usr/local/bin/etcd
Restart=on-failure
RestartSec=5s
LimitNOFILE=65536
Nice=-10

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload

3.6 开放防火墙端口

bash

复制

复制代码

# 开放 etcd 所需端口
# 2379: 客户端通信
# 2380: 节点间 Raft 通信  
# 2381: Prometheus metrics 采集
firewall-cmd --permanent --add-port=2379-2381/tcp
firewall-cmd --reload

3.7 同时启动三个节点

⚠️ 重要：etcd 集群启动时必须有超过半数节点同时上线才能完成 Leader 选举。请在三台节点上几乎同时执行以下命令：

bash

复制

复制代码

# 三台节点同时执行（可用 tmux 同步）
systemctl start etcd

# 设置开机自启
systemctl enable etcd

# 验证服务状态
systemctl status etcd

四、验证 etcd 集群

4.1 查看集群成员列表

bash

复制

复制代码

etcdctl \
  --endpoints=http://192.168.44.132:2379,http://192.168.44.133:2379,http://192.168.44.134:2379 \
  member list -w table

预期输出示例：

复制代码

+------------------+---------+-------+----------------------------+----------------------------+------------+
|        ID        | STATUS  | NAME  |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-------+----------------------------+----------------------------+------------+
| 445519fc8222d25a | started | node1 | http://192.168.44.132:2380 | http://192.168.44.132:2379 |      false |
| b65a5ac53d56bb92 | started | node2 | http://192.168.44.133:2380 | http://192.168.44.133:2379 |      false |
| 76efd20dda691bed | started | node3 | http://192.168.44.134:2380 | http://192.168.44.134:2379 |      false |
+------------------+---------+-------+----------------------------+----------------------------+------------+

4.2 查看端点健康状态及 Leader

bash

复制

复制代码

etcdctl \
  --endpoints=http://192.168.44.132:2379,http://192.168.44.133:2379,http://192.168.44.134:2379 \
  endpoint status -w table

4.3 读写测试

bash

复制

复制代码

# 写入测试键值
etcdctl --endpoints=http://192.168.44.132:2379 put /test/hello "world"

# 读取验证
etcdctl --endpoints=http://192.168.44.132:2379 get /test/hello

# 删除测试数据
etcdctl --endpoints=http://192.168.44.132:2379 del /test/hello

4.4 检查监听端口

bash

复制

复制代码

ss -tlnp | grep -E '2379|2380|2381'
# 应看到 2379（客户端）、2380（peer）、2381（metrics）三个端口在监听

五、Prometheus 监控配置（192.168.44.135）

5.1 添加 etcd 抓取配置

编辑 /opt/monitoring/prometheus/prometheus.yml，在 scrape_configs 段追加：

yaml

复制

复制代码

  - job_name: 'etcd'
    static_configs:
      - targets:
          - '192.168.44.132:2381'
          - '192.168.44.133:2381'
          - '192.168.44.134:2381'
    metrics_path: /metrics
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance

5.2 创建 etcd 告警规则文件

创建 /opt/monitoring/prometheus/etcd_alert_rules.yml：

yaml

复制

复制代码

groups:
  - name: etcd_alerts
    rules:
    - alert: EtcdNoLeader
      expr: 'etcd_server_has_leader == 0'
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "Etcd 集群无 Leader（实例 {{ $labels.instance }}）"
        description: "Etcd 集群当前不存在 Leader 节点，集群已不可写，可能发生选举异常或网络分区。\n  当前值 = {{ $value }}\n  标签 = {{ $labels }}"

    - alert: EtcdHighNumberOfLeaderChangesCritical
      expr: 'increase(etcd_server_leader_changes_seen_total[10m]) > 3'
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: "Etcd 实例 {{ $labels.instance }} 主节点切换过于频繁"
        description: "10 分钟内 Etcd Leader 切换次数超过 3 次。\n  当前值 = {{ $value }}"

    - alert: EtcdHighNumberOfLeaderChangesWarning
      expr: 'increase(etcd_server_leader_changes_seen_total[10m]) > 1'
      for: 0m
      labels:
        severity: warning
      annotations:
        summary: "Etcd 实例 {{ $labels.instance }} 主节点切换过于频繁"
        description: "10 分钟内 Etcd Leader 切换次数超过 1 次。\n  当前值 = {{ $value }}"

    - alert: EtcdHighNumberOfFailedProposals
      expr: 'increase(etcd_server_proposals_failed_total[1h]) > 5'
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "Etcd 提案失败次数过高（实例 {{ $labels.instance }}）"
        description: "过去 1 小时内 Etcd 提案失败次数超过 5 次。\n  当前值 = {{ $value }}"

    - alert: EtcdBackendStorageQuotaExceed90Percent
      expr: (etcd_mvcc_db_total_size_in_bytes{job=~"etcd.*"} / etcd_server_quota_backend_bytes{job=~"etcd.*"}) * 100 > 90
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Etcd 后端存储配额使用率过高（实例 {{ $labels.instance }}）"
        description: "Etcd 后端数据库占用存储配额已超过 90%。\n  当前使用率 {{ $value | humanizePercentage }}"

    - alert: EtcdBackendStorageQuotaExceed95Percent
      expr: (etcd_mvcc_db_total_size_in_bytes{job=~"etcd.*"} / etcd_server_quota_backend_bytes{job=~"etcd.*"}) * 100 > 95
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Etcd 后端存储配额使用率过高（实例 {{ $labels.instance }}）"
        description: "Etcd 后端数据库占用存储配额已超过 95%，即将触发写阻塞！\n  当前使用率 {{ $value | humanizePercentage }}"

    - alert: EtcdHighGrpcRequestFailureRate
      expr: 'sum(rate(grpc_server_handled_total{grpc_code!="OK"}[5m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[5m])) BY (grpc_service, grpc_method) > 0.01'
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Etcd gRPC 请求失败率偏高（实例 {{ $labels.instance }}）"
        description: "Etcd gRPC 请求失败率 > 1%。\n  当前失败率 = {{ $value }}"

    - alert: EtcdHighGrpcRequestFailureCount
      expr: 'sum(increase(grpc_server_handled_total{job=~"etcd.*", grpc_code!="OK"}[5m])) BY (grpc_service, grpc_method) > 3'
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Etcd gRPC 请求失败次数偏高"
        description: "Etcd gRPC 请求失败次数超过 3 次，可能存在网络抖动、节点负载高或 Leader 切换。"

5.3 在 prometheus.yml 中引用告警规则文件

yaml

复制

复制代码

rule_files:
  - "etcd_alert_rules.yml"
  - "alert_rules.yml"

5.4 重启 Prometheus 使配置生效

bash

复制

复制代码

cd /opt/monitoring
docker compose restart prometheus

# 验证 etcd 抓取目标状态
curl -s 'http://localhost:9090/api/v1/targets' | python3 -c \
  "import sys,json; d=json.load(sys.stdin); \
   [print(t['labels']['job'], t['labels']['instance'], t['health']) \
    for t in d['data']['activeTargets'] if 'etcd' in t['labels'].get('job','')]"
# 预期输出：
# etcd 192.168.44.132:2381 up
# etcd 192.168.44.133:2381 up
# etcd 192.168.44.134:2381 up

六、Zabbix 监控配置（192.168.44.135）

6.1 通过 Zabbix API 配置监控

获取 API Token

bash

复制

复制代码

TOKEN=$(curl -s http://localhost:8080/api_jsonrpc.php \
  -H 'Content-Type: application/json' \
  -d '{"jsonrpc":"2.0","method":"user.login","params":{"username":"Admin","password":"zabbix"},"id":1}' \
  | python3 -c 'import sys,json; print(json.load(sys.stdin)["result"])')
echo "Token: $TOKEN"

创建 etcd 监控模板

bash

复制

复制代码

curl -s http://localhost:8080/api_jsonrpc.php \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "jsonrpc":"2.0",
    "method":"template.create",
    "params":{
      "host":"Template etcd",
      "name":"Template etcd Service Monitor",
      "groups":[{"groupid":"1"}]
    },
    "id":2
  }'
# 记录返回的 templateid，例如 10772

添加监控项

bash

复制

复制代码

TEMPLATE_ID="10772"

# 监控项1：客户端端口 2379
curl -s http://localhost:8080/api_jsonrpc.php \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer $TOKEN" \
  -d "{\"jsonrpc\":\"2.0\",\"method\":\"item.create\",\"params\":{
    \"name\":\"etcd client port 2379 check\",
    \"key_\":\"net.tcp.port[{HOST.IP},2379]\",
    \"hostid\":\"$TEMPLATE_ID\",\"type\":0,\"value_type\":3,\"delay\":\"30s\"},\"id\":3}"

# 监控项2：peer 端口 2380
curl -s http://localhost:8080/api_jsonrpc.php \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer $TOKEN" \
  -d "{\"jsonrpc\":\"2.0\",\"method\":\"item.create\",\"params\":{
    \"name\":\"etcd peer port 2380 check\",
    \"key_\":\"net.tcp.port[{HOST.IP},2380]\",
    \"hostid\":\"$TEMPLATE_ID\",\"type\":0,\"value_type\":3,\"delay\":\"30s\"},\"id\":4}"

# 监控项3：metrics 端口 2381
curl -s http://localhost:8080/api_jsonrpc.php \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer $TOKEN" \
  -d "{\"jsonrpc\":\"2.0\",\"method\":\"item.create\",\"params\":{
    \"name\":\"etcd metrics port 2381 check\",
    \"key_\":\"net.tcp.port[{HOST.IP},2381]\",
    \"hostid\":\"$TEMPLATE_ID\",\"type\":0,\"value_type\":3,\"delay\":\"30s\"},\"id\":5}"

# 监控项4：etcd 进程数量
curl -s http://localhost:8080/api_jsonrpc.php \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer $TOKEN" \
  -d "{\"jsonrpc\":\"2.0\",\"method\":\"item.create\",\"params\":{
    \"name\":\"etcd process count\",
    \"key_\":\"proc.num[etcd]\",
    \"hostid\":\"$TEMPLATE_ID\",\"type\":0,\"value_type\":3,\"delay\":\"60s\"},\"id\":6}"

# 监控项5：etcd health endpoint
curl -s http://localhost:8080/api_jsonrpc.php \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer $TOKEN" \
  -d "{\"jsonrpc\":\"2.0\",\"method\":\"item.create\",\"params\":{
    \"name\":\"etcd health check\",
    \"key_\":\"web.page.regexp[http://{HOST.IP}:2381/health,,\\\"\\\\{.*health.*\\\\}\\\",0]\",
    \"hostid\":\"$TEMPLATE_ID\",\"type\":0,\"value_type\":1,\"delay\":\"30s\"},\"id\":7}"

添加告警触发器（Zabbix 7.x 新语法）

bash

复制

复制代码

# 触发器1：etcd 客户端端口不可达（Disaster级别）
curl -s http://localhost:8080/api_jsonrpc.php \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer $TOKEN" \
  -d "{\"jsonrpc\":\"2.0\",\"method\":\"trigger.create\",\"params\":{
    \"description\":\"etcd client port 2379 unreachable on {HOST.NAME}\",
    \"expression\":\"last(/Template etcd/net.tcp.port[{HOST.IP},2379])=0\",
    \"priority\":5},\"id\":8}"

# 触发器2：etcd peer 端口不可达
curl -s http://localhost:8080/api_jsonrpc.php \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer $TOKEN" \
  -d "{\"jsonrpc\":\"2.0\",\"method\":\"trigger.create\",\"params\":{
    \"description\":\"etcd peer port 2380 unreachable on {HOST.NAME}\",
    \"expression\":\"last(/Template etcd/net.tcp.port[{HOST.IP},2380])=0\",
    \"priority\":5},\"id\":9}"

# 触发器3：etcd 进程未运行
curl -s http://localhost:8080/api_jsonrpc.php \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer $TOKEN" \
  -d "{\"jsonrpc\":\"2.0\",\"method\":\"trigger.create\",\"params\":{
    \"description\":\"etcd process not running on {HOST.NAME}\",
    \"expression\":\"last(/Template etcd/proc.num[etcd])<1\",
    \"priority\":5},\"id\":10}"

将模板关联到三台主机

bash

复制

复制代码

for hostid in 10769 10770 10771; do
  curl -s http://localhost:8080/api_jsonrpc.php \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $TOKEN" \
    -d "{\"jsonrpc\":\"2.0\",\"method\":\"host.update\",\"params\":{
      \"hostid\":\"$hostid\",
      \"templates\":[{\"templateid\":\"10343\"},{\"templateid\":\"10772\"}]},\"id\":11}"
done

6.2 在 Zabbix Web 界面验证

访问 http://192.168.44.135:8080，账号 Admin / zabbix
进入 Configuration → Templates，确认 "Template etcd Service Monitor" 存在
进入 Configuration → Hosts，确认 132/133/134 均关联了 etcd 模板
进入 Monitoring → Latest data，筛选 etcd 相关监控项查看数据

七、etcd 数据备份与恢复

7.1 手动备份

bash

复制

复制代码

# 创建快照（在任一节点执行）
etcdctl \
  --endpoints=http://192.168.44.132:2379 \
  snapshot save /backups/etcd-snapshot-$(date +%Y%m%d_%H%M%S).db

# 验证快照
etcdutl snapshot status /backups/etcd-snapshot-*.db -w table

7.2 定时备份（crontab）

bash

复制

复制代码

# 每天凌晨 1:00 备份，保留 7 天
crontab -e
# 添加：
0 1 * * * /usr/local/bin/etcdctl --endpoints=http://127.0.0.1:2379 snapshot save /backups/etcd-$(date +\%Y\%m\%d).db && find /backups -name 'etcd-*.db' -mtime +7 -delete

7.3 数据恢复

bash

复制

复制代码

# 1. 停止 etcd 服务（所有节点）
systemctl stop etcd

# 2. 备份当前数据（防止误操作）
mv /var/lib/etcd /var/lib/etcd.bak

# 3. 从快照恢复（所有节点使用同一个快照）
etcdutl snapshot restore /backups/etcd-20260329.db \
  --name node1 \
  --initial-cluster "node1=http://192.168.44.132:2380,node2=http://192.168.44.133:2380,node3=http://192.168.44.134:2380" \
  --initial-cluster-token "etcd-cluster-prod" \
  --initial-advertise-peer-urls "http://192.168.44.132:2380" \
  --data-dir /var/lib/etcd

# 4. 修正权限
chown -R etcd:etcd /var/lib/etcd

# 5. 启动 etcd
systemctl start etcd

八、常见运维命令

bash

复制

复制代码

# 查看集群成员
etcdctl --endpoints=http://192.168.44.132:2379,http://192.168.44.133:2379,http://192.168.44.134:2379 member list -w table

# 查看 Leader 和状态
etcdctl --endpoints=http://192.168.44.132:2379,http://192.168.44.133:2379,http://192.168.44.134:2379 endpoint status -w table

# 检查 metrics（Prometheus 指标）
curl http://192.168.44.132:2381/metrics | grep etcd_server_has_leader

# 检查健康状态
curl http://192.168.44.132:2381/health

# 查看日志
journalctl -u etcd --no-pager -n 50
tail -f /var/log/etcd/etcd.log

九、注意事项

节点数选择：强烈推荐奇数节点（3、5、7），避免选举平票问题。3 节点最多容忍 1 个节点故障
网络连通性：确保节点间 2379（客户端）和 2380（Peer Raft）端口双向可达
时间同步：节点间时间偏差超过 500ms 可能导致 Leader 频繁切换，必须配置 NTP/chrony
etcd v3.6 API 变化 ：v3.6 不再需要 ETCDCTL_API=3 环境变量，默认即为 v3 API
Zabbix 7.4 触发器语法变更 ：新版本使用 last(/TemplateName/item.key[])=value 格式，不再支持旧格式 {TemplateName:item.key[].last()}=value
Prometheus metrics 端口 ：etcd v3.6 metrics 监听在 2381 端口，需在配置中显式设置 ETCD_LISTEN_METRICS_URLS

你们的环境是三节点裸机 etcd + Prometheus 独立部署 ，优先用 15308 ，功能最完整。如果团队不熟悉英文 Dashboard，可以配合 23560 一起用。