环境信息
- OS:Rocky Linux 9.7
- etcd 版本:v3.6.7
- 节点规划:3节点集群(满足 Raft 多数派协议,最多容忍 1 节点故障)
- Zabbix Server:192.168.44.135(端口 8080)
- Prometheus:192.168.44.135(端口 9090)
一、节点规划
| 节点名 | IP 地址 | 角色 |
|---|---|---|
| node1 | 192.168.44.132 | etcd member |
| node2 | 192.168.44.133 | etcd member |
| node3 | 192.168.44.134 | etcd member(初始 Leader) |
二、前置准备
2.1 时间同步(所有节点)
etcd 集群对时间一致性有严格要求,节点间时间偏差需保持在 500ms 以内。
bash
复制
# 安装并启动 chrony
yum install -y chrony
systemctl enable chronyd && systemctl start chronyd
# 手动同步时间
timedatectl set-ntp true
chronyc tracking
2.2 安装依赖工具(所有节点)
bash
复制
# 安装 tar(Rocky Linux 9 minimal 版本默认不带 tar)
yum install -y tar
2.3 准备安装包
从本地 Windows 机器(192.168.44.1)通过 HTTP 服务分发安装包:
bash
复制
# 在 Windows 上启动临时 HTTP 服务(Downloads 目录下)
python -m http.server 18888 --directory C:\Users\qiyongquan\Downloads
# 在各 Linux 节点下载
curl -o /tmp/etcd-v3.6.7-linux-amd64.tar.gz http://192.168.44.1:18888/etcd-v3.6.7-linux-amd64.tar.gz
三、etcd 安装
以下步骤在所有三台节点上执行(节点间配置仅 IP 和名称不同)
3.1 创建系统用户和目录
bash
复制
# 创建 etcd 专用系统用户(无登录 shell)
useradd -r -s /sbin/nologin etcd
# 创建数据目录、日志目录、配置目录
mkdir -p /etc/etcd /var/lib/etcd /var/log/etcd
3.2 安装二进制文件
bash
复制
# 解压安装包(注意:需先安装 tar)
cd /tmp
tar xzf etcd-v3.6.7-linux-amd64.tar.gz
# 复制二进制到系统路径
cp /tmp/etcd-v3.6.7-linux-amd64/etcd \
/tmp/etcd-v3.6.7-linux-amd64/etcdctl \
/tmp/etcd-v3.6.7-linux-amd64/etcdutl \
/usr/local/bin/
chmod +x /usr/local/bin/etcd /usr/local/bin/etcdctl /usr/local/bin/etcdutl
# 验证版本
etcd --version
# etcd Version: 3.6.7
3.3 创建配置文件
node1(192.168.44.132)
bash
复制
cat > /etc/etcd/etcd.conf << 'EOF'
# ── 基础标识 ──────────────────────────────────────────────
ETCD_NAME="node1"
ETCD_DATA_DIR="/var/lib/etcd"
# ── 网络监听 ──────────────────────────────────────────────
ETCD_LISTEN_PEER_URLS="http://192.168.44.132:2380"
ETCD_LISTEN_CLIENT_URLS="http://192.168.44.132:2379,http://127.0.0.1:2379"
# ── 对外公告地址 ──────────────────────────────────────────
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.44.132:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://192.168.44.132:2379"
# ── 集群初始化 ────────────────────────────────────────────
ETCD_INITIAL_CLUSTER="node1=http://192.168.44.132:2380,node2=http://192.168.44.133:2380,node3=http://192.168.44.134:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster-prod"
ETCD_INITIAL_CLUSTER_STATE="new"
# ── 日志配置 ──────────────────────────────────────────────
ETCD_LOGGER="zap"
ETCD_LOG_OUTPUTS="/var/log/etcd/etcd.log"
ETCD_LOG_LEVEL="info"
# ── 监控指标(Prometheus)────────────────────────────────
ETCD_METRICS="extensive"
ETCD_LISTEN_METRICS_URLS="http://0.0.0.0:2381"
# ── 性能与稳定性调优 ──────────────────────────────────────
ETCD_AUTO_COMPACTION_RETENTION="1" # 自动压缩,保留1小时历史
ETCD_SNAPSHOT_COUNT="5000" # 每5000次写入创建快照
ETCD_QUOTA_BACKEND_BYTES="8589934592" # 后端存储配额 8GB
ETCD_HEARTBEAT_INTERVAL="100" # 心跳间隔 100ms
ETCD_ELECTION_TIMEOUT="1000" # 选举超时 1000ms
ETCD_MAX_SNAPSHOTS="5"
ETCD_MAX_WALS="5"
ETCD_ENABLE_V2="false"
EOF
node2(192.168.44.133)
与 node1 相同,仅修改以下字段:
bash
复制
ETCD_NAME="node2"
ETCD_LISTEN_PEER_URLS="http://192.168.44.133:2380"
ETCD_LISTEN_CLIENT_URLS="http://192.168.44.133:2379,http://127.0.0.1:2379"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.44.133:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://192.168.44.133:2379"
node3(192.168.44.134)
与 node1 相同,仅修改以下字段:
bash
复制
ETCD_NAME="node3"
ETCD_LISTEN_PEER_URLS="http://192.168.44.134:2380"
ETCD_LISTEN_CLIENT_URLS="http://192.168.44.134:2379,http://127.0.0.1:2379"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.44.134:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://192.168.44.134:2379"
3.4 设置目录权限
bash
复制
chown -R etcd:etcd /etc/etcd /var/lib/etcd /var/log/etcd
3.5 创建 systemd 服务单元
bash
复制
cat > /etc/systemd/system/etcd.service << 'EOF'
[Unit]
Description=etcd Key-Value Store
Documentation=https://etcd.io/docs/
After=network.target
[Service]
Type=notify
User=etcd
EnvironmentFile=/etc/etcd/etcd.conf
ExecStart=/usr/local/bin/etcd
Restart=on-failure
RestartSec=5s
LimitNOFILE=65536
Nice=-10
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
3.6 开放防火墙端口
bash
复制
# 开放 etcd 所需端口
# 2379: 客户端通信
# 2380: 节点间 Raft 通信
# 2381: Prometheus metrics 采集
firewall-cmd --permanent --add-port=2379-2381/tcp
firewall-cmd --reload
3.7 同时启动三个节点
⚠️ 重要:etcd 集群启动时必须有超过半数节点同时上线才能完成 Leader 选举。请在三台节点上几乎同时执行以下命令:
bash
复制
# 三台节点同时执行(可用 tmux 同步)
systemctl start etcd
# 设置开机自启
systemctl enable etcd
# 验证服务状态
systemctl status etcd
四、验证 etcd 集群
4.1 查看集群成员列表
bash
复制
etcdctl \
--endpoints=http://192.168.44.132:2379,http://192.168.44.133:2379,http://192.168.44.134:2379 \
member list -w table
预期输出示例:
+------------------+---------+-------+----------------------------+----------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+-------+----------------------------+----------------------------+------------+
| 445519fc8222d25a | started | node1 | http://192.168.44.132:2380 | http://192.168.44.132:2379 | false |
| b65a5ac53d56bb92 | started | node2 | http://192.168.44.133:2380 | http://192.168.44.133:2379 | false |
| 76efd20dda691bed | started | node3 | http://192.168.44.134:2380 | http://192.168.44.134:2379 | false |
+------------------+---------+-------+----------------------------+----------------------------+------------+
4.2 查看端点健康状态及 Leader
bash
复制
etcdctl \
--endpoints=http://192.168.44.132:2379,http://192.168.44.133:2379,http://192.168.44.134:2379 \
endpoint status -w table
4.3 读写测试
bash
复制
# 写入测试键值
etcdctl --endpoints=http://192.168.44.132:2379 put /test/hello "world"
# 读取验证
etcdctl --endpoints=http://192.168.44.132:2379 get /test/hello
# 删除测试数据
etcdctl --endpoints=http://192.168.44.132:2379 del /test/hello
4.4 检查监听端口
bash
复制
ss -tlnp | grep -E '2379|2380|2381'
# 应看到 2379(客户端)、2380(peer)、2381(metrics)三个端口在监听
五、Prometheus 监控配置(192.168.44.135)
5.1 添加 etcd 抓取配置
编辑 /opt/monitoring/prometheus/prometheus.yml,在 scrape_configs 段追加:
yaml
复制
- job_name: 'etcd'
static_configs:
- targets:
- '192.168.44.132:2381'
- '192.168.44.133:2381'
- '192.168.44.134:2381'
metrics_path: /metrics
relabel_configs:
- source_labels: [__address__]
target_label: instance
5.2 创建 etcd 告警规则文件
创建 /opt/monitoring/prometheus/etcd_alert_rules.yml:
yaml
复制
groups:
- name: etcd_alerts
rules:
- alert: EtcdNoLeader
expr: 'etcd_server_has_leader == 0'
for: 3m
labels:
severity: critical
annotations:
summary: "Etcd 集群无 Leader(实例 {{ $labels.instance }})"
description: "Etcd 集群当前不存在 Leader 节点,集群已不可写,可能发生选举异常或网络分区。\n 当前值 = {{ $value }}\n 标签 = {{ $labels }}"
- alert: EtcdHighNumberOfLeaderChangesCritical
expr: 'increase(etcd_server_leader_changes_seen_total[10m]) > 3'
for: 0m
labels:
severity: critical
annotations:
summary: "Etcd 实例 {{ $labels.instance }} 主节点切换过于频繁"
description: "10 分钟内 Etcd Leader 切换次数超过 3 次。\n 当前值 = {{ $value }}"
- alert: EtcdHighNumberOfLeaderChangesWarning
expr: 'increase(etcd_server_leader_changes_seen_total[10m]) > 1'
for: 0m
labels:
severity: warning
annotations:
summary: "Etcd 实例 {{ $labels.instance }} 主节点切换过于频繁"
description: "10 分钟内 Etcd Leader 切换次数超过 1 次。\n 当前值 = {{ $value }}"
- alert: EtcdHighNumberOfFailedProposals
expr: 'increase(etcd_server_proposals_failed_total[1h]) > 5'
for: 2m
labels:
severity: warning
annotations:
summary: "Etcd 提案失败次数过高(实例 {{ $labels.instance }})"
description: "过去 1 小时内 Etcd 提案失败次数超过 5 次。\n 当前值 = {{ $value }}"
- alert: EtcdBackendStorageQuotaExceed90Percent
expr: (etcd_mvcc_db_total_size_in_bytes{job=~"etcd.*"} / etcd_server_quota_backend_bytes{job=~"etcd.*"}) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "Etcd 后端存储配额使用率过高(实例 {{ $labels.instance }})"
description: "Etcd 后端数据库占用存储配额已超过 90%。\n 当前使用率 {{ $value | humanizePercentage }}"
- alert: EtcdBackendStorageQuotaExceed95Percent
expr: (etcd_mvcc_db_total_size_in_bytes{job=~"etcd.*"} / etcd_server_quota_backend_bytes{job=~"etcd.*"}) * 100 > 95
for: 5m
labels:
severity: critical
annotations:
summary: "Etcd 后端存储配额使用率过高(实例 {{ $labels.instance }})"
description: "Etcd 后端数据库占用存储配额已超过 95%,即将触发写阻塞!\n 当前使用率 {{ $value | humanizePercentage }}"
- alert: EtcdHighGrpcRequestFailureRate
expr: 'sum(rate(grpc_server_handled_total{grpc_code!="OK"}[5m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[5m])) BY (grpc_service, grpc_method) > 0.01'
for: 5m
labels:
severity: warning
annotations:
summary: "Etcd gRPC 请求失败率偏高(实例 {{ $labels.instance }})"
description: "Etcd gRPC 请求失败率 > 1%。\n 当前失败率 = {{ $value }}"
- alert: EtcdHighGrpcRequestFailureCount
expr: 'sum(increase(grpc_server_handled_total{job=~"etcd.*", grpc_code!="OK"}[5m])) BY (grpc_service, grpc_method) > 3'
for: 5m
labels:
severity: warning
annotations:
summary: "Etcd gRPC 请求失败次数偏高"
description: "Etcd gRPC 请求失败次数超过 3 次,可能存在网络抖动、节点负载高或 Leader 切换。"
5.3 在 prometheus.yml 中引用告警规则文件
yaml
复制
rule_files:
- "etcd_alert_rules.yml"
- "alert_rules.yml"
5.4 重启 Prometheus 使配置生效
bash
复制
cd /opt/monitoring
docker compose restart prometheus
# 验证 etcd 抓取目标状态
curl -s 'http://localhost:9090/api/v1/targets' | python3 -c \
"import sys,json; d=json.load(sys.stdin); \
[print(t['labels']['job'], t['labels']['instance'], t['health']) \
for t in d['data']['activeTargets'] if 'etcd' in t['labels'].get('job','')]"
# 预期输出:
# etcd 192.168.44.132:2381 up
# etcd 192.168.44.133:2381 up
# etcd 192.168.44.134:2381 up
六、Zabbix 监控配置(192.168.44.135)
6.1 通过 Zabbix API 配置监控
获取 API Token
bash
复制
TOKEN=$(curl -s http://localhost:8080/api_jsonrpc.php \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","method":"user.login","params":{"username":"Admin","password":"zabbix"},"id":1}' \
| python3 -c 'import sys,json; print(json.load(sys.stdin)["result"])')
echo "Token: $TOKEN"
创建 etcd 监控模板
bash
复制
curl -s http://localhost:8080/api_jsonrpc.php \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $TOKEN" \
-d '{
"jsonrpc":"2.0",
"method":"template.create",
"params":{
"host":"Template etcd",
"name":"Template etcd Service Monitor",
"groups":[{"groupid":"1"}]
},
"id":2
}'
# 记录返回的 templateid,例如 10772
添加监控项
bash
复制
TEMPLATE_ID="10772"
# 监控项1:客户端端口 2379
curl -s http://localhost:8080/api_jsonrpc.php \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $TOKEN" \
-d "{\"jsonrpc\":\"2.0\",\"method\":\"item.create\",\"params\":{
\"name\":\"etcd client port 2379 check\",
\"key_\":\"net.tcp.port[{HOST.IP},2379]\",
\"hostid\":\"$TEMPLATE_ID\",\"type\":0,\"value_type\":3,\"delay\":\"30s\"},\"id\":3}"
# 监控项2:peer 端口 2380
curl -s http://localhost:8080/api_jsonrpc.php \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $TOKEN" \
-d "{\"jsonrpc\":\"2.0\",\"method\":\"item.create\",\"params\":{
\"name\":\"etcd peer port 2380 check\",
\"key_\":\"net.tcp.port[{HOST.IP},2380]\",
\"hostid\":\"$TEMPLATE_ID\",\"type\":0,\"value_type\":3,\"delay\":\"30s\"},\"id\":4}"
# 监控项3:metrics 端口 2381
curl -s http://localhost:8080/api_jsonrpc.php \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $TOKEN" \
-d "{\"jsonrpc\":\"2.0\",\"method\":\"item.create\",\"params\":{
\"name\":\"etcd metrics port 2381 check\",
\"key_\":\"net.tcp.port[{HOST.IP},2381]\",
\"hostid\":\"$TEMPLATE_ID\",\"type\":0,\"value_type\":3,\"delay\":\"30s\"},\"id\":5}"
# 监控项4:etcd 进程数量
curl -s http://localhost:8080/api_jsonrpc.php \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $TOKEN" \
-d "{\"jsonrpc\":\"2.0\",\"method\":\"item.create\",\"params\":{
\"name\":\"etcd process count\",
\"key_\":\"proc.num[etcd]\",
\"hostid\":\"$TEMPLATE_ID\",\"type\":0,\"value_type\":3,\"delay\":\"60s\"},\"id\":6}"
# 监控项5:etcd health endpoint
curl -s http://localhost:8080/api_jsonrpc.php \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $TOKEN" \
-d "{\"jsonrpc\":\"2.0\",\"method\":\"item.create\",\"params\":{
\"name\":\"etcd health check\",
\"key_\":\"web.page.regexp[http://{HOST.IP}:2381/health,,\\\"\\\\{.*health.*\\\\}\\\",0]\",
\"hostid\":\"$TEMPLATE_ID\",\"type\":0,\"value_type\":1,\"delay\":\"30s\"},\"id\":7}"
添加告警触发器(Zabbix 7.x 新语法)
bash
复制
# 触发器1:etcd 客户端端口不可达(Disaster级别)
curl -s http://localhost:8080/api_jsonrpc.php \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $TOKEN" \
-d "{\"jsonrpc\":\"2.0\",\"method\":\"trigger.create\",\"params\":{
\"description\":\"etcd client port 2379 unreachable on {HOST.NAME}\",
\"expression\":\"last(/Template etcd/net.tcp.port[{HOST.IP},2379])=0\",
\"priority\":5},\"id\":8}"
# 触发器2:etcd peer 端口不可达
curl -s http://localhost:8080/api_jsonrpc.php \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $TOKEN" \
-d "{\"jsonrpc\":\"2.0\",\"method\":\"trigger.create\",\"params\":{
\"description\":\"etcd peer port 2380 unreachable on {HOST.NAME}\",
\"expression\":\"last(/Template etcd/net.tcp.port[{HOST.IP},2380])=0\",
\"priority\":5},\"id\":9}"
# 触发器3:etcd 进程未运行
curl -s http://localhost:8080/api_jsonrpc.php \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $TOKEN" \
-d "{\"jsonrpc\":\"2.0\",\"method\":\"trigger.create\",\"params\":{
\"description\":\"etcd process not running on {HOST.NAME}\",
\"expression\":\"last(/Template etcd/proc.num[etcd])<1\",
\"priority\":5},\"id\":10}"
将模板关联到三台主机
bash
复制
for hostid in 10769 10770 10771; do
curl -s http://localhost:8080/api_jsonrpc.php \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $TOKEN" \
-d "{\"jsonrpc\":\"2.0\",\"method\":\"host.update\",\"params\":{
\"hostid\":\"$hostid\",
\"templates\":[{\"templateid\":\"10343\"},{\"templateid\":\"10772\"}]},\"id\":11}"
done
6.2 在 Zabbix Web 界面验证
- 访问 http://192.168.44.135:8080,账号
Admin / zabbix - 进入 Configuration → Templates,确认 "Template etcd Service Monitor" 存在
- 进入 Configuration → Hosts,确认 132/133/134 均关联了 etcd 模板
- 进入 Monitoring → Latest data,筛选 etcd 相关监控项查看数据
七、etcd 数据备份与恢复
7.1 手动备份
bash
复制
# 创建快照(在任一节点执行)
etcdctl \
--endpoints=http://192.168.44.132:2379 \
snapshot save /backups/etcd-snapshot-$(date +%Y%m%d_%H%M%S).db
# 验证快照
etcdutl snapshot status /backups/etcd-snapshot-*.db -w table
7.2 定时备份(crontab)
bash
复制
# 每天凌晨 1:00 备份,保留 7 天
crontab -e
# 添加:
0 1 * * * /usr/local/bin/etcdctl --endpoints=http://127.0.0.1:2379 snapshot save /backups/etcd-$(date +\%Y\%m\%d).db && find /backups -name 'etcd-*.db' -mtime +7 -delete
7.3 数据恢复
bash
复制
# 1. 停止 etcd 服务(所有节点)
systemctl stop etcd
# 2. 备份当前数据(防止误操作)
mv /var/lib/etcd /var/lib/etcd.bak
# 3. 从快照恢复(所有节点使用同一个快照)
etcdutl snapshot restore /backups/etcd-20260329.db \
--name node1 \
--initial-cluster "node1=http://192.168.44.132:2380,node2=http://192.168.44.133:2380,node3=http://192.168.44.134:2380" \
--initial-cluster-token "etcd-cluster-prod" \
--initial-advertise-peer-urls "http://192.168.44.132:2380" \
--data-dir /var/lib/etcd
# 4. 修正权限
chown -R etcd:etcd /var/lib/etcd
# 5. 启动 etcd
systemctl start etcd
八、常见运维命令
bash
复制
# 查看集群成员
etcdctl --endpoints=http://192.168.44.132:2379,http://192.168.44.133:2379,http://192.168.44.134:2379 member list -w table
# 查看 Leader 和状态
etcdctl --endpoints=http://192.168.44.132:2379,http://192.168.44.133:2379,http://192.168.44.134:2379 endpoint status -w table
# 检查 metrics(Prometheus 指标)
curl http://192.168.44.132:2381/metrics | grep etcd_server_has_leader
# 检查健康状态
curl http://192.168.44.132:2381/health
# 查看日志
journalctl -u etcd --no-pager -n 50
tail -f /var/log/etcd/etcd.log
九、注意事项
- 节点数选择:强烈推荐奇数节点(3、5、7),避免选举平票问题。3 节点最多容忍 1 个节点故障
- 网络连通性:确保节点间 2379(客户端)和 2380(Peer Raft)端口双向可达
- 时间同步:节点间时间偏差超过 500ms 可能导致 Leader 频繁切换,必须配置 NTP/chrony
- etcd v3.6 API 变化 :v3.6 不再需要
ETCDCTL_API=3环境变量,默认即为 v3 API - Zabbix 7.4 触发器语法变更 :新版本使用
last(/TemplateName/item.key[])=value格式,不再支持旧格式{TemplateName:item.key[].last()}=value - Prometheus metrics 端口 :etcd v3.6 metrics 监听在
2381端口,需在配置中显式设置ETCD_LISTEN_METRICS_URLS
你们的环境是三节点裸机 etcd + Prometheus 独立部署 ,优先用 15308 ,功能最完整。如果团队不熟悉英文 Dashboard,可以配合 23560 一起用。