一.背景
公司的kubeadm部署的3个master节点的kubernetes集群,其etcd数据库是和master混布。kubernetes集群在运行过程中出现过由于底层存储IO异常造成etcd集群崩溃,进而造成整个集群异常。为了应对再次出现类似问题,研究在复用现有kubeadm部署集群时生成的etcd证书,重新使用二进制新建集群。
核心流程:备份数据 → 逐个停止各节点的 etcd 静态 Pod → 安装 etcd 二进制包 → 按节点配置 etcd 服务(首节点初始化集群,后续节点加入集群)→ 复用现有证书启动 etcd → 验证集群连接
二.环境与前提确认
3 个 master 节点信息(示例):
| 节点名 | IP 地址 |
|---|---|
| master01 | 192.168.1.101 |
| master02 | 192.168.1.102 |
| master03 | 192.168.1.103 |
2.所有节点备份etcd数据(非常重要,此步骤是日常定期备份)
# 在每个 master 节点执行(以 master01 为例)
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /data/etcd-snapshot-$(date +%Y%m%d)-$(hostname).db
3.下载匹配版本的 etcd 二进制包
# 示例版本(根据实际版本调整)
ETCD_VERSION=v3.5.10
ETCD_ARCH=amd64
# 下载并解压
wget https://github.com/etcd-io/etcd/releases/download/${ETCD_VERSION}/etcd-${ETCD_VERSION}-linux-${ETCD_ARCH}.tar.gz
tar -zxvf etcd-${ETCD_VERSION}-linux-${ETCD_ARCH}.tar.gz
# 复制二进制文件到系统路径
cp etcd-${ETCD_VERSION}-linux-${ETCD_ARCH}/etcd* /usr/local/bin/
chmod +x /usr/local/bin/etcd*
# 验证版本
etcd --version
三.分步替换操作(3 节点集群)
关键配置 :多节点 etcd 集群需统一 initial-cluster 参数,首节点用 initial-cluster-state=new,后续节点用 existing,且证书路径必须指向 kubeadm 生成的 /etc/kubernetes/pki/etcd/
1.停止所有 master 节点的 etcd 静态 Pod
#在每个 master 节点执行,停止 kubeadm 部署的 etcd 静态 Pod:
# 备份并删除 etcd 静态 Pod 配置文件
mv /etc/kubernetes/manifests/etcd.yaml /root/etcd.yaml.bak
# 验证 etcd pod 已停止(等待 10 秒)
kubectl get pods -n kube-system | grep etcd
2.部署二进制 etcd 集群(复用现有证书)
三台master节点逐台操作,第一台初始化集群,第二台第三台加入集群
3.配置第一个节点(master01):初始化 etcd 集群
[Unit]
Description=Etcd Server (master01)
After=network.target
After=network-online.target
Wants=network-online.target
[Service]
Type=notify
ExecStart=/usr/local/bin/etcd \
--name=master01 \
--data-dir=/var/lib/etcd \ # 与原静态 Pod 数据目录一致
--listen-client-urls=https://127.0.0.1:2379,https://192.168.1.101:2379 \
--advertise-client-urls=https://192.168.1.101:2379 \
--listen-peer-urls=https://192.168.1.101:2380 \
--initial-advertise-peer-urls=https://192.168.1.101:2380 \
--initial-cluster=master01=https://192.168.1.101:2380,master02=https://192.168.1.102:2380,master03=https://192.168.1.103:2380 \
--initial-cluster-token=etcd-cluster-token \
--initial-cluster-state=new \ # 首次初始化用 new,后续节点用 existing
# 复用 kubeadm 生成的证书
--cert-file=/etc/kubernetes/pki/etcd/server.crt \
--key-file=/etc/kubernetes/pki/etcd/server.key \
--client-cert-auth=true \
--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \
--peer-key-file=/etc/kubernetes/pki/etcd/peer.key \
--peer-client-cert-auth=true \
--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
--log-level=info
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
启动并验证 master01 的 etcd:
systemctl daemon-reload
systemctl start etcd
systemctl enable etcd
# 查看状态
systemctl status etcd
# 验证单节点健康(此时集群仅 master01 启动,会显示其他节点未就绪,属正常)
ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.101:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint health
4.配置第二个节点(master02):加入 etcd 集群
创建 /etc/systemd/system/etcd.service 文件(master02 专属配置,仅修改 name、IP 和 initial-cluster-state):
[Unit]
Description=Etcd Server (master02)
After=network.target
After=network-online.target
Wants=network-online.target
[Service]
Type=notify
ExecStart=/usr/local/bin/etcd \
--name=master02 \ # 节点名改为 master02
--data-dir=/var/lib/etcd \
--listen-client-urls=https://127.0.0.1:2379,https://192.168.1.102:2379 \
--advertise-client-urls=https://192.168.1.102:2379 \
--listen-peer-urls=https://192.168.1.102:2380 \
--initial-advertise-peer-urls=https://192.168.1.102:2380 \
--initial-cluster=master01=https://192.168.1.101:2380,master02=https://192.168.1.102:2380,master03=https://192.168.1.103:2380 \
--initial-cluster-token=etcd-cluster-token \
--initial-cluster-state=existing \ # 加入现有集群,改为 existing
# 证书路径与 master01 一致(复用本地证书)
--cert-file=/etc/kubernetes/pki/etcd/server.crt \
--key-file=/etc/kubernetes/pki/etcd/server.key \
--client-cert-auth=true \
--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \
--peer-key-file=/etc/kubernetes/pki/etcd/peer.key \
--peer-client-cert-auth=true \
--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
--log-level=info
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
启动并验证 master02:
systemctl daemon-reload
systemctl start etcd
systemctl enable etcd
systemctl status etcd
# 验证双节点健康
ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.101:2379,https://192.168.1.102:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint health
5.配置第三个节点(master03):加入 etcd 集群
创建 /etc/systemd/system/etcd.service 文件(master03 专属配置,参考 master02,仅修改节点名和 IP):
[Unit]
Description=Etcd Server (master03)
After=network.target
After=network-online.target
Wants=network-online.target
[Service]
Type=notify
ExecStart=/usr/local/bin/etcd \
--name=master03 \ # 节点名改为 master03
--data-dir=/var/lib/etcd \
--listen-client-urls=https://127.0.0.1:2379,https://192.168.1.103:2379 \
--advertise-client-urls=https://192.168.1.103:2379 \
--listen-peer-urls=https://192.168.1.103:2380 \
--initial-advertise-peer-urls=https://192.168.1.103:2380 \
--initial-cluster=master01=https://192.168.1.101:2380,master02=https://192.168.1.102:2380,master03=https://192.168.1.103:2380 \
--initial-cluster-token=etcd-cluster-token \
--initial-cluster-state=existing \ # 同样改为 existing
# 证书路径不变
--cert-file=/etc/kubernetes/pki/etcd/server.crt \
--key-file=/etc/kubernetes/pki/etcd/server.key \
--client-cert-auth=true \
--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \
--peer-key-file=/etc/kubernetes/pki/etcd/peer.key \
--peer-client-cert-auth=true \
--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
--log-level=info
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
启动并验证 master03
systemctl daemon-reload
systemctl start etcd
systemctl enable etcd
systemctl status etcd
# 验证 3 节点 etcd 集群健康
ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.101:2379,https://192.168.1.102:2379,https://192.168.1.103:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint health
# 查看 etcd 集群成员
ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.101:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
member list
6.验证 kube-apiserver 与 etcd 集群的连接
# 查看所有 master 节点的 kube-apiserver 日志,确认无 etcd 连接错误
kubectl logs -n kube-system kube-apiserver-master01 | grep etcd
kubectl logs -n kube-system kube-apiserver-master02 | grep etcd
kubectl logs -n kube-system kube-apiserver-master03 | grep etcd
# 查看集群组件状态
kubectl get componentstatuses
# 验证集群功能(创建测试 Pod)
kubectl run test-pod --image=nginx:alpine -n default
kubectl get pods test-pod