kubeadm部署的kubernetes集群的etcd由默认静态pod改为二级制的etcd集群

一.背景

公司的kubeadm部署的3个master节点的kubernetes集群,其etcd数据库是和master混布。kubernetes集群在运行过程中出现过由于底层存储IO异常造成etcd集群崩溃,进而造成整个集群异常。为了应对再次出现类似问题,研究在复用现有kubeadm部署集群时生成的etcd证书,重新使用二进制新建集群。

核心流程:备份数据 → 逐个停止各节点的 etcd 静态 Pod → 安装 etcd 二进制包 → 按节点配置 etcd 服务(首节点初始化集群,后续节点加入集群)→ 复用现有证书启动 etcd → 验证集群连接

二.环境与前提确认

3 个 master 节点信息(示例):

节点名 IP 地址
master01 192.168.1.101
master02 192.168.1.102
master03 192.168.1.103

2.所有节点备份etcd数据(非常重要,此步骤是日常定期备份)

复制代码
# 在每个 master 节点执行(以 master01 为例)
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  snapshot save /data/etcd-snapshot-$(date +%Y%m%d)-$(hostname).db

3.下载匹配版本的 etcd 二进制包

复制代码
# 示例版本(根据实际版本调整)
ETCD_VERSION=v3.5.10
ETCD_ARCH=amd64
# 下载并解压
wget https://github.com/etcd-io/etcd/releases/download/${ETCD_VERSION}/etcd-${ETCD_VERSION}-linux-${ETCD_ARCH}.tar.gz
tar -zxvf etcd-${ETCD_VERSION}-linux-${ETCD_ARCH}.tar.gz
# 复制二进制文件到系统路径
cp etcd-${ETCD_VERSION}-linux-${ETCD_ARCH}/etcd* /usr/local/bin/
chmod +x /usr/local/bin/etcd*
# 验证版本
etcd --version

三.分步替换操作(3 节点集群)

关键配置 :多节点 etcd 集群需统一 initial-cluster 参数,首节点用 initial-cluster-state=new,后续节点用 existing,且证书路径必须指向 kubeadm 生成的 /etc/kubernetes/pki/etcd/

1.停止所有 master 节点的 etcd 静态 Pod

复制代码
#在每个 master 节点执行,停止 kubeadm 部署的 etcd 静态 Pod:
# 备份并删除 etcd 静态 Pod 配置文件
mv /etc/kubernetes/manifests/etcd.yaml /root/etcd.yaml.bak
# 验证 etcd pod 已停止(等待 10 秒)
kubectl get pods -n kube-system | grep etcd

2.部署二进制 etcd 集群(复用现有证书)

三台master节点逐台操作,第一台初始化集群,第二台第三台加入集群

3.配置第一个节点(master01):初始化 etcd 集群

复制代码
[Unit]
Description=Etcd Server (master01)
After=network.target
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
ExecStart=/usr/local/bin/etcd \
  --name=master01 \
  --data-dir=/var/lib/etcd \  # 与原静态 Pod 数据目录一致
  --listen-client-urls=https://127.0.0.1:2379,https://192.168.1.101:2379 \
  --advertise-client-urls=https://192.168.1.101:2379 \
  --listen-peer-urls=https://192.168.1.101:2380 \
  --initial-advertise-peer-urls=https://192.168.1.101:2380 \
  --initial-cluster=master01=https://192.168.1.101:2380,master02=https://192.168.1.102:2380,master03=https://192.168.1.103:2380 \
  --initial-cluster-token=etcd-cluster-token \
  --initial-cluster-state=new \  # 首次初始化用 new,后续节点用 existing
  # 复用 kubeadm 生成的证书
  --cert-file=/etc/kubernetes/pki/etcd/server.crt \
  --key-file=/etc/kubernetes/pki/etcd/server.key \
  --client-cert-auth=true \
  --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \
  --peer-key-file=/etc/kubernetes/pki/etcd/peer.key \
  --peer-client-cert-auth=true \
  --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --log-level=info
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

启动并验证 master01 的 etcd:

复制代码
systemctl daemon-reload
systemctl start etcd
systemctl enable etcd
# 查看状态
systemctl status etcd
# 验证单节点健康(此时集群仅 master01 启动,会显示其他节点未就绪,属正常)
ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.101:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health

4.配置第二个节点(master02):加入 etcd 集群

创建 /etc/systemd/system/etcd.service 文件(master02 专属配置,仅修改 name、IP 和 initial-cluster-state):

复制代码
[Unit]
Description=Etcd Server (master02)
After=network.target
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
ExecStart=/usr/local/bin/etcd \
  --name=master02 \  # 节点名改为 master02
  --data-dir=/var/lib/etcd \
  --listen-client-urls=https://127.0.0.1:2379,https://192.168.1.102:2379 \
  --advertise-client-urls=https://192.168.1.102:2379 \
  --listen-peer-urls=https://192.168.1.102:2380 \
  --initial-advertise-peer-urls=https://192.168.1.102:2380 \
  --initial-cluster=master01=https://192.168.1.101:2380,master02=https://192.168.1.102:2380,master03=https://192.168.1.103:2380 \
  --initial-cluster-token=etcd-cluster-token \
  --initial-cluster-state=existing \  # 加入现有集群,改为 existing
  # 证书路径与 master01 一致(复用本地证书)
  --cert-file=/etc/kubernetes/pki/etcd/server.crt \
  --key-file=/etc/kubernetes/pki/etcd/server.key \
  --client-cert-auth=true \
  --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \
  --peer-key-file=/etc/kubernetes/pki/etcd/peer.key \
  --peer-client-cert-auth=true \
  --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --log-level=info
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

启动并验证 master02:

复制代码
systemctl daemon-reload
systemctl start etcd
systemctl enable etcd
systemctl status etcd
# 验证双节点健康
ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.101:2379,https://192.168.1.102:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health

5.配置第三个节点(master03):加入 etcd 集群

创建 /etc/systemd/system/etcd.service 文件(master03 专属配置,参考 master02,仅修改节点名和 IP):

复制代码
[Unit]
Description=Etcd Server (master03)
After=network.target
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
ExecStart=/usr/local/bin/etcd \
  --name=master03 \  # 节点名改为 master03
  --data-dir=/var/lib/etcd \
  --listen-client-urls=https://127.0.0.1:2379,https://192.168.1.103:2379 \
  --advertise-client-urls=https://192.168.1.103:2379 \
  --listen-peer-urls=https://192.168.1.103:2380 \
  --initial-advertise-peer-urls=https://192.168.1.103:2380 \
  --initial-cluster=master01=https://192.168.1.101:2380,master02=https://192.168.1.102:2380,master03=https://192.168.1.103:2380 \
  --initial-cluster-token=etcd-cluster-token \
  --initial-cluster-state=existing \  # 同样改为 existing
  # 证书路径不变
  --cert-file=/etc/kubernetes/pki/etcd/server.crt \
  --key-file=/etc/kubernetes/pki/etcd/server.key \
  --client-cert-auth=true \
  --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \
  --peer-key-file=/etc/kubernetes/pki/etcd/peer.key \
  --peer-client-cert-auth=true \
  --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --log-level=info
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

启动并验证 master03

复制代码
systemctl daemon-reload
systemctl start etcd
systemctl enable etcd
systemctl status etcd
# 验证 3 节点 etcd 集群健康
ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.101:2379,https://192.168.1.102:2379,https://192.168.1.103:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health

# 查看 etcd 集群成员
ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.101:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  member list

6.验证 kube-apiserver 与 etcd 集群的连接

复制代码
# 查看所有 master 节点的 kube-apiserver 日志,确认无 etcd 连接错误
kubectl logs -n kube-system kube-apiserver-master01 | grep etcd
kubectl logs -n kube-system kube-apiserver-master02 | grep etcd
kubectl logs -n kube-system kube-apiserver-master03 | grep etcd

# 查看集群组件状态
kubectl get componentstatuses

# 验证集群功能(创建测试 Pod)
kubectl run test-pod --image=nginx:alpine -n default
kubectl get pods test-pod
相关推荐
小肥君1 天前
docker无法连接GPU资源解决方案
docker·容器·eureka
liux35281 天前
K8s存储卷全解析:PV/PVC/StorageClass 关系
kubernetes
江华森1 天前
从零搭建 Kubernetes 集群并部署 Kuboard v3 管理面板 —— 国内环境完整实战教程
容器·kubernetes
友莘居士1 天前
KingbaseES Docker速查表
运维·docker·容器
小肥君1 天前
docker镜像配置
运维·docker·容器
某林2121 天前
Isaac Lab (v2.3.2) Docker 本地化部署与底层排障全解析
运维·docker·容器·架构·iassc
iDao技术魔方1 天前
WSL 配 GPU 用 Docker 的折腾指南(2026 年版)
运维·docker·容器
跳动的世界线1 天前
WSL 2 + Docker 本地全栈开发环境配置指南
运维·docker·容器
xiaogg36782 天前
Rancher2.0搭建kubernetes(K8S)集群
云原生·容器·kubernetes