kubeadm部署的kubernetes集群的etcd由默认静态pod改为二级制的etcd集群

一.背景

公司的kubeadm部署的3个master节点的kubernetes集群,其etcd数据库是和master混布。kubernetes集群在运行过程中出现过由于底层存储IO异常造成etcd集群崩溃,进而造成整个集群异常。为了应对再次出现类似问题,研究在复用现有kubeadm部署集群时生成的etcd证书,重新使用二进制新建集群。

核心流程:备份数据 → 逐个停止各节点的 etcd 静态 Pod → 安装 etcd 二进制包 → 按节点配置 etcd 服务(首节点初始化集群,后续节点加入集群)→ 复用现有证书启动 etcd → 验证集群连接

二.环境与前提确认

3 个 master 节点信息(示例):

节点名 IP 地址
master01 192.168.1.101
master02 192.168.1.102
master03 192.168.1.103

2.所有节点备份etcd数据(非常重要,此步骤是日常定期备份)

复制代码
# 在每个 master 节点执行(以 master01 为例)
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  snapshot save /data/etcd-snapshot-$(date +%Y%m%d)-$(hostname).db

3.下载匹配版本的 etcd 二进制包

复制代码
# 示例版本(根据实际版本调整)
ETCD_VERSION=v3.5.10
ETCD_ARCH=amd64
# 下载并解压
wget https://github.com/etcd-io/etcd/releases/download/${ETCD_VERSION}/etcd-${ETCD_VERSION}-linux-${ETCD_ARCH}.tar.gz
tar -zxvf etcd-${ETCD_VERSION}-linux-${ETCD_ARCH}.tar.gz
# 复制二进制文件到系统路径
cp etcd-${ETCD_VERSION}-linux-${ETCD_ARCH}/etcd* /usr/local/bin/
chmod +x /usr/local/bin/etcd*
# 验证版本
etcd --version

三.分步替换操作(3 节点集群)

关键配置 :多节点 etcd 集群需统一 initial-cluster 参数,首节点用 initial-cluster-state=new,后续节点用 existing,且证书路径必须指向 kubeadm 生成的 /etc/kubernetes/pki/etcd/

1.停止所有 master 节点的 etcd 静态 Pod

复制代码
#在每个 master 节点执行,停止 kubeadm 部署的 etcd 静态 Pod:
# 备份并删除 etcd 静态 Pod 配置文件
mv /etc/kubernetes/manifests/etcd.yaml /root/etcd.yaml.bak
# 验证 etcd pod 已停止(等待 10 秒)
kubectl get pods -n kube-system | grep etcd

2.部署二进制 etcd 集群(复用现有证书)

三台master节点逐台操作,第一台初始化集群,第二台第三台加入集群

3.配置第一个节点(master01):初始化 etcd 集群

复制代码
[Unit]
Description=Etcd Server (master01)
After=network.target
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
ExecStart=/usr/local/bin/etcd \
  --name=master01 \
  --data-dir=/var/lib/etcd \  # 与原静态 Pod 数据目录一致
  --listen-client-urls=https://127.0.0.1:2379,https://192.168.1.101:2379 \
  --advertise-client-urls=https://192.168.1.101:2379 \
  --listen-peer-urls=https://192.168.1.101:2380 \
  --initial-advertise-peer-urls=https://192.168.1.101:2380 \
  --initial-cluster=master01=https://192.168.1.101:2380,master02=https://192.168.1.102:2380,master03=https://192.168.1.103:2380 \
  --initial-cluster-token=etcd-cluster-token \
  --initial-cluster-state=new \  # 首次初始化用 new,后续节点用 existing
  # 复用 kubeadm 生成的证书
  --cert-file=/etc/kubernetes/pki/etcd/server.crt \
  --key-file=/etc/kubernetes/pki/etcd/server.key \
  --client-cert-auth=true \
  --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \
  --peer-key-file=/etc/kubernetes/pki/etcd/peer.key \
  --peer-client-cert-auth=true \
  --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --log-level=info
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

启动并验证 master01 的 etcd:

复制代码
systemctl daemon-reload
systemctl start etcd
systemctl enable etcd
# 查看状态
systemctl status etcd
# 验证单节点健康(此时集群仅 master01 启动,会显示其他节点未就绪,属正常)
ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.101:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health

4.配置第二个节点(master02):加入 etcd 集群

创建 /etc/systemd/system/etcd.service 文件(master02 专属配置,仅修改 name、IP 和 initial-cluster-state):

复制代码
[Unit]
Description=Etcd Server (master02)
After=network.target
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
ExecStart=/usr/local/bin/etcd \
  --name=master02 \  # 节点名改为 master02
  --data-dir=/var/lib/etcd \
  --listen-client-urls=https://127.0.0.1:2379,https://192.168.1.102:2379 \
  --advertise-client-urls=https://192.168.1.102:2379 \
  --listen-peer-urls=https://192.168.1.102:2380 \
  --initial-advertise-peer-urls=https://192.168.1.102:2380 \
  --initial-cluster=master01=https://192.168.1.101:2380,master02=https://192.168.1.102:2380,master03=https://192.168.1.103:2380 \
  --initial-cluster-token=etcd-cluster-token \
  --initial-cluster-state=existing \  # 加入现有集群,改为 existing
  # 证书路径与 master01 一致(复用本地证书)
  --cert-file=/etc/kubernetes/pki/etcd/server.crt \
  --key-file=/etc/kubernetes/pki/etcd/server.key \
  --client-cert-auth=true \
  --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \
  --peer-key-file=/etc/kubernetes/pki/etcd/peer.key \
  --peer-client-cert-auth=true \
  --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --log-level=info
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

启动并验证 master02:

复制代码
systemctl daemon-reload
systemctl start etcd
systemctl enable etcd
systemctl status etcd
# 验证双节点健康
ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.101:2379,https://192.168.1.102:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health

5.配置第三个节点(master03):加入 etcd 集群

创建 /etc/systemd/system/etcd.service 文件(master03 专属配置,参考 master02,仅修改节点名和 IP):

复制代码
[Unit]
Description=Etcd Server (master03)
After=network.target
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
ExecStart=/usr/local/bin/etcd \
  --name=master03 \  # 节点名改为 master03
  --data-dir=/var/lib/etcd \
  --listen-client-urls=https://127.0.0.1:2379,https://192.168.1.103:2379 \
  --advertise-client-urls=https://192.168.1.103:2379 \
  --listen-peer-urls=https://192.168.1.103:2380 \
  --initial-advertise-peer-urls=https://192.168.1.103:2380 \
  --initial-cluster=master01=https://192.168.1.101:2380,master02=https://192.168.1.102:2380,master03=https://192.168.1.103:2380 \
  --initial-cluster-token=etcd-cluster-token \
  --initial-cluster-state=existing \  # 同样改为 existing
  # 证书路径不变
  --cert-file=/etc/kubernetes/pki/etcd/server.crt \
  --key-file=/etc/kubernetes/pki/etcd/server.key \
  --client-cert-auth=true \
  --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \
  --peer-key-file=/etc/kubernetes/pki/etcd/peer.key \
  --peer-client-cert-auth=true \
  --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --log-level=info
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

启动并验证 master03

复制代码
systemctl daemon-reload
systemctl start etcd
systemctl enable etcd
systemctl status etcd
# 验证 3 节点 etcd 集群健康
ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.101:2379,https://192.168.1.102:2379,https://192.168.1.103:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health

# 查看 etcd 集群成员
ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.101:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  member list

6.验证 kube-apiserver 与 etcd 集群的连接

复制代码
# 查看所有 master 节点的 kube-apiserver 日志,确认无 etcd 连接错误
kubectl logs -n kube-system kube-apiserver-master01 | grep etcd
kubectl logs -n kube-system kube-apiserver-master02 | grep etcd
kubectl logs -n kube-system kube-apiserver-master03 | grep etcd

# 查看集群组件状态
kubectl get componentstatuses

# 验证集群功能(创建测试 Pod)
kubectl run test-pod --image=nginx:alpine -n default
kubectl get pods test-pod
相关推荐
运维开发故事1 天前
基于 Arthas 的多集群在线诊断系统设计与实现
kubernetes
Patrick_Wilson3 天前
从「改个端口」到 502:Next.js on k8s 的容器端口、Service 映射与 env 覆盖
docker·kubernetes·next.js
探索云原生3 天前
K8s 1.36 这个 GA 特性,把 initContainer 拉模型的 hack 干掉了
ai·云原生·kubernetes
云恒要逆袭3 天前
运行你的第一个Docker容器
后端·docker·容器
Java之美4 天前
一次k8s升级引发的DevicePlugin注册失败
云原生·kubernetes
程序员老赵5 天前
10 分钟部署 OpenCode:Docker 一键安装,浏览器打开就能用 AI 写代码(附完整命令与排错)
docker·容器·ai编程
武子康8 天前
调查研究-183 Apple container:Mac 上用轻量 VM 跑 Linux 容器,Swift 会改写本地容器体验吗?
docker·容器·apple
2601_9618752411 天前
决战申论100题2026|最新|范文
linux·容器·centos·debian·ssh·fabric·vagrant
java_cj11 天前
深入kube-apiserver认证机制:从Bearer Token到mTLS的完整认证链解析
linux·运维·服务器·云原生·容器·kubernetes