kubeadm部署的kubernetes集群的etcd由默认静态pod改为二级制的etcd集群

一.背景

公司的kubeadm部署的3个master节点的kubernetes集群,其etcd数据库是和master混布。kubernetes集群在运行过程中出现过由于底层存储IO异常造成etcd集群崩溃,进而造成整个集群异常。为了应对再次出现类似问题,研究在复用现有kubeadm部署集群时生成的etcd证书,重新使用二进制新建集群。

核心流程:备份数据 → 逐个停止各节点的 etcd 静态 Pod → 安装 etcd 二进制包 → 按节点配置 etcd 服务(首节点初始化集群,后续节点加入集群)→ 复用现有证书启动 etcd → 验证集群连接

二.环境与前提确认

3 个 master 节点信息(示例):

节点名 IP 地址
master01 192.168.1.101
master02 192.168.1.102
master03 192.168.1.103

2.所有节点备份etcd数据(非常重要,此步骤是日常定期备份)

复制代码
# 在每个 master 节点执行(以 master01 为例)
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  snapshot save /data/etcd-snapshot-$(date +%Y%m%d)-$(hostname).db

3.下载匹配版本的 etcd 二进制包

复制代码
# 示例版本(根据实际版本调整)
ETCD_VERSION=v3.5.10
ETCD_ARCH=amd64
# 下载并解压
wget https://github.com/etcd-io/etcd/releases/download/${ETCD_VERSION}/etcd-${ETCD_VERSION}-linux-${ETCD_ARCH}.tar.gz
tar -zxvf etcd-${ETCD_VERSION}-linux-${ETCD_ARCH}.tar.gz
# 复制二进制文件到系统路径
cp etcd-${ETCD_VERSION}-linux-${ETCD_ARCH}/etcd* /usr/local/bin/
chmod +x /usr/local/bin/etcd*
# 验证版本
etcd --version

三.分步替换操作(3 节点集群)

关键配置 :多节点 etcd 集群需统一 initial-cluster 参数,首节点用 initial-cluster-state=new,后续节点用 existing,且证书路径必须指向 kubeadm 生成的 /etc/kubernetes/pki/etcd/

1.停止所有 master 节点的 etcd 静态 Pod

复制代码
#在每个 master 节点执行,停止 kubeadm 部署的 etcd 静态 Pod:
# 备份并删除 etcd 静态 Pod 配置文件
mv /etc/kubernetes/manifests/etcd.yaml /root/etcd.yaml.bak
# 验证 etcd pod 已停止(等待 10 秒)
kubectl get pods -n kube-system | grep etcd

2.部署二进制 etcd 集群(复用现有证书)

三台master节点逐台操作,第一台初始化集群,第二台第三台加入集群

3.配置第一个节点(master01):初始化 etcd 集群

复制代码
[Unit]
Description=Etcd Server (master01)
After=network.target
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
ExecStart=/usr/local/bin/etcd \
  --name=master01 \
  --data-dir=/var/lib/etcd \  # 与原静态 Pod 数据目录一致
  --listen-client-urls=https://127.0.0.1:2379,https://192.168.1.101:2379 \
  --advertise-client-urls=https://192.168.1.101:2379 \
  --listen-peer-urls=https://192.168.1.101:2380 \
  --initial-advertise-peer-urls=https://192.168.1.101:2380 \
  --initial-cluster=master01=https://192.168.1.101:2380,master02=https://192.168.1.102:2380,master03=https://192.168.1.103:2380 \
  --initial-cluster-token=etcd-cluster-token \
  --initial-cluster-state=new \  # 首次初始化用 new,后续节点用 existing
  # 复用 kubeadm 生成的证书
  --cert-file=/etc/kubernetes/pki/etcd/server.crt \
  --key-file=/etc/kubernetes/pki/etcd/server.key \
  --client-cert-auth=true \
  --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \
  --peer-key-file=/etc/kubernetes/pki/etcd/peer.key \
  --peer-client-cert-auth=true \
  --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --log-level=info
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

启动并验证 master01 的 etcd:

复制代码
systemctl daemon-reload
systemctl start etcd
systemctl enable etcd
# 查看状态
systemctl status etcd
# 验证单节点健康(此时集群仅 master01 启动,会显示其他节点未就绪,属正常)
ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.101:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health

4.配置第二个节点(master02):加入 etcd 集群

创建 /etc/systemd/system/etcd.service 文件(master02 专属配置,仅修改 name、IP 和 initial-cluster-state):

复制代码
[Unit]
Description=Etcd Server (master02)
After=network.target
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
ExecStart=/usr/local/bin/etcd \
  --name=master02 \  # 节点名改为 master02
  --data-dir=/var/lib/etcd \
  --listen-client-urls=https://127.0.0.1:2379,https://192.168.1.102:2379 \
  --advertise-client-urls=https://192.168.1.102:2379 \
  --listen-peer-urls=https://192.168.1.102:2380 \
  --initial-advertise-peer-urls=https://192.168.1.102:2380 \
  --initial-cluster=master01=https://192.168.1.101:2380,master02=https://192.168.1.102:2380,master03=https://192.168.1.103:2380 \
  --initial-cluster-token=etcd-cluster-token \
  --initial-cluster-state=existing \  # 加入现有集群,改为 existing
  # 证书路径与 master01 一致(复用本地证书)
  --cert-file=/etc/kubernetes/pki/etcd/server.crt \
  --key-file=/etc/kubernetes/pki/etcd/server.key \
  --client-cert-auth=true \
  --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \
  --peer-key-file=/etc/kubernetes/pki/etcd/peer.key \
  --peer-client-cert-auth=true \
  --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --log-level=info
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

启动并验证 master02:

复制代码
systemctl daemon-reload
systemctl start etcd
systemctl enable etcd
systemctl status etcd
# 验证双节点健康
ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.101:2379,https://192.168.1.102:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health

5.配置第三个节点(master03):加入 etcd 集群

创建 /etc/systemd/system/etcd.service 文件(master03 专属配置,参考 master02,仅修改节点名和 IP):

复制代码
[Unit]
Description=Etcd Server (master03)
After=network.target
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
ExecStart=/usr/local/bin/etcd \
  --name=master03 \  # 节点名改为 master03
  --data-dir=/var/lib/etcd \
  --listen-client-urls=https://127.0.0.1:2379,https://192.168.1.103:2379 \
  --advertise-client-urls=https://192.168.1.103:2379 \
  --listen-peer-urls=https://192.168.1.103:2380 \
  --initial-advertise-peer-urls=https://192.168.1.103:2380 \
  --initial-cluster=master01=https://192.168.1.101:2380,master02=https://192.168.1.102:2380,master03=https://192.168.1.103:2380 \
  --initial-cluster-token=etcd-cluster-token \
  --initial-cluster-state=existing \  # 同样改为 existing
  # 证书路径不变
  --cert-file=/etc/kubernetes/pki/etcd/server.crt \
  --key-file=/etc/kubernetes/pki/etcd/server.key \
  --client-cert-auth=true \
  --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \
  --peer-key-file=/etc/kubernetes/pki/etcd/peer.key \
  --peer-client-cert-auth=true \
  --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --log-level=info
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

启动并验证 master03

复制代码
systemctl daemon-reload
systemctl start etcd
systemctl enable etcd
systemctl status etcd
# 验证 3 节点 etcd 集群健康
ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.101:2379,https://192.168.1.102:2379,https://192.168.1.103:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health

# 查看 etcd 集群成员
ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.101:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  member list

6.验证 kube-apiserver 与 etcd 集群的连接

复制代码
# 查看所有 master 节点的 kube-apiserver 日志,确认无 etcd 连接错误
kubectl logs -n kube-system kube-apiserver-master01 | grep etcd
kubectl logs -n kube-system kube-apiserver-master02 | grep etcd
kubectl logs -n kube-system kube-apiserver-master03 | grep etcd

# 查看集群组件状态
kubectl get componentstatuses

# 验证集群功能(创建测试 Pod)
kubectl run test-pod --image=nginx:alpine -n default
kubectl get pods test-pod
相关推荐
忍冬行者2 小时前
kubernetes安装traefik ingress,替换原来的nginx-ingress
云原生·容器·kubernetes
爬也要爬着前进2 小时前
zookeeper迁移k8s
zookeeper·kubernetes·debian
篙芷2 小时前
k8s Service 暴露方式详解:ClusterIP、NodePort、LoadBalancer 与 Headless Service
云原生·容器·kubernetes
篙芷2 小时前
k8s节点绑定:nodeName与nodeSelector实战
linux·docker·kubernetes
aashuii2 小时前
k8s POD上RDMA网卡VF不生效问题
云原生·容器·kubernetes
weixin_46682 小时前
K8S-Ingress
云原生·容器·kubernetes
l1t2 小时前
wsl docker安装达梦数据库的过程
数据库·docker·容器·达梦
Wzx1980122 小时前
go聊天室项目docker部署
运维·docker·容器
weixin_466810 小时前
K8S-特殊容器
云原生·容器·kubernetes