13.ETCD 存储系统、生产环境 Kubernetes 集群部署和Kubernetes 集群升级

2026-05-07

复习和预习

昨天课堂内容

认证和授权
Dashboard
动态卷供应
StatefulSet

课前复习

认证和授权
Dashboard
动态卷供应
StatefulSet

今天课堂内容

ETCD 存储系统
生产环境 Kubernetes 集群部署
Kubernetes 集群升级

ETCD 存储系统

学习参考：etcd 官网、etcd 项目

ETCD 存储系统介绍

ETCD 存储系统概述

etcd，是一款开源、高度一致的分布式键值(key-value)存储系统。
etcd，使用Go语言编写的，节点之间通信通过Raft共识算法处理。
etcd，是 CoreOS 公司于 2013 年 6 月发起的开源项目，于 2018 年 12 月正式加入云原生计算基金会(CNCF-Cloud Native Computing Foundation)，并由 CNCF 支持。

CNCF 是一个中立的基金会、云原生技术推广和普及的领导者。

ETCD 特点

读写性能强：官网给出的Benchmark，在2CPU、1.8G内存、SSD磁盘这样的配置下，单节点的写性能可以达到16K QPS，而先写后读也能达到12K QPS。
简单：安装配置简单，而且提供了 HTTP API 进行交互，使用也很简单。
支持HTTPS：更安全可靠。
跨平台支持：支持不同系统。
二进制文件小和社区活跃。

ETCD 应用场景

归根结底，etcd 是一个键值存储的组件，其功能都是基于其键值存储功能展开的。

从简单的 Web 应用程序到 Kubernetes 集群，任何复杂的应用程序都可以从 etcd 中读取数据或将数据写入 etcd。

服务注册与发现：服务提供方向 ETCD 注册，服务请求方向 ETCD 查询服务方，随后与服务提供方建立连接。
消息发布与订阅：在分布式系统中，最适用的一种组件间通信方式就是消息发布与订阅。即构建一个配置共享中心，数据提供者在这个配置中心发布消息，而消息使用者则订阅他们关心的主题，一旦主题有消息发布，就会实时通知订阅者。通过这种方式可以做到分布式系统配置的集中式管理与动态更新。
分布式通知与协调：与消息发布和订阅有些相似。
选主，在同一个分布式集群中，多个节点选举出主节点对外提供服务。
分布式锁 ：当在分布式系统中，数据只有一份（或有限制），此时需要利用锁的技术控制某一时刻修改数据的进程数。分布式环境下下的锁不仅保证进程可见，还需要考虑进程与锁之间的网络问题。锁服务有两种使用方式：独占锁 ，即获取锁的用户最终只有一个可以得到；控制时序锁，即获得锁的用户都会按数序顺序执行。

单节点部署

提前设置：

主机名：etcd.laoma.cloud
IP地址：10.1.8.10

bash 复制代码

# 所有节点配置
root@etcd:~# vim /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
10.1.8.10 etcd.laoma.cloud etcd

# 安装
root@etcd:~# apt install -y etcd-server etcd-client
# 安装完成后，etcd 会默认启用并启动

# 配置
root@etcd:~# cp /etc/default/etcd /etc/default/etcd.ori
root@etcd:~# vim /etc/default/etcd

bash 复制代码

# 定义节点名称，默认为default
ETCD_NAME="default"

# 定义节点数据存储目录，默认/var/lib/etcd/default
ETCD_DATA_DIR="/var/lib/etcd/default"

# 定义对外提供服务的监听地址，本机机器执行etcdctl命令连接的监听
ETCD_LISTEN_CLIENT_URLS="http://localhost:2379,http://10.1.8.10:2379"

# 定义对外提供服务的监听地址，其他机器执行etcdctl命令连接的监听
ETCD_ADVERTISE_CLIENT_URLS="http://localhost:2379,http://10.1.8.10:2379"

配置说明：

如果只需要服务器自己可以访问http://localhost:2379，则配置：
- ETCD_LISTEN_CLIENT_URLS="http://localhost:2379"
- ETCD_ADVERTISE_CLIENT_URLS="http://localhost:2379"
如果只需要服务器自己可以访问http://10.1.8.10:2379，则配置：
- ETCD_LISTEN_CLIENT_URLS="http://localhost:2379,http://10.1.8.10:2379"
- ETCD_ADVERTISE_CLIENT_URLS="http://localhost:2379"
如果需要客户端和服务器都可以访问http://10.1.8.10:2379，则配置：
- ETCD_LISTEN_CLIENT_URLS="http://localhost:2379,http://10.1.8.10:2379"
- ETCD_ADVERTISE_CLIENT_URLS="http://localhost:2379,http://10.1.8.10:2379"

bash 复制代码

# 重启服务
root@etcd:~# systemctl restart etcd

# 验证状态
root@etcd:~# etcdctl member list 
8e9e05c52164694d, started, default, http://localhost:2380, http://10.1.8.10:2379,http://localhost:2379, false

# 查询键值对列表
root@etcd:~# etcdctl put /xxx yyy
OK
root@etcd:~# etcdctl get /xxx
/xxx
yyy

清除节点所有数据

bash 复制代码

root@etcd:~# systemctl stop etcd.service
root@etcd:~# apt remove -y etcd-server etcd-client
root@etcd:~# rm -fr /var/lib/etcd/default

多节点集群部署

部署集群

**实践：**先部署两节点环境，再添加增删节点。

所有节点配置

bash 复制代码

root@etcdN:~# vim /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

#########  etcd cluster #########
10.1.8.11 etcd1.laoma.cloud etcd1
10.1.8.12 etcd2.laoma.cloud etcd2
10.1.8.13 etcd3.laoma.cloud etcd3

root@etcdN:~# apt update && apt install -y etcd-server etcd-client
root@etcdN:~# systemctl stop etcd.service
root@etcdN:~# rm -fr /var/lib/etcd/default

etcd1 配置

bash 复制代码

root@etcd1:~# cp /etc/default/etcd /etc/default/etcd.ori
root@etcd1:~# vim /etc/default/etcd

bash 复制代码

ETCD_NAME="etcd1"
ETCD_DATA_DIR="/var/lib/etcd/default"
ETCD_LISTEN_PEER_URLS="http://localhost:2380,http://10.1.8.11:2380"
ETCD_LISTEN_CLIENT_URLS="http://localhost:2379,http://10.1.8.11:2379"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.1.8.11:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://localhost:2379,http://10.1.8.11:2379"
ETCD_INITIAL_CLUSTER="etcd1=http://10.1.8.11:2380,etcd2=http://10.1.8.12:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcdcluster"
ETCD_INITIAL_CLUSTER_STATE="new"

参数说明

bash 复制代码

# ETCD_DATA_DIR 定义节点的数据存储目录，默认/var/lib/etcd/default
# ETCD_NAME 定义ETCD节点名称，默认为default

# ETCD_LISTEN_CLIENT_URLS 对外提供服务的监听地址，用于服务端自己
# ETCD_ADVERTISE_CLIENT_URLS 该节点对外提供服务的监听地址，用于客户端

# ETCD_LISTEN_PEER_URLS 定义ETCD监听URL，用于与其他节点心跳通讯
# ETCD_INITIAL_ADVERTISE_PEER_URLS 该节点同伴监听地址，这个值会告诉集群中其他节点
# ETCD_INITIAL_CLUSTER 集群中所有节点的信息

# ETCD_INITIAL_CLUSTER_TOKEN 集群的token，用于集群节点心跳通信认证，集群中所有节点统一。
# ETCD_INITIAL_CLUSTER_STATE 新建集群的时候，这个值为 new ；假如已经存在的集群，这个值为 existing

etcd2 配置

bash 复制代码

root@etcd2:~# cp /etc/default/etcd /etc/default/etcd.ori
root@etcd2:~# vim /etc/default/etcd

bash 复制代码

ETCD_NAME="etcd2"
ETCD_DATA_DIR="/var/lib/etcd/default"
ETCD_LISTEN_PEER_URLS="http://localhost:2380,http://10.1.8.12:2380"
ETCD_LISTEN_CLIENT_URLS="http://localhost:2379,http://10.1.8.12:2379"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.1.8.12:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://localhost:2379,http://10.1.8.12:2379"
ETCD_INITIAL_CLUSTER="etcd1=http://10.1.8.11:2380,etcd2=http://10.1.8.12:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcdcluster"
ETCD_INITIAL_CLUSTER_STATE="new"

启动集群

bash 复制代码

# etcd1 和 etcd2 节点启动服务
root@etcdN:~# systemctl start etcd
# 如果etcd节点之前是一个独立的集群，则需要删除/var/lib/etcd/default目录下所有文件再启动。

# 验证状态
root@etcd1:~# etcdctl member list
1b344327dd751339, started, etcd2, http://10.1.8.12:2380, http://10.1.8.12:2379,http://localhost:2379, false
5b1db881c63d5ce4, started, etcd1, http://10.1.8.11:2380, http://10.1.8.11:2379,http://localhost:2379, false

验证部署

bash 复制代码

root@etcd1:~# etcdctl put /users/user1/name laoma
OK

root@etcd2:~# etcdctl get /users/user1/name 
/users/user1/name
laoma

集群添加节点

更新现有集群中现有节点配置文件：

更新ETCD_INITIAL_CLUSTER参数，添加etcd3=http://10.1.8.13:2380
更新ETCD_INITIAL_CLUSTER_STATE为existing

更新 etcd1 配置文件

bash 复制代码

root@etcd1:~# vim /etc/default/etcd

bash 复制代码

ETCD_NAME="etcd1"
ETCD_DATA_DIR="/var/lib/etcd/default"
ETCD_LISTEN_PEER_URLS="http://localhost:2380,http://10.1.8.11:2380"
ETCD_LISTEN_CLIENT_URLS="http://localhost:2379,http://10.1.8.11:2379"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.1.8.11:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://localhost:2379,http://10.1.8.11:2379"
ETCD_INITIAL_CLUSTER="etcd1=http://10.1.8.11:2380,etcd2=http://10.1.8.12:2380,etcd3=http://10.1.8.13:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcdcluster"
ETCD_INITIAL_CLUSTER_STATE="existing"

bash 复制代码

# 重启etcd服务
root@etcd1:~# systemctl restart etcd

更新 etcd2 配置文件

bash 复制代码

root@etcd2:~# vim /etc/default/etcd

bash 复制代码

ETCD_NAME="etcd2"
ETCD_DATA_DIR="/var/lib/etcd/default"
ETCD_LISTEN_PEER_URLS="http://localhost:2380,http://10.1.8.12:2380"
ETCD_LISTEN_CLIENT_URLS="http://localhost:2379,http://10.1.8.12:2379"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.1.8.12:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://localhost:2379,http://10.1.8.12:2379"
ETCD_INITIAL_CLUSTER="etcd1=http://10.1.8.11:2380,etcd2=http://10.1.8.12:2380,etcd3=http://10.1.8.13:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcdcluster"
ETCD_INITIAL_CLUSTER_STATE="existing"

bash 复制代码

# 重启 etcd 服务
root@etcd2:~# systemctl restart etcd

集群添加节点

集群中任意节点执行以下指令添加节点

bash 复制代码

root@etcd1:~# etcdctl member add etcd3 --peer-urls=http://10.1.8.13:2380

输入内容如下：

bash 复制代码

Member 2038744cfb6bf628 added to cluster f917ad0d43233ca7

ETCD_NAME="etcd3"
ETCD_INITIAL_CLUSTER="etcd2=http://10.1.8.12:2380,etcd3=http://10.1.8.13:2380,etcd1=http://10.1.8.11:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.1.8.13:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

etcd3 部署 etcd

bash 复制代码

root@etcd3:~# apt install -y etcd-server etcd-client
root@etcd3:~# systemctl stop etcd.service
root@etcd3:~# rm -fr /var/lib/etcd/default

root@etcd1:~# scp /etc/default/etcd root@etcd3:/etc/default/etcd
root@etcd3:~# vim /etc/default/etcd

bash 复制代码

ETCD_NAME="etcd3"
ETCD_DATA_DIR="/var/lib/etcd/default"
ETCD_LISTEN_PEER_URLS="http://localhost:2380,http://10.1.8.13:2380"
ETCD_LISTEN_CLIENT_URLS="http://localhost:2379,http://10.1.8.13:2379"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.1.8.13:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://localhost:2379,http://10.1.8.13:2379"
ETCD_INITIAL_CLUSTER="etcd1=http://10.1.8.11:2380,etcd2=http://10.1.8.12:2380,etcd3=http://10.1.8.13:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcdcluster"
ETCD_INITIAL_CLUSTER_STATE="existing"

bash 复制代码

# 启动etcd 服务
root@etcd3:~# systemctl start etcd

# 查看成员列表
root@etcd1:~# etcdctl member list
1b344327dd751339, started, etcd2, http://10.1.8.12:2380, http://10.1.8.12:2379,http://localhost:2379, false
2038744cfb6bf628, started, etcd3, http://10.1.8.13:2380, http://10.1.8.13:2379,http://localhost:2379, false
5b1db881c63d5ce4, started, etcd1, http://10.1.8.11:2380, http://10.1.8.11:2379,http://localhost:2379, false

集群删除节点

bash 复制代码

# 删除etcd3
root@etcd1:~# etcdctl member remove 2038744cfb6bf628
Member 2038744cfb6bf628 removed from cluster f917ad0d43233ca7

root@etcd1:~# etcdctl member list
1b344327dd751339, started, etcd2, http://10.1.8.12:2380, http://10.1.8.12:2379,http://localhost:2379, false
ae40c1152eed3b3a, started, etcd1, http://10.1.8.11:2380, http://10.1.8.11:2379,http://localhost:2379, false

# 更新etcd1和etcd2节点配置文件为初始2节点集群配置，状态设置为existing
root@etcd1:~# vim /etc/default/etcd
ETCD_NAME="etcd1"
ETCD_DATA_DIR="/var/lib/etcd/default"
ETCD_LISTEN_PEER_URLS="http://localhost:2380,http://10.1.8.11:2380"
ETCD_LISTEN_CLIENT_URLS="http://localhost:2379,http://10.1.8.11:2379"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.1.8.11:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://localhost:2379,http://10.1.8.11:2379"
ETCD_INITIAL_CLUSTER="etcd1=http://10.1.8.11:2380,etcd2=http://10.1.8.12:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcdcluster"
ETCD_INITIAL_CLUSTER_STATE="existing"

root@etcd2:~# vim /etc/default/etcd
ETCD_NAME="etcd2"
ETCD_DATA_DIR="/var/lib/etcd/default"
ETCD_LISTEN_PEER_URLS="http://localhost:2380,http://10.1.8.12:2380"
ETCD_LISTEN_CLIENT_URLS="http://localhost:2379,http://10.1.8.12:2379"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.1.8.12:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://localhost:2379,http://10.1.8.12:2379"
ETCD_INITIAL_CLUSTER="etcd1=http://10.1.8.11:2380,etcd2=http://10.1.8.12:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcdcluster"
ETCD_INITIAL_CLUSTER_STATE="existing"

# 更新完成后建议etcd1和etcd2节点重启etcd服务
root@etcd1:~# systemctl restart etcd
root@etcd2:~# systemctl restart etcd

**重要说明：**节点加入集群前，确保etcd数据目录不能有数据。

清理节点etcd3

bash 复制代码

root@etcd3:~# systemctl stop etcd.service
root@etcd3:~# apt remove -y etcd-server etcd-client
root@etcd3:~# rm -fr /var/lib/etcd/default

etcdctl 命令

etcd api 有两个版本V2和V3，可通过环境变量 ETCDCTL_API 设置，默认使用v3。

注意：所有节点ETCDCTL_API必须保持一致，support backup of v2 and v3 stores 。

配置补全

bash 复制代码

root@etcd1:~# mkdir /etc/bash_completion.d
root@etcd1:~# etcdctl completion bash > /etc/bash_completion.d/etcdctl
root@etcd1:~# source /etc/bash_completion.d/etcdctl

帮助信息

bash 复制代码

root@etcd1:~# etcdctl -h
NAME:
	etcdctl - A simple command line client for etcd3.

USAGE:
	etcdctl [flags]

VERSION:
	3.4.30

API VERSION:
	3.4


COMMANDS:
	alarm disarm		Disarms all alarms
	alarm list		Lists all alarms
	auth disable		Disables authentication
	auth enable		Enables authentication
	check datascale		Check the memory usage of holding data for different workloads on a given server endpoint.
	check perf		Check the performance of the etcd cluster
	compaction		Compacts the event history in etcd
	completion bash		Generate the autocompletion script for bash
	completion fish		Generate the autocompletion script for fish
	completion powershell	Generate the autocompletion script for powershell
	completion zsh		Generate the autocompletion script for zsh
	defrag			Defragments the storage of the etcd members with given endpoints
	del			Removes the specified key or range of keys [key, range_end)
	elect			Observes and participates in leader election
	endpoint hashkv		Prints the KV history hash for each endpoint in --endpoints
	endpoint health		Checks the healthiness of endpoints specified in `--endpoints` flag
	endpoint status		Prints out the status of endpoints specified in `--endpoints` flag
	get			Gets the key or a range of keys
	help			Help about any command
	lease grant		Creates leases
	lease keep-alive	Keeps leases alive (renew)
	lease list		List all active leases
	lease revoke		Revokes leases
	lease timetolive	Get lease information
	lock			Acquires a named lock
	make-mirror		Makes a mirror at the destination etcd cluster
	member add		Adds a member into the cluster
	member list		Lists all members in the cluster
	member promote		Promotes a non-voting member in the cluster
	member remove		Removes a member from the cluster
	member update		Updates a member in the cluster
	migrate			Migrates keys in a v2 store to a mvcc store
	move-leader		Transfers leadership to another etcd cluster member.
	put			Puts the given key into the store
	role add		Adds a new role
	role delete		Deletes a role
	role get		Gets detailed information of a role
	role grant-permission	Grants a key to a role
	role list		Lists all roles
	role revoke-permission	Revokes a key from a role
	snapshot restore	Restores an etcd member snapshot to an etcd directory
	snapshot save		Stores an etcd node backend snapshot to a given file
	snapshot status		Gets backend snapshot status of a given file
	txn			Txn processes all the requests in one transaction
	user add		Adds a new user
	user delete		Deletes a user
	user get		Gets detailed information of a user
	user grant-role		Grants a role to a user
	user list		Lists all users
	user passwd		Changes password of user
	user revoke-role	Revokes a role from a user
	version			Prints the version of etcdctl
	watch			Watches events stream on keys or prefixes

OPTIONS:
      --cacert=""				verify certificates of TLS-enabled secure servers using this CA bundle
      --cert=""					identify secure client using this TLS certificate file
      --command-timeout=5s			timeout for short running command (excluding dial timeout)
      --debug[=false]				enable client-side debug logging
      --dial-timeout=2s				dial timeout for client connections
  -d, --discovery-srv=""			domain name to query for SRV records describing cluster endpoints
      --discovery-srv-name=""			service name to query when using DNS discovery
      --endpoints=[127.0.0.1:2379]		gRPC endpoints
  -h, --help[=false]				help for etcdctl
      --hex[=false]				print byte strings as hex encoded strings
      --insecure-discovery[=true]		accept insecure SRV records describing cluster endpoints
      --insecure-skip-tls-verify[=false]	skip server certificate verification (CAUTION: this option should be enabled only for testing purposes)
      --insecure-transport[=true]		disable transport security for client connections
      --keepalive-time=2s			keepalive time for client connections
      --keepalive-timeout=6s			keepalive timeout for client connections
      --key=""					identify secure client using this TLS key file
      --password=""				password for authentication (if this option is used, --user option shouldn't include password)
      --user=""					username[:password] for authentication (prompt if password is not supplied)
  -w, --write-out="simple"			set the output format (fields, json, protobuf, simple, table)

key-value管理

创建 key-value

bash 复制代码

root@etcd:~# etcdctl put /users/user1/name laoma
OK

查询 key-value

bash 复制代码

# 查看key-value
root@etcd:~# etcdctl get /users/user1/name 
/users/user1/name
laoma

# 只查看value
root@etcd:~# etcdctl get /users/user1/name --print-value-only
laoma

# 只查看key
root@etcd:~# etcdctl get /users/user1/name --keys-only
/users/user1/name

# 再创建一些 key-value
root@etcd:~# etcdctl put /users/user1/age 18
root@etcd:~# etcdctl put /users/user2/name laowang
root@etcd:~# etcdctl put /usage/ls ls--help
root@etcd:~# etcdctl put /usage/cd cd--help
root@etcd:~# etcdctl put /registry/services/endpoints/laoma/web 10.224.225.98:80
# k8s将svc数据按如下格式命名存在etcd系统中：
# /registry/services/endpoints/namespace/svcname

# 查看以/开头的所有key
root@etcd:~# etcdctl get / --keys-only --prefix |grep -v '^$'
/registry/services/endpoints/laoma/web
/usage/cd
/usage/ls
/users/user1/age
/users/user1/name
/users/user2/name

# 查看以/users开头的所有key
root@etcd:~# etcdctl get /users --keys-only --prefix |grep -v '^$'
/users/user1/age
/users/user1/name
/users/user2/name

# 如何查看"特定目录"下有哪些文件或者子目录，例如/users
root@etcd:~# etcdctl get /users --keys-only --prefix |grep -v '^$'| awk -F '/' '{print "/"$2"/"$3}'| sort |uniq
/users/user1
/users/user2

脚本查询

以上命令太繁琐，编写脚本 /usr/local/bin/etcdls 实现：

bash 复制代码

#!/bin/bash
# 如果脚本未提供参数，查询的前缀设置为"/"
PREFIX=${1:-/}

# 统计以该前缀开头的"目录"数量
count=$(etcdctl get $PREFIX --keys-only --prefix |grep -v '^$'|egrep -o "$PREFIX[^/]*/" |sort |uniq|wc -l)
# 正则表达式说明 $PREFIX[^/]*/
## $PREFIX 变量
## [^/]* 变量后跟着任意数量的非/字符串
## 最后补1个/

# 如果数量等于1，则显示该目录下子条目
if [ $count -eq 1 ];then
  etcdctl get $PREFIX --keys-only --prefix |grep -v '^$'|egrep -o "^$PREFIX[^/]*/[^/]*" | sort | uniq
else
# 如果数量不等于1，则显示该前缀匹配的目录条目
  etcdctl get $PREFIX --keys-only --prefix |grep -v '^$'|egrep -o "^$PREFIX[^/]*" | sort | uniq
fi

bash 复制代码

root@etcd:~# chmod +x /usr/local/bin/etcdls

查询效果如下：

bash 复制代码

root@etcd:~# etcdls 
/registry
/usage
/users
root@etcd:~# etcdls /us
/usage
/users
root@etcd:~# etcdls /user
/users/user1
/users/user2

更新 value

bash 复制代码

root@etcd:~# etcdctl put /users/user1/name laohu
OK
root@etcd:~# etcdctl get /users/user1/name --print-value-only
laohu

删除 key

bash 复制代码

root@etcd:~# etcdctl del /users/user2/name 
1
root@etcd:~# etcdctl get /users/user2/name --print-value-only
root@etcd:~# etcdls /users/user
/users/user1/age
/users/user1/name

仔细观察：当删除/users/user2/name时 /users/user2条目也没有了。

动态监控 key 值

bash 复制代码

[root@etcd bin]# etcdctl put /users/user1/name laoma1
OK
[root@etcd bin]# etcdctl put /users/user1/name laoma2
OK

# 当执行以上命令时，watch结果如下：
root@etcd:~# etcdctl watch /users/user1/name 
PUT
/users/user1/name
laoma1
PUT
/users/user1/name
laoma2

user 管理

bash 复制代码

root@etcd:~# etcdctl user -h
NAME:
	user - User related commands

USAGE:
	etcdctl user <subcommand> [flags]

API VERSION:
	3.3


COMMANDS:
	add		Adds a new user
	delete		Deletes a user
	get		Gets detailed information of a user
	grant-role	Grants a role to a user
	list		Lists all users
	passwd		Changes password of user
	revoke-role	Revokes a role from a user
......

role 管理

bash 复制代码

root@etcd:~# etcdctl role -h
NAME:
	role - Role related commands

USAGE:
	etcdctl role <subcommand> [flags]

API VERSION:
	3.3


COMMANDS:
	add			Adds a new role
	delete			Deletes a role
	get			Gets detailed information of a role
	grant-permission	Grants a key to a role
	list			Lists all roles
	revoke-permission	Revokes a key from a role
......

member 管理

bash 复制代码

root@etcd:~# etcdctl member -h
NAME:
	member - Membership related commands

USAGE:
	etcdctl member <subcommand> [flags]

API VERSION:
	3.3


COMMANDS:
	add	Adds a member into the cluster
	list	Lists all members in the cluster
	remove	Removes a member from the cluster
	update	Updates a member in the cluster

查询 Kubernetes 数据

Kubernetes从 1.13 版本开始，默认使用etcd v3 API。数据存放在以 /registry 中。

确保master节点安装了etcdctl工具。
bash 复制代码
[root@master30 ~]# apt install etcd-client -y

我们以集成 etcd 静态 pod 的Kubernetes为例，讲解如何查询。

在master节点创建脚本/usr/local/bin/etcdls，内容如下：

bash 复制代码

#!/bin/bash
export ETCDCTL_API=3
key_file=/etc/kubernetes/pki/etcd/server.key
cert_file=/etc/kubernetes/pki/etcd/server.crt
cacert_file=/etc/kubernetes/pki/etcd/ca.crt
endpoints=https://127.0.0.1:2379

# 如果脚本未提供参数，查询的前缀设置为"/"
PREFIX=${1:-/}
tmpfile=$(mktemp)
etcdctl --key=${key_file} --cert=${cert_file} --cacert=${cacert_file} --endpoints=${endpoints} get $PREFIX --keys-only --prefix |grep -v '^$'|egrep "$PREFIX[^/]*/" > $tmpfile

# 统计以该前缀开头的"目录"数量
count=$(cat $tmpfile | egrep -o "$PREFIX[^/]*/" | sort | uniq | wc -l )
# 如果数量等于1，则显示该目录下子条目
if [ $count -eq 1 ];then
  cat $tmpfile | egrep -o "^$PREFIX[^/]*/[^/]*" | sort | uniq
else
# 如果数量不等于1，则显示该前缀匹配的目录条目
  cat $tmpfile | egrep -o "^$PREFIX[^/]*" | sort | uniq
fi

# 清理临时文件
rm -f $tmpfile

在 master 节点创建脚本/usr/local/bin/etcdget，内容如下：

bash 复制代码

#!/bin/bash
export ETCDCTL_API=3
key_file=/etc/kubernetes/pki/etcd/server.key
cert_file=/etc/kubernetes/pki/etcd/server.crt
cacert_file=/etc/kubernetes/pki/etcd/ca.crt
endpoints=https://127.0.0.1:2379

etcdctl --key=${key_file} --cert=${cert_file} --cacert=${cacert_file} --endpoints=${endpoints} get $*

以查询 service 数据为例

bash 复制代码

[root@master30 ~]# etcdls /registry/ser
/registry/serviceaccounts
/registry/services

[root@master30 ~]# etcdls /registry/services
/registry/services/endpoints
/registry/services/specs

[root@master30 ~]# etcdls /registry/services/endpoints
/registry/services/endpoints/default
/registry/services/endpoints/kube-system

[root@master30 ~]# etcdls /registry/services/endpoints/kube-system
/registry/services/endpoints/kube-system/kube-dns

[root@master30 ~]# etcdls /registry/services/endpoints/kube-system/kube-dns
[root@master30 ~]# etcdget /registry/services/endpoints/kube-system/kube-dns --print-value-only

# 除了/registry/apiregistration.k8s.io是直接存储JSON格式的，其他资源是通过 protobuf 格式存储，这么做的原因是为了性能。

# 使用 strings 命令可解码原文，该工具由 binutils 包提供
[root@master30 ~]# etcdget /registry/services/endpoints/kube-system/kube-dns --print-value-only | strings
# openshift 项目已经开发了一个强大的辅助工具 etcdhelper 可以读取 etcd 内容并解码proto。

数据的备份和恢复

etcd v3 可以同时备份和恢复 v2和v3创建的数据。
etcd v2 无法备份 v3 创建的数据，也无法恢复v3 创建的数据。

物理备份和恢复

单节点集群

备份

bash 复制代码

# 准备测试数据
root@etcd:~# etcdctl put /users/user1/name laoma
root@etcd:~# etcdctl put /users/user1/age 18
root@etcd:~# etcdctl put /users/user2/name laowang
root@etcd:~# etcdctl put /usage/ls ls--help
root@etcd:~# etcdctl put /usage/cd cd--help

# 停止 etcd 集群，备份数据目录
root@etcd:~# systemctl stop etcd.service
root@etcd:~# cp -a /var/lib/etcd/default /var/lib/etcd/default.bak

删除数据

bash 复制代码

# 启动etcd集群
root@etcd:~# systemctl start etcd.service

# 删除测试数据
root@etcd:~# etcdctl del /users/user2/name

恢复

bash 复制代码

# 停止 etcd 集群，恢复数据目录
root@etcd:~# systemctl stop etcd.service
root@etcd:~# mv /var/lib/etcd/default /var/lib/etcd/default.bad
root@etcd:~# cp -a /var/lib/etcd/default.bak /var/lib/etcd/default
root@etcd:~# chown -R etcd:etcd /var/lib/etcd/default/
root@etcd:~# systemctl start etcd.service

# 验证恢复
root@etcd:~# etcdctl get /users/user2/name
/users/user2/name
laowang

多节点集群

备份

bash 复制代码

# 准备测试数据
root@etcd1:~# etcdctl put /users/user1/name laoma
root@etcd1:~# etcdctl put /users/user1/age 18
root@etcd1:~# etcdctl put /users/user2/name laowang
root@etcd1:~# etcdctl put /usage/ls ls--help
root@etcd1:~# etcdctl put /usage/cd cd--help

# 所有节点停止 etcd 服务，备份数据目录
root@etcdN:~# systemctl stop etcd.service
root@etcdN:~# cp -a /var/lib/etcd/default /var/lib/etcd/default.bak

删除数据

单节点删除数据即可。

bash 复制代码

# 启动etcd集群
root@etcd1:~# systemctl start etcd.service

# 删除测试数据
root@etcd1:~# etcdctl del /users/user2/name

恢复

bash 复制代码

# 所有节点停止 etcd 服务，恢复数据目录
root@etcdN:~# systemctl stop etcd.service
root@etcdN:~# mv /var/lib/etcd/default /var/lib/etcd/default.bad
root@etcdN:~# cp -r /var/lib/etcd/default.bak /var/lib/etcd/default
root@etcdN:~# chown -R etcd:etcd /var/lib/etcd/default/
root@etcdN:~# systemctl start etcd.service

# 验证恢复
root@etcdN:~# etcdctl get /users/user2/name
/users/user2/name
laowang

逻辑备份与恢复

单节点集群

备份

bash 复制代码

# 确保使用 API 3
root@etcd:~# export ETCDCTL_API=3

# 备份数据到文件 snapshot.db
root@etcd:~# etcdctl snapshot save snapshot.db
Snapshot saved at snapshot.db

恢复

bash 复制代码

# 停止etcd服务并删除数据目录 /var/lib/etcd/default/
root@etcd:~# systemctl stop etcd
root@etcd:~# mv /var/lib/etcd/default /var/lib/etcd/default.old

# 恢复快照文件snapshot.db到目录/var/lib/etcd/default/
root@etcd:~# etcdctl snapshot restore snapshot.db --data-dir=/var/lib/etcd/default/

# 更改文件所有权，并启动etcd服务
root@etcd1:~# chown -R etcd:etcd /var/lib/etcd/default
root@etcd1:~# systemctl start etcd.service

多节点集群

备份

因为多节点集群之间数据是一致的，所以多节点集群备份，只需要在单个节点备份即可。

bash 复制代码

# 确保使用 API 3
root@etcd:~# export ETCDCTL_API=3

# 备份数据到文件snapshot1.db
root@etcd1:~# etcdctl snapshot save /root/snapshot1.db
Snapshot saved at snapshot.db

# 将备份文件同步到所有节点

恢复

bash 复制代码

# 停止所有节点etcd服务并删除数据目录 /var/lib/etcd/default/
root@etcdN:~# systemctl stop etcd
root@etcdN:~# rm -fr /var/lib/etcd/default/

# 所有节点恢复快照文件snapshot1.db到目录/var/lib/etcd/default/
root@etcd1:~# etcdctl snapshot restore /root/snapshot1.db \
--data-dir=/var/lib/etcd/default
root@etcd1:~# chown -R etcd:etcd /var/lib/etcd/default
root@etcd1:~# systemctl start etcd

root@etcd2:~# etcdctl snapshot restore /root/snapshot1.db \
--data-dir=/var/lib/etcd/default
root@etcd2:~# chown -R etcd:etcd /var/lib/etcd/default
root@etcd2:~# systemctl start etcd

root@etcd3:~# etcdctl snapshot restore /root/snapshot1.db \
--data-dir=/var/lib/etcd/default
root@etcd3:~# chown -R etcd:etcd /var/lib/etcd/default
root@etcd3:~# systemctl start etcd

Kubernetes 备份和恢复

bash 复制代码

# 准备工作
[root@master30 ~]# export ETCDCTL_API=3
[root@master30 ~]# key_file=/etc/kubernetes/pki/etcd/server.key
[root@master30 ~]# cert_file=/etc/kubernetes/pki/etcd/server.crt
[root@master30 ~]# cacert_file=/etc/kubernetes/pki/etcd/ca.crt
[root@master30 ~]# endpoints=https://127.0.0.1:2379

# 创建一个应用
[root@master30 ~]# kubectl create deployment web --image=httpd --replicas=2

# 备份
[root@master30 ~]# etcdctl --key=${key_file} --cert=${cert_file} --cacert=${cacert_file} --endpoints=${endpoints} snapshot save snapshot.db

# 删除应用
[root@master30 ~]# kubectl delete deployments.apps web

# 恢复验证
[root@master30 ~]# rm -fr /var/lib/etcd/
[root@master30 ~]# etcdctl --key=${key_file} --cert=${cert_file} --cacert=${cacert_file} --endpoints=${endpoints} snapshot restore snapshot.db --data-dir=/var/lib/etcd --name=master30.laoma.cloud --initial-cluster=master30.laoma.cloud=https://10.1.8.30:2380 --initial-advertise-peer-urls=https://10.1.8.30:2380
# 说明：
#    master30.laoma.cloud是master节点主机名
#    10.1.8.30 是 master30 节点主机 ip

[root@master30 ~]# kubectl get deployments.apps

生产环境 Kubernetes 集群部署

高可用概述

Kubernetes中的业务系统，通过控制器实现业务的负载均衡和高可用。

Kubernetes集群采用单节点Master架构存在的问题：

一旦Master故障，将导致整个集群不可用。
单节点Master工作负载能力有限。

为了实现没有单点故障的目标，需要为以下几个组件建立高可用方案：

etcd
kube-apiserver
kube-controller-manager与kube-scheduler
kube-dns

Etcd 集群高可用

Etcd集群高可用方案：

独立的etcd集群，使用3台或者5台服务器运行etcd集群。采用这项策略的主要动机是etcd集群的节点增减都需要显式的通知集群，保证etcd集群节点稳定可以更方便的用程序完成集群滚动升级，减轻维护负担。如果此时在etcd集群前构建一套高可用+负载均衡服务器，则etcd集群的节点增减将不用通知Kubernetes集群。
Kubernetes中运行etcd集群，在Kubernetes Master上用static pod的形式来运行etcd，并将多台Kubernetes Master上的etcd组成集群。在这一模式下，各个服务器的etcd实例被注册进了Kubernetes当中，虽然无法直接使用kubectl来管理这部分实例，但是监控以及日志搜集组件均可正常工作。在这一模式运行下的etcd可管理性更强。
self-hosted etcd集群，使用CoreOS提出的self-hosted etcd方案，将本应在底层为Kubernetes提供服务的etcd运行在Kubernetes之上。实现Kubernetes对自身依赖组件的管理。在这一模式下的etcd集群可以直接使用etcd-operator来自动化运维，最符合Kubernetes的使用习惯。

这三种思路均可以实现etcd高可用的目标，但是在选择过程中却要根据实际情况做出一些判断：

预算充足但保守的项目选方案一
想一步到位并愿意承担一定风险的项目选方案三
折中一点选方案二

kube-apiserver 高可用

apiserver本身是一个无状态服务，要实现其高可用，在于如何将运行在多台服务器上的apiserver用一个统一的外部入口暴露给所有Node节点和客户端。

对于这种无状态服务的高可用，有两种方案：

使用外部负载均衡器，不管是使用公有云提供的负载均衡器服务或是在私有云中使用LVS或者HaProxy自建负载均衡器都可以归到这一类。如何保证负载均衡器的高可用，则是选择这一方案需要考虑的新问题。
在网络层做负载均衡，比如在Master节点上用BGP做ECMP，或者在Node节点上用iptables做NAT都可以实现。采用这一方案不需要额外的外部服务，但是对网络配置有一定的要求。

这两种思路均可以实现 kube-apiserver 高可用的目标，但要根据实际情况做出选择：

建议在公有云上的集群多考虑方案一
建议在私有云上的集群多考虑方案二，因为私有云环境中需要额外维护负载均衡器

kube-controller-manager与kube-scheduler

这两项服务是Master节点的一部分，他们的高可用相对容易，仅需要运行多份实例即可。

目前在多个Master节点上采用static pod模式部署这两项服务的方案比较常见。

kube-dns

严格来说kube-dns并不算是Master组件的一部分，因为它是可以跑在Node节点上，并用Service向集群内部提供服务的。

为了避免单点故障，请将kube-dns的replicas值设为2或者更多，并用anti-affinity将他们部署在不同的Node节点上。

高可用架构

高可用实践

实验拓扑

主机清单

集群	节点	地址	角色
高可用	ha01.laoma.cloud	10.1.8.101	高可用和负载均衡
高可用	ha02.laoma.cloud	10.1.8.102	高可用和负载均衡
etcd	etcd11.laoma.cloud	10.1.8.111	etcd 节点
etcd	etcd12.laoma.cloud	10.1.8.112	etcd 节点
etcd	etcd13.laoma.cloud	10.1.8.113	etcd 节点
Kubernetes	master21.laoma.cloud	10.1.8.121	master 节点
Kubernetes	master22.laoma.cloud	10.1.8.122	master 节点
Kubernetes	master23.laoma.cloud	10.1.8.123	master 节点
Kubernetes	worker31.laoma.cloud	10.1.8.131	worker 节点
Kubernetes	worker32.laoma.cloud	10.1.8.132	worker 节点
Kubernetes	worker33.laoma.cloud	10.1.8.133	worker 节点

主机硬件配置清单

Kubernetes节点：2 CPU、4G 内存
其他节点：1 CPU、1G 内存

准备虚拟机-base 模版

安装系统

Ubuntu 2404 系统安装

最小化安装系统，不需要swap分区，按以下要求分区。

/boot 2G
/ 90G

安装基础软件包

bash 复制代码

root@ubuntu2404:~# apt update && apt install -y vim bash-completion open-vm-tools apt-transport-https lrzsz unzip

配置仓库源

操作系统仓库

操作系统仓库换成华为云的仓库，速度更快。

bash 复制代码

root@ubuntu2404:~# cat /etc/apt/sources.list.d/ubuntu.sources
Types: deb
URIs: http://mirrors.huaweicloud.com/ubuntu/
Suites: noble noble-updates noble-backports
Components: main restricted universe multiverse
Signed-By: /usr/share/keyrings/ubuntu-archive-keyring.gpg

#Types: deb
#URIs: http://security.ubuntu.com/ubuntu/
#Suites: noble-security
#Components: main restricted universe multiverse
#Signed-By: /usr/share/keyrings/ubuntu-archive-keyring.gpg

containerd 仓库

bash 复制代码

# 导入 containerd 仓库 key
root@ubuntu2404:~# curl -fsSL https://mirrors.huaweicloud.com/docker-ce/linux/ubuntu/gpg | gpg --dearmour -o /etc/apt/trusted.gpg.d/containerd.gpg

# 添加 containerd 仓库
root@ubuntu2404:~# cat << 'EOF' > /etc/apt/sources.list.d/docker-ce.list
deb [arch=amd64] https://mirrors.huaweicloud.com/docker-ce/linux/ubuntu noble stable
EOF

kubernetes 仓库

由于 Kubernetes 官方变更了仓库的存储路径以及使用方式，使用 1.28 及以上版本，需按照新版配置方法进行配置。

该文档示例为配置 1.30 版本，如需其他版本请在对应位置字符串替换即可。比如需要安装 1.29 版本，则需要将如下配置中的 v1.30 替换成 v1.29。

目前该源支持 v1.24 - v1.30 版本，后续版本会持续更新。

bash 复制代码

# 添加 kubernetes 仓库 key
root@ubuntu2404:~# curl -fsSL https://mirrors.aliyun.com/kubernetes-new/core/stable/v1.30/deb/Release.key | gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg

# 添加 kubernetes 仓库
root@ubuntu2404:~# echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://mirrors.aliyun.com/kubernetes-new/core/stable/v1.30/deb/ /" > /etc/apt/sources.list.d/kubernetes.list

# 更新仓库源
root@ubuntu2404:~# apt update

设置 IP

bash 复制代码

root@ubuntu2404:~# mkdir /etc/netplan/origin
root@ubuntu2404:~# mv /etc/netplan/*yaml /etc/netplan/origin
root@ubuntu2404:~# cat > /etc/netplan/00-static.yaml <<EOF
network:
  ethernets:
    ens32:
      dhcp4: no
      addresses:
        - 10.1.8.10/24
      routes:
        - to: default
          via: 10.1.8.2
      nameservers:
        addresses:
          - 10.1.8.2
          - 223.5.5.5
  version: 2
EOF
root@ubuntu2404:~# chmod 600 /etc/netplan/00-static.yaml
root@ubuntu2404:~# netplan apply

设置 /etc/hosts

bash 复制代码

root@ubuntu2404:~# cat << 'EOF' >> /etc/hosts

####### kubernetes #####
10.1.8.100 k8s.laoma.cloud k8s
10.1.8.101 ha01.laoma.cloud ha01
10.1.8.102 ha02.laoma.cloud ha02
10.1.8.111 etcd11.laoma.cloud etcd11
10.1.8.112 etcd12.laoma.cloud etcd12
10.1.8.113 etcd13.laoma.cloud etcd13
10.1.8.121 master21.laoma.cloud master21
10.1.8.122 master22.laoma.cloud master22
10.1.8.123 master23.laoma.cloud master23
10.1.8.131 worker31.laoma.cloud worker31
10.1.8.132 worker32.laoma.cloud worker32
10.1.8.133 worker33.laoma.cloud worker33
EOF

配置对时

bash 复制代码

root@ubuntu2404:~# apt-get install -y chrony
# 以下步骤可以省略
root@ubuntu2404:~# systemctl enable chrony --now

设置 ssh

bash 复制代码

# 避免ssh服务器对客户端IP进行反向解析为域名，客户端可以快速与服务器建立连接
root@ubuntu2404:~# echo 'UseDNS no' >> /etc/ssh/sshd_config
# 避免ssh客户端校验服务器公钥，否则首次连接需要交互输入yes
root@ubuntu2404:~# echo 'StrictHostKeyChecking no' >> /etc/ssh/ssh_config
# 生成秘钥
root@ubuntu2404:~# ssh-keygen -N '' -f ~/.ssh/id_rsa -t rsa
# 配置免密登录自己
root@ubuntu2404:~# ssh-copy-id root@localhost

配置 IPVS

bash 复制代码

# 1. 安装 ipvs 依赖包
[root@ubuntu2404 ~]# apt install -y iptables ipvsadm ipset conntrack

# 2-1. 加载基础网络模块：临时加载（立即生效）
[root@ubuntu2404 ~]# modprobe overlay 
[root@ubuntu2404 ~]# modprobe br_netfilter
# 模块功能说明：
# br_netfilter：允许桥接设备（Linux 网桥）通过 iptables 过滤，是 Kubernetes 网络通信必需。
# overlay：Overlay 文件系统模块，容器运行时（containerd/docker）分层镜像必备。

# 2-2. 加载 ipvs 内核模块：临时加载（立即生效）
[root@ubuntu2404 ~]# modprobe ip_vs
[root@ubuntu2404 ~]# modprobe ip_vs_rr
[root@ubuntu2404 ~]# modprobe ip_vs_wrr
[root@ubuntu2404 ~]# modprobe ip_vs_lc
[root@ubuntu2404 ~]# modprobe ip_vs_sh
[root@ubuntu2404 ~]# modprobe nf_conntrack
# 模块功能说明：
# ip_vs：IPVS 核心模块，实现四层负载均衡，是 kube-proxy ipvs 模式的基础。
# ip_vs_rr：IPVS 轮询调度算法，按顺序依次分发请求。
# ip_vs_wrr：加权轮询，按后端节点权重分配流量。
# ip_vs_lc：最少连接调度，优先发给连接数最少的节点。
# ip_vs_sh：源地址哈希，保证同一客户端 IP 始终访问同一后端。
# nf_conntrack：连接跟踪，记录网络连接状态，保证数据包正确转发。

# 3 永久加载模块（重启生效）
# systemd-modules-load.service 会自动加载改配置文件
[root@ubuntu2404 ~]# cat > /etc/modules-load.d/k8s-net.conf << EOF
# K8s 基础网络
br_netfilter
overlay

# IPVS 必需
ip_vs
ip_vs_rr
ip_vs_wrr
ip_vs_lc
ip_vs_sh
nf_conntrack
EOF

设置脚本

bash 复制代码

root@ubuntu2404:~# cat > /usr/local/bin/sethost <<'EOF'
#!/bin/bash

# 以 root 身份运行
[ $UID -ne 0 ] && echo 'Please run as root.' && exit 1 

# 指定接口名称
net_config=/etc/netplan/00-installer-config.yaml

# 指定域名称
domain=laoma.cloud

# 脚本使用说明
usage (){
  echo "Usage: $0 101-102 | 111-113 | 121-123 | 131-133"
  exit 1
}

# 设置 IP 地址
function set_ip () {
  # 修改网卡配置文件IP
  IP=10.1.8.$1
  sed -ri "s#10.1.8.[0-9]{,3}/#$IP/#g" ${net_config}
  # 设置网卡文件权限
  chmod 600 ${net_config}
  # 激活IP
  netplan apply
}

# 设置主机名
function set_hostname () {
  HOSTNAME_END=$[ $1 - 100 ].$domain
  # 获取主机名
  case $1 in
    10[1-2])
      HOSTNAME=ha0${HOSTNAME_END}
      ;;
    11[1-3])
      HOSTNAME=etcd${HOSTNAME_END}
      ;;
    12[1-3])
      HOSTNAME=master${HOSTNAME_END}
      ;;
    13[1-3])
      HOSTNAME=worker${HOSTNAME_END}
      ;;
    *)
      usage
      ;;
  esac
  # 设置主机名
  hostnamectl hostname $HOSTNAME
}

# 定义 main 函数调用功能函数
function main() {
  # 设置主机名
  set_hostname $1
  
  # 设置 IP
  set_ip $1

  # 显示修改结果
  bash -c 'clear;hostname;echo;ip -br a;echo'

  # 关机打快照
  while true
  do
    echo -ne "Press the \033[1;31mEnter\033[0;39m key, and the system will shut down in 5 seconds.";read
    echo -e "Press \033[1;35mCTRL+C\033[0;39m to cancel the shutdown."
    for i in {5..1}
    do
      echo "The system will shut down in $i seconds."
      sleep 1
    done
    echo "Shutdown system Now." && init 0
  done 
}

# 执行 main 函数
main $*
EOF
root@ubuntu2404:~# chmod +x /usr/local/bin/sethost

打快照-base

关闭虚拟机，打快照，名称base。

bash 复制代码

root@ubuntu2404:~# init 0

准备虚拟机-k8s 模版

提示：基于上个虚拟机原有快照base继续配置系统。

关闭 swap

如果有 swap 分区，需要关闭。kubernetes不需要swap分区。

bash 复制代码

root@ubuntu2404:~# swapoff -a && sed -i '/^.*swap/d' /etc/fstab
root@ubuntu2404:~# rm -f /swap.img

配置 containerd

bash 复制代码

[root@ubuntu2404 ~]# apt-get install -y containerd.io=1.7.20-1 cri-tools

# 设置crictl的runtime-endpoint
root@ubuntu2204:~# crictl config runtime-endpoint unix:///var/run/containerd/containerd.sock

[root@ubuntu2404 ~]# containerd config default > /etc/containerd/config.toml

# 修改 SystemdCgroup 和 sandbox_image
[root@ubuntu2404 ~]# sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml
[root@ubuntu2404 ~]# sed -i 's|sandbox_image = ".*"|sandbox_image = "registry.k8s.io/pause:3.9"|' /etc/containerd/config.toml

# 配置镜像仓库加速
[root@ubuntu2404 ~]# vim /etc/containerd/config.toml

toml 复制代码

# 查找 mirrors行
      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
        # 添加如下四行记录，注意缩进
        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
          endpoint = ["https://docker.m.daocloud.io","https://docker.1ms.run","https://docker.xuanyuan.me"]
        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."registry.k8s.io"]
          endpoint = ["https://k8s.m.daocloud.io","https://registry.cn-hangzhou.aliyuncs.com/google_containers"]

bash 复制代码

# 重启服务
[root@ubuntu2404 ~]# systemctl restart containerd.service
# containerd 服务，默认已经设置开机启动，并启动

crictl 走的是 containerd CRI 接口 ，会读取 /etc/containerd/config.toml 里的 registry.mirrors 配置。

下载测试：

bash 复制代码

[root@ubuntu2404 ~]# crictl pull busybox
Image is up to date for sha256:925ff61909aebae4bcc9bc04bb96a8bd15cd2271f13159fe95ce4338824531dd

安装 nerdctl 和 cni plugin

nerdctl 项目地址：https://github.com/containerd/nerdctl/releases

cni 插件项目地址：https://github.com/containernetworking/plugins/releases

bash 复制代码

# 下载并安装
[root@ubuntu2404 ~]# wget http://192.168.46.100/01.softwares/03.stage-3/nerdctl-1.7.7-linux-amd64.tar.gz
[root@ubuntu2404 ~]# tar -xf nerdctl-1.7.7-linux-amd64.tar.gz -C /usr/bin/

# 下载 nerdctl 所需要的 cni 插件
[root@ubuntu2404 ~]# wget http://192.168.46.100/01.softwares/03.stage-3/cni-plugins-linux-amd64-v1.6.0.tgz
[root@ubuntu2404 ~]# mkdir -p /opt/cni/bin
[root@ubuntu2404 ~]# tar -xf cni-plugins-linux-amd64-v1.6.0.tgz -C /opt/cni/bin

nerdctl 走的是 containerd 原生 API ，不会读取 CRI 专属的 registry 配置，而是用自己独立的镜像源配置。

nerdctl 有自己的配置文件，需要单独配置 Docker Hub 加速。

bash 复制代码

# 配置 docker.io 镜像加速
mkdir -p /etc/containerd/certs.d/docker.io
cat > /etc/containerd/certs.d/docker.io/hosts.toml << EOF
server = "https://registry-1.docker.io"
[host."https://09def58152000fc00ff0c00057bad7e0.mirror.swr.myhuaweicloud.com"]
  capabilities = ["pull", "resolve"]
EOF

# 配置 registry.k8s.io 镜像加速
mkdir -p /etc/containerd/certs.d/registry.k8s.io
cat > /etc/containerd/certs.d/registry.k8s.io/hosts.toml << EOF
server = "https://registry.k8s.io"

# 首选 DaoCloud
[host."https://k8s.m.daocloud.io"]
  capabilities = ["pull", "resolve"]

# 次选 Mirrorify
[host."https://k8s.mirrorify.net"]
  capabilities = ["pull", "resolve"]

# 兜底阿里云（如有账号）
[host."https://registry.cn-hangzhou.aliyuncs.com/google_containers"]
  capabilities = ["pull", "resolve"]
  override_path = true
EOF

下载测试：

bash 复制代码

[root@ubuntu2404 ~]# nerdctl pull nginx
docker.io/library/nginx:latest:                                                   resolved       |++++++++++++++++++++++++++++++++++++++| 
index-sha256:7f0adca1fc6c29c8dc49a2e90037a10ba20dc266baaed0988e9fb4d0d8b85ba0:    done           |++++++++++++++++++++++++++++++++++++++| 
manifest-sha256:f1e4ce3095f46ab65fd053991508a2433e2d7b45f6d8260c93d69c37d4307350: done           |++++++++++++++++++++++++++++++++++++++| 
config-sha256:a716c9c12c382ab51a71127f1dd9440af118939b92af2814d14f6232bb6105d4:   done           |++++++++++++++++++++++++++++++++++++++| 
elapsed: 0.3 s                                                       total:   0.0 B (0.0 B/s)

安装 kubernetes 软件包

bash 复制代码

# 查看版本
[root@ubuntu2404 ~]# apt list kubeadm -a|head

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Listing...
kubeadm/unknown 1.30.2-1.1 amd64
kubeadm/unknown 1.30.1-1.1 amd64
kubeadm/unknown 1.30.0-1.1 amd64

kubeadm/unknown 1.30.2-1.1 arm64
kubeadm/unknown 1.30.1-1.1 arm64
kubeadm/unknown 1.30.0-1.1 arm64

kubeadm/unknown 1.30.2-1.1 ppc64el

[root@ubuntu2404 ~]# apt install -y kubeadm=1.30.2-1.1 kubelet=1.30.2-1.1 kubectl=1.30.2-1.1

# 设置 kubelet 服务
[root@ubuntu2404 ~]# systemctl enable kubelet --now

此时kubelet服务处于activating，等 kubernetes 安装完成后状态变更为active。

配置相关命令补全

bash 复制代码

# 配置 crictl 命令自动补全
[root@ubuntu2404 ~]# mkdir /etc/bash_completion.d
[root@ubuntu2404 ~]# crictl completion bash > /etc/bash_completion.d/crictl
[root@ubuntu2404 ~]# source /etc/bash_completion.d/crictl

# 配置 nerdctl 命令自动补全
[root@ubuntu2404 ~]# nerdctl completion bash > /etc/bash_completion.d/nerdctl
[root@ubuntu2404 ~]# echo 'export CONTAINERD_NAMESPACE=k8s.io' >> /etc/bash_completion.d/nerdctl
[root@ubuntu2404 ~]# source /etc/bash_completion.d/nerdctl

注意：此处必须设置变量 CONTAINERD_NAMESPACE，否则 nerdctl 默认将镜像导入到 default 命名空间，导致 k8s 无法使用镜像。k8s 默认使用 k8s.io 命名空间中镜像。

bash 复制代码

# 配置 kubectl 命令补全
[root@ubuntu2404 ~]# kubectl completion bash > /etc/bash_completion.d/kubectl
[root@ubuntu2404 ~]# source /etc/bash_completion.d/kubectl

# 配置 kubeadm 命令补全
[root@ubuntu2404 ~]# kubeadm completion bash > /etc/bash_completion.d/kubeadm
[root@ubuntu2404 ~]# source /etc/bash_completion.d/kubeadm

打快照-k8s

关闭虚拟机，打快照，名称k8s。

bash 复制代码

root@ubuntu2404:~# init 0

准备集群节点

通过 base 快照克隆ha节点和etcd节点机器。

通过 k8s 快照克隆 k8s 节点机器。

使用sethost脚本设置主机名和IP地址。

bash 复制代码

root@ubuntu2404:~# sethost 
Usage: /usr/local/bin/sethost 101-102 | 111-113 | 121-123 | 131-133

root@ubuntu2404:~# sethost 101

部署 Etcd 集群

bash 复制代码

# 所有etcd节点配置
root@etcdN:~# apt update && apt install -y etcd-server etcd-client
root@etcdN:~# systemctl stop etcd.service
root@etcdN:~# rm -fr /var/lib/etcd/default

# etcd11节点配置
root@etcd11:~# cp /etc/default/etcd /etc/default/etcd.ori
root@etcd11:~# cat > /etc/default/etcd <<'EOF'
ETCD_NAME="etcd11"
ETCD_DATA_DIR="/var/lib/etcd/default"
ETCD_LISTEN_PEER_URLS="http://localhost:2380,http://10.1.8.111:2380"
ETCD_LISTEN_CLIENT_URLS="http://localhost:2379,http://10.1.8.111:2379"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.1.8.111:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://localhost:2379,http://10.1.8.111:2379"
ETCD_INITIAL_CLUSTER="etcd11=http://10.1.8.111:2380,etcd12=http://10.1.8.112:2380,etcd13=http://10.1.8.113:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="new"
EOF

# etcd12节点配置
root@etcd12:~# cp /etc/default/etcd /etc/default/etcd.ori
root@etcd12:~# cat > /etc/default/etcd <<'EOF'
ETCD_NAME="etcd12"
ETCD_DATA_DIR="/var/lib/etcd/default"
ETCD_LISTEN_PEER_URLS="http://localhost:2380,http://10.1.8.112:2380"
ETCD_LISTEN_CLIENT_URLS="http://localhost:2379,http://10.1.8.112:2379"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.1.8.112:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://localhost:2379,http://10.1.8.112:2379"
ETCD_INITIAL_CLUSTER="etcd11=http://10.1.8.111:2380,etcd12=http://10.1.8.112:2380,etcd13=http://10.1.8.113:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="new"
EOF

# etcd3节点配置
root@etcd13:~# cp /etc/default/etcd /etc/default/etcd.ori
root@etcd13:~# cat > /etc/default/etcd <<'EOF'
ETCD_NAME="etcd13"
ETCD_DATA_DIR="/var/lib/etcd/default"
ETCD_LISTEN_PEER_URLS="http://localhost:2380,http://10.1.8.113:2380"
ETCD_LISTEN_CLIENT_URLS="http://localhost:2379,http://10.1.8.113:2379"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.1.8.113:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://localhost:2379,http://10.1.8.113:2379"
ETCD_INITIAL_CLUSTER="etcd11=http://10.1.8.111:2380,etcd12=http://10.1.8.112:2380,etcd13=http://10.1.8.113:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="new"
EOF

# 所有节点启用并启动服务
root@etcdN:~# systemctl enable etcd
root@etcdN:~# systemctl start etcd

# 验证集群状态
root@etcd11:~# etcdctl member list
56ee9a8b1143f1f6, started, etcd12, http://10.1.8.112:2380, http://10.1.8.112:2379,http://localhost:2379, false
b09f448fe279511f, started, etcd11, http://10.1.8.111:2380, http://10.1.8.111:2379,http://localhost:2379, false
b83c09e0fc6318a6, started, etcd13, http://10.1.8.113:2380, http://10.1.8.113:2379,http://localhost:2379, false

配置补全

bash 复制代码

root@etcdN:~# mkdir /etc/bash_completion.d
root@etcdN:~# etcdctl completion bash > /etc/bash_completion.d/etcdctl
root@etcdN:~# source /etc/bash_completion.d/etcdctl

部署 Keepalived 集群

bash 复制代码

# 所有 HA 节点安装keepalived
root@haN:~# apt install -y keepalived

# ha2修改配置文件
root@ha02:~# cp /etc/keepalived/keepalived.conf{.sample,}
root@ha02:~# cat > /etc/keepalived/keepalived.conf <<'EOF'
! Configuration File for keepalived

global_defs {
   router_id ha02
}

vrrp_instance k8s {
    state BACKUP
    interface ens32
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        10.1.8.100/24
    }
}
EOF


# 启用并启动安装keepalived服务
root@ha02:~# systemctl enable keepalived.service
root@ha02:~# systemctl restart keepalived.service
root@ha02:~# ip -br a
lo               UNKNOWN        127.0.0.1/8 ::1/128 
ens32            UP             10.1.8.102/24 fe80::20c:29ff:fea5:36a1/64 

# ha1修改配置文件
root@ha01:~# cp /etc/keepalived/keepalived.conf{.sample,}
root@ha01:~# cat > /etc/keepalived/keepalived.conf <<'EOF'
! Configuration File for keepalived

global_defs {
   router_id ha01
}

vrrp_instance k8s {
    state MASTER
    interface ens32
    virtual_router_id 51
    priority 200
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        10.1.8.100/24
    }
}
EOF

# 启用并启动安装keepalived服务
root@ha01:~# systemctl enable keepalived.service
root@ha01:~# systemctl restart keepalived.service
root@ha01:~# ip -br a
lo               UNKNOWN        127.0.0.1/8 ::1/128 
ens32            UP             10.1.8.101/24 10.1.8.100/24 fe80::20c:29ff:fe39:b65c/64

部署 Haproxy 集群

提示：所有 HA 节点都部署Haproxy 集群

bash 复制代码

# 安装软件包
root@haN:~# apt install -y haproxy

# 更改haproxy.cfg
root@haN:~# cat >> /etc/haproxy/haproxy.cfg <<'EOF'
# 新增统计面板配置
listen stats
    bind *:1080  # 监听 1080 端口（可自定义，如 8080）
    mode http
    stats enable  # 启用统计面板
    stats uri /haproxy-stats  # 访问路径（如 http://HA节点IP:1080/haproxy-stats）
    stats refresh 3s  # 页面自动刷新间隔
    stats auth admin:123456  # 登录账号密码（生产环境务必修改）
    stats hide-version  # 隐藏 HAProxy 版本（安全）

# 新增 k8s 面板配置
listen k8s
  mode tcp
  bind *:6443
  balance roundrobin
  server master21 10.1.8.121:6443 check inter 3000ms fall 3 rise 2
  server master22 10.1.8.122:6443 check inter 3000ms fall 3 rise 2
  server master23 10.1.8.123:6443 check inter 3000ms fall 3 rise 2
EOF

root@haN:~# systemctl enable haproxy
root@haN:~# systemctl restart haproxy.service

验证配置

部署 Kubernetes 集群

初始化集群

bash 复制代码

# 生成初始化配置，并且根据需求进行更改，后续可以使用选项--config指定配置文件
[root@master21 ~]# kubeadm config print init-defaults | tee kubeadm-config.yaml

#导入k8s镜像
tar -xvf k8s-1.30.2-images-all.tar.gz;for image in k8s-1.30.2-images/*; do nerdctl load -i $image; done

scp k8s-1.30.2-images/kube-proxy-v1.30.2.tar k8s-1.30.2-images/pause-3.9.tar worker31:
scp k8s-1.30.2-images/kube-proxy-v1.30.2.tar k8s-1.30.2-images/pause-3.9.tar worker32:
scp k8s-1.30.2-images/kube-proxy-v1.30.2.tar k8s-1.30.2-images/pause-3.9.tar worker33:
ssh worker31 'for image in *tar;do nerdctl load -i $image;done'
ssh worker32 'for image in *tar;do nerdctl load -i $image;done'
ssh worker33 'for image in *tar;do nerdctl load -i $image;done'

# 按需修改
[root@master21 ~]# vim kubeadm-config.yaml

bash 复制代码

apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
apiServer:
  timeoutForControlPlane: 4m0s
certificatesDir: /etc/kubernetes/pki
clusterName: kubernetes

# 添加控制平面
controlPlaneEndpoint: "10.1.8.100:6443"

# 修改etcd
etcd:
  external:
    endpoints:
    - "http://10.1.8.111:2379"
    - "http://10.1.8.112:2379"
    - "http://10.1.8.113:2379"

dns: {}
controllerManager: {}
scheduler: {}
# 按需修改，实验环境使用提前导入的镜像 | 阿里云镜像：imageRepository: registry.aliyuncs.com/google_containers
imageRepository: registry.k8s.io

# 设置版本
kubernetesVersion: 1.30.2
networking:
  # 设置域名
  dnsDomain: kubernetes.laoma.cloud
  # 设置pod子网
  podSubnet: 10.224.0.0/16
  serviceSubnet: 10.96.0.0/12

---
# 添加kube-proxy配置
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: ipvs
ipvs:
  scheduler: rr
clientConnection:
  kubeconfig: /var/lib/kube-proxy/kubeconfig.conf

bash 复制代码

# 初始化集群
[root@master21 ~]# kubeadm init --config kubeadm-config.yaml

bash 复制代码

[init] Using Kubernetes version: v1.30.2
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local master21.laoma.cloud] and IPs [10.96.0.1 10.1.8.121 10.1.8.100]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] External etcd mode: Skipping etcd/ca certificate authority generation
[certs] External etcd mode: Skipping etcd/server certificate generation
[certs] External etcd mode: Skipping etcd/peer certificate generation
[certs] External etcd mode: Skipping etcd/healthcheck-client certificate generation
[certs] External etcd mode: Skipping apiserver-etcd-client certificate generation
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "super-admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests"
[kubelet-check] Waiting for a healthy kubelet. This can take up to 4m0s
[kubelet-check] The kubelet is healthy after 502.28234ms
[api-check] Waiting for a healthy API server. This can take up to 4m0s
[api-check] The API server is healthy after 5.043608027s
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config" in namespace kube-system with the configuration for the kubelets in the cluster
[upload-certs] Skipping phase. Please see --upload-certs
[mark-control-plane] Marking the node master21.laoma.cloud as control-plane by adding the labels: [node-role.kubernetes.io/control-plane node.kubernetes.io/exclude-from-external-load-balancers]
[mark-control-plane] Marking the node master21.laoma.cloud as control-plane by adding the taints [node-role.kubernetes.io/control-plane:NoSchedule]
[bootstrap-token] Using token: b0pnwt.1o94h2j82okh7gzt
[bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to get nodes
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] Configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] Configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[kubelet-finalize] Updating "/etc/kubernetes/kubelet.conf" to point to a rotatable kubelet client certificate and key
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

Alternatively, if you are the root user, you can run:

  export KUBECONFIG=/etc/kubernetes/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

You can now join any number of control-plane nodes by copying certificate authorities
and service account keys on each node and then running the following as root:

  kubeadm join 10.1.8.100:6443 --token b0pnwt.1o94h2j82okh7gzt \
	--discovery-token-ca-cert-hash sha256:0bd96f3415a5913f0fe28e6ad29c6bdfd487f4f7f46aa2dcfdf13f3bca00464f \
	--control-plane 

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 10.1.8.100:6443 --token b0pnwt.1o94h2j82okh7gzt \
	--discovery-token-ca-cert-hash sha256:0bd96f3415a5913f0fe28e6ad29c6bdfd487f4f7f46aa2dcfdf13f3bca00464f

配置凭据

bash 复制代码

[root@master21 ~]# mkdir -p $HOME/.kube
[root@master21 ~]# sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
[root@master21 ~]# sudo chown $(id -u):$(id -g) $HOME/.kube/config

部署网络

这里采用 calico 网络。

官方地址：http://projectcalico.org 或者 https://www.tigera.io/project-calico/

产品文档：https://projectcalico.docs.tigera.io/about/about-calico

项目地址：https://github.com/projectcalico/calico（提供镜像）

下载 calico 配置

bash 复制代码

[root@master21 ~]# wget --no-check-certificate https://raw.githubusercontent.com/projectcalico/calico/v3.30.7/manifests/calico.yaml

导入 calico 镜像

bash 复制代码

# 1. 先给 master21 本地导入 calico 镜像
for img in calico_*.tar; do
  echo "✅ 本地导入镜像：$img"
  nerdctl load -i "$img"
done

# 2. 批量发送到其他节点 + 远程导入
for host in master{22..23} worker{31..33}
do
  echo -e "\n==================================="
  echo "正在处理节点：$host"
  echo "==================================="
  scp calico_*.tar $host:/root/
  ssh $host "cd /root && for img in calico_*.tar; do echo '✅ 远程导入：\$img'; nerdctl load -i \$img; done"
done

修改 pod 网络

bash 复制代码

# 查看集群 pod 网络范围
[root@master21 ~]# kubectl get cm -n kube-system kubeadm-config -o yaml|grep podSubnet
      podSubnet: 10.224.0.0/16

# 更改 calico.yml，确保 CALICO_IPV4POOL_CIDR 与集群初始化的pod网络一致。
[root@master21 ~]# sed -i "s|# - name: CALICO_IPV4POOL_CIDR|- name: CALICO_IPV4POOL_CIDR|g" calico.yaml
[root@master21 ~]# sed -i "s|#   value: \"192.*|  value: \"10.224.0.0/16\"|g" calico.yaml

下载镜像

bash 复制代码

[root@master21 ~]# grep image: calico.yaml | uniq
          image: docker.io/calico/cni:v3.30.7
          image: docker.io/calico/node:v3.30.7
          image: docker.io/calico/kube-controllers:v3.30.7
# 如果没有，执行以下命令导入
nerdctl load -i calico_cni-v3.30.7.tar
nerdctl load -i calico_node-v3.30.7.tar
nerdctl load -i calico_kube-controllers-v3.30.7.tar
nerdctl load -i pause-3.9.tar

# 所有节点下载以上镜像
[root@all-node ~]# nerdctl pull docker.io/calico/cni:v3.30.7
[root@all-node ~]# nerdctl pull docker.io/calico/node:v3.30.7
[root@all-node ~]# nerdctl pull docker.io/calico/kube-controllers:v3.30.7

部署 calico 网络

bash 复制代码

[root@master21 ~]# kubectl apply -f calico.yaml

master 节点加入集群

注意：master22 和 master23 节点加入集群，过程相同。

bash 复制代码

# 将master21节点目录/etc/kubernetes/pki/中以下文件拷贝过来到其他master节点
- ca.crt
- ca.key
- sa.key
- sa.pub
- front-proxy-ca.crt
- front-proxy-ca.key

[root@master22-23 ~]# mkdir -p /etc/kubernetes/pki
[root@master22-23 ~]# scp master21:/etc/kubernetes/pki/{ca.crt,ca.key,front-proxy-ca.crt,front-proxy-ca.key,sa.key,sa.pub} /etc/kubernetes/pki/

# 其他master节点加入集群
[root@master21 ~]# kubeadm token create --print-join-command
[root@master22-23 ~]# kubeadm join 10.1.8.100:6443 --token b0pnwt.1o94h2j82okh7gzt \
	--discovery-token-ca-cert-hash sha256:0bd96f3415a5913f0fe28e6ad29c6bdfd487f4f7f46aa2dcfdf13f3bca00464f \
	--control-plane   #--control-plane一定要加

bash 复制代码

[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[preflight] Running pre-flight checks before initializing the new control plane instance
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local master22.laoma.cloud] and IPs [10.96.0.1 10.1.8.122 10.1.8.100]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Valid certificates and keys now exist in "/etc/kubernetes/pki"
[certs] Using the existing "sa" key
[kubeconfig] Generating kubeconfig files
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[check-etcd] Skipping etcd check in external mode
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-check] Waiting for a healthy kubelet. This can take up to 4m0s
[kubelet-check] The kubelet is healthy after 504.174675ms
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap
[control-plane-join] Using external etcd - no local stacked instance added
The 'update-status' phase is deprecated and will be removed in a future release. Currently it performs no operation
[mark-control-plane] Marking the node master22.laoma.cloud as control-plane by adding the labels: [node-role.kubernetes.io/control-plane node.kubernetes.io/exclude-from-external-load-balancers]
[mark-control-plane] Marking the node master22.laoma.cloud as control-plane by adding the taints [node-role.kubernetes.io/control-plane:NoSchedule]

This node has joined the cluster and a new control plane instance was created:

* Certificate signing request was sent to apiserver and approval was received.
* The Kubelet was informed of the new secure connection details.
* Control plane label and taint were applied to the new node.
* The Kubernetes control plane instances scaled up.


To start administering your cluster from this node, you need to run the following as a regular user:

	mkdir -p $HOME/.kube
	sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
	sudo chown $(id -u):$(id -g) $HOME/.kube/config

Run 'kubectl get nodes' to see this node join the cluster.

bash 复制代码

# 准备凭据文件
[root@master22-23 ~]# mkdir -p $HOME/.kube
[root@master22-23 ~]# sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
[root@master22-23 ~]# sudo chown $(id -u):$(id -g) $HOME/.kube/config

worker 节点加入集群

bash 复制代码

# 所有 worker 节点加入集群
[root@workerN ~]# kubeadm join 10.1.8.100:6443 --token b0pnwt.1o94h2j82okh7gzt \
	--discovery-token-ca-cert-hash sha256:0bd96f3415a5913f0fe28e6ad29c6bdfd487f4f7f46aa2dcfdf13f3bca00464f

验证安装结果

确认集群状态

bash 复制代码

# 确认当前集群节点状态
[root@master21 ~]# kubectl get nodes
NAME                   STATUS   ROLES           AGE     VERSION
master21.laoma.cloud   Ready    control-plane   14m     v1.30.2
master22.laoma.cloud   Ready    control-plane   9m47s   v1.30.2
master23.laoma.cloud   Ready    control-plane   9m15s   v1.30.2
worker31.laoma.cloud   Ready    <none>          8m18s   v1.30.2
worker32.laoma.cloud   Ready    <none>          8m17s   v1.30.2
worker33.laoma.cloud   Ready    <none>          8m15s   v1.30.2

# 确认pod状态
[root@master21 ~]# kubectl get pods -n kube-system 
NAME                                           READY   STATUS    RESTARTS   AGE
calico-kube-controllers-564985c589-v9njx       1/1     Running   0          10m
calico-node-bdzbb                              1/1     Running   0          8m48s
calico-node-h6rrw                              1/1     Running   0          10m
calico-node-mhsf2                              1/1     Running   0          10m
calico-node-nl94k                              1/1     Running   0          9m45s
calico-node-rb29r                              1/1     Running   0          8m47s
calico-node-t9swg                              1/1     Running   0          8m45s
coredns-7b5944fdcf-5sd4q                       1/1     Running   0          14m
coredns-7b5944fdcf-gkgfp                       1/1     Running   0          14m
kube-apiserver-master21.laoma.cloud            1/1     Running   0          14m
kube-apiserver-master22.laoma.cloud            1/1     Running   0          10m
kube-apiserver-master23.laoma.cloud            1/1     Running   0          9m44s
kube-controller-manager-master21.laoma.cloud   1/1     Running   0          14m
kube-controller-manager-master22.laoma.cloud   1/1     Running   0          10m
kube-controller-manager-master23.laoma.cloud   1/1     Running   0          9m43s
kube-proxy-7ww8c                               1/1     Running   0          9m45s
kube-proxy-df46r                               1/1     Running   0          8m48s
kube-proxy-dglmh                               1/1     Running   0          8m45s
kube-proxy-q7wrv                               1/1     Running   0          8m47s
kube-proxy-v68gp                               1/1     Running   0          14m
kube-proxy-vhr5n                               1/1     Running   0          10m
kube-scheduler-master21.laoma.cloud            1/1     Running   0          14m
kube-scheduler-master22.laoma.cloud            1/1     Running   0          10m
kube-scheduler-master23.laoma.cloud            1/1     Running   0          9m44s

部署测试资源

bash 复制代码

[root@master21 ~]# kubectl create deployment webapp --image=nginx --replicas 3
[root@master21 ~]# kubectl get pods -o wide
NAME                      READY   STATUS    RESTARTS   AGE   IP               NODE                   NOMINATED NODE   READINESS GATES
webapp-7fb6dc88f9-48kcx   1/1     Running   0          57s   10.224.113.129   worker32.laoma.cloud   <none>           <none>
webapp-7fb6dc88f9-9z87c   1/1     Running   0          57s   10.224.19.1      worker31.laoma.cloud   <none>           <none>
webapp-7fb6dc88f9-wtn6t   1/1     Running   0          57s   10.224.162.129   worker33.laoma.cloud   <none>           <none>
# 3个 pod 分布在3个节点

# 准备 pod 页面
[root@master21 ~]# \
for pod in $(kubectl get pods -o name | awk -F/  '{print $2}')
do 
  kubectl exec -it $pod -- bash -c "echo $pod > /usr/share/nginx/html/index.html"
done

[root@master21 ~]# kubectl expose deployment webapp --port 80 --target-port 80
[root@master21 ~]# kubectl get svc webapp
NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
webapp       ClusterIP   10.102.27.219   <none>        80/TCP    13s

# 验证效果
[root@master21 ~]# for i in {1..30};do curl -s 10.102.27.219:80;done|sort |uniq -c
      9 webapp-7fb6dc88f9-48kcx
     10 webapp-7fb6dc88f9-9z87c
     11 webapp-7fb6dc88f9-wtn6t

高可用测试

关闭任意 master 节点，验证集群是否可用。
关闭任意 worker 节点，验证集群是否可用。
关闭任意 etcd 节点，验证集群是否可用。
关闭任意 ha 节点，验证集群是否可用。

Kubernetes 集群升级

学习参考：kubernetes 升级

升级必要性

功能性增强。
漏洞修复。
等等

升级注意事项

如果 Kubernetes 集群只有一个master节点，则升级 master 节点过程中将导致集群不可用。
如果 Kubernetes 集群只有两个worker节点，则升级第二个 worker 节点时，负载会在第一个worker节点重建，进而导致应用短暂不可用，不可用时间取决于pod重建时间。

升级流程

从节点角度看：先逐个升级 master 节点，再逐个升级 worker 节点。
从软件层面看：任何节点都需要先升级kubeadm，再利用 kubeadm 升级各个组件，最后升级kubelet和kubectl等。

升级步骤

本次实验环境：从kubernetes 1.30.2 升级到 1.30.4。

升级前准备

bash 复制代码

# 确认当前安装的版本
[root@master30 ~]# kubectl version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.2

[root@master30 ~]# kubectl get nodes
NAME                  STATUS   ROLES           AGE   VERSION
master30.laoma.cloud   Ready    control-plane   53d   v1.30.2
worker31.laoma.cloud   Ready    <none>          53d   v1.30.2
worker32.laoma.cloud   Ready    <none>          53d   v1.30.2

# 确认仓库中提供的版本
[root@master30 ~]# apt list kubeadm -a|grep amd64

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

kubeadm/unknown 1.30.4-1.1 amd64 [upgradable from: 1.30.2-1.1]
kubeadm/unknown 1.30.3-1.1 amd64
kubeadm/unknown,now 1.30.2-1.1 amd64 [installed,upgradable to: 1.30.4-1.1]
kubeadm/unknown 1.30.1-1.1 amd64
kubeadm/unknown 1.30.0-1.1 amd64

升级 master

升级 kubeadm

bash 复制代码

# 升级 kubeadm
[root@master30 ~]# apt install -y kubeadm=1.30.4-1.1

如果 kubeadm 被hold住了，使用以下命令解除。
bash 复制代码
# 升级 kubeadm
[root@master30 ~]# apt-mark unhold kubeadm

升级 master 节点组件

bash 复制代码

# 升级帮助
[root@master30 ~]# kubeadm upgrade apply --help
# --dry-run 模拟升级
# --etcd-upgrade 升级 etcd，默认值true。设置为 false 则不升级etcd。
# -f, --force 强制升级，即使部分条件不满足。使用该选项还可以避免二次确认升级。

# 升级计划
[root@master30 ~]# kubeadm upgrade plan

bash 复制代码

[preflight] Running pre-flight checks.
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[upgrade] Running cluster health checks
[upgrade/health] FATAL: [preflight] Some fatal errors occurred:
	[ERROR CreateJob]: Job "upgrade-health-check-tk52j" in the namespace "kube-system" did not complete in 15s: no condition of type Complete
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher

bash 复制代码

[root@master30 ~]# kubeadm upgrade plan

bash 复制代码

[preflight] Running pre-flight checks.
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[upgrade] Running cluster health checks
	[WARNING CreateJob]: Job "upgrade-health-check-c45nc" in the namespace "kube-system" did not complete in 15s: no condition of type Complete
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: 1.30.2
[upgrade/versions] kubeadm version: v1.30.4
W0905 09:07:28.918665   63943 version.go:104] could not fetch a Kubernetes version from the internet: unable to get URL "https://dl.k8s.io/release/stable.txt": Get "https://cdn.dl.k8s.io/release/stable.txt": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
W0905 09:07:28.919004   63943 version.go:105] falling back to the local client version: v1.30.4
[upgrade/versions] Target version: v1.30.4
[upgrade/versions] Latest version in the v1.30 series: v1.30.4

Components that must be upgraded manually after you have upgraded the control plane with 'kubeadm upgrade apply':
COMPONENT   NODE                  CURRENT   TARGET
kubelet     master30.laoma.cloud   v1.30.2   v1.30.4
kubelet     worker31.laoma.cloud   v1.30.2   v1.30.4
kubelet     worker32.laoma.cloud   v1.30.2   v1.30.4

Upgrade to the latest version in the v1.30 series:

COMPONENT                 NODE                  CURRENT    TARGET
kube-apiserver            master30.laoma.cloud   v1.30.2    v1.30.4
kube-controller-manager   master30.laoma.cloud   v1.30.2    v1.30.4
kube-scheduler            master30.laoma.cloud   v1.30.2    v1.30.4
kube-proxy                                      1.30.2     v1.30.4
CoreDNS                                         v1.11.1    v1.11.1
etcd                      master30.laoma.cloud   3.5.12-0   3.5.12-0

You can now apply the upgrade by executing the following command:

	kubeadm upgrade apply v1.30.4

_____________________________________________________________________


The table below shows the current state of component configs as understood by this version of kubeadm.
Configs that have a "yes" mark in the "MANUAL UPGRADE REQUIRED" column require manual config upgrade or
resetting to kubeadm defaults before a successful upgrade can be performed. The version to manually
upgrade to is denoted in the "PREFERRED VERSION" column.

API GROUP                 CURRENT VERSION   PREFERRED VERSION   MANUAL UPGRADE REQUIRED
kubeproxy.config.k8s.io   v1alpha1          v1alpha1            no
kubelet.config.k8s.io     v1beta1           v1beta1             no
_____________________________________________________________________

升级 master 组件

bash 复制代码

[root@master30 ~]# kubeadm upgrade apply v1.30.4 --force

bash 复制代码

[preflight] Running pre-flight checks.
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[upgrade] Running cluster health checks
	[WARNING CreateJob]: Job "upgrade-health-check-wpclk" in the namespace "kube-system" did not complete in 15s: no condition of type Complete
[upgrade/version] You have chosen to change the cluster version to "v1.30.4"
[upgrade/versions] Cluster version: v1.30.2
[upgrade/versions] kubeadm version: v1.30.4
[upgrade/prepull] Pulling images required for setting up a Kubernetes cluster
[upgrade/prepull] This might take a minute or two, depending on the speed of your internet connection
[upgrade/prepull] You can also perform this action in beforehand using 'kubeadm config images pull'
W0905 09:12:29.406987   66867 checks.go:844] detected that the sandbox image "registry.aliyuncs.com/google_containers/pause:3.9" of the container runtime is inconsistent with that used by kubeadm.It is recommended to use "registry.k8s.io/pause:3.9" as the CRI sandbox image.
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.30.4" (timeout: 5m0s)...
[upgrade/etcd] Upgrading to TLS for etcd
[upgrade/staticpods] Preparing for "etcd" upgrade
[upgrade/staticpods] Current and new manifests of etcd are equal, skipping upgrade
[upgrade/etcd] Waiting for etcd to become available
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests3246055680"
[upgrade/staticpods] Preparing for "kube-apiserver" upgrade
[upgrade/staticpods] Renewing apiserver certificate
[upgrade/staticpods] Renewing apiserver-kubelet-client certificate
[upgrade/staticpods] Renewing front-proxy-client certificate
[upgrade/staticpods] Renewing apiserver-etcd-client certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2024-09-05-09-12-29/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This can take up to 5m0s
[apiclient] Found 1 Pods for label selector component=kube-apiserver
[upgrade/staticpods] Component "kube-apiserver" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-controller-manager" upgrade
[upgrade/staticpods] Renewing controller-manager.conf certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-controller-manager.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2024-09-05-09-12-29/kube-controller-manager.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This can take up to 5m0s
[apiclient] Found 1 Pods for label selector component=kube-controller-manager
[upgrade/staticpods] Component "kube-controller-manager" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-scheduler" upgrade
[upgrade/staticpods] Renewing scheduler.conf certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-scheduler.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2024-09-05-09-12-29/kube-scheduler.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This can take up to 5m0s
[apiclient] Found 1 Pods for label selector component=kube-scheduler
[upgrade/staticpods] Component "kube-scheduler" upgraded successfully!
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config" in namespace kube-system with the configuration for the kubelets in the cluster
[upgrade] Backing up kubelet config file to /etc/kubernetes/tmp/kubeadm-kubelet-config1570193532/config.yaml
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to get nodes
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] Configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] Configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy

[upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.30.4". Enjoy!

[upgrade/kubelet] Now that your control plane is upgraded, please proceed with upgrading your kubelets if you haven't already done so.

kubeadm upgrade apply 做了以下工作：

检查你的集群是否处于可升级状态:
- API 服务器是可访问的
- 所有节点处于 Ready 状态
- 控制面是健康的
强制执行版本偏差策略。
确保控制面的镜像是可用的或可拉取到服务器上。
如果组件配置要求版本升级，则生成替代配置与/或使用用户提供的覆盖版本配置。
升级控制面组件或回滚（如果其中任何一个组件无法启动）。
应用新的 CoreDNS 和 kube-proxy 清单，并强制创建所有必需的 RBAC 规则。
如果旧文件在 180 天后过期，将创建 API 服务器的新证书和密钥文件并备份旧文件。

升级其他 master 节点

bash 复制代码

[root@master30 ~]# kubeadm upgrade node

kubeadm upgrade node 在其他控制平节点上执行以下操作：

从集群中获取 kubeadm ClusterConfiguration。
（可选操作）备份 kube-apiserver 证书。
升级控制平面组件的静态 Pod 清单。
为本节点升级 kubelet 配置

驱逐 master 节点

bash 复制代码

# 驱逐 master节点，并设置为不可调度
[root@master30 ~]# kubectl drain master30.laoma.cloud --ignore-daemonsets

[root@master30 ~]# kubectl get nodes
NAME                   STATUS                     ROLES           AGE   VERSION
master30.laoma.cloud   Ready,SchedulingDisabled   control-plane   42h   v1.30.2
worker31.laoma.cloud   Ready                      <none>          42h   v1.30.2
worker32.laoma.cloud   Ready                      <none>          42h   v1.30.2

升级 kubelet 和 kubectl

bash 复制代码

# 升级 kubectl和kubelet
[root@master30 ~]# apt install -y kubelet=1.30.4-1.1 kubectl=1.30.4-1.1

# 重启 kubelet 服务
[root@master30 ~]# systemctl restart kubelet.service

取消节点不可调度

bash 复制代码

# 设置master节点可调度
[root@master30 ~]# kubectl uncordon master30.laoma.cloud

# 验证结果
[root@master30 ~]# kubectl get nodes
NAME                  STATUS   ROLES           AGE   VERSION
master30.laoma.cloud   Ready    control-plane   48m   v1.30.4
worker31.laoma.cloud   Ready    <none>          37m   v1.30.2
worker32.laoma.cloud   Ready    <none>          37m   v1.30.2

升级 worker

升级 worker31

bash 复制代码

# 升级kubeadm
root@worker31:~# apt install -y kubeadm=1.30.4-1.1

# 升级 worker节点组件
root@worker31:~# kubeadm upgrade node
# 以上命令在工作节点上完成以下工作：
# 1.从集群取回 kubeadm ClusterConfiguration。
# 2.为本节点升级 kubelet 配置。

# 驱逐节点上 pod，并取消节点不可调度
[root@master30 ~]# kubectl drain worker31.laoma.cloud --ignore-daemonsets
[root@master30 ~]# kubectl get nodes
NAME                  STATUS                     ROLES           AGE   VERSION
master30.laoma.cloud   Ready                      control-plane   50m   v1.30.4
worker31.laoma.cloud   Ready,SchedulingDisabled   <none>          39m   v1.30.2
worker32.laoma.cloud   Ready                      <none>          39m   v1.30.2

# 升级 kubectl 和 kubelet
root@worker31:~# apt install -y kubelet=1.30.4-1.1 kubectl=1.30.4-1.1

# 重启kubelet服务
root@worker31:~# systemctl restart kubelet.service

# 取消节点不可调度
[root@master30 ~]# kubectl uncordon worker31.laoma.cloud 
[root@master30 ~]# kubectl get nodes
NAME                  STATUS   ROLES           AGE   VERSION
master30.laoma.cloud   Ready    control-plane   51m   v1.30.4
worker31.laoma.cloud   Ready    <none>          40m   v1.30.4
worker32.laoma.cloud   Ready    <none>          40m   v1.30.2

升级 worker32

bash 复制代码

# 升级 kubeadm
root@worker32:~# apt install -y kubeadm=1.30.4-1.1

# 升级 worker 节点组件
root@worker32:~# kubeadm upgrade node

# 驱逐节点上pod，并取消节点不可调度
[root@master30 ~]# kubectl drain worker32.laoma.cloud --ignore-daemonsets
[root@master30 ~]# kubectl get nodes
NAME                  STATUS                     ROLES           AGE   VERSION
master30.laoma.cloud   Ready                      control-plane   52m   v1.30.4
worker31.laoma.cloud   Ready                      <none>          40m   v1.30.4
worker32.laoma.cloud   Ready,SchedulingDisabled   <none>          40m   v1.30.2

# 升级 kubectl 和 kubelet
root@worker32:~# apt install -y kubelet=1.30.4-1.1 kubectl=1.30.4-1.1

# 重启 kubelet 服务
root@worker32:~# systemctl restart kubelet.service

# 取消节点不可调度
[root@master30 ~]# kubectl uncordon worker32.laoma.cloud 
[root@master30 ~]# kubectl get nodes
NAME                  STATUS   ROLES           AGE   VERSION
master30.laoma.cloud   Ready    control-plane   53d   v1.30.4
worker31.laoma.cloud   Ready    <none>          53d   v1.30.4
worker32.laoma.cloud   Ready    <none>          53d   v1.30.4

# 确认集群版本
[root@master30 ~]# kubectl version 
Client Version: v1.30.4
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.4

从故障状态恢复

如果 kubeadm upgrade 失败并且没有回滚，例如由于执行期间节点意外关闭，你可以再次运行 kubeadm upgrade。此命令是幂等的，并最终确保实际状态是你声明的期望状态。

要从故障状态恢复，你还可以运行 kubeadm upgrade apply --force 而无需更改集群正在运行的版本。

在升级期间，kubeadm 向 /etc/kubernetes/tmp 目录下的如下备份文件夹写入数据：

kubeadm-backup-etcd-<date>-<time>
kubeadm-backup-manifests-<date>-<time>

kubeadm-backup-etcd 包含当前控制面节点本地 etcd 成员数据的备份。如果 etcd 升级失败并且自动回滚也无法修复，则可以将此文件夹中的内容复制到 /var/lib/etcd 进行手工修复。如果使用的是外部的 etcd，则此备份文件夹为空。

kubeadm-backup-manifests 包含当前控制面节点的静态 Pod 清单文件的备份版本。如果升级失败并且无法自动回滚，则此文件夹中的内容可以复制到 /etc/kubernetes/manifests 目录实现手工恢复。如果由于某些原因，在升级前后某个组件的清单未发生变化，则 kubeadm 也不会为之生成备份版本。

31.laoma.cloud Ready,SchedulingDisabled 39m v1.30.2

worker32.laoma.cloud Ready 39m v1.30.2

升级 kubectl 和 kubelet

root@worker31:~# apt install -y kubelet=1.30.4-1.1 kubectl=1.30.4-1.1

重启kubelet服务

root@worker31:~# systemctl restart kubelet.service

取消节点不可调度

$root@master30 \~$ # kubectl uncordon worker31.laoma.cloud

$root@master30 \~$ # kubectl get nodes

NAME STATUS ROLES AGE VERSION

master30.laoma.cloud Ready control-plane 51m v1.30.4

worker31.laoma.cloud Ready 40m v1.30.4

worker32.laoma.cloud Ready 40m v1.30.2

复制代码

#### 升级 worker32

```bash
# 升级 kubeadm
root@worker32:~# apt install -y kubeadm=1.30.4-1.1

# 升级 worker 节点组件
root@worker32:~# kubeadm upgrade node

# 驱逐节点上pod，并取消节点不可调度
[root@master30 ~]# kubectl drain worker32.laoma.cloud --ignore-daemonsets
[root@master30 ~]# kubectl get nodes
NAME                  STATUS                     ROLES           AGE   VERSION
master30.laoma.cloud   Ready                      control-plane   52m   v1.30.4
worker31.laoma.cloud   Ready                      <none>          40m   v1.30.4
worker32.laoma.cloud   Ready,SchedulingDisabled   <none>          40m   v1.30.2

# 升级 kubectl 和 kubelet
root@worker32:~# apt install -y kubelet=1.30.4-1.1 kubectl=1.30.4-1.1

# 重启 kubelet 服务
root@worker32:~# systemctl restart kubelet.service

# 取消节点不可调度
[root@master30 ~]# kubectl uncordon worker32.laoma.cloud 
[root@master30 ~]# kubectl get nodes
NAME                  STATUS   ROLES           AGE   VERSION
master30.laoma.cloud   Ready    control-plane   53d   v1.30.4
worker31.laoma.cloud   Ready    <none>          53d   v1.30.4
worker32.laoma.cloud   Ready    <none>          53d   v1.30.4

# 确认集群版本
[root@master30 ~]# kubectl version 
Client Version: v1.30.4
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.4

从故障状态恢复

要从故障状态恢复，你还可以运行 kubeadm upgrade apply --force 而无需更改集群正在运行的版本。

在升级期间，kubeadm 向 /etc/kubernetes/tmp 目录下的如下备份文件夹写入数据：

kubeadm-backup-etcd-<date>-<time>
kubeadm-backup-manifests-<date>-<time>