一、为什么要用二进制方式部署?
你在用kubeadm一把梭的时候,有没有遇到过这种情况:集群莫名其妙出问题,翻遍日志也搞不明白,想调某个组件的参数发现kubeadm不让你动?
二进制部署就是来解决这些问题的。
二进制部署的核心价值:完全掌控每个组件的版本、配置参数和启动选项;可以独立升级特定组件而不影响整体;彻底理解各组件的协作关系。
适用场景:
- 生产环境高可用集群(金融、政务等对合规性要求严格的行业)
- 需要定制内核参数或网络插件的特殊环境
- 离线环境的集群部署与维护
- 需要精确控制集群版本升级的场景
以及一个我深有体会的场景:面试的时候,面试官问你"K8s的证书链是什么样的",你张口就来------因为你亲手签过。
说实话,二进制部署确实比kubeadm麻烦不少,但多花这半天时间,你对K8s的理解会彻底不一样。
二、架构规划
2.1 整体架构
我们这次搭建的是一个3 Master + 2 Worker的高可用集群,外加独立的3节点etcd集群。Master组件通过HAProxy + Keepalived实现API Server的负载均衡和VIP漂移。
2.2 节点规划
|-----------|--------------|---------------|--------------------------------|
| 角色 | 主机名 | IP地址 | 组件 |
| Master-01 | k8s-master01 | 192.168.26.31 | apiserver, cm, scheduler, etcd |
| Master-02 | k8s-master02 | 192.168.26.32 | apiserver, cm, scheduler, etcd |
| Master-03 | k8s-master03 | 192.168.26.33 | apiserver, cm, scheduler, etcd |
| Worker-01 | k8s-node01 | 192.168.26.34 | kubelet, kube-proxy |
| Worker-02 | k8s-node02 | 192.168.26.35 | kubelet, kube-proxy |
| LB-01 | k8s-lb01 | 192.168.26.36 | HAProxy + Keepalived |
| LB-02 | k8s-lb02 | 192.168.26.37 | HAProxy + Keepalived |
| VIP | - | 192.168.26.30 | 虚拟IP |
组件版本我用的是下面这套(实测稳定):
|-------------|---------|
| 组件 | 版本 |
| Kubernetes | v1.34.6 |
| etcd | v3.5.18 |
| containerd | 2.1.2 |
| runc | 1.3.0 |
| cni-plugins | 1.8.0 |
| Calico | v3.29.1 |
(etcd 3.5.18是我踩坑后选的,3.6.x某些版本在低负载场景下会有性能波动,3.5.x稳定得多。)
2.3 网络规划
|-----------|-----------------|----------------|
| 网段 | CIDR | 说明 |
| 节点网络 | 192.168.26.0/24 | 物理节点IP段 |
| Pod网络 | 10.244.0.0/16 | Calico分配 |
| Service网络 | 10.96.0.0/16 | ClusterIP段 |
| VIP | 192.168.26.30 | API Server虚拟IP |
三、前置条件
3.1 硬件要求
我建议的最低配置(基于实际生产压测经验):
|-------------|-----|-------|------------|
| 节点类型 | CPU | 内存 | 磁盘 |
| Master/控制平面 | ≥4核 | ≥16GB | ≥100GB SSD |
| Worker节点 | ≥2核 | ≥8GB | ≥50GB |
| etcd节点 | ≥2核 | ≥8GB | ≥200GB SSD |
⚠️ 特别注意etcd的磁盘:etcd对磁盘IO延迟非常敏感,实测HDD和SSD的集群稳定性差距巨大。如果etcd用机械盘,集群大概率会出各种超时问题。
3.2 操作系统
我用的Ubuntu 22.04 LTS,CentOS 7.9/8.5也都可以。不过CentOS 7默认内核3.10可能导致kubelet崩溃,强烈建议升级到5.4.x以上。
3.3 基础环境配置(所有节点)
# 1. 配置主机名(各节点单独执行)
hostnamectl set-hostname k8s-master01 # 对应节点自行修改
# 2. 配置/etc/hosts
cat >> /etc/hosts << 'EOF'
192.168.26.31 k8s-master01
192.168.26.32 k8s-master02
192.168.26.33 k8s-master03
192.168.26.34 k8s-node01
192.168.26.35 k8s-node02
192.168.26.36 k8s-lb01
192.168.26.37 k8s-lb02
192.168.26.30 k8s-apiserver
EOF
# 3. 关闭swap------这一步很多人忘记做,kubelet会直接拒绝启动
swapoff -a
sed -i '/swap/d' /etc/fstab
# 4. 配置时区和时间同步
timedatectl set-timezone Asia/Shanghai
apt update && apt install -y chrony
systemctl enable chrony && systemctl start chrony
# 5. 基础工具
apt install -y wget jq vim net-tools curl \
apt-transport-https ca-certificates
# 6. 加载内核模块
cat > /etc/modules-load.d/k8s.conf << 'EOF'
overlay
br_netfilter
nf_conntrack
EOF
modprobe overlay
modprobe br_netfilter
modprobe nf_conntrack
# 7. 设置内核参数
cat > /etc/sysctl.d/k8s.conf << 'EOF'
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
net.ipv4.conf.all.rp_filter = 0
vm.swappiness = 0
EOF
sysctl --system
# 8. 配置SSH免密(master01上执行)
ssh-keygen -t rsa -N '' -f ~/.ssh/id_rsa
for host in 192.168.26.31 192.168.26.32 192.168.26.33 192.168.26.34 192.168.26.35; do
ssh-copy-id root@$host
done
3.4 关于防火墙
二进制部署调试期间建议先把防火墙关了,等集群跑通了再根据需要配白名单。
systemctl stop ufw && systemctl disable ufw # Ubuntu
# CentOS: systemctl stop firewalld && systemctl disable firewalld
四、证书管理
证书是二进制部署里最容易翻车的环节,没有之一。我第一次搞的时候x509错误折腾了一整天。下面这个流程经过多次验证,照着做问题不大。
4.1 下载cfssl工具
wget -q --show-progress https://github.com/cloudflare/cfssl/releases/download/v1.6.5/cfssl_1.6.5_linux_amd64 -O /usr/local/bin/cfssl
wget -q --show-progress https://github.com/cloudflare/cfssl/releases/download/v1.6.5/cfssljson_1.6.5_linux_amd64 -O /usr/local/bin/cfssljson
chmod +x /usr/local/bin/cfssl /usr/local/bin/cfssljson
mkdir -p /etc/kubernetes/pki
cd /etc/kubernetes/pki
4.2 生成CA证书
# CA配置文件
cat > ca-config.json << 'EOF'
{
"signing": {
"default": {
"expiry": "87600h"
},
"profiles": {
"kubernetes": {
"usages": ["signing", "key encipherment", "server auth", "client auth"],
"expiry": "87600h"
}
}
}
}
EOF
# CA证书请求
cat > ca-csr.json << 'EOF'
{
"CN": "kubernetes-ca",
"key": { "algo": "rsa", "size": 2048 },
"names": [
{ "C": "CN", "ST": "Beijing", "L": "Beijing", "O": "k8s", "OU": "System" }
]
}
EOF
# 生成CA
cfssl gencert -initca ca-csr.json | cfssljson -bare ca
4.3 生成API Server证书
这是最关键的环节------certSANs里必须包含所有Master节点IP、VIP、Service CIDR的第一个IP(即API Server的内部ClusterIP)。漏一个就等着x509报错吧。
# Service CIDR的.1地址
SERVICE_CIDR_IP="10.96.0.1"
VIP="192.168.26.30"
cat > apiserver-csr.json << 'EOF'
{
"CN": "kube-apiserver",
"key": { "algo": "rsa", "size": 2048 },
"names": [
{ "C": "CN", "ST": "Beijing", "L": "Beijing", "O": "k8s", "OU": "System" }
]
}
EOF
# 生成证书(关键:SAN列表要写全)
cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json \
-profile=kubernetes apiserver-csr.json | cfssljson -bare apiserver
4.4 生成其他组件证书
# etcd集群证书
cat > etcd-csr.json << 'EOF'
{
"CN": "etcd",
"hosts": [
"127.0.0.1",
"192.168.26.31","192.168.26.32","192.168.26.33"
],
"key": { "algo": "rsa", "size": 2048 },
"names": [{ "C": "CN", "ST": "Beijing", "L": "Beijing", "O": "k8s" }]
}
EOF
cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json \
-profile=kubernetes etcd-csr.json | cfssljson -bare etcd
# kube-proxy证书
cat > kube-proxy-csr.json << 'EOF'
{
"CN": "system:kube-proxy",
"key": { "algo": "rsa", "size": 2048 },
"names": [{ "C": "CN", "ST": "Beijing", "L": "Beijing", "O": "k8s" }]
}
EOF
cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json \
-profile=kubernetes kube-proxy-csr.json | cfssljson -bare kube-proxy
(顺便提一嘴,kubelet的证书是通过TLS Bootstrap自动签发的,不需要手动生成。)
五、部署etcd集群
K8s所有状态数据都存etcd里,etcd挂了整个集群就瘫痪了。所以etcd一定要高可用,而且必须独立于K8s Master节点部署。我用的是3节点集群。
5.1 下载并安装etcd
# 在三个etcd节点上执行
ETCD_VER="v3.5.18"
wget https://github.com/etcd-io/etcd/releases/download/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz
tar xzf etcd-${ETCD_VER}-linux-amd64.tar.gz
mv etcd-${ETCD_VER}-linux-amd64/etcd* /usr/local/bin/
mkdir -p /var/lib/etcd /etc/etcd
5.2 分发证书和配置
# 在master01上统一分发证书
for node in 192.168.26.31 192.168.26.32 192.168.26.33; do
ssh root@$node "mkdir -p /etc/etcd/ssl"
scp /etc/kubernetes/pki/{ca,etcd}.pem root@$node:/etc/etcd/ssl/
scp /etc/kubernetes/pki/etcd-key.pem root@$node:/etc/etcd/ssl/
done
5.3 配置etcd服务
各节点的配置文件如下(替换对应节点的name和IP):
# /etc/etcd/etcd.conf.yml
name: 'etcd-1' # etcd-2、etcd-3相应修改
data-dir: '/var/lib/etcd'
wal-dir: '/var/lib/etcd/wal'
snapshot-count: 10000
heartbeat-interval: 500
election-timeout: 5000
listen-peer-urls: 'https://192.168.26.31:2380' # 改为节点自己的IP
listen-client-urls: 'https://192.168.26.31:2379,https://127.0.0.1:2379'
advertise-client-urls: 'https://192.168.26.31:2379'
initial-advertise-peer-urls: 'https://192.168.26.31:2380'
initial-cluster: 'etcd-1=https://192.168.26.31:2380,etcd-2=https://192.168.26.32:2380,etcd-3=https://192.168.26.33:2380'
initial-cluster-token: 'etcd-cluster'
initial-cluster-state: 'new'
client-transport-security:
cert-file: '/etc/etcd/ssl/etcd.pem'
key-file: '/etc/etcd/ssl/etcd-key.pem'
trusted-ca-file: '/etc/etcd/ssl/ca.pem'
peer-transport-security:
cert-file: '/etc/etcd/ssl/etcd.pem'
key-file: '/etc/etcd/ssl/etcd-key.pem'
trusted-ca-file: '/etc/etcd/ssl/ca.pem'
5.4 创建systemd服务
# /usr/lib/systemd/system/etcd.service
cat > /etc/systemd/system/etcd.service << 'EOF'
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target
[Service]
Type=notify
WorkingDirectory=/var/lib/etcd/
ExecStart=/usr/local/bin/etcd --config-file=/etc/etcd/etcd.conf.yml
Restart=always
RestartSec=10
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable etcd --now
# 验证集群
etcdctl --cacert=/etc/etcd/ssl/ca.pem \
--cert=/etc/etcd/ssl/etcd.pem \
--key=/etc/etcd/ssl/etcd-key.pem \
--endpoints="https://192.168.26.31:2379,https://192.168.26.32:2379,https://192.168.26.33:2379" \
endpoint health
预期输出:所有三个节点都返回is healthy。
六、部署容器运行时(containerd)
Docker已经过时了,现在生产环境基本都用containerd。
6.1 下载安装
# 所有节点执行
wget https://github.com/containerd/containerd/releases/download/v2.1.2/containerd-2.1.2-linux-amd64.tar.gz
tar xzf containerd-2.1.2-linux-amd64.tar.gz -C /usr/local/
wget https://github.com/opencontainers/runc/releases/download/v1.3.0/runc.amd64
install -m 755 runc.amd64 /usr/local/sbin/runc
wget https://github.com/containernetworking/plugins/releases/download/v1.8.0/cni-plugins-linux-amd64-v1.8.0.tgz
mkdir -p /opt/cni/bin
tar xzf cni-plugins-linux-amd64-v1.8.0.tgz -C /opt/cni/bin/
mkdir -p /etc/containerd
containerd config default > /etc/containerd/config.toml
6.2 配置containerd
# 修改关键配置
sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml
sed -i 's/sandbox_image = "registry.k8s.io\/pause:3.6"/sandbox_image = "registry.cn-hangzhou.aliyuncs.com\/google_containers\/pause:3.9"/' /etc/containerd/config.toml
# 国内用户配置镜像加速(阿里云替换成你自己的)
sed -i '/registry.mirrors]/a\ [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]\n endpoint = ["https://xxxx.mirror.aliyuncs.com"]' /etc/containerd/config.toml
# 创建systemd服务
cat > /etc/systemd/system/containerd.service << 'EOF'
[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target local-fs.target
[Service]
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/containerd
Type=notify
Delegate=yes
KillMode=process
Restart=always
RestartSec=5
LimitNOFILE=1048576
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable containerd --now
七、部署Kubernetes Master组件
7.1 下载Kubernetes二进制
VERSION="v1.34.6"
wget https://dl.k8s.io/${VERSION}/kubernetes-server-linux-amd64.tar.gz
tar xzf kubernetes-server-linux-amd64.tar.gz
cd kubernetes/server/bin
cp kube-apiserver kube-controller-manager kube-scheduler kubectl /usr/local/bin/
7.2 部署kube-apiserver
# 创建kubeconfig文件
APISERVER_VIP="192.168.26.30"
kubectl config set-cluster kubernetes --certificate-authority=/etc/kubernetes/pki/ca.pem \
--embed-certs=true --server=https://${APISERVER_VIP}:6443 --kubeconfig=/etc/kubernetes/admin.kubeconfig
kubectl config set-credentials admin --client-certificate=/etc/kubernetes/pki/apiserver.pem \
--client-key=/etc/kubernetes/pki/apiserver-key.pem --embed-certs=true --kubeconfig=/etc/kubernetes/admin.kubeconfig
kubectl config set-context kubernetes --cluster=kubernetes --user=admin --kubeconfig=/etc/kubernetes/admin.kubeconfig
kubectl config use-context kubernetes --kubeconfig=/etc/kubernetes/admin.kubeconfig
# 环境变量
export KUBECONFIG=/etc/kubernetes/admin.kubeconfig
echo "export KUBECONFIG=/etc/kubernetes/admin.kubeconfig" >> /etc/profile
# 创建service文件
cat > /etc/systemd/system/kube-apiserver.service << 'EOF'
[Unit]
Description=Kubernetes API Server
Documentation=https://kubernetes.io/docs/
After=network.target
Wants=etcd.service
[Service]
ExecStart=/usr/local/bin/kube-apiserver \
--advertise-address=192.168.26.31 \
--allow-privileged=true \
--authorization-mode=Node,RBAC \
--client-ca-file=/etc/kubernetes/pki/ca.pem \
--enable-admission-plugins=NodeRestriction \
--enable-bootstrap-token-auth=true \
--etcd-cafile=/etc/kubernetes/pki/ca.pem \
--etcd-certfile=/etc/kubernetes/pki/apiserver.pem \
--etcd-keyfile=/etc/kubernetes/pki/apiserver-key.pem \
--etcd-servers=https://192.168.26.31:2379,https://192.168.26.32:2379,https://192.168.26.33:2379 \
--kubelet-client-certificate=/etc/kubernetes/pki/apiserver.pem \
--kubelet-client-key=/etc/kubernetes/pki/apiserver-key.pem \
--kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname \
--proxy-client-cert-file=/etc/kubernetes/pki/front-proxy-client.pem \
--proxy-client-key-file=/etc/kubernetes/pki/front-proxy-client-key.pem \
--requestheader-allowed-names=front-proxy-client \
--requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.pem \
--requestheader-extra-headers-prefix=X-Remote-Extra- \
--requestheader-group-headers=X-Remote-Group \
--requestheader-username-headers=X-Remote-User \
--secure-port=6443 \
--service-account-issuer=https://kubernetes.default.svc.cluster.local \
--service-account-key-file=/etc/kubernetes/pki/sa.pub \
--service-account-signing-key-file=/etc/kubernetes/pki/sa.key \
--service-cluster-ip-range=10.96.0.0/16 \
--tls-cert-file=/etc/kubernetes/pki/apiserver.pem \
--tls-private-key-file=/etc/kubernetes/pki/apiserver-key.pem
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable kube-apiserver --now
7.3 部署kube-controller-manager
cat > /etc/systemd/system/kube-controller-manager.service << 'EOF'
[Unit]
Description=Kubernetes Controller Manager
Documentation=https://kubernetes.io/docs/
After=network.target
Wants=kube-apiserver.service
[Service]
ExecStart=/usr/local/bin/kube-controller-manager \
--allocate-node-cidrs=true \
--authentication-kubeconfig=/etc/kubernetes/controller-manager.kubeconfig \
--authorization-kubeconfig=/etc/kubernetes/controller-manager.kubeconfig \
--bind-address=127.0.0.1 \
--cluster-cidr=10.244.0.0/16 \
--cluster-name=kubernetes \
--cluster-signing-cert-file=/etc/kubernetes/pki/ca.pem \
--cluster-signing-key-file=/etc/kubernetes/pki/ca-key.pem \
--controllers=*,bootstrapsigner,tokencleaner \
--kubeconfig=/etc/kubernetes/controller-manager.kubeconfig \
--leader-elect=true \
--node-cidr-mask-size=24 \
--requestheader-client-ca-file=/etc/kubernetes/pki/ca.pem \
--root-ca-file=/etc/kubernetes/pki/ca.pem \
--service-account-private-key-file=/etc/kubernetes/pki/sa.key \
--service-cluster-ip-range=10.96.0.0/16 \
--use-service-account-credentials=true
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable kube-controller-manager --now
7.4 部署kube-scheduler
cat > /etc/systemd/system/kube-scheduler.service << 'EOF'
[Unit]
Description=Kubernetes Scheduler
Documentation=https://kubernetes.io/docs/
After=network.target
Wants=kube-apiserver.service
[Service]
ExecStart=/usr/local/bin/kube-scheduler \
--authentication-kubeconfig=/etc/kubernetes/scheduler.kubeconfig \
--authorization-kubeconfig=/etc/kubernetes/scheduler.kubeconfig \
--bind-address=127.0.0.1 \
--kubeconfig=/etc/kubernetes/scheduler.kubeconfig \
--leader-elect=true
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable kube-scheduler --now
(controller-manager和scheduler的kubeconfig文件生成方式跟admin基本一样,有兴趣的可以照着上面apiserver的模板生成。)
八、部署Worker节点组件
8.1 安装kubelet和kube-proxy
# 在所有Worker节点(及Master节点,如果你想跑工作负载)执行
cp kubelet kube-proxy /usr/local/bin/
mkdir -p /var/lib/kubelet /var/lib/kube-proxy /etc/kubernetes/pki
8.2 配置Bootstrap Token
# 在master01上生成token
TOKEN_ID=$(head -c 6 /dev/urandom | xxd -p)
TOKEN_SECRET=$(head -c 16 /dev/urandom | xxd -p)
TOKEN="${TOKEN_ID}.${TOKEN_SECRET}"
echo "$TOKEN,kubelet-bootstrap,10001,\"system:kubelet-bootstrap\"" > /etc/kubernetes/token.csv
# 创建Bootstrap Secret
cat > bootstrap-secret.yaml << EOF
apiVersion: v1
kind: Secret
metadata:
name: bootstrap-token-${TOKEN_ID}
namespace: kube-system
type: bootstrap.kubernetes.io/token
stringData:
token-id: ${TOKEN_ID}
token-secret: ${TOKEN_SECRET}
usage-bootstrap-authentication: "true"
usage-bootstrap-signing: "true"
auth-extra-groups: "system:bootstrappers:worker,system:bootstrappers:default"
EOF
kubectl apply -f bootstrap-secret.yaml
# 创建ClusterRole绑定
kubectl create clusterrolebinding kubelet-bootstrap \
--clusterrole=system:node-bootstrapper \
--group=system:bootstrappers
8.3 配置kubelet
# 生成bootstrap.kubeconfig
kubectl config set-cluster kubernetes --certificate-authority=/etc/kubernetes/pki/ca.pem \
--embed-certs=true --server=https://192.168.26.30:6443 --kubeconfig=/etc/kubernetes/bootstrap.kubeconfig
kubectl config set-credentials kubelet-bootstrap --token=${TOKEN} --kubeconfig=/etc/kubernetes/bootstrap.kubeconfig
kubectl config set-context kubernetes --cluster=kubernetes --user=kubelet-bootstrap --kubeconfig=/etc/kubernetes/bootstrap.kubeconfig
kubectl config use-context kubernetes --kubeconfig=/etc/kubernetes/bootstrap.kubeconfig
# 创建kubelet配置
cat > /var/lib/kubelet/kubelet-config.yaml << EOF
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
address: 0.0.0.0
authentication:
anonymous:
enabled: false
webhook:
enabled: true
authorization:
mode: Webhook
cgroupDriver: systemd
clusterDNS:
- 10.96.0.10
clusterDomain: cluster.local
containerRuntimeEndpoint: unix:///run/containerd/containerd.sock
hairpinMode: hairpin-veth
readOnlyPort: 0
serializeImagePulls: false
tlsCertFile: /var/lib/kubelet/pki/kubelet.crt
tlsPrivateKeyFile: /var/lib/kubelet/pki/kubelet.key
EOF
# systemd服务
cat > /etc/systemd/system/kubelet.service << 'EOF'
[Unit]
Description=Kubernetes Kubelet
Documentation=https://kubernetes.io/docs/
After=containerd.service
Requires=containerd.service
[Service]
ExecStart=/usr/local/bin/kubelet \
--bootstrap-kubeconfig=/etc/kubernetes/bootstrap.kubeconfig \
--kubeconfig=/etc/kubernetes/kubelet.kubeconfig \
--config=/var/lib/kubelet/kubelet-config.yaml \
--container-runtime-endpoint=unix:///run/containerd/containerd.sock \
--node-ip=192.168.26.34 \
--pod-infra-container-image=registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.9 \
--v=2
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable kubelet --now
8.4 配置kube-proxy
# 生成kube-proxy的kubeconfig
kubectl config set-cluster kubernetes --certificate-authority=/etc/kubernetes/pki/ca.pem \
--embed-certs=true --server=https://192.168.26.30:6443 --kubeconfig=/etc/kubernetes/kube-proxy.kubeconfig
kubectl config set-credentials kube-proxy --client-certificate=/etc/kubernetes/pki/kube-proxy.pem \
--client-key=/etc/kubernetes/pki/kube-proxy-key.pem --embed-certs=true --kubeconfig=/etc/kubernetes/kube-proxy.kubeconfig
kubectl config set-context kubernetes --cluster=kubernetes --user=kube-proxy --kubeconfig=/etc/kubernetes/kube-proxy.kubeconfig
kubectl config use-context kubernetes --kubeconfig=/etc/kubernetes/kube-proxy.kubeconfig
# kube-proxy服务
cat > /etc/systemd/system/kube-proxy.service << 'EOF'
[Unit]
Description=Kubernetes Kube Proxy
Documentation=https://kubernetes.io/docs/
After=network.target
[Service]
ExecStart=/usr/local/bin/kube-proxy \
--cluster-cidr=10.244.0.0/16 \
--hostname-override=k8s-node01 \
--kubeconfig=/etc/kubernetes/kube-proxy.kubeconfig \
--proxy-mode=ipvs \
--v=2
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable kube-proxy --now
8.5 批准节点CSR
kubelet首次启动后会自动生成CSR请求,需要在master上批准:
# 查看待批准的CSR
kubectl get csr
# 批量批准
kubectl get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs kubectl certificate approve
这里有个坑:如果你发现节点一直Pending,八成是CSR没被批准。我当时第一次搞的时候差点以为kubelet坏了,结果只是忘了跑这条命令。
九、部署CNI网络插件(Calico)
没有CNI插件,Pod之间没法通信,节点状态会是NotReady。
# 下载Calico manifest
wget https://raw.githubusercontent.com/projectcalico/calico/v3.29.1/manifests/calico.yaml
# 修改Pod CIDR(如果和默认的192.168.0.0/16不同的话)
sed -i 's/192.168.0.0\/16/10.244.0.0\/16/g' calico.yaml
kubectl apply -f calico.yaml
# 等待所有Pod Running
kubectl get pods -n kube-system -w
Calico部署完成后,所有节点应该变为Ready状态。
十、配置API Server高可用
10.1 安装HAProxy
在LB节点上执行:
apt install -y haproxy
cat > /etc/haproxy/haproxy.cfg << 'EOF'
global
log /dev/log local0
maxconn 4096
user haproxy
group haproxy
defaults
log global
mode tcp
option tcplog
retries 3
timeout connect 10s
timeout client 60s
timeout server 60s
frontend kubernetes-apiserver
bind *:6443
mode tcp
default_backend kubernetes-apiserver-backend
backend kubernetes-apiserver-backend
mode tcp
balance roundrobin
server master01 192.168.26.31:6443 check fall 3 rise 2
server master02 192.168.26.32:6443 check fall 3 rise 2
server master03 192.168.26.33:6443 check fall 3 rise 2
EOF
systemctl enable haproxy --now
10.2 安装Keepalived
apt install -y keepalived
cat > /etc/keepalived/keepalived.conf << 'EOF'
global_defs {
router_id LVS_DEVEL
}
vrrp_script check_haproxy {
script "/usr/bin/killall -0 haproxy"
interval 3
weight -2
fall 10
rise 2
}
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass 12345678
}
virtual_ipaddress {
192.168.26.30/24
}
track_script {
check_haproxy
}
}
EOF
# 第二个LB节点上的priority改为90
systemctl enable keepalived --now
# 验证VIP
ip addr show eth0 | grep 192.168.26.30
十一、集群验证
# 检查节点状态
kubectl get nodes -o wide
# 检查所有系统Pod
kubectl get pods -n kube-system
# 检查组件健康状况
kubectl get cs
预期输出:所有节点状态Ready,所有系统Pod Running(除了可能有Pending的可以先不管)。
十二、常见问题与解决方法
问题1:kubelet启动失败,报x509证书错误
检查证书的SAN字段是否包含了所有需要的IP和域名。特别是VIP地址,很多人忘了加。
问题2:节点状态NotReady
kubectl describe node <node-name>
如果是因为网络插件未安装,赶紧回去部署Calico。
问题3:Pod之间无法通信
检查Calico是否正常:kubectl get pods -n kube-system | grep calico。如果不是Running状态,重新apply Calico的manifest。
问题4:kubelet启动时报错--node-labels参数格式不对
把--node-labels=node.kubernetes.io/node=''替换成--node-labels=node.kubernetes.io/node=,删掉空引号。
问题5:etcd集群起不来,报peer通信失败
检查防火墙端口2380是否开放,以及证书配置是否正确。
问题6:containerd启动失败
确认overlay内核模块是否加载:lsmod | grep overlay。
十三、最后说几句
到此为止,一个生产级的高可用Kubernetes集群就部署完成了。整个过程涉及cert管理、etcd集群、Master组件、Worker节点、网络插件、负载均衡等多个环节,任何一个地方出问题都可能导致集群无法正常工作。
我整理了一份问题速查表,你可以收藏起来:
|-------------------------|-------------------------------------------|
| 现象 | 快速排查 |
| apiserver启动失败 | journalctl -u kube-apiserver -xe |
| etcd集群异常 | etcdctl endpoint health |
| kubelet没反应 | systemctl status kubelet + journalctl |
| 节点NotReady | 检查CNI是否部署 |
| Pod启动卡ContainerCreating | 检查containerd + CNI |
| x509错误 | 检查证书SAN是否完整 |
如果你跟着文档走到了最后一步,恭喜你,你对K8s的底层认知已经超越不少"kubeadm工程师"了。
有什么你自己踩过的奇葩坑吗?欢迎在评论区分享,说不定能帮到后面的同学。
下次见,运维人。