Kubernetes 二进制部署完全指南:从零搭建生产级HA集群

一、为什么要用二进制方式部署?

你在用kubeadm一把梭的时候,有没有遇到过这种情况:集群莫名其妙出问题,翻遍日志也搞不明白,想调某个组件的参数发现kubeadm不让你动?

二进制部署就是来解决这些问题的。

二进制部署的核心价值:完全掌控每个组件的版本、配置参数和启动选项;可以独立升级特定组件而不影响整体;彻底理解各组件的协作关系。

适用场景

  • 生产环境高可用集群(金融、政务等对合规性要求严格的行业)
  • 需要定制内核参数或网络插件的特殊环境
  • 离线环境的集群部署与维护
  • 需要精确控制集群版本升级的场景

以及一个我深有体会的场景:面试的时候,面试官问你"K8s的证书链是什么样的",你张口就来------因为你亲手签过。

说实话,二进制部署确实比kubeadm麻烦不少,但多花这半天时间,你对K8s的理解会彻底不一样


二、架构规划

2.1 整体架构

我们这次搭建的是一个3 Master + 2 Worker的高可用集群,外加独立的3节点etcd集群。Master组件通过HAProxy + Keepalived实现API Server的负载均衡和VIP漂移。

2.2 节点规划

|-----------|--------------|---------------|--------------------------------|
| 角色 | 主机名 | IP地址 | 组件 |
| Master-01 | k8s-master01 | 192.168.26.31 | apiserver, cm, scheduler, etcd |
| Master-02 | k8s-master02 | 192.168.26.32 | apiserver, cm, scheduler, etcd |
| Master-03 | k8s-master03 | 192.168.26.33 | apiserver, cm, scheduler, etcd |
| Worker-01 | k8s-node01 | 192.168.26.34 | kubelet, kube-proxy |
| Worker-02 | k8s-node02 | 192.168.26.35 | kubelet, kube-proxy |
| LB-01 | k8s-lb01 | 192.168.26.36 | HAProxy + Keepalived |
| LB-02 | k8s-lb02 | 192.168.26.37 | HAProxy + Keepalived |
| VIP | - | 192.168.26.30 | 虚拟IP |

组件版本我用的是下面这套(实测稳定):

|-------------|---------|
| 组件 | 版本 |
| Kubernetes | v1.34.6 |
| etcd | v3.5.18 |
| containerd | 2.1.2 |
| runc | 1.3.0 |
| cni-plugins | 1.8.0 |
| Calico | v3.29.1 |

(etcd 3.5.18是我踩坑后选的,3.6.x某些版本在低负载场景下会有性能波动,3.5.x稳定得多。)

2.3 网络规划

|-----------|-----------------|----------------|
| 网段 | CIDR | 说明 |
| 节点网络 | 192.168.26.0/24 | 物理节点IP段 |
| Pod网络 | 10.244.0.0/16 | Calico分配 |
| Service网络 | 10.96.0.0/16 | ClusterIP段 |
| VIP | 192.168.26.30 | API Server虚拟IP |


三、前置条件

3.1 硬件要求

我建议的最低配置(基于实际生产压测经验):

|-------------|-----|-------|------------|
| 节点类型 | CPU | 内存 | 磁盘 |
| Master/控制平面 | ≥4核 | ≥16GB | ≥100GB SSD |
| Worker节点 | ≥2核 | ≥8GB | ≥50GB |
| etcd节点 | ≥2核 | ≥8GB | ≥200GB SSD |

⚠️ 特别注意etcd的磁盘:etcd对磁盘IO延迟非常敏感,实测HDD和SSD的集群稳定性差距巨大。如果etcd用机械盘,集群大概率会出各种超时问题。

3.2 操作系统

我用的Ubuntu 22.04 LTS,CentOS 7.9/8.5也都可以。不过CentOS 7默认内核3.10可能导致kubelet崩溃,强烈建议升级到5.4.x以上。

3.3 基础环境配置(所有节点)

复制代码
# 1. 配置主机名(各节点单独执行)
hostnamectl set-hostname k8s-master01   # 对应节点自行修改

# 2. 配置/etc/hosts
cat >> /etc/hosts << 'EOF'
192.168.26.31 k8s-master01
192.168.26.32 k8s-master02
192.168.26.33 k8s-master03
192.168.26.34 k8s-node01
192.168.26.35 k8s-node02
192.168.26.36 k8s-lb01
192.168.26.37 k8s-lb02
192.168.26.30 k8s-apiserver
EOF

# 3. 关闭swap------这一步很多人忘记做,kubelet会直接拒绝启动
swapoff -a
sed -i '/swap/d' /etc/fstab

# 4. 配置时区和时间同步
timedatectl set-timezone Asia/Shanghai
apt update && apt install -y chrony
systemctl enable chrony && systemctl start chrony

# 5. 基础工具
apt install -y wget jq vim net-tools curl \
  apt-transport-https ca-certificates

# 6. 加载内核模块
cat > /etc/modules-load.d/k8s.conf << 'EOF'
overlay
br_netfilter
nf_conntrack
EOF
modprobe overlay
modprobe br_netfilter
modprobe nf_conntrack

# 7. 设置内核参数
cat > /etc/sysctl.d/k8s.conf << 'EOF'
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
net.ipv4.conf.all.rp_filter = 0
vm.swappiness = 0
EOF
sysctl --system

# 8. 配置SSH免密(master01上执行)
ssh-keygen -t rsa -N '' -f ~/.ssh/id_rsa
for host in 192.168.26.31 192.168.26.32 192.168.26.33 192.168.26.34 192.168.26.35; do
  ssh-copy-id root@$host
done

3.4 关于防火墙

二进制部署调试期间建议先把防火墙关了,等集群跑通了再根据需要配白名单。

复制代码
systemctl stop ufw && systemctl disable ufw   # Ubuntu
# CentOS: systemctl stop firewalld && systemctl disable firewalld

四、证书管理

证书是二进制部署里最容易翻车的环节,没有之一。我第一次搞的时候x509错误折腾了一整天。下面这个流程经过多次验证,照着做问题不大。

4.1 下载cfssl工具

复制代码
wget -q --show-progress https://github.com/cloudflare/cfssl/releases/download/v1.6.5/cfssl_1.6.5_linux_amd64 -O /usr/local/bin/cfssl
wget -q --show-progress https://github.com/cloudflare/cfssl/releases/download/v1.6.5/cfssljson_1.6.5_linux_amd64 -O /usr/local/bin/cfssljson
chmod +x /usr/local/bin/cfssl /usr/local/bin/cfssljson

mkdir -p /etc/kubernetes/pki
cd /etc/kubernetes/pki

4.2 生成CA证书

复制代码
# CA配置文件
cat > ca-config.json << 'EOF'
{
  "signing": {
    "default": {
      "expiry": "87600h"
    },
    "profiles": {
      "kubernetes": {
        "usages": ["signing", "key encipherment", "server auth", "client auth"],
        "expiry": "87600h"
      }
    }
  }
}
EOF

# CA证书请求
cat > ca-csr.json << 'EOF'
{
  "CN": "kubernetes-ca",
  "key": { "algo": "rsa", "size": 2048 },
  "names": [
    { "C": "CN", "ST": "Beijing", "L": "Beijing", "O": "k8s", "OU": "System" }
  ]
}
EOF

# 生成CA
cfssl gencert -initca ca-csr.json | cfssljson -bare ca

4.3 生成API Server证书

这是最关键的环节------certSANs里必须包含所有Master节点IP、VIP、Service CIDR的第一个IP(即API Server的内部ClusterIP)。漏一个就等着x509报错吧。

复制代码
# Service CIDR的.1地址
SERVICE_CIDR_IP="10.96.0.1"
VIP="192.168.26.30"

cat > apiserver-csr.json << 'EOF'
{
  "CN": "kube-apiserver",
  "key": { "algo": "rsa", "size": 2048 },
  "names": [
    { "C": "CN", "ST": "Beijing", "L": "Beijing", "O": "k8s", "OU": "System" }
  ]
}
EOF

# 生成证书(关键:SAN列表要写全)
cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json \
  -profile=kubernetes apiserver-csr.json | cfssljson -bare apiserver

4.4 生成其他组件证书

复制代码
# etcd集群证书
cat > etcd-csr.json << 'EOF'
{
  "CN": "etcd",
  "hosts": [
    "127.0.0.1",
    "192.168.26.31","192.168.26.32","192.168.26.33"
  ],
  "key": { "algo": "rsa", "size": 2048 },
  "names": [{ "C": "CN", "ST": "Beijing", "L": "Beijing", "O": "k8s" }]
}
EOF
cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json \
  -profile=kubernetes etcd-csr.json | cfssljson -bare etcd

# kube-proxy证书
cat > kube-proxy-csr.json << 'EOF'
{
  "CN": "system:kube-proxy",
  "key": { "algo": "rsa", "size": 2048 },
  "names": [{ "C": "CN", "ST": "Beijing", "L": "Beijing", "O": "k8s" }]
}
EOF
cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json \
  -profile=kubernetes kube-proxy-csr.json | cfssljson -bare kube-proxy

(顺便提一嘴,kubelet的证书是通过TLS Bootstrap自动签发的,不需要手动生成。)


五、部署etcd集群

K8s所有状态数据都存etcd里,etcd挂了整个集群就瘫痪了。所以etcd一定要高可用,而且必须独立于K8s Master节点部署。我用的是3节点集群。

5.1 下载并安装etcd

复制代码
# 在三个etcd节点上执行
ETCD_VER="v3.5.18"
wget https://github.com/etcd-io/etcd/releases/download/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz
tar xzf etcd-${ETCD_VER}-linux-amd64.tar.gz
mv etcd-${ETCD_VER}-linux-amd64/etcd* /usr/local/bin/

mkdir -p /var/lib/etcd /etc/etcd

5.2 分发证书和配置

复制代码
# 在master01上统一分发证书
for node in 192.168.26.31 192.168.26.32 192.168.26.33; do
  ssh root@$node "mkdir -p /etc/etcd/ssl"
  scp /etc/kubernetes/pki/{ca,etcd}.pem root@$node:/etc/etcd/ssl/
  scp /etc/kubernetes/pki/etcd-key.pem root@$node:/etc/etcd/ssl/
done

5.3 配置etcd服务

各节点的配置文件如下(替换对应节点的name和IP):

复制代码
# /etc/etcd/etcd.conf.yml
name: 'etcd-1'                                   # etcd-2、etcd-3相应修改
data-dir: '/var/lib/etcd'
wal-dir: '/var/lib/etcd/wal'
snapshot-count: 10000
heartbeat-interval: 500
election-timeout: 5000

listen-peer-urls: 'https://192.168.26.31:2380'  # 改为节点自己的IP
listen-client-urls: 'https://192.168.26.31:2379,https://127.0.0.1:2379'
advertise-client-urls: 'https://192.168.26.31:2379'
initial-advertise-peer-urls: 'https://192.168.26.31:2380'

initial-cluster: 'etcd-1=https://192.168.26.31:2380,etcd-2=https://192.168.26.32:2380,etcd-3=https://192.168.26.33:2380'
initial-cluster-token: 'etcd-cluster'
initial-cluster-state: 'new'

client-transport-security:
  cert-file: '/etc/etcd/ssl/etcd.pem'
  key-file: '/etc/etcd/ssl/etcd-key.pem'
  trusted-ca-file: '/etc/etcd/ssl/ca.pem'
peer-transport-security:
  cert-file: '/etc/etcd/ssl/etcd.pem'
  key-file: '/etc/etcd/ssl/etcd-key.pem'
  trusted-ca-file: '/etc/etcd/ssl/ca.pem'

5.4 创建systemd服务

复制代码
# /usr/lib/systemd/system/etcd.service
cat > /etc/systemd/system/etcd.service << 'EOF'
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
WorkingDirectory=/var/lib/etcd/
ExecStart=/usr/local/bin/etcd --config-file=/etc/etcd/etcd.conf.yml
Restart=always
RestartSec=10
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable etcd --now

# 验证集群
etcdctl --cacert=/etc/etcd/ssl/ca.pem \
  --cert=/etc/etcd/ssl/etcd.pem \
  --key=/etc/etcd/ssl/etcd-key.pem \
  --endpoints="https://192.168.26.31:2379,https://192.168.26.32:2379,https://192.168.26.33:2379" \
  endpoint health

预期输出:所有三个节点都返回is healthy


六、部署容器运行时(containerd)

Docker已经过时了,现在生产环境基本都用containerd。

6.1 下载安装

复制代码
# 所有节点执行
wget https://github.com/containerd/containerd/releases/download/v2.1.2/containerd-2.1.2-linux-amd64.tar.gz
tar xzf containerd-2.1.2-linux-amd64.tar.gz -C /usr/local/

wget https://github.com/opencontainers/runc/releases/download/v1.3.0/runc.amd64
install -m 755 runc.amd64 /usr/local/sbin/runc

wget https://github.com/containernetworking/plugins/releases/download/v1.8.0/cni-plugins-linux-amd64-v1.8.0.tgz
mkdir -p /opt/cni/bin
tar xzf cni-plugins-linux-amd64-v1.8.0.tgz -C /opt/cni/bin/

mkdir -p /etc/containerd
containerd config default > /etc/containerd/config.toml

6.2 配置containerd

复制代码
# 修改关键配置
sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml
sed -i 's/sandbox_image = "registry.k8s.io\/pause:3.6"/sandbox_image = "registry.cn-hangzhou.aliyuncs.com\/google_containers\/pause:3.9"/' /etc/containerd/config.toml

# 国内用户配置镜像加速(阿里云替换成你自己的)
sed -i '/registry.mirrors]/a\        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]\n          endpoint = ["https://xxxx.mirror.aliyuncs.com"]' /etc/containerd/config.toml

# 创建systemd服务
cat > /etc/systemd/system/containerd.service << 'EOF'
[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target local-fs.target

[Service]
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/containerd
Type=notify
Delegate=yes
KillMode=process
Restart=always
RestartSec=5
LimitNOFILE=1048576

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable containerd --now

七、部署Kubernetes Master组件

7.1 下载Kubernetes二进制

复制代码
VERSION="v1.34.6"
wget https://dl.k8s.io/${VERSION}/kubernetes-server-linux-amd64.tar.gz
tar xzf kubernetes-server-linux-amd64.tar.gz
cd kubernetes/server/bin
cp kube-apiserver kube-controller-manager kube-scheduler kubectl /usr/local/bin/

7.2 部署kube-apiserver

复制代码
# 创建kubeconfig文件
APISERVER_VIP="192.168.26.30"

kubectl config set-cluster kubernetes --certificate-authority=/etc/kubernetes/pki/ca.pem \
  --embed-certs=true --server=https://${APISERVER_VIP}:6443 --kubeconfig=/etc/kubernetes/admin.kubeconfig

kubectl config set-credentials admin --client-certificate=/etc/kubernetes/pki/apiserver.pem \
  --client-key=/etc/kubernetes/pki/apiserver-key.pem --embed-certs=true --kubeconfig=/etc/kubernetes/admin.kubeconfig

kubectl config set-context kubernetes --cluster=kubernetes --user=admin --kubeconfig=/etc/kubernetes/admin.kubeconfig
kubectl config use-context kubernetes --kubeconfig=/etc/kubernetes/admin.kubeconfig

# 环境变量
export KUBECONFIG=/etc/kubernetes/admin.kubeconfig
echo "export KUBECONFIG=/etc/kubernetes/admin.kubeconfig" >> /etc/profile

# 创建service文件
cat > /etc/systemd/system/kube-apiserver.service << 'EOF'
[Unit]
Description=Kubernetes API Server
Documentation=https://kubernetes.io/docs/
After=network.target
Wants=etcd.service

[Service]
ExecStart=/usr/local/bin/kube-apiserver \
  --advertise-address=192.168.26.31 \
  --allow-privileged=true \
  --authorization-mode=Node,RBAC \
  --client-ca-file=/etc/kubernetes/pki/ca.pem \
  --enable-admission-plugins=NodeRestriction \
  --enable-bootstrap-token-auth=true \
  --etcd-cafile=/etc/kubernetes/pki/ca.pem \
  --etcd-certfile=/etc/kubernetes/pki/apiserver.pem \
  --etcd-keyfile=/etc/kubernetes/pki/apiserver-key.pem \
  --etcd-servers=https://192.168.26.31:2379,https://192.168.26.32:2379,https://192.168.26.33:2379 \
  --kubelet-client-certificate=/etc/kubernetes/pki/apiserver.pem \
  --kubelet-client-key=/etc/kubernetes/pki/apiserver-key.pem \
  --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname \
  --proxy-client-cert-file=/etc/kubernetes/pki/front-proxy-client.pem \
  --proxy-client-key-file=/etc/kubernetes/pki/front-proxy-client-key.pem \
  --requestheader-allowed-names=front-proxy-client \
  --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.pem \
  --requestheader-extra-headers-prefix=X-Remote-Extra- \
  --requestheader-group-headers=X-Remote-Group \
  --requestheader-username-headers=X-Remote-User \
  --secure-port=6443 \
  --service-account-issuer=https://kubernetes.default.svc.cluster.local \
  --service-account-key-file=/etc/kubernetes/pki/sa.pub \
  --service-account-signing-key-file=/etc/kubernetes/pki/sa.key \
  --service-cluster-ip-range=10.96.0.0/16 \
  --tls-cert-file=/etc/kubernetes/pki/apiserver.pem \
  --tls-private-key-file=/etc/kubernetes/pki/apiserver-key.pem
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable kube-apiserver --now

7.3 部署kube-controller-manager

复制代码
cat > /etc/systemd/system/kube-controller-manager.service << 'EOF'
[Unit]
Description=Kubernetes Controller Manager
Documentation=https://kubernetes.io/docs/
After=network.target
Wants=kube-apiserver.service

[Service]
ExecStart=/usr/local/bin/kube-controller-manager \
  --allocate-node-cidrs=true \
  --authentication-kubeconfig=/etc/kubernetes/controller-manager.kubeconfig \
  --authorization-kubeconfig=/etc/kubernetes/controller-manager.kubeconfig \
  --bind-address=127.0.0.1 \
  --cluster-cidr=10.244.0.0/16 \
  --cluster-name=kubernetes \
  --cluster-signing-cert-file=/etc/kubernetes/pki/ca.pem \
  --cluster-signing-key-file=/etc/kubernetes/pki/ca-key.pem \
  --controllers=*,bootstrapsigner,tokencleaner \
  --kubeconfig=/etc/kubernetes/controller-manager.kubeconfig \
  --leader-elect=true \
  --node-cidr-mask-size=24 \
  --requestheader-client-ca-file=/etc/kubernetes/pki/ca.pem \
  --root-ca-file=/etc/kubernetes/pki/ca.pem \
  --service-account-private-key-file=/etc/kubernetes/pki/sa.key \
  --service-cluster-ip-range=10.96.0.0/16 \
  --use-service-account-credentials=true
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable kube-controller-manager --now

7.4 部署kube-scheduler

复制代码
cat > /etc/systemd/system/kube-scheduler.service << 'EOF'
[Unit]
Description=Kubernetes Scheduler
Documentation=https://kubernetes.io/docs/
After=network.target
Wants=kube-apiserver.service

[Service]
ExecStart=/usr/local/bin/kube-scheduler \
  --authentication-kubeconfig=/etc/kubernetes/scheduler.kubeconfig \
  --authorization-kubeconfig=/etc/kubernetes/scheduler.kubeconfig \
  --bind-address=127.0.0.1 \
  --kubeconfig=/etc/kubernetes/scheduler.kubeconfig \
  --leader-elect=true
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable kube-scheduler --now

(controller-manager和scheduler的kubeconfig文件生成方式跟admin基本一样,有兴趣的可以照着上面apiserver的模板生成。)


八、部署Worker节点组件

8.1 安装kubelet和kube-proxy

复制代码
# 在所有Worker节点(及Master节点,如果你想跑工作负载)执行
cp kubelet kube-proxy /usr/local/bin/

mkdir -p /var/lib/kubelet /var/lib/kube-proxy /etc/kubernetes/pki

8.2 配置Bootstrap Token

复制代码
# 在master01上生成token
TOKEN_ID=$(head -c 6 /dev/urandom | xxd -p)
TOKEN_SECRET=$(head -c 16 /dev/urandom | xxd -p)
TOKEN="${TOKEN_ID}.${TOKEN_SECRET}"

echo "$TOKEN,kubelet-bootstrap,10001,\"system:kubelet-bootstrap\"" > /etc/kubernetes/token.csv

# 创建Bootstrap Secret
cat > bootstrap-secret.yaml << EOF
apiVersion: v1
kind: Secret
metadata:
  name: bootstrap-token-${TOKEN_ID}
  namespace: kube-system
type: bootstrap.kubernetes.io/token
stringData:
  token-id: ${TOKEN_ID}
  token-secret: ${TOKEN_SECRET}
  usage-bootstrap-authentication: "true"
  usage-bootstrap-signing: "true"
  auth-extra-groups: "system:bootstrappers:worker,system:bootstrappers:default"
EOF

kubectl apply -f bootstrap-secret.yaml

# 创建ClusterRole绑定
kubectl create clusterrolebinding kubelet-bootstrap \
  --clusterrole=system:node-bootstrapper \
  --group=system:bootstrappers

8.3 配置kubelet

复制代码
# 生成bootstrap.kubeconfig
kubectl config set-cluster kubernetes --certificate-authority=/etc/kubernetes/pki/ca.pem \
  --embed-certs=true --server=https://192.168.26.30:6443 --kubeconfig=/etc/kubernetes/bootstrap.kubeconfig

kubectl config set-credentials kubelet-bootstrap --token=${TOKEN} --kubeconfig=/etc/kubernetes/bootstrap.kubeconfig

kubectl config set-context kubernetes --cluster=kubernetes --user=kubelet-bootstrap --kubeconfig=/etc/kubernetes/bootstrap.kubeconfig
kubectl config use-context kubernetes --kubeconfig=/etc/kubernetes/bootstrap.kubeconfig

# 创建kubelet配置
cat > /var/lib/kubelet/kubelet-config.yaml << EOF
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
address: 0.0.0.0
authentication:
  anonymous:
    enabled: false
  webhook:
    enabled: true
authorization:
  mode: Webhook
cgroupDriver: systemd
clusterDNS:
  - 10.96.0.10
clusterDomain: cluster.local
containerRuntimeEndpoint: unix:///run/containerd/containerd.sock
hairpinMode: hairpin-veth
readOnlyPort: 0
serializeImagePulls: false
tlsCertFile: /var/lib/kubelet/pki/kubelet.crt
tlsPrivateKeyFile: /var/lib/kubelet/pki/kubelet.key
EOF

# systemd服务
cat > /etc/systemd/system/kubelet.service << 'EOF'
[Unit]
Description=Kubernetes Kubelet
Documentation=https://kubernetes.io/docs/
After=containerd.service
Requires=containerd.service

[Service]
ExecStart=/usr/local/bin/kubelet \
  --bootstrap-kubeconfig=/etc/kubernetes/bootstrap.kubeconfig \
  --kubeconfig=/etc/kubernetes/kubelet.kubeconfig \
  --config=/var/lib/kubelet/kubelet-config.yaml \
  --container-runtime-endpoint=unix:///run/containerd/containerd.sock \
  --node-ip=192.168.26.34 \
  --pod-infra-container-image=registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.9 \
  --v=2
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable kubelet --now

8.4 配置kube-proxy

复制代码
# 生成kube-proxy的kubeconfig
kubectl config set-cluster kubernetes --certificate-authority=/etc/kubernetes/pki/ca.pem \
  --embed-certs=true --server=https://192.168.26.30:6443 --kubeconfig=/etc/kubernetes/kube-proxy.kubeconfig

kubectl config set-credentials kube-proxy --client-certificate=/etc/kubernetes/pki/kube-proxy.pem \
  --client-key=/etc/kubernetes/pki/kube-proxy-key.pem --embed-certs=true --kubeconfig=/etc/kubernetes/kube-proxy.kubeconfig

kubectl config set-context kubernetes --cluster=kubernetes --user=kube-proxy --kubeconfig=/etc/kubernetes/kube-proxy.kubeconfig
kubectl config use-context kubernetes --kubeconfig=/etc/kubernetes/kube-proxy.kubeconfig

# kube-proxy服务
cat > /etc/systemd/system/kube-proxy.service << 'EOF'
[Unit]
Description=Kubernetes Kube Proxy
Documentation=https://kubernetes.io/docs/
After=network.target

[Service]
ExecStart=/usr/local/bin/kube-proxy \
  --cluster-cidr=10.244.0.0/16 \
  --hostname-override=k8s-node01 \
  --kubeconfig=/etc/kubernetes/kube-proxy.kubeconfig \
  --proxy-mode=ipvs \
  --v=2
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable kube-proxy --now

8.5 批准节点CSR

kubelet首次启动后会自动生成CSR请求,需要在master上批准:

复制代码
# 查看待批准的CSR
kubectl get csr

# 批量批准
kubectl get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs kubectl certificate approve

这里有个坑:如果你发现节点一直Pending,八成是CSR没被批准。我当时第一次搞的时候差点以为kubelet坏了,结果只是忘了跑这条命令。


九、部署CNI网络插件(Calico)

没有CNI插件,Pod之间没法通信,节点状态会是NotReady。

复制代码
# 下载Calico manifest
wget https://raw.githubusercontent.com/projectcalico/calico/v3.29.1/manifests/calico.yaml

# 修改Pod CIDR(如果和默认的192.168.0.0/16不同的话)
sed -i 's/192.168.0.0\/16/10.244.0.0\/16/g' calico.yaml

kubectl apply -f calico.yaml

# 等待所有Pod Running
kubectl get pods -n kube-system -w

Calico部署完成后,所有节点应该变为Ready状态。


十、配置API Server高可用

10.1 安装HAProxy

在LB节点上执行:

复制代码
apt install -y haproxy

cat > /etc/haproxy/haproxy.cfg << 'EOF'
global
    log /dev/log local0
    maxconn 4096
    user haproxy
    group haproxy

defaults
    log global
    mode tcp
    option tcplog
    retries 3
    timeout connect 10s
    timeout client 60s
    timeout server 60s

frontend kubernetes-apiserver
    bind *:6443
    mode tcp
    default_backend kubernetes-apiserver-backend

backend kubernetes-apiserver-backend
    mode tcp
    balance roundrobin
    server master01 192.168.26.31:6443 check fall 3 rise 2
    server master02 192.168.26.32:6443 check fall 3 rise 2
    server master03 192.168.26.33:6443 check fall 3 rise 2
EOF

systemctl enable haproxy --now

10.2 安装Keepalived

复制代码
apt install -y keepalived

cat > /etc/keepalived/keepalived.conf << 'EOF'
global_defs {
    router_id LVS_DEVEL
}

vrrp_script check_haproxy {
    script "/usr/bin/killall -0 haproxy"
    interval 3
    weight -2
    fall 10
    rise 2
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 12345678
    }
    virtual_ipaddress {
        192.168.26.30/24
    }
    track_script {
        check_haproxy
    }
}
EOF

# 第二个LB节点上的priority改为90
systemctl enable keepalived --now

# 验证VIP
ip addr show eth0 | grep 192.168.26.30

十一、集群验证

复制代码
# 检查节点状态
kubectl get nodes -o wide

# 检查所有系统Pod
kubectl get pods -n kube-system

# 检查组件健康状况
kubectl get cs

预期输出:所有节点状态Ready,所有系统Pod Running(除了可能有Pending的可以先不管)。


十二、常见问题与解决方法

问题1:kubelet启动失败,报x509证书错误

检查证书的SAN字段是否包含了所有需要的IP和域名。特别是VIP地址,很多人忘了加。

问题2:节点状态NotReady

复制代码
kubectl describe node <node-name>

如果是因为网络插件未安装,赶紧回去部署Calico。

问题3:Pod之间无法通信

检查Calico是否正常:kubectl get pods -n kube-system | grep calico。如果不是Running状态,重新apply Calico的manifest。

问题4:kubelet启动时报错--node-labels参数格式不对

--node-labels=node.kubernetes.io/node=''替换成--node-labels=node.kubernetes.io/node=,删掉空引号。

问题5:etcd集群起不来,报peer通信失败

检查防火墙端口2380是否开放,以及证书配置是否正确。

问题6:containerd启动失败

确认overlay内核模块是否加载:lsmod | grep overlay


十三、最后说几句

到此为止,一个生产级的高可用Kubernetes集群就部署完成了。整个过程涉及cert管理、etcd集群、Master组件、Worker节点、网络插件、负载均衡等多个环节,任何一个地方出问题都可能导致集群无法正常工作。

我整理了一份问题速查表,你可以收藏起来:

|-------------------------|-------------------------------------------|
| 现象 | 快速排查 |
| apiserver启动失败 | journalctl -u kube-apiserver -xe |
| etcd集群异常 | etcdctl endpoint health |
| kubelet没反应 | systemctl status kubelet + journalctl |
| 节点NotReady | 检查CNI是否部署 |
| Pod启动卡ContainerCreating | 检查containerd + CNI |
| x509错误 | 检查证书SAN是否完整 |

如果你跟着文档走到了最后一步,恭喜你,你对K8s的底层认知已经超越不少"kubeadm工程师"了。

有什么你自己踩过的奇葩坑吗?欢迎在评论区分享,说不定能帮到后面的同学。

下次见,运维人。

相关推荐
MXsoft6182 小时前
## 告警治理:从“风暴”到“精准”——运维告警压缩与根因定位实践指南
运维
MXsoft6182 小时前
**智慧校园哑终端监控:摄像头、门禁、信息屏的统一管理实践**
运维
唐墨1232 小时前
关于linux kernel错误码为负数编码这件事情,我个人的一些看法
linux·运维·服务器
IT WorryFree2 小时前
基于Fortinet MIB实现设备资产管理完整方案
运维·服务器·网络
鼎讯信通2 小时前
宽频高敏・全域监测|鼎讯 DXMP 系列,打造风电射频侦测新范式
运维·能源·信息与通信
网络系统管理2 小时前
第八届江苏技能状元大赛选拔赛信息通信网络运行管理项目模块D网络服务与系统运维-Linux样题
linux·运维
成为你的宁宁2 小时前
【K8S黑盒监控实践:Probe配置、Prometheus验证与Grafana可视化】
kubernetes·grafana·prometheus
Sunny_20228372 小时前
CAD在执行移动命令的时候按正交F8,老是卡住
运维·pccad插件、cad·pccad个人免费版·免费cad插件
成为你的宁宁2 小时前
【Prometheus Operator监控K8S Nginx】
nginx·kubernetes·prometheus