本文将详细介绍生产级Kubernetes集群的搭建步骤、CI/CD流水线配置、监控部署和故障排查方法并提供可执行的命令和配置文件。
适合读者:运维工程师、DevOps工程师、想动手搭建K8s集群的技术人员
前置阅读:建议先阅读《架构设计篇》了解整体架构和技术选型
目录
一、环境准备
1.0 网络规划
网络规划是部署的第一步,合理的网络架构能确保安全隔离和高效通信。
1.0.1 VPC与子网规划
采用三层子网架构,实现网络隔离:
VPC网络: 10.0.0.0/16
├── entry子网: 10.0.10.0/24 (公网入口层)
├── middleware子网: 10.0.20.0/24 (中间件层)
└── k8s子网: 10.0.30.0/24 (应用服务层)
| 子网 | CIDR | 用途 | 部署服务器 |
|---|---|---|---|
| entry子网 | 10.0.10.0/24 | 公网入口、运维管理 | entry-01, jumpserver |
| middleware子网 | 10.0.20.0/24 | 中间件服务 | middleware-01 |
| k8s子网 | 10.0.30.0/24 | K8s集群 | master-01~03, node-01~02 |
K8s内部网络规划:
| 网段 | CIDR | 用途 |
|---|---|---|
| Pod网段 | 172.16.0.0/16 | Pod IP分配(Calico管理) |
| Service网段 | 10.96.0.0/12 | Service ClusterIP |
1.0.2 安全组配置
Entry子网安全组(公网入口):
| 方向 | 端口 | 来源/目标 | 用途 |
|---|---|---|---|
| 入站 | 80, 443 | 0.0.0.0/0 | Web访问 |
| 入站 | 10022 | 运维IP白名单 | SSH管理(非标准端口) |
| 出站 | ALL | 0.0.0.0/0 | 允许所有出站 |
K8s子网安全组:
| 方向 | 端口 | 来源/目标 | 用途 |
|---|---|---|---|
| 入站 | 6443 | entry子网 | K8s API Server |
| 入站 | 30080, 30443 | entry子网 | Ingress NodePort |
| 入站 | ALL | k8s子网内部 | 集群内通信 |
| 入站 | ALL | 172.16.0.0/16 | Pod网络通信 |
| 出站 | ALL | 0.0.0.0/0 | 允许所有出站 |
Middleware子网安全组:
| 方向 | 端口 | 来源/目标 | 用途 |
|---|---|---|---|
| 入站 | 3306, 6379, 8848等 | k8s子网 | 中间件服务端口 |
| 入站 | 10022 | entry子网 | SSH管理 |
| 出站 | ALL | 0.0.0.0/0 | 允许所有出站 |
1.0.3 服务器互访规则
graph LR subgraph entry子网 Entry[Entry节点] Jump[JumpServer] end subgraph middleware子网 MW[Middleware] end subgraph k8s子网 Master[K8s Master] Worker[K8s Worker] end Internet((互联网)) --> |80/443| Entry Entry --> |6443| Master Entry --> |30080| Worker Jump --> |10022| Master Jump --> |10022| MW Worker --> |3306/6379/8848| MW Master --> |3306/6379/8848| MW
1.1 服务器清单
| 角色 | 主机名 | IP示例 | 配置 | 说明 |
|---|---|---|---|---|
| Entry | entry-01 | 10.0.10.10 | 2C/4G | Nginx + Squid代理 |
| Middleware | middleware-01 | 10.0.20.10 | 8C/32G | MySQL、Redis等 |
| K8s Master | k8s-master-01 | 10.0.30.10 | 4C/8G | 控制平面 |
| K8s Master | k8s-master-02 | 10.0.30.11 | 4C/8G | 控制平面 |
| K8s Master | k8s-master-03 | 10.0.30.12 | 4C/8G | 控制平面 |
| K8s Worker | k8s-node-01 | 10.0.30.20 | 8C/32G | 工作节点 |
| K8s Worker | k8s-node-02 | 10.0.30.21 | 8C/32G | 工作节点 |
| JumpServer | jumpserver | 10.0.10.20 | 4C/8G | 堡垒机 |
1.2 基础设施初始化
在部署K8s之前,需要完成基础设施的初始化配置。
1.2.1 服务器基础配置
所有服务器执行:
bash
#!/bin/bash
# 服务器基础配置脚本
# 1. 设置主机名(根据服务器角色修改)
HOSTNAME="k8s-master-01"
hostnamectl set-hostname $HOSTNAME
echo "127.0.0.1 $HOSTNAME" >> /etc/hosts
# 2. 时区配置
timedatectl set-timezone Asia/Shanghai
timedatectl set-ntp yes
# 3. 内核参数优化
cat > /etc/sysctl.d/local.conf << EOF
# 文件描述符
fs.file-max = 512000
# TCP优化
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.core.somaxconn = 4096
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.ip_local_port_range = 10000 65000
net.ipv4.tcp_max_syn_backlog = 4096
# 开启BBR拥塞控制
net.ipv4.tcp_congestion_control = bbr
# 禁用IPv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
EOF
sysctl -p /etc/sysctl.d/local.conf
# 4. 系统资源限制
cat > /etc/security/limits.conf << EOF
* hard nofile 512000
* soft nofile 512000
root hard nofile 512000
root soft nofile 512000
EOF
# 5. SSH安全配置
cat > /etc/ssh/sshd_config << EOF
Include /etc/ssh/sshd_config.d/*.conf
Port 60022
PermitRootLogin prohibit-password
PubkeyAuthentication yes
PasswordAuthentication no
ClientAliveInterval 60
ClientAliveCountMax 5
EOF
systemctl restart sshd
1.2.2 入口服务器部署
在Entry节点执行:
bash
#!/bin/bash
# 入口服务器Nginx配置
# 1. 安装Nginx
apt-get update
apt-get install -y nginx
# 2. 配置Nginx(支持stream模块用于TCP负载均衡)
cat > /etc/nginx/nginx.conf << EOF
user www-data;
worker_processes auto;
pid /run/nginx.pid;
events {
worker_connections 20480;
multi_accept on;
}
# TCP负载均衡(用于K8s API Server)
stream {
include /data/nginx/stream-sites-enabled/*;
}
http {
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
client_max_body_size 0;
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '[\$time_local] \$remote_addr -> '
'"\$request" \$status \$body_bytes_sent '
'"\$http_user_agent" \$request_time';
access_log /data/nginx/logs/access.log main;
error_log /data/nginx/logs/error.log;
gzip on;
gzip_types text/plain text/css application/json application/javascript;
include /data/nginx/sites-enabled/*;
}
EOF
# 3. 创建目录结构
mkdir -p /data/nginx/{stream-sites-enabled,logs,sites-enabled,conf.d}
chown -R www-data:www-data /data/nginx
K8s API Server负载均衡配置:
bash
# K8s API Server TCP负载均衡(6443端口)
cat > /data/nginx/stream-sites-enabled/k8s-apiserver.conf << EOF
upstream k8s-apiserver {
server 10.0.30.10:6443 max_fails=3 fail_timeout=30s;
server 10.0.30.11:6443 max_fails=3 fail_timeout=30s;
server 10.0.30.12:6443 max_fails=3 fail_timeout=30s;
}
server {
listen 6443;
proxy_pass k8s-apiserver;
proxy_timeout 3s;
proxy_connect_timeout 1s;
}
EOF
K8s Ingress NodePort负载均衡配置:
bash
# Ingress节点HTTP负载均衡
cat > /data/nginx/conf.d/k8s-ingress.conf << EOF
upstream ingress_nodes {
server 10.0.30.20:30080;
server 10.0.30.21:30080;
}
EOF
# 应用站点配置示例
cat > /data/nginx/sites-enabled/app.conf << EOF
server {
listen 80;
server_name app.example.com;
location / {
proxy_pass http://ingress_nodes;
proxy_set_header Host \$host;
proxy_set_header X-Real-IP \$remote_addr;
proxy_set_header X-Forwarded-For \$proxy_add_x_forwarded_for;
}
}
EOF
systemctl reload nginx
1.2.3 堡垒机部署
在JumpServer节点执行:
bash
#!/bin/bash
# JumpServer一键部署
# 使用官方脚本快速部署
curl -sSL https://resource.fit2cloud.com/jumpserver/jumpserver/releases/download/v3.10.17/quick_start.sh | bash
# 修改配置(可选)
# vim /opt/jumpserver/config/config.txt
# 常用配置项:
# - HTTP_PORT=80
# - HTTPS_PORT=443
# - DOMAINS="jumpserver.example.com"
# 重启服务
cd /opt/jumpserver-installer-v3.10.17
./jmsctl.sh restart
JumpServer访问:
- 默认地址:
http://<JumpServer-IP>:80 - 默认账号:
admin - 默认密码:
admin(首次登录需修改)
1.2.4 Docker引擎安装
在Middleware节点执行(用于运行中间件容器):
bash
#!/bin/bash
# Docker引擎安装与配置
# 1. 安装依赖
apt-get update
apt-get install -y ca-certificates curl gnupg lsb-release
# 2. 添加Docker官方GPG密钥(使用阿里云镜像)
curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | \
gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
# 3. 添加Docker仓库
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] \
https://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable" | \
tee /etc/apt/sources.list.d/docker.list
# 4. 安装Docker
apt-get update
apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin
# 5. 配置Docker
mkdir -p /data/docker
cat > /etc/docker/daemon.json << EOF
{
"log-driver": "json-file",
"log-opts": {
"max-size": "100m",
"max-file": "3"
},
# 设置代理(可选)
"registry-mirrors": [
"https://docker.m.daocloud.io"
],
# docker数据目录
"data-root": "/data/docker"
}
EOF
# 6. 启动Docker
systemctl enable docker
systemctl restart docker
# 7. 验证安装
docker info
docker compose version
1.2.5 安全加固
所有服务器执行:
bash
#!/bin/bash
# 服务器安全加固
# 1. 安装fail2ban防暴力破解
apt-get install -y fail2ban
# 2. 配置fail2ban
cat > /etc/fail2ban/jail.local << EOF
[DEFAULT]
ignoreip = 127.0.0.1/8 ::1
bantime = 3600
maxretry = 3
findtime = 600
banaction = iptables-multiport
[sshd]
enabled = true
port = 10022
logpath = /var/log/auth.log
maxretry = 3
bantime = 3600
EOF
# 3. 启动fail2ban
systemctl enable fail2ban
systemctl restart fail2ban
# 4. 查看状态
fail2ban-client status sshd
1.3 操作系统优化
所有节点统一使用 Ubuntu Server 22.04。
1.2.1 关闭Swap
K8s要求关闭Swap,否则kubelet无法正常工作:
bash
# 立即关闭
swapoff -a
# 永久关闭:删除fstab中的swap行
sed -i '/swap/d' /etc/fstab
1.2.2 加载内核模块
bash
cat > /etc/modules-load.d/k8s.conf << EOF
overlay # OverlayFS文件系统
br_netfilter # 网桥过滤
EOF
modprobe overlay
modprobe br_netfilter
# 验证
lsmod | grep -E "overlay|br_netfilter"
1.2.3 配置内核参数
bash
cat > /etc/sysctl.d/k8s.conf << EOF
# K8s必需参数
net.ipv4.ip_forward = 1
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
# 连接跟踪优化
net.netfilter.nf_conntrack_max = 524288
# TCP优化
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 30
net.core.somaxconn = 32768
# 文件描述符
fs.file-max = 2097152
EOF
sysctl --system
1.2.4 配置系统资源限制
bash
cat >> /etc/security/limits.conf << EOF
# Kubernetes resource limits
* soft nofile 655360
* hard nofile 655360
* soft nproc 655360
* hard nproc 655360
EOF
1.4 安装containerd
1.4.1 安装
bash
# 添加Docker镜像源(containerd包含在其中)
curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | \
gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable" | \
tee /etc/apt/sources.list.d/docker.list
apt-get update
apt-get install -y containerd.io
1.4.2 配置containerd
bash
mkdir -p /etc/containerd
cat > /etc/containerd/config.toml << 'EOF'
version = 2
[plugins."io.containerd.grpc.v1.cri"]
# 使用国内镜像
sandbox_image = "registry.aliyuncs.com/google_containers/pause:3.9"
[plugins."io.containerd.grpc.v1.cri".containerd]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true # 使用systemd作为cgroup驱动
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://docker.m.daocloud.io"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."registry.k8s.io"]
endpoint = ["https://k8s.m.daocloud.io"]
EOF
systemctl daemon-reload
systemctl restart containerd
systemctl enable containerd
1.5 安装Kubernetes组件
在所有K8s节点上执行:
bash
# 添加阿里云Kubernetes源
curl -fsSL https://mirrors.aliyun.com/kubernetes/apt/doc/apt-key.gpg | \
gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] \
https://mirrors.aliyun.com/kubernetes/apt/ kubernetes-xenial main' | \
tee /etc/apt/sources.list.d/kubernetes.list
# 安装指定版本
apt-get update
apt-get install -y kubelet=1.28.6-1.1 kubeadm=1.28.6-1.1 kubectl=1.28.6-1.1
# 锁定版本,防止意外升级
apt-mark hold kubelet kubeadm kubectl
# 启用kubelet
systemctl enable kubelet
1.6 配置时间同步
bash
apt-get install -y chrony
cat > /etc/chrony/chrony.conf << 'EOF'
server ntp.aliyun.com iburst
server ntp.tencent.com iburst
driftfile /var/lib/chrony/chrony.drift
makestep 1.0 3
rtcsync
EOF
systemctl restart chrony
systemctl enable chrony
1.7 配置hosts文件
所有节点添加:
bash
cat >> /etc/hosts << EOF
10.0.30.10 k8s-master-01
10.0.30.11 k8s-master-02
10.0.30.12 k8s-master-03
10.0.30.20 k8s-node-01
10.0.30.21 k8s-node-02
10.0.10.10 k8s-api-lb
EOF
二、中间件部署
2.1 存储规划
Middleware节点建议挂载独立数据盘:
bash
# 热数据盘(SSD):MySQL、Redis
mkdir -p /data/hot
mount /dev/vdb1 /data/hot
# 冷数据盘(HDD):Elasticsearch、MinIO
mkdir -p /data/cold
mount /dev/vdc1 /data/cold
# 写入fstab自动挂载
echo '/dev/vdb1 /data/hot ext4 defaults 0 0' >> /etc/fstab
echo '/dev/vdc1 /data/cold ext4 defaults 0 0' >> /etc/fstab
2.2 Docker Compose配置
yaml
# docker-compose.yml
version: '3'
services:
mysql:
image: mysql:8.0
restart: always
ports:
- 3306:3306
volumes:
- /data/hot/mysql:/var/lib/mysql
- ./config/my.cnf:/etc/mysql/conf.d/my.cnf
environment:
MYSQL_ROOT_PASSWORD: ${MYSQL_PASSWORD}
TZ: Asia/Shanghai
networks:
- middleware
redis:
image: redis:7.2
restart: always
ports:
- 6379:6379
volumes:
- /data/hot/redis:/data
command: redis-server --requirepass ${REDIS_PASSWORD} --appendonly yes
networks:
- middleware
nacos:
image: nacos/nacos-server:v2.3.2
restart: always
depends_on:
- mysql
environment:
MODE: standalone
NACOS_AUTH_ENABLE: "true"
SPRING_DATASOURCE_PLATFORM: mysql
MYSQL_SERVICE_HOST: mysql
MYSQL_SERVICE_DB_NAME: nacos
MYSQL_SERVICE_USER: root
MYSQL_SERVICE_PASSWORD: ${MYSQL_PASSWORD}
ports:
- 8848:8848
- 9848:9848
networks:
- middleware
rabbitmq:
image: rabbitmq:3.12-management
restart: always
ports:
- 5672:5672
- 15672:15672
environment:
RABBITMQ_DEFAULT_USER: admin
RABBITMQ_DEFAULT_PASS: ${RABBITMQ_PASSWORD}
volumes:
- /data/hot/rabbitmq:/var/lib/rabbitmq
networks:
- middleware
elasticsearch:
image: elasticsearch:7.17.19
restart: always
volumes:
- /data/cold/elasticsearch:/usr/share/elasticsearch/data
environment:
discovery.type: single-node
ES_JAVA_OPTS: -Xms2g -Xmx2g
ports:
- 9200:9200
networks:
- middleware
networks:
middleware:
driver: bridge
2.3 MySQL优化配置
ini
# config/my.cnf
[mysqld]
# 连接数
max_connections = 1000
# 缓冲池大小(建议为物理内存的50-70%)
innodb_buffer_pool_size = 16G
# 日志配置
innodb_log_file_size = 512M
innodb_flush_log_at_trx_commit = 2
# 字符集
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
# 时区
default-time-zone = '+08:00'
2.4 启动中间件
bash
# 创建环境变量文件
cat > .env << EOF
MYSQL_PASSWORD=YourStrongPassword123
REDIS_PASSWORD=YourStrongPassword456
RABBITMQ_PASSWORD=YourStrongPassword789
EOF
# 启动
docker compose up -d
# 检查状态
docker compose ps
三、Kubernetes集群搭建
3.1 初始化第一个Master节点
3.1.1 生成kubeadm配置文件
yaml
# kubeadm-config.yaml
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.28.6
controlPlaneEndpoint: "10.0.10.10:6443" # Entry节点的负载均衡地址
networking:
podSubnet: "172.16.0.0/16" # Pod网段
serviceSubnet: "10.96.0.0/12" # Service网段
imageRepository: registry.aliyuncs.com/google_containers
apiServer:
certSANs:
- "10.0.10.10"
- "10.0.30.10"
- "10.0.30.11"
- "10.0.30.12"
- "k8s-api-lb"
- "127.0.0.1"
---
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "ipvs"
ipvs:
strictARP: true
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd
3.1.2 执行初始化
bash
# 拉取镜像
kubeadm config images pull --config=kubeadm-config.yaml
# 初始化集群
kubeadm init --config=kubeadm-config.yaml --upload-certs | tee kubeadm-init.log
# 配置kubectl
mkdir -p $HOME/.kube
cp /etc/kubernetes/admin.conf $HOME/.kube/config
chown $(id -u):$(id -g) $HOME/.kube/config
# 验证
kubectl cluster-info
kubectl get nodes
重要:保存输出的join命令,包含token和certificate-key。
3.2 加入其他Master节点
在k8s-master-02和k8s-master-03上执行:
bash
kubeadm join 10.0.10.10:6443 \
--token <token> \
--discovery-token-ca-cert-hash sha256:<hash> \
--control-plane \
--certificate-key <certificate-key>
# 配置kubectl
mkdir -p $HOME/.kube
cp /etc/kubernetes/admin.conf $HOME/.kube/config
3.3 加入Worker节点
在k8s-node-01和k8s-node-02上执行:
bash
kubeadm join 10.0.10.10:6443 \
--token <token> \
--discovery-token-ca-cert-hash sha256:<hash>
3.4 安装Calico网络插件
bash
# 下载Tigera Operator
kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/tigera-operator.yaml
# 等待Operator就绪
kubectl wait --namespace tigera-operator \
--for=condition=ready pod \
--selector=name=tigera-operator \
--timeout=90s
# 创建自定义资源配置
cat > calico-custom-resources.yaml << EOF
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
name: default
spec:
calicoNetwork:
ipPools:
- blockSize: 26
cidr: 172.16.0.0/16
encapsulation: VXLANCrossSubnet # 同子网BGP,跨子网VXLAN
natOutgoing: Enabled
nodeSelector: all()
---
apiVersion: operator.tigera.io/v1
kind: APIServer
metadata:
name: default
spec: {}
EOF
kubectl create -f calico-custom-resources.yaml
# 等待所有节点Ready
kubectl get nodes -w
3.5 部署Traefik Ingress
3.5.1 安装Helm
bash
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm version
3.5.2 部署Traefik
bash
# 添加仓库
helm repo add traefik https://traefik.github.io/charts
helm repo update
# 创建values配置
cat > traefik-values.yaml << 'EOF'
deployment:
kind: DaemonSet
image:
tag: "v3.2"
ingressClass:
enabled: true
isDefaultClass: true
ports:
web:
port: 8000
exposedPort: 80
nodePort: 30080
websecure:
port: 8443
exposedPort: 443
nodePort: 30443
service:
type: NodePort
logs:
general:
level: INFO
access:
enabled: true
ingressRoute:
dashboard:
enabled: true
matchRule: Host(`traefik.example.com`)
entryPoints: ["web"]
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "1000m"
memory: "512Mi"
EOF
# 安装
helm install traefik traefik/traefik \
--namespace traefik \
--create-namespace \
--values traefik-values.yaml
# 验证
kubectl get pods -n traefik
kubectl get svc -n traefik
3.6 验证集群
bash
# 检查节点状态
kubectl get nodes -o wide
# 检查系统Pod
kubectl get pods -A
# 创建测试应用
kubectl create deployment nginx --image=nginx
kubectl expose deployment nginx --port=80 --type=NodePort
kubectl get svc nginx
# 测试访问
curl http://<node-ip>:<node-port>
# 清理测试
kubectl delete deployment nginx
kubectl delete svc nginx
四、CI/CD流水线搭建
4.1 CI/CD整体架构
graph LR A[开发者Push代码] --> B[GitLab触发CI] B --> C[构建阶段: 编译] C --> D[打包阶段: Docker镜像] D --> E[推送到Harbor] E --> F[部署到测试环境] F -->|手动触发| G[部署到生产环境] G --> H[K8s滚动更新]
4.2 Harbor私有镜像仓库
4.2.1 安装Harbor
bash
# 下载离线安装包
wget https://github.com/goharbor/harbor/releases/download/v2.12.0/harbor-offline-installer-v2.12.0.tgz
tar xvf harbor-offline-installer-v2.12.0.tgz
cd harbor
# 修改配置
cp harbor.yml.tmpl harbor.yml
yaml
# harbor.yml 关键配置
hostname: harbor.example.com
https:
port: 443
certificate: /data/cert/server.crt
private_key: /data/cert/server.key
harbor_admin_password: Harbor12345
data_volume: /data/cold/harbor
bash
# 安装
./install.sh
# 配置开机自启
cat > /etc/systemd/system/harbor.service << EOF
[Unit]
Description=Harbor
After=docker.service
Requires=docker.service
[Service]
Type=simple
Restart=on-failure
WorkingDirectory=/root/harbor
ExecStart=/usr/bin/docker compose up
ExecStop=/usr/bin/docker compose down
[Install]
WantedBy=multi-user.target
EOF
systemctl enable harbor
4.2.2 配置K8s拉取凭证
bash
kubectl create secret docker-registry harbor-secret \
--docker-server=harbor.example.com \
--docker-username=admin \
--docker-password=Harbor12345 \
--namespace=default
4.3 GitLab CI模板化设计
4.3.1 模板项目结构
devops/ci-templates/
├── build/
│ ├── java.build.gitlab-ci.yml
│ └── node.build.gitlab-ci.yml
├── deploy/
│ ├── java.deploy.gitlab-ci.yml
│ └── node.deploy.gitlab-ci.yml
└── rules/
└── changes.gitlab-ci.yml
4.3.2 Java构建模板
yaml
# build/java.build.gitlab-ci.yml
variables:
REGISTRY_ADDRESS: harbor.example.com
REGISTRY_SECRET: harbor-secret
stages:
- build
- package
- deploy_qa
- deploy_prod
.build:
stage: build
image: maven:3.9-eclipse-temurin-17
script:
- mvn clean package -DskipTests
artifacts:
paths:
- "**/target/*.jar"
expire_in: 1 hrs
tags:
- java-build
.package:
stage: package
image: docker:20.10-dind
services:
- docker:20.10-dind
before_script:
- docker login -u $HARBOR_USER -p $HARBOR_PASSWORD $REGISTRY_ADDRESS
script:
- docker build -t ${REGISTRY_ADDRESS}/${CI_PROJECT_PATH}/${MODULE_NAME}:${CI_COMMIT_SHA:0:8} .
- docker push ${REGISTRY_ADDRESS}/${CI_PROJECT_PATH}/${MODULE_NAME}:${CI_COMMIT_SHA:0:8}
tags:
- docker
.deploy:
stage: deploy_prod
image: bitnami/kubectl:1.28
when: manual
script:
- |
kubectl set image deployment/${MODULE_NAME} \
${MODULE_NAME}=${REGISTRY_ADDRESS}/${CI_PROJECT_PATH}/${MODULE_NAME}:${CI_COMMIT_SHA:0:8} \
-n ${NAMESPACE}
- kubectl rollout status deployment/${MODULE_NAME} -n ${NAMESPACE} --timeout=300s
4.3.3 微服务项目CI配置
yaml
# .gitlab-ci.yml
include:
- project: 'devops/ci-templates'
file: '/build/java.build.gitlab-ci.yml'
- project: 'devops/ci-templates'
file: '/deploy/java.deploy.gitlab-ci.yml'
variables:
MODULE_NAME: order-service
MODULE_PORT: 8080
NAMESPACE: production
build:
extends: .build
package:
extends: .package
needs:
- build
deploy_prod:
extends: .deploy
variables:
REPLICAS: "2"
needs:
- package
4.4 智能变更检测
只有修改了某个服务的代码,才触发该服务的构建:
yaml
# rules/changes.gitlab-ci.yml
.order_service_changes:
rules:
- changes:
- service/order-service/**/*
- pom.xml
- .gitlab-ci.yml
.user_service_changes:
rules:
- changes:
- service/user-service/**/*
- pom.xml
- .gitlab-ci.yml
在微服务CI配置中使用:
yaml
build_order:
extends:
- .build
- .order_service_changes
variables:
MODULE_NAME: order-service
五、应用部署实践
5.1 Java应用Deployment配置
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
namespace: production
spec:
replicas: 2
selector:
matchLabels:
app: order-service
template:
metadata:
labels:
app: order-service
spec:
imagePullSecrets:
- name: harbor-secret
containers:
- name: order-service
image: harbor.example.com/project/order-service:latest
ports:
- containerPort: 8080
env:
- name: SPRING_PROFILES_ACTIVE
value: "prod"
- name: JAVA_OPTS
value: >-
-XX:+UseContainerSupport
-XX:MaxRAMPercentage=70.0
-XX:+UseG1GC
resources:
requests:
cpu: "256m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "2048Mi"
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 30
periodSeconds: 5
failureThreshold: 3
---
apiVersion: v1
kind: Service
metadata:
name: order-service
namespace: production
spec:
selector:
app: order-service
ports:
- port: 8080
targetPort: 8080
5.2 Ingress路由配置
yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-ingress
namespace: production
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: web,websecure
spec:
ingressClassName: traefik
rules:
- host: api.example.com
http:
paths:
- path: /order
pathType: Prefix
backend:
service:
name: order-service
port:
number: 8080
- path: /user
pathType: Prefix
backend:
service:
name: user-service
port:
number: 8080
5.3 ConfigMap和Secret使用
yaml
# ConfigMap - 非敏感配置
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
namespace: production
data:
NACOS_SERVER: "10.0.20.10:8848"
REDIS_HOST: "10.0.20.10"
LOG_LEVEL: "INFO"
---
# Secret - 敏感配置
apiVersion: v1
kind: Secret
metadata:
name: app-secret
namespace: production
type: Opaque
stringData:
MYSQL_PASSWORD: "YourPassword123"
REDIS_PASSWORD: "YourPassword456"
在Deployment中引用:
yaml
containers:
- name: app
envFrom:
- configMapRef:
name: app-config
- secretRef:
name: app-secret
六、监控与日志
6.1 Prometheus + Grafana部署
6.1.1 部署Node Exporter
yaml
# node-exporter-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
hostNetwork: true
hostPID: true
containers:
- name: node-exporter
image: prom/node-exporter:v1.7.0
args:
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --path.rootfs=/host/root
ports:
- containerPort: 9100
hostPort: 9100
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
- name: root
mountPath: /host/root
readOnly: true
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: root
hostPath:
path: /
6.1.2 Prometheus配置
yaml
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
# K8s节点监控
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
# Spring Boot应用监控
- job_name: 'spring-boot-apps'
kubernetes_sd_configs:
- role: pod
namespaces:
names: ['production']
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
6.2 告警规则示例
yaml
groups:
- name: node_alerts
rules:
# CPU使用率告警
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "CPU使用率过高"
description: "{{ $labels.instance }} CPU使用率已达 {{ $value }}%"
# 内存使用率告警
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "内存使用率过高"
description: "{{ $labels.instance }} 内存使用率已达 {{ $value }}%"
# 磁盘使用率告警
- alert: HighDiskUsage
expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 80
for: 5m
labels:
severity: critical
annotations:
summary: "磁盘使用率过高"
- name: application_alerts
rules:
# 应用健康检查失败
- alert: ApplicationDown
expr: up{job="spring-boot-apps"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "应用服务不可用"
6.3 日志收集方案
使用Filebeat收集容器日志到Elasticsearch:
yaml
# filebeat-k8s.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: filebeat
namespace: kube-system
spec:
selector:
matchLabels:
app: filebeat
template:
spec:
containers:
- name: filebeat
image: elastic/filebeat:7.17.19
args:
- "-c"
- "/etc/filebeat/filebeat.yml"
- "-e"
volumeMounts:
- name: config
mountPath: /etc/filebeat
- name: varlog
mountPath: /var/log
- name: containers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: config
configMap:
name: filebeat-config
- name: varlog
hostPath:
path: /var/log
- name: containers
hostPath:
path: /var/lib/docker/containers
七、故障排查与问题解决
7.1 常见问题速查表
| 问题现象 | 可能原因 | 排查命令 | 解决方案 |
|---|---|---|---|
| Pod一直Pending | 资源不足 | kubectl describe pod <name> |
增加节点或调整资源请求 |
| Pod CrashLoopBackOff | 应用启动失败 | kubectl logs <pod> |
检查应用配置和依赖 |
| ImagePullBackOff | 镜像拉取失败 | kubectl describe pod <name> |
检查镜像地址和凭证 |
| Service无法访问 | Endpoints为空 | kubectl get endpoints <svc> |
检查Pod标签和selector |
| Ingress 502 | 后端Pod未就绪 | kubectl get pods |
检查readinessProbe |
| OOMKilled | 内存不足 | kubectl describe pod <name> |
增加内存限制 |
| 节点NotReady | 网络或kubelet问题 | kubectl describe node <name> |
检查kubelet和网络插件 |
| DNS解析失败 | CoreDNS问题 | kubectl logs -n kube-system -l k8s-app=kube-dns |
重启CoreDNS |
7.2 排查命令速查
bash
# 查看Pod详细信息
kubectl describe pod <pod-name> -n <namespace>
# 查看Pod日志
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous # 上一个容器
# 进入Pod调试
kubectl exec -it <pod-name> -n <namespace> -- sh
# 查看Service和Endpoints
kubectl get svc,endpoints -n <namespace>
# 查看所有异常Pod
kubectl get pods -A --field-selector=status.phase!=Running
# 查看节点资源使用
kubectl top nodes
kubectl top pods -A
# 查看事件
kubectl get events -A --sort-by='.lastTimestamp'
7.3 典型案例分析
案例1:Pod频繁OOMKilled
现象:Pod每隔几小时重启,状态显示OOMKilled
排查:
bash
kubectl describe pod <pod-name> -n production
# 查看 Limits 和实际使用
kubectl top pod <pod-name> -n production
# 查看当前内存使用
原因:JVM堆内存设置与K8s限制不匹配
解决:
yaml
env:
- name: JAVA_OPTS
value: "-XX:MaxRAMPercentage=70.0" # 使用百分比而非固定值
resources:
limits:
memory: "2048Mi" # 预留30%给堆外内存
案例2:跨节点Pod通信504
现象:同节点Pod通信正常,跨节点返回504超时,失败率约50%
快速排查:
bash
# 1. 验证跨节点Pod通信
kubectl exec pod-on-node1 -- ping -c 3 <pod-on-node2-ip>
# 2. 对比:节点直接访问Pod(如果成功说明是安全组问题)
ssh node1 "curl http://<pod-on-node2-ip>"
# 3. 检查安全组是否包含Pod网络
# 需要放行:Pod网络CIDR(如172.16.0.0/16)
根本原因:云平台安全组只允许了节点网络,未允许Pod网络CIDR
解决:在安全组添加规则:
- 入站:ANY - Pod网络CIDR(如172.16.0.0/16)
- 出站:ANY - Pod网络CIDR(如172.16.0.0/16)
详细案例分析:参见《故障排查实战》篇
八、总结与检查清单
8.1 部署前检查清单
| 检查项 | 命令/操作 | 预期结果 |
|---|---|---|
| 系统时间同步 | timedatectl |
System clock synchronized: yes |
| Swap已关闭 | free -h |
Swap行全为0 |
| 内核模块已加载 | `lsmod | grep br_netfilter` |
| containerd运行正常 | systemctl status containerd |
active (running) |
| kubelet已启用 | systemctl is-enabled kubelet |
enabled |
| 网络连通性 | 节点间ping测试 | 全部通 |
| 镜像源可访问 | crictl pull nginx |
成功拉取 |
8.2 部署后验证清单
| 检查项 | 命令 | 预期结果 |
|---|---|---|
| 所有节点Ready | kubectl get nodes |
STATUS全为Ready |
| 系统Pod正常 | kubectl get pods -n kube-system |
全为Running |
| 网络插件正常 | kubectl get pods -n calico-system |
全为Running |
| DNS解析正常 | kubectl run test --rm -it --image=busybox -- nslookup kubernetes |
解析成功 |
| 跨节点通信 | 创建两个Pod,互相ping | 通信正常 |
| Ingress工作 | 创建测试Ingress并访问 | 正常响应 |
8.3 常用命令速查
bash
# 集群管理
kubectl cluster-info
kubectl get nodes -o wide
kubectl get pods -A
# 应用管理
kubectl apply -f deployment.yaml
kubectl rollout status deployment/<name>
kubectl rollout undo deployment/<name>
kubectl scale deployment/<name> --replicas=5
# 日志和调试
kubectl logs -f <pod>
kubectl exec -it <pod> -- sh
kubectl describe pod <pod>
kubectl top nodes
kubectl top pods
# 清理
kubectl delete pod <name> --force --grace-period=0
kubectl delete namespace <name>
8.4 关键配置文件位置
| 配置项 | 路径 |
|---|---|
| kubeadm配置 | /etc/kubernetes/admin.conf |
| kubelet配置 | /var/lib/kubelet/config.yaml |
| containerd配置 | /etc/containerd/config.toml |
| Calico配置 | kubectl get installation default -o yaml |
| kubectl配置 | ~/.kube/config |
关键词: Kubernetes、部署实践、CI/CD、监控、故障排查