K8s 集群上线后,故障不可避免。本文汇总了 10 个生产环境真实案例,涵盖 Pod 调度、网络、存储、RBAC 等高频问题,帮你少走 3 年弯路。
背景:为什么 K8s 故障这么多?
Kubernetes 的复杂性在于它涉及多个层次:
arduino
┌─────────────────────────────────────────┐
│ 应用层 │
│ Pod / Deployment / StatefulSet │
├─────────────────────────────────────────┤
│ 调度层 │
│ Scheduler / Node / Taint/Toleration │
├─────────────────────────────────────────┤
│ 网络层 │
│ CNI / Service / Ingress / DNS │
├─────────────────────────────────────────┤
│ 存储层 │
│ PV / PVC / StorageClass │
├─────────────────────────────────────────┤
│ 控制平面 │
│ API Server / etcd / Controller │
└─────────────────────────────────────────┘
任何一层出问题,都会导致应用不可用。以下是真实踩过的坑。
案例 1:Pod 一直 Pending
问题现象
bash
kubectl get pods -n production
# 输出:
NAME READY STATUS RESTARTS AGE
web-app-7d9f8b6c5-x2n4p 0/1 Pending 0 5m
Pod 一直处于 Pending 状态,不被调度。
排查步骤
bash
# 1. 查看 Pod 详情
kubectl describe pod web-app-7d9f8b6c5-x2n4p -n production
# 2. 常见原因:
# - 资源不足(CPU/内存不够)
# - 亲和性/反亲和性限制
# - 污点(Taint)限制
# - PVC 未绑定
# 3. 检查节点资源
kubectl describe nodes | grep -A 5 "Allocated resources"
# 4. 检查污点
kubectl get nodes -o custom-columns=NODE:.metadata.name,TAINTS:.spec.taints
解决方案
场景 A:资源不足
yaml
# 增加资源配额
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- max:
cpu: "4"
memory: 8Gi
default:
cpu: 500m
memory: 1Gi
defaultRequest:
cpu: 200m
memory: 512Mi
type: Container
场景 B:污点限制
bash
# 查看污点
kubectl describe node node-1 | grep Taints
# 临时移除污点(测试用)
kubectl taint node node-1 dedicated-
# 或添加容忍
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
template:
spec:
tolerations:
- key: "dedicated"
operator: "Exists"
effect: "NoSchedule"
案例 2:Pod CrashLoopBackOff
问题现象
bash
kubectl get pods -n production
# 输出:
NAME READY STATUS RESTARTS AGE
api-server-5f4d7c8b9-abc 0/1 CrashLoopBackOff 3 2m
Pod 不断重启。
排查
bash
# 查看日志
kubectl logs api-server-5f4d7c8b9-abc -n production --previous
# 常见原因:
# - 应用启动失败(配置错误)
# - 健康检查失败
# - OOMKilled(内存超限)
# - 依赖服务不可达
真实案例:健康检查配置错误
yaml
# 错误的配置(NodePort 在 Init 容器里不存在)
apiVersion: v1
kind: Pod
metadata:
name: api-server
spec:
containers:
- name: api
image: my-api:v1
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 0 # 太短!
periodSeconds: 5
yaml
# 正确的配置
apiVersion: v1
kind: Pod
metadata:
name: api-server
spec:
containers:
- name: api
image: my-api:v1
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30 # 给够启动时间
periodSeconds: 10
failureThreshold: 3
resources:
limits:
memory: "512Mi"
requests:
memory: "256Mi"
案例 3:Service 无法访问
问题现象
bash
# 在 Pod 内测试
kubectl exec -it nginx-pod -- sh
/ # curl http://api-service:8080/health
# curl: couldn't connect to host
排查流程
bash
# 1. 检查 Service 是否存在
kubectl get svc -n production | grep api
# 2. 检查 Endpoint
kubectl get endpoints api-service -n production
# 如果为空,说明 Selector 没有匹配到 Pod
# 3. 检查 Pod 标签
kubectl get pods -n production --show-labels | grep app
kubectl get svc api-service -n production -o yaml | grep -A 5 selector
真实案例:标签不匹配
yaml
# Deployment 标签
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
selector:
matchLabels:
app: api-server
version: v2
template:
metadata:
labels:
app: api-server
version: v2 # 新版本用 v2
yaml
# Service Selector 还是 v1
apiVersion: v1
kind: Service
metadata:
name: api-service
spec:
selector:
app: api-server
version: v1 # ❌ 这里!
ports:
- port: 8080
targetPort: 8080
修复:更新 Service Selector 或 Deployment 标签。
案例 4:DNS 解析失败
问题现象
bash
kubectl exec -it test-pod -- nslookup kubernetes
# Server: 10.96.0.10
# ** server can't find kubernetes.default: NXDOMAIN
排查
bash
# 1. 检查 CoreDNS 是否运行
kubectl get pods -n kube-system -l k8s-app=kube-dns
# 2. 查看 CoreDNS 日志
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100
# 3. 测试其他 DNS
kubectl exec -it test-pod -- nslookup www.baidu.com
解决方案
yaml
# 方案 1:增加 CoreDNS 副本
apiVersion: apps/v1
kind: Deployment
metadata:
name: coredns
namespace: kube-system
spec:
replicas: 3 # 生产至少 2 个副本
yaml
# 方案 2:配置 Pod 的 DNS 策略
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
dnsPolicy: ClusterFirst # 默认
# 或自定义 DNS
dnsConfig:
nameservers:
- 8.8.8.8
searches:
- default.svc.cluster.local
options:
- name: ndots
value: "2"
案例 5:PVC 无法绑定
问题现象
bash
kubectl get pvc -n production
# NAME STATUS VOLUME CAPACITY
# data-pvc Pending pvc-8f7a3c2b-xxxx 10Gi
# 一直 Pending!
排查
bash
# 1. 查看 PVC 详情
kubectl describe pvc data-pvc -n production
# 常见原因:
# - StorageClass 不存在
# - 存储配额超限
# - 云厂商卷类型不支持
# 2. 检查 StorageClass
kubectl get storageclass
# 3. 检查云厂商存储限制
# 腾讯云 CBS:单节点最多 20 个云盘
解决方案
yaml
# 方案 1:使用正确的 StorageClass
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data-pvc
namespace: production
spec:
accessModes:
- ReadWriteOnce
storageClassName: "cbs-balanced" # 腾讯云 CBS
resources:
requests:
storage: 50Gi
bash
# 方案 2:清理无用 PVC
kubectl get pvc --all-namespaces | grep Released
kubectl delete pvc <pvc-name> -n <namespace>
案例 6:Ingress 502 Bad Gateway
问题现象
浏览器访问返回 502,但 Pod 本身正常运行。
排查
bash
# 1. 检查 Ingress Controller
kubectl get pods -n ingress-nginx
# 2. 检查 Backend
kubectl describe ingress my-ingress -n production
# 3. 测试 Pod 访问
kubectl exec -it nginx-ingress-xxx -n ingress-nginx -- curl -v http://<pod-ip>:8080/health
真实案例:健康检查路径错误
yaml
# 错误的 Ingress 配置
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-ingress
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "50m"
spec:
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-service
port:
number: 8080
yaml
# 加上健康检查注解
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-ingress
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "50m"
nginx.ingress.kubernetes.io/server-snippet: |
location /health {
return 200 'OK';
}
spec:
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-service
port:
number: 8080
案例 7:RBAC 权限不足
问题现象
python
# 应用报错
Forbidden: User "system:serviceaccount:default:my-app"
cannot list pods in namespace "production"
排查
bash
# 1. 查看 ServiceAccount
kubectl get sa my-app -n default
# 2. 查看 Role
kubectl get role -n production
# 3. 查看 RoleBinding
kubectl get rolebinding -n production
解决方案
yaml
# 创建 Role
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-reader
namespace: production
rules:
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list"]
yaml
# 绑定到 ServiceAccount
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: pod-reader-binding
namespace: production
subjects:
- kind: ServiceAccount
name: my-app
namespace: default
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.io
案例 8:OOMKilled 内存超限
问题现象
bash
kubectl get pods -n production
# NAME READY STATUS RESTARTS AGE
# api-xxx 0/1 OOMKilled 2 10m
排查与解决
yaml
# 查看资源限制
kubectl describe pod api-xxx -n production | grep -A 5 "Limits"
# 解决方案:调高内存限制
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
template:
spec:
containers:
- name: api
resources:
limits:
memory: "2Gi" # 从 512Mi 增加到 2Gi
cpu: "1"
requests:
memory: "1Gi"
cpu: "500m"
案例 9:集群节点 NotReady
问题现象
bash
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# node-1 NotReady worker 30d v1.28.0
排查
bash
# 1. SSH 到问题节点
ssh root@node-1
# 2. 检查 kubelet 状态
systemctl status kubelet
# 3. 查看日志
journalctl -u kubelet -n 100 --no-pager
# 常见原因:
# - 磁盘空间不足
# - 内存不足
# - kubelet 证书过期
解决方案
bash
# 清理磁盘
docker system prune -a --volumes
rm -rf /var/lib/docker/*
# 重启 kubelet
systemctl restart kubelet
# 如果是证书问题
kubeadm kubeconfig user --org myorg --cluster mycluster
案例 10:HPA 无法扩容
问题现象
bash
kubectl get hpa -n production
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
# api-hpa Deployment/api 85%/80% 2 10 2
# CPU 使用率 85% > 80%,但副本数还是 2!
排查
bash
# 1. 检查 HPA 详情
kubectl describe hpa api-hpa -n production
# 2. 常见原因:
# - Pod 没有设置资源请求(CPU/内存)
# - Metrics Server 未运行
# - 副本数达到上限
# 3. 检查 Metrics Server
kubectl get pods -n kube-system | grep metrics
# 4. 测试指标采集
kubectl top pods -n production
解决方案
yaml
# Pod 必须设置资源请求
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
template:
spec:
containers:
- name: api
resources:
requests:
cpu: "500m" # 必须设置!HPA 依赖此指标
memory: "512Mi"
limits:
cpu: "1"
memory: "1Gi"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
生产环境 Checklist
部署前必查:
bash
# 1. 资源限制
kubectl check-resource-limits.sh
# 2. 健康检查
kubectl test-health-checks.sh
# 3. 网络连通性
kubectl test-network.sh
# 4. 存储可用性
kubectl test-storage.sh
# 5. RBAC 权限
kubectl test-rbac.sh
总结:避坑要点
| 类别 | 常见问题 | 预防措施 |
|---|---|---|
| 调度 | Pod Pending | 提前规划资源,预留 buffer |
| 生命周期 | CrashLoop | 合理健康检查 + 资源限制 |
| 网络 | Service 不通 | 验证标签匹配,测试 DNS |
| 存储 | PVC Pending | 确认 StorageClass 存在 |
| 权限 | RBAC 报错 | 最小权限原则,测试验证 |
| 弹性 | HPA 不工作 | 必须设置资源请求 |
黄金法则:先在测试环境充分验证,再上生产!
👤 作者简介
一枚在大中原腹地(河南)卖公有云的从业者,主营腾讯云/阿里云/华为云,曾踩坑无数,现专注AI大模型应用落地。关注公众号「公有云cloud」,围观AI前沿动态~
博客:yunduancloud.icu