K8s 排坑 01：Pod 一直 Pending 怎么办？

场景：Kubernetes Pod 一直卡在 Pending 状态，无法被调度到节点
难度：⭐⭐ 基础（运维入门必会）
适用：所有 K8s 集群（ACK / TKE / 自建 / k3s）

一、先搞懂 Pending 是什么

1. Pod 调度流程回顾

Pod 从创建到运行，必须经过调度器（Scheduler）的分配：

复制代码

创建 Pod → API Server → Scheduler（调度）
                              │
                    ┌─────────┼─────────┐
                    ▼         ▼         ▼
                  Node-A    Node-B   Node-C
                    │
                    ▼
              绑定(Bind) → kubelet 拉取镜像 → 运行容器

当调度器找不到合适的节点时，Pod 就卡在 Pending，永远不会变成 Running。

💡 Pending vs CrashLoopBackOff 的区别：

Pending = 还没轮到它跑，调度阶段就卡住了

CrashLoopBackOff = 已经在节点上跑了，但反复崩溃重启

2. 常见触发原因速查

原因类别	典型情况	出现频率
🖥️ 节点资源不足	CPU / 内存 / PID 都满了，没地方跑	★★★★★ 最常见
🔒 调度限制	nodeSelector / affinity / taint 不匹配	★★★★☆
📦 PVC 未绑定	存储卷没准备好，Pod 等着挂载	★★★☆☆
⚠️ ResourceQuota 超限	Namespace 配额用完了	★★☆☆☆
🔐 权限问题	ServiceAccount / RBAC 缺少权限	★★☆☆☆
🐛 集群级异常	Scheduler 故障、ControllerManager 异常	☆☆☆☆☆

二、标准排查流程（6 步法）

Step 1：看 Events ------ 第一现场

bash 复制代码

# 一看就懂，Pending 的原因直接写着
kubectl describe pod <pod-name> -n <namespace>

Events 是 Pod 的"第一现场"。 这条命令会直接告诉你：

FailedScheduling --- 调度失败，下面会有具体原因
Insufficient cpu --- 节点 CPU 不够
Insufficient memory --- 节点内存不够
node(s) didn't match node selector --- 节点选择器不匹配
node(s) had taint {...} that the pod didn't tolerate --- 污点不兼容
0/N nodes are available: N pod has unbound immediate PersistentVolumeClaims --- PVC 没绑定
exceeded quota --- ResourceQuota 超限

⚠️ 核心原则：排查 Pending 时永远先执行 kubectl describe pod，少走 90% 的弯路。

输出重点关注区域：

复制代码

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  2m (x12 over 15m)  default-scheduler  0/3 nodes are available:
                                                            3 Insufficient cpu.

上面这行就是答案：3 个节点的 CPU 都不够。

Step 2：检查节点资源是否够用

bash 复制代码

# 实时查看所有节点的 CPU/内存使用率（需要安装 metrics-server）
kubectl top nodes

# 如果 metrics-server 没装，用这个查看节点的 Allocatable 资源
kubectl get nodes -o wide

# 查看每个节点详细资源分配情况
kubectl describe nodes | grep -A 5 "Allocatable"
kubectl describe nodes | grep -A 5 "Allocated resources"

典型输出解读：

复制代码

Name:               node-1
Allocatable:
  cpu:                4          # 这个节点总共 4 核
  memory:            16384Mi     # 总共 16G 内存
  pods:               110        # 最多跑 110 个 Pod
Allocated resources:
  (Total limits may be over 100%, i.e., overcommitted.)
  cpu:                3950m      # 已分配 3.95 核 (99%)
  memory:            15872Mi     # 已分配 15.4G 内存 (97%)

如果节点资源即将打满，Pod 自然无法调度。 解决方案：

加节点（水平扩容）
降低现有 Pod 的 requests（垂直优化）
驱逐低优先级 Pod（kubectl drain）
开启 HPA 自动伸缩（长期方案）

Step 3：检查调度限制

3a）节点污点（Taints）

bash 复制代码

# 查看所有节点的污点
kubectl get nodes -o custom-columns=NAME:.metadata.Name,TAINTS:.spec.taints

# 或者详细查看某个节点
kubectl describe node <node-name> | grep -i taint

常见污点及含义：

污点	含义	谁能容忍
`node-role.kubernetes.io/control-plane:NoSchedule`	Master 节点不调度普通 Pod	kube-system 关键组件
`node.kubernetes.io/not-ready:NoExecute`	节点 NotReady 时驱逐 Pod	有对应 Toleration 的 Pod
`nvidia.com/gpu=true:NoSchedule`	GPU 节点只给需要 GPU 的 Pod	声明了 gpu 资源的 Pod

问题场景： 所有节点都有污点，而你的 Pod 没有对应的 Toleration → 直接 Pending。

3b）节点选择器 & 亲和性

bash 复制代码

# 查看 Pod 的节点选择器和亲和性配置
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 "nodeSelector"
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 15 "affinity"

典型配置示例：

yaml 复制代码

# ❌ 问题：要求节点有 disk=ssd 标签，但没有节点打这个标签
nodeSelector:
  disk: ssd

# ❌ 问题：要求调度到特定节点，但该节点不存在或不健康
nodeName: worker-node-03

# ✅ 正确做法：用亲和性做软约束 + 反亲和性分散部署
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: disk
          operator: In
          values: ["ssd"]
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchLabels:
            app: my-app
        topologyKey: kubernetes.io/hostname

快速诊断命令：

bash 复制代码

# 一条命令同时查污点和选择器
kubectl describe nodes | grep -E "Taints|Labels"
kubectl get pod <pod-name> -n <namespace> -o yaml \
  | grep -E "nodeSelector|nodeName|affinity|tolerations" -A 5

Step 4：检查 PVC 是否绑定

bash 复制代码

# 查看 Namespace 下所有 PVC 的状态
kubectl get pvc -n <namespace>

# 详细查看某个 PVC 为什么没绑定
kubectl describe pvc <pvc-name> -n <namespace>

PVC 状态说明：

PVC 状态	含义	解决方向
`Bound`	✅ 正常，已关联 PV	无需处理
`Pending`	⚠️ 没有 PV 可绑定或 StorageClass 问题	创建 PV 或修改 StorageClass
`Lost`	❌ 底层 PV 丢失	删除 PVC 重建

常见原因和解决：

bash 复制代码

# 原因 1：没有 StorageClass 或指定的 class 不存在
kubectl get storageclass

# 原因 2：没有满足条件的 PV（静态供给）
kubectl get pv

# 原因 3：动态供给失败（StorageClass provisioner 有问题）
kubectl describe pvc my-pvc -n myns | tail -20
# Events 里通常能看到类似：
#   persistentvolume-controller: waiting for a volume to be created,
#   either by external provisioner "csi-rbdplugin" or manually...

💡 技巧：如果你的 Pod 挂载了多个 PVC，只要有一个没绑定，整个 Pod 就 Pending。用 kubectl get pvc 一次性全扫出来。

Step 5：检查 ResourceQuota 是否超限

bash 复制代码

# 查看 Namespace 的 ResourceQuota
kubectl get resourcequota -n <namespace>

# 查看每种配额的使用量
kubectl describe resourcequota -n <namespace>

ResourceQuota 是 Namespace 级别的"隐形天花板"：

复制代码

Name:                   compute-quota
Resource                Used  Hard
--------                ---   ---
requests.cpu            2     4
requests.memory         1Gi   8Gi
limits.cpu              4     8
limits.memory           4Gi   16Gi
persistentvolumeclaims  2     5
pods                    10    20
services                2     5
secrets                 5     10

如果某一列 Used 到了 Hard，新建 Pod 就会被拒绝！

yaml 复制代码

# Events 里会看到类似提示：
# Error creating: exceeded quota: compute-quota,
# requested: requests.cpu=500m, used: requests.cpu=2, limited: requests.cpu=4

解决方案：

bash 复制代码

# 方案 A：调大配额
kubectl patch resourcequota compute-quota -n <ns> \
  --type merge -p '{"spec":{"hard":{"pods":"50"}}}'

# 方案 B：清理不需要的 Pod 释放配额
kubectl delete pod <unused-pod> -n <ns>

# 方案 C：移除限制（开发测试环境可以）
kubectl delete resourcequota compute-quota -n <ns>

Step 6：全局事件扫描 ------ 发现隐藏关联问题

bash 复制代码

# 全局事件按时间排序，一眼看到最近发生了什么
kubectl get events -A --sort-by='.lastTimestamp'

# 只看 Warning 和 Error 级别的事件
kubectl get events -A --field-type=Warning --sort-by='.lastTimestamp'

# 只看某个 Namespace 的事件
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# 只看最近 1 小时的
kubectl get events -A --sort-by='.lastTimestamp' \
  | awk '$1 >= "'$(date -u -v-1H +%H:%M)'"{print}'

为什么需要全局扫描？

有时候问题不在当前 Pod 的 Events 里：

节点故障：某个节点 NotReady 导致调度失败
Scheduler 异常 ：kube-scheduler Pod 本身 Crash 了
Controller Manager 错误：Deployment/StatefulSet 控制器异常
网络插件问题：CNI 插件未正常工作导致所有新 Pod 无法启动

bash 复制代码

# 快速检查核心组件状态
kubectl get pods -n kube-system | grep -v Running

# 如果发现 scheduler/controller-manager 有问题，说明是集群级故障
kubectl logs -n kube-system -l component=kube-scheduler --tail=50

三、典型案例 & 解决方案

案例 1：资源不足（最常见）

现象：

复制代码

Events:
  Warning  FailedScheduling  2m (x23)  default-scheduler
  0/3 nodes are available: 3 Insufficient cpu.

排查：

bash 复制代码

kubectl top nodes
# NAME     CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
# node-1   3900m        98%    14Gi            90%
# node-2   3950m        99%    15Gi            95%
# node-3   3800m        95%    14.5Gi          89%

三个节点都满了。 可能的原因：

HPA 弹性伸缩失控（副本数飙太高）
某个 Pod 内存泄漏导致 OOM 后不断重启占用资源
新上线的服务 requests 设太大

解决：

bash 复制代码

# 1. 找出吃资源的 Top Pod
kubectl top pods -A --sort-by='cpu.usage' | head -20

# 2. 降低非关键 Pod 的副本数
kubectl scale deployment non-critical-app --replicas=1 -n <ns>

# 3. 如果是内存泄漏导致的，先找到异常 Pod
kubectl get pods -A -o wide | grep -i evict

# 4. 长期方案：配置 HPA 合理范围
kubectl autoscale deployment my-app --cpu-percent=70 --min=2 --max=10

案例 2：节点污点不匹配

现象：

复制代码

Events:
  Warning  FailedScheduling  30s  default-scheduler
  0/3 nodes are available: 1 node(s) had taint {node-type: production:NoSchedule},
  2 node(s) had taint {node-role.kubernetes.io/control-plane:NoSchedule}.

分析： 3 个节点中 1 个是 Master（control-plane），1 个打了生产环境污点，

你的 Pod 没有对应的 Toleration，所以没地方可去。

解决：

bash 复制代码

# 方案 A：给 Pod 加 Toleration（推荐）
kubectl patch deployment my-app -p '{"spec":{"template":{"spec":{"tolerations":[{"key":"node-type","operator":"Equal","value":"production","effect":"NoSchedule"}]}}}}'

# 方案 B：去掉节点污点（影响面大，慎用）
kubectl taint nodes <node-name> node-type:production:NoSchedule-

# 方案 C：给 Pod 加 nodeSelector 明确指定可调度节点

案例 3：PVC 未绑定导致 Pending

现象：

复制代码

Events:
  Warning  FailedScheduling  45s  default-scheduler
  0/3 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.

排查：

bash 复制代码

kubectl get pvc -n myapp
# NAME       STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS
# data-pvc   Pending                                       standard

kubectl describe pvc data-pvc -n myapp | tail -10
# Events:
#   Type     Reason                Age               From                         Message
#   ----     ------                ----              ----                         -------
#   Normal   WaitForFirstConsumer  2m (x6)           persistentvolume-controller  waiting for first consumer to be created before binding

分析： 使用了 WaitForFirstConsumer 延迟绑定模式，但因为某种原因一直没有消费者成功调度。

解决：

bash 复制代码

# 如果使用的是延迟绑定的 StorageClass，改为立即绑定
# 或者手动创建一个 PV 来满足它
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolume
metadata:
  name: manual-pv-for-myapp
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /data/myapp
  claimRef:
    namespace: myapp
    name: data-pvc
EOF

案例 4：ResourceQuota 超限

现象：

复制代码

Events:
  Warning  FailedScheduling  10s  default-scheduler
  Error creating: exceeded quota: pod-quota, requested: pods=1, used: pods=20, limited: pods=20

一目了然： 这个 Namespace 的 Pod 数量上限是 20 个，已经用了 20 个。

bash 复制代码

kubectl get pods -n <ns> --no-headers | wc -l  # 确认确实 20 个

# 解决方案三选一：
# A. 清理废弃 Pod
kubectl delete pod <old-pod> -n <ns>
# B. 调大配额
kubectl patch resourcequota pod-quota -n <ns> \
  --type merge -p '{"spec":{"hard":{"pods":"50"}}}'
# C. 移除配额（仅测试环境）
kubectl delete resourcequota pod-quota -n <ns>

案例 5：真实生产事故案例

某金融客户高峰期 Pod 全 Pending

背景：某银行核心交易系统，3 个 Worker 节点，每节点 16 核 32GB 内存。

现象：交易高峰期，突然大量交易类 Pod 变成 Pending，监控告警电话被打爆。

排查过程：

复制代码

Step 1: kubectl describe pod trade-service-xxxxx
→ Events: 0/3 nodes are available: 3 Insufficient memory

Step 2: kubectl top nodes
→ 3 个节点内存都在 99%+，几乎满载

Step 3: kubectl top pods -A --sort-by='memory.usage' | head -10
→ 发现一个 batch-job 的 Pod 吃了 28GB 内存（接近单节点总量）
→ 而且 HPA 把它的副本从 2 自动扩到了 500+！

Step 4: 追根溯源
→ batch-job 有内存泄漏 bug
→ HPA 基于 CPU 指标做扩容，CPU 不高但内存飙升
→ 结果 HPA 一直扩，副本数飙到 500+
→ 每个副本又各占一份内存，把节点全部打满
→ 最终其他业务 Pod 全部 Pending

处理：

立即 kubectl scale deployment batch-job --replicas=0
等 2 分钟让已终止 Pod 释放内存
修复 batch-job 内存泄漏后重新上线
给 HPA 加上内存指标作为扩缩容依据
设置 Pod 最大副本数上限

教训总结：

HPA 不能只看 CPU，内存指标同样重要
务必设置 maxReplicas 上限，防止无限扩展
关键业务 Pod 应该配置 PriorityClass，确保低优先级任务不会抢占资源

四、快速排障速查表

1. 一键诊断脚本

bash 复制代码

#!/bin/bash
# ════════════════════════════════════════
# 🚨 Pod Pending 一键诊断脚本
# ════════════════════════════════════════

POD_NAME="你的pod名字"
NS="你的namespace"

echo "=========================================="
echo " Pod Pending 诊断报告: $POD_NAME"
echo " 时间: $(date)"
echo "=========================================="

echo ""
echo "【1】Pod 当前状态"
kubectl get pod $POD_NAME -n $NS -o wide

echo ""
echo "【2】Events（最关键！）"
kubectl describe pod $POD_NAME -n $NS | sed -n '/^Events:/,$ p' | head -30

echo ""
echo "【3】节点资源概览"
kubectl top nodes 2>/dev/null || echo "(metrics-server 未安装，显示 Allocatable)"
kubectl get nodes -o custom-columns=\
  'NAME:.metadataName,CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory'

echo ""
echo "【4】节点污点"
kubectl get nodes -o custom-columns=\
  'NAME:.metadataName,TAINTS:.spec.taints' 2>/dev/null

echo ""
echo "【5】Pod 调度约束"
kubectl get pod $POD_NAME -n $NS -o yaml \
  | grep -E "nodeSelector|nodeName|tolerations|affinity" -A 6 2>/dev/null

echo ""
echo "【6】PVC 状态"
kubectl get pvc -n $NS 2>/dev/null

echo ""
echo "【7】ResourceQuota"
kubectl get resourcequota -n $NS 2>/dev/null

echo ""
echo "【8】最近全局 Warning 事件（Top 20）"
kubectl get events -A --field-type=Warning --sort-by='.lastTimestamp' 2>/dev/null \
  | tail -20

echo ""
echo "【9】核心组件健康状态"
kubectl get pods -n kube-system \
  | grep -E "scheduler|controller|coredns|calico|cilium|kube-proxy|flannel" \
  | grep -v Running

echo ""
echo "=========================================="
echo " 诊断完成！根据 Events 字段确定根本原因"
echo "=========================================="

2. Pending 原因决策树

复制代码

Pod 一直 Pending？
    │
    ▼
┌─────────────────────────────┐
│ kubectl describe pod <name> │
│ 看 Events 最后一条消息       │
└──────────────┬──────────────┘
               │
    ┌──────────┼──────────┬──────────┬──────────┬──────────┐
    ▼          ▼          ▼          ▼          ▼          ▼
Insufficient  didn't     didn't     unbound   exceeded   (无明确
cpu/memory   match      tolerate   PVC        quota      Events)
    │       nodeSelector  taints                          │
    ▼          │          │         │          │          ▼
kubectl top  检查标签    加 Toleration  kubectl    调大 quota  检查集群
nodes       和亲和性    或去污点      get pvc             组件状态
    │                                                    │
    ▼                                                    ▼
加节点/降低  改选择器/   改 Pod        手动创建   patch quota  重启 scheduler/
requests     改亲和性    spec         PV 或修     或删掉     controller-
              或加节点                 SC                     manager

3. 常见 Events 消息对照表

Events 消息	含义	解决方向
`Insufficient cpu`	CPU 不够	加核 / 降低 request / 驱逐低优先 Pod
`Insufficient memory`	内存不够	加内存 / 降低 request / 驱逐
`didn't match node selector`	选择器不匹配	检查 nodeSelector / 给节点加 label
`had taint ... that the pod didn't tolerate`	污点不兼容	加 Toleration 或去污点
`unbound PersistentVolumeClaims`	PVC 没绑定	创建 PV / 检查 StorageClass
`exceeded quota`	超出命名空间配额	调大 quota 或释放资源
`(no Events at all)`	调度器可能挂了	检查 kube-scheduler 状态

五、预防措施

1. Pod 设计最佳实践

实践项	说明
✅ 合理设置 requests	不要设太小（OOM），也不要设太大（调度不了）。参考实际用量设置
✅ 设置 limits	必须设 limit，防止单个 Pod 打满节点
✅ 配置 PriorityClass	关键业务优先调度，低优先级任务可被抢占
✅ 使用 PodDisruptionBudget	保护关键应用最小可用副本数
✅ 多可用区分布	用 topologySpreadConstraints 做拓扑分散
✅ PDB + 优先级组合	保证升级/维护时不影响 SLA

2. 监控告警规则

yaml 复制代码

groups:
- name: pod_pending_alerts
  rules:
  # Pod 长时间 Pending 告警
  - alert: PodPendingTooLong
    expr: kube_pod_status_phase{phase="Pending"} == 1
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} Pending 超过 10 分钟"
      description: "该 Pod 无法调度，请执行 kubectl describe pod 排查"

  # 节点资源利用率告警
  - alert: NodeHighCPUUtilization
    expr: 100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by instance) > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "节点 {{ $labels.instance }} CPU 使用率超过 85%"
      description: "当前 {{ $value }}%，可能导致新 Pod 无法调度"

  - alert: NodeHighMemoryUtilization
    expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "节点 {{ $labels.instance }} 内存使用率超过 85%"
      description: "当前 {{ $value }}%，可能导致新 Pod 无法调度"

  # ResourceQuota 即将耗尽预警
  - alert: ResourceQuotaNearlyExhausted
    expr: kube_resourcequota{resource="pods",unit="count"} / kube_resourcequota_hard_limit{resource="pods",unit="count"} > 0.9
    for: 5m
    labels:
      severity: info
    annotations:
      summary: "Namespace {{ $labels.namespace }} 的 Pod 配额使用了超过 90%"

3. 架构层面防护

yaml 复制代码

# 优先级定义：确保核心业务优先
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "核心交易系统优先"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 1000
globalDefault: false
description: "批处理任务低优先级"
---
# 在核心 Deployment 中引用
apiVersion: apps/v1
kind: Deployment
metadata:
  name: core-trade-service
spec:
  template:
    spec:
      priorityClassName: high-priority  # ← 关键
      containers:
      - name: app
        image: core-trade:v2.1

# HPA 合理配置（带最大值限制）
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: safe-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 20          # ← 必须设上限，防止无限扩展
  behavior:                # ← 冷却时间，防止抖动
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory        # ← 同时关注内存！
      target:
        type: Utilization
        averageUtilization: 80

六、总结

Pending 排查口诀：

describe 看 Events，top nodes 看资源，
taint/selector 查约束，pvc/binding 别漏掉，
resourcequota 看限额，events -A 扫全局。
六步走完定位准，Pending 再也不犯愁。

与 CrashLoopBackOff 的区别记忆：

Pending = 还没上车（调度阶段的问题）→ 查资源和约束
CrashLoopBackOff = 上车后晕倒了（运行阶段的问题）→ 查日志和退出码

两个排坑文档配合使用，覆盖 Pod 生命周期 80% 以上的常见问题 👍

第 01 期：Pending 排查（本文）\第 02 期：CrashLoopBackOff 排查
更多排坑实战，关注「小刘」持续更新 🐾