一、调度与亲和性
1. Pod调度过程是怎样的?
答案:
Kubernetes Scheduler通过预选(Filtering)和优选(Scoring)两阶段完成Pod调度。
调度流程:
Pod创建 → Scheduler监听 → 预选阶段 → 优选阶段 → 绑定节点
预选阶段(Filtering/Predicates):
过滤不符合条件的节点:
PodFitsResources
yaml
# 检查节点资源是否满足Pod需求
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 4
memory: 8Gi
# 节点可用资源必须 >= requests
PodFitsHostPorts
yaml
# 检查节点端口是否被占用
spec:
containers:
- name: web
ports:
- containerPort: 8080
hostPort: 8080 # 节点上8080端口必须可用
NodeSelector
yaml
# 检查节点标签匹配
spec:
nodeSelector:
disktype: ssd
region: us-west
# 节点必须有这些标签
NodeAffinity
yaml
# 检查节点亲和性规则
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/e2e-az-name
operator: In
values:
- e2e-az1
- e2e-az2
PodAffinity/PodAntiAffinity
yaml
# 检查Pod间亲和性和反亲和性
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web
topologyKey: kubernetes.io/hostname
# 不调度到已有web Pod的节点
Taints/Tolerations
yaml
# 检查污点容忍
spec:
tolerations:
- key: "node.kubernetes.io/memory-pressure"
operator: "Exists"
effect: "NoSchedule"
# Pod必须容忍节点污点
优选阶段(Scoring/Priorities):
为符合条件的节点打分(0-100分):
LeastRequestedPriority
- 优先选择资源使用率低的节点
- 计算公式:
(capacity - requests) / capacity * 100
BalancedResourceAllocation
- 优先选择CPU和内存使用率均衡的节点
- 避免某一资源过度使用
NodeAffinityPriority
- 根据节点亲和性权重打分
ImageLocalityPriority
- 优先选择已有镜像的节点
- 减少镜像拉取时间
SelectorSpreadPriority
- 将同一Service/ReplicaSet的Pod分散到不同节点
最终选择:
选择得分最高的节点 → 绑定Pod → 通知kubelet运行容器
2. 节点亲和性(Node Affinity)如何使用?
答案:
Node Affinity是NodeSelector的增强版,提供了更灵活的节点选择能力。
两种亲和性类型:
requiredDuringSchedulingIgnoredDuringExecution(硬亲和性)
必须满足的条件:
yaml
apiVersion: v1
kind: Pod
metadata:
name: node-affinity-required
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
# 节点必须在这些可用区
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1a
- us-east-1b
# 且必须是SSD磁盘
- key: disktype
operator: In
values:
- ssd
containers:
- name: app
image: myapp:latest
Operator支持的操作符:
In: 值在列表中NotIn: 值不在列表中Exists: 键存在(不检查值)DoesNotExist: 键不存在Gt: 大于(数值比较)Lt: 小于(数值比较)
preferredDuringSchedulingIgnoredDuringExecution(软亲和性)
倾向性规则,不强制要求:
yaml
apiVersion: v1
kind: Pod
metadata:
name: node-affinity-preferred
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
# 权重100:优先选择高性能节点
- weight: 100
preference:
matchExpressions:
- key: node-type
operator: In
values:
- high-performance
# 权重50:其次选择本地可用区
- weight: 50
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1a
containers:
- name: app
image: myapp:latest
权重计算:
节点总分 = Σ(匹配规则的权重)
选择得分最高的节点
组合使用示例:
yaml
apiVersion: v1
kind: Pod
metadata:
name: combined-affinity
spec:
affinity:
nodeAffinity:
# 硬性要求:必须是Linux节点
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
# 软性倾向:优先选择GPU节点,其次选择SSD节点
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: accelerator
operator: In
values:
- nvidia-tesla-v100
- weight: 20
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
containers:
- name: ml-training
image: tensorflow:latest
3. Pod亲和性和反亲和性的使用场景是什么?
答案:
Pod Affinity/AntiAffinity基于已运行的Pod标签进行调度决策。
Pod Affinity(亲和性)- 就近部署
场景:应用和缓存部署在同一节点
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- redis-cache
topologyKey: kubernetes.io/hostname
containers:
- name: web
image: nginx
topologyKey作用域:
kubernetes.io/hostname: 同一节点topology.kubernetes.io/zone: 同一可用区topology.kubernetes.io/region: 同一地域
Pod AntiAffinity(反亲和性)- 分散部署
场景1:高可用部署,避免单点故障
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-ha
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
affinity:
podAntiAffinity:
# 硬性要求:不同节点
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: web
topologyKey: kubernetes.io/hostname
containers:
- name: web
image: nginx
场景2:避免资源竞争
yaml
apiVersion: v1
kind: Pod
metadata:
name: compute-intensive-app
spec:
affinity:
podAntiAffinity:
# 软性倾向:避免和其他计算密集型应用同节点
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: workload-type
operator: In
values:
- compute-intensive
topologyKey: kubernetes.io/hostname
containers:
- name: app
image: compute-app:latest
resources:
requests:
cpu: 8
memory: 16Gi
复杂拓扑分布示例:
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: multi-tier-app
spec:
replicas: 6
selector:
matchLabels:
app: multi-tier
template:
metadata:
labels:
app: multi-tier
tier: frontend
spec:
affinity:
# Pod亲和性:前端和后端在同一可用区
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: multi-tier
tier: backend
topologyKey: topology.kubernetes.io/zone
# Pod反亲和性:前端副本分散到不同节点
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: multi-tier
tier: frontend
topologyKey: kubernetes.io/hostname
containers:
- name: frontend
image: frontend:latest
4. Taints和Tolerations如何配合使用?
答案:
Taints(污点)用于标记节点,拒绝不能容忍该污点的Pod调度。Tolerations(容忍)允许Pod调度到有特定污点的节点。
工作机制:
节点污点 + Pod容忍 = 调度决策
污点效果(Effect):
NoSchedule
- 不调度不容忍的Pod
- 已运行的Pod不受影响
PreferNoSchedule
- 尽量不调度
- 如果没有其他选择,仍可调度
NoExecute
- 不调度新Pod
- 驱逐已运行的不容忍Pod
设置节点污点:
bash
# 添加污点
kubectl taint nodes node1 key=value:NoSchedule
# 示例:标记GPU节点
kubectl taint nodes gpu-node1 nvidia.com/gpu=true:NoSchedule
# 示例:标记维护节点
kubectl taint nodes node2 maintenance=true:NoExecute
# 删除污点
kubectl taint nodes node1 key:NoSchedule-
# 查看节点污点
kubectl describe node node1 | grep Taints
Pod容忍配置:
yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
tolerations:
# 完全匹配
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
# 只匹配key和effect
- key: "example.com/special"
operator: "Exists"
effect: "NoSchedule"
# 容忍所有污点(危险!)
- operator: "Exists"
containers:
- name: cuda-app
image: nvidia/cuda:11.0
实战场景:
场景1:专用节点(GPU节点)
bash
# 标记GPU节点
kubectl taint nodes gpu-node nvidia.com/gpu=true:NoSchedule
kubectl label nodes gpu-node accelerator=nvidia-tesla-v100
yaml
# 只有GPU工作负载才能调度到GPU节点
apiVersion: v1
kind: Pod
metadata:
name: ml-training
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
nodeSelector:
accelerator: nvidia-tesla-v100
containers:
- name: training
image: tensorflow-gpu:latest
场景2:节点维护
bash
# 准备维护节点,驱逐所有Pod
kubectl taint nodes node1 maintenance=true:NoExecute
# 重要Pod添加容忍,不会被驱逐
yaml
apiVersion: v1
kind: Pod
metadata:
name: critical-app
spec:
tolerations:
- key: "maintenance"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 3600 # 1小时后仍会被驱逐
containers:
- name: app
image: critical-app:latest
场景3:Master节点隔离
bash
# Master节点默认污点
kubectl describe node master | grep Taints
# Taints: node-role.kubernetes.io/master:NoSchedule
yaml
# DaemonSet容忍Master污点
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: monitoring-agent
spec:
template:
spec:
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: agent
image: prometheus/node-exporter
场景4:按优先级调度
bash
# 高优先级节点
kubectl taint nodes premium-node tier=premium:NoSchedule
yaml
# 高优先级Pod
apiVersion: v1
kind: Pod
metadata:
name: premium-app
spec:
tolerations:
- key: "tier"
operator: "Equal"
value: "premium"
effect: "NoSchedule"
containers:
- name: app
image: premium-app:latest
二、资源管理与QoS
5. Kubernetes的QoS(服务质量)等级是什么?
答案:
Kubernetes根据Pod的资源requests和limits自动分配QoS等级,影响资源竞争时的驱逐优先级。
三个QoS等级:
1. Guaranteed(保证级)- 最高优先级
条件:
- 所有容器都设置了CPU和内存的requests和limits
- requests == limits
yaml
apiVersion: v1
kind: Pod
metadata:
name: guaranteed-pod
spec:
containers:
- name: app
image: nginx
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 500m # 必须等于requests
memory: 512Mi # 必须等于requests
特性:
- 资源预留有保障
- 不会被限流
- 最不容易被驱逐
适用场景:
- 生产核心应用
- 数据库
- 对性能敏感的服务
2. Burstable(突发级)- 中等优先级
条件:
- 至少一个容器设置了requests或limits
- 不满足Guaranteed条件
yaml
apiVersion: v1
kind: Pod
metadata:
name: burstable-pod
spec:
containers:
- name: app
image: nginx
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m # limits > requests
memory: 512Mi
特性:
- 保证最小资源requests
- 可以使用额外资源(直到limits)
- 资源不足时可能被限流
- 比BestEffort优先级高
适用场景:
- Web应用
- API服务
- 流量波动的应用
3. BestEffort(尽力而为)- 最低优先级
条件:
- 所有容器都没有设置requests和limits
yaml
apiVersion: v1
kind: Pod
metadata:
name: besteffort-pod
spec:
containers:
- name: app
image: nginx
# 没有resources定义
特性:
- 无资源保证
- 可使用节点剩余资源
- 最容易被驱逐
适用场景:
- 批处理任务
- 低优先级后台任务
- 测试环境
驱逐顺序(资源不足时):
BestEffort → Burstable(超出requests部分) → Guaranteed
查看Pod的QoS等级:
bash
kubectl get pod burstable-pod -o jsonpath='{.status.qosClass}'
# 输出:Burstable
kubectl describe pod guaranteed-pod | grep "QoS Class"
# QoS Class: Guaranteed
6. 如何实现资源配额(ResourceQuota)?
答案:
ResourceQuota限制命名空间的资源使用总量,防止资源滥用。
ResourceQuota配置:
yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: dev
spec:
hard:
# 计算资源限制
requests.cpu: "10" # 总CPU requests不超过10核
requests.memory: 20Gi # 总内存requests不超过20GB
limits.cpu: "20" # 总CPU limits不超过20核
limits.memory: 40Gi # 总内存limits不超过40GB
# 对象数量限制
pods: "50" # 最多50个Pod
services: "10" # 最多10个Service
persistentvolumeclaims: "20" # 最多20个PVC
# 存储限制
requests.storage: 100Gi # 总存储请求不超过100GB
针对特定StorageClass的配额:
yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: storage-quota
namespace: production
spec:
hard:
# SSD存储配额
ssd.storageclass.storage.k8s.io/requests.storage: 50Gi
ssd.storageclass.storage.k8s.io/persistentvolumeclaims: "10"
# 标准存储配额
standard.storageclass.storage.k8s.io/requests.storage: 200Gi
针对优先级的配额:
yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: priority-quota
namespace: production
spec:
hard:
pods: "100"
scopeSelector:
matchExpressions:
- operator: In
scopeName: PriorityClass
values: ["high"]
查看配额使用情况:
bash
# 查看配额
kubectl get resourcequota -n dev
# 详细信息
kubectl describe resourcequota compute-quota -n dev
输出示例:
Name: compute-quota
Namespace: dev
Resource Used Hard
-------- ---- ----
limits.cpu 15 20
limits.memory 30Gi 40Gi
pods 35 50
requests.cpu 8 10
requests.memory 16Gi 20Gi
7. LimitRange的作用是什么?
答案:
LimitRange为命名空间中的Pod和容器设置默认资源限制和范围约束。
LimitRange配置:
yaml
apiVersion: v1
kind: LimitRange
metadata:
name: resource-limits
namespace: dev
spec:
limits:
# 容器级别限制
- type: Container
max:
cpu: "4" # 单个容器最大CPU
memory: 8Gi # 单个容器最大内存
min:
cpu: 100m # 单个容器最小CPU
memory: 128Mi # 单个容器最小内存
default:
cpu: 500m # 默认limits
memory: 512Mi
defaultRequest:
cpu: 200m # 默认requests
memory: 256Mi
maxLimitRequestRatio:
cpu: 4 # limits/requests比率不超过4
memory: 2
# Pod级别限制
- type: Pod
max:
cpu: "8"
memory: 16Gi
# PVC大小限制
- type: PersistentVolumeClaim
max:
storage: 10Gi
min:
storage: 1Gi
自动应用默认值:
yaml
# 未指定resources的Pod
apiVersion: v1
kind: Pod
metadata:
name: default-resources-pod
spec:
containers:
- name: app
image: nginx
# 自动应用LimitRange的默认值
# requests.cpu: 200m
# requests.memory: 256Mi
# limits.cpu: 500m
# limits.memory: 512Mi
验证限制:
bash
# 创建超出限制的Pod会被拒绝
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: exceed-limits
namespace: dev
spec:
containers:
- name: app
image: nginx
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 1000m # 超出maxLimitRequestRatio
memory: 128Mi
EOF
# 错误:CPU limits/requests比率超过4
三、健康检查与自愈
8. 三种健康检查探针的区别和使用场景?
答案:
Kubernetes提供三种探针检查容器健康状态:Liveness、Readiness、Startup。
Liveness Probe(存活探针)
作用: 检查容器是否存活,失败则重启容器
使用场景:
- 应用死锁但进程仍在运行
- 内存泄漏导致无法响应
- 依赖组件故障导致应用挂起
HTTP探针:
yaml
apiVersion: v1
kind: Pod
metadata:
name: liveness-http
spec:
containers:
- name: app
image: myapp:latest
livenessProbe:
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: Custom-Header
value: Awesome
initialDelaySeconds: 30 # 容器启动后等待30秒
periodSeconds: 10 # 每10秒检查一次
timeoutSeconds: 5 # 超时时间5秒
successThreshold: 1 # 成功1次即为健康
failureThreshold: 3 # 失败3次后重启
TCP探针:
yaml
livenessProbe:
tcpSocket:
port: 3306
initialDelaySeconds: 15
periodSeconds: 10
命令探针:
yaml
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
Readiness Probe(就绪探针)
作用: 检查容器是否就绪接收流量,失败则从Service移除
使用场景:
- 应用启动较慢,需要加载数据
- 依赖服务未就绪
- 暂时无法处理请求(如缓存预热)
yaml
apiVersion: v1
kind: Pod
metadata:
name: readiness-http
spec:
containers:
- name: app
image: myapp:latest
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
效果:
Pod启动 → Readiness检查失败 → 不加入Service Endpoints
↓
Readiness检查成功 → 加入Service Endpoints → 接收流量
↓
Readiness检查失败 → 从Service移除 → 停止接收流量
↓
Readiness检查成功 → 重新加入Service
Startup Probe(启动探针)
作用: 检查容器是否已启动,失败则重启容器
使用场景:
- 启动时间长的应用
- 避免Liveness Probe过早杀死容器
yaml
apiVersion: v1
kind: Pod
metadata:
name: startup-probe
spec:
containers:
- name: slow-start-app
image: slow-app:latest
startupProbe:
httpGet:
path: /startup
port: 8080
initialDelaySeconds: 0
periodSeconds: 10
failureThreshold: 30 # 最多等待300秒启动
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
failureThreshold: 3
探针执行顺序:
Startup Probe执行中 → Liveness/Readiness被禁用
↓
Startup Probe成功 → 启用Liveness/Readiness
↓
Startup Probe失败failureThreshold次 → 重启容器
最佳实践示例:
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: production-app
spec:
replicas: 3
selector:
matchLabels:
app: production
template:
metadata:
labels:
app: production
spec:
containers:
- name: app
image: myapp:v2.0
ports:
- containerPort: 8080
# 启动探针:应用启动慢,给足够时间
startupProbe:
httpGet:
path: /startup
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 60 # 最多5分钟启动时间
# 存活探针:检查应用是否死锁
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 0 # startupProbe成功后立即开始
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
# 就绪探针:检查是否可以处理请求
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 2
successThreshold: 1
failureThreshold: 2
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
健康检查端点实现示例(Go):
go
package main
import (
"net/http"
"time"
)
var (
startTime = time.Now()
isReady = false
)
func main() {
// 模拟启动过程
go func() {
time.Sleep(30 * time.Second)
isReady = true
}()
http.HandleFunc("/startup", startupHandler)
http.HandleFunc("/healthz", healthzHandler)
http.HandleFunc("/ready", readyHandler)
http.ListenAndServe(":8080", nil)
}
func startupHandler(w http.ResponseWriter, r *http.Request) {
if time.Since(startTime) < 30*time.Second {
w.WriteHeader(503)
return
}
w.WriteHeader(200)
}
func healthzHandler(w http.ResponseWriter, r *http.Request) {
// 检查关键