调度
在k8s
中我们很少直接创建pod
,大多数情况会通过RC
、Deployment
等控制器完成Pod
副本的创建、调度及生命周期的自动控制任务。
全自动调度
Deployment
主要功能之一是自动部署一个容器应用的多个副本,以及持续监控副本的数量,在集群内始终维持用户设定的副本数量。
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 3
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.7.9
ports:
- containerPort: 80
- 查看
Deployment
bash
[root@master1 pod]# kubectl get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
nginx-deployment 3/3 3 3 4m32s
该状态说明Deployment
已经创建3个副本,并且所有副本都是可用的。
- 查看
Rs
和Pod
bash
[root@master1 pod]# kubectl get rs
NAME DESIRED CURRENT READY AGE
nginx-deployment-5d59d67564 3 3 3 6m35s
[root@master1 pod]# kubectl get pod
NAME READY STATUS RESTARTS AGE
nginx-deployment-5d59d67564-5thcv 1/1 Running 0 6m47s
nginx-deployment-5d59d67564-9pgcf 1/1 Running 0 6m47s
nginx-deployment-5d59d67564-dvfp5 1/1 Running 0 6m47s
3个Pod
由系统自动完成调度。完全由Scheduler
经过算法计算出来的。
NodeSelector
在一些情况下,可能需要将Pod
调度到指定的Node
上,可以通过Node
的标签来实现。
bash
# 给node2打标签
[root@master1 pod]# kubectl label nodes node2 zone=north
node/node2 labeled
yaml
apiVersion: v1
kind: ReplicationController
metadata:
name: redis-master
labels:
name: redis-master
spec:
replicas: 1
selector:
name: redis-master
template:
metadata:
labels:
name: redis-master
spec:
containers:
- name: master
image: kubeguide/redis-master
ports:
- containerPort: 6379
nodeSelector:
zone: north
在Pod
中增加nodeSelector
设置 生成并查看结果
bash
[root@master1 pod]# kubectl apply -f 14.yaml
replicationcontroller/redis-master created
[root@master1 pod]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
redis-master-qnbkm 1/1 Running 0 35s 10.244.104.3 node2 <none> <none>
注意,如果我们指定了Pod
的nodeSelector
条件,且在集群中不存在包含的相应标签,则这个Pod
也不能创建成功。
Node
亲和性
两种节点亲和性表达
- RequiredDuringSchedulingIgnoredDuringExecution:必须满足指定规则才可以调度
Pod
到Node
上,相当于硬限制。 - PerferredDuringSchedulingIgnoredDuringExecution:强调优先满足指定规则,调度器会尝试调度
Pod
到Node
上,但不强求,相当于软限制。
IgnoredDuringExecution:如果在运行期间标签发生了变化,不再符合该Pod
的节点亲和性要求,系统将忽略Node
上的标签变化,该Pod
可以继续在该节点上运行。
yaml
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: bete.kubernets.io/arch
operator: In
values:
- amd64
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: disk-type
operator: In
values:
- ssd
containers:
- name: with-node-affinity
image: registry.aliyuncs.com/google_containers/pause:3.1
从上面的配置中可 以看到 In
操作符, NodeAffrnity
语法支持的操作符包括 In
、 Notln
、 Exists
、DoesNotExist
、 Gt
、 Lt
。虽然没有节点排斥 的功能, 但是用 Notln
和 DoesNotExist
就可以实现排斥 的 功能了
odeAffinity
规则设置的注意事项如下
- 如果同时定义了
nodeSelector
和nodeA伍nity
,那么必须两个条件都得到满足,Pod
才能最终运行在指定的Node
上。 - 如果
nodeAffrnity
指定了多个nodeSelectorTerms
, 那么只需要其中 一个能够匹配成功即可 。 - 如果
nodeSelectorTerms
中有多个matchExpressions
, 则 一 个节点必须满足所有matchExpressions
才能运行该Pod
。
Pod
亲和与互斥调度策略
这一功能让用户从另一个角度来限制Pod
所能运行的节点:根据在节点上正在运行的Pod
的标签而不是节点的标签进行判断和调度,要求对节点和Pod
两个条件进行匹配。
-
参照
Pod
注意:带有标签 security=S1和image=nginxyamlapiVersion: v1 kind: Pod metadata: name: pod-flag labels: security: "S1" app: "nginx" spec: containers: - name: nginx image: nginx
运行
bash[root@master1 pod]# kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-flag 1/1 Running 0 8m57s 10.244.166.164 node1 <none> <none>
-
亲和性调度
yamlapiVersion: v1 kind: Pod metadata: name: pod-affinity spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: security operator: In values: - S1 topologyKey: kubernetes.io/hostname containers: - name: with-pod-affinity image: nginx
查看结果
bash
[root@master1 pod]# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-affinity 1/1 Running 0 4s 10.244.166.166 node1 <none> <none>
pod-flag 1/1 Running 0 45m 10.244.166.164 node1 <none> <none>
可以看到两个Pod
在同一个node1
上。
互斥性
创建Pod
,我们希望该Pod
不与目标Pod
运行在同一个Node
上
yaml
apiVersion: v1
kind: Pod
metadata:
name: anti-affinity
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: kubernetes.io/hostname
containers:
- name: anti-affinity
image: nginx
运行并查看结果
bash
[root@master1 pod]# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
anti-affinity 1/1 Running 0 20s 10.244.104.6 node2 <none> <none>
pod-flag 1/1 Running 1 23h 10.244.166.167 node1 <none> <none>
与 pod-flag
pod
分配到不同的node
上
污点/容忍度
让Node
拒绝Pod
的运行,需要和Toleration
配合使用,让Pod
避开那些不合适的Node
。
可以使用kubectl taint
命令为Node设
置taint
信息
在Node2
上设置污点
bash
[root@master1 pod]# kubectl taint nodes node2 key=value:NoSchedule
node/node2 tainted
# 查看node2上是否有污点
[root@master1 pod]# kubectl describe node node2
Name: node2
Roles: worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=node2
kubernetes.io/os=linux
node-role.kubernetes.io/worker=worker
zone=north
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 192.168.40.183/24
projectcalico.org/IPv4IPIPTunnelAddr: 10.244.104.0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 07 Mar 2024 08:46:02 -0500
Taints: key=value:NoSchedule
将Pod
调用到node2
节点上
yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-taints
spec:
nodeSelector:
kubernetes.io/hostname: node2
containers:
- name: nginx
image: nginx
运行后
bash
[root@master1 pod]# kubectl describe pod pod-taints
Name: pod-taints
Namespace: pod-ns
Priority: 0
Node: <none>
Labels: <none>
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Containers:
nginx:
Image: nginx
Port: <none>
Host Port: <none>
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-9p2fw (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-9p2fw:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-9p2fw
Optional: false
QoS Class: BestEffort
Node-Selectors: kubernetes.io/hostname=node2
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 32s default-scheduler 0/4 nodes are available: 1 node(s) didn't match Pod's node affinity, 1 node(s) had taint {key: value}, that the pod didn't tolerate, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
Warning FailedScheduling 32s default-scheduler 0/4 nodes are available: 1 node(s) didn't match Pod's node affinity, 1 node(s) had taint {key: value}, that the pod didn't tolerate, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
运行失败,因为pod
没有容忍该污点 key=value:NoSchedule
更改yaml
文件,新增容忍
yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-taints
spec:
nodeSelector:
kubernetes.io/hostname: node2
tolerations:
- key: "key"
operator: "Equal"
value: "value"
effect: "NoSchedule"
containers:
- name: nginx
image: nginx
运行并查看结果
bash
[root@master1 pod]# kubectl get pod
NAME READY STATUS RESTARTS AGE
pod-taints 1/1 Running 0 20s
- operator 的值是
Exists
(无需指定value) - operator 的值是
Equal
并且value
相等
如果不指定Operator
,默认值为Equal
- effect=NoSchedule: 调度器不会将
Pod
调度到这个节点 - effect=PreNoSchedul:尽量不调度到这个节点
- effect=NoExecute:已经调度到该节点的
Pod
会被驱逐,后续不再会有Pod
调度到该节点(可以增加tolerationSeconds
)
优先级调度
有些负载比较重要,在集群资源不足的情况下,要保证重要的资源运行正常。
如何声明一个负载相对于其他负载"更重要",我们可以通过一下维度来定义:
- Proority,优先级
- 驱逐:综合考虑
Pod
的优先级,优先级一样的,资源使用量是资源申请量的倍数越大越先驱逐 - 抢占:
Scheduler
有权驱逐部分优先级低的Pod
来满足调度目标
QoS
,服务质量等级- 系统定义的其他度量指标
创建优先级Yaml
,优先级不属于任何命名空间
yaml
apiVersion: scheduling.k8s.io/v1batal
kind: PriorityClass
metadata:
name: hight-priority
value: 1000000
globalDefault: false
description: "这个是优先级文件,应该被用在XYZ service pods.only"
运行结果
yaml
[root@master1 pod]# kubectl apply -f 20.yaml
priorityclass.scheduling.k8s.io/hight-priority created
# 查看
[root@master1 pod]# kubectl get priorityClass
NAME VALUE GLOBAL-DEFAULT AGE
hight-priority 1000000 false 7s
优先级别100000,数字越大,优先级越高
使用优先级
yaml
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx-priority
image: nginx
priorityClassName: hight-priority
运行并查看结果
bash
[root@master1 pod]# kubectl apply -f 21.yaml
pod/nginx created
# 查看结果
[root@master1 pod]# kubectl describe pod nginx
Name: nginx
Namespace: pod-ns
Priority: 1000000
Priority Class Name: hight-priority
Node: node1/192.168.40.182
Start Time: Thu, 14 Mar 2024 09:39:25 -0400
Labels: env=test
Annotations: cni.projectcalico.org/podIP: 10.244.166.176/32
cni.projectcalico.org/podIPs: 10.244.166.176/32
Status: Running
IP: 10.244.166.176
IPs:
IP: 10.244.166.176
Containers:
nginx-priority:
Container ID: docker://fc7465fd346a5f4b50958b152f48b0c368dc8608967db92b22fa1a1c6f31c320
Image: nginx
Image ID: docker-pullable://nginx@sha256:0d17b565c37bcbd895e9d92315a05c1c3c9a29f762b011a10c54a66cd53c9b31
Port: <none>
Host Port: <none>
State: Running
Started: Thu, 14 Mar 2024 09:39:27 -0400
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-9p2fw (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-9p2fw:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-9p2fw
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m15s default-scheduler Successfully assigned pod-ns/nginx to node1
Normal Pulling 2m14s kubelet Pulling image "nginx"
Normal Pulled 2m13s kubelet Successfully pulled image "nginx" in 752.154707ms
Normal Created 2m13s kubelet Created container nginx-priority
Normal Started 2m13s kubelet Started container nginx-priority
如果发生抢占调度,高优先级Pod
就可以抢占节点N,并将低优先级Pod
驱逐出节点N。高优先级的Pod
信息中nominatedNodeName
字段会记录目标节点N的名称。
如果资源不足,首先要考虑扩容,然后在考虑优先级。
DaemonSet
管理集群中每个Node
上仅仅运行一个Pod
的副本。
需求:
- 日志采集
- 性能监控
yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nginx-daemonset
spec:
selector:
matchLabels:
app: nginx-daemon
template:
metadata:
labels:
app: nginx-daemon
spec:
containers:
- name: nginx
image: nginx:1.21.6
ports:
- containerPort: 80
hostPort: 80
volumeMounts:
- mountPath: /var/log/nginx
name: nginx-logs
volumes:
- name: nginx-logs
emptyDir: {}
运行并查看结果
bash
[root@master1 pod]# kubectl get daemonset
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
nginx-daemonset 2 2 0 2 0 <none> 14s
# 查看pod
[root@master1 pod]# kubectl get pod
NAME READY STATUS RESTARTS AGE
nginx-daemonset-47gkk 1/1 Running 0 3m28s
nginx-daemonset-4bqcs 1/1 Running 0 3m28s
在 Kubernetes
中,DaemonSet
的更新策略用于控制 DaemonSet
中 Pod
的更新行为。DaemonSet
确保集群中的每个节点上运行一个指定的 Pod
副本。当您修改 DaemonSet
的配置(如容器镜像或 Pod
参数)时,更新策略决定了旧 Pods
如何被新版本的 Pods
替换。
DaemonSet
支持以下两种更新策略:
- OnDelete
当设置为 OnDelete
时,Kubernetes
不会自动更新 Pod
直至用户手动删除旧的 DaemonSet Pod
。这意味着只要旧版 Pod
存在并且没有被删除,就不会启动新版 Pod
。
- RollingUpdate (默认策略)
这是 DaemonSet
更新的默认策略。使用 RollingUpdate
时,一旦更新了 DaemonSet
的配置模板,系统将按受控的方式终止旧的 DaemonSet Pods
,并自动创建新的 Pods
。
yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: my-daemonset
spec:
updateStrategy:
type: RollingUpdate
批处理调度
批处理任务通常并行(或者串行)启动多个计算进程去处理一批工作项,处理完成后,整个批处理任务结束。
批处理几种模式
- Job Template Expansion模式:一个
Job
对应一个待处理的Work item
。通常适合Work item
数量少、每个Work item
要处理数据量大的场景。
模版
yaml
apiVersion: batch/v1
kind: Job
metadata:
name: process-item-$ITEM
labels:
jobgroup: jobexample
spec:
template:
metadata:
name: jobexample
labels:
jobgroup: jobexample
spec:
containers:
- name: c
image: busybox
command: ["sh","-c","echo Processing item $ITEM && sleep 5"]
restartPolicy: Never
创建Job
文件
bash
# 创建 jobs 文件夹
mkdir jobs
# 根据模版创建文件
for i in apple banana cherry; do cat 3.9_6.yaml | sed "s/\$ITEM/$i/" > ./jobs/job-$i.yaml; done
# 查看文件
[root@master1 pod]# ls jobs/
job-apple.yaml job-banana.yaml job-cherry.yaml
# 创建 Job
[root@master1 pod]# kubectl apply -f jobs/
job.batch/process-item-apple created
job.batch/process-item-banana created
job.batch/process-item-cherry created
# 查看
[root@master1 pod]# kubectl get jobs -l jobgroup=jobexample
NAME COMPLETIONS DURATION AGE
process-item-apple 1/1 23s 5m32s
process-item-banana 1/1 38s 5m32s
process-item-cherry 1/1 55s 5m32s
- Queue with Pod Per Work Item:采用一个队列存放
work item
。对象作为消费者去完成这些Work Item
。一个Pod
对应一个Work Item
。 - Queue with Variable Pod Count模式:
Worker
程序需要指定队列中是否还有等待处理的Work item
,如果有就取出来处理,否则就认为所有工作完成并结束进程。
CronJob
类似Linux Cron
的定时任务Cron Job
。
yaml
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: hello
spec:
schedule: "*/1 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox
args:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster
restartPolicy: OnFailure
运行并查看结果
bash
# 查看结果
[root@master1 pod]# kubectl get cronjob hello -w
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
hello */1 * * * * False 0 46s 95s
hello */1 * * * * False 1 3s 112s
hello */1 * * * * False 0 13s 2m2s
hello */1 * * * * False 1 3s 2m52s
hello */1 * * * * False 0 13s 3m2s
hello */1 * * * * False 1 3s 3m52s