我最近几天遇到了一个k8s调度器对PV的节点亲和性检查会失效的问题,自己找不到原因,只有上网来向大家请教了
问题背景和现象
我在实验室离线LAN中,基于arm64+国产linux的服务器,使用kubeadm自建了一个 1 maser + 3 worker 的k8s集群,使用的是标准的 v1.29.10 版本,在集群中安装了 rancher.ioo/local-path 这个本地存储StorageClass,如下所示
bash
[root@server01 ~]# k get nodes
NAME STATUS ROLES AGE VERSION
master01 Ready control-plane 167d v1.29.10
node01 Ready <none> 167d v1.29.10
node02 Ready <none> 167d v1.29.10
node03 Ready <none> 167d v1.29.10
[root@server01 ~]# k get storageclass
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
local-path rancher.io/local-path Delete WaitForFirstConsumer false 165d
我在集群中部署了一个我们购买的商用数据库应用,其中有一个StatefulSet负载,它的Pod申请了 local-path 类型的 PVC。这个负载第一次在集群上运行起来之后,为Pod生成并绑定的PV是位于node02节点上的,对应节点2上的本地目录 /opt/local-path-provisioner/
下的一个子目录,这个PV的节点亲和性(nodeAffinity)也很清晰的,是强制适配节点名为 node02
的节点,这个Pod开始也是被调度到node02上运行的。
但是后面不知道什么开始,这个负载的pod因故终止后被重新调度拉起的时候,我却很惊奇地发现这个Pod被调度到了 node03 节点上,同样在node03节点上的 /opt/local-path-provisioner/
目录下也出现了一个完全同名的子目录,结果就是存储在node02 的PV中的持久化数据在新的Pod中就没有了。
而且更惊奇地是,当我调整 kube-scheduler 的日志输出级别,重新删除这个pod触发重新调度的时候,然后执行 kubectl -n kube-system logs
查看 kube-scheduler 的日志时,发现 kube-scheduler 检查这个PV 的 nodeAffinity 和三个节点是否匹配时,居然认定这个强制亲和node02的 PV 和 node01、node02 和 node03 三个节点都匹配!!
我根据kube-scheduler的日志,去看了 kubernetes v1.29.10 的 pkg\scheduler\framework\plugins\volumebinding\binder.go 源代码,也确实看不出来问题出在哪里。所以只有来请教一下各位了,谢谢
操作环境、软件版本等信息
- k8s集群基本信息
- 物理节点:ARM64处理器 + 国产Linux操作系统
- 集群共4个节点,1 master + 3 worker
- Kubernetes版本:标准 v1.29.10
bash
[root@server01 ~]# k get nodes
NAME STATUS ROLES AGE VERSION
master01 Ready control-plane 167d v1.29.10
node01 Ready <none> 167d v1.29.10
node02 Ready <none> 167d v1.29.10
node03 Ready <none> 167d v1.29.10
local-path StorageClass信息如下,对应的 image 是 rancher/local-path-provisioner:v0.0.29-arm64
。其配置的物理节点持久化目录为 /opt/local-path-provisioner/
,volume绑定模式为 WaitForFirstConsumer
bash
[root@server01 ~]# k get storageclass
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
local-path rancher.io/local-path Delete WaitForFirstConsumer false 165d
[root@server01 ~]# kubectl get sc local-path -o yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{},"name":"local-path"},"provisioner":"rancher.io/local-path","reclaimPolicy":"Delete","volumeBindingMode":"WaitForFirstConsumer"}
creationTimestamp: "2024-11-12T10:45:21Z"
name: local-path
resourceVersion: "113298"
uid: 6bdd36c8-3526-4a03-b54d-cc6e311eaee5
provisioner: rancher.io/local-path
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
出现问题的 pod 名字叫 ddb-napp-k8s-ha-cluster-dn-0-0 ,对应的 pvc 名字是 data-ddb-napp-k8s-ha-cluster-dn-0-0 ,如下所示(省略了一些无关输出信息,用......占位,保留的主要是 volumes 信息)------------ 需要注意,这里的 Pod 本身的节点亲和性是非强制的(preferredDuringSchedulingIgnoredDuringExecution),判断的依据是节点自身的标签中是否有 mydb.io/pod: mydb,而事实上这三个 node 都没有这个标签------所以Pod自身的节点亲和性对这个Pod被调度到哪个节点应该没有影响
bash
# pod信息
[root@server01 common]# k -n mydb get pod ddb-napp-k8s-ha-cluster-dn-0-0 -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
......
creationTimestamp: "2025-04-27T03:12:13Z"
generateName: ddb-napp-k8s-ha-cluster-dn-0-
labels:
apps.kubernetes.io/pod-index: "0"
controller-revision-hash: ddb-napp-k8s-ha-cluster-dn-0-f89d59448
......
statefulset.kubernetes.io/pod-name: ddb-napp-k8s-ha-cluster-dn-0-0
name: ddb-napp-k8s-ha-cluster-dn-0-0
namespace: mydb
......
spec:
# Pod本身的节点亲和性判断是一个软亲和性,判断的依据是节点自身的标签中是否有 `mydb.io/pod: mydb`,但事实上三个节点都不满足这个条件
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: mydb.io/pod
operator: In
values:
- mydb
weight: 1
containers:
- args:
......
hostname: ddb-napp-k8s-ha-cluster-dn-0-0
nodeName: node03
......
volumes:
- name: data
persistentVolumeClaim:
claimName: data-ddb-napp-k8s-ha-cluster-dn-0-0
......
status:
......
# pvc信息
[root@server01 common]# k -n mydb get pvc data-ddb-napp-k8s-ha-cluster-dn-0-0 -o wide
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE VOLUMEMODE
data-ddb-napp-k8s-ha-cluster-dn-0-0 Bound pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b 100Gi RWO local-path <unset> 55d Filesystem
绑定的PV名字叫 pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b ,其详细信息如下,其中明确了该PV的节点亲和性(nodeAffinity)是强制要求(required) 节点名字是 node02
bash
[root@server01 common]# k get pv pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
local.path.provisioner/selected-node: node02
pv.kubernetes.io/provisioned-by: rancher.io/local-path
creationTimestamp: "2025-03-02T11:28:37Z"
finalizers:
- kubernetes.io/pv-protection
name: pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b
resourceVersion: "18745336"
uid: 2cac4c19-fc76-49f3-83b0-b6aaef9f4d16
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 100Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: data-ddb-napp-k8s-ha-cluster-dn-0-0
namespace: mydb
resourceVersion: "18745071"
uid: ae8db06e-379c-4911-a9f0-b8962c596b5b
hostPath:
path: /opt/local-path-provisioner/pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b_mydb_data-ddb-napp-k8s-ha-cluster-dn-0-0
type: DirectoryOrCreate
# PV的节点亲和性是强制要求节点自身的 metadata.name 必须是 node02
nodeAffinity:
required:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- node02
persistentVolumeReclaimPolicy: Delete
storageClassName: local-path
volumeMode: Filesystem
status:
lastPhaseTransitionTime: "2025-03-02T11:28:37Z"
phase: Bound
我自己借助网络和DeepSeek一起进行的分析和尝试
- 我通过调整 kube-system 命名空间中 kube-scheduler 的日志参数级别,让 kube-scheduler 的日志输出更详细的pod调度选择信息
bash
# 这是 kube-scheduler 对应的 pod
[root@server01 common]# k -n kube-system get pods | grep scheduler
kube-scheduler-master01 1/1 Running 0 3h3m
然后我将原来的那个 Pod(data-ddb-napp-k8s-ha-cluster-dn-0-0) 删除掉,触发重新调度,结果查看到的 kube-schduler 的调度信息中居然判定 node01、node02 和 node03 都通过了PV亲和性检查,相当于就跳过了PV节点亲和性的filter操作,把三个节点都当做备选几点进入下一阶段的资源调度打分环节,最后选中的是node03,而不是预期的node02.
关键的日志输出信息如下所示
bash
# 在 binder.go 中居然三个节点都通过了对目标PV和节点亲和性的检查,三个节点都成为了
I0427 03:12:13.488773 1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b" node="node03" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.488773 1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b" node="node01" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.489225 1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b" node="node02"
# 然后进入了Pod资源申请的打分阶段,node03节点得分最高
I0427 03:12:13.490916 1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03" resourceAllocationScorer="LeastAllocated" allocatableResource=[48000,49900879872] requestedResource=[19210,32480690176] resourceScore=46
I0427 03:12:13.491043 1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03" resourceAllocationScorer="NodeResourcesBalancedAllocation" allocatableResource=[48000,49900879872] requestedResource=[16510,26608664576] resourceScore=90
I0427 03:12:13.491213 1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node01" resourceAllocationScorer="LeastAllocated" allocatableResource=[48000,56343199744] requestedResource=[33500,58493763584] resourceScore=15
I0427 03:12:13.491291 1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node02" resourceAllocationScorer="LeastAllocated" allocatableResource=[48000,56344510464] requestedResource=[23400,41028681728] resourceScore=39
I0427 03:12:13.491360 1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node01" resourceAllocationScorer="NodeResourcesBalancedAllocation" allocatableResource=[48000,56343199744] requestedResource=[25800,42555408384] resourceScore=89
I0427 03:12:13.491437 1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node02" resourceAllocationScorer="NodeResourcesBalancedAllocation" allocatableResource=[48000,56344510464] requestedResource=[16800,26977763328] resourceScore=93
# 最后选择了在node03上调度新拉起目标pod ddb-napp-k8s-ha-cluster-dn-0-0,并且打印信息中也再次明确有4个节点,其中三个节点都是适合的(feasibleNodes=3)
I0427 03:12:13.503889 1 schedule_one.go:302] "Successfully bound pod to node" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03" evaluatedNodes=4 feasibleNodes=3
并且该Pod涉及的所有的3个PVC都是如此,PV节点亲和性检查都形同虚设,三个节点都能通过检查,打印出 PersistentVolume and node matches for pod 信息,如下所示的日志信息
bash
[root@server01 common]# k -n kube-system logs -f kube-scheduler-master01
......
I0427 03:12:13.486272 1 eventhandlers.go:126] "Add event for unscheduled pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.486424 1 scheduling_queue.go:576] "Pod moved to an internal scheduling queue" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" event="PodAdd" queue="Active"
I0427 03:12:13.486671 1 schedule_one.go:85] "About to try and schedule pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.486775 1 schedule_one.go:98] "Attempting to schedule pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.487325 1 binder.go:794] "PVC is fully bound to PV" PVC="mydb/data-ddb-napp-k8s-ha-cluster-dn-0-0" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b"
I0427 03:12:13.487399 1 binder.go:794] "PVC is fully bound to PV" PVC="mydb/log-ddb-napp-k8s-ha-cluster-dn-0-0" PV="pvc-bf6eb403-ee16-4313-8011-d0c4fb97d95f"
I0427 03:12:13.487441 1 binder.go:794] "PVC is fully bound to PV" PVC="mydb/core-ddb-napp-k8s-ha-cluster-dn-0-0" PV="pvc-3d8142e4-177e-4b12-820f-30008dc11b0e"
I0427 03:12:13.488274 1 binder.go:282] "FindPodVolumes" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node01"
I0427 03:12:13.488705 1 binder.go:282] "FindPodVolumes" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03"
I0427 03:12:13.488773 1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b" node="node03" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.488876 1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-bf6eb403-ee16-4313-8011-d0c4fb97d95f" node="node03" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.488944 1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-3d8142e4-177e-4b12-820f-30008dc11b0e" node="node03" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.488773 1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b" node="node01" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.488993 1 binder.go:895] "All bound volumes for pod match with node" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03"
I0427 03:12:13.489083 1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-bf6eb403-ee16-4313-8011-d0c4fb97d95f" node="node01" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.488490 1 binder.go:282] "FindPodVolumes" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node02"
I0427 03:12:13.489212 1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-3d8142e4-177e-4b12-820f-30008dc11b0e" node="node01" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.489342 1 binder.go:895] "All bound volumes for pod match with node" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node01"
I0427 03:12:13.489225 1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b" node="node02" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.489494 1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-bf6eb403-ee16-4313-8011-d0c4fb97d95f" node="node02" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.489567 1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-3d8142e4-177e-4b12-820f-30008dc11b0e" node="node02" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.489623 1 binder.go:895] "All bound volumes for pod match with node" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node02"
I0427 03:12:13.490916 1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03" resourceAllocationScorer="LeastAllocated" allocatableResource=[48000,49900879872] requestedResource=[19210,32480690176] resourceScore=46
I0427 03:12:13.491043 1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03" resourceAllocationScorer="NodeResourcesBalancedAllocation" allocatableResource=[48000,49900879872] requestedResource=[16510,26608664576] resourceScore=90
I0427 03:12:13.491213 1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node01" resourceAllocationScorer="LeastAllocated" allocatableResource=[48000,56343199744] requestedResource=[33500,58493763584] resourceScore=15
I0427 03:12:13.491291 1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node02" resourceAllocationScorer="LeastAllocated" allocatableResource=[48000,56344510464] requestedResource=[23400,41028681728] resourceScore=39
I0427 03:12:13.491360 1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node01" resourceAllocationScorer="NodeResourcesBalancedAllocation" allocatableResource=[48000,56343199744] requestedResource=[25800,42555408384] resourceScore=89
I0427 03:12:13.491437 1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node02" resourceAllocationScorer="NodeResourcesBalancedAllocation" allocatableResource=[48000,56344510464] requestedResource=[16800,26977763328] resourceScore=93
I0427 03:12:13.491885 1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="TaintToleration" node="node03" score=300
I0427 03:12:13.491950 1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeAffinity" node="node03" score=0
I0427 03:12:13.491988 1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeResourcesFit" node="node03" score=46
I0427 03:12:13.492026 1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="VolumeBinding" node="node03" score=0
I0427 03:12:13.492061 1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="PodTopologySpread" node="node03" score=200
I0427 03:12:13.492111 1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeResourcesBalancedAllocation" node="node03" score=90
I0427 03:12:13.492162 1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="ImageLocality" node="node03" score=0
I0427 03:12:13.492208 1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="TaintToleration" node="node01" score=300
I0427 03:12:13.492241 1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeAffinity" node="node01" score=0
I0427 03:12:13.492283 1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeResourcesFit" node="node01" score=15
I0427 03:12:13.492311 1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="VolumeBinding" node="node01" score=0
I0427 03:12:13.492362 1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="PodTopologySpread" node="node01" score=200
I0427 03:12:13.492415 1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeResourcesBalancedAllocation" node="node01" score=89
I0427 03:12:13.492459 1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="ImageLocality" node="node01" score=0
I0427 03:12:13.492509 1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="TaintToleration" node="node02" score=300
I0427 03:12:13.492551 1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeAffinity" node="node02" score=0
I0427 03:12:13.492604 1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeResourcesFit" node="node02" score=39
I0427 03:12:13.492646 1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="VolumeBinding" node="node02" score=0
I0427 03:12:13.492688 1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="PodTopologySpread" node="node02" score=200
I0427 03:12:13.492760 1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeResourcesBalancedAllocation" node="node02" score=93
I0427 03:12:13.492804 1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="ImageLocality" node="node02" score=0
I0427 03:12:13.492855 1 schedule_one.go:812] "Calculated node's final score for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03" score=636
I0427 03:12:13.492902 1 schedule_one.go:812] "Calculated node's final score for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node01" score=604
I0427 03:12:13.492954 1 schedule_one.go:812] "Calculated node's final score for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node02" score=632
I0427 03:12:13.493205 1 binder.go:439] "AssumePodVolumes" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03"
I0427 03:12:13.493276 1 binder.go:794] "PVC is fully bound to PV" PVC="mydb/data-ddb-napp-k8s-ha-cluster-dn-0-0" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b"
I0427 03:12:13.493320 1 binder.go:794] "PVC is fully bound to PV" PVC="mydb/log-ddb-napp-k8s-ha-cluster-dn-0-0" PV="pvc-bf6eb403-ee16-4313-8011-d0c4fb97d95f"
I0427 03:12:13.493365 1 binder.go:794] "PVC is fully bound to PV" PVC="mydb/core-ddb-napp-k8s-ha-cluster-dn-0-0" PV="pvc-3d8142e4-177e-4b12-820f-30008dc11b0e"
I0427 03:12:13.493427 1 binder.go:447] "AssumePodVolumes: all PVCs bound and nothing to do" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03"
I0427 03:12:13.493713 1 default_binder.go:53] "Attempting to bind pod to node" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03"
- 我专门去查看了 kubernetes v1.29.10 版本的源代码,其中
pkg\scheduler\framework\plugins\volumebinding\binder.go
中的FindPodVolumes()
函数中,专门有这一段检查已绑定PV和目标节点的 node affinity 检查
go
// Check PV node affinity on bound volumes
if len(podVolumeClaims.boundClaims) > 0 {
boundVolumesSatisfied, boundPVsFound, err = b.checkBoundClaims(logger, podVolumeClaims.boundClaims, node, pod)
if err != nil {
return
}
}
其调用的 checkBoundClaims()
函数定义如下
go
func (b *volumeBinder) checkBoundClaims(logger klog.Logger, claims []*v1.PersistentVolumeClaim, node *v1.Node, pod *v1.Pod) (bool, bool, error) {
csiNode, err := b.csiNodeLister.Get(node.Name)
if err != nil {
// TODO: return the error once CSINode is created by default
logger.V(4).Info("Could not get a CSINode object for the node", "node", klog.KObj(node), "err", err)
}
for _, pvc := range claims {
pvName := pvc.Spec.VolumeName
pv, err := b.pvCache.GetPV(pvName)
if err != nil {
if _, ok := err.(*errNotFound); ok {
err = nil
}
return true, false, err
}
pv, err = b.tryTranslatePVToCSI(pv, csiNode)
if err != nil {
return false, true, err
}
err = volume.CheckNodeAffinity(pv, node.Labels)
if err != nil {
logger.V(4).Info("PersistentVolume and node mismatch for pod", "PV", klog.KRef("", pvName), "node", klog.KObj(node), "pod", klog.KObj(pod), "err", err)
return false, true, nil
}
logger.V(5).Info("PersistentVolume and node matches for pod", "PV", klog.KRef("", pvName), "node", klog.KObj(node), "pod", klog.KObj(pod))
}
logger.V(4).Info("All bound volumes for pod match with node", "pod", klog.KObj(pod), "node", klog.KObj(node))
return true, true, nil
}
在我的实际例子中,PV的节点亲和性明明是节点的metadata.name是 node02,那么对于 node01 和 node03,就应该打印出这段代码中的 "PersistentVolume and node mismatch for pod" 才是预期的结果啊,但事实上对于三个节点,这里打印出来的都是 "PersistentVolume and node matches for pod"
-
我也去查过 kubernetes 官方对于 v1.29.10版本的发布信息,没有找到任何关于这样PV节点亲和性的BUG报告
-
我自己也写过一个最简单的 pod yaml,也是声明一个 local-path 的 PVC,来重复刚才的全过程,虽然这次这个 pod 被调度到了 node2 节点上,但是从 kube-scheduler的日志来看,PV节点亲和性检查那里依然是三个节点都是match的,只是因为资源申请和匹配打分 node02 得分高才被选中的
个人期待的帮助
我预期的当然是通过PVC绑定目标PV的Pod,能够按照预期在PV强制亲和的节点node02上被重新拉起,即使node02确实异常了(例如下电了),这个Pod也应该是 Pending 状态而不是在 node03 上被拉起。
说实话,我不相信kubernetes这样的工业级产品代码,最核心的 kube-scheduler 代码在 1.29.10 这个不算很老的版本中还有这样的BUG,但我确实也无法来单步调试验证 kubernetes 的源代码。
另外,我已经在全过程诊断调试过程中和DeepSeek R1大模型进行过详细的交互和诊断了,包括我采用的很多调试方法也都是大模型教的,所以希望真正的k8s专家来人工指导一下,谢谢🤝