求助:为什么k8s调度器对PV的节点亲和性检查会失效呢?

我最近几天遇到了一个k8s调度器对PV的节点亲和性检查会失效的问题,自己找不到原因,只有上网来向大家请教了

问题背景和现象

我在实验室离线LAN中,基于arm64+国产linux的服务器,使用kubeadm自建了一个 1 maser + 3 worker 的k8s集群,使用的是标准的 v1.29.10 版本,在集群中安装了 rancher.ioo/local-path 这个本地存储StorageClass,如下所示

bash 复制代码
[root@server01 ~]# k get nodes
NAME       STATUS   ROLES           AGE    VERSION
master01   Ready    control-plane   167d   v1.29.10
node01     Ready    <none>          167d   v1.29.10
node02     Ready    <none>          167d   v1.29.10
node03     Ready    <none>          167d   v1.29.10

[root@server01 ~]# k get storageclass
NAME         PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-path   rancher.io/local-path             Delete                       WaitForFirstConsumer                   false                                   165d

我在集群中部署了一个我们购买的商用数据库应用,其中有一个StatefulSet负载,它的Pod申请了 local-path 类型的 PVC。这个负载第一次在集群上运行起来之后,为Pod生成并绑定的PV是位于node02节点上的,对应节点2上的本地目录 /opt/local-path-provisioner/ 下的一个子目录,这个PV的节点亲和性(nodeAffinity)也很清晰的,是强制适配节点名为 node02 的节点,这个Pod开始也是被调度到node02上运行的。

但是后面不知道什么开始,这个负载的pod因故终止后被重新调度拉起的时候,我却很惊奇地发现这个Pod被调度到了 node03 节点上,同样在node03节点上的 /opt/local-path-provisioner/ 目录下也出现了一个完全同名的子目录,结果就是存储在node02 的PV中的持久化数据在新的Pod中就没有了。

而且更惊奇地是,当我调整 kube-scheduler 的日志输出级别,重新删除这个pod触发重新调度的时候,然后执行 kubectl -n kube-system logs 查看 kube-scheduler 的日志时,发现 kube-scheduler 检查这个PV 的 nodeAffinity 和三个节点是否匹配时,居然认定这个强制亲和node02的 PV 和 node01、node02 和 node03 三个节点都匹配!!

我根据kube-scheduler的日志,去看了 kubernetes v1.29.10 的 pkg\scheduler\framework\plugins\volumebinding\binder.go 源代码,也确实看不出来问题出在哪里。所以只有来请教一下各位了,谢谢

操作环境、软件版本等信息

  • k8s集群基本信息
    • 物理节点:ARM64处理器 + 国产Linux操作系统
    • 集群共4个节点,1 master + 3 worker
    • Kubernetes版本:标准 v1.29.10
bash 复制代码
[root@server01 ~]# k get nodes
NAME       STATUS   ROLES           AGE    VERSION
master01   Ready    control-plane   167d   v1.29.10
node01     Ready    <none>          167d   v1.29.10
node02     Ready    <none>          167d   v1.29.10
node03     Ready    <none>          167d   v1.29.10

local-path StorageClass信息如下,对应的 image 是 rancher/local-path-provisioner:v0.0.29-arm64 。其配置的物理节点持久化目录为 /opt/local-path-provisioner/,volume绑定模式为 WaitForFirstConsumer

bash 复制代码
[root@server01 ~]# k get storageclass
NAME         PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-path   rancher.io/local-path             Delete                       WaitForFirstConsumer                   false                                   165d

[root@server01 ~]# kubectl get sc local-path -o yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{},"name":"local-path"},"provisioner":"rancher.io/local-path","reclaimPolicy":"Delete","volumeBindingMode":"WaitForFirstConsumer"}
  creationTimestamp: "2024-11-12T10:45:21Z"
  name: local-path
  resourceVersion: "113298"
  uid: 6bdd36c8-3526-4a03-b54d-cc6e311eaee5
provisioner: rancher.io/local-path
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

出现问题的 pod 名字叫 ddb-napp-k8s-ha-cluster-dn-0-0 ,对应的 pvc 名字是 data-ddb-napp-k8s-ha-cluster-dn-0-0 ,如下所示(省略了一些无关输出信息,用......占位,保留的主要是 volumes 信息)------------ 需要注意,这里的 Pod 本身的节点亲和性是非强制的(preferredDuringSchedulingIgnoredDuringExecution),判断的依据是节点自身的标签中是否有 mydb.io/pod: mydb,而事实上这三个 node 都没有这个标签------所以Pod自身的节点亲和性对这个Pod被调度到哪个节点应该没有影响

bash 复制代码
# pod信息
[root@server01 common]# k -n mydb get pod ddb-napp-k8s-ha-cluster-dn-0-0 -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    ......
  creationTimestamp: "2025-04-27T03:12:13Z"
  generateName: ddb-napp-k8s-ha-cluster-dn-0-
  labels:
    apps.kubernetes.io/pod-index: "0"
    controller-revision-hash: ddb-napp-k8s-ha-cluster-dn-0-f89d59448
    ......
    statefulset.kubernetes.io/pod-name: ddb-napp-k8s-ha-cluster-dn-0-0
  name: ddb-napp-k8s-ha-cluster-dn-0-0
  namespace: mydb
  ......
spec:
  # Pod本身的节点亲和性判断是一个软亲和性,判断的依据是节点自身的标签中是否有 `mydb.io/pod: mydb`,但事实上三个节点都不满足这个条件
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: mydb.io/pod
            operator: In
            values:
            - mydb
        weight: 1
  containers:
  - args:
    ......
  hostname: ddb-napp-k8s-ha-cluster-dn-0-0
  nodeName: node03
  ......
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: data-ddb-napp-k8s-ha-cluster-dn-0-0
    ......
status:
  ......


# pvc信息
[root@server01 common]# k -n mydb get pvc data-ddb-napp-k8s-ha-cluster-dn-0-0 -o wide
NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE   VOLUMEMODE
data-ddb-napp-k8s-ha-cluster-dn-0-0   Bound    pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b   100Gi      RWO            local-path     <unset>                 55d   Filesystem

绑定的PV名字叫 pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b ,其详细信息如下,其中明确了该PV的节点亲和性(nodeAffinity)是强制要求(required) 节点名字是 node02

bash 复制代码
[root@server01 common]# k get pv pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    local.path.provisioner/selected-node: node02
    pv.kubernetes.io/provisioned-by: rancher.io/local-path
  creationTimestamp: "2025-03-02T11:28:37Z"
  finalizers:
  - kubernetes.io/pv-protection
  name: pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b
  resourceVersion: "18745336"
  uid: 2cac4c19-fc76-49f3-83b0-b6aaef9f4d16
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 100Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: data-ddb-napp-k8s-ha-cluster-dn-0-0
    namespace: mydb
    resourceVersion: "18745071"
    uid: ae8db06e-379c-4911-a9f0-b8962c596b5b
  hostPath:
    path: /opt/local-path-provisioner/pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b_mydb_data-ddb-napp-k8s-ha-cluster-dn-0-0
    type: DirectoryOrCreate
  # PV的节点亲和性是强制要求节点自身的 metadata.name 必须是 node02 
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchFields:
        - key: metadata.name
          operator: In
          values:
          - node02
  persistentVolumeReclaimPolicy: Delete
  storageClassName: local-path
  volumeMode: Filesystem
status:
  lastPhaseTransitionTime: "2025-03-02T11:28:37Z"
  phase: Bound

我自己借助网络和DeepSeek一起进行的分析和尝试

  • 我通过调整 kube-system 命名空间中 kube-scheduler 的日志参数级别,让 kube-scheduler 的日志输出更详细的pod调度选择信息
bash 复制代码
# 这是 kube-scheduler 对应的 pod
[root@server01 common]# k -n kube-system get pods | grep scheduler
kube-scheduler-master01                   1/1     Running   0                 3h3m

然后我将原来的那个 Pod(data-ddb-napp-k8s-ha-cluster-dn-0-0) 删除掉,触发重新调度,结果查看到的 kube-schduler 的调度信息中居然判定 node01、node02 和 node03 都通过了PV亲和性检查,相当于就跳过了PV节点亲和性的filter操作,把三个节点都当做备选几点进入下一阶段的资源调度打分环节,最后选中的是node03,而不是预期的node02.

关键的日志输出信息如下所示

bash 复制代码
# 在 binder.go 中居然三个节点都通过了对目标PV和节点亲和性的检查,三个节点都成为了
I0427 03:12:13.488773       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b" node="node03" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.488773       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b" node="node01" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.489225       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b" node="node02"

# 然后进入了Pod资源申请的打分阶段,node03节点得分最高
I0427 03:12:13.490916       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03" resourceAllocationScorer="LeastAllocated" allocatableResource=[48000,49900879872] requestedResource=[19210,32480690176] resourceScore=46

I0427 03:12:13.491043       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03" resourceAllocationScorer="NodeResourcesBalancedAllocation" allocatableResource=[48000,49900879872] requestedResource=[16510,26608664576] resourceScore=90

I0427 03:12:13.491213       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node01" resourceAllocationScorer="LeastAllocated" allocatableResource=[48000,56343199744] requestedResource=[33500,58493763584] resourceScore=15

I0427 03:12:13.491291       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node02" resourceAllocationScorer="LeastAllocated" allocatableResource=[48000,56344510464] requestedResource=[23400,41028681728] resourceScore=39

I0427 03:12:13.491360       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node01" resourceAllocationScorer="NodeResourcesBalancedAllocation" allocatableResource=[48000,56343199744] requestedResource=[25800,42555408384] resourceScore=89

I0427 03:12:13.491437       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node02" resourceAllocationScorer="NodeResourcesBalancedAllocation" allocatableResource=[48000,56344510464] requestedResource=[16800,26977763328] resourceScore=93

# 最后选择了在node03上调度新拉起目标pod ddb-napp-k8s-ha-cluster-dn-0-0,并且打印信息中也再次明确有4个节点,其中三个节点都是适合的(feasibleNodes=3)
I0427 03:12:13.503889       1 schedule_one.go:302] "Successfully bound pod to node" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03" evaluatedNodes=4 feasibleNodes=3

并且该Pod涉及的所有的3个PVC都是如此,PV节点亲和性检查都形同虚设,三个节点都能通过检查,打印出 PersistentVolume and node matches for pod 信息,如下所示的日志信息

bash 复制代码
[root@server01 common]# k -n kube-system logs -f kube-scheduler-master01
......
I0427 03:12:13.486272       1 eventhandlers.go:126] "Add event for unscheduled pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.486424       1 scheduling_queue.go:576] "Pod moved to an internal scheduling queue" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" event="PodAdd" queue="Active"
I0427 03:12:13.486671       1 schedule_one.go:85] "About to try and schedule pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.486775       1 schedule_one.go:98] "Attempting to schedule pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.487325       1 binder.go:794] "PVC is fully bound to PV" PVC="mydb/data-ddb-napp-k8s-ha-cluster-dn-0-0" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b"
I0427 03:12:13.487399       1 binder.go:794] "PVC is fully bound to PV" PVC="mydb/log-ddb-napp-k8s-ha-cluster-dn-0-0" PV="pvc-bf6eb403-ee16-4313-8011-d0c4fb97d95f"
I0427 03:12:13.487441       1 binder.go:794] "PVC is fully bound to PV" PVC="mydb/core-ddb-napp-k8s-ha-cluster-dn-0-0" PV="pvc-3d8142e4-177e-4b12-820f-30008dc11b0e"
I0427 03:12:13.488274       1 binder.go:282] "FindPodVolumes" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node01"
I0427 03:12:13.488705       1 binder.go:282] "FindPodVolumes" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03"
I0427 03:12:13.488773       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b" node="node03" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.488876       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-bf6eb403-ee16-4313-8011-d0c4fb97d95f" node="node03" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.488944       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-3d8142e4-177e-4b12-820f-30008dc11b0e" node="node03" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.488773       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b" node="node01" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.488993       1 binder.go:895] "All bound volumes for pod match with node" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03"
I0427 03:12:13.489083       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-bf6eb403-ee16-4313-8011-d0c4fb97d95f" node="node01" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.488490       1 binder.go:282] "FindPodVolumes" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node02"
I0427 03:12:13.489212       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-3d8142e4-177e-4b12-820f-30008dc11b0e" node="node01" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.489342       1 binder.go:895] "All bound volumes for pod match with node" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node01"
I0427 03:12:13.489225       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b" node="node02" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.489494       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-bf6eb403-ee16-4313-8011-d0c4fb97d95f" node="node02" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.489567       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-3d8142e4-177e-4b12-820f-30008dc11b0e" node="node02" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.489623       1 binder.go:895] "All bound volumes for pod match with node" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node02"


I0427 03:12:13.490916       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03" resourceAllocationScorer="LeastAllocated" allocatableResource=[48000,49900879872] requestedResource=[19210,32480690176] resourceScore=46

I0427 03:12:13.491043       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03" resourceAllocationScorer="NodeResourcesBalancedAllocation" allocatableResource=[48000,49900879872] requestedResource=[16510,26608664576] resourceScore=90

I0427 03:12:13.491213       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node01" resourceAllocationScorer="LeastAllocated" allocatableResource=[48000,56343199744] requestedResource=[33500,58493763584] resourceScore=15

I0427 03:12:13.491291       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node02" resourceAllocationScorer="LeastAllocated" allocatableResource=[48000,56344510464] requestedResource=[23400,41028681728] resourceScore=39

I0427 03:12:13.491360       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node01" resourceAllocationScorer="NodeResourcesBalancedAllocation" allocatableResource=[48000,56343199744] requestedResource=[25800,42555408384] resourceScore=89

I0427 03:12:13.491437       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node02" resourceAllocationScorer="NodeResourcesBalancedAllocation" allocatableResource=[48000,56344510464] requestedResource=[16800,26977763328] resourceScore=93

I0427 03:12:13.491885       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="TaintToleration" node="node03" score=300
I0427 03:12:13.491950       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeAffinity" node="node03" score=0
I0427 03:12:13.491988       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeResourcesFit" node="node03" score=46
I0427 03:12:13.492026       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="VolumeBinding" node="node03" score=0
I0427 03:12:13.492061       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="PodTopologySpread" node="node03" score=200
I0427 03:12:13.492111       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeResourcesBalancedAllocation" node="node03" score=90
I0427 03:12:13.492162       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="ImageLocality" node="node03" score=0
I0427 03:12:13.492208       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="TaintToleration" node="node01" score=300
I0427 03:12:13.492241       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeAffinity" node="node01" score=0
I0427 03:12:13.492283       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeResourcesFit" node="node01" score=15
I0427 03:12:13.492311       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="VolumeBinding" node="node01" score=0
I0427 03:12:13.492362       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="PodTopologySpread" node="node01" score=200
I0427 03:12:13.492415       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeResourcesBalancedAllocation" node="node01" score=89
I0427 03:12:13.492459       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="ImageLocality" node="node01" score=0
I0427 03:12:13.492509       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="TaintToleration" node="node02" score=300
I0427 03:12:13.492551       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeAffinity" node="node02" score=0
I0427 03:12:13.492604       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeResourcesFit" node="node02" score=39
I0427 03:12:13.492646       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="VolumeBinding" node="node02" score=0
I0427 03:12:13.492688       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="PodTopologySpread" node="node02" score=200
I0427 03:12:13.492760       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeResourcesBalancedAllocation" node="node02" score=93
I0427 03:12:13.492804       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="ImageLocality" node="node02" score=0
I0427 03:12:13.492855       1 schedule_one.go:812] "Calculated node's final score for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03" score=636
I0427 03:12:13.492902       1 schedule_one.go:812] "Calculated node's final score for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node01" score=604
I0427 03:12:13.492954       1 schedule_one.go:812] "Calculated node's final score for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node02" score=632
I0427 03:12:13.493205       1 binder.go:439] "AssumePodVolumes" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03"
I0427 03:12:13.493276       1 binder.go:794] "PVC is fully bound to PV" PVC="mydb/data-ddb-napp-k8s-ha-cluster-dn-0-0" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b"
I0427 03:12:13.493320       1 binder.go:794] "PVC is fully bound to PV" PVC="mydb/log-ddb-napp-k8s-ha-cluster-dn-0-0" PV="pvc-bf6eb403-ee16-4313-8011-d0c4fb97d95f"
I0427 03:12:13.493365       1 binder.go:794] "PVC is fully bound to PV" PVC="mydb/core-ddb-napp-k8s-ha-cluster-dn-0-0" PV="pvc-3d8142e4-177e-4b12-820f-30008dc11b0e"
I0427 03:12:13.493427       1 binder.go:447] "AssumePodVolumes: all PVCs bound and nothing to do" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03"
I0427 03:12:13.493713       1 default_binder.go:53] "Attempting to bind pod to node" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03"
  • 我专门去查看了 kubernetes v1.29.10 版本的源代码,其中 pkg\scheduler\framework\plugins\volumebinding\binder.go 中的 FindPodVolumes() 函数中,专门有这一段检查已绑定PV和目标节点的 node affinity 检查
go 复制代码
// Check PV node affinity on bound volumes
	if len(podVolumeClaims.boundClaims) > 0 {
		boundVolumesSatisfied, boundPVsFound, err = b.checkBoundClaims(logger, podVolumeClaims.boundClaims, node, pod)
		if err != nil {
			return
		}
	}

其调用的 checkBoundClaims() 函数定义如下

go 复制代码
func (b *volumeBinder) checkBoundClaims(logger klog.Logger, claims []*v1.PersistentVolumeClaim, node *v1.Node, pod *v1.Pod) (bool, bool, error) {
	csiNode, err := b.csiNodeLister.Get(node.Name)
	if err != nil {
		// TODO: return the error once CSINode is created by default
		logger.V(4).Info("Could not get a CSINode object for the node", "node", klog.KObj(node), "err", err)
	}

	for _, pvc := range claims {
		pvName := pvc.Spec.VolumeName
		pv, err := b.pvCache.GetPV(pvName)
		if err != nil {
			if _, ok := err.(*errNotFound); ok {
				err = nil
			}
			return true, false, err
		}

		pv, err = b.tryTranslatePVToCSI(pv, csiNode)
		if err != nil {
			return false, true, err
		}

		err = volume.CheckNodeAffinity(pv, node.Labels)
		if err != nil {
			logger.V(4).Info("PersistentVolume and node mismatch for pod", "PV", klog.KRef("", pvName), "node", klog.KObj(node), "pod", klog.KObj(pod), "err", err)
			return false, true, nil
		}
		logger.V(5).Info("PersistentVolume and node matches for pod", "PV", klog.KRef("", pvName), "node", klog.KObj(node), "pod", klog.KObj(pod))
	}

	logger.V(4).Info("All bound volumes for pod match with node", "pod", klog.KObj(pod), "node", klog.KObj(node))
	return true, true, nil
}

在我的实际例子中,PV的节点亲和性明明是节点的metadata.name是 node02,那么对于 node01 和 node03,就应该打印出这段代码中的 "PersistentVolume and node mismatch for pod" 才是预期的结果啊,但事实上对于三个节点,这里打印出来的都是 "PersistentVolume and node matches for pod"

  • 我也去查过 kubernetes 官方对于 v1.29.10版本的发布信息,没有找到任何关于这样PV节点亲和性的BUG报告

  • 我自己也写过一个最简单的 pod yaml,也是声明一个 local-path 的 PVC,来重复刚才的全过程,虽然这次这个 pod 被调度到了 node2 节点上,但是从 kube-scheduler的日志来看,PV节点亲和性检查那里依然是三个节点都是match的,只是因为资源申请和匹配打分 node02 得分高才被选中的

个人期待的帮助

我预期的当然是通过PVC绑定目标PV的Pod,能够按照预期在PV强制亲和的节点node02上被重新拉起,即使node02确实异常了(例如下电了),这个Pod也应该是 Pending 状态而不是在 node03 上被拉起。

说实话,我不相信kubernetes这样的工业级产品代码,最核心的 kube-scheduler 代码在 1.29.10 这个不算很老的版本中还有这样的BUG,但我确实也无法来单步调试验证 kubernetes 的源代码。

另外,我已经在全过程诊断调试过程中和DeepSeek R1大模型进行过详细的交互和诊断了,包括我采用的很多调试方法也都是大模型教的,所以希望真正的k8s专家来人工指导一下,谢谢🤝

相关推荐
人生偌只如初见2 小时前
Kubernetes学习笔记-配置Service对接第三方访问
kubernetes·k8s
云攀登者-望正茂5 小时前
Golang 遇见 Kubernetes:云原生开发的完美结合
云原生·golang·kubernetes
David爱编程6 小时前
Service 与 Headless Service 全面对比与实战指南
docker·容器·kubernetes
dessler7 小时前
Kubernetes(k8s)-集群监控(Prometheus)
linux·运维·kubernetes
DavidSoCool9 小时前
k8s生成StarRocks集群模版
云原生·容器·kubernetes
企鹅侠客9 小时前
简述删除一个Pod流程?
面试·kubernetes·pod·删除pod流程
庸子12 小时前
当JIT遇见K8s
云原生·容器·kubernetes
roman_日积跬步-终至千里15 小时前
【K8s基础】K8s下的Helm和Operator:包管理器与运维程序化
运维·容器·kubernetes
技术liul19 小时前
Docker Compose和 Kubernetes(k8s)区别
docker·容器·kubernetes