求助:为什么k8s调度器对PV的节点亲和性检查会失效呢?

我最近几天遇到了一个k8s调度器对PV的节点亲和性检查会失效的问题,自己找不到原因,只有上网来向大家请教了

问题背景和现象

我在实验室离线LAN中,基于arm64+国产linux的服务器,使用kubeadm自建了一个 1 maser + 3 worker 的k8s集群,使用的是标准的 v1.29.10 版本,在集群中安装了 rancher.ioo/local-path 这个本地存储StorageClass,如下所示

bash 复制代码
[root@server01 ~]# k get nodes
NAME       STATUS   ROLES           AGE    VERSION
master01   Ready    control-plane   167d   v1.29.10
node01     Ready    <none>          167d   v1.29.10
node02     Ready    <none>          167d   v1.29.10
node03     Ready    <none>          167d   v1.29.10

[root@server01 ~]# k get storageclass
NAME         PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-path   rancher.io/local-path             Delete                       WaitForFirstConsumer                   false                                   165d

我在集群中部署了一个我们购买的商用数据库应用,其中有一个StatefulSet负载,它的Pod申请了 local-path 类型的 PVC。这个负载第一次在集群上运行起来之后,为Pod生成并绑定的PV是位于node02节点上的,对应节点2上的本地目录 /opt/local-path-provisioner/ 下的一个子目录,这个PV的节点亲和性(nodeAffinity)也很清晰的,是强制适配节点名为 node02 的节点,这个Pod开始也是被调度到node02上运行的。

但是后面不知道什么开始,这个负载的pod因故终止后被重新调度拉起的时候,我却很惊奇地发现这个Pod被调度到了 node03 节点上,同样在node03节点上的 /opt/local-path-provisioner/ 目录下也出现了一个完全同名的子目录,结果就是存储在node02 的PV中的持久化数据在新的Pod中就没有了。

而且更惊奇地是,当我调整 kube-scheduler 的日志输出级别,重新删除这个pod触发重新调度的时候,然后执行 kubectl -n kube-system logs 查看 kube-scheduler 的日志时,发现 kube-scheduler 检查这个PV 的 nodeAffinity 和三个节点是否匹配时,居然认定这个强制亲和node02的 PV 和 node01、node02 和 node03 三个节点都匹配!!

我根据kube-scheduler的日志,去看了 kubernetes v1.29.10 的 pkg\scheduler\framework\plugins\volumebinding\binder.go 源代码,也确实看不出来问题出在哪里。所以只有来请教一下各位了,谢谢

操作环境、软件版本等信息

  • k8s集群基本信息
    • 物理节点:ARM64处理器 + 国产Linux操作系统
    • 集群共4个节点,1 master + 3 worker
    • Kubernetes版本:标准 v1.29.10
bash 复制代码
[root@server01 ~]# k get nodes
NAME       STATUS   ROLES           AGE    VERSION
master01   Ready    control-plane   167d   v1.29.10
node01     Ready    <none>          167d   v1.29.10
node02     Ready    <none>          167d   v1.29.10
node03     Ready    <none>          167d   v1.29.10

local-path StorageClass信息如下,对应的 image 是 rancher/local-path-provisioner:v0.0.29-arm64 。其配置的物理节点持久化目录为 /opt/local-path-provisioner/,volume绑定模式为 WaitForFirstConsumer

bash 复制代码
[root@server01 ~]# k get storageclass
NAME         PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-path   rancher.io/local-path             Delete                       WaitForFirstConsumer                   false                                   165d

[root@server01 ~]# kubectl get sc local-path -o yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{},"name":"local-path"},"provisioner":"rancher.io/local-path","reclaimPolicy":"Delete","volumeBindingMode":"WaitForFirstConsumer"}
  creationTimestamp: "2024-11-12T10:45:21Z"
  name: local-path
  resourceVersion: "113298"
  uid: 6bdd36c8-3526-4a03-b54d-cc6e311eaee5
provisioner: rancher.io/local-path
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

出现问题的 pod 名字叫 ddb-napp-k8s-ha-cluster-dn-0-0 ,对应的 pvc 名字是 data-ddb-napp-k8s-ha-cluster-dn-0-0 ,如下所示(省略了一些无关输出信息,用......占位,保留的主要是 volumes 信息)------------ 需要注意,这里的 Pod 本身的节点亲和性是非强制的(preferredDuringSchedulingIgnoredDuringExecution),判断的依据是节点自身的标签中是否有 mydb.io/pod: mydb,而事实上这三个 node 都没有这个标签------所以Pod自身的节点亲和性对这个Pod被调度到哪个节点应该没有影响

bash 复制代码
# pod信息
[root@server01 common]# k -n mydb get pod ddb-napp-k8s-ha-cluster-dn-0-0 -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    ......
  creationTimestamp: "2025-04-27T03:12:13Z"
  generateName: ddb-napp-k8s-ha-cluster-dn-0-
  labels:
    apps.kubernetes.io/pod-index: "0"
    controller-revision-hash: ddb-napp-k8s-ha-cluster-dn-0-f89d59448
    ......
    statefulset.kubernetes.io/pod-name: ddb-napp-k8s-ha-cluster-dn-0-0
  name: ddb-napp-k8s-ha-cluster-dn-0-0
  namespace: mydb
  ......
spec:
  # Pod本身的节点亲和性判断是一个软亲和性,判断的依据是节点自身的标签中是否有 `mydb.io/pod: mydb`,但事实上三个节点都不满足这个条件
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: mydb.io/pod
            operator: In
            values:
            - mydb
        weight: 1
  containers:
  - args:
    ......
  hostname: ddb-napp-k8s-ha-cluster-dn-0-0
  nodeName: node03
  ......
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: data-ddb-napp-k8s-ha-cluster-dn-0-0
    ......
status:
  ......


# pvc信息
[root@server01 common]# k -n mydb get pvc data-ddb-napp-k8s-ha-cluster-dn-0-0 -o wide
NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE   VOLUMEMODE
data-ddb-napp-k8s-ha-cluster-dn-0-0   Bound    pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b   100Gi      RWO            local-path     <unset>                 55d   Filesystem

绑定的PV名字叫 pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b ,其详细信息如下,其中明确了该PV的节点亲和性(nodeAffinity)是强制要求(required) 节点名字是 node02

bash 复制代码
[root@server01 common]# k get pv pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    local.path.provisioner/selected-node: node02
    pv.kubernetes.io/provisioned-by: rancher.io/local-path
  creationTimestamp: "2025-03-02T11:28:37Z"
  finalizers:
  - kubernetes.io/pv-protection
  name: pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b
  resourceVersion: "18745336"
  uid: 2cac4c19-fc76-49f3-83b0-b6aaef9f4d16
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 100Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: data-ddb-napp-k8s-ha-cluster-dn-0-0
    namespace: mydb
    resourceVersion: "18745071"
    uid: ae8db06e-379c-4911-a9f0-b8962c596b5b
  hostPath:
    path: /opt/local-path-provisioner/pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b_mydb_data-ddb-napp-k8s-ha-cluster-dn-0-0
    type: DirectoryOrCreate
  # PV的节点亲和性是强制要求节点自身的 metadata.name 必须是 node02 
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchFields:
        - key: metadata.name
          operator: In
          values:
          - node02
  persistentVolumeReclaimPolicy: Delete
  storageClassName: local-path
  volumeMode: Filesystem
status:
  lastPhaseTransitionTime: "2025-03-02T11:28:37Z"
  phase: Bound

我自己借助网络和DeepSeek一起进行的分析和尝试

  • 我通过调整 kube-system 命名空间中 kube-scheduler 的日志参数级别,让 kube-scheduler 的日志输出更详细的pod调度选择信息
bash 复制代码
# 这是 kube-scheduler 对应的 pod
[root@server01 common]# k -n kube-system get pods | grep scheduler
kube-scheduler-master01                   1/1     Running   0                 3h3m

然后我将原来的那个 Pod(data-ddb-napp-k8s-ha-cluster-dn-0-0) 删除掉,触发重新调度,结果查看到的 kube-schduler 的调度信息中居然判定 node01、node02 和 node03 都通过了PV亲和性检查,相当于就跳过了PV节点亲和性的filter操作,把三个节点都当做备选几点进入下一阶段的资源调度打分环节,最后选中的是node03,而不是预期的node02.

关键的日志输出信息如下所示

bash 复制代码
# 在 binder.go 中居然三个节点都通过了对目标PV和节点亲和性的检查,三个节点都成为了
I0427 03:12:13.488773       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b" node="node03" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.488773       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b" node="node01" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.489225       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b" node="node02"

# 然后进入了Pod资源申请的打分阶段,node03节点得分最高
I0427 03:12:13.490916       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03" resourceAllocationScorer="LeastAllocated" allocatableResource=[48000,49900879872] requestedResource=[19210,32480690176] resourceScore=46

I0427 03:12:13.491043       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03" resourceAllocationScorer="NodeResourcesBalancedAllocation" allocatableResource=[48000,49900879872] requestedResource=[16510,26608664576] resourceScore=90

I0427 03:12:13.491213       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node01" resourceAllocationScorer="LeastAllocated" allocatableResource=[48000,56343199744] requestedResource=[33500,58493763584] resourceScore=15

I0427 03:12:13.491291       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node02" resourceAllocationScorer="LeastAllocated" allocatableResource=[48000,56344510464] requestedResource=[23400,41028681728] resourceScore=39

I0427 03:12:13.491360       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node01" resourceAllocationScorer="NodeResourcesBalancedAllocation" allocatableResource=[48000,56343199744] requestedResource=[25800,42555408384] resourceScore=89

I0427 03:12:13.491437       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node02" resourceAllocationScorer="NodeResourcesBalancedAllocation" allocatableResource=[48000,56344510464] requestedResource=[16800,26977763328] resourceScore=93

# 最后选择了在node03上调度新拉起目标pod ddb-napp-k8s-ha-cluster-dn-0-0,并且打印信息中也再次明确有4个节点,其中三个节点都是适合的(feasibleNodes=3)
I0427 03:12:13.503889       1 schedule_one.go:302] "Successfully bound pod to node" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03" evaluatedNodes=4 feasibleNodes=3

并且该Pod涉及的所有的3个PVC都是如此,PV节点亲和性检查都形同虚设,三个节点都能通过检查,打印出 PersistentVolume and node matches for pod 信息,如下所示的日志信息

bash 复制代码
[root@server01 common]# k -n kube-system logs -f kube-scheduler-master01
......
I0427 03:12:13.486272       1 eventhandlers.go:126] "Add event for unscheduled pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.486424       1 scheduling_queue.go:576] "Pod moved to an internal scheduling queue" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" event="PodAdd" queue="Active"
I0427 03:12:13.486671       1 schedule_one.go:85] "About to try and schedule pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.486775       1 schedule_one.go:98] "Attempting to schedule pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.487325       1 binder.go:794] "PVC is fully bound to PV" PVC="mydb/data-ddb-napp-k8s-ha-cluster-dn-0-0" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b"
I0427 03:12:13.487399       1 binder.go:794] "PVC is fully bound to PV" PVC="mydb/log-ddb-napp-k8s-ha-cluster-dn-0-0" PV="pvc-bf6eb403-ee16-4313-8011-d0c4fb97d95f"
I0427 03:12:13.487441       1 binder.go:794] "PVC is fully bound to PV" PVC="mydb/core-ddb-napp-k8s-ha-cluster-dn-0-0" PV="pvc-3d8142e4-177e-4b12-820f-30008dc11b0e"
I0427 03:12:13.488274       1 binder.go:282] "FindPodVolumes" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node01"
I0427 03:12:13.488705       1 binder.go:282] "FindPodVolumes" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03"
I0427 03:12:13.488773       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b" node="node03" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.488876       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-bf6eb403-ee16-4313-8011-d0c4fb97d95f" node="node03" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.488944       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-3d8142e4-177e-4b12-820f-30008dc11b0e" node="node03" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.488773       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b" node="node01" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.488993       1 binder.go:895] "All bound volumes for pod match with node" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03"
I0427 03:12:13.489083       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-bf6eb403-ee16-4313-8011-d0c4fb97d95f" node="node01" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.488490       1 binder.go:282] "FindPodVolumes" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node02"
I0427 03:12:13.489212       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-3d8142e4-177e-4b12-820f-30008dc11b0e" node="node01" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.489342       1 binder.go:895] "All bound volumes for pod match with node" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node01"
I0427 03:12:13.489225       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b" node="node02" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.489494       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-bf6eb403-ee16-4313-8011-d0c4fb97d95f" node="node02" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.489567       1 binder.go:892] "PersistentVolume and node matches for pod" PV="pvc-3d8142e4-177e-4b12-820f-30008dc11b0e" node="node02" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0"
I0427 03:12:13.489623       1 binder.go:895] "All bound volumes for pod match with node" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node02"


I0427 03:12:13.490916       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03" resourceAllocationScorer="LeastAllocated" allocatableResource=[48000,49900879872] requestedResource=[19210,32480690176] resourceScore=46

I0427 03:12:13.491043       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03" resourceAllocationScorer="NodeResourcesBalancedAllocation" allocatableResource=[48000,49900879872] requestedResource=[16510,26608664576] resourceScore=90

I0427 03:12:13.491213       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node01" resourceAllocationScorer="LeastAllocated" allocatableResource=[48000,56343199744] requestedResource=[33500,58493763584] resourceScore=15

I0427 03:12:13.491291       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node02" resourceAllocationScorer="LeastAllocated" allocatableResource=[48000,56344510464] requestedResource=[23400,41028681728] resourceScore=39

I0427 03:12:13.491360       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node01" resourceAllocationScorer="NodeResourcesBalancedAllocation" allocatableResource=[48000,56343199744] requestedResource=[25800,42555408384] resourceScore=89

I0427 03:12:13.491437       1 resource_allocation.go:76] "Listed internal info for allocatable resources, requested resources and score" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node02" resourceAllocationScorer="NodeResourcesBalancedAllocation" allocatableResource=[48000,56344510464] requestedResource=[16800,26977763328] resourceScore=93

I0427 03:12:13.491885       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="TaintToleration" node="node03" score=300
I0427 03:12:13.491950       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeAffinity" node="node03" score=0
I0427 03:12:13.491988       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeResourcesFit" node="node03" score=46
I0427 03:12:13.492026       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="VolumeBinding" node="node03" score=0
I0427 03:12:13.492061       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="PodTopologySpread" node="node03" score=200
I0427 03:12:13.492111       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeResourcesBalancedAllocation" node="node03" score=90
I0427 03:12:13.492162       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="ImageLocality" node="node03" score=0
I0427 03:12:13.492208       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="TaintToleration" node="node01" score=300
I0427 03:12:13.492241       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeAffinity" node="node01" score=0
I0427 03:12:13.492283       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeResourcesFit" node="node01" score=15
I0427 03:12:13.492311       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="VolumeBinding" node="node01" score=0
I0427 03:12:13.492362       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="PodTopologySpread" node="node01" score=200
I0427 03:12:13.492415       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeResourcesBalancedAllocation" node="node01" score=89
I0427 03:12:13.492459       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="ImageLocality" node="node01" score=0
I0427 03:12:13.492509       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="TaintToleration" node="node02" score=300
I0427 03:12:13.492551       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeAffinity" node="node02" score=0
I0427 03:12:13.492604       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeResourcesFit" node="node02" score=39
I0427 03:12:13.492646       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="VolumeBinding" node="node02" score=0
I0427 03:12:13.492688       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="PodTopologySpread" node="node02" score=200
I0427 03:12:13.492760       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="NodeResourcesBalancedAllocation" node="node02" score=93
I0427 03:12:13.492804       1 schedule_one.go:745] "Plugin scored node for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" plugin="ImageLocality" node="node02" score=0
I0427 03:12:13.492855       1 schedule_one.go:812] "Calculated node's final score for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03" score=636
I0427 03:12:13.492902       1 schedule_one.go:812] "Calculated node's final score for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node01" score=604
I0427 03:12:13.492954       1 schedule_one.go:812] "Calculated node's final score for pod" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node02" score=632
I0427 03:12:13.493205       1 binder.go:439] "AssumePodVolumes" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03"
I0427 03:12:13.493276       1 binder.go:794] "PVC is fully bound to PV" PVC="mydb/data-ddb-napp-k8s-ha-cluster-dn-0-0" PV="pvc-ae8db06e-379c-4911-a9f0-b8962c596b5b"
I0427 03:12:13.493320       1 binder.go:794] "PVC is fully bound to PV" PVC="mydb/log-ddb-napp-k8s-ha-cluster-dn-0-0" PV="pvc-bf6eb403-ee16-4313-8011-d0c4fb97d95f"
I0427 03:12:13.493365       1 binder.go:794] "PVC is fully bound to PV" PVC="mydb/core-ddb-napp-k8s-ha-cluster-dn-0-0" PV="pvc-3d8142e4-177e-4b12-820f-30008dc11b0e"
I0427 03:12:13.493427       1 binder.go:447] "AssumePodVolumes: all PVCs bound and nothing to do" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03"
I0427 03:12:13.493713       1 default_binder.go:53] "Attempting to bind pod to node" pod="mydb/ddb-napp-k8s-ha-cluster-dn-0-0" node="node03"
  • 我专门去查看了 kubernetes v1.29.10 版本的源代码,其中 pkg\scheduler\framework\plugins\volumebinding\binder.go 中的 FindPodVolumes() 函数中,专门有这一段检查已绑定PV和目标节点的 node affinity 检查
go 复制代码
// Check PV node affinity on bound volumes
	if len(podVolumeClaims.boundClaims) > 0 {
		boundVolumesSatisfied, boundPVsFound, err = b.checkBoundClaims(logger, podVolumeClaims.boundClaims, node, pod)
		if err != nil {
			return
		}
	}

其调用的 checkBoundClaims() 函数定义如下

go 复制代码
func (b *volumeBinder) checkBoundClaims(logger klog.Logger, claims []*v1.PersistentVolumeClaim, node *v1.Node, pod *v1.Pod) (bool, bool, error) {
	csiNode, err := b.csiNodeLister.Get(node.Name)
	if err != nil {
		// TODO: return the error once CSINode is created by default
		logger.V(4).Info("Could not get a CSINode object for the node", "node", klog.KObj(node), "err", err)
	}

	for _, pvc := range claims {
		pvName := pvc.Spec.VolumeName
		pv, err := b.pvCache.GetPV(pvName)
		if err != nil {
			if _, ok := err.(*errNotFound); ok {
				err = nil
			}
			return true, false, err
		}

		pv, err = b.tryTranslatePVToCSI(pv, csiNode)
		if err != nil {
			return false, true, err
		}

		err = volume.CheckNodeAffinity(pv, node.Labels)
		if err != nil {
			logger.V(4).Info("PersistentVolume and node mismatch for pod", "PV", klog.KRef("", pvName), "node", klog.KObj(node), "pod", klog.KObj(pod), "err", err)
			return false, true, nil
		}
		logger.V(5).Info("PersistentVolume and node matches for pod", "PV", klog.KRef("", pvName), "node", klog.KObj(node), "pod", klog.KObj(pod))
	}

	logger.V(4).Info("All bound volumes for pod match with node", "pod", klog.KObj(pod), "node", klog.KObj(node))
	return true, true, nil
}

在我的实际例子中,PV的节点亲和性明明是节点的metadata.name是 node02,那么对于 node01 和 node03,就应该打印出这段代码中的 "PersistentVolume and node mismatch for pod" 才是预期的结果啊,但事实上对于三个节点,这里打印出来的都是 "PersistentVolume and node matches for pod"

  • 我也去查过 kubernetes 官方对于 v1.29.10版本的发布信息,没有找到任何关于这样PV节点亲和性的BUG报告

  • 我自己也写过一个最简单的 pod yaml,也是声明一个 local-path 的 PVC,来重复刚才的全过程,虽然这次这个 pod 被调度到了 node2 节点上,但是从 kube-scheduler的日志来看,PV节点亲和性检查那里依然是三个节点都是match的,只是因为资源申请和匹配打分 node02 得分高才被选中的

个人期待的帮助

我预期的当然是通过PVC绑定目标PV的Pod,能够按照预期在PV强制亲和的节点node02上被重新拉起,即使node02确实异常了(例如下电了),这个Pod也应该是 Pending 状态而不是在 node03 上被拉起。

说实话,我不相信kubernetes这样的工业级产品代码,最核心的 kube-scheduler 代码在 1.29.10 这个不算很老的版本中还有这样的BUG,但我确实也无法来单步调试验证 kubernetes 的源代码。

另外,我已经在全过程诊断调试过程中和DeepSeek R1大模型进行过详细的交互和诊断了,包括我采用的很多调试方法也都是大模型教的,所以希望真正的k8s专家来人工指导一下,谢谢🤝

相关推荐
木鱼时刻21 小时前
容器与 Kubernetes 基本概念与架构
容器·架构·kubernetes
chuanauc1 天前
Kubernets K8s 学习
java·学习·kubernetes
庸子2 天前
基于Jenkins和Kubernetes构建DevOps自动化运维管理平台
运维·kubernetes·jenkins
李白你好2 天前
高级运维!Kubernetes(K8S)常用命令的整理集合
运维·容器·kubernetes
Connie14512 天前
k8s多集群管理中的联邦和舰队如何理解?
云原生·容器·kubernetes
伤不起bb2 天前
Kubernetes 服务发布基础
云原生·容器·kubernetes
别骂我h2 天前
Kubernetes服务发布基础
云原生·容器·kubernetes
weixin_399380692 天前
k8s一键部署tongweb企业版7049m6(by why+lqw)
java·linux·运维·服务器·云原生·容器·kubernetes
斯普信专业组3 天前
K8s环境下基于Nginx WebDAV与TLS/SSL的文件上传下载部署指南
nginx·kubernetes·ssl
&如歌的行板&3 天前
如何在postman中动态请求k8s中的pod ip(基于nacos)
云原生·容器·kubernetes