CA 不够用了？Azure 推 Karpenter + Spot，让 AKS 便宜 80%！

引言

在 2023 年的 KubeCon 北美大会上，微软宣布在 Azure Kubernetes Service（AKS）中引入 Karpenter 作为 Cluster Autoscaler（CA）的替代方案，并将其命名为 Node Autoprovisioning（NAP）。

虽然 Cluster Autoscaler 一直以来是 AKS/Kubernetes 的默认节点扩缩工具，但其存在诸多限制，促使微软引入 Karpenter。

本文将深入探讨这些挑战，并介绍 Karpenter 如何有效解决它们。

CA vs Karpenter vs CloudPilot AI

太长不看版

如果你想了解这几款工具的更多技术细节、查看 Karpenter 与 Cluster Autoscaler 的架构差异，回复关键词【对比】，获取完整版 PDF 文件。

Cluster Autoscaler 的局限性

以下是 Cluster Autoscaler 进行节点自动扩缩的流程示意图：

受限于虚拟机规模集（ VMSS Groups ）：

Cluster Autoscaler 仅支持 AKS 中的虚拟机规模集。

每个 VMSS 由特定类型的 VM 实例组成，具有固定的 VM SKU、硬件规格和 CPU:内存比（例如 Standard D4sv5，4 vCPU 和 16GB 内存）。

节点池约束：

当新 Pod 需要部署但当前节点容量已满时，Cluster Autoscaler 会尝试创建新的节点，但只能基于现有 VMSS SKU 进行扩容。如果该类型的实例不可用，Pod 将保持待调度状态。

扩缩能力受限：

Cluster Autoscaler 仅能基于指定的 VMSS 进行弹性扩缩，即使其他虚拟机 SKU 有闲置资源，也无法利用这些 VM SKU 的剩余容量。

Karpenter的优势

Karpenter 是一款开源的 Kubernetes 集群自动扩缩工具，专为优化性能和成本而设计，旨在以灵活、高性能和简洁的方式实现节点的弹性扩展。它比 Cluster Autoscaler 的扩缩速度更快，并且能够直接创建独立节点，无需传统的节点组限制。

Karpenter 的核心特性：

✅高效扩缩：快速弹性扩展 Kubernetes 节点。

✅灵活调度：无需依赖 VMSS 也能启动新节点。

✅成本优化：支持自动补丁更新和 Kubernetes 版本升级，降低总体成本。

✅基于 YAML 配置的 NodePool，可自定义节点类型及调度策略。

Karpenter 的中断管理

中断控制器 （Disruption Controller）负责在 Kubernetes 集群中终止或替换节点，并采用以下三种策略来决定如何处理节点：

1. 节点到期

Karpenter 会为节点设置存活时间（TTL），到期后进行替换。例如：

YAML 复制代码

spec:
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 300s  #

2. 延迟合并

设置中断间隔，控制在触发中断操作前等待的时间。

3. 资源整合

Karpenter 通过分析节点资源使用情况来主动减少集群成本。

📌 支持的策略模式：

无工作负载时：仅移除无 Pod 运行的节点。
资源利用不足时：当节点资源利用率低时，尝试减少或替换节点。

示例：

YAML 复制代码

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: ondemand
spec:
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 60s

在 AKS 上启用 Karpenter

启用 Karpenter 之前，需要满足以下前置条件：

✅ 安装 Az CLI 并更新至 0.5.17 以上版本

✅ 注册 NAP 预览功能

✅ AKS 需使用 Cilium + Overlay 作为网络配置

如果在 Karpenter 安装上有困难，欢迎尝试 Karpenter 的托管云服务 CloudPilot AI（www.cloudpilot.ai），仅需5分钟即可完成安装。

在现有 AKS 集群上启用 Karpenter

确保 AKS 集群启用了 Azure 网络插件，并使用 Cilium 作为网络策略。

关键参数：--node-provisioning-mode Auto，用于将 Karpenter 设为默认节点扩缩工具。

Bash 复制代码

az aks update --name <aks-cluster-name> --resource-group <rg-name> --node-provisioning-mode Auto

创建新的 AKS 集群并启用 Karpenter

Bash 复制代码

az aks create --name <aks-cluster-name> --resource-group <rg-name> \
  --node-provisioning-mode Auto --network-plugin azure --network-plugin-mode overlay --network-dataplane cilium

验证 Karpenter 是否已启用

Bash 复制代码

kubectl api-resources | grep -e aksnodeclasses -e nodeclaims -e nodepools

返回结果示例：

Plain 复制代码

aksnodeclasses      aksnc,aksncs      karpenter.azure.com/v1alpha2      false      AKSNodeClass
nodeclaims                               karpenter.sh/v1beta1             false      NodeClaim
nodepools                                karpenter.sh/v1beta1             false      NodePool

禁用 Cluster Autoscaler

如果希望从 Cluster Autoscaler 切换到 Karpenter，首先需要在 AKS 集群上禁用 Cluster Autoscaler：

Bash 复制代码

az aks update --name <aks-cluster-name> --resource-group <rg-name> --disable-cluster-autoscaler

部署示例应用

为了观察节点自动扩缩的实际运行情况，可以部署一个示例应用：

Plain 复制代码

osama [ ~ ]$ kubectl get nodes
NAME                                STATUS   ROLES   AGE     VERSION
aks-default-h2jxh                   Ready    agent   35m     v1.27.9
aks-nodepool1-41633911-vmss000000   Ready    agent   3d19h   v1.27.9

调整 Vote 应用副本数量以触发扩容事件

Plain 复制代码

osama [ ~ ]$ kubectl scale deployment azure-vote-front --replicas=12 -n karpenter-demo-ns
^[[Adeployment.apps/azure-vote-front scaled
osama [ ~ ]$ kubectl scale deployment azure-vote-back --replicas=12 -n karpenter-demo-ns
deployment.apps/azure-vote-back scaled

使用以下 kubectl 命令检查 Karpenter 自动扩容情况

Plain 复制代码

kubectl get events -A --field-selector source=karpenter --sort-by='.lastTimestamp' -n 10
NAMESPACE           LAST SEEN   TYPE     REASON                  OBJECT                                  MESSAGE
default             50m         Normal   Unconsolidatable        nodeclaim/default-95f54                 SpotToSpotConsolidation is disabled, can't replace a spot node with a spot node
default             50m         Normal   Unconsolidatable        node/aks-default-95f54                  SpotToSpotConsolidation is disabled, can't replace a spot node with a spot node
default             38m         Normal   DisruptionBlocked       nodepool/default                        No allowed disruptions due to blocking budget
default             5m33s       Normal   Unconsolidatable        nodeclaim/default-h2jxh                 Can't remove without creating 2 candidates
default             5m33s       Normal   Unconsolidatable        node/aks-default-h2jxh                  Can't remove without creating 2 candidates
default             2m12s       Normal   DisruptionBlocked       nodepool/system-surge                   No allowed disruptions due to blocking budget
karpenter-demo-ns   63s         Normal   Nominated               pod/azure-vote-front-6855444955-bnq7p   Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns   63s         Normal   Nominated               pod/azure-vote-front-6855444955-gbwk6   Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns   63s         Normal   Nominated               pod/azure-vote-front-6855444955-l2bgj   Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns   63s         Normal   Nominated               pod/azure-vote-front-6855444955-nvc56   Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns   63s         Normal   Nominated               pod/azure-vote-front-6855444955-22glj   Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns   63s         Normal   Nominated               pod/azure-vote-front-6855444955-sxdl6   Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns   63s         Normal   Nominated               pod/azure-vote-front-6855444955-t69w4   Pod should schedule on: nodeclaim/default-mrh7w

自定义 Karpenter 配置

Karpenter 在 Kubernetes 中引入了一种新的资源类型：NodePool，用于管理和优化节点调度。

自定义 NodePools：指定特定的 VM 系列、VM 家族，或自定义 CPU 与内存比例。
基于特性选择节点：支持 GPU 加速或网络加速等功能。
定义 CPU 架构：根据特定工作负载需求，选择 ARM 或 AMD 架构。
构建高可用节点架构：通过配置可用区拓扑提高容灾能力。
限制节点级别的 CPU 和内存使用：控制单个节点可分配的 CPU 和内存资源。

以下是 Karpenter 默认的 NodePool 配置 YAML，其中包含：

节点 SKU 类型和容量配置
CPU:内存比例限制
在多个 NodePools 存在时的调度权重（Weight）

Plain 复制代码

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 10s
  template:
    spec:
      nodeClassRef:
        name: default

      # Requirements that constrain the parameters of provisioned nodes.
      # These requirements are combined with pod.spec.affinity.nodeAffinity rules.
      # Operators { In, NotIn, Exists, DoesNotExist, Gt, and Lt } are supported.
      # https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#operators
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - ondemand
      - key: karpenter.azure.com/sku-family
        operator: In
        values:
        - E
        - D
      - key: karpenter.azure.com/sku-name
        operator: In
        values:
        - Standard_E2s_v5
        - Standard_D4s_v3
limits:
  cpu: "1000"
  memory: 1000Gi
weight: 100

使用 Spot 节点与 Karpenter

在 AKS-Vote 示例应用中添加 Toleration（如 "karpenter.sh/disruption:NoSchedule"），该 Toleration 在 AKS 集群预配 Spot 节点时默认存在。
请参考我的 GitHub 仓库，获取应用的 YAML 配置和示例 NodePool 配置。（github.com/Osshaikh/Ka...

Plain 复制代码

spec:
      nodeSelector:
        "kubernetes.io/os": linux
      tolerations:
      - key: "kubernetes.azure.com/scalesetpriority"
        operator: "Equal"
        value: "spot"
        effect: "NoSchedule"
      containers:
      - name: azure-vote-front
        image: mcr.microsoft.com/azuredocs/azure-vote-front:v1

缩减应用副本数量，以便 Karpenter 逐步驱逐现有的 On-Demand 节点，并用 Spot 节点替换：

Plain 复制代码

osama [ ~/karpenter ]$ kubectl get nodes
NAME                                STATUS   ROLES   AGE     VERSION
aks-nodepool1-41633911-vmss000000   Ready    agent   3d21h   v1.27.9
aks-nodepool1-41633911-vmss00000b   Ready    agent   24m     v1.27.9

osama [ ~/karpenter ]$ kubectl get pods -n karpenter-demo-ns -o wide
No resources found in karpenter-demo-ns namespace.

osama [ ~/karpenter ]$ kubectl scale deployment azure-vote-back --replicas=10 -n karpenter-demo-ns
deployment.apps/azure-vote-back scaled
osama [ ~/karpenter ]$ kubectl scale deployment azure-vote-front --replicas=10 -n karpenter-demo-ns
deployment.apps/azure-vote-front scaled
osama [ ~/karpenter ]$

部署并扩展 Vote 应用副本，使 Karpenter 根据 NodePool 配置启动 Spot 节点，并在 Toleration 验证后将 Pod 调度到 Spot 节点上。
Karpenter 启动新的 Spot 节点，并将该节点指定用于调度 Vote 示例应用。

Plain 复制代码

osama [ ~/karpenter ]$ kubectl get events -A --field-selector source=karpenter --sort-by='.lastTimestamp'
NAMESPACE           LAST SEEN   TYPE      REASON                       OBJECT                                    MESSAGE
karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-pz8sp      Pod should schedule on: nodeclaim/default-52gbg
karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-ckdcq      Pod should schedule on: nodeclaim/default-52gbg
karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-v9nqj      Pod should schedule on: nodeclaim/default-52gbg
karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-vswvs      Pod should schedule on: nodeclaim/default-52gbg
karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-lnxmp      Pod should schedule on: nodeclaim/default-52gbg
karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-jc2jz      Pod should schedule on: nodeclaim/default-52gbg
karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-hwnbh      Pod should schedule on: nodeclaim/default-52gbg
karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-r7msb      Pod should schedule on: nodeclaim/default-52gbg
karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-96lm9      Pod should schedule on: nodeclaim/default-52gbg
karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-5qcvk      Pod should schedule on: nodeclaim/default-52gbg
default             1s          Normal    DisruptionLaunching          nodeclaim/default-bkz6c                   Launching NodeClaim: Expiration/Replace
default             1s          Normal    DisruptionWaitingReadiness   nodeclaim/default-bkz6c                   Waiting on readiness to continue disruption
default             1s          Normal    DisruptionBlocked            nodepool/system-surge                     No allowed disruptions due to blocking budget
default             1s          Normal    DisruptionWaitingReadiness   nodeclaim/default-5vp7x                   Waiting on readiness to continue disruption
default             1s          Normal    DisruptionLaunching          nodeclaim/default-5vp7x                   Launching NodeClaim: Expiration/Replace

配置多个NodePools

为 Spot 和 On-Demand 资源分别配置 NodePool：

使用 E 系列 VM（如 "Standard_E2s_v5"）的 Spot 节点和使用 D 系列 VM（如 "Standard_D4s_v5"）的 On-Demand 节点。

设置 NodePool 的优先级

在多 NodePool 场景中，每个 NodePool 需要配置 Weight 属性，Karpenter 会优先选择权重最高的 NodePool 进行调度。

📌 在此配置中，Spot 节点的权重为 100，On-Demand 节点的权重为 60：

Plain 复制代码

osama [ ~ ]$ kubectl get nodepool default -o yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    budgets:
    - nodes: 100%
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - spot
      - key: karpenter.azure.com/sku-family
        operator: In
        values:
        - B
      - key: karpenter.azure.com/sku-name
        operator: In
        values:
        - Standard_B2s_v2
  weight: 100

如果未指定具体的 SKU 名称，Karpenter 将默认考虑整个 VM 系列。
要验证示例 VoteApp 是否运行在 Spot 节点上，请使用以下命令：
输出结果应显示节点的容量类型为 "spot"：

Plain 复制代码

osama [ ~ ]$ kubectl get pods -n karpenter-demo-ns -o wide
NAME                                READY   STATUS    RESTARTS   AGE   IP             NODE                NOMINATED NODE   READINESS GATES
azure-vote-back-687ddb67bd-w7ghm    1/1     Running   0          63m   10.244.3.11    aks-default-5cr5f   <none>           <none>
azure-vote-front-6855444955-64558   1/1     Running   0          63m   10.244.3.168   aks-default-5cr5f   <none>           <none>
osama [ ~ ]$ kubectl describe node aks-default-5cr5f | grep karpenter.sh
                    karpenter.sh/capacity-type=spot
                    karpenter.sh/initialized=true
                    karpenter.sh/nodepool=default
                    karpenter.sh/registered=true
                    karpenter.sh/nodepool-hash: 12393960163388511505
                    karpenter.sh/nodepool-hash-version: v2

模拟 Spot 节点驱逐

要测试 Spot 节点的驱逐情况，可以使用 Azure CLI 模拟 Spot 节点被回收：

Plain 复制代码

osama [ ~ ]$ az vm simulate-eviction --resource-group MC_aks-lab_aks-karpenter_eastus --name aks-default-5cr5f
osama [ ~ ]$ date
Tue May 21 06:20:02 PM IST 2024

使用以下 curl 命令监控 VoteApp 的可用性：

Plain 复制代码

while true; do echo "$(date) $(curl -s -v -o /dev/null -w 'HTTP %{http_code}\n' http://voteapp.com 2>&1 | grep 'HTTP')"; sleep 2; done

在运行 Spot 模拟驱逐后，现有节点将被标记为终止，Karpenter 将自动创建新的 Spot 节点来调度 VoteApp 的 Pods。在不到一分钟内，VoteApp 应该会恢复运行，并开始返回 HTTP 200 状态码。

Plain 复制代码

  root@MININT-8C81HDE:/home/osamaex while true; do echo "$(date) $(curl -s -v -o /dev/null -w 'HTTP %{http_code}\n' http://voteapp.com 2>&1 | grep 'HTTP')"; sleep 2; done
    Tue May 21 18:20:04 IST 2024 > GET / HTTP/1.1
    < HTTP/1.1 200 OK
    HTTP 200
    Tue May 21 18:20:07 IST 2024 > GET / HTTP/1.1
    < HTTP/1.1 200 OK
    HTTP 200
    Tue May 21 18:20:09 IST 2024 > GET / HTTP/1.1
    < HTTP/1.1 200 OK
    HTTP 200
    Tue May 21 18:20:12 IST 2024 HTTP 000  $Failure-Alert
    Tue May 21 18:21:14 IST 2024 > GET / HTTP/1.1
    < HTTP/1.1 200 OK                      $Successful-Response
    HTTP 200
    Tue May 21 18:22:58 IST 2024 > GET / HTTP/1.1
    < HTTP/1.1 200 OK
    HTTP 200

查看 Karpenter 记录的事件日志：

Plain 复制代码

       kubectl get events -A --field-selector source=karpenter --sort-by='.lastTimestamp'

Karpenter 记录的事件日志显示 Spot 节点被替换为 On-Demand 节点的过程：

Plain 复制代码

  osama [ ~ ]$ 
    NAMESPACE           LAST SEEN   TYPE      REASON           OBJECT                                  MESSAGE
    default             23s         Warning   FailedDraining   node/aks-default-5cr5f                  Failed to drain node, 10 pods are waiting to be evicted
    karpenter-demo-ns   22s         Normal    Evicted          pod/azure-vote-back-687ddb67bd-w7ghm    Evicted pod
    karpenter-demo-ns   22s         Normal    Evicted          pod/azure-vote-front-6855444955-64558   Evicted pod
    karpenter-demo-ns   21s         Normal    Nominated        pod/azure-vote-back-687ddb67bd-tb2pv    Pod should schedule on: nodeclaim/default-6zkkl
    karpenter-demo-ns   21s         Normal    Nominated        pod/azure-vote-front-6855444955-7wzss   Pod should schedule on: nodeclaim/default-6zkkl

验证 Pods 是否已运行在新的 Spot 节点上：

Plain 复制代码

kubectl get pods -n karpenter-demo-ns -o wide
NAME                                READY   STATUS    RESTARTS   AGE   IP             NODE                NOMINATED NODE   READINESS GATES
azure-vote-back-687ddb67bd-tb2pv    1/1     Running   0          18m   10.244.2.103   aks-default-6zkkl   <none>           <none>
azure-vote-front-6855444955-7wzss   1/1     Running   0          18m   10.244.2.47    aks-default-6zkkl   <none>           <none>

通过使用预留实例节省成本

NodePool 配置允许您指定不同的 VM 系列以及多个 VM SKU。

您可以创建一个独立的 NodePool，设置最高权重，并通过karpenter.azure.com/sku-name或 karpenter.azure.com/sku-family 参数，指定所有预留实例的 VM SKU 家族或具体 SKU 名称。

Plain 复制代码

 spec:
      nodeClassRef:
        name: default
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - on-demand
      - key: karpenter.azure.com/sku-family
        operator: In
        values:
        - D
      - key: karpenter.azure.com/sku-name
        operator: In
        values:
        - [Standard_D2s_v3, Standard_D4s_v3, Standard_D8s_v3, Standard_D16s_v3, Standard_D32s_v3, Standard_D64s_v3, Standard_D96s_v3]
  weight: 90

结论

在 AKS 引入 Karpenter 标志着节点扩缩效率、灵活性和成本优化方面的一次重大升级。

Karpenter 克服了 Cluster Autoscaler 的局限性，引入了动态且快速的节点预置能力，为 Kubernetes 集群管理提供了更强大、更高效的解决方案。

Karpenter 在支持不同 VM 类型、提高扩缩能力以及优化成本方面展现出了极大的灵活性，使其成为 Kubernetes 集群管理的重要利器。

通过 Karpenter，企业可以实现更敏捷、更具成本效益的 Kubernetes 部署，提升集群的响应速度和资源利用率。

如果你正在寻找 Karpenter 的相关实践教程，关注「 CloudPilot AI 」，我们将持续带来更多干货💡