Kubeflow ML 流水线 K8s 部署教程：机器学习工作流编排全攻略

机器学习项目的工程化落地，往往卡在工作流管理这一环。从数据预处理、特征工程、模型训练、超参调优到最终部署，每个步骤都需要精确的依赖管理和资源调度。Kubeflow 正是为解决这一痛点而生------作为专为 Kubernetes 设计的机器学习平台，它将 ML 工作流的每个阶段都封装为可复用、可观测、可版本化的 Pipeline 组件，实现模型训练、调参（Katib）与部署（KServe）的一体化编排。

Kubeflow 的核心价值在于将 ML 实验的"手工艺"提升为可工程化复现的"流水线生产"。通过 Kubeflow Pipelines，数据科学家可以用 Python 代码定义 DAG（有向无环图）工作流，每个节点在独立容器中运行，天然支持并行执行、断点续跑和资源隔离。结合 Katib 的自动超参搜索和 KServe 的模型推理服务，整个 MLOps 生命周期在同一个 Kubernetes 平台上形成闭环。

本教程详细介绍如何在 Kubernetes 集群上通过 Helm 部署 Kubeflow，并编写一个完整的图像分类 Pipeline，从数据准备到模型服务一步到位。

服务器配置

机器学习工作负载对计算资源要求较高。推荐在雨云服务器 rainyun-com 上部署 Kubeflow 集群，注册填优惠码 2026off 领 5 折优惠券。建议选择 8 核 16GB 机型 作为控制节点和主要工作节点，GPU 节点可按需额外挂载。

推荐硬件配置：

Kubeflow 控制面：8 核 16GB（本教程使用此规格）
训练工作节点：视模型规模而定，GPU 节点最佳
存储：至少 100GB SSD，用于模型存储和 Pipeline 缓存
操作系统：Ubuntu 22.04 LTS
Kubernetes：v1.27+

软件前置依赖：

kubectl、helm v3.12+
kustomize v5.0+
Python 3.9+（用于编写 Pipeline）
kfp SDK 2.x

准备工作

安装 kustomize

bash 复制代码

curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
mv kustomize /usr/local/bin/
kustomize version

安装 Kubeflow Pipelines SDK

bash 复制代码

pip install kfp==2.7.0
pip install kfp-kubernetes==1.2.0

前置 K8s 组件检查

bash 复制代码

# 确认 StorageClass 已配置（Kubeflow 需要动态 PVC 供给）
kubectl get storageclass

# 如无默认 StorageClass，可安装 local-path-provisioner（测试环境）
kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.26/deploy/local-path-storage.yaml
kubectl patch storageclass local-path -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

详细配置：部署 Kubeflow

方式一：使用官方 manifests（推荐）

bash 复制代码

# 克隆官方仓库（指定稳定版本）
git clone https://github.com/kubeflow/manifests.git
cd manifests
git checkout v1.9.0

# 完整安装（包含所有组件）
while ! kustomize build example | kubectl apply -f -; do
  echo "Retrying to apply resources..."
  sleep 20
done

完整安装包含：Kubeflow Pipelines、Katib、KServe、Central Dashboard、Jupyter Notebooks、Profile Controller 等核心组件。

方式二：仅安装 Kubeflow Pipelines（轻量部署）

bash 复制代码

# 仅部署 KFP 组件，适合资源有限的环境
kustomize build apps/pipeline/upstream/env/cert-manager/platform-agnostic-multi-user | kubectl apply -f -

# 等待所有 Pod 就绪
kubectl wait --for=condition=Ready pods --all -n kubeflow --timeout=300s

访问 Kubeflow Dashboard

bash 复制代码

# 端口转发访问 Central Dashboard
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

# 或配置 NodePort（生产环境建议配置 Ingress）
kubectl patch svc istio-ingressgateway -n istio-system \
  -p '{"spec": {"type": "NodePort"}}'

默认登录账号：user@example.com / 12341234

验证部署状态

bash 复制代码

kubectl get pods -n kubeflow
kubectl get pods -n istio-system
kubectl get pods -n cert-manager

# 检查 Pipeline 服务
kubectl get svc -n kubeflow | grep ml-pipeline

核心功能：编写 ML Pipeline

定义一个完整的训练流水线

以下示例演示一个 MNIST 图像分类的完整 Pipeline，包含数据下载、预处理、训练和评估四个步骤：

python 复制代码

# pipeline_mnist.py
from kfp import dsl, compiler
from kfp.dsl import Dataset, Model, Output, Input, Metrics

# 步骤一：下载并准备数据集
@dsl.component(
    base_image="python:3.9-slim",
    packages_to_install=["tensorflow-datasets", "numpy"]
)
def download_data(
    dataset_output: Output[Dataset],
    num_samples: int = 10000
):
    import tensorflow_datasets as tfds
    import numpy as np
    import os

    (ds_train, ds_test), ds_info = tfds.load(
        'mnist',
        split=['train[:{}]'.format(num_samples), 'test[:1000]'],
        as_supervised=True,
        with_info=True
    )

    os.makedirs(dataset_output.path, exist_ok=True)
    # 保存数据到输出路径
    np.save(os.path.join(dataset_output.path, 'train_data.npy'),
            [(x.numpy(), y.numpy()) for x, y in ds_train])

# 步骤二：训练模型
@dsl.component(
    base_image="tensorflow/tensorflow:2.13.0",
    packages_to_install=["numpy"]
)
def train_model(
    dataset: Input[Dataset],
    model_output: Output[Model],
    epochs: int = 5,
    learning_rate: float = 0.001
):
    import tensorflow as tf
    import numpy as np
    import os

    # 加载数据
    data = np.load(os.path.join(dataset.path, 'train_data.npy'), allow_pickle=True)
    X = np.array([item[0] for item in data]) / 255.0
    y = np.array([item[1] for item in data])

    # 构建模型
    model = tf.keras.Sequential([
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10, activation='softmax')
    ])

    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )

    model.fit(X, y, epochs=epochs, validation_split=0.1, verbose=1)

    # 保存模型
    os.makedirs(model_output.path, exist_ok=True)
    model.save(model_output.path)
    print(f"Model saved to: {model_output.path}")

# 步骤三：评估模型
@dsl.component(
    base_image="tensorflow/tensorflow:2.13.0",
)
def evaluate_model(
    model: Input[Model],
    metrics_output: Output[Metrics]
):
    import tensorflow as tf

    model = tf.keras.models.load_model(model.path)
    # 在测试集上评估（简化示例）
    metrics_output.log_metric("accuracy", 0.98)
    metrics_output.log_metric("loss", 0.07)

# 组合 Pipeline
@dsl.pipeline(
    name="MNIST 训练流水线",
    description="完整的 MNIST 图像分类训练、评估流水线"
)
def mnist_pipeline(
    num_samples: int = 10000,
    epochs: int = 5,
    learning_rate: float = 0.001
):
    download_task = download_data(num_samples=num_samples)
    download_task.set_cpu_request("500m").set_memory_request("1Gi")

    train_task = train_model(
        dataset=download_task.outputs["dataset_output"],
        epochs=epochs,
        learning_rate=learning_rate
    )
    train_task.set_cpu_request("2").set_memory_request("4Gi")
    train_task.after(download_task)

    eval_task = evaluate_model(model=train_task.outputs["model_output"])
    eval_task.after(train_task)

# 编译为 YAML
if __name__ == "__main__":
    compiler.Compiler().compile(
        pipeline_func=mnist_pipeline,
        package_path="mnist_pipeline.yaml"
    )
    print("Pipeline compiled: mnist_pipeline.yaml")

提交 Pipeline 到 Kubeflow

python 复制代码

# submit_pipeline.py
import kfp

# 连接 KFP 服务端
client = kfp.Client(host="http://localhost:8080")

# 上传并运行
run = client.create_run_from_pipeline_package(
    pipeline_file="mnist_pipeline.yaml",
    arguments={
        "num_samples": 50000,
        "epochs": 10,
        "learning_rate": 0.001
    },
    run_name="mnist-training-v1",
    experiment_name="MNIST Experiments"
)

print(f"Run ID: {run.run_id}")
print(f"View at: http://localhost:8080/#/runs/details/{run.run_id}")

bash 复制代码

python submit_pipeline.py

配置 Katib 超参调优

yaml 复制代码

# katib-experiment.yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: mnist-hp-tuning
  namespace: kubeflow-user-example-com
spec:
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: bayesianoptimization
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.0001"
        max: "0.01"
    - name: epochs
      parameterType: int
      feasibleSpace:
        min: "5"
        max: "20"
  trialTemplate:
    primaryContainerName: training-container
    trialParameters:
      - name: learningRate
        description: Learning rate
        reference: learning_rate
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
              - name: training-container
                image: docker.io/myrepo/mnist-trainer:latest
                command:
                  - python
                  - train.py
                  - --lr=${trialParameters.learningRate}

bash 复制代码

kubectl apply -f katib-experiment.yaml

使用技巧

1. Pipeline 组件缓存

KFP 支持组件级缓存，相同输入的步骤不会重复执行：

python 复制代码

@dsl.pipeline
def my_pipeline():
    task = my_component()
    task.set_caching_options(enable_caching=True)  # 默认开启

2. 使用 PVC 共享大型数据集

python 复制代码

from kfp.kubernetes import use_field_path, mount_pvc

@dsl.pipeline
def pipeline_with_pvc():
    task = heavy_preprocessing()
    mount_pvc(task, pvc_name="dataset-pvc", mount_path="/data")

3. 设置资源配额与 GPU

python 复制代码

train_task = train_model(dataset=...)
train_task.set_gpu_limit("1")
train_task.set_memory_limit("8Gi")
train_task.add_node_selector_constraint("cloud.google.com/gke-accelerator", "nvidia-tesla-t4")

4. Pipeline 版本管理

通过 KFP UI 或 SDK 管理 Pipeline 版本，便于 A/B 对比不同版本的训练效果：

python 复制代码

client.upload_pipeline_version(
    pipeline_package_path="mnist_pipeline_v2.yaml",
    pipeline_version_name="v2.0-improved-augmentation",
    pipeline_id=existing_pipeline_id
)

常见问题排查

Q：Pod 因 OOMKilled 崩溃

ML 训练步骤内存消耗大，需在组件定义中合理设置 memory_limit。检查实际用量：

bash 复制代码

kubectl top pods -n kubeflow

Q：PVC 无法绑定，Pipeline 卡在 Pending

检查 StorageClass 是否正确配置并设为默认：

bash 复制代码

kubectl get pvc -n kubeflow-user-example-com
kubectl describe pvc <pvc-name> -n kubeflow-user-example-com

Q：Istio sidecar 注入导致 Pipeline 组件启动慢

在命名空间级别配置 Istio 注入策略，减少不必要的 sidecar：

bash 复制代码

kubectl label namespace kubeflow istio-injection=enabled

Q：KFP UI 无法访问，显示 "502 Bad Gateway"

检查 Istio Ingress Gateway 和 KFP 后端服务状态：

bash 复制代码

kubectl get pods -n istio-system
kubectl logs -n kubeflow deployment/ml-pipeline -f

Q：Katib 实验长时间未完成

检查 Trial Pod 日志，确认训练脚本正确输出指标（Katib 通过 stdout 的特定格式采集指标）：

bash 复制代码

kubectl logs -n kubeflow <trial-pod-name> | grep "accuracy="

Kubeflow 将 MLOps 最佳实践固化为平台能力，让机器学习工程化不再依赖个人经验的堆积。稳定的基础设施是 ML 平台可靠运行的前提------推荐选用雨云服务器 rainyun-com 的 8 核 16GB 机型 部署 Kubeflow，大内存配置可以从容应对多个并发训练任务的内存压力。注册填优惠码 2026off 享 5 折优惠券，以超高性价比开启你的 MLOps 之旅。