SageMaker 内置 MLflow 了，实验管理不用自己搭 Server

一句话

亚马逊云科技 5 月更新：SageMaker Unified Studio 内原生集成托管 MLflow。不用自己搭 tracking server，开箱即用。

之前多痛苦

自建 MLflow 要维护一堆东西：

bash 复制代码

# EC2 跑 server
mlflow server --host 0.0.0.0 \
  --backend-store-uri postgresql://user:pass@rds/mlflow \
  --default-artifact-root s3://mlflow-artifacts/

# 还要：RDS、S3、ALB、VPC、IAM、备份...

7 件事，还没开始训练就折腾半天。Server 挂了？实验记录可能全丢。

现在多简单

python 复制代码

import mlflow

# Studio 环境自动配置，不用指定 server 地址
mlflow.set_experiment("my-recommendation")

with mlflow.start_run(run_name="lr-0.001"):
    mlflow.log_param("learning_rate", 0.001)
    mlflow.log_param("batch_size", 32)
    
    model = train(config)
    
    mlflow.log_metric("val_accuracy", 0.891)
    mlflow.pytorch.log_model(model, "model")

就这样。参数、指标、模型文件自动存。

批量调参

python 复制代码

lrs = [0.001, 0.0005, 0.0001]
batches = [16, 32, 64]

for lr in lrs:
    for bs in batches:
        with mlflow.start_run(run_name=f"lr-{lr}-bs-{bs}"):
            mlflow.log_params({"lr": lr, "batch_size": bs})
            metrics = train_and_eval(lr, bs)
            mlflow.log_metrics(metrics)

9 组实验跑完，Studio UI 里直接看图表对比。

和 Training Job 配合

python 复制代码

from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point="train.py",
    instance_type="ml.p4d.24xlarge",
    instance_count=2,
    environment={"MLFLOW_TRACKING_URI": "auto"}
)
estimator.fit({"training": "s3://data/train/"})

分布式训练的日志自动写到托管 MLflow。

模型注册

python 复制代码

# 注册
result = mlflow.register_model("runs:/abc123/model", "my-model")

# 推进到 Production
from mlflow.tracking import MlflowClient
client = MlflowClient()
client.transition_model_version_stage(
    name="my-model",
    version=result.version,
    stage="Production"
)

模型从实验到上线有完整的版本链条。

训练过程可视化

python 复制代码

for epoch in range(50):
    loss = train_epoch(model)
    val_acc = evaluate(model)
    mlflow.log_metrics({"loss": loss, "val_acc": val_acc}, step=epoch)

自动画 loss 曲线。多个 run 叠加对比。

对比

维度	自建	SageMaker 托管
搭建时间	半天-1天	0
运维	持续	0
和 SageMaker 集成	手动	原生
IAM 权限	自建鉴权	AWS IAM
定制能力	高	中

局限

只在 Studio 内可用
深度定制的 MLflow plugin 需评估兼容性
MLflow UI 自定义少一些

建议

新项目直接用
老项目如果只用基础功能（log_param/metric/model），迁移很简单
搭配 SageMaker Pipeline 做 ML CI/CD

多框架支持

框架	方法
PyTorch	mlflow.pytorch.log_model()
TensorFlow	mlflow.tensorflow.log_model()
scikit-learn	mlflow.sklearn.log_model()
XGBoost	mlflow.xgboost.log_model()
HuggingFace	mlflow.transformers.log_model()

不管什么框架，追踪方式统一。

参考：

觉得有用点赞收藏。ML 工程化工具持续跟进。

补充：迁移已有实验数据

python 复制代码

from mlflow.tracking import MlflowClient

# 从旧 server 读
old = MlflowClient("http://old-server:5000")
runs = old.search_runs(experiment_ids=["1"])

# 写入 Studio MLflow
mlflow.set_tracking_uri("auto")
for run in runs:
    with mlflow.start_run():
        mlflow.log_params(run.data.params)
        mlflow.log_metrics(run.data.metrics)

基础指标和参数能迁。模型 artifact 需要手动从旧 S3 复制过来。

补充：和 Pipeline 配合做 ML CI/CD

python 复制代码

from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import TrainingStep

training_step = TrainingStep(
    name="train-model",
    estimator=estimator,
    inputs={"train": train_input}
)

pipeline = Pipeline(
    name="weekly-retrain",
    steps=[preprocess_step, training_step, eval_step]
)

# 每次 pipeline 执行自动创建 MLflow run
pipeline.start()

Pipeline 执行历史 = MLflow 实验历史。自动关联。

补充：团队协作场景

Studio 内多人共享：

同一个 experiment 下看所有人的 run
对比不同人的实验结果
共享模型 registry

不用再互相传 CSV 对比结果了。

补充：真实场景 --- LLM 微调

python 复制代码

mlflow.set_experiment("llm-finetune-sentiment")

configs = [
    {"base_model": "distilbert", "lr": 2e-5, "epochs": 3},
    {"base_model": "roberta", "lr": 1e-5, "epochs": 5},
    {"base_model": "deberta-v3", "lr": 5e-6, "epochs": 3},
]

for cfg in configs:
    with mlflow.start_run(run_name=cfg["base_model"]):
        mlflow.log_params(cfg)
        
        model = finetune(cfg)
        metrics = evaluate(model, test_data)
        
        mlflow.log_metrics({
            "accuracy": metrics["accuracy"],
            "f1": metrics["f1"],
            "inference_time_ms": metrics["latency"]
        })
        mlflow.transformers.log_model(model, "model")

微调不同基座模型，一目了然哪个效果好。

补充：什么时候不需要

一次性脚本/notebook 实验（不需要追踪）
团队只有 1 个人且实验量很少
已有运行良好的自建方案

MLflow 解决的是"规模化实验管理"的问题。小规模不需要。

实验管理是 ML 工程化的第一步。SageMaker + MLflow 让这一步变得很轻。别再用 Excel 记超参数了。

ML 实验管理工具在进化。跟上节奏。省下运维时间多跑实验。效率就是竞争力。拥抱托管。