文章目录
-
- [一、模型序列化方案对比:Pickle、ONNX 与 TorchScript](#一、模型序列化方案对比:Pickle、ONNX 与 TorchScript)
-
- [ONNX 的核心优势](#ONNX 的核心优势)
- [二、scikit-learn → ONNX 转换实战](#二、scikit-learn → ONNX 转换实战)
-
- [2.1 安装依赖](#2.1 安装依赖)
- [2.2 单模型转换](#2.2 单模型转换)
- [2.3 Pipeline 整体转换](#2.3 Pipeline 整体转换)
- [2.4 转换失败的常见原因](#2.4 转换失败的常见原因)
- [三、FastAPI 推理 API 设计](#三、FastAPI 推理 API 设计)
-
- [3.1 API 接口规划](#3.1 API 接口规划)
- [3.2 Pydantic 输入校验](#3.2 Pydantic 输入校验)
- [3.3 版本管理策略](#3.3 版本管理策略)
- [四、ONNX Runtime 推理性能优化](#四、ONNX Runtime 推理性能优化)
-
- [4.1 执行后端(Execution Providers)选择](#4.1 执行后端(Execution Providers)选择)
- [4.2 批处理(Batching)优化](#4.2 批处理(Batching)优化)
- [4.3 模型预热](#4.3 模型预热)
- [五、Docker 化推理服务](#五、Docker 化推理服务)
-
- [5.1 多阶段构建](#5.1 多阶段构建)
- [5.2 docker-compose 部署](#5.2 docker-compose 部署)
- [5.3 Nginx 负载均衡](#5.3 Nginx 负载均衡)
- 六、端到端实战:泰坦尼克号生存预测服务化
-
- [6.1 训练与转换](#6.1 训练与转换)
- [6.2 FastAPI 服务](#6.2 FastAPI 服务)
- [6.3 locust 压测验证](#6.3 locust 压测验证)
- 七、模型部署完整链路总结
Notebook 里的 model.predict(X) 跑完就算完------但生产环境需要的是毫秒级响应、模型热加载、A/B 测试、灰度发布。把模型从实验环境搬到线上,需要完成序列化选型、推理加速、服务封装、容器化部署、负载均衡这一整条工程链路。本文以 scikit-learn 模型为主线,完整演示从 .pkl 到 ONNX 再到 FastAPI + Docker 的生产级部署方案。
一、模型序列化方案对比:Pickle、ONNX 与 TorchScript
模型训练完成后,第一步是将其序列化为可持久化、可传输的格式。不同方案在跨平台性、安全性、推理性能三个维度上差异明显。
| 维度 | Pickle | ONNX | TorchScript |
|---|---|---|---|
| 跨框架 | 仅限 Python | 跨语言(C++/Java/JS/Go) | 仅限 PyTorch 生态 |
| 跨版本 | Python 版本敏感(3.9 的 pickle 文件在 3.8 可能报错) | 版本无关,ONNX opset 向前兼容 | PyTorch 版本敏感 |
| 安全性 | 存在反序列化漏洞风险(任意代码执行) | 只含计算图,无代码执行风险 | 相对安全 |
| 推理速度 | 原生 Python,无加速 | ONNX Runtime 加速 1.5-3× | 与 PyTorch 原生接近 |
| 模型体积 | 原始体积 | 通常比 pickle 小 20-40% | 与 PyTorch 接近 |
| 适用场景 | 快速原型、内部实验 | 生产部署、跨语言服务 | PyTorch 模型部署 |
Pickle 的最大隐患在于安全风险。pickle.load() 会执行文件中包含的任意 Python 代码,攻击者可以构造恶意 .pkl 文件在服务器上执行 shell 命令。2024 年 Hugging Face 安全公告就曾披露过多起因加载不可信 pickle 文件导致的安全事件。ONNX 作为开放标准格式,只描述计算图和张量运算,不存在代码执行能力,天然适合不可信环境下的模型交换。
#mermaid-svg-giHJi3eZzSIBFYBu{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-giHJi3eZzSIBFYBu .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-giHJi3eZzSIBFYBu .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-giHJi3eZzSIBFYBu .error-icon{fill:#552222;}#mermaid-svg-giHJi3eZzSIBFYBu .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-giHJi3eZzSIBFYBu .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-giHJi3eZzSIBFYBu .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-giHJi3eZzSIBFYBu .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-giHJi3eZzSIBFYBu .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-giHJi3eZzSIBFYBu .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-giHJi3eZzSIBFYBu .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-giHJi3eZzSIBFYBu .marker{fill:#333333;stroke:#333333;}#mermaid-svg-giHJi3eZzSIBFYBu .marker.cross{stroke:#333333;}#mermaid-svg-giHJi3eZzSIBFYBu svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-giHJi3eZzSIBFYBu p{margin:0;}#mermaid-svg-giHJi3eZzSIBFYBu .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-giHJi3eZzSIBFYBu .cluster-label text{fill:#333;}#mermaid-svg-giHJi3eZzSIBFYBu .cluster-label span{color:#333;}#mermaid-svg-giHJi3eZzSIBFYBu .cluster-label span p{background-color:transparent;}#mermaid-svg-giHJi3eZzSIBFYBu .label text,#mermaid-svg-giHJi3eZzSIBFYBu span{fill:#333;color:#333;}#mermaid-svg-giHJi3eZzSIBFYBu .node rect,#mermaid-svg-giHJi3eZzSIBFYBu .node circle,#mermaid-svg-giHJi3eZzSIBFYBu .node ellipse,#mermaid-svg-giHJi3eZzSIBFYBu .node polygon,#mermaid-svg-giHJi3eZzSIBFYBu .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-giHJi3eZzSIBFYBu .rough-node .label text,#mermaid-svg-giHJi3eZzSIBFYBu .node .label text,#mermaid-svg-giHJi3eZzSIBFYBu .image-shape .label,#mermaid-svg-giHJi3eZzSIBFYBu .icon-shape .label{text-anchor:middle;}#mermaid-svg-giHJi3eZzSIBFYBu .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-giHJi3eZzSIBFYBu .rough-node .label,#mermaid-svg-giHJi3eZzSIBFYBu .node .label,#mermaid-svg-giHJi3eZzSIBFYBu .image-shape .label,#mermaid-svg-giHJi3eZzSIBFYBu .icon-shape .label{text-align:center;}#mermaid-svg-giHJi3eZzSIBFYBu .node.clickable{cursor:pointer;}#mermaid-svg-giHJi3eZzSIBFYBu .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-giHJi3eZzSIBFYBu .arrowheadPath{fill:#333333;}#mermaid-svg-giHJi3eZzSIBFYBu .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-giHJi3eZzSIBFYBu .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-giHJi3eZzSIBFYBu .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-giHJi3eZzSIBFYBu .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-giHJi3eZzSIBFYBu .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-giHJi3eZzSIBFYBu .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-giHJi3eZzSIBFYBu .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-giHJi3eZzSIBFYBu .cluster text{fill:#333;}#mermaid-svg-giHJi3eZzSIBFYBu .cluster span{color:#333;}#mermaid-svg-giHJi3eZzSIBFYBu div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-giHJi3eZzSIBFYBu .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-giHJi3eZzSIBFYBu rect.text{fill:none;stroke-width:0;}#mermaid-svg-giHJi3eZzSIBFYBu .icon-shape,#mermaid-svg-giHJi3eZzSIBFYBu .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-giHJi3eZzSIBFYBu .icon-shape p,#mermaid-svg-giHJi3eZzSIBFYBu .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-giHJi3eZzSIBFYBu .icon-shape .label rect,#mermaid-svg-giHJi3eZzSIBFYBu .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-giHJi3eZzSIBFYBu .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-giHJi3eZzSIBFYBu .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-giHJi3eZzSIBFYBu :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 内部实验
生产部署
PyTorch 生态
Notebook 训练完成
序列化方案
Pickle .pkl
ONNX .onnx
TorchScript .pt
Python 进程加载
ONNX Runtime 推理
LibTorch / PyTorch 服务
跨语言服务化
ONNX 的核心优势
ONNX(Open Neural Network Exchange)不是某个框架的私有格式,而是一种开放的中间表示(Intermediate Representation)。它的设计目标就是让不同框架训练的模型能够互通:TensorFlow 模型可以通过 tf2onnx 转为 ONNX,PyTorch 模型可以通过 torch.onnx.export() 导出,scikit-learn 模型可以通过 skl2onnx 转换。转换后的 .onnx 文件可以被 ONNX Runtime、TensorRT、OpenVINO 等高性能推理引擎加载执行,实现"一次转换,到处运行"。
二、scikit-learn → ONNX 转换实战
2.1 安装依赖
bash
pip install skl2onnx onnxruntime
skl2onnx 是 scikit-learn 到 ONNX 的官方转换器,支持绝大多数 sklearn 模型和预处理组件。onnxruntime 是微软开发的高性能推理引擎,支持 CPU、CUDA、TensorRT 等多种执行后端。
2.2 单模型转换
以逻辑回归为例,转换流程分为三步:训练模型 → 定义输入类型 → 调用 convert_sklearn():
python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
import onnxruntime as ort
import numpy as np
# 训练模型
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
# 定义输入类型:batch_size 为 None 表示动态批次
initial_type = [("float_input", FloatTensorType([None, 4]))]
# 转换为 ONNX
onnx_model = convert_sklearn(model, initial_types=initial_type)
with open("iris_model.onnx", "wb") as f:
f.write(onnx_model.SerializeToString())
# ONNX Runtime 推理验证
sess = ort.InferenceSession("iris_model.onnx", providers=["CPUExecutionProvider"])
input_name = sess.get_inputs()[0].name
output_name = sess.get_outputs()[0].name
sample = X_test[:3].astype(np.float32)
result = sess.run([output_name], {input_name: sample})
print("预测结果:", result[0]) # 与 model.predict(sample) 结果一致
initial_types 参数是关键。它告诉转换器输入张量的形状和数据类型。FloatTensorType([None, 4]) 中的 None 表示批次维度可以动态变化------这是生产部署的必备配置,因为线上请求的批次大小是不固定的。
2.3 Pipeline 整体转换
实际项目中,模型通常与预处理组件(标准化、编码等)打包成 Pipeline。skl2onnx 支持直接转换整个 Pipeline:
python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
("scaler", StandardScaler()),
("classifier", RandomForestClassifier(n_estimators=100, random_state=42))
])
pipeline.fit(X_train, y_train)
# Pipeline 整体转换
initial_type = [("features", FloatTensorType([None, X_train.shape[1]]))]
onnx_pipeline = convert_sklearn(pipeline, initial_types=initial_type)
with open("pipeline_model.onnx", "wb") as f:
f.write(onnx_pipeline.SerializeToString())
转换后的 ONNX 文件包含了从原始输入到最终预测的完整计算图,推理时无需再加载 sklearn 及其依赖。这意味着生产镜像可以去掉沉重的 scikit-learn,只保留轻量的 ONNX Runtime。
2.4 转换失败的常见原因
| 错误现象 | 根因 | 解决方案 |
|---|---|---|
Unsupported model |
该模型类型尚未被 skl2onnx 支持 | 查阅官方支持列表,或拆解 Pipeline 分别转换 |
Shape inference failed |
输入类型定义与实际数据不匹配 | 检查 initial_types 中的维度、数据类型是否与训练数据一致 |
String features not supported |
某些转换器对字符串输入支持有限 | 提前将类别编码为数值,或使用 ColumnTransformer 预处理 |
Pipeline component unknown |
Pipeline 中包含自定义组件 | 将自定义组件替换为标准组件,或编写自定义 ONNX 转换器 |
三、FastAPI 推理 API 设计
3.1 API 接口规划
生产级推理服务至少应该提供两类接口:单条预测(延迟敏感)和批量预测(吞吐敏感)。此外,版本管理和健康检查也是运维刚需。
| 端点 | 方法 | 功能 | 场景 |
|---|---|---|---|
/predict |
POST | 单条样本预测 | 实时推荐、实时风控 |
/batch_predict |
POST | 批量样本预测 | 离线打分、批量标签 |
/health |
GET | 服务健康状态 | K8s 探针、负载均衡 |
/models |
GET | 列出可用模型版本 | 版本管理、A/B 测试 |
3.2 Pydantic 输入校验
FastAPI 的 Pydantic 模型天然适合定义推理请求的 Schema,既能自动生成 API 文档,又能在请求到达业务逻辑前完成类型校验:
python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from typing import List, Union
import onnxruntime as ort
import numpy as np
import os
app = FastAPI(title="ML Inference Service", version="1.0.0")
class SinglePredictionRequest(BaseModel):
features: List[float] = Field(..., min_items=4, max_items=4, description="4 维特征向量")
model_version: str = Field(default="v1", description="模型版本号")
class BatchPredictionRequest(BaseModel):
features: List[List[float]] = Field(..., description="批量特征矩阵")
model_version: str = Field(default="v1")
class PredictionResponse(BaseModel):
predictions: Union[int, List[int]]
probabilities: Union[List[float], List[List[float]]]
model_version: str
inference_time_ms: float
# 模型仓库:版本号 -> InferenceSession
MODEL_REGISTRY = {}
def load_model(version: str, path: str):
providers = ["CPUExecutionProvider"]
# 如果环境有 GPU,优先使用 CUDA
if "CUDAExecutionProvider" in ort.get_available_providers():
providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
sess = ort.InferenceSession(path, providers=providers)
MODEL_REGISTRY[version] = sess
return sess
@app.on_event("startup")
def startup():
load_model("v1", "models/iris_v1.onnx")
load_model("v2", "models/iris_v2.onnx")
@app.post("/predict", response_model=PredictionResponse)
def predict(req: SinglePredictionRequest):
if req.model_version not in MODEL_REGISTRY:
raise HTTPException(status_code=404, detail=f"模型版本 {req.model_version} 不存在")
sess = MODEL_REGISTRY[req.model_version]
input_name = sess.get_inputs()[0].name
output_names = [o.name for o in sess.get_outputs()]
arr = np.array([req.features], dtype=np.float32)
start = time.time()
outputs = sess.run(output_names, {input_name: arr})
elapsed = (time.time() - start) * 1000
return PredictionResponse(
predictions=int(outputs[0][0]),
probabilities=outputs[1][0].tolist() if len(outputs) > 1 else [],
model_version=req.model_version,
inference_time_ms=round(elapsed, 2)
)
@app.post("/batch_predict", response_model=PredictionResponse)
def batch_predict(req: BatchPredictionRequest):
if req.model_version not in MODEL_REGISTRY:
raise HTTPException(status_code=404, detail=f"模型版本 {req.model_version} 不存在")
sess = MODEL_REGISTRY[req.model_version]
input_name = sess.get_inputs()[0].name
output_names = [o.name for o in sess.get_outputs()]
arr = np.array(req.features, dtype=np.float32)
start = time.time()
outputs = sess.run(output_names, {input_name: arr})
elapsed = (time.time() - start) * 1000
return PredictionResponse(
predictions=outputs[0].tolist(),
probabilities=outputs[1].tolist() if len(outputs) > 1 else [],
model_version=req.model_version,
inference_time_ms=round(elapsed, 2)
)
@app.get("/health")
def health():
loaded = list(MODEL_REGISTRY.keys())
status = "healthy" if loaded else "unhealthy"
return {"status": status, "loaded_models": loaded}
@app.get("/models")
def list_models():
return {
"available_versions": list(MODEL_REGISTRY.keys()),
"default_version": "v1"
}
3.3 版本管理策略
上述代码中的 MODEL_REGISTRY 是一个简单的内存模型仓库。生产环境中通常扩展为以下模式:
- URL 路径版本 :
/v1/predictvs/v2/predict,适合大版本切换 - 请求体版本 :
model_version字段,适合同一端点下的灰度切换 - 响应头版本 :
X-Model-Version: v1.2.3,便于客户端和监控侧追踪
A/B 测试场景下,可以在 Nginx 或 API Gateway 层配置流量比例,将 90% 请求路由到 v1、10% 路由到 v2,对比两个版本的延迟和准确率差异。
四、ONNX Runtime 推理性能优化
4.1 执行后端(Execution Providers)选择
ONNX Runtime 支持多种执行后端,自动将计算图中的算子分配到最适合的硬件上运行:
python
# 自动选择最优后端(推荐)
providers = ort.get_available_providers()
# 典型返回值:['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
# 显式指定优先级
providers = [
"TensorrtExecutionProvider", # NVIDIA GPU 最优,首次加载有编译开销
"CUDAExecutionProvider", # NVIDIA GPU 通用
"CPUExecutionProvider" # 兜底
]
sess = ort.InferenceSession("model.onnx", providers=providers)
| 后端 | 适用硬件 | 特点 |
|---|---|---|
| CPUExecutionProvider | x86/ARM CPU | 零依赖,开箱即用 |
| CUDAExecutionProvider | NVIDIA GPU | 利用 cuDNN 加速,需 CUDA 驱动 |
| TensorrtExecutionProvider | NVIDIA GPU | 极致性能,首次推理前编译耗时 |
| OpenVINOExecutionProvider | Intel CPU/GPU | Intel 硬件优化 |
| DirectMLExecutionProvider | Windows GPU | AMD/Intel/NVIDIA 通用 |
TensorRT 后端虽然推理延迟最低,但首次加载模型时需要将 ONNX 图编译为 TensorRT 引擎,这个过程可能耗时数秒到数分钟(取决于模型复杂度)。生产环境中通常在启动阶段完成编译并缓存引擎文件,避免线上请求触发编译。
4.2 批处理(Batching)优化
批处理是提升吞吐量的核心手段,但会增加单条样本的延迟。存在一个最优批次大小,使得吞吐量 × 延迟的权衡达到业务可接受的范围:
python
import time
def benchmark_batch(sess, input_name, output_name, batch_sizes):
results = []
for bs in batch_sizes:
dummy = np.random.randn(bs, 4).astype(np.float32)
# 预热
for _ in range(10):
sess.run([output_name], {input_name: dummy})
# 正式测试
start = time.time()
for _ in range(100):
sess.run([output_name], {input_name: dummy})
elapsed = time.time() - start
qps = (100 * bs) / elapsed
latency_p99 = (elapsed / 100) * 1000 # 近似 P99
results.append((bs, qps, latency_p99))
return results
# 典型结果(4 维特征,逻辑回归,CPU):
# batch_size=1 -> QPS=8500, P99=0.12ms
# batch_size=8 -> QPS=32000, P99=0.25ms
# batch_size=32 -> QPS=68000, P99=0.47ms
# batch_size=64 -> QPS=72000, P99=0.89ms
# batch_size=128-> QPS=68000, P99=1.88ms <-- 边际收益递减
#mermaid-svg-L86XUgbucVONUWch{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-L86XUgbucVONUWch .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-L86XUgbucVONUWch .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-L86XUgbucVONUWch .error-icon{fill:#552222;}#mermaid-svg-L86XUgbucVONUWch .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-L86XUgbucVONUWch .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-L86XUgbucVONUWch .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-L86XUgbucVONUWch .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-L86XUgbucVONUWch .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-L86XUgbucVONUWch .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-L86XUgbucVONUWch .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-L86XUgbucVONUWch .marker{fill:#333333;stroke:#333333;}#mermaid-svg-L86XUgbucVONUWch .marker.cross{stroke:#333333;}#mermaid-svg-L86XUgbucVONUWch svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-L86XUgbucVONUWch p{margin:0;}#mermaid-svg-L86XUgbucVONUWch :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 批处理性能(QPS) 1 8 16 32 64 128 80000 70000 60000 50000 40000 30000 20000 10000 0 QPS
#mermaid-svg-jMhCgmRNkbyeclHZ{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-jMhCgmRNkbyeclHZ .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-jMhCgmRNkbyeclHZ .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-jMhCgmRNkbyeclHZ .error-icon{fill:#552222;}#mermaid-svg-jMhCgmRNkbyeclHZ .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-jMhCgmRNkbyeclHZ .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-jMhCgmRNkbyeclHZ .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-jMhCgmRNkbyeclHZ .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-jMhCgmRNkbyeclHZ .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-jMhCgmRNkbyeclHZ .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-jMhCgmRNkbyeclHZ .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-jMhCgmRNkbyeclHZ .marker{fill:#333333;stroke:#333333;}#mermaid-svg-jMhCgmRNkbyeclHZ .marker.cross{stroke:#333333;}#mermaid-svg-jMhCgmRNkbyeclHZ svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-jMhCgmRNkbyeclHZ p{margin:0;}#mermaid-svg-jMhCgmRNkbyeclHZ :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 批处理延迟(P99) 1 8 16 32 64 128 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 P99(ms)
从曲线可以看出:batch_size 从 1 增加到 32 时,QPS 提升约 8 倍,延迟仅从 0.12ms 增加到 0.47ms;但从 64 增加到 128 时,QPS 反而下降(内存带宽瓶颈),延迟却翻倍。对于延迟敏感型服务(如实时推荐),batch_size 取 1-8;对于吞吐优先型服务(如离线打分),batch_size 取 32-64。
4.3 模型预热
ONNX Runtime 在首次推理时需要完成内存分配、算子初始化、图优化等准备工作。如果在服务启动后不预热,第一个线上请求会遭遇明显的冷启动延迟("warmup penalty")。标准做法是在服务启动后执行若干次 dummy 推理:
python
@app.on_event("startup")
def startup():
load_model("v1", "models/iris_v1.onnx")
# 模型预热:触发 JIT 编译和内存分配
sess = MODEL_REGISTRY["v1"]
input_name = sess.get_inputs()[0].name
output_name = sess.get_outputs()[0].name
dummy = np.random.randn(32, 4).astype(np.float32)
for _ in range(100):
sess.run([output_name], {input_name: dummy})
print("模型预热完成")
预热次数通常为 10-100 次,取决于模型复杂度。对于 TensorRT 后端,预热还能触发引擎编译缓存,将编译结果写入磁盘供下次启动复用。
五、Docker 化推理服务
5.1 多阶段构建
推理服务的 Docker 镜像应该尽量精简------只需要 ONNX Runtime 和 FastAPI,不需要 PyTorch、TensorFlow 或 scikit-learn 等训练框架。
dockerfile
# builder 阶段:安装依赖(含编译工具)
FROM python:3.11-slim as builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt
# runtime 阶段:只复制必要的文件
FROM python:3.11-slim
WORKDIR /app
# 从 builder 复制已安装的包
COPY --from=builder /root/.local /root/.local
ENV PATH=/root/.local/bin:$PATH
# 复制应用代码和模型
COPY app.py .
COPY models/ ./models/
# 健康检查
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]
requirements.txt 只需包含运行时依赖:
text
fastapi==0.111.0
uvicorn[standard]==0.30.0
onnxruntime==1.18.0
numpy==1.26.4
pydantic==2.7.0
对比单阶段构建,多阶段构建的镜像体积从 ~1.2GB(含 gcc、scikit-learn)缩减到 ~180MB(仅 onnxruntime + fastapi)。
5.2 docker-compose 部署
yaml
version: "3.8"
services:
inference:
build: .
ports:
- "8000:8000"
environment:
- OMP_NUM_THREADS=4 # 限制 OpenMP 线程数,避免 CPU 争抢
deploy:
resources:
limits:
cpus: '4.0'
memory: 2G
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 15s
nginx:
image: nginx:alpine
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
- inference
OMP_NUM_THREADS 是 ONNX Runtime CPU 后端的关键环境变量。默认情况下,ONNX Runtime 会使用所有可用 CPU 核心,这在容器化环境中会导致与 uvicorn 工作进程的 CPU 争抢。将其限制为容器 CPU 配额的一半(如容器限制 4 核,设置 OMP_NUM_THREADS=2),可以让 ONNX 和 uvicorn 各取所需。
5.3 Nginx 负载均衡
nginx
events {
worker_connections 1024;
}
http {
upstream inference_backend {
least_conn; # 最少连接数调度
server inference_1:8000 weight=5;
server inference_2:8000 weight=5;
server inference_3:8000 weight=5;
}
server {
listen 80;
location / {
proxy_pass http://inference_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_connect_timeout 5s;
proxy_read_timeout 30s;
}
location /health {
access_log off;
proxy_pass http://inference_backend/health;
}
}
}
#mermaid-svg-MrNcvV0c0NTeQsHN{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-MrNcvV0c0NTeQsHN .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-MrNcvV0c0NTeQsHN .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-MrNcvV0c0NTeQsHN .error-icon{fill:#552222;}#mermaid-svg-MrNcvV0c0NTeQsHN .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-MrNcvV0c0NTeQsHN .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-MrNcvV0c0NTeQsHN .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-MrNcvV0c0NTeQsHN .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-MrNcvV0c0NTeQsHN .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-MrNcvV0c0NTeQsHN .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-MrNcvV0c0NTeQsHN .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-MrNcvV0c0NTeQsHN .marker{fill:#333333;stroke:#333333;}#mermaid-svg-MrNcvV0c0NTeQsHN .marker.cross{stroke:#333333;}#mermaid-svg-MrNcvV0c0NTeQsHN svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-MrNcvV0c0NTeQsHN p{margin:0;}#mermaid-svg-MrNcvV0c0NTeQsHN .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-MrNcvV0c0NTeQsHN .cluster-label text{fill:#333;}#mermaid-svg-MrNcvV0c0NTeQsHN .cluster-label span{color:#333;}#mermaid-svg-MrNcvV0c0NTeQsHN .cluster-label span p{background-color:transparent;}#mermaid-svg-MrNcvV0c0NTeQsHN .label text,#mermaid-svg-MrNcvV0c0NTeQsHN span{fill:#333;color:#333;}#mermaid-svg-MrNcvV0c0NTeQsHN .node rect,#mermaid-svg-MrNcvV0c0NTeQsHN .node circle,#mermaid-svg-MrNcvV0c0NTeQsHN .node ellipse,#mermaid-svg-MrNcvV0c0NTeQsHN .node polygon,#mermaid-svg-MrNcvV0c0NTeQsHN .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-MrNcvV0c0NTeQsHN .rough-node .label text,#mermaid-svg-MrNcvV0c0NTeQsHN .node .label text,#mermaid-svg-MrNcvV0c0NTeQsHN .image-shape .label,#mermaid-svg-MrNcvV0c0NTeQsHN .icon-shape .label{text-anchor:middle;}#mermaid-svg-MrNcvV0c0NTeQsHN .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-MrNcvV0c0NTeQsHN .rough-node .label,#mermaid-svg-MrNcvV0c0NTeQsHN .node .label,#mermaid-svg-MrNcvV0c0NTeQsHN .image-shape .label,#mermaid-svg-MrNcvV0c0NTeQsHN .icon-shape .label{text-align:center;}#mermaid-svg-MrNcvV0c0NTeQsHN .node.clickable{cursor:pointer;}#mermaid-svg-MrNcvV0c0NTeQsHN .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-MrNcvV0c0NTeQsHN .arrowheadPath{fill:#333333;}#mermaid-svg-MrNcvV0c0NTeQsHN .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-MrNcvV0c0NTeQsHN .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-MrNcvV0c0NTeQsHN .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-MrNcvV0c0NTeQsHN .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-MrNcvV0c0NTeQsHN .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-MrNcvV0c0NTeQsHN .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-MrNcvV0c0NTeQsHN .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-MrNcvV0c0NTeQsHN .cluster text{fill:#333;}#mermaid-svg-MrNcvV0c0NTeQsHN .cluster span{color:#333;}#mermaid-svg-MrNcvV0c0NTeQsHN div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-MrNcvV0c0NTeQsHN .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-MrNcvV0c0NTeQsHN rect.text{fill:none;stroke-width:0;}#mermaid-svg-MrNcvV0c0NTeQsHN .icon-shape,#mermaid-svg-MrNcvV0c0NTeQsHN .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-MrNcvV0c0NTeQsHN .icon-shape p,#mermaid-svg-MrNcvV0c0NTeQsHN .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-MrNcvV0c0NTeQsHN .icon-shape .label rect,#mermaid-svg-MrNcvV0c0NTeQsHN .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-MrNcvV0c0NTeQsHN .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-MrNcvV0c0NTeQsHN .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-MrNcvV0c0NTeQsHN :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 负载均衡
负载均衡
负载均衡
客户端请求
Nginx
FastAPI Worker 1
FastAPI Worker 2
FastAPI Worker 3
ONNX Runtime
模型文件 .onnx
响应返回
对于大型模型(>1GB),每个 worker 进程都加载一份模型到内存会造成巨大的内存浪费。更优的方案是将模型存储在共享内存(Shared Memory)或独立推理服务中,FastAPI worker 通过 IPC(进程间通信)或 gRPC 调用推理服务。K8s 环境下可以使用 NFS 或对象存储挂载模型文件,配合 Init Container 在 Pod 启动时拉取最新版本。
六、端到端实战:泰坦尼克号生存预测服务化
将前文搭建的泰坦尼克号 Pipeline 完整转换为 ONNX 推理服务,演示从训练到部署的全链路。
6.1 训练与转换
python
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType, StringTensorType
# 加载数据
df = pd.read_csv("titanic.csv")
X = df[["Pclass", "Sex", "Age", "Fare", "Embarked"]]
y = df["Survived"]
# 预处理 Pipeline
numeric_features = ["Age", "Fare"]
numeric_transformer = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])
categorical_features = ["Pclass", "Sex", "Embarked"]
categorical_transformer = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")),
("encoder", OneHotEncoder(handle_unknown="ignore"))
])
preprocessor = ColumnTransformer([
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features)
])
# 完整 Pipeline
clf = Pipeline([
("preprocessor", preprocessor),
("classifier", RandomForestClassifier(n_estimators=100, random_state=42))
])
clf.fit(X, y)
# 转换为 ONNX(注意:ColumnTransformer 含字符串输入,需要混合类型定义)
from skl2onnx import update_registered_converter
from skl2onnx.common.data_types import Int64TensorType
# skl2onnx 的 ColumnTransformer 转换需要指定每个输入列的类型
initial_type = [
("Pclass", Int64TensorType([None, 1])),
("Sex", StringTensorType([None, 1])),
("Age", FloatTensorType([None, 1])),
("Fare", FloatTensorType([None, 1])),
("Embarked", StringTensorType([None, 1]))
]
onnx_model = convert_sklearn(clf, initial_types=initial_type, target_opset=15)
with open("titanic_model.onnx", "wb") as f:
f.write(onnx_model.SerializeToString())
print("模型转换完成,体积:", round(os.path.getsize("titanic_model.onnx") / 1024, 2), "KB")
6.2 FastAPI 服务
python
from fastapi import FastAPI
from pydantic import BaseModel, Field
import onnxruntime as ort
import numpy as np
app = FastAPI()
class TitanicRequest(BaseModel):
Pclass: int = Field(..., ge=1, le=3, description="舱位等级: 1/2/3")
Sex: str = Field(..., pattern="^(male|female)$", description="性别")
Age: float = Field(..., ge=0, le=120, description="年龄")
Fare: float = Field(..., ge=0, description="票价")
Embarked: str = Field(..., pattern="^(S|C|Q)$", description="登船港口")
class TitanicResponse(BaseModel):
survived: int
survival_probability: float
model_version: str
sess = ort.InferenceSession("titanic_model.onnx", providers=["CPUExecutionProvider"])
input_names = {inp.name: inp for inp in sess.get_inputs()}
output_names = [o.name for o in sess.get_outputs()]
@app.post("/predict", response_model=TitanicResponse)
def predict(req: TitanicRequest):
# 构建 ONNX 输入(按列名匹配)
inputs = {
"Pclass": np.array([[req.Pclass]], dtype=np.int64),
"Sex": np.array([[req.Sex]], dtype=object),
"Age": np.array([[req.Age]], dtype=np.float32),
"Fare": np.array([[req.Fare]], dtype=np.float32),
"Embarked": np.array([[req.Embarked]], dtype=object)
}
outputs = sess.run(output_names, inputs)
prob = outputs[1][0][1] # 正类概率
return TitanicResponse(
survived=int(outputs[0][0]),
survival_probability=round(float(prob), 4),
model_version="titanic-v1"
)
@app.get("/health")
def health():
return {"status": "healthy", "model": "titanic-v1"}
6.3 locust 压测验证
python
from locust import HttpUser, task, between
class InferenceUser(HttpUser):
wait_time = between(0.1, 0.5)
@task(3)
def predict(self):
self.client.post("/predict", json={
"Pclass": 1,
"Sex": "female",
"Age": 29.0,
"Fare": 211.3,
"Embarked": "S"
})
@task(1)
def health(self):
self.client.get("/health")
启动压测:
bash
locust -f locustfile.py --host=http://localhost:8000 -u 100 -r 10 --run-time=60s
预期结果(4 核 CPU、单 worker):
- RPS: ~1200-1500
- P50 延迟: ~3-5ms
- P99 延迟: ~15-25ms
- 错误率: 0%
七、模型部署完整链路总结
#mermaid-svg-6y1oOVhD0JvsDTZQ{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-6y1oOVhD0JvsDTZQ .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-6y1oOVhD0JvsDTZQ .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-6y1oOVhD0JvsDTZQ .error-icon{fill:#552222;}#mermaid-svg-6y1oOVhD0JvsDTZQ .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-6y1oOVhD0JvsDTZQ .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-6y1oOVhD0JvsDTZQ .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-6y1oOVhD0JvsDTZQ .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-6y1oOVhD0JvsDTZQ .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-6y1oOVhD0JvsDTZQ .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-6y1oOVhD0JvsDTZQ .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-6y1oOVhD0JvsDTZQ .marker{fill:#333333;stroke:#333333;}#mermaid-svg-6y1oOVhD0JvsDTZQ .marker.cross{stroke:#333333;}#mermaid-svg-6y1oOVhD0JvsDTZQ svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-6y1oOVhD0JvsDTZQ p{margin:0;}#mermaid-svg-6y1oOVhD0JvsDTZQ .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-6y1oOVhD0JvsDTZQ .cluster-label text{fill:#333;}#mermaid-svg-6y1oOVhD0JvsDTZQ .cluster-label span{color:#333;}#mermaid-svg-6y1oOVhD0JvsDTZQ .cluster-label span p{background-color:transparent;}#mermaid-svg-6y1oOVhD0JvsDTZQ .label text,#mermaid-svg-6y1oOVhD0JvsDTZQ span{fill:#333;color:#333;}#mermaid-svg-6y1oOVhD0JvsDTZQ .node rect,#mermaid-svg-6y1oOVhD0JvsDTZQ .node circle,#mermaid-svg-6y1oOVhD0JvsDTZQ .node ellipse,#mermaid-svg-6y1oOVhD0JvsDTZQ .node polygon,#mermaid-svg-6y1oOVhD0JvsDTZQ .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-6y1oOVhD0JvsDTZQ .rough-node .label text,#mermaid-svg-6y1oOVhD0JvsDTZQ .node .label text,#mermaid-svg-6y1oOVhD0JvsDTZQ .image-shape .label,#mermaid-svg-6y1oOVhD0JvsDTZQ .icon-shape .label{text-anchor:middle;}#mermaid-svg-6y1oOVhD0JvsDTZQ .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-6y1oOVhD0JvsDTZQ .rough-node .label,#mermaid-svg-6y1oOVhD0JvsDTZQ .node .label,#mermaid-svg-6y1oOVhD0JvsDTZQ .image-shape .label,#mermaid-svg-6y1oOVhD0JvsDTZQ .icon-shape .label{text-align:center;}#mermaid-svg-6y1oOVhD0JvsDTZQ .node.clickable{cursor:pointer;}#mermaid-svg-6y1oOVhD0JvsDTZQ .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-6y1oOVhD0JvsDTZQ .arrowheadPath{fill:#333333;}#mermaid-svg-6y1oOVhD0JvsDTZQ .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-6y1oOVhD0JvsDTZQ .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-6y1oOVhD0JvsDTZQ .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-6y1oOVhD0JvsDTZQ .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-6y1oOVhD0JvsDTZQ .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-6y1oOVhD0JvsDTZQ .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-6y1oOVhD0JvsDTZQ .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-6y1oOVhD0JvsDTZQ .cluster text{fill:#333;}#mermaid-svg-6y1oOVhD0JvsDTZQ .cluster span{color:#333;}#mermaid-svg-6y1oOVhD0JvsDTZQ div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-6y1oOVhD0JvsDTZQ .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-6y1oOVhD0JvsDTZQ rect.text{fill:none;stroke-width:0;}#mermaid-svg-6y1oOVhD0JvsDTZQ .icon-shape,#mermaid-svg-6y1oOVhD0JvsDTZQ .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-6y1oOVhD0JvsDTZQ .icon-shape p,#mermaid-svg-6y1oOVhD0JvsDTZQ .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-6y1oOVhD0JvsDTZQ .icon-shape .label rect,#mermaid-svg-6y1oOVhD0JvsDTZQ .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-6y1oOVhD0JvsDTZQ .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-6y1oOVhD0JvsDTZQ .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-6y1oOVhD0JvsDTZQ :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Pickle
ONNX
Notebook 训练
序列化方案
.pkl 文件
.onnx 文件
Python 服务
ONNX Runtime
FastAPI 封装
版本管理 /v1 /v2
Docker 多阶段构建
镜像 ~180MB
Nginx 负载均衡
水平扩展 3x
Prometheus 监控
生产环境
从 notebook 到生产,模型部署的关键决策点可以归纳为以下检查清单:
| 阶段 | 检查项 | 推荐方案 |
|---|---|---|
| 序列化 | 是否跨语言/跨版本? | ONNX |
| 推理 | 是否需要 GPU 加速? | CUDA / TensorRT |
| 服务 | 单条 or 批量? | 双端点设计 |
| 版本 | 如何灰度发布? | URL 路径 + 请求体版本 |
| 容器 | 镜像体积是否可控? | 多阶段构建,去掉训练依赖 |
| 扩展 | 模型内存占用大? | 共享存储 + 独立推理 Pod |
| 运维 | 如何监控模型漂移? | 记录输入分布 + 输出置信度 |
如果这篇文章对模型部署的实践思路有所帮助,欢迎点赞和关注专栏。此前关于 scikit-learn Pipeline 工程化、K8s 服务部署和高并发压测的内容,也可以结合起来阅读,形成从训练到部署的完整闭环。