文章目录
-
- [一、训练-推理偏差(Training-Serving Skew)](#一、训练-推理偏差(Training-Serving Skew))
-
- [1.1 什么是训练-推理偏差?](#1.1 什么是训练-推理偏差?)
- [1.2 训练-推理偏差检测](#1.2 训练-推理偏差检测)
- [1.3 根治偏差的方案](#1.3 根治偏差的方案)
- [二、特征存储(Feature Store)](#二、特征存储(Feature Store))
-
- [2.1 Feature Store 架构](#2.1 Feature Store 架构)
- [2.2 Feature Store 实现](#2.2 Feature Store 实现)
- [2.3 Feature Store 的版本管理](#2.3 Feature Store 的版本管理)
- 三、在线推理方案对比
-
- [3.1 三种推理模式](#3.1 三种推理模式)
- [3.2 实时推理服务实现](#3.2 实时推理服务实现)
- [3.3 模型版本管理与灰度发布](#3.3 模型版本管理与灰度发布)
- [四、A/B 测试设计](#四、A/B 测试设计)
-
- [4.1 A/B 测试的科学设计](#4.1 A/B 测试的科学设计)
- [4.2 A/B 测试的常见陷阱](#4.2 A/B 测试的常见陷阱)
- 五、模型监控体系
-
- [5.1 四层监控架构](#5.1 四层监控架构)
- [5.2 模型监控实现](#5.2 模型监控实现)
- 六、模型重训练策略
-
- [6.1 三种重训练触发方式](#6.1 三种重训练触发方式)
- [七、电商推荐系统 ML 架构实战](#七、电商推荐系统 ML 架构实战)
-
- [7.1 完整 ML 架构设计](#7.1 完整 ML 架构设计)
- 八、训练-推理偏差根治清单
- 总结
训练一个模型需要一周,让它在生产环境稳定运行需要一年。ML 系统的"最后一公里"------特征一致性、在线推理延迟、A/B 测试、模型监控------才是工程价值的真正考验。
模型训好了,怎么让它在线上稳定跑起来?特征存储怎么保证训练和推理的一致性?A/B 测试怎么做才科学?本篇从系统设计的视角回答这些问题。
一、训练-推理偏差(Training-Serving Skew)
1.1 什么是训练-推理偏差?
训练时用离线特征、推理时用在线特征 → 数值不一致 → 性能下降。这不是模型错了,是特征算错了。
偏差的四种常见来源:
| 偏差来源 | 具体表现 | 影响 |
|---|---|---|
| 代码不同 | 训练用 Python 计算,推理用 Java/Go 计算 | 浮点精度差异、逻辑细微差异 |
| 时间窗口不同 | 训练用"截止到昨天的全量数据",推理用"最近 30 天的实时数据" | 统计口径不一致 |
| 数据源不同 | 训练用数据仓库的批处理表,推理用消息队列的实时流 | 数据完整度差异 |
| 精度不同 | 训练时保留 6 位小数,推理时只保留 4 位 | 特征值微变导致预测偏移 |
#mermaid-svg-0or22ODq4rRJD0dj{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-0or22ODq4rRJD0dj .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-0or22ODq4rRJD0dj .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-0or22ODq4rRJD0dj .error-icon{fill:#552222;}#mermaid-svg-0or22ODq4rRJD0dj .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-0or22ODq4rRJD0dj .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-0or22ODq4rRJD0dj .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-0or22ODq4rRJD0dj .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-0or22ODq4rRJD0dj .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-0or22ODq4rRJD0dj .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-0or22ODq4rRJD0dj .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-0or22ODq4rRJD0dj .marker{fill:#333333;stroke:#333333;}#mermaid-svg-0or22ODq4rRJD0dj .marker.cross{stroke:#333333;}#mermaid-svg-0or22ODq4rRJD0dj svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-0or22ODq4rRJD0dj p{margin:0;}#mermaid-svg-0or22ODq4rRJD0dj .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-0or22ODq4rRJD0dj .cluster-label text{fill:#333;}#mermaid-svg-0or22ODq4rRJD0dj .cluster-label span{color:#333;}#mermaid-svg-0or22ODq4rRJD0dj .cluster-label span p{background-color:transparent;}#mermaid-svg-0or22ODq4rRJD0dj .label text,#mermaid-svg-0or22ODq4rRJD0dj span{fill:#333;color:#333;}#mermaid-svg-0or22ODq4rRJD0dj .node rect,#mermaid-svg-0or22ODq4rRJD0dj .node circle,#mermaid-svg-0or22ODq4rRJD0dj .node ellipse,#mermaid-svg-0or22ODq4rRJD0dj .node polygon,#mermaid-svg-0or22ODq4rRJD0dj .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-0or22ODq4rRJD0dj .rough-node .label text,#mermaid-svg-0or22ODq4rRJD0dj .node .label text,#mermaid-svg-0or22ODq4rRJD0dj .image-shape .label,#mermaid-svg-0or22ODq4rRJD0dj .icon-shape .label{text-anchor:middle;}#mermaid-svg-0or22ODq4rRJD0dj .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-0or22ODq4rRJD0dj .rough-node .label,#mermaid-svg-0or22ODq4rRJD0dj .node .label,#mermaid-svg-0or22ODq4rRJD0dj .image-shape .label,#mermaid-svg-0or22ODq4rRJD0dj .icon-shape .label{text-align:center;}#mermaid-svg-0or22ODq4rRJD0dj .node.clickable{cursor:pointer;}#mermaid-svg-0or22ODq4rRJD0dj .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-0or22ODq4rRJD0dj .arrowheadPath{fill:#333333;}#mermaid-svg-0or22ODq4rRJD0dj .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-0or22ODq4rRJD0dj .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-0or22ODq4rRJD0dj .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-0or22ODq4rRJD0dj .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-0or22ODq4rRJD0dj .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-0or22ODq4rRJD0dj .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-0or22ODq4rRJD0dj .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-0or22ODq4rRJD0dj .cluster text{fill:#333;}#mermaid-svg-0or22ODq4rRJD0dj .cluster span{color:#333;}#mermaid-svg-0or22ODq4rRJD0dj div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-0or22ODq4rRJD0dj .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-0or22ODq4rRJD0dj rect.text{fill:none;stroke-width:0;}#mermaid-svg-0or22ODq4rRJD0dj .icon-shape,#mermaid-svg-0or22ODq4rRJD0dj .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-0or22ODq4rRJD0dj .icon-shape p,#mermaid-svg-0or22ODq4rRJD0dj .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-0or22ODq4rRJD0dj .icon-shape .label rect,#mermaid-svg-0or22ODq4rRJD0dj .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-0or22ODq4rRJD0dj .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-0or22ODq4rRJD0dj .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-0or22ODq4rRJD0dj :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 推理时
训练时
偏差来源
Python 代码计算特征
数据仓库全量数据
截止昨天的统计窗口
6位小数精度
模型训练
Java/Go 代码计算特征
Kafka实时流数据
最近30天实时窗口
4位小数精度
模型推理
代码不同
时间窗口不同
数据源不同
精度不同
1.2 训练-推理偏差检测
python
import numpy as np
from scipy import stats
class TrainingServingSkewDetector:
"""训练-推理偏差检测器"""
def __init__(self):
self.feature_names = []
self.skew_report = {}
def detect_numerical_skew(self, offline_features, online_features, feature_names,
threshold=0.01):
"""
检测数值特征偏差
threshold: 允许的最大相对偏差(1% = 0.01)
"""
skew_results = {}
for i, name in enumerate(feature_names):
offline_vals = offline_features[:, i]
online_vals = online_features[:, i]
# 1. 均值偏差
mean_diff = abs(np.mean(offline_vals) - np.mean(online_vals))
mean_relative = mean_diff / abs(np.mean(offline_vals)) if np.mean(offline_vals) != 0 else 0
# 2. KS 检验:分布是否一致
ks_stat, ks_pvalue = stats.ks_2samp(offline_vals, online_vals)
# 3. Pearson 相关性:偏差是否系统性(高相关 = 线性偏差,低相关 = 随机偏差)
correlation, _ = stats.pearsonr(offline_vals, online_vals)
is_skewed = mean_relative > threshold or ks_pvalue < 0.05
skew_results[name] = {
'mean_relative_diff': mean_relative,
'ks_stat': ks_stat,
'ks_pvalue': ks_pvalue,
'correlation': correlation,
'is_skewed': is_skewed,
'skew_type': 'systematic' if correlation > 0.9 else 'random'
}
# 生成报告
skewed_features = [name for name, result in skew_results.items() if result['is_skewed']]
self.skew_report = {
'total_features': len(feature_names),
'skewed_features': len(skewed_features),
'skewed_feature_names': skewed_features,
'skew_rate': len(skewed_features) / len(feature_names),
'details': skew_results
}
if self.skew_report['skew_rate'] > 0.2:
print(f"⚠ 严重偏差: {len(skewed_features)}/{len(feature_names)} 个特征存在偏差")
print(f" 偏差特征: {skewed_features[:5]}")
elif self.skew_report['skew_rate'] > 0.05:
print(f"⚠ 轻微偏差: {len(skewed_features)} 个特征存在偏差")
else:
print(f"✓ 偏差检测通过: 所有特征偏差 < {threshold}")
return self.skew_report
def detect_categorical_skew(self, offline_cats, online_cats, feature_names):
"""检测分类特征偏差------值域是否一致"""
skew_results = {}
for name in feature_names:
offline_unique = set(offline_cats[name])
online_unique = set(online_cats[name])
# 新值出现:推理时出现了训练时未见过的值
new_values = online_unique - offline_unique
missing_values = offline_unique - online_unique
skew_results[name] = {
'offline_values': len(offline_unique),
'online_values': len(online_unique),
'new_values': list(new_values),
'missing_values': list(missing_values),
'is_skewed': len(new_values) > 0 or len(missing_values) > 0,
}
return skew_results
1.3 根治偏差的方案
偏差的根源是"训练和推理用了不同的代码计算同一个特征"。根治方案是单一特征定义(One Source of Truth)------离线和在线使用同一份计算代码,由 Feature Store 统一管理。
二、特征存储(Feature Store)
2.1 Feature Store 架构
#mermaid-svg-lR4re7SAQ55s83sn{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-lR4re7SAQ55s83sn .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-lR4re7SAQ55s83sn .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-lR4re7SAQ55s83sn .error-icon{fill:#552222;}#mermaid-svg-lR4re7SAQ55s83sn .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-lR4re7SAQ55s83sn .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-lR4re7SAQ55s83sn .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-lR4re7SAQ55s83sn .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-lR4re7SAQ55s83sn .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-lR4re7SAQ55s83sn .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-lR4re7SAQ55s83sn .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-lR4re7SAQ55s83sn .marker{fill:#333333;stroke:#333333;}#mermaid-svg-lR4re7SAQ55s83sn .marker.cross{stroke:#333333;}#mermaid-svg-lR4re7SAQ55s83sn svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-lR4re7SAQ55s83sn p{margin:0;}#mermaid-svg-lR4re7SAQ55s83sn .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-lR4re7SAQ55s83sn .cluster-label text{fill:#333;}#mermaid-svg-lR4re7SAQ55s83sn .cluster-label span{color:#333;}#mermaid-svg-lR4re7SAQ55s83sn .cluster-label span p{background-color:transparent;}#mermaid-svg-lR4re7SAQ55s83sn .label text,#mermaid-svg-lR4re7SAQ55s83sn span{fill:#333;color:#333;}#mermaid-svg-lR4re7SAQ55s83sn .node rect,#mermaid-svg-lR4re7SAQ55s83sn .node circle,#mermaid-svg-lR4re7SAQ55s83sn .node ellipse,#mermaid-svg-lR4re7SAQ55s83sn .node polygon,#mermaid-svg-lR4re7SAQ55s83sn .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-lR4re7SAQ55s83sn .rough-node .label text,#mermaid-svg-lR4re7SAQ55s83sn .node .label text,#mermaid-svg-lR4re7SAQ55s83sn .image-shape .label,#mermaid-svg-lR4re7SAQ55s83sn .icon-shape .label{text-anchor:middle;}#mermaid-svg-lR4re7SAQ55s83sn .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-lR4re7SAQ55s83sn .rough-node .label,#mermaid-svg-lR4re7SAQ55s83sn .node .label,#mermaid-svg-lR4re7SAQ55s83sn .image-shape .label,#mermaid-svg-lR4re7SAQ55s83sn .icon-shape .label{text-align:center;}#mermaid-svg-lR4re7SAQ55s83sn .node.clickable{cursor:pointer;}#mermaid-svg-lR4re7SAQ55s83sn .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-lR4re7SAQ55s83sn .arrowheadPath{fill:#333333;}#mermaid-svg-lR4re7SAQ55s83sn .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-lR4re7SAQ55s83sn .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-lR4re7SAQ55s83sn .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-lR4re7SAQ55s83sn .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-lR4re7SAQ55s83sn .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-lR4re7SAQ55s83sn .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-lR4re7SAQ55s83sn .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-lR4re7SAQ55s83sn .cluster text{fill:#333;}#mermaid-svg-lR4re7SAQ55s83sn .cluster span{color:#333;}#mermaid-svg-lR4re7SAQ55s83sn div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-lR4re7SAQ55s83sn .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-lR4re7SAQ55s83sn rect.text{fill:none;stroke-width:0;}#mermaid-svg-lR4re7SAQ55s83sn .icon-shape,#mermaid-svg-lR4re7SAQ55s83sn .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-lR4re7SAQ55s83sn .icon-shape p,#mermaid-svg-lR4re7SAQ55s83sn .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-lR4re7SAQ55s83sn .icon-shape .label rect,#mermaid-svg-lR4re7SAQ55s83sn .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-lR4re7SAQ55s83sn .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-lR4re7SAQ55s83sn .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-lR4re7SAQ55s83sn :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 推理流程
训练流程
特征存储双架构
One Source of Truth
离线存储
Parquet/Delta Lake
训练用 历史特征
在线存储
Redis/Bigtable
推理用 实时特征 < 10ms
特征注册表
Feature Registry
元数据 + 计算逻辑 + 版本
特征定义编写
离线特征读取
模型训练
同一份特征定义
在线特征读取
模型推理 < 100ms
Feature Store 的核心价值不是"存特征"------而是保证离线和在线用同一份特征定义和计算逻辑,从而根治训练-推理偏差。
2.2 Feature Store 实现
python
import time
from datetime import datetime
from collections import defaultdict
class SimpleFeatureStore:
"""轻量级 Feature Store 实现"""
def __init__(self):
# 特征注册表
self.registry = {} # feature_name -> FeatureDefinition
# 离线存储(训练用)
self.offline_store = defaultdict(dict) # entity_id -> {feature_name: value}
# 在线存储(推理用)
self.online_store = defaultdict(dict) # entity_id -> {feature_name: value}
# 特征版本历史
self.version_history = defaultdict(list)
def register_feature(self, name, description, dtype, computation_logic,
entity_key, online_retention_hours=24,
offline_retention_days=365):
"""
注册特征定义------这是 One Source of Truth 的核心
所有特征必须先注册才能使用
"""
definition = {
'name': name,
'description': description,
'dtype': dtype, # int/float/string/array
'computation_logic': computation_logic, # 计算逻辑(代码/SQL)
'entity_key': entity_key, # 实体键(user_id/item_id 等)
'online_retention_hours': online_retention_hours,
'offline_retention_days': offline_retention_days,
'registered_at': datetime.now(),
'version': 1,
}
self.registry[name] = definition
self.version_history[name].append(definition)
return definition
def write_offline_features(self, entity_id, feature_values, timestamp=None):
"""
写入离线存储------训练时读取
feature_values: {feature_name: value}
"""
ts = timestamp or datetime.now()
for feature_name, value in feature_values.items():
if feature_name not in self.registry:
raise ValueError(f"特征 {feature_name} 未注册------必须先 register_feature()")
self.offline_store[entity_id][feature_name] = {
'value': value,
'timestamp': ts,
'version': self.registry[feature_name]['version'],
}
def write_online_features(self, entity_id, feature_values):
"""
写入在线存储------推理时读取
必须与离线存储使用同一份特征定义
"""
for feature_name, value in feature_values.items():
if feature_name not in self.registry:
raise ValueError(f"特征 {feature_name} 未注册")
self.online_store[entity_id][feature_name] = {
'value': value,
'timestamp': datetime.now(),
'version': self.registry[feature_name]['version'],
}
def get_offline_features(self, entity_ids, feature_names, time_range=None):
"""
批量读取离线特征------训练数据构建
支持时间范围过滤:只读取特定时间窗口内的特征
"""
result = {}
for entity_id in entity_ids:
entity_features = {}
for name in feature_names:
if name in self.offline_store[entity_id]:
feat_data = self.offline_store[entity_id][name]
# 时间范围过滤
if time_range:
start, end = time_range
if feat_data['timestamp'] < start or feat_data['timestamp'] > end:
continue
entity_features[name] = feat_data['value']
result[entity_id] = entity_features
return result
def get_online_features(self, entity_id, feature_names):
"""
读取在线特征------推理时使用
性能要求:单次读取 < 10ms
"""
start_time = time.time()
entity_features = {}
for name in feature_names:
if name in self.online_store[entity_id]:
entity_features[name] = self.online_store[entity_id][name]['value']
else:
# 特征缺失时的处理策略
entity_features[name] = self._handle_missing_feature(name)
elapsed = time.time() - start_time
if elapsed > 0.01: # 10ms 告警
print(f"⚠ 在线特征读取耗时 {elapsed*1000:.1f}ms > 10ms 预算")
return entity_features
def _handle_missing_feature(self, feature_name):
"""特征缺失时的兜底策略"""
definition = self.registry.get(feature_name)
if definition and definition['dtype'] == 'float':
return 0.0 # 数值特征用 0 填充
elif definition and definition['dtype'] == 'string':
return 'unknown' # 分类特征用 unknown
return None
def verify_consistency(self, entity_id, feature_names, tolerance=0.01):
"""
验证离线和在线特征的一致性
tolerance: 允许的最大相对偏差
"""
offline = self.get_offline_features([entity_id], feature_names)
online = self.get_online_features(entity_id, feature_names)
inconsistencies = []
for name in feature_names:
off_val = offline.get(entity_id, {}).get(name)
on_val = online.get(name)
if off_val is not None and on_val is not None:
if isinstance(off_val, (int, float)) and isinstance(on_val, (int, float)):
if abs(off_val) > 0:
relative_diff = abs(off_val - on_val) / abs(off_val)
if relative_diff > tolerance:
inconsistencies.append({
'feature': name,
'offline': off_val,
'online': on_val,
'relative_diff': relative_diff,
})
if inconsistencies:
print(f"⚠ 特征不一致: {len(inconsibilities)} 个特征偏差 > {tolerance}")
for inc in inconsistencies:
print(f" {inc['feature']}: 离线={inc['offline']}, "
f"在线={inc['online']}, 偏差={inc['relative_diff']:.2%}")
return inconsistencies
2.3 Feature Store 的版本管理
特征逻辑变更的流程:新版本注册 → 旧版本保留 → 模型绑定特定版本 → 避免新特征上线影响旧模型。
python
def feature_version_update(feature_store, feature_name, new_logic, new_description):
"""特征版本更新流程"""
# 1. 保留旧版本
old_definition = feature_store.registry[feature_name]
old_version = old_definition['version']
# 2. 创建新版本
new_definition = old_definition.copy()
new_definition['version'] = old_version + 1
new_definition['computation_logic'] = new_logic
new_definition['description'] = new_description
new_definition['registered_at'] = datetime.now()
# 3. 更新注册表(旧版本仍在 version_history 中)
feature_store.registry[feature_name] = new_definition
feature_store.version_history[feature_name].append(new_definition)
print(f"特征 {feature_name} 更新: v{old_version} → v{new_definition['version']}")
print(f" 旧版本 v{old_version} 仍可使用(绑定了旧模型)")
return new_definition
三、在线推理方案对比
3.1 三种推理模式
| 推理模式 | 适用场景 | 延迟 | 吞吐 | 实现方式 |
|---|---|---|---|---|
| 实时推理 | 推荐/搜索/风控 | < 100ms | 低~中 | REST API / gRPC |
| 批量推理 | 评分/报表/画像 | 分钟级 | 高 | 定时任务 / Spark |
| 流式推理 | 实时监控/欺诈检测 | < 50ms | 高 | Kafka + Flink |
3.2 实时推理服务实现
python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import time
import numpy as np
# 推理服务的数据模型
class PredictionRequest(BaseModel):
entity_id: str
feature_names: list[str] = []
context: dict = {}
class PredictionResponse(BaseModel):
entity_id: str
prediction: float
feature_count: int
inference_time_ms: float
total_time_ms: float
class RealtimeInferenceService:
"""实时推理服务------延迟预算 < 100ms"""
def __init__(self, model, feature_store):
self.model = model
self.feature_store = feature_store
self.app = FastAPI(title="ML Inference Service")
self._setup_routes()
# 延迟统计
self.latency_stats = {'p50': 0, 'p90': 0, 'p99': 0}
self.request_count = 0
def _setup_routes(self):
@self.app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
total_start = time.time()
# Step 1: 读取在线特征(预算 30ms)
feature_start = time.time()
features = self.feature_store.get_online_features(
request.entity_id, request.feature_names
)
feature_time = (time.time() - feature_start) * 1000
if not features:
raise HTTPException(status_code=404,
detail=f"特征未找到: {request.entity_id}")
# Step 2: 模型推理(预算 50ms)
inference_start = time.time()
feature_vector = np.array(list(features.values())).reshape(1, -1)
prediction = self.model.predict(feature_vector)[0]
inference_time = (time.time() - inference_start) * 1000
total_time = (time.time() - total_start) * 1000
# 延迟告警
if total_time > 100:
print(f"⚠ 推理超时: {total_time:.1f}ms > 100ms 预算 "
f"(特征: {feature_time:.1f}ms, 推理: {inference_time:.1f}ms)")
self.request_count += 1
return PredictionResponse(
entity_id=request.entity_id,
prediction=prediction,
feature_count=len(features),
inference_time_ms=inference_time,
total_time_ms=total_time,
)
@self.app.get("/health")
async def health_check():
return {
'status': 'healthy',
'model_loaded': self.model is not None,
'feature_store_connected': self.feature_store is not None,
'request_count': self.request_count,
}
def get_latency_stats(self, recent_latencies):
"""延迟统计报告"""
if not recent_latencies:
return self.latency_stats
sorted_latencies = sorted(recent_latencies)
n = len(sorted_latencies)
self.latency_stats = {
'p50': sorted_latencies[n // 2],
'p90': sorted_latencies[int(n * 0.9)],
'p99': sorted_latencies[int(n * 0.99)],
'max': sorted_latencies[-1],
}
# SLA 告警
if self.latency_stats['p99'] > 100:
print(f"⚠ P99 延迟 {self.latency_stats['p99']:.1f}ms > 100ms SLA")
return self.latency_stats
3.3 模型版本管理与灰度发布
python
class ModelVersionManager:
"""模型版本管理 + 灰度发布"""
def __init__(self):
self.models = {} # version -> model
self.active_version = None
self.traffic_allocation = {} # version -> traffic_ratio
def register_model(self, version, model, metrics_report):
"""注册新模型版本"""
self.models[version] = {
'model': model,
'registered_at': datetime.now(),
'metrics': metrics_report,
'status': 'staged', # staged → canary → production
}
def canary_release(self, new_version, canary_ratio=0.1):
"""
灰度发布(Canary Release)
先分配 10% 流量给新模型,观察指标
"""
if self.active_version is None:
# 无活跃版本 → 直接上线
self.active_version = new_version
self.traffic_allocation[new_version] = 1.0
self.models[new_version]['status'] = 'production'
return
# 新模型灰度发布
old_version = self.active_version
self.traffic_allocation = {
old_version: 1.0 - canary_ratio,
new_version: canary_ratio,
}
self.models[new_version]['status'] = 'canary'
print(f"灰度发布: {old_version} ({1-canary_ratio:.0%}) → "
f"{new_version} ({canary_ratio:.0%})")
def promote_canary(self, canary_metrics=None):
"""
灰度观察通过后 → 全量切换
条件:canary 的 CTR/GMV 不低于旧模型
"""
if canary_metrics:
# 指标验证
if canary_metrics.get('ctr_drop', 0) > 0.02:
print(f"⚠ 灰度指标不达标: CTR 下降 {canary_metrics['ctr_drop']:.2%}")
return False
# 全量切换
new_version = [v for v, ratio in self.traffic_allocation.items()
if ratio < 1.0 and ratio > 0][0]
old_version = self.active_version
self.active_version = new_version
self.traffic_allocation = {new_version: 1.0}
self.models[new_version]['status'] = 'production'
self.models[old_version]['status'] = 'retired'
print(f"✓ 全量切换: {old_version} → {new_version}")
return True
def rollback(self, reason="指标不达标"):
"""回滚到上一个稳定版本"""
old_versions = [v for v, info in self.models.items()
if info['status'] == 'retired']
if not old_versions:
print("⚠ 无可回滚版本")
return False
rollback_version = old_versions[-1] # 最近一个稳定版本
self.active_version = rollback_version
self.traffic_allocation = {rollback_version: 1.0}
self.models[rollback_version]['status'] = 'production'
print(f"⚠ 回滚: → {rollback_version}, 原因: {reason}")
return True
def route_request(self, request_id):
"""按流量分配比例路由请求"""
hash_val = int(hashlib.md5(str(request_id).encode()).hexdigest(), 16)
cumulative = 0
for version, ratio in self.traffic_allocation.items():
cumulative += ratio
if hash_val % 100 < cumulative * 100:
return version
return self.active_version
四、A/B 测试设计
4.1 A/B 测试的科学设计
python
import hashlib
import numpy as np
from scipy import stats
from collections import defaultdict
class ABTestFramework:
"""A/B 测试框架------样本量/分流/统计检验/Simpson 悖论检测"""
def __init__(self, experiment_name, control_model, treatment_model):
self.experiment_name = experiment_name
self.control_model = control_model
self.treatment_model = treatment_model
self.results = defaultdict(lambda: defaultdict(list))
def calculate_sample_size(self, baseline_rate, mde, alpha=0.05, power=0.8):
"""
样本量计算------实验开始前必须先算
baseline_rate: 基线组的关键指标(如 CTR)
mde: 最小可检测效应(如 2% = 0.02)
alpha: 显著性水平(默认 5%)
power: 统计功效(默认 80%)
"""
p1 = baseline_rate
p2 = baseline_rate * (1 + mde)
z_alpha = stats.norm.ppf(1 - alpha / 2)
z_beta = stats.norm.ppf(power)
p_avg = (p1 + p2) / 2
n_per_group = ((z_alpha * np.sqrt(2 * p_avg * (1 - p_avg)) +
z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2) / \
(p2 - p1) ** 2
print(f"样本量计算结果:")
print(f" 基线 CTR: {baseline_rate:.2%}")
print(f" 最小可检测提升: {mde:.2%}")
print(f" 每组所需样本量: {int(np.ceil(n_per_group))}")
print(f" 总样本量: {int(np.ceil(n_per_group * 2))}")
print(f" 实验天数预估(日均 10 万请求): "
f"{int(np.ceil(n_per_group * 2 / 100000))} 天")
return int(np.ceil(n_per_group))
def assign_group(self, user_id, strategy='user_level'):
"""
分流策略:
- user_level: 用户级哈希(同一用户始终在同一组)
- request_level: 请求级哈希(同一用户可能在不同组)
推荐场景必须用 user_level------同一用户看到不同版本的推荐会混淆行为信号
"""
if strategy == 'user_level':
hash_key = f"{self.experiment_name}_{user_id}"
else:
hash_key = f"{self.experiment_name}_{user_id}_{time.time()}"
hash_val = int(hashlib.md5(hash_key.encode()).hexdigest(), 16)
return 'control' if hash_val % 2 == 0 else 'treatment'
def record_metric(self, group, user_id, metric_name, value):
"""记录实验指标"""
self.results[group][metric_name].append(value)
def analyze(self, metric_name='ctr', alpha=0.05):
"""
统计显著性检验 + 效果量估计
"""
control_vals = self.results['control'][metric_name]
treatment_vals = self.results['treatment'][metric_name]
if not control_vals or not treatment_vals:
return {'error': '数据不足'}
# 均值对比
control_mean = np.mean(control_vals)
treatment_mean = np.mean(treatment_vals)
absolute_lift = treatment_mean - control_mean
relative_lift = absolute_lift / control_mean if control_mean > 0 else 0
# t 检验
t_stat, p_value = stats.ttest_ind(control_vals, treatment_vals)
# Mann-Whitney U 检验(非参数,对分布假设更宽松)
u_stat, u_pvalue = stats.mannwhitneyu(control_vals, treatment_vals,
alternative='two-sided')
# 效应量 Cohen's d
pooled_std = np.sqrt((np.var(control_vals) + np.var(treatment_vals)) / 2)
cohens_d = absolute_lift / pooled_std if pooled_std > 0 else 0
# 置信区间
se = np.sqrt(np.var(control_vals)/len(control_vals) +
np.var(treatment_vals)/len(treatment_vals))
ci_lower = absolute_lift - 1.96 * se
ci_upper = absolute_lift + 1.96 * se
result = {
'control_mean': control_mean,
'treatment_mean': treatment_mean,
'absolute_lift': absolute_lift,
'relative_lift': relative_lift,
'p_value': p_value,
'significant': p_value < alpha,
'cohens_d': cohens_d,
'ci_95': (ci_lower, ci_upper),
'sample_size': {'control': len(control_vals),
'treatment': len(treatment_vals)},
}
print(f"A/B 测试分析 ({metric_name}):")
print(f" 对照组均值: {control_mean:.4f}")
print(f" 实验组均值: {treatment_mean:.4f}")
print(f" 绝对提升: {absolute_lift:.4f} ({relative_lift:.2%})")
print(f" p-value: {p_value:.4f} {'✓ 显著' if p_value < alpha else '✗ 不显著'}")
print(f" Cohen's d: {cohens_d:.3f}")
print(f" 95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")
return result
def check_simpson_paradox(self, stratified_data):
"""
Simpson 悖论检测
分人群看对照组在每个子群体都更好,但整体汇总后实验组更好
→ 这意味着实验组被某些大子群体的样本量优势"掩盖"了
"""
overall_control = np.mean([v for v in stratified_data.values()
for v in v.get('control', [])])
overall_treatment = np.mean([v for v in stratified_data.values()
for v in v.get('treatment', [])])
subgroup_results = {}
subgroup_control_wins = 0
for subgroup, data in stratified_data.items():
ctrl_mean = np.mean(data.get('control', [0]))
treat_mean = np.mean(data.get('treatment', [0]))
subgroup_results[subgroup] = {
'control_mean': ctrl_mean,
'treatment_mean': treat_mean,
'treatment_wins': treat_mean > ctrl_mean,
}
if treat_mean <= ctrl_mean:
subgroup_control_wins += 1
paradox = (overall_treatment > overall_control) and \
(subgroup_control_wins > len(stratified_data) / 2)
if paradox:
print(f"⚠ Simpson 悖论检测到!")
print(f" 整体: 实验组优于对照组 ({overall_treatment:.4f} vs {overall_control:.4f})")
print(f" 但 {subgroup_control_wins}/{len(stratified_data)} 个子群体中对照组更好")
for sg, res in subgroup_results.items():
print(f" {sg}: 对照组 {res['control_mean']:.4f} vs "
f"实验组 {res['treatment_mean']:.4f}")
return {
'paradox_detected': paradox,
'overall': {'control': overall_control, 'treatment': overall_treatment},
'subgroup_results': subgroup_results,
'recommendation': '必须分人群单独分析' if paradox else '整体结论可靠'
}
4.2 A/B 测试的常见陷阱
| 陷阱 | 描述 | 防范 |
|---|---|---|
| 偷看效应 | 实验未达预定天数就看结果并提前结束 | 严格遵守预定实验天数,不做提前决策 |
| ** Simpson 悖论** | 分人群结论与整体结论相反 | 分人群分析,不做无条件汇总 |
| 请求级分流 | 同一用户在不同组看到不同推荐 | 推荐场景必须用用户级哈希分流 |
| 新奇效应 | 用户对新界面/新推荐的短期好奇心导致 CTR 暂时上升 | 实验至少跑 2~4 周,观察长期效果 |
| 多实验干扰 | 多个 A/B 测试同时运行,互相影响结果 | 实验互斥分组或使用正交分流 |
五、模型监控体系
5.1 四层监控架构
#mermaid-svg-L1D9i7Og5oPsFvkk{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-L1D9i7Og5oPsFvkk .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-L1D9i7Og5oPsFvkk .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-L1D9i7Og5oPsFvkk .error-icon{fill:#552222;}#mermaid-svg-L1D9i7Og5oPsFvkk .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-L1D9i7Og5oPsFvkk .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-L1D9i7Og5oPsFvkk .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-L1D9i7Og5oPsFvkk .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-L1D9i7Og5oPsFvkk .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-L1D9i7Og5oPsFvkk .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-L1D9i7Og5oPsFvkk .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-L1D9i7Og5oPsFvkk .marker{fill:#333333;stroke:#333333;}#mermaid-svg-L1D9i7Og5oPsFvkk .marker.cross{stroke:#333333;}#mermaid-svg-L1D9i7Og5oPsFvkk svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-L1D9i7Og5oPsFvkk p{margin:0;}#mermaid-svg-L1D9i7Og5oPsFvkk .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-L1D9i7Og5oPsFvkk .cluster-label text{fill:#333;}#mermaid-svg-L1D9i7Og5oPsFvkk .cluster-label span{color:#333;}#mermaid-svg-L1D9i7Og5oPsFvkk .cluster-label span p{background-color:transparent;}#mermaid-svg-L1D9i7Og5oPsFvkk .label text,#mermaid-svg-L1D9i7Og5oPsFvkk span{fill:#333;color:#333;}#mermaid-svg-L1D9i7Og5oPsFvkk .node rect,#mermaid-svg-L1D9i7Og5oPsFvkk .node circle,#mermaid-svg-L1D9i7Og5oPsFvkk .node ellipse,#mermaid-svg-L1D9i7Og5oPsFvkk .node polygon,#mermaid-svg-L1D9i7Og5oPsFvkk .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-L1D9i7Og5oPsFvkk .rough-node .label text,#mermaid-svg-L1D9i7Og5oPsFvkk .node .label text,#mermaid-svg-L1D9i7Og5oPsFvkk .image-shape .label,#mermaid-svg-L1D9i7Og5oPsFvkk .icon-shape .label{text-anchor:middle;}#mermaid-svg-L1D9i7Og5oPsFvkk .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-L1D9i7Og5oPsFvkk .rough-node .label,#mermaid-svg-L1D9i7Og5oPsFvkk .node .label,#mermaid-svg-L1D9i7Og5oPsFvkk .image-shape .label,#mermaid-svg-L1D9i7Og5oPsFvkk .icon-shape .label{text-align:center;}#mermaid-svg-L1D9i7Og5oPsFvkk .node.clickable{cursor:pointer;}#mermaid-svg-L1D9i7Og5oPsFvkk .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-L1D9i7Og5oPsFvkk .arrowheadPath{fill:#333333;}#mermaid-svg-L1D9i7Og5oPsFvkk .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-L1D9i7Og5oPsFvkk .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-L1D9i7Og5oPsFvkk .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-L1D9i7Og5oPsFvkk .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-L1D9i7Og5oPsFvkk .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-L1D9i7Og5oPsFvkk .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-L1D9i7Og5oPsFvkk .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-L1D9i7Og5oPsFvkk .cluster text{fill:#333;}#mermaid-svg-L1D9i7Og5oPsFvkk .cluster span{color:#333;}#mermaid-svg-L1D9i7Og5oPsFvkk div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-L1D9i7Og5oPsFvkk .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-L1D9i7Og5oPsFvkk rect.text{fill:none;stroke-width:0;}#mermaid-svg-L1D9i7Og5oPsFvkk .icon-shape,#mermaid-svg-L1D9i7Og5oPsFvkk .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-L1D9i7Og5oPsFvkk .icon-shape p,#mermaid-svg-L1D9i7Og5oPsFvkk .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-L1D9i7Og5oPsFvkk .icon-shape .label rect,#mermaid-svg-L1D9i7Og5oPsFvkk .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-L1D9i7Og5oPsFvkk .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-L1D9i7Og5oPsFvkk .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-L1D9i7Og5oPsFvkk :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 响应策略
模型监控四层
特征级告警
预测级告警
性能级告警
输入数据漂移
PSI/KS 检验
预测分布漂移
预测均值/分位数变化
标签延迟问题
真实标签滞后获得
业务指标联动
推荐GMV/风控损失率
降低置信度
触发重训练
模型回滚
5.2 模型监控实现
python
from scipy import stats
import numpy as np
class ModelMonitoringSystem:
"""模型监控系统------漂移检测 + 性能告警 + 业务联动"""
def __init__(self, reference_data=None, reference_predictions=None):
# 参考数据(训练时的特征分布)
self.reference_features = reference_data
self.reference_predictions = reference_predictions
self.alerts = []
def psi_check(self, reference_dist, current_dist, bins=10):
"""
PSI(Population Stability Index)检测特征分布漂移
PSI < 0.1: 稳定
PSI 0.1~0.2: 轻微变化,需关注
PSI > 0.2: 显著漂移,需重训练
"""
# 分桶计算
ref_hist, bin_edges = np.histogram(reference_dist, bins=bins, density=True)
cur_hist, _ = np.histogram(current_dist, bins=bin_edges, density=True)
# 避免零桶:最小占比 0.001
ref_hist = np.maximum(ref_hist / ref_hist.sum(), 0.001)
cur_hist = np.maximum(cur_hist / cur_hist.sum(), 0.001)
psi = np.sum((cur_hist - ref_hist) * np.log(cur_hist / ref_hist))
severity = 'stable' if psi < 0.1 else 'warning' if psi < 0.2 else 'critical'
return {
'psi': psi,
'severity': severity,
'action': 'none' if psi < 0.1 else 'monitor' if psi < 0.2 else 'retrain'
}
def ks_check(self, reference_dist, current_dist):
"""KS 检验------分布是否发生显著变化"""
ks_stat, p_value = stats.ks_2samp(reference_dist, current_dist)
return {
'ks_stat': ks_stat,
'p_value': p_value,
'significant_shift': p_value < 0.05,
'action': 'none' if p_value >= 0.05 else 'monitor' if ks_stat < 0.1 else 'retrain'
}
def prediction_drift_check(self, recent_predictions, window_size=1000):
"""
预测分布漂移检测
比较近期预测均值/分位数与参考预测的差异
"""
if self.reference_predictions is None:
return {'error': '无参考预测分布'}
ref_mean = np.mean(self.reference_predictions)
ref_median = np.median(self.reference_predictions)
ref_q75 = np.percentile(self.reference_predictions, 75)
cur_mean = np.mean(recent_predictions[-window_size:])
cur_median = np.median(recent_predictions[-window_size:])
cur_q75 = np.percentile(recent_predictions[-window_size:], 75)
# 均值偏移
mean_shift = abs(cur_mean - ref_mean) / abs(ref_mean) if ref_mean != 0 else 0
# 分位数偏移
median_shift = abs(cur_median - ref_median) / abs(ref_median) if ref_median != 0 else 0
drift_detected = mean_shift > 0.1 or median_shift > 0.15
return {
'mean_shift': mean_shift,
'median_shift': median_shift,
'drift_detected': drift_detected,
'reference_mean': ref_mean,
'current_mean': cur_mean,
}
def label_delay_handler(self, prediction_timestamps, label_timestamps,
max_delay_hours=72):
"""
标签延迟问题------真实标签滞后获得
例:信贷违约的标签(是否违约)需要等贷款到期才能确认
处理策略:
1. 只用已确认标签计算性能指标
2. 标签未确认的样本暂不纳入指标计算
3. 定期回补标签后重新计算指标
"""
confirmed = []
pending = []
for pred_ts, label_ts in zip(prediction_timestamps, label_timestamps):
delay_hours = (label_ts - pred_ts).total_seconds() / 3600
if delay_hours <= max_delay_hours:
confirmed.append(True)
else:
pending.append(True)
confirmed.append(False)
confirmed_rate = sum(confirmed) / len(confirmed) if confirmed else 0
return {
'confirmed_rate': confirmed_rate,
'pending_count': sum(pending),
'warning': confirmed_rate < 0.5, # 确认率 < 50% → 指标不可信
'recommendation': '延长观察窗口' if confirmed_rate < 0.5 else '正常计算'
}
def business_metric_correlation(self, model_metrics, business_metrics):
"""
业务指标联动------模型指标变化是否与业务指标变化同步
关键:模型指标上升但业务指标不变 → 模型在"空转"
例:AUC 上升但推荐 GMV 占比不变 → 模型改进没有产生业务价值
"""
correlation = np.corrcoef(model_metrics, business_metrics)[0, 1]
# 模型指标和业务指标是否方向一致
model_trend = np.polyfit(range(len(model_metrics)), model_metrics, 1)[0]
business_trend = np.polyfit(range(len(business_metrics)), business_metrics, 1)[0]
direction_aligned = (model_trend > 0 and business_trend > 0) or \
(model_trend < 0 and business_trend < 0)
return {
'correlation': correlation,
'model_trend': model_trend,
'business_trend': business_trend,
'direction_aligned': direction_aligned,
'warning': not direction_aligned and abs(model_trend) > 0.01,
'recommendation': '模型空转,需重新定义目标' if not direction_aligned else '正常'
}
def generate_daily_report(self, current_features, current_predictions,
feature_names, business_metrics=None):
"""生成每日监控报告"""
report = {'date': datetime.now().strftime('%Y-%m-%d'), 'alerts': []}
# 特征漂移检测
for i, name in enumerate(feature_names):
if self.reference_features is not None:
psi_result = self.psi_check(
self.reference_features[:, i], current_features[:, i]
)
if psi_result['severity'] != 'stable':
report['alerts'].append(
f"特征 {name}: PSI={psi_result['psi']:.3f} ({psi_result['severity']})"
)
# 预测分布漂移
pred_result = self.prediction_drift_check(current_predictions)
if pred_result.get('drift_detected'):
report['alerts'].append(
f"预测分布偏移: 均值偏移={pred_result['mean_shift']:.2%}"
)
# 业务指标联动
if business_metrics:
biz_result = self.business_metric_correlation(
current_predictions[-30:], business_metrics[-30:]
)
if biz_result.get('warning'):
report['alerts'].append(
f"业务指标不联动: 相关性={biz_result['correlation']:.3f}"
)
report['alert_count'] = len(report['alerts'])
report['status'] = 'healthy' if not report['alerts'] else 'attention_needed'
for alert in report['alerts']:
print(f" {alert}")
return report
六、模型重训练策略
6.1 三种重训练触发方式
#mermaid-svg-QC9szlSZxKcSbzT5{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-QC9szlSZxKcSbzT5 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-QC9szlSZxKcSbzT5 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-QC9szlSZxKcSbzT5 .error-icon{fill:#552222;}#mermaid-svg-QC9szlSZxKcSbzT5 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-QC9szlSZxKcSbzT5 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-QC9szlSZxKcSbzT5 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-QC9szlSZxKcSbzT5 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-QC9szlSZxKcSbzT5 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-QC9szlSZxKcSbzT5 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-QC9szlSZxKcSbzT5 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-QC9szlSZxKcSbzT5 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-QC9szlSZxKcSbzT5 .marker.cross{stroke:#333333;}#mermaid-svg-QC9szlSZxKcSbzT5 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-QC9szlSZxKcSbzT5 p{margin:0;}#mermaid-svg-QC9szlSZxKcSbzT5 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-QC9szlSZxKcSbzT5 .cluster-label text{fill:#333;}#mermaid-svg-QC9szlSZxKcSbzT5 .cluster-label span{color:#333;}#mermaid-svg-QC9szlSZxKcSbzT5 .cluster-label span p{background-color:transparent;}#mermaid-svg-QC9szlSZxKcSbzT5 .label text,#mermaid-svg-QC9szlSZxKcSbzT5 span{fill:#333;color:#333;}#mermaid-svg-QC9szlSZxKcSbzT5 .node rect,#mermaid-svg-QC9szlSZxKcSbzT5 .node circle,#mermaid-svg-QC9szlSZxKcSbzT5 .node ellipse,#mermaid-svg-QC9szlSZxKcSbzT5 .node polygon,#mermaid-svg-QC9szlSZxKcSbzT5 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-QC9szlSZxKcSbzT5 .rough-node .label text,#mermaid-svg-QC9szlSZxKcSbzT5 .node .label text,#mermaid-svg-QC9szlSZxKcSbzT5 .image-shape .label,#mermaid-svg-QC9szlSZxKcSbzT5 .icon-shape .label{text-anchor:middle;}#mermaid-svg-QC9szlSZxKcSbzT5 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-QC9szlSZxKcSbzT5 .rough-node .label,#mermaid-svg-QC9szlSZxKcSbzT5 .node .label,#mermaid-svg-QC9szlSZxKcSbzT5 .image-shape .label,#mermaid-svg-QC9szlSZxKcSbzT5 .icon-shape .label{text-align:center;}#mermaid-svg-QC9szlSZxKcSbzT5 .node.clickable{cursor:pointer;}#mermaid-svg-QC9szlSZxKcSbzT5 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-QC9szlSZxKcSbzT5 .arrowheadPath{fill:#333333;}#mermaid-svg-QC9szlSZxKcSbzT5 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-QC9szlSZxKcSbzT5 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-QC9szlSZxKcSbzT5 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-QC9szlSZxKcSbzT5 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-QC9szlSZxKcSbzT5 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-QC9szlSZxKcSbzT5 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-QC9szlSZxKcSbzT5 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-QC9szlSZxKcSbzT5 .cluster text{fill:#333;}#mermaid-svg-QC9szlSZxKcSbzT5 .cluster span{color:#333;}#mermaid-svg-QC9szlSZxKcSbzT5 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-QC9szlSZxKcSbzT5 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-QC9szlSZxKcSbzT5 rect.text{fill:none;stroke-width:0;}#mermaid-svg-QC9szlSZxKcSbzT5 .icon-shape,#mermaid-svg-QC9szlSZxKcSbzT5 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-QC9szlSZxKcSbzT5 .icon-shape p,#mermaid-svg-QC9szlSZxKcSbzT5 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-QC9szlSZxKcSbzT5 .icon-shape .label rect,#mermaid-svg-QC9szlSZxKcSbzT5 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-QC9szlSZxKcSbzT5 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-QC9szlSZxKcSbzT5 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-QC9szlSZxKcSbzT5 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 重训练后处理
重训练触发策略
通过性能门控
未通过门控
指标达标
指标不达标
定时重训练
每周/每月
重训练
漂移触发重训练
PSI > 0.2 或 性能下降
Champion-Challenger
新模型持续挑战旧模型
验证集评估
灰度发布 10%
保持旧模型
全量切换
回滚旧模型
python
class RetrainingStrategy:
"""模型重训练策略"""
def __init__(self, model_trainer, version_manager, monitoring):
self.trainer = model_trainer
self.version_mgr = version_manager
self.monitoring = monitoring
def scheduled_retrain(self, new_data, schedule='weekly'):
"""
定时重训练------简单可靠
适合:数据分布缓慢变化的场景
"""
print(f"[定时重训练] 使用最新 {len(new_data)} 条数据")
# 1. 训练新模型
new_model, metrics = self.trainer.train(new_data)
# 2. 性能门控:新模型必须优于旧模型或持平
current_metrics = self.version_mgr.models[self.version_mgr.active_version]['metrics']
if metrics['auc'] >= current_metrics['auc'] * 0.98: # 允许 2% 下降
# 3. 注册新版本
new_version = f"v_{datetime.now().strftime('%Y%m%d')}"
self.version_mgr.register_model(new_version, new_model, metrics)
# 4. 灰度发布
self.version_mgr.canary_release(new_version, canary_ratio=0.1)
return {'action': 'canary_release', 'version': new_version}
else:
print(f"⚠ 新模型性能不达标: AUC {metrics['auc']:.3f} "
f"< {current_metrics['auc']*0.98:.3f}")
return {'action': 'skip', 'reason': '性能不达标'}
def drift_triggered_retrain(self, drift_report, new_data):
"""
漂移触发重训练------响应更快
适合:数据分布快速变化的场景
"""
if drift_report['severity'] != 'critical':
return {'action': 'monitor', 'reason': '漂移不严重'}
print(f"[漂移触发重训练] PSI={drift_report['psi']:.3f}")
# 漂移触发时用更激进的数据窗口:
# 只用最近 7 天的数据(而非全量),快速适应新分布
recent_data = new_data[-7*24*60:] # 最近 7 天(按分钟计)
new_model, metrics = self.trainer.train(recent_data)
# 漂移场景的性能门控更宽松:
# 只要新模型在新数据上优于旧模型即可
# (旧模型在旧数据上可能更好,但这不重要------当前数据才是真实分布)
new_version = f"v_drift_{datetime.now().strftime('%Y%m%d%H%M')}"
self.version_mgr.register_model(new_version, new_model, metrics)
self.version_mgr.canary_release(new_version, canary_ratio=0.2) # 更高灰度比例
return {'action': 'canary_release', 'version': new_version, 'drift_psi': drift_report['psi']}
def champion_challenger(self, challenger_data, champion_version=None):
"""
Champion-Challenger 架构
Challenger 模型持续在后台训练,与 Champion(当前生产模型)比较
优势:不中断生产服务,Challenger 失败也无影响
劣势:需要额外的计算资源
"""
champion = champion_version or self.version_mgr.active_version
champion_metrics = self.version_mgr.models[champion]['metrics']
# Challenger 训练
challenger_model, challenger_metrics = self.trainer.train(challenger_data)
# 比较指标
improvement = {
metric: challenger_metrics[metric] - champion_metrics[metric]
for metric in champion_metrics.keys()
if metric in challenger_metrics
}
# 判断 Challenger 是否值得上线
significant_improvement = any(
abs(delta) > 0.02 for delta in improvement.values()
)
if significant_improvement and all(
challenger_metrics[m] >= champion_metrics[m] * 0.98
for m in ['auc', 'precision']
):
print(f"✓ Challenger 优于 Champion: {improvement}")
new_version = f"v_challenger_{datetime.now().strftime('%Y%m%d')}"
self.version_mgr.register_model(new_version, challenger_model, challenger_metrics)
self.version_mgr.canary_release(new_version)
return {'action': 'promote_challenger', 'improvement': improvement}
else:
print(f"✗ Challenger 不优于 Champion: {improvement}")
return {'action': 'discard_challenger', 'improvement': improvement}
七、电商推荐系统 ML 架构实战
7.1 完整 ML 架构设计
python
class ECommerceMLArchitecture:
"""电商推荐系统 ML 架构------Feature Store + 推理 + A/B + 监控"""
def __init__(self):
# 核心组件
self.feature_store = SimpleFeatureStore()
self.inference_service = None
self.ab_test_framework = None
self.monitoring = ModelMonitoringSystem()
self.version_manager = ModelVersionManager()
self.retraining = None
# 注册核心特征
self._register_core_features()
def _register_core_features(self):
"""注册推荐系统的核心特征定义"""
features = [
('user_category_preference', '用户最近30天类别偏好分布', 'array',
'最近30天行为的类别分布', 'user_id'),
('user_avg_price', '用户平均客单价', 'float',
'最近30天购买的平均价格', 'user_id'),
('user_active_days', '用户活跃天数', 'int',
'最近30天有行为的天数', 'user_id'),
('item_quality_score', '商品质量评分', 'float',
'评分加权平均(近90天)', 'item_id'),
('item_popularity_7d', '商品7天热度', 'int',
'最近7天浏览+点击+加购次数', 'item_id'),
('item_conversion_rate', '商品转化率', 'float',
'最近30天(购买数/浏览数)', 'item_id'),
('cross_purchase_rate', '用户在该类别的购买率', 'float',
'用户最近30天在该类别下的购买占比', 'user_id,item_id'),
]
for name, desc, dtype, logic, entity in features:
self.feature_store.register_feature(name, desc, dtype, logic, entity)
def design_architecture(self):
"""输出完整 ML 架构设计"""
architecture = {
'components': {
'Feature Store': {
'offline': 'Parquet + Delta Lake(训练用,历史特征)',
'online': 'Redis(推理用,实时特征,< 10ms 读取)',
'registry': '特征注册表(One Source of Truth)',
},
'在线推理': {
'realtime': 'FastAPI REST API(推荐场景,< 100ms)',
'batch': '每日定时批量评分(画像场景)',
},
'A/B 测试': {
'分流': '用户级 MD5 哈希',
'统计检验': 't-test + Mann-Whitney U',
'Simpson 检测': '分人群分析',
},
'模型监控': {
'数据漂移': 'PSI + KS 检验(每日)',
'预测漂移': '均值/分位数偏移检测',
'业务联动': '推荐 GMV 占比 vs AUC 相关性',
},
'模型管理': {
'版本': '日期版本号 + 灰度发布',
'重训练': '定时 + 漂移触发 + Champion-Challenger',
'回滚': '自动回滚(指标不达标时)',
},
},
'latency_budget': {
'特征读取': '30ms',
'模型推理': '50ms',
'后处理': '20ms',
'总延迟': '100ms',
},
'sla': {
'p50_latency': '50ms',
'p99_latency': '100ms',
'availability': '99.9%',
'feature_freshness': '5分钟',
},
}
return architecture
def run_health_check(self):
"""系统健康检查"""
checks = {
'feature_store': {
'registered_features': len(self.feature_store.registry),
'offline_available': True,
'online_available': True,
},
'inference': {
'model_loaded': self.inference_service is not None,
'latency_p99': self.monitoring.latency_stats.get('p99', 0),
},
'monitoring': {
'last_report': datetime.now().strftime('%Y-%m-%d'),
'active_alerts': len(self.monitoring.alerts),
},
}
all_healthy = all(
checks['feature_store']['registered_features'] > 0,
# Python doesn't allow all() with multiple args like this
)
# Simplified check
healthy = checks['feature_store']['registered_features'] > 0 and \
checks['monitoring']['active_alerts'] == 0
print(f"系统健康检查: {'✓ 健康' if healthy else '⚠ 需关注'}")
for component, status in checks.items():
print(f" {component}: {status}")
return checks
八、训练-推理偏差根治清单
| 偏差来源 | 检测方法 | 根治方案 |
|---|---|---|
| 代码不同 | 离线/在线特征值逐字段比对 | Feature Store 统一计算逻辑,One Source of Truth |
| 时间窗口不同 | 统计口径对比(均值/分位数) | 特征注册表记录时间窗口定义,严格对齐 |
| 数据源不同 | KS 检验对比分布 | 同一数据源 + 实时流写回离线存储 |
| 精度不同 | 逐值比对相对偏差 | 统一精度标准,存储层强制舍入规则 |
| 特征缺失 | 推理时特征空值率监控 | 注册兜底策略(0.0 / 'unknown' / 最近值) |
| 新值出现 | 推理时出现训练未见值 | 分类特征加 'unknown' 桶,数值特征用训练均值 |
#mermaid-svg-kniwHfa9gz5UNFOx{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-kniwHfa9gz5UNFOx .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-kniwHfa9gz5UNFOx .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-kniwHfa9gz5UNFOx .error-icon{fill:#552222;}#mermaid-svg-kniwHfa9gz5UNFOx .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-kniwHfa9gz5UNFOx .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-kniwHfa9gz5UNFOx .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-kniwHfa9gz5UNFOx .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-kniwHfa9gz5UNFOx .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-kniwHfa9gz5UNFOx .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-kniwHfa9gz5UNFOx .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-kniwHfa9gz5UNFOx .marker{fill:#333333;stroke:#333333;}#mermaid-svg-kniwHfa9gz5UNFOx .marker.cross{stroke:#333333;}#mermaid-svg-kniwHfa9gz5UNFOx svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-kniwHfa9gz5UNFOx p{margin:0;}#mermaid-svg-kniwHfa9gz5UNFOx .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-kniwHfa9gz5UNFOx .cluster-label text{fill:#333;}#mermaid-svg-kniwHfa9gz5UNFOx .cluster-label span{color:#333;}#mermaid-svg-kniwHfa9gz5UNFOx .cluster-label span p{background-color:transparent;}#mermaid-svg-kniwHfa9gz5UNFOx .label text,#mermaid-svg-kniwHfa9gz5UNFOx span{fill:#333;color:#333;}#mermaid-svg-kniwHfa9gz5UNFOx .node rect,#mermaid-svg-kniwHfa9gz5UNFOx .node circle,#mermaid-svg-kniwHfa9gz5UNFOx .node ellipse,#mermaid-svg-kniwHfa9gz5UNFOx .node polygon,#mermaid-svg-kniwHfa9gz5UNFOx .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-kniwHfa9gz5UNFOx .rough-node .label text,#mermaid-svg-kniwHfa9gz5UNFOx .node .label text,#mermaid-svg-kniwHfa9gz5UNFOx .image-shape .label,#mermaid-svg-kniwHfa9gz5UNFOx .icon-shape .label{text-anchor:middle;}#mermaid-svg-kniwHfa9gz5UNFOx .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-kniwHfa9gz5UNFOx .rough-node .label,#mermaid-svg-kniwHfa9gz5UNFOx .node .label,#mermaid-svg-kniwHfa9gz5UNFOx .image-shape .label,#mermaid-svg-kniwHfa9gz5UNFOx .icon-shape .label{text-align:center;}#mermaid-svg-kniwHfa9gz5UNFOx .node.clickable{cursor:pointer;}#mermaid-svg-kniwHfa9gz5UNFOx .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-kniwHfa9gz5UNFOx .arrowheadPath{fill:#333333;}#mermaid-svg-kniwHfa9gz5UNFOx .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-kniwHfa9gz5UNFOx .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-kniwHfa9gz5UNFOx .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-kniwHfa9gz5UNFOx .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-kniwHfa9gz5UNFOx .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-kniwHfa9gz5UNFOx .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-kniwHfa9gz5UNFOx .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-kniwHfa9gz5UNFOx .cluster text{fill:#333;}#mermaid-svg-kniwHfa9gz5UNFOx .cluster span{color:#333;}#mermaid-svg-kniwHfa9gz5UNFOx div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-kniwHfa9gz5UNFOx .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-kniwHfa9gz5UNFOx rect.text{fill:none;stroke-width:0;}#mermaid-svg-kniwHfa9gz5UNFOx .icon-shape,#mermaid-svg-kniwHfa9gz5UNFOx .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-kniwHfa9gz5UNFOx .icon-shape p,#mermaid-svg-kniwHfa9gz5UNFOx .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-kniwHfa9gz5UNFOx .icon-shape .label rect,#mermaid-svg-kniwHfa9gz5UNFOx .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-kniwHfa9gz5UNFOx .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-kniwHfa9gz5UNFOx .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-kniwHfa9gz5UNFOx :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 检测流程
偏差根治方案
< 5%
5%~20%
> 20%
Feature Store
One Source of Truth
离线存储
训练用
在线存储
推理用
特征注册表
计算逻辑 + 版本 + 时间窗口
每日偏差检测
PSI + KS + 逐值比对
偏差率
✓ 通过
⚠ 轻微偏差
检查特征定义
⚠ 严重偏差
触发重训练 + 修复特征
总结
ML 系统设计不是"部署一个模型"------而是"运维一个 ML 系统"。训练-推理偏差是生产环境最高频的 bug 来源,Feature Store 的离线/在线双存储架构是根治偏差的核心方案。A/B 测试的科学设计(样本量计算、分流策略、Simpson 悖论检测)是验证模型效果的标准方式。模型监控四层架构(数据漂移 → 预测漂移 → 标签延迟 → 业务联动)是上线后的持续保障
前文端到端 ML 项目实战一建立了"从需求到监控的全链路思维"------本篇把监控从"事后检测"升级为"系统设计"。前文端到端 ML 项目实战二的金融风控场景展示了合规约束下的特征管理------本篇的 Feature Store 正是特征管理的工程化升级。前文推荐系统基础和电商推荐系统实战讲了算法和系统------本篇的 A/B 测试和监控是推荐系统持续运营的保障
如果觉得这篇 ML 系统设计模式对理解生产级 ML 系统有帮助,欢迎点赞收藏,关注专栏获取后续更新