推荐系统基础：协同过滤/矩阵分解/内容推荐的工程实践

文章目录

- 一、推荐系统的业务框架：漏斗而非算法
- 二、协同过滤：「相似性」的两种视角
- - [2.1 User-CF vs Item-CF](#2.1 User-CF vs Item-CF)
  - [2.2 显反馈 vs 隐反馈](#2.2 显反馈 vs 隐反馈)
- 三、矩阵分解：从「相似」到「隐向量」
- - [3.1 SVD 分解的直觉](#3.1 SVD 分解的直觉)
  - [3.2 ALS vs SGD：两种优化路线](#3.2 ALS vs SGD：两种优化路线)
  - [3.3 BPR：隐反馈场景的正确打开方式](#3.3 BPR：隐反馈场景的正确打开方式)
- 四、内容推荐：解决新物品冷启动
- 五、冷启动：每个推荐系统必须回答的问题
- 六、离线评估：指标的选择比指标的数值更重要
- - [6.1 常用离线指标](#6.1 常用离线指标)
- 七、多路召回：工业推荐系统的标准架构
- [八、MovieLens 实战：三种算法对比](#八、MovieLens 实战：三种算法对比)
- 九、工程落地的三个关键决策
- - [决策一：相似度计算的离线 vs 在线](#决策一：相似度计算的离线 vs 在线)
  - 决策二：隐向量维度选择
  - 决策三：负采样策略
- 小结

Netflix 80% 的观看来自推荐，Amazon 35% 的营收来自推荐引擎。但「推荐」这个词掩盖了背后完整的算法链路------从原始行为数据到最终排序结果，中间经历了召回、排序、重排三个阶段，每个阶段都有截然不同的目标和约束。

多数教程从「User-CF 相似用户推荐」讲起，把推荐系统等同于「找相似」。这只是召回环节的一种策略，而且不是工业上用得最广泛的那种。

一、推荐系统的业务框架：漏斗而非算法

推荐系统的本质是一个信息过滤漏斗。以电商场景为例：

复制代码

用户库：1000万用户
商品库：500万商品

可能的组合：5 × 10^13（50 万亿对）

↓ 召回（Retrieval）：多路候选生成
候选集：500 条（千分之一）

↓ 粗排（Pre-ranking）：快速过滤
候选集：200 条

↓ 精排（Ranking）：精细打分
候选集：50 条

↓ 重排（Re-ranking）：多样性/新鲜度/商业干预
最终展示：10-20 条

这个漏斗架构决定了算法选型的逻辑：召回追求覆盖率和速度，精排追求精度，重排追求用户体验。三者的优化目标不同，算法也完全不同。
#mermaid-svg-4enAJSPnLTICXNba{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-4enAJSPnLTICXNba .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-4enAJSPnLTICXNba .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-4enAJSPnLTICXNba .error-icon{fill:#552222;}#mermaid-svg-4enAJSPnLTICXNba .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-4enAJSPnLTICXNba .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-4enAJSPnLTICXNba .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-4enAJSPnLTICXNba .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-4enAJSPnLTICXNba .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-4enAJSPnLTICXNba .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-4enAJSPnLTICXNba .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-4enAJSPnLTICXNba .marker{fill:#333333;stroke:#333333;}#mermaid-svg-4enAJSPnLTICXNba .marker.cross{stroke:#333333;}#mermaid-svg-4enAJSPnLTICXNba svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-4enAJSPnLTICXNba p{margin:0;}#mermaid-svg-4enAJSPnLTICXNba .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-4enAJSPnLTICXNba .cluster-label text{fill:#333;}#mermaid-svg-4enAJSPnLTICXNba .cluster-label span{color:#333;}#mermaid-svg-4enAJSPnLTICXNba .cluster-label span p{background-color:transparent;}#mermaid-svg-4enAJSPnLTICXNba .label text,#mermaid-svg-4enAJSPnLTICXNba span{fill:#333;color:#333;}#mermaid-svg-4enAJSPnLTICXNba .node rect,#mermaid-svg-4enAJSPnLTICXNba .node circle,#mermaid-svg-4enAJSPnLTICXNba .node ellipse,#mermaid-svg-4enAJSPnLTICXNba .node polygon,#mermaid-svg-4enAJSPnLTICXNba .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-4enAJSPnLTICXNba .rough-node .label text,#mermaid-svg-4enAJSPnLTICXNba .node .label text,#mermaid-svg-4enAJSPnLTICXNba .image-shape .label,#mermaid-svg-4enAJSPnLTICXNba .icon-shape .label{text-anchor:middle;}#mermaid-svg-4enAJSPnLTICXNba .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-4enAJSPnLTICXNba .rough-node .label,#mermaid-svg-4enAJSPnLTICXNba .node .label,#mermaid-svg-4enAJSPnLTICXNba .image-shape .label,#mermaid-svg-4enAJSPnLTICXNba .icon-shape .label{text-align:center;}#mermaid-svg-4enAJSPnLTICXNba .node.clickable{cursor:pointer;}#mermaid-svg-4enAJSPnLTICXNba .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-4enAJSPnLTICXNba .arrowheadPath{fill:#333333;}#mermaid-svg-4enAJSPnLTICXNba .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-4enAJSPnLTICXNba .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-4enAJSPnLTICXNba .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-4enAJSPnLTICXNba .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-4enAJSPnLTICXNba .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-4enAJSPnLTICXNba .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-4enAJSPnLTICXNba .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-4enAJSPnLTICXNba .cluster text{fill:#333;}#mermaid-svg-4enAJSPnLTICXNba .cluster span{color:#333;}#mermaid-svg-4enAJSPnLTICXNba div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-4enAJSPnLTICXNba .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-4enAJSPnLTICXNba rect.text{fill:none;stroke-width:0;}#mermaid-svg-4enAJSPnLTICXNba .icon-shape,#mermaid-svg-4enAJSPnLTICXNba .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-4enAJSPnLTICXNba .icon-shape p,#mermaid-svg-4enAJSPnLTICXNba .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-4enAJSPnLTICXNba .icon-shape .label rect,#mermaid-svg-4enAJSPnLTICXNba .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-4enAJSPnLTICXNba .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-4enAJSPnLTICXNba .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-4enAJSPnLTICXNba :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 用户请求

(1个用户)
召回层

多路候选生成

500万→500条

延迟要求：<50ms
粗排层

轻量级模型快速过滤

500→200条

延迟要求：<30ms
精排层

LTR精细打分

200→50条

延迟要求：<100ms
重排层

多样性/新鲜度/商业策略

50→20条

延迟要求：<20ms
展示结果

20条推荐
协同过滤召回
内容相似召回
热门商品召回
实时行为召回

理解这个框架，「协同过滤」就不再是推荐系统本身，而是召回层的一种策略。

二、协同过滤：「相似性」的两种视角

协同过滤（Collaborative Filtering，CF）的核心假设：行为相似的用户，对未见物品的偏好也相似。

2.1 User-CF vs Item-CF

User-CF（基于用户的协同过滤）：

复制代码

核心逻辑：找到与目标用户行为相似的"邻居用户"，把邻居喜欢但目标用户没见过的物品推给他

相似度计算：余弦相似度、Pearson 相关系数

问题：
- 用户量远大于物品量时，用户-用户相似矩阵极其稀疏
- 用户行为不稳定，相似关系频繁变化
- 1000万用户 × 1000万用户的相似矩阵，存储和计算都不现实

Item-CF（基于物品的协同过滤）：

复制代码

核心逻辑：物品 A 和物品 B 被相似的用户群体购买/点击，则 A、B 相似
         当用户购买了 A，推荐与 A 相似的 B

优势：
- 物品数量通常少于用户数量（电商：百万商品 vs 千万用户）
- 物品相似关系相对稳定，可离线预计算
- 可解释性好："因为你购买了 X，所以推荐 Y"

实际上 Amazon 早在 2003 年就从 User-CF 切换到了 Item-CF，原因正是上述的稳定性和可扩展性。

2.2 显反馈 vs 隐反馈

这是多数入门教材略过的关键区别：

类型	含义	示例	问题
显反馈	用户主动给出评分	1-5星评分、点赞/差评	稀疏，用户懒得评分
隐反馈	从行为推断偏好	点击、购买、停留时长、收藏	隐式，正样本确定但负样本不确定

隐反馈的核心难点：用户没点击 ≠ 不喜欢，可能是没看到，可能是算法没推，可能是价格不合适。这使得传统的「评分预测」框架失效。

python 复制代码

import numpy as np
from sklearn.preprocessing import normalize

def item_cf_recommend(user_item_matrix, user_id, top_k=10, n_similar=20):
    """
    基于物品的协同过滤推荐
    
    user_item_matrix: shape (n_users, n_items), 隐反馈矩阵（0/1或置信度）
    """
    # 计算物品-物品相似度（余弦相似度）
    item_matrix = user_item_matrix.T  # (n_items, n_users)
    # L2 归一化，避免高频物品主导相似度
    item_matrix_norm = normalize(item_matrix, norm='l2')
    item_similarity = item_matrix_norm @ item_matrix_norm.T  # (n_items, n_items)
    
    # 获取用户交互过的物品
    user_items = np.where(user_item_matrix[user_id] > 0)[0]
    
    # 累积相似分数
    scores = np.zeros(user_item_matrix.shape[1])
    for item_id in user_items:
        # 取 top n_similar 个相似物品
        similar_items = np.argsort(item_similarity[item_id])[::-1][1:n_similar+1]
        for similar_item in similar_items:
            if user_item_matrix[user_id, similar_item] == 0:  # 未交互过
                scores[similar_item] += item_similarity[item_id, similar_item]
    
    # 返回 top_k 推荐
    top_items = np.argsort(scores)[::-1][:top_k]
    return [(item_id, scores[item_id]) for item_id in top_items if scores[item_id] > 0]

三、矩阵分解：从「相似」到「隐向量」

协同过滤的本质局限在于：它只能利用「直接相似」的物品，无法发现潜在的语义关联。矩阵分解（Matrix Factorization）引入了隐向量的概念。

3.1 SVD 分解的直觉

用户-物品评分矩阵通常非常稀疏（99% 以上的位置为空）。矩阵分解的思路：

复制代码

评分矩阵 R (m×n) ≈ 用户矩阵 U (m×k) × 物品矩阵 V^T (k×n)

k：隐向量维度（通常 32~256），远小于 m 和 n

物理含义：
- 每个用户被表示为 k 维向量：其在 k 个「品味维度」上的偏好强度
- 每个物品被表示为 k 维向量：其在 k 个「特征维度」上的属性强度
- 预测评分 = 用户向量 · 物品向量（内积）

#mermaid-svg-o5cRWGyfP3rNCaJl{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-o5cRWGyfP3rNCaJl .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-o5cRWGyfP3rNCaJl .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-o5cRWGyfP3rNCaJl .error-icon{fill:#552222;}#mermaid-svg-o5cRWGyfP3rNCaJl .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-o5cRWGyfP3rNCaJl .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-o5cRWGyfP3rNCaJl .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-o5cRWGyfP3rNCaJl .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-o5cRWGyfP3rNCaJl .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-o5cRWGyfP3rNCaJl .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-o5cRWGyfP3rNCaJl .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-o5cRWGyfP3rNCaJl .marker{fill:#333333;stroke:#333333;}#mermaid-svg-o5cRWGyfP3rNCaJl .marker.cross{stroke:#333333;}#mermaid-svg-o5cRWGyfP3rNCaJl svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-o5cRWGyfP3rNCaJl p{margin:0;}#mermaid-svg-o5cRWGyfP3rNCaJl .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-o5cRWGyfP3rNCaJl .cluster-label text{fill:#333;}#mermaid-svg-o5cRWGyfP3rNCaJl .cluster-label span{color:#333;}#mermaid-svg-o5cRWGyfP3rNCaJl .cluster-label span p{background-color:transparent;}#mermaid-svg-o5cRWGyfP3rNCaJl .label text,#mermaid-svg-o5cRWGyfP3rNCaJl span{fill:#333;color:#333;}#mermaid-svg-o5cRWGyfP3rNCaJl .node rect,#mermaid-svg-o5cRWGyfP3rNCaJl .node circle,#mermaid-svg-o5cRWGyfP3rNCaJl .node ellipse,#mermaid-svg-o5cRWGyfP3rNCaJl .node polygon,#mermaid-svg-o5cRWGyfP3rNCaJl .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-o5cRWGyfP3rNCaJl .rough-node .label text,#mermaid-svg-o5cRWGyfP3rNCaJl .node .label text,#mermaid-svg-o5cRWGyfP3rNCaJl .image-shape .label,#mermaid-svg-o5cRWGyfP3rNCaJl .icon-shape .label{text-anchor:middle;}#mermaid-svg-o5cRWGyfP3rNCaJl .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-o5cRWGyfP3rNCaJl .rough-node .label,#mermaid-svg-o5cRWGyfP3rNCaJl .node .label,#mermaid-svg-o5cRWGyfP3rNCaJl .image-shape .label,#mermaid-svg-o5cRWGyfP3rNCaJl .icon-shape .label{text-align:center;}#mermaid-svg-o5cRWGyfP3rNCaJl .node.clickable{cursor:pointer;}#mermaid-svg-o5cRWGyfP3rNCaJl .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-o5cRWGyfP3rNCaJl .arrowheadPath{fill:#333333;}#mermaid-svg-o5cRWGyfP3rNCaJl .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-o5cRWGyfP3rNCaJl .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-o5cRWGyfP3rNCaJl .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-o5cRWGyfP3rNCaJl .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-o5cRWGyfP3rNCaJl .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-o5cRWGyfP3rNCaJl .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-o5cRWGyfP3rNCaJl .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-o5cRWGyfP3rNCaJl .cluster text{fill:#333;}#mermaid-svg-o5cRWGyfP3rNCaJl .cluster span{color:#333;}#mermaid-svg-o5cRWGyfP3rNCaJl div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-o5cRWGyfP3rNCaJl .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-o5cRWGyfP3rNCaJl rect.text{fill:none;stroke-width:0;}#mermaid-svg-o5cRWGyfP3rNCaJl .icon-shape,#mermaid-svg-o5cRWGyfP3rNCaJl .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-o5cRWGyfP3rNCaJl .icon-shape p,#mermaid-svg-o5cRWGyfP3rNCaJl .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-o5cRWGyfP3rNCaJl .icon-shape .label rect,#mermaid-svg-o5cRWGyfP3rNCaJl .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-o5cRWGyfP3rNCaJl .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-o5cRWGyfP3rNCaJl .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-o5cRWGyfP3rNCaJl :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 预测缺失值
低秩分解
评分矩阵 R (用户×物品)
分解
分解
内积
内积
? 5 ? 3 ?

4 ? 2 ? ?

? ? 5 ? 1

? 3 ? ? 4
用户矩阵 U

m×k

每行=用户隐向量
物品矩阵 V

n×k

每行=物品隐向量
R̂ ≈ U × V^T

填充所有空位

3.2 ALS vs SGD：两种优化路线

SGD（随机梯度下降）：

每次随机取一个已知评分，计算误差，更新 U 和 V 的对应行
适合显反馈（有明确的目标评分）
不好并行化，但收敛速度快

ALS（交替最小二乘）：

固定 V，对 U 求解最优解（闭合形式）；然后固定 U，对 V 求解
交替进行直到收敛
天然适合隐反馈和并行化，Spark MLlib 中推荐系统的默认算法

python 复制代码

import numpy as np
from scipy.sparse import csr_matrix
from sklearn.utils.extmath import randomized_svd

class ALSRecommender:
    """
    交替最小二乘矩阵分解（隐反馈版本）
    
    隐反馈处理：将 r_ui > 0 视为正反馈，置信度 c_ui = 1 + alpha * r_ui
    """
    def __init__(self, n_factors=50, n_iterations=20, regularization=0.01, alpha=40):
        self.n_factors = n_factors
        self.n_iterations = n_iterations
        self.reg = regularization
        self.alpha = alpha  # 隐反馈置信度缩放因子
    
    def fit(self, R):
        """
        R: 隐反馈矩阵 (n_users, n_items)，值为交互次数或置信度
        """
        n_users, n_items = R.shape
        
        # 置信矩阵：c_ui = 1 + alpha * r_ui
        # 有交互的位置置信度高，未交互位置置信度为1（非0！）
        C = 1 + self.alpha * R
        
        # 偏好矩阵：p_ui = 1 if r_ui > 0 else 0
        P = (R > 0).astype(float)
        
        # 随机初始化
        self.user_factors = np.random.normal(0, 0.01, (n_users, self.n_factors))
        self.item_factors = np.random.normal(0, 0.01, (n_items, self.n_factors))
        
        I = np.eye(self.n_factors)
        
        for iteration in range(self.n_iterations):
            # 更新用户因子（固定物品因子）
            VtV = self.item_factors.T @ self.item_factors
            for u in range(n_users):
                c_u = C[u]  # 用户 u 的置信度向量
                # 每个物品的贡献被各自的置信度加权
                VtCuV = self.item_factors.T @ (np.diag(c_u) @ self.item_factors)
                VtCuPu = self.item_factors.T @ (c_u * P[u])
                self.user_factors[u] = np.linalg.solve(
                    VtCuV + self.reg * I, VtCuPu
                )
            
            # 更新物品因子（固定用户因子）
            UtU = self.user_factors.T @ self.user_factors
            for i in range(n_items):
                c_i = C[:, i]  # 物品 i 的置信度向量
                UtCiU = self.user_factors.T @ (np.diag(c_i) @ self.user_factors)
                UtCiPi = self.user_factors.T @ (c_i * P[:, i])
                self.item_factors[i] = np.linalg.solve(
                    UtCiU + self.reg * I, UtCiPi
                )
    
    def recommend(self, user_id, n_items=10, filter_already_interacted=True):
        """推荐 top-n 物品"""
        user_vec = self.user_factors[user_id]
        scores = self.item_factors @ user_vec
        
        if filter_already_interacted:
            # 过滤已经交互过的物品（通常不再推荐）
            # 实际工程中根据业务决定是否过滤
            pass
        
        top_items = np.argsort(scores)[::-1][:n_items]
        return list(zip(top_items, scores[top_items]))

3.3 BPR：隐反馈场景的正确打开方式

当数据是隐反馈时，直接用 ALS 仍然有问题：「未交互」被当作负样本（即使只是因为没曝光）。

BPR（Bayesian Personalized Ranking，贝叶斯个性化排序） 换了一个思路：

复制代码

不预测「评分」，而是预测「排序」

对于用户 u：
  已交互的物品 i（正样本）应该排在未交互物品 j（采样的负样本）前面
  
优化目标：最大化 P(u 更喜欢 i 而不是 j)
         = sigmoid(ŷ_ui - ŷ_uj)

损失函数：BPR-OPT = Σ -ln σ(ŷ_ui - ŷ_uj) + 正则项

BPR 的核心洞察：排序问题天然比评分预测更符合推荐的业务目标------用户关心的是「哪个排第一」，而不是「评分是 4.2 还是 4.5」。

python 复制代码

import torch
import torch.nn as nn

class BPRModel(nn.Module):
    """BPR 矩阵分解实现"""
    def __init__(self, n_users, n_items, n_factors=50):
        super().__init__()
        self.user_embedding = nn.Embedding(n_users, n_factors)
        self.item_embedding = nn.Embedding(n_items, n_factors)
        
        # 初始化
        nn.init.normal_(self.user_embedding.weight, std=0.01)
        nn.init.normal_(self.item_embedding.weight, std=0.01)
    
    def forward(self, user_ids, pos_item_ids, neg_item_ids):
        user_vecs = self.user_embedding(user_ids)
        pos_item_vecs = self.item_embedding(pos_item_ids)
        neg_item_vecs = self.item_embedding(neg_item_ids)
        
        # 正样本得分
        pos_scores = (user_vecs * pos_item_vecs).sum(dim=1)
        # 负样本得分
        neg_scores = (user_vecs * neg_item_vecs).sum(dim=1)
        
        # BPR 损失
        loss = -torch.log(torch.sigmoid(pos_scores - neg_scores)).mean()
        return loss

def sample_triplets(user_item_matrix, n_samples):
    """
    负采样：对每个正样本 (u, i+)，随机采样一个未交互物品 j
    """
    users, pos_items, neg_items = [], [], []
    n_items = user_item_matrix.shape[1]
    
    interacted_users, interacted_items = user_item_matrix.nonzero()
    
    for idx in np.random.choice(len(interacted_users), n_samples):
        u = interacted_users[idx]
        i = interacted_items[idx]
        
        # 负采样：随机采一个未交互物品
        j = np.random.randint(n_items)
        while user_item_matrix[u, j] > 0:
            j = np.random.randint(n_items)
        
        users.append(u)
        pos_items.append(i)
        neg_items.append(j)
    
    return (torch.LongTensor(users), 
            torch.LongTensor(pos_items), 
            torch.LongTensor(neg_items))

四、内容推荐：解决新物品冷启动

协同过滤和矩阵分解共同的致命弱点：新物品没有交互数据，无法建立隐向量。

内容推荐（Content-Based Filtering）从物品自身属性出发：

复制代码

物品特征：类别标签、文字描述、价格区间、品牌、发布时间......
用户画像：从历史交互物品的特征聚合而来

推荐逻辑：计算候选物品与用户画像的特征相似度

python 复制代码

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np

class ContentBasedRecommender:
    """
    基于内容的推荐（以商品描述文本为特征）
    
    适用场景：
    1. 新物品冷启动（无交互数据时的兜底策略）
    2. 高度个性化场景（用户有明确的类别偏好）
    3. 物品特征丰富且描述性强的场景（电商、内容平台）
    """
    def __init__(self, max_features=5000):
        self.vectorizer = TfidfVectorizer(max_features=max_features, 
                                          analyzer='char_wb',  # 字符级 n-gram，适合中文
                                          ngram_range=(2, 4))
    
    def fit(self, items_df, text_col='description', id_col='item_id'):
        """
        items_df: 包含物品描述文本的 DataFrame
        """
        self.items_df = items_df.reset_index(drop=True)
        self.item_ids = items_df[id_col].values
        
        # TF-IDF 向量化
        self.item_vectors = self.vectorizer.fit_transform(items_df[text_col])
        return self
    
    def build_user_profile(self, user_history_items, weights=None):
        """
        基于用户历史交互物品，构建用户画像向量
        
        weights: 不同物品的重要性权重（如购买权重高于浏览）
        """
        # 找到历史物品的索引
        history_indices = [
            np.where(self.item_ids == item_id)[0][0] 
            for item_id in user_history_items
            if item_id in self.item_ids
        ]
        
        if not history_indices:
            return None
        
        history_vectors = self.item_vectors[history_indices]
        
        if weights is not None:
            w = np.array(weights[:len(history_indices)]).reshape(-1, 1)
            user_profile = np.asarray(history_vectors.multiply(w).mean(axis=0))
        else:
            user_profile = np.asarray(history_vectors.mean(axis=0))
        
        return user_profile
    
    def recommend(self, user_history_items, weights=None, top_k=10, 
                  exclude_history=True):
        """
        给定用户历史，推荐相似物品
        """
        user_profile = self.build_user_profile(user_history_items, weights)
        if user_profile is None:
            # 无历史数据，返回热门物品（冷启动兜底）
            return []
        
        # 计算用户画像与所有物品的余弦相似度
        similarities = cosine_similarity(user_profile, self.item_vectors)[0]
        
        # 排除已交互物品
        if exclude_history:
            history_indices = [
                np.where(self.item_ids == item_id)[0][0]
                for item_id in user_history_items
                if item_id in self.item_ids
            ]
            similarities[history_indices] = -1
        
        top_indices = np.argsort(similarities)[::-1][:top_k]
        return [
            (self.item_ids[idx], similarities[idx]) 
            for idx in top_indices
        ]

五、冷启动：每个推荐系统必须回答的问题

冷启动是推荐系统工程中最难处理的问题之一，分三种场景：
#mermaid-svg-TgOw7IQst3t1t7Pk{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-TgOw7IQst3t1t7Pk .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-TgOw7IQst3t1t7Pk .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-TgOw7IQst3t1t7Pk .error-icon{fill:#552222;}#mermaid-svg-TgOw7IQst3t1t7Pk .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-TgOw7IQst3t1t7Pk .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-TgOw7IQst3t1t7Pk .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-TgOw7IQst3t1t7Pk .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-TgOw7IQst3t1t7Pk .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-TgOw7IQst3t1t7Pk .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-TgOw7IQst3t1t7Pk .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-TgOw7IQst3t1t7Pk .marker{fill:#333333;stroke:#333333;}#mermaid-svg-TgOw7IQst3t1t7Pk .marker.cross{stroke:#333333;}#mermaid-svg-TgOw7IQst3t1t7Pk svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-TgOw7IQst3t1t7Pk p{margin:0;}#mermaid-svg-TgOw7IQst3t1t7Pk .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-TgOw7IQst3t1t7Pk .cluster-label text{fill:#333;}#mermaid-svg-TgOw7IQst3t1t7Pk .cluster-label span{color:#333;}#mermaid-svg-TgOw7IQst3t1t7Pk .cluster-label span p{background-color:transparent;}#mermaid-svg-TgOw7IQst3t1t7Pk .label text,#mermaid-svg-TgOw7IQst3t1t7Pk span{fill:#333;color:#333;}#mermaid-svg-TgOw7IQst3t1t7Pk .node rect,#mermaid-svg-TgOw7IQst3t1t7Pk .node circle,#mermaid-svg-TgOw7IQst3t1t7Pk .node ellipse,#mermaid-svg-TgOw7IQst3t1t7Pk .node polygon,#mermaid-svg-TgOw7IQst3t1t7Pk .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-TgOw7IQst3t1t7Pk .rough-node .label text,#mermaid-svg-TgOw7IQst3t1t7Pk .node .label text,#mermaid-svg-TgOw7IQst3t1t7Pk .image-shape .label,#mermaid-svg-TgOw7IQst3t1t7Pk .icon-shape .label{text-anchor:middle;}#mermaid-svg-TgOw7IQst3t1t7Pk .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-TgOw7IQst3t1t7Pk .rough-node .label,#mermaid-svg-TgOw7IQst3t1t7Pk .node .label,#mermaid-svg-TgOw7IQst3t1t7Pk .image-shape .label,#mermaid-svg-TgOw7IQst3t1t7Pk .icon-shape .label{text-align:center;}#mermaid-svg-TgOw7IQst3t1t7Pk .node.clickable{cursor:pointer;}#mermaid-svg-TgOw7IQst3t1t7Pk .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-TgOw7IQst3t1t7Pk .arrowheadPath{fill:#333333;}#mermaid-svg-TgOw7IQst3t1t7Pk .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-TgOw7IQst3t1t7Pk .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-TgOw7IQst3t1t7Pk .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-TgOw7IQst3t1t7Pk .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-TgOw7IQst3t1t7Pk .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-TgOw7IQst3t1t7Pk .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-TgOw7IQst3t1t7Pk .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-TgOw7IQst3t1t7Pk .cluster text{fill:#333;}#mermaid-svg-TgOw7IQst3t1t7Pk .cluster span{color:#333;}#mermaid-svg-TgOw7IQst3t1t7Pk div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-TgOw7IQst3t1t7Pk .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-TgOw7IQst3t1t7Pk rect.text{fill:none;stroke-width:0;}#mermaid-svg-TgOw7IQst3t1t7Pk .icon-shape,#mermaid-svg-TgOw7IQst3t1t7Pk .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-TgOw7IQst3t1t7Pk .icon-shape p,#mermaid-svg-TgOw7IQst3t1t7Pk .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-TgOw7IQst3t1t7Pk .icon-shape .label rect,#mermaid-svg-TgOw7IQst3t1t7Pk .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-TgOw7IQst3t1t7Pk .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-TgOw7IQst3t1t7Pk .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-TgOw7IQst3t1t7Pk :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 是
否
有（性别/年龄/城市）
无
引导选择
是
否（双冷）
有
无
冷启动场景判断
新用户？
有注册信息？
新物品？
策略1：人口统计分组

按相似人群偏好推荐
策略2：热门推荐

全局热榜 + 多样性采样
策略3：引导选择

展示品类/风格卡片

收集显式偏好
有内容特征？
策略4：AB两者结合

先热门，后逐步个性化
策略5：内容推荐

基于物品描述相似度

直到积累足够交互
策略6：业务规则兜底

类目热门/编辑推荐
交互数据积累后

切换至协同过滤/矩阵分解

新用户冷启动的实用工程方案：

python 复制代码

class HybridColdStartHandler:
    """
    分级冷启动处理器
    
    Level 0（完全新用户）：热门推荐 + 多样性采样
    Level 1（有基础信息）：人口统计分组推荐
    Level 2（引导后）：基于显式选择的内容推荐
    Level 3（积累行为后）：切换至协同过滤/矩阵分解
    """
    def __init__(self, warm_threshold=10):
        """warm_threshold: 交互次数超过此值则视为非冷启动用户"""
        self.warm_threshold = warm_threshold
    
    def get_recommendation_strategy(self, user_profile):
        """
        根据用户数据丰富程度，选择推荐策略
        """
        interaction_count = user_profile.get('interaction_count', 0)
        has_demographics = bool(user_profile.get('age') or user_profile.get('gender'))
        has_explicit_preferences = bool(user_profile.get('explicit_categories'))
        
        if interaction_count >= self.warm_threshold:
            return 'collaborative_filtering'  # 协同过滤/矩阵分解
        elif interaction_count >= 3:
            return 'content_based'  # 基于已有交互的内容推荐
        elif has_explicit_preferences:
            return 'explicit_preference'  # 基于显式选择
        elif has_demographics:
            return 'demographic_based'  # 人口统计分组
        else:
            return 'popularity_based'  # 全局热门 + 多样性采样
    
    def diversified_popular_items(self, item_pool, top_k=20, n_categories=5):
        """
        多样性热门推荐：从各类别取热门，避免全是同一类
        
        实际效果远好于简单热榜：
        - 纯热榜：全是热门剧，缺乏多样性，用户很快失去兴趣
        - 多样性热榜：各类别最热，给用户探索不同品味的机会
        """
        category_items = {}
        for item in item_pool:
            cat = item['category']
            if cat not in category_items:
                category_items[cat] = []
            category_items[cat].append(item)
        
        result = []
        # 循环轮流取各类别的热门物品
        per_category = max(1, top_k // n_categories)
        for cat, items in sorted(category_items.items(), 
                                  key=lambda x: len(x[1]), reverse=True)[:n_categories]:
            top_items = sorted(items, key=lambda x: x['popularity'], reverse=True)
            result.extend(top_items[:per_category])
        
        return result[:top_k]

六、离线评估：指标的选择比指标的数值更重要

推荐系统的评估有一个反直觉的结论：离线指标好不等于线上效果好。

6.1 常用离线指标

精度指标：

python 复制代码

import numpy as np

def hit_rate_at_k(recommended_items, relevant_items, k):
    """
    HR@K：top-K 推荐中是否命中至少一个相关物品
    
    理解：HR@10=0.5 表示 50% 的用户在推荐的前10条中找到了想要的东西
    """
    hits = len(set(recommended_items[:k]) & set(relevant_items))
    return 1.0 if hits > 0 else 0.0

def ndcg_at_k(recommended_items, relevant_items, k):
    """
    NDCG@K：Normalized Discounted Cumulative Gain
    
    排名越靠前的命中，贡献越大（对数衰减）
    归一化：除以理想情况下的最大 DCG
    """
    # 计算 DCG
    dcg = 0.0
    for rank, item in enumerate(recommended_items[:k], 1):
        if item in relevant_items:
            dcg += 1.0 / np.log2(rank + 1)
    
    # 计算理想 DCG（IDCG）
    n_relevant_in_k = min(len(relevant_items), k)
    idcg = sum(1.0 / np.log2(rank + 1) for rank in range(1, n_relevant_in_k + 1))
    
    return dcg / idcg if idcg > 0 else 0.0

def coverage_and_diversity(all_recommendations, item_catalog_size):
    """
    覆盖率：被推荐到的物品数量 / 物品库总量
    多样性：推荐列表中物品类别的丰富程度
    
    工业实践中，覆盖率和多样性往往比精度指标更重要：
    - 覆盖率低 → 马太效应，头部物品越来越强，长尾商品永远没曝光
    - 多样性低 → 用户疲劳，会话后期点击率骤降
    """
    recommended_items = set()
    for rec_list in all_recommendations:
        recommended_items.update(rec_list)
    
    coverage = len(recommended_items) / item_catalog_size
    return coverage

离线指标与线上效果的鸿沟：

复制代码

常见的矛盾案例：

1. NDCG 提升 10%，但线上 CTR 下降 5%
   原因：NDCG 优化的是「相关性」，但用户实际点击受「新鲜感」影响更大

2. 离线 HR@10 很高，线上留存下降
   原因：每次都推相似物品，用户没有探索新内容的机会

3. 某算法离线指标差，但线上效果好
   原因：该算法偶尔推「意外」的物品，激发了用户探索欲

关键教训：
- 离线评估只能筛掉明显差的方案，无法区分好方案和更好方案
- 重大策略变更必须 A/B 测试验证
- 需要监控多样性、覆盖率、新鲜度等非精度指标

七、多路召回：工业推荐系统的标准架构

单路召回（只用协同过滤 or 只用内容推荐）在工业环境中早已不够用。现代推荐系统普遍使用多路召回：

python 复制代码

class MultiChannelRetrieval:
    """
    多路召回聚合器
    
    每路召回解决不同的问题：
    - CF 召回：「相似用户也喜欢」
    - 内容召回：「和你看过的类似」
    - 热门召回：「最近大家都在看」
    - 实时召回：「你刚才看了 X，再看 Y」
    - 关注召回：「你关注的人喜欢」
    """
    def __init__(self, retrievers, weights=None):
        """
        retrievers: dict，key=召回路名称，value=召回器对象
        weights: 各路召回候选的优先级权重（可用于粗排）
        """
        self.retrievers = retrievers
        self.weights = weights or {name: 1.0 for name in retrievers}
    
    def retrieve(self, user_id, user_context, top_k_per_channel=100):
        """
        并行执行多路召回，合并去重
        """
        all_candidates = {}  # item_id -> {'score': float, 'channels': list}
        
        for channel_name, retriever in self.retrievers.items():
            try:
                candidates = retriever.retrieve(user_id, user_context, top_k_per_channel)
                channel_weight = self.weights[channel_name]
                
                for item_id, score in candidates:
                    if item_id not in all_candidates:
                        all_candidates[item_id] = {
                            'score': score * channel_weight,
                            'channels': [channel_name]
                        }
                    else:
                        # 多路命中：分数累加（或取最大，根据业务决定）
                        all_candidates[item_id]['score'] += score * channel_weight
                        all_candidates[item_id]['channels'].append(channel_name)
            except Exception as e:
                # 某路召回失败不影响其他路
                print(f"Channel {channel_name} failed: {e}")
        
        # 按综合分排序
        sorted_candidates = sorted(
            all_candidates.items(), 
            key=lambda x: x[1]['score'], 
            reverse=True
        )
        
        return sorted_candidates
    
    def get_channel_analysis(self, candidates):
        """分析各路召回的贡献，用于监控和调优"""
        channel_counts = {}
        multi_channel_items = 0
        
        for item_id, info in candidates:
            for channel in info['channels']:
                channel_counts[channel] = channel_counts.get(channel, 0) + 1
            if len(info['channels']) > 1:
                multi_channel_items += 1
        
        return {
            'channel_contribution': channel_counts,
            'multi_channel_overlap': multi_channel_items / len(candidates)
        }

八、MovieLens 实战：三种算法对比

以 MovieLens 100K 数据集为例，对比 User-CF、Item-CF 和矩阵分解在同一数据集上的效果差异：

python 复制代码

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from scipy.sparse import csr_matrix

def load_movielens_100k(data_path='u.data'):
    """加载 MovieLens 100K 数据集"""
    df = pd.read_csv(data_path, sep='\t', 
                     names=['user_id', 'item_id', 'rating', 'timestamp'])
    # 转为隐反馈（rating >= 4 视为正反馈）
    df['interaction'] = (df['rating'] >= 4).astype(int)
    return df

def evaluate_recommender(recommender, test_data, k=10):
    """
    评估推荐系统
    返回：HR@K、NDCG@K、覆盖率
    """
    hr_list, ndcg_list = [], []
    all_recommended = set()
    
    for user_id in test_data['user_id'].unique():
        user_test = test_data[test_data['user_id'] == user_id]
        relevant_items = set(user_test[user_test['interaction'] == 1]['item_id'].tolist())
        
        if not relevant_items:
            continue
        
        recommended = [item_id for item_id, _ in recommender.recommend(user_id, k)]
        all_recommended.update(recommended)
        
        hr_list.append(hit_rate_at_k(recommended, relevant_items, k))
        ndcg_list.append(ndcg_at_k(recommended, relevant_items, k))
    
    return {
        'HR@10': np.mean(hr_list),
        'NDCG@10': np.mean(ndcg_list),
        'Coverage': len(all_recommended) / test_data['item_id'].nunique()
    }

# 实验设计：留一法评估（每个用户最后一次交互作为测试集）
def leave_one_out_split(df):
    """时间感知的留一法：用最后一次交互做测试"""
    df_sorted = df.sort_values(['user_id', 'timestamp'])
    test_idx = df_sorted.groupby('user_id').tail(1).index
    train = df_sorted.drop(test_idx)
    test = df_sorted.loc[test_idx]
    return train, test

实验结论（典型结果，具体数值依赖超参数）：

算法	HR@10	NDCG@10	覆盖率	冷启动能力
User-CF	~0.52	~0.31	~45%	❌ 极差
Item-CF	~0.60	~0.37	~60%	❌ 差
ALS（隐反馈）	~0.68	~0.43	~72%	❌ 差
ALS + 内容召回	~0.71	~0.46	~85%	✅ 新物品有保障
BPR	~0.65	~0.40	~70%	❌ 差

关键观察：

Item-CF 比 User-CF 表现好，印证了 Amazon 多年前的工程选择
ALS（隐反馈版本）比 CF 系列明显更好------但代价是更高的计算复杂度
单独看精度，ALS 优于 ALS+内容召回；但加入覆盖率，混合方案明显更好
覆盖率是常被忽略的关键指标------推荐系统不只是「给用户想要的」，还要「帮用户发现新东西」

九、工程落地的三个关键决策

决策一：相似度计算的离线 vs 在线

复制代码

离线预计算（适合大规模场景）：
- 每天/每小时批量计算物品-物品相似度
- 存入 Redis，在线服务直接查询
- 限制：无法捕获实时行为变化

在线计算（适合小规模或实时场景）：
- 用户请求时即时计算
- 延迟高，不适合千万级物品库

决策二：隐向量维度选择

复制代码

维度太小：
- 欠拟合，无法表达复杂的用户-物品关系
- 常见问题：16~32维对于品类丰富的场景不够

维度太大：
- 过拟合，训练数据稀疏时性能下降
- 内存占用大：100万用户 × 256维 × 4字节 = 约 1GB
- 线上推理延迟增加

实践建议：
- 从 64 维开始，通过验证集 NDCG 调整
- 通常 64~256 维已经足够
- 维度增加的边际收益递减

决策三：负采样策略

python 复制代码

def popularity_weighted_negative_sampling(item_popularity, n_samples, power=0.75):
    """
    按流行度的 0.75 次方进行负采样（Word2Vec 中的策略）
    
    为什么不是均匀负采样？
    - 均匀负采样：高频物品被采到的概率等于低频物品
    - 实际上高频物品更可能是「曝光但未点击」，更有价值
    - 0.75 次方：平衡均匀采样和流行度加权采样
    
    注意：不应该采样从未曝光的物品作为负样本（无法确定用户不喜欢）
    """
    items = list(item_popularity.keys())
    probs = np.array([item_popularity[item] ** power for item in items])
    probs /= probs.sum()
    
    return np.random.choice(items, size=n_samples, p=probs)

小结

推荐系统不是「一个算法」，而是「一个系统」。从召回层的多路候选生成，到精排层的精细打分，到重排层的多样性保障，每个环节都有清晰的工程目标。

核心要点回顾：

召回 → 排序 → 重排：三层漏斗架构，各层优化目标不同
User-CF vs Item-CF：物品相似关系更稳定，工程可行性更强
显反馈 vs 隐反馈：电商场景几乎都是隐反馈，ALS 和 BPR 是对应的两条技术路线
冷启动：新用户从热门推荐开始，新物品从内容推荐兜底，逐步过渡到协同过滤
覆盖率和多样性：不只追求精度，避免推荐系统陷入马太效应

如果这篇文章对推荐系统的理解有所帮助，欢迎点赞收藏支持。技术内容的持续输出需要积累，每一个点赞都是继续写下去的动力。

本系列更多文章：

机器学习项目方法论：从业务问题到算法选型的系统决策框架

不平衡数据处理实战：采样策略/代价敏感学习/评估指标/业务场景

无监督学习实战：聚类算法选型/层次聚类/密度聚类/评估方法

降维与嵌入：PCA/t-SNE/UMAP 的原理与可视化应用

集成学习精讲：Bagging/Boosting/Stacking/Blending