使用 Maximum Marginal Relevance 实现搜索结果多样化

作者：来自 Elastic Peter Straßer

使用 Elasticsearch 和 Python 实现 Maximum Marginal Relevance （MMR）算法。这个博客包含用于向量搜索重排的代码示例。

Elasticsearch 拥有大量新功能，能帮助你为你的使用场景构建最佳搜索解决方案。浏览我们的示例 notebooks 了解更多内容，开始免费云试用，或在本地机器上试用 Elastic。

当你在电商目录中搜索 "pants" 时，你真的想看到 10 条相同的黑色七分裤吗？可能不是。你更可能希望看到展示不同款式、颜色和类型裤子的多样化选择。这正是 Maximum Marginal Relevance （MMR）发挥作用的地方 ------ 一种在搜索结果中平衡相关性和多样性的强大技术。

在这篇博客中，我们将探讨如何使用 Elasticsearch 实现 MMR，以创建更加多样且实用的搜索结果，并以一个时尚商品目录作为示例。

问题：当相关性不再足够

传统搜索系统只优化一个目标：相关性。它们寻找最符合查询的内容，并按相似度得分进行排序。这在很多场景下效果很好，但可能会导致结果冗余。

比如在一个时尚目录中搜索 "pants"，纯粹基于相关性的搜索可能会返回：

Black capris (score: 0.682)
Black capris from another brand (score: 0.681)
More black capris (score: 0.680)
Even more black capris (score: 0.680)
...you get the idea

虽然这些结果与 "pants" 的确高度相关，但对用户想要浏览不同选择来说，并没有太大帮助。我们需要的是：在保持相关性的同时，引入多样性。

引入最大边际相关性（Maximum Marginal Relevance - MMR）

MMR 是一种优雅地解决上述问题的算法，它在两个相互竞争的目标之间取得平衡：

相关性：项目与查询匹配的程度
多样性：项目之间彼此的差异程度

该算法以迭代方式工作：每次选择那些既与查询高度相关，又与已选项目有明显差异的结果。

这能确保每一个新增结果都带来新的信息 ，而不是重复内容。

MMR 如何运作

MMR 算法遵循一个简单但有效的流程：

从得分最高的项目开始
对每个剩余项目，计算一个 MMR 分数，结合：它与查询的相关性，以及它与已选项目的差异性
选择 MMR 分数最高的项目
重复此过程，直到获得足够的结果

关键在于 MMR 的评分公式：

ini 复制代码

`MMR Score = λ × relevance - (1 - λ) × max_similarity_to_selected`AI写代码

λ 参数控制这种权衡，其中 λ = 1.0 表示纯相关性（无多样性），而 λ = 0.0 表示纯多样性（忽略相关性）。

MMR 并不规定你如何计算相关性 ------ 它只需要一个分数。这个分数可以来自 BM-25、一个学习排序器，或任何你喜欢的自定义指标。由于 BM-25 依赖于保存在 postings 列表中的每个词项的统计信息，而这些信息在客户端不可用，因此我们将在本文中使用向量相似度作为我们的相关性函数。这样我们可以通过点积优雅地计算相关性。

MMR 的实现

我们来看如何使用 Elasticsearch 和多模态嵌入实现一个图像搜索系统中的 MMR。为了展示 MMR 的效果，我们将使用 paramaggarwal/fashion-product-images-dataset 数据集。

本文将只关注检索和重排序部分，但你可以在我们的 search-labs GitHub 仓库里找到一个完整的端到端示例。

首先，我们需要使用向量相似度搜索相似的项目：

python 复制代码

`

1.  def search_similar_images(es, index_name, query_vector, k=10):
2.      """Search for similar images using vector similarity"""
3.      query = {
4.          "knn": {
5.              "field": "image_vector",
6.              "query_vector": query_vector,
7.              "k": k
8.          },
9.          "_source": ["id", "image_url"],
10.          "size": k
11.      }

13.      response = es.search(index=index_name, body=query)
14.      return extract_results(response)

`AI写代码![](https://csdnimg.cn/release/blogv2/dist/pc/img/runCode/icon-arrowwhite.png)

这为我们提供了初始的相关性排序结果。现在，让我们应用 MMR 对它们进行重排序以提升多样性：

ini 复制代码

`

1.  def maximal_marginal_relevance(
2.      query_embedding: List[float],
3.      embedding_list: List[List[float]],
4.      lambda_mult: float = 0.5, # value between 0.0 and 1.0
5.      k: int = 4,
6.  ) -> List[int]:
7.      query_embedding_arr = np.array(query_embedding)

9.      if min(k, len(embedding_list)) <= 0:
10.          return []
11.      if query_embedding_arr.ndim == 1:
12.          query_embedding_arr = np.expand_dims(query_embedding_arr, axis=0)

14.      # calcuate the similarity to the query for all reranking candidates
15.      similarity_to_query = _cosine_similarity(query_embedding_arr, embedding_list)[0]
16.      # start with the most similar item to the query
17.      most_similar = int(np.argmax(similarity_to_query))
18.      idxs = [most_similar]
19.      selected = np.array([embedding_list[most_similar]])

21.      # Iteratively select documents that maximize MMR score
22.      while len(idxs) < min(k, len(embedding_list)):
23.          best_score = -np.inf
24.          idx_to_add = -1

26.  	   # calulate the similarity between all candidate items and all selected items
27.          similarity_to_selected = _cosine_similarity(embedding_list, selected)

29.  	   # look at all candidates 
30.          for i, query_score in enumerate(similarity_to_query):
31.              if i in idxs:
32.                  continue

34.  		 # Find the highest similarity of this item to already selected items
35.              redundant_score = max(similarity_to_selected[i])
36.  		 # Calculate MMR score
37.              equation_score = (lambda_mult * query_score - (1 - lambda_mult) * redundant_score)

39.  		 # select this item if it has the highest MMR score for this run
40.              if equation_score > best_score:
41.                  best_score = equation_score
42.                  idx_to_add = i

44.          # append the item with the highest MMR score for this run to our list  
45.          idxs.append(idx_to_add)
46.          selected = np.append(selected, [embedding_list[idx_to_add]], axis=0)
47.      return idxs

`AI写代码![](https://csdnimg.cn/release/blogv2/dist/pc/img/runCode/icon-arrowwhite.png)