作者:来自 Elastic Mayya Sharipova, Benjamin Trent
当前状况:kNN 搜索作为顶层部分
Elasticsearch 中的 kNN 搜索被组织为搜索请求的顶层(top level)部分。 我们这样设计是为了:
- 无论分片数量多少,它总是可以返回全局 k 个最近邻居
- 这些全局 k 个结果与其他查询的结果相结合以形成混合搜索
- 全局 k 结果被传递到聚合以形成统计(facets)。
这是 kNN 搜索在内部执行的简化图(省略了一些阶段):
图 1:顶层 kNN 搜索的步骤是:
- 用户提交搜索请求
- 协调器节点在 DFS 阶段向数据节点发送请求的 kNN 搜索部分
- 每个数据节点运行 kNN 搜索并将本地 top-k 结果发送回协调器
- 协调器合并所有本地结果以形成全局前 k 个最近邻居。
- 协调器将全局 k 个最近邻居发送回数据节点,并提供任何其他查询
- 每个数据节点运行额外的查询并将本地 size 结果发送回协调器
- 协调器合并所有本地结果并向用户发送响应
我们首先在 DFS 阶段运行 kNN 搜索以获得全局前 k 个结果。 然后,这些全局 k 结果被传递到搜索请求的其他部分,例如其他查询或聚合。 即使执行看起来很复杂,但从用户的角度来看,运行 kNN 搜索的模型很简单,因为用户始终可以确保 kNN 搜索返回全局 k 结果。
它的请求格式如下:
sql
1. GET collection-with-embeddings/_search
2. {
3. "knn": {
4. "field": "text_embedding.predicted_value",
5. "query_vector_builder": {
6. "text_embedding": {
7. "model_id": "sentence-transformers__msmarco-distilbert-base-tas-b",
8. "model_text": "How is the weather in Jamaica?"
9. }
10. },
11. "k": 10,
12. "num_candidates": 100
13. },
14. "_source": [
15. "id",
16. "text"
17. ]
18. }
引入 kNN 查询
随着时间的推移,我们意识到还需要将 kNN 搜索表示为查询。 查询是 Elasticsearch 中搜索请求的核心组件,将 kNN 搜索表示为查询可以灵活地将其与其他查询结合起来,以解决更复杂的请求。
kNN 查询与顶层 kNN 搜索不同,没有 k 参数。 与其他查询一样,返回的结果(最近邻居)的数量由 size 参数定义。 与 kNN 搜索类似,num_candidates 参数定义在执行 kNN 搜索时在每个分片上考虑多少个候选者。
bash
1. GET products/_search
2. {
3. "size" : 3,
4. "query": {
5. "knn": {
6. "field": "embedding",
7. "query_vector": [2,2,2,0],
8. "num_candidates": 10
9. }
10. }
11. }
kNN 查询的执行方式与顶层 kNN 搜索不同。 下面是一个简化图,描述了 kNN 查询如何在内部执行(省略了一些阶段):
图 2:基于查询的 kNN 搜索步骤如下:
- 用户提交搜索请求
- 协调器向数据节点发送一个 kNN 搜索查询,并提供附加查询
- 每个数据节点运行查询并将本地大小结果发送回协调器节点
- 协调器节点合并所有本地结果并向用户发送响应
我们在一个分片上运行 kNN 搜索以获得 num_candidates 结果; 这些结果将传递给分片上的其他查询和聚合,以从分片获取大小结果。 由于我们不首先收集全局 k 个最近邻居,因此在此模型中,收集的且对其他查询和聚合可见的最近邻居的数量取决于分片的数量。
kNN 查询 API 示例
让我们看一下 API 示例,这些示例演示了顶层 kNN 搜索和 kNN 查询之间的差异。
我们创建产品索引并索引一些文档:
json
1. PUT products
2. {
3. "mappings": {
4. "dynamic": "strict",
5. "properties": {
6. "department": {
7. "type": "keyword"
8. },
9. "brand": {
10. "type": "keyword"
11. },
12. "description": {
13. "type": "text"
14. },
15. "embedding": {
16. "type": "dense_vector",
17. "index": true,
18. "similarity": "l2_norm"
19. },
20. "price": {
21. "type": "float"
22. }
23. }
24. }
25. }
bash
1. POST products/_bulk?refresh=true
2. {"index":{"_id":1}}
3. {"department":"women","brand": "Levi's", "description":"high-rise red jeans","embedding":[1,1,1,1],"price":100}
4. {"index":{"_id":2}}
5. {"department":"women","brand": "Calvin Klein","description":"high-rise beautiful jeans","embedding":[1,1,1,1],"price":250}
6. {"index":{"_id":3}}
7. {"department":"women","brand": "Gap","description":"every day jeans","embedding":[1,1,1,1],"price":50}
8. {"index":{"_id":4}}
9. {"department":"women","brand": "Levi's","description":"jeans","embedding":[2,2,2,0],"price":75}
10. {"index":{"_id":5}}
11. {"department":"women","brand": "Levi's","description":"luxury jeans","embedding":[2,2,2,0],"price":150}
12. {"index":{"_id":6}}
13. {"department":"men","brand": "Levi's", "description":"jeans","embedding":[2,2,2,0],"price":50}
14. {"index":{"_id":7}}
15. {"department":"women","brand": "Levi's", "description":"jeans 2023","embedding":[2,2,2,0],"price":150}
kNN 查询类似于顶层 kNN 搜索,具有 num_candidates 和充当预过滤器的内部 filter 参数。
bash
1. GET products/_search?filter_path=**.hits
2. {
3. "size" : 3,
4. "query": {
5. "knn": {
6. "field": "embedding",
7. "query_vector": [2,2,2,0],
8. "num_candidates": 10,
9. "filter" : {
10. "term" : {
11. "department" : "women"
12. }
13. }
14. }
15. }
16. }
markdown
1. {
2. "hits": {
3. "hits": [
4. {
5. "_index": "products",
6. "_id": "4",
7. "_score": 1,
8. "_source": {
9. "department": "women",
10. "brand": "Levi's",
11. "description": "jeans",
12. "embedding": [
13. 2,
14. 2,
15. 2,
16. 0
17. ],
18. "price": 75
19. }
20. },
21. {
22. "_index": "products",
23. "_id": "5",
24. "_score": 1,
25. "_source": {
26. "department": "women",
27. "brand": "Levi's",
28. "description": "luxury jeans",
29. "embedding": [
30. 2,
31. 2,
32. 2,
33. 0
34. ],
35. "price": 150
36. }
37. },
38. {
39. "_index": "products",
40. "_id": "7",
41. "_score": 1,
42. "_source": {
43. "department": "women",
44. "brand": "Levi's",
45. "description": "jeans 2023",
46. "embedding": [
47. 2,
48. 2,
49. 2,
50. 0
51. ],
52. "price": 150
53. }
54. }
55. ]
56. }
57. }
kNN 查询比 kNN collapsing 和聚合搜索可以获得更多样化的结果。 对于下面的 kNN 查询,我们在每个分片上执行 kNN 搜索以获得 10 个最近邻居,然后将其传递到 collapsing 以获取 3 个顶部结果。 因此,我们将在响应中得到 3 个不同的点击。
bash
1. GET products/_search?filter_path=**.hits
2. {
3. "size" : 3,
4. "query": {
5. "knn": {
6. "field": "embedding",
7. "query_vector": [2,2,2,0],
8. "num_candidates": 10,
9. "filter" : {
10. "term" : {
11. "department" : "women"
12. }
13. }
14. }
15. },
16. "collapse": {
17. "field": "brand"
18. }
19. }
markdown
1. {
2. "hits": {
3. "hits": [
4. {
5. "_index": "products",
6. "_id": "4",
7. "_score": 1,
8. "_source": {
9. "department": "women",
10. "brand": "Levi's",
11. "description": "jeans",
12. "embedding": [
13. 2,
14. 2,
15. 2,
16. 0
17. ],
18. "price": 75
19. },
20. "fields": {
21. "brand": [
22. "Levi's"
23. ]
24. }
25. },
26. {
27. "_index": "products",
28. "_id": "2",
29. "_score": 0.2,
30. "_source": {
31. "department": "women",
32. "brand": "Calvin Klein",
33. "description": "high-rise beautiful jeans",
34. "embedding": [
35. 1,
36. 1,
37. 1,
38. 1
39. ],
40. "price": 250
41. },
42. "fields": {
43. "brand": [
44. "Calvin Klein"
45. ]
46. }
47. },
48. {
49. "_index": "products",
50. "_id": "3",
51. "_score": 0.2,
52. "_source": {
53. "department": "women",
54. "brand": "Gap",
55. "description": "every day jeans",
56. "embedding": [
57. 1,
58. 1,
59. 1,
60. 1
61. ],
62. "price": 50
63. },
64. "fields": {
65. "brand": [
66. "Gap"
67. ]
68. }
69. }
70. ]
71. }
72. }
顶层 kNN 搜索首先在 DFS 阶段获取全局前 3 个结果,然后在查询阶段将它们传递到 collapse。 我们在响应中只会得到 1 个命中,因为全球 3 个最近的邻居恰好都来自同一品牌。
与聚合类似,kNN query 允许我们获得 3 个不同的存储桶,而 kNN search 仅允许 1 个。
markdown
1. GET products/_search?filter_path=aggregations
2. {
3. "size": 0,
4. "query": {
5. "knn": {
6. "field": "embedding",
7. "query_vector": [2,2,2,0],
8. "num_candidates": 10,
9. "filter" : {
10. "term" : {
11. "department" : "women"
12. }
13. }
14. }
15. },
16. "aggs": {
17. "brands": {
18. "terms": {
19. "field": "brand"
20. }
21. }
22. }
23. }
json
1. {
2. "aggregations": {
3. "brands": {
4. "doc_count_error_upper_bound": 0,
5. "sum_other_doc_count": 0,
6. "buckets": [
7. {
8. "key": "Levi's",
9. "doc_count": 4
10. },
11. {
12. "key": "Calvin Klein",
13. "doc_count": 1
14. },
15. {
16. "key": "Gap",
17. "doc_count": 1
18. }
19. ]
20. }
21. }
22. }
而顶层的 search 是这样的:
markdown
1. GET products/_search?filter_path=aggregations
2. {
3. "size": 0,
4. "knn": {
5. "field": "embedding",
6. "query_vector": [
7. 2,
8. 2,
9. 2,
10. 0
11. ],
12. "k": 3,
13. "num_candidates": 10,
14. "filter": {
15. "term": {
16. "department": "women"
17. }
18. }
19. },
20. "aggs": {
21. "brands": {
22. "terms": {
23. "field": "brand"
24. }
25. }
26. }
27. }
markdown
1. {
2. "aggregations": {
3. "brands": {
4. "doc_count_error_upper_bound": 0,
5. "sum_other_doc_count": 0,
6. "buckets": [
7. {
8. "key": "Levi's",
9. "doc_count": 3
10. }
11. ]
12. }
13. }
14. }
现在,让我们看一下其他示例,展示 kNN 查询的灵活性。 具体来说,它如何能够灵活地与其他查询结合起来。
kNN 可以是 boolean 查询的一部分(需要注意的是,所有外部查询过滤器都用作 kNN 搜索的后过滤器)。 我们可以使用 kNN 查询的 _name 参数来通过额外信息来增强结果,这些信息告诉 kNN 查询是否匹配及其分数贡献。
bash
1. GET products/_search?include_named_queries_score
2. {
3. "size": 3,
4. "query": {
5. "bool": {
6. "should": [
7. {
8. "knn": {
9. "field": "embedding",
10. "query_vector": [2,2,2,0],
11. "num_candidates": 10,
12. "_name": "knn_query"
13. }
14. },
15. {
16. "match": {
17. "description": {
18. "query": "luxury",
19. "_name": "bm25query"
20. }
21. }
22. }
23. ]
24. }
25. }
26. }
json
1. {
2. "took": 2,
3. "timed_out": false,
4. "_shards": {
5. "total": 1,
6. "successful": 1,
7. "skipped": 0,
8. "failed": 0
9. },
10. "hits": {
11. "total": {
12. "value": 7,
13. "relation": "eq"
14. },
15. "max_score": 2.8042283,
16. "hits": [
17. {
18. "_index": "products",
19. "_id": "5",
20. "_score": 2.8042283,
21. "_source": {
22. "department": "women",
23. "brand": "Levi's",
24. "description": "luxury jeans",
25. "embedding": [
26. 2,
27. 2,
28. 2,
29. 0
30. ],
31. "price": 150
32. },
33. "matched_queries": {
34. "knn_query": 1,
35. "bm25query": 1.8042282
36. }
37. },
38. {
39. "_index": "products",
40. "_id": "4",
41. "_score": 1,
42. "_source": {
43. "department": "women",
44. "brand": "Levi's",
45. "description": "jeans",
46. "embedding": [
47. 2,
48. 2,
49. 2,
50. 0
51. ],
52. "price": 75
53. },
54. "matched_queries": {
55. "knn_query": 1
56. }
57. },
58. {
59. "_index": "products",
60. "_id": "6",
61. "_score": 1,
62. "_source": {
63. "department": "men",
64. "brand": "Levi's",
65. "description": "jeans",
66. "embedding": [
67. 2,
68. 2,
69. 2,
70. 0
71. ],
72. "price": 50
73. },
74. "matched_queries": {
75. "knn_query": 1
76. }
77. }
78. ]
79. }
80. }
kNN 也可以是复杂查询的一部分,例如 pinned 查询。 当我们想要显示最接近的结果,但又想要提升选定数量的其他结果时,这非常有用。
bash
1. GET products/_search
2. {
3. "size": 3,
4. "query": {
5. "pinned": {
6. "ids": [ "1", "2" ],
7. "organic": {
8. "knn": {
9. "field": "embedding",
10. "query_vector": [2,2,2,0],
11. "num_candidates": 10,
12. "_name": "knn_query"
13. }
14. }
15. }
16. }
17. }
json
1. {
2. "took": 9,
3. "timed_out": false,
4. "_shards": {
5. "total": 1,
6. "successful": 1,
7. "skipped": 0,
8. "failed": 0
9. },
10. "hits": {
11. "total": {
12. "value": 7,
13. "relation": "eq"
14. },
15. "max_score": 1.7014124e+38,
16. "hits": [
17. {
18. "_index": "products",
19. "_id": "1",
20. "_score": 1.7014124e+38,
21. "_source": {
22. "department": "women",
23. "brand": "Levi's",
24. "description": "high-rise red jeans",
25. "embedding": [
26. 1,
27. 1,
28. 1,
29. 1
30. ],
31. "price": 100
32. },
33. "matched_queries": [
34. "knn_query"
35. ]
36. },
37. {
38. "_index": "products",
39. "_id": "2",
40. "_score": 1.7014122e+38,
41. "_source": {
42. "department": "women",
43. "brand": "Calvin Klein",
44. "description": "high-rise beautiful jeans",
45. "embedding": [
46. 1,
47. 1,
48. 1,
49. 1
50. ],
51. "price": 250
52. },
53. "matched_queries": [
54. "knn_query"
55. ]
56. },
57. {
58. "_index": "products",
59. "_id": "4",
60. "_score": 1,
61. "_source": {
62. "department": "women",
63. "brand": "Levi's",
64. "description": "jeans",
65. "embedding": [
66. 2,
67. 2,
68. 2,
69. 0
70. ],
71. "price": 75
72. },
73. "matched_queries": [
74. "knn_query"
75. ]
76. }
77. ]
78. }
79. }
我们甚至可以将 kNN 查询作为 function_score 查询的一部分。 当我们需要为 kNN 查询返回的结果定义自定义分数时,这非常有用:
css
1. GET products/_search
2. {
3. "size": 3,
4. "query": {
5. "function_score": {
6. "query": {
7. "knn": {
8. "field": "embedding",
9. "query_vector": [2,2,2,0],
10. "num_candidates": 10,
11. "_name": "knn_query"
12. }
13. },
14. "functions": [
15. {
16. "filter": { "match": { "department": "men" } },
17. "weight": 100
18. },
19. {
20. "filter": { "match": { "department": "women" } },
21. "weight": 50
22. }
23. ]
24. }
25. }
26. }
json
1. {
2. "took": 3,
3. "timed_out": false,
4. "_shards": {
5. "total": 1,
6. "successful": 1,
7. "skipped": 0,
8. "failed": 0
9. },
10. "hits": {
11. "total": {
12. "value": 7,
13. "relation": "eq"
14. },
15. "max_score": 100,
16. "hits": [
17. {
18. "_index": "products",
19. "_id": "6",
20. "_score": 100,
21. "_source": {
22. "department": "men",
23. "brand": "Levi's",
24. "description": "jeans",
25. "embedding": [
26. 2,
27. 2,
28. 2,
29. 0
30. ],
31. "price": 50
32. },
33. "matched_queries": [
34. "knn_query"
35. ]
36. },
37. {
38. "_index": "products",
39. "_id": "4",
40. "_score": 50,
41. "_source": {
42. "department": "women",
43. "brand": "Levi's",
44. "description": "jeans",
45. "embedding": [
46. 2,
47. 2,
48. 2,
49. 0
50. ],
51. "price": 75
52. },
53. "matched_queries": [
54. "knn_query"
55. ]
56. },
57. {
58. "_index": "products",
59. "_id": "5",
60. "_score": 50,
61. "_source": {
62. "department": "women",
63. "brand": "Levi's",
64. "description": "luxury jeans",
65. "embedding": [
66. 2,
67. 2,
68. 2,
69. 0
70. ],
71. "price": 150
72. },
73. "matched_queries": [
74. "knn_query"
75. ]
76. }
77. ]
78. }
79. }
当我们想要组合 kNN 搜索和其他查询的结果时,kNN 查询作为 dis_max 查询的一部分非常有用,以便文档的分数来自排名最高的子句,并为任何其他子句提供打破平局的增量。
bash
1. GET products/_search
2. {
3. "size": 5,
4. "query": {
5. "dis_max": {
6. "queries": [
7. {
8. "knn": {
9. "field": "embedding",
10. "query_vector": [2,2, 2,0],
11. "num_candidates": 3,
12. "_name": "knn_query"
13. }
14. },
15. {
16. "match": {
17. "description": "high-rise jeans"
18. }
19. }
20. ],
21. "tie_breaker": 0.8
22. }
23. }
24. }
json
1. {
2. "took": 1,
3. "timed_out": false,
4. "_shards": {
5. "total": 1,
6. "successful": 1,
7. "skipped": 0,
8. "failed": 0
9. },
10. "hits": {
11. "total": {
12. "value": 7,
13. "relation": "eq"
14. },
15. "max_score": 1.890432,
16. "hits": [
17. {
18. "_index": "products",
19. "_id": "1",
20. "_score": 1.890432,
21. "_source": {
22. "department": "women",
23. "brand": "Levi's",
24. "description": "high-rise red jeans",
25. "embedding": [
26. 1,
27. 1,
28. 1,
29. 1
30. ],
31. "price": 100
32. }
33. },
34. {
35. "_index": "products",
36. "_id": "2",
37. "_score": 1.890432,
38. "_source": {
39. "department": "women",
40. "brand": "Calvin Klein",
41. "description": "high-rise beautiful jeans",
42. "embedding": [
43. 1,
44. 1,
45. 1,
46. 1
47. ],
48. "price": 250
49. }
50. },
51. {
52. "_index": "products",
53. "_id": "4",
54. "_score": 1.0679927,
55. "_source": {
56. "department": "women",
57. "brand": "Levi's",
58. "description": "jeans",
59. "embedding": [
60. 2,
61. 2,
62. 2,
63. 0
64. ],
65. "price": 75
66. },
67. "matched_queries": [
68. "knn_query"
69. ]
70. },
71. {
72. "_index": "products",
73. "_id": "6",
74. "_score": 1.0679927,
75. "_source": {
76. "department": "men",
77. "brand": "Levi's",
78. "description": "jeans",
79. "embedding": [
80. 2,
81. 2,
82. 2,
83. 0
84. ],
85. "price": 50
86. },
87. "matched_queries": [
88. "knn_query"
89. ]
90. },
91. {
92. "_index": "products",
93. "_id": "5",
94. "_score": 1.0556482,
95. "_source": {
96. "department": "women",
97. "brand": "Levi's",
98. "description": "luxury jeans",
99. "embedding": [
100. 2,
101. 2,
102. 2,
103. 0
104. ],
105. "price": 150
106. },
107. "matched_queries": [
108. "knn_query"
109. ]
110. }
111. ]
112. }
113. }
kNN 搜索作为查询已在 8.12 版本中引入。 请尝试一下,如果有任何反馈,我们将不胜感激。
原文:Introducing kNN query, an expert way to do kNN search --- Elastic Search Labs