Elasticsearch:如何在 ES|QL 中使用 FORK 及 FUSE 命令来实现混合搜索 - 9.1+

我们知道在 DSL 中我们可以轻松地使用 RRF 及线性来针对多路搜索进行混合搜索。我们可以使用如下的命令来进行 RRF 混合搜索:

bash 复制代码
`

1.  GET /imdb_movies/_search?pretty
2.  {
3.    "retriever": {
4.      "rrf": {
5.        "retrievers": [
6.          {
7.            "standard": {
8.              "query": {
9.                "term": {
10.                  "overview": "clueless slackers"
11.                }
12.              }
13.            }
14.          },
15.          {
16.            "knn": {
17.              "field": "overview_dense",
18.              "query_vector_builder": {
19.                "text_embedding": {
20.                  "model_id": ".multilingual-e5-small_linux-x86_64",
21.                  "model_text": "clueless slackers"
22.                }
23.              },
24.              "k": 5,
25.              "num_candidates": 5
26.            }
27.          },
28.          {
29.            "standard": {
30.              "query": {
31.                "text_expansion": {
32.                  "overview_sparse": {
33.                    "model_id": ".elser_model_2_linux-x86_64",
34.                    "model_text": "clueless slackers"
35.                  }
36.                }
37.              }
38.            }
39.          }
40.        ],
41.        "rank_window_size": 5,
42.        "rank_constant": 1
43.      }
44.    },
45.    "size": 3,
46.    "fields": [
47.      "names",
48.      "overview"
49.    ],
50.    "_source": false
51.  }

`AI写代码![](https://csdnimg.cn/release/blogv2/dist/pc/img/runCode/icon-arrowwhite.png)收起代码块![](https://csdnimg.cn/release/blogv2/dist/pc/img/arrowup-line-top-White.png)

请详细阅读之前的文章 "Elasticsearch:介绍 retrievers - 搜索一切事物"。在上面,我们使用了 RRF。有关 RRF 的介绍,请阅读文章 "Elasticsearch:倒数排序融合 - Reciprocal rank fusion (RRF)"。

我们也可以使用线性组合来平衡混合搜索。请详细阅读文章 "平衡尺度:利用权重使倒数排序融合 (RRF) 更加智能"。

markdown 复制代码
`

1.  {
2.    "retriever": {
3.      "rrf": {
4.        "retrievers": [
5.          {
6.            "retriever": {
7.              "standard": {
8.                "query": {
9.                  "match": {
10.                    "cuisine_type": "Italian"
11.                  }
12.                }
13.              }
14.            },
15.            "weight": 0.4
16.          },
17.          {
18.            "retriever": {
19.              "standard": {
20.                "query": {
21.                  "match": {
22.                    "menu_items": "cacio e pepe"
23.                  }
24.                }
25.              }
26.            },
27.            "weight": 0.6
28.          }
29.        ]
30.      }
31.    }
32.  }

`AI写代码![](https://csdnimg.cn/release/blogv2/dist/pc/img/runCode/icon-arrowwhite.png)

随着 ES|QL 的推出,越来越多的查询会使用 ES|QL 来做查询,那么我们该如何实现混合搜索呢?

Elastic 在 9.1 中开始推出 FORK 及 FUSE 命令来帮助我们来实现这种混合搜索。

准备数据

我们首先创建如下的数据集:

bash 复制代码
`

1.  PUT /people
2.  {
3.    "mappings": {
4.      "properties": {
5.        "id": {
6.          "type": "integer"
7.        },
8.        "name": {
9.          "type": "text"
10.        },
11.        "description": {
12.          "type": "text",
13.          "copy_to": "des_semantic"
14.        },
15.        "des_semantic": {
16.          "type": "semantic_text",
17.          "inference_id": ".multilingual-e5-small-elasticsearch"
18.        },
19.        "sex": {
20.          "type": "keyword"
21.        },
22.        "age": {
23.          "type": "integer"
24.        },
25.        "address": {
26.          "type": "text"
27.        },
28.        "location": {
29.          "type": "geo_point"
30.        }
31.      }
32.    }
33.  }

`AI写代码![](https://csdnimg.cn/release/blogv2/dist/pc/img/runCode/icon-arrowwhite.png)
bash 复制代码
`

1.  POST /_bulk
2.  { "index" : { "_index" : "people", "_id" : "1" } }
3.  { "id": 1, "name" : "John Doe", "description" : "A software developer", "sex" : "Male", "age" : 30, "address" : "123 Elm Street, Springfield", "location": {"lat": 37.7749, "lon": -122.4194} }
4.  { "index" : { "_index" : "people", "_id" : "2" } }
5.  { "id": 2, "name" : "Jane Smith", "description" : "A project manager", "sex" : "Female", "age" : 28, "address" : "456 Maple Avenue, Anytown", "location": {"lat": 40.7128, "lon": -74.0060} }
6.  { "index" : { "_index" : "people", "_id" : "3" } }
7.  { "id": 3, "name" : "Alice Johnson", "description" : "A graphic designer", "sex" : "Female", "age" : 26, "address" : "789 Oak Lane, Metropolis", "location": {"lat": 34.0522, "lon": -118.2437} }
8.  { "index" : { "_index" : "people", "_id" : "4" } }
9.  { "id": 4, "name" : "Bob Brown", "description" : "A marketing specialist", "sex" : "Male", "age" : 32, "address" : "321 Pine Street, Gotham", "location": {"lat": 41.8781, "lon": -87.6298} }
10.  { "index" : { "_index" : "people", "_id" : "5" } }
11.  { "id": 5, "name" : "Charlie Davis", "description" : "An IT analyst", "sex" : "Male", "age" : 29, "address" : "654 Cedar Blvd, Star City", "location": {"lat": 29.7604, "lon": -95.3698} }
12.  { "index" : { "_index" : "people", "_id" : "6" } }
13.  { "id": 6, "name" : "Diana Prince", "description" : "A diplomat", "sex" : "Female", "age" : 35, "address" : "987 Birch Road, Themyscira", "location": {"lat": 39.9526, "lon": -75.1652} }
14.  { "index" : { "_index" : "people", "_id" : "7" } }
15.  { "id": 7, "name" : "Evan Wright", "description" : "A journalist", "sex" : "Male", "age" : 27, "address" : "213 Willow Lane, Central City", "location": {"lat": 33.4484, "lon": -112.0740} }
16.  { "index" : { "_index" : "people", "_id" : "8" } }
17.  { "id": 8, "name" : "Fiona Gallagher", "description" : "A nurse", "sex" : "Female", "age" : 31, "address" : "546 Spruce Street, South Side", "location": {"lat": 32.7157, "lon": -117.1611} }
18.  { "index" : { "_index" : "people", "_id" : "9" } }
19.  { "id": 9, "name" : "George King", "description" : "A teacher", "sex" : "Male", "age" : 34, "address" : "879 Elm St, Smallville", "location": {"lat": 39.7392, "lon": -104.9903} }
20.  { "index" : { "_index" : "people", "_id" : "10" } }
21.  { "id": 10, "name" : "Helen Parr", "description" : "A full-time superhero", "sex" : "Female", "age": 37, "address" : "123 Metro Avenue, Metroville", "location": {"lat": 47.6062, "lon": -122.3321} }

`AI写代码![](https://csdnimg.cn/release/blogv2/dist/pc/img/runCode/icon-arrowwhite.png)

如上所示,我们的 people 索引含有一个 description 字段。它是 text 类型的。而另外一个字段 des_semantic 其实是一个密集向量。它的类型是 semantic_text。其向量值由 E3 多语言模型产生。

我们可以针对 des_semantic 字段做如下的向量查询:

python 复制代码
`

1.  POST _query?format=txt
2.  {
3.    "query": """
4.       FROM people METADATA _score 
5.       | WHERE MATCH(des_semantic, "coder")
6.       | SORT _score DESC | LIMIT 2
7.       | KEEP description
8.    """
9.  }

`AI写代码

我们做如上的查询:

FORK

FORK 处理命令会创建多个执行分支,对相同的输入数据进行操作,并将结果合并为一个输出表。

scss 复制代码
`FORK ( <processing_commands> ) ( <processing_commands> ) ... ( <processing_commands> )`AI写代码

说明

FORK 处理命令会创建多个执行分支,对相同的输入数据进行操作,并将结果合并为一个输出表。会添加一个区分列(_fork)来标识每一行来自哪个分支。

结合 FUSE 命令,FORK 可以实现混合搜索,用来合并并给多个查询的结果打分。想要了解更多关于使用 ES|QL 做搜索的内容,请参考 ES|QL for search

例子

python 复制代码
`

1.  POST _query?format=txt
2.  {
3.    "query": """
4.       FROM people METADATA _score 
5.       | FORK (WHERE MATCH(des_semantic, "coder"))
6.              (WHERE MATCH(description, "journalist"))
7.       | SORT _score DESC
8.       | KEEP description
9.       | LIMIT 2
10.    """
11.  }

`AI写代码![](https://csdnimg.cn/release/blogv2/dist/pc/img/runCode/icon-arrowwhite.png)

很显然,我们得到了搜索既是 jounalist 也是是 coder 的搜索结果。虽然这种结果是很好,返回了多路查询的结果,但是毕竟 WHERE MATCH(description, "journalist") 使用的是 BM25 的搜索方法,而 HERE MATCH(des_semantic, "coder") 使用的是向量搜索的方法。两种打分的方式是不一样的。相似性的分数在 0-1 之间。

我们再次修改我们的查询方法:

python 复制代码
`

1.  POST _query?format=txt
2.  {
3.    "query": """
4.       FROM people METADATA _score 
5.       | FORK (WHERE MATCH(des_semantic, "coder"))
6.              (WHERE MATCH(description, "journalist"))
7.       | KEEP description, _score
8.       | LIMIT 2
9.    """
10.  }

`AI写代码![](https://csdnimg.cn/release/blogv2/dist/pc/img/runCode/icon-arrowwhite.png)

在上面,我们保留了 _score 的值:

FUSE 命令

从上面的结果中,我们可以看出来,FORK 命令虽然可以帮我进行多路召回,但是他们各自的打分体现是不同的。我们需要使用一种方法把多路查询的结果来进行统一打分,并最终给出结果。 FUSE 就是为这种混合搜索而生。FUSE 处理命令会合并多个结果集的行,并分配新的相关性分数。

FUSE 与 FORK 命令结合,可以实现混合搜索,用来合并并给多个查询的结果打分。

FUSE 的工作方式包括:

  • 合并具有匹配 <key_columns> 值的行

  • 使用指定的 <fuse_method> 算法,根据 <group_column><score_column> 的值分配新的相关性分数

提示 :FUSE 用于搜索用例:它会合并已排序的结果集并计算相关性。想了解更多 ES|QL 中搜索的工作方式

示例

使用 RRF

python 复制代码
`

1.  POST _query?format=txt
2.  {
3.    "query": """
4.       FROM people METADATA _id, _index, _score 
5.       | FORK (WHERE des_semantic: "coder" | SORT _score DESC)
6.              (WHERE description: "journalist" | SORT _score DESC)
7.       | FUSE
8.       | KEEP description, _score
9.       | LIMIT 2
10.    """
11.  }

`AI写代码![](https://csdnimg.cn/release/blogv2/dist/pc/img/runCode/icon-arrowwhite.png)

注意:针对上面的搜索,我们必须在 METADATA 里指名 _id, _index 及 _score。否则会有错误!

上面的搜索是针对两种不同的职业来进行搜索的。比较少见。我们直接针对 coder 来进行搜索:

python 复制代码
`

1.  POST _query?format=txt
2.  {
3.    "query": """
4.       FROM people METADATA _id, _index, _score 
5.       | FORK (WHERE des_semantic: "coder" | SORT _score DESC)
6.              (WHERE description: "coder" | SORT _score DESC)
7.       | FUSE
8.       | KEEP description, _score
9.       | LIMIT 2
10.    """
11.  }

`AI写代码![](https://csdnimg.cn/release/blogv2/dist/pc/img/runCode/icon-arrowwhite.png)

为了更加清楚地说明问题,我们添加一个 search_type 来展示:

python 复制代码
`

1.  POST _query?format=txt
2.  {
3.    "query": """
4.       FROM people METADATA _id, _index, _score 
5.       | FORK (WHERE des_semantic: "coder" | SORT _score DESC | EVAL search_type = "semantic" )
6.              (WHERE description: "coder" | SORT _score DESC | EVAL search_type = "bm25")
7.       | FUSE
8.       | KEEP description, _score, search_type
9.       | LIMIT 2
10.    """
11.  }

`AI写代码![](https://csdnimg.cn/release/blogv2/dist/pc/img/runCode/icon-arrowwhite.png)

我们可以直接查询 software developer:

python 复制代码
`

1.  POST _query?format=txt
2.  {
3.    "query": """
4.       FROM people METADATA _id, _index, _score 
5.       | FORK (WHERE des_semantic: "software developer" | SORT _score DESC | EVAL search_type = "semantic" )
6.              (WHERE description: "software developer" | SORT _score DESC | EVAL search_type = "bm25")
7.       | FUSE
8.       | KEEP description, _score, search_type
9.       | LIMIT 2
10.    """
11.  }

`AI写代码![](https://csdnimg.cn/release/blogv2/dist/pc/img/runCode/icon-arrowwhite.png)

使用线性组合

FUSE 也可以使用 线性 分数组合:

python 复制代码
`

1.  POST _query?format=txt
2.  {
3.    "query": """
4.       FROM people METADATA _id, _index, _score 
5.       | FORK (WHERE des_semantic: "software developer" | SORT _score DESC | EVAL search_type = "semantic" )
6.              (WHERE description: "software developer" | SORT _score DESC | EVAL search_type = "bm25")
7.       | FUSE LINEAR
8.       | KEEP description, _score, search_type
9.       | LIMIT 2
10.    """
11.  }

`AI写代码![](https://csdnimg.cn/release/blogv2/dist/pc/img/runCode/icon-arrowwhite.png)

我们也可以使用定制的 weights。FUSE 允许你基于 _fork 列的值为分数指定不同的权重,让你可以控制每个查询分支在最终结果中的相对重要性。

bash 复制代码
`

1.  POST _query?format=txt
2.  {
3.    "query": """
4.       FROM people METADATA _id, _index, _score 
5.       | FORK (WHERE des_semantic: "software developer" | SORT _score DESC | EVAL search_type = "semantic" )
6.              (WHERE description: "software developer" | SORT _score DESC | EVAL search_type = "bm25")
7.       | FUSE LINEAR WITH { "weights": { "fork1": 0.7, "fork2": 0.3 }, "normalizer": "minmax" }
8.       | KEEP description, _score, search_type
9.       | LIMIT 2
10.    """
11.  }

`AI写代码![](https://csdnimg.cn/release/blogv2/dist/pc/img/runCode/icon-arrowwhite.png)

归一化 分数

当使用 线性 组合将语义查询和词法查询的结果合并时,我们建议先对每个结果集的分数进行归一化。

下面的示例使用 minmax 分数归一化。这意味着在合并行之前,分数会归一化为 0 到 1 之间的值:

bash 复制代码
`

1.  POST _query?format=txt
2.  {
3.    "query": """
4.       FROM people METADATA _id, _index, _score 
5.       | FORK (WHERE des_semantic: "software developer" | SORT _score DESC | EVAL search_type = "semantic" )
6.              (WHERE description: "software developer" | SORT _score DESC | EVAL search_type = "bm25")
7.       | FUSE LINEAR WITH { "normalizer": "minmax" }
8.       | KEEP description, _score, search_type
9.       | LIMIT 2
10.    """
11.  }

`AI写代码![](https://csdnimg.cn/release/blogv2/dist/pc/img/runCode/icon-arrowwhite.png)
相关推荐
2301_7976042416 小时前
d47:Elasticsearch入门
elasticsearch
勇往直前plus16 小时前
ElasticSearch详解(篇一)
大数据·elasticsearch·jenkins
大哥,带带弟弟16 小时前
ES错误记录
elasticsearch·kubernetes
眠りたいです1 天前
基于脚手架微服务的视频点播系统-脚手架开发部分(完结)elasticsearch与libcurl的简单使用与二次封装及bug修复
c++·elasticsearch·微服务·云原生·架构·bug
失散131 天前
分布式专题——57 如何保证MySQL数据库到ES的数据一致性
java·数据库·分布式·mysql·elasticsearch·架构
liliangcsdn2 天前
如何基于DSL脚本进行elasticsearch向量检索示例
大数据·elasticsearch·搜索引擎
hadage2332 天前
--- git 笔记 ---
笔记·git·elasticsearch
厨 神2 天前
11月10日ES本机
大数据·elasticsearch·搜索引擎
小二·2 天前
Elasticsearch 面试题精编(26题|含答案|分类整理)
java·大数据·elasticsearch