在 Elasticsearch 中实现自动完成功能 3：completion suggester

在这篇博文中，我们将讨论 complete suggester - 一种针对自动完成功能进行优化的 suggester，并且被认为比我们迄今为止讨论的方法更快。

Completion suggester 使用称为有限状态转换器的数据结构，该结构类似于 Trie 数据结构，并且针对更快的查找进行了优化。这些数据结构存储在节点的内存中，以实现更快的搜索。与 edge-n-gram 和 search_as_you_type 一样，这也通过使用我们提供的输入更新内存中的 FST 来完成索引时的大部分工作。

Elasticsearch 类型的一种特殊类型 ------ complete，用于实现它：

markdown 复制代码

1.  PUT /movies
2.  {
3.    "mappings": {
4.      "properties": {
5.        "title": {
6.          "type": "completion"
7.        }
8.      }
9.    }
10.  }

映射还支持完成字段的 analyzer、search analyzer、max_input_length 参数。分析器值默认为 simple analyzer，它将输入小写并对任何非字母字符，例如数字、空格、连字符等进行分词。完成类型上的分析器的行为与其他文本字段上的分析器不同。经过分析后，分词不能单独使用 - 它们根据输入文本中的顺序放在一起并插入到 FST 中。此外，我们无法在此方法中使用 _analyze 端点来测试我们的映射。如果我们尝试这样做，Elasticsearch 会抛出一个错误，指出 "Can't process field [title], Analysis requests are only supported on tokenized fields"。

在索引文档时，我们指定输入和可选的权重参数 -

css 复制代码

1.  POST /movies/_doc/1001
2.  {
3.    "title": [
4.      {
5.        "input": "Harry Potter and the Goblet of Fire",
6.        "weight": 5
7.      },
8.      {
9.        "input": "Goblet of Fire",
10.        "weight": 10
11.      }
12.    ]
13.  }

15.  POST /movies/_doc/1002
16.  {
17.    "title": {
18.      "input": [
19.        "Harry Potter and the Goblet of Fire",
20.        "Goblet of Fire"
21.      ],
22.      "weight": 2
23.    }
24.  }

我们可以使用 input 参数为单个文档指定多个匹配项。 weight 参数控制搜索结果中文档的排名。它可以针对每个 input 进行指定，如上面的第一个文档（1001）中所示，或者可以对所有 input 保持相同，如第二个文档（1002）中所示。

使用 _search 端点的请求正文中的 suggest 子句查询建议字段。在 ES 5.0 版本之前，有一个单独的端点 - _suggest 用于 suggester。互联网上的许多示例都使用 _suggest。从版本 5 开始，_search 端点本身也已更新以支持 suggester。

默认情况下，Elasticsearch 返回整个匹配文档。如果我们只对建议文本感兴趣，我们可以使用 _source 选项并将其设置为"suggest"。通过这种方式，我们可以最大限度地减少磁盘获取和传输开销：

bash 复制代码

1.  GET /movies/_search?filter_path=**.harry_suggest
2.  {
3.    "_source": "title.input",
4.    "suggest": {
5.      "harry_suggest": {
6.        "prefix": "goblet of f",
7.        "completion": {
8.          "field": "title"
9.        }
10.      }
11.    }
12.  }

上面命令返回的结果为：

css 复制代码

1.  {
2.    "suggest": {
3.      "harry_suggest": [
4.        {
5.          "text": "goblet of f",
6.          "offset": 0,
7.          "length": 11,
8.          "options": [
9.            {
10.              "text": "Goblet of Fire",
11.              "_index": "movies",
12.              "_id": "1001",
13.              "_score": 10,
14.              "_source": {
15.                "title": [
16.                  {
17.                    "input": "Harry Potter and the Goblet of Fire"
18.                  },
19.                  {
20.                    "input": "Goblet of Fire"
21.                  }
22.                ]
23.              }
24.            },
25.            {
26.              "text": "Goblet of Fire",
27.              "_index": "movies",
28.              "_id": "1002",
29.              "_score": 2,
30.              "_source": {
31.                "title": {
32.                  "input": [
33.                    "Harry Potter and the Goblet of Fire",
34.                    "Goblet of Fire"
35.                  ]
36.                }
37.              }
38.            }
39.          ]
40.        }
41.      ]
42.    }
43.  }

"Goblet of Fire" 在建议中返回了两次，因为我们已在两个文档中提供此文本作为输入。这可以通过使用 skip_duplicates 选项来避免。

bash 复制代码

1.  GET /movies/_search?filter_path=**.harry_suggest
2.  {
3.    "_source": "title.input",
4.    "suggest": {
5.      "harry_suggest": {
6.        "prefix": "goblet of f",
7.        "completion": {
8.          "field": "title",
9.          "skip_duplicates": true
10.        }
11.      }
12.    }
13.  }

css 复制代码

1.  {
2.    "suggest": {
3.      "harry_suggest": [
4.        {
5.          "text": "goblet of f",
6.          "offset": 0,
7.          "length": 11,
8.          "options": [
9.            {
10.              "text": "Goblet of Fire",
11.              "_index": "movies",
12.              "_id": "1001",
13.              "_score": 10,
14.              "_source": {
15.                "title": [
16.                  {
17.                    "input": "Harry Potter and the Goblet of Fire"
18.                  },
19.                  {
20.                    "input": "Goblet of Fire"
21.                  }
22.                ]
23.              }
24.            }
25.          ]
26.        }
27.      ]
28.    }
29.  }

警告：当设置为 true 时，此选项会减慢搜索速度，因为需要访问更多建议才能找到前 N 个。

在 completion suggester 的情况下，Elasticsearch 从第一个字符开始一次匹配文档一个字符，在输入新字符时向前移动一个位置。如上所述，它保留 FST 中的输入顺序。因此，它无法像基于 n-gram 的方法那样在输入中间进行匹配。即，如果你有一部名为 "Harry Potter and the Goblet of Fire" 的电影，并且你输入 "goblet of fire"，它不会将文档作为匹配项返回。但是，你可以使用输入选项来提供多个匹配项。你可以手动对输入字符串进行分词，并将分词传递到输入选项中的 Elasticsearch，就像我们在上面的示例中通过提供 "Goblet of Fire" 作为附加输入所做的那样。

Completion suggester 支持 fuzzy queries，使我们能够在搜索文档时考虑拼写错误。你还可以指定前缀文本作为正则表达式查询。下面示例中的两个查询都返回 "Goblet of Fire" 作为建议 -

bash 复制代码

1.  GET /movies/_search?filter_path=**.harry_suggest
2.  {
3.    "_source": "title.input",
4.    "suggest": {
5.      "harry_suggest": {
6.        "prefix": "gobet of f",
7.        "completion": {
8.          "field": "title",
9.          "fuzzy": {
10.            "fuzziness": 2
11.          }
12.        }
13.      }
14.    }
15.  }

bash 复制代码

1.  GET /movies/_search?filter_path=**.harry_suggest
2.  {
3.    "_source": "title.input",
4.    "suggest": {
5.      "harry_suggest": {
6.        "regex": "g[aieou]b",
7.        "completion": {
8.          "field": "title"
9.        }
10.      }
11.    }
12.  }

添加上下文到搜索

与其他查询不同，completion suggesters 不支持在查询中添加过滤器。即，你无法根据文档中其他字段的值过滤掉建议。假设我们有一个存储电影的索引，并且我们正在开发基于标题字段的自动完成功能。假设我们已将 title 映射为完成类型，还有其他字段，如 genres、ratings、production companies 等。有一个 title 为 "Goblet of Fire" 的文档，其 genre 为 "action"。现在，如果我们尝试根据 genre = "romance" 过滤掉自动完成建议，我们预计它不应该返回 "Goblet of Fire"：

css 复制代码

1.  POST /movies/_doc/1001
2.  {
3.    "genre": "action",
4.    "title": [
5.      {
6.        "input": "Harry Potter and the Goblet of Fire",
7.        "weight": 5
8.      },
9.      {
10.        "input": "Goblet of Fire",
11.        "weight": 10
12.      }
13.    ]
14.  }

16.  POST /movies/_doc/1002
17.  {
18.    "genre": "fiction",
19.    "title": {
20.      "input": [
21.        "Harry Potter and the Goblet of Fire",
22.        "Goblet of Fire"
23.      ],
24.      "weight": 2
25.    }
26.  }

28.  GET /movies/_search
29.  {
30.    "query": {
31.      "bool": {
32.        "filter": [
33.          {
34.            "term": {
35.              "genre": "romance"
36.            }
37.          }
38.        ]
39.      }
40.    },
41.    "suggest": {
42.      "harry_suggest": {
43.        "prefix": "goblet",
44.        "completion": {
45.          "field": "title"
46.        }
47.      }
48.    }
49.  }

上述搜索将返回和之前一样的结果。仿佛那个过滤器根本就不存在。这并不像我们预期的那样工作 - 它返回 "Goblet of Fire" 作为建议，即使它属于 "action" 类型。这种限制背后的主要原因是它的设计。正如已经讨论过的，建议存储在单独的数据结构中 - 内存中 FST，而其他字段存储在磁盘上。这种设计有助于通过内存中 FST 进行更快的搜索。像上面这样的查询违背了这种设计。

然而，Elasticsearch 确实提供了上下文建议在一定程度上规避了这个问题。要使用上下文建议器，我们必须在为索引创建映射时提供上下文：

markdown 复制代码

1.  DELETE movies

3.  PUT /movies
4.  {
5.    "mappings": {
6.      "properties": {
7.        "title": {
8.          "type": "completion",
9.          "contexts": [
10.            {
11.              "name": "genre",
12.              "type": "category"
13.            }
14.          ]
15.        }
16.      }
17.    }
18.  }

对于特定的 completion 字段，我们可以定义多个具有唯一名称的上下文。支持两种类型的上下文：

Category => 你正在索引的事物的类别，例如电影/歌曲的类型（genre）
Geo => 你正在索引的文档的地理点，允许根据经纬度过滤建议。

上述每种上下文类型都支持一些高级参数，例如精度 (precision)、地理上下文的邻居 (neighbours)、查询时的增强 (boost)，以便具有特定类别的文档获得更高的分数。请注意，对于启用上下文的完成字段，在索引文档以及查询文档时，上下文参数是必需的。

让我们在上面创建的索引中索引一些文档：

bash 复制代码

1.  POST /movies/_doc/2001
2.  {
3.    "title": {
4.      "input": "Harry Potter and the Chamber of Secrets",
5.      "contexts": {
6.        "genre": "mystery"
7.      }
8.    }
9.  }

11.  POST /movies/_doc/2002
12.  {
13.    "title": {
14.      "input": "Harry Potter and the Prisoner of Azkaban",
15.      "contexts": {
16.        "genre": "crime"
17.      }
18.    }
19.  }

上面，我们将 "Harry Potter and the Prisoner of Azkaban" 索引为 "crime" 类型的电影，将 "Harry Potter and the Chamber of Secrets" 索引为 "mystery" 类型的电影。让我们尝试获取前缀 "harry" 的建议：

bash 复制代码

1.  GET /movies/_search?filter_path=**.harry_suggest
2.  {
3.    "_source": "title.input",
4.    "suggest": {
5.      "harry_suggest": {
6.        "prefix": "harry",
7.        "completion": {
8.          "field": "title",
9.          "contexts": {
10.            "genre": "crime"
11.          }
12.        }
13.      }
14.    }
15.  }

上面查询的结果为：

css 复制代码

1.  {
2.    "suggest": {
3.      "harry_suggest": [
4.        {
5.          "text": "harry",
6.          "offset": 0,
7.          "length": 5,
8.          "options": [
9.            {
10.              "text": "Harry Potter and the Prisoner of Azkaban",
11.              "_index": "movies",
12.              "_id": "2002",
13.              "_score": 1,
14.              "_source": {
15.                "title": {
16.                  "input": "Harry Potter and the Prisoner of Azkaban"
17.                }
18.              },
19.              "contexts": {
20.                "genre": [
21.                  "crime"
22.                ]
23.              }
24.            }
25.          ]
26.        }
27.      ]
28.    }
29.  }

从上面的响应中可以看出，即使传递的前缀与上面索引的两个文档都匹配，也仅返回 "crime" 类型的 "Harry Potter and the Prisoner of Azkaban"。

这就是我们在 Elasticsearch 中实现自动完成的第三种方法。那么，completion suggester 与迄今为止看到的其他方法相比如何？它绝对是最快的，因为要搜索的数据在内存中可用，但是如果我们决定使用它实现自动完成，我们需要记住一些事情：

必须注意索引的大小，因为建议存储在内存中。
中缀 (infix) 匹配，例如不支持按中间名匹配。
不支持对文档中其他字段的建议进行高级过滤。

总结一下，我们可以说在选择在 Elasticsearch 中实现自动完成功能的方法时应考虑以下因素：

数据是否已建立索引？以什么格式？我们可以重新索引它以使其更适合自动完成功能吗？如果数据已经被索引为文本字段并且我们无法重新索引它，我们将需要采用查询时间方法 - 即前缀查询 (prefix queries) ！
该字段可以通过哪些方式查询？以多种方式存储它有意义吗？
是否需要支持中缀 (infix) 匹配？文本中的单词顺序是固定的吗？用户是否熟悉该顺序？ complete suggesters 不支持中缀匹配，并且不适合具有众所周知的顺序的字段。
将作为值提供给我们的字段的文本的最大大小是多少？如果保存在内存中会产生问题吗？ completion suggesters 将数据保存在内存中，基于 n-gram 的方法在基本分词化后创建附加分词以实现更快的匹配。
我们需要为这个字段建立一个单独的索引吗？如果这里提到的所有三种方法都不能满足你的要求，那么你将需要创建另一个索引。在该索引中，只有 auto-complete 功能所需的字段才会存储为唯一文档，而不是与同一索引中的其他数据一起保存。这将最大限度地减少节点膨胀的可能性，并且还可以提供更快的建议。但是，是的，它毕竟是一个单独的索引，你必须保持主索引和新索引之间的数据同步。管理另一个索引也有开销。