ES的预置分词器

Elasticsearch（简称 ES）提供了多种预置的分词器（Analyzer），用于对文本进行分词处理。分词器通常由字符过滤器（Character Filters）、分词器（Tokenizer）和词元过滤器（Token Filters）组成。以下是一些常用的预置分词器及其示例：

1. Standard Analyzer（标准分词器）

默认分词器，适用于大多数语言。
处理步骤：
1. 使用标准分词器（Standard Tokenizer）按空格和标点符号分词。
2. 应用小写过滤器（Lowercase Token Filter）将词元转换为小写。

示例：

json 复制代码

POST _analyze
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

json 复制代码

["the", "2", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dog's", "bone"]

2. Simple Analyzer（简单分词器）

按非字母字符（如数字、标点符号）分词，并将词元转换为小写。

示例：

json 复制代码

POST _analyze
{
  "analyzer": "simple",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

json 复制代码

["the", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dog", "s", "bone"]

3. Whitespace Analyzer（空格分词器）

仅按空格分词，不转换大小写，不处理标点符号。

示例：

json 复制代码

POST _analyze
{
  "analyzer": "whitespace",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

json 复制代码

["The", "2", "QUICK", "Brown-Foxes", "jumped", "over", "the", "lazy", "dog's", "bone."]

4. Keyword Analyzer（关键词分词器）

将整个文本作为一个单独的词元，不做任何分词处理。

示例：

json 复制代码

POST _analyze
{
  "analyzer": "keyword",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

json 复制代码

["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."]

5. Stop Analyzer（停用词分词器）

类似于简单分词器，但会过滤掉常见的停用词（如 "the", "and", "a" 等）。

示例：

json 复制代码

POST _analyze
{
  "analyzer": "stop",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

json 复制代码

["quick", "brown", "foxes", "jumped", "over", "lazy", "dog", "s", "bone"]

6. Pattern Analyzer（正则分词器）

使用正则表达式定义分词规则。

示例：

json 复制代码

POST _analyze
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

默认按非字母字符分词，并转换为小写：

json 复制代码

["the", "2", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dog", "s", "bone"]

7. Language Analyzer（语言分词器）

针对特定语言优化，支持多种语言（如英语、中文、法语等）。

示例（英语） ：

json 复制代码

POST _analyze
{
  "analyzer": "english",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

json 复制代码

["2", "quick", "brown", "fox", "jump", "over", "lazi", "dog", "bone"]

8. ICU Analyzer（国际化分词器）

基于 ICU（International Components for Unicode）库，支持多语言分词。

示例：

json 复制代码

POST _analyze
{
  "analyzer": "icu_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

json 复制代码

["the", "2", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dog's", "bone"]

9. Fingerprint Analyzer（指纹分词器）

对文本进行分词、去重、排序，并生成唯一的"指纹"。

示例：

json 复制代码

POST _analyze
{
  "analyzer": "fingerprint",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

json 复制代码

["2", "bone", "brown", "dog", "foxes", "jumped", "lazy", "over", "quick", "the"]

总结

Elasticsearch 的预置分词器适用于不同的场景，开发者可以根据需求选择合适的分析器，或者自定义分词器以满足特定需求。