Elasticsearch 分词器

一、什么是分词器
- [1.1 基本概念](#1.1 基本概念)
- [1.2 为什么需要分词器？](#1.2 为什么需要分词器？)
二、分词器的作用
- [2.1 主要功能](#2.1 主要功能)
- [2.2 具体作用场景](#2.2 具体作用场景)
三、分词器的组成结构
- [3.1 三阶段处理流程](#3.1 三阶段处理流程)
- - [1. 字符过滤器（Character Filters）](#1. 字符过滤器（Character Filters）)
  - [2. 分词器（Tokenizer）](#2. 分词器（Tokenizer）)
  - [3. 词元过滤器（Token Filters）](#3. 词元过滤器（Token Filters）)
- [3.2 完整处理示例](#3.2 完整处理示例)
[四、Elasticsearch 内置分词器](#四、Elasticsearch 内置分词器)
- [4.1 常用内置分词器](#4.1 常用内置分词器)
- [4.2 内置分词器对比测试](#4.2 内置分词器对比测试)
五、中文分词器的安装与配置
- [5.1 为什么需要专门的中文分词器？](#5.1 为什么需要专门的中文分词器？)
- [5.2 IK Analyzer（最流行的中文分词器）](#5.2 IK Analyzer（最流行的中文分词器）)
六、创建自定义分词器
- [6.1 在索引设置中定义自定义分词器](#6.1 在索引设置中定义自定义分词器)
- [6.2 测试自定义分词器](#6.2 测试自定义分词器)
七、分词器的实际应用
- [7.1 索引时 vs 搜索时分词器](#7.1 索引时 vs 搜索时分词器)
- [7.2 多字段映射（Multi-fields）](#7.2 多字段映射（Multi-fields）)
- [7.3 查询时指定分词器](#7.3 查询时指定分词器)
八、高级分词器配置
- [8.1 N-gram 和 Edge N-gram 分词器](#8.1 N-gram 和 Edge N-gram 分词器)
- [8.2 同义词过滤器](#8.2 同义词过滤器)
- [8.3 拼音分词器（需要安装插件）](#8.3 拼音分词器（需要安装插件）)
九、分词器性能优化
- [9.1 分词器缓存](#9.1 分词器缓存)
- [9.2 避免过度分词](#9.2 避免过度分词)
- [9.3 监控分词器性能](#9.3 监控分词器性能)
十、实战案例：电商搜索分词器配置
- [10.1 电商搜索需求分析](#10.1 电商搜索需求分析)
- [10.2 完整配置示例](#10.2 完整配置示例)
- [10.3 测试电商分词效果](#10.3 测试电商分词效果)
十一、常见问题与解决方案
十二、分词器选择建议

一、什么是分词器

1.1 基本概念

分词器（Analyzer）是 Elasticsearch 中用于处理文本数据的核心组件，负责将文本转换成适合搜索的格式。它就像是一个"文本处理器"，将原始文本拆分成一个个有意义的词条（Terms），同时进行标准化处理。

简单比喻：

想象一下你要在图书馆找一本书
原始书籍 = 原始文本
图书管理员（分词器） = 将书籍内容拆分成关键词、索引标签
图书索引卡 = 分词后的词条
读者 = 搜索请求

1.2 为什么需要分词器？

bash 复制代码

// 没有分词的情况
原始文本："中华人民共和国成立于1949年"

// 直接搜索："中国" -> 无法匹配 "中华人民共和国"
// 因为计算机看到的是："中华人民共和国成立于1949年" 这个整体字符串

// 有分词的情况
分词后：["中华", "中华人民", "中华人民共和国", "人民", "共和国", "成立", "1949"]

// 搜索："中国" -> 可以匹配到相关文档

二、分词器的作用

2.1 主要功能

功能	说明	示例
文本拆分	将连续文本拆分成独立的词条	"我爱编程" → ["我", "爱", "编程"]
标准化处理	统一文本格式，提高搜索准确性	"Hello" → "hello", "中国" → "china"
过滤停用词	移除无意义的词	"a an the 的了在" 被移除
同义词扩展	扩展搜索范围	"计算机" → ["电脑", "Computer"]
词干提取	还原单词的基本形式	"running" → "run", "better" → "good"

2.2 具体作用场景

场景1：精确匹配 vs 模糊匹配

bash 复制代码

// 没有分词器
搜索："苹果手机" 
只能匹配完全相同的文本："苹果手机"

// 有分词器
搜索："苹果手机"
可以匹配："苹果iPhone", "苹果智能手机", "苹果最新款手机"

场景2：多语言支持

bash 复制代码

// 英文：单词边界明确
"I love programming" → ["i", "love", "programming"]

// 中文：需要智能分词
"我爱编程" → ["我", "爱", "编程"]

// 日文：混合多种字符
"今日は良い天気です" → ["今日", "良い", "天気"]

场景3：搜索引擎优化

bash 复制代码

原始搜索："how to install elasticsearch on centos 7"
分词后：["how", "to", "install", "elasticsearch", "on", "centos", "7"]

// 即使输入有错误或变体也能匹配
用户搜索："installing elasticsearch centos7"
仍然可以匹配到相关文档

三、分词器的组成结构

3.1 三阶段处理流程

bash 复制代码

原始文本 → 字符过滤器 → 分词器 → 词元过滤器 → 词条

1. 字符过滤器（Character Filters）

作用：在分词前预处理原始文本
类型：
- HTML 去除：hello
  
  → hello
- 字符映射：& → and
- 正则替换：替换或删除特定字符

2. 分词器（Tokenizer）

作用：将文本拆分成词元（Tokens）
关键决策：在哪里切分文本
示例：
- 空格分词："hello world" → ["hello", "world"]
- 中文分词："我爱北京" → ["我", "爱", "北京"]

3. 词元过滤器（Token Filters）

作用：对词元进行进一步处理
常见操作：
- 小写转换：Hello → hello
- 停用词移除：移除 a, an, the, 的, 了
- 同义词扩展：car → [car, automobile]
- 词干提取：running → run

3.2 完整处理示例

bash 复制代码

原始文本： "<p>I'm running in the park</p>"

步骤1: 字符过滤器 (HTML去除)
→ "I'm running in the park"

步骤2: 分词器 (标准分词器)
→ ["I'm", "running", "in", "the", "park"]

步骤3: 词元过滤器
  3.1 小写转换: ["i'm", "running", "in", "the", "park"]
  3.2 停用词移除: ["i'm", "running", "park"]
  3.3 词干提取: ["i'm", "run", "park"]
  3.4 撇号切分: ["i", "m", "run", "park"]

最终词条: ["i", "m", "run", "park"]

四、Elasticsearch 内置分词器

4.1 常用内置分词器

bash 复制代码

// 1. Standard Analyzer (默认)
// 特点：按词切分，移除标点，小写处理
示例: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
→ ["the", "2", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dog's", "bone"]

// 2. Simple Analyzer
// 特点：遇到非字母字符时分割文本，小写处理
示例: "The 2 QUICK Brown-Foxes."
→ ["the", "quick", "brown", "foxes"]

// 3. Whitespace Analyzer
// 特点：仅按空白字符分割，不进行小写转换
示例: "The 2 QUICK Brown-Foxes."
→ ["The", "2", "QUICK", "Brown-Foxes."]

// 4. Stop Analyzer
// 特点：类似Simple Analyzer，但会移除停用词
示例: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
→ ["quick", "brown", "foxes", "jumped", "over", "lazy", "dog", "s", "bone"]

// 5. Keyword Analyzer
// 特点：不进行分词，整个文本作为一个词条
示例: "The 2 QUICK Brown-Foxes."
→ ["The 2 QUICK Brown-Foxes."]

// 6. Pattern Analyzer
// 特点：使用正则表达式分割
示例: 使用模式 \W+ (非单词字符)
"hello-world_test"
→ ["hello", "world", "test"]

// 7. Language Analyzers
// 特点：针对特定语言优化
示例: english, french, chinese等
"running foxes"
→ ["run", "fox"] (英文词干提取)

4.2 内置分词器对比测试

bash 复制代码

# 测试各种分词器效果
curl -X GET "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "standard",
  "text": "The quick brown fox jumps over 2 lazy dogs!"
}'

curl -X GET "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "simple",
  "text": "The quick brown fox jumps over 2 lazy dogs!"
}'

curl -X GET "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "stop",
  "text": "The quick brown fox jumps over 2 lazy dogs!"
}'

五、中文分词器的安装与配置

5.1 为什么需要专门的中文分词器？

中文分词的特殊性：

bash 复制代码

英文: "I love programming"  # 单词间有空格
中文: "我爱编程"  # 词语间无明确分隔符

内置标准分词器的问题：

bash 复制代码

标准分词器: "我爱编程" → ["我", "爱", "编", "程"]  # 单字分割，不准确
理想分词: "我爱编程" → ["我", "爱", "编程"]  # 完整词语识别

5.2 IK Analyzer（最流行的中文分词器）

安装 IK 分词器

bash 复制代码

# 方法1: 使用插件管理器安装（推荐）
# 进入 Elasticsearch 安装目录
cd /usr/share/elasticsearch

# 安装对应版本的 IK 分词器
# 查看 Elasticsearch 版本
curl localhost:9200 | grep number

# 安装 IK 分词器（以7.17.3为例）
sudo bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.17.3/elasticsearch-analysis-ik-7.17.3.zip

# 方法2: 手动安装
# 下载 IK 分词器
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.17.3/elasticsearch-analysis-ik-7.17.3.zip

# 解压到 plugins 目录
sudo mkdir -p plugins/ik
sudo unzip elasticsearch-analysis-ik-7.17.3.zip -d plugins/ik/

# 重启 Elasticsearch
sudo systemctl restart elasticsearch

IK 分词器的两种模式

bash 复制代码

# 测试 IK 分词器的两种模式

# 1. ik_max_word (最细粒度拆分)
curl -X GET "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "ik_max_word",
  "text": "中华人民共和国成立了"
}'

# 输出可能包括：
# ["中华", "中华人民", "中华人民共和国", "华人", "人民", "人民共和国", "共和国", "共和", "国", "成立", "了"]

# 2. ik_smart (最粗粒度拆分)
curl -X GET "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "ik_smart",
  "text": "中华人民共和国成立了"
}'

# 输出：
# ["中华人民共和国", "成立", "了"]

自定义词典配置

IK 分词器配置文件位于：

bash 复制代码

/etc/elasticsearch/plugins/ik/config/

主要配置文件结构：

bash 复制代码

IKAnalyzer.cfg.xml          # 主配置文件
main.dic                    # 主词典
quantifier.dic              # 量词词典
suffix.dic                  # 后缀词典
surname.dic                 # 姓氏词典
stopword.dic                # 停用词词典
preposition.dic             # 介词词典

步骤1：配置 IK 分词器

bash 复制代码

# 编辑主配置文件
sudo vi /etc/elasticsearch/plugins/ik/config/IKAnalyzer.cfg.xml

配置文件内容：

bash 复制代码

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    
    <!-- 用户可以在这里配置自己的扩展字典 -->
    <entry key="ext_dict">
        custom/mydict.dic;
        custom/single_word_low_freq.dic
    </entry>
    
    <!-- 用户可以在这里配置自己的扩展停止词字典 -->
    <entry key="ext_stopwords">
        custom/ext_stopwords.dic
    </entry>
    
    <!-- 远程词典配置（支持热更新） -->
    <!--
    <entry key="remote_ext_dict">http://192.168.3.128:8080/dict/mydict.dic</entry>
    <entry key="remote_ext_stopwords">http://192.168.3.128:8080/dict/stopwords.dic</entry>
    -->
</properties>

步骤2：创建自定义词典

bash 复制代码

# 创建自定义词典目录
sudo mkdir -p /etc/elasticsearch/plugins/ik/config/custom

# 创建主词典文件
sudo vi /etc/elasticsearch/plugins/ik/config/custom/mydict.dic

词典文件格式（每行一个词）：

bash 复制代码

王者荣耀
英雄联盟
吃鸡
绝地求生
COVID-19
新冠肺炎
区块链
人工智能
机器学习
深度学习
美团外卖
饿了么

步骤3：创建自定义停用词词典

bash 复制代码

sudo vi /etc/elasticsearch/plugins/ik/config/custom/ext_stopwords.dic

停用词内容：

bash 复制代码

的
了
在
是
我
有
和
就
不
人
都
一
一个
上
也
很
到
说
要
去
你
会
着
没有
看
自己
这

步骤4：重启并测试

bash 复制代码

# 重启 Elasticsearch
sudo systemctl restart elasticsearch

# 等待启动完成
sleep 10

# 测试自定义词典效果
curl -X GET "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "ik_max_word",
  "text": "我喜欢玩王者荣耀和吃鸡游戏"
}'

# 期望输出应包含完整的游戏名称，而不是被拆分

六、创建自定义分词器

6.1 在索引设置中定义自定义分词器

bash 复制代码

# 创建带有自定义分词器的索引
curl -X PUT "localhost:9200/my_custom_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "& => and",
            "| => or",
            "❤ => love",
            "😂 => laugh"
          ]
        },
        "html_strip_filter": {
          "type": "html_strip",
          "escaped_tags": ["b", "i"]
        }
      },
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 3,
          "token_chars": ["letter", "digit"]
        },
        "my_pattern_tokenizer": {
          "type": "pattern",
          "pattern": "\\W+",
          "flags": "CASE_INSENSITIVE"
        }
      },
      "filter": {
        "my_stopwords": {
          "type": "stop",
          "stopwords": ["_english_", "_chinese_"],
          "ignore_case": true
        },
        "my_synonyms": {
          "type": "synonym",
          "synonyms": [
            "car, auto, automobile",
            "bike, bicycle, cycle",
            "tv, television, telly",
            "手机, 电话, 移动电话",
            "电脑, 计算机, PC"
          ]
        },
        "my_stemmer": {
          "type": "stemmer",
          "language": "english"
        },
        "my_length_filter": {
          "type": "length",
          "min": 2,
          "max": 20
        }
      },
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip_filter", "my_char_filter"],
          "tokenizer": "ik_max_word",
          "filter": [
            "lowercase",
            "my_stopwords",
            "my_synonyms",
            "my_stemmer",
            "my_length_filter"
          ]
        },
        "my_simple_analyzer": {
          "type": "custom",
          "tokenizer": "ik_smart",
          "filter": ["lowercase"]
        },
        "my_ngram_analyzer": {
          "type": "custom",
          "tokenizer": "my_ngram_tokenizer",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "my_custom_analyzer",
        "search_analyzer": "my_simple_analyzer"
      },
      "content": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      },
      "keywords": {
        "type": "text",
        "analyzer": "my_ngram_analyzer"
      }
    }
  }
}'

6.2 测试自定义分词器

bash 复制代码

# 测试自定义分词器
curl -X GET "localhost:9200/my_custom_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_custom_analyzer",
  "text": "I ❤ programming & <b>人工智能</b>是未来的趋势"
}'

# 测试ngram分词器
curl -X GET "localhost:9200/my_custom_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_ngram_analyzer",
  "text": "hello"
}'
# 输出: ["he", "hel", "el", "ell", "ll", "llo", "lo"]

七、分词器的实际应用

7.1 索引时 vs 搜索时分词器

bash 复制代码

# 创建索引，指定不同的索引和搜索分词器
curl -X PUT "localhost:9200/products" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "index_analyzer": {
          "type": "custom",
          "tokenizer": "ik_max_word",  # 索引时细粒度
          "filter": ["lowercase", "stop"]
        },
        "search_analyzer": {
          "type": "custom",
          "tokenizer": "ik_smart",     # 搜索时粗粒度
          "filter": ["lowercase", "stop"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "index_analyzer",
        "search_analyzer": "search_analyzer"
      },
      "description": {
        "type": "text",
        "analyzer": "index_analyzer",
        "search_analyzer": "search_analyzer"
      }
    }
  }
}'

为什么需要不同的分词器？

索引时（细粒度）：尽可能拆分出更多的词条，提高召回率
搜索时（粗粒度）：更精确地匹配用户意图，提高准确率

7.2 多字段映射（Multi-fields）

bash 复制代码

curl -X PUT "localhost:9200/articles" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_max_word",
        "fields": {
          "keyword": {
            "type": "keyword"  # 用于精确匹配
          },
          "english": {
            "type": "text",
            "analyzer": "english"  # 英文分析
          },
          "standard": {
            "type": "text",
            "analyzer": "standard"  # 标准分析
          },
          "pinyin": {
            "type": "text",
            "analyzer": "pinyin"  # 拼音搜索（需要安装拼音插件）
          }
        }
      }
    }
  }
}'

7.3 查询时指定分词器

bash 复制代码

# 查询时指定不同的分词器
curl -X GET "localhost:9200/articles/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": {
      "title": {
        "query": "中华人民共和国",
        "analyzer": "ik_smart"  # 查询时指定分词器
      }
    }
  }
}'

# 使用多字段查询
curl -X GET "localhost:9200/articles/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "multi_match": {
      "query": "中国",
      "fields": [
        "title",           # 使用 ik_max_word
        "title.keyword",   # 精确匹配
        "title.pinyin"     # 拼音搜索
      ],
      "type": "most_fields"
    }
  }
}'

八、高级分词器配置

8.1 N-gram 和 Edge N-gram 分词器

bash 复制代码

# 创建 N-gram 分词器索引
curl -X PUT "localhost:9200/ngram_demo" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_ngram_analyzer": {
          "tokenizer": "my_ngram_tokenizer"
        },
        "my_edge_ngram_analyzer": {
          "tokenizer": "my_edge_ngram_tokenizer"
        }
      },
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 3,
          "token_chars": ["letter", "digit"]
        },
        "my_edge_ngram_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 5,
          "token_chars": ["letter", "digit"]
        }
      }
    }
  }
}'

# 测试 N-gram
curl -X GET "localhost:9200/ngram_demo/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_ngram_analyzer",
  "text": "quick"
}'
# 输出: ["qu", "qui", "ui", "uic", "ic", "ick", "ck"]

# 测试 Edge N-gram（前缀匹配）
curl -X GET "localhost:9200/ngram_demo/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_edge_ngram_analyzer",
  "text": "quick"
}'
# 输出: ["qu", "qui", "quic", "quick"]

8.2 同义词过滤器

bash 复制代码

# 创建同义词分词器
curl -X PUT "localhost:9200/synonym_demo" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonyms": {
          "type": "synonym",
          "synonyms": [
            "汽车, 轿车, 小汽车, 私家车",
            "手机, 电话, 移动电话, 智能手机",
            "电脑, 计算机, 微机, 个人电脑, PC",
            "好吃, 美味, 可口, 香甜",
            "便宜, 实惠, 廉价, 价廉",
            "京东 => jd",
            "淘宝 => tb",
            "微信 => wechat"
          ],
          "lenient": true  # 忽略解析错误
        }
      },
      "analyzer": {
        "my_synonym_analyzer": {
          "tokenizer": "ik_max_word",
          "filter": ["lowercase", "my_synonyms"]
        }
      }
    }
  }
}'

# 测试同义词
curl -X GET "localhost:9200/synonym_demo/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_synonym_analyzer",
  "text": "我想买一台便宜的电脑"
}'
# 输出可能包含: ["我", "想", "买", "一台", "便宜", "实惠", "廉价", "价廉", "电脑", "计算机", "微机", "个人电脑", "PC"]

8.3 拼音分词器（需要安装插件）

bash 复制代码

# 安装拼音分词器
cd /usr/share/elasticsearch
sudo bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v7.17.3/elasticsearch-analysis-pinyin-7.17.3.zip

# 重启 Elasticsearch
sudo systemctl restart elasticsearch

# 创建拼音分词器配置
curl -X PUT "localhost:9200/pinyin_demo" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "pinyin_analyzer": {
          "tokenizer": "my_pinyin"
        }
      },
      "tokenizer": {
        "my_pinyin": {
          "type": "pinyin",
          "keep_separate_first_letter": false,
          "keep_full_pinyin": true,
          "keep_original": true,
          "limit_first_letter_length": 16,
          "lowercase": true,
          "remove_duplicated_term": true
        }
      }
    }
  }
}'

# 测试拼音分词器
curl -X GET "localhost:9200/pinyin_demo/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "pinyin_analyzer",
  "text": "刘德华"
}'
# 输出: ["liu", "de", "hua", "刘德华", "ldh"]

九、分词器性能优化

9.1 分词器缓存

Elasticsearch 会自动缓存分词结果以提高性能：

bash 复制代码

// 分词器缓存策略
{
  "cache": {
    "size": "100mb",           // 缓存大小
    "expire": "30m",           // 过期时间
    "type": "node"             // 节点级缓存
  }
}

9.2 避免过度分词

bash 复制代码

# 不好的实践：过度使用复杂分词器
{
  "analyzer": {
    "over_analyzed": {
      "type": "custom",
      "char_filter": ["html_strip", "mapping"],
      "tokenizer": "standard",
      "filter": [
        "lowercase",
        "stop",
        "synonym",      # 同义词
        "stemmer",      # 词干提取
        "shingle",      # 邻近词
        "ngram",        # N-gram
        "edge_ngram",   # Edge N-gram
        "length"        # 长度过滤
      ]
    }
  }
}

# 好的实践：根据需求选择合适的分词器
{
  "analyzer": {
    "optimized": {
      "type": "custom",
      "tokenizer": "ik_max_word",
      "filter": [
        "lowercase",
        "stop",
        "length"  # 只保留2-20个字符的词
      ]
    }
  }
}

9.3 监控分词器性能

bash 复制代码

# 查看索引统计信息，包括分词性能
curl -X GET "localhost:9200/_stats/analysis?pretty"

# 查看特定索引的分词统计
curl -X GET "localhost:9200/my_index/_stats/analysis?pretty"

# 使用 Profile API 分析查询性能
curl -X GET "localhost:9200/my_index/_search" -H 'Content-Type: application/json' -d'
{
  "profile": true,
  "query": {
    "match": {
      "content": "分词器性能测试"
    }
  }
}'

十、实战案例：电商搜索分词器配置

10.1 电商搜索需求分析

需求：

支持中英文混合搜索
支持拼音搜索
支持同义词（品牌、型号）
支持错别字容错
支持分类导航

10.2 完整配置示例

bash 复制代码

# 创建电商搜索索引
curl -X PUT "localhost:9200/ecommerce" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "char_filter": {
        "brand_mapping": {
          "type": "mapping",
          "mappings": [
            "iphone => iPhone",
            "ipad => iPad",
            "macbook => MacBook",
            "xiaomi => 小米",
            "huawei => 华为"
          ]
        }
      },
      "tokenizer": {
        "smart_tokenizer": {
          "type": "pattern",
          "pattern": "[,\\s]+"
        }
      },
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "chinese_stop": {
          "type": "stop",
          "stopwords": ["的", "了", "在", "是", "和", "有"]
        },
        "brand_synonyms": {
          "type": "synonym",
          "synonyms": [
            "苹果, Apple, iphone, 爱疯",
            "小米, xiaomi, MI, 红米",
            "华为, huawei, 荣耀, honor",
            "三星, samsung, galaxy",
            "联想, lenovo, thinkpad",
            "戴尔, dell",
            "索尼, sony"
          ]
        },
        "product_synonyms": {
          "type": "synonym",
          "synonyms": [
            "手机, 电话, 移动电话, 智能手机",
            "笔记本, 笔记本电脑, 手提电脑, 便携电脑",
            "平板, 平板电脑, 平板机",
            "电视, 电视机, TV, 液晶电视",
            "冰箱, 电冰箱, 冷藏箱",
            "空调, 空调机, 冷气机"
          ]
        },
        "length_filter": {
          "type": "length",
          "min": 2,
          "max": 20
        }
      },
      "analyzer": {
        "product_analyzer": {
          "type": "custom",
          "char_filter": ["brand_mapping"],
          "tokenizer": "ik_max_word",
          "filter": [
            "lowercase",
            "chinese_stop",
            "english_stop",
            "brand_synonyms",
            "product_synonyms",
            "length_filter"
          ]
        },
        "pinyin_analyzer": {
          "type": "custom",
          "tokenizer": "pinyin_tokenizer",
          "filter": ["lowercase"]
        },
        "keyword_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "product_name": {
        "type": "text",
        "analyzer": "product_analyzer",
        "search_analyzer": "product_analyzer",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          },
          "pinyin": {
            "type": "text",
            "analyzer": "pinyin_analyzer"
          }
        }
      },
      "category": {
        "type": "keyword",
        "fields": {
          "text": {
            "type": "text",
            "analyzer": "product_analyzer"
          }
        }
      },
      "brand": {
        "type": "keyword",
        "fields": {
          "text": {
            "type": "text",
            "analyzer": "keyword_analyzer"
          }
        }
      },
      "description": {
        "type": "text",
        "analyzer": "product_analyzer"
      },
      "tags": {
        "type": "text",
        "analyzer": "product_analyzer"
      }
    }
  }
}'

10.3 测试电商分词效果

bash 复制代码

# 测试各种搜索场景
curl -X GET "localhost:9200/ecommerce/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "product_analyzer",
  "text": "我想买一部苹果iPhone 13 Pro Max手机"
}'

curl -X GET "localhost:9200/ecommerce/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "product_analyzer",
  "text": "huawei mate 40 pro 5g 智能手机"
}'

curl -X GET "localhost:9200/ecommerce/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "pinyin_analyzer",
  "text": "华为手机"
}'
# 输出: ["hua", "wei", "shou", "ji", "华为", "手机", "hwsj"]

十一、常见问题与解决方案

Q1: 分词器不生效怎么办？

bash 复制代码

# 1. 检查分词器是否正确定义
curl -X GET "localhost:9200/my_index/_settings?include_defaults=true"

# 2. 测试分词器
curl -X GET "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "field": "content",
  "text": "测试文本"
}'

# 3. 检查映射是否应用了分词器
curl -X GET "localhost:9200/my_index/_mapping"

Q2: 如何更新分词器配置？

bash 复制代码

# 方法1: 关闭索引 -> 更新设置 -> 打开索引
curl -X POST "localhost:9200/my_index/_close"
curl -X PUT "localhost:9200/my_index/_settings" -H 'Content-Type: application/json' -d'
{
  "analysis": {
    "analyzer": {
      "new_analyzer": {
        "type": "custom",
        "tokenizer": "ik_max_word"
      }
    }
  }
}'
curl -X POST "localhost:9200/my_index/_open"

# 方法2: 创建新索引 -> 重新索引数据 -> 别名切换
curl -X PUT "localhost:9200/my_index_v2" -H 'Content-Type: application/json' -d'
{
  "settings": { ... },
  "mappings": { ... }
}'

curl -X POST "localhost:9200/_reindex" -H 'Content-Type: application/json' -d'
{
  "source": {
    "index": "my_index"
  },
  "dest": {
    "index": "my_index_v2"
  }
}'

Q3: 如何调试分词问题？

bash 复制代码

# 使用 Explain API 查看分词详情
curl -X GET "localhost:9200/my_index/_validate/query?explain" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": {
      "content": "搜索词"
    }
  }
}'

# 查看索引统计
curl -X GET "localhost:9200/my_index/_stats?pretty"

十二、分词器选择建议

场景	推荐分词器	说明
中文搜索	IK Analyzer	支持细粒度和粗粒度分词
英文搜索	Standard/English	内置英文分词器效果良好
中英文混合	IK + 小写过滤	结合使用
拼音搜索	Pinyin Analyzer	需要安装插件
前缀搜索	Edge N-gram	实现自动补全
同义词搜索	Synonym Filter	扩展搜索范围
分类筛选	Keyword Analyzer	精确匹配

对于大多数中文应用，以下配置足够：

bash 复制代码

# 1. 安装 IK 分词器
# 2. 使用默认 IK 配置
# 3. 索引时用 ik_max_word，搜索时用 ik_smart
# 4. 根据业务需求添加少量自定义词典