ElasticSearch中的分词器详解

概述

分词器（Analyzer）是Elasticsearch全文检索的核心组件，负责将文本内容拆分为一系列独立的词项（Term），同时完成大小写转换、特殊字符过滤、同义词替换、停词移除等预处理工作，直接决定检索的准确性和性能。

一个完整的分词器由三部分组成：

Character Filter（字符过滤器）：预处理原始文本，如删除HTML标签、替换特殊字符
Tokenizer（分词器核心）：将文本拆分为词项
Token Filter（词项过滤器）：对拆分后的词项做二次处理，如转小写、删除停词、添加同义词

内置分词器

ES中内置了挺多的分词器，可以简单看一下。

standard

这是ES中默认的分词器，通常用于英文文本等通用场景，其是按单词边界拆分，转小写，支持删除停词（默认关闭），不适合中文，中文会拆分为单个汉字。

示例：

中文

可以看到standard把每个中文都拆分为了一个词

复制代码

# 请求
POST /_analyze
{
  "analyzer": "standard",
  "text": "我是中国人"
}

# 响应
{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "中",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "国",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "人",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    }
  ]
}

英文

可以看到standard把以空格为分界线，把每个单词都转为小写提取出来

复制代码

# 请求
POST /_analyze
{
  "analyzer": "standard",
  "text": "I Love You Very Much"
}

# 响应
{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "love",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "you",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "very",
      "start_offset" : 11,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "much",
      "start_offset" : 16,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}

simple

适用于简单英文文本，其按非字母字符拆分，自动转小写，数字、特殊字符会被完全过滤

示例：

纯中文

请求

POST /_analyze
{
"analyzer": "simple",
"text": "我是中国人"
}

响应

{
"tokens" : [
{
"token" : "我是中国人",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
}
]
}
中文+数字

可以看到其是按照数字分割出字符

复制代码

# 请求
POST /_analyze
{
  "analyzer": "simple",
  "text": "我1是2中国3人"
}

# 响应
{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "中国",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "人",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "word",
      "position" : 3
    }
  ]
}

英文+数字

请求

POST /_analyze
{
"analyzer": "simple",
"text": "Lo1ve Y2ou"
}

响应

{
"tokens" : [
{
"token" : "lo",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "ve",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 1
},
{
"token" : "y",
"start_offset" : 6,
"end_offset" : 7,
"type" : "word",
"position" : 2
},
{
"token" : "ou",
"start_offset" : 8,
"end_offset" : 10,
"type" : "word",
"position" : 3
}
]
}

keyword

其适用于精确匹配字段（手机号、身份证、枚举值），keyword不做任何拆分，将整个文本作为一个词项，不支持模糊检索，适合需要精确匹配的字段。

示例：

英文

POST /_analyze
{
"analyzer": "keyword",
"text": "I Love You"
}

响应

{
"tokens" : [
{
"token" : "I Love You",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 0
}
]
}
中文

请求

POST /_analyze
{
"analyzer": "simple",
"text": "我是中国人"
}

响应

{
"tokens" : [
{
"token" : "我是中国人",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
}
]
}

whitespace

whitespace适用于空格分隔的结构化文本，它仅按空格拆分，不做其他处理，适合已经预分词的文本。

示例：

中文

POST /_analyze
{
"analyzer": "whitespace",
"text": "我是中国人" #这儿有空格
}

响应

{
"tokens" : [
{
"token" : "我是",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "中国人",
"start_offset" : 3,
"end_offset" : 6,
"type" : "word",
"position" : 1
}
]
}
英文

POST /_analyze
{
"analyzer": "whitespace",
"text": "I Love You"
}

响应

{
"tokens" : [
{
"token" : "I",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "Love",
"start_offset" : 2,
"end_offset" : 6,
"type" : "word",
"position" : 1
},
{
"token" : "You",
"start_offset" : 7,
"end_offset" : 10,
"type" : "word",
"position" : 2
}
]
}

stop

stop适用于英文纯文本场景，它基于基于Simple Analyzer，额外移除英文停词（the/a/an等），停词列表可自定义，不支持中文停词。

示例

复制代码

# 省略，stop分词器在国内基本不使用，这儿没什么好写的，可自行尝试

pattern

pattern适用于格式固定的文本，它基于正则表达式拆分文本，正则性能较差，避免用于大文本字段

示例：

复制代码

# 省略，pattern分词器在生产环境基本不使用，这儿没什么好写的，可自行尝试

fingerprint

fingerprint适用于去重、聚类场景，它对文本归一化后生成唯一指纹，用于内容去重，适合新闻、文档重复的场景。

示例：

复制代码

# 请求
POST /_analyze
{
  "analyzer": "fingerprint",
  "text": "我是中国人，你也是中国人"
}

# 响应
{
  "tokens" : [
    {
      "token" : "中 也 人 你 国 我 是",
      "start_offset" : 0,
      "end_offset" : 12,
      "type" : "fingerprint",
      "position" : 0
    }
  ]
}

生产环境常用第三方分词器

通过上面可以发现，内置的分词器很多都是仅支持英文，对中文的支持度很低。

英文这东西都是国外开发的，国内吗，嘿嘿，你懂的

IK分词器（最主流中文分词，生产首选）

IK分词器基于正向最大匹配（Forward Maximum Matching, FMM）和逆向最大匹配（Backward Maximum Matching, BMM）等算法，通过对文本的多遍扫描和匹配，实现中文词汇的准确切分。这种算法能够较为准确地处理中文文本中的词汇边界问题。

核心特点

支持两种分词模式：
- ik_max_word：最细粒度拆分，尽可能多的匹配词，适合索引阶段使用
- ik_smart：最粗粒度拆分，避免重复，适合查询阶段使用
支持自定义扩展词典、停词词典
支持词典热更新（无需重启ES）

安装IK分词器

官网：https://github.com/infinilabs/analysis-ik

在线安装

集群中所有节点执行：

复制代码

# 注意版本号需要和ES集群的版本号一致
root@master:~# elasticsearch-plugin install https://get.infini.cloud/elasticsearch/analysis-ik/7.17.26
-> Installing https://get.infini.cloud/elasticsearch/analysis-ik/7.17.26
-> Downloading https://get.infini.cloud/elasticsearch/analysis-ik/7.17.26
[=================================================] 100%   
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@     WARNING: plugin requires additional permissions     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
* java.net.SocketPermission * connect,resolve
See https://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html
for descriptions of what these permissions allow and the associated risks.

Continue with installation? [y/N]y # 这里输入y
-> Installed analysis-ik
-> Please restart Elasticsearch to activate any plugins installed
# 修改所属者
root@master:~# chown elasticsearch:elasticsearch -R /data00/software/elasticsearch-7.17.26
# 查看一下
root@master:~# ll /data00/software/elasticsearch-7.17.26/plugins/
total 8
# 安装的ik分词器
drwxr-xr-x 2 elasticsearch elasticsearch 4096 Apr 20 11:26 analysis-ik
drwxr-xr-x 2 elasticsearch elasticsearch 4096 Apr 16 15:45 repository-s3

# 最后滚动重启ES，保证业务不受影响
root@master:~# systemctl restart elasticsearch.service

离线安装

##下载 IK，将下载好的包上传至集群中
https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.17.26/elasticsearch-analysis-ik-7.17.26.zip

创建目录

root@master:~# mkdir /data00/software/elasticsearch-7.17.26/plugins/ik

解压包

root@master:~# unzip elasticsearch-analysis-ik-7.17.26.zip -d /data00/software/elasticsearch-7.17.26/plugins/ik/

修改所属者

root@master:~# chown -R elasticsearch:elasticsearch /data00/software/elasticsearch-7.17.26/plugins/ik/

最后滚动重启所有ES节点，保证业务不受影响

root@master:~# systemctl restart elasticsearch.service

测试IK分词器使用

ik_smart

最粗粒度拆分，避免重复，适合查询阶段使用，ik_smart通常分词较于ik_max_word比较合理，精准度也比较高

复制代码

# 请求
POST /_analyze
{
  "analyzer": "ik_smart",
  "text": "我是中国人"
}

# 响应
{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中国人",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}

ik_max_word

最细粒度拆分，尽可能多的匹配词，适合索引阶段使用，它是细粒度分词，穷尽所有可能，召回率高

示例：

复制代码

#请求
POST /_analyze
{
  "analyzer": "ik_max_word",
  "text": "我是中国人"
}

# 响应
{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中国人",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "中国",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "国人",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}

扩展，配置IK本地词典。

需求，当我们发现分词器拆分出来的词不符合我们的要求时，可以自定义一下。

实操，所有节点都需执行

复制代码

# 修改ik配置文件
root@master:~# vim /data00/software/elasticsearch-7.17.26/config/analysis-ik/IKAnalyzer.cfg.xml
# 文件内容
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户可以在这里配置自己的扩展字典 -->
        <entry key="ext_dict">ik_diy.dic</entry> # 主要是这里
         <!--用户可以在这里配置自己的扩展停止词字典-->
        <entry key="ext_stopwords"></entry>
        <!--用户可以在这里配置远程扩展字典 -->
        <!-- <entry key="remote_ext_dict">words_location</entry> -->
        <!--用户可以在这里配置远程扩展停止词字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

# 修改词典内容
root@master:~# vim /data00/software/elasticsearch-7.17.26/config/analysis-ik/ik_diy.dic
我是中国人
你也是中国人
中国
中华人民共和国

# 修改权限
root@master:~# chown elasticsearch:elasticsearch -R /data00/software/elasticsearch-7.17.26

# 滚动更新重启ES
root@master:~# systemctl restart elasticsearch.service

验证是否生效

ik_smart

POST /_analyze
{
"analyzer": "ik_smart",
"text": "我是中国人"
}

响应

{
"tokens" : [
{
"token" : "我是中国人",
"start_offset" : 0,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 0
}
]
}
ik_max_word

POST /_analyze
{
"analyzer": "ik_max_word",
"text": "我是中国人"
}

响应

{
"tokens" : [
{
"token" : "我是中国人",
"start_offset" : 0,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "中国人",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "中国",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "国人",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 3
}
]
}

扩展，配置IK远程词典（生产环境推荐）

远程词典是生产环境首选的词典管理方案，无需重启ES节点即可实现词典热更新，适合需要频繁更新业务词、网络热词、停词的场景（如电商、内容平台、社交产品等）。

支持扩展词、停词两种远程词典类型
自动检测更新：默认每分钟检查一次词典是否变更
支持多个远程词典同时加载
全节点自动同步，保证所有节点分词结果一致

IK远程词典服务端配置要求

只要是支持HTTP/HTTPS的静态资源服务都可以作为远程词典服务（Nginx、Apache、对象存储、自研接口均可），需要满足以下要求：

响应内容：纯文本，UTF-8 无BOM编码，每行一个词/停词，行分隔符用\n（不要用Windows的\r\n）
响应头必须包含Last-Modified和ETag两个字段（IK分词器通过这两个字段判断词典是否需要更新）
响应Content-Type: text/plain; charset=utf-8

配置步骤

以Nginx举例：

配置nginx

安装nginx

root@master:~# apt install nginx

启动nginx

root@master:~# systemctl start nginx

创建词典目录和文件

root@master:~# mkdir -p /data00/data/nginx/es-dict
root@master:~# cat /data00/data/nginx/es-dict/ext_dict.txt
chatGPT
GPT4
文心一言
通义千问

创建nginx的配置文件

root@master:~# cat /etc/nginx/conf.d/es-dict.conf
server {
listen 81;
server_name es-dict.example.com;
root /data00/data/nginx/es-dict;
location / {
add_header Content-Type "text/plain; charset=utf-8";
# 允许ES节点IP访问，生产建议加访问控制
allow 10.37.0.0/16;
deny all;
}
}

检查配置文件是否正常

root@master:~# nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful

重启nginx

root@master:~# systemctl restart nginx.service
配置ES集群节点，所有节点配置

修改hosts文件

root@master:~# echo '10.37.97.56 es-dict.example.com' >> /etc/hosts

测试访问，看是否能访问通

root@master:~# curl http://es-dict.example.com:81/ext_dict.txt
chatGPT
GPT4
文心一言
通义千问

root@node01:~# curl -I http://es-dict.example.com:81/ext_dict.txt
HTTP/1.1 200 OK
Server: nginx/1.14.2
Date: Mon, 20 Apr 2026 08:05:37 GMT
Content-Type: text/plain
Content-Length: 39
Last-Modified: Mon, 20 Apr 2026 07:47:04 GMT
Connection: keep-alive
ETag: "69e5d9f8-27"
Content-Type: text/plain; charset=utf-8
Accept-Ranges: bytes

编辑ES配置目录下的IKAnalyzer.cfg.xml：

root@master:~# cat /data00/software/elasticsearch-7.17.26/config/analysis-ik/IKAnalyzer.cfg.xml
IK Analyzer 扩展配置 ik_diy.dic http://es-dict.example.com:81/ext_dict.txt # 主要修改这里
滚动重启ES节点

root@master:~# systemctl restart elasticsearch.service
测试

ik_smart请求

POST /_analyze
{
"analyzer": "ik_smart",
"text": "通义千问"
}

响应

{
"tokens" : [
{
"token" : "通义千问",
"start_offset" : 0,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 0
}
]
}

ik_max_word请求

POST /_analyze
{
"analyzer": "ik_max_word",
"text": "通义千问"
}

响应

{
"tokens" : [
{
"token" : "通义千问",
"start_offset" : 0,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "通义",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "千",
"start_offset" : 2,
"end_offset" : 3,
"type" : "TYPE_CNUM",
"position" : 2
},
{
"token" : "问",
"start_offset" : 3,
"end_offset" : 4,
"type" : "CN_CHAR",
"position" : 3
}
]
}

拼音分词器

拼音分词器是把汉字转换成拼音，和IK分词器的黄金搭档，通常适用于：商品搜索、姓名搜索、模糊搜索、拼音 / 首字母检索，其作用主要如下：

把汉字 → 转换成拼音
支持全拼、简拼（首字母）、声母、韵母
支持多音字
支持和 IK 分词组合使用（先切词再转拼音）

通常用于：

电商搜索：输入 shouji → 搜到手机
姓名搜索：输入 zs → 搜到张三
拼音 / 汉字混合搜索
模糊联想、容错搜索

安装

官网：https://github.com/infinilabs/analysis-pinyin

在线安装

所有节点操作

复制代码

# 安装
root@master:~# elasticsearch-plugin install https://get.infini.cloud/elasticsearch/analysis-pinyin/7.17.26
-> Installing https://get.infini.cloud/elasticsearch/analysis-pinyin/7.17.26
-> Downloading https://get.infini.cloud/elasticsearch/analysis-pinyin/7.17.26
[=================================================] 100%   
-> Installed analysis-pinyin
-> Please restart Elasticsearch to activate any plugins installed
# 修改所属者
root@master:~# chown elasticsearch:elasticsearch -R /data00/software/elasticsearch-7.17.26
root@master:~# ll /data00/software/elasticsearch-7.17.26/plugins/
total 12
drwxr-xr-x 2 elasticsearch elasticsearch 4096 Apr 20 11:26 analysis-ik
drwxr-xr-x 2 elasticsearch elasticsearch 4096 Apr 20 16:15 analysis-pinyin
drwxr-xr-x 2 elasticsearch elasticsearch 4096 Apr 16 15:45 repository-s3

# 滚动更新所有ES节点
root@master:~# systemctl restart elasticsearch.service

离线安装

##下载 pinyin分词器，将下载好的包上传至集群中
https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.17.26/elasticsearch-analysis-pinyin-7.17.26.zip

创建目录

root@master:~# mkdir /data00/software/elasticsearch-7.17.26/plugins/pinyin

解压包

root@master:~# unzip elasticsearch-analysis-ik-7.17.26.zip -d /data00/software/elasticsearch-7.17.26/plugins/pinyin/

修改所属者

root@master:~# chown -R elasticsearch:elasticsearch /data00/software/elasticsearch-7.17.26/plugins/pinyin/

最后滚动重启所有ES节点，保证业务不受影响

root@master:~# systemctl restart elasticsearch.service

测试

复制代码

POST /_analyze
{
  "analyzer": "pinyin",
  "text": "我爱中国"
}

# 响应
{
  "tokens" : [
    {
      "token" : "wo",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "wazg",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ai",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "zhong",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "guo",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 3
    }
  ]
}

扩展，pinyin分词器配置词典

pinyin分词器也支持本地词典和远程词典，整体的配置步骤和IK分词器一样，可以参考上文即可

扩展，创建索引时指定IK分词器

示例：

生产环境可用

复制代码

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "ik_max_word_custom": {
          "type": "ik_smart",          // 必选：ik_max_word / ik_smart
          "use_smart": true,             // 是否使用智能分词
          "enable_lowercase": true,       // 英文是否转小写
          "enable_remote_dict": true,     // 是否开启远程词典
          "remote_dict_interval": 60,     // 远程词典刷新间隔(秒)
          "use_single_word": false,       // 未匹配到词时是否单字输出
          "convert_chinese_num": false,   // 是否把中文数字转为阿拉伯数字
          "use_stop_word": true           // 是否启用停用词
        },
        "ik_smart_custom": {
          "type": "ik_max_word",          // 必选：ik_max_word / ik_smart
          "use_smart": false,             // 是否使用智能分词
          "enable_lowercase": true,       // 英文是否转小写
          "enable_remote_dict": true,     // 是否开启远程词典
          "remote_dict_interval": 60,     // 远程词典刷新间隔(秒)
          "use_single_word": false,       // 未匹配到词时是否单字输出
          "convert_chinese_num": false,   // 是否把中文数字转为阿拉伯数字
          "use_stop_word": true           // 是否启用停用词
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "ik_max_word_custom",
        "search_analyzer": "ik_smart_custom"
      }
    }
  }
}

参数解析

type：必选，可选值：
- ik_max_word
- ik_smart
use_smart
- 是否使用智能分词模式
- true = ik_smart 效果
- false = ik_max_word 效果
- 一般和 type 保持一致即可
enable_lowercase（生产最常用）
- 英文是否自动转小写
- true：Hello → hello
- false：保持原样
- 生产建议：true
use_stop_word
- 是否启用停用词过滤
- true：过滤 "的、了、是、吗"
- 生产建议：true
enable_remote_dict
- 是否开启远程词典热更新
- true：开启
remote_dict_interval
- 远程词典自动刷新间隔（秒）
- 默认 60 秒
convert_chinese_num
- 是否把中文数字转为阿拉伯数字
- 一百 → 100
- 一般场景不需要

扩展，pinyin分词器和IK分词器联合使用

在生产环境中，pinyin分词器一般不单独使用，基本都是和IK分词器联合使用

下面是一个示例：

复制代码

PUT /pinyin_index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_pinyin_filter": {
          "type": "pinyin",
          "keep_full_pinyin": true,          // 保留全拼：中国 → zhongguo
          "keep_first_letter": true,        // 保留首字母：zg
          "keep_original": true,            // 保留原词：中国
          "keep_separate_first_letter": false, // 首字母分开：z g
          "limit_first_letter_length": 16,   // 首字母最大长度
          "lowercase": true,                 // 全部小写
          "ignore_pinyin_modifier": true,    // 忽略拼音声调
          "remove_duplicated_term": true,    // 自动去重
          "keep_joined_full_pinyin": false,   // 全拼连写
          "keep_none_chinese": true,         // 保留非中文字符
          "none_chinese_pinyin_tokenize": true // 非中文也分词
        }
      },
      "analyzer": {
        "my_pinyin_analyzer": {
          "type": "custom",
          "tokenizer": "ik_max_word",        // 先用IK切词
          "filter": [
            "my_pinyin_filter"               // 再转拼音
          ]
        },
        "my_pinyin_smart_analyzer": {
          "type": "custom",
          "tokenizer": "ik_smart",
          "filter": [
            "my_pinyin_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "my_pinyin_analyzer",
        "search_analyzer": "my_pinyin_smart_analyzer"
      }
    }
  }
}

ElasticSearch中的分词器详解

概述

内置分词器

standard

simple

请求

响应

请求

响应

keyword

响应

请求

响应

whitespace

响应

响应

stop

pattern

fingerprint

生产环境常用第三方分词器

IK分词器（最主流中文分词，生产首选）

核心特点

安装IK分词器

创建目录

解压包

修改所属者

最后滚动重启所有ES节点，保证业务不受影响

测试IK分词器使用

扩展，配置IK本地词典。

实操，所有节点都需执行

验证是否生效

响应

响应

扩展，配置IK远程词典（生产环境推荐）

IK远程词典服务端配置要求

配置步骤

安装nginx

启动nginx

创建词典目录和文件

创建nginx的配置文件

检查配置文件是否正常

重启nginx

修改hosts文件

测试访问，看是否能访问通

编辑ES配置目录下的IKAnalyzer.cfg.xml：

滚动重启ES节点

ik_smart请求

响应

ik_max_word请求

响应

拼音分词器

安装

创建目录

解压包

修改所属者

最后滚动重启所有ES节点，保证业务不受影响

测试

扩展，pinyin分词器配置词典

扩展，创建索引时指定IK分词器

参数解析

扩展，pinyin分词器和IK分词器联合使用