安装IK分词器;IK分词器配置扩展词库:配置扩展字典-扩展词,配置扩展停止词字典-停用词
本文 ElasticSearch 版本为:7.17.9 ,为了对应 spring-boot-starter-parent 的 2.7.9 版本
安装IK分词器
官网资源 :IK Analyzer GitHub 页面
IK分词器 下载地址:https://release.infinilabs.com/analysis-ik/stable/,下载与 ES 对应的 IK分词器 版本
然后解压下载的 zip
文件,不要直接解压到本文件夹下,里面直接就是所有文件,选择解压到 XXX
文件夹即可。
解压好的文件放在 ElasticSearch 目录下的 plugins
文件夹下重启 ES 即可使用(Windows和Linux同理)
IK分词配置扩展词库
IK分词器 不是 ElasticSearch 自带的分词器,需要用户自己全装。一般是安装在 ElasticSearch 的 plugins
文件夹中的,要扩展 IK分词器 的词库,只需要修改 IK分词器 目录中的 config
目录中的 IKAnalyzer.cfg.xml
文件:
默认配置文件如下:
xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict"></entry>
<!--用户可以在这里配置自己的扩展停止词字典-->
<entry key="ext_stopwords"></entry>
<!--用户可以在这里配置远程扩展字典 -->
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!--用户可以在这里配置远程扩展停止词字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
需要我们指定自己的字典文件名去进行扩展,如:
xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict">dict.dic</entry>
<!--用户可以在这里配置自己的扩展停止词字典-->
<entry key="ext_stopwords">stopwords.dic</entry>
<!--用户可以在这里配置远程扩展字典 -->
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!--用户可以在这里配置远程扩展停止词字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
修改配置之后,这两个配置的文件也要去对应的新建,与 IKAnalyzer.cfg.xml
文件平级即可
配置好之后,重启 ElasticSearch 即可生效。
配置扩展字典-扩展词
新增 dict.dic
文件,配置如下:
bash
搬砖
画大饼
已读不回
舔狗
摆烂
配置停止词字典-停用词
在已有的 stopword.dic
文件配置停用词,配置之后,这些词不会被解析出来:
测试
配置字典前
使用 ElasticSearch 的可视化界面 Kibana 的调试工具 Dev Tools 调用解析接口测试:
bash
# `IK Analyzer` 扩展字典。
POST /_analyze
{
"analyzer": "ik_smart",
"text": "小明是一个Java程序员,白天摸鱼学习ElasticSearch,晚上加班到九点,搬砖完之后,去找他的女神画大饼,女神却已读不回,小明认清自己是个舔狗,直接摆烂,吸食海洛因"
}
解析结果:
bash
#! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security.
{
"tokens" : [
{
"token" : "小明",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "是",
"start_offset" : 2,
"end_offset" : 3,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "一个",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "java",
"start_offset" : 5,
"end_offset" : 9,
"type" : "ENGLISH",
"position" : 3
},
{
"token" : "程序员",
"start_offset" : 9,
"end_offset" : 12,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "白天",
"start_offset" : 13,
"end_offset" : 15,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "摸鱼",
"start_offset" : 15,
"end_offset" : 17,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "学习",
"start_offset" : 17,
"end_offset" : 19,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "elasticsearch",
"start_offset" : 19,
"end_offset" : 32,
"type" : "ENGLISH",
"position" : 8
},
{
"token" : "晚上",
"start_offset" : 33,
"end_offset" : 35,
"type" : "CN_WORD",
"position" : 9
},
{
"token" : "加班",
"start_offset" : 35,
"end_offset" : 37,
"type" : "CN_WORD",
"position" : 10
},
{
"token" : "到",
"start_offset" : 37,
"end_offset" : 38,
"type" : "CN_CHAR",
"position" : 11
},
{
"token" : "九点",
"start_offset" : 38,
"end_offset" : 40,
"type" : "TYPE_CQUAN",
"position" : 12
},
{
"token" : "搬",
"start_offset" : 41,
"end_offset" : 42,
"type" : "CN_CHAR",
"position" : 13
},
{
"token" : "砖",
"start_offset" : 42,
"end_offset" : 43,
"type" : "CN_CHAR",
"position" : 14
},
{
"token" : "完",
"start_offset" : 43,
"end_offset" : 44,
"type" : "CN_CHAR",
"position" : 15
},
{
"token" : "之后",
"start_offset" : 44,
"end_offset" : 46,
"type" : "CN_WORD",
"position" : 16
},
{
"token" : "去",
"start_offset" : 47,
"end_offset" : 48,
"type" : "CN_CHAR",
"position" : 17
},
{
"token" : "找他",
"start_offset" : 48,
"end_offset" : 50,
"type" : "CN_WORD",
"position" : 18
},
{
"token" : "的",
"start_offset" : 50,
"end_offset" : 51,
"type" : "CN_CHAR",
"position" : 19
},
{
"token" : "女神",
"start_offset" : 51,
"end_offset" : 53,
"type" : "CN_WORD",
"position" : 20
},
{
"token" : "画",
"start_offset" : 53,
"end_offset" : 54,
"type" : "CN_CHAR",
"position" : 21
},
{
"token" : "大饼",
"start_offset" : 54,
"end_offset" : 56,
"type" : "CN_WORD",
"position" : 22
},
{
"token" : "女神",
"start_offset" : 57,
"end_offset" : 59,
"type" : "CN_WORD",
"position" : 23
},
{
"token" : "却已",
"start_offset" : 59,
"end_offset" : 61,
"type" : "CN_WORD",
"position" : 24
},
{
"token" : "读",
"start_offset" : 61,
"end_offset" : 62,
"type" : "CN_CHAR",
"position" : 25
},
{
"token" : "不回",
"start_offset" : 62,
"end_offset" : 64,
"type" : "CN_WORD",
"position" : 26
},
{
"token" : "小明",
"start_offset" : 65,
"end_offset" : 67,
"type" : "CN_WORD",
"position" : 27
},
{
"token" : "认清",
"start_offset" : 67,
"end_offset" : 69,
"type" : "CN_WORD",
"position" : 28
},
{
"token" : "自己",
"start_offset" : 69,
"end_offset" : 71,
"type" : "CN_WORD",
"position" : 29
},
{
"token" : "是",
"start_offset" : 71,
"end_offset" : 72,
"type" : "CN_CHAR",
"position" : 30
},
{
"token" : "个",
"start_offset" : 72,
"end_offset" : 73,
"type" : "CN_CHAR",
"position" : 31
},
{
"token" : "舔",
"start_offset" : 73,
"end_offset" : 74,
"type" : "CN_CHAR",
"position" : 32
},
{
"token" : "狗",
"start_offset" : 74,
"end_offset" : 75,
"type" : "CN_CHAR",
"position" : 33
},
{
"token" : "直接",
"start_offset" : 76,
"end_offset" : 78,
"type" : "CN_WORD",
"position" : 34
},
{
"token" : "摆",
"start_offset" : 78,
"end_offset" : 79,
"type" : "CN_CHAR",
"position" : 35
},
{
"token" : "烂",
"start_offset" : 79,
"end_offset" : 80,
"type" : "CN_CHAR",
"position" : 36
},
{
"token" : "吸食",
"start_offset" : 81,
"end_offset" : 83,
"type" : "CN_WORD",
"position" : 37
},
{
"token" : "海洛因",
"start_offset" : 83,
"end_offset" : 86,
"type" : "CN_WORD",
"position" : 38
}
]
}
配置字典后
使用 ElasticSearch 的可视化界面 Kibana 的调试工具 Dev Tools 调用解析接口测试:
bash
# `IK Analyzer` 扩展字典。
POST /_analyze
{
"analyzer": "ik_smart",
"text": "小明是一个Java程序员,白天摸鱼学习ElasticSearch,晚上加班到九点,搬砖完之后,去找他的女神画大饼,女神却已读不回,小明认清自己是个舔狗,直接摆烂,吸食海洛因"
}
解析结果:
bash
#! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security.
{
"tokens" : [
{
"token" : "小明",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "是",
"start_offset" : 2,
"end_offset" : 3,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "一个",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "java",
"start_offset" : 5,
"end_offset" : 9,
"type" : "ENGLISH",
"position" : 3
},
{
"token" : "程序员",
"start_offset" : 9,
"end_offset" : 12,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "白天",
"start_offset" : 13,
"end_offset" : 15,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "摸鱼",
"start_offset" : 15,
"end_offset" : 17,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "学习",
"start_offset" : 17,
"end_offset" : 19,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "elasticsearch",
"start_offset" : 19,
"end_offset" : 32,
"type" : "ENGLISH",
"position" : 8
},
{
"token" : "晚上",
"start_offset" : 33,
"end_offset" : 35,
"type" : "CN_WORD",
"position" : 9
},
{
"token" : "加班",
"start_offset" : 35,
"end_offset" : 37,
"type" : "CN_WORD",
"position" : 10
},
{
"token" : "到",
"start_offset" : 37,
"end_offset" : 38,
"type" : "CN_CHAR",
"position" : 11
},
{
"token" : "九点",
"start_offset" : 38,
"end_offset" : 40,
"type" : "TYPE_CQUAN",
"position" : 12
},
{
"token" : "搬砖",
"start_offset" : 41,
"end_offset" : 43,
"type" : "CN_WORD",
"position" : 13
},
{
"token" : "完",
"start_offset" : 43,
"end_offset" : 44,
"type" : "CN_CHAR",
"position" : 14
},
{
"token" : "之后",
"start_offset" : 44,
"end_offset" : 46,
"type" : "CN_WORD",
"position" : 15
},
{
"token" : "去",
"start_offset" : 47,
"end_offset" : 48,
"type" : "CN_CHAR",
"position" : 16
},
{
"token" : "找他",
"start_offset" : 48,
"end_offset" : 50,
"type" : "CN_WORD",
"position" : 17
},
{
"token" : "女神",
"start_offset" : 51,
"end_offset" : 53,
"type" : "CN_WORD",
"position" : 18
},
{
"token" : "画大饼",
"start_offset" : 53,
"end_offset" : 56,
"type" : "CN_WORD",
"position" : 19
},
{
"token" : "女神",
"start_offset" : 57,
"end_offset" : 59,
"type" : "CN_WORD",
"position" : 20
},
{
"token" : "却",
"start_offset" : 59,
"end_offset" : 60,
"type" : "CN_CHAR",
"position" : 21
},
{
"token" : "已读不回",
"start_offset" : 60,
"end_offset" : 64,
"type" : "CN_WORD",
"position" : 22
},
{
"token" : "小明",
"start_offset" : 65,
"end_offset" : 67,
"type" : "CN_WORD",
"position" : 23
},
{
"token" : "认清",
"start_offset" : 67,
"end_offset" : 69,
"type" : "CN_WORD",
"position" : 24
},
{
"token" : "自己",
"start_offset" : 69,
"end_offset" : 71,
"type" : "CN_WORD",
"position" : 25
},
{
"token" : "是",
"start_offset" : 71,
"end_offset" : 72,
"type" : "CN_CHAR",
"position" : 26
},
{
"token" : "个",
"start_offset" : 72,
"end_offset" : 73,
"type" : "CN_CHAR",
"position" : 27
},
{
"token" : "舔狗",
"start_offset" : 73,
"end_offset" : 75,
"type" : "CN_WORD",
"position" : 28
},
{
"token" : "直接",
"start_offset" : 76,
"end_offset" : 78,
"type" : "CN_WORD",
"position" : 29
},
{
"token" : "摆烂",
"start_offset" : 78,
"end_offset" : 80,
"type" : "CN_WORD",
"position" : 30
},
{
"token" : "吸食",
"start_offset" : 81,
"end_offset" : 83,
"type" : "CN_WORD",
"position" : 31
}
]
}
可以看到网络热词 "搬砖"、"画大饼"、"已读不回"、"舔狗"、"摆烂"
均可以正确识别,敏感词 "海洛因"
没有解析。