安装IK分词器;IK分词器配置扩展词库:配置扩展字典-扩展词,配置扩展停止词字典-停用词

安装IK分词器;IK分词器配置扩展词库:配置扩展字典-扩展词,配置扩展停止词字典-停用词

本文 ElasticSearch 版本为:7.17.9 ,为了对应 spring-boot-starter-parent2.7.9 版本

安装IK分词器

官网资源IK Analyzer GitHub 页面

IK分词器 下载地址:https://release.infinilabs.com/analysis-ik/stable/,下载与 ES 对应的 IK分词器 版本

然后解压下载的 zip 文件,不要直接解压到本文件夹下,里面直接就是所有文件,选择解压到 XXX 文件夹即可。

解压好的文件放在 ElasticSearch 目录下的 plugins 文件夹下重启 ES 即可使用(Windows和Linux同理)

IK分词配置扩展词库

IK分词器 不是 ElasticSearch 自带的分词器,需要用户自己全装。一般是安装在 ElasticSearchplugins 文件夹中的,要扩展 IK分词器 的词库,只需要修改 IK分词器 目录中的 config 目录中的 IKAnalyzer.cfg.xml 文件:

默认配置文件如下:

xml 复制代码
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict"></entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords"></entry>
	<!--用户可以在这里配置远程扩展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用户可以在这里配置远程扩展停止词字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

需要我们指定自己的字典文件名去进行扩展,如:

xml 复制代码
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">dict.dic</entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords">stopwords.dic</entry>
	<!--用户可以在这里配置远程扩展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用户可以在这里配置远程扩展停止词字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

修改配置之后,这两个配置的文件也要去对应的新建,与 IKAnalyzer.cfg.xml 文件平级即可

配置好之后,重启 ElasticSearch 即可生效。

配置扩展字典-扩展词

新增 dict.dic 文件,配置如下:

bash 复制代码
搬砖
画大饼
已读不回
舔狗
摆烂

配置停止词字典-停用词

在已有的 stopword.dic 文件配置停用词,配置之后,这些词不会被解析出来:

测试

配置字典前

使用 ElasticSearch 的可视化界面 Kibana 的调试工具 Dev Tools 调用解析接口测试:

bash 复制代码
# `IK Analyzer` 扩展字典。
POST /_analyze
{
  "analyzer": "ik_smart",
  "text": "小明是一个Java程序员,白天摸鱼学习ElasticSearch,晚上加班到九点,搬砖完之后,去找他的女神画大饼,女神却已读不回,小明认清自己是个舔狗,直接摆烂,吸食海洛因"
}

解析结果:

bash 复制代码
#! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security.
{
  "tokens" : [
    {
      "token" : "小明",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "一个",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "java",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "ENGLISH",
      "position" : 3
    },
    {
      "token" : "程序员",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "白天",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "摸鱼",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "学习",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "elasticsearch",
      "start_offset" : 19,
      "end_offset" : 32,
      "type" : "ENGLISH",
      "position" : 8
    },
    {
      "token" : "晚上",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "加班",
      "start_offset" : 35,
      "end_offset" : 37,
      "type" : "CN_WORD",
      "position" : 10
    },
    {
      "token" : "到",
      "start_offset" : 37,
      "end_offset" : 38,
      "type" : "CN_CHAR",
      "position" : 11
    },
    {
      "token" : "九点",
      "start_offset" : 38,
      "end_offset" : 40,
      "type" : "TYPE_CQUAN",
      "position" : 12
    },
    {
      "token" : "搬",
      "start_offset" : 41,
      "end_offset" : 42,
      "type" : "CN_CHAR",
      "position" : 13
    },
    {
      "token" : "砖",
      "start_offset" : 42,
      "end_offset" : 43,
      "type" : "CN_CHAR",
      "position" : 14
    },
    {
      "token" : "完",
      "start_offset" : 43,
      "end_offset" : 44,
      "type" : "CN_CHAR",
      "position" : 15
    },
    {
      "token" : "之后",
      "start_offset" : 44,
      "end_offset" : 46,
      "type" : "CN_WORD",
      "position" : 16
    },
    {
      "token" : "去",
      "start_offset" : 47,
      "end_offset" : 48,
      "type" : "CN_CHAR",
      "position" : 17
    },
    {
      "token" : "找他",
      "start_offset" : 48,
      "end_offset" : 50,
      "type" : "CN_WORD",
      "position" : 18
    },
    {
      "token" : "的",
      "start_offset" : 50,
      "end_offset" : 51,
      "type" : "CN_CHAR",
      "position" : 19
    },
    {
      "token" : "女神",
      "start_offset" : 51,
      "end_offset" : 53,
      "type" : "CN_WORD",
      "position" : 20
    },
    {
      "token" : "画",
      "start_offset" : 53,
      "end_offset" : 54,
      "type" : "CN_CHAR",
      "position" : 21
    },
    {
      "token" : "大饼",
      "start_offset" : 54,
      "end_offset" : 56,
      "type" : "CN_WORD",
      "position" : 22
    },
    {
      "token" : "女神",
      "start_offset" : 57,
      "end_offset" : 59,
      "type" : "CN_WORD",
      "position" : 23
    },
    {
      "token" : "却已",
      "start_offset" : 59,
      "end_offset" : 61,
      "type" : "CN_WORD",
      "position" : 24
    },
    {
      "token" : "读",
      "start_offset" : 61,
      "end_offset" : 62,
      "type" : "CN_CHAR",
      "position" : 25
    },
    {
      "token" : "不回",
      "start_offset" : 62,
      "end_offset" : 64,
      "type" : "CN_WORD",
      "position" : 26
    },
    {
      "token" : "小明",
      "start_offset" : 65,
      "end_offset" : 67,
      "type" : "CN_WORD",
      "position" : 27
    },
    {
      "token" : "认清",
      "start_offset" : 67,
      "end_offset" : 69,
      "type" : "CN_WORD",
      "position" : 28
    },
    {
      "token" : "自己",
      "start_offset" : 69,
      "end_offset" : 71,
      "type" : "CN_WORD",
      "position" : 29
    },
    {
      "token" : "是",
      "start_offset" : 71,
      "end_offset" : 72,
      "type" : "CN_CHAR",
      "position" : 30
    },
    {
      "token" : "个",
      "start_offset" : 72,
      "end_offset" : 73,
      "type" : "CN_CHAR",
      "position" : 31
    },
    {
      "token" : "舔",
      "start_offset" : 73,
      "end_offset" : 74,
      "type" : "CN_CHAR",
      "position" : 32
    },
    {
      "token" : "狗",
      "start_offset" : 74,
      "end_offset" : 75,
      "type" : "CN_CHAR",
      "position" : 33
    },
    {
      "token" : "直接",
      "start_offset" : 76,
      "end_offset" : 78,
      "type" : "CN_WORD",
      "position" : 34
    },
    {
      "token" : "摆",
      "start_offset" : 78,
      "end_offset" : 79,
      "type" : "CN_CHAR",
      "position" : 35
    },
    {
      "token" : "烂",
      "start_offset" : 79,
      "end_offset" : 80,
      "type" : "CN_CHAR",
      "position" : 36
    },
    {
      "token" : "吸食",
      "start_offset" : 81,
      "end_offset" : 83,
      "type" : "CN_WORD",
      "position" : 37
    },
    {
      "token" : "海洛因",
      "start_offset" : 83,
      "end_offset" : 86,
      "type" : "CN_WORD",
      "position" : 38
    }
  ]
}

配置字典后

使用 ElasticSearch 的可视化界面 Kibana 的调试工具 Dev Tools 调用解析接口测试:

bash 复制代码
# `IK Analyzer` 扩展字典。
POST /_analyze
{
  "analyzer": "ik_smart",
  "text": "小明是一个Java程序员,白天摸鱼学习ElasticSearch,晚上加班到九点,搬砖完之后,去找他的女神画大饼,女神却已读不回,小明认清自己是个舔狗,直接摆烂,吸食海洛因"
}

解析结果:

bash 复制代码
#! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security.
{
  "tokens" : [
    {
      "token" : "小明",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "一个",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "java",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "ENGLISH",
      "position" : 3
    },
    {
      "token" : "程序员",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "白天",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "摸鱼",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "学习",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "elasticsearch",
      "start_offset" : 19,
      "end_offset" : 32,
      "type" : "ENGLISH",
      "position" : 8
    },
    {
      "token" : "晚上",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "加班",
      "start_offset" : 35,
      "end_offset" : 37,
      "type" : "CN_WORD",
      "position" : 10
    },
    {
      "token" : "到",
      "start_offset" : 37,
      "end_offset" : 38,
      "type" : "CN_CHAR",
      "position" : 11
    },
    {
      "token" : "九点",
      "start_offset" : 38,
      "end_offset" : 40,
      "type" : "TYPE_CQUAN",
      "position" : 12
    },
    {
      "token" : "搬砖",
      "start_offset" : 41,
      "end_offset" : 43,
      "type" : "CN_WORD",
      "position" : 13
    },
    {
      "token" : "完",
      "start_offset" : 43,
      "end_offset" : 44,
      "type" : "CN_CHAR",
      "position" : 14
    },
    {
      "token" : "之后",
      "start_offset" : 44,
      "end_offset" : 46,
      "type" : "CN_WORD",
      "position" : 15
    },
    {
      "token" : "去",
      "start_offset" : 47,
      "end_offset" : 48,
      "type" : "CN_CHAR",
      "position" : 16
    },
    {
      "token" : "找他",
      "start_offset" : 48,
      "end_offset" : 50,
      "type" : "CN_WORD",
      "position" : 17
    },
    {
      "token" : "女神",
      "start_offset" : 51,
      "end_offset" : 53,
      "type" : "CN_WORD",
      "position" : 18
    },
    {
      "token" : "画大饼",
      "start_offset" : 53,
      "end_offset" : 56,
      "type" : "CN_WORD",
      "position" : 19
    },
    {
      "token" : "女神",
      "start_offset" : 57,
      "end_offset" : 59,
      "type" : "CN_WORD",
      "position" : 20
    },
    {
      "token" : "却",
      "start_offset" : 59,
      "end_offset" : 60,
      "type" : "CN_CHAR",
      "position" : 21
    },
    {
      "token" : "已读不回",
      "start_offset" : 60,
      "end_offset" : 64,
      "type" : "CN_WORD",
      "position" : 22
    },
    {
      "token" : "小明",
      "start_offset" : 65,
      "end_offset" : 67,
      "type" : "CN_WORD",
      "position" : 23
    },
    {
      "token" : "认清",
      "start_offset" : 67,
      "end_offset" : 69,
      "type" : "CN_WORD",
      "position" : 24
    },
    {
      "token" : "自己",
      "start_offset" : 69,
      "end_offset" : 71,
      "type" : "CN_WORD",
      "position" : 25
    },
    {
      "token" : "是",
      "start_offset" : 71,
      "end_offset" : 72,
      "type" : "CN_CHAR",
      "position" : 26
    },
    {
      "token" : "个",
      "start_offset" : 72,
      "end_offset" : 73,
      "type" : "CN_CHAR",
      "position" : 27
    },
    {
      "token" : "舔狗",
      "start_offset" : 73,
      "end_offset" : 75,
      "type" : "CN_WORD",
      "position" : 28
    },
    {
      "token" : "直接",
      "start_offset" : 76,
      "end_offset" : 78,
      "type" : "CN_WORD",
      "position" : 29
    },
    {
      "token" : "摆烂",
      "start_offset" : 78,
      "end_offset" : 80,
      "type" : "CN_WORD",
      "position" : 30
    },
    {
      "token" : "吸食",
      "start_offset" : 81,
      "end_offset" : 83,
      "type" : "CN_WORD",
      "position" : 31
    }
  ]
}

可以看到网络热词 "搬砖"、"画大饼"、"已读不回"、"舔狗"、"摆烂" 均可以正确识别,敏感词 "海洛因" 没有解析。

相关推荐
一叶之秋14122 小时前
Linux基础IO
linux·运维·服务器
longerxin20205 小时前
在 Linux 上使用 SCP 将文件传输到 Windows(已开启 SSH)
linux·运维·ssh
zhaotiannuo_19988 小时前
渗透测试之docker
运维·docker·容器
王正南8 小时前
kali-linux 虚拟机连接安卓模拟器
android·linux·运维·虚拟机连接模拟器·安卓模拟器,linux虚拟机
三不原则9 小时前
故障案例:容器启动失败排查(AI运维场景)——从日志分析到根因定位
运维·人工智能·kubernetes
吳所畏惧9 小时前
Linux环境/麒麟V10SP3下离线安装Redis、修改默认密码并设置Redis开机自启动
linux·运维·服务器·redis·中间件·架构·ssh
yueguangni9 小时前
sysstat 版本 10.1.5 是 CentOS 7 的默认版本,默认情况下确实不显示 %wait 字段。需要升级到新版sysstat
linux·运维·centos
funfan05179 小时前
【运维】MySQL数据库全量备份与恢复实战指南:从入门到精通
运维·数据库·mysql
-dcr10 小时前
49.python自动化
运维·python·自动化
萧曵 丶10 小时前
Linux 业务场景常用命令详解
linux·运维·服务器