Elasticsearch分词器
如何将一内容进行分词
json
get _analyze
{
"text":"你好,我是小明,很高兴认识你",
"analyzer":"standard"
}
normalization-规范化
normalization会将我们查询的内容和ES存储的内容进行统一的格式化,比如大小写、单复数、没有含义的单词等做统一的处理,以保证在检索的时候能够匹配的到。
Character Filter-字符过滤器
在搜索时不参与匹配的字符进行过滤,分为三种
HTML Strip
过滤HTML标签
json
put my_index
{
"settings": {
"analysis": {
"char_filter": {
"my_filter": {
"type": "html_strip",
"escaped_tags": [
"a"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": "my_filter"
}
}
}
}
}
get my_index/_analyze
{
"analyzer":"my_analyzer",
"text":"<p>你好<h1><a>小明</a></h2>"
}
Mapping
自定义过滤器
json
put my_index1
{
"settings": {
"analysis": {
"char_filter": {
"my_filter": {
"type": "mapping",
"mappings":[
"你 => *",
"滚 => *"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": "my_filter"
}
}
}
}
}
get my_index1/_analyze
{
"analyzer":"my_analyzer",
"text":"你好,滚"
}
Patten Replace
正则替换,通过正则替换掉指定部分
json
put my_index2
{
"settings": {
"analysis": {
"char_filter": {
"my_filter": {
"type": "pattern_replace",
"pattern":"(\\d{3})\\d{4}(\\d{4})",
"replacement":"$1****$2"
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": "my_filter"
}
}
}
}
}
get my_index2/_analyze
{
"analyzer":"my_analyzer",
"text":"15012345678"
}
令牌过滤器
灵牌过滤器有很多中,如转大小写,近义词等。参考官网 https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-thai-tokenizer.html
json
//近义词
put my_index2
{
"settings": {
"analysis": {
"filter": {
"my_filter": {
"type": "synonym",
"synonyms":["你好=>hello","小明 =>xiaoming"]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"filter": ["my_filter"]
}
}
}
}
}
get my_index2/_analyze
{
"analyzer":"my_analyzer",
"text":"你好"
}
//转小写
get my_index2/_analyze
{
"tokenizer":"standard",
"filter":["lowercase"],
"text":["ABCDE"]
}
自定义分析器
- 自定义的字符串过滤器、令牌过滤器、分词器都要定义名字
- 过滤器、令牌过滤器可定义多个
json
put my_index
{
"settings":{
"analysis":{
//自定义字符串过滤器
"char_filter":{
"my_char_filter":{
"type":"mapping",
"mappings":["& => and","| => or"]
}
},
//自定义令牌过滤器
"filter":{
"my_filter":{
"type":"stop",
"stopwords":[
"is",
"in",
"the",
"a",
"at"
]
}
},
//自定义分词器
"tokenizer":{
"my_tonenizer":{
"type":"pattern",
"pattern":"[ ,.!?]"
}
},
//自定义分析器,将自定义字符串过滤器、自定义令牌过滤器、自定义分词器赋值给分析器
"analyzer":{
"my_analyzer":{
"type":"custom",
"char_filter":["my_char_filter"], //字符串过滤器可定义多个
"filter":["my_filter"], //自定义令牌过滤器可支持多个
"tokenizer":"my_tonenizer"
}
}
}
}
}
get my_index/_analyze
{
"analyzer":"my_analyzer",
"text":"hello && hello | | word is me the word ! how are you ? and me ,,,"
}
中文分词器(IK)
安装
在ES的plugins文件夹下创建ik文件夹,将ik分词器压缩包压缩,重启
两种分词方式
- ik_max_word:分词颗粒高
- id_smart:分词颗粒低
json
get my_index/_analyze
{
"analyzer":"ik_max_word",
"text":"我爱你中华人民共和国"
}
get my_index/_analyze
{
"analyzer":"ik_smart",
"text":"我爱你中华人民共和国"
}
结果:
json
{
"tokens": [
{
"token": "我爱你",
"start_offset": 0,
"end_offset": 3,
"type": "CN_WORD",
"position": 0
},
{
"token": "爱你",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 1
},
{
"token": "中华人民共和国",
"start_offset": 3,
"end_offset": 10,
"type": "CN_WORD",
"position": 2
},
{
"token": "中华人民",
"start_offset": 3,
"end_offset": 7,
"type": "CN_WORD",
"position": 3
},
{
"token": "中华",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 4
},
{
"token": "华人",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 5
},
{
"token": "人民共和国",
"start_offset": 5,
"end_offset": 10,
"type": "CN_WORD",
"position": 6
},
{
"token": "人民",
"start_offset": 5,
"end_offset": 7,
"type": "CN_WORD",
"position": 7
},
{
"token": "共和国",
"start_offset": 7,
"end_offset": 10,
"type": "CN_WORD",
"position": 8
},
{
"token": "共和",
"start_offset": 7,
"end_offset": 9,
"type": "CN_WORD",
"position": 9
},
{
"token": "国",
"start_offset": 9,
"end_offset": 10,
"type": "CN_CHAR",
"position": 10
}
]
}
json
{
"tokens": [
{
"token": "我爱你",
"start_offset": 0,
"end_offset": 3,
"type": "CN_WORD",
"position": 0
},
{
"token": "中华人民共和国",
"start_offset": 3,
"end_offset": 10,
"type": "CN_WORD",
"position": 1
}
]
}
IK文件描述
- IKAnalyzer.xml:ik配置文件
- main.dic:主词库
- stopword.dic:英文停用词
- 特殊词库
- quantifier.dic:计量单位
- suffix.dic:行政单位
- surname.dic:百家姓
- prepositoin:语气词
自定义词库
在配置文件中配置,多个词典可用分号分开
xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict">aaa.dic;bbb.dic</entry>
<!--用户可以在这里配置自己的扩展停止词字典-->
<entry key="ext_stopwords"></entry>
<!--用户可以在这里配置远程扩展字典 -->
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!--用户可以在这里配置远程扩展停止词字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
热更新
两种方式:基于远程词库和基于数据库
基于远程词库
- 修改配置文件,将remote_ext_dict或remote_ext_stopwords标签内属性换成rest接口地址
xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict">aaa.dic;bbb.dic</entry>
<!--用户可以在这里配置自己的扩展停止词字典-->
<entry key="ext_stopwords"></entry>
<!--用户可以在这里配置远程扩展字典 -->
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!--用户可以在这里配置远程扩展停止词字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
- 接口要求
- http请求需要返回两个header,一个是Last-Modified,一个是ETag,两个都是字符串,只要一个有变化,ik就会抓取新的词库
- 该http请求返回的是一行一个分词,用\n换行
- 可以将自动更新的热词放到要给UTF-8编码的.txt文件中,放到nginx或者http server下,当.txt文件修改时候,http server会在客户端请求该文件时自动返回Last-modify和ETag。可以从另一个业务系统提取相关词汇并更新.txt文件
- .txt文件也可以是.dic文件
- IK默认一分钟获取一次
- 接口
java
@RestController
public class Test {
@GetMapping("hotWord")
public void hostWord(HttpServletResponse response){
File file = new File("/my_word");
FileInputStream fis = null;
try {
fis = new FileInputStream(file);
byte[] bytes = new byte[(int) file.length()];
response.setDateHeader("Last-modify",bytes.length);
response.setDateHeader("ETag",bytes.length);
response.setContentType("text/plain;charset=utf-8");
int offset = 0;
while (fis.read(bytes,offset,bytes.length - offset) != -1){}
ServletOutputStream out = response.getOutputStream();
out.write(bytes);
out.flush();
fis.close();
} catch (Exception e) {
throw new RuntimeException(e);
}
}
}
基于Mysql
- 修改IK源码,在initial方法中自定义从数据库读取词的方法。
- 打包替换ES中原先的IK