前言
最近做的需求涉及到了Elasticsearch,顺便整理一下常用的查询语句
索引模版
定义settings + mappings,常用于日志相关的数据,例子如下
perl
PUT _index_template/http-log-template
{
"index_patterns": [
"http-log-*"
],
"template": {
"mappings": {
"properties": {
"@timestamp" : {
"type" : "date"
},
"request.path" : {
"type" : "keyword"
},
"request.method" : {
"type" : "keyword"
},
"request.body" : {
"type" : "keyword"
},
"response.status" : {
"type" : "integer"
},
"response.body" : {
"type" : "text"
}
}
},
"settings": {
"index": {
"number_of_shards": 1,
"indexing": {
"slowlog": {
"level": "info",
"threshold": {
"index": {
"warn": "200s",
"trace": "50ms",
"debug": "80s",
"info": "100s"
}
}
}
},
"search": {
"slowlog": {
"threshold": {
"fetch": {
"warn": "200s",
"trace": "50ms",
"debug": "80ms",
"info": "100ms"
},
"query": {
"warn": "10s",
"trace": "500ms",
"debug": "2s",
"info": "5s"
}
}
}
}
}
}
}
}
解释一下例子中的参数
参数 | 说明 |
---|---|
index_patterns | 模版匹配的索引,上文的的例子匹配的是名字以http-log-为前缀的索引 |
index.number_of_shards | 索引的主分片数,将该参数值设置为1,可以有效控制集群的分片数量,防止由于分片数量过多导致集群压力过大 |
index.indexing.slowlog.level | 慢索引的日志级别 |
index.indexing.slowlog.threshold.index.warn | warn级别的慢索引日志的阈值 |
index.indexing.slowlog.threshold.index.trace | trace级别的慢索引日志的阈值 |
index.indexing.slowlog.threshold.index.debug | debug级别的慢索引日志的阈值 |
index.indexing.slowlog.threshold.index.info | info级别的慢索引日志的阈值 |
index.indexing.slowlog.threshold.fetch.warn | warn级别的慢fetch日志的阈值 |
index.indexing.slowlog.threshold.fetch.trace | trace级别的慢fetch日志的阈值 |
index.indexing.slowlog.threshold.fetch.debug | debug级别的慢fetch日志的阈值 |
index.indexing.slowlog.threshold.fetch.info | info级别的慢fetch日志的阈值 |
index.indexing.slowlog.threshold.query.warn | warn级别的慢查询日志的阈值 |
index.indexing.slowlog.threshold.query.trace | trace级别的慢查询日志的阈值 |
index.indexing.slowlog.threshold.query.debug | debug级别的慢查询日志的阈值 |
index.indexing.slowlog.threshold.query.info | info级别的慢查询日志的阈值 |
index.refresh_interval | es是准实时系统,新写入的分段需要被刷新才被完全创建,才可用于查询 慢的刷新频率可用降低分段合并的频率,分段合并十分耗资源 默认刷新频率是1s |
index、fetch、query分别对应ElasticSearch对数据的几个操作 更多参数可以参考阿里云的文档help.aliyun.com/zh/es/user-...
c
// 新建索引,匹配了上文中模版
PUT /http-log-2023-10-28
// 使用这条DSL可以查询到索引字段与模版中的一致
GET /http-log-2023-10-28/_mapping
别名
- 别名可以指向多个索引
- 查询时别名指向的所有索引都会被查询到
- 写入时如果别名指向单个索引,数据会写入该索引,当别名指向多个索引,需通过
is_write_index
参数指定写入索引,如下文,不可同时指定多个索引可写入
json
POST _aliases
{
"actions": [
{
"add": {
"index": "http-log-2023-10-28",
// 别名
"alias": "http-log-test",
"is_write_index": false
}
},
{
"add": {
"index": "http-log-2023-10-29",
// 别名
"alias": "http-log-test",
"is_write_index": true
}
}
]
}
别名更大的作用是用于索引重构时的平滑过渡,当旧索引无法满足业务需求时,我们可以建新索引,使用别名同时指向新旧索引,使得查询出来的历史数据不受影响,并且变更写入索引,数据迁移到新索引后,删除旧索引
查询
查询时会返回查询时的相关信息
json
{
// 这次查询花的时间
"took" : 23,
// 是否超时
"timed_out" : false,
// 搜索的分片
"_shards" : {
// 总共搜索了多少分片
"total" : 1,
// 成功了多少个
"successful" : 1,
// 跳过了多少个
"skipped" : 0,
// 失败了多少个
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
为了测试下面的场景,我们先准备索引和数据
bash
{
"test-product-20231028" : {
"mappings" : {
"properties" : {
"desc" : {
"type" : "text"
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"price" : {
"type" : "long"
}
}
}
}
}
PUT test-product-20231028/_doc/1
{
"name": "elasticsearch",
"desc": "quick restful",
"price": 10
}
PUT test-product-20231028/_doc/2
{
"name": "java",
"desc": "quick language jvm",
"price": 20
}
PUT test-product-20231028/_doc/3
{
"name": "mysql",
"desc": "quick data",
"price": 30
}
match
- match:match会先将查询语句分词,然后匹配包含某个term的子句,如下文,会先将查询语句分词为restful、language
bash
GET test-product-20231028/_search
{
"query": {
"match": {
"desc": "restful language"
}
}
}
{
"took" : 57,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0417082,
"hits" : [
{
"_index" : "test-product-20231028",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0417082,
"_source" : {
"name" : "elasticsearch",
"desc" : "quick restful",
"price" : 10
}
},
{
"_index" : "test-product-20231028",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.8781843,
"_source" : {
"name" : "java",
"desc" : "quick language jvm",
"price" : 20
}
}
]
}
}
- match_all:匹配所有结果的子句
- multi_match:可以在一个查询中匹配多个字段,如下文,是在name、desc两个字段中查询elasticsearch language
bash
GET test-product-20231028/_search
{
"query": {
"multi_match": {
"query": "elasticsearch language",
"fields": ["name", "desc"]
}
}
}
{
"took" : 14,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.9808291,
"hits" : [
{
"_index" : "test-product-20231028",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.9808291,
"_source" : {
"name" : "elasticsearch",
"desc" : "quick restful",
"price" : 10
}
},
{
"_index" : "test-product-20231028",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.8781843,
"_source" : {
"name" : "java",
"desc" : "quick language jvm",
"price" : 20
}
}
]
}
}
- match_phrase:短语查询,所有term都出现在待查询字段之中,待查询字段之中的所有term都必须和match_phase具有相同的顺序
bash
GET test-product-20231028/_search
{
"query": {
"match_phrase": {
"desc": "quick language"
}
}
}
{
"took" : 13,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.99774146,
"hits" : [
{
"_index" : "test-product-20231028",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.99774146,
"_source" : {
"name" : "java",
"desc" : "quick language jvm",
"price" : 20
}
}
]
}
}
term
-
term:匹配和搜索项完全相等的结果
-
term和match_phrase区别
- match_phrase会将搜索关键词分词,分词结果必须在被检索字段的分词中都包含,而且顺序必须相同,默认必须是连续的
- term搜索不会将搜索词分词
-
term和keyword区别
- term对于搜索词不分词
- keyword是字段类型,是对于source data中的字段值不分词
-
bash
GET test-product-20231028/_search
{
"query": {
"term": {
"name.keyword": "java"
}
}
}
{
"took" : 14,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.9808291,
"hits" : [
{
"_index" : "test-product-20231028",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.9808291,
"_source" : {
"name" : "java",
"desc" : "quick language jvm",
"price" : 20
}
}
]
}
}
- terms:匹配和搜索词项列表中任意项匹配的结果
bash
GET test-product-20231028/_search
{
"query": {
"terms": {
"name.keyword": ["java", "mysql"]
}
}
}
{
"took" : 15,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "test-product-20231028",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"name" : "java",
"desc" : "quick language jvm",
"price" : 20
}
},
{
"_index" : "test-product-20231028",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"name" : "mysql",
"desc" : "quick data",
"price" : 30
}
}
]
}
}
filter
filter倾向于当前文档和查询的条件是不是相符,即在查询过程中,query是要对查询的每个结果计算相关性得分的,而filter不会,另外filter有相应的缓存机制,可以提高查询效率
bash
GET test-product-20231028/_search
{
"query": {
"constant_score": {
"filter": {
"term": {
"name.keyword": "java"
}
}
}
}
}
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "test-product-20231028",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"name" : "java",
"desc" : "quick language jvm",
"price" : 20
}
}
]
}
}
json
GET test-product-20231028/_search
{
"query": {
"constant_score": {
"filter": {
"term": {
"name.keyword": "java"
}
},
// 也可以加个默认分数
"boost": 1.2
}
}
}
// 也可以将bool嵌套在bool中
GET test-product-20231028/_search
{
"query": {
"bool": {
"filter": {
"term": {
"name.keyword": "java"
}
}
}
}
}
bool
可以组合多个查询条件,bool查询也是采用more_matches_is_better的机制,因此满足must和should子句的文档将会合并起来计算分值
- must:必须满足的条件
- filter:过滤器,不计算相关度分数,cache子句,必须出现在匹配的文档中,但是不像must查询的分数将被忽略,filter子句在filter上下文中执行,这意味着计分将忽略,并且子句被考虑用于缓存
- should:可以满足也可以不满足的条件
- must_not:不需要满足的条件,不计算相关度分数
minimum_should_match:参数指定should返回的文档必须匹配的子句的数量或百分比,如果bool查询包含至少一个should子句,而没有must或filter子句,则默认值为1,否则,默认值为0
boost:可以将某个搜索条件的权重加大,此时当匹配这个搜索条件和匹配另一个搜索条件的document,计算relevance source时,匹配权重更大的搜索条件的document,relevance source会更高,当然也就会优先被返回回来,默认情况下,搜索条件的权重都是一样的,都是1
下面举几个例子
bash
GET test-product-20231028/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"desc": "quick"
}
},
{
"term": {
"name.keyword": "java"
}
}
],
"must_not": [
{
"term": {
"name.keyword":
"java"
}
}
]
}
}
}
GET test-product-20231028/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"desc": "quick"
}
}
],
"must_not": [
{
"term": {
"name.keyword":
"java"
}
}
],
"should": [
{
"term": {
"name.keyword": {
"value": "elasticsearch"
}
}
}
],
"minimum_should_match": 1
}
}
}
GET test-product-20231028/_search
{
"query": {
"bool": {
"filter": [
{
"match_all": {}
}
],
"must_not": [
{
"term": {
"name.keyword": "java"
}
}
],
"should": [
{
"term": {
"name.keyword": {
"value": "elasticsearch"
}
}
}
],
"must": [
{
"match": {
"desc": "quick"
}
}
],
"minimum_should_match": 1
}
}
}
bool中也是可以嵌套bool的,使用起来非常灵活
wildcard
通配符运算符是匹配一个或多个字符的占位符,例如,*通配符运算符匹配零个或多个字符,
bash
GET test-product-20231028/_search
{
"query": {
"wildcard": {
"name.keyword": {
"value": "*ava"
}
}
}
}
fuzzy
为了找到相似的词,模糊搜索会在指定的编辑距离内创建搜索词的所有可能变化或拓展的集合,查询然后返回每个拓展的完全匹配,有以下四种情况
混淆字符(box -> fox)
缺少字符(black -> lack)
多出字符(sic -> sick)
颠倒次序(act -> cat)
bash
GET test-product-20231028/_search
{
"query": {
"fuzzy": {
"name.keyword": {
"value": "av",
"fuzziness": 2
}
}
}
}
GET test-product-20231028/_search
{
"query": {
"match": {
"name.keyword": {
"query": "msql",
"fuzziness": 2
}
}
}
}
参数:
-
fuzziness:编辑距离,(0,1,2)并非越大越好,召回率高但结果不准确
- 两段文本之间的Damerau-Levenshtein距离是使一个字符串与另一个字符串匹配所需的插入、删除、替换和调换的数量
- 距离公式:Levenshtein是lucene的,改进版:Damerau-Levenshtein
- axe -> aex Levenshtein=2 Damerau-Levenshtein=1
-
transpositions:(可选,布尔值)指示编辑是否包含两个相邻字符的变位(ab -> ba),默认为true
range
json
// 查找范围
GET test-product-20231028/_search
{
"query": {
"range": {
"price": {
"from": 10,
"to": 20
}
}
}
}
// include_lower是否包含范围的左边界,默认true
// include_upper是否包含范围的右边界,默认为true
GET test-product-20231028/_search
{
"query": {
"range": {
"price": {
"from": 10,
"to": 20,
"include_lower":true,
"include_upper":false
}
}
}
}
// gt: >
// lt: <
// gte: >=
// lte: <=
GET test-product-20231028/_search
{
"query": {
"range": {
"price": {
"gt": 10,
"lte": 20
}
}
}
}
sort
bash
GET test-product-20231028/_search
{
"query": {
"match_all": {}
},
"sort": [
{
"price": {
"order": "desc"
}
}
]
}
from
返回指定数量
bash
GET test-product-20231028/_search
{
"from": 0,
"size": 1,
"query": {
"match_all": {}
}
}
分词分析
有时候我们想要知道一个词分词后的样子,可以使用这个api测试
json
POST _analyze
{
"analyzer": "ik_max_word",
"text": "风华正茂"
}
{
"tokens" : [
{
"token" : "风华正茂",
"start_offset" : 0,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "风华",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "正",
"start_offset" : 2,
"end_offset" : 3,
"type" : "CN_CHAR",
"position" : 2
},
{
"token" : "茂",
"start_offset" : 3,
"end_offset" : 4,
"type" : "CN_CHAR",
"position" : 3
}
]
}
completion suggester
自动补全,可以在用户输入时引导用户查看相关结果,从而提高搜索精度,现广泛用于搜索相关的各种场景,但是也存在两个问题
- 内存代价太大,必须在用户不断输入的时候产生建议,数据是基于FST的原理存储在内存中
- 只能前缀搜索,如果用户输入的不是前缀,召回率可能很低
想要使用搜索方式,字段的类型必须加上completion
bash
PUT suggest_test
{
"mappings": {
"properties": {
"title" : {
"type" : "text",
"fields" : {
"suggest" : {
"type" : "completion",
"analyzer" : "ik_max_word"
}
},
"analyzer" : "ik_max_word"
},
"content" : {
"type" : "text",
"analyzer": "ik_max_word"
}
}
}
}
PUT suggest_test/_doc/1
{
"title": "三番四次",
"content": "成语1"
}
PUT suggest_test/_doc/2
{
"title": "三山五岳",
"content": "成语2"
}
PUT suggest_test/_doc/3
{
"title": "十全十美",
"content": "成语3"
}
PUT suggest_test/_doc/4
{
"title": "十全十美",
"content": "成语4"
}
sql
GET suggest_test/_search
{
"suggest": {
"my_suggest": {
"prefix": "三",
"completion": {
"field": "title.suggest"
}
}
}
}
{
"took" : 91,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"suggest" : {
"my_suggest" : [
{
"text" : "三",
"offset" : 0,
"length" : 1,
"options" : [
{
"text" : "三山五岳",
"_index" : "suggest_test",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"title" : "三山五岳",
"content" : "成语2"
}
},
{
"text" : "三番四次",
"_index" : "suggest_test",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"title" : "三番四次",
"content" : "成语1"
}
}
]
}
]
}
}
这样查询容易出现重复的数据
bash
GET suggest_test/_search
{
"suggest": {
"my_suggest": {
"prefix": "十",
"completion": {
"field": "title.suggest",
"skip_duplicates": true
}
}
}
}
此外,我们可以加上模糊查询
bash
GET suggest_test/_search
{
"suggest": {
"my_suggest": {
"prefix": "十",
"completion": {
"field": "title.suggest",
"skip_duplicates": true,
"fuzzy": {
"fuzziness": 2
}
}
}
}
}
只查看某个字段
elasticsearch通常会存储一些宽表,如果只想查看某个字段
bash
GET suggest_test/_search
{
"suggest": {
"my_suggest": {
"prefix": "十",
"completion": {
"field": "title.suggest",
"skip_duplicates": true,
"fuzzy": {
"fuzziness": 2
}
}
}
},
"_source": {
"includes": ["title"]
}
}
聚合函数
平均值
bash
GET test-product-20231028/_search
{
"query": {
"term": {
"name.keyword": {
"value": "java"
}
}
},
"aggs": {
"agv_of_price": {
"avg": {
"field": "price"
}
}
}
}
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.9808291,
"hits" : [
{
"_index" : "test-product-20231028",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.9808291,
"_source" : {
"name" : "java",
"desc" : "quick language jvm",
"price" : 20
}
}
]
},
"aggregations" : {
"agv_of_price" : {
"value" : 20.0
}
}
}
最大值
bash
GET test-product-20231028/_search
{
"query": {
"match_all": {}
},
"aggs": {
"max_of_price": {
"max": {
"field": "price"
}
}
}
}
最小值
bash
GET test-product-20231028/_search
{
"query": {
"match_all": {}
},
"aggs": {
"min_of_price": {
"min": {
"field": "price"
}
}
}
}
求和
bash
GET test-product-20231028/_search
{
"query": {
"match_all": {}
},
"aggs": {
"sum_of_price": {
"sum": {
"field": "price"
}
}
}
}
去重
bash
// 再加一条测试数据
PUT test-product-20231028/_doc/4
{
"name": "mysql",
"desc": "8.0",
"price": 40
}
GET test-product-20231028/_search
{
"collapse": {
"field": "name.keyword"
}
}
分析字段
字段分析分为三个阶段
- 字符过滤(过滤器):使用字符过滤器转变字符(比如:大写变小写)
- 文本切分为分词(分词器):将文本切分为单个或者多个分词(比如:英文文本用空格切为一堆单词)
- 分词过滤(分词过滤器):转变每个分词(比如:把a an of这类词干掉,或者复数单词转为单数单词)
这三个阶段可以用上面这张图表示,字符过滤器可以过滤掉一些无用的词语,使得文档规范化,比如&
就是这个无用的词,分词器是最简单的,就是按照规则分词,分词过滤器对分词完的词语进行过滤,比如停用词
、时态转换
、大小写
、同义词
、语气词
字符过滤
我们可以定义字符过滤器
json
// 定义一个索引
PUT my_char_filter
{
"settings": {
"analysis": {
"char_filter": {
// 自定义的字符过滤器
"my_char_filter" : {
// 去除html标签
"type": "html_strip",
// 可以配置忽略的标签
"escaped_tags": ""
}
},
"analyzer": {
"my_analyzer" : {
// 这个是分词器
"tokenizer" : "keyword",
// 设置过滤器
"char_filter": "my_char_filter"
}
}
}
}
}
使用字符过滤器过滤 <p>I'm </p>
结果是这样的
python
GET my_char_filter/_analyze
{
"analyzer": "my_analyzer",
"text": "<p>I'm </p>"
}
{
"tokens" : [
{
"token" : """
I'm
""",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 0
}
]
}
json
PUT my_char_filter
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter" : {
"type": "html_strip",
// 配置了忽略p标签
"escaped_tags": "p"
}
},
"analyzer": {
"my_analyzer" : {
"tokenizer" : "keyword",
"char_filter": "my_char_filter"
}
}
}
}
}
GET my_char_filter/_analyze
{
"analyzer": "my_analyzer",
"text": "<p>I'm </p>"
}
// 结果是这样的
{
"tokens" : [
{
"token" : "<p>I'm </p>",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 0
}
]
}
分词
bash
GET my_token_filter/_analyze
{
"tokenizer": "standard",
"text": "my english is very bad"
}
{
"tokens" : [
{
"token" : "my",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "english",
"start_offset" : 3,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "is",
"start_offset" : 11,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "very",
"start_offset" : 14,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "bad",
"start_offset" : 19,
"end_offset" : 22,
"type" : "<ALPHANUM>",
"position" : 4
}
]
}
分词过滤
json
PUT /my_token_filter
{
"settings": {
"analysis": {
"filter": {
"my_synonym" : {
"type": "synonym_graph",
// 这个是分词转换的文档路径
"synonyms_path": "analysis/synonym.txt"
}
},
"analyzer": {
"my_analyzer" : {
"tokenizer" : "ik_max_word",
"filter": ["my_synonym"]
}
}
}
}
}
文档可以这样定义
ini
大G==>奔驰
坦克==>长城
json
// 使用上面的分词过滤器
GET my_token_filter/_analyze
{
"analyzer": "my_analyzer",
"text": ["坦克, 大G"]
}
// 得出来的结果是这样子
{
"tokens" : [
{
"token" : "长城",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "奔驰",
"start_offset" : 4,
"end_offset" : 6,
"type" : "SYNONYM",
"position" : 1
}
]
}
ElasticSearch内置了很多分词过滤器,可以参考官方文档
字段具体分词情况
如果文档在索引中已经生成了,想要查看数据分词变成了什么样子,可以使用这个api查看
css
GET /suggest_test/_doc/4/_termvectors?fields=title
{
"_index" : "suggest_test",
"_type" : "_doc",
"_id" : "4",
"_version" : 1,
"found" : true,
"took" : 57,
"term_vectors" : {
"title" : {
"field_statistics" : {
"sum_doc_freq" : 28,
"doc_count" : 5,
"sum_ttf" : 31
},
"terms" : {
"全" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 3,
"start_offset" : 1,
"end_offset" : 2
}
]
},
"十" : {
"term_freq" : 2,
"tokens" : [
{
"position" : 2,
"start_offset" : 0,
"end_offset" : 1
},
{
"position" : 4,
"start_offset" : 2,
"end_offset" : 3
}
]
},
"十全" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 1,
"start_offset" : 0,
"end_offset" : 2
}
]
},
"十全十美" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 0,
"start_offset" : 0,
"end_offset" : 4
}
]
},
"美" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 5,
"start_offset" : 3,
"end_offset" : 4
}
]
}
}
}
}
}
相关度评分
相关度评分用于对搜索结果排序,评分越高则认为其结果和搜索的预期值相关度越高,即越符合搜索预期值。在7.x之前相关度评分默认使用TF/IDF算法而来,7.x之后默认为BM25,那这个算法是怎么衡量一个文档是否符合预期的呢?
逆文档频率
<math xmlns="http://www.w3.org/1998/Math/MathML"> I D F ( q i ) = l n N − d f i + 0.5 d f i + 0.5 IDF(q_i) = ln \frac{N-df_i+0.5} {df_i+0.5} </math>IDF(qi)=lndfi+0.5N−dfi+0.5
其中N表示索引中全部文档数, <math xmlns="http://www.w3.org/1998/Math/MathML"> d f i df_i </math>dfi为包含了 <math xmlns="http://www.w3.org/1998/Math/MathML"> q i q_i </math>qi的文档的个数,根据公式,对于某个 <math xmlns="http://www.w3.org/1998/Math/MathML"> q i q_i </math>qi,包含 <math xmlns="http://www.w3.org/1998/Math/MathML"> q i q_i </math>qi的文档越多,说明 <math xmlns="http://www.w3.org/1998/Math/MathML"> q i q_i </math>qi重要性越小,或者区分度越低
举个例子
有三篇文档,分别为
A = 张三,李四,王五
B = 张三,李四
C = 张三
如果我的搜索词是张三,将公式中的变量替换, <math xmlns="http://www.w3.org/1998/Math/MathML"> l n 3 − 3 + 0.5 3 + 0.5 ln \frac{3-3+0.5} {3+0.5} </math>ln3+0.53−3+0.5 ,分母变大,分子变小,这个数是相对小的
如果我的搜索词是王五,将共识中的变量替换, <math xmlns="http://www.w3.org/1998/Math/MathML"> l n 3 − 1 + 0.5 1 + 0.5 ln \frac{3-1+0.5} {1+0.5} </math>ln1+0.53−1+0.5,分母变小,分子变大,这个数是相对大的
从常识来看,如果一个词在很多文档中都出现过,说明这个词的区分度是越小的,比如 的 这种字眼(只是举个例子,ElasticSearch会将这些停用词去除),如果一个词在很少文档中出现,那这个词的区分度是越大的
词频
<math xmlns="http://www.w3.org/1998/Math/MathML"> S ( q i , d ) = ( k 1 + 1 ) t f t d k + t f t d S(q_i, d) = \frac{(k_1+1)tf_{td}} {k+tf_{td}} </math>S(qi,d)=k+tftd(k1+1)tftd
<math xmlns="http://www.w3.org/1998/Math/MathML"> K = k 1 ( 1 − b + b ∗ L d L a v e ) K=k_1(1-b+b*\frac{L_d} {L_ave}) </math>K=k1(1−b+b∗LaveLd)
其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> t f t d tf_{td} </math>tftd是单词t在文档d中的词频, <math xmlns="http://www.w3.org/1998/Math/MathML"> L d L_d </math>Ld是文档d的长度, <math xmlns="http://www.w3.org/1998/Math/MathML"> L a v e L_{ave} </math>Lave是所有文档的平均长度,变量 <math xmlns="http://www.w3.org/1998/Math/MathML"> k 1 k_1 </math>k1是一个正的参数,用来标准化文章词频的范围,我们先看上面的公式,假设 <math xmlns="http://www.w3.org/1998/Math/MathML"> t f t d tf_{td} </math>tftd是一个无限大的数,那么分母中的k相比 <math xmlns="http://www.w3.org/1998/Math/MathML"> t f t d tf_{td} </math>tftd是可以被忽略的,所以该公式的最大值是 <math xmlns="http://www.w3.org/1998/Math/MathML"> k 1 k_1 </math>k1+1,如果 <math xmlns="http://www.w3.org/1998/Math/MathML"> t f t d tf_{td} </math>tftd是一个无限小的数,那么该值是可以被忽略的,所以该公式的最小值是 <math xmlns="http://www.w3.org/1998/Math/MathML"> k 1 + 1 k \frac{k_1+1} {k} </math>kk1+1,所以上面公式的取值区间在 <math xmlns="http://www.w3.org/1998/Math/MathML"> k 1 + 1 k \frac{k_1+1} {k} </math>kk1+1和 <math xmlns="http://www.w3.org/1998/Math/MathML"> k 1 k_1 </math>k1+1, 再看下面的公式, <math xmlns="http://www.w3.org/1998/Math/MathML"> k 1 k_1 </math>k1和b都是一个常数,看作固定值就可以,当 <math xmlns="http://www.w3.org/1998/Math/MathML"> L d L_d </math>Ld值越大时,分母会越大,整个值会越小
举个例子
有三篇文档,分别为
A = 张三,李四,王五
B = 张三,李四
C = 张三
如果我的搜索词是张三,A文档和B文档同时被检索到,这个时候,B文档的排名相对A文档会靠前点,因为A文档字数更多