ElasticSearch常用语句

前言

最近做的需求涉及到了Elasticsearch,顺便整理一下常用的查询语句

索引模版

定义settings + mappings,常用于日志相关的数据,例子如下

perl 复制代码
PUT _index_template/http-log-template
{
  "index_patterns": [
    "http-log-*"
  ],
  "template": {
    "mappings": {
      "properties": {
        "@timestamp" : {
          "type" : "date"
        },
        "request.path" : {
          "type" : "keyword"
        },
        "request.method" : {
          "type" : "keyword"
        },
        "request.body" : {
          "type" : "keyword"
        },
        "response.status" : {
          "type" : "integer"
        },
        "response.body" : {
          "type" : "text"
        }
      }
    },
    "settings": {
      "index": {
        "number_of_shards": 1,
        "indexing": {
          "slowlog": {
            "level": "info",
            "threshold": {
              "index": {
                "warn": "200s",
                "trace": "50ms",
                "debug": "80s",
                "info": "100s"
              }
            }
          }
        },
        "search": {
          "slowlog": {
            "threshold": {
              "fetch": {
                "warn": "200s",
                "trace": "50ms",
                "debug": "80ms",
                "info": "100ms"
              },
              "query": {
                "warn": "10s",
                "trace": "500ms",
                "debug": "2s",
                "info": "5s"
              }
            }
          }
        }
      }
    }
  }
}

解释一下例子中的参数

参数 说明
index_patterns 模版匹配的索引,上文的的例子匹配的是名字以http-log-为前缀的索引
index.number_of_shards 索引的主分片数,将该参数值设置为1,可以有效控制集群的分片数量,防止由于分片数量过多导致集群压力过大
index.indexing.slowlog.level 慢索引的日志级别
index.indexing.slowlog.threshold.index.warn warn级别的慢索引日志的阈值
index.indexing.slowlog.threshold.index.trace trace级别的慢索引日志的阈值
index.indexing.slowlog.threshold.index.debug debug级别的慢索引日志的阈值
index.indexing.slowlog.threshold.index.info info级别的慢索引日志的阈值
index.indexing.slowlog.threshold.fetch.warn warn级别的慢fetch日志的阈值
index.indexing.slowlog.threshold.fetch.trace trace级别的慢fetch日志的阈值
index.indexing.slowlog.threshold.fetch.debug debug级别的慢fetch日志的阈值
index.indexing.slowlog.threshold.fetch.info info级别的慢fetch日志的阈值
index.indexing.slowlog.threshold.query.warn warn级别的慢查询日志的阈值
index.indexing.slowlog.threshold.query.trace trace级别的慢查询日志的阈值
index.indexing.slowlog.threshold.query.debug debug级别的慢查询日志的阈值
index.indexing.slowlog.threshold.query.info info级别的慢查询日志的阈值
index.refresh_interval es是准实时系统,新写入的分段需要被刷新才被完全创建,才可用于查询 慢的刷新频率可用降低分段合并的频率,分段合并十分耗资源 默认刷新频率是1s

index、fetch、query分别对应ElasticSearch对数据的几个操作 更多参数可以参考阿里云的文档help.aliyun.com/zh/es/user-...

c 复制代码
// 新建索引,匹配了上文中模版
PUT /http-log-2023-10-28
// 使用这条DSL可以查询到索引字段与模版中的一致
GET /http-log-2023-10-28/_mapping

别名

  • 别名可以指向多个索引
  • 查询时别名指向的所有索引都会被查询到
  • 写入时如果别名指向单个索引,数据会写入该索引,当别名指向多个索引,需通过is_write_index 参数指定写入索引,如下文,不可同时指定多个索引可写入
json 复制代码
POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "http-log-2023-10-28",
        // 别名
        "alias": "http-log-test",
        "is_write_index": false
      }
    },
    {
      "add": {
        "index": "http-log-2023-10-29",
        // 别名
        "alias": "http-log-test",
        "is_write_index": true
      }
    }
  ]
}

别名更大的作用是用于索引重构时的平滑过渡,当旧索引无法满足业务需求时,我们可以建新索引,使用别名同时指向新旧索引,使得查询出来的历史数据不受影响,并且变更写入索引,数据迁移到新索引后,删除旧索引

查询

查询时会返回查询时的相关信息

json 复制代码
{
  // 这次查询花的时间
  "took" : 23,
  // 是否超时
  "timed_out" : false,
  // 搜索的分片
  "_shards" : {
    // 总共搜索了多少分片
    "total" : 1,
    // 成功了多少个
    "successful" : 1,
    // 跳过了多少个
    "skipped" : 0,
    // 失败了多少个
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

为了测试下面的场景,我们先准备索引和数据

bash 复制代码
{
  "test-product-20231028" : {
    "mappings" : {
      "properties" : {
        "desc" : {
          "type" : "text"
        },
        "name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "price" : {
          "type" : "long"
        }
      }
    }
  }
}
​
​
PUT test-product-20231028/_doc/1
{
  "name": "elasticsearch",
  "desc": "quick restful", 
  "price": 10
}
​
PUT test-product-20231028/_doc/2
{
  "name": "java",
  "desc": "quick language jvm",
  "price": 20
}
​
PUT test-product-20231028/_doc/3
{
  "name": "mysql",
  "desc": "quick data",
  "price": 30
}
​
match
  • match:match会先将查询语句分词,然后匹配包含某个term的子句,如下文,会先将查询语句分词为restful、language
bash 复制代码
GET test-product-20231028/_search
{
  "query": {
    "match": {
      "desc": "restful language"
    }
  }
}
​
​
{
  "took" : 57,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0417082,
    "hits" : [
      {
        "_index" : "test-product-20231028",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0417082,
        "_source" : {
          "name" : "elasticsearch",
          "desc" : "quick restful",
          "price" : 10
        }
      },
      {
        "_index" : "test-product-20231028",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.8781843,
        "_source" : {
          "name" : "java",
          "desc" : "quick language jvm",
          "price" : 20
        }
      }
    ]
  }
}
  • match_all:匹配所有结果的子句
  • multi_match:可以在一个查询中匹配多个字段,如下文,是在name、desc两个字段中查询elasticsearch language
bash 复制代码
GET test-product-20231028/_search
{
  "query": {
    "multi_match": {
      "query": "elasticsearch language", 
      "fields": ["name", "desc"]
    }
  }
}
​
​
{
  "took" : 14,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.9808291,
    "hits" : [
      {
        "_index" : "test-product-20231028",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.9808291,
        "_source" : {
          "name" : "elasticsearch",
          "desc" : "quick restful",
          "price" : 10
        }
      },
      {
        "_index" : "test-product-20231028",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.8781843,
        "_source" : {
          "name" : "java",
          "desc" : "quick language jvm",
          "price" : 20
        }
      }
    ]
  }
}
  • match_phrase:短语查询,所有term都出现在待查询字段之中,待查询字段之中的所有term都必须和match_phase具有相同的顺序
bash 复制代码
GET test-product-20231028/_search
{
  "query": {
    "match_phrase": {
      "desc": "quick language"
    }
  }
}
​
{
  "took" : 13,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.99774146,
    "hits" : [
      {
        "_index" : "test-product-20231028",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.99774146,
        "_source" : {
          "name" : "java",
          "desc" : "quick language jvm",
          "price" : 20
        }
      }
    ]
  }
}
term
  • term:匹配和搜索项完全相等的结果

    • term和match_phrase区别

      • match_phrase会将搜索关键词分词,分词结果必须在被检索字段的分词中都包含,而且顺序必须相同,默认必须是连续的
      • term搜索不会将搜索词分词
    • term和keyword区别

      • term对于搜索词不分词
      • keyword是字段类型,是对于source data中的字段值不分词
bash 复制代码
GET test-product-20231028/_search
{
  "query": {
    "term": {
      "name.keyword": "java"
    }
  }
}
​
​
{
  "took" : 14,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.9808291,
    "hits" : [
      {
        "_index" : "test-product-20231028",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.9808291,
        "_source" : {
          "name" : "java",
          "desc" : "quick language jvm",
          "price" : 20
        }
      }
    ]
  }
}
  • terms:匹配和搜索词项列表中任意项匹配的结果
bash 复制代码
GET test-product-20231028/_search
{
  "query": {
    "terms": {
      "name.keyword": ["java", "mysql"]
    }
  }
}
​
{
  "took" : 15,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test-product-20231028",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "name" : "java",
          "desc" : "quick language jvm",
          "price" : 20
        }
      },
      {
        "_index" : "test-product-20231028",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "name" : "mysql",
          "desc" : "quick data",
          "price" : 30
        }
      }
    ]
  }
}
filter

filter倾向于当前文档和查询的条件是不是相符,即在查询过程中,query是要对查询的每个结果计算相关性得分的,而filter不会,另外filter有相应的缓存机制,可以提高查询效率

bash 复制代码
GET test-product-20231028/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "name.keyword": "java"
        }
      }
    }
  }
}
​
{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test-product-20231028",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "name" : "java",
          "desc" : "quick language jvm",
          "price" : 20
        }
      }
    ]
  }
}
json 复制代码
GET test-product-20231028/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "name.keyword": "java"
        }
      },
      // 也可以加个默认分数
      "boost": 1.2
    }
  }
}
​
// 也可以将bool嵌套在bool中
GET test-product-20231028/_search
{
  "query": {
    "bool": {
      "filter": {
        "term": {
          "name.keyword": "java"
        }
      }
    }
  }
}
bool

可以组合多个查询条件,bool查询也是采用more_matches_is_better的机制,因此满足must和should子句的文档将会合并起来计算分值

  • must:必须满足的条件
  • filter:过滤器,不计算相关度分数,cache子句,必须出现在匹配的文档中,但是不像must查询的分数将被忽略,filter子句在filter上下文中执行,这意味着计分将忽略,并且子句被考虑用于缓存
  • should:可以满足也可以不满足的条件
  • must_not:不需要满足的条件,不计算相关度分数

minimum_should_match:参数指定should返回的文档必须匹配的子句的数量或百分比,如果bool查询包含至少一个should子句,而没有must或filter子句,则默认值为1,否则,默认值为0

boost:可以将某个搜索条件的权重加大,此时当匹配这个搜索条件和匹配另一个搜索条件的document,计算relevance source时,匹配权重更大的搜索条件的document,relevance source会更高,当然也就会优先被返回回来,默认情况下,搜索条件的权重都是一样的,都是1

下面举几个例子

bash 复制代码
GET test-product-20231028/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "desc": "quick"
          }
        },
        {
          "term": {
            "name.keyword": "java"
          }
        }
      ],
      "must_not": [
        {
          "term": {
            "name.keyword": 
              "java"
            
          }
        }
      ]
    }
  }
}
​
​
GET test-product-20231028/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "desc": "quick"
          }
        }
      ],
      "must_not": [
        {
          "term": {
            "name.keyword": 
              "java"
            
          }
        }
      ],
      "should": [
        {
          "term": {
            "name.keyword": {
              "value": "elasticsearch"
            }
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}
​
GET test-product-20231028/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "match_all": {}
        }
      ],
      "must_not": [
        {
          "term": {
            "name.keyword": "java"
          }
        }
      ],
      "should": [
        {
          "term": {
            "name.keyword": {
              "value": "elasticsearch"
            }
          }
        }
      ],
      "must": [
        {
          "match": {
            "desc": "quick"
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

bool中也是可以嵌套bool的,使用起来非常灵活

wildcard

通配符运算符是匹配一个或多个字符的占位符,例如,*通配符运算符匹配零个或多个字符,

bash 复制代码
GET test-product-20231028/_search
{
  "query": {
    "wildcard": {
      "name.keyword": {
        "value": "*ava"
      }
    } 
  }
}
fuzzy

为了找到相似的词,模糊搜索会在指定的编辑距离内创建搜索词的所有可能变化或拓展的集合,查询然后返回每个拓展的完全匹配,有以下四种情况

混淆字符(box -> fox)

缺少字符(black -> lack)

多出字符(sic -> sick)

颠倒次序(act -> cat)

bash 复制代码
GET test-product-20231028/_search
{
  "query": {
    "fuzzy": {
      "name.keyword": {
        "value": "av", 
        "fuzziness": 2
      }
    } 
  }
}
​
​
GET test-product-20231028/_search
{
  "query": {
    "match": {
      "name.keyword": {
        "query": "msql", 
        "fuzziness": 2
      }
    } 
  }
}

参数:

  • fuzziness:编辑距离,(0,1,2)并非越大越好,召回率高但结果不准确

    • 两段文本之间的Damerau-Levenshtein距离是使一个字符串与另一个字符串匹配所需的插入、删除、替换和调换的数量
    • 距离公式:Levenshtein是lucene的,改进版:Damerau-Levenshtein
    • axe -> aex Levenshtein=2 Damerau-Levenshtein=1
  • transpositions:(可选,布尔值)指示编辑是否包含两个相邻字符的变位(ab -> ba),默认为true

range
json 复制代码
// 查找范围
GET test-product-20231028/_search
{
  "query": {
    "range": {
      "price": {
        "from": 10,
        "to": 20
      }
    }
  }
}
​
// include_lower是否包含范围的左边界,默认true
// include_upper是否包含范围的右边界,默认为true
GET test-product-20231028/_search
{
  "query": {
    "range": {
      "price": {
        "from": 10,
        "to": 20,
        "include_lower":true,
        "include_upper":false
      }
    }
  }
}
​
// gt: >
// lt: <
// gte: >=
// lte: <= 
GET test-product-20231028/_search
{
  "query": {
    "range": {
      "price": {
        "gt": 10,
        "lte": 20
      }
    }
  }
}
sort
bash 复制代码
GET test-product-20231028/_search
{
  "query": {
    "match_all": {}
  }, 
  "sort": [
    {
      "price": {
        "order": "desc"
      }
    }
  ]
}
from

返回指定数量

bash 复制代码
GET test-product-20231028/_search
{
  "from": 0,
  "size": 1, 
  "query": {
    "match_all": {}
  }
}
分词分析

有时候我们想要知道一个词分词后的样子,可以使用这个api测试

json 复制代码
POST _analyze
{
  "analyzer": "ik_max_word",
  "text": "风华正茂"
}
​
​
{
  "tokens" : [
    {
      "token" : "风华正茂",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "风华",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "正",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "茂",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 3
    }
  ]
}
completion suggester

自动补全,可以在用户输入时引导用户查看相关结果,从而提高搜索精度,现广泛用于搜索相关的各种场景,但是也存在两个问题

  1. 内存代价太大,必须在用户不断输入的时候产生建议,数据是基于FST的原理存储在内存中
  2. 只能前缀搜索,如果用户输入的不是前缀,召回率可能很低

想要使用搜索方式,字段的类型必须加上completion

bash 复制代码
PUT suggest_test
{
  "mappings": {
    "properties": {
      "title" : {
          "type" : "text",
          "fields" : {
            "suggest" : {
              "type" : "completion",
              "analyzer" : "ik_max_word"
            }
          },
          "analyzer" : "ik_max_word"
        },
        "content" : {
          "type" : "text",
          "analyzer": "ik_max_word"
        }
    }
  }
}
​
​
PUT suggest_test/_doc/1
{
  "title": "三番四次",
  "content": "成语1"
}
​
PUT suggest_test/_doc/2
{
  "title": "三山五岳",
  "content": "成语2"
}
​
PUT suggest_test/_doc/3
{
  "title": "十全十美",
  "content": "成语3"
}
​
PUT suggest_test/_doc/4
{
  "title": "十全十美",
  "content": "成语4"
}
sql 复制代码
GET suggest_test/_search
{
  "suggest": {
    "my_suggest": {
      "prefix": "三",
      "completion": {
        "field": "title.suggest"
      }
    }
  }
}
​
​
{
  "took" : 91,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "suggest" : {
    "my_suggest" : [
      {
        "text" : "三",
        "offset" : 0,
        "length" : 1,
        "options" : [
          {
            "text" : "三山五岳",
            "_index" : "suggest_test",
            "_type" : "_doc",
            "_id" : "2",
            "_score" : 1.0,
            "_source" : {
              "title" : "三山五岳",
              "content" : "成语2"
            }
          },
          {
            "text" : "三番四次",
            "_index" : "suggest_test",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : 1.0,
            "_source" : {
              "title" : "三番四次",
              "content" : "成语1"
            }
          }
        ]
      }
    ]
  }
}

这样查询容易出现重复的数据

bash 复制代码
GET suggest_test/_search
{
  "suggest": {
    "my_suggest": {
      "prefix": "十",
      "completion": {
        "field": "title.suggest",
        "skip_duplicates": true
      }
    }
  }
}

此外,我们可以加上模糊查询

bash 复制代码
GET suggest_test/_search
{
  "suggest": {
    "my_suggest": {
      "prefix": "十",
      "completion": {
        "field": "title.suggest",
        "skip_duplicates": true,
        "fuzzy": {
          "fuzziness": 2
        }
      }
    }
  }
}
只查看某个字段

elasticsearch通常会存储一些宽表,如果只想查看某个字段

bash 复制代码
GET suggest_test/_search
{
  "suggest": {
    "my_suggest": {
      "prefix": "十",
      "completion": {
        "field": "title.suggest",
        "skip_duplicates": true,
        "fuzzy": {
          "fuzziness": 2
        }
      }
    }
  },
  "_source": {
    "includes": ["title"]
  }
}

聚合函数

平均值
bash 复制代码
GET test-product-20231028/_search
{
  "query": {
    "term": {
      "name.keyword": {
        "value": "java"
      }
    }
  }, 
  "aggs": {
    "agv_of_price": {
      "avg": {
        "field": "price"
      }
    }
  }
}
​
{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.9808291,
    "hits" : [
      {
        "_index" : "test-product-20231028",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.9808291,
        "_source" : {
          "name" : "java",
          "desc" : "quick language jvm",
          "price" : 20
        }
      }
    ]
  },
  "aggregations" : {
    "agv_of_price" : {
      "value" : 20.0
    }
  }
}
​
最大值
bash 复制代码
GET test-product-20231028/_search
{
  "query": {
    "match_all": {}
  }, 
  "aggs": {
    "max_of_price": {
      "max": {
        "field": "price"
      }
    }
  }
}
最小值
bash 复制代码
GET test-product-20231028/_search
{
  "query": {
    "match_all": {}
  }, 
  "aggs": {
    "min_of_price": {
      "min": {
        "field": "price"
      }
    }
  }
}
求和
bash 复制代码
GET test-product-20231028/_search
{
  "query": {
    "match_all": {}
  }, 
  "aggs": {
    "sum_of_price": {
      "sum": {
        "field": "price"
      }
    }
  }
}
去重
bash 复制代码
// 再加一条测试数据
PUT test-product-20231028/_doc/4
{
  "name": "mysql",
  "desc": "8.0",
  "price": 40
}
​
​
GET test-product-20231028/_search
{
  "collapse": {
    "field": "name.keyword"
  }
}

分析字段

字段分析分为三个阶段

  • 字符过滤(过滤器):使用字符过滤器转变字符(比如:大写变小写)
  • 文本切分为分词(分词器):将文本切分为单个或者多个分词(比如:英文文本用空格切为一堆单词)
  • 分词过滤(分词过滤器):转变每个分词(比如:把a an of这类词干掉,或者复数单词转为单数单词)

这三个阶段可以用上面这张图表示,字符过滤器可以过滤掉一些无用的词语,使得文档规范化,比如&就是这个无用的词,分词器是最简单的,就是按照规则分词,分词过滤器对分词完的词语进行过滤,比如停用词时态转换大小写同义词语气词

字符过滤

我们可以定义字符过滤器

json 复制代码
// 定义一个索引
PUT my_char_filter
{
  "settings": {
    "analysis": {
      "char_filter": {
        // 自定义的字符过滤器
        "my_char_filter" : {
          // 去除html标签
          "type": "html_strip",
          // 可以配置忽略的标签
          "escaped_tags": ""
        }
      },
      "analyzer": {
        "my_analyzer" : {
          // 这个是分词器
          "tokenizer" : "keyword",
          // 设置过滤器
          "char_filter": "my_char_filter"
        }
      }
    }
  }
}

使用字符过滤器过滤 <p>I'm </p> 结果是这样的

python 复制代码
GET my_char_filter/_analyze
{
  "analyzer": "my_analyzer",
  "text": "<p>I'm </p>"
}
​
{
  "tokens" : [
    {
      "token" : """
I'm 
""",
      "start_offset" : 0,
      "end_offset" : 16,
      "type" : "word",
      "position" : 0
    }
  ]
}
json 复制代码
PUT my_char_filter
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter" : {
          "type": "html_strip",
          // 配置了忽略p标签
          "escaped_tags": "p"
        }
      },
      "analyzer": {
        "my_analyzer" : {
          "tokenizer" : "keyword",
          "char_filter": "my_char_filter"
        }
      }
    }
  }
}
​
GET my_char_filter/_analyze
{
  "analyzer": "my_analyzer",
  "text": "<p>I'm </p>"
}
​
// 结果是这样的
{
  "tokens" : [
    {
      "token" : "<p>I'm </p>",
      "start_offset" : 0,
      "end_offset" : 16,
      "type" : "word",
      "position" : 0
    }
  ]
}
分词
bash 复制代码
GET my_token_filter/_analyze
{
  "tokenizer": "standard",
  "text": "my english is very bad"
}
​
{
  "tokens" : [
    {
      "token" : "my",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "english",
      "start_offset" : 3,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "is",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "very",
      "start_offset" : 14,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "bad",
      "start_offset" : 19,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}
分词过滤
json 复制代码
PUT /my_token_filter
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym" : {
          "type": "synonym_graph",
          // 这个是分词转换的文档路径
          "synonyms_path": "analysis/synonym.txt"
        }
      },
      "analyzer": {
        "my_analyzer" : {
          "tokenizer" : "ik_max_word",
          "filter": ["my_synonym"]
        }
      }
    }
  }
}

文档可以这样定义

ini 复制代码
大G==>奔驰
坦克==>长城
json 复制代码
// 使用上面的分词过滤器
GET my_token_filter/_analyze
{
  "analyzer": "my_analyzer",
  "text": ["坦克, 大G"]
}
​
// 得出来的结果是这样子
{
  "tokens" : [
    {
      "token" : "长城",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "奔驰",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "SYNONYM",
      "position" : 1
    }
  ]
}

ElasticSearch内置了很多分词过滤器,可以参考官方文档

www.elastic.co/guide/en/el...

字段具体分词情况

如果文档在索引中已经生成了,想要查看数据分词变成了什么样子,可以使用这个api查看

css 复制代码
GET /suggest_test/_doc/4/_termvectors?fields=title
​
​
{
  "_index" : "suggest_test",
  "_type" : "_doc",
  "_id" : "4",
  "_version" : 1,
  "found" : true,
  "took" : 57,
  "term_vectors" : {
    "title" : {
      "field_statistics" : {
        "sum_doc_freq" : 28,
        "doc_count" : 5,
        "sum_ttf" : 31
      },
      "terms" : {
        "全" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 3,
              "start_offset" : 1,
              "end_offset" : 2
            }
          ]
        },
        "十" : {
          "term_freq" : 2,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 0,
              "end_offset" : 1
            },
            {
              "position" : 4,
              "start_offset" : 2,
              "end_offset" : 3
            }
          ]
        },
        "十全" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 0,
              "end_offset" : 2
            }
          ]
        },
        "十全十美" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 4
            }
          ]
        },
        "美" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 5,
              "start_offset" : 3,
              "end_offset" : 4
            }
          ]
        }
      }
    }
  }
}

相关度评分

相关度评分用于对搜索结果排序,评分越高则认为其结果和搜索的预期值相关度越高,即越符合搜索预期值。在7.x之前相关度评分默认使用TF/IDF算法而来,7.x之后默认为BM25,那这个算法是怎么衡量一个文档是否符合预期的呢?

逆文档频率

<math xmlns="http://www.w3.org/1998/Math/MathML"> I D F ( q i ) = l n N − d f i + 0.5 d f i + 0.5 IDF(q_i) = ln \frac{N-df_i+0.5} {df_i+0.5} </math>IDF(qi)=lndfi+0.5N−dfi+0.5

其中N表示索引中全部文档数, <math xmlns="http://www.w3.org/1998/Math/MathML"> d f i df_i </math>dfi为包含了 <math xmlns="http://www.w3.org/1998/Math/MathML"> q i q_i </math>qi的文档的个数,根据公式,对于某个 <math xmlns="http://www.w3.org/1998/Math/MathML"> q i q_i </math>qi,包含 <math xmlns="http://www.w3.org/1998/Math/MathML"> q i q_i </math>qi的文档越多,说明 <math xmlns="http://www.w3.org/1998/Math/MathML"> q i q_i </math>qi重要性越小,或者区分度越低

举个例子

有三篇文档,分别为

A = 张三,李四,王五

B = 张三,李四

C = 张三

如果我的搜索词是张三,将公式中的变量替换, <math xmlns="http://www.w3.org/1998/Math/MathML"> l n 3 − 3 + 0.5 3 + 0.5 ln \frac{3-3+0.5} {3+0.5} </math>ln3+0.53−3+0.5 ,分母变大,分子变小,这个数是相对小的

如果我的搜索词是王五,将共识中的变量替换, <math xmlns="http://www.w3.org/1998/Math/MathML"> l n 3 − 1 + 0.5 1 + 0.5 ln \frac{3-1+0.5} {1+0.5} </math>ln1+0.53−1+0.5,分母变小,分子变大,这个数是相对大的

从常识来看,如果一个词在很多文档中都出现过,说明这个词的区分度是越小的,比如 的 这种字眼(只是举个例子,ElasticSearch会将这些停用词去除),如果一个词在很少文档中出现,那这个词的区分度是越大的

词频

<math xmlns="http://www.w3.org/1998/Math/MathML"> S ( q i , d ) = ( k 1 + 1 ) t f t d k + t f t d S(q_i, d) = \frac{(k_1+1)tf_{td}} {k+tf_{td}} </math>S(qi,d)=k+tftd(k1+1)tftd

<math xmlns="http://www.w3.org/1998/Math/MathML"> K = k 1 ( 1 − b + b ∗ L d L a v e ) K=k_1(1-b+b*\frac{L_d} {L_ave}) </math>K=k1(1−b+b∗LaveLd)

其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> t f t d tf_{td} </math>tftd是单词t在文档d中的词频, <math xmlns="http://www.w3.org/1998/Math/MathML"> L d L_d </math>Ld是文档d的长度, <math xmlns="http://www.w3.org/1998/Math/MathML"> L a v e L_{ave} </math>Lave是所有文档的平均长度,变量 <math xmlns="http://www.w3.org/1998/Math/MathML"> k 1 k_1 </math>k1是一个正的参数,用来标准化文章词频的范围,我们先看上面的公式,假设 <math xmlns="http://www.w3.org/1998/Math/MathML"> t f t d tf_{td} </math>tftd是一个无限大的数,那么分母中的k相比 <math xmlns="http://www.w3.org/1998/Math/MathML"> t f t d tf_{td} </math>tftd是可以被忽略的,所以该公式的最大值是 <math xmlns="http://www.w3.org/1998/Math/MathML"> k 1 k_1 </math>k1+1,如果 <math xmlns="http://www.w3.org/1998/Math/MathML"> t f t d tf_{td} </math>tftd是一个无限小的数,那么该值是可以被忽略的,所以该公式的最小值是 <math xmlns="http://www.w3.org/1998/Math/MathML"> k 1 + 1 k \frac{k_1+1} {k} </math>kk1+1,所以上面公式的取值区间在 <math xmlns="http://www.w3.org/1998/Math/MathML"> k 1 + 1 k \frac{k_1+1} {k} </math>kk1+1和 <math xmlns="http://www.w3.org/1998/Math/MathML"> k 1 k_1 </math>k1+1, 再看下面的公式, <math xmlns="http://www.w3.org/1998/Math/MathML"> k 1 k_1 </math>k1和b都是一个常数,看作固定值就可以,当 <math xmlns="http://www.w3.org/1998/Math/MathML"> L d L_d </math>Ld值越大时,分母会越大,整个值会越小

举个例子

有三篇文档,分别为

A = 张三,李四,王五

B = 张三,李四

C = 张三

如果我的搜索词是张三,A文档和B文档同时被检索到,这个时候,B文档的排名相对A文档会靠前点,因为A文档字数更多

参考文档

www.elastic.co/cn/blog/pra...

相关推荐
马剑威(威哥爱编程)17 分钟前
MongoDB面试专题33道解析
数据库·mongodb·面试
Elastic 中国社区官方博客33 分钟前
如何将数据从 AWS S3 导入到 Elastic Cloud - 第 3 部分:Elastic S3 连接器
大数据·elasticsearch·搜索引擎·云计算·全文检索·可用性测试·aws
掘金-我是哪吒43 分钟前
微服务mysql,redis,elasticsearch, kibana,cassandra,mongodb, kafka
redis·mysql·mongodb·elasticsearch·微服务
许野平1 小时前
Rust: 利用 chrono 库实现日期和字符串互相转换
开发语言·后端·rust·字符串·转换·日期·chrono
研究是为了理解2 小时前
Git Bash 常用命令
git·elasticsearch·bash
独行soc2 小时前
#渗透测试#SRC漏洞挖掘#深入挖掘XSS漏洞02之测试流程
web安全·面试·渗透测试·xss·漏洞挖掘·1024程序员节
理想不理想v3 小时前
‌Vue 3相比Vue 2的主要改进‌?
前端·javascript·vue.js·面试
齐 飞3 小时前
MongoDB笔记01-概念与安装
前端·数据库·笔记·后端·mongodb
LunarCod3 小时前
WorkFlow源码剖析——Communicator之TCPServer(中)
后端·workflow·c/c++·网络框架·源码剖析·高性能高并发
sszmvb12344 小时前
测试开发 | 电商业务性能测试: Jmeter 参数化功能实现注册登录的数据驱动
jmeter·面试·职场和发展