ElasticSearch常用语句

前言

最近做的需求涉及到了Elasticsearch，顺便整理一下常用的查询语句

索引模版

定义settings + mappings，常用于日志相关的数据，例子如下

perl 复制代码

PUT _index_template/http-log-template
{
  "index_patterns": [
    "http-log-*"
  ],
  "template": {
    "mappings": {
      "properties": {
        "@timestamp" : {
          "type" : "date"
        },
        "request.path" : {
          "type" : "keyword"
        },
        "request.method" : {
          "type" : "keyword"
        },
        "request.body" : {
          "type" : "keyword"
        },
        "response.status" : {
          "type" : "integer"
        },
        "response.body" : {
          "type" : "text"
        }
      }
    },
    "settings": {
      "index": {
        "number_of_shards": 1,
        "indexing": {
          "slowlog": {
            "level": "info",
            "threshold": {
              "index": {
                "warn": "200s",
                "trace": "50ms",
                "debug": "80s",
                "info": "100s"
              }
            }
          }
        },
        "search": {
          "slowlog": {
            "threshold": {
              "fetch": {
                "warn": "200s",
                "trace": "50ms",
                "debug": "80ms",
                "info": "100ms"
              },
              "query": {
                "warn": "10s",
                "trace": "500ms",
                "debug": "2s",
                "info": "5s"
              }
            }
          }
        }
      }
    }
  }
}

解释一下例子中的参数

参数	说明
index_patterns	模版匹配的索引，上文的的例子匹配的是名字以http-log-为前缀的索引
index.number_of_shards	索引的主分片数，将该参数值设置为1，可以有效控制集群的分片数量，防止由于分片数量过多导致集群压力过大
index.indexing.slowlog.level	慢索引的日志级别
index.indexing.slowlog.threshold.index.warn	warn级别的慢索引日志的阈值
index.indexing.slowlog.threshold.index.trace	trace级别的慢索引日志的阈值
index.indexing.slowlog.threshold.index.debug	debug级别的慢索引日志的阈值
index.indexing.slowlog.threshold.index.info	info级别的慢索引日志的阈值
index.indexing.slowlog.threshold.fetch.warn	warn级别的慢fetch日志的阈值
index.indexing.slowlog.threshold.fetch.trace	trace级别的慢fetch日志的阈值
index.indexing.slowlog.threshold.fetch.debug	debug级别的慢fetch日志的阈值
index.indexing.slowlog.threshold.fetch.info	info级别的慢fetch日志的阈值
index.indexing.slowlog.threshold.query.warn	warn级别的慢查询日志的阈值
index.indexing.slowlog.threshold.query.trace	trace级别的慢查询日志的阈值
index.indexing.slowlog.threshold.query.debug	debug级别的慢查询日志的阈值
index.indexing.slowlog.threshold.query.info	info级别的慢查询日志的阈值
index.refresh_interval	es是准实时系统，新写入的分段需要被刷新才被完全创建，才可用于查询慢的刷新频率可用降低分段合并的频率，分段合并十分耗资源默认刷新频率是1s

index、fetch、query分别对应ElasticSearch对数据的几个操作更多参数可以参考阿里云的文档help.aliyun.com/zh/es/user-...

c 复制代码

// 新建索引，匹配了上文中模版
PUT /http-log-2023-10-28
// 使用这条DSL可以查询到索引字段与模版中的一致
GET /http-log-2023-10-28/_mapping

别名

别名可以指向多个索引
查询时别名指向的所有索引都会被查询到
写入时如果别名指向单个索引，数据会写入该索引，当别名指向多个索引，需通过is_write_index 参数指定写入索引，如下文，不可同时指定多个索引可写入

json 复制代码

POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "http-log-2023-10-28",
        // 别名
        "alias": "http-log-test",
        "is_write_index": false
      }
    },
    {
      "add": {
        "index": "http-log-2023-10-29",
        // 别名
        "alias": "http-log-test",
        "is_write_index": true
      }
    }
  ]
}

别名更大的作用是用于索引重构时的平滑过渡，当旧索引无法满足业务需求时，我们可以建新索引，使用别名同时指向新旧索引，使得查询出来的历史数据不受影响，并且变更写入索引，数据迁移到新索引后，删除旧索引

查询

查询时会返回查询时的相关信息

json 复制代码

{
  // 这次查询花的时间
  "took" : 23,
  // 是否超时
  "timed_out" : false,
  // 搜索的分片
  "_shards" : {
    // 总共搜索了多少分片
    "total" : 1,
    // 成功了多少个
    "successful" : 1,
    // 跳过了多少个
    "skipped" : 0,
    // 失败了多少个
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

为了测试下面的场景，我们先准备索引和数据

bash 复制代码

{
  "test-product-20231028" : {
    "mappings" : {
      "properties" : {
        "desc" : {
          "type" : "text"
        },
        "name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "price" : {
          "type" : "long"
        }
      }
    }
  }
}


PUT test-product-20231028/_doc/1
{
  "name": "elasticsearch",
  "desc": "quick restful", 
  "price": 10
}

PUT test-product-20231028/_doc/2
{
  "name": "java",
  "desc": "quick language jvm",
  "price": 20
}

PUT test-product-20231028/_doc/3
{
  "name": "mysql",
  "desc": "quick data",
  "price": 30
}

match

match：match会先将查询语句分词，然后匹配包含某个term的子句，如下文，会先将查询语句分词为restful、language

bash 复制代码

GET test-product-20231028/_search
{
  "query": {
    "match": {
      "desc": "restful language"
    }
  }
}


{
  "took" : 57,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0417082,
    "hits" : [
      {
        "_index" : "test-product-20231028",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0417082,
        "_source" : {
          "name" : "elasticsearch",
          "desc" : "quick restful",
          "price" : 10
        }
      },
      {
        "_index" : "test-product-20231028",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.8781843,
        "_source" : {
          "name" : "java",
          "desc" : "quick language jvm",
          "price" : 20
        }
      }
    ]
  }
}

match_all：匹配所有结果的子句
multi_match：可以在一个查询中匹配多个字段，如下文，是在name、desc两个字段中查询elasticsearch language

bash 复制代码

GET test-product-20231028/_search
{
  "query": {
    "multi_match": {
      "query": "elasticsearch language", 
      "fields": ["name", "desc"]
    }
  }
}


{
  "took" : 14,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.9808291,
    "hits" : [
      {
        "_index" : "test-product-20231028",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.9808291,
        "_source" : {
          "name" : "elasticsearch",
          "desc" : "quick restful",
          "price" : 10
        }
      },
      {
        "_index" : "test-product-20231028",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.8781843,
        "_source" : {
          "name" : "java",
          "desc" : "quick language jvm",
          "price" : 20
        }
      }
    ]
  }
}

match_phrase：短语查询，所有term都出现在待查询字段之中，待查询字段之中的所有term都必须和match_phase具有相同的顺序

bash 复制代码

GET test-product-20231028/_search
{
  "query": {
    "match_phrase": {
      "desc": "quick language"
    }
  }
}

{
  "took" : 13,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.99774146,
    "hits" : [
      {
        "_index" : "test-product-20231028",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.99774146,
        "_source" : {
          "name" : "java",
          "desc" : "quick language jvm",
          "price" : 20
        }
      }
    ]
  }
}

term

term：匹配和搜索项完全相等的结果
- term和match_phrase区别
  - match_phrase会将搜索关键词分词，分词结果必须在被检索字段的分词中都包含，而且顺序必须相同，默认必须是连续的
  - term搜索不会将搜索词分词
- term和keyword区别
  - term对于搜索词不分词
  - keyword是字段类型，是对于source data中的字段值不分词

bash 复制代码

GET test-product-20231028/_search
{
  "query": {
    "term": {
      "name.keyword": "java"
    }
  }
}


{
  "took" : 14,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.9808291,
    "hits" : [
      {
        "_index" : "test-product-20231028",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.9808291,
        "_source" : {
          "name" : "java",
          "desc" : "quick language jvm",
          "price" : 20
        }
      }
    ]
  }
}

terms：匹配和搜索词项列表中任意项匹配的结果

bash 复制代码

GET test-product-20231028/_search
{
  "query": {
    "terms": {
      "name.keyword": ["java", "mysql"]
    }
  }
}

{
  "took" : 15,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test-product-20231028",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "name" : "java",
          "desc" : "quick language jvm",
          "price" : 20
        }
      },
      {
        "_index" : "test-product-20231028",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "name" : "mysql",
          "desc" : "quick data",
          "price" : 30
        }
      }
    ]
  }
}

filter

filter倾向于当前文档和查询的条件是不是相符，即在查询过程中，query是要对查询的每个结果计算相关性得分的，而filter不会，另外filter有相应的缓存机制，可以提高查询效率

bash 复制代码

GET test-product-20231028/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "name.keyword": "java"
        }
      }
    }
  }
}

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test-product-20231028",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "name" : "java",
          "desc" : "quick language jvm",
          "price" : 20
        }
      }
    ]
  }
}

json 复制代码

GET test-product-20231028/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "name.keyword": "java"
        }
      },
      // 也可以加个默认分数
      "boost": 1.2
    }
  }
}

// 也可以将bool嵌套在bool中
GET test-product-20231028/_search
{
  "query": {
    "bool": {
      "filter": {
        "term": {
          "name.keyword": "java"
        }
      }
    }
  }
}

bool

可以组合多个查询条件，bool查询也是采用more_matches_is_better的机制，因此满足must和should子句的文档将会合并起来计算分值

must：必须满足的条件
filter：过滤器，不计算相关度分数，cache子句，必须出现在匹配的文档中，但是不像must查询的分数将被忽略，filter子句在filter上下文中执行，这意味着计分将忽略，并且子句被考虑用于缓存
should：可以满足也可以不满足的条件
must_not：不需要满足的条件，不计算相关度分数

minimum_should_match：参数指定should返回的文档必须匹配的子句的数量或百分比，如果bool查询包含至少一个should子句，而没有must或filter子句，则默认值为1，否则，默认值为0

boost：可以将某个搜索条件的权重加大，此时当匹配这个搜索条件和匹配另一个搜索条件的document，计算relevance source时，匹配权重更大的搜索条件的document，relevance source会更高，当然也就会优先被返回回来，默认情况下，搜索条件的权重都是一样的，都是1

下面举几个例子

bash 复制代码

GET test-product-20231028/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "desc": "quick"
          }
        },
        {
          "term": {
            "name.keyword": "java"
          }
        }
      ],
      "must_not": [
        {
          "term": {
            "name.keyword": 
              "java"
            
          }
        }
      ]
    }
  }
}


GET test-product-20231028/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "desc": "quick"
          }
        }
      ],
      "must_not": [
        {
          "term": {
            "name.keyword": 
              "java"
            
          }
        }
      ],
      "should": [
        {
          "term": {
            "name.keyword": {
              "value": "elasticsearch"
            }
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

GET test-product-20231028/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "match_all": {}
        }
      ],
      "must_not": [
        {
          "term": {
            "name.keyword": "java"
          }
        }
      ],
      "should": [
        {
          "term": {
            "name.keyword": {
              "value": "elasticsearch"
            }
          }
        }
      ],
      "must": [
        {
          "match": {
            "desc": "quick"
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

bool中也是可以嵌套bool的，使用起来非常灵活

wildcard

通配符运算符是匹配一个或多个字符的占位符，例如，*通配符运算符匹配零个或多个字符，

bash 复制代码

GET test-product-20231028/_search
{
  "query": {
    "wildcard": {
      "name.keyword": {
        "value": "*ava"
      }
    } 
  }
}

fuzzy

为了找到相似的词，模糊搜索会在指定的编辑距离内创建搜索词的所有可能变化或拓展的集合，查询然后返回每个拓展的完全匹配，有以下四种情况

混淆字符（box -> fox）

缺少字符（black -> lack）

多出字符（sic -> sick）

颠倒次序（act -> cat）

bash 复制代码

GET test-product-20231028/_search
{
  "query": {
    "fuzzy": {
      "name.keyword": {
        "value": "av", 
        "fuzziness": 2
      }
    } 
  }
}


GET test-product-20231028/_search
{
  "query": {
    "match": {
      "name.keyword": {
        "query": "msql", 
        "fuzziness": 2
      }
    } 
  }
}

参数：

fuzziness：编辑距离，（0，1，2）并非越大越好，召回率高但结果不准确
- 两段文本之间的Damerau-Levenshtein距离是使一个字符串与另一个字符串匹配所需的插入、删除、替换和调换的数量
- 距离公式：Levenshtein是lucene的，改进版：Damerau-Levenshtein
- axe -> aex Levenshtein=2 Damerau-Levenshtein=1
transpositions：（可选，布尔值）指示编辑是否包含两个相邻字符的变位（ab -> ba），默认为true

range

json 复制代码

// 查找范围
GET test-product-20231028/_search
{
  "query": {
    "range": {
      "price": {
        "from": 10,
        "to": 20
      }
    }
  }
}

// include_lower是否包含范围的左边界，默认true
// include_upper是否包含范围的右边界，默认为true
GET test-product-20231028/_search
{
  "query": {
    "range": {
      "price": {
        "from": 10,
        "to": 20,
        "include_lower":true,
        "include_upper":false
      }
    }
  }
}

// gt: >
// lt: <
// gte: >=
// lte: <= 
GET test-product-20231028/_search
{
  "query": {
    "range": {
      "price": {
        "gt": 10,
        "lte": 20
      }
    }
  }
}

sort

bash 复制代码

GET test-product-20231028/_search
{
  "query": {
    "match_all": {}
  }, 
  "sort": [
    {
      "price": {
        "order": "desc"
      }
    }
  ]
}

from

返回指定数量

bash 复制代码

GET test-product-20231028/_search
{
  "from": 0,
  "size": 1, 
  "query": {
    "match_all": {}
  }
}

分词分析

有时候我们想要知道一个词分词后的样子，可以使用这个api测试

json 复制代码

POST _analyze
{
  "analyzer": "ik_max_word",
  "text": "风华正茂"
}


{
  "tokens" : [
    {
      "token" : "风华正茂",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "风华",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "正",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "茂",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 3
    }
  ]
}

completion suggester

自动补全，可以在用户输入时引导用户查看相关结果，从而提高搜索精度，现广泛用于搜索相关的各种场景，但是也存在两个问题

内存代价太大，必须在用户不断输入的时候产生建议，数据是基于FST的原理存储在内存中
只能前缀搜索，如果用户输入的不是前缀，召回率可能很低

想要使用搜索方式，字段的类型必须加上completion

bash 复制代码

PUT suggest_test
{
  "mappings": {
    "properties": {
      "title" : {
          "type" : "text",
          "fields" : {
            "suggest" : {
              "type" : "completion",
              "analyzer" : "ik_max_word"
            }
          },
          "analyzer" : "ik_max_word"
        },
        "content" : {
          "type" : "text",
          "analyzer": "ik_max_word"
        }
    }
  }
}


PUT suggest_test/_doc/1
{
  "title": "三番四次",
  "content": "成语1"
}

PUT suggest_test/_doc/2
{
  "title": "三山五岳",
  "content": "成语2"
}

PUT suggest_test/_doc/3
{
  "title": "十全十美",
  "content": "成语3"
}

PUT suggest_test/_doc/4
{
  "title": "十全十美",
  "content": "成语4"
}

sql 复制代码

GET suggest_test/_search
{
  "suggest": {
    "my_suggest": {
      "prefix": "三",
      "completion": {
        "field": "title.suggest"
      }
    }
  }
}


{
  "took" : 91,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "suggest" : {
    "my_suggest" : [
      {
        "text" : "三",
        "offset" : 0,
        "length" : 1,
        "options" : [
          {
            "text" : "三山五岳",
            "_index" : "suggest_test",
            "_type" : "_doc",
            "_id" : "2",
            "_score" : 1.0,
            "_source" : {
              "title" : "三山五岳",
              "content" : "成语2"
            }
          },
          {
            "text" : "三番四次",
            "_index" : "suggest_test",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : 1.0,
            "_source" : {
              "title" : "三番四次",
              "content" : "成语1"
            }
          }
        ]
      }
    ]
  }
}

这样查询容易出现重复的数据

bash 复制代码

GET suggest_test/_search
{
  "suggest": {
    "my_suggest": {
      "prefix": "十",
      "completion": {
        "field": "title.suggest",
        "skip_duplicates": true
      }
    }
  }
}

此外，我们可以加上模糊查询

bash 复制代码

GET suggest_test/_search
{
  "suggest": {
    "my_suggest": {
      "prefix": "十",
      "completion": {
        "field": "title.suggest",
        "skip_duplicates": true,
        "fuzzy": {
          "fuzziness": 2
        }
      }
    }
  }
}

只查看某个字段

elasticsearch通常会存储一些宽表，如果只想查看某个字段

bash 复制代码

GET suggest_test/_search
{
  "suggest": {
    "my_suggest": {
      "prefix": "十",
      "completion": {
        "field": "title.suggest",
        "skip_duplicates": true,
        "fuzzy": {
          "fuzziness": 2
        }
      }
    }
  },
  "_source": {
    "includes": ["title"]
  }
}

聚合函数

平均值

bash 复制代码

GET test-product-20231028/_search
{
  "query": {
    "term": {
      "name.keyword": {
        "value": "java"
      }
    }
  }, 
  "aggs": {
    "agv_of_price": {
      "avg": {
        "field": "price"
      }
    }
  }
}

{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.9808291,
    "hits" : [
      {
        "_index" : "test-product-20231028",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.9808291,
        "_source" : {
          "name" : "java",
          "desc" : "quick language jvm",
          "price" : 20
        }
      }
    ]
  },
  "aggregations" : {
    "agv_of_price" : {
      "value" : 20.0
    }
  }
}

最大值

bash 复制代码

GET test-product-20231028/_search
{
  "query": {
    "match_all": {}
  }, 
  "aggs": {
    "max_of_price": {
      "max": {
        "field": "price"
      }
    }
  }
}

最小值

bash 复制代码

GET test-product-20231028/_search
{
  "query": {
    "match_all": {}
  }, 
  "aggs": {
    "min_of_price": {
      "min": {
        "field": "price"
      }
    }
  }
}

求和

bash 复制代码

GET test-product-20231028/_search
{
  "query": {
    "match_all": {}
  }, 
  "aggs": {
    "sum_of_price": {
      "sum": {
        "field": "price"
      }
    }
  }
}

去重

bash 复制代码

// 再加一条测试数据
PUT test-product-20231028/_doc/4
{
  "name": "mysql",
  "desc": "8.0",
  "price": 40
}


GET test-product-20231028/_search
{
  "collapse": {
    "field": "name.keyword"
  }
}

分析字段

字段分析分为三个阶段

字符过滤（过滤器）：使用字符过滤器转变字符（比如：大写变小写）

文本切分为分词（分词器）：将文本切分为单个或者多个分词（比如：英文文本用空格切为一堆单词）
分词过滤（分词过滤器）：转变每个分词（比如：把a an of这类词干掉，或者复数单词转为单数单词）

这三个阶段可以用上面这张图表示，字符过滤器可以过滤掉一些无用的词语，使得文档规范化，比如&就是这个无用的词，分词器是最简单的，就是按照规则分词，分词过滤器对分词完的词语进行过滤，比如停用词、时态转换、大小写、同义词、语气词

字符过滤

我们可以定义字符过滤器

json 复制代码

// 定义一个索引
PUT my_char_filter
{
  "settings": {
    "analysis": {
      "char_filter": {
        // 自定义的字符过滤器
        "my_char_filter" : {
          // 去除html标签
          "type": "html_strip",
          // 可以配置忽略的标签
          "escaped_tags": ""
        }
      },
      "analyzer": {
        "my_analyzer" : {
          // 这个是分词器
          "tokenizer" : "keyword",
          // 设置过滤器
          "char_filter": "my_char_filter"
        }
      }
    }
  }
}

使用字符过滤器过滤 <p>I'm </p> 结果是这样的

python 复制代码

GET my_char_filter/_analyze
{
  "analyzer": "my_analyzer",
  "text": "<p>I'm </p>"
}

{
  "tokens" : [
    {
      "token" : """
I'm 
""",
      "start_offset" : 0,
      "end_offset" : 16,
      "type" : "word",
      "position" : 0
    }
  ]
}

json 复制代码

PUT my_char_filter
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter" : {
          "type": "html_strip",
          // 配置了忽略p标签
          "escaped_tags": "p"
        }
      },
      "analyzer": {
        "my_analyzer" : {
          "tokenizer" : "keyword",
          "char_filter": "my_char_filter"
        }
      }
    }
  }
}

GET my_char_filter/_analyze
{
  "analyzer": "my_analyzer",
  "text": "<p>I'm </p>"
}

// 结果是这样的
{
  "tokens" : [
    {
      "token" : "<p>I'm </p>",
      "start_offset" : 0,
      "end_offset" : 16,
      "type" : "word",
      "position" : 0
    }
  ]
}

分词

bash 复制代码

GET my_token_filter/_analyze
{
  "tokenizer": "standard",
  "text": "my english is very bad"
}

{
  "tokens" : [
    {
      "token" : "my",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "english",
      "start_offset" : 3,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "is",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "very",
      "start_offset" : 14,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "bad",
      "start_offset" : 19,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}

分词过滤

json 复制代码

PUT /my_token_filter
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym" : {
          "type": "synonym_graph",
          // 这个是分词转换的文档路径
          "synonyms_path": "analysis/synonym.txt"
        }
      },
      "analyzer": {
        "my_analyzer" : {
          "tokenizer" : "ik_max_word",
          "filter": ["my_synonym"]
        }
      }
    }
  }
}

文档可以这样定义

ini 复制代码

大G==>奔驰
坦克==>长城

json 复制代码

// 使用上面的分词过滤器
GET my_token_filter/_analyze
{
  "analyzer": "my_analyzer",
  "text": ["坦克, 大G"]
}

// 得出来的结果是这样子
{
  "tokens" : [
    {
      "token" : "长城",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "奔驰",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "SYNONYM",
      "position" : 1
    }
  ]
}

ElasticSearch内置了很多分词过滤器，可以参考官方文档

www.elastic.co/guide/en/el...

字段具体分词情况

如果文档在索引中已经生成了，想要查看数据分词变成了什么样子，可以使用这个api查看

css 复制代码

GET /suggest_test/_doc/4/_termvectors?fields=title


{
  "_index" : "suggest_test",
  "_type" : "_doc",
  "_id" : "4",
  "_version" : 1,
  "found" : true,
  "took" : 57,
  "term_vectors" : {
    "title" : {
      "field_statistics" : {
        "sum_doc_freq" : 28,
        "doc_count" : 5,
        "sum_ttf" : 31
      },
      "terms" : {
        "全" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 3,
              "start_offset" : 1,
              "end_offset" : 2
            }
          ]
        },
        "十" : {
          "term_freq" : 2,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 0,
              "end_offset" : 1
            },
            {
              "position" : 4,
              "start_offset" : 2,
              "end_offset" : 3
            }
          ]
        },
        "十全" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 0,
              "end_offset" : 2
            }
          ]
        },
        "十全十美" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 4
            }
          ]
        },
        "美" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 5,
              "start_offset" : 3,
              "end_offset" : 4
            }
          ]
        }
      }
    }
  }
}

ElasticSearch常用语句

前言

索引模版

别名

查询

match

term

filter

bool

wildcard

fuzzy

range

sort

from

分词分析

completion suggester

只查看某个字段

聚合函数

平均值

最大值

最小值

求和

去重

分析字段

字符过滤

分词

分词过滤

字段具体分词情况

相关度评分

逆文档频率

词频

参考文档