压榨 ES,从 numeric 到 keyword,秒变闪电侠

之前有位小伙伴期望在不改代码的情况下优化 elasticsearch 查询,( 简化 ) 查询语句如下:

json 复制代码
{
 "query": {
   "term": {
     "status": 1
   }
 }
}

以下数据来自测试环境,文档数据量:135999,status 值分布:[ { "key" : 1, "doc_count" : 131371 }, { "key" : 3, "doc_count" : 206 }, { "key" : 2, "doc_count" : 33 } ],数据均为新建立 index。

Profile 该查询语句有如下:

从上图中可以看到 PointRangeQuery。了解到对该字段仅有等值配对场景 ( 类似于 where status = ? ),对字段类型从 long 改为 keyword ( 手动创建 mapping 然后 reindex,当然也就意味着语义的改变 ),有如下优化结果:

可以看到的性能有及其显著提升 3.943ms -> 0.137ms ( 数据量越大性能优势越明显 ),此时查询类型由之前 PointRangeQuery 变成 TermQuery,在BUILD_SCORERNEXT_DOC 两部分有巨大的性能提升 。 关于如何选择数据类型,ES 的官方文档上有恰当说明:

Mapping numeric identifiers Not all numeric data should be mapped as a numeric field data type. Elasticsearch optimizes numeric fields, such as integer or long, for range queries. However, keyword fields are better for term and other term-level queries. Identifiers, such as an ISBN or a product ID, are rarely used in range queries. However, they are often retrieved using term-level queries. Consider mapping a numeric identifier as a keyword if:      You don't plan to search for the identifier data using range queries.      Fast retrieval is important. term query searches on keyword fields are often faster than term searches on numeric fields. If you're unsure which to use, you can use a multi-field to map the data as both a keyword and a numeric data type.

带着求知欲,接下来来看看为何会有如此显著的性能差距,结合图中的 PointRangeQuery / TermQueryBUILD_SCORERNEXT_DOC 关键字,可以搜索到如下源码:

先来看 PointRangeQuery

scss 复制代码
@Override
public Scorer get(long leadCost) throws IOException {
  if (values.getDocCount() == reader.maxDoc()
      && values.getDocCount() == values.size()
      && cost() > reader.maxDoc() / 2) {
    // If all docs have exactly one value and the cost is greater
    // than half the leaf size then maybe we can make things faster
    // by computing the set of documents that do NOT match the range
    final FixedBitSet result = new FixedBitSet(reader.maxDoc());
    result.set(0, reader.maxDoc());
    int[] cost = new int[] { reader.maxDoc() };
    values.intersect(getInverseIntersectVisitor(result, cost));
    final DocIdSetIterator iterator = new BitSetIterator(result, cost[0]);
    return new ConstantScoreScorer(weight, score(), scoreMode, iterator);
  }

  values.intersect(visitor);
  DocIdSetIterator iterator = result.build().iterator();
  return new ConstantScoreScorer(weight, score(), scoreMode, iterator);
}

上述方法方法中便对应的是 scorer 的构建逻辑,有两个逻辑点:

  1. PointValues-- Points represent numeric values and are indexed differently than ordinary text. Instead of an inverted index, points are indexed with datastructures such as KD-trees . These structures are optimized for operations such as range, distance, nearest-neighbor, and point-in-polygon queries.
csharp 复制代码
/** Finds all documents and points matching the provided visitor.
 *  This method does not enforce live documents, so it's up to the caller
 *  to test whether each document is deleted, if necessary. */
public abstract void intersect(IntersectVisitor visitor) throws IOException;

针对 numeric values,lucene 使用 KD-tree 进行组织,在查询的时候,通过索引数据读取 leaf block 进行过滤设置 ( 对应代码片段中的 values.intersect ) ,这其中的耗时便包括 DocIdSetIterator的构建和 PointValues 的处理 ( BUILD_SCORER )。

  1. BitSetIterator--A DocIdSetIterator which iterates over set bits in a bit set.
java 复制代码
private final BitSet bits;

@Override
public int nextDoc() {
  return advance(doc + 1);
}

@Override
public int advance(int target) {
  if (target >= length) {
    return doc = NO_MORE_DOCS;
  }
  return doc = bits.nextSetBit(target);
}

再来看 TermQuery

java 复制代码
@Override
public Scorer scorer(LeafReaderContext context) throws IOException {
  assert termStates == null || termStates.wasBuiltFor(ReaderUtil.getTopLevelContext(context)) : "The top-reader used to create Weight is not the same as the current reader's top-reader (" + ReaderUtil.getTopLevelContext(context);;
  final TermsEnum termsEnum = getTermsEnum(context);
  if (termsEnum == null) {
    return null;
  }
  LeafSimScorer scorer = new LeafSimScorer(simScorer, context.reader(), term.field(), scoreMode.needsScores());
  if (scoreMode == ScoreMode.TOP_SCORES) {
    return new TermScorer(this, termsEnum.impacts(PostingsEnum.FREQS), scorer);
  } else {
    return new TermScorer(this, termsEnum.postings(null, scoreMode.needsScores() ? PostingsEnum.FREQS : PostingsEnum.NONE), scorer);
  }
}

可以看到相对于之前的 PointRangeQuery,获取到 TermsEnum 便完成 scorer 的构建 ( BUILD_SCORER ),而这实际上就是简单从文件读取 posting list。而对于 NEXT_DOC 操作 ( BlockDocsEnum ),**相比之下:一方面是少了位操作;另一方面是 O(n) 最优时间复杂度 ( n 为满足条件的文档数 )。

ini 复制代码
private final long[] docBuffer = new long[BLOCK_SIZE+1];

@Override
public int nextDoc() throws IOException {
  if (docBufferUpto == BLOCK_SIZE) {
    refillDocs(); // we don't need to load freqBuffer for now (will be loaded later if necessary)
  }

  doc = (int) docBuffer[docBufferUpto];
  docBufferUpto++;
  return doc;
}

线上效果:

优化前:
优化后:
相关推荐
CoderJia程序员甲6 小时前
重学SpringBoot3-整合 Elasticsearch 8.x (三)使用Repository
java·大数据·spring boot·elasticsearch
东方巴黎~Sunsiny6 小时前
如何优化Elasticsearch的查询性能?
大数据·elasticsearch·搜索引擎
NoneCoder11 小时前
命令行工具进阶指南
大数据·elasticsearch·搜索引擎
许苑向上13 小时前
【Elasticsearch】Elasticsearch集成Spring Boot
spring boot·elasticsearch·jenkins
东方巴黎~Sunsiny17 小时前
Elasticsearch中什么是倒排索引?
大数据·elasticsearch·jenkins
qq_356408661 天前
es 数据清理delete_by_query
elasticsearch
Elastic 中国社区官方博客1 天前
AutoOps 使每个 Elasticsearch 部署都更易于管理
大数据·人工智能·elasticsearch·搜索引擎·全文检索·devops
东方巴黎~Sunsiny1 天前
如何优化Elasticsearch查询以提高性能?
大数据·elasticsearch·搜索引擎
慢生活的人。1 天前
Springboot集成syslog+logstash收集日志到ES
spring boot·后端·elasticsearch
光仔December1 天前
【Elasticsearch入门到落地】1、初识Elasticsearch
大数据·elk·elasticsearch·搜索引擎·lucene