压榨 ES,从 numeric 到 keyword,秒变闪电侠

之前有位小伙伴期望在不改代码的情况下优化 elasticsearch 查询,( 简化 ) 查询语句如下:

json 复制代码
{
 "query": {
   "term": {
     "status": 1
   }
 }
}

以下数据来自测试环境,文档数据量:135999,status 值分布:[ { "key" : 1, "doc_count" : 131371 }, { "key" : 3, "doc_count" : 206 }, { "key" : 2, "doc_count" : 33 } ],数据均为新建立 index。

Profile 该查询语句有如下:

从上图中可以看到 PointRangeQuery。了解到对该字段仅有等值配对场景 ( 类似于 where status = ? ),对字段类型从 long 改为 keyword ( 手动创建 mapping 然后 reindex,当然也就意味着语义的改变 ),有如下优化结果:

可以看到的性能有及其显著提升 3.943ms -> 0.137ms ( 数据量越大性能优势越明显 ),此时查询类型由之前 PointRangeQuery 变成 TermQuery,在BUILD_SCORERNEXT_DOC 两部分有巨大的性能提升 。 关于如何选择数据类型,ES 的官方文档上有恰当说明:

Mapping numeric identifiers Not all numeric data should be mapped as a numeric field data type. Elasticsearch optimizes numeric fields, such as integer or long, for range queries. However, keyword fields are better for term and other term-level queries. Identifiers, such as an ISBN or a product ID, are rarely used in range queries. However, they are often retrieved using term-level queries. Consider mapping a numeric identifier as a keyword if:      You don't plan to search for the identifier data using range queries.      Fast retrieval is important. term query searches on keyword fields are often faster than term searches on numeric fields. If you're unsure which to use, you can use a multi-field to map the data as both a keyword and a numeric data type.

带着求知欲,接下来来看看为何会有如此显著的性能差距,结合图中的 PointRangeQuery / TermQueryBUILD_SCORERNEXT_DOC 关键字,可以搜索到如下源码:

先来看 PointRangeQuery

scss 复制代码
@Override
public Scorer get(long leadCost) throws IOException {
  if (values.getDocCount() == reader.maxDoc()
      && values.getDocCount() == values.size()
      && cost() > reader.maxDoc() / 2) {
    // If all docs have exactly one value and the cost is greater
    // than half the leaf size then maybe we can make things faster
    // by computing the set of documents that do NOT match the range
    final FixedBitSet result = new FixedBitSet(reader.maxDoc());
    result.set(0, reader.maxDoc());
    int[] cost = new int[] { reader.maxDoc() };
    values.intersect(getInverseIntersectVisitor(result, cost));
    final DocIdSetIterator iterator = new BitSetIterator(result, cost[0]);
    return new ConstantScoreScorer(weight, score(), scoreMode, iterator);
  }

  values.intersect(visitor);
  DocIdSetIterator iterator = result.build().iterator();
  return new ConstantScoreScorer(weight, score(), scoreMode, iterator);
}

上述方法方法中便对应的是 scorer 的构建逻辑,有两个逻辑点:

  1. PointValues-- Points represent numeric values and are indexed differently than ordinary text. Instead of an inverted index, points are indexed with datastructures such as KD-trees . These structures are optimized for operations such as range, distance, nearest-neighbor, and point-in-polygon queries.
csharp 复制代码
/** Finds all documents and points matching the provided visitor.
 *  This method does not enforce live documents, so it's up to the caller
 *  to test whether each document is deleted, if necessary. */
public abstract void intersect(IntersectVisitor visitor) throws IOException;

针对 numeric values,lucene 使用 KD-tree 进行组织,在查询的时候,通过索引数据读取 leaf block 进行过滤设置 ( 对应代码片段中的 values.intersect ) ,这其中的耗时便包括 DocIdSetIterator的构建和 PointValues 的处理 ( BUILD_SCORER )。

  1. BitSetIterator--A DocIdSetIterator which iterates over set bits in a bit set.
java 复制代码
private final BitSet bits;

@Override
public int nextDoc() {
  return advance(doc + 1);
}

@Override
public int advance(int target) {
  if (target >= length) {
    return doc = NO_MORE_DOCS;
  }
  return doc = bits.nextSetBit(target);
}

再来看 TermQuery

java 复制代码
@Override
public Scorer scorer(LeafReaderContext context) throws IOException {
  assert termStates == null || termStates.wasBuiltFor(ReaderUtil.getTopLevelContext(context)) : "The top-reader used to create Weight is not the same as the current reader's top-reader (" + ReaderUtil.getTopLevelContext(context);;
  final TermsEnum termsEnum = getTermsEnum(context);
  if (termsEnum == null) {
    return null;
  }
  LeafSimScorer scorer = new LeafSimScorer(simScorer, context.reader(), term.field(), scoreMode.needsScores());
  if (scoreMode == ScoreMode.TOP_SCORES) {
    return new TermScorer(this, termsEnum.impacts(PostingsEnum.FREQS), scorer);
  } else {
    return new TermScorer(this, termsEnum.postings(null, scoreMode.needsScores() ? PostingsEnum.FREQS : PostingsEnum.NONE), scorer);
  }
}

可以看到相对于之前的 PointRangeQuery,获取到 TermsEnum 便完成 scorer 的构建 ( BUILD_SCORER ),而这实际上就是简单从文件读取 posting list。而对于 NEXT_DOC 操作 ( BlockDocsEnum ),**相比之下:一方面是少了位操作;另一方面是 O(n) 最优时间复杂度 ( n 为满足条件的文档数 )。

ini 复制代码
private final long[] docBuffer = new long[BLOCK_SIZE+1];

@Override
public int nextDoc() throws IOException {
  if (docBufferUpto == BLOCK_SIZE) {
    refillDocs(); // we don't need to load freqBuffer for now (will be loaded later if necessary)
  }

  doc = (int) docBuffer[docBufferUpto];
  docBufferUpto++;
  return doc;
}

线上效果:

优化前:
优化后:
相关推荐
Fireworkitte5 小时前
安装 Elasticsearch IK 分词器
大数据·elasticsearch
huisheng_qaq14 小时前
【ElasticSearch实用篇-01】需求分析和数据制造
大数据·elasticsearch·制造
G皮T1 天前
【Elasticsearch】自定义评分检索
大数据·elasticsearch·搜索引擎·查询·检索·自定义评分·_score
feilieren1 天前
Docker 安装 Elasticsearch 9
运维·elasticsearch·docker·es
Java烘焙师2 天前
架构师必备:业务扩展模式选型
mysql·elasticsearch·架构·hbase·多维度查询
G皮T2 天前
【Elasticsearch】深度分页及其替代方案
大数据·elasticsearch·搜索引擎·scroll·检索·深度分页·search_after
G皮T2 天前
【Elasticsearch】检索排序 & 分页
大数据·elasticsearch·搜索引擎·排序·分页·检索·深度分页
飞询2 天前
Docker 安装 Elasticsearch 9
elasticsearch·docker
G皮T2 天前
【Elasticsearch】检索高亮
大数据·elasticsearch·搜索引擎·全文检索·kibana·检索·高亮
大只鹅3 天前
解决 Spring Boot 对 Elasticsearch 字段没有小驼峰映射的问题
spring boot·后端·elasticsearch