压榨 ES，从 numeric 到 keyword，秒变闪电侠

之前有位小伙伴期望在不改代码的情况下优化 elasticsearch 查询，( 简化 ) 查询语句如下：

json 复制代码

{
 "query": {
   "term": {
     "status": 1
   }
 }
}

以下数据来自测试环境，文档数据量：135999，status 值分布：[ { "key" : 1, "doc_count" : 131371 }, { "key" : 3, "doc_count" : 206 }, { "key" : 2, "doc_count" : 33 } ]，数据均为新建立 index。

Profile 该查询语句有如下：

从上图中可以看到 PointRangeQuery。了解到对该字段仅有等值配对场景 ( 类似于 where status = ? )，对字段类型从 long 改为 keyword ( 手动创建 mapping 然后 reindex，当然也就意味着语义的改变 )，有如下优化结果：

可以看到的性能有及其显著提升 3.943ms -> 0.137ms ( 数据量越大性能优势越明显 )，此时查询类型由之前 PointRangeQuery 变成 TermQuery，在BUILD_SCORER、NEXT_DOC 两部分有巨大的性能提升。关于如何选择数据类型，ES 的官方文档上有恰当说明：

Mapping numeric identifiers Not all numeric data should be mapped as a numeric field data type. Elasticsearch optimizes numeric fields, such as integer or long, for range queries. However, keyword fields are better for term and other term-level queries. Identifiers, such as an ISBN or a product ID, are rarely used in range queries. However, they are often retrieved using term-level queries. Consider mapping a numeric identifier as a keyword if: You don't plan to search for the identifier data using range queries. Fast retrieval is important. term query searches on keyword fields are often faster than term searches on numeric fields. If you're unsure which to use, you can use a multi-field to map the data as both a keyword and a numeric data type.

带着求知欲，接下来来看看为何会有如此显著的性能差距，结合图中的 PointRangeQuery / TermQuery 和 BUILD_SCORER、NEXT_DOC 关键字，可以搜索到如下源码：

先来看 `PointRangeQuery`

scss 复制代码

@Override
public Scorer get(long leadCost) throws IOException {
  if (values.getDocCount() == reader.maxDoc()
      && values.getDocCount() == values.size()
      && cost() > reader.maxDoc() / 2) {
    // If all docs have exactly one value and the cost is greater
    // than half the leaf size then maybe we can make things faster
    // by computing the set of documents that do NOT match the range
    final FixedBitSet result = new FixedBitSet(reader.maxDoc());
    result.set(0, reader.maxDoc());
    int[] cost = new int[] { reader.maxDoc() };
    values.intersect(getInverseIntersectVisitor(result, cost));
    final DocIdSetIterator iterator = new BitSetIterator(result, cost[0]);
    return new ConstantScoreScorer(weight, score(), scoreMode, iterator);
  }

  values.intersect(visitor);
  DocIdSetIterator iterator = result.build().iterator();
  return new ConstantScoreScorer(weight, score(), scoreMode, iterator);
}

上述方法方法中便对应的是 scorer 的构建逻辑，有两个逻辑点：

PointValues-- Points represent numeric values and are indexed differently than ordinary text. Instead of an inverted index, points are indexed with datastructures such as KD-trees . These structures are optimized for operations such as range, distance, nearest-neighbor, and point-in-polygon queries.

csharp 复制代码

/** Finds all documents and points matching the provided visitor.
 *  This method does not enforce live documents, so it's up to the caller
 *  to test whether each document is deleted, if necessary. */
public abstract void intersect(IntersectVisitor visitor) throws IOException;

针对 numeric values，lucene 使用 KD-tree 进行组织，在查询的时候，通过索引数据读取 leaf block 进行过滤设置 ( 对应代码片段中的 values.intersect ) ，这其中的耗时便包括 DocIdSetIterator的构建和 PointValues 的处理 ( BUILD_SCORER )。

BitSetIterator--A DocIdSetIterator which iterates over set bits in a bit set.

java 复制代码

private final BitSet bits;

@Override
public int nextDoc() {
  return advance(doc + 1);
}

@Override
public int advance(int target) {
  if (target >= length) {
    return doc = NO_MORE_DOCS;
  }
  return doc = bits.nextSetBit(target);
}

再来看 `TermQuery`

java 复制代码

@Override
public Scorer scorer(LeafReaderContext context) throws IOException {
  assert termStates == null || termStates.wasBuiltFor(ReaderUtil.getTopLevelContext(context)) : "The top-reader used to create Weight is not the same as the current reader's top-reader (" + ReaderUtil.getTopLevelContext(context);;
  final TermsEnum termsEnum = getTermsEnum(context);
  if (termsEnum == null) {
    return null;
  }
  LeafSimScorer scorer = new LeafSimScorer(simScorer, context.reader(), term.field(), scoreMode.needsScores());
  if (scoreMode == ScoreMode.TOP_SCORES) {
    return new TermScorer(this, termsEnum.impacts(PostingsEnum.FREQS), scorer);
  } else {
    return new TermScorer(this, termsEnum.postings(null, scoreMode.needsScores() ? PostingsEnum.FREQS : PostingsEnum.NONE), scorer);
  }
}

可以看到相对于之前的 PointRangeQuery，获取到 TermsEnum 便完成 scorer 的构建 ( BUILD_SCORER )，而这实际上就是简单从文件读取 posting list。而对于 NEXT_DOC 操作 ( BlockDocsEnum )，**相比之下：一方面是少了位操作；另一方面是 O(n) 最优时间复杂度 ( n 为满足条件的文档数 )。

ini 复制代码

private final long[] docBuffer = new long[BLOCK_SIZE+1];

@Override
public int nextDoc() throws IOException {
  if (docBufferUpto == BLOCK_SIZE) {
    refillDocs(); // we don't need to load freqBuffer for now (will be loaded later if necessary)
  }

  doc = (int) docBuffer[docBufferUpto];
  docBufferUpto++;
  return doc;
}

压榨 ES，从 numeric 到 keyword，秒变闪电侠

先来看 `PointRangeQuery`

再来看 `TermQuery`

线上效果：

优化前：

优化后：

压榨 ES，从 numeric 到 keyword，秒变闪电侠

先来看 PointRangeQuery

再来看 TermQuery

线上效果：

优化前：

优化后：

先来看 `PointRangeQuery`

再来看 `TermQuery`