压榨 ES,从 numeric 到 keyword,秒变闪电侠

之前有位小伙伴期望在不改代码的情况下优化 elasticsearch 查询,( 简化 ) 查询语句如下:

json 复制代码
{
 "query": {
   "term": {
     "status": 1
   }
 }
}

以下数据来自测试环境,文档数据量:135999,status 值分布:[ { "key" : 1, "doc_count" : 131371 }, { "key" : 3, "doc_count" : 206 }, { "key" : 2, "doc_count" : 33 } ],数据均为新建立 index。

Profile 该查询语句有如下:

从上图中可以看到 PointRangeQuery。了解到对该字段仅有等值配对场景 ( 类似于 where status = ? ),对字段类型从 long 改为 keyword ( 手动创建 mapping 然后 reindex,当然也就意味着语义的改变 ),有如下优化结果:

可以看到的性能有及其显著提升 3.943ms -> 0.137ms ( 数据量越大性能优势越明显 ),此时查询类型由之前 PointRangeQuery 变成 TermQuery,在BUILD_SCORERNEXT_DOC 两部分有巨大的性能提升 。 关于如何选择数据类型,ES 的官方文档上有恰当说明:

Mapping numeric identifiers Not all numeric data should be mapped as a numeric field data type. Elasticsearch optimizes numeric fields, such as integer or long, for range queries. However, keyword fields are better for term and other term-level queries. Identifiers, such as an ISBN or a product ID, are rarely used in range queries. However, they are often retrieved using term-level queries. Consider mapping a numeric identifier as a keyword if:      You don't plan to search for the identifier data using range queries.      Fast retrieval is important. term query searches on keyword fields are often faster than term searches on numeric fields. If you're unsure which to use, you can use a multi-field to map the data as both a keyword and a numeric data type.

带着求知欲,接下来来看看为何会有如此显著的性能差距,结合图中的 PointRangeQuery / TermQueryBUILD_SCORERNEXT_DOC 关键字,可以搜索到如下源码:

先来看 PointRangeQuery

scss 复制代码
@Override
public Scorer get(long leadCost) throws IOException {
  if (values.getDocCount() == reader.maxDoc()
      && values.getDocCount() == values.size()
      && cost() > reader.maxDoc() / 2) {
    // If all docs have exactly one value and the cost is greater
    // than half the leaf size then maybe we can make things faster
    // by computing the set of documents that do NOT match the range
    final FixedBitSet result = new FixedBitSet(reader.maxDoc());
    result.set(0, reader.maxDoc());
    int[] cost = new int[] { reader.maxDoc() };
    values.intersect(getInverseIntersectVisitor(result, cost));
    final DocIdSetIterator iterator = new BitSetIterator(result, cost[0]);
    return new ConstantScoreScorer(weight, score(), scoreMode, iterator);
  }

  values.intersect(visitor);
  DocIdSetIterator iterator = result.build().iterator();
  return new ConstantScoreScorer(weight, score(), scoreMode, iterator);
}

上述方法方法中便对应的是 scorer 的构建逻辑,有两个逻辑点:

  1. PointValues-- Points represent numeric values and are indexed differently than ordinary text. Instead of an inverted index, points are indexed with datastructures such as KD-trees . These structures are optimized for operations such as range, distance, nearest-neighbor, and point-in-polygon queries.
csharp 复制代码
/** Finds all documents and points matching the provided visitor.
 *  This method does not enforce live documents, so it's up to the caller
 *  to test whether each document is deleted, if necessary. */
public abstract void intersect(IntersectVisitor visitor) throws IOException;

针对 numeric values,lucene 使用 KD-tree 进行组织,在查询的时候,通过索引数据读取 leaf block 进行过滤设置 ( 对应代码片段中的 values.intersect ) ,这其中的耗时便包括 DocIdSetIterator的构建和 PointValues 的处理 ( BUILD_SCORER )。

  1. BitSetIterator--A DocIdSetIterator which iterates over set bits in a bit set.
java 复制代码
private final BitSet bits;

@Override
public int nextDoc() {
  return advance(doc + 1);
}

@Override
public int advance(int target) {
  if (target >= length) {
    return doc = NO_MORE_DOCS;
  }
  return doc = bits.nextSetBit(target);
}

再来看 TermQuery

java 复制代码
@Override
public Scorer scorer(LeafReaderContext context) throws IOException {
  assert termStates == null || termStates.wasBuiltFor(ReaderUtil.getTopLevelContext(context)) : "The top-reader used to create Weight is not the same as the current reader's top-reader (" + ReaderUtil.getTopLevelContext(context);;
  final TermsEnum termsEnum = getTermsEnum(context);
  if (termsEnum == null) {
    return null;
  }
  LeafSimScorer scorer = new LeafSimScorer(simScorer, context.reader(), term.field(), scoreMode.needsScores());
  if (scoreMode == ScoreMode.TOP_SCORES) {
    return new TermScorer(this, termsEnum.impacts(PostingsEnum.FREQS), scorer);
  } else {
    return new TermScorer(this, termsEnum.postings(null, scoreMode.needsScores() ? PostingsEnum.FREQS : PostingsEnum.NONE), scorer);
  }
}

可以看到相对于之前的 PointRangeQuery,获取到 TermsEnum 便完成 scorer 的构建 ( BUILD_SCORER ),而这实际上就是简单从文件读取 posting list。而对于 NEXT_DOC 操作 ( BlockDocsEnum ),**相比之下:一方面是少了位操作;另一方面是 O(n) 最优时间复杂度 ( n 为满足条件的文档数 )。

ini 复制代码
private final long[] docBuffer = new long[BLOCK_SIZE+1];

@Override
public int nextDoc() throws IOException {
  if (docBufferUpto == BLOCK_SIZE) {
    refillDocs(); // we don't need to load freqBuffer for now (will be loaded later if necessary)
  }

  doc = (int) docBuffer[docBufferUpto];
  docBufferUpto++;
  return doc;
}

线上效果:

优化前:
优化后:
相关推荐
三水不滴1 小时前
Elasticsearch 实战系列(二):SpringBoot 集成 Elasticsearch,从 0 到 1 实现商品搜索系统
经验分享·spring boot·笔记·后端·elasticsearch·搜索引擎
奋斗者1号3 小时前
解决Git Push Gerrit分支失败的全流程实战
大数据·git·elasticsearch
margu_1684 小时前
【Elasticsearch】es7.2 跨集群迁移大量数据
elasticsearch
常利兵4 小时前
Spring Boot 邂逅Elasticsearch:打造搜索“光速引擎”
spring boot·elasticsearch·jenkins
Jiozg4 小时前
ES安装到linux(ubuntu)
linux·ubuntu·elasticsearch
JP-Destiny4 小时前
后端-elasticsearch
大数据·elasticsearch·搜索引擎
等风来不如迎风去4 小时前
【linux】tar [选项] 归档文件名 要打包的文件/目录..
linux·运维·elasticsearch
yumgpkpm5 小时前
华为昇腾910B 开源软件GPUStack的介绍(Cloudera CDH、CDP)
人工智能·hadoop·elasticsearch·flink·kafka·企业微信·big data
Elastic 中国社区官方博客5 小时前
AI agent 记忆:使用 Elasticsearch 托管记忆创建智能代理
大数据·人工智能·elasticsearch·搜索引擎·ai·云原生·全文检索
BetterNow.5 小时前
Git误操作急救手册
大数据·elasticsearch·搜索引擎