压榨 ES,从 numeric 到 keyword,秒变闪电侠

之前有位小伙伴期望在不改代码的情况下优化 elasticsearch 查询,( 简化 ) 查询语句如下:

json 复制代码
{
 "query": {
   "term": {
     "status": 1
   }
 }
}

以下数据来自测试环境,文档数据量:135999,status 值分布:[ { "key" : 1, "doc_count" : 131371 }, { "key" : 3, "doc_count" : 206 }, { "key" : 2, "doc_count" : 33 } ],数据均为新建立 index。

Profile 该查询语句有如下:

从上图中可以看到 PointRangeQuery。了解到对该字段仅有等值配对场景 ( 类似于 where status = ? ),对字段类型从 long 改为 keyword ( 手动创建 mapping 然后 reindex,当然也就意味着语义的改变 ),有如下优化结果:

可以看到的性能有及其显著提升 3.943ms -> 0.137ms ( 数据量越大性能优势越明显 ),此时查询类型由之前 PointRangeQuery 变成 TermQuery,在BUILD_SCORERNEXT_DOC 两部分有巨大的性能提升 。 关于如何选择数据类型,ES 的官方文档上有恰当说明:

Mapping numeric identifiers Not all numeric data should be mapped as a numeric field data type. Elasticsearch optimizes numeric fields, such as integer or long, for range queries. However, keyword fields are better for term and other term-level queries. Identifiers, such as an ISBN or a product ID, are rarely used in range queries. However, they are often retrieved using term-level queries. Consider mapping a numeric identifier as a keyword if:      You don't plan to search for the identifier data using range queries.      Fast retrieval is important. term query searches on keyword fields are often faster than term searches on numeric fields. If you're unsure which to use, you can use a multi-field to map the data as both a keyword and a numeric data type.

带着求知欲,接下来来看看为何会有如此显著的性能差距,结合图中的 PointRangeQuery / TermQueryBUILD_SCORERNEXT_DOC 关键字,可以搜索到如下源码:

先来看 PointRangeQuery

scss 复制代码
@Override
public Scorer get(long leadCost) throws IOException {
  if (values.getDocCount() == reader.maxDoc()
      && values.getDocCount() == values.size()
      && cost() > reader.maxDoc() / 2) {
    // If all docs have exactly one value and the cost is greater
    // than half the leaf size then maybe we can make things faster
    // by computing the set of documents that do NOT match the range
    final FixedBitSet result = new FixedBitSet(reader.maxDoc());
    result.set(0, reader.maxDoc());
    int[] cost = new int[] { reader.maxDoc() };
    values.intersect(getInverseIntersectVisitor(result, cost));
    final DocIdSetIterator iterator = new BitSetIterator(result, cost[0]);
    return new ConstantScoreScorer(weight, score(), scoreMode, iterator);
  }

  values.intersect(visitor);
  DocIdSetIterator iterator = result.build().iterator();
  return new ConstantScoreScorer(weight, score(), scoreMode, iterator);
}

上述方法方法中便对应的是 scorer 的构建逻辑,有两个逻辑点:

  1. PointValues-- Points represent numeric values and are indexed differently than ordinary text. Instead of an inverted index, points are indexed with datastructures such as KD-trees . These structures are optimized for operations such as range, distance, nearest-neighbor, and point-in-polygon queries.
csharp 复制代码
/** Finds all documents and points matching the provided visitor.
 *  This method does not enforce live documents, so it's up to the caller
 *  to test whether each document is deleted, if necessary. */
public abstract void intersect(IntersectVisitor visitor) throws IOException;

针对 numeric values,lucene 使用 KD-tree 进行组织,在查询的时候,通过索引数据读取 leaf block 进行过滤设置 ( 对应代码片段中的 values.intersect ) ,这其中的耗时便包括 DocIdSetIterator的构建和 PointValues 的处理 ( BUILD_SCORER )。

  1. BitSetIterator--A DocIdSetIterator which iterates over set bits in a bit set.
java 复制代码
private final BitSet bits;

@Override
public int nextDoc() {
  return advance(doc + 1);
}

@Override
public int advance(int target) {
  if (target >= length) {
    return doc = NO_MORE_DOCS;
  }
  return doc = bits.nextSetBit(target);
}

再来看 TermQuery

java 复制代码
@Override
public Scorer scorer(LeafReaderContext context) throws IOException {
  assert termStates == null || termStates.wasBuiltFor(ReaderUtil.getTopLevelContext(context)) : "The top-reader used to create Weight is not the same as the current reader's top-reader (" + ReaderUtil.getTopLevelContext(context);;
  final TermsEnum termsEnum = getTermsEnum(context);
  if (termsEnum == null) {
    return null;
  }
  LeafSimScorer scorer = new LeafSimScorer(simScorer, context.reader(), term.field(), scoreMode.needsScores());
  if (scoreMode == ScoreMode.TOP_SCORES) {
    return new TermScorer(this, termsEnum.impacts(PostingsEnum.FREQS), scorer);
  } else {
    return new TermScorer(this, termsEnum.postings(null, scoreMode.needsScores() ? PostingsEnum.FREQS : PostingsEnum.NONE), scorer);
  }
}

可以看到相对于之前的 PointRangeQuery,获取到 TermsEnum 便完成 scorer 的构建 ( BUILD_SCORER ),而这实际上就是简单从文件读取 posting list。而对于 NEXT_DOC 操作 ( BlockDocsEnum ),**相比之下:一方面是少了位操作;另一方面是 O(n) 最优时间复杂度 ( n 为满足条件的文档数 )。

ini 复制代码
private final long[] docBuffer = new long[BLOCK_SIZE+1];

@Override
public int nextDoc() throws IOException {
  if (docBufferUpto == BLOCK_SIZE) {
    refillDocs(); // we don't need to load freqBuffer for now (will be loaded later if necessary)
  }

  doc = (int) docBuffer[docBufferUpto];
  docBufferUpto++;
  return doc;
}

线上效果:

优化前:
优化后:
相关推荐
努力的小郑7 小时前
Elasticsearch 避坑指南:我在项目中总结的 14 条实用经验
后端·elasticsearch·性能优化
qq_54702617912 小时前
Canal实时同步MySQL数据到Elasticsearch
数据库·mysql·elasticsearch
星光一影19 小时前
基于SpringBoot智慧社区系统/乡村振兴系统/大数据与人工智能平台
大数据·spring boot·后端·mysql·elasticsearch·vue
Elasticsearch2 天前
在 Kibana 中引入 Elasticsearch 查询规则界面
elasticsearch
Elastic 中国社区官方博客2 天前
使用 Mastra 和 Elasticsearch 构建具有语义回忆功能的知识 agent
大数据·数据库·人工智能·elasticsearch·搜索引擎·ai·全文检索
新手小白*2 天前
Elasticsearch+Logstash+Filebeat+Kibana部署【7.1.1版本】
大数据·elasticsearch·搜索引擎
lpfasd1232 天前
git-团队协作基础
chrome·git·elasticsearch
苗壮.2 天前
「个人 Gitee 仓库」与「企业 Gitee 仓库」同步的几种常见方式
大数据·elasticsearch·gitee
Elastic 中国社区官方博客2 天前
如何使用 Ollama 在本地设置和运行 GPT-OSS
人工智能·gpt·elasticsearch·搜索引擎·ai·语言模型
Elasticsearch2 天前
Elastic Streams 中的数据协调:稳健架构深度解析
elasticsearch