Background 背景

在使用迭代器的场景中，对于一些需求可能会采用前向迭代器，再此之前我们使用 next 迭代器偏多，所以本次测试意在探寻在反向迭代的场景下迭代性能是否会有下降，同时测试将 sst 数据顺序反转看得到的效果。

Test Plan 测试计划

Env 环境

OS	ubuntu
CPU	40 / 40
Disk	nvme 2.9T
Memory	100G / 100G
Rocksdb	6.20.3
data	learn search-0 worker
blockcache	0
cache_index_and_filter_blocks	0
target_file_size_base	512MB
level0_file_num_compaction_trigger	8
level0_slowdown_writes_trigger	10000
level0_stop_writes_trigger	10000
max_background_compactions	4
rate_limiter	1500MB/s
num_level	7	LSM tree layer numberLSM tree 层数
compression_per_level	default(snappy*7)	Compression algorithm for each layer每层的压缩算法
key number	4000000
comparator	default/ReverseBytewiseComparator

Test Step

round 1

key 在 n 个数据为一组的区间上进行随机散列，让 key 是不紧密的；
每 n 个数据为一个 writebatch；
使用前向迭代器迭代全部；
使用后向迭代器迭代全部；
使用前向迭代器迭检索某个 key 的全部数据，十次求平均；
使用后向迭代器迭检索某个 key 的全部数据，十次求平均。

round 2

使用 ReverseBytewiseComparator 构建反转顺序的 sst；
key 在 n 个数据为一组的区间上进行随机散列，让 key 是不紧密的；
每 n 个数据为一个 writebatch；
使用前向迭代器迭代全部；
使用后向迭代器迭代全部；
使用前向迭代器迭检索某个 key 的全部数据，十次求平均；
使用后向迭代器迭检索某个 key 的全部数据，十次求平均。

Test Group

different key numbers default & ReverseBytewiseComparator.

1000000	3000000	5000000	7000000	9000000	11000000

Metric 指标

metric	expline
avg iter use time	Average single iteration time 平均单次迭代时间

Code

ini 复制代码

#include "rocksdb/db.h"
#include "rocksdb/table.h"
#include "rocksdb/slice.h"
#include "rocksdb/status.h"
#include <iostream>
#include <sys/time.h>
#include <stdlib.h>
#include <string_view>
#include <map>
 
#include "rocksdb/convenience.h"
#include "rocksdb/filter_policy.h"
#include "rocksdb/options.h"
#include "rocksdb/rate_limiter.h"
#include "rocksdb/sst_file_manager.h"
#include "rocksdb/statistics.h"
#include "rocksdb/table.h"
#include "rocksdb/write_buffer_manager.h"
 
using namespace std;
using namespace rocksdb;
 
rocksdb::DB* db = nullptr;  
rocksdb::Options option;
 
std::vector<int> randn(int max, int min, int n) {
    std::vector<int> randvec;
    std::map<int,bool> dict_map;
    srand(time(NULL));
    for (int i = 0; i < n; i++) {
      while(true) {
        int u = (double)rand() / (RAND_MAX + 1) * (max - min) + min;
        u = ~u+1;
        auto iter = dict_map.find(u);
        if (iter != dict_map.end()) {
            continue;
        }
        randvec.push_back(u);
        dict_map[u]=true;
        break;
      }
    }
    return randvec;
}
 
void OpenDB() {
    rocksdb::BlockBasedTableOptions table_options;
    std::unordered_map<std::string, std::string> options_map;
    options_map["block_size"] = std::to_string(32 * 1024);  // 32KB
    rocksdb::GetBlockBasedTableOptionsFromMap(table_options, options_map, &table_options);
    table_options.no_block_cache = true;
    table_options.cache_index_and_filter_blocks = false;
    table_options.filter_policy.reset(rocksdb::NewBloomFilterPolicy(10, false));
    option.create_if_missing = true;
    option.compression = rocksdb::CompressionType::kLZ4Compression;
    // round 2 open
    // option.comparator = rocksdb::ReverseBytewiseComparator();
    option.compression_per_level = std::vector<rocksdb::CompressionType>{kNoCompression,kNoCompression,kNoCompression,kNoCompression,kNoCompression,kZSTD,kZSTD};
    option.table_factory.reset(NewBlockBasedTableFactory(table_options));
    auto s = rocksdb::DB::Open(option, "./db", &db);
    if (!s.ok()) {
        cout << "open faled :  " << s.ToString() << endl;
        exit(-1);
    }
    cout << "Finish open !"<< endl;
}
 
uint64_t NowMicros() {
  struct timeval tv;
  gettimeofday(&tv, nullptr);
  return static_cast<uint64_t>(tv.tv_sec* 1000000 + tv.tv_usec);
}
 
void NextTraverse() {
  uint64_t start_ts = NowMicros();
  auto begin_it = db->NewIterator(rocksdb::ReadOptions());
  for (begin_it->SeekToFirst(); begin_it->Valid(); begin_it->Next()) {
    assert(begin_it->Valid());
  }
  delete begin_it;
  cout << "NextTraverse use time: " << NowMicros() - start_ts << endl;
}
 
void PrevTraverse() {
  uint64_t start_ts = NowMicros();
  auto begin_it = db->NewIterator(rocksdb::ReadOptions());
  for (begin_it->SeekToLast(); begin_it->Valid(); begin_it->Prev()) {
    assert(begin_it->Valid());
  }
  delete begin_it;
  cout << "PrevTraverse use time: " << NowMicros() - start_ts << endl;
}
 
void NextTraverseWithKey(std::string_view key, std::string_view prefix) {
  uint64_t start_ts = NowMicros();
  auto begin_it = db->NewIterator(rocksdb::ReadOptions());
  begin_it->Seek(key);
  while (begin_it->Valid() && begin_it->key().starts_with(prefix)) {
    assert(begin_it->Valid());
    begin_it->Next();
  }
  delete begin_it;
  cout << "NextTraverseWithKey use time: " << NowMicros() - start_ts << endl;
}
 
void PrevTraverseWithKey(std::string_view key, std::string_view prefix) {
  uint64_t start_ts = NowMicros();
  auto begin_it = db->NewIterator(rocksdb::ReadOptions());
  begin_it->SeekForPrev(key);
  while (begin_it->Valid() && begin_it->key().starts_with(rocksdb::Slice(prefix.data(), prefix.length()))) {
    assert(begin_it->Valid());
    begin_it->Prev();
  }
  delete begin_it;
  cout << "PrevTraverseWithKey use time: " << NowMicros() - start_ts << endl;
}
 
/*
    build discrete data
    Ensure that the keys of each prefix are discrete among the 50 keys
 
    [0,1,2,3,4,,,,,,,,,49]
         ^              ^
         |              |
         test1_key      test1_key10
 
    [0,1,2,3,4,,,,,,,,,49]
       ^              ^
       |              |
    test1_key40      test1_key41
*/
void BuildBaseData() {
  int j = 0;
  while (j < 4000000) {
    auto rand_vec = randn(50,0,50);
    rocksdb::WriteBatch writeBatch;
    std::vector<std::string> keys;
    keys.reserve(50);
    keys.resize(50);
 
    for (size_t idx = 0; idx < rand_vec.size(); idx++) {
      keys[rand_vec[idx]] = std::to_string(idx) + "_test_key_" + std::to_string(j);
    }
 
    for (size_t idx = 0; idx < keys.size(); idx++) {
      auto status = writeBatch.Put(keys[idx], "test");
      if (!status.ok()) {
        cout << "writeBatch error: " << status.ToString() << endl;
        continue;
      }
    }
    auto status = db->Write(rocksdb::WriteOptions(), &writeBatch);
    if (!status.ok()) {
      cout << "Write error: " << status.ToString() << endl;
      continue;
    }
 
    j+=50;
  }
  uint64_t keynum = 0;
  db->GetIntProperty("rocksdb.estimate-num-keys", &keynum);
  cout << "Finish write ! keynum:" << keynum << endl;
 
}
 
int main(int argc, char *argv[]) {
  OpenDB();
  BuildBaseData();
  cout << "BuildBaseData done\n" << endl;
  ReverseTraverse();
  ForwardTraverse();
 
  std::string_view prefix= "0_test_key";
  // round 1
  PrevTraverseWithKey("0_test_key_4000000", prefix);
  NextTraverseWithKey("0_test_key_0", prefix);

  // round 2
  // 数据正反向调转
  // PrevTraverseWithKey("0_test_key_0", prefix);
  // NextTraverseWithKey("0_test_key_4000000", prefix);
  return 0;
}

Compilation Command 编译命令

bash 复制代码

clang++ -std=c++17 main.cc -o demo ./rocksdb/librocksdb.a -I ./rocksdb/include -lpthread -ldl -lrt -lsnappy -lgflags -lz -lbz2 -llz4 -lzstd

Test Res 测试结果

使用 default comparator

perl 复制代码

keynum:8000000 
PrevTraverse use time: 3180573 
NextTraverse use time: 1825431 
(prefix)PrevTraverseWithKey use time: 60031 
(prefix)NextTraverseWithKey use time: 40023

使用 ReverseBytewiseComparator

perl 复制代码

keynum:8000000 
PrevTraverse use time: 3183655 
NextTraverse use time: 1856932 
(prefix)PrevTraverseWithKey use time: 69882 
(prefix)NextTraverseWithKey use time: 37530

Conclusion 结论

结果

在相同的 comparator 下：

全量迭代的情况下：next 比 prev 快 39.5%

带有前缀过滤的情况下：next 比 prev 快 21.6%

结论

next 总比 prev 快；
使用 ReverseBytewiseComparator 后，逻辑上 prev 迭代（第二轮的 next 和第一轮的 prev 比）提升了约 37% ，基本和 default comparator 时后向迭代效率一样。

Principle Analysis 结果分析

ReverseBytewiseComparator

MyRocks 开发之前的性能测试中就发现了 Rocksdb 的正反向迭代器的问题，那他们也提出了比较友好的解决方案，那就是 Reverse Comparator。我们知道 Rocksdb 的 key 的写入/读取顺序是依赖 Comparator ，也就是 sst 内部的有序是由 Comparator 决定的。

就像是我们为一个 vector 排序，可以通过一个自定义 comparator 来决定 vector 中元素的排序行为。对于 Rocksdb 来说，这个 Comparator 作用的地方主要是 memtable 中 skiplist 的构建以及 compaction 过程中将 key 写入一个新的 sst。所以这个 comparator 决定了 keys 在 sst 文件中的顺序。所以 MyRocks 针对 Reverse-scan 痛点实现了 Reverse Comparator : ReverseBytewiseComparator，它就是指定 key 的存储顺序和原来相反。可以通过 options.comparator = rocksdb::ReverseBytewiseComparator(); 指定。很明显这个优化的好处就是将原来的 Reverse-scan 变更为 Forward-scan。

ReverseBytewiseComparator 有效地缩短了前向迭代的延迟，在相同的 comparator 下，next 和 prev 性能是相差将近 20%～40%，其中使用前缀过滤的效果要比全迭代要好。

注意：default 的 ReverseTraverse 和 ReverseBytewiseComparator 的 ForwardTraverse 是一样的效果，同理，ReverseTraverseWithKey 的结果和第二次 ForwardTraverseWithKey

为什么前向比后向慢？

1.memtable

由于 MemTable 中 skiplist 的单向性。所有 prev 操作都需要从头开始遍历。所以，MemTable 的 Next 操作的时间复杂度是 O(1)， Prev 操作的时间复杂度是 O(logN)。

arduino 复制代码

template<typename Key, class Comparator>
inline void SkipList<Key, Comparator>::Iterator::Next() {
  assert(Valid());
  node_ = node_->Next(0);
}
 
template<typename Key, class Comparator>
inline void SkipList<Key, Comparator>::Iterator::Prev() {
  // Instead of using explicit "prev" links, we just search for the
  // last node that falls before key.
  assert(Valid());
  node_ = list_->FindLessThan(node_->key);
  if (node_ == list_->head_) {
    node_ = nullptr;
  }
}

2.sst file

scss 复制代码

void IndexBlockIter::NextImpl() { ParseNextIndexKey(); }
 
void IndexBlockIter::PrevImpl() {
  assert(Valid());
  // Scan backwards to a restart point before current_
  const uint32_t original = current_;
  while (GetRestartPoint(restart_index_) >= original) {
    if (restart_index_ == 0) {
      // No more entries
      current_ = restarts_;
      restart_index_ = num_restarts_;
      return;
    }
    restart_index_--;
  }
  SeekToRestartPoint(restart_index_);
  // Loop until end of current entry hits the start of original entry
  while (ParseNextIndexKey() && NextEntryOffset() < original) {
  }
}

由于 SST 是由 block 组成的，其中包括一个 index block 和多个 data block。由 index iter + data iter 组成的 TwoLevelIterator 来实现对 SST 的查找/遍历。而 index block 本质上也是一个 data block，只不过这个 block 保存的是索引数据。所以，对 TwoLevelIterator 的 Next/Prev 本质上是对 Block 的 Next/Prev。同样，由于 block 中数据的单向性，Next 操作的时间复杂度是 O(1)，而每次 prev 都需要重新定位，性能也比 next 差不少。

建议

针对需要查询 last n 的需求，又或者是时间戳类型数据做近期查询时建议使用 ReverseBytewiseComparator 将 sst 的顺序反转，从而使正反迭代器调转，提高迭代效率。

Warning!

一旦使用 ReverseBytewiseComparator 那么数据就无法再使用其他 Comparator ，只能重新构建

lua 复制代码

open faled :  Invalid argument: leveldb.BytewiseComparator: does not match existing comparator rocksdb.ReverseBytewiseComparator

Rocksdb 正反向迭代器效率对比以及优化方案