RocksDB重要的数据结构

参考列表

ColumnFamilyData

rocksdb中有column family的概念，所谓column family可以理解为mysql中的表。举个简单例子，成绩表和名单表。既然已分成不同的"表"，代表数据之间有隔离，如果还共用memtable或sst，显然不合理。因此，每个column family有自己独立的lsm，即只属于自己的memtable、imm table和sst。rocksdb官方文档明确说明了这点：

Column Families · facebook/rocksdb Wiki The main idea behind Column Families is that they share the write-ahead log and don't share memtables and table files. By sharing write-ahead logs we get awesome benefit of atomic writes. By separating memtables and table files, we are able to configure column families independently and delete them quickly
How we keep track of live SST files · facebook/rocksdb Wiki Since each column family is a separate LSM, it also has its own list of versions with one that is "current"

在rocksdb中， class ColumnFamilyData表示column family

cpp 复制代码

// db/column_family.h

// This class keeps all the data that a column family needs.
// Most methods require DB mutex held, unless otherwise noted
class ColumnFamilyData {
  // ...

  uint32_t id_; // column family的id
  const std::string name_; // column family的名字
  Version* dummy_versions_;  // Head of circular doubly-linked list of versions.
  Version* current_;         // == dummy_versions->prev_

  MemTable* mem_; // column family的memtable，只能一个
  MemTableList imm_; // column family的imm table，可以有多个
  SuperVersion* super_version_; // 指向SuperVersion
  // ...
}

可见，ColumnFamilyData管理属于自己的memtable(只有一个)、imm table(可能存在多个)以及version 。sst去哪里了？答案就是 Version* current。接下来介绍 Version这个数据结构

Version

在介绍这个类前，需要先提一嘴MVCC。MVCC是一种多版本并发控制方法，并不专指事务的实现方式。此处的并发指------"一写多读"，MVCC允许不加锁一写多读并发。Version正是利用了MVCC的思想，为什么rocksdb要借助mvcc来管理sst文件？试想一下，随着后台线程池里运行flush和compaction，磁盘上不同层的文件也在不停改变，此时如何满足用户的get请求？这就是一个典型的写读冲突场景，如果没有任何同步手段，可能导致sst查找不全。但如果利用锁实现同步，会导致get时无法进行flush和compaction，反之亦然。如此效率很低，而利用mvcc则可极大提升并发度

根据rocksdb官方文档How we keep track of live SST files · facebook/rocksdb Wiki

In the end of a compaction or a mem table flush, a new version is created for the updated LSM tree. At one time, there is only one "current" version that represents the files in the up-to-date LSM tree

每次flush或compaction结束后，会创建一个新的 Version用于表示各层sst分布/状态，可以将其理解为当前的snapshot，这就是current version。下一个current产生前，用户的get只能作用于当前current。下一个current产生后，之前的get未必执行完，所以旧的current不能直接删除。新旧 Version以双向链表的形式连接。rocksdb中每个 Version都有引用计数，只有值为0时，此 Version才可被删除。同理，sst文件也有引用计数。具体可阅读rocksdb官方文档How we keep track of live SST files · facebook/rocksdb Wiki，它提供了一个详细的例子。

cpp 复制代码

// db/version_set.h

// A column family's version consists of the table and blob files owned by
// the column family at a certain point in time.
class Version {
  // ...
  ColumnFamilyData* cfd_;  // ColumnFamilyData to which this Version belongs
  Logger* info_log_;
  Statistics* db_statistics_;
  TableCache* table_cache_;
  BlobSource* blob_source_;
  const MergeOperator* merge_operator_;

  VersionStorageInfo storage_info_;
  VersionSet* vset_;  // VersionSet to which this Version belongs
  Version* next_;     // Next version in linked list
  Version* prev_;     // Previous version in linked list
  int refs_;          // Number of live refs to this version
  //...
}

由此可见，ColumnFailyData和 Version会相互指向对方

在 ColumnFamilyData还有个 SuperVersion，这个数据结构又是干什么用？接下来将对其进行分析。

SuperVersion

回忆flush过程，flush可不是凭空产生sst，而是消耗imm table生成sst。对于用户的get请求，先查询memtable，再查imm table，最后查sst。既然对sst利用mvcc实现并发控制，同理imm table也需要。

cpp 复制代码

// db/column_family.h

// holds references to memtable, all immutable memtables and version
struct SuperVersion {
  // Accessing members of this class is not thread-safe and requires external
  // synchronization (ie db mutex held or on write thread).
  ColumnFamilyData* cfd;
  MemTable* mem;
  MemTableListVersion* imm;
  Version* current;
  MutableCFOptions mutable_cf_options;
  // Version number of the current SuperVersion
  uint64_t version_number;
  WriteStallCondition write_stall_condition;
  // ...

  // should be called outside the mutex
  SuperVersion() = default;
  ~SuperVersion();
  SuperVersion* Ref();
  // If Unref() returns true, Cleanup() should be called with mutex held
  // before deleting this SuperVersion.
  bool Unref();

  // call these two methods with db mutex held
  // Cleanup unrefs mem, imm and current. Also, it stores all memtables
  // that needs to be deleted in to_delete vector. Unrefing those
  // objects needs to be done in the mutex
  void Cleanup();
  void Init(ColumnFamilyData* new_cfd, MemTable* new_mem,
            MemTableListVersion* new_imm, Version* new_current);
  // ...

 private:
  std::atomic<uint32_t> refs;
  // We need to_delete because during Cleanup(), imm->Unref() returns
  // all memtables that we need to free through this vector. We then
  // delete all those memtables outside of mutex, during destruction
  autovector<MemTable*> to_delete;
};

可见，SuperVersion是 Version的超集，除了包含current version，它还拥有对imm tables的snapshot管理，类似"总经理"的角色。比如，初始化阶段会对所管理的资源增加引用次数

cpp 复制代码

// db/column_family.cc
void SuperVersion::Init(ColumnFamilyData* new_cfd, MemTable* new_mem,
                        MemTableListVersion* new_imm, Version* new_current) {
  cfd = new_cfd;
  mem = new_mem;
  imm = new_imm;
  current = new_current;
  full_history_ts_low = cfd->GetFullHistoryTsLow();
  cfd->Ref();
  mem->Ref();
  imm->Ref();
  current->Ref();
  refs.store(1, std::memory_order_relaxed);
}

事实上，对于用户的get请求，并不直接从 ColumnFamily里拿 current version，而是通过super version，读取流程分析中会看到

VersionSet

这个数据结构包含了一个db中所有column family的所有version

总结

引用RocksDB Version管理概述 - 追梦的蜗牛 - 博客园做的一张图