RocksDB重要的数据结构

参考列表

ColumnFamilyData

rocksdb中有column family的概念,所谓column family可以理解为mysql中的表。举个简单例子,成绩表和名单表。既然已分成不同的"表",代表数据之间有隔离,如果还共用memtable或sst,显然不合理。因此,每个column family有自己独立的lsm,即只属于自己的memtable、imm table和sst。rocksdb官方文档明确说明了这点:

  • Column Families · facebook/rocksdb Wiki The main idea behind Column Families is that they share the write-ahead log and don't share memtables and table files. By sharing write-ahead logs we get awesome benefit of atomic writes. By separating memtables and table files, we are able to configure column families independently and delete them quickly
  • How we keep track of live SST files · facebook/rocksdb Wiki Since each column family is a separate LSM, it also has its own list of versions with one that is "current"

在rocksdb中, class ColumnFamilyData表示column family

cpp 复制代码
// db/column_family.h

// This class keeps all the data that a column family needs.
// Most methods require DB mutex held, unless otherwise noted
class ColumnFamilyData {
  // ...

  uint32_t id_; // column family的id
  const std::string name_; // column family的名字
  Version* dummy_versions_;  // Head of circular doubly-linked list of versions.
  Version* current_;         // == dummy_versions->prev_

  MemTable* mem_; // column family的memtable,只能一个
  MemTableList imm_; // column family的imm table,可以有多个
  SuperVersion* super_version_; // 指向SuperVersion
  // ...
}

可见,ColumnFamilyData管理属于自己的memtable(只有一个)、imm table(可能存在多个)以及version 。sst去哪里了?答案就是 Version* current。接下来介绍 Version这个数据结构

Version

在介绍这个类前,需要先提一嘴MVCC。MVCC是一种多版本并发控制方法,并不专指事务的实现方式。此处的并发指------"一写多读",MVCC允许不加锁一写多读并发。Version正是利用了MVCC的思想,为什么rocksdb要借助mvcc来管理sst文件?试想一下,随着后台线程池里运行flush和compaction,磁盘上不同层的文件也在不停改变,此时如何满足用户的get请求?这就是一个典型的写读冲突场景,如果没有任何同步手段,可能导致sst查找不全。但如果利用锁实现同步,会导致get时无法进行flush和compaction,反之亦然。如此效率很低,而利用mvcc则可极大提升并发度

根据rocksdb官方文档How we keep track of live SST files · facebook/rocksdb Wiki

In the end of a compaction or a mem table flush, a new version is created for the updated LSM tree. At one time, there is only one "current" version that represents the files in the up-to-date LSM tree

每次flush或compaction结束后,会创建一个新的 Version用于表示各层sst分布/状态,可以将其理解为当前的snapshot,这就是current version。下一个current产生前,用户的get只能作用于当前current。下一个current产生后,之前的get未必执行完,所以旧的current不能直接删除。新旧 Version以双向链表的形式连接。rocksdb中每个 Version都有引用计数,只有值为0时,此 Version才可被删除。同理,sst文件也有引用计数。具体可阅读rocksdb官方文档How we keep track of live SST files · facebook/rocksdb Wiki,它提供了一个详细的例子。

cpp 复制代码
// db/version_set.h

// A column family's version consists of the table and blob files owned by
// the column family at a certain point in time.
class Version {
  // ...
  ColumnFamilyData* cfd_;  // ColumnFamilyData to which this Version belongs
  Logger* info_log_;
  Statistics* db_statistics_;
  TableCache* table_cache_;
  BlobSource* blob_source_;
  const MergeOperator* merge_operator_;

  VersionStorageInfo storage_info_;
  VersionSet* vset_;  // VersionSet to which this Version belongs
  Version* next_;     // Next version in linked list
  Version* prev_;     // Previous version in linked list
  int refs_;          // Number of live refs to this version
  //...
}

由此可见,ColumnFailyDataVersion会相互指向对方

ColumnFamilyData还有个 SuperVersion,这个数据结构又是干什么用?接下来将对其进行分析。

SuperVersion

回忆flush过程,flush可不是凭空产生sst,而是消耗imm table生成sst。对于用户的get请求,先查询memtable,再查imm table,最后查sst。既然对sst利用mvcc实现并发控制,同理imm table也需要。

cpp 复制代码
// db/column_family.h

// holds references to memtable, all immutable memtables and version
struct SuperVersion {
  // Accessing members of this class is not thread-safe and requires external
  // synchronization (ie db mutex held or on write thread).
  ColumnFamilyData* cfd;
  MemTable* mem;
  MemTableListVersion* imm;
  Version* current;
  MutableCFOptions mutable_cf_options;
  // Version number of the current SuperVersion
  uint64_t version_number;
  WriteStallCondition write_stall_condition;
  // ...

  // should be called outside the mutex
  SuperVersion() = default;
  ~SuperVersion();
  SuperVersion* Ref();
  // If Unref() returns true, Cleanup() should be called with mutex held
  // before deleting this SuperVersion.
  bool Unref();

  // call these two methods with db mutex held
  // Cleanup unrefs mem, imm and current. Also, it stores all memtables
  // that needs to be deleted in to_delete vector. Unrefing those
  // objects needs to be done in the mutex
  void Cleanup();
  void Init(ColumnFamilyData* new_cfd, MemTable* new_mem,
            MemTableListVersion* new_imm, Version* new_current);
  // ...

 private:
  std::atomic<uint32_t> refs;
  // We need to_delete because during Cleanup(), imm->Unref() returns
  // all memtables that we need to free through this vector. We then
  // delete all those memtables outside of mutex, during destruction
  autovector<MemTable*> to_delete;
};

可见,SuperVersionVersion的超集,除了包含current version,它还拥有对imm tables的snapshot管理,类似"总经理"的角色。比如,初始化阶段会对所管理的资源增加引用次数

cpp 复制代码
// db/column_family.cc
void SuperVersion::Init(ColumnFamilyData* new_cfd, MemTable* new_mem,
                        MemTableListVersion* new_imm, Version* new_current) {
  cfd = new_cfd;
  mem = new_mem;
  imm = new_imm;
  current = new_current;
  full_history_ts_low = cfd->GetFullHistoryTsLow();
  cfd->Ref();
  mem->Ref();
  imm->Ref();
  current->Ref();
  refs.store(1, std::memory_order_relaxed);
}

事实上,对于用户的get请求,并不直接从 ColumnFamily里拿 current version,而是通过super version,读取流程分析中会看到

VersionSet

这个数据结构包含了一个db中所有column family的所有version

总结

引用RocksDB Version管理概述 - 追梦的蜗牛 - 博客园做的一张图

相关推荐
Watink Cpper2 小时前
[Redis] Redis:高性能内存数据库与分布式架构设计
linux·数据库·redis·分布式·架构
惜.己6 小时前
MySql(十)
数据库·mysql
lichenyang4538 小时前
使用react进行用户管理系统
数据库
木子.李3479 小时前
数据结构-算法学习C++(入门)
数据库·c++·学习·算法
Layux9 小时前
flowable候选人及候选人组(Candidate Users 、Candidate Groups)的应用包含拾取、归还、交接
java·数据库
@Turbo@10 小时前
【QT】在QT6中读取文件的方法
开发语言·数据库·qt
ArabySide10 小时前
【EF Core】 EF Core 批量操作的进化之路——从传统变更跟踪到无跟踪更新
数据库·.net·efcore
线条112 小时前
Hive SQL 中 BY 系列关键字全解析:从排序、分发到分组的核心用法
数据库·hive·sql
字节源流12 小时前
【MYSQL】索引篇(一)
数据库·mysql
n33(NK)12 小时前
MySQL中count(1)和count(*)的区别及细节
数据库·mysql