参考列表
ColumnFamilyData
rocksdb中有column family的概念,所谓column family可以理解为mysql中的表。举个简单例子,成绩表和名单表。既然已分成不同的"表",代表数据之间有隔离,如果还共用memtable或sst,显然不合理。因此,每个column family有自己独立的lsm,即只属于自己的memtable、imm table和sst。rocksdb官方文档明确说明了这点:
- Column Families · facebook/rocksdb Wiki The main idea behind Column Families is that they share the write-ahead log and don't share memtables and table files. By sharing write-ahead logs we get awesome benefit of atomic writes. By separating memtables and table files, we are able to configure column families independently and delete them quickly
- How we keep track of live SST files · facebook/rocksdb Wiki Since each
column family
is a separate LSM, it also has its own list ofversion
s with one that is "current"
在rocksdb中, class ColumnFamilyData
表示column family
cpp
// db/column_family.h
// This class keeps all the data that a column family needs.
// Most methods require DB mutex held, unless otherwise noted
class ColumnFamilyData {
// ...
uint32_t id_; // column family的id
const std::string name_; // column family的名字
Version* dummy_versions_; // Head of circular doubly-linked list of versions.
Version* current_; // == dummy_versions->prev_
MemTable* mem_; // column family的memtable,只能一个
MemTableList imm_; // column family的imm table,可以有多个
SuperVersion* super_version_; // 指向SuperVersion
// ...
}
可见,ColumnFamilyData
管理属于自己的memtable(只有一个)、imm table(可能存在多个)以及version 。sst去哪里了?答案就是 Version* current
。接下来介绍 Version
这个数据结构
Version
在介绍这个类前,需要先提一嘴MVCC。MVCC是一种多版本并发控制方法,并不专指事务的实现方式。此处的并发指------"一写多读",MVCC允许不加锁一写多读并发。Version
正是利用了MVCC的思想,为什么rocksdb要借助mvcc来管理sst文件?试想一下,随着后台线程池里运行flush和compaction,磁盘上不同层的文件也在不停改变,此时如何满足用户的get请求?这就是一个典型的写读冲突场景,如果没有任何同步手段,可能导致sst查找不全。但如果利用锁实现同步,会导致get时无法进行flush和compaction,反之亦然。如此效率很低,而利用mvcc则可极大提升并发度
根据rocksdb官方文档How we keep track of live SST files · facebook/rocksdb Wiki
In the end of a compaction or a mem table flush, a new
version
is created for the updated LSM tree. At one time, there is only one "current"version
that represents the files in the up-to-date LSM tree
每次flush或compaction结束后,会创建一个新的 Version
用于表示各层sst分布/状态,可以将其理解为当前的snapshot,这就是current version。下一个current产生前,用户的get只能作用于当前current。下一个current产生后,之前的get未必执行完,所以旧的current不能直接删除。新旧 Version
以双向链表的形式连接。rocksdb中每个 Version
都有引用计数,只有值为0时,此 Version
才可被删除。同理,sst文件也有引用计数。具体可阅读rocksdb官方文档How we keep track of live SST files · facebook/rocksdb Wiki,它提供了一个详细的例子。
cpp
// db/version_set.h
// A column family's version consists of the table and blob files owned by
// the column family at a certain point in time.
class Version {
// ...
ColumnFamilyData* cfd_; // ColumnFamilyData to which this Version belongs
Logger* info_log_;
Statistics* db_statistics_;
TableCache* table_cache_;
BlobSource* blob_source_;
const MergeOperator* merge_operator_;
VersionStorageInfo storage_info_;
VersionSet* vset_; // VersionSet to which this Version belongs
Version* next_; // Next version in linked list
Version* prev_; // Previous version in linked list
int refs_; // Number of live refs to this version
//...
}
由此可见,ColumnFailyData
和 Version
会相互指向对方
在 ColumnFamilyData
还有个 SuperVersion
,这个数据结构又是干什么用?接下来将对其进行分析。
SuperVersion
回忆flush过程,flush可不是凭空产生sst,而是消耗imm table生成sst。对于用户的get请求,先查询memtable,再查imm table,最后查sst。既然对sst利用mvcc实现并发控制,同理imm table也需要。
cpp
// db/column_family.h
// holds references to memtable, all immutable memtables and version
struct SuperVersion {
// Accessing members of this class is not thread-safe and requires external
// synchronization (ie db mutex held or on write thread).
ColumnFamilyData* cfd;
MemTable* mem;
MemTableListVersion* imm;
Version* current;
MutableCFOptions mutable_cf_options;
// Version number of the current SuperVersion
uint64_t version_number;
WriteStallCondition write_stall_condition;
// ...
// should be called outside the mutex
SuperVersion() = default;
~SuperVersion();
SuperVersion* Ref();
// If Unref() returns true, Cleanup() should be called with mutex held
// before deleting this SuperVersion.
bool Unref();
// call these two methods with db mutex held
// Cleanup unrefs mem, imm and current. Also, it stores all memtables
// that needs to be deleted in to_delete vector. Unrefing those
// objects needs to be done in the mutex
void Cleanup();
void Init(ColumnFamilyData* new_cfd, MemTable* new_mem,
MemTableListVersion* new_imm, Version* new_current);
// ...
private:
std::atomic<uint32_t> refs;
// We need to_delete because during Cleanup(), imm->Unref() returns
// all memtables that we need to free through this vector. We then
// delete all those memtables outside of mutex, during destruction
autovector<MemTable*> to_delete;
};
可见,SuperVersion
是 Version
的超集,除了包含current version,它还拥有对imm tables的snapshot管理,类似"总经理"的角色。比如,初始化阶段会对所管理的资源增加引用次数
cpp
// db/column_family.cc
void SuperVersion::Init(ColumnFamilyData* new_cfd, MemTable* new_mem,
MemTableListVersion* new_imm, Version* new_current) {
cfd = new_cfd;
mem = new_mem;
imm = new_imm;
current = new_current;
full_history_ts_low = cfd->GetFullHistoryTsLow();
cfd->Ref();
mem->Ref();
imm->Ref();
current->Ref();
refs.store(1, std::memory_order_relaxed);
}
事实上,对于用户的get请求,并不直接从 ColumnFamily
里拿 current version,而是通过super version,读取流程分析中会看到
VersionSet
这个数据结构包含了一个db中所有column family的所有version
总结
引用RocksDB Version管理概述 - 追梦的蜗牛 - 博客园做的一张图
