RocksDB重要的数据结构

参考列表

ColumnFamilyData

rocksdb中有column family的概念,所谓column family可以理解为mysql中的表。举个简单例子,成绩表和名单表。既然已分成不同的"表",代表数据之间有隔离,如果还共用memtable或sst,显然不合理。因此,每个column family有自己独立的lsm,即只属于自己的memtable、imm table和sst。rocksdb官方文档明确说明了这点:

  • Column Families · facebook/rocksdb Wiki The main idea behind Column Families is that they share the write-ahead log and don't share memtables and table files. By sharing write-ahead logs we get awesome benefit of atomic writes. By separating memtables and table files, we are able to configure column families independently and delete them quickly
  • How we keep track of live SST files · facebook/rocksdb Wiki Since each column family is a separate LSM, it also has its own list of versions with one that is "current"

在rocksdb中, class ColumnFamilyData表示column family

cpp 复制代码
// db/column_family.h

// This class keeps all the data that a column family needs.
// Most methods require DB mutex held, unless otherwise noted
class ColumnFamilyData {
  // ...

  uint32_t id_; // column family的id
  const std::string name_; // column family的名字
  Version* dummy_versions_;  // Head of circular doubly-linked list of versions.
  Version* current_;         // == dummy_versions->prev_

  MemTable* mem_; // column family的memtable,只能一个
  MemTableList imm_; // column family的imm table,可以有多个
  SuperVersion* super_version_; // 指向SuperVersion
  // ...
}

可见,ColumnFamilyData管理属于自己的memtable(只有一个)、imm table(可能存在多个)以及version 。sst去哪里了?答案就是 Version* current。接下来介绍 Version这个数据结构

Version

在介绍这个类前,需要先提一嘴MVCC。MVCC是一种多版本并发控制方法,并不专指事务的实现方式。此处的并发指------"一写多读",MVCC允许不加锁一写多读并发。Version正是利用了MVCC的思想,为什么rocksdb要借助mvcc来管理sst文件?试想一下,随着后台线程池里运行flush和compaction,磁盘上不同层的文件也在不停改变,此时如何满足用户的get请求?这就是一个典型的写读冲突场景,如果没有任何同步手段,可能导致sst查找不全。但如果利用锁实现同步,会导致get时无法进行flush和compaction,反之亦然。如此效率很低,而利用mvcc则可极大提升并发度

根据rocksdb官方文档How we keep track of live SST files · facebook/rocksdb Wiki

In the end of a compaction or a mem table flush, a new version is created for the updated LSM tree. At one time, there is only one "current" version that represents the files in the up-to-date LSM tree

每次flush或compaction结束后,会创建一个新的 Version用于表示各层sst分布/状态,可以将其理解为当前的snapshot,这就是current version。下一个current产生前,用户的get只能作用于当前current。下一个current产生后,之前的get未必执行完,所以旧的current不能直接删除。新旧 Version以双向链表的形式连接。rocksdb中每个 Version都有引用计数,只有值为0时,此 Version才可被删除。同理,sst文件也有引用计数。具体可阅读rocksdb官方文档How we keep track of live SST files · facebook/rocksdb Wiki,它提供了一个详细的例子。

cpp 复制代码
// db/version_set.h

// A column family's version consists of the table and blob files owned by
// the column family at a certain point in time.
class Version {
  // ...
  ColumnFamilyData* cfd_;  // ColumnFamilyData to which this Version belongs
  Logger* info_log_;
  Statistics* db_statistics_;
  TableCache* table_cache_;
  BlobSource* blob_source_;
  const MergeOperator* merge_operator_;

  VersionStorageInfo storage_info_;
  VersionSet* vset_;  // VersionSet to which this Version belongs
  Version* next_;     // Next version in linked list
  Version* prev_;     // Previous version in linked list
  int refs_;          // Number of live refs to this version
  //...
}

由此可见,ColumnFailyDataVersion会相互指向对方

ColumnFamilyData还有个 SuperVersion,这个数据结构又是干什么用?接下来将对其进行分析。

SuperVersion

回忆flush过程,flush可不是凭空产生sst,而是消耗imm table生成sst。对于用户的get请求,先查询memtable,再查imm table,最后查sst。既然对sst利用mvcc实现并发控制,同理imm table也需要。

cpp 复制代码
// db/column_family.h

// holds references to memtable, all immutable memtables and version
struct SuperVersion {
  // Accessing members of this class is not thread-safe and requires external
  // synchronization (ie db mutex held or on write thread).
  ColumnFamilyData* cfd;
  MemTable* mem;
  MemTableListVersion* imm;
  Version* current;
  MutableCFOptions mutable_cf_options;
  // Version number of the current SuperVersion
  uint64_t version_number;
  WriteStallCondition write_stall_condition;
  // ...

  // should be called outside the mutex
  SuperVersion() = default;
  ~SuperVersion();
  SuperVersion* Ref();
  // If Unref() returns true, Cleanup() should be called with mutex held
  // before deleting this SuperVersion.
  bool Unref();

  // call these two methods with db mutex held
  // Cleanup unrefs mem, imm and current. Also, it stores all memtables
  // that needs to be deleted in to_delete vector. Unrefing those
  // objects needs to be done in the mutex
  void Cleanup();
  void Init(ColumnFamilyData* new_cfd, MemTable* new_mem,
            MemTableListVersion* new_imm, Version* new_current);
  // ...

 private:
  std::atomic<uint32_t> refs;
  // We need to_delete because during Cleanup(), imm->Unref() returns
  // all memtables that we need to free through this vector. We then
  // delete all those memtables outside of mutex, during destruction
  autovector<MemTable*> to_delete;
};

可见,SuperVersionVersion的超集,除了包含current version,它还拥有对imm tables的snapshot管理,类似"总经理"的角色。比如,初始化阶段会对所管理的资源增加引用次数

cpp 复制代码
// db/column_family.cc
void SuperVersion::Init(ColumnFamilyData* new_cfd, MemTable* new_mem,
                        MemTableListVersion* new_imm, Version* new_current) {
  cfd = new_cfd;
  mem = new_mem;
  imm = new_imm;
  current = new_current;
  full_history_ts_low = cfd->GetFullHistoryTsLow();
  cfd->Ref();
  mem->Ref();
  imm->Ref();
  current->Ref();
  refs.store(1, std::memory_order_relaxed);
}

事实上,对于用户的get请求,并不直接从 ColumnFamily里拿 current version,而是通过super version,读取流程分析中会看到

VersionSet

这个数据结构包含了一个db中所有column family的所有version

总结

引用RocksDB Version管理概述 - 追梦的蜗牛 - 博客园做的一张图

相关推荐
阿蒙Amon2 小时前
C# Linq to SQL:数据库编程的解决方案
数据库·c#·linq
互联网搬砖老肖6 小时前
运维打铁: MongoDB 数据库集群搭建与管理
运维·数据库·mongodb
典学长编程6 小时前
数据库Oracle从入门到精通!第四天(并发、锁、视图)
数据库·oracle
积跬步,慕至千里7 小时前
clickhouse数据库表和doris数据库表迁移starrocks数据库时建表注意事项总结
数据库·clickhouse
极限实验室7 小时前
搭建持久化的 INFINI Console 与 Easysearch 容器环境
数据库
白仑色8 小时前
Oracle PL/SQL 编程基础详解(从块结构到游标操作)
数据库·oracle·数据库开发·存储过程·plsql编程
程序猿小D9 小时前
[附源码+数据库+毕业论文]基于Spring+MyBatis+MySQL+Maven+jsp实现的个人财务管理系统,推荐!
java·数据库·mysql·spring·毕业论文·ssm框架·个人财务管理系统
钢铁男儿10 小时前
C# 接口(什么是接口)
java·数据库·c#
__风__11 小时前
PostgreSQL kv(jsonb)存储
数据库·postgresql
轩情吖11 小时前
Qt的第一个程序(2)
服务器·数据库·qt·qt creator·qlineedit·hello world·编辑框