文章目录
- 源码分析
- [rate limiter的传递](#rate limiter的传递)
本篇文章分成两部分,先讲解工作原理和源码,尔后讲解其如何被传递使用。第一部分待完善
rate limiter既不是lsm特有的组件,也不是rocksdb独创的概念。它是个常见的工具------用于控制系统资源访问速率的限流工具 ,搞过java后端的朋友可能接触过这种工具。RocksDB为什么用rate limiter呢?按照官方wiki的说法
For example, flash writes cause terrible spikes in read latency if they exceed a certain threshold
因此,rocksdb使用rate limiter来控制flush和compaction的总写入速率,换言之限制它俩的总写入带宽。前文分析flush流程时我们也看到了rate limiter的使用
cpp
// file/writable_file_writer.cc
// This method writes to disk the specified data and makes use of the rate
// limiter if available
IOStatus WritableFileWriter::WriteBuffered(const IOOptions& opts,
const char* data, size_t size) {
// ...
while (left > 0) {
size_t allowed = left;
// 如果启用了rate limiter则会被限流
if (rate_limiter_ != nullptr &&
rate_limiter_priority_used != Env::IO_TOTAL) {
allowed = rate_limiter_->RequestToken(left, 0 /* alignment */,
rate_limiter_priority_used, stats_,
RateLimiter::OpType::kWrite);
}
// ...
}
// ...
}
但是,默认情况下并没有rate limiter!需要用户手动传入,可通过数据库的配置项传入。当用户使用数据库时,首先创建数据库的配置项,数据类型为 struct Options
,它的声明如下
cpp
// include/rocksdb/options.h
struct Options: public DBOptions, public ColumnFamilyOptions {}
其中一个基类 DBOptions
的声明:
cpp
// include/rocksdb/options.h
struct DBoptions{
// ...
std::shared_ptr<RateLimiter> rate_limiter = nullptr;
// ...
}
用户可按如下方式对其赋值:
cpp
options.rate_limiter.reset(rocksdb::NewGenericRateLimiter(200 * 1000 * 1000));
查看 NewGenericRateLimiter
cpp
// util/rate_limiter.cc
RateLimiter* NewGenericRateLimiter(
int64_t rate_bytes_per_sec, int64_t refill_period_us /* = 100 * 1000 */,
int32_t fairness /* = 10 */,
RateLimiter::Mode mode /* = RateLimiter::Mode::kWritesOnly */,
bool auto_tuned /* = false */) {
// 参数校验
assert(rate_bytes_per_sec > 0);
assert(refill_period_us > 0);
assert(fairness > 0);
// 实际创建的是GenericRateLimiter
std::unique_ptr<RateLimiter> limiter(
new GenericRateLimiter(rate_bytes_per_sec, refill_period_us, fairness,
mode, SystemClock::Default(), auto_tuned));
return limiter.release();
}
看来rocksdb十分贴心地给用户准备了已实现的rate limiter------GenericRateLimiter
,接下来对其进行分析。
源码分析
在分析具体代码前,我们先简要谈谈 GenericRateLimiter
的限速原理。它基于令牌桶算法,周期性地向桶里添加一定数目的token,此举直接决定了最大流量。执行flush或compaction写时,会向limiter申请token,token数和写入量对等。如果想要获得的token数小于桶中的量,则直接放行,否则被阻塞进入休眠。虽然,rocksdb使用rate limiter控制flush和compaction的总写入带宽,但二者被给予了不同的优先级,flush优先级高。
函数调用顺序
GenericRateLimiter
cpp
// util/rate_limiter.cc
GenericRateLimiter::GenericRateLimiter(
int64_t rate_bytes_per_sec, int64_t refill_period_us, int32_t fairness,
RateLimiter::Mode mode, const std::shared_ptr<SystemClock>& clock,
bool auto_tuned)
: RateLimiter(mode),
refill_period_us_(refill_period_us),
rate_bytes_per_sec_(auto_tuned ? rate_bytes_per_sec / 2
: rate_bytes_per_sec),
refill_bytes_per_period_(
CalculateRefillBytesPerPeriodLocked(rate_bytes_per_sec_)),
clock_(clock),
stop_(false),
exit_cv_(&request_mutex_),
requests_to_wait_(0),
available_bytes_(0),
next_refill_us_(NowMicrosMonotonicLocked()),
fairness_(fairness > 100 ? 100 : fairness),
rnd_((uint32_t)time(nullptr)),
wait_until_refill_pending_(false),
auto_tuned_(auto_tuned),
num_drains_(0),
max_bytes_per_sec_(rate_bytes_per_sec),
tuned_time_(NowMicrosMonotonicLocked()) {
for (int i = Env::IO_LOW; i < Env::IO_TOTAL; ++i) {
total_requests_[i] = 0;
total_bytes_through_[i] = 0;
}
}
构造函数参数:
-
rate_bytes_per_sec
:写入带宽,该数值决定了flush和compaction的总写带宽,单位是byte/s。 -
refill_period_us
:往令牌桶添加token的周期。 -
fairness
:公平值,因为请求有优先级并优先满足高级,为避免低级饿死,需要按照一定比率去照顾低级 -
mode
:模式,表示 只对写限流 或 只对读限流 或 二者皆有cpp// include/rocksdb/rate_limiter.h class RateLimiter { public: // ... enum class Mode { kReadsOnly, kWritesOnly, kAllIo, }; //... }
-
clock
:时钟 -
auto_tuned
:是否自动调整,
属性 refill_bytes_per_period_
,它表示每个周期需向桶中新增的token数。 rate_bytes_per_sec_
表示每秒的量,不妨设为 B B B,那么每个周期的量即是 B ∗ T / 1000000 B*T/1000000 B∗T/1000000。因为 T T T的单位是 μ s \mu s μs,所以需要除以1,000,000进行换算。初始化列表这儿调用 GenericRateLimiter::CalculateRefillBytesPerPeriodLocked
计算该值
cpp
// util/rate_limiter.cc
int64_t GenericRateLimiter::CalculateRefillBytesPerPeriodLocked(
int64_t rate_bytes_per_sec) {
int64_t refill_period_us = refill_period_us_.load(std::memory_order_relaxed);
if (std::numeric_limits<int64_t>::max() / rate_bytes_per_sec <
refill_period_us) {
// Avoid unexpected result in the overflow case. The result now is still
// inaccurate but is a number that is large enough.
return std::numeric_limits<int64_t>::max() / kMicrosecondsPerSecond;
} else {
// 正是 B*T / 1,000,000
return rate_bytes_per_sec * refill_period_us / kMicrosecondsPerSecond;
}
}
这段代码很巧妙地检查 B ∗ T B*T B∗T是否overflow,如果溢出则用 std::numeric_limits<int64_t>::max()
尽可能近似
构造函数主体非常简单,只有一个for循环。请求的优先级有4种,因此需要初始化各自的统计信息:
total_requests_[i]
:i
等级的累积请求数目total_bytes_through_[i]
:i
等级的累计请求byte量
现在,我们回到rate limiter的使用处,跟踪函数调用链
cpp
// file/writable_file_writer.cc
// This method writes to disk the specified data and makes use of the rate
// limiter if available
IOStatus WritableFileWriter::WriteBuffered(const IOOptions& opts,
const char* data, size_t size) {
// ...
while (left > 0) {
size_t allowed = left;
// 如果启用了rate limiter则会被限流
if (rate_limiter_ != nullptr &&
rate_limiter_priority_used != Env::IO_TOTAL) {
allowed = rate_limiter_->RequestToken(left, 0 /* alignment */,
rate_limiter_priority_used, stats_,
RateLimiter::OpType::kWrite);
}
// ...
}
// ...
}
首先调用了 RateLimiter::RequestToken
RateLimiter::RequestToken
cpp
// util/rate_limiter.cc
size_t RateLimiter::RequestToken(size_t bytes, size_t alignment,
Env::IOPriority io_priority, Statistics* stats,
RateLimiter::OpType op_type) {
if (io_priority < Env::IO_TOTAL && IsRateLimited(op_type)) {
bytes = std::min(bytes, static_cast<size_t>(GetSingleBurstBytes()));
if (alignment > 0) {
// Here we may actually require more than burst and block
// as we can not write/read less than one page at a time on direct I/O
// thus we do not want to be strictly constrained by burst
bytes = std::max(alignment, TruncateToPageBoundary(alignment, bytes));
}
Request(bytes, io_priority, stats, op_type);
}
return bytes;
}
它会处理对齐,我们先忽略对齐,来到了
cpp
// include/rocksdb/rate_limiter.h
// Requests token to read or write bytes and potentially updates statistics.
// If this request can not be satisfied, the call is blocked. Caller is
// responsible to make sure bytes <= GetSingleBurstBytes() and bytes >= 0.
virtual void Request(constint64_t bytes, const Env::IOPriority pri, Statistics* stats, OpType op_type) {
// IO请求类型与limiter的mode一致,才会被管理
if (IsRateLimited(op_type)) {
Request(bytes, pri, stats);
}
}
接着
cpp
// include/rocksdb/rate_limiter.h
virtual void Request(const int64_t bytes, const Env::IOPriority pri, Statistics* /* stats */)
virtual void Request(const int64_t /*bytes*/, const Env::IOPriority /*pri*/) {
assert(false);
}
最后一个 Request
需要被重载,实际调用的是 GernericRateLimiter::Request
GenericRateLimiter::Request
cpp
void GenericRateLimiter::Request(int64_t bytes, const Env::IOPriority pri,
Statistics* stats) {
assert(bytes <= refill_bytes_per_period_.load(std::memory_order_relaxed));
bytes = std::max(static_cast<int64_t>(0), bytes);
TEST_SYNC_POINT("GenericRateLimiter::Request");
TEST_SYNC_POINT_CALLBACK("GenericRateLimiter::Request:1",
&rate_bytes_per_sec_);
MutexLock g(&request_mutex_);
if (auto_tuned_) {
// 忽略auto tune模式
}
if (stop_) {
// It is now in the clean-up of ~GenericRateLimiter().
// Therefore any new incoming request will exit from here
// and not get satiesfied.
return;
}
// 更新统计信息,增加相应优先级的请求次数
++total_requests_[pri];
if (available_bytes_ > 0) {
int64_t bytes_through = std::min(available_bytes_, bytes);
total_bytes_through_[pri] += bytes_through;
// 这一步很关键!即使不够也会取出所有token为己所用
available_bytes_ -= bytes_through;
bytes -= bytes_through;
}
// 当前请求已被完全处理,所以直接return
if (bytes == 0) {
return;
}
// 凡是走到了这一步,available_bytes一定是0
// 并且,要么自己是当前第一个遇上token不够,要么之前仍有请求被阻塞
// 前者表示当前queue为空,后者表示queue不为空
// Request cannot be satisfied at this moment, enqueue
Req r(bytes, &request_mutex_);
queue_[pri].push_back(&r);
TEST_SYNC_POINT_CALLBACK("GenericRateLimiter::Request:PostEnqueueRequest",
&request_mutex_);
// A thread representing a queued request coordinates with other such threads.
// There are two main duties.
//
// (1) Waiting for the next refill time.
// (2) Refilling the bytes and granting requests.
do {
int64_t time_until_refill_us = next_refill_us_ - NowMicrosMonotonicLocked();
if (time_until_refill_us > 0) {
if (wait_until_refill_pending_) {
// Somebody is performing (1). Trust we'll be woken up when our request
// is granted or we are needed for future duties.
// 因为之前的req(设置了wait_until_refill_pending_)
r.cv.Wait();
} else {
// Whichever thread reaches here first performs duty (1) as described
// above.
// 如下的操作为了防止:只有一个请求时被永远阻塞
// 所以,即使只有一个请求,它也会每过一个周期在下一个while循环里去执行Refill操作
int64_t wait_until = clock_->NowMicros() + time_until_refill_us;
RecordTick(stats, NUMBER_RATE_LIMITER_DRAINS);
++num_drains_;
wait_until_refill_pending_ = true;
clock_->TimedWait(&r.cv, std::chrono::microseconds(wait_until));
TEST_SYNC_POINT_CALLBACK("GenericRateLimiter::Request:PostTimedWait",
&time_until_refill_us);
wait_until_refill_pending_ = false;
}
} else {
// Whichever thread reaches here first performs duty (2) as described
// above.
RefillBytesAndGrantRequestsLocked();
}
if (r.request_bytes == 0) {
// If there is any remaining requests, make sure there exists at least
// one candidate is awake for future duties by signaling a front request
// of a queue.
for (int i = Env::IO_TOTAL - 1; i >= Env::IO_LOW; --i) {
auto& queue = queue_[i];
if (!queue.empty()) {
queue.front()->cv.Signal();
break;
}
}
}
// Invariant: non-granted request is always in one queue, and granted
// request is always in zero queues.
// 忽略debug部分...
} while (!stop_ && r.request_bytes > 0);
if (stop_) {
// It is now in the clean-up of ~GenericRateLimiter().
// Therefore any woken-up request will have come out of the loop and then
// exit here. It might or might not have been satisfied.
--requests_to_wait_;
exit_cv_.Signal();
}
}
这段代码的逻辑是:
- 依据桶里token量
available_bytes_
判断是否能放行- 如果大于等于所申请的量,直接放行,return
- 如果不够,封装成
Req
并放入相应优先级的队列
- 执行while循环,直至自己所申请的量被全部满足:
GenericRateLimiter::RefillBytesAndGrantRequestsLocked
cpp
void GenericRateLimiter::RefillBytesAndGrantRequestsLocked() {
TEST_SYNC_POINT_CALLBACK(
"GenericRateLimiter::RefillBytesAndGrantRequestsLocked", &request_mutex_);
// 这段函数在持有锁时执行
// 且只会被GenericRateLimiter::Request调用,且调用时avalable_bytes=0
// 因为往桶里放token,需计算下次放的时间
next_refill_us_ = NowMicrosMonotonicLocked() +
refill_period_us_.load(std::memory_order_relaxed);
// Carry over the left over quota from the last period
auto refill_bytes_per_period =
refill_bytes_per_period_.load(std::memory_order_relaxed);
assert(available_bytes_ == 0);
available_bytes_ = refill_bytes_per_period;
std::vector<Env::IOPriority> pri_iteration_order =
GeneratePriorityIterationOrderLocked();
for (int i = Env::IO_LOW; i < Env::IO_TOTAL; ++i) {
assert(!pri_iteration_order.empty());
Env::IOPriority current_pri = pri_iteration_order[i];
auto* queue = &queue_[current_pri];
while (!queue->empty()) {
auto* next_req = queue->front();
if (available_bytes_ < next_req->request_bytes) {
// Grant partial request_bytes to avoid starvation of requests
// that become asking for more bytes than available_bytes_
// due to dynamically reduced rate limiter's bytes_per_second that
// leads to reduced refill_bytes_per_period hence available_bytes_
next_req->request_bytes -= available_bytes_;
available_bytes_ = 0;
break;
}
// 这轮新增的token够该请求用
available_bytes_ -= next_req->request_bytes;
next_req->request_bytes = 0;
total_bytes_through_[current_pri] += next_req->bytes;
queue->pop_front();
// Quota granted, signal the thread to exit
// 因为该请求的request_bytes=0,之后被唤醒时,while循环条件不再成立
next_req->cv.Signal();
// 需要把这轮新增的token用尽,因此循环继续
}
}
}
在queue_
中有请求阻塞时,该函数一定是每过一个周期就被执行一次!
总结
rate limiter的传递
当用户使用数据库时,首先创建数据库的配置项,数据类型为 struct Options
,它的声明如下
cpp
struct Options: public DBOptions, public ColumnFamilyOptions {}
可见它是派生类,其中一个基类 DBOptions
声明如下:
cpp
// include/rocksdb/options.h
struct DBoptions{
// ...
std::shared_ptr<RateLimiter> rate_limiter = nullptr;
// ...
}
用户需要对其赋值,如下所示
cpp
options.rate_limiter.reset(rocksdb::NewGenericRateLimiter(200 * 1000 * 1000));
这是rate limiter的起点。
之后,用户将该配置项传入数据库的打开函数 Open
,配置项 Options
先被转为 DBOptions
cpp
// db/db_impl/db_imple_open.cc
Status DB::Open(Options& options, const std::string& dbname, DB** dbptr) {
// 转成DBOptions
DBOptions db_options(options);
// ...
}
// include/rocksdb/options.h
struct DBOptions {}
// options.options.cc
// 派生类转基类(下转上)直接用static_cast.完成Options到DBoptions的转化
DBOptions::DBOptions(const Options& options): DBOptions(*static_cast<const DBOptions*>(&options)) {}
再通过两次 Open
重载,将 DBOptions
传递给 DBImpl
的构造函数(此构造函数才完成了数据库的实例化):
cpp
// db/db_impl/db_impl_open.cc
Status DBImpl::Open(const DBOptions& db_options, const std::string& dbname,
const std::vector<ColumnFamilyDescriptor>& column_families,
std::vector<ColumnFamilyHandle*>* handles, DB** dbptr,
const bool seq_per_batch, const bool batch_per_txn) {
// ...
// 创建DBBImpl对象,将DBOptions传入
DBImpl* impl = new DBImpl(db_options, dbname, seq_per_batch, batch_per_txn);
// ...
if (s.ok()) {
// Persist RocksDB Options before scheduling the compaction.
// The WriteOptionsFile() will release and lock the mutex internally.
persist_options_status =
impl->WriteOptionsFile(write_options, true /*db_mutex_already_held*/);
// 给用户传入的DB指针赋值,如此用户可通过该指针执行get,put等命令。
// 这也是为什么其类型是指针的指针
*dbptr = impl;
impl->opened_successfully_ = true;
impl->DeleteObsoleteFiles();
TEST_SYNC_POINT("DBImpl::Open:AfterDeleteFiles");
impl->MaybeScheduleFlushOrCompaction();
}
// ...
}
其构造函数部分如下:
cpp
// db/db_impl/db_impl.cc
DBImpl::DBImpl(const DBOptions& options, const std::string& dbname,
const bool seq_per_batch, const bool batch_per_txn,
bool read_only)
...,
initial_db_options_(SanitizeOptions(dbname, options, read_only, ...),
immutable_db_options_(initial_db_options_),
mutable_db_options(initial_db_options_),
file_options_(BuildDBOptions(immutable_db_options_,mutable_db_options)),
file_options_for_compaction_(fs->OptimizeFoCompactionTableWrite(file_options_,immutable_db_options_)
{...}
它接收参数 const DBOptions& options
,将rate limiter复制给多个属性:
-
函数
SanitizeOptions
复制limiter到属性initial_db_options_
, 类型仍然是DBOptions
-
后者再通过构造函数
ImmutableDBOptions
把limiter复制给属性immutable_db_options_
,DBOptions
转为ImmutableDBOptions
。 -
接着,利用函数
BuildDBOptions
将 limiter从immutableDBOptions_
复制给file_options_
。总的来看,ImmutableDBOptions
转为DBOptions
再转为FileOptions
,注意: 此处完成了DBOptions
到FileOptions
的转变, limiter进入了FileOptions
。仔细分析每一步,转化过程十分复杂,存在多次复制-
函数
BuildDBOptions
将ImmutableDBOptions
转为DBOptions
-
FileOptions
的构造函数将DBOptions
转为FileOptions
cppstruct FileOptions : EnvOptions { FileOptions(const DBOptions& opts) : EnvOptions(opts), handoff_checksum_type(ChecksumType::kCRC32c) {} }
由于
FileOptions
继承自EnvOptions
,所以limiter从DBOptions
复制到FileOptions
里的EnvOptions
,查看EnvOptions
相应的构造函数cppstruct EnvOptions { explicit EnvOptions(const DBOptions& options){ AssignEnvOptions(this, options); }; } // rate limiter在此处被复制给EnvOptions void AssignEnvOptions(EnvOptions* env_options, const DBOptions& options) { // ... env_options->rate_limiter = options.rate_limiter.get(); // ... }
-
-
最后,limiter随着
file_options_
复制到file_options_for_compaction
,类型保持不变,仍是FileOptions
。注意: 这个属性之后会传递给Flush和Compaction的相关函数,于是Flush或Compaction拿到了limiter,非常关键
以Flush为例,一次Flush任务用 class FlushJob
表示,FlushJob
的构造函数
cpp
// db/flush_job.cc
FlushJob::FlushJob(..., const FileOptions& file_options, ...):
//...
file_options_(file_options)
//...
{}
它会接收 const FileOptions& file_options
。查看此构造函数被调用的地方
cpp
Status DBImpl::FlushMemTableToOutputFile(...){
//...
FlushJob flush_job(..., file_options_for_compaction_, ...)
//...
}
可见 ,DBImpl::file_options_for_compaction
传递给了 FlushJob::file_options_
,类型仍然是 FileOptions
。因此 FlushJob
也获得了rate limiter。
flush的执行由 FlushJob::Run
完成,它调用 FlushJob::WriteLevel0Table
,然后调用函数 BuildTable
,它会接收 file_options_
.
cpp
Status BuildTable(...,const FileOptions& file_options,...){
TableBuilder* builder;
std::unique_ptr<WritableFileWriter> file_writer;
//...
// file_options参数又传给了WritableFileWriter
file_writer.reset(new WritableFileWriter(
std::move(file), fname, file_options, ioptions.clock, io_tracer,
ioptions.stats, Histograms::SST_WRITE_MICROS, ioptions.listeners,
ioptions.file_checksum_gen_factory.get(),
tmp_set.Contains(FileType::kTableFile), false));
builder = NewTableBuilder(tboptions, file_writer.get());
//...
}
此处,FlushJob
的 file_options_
又传给了 WritebleFileWriter
的构造函数
cpp
WritableFileWriter(..., const FileOptions& options, ...):
// ...
rate_limiter_(options.rate_limiter) //存rate limiter
//...
{}
因此 ,WritableFileWriter
直接拥有了rate limiter,即属性 rate_limiter
。至此,rate limiter复制进了 WritableFileWriter
,非常关键。
BuildTable
接着调用 NewTableBuilder
函数,并传入刚构造好的 WritableFileWriter
。
cpp
TableBuilder* NewTableBuilder(const TableBuilderOptions& tboptions,
WritableFileWriter* file) {
assert((tboptions.column_family_id ==
TablePropertiesCollectorFactory::Context::kUnknownColumnFamily) ==
tboptions.column_family_name.empty());
return tboptions.ioptions.table_factory->NewTableBuilder(tboptions, file);
}
这里最后调用的是一个纯虚函数,实际调用为 BlockBasedTableFactory::NewTableBuilder
,而它继续调用 BlockBasedTableBuilder
构造函数。
cpp
// table/block_based/block_based_table_builder.cc
BlockBasedTableBuilder::BlockBasedTableBuilder(
const BlockBasedTableOptions& table_options, const TableBuilderOptions& tbo,
WritableFileWriter* file) {
//...
rep_ = new Rep(sanitized_table_options, tbo, file);
//...
}
重点 ,上述 WritableFileWriter
指针被复制到 BlockBasedTableBuilder::rep_
,即 BlockBasedTableBuilder::rep_
可以操作 WritableFileWriter
。
再回到 BuildTable
函数,它继续调用了 BlockBasedTableBuilder::Add
-> BlockBasedTableBuilder::Flush
-> BlockBasedTableBuilder::WriteBlock
-> BlockBasedTableBuilder::WriteMaybeCompressedBlock
, 对于最后一个函数
cpp
void BlockBasedTableBuilder::WriteMaybeCompressedBlock(
const Slice& block_contents, CompressionType comp_type, BlockHandle* handle,
BlockType block_type, const Slice* uncompressed_block_data) {
Rep* r = rep_;
// ...
{
io_s = r->file->Append(io_options, block_contents); // 通过rep_操控之前构造好的WritableFileWriter
if (!io_s.ok()) {
r->SetIOStatus(io_s);
return;
}
}
// ...
}
它通过自己的属性 rep_
操作之前构造好的 WritableFileWriter
。于是调用链为 WritableFileWriter::Append
-> WritableFileWriter::Flush
-> WritableFileWriter::WriteBuffered
cpp
IOStatus WritableFileWriter::WriteBuffered(const IOOptions& opts,
const char* data, size_t size) {
//..
while (left > 0) {
size_t allowed = left;
if (rate_limiter_ != nullptr &&
rate_limiter_priority_used != Env::IO_TOTAL) {
allowed = rate_limiter_->RequestToken(left, 0 /* alignment */,
rate_limiter_priority_used, stats_,
RateLimiter::OpType::kWrite);
}
}
//...
}
正如前文说到 WritableFileWriter
本身拥有rate limiter,此处访问它。
总结
rate limiter的转移过程:
Options
-> DBOptions
(子类 到 基类) -> FileOptions
-> WritebleFileWriter