本文在绿泡泡"狗哥琐话"首发于2025.3.25 <-关注不走丢。
1.前言
又到了大家最爱的生产级Rust源码分析环节了,今天我们就把之前留的尾巴给收了。
2.读代码
为了方便大家理清思路,这边直接给出一个简单的代码地图。
lua
-- meta/barrier/schedule.rs
-- run_command
-- run_multiple_commands
-- push #分析见2.1。将命令塞入队列,这里是队列的生产者
#队列的消费方。为了方便大家了解全貌,从上至下给出整条链路
-- meta/node.rs
-- start #上面就是运行的入口,一般是用tokio协程启动的。上层调用有standalone.rs cluster.rs test_runner.rs
-- meta/node/server.rs
-- rpc_serve
-- rpc_serve_with_store
-- start_service_as_election_leader #上面的逻辑不做分析。因为和run task关系不大,有兴趣的读者可以自行翻阅,或者在评论区、弹幕中提出需求
-- meta/barrier/manager.rs
-- GlobalBarrierManager.start #分析见2.6
-- meta/barrier/worker.rs
-- GlobalBarrierWorker.start #轻量级入口,不做分析
-- GlobalBarrierWorker.run #分析见2.5
-- GlobalBarrierWorker.run_inner #分析见2.4
-- meta/barrier/schedule.rs
-- PeriodicBarriers.next_barrier #分析见2.3
-- meta/context.mod.rs
-- GlobalBarrierWorkerContext.next_scheduled
-- meta/barrier/schedule.rs
-- ScheduledBarriers.next_scheduled #分析见2.2
2.1 push
这里就是获取队列的并把运行命令推进去的。还会做些默认检查操作。
rust
/// Push a scheduled barrier into the queue.
fn push(
&self,
database_id: DatabaseId,
scheduleds: impl IntoIterator<Item = ScheduledQueueItem>,
) -> MetaResult<()> {
let mut queue = self.inner.queue.lock();
let scheduleds = scheduleds.into_iter().collect_vec();
scheduleds
.iter()
.try_for_each(|scheduled| queue.validate_item(scheduled))?;
let queue = queue
.queue
.entry(database_id)
.or_insert_with(DatabaseScheduledQueue::new);
scheduleds
.iter()
.try_for_each(|scheduled| queue.validate_item(scheduled))?;
for scheduled in scheduleds {
queue.queue.push_back(scheduled);
if queue.queue.len() == 1 {
self.inner.changed_tx.send(()).ok();
}
}
Ok(())
}
这里有段代码非常显眼,手动加锁获取。但是没unlock是怎么个事儿?
我们可以看到返回的是MutexGuard。
rust
/// An RAII implementation of a "scoped lock" of a mutex. When this structure is
/// dropped (falls out of scope), the lock will be unlocked.
///
/// The data protected by the mutex can be accessed through this guard via its
/// `Deref` and `DerefMut` implementations.
#[clippy::has_significant_drop]
#[must_use = "if unused the Mutex will immediately unlock"]
pub struct MutexGuard<'a, R: RawMutex, T: ?Sized> {
mutex: &'a Mutex<R, T>,
marker: PhantomData<(&'a mut T, R::GuardMarker)>,
}
注释写得很清楚,超出作用区自动释放。那么这段代码里,在栈上声明了queue,如果函数运行完了,肯定会释放掉,所以不需要unlock。
2.2 ScheduledBarriers.next_scheduled
rust
impl ScheduledBarriers {
pub(super) async fn next_scheduled(&self) -> Scheduled {
'outer: loop {
let mut rx = self.inner.changed_tx.subscribe();
{
let mut queue = self.inner.queue.lock();
if queue.status.is_blocked() {
continue;
}
for (database_id, queue) in &mut queue.queue {
if queue.status.is_blocked() {
continue;
}
if let Some(item) = queue.queue.pop_front() {
item.send_latency_timer.observe_duration();
break 'outer Scheduled {
database_id: *database_id,
command: item.command,
notifiers: item.notifiers,
span: item.span,
};
}
}
}
rx.changed().await.unwrap();
}
}
}
这段代码很有意思。在一个循环里,订阅了change_tx这个sender,这个change_tx就是用来告知队列有元素的情况。那如果没元素呢,这里就是block住的,并不会执行后面的逻辑。
然后我们还可以看到一个代码块,为啥要单独写个代码块呢?早点走完这个作用域,尽早释放lock。
然后有人可能会问这个break outer是什么意思?这个就是跳出外部循环的意思。因为外部循环其实取了个名字(方便理解的说法),叫outer。如果多个循环嵌套的情况下,你会发现这个写法很实用。
所以这个函数的整体逻辑,就是:先订阅变更通知;再获取队列锁并检查状态是否被阻塞,若阻塞则继续循环;遍历队列中的任务,若找到可执行的任务则返回;若无可用任务,则等待通知再重试。
2.3 PeriodicBarriers.next_barrier
rust
pub(super) async fn next_barrier(
&mut self,
context: &impl GlobalBarrierWorkerContext,
) -> NewBarrier {
let checkpoint = self.try_get_checkpoint();
let scheduled = select! {
biased;
scheduled = context.next_scheduled() => {
self.min_interval.reset();
let checkpoint = scheduled.command.need_checkpoint() || checkpoint;
NewBarrier {
command: Some((scheduled.database_id, scheduled.command, scheduled.notifiers)),
span: scheduled.span,
checkpoint,
}
},
_ = self.min_interval.tick() => {
NewBarrier {
command: None,
span: tracing_span(),
checkpoint,
}
}
};
self.update_num_uncheckpointed_barrier(scheduled.checkpoint);
scheduled
}
}
逻辑比较简单。用于生成一个新的Barrier,根据调度命令或时间间隔触发。然后更新未检查点屏障的数量,并返回生成的屏障。
但里面scheduled = context.next_scheduled() => {
乍一眼看会有点难理解。因为next_scheduled
是Trait GlobalBarrierWorkerContext里的一个方法,但我们跳过去就明白了。
rust
impl GlobalBarrierWorkerContext for GlobalBarrierWorkerContextImpl {
// ignore some code...
async fn next_scheduled(&self) -> Scheduled {
self.scheduled_barriers.next_scheduled().await
}
// ignore some code...
}
next_scheduled就是我们前面分析的,会block的函数。相当于这里一调用可能会block住,所以我们要等返回的时候再做处理。
2.4 GlobalBarrierWorker.run_inner
rust
impl<C: GlobalBarrierWorkerContext> GlobalBarrierWorker<C> {
async fn run_inner(mut self, mut shutdown_rx: Receiver<()>) {
let (local_notification_tx, mut local_notification_rx) =
tokio::sync::mpsc::unbounded_channel();
self.env
.notification_manager()
.insert_local_sender(local_notification_tx)
.await;
// Start the event loop.
loop {
tokio::select! {
biased;
// Shutdown
_ = &mut shutdown_rx => {
tracing::info!("Barrier manager is stopped");
break;
}
request = self.request_rx.recv() => {
if let Some(request) = request {
match request {
BarrierManagerRequest::GetDdlProgress(result_tx) => {
let progress = self.checkpoint_control.gen_ddl_progress();
if result_tx.send(progress).is_err() {
error!("failed to send get ddl progress");
}
}// Handle adhoc recovery triggered by user.
BarrierManagerRequest::AdhocRecovery(sender) => {
self.adhoc_recovery().await;
if sender.send(()).is_err() {
warn!("failed to notify finish of adhoc recovery");
}
}
}
} else {
tracing::info!("end of request stream. meta node may be shutting down. Stop global barrier manager");
return;
}
}
changed_worker = self.active_streaming_nodes.changed() => {
#[cfg(debug_assertions)]
{
self.active_streaming_nodes.validate_change().await;
}
info!(?changed_worker, "worker changed");
if let ActiveStreamingWorkerChange::Add(node) | ActiveStreamingWorkerChange::Update(node) = changed_worker {
self.control_stream_manager.add_worker(node, self.checkpoint_control.inflight_infos(), &*self.context).await;
}
}
notification = local_notification_rx.recv() => {
let notification = notification.unwrap();
if let LocalNotification::SystemParamsChange(p) = notification {
{
self.periodic_barriers.set_min_interval(Duration::from_millis(p.barrier_interval_ms() as u64));
self.periodic_barriers
.set_checkpoint_frequency(p.checkpoint_frequency() as usize)
}
}
}
complete_result = self
.completing_task
.next_completed_barrier(
&mut self.periodic_barriers,
&mut self.checkpoint_control,
&mut self.control_stream_manager,
&self.context,
&self.env,
) => {
match complete_result {
Ok(output) => {
self.checkpoint_control.ack_completed(output);
}
Err(e) => {
self.failure_recovery(e).await;
}
}
},
event = self.checkpoint_control.next_event() => {
let result: MetaResult<()> = try {
match event {
CheckpointControlEvent::EnteringInitializing(entering_initializing) => {
let database_id = entering_initializing.database_id();
let error = merge_node_rpc_errors(&format!("database {} reset", database_id), entering_initializing.action.0.iter().filter_map(|(worker_id, resp)| {
resp.root_err.as_ref().map(|root_err| {
(*worker_id, ScoredError {
error: Status::internal(&root_err.err_msg),
score: Score(root_err.score)
})
})
}));
Self::report_collect_failure(&self.env, &error);
if let Some(runtime_info) = self.context.reload_database_runtime_info(database_id).await? {
runtime_info.validate(database_id, &self.active_streaming_nodes).inspect_err(|e| {
warn!(database_id = database_id.database_id, err = ?e.as_report(), ?runtime_info, "reloaded database runtime info failed to validate");
})?;
entering_initializing.enter(runtime_info, &mut self.control_stream_manager);
} else {
info!(database_id = database_id.database_id, "database removed after reloading empty runtime info");
entering_initializing.remove();
}
}
CheckpointControlEvent::EnteringRunning(entering_running) => {
self.context.mark_ready(Some(entering_running.database_id()));
entering_running.enter();
}
}
};
if let Err(e) = result {
self.failure_recovery(e).await;
}
}
(worker_id, resp_result) = self.control_stream_manager.next_response() => {
let resp = match resp_result {
Err(e) => {
if self.checkpoint_control.is_failed_at_worker_err(worker_id) {
let errors = self.control_stream_manager.collect_errors(worker_id, e).await;
let err = merge_node_rpc_errors("get error from control stream", errors);
Self::report_collect_failure(&self.env, &err);
self.failure_recovery(err).await;
} else {
warn!(worker_id, "no barrier to collect from worker, ignore err");
}
continue;
}
Ok(resp) => resp,
};
let result: MetaResult<()> = try {
match resp {
Response::CompleteBarrier(resp) => {
self.checkpoint_control.barrier_collected(resp, &mut self.control_stream_manager)?;
},
Response::ReportDatabaseFailure(resp) => {
if !self.enable_recovery {
panic!("database failure reported but recovery not enabled: {:?}", resp)
}
if let Some(entering_recovery) = self.checkpoint_control.on_report_failure(resp, &mut self.control_stream_manager) {
let database_id = entering_recovery.database_id();
warn!(database_id = database_id.database_id, "database entering recovery");
self.context.abort_and_mark_blocked(Some(database_id), RecoveryReason::Failover(anyhow!("reset database: {}", database_id).into()));
// TODO: add log on blocking time
let output = self.completing_task.wait_completing_task().await?;
entering_recovery.enter(output, &mut self.control_stream_manager);
}
}
Response::ResetDatabase(resp) => {
self.checkpoint_control.on_reset_database_resp(worker_id, resp);
}
other @ Response::Init(_) | other @ Response::Shutdown(_) => {
Err(anyhow!("get expected response: {:?}", other))?;
}
}
};
if let Err(e) = result {
self.failure_recovery(e).await;
}
}
new_barrier = self.periodic_barriers.next_barrier(&*self.context),
if self
.checkpoint_control
.can_inject_barrier(self.in_flight_barrier_nums) => {
if let Err(e) = self.checkpoint_control.handle_new_barrier(new_barrier, &mut self.control_stream_manager, &self.active_streaming_nodes) {
self.failure_recovery(e).await;
}
}
}
self.checkpoint_control.update_barrier_nums_metrics();
}
}
}
代码的量比较大。 看得出来作为一个全局的BarrierWorker还是要关注很多事的,不仅仅是CP的触发------比如shutdown,新增计算节点,一些控制命令都要在这里处理。
但核心代码就是:
rust
new_barrier = self.periodic_barriers.next_barrier(&*self.context),
if self
.checkpoint_control
.can_inject_barrier(self.in_flight_barrier_nums) => {
if let Err(e) = self.checkpoint_control.handle_new_barrier(new_barrier, &mut self.control_stream_manager, &self.active_streaming_nodes) {
self.failure_recovery(e).await;
}
}
那如果我们把目光往上一点呢,我们会看到一个try代码块。
rust
let result: MetaResult<()> = try {
match resp {
Response::CompleteBarrier(resp) => {
self.checkpoint_control.barrier_collected(resp, &mut self.control_stream_manager)?;
},
Response::ReportDatabaseFailure(resp) => {
if !self.enable_recovery {
panic!("database failure reported but recovery not enabled: {:?}", resp)
}
if let Some(entering_recovery) = self.checkpoint_control.on_report_failure(resp, &mut self.control_stream_manager) {
let database_id = entering_recovery.database_id();
warn!(database_id = database_id.database_id, "database entering recovery");
self.context.abort_and_mark_blocked(Some(database_id), RecoveryReason::Failover(anyhow!("reset database: {}", database_id).into()));
// TODO: add log on blocking time
let output = self.completing_task.wait_completing_task().await?;
entering_recovery.enter(output, &mut self.control_stream_manager);
}
}
Response::ResetDatabase(resp) => {
self.checkpoint_control.on_reset_database_resp(worker_id, resp);
}
other @ Response::Init(_) | other @ Response::Shutdown(_) => {
Err(anyhow!("get expected response: {:?}", other))?;
}
}
};
try 块的主要作用是用于错误处理和控制流管理。它通过 ? 运算符捕获可能的错误,并在发生错误时立即退出当前块,将错误传递给外部处理逻辑。try 块中的每个操作都可能返回一个 Result 类型。如果某个操作返回了 Err,则会立即退出块并返回该错误。
然后沿着这段代码往下看:
rust
other @ Response::Init(_) | other @ Response::Shutdown(_) => {
Err(anyhow!("get expected response: {:?}", other))?;
}
这个艾特符是什么意思?,@ 符号用于模式绑定(pattern binding),允许在匹配模式的同时将匹配到的值绑定到变量名 other 上。
2.5 GlobalBarrierWorker.run
rust
/// Start an infinite loop to take scheduled barriers and send them.
async fn run(mut self, shutdown_rx: Receiver<()>) {
tracing::info!(
"Starting barrier manager with: enable_recovery={}, in_flight_barrier_nums={}",
self.enable_recovery,
self.in_flight_barrier_nums,
);
if !self.enable_recovery {
let job_exist = self
.context
.metadata_manager
.catalog_controller
.has_any_streaming_jobs()
.await
.unwrap();
if job_exist {
panic!(
"Some streaming jobs already exist in meta, please start with recovery enabled \
or clean up the metadata using `./risedev clean-data`"
);
}
}
{
// Bootstrap recovery. Here we simply trigger a recovery process to achieve the
// consistency.
// Even if there's no actor to recover, we still go through the recovery process to
// inject the first `Initial` barrier.
let span = tracing::info_span!("bootstrap_recovery");
crate::telemetry::report_event(
risingwave_pb::telemetry::TelemetryEventStage::Recovery,
"normal_recovery",
0,
None,
None,
None,
);
let paused = self.take_pause_on_bootstrap().await.unwrap_or(false);
let paused_reason = paused.then_some(PausedReason::Manual);
self.recovery(paused_reason, None, RecoveryReason::Bootstrap)
.instrument(span)
.await;
}
self.run_inner(shutdown_rx).await
}
}
在这里我们看到GlobalBarrierManager和GlobalBarrierWorker是怎么建立起联系的,就是通过let (request_tx, request_rx) = unbounded_channel();
的一对channel来做的,一个在上面发号施令,一个在下面猛猛干活。