MongoDB Oplog(操作日志)是副本集与分片集群实现数据一致性的核心组件,它以固定集合(Capped Collection)形式存储在local
库中,具备固定大小、循环覆盖的特性。本文将从删除操作的 Oplog 记录链路切入,结合核心源代码,拆解 Oplog 的日志生成、合法性校验与持久化机制,帮助开发者理解其底层工作原理。
一、Oplog 基础认知:定位与核心特性
在深入代码前,先明确 Oplog 的核心属性,为后续分析铺垫基础:
- 核心作用:记录 MongoDB 所有写操作(插入、更新、删除、命令),副本集从节点通过复制并应用 Oplog,实现与主节点的数据同步。
- 存储形态 :默认存储于
local.oplog.rs
集合,属于 Capped Collection(固定集合),初始大小为磁盘剩余空间的 5%(最小 1GB、最大 50GB),空间满后自动覆盖最早的记录。 - 关键接口 :
createOplog()
负责初始化 Oplog 集合;logInsertOps()
专用于记录批量插入操作;logOp()
是通用写操作日志记录入口,也是本文重点解析的函数。
MongoDB Oplog是local库下的一个固定集合。它是Capped collection,通俗意思就是它是固定大小,循环使用的。如下图:
二、删除操作的 Oplog 记录链路
当执行文档删除操作时,MongoDB 会通过 "观察者模式" 触发 Oplog 记录,完整调用链如下(基于源码调用关系梳理):
cpp
mongo/db/catalog/collection_impl.cpp的deleteDocument
mongo/db/catalog/collection_impl.h的docFor
mongo/db/storage/record_store.h的dataFor
mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp的findRecord,返回RecordData
mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp的deleteRecord
mongo/db/op_observer_registry.h的onDelete
mongo/db/auth_op_observer.h的onDelete
mongo/db/op_observer_impl.cpp的onDelete
mongo/db/op_observer_impl.cpp中replLogDelete
mongo/db/op_observer_impl.cpp中logOperation
mongo/db/repl/oplog.cpp中logOp
mongo/db/repl/oplog.cpp中_logOpsInner
- 删除操作入口 :
mongo/db/catalog/collection_impl.cpp
的deleteDocument()
函数,先执行索引删除(_indexCatalog->unindexRecord
)和物理记录删除(_recordStore->deleteRecord
),再通过观察者接口触发日志记录。 - 观察者触发 :调用
getGlobalServiceContext()->getOpObserver()->onDelete()
,将删除事件传递给 Oplog 模块(非事务场景下直接记录,事务场景需关联会话记录)。 - 日志记录转发 :经
op_observer_registry.h
、auth_op_observer.h
等中间层,最终路由到mongo/db/op_observer_impl.cpp
的onDelete()
,再调用replLogDelete()
和logOperation()
。 - 核心日志函数 :最终进入
mongo/db/repl/oplog.cpp
的logOp()
函数,完成 Oplog 记录的生成与持久化。
mongo/db/catalog/collection_impl.cpp的deleteDocument代码
cpp
void CollectionImpl::deleteDocument(OperationContext* opCtx,
StmtId stmtId,
RecordId loc,
OpDebug* opDebug,
bool fromMigrate,
bool noWarn,
Collection::StoreDeletedDoc storeDeletedDoc) {
...
int64_t keysDeleted;
_indexCatalog->unindexRecord(opCtx, doc.value(), loc, noWarn, &keysDeleted);
_recordStore->deleteRecord(opCtx, loc);
getGlobalServiceContext()->getOpObserver()->onDelete(
opCtx, ns(), uuid(), stmtId, fromMigrate, deletedDoc);
if (opDebug) {
opDebug->additiveMetrics.incrementKeysDeleted(keysDeleted);
}
}
getGlobalServiceContext()->getOpObserver()->onDelete(opCtx, ns(), uuid(), stmtId, fromMigrate, deletedDoc);代码块非事务操作则直接记录删除操作到 oplog,并更新会话事务记录。
mongo/db/repl/oplog.h总预览代码
cpp
void appendRetryableWriteInfo(OperationContext* opCtx,
MutableOplogEntry* oplogEntry,
OplogLink* oplogLink,
StmtId stmtId);
void createOplog(OperationContext* opCtx,
const NamespaceString& oplogCollectionName,
bool isReplSet);
void createOplog(OperationContext* opCtx);
std::vector<OpTime> logInsertOps(OperationContext* opCtx,
MutableOplogEntry* oplogEntryTemplate,
std::vector<InsertStatement>::const_iterator begin,
std::vector<InsertStatement>::const_iterator end);
OpTime logOp(OperationContext* opCtx, MutableOplogEntry* oplogEntry);
void clearLocalOplogPtr();
void acquireOplogCollectionForLogging(OperationContext* opCtx);
mongo/db/repl/oplog.h是MongoDB源代码主要涉及操作日志(oplog)的管理与复制机制,是MongoDB副本集(Replica Set)和分片集群(Sharded Cluster)实现数据一致性的核心组件。
三、核心函数logOp()
深度解析
logOp()
是 MongoDB 记录 Oplog 的核心入口,负责操作合法性校验、OpTime 分配、并发控制与日志持久化触发,其源码逻辑可拆解为 4 个关键步骤:
mongo/db/repl/oplog.cpp是mongo/db/repl/oplog.h实现,OpObserver()观察者调用onDelete记录日志,logOp 是 MongoDB 复制模块中用于将操作记录到 Oplog(操作日志)的核心函数。其职责包括验证操作合法性、分配唯一时间戳(OpTime)、并发控制及持久化存储,确保分布式系统中的数据一致性。调用的是mongo/db/repl/oplog.cpp中logOp方法,代码如下:
cpp
OpTime logOp(OperationContext* opCtx, MutableOplogEntry* oplogEntry) {
// All collections should have UUIDs now, so all insert, update, and delete oplog entries should
// also have uuids. Some no-op (n) and command (c) entries may still elide the uuid field.
invariant(oplogEntry->getUuid() || oplogEntry->getOpType() == OpTypeEnum::kNoop ||
oplogEntry->getOpType() == OpTypeEnum::kCommand,
str::stream() << "Expected uuid for logOp with oplog entry: "
<< redact(oplogEntry->toBSON()));
auto replCoord = ReplicationCoordinator::get(opCtx);
// For commands, the test below is on the command ns and therefore does not check for
// specific namespaces such as system.profile. This is the caller's responsibility.
if (replCoord->isOplogDisabledFor(opCtx, oplogEntry->getNss())) {
uassert(ErrorCodes::IllegalOperation,
str::stream() << "retryable writes is not supported for unreplicated ns: "
<< oplogEntry->getNss().ns(),
!oplogEntry->getStatementId());
return {};
}
auto oplogInfo = LocalOplogInfo::get(opCtx);
// Obtain Collection exclusive intent write lock for non-document-locking storage engines.
boost::optional<Lock::DBLock> dbWriteLock;
boost::optional<Lock::CollectionLock> collWriteLock;
if (!opCtx->getServiceContext()->getStorageEngine()->supportsDocLocking()) {
dbWriteLock.emplace(opCtx, NamespaceString::kLocalDb, MODE_IX);
collWriteLock.emplace(opCtx, oplogInfo->getOplogCollectionName(), MODE_IX);
}
// If an OpTime is not specified (i.e. isNull), a new OpTime will be assigned to the oplog entry
// within the WUOW. If a new OpTime is assigned, it needs to be reset back to a null OpTime
// before exiting this function so that the same oplog entry instance can be reused for logOp()
// again. For example, if the WUOW gets aborted within a writeConflictRetry loop, we need to
// reset the OpTime to null so a new OpTime will be assigned on retry.
OplogSlot slot = oplogEntry->getOpTime();
auto resetOpTimeGuard = makeGuard([&, resetOpTimeOnExit = bool(slot.isNull())] {
if (resetOpTimeOnExit)
oplogEntry->setOpTime(OplogSlot());
});
WriteUnitOfWork wuow(opCtx);
if (slot.isNull()) {
slot = oplogInfo->getNextOpTimes(opCtx, 1U)[0];
// It would be better to make the oplogEntry a const reference. But because in some cases, a
// new OpTime needs to be assigned within the WUOW as explained earlier, we instead pass
// oplogEntry by pointer and reset the OpTime to null using a ScopeGuard.
oplogEntry->setOpTime(slot);
}
std::cout<<"conca logOp oplogInfo->getOplogCollectionName(): "<<oplogInfo->getOplogCollectionName()<<std::endl;
auto oplog = oplogInfo->getCollection();
auto wallClockTime = oplogEntry->getWallClockTime();
auto bsonOplogEntry = oplogEntry->toBSON();
// The storage engine will assign the RecordId based on the "ts" field of the oplog entry, see
// oploghack::extractKey.
std::vector<Record> records{
{RecordId(), RecordData(bsonOplogEntry.objdata(), bsonOplogEntry.objsize())}};
std::vector<Timestamp> timestamps{slot.getTimestamp()};
_logOpsInner(opCtx, oplogEntry->getNss(), &records, timestamps, oplog, slot, wallClockTime);
wuow.commit();
return slot;
}
输入验证与合法性检查
确保除 noop 和 command 操作外,其他操作日志条目必须包含 UUID,这是因为 MongoDB 从 3.6 版本开始要求所有集合包含 UUID,以增强数据一致性和可管理性。
cpp
// All collections should have UUIDs now, so all insert, update, and delete oplog entries should
// also have uuids. Some no-op (n) and command (c) entries may still elide the uuid field.
invariant(oplogEntry->getUuid() || oplogEntry->getOpType() == OpTypeEnum::kNoop ||
oplogEntry->getOpType() == OpTypeEnum::kCommand,
str::stream() << "Expected uuid for logOp with oplog entry: "
<< redact(oplogEntry->toBSON()));
Oplog 禁用检查
某些系统集合(如local_数据库下面的集合,system.profile
)可能被配置为不参与复制。
cpp
if (replCoord->isOplogDisabledFor(opCtx, oplogEntry->getNss())) {
uassert(ErrorCodes::IllegalOperation,
str::stream() << "retryable writes is not supported for unreplicated ns: "
<< oplogEntry->getNss().ns(),
!oplogEntry->getStatementId());
return {};
}
mongo/db/repl/replication_coordinator.cpp的isOplogDisabledFor代码:
cpp
bool ReplicationCoordinator::isOplogDisabledFor(OperationContext* opCtx,
const NamespaceString& nss) const {
if (getReplicationMode() == ReplicationCoordinator::modeNone) {
return true;
}
if (!opCtx->writesAreReplicated()) {
return true;
}
if (ReplicationCoordinator::isOplogDisabledForNS(nss)) {
return true;
}
fassert(28626, opCtx->recoveryUnit());
return false;
}
bool ReplicationCoordinator::isOplogDisabledForNS(const NamespaceString& nss) {
if (nss.isLocal()) {
return true;
}
if (nss.isSystemDotProfile()) {
return true;
}
if (nss.isDropPendingNamespace()) {
return true;
}
return false;
}
mongo/db/namespace_string.h的isLocal代码判断是否是local数据库,具体代码如下:
cpp
// Namespace for the local database
static constexpr StringData kLocalDb = "local"_sd;
bool isLocal() const {
return db() == kLocalDb;
}
bool isSystemDotProfile() const {
return coll() == "system.profile";
}
Oplog 写入与持久化
构造存储引擎可识别的 Record 对象,包含操作数据,调用 _logOpsInner 执行底层写入,关联时间戳和集合信息。oplogEntry->getNss()值是 local.oplog.rs,MongoDB 的操作日志(oplog)默认存储在本地数据库(local
)的oplog.rs
集合中。
cpp
WriteUnitOfWork wuow(opCtx);
auto bsonOplogEntry = oplogEntry->toBSON();
std::vector<Record> records{ {RecordId(), RecordData(bsonOplogEntry.objdata(), bsonOplogEntry.objsize())} };
std::vector<Timestamp> timestamps{slot.getTimestamp()};
_logOpsInner(opCtx, oplogEntry->getNss(), &records, timestamps, oplog, slot, wallClockTime);
wuow.commit();
四、_logOpsInner
:Oplog 的物理写入与状态更新
logOp()
仅负责 "上层逻辑处理",真正的物理写入由_logOpsInner()
完成,其核心职责包括写入权限校验、存储引擎调用、复制状态更新。
mongo/db/repl/oplog.cpp中_logOpsInner代码如下:
cpp
/*
* records - a vector of oplog records to be written.
* timestamps - a vector of respective Timestamp objects for each oplog record.
* oplogCollection - collection to be written to.
* finalOpTime - the OpTime of the last oplog record.
* wallTime - the wall clock time of the last oplog record.
*/
void _logOpsInner(OperationContext* opCtx,
const NamespaceString& nss,
std::vector<Record>* records,
const std::vector<Timestamp>& timestamps,
Collection* oplogCollection,
OpTime finalOpTime,
Date_t wallTime) {
auto replCoord = ReplicationCoordinator::get(opCtx);
if (nss.size() && replCoord->getReplicationMode() == ReplicationCoordinator::modeReplSet &&
!replCoord->canAcceptWritesFor(opCtx, nss)) {
str::stream ss;
ss << "logOp() but can't accept write to collection " << nss;
ss << ": entries: " << records->size() << ": [ ";
for (const auto& record : *records) {
ss << "(" << record.id << ", " << redact(record.data.toBson()) << ") ";
}
ss << "]";
uasserted(ErrorCodes::NotMaster, ss);
}
Status result = oplogCollection->insertDocumentsForOplog(opCtx, records, timestamps);
if (!result.isOK()) {
severe() << "write to oplog failed: " << result.toString();
fassertFailed(17322);
}
// Set replCoord last optime only after we're sure the WUOW didn't abort and roll back.
opCtx->recoveryUnit()->onCommit(
[opCtx, replCoord, finalOpTime, wallTime](boost::optional<Timestamp> commitTime) {
if (commitTime) {
// The `finalOpTime` may be less than the `commitTime` if multiple oplog entries
// are logging within one WriteUnitOfWork.
invariant(finalOpTime.getTimestamp() <= *commitTime,
str::stream() << "Final OpTime: " << finalOpTime.toString()
<< ". Commit Time: " << commitTime->toString());
}
// Optionally hang before advancing lastApplied.
if (MONGO_unlikely(hangBeforeLogOpAdvancesLastApplied.shouldFail())) {
log() << "hangBeforeLogOpAdvancesLastApplied fail point enabled.";
hangBeforeLogOpAdvancesLastApplied.pauseWhileSet(opCtx);
}
// Optimes on the primary should always represent consistent database states.
replCoord->setMyLastAppliedOpTimeAndWallTimeForward(
{finalOpTime, wallTime}, ReplicationCoordinator::DataConsistency::Consistent);
// We set the last op on the client to 'finalOpTime', because that contains the
// timestamp of the operation that the client actually performed.
ReplClientInfo::forClient(opCtx->getClient()).setLastOp(opCtx, finalOpTime);
});
}
写入权限验证
确保当前节点是主节点(Primary),并且允许对目标命名空间(nss
)写入。
cpp
if (nss.size() && replCoord->getReplicationMode() == ReplicationCoordinator::modeReplSet &&
!replCoord->canAcceptWritesFor(opCtx, nss)) {
str::stream ss;
ss << "logOp() but can't accept write to collection " << nss;
ss << ": entries: " << records->size() << ": [ ";
for (const auto& record : *records) {
ss << "(" << record.id << ", " << redact(record.data.toBson()) << ") ";
}
ss << "]";
uasserted(ErrorCodes::NotMaster, ss);
}
物理写入 Oplog 集合
调用存储引擎接口将 records 和 timestamps 写入 Oplog 集合。
cpp
Status result = oplogCollection->insertDocumentsForOplog(opCtx, records, timestamps);
if (!result.isOK()) {
severe() << "write to oplog failed: " << result.toString();
fassertFailed(17322);
}
mongo/db/catalog/collection_impl.cpp的insertDocumentsForOplog代码如下:
cpp
Status CollectionImpl::insertDocumentsForOplog(OperationContext* opCtx,
std::vector<Record>* records,
const std::vector<Timestamp>& timestamps) {
dassert(opCtx->lockState()->isWriteLocked());
// Since this is only for the OpLog, we can assume these for simplicity.
invariant(!_validator);
invariant(!_indexCatalog->haveAnyIndexes());
Status status = _recordStore->insertRecords(opCtx, records, timestamps);
if (!status.isOK())
return status;
opCtx->recoveryUnit()->onCommit(
[this](boost::optional<Timestamp>) { notifyCappedWaitersIfNeeded(); });
return status;
}
local.oplog.rs表插入日志记录,mongodb默认的存储引擎是WiredTiger, _recordStore实现类是mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp。mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp会插入日志记录。
五、总结:Oplog 的核心价值与关键启示
Oplog 作为 MongoDB 复制机制的 "生命线",其设计体现了三大核心思想:
- 原子性保障:通过 WUOW 将 "OpTime 分配 - 日志写入" 包装为原子操作,避免分布式场景下的日志不一致;
- 性能优先:Oplog 集合无索引、无校验,写入逻辑极简,确保写操作不会因日志记录产生明显延迟;
- 一致性校验:UUID 校验、主节点权限判断等机制,从源头规避非法操作对复制集群的影响。
对于开发者而言,理解 Oplog 的底层逻辑,不仅能更高效地排查副本集同步问题,还能在设计高可用 MongoDB 架构时,更合理地配置 Oplog 大小、监控 Oplog 覆盖频率,确保集群稳定性。