实验原址:mit 6.824 Lab2 Raft
中文翻译:mit 6.824 Lab2 翻译
Raft论文: In Search of an Understandable Consensus Algorithm
Raft论文翻译:(zhuanlan.zhihu.com/p/524885008)
介绍
Lab 2B 的实验,是实现Raft的核心:日志复制部分。当一笔请求到达整个集群,Raft通过复制日志的方式,来达成共识。当一台机器挂了之后,其余的机器都可能当选Leader,并且数据不会丢失。
实现依旧按照Raft论文的图figure2来。如果发现一些特别奇怪的Bug,先读论文,看看自己想的,和论文的描述是否存在出入。
核心AppendEntries RPCs
可以叫做日志复制请求,也可以叫做心跳请求。
定义
go
type AppendEntriesArgs struct{
Term int // leader's term
LeaderId int
PrevLogIndex int
PrevLogTerm int
Entries []LogEntry // empty for heart beat
LeaderCommit int
}
type AppendEntriesReply struct{
Term int
Success bool // log match
//fast return
XTerm int // term in the conflicting entry (if any)
XIndex int // index of first entry with that term (if any)
XLen int //slog length
}
发送心跳函数
易错点:何时发送日志?
If last log index>= nextIndex for a follower,AppendEntries RPC with log entries starting at nextIndex
在发送AE RPCs的时候,检查是否需要发送日志。
注意,不仅仅是发送一条日志,而是从nextIndex[i] 开始一直到末尾的日志。 如果没有日志,也依旧需要发送请求。
发送AE RPC
go
//周期心跳函数
func (rf*Raft) cycleAppendEntries(){
rf.nextBeatTime = time.Now()
for !rf.killed() {
rf.mu.Lock()
// if the server is dead or is not the leader, just return
if rf.state != Leader{
// 不是leader则终止心跳的发送
rf.mu.Unlock()
return
}
if !time.Now().After(rf.nextBeatTime){
rf.mu.Unlock()
continue
}
for i := 0; i < len(rf.peers); i++ {
if i == rf.me {
continue
}
reply := AppendEntriesReply{}
args := AppendEntriesArgs{
LeaderId : rf.me,
Term : rf.currentTerm,
LeaderCommit: rf.commitIndex,
PrevLogIndex: rf.nextIndex[i] - 1,
}
flag := false
//如果发送时,发现要发送的日志存在于快照中
if args.PrevLogIndex < rf.lastIncludedIndex {
flag = true
//If last log index>= nextIndex for a follower,AppendEntries RPC with log entries starting at nextIndex
}else if rf.ToVirtualIndex(len(rf.logs) - 1) > args.PrevLogIndex {
args.Entries = rf.logs[rf.ToRealIndex(args.PrevLogIndex+1):]
DPrintf("Server %v Send AE Args %v To %v",rf.me,args,i)
}else {
args.Entries = make([]LogEntry,0)
}
if flag {
go rf.SendInstallSnapshot(i)
}else{
args.PrevLogTerm = rf.logs[rf.ToRealIndex(args.PrevLogIndex)].Term
go rf.SendAppendEntries(i,&args,&reply)
}
}
rf.nextBeatTime = time.Now().Add(time.Duration(HeartBeatInterval)*time.Millisecond)
rf.mu.Unlock()
}
}
处理AE RPC
此处紧紧跟着figure2的描述。下面是个人理解的每一句话,这些都要在代码中体现。注意,图中Reply false是立即返回的意思。
- Reply false if term < currentTerm
- Reply false if log doesn't contain an entry at prevLogIndex whose term matches prevLogTerm
此处是进行一致性检查,一致性检查就像是一个归纳步骤一样,从一开始,就检查上一个日志是否正确,然后才进行日志追加,所以到后面任意一条日志,都只需要检查前一个index是否存在Term即可。- If an existing entry conficts with a new one(same index but different terms) delete the existing entry and all that follow
注意,前提是PrevIndex 与 PrevTerm匹配,才走到这。- append any new entries not already in the log
此处需要格外注意 "new"这个词。虽然上两步成功了,但是不排除这是旧的PRC,因此需要判断Follower中的日志,与args中的日志谁更新。- If leadrCommit > commitIndex, set commitIndex = min(leaderCommit,index of last new entry)
go
func (rf*Raft) HandlerAppendEntries(args*AppendEntriesArgs, reply * AppendEntriesReply){
rf.mu.Lock()
// 1. Reply false if term < currentTerm
if args.Term < rf.currentTerm {
DPrintf("Old Leader %v ",args.LeaderId)
reply.Success = false
reply.Term = rf.currentTerm
rf.mu.Unlock()
return
}
//All Servers: If RPC request or resopnse contains term T > currentTerm: set currentTerm = T, convert to follower
//有效的投票,或者有效的AE,都要更新时间戳
rf.stamp = time.Now()
if args.Term > rf.currentTerm {
rf.currentTerm = args.Term
rf.state = Follower
rf.voteNum = 0
rf.votedFor = None
rf.persist()
}
DPrintf("Server %v Term:%v\n",rf.me,rf.currentTerm)
//2.Reply false if log doesn't contain an entry at prevLogIndex whose term matches prevLogTerm
isConflict := false
if args.PrevLogIndex >= rf.ToVirtualIndex(len(rf.logs)) || args.PrevLogIndex < rf.lastIncludedIndex{ //不存在PrevLogIndex 或者由于太落后,连PreLogIndex都没有
reply.XTerm = -1
reply.XLen = rf.ToVirtualIndex(len(rf.logs)) //因为Follower的日志太短
isConflict = true
}else if args.PrevLogTerm != rf.logs[rf.ToRealIndex(args.PrevLogIndex)].Term { //存在,但是冲突
reply.XTerm = rf.logs[rf.ToRealIndex(args.PrevLogIndex)].Term
i := args.PrevLogIndex
for i > rf.lastIncludedIndex && rf.logs[rf.ToRealIndex(i)].Term == reply.XTerm {
i -= 1
}
reply.XIndex = i + 1
reply.XLen = rf.ToRealIndex(len(rf.logs)) // Log长度, 包括了已经snapShot的部分
isConflict = true
DPrintf("server %v 的log在PrevLogIndex: %v 位置Term不匹配, args.Term=%v, 实际的term=%v\n", rf.me, args.PrevLogIndex, args.PrevLogTerm, reply.XTerm)
}
if isConflict {
reply.Success = false
reply.Term = rf.currentTerm
rf.mu.Unlock()
return
}
//能到这,说明一致性检查已经通过
//判断日志能否全覆盖
//3.If an existing entry conficts with a new one(same index but different terms) delete the existing entry and all that follow
if len(args.Entries) != 0 {
lastLogIndex := rf.ToVirtualIndex(len(rf.logs) - 1)
lastLogTerm := rf.logs[rf.ToRealIndex(lastLogIndex)].Term
//判断日志是否可以全覆盖
if lastLogTerm != args.Term || args.PrevLogIndex + len(args.Entries) >= lastLogIndex {
rf.logs = rf.logs[:rf.ToRealIndex(args.PrevLogIndex+1)]
rf.logs = append(rf.logs,args.Entries...)
}
}
//4.append any new entries not already in the log
//注意从prevIndex开始拼接日志
rf.persist()
if len(args.Entries) != 0{
DPrintf("server %v append %v success\n",rf.me,args.Entries)
}
//5.If leadrCommit > commitIndex, set commitIndex = min(leaderCommit,index of last new entry)
if args.LeaderCommit > rf.commitIndex {
rf.commitIndex = int(math.Min(float64(args.LeaderCommit), float64(rf.ToVirtualIndex(len(rf.logs)-1))))
DPrintf("server %v 推进commindex %v", rf.me,rf.commitIndex)
}
//这个RPC只能是Follower接受
if rf.state == Follower {
reply.Success = true
reply.Term = rf.currentTerm
rf.mu.Unlock()
return
}
rf.mu.Unlock()
}
处理AE RPC的回复
难点:这是一个多线程代码,如何确保回复的RPC不会重复提交?
- 每一条Reply只会更新到自己认为的最新的Commit,这样就不会重复提交
- Leader不被允许提交之前Term中的Log
- If there exists an N such that N > commitIndex, a majority of matchIndex[i] >= N, and log[N].Term == currentTerm: set commitIndex = N
从日志末尾开始,一个一个判断日志是否已经被多数派提交。
scss
```go
func(rf*Raft) SendAppendEntries(server int, args *AppendEntriesArgs, reply *AppendEntriesReply){
ok := rf.sendAppendEntries(server,args,reply)
if !ok {
return
}
rf.mu.Lock()
//不是当前Term的RPC,丢弃
if rf.currentTerm != args.Term {
DPrintf("Old RPC")
rf.mu.Unlock()
return
}
//All Servers: If RPC request or resopnse contains term T > currentTerm: set currentTerm = T, convert to follower
if reply.Term > rf.currentTerm{
rf.currentTerm = reply.Term
rf.state = Follower
rf.votedFor = None
rf.voteNum = 0
rf.persist()
rf.mu.Unlock()
return
}
//if successful: update nextIndex and matchIndex for follower
if reply.Success {
rf.matchIndex[server] = args.PrevLogIndex + len(args.Entries)
rf.nextIndex[server] =args.PrevLogIndex + len(args.Entries) + 1
//如何确保多数派提交同一条日志,且不会重复提交?
//每一条Reply只会更新到自己认为的最新的Commit,这样就不会重复提交
//Leader不被允许提交之前Term中的Log
//If there exists an N such that N > commitIndex, a majority of matchIndex[i] >= N, and log[N].Term == currentTerm: set commitIndex = N
N := rf.ToVirtualIndex(len(rf.logs) - 1)
for N > rf.commitIndex {
cnt := 1
for i:= 0; i < len(rf.peers); i++{
if i == rf.me {
continue
}
if rf.matchIndex[i] >= N && rf.logs[rf.ToRealIndex(N)].Term == rf.currentTerm {
cnt++
}
}
if cnt > len(rf.peers) / 2{
rf.commitIndex = N
DPrintf("update commitIndex to %v",N)
break
}
N-= 1
}
rf.mu.Unlock()
return
}
if reply.Term == rf.currentTerm && rf.state == Leader {
//if fails because of log inconsistency: decrement nextIndex and retry
//Upon receiving a conflict response, the leader should first search its log for conflictTerm.
// If it finds a entry in its log with that term, it should set nextIndex to be the one beyond the index of the last entry in that term in its log.
DPrintf("Server %v reply: conflictTerm %v conflictIndex %v",server,reply.XTerm,reply.XIndex)
//回复不存在该Term
if reply.XTerm == -1 {
//
if rf.lastIncludedIndex >= reply.XLen {
go rf.SendInstallSnapshot(server)
}else{
rf.nextIndex[server] = reply.XLen
}
DPrintf(" ConflictIndex By Too Short Follower Len . Leader logs %v",rf.logs)
rf.mu.Unlock()
return
}
}
//case Leader has the same term with follower
i := rf.nextIndex[server] - 1
if i < rf.lastIncludedIndex {
i = rf.lastIncludedIndex
}
for i > rf.lastIncludedIndex && rf.logs[rf.ToRealIndex(i)].Term > reply.XTerm {
i -= 1
}
//当前logs中不存在该Term
if i == rf.lastIncludedIndex && rf.logs[rf.ToRealIndex(i)].Term > reply.XTerm {
go rf.SendInstallSnapshot(server)
//存在该Term
}else if rf.logs[rf.ToRealIndex(i)].Term == reply.XTerm{
rf.nextIndex[server] = i+1
}else{
// 之前PrevLogIndex发生冲突位置时, Follower的Term自己没
if reply.XIndex <= rf.lastIncludedIndex {
// XIndex位置也被截断了
// 添加InstallSnapshot的处理
go rf.SendInstallSnapshot(server)
} else {
rf.nextIndex[server] = reply.XIndex
}
}
rf.mu.Unlock()
return
}
关于快速回退
我的代码中已经体现了快速回退了。按照论文中的描述,当一致性检查不通过后,采用的是往前挪一位日志,重试。而快速回退则是下面的思路
在AE的Reply中新增了三个信息,XLen,XTerm,XIndex
冲突的可能如下:
- 不存在PrevIndex
- 存在PreIndex但是Term冲突
- 一致性检查通过,但是新日志位置Term不匹配
- 首先第一个,肯定是由于Follower的日志过段,因此我们回复XLen,Follower日志长度为下一次一致性检查位置
- Term一旦发生冲突,我们可以认为Follower与Leader的Term不匹配,因此找到Follower的第一个Term的位置,为XIndex。此时我们在Leader中可以找到该Term,尝试重新发送日志。
关于测试
测试时,将调用Start函数。在Lab2中,一旦调用Start可以等到下一次心跳才发送日志。在3A中会发生通不过测试。所以,最好是写一个可控的Ticker。
go
func (rf *Raft) Start(command interface{}) (int, int, bool) {
// Your code here (2B).
//注意并发调用时,出现并发问题
rf.mu.Lock()
defer rf.mu.Unlock()
if rf.state != Leader {
return -1,-1,false
}
rf.logs = append(rf.logs,LogEntry{
Command: command,
Term: rf.currentTerm,
})
rf.persist()
rf.nextBeatTime = time.Now()
DPrintf("new command %v\n",command)
return rf.ToVirtualIndex(len(rf.logs) - 1), rf.currentTerm, true
}
结果
sql
Test (2B): basic agreement ...
... Passed -- 0.6 3 16 4324 3
Test (2B): RPC byte count ...
... Passed -- 1.6 3 48 113616 11
Test (2B): test progressive failure of followers ...
... Passed -- 4.7 3 133 26474 3
Test (2B): test failure of leaders ...
... Passed -- 5.0 3 196 40612 3
Test (2B): agreement after follower reconnects ...
... Passed -- 3.8 3 94 23743 7
Test (2B): no agreement if too many followers disconnect ...
... Passed -- 3.5 5 216 42926 4
Test (2B): concurrent Start()s ...
... Passed -- 0.6 3 18 5082 6
Test (2B): rejoin of partitioned leader ...
... Passed -- 6.0 3 188 44507 4
Test (2B): leader backs up quickly over incorrect follower logs ...
... Passed -- 16.7 5 1844 1264268 102
Test (2B): RPC counts aren't too high ...
... Passed -- 2.2 3 42 11834 12
PASS
ok 6.5840/raft 44.638s