Mit 6.824 Lab:2B Raft 实现历程

实验原址:mit 6.824 Lab2 Raft

中文翻译:mit 6.824 Lab2 翻译

Raft论文: In Search of an Understandable Consensus Algorithm

Raft论文翻译:(zhuanlan.zhihu.com/p/524885008)

介绍

Lab 2B 的实验,是实现Raft的核心:日志复制部分。当一笔请求到达整个集群,Raft通过复制日志的方式,来达成共识。当一台机器挂了之后,其余的机器都可能当选Leader,并且数据不会丢失。

实现依旧按照Raft论文的图figure2来。如果发现一些特别奇怪的Bug,先读论文,看看自己想的,和论文的描述是否存在出入。

核心AppendEntries RPCs

可以叫做日志复制请求,也可以叫做心跳请求。

定义

go 复制代码
type AppendEntriesArgs  struct{
	Term	     int // leader's term
	LeaderId     int 
	PrevLogIndex int
	PrevLogTerm  int
	Entries      []LogEntry // empty for heart beat
	LeaderCommit int
}
type AppendEntriesReply struct{
    Term    int 
    Success bool // log match 
    //fast return
    XTerm   int  // term in the conflicting entry (if any)
    XIndex  int  // index of first entry with that term (if any)
    XLen    int  //slog length
}

发送心跳函数

易错点:何时发送日志?

If last log index>= nextIndex for a follower,AppendEntries RPC with log entries starting at nextIndex
在发送AE RPCs的时候,检查是否需要发送日志。
注意,不仅仅是发送一条日志,而是从nextIndex[i] 开始一直到末尾的日志。 如果没有日志,也依旧需要发送请求。

发送AE RPC

go 复制代码
//周期心跳函数
func (rf*Raft) cycleAppendEntries(){
	rf.nextBeatTime = time.Now()
	for !rf.killed() {
		rf.mu.Lock()
		// if the server is dead or is not the leader, just return
		if rf.state != Leader{
			// 不是leader则终止心跳的发送
			rf.mu.Unlock()
			return
		}
		if !time.Now().After(rf.nextBeatTime){
			rf.mu.Unlock()
			continue
		}
		for i := 0; i < len(rf.peers); i++ {
			if i == rf.me {
				continue
			}
			reply := AppendEntriesReply{}
			args := AppendEntriesArgs{
				LeaderId : rf.me,
				Term : rf.currentTerm,
				LeaderCommit: rf.commitIndex,
				PrevLogIndex: rf.nextIndex[i] - 1,
			}
			flag := false
			//如果发送时,发现要发送的日志存在于快照中
			if args.PrevLogIndex  < rf.lastIncludedIndex  {
				flag = true
				//If last log index>= nextIndex for a follower,AppendEntries RPC with log entries starting at nextIndex
			}else if rf.ToVirtualIndex(len(rf.logs) - 1) > args.PrevLogIndex  { 
				args.Entries = rf.logs[rf.ToRealIndex(args.PrevLogIndex+1):]

				DPrintf("Server %v Send AE Args %v To %v",rf.me,args,i)
			}else {
				args.Entries = make([]LogEntry,0)
			}
			if flag {
				go rf.SendInstallSnapshot(i)
			}else{
				args.PrevLogTerm = rf.logs[rf.ToRealIndex(args.PrevLogIndex)].Term
				go rf.SendAppendEntries(i,&args,&reply)
			}
		}
		rf.nextBeatTime = time.Now().Add(time.Duration(HeartBeatInterval)*time.Millisecond)
		rf.mu.Unlock()
	}
}

处理AE RPC

此处紧紧跟着figure2的描述。下面是个人理解的每一句话,这些都要在代码中体现。注意,图中Reply false是立即返回的意思。

  1. Reply false if term < currentTerm
  2. Reply false if log doesn't contain an entry at prevLogIndex whose term matches prevLogTerm
    此处是进行一致性检查,一致性检查就像是一个归纳步骤一样,从一开始,就检查上一个日志是否正确,然后才进行日志追加,所以到后面任意一条日志,都只需要检查前一个index是否存在Term即可。
  3. If an existing entry conficts with a new one(same index but different terms) delete the existing entry and all that follow
    注意,前提是PrevIndex 与 PrevTerm匹配,才走到这。
  4. append any new entries not already in the log
    此处需要格外注意 "new"这个词。虽然上两步成功了,但是不排除这是旧的PRC,因此需要判断Follower中的日志,与args中的日志谁更新。
  5. If leadrCommit > commitIndex, set commitIndex = min(leaderCommit,index of last new entry)
go 复制代码
func (rf*Raft) HandlerAppendEntries(args*AppendEntriesArgs, reply * AppendEntriesReply){
	rf.mu.Lock()
	 
	// 1. Reply false if term < currentTerm
	if args.Term < rf.currentTerm {
		DPrintf("Old Leader %v ",args.LeaderId)
		reply.Success = false
		reply.Term = rf.currentTerm
		rf.mu.Unlock()
		return
	}
	//All Servers: If RPC request or resopnse contains term T > currentTerm: set currentTerm = T, convert to follower
	//有效的投票,或者有效的AE,都要更新时间戳
	rf.stamp = time.Now()
	if args.Term > rf.currentTerm {
		rf.currentTerm = args.Term
		rf.state = Follower
		rf.voteNum = 0
		rf.votedFor = None
		rf.persist() 
	}
	
	
	DPrintf("Server %v Term:%v\n",rf.me,rf.currentTerm)
	
	//2.Reply false if log doesn't contain an entry at prevLogIndex whose term matches prevLogTerm
	isConflict := false
	if args.PrevLogIndex >= rf.ToVirtualIndex(len(rf.logs)) || args.PrevLogIndex < rf.lastIncludedIndex{ //不存在PrevLogIndex 或者由于太落后,连PreLogIndex都没有
		reply.XTerm = -1
		reply.XLen = rf.ToVirtualIndex(len(rf.logs)) //因为Follower的日志太短
		isConflict = true
	}else if args.PrevLogTerm != rf.logs[rf.ToRealIndex(args.PrevLogIndex)].Term { //存在,但是冲突
		reply.XTerm = rf.logs[rf.ToRealIndex(args.PrevLogIndex)].Term
		i := args.PrevLogIndex
		for i > rf.lastIncludedIndex && rf.logs[rf.ToRealIndex(i)].Term == reply.XTerm {
			i -= 1
		}
		reply.XIndex = i + 1
		reply.XLen = rf.ToRealIndex(len(rf.logs)) // Log长度, 包括了已经snapShot的部分
		isConflict = true
		DPrintf("server %v 的log在PrevLogIndex: %v 位置Term不匹配, args.Term=%v, 实际的term=%v\n", rf.me, args.PrevLogIndex, args.PrevLogTerm, reply.XTerm)
	}
	if isConflict {
		reply.Success = false
		reply.Term = rf.currentTerm
		rf.mu.Unlock()
		return
	}	
	//能到这,说明一致性检查已经通过
	//判断日志能否全覆盖
	//3.If an existing entry conficts with a new one(same index but different terms) delete the existing entry and all that follow
	if len(args.Entries) != 0 {
		lastLogIndex := rf.ToVirtualIndex(len(rf.logs) - 1)
		
		lastLogTerm :=  rf.logs[rf.ToRealIndex(lastLogIndex)].Term
		//判断日志是否可以全覆盖
		if lastLogTerm != args.Term ||  args.PrevLogIndex + len(args.Entries) >=  lastLogIndex {
			rf.logs = rf.logs[:rf.ToRealIndex(args.PrevLogIndex+1)]
			rf.logs = append(rf.logs,args.Entries...)
		}
	}

	//4.append any new entries not already in the log
	//注意从prevIndex开始拼接日志
	
	rf.persist()
	if len(args.Entries) != 0{
		DPrintf("server %v append %v success\n",rf.me,args.Entries)
	}
	//5.If leadrCommit > commitIndex, set commitIndex = min(leaderCommit,index of last new entry)
	if args.LeaderCommit > rf.commitIndex {
		rf.commitIndex = int(math.Min(float64(args.LeaderCommit), float64(rf.ToVirtualIndex(len(rf.logs)-1))))
		DPrintf("server %v 推进commindex %v", rf.me,rf.commitIndex)
	}
	//这个RPC只能是Follower接受
	if rf.state == Follower {
		reply.Success = true
		reply.Term = rf.currentTerm
		rf.mu.Unlock()
		return
	}
	rf.mu.Unlock()
}

处理AE RPC的回复

难点:这是一个多线程代码,如何确保回复的RPC不会重复提交?

  • 每一条Reply只会更新到自己认为的最新的Commit,这样就不会重复提交
  • Leader不被允许提交之前Term中的Log
  • If there exists an N such that N > commitIndex, a majority of matchIndex[i] >= N, and log[N].Term == currentTerm: set commitIndex = N

从日志末尾开始,一个一个判断日志是否已经被多数派提交。

scss 复制代码
```go
func(rf*Raft) SendAppendEntries(server int, args *AppendEntriesArgs, reply *AppendEntriesReply){
	ok := rf.sendAppendEntries(server,args,reply)
	if !ok {
		return
	}
	rf.mu.Lock()
	//不是当前Term的RPC,丢弃
	if rf.currentTerm != args.Term {
		DPrintf("Old RPC")
		rf.mu.Unlock()
		return
	}
	
	//All Servers: If RPC request or resopnse contains term T > currentTerm: set currentTerm = T, convert to follower
	if reply.Term > rf.currentTerm{
		rf.currentTerm = reply.Term
		rf.state = Follower
		rf.votedFor = None
		rf.voteNum = 0
		rf.persist()
		rf.mu.Unlock()
		return
	}
	
	//if successful: update nextIndex and matchIndex for follower
	if reply.Success {
		rf.matchIndex[server] = args.PrevLogIndex + len(args.Entries) 
		rf.nextIndex[server] =args.PrevLogIndex + len(args.Entries)   + 1  
		//如何确保多数派提交同一条日志,且不会重复提交?
		//每一条Reply只会更新到自己认为的最新的Commit,这样就不会重复提交
		//Leader不被允许提交之前Term中的Log
		//If there exists an N such that N > commitIndex, a majority of matchIndex[i] >= N, and log[N].Term == currentTerm: set commitIndex = N
		N := rf.ToVirtualIndex(len(rf.logs) - 1)
		for N > rf.commitIndex {
			cnt := 1
			for i:= 0; i < len(rf.peers); i++{
				if i == rf.me {
					continue
				}
				if rf.matchIndex[i] >= N && rf.logs[rf.ToRealIndex(N)].Term == rf.currentTerm {
					cnt++
				}
			}
			if cnt > len(rf.peers) / 2{
				rf.commitIndex = N
				DPrintf("update commitIndex to %v",N)
				break
			}
			N-= 1
		}
		rf.mu.Unlock()
		return
	}
	
	if reply.Term == rf.currentTerm && rf.state == Leader {
			//if fails because of log inconsistency: decrement nextIndex and retry
			//Upon receiving a conflict response, the leader should first search its log for conflictTerm.
			// If it finds a entry in its log with that term, it should set nextIndex to be the one beyond the index of the last entry in that term in its log.
			DPrintf("Server %v reply: conflictTerm %v conflictIndex %v",server,reply.XTerm,reply.XIndex)
			//回复不存在该Term
			if reply.XTerm == -1 {
                        //
                        if rf.lastIncludedIndex >= reply.XLen {
                                go rf.SendInstallSnapshot(server)
                        }else{
                                rf.nextIndex[server] = reply.XLen
                        }

                        DPrintf(" ConflictIndex By Too Short Follower Len . Leader logs %v",rf.logs)
                        rf.mu.Unlock()
                        return
				}
			}
			//case Leader has  the same term with follower
			i := rf.nextIndex[server] - 1
			if i < rf.lastIncludedIndex {
				i = rf.lastIncludedIndex
			}
			for  i > rf.lastIncludedIndex && rf.logs[rf.ToRealIndex(i)].Term > reply.XTerm {
				i -= 1
			}
			//当前logs中不存在该Term
			if i == rf.lastIncludedIndex && rf.logs[rf.ToRealIndex(i)].Term > reply.XTerm {
				go rf.SendInstallSnapshot(server)
			//存在该Term	
			}else if rf.logs[rf.ToRealIndex(i)].Term ==  reply.XTerm{
				rf.nextIndex[server] = i+1
			}else{
				// 之前PrevLogIndex发生冲突位置时, Follower的Term自己没
				if reply.XIndex <= rf.lastIncludedIndex {
					// XIndex位置也被截断了
					// 添加InstallSnapshot的处理
					go rf.SendInstallSnapshot(server)
				} else {
					rf.nextIndex[server] = reply.XIndex
				}
			}
			rf.mu.Unlock()
			return
	}

关于快速回退

我的代码中已经体现了快速回退了。按照论文中的描述,当一致性检查不通过后,采用的是往前挪一位日志,重试。而快速回退则是下面的思路

在AE的Reply中新增了三个信息,XLen,XTerm,XIndex

冲突的可能如下:

  1. 不存在PrevIndex
  2. 存在PreIndex但是Term冲突
  3. 一致性检查通过,但是新日志位置Term不匹配
  • 首先第一个,肯定是由于Follower的日志过段,因此我们回复XLen,Follower日志长度为下一次一致性检查位置
  • Term一旦发生冲突,我们可以认为Follower与Leader的Term不匹配,因此找到Follower的第一个Term的位置,为XIndex。此时我们在Leader中可以找到该Term,尝试重新发送日志。

关于测试

测试时,将调用Start函数。在Lab2中,一旦调用Start可以等到下一次心跳才发送日志。在3A中会发生通不过测试。所以,最好是写一个可控的Ticker。

go 复制代码
func (rf *Raft) Start(command interface{}) (int, int, bool) {
	// Your code here (2B).
	//注意并发调用时,出现并发问题
	rf.mu.Lock()
	defer rf.mu.Unlock()
	if rf.state != Leader {
		return -1,-1,false
	}
	
	rf.logs = append(rf.logs,LogEntry{
		Command: command,
		Term: rf.currentTerm,
	})
	rf.persist()
	rf.nextBeatTime = time.Now()
	DPrintf("new command %v\n",command)
	return rf.ToVirtualIndex(len(rf.logs) - 1), rf.currentTerm, true
}

结果

sql 复制代码
Test (2B): basic agreement ...
  ... Passed --   0.6  3   16    4324    3
Test (2B): RPC byte count ...
  ... Passed --   1.6  3   48  113616   11
Test (2B): test progressive failure of followers ...
  ... Passed --   4.7  3  133   26474    3
Test (2B): test failure of leaders ...
  ... Passed --   5.0  3  196   40612    3
Test (2B): agreement after follower reconnects ...
  ... Passed --   3.8  3   94   23743    7
Test (2B): no agreement if too many followers disconnect ...
  ... Passed --   3.5  5  216   42926    4
Test (2B): concurrent Start()s ...
  ... Passed --   0.6  3   18    5082    6
Test (2B): rejoin of partitioned leader ...
  ... Passed --   6.0  3  188   44507    4
Test (2B): leader backs up quickly over incorrect follower logs ...
  ... Passed --  16.7  5 1844 1264268  102
Test (2B): RPC counts aren't too high ...
  ... Passed --   2.2  3   42   11834   12
PASS
ok  	6.5840/raft	44.638s
相关推荐
ZHOU西口1 个月前
微服务实战系列之玩转Docker(十二)
docker·云原生·raft·swarm·manager·docker swarm·worker
ZHOU西口1 个月前
微服务实战系列之玩转Docker(十一)
docker·云原生·架构·raft·swarm·docker swarm·master-slave
ZHOU西口1 个月前
微服务实战系列之玩转Docker(十)
docker·微服务·云原生·raft·swarm·docker swarm·master-slave
JustLorain2 个月前
如何使用 etcd raft 库构建自己的分布式 KV 存储系统(2)
数据库·分布式·raft
JustLorain2 个月前
如何使用 etcd raft 库构建自己的分布式 KV 存储系统
数据库·分布式·raft
中间件XL2 个月前
dledger原理源码分析系列(一)-架构,核心组件和rpc组件
raft·共识算法·选主·分布式日志·dledger
六神就是我3 个月前
【LLM之RAG】RAFT论文阅读笔记
论文阅读·llm·raft·rag
小哈里4 个月前
【后端开发】服务开发场景之分布式(CAP,Raft,Gossip | API网关,分布式ID与锁 | RPC,Dubbo,Zookeeper)
分布式·rpc·raft·cap·gossip
Hello-Brand4 个月前
架构与思维:4大主流分布式算法介绍(图文并茂、算法拆解)
分布式·raft·cap·分布式算法·paxos·zab
coffee_babe4 个月前
分布式与一致性协议之常见疑惑(一)
java·分布式·raft·一致性