


replica groups:复制组
shard:碎片/分片。K/V Server 中键 Keys 的子集

Join RPC

管理员 administrator 添加新的 replica groups。

  • 参数是 Servers map[int][]string由 复制组ID-GID到一组服务器的映射。
  • shardctrler 要创建一个新的 configuration,这个configuration包括新的replica groups。新的configuration应该尽可能平均地将所有shards分给每个replica group
  • GID可以重复使用。即,如果GID = 2replica group离开Leave后,后面Join的可以使用GID = 2

Leave RPC


  • 参数是包含GID的列表
    shardctrler 要创建一个新的 configuration,这个configuration包括不包括参数中的replica groups。并重分配shards给剩余的replica groups

Move RPC


  • 参数:shard intGID int
  • 如果Move后面跟着一个JoinLeave,会重新分配shard,所以这个Move可能是没什么作用的。

Query RPC


  • 参数:configuration number
  • 如果参数为-1或超过已有的number那么回复最新的configuration
  • 初始configuration number = 0,不包含groups,所有的shards都分配给GID=0(一个无效的 GID)。


CommandArgs:将四种命令统一封装在该结构体中,发送给相应的 RPC Handler。

go 复制代码
type CommandArgs struct {
	ClientId  int64  // the Client's ID which sends the cmd
	CommandId int64  // the number of the cmd to be sent
	Op        OpType // kind of operation

	// For Join
	Servers map[int][]string // new GID -> servers mappings

	// For Leave
	GIDs []int // GIDs To Leave

	// For Move
	Shard int
	GID   int // Shard -> GID

	// Query
	Num int // desired config number

CommandReply :Client接收 RPC Handler 返回的信息

AppliedCmdTerm是用于判断从 raft layer 接收到 将要应用 Apply 到本地时的 Cmd 的 Term 是否等于将该 Cmd 传入 raft layer 层时的 Term(i.e. AppliedCmdTerm == StartTerm )。

因为在Reference Code中,是在得到 Cmd 的 Reply 之后判断 Cmd 的 Term currentTerm 是否等于 Cmd 传入 Raft 时的 Term message.CommandTerm。由于我之前实现的 Raft 层,通过通道发送来的 Cmd 中不包含 Cmd 提交到 Raft 时的 Term message.CommandTerm,所以我在下面的 CommandReply 中设置了一个字段 appliedCmdTerm 用来存放接收到 Raft 传来时的 Term,在函数Command中接收应用到本地后传来的 Reply 时判断 Term。

go 复制代码
type CommandReply struct {
	Err            Err
	Config         Config // For QueryReply
	AppliedCmdTerm int    // the term of the applied cmd


用来存放每个 Client 最新已经应用到本地的最新的一条 Cmd 的 ID,以及这条 Cmd 的结果 LastReply

用于当遇到 Client 发送的重复的 Cmd 时,直接从历史中取出之前的 Reply。

go 复制代码
type OperationContext struct {
	MaxAppliedCommandId int64
	LastReply           *CommandReply

leaderId:Client 当前通信 Raft layer 中的 Leader
commandId:Client 要发送下条 Cmd 的编号
clientId: 该 Client 的编号

go 复制代码
type Clerk struct {
	servers []*labrpc.ClientEnd
	// Your data here.
	leaderId  int64 
	commandId int64
	clientId  int64

将 Client 要发送的 Cmd 统一封装为CommandArgs

go 复制代码
func (ck *Clerk) Query(num int) Config {
	args := &CommandArgs{
		Op:  Query,
		Num: num,
	return ck.Command(args)

func (ck *Clerk) Join(servers map[int][]string) {
	args := &CommandArgs{Op: Join, Servers: servers}

func (ck *Clerk) Leave(gids []int) {
	args := &CommandArgs{Op: Leave, GIDs: gids}

func (ck *Clerk) Move(shard int, gid int) {
	args := &CommandArgs{Op: Move, Shard: shard, GID: gid}

将 Client 的 Cmd 发送给 Server,Server 只能给 Raft layer 的 Leader 通信,由 Leader 给 Follower 发送信息。

通过 RPC 调用 Server 的函数Command来传递包含 Cmd 的CommandArgs并得到 Server 的 Reply。

发送信息后需要把 ck.commandId ++,作为这个 Client 发送下条 Cmd 的编号。

go 复制代码
func (ck *Clerk) Command(args *CommandArgs) Config {
	args.ClientId, args.CommandId = ck.clientId, ck.commandId
	for {
		reply := &CommandReply{}
		if !ck.servers[ck.leaderId].Call("ShardCtrler.Command", args, reply) ||
			reply.Err == ErrWrongLeader || reply.Err == ErrTimeOut {
			// 没有发送成功,就向下个server发送
			// 如果即是因为网络原因没有向leader发送成功而跳过的话
			// 后面都会返回 ErrWrongLeader 继续跳过直到绕回Leader
			ck.leaderId = (ck.leaderId + 1) % int64(len(ck.servers))
		} else {
			// 发送成功,编号+1为下一条做准备
			ck.commandId += 1
			return reply.Config


从 Client 接收到 CommandArgs 后的逻辑:

  1. 通过携带的ClientId & CommandId判断是否是重复的 RPC。 若是重复的 RPC 则直接返回之前的历史 Reply;
  2. 将 CommandArgs 传给底层的 Raft layer,同时记录 startTerm、startIndex、isLeader
    2.1 startTerm:用于下面判断与 Cmd Apply 到本地后的 Term 是否相等;
    2.2 startindex:每条传到 Raft layer 的 Cmd 的 startIndex 都不同,用于创建通道来传输该条 Cmd 应用到本地的 Reply;
    2.3 isLeader:底层的 Raft layer 如果不是 Leader 则直接返回ErrWrongLeader到 Client。
  3. 从 2.2 建立的通道中获取应用相应 Cmd 的 Reply,并向 Client 回复
  4. 删除 2.2 建立的通道,清理 memory footprint。
go 复制代码
// Command handles the commands form client.
// if command isn't a Query and is a duplicate command,
// returns previous reply
// otherwise transform the command to the raft.
// if the raft server isn't leader return ERR
// if the command successfully starts, it waits for the result form che channel
// or timeout after a certain period
func (sc *ShardCtrler) Command(args *CommandArgs, reply *CommandReply) {
	if args.Op != Query && sc.isDuplicateRequest(args.ClientId, args.CommandId) {
		// the command is a duplicate command return previous reply
		LastReply := sc.lastOperations[args.ClientId].LastReply
		reply.Config, reply.Err = LastReply.Config, LastReply.Err

	// start the command in raft layer and the raft server isn't Leader
	startIndex, startTerm, isLeader := sc.rf.Start(Command{args})
	if !isLeader {
		reply.Err = ErrWrongLeader

	// creat the channel with command's startIndex in raft layer's log
	notifyChan := sc.GetNotifyChan(startIndex)

	// wait the result from channel
	select {
	case result := <-notifyChan:
		if result.AppliedCmdTerm == startTerm {
			reply.Config = result.Config
			reply.Err = result.Err
	case <-time.After(ExecuteTimeout):
		DPrintf("TimeOut: Client %v Seq %v 的 Cmd-%v", args.ClientId, args.CommandId, args.Op)
		reply.Err = ErrTimeOut

	go func() {
		// delete the outdated notify channel to reduce the memory footprint
		delete(sc.notifyChan, startIndex)

Applier:从 Raft layer 收到提交的 Cmd,并应用 Apply 到本地,应用后的结果 Reply 通过通道返回给上面Command函数,由Command函数回复 Client。具体逻辑为:

  1. 先从 Raft layer 取出一条被 Raft layer Apply 的 Cmd
  2. 判重,若已经出现过则直接返回历史结果 Reply
  3. 如果不是重复的则应用到本地ApplyCmdToStateMachine 得到结果 Reply
  4. 更新 Client 的已经应用的最新的 Cmd 的字段
  5. 如果 Raft layer 是 Leader 还需要负责向上汇报结果,将得到的 Reply 放入 之前创建的通道
go 复制代码
func (sc *ShardCtrler) Applier() {
	!sc.killed() {
		select {
		// get the Cmd form raft layer
		case message := <-sc.applyCh:
			if message.CommandValid {
				reply := new(CommandReply)
				command := message.Command.(Command)

				if command.Op != Query && sc.isDuplicateRequest(command.ClientId, command.CommandId) {
					reply = sc.lastOperations[command.ClientId].LastReply
				} else {
					reply = sc.ApplyCmdToStateMachine(command)
					if command.Op != Query {
						sc.lastOperations[command.ClientId] = OperationContext{
							MaxAppliedCommandId: command.CommandId,
							LastReply:           reply,
				// Leader raft server need report the result to client
				if currentTerm, isLeader := sc.rf.GetState(); isLeader {
					notifyChan := sc.GetNotifyChan(message.CommandIndex)
					reply.AppliedCmdTerm = currentTerm
					notifyChan <- reply

ApplyCmdToStateMachine:负责从 Raft layer 接收到的 Cmd 应用到本地。

代码中的DPrintf是用于调试用的。Reference Code中是通过创建一个接口ConfigStateMachine,将处理四种 Cmd 的四个函数作为接口的方法向外暴露。

go 复制代码
// Applier Handler
func (sc *ShardCtrler) ApplyCmdToStateMachine(command Command) *CommandReply {
	reply := new(CommandReply)
	switch command.Op {
	case Join:
		DPrintf("Client %v Seq %v 的 Join-Start", command.ClientId, command.CommandId)
		reply.Err = sc.stateMachine.Join(command.Servers)
		DPrintf("Client %v Seq %v 的 Join-End", command.ClientId, command.CommandId)
	case Leave:
		DPrintf("Client %v Seq %v 的 Leave-Start", command.ClientId, command.CommandId)
		reply.Err = sc.stateMachine.Leave(command.GIDs)
		DPrintf("Client %v Seq %v 的 Leave-End", command.ClientId, command.CommandId)
	case Move:
		DPrintf("Client %v Seq %v 的 Move-Start", command.ClientId, command.CommandId)
		reply.Err = sc.stateMachine.Move(command.Shard, command.GID)
		DPrintf("Client %v Seq %v 的 Move-End", command.ClientId, command.CommandId)
	case Query:
		DPrintf("Client %v Seq %v 的 Query-Start", command.ClientId, command.CommandId)
		reply.Config, reply.Err = sc.stateMachine.Query(command.Num)
		DPrintf("Client %v Seq %v 的 Query-End", command.ClientId, command.CommandId)
	return reply



go 复制代码
type ConfigStateMachine interface {
	Join(groups map[int][]string) Err
	Leave(gids []int) Err
	Move(shard, gid int) Err
	Query(num int) (Config, Err)


先定义了一个结构体MemoryConfigStateMachine ,由其来实现接口声明的方法。
NewMemoryConfigStateMachine:初始化一个 GID= 0 的group,该group不包含任何server。
deepCopy:实现 Map 结构的深拷贝 ,由于Map 是引用类型,直接mp1 = mp2的话不是复制值而是引用,所以要实现深拷贝来对 Map 类型复制。


Group2Shards:统计每个 GID 的 Groups 中包含那些分片 Shards
GetGIDWithMaxAndMinShards:同时得到拥有最多分片的 GID_Max 和最少分片的 GID_Min,Reference Code中是分来写两个函数分别找到 GID_Max 和 GID_Min。


  • 在遍历Map(GID -> shards)时,需要按照 Map 的 Key 升序遍历,不然最后一个点会过不去
  • 在当 GID=0 的 Group 中有 Shard 时就直接让 GID=0 作为 Source 转移 Shard 到非零的最少 Shard 的 Group。


go 复制代码
type MemoryConfigStateMachine struct {
	Configs []Config

func NewMemoryConfigStateMachine() *MemoryConfigStateMachine {
	cf := &MemoryConfigStateMachine{make([]Config, 1)}
	cf.Configs[0] = DefaultConfig()
	return cf

func deepCopy(groups map[int][]string) map[int][]string {
	newGroups := make(map[int][]string)
	for gid, servers := range groups {
		newServers := make([]string, len(servers))
		copy(newServers, servers) // 按下标进行值拷贝 i.e.深拷贝
		newGroups[gid] = newServers
	return newGroups

// return the gid-group contain which shards
func Group2Shards(config Config) map[int][]int {
	group2shards := make(map[int][]int)
	for gid := range config.Groups {
		group2shards[gid] = make([]int, 0)
	for shard, gid := range config.Shards {
		group2shards[gid] = append(group2shards[gid], shard)
	return group2shards

// return the group's id with the maximum shards
func GetGIDWithMaxAndMinShards(group2shards map[int][]int) (maxGID, minGID int) {
	cntMax, cntMin := 0, 1000
	maxGID, minGID = -1, -1
	f := true

	// GID-0是设置的初始group,是不存在的,所以GID-0如果有shard就要转出
	if shards, ok := group2shards[0]; ok && len(shards) != 0 {
		maxGID, f = 0, false

	// todo:这里为什么要先排个序,按照升序找最大最小?
	// A: 这里必须要按照升序找最大最小,不然最后一个点会报错
	gids := []int{}
	for gid := range group2shards {
		gids = append(gids, gid)

	for _, gid := range gids {
		if f && len(group2shards[gid]) > cntMax {
			maxGID, cntMax = gid, len(group2shards[gid])
		if gid != 0 && len(group2shards[gid]) < cntMin {
			// can't move shard to group_0
			minGID, cntMin = gid, len(group2shards[gid])
	//for gid, Shards := range group2shards {
	//	if f && len(Shards) > cntMax {
	//		maxGID, cntMax = gid, len(Shards)
	//	}
	//	if gid != 0 && len(Shards) < cntMin {
	//		// can't move shard to group_0
	//		minGID, cntMin = gid, len(Shards)
	//	}
	return maxGID, minGID

// Join adds new groups to configuration
func (cf *MemoryConfigStateMachine) Join(groups map[int][]string) Err {
	lastConfig := cf.Configs[len(cf.Configs)-1] // 最新的Configuration
	// create a new configuration
	newConfig := Config{
		Num:    len(cf.Configs),
		Shards: lastConfig.Shards,
		Groups: deepCopy(lastConfig.Groups), // map is reference, so here is a deepCopy

	// add the server which in groups if the server doesn't exist in the last configuration
	for gid, servers := range groups {
		if _, exist := newConfig.Groups[gid]; !exist {
			// gid 的 group 不存在 lastConfiguration 中
			newServers := make([]string, len(servers))
			copy(newServers, servers)
			newConfig.Groups[gid] = newServers
	// find the each groups has which shard
	group2shards := Group2Shards(newConfig)

	// load balance the shards among the groups
	// by maximum groups giving a shard to minimum groups
	// until (maximum groups - minimum groups) <= 1
	for {
		source, target := GetGIDWithMaxAndMinShards(group2shards)
		if source != 0 && len(group2shards[source])-len(group2shards[target]) <= 1 {
		group2shards[target] = append(group2shards[target], group2shards[source][0])
		group2shards[source] = group2shards[source][1:]
	// update newConfig.Shards
	var newShards [NShards]int // 数组
	for gid, shards := range group2shards {
		for _, shard := range shards {
			newShards[shard] = gid
	newConfig.Shards = newShards

	// append the newConfig to cf.Configs
	cf.Configs = append(cf.Configs, newConfig)
	return OK

// Leave allows some groups named gids to leave
func (cf *MemoryConfigStateMachine) Leave(gids []int) Err {
	// get the lastConfig
	lastConfig := cf.Configs[len(cf.Configs)-1]

	// create the newConfig
	newConfig := Config{
		Num:    len(cf.Configs),
		Shards: lastConfig.Shards,
		Groups: deepCopy(lastConfig.Groups),

	// collect the orphan shards owned by Leaving group
	group2Shards := Group2Shards(newConfig)
	orphanShards := []int{}
	for _, gid := range gids {
		// delete group_gid that exists in newConfig
		if _, ok := newConfig.Groups[gid]; ok {
			delete(newConfig.Groups, gid)

		// delete group_gid that exists in group2Shards
		if shards, ok := group2Shards[gid]; ok {
			orphanShards = append(orphanShards, shards...)
			delete(group2Shards, gid)

	// update the newConfig.Shards
	var newShards [NShards]int
	if len(newConfig.Groups) > 0 {
		// re-assign the orphan shards to remain groups
		for _, shard := range orphanShards {
			_, MinGID := GetGIDWithMaxAndMinShards(group2Shards)
			// todo: different with blog
			group2Shards[MinGID] = append(group2Shards[MinGID], shard)

		for gid, shards := range group2Shards {
			for _, shard := range shards {
				newShards[shard] = gid

	newConfig.Shards = newShards
	cf.Configs = append(cf.Configs, newConfig)
	return OK

// Move allows shard to be assigned to group named gid
func (cf *MemoryConfigStateMachine) Move(shard, gid int) Err {
	lastConfig := cf.Configs[len(cf.Configs)-1]
	newConfig := Config{
		Num:    len(cf.Configs),
		Shards: lastConfig.Shards,
		Groups: deepCopy(lastConfig.Groups),

	newConfig.Shards[shard] = gid
	cf.Configs = append(cf.Configs, newConfig)
	return OK

// Query return a configuration. if num < 0 Or num >= len(Configs) return lastConfig
func (cf *MemoryConfigStateMachine) Query(num int) (config Config, e Err) {
	if num < 0 || num >= len(cf.Configs) {
		return cf.Configs[len(cf.Configs)-1], OK
	return cf.Configs[num], OK


