6.584-Lab5A

Reference blog

Reference Code
本次lab要实现的ShardCtrler的代码框架和上个lab实现的kvraft较为相像。

要实现的四个主要函数

名词统一：
replica groups：复制组
shard：碎片/分片。K/V Server 中键 Keys 的子集
shardctrler：分片控制器，用来控制各个复制组
configuration：配置。包含目前存在哪些复制组、哪些键分配给哪些复制组

Join RPC

管理员 administrator 添加新的 replica groups。

参数是 Servers map[int][]string由复制组ID-GID到一组服务器的映射。

shardctrler 要创建一个新的 configuration，这个configuration包括新的replica groups。新的configuration应该尽可能平均地将所有shards分给每个replica group。

GID可以重复使用。即，如果GID = 2的replica group离开Leave后，后面Join的可以使用GID = 2。

Leave RPC

之前加入的groups离开。

参数是包含GID的列表
shardctrler 要创建一个新的 configuration，这个configuration包括不包括参数中的replica groups。并重分配shards给剩余的replica groups。

Move RPC

将一个特定的shard``分配给GID的group`。

参数：shard int 和 GID int

如果Move后面跟着一个Join或Leave，会重新分配shard，所以这个Move可能是没什么作用的。

Query RPC

查询某一个特定的configuration。

参数：configuration number

如果参数为-1或超过已有的number那么回复最新的configuration。

初始configuration number = 0，不包含groups，所有的shards都分配给GID=0（一个无效的 GID）。

代码梳理

CommandArgs：将四种命令统一封装在该结构体中，发送给相应的 RPC Handler。

go 复制代码

type CommandArgs struct {
	ClientId  int64  // the Client's ID which sends the cmd
	CommandId int64  // the number of the cmd to be sent
	Op        OpType // kind of operation

	// For Join
	Servers map[int][]string // new GID -> servers mappings

	// For Leave
	GIDs []int // GIDs To Leave

	// For Move
	Shard int
	GID   int // Shard -> GID

	// Query
	Num int // desired config number
}

CommandReply ：Client接收 RPC Handler 返回的信息

结构体中的Config只在Query的时候返回；
AppliedCmdTerm是用于判断从 raft layer 接收到将要应用 Apply 到本地时的 Cmd 的 Term 是否等于将该 Cmd 传入 raft layer 层时的 Term（i.e. AppliedCmdTerm == StartTerm ）。

因为在Reference Code中，是在得到 Cmd 的 Reply 之后判断 Cmd 的 Term currentTerm 是否等于 Cmd 传入 Raft 时的 Term message.CommandTerm。由于我之前实现的 Raft 层，通过通道发送来的 Cmd 中不包含 Cmd 提交到 Raft 时的 Term message.CommandTerm，所以我在下面的 CommandReply 中设置了一个字段 appliedCmdTerm 用来存放接收到 Raft 传来时的 Term，在函数Command中接收应用到本地后传来的 Reply 时判断 Term。

go 复制代码

type CommandReply struct {
	Err            Err
	Config         Config // For QueryReply
	AppliedCmdTerm int    // the term of the applied cmd
}

OperationContext ：

用来存放每个 Client 最新已经应用到本地的最新的一条 Cmd 的 ID，以及这条 Cmd 的结果 LastReply。

用于当遇到 Client 发送的重复的 Cmd 时，直接从历史中取出之前的 Reply。

go 复制代码

type OperationContext struct {
	MaxAppliedCommandId int64
	LastReply           *CommandReply
}

Client
leaderId：Client 当前通信 Raft layer 中的 Leader
commandId：Client 要发送下条 Cmd 的编号
clientId：该 Client 的编号

go 复制代码

type Clerk struct {
	servers []*labrpc.ClientEnd
	// Your data here.
	leaderId  int64 
	commandId int64
	clientId  int64
}

将 Client 要发送的 Cmd 统一封装为CommandArgs。

go 复制代码

func (ck *Clerk) Query(num int) Config {
	args := &CommandArgs{
		Op:  Query,
		Num: num,
	}
	return ck.Command(args)
}

func (ck *Clerk) Join(servers map[int][]string) {
	args := &CommandArgs{Op: Join, Servers: servers}
	ck.Command(args)
}

func (ck *Clerk) Leave(gids []int) {
	args := &CommandArgs{Op: Leave, GIDs: gids}
	ck.Command(args)
}

func (ck *Clerk) Move(shard int, gid int) {
	args := &CommandArgs{Op: Move, Shard: shard, GID: gid}
	ck.Command(args)
}

将 Client 的 Cmd 发送给 Server，Server 只能给 Raft layer 的 Leader 通信，由 Leader 给 Follower 发送信息。

通过 RPC 调用 Server 的函数Command来传递包含 Cmd 的CommandArgs并得到 Server 的 Reply。

发送信息后需要把 ck.commandId ++，作为这个 Client 发送下条 Cmd 的编号。

go 复制代码

func (ck *Clerk) Command(args *CommandArgs) Config {
	args.ClientId, args.CommandId = ck.clientId, ck.commandId
	for {
		reply := &CommandReply{}
		if !ck.servers[ck.leaderId].Call("ShardCtrler.Command", args, reply) ||
			reply.Err == ErrWrongLeader || reply.Err == ErrTimeOut {
			// 没有发送成功，就向下个server发送
			// 如果即是因为网络原因没有向leader发送成功而跳过的话
			// 后面都会返回 ErrWrongLeader 继续跳过直到绕回Leader
			ck.leaderId = (ck.leaderId + 1) % int64(len(ck.servers))
		} else {
			// 发送成功，编号+1为下一条做准备
			ck.commandId += 1
			return reply.Config
		}
	}
}

Server ：

从 Client 接收到 CommandArgs 后的逻辑：

通过携带的ClientId & CommandId判断是否是重复的 RPC。若是重复的 RPC 则直接返回之前的历史 Reply；
将 CommandArgs 传给底层的 Raft layer，同时记录 startTerm、startIndex、isLeader：
2.1 startTerm：用于下面判断与 Cmd Apply 到本地后的 Term 是否相等；
2.2 startindex：每条传到 Raft layer 的 Cmd 的 startIndex 都不同，用于创建通道来传输该条 Cmd 应用到本地的 Reply；
2.3 isLeader：底层的 Raft layer 如果不是 Leader 则直接返回ErrWrongLeader到 Client。
从 2.2 建立的通道中获取应用相应 Cmd 的 Reply，并向 Client 回复
删除 2.2 建立的通道，清理 memory footprint。

go 复制代码

// Command handles the commands form client.
// if command isn't a Query and is a duplicate command,
// returns previous reply
// otherwise transform the command to the raft.
// if the raft server isn't leader return ERR
// if the command successfully starts, it waits for the result form che channel
// or timeout after a certain period
func (sc *ShardCtrler) Command(args *CommandArgs, reply *CommandReply) {
	sc.mu.Lock()
	if args.Op != Query && sc.isDuplicateRequest(args.ClientId, args.CommandId) {
		// the command is a duplicate command return previous reply
		LastReply := sc.lastOperations[args.ClientId].LastReply
		reply.Config, reply.Err = LastReply.Config, LastReply.Err
		sc.mu.Unlock()
		return
	}
	sc.mu.Unlock()

	// start the command in raft layer and the raft server isn't Leader
	startIndex, startTerm, isLeader := sc.rf.Start(Command{args})
	if !isLeader {
		reply.Err = ErrWrongLeader
		return
	}

	// creat the channel with command's startIndex in raft layer's log
	sc.mu.Lock()
	notifyChan := sc.GetNotifyChan(startIndex)
	sc.mu.Unlock()

	// wait the result from channel
	select {
	case result := <-notifyChan:
		if result.AppliedCmdTerm == startTerm {
			reply.Config = result.Config
			reply.Err = result.Err
		}
	case <-time.After(ExecuteTimeout):
		DPrintf("TimeOut: Client %v Seq %v 的 Cmd-%v", args.ClientId, args.CommandId, args.Op)
		reply.Err = ErrTimeOut
	}

	go func() {
		// delete the outdated notify channel to reduce the memory footprint
		sc.mu.Lock()
		delete(sc.notifyChan, startIndex)
		sc.mu.Unlock()
	}()
}

Applier：从 Raft layer 收到提交的 Cmd，并应用 Apply 到本地，应用后的结果 Reply 通过通道返回给上面Command函数，由Command函数回复 Client。具体逻辑为：

先从 Raft layer 取出一条被 Raft layer Apply 的 Cmd
判重，若已经出现过则直接返回历史结果 Reply
如果不是重复的则应用到本地ApplyCmdToStateMachine 得到结果 Reply
更新 Client 的已经应用的最新的 Cmd 的字段
如果 Raft layer 是 Leader 还需要负责向上汇报结果，将得到的 Reply 放入之前创建的通道

go 复制代码

func (sc *ShardCtrler) Applier() {
	for 
	!sc.killed() {
		select {
		// get the Cmd form raft layer
		case message := <-sc.applyCh:
			if message.CommandValid {
				reply := new(CommandReply)
				command := message.Command.(Command)
				sc.mu.Lock()

				if command.Op != Query && sc.isDuplicateRequest(command.ClientId, command.CommandId) {
					reply = sc.lastOperations[command.ClientId].LastReply
				} else {
					reply = sc.ApplyCmdToStateMachine(command)
					if command.Op != Query {
						sc.lastOperations[command.ClientId] = OperationContext{
							MaxAppliedCommandId: command.CommandId,
							LastReply:           reply,
						}
					}
				}
				// Leader raft server need report the result to client
				if currentTerm, isLeader := sc.rf.GetState(); isLeader {
					notifyChan := sc.GetNotifyChan(message.CommandIndex)
					reply.AppliedCmdTerm = currentTerm
					notifyChan <- reply
				}
				sc.mu.Unlock()
			}
		}
	}
}

ApplyCmdToStateMachine：负责从 Raft layer 接收到的 Cmd 应用到本地。

代码中的DPrintf是用于调试用的。Reference Code中是通过创建一个接口ConfigStateMachine，将处理四种 Cmd 的四个函数作为接口的方法向外暴露。

go 复制代码

// Applier Handler
func (sc *ShardCtrler) ApplyCmdToStateMachine(command Command) *CommandReply {
	reply := new(CommandReply)
	switch command.Op {
	case Join:
		DPrintf("Client %v Seq %v 的 Join-Start", command.ClientId, command.CommandId)
		reply.Err = sc.stateMachine.Join(command.Servers)
		DPrintf("Client %v Seq %v 的 Join-End", command.ClientId, command.CommandId)
	case Leave:
		DPrintf("Client %v Seq %v 的 Leave-Start", command.ClientId, command.CommandId)
		reply.Err = sc.stateMachine.Leave(command.GIDs)
		DPrintf("Client %v Seq %v 的 Leave-End", command.ClientId, command.CommandId)
	case Move:
		DPrintf("Client %v Seq %v 的 Move-Start", command.ClientId, command.CommandId)
		reply.Err = sc.stateMachine.Move(command.Shard, command.GID)
		DPrintf("Client %v Seq %v 的 Move-End", command.ClientId, command.CommandId)
	case Query:
		DPrintf("Client %v Seq %v 的 Query-Start", command.ClientId, command.CommandId)
		reply.Config, reply.Err = sc.stateMachine.Query(command.Num)
		DPrintf("Client %v Seq %v 的 Query-End", command.ClientId, command.CommandId)
	}
	return reply
}

接口ConfigStateMachine ：

在接口中的声明：

go 复制代码

type ConfigStateMachine interface {
	Join(groups map[int][]string) Err
	Leave(gids []int) Err
	Move(shard, gid int) Err
	Query(num int) (Config, Err)
}

接口的实现：

先定义了一个结构体MemoryConfigStateMachine ，由其来实现接口声明的方法。
NewMemoryConfigStateMachine：初始化一个 GID= 0 的group，该group不包含任何server。
deepCopy：实现 Map 结构的深拷贝 ，由于Map 是引用类型，直接mp1 = mp2的话不是复制值而是引用，所以要实现深拷贝来对 Map 类型复制。

同时通道类型chan也是引用类型，即是使作为函数传参，也是引用传参。

Group2Shards：统计每个 GID 的 Groups 中包含那些分片 Shards
GetGIDWithMaxAndMinShards：同时得到拥有最多分片的 GID_Max 和最少分片的 GID_Min，Reference Code中是分来写两个函数分别找到 GID_Max 和 GID_Min。

这个函数里面有几个要注意的点：

在遍历Map(GID -> shards)时，需要按照 Map 的 Key 升序遍历，不然最后一个点会过不去

在当 GID=0 的 Group 中有 Shard 时就直接让 GID=0 作为 Source 转移 Shard 到非零的最少 Shard 的 Group。

Join、Leave、Move、Query：是要实现的接口声明的方法，整体逻辑比较清晰，请看下面代码以及代码中的注释。

go 复制代码

type MemoryConfigStateMachine struct {
	Configs []Config
}

func NewMemoryConfigStateMachine() *MemoryConfigStateMachine {
	cf := &MemoryConfigStateMachine{make([]Config, 1)}
	cf.Configs[0] = DefaultConfig()
	return cf
}

func deepCopy(groups map[int][]string) map[int][]string {
	newGroups := make(map[int][]string)
	for gid, servers := range groups {
		newServers := make([]string, len(servers))
		copy(newServers, servers) // 按下标进行值拷贝 i.e.深拷贝
		newGroups[gid] = newServers
	}
	return newGroups
}

// return the gid-group contain which shards
func Group2Shards(config Config) map[int][]int {
	group2shards := make(map[int][]int)
	for gid := range config.Groups {
		group2shards[gid] = make([]int, 0)
	}
	for shard, gid := range config.Shards {
		group2shards[gid] = append(group2shards[gid], shard)
	}
	return group2shards
}

// return the group's id with the maximum shards
func GetGIDWithMaxAndMinShards(group2shards map[int][]int) (maxGID, minGID int) {
	cntMax, cntMin := 0, 1000
	maxGID, minGID = -1, -1
	f := true

	// GID-0是设置的初始group，是不存在的，所以GID-0如果有shard就要转出
	if shards, ok := group2shards[0]; ok && len(shards) != 0 {
		maxGID, f = 0, false
	}

	// todo：这里为什么要先排个序，按照升序找最大最小？
	// A: 这里必须要按照升序找最大最小，不然最后一个点会报错
	gids := []int{}
	for gid := range group2shards {
		gids = append(gids, gid)
	}
	sort.Ints(gids)

	for _, gid := range gids {
		if f && len(group2shards[gid]) > cntMax {
			maxGID, cntMax = gid, len(group2shards[gid])
		}
		if gid != 0 && len(group2shards[gid]) < cntMin {
			// can't move shard to group_0
			minGID, cntMin = gid, len(group2shards[gid])
		}
	}
	//for gid, Shards := range group2shards {
	//	if f && len(Shards) > cntMax {
	//		maxGID, cntMax = gid, len(Shards)
	//	}
	//	if gid != 0 && len(Shards) < cntMin {
	//		// can't move shard to group_0
	//		minGID, cntMin = gid, len(Shards)
	//	}
	//}
	return maxGID, minGID
}

// Join adds new groups to configuration
func (cf *MemoryConfigStateMachine) Join(groups map[int][]string) Err {
	lastConfig := cf.Configs[len(cf.Configs)-1] // 最新的Configuration
	// create a new configuration
	newConfig := Config{
		Num:    len(cf.Configs),
		Shards: lastConfig.Shards,
		Groups: deepCopy(lastConfig.Groups), // map is reference, so here is a deepCopy
	}

	// add the server which in groups if the server doesn't exist in the last configuration
	for gid, servers := range groups {
		if _, exist := newConfig.Groups[gid]; !exist {
			// gid 的 group 不存在 lastConfiguration 中
			newServers := make([]string, len(servers))
			copy(newServers, servers)
			newConfig.Groups[gid] = newServers
		}
	}
	// find the each groups has which shard
	group2shards := Group2Shards(newConfig)

	// load balance the shards among the groups
	// by maximum groups giving a shard to minimum groups
	// until (maximum groups - minimum groups) <= 1
	for {
		source, target := GetGIDWithMaxAndMinShards(group2shards)
		if source != 0 && len(group2shards[source])-len(group2shards[target]) <= 1 {
			break
		}
		group2shards[target] = append(group2shards[target], group2shards[source][0])
		group2shards[source] = group2shards[source][1:]
	}
	// update newConfig.Shards
	var newShards [NShards]int // 数组
	for gid, shards := range group2shards {
		for _, shard := range shards {
			newShards[shard] = gid
		}
	}
	newConfig.Shards = newShards

	// append the newConfig to cf.Configs
	cf.Configs = append(cf.Configs, newConfig)
	return OK
}

// Leave allows some groups named gids to leave
func (cf *MemoryConfigStateMachine) Leave(gids []int) Err {
	// get the lastConfig
	lastConfig := cf.Configs[len(cf.Configs)-1]

	// create the newConfig
	newConfig := Config{
		Num:    len(cf.Configs),
		Shards: lastConfig.Shards,
		Groups: deepCopy(lastConfig.Groups),
	}

	// collect the orphan shards owned by Leaving group
	group2Shards := Group2Shards(newConfig)
	orphanShards := []int{}
	for _, gid := range gids {
		// delete group_gid that exists in newConfig
		if _, ok := newConfig.Groups[gid]; ok {
			delete(newConfig.Groups, gid)
		}

		// delete group_gid that exists in group2Shards
		if shards, ok := group2Shards[gid]; ok {
			orphanShards = append(orphanShards, shards...)
			delete(group2Shards, gid)
		}
	}

	// update the newConfig.Shards
	var newShards [NShards]int
	if len(newConfig.Groups) > 0 {
		// re-assign the orphan shards to remain groups
		for _, shard := range orphanShards {
			_, MinGID := GetGIDWithMaxAndMinShards(group2Shards)
			// todo: different with blog
			group2Shards[MinGID] = append(group2Shards[MinGID], shard)
		}

		for gid, shards := range group2Shards {
			for _, shard := range shards {
				newShards[shard] = gid
			}
		}
	}

	newConfig.Shards = newShards
	cf.Configs = append(cf.Configs, newConfig)
	return OK
}

// Move allows shard to be assigned to group named gid
func (cf *MemoryConfigStateMachine) Move(shard, gid int) Err {
	lastConfig := cf.Configs[len(cf.Configs)-1]
	newConfig := Config{
		Num:    len(cf.Configs),
		Shards: lastConfig.Shards,
		Groups: deepCopy(lastConfig.Groups),
	}

	newConfig.Shards[shard] = gid
	cf.Configs = append(cf.Configs, newConfig)
	return OK
}

// Query return a configuration. if num < 0 Or num >= len(Configs) return lastConfig
func (cf *MemoryConfigStateMachine) Query(num int) (config Config, e Err) {
	if num < 0 || num >= len(cf.Configs) {
		return cf.Configs[len(cf.Configs)-1], OK
	}
	return cf.Configs[num], OK
}

6.584-Lab5A

6.584-Lab5A

要实现的四个主要函数

Join RPC

Leave RPC

Move RPC

Query RPC

代码梳理

运行结果