DeepSeek group-limited expert routing和负载均衡

Ref

https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/model.py

GitHub - deepseek-ai/EPLB: Expert Parallelism Load Balancer

DeepSeek-V3 Technical Report

DeepSeek的路由方法

python 复制代码
class Gate(nn.Module):
    def __init__(self, args: ModelArgs):
        super().__init__()
        self.dim = args.dim
        self.topk = args.n_activated_experts
        self.n_groups = args.n_expert_groups
        self.topk_groups = args.n_limited_groups
        self.score_func = args.score_func
        self.route_scale = args.route_scale
        self.weight = nn.Parameter(torch.empty(args.n_routed_experts, args.dim))
        self.bias = nn.Parameter(torch.empty(args.n_routed_experts)) if self.dim == 7168 else None

    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        scores = linear(x, self.weight)
        if self.score_func == "softmax":
            scores = scores.softmax(dim=-1, dtype=torch.float32)
        else:
            scores = scores.sigmoid()
        original_scores = scores
        if self.bias is not None:
            scores = scores + self.bias
        if self.n_groups > 1: # n_groups = n_expert_groups = 8
            scores = scores.view(x.size(0), self.n_groups, -1)
            if self.bias is None:
                group_scores = scores.amax(dim=-1)
            else:
                group_scores = scores.topk(2, dim=-1)[0].sum(dim=-1)
            indices = group_scores.topk(self.topk_groups, dim=-1)[1] # topk_groups = n_limited_groups = 4
            mask = torch.zeros_like(scores[..., 0]).scatter_(1, indices, True)
            scores = (scores * mask.unsqueeze(-1)).flatten(1)
        indices = torch.topk(scores, self.topk, dim=-1)[1] # topk = n_activated_experts = 8
        weights = original_scores.gather(1, indices)
        if self.score_func == "sigmoid":
            weights /= weights.sum(dim=-1, keepdim=True)
        weights *= self.route_scale
        return weights.type_as(x), indices

每个token有1个共享专家和256个路由专家,对这256个路由专家选择出8个专家进行实际的计算。

常规的MOE是根据score选择最高的topk个专家,具有较大的随机性。

但是DeepSeek把这256个专家分成了连续的n_expert_groups=8个组,然后根据每个组的最高score选择出n_limited_groups=4个组,其他组的score值清零。使得最终选择的topk=8随机分布在4个组内。但并没有限制每个组一定有多少个专家选中。这一定程度上限制了topk出现的随机性。

Expert-Parallel Load Balancer

  • 核心问题:对于给定 MoE 模型,存在一些天然的高负载专家(expert),导致不同 GPU 的专家计算负载不均衡
  • 优化目标:每个 GPU 上的专家计算量均衡(即最小化所有 GPU 的 dispatch 接收量的最大值)。To achieve load balancing among different experts in the MoE part, we need to ensure that each GPU processes approximately the same number of tokens.

Moreover, thanks to the group-limited expert routing used in DeepSeek-V3, we also attempt to place the experts of the same group to the same node to reduce inter-node data traffic, whenever possible.

Prefill阶段 Hierarchical Load Balancing

Prefill:路由专家 EP32、MLA 和共享专家 DP32,一个部署单元是 4 节点,32 个冗余路由专家,每张卡 9 个路由专家和 1 个共享专家

When the number of server nodes divides the number of expert groups, we use the hierarchical load balancing policy to harness the group-limited expert routing. We first pack the expert groups to nodes evenly, ensuring the loads of different nodes are balanced. Then, we replicate the experts within each node. Finally, we pack the replicated experts to individual GPUs to ensure different GPUs are load-balanced. The hierarchical load balancing policy can be used in prefilling stage with a smaller expert-parallel size.

To achieve load balancing among different experts in the MoE part, we need to ensure that each GPU processes approximately the same number of tokens. To this end, we introduce a deployment strategy of redundant experts, which duplicates high-load experts and deploys them redundantly. The high-load experts are detected based on statistics collected during the online deployment and are adjusted periodically (e.g., every 10 minutes). After determining the set of redundant experts, we carefully rearrange experts among GPUs within a node based on the observed loads , striving to balance the load across GPUs as much as possible without increasing the cross-node all-to-all communication overhead. For the deployment of DeepSeek-V3, we set 32 redundant experts for the prefilling stage. For each GPU, besides the original 8 experts it hosts, it will also host one additional redundant expert.

Furthermore, in the prefilling stage, to improve the throughput and hide the overhead of all-to-all and TP communication, we simultaneously process two micro-batches with similar computational workloads, overlapping the attention and MoE of one micro-batch with the dispatch and combine of another.

Finally, we are exploring a dynamic redundancy strategy for experts, where each GPU hosts more experts (e.g., 16 experts), but only 9 will be activated during each inference step. Before the all-to-all operation at each layer begins, we compute the globally optimal routing scheme on the fly. Given the substantial computation involved in the prefilling stage, the overhead of computing this routing scheme is almost negligible.

Decoding阶段 Global Load Balancing

Decode:路由专家 EP144、MLA 和共享专家 DP144,一个部署单元是 18 节点,32 个冗余路由专家,每张卡 2 个路由专家和 1 个共享专家

In other cases, we use the global load balancing policy that replicates the experts globally regardless of expert groups, and pack the replicated experts to individual GPUs . This policy can beadopted in decoding stage with a larger expert-parallel size.

During decoding, we treat the shared expert as a routed one . From this perspective, each token will select 9 experts during routing , where the shared expert is regarded as a heavy-load one that will always be selected. The minimum deployment unit of the decoding stage consists of 40 nodes with 320 GPUs. The attention part employs TP4 with SP, combined with DP80, while the MoE part uses EP320 . For the MoE part, each GPU hosts only one expert, and 64 GPUs are responsible for hosting redundant experts and shared experts. All-to-all communication of the dispatch and combine parts is performed via direct point-to-point transfers over IB to achieve low latency. Additionally, we leverage the IBGDA (NVIDIA, 2022) technology to further minimize latency and enhance communication efficiency.

Similar to prefilling, we periodically determine the set of redundant experts in a certain interval, based on the statistical expert load from our online service. However, we do not need to rearrange experts since each GPU only hosts one expert. We are also exploring the dynamic redundancy strategy for decoding. However, this requires more careful optimization of the algorithm that computes the globally optimal routing scheme and the fusion with the dispatch kernel to reduce overhead.

相关推荐
大模型真好玩12 小时前
大模型训练全流程实战指南工具篇(九)——LLamaFactory大模型训练工具使用指南
人工智能·agent·deepseek
Lab_AI2 天前
创腾科技推出DeepSeek智能一体机:AI4S驱动研发效率提升300%,打造科学家“第二大脑”
人工智能·ai4s·deepseek·科学智能
AC赳赳老秦2 天前
OpenClaw核心命令详解(常用指令+实战示例,高效开启自动化工作)
大数据·运维·人工智能·自动化·ai-native·deepseek·openclaw
KIO no way3 天前
自定义Node.js安装路径及环境变量配置
node.js·deepseek
码路飞3 天前
OpenClaw 模型配置终极指南:5 种方案实测,帮你选出最适合的那个
claude·deepseek
gujunge4 天前
Spring with AI (3): 定制对话——Prompt模板引入
ai·大模型·llm·openai·qwen·rag·spring ai·deepseek
视觉&物联智能4 天前
【杂谈】-人工智能蓬勃演进背后的隐性支撑体系
人工智能·ai·aigc·算力·agi·deepseek
DS随心转插件4 天前
ChatGPT或Gemini如何生成word文档
人工智能·ai·chatgpt·word·deepseek·ds随心转
gujunge4 天前
Spring with AI (2): 评估答案——UnitTest引入
ai·大模型·llm·openai·qwen·rag·spring ai·deepseek
倾心琴心5 天前
【agent辅助pcb routing coding学习】实践3 kicad routing tools 从PCB文件获取了哪些信息
算法·agent·pcb·eda·routing