DeepSeek group-limited expert routing和负载均衡

Ref

https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/model.py

GitHub - deepseek-ai/EPLB: Expert Parallelism Load Balancer

DeepSeek-V3 Technical Report

DeepSeek的路由方法

python 复制代码
class Gate(nn.Module):
    def __init__(self, args: ModelArgs):
        super().__init__()
        self.dim = args.dim
        self.topk = args.n_activated_experts
        self.n_groups = args.n_expert_groups
        self.topk_groups = args.n_limited_groups
        self.score_func = args.score_func
        self.route_scale = args.route_scale
        self.weight = nn.Parameter(torch.empty(args.n_routed_experts, args.dim))
        self.bias = nn.Parameter(torch.empty(args.n_routed_experts)) if self.dim == 7168 else None

    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        scores = linear(x, self.weight)
        if self.score_func == "softmax":
            scores = scores.softmax(dim=-1, dtype=torch.float32)
        else:
            scores = scores.sigmoid()
        original_scores = scores
        if self.bias is not None:
            scores = scores + self.bias
        if self.n_groups > 1: # n_groups = n_expert_groups = 8
            scores = scores.view(x.size(0), self.n_groups, -1)
            if self.bias is None:
                group_scores = scores.amax(dim=-1)
            else:
                group_scores = scores.topk(2, dim=-1)[0].sum(dim=-1)
            indices = group_scores.topk(self.topk_groups, dim=-1)[1] # topk_groups = n_limited_groups = 4
            mask = torch.zeros_like(scores[..., 0]).scatter_(1, indices, True)
            scores = (scores * mask.unsqueeze(-1)).flatten(1)
        indices = torch.topk(scores, self.topk, dim=-1)[1] # topk = n_activated_experts = 8
        weights = original_scores.gather(1, indices)
        if self.score_func == "sigmoid":
            weights /= weights.sum(dim=-1, keepdim=True)
        weights *= self.route_scale
        return weights.type_as(x), indices

每个token有1个共享专家和256个路由专家,对这256个路由专家选择出8个专家进行实际的计算。

常规的MOE是根据score选择最高的topk个专家,具有较大的随机性。

但是DeepSeek把这256个专家分成了连续的n_expert_groups=8个组,然后根据每个组的最高score选择出n_limited_groups=4个组,其他组的score值清零。使得最终选择的topk=8随机分布在4个组内。但并没有限制每个组一定有多少个专家选中。这一定程度上限制了topk出现的随机性。

Expert-Parallel Load Balancer

  • 核心问题:对于给定 MoE 模型,存在一些天然的高负载专家(expert),导致不同 GPU 的专家计算负载不均衡
  • 优化目标:每个 GPU 上的专家计算量均衡(即最小化所有 GPU 的 dispatch 接收量的最大值)。To achieve load balancing among different experts in the MoE part, we need to ensure that each GPU processes approximately the same number of tokens.

Moreover, thanks to the group-limited expert routing used in DeepSeek-V3, we also attempt to place the experts of the same group to the same node to reduce inter-node data traffic, whenever possible.

Prefill阶段 Hierarchical Load Balancing

Prefill:路由专家 EP32、MLA 和共享专家 DP32,一个部署单元是 4 节点,32 个冗余路由专家,每张卡 9 个路由专家和 1 个共享专家

When the number of server nodes divides the number of expert groups, we use the hierarchical load balancing policy to harness the group-limited expert routing. We first pack the expert groups to nodes evenly, ensuring the loads of different nodes are balanced. Then, we replicate the experts within each node. Finally, we pack the replicated experts to individual GPUs to ensure different GPUs are load-balanced. The hierarchical load balancing policy can be used in prefilling stage with a smaller expert-parallel size.

To achieve load balancing among different experts in the MoE part, we need to ensure that each GPU processes approximately the same number of tokens. To this end, we introduce a deployment strategy of redundant experts, which duplicates high-load experts and deploys them redundantly. The high-load experts are detected based on statistics collected during the online deployment and are adjusted periodically (e.g., every 10 minutes). After determining the set of redundant experts, we carefully rearrange experts among GPUs within a node based on the observed loads , striving to balance the load across GPUs as much as possible without increasing the cross-node all-to-all communication overhead. For the deployment of DeepSeek-V3, we set 32 redundant experts for the prefilling stage. For each GPU, besides the original 8 experts it hosts, it will also host one additional redundant expert.

Furthermore, in the prefilling stage, to improve the throughput and hide the overhead of all-to-all and TP communication, we simultaneously process two micro-batches with similar computational workloads, overlapping the attention and MoE of one micro-batch with the dispatch and combine of another.

Finally, we are exploring a dynamic redundancy strategy for experts, where each GPU hosts more experts (e.g., 16 experts), but only 9 will be activated during each inference step. Before the all-to-all operation at each layer begins, we compute the globally optimal routing scheme on the fly. Given the substantial computation involved in the prefilling stage, the overhead of computing this routing scheme is almost negligible.

Decoding阶段 Global Load Balancing

Decode:路由专家 EP144、MLA 和共享专家 DP144,一个部署单元是 18 节点,32 个冗余路由专家,每张卡 2 个路由专家和 1 个共享专家

In other cases, we use the global load balancing policy that replicates the experts globally regardless of expert groups, and pack the replicated experts to individual GPUs . This policy can beadopted in decoding stage with a larger expert-parallel size.

During decoding, we treat the shared expert as a routed one . From this perspective, each token will select 9 experts during routing , where the shared expert is regarded as a heavy-load one that will always be selected. The minimum deployment unit of the decoding stage consists of 40 nodes with 320 GPUs. The attention part employs TP4 with SP, combined with DP80, while the MoE part uses EP320 . For the MoE part, each GPU hosts only one expert, and 64 GPUs are responsible for hosting redundant experts and shared experts. All-to-all communication of the dispatch and combine parts is performed via direct point-to-point transfers over IB to achieve low latency. Additionally, we leverage the IBGDA (NVIDIA, 2022) technology to further minimize latency and enhance communication efficiency.

Similar to prefilling, we periodically determine the set of redundant experts in a certain interval, based on the statistical expert load from our online service. However, we do not need to rearrange experts since each GPU only hosts one expert. We are also exploring the dynamic redundancy strategy for decoding. However, this requires more careful optimization of the algorithm that computes the globally optimal routing scheme and the fusion with the dispatch kernel to reduce overhead.

相关推荐
镰刀韭菜18 小时前
【LLM】一文理解推理大模型
大语言模型·强化学习·知识蒸馏·指令微调·deepseek·推理模型·旅程式学习
linmoo198618 小时前
Langchain4j 系列之二十七 - Ollama集成Deepseek
人工智能·langchain·ollama·deepseek·langchain4j
AC赳赳老秦19 小时前
低代码开发中的高效调试:基于 DeepSeek 的报错日志解析与自动修复方案生成
前端·javascript·低代码·postgresql·数据库架构·easyui·deepseek
DS随心转小程序20 小时前
【技术前瞻】Edge 浏览器深度集成 DS随心转:AI 搜索与笔记流转的一站式生产力革命
人工智能·笔记·edge·deepseek·ds随心转
DO_Community2 天前
技术解码:Character.ai 如何实现大模型实时推理性能 2 倍提升
人工智能·算法·llm·aigc·moe·aiter
大数据002 天前
基于Ollama大模型学习
python·flask·大模型·alibaba·ollama·springai·deepseek
一个处女座的程序猿2 天前
LLMs之MoE之Thinking:LongCat-Flash-Thinking-2601的简介、安装和使用方法、案例应用之详细攻略
llm·moe·thinking
亿丢丢4 天前
DeepSeek本地部署:Ollama+Open WebUI
人工智能·windows·deepseek
l1t5 天前
利用豆包辅助编写数独隐式唯一数填充c程序
c语言·开发语言·人工智能·算法·豆包·deepseek
avi91115 天前
[Unity] 仙剑源码-仙剑奇侠传移动版分析 - 开篇;[Lua] [A1相关],DeepSeek学习脚手架源码
chatgpt·aigc·lua·deepseek·仙剑移动版·仙剑