DeepSeek group-limited expert routing和负载均衡

Ref

https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/model.py

GitHub - deepseek-ai/EPLB: Expert Parallelism Load Balancer

DeepSeek-V3 Technical Report

DeepSeek的路由方法

python 复制代码
class Gate(nn.Module):
    def __init__(self, args: ModelArgs):
        super().__init__()
        self.dim = args.dim
        self.topk = args.n_activated_experts
        self.n_groups = args.n_expert_groups
        self.topk_groups = args.n_limited_groups
        self.score_func = args.score_func
        self.route_scale = args.route_scale
        self.weight = nn.Parameter(torch.empty(args.n_routed_experts, args.dim))
        self.bias = nn.Parameter(torch.empty(args.n_routed_experts)) if self.dim == 7168 else None

    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        scores = linear(x, self.weight)
        if self.score_func == "softmax":
            scores = scores.softmax(dim=-1, dtype=torch.float32)
        else:
            scores = scores.sigmoid()
        original_scores = scores
        if self.bias is not None:
            scores = scores + self.bias
        if self.n_groups > 1: # n_groups = n_expert_groups = 8
            scores = scores.view(x.size(0), self.n_groups, -1)
            if self.bias is None:
                group_scores = scores.amax(dim=-1)
            else:
                group_scores = scores.topk(2, dim=-1)[0].sum(dim=-1)
            indices = group_scores.topk(self.topk_groups, dim=-1)[1] # topk_groups = n_limited_groups = 4
            mask = torch.zeros_like(scores[..., 0]).scatter_(1, indices, True)
            scores = (scores * mask.unsqueeze(-1)).flatten(1)
        indices = torch.topk(scores, self.topk, dim=-1)[1] # topk = n_activated_experts = 8
        weights = original_scores.gather(1, indices)
        if self.score_func == "sigmoid":
            weights /= weights.sum(dim=-1, keepdim=True)
        weights *= self.route_scale
        return weights.type_as(x), indices

每个token有1个共享专家和256个路由专家,对这256个路由专家选择出8个专家进行实际的计算。

常规的MOE是根据score选择最高的topk个专家,具有较大的随机性。

但是DeepSeek把这256个专家分成了连续的n_expert_groups=8个组,然后根据每个组的最高score选择出n_limited_groups=4个组,其他组的score值清零。使得最终选择的topk=8随机分布在4个组内。但并没有限制每个组一定有多少个专家选中。这一定程度上限制了topk出现的随机性。

Expert-Parallel Load Balancer

  • 核心问题:对于给定 MoE 模型,存在一些天然的高负载专家(expert),导致不同 GPU 的专家计算负载不均衡
  • 优化目标:每个 GPU 上的专家计算量均衡(即最小化所有 GPU 的 dispatch 接收量的最大值)。To achieve load balancing among different experts in the MoE part, we need to ensure that each GPU processes approximately the same number of tokens.

Moreover, thanks to the group-limited expert routing used in DeepSeek-V3, we also attempt to place the experts of the same group to the same node to reduce inter-node data traffic, whenever possible.

Prefill阶段 Hierarchical Load Balancing

Prefill:路由专家 EP32、MLA 和共享专家 DP32,一个部署单元是 4 节点,32 个冗余路由专家,每张卡 9 个路由专家和 1 个共享专家

When the number of server nodes divides the number of expert groups, we use the hierarchical load balancing policy to harness the group-limited expert routing. We first pack the expert groups to nodes evenly, ensuring the loads of different nodes are balanced. Then, we replicate the experts within each node. Finally, we pack the replicated experts to individual GPUs to ensure different GPUs are load-balanced. The hierarchical load balancing policy can be used in prefilling stage with a smaller expert-parallel size.

To achieve load balancing among different experts in the MoE part, we need to ensure that each GPU processes approximately the same number of tokens. To this end, we introduce a deployment strategy of redundant experts, which duplicates high-load experts and deploys them redundantly. The high-load experts are detected based on statistics collected during the online deployment and are adjusted periodically (e.g., every 10 minutes). After determining the set of redundant experts, we carefully rearrange experts among GPUs within a node based on the observed loads , striving to balance the load across GPUs as much as possible without increasing the cross-node all-to-all communication overhead. For the deployment of DeepSeek-V3, we set 32 redundant experts for the prefilling stage. For each GPU, besides the original 8 experts it hosts, it will also host one additional redundant expert.

Furthermore, in the prefilling stage, to improve the throughput and hide the overhead of all-to-all and TP communication, we simultaneously process two micro-batches with similar computational workloads, overlapping the attention and MoE of one micro-batch with the dispatch and combine of another.

Finally, we are exploring a dynamic redundancy strategy for experts, where each GPU hosts more experts (e.g., 16 experts), but only 9 will be activated during each inference step. Before the all-to-all operation at each layer begins, we compute the globally optimal routing scheme on the fly. Given the substantial computation involved in the prefilling stage, the overhead of computing this routing scheme is almost negligible.

Decoding阶段 Global Load Balancing

Decode:路由专家 EP144、MLA 和共享专家 DP144,一个部署单元是 18 节点,32 个冗余路由专家,每张卡 2 个路由专家和 1 个共享专家

In other cases, we use the global load balancing policy that replicates the experts globally regardless of expert groups, and pack the replicated experts to individual GPUs . This policy can beadopted in decoding stage with a larger expert-parallel size.

During decoding, we treat the shared expert as a routed one . From this perspective, each token will select 9 experts during routing , where the shared expert is regarded as a heavy-load one that will always be selected. The minimum deployment unit of the decoding stage consists of 40 nodes with 320 GPUs. The attention part employs TP4 with SP, combined with DP80, while the MoE part uses EP320 . For the MoE part, each GPU hosts only one expert, and 64 GPUs are responsible for hosting redundant experts and shared experts. All-to-all communication of the dispatch and combine parts is performed via direct point-to-point transfers over IB to achieve low latency. Additionally, we leverage the IBGDA (NVIDIA, 2022) technology to further minimize latency and enhance communication efficiency.

Similar to prefilling, we periodically determine the set of redundant experts in a certain interval, based on the statistical expert load from our online service. However, we do not need to rearrange experts since each GPU only hosts one expert. We are also exploring the dynamic redundancy strategy for decoding. However, this requires more careful optimization of the algorithm that computes the globally optimal routing scheme and the fusion with the dispatch kernel to reduce overhead.

相关推荐
AC赳赳老秦36 分钟前
代码生成超越 GPT-4:DeepSeek-V4 编程任务实战与 2026 开发者效率提升指南
数据库·数据仓库·人工智能·科技·rabbitmq·memcache·deepseek
小白狮ww3 小时前
要给 OCR 装个脑子吗?DeepSeek-OCR 2 让文档不再只是扫描
人工智能·深度学习·机器学习·ocr·cpu·gpu·deepseek
realhuizhu12 小时前
为什么程序员配出的颜色像"斑斓的灰"?因为你还在靠直觉
前端开发·ai工具·ui设计·deepseek·程序员提升
逐梦苍穹17 小时前
速通DeepSeek论文mHC:给大模型装上物理阀门的架构革命
人工智能·deepseek·mhc
realhuizhu3 天前
你有多少次对着设计稿说"感觉不对,但说不上来"?
提示词工程·设计效率·deepseek·ai设计工具·品牌logo
AC赳赳老秦3 天前
DeepSeek一体机部署:中小企业本地化算力成本控制方案
服务器·数据库·人工智能·zookeeper·时序数据库·terraform·deepseek
Elwin Wong3 天前
浅析DeepSeek-OCR v1&v2
人工智能·大模型·llm·ocr·deepseek
AI刀刀4 天前
千问 文心 元宝 Kimi公式乱码
ai·pdf·豆包·deepseek·ds随心转
aihuangwu4 天前
deepseek图表怎么导出
人工智能·ai·deepseek·ds随心转
马武寨山的猴子4 天前
【KTransformers+SGLang】:异构推理架构融合与性能实测全解析
架构·transformer·moe·ktransformers·sglang