Nemotron架构(Mamba3+Transformer+Moe)

Nemotron-Mamba3 Architecture Documentation

复制代码
╔══════════════════════════════════════════════════════════╗
║         Nemotron-Mamba3 混合模型演示程序                    ║
║         (结合Mamba3 SSM + Transformer + MoE)              ║
╚══════════════════════════════════════════════════════════╝

═══════════════════════════════════════════════════════════
【模型配置】
═══════════════════════════════════════════════════════════
  词汇表大小:     32,000
  隐藏层维度:     512
  Mamba3层数:     2
  Transformer层数: 4
  注意力头数:     8 (Q) / 2 (KV)
  状态大小:       64
  FFN中间层:      2048
  MoE专家数:      8 (激活2 + 共享1)
  MoE启用:        是

═══════════════════════════════════════════════════════════
【模型初始化】
═══════════════════════════════════════════════════════════
  正在创建模型组件...
  ? 词嵌入层初始化完成 [vocab=32000, hidden=512]
  ? Mamba3层堆栈初始化完成 [2层]
  ? Transformer层堆栈初始化完成 [4层]
  ? 最终归一化层初始化完成
  ? LM输出头初始化完成 (权重共享)

═══════════════════════════════════════════════════════════
【分词器初始化】
═══════════════════════════════════════════════════════════
  ? 分词器初始化成功, 词汇表大小: 256

═══════════════════════════════════════════════════════════
【前向传播测试】
═══════════════════════════════════════════════════════════
  输入形状: [batch=1, seq_len=5]
  输入Token IDs: [1, 2, 3, 4, 5]

  开始前向传播...
  ───────────────────────────────────────────────────────────
  ├─ [层0] 词嵌入层
  │   输入: [batch=1, seq_len=5] token IDs
  │   输出: [batch=1, seq_len=5, hidden=512]
  ├─ [层1] Mamba3层 #1
  │   输入形状: [1, 5, 512]
  │   ├─ RMSNorm归一化
  │   │   输出: [1, 5, 512]
  │   ├─ 选择性SSM (状态空间模型)
  │   │   状态维度: 64, 扩展因子: 2
  │   │   输出: [1, 5, 512]
  │   └─ MLP路径: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = input + SSM + MLP
  │   输出形状: [1, 5, 512]
  │
  ├─ [层2] Transformer层 #1
  │   输入形状: [1, 5, 512]
  │   ├─ 注意力归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   ├─ QKV投影
  │   │   Q: [5, 512, 8头 x 64维]
  │   │   K/V: [5, 128, 2头 x 64维]
  │   ├─ QK归一化 (Mamba3改进)
  │   │   归一化完成
  │   ├─ RoPE旋转位置编码 (base=10000, max_len=1024)
  │   │   旋转完成
  │   ├─ 分组查询注意力 (GQA)
  │   │   Q头: 8, KV头: 2, 重复: 4x
  │   │   输出: [1, 8, 5头 x 64维]
  │   └─ 注意力输出投影
  │       输出: [1, 5, 512]
  │       └─ 残差合并: h = x + attention
  │
  │   ├─ FFN归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   └─ FFN/MLP: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = h + FFN
  │   输出形状: [1, 5, 512]
  │
  ├─ [层3] Mamba3层 #2
  │   输入形状: [1, 5, 512]
  │   ├─ RMSNorm归一化
  │   │   输出: [1, 5, 512]
  │   ├─ 选择性SSM (状态空间模型)
  │   │   状态维度: 64, 扩展因子: 2
  │   │   输出: [1, 5, 512]
  │   └─ MLP路径: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = input + SSM + MLP
  │   输出形状: [1, 5, 512]
  │
  ├─ [层4] Transformer层 #2
  │   输入形状: [1, 5, 512]
  │   ├─ 注意力归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   ├─ QKV投影
  │   │   Q: [5, 512, 8头 x 64维]
  │   │   K/V: [5, 128, 2头 x 64维]
  │   ├─ QK归一化 (Mamba3改进)
  │   │   归一化完成
  │   ├─ RoPE旋转位置编码 (base=10000, max_len=1024)
  │   │   旋转完成
  │   ├─ 分组查询注意力 (GQA)
  │   │   Q头: 8, KV头: 2, 重复: 4x
  │   │   输出: [1, 8, 5头 x 64维]
  │   └─ 注意力输出投影
  │       输出: [1, 5, 512]
  │       └─ 残差合并: h = x + attention
  │
  │   ├─ FFN归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   └─ FFN/MLP: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = h + FFN
  │   输出形状: [1, 5, 512]
  │
  ├─ [层5] Transformer层 #3
  │   输入形状: [1, 5, 512]
  │   ├─ 注意力归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   ├─ QKV投影
  │   │   Q: [5, 512, 8头 x 64维]
  │   │   K/V: [5, 128, 2头 x 64维]
  │   ├─ QK归一化 (Mamba3改进)
  │   │   归一化完成
  │   ├─ RoPE旋转位置编码 (base=10000, max_len=1024)
  │   │   旋转完成
  │   ├─ 分组查询注意力 (GQA)
  │   │   Q头: 8, KV头: 2, 重复: 4x
  │   │   输出: [1, 8, 5头 x 64维]
  │   └─ 注意力输出投影
  │       输出: [1, 5, 512]
  │       └─ 残差合并: h = x + attention
  │
  │   ├─ FFN归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   └─ FFN/MLP: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = h + FFN
  │   输出形状: [1, 5, 512]
  │
  └─ [层6] Transformer层 #4
  │   输入形状: [1, 5, 512]
  │   ├─ 注意力归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   ├─ QKV投影
  │   │   Q: [5, 512, 8头 x 64维]
  │   │   K/V: [5, 128, 2头 x 64维]
  │   ├─ QK归一化 (Mamba3改进)
  │   │   归一化完成
  │   ├─ RoPE旋转位置编码 (base=10000, max_len=1024)
  │   │   旋转完成
  │   ├─ 分组查询注意力 (GQA)
  │   │   Q头: 8, KV头: 2, 重复: 4x
  │   │   输出: [1, 8, 5头 x 64维]
  │   └─ 注意力输出投影
  │       输出: [1, 5, 512]
  │       └─ 残差合并: h = x + attention
  │
  │   ├─ FFN归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   └─ FFN/MLP: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = h + FFN
  │   输出形状: [1, 5, 512]
  └─ [层7] 最终RMSNorm归一化
      输入形状: [1, 5, 512]
      输出形状: [1, 5, 512]
  
  └─ [层8] LM输出头 (权重共享)
      输入形状: [1, 5, 512]
      输出形状: [1, 5, 32000]
  ───────────────────────────────────────────────────────────
  ? 前向传播成功完成!
    耗时: 1846ms
    输出形状: [batch=1, seq_len=5, vocab=32000]

  【输出Logits预览 (最后一个位置)】
    Top-5 Token概率: [token 0(NaN), token 1(NaN), token 2(NaN), token 3(NaN), token 4(NaN)]

═══════════════════════════════════════════════════════════
  演示完成!
═══════════════════════════════════════════════════════════

1. Project Overview

Nemotron-Mamba3 is a hybrid architecture large language model combining Mamba3 State Space Model (SSM) with Transformer architecture and Mixture of Experts (MoE).

Core Technical Features

Feature Description
Mamba3 SSM Selective State Space Model, O(L) linear sequence complexity
Transformer Grouped Query Attention (GQA) for efficient inference
Hybrid Architecture Alternating Mamba3 and Transformer layers
MoE Mixture of Experts - 128 experts, Top-6 active per token
RoPE Rotary Position Embedding for long context
SwiGLU Gated activation function
SquaredReLU Activation for MoE experts

2. System Architecture Overview

Output
Core
Embedding
Input
Token IDs [batch, seq_len]
TokenEmbedding vocab->hidden
Mamba3Layer x N
TransformerLayer x N
RMSNorm
LM Head hidden->vocab
Logits [batch, seq_len, vocab]


3. Complete Data Flow

Transformer Layer
Mamba3 Layer
Layer_Stack
Input_Step
Token IDs [B, L]
TokenEmbedding Forward()
Hidden States [B, L, H]
for layer in (M+T)xN
RMSNorm
Conv1d local context
Selective SSM Parallel Scan
MoE / SwiGLU MLP

  • Residual
    RMSNorm
    Q/K/V Projection
    QK Normalization
    RoPE
    GQA Attention
    MoE / SwiGLU FFN
  • Residual
    RMSNorm
    LM Head tied with embedding
    Logits [B, L, V]

4. Mamba3 Core Architecture

Mamba3 Selective SSM Core
Output Projection
Local Convolution
Output Projection

d_inner -> d_model
Input x [B, L, D]
Conv1d kernel=4

  • Skip Connection
    Output [B, L, D]
    Selective Mechanism
    X_proj x->A,B,C,d
    DT_proj x->d
    SSM Computation
    Discretize

A_bar=exp(dA)

B_bar=dB
Parallel Prefix Scan

O(log L)

Mamba3 Core Equations

复制代码
Delta_t = tau(LinearDelta(x_t))           # Time step parameter (input-dependent)
A_t = tau(A + LinearA(x_t))                # Extended state matrix (input-dependent)
B_t = LinearB(x_t)                        # Input projection
C_t = LinearC(x_t)                        # Output projection
h_t = A_bar_t * h_{t-1} + B_bar_t * x_t  # State update
y_t = C_t * h_t                           # Output

5. Transformer Architecture (GQA + MoE)

Transformer Layer with GQA + MoE
MoE / SwiGLU FFN
Grouped Query Attention
RMSNorm
Input x [B, L, H]
RMSNorm
Output Projection WO

  • Residual 1
    Expert Router

(sigmoid + top-k)
128 Experts

(SquaredReLU)
2 Shared Experts

(GELU)
MoE + Shared

  • Residual 2
    Output [B, L, H]
    Q/K/V Projection
    W_Q H->num_heads*d
    W_K H->num_kv*d
    W_V H->num_kv*d
    Reshape and Transpose

B,L,H\] -\> \[B,heads,L,dim

QK Normalization
RoPE Position Encoding
RotaryEmbedding

cos/sin cache
Repeat KV

num_kv -> num_heads
Q @ K_T / sqrt(d)
Softmax


6. Mixture of Experts (MoE) Architecture

MoE Layer
Shared Experts (2)
Shared Exp 1
Shared Exp 2
MoE Output + Shared Output
Output [B, L, H]
Input x [B, L, H]
Expert Router
MLP: H -> E
Sigmoid Gating
Top-K Selection
Gated Experts (128)
Expert 1
Expert 2
Expert N
gate_1 *
gate_2 *
gate_N *

MoE Configuration

Parameter 4B Model 8B Model Mini (Test)
Total Experts 128 128 8
Active Experts 6 6 2
Shared Experts 2 2 1
Expert Activation SquaredReLU SquaredReLU SquaredReLU
Shared Activation GELU GELU GELU
Router MLP + Sigmoid MLP + Sigmoid MLP + Sigmoid

7. Configuration and Model Variants

NemotronConfig
+int VocabSize
+int HiddenSize
+int NumMambaLayers
+int NumTransformerLayers
+int NumAttentionHeads
+int NumKVHeads
+int IntermediateSize
+int StateSize
+int ConvKernelSize
+int MaxSeqLen
+float RopeBase
+int ExpandFactor
+int MIMORank
+bool UseComplexSSM
+bool UseExpTrapezoid
+int NumExperts
+int NumActiveExperts
+int NumSharedExperts
+bool UseMoE

+NemotronConfig Nemotron4B()
+NemotronConfig Nemotron8B()
+NemotronConfig Mini()
+NemotronConfig Nemotron4BWithMoE()
+NemotronConfig Nemotron8BWithMoE()
NemotronMamba3Model
-NemotronConfig _config
-TokenEmbedding _embedding
-Mamba3Layer[] _mambaLayers
-TransformerLayer[] _transformerLayers
-Tensor _finalNormWeight
-Tensor _lmHeadWeight

+Forward(inputIds) : Tensor
+Generate(...) : List<int>

Model Configuration Parameters

Parameter 4B Model 4B MoE 8B MoE Mini (Test)
Hidden Size 3072 3072 4096 512
Mamba Layers 8 8 16 2
Transformer Layers 32 32 56 4
Attention Heads 24 24 32 8
KV Heads 8 8 8 2
FFN Intermediate 8192 8192 10976 2048
State Size 128 128 192 64
Max Seq Len 8192 8192 8192 1024
Vocab Size 128256 128256 128256 32000
MoE Experts N/A 128 128 8
Active Experts N/A 6 6 2

8. Forward Pass Flow

Logits [B, L, V] LM Head RMSNorm MoE Layer Transformer Layer Mamba3 Layer TokenEmbedding Token IDs [B, L] Logits [B, L, V] LM Head RMSNorm MoE Layer Transformer Layer Mamba3 Layer TokenEmbedding Token IDs [B, L] M ->> T alternating loop [N times (alternating)] Optional: Generate() autoregressive Forward(token_ids) hidden states [B, L, H] RMSNorm ->> Conv1d ->> SSM ->> MoE/SwiGLU hidden states RMSNorm ->> QKV ->> QKNorm ->> RoPE ->> GQA ->> MoE/SwiGLU next Mamba layer Final RMSNorm Reverse (tied weights) Logits [B, L, V]


9. Inference Generation Flow

No
Yes
Input Prompt token_ids
model.Forward(input)
Get last position logits
Temperature scaling
Top-K filtering
Top-P Nucleus filtering
Softmax normalization
Sampling
Append token
EOS token?
Generated token sequence


10. Component File Mapping

Component File Path Description
Core Model Models/NemotronMamba3.cs Config class, main model, forward pass
Mamba3 Layer Layers/Mamba3Layer.cs Complete Mamba3 layer with MoE
Mamba3 Core Layers/Mamba3Core.cs Selective SSM, parallel prefix scan
Transformer Layers/TransformerLayer.cs GQA attention, RoPE, optional MoE
MoE Layer Layers/MoE/MoELayer.cs Complete MoE implementation
Expert Router Layers/MoE/ExpertRouter.cs Router with sigmoid + top-k
Expert Layers/MoE/Expert.cs Individual expert with SquaredReLU
Shared Experts Layers/MoE/SharedExperts.cs Always-active shared experts
Embedding Layers/Embedding.cs TokenEmbedding, RotaryEmbedding
Normalization Layers/LayerNorm.cs RMSNorm, LayerNorm, QKNorm
Activation Layers/Activation.cs SwiGLU, GELU, SiLU, SquaredReLU
Tensor Core Core/Tensor.cs Tensor operations library
Tokenizer Inference/Tokenizer.cs Text encode/decode interface

11. Technical Highlights

1. Mamba3 Selective Mechanism

  • Data-dependent: Delta, A, B, C generated from input
  • Selective scanning: Decides what information to retain/ignore

2. Parallel Prefix Scan

  • Complexity: O(log L) vs traditional O(L)
  • Algorithm: Blelloch parallel scan algorithm

3. GQA Efficiency Optimization

  • KV head sharing: 8 KV heads serve 24/32 Q heads
  • Memory savings: Significantly reduced KV cache

4. RoPE Position Encoding

  • Relative position: Encodes relative position rather than absolute
  • Extrapolation: Supports longer sequences

5. Mixture of Experts (MoE)

  • Sparse activation: Only Top-6 of 128 experts active per token
  • Shared experts: 2 experts always active for every token
  • SquaredReLU: Activation function for expert outputs
  • Routing: MLP router with sigmoid gating

6. QK Normalization

  • Training stability: Normalize Q and K per head
  • Mamba3 improvement: Enhanced attention mechanism

12. MoE Forward Pass Details

Output
Shared_Computation
Expert_Computation
Router
Input
x [B*L, H]
MLP: H -> E
Sigmoid
Top-K
For each selected expert k
Up Projection
SquaredReLU
Down Projection
weight_k * output
For each shared expert j
Up Projection
GELU
Down Projection
Sum(expert_outputs)

  • Shared_output
    Reshape [B, L, H]

Document generated: 2026-03-24
Project: NemotronMamba3 (C# Implementation with MoE)

相关推荐
牧子川4 小时前
009-Transformer-Architecture
人工智能·深度学习·transformer
赏金术士5 小时前
Kotlin 习题集 · 高级篇
android·开发语言·kotlin
dfsj660116 小时前
第四章:深度学习革命
人工智能·深度学习
cskywit6 小时前
【CVPR2024】用Diffusion“造”遥感分割数据:SatSynth论文解读
人工智能·深度学习·计算机视觉
薛定e的猫咪6 小时前
因果推理研究方向综述笔记
人工智能·笔记·深度学习·算法
问心无愧05137 小时前
ctf show web 入门42
android·前端·android studio
老毛肚7 小时前
卷积神经网络CNN
人工智能·深度学习·cnn
八月瓜科技8 小时前
用AI来省电?iOS26.5正式版全球推送:信号弱网双提升,AI省电模式上新
数据库·人工智能·科技·深度学习·机器人
没什么本事8 小时前
关于C# panel 添加lable问题 -- 明确X和Y 位置错误
android·java·c#
碧海银沙音频科技研究院8 小时前
音箱在加入 NN AEC(神经网络声学回声消除) 后出现反复重启问题解决
人工智能·深度学习·算法