Nemotron-Mamba3 Architecture Documentation
╔══════════════════════════════════════════════════════════╗
║ Nemotron-Mamba3 混合模型演示程序 ║
║ (结合Mamba3 SSM + Transformer + MoE) ║
╚══════════════════════════════════════════════════════════╝
═══════════════════════════════════════════════════════════
【模型配置】
═══════════════════════════════════════════════════════════
词汇表大小: 32,000
隐藏层维度: 512
Mamba3层数: 2
Transformer层数: 4
注意力头数: 8 (Q) / 2 (KV)
状态大小: 64
FFN中间层: 2048
MoE专家数: 8 (激活2 + 共享1)
MoE启用: 是
═══════════════════════════════════════════════════════════
【模型初始化】
═══════════════════════════════════════════════════════════
正在创建模型组件...
? 词嵌入层初始化完成 [vocab=32000, hidden=512]
? Mamba3层堆栈初始化完成 [2层]
? Transformer层堆栈初始化完成 [4层]
? 最终归一化层初始化完成
? LM输出头初始化完成 (权重共享)
═══════════════════════════════════════════════════════════
【分词器初始化】
═══════════════════════════════════════════════════════════
? 分词器初始化成功, 词汇表大小: 256
═══════════════════════════════════════════════════════════
【前向传播测试】
═══════════════════════════════════════════════════════════
输入形状: [batch=1, seq_len=5]
输入Token IDs: [1, 2, 3, 4, 5]
开始前向传播...
───────────────────────────────────────────────────────────
├─ [层0] 词嵌入层
│ 输入: [batch=1, seq_len=5] token IDs
│ 输出: [batch=1, seq_len=5, hidden=512]
├─ [层1] Mamba3层 #1
│ 输入形状: [1, 5, 512]
│ ├─ RMSNorm归一化
│ │ 输出: [1, 5, 512]
│ ├─ 选择性SSM (状态空间模型)
│ │ 状态维度: 64, 扩展因子: 2
│ │ 输出: [1, 5, 512]
│ └─ MLP路径: MoE (混合专家)
│ 专家数: 8, 激活: 2, 共享: 1
│ 输出: [1, 5, 512]
│ └─ 残差合并: output = input + SSM + MLP
│ 输出形状: [1, 5, 512]
│
├─ [层2] Transformer层 #1
│ 输入形状: [1, 5, 512]
│ ├─ 注意力归一化 (RMSNorm)
│ │ 输出: [1, 5, 512]
│ ├─ QKV投影
│ │ Q: [5, 512, 8头 x 64维]
│ │ K/V: [5, 128, 2头 x 64维]
│ ├─ QK归一化 (Mamba3改进)
│ │ 归一化完成
│ ├─ RoPE旋转位置编码 (base=10000, max_len=1024)
│ │ 旋转完成
│ ├─ 分组查询注意力 (GQA)
│ │ Q头: 8, KV头: 2, 重复: 4x
│ │ 输出: [1, 8, 5头 x 64维]
│ └─ 注意力输出投影
│ 输出: [1, 5, 512]
│ └─ 残差合并: h = x + attention
│
│ ├─ FFN归一化 (RMSNorm)
│ │ 输出: [1, 5, 512]
│ └─ FFN/MLP: MoE (混合专家)
│ 专家数: 8, 激活: 2, 共享: 1
│ 输出: [1, 5, 512]
│ └─ 残差合并: output = h + FFN
│ 输出形状: [1, 5, 512]
│
├─ [层3] Mamba3层 #2
│ 输入形状: [1, 5, 512]
│ ├─ RMSNorm归一化
│ │ 输出: [1, 5, 512]
│ ├─ 选择性SSM (状态空间模型)
│ │ 状态维度: 64, 扩展因子: 2
│ │ 输出: [1, 5, 512]
│ └─ MLP路径: MoE (混合专家)
│ 专家数: 8, 激活: 2, 共享: 1
│ 输出: [1, 5, 512]
│ └─ 残差合并: output = input + SSM + MLP
│ 输出形状: [1, 5, 512]
│
├─ [层4] Transformer层 #2
│ 输入形状: [1, 5, 512]
│ ├─ 注意力归一化 (RMSNorm)
│ │ 输出: [1, 5, 512]
│ ├─ QKV投影
│ │ Q: [5, 512, 8头 x 64维]
│ │ K/V: [5, 128, 2头 x 64维]
│ ├─ QK归一化 (Mamba3改进)
│ │ 归一化完成
│ ├─ RoPE旋转位置编码 (base=10000, max_len=1024)
│ │ 旋转完成
│ ├─ 分组查询注意力 (GQA)
│ │ Q头: 8, KV头: 2, 重复: 4x
│ │ 输出: [1, 8, 5头 x 64维]
│ └─ 注意力输出投影
│ 输出: [1, 5, 512]
│ └─ 残差合并: h = x + attention
│
│ ├─ FFN归一化 (RMSNorm)
│ │ 输出: [1, 5, 512]
│ └─ FFN/MLP: MoE (混合专家)
│ 专家数: 8, 激活: 2, 共享: 1
│ 输出: [1, 5, 512]
│ └─ 残差合并: output = h + FFN
│ 输出形状: [1, 5, 512]
│
├─ [层5] Transformer层 #3
│ 输入形状: [1, 5, 512]
│ ├─ 注意力归一化 (RMSNorm)
│ │ 输出: [1, 5, 512]
│ ├─ QKV投影
│ │ Q: [5, 512, 8头 x 64维]
│ │ K/V: [5, 128, 2头 x 64维]
│ ├─ QK归一化 (Mamba3改进)
│ │ 归一化完成
│ ├─ RoPE旋转位置编码 (base=10000, max_len=1024)
│ │ 旋转完成
│ ├─ 分组查询注意力 (GQA)
│ │ Q头: 8, KV头: 2, 重复: 4x
│ │ 输出: [1, 8, 5头 x 64维]
│ └─ 注意力输出投影
│ 输出: [1, 5, 512]
│ └─ 残差合并: h = x + attention
│
│ ├─ FFN归一化 (RMSNorm)
│ │ 输出: [1, 5, 512]
│ └─ FFN/MLP: MoE (混合专家)
│ 专家数: 8, 激活: 2, 共享: 1
│ 输出: [1, 5, 512]
│ └─ 残差合并: output = h + FFN
│ 输出形状: [1, 5, 512]
│
└─ [层6] Transformer层 #4
│ 输入形状: [1, 5, 512]
│ ├─ 注意力归一化 (RMSNorm)
│ │ 输出: [1, 5, 512]
│ ├─ QKV投影
│ │ Q: [5, 512, 8头 x 64维]
│ │ K/V: [5, 128, 2头 x 64维]
│ ├─ QK归一化 (Mamba3改进)
│ │ 归一化完成
│ ├─ RoPE旋转位置编码 (base=10000, max_len=1024)
│ │ 旋转完成
│ ├─ 分组查询注意力 (GQA)
│ │ Q头: 8, KV头: 2, 重复: 4x
│ │ 输出: [1, 8, 5头 x 64维]
│ └─ 注意力输出投影
│ 输出: [1, 5, 512]
│ └─ 残差合并: h = x + attention
│
│ ├─ FFN归一化 (RMSNorm)
│ │ 输出: [1, 5, 512]
│ └─ FFN/MLP: MoE (混合专家)
│ 专家数: 8, 激活: 2, 共享: 1
│ 输出: [1, 5, 512]
│ └─ 残差合并: output = h + FFN
│ 输出形状: [1, 5, 512]
└─ [层7] 最终RMSNorm归一化
输入形状: [1, 5, 512]
输出形状: [1, 5, 512]
└─ [层8] LM输出头 (权重共享)
输入形状: [1, 5, 512]
输出形状: [1, 5, 32000]
───────────────────────────────────────────────────────────
? 前向传播成功完成!
耗时: 1846ms
输出形状: [batch=1, seq_len=5, vocab=32000]
【输出Logits预览 (最后一个位置)】
Top-5 Token概率: [token 0(NaN), token 1(NaN), token 2(NaN), token 3(NaN), token 4(NaN)]
═══════════════════════════════════════════════════════════
演示完成!
═══════════════════════════════════════════════════════════
1. Project Overview
Nemotron-Mamba3 is a hybrid architecture large language model combining Mamba3 State Space Model (SSM) with Transformer architecture and Mixture of Experts (MoE).
Core Technical Features
| Feature | Description |
|---|---|
| Mamba3 SSM | Selective State Space Model, O(L) linear sequence complexity |
| Transformer | Grouped Query Attention (GQA) for efficient inference |
| Hybrid Architecture | Alternating Mamba3 and Transformer layers |
| MoE | Mixture of Experts - 128 experts, Top-6 active per token |
| RoPE | Rotary Position Embedding for long context |
| SwiGLU | Gated activation function |
| SquaredReLU | Activation for MoE experts |
2. System Architecture Overview
Output
Core
Embedding
Input
Token IDs [batch, seq_len]
TokenEmbedding vocab->hidden
Mamba3Layer x N
TransformerLayer x N
RMSNorm
LM Head hidden->vocab
Logits [batch, seq_len, vocab]
3. Complete Data Flow
Transformer Layer
Mamba3 Layer
Layer_Stack
Input_Step
Token IDs [B, L]
TokenEmbedding Forward()
Hidden States [B, L, H]
for layer in (M+T)xN
RMSNorm
Conv1d local context
Selective SSM Parallel Scan
MoE / SwiGLU MLP
- Residual
RMSNorm
Q/K/V Projection
QK Normalization
RoPE
GQA Attention
MoE / SwiGLU FFN - Residual
RMSNorm
LM Head tied with embedding
Logits [B, L, V]
4. Mamba3 Core Architecture
Mamba3 Selective SSM Core
Output Projection
Local Convolution
Output Projection
d_inner -> d_model
Input x [B, L, D]
Conv1d kernel=4
- Skip Connection
Output [B, L, D]
Selective Mechanism
X_proj x->A,B,C,d
DT_proj x->d
SSM Computation
Discretize
A_bar=exp(dA)
B_bar=dB
Parallel Prefix Scan
O(log L)
Mamba3 Core Equations
Delta_t = tau(LinearDelta(x_t)) # Time step parameter (input-dependent)
A_t = tau(A + LinearA(x_t)) # Extended state matrix (input-dependent)
B_t = LinearB(x_t) # Input projection
C_t = LinearC(x_t) # Output projection
h_t = A_bar_t * h_{t-1} + B_bar_t * x_t # State update
y_t = C_t * h_t # Output
5. Transformer Architecture (GQA + MoE)
Transformer Layer with GQA + MoE
MoE / SwiGLU FFN
Grouped Query Attention
RMSNorm
Input x [B, L, H]
RMSNorm
Output Projection WO
- Residual 1
Expert Router
(sigmoid + top-k)
128 Experts
(SquaredReLU)
2 Shared Experts
(GELU)
MoE + Shared
- Residual 2
Output [B, L, H]
Q/K/V Projection
W_Q H->num_heads*d
W_K H->num_kv*d
W_V H->num_kv*d
Reshape and Transpose
B,L,H\] -\> \[B,heads,L,dim
QK Normalization
RoPE Position Encoding
RotaryEmbedding
cos/sin cache
Repeat KV
num_kv -> num_heads
Q @ K_T / sqrt(d)
Softmax
6. Mixture of Experts (MoE) Architecture
MoE Layer
Shared Experts (2)
Shared Exp 1
Shared Exp 2
MoE Output + Shared Output
Output [B, L, H]
Input x [B, L, H]
Expert Router
MLP: H -> E
Sigmoid Gating
Top-K Selection
Gated Experts (128)
Expert 1
Expert 2
Expert N
gate_1 *
gate_2 *
gate_N *
MoE Configuration
| Parameter | 4B Model | 8B Model | Mini (Test) |
|---|---|---|---|
| Total Experts | 128 | 128 | 8 |
| Active Experts | 6 | 6 | 2 |
| Shared Experts | 2 | 2 | 1 |
| Expert Activation | SquaredReLU | SquaredReLU | SquaredReLU |
| Shared Activation | GELU | GELU | GELU |
| Router | MLP + Sigmoid | MLP + Sigmoid | MLP + Sigmoid |
7. Configuration and Model Variants
NemotronConfig
+int VocabSize
+int HiddenSize
+int NumMambaLayers
+int NumTransformerLayers
+int NumAttentionHeads
+int NumKVHeads
+int IntermediateSize
+int StateSize
+int ConvKernelSize
+int MaxSeqLen
+float RopeBase
+int ExpandFactor
+int MIMORank
+bool UseComplexSSM
+bool UseExpTrapezoid
+int NumExperts
+int NumActiveExperts
+int NumSharedExperts
+bool UseMoE
+NemotronConfig Nemotron4B()
+NemotronConfig Nemotron8B()
+NemotronConfig Mini()
+NemotronConfig Nemotron4BWithMoE()
+NemotronConfig Nemotron8BWithMoE()
NemotronMamba3Model
-NemotronConfig _config
-TokenEmbedding _embedding
-Mamba3Layer[] _mambaLayers
-TransformerLayer[] _transformerLayers
-Tensor _finalNormWeight
-Tensor _lmHeadWeight
+Forward(inputIds) : Tensor
+Generate(...) : List<int>
Model Configuration Parameters
| Parameter | 4B Model | 4B MoE | 8B MoE | Mini (Test) |
|---|---|---|---|---|
| Hidden Size | 3072 | 3072 | 4096 | 512 |
| Mamba Layers | 8 | 8 | 16 | 2 |
| Transformer Layers | 32 | 32 | 56 | 4 |
| Attention Heads | 24 | 24 | 32 | 8 |
| KV Heads | 8 | 8 | 8 | 2 |
| FFN Intermediate | 8192 | 8192 | 10976 | 2048 |
| State Size | 128 | 128 | 192 | 64 |
| Max Seq Len | 8192 | 8192 | 8192 | 1024 |
| Vocab Size | 128256 | 128256 | 128256 | 32000 |
| MoE Experts | N/A | 128 | 128 | 8 |
| Active Experts | N/A | 6 | 6 | 2 |
8. Forward Pass Flow
Logits [B, L, V] LM Head RMSNorm MoE Layer Transformer Layer Mamba3 Layer TokenEmbedding Token IDs [B, L] Logits [B, L, V] LM Head RMSNorm MoE Layer Transformer Layer Mamba3 Layer TokenEmbedding Token IDs [B, L] M ->> T alternating loop [N times (alternating)] Optional: Generate() autoregressive Forward(token_ids) hidden states [B, L, H] RMSNorm ->> Conv1d ->> SSM ->> MoE/SwiGLU hidden states RMSNorm ->> QKV ->> QKNorm ->> RoPE ->> GQA ->> MoE/SwiGLU next Mamba layer Final RMSNorm Reverse (tied weights) Logits [B, L, V]
9. Inference Generation Flow
No
Yes
Input Prompt token_ids
model.Forward(input)
Get last position logits
Temperature scaling
Top-K filtering
Top-P Nucleus filtering
Softmax normalization
Sampling
Append token
EOS token?
Generated token sequence
10. Component File Mapping
| Component | File Path | Description |
|---|---|---|
| Core Model | Models/NemotronMamba3.cs |
Config class, main model, forward pass |
| Mamba3 Layer | Layers/Mamba3Layer.cs |
Complete Mamba3 layer with MoE |
| Mamba3 Core | Layers/Mamba3Core.cs |
Selective SSM, parallel prefix scan |
| Transformer | Layers/TransformerLayer.cs |
GQA attention, RoPE, optional MoE |
| MoE Layer | Layers/MoE/MoELayer.cs |
Complete MoE implementation |
| Expert Router | Layers/MoE/ExpertRouter.cs |
Router with sigmoid + top-k |
| Expert | Layers/MoE/Expert.cs |
Individual expert with SquaredReLU |
| Shared Experts | Layers/MoE/SharedExperts.cs |
Always-active shared experts |
| Embedding | Layers/Embedding.cs |
TokenEmbedding, RotaryEmbedding |
| Normalization | Layers/LayerNorm.cs |
RMSNorm, LayerNorm, QKNorm |
| Activation | Layers/Activation.cs |
SwiGLU, GELU, SiLU, SquaredReLU |
| Tensor Core | Core/Tensor.cs |
Tensor operations library |
| Tokenizer | Inference/Tokenizer.cs |
Text encode/decode interface |
11. Technical Highlights
1. Mamba3 Selective Mechanism
- Data-dependent: Delta, A, B, C generated from input
- Selective scanning: Decides what information to retain/ignore
2. Parallel Prefix Scan
- Complexity: O(log L) vs traditional O(L)
- Algorithm: Blelloch parallel scan algorithm
3. GQA Efficiency Optimization
- KV head sharing: 8 KV heads serve 24/32 Q heads
- Memory savings: Significantly reduced KV cache
4. RoPE Position Encoding
- Relative position: Encodes relative position rather than absolute
- Extrapolation: Supports longer sequences
5. Mixture of Experts (MoE)
- Sparse activation: Only Top-6 of 128 experts active per token
- Shared experts: 2 experts always active for every token
- SquaredReLU: Activation function for expert outputs
- Routing: MLP router with sigmoid gating
6. QK Normalization
- Training stability: Normalize Q and K per head
- Mamba3 improvement: Enhanced attention mechanism
12. MoE Forward Pass Details
Output
Shared_Computation
Expert_Computation
Router
Input
x [B*L, H]
MLP: H -> E
Sigmoid
Top-K
For each selected expert k
Up Projection
SquaredReLU
Down Projection
weight_k * output
For each shared expert j
Up Projection
GELU
Down Projection
Sum(expert_outputs)
- Shared_output
Reshape [B, L, H]
Document generated: 2026-03-24
Project: NemotronMamba3 (C# Implementation with MoE)