Nemotron架构(Mamba3+Transformer+Moe)

Nemotron-Mamba3 Architecture Documentation

复制代码

╔══════════════════════════════════════════════════════════╗
║         Nemotron-Mamba3 混合模型演示程序                    ║
║         (结合Mamba3 SSM + Transformer + MoE)              ║
╚══════════════════════════════════════════════════════════╝

═══════════════════════════════════════════════════════════
【模型配置】
═══════════════════════════════════════════════════════════
  词汇表大小:     32,000
  隐藏层维度:     512
  Mamba3层数:     2
  Transformer层数: 4
  注意力头数:     8 (Q) / 2 (KV)
  状态大小:       64
  FFN中间层:      2048
  MoE专家数:      8 (激活2 + 共享1)
  MoE启用:        是

═══════════════════════════════════════════════════════════
【模型初始化】
═══════════════════════════════════════════════════════════
  正在创建模型组件...
  ? 词嵌入层初始化完成 [vocab=32000, hidden=512]
  ? Mamba3层堆栈初始化完成 [2层]
  ? Transformer层堆栈初始化完成 [4层]
  ? 最终归一化层初始化完成
  ? LM输出头初始化完成 (权重共享)

═══════════════════════════════════════════════════════════
【分词器初始化】
═══════════════════════════════════════════════════════════
  ? 分词器初始化成功, 词汇表大小: 256

═══════════════════════════════════════════════════════════
【前向传播测试】
═══════════════════════════════════════════════════════════
  输入形状: [batch=1, seq_len=5]
  输入Token IDs: [1, 2, 3, 4, 5]

  开始前向传播...
  ───────────────────────────────────────────────────────────
  ├─ [层0] 词嵌入层
  │   输入: [batch=1, seq_len=5] token IDs
  │   输出: [batch=1, seq_len=5, hidden=512]
  ├─ [层1] Mamba3层 #1
  │   输入形状: [1, 5, 512]
  │   ├─ RMSNorm归一化
  │   │   输出: [1, 5, 512]
  │   ├─ 选择性SSM (状态空间模型)
  │   │   状态维度: 64, 扩展因子: 2
  │   │   输出: [1, 5, 512]
  │   └─ MLP路径: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = input + SSM + MLP
  │   输出形状: [1, 5, 512]
  │
  ├─ [层2] Transformer层 #1
  │   输入形状: [1, 5, 512]
  │   ├─ 注意力归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   ├─ QKV投影
  │   │   Q: [5, 512, 8头 x 64维]
  │   │   K/V: [5, 128, 2头 x 64维]
  │   ├─ QK归一化 (Mamba3改进)
  │   │   归一化完成
  │   ├─ RoPE旋转位置编码 (base=10000, max_len=1024)
  │   │   旋转完成
  │   ├─ 分组查询注意力 (GQA)
  │   │   Q头: 8, KV头: 2, 重复: 4x
  │   │   输出: [1, 8, 5头 x 64维]
  │   └─ 注意力输出投影
  │       输出: [1, 5, 512]
  │       └─ 残差合并: h = x + attention
  │
  │   ├─ FFN归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   └─ FFN/MLP: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = h + FFN
  │   输出形状: [1, 5, 512]
  │
  ├─ [层3] Mamba3层 #2
  │   输入形状: [1, 5, 512]
  │   ├─ RMSNorm归一化
  │   │   输出: [1, 5, 512]
  │   ├─ 选择性SSM (状态空间模型)
  │   │   状态维度: 64, 扩展因子: 2
  │   │   输出: [1, 5, 512]
  │   └─ MLP路径: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = input + SSM + MLP
  │   输出形状: [1, 5, 512]
  │
  ├─ [层4] Transformer层 #2
  │   输入形状: [1, 5, 512]
  │   ├─ 注意力归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   ├─ QKV投影
  │   │   Q: [5, 512, 8头 x 64维]
  │   │   K/V: [5, 128, 2头 x 64维]
  │   ├─ QK归一化 (Mamba3改进)
  │   │   归一化完成
  │   ├─ RoPE旋转位置编码 (base=10000, max_len=1024)
  │   │   旋转完成
  │   ├─ 分组查询注意力 (GQA)
  │   │   Q头: 8, KV头: 2, 重复: 4x
  │   │   输出: [1, 8, 5头 x 64维]
  │   └─ 注意力输出投影
  │       输出: [1, 5, 512]
  │       └─ 残差合并: h = x + attention
  │
  │   ├─ FFN归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   └─ FFN/MLP: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = h + FFN
  │   输出形状: [1, 5, 512]
  │
  ├─ [层5] Transformer层 #3
  │   输入形状: [1, 5, 512]
  │   ├─ 注意力归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   ├─ QKV投影
  │   │   Q: [5, 512, 8头 x 64维]
  │   │   K/V: [5, 128, 2头 x 64维]
  │   ├─ QK归一化 (Mamba3改进)
  │   │   归一化完成
  │   ├─ RoPE旋转位置编码 (base=10000, max_len=1024)
  │   │   旋转完成
  │   ├─ 分组查询注意力 (GQA)
  │   │   Q头: 8, KV头: 2, 重复: 4x
  │   │   输出: [1, 8, 5头 x 64维]
  │   └─ 注意力输出投影
  │       输出: [1, 5, 512]
  │       └─ 残差合并: h = x + attention
  │
  │   ├─ FFN归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   └─ FFN/MLP: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = h + FFN
  │   输出形状: [1, 5, 512]
  │
  └─ [层6] Transformer层 #4
  │   输入形状: [1, 5, 512]
  │   ├─ 注意力归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   ├─ QKV投影
  │   │   Q: [5, 512, 8头 x 64维]
  │   │   K/V: [5, 128, 2头 x 64维]
  │   ├─ QK归一化 (Mamba3改进)
  │   │   归一化完成
  │   ├─ RoPE旋转位置编码 (base=10000, max_len=1024)
  │   │   旋转完成
  │   ├─ 分组查询注意力 (GQA)
  │   │   Q头: 8, KV头: 2, 重复: 4x
  │   │   输出: [1, 8, 5头 x 64维]
  │   └─ 注意力输出投影
  │       输出: [1, 5, 512]
  │       └─ 残差合并: h = x + attention
  │
  │   ├─ FFN归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   └─ FFN/MLP: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = h + FFN
  │   输出形状: [1, 5, 512]
  └─ [层7] 最终RMSNorm归一化
      输入形状: [1, 5, 512]
      输出形状: [1, 5, 512]
  
  └─ [层8] LM输出头 (权重共享)
      输入形状: [1, 5, 512]
      输出形状: [1, 5, 32000]
  ───────────────────────────────────────────────────────────
  ? 前向传播成功完成!
    耗时: 1846ms
    输出形状: [batch=1, seq_len=5, vocab=32000]

  【输出Logits预览 (最后一个位置)】
    Top-5 Token概率: [token 0(NaN), token 1(NaN), token 2(NaN), token 3(NaN), token 4(NaN)]

═══════════════════════════════════════════════════════════
  演示完成!
═══════════════════════════════════════════════════════════

1. Project Overview

Nemotron-Mamba3 is a hybrid architecture large language model combining Mamba3 State Space Model (SSM) with Transformer architecture and Mixture of Experts (MoE).

Core Technical Features

Feature	Description
Mamba3 SSM	Selective State Space Model, O(L) linear sequence complexity
Transformer	Grouped Query Attention (GQA) for efficient inference
Hybrid Architecture	Alternating Mamba3 and Transformer layers
MoE	Mixture of Experts - 128 experts, Top-6 active per token
RoPE	Rotary Position Embedding for long context
SwiGLU	Gated activation function
SquaredReLU	Activation for MoE experts

2. System Architecture Overview

Output
Core
Embedding
Input
Token IDs $batch, seq_len$
TokenEmbedding vocab->hidden
Mamba3Layer x N
TransformerLayer x N
RMSNorm
LM Head hidden->vocab
Logits $batch, seq_len, vocab$

3. Complete Data Flow

Transformer Layer
Mamba3 Layer
Layer_Stack
Input_Step
Token IDs $B, L$
TokenEmbedding Forward()
Hidden States $B, L, H$
for layer in (M+T)xN
RMSNorm
Conv1d local context
Selective SSM Parallel Scan
MoE / SwiGLU MLP

Residual
RMSNorm
Q/K/V Projection
QK Normalization
RoPE
GQA Attention
MoE / SwiGLU FFN
Residual
RMSNorm
LM Head tied with embedding
Logits $B, L, V$

4. Mamba3 Core Architecture

Mamba3 Selective SSM Core
Output Projection
Local Convolution
Output Projection

d_inner -> d_model
Input x $B, L, D$
Conv1d kernel=4

Skip Connection
Output $B, L, D$
Selective Mechanism
X_proj x->A,B,C,d
DT_proj x->d
SSM Computation
Discretize

A_bar=exp(dA)

B_bar=dB
Parallel Prefix Scan

O(log L)

Mamba3 Core Equations

复制代码

Delta_t = tau(LinearDelta(x_t))           # Time step parameter (input-dependent)
A_t = tau(A + LinearA(x_t))                # Extended state matrix (input-dependent)
B_t = LinearB(x_t)                        # Input projection
C_t = LinearC(x_t)                        # Output projection
h_t = A_bar_t * h_{t-1} + B_bar_t * x_t  # State update
y_t = C_t * h_t                           # Output

5. Transformer Architecture (GQA + MoE)

Transformer Layer with GQA + MoE
MoE / SwiGLU FFN
Grouped Query Attention
RMSNorm
Input x $B, L, H$
RMSNorm
Output Projection WO

Residual 1
Expert Router

(sigmoid + top-k)
128 Experts

(SquaredReLU)
2 Shared Experts

(GELU)
MoE + Shared

Residual 2
Output $B, L, H$
Q/K/V Projection
W_Q H->num_heads*d
W_K H->num_kv*d
W_V H->num_kv*d
Reshape and Transpose
$B,L,H$ ->

B,heads,L,dim

QK Normalization
RoPE Position Encoding
RotaryEmbedding

cos/sin cache
Repeat KV

num_kv -> num_heads
Q @ K_T / sqrt(d)
Softmax

6. Mixture of Experts (MoE) Architecture

MoE Layer
Shared Experts (2)
Shared Exp 1
Shared Exp 2
MoE Output + Shared Output
Output $B, L, H$
Input x $B, L, H$
Expert Router
MLP: H -> E
Sigmoid Gating
Top-K Selection
Gated Experts (128)
Expert 1
Expert 2
Expert N
gate_1 *
gate_2 *
gate_N *

MoE Configuration

Parameter	4B Model	8B Model	Mini (Test)
Total Experts	128	128	8
Active Experts	6	6	2
Shared Experts	2	2	1
Expert Activation	SquaredReLU	SquaredReLU	SquaredReLU
Shared Activation	GELU	GELU	GELU
Router	MLP + Sigmoid	MLP + Sigmoid	MLP + Sigmoid

7. Configuration and Model Variants

NemotronConfig
+int VocabSize
+int HiddenSize
+int NumMambaLayers
+int NumTransformerLayers
+int NumAttentionHeads
+int NumKVHeads
+int IntermediateSize
+int StateSize
+int ConvKernelSize
+int MaxSeqLen
+float RopeBase
+int ExpandFactor
+int MIMORank
+bool UseComplexSSM
+bool UseExpTrapezoid
+int NumExperts
+int NumActiveExperts
+int NumSharedExperts
+bool UseMoE

+NemotronConfig Nemotron4B()
+NemotronConfig Nemotron8B()
+NemotronConfig Mini()
+NemotronConfig Nemotron4BWithMoE()
+NemotronConfig Nemotron8BWithMoE()
NemotronMamba3Model
-NemotronConfig _config
-TokenEmbedding _embedding
-Mamba3Layer\[\] _mambaLayers
-TransformerLayer\[\] _transformerLayers
-Tensor _finalNormWeight
-Tensor _lmHeadWeight

+Forward(inputIds) : Tensor
+Generate(...) : List<int>

Model Configuration Parameters

Parameter	4B Model	4B MoE	8B MoE	Mini (Test)
Hidden Size	3072	3072	4096	512
Mamba Layers	8	8	16	2
Transformer Layers	32	32	56	4
Attention Heads	24	24	32	8
KV Heads	8	8	8	2
FFN Intermediate	8192	8192	10976	2048
State Size	128	128	192	64
Max Seq Len	8192	8192	8192	1024
Vocab Size	128256	128256	128256	32000
MoE Experts	N/A	128	128	8
Active Experts	N/A	6	6	2

8. Forward Pass Flow

Logits $B, L, V$ LM Head RMSNorm MoE Layer Transformer Layer Mamba3 Layer TokenEmbedding Token IDs $B, L$ Logits $B, L, V$ LM Head RMSNorm MoE Layer Transformer Layer Mamba3 Layer TokenEmbedding Token IDs $B, L$ M ->> T alternating loop $N times (alternating)$ Optional: Generate() autoregressive Forward(token_ids) hidden states $B, L, H$ RMSNorm ->> Conv1d ->> SSM ->> MoE/SwiGLU hidden states RMSNorm ->> QKV ->> QKNorm ->> RoPE ->> GQA ->> MoE/SwiGLU next Mamba layer Final RMSNorm Reverse (tied weights) Logits $B, L, V$

9. Inference Generation Flow

No
Yes
Input Prompt token_ids
model.Forward(input)
Get last position logits
Temperature scaling
Top-K filtering
Top-P Nucleus filtering
Softmax normalization
Sampling
Append token
EOS token?
Generated token sequence

10. Component File Mapping

Component	File Path	Description
Core Model	`Models/NemotronMamba3.cs`	Config class, main model, forward pass
Mamba3 Layer	`Layers/Mamba3Layer.cs`	Complete Mamba3 layer with MoE
Mamba3 Core	`Layers/Mamba3Core.cs`	Selective SSM, parallel prefix scan
Transformer	`Layers/TransformerLayer.cs`	GQA attention, RoPE, optional MoE
MoE Layer	`Layers/MoE/MoELayer.cs`	Complete MoE implementation
Expert Router	`Layers/MoE/ExpertRouter.cs`	Router with sigmoid + top-k
Expert	`Layers/MoE/Expert.cs`	Individual expert with SquaredReLU
Shared Experts	`Layers/MoE/SharedExperts.cs`	Always-active shared experts
Embedding	`Layers/Embedding.cs`	TokenEmbedding, RotaryEmbedding
Normalization	`Layers/LayerNorm.cs`	RMSNorm, LayerNorm, QKNorm
Activation	`Layers/Activation.cs`	SwiGLU, GELU, SiLU, SquaredReLU
Tensor Core	`Core/Tensor.cs`	Tensor operations library
Tokenizer	`Inference/Tokenizer.cs`	Text encode/decode interface

11. Technical Highlights

1. Mamba3 Selective Mechanism

Data-dependent: Delta, A, B, C generated from input
Selective scanning: Decides what information to retain/ignore

2. Parallel Prefix Scan

Complexity: O(log L) vs traditional O(L)
Algorithm: Blelloch parallel scan algorithm

3. GQA Efficiency Optimization

KV head sharing: 8 KV heads serve 24/32 Q heads
Memory savings: Significantly reduced KV cache

4. RoPE Position Encoding

Relative position: Encodes relative position rather than absolute
Extrapolation: Supports longer sequences

5. Mixture of Experts (MoE)

Sparse activation: Only Top-6 of 128 experts active per token
Shared experts: 2 experts always active for every token
SquaredReLU: Activation function for expert outputs
Routing: MLP router with sigmoid gating

6. QK Normalization

Training stability: Normalize Q and K per head
Mamba3 improvement: Enhanced attention mechanism

12. MoE Forward Pass Details

Output
Shared_Computation
Expert_Computation
Router
Input
x $B\*L, H$
MLP: H -> E
Sigmoid
Top-K
For each selected expert k
Up Projection
SquaredReLU
Down Projection
weight_k * output
For each shared expert j
Up Projection
GELU
Down Projection
Sum(expert_outputs)

Shared_output
Reshape $B, L, H$

Document generated: 2026-03-24
Project: NemotronMamba3 (C# Implementation with MoE)