Nemotron架构(Mamba3+Transformer+Moe)

Nemotron-Mamba3 Architecture Documentation

复制代码
╔══════════════════════════════════════════════════════════╗
║         Nemotron-Mamba3 混合模型演示程序                    ║
║         (结合Mamba3 SSM + Transformer + MoE)              ║
╚══════════════════════════════════════════════════════════╝

═══════════════════════════════════════════════════════════
【模型配置】
═══════════════════════════════════════════════════════════
  词汇表大小:     32,000
  隐藏层维度:     512
  Mamba3层数:     2
  Transformer层数: 4
  注意力头数:     8 (Q) / 2 (KV)
  状态大小:       64
  FFN中间层:      2048
  MoE专家数:      8 (激活2 + 共享1)
  MoE启用:        是

═══════════════════════════════════════════════════════════
【模型初始化】
═══════════════════════════════════════════════════════════
  正在创建模型组件...
  ? 词嵌入层初始化完成 [vocab=32000, hidden=512]
  ? Mamba3层堆栈初始化完成 [2层]
  ? Transformer层堆栈初始化完成 [4层]
  ? 最终归一化层初始化完成
  ? LM输出头初始化完成 (权重共享)

═══════════════════════════════════════════════════════════
【分词器初始化】
═══════════════════════════════════════════════════════════
  ? 分词器初始化成功, 词汇表大小: 256

═══════════════════════════════════════════════════════════
【前向传播测试】
═══════════════════════════════════════════════════════════
  输入形状: [batch=1, seq_len=5]
  输入Token IDs: [1, 2, 3, 4, 5]

  开始前向传播...
  ───────────────────────────────────────────────────────────
  ├─ [层0] 词嵌入层
  │   输入: [batch=1, seq_len=5] token IDs
  │   输出: [batch=1, seq_len=5, hidden=512]
  ├─ [层1] Mamba3层 #1
  │   输入形状: [1, 5, 512]
  │   ├─ RMSNorm归一化
  │   │   输出: [1, 5, 512]
  │   ├─ 选择性SSM (状态空间模型)
  │   │   状态维度: 64, 扩展因子: 2
  │   │   输出: [1, 5, 512]
  │   └─ MLP路径: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = input + SSM + MLP
  │   输出形状: [1, 5, 512]
  │
  ├─ [层2] Transformer层 #1
  │   输入形状: [1, 5, 512]
  │   ├─ 注意力归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   ├─ QKV投影
  │   │   Q: [5, 512, 8头 x 64维]
  │   │   K/V: [5, 128, 2头 x 64维]
  │   ├─ QK归一化 (Mamba3改进)
  │   │   归一化完成
  │   ├─ RoPE旋转位置编码 (base=10000, max_len=1024)
  │   │   旋转完成
  │   ├─ 分组查询注意力 (GQA)
  │   │   Q头: 8, KV头: 2, 重复: 4x
  │   │   输出: [1, 8, 5头 x 64维]
  │   └─ 注意力输出投影
  │       输出: [1, 5, 512]
  │       └─ 残差合并: h = x + attention
  │
  │   ├─ FFN归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   └─ FFN/MLP: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = h + FFN
  │   输出形状: [1, 5, 512]
  │
  ├─ [层3] Mamba3层 #2
  │   输入形状: [1, 5, 512]
  │   ├─ RMSNorm归一化
  │   │   输出: [1, 5, 512]
  │   ├─ 选择性SSM (状态空间模型)
  │   │   状态维度: 64, 扩展因子: 2
  │   │   输出: [1, 5, 512]
  │   └─ MLP路径: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = input + SSM + MLP
  │   输出形状: [1, 5, 512]
  │
  ├─ [层4] Transformer层 #2
  │   输入形状: [1, 5, 512]
  │   ├─ 注意力归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   ├─ QKV投影
  │   │   Q: [5, 512, 8头 x 64维]
  │   │   K/V: [5, 128, 2头 x 64维]
  │   ├─ QK归一化 (Mamba3改进)
  │   │   归一化完成
  │   ├─ RoPE旋转位置编码 (base=10000, max_len=1024)
  │   │   旋转完成
  │   ├─ 分组查询注意力 (GQA)
  │   │   Q头: 8, KV头: 2, 重复: 4x
  │   │   输出: [1, 8, 5头 x 64维]
  │   └─ 注意力输出投影
  │       输出: [1, 5, 512]
  │       └─ 残差合并: h = x + attention
  │
  │   ├─ FFN归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   └─ FFN/MLP: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = h + FFN
  │   输出形状: [1, 5, 512]
  │
  ├─ [层5] Transformer层 #3
  │   输入形状: [1, 5, 512]
  │   ├─ 注意力归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   ├─ QKV投影
  │   │   Q: [5, 512, 8头 x 64维]
  │   │   K/V: [5, 128, 2头 x 64维]
  │   ├─ QK归一化 (Mamba3改进)
  │   │   归一化完成
  │   ├─ RoPE旋转位置编码 (base=10000, max_len=1024)
  │   │   旋转完成
  │   ├─ 分组查询注意力 (GQA)
  │   │   Q头: 8, KV头: 2, 重复: 4x
  │   │   输出: [1, 8, 5头 x 64维]
  │   └─ 注意力输出投影
  │       输出: [1, 5, 512]
  │       └─ 残差合并: h = x + attention
  │
  │   ├─ FFN归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   └─ FFN/MLP: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = h + FFN
  │   输出形状: [1, 5, 512]
  │
  └─ [层6] Transformer层 #4
  │   输入形状: [1, 5, 512]
  │   ├─ 注意力归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   ├─ QKV投影
  │   │   Q: [5, 512, 8头 x 64维]
  │   │   K/V: [5, 128, 2头 x 64维]
  │   ├─ QK归一化 (Mamba3改进)
  │   │   归一化完成
  │   ├─ RoPE旋转位置编码 (base=10000, max_len=1024)
  │   │   旋转完成
  │   ├─ 分组查询注意力 (GQA)
  │   │   Q头: 8, KV头: 2, 重复: 4x
  │   │   输出: [1, 8, 5头 x 64维]
  │   └─ 注意力输出投影
  │       输出: [1, 5, 512]
  │       └─ 残差合并: h = x + attention
  │
  │   ├─ FFN归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   └─ FFN/MLP: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = h + FFN
  │   输出形状: [1, 5, 512]
  └─ [层7] 最终RMSNorm归一化
      输入形状: [1, 5, 512]
      输出形状: [1, 5, 512]
  
  └─ [层8] LM输出头 (权重共享)
      输入形状: [1, 5, 512]
      输出形状: [1, 5, 32000]
  ───────────────────────────────────────────────────────────
  ? 前向传播成功完成!
    耗时: 1846ms
    输出形状: [batch=1, seq_len=5, vocab=32000]

  【输出Logits预览 (最后一个位置)】
    Top-5 Token概率: [token 0(NaN), token 1(NaN), token 2(NaN), token 3(NaN), token 4(NaN)]

═══════════════════════════════════════════════════════════
  演示完成!
═══════════════════════════════════════════════════════════

1. Project Overview

Nemotron-Mamba3 is a hybrid architecture large language model combining Mamba3 State Space Model (SSM) with Transformer architecture and Mixture of Experts (MoE).

Core Technical Features

Feature Description
Mamba3 SSM Selective State Space Model, O(L) linear sequence complexity
Transformer Grouped Query Attention (GQA) for efficient inference
Hybrid Architecture Alternating Mamba3 and Transformer layers
MoE Mixture of Experts - 128 experts, Top-6 active per token
RoPE Rotary Position Embedding for long context
SwiGLU Gated activation function
SquaredReLU Activation for MoE experts

2. System Architecture Overview

Output
Core
Embedding
Input
Token IDs [batch, seq_len]
TokenEmbedding vocab->hidden
Mamba3Layer x N
TransformerLayer x N
RMSNorm
LM Head hidden->vocab
Logits [batch, seq_len, vocab]


3. Complete Data Flow

Transformer Layer
Mamba3 Layer
Layer_Stack
Input_Step
Token IDs [B, L]
TokenEmbedding Forward()
Hidden States [B, L, H]
for layer in (M+T)xN
RMSNorm
Conv1d local context
Selective SSM Parallel Scan
MoE / SwiGLU MLP

  • Residual
    RMSNorm
    Q/K/V Projection
    QK Normalization
    RoPE
    GQA Attention
    MoE / SwiGLU FFN
  • Residual
    RMSNorm
    LM Head tied with embedding
    Logits [B, L, V]

4. Mamba3 Core Architecture

Mamba3 Selective SSM Core
Output Projection
Local Convolution
Output Projection

d_inner -> d_model
Input x [B, L, D]
Conv1d kernel=4

  • Skip Connection
    Output [B, L, D]
    Selective Mechanism
    X_proj x->A,B,C,d
    DT_proj x->d
    SSM Computation
    Discretize

A_bar=exp(dA)

B_bar=dB
Parallel Prefix Scan

O(log L)

Mamba3 Core Equations

复制代码
Delta_t = tau(LinearDelta(x_t))           # Time step parameter (input-dependent)
A_t = tau(A + LinearA(x_t))                # Extended state matrix (input-dependent)
B_t = LinearB(x_t)                        # Input projection
C_t = LinearC(x_t)                        # Output projection
h_t = A_bar_t * h_{t-1} + B_bar_t * x_t  # State update
y_t = C_t * h_t                           # Output

5. Transformer Architecture (GQA + MoE)

Transformer Layer with GQA + MoE
MoE / SwiGLU FFN
Grouped Query Attention
RMSNorm
Input x [B, L, H]
RMSNorm
Output Projection WO

  • Residual 1
    Expert Router

(sigmoid + top-k)
128 Experts

(SquaredReLU)
2 Shared Experts

(GELU)
MoE + Shared

  • Residual 2
    Output [B, L, H]
    Q/K/V Projection
    W_Q H->num_heads*d
    W_K H->num_kv*d
    W_V H->num_kv*d
    Reshape and Transpose

B,L,H\] -\> \[B,heads,L,dim

QK Normalization
RoPE Position Encoding
RotaryEmbedding

cos/sin cache
Repeat KV

num_kv -> num_heads
Q @ K_T / sqrt(d)
Softmax


6. Mixture of Experts (MoE) Architecture

MoE Layer
Shared Experts (2)
Shared Exp 1
Shared Exp 2
MoE Output + Shared Output
Output [B, L, H]
Input x [B, L, H]
Expert Router
MLP: H -> E
Sigmoid Gating
Top-K Selection
Gated Experts (128)
Expert 1
Expert 2
Expert N
gate_1 *
gate_2 *
gate_N *

MoE Configuration

Parameter 4B Model 8B Model Mini (Test)
Total Experts 128 128 8
Active Experts 6 6 2
Shared Experts 2 2 1
Expert Activation SquaredReLU SquaredReLU SquaredReLU
Shared Activation GELU GELU GELU
Router MLP + Sigmoid MLP + Sigmoid MLP + Sigmoid

7. Configuration and Model Variants

NemotronConfig
+int VocabSize
+int HiddenSize
+int NumMambaLayers
+int NumTransformerLayers
+int NumAttentionHeads
+int NumKVHeads
+int IntermediateSize
+int StateSize
+int ConvKernelSize
+int MaxSeqLen
+float RopeBase
+int ExpandFactor
+int MIMORank
+bool UseComplexSSM
+bool UseExpTrapezoid
+int NumExperts
+int NumActiveExperts
+int NumSharedExperts
+bool UseMoE

+NemotronConfig Nemotron4B()
+NemotronConfig Nemotron8B()
+NemotronConfig Mini()
+NemotronConfig Nemotron4BWithMoE()
+NemotronConfig Nemotron8BWithMoE()
NemotronMamba3Model
-NemotronConfig _config
-TokenEmbedding _embedding
-Mamba3Layer[] _mambaLayers
-TransformerLayer[] _transformerLayers
-Tensor _finalNormWeight
-Tensor _lmHeadWeight

+Forward(inputIds) : Tensor
+Generate(...) : List<int>

Model Configuration Parameters

Parameter 4B Model 4B MoE 8B MoE Mini (Test)
Hidden Size 3072 3072 4096 512
Mamba Layers 8 8 16 2
Transformer Layers 32 32 56 4
Attention Heads 24 24 32 8
KV Heads 8 8 8 2
FFN Intermediate 8192 8192 10976 2048
State Size 128 128 192 64
Max Seq Len 8192 8192 8192 1024
Vocab Size 128256 128256 128256 32000
MoE Experts N/A 128 128 8
Active Experts N/A 6 6 2

8. Forward Pass Flow

Logits [B, L, V] LM Head RMSNorm MoE Layer Transformer Layer Mamba3 Layer TokenEmbedding Token IDs [B, L] Logits [B, L, V] LM Head RMSNorm MoE Layer Transformer Layer Mamba3 Layer TokenEmbedding Token IDs [B, L] M ->> T alternating loop [N times (alternating)] Optional: Generate() autoregressive Forward(token_ids) hidden states [B, L, H] RMSNorm ->> Conv1d ->> SSM ->> MoE/SwiGLU hidden states RMSNorm ->> QKV ->> QKNorm ->> RoPE ->> GQA ->> MoE/SwiGLU next Mamba layer Final RMSNorm Reverse (tied weights) Logits [B, L, V]


9. Inference Generation Flow

No
Yes
Input Prompt token_ids
model.Forward(input)
Get last position logits
Temperature scaling
Top-K filtering
Top-P Nucleus filtering
Softmax normalization
Sampling
Append token
EOS token?
Generated token sequence


10. Component File Mapping

Component File Path Description
Core Model Models/NemotronMamba3.cs Config class, main model, forward pass
Mamba3 Layer Layers/Mamba3Layer.cs Complete Mamba3 layer with MoE
Mamba3 Core Layers/Mamba3Core.cs Selective SSM, parallel prefix scan
Transformer Layers/TransformerLayer.cs GQA attention, RoPE, optional MoE
MoE Layer Layers/MoE/MoELayer.cs Complete MoE implementation
Expert Router Layers/MoE/ExpertRouter.cs Router with sigmoid + top-k
Expert Layers/MoE/Expert.cs Individual expert with SquaredReLU
Shared Experts Layers/MoE/SharedExperts.cs Always-active shared experts
Embedding Layers/Embedding.cs TokenEmbedding, RotaryEmbedding
Normalization Layers/LayerNorm.cs RMSNorm, LayerNorm, QKNorm
Activation Layers/Activation.cs SwiGLU, GELU, SiLU, SquaredReLU
Tensor Core Core/Tensor.cs Tensor operations library
Tokenizer Inference/Tokenizer.cs Text encode/decode interface

11. Technical Highlights

1. Mamba3 Selective Mechanism

  • Data-dependent: Delta, A, B, C generated from input
  • Selective scanning: Decides what information to retain/ignore

2. Parallel Prefix Scan

  • Complexity: O(log L) vs traditional O(L)
  • Algorithm: Blelloch parallel scan algorithm

3. GQA Efficiency Optimization

  • KV head sharing: 8 KV heads serve 24/32 Q heads
  • Memory savings: Significantly reduced KV cache

4. RoPE Position Encoding

  • Relative position: Encodes relative position rather than absolute
  • Extrapolation: Supports longer sequences

5. Mixture of Experts (MoE)

  • Sparse activation: Only Top-6 of 128 experts active per token
  • Shared experts: 2 experts always active for every token
  • SquaredReLU: Activation function for expert outputs
  • Routing: MLP router with sigmoid gating

6. QK Normalization

  • Training stability: Normalize Q and K per head
  • Mamba3 improvement: Enhanced attention mechanism

12. MoE Forward Pass Details

Output
Shared_Computation
Expert_Computation
Router
Input
x [B*L, H]
MLP: H -> E
Sigmoid
Top-K
For each selected expert k
Up Projection
SquaredReLU
Down Projection
weight_k * output
For each shared expert j
Up Projection
GELU
Down Projection
Sum(expert_outputs)

  • Shared_output
    Reshape [B, L, H]

Document generated: 2026-03-24
Project: NemotronMamba3 (C# Implementation with MoE)

相关推荐
Ricardo-Yang17 分钟前
# BPE Tokenizer:从训练规则到推理切分的完整理解
人工智能·深度学习·算法·机器学习·计算机视觉
Kapaseker32 分钟前
Android 开发快 3 倍!Google 说的
android
黄林晴34 分钟前
Android 17 Beta4发布:四大行为变更,不改上线就崩
android
恋猫de小郭1 小时前
Flutter 3.41.7 ,小版本但 iOS 大修复,看完只想说:这是人能写出来的 bug ?
android·前端·flutter
郭菁菁1 小时前
职业深度解析:AI/ML Engineer——从模型设计到生产落地
人工智能·深度学习·机器学习
嵌入式吴彦祖1 小时前
千问模型本地部署
深度学习
承渊政道2 小时前
Prompt工程:连接大语言模型能力与真实应用的关键桥梁
人工智能·深度学习·语言模型·自然语言处理·chatgpt·prompt·transformer
麦芽糖02192 小时前
python进阶六 正则表达式
android·python·正则表达式
三少爷的鞋2 小时前
🚀天下苦阻塞久矣之DeliQueue:Android 17 无锁 MessageQueue 的架构重构
android
北漂Zachary10 小时前
四大编程语言终极对比
android·java·php·laravel