Nemotron架构(Mamba3+Transformer+Moe)

Nemotron-Mamba3 Architecture Documentation

复制代码
╔══════════════════════════════════════════════════════════╗
║         Nemotron-Mamba3 混合模型演示程序                    ║
║         (结合Mamba3 SSM + Transformer + MoE)              ║
╚══════════════════════════════════════════════════════════╝

═══════════════════════════════════════════════════════════
【模型配置】
═══════════════════════════════════════════════════════════
  词汇表大小:     32,000
  隐藏层维度:     512
  Mamba3层数:     2
  Transformer层数: 4
  注意力头数:     8 (Q) / 2 (KV)
  状态大小:       64
  FFN中间层:      2048
  MoE专家数:      8 (激活2 + 共享1)
  MoE启用:        是

═══════════════════════════════════════════════════════════
【模型初始化】
═══════════════════════════════════════════════════════════
  正在创建模型组件...
  ? 词嵌入层初始化完成 [vocab=32000, hidden=512]
  ? Mamba3层堆栈初始化完成 [2层]
  ? Transformer层堆栈初始化完成 [4层]
  ? 最终归一化层初始化完成
  ? LM输出头初始化完成 (权重共享)

═══════════════════════════════════════════════════════════
【分词器初始化】
═══════════════════════════════════════════════════════════
  ? 分词器初始化成功, 词汇表大小: 256

═══════════════════════════════════════════════════════════
【前向传播测试】
═══════════════════════════════════════════════════════════
  输入形状: [batch=1, seq_len=5]
  输入Token IDs: [1, 2, 3, 4, 5]

  开始前向传播...
  ───────────────────────────────────────────────────────────
  ├─ [层0] 词嵌入层
  │   输入: [batch=1, seq_len=5] token IDs
  │   输出: [batch=1, seq_len=5, hidden=512]
  ├─ [层1] Mamba3层 #1
  │   输入形状: [1, 5, 512]
  │   ├─ RMSNorm归一化
  │   │   输出: [1, 5, 512]
  │   ├─ 选择性SSM (状态空间模型)
  │   │   状态维度: 64, 扩展因子: 2
  │   │   输出: [1, 5, 512]
  │   └─ MLP路径: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = input + SSM + MLP
  │   输出形状: [1, 5, 512]
  │
  ├─ [层2] Transformer层 #1
  │   输入形状: [1, 5, 512]
  │   ├─ 注意力归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   ├─ QKV投影
  │   │   Q: [5, 512, 8头 x 64维]
  │   │   K/V: [5, 128, 2头 x 64维]
  │   ├─ QK归一化 (Mamba3改进)
  │   │   归一化完成
  │   ├─ RoPE旋转位置编码 (base=10000, max_len=1024)
  │   │   旋转完成
  │   ├─ 分组查询注意力 (GQA)
  │   │   Q头: 8, KV头: 2, 重复: 4x
  │   │   输出: [1, 8, 5头 x 64维]
  │   └─ 注意力输出投影
  │       输出: [1, 5, 512]
  │       └─ 残差合并: h = x + attention
  │
  │   ├─ FFN归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   └─ FFN/MLP: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = h + FFN
  │   输出形状: [1, 5, 512]
  │
  ├─ [层3] Mamba3层 #2
  │   输入形状: [1, 5, 512]
  │   ├─ RMSNorm归一化
  │   │   输出: [1, 5, 512]
  │   ├─ 选择性SSM (状态空间模型)
  │   │   状态维度: 64, 扩展因子: 2
  │   │   输出: [1, 5, 512]
  │   └─ MLP路径: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = input + SSM + MLP
  │   输出形状: [1, 5, 512]
  │
  ├─ [层4] Transformer层 #2
  │   输入形状: [1, 5, 512]
  │   ├─ 注意力归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   ├─ QKV投影
  │   │   Q: [5, 512, 8头 x 64维]
  │   │   K/V: [5, 128, 2头 x 64维]
  │   ├─ QK归一化 (Mamba3改进)
  │   │   归一化完成
  │   ├─ RoPE旋转位置编码 (base=10000, max_len=1024)
  │   │   旋转完成
  │   ├─ 分组查询注意力 (GQA)
  │   │   Q头: 8, KV头: 2, 重复: 4x
  │   │   输出: [1, 8, 5头 x 64维]
  │   └─ 注意力输出投影
  │       输出: [1, 5, 512]
  │       └─ 残差合并: h = x + attention
  │
  │   ├─ FFN归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   └─ FFN/MLP: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = h + FFN
  │   输出形状: [1, 5, 512]
  │
  ├─ [层5] Transformer层 #3
  │   输入形状: [1, 5, 512]
  │   ├─ 注意力归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   ├─ QKV投影
  │   │   Q: [5, 512, 8头 x 64维]
  │   │   K/V: [5, 128, 2头 x 64维]
  │   ├─ QK归一化 (Mamba3改进)
  │   │   归一化完成
  │   ├─ RoPE旋转位置编码 (base=10000, max_len=1024)
  │   │   旋转完成
  │   ├─ 分组查询注意力 (GQA)
  │   │   Q头: 8, KV头: 2, 重复: 4x
  │   │   输出: [1, 8, 5头 x 64维]
  │   └─ 注意力输出投影
  │       输出: [1, 5, 512]
  │       └─ 残差合并: h = x + attention
  │
  │   ├─ FFN归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   └─ FFN/MLP: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = h + FFN
  │   输出形状: [1, 5, 512]
  │
  └─ [层6] Transformer层 #4
  │   输入形状: [1, 5, 512]
  │   ├─ 注意力归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   ├─ QKV投影
  │   │   Q: [5, 512, 8头 x 64维]
  │   │   K/V: [5, 128, 2头 x 64维]
  │   ├─ QK归一化 (Mamba3改进)
  │   │   归一化完成
  │   ├─ RoPE旋转位置编码 (base=10000, max_len=1024)
  │   │   旋转完成
  │   ├─ 分组查询注意力 (GQA)
  │   │   Q头: 8, KV头: 2, 重复: 4x
  │   │   输出: [1, 8, 5头 x 64维]
  │   └─ 注意力输出投影
  │       输出: [1, 5, 512]
  │       └─ 残差合并: h = x + attention
  │
  │   ├─ FFN归一化 (RMSNorm)
  │   │   输出: [1, 5, 512]
  │   └─ FFN/MLP: MoE (混合专家)
  │       专家数: 8, 激活: 2, 共享: 1
  │       输出: [1, 5, 512]
  │       └─ 残差合并: output = h + FFN
  │   输出形状: [1, 5, 512]
  └─ [层7] 最终RMSNorm归一化
      输入形状: [1, 5, 512]
      输出形状: [1, 5, 512]
  
  └─ [层8] LM输出头 (权重共享)
      输入形状: [1, 5, 512]
      输出形状: [1, 5, 32000]
  ───────────────────────────────────────────────────────────
  ? 前向传播成功完成!
    耗时: 1846ms
    输出形状: [batch=1, seq_len=5, vocab=32000]

  【输出Logits预览 (最后一个位置)】
    Top-5 Token概率: [token 0(NaN), token 1(NaN), token 2(NaN), token 3(NaN), token 4(NaN)]

═══════════════════════════════════════════════════════════
  演示完成!
═══════════════════════════════════════════════════════════

1. Project Overview

Nemotron-Mamba3 is a hybrid architecture large language model combining Mamba3 State Space Model (SSM) with Transformer architecture and Mixture of Experts (MoE).

Core Technical Features

Feature Description
Mamba3 SSM Selective State Space Model, O(L) linear sequence complexity
Transformer Grouped Query Attention (GQA) for efficient inference
Hybrid Architecture Alternating Mamba3 and Transformer layers
MoE Mixture of Experts - 128 experts, Top-6 active per token
RoPE Rotary Position Embedding for long context
SwiGLU Gated activation function
SquaredReLU Activation for MoE experts

2. System Architecture Overview

Output
Core
Embedding
Input
Token IDs [batch, seq_len]
TokenEmbedding vocab->hidden
Mamba3Layer x N
TransformerLayer x N
RMSNorm
LM Head hidden->vocab
Logits [batch, seq_len, vocab]


3. Complete Data Flow

Transformer Layer
Mamba3 Layer
Layer_Stack
Input_Step
Token IDs [B, L]
TokenEmbedding Forward()
Hidden States [B, L, H]
for layer in (M+T)xN
RMSNorm
Conv1d local context
Selective SSM Parallel Scan
MoE / SwiGLU MLP

  • Residual
    RMSNorm
    Q/K/V Projection
    QK Normalization
    RoPE
    GQA Attention
    MoE / SwiGLU FFN
  • Residual
    RMSNorm
    LM Head tied with embedding
    Logits [B, L, V]

4. Mamba3 Core Architecture

Mamba3 Selective SSM Core
Output Projection
Local Convolution
Output Projection

d_inner -> d_model
Input x [B, L, D]
Conv1d kernel=4

  • Skip Connection
    Output [B, L, D]
    Selective Mechanism
    X_proj x->A,B,C,d
    DT_proj x->d
    SSM Computation
    Discretize

A_bar=exp(dA)

B_bar=dB
Parallel Prefix Scan

O(log L)

Mamba3 Core Equations

复制代码
Delta_t = tau(LinearDelta(x_t))           # Time step parameter (input-dependent)
A_t = tau(A + LinearA(x_t))                # Extended state matrix (input-dependent)
B_t = LinearB(x_t)                        # Input projection
C_t = LinearC(x_t)                        # Output projection
h_t = A_bar_t * h_{t-1} + B_bar_t * x_t  # State update
y_t = C_t * h_t                           # Output

5. Transformer Architecture (GQA + MoE)

Transformer Layer with GQA + MoE
MoE / SwiGLU FFN
Grouped Query Attention
RMSNorm
Input x [B, L, H]
RMSNorm
Output Projection WO

  • Residual 1
    Expert Router

(sigmoid + top-k)
128 Experts

(SquaredReLU)
2 Shared Experts

(GELU)
MoE + Shared

  • Residual 2
    Output [B, L, H]
    Q/K/V Projection
    W_Q H->num_heads*d
    W_K H->num_kv*d
    W_V H->num_kv*d
    Reshape and Transpose

B,L,H\] -\> \[B,heads,L,dim

QK Normalization
RoPE Position Encoding
RotaryEmbedding

cos/sin cache
Repeat KV

num_kv -> num_heads
Q @ K_T / sqrt(d)
Softmax


6. Mixture of Experts (MoE) Architecture

MoE Layer
Shared Experts (2)
Shared Exp 1
Shared Exp 2
MoE Output + Shared Output
Output [B, L, H]
Input x [B, L, H]
Expert Router
MLP: H -> E
Sigmoid Gating
Top-K Selection
Gated Experts (128)
Expert 1
Expert 2
Expert N
gate_1 *
gate_2 *
gate_N *

MoE Configuration

Parameter 4B Model 8B Model Mini (Test)
Total Experts 128 128 8
Active Experts 6 6 2
Shared Experts 2 2 1
Expert Activation SquaredReLU SquaredReLU SquaredReLU
Shared Activation GELU GELU GELU
Router MLP + Sigmoid MLP + Sigmoid MLP + Sigmoid

7. Configuration and Model Variants

NemotronConfig
+int VocabSize
+int HiddenSize
+int NumMambaLayers
+int NumTransformerLayers
+int NumAttentionHeads
+int NumKVHeads
+int IntermediateSize
+int StateSize
+int ConvKernelSize
+int MaxSeqLen
+float RopeBase
+int ExpandFactor
+int MIMORank
+bool UseComplexSSM
+bool UseExpTrapezoid
+int NumExperts
+int NumActiveExperts
+int NumSharedExperts
+bool UseMoE

+NemotronConfig Nemotron4B()
+NemotronConfig Nemotron8B()
+NemotronConfig Mini()
+NemotronConfig Nemotron4BWithMoE()
+NemotronConfig Nemotron8BWithMoE()
NemotronMamba3Model
-NemotronConfig _config
-TokenEmbedding _embedding
-Mamba3Layer[] _mambaLayers
-TransformerLayer[] _transformerLayers
-Tensor _finalNormWeight
-Tensor _lmHeadWeight

+Forward(inputIds) : Tensor
+Generate(...) : List<int>

Model Configuration Parameters

Parameter 4B Model 4B MoE 8B MoE Mini (Test)
Hidden Size 3072 3072 4096 512
Mamba Layers 8 8 16 2
Transformer Layers 32 32 56 4
Attention Heads 24 24 32 8
KV Heads 8 8 8 2
FFN Intermediate 8192 8192 10976 2048
State Size 128 128 192 64
Max Seq Len 8192 8192 8192 1024
Vocab Size 128256 128256 128256 32000
MoE Experts N/A 128 128 8
Active Experts N/A 6 6 2

8. Forward Pass Flow

Logits [B, L, V] LM Head RMSNorm MoE Layer Transformer Layer Mamba3 Layer TokenEmbedding Token IDs [B, L] Logits [B, L, V] LM Head RMSNorm MoE Layer Transformer Layer Mamba3 Layer TokenEmbedding Token IDs [B, L] M ->> T alternating loop [N times (alternating)] Optional: Generate() autoregressive Forward(token_ids) hidden states [B, L, H] RMSNorm ->> Conv1d ->> SSM ->> MoE/SwiGLU hidden states RMSNorm ->> QKV ->> QKNorm ->> RoPE ->> GQA ->> MoE/SwiGLU next Mamba layer Final RMSNorm Reverse (tied weights) Logits [B, L, V]


9. Inference Generation Flow

No
Yes
Input Prompt token_ids
model.Forward(input)
Get last position logits
Temperature scaling
Top-K filtering
Top-P Nucleus filtering
Softmax normalization
Sampling
Append token
EOS token?
Generated token sequence


10. Component File Mapping

Component File Path Description
Core Model Models/NemotronMamba3.cs Config class, main model, forward pass
Mamba3 Layer Layers/Mamba3Layer.cs Complete Mamba3 layer with MoE
Mamba3 Core Layers/Mamba3Core.cs Selective SSM, parallel prefix scan
Transformer Layers/TransformerLayer.cs GQA attention, RoPE, optional MoE
MoE Layer Layers/MoE/MoELayer.cs Complete MoE implementation
Expert Router Layers/MoE/ExpertRouter.cs Router with sigmoid + top-k
Expert Layers/MoE/Expert.cs Individual expert with SquaredReLU
Shared Experts Layers/MoE/SharedExperts.cs Always-active shared experts
Embedding Layers/Embedding.cs TokenEmbedding, RotaryEmbedding
Normalization Layers/LayerNorm.cs RMSNorm, LayerNorm, QKNorm
Activation Layers/Activation.cs SwiGLU, GELU, SiLU, SquaredReLU
Tensor Core Core/Tensor.cs Tensor operations library
Tokenizer Inference/Tokenizer.cs Text encode/decode interface

11. Technical Highlights

1. Mamba3 Selective Mechanism

  • Data-dependent: Delta, A, B, C generated from input
  • Selective scanning: Decides what information to retain/ignore

2. Parallel Prefix Scan

  • Complexity: O(log L) vs traditional O(L)
  • Algorithm: Blelloch parallel scan algorithm

3. GQA Efficiency Optimization

  • KV head sharing: 8 KV heads serve 24/32 Q heads
  • Memory savings: Significantly reduced KV cache

4. RoPE Position Encoding

  • Relative position: Encodes relative position rather than absolute
  • Extrapolation: Supports longer sequences

5. Mixture of Experts (MoE)

  • Sparse activation: Only Top-6 of 128 experts active per token
  • Shared experts: 2 experts always active for every token
  • SquaredReLU: Activation function for expert outputs
  • Routing: MLP router with sigmoid gating

6. QK Normalization

  • Training stability: Normalize Q and K per head
  • Mamba3 improvement: Enhanced attention mechanism

12. MoE Forward Pass Details

Output
Shared_Computation
Expert_Computation
Router
Input
x [B*L, H]
MLP: H -> E
Sigmoid
Top-K
For each selected expert k
Up Projection
SquaredReLU
Down Projection
weight_k * output
For each shared expert j
Up Projection
GELU
Down Projection
Sum(expert_outputs)

  • Shared_output
    Reshape [B, L, H]

Document generated: 2026-03-24
Project: NemotronMamba3 (C# Implementation with MoE)

相关推荐
乐分启航2 小时前
SliMamba:十余K参数量刷新SOTA!高光谱分类的“降维打击“来了
java·人工智能·深度学习·算法·机器学习·分类·数据挖掘
xianjian09123 小时前
MySQL 的 INSERT(插入数据)详解
android·数据库·mysql
何仙鸟4 小时前
Garmagenet环境安装
人工智能·深度学习
欧简墨4 小时前
kotlin Android Extensions插件迁移到viewbinding总结
android·trae
Theodore_10224 小时前
深度学习(11):偏差与方差诊断、学习曲线
人工智能·笔记·深度学习·神经网络·机器学习·计算机视觉
货拉拉技术5 小时前
优雅解决Android app后台悬浮窗权限问题
android
Hello world.Joey5 小时前
Transformer解读
人工智能·深度学习·神经网络·自然语言处理·nlp·aigc·transformer
用户69371750013845 小时前
Android 手机终于能当电脑用了
android·前端
itwangyang5206 小时前
AIDD-人工智能药物发现与设计-利用深度学习从头设计药物,实现逆转疾病相关转录表型
人工智能·深度学习