【自然语言处理 NLP】前沿架构与多模态 6.1.1.4 混合架构（Mamba-Transformer Hybrid）

目录结构

[第一部分原理详解](#第一部分原理详解)

[6.1.1.4.1 混合架构的内存-质量-吞吐量权衡](#6.1.1.4.1 混合架构的内存-质量-吞吐量权衡)

[6.1.1.4.2 Jamba块与层交错策略](#6.1.1.4.2 Jamba块与层交错策略)

[6.1.1.4.3 稀疏局部注意力与全局SSM建模](#6.1.1.4.3 稀疏局部注意力与全局SSM建模)

[6.1.1.4.4 专家混合（MoE）与计算效率](#6.1.1.4.4 专家混合（MoE）与计算效率)

[6.1.1.4.5 大规模训练稳定性机制](#6.1.1.4.5 大规模训练稳定性机制)

第二部分

[6.1.1.4.1 混合层调度算法 (Hybrid Layer Scheduling)](#6.1.1.4.1 混合层调度算法 (Hybrid Layer Scheduling))

[6.1.1.4.2 稀疏滑动窗口注意力算法 (Sliding Window Sparse Attention)](#6.1.1.4.2 稀疏滑动窗口注意力算法 (Sliding Window Sparse Attention))

[6.1.1.4.3 Jamba 块前向传播 (Jamba Block Forward Pass)](#6.1.1.4.3 Jamba 块前向传播 (Jamba Block Forward Pass))

[第三部分代码实现](#第三部分代码实现)

[6.1.1.4.1 混合层调度器与架构配置](#6.1.1.4.1 混合层调度器与架构配置)

[6.1.1.4.2 稀疏滑动窗口注意力实现](#6.1.1.4.2 稀疏滑动窗口注意力实现)

[6.1.1.4.3 专家混合（MoE）层实现](#6.1.1.4.3 专家混合（MoE）层实现)

[6.1.1.4.4 完整Jamba块与模型架构](#6.1.1.4.4 完整Jamba块与模型架构)

[6.1.1.4.5 对比实验框架（困惑度与稳定性评估）](#6.1.1.4.5 对比实验框架（困惑度与稳定性评估）)

目录结构

第一部分原理详解

6.1.1.4.1 混合架构的内存-质量-吞吐量权衡
6.1.1.4.2 Jamba块与层交错策略
6.1.1.4.3 稀疏局部注意力与全局SSM建模
6.1.1.4.4 专家混合（MoE）与计算效率
6.1.1.4.5 大规模训练稳定性机制

第二部分结构化伪代码

混合层调度算法
稀疏滑动窗口注意力算法
Jamba块前向传播流程

第三部分代码实现

6.1.1.4.1 混合层调度器与架构配置
6.1.1.4.2 稀疏滑动窗口注意力实现
6.1.1.4.3 专家混合（MoE）层实现
6.1.1.4.4 完整Jamba块与模型架构
6.1.1.4.5 对比实验框架（困惑度与稳定性评估）

第一部分原理详解

6.1.1.4.1 混合架构的内存-质量-吞吐量权衡

序列建模架构面临内存占用、模型质量与计算吞吐量之间的基本权衡。纯Transformer架构的自注意力机制提供全局依赖建模能力，但其键值缓存（KV Cache）随序列长度线性增长，计算复杂度呈二次方增长。对于长度 $L=256K$ 的上下文，标准Transformer的KV缓存需求达到：

\\text{Memory}_{KV} = 2 \\times L \\times d_{model} \\times \\text{heads} \\times \\text{batch} \\times \\text{bytes\\_per\\_param}

在16位精度下，7B参数模型的缓存可超过128GB，严重限制长上下文部署。纯Mamba架构通过状态空间压缩将内存降至 $O(d_{state} \\times d_{model})$ ，实现与序列长度无关的常数内存占用，但在需要精确全局检索的任务上表现受限。

混合架构通过结构化层交错打破这种二元对立。定义层类型集合 $\\mathcal{L}=\\{A, M\\}$ ，其中 $A$ 表示自注意力层， $M$ 表示Mamba层。Jamba架构采用周期性块结构，每个块包含 $l$ 层，注意力与Mamba层按比例 $a:m$ 分布。对于 $a:m=1:7$ 且 $l=8$ ，每个块包含1层注意力与7层Mamba。此配置使KV缓存相比同规模Transformer减少8倍，同时保留注意力层的全局检索精度。

吞吐量分析揭示计算瓶颈的转移。短序列（ $L \< 2048$ ）时，MLP与注意力计算量相当；长序列时，注意力占据主导。Mamba层的线性复杂度使其在长序列上保持恒定FLOPs每token。混合架构的有效吞吐量 $T$ 可建模为：

T = \\frac{1}{\\alpha \\cdot O(L\^2) + (1-\\alpha) \\cdot O(L) + O(1)}

其中 $\\alpha = \\frac{a}{a+m}$ 为注意力层比例。通过设置 $\\alpha = 1/8$ ，Jamba在长序列上实现接近纯Mamba的吞吐量，同时保持注意力的表征优势。

6.1.1.4.2 Jamba块与层交错策略

Jamba块是混合架构的基本计算单元，超越简单层堆叠，实现计算原语的深度整合。每个块由 $l$ 个连续子层构成，遵循特定的拓扑约束：

\\text{JambaBlock} = \\{(L_i, F_i)\\}_{i=1}\^l

其中 $L_i \\in \\{A, M\\}$ 标识层类型， $F_i \\in \\{\\text{Dense, MoE}\\}$ 标识前馈网络类型。层类型序列遵循确定性调度模式 $\\pi = \[A, M, M, M, M, M, M, M\]$ 对于1:7比例，确保注意力层均匀分布以提供周期性全局上下文刷新。

每个子层保留残差连接与层归一化前置（Pre-Norm）结构：

x_{i+1} = x_i + \\text{Layer}_i(\\text{RMSNorm}(x_i))

注意力层采用分组查询注意力（GQA）进一步压缩KV缓存。设总头数为 $h$ ，查询头数 $h_q = h$ ，键值头数 $h_{kv} \< h$ 。缓存需求减少因子为 $h/h_{kv}$ 。Jamba-1.5-Large配置 $h_q = 64, h_{kv} = 8$ ，实现8倍缓存压缩。

Mamba层内部引入额外RMSNorm以稳定大规模训练。标准Mamba块包含投影、卷积、选择性SSM与门控。在隐藏状态维度 $d_{inner}$ 处插入归一化：

u = \\text{RMSNorm}(W_{in}x)

此干预在7B+参数规模训练中被证明可有效抑制损失尖峰（loss spikes），防止内部激活值爆炸导致的训练崩溃。

6.1.1.4.3 稀疏局部注意力与全局SSM建模

纯全局注意力的二次方复杂度不可持续。混合架构引入空间稀疏化策略，将注意力限制在局部邻域，同时依赖Mamba层处理全局依赖。定义滑动窗口注意力（Sliding Window Attention）的局部邻域半径 $w$ ，每个token仅关注前后 $w$ 个位置：

\\text{Attention}(Q, K, V) = \\text{softmax}\\left(\\frac{QK\^T}{\\sqrt{d_k}} \\odot M_w\\right)V

其中 $M_w \\in \\{0, 1\\}\^{L \\times L}$ 为掩码矩阵， $M_w\[i, j\] = 1$ 当且仅当 $\|i - j\| \\le w$ 。计算复杂度降至 $O(L \\cdot w \\cdot d)$ ，与序列长度线性相关。

全局-局部分工机制确立：滑动窗口注意力负责局部细粒度特征提取（如语法结构、短程共指），Mamba层通过选择性状态空间处理长程压缩。这种分工基于归纳偏置的互补性------注意力擅长随机访问任意位置，Mamba擅长线性传播与状态演化。

在极端长上下文（ $L \> 100K$ ）场景，可采用分层稀疏策略。Command-A架构采用3:1比例的局部与全局注意力交错，每4层设置1层全局注意力，其余使用滑动窗口。Jamba则依赖Mamba层提供全局连通性，注意力层（即使稀疏）专注于关键位置检索。

信息流动分析表明，混合架构的有效感受野是各层感受野的复合函数。设第 $i$ 层感受野为 $R_i$ ，则 $n$ 层堆叠后的全局感受野满足：

R_{global} \\ge \\max(R_i) + \\sum_{j \\in M} \\Delta_j \\cdot N_j

其中 $N_j$ 为Mamba层数， $\\Delta_j$ 为离散化步长。通过调整 $\\Delta$ ，Mamba层可扩展有效感受野至整个序列，弥补局部注意力的全局性缺失。

6.1.1.4.4 专家混合（MoE）与计算效率

专家混合（Mixture-of-Experts）在混合架构中进一步解耦参数量与计算量。传统密集MLP在前向传播中激活全部参数，MoE通过路由机制仅激活子集。定义专家集合 $E = \\{E_1, \\dots, E_n\\}$ ，每个专家为独立MLP。路由函数 $G(x) \\in \\mathbb{R}\^n$ 生成分配权重：

G(x) = \\text{softmax}(W_g x + b_g)

选择Top-K 专家（通常 $K=2$ ）激活：

\\text{MoE}(x) = \\sum_{i \\in \\text{TopK}(G(x), K)} G(x)_i \\cdot E_i(x)

Jamba配置 $n=16$ 专家，每2层（ $e=2$ ）替换标准MLP为MoE层。总参数量 $P_{total} \\approx 52B$ ，激活参数量 $P_{active} \\approx 12B$ ，实现4.3倍参数扩展而不增加推理计算。

负载均衡是关键挑战。若路由总是选择相同专家，将造成专家塌陷与设备闲置。引入辅助损失函数确保均匀分配：

L_{balance} = \\alpha \\cdot n \\cdot \\sum_{i=1}\^n f_i \\cdot P_i

其中 $f_i$ 为批次中分配给专家 $i$ 的token比例， $P_i$ 为路由概率均值， $\\alpha$ 为平衡系数。此损失最小化时，所有专家获得相等利用率。

MoE与混合架构的协同效应显著。注意力层提供全局路由决策所需上下文，Mamba层以线性成本处理高吞吐量专家计算。在256K上下文长度，Jamba的MoE配置配合混合层可实现仅4GB的KV缓存占用，相比LLaMA-2的128GB降低32倍。

6.1.1.4.5 大规模训练稳定性机制

7B+参数规模的混合架构训练面临独特的稳定性挑战。Mamba层内部激活值在训练初期易出现尖峰，源于选择性SSM的门控机制与离散化梯度的相互作用。

稳定性干预措施包括：

内部RMSNorm：在Mamba块的核心投影层插入额外归一化。设标准Mamba包含输入投影 $W_{in}$ 、卷积 $\\text{Conv}$ 、SSM $\\text{SSM}$ 与输出 $W_{out}$ ，在卷积前与SSM前分别插入：
$u = \\text{RMSNorm}(W_{in}x) \\quad u' = \\text{RMSNorm}(\\text{Conv}(u))$
梯度裁剪：对Mamba层参数实施独立梯度阈值。设全局裁剪阈值为 $\\tau$ ，Mamba子模块阈值 $\\tau_{mamba} = 0.5\\tau$ ，防止状态转移矩阵梯度爆炸。
初始化校准：离散化参数 $\\Delta$ 的初始化直接影响稳定性。采用逆softplus初始化确保初始步长 $\\Delta_{init} \\approx 0.01$ ：
$b_\\Delta = \\text{softplus}\^{-1}(\\Delta_{init}) = \\log(\\exp(\\Delta_{init}) - 1)$
精度混合：在 forward 中使用BF16加速计算，但对MoE路由logits、SSM离散化指数运算保留FP32精度，防止数值下溢。

训练监控指标扩展至专家负载方差 $\\sigma_{load}\^2$ 与Mamba层激活均值 $\\mu_{act}$ 。当 $\\sigma_{load}\^2 \> 0.1$ 或 $\\mu_{act} \> 10.0$ 时触发学习率回滚（rollback）至前一稳定检查点。

第二部分

6.1.1.4.1 混合层调度算法 (Hybrid Layer Scheduling)

为了在长序列建模中平衡计算效率与表征能力，Jamba架构采用了基于周期性块的调度策略。该算法（Algorithm 1）定义了自注意力层（Attention）与状态空间模型层（Mamba）的交错逻辑，并集成了专家混合机制（MoE）。

Algorithm 1: Jamba Layer Scheduler

Input: Block size $l$ , attention ratio $a$ , mamba ratio $m$ , MoE period $e$ , number of blocks $N_b$

Output: Layer configuration sequence $\\mathcal{C}$ of length $N_b \\times l$

Initialize empty configuration list $\\mathcal{C}$
For each block $b \\leftarrow 1$ to $N_b$ do
$\\quad$ Initialize block pattern $\\pi \\leftarrow \\text{empty\\_list}$
$\\quad$ $n_a \\leftarrow \\lceil \\frac{l \\cdot a}{a + m} \\rceil$ \Comment{Calculate number of attention layers}
$\\quad$ $n_m \\leftarrow l - n_a$ \Comment{Remaining layers assigned to Mamba}
$\\quad$ \Comment{Distribute attention layers evenly across the block}
$\\quad$ For each index $i \\leftarrow 0$ to $n_a - 1$ do
$\\quad \\quad$ $pos \\leftarrow \\lfloor i \\cdot \\frac{l}{n_a} \\rfloor$
$\\quad \\quad$ If $(b \\cdot l + pos) \\pmod e == 0$ then $F \\leftarrow \\text{MoE}$ else $F \\leftarrow \\text{Dense}$
$\\quad \\quad$ $\\pi\[pos\] \\leftarrow (A, F)$
$\\quad$ End For
$\\quad$ \Comment{Populate remaining positions with Mamba layers}
$\\quad$ For each index $j \\leftarrow 0$ to $l - 1$ do
$\\quad \\quad$ If $\\pi\[j\]$ is undefined then
$\\quad \\quad \\quad$ If $(b \\cdot l + j) \\pmod e == 0$ then $F \\leftarrow \\text{MoE}$ else $F \\leftarrow \\text{Dense}$
$\\quad \\quad \\quad$ $\\pi\[j\] \\leftarrow (M, F)$
$\\quad \\quad$ End If
$\\quad$ End For
$\\quad$ Append $\\pi$ to $\\mathcal{C}$
End For
Return $\\mathcal{C}$

6.1.1.4.2 稀疏滑动窗口注意力算法 (Sliding Window Sparse Attention)

针对长序列带来的二次方复杂度问题，算法 2 描述了稀疏滑动窗口注意力的实现方案。该方法结合了分组查询注意力（GQA）与局部掩码技术，将复杂度降低至线性量级。

Algorithm 2: Sliding Window Sparse Attention

Input: Queries $Q \\in \\mathbb{R}\^{L \\times d_k}$ , Keys $K \\in \\mathbb{R}\^{L \\times d_k}$ , Values $V \\in \\mathbb{R}\^{L \\times d_v}$ , window size $w$ , GQA group size $g$

Output: Aggregated output $O \\in \\mathbb{R}\^{L \\times d_v}$

$h_{kv} \\leftarrow \\text{K.shape}\[1\] / g$ \Comment{Reduce KV heads by factor $g$ for memory efficiency}
Initialize output matrix $O \\leftarrow \\mathbf{0} \\in \\mathbb{R}\^{L \\times d_v}$
For each position $i \\leftarrow 0$ to $L - 1$ do \Comment{Parallelizable sequence iteration}
$\\quad$ $w_{start} \\leftarrow \\max(0, i - w)$
$\\quad$ $w_{end} \\leftarrow \\min(L, i + w + 1)$
$\\quad$ $K_{local} \\leftarrow K\[w_{start}:w_{end}, :\]$ \Comment{Slice local window keys}
$\\quad$ $V_{local} \\leftarrow V\[w_{start}:w_{end}, :\]$ \Comment{Slice local window values}
$\\quad$ $S \\leftarrow \\frac{1}{\\sqrt{d_k}} Q\[i, :\] \\cdot K_{local}\^T$ \Comment{Compute local attention scores}
$\\quad$ $A \\leftarrow \\text{softmax}(S)$ \Comment{Normalize weights via Softmax}
$\\quad$ $O\[i, :\] \\leftarrow A \\cdot V_{local}$ \Comment{Weighted sum of values}
End For
Return $O$

6.1.1.4.3 Jamba 块前向传播 (Jamba Block Forward Pass)

Jamba 块作为核心计算单元，通过统一的残差框架整合了自注意力分支、Mamba 分支以及 MoE 专家路由逻辑。算法 3 详细说明了这一复杂的计算流。

Algorithm 3: Jamba Block Forward Pass

Input: Input $x \\in \\mathbb{R}\^{L \\times D}$ , configuration $(L_i, F_i)$ , MoE experts $\\mathcal{E}=\\{E_k\\}_{k=1}\^n$ , router $W_g$

Output: Transformed output $y \\in \\mathbb{R}\^{L \\times D}$

$x_{res} \\leftarrow x$ \Comment{Store identity for residual connection}
$x \\leftarrow \\text{RMSNorm}(x)$ \Comment{Pre-normalization}
If layer type $L_i == A$ then
$\\quad$ \Comment{Attention path with GQA optimization}
$\\quad$ $Q \\leftarrow xW_q, K \\leftarrow xW_k, V \\leftarrow xW_v$
$\\quad$ $K \\leftarrow \\text{reshape}(K, \[L, h_{kv}, d_k\])$
$\\quad$ $V \\leftarrow \\text{reshape}(V, \[L, h_{kv}, d_v\])$
$\\quad$ $x \\leftarrow \\text{SlidingWindowAttention}(Q, K, V, w) \\cdot W_o$
Else if layer type $L_i == M$ then
$\\quad$ \Comment{Mamba path with selective SSM and internal stability normalization}
$\\quad$ $u \\leftarrow \\text{RMSNorm}(xW_{in})$
$\\quad$ $u \\leftarrow \\text{CausalConv1D}(u)$
$\\quad$ $u \\leftarrow \\text{RMSNorm}(\\text{SiLU}(u))$
$\\quad$ $\\Delta, B, C \\leftarrow \\text{SelectiveProjection}(u)$
$\\quad$ $h \\leftarrow \\text{ParallelScanSSM}(\\Delta, B, C, u)$
$\\quad$ $x \\leftarrow h W_{out}$
End If
\Comment{Feed-Forward Network (FFN) selection}
If feed-forward type $F_i == \\text{MoE}$ then
$\\quad$ $g \\leftarrow \\text{softmax}(xW_g)$ \Comment{Compute routing logits}
$\\quad$ $\\mathcal{I} \\leftarrow \\text{TopK}(\\text{indices}(g), K)$ \Comment{Sparsely activate $K$ experts}
$\\quad$ $f \\leftarrow \\sum_{k \\in \\mathcal{I}} g_k \\cdot E_k(x)$
$\\quad$ $x \\leftarrow f$
Else
$\\quad$ $x \\leftarrow \\text{SwiGLU}(x)$ \Comment{Standard dense FFN}
End If
$y \\leftarrow x_{res} + \\text{Dropout}(x)$ \Comment{Residual summation and dropout}
Return $y$

第三部分代码实现

6.1.1.4.1 混合层调度器与架构配置

脚本说明：实现Jamba块的层调度逻辑，支持可配置的Attention:Mamba比例与MoE周期性插入。包含可视化层分布与计算成本估算。

复制代码

"""
Script: hybrid_scheduler.py
Content: Jamba layer scheduler with configurable attention/mamba ratios and MoE placement
Usage: python hybrid_scheduler.py --visualize --layers 32 --ratio 1:7 --moe-every 2
Functions:
    - JambaScheduler: Generates layer interleaving patterns
    - compute_memory_footprint: Estimates KV cache and active parameters
    - visualize_layer_distribution: Plots layer types and MoE placement
"""

import torch
import matplotlib.pyplot as plt
import numpy as np
from dataclasses import dataclass
from typing import List, Tuple, Literal
import argparse


@dataclass
class JambaConfig:
    """Configuration for Jamba hybrid architecture."""
    num_layers: int = 32
    attention_ratio: Tuple[int, int] = (1, 7)  # a:m ratio
    moe_every: int = 2  # Insert MoE every e layers
    num_experts: int = 16
    top_k: int = 2
    d_model: int = 4096
    num_heads: int = 32
    num_kv_heads: int = 8  # GQA compression
    window_size: int = 4096  # For sparse attention
    batch_size: int = 1
    seq_len: int = 65536
    
    @property
    def attention_layers(self) -> int:
        a, m = self.attention_ratio
        total = a + m
        return int(self.num_layers * a / total)
    
    @property
    def mamba_layers(self) -> int:
        return self.num_layers - self.attention_layers


class JambaScheduler:
    """
    Generates layer interleaving patterns for Jamba architecture.
    Ensures even distribution of attention layers within blocks.
    """
    def __init__(self, config: JambaConfig):
        self.config = config
        self.pattern = self._generate_pattern()
        
    def _generate_pattern(self) -> List[Tuple[Literal['A', 'M'], Literal['Dense', 'MoE']]]:
        """
        Generate (LayerType, FFNType) tuples for each layer.
        Strategy: Evenly distribute attention layers, fill rest with Mamba.
        """
        L = self.config.num_layers
        num_a = self.config.attention_layers
        num_m = self.config.mamba_layers
        
        # Initialize all as Mamba
        pattern: List[Literal['A', 'M']] = ['M'] * L
        
        # Distribute attention layers evenly
        if num_a > 0:
            step = L / num_a
            for i in range(num_a):
                pos = int(i * step) % L
                # Avoid collision by finding next available
                while pattern[pos] == 'A' and pos < L - 1:
                    pos += 1
                pattern[pos] = 'A'
        
        # Determine MoE placement
        moe_pattern: List[Literal['Dense', 'MoE']] = ['Dense'] * L
        for i in range(0, L, self.config.moe_every):
            moe_pattern[i] = 'MoE'
            
        return list(zip(pattern, moe_pattern))
    
    def get_layer_type(self, layer_idx: int) -> Tuple[Literal['A', 'M'], Literal['Dense', 'MoE']]:
        """Get configuration for specific layer."""
        return self.pattern[layer_idx]
    
    def compute_compute_cost(self, seq_len: int) -> dict:
        """
        Estimate FLOPs for forward pass.
        Returns breakdown by component.
        """
        d = self.config.d_model
        h = self.config.num_heads
        h_kv = self.config.num_kv_heads
        
        costs = {
            'attention_layers': 0,
            'mamba_layers': 0,
            'moe_layers': 0,
            'dense_ffn_layers': 0,
            'total_active_params': 0
        }
        
        for layer_type, ffn_type in self.pattern:
            # Attention cost: O(L^2 * d) for full, O(L * w * d) for sparse
            if layer_type == 'A':
                # Sparse sliding window
                w = min(self.config.window_size, seq_len)
                attn_flops = 2 * seq_len * w * d * (h / h_kv)  # GQA reduces KV dimension
                costs['attention_layers'] += attn_flops
            else:
                # Mamba: O(L * d * d_state), assume d_state=16 typically
                d_state = 16
                mamba_flops = 2 * seq_len * d * d_state * 4  # 4x for projections
                costs['mamba_layers'] += mamba_flops
            
            # FFN cost
            if ffn_type == 'MoE':
                # MoE: top_k experts active, each 4*d model dim (SwiGLU)
                expert_flops = self.config.top_k * 2 * seq_len * d * (4 * d)
                costs['moe_layers'] += expert_flops
                costs['total_active_params'] += self.config.top_k * 4 * d * d / 1e9  # Billions
            else:
                # Dense SwiGLU: 4*d intermediate
                dense_flops = 2 * seq_len * d * (4 * d)
                costs['dense_ffn_layers'] += dense_flops
                costs['total_active_params'] += 4 * d * d / 1e9
                
        costs['total_flops'] = (costs['attention_layers'] + costs['mamba_layers'] + 
                               costs['moe_layers'] + costs['dense_ffn_layers'])
        return costs
    
    def estimate_memory(self, seq_len: int) -> dict:
        """
        Estimate memory footprint in GB.
        Accounts for KV cache differences between Attention and Mamba.
        """
        batch = self.config.batch_size
        d = self.config.d_model
        h_kv = self.config.num_kv_heads
        head_dim = d // self.config.num_heads
        
        # KV cache per layer
        # Attention: store K, V of shape [batch, h_kv, seq, head_dim]
        kv_per_layer_attn = 2 * batch * h_kv * seq_len * head_dim * 2 / (1024**3)  # GB, fp16
        
        # Mamba: store state of shape [batch, d_model, d_state] or [batch, d_model] for convolution
        # Minimal compared to KV cache
        kv_per_layer_mamba = 2 * batch * d * 16 * 2 / (1024**3)  # GB, state dimension 16
        
        total_kv = 0
        for layer_type, _ in self.pattern:
            if layer_type == 'A':
                total_kv += kv_per_layer_attn
            else:
                total_kv += kv_per_layer_mamba
                
        return {
            'kv_cache_gb': total_kv,
            'kv_per_layer_attn_gb': kv_per_layer_attn,
            'kv_per_layer_mamba_gb': kv_per_layer_mamba,
            'model_params_gb': self.config.num_layers * 4 * d * d * 2 / (1024**3)  # Rough estimate
        }


def visualize_layer_distribution(scheduler: JambaScheduler, save_path: str = 'jamba_layers.png'):
    """Visualize layer type distribution and compute characteristics."""
    config = scheduler.config
    pattern = scheduler.pattern
    
    fig, axes = plt.subplots(3, 1, figsize=(14, 10))
    
    # Plot 1: Layer type distribution
    layer_types = [p[0] for p in pattern]
    colors = ['#FF6B6B' if t == 'A' else '#4ECDC4' for t in layer_types]
    
    axes[0].bar(range(len(layer_types)), [1]*len(layer_types), color=colors, edgecolor='black')
    axes[0].set_xlabel('Layer Index')
    axes[0].set_ylabel('Layer Type')
    axes[0].set_title(f'Jamba Layer Distribution (A:M = {config.attention_ratio[0]}:{config.attention_ratio[1]}, '
                     f'Total {len(layer_types)} layers)')
    
    # Add legend
    from matplotlib.patches import Patch
    legend_elements = [Patch(facecolor='#FF6B6B', label='Attention'),
                      Patch(facecolor='#4ECDC4', label='Mamba')]
    axes[0].legend(handles=legend_elements, loc='upper right')
    
    # Mark MoE layers
    moe_positions = [i for i, (t, f) in enumerate(pattern) if f == 'MoE']
    for pos in moe_positions:
        axes[0].axvline(x=pos, color='yellow', alpha=0.5, linewidth=2, linestyle='--')
    
    # Plot 2: Memory footprint vs sequence length
    seq_lens = [1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072]
    memories = []
    reference_memories = []  # Pure transformer for comparison
    
    for sl in seq_lens:
        mem = scheduler.estimate_memory(sl)['kv_cache_gb']
        memories.append(mem)
        # Pure transformer: all layers are attention with full KV cache
        pure_kv = 2 * config.batch_size * config.num_heads * sl * (config.d_model // config.num_heads) * 2 / (1024**3)
        pure_kv *= config.num_layers  # All layers
        reference_memories.append(pure_kv)
    
    axes[1].plot(seq_lens, memories, 'o-', label='Jamba Hybrid', linewidth=2, markersize=6)
    axes[1].plot(seq_lens, reference_memories, 's--', label='Pure Transformer', linewidth=2, markersize=6)
    axes[1].set_xlabel('Sequence Length')
    axes[1].set_ylabel('KV Cache Memory (GB)')
    axes[1].set_title('Memory Scaling: Hybrid vs Pure Transformer')
    axes[1].set_xscale('log', base=2)
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    # Plot 3: Compute cost breakdown
    seq_test = 8192
    costs = scheduler.compute_compute_cost(seq_test)
    categories = ['Attention\n(Sparse)', 'Mamba\n(SSM)', 'MoE\n(FFN)', 'Dense\n(FFN)']
    values = [costs['attention_layers']/1e9, costs['mamba_layers']/1e9, 
              costs['moe_layers']/1e9, costs['dense_ffn_layers']/1e9]
    colors_bar = ['#FF6B6B', '#4ECDC4', '#FFE66D', '#95E1D3']
    
    bars = axes[2].bar(categories, values, color=colors_bar, edgecolor='black')
    axes[2].set_ylabel('FLOPs (Billions)')
    axes[2].set_title(f'Compute Breakdown at Sequence Length {seq_test}')
    
    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        axes[2].text(bar.get_x() + bar.get_width()/2., height,
                    f'{height:.1f}B', ha='center', va='bottom')
    
    plt.tight_layout()
    plt.savefig(save_path, dpi=150, bbox_inches='tight')
    print(f"Saved visualization to {save_path}")
    plt.show()


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--visualize', action='store_true', help='Generate visualization')
    parser.add_argument('--layers', type=int, default=32, help='Total number of layers')
    parser.add_argument('--ratio', type=str, default='1:7', help='Attention:Mamba ratio (e.g., 1:7)')
    parser.add_argument('--moe-every', type=int, default=2, help='MoE layer frequency')
    args = parser.parse_args()
    
    # Parse ratio
    a, m = map(int, args.ratio.split(':'))
    
    config = JambaConfig(
        num_layers=args.layers,
        attention_ratio=(a, m),
        moe_every=args.moe_every
    )
    
    scheduler = JambaScheduler(config)
    
    print(f"Jamba Scheduler Configuration:")
    print(f"  Total layers: {config.num_layers}")
    print(f"  Attention layers: {config.attention_layers} ({config.attention_layers/config.num_layers*100:.1f}%)")
    print(f"  Mamba layers: {config.mamba_layers} ({config.mamba_layers/config.num_layers*100:.1f}%)")
    print(f"  MoE layers: {len([p for p in scheduler.pattern if p[1] == 'MoE'])}")
    
    if args.visualize:
        visualize_layer_distribution(scheduler)
    else:
        # Print sample layer distribution
        print("\nLayer pattern (first 16):")
        for i, (lt, ft) in enumerate(scheduler.pattern[:16]):
            print(f"  Layer {i:2d}: {lt} + {ft}")

6.1.1.4.2 稀疏滑动窗口注意力实现

脚本说明：实现GQA（分组查询注意力）与滑动窗口稀疏化，支持局部注意力掩码与高效内存访问模式。

复制代码

"""
Script: sparse_attention.py
Content: Sliding window sparse attention with GQA (Grouped Query Attention)
Usage: python sparse_attention.py --benchmark
Functions:
    - SlidingWindowAttention: Local sparse attention with configurable window size
    - GroupedQueryAttention: KV cache sharing across query heads
    - benchmark_attention: Compare sparse vs dense attention performance
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import time
import matplotlib.pyplot as plt
from typing import Optional
import argparse


class GroupedQueryAttention(nn.Module):
    """
    Grouped Query Attention (GQA) reducing KV cache by sharing key/value heads across query heads.
    Configuration: num_heads (query), num_kv_heads (key/value), where num_kv_heads <= num_heads.
    """
    def __init__(self, d_model: int, num_heads: int = 32, num_kv_heads: int = 8, 
                 window_size: int = 4096, dropout: float = 0.0):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        assert num_heads % num_kv_heads == 0, "num_heads must be divisible by num_kv_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.num_kv_heads = num_kv_heads
        self.head_dim = d_model // num_heads
        self.window_size = window_size
        self.dropout = dropout
        
        # Q, K, V projections
        self.q_proj = nn.Linear(d_model, d_model, bias=False)
        self.k_proj = nn.Linear(d_model, num_kv_heads * self.head_dim, bias=False)
        self.v_proj = nn.Linear(d_model, num_kv_heads * self.head_dim, bias=False)
        self.o_proj = nn.Linear(d_model, d_model, bias=False)
        
        # For causal masking
        self.register_buffer('bias', None)
        
    def forward(self, x: torch.Tensor, attention_mask: Optional[torch.Tensor] = None) -> torch.Tensor:
        """
        Forward with sliding window sparse attention.
        
        Args:
            x: [batch, seq_len, d_model]
            attention_mask: Optional [batch, seq_len] padding mask
            
        Returns:
            output: [batch, seq_len, d_model]
        """
        batch_size, seq_len, _ = x.shape
        
        # Project
        Q = self.q_proj(x)  # [batch, L, d_model]
        K = self.k_proj(x)  # [batch, L, num_kv_heads * head_dim]
        V = self.v_proj(x)
        
        # Reshape to [batch, heads, seq, head_dim]
        Q = Q.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.num_kv_heads, self.head_dim).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.num_kv_heads, self.head_dim).transpose(1, 2)
        
        # Expand K, V to match Q heads (GQA broadcasting)
        # num_kv_heads = 1 -> all Q heads share same KV
        # num_kv_heads = num_heads -> standard MHA
        if self.num_kv_heads != self.num_heads:
            # Expand by repeating: [batch, kv_heads, L, dim] -> [batch, heads, L, dim]
            reps = self.num_heads // self.num_kv_heads
            K = K.repeat_interleave(reps, dim=1)
            V = V.repeat_interleave(reps, dim=1)
        
        # Sliding window attention
        if seq_len > self.window_size:
            # Manual sliding window implementation
            output = self._sliding_window_attn(Q, K, V)
        else:
            # Standard attention for short sequences
            output = self._standard_attn(Q, K, V, attention_mask)
        
        # Reshape and project out
        output = output.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)
        output = self.o_proj(output)
        
        return F.dropout(output, p=self.dropout, training=self.training)
    
    def _sliding_window_attn(self, Q: torch.Tensor, K: torch.Tensor, V: torch.Tensor) -> torch.Tensor:
        """
        Compute attention with sliding window constraint.
        Each position can only attend to [i-window, i+window].
        """
        batch, heads, L, dim = Q.shape
        w = self.window_size
        
        # Pad for efficient windowed computation
        pad_left = w
        K_padded = F.pad(K, (0, 0, pad_left, w), value=0)  # [batch, heads, L+2w, dim]
        V_padded = F.pad(V, (0, 0, pad_left, w), value=0)
        
        # Unfold to get sliding windows: [batch, heads, L, 2w+1, dim]
        K_windows = K_padded.unfold(2, 2*w+1, 1).transpose(-2, -1)  # [b, h, L, 2w+1, dim]
        V_windows = V_padded.unfold(2, 2*w+1, 1).transpose(-2, -1)
        
        # Compute attention scores for each window
        # Q: [b, h, L, 1, dim], K: [b, h, L, 2w+1, dim]
        scores = torch.matmul(Q.unsqueeze(3), K_windows.transpose(-2, -1)).squeeze(3) / math.sqrt(dim)
        # scores: [b, h, L, 2w+1]
        
        # Causal mask: ensure we don't look at future tokens within window
        # Create mask for valid positions
        positions = torch.arange(L, device=Q.device).unsqueeze(1)  # [L, 1]
        window_positions = torch.arange(2*w+1, device=Q.device).unsqueeze(0)  # [1, 2w+1]
        # Adjust for padding offset
        actual_indices = positions - w + window_positions  # [L, 2w+1]
        causal_mask = actual_indices <= positions  # Can only attend to current or past
        # Also mask out negative indices (padding)
        valid_mask = actual_indices >= 0
        
        mask = causal_mask & valid_mask  # [L, 2w+1]
        mask = mask.unsqueeze(0).unsqueeze(0)  # [1, 1, L, 2w+1]
        
        scores = scores.masked_fill(~mask, float('-inf'))
        attn = F.softmax(scores, dim=-1)
        
        # Apply attention to values
        # attn: [b, h, L, 2w+1], V: [b, h, L, 2w+1, dim]
        output = torch.matmul(attn.unsqueeze(3), V_windows).squeeze(3)  # [b, h, L, dim]
        
        return output
    
    def _standard_attn(self, Q, K, V, mask=None):
        """Standard dense attention for comparison/baseline."""
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.head_dim)
        
        # Causal mask
        L = Q.shape[2]
        causal_mask = torch.tril(torch.ones(L, L, device=Q.device)).unsqueeze(0).unsqueeze(0)
        scores = scores.masked_fill(causal_mask == 0, float('-inf'))
        
        if mask is not None:
            # Padding mask
            scores = scores.masked_fill(mask.unsqueeze(1).unsqueeze(2), float('-inf'))
        
        attn = F.softmax(scores, dim=-1)
        return torch.matmul(attn, V)
    
    def get_kv_cache_size(self, seq_len: int) -> int:
        """Return KV cache size in bytes for given sequence length."""
        # K and V: [batch, num_kv_heads, seq, head_dim], fp16 = 2 bytes
        bytes_per_element = 2
        cache_size = 2 * self.num_kv_heads * seq_len * self.head_dim * bytes_per_element
        return cache_size


def benchmark_attention():
    """Benchmark sparse vs dense attention performance and memory."""
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Benchmarking on {device}")
    
    d_model = 4096
    num_heads = 32
    num_kv_heads = 8  # GQA compression
    window_size = 4096
    
    attn_dense = GroupedQueryAttention(d_model, num_heads, num_kv_heads, window_size=float('inf')).to(device)
    attn_sparse = GroupedQueryAttention(d_model, num_heads, num_kv_heads, window_size=window_size).to(device)
    
    seq_lengths = [1024, 2048, 4096, 8192, 16384]
    times_dense = []
    times_sparse = []
    mem_dense = []
    mem_sparse = []
    
    batch_size = 2
    
    for L in seq_lengths:
        if L > 8192 and device.type == 'cpu':
            break  # Skip too long for CPU
            
        x = torch.randn(batch_size, L, d_model, device=device)
        
        # Warmup
        for _ in range(3):
            _ = attn_dense(x)
            _ = attn_sparse(x)
            
        torch.cuda.synchronize() if device.type == 'cuda' else None
        
        # Time dense
        start = time.time()
        for _ in range(10):
            out_dense = attn_dense(x)
        torch.cuda.synchronize() if device.type == 'cuda' else None
        t_dense = (time.time() - start) / 10 * 1000  # ms
        
        # Time sparse
        start = time.time()
        for _ in range(10):
            out_sparse = attn_sparse(x)
        torch.cuda.synchronize() if device.type == 'cuda' else None
        t_sparse = (time.time() - start) / 10 * 1000
        
        times_dense.append(t_dense)
        times_sparse.append(t_sparse)
        
        # Memory
        kv_dense = attn_dense.get_kv_cache_size(L) * batch_size / (1024**2)  # MB
        kv_sparse = attn_sparse.get_kv_cache_size(L) * batch_size / (1024**2)
        mem_dense.append(kv_dense)
        mem_sparse.append(kv_sparse)
        
        print(f"L={L:5d}: Dense={t_dense:6.1f}ms, Sparse={t_sparse:6.1f}ms, "
              f"Speedup={t_dense/t_sparse:4.2f}x, "
              f"KV_mem_dense={kv_dense:.1f}MB, sparse={kv_sparse:.1f}MB")
    
    # Plot
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    axes[0].plot(seq_lengths[:len(times_dense)], times_dense, 'o-', label='Dense Global', linewidth=2)
    axes[0].plot(seq_lengths[:len(times_sparse)], times_sparse, 's-', label=f'Sparse (w={window_size})', linewidth=2)
    axes[0].set_xlabel('Sequence Length')
    axes[0].set_ylabel('Time (ms)')
    axes[0].set_title('Attention Computation Time')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    axes[0].set_xscale('log', base=2)
    
    axes[1].plot(seq_lengths[:len(mem_dense)], mem_dense, 'o-', label='Dense Global', linewidth=2)
    axes[1].plot(seq_lengths[:len(mem_sparse)], mem_sparse, 's-', label=f'Sparse (w={window_size})', linewidth=2)
    axes[1].set_xlabel('Sequence Length')
    axes[1].set_ylabel('KV Cache Memory (MB)')
    axes[1].set_title('Memory Usage Comparison (GQA 8x compression)')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    axes[1].set_xscale('log', base=2)
    
    plt.tight_layout()
    plt.savefig('sparse_attention_benchmark.png', dpi=150)
    print("Saved benchmark to sparse_attention_benchmark.png")
    plt.show()


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--benchmark', action='store_true', help='Run performance benchmark')
    args = parser.parse_args()
    
    if args.benchmark:
        benchmark_attention()
    else:
        # Basic test
        attn = GroupedQueryAttention(d_model=512, num_heads=8, num_kv_heads=2, window_size=256)
        x = torch.randn(2, 512, 512)  # batch=2, seq=512
        out = attn(x)
        print(f"Sparse attention test: input {x.shape} -> output {out.shape}")
        assert out.shape == x.shape, "Shape mismatch!"
        print("Sparse attention forward pass PASSED")

6.1.1.4.3 专家混合（MoE）层实现

脚本说明：实现Top-K路由的专家混合层，包含负载均衡损失计算与专家容量管理。

复制代码

"""
Script: moe_layer.py
Content: Mixture-of-Experts (MoE) layer with Top-K routing and load balancing
Usage: python moe_layer.py --test-load-balancing
Functions:
    - TopKRouter: Routes tokens to top-K experts
    - ExpertLayer: Individual feed-forward expert
    - MoELayer: Complete MoE module with load balancing loss
    - visualize_expert_usage: Plot expert utilization distribution
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np
from typing import List, Tuple
import argparse


class ExpertLayer(nn.Module):
    """Individual expert: standard SwiGLU feed-forward network."""
    def __init__(self, d_model: int, d_ff: int = None, dropout: float = 0.0):
        super().__init__()
        d_ff = d_ff or 4 * d_model
        
        # SwiGLU architecture: split, gate, multiply
        self.w1 = nn.Linear(d_model, d_ff, bias=False)  # Gate
        self.w2 = nn.Linear(d_ff, d_model, bias=False)  # Output
        self.w3 = nn.Linear(d_model, d_ff, bias=False)  # Value
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # SwiGLU: silu(x @ W1) * (x @ W3) @ W2
        return self.w2(self.dropout(F.silu(self.w1(x)) * self.w3(x)))


class TopKRouter(nn.Module):
    """Routes each token to top-K experts with load balancing."""
    def __init__(self, d_model: int, num_experts: int, top_k: int = 2, 
                 noise_std: float = 0.0, capacity_factor: float = 1.0):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k
        self.noise_std = noise_std
        self.capacity_factor = capacity_factor
        
        self.gate = nn.Linear(d_model, num_experts, bias=False)
        
        # Initialize with small variance
        nn.init.normal_(self.gate.weight, 0, 0.01)
        
    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """
        Route tokens to experts.
        
        Args:
            x: [num_tokens, d_model]
            
        Returns:
            expert_indices: [num_tokens, top_k] selected expert indices
            expert_weights: [num_tokens, top_k] routing weights
            router_probs: [num_tokens, num_experts] full routing probability distribution
        """
        # Compute routing logits
        router_logits = self.gate(x)  # [tokens, experts]
        
        # Add noise for exploration during training
        if self.training and self.noise_std > 0:
            noise = torch.randn_like(router_logits) * self.noise_std
            router_logits = router_logits + noise
        
        # Softmax over experts
        router_probs = F.softmax(router_logits, dim=-1)
        
        # Select top-K experts
        expert_weights, expert_indices = torch.topk(router_probs, self.top_k, dim=-1)
        
        # Normalize weights across selected experts
        expert_weights = expert_weights / expert_weights.sum(dim=-1, keepdim=True)
        
        return expert_indices, expert_weights, router_probs
    
    def compute_load_balancing_loss(self, router_probs: torch.Tensor, 
                                     expert_indices: torch.Tensor) -> torch.Tensor:
        """
        Compute auxiliary load balancing loss to encourage uniform expert usage.
        Loss = α * N * Σ(f_i * P_i) where f_i is fraction of tokens to expert i,
        P_i is average routing probability to expert i.
        
        Args:
            router_probs: [tokens, experts]
            expert_indices: [tokens, top_k]
        """
        num_tokens = router_probs.shape[0]
        num_experts = self.num_experts
        
        # Fraction of tokens routed to each expert
        expert_mask = torch.zeros(num_tokens, num_experts, device=router_probs.device)
        expert_mask.scatter_(1, expert_indices, 1.0)
        f = expert_mask.mean(dim=0)  # [experts]
        
        # Average routing probability per expert
        P = router_probs.mean(dim=0)  # [experts]
        
        # Load balance loss
        balance_loss = num_experts * torch.sum(f * P)
        
        return balance_loss


class MoELayer(nn.Module):
    """
    Complete MoE layer dispatching to experts with capacity limits.
    """
    def __init__(
        self,
        d_model: int,
        num_experts: int = 16,
        top_k: int = 2,
        expert_ff_dim: int = None,
        dropout: float = 0.0,
        capacity_factor: float = 1.0,
        load_balance_coef: float = 0.01
    ):
        super().__init__()
        self.d_model = d_model
        self.num_experts = num_experts
        self.top_k = top_k
        self.capacity_factor = capacity_factor
        self.load_balance_coef = load_balance_coef
        
        # Router
        self.router = TopKRouter(d_model, num_experts, top_k, 
                                noise_std=0.1 if dropout > 0 else 0.0,
                                capacity_factor=capacity_factor)
        
        # Experts
        self.experts = nn.ModuleList([
            ExpertLayer(d_model, expert_ff_dim, dropout)
            for _ in range(num_experts)
        ])
        
    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Forward pass through MoE layer.
        
        Args:
            x: [batch, seq_len, d_model] or [num_tokens, d_model]
            
        Returns:
            output: Same shape as input
            aux_loss: Load balancing auxiliary loss
        """
        original_shape = x.shape
        if x.dim() == 3:
            x = x.view(-1, x.shape[-1])  # Flatten to [tokens, d_model]
        
        num_tokens = x.shape[0]
        
        # Route
        expert_indices, expert_weights, router_probs = self.router(x)
        
        # Compute capacity per expert (prevent overflow)
        capacity = int(self.capacity_factor * num_tokens / self.num_experts)
        capacity = max(capacity, 1)
        
        # Initialize output
        output = torch.zeros_like(x)
        
        # Track usage for statistics
        expert_counts = torch.zeros(self.num_experts, device=x.device)
        
        # Dispatch to experts (batched per expert for efficiency)
        for expert_idx in range(self.num_experts):
            # Find tokens assigned to this expert
            # expert_indices: [tokens, top_k]
            mask = (expert_indices == expert_idx).any(dim=-1)  # [tokens]
            selected_tokens = x[mask]  # [n_selected, d_model]
            
            if selected_tokens.shape[0] == 0:
                continue
                
            # Check capacity
            n_assigned = min(selected_tokens.shape[0], capacity)
            if n_assigned < selected_tokens.shape[0]:
                # Random drop if over capacity
                perm = torch.randperm(selected_tokens.shape[0])
                selected_tokens = selected_tokens[perm[:n_assigned]]
                # Adjust weights for dropped tokens? (simplified: just proceed)
            
            # Compute expert output
            expert_out = self.experts[expert_idx](selected_tokens)
            
            # Get weights for this expert
            # Find which position in top-k this expert is for each token
            expert_weight_mask = (expert_indices == expert_idx).float()  # [tokens, top_k]
            weights = (expert_weights * expert_weight_mask).sum(dim=-1)[mask]  # [n_selected]
            
            if n_assigned < weights.shape[0]:
                weights = weights[perm[:n_assigned]]
            
            # Weight and add to output
            expert_out = expert_out * weights.unsqueeze(-1)
            
            # Scatter back (simplified: direct assignment where mask is True)
            output[mask] = output[mask] + expert_out
        
        # Reshape back
        output = output.view(original_shape)
        
        # Compute auxiliary loss
        aux_loss = self.router.compute_load_balancing_loss(router_probs, expert_indices)
        aux_loss = self.load_balance_coef * aux_loss
        
        return output, aux_loss
    
    def get_expert_utilization(self, x: torch.Tensor) -> np.ndarray:
        """Analyze token distribution across experts."""
        if x.dim() == 3:
            x = x.view(-1, x.shape[-1])
        
        with torch.no_grad():
            expert_indices, _, router_probs = self.router(x)
            # Count assignments
            counts = torch.zeros(self.num_experts)
            for i in range(self.top_k):
                for idx in range(self.num_experts):
                    counts[idx] += (expert_indices[:, i] == idx).sum().item()
            
            # Normalize by total assignments (tokens * top_k)
            total = counts.sum()
            if total > 0:
                counts = counts / total
        
        return counts.numpy()


def visualize_expert_usage(moe_layer: MoELayer, d_model: int = 512, num_samples: int = 1000):
    """Visualize expert utilization under different input distributions."""
    # Simulate different input patterns
    patterns = {
        'Uniform': torch.randn(num_samples, d_model),
        'Skewed (first 50%)': torch.randn(num_samples, d_model),
        'Gaussian clusters': torch.randn(num_samples, d_model)
    }
    
    # Make skewed actually skewed by scaling first half dimensions
    patterns['Skewed (first 50%)'][:, :d_model//2] *= 3.0
    
    # Make clusters
    for i in range(4):
        mask = slice(i * num_samples//4, (i+1) * num_samples//4)
        patterns['Gaussian clusters'][mask] += torch.randn(1, d_model) * 2.0
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    for idx, (name, data) in enumerate(patterns.items()):
        util = moe_layer.get_expert_utilization(data)
        
        axes[idx].bar(range(len(util)), util, color='skyblue', edgecolor='black')
        axes[idx].axhline(y=1.0/len(util), color='r', linestyle='--', 
                        label='Uniform target')
        axes[idx].set_xlabel('Expert Index')
        axes[idx].set_ylabel('Utilization')
        axes[idx].set_title(f'{name}\nEntropy: {-(util * np.log(util + 1e-10)).sum():.2f}')
        axes[idx].legend()
        axes[idx].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('moe_expert_usage.png', dpi=150)
    print("Saved expert usage visualization to moe_expert_usage.png")
    plt.show()


def test_load_balancing():
    """Test MoE training with load balancing."""
    d_model = 512
    num_experts = 16
    batch_size = 32
    seq_len = 64
    
    moe = MoELayer(d_model, num_experts, top_k=2, load_balance_coef=0.01)
    optimizer = torch.optim.Adam(moe.parameters(), lr=1e-3)
    
    losses = []
    aux_losses = []
    entropies = []
    
    print("Training MoE with load balancing for 100 steps...")
    for step in range(100):
        # Random input
        x = torch.randn(batch_size, seq_len, d_model)
        
        optimizer.zero_grad()
        out, aux_loss = moe(x)
        
        # Dummy task loss (e.g., reconstruction)
        task_loss = F.mse_loss(out, x)
        total_loss = task_loss + aux_loss
        
        total_loss.backward()
        optimizer.step()
        
        # Track metrics
        with torch.no_grad():
            util = moe.get_expert_utilization(x)
            entropy = -(util * np.log(util + 1e-10)).sum()
        
        losses.append(task_loss.item())
        aux_losses.append(aux_loss.item())
        entropies.append(entropy)
        
        if step % 20 == 0:
            print(f"Step {step}: Task={task_loss.item():.4f}, "
                  f"Aux={aux_loss.item():.4f}, Entropy={entropy:.3f}")
    
    # Plot training curves
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    axes[0].plot(losses, label='Task Loss')
    axes[0].plot(aux_losses, label='Aux Loss')
    axes[0].set_xlabel('Step')
    axes[0].set_ylabel('Loss')
    axes[0].set_title('MoE Training Losses')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    axes[1].plot(entropies, label='Expert Usage Entropy')
    axes[1].axhline(y=np.log(num_experts), color='r', linestyle='--', 
                   label='Max Entropy (uniform)')
    axes[1].set_xlabel('Step')
    axes[1].set_ylabel('Entropy')
    axes[1].set_title('Expert Utilization Balance')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('moe_training.png', dpi=150)
    print("Saved training curves to moe_training.png")
    plt.show()


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--test-load-balancing', action='store_true', 
                       help='Run load balancing test')
    parser.add_argument('--visualize', action='store_true',
                       help='Visualize expert usage')
    args = parser.parse_args()
    
    if args.test_load_balancing:
        test_load_balancing()
    elif args.visualize:
        moe = MoELayer(d_model=512, num_experts=16, top_k=2)
        visualize_expert_usage(moe)
    else:
        # Basic test
        moe = MoELayer(d_model=512, num_experts=8, top_k=2)
        x = torch.randn(2, 64, 512)
        out, aux = moe(x)
        print(f"MoE test: input {x.shape} -> output {out.shape}, aux_loss={aux.item():.4f}")
        print("MoE layer test PASSED")

6.1.1.4.4 完整Jamba块与模型架构

脚本说明：整合前述组件的完整Jamba实现，包含内部RMSNorm稳定化、残差连接与多层堆叠。

复制代码

"""
Script: jamba_architecture.py
Content: Complete Jamba block and model architecture integrating all components
Usage: python jamba_architecture.py --test-forward
Functions:
    - JambaBlock: Integrated block with A/M layers, MoE/Dense FFN
    - JambaModel: Full model with embedding, multiple blocks, LM head
    - compare_architectures: Benchmark against pure Transformer and pure Mamba
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import time
import matplotlib.pyplot as plt
from typing import Optional, Tuple, Literal
import argparse

# Import from previous scripts (assuming they are in the same module path)
from hybrid_scheduler import JambaScheduler, JambaConfig
from sparse_attention import GroupedQueryAttention
from moe_layer import MoELayer
from mamba_block import MambaBlock  # From previous 3.1.1.5


class RMSNorm(nn.Module):
    """Root Mean Square Layer Normalization (used in LLaMA/Mamba/Jamba)."""
    def __init__(self, dim: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: [..., dim]
        norm = x.norm(2, dim=-1, keepdim=True) / math.sqrt(x.shape[-1])
        return self.weight * x / (norm + self.eps)


class JambaSubLayer(nn.Module):
    """
    Single sublayer within a Jamba block: either Attention or Mamba + FFN/MoE.
    Includes pre-normalization and residual connection.
    """
    def __init__(
        self,
        d_model: int,
        layer_type: Literal['A', 'M'],
        ffn_type: Literal['Dense', 'MoE'],
        config: JambaConfig,
        dropout: float = 0.1
    ):
        super().__init__()
        self.layer_type = layer_type
        self.ffn_type = ffn_type
        self.d_model = d_model
        
        # Pre-normalization
        self.norm1 = RMSNorm(d_model)
        self.norm2 = RMSNorm(d_model)
        
        # Main layer
        if layer_type == 'A':
            self.main_layer = GroupedQueryAttention(
                d_model=d_model,
                num_heads=config.num_heads,
                num_kv_heads=config.num_kv_heads,
                window_size=config.window_size,
                dropout=dropout
            )
        else:  # 'M'
            self.main_layer = MambaBlock(
                d_model=d_model,
                d_state=16,  # Standard SSM state dimension
                expand_factor=2,
                use_parallel_scan=True,
                dropout=dropout
            )
            # Additional internal normalization for stability
            self.norm_mamba_inner = RMSNorm(d_model)
        
        # FFN/MoE
        if ffn_type == 'MoE':
            self.ffn = MoELayer(
                d_model=d_model,
                num_experts=config.num_experts,
                top_k=config.top_k,
                dropout=dropout
            )
        else:
            # Standard SwiGLU FFN
            d_ff = 4 * d_model
            self.w1 = nn.Linear(d_model, d_ff, bias=False)
            self.w2 = nn.Linear(d_ff, d_model, bias=False)
            self.w3 = nn.Linear(d_model, d_ff, bias=False)
            self.ffn = None  # Use manual forward
            
        self.dropout = nn.Dropout(dropout)
        self.use_moe = (ffn_type == 'MoE')
        
    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
        """
        Forward with residual: x = x + Layer(Norm(x))
        Returns output and optional aux loss (for MoE)
        """
        # Main layer
        if self.layer_type == 'A':
            residual = x
            x = self.norm1(x)
            x = self.main_layer(x)
            x = self.dropout(x) + residual
        else:
            # Mamba with internal extra norm
            residual = x
            x = self.norm1(x)
            x = self.norm_mamba_inner(x)  # Extra stability
            x = self.main_layer(x)
            x = self.dropout(x) + residual
        
        # FFN/MoE
        residual = x
        x = self.norm2(x)
        
        aux_loss = None
        if self.use_moe:
            x, aux_loss = self.ffn(x)
        else:
            # SwiGLU: silu(xW1) * (xW3) @ W2
            x = self.w2(F.silu(self.w1(x)) * self.w3(x))
            
        x = self.dropout(x) + residual
        
        return x, aux_loss


class JambaModel(nn.Module):
    """
    Complete Jamba language model with embedding, Jamba blocks, and LM head.
    """
    def __init__(
        self,
        vocab_size: int,
        config: JambaConfig,
        dropout: float = 0.1
    ):
        super().__init__()
        self.config = config
        self.vocab_size = vocab_size
        self.d_model = config.d_model
        
        # Token embedding
        self.token_emb = nn.Embedding(vocab_size, config.d_model)
        
        # Generate layer configuration
        self.scheduler = JambaScheduler(config)
        
        # Build layers
        self.layers = nn.ModuleList([
            JambaSubLayer(
                d_model=config.d_model,
                layer_type=lt,
                ffn_type=ft,
                config=config,
                dropout=dropout
            )
            for lt, ft in self.scheduler.pattern
        ])
        
        # Final normalization
        self.norm_final = RMSNorm(config.d_model)
        
        # LM head (weight tying with embedding)
        self.lm_head = nn.Linear(config.d_model, vocab_size, bias=False)
        self.lm_head.weight = self.token_emb.weight
        
        # Gradient checkpointing flag
        self.gradient_checkpointing = False
        
        self._init_weights()
        
    def _init_weights(self):
        # Standard initialization
        for module in self.modules():
            if isinstance(module, nn.Linear):
                torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
                if module.bias is not None:
                    torch.nn.init.zeros_(module.bias)
            elif isinstance(module, nn.Embedding):
                torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
        
    def forward(
        self, 
        input_ids: torch.Tensor,
        targets: Optional[torch.Tensor] = None
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
        """
        Forward pass returning logits and optional loss.
        
        Args:
            input_ids: [batch, seq_len]
            targets: [batch, seq_len] for loss computation
            
        Returns:
            logits: [batch, seq_len, vocab_size]
            loss: scalar (cross entropy) or None
        """
        x = self.token_emb(input_ids)  # [batch, L, d_model]
        
        total_aux_loss = 0.0
        num_aux = 0
        
        # Pass through layers
        for layer in self.layers:
            if self.gradient_checkpointing and self.training:
                x, aux = torch.utils.checkpoint.checkpoint(layer, x)
            else:
                x, aux = layer(x)
                
            if aux is not None:
                total_aux_loss += aux
                num_aux += 1
        
        x = self.norm_final(x)
        logits = self.lm_head(x)
        
        loss = None
        if targets is not None:
            ce_loss = F.cross_entropy(
                logits.reshape(-1, self.vocab_size),
                targets.reshape(-1),
                ignore_index=-100
            )
            # Add MoE auxiliary loss (load balancing)
            if num_aux > 0:
                loss = ce_loss + total_aux_loss / num_aux
            else:
                loss = ce_loss
                
        return logits, loss
    
    def generate(
        self,
        input_ids: torch.Tensor,
        max_new_tokens: int = 100,
        temperature: float = 1.0,
        top_k: Optional[int] = None
    ) -> torch.Tensor:
        """Autoregressive generation."""
        for _ in range(max_new_tokens):
            # Get logits
            logits, _ = self(input_ids[:, -self.config.seq_len:])  # Crop if needed
            logits = logits[:, -1, :] / temperature  # Last position
            
            # Top-k
            if top_k is not None:
                v, _ = torch.topk(logits, top_k)
                logits[logits < v[:, [-1]]] = -float('Inf')
            
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            input_ids = torch.cat([input_ids, next_token], dim=-1)
            
        return input_ids
    
    def estimate_memory(self, batch_size: int = 1, seq_len: Optional[int] = None):
        """Estimate memory footprint."""
        if seq_len is None:
            seq_len = self.config.seq_len
            
        return self.scheduler.estimate_memory(seq_len)
    
    def get_parameter_count(self) -> dict:
        """Return total and active parameter counts."""
        total = sum(p.numel() for p in self.parameters())
        
        # Estimate active parameters (non-MoE layers + MoE top_k selection)
        active = 0
        for layer in self.layers:
            if layer.use_moe:
                # MoE layer: router + top_k experts
                expert_params = sum(p.numel() for p in layer.ffn.experts.parameters()) / self.config.num_experts
                active += sum(p.numel() for p in layer.router.parameters())
                active += expert_params * self.config.top_k
            else:
                active += sum(p.numel() for p in layer.parameters() if p.requires_grad)
        
        return {
            'total': total / 1e9,  # Billions
            'active': active / 1e9,
            'compression_ratio': total / active if active > 0 else 1.0
        }


def compare_architectures():
    """Compare Jamba vs pure Transformer vs pure Mamba."""
    configs = {
        'Jamba (1:7)': JambaConfig(num_layers=32, attention_ratio=(1, 7), moe_every=2),
        'Pure Transformer': JambaConfig(num_layers=32, attention_ratio=(1, 0)),  # All attention
        'Pure Mamba': JambaConfig(num_layers=32, attention_ratio=(0, 1)),  # All Mamba
        'Jamba (1:3)': JambaConfig(num_layers=32, attention_ratio=(1, 3), moe_every=2),
    }
    
    seq_lens = [1024, 4096, 16384, 65536, 262144]
    results = {name: {'memory': [], 'compute': []} for name in configs}
    
    print("Benchmarking architectures...")
    
    for name, cfg in configs.items():
        print(f"\n{name}:")
        sched = JambaScheduler(cfg)
        
        for L in seq_lens:
            mem_stats = sched.estimate_memory(L)
            comp_stats = sched.compute_compute_cost(L)
            
            mem_mb = mem_stats['kv_cache_gb'] * 1024  # Convert to MB
            comp_gflops = comp_stats['total_flops'] / 1e9
            
            results[name]['memory'].append(mem_mb)
            results[name]['compute'].append(comp_gflops)
            
            print(f"  L={L:6d}: Memory={mem_mb:8.1f}MB, Compute={comp_gflops:6.1f}GFLOPs")
    
    # Plot comparison
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Memory plot
    for name, data in results.items():
        axes[0].plot(seq_lens, data['memory'], 'o-', label=name, linewidth=2, markersize=6)
    axes[0].set_xlabel('Sequence Length')
    axes[0].set_ylabel('KV Cache Memory (MB)')
    axes[0].set_title('Memory Scaling Comparison')
    axes[0].set_xscale('log', base=2)
    axes[0].set_yscale('log')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Compute plot
    for name, data in results.items():
        axes[1].plot(seq_lens, data['compute'], 's-', label=name, linewidth=2, markersize=6)
    axes[1].set_xlabel('Sequence Length')
    axes[1].set_ylabel('Compute (GFLOPs)')
    axes[1].set_title('Computation Cost Comparison')
    axes[1].set_xscale('log', base=2)
    axes[1].set_yscale('log')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('architecture_comparison.png', dpi=150)
    print("\nSaved comparison to architecture_comparison.png")
    plt.show()


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--test-forward', action='store_true', help='Test forward pass')
    parser.add_argument('--compare', action='store_true', help='Run architecture comparison')
    args = parser.parse_args()
    
    if args.compare:
        compare_architectures()
    elif args.test_forward:
        # Small test configuration
        config = JambaConfig(
            num_layers=4,
            attention_ratio=(1, 3),
            d_model=512,
            num_heads=8,
            num_kv_heads=2,
            moe_every=2,
            num_experts=8
        )
        
        model = JambaModel(vocab_size=1000, config=config)
        x = torch.randint(0, 1000, (2, 128))  # batch=2, seq=128
        targets = torch.randint(0, 1000, (2, 128))
        
        logits, loss = model(x, targets)
        print(f"Jamba forward test:")
        print(f"  Input: {x.shape}")
        print(f"  Logits: {logits.shape}")
        print(f"  Loss: {loss.item():.4f}")
        
        stats = model.get_parameter_count()
        print(f"  Parameters: {stats['total']:.2f}B total, {stats['active']:.2f}B active")
        print("Jamba architecture test PASSED")
    else:
        print("Use --test-forward to test forward pass or --compare to benchmark architectures")

6.1.1.4.5 对比实验框架（困惑度与稳定性评估）

脚本说明：实现纯Transformer、纯Mamba、混合架构的对比训练框架，监控困惑度、梯度范数与损失尖峰。

复制代码

"""
Script: hybrid_comparison.py
Content: Comparative training framework for Transformer vs Mamba vs Hybrid
Usage: python hybrid_comparison.py --run-comparison --epochs 10
Functions:
    - PureTransformer: Standard Transformer baseline
    - PureMamba: Pure SSM baseline  
    - HybridJamba: Jamba architecture
    - Trainer: Unified training loop with stability monitoring
    - visualize_comparison: Plot PPL, gradient norms, loss spikes
"""

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Dict, Optional
import argparse
import time
import json

# Import architectures (simplified versions for fair comparison)
from jamba_architecture import JambaModel, JambaConfig


class PureTransformer(nn.Module):
    """Standard Transformer baseline (GPT-style)."""
    def __init__(self, vocab_size: int, d_model: int = 512, n_layers: int = 8, 
                 n_heads: int = 8, dropout: float = 0.1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model, 
            nhead=n_heads,
            dim_feedforward=4*d_model,
            dropout=dropout,
            batch_first=True
        )
        self.layers = nn.TransformerEncoder(encoder_layer, num_layers=n_layers)
        self.norm = nn.LayerNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size)
        
    def forward(self, x, targets=None):
        x = self.embedding(x)
        # Causal mask
        mask = torch.triu(torch.ones(x.size(1), x.size(1)), diagonal=1).bool().to(x.device)
        x = self.layers(x, mask=mask, is_causal=True)
        x = self.norm(x)
        logits = self.lm_head(x)
        
        loss = None
        if targets is not None:
            loss = nn.functional.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        return logits, loss


class PureMamba(nn.Module):
    """Pure Mamba baseline (all Mamba layers)."""
    def __init__(self, vocab_size: int, d_model: int = 512, n_layers: int = 8, dropout: float = 0.1):
        super().__init__()
        from mamba_block import MambaBlock  # Reuse from previous
        from jamba_architecture import RMSNorm
        
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.layers = nn.ModuleList([
            nn.ModuleDict({
                'mamba': MambaBlock(d_model, d_state=16, dropout=dropout),
                'norm': RMSNorm(d_model)
            })
            for _ in range(n_layers)
        ])
        self.norm_final = RMSNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size)
        
    def forward(self, x, targets=None):
        x = self.embedding(x)
        
        for layer in self.layers:
            residual = x
            x = layer['norm'](x)
            x = layer['mamba'](x)
            x = x + residual
            
        x = self.norm_final(x)
        logits = self.lm_head(x)
        
        loss = None
        if targets is not None:
            loss = nn.functional.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        return logits, loss


class ToyLanguageDataset(Dataset):
    """Synthetic language modeling dataset."""
    def __init__(self, vocab_size: int, seq_len: int, num_samples: int, seed: int = 42):
        super().__init__()
        rng = np.random.RandomState(seed)
        self.data = torch.from_numpy(rng.randint(0, vocab_size, size=(num_samples, seq_len))).long()
        self.vocab_size = vocab_size
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        x = self.data[idx]
        # Targets are input shifted by 1 (next token prediction)
        # For simplicity in this toy example, use same sequence
        return x, x


class Trainer:
    """Unified trainer with stability monitoring."""
    def __init__(self, model: nn.Module, name: str, device: torch.device):
        self.model = model.to(device)
        self.name = name
        self.device = device
        self.optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
        self.scheduler = optim.lr_scheduler.CosineAnnealingLR(self.optimizer, T_max=100)
        
        self.history = {
            'loss': [], 'ppl': [], 'grad_norm': [], 
            'max_grad': [], 'spike_detected': []
        }
        self.spike_threshold = 10.0  # Loss increase factor indicating spike
        
    def train_epoch(self, loader: DataLoader, epoch: int) -> Dict:
        self.model.train()
        total_loss = 0.0
        total_samples = 0
        grad_norms = []
        max_grads = []
        
        for batch_idx, (x, y) in enumerate(loader):
            x, y = x.to(self.device), y.to(self.device)
            
            self.optimizer.zero_grad()
            logits, loss = self.model(x, y)
            
            if loss is None:
                continue
                
            loss.backward()
            
            # Monitor gradients
            total_norm = 0.0
            max_grad = 0.0
            for p in self.model.parameters():
                if p.grad is not None:
                    param_norm = p.grad.data.norm(2).item()
                    total_norm += param_norm ** 2
                    max_grad = max(max_grad, p.grad.data.abs().max().item())
            total_norm = total_norm ** 0.5
            grad_norms.append(total_norm)
            max_grads.append(max_grad)
            
            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
            
            self.optimizer.step()
            
            total_loss += loss.item() * x.size(0)
            total_samples += x.size(0)
            
        avg_loss = total_loss / total_samples
        ppl = np.exp(avg_loss)
        
        # Detect spikes (loss increased significantly)
        is_spike = False
        if len(self.history['loss']) > 0:
            prev_loss = self.history['loss'][-1]
            if avg_loss > prev_loss * self.spike_threshold:
                is_spike = True
                
        self.history['loss'].append(avg_loss)
        self.history['ppl'].append(ppl)
        self.history['grad_norm'].append(np.mean(grad_norms))
        self.history['max_grad'].append(np.max(max_grads))
        self.history['spike_detected'].append(is_spike)
        
        self.scheduler.step()
        
        return {
            'loss': avg_loss,
            'ppl': ppl,
            'grad_norm': np.mean(grad_norms),
            'max_grad': np.max(max_grads),
            'spike': is_spike
        }
    
    def evaluate(self, loader: DataLoader) -> Dict:
        self.model.eval()
        total_loss = 0.0
        total_samples = 0
        
        with torch.no_grad():
            for x, y in loader:
                x, y = x.to(self.device), y.to(self.device)
                logits, loss = self.model(x, y)
                if loss is not None:
                    total_loss += loss.item() * x.size(0)
                    total_samples += x.size(0)
        
        avg_loss = total_loss / total_samples if total_samples > 0 else float('inf')
        return {
            'loss': avg_loss,
            'ppl': np.exp(avg_loss) if avg_loss < 10 else float('inf')
        }


def run_comparison(vocab_size: int = 1000, seq_len: int = 256, 
                   epochs: int = 10, device: str = 'cuda'):
    """Run comparative training of three architectures."""
    device = torch.device(device if torch.cuda.is_available() else 'cpu')
    print(f"Running comparison on {device}")
    
    # Common hyperparameters for fairness
    d_model = 512
    n_layers = 8
    batch_size = 16
    
    # Models
    models = {
        'Pure Transformer': PureTransformer(vocab_size, d_model, n_layers),
        'Pure Mamba': PureMamba(vocab_size, d_model, n_layers),
        'Jamba Hybrid': JambaModel(vocab_size, JambaConfig(
            num_layers=n_layers,
            attention_ratio=(1, 7),  # 1:7 ratio
            d_model=d_model,
            num_heads=8,
            num_kv_heads=2,
            moe_every=2,
            num_experts=8,
            window_size=seq_len//4  # Local attention window
        ))
    }
    
    # Data
    train_dataset = ToyLanguageDataset(vocab_size, seq_len, num_samples=1000)
    val_dataset = ToyLanguageDataset(vocab_size, seq_len, num_samples=200, seed=43)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size)
    
    # Trainers
    trainers = {name: Trainer(model, name, device) for name, model in models.items()}
    
    # Training loop
    print(f"\nTraining for {epochs} epochs...")
    print("-" * 60)
    
    for epoch in range(epochs):
        print(f"\nEpoch {epoch+1}/{epochs}")
        
        for name, trainer in trainers.items():
            train_stats = trainer.train_epoch(train_loader, epoch)
            val_stats = trainer.evaluate(val_loader)
            
            spike_marker = "⚠️ SPIKE!" if train_stats['spike'] else ""
            print(f"{name:20s}: Train PPL={train_stats['ppl']:7.2f}, "
                  f"Val PPL={val_stats['ppl']:7.2f}, "
                  f"GradNorm={train_stats['grad_norm']:6.2f} "
                  f"{spike_marker}")
    
    return trainers


def visualize_comparison(trainers: Dict[str, Trainer], save_path: str = 'hybrid_comparison.png'):
    """Generate comprehensive comparison plots."""
    fig = plt.figure(figsize=(16, 12))
    gs = fig.add_gridspec(3, 2, hspace=0.3, wspace=0.3)
    
    colors = {
        'Pure Transformer': '#FF6B6B',
        'Pure Mamba': '#4ECDC4',
        'Jamba Hybrid': '#FFE66D'
    }
    
    # Plot 1: Perplexity curves
    ax1 = fig.add_subplot(gs[0, :])
    for name, trainer in trainers.items():
        epochs = range(1, len(trainer.history['ppl']) + 1)
        ax1.plot(epochs, trainer.history['ppl'], 'o-', label=name, 
                color=colors[name], linewidth=2, markersize=6)
    ax1.set_xlabel('Epoch')
    ax1.set_ylabel('Perplexity')
    ax1.set_title('Language Modeling Perplexity Comparison')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    ax1.set_yscale('log')
    
    # Plot 2: Training Loss (detail)
    ax2 = fig.add_subplot(gs[1, 0])
    for name, trainer in trainers.items():
        epochs = range(1, len(trainer.history['loss']) + 1)
        ax2.plot(epochs, trainer.history['loss'], '-', label=name, 
                color=colors[name], linewidth=2)
        # Mark spikes
        spike_epochs = [e for e, s in enumerate(trainer.history['spike_detected'], 1) if s]
        if spike_epochs:
            spike_losses = [trainer.history['loss'][e-1] for e in spike_epochs]
            ax2.scatter(spike_epochs, spike_losses, marker='x', s=100, 
                       color=colors[name], linewidths=3)
    ax2.set_xlabel('Epoch')
    ax2.set_ylabel('Loss')
    ax2.set_title('Training Loss (x marks detected spikes)')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    # Plot 3: Gradient Norms (stability indicator)
    ax3 = fig.add_subplot(gs[1, 1])
    for name, trainer in trainers.items():
        epochs = range(1, len(trainer.history['grad_norm']) + 1)
        ax3.plot(epochs, trainer.history['grad_norm'], '-', label=name, 
                color=colors[name], linewidth=2)
        ax3.plot(epochs, trainer.history['max_grad'], '--', alpha=0.5, color=colors[name])
    ax3.set_xlabel('Epoch')
    ax3.set_ylabel('Gradient Norm')
    ax3.set_title('Gradient Norms (solid=mean, dashed=max)')
    ax3.legend()
    ax3.set_yscale('log')
    ax3.grid(True, alpha=0.3)
    
    # Plot 4: Architecture comparison table (as text)
    ax4 = fig.add_subplot(gs[2, :])
    ax4.axis('off')
    
    table_data = []
    for name, trainer in trainers.items():
        final_ppl = trainer.history['ppl'][-1]
        min_ppl = min(trainer.history['ppl'])
        spikes = sum(trainer.history['spike_detected'])
        avg_grad = np.mean(trainer.history['grad_norm'])
        table_data.append([
            name,
            f"{final_ppl:.2f}",
            f"{min_ppl:.2f}",
            str(spikes),
            f"{avg_grad:.2f}"
        ])
    
    table = ax4.table(
        cellText=table_data,
        colLabels=['Architecture', 'Final PPL', 'Best PPL', 'Loss Spikes', 'Avg Grad Norm'],
        loc='center',
        cellLoc='center'
    )
    table.auto_set_font_size(False)
    table.set_fontsize(10)
    table.scale(1, 2)
    ax4.set_title('Performance Summary', fontsize=12, pad=20)
    
    plt.savefig(save_path, dpi=150, bbox_inches='tight')
    print(f"Saved comparison plot to {save_path}")
    plt.show()
    
    # Save numerical results
    results = {}
    for name, trainer in trainers.items():
        results[name] = {
            'final_ppl': trainer.history['ppl'][-1],
            'best_ppl': min(trainer.history['ppl']),
            'spikes_detected': sum(trainer.history['spike_detected']),
            'avg_grad_norm': float(np.mean(trainer.history['grad_norm'])),
            'training_curve': trainer.history['ppl']
        }
    
    with open('comparison_results.json', 'w') as f:
        json.dump(results, f, indent=2)
    print("Saved numerical results to comparison_results.json")


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--run-comparison', action='store_true', 
                       help='Run full comparison experiment')
    parser.add_argument('--epochs', type=int, default=10, help='Number of epochs')
    parser.add_argument('--vocab-size', type=int, default=1000, help='Vocabulary size')
    parser.add_argument('--seq-len', type=int, default=256, help='Sequence length')
    args = parser.parse_args()
    
    if args.run_comparison:
        trainers = run_comparison(
            vocab_size=args.vocab_size,
            seq_len=args.seq_len,
            epochs=args.epochs
        )
        visualize_comparison(trainers)
    else:
        print("Use --run-comparison to execute training comparison")
        print("Example: python hybrid_comparison.py --run-comparison --epochs 20")

附录：系统整合与执行说明

上述六个脚本构成完整的6.1.1.4节混合架构实现系统。执行流程如下：

架构分析：首先运行调度器可视化以理解层分布策略

bash

复制
复制代码
```
python hybrid_scheduler.py --visualize --layers 32 --ratio 1:7
```

组件验证：单独测试稀疏注意力与MoE组件

bash

复制

复制代码

python sparse_attention.py --benchmark
python moe_layer.py --test-load-balancing

架构对比：执行三种架构的对比实验

bash

复制

复制代码

python hybrid_comparison.py --run-comparison --epochs 20 --seq-len 512

关键设计验证：
- 内存效率：Jamba在256K上下文下KV缓存约为4GB，相比纯Transformer的128GB降低32倍，验证混合架构的内存优势。
- 稳定性：内部RMSNorm与梯度裁剪机制有效抑制损失尖峰，纯Mamba在7B+规模易出现不稳定，混合架构通过注意力层调节改善训练动态。
- 困惑度：在语言建模任务上，混合架构（1:7比例）通常达到接近纯Transformer的困惑度（差距<5%），同时保持纯Mamba的长序列处理能力。

本实现严格遵循Lieber et al. (2024)的Jamba架构设计，通过层交错、GQA压缩、滑动窗口稀疏化与MoE专家的协同，实现了内存占用、计算吞吐量与模型质量的帕累托最优权衡。

【自然语言处理 NLP】前沿架构与多模态 6.1.1.4 混合架构（Mamba-Transformer Hybrid）

目录结构

第一部分 原理详解

6.1.1.4.1 混合架构的内存-质量-吞吐量权衡

6.1.1.4.2 Jamba块与层交错策略

6.1.1.4.3 稀疏局部注意力与全局SSM建模

6.1.1.4.4 专家混合（MoE）与计算效率

6.1.1.4.5 大规模训练稳定性机制

第二部分

6.1.1.4.1 混合层调度算法 (Hybrid Layer Scheduling)

6.1.1.4.2 稀疏滑动窗口注意力算法 (Sliding Window Sparse Attention)

6.1.1.4.3 Jamba 块前向传播 (Jamba Block Forward Pass)

第三部分 代码实现

6.1.1.4.1 混合层调度器与架构配置

6.1.1.4.2 稀疏滑动窗口注意力实现

6.1.1.4.3 专家混合（MoE）层实现

6.1.1.4.4 完整Jamba块与模型架构

6.1.1.4.5 对比实验框架（困惑度与稳定性评估）

第一部分原理详解

第三部分代码实现