FRSM 架构演进与速度对比报告

一、架构演进路线

复制代码

FRSM_V6_Fast (原始RNN)
    ↓ +MoE
FRSM_V6_MoE (Sparse, gather)
    ↓ -gather +Dense
FRSM_V6_DenseMoE (全专家, chunk)
    ↓ +RWKV并行
FRSM_V6_RWKV (cumsum并行扫描)
    ↓ +快慢分离
HybridFRSM (快尺度并行 + 慢尺度门控)
    ↓ +OpenASH骨干
FRSMASH (cummax并行 + 慢记忆)

二、训练速度对比 (RTX 4090, T=384, vocab=23005)

同等 ~100M 参数规模

模型	参数	B	tok/s	显存	架构特点
Transformer (d=512 L=6)	43M	32	220,000	7.5GB	自注意力, 全并行
FRSMASH (H=512 L=4)	33M	88	61,450	19.5GB	cummax并行 + 慢记忆
HybridFRSM (d=1024 1F+1S)	~100M	88	37,000	14.3GB	线性并行扫描 + 门控
Dense MoE (d=400 C=20)	100M	128	27,000	15.0GB	全专家堆叠einsum
Dense MoE (d=400 C=1)	100M	88	5,000	16.0GB	逐token串行
Sparse MoE (d=400 原版)	102M	88	5,000	20.0GB	gather + 检查点

小规模对比 (~13-33M)

模型	参数	B=128 tok/s	显存	备注
FRSMASH (H=256 L=8)	16M	63,787	19.1GB	最高tok/s
HybridFRSM (d=256 3F+1S)	13M	67,905	15.2GB	并行扫描
FRSMASH (H=512 L=4)	33M	61,450	19.5GB	推荐配置

关键速度提升

复制代码

原版 Sparse MoE:     5,000 tok/s  (基准)
↓ 去掉 gather:       27,000 tok/s  (5.4x)
↓ 换 RWKV 并行:      37,000 tok/s  (7.4x)
↓ 换 HybridFRSM:     43,000 tok/s  (8.6x)
↓ 换 FRSMASH:        61,450 tok/s  (12.3x)

三、推理速度对比

模型	参数	tok/s	ms/token	随上下文增长
FRSMASH	33M	255	3.9	否(O(1))
HybridFRSM (4F+2S)	85M	837	1.2	否(O(1))
HybridFRSM (3F+3S)	88M	152	6.6	否(O(1))
Transformer	43M	110	9.1	是(O(N))

FRSMASH 推理比 Transformer 快 2.3x
HybridFRSM (4F+2S) 推理比 Transformer 快 7.6x
所有 FRSM 变体推理都是 O(1) 恒定 , Transformer 是 O(N) 越来越慢

四、训练时间预估 (127万行全量数据, 3 epoch)

模型	参数	tok/s	1 epoch	3 epoch
Transformer	43M	220K	0.6h	1.8h
FRSMASH	33M	61K	2.2h	6.6h
HybridFRSM	100M	37K	3.6h	10.8h
Dense MoE C=20	100M	27K	5.0h	15.0h
原 Sparse MoE	102M	5K	27h	81h

从原版的 81 小时 到 FRSMASH 的 6.6 小时 , 加速 12.3 倍。

五、架构特性对比

特性	Transformer	FRSMASH	HybridFRSM	Dense MoE	Sparse MoE
训练并行度	全并行	cummax并行	cumsum并行	chunk内并行	串行
推理复杂度	O(N)	O(1)	O(1)	O(1)	O(1)
长序列显存	O(T²)爆	O(T)	O(T)	O(T)	O(T)
长程记忆	注意力	cummax+慢门控	慢尺度门控	全专家	top-k路由
门控信息	全交互	cummax状态	input-only	input-only	input-only
训练速度	最快	第二快	第三	第四	最慢
推理速度	最慢	第二快	最快	---	---

六、总结

FRSMASH 是 FRSM 架构演进的最优解:

训练 61K tok/s --- 所有 RNN 变体中最快
推理 255 tok/s --- O(1) 恒定, 比 Transformer 快 2.3x
OpenASH cummax 骨干提供强 LM 表达力
慢尺度内容门控提供选择性长程记忆
门控融合让模型自适应依赖骨干还是记忆

与 Transformer 的差距:

训练: 慢 3.6x (61K vs 220K), 但在可接受范围
推理: 快 2.3x 且恒定不变
长序列: 显存 O(T) vs O(T²), 越长优势越大
总成本: 训练多花 3.6x, 推理省 2.3x+, 长期部署总成本更低

附录: FRSMASH 完整代码

文件: frsmash.py

python 复制代码

"""
FRSMASH --- OpenASH 骨干 + 1 慢尺度记忆

设计思路:
  OpenASH 的 cummax + gen_model 是优秀的 LM 先验
  但 cummax 单调递增, 无法选择性遗忘

  HybridFRSM 的慢尺度内容门控可以完美选择性记忆
  但快尺度是简单线性递推, LM 表达力弱

  FRSMASH = OpenASH 骨干 (强 LM) + 1 慢尺度 (强记忆)
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
import math


# ============================================================
# 1. OpenASH 组件
# ============================================================
class MaxStateSuper(nn.Module):
    """OpenASH 核心: 多头 cummax + gen_model"""
    def __init__(self, dim_size, heads, model_flag="train"):
        super().__init__()
        self.heads = heads
        self.d_head = dim_size // heads
        self.model_flag = model_flag
        self.combined = nn.Linear(dim_size, 4 * dim_size, bias=False)
        self.alpha1 = nn.Parameter(torch.tensor(0.5))
        self.alpha2 = nn.Parameter(torch.tensor(0.5))
        self.alpha3 = nn.Parameter(torch.tensor(0.5))
        self.head_linear = nn.Linear(heads * 5, heads, bias=False)

    def forward(self, x, state=None):
        b, s, d = x.shape
        combined = self.combined(x).view(b, s, 4, self.heads, -1)
        out, out1, out2, out3 = combined.unbind(2)
        out = out.permute(0, 3, 1, 2)
        out1 = out1.permute(0, 3, 1, 2)
        out2 = out2.permute(0, 3, 1, 2)
        out3 = out3.permute(0, 3, 1, 2)

        if state is None:
            out4, _ = torch.cummax(out2, dim=2)
            state = out4[:, :, -1:]
        else:
            out4, _ = torch.cummax(torch.cat([state, out2], dim=2), dim=2)
            if self.model_flag == "train":
                out4 = out4[:, :, 1:]
            else:
                out4 = out4[:, :, -1:]
            state = out4[:, :, -1:]

        cat = torch.cat([out, out1, out2, out3, out4], dim=-1)
        combined_g = self.head_linear(cat) * out4
        term1 = out * out1
        term2 = self.alpha1 * out1 + self.alpha2 * out3
        term3 = out * (self.alpha3 * out4 + out3)
        term4 = out1 * (out2 + out4)
        result = term1 + term2 + term3 + term4 + out2 * out4 + combined_g

        out_l = result.transpose(1, 2).contiguous().view(b, s, d)
        return out_l, state


class FeedForward(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        self.ffn1 = nn.Linear(hidden_size, hidden_size)
        self.ffn2 = nn.Linear(hidden_size, hidden_size)
        self.gate = nn.Linear(hidden_size, hidden_size)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.ffn2(self.ffn1(x) * self.relu(self.gate(x)))


class ASHDecoderLayer(nn.Module):
    def __init__(self, hidden_size, num_heads, model_flag="train"):
        super().__init__()
        self.attn = MaxStateSuper(hidden_size, num_heads, model_flag)
        self.ffn = FeedForward(hidden_size)
        self.norm = nn.LayerNorm(hidden_size)
        self.alpha = nn.Parameter(torch.tensor(0.5))

    def forward(self, x, state=None):
        x1, state = self.attn(x, state)
        x = self.norm(self.alpha * self.ffn(x1) + (1 - self.alpha) * x)
        return x, state


# ============================================================
# 2. 慢尺度记忆
# ============================================================
class SlowMemoryCell(nn.Module):
    """内容门控慢记忆 --- 选择性写入"""
    def __init__(self, d_model):
        super().__init__()
        d = d_model
        self.W_forget = nn.Linear(d * 2, d)
        self.W_input  = nn.Linear(d * 2, d)
        self.W_cand   = nn.Linear(d * 2, d)
        nn.init.constant_(self.W_forget.bias, 1.0)
        nn.init.constant_(self.W_input.bias, -2.0)
        dh = max(d // 4, 1)
        self.gate = nn.Sequential(
            nn.Linear(d * 2, dh), nn.GELU(),
            nn.Linear(dh, 1), nn.Sigmoid()
        )

    def forward(self, x_t, h_prev):
        c = torch.cat([h_prev, x_t], dim=-1)
        f = torch.sigmoid(self.W_forget(c))
        i = torch.sigmoid(self.W_input(c))
        cand = f * h_prev + i * torch.tanh(self.W_cand(c))
        alpha = self.gate(c).squeeze(-1).unsqueeze(-1)
        return alpha * cand + (1 - alpha) * h_prev


# ============================================================
# 3. FRSMASH --- 融合模型
# ============================================================
class FRSMASH(nn.Module):
    """FRSMASH = OpenASH backbone + 1 SlowMemory"""
    def __init__(self, voc_size, hidden_size, num_heads, num_layers, K=8):
        super().__init__()
        self.D = hidden_size
        self.K = K
        self.em = nn.Embedding(voc_size, hidden_size, padding_idx=0)
        self.ash_layers = nn.ModuleList([
            ASHDecoderLayer(hidden_size, num_heads, "train")
            for _ in range(num_layers)
        ])
        self.ash_norm = nn.LayerNorm(hidden_size)
        self.mem_input_proj = nn.Linear(hidden_size, hidden_size)
        self.slow_cell = SlowMemoryCell(hidden_size)
        self.mem_proj = nn.Linear(hidden_size, hidden_size)
        self.fusion_gate = nn.Sequential(
            nn.Linear(hidden_size * 2, hidden_size // 4),
            nn.GELU(),
            nn.Linear(hidden_size // 4, 1),
            nn.Sigmoid()
        )
        self.fusion_norm = nn.LayerNorm(hidden_size)
        self.head = nn.Linear(hidden_size, voc_size, bias=False)

    def forward(self, x):
        B, T = x.shape; D = self.D
        x_emb = self.em(x)
        h = x_emb
        for layer in self.ash_layers:
            h1, _ = layer(h)
            h = h1 + h
        x_ash = self.ash_norm(h)
        inp_seq = self.mem_input_proj(x_emb)
        h_slow = torch.zeros(B, D, device=x.device)
        H_slow = torch.zeros(B, T, D, device=x.device)
        prev = 0
        for t in range(0, T, self.K):
            h_slow = self.slow_cell(inp_seq[:, t], h_slow)
            H_slow[:, prev:t+1] = h_slow.unsqueeze(1)
            prev = t + 1
        if prev < T:
            H_slow[:, prev:] = h_slow.unsqueeze(1)
        x_mem = self.mem_proj(H_slow)
        cat = torch.cat([x_ash, x_mem], dim=-1)
        gate = self.fusion_gate(cat)
        fused = self.fusion_norm(gate * x_ash + (1 - gate) * x_mem + x_emb)
        return self.head(fused)

    @torch.no_grad()
    def generate_step(self, token_id, ash_states, h_slow):
        B = token_id.size(0)
        x = self.em(token_id)
        h = x
        new_states = []
        for i, layer in enumerate(self.ash_layers):
            layer.attn.model_flag = "infer"
            h1, s = layer.attn(h, ash_states[i])
            h1 = layer.norm(layer.alpha * layer.ffn(h1) + (1 - layer.alpha) * h)
            h = h1 + h
            new_states.append(s)
        x_ash = self.ash_norm(h[:, 0])
        inp = self.mem_input_proj(x[:, 0])
        h_slow_new = self.slow_cell(inp, h_slow)
        x_mem = self.mem_proj(h_slow_new)
        cat = torch.cat([x_ash, x_mem], dim=-1)
        gate = self.fusion_gate(cat)
        fused = self.fusion_norm(gate * x_ash + (1 - gate) * x_mem + x[:, 0])
        logits = self.head(fused)
        return logits, new_states, h_slow_new