大模型推理中超出训练长度的外推方式有哪些?

如果想获取1000+整理好的面试题,请查看我总结的1000+大模型面试题

回答:主要有的几个做法为ALiBi、内插法、NTK-Aware方法、Yarn方法。

展开说一下:

长度外推或者说是Length Generalization研究的是如何预训练的时候使用较小的长度,但在外推的时候能够泛化到更大的长度上去。长度外推问题依然是目前Transformer亟待解决但还没解决的一个问题。好的外推性能表现就是在泛化到超长序列的时候,者相关指标不会出现较大的下降,模型表现依然很稳健。

目前或者说着2年内研究长度泛化的一些经典的思路如下:

  1. ALiBi直接外推,ALiBi主要是在计算attention score后添加上一个不可学习的bias, 公式即

    其中m的一个预先定义好的值,举例说明下对于8个heads的时候,m的取值为:1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/128, 1/256,ALiBi由于线性偏差,无法在单层注意力中捕捉远距离信息。ALIBI之所以能外推,是因为它类似局部滑动注意力,通过多层attention捕捉超过单层长度的信息,但这种信息感知能力与网络层数成正比,理论上是有限的。

  2. PI内插法,将预测的长文本的位置编码乘上因子Ltrain / Ltest,缩放到训练长度范围内,流程如下:

训练阶段:(1,2,3,4,...,n) 测试阶段:(1,2,3,4,...,n,...,2n) -> (0.5,1,...,n) [通过内插的方式来实现]

尽管位置内插避免了远处的位置越界问题,但这同时压缩了邻近Token的距离,严重扰乱了模型的局部分辨率, 会导致PPL爆炸,但研究显示,PI经过常文本微调后,效果也是可以的。整体上我们可以理解为这种做法是将位置编码中的sin(m/base^{-2i/d})中的m进行缩放【这里只是拿sin举个例子 base为100000】,比如之前是1的,现在经过压缩后变成了0.5。代码如下

复制代码
def _compute_linear_scaling_rope_parameters(
    config: Optional[PretrainedConfig] = None,
    device: Optional["torch.device"] = None,
    seq_len: Optional[int] = None,
    **rope_kwargs,
) -> Tuple["torch.Tensor", float]:
    """
    Computes the inverse frequencies with linear scaling. Credits to the Reddit user /u/kaiokendev
    Args:
        config ([`~transformers.PretrainedConfig`]):
            The model configuration.
        device (`torch.device`):
            The device to use for initialization of the inverse frequencies.
        seq_len (`int`, *optional*):
            The current sequence length. Unused for this type of RoPE.
        rope_kwargs (`Dict`, *optional*):
            BC compatibility with the previous RoPE class instantiation, will be removed in v4.45.
    Returns:
        Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
        post-processing scaling factor applied to the computed cos/sin (unused in this type of RoPE).
    """
    if config is not None and len(rope_kwargs) > 0:
        raise ValueError(
            "Unexpected arguments: `**rope_kwargs` and `config` are mutually exclusive in "
            f"`_compute_linear_scaling_rope_parameters`, got `rope_kwargs`={rope_kwargs} and `config`={config}"
        )
    if len(rope_kwargs) > 0:
        factor = rope_kwargs["factor"]
    elif config is not None:
        factor = config.rope_scaling["factor"]

    # Gets the default RoPE parameters
    inv_freq, attention_factor = _compute_default_rope_parameters(config, device, seq_len, **rope_kwargs)

    # Then applies linear scaling to the frequencies.
    # NOTE: originally, scaling was applied to the position_ids. However, we get `embs = inv_freq @ position_ids`, so
    # applying scaling to the inverse frequencies is equivalent.
    inv_freq /= factor
    return inv_freq, attention_factor
  1. NTK-aware系列: NTK-aware Scaled RoPE,目前基于RoPE的大模型太多了,基于该位置编码研究外推的算法很多,NTK-aware Scaled RoPE是一个网友提出来的,思路也是很简单,就是该上面sin(m/base^{-2i/d})中的base,变成了sin(m/(base*alph)^{-2i/d}),从苏剑林等人的分析来看,这种方式类似于进制转换,将b进制转换为lambda b进制,其中lambda=k^(2/d)。且理论上分析,这种方法可以实现高频外推、低频内插的目的,高频指的是如下[具体参考文献2]。苏剑林提出:NTK-aware Scaled RoPE"平摊"就不是最优的,应该是低位(比如第35维到64维)要分摊更多,高位(第10-35维)分摊更少,这就导致了混合进制的产生(这里不细说).代码如下:

    import transformers
    old_init = transformers.models.llama.modeling_llama.LlamaRotaryEmbedding.init
    def ntk_scaled_init(self, dim, max_position_embeddings=2048, base=10000, device=None):
    #The method is just these three lines
    max_position_embeddings = 16384
    a = 8 #Alpha value
    base = base * a ** (dim / (dim-2)) #Base change formula
    old_init(self, dim, max_position_embeddings, base, device)
    transformrs.models.llama.modeling_llama.LlamaRotaryEmbedding.init = ntk_scaled_init

Dynamically Scaled RoPE:这里找了一个transformers中的代码,位于https://github.com/huggingface/transformers/blob/10feacd88aef9569e240b7e3833ab32b297e4460/src/transformers/modeling_rope_utils.py#L112,对于动态 NTK,设置了一个α,其中α 的缩放设置为 (α * 当前序列长度 / 原始模型上下文长度) - (α - 1)。随着序列长度的增加动态缩放超参数。从如下代码中 可以看到其base缩放的过程:

复制代码
def _compute_dynamic_ntk_parameters(
    config: Optional[PretrainedConfig] = None,
    device: Optional["torch.device"] = None,
    seq_len: Optional[int] = None,
    **rope_kwargs,
) -> Tuple["torch.Tensor", float]:
    """
    Computes the inverse frequencies with NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla
    Args:
        config ([`~transformers.PretrainedConfig`]):
            The model configuration.
        device (`torch.device`):
            The device to use for initialization of the inverse frequencies.
        seq_len (`int`, *optional*):
            The current sequence length, used to update the dynamic RoPE at inference time.
        rope_kwargs (`Dict`, *optional*):
            BC compatibility with the previous RoPE class instantiation, will be removed in v4.45.
    Returns:
        Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
        post-processing scaling factor applied to the computed cos/sin (unused in this type of RoPE).
    """
    # TODO (joao): use the new `original_max_position_embeddings` from rope_scaling
    if config is not None and len(rope_kwargs) > 0:
        raise ValueError(
            "Unexpected arguments: `**rope_kwargs` and `config` are mutually exclusive in "
            f"`_compute_dynamic_ntk_parameters`, got `rope_kwargs`={rope_kwargs} and `config`={config}"
        )
    if len(rope_kwargs) > 0:
        base = rope_kwargs["base"]
        dim = rope_kwargs["dim"]
        max_position_embeddings = rope_kwargs["max_position_embeddings"]
        factor = rope_kwargs["factor"]
    elif config is not None:
        base = config.rope_theta
        partial_rotary_factor = config.partial_rotary_factor if hasattr(config, "partial_rotary_factor") else 1.0
        head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
        dim = int(head_dim * partial_rotary_factor)
        max_position_embeddings = config.max_position_embeddings
        factor = config.rope_scaling["factor"]

    attention_factor = 1.0  # Unused in this type of RoPE

    # seq_len: default to max_position_embeddings, e.g. at init time
    seq_len = seq_len if seq_len is not None and seq_len > max_position_embeddings else max_position_embeddings

    # Compute the inverse frequencies 核心代码
    base = base * ((factor * seq_len / max_position_embeddings) - (factor - 1)) ** (dim / (dim - 2))
    inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.int64).float().to(device) / dim))
    return inv_freq, attention_factor
  1. 苏剑林提出的几种方式,包括ReRoPE和Leaky ReRoPE,其主要的思路是:结合外推和内插的方法呢,先自己设定一个窗口大小w,在窗口内我们使用大小为1的位置间隔,在窗口外我们使用大小为1/k的位置间隔。在这个case下,不管输入长度是多少,它的位置编码范围都不超过w,这种方式可以支持任意长度的content. 但这种思路会增加计算量,因为窗口内核窗口外都算一遍后再归到一起,增加了计算开销(这些都是后话)。除此外,苏建林还提出了log n 的缩放因子,在测试阶段每个q都呈上log n 以稳定注意力分数的分布,该点虽说是trick但有理论证明且效果很好。[具体可以通过文献2去找其发布的一些博客]。

  2. Yarn,有两份部分工作,第一部分是对灵活进制的设计,说白了就是不同的位置用不同的进制。对于低维度,表示数字高位(数字=token的位置信息),外推,就是角度不变;对于中间维度(旋转位置编码向量的维度),表示数字的中间的位,需要逐步内插,也即逐步减小角度,相当于逐步增大"进制",对于高维度,表示数字的低位,彻底内插,就是把进制变成原来的进制的40倍。[看参考文献7];计算公式如下:

第二份工作是在相关研究观察到,在对logits进行softmax操作之前引入温度t可以统一地影响困惑度,无论数据样本和扩展上下文窗口上的token位置如何,更准确地说,我们将注意力权重的计算修改为

YaRN方法在微调和非微调场景中均超过以前所有方法,由于其占用空间较小,YaRN与修改注意力机制库(如Flash Attention 2[13])直接兼容,且在对不到0.1%的原始预训练数据进行微调后,YaRN在上下文窗口扩展中达到了最先进的性能, 同时,如果YaRN与动态缩放的推理技术相结合而得到的Dynamic-yarn,其允许在超过2倍的上下文窗口扩展,而无需任何微调。代码和上面的连接一样,这里也列一下

复制代码
def _compute_yarn_parameters(
    config: PretrainedConfig, device: "torch.device", seq_len: Optional[int] = None, **rope_kwargs
) -> Tuple["torch.Tensor", float]:
    """
    Computes the inverse frequencies with NTK scaling. Please refer to the
    [original paper](https://arxiv.org/abs/2309.00071)
    Args:
        config ([`~transformers.PretrainedConfig`]):
            The model configuration.
        device (`torch.device`):
            The device to use for initialization of the inverse frequencies.
        seq_len (`int`, *optional*):
            The current sequence length. Unused for this type of RoPE.
        rope_kwargs (`Dict`, *optional*):
            BC compatibility with the previous RoPE class instantiation, will be removed in v4.45.
    Returns:
        Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
        post-processing scaling factor applied to the computed cos/sin.
    """
    # No need to keep BC with yarn, unreleased when this new pattern was created.
    if len(rope_kwargs) > 0:
        raise ValueError(
            f"Unexpected arguments: `**rope_kwargs` should be unset in `_compute_yarn_parameters`, got {rope_kwargs}"
        )

    base = config.rope_theta
    partial_rotary_factor = config.partial_rotary_factor if hasattr(config, "partial_rotary_factor") else 1.0
    head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
    dim = int(head_dim * partial_rotary_factor)
    max_position_embeddings = config.max_position_embeddings
    factor = config.rope_scaling["factor"]

    # Sets the attention factor as suggested in the paper
    attention_factor = config.rope_scaling.get("attention_factor")
    if attention_factor is None:
        attention_factor = 0.1 * math.log(factor) + 1.0

    # Optional config options
    # beta_fast/beta_slow: as suggested in the paper, default to 32/1 (correspondingly)
    beta_fast = config.rope_scaling.get("beta_fast") or 32
    beta_slow = config.rope_scaling.get("beta_slow") or 1

    # Compute the inverse frequencies
    def find_correction_dim(num_rotations, dim, base, max_position_embeddings):
        """Inverse dimension formula to find the dimension based on the number of rotations"""
        return (dim * math.log(max_position_embeddings / (num_rotations * 2 * math.pi))) / (2 * math.log(base))

    def find_correction_range(low_rot, high_rot, dim, base, max_position_embeddings):
        """Find dimension range bounds based on rotations"""
        low = math.floor(find_correction_dim(low_rot, dim, base, max_position_embeddings))
        high = math.ceil(find_correction_dim(high_rot, dim, base, max_position_embeddings))
        return max(low, 0), min(high, dim - 1)

    def linear_ramp_factor(min, max, dim):
        if min == max:
            max += 0.001  # Prevent singularity

        linear_func = (torch.arange(dim, dtype=torch.float32) - min) / (max - min)
        ramp_func = torch.clamp(linear_func, 0, 1)
        return ramp_func

    # Note on variable naming: "interpolation" comes from the original technique, where we interpolate the position IDs
    # to expand the possible context length. In other words, interpolation = apply scaling factor.
    pos_freqs = base ** (torch.arange(0, dim, 2).float().to(device) / dim)
    inv_freq_extrapolation = 1.0 / pos_freqs
    inv_freq_interpolation = 1.0 / (factor * pos_freqs)

    low, high = find_correction_range(beta_fast, beta_slow, dim, base, max_position_embeddings)

    # Get n-dimensional rotational scaling corrected for extrapolation
    inv_freq_extrapolation_factor = 1 - linear_ramp_factor(low, high, dim // 2).float().to(device)
    inv_freq = (
        inv_freq_interpolation * (1 - inv_freq_extrapolation_factor)
        + inv_freq_extrapolation * inv_freq_extrapolation_factor
    )

    return inv_freq, attention_factor

参考:

  1. https://blog.csdn.net/2501_91504718/article/details/156342154

1.https://zhuanlan.zhihu.com/p/656684326

2.https://kexue.fm/archives/9675

3.https://kexue.fm/content.html

4.https://blog.csdn.net/v_JULY_v/article/details/135072211

5.https://blog.csdn.net/zpp13hao1/article/details/142999954 代码解读

6.https://blog.csdn.net/luxurie/article/details/135119538 研究ROPE的远程衰减性

7.https://zhuanlan.zhihu.com/p/719510760

相关推荐
渡我白衣2 小时前
数据是燃料:理解数据类型、质量评估与基本预处理
人工智能·深度学习·神经网络·机器学习·自然语言处理·机器人·caffe
Codebee2 小时前
Ooder A2UI框架开源首发:构建企业级应用的全新选择
java·人工智能·全栈
百泰派克生物科技2 小时前
串联质量标签(TMT)
人工智能·机器学习·蛋白质组学·蛋白质·质谱
草莓熊Lotso2 小时前
Linux 实战:从零实现动态进度条(含缓冲区原理与多版本优化)
linux·运维·服务器·c++·人工智能·centos·进度条
渡我白衣3 小时前
多路转接之epoll:理论篇
人工智能·神经网络·网络协议·tcp/ip·自然语言处理·信息与通信·tcpdump
明月照山海-3 小时前
机器学习周报二十八
人工智能·机器学习
weixin_437497779 小时前
读书笔记:Context Engineering 2.0 (上)
人工智能·nlp
喝拿铁写前端9 小时前
前端开发者使用 AI 的能力层级——从表面使用到工程化能力的真正分水岭
前端·人工智能·程序员
goodfat9 小时前
Win11如何关闭自动更新 Win11暂停系统更新的设置方法【教程】
人工智能·禁止windows更新·win11优化工具