大模型推理中超出训练长度的外推方式有哪些？

如果想获取1000+整理好的面试题，请查看我总结的1000+大模型面试题

回答：主要有的几个做法为ALiBi、内插法、NTK-Aware方法、Yarn方法。

展开说一下：

长度外推或者说是Length Generalization研究的是如何预训练的时候使用较小的长度，但在外推的时候能够泛化到更大的长度上去。长度外推问题依然是目前Transformer亟待解决但还没解决的一个问题。好的外推性能表现就是在泛化到超长序列的时候，者相关指标不会出现较大的下降，模型表现依然很稳健。

目前或者说着2年内研究长度泛化的一些经典的思路如下：

ALiBi直接外推，ALiBi主要是在计算attention score后添加上一个不可学习的bias，公式即

其中m的一个预先定义好的值，举例说明下对于8个heads的时候，m的取值为：1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/128, 1/256，ALiBi由于线性偏差，无法在单层注意力中捕捉远距离信息。ALIBI之所以能外推，是因为它类似局部滑动注意力，通过多层attention捕捉超过单层长度的信息，但这种信息感知能力与网络层数成正比，理论上是有限的。
PI内插法，将预测的长文本的位置编码乘上因子Ltrain / Ltest，缩放到训练长度范围内，流程如下：

训练阶段：(1,2,3,4,...,n) 测试阶段：(1,2,3,4,...,n,...,2n) -> (0.5,1,...,n) [通过内插的方式来实现]

尽管位置内插避免了远处的位置越界问题，但这同时压缩了邻近Token的距离，严重扰乱了模型的局部分辨率, 会导致PPL爆炸，但研究显示，PI经过常文本微调后，效果也是可以的。整体上我们可以理解为这种做法是将位置编码中的sin(m/base^{-2i/d})中的m进行缩放【这里只是拿sin举个例子 base为100000】，比如之前是1的，现在经过压缩后变成了0.5。代码如下

复制代码

def _compute_linear_scaling_rope_parameters(
    config: Optional[PretrainedConfig] = None,
    device: Optional["torch.device"] = None,
    seq_len: Optional[int] = None,
    **rope_kwargs,
) -> Tuple["torch.Tensor", float]:
    """
    Computes the inverse frequencies with linear scaling. Credits to the Reddit user /u/kaiokendev
    Args:
        config ([`~transformers.PretrainedConfig`]):
            The model configuration.
        device (`torch.device`):
            The device to use for initialization of the inverse frequencies.
        seq_len (`int`, *optional*):
            The current sequence length. Unused for this type of RoPE.
        rope_kwargs (`Dict`, *optional*):
            BC compatibility with the previous RoPE class instantiation, will be removed in v4.45.
    Returns:
        Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
        post-processing scaling factor applied to the computed cos/sin (unused in this type of RoPE).
    """
    if config is not None and len(rope_kwargs) > 0:
        raise ValueError(
            "Unexpected arguments: `**rope_kwargs` and `config` are mutually exclusive in "
            f"`_compute_linear_scaling_rope_parameters`, got `rope_kwargs`={rope_kwargs} and `config`={config}"
        )
    if len(rope_kwargs) > 0:
        factor = rope_kwargs["factor"]
    elif config is not None:
        factor = config.rope_scaling["factor"]

    # Gets the default RoPE parameters
    inv_freq, attention_factor = _compute_default_rope_parameters(config, device, seq_len, **rope_kwargs)

    # Then applies linear scaling to the frequencies.
    # NOTE: originally, scaling was applied to the position_ids. However, we get `embs = inv_freq @ position_ids`, so
    # applying scaling to the inverse frequencies is equivalent.
    inv_freq /= factor
    return inv_freq, attention_factor

NTK-aware系列： NTK-aware Scaled RoPE，目前基于RoPE的大模型太多了，基于该位置编码研究外推的算法很多，NTK-aware Scaled RoPE是一个网友提出来的，思路也是很简单,就是该上面sin(m/base^{-2i/d})中的base,变成了sin(m/(base*alph)^{-2i/d})，从苏剑林等人的分析来看，这种方式类似于进制转换，将b进制转换为lambda b进制，其中lambda=k^(2/d)。且理论上分析，这种方法可以实现高频外推、低频内插的目的，高频指的是如下[具体参考文献2]。苏剑林提出：NTK-aware Scaled RoPE"平摊"就不是最优的，应该是低位(比如第35维到64维)要分摊更多，高位(第10-35维)分摊更少，这就导致了混合进制的产生(这里不细说).代码如下：

import transformers
old_init = transformers.models.llama.modeling_llama.LlamaRotaryEmbedding.init
def ntk_scaled_init(self, dim, max_position_embeddings=2048, base=10000, device=None):
#The method is just these three lines
max_position_embeddings = 16384
a = 8 #Alpha value
base = base * a ** (dim / (dim-2)) #Base change formula
old_init(self, dim, max_position_embeddings, base, device)
transformrs.models.llama.modeling_llama.LlamaRotaryEmbedding.init = ntk_scaled_init

Dynamically Scaled RoPE：这里找了一个transformers中的代码，位于https://github.com/huggingface/transformers/blob/10feacd88aef9569e240b7e3833ab32b297e4460/src/transformers/modeling_rope_utils.py#L112，对于动态 NTK，设置了一个α，其中α 的缩放设置为 (α * 当前序列长度 / 原始模型上下文长度) - (α - 1)。随着序列长度的增加动态缩放超参数。从如下代码中可以看到其base缩放的过程：

复制代码

def _compute_dynamic_ntk_parameters(
    config: Optional[PretrainedConfig] = None,
    device: Optional["torch.device"] = None,
    seq_len: Optional[int] = None,
    **rope_kwargs,
) -> Tuple["torch.Tensor", float]:
    """
    Computes the inverse frequencies with NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla
    Args:
        config ([`~transformers.PretrainedConfig`]):
            The model configuration.
        device (`torch.device`):
            The device to use for initialization of the inverse frequencies.
        seq_len (`int`, *optional*):
            The current sequence length, used to update the dynamic RoPE at inference time.
        rope_kwargs (`Dict`, *optional*):
            BC compatibility with the previous RoPE class instantiation, will be removed in v4.45.
    Returns:
        Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
        post-processing scaling factor applied to the computed cos/sin (unused in this type of RoPE).
    """
    # TODO (joao): use the new `original_max_position_embeddings` from rope_scaling
    if config is not None and len(rope_kwargs) > 0:
        raise ValueError(
            "Unexpected arguments: `**rope_kwargs` and `config` are mutually exclusive in "
            f"`_compute_dynamic_ntk_parameters`, got `rope_kwargs`={rope_kwargs} and `config`={config}"
        )
    if len(rope_kwargs) > 0:
        base = rope_kwargs["base"]
        dim = rope_kwargs["dim"]
        max_position_embeddings = rope_kwargs["max_position_embeddings"]
        factor = rope_kwargs["factor"]
    elif config is not None:
        base = config.rope_theta
        partial_rotary_factor = config.partial_rotary_factor if hasattr(config, "partial_rotary_factor") else 1.0
        head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
        dim = int(head_dim * partial_rotary_factor)
        max_position_embeddings = config.max_position_embeddings
        factor = config.rope_scaling["factor"]

    attention_factor = 1.0  # Unused in this type of RoPE

    # seq_len: default to max_position_embeddings, e.g. at init time
    seq_len = seq_len if seq_len is not None and seq_len > max_position_embeddings else max_position_embeddings

    # Compute the inverse frequencies 核心代码
    base = base * ((factor * seq_len / max_position_embeddings) - (factor - 1)) ** (dim / (dim - 2))
    inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.int64).float().to(device) / dim))
    return inv_freq, attention_factor

苏剑林提出的几种方式，包括ReRoPE和Leaky ReRoPE，其主要的思路是：结合外推和内插的方法呢，先自己设定一个窗口大小w，在窗口内我们使用大小为1的位置间隔，在窗口外我们使用大小为1/k的位置间隔。在这个case下，不管输入长度是多少，它的位置编码范围都不超过w,这种方式可以支持任意长度的content. 但这种思路会增加计算量，因为窗口内核窗口外都算一遍后再归到一起，增加了计算开销(这些都是后话)。除此外，苏建林还提出了log n 的缩放因子，在测试阶段每个q都呈上log n 以稳定注意力分数的分布，该点虽说是trick但有理论证明且效果很好。[具体可以通过文献2去找其发布的一些博客]。
Yarn，有两份部分工作，第一部分是对灵活进制的设计，说白了就是不同的位置用不同的进制。对于低维度，表示数字高位（数字=token的位置信息），外推，就是角度不变；对于中间维度（旋转位置编码向量的维度），表示数字的中间的位，需要逐步内插，也即逐步减小角度，相当于逐步增大"进制"，对于高维度，表示数字的低位，彻底内插，就是把进制变成原来的进制的40倍。[看参考文献7]；计算公式如下：

第二份工作是在相关研究观察到，在对logits进行softmax操作之前引入温度t可以统一地影响困惑度，无论数据样本和扩展上下文窗口上的token位置如何，更准确地说，我们将注意力权重的计算修改为

YaRN方法在微调和非微调场景中均超过以前所有方法，由于其占用空间较小，YaRN与修改注意力机制库(如Flash Attention 2[13])直接兼容，且在对不到0.1%的原始预训练数据进行微调后，YaRN在上下文窗口扩展中达到了最先进的性能，同时，如果YaRN与动态缩放的推理技术相结合而得到的Dynamic-yarn，其允许在超过2倍的上下文窗口扩展，而无需任何微调。代码和上面的连接一样，这里也列一下

复制代码

def _compute_yarn_parameters(
    config: PretrainedConfig, device: "torch.device", seq_len: Optional[int] = None, **rope_kwargs
) -> Tuple["torch.Tensor", float]:
    """
    Computes the inverse frequencies with NTK scaling. Please refer to the
    [original paper](https://arxiv.org/abs/2309.00071)
    Args:
        config ([`~transformers.PretrainedConfig`]):
            The model configuration.
        device (`torch.device`):
            The device to use for initialization of the inverse frequencies.
        seq_len (`int`, *optional*):
            The current sequence length. Unused for this type of RoPE.
        rope_kwargs (`Dict`, *optional*):
            BC compatibility with the previous RoPE class instantiation, will be removed in v4.45.
    Returns:
        Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
        post-processing scaling factor applied to the computed cos/sin.
    """
    # No need to keep BC with yarn, unreleased when this new pattern was created.
    if len(rope_kwargs) > 0:
        raise ValueError(
            f"Unexpected arguments: `**rope_kwargs` should be unset in `_compute_yarn_parameters`, got {rope_kwargs}"
        )

    base = config.rope_theta
    partial_rotary_factor = config.partial_rotary_factor if hasattr(config, "partial_rotary_factor") else 1.0
    head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
    dim = int(head_dim * partial_rotary_factor)
    max_position_embeddings = config.max_position_embeddings
    factor = config.rope_scaling["factor"]

    # Sets the attention factor as suggested in the paper
    attention_factor = config.rope_scaling.get("attention_factor")
    if attention_factor is None:
        attention_factor = 0.1 * math.log(factor) + 1.0

    # Optional config options
    # beta_fast/beta_slow: as suggested in the paper, default to 32/1 (correspondingly)
    beta_fast = config.rope_scaling.get("beta_fast") or 32
    beta_slow = config.rope_scaling.get("beta_slow") or 1

    # Compute the inverse frequencies
    def find_correction_dim(num_rotations, dim, base, max_position_embeddings):
        """Inverse dimension formula to find the dimension based on the number of rotations"""
        return (dim * math.log(max_position_embeddings / (num_rotations * 2 * math.pi))) / (2 * math.log(base))

    def find_correction_range(low_rot, high_rot, dim, base, max_position_embeddings):
        """Find dimension range bounds based on rotations"""
        low = math.floor(find_correction_dim(low_rot, dim, base, max_position_embeddings))
        high = math.ceil(find_correction_dim(high_rot, dim, base, max_position_embeddings))
        return max(low, 0), min(high, dim - 1)

    def linear_ramp_factor(min, max, dim):
        if min == max:
            max += 0.001  # Prevent singularity

        linear_func = (torch.arange(dim, dtype=torch.float32) - min) / (max - min)
        ramp_func = torch.clamp(linear_func, 0, 1)
        return ramp_func

    # Note on variable naming: "interpolation" comes from the original technique, where we interpolate the position IDs
    # to expand the possible context length. In other words, interpolation = apply scaling factor.
    pos_freqs = base ** (torch.arange(0, dim, 2).float().to(device) / dim)
    inv_freq_extrapolation = 1.0 / pos_freqs
    inv_freq_interpolation = 1.0 / (factor * pos_freqs)

    low, high = find_correction_range(beta_fast, beta_slow, dim, base, max_position_embeddings)

    # Get n-dimensional rotational scaling corrected for extrapolation
    inv_freq_extrapolation_factor = 1 - linear_ramp_factor(low, high, dim // 2).float().to(device)
    inv_freq = (
        inv_freq_interpolation * (1 - inv_freq_extrapolation_factor)
        + inv_freq_extrapolation * inv_freq_extrapolation_factor
    )

    return inv_freq, attention_factor

参考：

https://blog.csdn.net/2501_91504718/article/details/156342154

1.https://zhuanlan.zhihu.com/p/656684326

2.https://kexue.fm/archives/9675

3.https://kexue.fm/content.html

4.https://blog.csdn.net/v_JULY_v/article/details/135072211

5.https://blog.csdn.net/zpp13hao1/article/details/142999954 代码解读

6.https://blog.csdn.net/luxurie/article/details/135119538 研究ROPE的远程衰减性

7.https://zhuanlan.zhihu.com/p/719510760