Stanford CS336 assignment1 | Transformer Language Model Architecture

Transformer Language Model Architecture

[零、环境设置](#零、环境设置)

[一、 Basic Building Blocks: Linear and Embedding Modules](#一、 Basic Building Blocks: Linear and Embedding Modules)

[Parameter Initialization 初始化参数](#Parameter Initialization 初始化参数)

[Linear Module 线性模块](#Linear Module 线性模块)

[Embedding module](#Embedding module)

[二、 Pre-Norm Transformer Block](#二、 Pre-Norm Transformer Block)

[Root Mean Square Layer Normalization](#Root Mean Square Layer Normalization)

[Position-Wise Feed-Forward Network](#Position-Wise Feed-Forward Network)

[Relative Positional Embeddings](#Relative Positional Embeddings)

[Scaled Dot-Product Attention](#Scaled Dot-Product Attention)

[Causal Multi-Head Self-Attention](#Causal Multi-Head Self-Attention)

[三、 The Full Transformer LM](#三、 The Full Transformer LM)

[Transformer block](#Transformer block)

[Transformer LM](#Transformer LM)

[四、完整代码](#四、完整代码)

零、环境设置

我们每次都去激活环境太繁琐这里建议直接把激活环境的命令写进.bashrc文件

找到家目录下的.bashrc文件

然后找到项目根目录下面有一个.venv文件

.venv目录下有一个bin，然后bin中有一个activate脚本，这个脚本就是用来激活uv环境的，我们只需要在每次打开终端也就是shell启动的时候执行一遍这个脚本就好了。

其实就是在.bashrc文件中加上一句source命令

shell 复制代码

source /root/css336/.venv/bin/activate # 这个路径文件需要和自己uv环境所在路径对应

一、 Basic Building Blocks: Linear and Embedding Modules

Parameter Initialization 初始化参数

根据之前深度学习基础，模型参数的初始化对于模型训练是很重要的，参数初始化的不好会引起梯度消失或梯度爆炸。

这里初始化采用截断高斯分布，来避免参数值过大或过小。

线性层(Linear weights)： N ( μ = 0 , σ = 2 d i n + d o u t ) \mathcal{N}(\mu=0, \sigma=\frac{2}{d_{in} + d_{out}}) N(μ=0,σ=din+dout2) 截断 [ − 3 σ , + 3 σ ] [-3\sigma,+3\sigma] [−3σ,+3σ]

嵌入层(embedding)： N ( μ = 0 , σ = 1 ) \mathcal{N}(\mu=0, \sigma=1) N(μ=0,σ=1) 截断 [ − 3 , + 3 ] [-3,+3] [−3,+3]

具体的实现是使用pytorch中的torch.nn.init.trunc_normal。torch.nn.init.trunc_normal会从正态分布中采样，如果采样得到的结果落在截断范围外则丢弃重新进行采样，直到所有值均落在截断范围内。

就比如上面嵌入层embedding的初始化

python 复制代码

torch.nn.init.trunc_normal_(
    tensor,
    mean=0.0,
    std=1.0,
    a=-3.0,
    b=3.0
)

参数	类型	默认值	说明
tensor	Tensor	---	要初始化的张量（in-place 操作）
mean	float	0.0	正态分布的均值
std	float	1.0	正态分布的标准差
a	float	-3.0	截断下界（以标准差为单位）
b	float	3.0	截断上界（以标准差为单位）

Linear Module 线性模块

python 复制代码

import torch
from torch import nn
from torch.nn.init import trunc_normal_
from einops import einsum

class Linear(nn.Module):
    def __init__(
            self,
            in_features: int,
            out_features: int,
            device = None,
            dtype = None):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        factory_kwargs = {'device': device, 'dtype': dtype}
        self.W = nn.Parameter(torch.empty(out_features, in_features, **factory_kwargs))
        std = (2 / (in_features + out_features)) ** 0.5
        trunc_normal_(self.W, mean=0.0, std=std, a=-3.0*std, b=3.0*std)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return einsum(x, self.W, '... in, out in -> ... out')

编写adapter.py进行测试

python 复制代码

import layer
def run_linear(
    d_in: int,
    d_out: int,
    weights: Float[Tensor, " d_out d_in"],
    in_features: Float[Tensor, " ... d_in"],
) -> Float[Tensor, " ... d_out"]:
    """
    Given the weights of a Linear layer, compute the transformation of a batched input.

    Args:
        in_dim (int): The size of the input dimension
        out_dim (int): The size of the output dimension
        weights (Float[Tensor, "d_out d_in"]): The linear weights to use
        in_features (Float[Tensor, "... d_in"]): The output tensor to apply the function to

    Returns:
        Float[Tensor, "... d_out"]: The transformed output of your linear module.
    """
    device, dtype = in_features.device, in_features.dtype
    model = layer.Linear(d_in, d_out, device, dtype)
    model.load_state_dict({'W': weights})
    return model.forward(in_features)

进入test目录中然后执行

python 复制代码

pytest test_model.py::test_linear

在进行执行前

测试通过后会显示下面的结果。

这里面涉及一些einops和nn.Parameter的用法。

Embedding module

embdding的初始化也是使用torch.nn.init.trunc_normal函数

python 复制代码

class Embedding(nn.Module):
    def __init__(
            self,
            num_embeddings: int,
            embedding_dim: int,
            device: torch.device | None = None,
            dtype: torch.dtype | None = None
    ):
        super().__init__()
        self.num_embeddings = num_embeddings
        self.embedding_dim = embedding_dim
        factory_kwargs = {'device': device, 'dtype': dtype}
        self.W = nn.Parameter(torch.empty(num_embeddings, embedding_dim, **factory_kwargs))
        std = 1
        trunc_normal_(self.W, mean=0.0, std=std, a=-3.0*std, b=3.0*std)

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        return self.W[token_ids]

编写测试adapter

python 复制代码

def run_embedding(
    vocab_size: int,
    d_model: int,
    weights: Float[Tensor, " vocab_size d_model"],
    token_ids: Int[Tensor, " ..."],
) -> Float[Tensor, " ... d_model"]:
    """
    Given the weights of an Embedding layer, get the embeddings for a batch of token ids.

    Args:
        vocab_size (int): The number of embeddings in the vocabulary
        d_model (int): The size of the embedding dimension
        weights (Float[Tensor, "vocab_size d_model"]): The embedding vectors to fetch from
        token_ids (Int[Tensor, "..."]): The set of token ids to fetch from the Embedding layer

    Returns:
        Float[Tensor, "... d_model"]: Batch of embeddings returned by your Embedding layer.
    """
    device, dtype = weights.device, weights.dtype
    model = layer.Embedding(vocab_size, d_model, device, dtype)
    model.load_state_dict({'W': weights})
    return model.forward(token_ids)

测试

shell 复制代码

pytest test_model.py::test_embedding

测试通过后会显示下面的结果。

二、 Pre-Norm Transformer Block

Root Mean Square Layer Normalization

我们要实现的transformer和原论文中的transformer的结构并不相同，就比如normliaztion，这里使用的是pre-norm，pre-norm被认为能改善梯度流（因为原始的数据是没有经过任何处理直接通过residual加到最后的结果）具体参考Stanford CS336 Lecture3 | Architectures, hyperparameters。

关于RMSNorm的原理

a i a_i ai为一个 ( d m o d e l , ) (d_{model}, ) (dmodel,)维的activation，g是可学习的增益参数。

python 复制代码

class RMSNorm(nn.Module):
    def __init__(
            self,
            d_model: int,
            eps: float = 1e-5,
            device: torch.device | None = None,
            dtype: torch.dtype | None = None):
        super().__init__()
        self.d_model = d_model
        self.eps = eps
        factory_kwargs = {'device': device, 'dtype': dtype}
        self.W = nn.Parameter(torch.ones(d_model, **factory_kwargs))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        in_dtype = x.dtype
        x = x.to(torch.float32)
        RMS = (x.pow(2).mean(dim=-1, keepdim=True) + self.eps).sqrt()
        x /= RMS
        x *= self.W
        return x.to(in_dtype)

adapters

python 复制代码

def run_rmsnorm(
    d_model: int,
    eps: float,
    weights: Float[Tensor, " d_model"],
    in_features: Float[Tensor, " ... d_model"],
) -> Float[Tensor, " ... d_model"]:
    """Given the weights of a RMSNorm affine transform,
    return the output of running RMSNorm on the input features.

    Args:
        d_model (int): The dimensionality of the RMSNorm input.
        eps: (float): A value added to the denominator for numerical stability.
        weights (Float[Tensor, "d_model"]): RMSNorm weights.
        in_features (Float[Tensor, "... d_model"]): Input features to run RMSNorm on. Can have arbitrary leading
            dimensions.

    Returns:
        Float[Tensor,"... d_model"]: Tensor of with the same shape as `in_features` with the output of running
        RMSNorm of the `in_features`.
    """
    device, dtype = weights.device, weights.dtype
    model = layer.RMSNorm(d_model, eps, device, dtype)
    model.load_state_dict({'W': weights})
    return model.forward(in_features)

测试

shell 复制代码

pytest test_model.py::test_rmsnorm

Position-Wise Feed-Forward Network

这里实现的是经过改进的FFN，加入了门控，同时激活函数使用SiLU 或者Swish。

上面都是列向量表示，因为在pytorch中是行优先存储的原则，所以这里需要对上面的公式进行相关修改
G L U ( x , W 1 , W 2 ) = σ ( x W 1 T ) ⊙ ( x W 2 T ) GLU(x, W_1, W_2)=\sigma(xW_1^T) \odot (xW_2^T) GLU(x,W1,W2)=σ(xW1T)⊙(xW2T)
F F N ( x ) = S w i G L U ( x , W 1 , W 2 , W 3 ) = ( S i L U ( x W 1 T ) ⊙ ( x W 3 T ) ) W 2 T FFN(x)=SwiGLU(x,W_1,W_2,W_3)=(SiLU(xW_1^T) \odot (xW_3^T))W_2^T FFN(x)=SwiGLU(x,W1,W2,W3)=(SiLU(xW1T)⊙(xW3T))W2T

同时 d f f = 8 3 d m o d e l d_{ff}=\frac{8}{3} d_{model} dff=38dmodel，这个原因主要是为了保持参数量一致和原FFN，具体可以参考Stanford CS336 Lecture3 | Architectures, hyperparameters或者lecture 3的视频和ppt中均有解释。

实现如下：assignment1的pdf中有说可以使用torch.sigmoid，这里直接使用了没有自己手动实现sigmoid函数。

python 复制代码

def SiLU(x: torch.Tensor) -> torch.Tensor:
        return x * torch.sigmoid(x)
    
class SwiGLU(nn.Module):
    def __init__(
            self,
            d_model: int,
            d_ff: int,
            device: torch.device | None = None,
            dtype: torch.dtype | None = None
    ):
        super().__init__()
        self.d_model = d_model
        self.d_ff = d_ff
        factory_kwargs = {'device': device, 'dtype': dtype}
        self.linear1 = Linear(d_model, d_ff)
        self.linear2 = Linear(d_ff, d_model)
        self.linear3 = Linear(d_model, d_ff)
        
    def forward(self, x: Float[torch.Tensor, "... d_model"]) -> Float[torch.Tensor, "... d_model"]:
        xW1 = self.linear1(x)
        xW3 = self.linear3(x)
        return self.linear2(SiLU(xW1) * xW3)

adapters中

python 复制代码

def run_swiglu(
    d_model: int,
    d_ff: int,
    w1_weight: Float[Tensor, " d_ff d_model"],
    w2_weight: Float[Tensor, " d_model d_ff"],
    w3_weight: Float[Tensor, " d_ff d_model"],
    in_features: Float[Tensor, " ... d_model"],
) -> Float[Tensor, " ... d_model"]:
    """Given the weights of a SwiGLU network, return
    the output of your implementation with these weights.

    Args:
        d_model (int): Dimensionality of the feedforward input and output.
        d_ff (int): Dimensionality of the up-project happening internally to your swiglu.
        w1_weight (Float[Tensor, "d_ff d_model"]): Stored weights for W1
        w2_weight (Float[Tensor, "d_model d_ff"]): Stored weights for W2
        w3_weight (Float[Tensor, "d_ff d_model"]): Stored weights for W3
        in_features (Float[Tensor, "... d_model"]): Input embeddings to the feed-forward layer.

    Returns:
        Float[Tensor, "... d_model"]: Output embeddings of the same shape as the input embeddings.
    """
    # Example:
    # If your state dict keys match, you can use `load_state_dict()`
    # swiglu.load_state_dict(weights)
    # You can also manually assign the weights
    # swiglu.w1.weight.data = w1_weight
    # swiglu.w2.weight.data = w2_weight
    # swiglu.w3.weight.data = w3_weight
    device, dtype = w1_weight.device, w1_weight.dtype
    model = layer.SwiGLU(d_model, d_ff, device, dtype)
    model.load_state_dict({
        "linear1.W": w1_weight,
        "linear2.W": w2_weight,
        "linear3.W": w3_weight,
    })
    return model.forward(in_features)

测试命令

shell 复制代码

pytest test_model.py::test_swiglu

还需要测试一下SiLU

shell 复制代码

pytest test_model.py::test_silu_matches_pytorch

Relative Positional Embeddings

数学上实现RoPE是通过旋转矩阵，对于给定的一个query token q ( i ) = W q x ( i ) ∈ R d q^{(i)} = W_qx^{(i)} \in \mathbb{R}^d q(i)=Wqx(i)∈Rd，query token的位置为i，可以使用成对的旋转矩阵(偶数) R i R^i Ri 给query token加入位置信息后 q ′ ( i ) = R i q ( i ) = R i W q x ( i ) q^{'(i)}=R^iq^{(i)}=R^i W_qx^{(i)} q′(i)=Riq(i)=RiWqx(i)

这里， R i R_i Ri会将嵌入向量中每一对元素 q ( i ) 2 k : 2 k + 1 q(i){2k:2k+1} q(i)2k:2k+1视为一个二维向量，并将其旋转一个角度 θ i , k = i Θ 2 k / d \theta{i,k}=\frac{i}{\Theta^{2k/d}} θi,k=Θ2k/di，其中 k = { 0 , 1 , 2 , d / 2 } k=\{0, 1, 2,d/2\} k={0,1,2,d/2}，而 Θ \Theta Θ是一个常数，通常取1000，可以将整个旋转矩阵视为一个分块对角矩阵，其对角线上包含 d 2 个 2 × 2 \frac{d}{2}个 2 \times 2 2d个2×2旋转块 R i , k R_{i,k} Ri,k

R k i = [ c o s ( θ i , k ) − s i n ( θ i , k ) s i n ( θ i , k ) c o s ( θ i , k ) ] R_k^i=\begin{bmatrix} cos(\theta_{i, k})& -sin(\theta_{i, k})\\ sin(\theta_{i, k}) & cos(\theta_{i, k}) \end{bmatrix} Rki=[cos(θi,k)sin(θi,k)−sin(θi,k)cos(θi,k)]
R i = [ R 1 i 0 0 ⋯ 0 0 R 2 i 0 ⋯ 0 0 0 R 3 i ⋯ 0 ⋮ ⋮ ⋮ ⋱ ⋮ 0 0 0 ⋯ R d / 2 i ] R^i = \begin{bmatrix} R_1^i & 0 & 0 & \cdots & 0 \\ 0 & R_2^i & 0 & \cdots & 0 \\ 0 & 0 & R_3^i & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & R_{d/2}^i \end{bmatrix} Ri= R1i00⋮00R2i0⋮000R3i⋮0⋯⋯⋯⋱⋯000⋮Rd/2i

上面是RoPE的数学原理，但是实际上实现RoPE时，并不会显示构造旋转矩阵，因为这样效率比较低，通常是构造cos、sin矩阵然后再分奇偶应用，具体操作如下：

先根据 θ i , k = i Θ 2 k / d \theta_{i,k}=\frac{i}{\Theta^{2k/d}} θi,k=Θ2k/di，其中 k = { 0 , 1 , 2 , d / 2 } k=\{0, 1, 2,d/2\} k={0,1,2,d/2} 构造 θ i , k \theta_{i,k} θi,k矩阵，这里可以使用torch.arange生成关于 1 Θ 2 k / d \frac{1}{\Theta^{2k/d}} Θ2k/d1的向量

python 复制代码

freq = 1.0 / (theta ** (torch.arange(0, d - 1, 2) / d))

然后使用torch.outer，对每个位置和freq做外积得到矩阵

python 复制代码

positions = torch.arange(0, max_seq_len, device=device).float()
freqs = torch.outer(positions, freq)

我们可以先忽略批次操作因为前面的维度不会对计算产生影响

下面构造sin、cos矩阵

python 复制代码

self.register_buffer('cos_cache', torch.cos(torch.tensor(freqs)))        
self.register_buffer('sin_cache', torch.sin(torch.tensor(freqs)))

假设对于第i个token只有两维，即 d m o d e l = 2 d_{model}=2 dmodel=2

x 0 , x 1 \] × \[ c o s ( θ i , 0 ) − s i n ( θ i , 0 ) s i n ( θ i , 0 ) c o s ( θ i , 0 ) \] = \[ x 0 c o s + x 1 s i n , − x 0 s i n + x 1 c o s \] \\begin{bmatrix}x_0, x_1\\end{bmatrix} \\times \\begin{bmatrix} cos(\\theta_{i, 0})\& -sin(\\theta_{i, 0})\\\\ sin(\\theta_{i, 0}) \& cos(\\theta_{i, 0}) \\end{bmatrix}=\\begin{bmatrix}x_0cos + x_1sin, -x_0sin+x_1cos\\end{bmatrix} \[x0,x1\]×\[cos(θi,0)sin(θi,0)−sin(θi,0)cos(θi,0)\]=\[x0cos+x1sin,−x0sin+x1cos

x 2 , x 3 \] × \[ c o s ( θ i , 1 ) − s i n ( θ i , 1 ) s i n ( θ i , 1 ) c o s ( θ i , 1 ) \] = \[ x 2 c o s + x 3 s i n , − x 2 s i n + x 3 c o s \] \\begin{bmatrix}x_2, x_3\\end{bmatrix} \\times \\begin{bmatrix} cos(\\theta_{i, 1})\& -sin(\\theta_{i, 1})\\\\ sin(\\theta_{i, 1}) \& cos(\\theta_{i, 1}) \\end{bmatrix}=\\begin{bmatrix}x_2cos + x_3sin, -x_2sin+x_3cos\\end{bmatrix} \[x2,x3\]×\[cos(θi,1)sin(θi,1)−sin(θi,1)cos(θi,1)\]=\[x2cos+x3sin,−x2sin+x3cos

如果我们单独把偶数、奇数拿出来，然后再还原回去就对token i完成了处理。

也就是 [ x 0 , x 2 , x 4 , . . . x d / 2 ] ⊗ [ c o s ( θ i , 0 ) , c o s ( θ i , 1 ) , c o s ( θ i , 2 ) , . . . . c o s ( θ i , d / 2 ) ] + [ x 1 , x 3 , x 5 , . . . x d / 2 − 1 ] ⊗ [ s i n ( θ i , 0 ) , s i n ( θ i , 1 ) , s i n ( θ i , 2 ) , . . . . s i n ( θ i , d / 2 ) ] [x_0, x_2, x_4,...x_{d/2}] \otimes [cos(\theta_{i,0}),cos(\theta_{i,1}),cos(\theta_{i,2}),....cos(\theta_{i,d/2})] + [x_1, x_3, x_5,...x_{d/2-1} ]\otimes [sin(\theta_{i,0}),sin(\theta_{i,1}),sin(\theta_{i,2}),....sin(\theta_{i,d/2})] [x0,x2,x4,...xd/2]⊗[cos(θi,0),cos(θi,1),cos(θi,2),....cos(θi,d/2)]+[x1,x3,x5,...xd/2−1]⊗[sin(θi,0),sin(θi,1),sin(θi,2),....sin(θi,d/2)]即为结果的偶数项

Stanford CS336 assignment1 | Transformer Language Model Architecture

Transformer Language Model Architecture

零、 环境设置

一、 Basic Building Blocks: Linear and Embedding Modules

Parameter Initialization 初始化参数

Linear Module 线性模块

Embedding module

二、 Pre-Norm Transformer Block

Root Mean Square Layer Normalization

Position-Wise Feed-Forward Network

Relative Positional Embeddings

零、环境设置