解析nanogpt - 技术栈

nanoGPT组成结构

- 总结
- 注意力机制
- nanogpt推理文件
- 模型文件
- - generate函数
  - [GPT forward](#GPT forward)
  - - 流程
    - 维度
  - [block forward](#block forward)
  - - 流程
    - 维度
  - [LayerNorm 层](#LayerNorm 层)
  - [CausalSelfAttention 层](#CausalSelfAttention 层)
  - - 流程
    - 维度
  - MLP
  - - 流程
    - 维度

总结

该博客用于记录自己在学习GPT-2模型时候的一些记录仅此而已，且大部分的内容来源于ai生成！说明本博客使用的代码是来自 https://github.com/karpathy/nanoGPT.git

注意力机制

A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K T / d k ) V Attention(Q,K,V)=softmax( QK^T / \sqrt d_k ) V Attention(Q,K,V)=softmax(QKT/d k)V

其实注意力机制就3个值嘛 Q Q Q K K K V V V，只需要理解这三个值是怎么计算的就ok了。先粘贴一组实现了缩放点积的注意力机制代码。在这段代码中先看向量的维度，维度本身围绕最后两个维度进行展开计算的。Q的维度是ql,dim；V的维度是kl,dim；所以 ' Q V T `QV^T 'QVT 的维度是ql,kl。然后进行softmax操作并不会改变维度只是对最后一维度进行了softmax操作。所以整体最终的维度就是ql,dim。即和输出一样。

python 复制代码

class BaseAttention(nn.Module):
    """
    Tensor          Type            Shape
    ===========================================================================
    q               float           (..., query_len, dims)
    k               float           (..., kv_len, dims)
    v               float           (..., kv_len, dims)
    mask            bool            (..., query_len, kv_len)
    ---------------------------------------------------------------------------
    output          float           (..., query_len, dims)
    ===========================================================================
    """
    def __init__(self, dropout: float = 0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

    def forward(self,
                q: torch.Tensor,
                k: torch.Tensor,
                v: torch.Tensor,
                mask: Optional[torch.Tensor] = None) -> torch.Tensor:
        x = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(k.size(-1))

        if mask is not None:
            x += mask.type_as(x) * x.new_tensor(-1e4)
        x = self.dropout(x.softmax(-1))

        return torch.matmul(x, v)

nanogpt推理文件

nanogpt推理文件是sample.py。推理的demo所做的事情是一直预测num_samples次创作，一次创作是一段话，这段话最多包含max_new_tokens个token。节选代码如下：

python 复制代码

with torch.no_grad(): # 推理模式 不计算梯度
    with ctx:
        # 进行 num_samples 次创作
        for k in range(num_samples):
            y = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
            print(decode(y[0].tolist()))
            print('---------------')

模型文件

nanogpt模型文件是model.py。这部分是核心内容是模型的搭建，我这里暂时先只关心forward函数。

generate函数

这个函数的作用是最多进行max_new_tokens多次自回归预测，部分翻译注释放在代码中。

python 复制代码

@torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """
        Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
        the sequence max_new_tokens times, feeding the predictions back into the model each time.
        Most likely you'll want to make sure to be in model.eval() mode of operation for this.
        接收一个条件序列索引 idx(形状为 (b,t) 的长整型张量),并完成序列生成,
        重复 max_new_tokens 次,每次将预测结果反馈回模型。
        通常你需要确保模型处于 model.eval() 模式下进行此操作。
        """
        for _ in range(max_new_tokens):
            # if the sequence context is growing too long we must crop it at block_size
            # 如果序列上下文增长得太长,我们必须在 block_size 处裁剪它。
            idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:] # 选择所有行，从倒数第 block_size 列到最后一列
            # forward the model to get the logits for the index in the sequence
            # 进行前向模型推理
            logits, _ = self(idx_cond)
            # pluck the logits at the final step and scale by desired temperature
            # 提取最后一步的logits(预测分数),并按照期望的温度参数进行缩放。
            logits = logits[:, -1, :] / temperature
            # optionally crop the logits to only the top k options
            # 只考虑概率最高的k个token,忽略其他所有token:
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')
            # apply softmax to convert logits to (normalized) probabilities
            # 应用softmax函数,将logits(未归一化的分数)转换为(归一化的)概率。
            probs = F.softmax(logits, dim=-1)
            # sample from the distribution
            # 概率分布中进行采样(随机抽取)。
            idx_next = torch.multinomial(probs, num_samples=1)
            # append sampled index to the running sequence and continue
            idx = torch.cat((idx, idx_next), dim=1)

        return idx

GPT forward

流程

否是是训练模式否推理模式开始: GPT.forward idx, targets 获取基本信息
device = idx.device
b, t = idx.size t <= block_size? 抛出异常生成位置索引
pos = arange 0, t
shape: t Token 嵌入
tok_emb = wte idx
shape: b, t, n_embd 位置嵌入
pos_emb = wpe pos
shape: t, n_embd 相加 + Dropout
x = drop tok_emb + pos_emb
shape: b, t, n_embd Transformer Block 1
x = block x Transformer Block 2
x = block x ... Transformer Block N
x = block x
N = n_layer 默认12 最终 LayerNorm
x = ln_f x
shape: b, t, n_embd targets
is not None? 计算全序列 logits
logits = lm_head x
shape: b, t, vocab_size 只计算最后位置
logits = lm_head x:, -1, :
shape: b, 1, vocab_size 计算损失
loss = cross_entropy
logits.view -1, vocab_size
targets.view -1 loss = None 返回 logits, loss 结束

维度

是否输入idx
2,10 位置索引
10 Token嵌入
2,10,768 位置嵌入
2,10,768 相加Dropout
2,10,768 Block1
2,10,768 Block2
2,10,768 ... Block12
2,10,768 FinalLN
2,10,768 训练? lm_head全
2,10,50304 重塑
20,50304 CrossEntropy
loss标量返回
logits,loss 取最后
2,1,768 lm_head
2,1,50304 返回
logits,None

python 复制代码

# 模型前向推理
def forward(self, idx, targets=None):
    device = idx.device # 获取设备信息
    b, t = idx.size()   # batch size, sequence length 训练个数 句子的长度
    assert t <= self.config.block_size, f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
    # 生成位置编码序列 0 - t-1
    pos = torch.arange(0, t, dtype=torch.long, device=device) # shape (t)

    # forward the GPT model itself
    # 将token索引转换为token嵌入向量(Token Embeddings) 维度(b, t, n_embd)
    tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
    # 将位置索引转换为位置嵌入向量(Position Embeddings)。 维度(b, t, n_embd)
    pos_emb = self.transformer.wpe(pos) # position embeddings of shape (t, n_embd)
    # x维度 (b, t, n_embd)
    x = self.transformer.drop(tok_emb + pos_emb)
    for block in self.transformer.h:
        x = block(x)
    x = self.transformer.ln_f(x)

    if targets is not None:
        # if we are given some desired targets also calculate the loss
        # 如果我们被提供了一些期望的目标值,也要计算损失(loss)。 训练模式
        logits = self.lm_head(x)
        loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
    else:
        # inference-time mini-optimization: only forward the lm_head on the very last position
        # 推理时的微优化:只对最后一个位置的输出执行lm_head(语言模型头)。 不计算损失
        logits = self.lm_head(x[:, [-1], :]) # note: using list [-1] to preserve the time dim
        loss = None

    return logits, loss

block forward

流程

Block 输入: x
shape: b, t, n_embd LayerNorm 1
x_norm = ln_1 x 因果自注意力
attn_out = attn x_norm 残差连接
x = x + attn_out LayerNorm 2
x_norm = ln_2 x 多层感知机
mlp_out = mlp x_norm 残差连接
x = x + mlp_out Block 输出: x
shape: b, t, n_embd

维度

输入
2,10,768 LN1
2,10,768 c_attn
2,10,2304 分割QKV
各2,10,768 多头重塑
2,12,10,64 注意力计算
2,12,10,10 掩码Softmax
2,12,10,10 乘V
2,12,10,64 合并
2,10,768 c_proj
2,10,768 残差1
2,10,768 LN2
2,10,768 c_fc扩展
2,10,3072 GELU
2,10,3072 c_proj压缩
2,10,768 残差2
2,10,768 输出
2,10,768

python 复制代码

class Block(nn.Module):

    def __init__(self, config):
        super().__init__()
        # 一个线性层
        self.ln_1 = LayerNorm(config.n_embd, bias=config.bias)
        # 注意力层
        self.attn = CausalSelfAttention(config)
        # 第二个线性层
        self.ln_2 = LayerNorm(config.n_embd, bias=config.bias)
        # 进行mlp操作
        self.mlp = MLP(config)
    # 前向传播
    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

LayerNorm 层

线性层其实可以看输入和输出的维度也是一样的

python 复制代码

class LayerNorm(nn.Module):
    """ LayerNorm but with an optional bias. PyTorch doesn't support simply bias=False """
    # "带有可选偏置项的LayerNorm层。PyTorch不支持简单地设置bias=False"
    def __init__(self, ndim, bias):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(ndim))                    # 初始化单位权重
        self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None   # 初始化零偏置  

    def forward(self, input):
        return F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)

CausalSelfAttention 层

流程

是否输入 x
B, T, C 生成 Q K V
c_attn 输出
split 成3部分重塑为多头形式
B T nh hs
转置维度使用
Flash Attention? Flash Attention
scaled_dot_product_attention
is_causal=True 手动计算注意力计算注意力分数
QK转置除以 sqrt hs
B nh T T 应用因果掩码
masked_fill
未来位置为负无穷 Softmax 归一化
最后一维 Attention Dropout 加权求和
注意力矩阵乘V
B nh T hs 合并多头
转置并重塑
B T C 输出投影
c_proj 线性层 Residual Dropout 输出 y
B T C

维度

输入
2,10,768 c_attn
2,10,2304 分割 Q
2,10,768 重塑
2,12,10,64 K
2,10,768 重塑
2,12,10,64 转置
2,12,64,10 V
2,10,768 重塑
2,12,10,64 Q@KT
2,12,10,10 除sqrt64
2,12,10,10 掩码
2,12,10,10 Softmax
2,12,10,10 @V
2,12,10,64 转置
2,10,12,64 重塑
2,10,768 输出
2,10,768

python 复制代码

# 因果自注意力机制
class CausalSelfAttention(nn.Module):

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads, but in a batch
        # 一次性生成qkv三个矩阵（效率优化）
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
        # output projection。c_proj：输出投影层，将多头注意力的结果投影回原始维度
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
        # regularization
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.dropout = config.dropout
        # flash attention make GPU go brrrrr but support is only in PyTorch >= 2.0
        self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')
        if not self.flash:
            print("WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0")
            # causal mask to ensure that attention is only applied to the left in the input sequence
            # 因果掩码（Causal Mask）这是一个下三角矩阵，对角线及其以下为1，对角线以上为0。
            self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                        .view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        # 训练数据个数 句子长度 embedding的维度
        B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        q, k, v  = self.c_attn(x).split(self.n_embd, dim=2)# 从一起构建的矩阵中分隔开
        # 重塑为多头形式 n_head 是头的个数 重塑前 (B, T, C) 重塑后 (B, T, n_head, head_dim) 进行转置后 (B, n_head, T, head_dim)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs) (batch, heads, seq_len, head_dim)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        if self.flash:
            # efficient attention using Flash Attention CUDA kernels
            y = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=self.dropout if self.training else 0, is_causal=True)
        else:
            # manual implementation of attention
            att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
            # 应用因果掩码（把看不到的数据mask掉）
            att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
            # 引用softmax函数计算注意力权重
            att = F.softmax(att, dim=-1)
            att = self.attn_dropout(att)
            # 加权求和
            y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        # 合并多头
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

        # output projection
        y = self.resid_dropout(self.c_proj(y))
        return y

MLP

流程

输入 x
B T n_embd 线性层 1
c_fc
768 to 3072
扩展4倍 GELU 激活函数
非线性变换线性层 2
c_proj
3072 to 768
压缩回原维度 Dropout
正则化输出 x
B T n_embd

维度

输入
2,10,768 c_fc线性
768→3072 扩展
2,10,3072 GELU激活
2,10,3072 c_proj线性
3072→768 压缩
2,10,768 输出
2,10,768

python 复制代码

# MLP层
class MLP(nn.Module):

    def __init__(self, config):
        super().__init__()
        # Linear 线性层，输入是 n_embd 维度，输出是 4 * n_embd 维度， 参数就有 4 * n_embd * n_embd + 4 * n_embd个
        self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.gelu    = nn.GELU()
        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)
    # 前向传播部分代码
    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        x = self.dropout(x)
        return x