Llama架构及代码详解

Llama的框架图如图:

源码中含有大量分布式训练相关的代码,读起来比较晦涩难懂,所以我们对llama自顶向下进行了解析及复现,我们对其划分成三层,分别是顶层、中层、和底层,如下:

Llama的整体组成

由上图可知,Llama整体是由1个embedding层,n个transformer层,和1个RMSNorm层组成的,所以顶层代码如下:
顶层

bash 复制代码
class Llama(torch.nn.Module):
    def __init__(self, config: ModelArgs):
        super().__init__()
       self.config = config
        # embedding层
        self.tok_embeddings = torch.nn.Embedding(self.config.vocab_size, self.config.dim)
        # RMSNorm
        self.norm = RMSNorm(config.dim, eps=config.norm_eps)
        # n层Transformer
        self.layers = torch.nn.ModuleList()
        for i in range(self.config.n_layers):
            self.layers.append(TransformerBlock(config))


    def forward(self, tokens):
        # 进行token的嵌入编码
        h = self.tok_embeddings(tokens)
        # decoder架构需要生成一个mask
        seqlen = h.shape[1]
        mask = torch.full((seqlen, seqlen), float('-inf'), device=tokens.device)
        mask = torch.triu(mask, diagonal=1)
        # 进行n层Transformer
        for i in range(self.config.n_layers):
            h = self.layers[i](h, mask)
        # 进行RMSNorm
        token_embeddings = self.norm(h)
        return token_embeddings

中层

我们首先进行RMSNorm的复现

bash 复制代码
class RMSNorm(torch.nn.Module):
    def __init__(self, dim, eps):
        super().__init__()
        self.eps = eps
        self.weight = torch.nn.Parameter(torch.ones(dim))

    def _norm(self, tensor):
        return tensor * torch.rsqrt(tensor.pow(2).mean(-1, keepdim=True) + self.eps)

    def forward(self, tensor):
        output = self._norm(tensor)
        return output * self.weight

然后对Transformer进行复现,在Transformer中,Transformer包括两个RMSNorm层,一个多头attention层,一个全连接层。

bash 复制代码
class TransformerBlock(torch.nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        # 多头注意力层
        self.attention = Attention(config)
        # Norm层
        self.attention_normal = RMSNorm(config.dim, config.norm_eps)
        self.ffn_norm = RMSNorm(config.dim, config.norm_eps)
        # 全连接层
        self.ffn = FeedForwad(self.config.dim, self.config.dim * 4)

    def forward(self, embeddings, mask):
        # norm
        h = self.attention_normal(embeddings)
        # attention
        h = self.attention(h, mask)
        # add & norm
        h = self.ffn_norm(h + embeddings)
        # fnn
        f = self.ffn(h)
        # add
        return f + h

底层

在多头attention中,首先需要对token的嵌入进行空间映射,多头拆分,旋转位置编码,分数计算等操作

bash 复制代码
class Attention(torch.nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.n_head = config.n_heads
        self.dim = config.dim // self.n_head

        self.k = torch.nn.Linear(config.dim, config.dim)
        self.q = torch.nn.Linear(config.dim, config.dim)
        self.v = torch.nn.Linear(config.dim, config.dim)

    def forward(self, embeddings, mask):
        bsz, seq_len, dim = embeddings.shape

        k_embeddings = self.k(embeddings)
        q_embeddings = self.q(embeddings)
        v_embeddings = self.v(embeddings)
        n_q_embeddings = q_embeddings.reshape(bsz, -1, self.n_head, self.dim).permute(0, 2, 1, 3)
        n_k_embeddings = k_embeddings.reshape(bsz, -1, self.n_head, self.dim).permute(0, 2, 1, 3)
        n_v_embeddings = v_embeddings.reshape(bsz, -1, self.n_head, self.dim).permute(0, 2, 1, 3)

        rotated_n_q_embeddings = compute_rotated_embedding(n_q_embeddings, self.dim, seq_len, self.config.rope_theta)
        rotated_n_k_embeddings = compute_rotated_embedding(n_k_embeddings, self.dim, seq_len, self.config.rope_theta)

        scores = torch.nn.functional.softmax(mask + rotated_n_q_embeddings @ rotated_n_k_embeddings.transpose(-1, -2)
                               / math.sqrt(self.dim), dim=-1)

        n_embeddings = scores @ n_v_embeddings
        embeddings = n_embeddings.permute(0, 2, 1, 3).reshape(bsz, -1, self.config.dim)

        return embeddings
bash 复制代码
class FeedForwad(torch.nn.Module):
    def __init__(self, dim, hidden_dim):
        super().__init__()
        self.linear1 = torch.nn.Linear(dim, hidden_dim)
        self.linear2 = torch.nn.Linear(dim, hidden_dim)
        self.linear3 = torch.nn.Linear(hidden_dim, dim)

    def forward(self, embeddings):
        gate = torch.nn.functional.silu(self.linear1(embeddings))
        up_proj = self.linear2(embeddings) * gate
        return self.linear3(up_proj)

最后,我们复现旋转位置编码,至此我们捋清了llama的所有结构!

bash 复制代码
def compute_rotated_embedding(embedding, dim, m, base):
    # 计算所有嵌入位置的旋转角度
    all_theta = compute_all_theta(dim, m, base)
    # 旋转后嵌入位置 = 复数平面上初始位置 * 复数平面上角度坐标
    # 1、将嵌入投影到复数平面
    embedding_real_pair = embedding.reshape(*embedding.shape[:-1], -1, 2)
    embedding_complex_pair = torch.view_as_complex(embedding_real_pair)
    # 2、将旋转角度投影到复数平面
    all_theta = all_theta[: embedding.shape[-2]]
    theta_complex_pair = torch.polar(torch.ones_like(all_theta), all_theta)
    # 3、旋转后嵌入位置 = 复数平面上初始位置 * 复数平面上角度坐标
    rotated_complex_embedding = embedding_complex_pair * theta_complex_pair
    # 4、将复数平面的嵌入投影到实数平面
    rotated_real_embedding = torch.view_as_real(rotated_complex_embedding)
    rotated_real_embedding = rotated_real_embedding.reshape(*embedding.shape[:-1], -1)
    return rotated_real_embedding

def compute_all_theta(dim, m, base):
    theta = 1 / (base ** (torch.arange(0, dim / 2).float() / (dim / 2)))
    m = torch.arange(0, m)
    all_theta = torch.outer(m, theta)
    return all_theta

附录:llama的config参数

bash 复制代码
@dataclass
class ModelArgs:
    dim: int = 4096
    n_layers: int = 32
    n_heads: int = 32
    n_kv_heads: Optional[int] = None
    vocab_size: int = -1
    multiple_of: int = 256  # make SwiGLU hidden layer size multiple of large power of 2
    ffn_dim_multiplier: Optional[float] = None
    norm_eps: float = 1e-5
    rope_theta: float = 500000

    max_batch_size: int = 32
    max_seq_len: int = 2048
    use_scaled_rope: bool = True
相关推荐
晨尘光1 天前
在Windows下编译出llama_cpp_python的DLL后,在虚拟环境中使用方法
python·llama
风筝超冷4 天前
LLaMA-Factory - 批量推理(inference)的脚本
llama
bluebonnet275 天前
【agent开发】部署LLM(一)
python·llama
阿牛大牛中6 天前
LLaDa——基于 Diffusion 的大语言模型 打平 LLama 3
人工智能·语言模型·llama
Lilith的AI学习日记6 天前
【AI面试秘籍】| 第25期:RAG的关键痛点及解决方案深度解析
人工智能·深度学习·机器学习·chatgpt·aigc·llama
LChuck8 天前
【大模型微调】魔搭社区GPU进行LLaMA-Factory微调大模型自我认知
人工智能·语言模型·自然语言处理·nlp·llama·魔搭社区·modelscope
燕双嘤8 天前
Fine-tuning:微调技术,训练方式,LLaMA-Factory,ms-swift
llama
装不满的克莱因瓶11 天前
【小白AI教程】大模型知识扫盲通识
人工智能·数学建模·ai·大模型·llm·llama·rag
TGITCIC13 天前
英伟达破局1000 Token/秒!Llama 4以光速重塑AI推理边界
人工智能·大模型·llama·英伟达·大模型速度·ai赛道·大模型基座
天天爱吃肉821814 天前
【 大模型技术驱动智能网联汽车革命:关键技术解析与未来趋势】
语言模型·汽车·llama