LLama: Open and Effecient Foundation Language Models

This paper is inspired by the Chinchilla scaling law. It found that given a fixed computing budget, the best performance is not generated by the larger models, but by the smaller models trained on more data . So it proposed a collection of models ranging from 7B to 65B. These smaller models outperforms other bigger models.

1. Architecture

It based on traditional transformer models, and leveraged some improvement proposed in the subquently large language models. The main change point is:

  • Pre-normalization, which nomalized the input in the sub-layer, instead of the output.
  • SwiGelu, instead of Relu.
  • Rotary Embeddidngs.

2. Efficient Implementation

  • The casual multihead attention. Which need me to explore the behind logic further.
  • Reduce the amount of activations that are recomputed during the backward pass.
  • Save the activation by manually implementing it, instead of using PyTorch Autograd in backward pass.
  • Using model and sequence parallelism to reduce the memory usage.
  • Using the overlay the computing and comunication bewteen different GPUs as much as possible.
相关推荐
龙文浩_2 分钟前
AI深度学习神经网络的结构设计与激活机制
人工智能·深度学习·神经网络
cxr8283 分钟前
控制理论基础
人工智能·算法
程序大视界4 分钟前
2026AI智能体元年,中国正式超越美国
大数据·人工智能
一只空白格10 分钟前
大模型微调
人工智能
Pushkin.16 分钟前
LLM预训练完全指南:从理论到NanoQwen实战
人工智能·深度学习·机器学习
翼龙云_cloud17 分钟前
亚马逊云代理商:如何在 AWS Lightsail 上一键部署 OpenClaw 私有化 AI 助手?
人工智能·云计算·aws·openclaw
wd5i8kA8i18 分钟前
Transformer 与模型架构原理
人工智能·深度学习·transformer
带娃的IT创业者23 分钟前
NCT 是什么——让 AI 拥有意识的尝试
人工智能·深度学习·神经网络·科普·技术分享·ai架构·nct
萧逸才24 分钟前
【learn-claude-code】S07TaskSystem - 任务系统:大目标拆成小任务,持久化到磁盘
java·人工智能·ai
做萤石二次开发的哈哈31 分钟前
AI+台球 | 萤石点亮智慧台球厅,让娱乐更智能
人工智能