LLama: Open and Effecient Foundation Language Models

This paper is inspired by the Chinchilla scaling law. It found that given a fixed computing budget, the best performance is not generated by the larger models, but by the smaller models trained on more data . So it proposed a collection of models ranging from 7B to 65B. These smaller models outperforms other bigger models.

1. Architecture

It based on traditional transformer models, and leveraged some improvement proposed in the subquently large language models. The main change point is:

  • Pre-normalization, which nomalized the input in the sub-layer, instead of the output.
  • SwiGelu, instead of Relu.
  • Rotary Embeddidngs.

2. Efficient Implementation

  • The casual multihead attention. Which need me to explore the behind logic further.
  • Reduce the amount of activations that are recomputed during the backward pass.
  • Save the activation by manually implementing it, instead of using PyTorch Autograd in backward pass.
  • Using model and sequence parallelism to reduce the memory usage.
  • Using the overlay the computing and comunication bewteen different GPUs as much as possible.
相关推荐
居7然8 小时前
ChatGPT是怎么学会接龙的?
深度学习·语言模型·chatgpt·性能优化·transformer
5Gcamera8 小时前
4G body camera BC310/BC310D user manual
人工智能·边缘计算·智能安全帽·执法记录仪·smarteye
爱喝可乐的老王8 小时前
机器学习中常用交叉验证总结
人工智能·机器学习
公链开发9 小时前
2026 Web3机构级风口:RWA Tokenization + ZK隐私系统定制开发全解析
人工智能·web3·区块链
wyw00009 小时前
目标检测之YOLO
人工智能·yolo·目标检测
发哥来了9 小时前
AI视频生成企业级方案选型指南:2025年核心能力与成本维度深度对比
大数据·人工智能
_codemonster9 小时前
强化学习入门到实战系列(四)马尔科夫决策过程
人工智能
北邮刘老师9 小时前
智能体治理:人工智能时代信息化系统的全新挑战与课题
大数据·人工智能·算法·机器学习·智能体互联网
laplace012310 小时前
第七章 构建自己的agent智能体框架
网络·人工智能·microsoft·agent
诗词在线10 小时前
中国古代诗词名句按主题分类有哪些?(爱国 / 思乡 / 送别)
人工智能·python·分类·数据挖掘