Llama 2 Powered By ONNX
- [1. Llama 2](#1. Llama 2)
-
- [1.1. The structure of Llama 2](#1.1. The structure of Llama 2)
- References
https://github.com/microsoft/Llama-2-Onnx
1. Llama 2
Llama 2 is a collection of pretrained and fine-tuned generative text models.
1.1. The structure of Llama 2
Llama 2 model consists of a stack of decoder layers. Each decoder layer (or transformer block) is constructed from one self-attention layer and one feed-forward multi-layer perceptron.
Llama models use different projection sizes compared with classic transformers in the feed-forward layer, for instance, both Llama 1 and Llama 2 projection use 2.7x hidden size rather than the standard 4x hidden size.
A key difference between Llama 1 and Llama 2 is the architectural change of attention layer, in which Llama 2 takes advantage of Grouped Query Attention (GQA) mechanism to improve efficiency.
Llama2 Model
Llama2 Model Top View
Decoder Layer
References
[1] Yongqiang Cheng, https://yongqiang.blog.csdn.net/