增强大型语言模型（LLM）可访问性：深入探究在单块AMD GPU上通过QLoRA微调Llama 2的过程

Enhancing LLM Accessibility: A Deep Dive into QLoRA Through Fine-tuning Llama 2 on a single AMD GPU --- ROCm Blogs

基于之前的博客《使用LoRA微调Llama 2》的内容，我们深入研究了一种称为量化低秩调整（QLoRA）的参数高效微调（PEFT）方法。本次重点是利用QLoRA技术在单块AMD GPU上，使用ROCm微调Llama-2 7B模型。通过使用QLoRA，可以解决内存和计算能力限制方面的挑战。本次探索旨在展示如何利用QLoRA来增强对开源大型语言模型的可访问性。

QLoRA微调

QLoRA是一种结合了高精度计算技术和低精度存储方法的微调技术。这有助于在确保模型仍然高性能和精确的同时，保持模型大小的小巧。

QLoRA如何工作？

简而言之，QLoRA在不牺牲性能的前提下，优化了LLM微调的内存使用，与标准的16位模型微调形成了对比。具体来说，QLoRA采用4位量化压缩预训练语言模型。然后冻结语言模型参数，并引入少量的可训练参数，以低秩适配器（Low-Rank Adapters）的形式。在微调过程中，QLoRA通过冻结的4位量化预训练语言模型反向传播梯度到低秩适配器中。值得注意的是，在训练期间，只有LoRA层进行更新。要更深入了解LoRA，请参阅原始的LoRA论文。

QLoRA与LoRA的比较

QLoRA和LoRA都是两种参数高效的微调技术。LoRA作为一个独立的微调方法运作，而QLoRA则结合了LoRA作为一个辅助机制，以解决量化过程中引入的错误，并在微调期间进一步最小化资源需求。

一步步使用QLoRA对Llama 2进行微调

本节将指导您通过QLoRA一步步对具有70亿参数的Llama 2模型进行微调，该模型可以在单个AMD GPU上运行。实现这一成就的关键在于QLoRA的关键支持，它在有效减少内存需求方面发挥了不可或缺的作用。

为此，我们将使用以下设置：

硬件 & 操作系统：请访问此链接，查看与ROCm兼容的硬件和操作系统列表。
软件：
ROCm 6.1.0+
Pytorch for ROCm 2.0+
库：`transformers`、`accelerate`、`peft`、`trl`、`bitsandbytes`、`scipy`

在这篇博客中，我们使用单个MI250GPU以及Docker镜像rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2进行了实验。

您可以在Github仓库中找到这篇博客中使用的完整代码。

1. 开始

我们的第一步是确认GPU的可用性。

复制代码

!rocm-smi --showproductname

    ========================= ROCm System Management Interface ========================= 
    =================================== Product Info ===================================
    GPU[0]      : 系列：      AMD INSTINCT MI250 (MCM) OAM AC MBA
    GPU[0]      : 型号：       0x0b0c
    GPU[0]      : 制造商：      Advanced Micro Devices, Inc. [AMD/ATI]
    GPU[0]      : SKU:         D65209
    GPU[1]      : 系列：      AMD INSTINCT MI250 (MCM) OAM AC MBA
    GPU[1]      : 型号：       0x0b0c
    GPU[1]      : 制造商：      Advanced Micro Devices, Inc. [AMD/ATI]
    GPU[1]      : SKU:         D65207
    ====================================================================================
    =============================== End of ROCm SMI Log ===============================

如果您的AMD机器上有不止一个GCDs或GPUs，让我们只使用一个图形计算模块（GCD）或GPU。

python 复制代码

import os
os.environ["HIP_VISIBLE_DEVICES"]="0"

import torch
use_cuda = torch.cuda.is_available()
if use_cuda:
    print('__CUDNN VERSION:', torch.backends.cudnn.version())
    print('__Number CUDA Devices:', torch.cuda.device_count())
    cunt = torch.cuda.device_count()

python 复制代码

    __CUDNN VERSION: 2020000
    __Number CUDA Devices: 1

接下来我们将安装所需的库。

python 复制代码

!pip install -q pandas peft==0.9.0 transformers==4.31.0 trl==0.4.7 accelerate scipy

安装bitsandbytes

ROCm需要特殊版本的bitsandbytes(bitsandbytes-rocm).

使用以下代码安装bitsandbytes。

python 复制代码

git clone --recurse https://github.com/ROCm/bitsandbytes
cd bitsandbytes
git checkout rocm_enabled
pip install -r requirements-dev.txt
cmake -DCOMPUTE_BACKEND=hip -S . #Use -DBNB_ROCM_ARCH="gfx90a;gfx942" to target specific gpu arch
make
pip install .

检查bitsandbytes版本。

在撰写本博客时，版本为0.43.0。

python 复制代码

%%bash
pip list | grep bitsandbytes

引入所需的包。

python 复制代码

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline
)
from peft import LoraConfig
from trl import SFTTrainer

2. 配置模型和数据

模型配置

在Hugging Face提交请求并等待几天后，您可以访问Meta的官方Llama-2模型。作为替代，我们将使用NousResearch的Llama-2-7b-chat-hf作为我们的基础模型（它与原始模型相同，但更易于访问）。

python 复制代码

# 模型和分词器名称
base_model_name = "NousResearch/Llama-2-7b-chat-hf"
new_model_name = "llama-2-7b-enhanced" #您可以为微调后的模型起自己的名字

# 分词器
llama_tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"

QLoRA 4-bit量化配置

正如论文所述，QLoRA以4位存储权重，允许在16位或32位精度下进行计算。这意呞着每当使用QLoRA权重张量时，我们就将张量去量化到16位或32位精度，然后执行矩阵乘法。可以选择各种组合，例如float16、bfloat16、float32等。可以尝试不同的4位量化变种，包括规范化浮点4（NF4）或纯浮点4量化。然而，根据论文中的理论考量和经验结果，建议选择NF4量化，因为它往往能提供更好的性能。

在我们的案例中，我们选择了以下配置：

使用NF4类型的4位量化
16位（float16）进行计算
双重量化，这在第一次量化后使用第二次量化，可以额外节省每个参数0.3位

量化参数可以通过BitsandbytesConfig控制（参见Hugging Face文档：https://huggingface.co/docs/transformers/main_classes/quantization#transformers.BitsAndBytesConfig）如下：

通过load_in_4bit激活4位加载
用bnb_4bit_quant_type指定用于量化的数据类型。注意，支持两种量化数据类型：fp4（四位浮点数）和nf4（规范化四位浮点数）。后者对于正态分布的权重理论上是最优的，因此我们推荐使用nf4。
用bnb_4bit_compute_dtype指定用于线性层计算的数据类型
通过bnb_4bit_use_double_quant激活嵌套量化

python 复制代码

# 量化配置
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

加载模型并设置量化配置。

python 复制代码

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=quant_config,
    device_map="auto"
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

数据集配置

我们使用一个名为mlabonne/guanaco-llama2-1k的小型数据集对我们的基础模型进行了微调，以进行问答任务。这个数据集是timdettmers/openassistant-guanaco数据集的一个子集（1000个样本）。该数据集是一个由人类生成、人类标注的助理风格对话语料库，它包含35种不同语言的161443条消息，带有461292个质量评分。这导致超过10000棵完全标注的对话树。

python 复制代码

# Dataset
data_name = "mlabonne/guanaco-llama2-1k"
training_data = load_dataset(data_name, split="train")
# check the data
print(training_data.shape)
# #11 is a QA sample in English
print(training_data[11])

python 复制代码

    (1000, 1)
    {'text': '<s>[INST] write me a 1000 words essay about deez nuts. [/INST] The Deez Nuts meme first gained popularity in 2015 on the social media platform Vine. The video featured a young man named Rodney Bullard, who recorded himself asking people if they had heard of a particular rapper. When they responded that they had not, he would respond with the phrase "Deez Nuts" and film their reactions. The video quickly went viral, and the phrase became a popular meme. \n\nSince then, Deez Nuts has been used in a variety of contexts to interrupt conversations, derail discussions, or simply add humor to a situation. It has been used in internet memes, in popular music, and even in politics. In the 2016 US presidential election, a 15-year-old boy named Brady Olson registered as an independent candidate under the name Deez Nuts. He gained some traction in the polls and even made appearances on national news programs.\n\nThe Deez Nuts meme has had a significant impact on popular culture. It has become a recognizable catchphrase that people use to add humor to everyday conversations. The meme has also been used to satirize politics and other serious issues. For example, in 2016, a group of activists in the UK used the phrase "Deez Nuts for President" as part of a campaign to encourage young people to vote in the EU referendum. </s><s>[INST] Rewrite the essay in a more casual way. Instead of sounding proffesional, sound like a college student who is forced to write the essay but refuses to do so in the propper way. Use casual words and slang when possible. [/INST] Yo, so you want me to write a 1000-word essay about Deez Nuts? Alright, fine. So, this whole thing started on Vine back in 2015. Some dude named Rodney Bullard made a video where he would ask people if they knew a rapper, and when they said no, he would hit them with the classic line: "Deez Nuts!" People loved it, and it became a viral meme.\n\nNowadays, Deez Nuts is used for all kinds of stuff. You can throw it out there to interrupt someone or just to be funny. It\'s all over the internet, in music, and even in politics. In fact, during the 2016 US presidential election, a kid named Brady Olson registered as an independent candidate under the name Deez Nuts. He actually got some attention from the media and made appearances on TV and everything.\n\nThe impact of Deez Nuts on our culture is pretty huge. It\'s become a thing that everyone knows and uses to add some humor to their everyday conversations. Plus, people have used it to make fun of politics and serious issues too. Like, in the UK, some groups of activists used the phrase "Deez Nuts for President" to encourage young people to vote in the EU referendum.\n\nThere you have it, a thousand words about Deez Nuts in a more casual tone. Can I go back to playing video games now? </s>'}

python 复制代码

## There is a dependency during training
!pip install tensorboardX

3. 开始微调

使用以下代码设置训练参数：

python 复制代码

# 训练参数
train_params = TrainingArguments(
    output_dir="./results_modified",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=50,
    logging_steps=50,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

使用QLoRA配置训练

现在您可以将LoRA集成到基准模型中，并评估其附加参数。LoRA实质上是向现有权重中添加了一对秩分解权重矩阵（称为更新矩阵），并且只训练新增加的权重。

python 复制代码

from peft import get_peft_model
# LoRA配置
peft_parameters = LoraConfig(
    lora_alpha=8,
    lora_dropout=0.1,
    r=8,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, peft_parameters)
model.print_trainable_parameters()

python 复制代码

    trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199

注意LoRA只增加了0.062%的参数，这只是原始模型的一小部分。我们将通过微调更新这一百分比的参数，如下所示。

python 复制代码

# 带有QLoRA配置的Trainer
fine_tuning = SFTTrainer(
    model=base_model,
    train_dataset=training_data,
    peft_config=peft_parameters,
    dataset_text_field="text",
    tokenizer=llama_tokenizer,
    args=train_params
)

# 训练
fine_tuning.train()

输出看起来像这样：

python 复制代码

[250/250 05:31, Epoch 1/1]\
Step     Training Loss \
50       1.557800 \
100      1.348100\
150      1.277000\
200      1.324300\
250      1.347700

TrainOutput(global_step=250, training_loss=1.3709784088134767, metrics={'train_runtime': 335.085, 'train_samples_per_second': 2.984, 'train_steps_per_second': 0.746, 'total_flos': 8679674339426304.0, 'train_loss': 1.3709784088134767, 'epoch': 1.0})

使用QLoRA训练期间检查内存使用

在训练期间，您可以在终端使用"rocm-smi"命令来检查内存使用情况。该命令将产生以下输出，它告诉了内存和GPU的使用情况。

python 复制代码

========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
GPU  Temp (DieEdge)  AvgPwr  SCLK     MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
0    50.0c           352.0W  1700Mhz  1600Mhz  0%   auto  560.0W   17%   100%  
====================================================================================
=============================== End of ROCm SMI Log ================================

为了更全面地理解QLoRA对训练的影响，我们将进行量化分析，比较QLoRA、LoRA和完整参数微调。这项分析将包括内存使用、训练速度、训练损失和其他相关指标，提供它们各自影响的全面评估。

4. QLoRA、LoRA和全参数微调的比较

我们将在前一篇有关如何使用LoRA微调Llama 2模型的博客文章的基础上------该文章展示了用LoRA和全参数方法微调Llama 2模型------增加QLoRA的结果。这旨在提供一个全面概述，结合了使用这三种微调方法所获得的见解。

Metric	Full-parameter	LoRA	QLoRA
Trainable parameters	6,738,415,616	4,194,304	4,194,304
Mem usage/GB	128	83.2	10.88
Number of GCDs	2	2	1
Training Speed	3 hours	9 minutes	6 minutes
Training Loss	1.368	1.377	1.347

• 内存使用量：

◦ 在全参数微调的情况下，有 6,738,415,616 个可训练参数，导致在训练反向传播阶段期间内存消耗显著。

◦ 相比之下，LoRA和QLoRA只引入了 4,194,304 个可训练参数，仅占全参数微调中总可训练参数的 *0.062%*。

◦ 当监控训练期间的内存使用时，很明显，使用LoRA进行微调仅使用了全参数微调内存使用量的65%。而QLoRA的表现更为出色，将内存消耗大幅降低到只有8%。

◦ 这为在有限的硬件资源限制下增加批量大小、最大序列长度和在更大的数据集上进行训练提供了机会。

• 训练速度：

◦ 结果表明，全参数微调需要 几小时 才能完成，而LoRA和QLoRA的微调仅需 *几分钟*。

◦ 训练速度加快的几个因素包括：

▪ LoRA中更少的可训练参数意味着更少的导数计算以及更少的存储和更新权重所需的内存。

▪ 全参数微调更容易受到内存限制的制约，在数据移动成为训练瓶颈时。这反映在更低的GPU利用率上。虽然调整训练设置可以缓解这一点，但可能需要更多的资源（额外的GPU）和更小的批量大小。

• 准确度：

◦ 在两次训练会话中，都观察到了训练损失的显著降低。我们对于三种微调方法都达到了相差无几的训练损失。

◦ 在QLoRA的原始研究中，作者提到了由于量化不精确而导致的性能损失，可以通过量化后的适配器微调完全恢复。与这个见解一致，我们的实验验证并呼应了这一观察，强调了在量化过程后恢复性能的适配器微调的有效性。

5. 使用QLoRA微调后的模型进行测试

python 复制代码

# 以FP16重新加载模型，并与微调后的权重合并
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto"
)
from peft import LoraConfig, PeftModel
model = PeftModel.from_pretrained(base_model, new_model_name)
model = model.merge_and_unload()

# 重新加载分词器以便保存
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

现在，让我们上传模型到Hugging Face，这样我们就可以进行后续测试或与他人分享。要进行此步骤，您需要一个有效的Hugging Face账户。

现在我们可以使用基础模型（原始的）和微调后的模型进行测试。

测试基础模型

python 复制代码

# Generate Text using base model
query = "What do you think is the most important part of building an AI chatbot?"
text_gen = pipeline(task="text-generation", model=base_model_name, tokenizer=llama_tokenizer, max_length=200)
output = text_gen(f"<s>[INST] {query} [/INST]")
print(output[0]['generated_text'])

测试微调后的模型

python 复制代码

# Generate Text using fine-tuned model
query = "What do you think is the most important part of building an AI chatbot?"
text_gen = pipeline(task="text-generation", model=new_model_name, tokenizer=llama_tokenizer, max_length=200)
output = text_gen(f"<s>[INST] {query} [/INST]")
print(output[0]['generated_text'])

python 复制代码

    <s>[INST] What do you think is the most important part of building an AI chatbot? [/INST] The most important part of building an AI chatbot is to ensure that it is able to understand and respond to user input in a way that is both accurate and natural-sounding.
    
    To achieve this, you will need to use a combination of natural language processing (NLP) techniques and machine learning algorithms to enable the chatbot to understand and interpret user input, and to generate appropriate responses.
    
    Some of the key considerations when building an AI chatbot include:
    
    1. Defining the scope and purpose of the chatbot: What kind of tasks or questions will the chatbot be able to handle? What kind of user input will it be able to understand?
    2. Choosing the right NLP and machine learning algorithms: There are many different NLP and machine learning algorithms available, and the right ones will depend on the

现在您可以根据给定的查询观察两个模型的输出。正如预期的那样，由于微调过程改变了模型的权重，两个输出显示了微小的差异。