大模型训练中CUDA out of memory的几种解决方法

常用：

减小batch size
如果需要较大的batch size来达到预期训练指标效果，则使用gradient accumulation来累积梯度，达到等价效果的同时，占用显存小；
使用低精度做训练。（mixed precision training). FP16+FP32
减小训练数据样本的长度
Model并行或者Pipeline并行，把模型拆分至多卡

不常用：

释放闲置内存；torch.cuda.empty_cache()
缓解PyTorch的显存碎片；PYTORCH_CUDA_ALLOC_CONF和max_split_size_mb

7.使用更小的模型

增大GPU显存
使用nvidia-smi看到该GPU上的其他进程；可杀则杀；

Perplexity 查询结果：

Here are several solutions to resolve the "CUDA out of memory" error when training large language models (LLMs) using SFT (Supervised Fine-Tuning):

Reduce batch size: Lowering the batch size during training can significantly reduce the memory requirements. For example, you could set the batch size to 1 and use gradient accumulation to accumulate gradients over multiple batches before updating the model weights.

Use gradient accumulation: Gradient accumulation allows you to accumulate gradients over multiple batches before updating the model weights. This effectively increases the batch size without increasing the memory usage. Be sure to divide the loss by the number of gradient accumulation steps to maintain the correct gradient scale.

Enable mixed precision training: Training the model in mixed precision (FP16) can reduce the memory usage compared to full precision (FP32). PyTorch provides the torch.cuda.amp package for automatic mixed precision training.

Reduce input sequence length: If your input texts are long, consider truncating or splitting them into shorter sequences to reduce the memory usage during tokenization and model inference.

Free GPU memory: Call torch.cuda.empty_cache() to free up any unused GPU memory allocated by PyTorch. This might not help much, but it's worth a try.

Set PYTORCH_CUDA_ALLOC_CONF environment variable: You can try setting the PYTORCH_CUDA_ALLOC_CONF environment variable with max_split_size_mb to avoid memory fragmentation. Experiment with different values based on your GPU memory.

Use a smaller model: If possible, consider using a smaller version of the LLM that has fewer parameters. Smaller models generally require less memory during training.

Increase GPU memory: If you have access to a GPU with more memory, consider using it to accommodate the memory requirements of the LLM training.

Distribute training across multiple GPUs: Splitting the training across multiple GPUs can reduce the memory usage per GPU. However, this approach requires additional setup and coordination.

Monitor GPU memory usage: Use nvidia-smi to monitor the GPU memory usage and identify any other processes consuming GPU memory. Terminate unnecessary processes to free up memory