大模型训练中CUDA out of memory的几种解决方法

常用:

  1. 减小batch size

  2. 如果需要较大的batch size来达到预期训练指标效果,则使用gradient accumulation来累积梯度,达到等价效果的同时,占用显存小;

  3. 使用低精度做训练。(mixed precision training). FP16+FP32

  4. 减小训练数据样本的长度

  5. Model并行或者Pipeline并行,把模型拆分至多卡

不常用:

  1. 释放闲置内存;torch.cuda.empty_cache()

  2. 缓解PyTorch的显存碎片;PYTORCH_CUDA_ALLOC_CONF和max_split_size_mb

7.使用更小的模型

  1. 增大GPU显存

  2. 使用nvidia-smi看到该GPU上的其他进程;可杀则杀;

Perplexity 查询结果:

Here are several solutions to resolve the "CUDA out of memory" error when training large language models (LLMs) using SFT (Supervised Fine-Tuning):

  1. Reduce batch size: Lowering the batch size during training can significantly reduce the memory requirements. For example, you could set the batch size to 1 and use gradient accumulation to accumulate gradients over multiple batches before updating the model weights.
  2. Use gradient accumulation: Gradient accumulation allows you to accumulate gradients over multiple batches before updating the model weights. This effectively increases the batch size without increasing the memory usage. Be sure to divide the loss by the number of gradient accumulation steps to maintain the correct gradient scale.
  3. Enable mixed precision training: Training the model in mixed precision (FP16) can reduce the memory usage compared to full precision (FP32). PyTorch provides the torch.cuda.amp package for automatic mixed precision training.
  4. Reduce input sequence length: If your input texts are long, consider truncating or splitting them into shorter sequences to reduce the memory usage during tokenization and model inference.
  5. Free GPU memory: Call torch.cuda.empty_cache() to free up any unused GPU memory allocated by PyTorch. This might not help much, but it's worth a try.
  6. Set PYTORCH_CUDA_ALLOC_CONF environment variable: You can try setting the PYTORCH_CUDA_ALLOC_CONF environment variable with max_split_size_mb to avoid memory fragmentation. Experiment with different values based on your GPU memory.
  7. Use a smaller model: If possible, consider using a smaller version of the LLM that has fewer parameters. Smaller models generally require less memory during training.
  8. Increase GPU memory: If you have access to a GPU with more memory, consider using it to accommodate the memory requirements of the LLM training.
  9. Distribute training across multiple GPUs: Splitting the training across multiple GPUs can reduce the memory usage per GPU. However, this approach requires additional setup and coordination.
  10. Monitor GPU memory usage: Use nvidia-smi to monitor the GPU memory usage and identify any other processes consuming GPU memory. Terminate unnecessary processes to free up memory
相关推荐
鼠鼠龙年发大财3 分钟前
【鼠鼠学AI代码合集#7】概率
人工智能
龙的爹233312 分钟前
论文 | Model-tuning Via Prompts Makes NLP Models Adversarially Robust
人工智能·gpt·深度学习·语言模型·自然语言处理·prompt
工业机器视觉设计和实现24 分钟前
cnn突破四(生成卷积核与固定核对比)
人工智能·深度学习·cnn
醒了就刷牙26 分钟前
58 深层循环神经网络_by《李沐:动手学深度学习v2》pytorch版
pytorch·rnn·深度学习
985小水博一枚呀28 分钟前
【对于Python爬虫的理解】数据挖掘、信息聚合、价格监控、新闻爬取等,附代码。
爬虫·python·深度学习·数据挖掘
想要打 Acm 的小周同学呀1 小时前
实现mnist手写数字识别
深度学习·tensorflow·实现mnist手写数字识别
我算是程序猿1 小时前
用AI做电子萌宠,快速涨粉变现
人工智能·stable diffusion·aigc
萱仔学习自我记录2 小时前
微调大语言模型——超详细步骤
人工智能·深度学习·机器学习
湘大小菜鸡2 小时前
NLP进阶(一)
人工智能·自然语言处理
XiaoLiuLB2 小时前
最佳语音识别 Whisper-large-v3-turbo 上线,速度更快(本地安装 )
人工智能·whisper·语音识别