大模型训练中CUDA out of memory的几种解决方法

常用:

  1. 减小batch size

  2. 如果需要较大的batch size来达到预期训练指标效果,则使用gradient accumulation来累积梯度,达到等价效果的同时,占用显存小;

  3. 使用低精度做训练。(mixed precision training). FP16+FP32

  4. 减小训练数据样本的长度

  5. Model并行或者Pipeline并行,把模型拆分至多卡

不常用:

  1. 释放闲置内存;torch.cuda.empty_cache()

  2. 缓解PyTorch的显存碎片;PYTORCH_CUDA_ALLOC_CONF和max_split_size_mb

7.使用更小的模型

  1. 增大GPU显存

  2. 使用nvidia-smi看到该GPU上的其他进程;可杀则杀;

Perplexity 查询结果:

Here are several solutions to resolve the "CUDA out of memory" error when training large language models (LLMs) using SFT (Supervised Fine-Tuning):

  1. Reduce batch size: Lowering the batch size during training can significantly reduce the memory requirements. For example, you could set the batch size to 1 and use gradient accumulation to accumulate gradients over multiple batches before updating the model weights.
  2. Use gradient accumulation: Gradient accumulation allows you to accumulate gradients over multiple batches before updating the model weights. This effectively increases the batch size without increasing the memory usage. Be sure to divide the loss by the number of gradient accumulation steps to maintain the correct gradient scale.
  3. Enable mixed precision training: Training the model in mixed precision (FP16) can reduce the memory usage compared to full precision (FP32). PyTorch provides the torch.cuda.amp package for automatic mixed precision training.
  4. Reduce input sequence length: If your input texts are long, consider truncating or splitting them into shorter sequences to reduce the memory usage during tokenization and model inference.
  5. Free GPU memory: Call torch.cuda.empty_cache() to free up any unused GPU memory allocated by PyTorch. This might not help much, but it's worth a try.
  6. Set PYTORCH_CUDA_ALLOC_CONF environment variable: You can try setting the PYTORCH_CUDA_ALLOC_CONF environment variable with max_split_size_mb to avoid memory fragmentation. Experiment with different values based on your GPU memory.
  7. Use a smaller model: If possible, consider using a smaller version of the LLM that has fewer parameters. Smaller models generally require less memory during training.
  8. Increase GPU memory: If you have access to a GPU with more memory, consider using it to accommodate the memory requirements of the LLM training.
  9. Distribute training across multiple GPUs: Splitting the training across multiple GPUs can reduce the memory usage per GPU. However, this approach requires additional setup and coordination.
  10. Monitor GPU memory usage: Use nvidia-smi to monitor the GPU memory usage and identify any other processes consuming GPU memory. Terminate unnecessary processes to free up memory
相关推荐
喝拿铁写前端3 小时前
别再让 AI 直接写页面了:一种更稳的中后台开发方式
前端·人工智能
tongxianchao4 小时前
UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer
人工智能·cnn·transformer
塔能物联运维4 小时前
设备边缘计算任务调度卡顿 后来动态分配CPU/GPU资源
人工智能·边缘计算
过期的秋刀鱼!4 小时前
人工智能-深度学习-线性回归
人工智能·深度学习
木头左4 小时前
高级LSTM架构在量化交易中的特殊入参要求与实现
人工智能·rnn·lstm
IE065 小时前
深度学习系列84:使用kokoros生成tts语音
人工智能·深度学习
欧阳天羲5 小时前
#前端开发未来3年(2026-2028)核心趋势与AI应用实践
人工智能·前端框架
IE065 小时前
深度学习系列83:使用outetts
人工智能·深度学习
水中加点糖5 小时前
源码运行RagFlow并实现AI搜索(文搜文档、文搜图、视频理解)与自定义智能体(一)
人工智能·二次开发·ai搜索·文档解析·ai知识库·ragflow·mineru
imbackneverdie5 小时前
如何用AI工具,把文献综述从“耗时费力”变成“高效产出”?
人工智能·经验分享·考研·自然语言处理·aigc·ai写作