大模型训练中CUDA out of memory的几种解决方法

常用:

  1. 减小batch size

  2. 如果需要较大的batch size来达到预期训练指标效果,则使用gradient accumulation来累积梯度,达到等价效果的同时,占用显存小;

  3. 使用低精度做训练。(mixed precision training). FP16+FP32

  4. 减小训练数据样本的长度

  5. Model并行或者Pipeline并行,把模型拆分至多卡

不常用:

  1. 释放闲置内存;torch.cuda.empty_cache()

  2. 缓解PyTorch的显存碎片;PYTORCH_CUDA_ALLOC_CONF和max_split_size_mb

7.使用更小的模型

  1. 增大GPU显存

  2. 使用nvidia-smi看到该GPU上的其他进程;可杀则杀;

Perplexity 查询结果:

Here are several solutions to resolve the "CUDA out of memory" error when training large language models (LLMs) using SFT (Supervised Fine-Tuning):

  1. Reduce batch size: Lowering the batch size during training can significantly reduce the memory requirements. For example, you could set the batch size to 1 and use gradient accumulation to accumulate gradients over multiple batches before updating the model weights.
  2. Use gradient accumulation: Gradient accumulation allows you to accumulate gradients over multiple batches before updating the model weights. This effectively increases the batch size without increasing the memory usage. Be sure to divide the loss by the number of gradient accumulation steps to maintain the correct gradient scale.
  3. Enable mixed precision training: Training the model in mixed precision (FP16) can reduce the memory usage compared to full precision (FP32). PyTorch provides the torch.cuda.amp package for automatic mixed precision training.
  4. Reduce input sequence length: If your input texts are long, consider truncating or splitting them into shorter sequences to reduce the memory usage during tokenization and model inference.
  5. Free GPU memory: Call torch.cuda.empty_cache() to free up any unused GPU memory allocated by PyTorch. This might not help much, but it's worth a try.
  6. Set PYTORCH_CUDA_ALLOC_CONF environment variable: You can try setting the PYTORCH_CUDA_ALLOC_CONF environment variable with max_split_size_mb to avoid memory fragmentation. Experiment with different values based on your GPU memory.
  7. Use a smaller model: If possible, consider using a smaller version of the LLM that has fewer parameters. Smaller models generally require less memory during training.
  8. Increase GPU memory: If you have access to a GPU with more memory, consider using it to accommodate the memory requirements of the LLM training.
  9. Distribute training across multiple GPUs: Splitting the training across multiple GPUs can reduce the memory usage per GPU. However, this approach requires additional setup and coordination.
  10. Monitor GPU memory usage: Use nvidia-smi to monitor the GPU memory usage and identify any other processes consuming GPU memory. Terminate unnecessary processes to free up memory
相关推荐
掘金一周17 分钟前
被老板逼出来的“表格生成器”:一个前端的自救之路| 掘金一周 8.21
前端·人工智能·后端
yzx99101318 分钟前
从机器视觉到图像识别:计算机视觉的多维探索
人工智能·深度学习·机器学习
是Dream呀20 分钟前
GPT-5时代的AI工具:AiOnly一站式平台深度体验报告
人工智能·深度学习·机器学习
网安INF1 小时前
【论文阅读】-《SIGN-OPT: A QUERY-EFFICIENT HARD-LABEL ADVERSARIAL ATTACK》
论文阅读·人工智能·网络安全·对抗攻击
智能汽车人1 小时前
行业分析---领跑汽车2025第二季度财报
人工智能·microsoft
先做个垃圾出来………1 小时前
迁移学习(Transfer Learning)
人工智能·机器学习·迁移学习
许泽宇的技术分享1 小时前
ReAct Agent:让AI像人类一样思考与行动的革命性框架
人工智能·agent·react
eBest数字化转型方案2 小时前
2025年快消品行业渠道数字化营销系统全景透视与选型策略
人工智能
kkcodeer3 小时前
大模型Prompt原理、编写原则与技巧以及衡量方法
人工智能·prompt·ai大模型