大模型训练中CUDA out of memory的几种解决方法

常用:

  1. 减小batch size

  2. 如果需要较大的batch size来达到预期训练指标效果,则使用gradient accumulation来累积梯度,达到等价效果的同时,占用显存小;

  3. 使用低精度做训练。(mixed precision training). FP16+FP32

  4. 减小训练数据样本的长度

  5. Model并行或者Pipeline并行,把模型拆分至多卡

不常用:

  1. 释放闲置内存;torch.cuda.empty_cache()

  2. 缓解PyTorch的显存碎片;PYTORCH_CUDA_ALLOC_CONF和max_split_size_mb

7.使用更小的模型

  1. 增大GPU显存

  2. 使用nvidia-smi看到该GPU上的其他进程;可杀则杀;

Perplexity 查询结果:

Here are several solutions to resolve the "CUDA out of memory" error when training large language models (LLMs) using SFT (Supervised Fine-Tuning):

  1. Reduce batch size: Lowering the batch size during training can significantly reduce the memory requirements. For example, you could set the batch size to 1 and use gradient accumulation to accumulate gradients over multiple batches before updating the model weights.
  2. Use gradient accumulation: Gradient accumulation allows you to accumulate gradients over multiple batches before updating the model weights. This effectively increases the batch size without increasing the memory usage. Be sure to divide the loss by the number of gradient accumulation steps to maintain the correct gradient scale.
  3. Enable mixed precision training: Training the model in mixed precision (FP16) can reduce the memory usage compared to full precision (FP32). PyTorch provides the torch.cuda.amp package for automatic mixed precision training.
  4. Reduce input sequence length: If your input texts are long, consider truncating or splitting them into shorter sequences to reduce the memory usage during tokenization and model inference.
  5. Free GPU memory: Call torch.cuda.empty_cache() to free up any unused GPU memory allocated by PyTorch. This might not help much, but it's worth a try.
  6. Set PYTORCH_CUDA_ALLOC_CONF environment variable: You can try setting the PYTORCH_CUDA_ALLOC_CONF environment variable with max_split_size_mb to avoid memory fragmentation. Experiment with different values based on your GPU memory.
  7. Use a smaller model: If possible, consider using a smaller version of the LLM that has fewer parameters. Smaller models generally require less memory during training.
  8. Increase GPU memory: If you have access to a GPU with more memory, consider using it to accommodate the memory requirements of the LLM training.
  9. Distribute training across multiple GPUs: Splitting the training across multiple GPUs can reduce the memory usage per GPU. However, this approach requires additional setup and coordination.
  10. Monitor GPU memory usage: Use nvidia-smi to monitor the GPU memory usage and identify any other processes consuming GPU memory. Terminate unnecessary processes to free up memory
相关推荐
星越华夏1 小时前
计算机视觉:YOLOv12安装环境
人工智能·yolo·计算机视觉
Yolanda943 小时前
【人工智能】《从零搭建AI问答助手项目(九):Prompt优化》
人工智能·prompt
wj3055853783 小时前
课程 9:模型测试记录与 Prompt 策略
linux·人工智能·python·comfyui
小和尚同志3 小时前
深入使用 skill-creator:结合真实生产级实践
人工智能·aigc
DevSecOps选型指南3 小时前
安全419专访悬镜安全 | 穿越周期在 AI 浪潮中定义数字供应链安全新范式
人工智能
沪漂阿龙3 小时前
面试题详解:GraphRAG 全面解析——知识图谱增强 RAG、Local Search、Global Search、社区摘要、工程落地与评估指标一次讲透
人工智能·知识图谱
WangN23 小时前
Unitree RL Lab 学习笔记【通识】
人工智能·机器学习
haina20193 小时前
海纳AI亮相《科创中国》,解码招聘“智”变之路
人工智能·ai面试·ai招聘
阿星AI工作室3 小时前
刘润年中大课笔记:一句话说清AI落地之战的本质
大数据·人工智能·创业创新·商业
qingfeng154153 小时前
企业微信机器人开发:如何实现自动化与智能运营?
人工智能·python·机器人·自动化·企业微信