大模型训练中CUDA out of memory的几种解决方法

常用:

  1. 减小batch size

  2. 如果需要较大的batch size来达到预期训练指标效果,则使用gradient accumulation来累积梯度,达到等价效果的同时,占用显存小;

  3. 使用低精度做训练。(mixed precision training). FP16+FP32

  4. 减小训练数据样本的长度

  5. Model并行或者Pipeline并行,把模型拆分至多卡

不常用:

  1. 释放闲置内存;torch.cuda.empty_cache()

  2. 缓解PyTorch的显存碎片;PYTORCH_CUDA_ALLOC_CONF和max_split_size_mb

7.使用更小的模型

  1. 增大GPU显存

  2. 使用nvidia-smi看到该GPU上的其他进程;可杀则杀;

Perplexity 查询结果:

Here are several solutions to resolve the "CUDA out of memory" error when training large language models (LLMs) using SFT (Supervised Fine-Tuning):

  1. Reduce batch size: Lowering the batch size during training can significantly reduce the memory requirements. For example, you could set the batch size to 1 and use gradient accumulation to accumulate gradients over multiple batches before updating the model weights.
  2. Use gradient accumulation: Gradient accumulation allows you to accumulate gradients over multiple batches before updating the model weights. This effectively increases the batch size without increasing the memory usage. Be sure to divide the loss by the number of gradient accumulation steps to maintain the correct gradient scale.
  3. Enable mixed precision training: Training the model in mixed precision (FP16) can reduce the memory usage compared to full precision (FP32). PyTorch provides the torch.cuda.amp package for automatic mixed precision training.
  4. Reduce input sequence length: If your input texts are long, consider truncating or splitting them into shorter sequences to reduce the memory usage during tokenization and model inference.
  5. Free GPU memory: Call torch.cuda.empty_cache() to free up any unused GPU memory allocated by PyTorch. This might not help much, but it's worth a try.
  6. Set PYTORCH_CUDA_ALLOC_CONF environment variable: You can try setting the PYTORCH_CUDA_ALLOC_CONF environment variable with max_split_size_mb to avoid memory fragmentation. Experiment with different values based on your GPU memory.
  7. Use a smaller model: If possible, consider using a smaller version of the LLM that has fewer parameters. Smaller models generally require less memory during training.
  8. Increase GPU memory: If you have access to a GPU with more memory, consider using it to accommodate the memory requirements of the LLM training.
  9. Distribute training across multiple GPUs: Splitting the training across multiple GPUs can reduce the memory usage per GPU. However, this approach requires additional setup and coordination.
  10. Monitor GPU memory usage: Use nvidia-smi to monitor the GPU memory usage and identify any other processes consuming GPU memory. Terminate unnecessary processes to free up memory
相关推荐
weixin_30777913几秒前
从工具到协作者:AI在后端研发中的流程重构与组织赋能
人工智能·后端·python·算法·自动化
云草桑3 分钟前
.NET10+AI 架构师全套实战学习文档(含源码、案例、面试题、项目源码)
人工智能·学习·ai·.net
装不满的克莱因瓶5 分钟前
循环神经网络及LSTM——从序列建模到长期依赖记忆机制
人工智能·pytorch·python·rnn·深度学习·神经网络·lstm
ai产品老杨7 分钟前
突破安防碎片化僵局:基于 Docker 与边缘计算的 AI 视频管理平台异构架构设计(附 GB28181/RTSP 统一接入与源码交付)
人工智能·docker·边缘计算
沉下去,苦磨练!10 分钟前
深度学习神经网络的搭建
人工智能·算法
夏天的味道٥11 分钟前
Spring-AI 多模型接入实战:本地 deepseek + 阿里云百炼 + 硅基流动
人工智能·spring·阿里云
2601_9619633814 分钟前
从OCR到NLP:AI技术如何赋能电子合同智能审核与风险预警?
网络·人工智能·安全·金融·智能合约
暗夜猎手-大魔王15 分钟前
hermes源码学习5-Provider 运行时解析
大数据·人工智能·学习
apcipot_rain18 分钟前
计科八股20260611——推荐系统协同过滤、信息安全、团队协作、知识图谱
人工智能·知识图谱
谷哥的小弟18 分钟前
大模型核心基础知识(18)—Transformer模型的提出背景
人工智能·深度学习·神经网络·大模型·transformer·大语言模型