Finetune LLaVA on Custom Datasets

Dataset Format

Convert your data to a JSON file of a List of all samples. Sample metadata should contain id (a unique identifier), image (the path to the image), and conversations (the conversation data between human and AI).

A sample JSON for finetuning LLaVA for generating tag-style captions for Stable Diffusion:

json 复制代码
[
  {
    "id": "997bb945-628d-4724-b370-b84de974a19f",
    "image": "part-000001/997bb945-628d-4724-b370-b84de974a19f.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nWrite a prompt for Stable Diffusion to generate this image."
      },
      {
        "from": "gpt",
        "value": "a beautiful painting of chernobyl by nekro, pascal blanche, john harris, greg rutkowski, sin jong hun, moebius, simon stalenhag. in style of cg art. ray tracing. cel shading. hyper detailed. realistic. ue 5. maya. octane render. "
      },
    ]
  },
  ...
]

Command

If you have a limited task-specific data, we recommend finetuning from LLaVA checkpoints with LoRA following this script.

If the amount of the task-specific data is sufficient, you can also finetune from LLaVA checkpoints with full-model finetuning following this script.

You may need to adjust the hyperparameters to fit each specific dataset and your hardware constraint.

相关推荐
超人也会哭️呀2 天前
视觉模型中的坐标漂移
人工智能·ai·llm·ocr·vlm·视觉模型·dots.ocr
_张一凡2 天前
【AIGC行业前沿】2026年5月AIGC行业前沿模型发布动态(5月25-5月31)
llm·aigc·vlm·前沿资讯
feasibility.10 天前
ROS2+Gazebo+VLM服务:纯仿真环境下的具身智能闭环系统| 大脑-小脑分离控制
人工智能·机器人·ros·仿真·具身智能·vla·vlm
feasibility.17 天前
Qwen3-VL-Seg 深度解读:当多模态大模型学会“像素级精准手术“
人工智能·深度学习·计算机视觉·llm·图像分割·多模态·vlm
feasibility.1 个月前
多模态模型Qwen-3.5在Llama-Factory使用+llama.cpp量化导出+部署流程(含报错处理)
人工智能·llm·多模态·量化·llama.cpp·vlm·llama-factory
山顶夕景1 个月前
【VLM】结合Python沙箱的以图思辨S1-VL模型
python·大模型·llm·agent·多模态·vlm
一颗小树x2 个月前
《VLA 系列》Humanoid Everyday | 人形机器人 | 开源数据集
机器人·开源数据集·人形机器人·vlm
bryant_meng2 个月前
【VLA】Vision Language Action
人工智能·深度学习·rl·vla·世界模型·vlm
一颗小树x2 个月前
空间理解 SITI 数据集 | 格式转换 | Lora微调 | VLM 大模型
格式转换·vlm·lora微调·空间理解·siti 数据集
山顶夕景3 个月前
【VLM】HopChain视觉语言推理多跳数据合成框架
大模型·llm·cot·vlm·视觉模型