LLAVA数据集下载
1. Data
Data file name | Size |
---|---|
llava_instruct_150k.json | 229 MB |
llava_instruct_80k.json | 229 MB |
conversation_58k.json | 126 MB |
detail_23k.json | 20.5 MB |
complex_reasoning_77k.json | 79.6 MB |
1.1 Pretraining Dataset
The pretraining dataset used in this release is a subset of CC-3M dataset, filtered with a more balanced concept coverage distribution. Please see here for a detailed description of the dataset structure and how to download the images.
If you already have CC-3M dataset on your disk, the image names follow this format: GCC_train_000000000.jpg
. You may edit the image
field correspondingly if necessary.
Data | Chat File | Meta Data | Size |
---|---|---|---|
CC-3M Concept-balanced 595K | chat.json | metadata.json | 211 MB |
LAION/CC/SBU BLIP-Caption Concept-balanced 558K | blip_laion_cc_sbu_558k.json | [metadata.json](#Data Chat File Meta Data Size CC-3M Concept-balanced 595K chat.json metadata.json 211 MB LAION/CC/SBU BLIP-Caption Concept-balanced 558K blip_laion_cc_sbu_558k.json metadata.json 181 MB) | 181 MB |
Important notice : Upon the request from the community, as ~15% images of the original CC-3M dataset are no longer accessible, we upload images.zip
for better reproducing our work in research community. It must not be used for any other purposes. The use of these images must comply with the CC-3M license. This may be taken down at any time when requested by the original CC-3M dataset owner or owners of the referenced images.
1.2 GPT-4 Prompts
We provide our prompts and few-shot samples for GPT-4 queries, to better facilitate research in this domain. Please check out the prompts
folder for three kinds of questions: conversation, detail description, and complex reasoning.
They are organized in a format of system_message.txt
for system message, pairs of abc_caps.txt
for few-shot sample user input, and abc_conv.txt
for few-shot sample reference output.
Note that you may find them in different format. For example, conversation
is in jsonl
, and detail description is answer-only. The selected format in our preliminary experiments works slightly better than a limited set of alternatives that we tried: jsonl
, more natural format, answer-only. If interested, you may try other variants or conduct more careful study in this. Contributions are welcomed!
2. Visual Instruction Tuning
---------2.1 指令调整数据(instruction tuning data)---------:
LLaVA-Instruct-150K
---------2.2 图像(images)---------
COCO
官方 :train2017
GQA
官方 :images
OCR-VAQ
官方 :download script
多线程下载(速度更快) :Github解决方案 以及 CSDN解决方案
处理好的数据集下载(方便快捷) :Huggingface
TextVQA
官方 :train_val_images
VisualGenome
playground
├──data
│ ├── coco
│ │ └── train2017
│ ├── gqa
│ │ └── images
│ ├── ocr_vqa
│ │ └── images
│ ├── textvqa
│ │ └── train_images
│ └── vg
│ ├── VG_100K
│ └── VG_100K_2
└── ...
3. Pretrained Model
---------3.1 语言大模型---------
vicuna-13b-v1.5
vicuna-7b-v1.5
---------3.2 视觉大模型---------
clip-vit-large-patch14-336
---------3.3 LLAVA-1.5预训练模型---------
LLAVA-1.5-13b
LLAVA-1.5-7b
---------3.4 LLAVA-lora微调训练的模型---------
LLAVA-1.5--13b-lora
LLAVA-1.5--7b-lora