《Synthetic Visual Genome》论文数据集的预处理
- 前言:
- [Synthetic Visual Genome 数据集与预处理实践](#Synthetic Visual Genome 数据集与预处理实践)
- 数据集解析
-
- 我们生成的预处理文件结构设计:
- 为什么SVG的对象描述为何更像句子
- [为何每张图只有一个 metadata?](#为何每张图只有一个 metadata?)
- [segmentation.counts 是什么?](#segmentation.counts 是什么?)
- 当前预处理是否充分利用数据?
- 之后如何调用图片查看与可视化
- 结语
前言:
与早期 Visual Genome 仅提供简短标签不同,《Synthetic Visual Genome》 通过自然语言化的对象描述和更丰富的关系谓词,使场景图更贴近真实语境。例如,SVG 中的对象不仅是"person",而是"person walking on the beach with a backpack"。
此外,SVG 在数据组织上保持了三层结构:
metadata
提供图像级别的全局信息;regions
捕捉局部空间区域及其分割掩码;scene graph
则在全局范围内组织对象与关系成场景图。
大家暂时看不懂没关系,看一下论文和我们下面的关于数据集的输出就明白了。
Synthetic Visual Genome论文
Synthetic Visual Genome论文数据集
Synthetic Visual Genome 数据集与预处理实践
首先贴一下我们预处理好的数据集文件,然后再说我们预处理的流程:
通过网盘分享的文件:svg_parsed_all.ndjson等2个文件
链接: https://pan.baidu.com/s/1kjGgFQG1Yhn8jh0xj2GhZA 提取码: tmg4
下载数据集:
因为网络原因,我使用镜像网站https://hf-mirror.com来下载hugging face上的数据集。
python
import os
# --- 关键步骤 ---
# 在导入 datasets 库之前,设置环境变量
print("正在设置Hugging Face镜像地址...")
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
# 现在再导入 datasets
from datasets import load_dataset
# 定义数据集名称和本地缓存目录
dataset_name = "jamepark3922/svg"
cache_path = "./Synthetic_Visual_Genome_datasets"
print(f"开始从镜像服务器下载数据集: {dataset_name}...")
# 正常执行加载命令
dataset = load_dataset(dataset_name, cache_dir=cache_path)
print("\n数据集下载并加载成功!")
print(dataset)
初步粗略的打印一下数据集看一下结构
这段是刚下载好初步粗略的看一下输出,大家直接下滑看最终版就好
python
import os
import random
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import load_dataset
import json
import networkx as nx
import matplotlib.pyplot as plt
import os
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
# --- 加载数据集 ---
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
dataset_name = "jamepark3922/svg"
cache_path = "./Synthetic_Visual_Genome_datasets"
dataset = load_dataset(dataset_name, cache_dir=cache_path)
# 1. 打印数据集结构
print("数据集结构:")
for split in dataset.keys():
print(f" Split: {split}, 样本数: {len(dataset[split])}")
# 2. 随机抽取一个样本
split_name = list(dataset.keys())[0]
sample = dataset[split_name][random.randint(0, len(dataset[split_name]) - 1)]
print("\n随机样本:")
for k, v in sample.items():
if isinstance(v, str):
print(f"{k}: {v[:200]}...") # 避免太长
else:
print(f"{k}: {type(v)}")
# 3. 如果有图像字段,显示图像
if "image" in sample:
plt.imshow(sample["image"])
plt.axis("off")
plt.title("随机样本图像")
plt.show()
# 4. 如果有文本字段,打印文本
for text_key in ["caption", "sentence", "description"]:
if text_key in sample:
print(f"\n{text_key}: {sample[text_key]}")
# 5. 如果有标签字段,统计分布
for label_key in ["label", "category", "object", "relation"]:
if label_key in dataset[split_name].features:
values = [ex[label_key] for ex in dataset[split_name].select(range(1000))]
plt.figure(figsize=(8, 4))
sns.countplot(x=values, order=pd.Series(values).value_counts().index)
plt.title(f"{label_key} 分布 (前1000样本)")
plt.xticks(rotation=45)
plt.show()
# 1. 打印并解析 scene_graph 和 regions
# 随机抽样一个样本
sample = dataset["train"][random.randint(0, len(dataset["train"]) - 1)]
print("样本 ID:", sample["id"])
print("图像 ID:", sample["image_id"])
# 解析 scene_graph
if "scene_graph" in sample and sample["scene_graph"] is not None:
sg = json.loads(sample["scene_graph"])
print("\nScene Graph 对象:")
for obj in sg.get("objects", []):
print(" -", obj)
print("\nScene Graph 关系:")
for rel in sg.get("relations", []):
print(" -", rel)
# 打印 regions
if "regions" in sample and sample["regions"] is not None:
print("\nRegions 示例:")
for r in sample["regions"][:5]: # 只看前 5 个
print(" -", r)
# 解析 scene_graph
sg = json.loads(sample["scene_graph"])
objects = sg.get("objects", [])
relations = sg.get("relations", [])
# 构建图
G = nx.DiGraph()
for i, obj in enumerate(objects):
G.add_node(i, label=obj)
for rel in relations:
if isinstance(rel, list) and len(rel) == 3:
subj, obj, pred = rel
if subj < len(objects) and obj < len(objects):
G.add_edge(subj, obj, label=pred)
# 绘制图
plt.figure(figsize=(12, 8))
pos = nx.spring_layout(G, seed=42)
nx.draw(G, pos, with_labels=True,
labels={i: f"{i}:{objects[i][:20]}" for i in range(len(objects))},
node_color="lightblue", node_size=2000, font_size=8, arrows=True)
edge_labels = nx.get_edge_attributes(G, "label")
nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_size=7)
plt.title("Scene Graph 可视化 (带对象编号)")
plt.show()
import matplotlib.patches as patches
plt.figure(figsize=(8, 6))
ax = plt.gca()
for r in sample["regions"][:10]: # 只画前10个
x, y, w, h = r["bbox"]
rect = patches.Rectangle((x, y), w, h, linewidth=2,
edgecolor="red", facecolor="none")
ax.add_patch(rect)
ax.text(x, y, r["object"], fontsize=8, color="blue")
plt.title("Regions 边界框可视化 (无图像背景)")
plt.gca().invert_yaxis() # 坐标系和图像一致
plt.show()
代码运行后bash粗略输出一下数据集结构:
bash
Repo card metadata block was not found. Setting CardData to empty.
数据集结构:
Split: train, 样本数: 93305
随机样本:
id: 2395765...
image_id: 2395765.jpg...
metadata/height: <class 'NoneType'>
metadata/objects: <class 'NoneType'>
metadata/scene: <class 'NoneType'>
metadata/width: <class 'NoneType'>
regions: <class 'list'>
scene_graph: {"objects": ["boy in red jersey and helmet, swinging a bat, standing on the baseball field", "girl in white hat and pink pants, sitting and watching the game", "mask worn by the catcher.", "cooler in ...
样本 ID: 2403699
图像 ID: 2403699.jpg
Scene Graph 对象:
- irons and other items on the stove
- ceiling fan with lights
- kitchen floor tiles in peach, beige, and white
- water dispenser on the refrigerator
- wallpaper in blue, white, and green
- dishes on a rack
- wooden cupboard with brown and green cabinets
- vaulted kitchen ceiling
- white chair
- wall clock
- brown wooden table
- silver tea kettle on stove
- white knobs on kitchen cabinets
- silver metal refrigerator
- faucet in white sink
- refrigerator door handles
- wall-mounted bulletin board
- cabinet door
- kitchen with wooden cabinets and appliances
- white porcelain sink
- lights in ceiling fan
- burgundy, white, and red mat
- multicolored striped container
- wall with attached items
- old-fashioned stove
Scene Graph 关系:
- [0, 24, 'placed on']
- [0, 24, 'on']
- [0, 24, 'part of']
- [1, 7, 'attached to']
- [1, 18, 'above']
- [1, 20, 'has']
- [1, 7, 'on']
- [1, 18, 'illuminates']
- [2, 13, 'under']
- [3, 13, 'on']
- [4, 18, 'decorates']
- [5, 6, 'located above']
- [6, 18, 'used for storage in']
- [6, 17, 'has']
- [7, 1, 'supports']
- [7, 18, 'above']
- [8, 10, 'positioned near']
- [8, 18, 'in']
- [8, 10, 'around']
- [9, 18, 'indicating time for']
- [9, 23, 'on']
- [10, 18, 'in']
- [11, 0, 'next to']
- [11, 24, 'on top of']
- [11, 24, 'on']
- [12, 17, 'right of']
- [12, 6, 'on']
- [13, 2, 'above']
- [13, 3, 'has']
- [13, 15, 'has']
- [13, 4, 'in front of']
- [13, 18, 'used for storing food by']
- [13, 18, 'in']
- [14, 12, 'above']
- [15, 13, 'attached to']
- [16, 18, 'in front of']
- [17, 13, 'right of']
- [18, 1, 'below']
- [18, 7, 'below']
- [18, 13, 'contains']
- [18, 16, 'contains']
- [18, 19, 'has']
- [19, 6, 'on top of']
- [19, 12, 'next to']
- [19, 14, 'with']
- [20, 7, 'attached to']
- [20, 18, 'illuminates']
- [21, 2, 'on top of']
- [21, 2, 'on']
- [21, 19, 'under']
- [22, 18, 'used for storage in']
- [23, 9, 'supports']
- [23, 16, 'supports']
- [24, 11, 'next to']
- [24, 18, 'used for cooking by']
- [24, 22, 'left of']
- [24, 18, 'in']
Regions 示例 bash中也输出了(这里没展示完):
Regions 示例:
- {'area': 989, 'bbox': [412, 237, 50, 49], 'object': 'irons', 'segmentation': {'counts': 'T`g45[;7O11IOLfV2`0SiM7J2O1N1O1NM[EYOc:e0aEZO_:b0eE^O[:c0dE]O[:g0aEZO_:f0aEYO`:g0`EYOi:0ZE3LNV;1jDOV;1jDNW;2iDN`;O2OF4hDMj:l0I4M2N2O001M2O1O1N2OONeNlEZ1U:fNlEX1[:@7XObE7TT?', 'size': [375, 500]}}
- {'area': 3056, 'bbox': [130, 0, 185, 39], 'object': 'fan', 'segmentation': {'counts': 'Y`i17^;6K2M3L4[E]OQ:f0oE[On9g0QFZOn9g0SFXOm9i0VFTOi9k0XFUOh9l0WFTOi9l0a00hETOe9l0[FTOe9m0ZFSOf9l0ZFUOf9k0ZFVOe9j0[FVOe9j0[FVOe9j0\\FUOd9k0\\FUOd9l0[FTOe9k0\\FUOd9k0\\FUOd9k0\\FUOd9k0\\FUOe92mE`0>^Of90WF56JQ;0000000d[42ZdK0ENQE110a:1VEO2164]:M[EN204;\\:G^EO1O0`0`:B`EO0e0^:]ObENOf0_:\\ObEh0_:YO`Eg0a:XO_Eh0`:80001O00001O0000000000000000L4N20000O100001O3M1OVO`E=`:C`E=`:CaE<`:CaE;_:FbE9^:FdE8]:HdE7]:HcE7^:IbE6_:K`E4a:L^E4c:M\\E2e:N[E1g:NXE2i:OVE0k:1TENm:2SENm:2SEMn:3REMn:3REMo:2QEKOLP;:QEHS;8PEEP;<50001O01O1O010OO100010O000001O000000001O00000000010O000000010O0000010O000000010O000000001O000000001N1O102Lo^T2', 'size': [375, 500]}}
- {'area': 17013, 'bbox': [113, 253, 218, 121], 'object': 'tile in peach, beige', 'segmentation': {'counts': 'Sk^19];>C3M2N1O00000000000000000000000000000000000000000O1000000O1O1O1N2O1N2M3N2M3O1N2N2N2N2O1N2M3M3M3M3cNPNkHX2S7W10000O100000000000000000000000000000000000000000000000000O1001O000000001]LoG^3W8N1O00001O0000000000000000000000000000000000000000000000000000000000000000000000000000000000O10000000000O100HfLQH\\3o76000000000000O10000000000000000000000000000000000000_OUHVMk7i2XHUMh7j2]HRMc7m2bHnL`7P3d0O2N2N3M3M2N3M4L3M3M3oMfF`1]9]NeFb1j9oNQF5R:DcFK`91T1O2N1O1O2N1O2On`P2', 'size': [375, 500]}}
- {'area': 362, 'bbox': [158, 160, 11, 48], 'object': 'dispenser', 'segmentation': {'counts': 'RQj1?2H[:5fEd0U:`0O0000000=C1TOiE6X:DW^i3', 'size': [375, 500]}}
- {'area': 4028, 'bbox': [3, 1, 45, 216], 'object': 'wallpaper in blue, white', 'segmentation': {'counts': '=e1R:;F?@7J;D:G:E>C]N_HbM1b1T7?jHSNg0\\1U6`0^JDV5:RKHe49[KN^41bK9U4IhK=R4FkK;U4KdK5i3eM_K^2>MR4lMdK]2JGc4kMmKZ2ZOJj4lM`Lg1eN=l4kMmLZ1WNk0l4kMUMR1oMR1n4kMTMR1nMS1o4jMTMR1mMT1o4jMUMQ1lMT1Q5jMUMP1jMV1R5iM`Me0^Mb1R5iMfM`0WMf1U5iMjM;nLo1Y5eMVNO`L]2[5cMaNV2`1iMmNi1o5B`0_O`0@[gY5', 'size': [375, 500]}}
C:\envs\anaconda3\envs\cpu_torch_py310\lib\tkinter\__init__.py:839: UserWarning: Glyph 21487 (\N{CJK UNIFIED IDEOGRAPH-53EF}) missing from font(s) DejaVu Sans.
func(*args)
C:\envs\anaconda3\envs\cpu_torch_py310\lib\tkinter\__init__.py:839: UserWarning: Glyph 35270 (\N{CJK UNIFIED IDEOGRAPH-89C6}) missing from font(s) DejaVu Sans.
func(*args)
C:\envs\anaconda3\envs\cpu_torch_py310\lib\tkinter\__init__.py:839: UserWarning: Glyph 21270 (\N{CJK UNIFIED IDEOGRAPH-5316}) missing from font(s) DejaVu Sans.
func(*args)
C:\envs\anaconda3\envs\cpu_torch_py310\lib\tkinter\__init__.py:839: UserWarning: Glyph 24102 (\N{CJK UNIFIED IDEOGRAPH-5E26}) missing from font(s) DejaVu Sans.
func(*args)
C:\envs\anaconda3\envs\cpu_torch_py310\lib\tkinter\__init__.py:839: UserWarning: Glyph 23545 (\N{CJK UNIFIED IDEOGRAPH-5BF9}) missing from font(s) DejaVu Sans.
000000000000000000000O1001O000000001]LoG^3W8N1O00001O0000000000000000000000000000000000000000000000000000000000000000000000000000000000O10000000000O100HfLQH\\3o76000000000000O10000000000000000000000000000000000000_OUHVMk7i2XHUMh7j2]HRMc7m2bHnL`7P3d0O2N2N3M3M2N3M4L3M3M3oMfF`1]9]NeFb1j9oNQF5R:DcFK`91T1O2N1O1O2N1O2On`P2', 'size': [375, 500]}}
- {'area': 362, 'bbox': [158, 160, 11, 48], 'object': 'dispenser', 'segmentation': {'counts': 'RQj1?2H[:5fEd0U:`0O0000000=C1TOiE6X:DW^i3', 'size': [375, 500]}}
- {'area': 4028, 'bbox': [3, 1, 45, 216], 'object': 'wallpaper in blue, white', 'segmentation': {'counts': '=e1R:;F?@7J;D:G:E>C]N_HbM1b1T7?jHSNg0\\1U6`0^JDV5:RKHe49[KN^41bK9U4IhK=R4FkK;U4KdK5i3eM_K^2>MR4lMdK]2JGc4kMmKZ2ZOJj4lM`Lg1eN=l4kMmLZ1WNk0l4kMUMR1oMR1n4kMTMR1nMS1o4jMTMR1mMT1o4jMUMQ1lMT1Q5jMUMP1jMV1R5iM`Me0^Mb1R5iMfM`0WMf1U5iMjM;nLo1Y5eMVNO`L]2[5cMaNV2`1iMmNi1o5B`0_O`0@[gY5', 'size': [375, 500]}}
C:\envs\anaconda3\envs\cpu_torch_py310\lib\tkinter\__init__.py:839: UserWarning: Glyph 21487 (\N{CJK UNIFIED IDEOGRAPH-53EF}) missing from font(s) DejaVu Sans.
func(*args)
C:\envs\anaconda3\envs\cpu_torch_py310\lib\tkinter\__init__.py:839: UserWarning: Glyph 35270 (\N{CJK UNIFIED IDEOGRAPH-89C6}) missing from font(s) DejaVu Sans.
func(*args)
数据集预处理和使用。
结合作者的jamepark3922/svg介绍可知,怎样从[13, 18, 'used for storing food by'],怎样通过Object ID 对应regions,又怎样从regions对应到物体名字上。
我们现在写一个最终的数据预处理文件,把这个数据集结构化保存一下(保存成一个抽取100条数据的和抽取全部数据的)
python
import os
import json
from typing import Any, Dict, List, Optional, Tuple
from datasets import load_dataset
from pathlib import Path
from tqdm import tqdm
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from PIL import Image
# -------------------------------
# 配置区
# -------------------------------
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com' # 你已在国内使用镜像
DATASET_NAME = "jamepark3922/svg"
CACHE_DIR = "./Synthetic_Visual_Genome_datasets"
# 输出文件
OUTPUT_100_JSON = "svg_parsed_100.json" # 预览小样本
OUTPUT_ALL_NDJSON = "svg_parsed_all.ndjson" # 全量逐行
OUTPUT_ALL_SINGLE_JSON = "svg_parsed_all.json" # 全量单文件(这里没保存)便于一些工具直接加载整个数组结构
# 是否额外生成单个超大 JSON 列表文件(谨慎,内存占用大)
save_all_as_single_json = False
# 可视化图片根目录(将你的图片放到这里,以 image_id 命名)
# 例如 images_root / "2403699.jpg"
images_root = Path("./images") # 修改为你存放图像的目录
images_root.mkdir(exist_ok=True, parents=True) # 若不存在则创建(占位)
# -------------------------------
# 工具函数
# -------------------------------
def safe_json_loads(s: Any) -> Optional[Dict[str, Any]]:
"""安全解析字符串 JSON,失败返回 None。"""
if s is None:
return None
if isinstance(s, (dict, list)):
# 有时可能已经是字典或列表
return s if isinstance(s, dict) else {"_list": s}
try:
return json.loads(s)
except Exception:
return None
def extract_sample_record(sample: Dict[str, Any]) -> Dict[str, Any]:
"""从原始样本中提取统一结构的记录。"""
image_id = sample.get("image_id")
# 解析 scene_graph
sg = safe_json_loads(sample.get("scene_graph"))
objects: List[str] = sg.get("objects", []) if isinstance(sg, dict) else []
relations: List[List[Any]] = sg.get("relations", []) if isinstance(sg, dict) else []
regions: List[Dict[str, Any]] = sample.get("regions", []) or []
# metadata(有些字段可能在 sample 下的层级中,如 metadata/width 等)
# 你之前看到 metadata/width/height 为 None,这里尽可能兼容不同键
# 若数据集中有统一的 metadata 字典,可直接取 sample.get("metadata")
metadata: Dict[str, Any] = {}
# 常见元数据尝试抽取(如果可用)
for key in ["metadata", "metadata/height", "metadata/width", "metadata/scene", "metadata/objects", "metadata/relations"]:
if key in sample:
metadata[key.replace("metadata/", "")] = sample[key]
# 若没有高度/宽度,且 regions 的 segmentation 有 size 信息,可用其中一个 region 的 size 作为参考
if ("height" not in metadata or metadata.get("height") is None) or ("width" not in metadata or metadata.get("width") is None):
for r in regions:
seg = r.get("segmentation")
if isinstance(seg, dict) and "size" in seg and isinstance(seg["size"], list) and len(seg["size"]) == 2:
h, w = seg["size"]
metadata.setdefault("height", h)
metadata.setdefault("width", w)
break
# 统计信息
metadata.setdefault("objects_count", len(objects))
metadata.setdefault("relations_count", len(relations))
# 构建 triples(带 id 和名称)
triples: List[Dict[str, Any]] = []
for rel in relations:
# 关系形如 [subj_id, obj_id, predicate]
if isinstance(rel, list) and len(rel) == 3:
subj_id, obj_id, predicate = rel
# 边界检查
if isinstance(subj_id, int) and isinstance(obj_id, int) and subj_id < len(objects) and obj_id < len(objects):
triples.append({
"subject_id": subj_id,
"subject_name": objects[subj_id],
"predicate": predicate,
"object_id": obj_id,
"object_name": objects[obj_id]
})
else:
# 不合法的索引,保留原始信息以便后续定位问题
triples.append({
"subject_id": subj_id,
"subject_name": None if not (isinstance(subj_id, int) and subj_id < len(objects)) else objects[subj_id],
"predicate": predicate,
"object_id": obj_id,
"object_name": None if not (isinstance(obj_id, int) and obj_id < len(objects)) else objects[obj_id],
"_warning": "index_out_of_range_or_non_int"
})
# 保留 regions 的关键字段(bbox, area, segmentation, object)
clean_regions: List[Dict[str, Any]] = []
for r in regions:
clean_regions.append({
"object": r.get("object"),
"bbox": r.get("bbox"),
"area": r.get("area"),
"segmentation": r.get("segmentation")
})
# 统一输出结构
record = {
"image_id": image_id,
"metadata": metadata,
"objects": objects,
"triples": triples,
"regions": clean_regions
}
return record
def show_image_with_regions(image_id: str, images_root: Path, record: Dict[str, Any], max_regions: int = 20, show_triples: int = 10) -> None:
"""
可视化函数:
- 在本地 images_root / image_id 查找图像并显示
- 叠加 regions 的 bbox
- 可在图像标题区显示若干 triples 文本说明
"""
img_path = images_root / image_id
if not img_path.exists():
print(f"图片未找到:{img_path}. 请将图像文件放到 {images_root} 目录。")
return
# 打开图像
try:
img = Image.open(img_path).convert("RGB")
except Exception as e:
print(f"图片打开失败:{e}")
return
# 展示图像
plt.figure(figsize=(10, 8))
plt.imshow(img)
plt.axis("off")
ax = plt.gca()
# 叠加 bbox(若有)
regions = record.get("regions", [])
for i, r in enumerate(regions[:max_regions]):
bbox = r.get("bbox")
obj_name = r.get("object")
if bbox and isinstance(bbox, list) and len(bbox) == 4:
x, y, w, h = bbox
rect = patches.Rectangle((x, y), w, h, linewidth=2, edgecolor="red", facecolor="none")
ax.add_patch(rect)
if obj_name:
ax.text(x, y - 2, f"{obj_name}", fontsize=9, color="yellow", bbox=dict(facecolor="black", alpha=0.4, pad=1))
# 在标题区合成若干 triples 文本
triples = record.get("triples", [])
text_lines = []
for t in triples[:show_triples]:
s = t.get("subject_name") or str(t.get("subject_id"))
p = t.get("predicate")
o = t.get("object_name") or str(t.get("object_id"))
text_lines.append(f"{s} ---[{p}]→ {o}")
if text_lines:
plt.title(" | ".join(text_lines), fontsize=10)
plt.show()
# -------------------------------
# 主流程:加载数据集并生成预处理文件
# -------------------------------
def main():
print("加载数据集中...")
ds = load_dataset(DATASET_NAME, cache_dir=CACHE_DIR)
# 仅使用 train split(该数据集显示只有 train)
train = ds["train"]
total_len = len(train)
print(f"train 总样本数:{total_len}")
# 1) 生成 100 条样本预处理文件(JSON 列表)
preview_count = min(100, total_len)
preview_records: List[Dict[str, Any]] = []
print(f"生成小样本({preview_count} 条)...")
for sample in tqdm(train.select(range(preview_count))):
record = extract_sample_record(sample)
preview_records.append(record)
with open(OUTPUT_100_JSON, "w", encoding="utf-8") as f:
json.dump(preview_records, f, ensure_ascii=False, indent=2)
print(f"已保存预览文件:{OUTPUT_100_JSON}")
# 2) 生成全量预处理文件(NDJSON,每行一个 JSON)
print(f"生成全量 NDJSON(逐行写入,适配大数据)...")
with open(OUTPUT_ALL_NDJSON, "w", encoding="utf-8") as f:
for sample in tqdm(train, total=total_len):
record = extract_sample_record(sample)
f.write(json.dumps(record, ensure_ascii=False) + "\n")
print(f"已保存全量文件:{OUTPUT_ALL_NDJSON}")
# 3) (可选)生成全量单个 JSON 文件(列表),注意内存与体积
if save_all_as_single_json:
print("生成全量单个 JSON 列表文件(可能占用大量内存/磁盘)...")
all_records: List[Dict[str, Any]] = []
for sample in tqdm(train, total=total_len):
record = extract_sample_record(sample)
all_records.append(record)
with open(OUTPUT_ALL_SINGLE_JSON, "w", encoding="utf-8") as f:
json.dump(all_records, f, ensure_ascii=False, indent=2)
print(f"已保存全量 JSON 列表文件:{OUTPUT_ALL_SINGLE_JSON}")
print("全部处理完成。")
if __name__ == "__main__":
main()
# 使用示例(可视化):
# 1) 从预览文件读取第一条记录并显示(需本地图像存在)
try:
with open(OUTPUT_100_JSON, "r", encoding="utf-8") as f:
preview = json.load(f)
if preview:
first_record = preview[0]
print(f"示例展示 image_id: {first_record.get('image_id')}")
show_image_with_regions(first_record.get("image_id"), images_root, first_record, max_regions=15, show_triples=8)
except Exception as e:
print(f"示例展示失败:{e}")
注意,就算没下载图片数据集,也能运行这个代码,最多是使用图片加载时会失败:
其中生成的svg_parsed_100.json
中的第一条数据如下:
json
[
{
"image_id": "ADE_frame_00000001.jpg",
"metadata": {
"height": 335,
"width": 500,
"scene": [
"outdoor",
"urban",
"airfield"
],
"objects": [
{
"attributes": "",
"depth_ordering_rank": 1,
"id": 0,
"name": "sea water",
"occluded": false,
"parts": {
"has_parts": [],
"is_part_of": null,
"part_level": 0
}
},
{
"attributes": "",
"depth_ordering_rank": 2,
"id": 1,
"name": "track",
"occluded": false,
"parts": {
"has_parts": [],
"is_part_of": null,
"part_level": 0
}
},
{
"attributes": "",
"depth_ordering_rank": 3,
"id": 2,
"name": "sand beach",
"occluded": false,
"parts": {
"has_parts": [],
"is_part_of": null,
"part_level": 0
}
},
{
"attributes": "walking",
"depth_ordering_rank": 4,
"id": 3,
"name": "person",
"occluded": false,
"parts": {
"has_parts": [],
"is_part_of": null,
"part_level": 0
}
},
{
"attributes": "",
"depth_ordering_rank": 5,
"id": 4,
"name": "backpack",
"occluded": false,
"parts": {
"has_parts": [],
"is_part_of": null,
"part_level": 0
}
},
{
"attributes": "walking",
"depth_ordering_rank": 6,
"id": 5,
"name": "person",
"occluded": false,
"parts": {
"has_parts": [],
"is_part_of": null,
"part_level": 0
}
},
{
"attributes": "",
"depth_ordering_rank": 7,
"id": 6,
"name": "airplane",
"occluded": false,
"parts": {
"has_parts": [],
"is_part_of": null,
"part_level": 0
}
},
{
"attributes": "",
"depth_ordering_rank": 8,
"id": 7,
"name": "grass",
"occluded": false,
"parts": {
"has_parts": [],
"is_part_of": null,
"part_level": 0
}
}
],
"objects_count": 8,
"relations_count": 23
},
"objects": [
"ocean water in the background",
"runway with white markings",
"narrow sandy beach",
"person walking on the beach with a backpack",
"backpack carried by the person",
"person walking on the beach",
"small twin-engine airplane with extended landing gear",
"grass patch near the runway"
],
"triples": [
{
"subject_id": 0,
"subject_name": "ocean water in the background",
"predicate": "behind",
"object_id": 1,
"object_name": "runway with white markings"
},
{
"subject_id": 0,
"subject_name": "ocean water in the background",
"predicate": "behind",
"object_id": 2,
"object_name": "narrow sandy beach"
},
{
"subject_id": 1,
"subject_name": "runway with white markings",
"predicate": "in front of",
"object_id": 0,
"object_name": "ocean water in the background"
},
{
"subject_id": 1,
"subject_name": "runway with white markings",
"predicate": "in front of",
"object_id": 2,
"object_name": "narrow sandy beach"
},
{
"subject_id": 2,
"subject_name": "narrow sandy beach",
"predicate": "in front of",
"object_id": 0,
"object_name": "ocean water in the background"
},
{
"subject_id": 2,
"subject_name": "narrow sandy beach",
"predicate": "walked on by",
"object_id": 3,
"object_name": "person walking on the beach with a backpack"
},
{
"subject_id": 2,
"subject_name": "narrow sandy beach",
"predicate": "walked on by",
"object_id": 5,
"object_name": "person walking on the beach"
},
{
"subject_id": 3,
"subject_name": "person walking on the beach with a backpack",
"predicate": "on top of",
"object_id": 2,
"object_name": "narrow sandy beach"
},
{
"subject_id": 3,
"subject_name": "person walking on the beach with a backpack",
"predicate": "below",
"object_id": 6,
"object_name": "small twin-engine airplane with extended landing gear"
},
{
"subject_id": 3,
"subject_name": "person walking on the beach with a backpack",
"predicate": "wearing",
"object_id": 4,
"object_name": "backpack carried by the person"
},
{
"subject_id": 3,
"subject_name": "person walking on the beach with a backpack",
"predicate": "in front of",
"object_id": 0,
"object_name": "ocean water in the background"
},
{
"subject_id": 3,
"subject_name": "person walking on the beach with a backpack",
"predicate": "looking at",
"object_id": 6,
"object_name": "small twin-engine airplane with extended landing gear"
},
{
"subject_id": 4,
"subject_name": "backpack carried by the person",
"predicate": "carried by",
"object_id": 3,
"object_name": "person walking on the beach with a backpack"
},
{
"subject_id": 5,
"subject_name": "person walking on the beach",
"predicate": "in front of",
"object_id": 0,
"object_name": "ocean water in the background"
},
{
"subject_id": 5,
"subject_name": "person walking on the beach",
"predicate": "on",
"object_id": 2,
"object_name": "narrow sandy beach"
},
{
"subject_id": 5,
"subject_name": "person walking on the beach",
"predicate": "below",
"object_id": 6,
"object_name": "small twin-engine airplane with extended landing gear"
},
{
"subject_id": 5,
"subject_name": "person walking on the beach",
"predicate": "looking at",
"object_id": 6,
"object_name": "small twin-engine airplane with extended landing gear"
},
{
"subject_id": 6,
"subject_name": "small twin-engine airplane with extended landing gear",
"predicate": "above",
"object_id": 1,
"object_name": "runway with white markings"
},
{
"subject_id": 6,
"subject_name": "small twin-engine airplane with extended landing gear",
"predicate": "above",
"object_id": 2,
"object_name": "narrow sandy beach"
},
{
"subject_id": 6,
"subject_name": "small twin-engine airplane with extended landing gear",
"predicate": "in front of",
"object_id": 0,
"object_name": "ocean water in the background"
},
{
"subject_id": 6,
"subject_name": "small twin-engine airplane with extended landing gear",
"predicate": "approaching",
"object_id": 1,
"object_name": "runway with white markings"
},
{
"subject_id": 6,
"subject_name": "small twin-engine airplane with extended landing gear",
"predicate": "flying over",
"object_id": 2,
"object_name": "narrow sandy beach"
},
{
"subject_id": 7,
"subject_name": "grass patch near the runway",
"predicate": "located near",
"object_id": 1,
"object_name": "runway with white markings"
}
],
"regions": [
{
"object": "sea water",
"bbox": [
0,
0,
499,
222
],
"area": 104845,
"segmentation": {
"counts": "1n6a3000000O100000000000000000000000000000000000000000000000000000000000000000000000O10000000000000000000000000O1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000O1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000O10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000O100000000O1000000O100000000O1000000O100000000O100000000O1000000O1000000000000O100000000000000000000000000000000000000O100000000000000000000000000000000000000O100000000000000000000000000000000O100000000000000000000000000000000O1000000000000000000000000000000O100000000000000000000000000000000O100000000000000000000000000000000000000O10000000000000000000000000000000000O10000000000000000O10000000000000000O10000000000000000O10000000000000000O10000000000000000O1000000O100000000O1000000O1000000O100000000O1000000O10000000000000000O100000000000000O10000000000000000O100000000000000000000O100000000000000000000",
"size": [
335,
500
]
}
},
{
"object": "track",
"bbox": [
1,
221,
498,
112
],
"area": 51058,
"segmentation": {
"counts": "]a0k1d8000000000000000000000000001O00000O1000000000000000000000001O000000000000000000000000000000001O0000000000000000000000000000001O0000000000000000000000000000001O001O1O2N1O2N1O2N1O1O2N1O2N1O2N1O2N1O2N1O1O2N1O2N1O2N1O2N1O2N1O1O2N1O1O0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001O0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000T3lLnE",
"size": [
335,
500
]
}
},
{
"object": "sand beach",
"bbox": [
188,
186,
311,
36
],
"area": 6002,
"segmentation": {
"counts": "ofm11^:00000O100000000O1000000O100000000O100000000O100000000O1000000O1000001O0O100000000O1000000000000O1000000000000000000000000000000000000000000O100000001O000000000000000000000O1000000000000000000000000000000O1000000000000000000000000000000O10000000000000000000000000000000000O10000000000000000000000000000000000O100000000000000000000000000000000O100000000000000000000000000000000000000O1000000000000000000000000000000000000O100000000000000000000O100000000000000000000O1000000O10000O1000000O1000000O10000O1000000O1000000O100000000O10000000000O10000000000O10000000000000000000000O10000000000000000000000O1000000000000000001O0R1nNTI",
"size": [
335,
500
]
}
},
{
"object": "person is walking",
"bbox": [
341,
175,
12,
38
],
"area": 258,
"segmentation": {
"counts": "Uh_33Z:4L3[OJmF:n8IQG9n8<jFZOT9c0QGZOo8n0kFPOV9R17K2Ne0[O6Jka_1",
"size": [
335,
500
]
}
},
{
"object": "backpack",
"bbox": [
349,
182,
7,
13
],
"area": 63,
"segmentation": {
"counts": "Y[b31^:1O<D01O1M3N7ISc^1",
"size": [
335,
500
]
}
},
{
"object": "person is walking",
"bbox": [
236,
188,
25,
29
],
"area": 354,
"segmentation": {
"counts": "`\\]25Z:1O1]FJk86TGKl85oF0Q92iF0Y92aF1`9>1N2O2N2O1O101N100O1O3N3L3N1O001O0O10kd]2",
"size": [
335,
500
]
}
},
{
"object": "airplane",
"bbox": [
68,
57,
356,
111
],
"area": 6212,
"segmentation": {
"counts": "T\\f01^:1O1O1O001N10001O00001O0O101OO100000O10000000O10O1000000000O10O10000000000O01000000000O10O10000000O1000O10000000O100000O1000O10000000O10O100000000000O010000000000O5L000000000O100000O1000O10000000O1000O1000000000O010000000000O0100000000000O10O10000000O1000001O1O0O2O1O001O001lF[O[8n0\\GXO_8Q1XGPOg8\\1000QO_G0a8OfGH]86kGCU8<nGCR8<oGEP89RHHm76UHKj74YHJg76[HCi7>T1000000000O10O1000000000O1000005K7TNZOaIh0]6@ZIc0h6@QIb0R7@gHb0\\7A\\H?i7CPH=U8FcG:c8f05K5K5K5K0001O:F2N1N01A\\GSOd8m0`GnNa8R1bGkN^8U1>O010000000ZOkF0V91mFJU96SG@o8`0`0O10O1000O1000O1000O100000O100BCjF>U9FgF:Y9KbF6]9>01O2N28HO8gFoNb8R1]GPOa8Q1^GPOa8V1YGkNf8_100OM^G\\Na8e14POnGCR8;PHFo79RHGn77THJj76YHGh78X100000000O01000000000O010000000O10O100000O1000O1000O100000O1000O100000O10O1000000KTFGk9:50000O1000O1000O1000000O010000000O10O1000O100000O10O1000000O0100000O1000O1000O100000O0100000000O0100000O100000O10O1000000O01000O10000O01000O100000O010000N2Nj^h0",
"size": [
335,
500
]
}
},
{
"object": "grass",
"bbox": [
2,
281,
113,
52
],
"area": 4853,
"segmentation": {
"counts": "gm0e1j80000000000000000000001O0000000000000000000000001O000000000000000000000000001O0000000000000000000000001O000000000000000000000000001O0000000000000000000000001O001O1O2N1O1O2N1O2N1O2N1O1O2N1O2N1O2N1O1O2N1O2N1O2N1O1O2N1O2N1O2N1O1OcYm3",
"size": [
335,
500
]
}
}
]
},
数据集解析
Synthetic Visual Genome(SVG)以"语义关系 + 空间区域"的组合呈现图像信息。每条样本的关键字段如下:
- image_id :图像标识(通常对应文件名,如
2403699.jpg
)。 - scene_graph :整张图的场景图,包含:
- objects:对象列表(文本化对象描述,粒度和自然语言属性都更丰富)。
- relations :关系三元组列表,形如
[subject_id, object_id, predicate]
。
- regions :局部空间标注列表,每个 region 包含:
- object:局部实体名称(可能与 objects 的文本不完全相同,通常更短或更局部)。
- bbox :边界框
[x, y, w, h]
。 - area:分割区域面积。
- segmentation :COCO RLE 掩码(
counts
+size=[height, width]
)。
- metadata :图像级元信息(高度、宽度、场景描述等;若缺省可由 regions 的
segmentation.size
推断)。
注意:SVG是"整图级别的 scene graph + 多个 regions 的空间标注",并非"每个 region 一个独立 scene graph 再合并"。
我们生成的预处理文件结构设计:
我们运行上述代码后,保存的文件的结构涉及如下:
-
统一结构:
- image_id
- metadata(height、width、scene、objects_count、relations_count 等)
- objects(scene_graph.objects 原始列表)
- triples(解析自 scene_graph.relations,包含 id 与对应文本名称)
- regions(保留 bbox、area、segmentation、object)
-
输出形式:
- 抽取样本 100 条:
svg_parsed_100.json
(标准 JSON 列表) - 全量样本约 93k 条:
svg_parsed_all.ndjson
(每行一个 JSON,更适合大规模数据)
- 抽取样本 100 条:
#一些问题解析:
为什么SVG的对象描述为何更像句子
与传统 VG/VG150 的短标签(如 "person""car""on")不同,SVG的数据对象更接近自然语言短语或句子片段(如"person walking on the beach with a backpack")。这是设计选择:
- 增强语义表达:对象不仅是类别,还包含属性、状态、位置、用途等,适配生成式任务与多模态理解。
- 语言空间对齐:更自然的描述更易与语言模型空间对齐,提升跨模态任务表现。
- 关系更丰富:谓词覆盖功能性和事件性关系(如"used for storing food by""walked on by"),对象文本需携带更充分上下文来表达语义。
因此,三元组的 subject/object 名称更长、更具语义信息,这是数据集的特性。
为何每张图只有一个 metadata?
- metadata 是图像级信息,描述整张图。细分的小区域以
regions
列出,每个 region 有自己的空间标注;scene graph 也是整图级的语义关系。因此"每图一个 metadata + 多个 regions"是正常设计。
segmentation.counts 是什么?
- COCO RLE(Run-Length Encoding)压缩掩码。
counts
是压缩后的掩码字符串,size=[height, width]
标记分辨率。解码后得到二值 mask,可以用来更精细地展示区域边界。
当前预处理是否充分利用数据?
- 是的:image_id、metadata、objects、triples、regions 都已纳入,适合关系建模、检索与可视化。可选增强包括保留原始 scene_graph JSON,或在预处理阶段解码并存储 mask(但体积会增大)。
之后如何调用图片查看与可视化
- 将图片文件放到
images_root
目录,文件名与image_id
对应。 - 使用预处理输出中的某条记录调用:
show_image_with_regions(image_id, images_root, record)
。
- 扩展建议:
- 批量生成可视化 PNG:遍历 JSON/NDJSON,按条件筛选并渲染。
- 在图像上叠加少量三元组文本,便于快速理解核心关系。
- 若需要 mask 可视化,使用
pycocotools.mask.decode
将segmentation.counts
解码成二维数组,并以半透明方式覆盖。
结语
Synthetic Visual Genome 的设计强调"自然语言增强的对象描述 + 丰富的语义关系 + 细粒度空间标注"。本文的预处理与可视化方案,将这些信息统一组织,既适合快速理解数据,也便于后续在生成式与多模态任务中的应用。如果需要进一步的功能(如批量导出可视化图、关系筛选与统计、mask 叠加渲染),可以在此基础上继续扩展。