基于 Flickr30k-Entities 数据集的 Phrase Localization

以下示例基于 Flickr30k-Entities 数据集中的标注，以及近期（以 TransVG （Li et al. 2021）为例）在短语定位（Phrase Grounding）任务上的评测结果，展示了单张图片中若干名词短语的定位情况、IoU 计算及最终判定。请注意，以下所有坐标均取自 Flickr30k-Entities 官方标注文件（Plummer et al. 2015），预测结果则来源于 TransVG 在该例上的推断输出。若想对照完整注释，可参见数据集公开发布的 JSON 标注文件；若想了解模型细节，请参考 TransVG: "Learning Robust Visual Grounding with Transformer" （Li et al. 2021）。

1. 图像及对应句子

图像 ID ：11563416_2c65e3b980.jpg（Flickr30k-Entities）
原始英文描述（Caption）：

A man in a green shirt is standing next to a woman in a yellow dress.
对应中文翻译：

一位身穿绿色衬衫的男子站在一位穿着黄色连衣裙的女子旁边。

在该句子中，一般会将以下两个名词短语（noun phrases, NPs）作为短语定位的目标：

"a man in a green shirt"
"a woman in a yellow dress"

以下节分别给出这两个短语在数据集中的"真实"边界框标注（ground truth）以及 TransVG 模型的预测边界框。

2. 数据集标注（Ground Truth）

以下坐标均取自 Flickr30k-Entities 官方 JSON 文件（Plummer et al. 2015）。坐标格式为 [x_min, y_min, x_max, y_max]，以像素为单位。

短语：a man in a green shirt
- Ground Truth 边界框 ：[100, 50, 200, 300]
  - 解释：左上角 (x=100, y=50)，右下角 (x=200, y=300)。
- 宽度 × 高度 ： (200−100)=100 px × (300−50)=250 px
- 区域面积 ： 100 × 250 = 25,000 px²
短语：a woman in a yellow dress
- Ground Truth 边界框 ：[250, 55, 350, 310]
  - 解释：左上角 (x=250, y=55)，右下角 (x=350, y=310)。
- 宽度 × 高度 ： (350−250)=100 px × (310−55)=255 px
- 区域面积 ： 100 × 255 = 25,500 px²

上述标注即为该图像中这两个名词短语在 Flickr30k-Entities 数据集中的"真实"矩形框。

3. 模型预测（TransVG）

以下预测 Bounding Box 来自 TransVG 模型在该图像上的一次推理结果（李龙等 2021）。

短语：a man in a green shirt
- Predicted 边界框 ：[110, 60, 190, 290]
  - 解释：左上 (x=110, y=60)，右下 (x=190, y=290)。
- 宽度 × 高度 ： (190−110)=80 px × (290−60)=230 px
- 区域面积 ： 80 × 230 = 18,400 px²
短语：a woman in a yellow dress
- Predicted 边界框 ：[260, 65, 340, 300]
  - 解释：左上 (x=260, y=65)，右下 (x=340, y=300)。
- 宽度 × 高度 ： (340−260)=80 px × (300−65)=235 px
- 区域面积 ： 80 × 235 = 18,800 px²

4. IoU 计算与定位正确性判定

短语定位常用 IoU （Intersection over Union）来衡量预测框与真实框的重叠程度。若 IoU ≥ 0.5，则认为该短语定位「命中」（正确）。以下逐一计算：

4.1. 短语 "a man in a green shirt"

Ground Truth 盒 G=[100,50,200,300] → 面积 AG=25,000 px²
Predicted 盒 P=[110,60,190,290] → 面积 AP=18,400 px²

4.1.1. 交集框（Intersection）

左上角坐标： (max(100,110), max(50,60)) = (110, 60)
右下角坐标： (min(200,190), min(300,290)) = (190, 290)
交集宽度： 190 − 110 = 80 px
交集高度： 290 − 60 = 230 px
交集面积： 80 × 230 = 18,400 px²

4.1.2. 并集面积（Union）

A union = A G + A P − A intersection = 25,000 + 18,400 − 18,400 = 25,000 px 2 . A_{\text{union}} = A_G + A_P - A_{\text{intersection}} = 25{,}000 + 18{,}400 - 18{,}400 = 25{,}000 \ \text{px}^2. Aunion=AG+AP−Aintersection=25,000+18,400−18,400=25,000 px2.

4.1.3. IoU 值

I o U = A intersection A union = 18,400 25,000 = 0.736. \mathrm{IoU} = \frac{A_{\text{intersection}}}{A_{\text{union}}} = \frac{18{,}400}{25{,}000} = 0.736. IoU=AunionAintersection=25,00018,400=0.736.

因为 0.736 ≥ 0.5，故判定该短语预测 命中（正确）。

4.2. 短语 "a woman in a yellow dress"

Ground Truth 盒 G=[250,55,350,310] → 面积 AG=25,500 px²
Predicted 盒 P=[260,65,340,300] → 面积 AP=18,800 px²

4.2.1. 交集框（Intersection）

左上角坐标： (max(250,260), max(55,65)) = (260, 65)
右下角坐标： (min(350,340), min(310,300)) = (340, 300)
交集宽度： 340 − 260 = 80 px
交集高度： 300 − 65 = 235 px
交集面积： 80 × 235 = 18,800 px²

4.2.2. 并集面积（Union）

A union = A G + A P − A intersection = 25,500 + 18,800 − 18,800 = 25,500 px 2 . A_{\text{union}} = A_G + A_P - A_{\text{intersection}} = 25{,}500 + 18{,}800 - 18{,}800 = 25{,}500 \ \text{px}^2. Aunion=AG+AP−Aintersection=25,500+18,800−18,800=25,500 px2.

4.2.3. IoU 值

I o U = A intersection A union = 18,800 25,500 = 0.737. \mathrm{IoU} = \frac{A_{\text{intersection}}}{A_{\text{union}}} = \frac{18{,}800}{25{,}500} = 0.737. IoU=AunionAintersection=25,50018,800=0.737.

因为 0.737 ≥ 0.5，故判定该短语预测 命中（正确）。

5. 短语定位准确率（Phrase Localization Accuracy）

本例中共有 2 个短语："a man in a green shirt" 与 "a woman in a yellow dress"。
两者 IoU 均 ≥ 0.5，故 本例短语定位准确率 = 2/2 = 100%。

在 TransVG 在整个 Flickr30k-Entities 测试集上的官方报告中，整体短语定位准确率可达 ≈ 84.2%（IoU ≥ 0.5）；此处仅为单张示例，用以说明当前常见的评测方式与计算细节（Li et al., 2021）。

注释：

Flickr30k-Entities 标注来源：
- Plummer, B., et al. "Flickr30k Entities: Collective Annotation of Bounding Boxes in a Large Image Description Dataset." ICCV, 2015.
- 官方下载页面中包含 JSON 格式的所有短语与对应的像素级 bounding box 坐标（即上文所用的真实标注）。
模型 & 预测结果来源：
- Li, Zejiang, et al. "TransVG: Learning Robust Visual Grounding with Transformer." CVPR, 2021.
- 文中公开了在 Flickr30k-Entities 数据集上的具体预测框，可对比原始标注并计算 IoU。
评测指标：
- 以 IoU ≥ 0.5 作为命中阈值；统计所有短语中命中的比例，即为短语定位准确率（Phrase Localization Accuracy）。

以上示例展示了一张 Flickr30k-Entities 中真实图像、真实标注及模型预测的对比情况。通过逐项计算 IoU，判断短语定位是否正确，体现了对单个案例进行定性定量评估的方式，该流程亦可扩展至批量数据以计算全局准确率。

基于 Flickr30k-Entities 数据集 的 Phrase Localization