1. 文献标题以及地址

文献标题: Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs
文献地址: this.

2. 论文背景

2.1 研究课题的原因

We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images, by using only image text pairs without dense annotations. <math xmlns="http://www.w3.org/1998/Math/MathML"> \quad </math>我们解决开放世界语义分割问题,其目的是通过仅使用没有密集注释的图像文本对来学习分割图像中的任意视觉概念.
However, these CL-based methods suffer from a train-test discrepancy, since it only considers image-text alignment during training, whereas segmentation requires region-text alignment during testing. <math xmlns="http://www.w3.org/1998/Math/MathML"> \quad </math>然而,这些基于 constrastive learning 的方法存在训练测试差异,因为它仅在训练期间考虑图像文本对齐,而分割在测试期间需要区域文本对齐.
However, there still remains another challenge in how to achieve precise localization of arbitrary concepts without dense annotations. 然而,如何在没有密集注释的情况下实现任意概念的精确定位仍然是一个挑战
There are several approaches that simply address this issue using dense annotation (segmentation masks) in addition to image-text pairs 除了图像文本对之外,还有几种方法可以使用密集注释(分割掩码)来简单地解决这个问题.
The dense annotation helps to improve segmentation performance in a fixed benchmark dataset, but the requirements of expensive dense annotation still limit the applicable domains and scalability of the method. 密集注释有助于提高固定基准数据集中的分割性能,但昂贵的密集注释的要求仍然限制了该方法的适用领域和可扩展性.
While the existing methods have shown impressive results even through the training with image-text alignment, they still suffer from the alignment-level discrepancy between training and testing phases as depicted in Fig. 2. <math xmlns="http://www.w3.org/1998/Math/MathML"> \quad </math>虽然现有方法即使通过图像文本对齐训练也显示出令人印象深刻的结果，但它们仍然受到训练和测试阶段之间对齐水平差异的影响.
To address this train-test discrepancy, we propose the Text-grounded Contrastive Learning (TCL) framework, which allows a model to learn region-text alignment directly from the image-text pairs without any dense annotations. <math xmlns="http://www.w3.org/1998/Math/MathML"> \quad </math>为了解决这种训练与测试的差异，我们提出了基于文本的对比学习（TCL）框架，该框架允许模型直接从图像-文本对学习区域-文本对齐，而无需任何密集注释

简单来说,先将视野放在了开放世界语义分割问题上.作者提出了当前open-world分割遇到的挑战.第一个是如何学习超出预定义类别的任意概念.这个挑战已经由前人提出的方法得到了适当的解决(通过利用大量网络爬行的图像文本配对数据来应对这一挑战).作者在本文提出的方法是顺应这个方向研究.

第二个挑战是如何在没有密集注释的情况下实现任意概念的精确定位.学习了新类,还是需要精确定位新物体的轮廓,这个精准定位物体轮廓的现有解决方法是密集注释和文本图像对.但是密集注释很贵就限制了这个方法的发展.因此作者仅从文本-图像对齐的角度出发,试图解决这两个挑战.

论文提到很多模型在训练时直接拿图片和文字比对,但在测试时,又要求模型拿图片区域和文本比对.这个会导致训练与测试差异较大,会导致模型在测试时获得次优解.作者研究是否存在一种方法可以解决这种训练差异大的问题.

然后,我想解释一下遇到的生词.

首先是语义分割,也就是将图片中任意一个像素进行分类,它与实体分割最大的不同就是没有区分相同类别的不同个体.但是open-world 语义分割与普通的语义分割最大的区别是open-world 语义分割可以对非目标类别(也就是新类)也进行分类.任务主要如下:

open-world语义分割将未知标签分配给新类，并将正确标签分配给旧类;
增量学习.在标签商提供新类的标签后，将新类逐渐合并到知识库中;
by 激光雷达点云的开放世界语义分割

其次是密集注释.这个网址写得比较通俗易懂.

2.2 解决问题的方法

In this paper, we proposed a novel Text-grounded Contrastive Learning (TCL) framework that enables a model to directly learn region-text alignment <math xmlns="http://www.w3.org/1998/Math/MathML"> \quad </math>.在本文中,我们提出了一种新颖的基于文本的对比学习(TCL)框架,该框架使模型能够直接学习区域文本对齐.
Our method generates a segmentation mask for a given text, extracts text-grounded image embedding from the masked region, and aligns it with text embedding via TCL. <math xmlns="http://www.w3.org/1998/Math/MathML"> \quad </math> 我们的方法为给定文本生成分割掩码,从掩码区域提取基于文本的图像嵌入,并通过 TCL 将图像嵌入与文本嵌入对齐.
By learning region-text alignment directly, our framework encourages a model to directly improve the quality of generated segmentation masks. <math xmlns="http://www.w3.org/1998/Math/MathML"> \quad </math> 通过直接学习区域文本对齐,我们的框架鼓励模型直接提高生成的分割掩码的质量.
Our key idea is to incorporate a text grounding procedure within contrastive learning as illustrated in Fig. 2, where TCL generates a segmentation mask indicating textgrounded regions, computes grounded region embeddings using the mask, and applies contrastive learning between text and grounded region <math xmlns="http://www.w3.org/1998/Math/MathML"> \quad </math> 我们的关键思想是将文本基础过程纳入对比学习中，如图 2 所示，其中 TCL 生成指示文本基础区域的分割掩模，使用掩模计算基础区域嵌入，并在文本和基础区域之间应用对比学习.

总结一下,作者是在有文本图像对Open-world semantic segmentation的前提下，提出前人实现Open-world semantic segmentation的方法(两种方法,前人和作者)是使用半监督学习.除了文本图像对,也使用密集注释.半监督方法使用密集注释学习分割能力,并使用图像文本监督扩展目标词汇.密集注释的使用使模型学习区域级对齐而不是图像级对齐，从而产生高质量的分割掩码.但是这种方法依赖昂贵的密集注释,所以作者提出了改进.

我理解的作者的方法是,将 <math xmlns="http://www.w3.org/1998/Math/MathML"> I m a g e Image </math>Image和 <math xmlns="http://www.w3.org/1998/Math/MathML"> T e x t Text </math>Text作为 <math xmlns="http://www.w3.org/1998/Math/MathML"> G r o u n d e r Grounder </math>Grounder的输入,输出学习了图片区域特征的掩码(也就是在训练时就进行了区域比对),该掩码区域要和文本描述区域一致(对两者进行对比学习).我们训练完后,直接拿新的文本图像对作为模型的输入,也必须和文本描述区域一致.

2.3 论文贡献

We introduce a novel framework for open-world segmentation, named Text-grounded Contrastive Learning (TCL), which enables learning region-text alignment directly without train-test discrepancy, thus learning to generate more precise segmentation masks through only image-text pairs. <math xmlns="http://www.w3.org/1998/Math/MathML"> \quad </math>我们引入了一种新颖的开放世界分割框架,名为基于文本的对比学习(TCL),它可以直接学习区域文本对齐,而无需训练测试差异,从而学习仅通过图像文本对生成更精确的分割掩模.
We present a unified evaluation protocol and reevaluate recent open world segmentation models for a fair and direct comparison. <math xmlns="http://www.w3.org/1998/Math/MathML"> \quad </math>我们提出了统一的评估协议,并重新评估最近的开放世界分割模型,以进行公平和直接的比较.
We achieve the new state-of-the-art zero-shot segmentation performance on 8 segmentation datasets with large margins compared to existing methods. <math xmlns="http://www.w3.org/1998/Math/MathML"> \quad </math> 我们在 <math xmlns="http://www.w3.org/1998/Math/MathML"> 8 8 </math>8个分割数据集上实现了最先进的零样本分割性能,与现有方法相比,具有较大的裕度.

2.4 同阶段研究对比

The target of this paper is an unsupervised setting, which aims to learn segmentation from only image-text pairs without any dense annotation <math xmlns="http://www.w3.org/1998/Math/MathML"> \quad </math>本文的目标是无监督设置，旨在仅从图像文本对中学习分割,而无需任何密集注释.
Since the massive image-text pairs are easily obtained by web crawling without human annotators, applicable domains of unsupervised methods become almost unlimited. <math xmlns="http://www.w3.org/1998/Math/MathML"> \quad </math> 由于无需人工注释者即可通过网络爬行轻松获得大量图像文本对,因此无监督方法的应用领域几乎是无限的.
Existing open-world semantic segmentation studies have taken a strategy to bypass this issue. <math xmlns="http://www.w3.org/1998/Math/MathML"> \quad </math>现有的开放世界语义分割研究已经采取了绕过这个问题的策略.
Instead of learning region-level alignment directly, they transfer image level alignment to region-level by heuristic modification or clustering. 他们不是直接学习区域级对齐，而是通过启发式修改或聚类将图像级对齐转移到区域级
- the learning objective is still image-level alignment due to lack of the region annotation <math xmlns="http://www.w3.org/1998/Math/MathML"> \quad </math> 于缺乏区域注释,学习目标仍然是图像级对齐.
- the number of clusters is pre-defined independent of the given image 簇的数量是独立于给定图像预先定义的
- clustering sub-region image embeddings is independent of the query text. 聚类子区域图像嵌入独立于查询文本
In summary, existing methods indirectly address region-level alignment problems by learning image-level alignment. To tackle this problem, we propose a novel region-level alignment objective, named Text-grounded Contrastive Learning (TCL). <math xmlns="http://www.w3.org/1998/Math/MathML"> \quad </math>总之，现有方法通过学习图像级对齐来间接解决区域级对齐问题。为了解决这个问题，我们提出了一种新颖的区域级对齐目标，称为基于文本的对比学习

这里需要阐述一下zero-shot、open-world的区别,详情参考这篇博客.开放世界识别关注的是处理未知类别的能力,与Zero-shot Learning类似,但更强调应对未知类别的不确定性.

基于文本图像对的开放世界语义分割实现方式是半监督学习和无监督学习.

OpenSeg and OVSeg first train a mask generator using dense annotation and expand target vocabulary using image-text datasets. 这两个模型使用半监督学习,在密集注释的帮助下构建掩码生成器,然后通过图片文本数据集扩增目标词汇.
MaskCLIP proposes to obtain a dense image embedding from CLIP image encoder through heuristic modification of the last attention layer. <math xmlns="http://www.w3.org/1998/Math/MathML"> \quad </math> 另一种方法就是作者使用的无监督学习.但是不使用密集注释很难构建掩码生成器,从而学习区域与文本比对.现存的方法使用其他方法绕过了这个问题.比如MaskCLIP通过启发式修改最后一层的自注意力层获得细化的图片分割图.

但正如上文所说,由于缺乏区域级注释,学习目标仍然是图像级,也就是通过学习图像级分割来实现区域级分割.

为了解决这个问题,作者提出一个新的区域级对比目标TCL来学习区域对比.

3. 模型方法

3.1 总览

现有的文本图像对模型通过最大化图片和文本之间的互信息.在训练时通过反向传播更新模型参数,但是在推理阶段就使用这些参数计算区域与文本比对生成图片的分割掩码,这会造成训练测试之间的差异.

作者提出一种TCL模型.作者构建了一个模型grounder,输入文本和图像生成基于文本的mask,然后将mask与图像点积,获得区域图像,然后最大化区域图像与文本之间的互信息.

m是随机变量的基于文本的掩码,指示文本描述的区域.

3.2 模型结构

3.2.1 Grounder

首先先看Grounder的组成.

image encoder <math xmlns="http://www.w3.org/1998/Math/MathML"> E v E_v </math>Ev is in charge of providing a single (L2-normalized) global feature as well as dense patch-level features. 图片编码器,输出全局向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> V g V^g </math>Vg和 <math xmlns="http://www.w3.org/1998/Math/MathML"> p a t c h patch </math>patch级向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> V d V^d </math>Vd. 作者在后文有写到参考了MaskCLIP,作者修改了最后一层的自注意力层.
text encoder <math xmlns="http://www.w3.org/1998/Math/MathML"> E t E_t </math>Et provides a (L2-normalized) text embedding feature. 文本编码器提供文本Embedding.
grounding decoder <math xmlns="http://www.w3.org/1998/Math/MathML"> D g D_g </math>Dg converts dense features from image encoder into finer pixel-level embeddings for alignment with text. grounding decoder将 <math xmlns="http://www.w3.org/1998/Math/MathML"> V d V^d </math>Vd转化为像素级别的embedding,为了与文本比对.

有一点需要注意,这个 <math xmlns="http://www.w3.org/1998/Math/MathML"> C L I P E n c o d e r CLIP\ Encoder </math>CLIP Encoder都是预训练好的,并且在训练阶段冻结(但其实30000循环后仅解除来最后一块image encoder,为了有更好的模型能力).Grounder整个过程流程如下:

作者在附录B里,详细解释了decoder的结构.decoder的核心设计原则是保留CLIP的预训练知识,它由 <math xmlns="http://www.w3.org/1998/Math/MathML"> 4 4 </math>4个门控卷积.门控卷积的过程如下:

<math xmlns="http://www.w3.org/1998/Math/MathML"> g g </math>g是需要学习的门控参数,最后是接了个残差连接.灰色部分是上采样,提高图像的分辨率,这样也可以提高分割能力(前两个是最近邻插值,后一个是双线性插值).我们观察图可以发现存在两个分支,一个是KP branch,这个分支主要是保留CLIP预训练知识, <math xmlns="http://www.w3.org/1998/Math/MathML"> V d V^d </math>Vd会被空间重塑,这个部分没有可学习参数,只有在测试阶段使用.然后和 <math xmlns="http://www.w3.org/1998/Math/MathML"> M M </math>M混在一起,输出最终掩码.

Extension.门控卷积是部分卷积的可学习版本,它的计算公式如下，其中输出是两个标准卷积层的输出的逐个元素相乘，一层后跟任何激活函数，另一层后跟一个S型激活函数.我能接S型函数是为了控制图像哪部分的信息可以传入下一层.

3.2.3 Text-grounded Contrastive Learning

Recall that the main idea of TCL is to use text-grounded images instead of whole images, unlike conventional CL. For this purpose, we define TCL losses in three different levels---image-level, feature-level, and area-level---using the generated masks M for all pairs of images and texts in a batch. 回想一下，与传统 CL 不同，TCL 的主要思想是使用基于文本的图像而不是整个图像。为此，我们使用为批次中的所有图像和文本对生成的掩模 M 在三个不同级别(图像级别、特征级别和区域级别)定义 TCL 损失.

如前文所说,这部分的主要思想就是将上个模块输出的掩码与图片点积,然后计算文本向量之间的互信息,使互信息最大化.

Image-level TCL loss.计算基于文本的图像embedding,直接方法就是只对有语义的图像区域进行编码.为了让这个过程可以进行反向传播,需要对上个模块输出掩码 <math xmlns="http://www.w3.org/1998/Math/MathML"> M M </math>M进行Gumbel-Max将掩码转为二值化的掩码 <math xmlns="http://www.w3.org/1998/Math/MathML"> M b M^b </math>Mb,将掩码与图像 <math xmlns="http://www.w3.org/1998/Math/MathML"> X X </math>X点积以获得可微的掩码图像.

再将掩码图像输入进 <math xmlns="http://www.w3.org/1998/Math/MathML"> C L I P i CLIP_i </math>CLIPi,得到基于文本的图片嵌入 <math xmlns="http://www.w3.org/1998/Math/MathML"> v g v^g </math>vg,再将 <math xmlns="http://www.w3.org/1998/Math/MathML"> v g v^g </math>vg与文本向量相乘获得相似矩阵 <math xmlns="http://www.w3.org/1998/Math/MathML"> S S </math>S.将 <math xmlns="http://www.w3.org/1998/Math/MathML"> S S </math>S输入进 <math xmlns="http://www.w3.org/1998/Math/MathML"> I n f o N C E InfoNCE </math>InfoNCE,输出就是Image-level TCL loss.
这个损失函数的S是相似矩阵,如果越相似,值越大,从而计算出来是 <math xmlns="http://www.w3.org/1998/Math/MathML"> l o s s loss </math>loss越小.

Feature-level TCL loss.作者在论文里写这个 <math xmlns="http://www.w3.org/1998/Math/MathML"> l o s s loss </math>loss的存在是为了防止模型给没有文本描述的区域形成掩码.

实现基于负掩码的文本图像(就是那个Text-grounded image)的计算成本太高,也就是说,实现负掩码Image-level TCL loss是不可实现的,因此就需要考虑别的策略.
为了计算负掩码带来的Loss,这里引入了一个Feature-level TCL loss.

<math xmlns="http://www.w3.org/1998/Math/MathML"> v S v^S </math>vS是 <math xmlns="http://www.w3.org/1998/Math/MathML"> g r o u n d e r grounder </math>grounder的输出, <math xmlns="http://www.w3.org/1998/Math/MathML"> M M </math>M是负掩码.然后我们将 <math xmlns="http://www.w3.org/1998/Math/MathML"> v f v^f </math>vf与文本向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t相乘(批次内所有文本图像对),从而获得相似性矩阵 <math xmlns="http://www.w3.org/1998/Math/MathML"> S S </math>S.

<math xmlns="http://www.w3.org/1998/Math/MathML"> v f v^f </math>vf为什么这么算暂时还没搞清楚.但是论文把它称为feature-level text-grounded image embedding.

Area TCL loss.这个 <math xmlns="http://www.w3.org/1998/Math/MathML"> l o s s loss </math>loss的提出是为了防止模型坍塌成只为整个图像生成掩码,而不是所需区域.这句话是论文的原话,我这里的理解是掩码只生成一个点,那样Feature-level TCL loss就为0了,但明显不是我们要的结果.

However, the model can collapse into a trivial solution with only these losses--- generating a mask for the entire image instead of the desired region
这个loss包含掩码区域的先验信息,以确保只捕获文本描述的区域.
先验信息是我们人为设定的信息,这里 <math xmlns="http://www.w3.org/1998/Math/MathML"> p + p^+ </math>p+设置的 <math xmlns="http://www.w3.org/1998/Math/MathML"> 0.4 0.4 </math>0.4,这是由MaskCLIP在CC3M数据集上测到的平均文本描述区域.
The area TCL loss is defined by L1-distance between the area priors and the expected area of each mask 区域 <math xmlns="http://www.w3.org/1998/Math/MathML"> T C L TCL </math>TCL损失由每个掩码的先验面积和预期面积之间的 L1 距离定义

<math xmlns="http://www.w3.org/1998/Math/MathML"> M + M^+ </math>M+表示正掩码, <math xmlns="http://www.w3.org/1998/Math/MathML"> M ‾ + \overline M^+ </math>M+表示正掩码的面积.

Smooth regularization. 这个正则化主要用于保留边缘,它是主要用于降噪和图像修复的算法.

总变化是指图像中相邻像素之间的差异.在一张图像中,像素之间的差异越大,图像的总变化就越高.而Total Variation正则化的目标是最小化图像的总变化,从而平滑图像、去除噪声等.
全变分去噪的基本思想是，如果图像的细节有很多高频信息(如尖刺、噪点等),那么整幅图像的梯度幅值之和（全变分）是比较大的,如果能够使整幅图像的梯度积分之和降低,就达到了去噪的目的.
各向异性通俗上讲就是在各个方向上所体现出来的性质都不一样.对图像来说各向异性就是在每个像素点周围四个方向上梯度变化都不一样.详情请参考Click.
这实际上是各向异性滤波.滤波是在尽量保留图像细节特征的条件下对目标图像的噪声进行抑制 .各向同性滤波是滤波时，各个方向都一视同仁,边缘和平坦区域都做平滑,去噪的同时容易丢失边缘等有意义的高频.如高斯滤波、均值滤波等.滤波时,各个方向不是一视同仁的,边缘等高频跟平坦区域会区别对待.涉及涉及到一些梯度、散度、拉普拉斯算子等数学概念 .请参考这个Click,解释了数学概念对降噪的作用.

这里的 <math xmlns="http://www.w3.org/1998/Math/MathML"> M M </math>M是掩码, <math xmlns="http://www.w3.org/1998/Math/MathML"> V S V^S </math>VS是像素级的embedding.

Final loss.最后我们总的 <math xmlns="http://www.w3.org/1998/Math/MathML"> l o s s loss </math>loss计算如下,具体的解释直接截图了,这个用中文表达反而不便.

3.2.4 Inference Pipeline

请注意上图的模型结构是适用于模型在训练时的,当模型在测试的时候,Text-grounded Contrastive Learning这个模块会被丢弃.只使用grounder生成图像掩码.

M是grounder计算出来的基于文本的掩码, <math xmlns="http://www.w3.org/1998/Math/MathML"> N N </math>N是目标类别的数量. <math xmlns="http://www.w3.org/1998/Math/MathML"> M h , w M_{h,w} </math>Mh,w才是最后的分割图.据文献,这个推理过程与CLIP十分类似,可以看下文献.

4. Experiments

4.1 Experiment Settings

4.1.1 Unified evaluation protocol

这部分大概是说,在本文以前,开放世界语义分割没有一个明确的比较标准.先前的研究使用自己的协议进行评估,例如对不同数据集使用不同的数据处理策略.这里本文作者自己提出了一个协议.

surprisingly, even for the same dataset, the target classes are sometimes different across studies. <math xmlns="http://www.w3.org/1998/Math/MathML"> \quad </math>令人惊讶的是,即使对于相同的数据集,不同研究的目标类别有时也会不同(这句的意思是相同图片,可能模型A定义它的目标类别是人,而模型B定义它的目标类别是骑着车的人).为了公平比较,我们提出了一个遵循开放世界场景的统一评估协议,在评估之前不允许事先访问目标数据.
Under this scenario, the proposed protocol prohibits dataset-specific hyperparameters or tricks. 在这种情况下,提议的协议禁止特定于数据集的超参数或技巧. 意思是对图片上的目标类别进行更详细的描述,比如从人到骑着车的人,这种更详细的描述会让模型表现更好.
本文的模型评估使用了MMSegmentation默认版本中的统一类名来评估模型，而不使用基于类名的技巧,其他评估设置参考了GroupVit.mloU作为性能指标. MMSegmentation是一种统一的性能评估框架,它对评测的模型超参和trick都保持了统一,尽可能公平比较.
提出的协议没有统一细化方法.本文TCL采用的方法是PAMR,作者做了所有模型基于PAMR的对比实验,但是细化对一些模型有效,一些模型有害.因为不是每个模型实现都考虑了PAMR,所以细化的效果与模型有关,所以没有统一细化方法.
与TCL比较的模型有的没有预训练的CLIP模块,作者提出给该模型(GroupViT)更大的数据集训练,但效果仍然不如TCL.

4.1.2 Benchmark datasets and comparison methods

本小节主要是讲述作者在广泛使用的8个基准数据集上做了评估,分为有背景组和无背景组.开放世界语义分割依赖对类名的描述,这需要对背景类额外考虑,有背景类的数据集可以评估这一点.

作者在附录里追加了比较实验的细节.对于没有开源的模型,作者按模型的评估协议在相同条件下进行比较,TCL取得了最先进的成果.对于存在变体的模型,作者选择了评估实验时,表现最好的变体.

4.2 Zero-shot Transfer to Semantic Segmentation

本小节主要讲述不同的模型在统一协议的比较.显然TCL模型表现最好.这个表内都是以mloU作为指标.

4.3 Qualitative Results

4.3.1 Visualization of the generated text-grounded masks

作者在这部分做了几个实验,首先是text-grounded masks的可视化,实际上这部分实验体现了Grounder-decoder的作用. <math xmlns="http://www.w3.org/1998/Math/MathML"> w / o D g w/o\ D_g </math>w/o Dg是没有decoder部分,输出的掩码. <math xmlns="http://www.w3.org/1998/Math/MathML"> w / D g w/\ D_g </math>w/ Dg是有decoder部分输出的掩码.蓝色是正确的文字提示,而红色是错误的.我们可以明显发现加了decoder效果好了很多.

4.3.2 定性比较

本小节主要是对现在存在的方法进行定性比较.在下图a可以看见不同方法在分割上出现的错误.比如ReCo缺乏对背景类的考虑,难以分割背景区域.下图b展示的是开放世界语义分割的能力.

可以发现 <math xmlns="http://www.w3.org/1998/Math/MathML"> G r o u p V i T GroupViT </math>GroupViT只关注图片里的主要目标并把其他目标看作背景,但TCL始终生成更精细的掩码.

We collect test samples containing visual concepts not included in conventional segmentation datasets (e.g., moon, sunset) or free-form texts (e.g., "two women and one man with a smiling snowman"). <math xmlns="http://www.w3.org/1998/Math/MathML"> \quad </math>我们收集的测试样本包含传统分割数据集中未包含的视觉概念（例如月亮、日落）或自由格式文本（例如"两个女人和一个男人带着微笑的雪人"）

更多的定性比较在附录里,主要是说VOC数据集的图像是面对对象的,图像主要由1,2段构成.附录主要是对复杂场景的开放世界语义分割做实验.如下图,GroupViT噪音较少,但是误差比较大,ReCo受到噪音的影响.TCL虽然有噪音但是相对干净.

同样作者也做了来自网络图的对比实验.

In this experiment, we investigate the discrimination capability of the model in various aspects: proper nouns (Frodo, Gandalf, Pyramid, Sphinx, Samwise, Gollum, Taj Mahal, Batman, Superman), colors with the same object (red, green, yellow bananas), letters (MMU, Turkish, Fighter), and subclasses (Corgi, Shepherd). 我们从各个方面研究模型的辨别能力：专有名词（Frodo、Gandalf、Pyramid、Sphinx、Samwise、Gollum、Taj Mahal、Batman、Superman）、同一物体的颜色（红、绿、黄香蕉）、字母(MMU、土耳其语、战斗机)和子类(柯基犬、牧羊犬).模型表现良好.

4.3.3 Additional analysis on failure cases and model behavior

本小节侧重模型的不足.

首先是,TCL在准确捕捉分割边界方面存在困难.这也是无监督开放世界分割的一个基本挑战.缺乏密集注释让精准捕获边界变得困难.虽然所提方法显著改善了分割性能,但还是有需要改进的地方.

Furthermore, despite the help of the smooth prior loss, the predictions still tend to be noisy, e.g., "sea" or "hair drier" <math xmlns="http://www.w3.org/1998/Math/MathML"> \quad </math> 此外,尽管有平滑先验损失的帮助,预测仍然往往是嘈杂的,例如"大海"或"吹风机".在下图case 2,被分割出了大海.

其次是基准的模糊性.

On the other hand, we also find crucial issues in the current benchmark datasets: ambiguities in the class label set and scene semantics. 另一方面，我们还发现当前基准数据集中的关键问题：类标签集和场景语义中的模糊性.
Mostly the labels have different semantics in detail, but the distinction between the labels can be ambiguous depending on how the image captures the scene 大多数标签在细节上具有不同的语义，但标签之间的区别可能不明确，具体取决于图像捕获场景的方式.
For example, it is hard to distinguish "clouds" and "fog" in Fig. 9a and "hill" and "mountain" in Fig. 9b 例如，很难区分图9a中的"云"和"雾"以及图9b中的"丘陵"和"山"
Also, there are labels with superset-subset relations. For instance, the COCOStuff dataset has "broccoli", "vegetable", and "food-other" classes. 此外，还有具有超集-子集关系的标签。例如，COCOStuff 数据集有"西兰花"、"蔬菜"和"其他食物"类别.
In this study, we propose a unified evaluation protocol to compare the existing methods fairly, but it only unifies the evaluation protocol and simply employs existing benchmark datasets. This analysis suggests the need for further advanced benchmarks dedicated to open-world scenarios in the future. <math xmlns="http://www.w3.org/1998/Math/MathML"> \quad </math> 在本研究中，我们提出了一个统一的评估协议来公平地比较现有的方法，但它只是统一了评估协议并简单地采用了现有的基准数据集。该分析表明未来需要专门针对开放世界场景的进一步高级基准

也就是说,当一个片段有多种语义时,就会出现歧义.特别是有包含关系的情况.对图9a中的"云"和"草"段更合适的描述是"有雾或有云的山"和"草地上的灌木丛".真实值（GT）标签仅代表整个语义的一部分。与超集-子集关系情况一样，开放世界场景的基准需要额外考虑以解决此类模糊性.

附录的G小节,模型行为分析,主要是说,文本提示越具体,模型表现越好.

4.4 消融实验

主要是模型各个部件对模型表现的影响.

4.4.1 Baseline to TCL

这部分比较base model再加上Decoder、TCL Loss后的影响.主要看表 <math xmlns="http://www.w3.org/1998/Math/MathML"> 2 a 2a </math>2a.

Our initial model before training based on MaskCLIP [33] is referred to as the baseline (A), which modifies the last attention layer of the CLIP image encoder. When we add only the grounding decoder to the baseline without TCL loss (B), there is no improvement in performance. This suggests that training the decoder with the same CL loss as the pre-training (CLIP) does not enhance the localization capabilities. As shown in (C), the proposed framework becomes complete with TCL loss. <math xmlns="http://www.w3.org/1998/Math/MathML"> \quad </math> 我们基于MaskCLIP训练之前的初始模型被称为基线（A），它修改了 CLIP 图像编码器的最后一个注意力层。当我们仅将接地解码器添加到没有 TCL 损失的基线 (B) 时，性能没有任何改进。这表明使用与预训练 (CLIP) 相同的 CL 损失来训练解码器并不会增强定位能力。如图 (C) 所示，所提出的框架在 TCL 损失的情况下变得完整.

4.4.2 Impact of individual TCL losses

这部分主要比较TCL loss各部分对模型性能的影响.主要看表 <math xmlns="http://www.w3.org/1998/Math/MathML"> 2 b 2b </math>2b

平滑正则化在每个实验都加上了.
The CL loss (D) is computed by applying attention pooling to the dense image embedding <math xmlns="http://www.w3.org/1998/Math/MathML"> V S V^S </math>VS. CL损失是通过将注意力池应用于密集图像嵌入 <math xmlns="http://www.w3.org/1998/Math/MathML"> V S V^S </math>VS.
比较 (D) 和 (C) 时，所提出的 TCL 损失显著提高了分割性能 (61.1 → 77.4).图像级或特征级 TCL 损失(E、F)单独显著提高性能，并且同时使用这两种损失可以进一步提高性能。除了 TCL (G) 之外，使用 CL 不会提高性能，并且必须在 TCL 框架中使用面积 TCL 损失来防止模型崩溃 (H)，如第3.3节中所述.(B)和(D)的区别在于使用了平滑正则化.

4.4.3 Hyperparameters

主要是超参数对模型的影响.涉及表在 <math xmlns="http://www.w3.org/1998/Math/MathML"> 2 c − 2 e 2c-2e </math>2c−2e.

Tables 2c to 2e shows the performance changes according to the variation of the loss weight hyperparameters (HPs). The first rows show the importance of each loss (λ = 0.0 cases). The absence of area TCL loss causes a significant performance drop (Table 2d), as mentioned above. Smooth regularization also significantly contributes to the final performance (Table 2e), supporting our assumption that the text-described region is smooth rather than noisy. 表 2c 至 2e 显示了根据损失权重超参数 (HP) 变化的性能变化。第一行显示每个损失的重要性（λ = 0.0 情况）。如上所述，没有面积 TCL 损失会导致性能显着下降（表 2d）。平滑正则化也对最终性能有显着贡献（表 2e），支持我们的假设，即文本描述的区域是平滑的而不是嘈杂的

5. Conclusion

我们提出了一种仅使用图像-文本对的开放世界语义分割的新颖框架，解决了训练（图像-文本）与测试(region-text)之间的比对差异.在所提出的框架中，我们将基础过程纳入对比学习中，从而允许明确学习文本和文本基础区域（即分割掩模）之间的对齐。我们还提出了一个统一的评估协议，用于对现有方法进行公平比较，其中 TCL 在所有 8 个基准测试中实现了最先进的零样本分割性能，显著超越了以前的方法。我们希望这项研究能够鼓励一个新的研究方向，即明确学习开放世界语义分割的区域文本比对。

TCL 阅读笔记