MobileSAMv2论文精读（逐段解析）

MobileSAMv2: Faster Segment Anything to Everything

官仓地址：https://github.com/ChaoningZhang/MobileSAM

2023，韩国庆熙大学

【论文总结】MobileSAMv2针对SAM模型在SegEvery(分割所有对象)任务中的效率瓶颈进行了优化。原始SAM采用密集网格采样(如64×64个点)作为提示,导致掩码解码器需要处理大量冗余提示并进行复杂的后过滤。本文提出使用目标检测器(YOLOv8)先识别图像中的对象位置,然后仅在这些有效区域生成边界框提示,从而直接生成最终掩码而无需冗余计算和过滤。这种对象感知的提示采样策略将掩码解码器的计算时间减少了至少16倍,同时在LVIS数据集上的mask AR@K指标从38.9%提升到42.5%。此外,该方法与MobileSAM的轻量级图像编码器兼容,可构建同时加速SegAny和SegEvery的统一高效框架。核心创新在于将盲目的网格搜索转变为智能的对象定位,从根本上改变了SegEvery的提示生成范式。

Abstract

Segment anything model (SAM) addresses two practical yet challenging segmentation tasks: segment anything (SegAny), which utilizes a certain point to predict the mask for a single object of interest, and segment everything (SegEvery), which predicts the masks for all objects on the image. What makes SegAny slow for SAM is its heavyweight image encoder, which has been addressed by MobileSAM via decoupled knowledge distillation. The efficiency bottleneck of SegEvery with SAM, however, lies in its mask decoder because it needs to first generate numerous masks with redundant grid-search prompts and then perform filtering to obtain the final valid masks. We propose to improve its efficiency by directly generating the final masks with only valid prompts, which can be obtained through object discovery. Our proposed approach not only helps reduce the total time on the mask decoder by at least 16 times but also achieves superior performance. Specifically, our approach yields an average performance boost of 3.6 % 3.6\% 3.6% ( 42.5 % 42.5\% 42.5% v.s. 38.9 % 38.9\% 38.9% ) for zero-shot object proposal on the LVIS dataset with the mask AR@K metric. Qualitative results show that our approach generates fine-grained masks while avoiding oversegmenting things. This project targeting faster SegEvery than the original SAM is termed MobileSAMv2 to differentiate from MobileSAM which targets faster SegAny. Moreover, we demonstrate that our new prompt sampling is also compatible with the distilled image encoders in MobileSAM, contributing to a unified framework for efficient SegAny and SegEvery. The code is available at the same link as MobileSAM Project https://github.com/ChaoningZhang/MobileSAM.

【翻译】分割一切模型(SAM)解决了两个实用但具有挑战性的分割任务:分割任意对象(SegAny),它利用某个点来预测单个感兴趣对象的掩码,以及分割所有对象(SegEvery),它预测图像上所有对象的掩码。使SegAny在SAM中变慢的原因是其重量级的图像编码器,这已经通过MobileSAM的解耦知识蒸馏得到解决。然而,SAM中SegEvery的效率瓶颈在于其掩码解码器,因为它需要首先使用冗余的网格搜索提示生成大量掩码,然后执行过滤以获得最终的有效掩码。我们提出通过仅使用有效提示直接生成最终掩码来提高其效率,这些有效提示可以通过对象发现获得。我们提出的方法不仅有助于将掩码解码器的总时间减少至少16倍,而且还实现了卓越的性能。具体来说,我们的方法在LVIS数据集上使用掩码AR@K指标进行零样本对象提议时,平均性能提升了 3.6 % 3.6\% 3.6%(42.5%对比38.9%)。定性结果表明,我们的方法生成细粒度的掩码,同时避免过度分割事物。这个针对比原始SAM更快的SegEvery的项目被称为MobileSAMv2,以区别于针对更快SegAny的MobileSAM。此外,我们证明了我们的新提示采样也与MobileSAM中的蒸馏图像编码器兼容,为高效的SegAny和SegEvery贡献了一个统一的框架。代码可在与MobileSAM项目相同的链接https://github.com/ChaoningZhang/MobileSAM获得。

【解析】SegAny任务关注的是单个目标的精确分割,用户通过给定一个点或框等提示,模型就能预测出该目标的完整掩码。而SegEvery任务则更加宏大,它要求模型自动识别并分割图像中的所有对象,无需用户指定具体目标。这两个任务的技术难点和计算瓶颈是不同的。对于SegAny任务,主要的性能瓶颈在于SAM使用了一个非常庞大的图像编码器来提取图像特征,这个编码器基于Vision Transformer架构,参数量巨大导致推理速度缓慢。MobileSAM通过解耦知识蒸馏技术,将这个重量级编码器的知识迁移到一个轻量级的编码器中,从而显著提升了SegAny任务的速度。但是对于SegEvery任务,问题就不在图像编码器了,而是转移到了掩码解码器这一环节。原因在于SegEvery需要分割图像中的所有对象,SAM采用的策略是在图像上密集地采样大量的网格点作为提示,比如 64 × 64 64\times64 64×64个点,然后对每个点都运行掩码解码器生成候选掩码。这样会产生大量冗余和重叠的掩码,需要后续进行复杂的过滤操作来筛选出真正有效的掩码。这个过程不仅计算量大,而且很多计算都是浪费的。本文提出的MobileSAMv2针对这个问题,核心思想是不再盲目地在网格上采样提示点,而是先通过对象发现技术(本质上就是目标检测)找到图像中真正存在对象的区域,然后只在这些有效区域生成提示。这样就避免了大量无效的掩码生成和后续的过滤操作。实验结果显示,这种方法将掩码解码器的计算时间减少了至少16倍,同时在LVIS数据集的零样本对象提议任务上,使用mask AR@K指标评估时,性能从 38.9 % 38.9\% 38.9%提升到了 42.5 % 42.5\% 42.5%,提升了 3.6 % 3.6\% 3.6%。这里的AR@K指标说明的是在允许最多K个提议掩码的情况下,模型能够正确召回多少真实对象,是评估对象提议质量的标准指标。定性结果还表明,该方法生成的掩码边界更加精细,同时避免了将一个完整对象过度分割成多个碎片的问题。为了与专注于加速SegAny任务的MobileSAM区分开来,本文将这个专注于加速SegEvery任务的工作命名为MobileSAMv2。更重要的是,新提出的提示采样策略可以与MobileSAM中蒸馏得到的轻量级图像编码器无缝结合,这说明可以构建一个统一的框架,同时实现高效的SegAny和SegEvery功能,为实际应用提供了完整的解决方案。

1. Introduction

The NLP field has been revolutionalized by ChatGPT [36], which constitutes a milestone in the development of generative AI (AIGC, a.k.a artificial intelligence generated content) [37]. GPT-series models [3, 23, 24] trained on web-scale text datasets play a major role for its development. Following the success of foundation models [2] in NLP, vision foundation models like CLIP [25] have been developed by co-learning a text encoder via contrastive learning [8, 33]. More recently, a vision foundation model termed SAM [14], short for segment anything model, was released to solve two practical image segmentation tasks: segment anything (SegAny) and segment everything (SegEvery). Both two tasks perform class-agnostic mask segmentation, with the difference in what to segment. SegAny utilizes a certain prompt (like a point or box) to segment a single thing of interest in the image. By contrast, SegEvery aims to segment all things in the image. SAM has been widely used in a wide range of applications [38] due to its impressive performance on these two tasks.

【翻译】自然语言处理领域已经被ChatGPT [36]彻底革新,它构成了生成式AI(AIGC,即人工智能生成内容)发展的一个里程碑[37]。在网络规模文本数据集上训练的GPT系列模型[3, 23, 24]在其发展中发挥了重要作用。继基础模型[2]在自然语言处理领域取得成功之后,像CLIP [25]这样的视觉基础模型通过对比学习[8, 33]共同学习文本编码器而被开发出来。最近,一个名为SAM [14]的视觉基础模型(分割一切模型的简称)被发布,用于解决两个实用的图像分割任务:分割任意对象(SegAny)和分割所有对象(SegEvery)。这两个任务都执行类别无关的掩码分割,区别在于分割什么。SegAny利用某个提示(如点或框)来分割图像中感兴趣的单个事物。相比之下,SegEvery旨在分割图像中的所有事物。由于SAM在这两个任务上的出色性能,它已被广泛应用于各种应用中[38]。

【解析】基础模型是指那些在大规模数据上预训练,然后可以迁移到多种下游任务的模型,它们具有强大的泛化能力。CLIP通过对比学习的方式同时训练图像编码器和文本编码器,使得模型能够理解图像和文本之间的对应关系,从而实现跨模态的理解能力。对比学习的核心思想是让相似的样本在特征空间中靠近,不相似的样本远离,通过这种方式学习到有判别力的特征表示。SAM模型专注于图像分割这一基础视觉任务，SegAny和SegEvery这两个任务虽然都是分割,但应用场景不同。SegAny是交互式的,用户通过提供提示来指定想要分割的对象,这适用于需要精确控制的场景,比如图像编辑中只想抠出某个特定物体。SegEvery则是自动化的（SAM是平铺网格点提示+结果后处理）,模型需要自主识别并分割图像中的所有有意义的对象和部分,这更适合需要全面理解图像内容的场景,比如场景理解或自动标注。

SAM works in sequence with two modules: ViT-based image encoder and prompt-guided mask decoder (see Figure 1). They are simply referred to image encoder and mask decoder in the remainder of this work when it does not confuse. The lightweight mask decoder adopts two-way attention to enable efficient interaction between image embedding and promt token for generating fine-grained masks [14]. What makes SegAny slow is the image encoder which is 100 + 100+ 100+ more heavyweight than the mask decoder. This issue has been addressed by MobileSAM by distilling a lightweight image encoder in a decoupled manner. To segment all things, SegEvery requires first repeatedly running the mask decoder to generate numerous proposal masks and then selecting the high-quality and non-overlapping ones. This shifts the computation bottleneck from the image encoding to the mask generation and filtering. In essence, SegEvery is not a promptable segmentation task and thus the masks might be generated directly without using prompts [34]. Such a prompt-free approach has been attempted in [41], which generates masks with less satisfactory boundaries (see analysis in Sec. 6.1). The mask decoder with two-way attention solves this problem but at the cost of making SegEvery much slower [14]. To this end, we follow the practice of SegEvery in [14] to prompt the mask decoder to guarantee the quality of the generated masks but address its low-speed issue by reducing the number of prompts.

【翻译】SAM按顺序使用两个模块工作:基于ViT的图像编码器和提示引导的掩码解码器(见图1)。在本文的其余部分,当不会引起混淆时,它们被简称为图像编码器和掩码解码器。轻量级掩码解码器采用双向注意力机制,以实现图像嵌入和提示标记之间的高效交互,从而生成细粒度的掩码[14]。使SegAny变慢的是图像编码器,它比掩码解码器重 100 + 100+ 100+倍以上。MobileSAM通过以解耦的方式蒸馏轻量级图像编码器解决了这个问题。为了分割所有事物,SegEvery需要首先重复运行掩码解码器以生成大量候选掩码,然后选择高质量且不重叠的掩码。这将计算瓶颈从图像编码转移到掩码生成和过滤。本质上,SegEvery不是一个可提示的分割任务,因此掩码可能直接生成而不使用提示[34]。这种无提示方法已在[41]中尝试过,但生成的掩码边界不太令人满意(见第6.1节的分析)。具有双向注意力的掩码解码器解决了这个问题,但代价是使SegEvery变得更慢[14]。为此,我们遵循[14]中SegEvery的做法,提示掩码解码器以保证生成掩码的质量,但通过减少提示的数量来解决其低速问题。

【解析】SAM 第一阶段是图像编码器,它基于Vision Transformer架构,负责将输入图像转换为高维特征表示,这个特征表示包含了图像的语义信息和空间结构信息。第二阶段是掩码解码器,它接收图像特征和用户提供的提示信息,通过交互式处理生成最终的分割掩码。掩码解码器采用双向注意力机制,允许图像特征和提示标记之间进行双向信息流动。具体来说,图像特征可以关注提示标记来理解用户的意图,而提示标记也可以关注图像特征来定位目标对象,双向交互确保生成的掩码能够精确地贴合对象边界。在SegAny任务中,性能瓶颈主要在图像编码器,因为它的参数量巨大,计算复杂度高。图像编码器的计算量比掩码解码器大 100 100 100倍以上,这说明了为什么MobileSAM选择通过知识蒸馏来压缩图像编码器。MobileSAM采用解耦蒸馏策略，图像编码器的蒸馏可以独立于其他模块进行,较为灵活，但是对于SegEvery任务,情况完全不同。由于需要分割图像中的所有对象,系统必须在图像上密集采样大量提示点,比如 64 × 64 = 4096 64\times64=4096 64×64=4096个点,然后对每个点都运行一次掩码解码器。这样掩码解码器就需要被调用数千次,即使单次调用很快,累积起来的计算量也是巨大的（事实上，批次处理提示，但也较慢）。此外,这些密集采样的点会产生大量重叠和冗余的掩码,需要后续进行复杂的过滤操作来去除冗余。这个过滤过程本身也很耗时,因为需要计算大量高维掩码之间的重叠度。从理论上讲,SegEvery任务并不一定需要提示,因为目标是分割所有对象而不是特定对象,所以可以设计一个端到端的网络直接输出所有掩码。但是这种无提示方法在实践中遇到了问题,生成的掩码边界往往不够精确,会出现锯齿状或者不连续的情况。这是因为没有提示引导,网络很难准确定位对象边界,特别是在对象边缘模糊或者与背景相似的情况下。SAM的双向注意力机制通过提示引导能够生成高质量的掩码边界,但这需要大量的提示输入,导致计算效率低下。而作者的核心思路是在保持提示引导机制的同时,通过智能采样来大幅减少提示的数量,从而在质量和效率之间找到更好的平衡点。

Figure 1. SAM architecture and efficiency. The computation bottleneck for SegAny lies in its image encoder, while that for SegEvery mainly lies in its mask decoder when a high grid-search density is required (zero-shot object proposal in [14] adopts 64 × 64 64\times64 64×64 points).

【翻译】图1. SAM架构和效率。SegAny的计算瓶颈在于其图像编码器,而当需要高网格搜索密度时,SegEvery的计算瓶颈主要在于其掩码解码器([14]中的零样本对象提议采用 64 × 64 64\times64 64×64个点)。

【解析】虽然图像编码器只需要运行一次来提取图像特征,但掩码解码器需要被反复调用数千次。当采用 64 × 64 64\times64 64×64的网格密度时,说明在图像上均匀分布了 4096 4096 4096个采样点,每个点都需要作为提示输入到掩码解码器中生成候选掩码。即使掩码解码器的单次运行时间远小于图像编码器,但当它需要运行数千次时,累积的计算时间就会超过图像编码器。这说明了为什么针对不同任务需要采用不同的优化策略:MobileSAM通过压缩图像编码器来加速SegAny,而MobileSAMv2则需要通过减少掩码解码器的调用次数来加速SegEvery。

SegEvery in [14] prompts the image encoder with a grid search of foreground points. When the grid search is sparse, many small things or meaningful object parts might miss from being detected. Therefore, SegEvery in [14] adopts a high grid density, like 64 × 64 64\times64 64×64 points for zero-shot object proposal, which tends to have redundant prompts for large objects. In essence, it adopts a strategy to first generate many masks, most of which are redundant, and then filter the redundant ones. Intuitively, this process can be simplified by only generating valid masks, which saves time for mask generation and removes the need for mask filtering. Motivated by this intuition, we propose an efficient prompt sampling that seeks object-aware prompts. Fortunately, this is a well-solved issue in modern object detection. In this work, we adopt YOLOv8 which is a SOTA architecture for efficient detection with bounding boxes. To avoid over-fitting to any specific dataset, the model should be trained on an open-world dataset, for which a subset of SA-1B dataset is chosen. With the generated box, we can either use its center as an object-aware point prompt or directly adopt the box itself as the prompt. An issue with the point prompt is that it requires predicting three output masks per prompt to address the ambiguity issue. The bounding box is more informative with less ambiguity and thus is more suitable to be adopted in efficient SegEvery. Overall, this project is designed to make SegEvery in [14] faster while achieving competitive performance. We term this project MobileSAMv2 to differentiate MobileSAM [34] that makes SegAny faster. Overall, the contributions of this work are summarized as follows.

【翻译】[14]中的SegEvery使用前景点的网格搜索来提示图像编码器。当网格搜索稀疏时,许多小物体或有意义的对象部分可能会被漏检。因此,[14]中的SegEvery采用高网格密度,例如用于零样本对象提议的 64 × 64 64\times64 64×64个点,这对于大型对象往往会产生冗余提示。本质上,它采用的策略是首先生成许多掩码(其中大部分是冗余的),然后过滤掉冗余的掩码。直观地说,这个过程可以通过只生成有效掩码来简化,这样可以节省掩码生成的时间并消除掩码过滤的需要。受此直觉的启发,我们提出了一种高效的提示采样方法,寻求对象感知的提示。幸运的是,这在现代目标检测中是一个已经解决的问题。在这项工作中,我们采用YOLOv8,这是一种用于边界框高效检测的最先进架构。为了避免对任何特定数据集过拟合,模型应该在开放世界数据集上训练,为此选择了SA-1B数据集的一个子集。有了生成的边界框,我们可以使用其中心作为对象感知的点提示,或者直接采用边界框本身作为提示。点提示的一个问题是它需要为每个提示预测三个输出掩码以解决歧义问题。边界框信息更丰富,歧义更少,因此更适合用于高效的SegEvery。总的来说,这个项目旨在使[14]中的SegEvery更快,同时实现有竞争力的性能。我们将这个项目命名为MobileSAMv2,以区别于使SegAny更快的MobileSAM [34]。总的来说,这项工作的贡献总结如下。

【解析】原始SAM的SegEvery方法采用网格搜索策略,在图像上均匀分布大量采样点作为提示。当网格密度较低时,比如 32 × 32 32\times32 32×32,虽然计算效率高,但会遗漏很多小物体或者物体的细节部分,导致分割不完整。为了提高召回率,SAM在零样本对象提议任务中采用了 64 × 64 64\times64 64×64的高密度网格,总共 4096 4096 4096个采样点。但这种密集采样会带来严重的冗余问题,特别是对于大型对象,可能会有几十甚至上百个采样点落在同一个对象上,生成大量重复的掩码。这些冗余掩码不仅浪费计算资源,还需要后续进行复杂的过滤操作来去重。本文提出的核心改进思路是:与其盲目地在整个图像上密集采样,不如先识别出图像中真正存在对象的区域,然后只在这些区域生成提示。这样既能保证不遗漏重要对象,又能大幅减少冗余计算。这个思路的实现依赖于目标检测技术,作者选择YOLOv8用于目标检测,选择SA-1B数据集的子集进行训练,SA-1B是SAM论文发布的大规模分割数据集,包含超过10亿个掩码标注,覆盖了极其丰富多样的场景和对象类型,是理想的开放世界训练数据。得到边界框后,有两种使用方式:一是提取边界框的中心点作为点提示,二是直接使用边界框作为框提示。点提示的问题在于存在歧义性,同一个点可能对应不同粒度的分割结果,比如点击一个人的脸部,可能想分割整个人,也可能只想分割脸部,还可能想分割鼻子等局部。为了处理这种歧义,SAM的掩码解码器在接收点提示时会输出三个不同粒度的掩码供选择,这就增加了计算量。相比之下,边界框提示包含了更丰富的空间信息,明确指定了感兴趣区域的范围,歧义性大大降低,通常只需要输出一个掩码即可。因此,直接使用边界框作为提示不仅更高效,而且能够生成更准确的分割结果。作者将这个专注于加速SegEvery任务的工作命名为MobileSAMv2,与之前专注于加速SegAny任务的MobileSAM形成系列,共同构建了一个完整的高效分割解决方案。

• We identify what makes SegEvery in SAM slow and propose object-aware box prompts to replace the default grid-search point prompts, which significantly increases its speed while achieving overall superior performance.

【翻译】• 我们识别了SAM中SegEvery变慢的原因,并提出对象感知的边界框提示来替代默认的网格搜索点提示,这显著提高了其速度,同时实现了整体上更优越的性能。

【解析】本文的第一个主要贡献是系统地分析了SAM在SegEvery任务上的性能瓶颈,并提出了针对性的解决方案。通过详细的实验分析,明确指出瓶颈在于掩码解码器需要处理大量冗余的网格采样提示。提出的对象感知边界框提示策略从根本上改变了提示生成的方式,从盲目的均匀采样转变为智能的对象定位,这不仅大幅减少了需要处理的提示数量(从 4096 4096 4096个降低到最多 320 320 320个),还提高了提示的质量,使得生成的掩码更加准确。实验结果显示,掩码解码器的计算时间减少了至少 16 16 16倍,同时在LVIS数据集上的mask AR@K指标还有显著提升,证明了该方法在效率和性能上的双重优势。

• We demonstrate that the our proposed object-ware prompt sampling strategy is compatible with the distilled image encoders in MobileSAM, which further contributes to a unified framework for efficient SegAny and SegEvery.

【翻译】• 我们证明了我们提出的对象感知提示采样策略与MobileSAM中的蒸馏图像编码器兼容,这进一步为高效的SegAny和SegEvery贡献了一个统一的框架。

【解析】本文的第二个重要贡献是实现了技术的模块化和可组合性。MobileSAM通过知识蒸馏得到了轻量级的图像编码器,主要用于加速SegAny任务。本文提出的对象感知提示采样策略主要用于加速SegEvery任务。这两个技术分别针对不同的性能瓶颈,理论上应该可以组合使用以同时加速两个任务。实验验证了这种组合的可行性,证明了轻量级图像编码器和对象感知提示采样可以无缝集成,共同工作。这意味着用户可以根据实际应用需求灵活选择:如果只需要SegAny功能,可以使用MobileSAM的轻量级编码器;如果只需要SegEvery功能,可以使用本文的对象感知提示采样;如果两个功能都需要,可以将两者结合起来,获得一个在SegAny和SegEvery任务上都高效的统一框架。

Progress on SAM. Since its advent in April 2023, SAM has been extensively studied in numerous GitHub projects and research articles. Its performance of SegAny, has been studied in various challenging setups, including medical images [18, 40], camouflaged objects [28], and transparent objects [7]. Overall, SAM shows strong generalization performance but can be improved when the setup gets more challenging. Its generalization in the adversarial setup has been studied in Attack-SAM [35] which shows that the output masks of SAM can be easily manipulated by maliciously generated perturbations. Follow-up works further study the performance of adversarial perturbation generated on SAM in cross-model transferability [7] and cross-sample transferability [42]. A comprehensive robustness evaluation of SAM has been studied in follow-up work [22] which shows that SAM is robust against style transfer, common corruptions, local occlusion but not adversarial perturbation. The versatility of SAM has been demonstrated in another line of work. Even though SAM is shown to be compatible with text prompts in the original paper [14] as a proof-of-concept, this functionality is not included in its official code. Grounded SAM [9] project combines Grounding DINO [17] with SAM for text-guided promptable segmentation. Specifically, Grounding DINO utilizes a box to generate a bounding box which can be used as a prompt for the SAM to predict a mask. Semantic segment anything project [4] introduces CLIP [25] to assign labels to the predicted masks of SAM. SAM has also been shown to be versatile for image editing [26], inpainting tasks [32] and object tracking in videos [31, 43]. Beyond 2D, SAM can also be used for 3D object reconstruction [11, 27], i.e. assisting 3D model generation from a single image. PersoanlizeSAM [39] personalizes the SAM with one shot for the customized SAM. High-quality tokens have been introduced in [12] to improve the quality of predicted masks. The readers are suggested to refer to [38] for a survey of SAM for its recent progress.

【翻译】SAM的进展。自2023年4月问世以来,SAM已在众多GitHub项目和研究文章中得到广泛研究。其SegAny的性能已在各种具有挑战性的设置中进行了研究,包括医学图像[18, 40]、伪装对象[28]和透明对象[7]。总体而言,SAM显示出强大的泛化性能,但在设置变得更具挑战性时可以得到改进。其在对抗性设置中的泛化能力已在Attack-SAM [35]中进行了研究,该研究表明SAM的输出掩码可以很容易地被恶意生成的扰动操纵。后续工作进一步研究了在SAM上生成的对抗性扰动在跨模型可迁移性[7]和跨样本可迁移性[42]方面的性能。SAM的全面鲁棒性评估已在后续工作[22]中进行了研究,该研究表明SAM对风格迁移、常见损坏、局部遮挡具有鲁棒性,但对对抗性扰动不具有鲁棒性。SAM的多功能性已在另一系列工作中得到证明。尽管SAM在原始论文[14]中作为概念验证显示与文本提示兼容,但此功能未包含在其官方代码中。Grounded SAM [9]项目将Grounding DINO [17]与SAM结合,用于文本引导的可提示分割。具体来说,Grounding DINO利用框生成边界框,该边界框可用作SAM预测掩码的提示。Semantic segment anything项目[4]引入CLIP [25]为SAM的预测掩码分配标签。SAM还被证明在图像编辑[26]、修复任务[32]和视频中的对象跟踪[31, 43]方面具有多功能性。除了2D之外,SAM还可用于3D对象重建[11, 27],即协助从单个图像生成3D模型。PersonalizeSAM [39]通过单次样本个性化SAM以实现定制化SAM。高质量标记已在[12]中引入以提高预测掩码的质量。建议读者参考[38]了解SAM最新进展的综述。

Class-agnostic segmentation. Detection is a fundamental computer vision task that localize the objects of interest on an image [16]. Detection roughly localizes the object by a box, while segmentation performs a more fine-grained localization by assigning a pixel-wise mask [20]. It is straightforward to deduce a box from a given mask, but not vice versa, which indicates that the segmentation task is more complex than detection. Except for assigning masks, image segmentation (like semantic segmentation) often involves predicting their corresponding semantic labels from a predefined class set [5]. However, it is far from practical applications because there can be unlimited classes in the real world. To this end, a line of work has attempted to extend them to the open world by not considering their semantic labels. Class-agnostic object detection has been first formally proposed in [10] with the average recall established as the metric to evaluate its performance and then be used as a new pretraining technique [1]. Multimodal transformer has been shown in [19] to demonstrate satisfactory performance. Open-world instance segmentation has been extensively in [13, 29, 30] for realizing class-agnostic detection and segmentation. In contrast to them treating the object as a whole, a follow-up work [21] has investigated open-world object part segmentation. More recently, SAM [14] has solved the SegEvery task that segments all things including all objects and their meaningful parts. It has been shown in multiple Github projects (CLIP-SAM, Segment-Anything-CLIP, segmentanything-with-clip) that class-agnostic segmentation masks obtained from SegEvery with SAM [14] can be combined with CLIP [25] to produce semantic-aware segmentation in the open world.

【翻译】类别无关分割。检测是一项基本的计算机视觉任务,用于定位图像上感兴趣的对象[16]。检测通过边界框粗略地定位对象,而分割通过分配逐像素掩码来执行更细粒度的定位[20]。从给定的掩码推导出边界框是直接的,但反之则不然,这表明分割任务比检测更复杂。除了分配掩码外,图像分割(如语义分割)通常还涉及从预定义的类别集中预测其对应的语义标签[5]。然而,这远非实际应用,因为现实世界中可能存在无限的类别。为此,一系列工作试图通过不考虑语义标签将其扩展到开放世界。类别无关目标检测首先在[10]中正式提出,并建立了平均召回率作为评估其性能的指标,然后被用作一种新的预训练技术[1]。多模态Transformer已在[19]中显示出令人满意的性能。开放世界实例分割已在[13, 29, 30]中被广泛研究,以实现类别无关的检测和分割。与它们将对象视为整体不同,后续工作[21]研究了开放世界对象部分分割。最近,SAM [14]解决了SegEvery任务,该任务分割所有事物,包括所有对象及其有意义的部分。多个Github项目(CLIP-SAM、Segment-Anything-CLIP、segmentanything-with-clip)已经表明,从SAM [14]的SegEvery获得的类别无关分割掩码可以与CLIP [25]结合,在开放世界中产生语义感知的分割。

【解析】将SAM生成的类别无关掩码与CLIP的结合应用。CLIP通过对比学习训练,能够理解图像和文本之间的对应关系,可以为任意文本描述找到匹配的图像区域。具体做法是:首先用SAM的SegEvery生成图像中所有对象和部分的掩码,然后用CLIP对每个掩码区域进行特征提取,最后根据用户提供的文本查询,找到与文本最匹配的掩码。这样就实现了开放世界的语义感知分割,用户可以用自然语言描述想要分割的对象,而不需要预先定义类别集。

3. Segment Everything

Task Definition. Conventional image segmentation predicts pixel-wise masks together with their corresponding class labels. However, the classes can be ambiguous across different datasets. For example, CIFAR10 dataset has a dog class, while ImageNet-1K has several hundred classes to indicate various breeds of dogs. Another setup might divide them into puppy or adult dogs instead of their breed. This makes open-world image segmentation not tractable when considering the semantics. When decoupled from label prediction, open-world image segmentation becomes relatively easier but remains a challenging issue. Without semantic information, whether a region in the image is considered an object or a thing denoted by a mask can be subjective. This ill-posed nature is, at least partly, connected to the ambiguity of granularity [15]. For example, when the granularity is too large, it might only detect a large object but ignore its meaningful object parts. When the granularity is too small, every pixel can be independently segmented, which is trivial and meaningless. In other words, open-world image segmentation requires segmenting all things including the whole objects and their meaningful parts, i.e. everything. In essence, it is a class-agnostic segmentation task that performs zero-shot object proposal generation in the open world. This task is termed segment everything (SegEvery) in [14], and we follow [14] to adopt the same name to avoid confusion.

【翻译】任务定义。传统的图像分割预测像素级掩码及其对应的类别标签。然而,类别在不同数据集之间可能是模糊的。例如,CIFAR10数据集有一个狗类别,而ImageNet-1K有几百个类别来表示各种品种的狗。另一种设置可能将它们分为幼犬或成年犬,而不是按品种分类。这使得在考虑语义时,开放世界图像分割变得不可行。当与标签预测解耦时,开放世界图像分割变得相对容易,但仍然是一个具有挑战性的问题。没有语义信息,图像中的一个区域是否被认为是一个对象或由掩码表示的事物可能是主观的。这种不适定性质至少部分地与粒度的模糊性有关[15]。例如,当粒度太大时,它可能只检测到一个大对象,但忽略其有意义的对象部分。当粒度太小时,每个像素都可以被独立分割,这是琐碎且无意义的。换句话说,开放世界图像分割需要分割所有事物,包括整个对象及其有意义的部分,即一切。本质上,它是一个类别无关的分割任务,在开放世界中执行零样本对象提议生成。这个任务在[14]中被称为分割一切(SegEvery),我们遵循[14]采用相同的名称以避免混淆。

【解析】开放世界图像分割的目标是在合适的粒度范围内分割所有有意义的事物,既包括完整的对象,也包括对象的有意义部分。这个任务本质上是类别无关的,不需要预测类别标签,只需要生成高质量的分割掩码。它可以被视为零样本对象提议生成任务,即在没有见过特定类别训练样本的情况下,自动发现并分割图像中的所有潜在对象。SAM论文将这个任务命名为SegEvery,强调其目标是分割图像中的一切有意义的内容,本文沿用这个术语以保持一致性。

Prompt-aware Solution. SAM is a pioneering work to solve the task of promptable segmentation [14]. Specifically, it segments any object of interest with a certain prompt, which is named segment anything (SegAny) in [14]. Based on this, SAM provides a straightforward solution to the SegEvery task by prompting the SAM decoder with a search grid of foreground points. An underlying issue of this approach is that the performance is highly dependent on the grid density. Intuitively, a higher grid density tends to yield higher performance but at a cost of significantly increasing the computation overhead. Orthogonal to MobileSAM [34] distilling the heavyweight image encoder for faster SegAny, this project, named MobileSAMv2 for term differentiation, aims to make SegEvery faster by proposing a new sampling strategy to reduce the number of sampled prompts. Our solution significantly improves its efficiency while achieving overall superior performance. In the following section, we will illustrate the motivation behind our solution and its detailed implementation.

【翻译】提示感知解决方案。SAM是解决可提示分割任务的开创性工作[14]。具体来说,它使用特定提示分割任何感兴趣的对象,在[14]中被命名为分割任何事物(SegAny)。基于此,SAM通过使用前景点的搜索网格提示SAM解码器,为SegEvery任务提供了一个直接的解决方案。这种方法的一个潜在问题是性能高度依赖于网格密度。直观地说,更高的网格密度往往会产生更高的性能,但代价是显著增加计算开销。与MobileSAM [34]蒸馏重量级图像编码器以实现更快的SegAny正交,这个项目被命名为MobileSAMv2以区分术语,旨在通过提出一种新的采样策略来减少采样提示的数量,从而使SegEvery更快。我们的解决方案显著提高了其效率,同时实现了整体上更优越的性能。在下一节中,我们将说明我们解决方案背后的动机及其详细实现。

【解析】作者提出改进提示采样策略,从盲目的网格搜索转变为智能的对象感知采样,大幅减少需要处理的提示数量。通过引入目标检测模块来识别图像中真正存在对象的区域,然后只在这些区域生成提示,可以在保持甚至提升分割性能的同时,将提示数量从 4096 4096 4096个降低到最多 320 320 320个,实现了超过 10 10 10倍的效率提升。这种方法不仅减少了掩码生成的计算量,还降低了后续掩码过滤的复杂度,因为生成的掩码冗余度大大降低。

4. Method

4.1. 动机与框架

The prompt-aware solution proposed in [14] has demonstrated impressive performance for the challenging SegEvery task. It adopts a strategy of first generating redundant masks and then filtering them to obtain the final valid masks. Intuitively, this process might be unnecessarily cumbersome and can be simplified by prompting the mask decoder with only valid prompts, which saves time for mask generation and has no need to perform any filtering. The core of our method lies in replacing the default gird-search prompt sampling with object-aware prompt sampling. This strategy boils down to determining whether there is an object in a certain region on the image. Modern object detection task already solves this by localizing the objects with bounding boxes. Most of the generated bounding boxes overlap with each other, which thus requires pre-filtering before being used as valid prompts. Without additional prior knowledge, we deduce the filter-left bounding box center as the foreground point with a moderate assumption that the box center point is on the object. Moreover, the mask decoder of SAM also accepts a box as the prompt. Therefore, we also experiment with directly using the remaining box as the prompt. Overall, our proposed SegEvery framework consists of two stages: object-aware prompt sampling and prompt-guided mask decoding. The first stage samples the prompts by relying on a modern object detection network, and the second stage follows SAM [14] to perform a prompt-guided mask decoding.

【翻译】[14]中提出的提示感知解决方案在具有挑战性的SegEvery任务上展示了令人印象深刻的性能。它采用的策略是首先生成冗余掩码,然后过滤它们以获得最终的有效掩码。直观地说,这个过程可能不必要地繁琐,可以通过仅使用有效提示来提示掩码解码器来简化,这样可以节省掩码生成的时间,并且不需要执行任何过滤。我们方法的核心在于用对象感知提示采样替换默认的网格搜索提示采样。这个策略归结为确定图像上某个区域是否存在对象。现代目标检测任务已经通过使用边界框定位对象来解决这个问题。大多数生成的边界框彼此重叠,因此在用作有效提示之前需要进行预过滤。在没有额外先验知识的情况下,我们推断过滤后剩余的边界框中心为前景点,并做出一个适度的假设,即边界框中心点位于对象上。此外,SAM的掩码解码器也接受边界框作为提示。因此,我们还实验了直接使用剩余的边界框作为提示。总的来说,我们提出的SegEvery框架由两个阶段组成:对象感知提示采样和提示引导的掩码解码。第一阶段依赖现代目标检测网络来采样提示,第二阶段遵循SAM [14]执行提示引导的掩码解码。

4.2. 对象感知提示采样

Object discovery has been widely used in some cases (like visual-language tasks) as a preprocessing technique for avoiding exhaustive sliding window search. Inspired by their practice, we propose to exploit object discovery for sampling prompts. In essence, object discovery is to localize the objects with a bounding box, which can be realized by modern object detection models but excluding its classification head. The past decade has witnessed a huge advancement in the development of object detection models, YOLO family models have become de facto standard choice for its advantages in real-time performance. To prevent over-fitting to any specific domain, the chosen YOLOv8 model needs to be trained on an open-world dataset, for which a small subset of SA-1B dataset [14, 34] is chosen. The model is trained with the supervision of both the bounding box and masks and then finetuned with only the bounding box loss. Such a training approach also facilitates comparison with the prompt-free approach (see Sec. 6.1). This generates numerous overlapping boxes, which need to be filtered before being used as prompts. Following the standard practice, we adopt NMS to filter the overlapping boxes. With the filtered bounding boxes, we can either use its center as an object-aware point prompt or directly adopt the box itself as the prompt. In practice, we choose the latter strategy for multiple reasons. Even though the center point is object-aware, it is based on an assumption that the object inside the bounding box covers the center point. This holds in most cases but not in all cases. Another issue with the point prompt is that it needs to predict three output masks to address the ambiguity issue, which requires additional mask filtering. By contrast, the box prompt is more informative and generates high-quality masks with less ambiguity, which mitigates the need to predict three masks and is thus more beneficial for efficient SegEvery.

【翻译】对象发现在某些情况下(如视觉-语言任务)已被广泛用作预处理技术,以避免穷举滑动窗口搜索。受其实践的启发,我们提出利用对象发现来采样提示。本质上,对象发现是用边界框定位对象,这可以通过现代目标检测模型实现,但不包括其分类头。过去十年见证了目标检测模型发展的巨大进步,YOLO系列模型因其在实时性能方面的优势已成为事实上的标准选择。为了防止过拟合到任何特定领域,所选的YOLOv8模型需要在开放世界数据集上训练,为此选择了SA-1B数据集[14, 34]的一个小子集。该模型使用边界框和掩码的监督进行训练,然后仅使用边界框损失进行微调。这种训练方法也有助于与无提示方法进行比较(见第6.1节)。这会生成大量重叠的边界框,在用作提示之前需要进行过滤。遵循标准做法,我们采用NMS来过滤重叠的边界框。有了过滤后的边界框,我们可以使用其中心作为对象感知点提示,或直接采用边界框本身作为提示。在实践中,我们出于多种原因选择后一种策略。尽管中心点是对象感知的,但它基于一个假设,即边界框内的对象覆盖中心点。这在大多数情况下成立,但并非所有情况。点提示的另一个问题是它需要预测三个输出掩码来解决歧义问题,这需要额外的掩码过滤。相比之下,边界框提示信息量更大,生成的高质量掩码歧义更少,这减轻了预测三个掩码的需要,因此对高效的SegEvery更有利。

【解析】本文采用对象发现技术来生成对象感知的提示,这是整个方法的核心创新点。对象发现本质上是一个目标检测任务,但只关注定位而不关注分类。作者选择YOLOv8作为对象发现的基础模型,主要是因为YOLO系列在实时性能和准确性之间取得了良好的平衡,已经成为工业界和学术界的主流选择。为了确保模型具有良好的泛化能力,不会过拟合到特定领域,作者在SA-1B数据集的子集上训练模型。训练策略采用两阶段方式:首先同时使用边界框和掩码监督进行训练,这样可以让模型学习到更丰富的对象表示;然后仅使用边界框损失进行微调,使模型专注于边界框预测任务。这种训练方式还有一个额外的好处,就是可以与第6.1节中讨论的无提示方法进行公平比较,因为两者都使用了相同的检测模型。目标检测模型通常会生成大量重叠的候选框,这些冗余的边界框如果直接用作提示会导致计算浪费和结果冗余。因此需要使用非极大值抑制(NMS)算法来过滤掉重叠度高的边界框,只保留最有代表性的那些。过滤后的边界框可以有两种使用方式:一是提取边界框的中心点作为点提示,二是直接使用边界框本身作为框提示。作者经过分析选择了后者,原因有几点:首先,中心点提示虽然是对象感知的,但它隐含假设对象一定覆盖边界框的中心位置,这个假设在大多数情况下成立,但对于形状不规则或位置偏移的对象可能不成立;其次,SAM在使用点提示时会生成三个不同粒度的掩码来应对歧义问题,这就需要额外的后处理来选择最合适的掩码,增加了计算开销;相比之下,边界框提示包含了更丰富的空间信息,能够更明确地指示对象的位置和大致范围,因此生成的掩码歧义性更小,通常只需要生成一个掩码就能达到很好的效果,这对于追求效率的SegEvery任务来说更加合适。

4.3. 提示引导的掩码解码

We follow SAM [14] to perform a prompt-guided mask decoding in a batch manner. In contrast to the image encoder setting the number of image samples as batch, here, the batch concept is the number of prompts. It is worth noting that the promptguided mask decoder in SAM also accepts a box as the input. Therefore, it is technically feasible to directly prompt the mask decoder with a set of boxes that save the process of deriving the center points. Even though it is not our original motivation, without causing any additional cost, we find that this practice yields a non-trivial performance boost. In other words, it can be seen as a free trick to improve the task performance. Prompt-aware solution in [14] requires mask filtering. Empirically, we find that this process can be very slow because the mask is high-dimensional. This is different from efficient box filtering because a box only has four dimensions. This cumbersome mask filtering is optional in our proposed SegEvery framework because we can avoid it by prompting the mask decoder with only valid prompts. In other words, we keep all the generated masks since the prompts are sampled in an object-aware manner.

【翻译】我们遵循SAM [14]以批处理方式执行提示引导的掩码解码。与图像编码器将图像样本数量设置为批次不同,这里的批次概念是提示的数量。值得注意的是,SAM中的提示引导掩码解码器也接受边界框作为输入。因此,直接使用一组边界框来提示掩码解码器在技术上是可行的,这样可以省去推导中心点的过程。尽管这不是我们最初的动机,但在不产生任何额外成本的情况下,我们发现这种做法产生了不可忽视的性能提升。换句话说,它可以被视为提高任务性能的免费技巧。[14]中的提示感知解决方案需要掩码过滤。根据经验,我们发现这个过程可能非常慢,因为掩码是高维的。这与高效的边界框过滤不同,因为边界框只有四个维度。在我们提出的SegEvery框架中,这种繁琐的掩码过滤是可选的,因为我们可以通过仅使用有效提示来提示掩码解码器来避免它。换句话说,我们保留所有生成的掩码,因为提示是以对象感知的方式采样的。

【解析】本文采用批处理方式进行掩码解码,但这里的批处理概念与图像编码器不同。在图像编码器中,批次大小指的是同时处理的图像数量;而在掩码解码器中,批次大小指的是同时处理的提示数量。这种设计允许系统一次性处理多个提示,提高了计算效率。SAM的掩码解码器支持多种类型的提示输入,包括点提示和边界框提示。作者发现直接使用边界框作为提示比先从边界框提取中心点再使用点提示的方式更有效。这是因为边界框包含了更丰富的空间信息,不仅指示了对象的位置,还提供了对象的大致尺寸和形状范围,这些额外信息帮助掩码解码器生成更准确的分割结果。这种改进没有增加任何计算成本,却带来了明显的性能提升,可以说是一个"免费的午餐"。原始SAM方法在生成大量冗余掩码后需要进行过滤,这个过滤过程计算代价很高。掩码是高维数据结构,每个掩码包含数十万甚至上百万个像素值,对这些高维数据进行比较、去重和筛选需要大量计算。相比之下,边界框只有四个数值(左上角和右下角坐标或中心点坐标加宽高),对边界框进行过滤非常快速。本文提出的对象感知提示采样策略从源头上解决了这个问题。由于提示是通过目标检测模型智能采样的,每个提示都对应一个真实存在的对象,生成的掩码基本都是有效的,因此不需要复杂的后处理过滤步骤。这不仅节省了掩码过滤的时间,还简化了整个处理流程,使系统更加高效和实用。

5. Experiments

SegEvery has been perceived in [14] as a zero-shot object proposal task with standard average recall (AR) as the metric for performance evaluation. We follow the practice in [14] to adopt AR for masks at K K K proposals (mask AR ⁡ @ K ) \operatorname{AR}{@K}) AR@K) , where K K K is the maximum allowable number of masks. With the definition of AR, AR ⁡ @ K \operatorname{AR}@K AR@K gets higher when K K K is allowed to set to a larger value, which constitutes a less strict metric. Only A R @ 1000 \mathrm{AR}@1000 AR@1000 is reported in [14], but we choose to report AR ⁡ @ K \operatorname{AR}@K AR@K for K K K ranging from 10 to 1000. To not lose generality yet save computation resources, we choose to report the results on 100 images randomly sampled from the large vocabulary instance segmentaiton (LVIS) dataset [6].

【翻译】SegEvery在[14]中被视为一个零样本对象提议任务,采用标准平均召回率(AR)作为性能评估指标。我们遵循[14]中的做法,采用K个提议的掩码平均召回率(mask AR ⁡ @ K \operatorname{AR}{@K} AR@K),其中 K K K是允许的最大掩码数量。根据AR的定义,当 K K K被允许设置为更大的值时, AR ⁡ @ K \operatorname{AR}@K AR@K会更高,这构成了一个不太严格的指标。[14]中仅报告了 A R @ 1000 \mathrm{AR}@1000 AR@1000,但我们选择报告 K K K从10到1000范围内的 AR ⁡ @ K \operatorname{AR}@K AR@K。为了不失一般性同时节省计算资源,我们选择在从大词汇量实例分割(LVIS)数据集[6]中随机采样的100张图像上报告结果。

5.1. Main Results

What makes SegEvery much more computation-intensive than SegAny lies in the need to run the mask decoder with numerous sampled prompts [14]. Our proposed object-aware prompt sampling improves its efficiency by reducing the number of total prompts. In the following, we detail their difference in terms of required computation time by roughly dividing the prompt-guided mask decoding pipeline into two stages: prompt encoding (including pre-sampling) and mask decoding (including post-filtering). Mask decoding is much more heavy than simple prompt encoding. Except for the redundant sampled prompts, the default grid-search sampling in [14] also requires filtering the generated masks, which is also time-consuming. By contrast, our object-aware prompt sampling strategy does not need to filter the generated masks because the prompts are sampled in an object-aware manner. This saves the time for mask filtering. In the following, we compare the efficiency of the two sampling strategies in terms of the time spent on the prompt encoding and mask decoding stages.

【翻译】使SegEvery比SegAny计算密集得多的原因在于需要使用大量采样的提示运行掩码解码器[14]。我们提出的对象感知提示采样通过减少总提示数量来提高其效率。在下文中,我们通过将提示引导的掩码解码流程大致分为两个阶段来详细说明它们在所需计算时间方面的差异:提示编码(包括预采样)和掩码解码(包括后过滤)。掩码解码比简单的提示编码要重得多。除了冗余的采样提示外,[14]中的默认网格搜索采样还需要过滤生成的掩码,这也很耗时。相比之下,我们的对象感知提示采样策略不需要过滤生成的掩码,因为提示是以对象感知的方式采样的。这节省了掩码过滤的时间。在下文中,我们比较了两种采样策略在提示编码和掩码解码阶段所花费时间方面的效率。

Efficiency comparison. SegEvery with our proposed sampling strategy needs to run an object discovery algorithm to obtain object-aware prompts, which requires more time for prompt sampling than the default grid-search sampling in [14] but needs to encode much fewer prompts. For the mask generation, the time spent on the mask decoder is somewhat proportional to the number of sampled prompts. We find that the performance saturates when the number of prompts is approaching 320, which is set to the maximum number of detection boxes (See Sec.6.2). Less computation is needed when the object discovery generates masks that are fewer than 320, which occurs in many cases. Nonetheless, when performing an efficiency analysis, we compare our most computation-intensive scenario (max 320 prompts) with the grid-search strategy. The results in Table 2 show that our proposed prompt sampling strategy significantly improves the efficiency of SegEvery. Specifically, our proposed prompt sampling strategy reduces the time spent on the mask decoder by more than 10 times (from 1200ms to 100ms) while achieving overall superior performance.

【翻译】效率比较。使用我们提出的采样策略的SegEvery需要运行对象发现算法来获得对象感知提示,这比[14]中的默认网格搜索采样需要更多的提示采样时间,但需要编码的提示要少得多。对于掩码生成,掩码解码器花费的时间在某种程度上与采样提示的数量成正比。我们发现当提示数量接近320时性能达到饱和,这被设置为检测框的最大数量(见第6.2节)。当对象发现生成少于320个掩码时,需要的计算量更少,这在许多情况下都会发生。尽管如此,在进行效率分析时,我们将最计算密集的场景(最多320个提示)与网格搜索策略进行比较。表2中的结果表明,我们提出的提示采样策略显著提高了SegEvery的效率。具体来说,我们提出的提示采样策略将掩码解码器花费的时间减少了10倍以上(从1200ms减少到100ms),同时实现了整体上更优越的性能。

Table 2. Efficiency comparison of the (prompt-guided) mask decoder between grid-search sampling and object-aware sampling. Note that the prompt encoding includes the prompt pre-sampling time, while the mask decoding includes the mask post-filtering time.

【翻译】表2. 网格搜索采样和对象感知采样之间(提示引导的)掩码解码器的效率比较。注意,提示编码包括提示预采样时间,而掩码解码包括掩码后过滤时间。

Performance comparison. We carefully follow the implementation practice recommended in [14] for zero-shot object proposal. By default, it is suggested to set the grid density to 64 × 64 64\times64 64×64 and generate a total of 12288 ( 64 × 64 × 3 ) (64\times64\times3) (64×64×3) masks, out of which a maximum of 1000 masks are then selected given the mask A R @ 1000 \mathrm{AR}@1000 AR@1000 metric. We have experimented with decreasing the grid density and/or setting the multi-mask option to false (single-mask mode). The results in Table 3 show that generating fewer masks by either one of the above two practices leads to a performance drop, suggesting that the default grid-search sampling strategy highly relies on generating redundant masks for selecting the final needed ones. Moreover, we have multiple major observations by comparing SAM (the default grid-search prompt sampling) and MobileSAMv2 (our proposed object-aware prompt sampling). First, under the condition of prompting with the same type of prompt (points) and setting multi-mask to false, we find that MobileSAMv2 (max 320 points) achieves comparable performance as SAM using 4096 points, suggesting that the object-aware property of our prompt sampling strategy significantly avoids redundancy. Boosted with the multitask option set to true, the default 64 × 64 64\times64 64×64 grid density yields a higher performance ( 59.2 % ) (59.2\%) (59.2%) , which constitutes the best setup for the grid-search strategy. Similarly, we can also increase the performance of our object-aware point sampling by setting the multi-mask to true. Note that the motivation for predicting three output masks of different granularities [14] is to address the ambiguity issue of a point prompt. A single point has limited prompt information and thus causing ambiguity (the readers can check Figure 4 in [14] for more details). By contrast, a box prompt is much more informative and reduces ambiguity to a very large extent. This is supported by our results in Table 3 that box prompts yield a significant performance boost at single mask mode. Last, it is worth mentioning that, compared with the best result of the grid-search sampling strategy (with 64 × 64 64\times64 64×64 points at multi-mask mode), our proposed sampling strategy (with max 320 box prompts) achieves comparable performance ( 59.3 % ν . s .59.2 % ) (59.3\%\nu.s.59.2\%) (59.3%ν.s.59.2%) . Limiting the max number of prompts to 256, our strategy still yields competitive performance ( 58.5 % ) (58.5\%) (58.5%) compared with that of the grid-search strategy ( 34.6 % ) (34.6\%) (34.6%) under the same condition. We also report AR ⁡ @ K \operatorname{AR}@K AR@K for other K K K values in Table 4. When K K K is set to a relatively small value, we find that our proposed object-aware sampling strategy with much fewer prompts leads to a performance boost by a large margin. Overall, our proposed approach achieves an average performance boost of 3.6 % 3.6\% 3.6% ( 42.5 % 42.5\% 42.5% v.s. 38.9 % 38.9\% 38.9% ).

【翻译】性能比较。我们仔细遵循[14]中推荐的零样本对象提议的实现实践。默认情况下,建议将网格密度设置为 64 × 64 64\times64 64×64并生成总共12288个 ( 64 × 64 × 3 ) (64\times64\times3) (64×64×3)掩码,然后根据掩码 A R @ 1000 \mathrm{AR}@1000 AR@1000指标从中选择最多1000个掩码。我们尝试了降低网格密度和/或将多掩码选项设置为false(单掩码模式)。表3中的结果表明,通过上述两种做法中的任何一种生成更少的掩码都会导致性能下降,这表明默认的网格搜索采样策略高度依赖于生成冗余掩码来选择最终需要的掩码。此外,通过比较SAM(默认的网格搜索提示采样)和MobileSAMv2(我们提出的对象感知提示采样),我们有多个主要观察结果。首先,在使用相同类型的提示(点)进行提示并将多掩码设置为false的条件下,我们发现MobileSAMv2(最多320个点)实现了与使用4096个点的SAM相当的性能,这表明我们提示采样策略的对象感知特性显著避免了冗余。在多任务选项设置为true的情况下,默认的 64 × 64 64\times64 64×64网格密度产生了更高的性能 ( 59.2 % ) (59.2\%) (59.2%),这构成了网格搜索策略的最佳设置。同样,我们也可以通过将多掩码设置为true来提高对象感知点采样的性能。请注意,预测三个不同粒度的输出掩码[14]的动机是为了解决点提示的歧义问题。单个点的提示信息有限,因此会造成歧义(读者可以查看[14]中的图4了解更多细节)。相比之下,边界框提示信息量更大,在很大程度上减少了歧义。我们在表3中的结果支持了这一点,即边界框提示在单掩码模式下产生了显著的性能提升。最后,值得一提的是,与网格搜索采样策略的最佳结果(在多掩码模式下使用 64 × 64 64\times64 64×64个点)相比,我们提出的采样策略(最多320个边界框提示)实现了相当的性能 ( 59.3 % ν . s .59.2 % ) (59.3\%\nu.s.59.2\%) (59.3%ν.s.59.2%)。将提示的最大数量限制为256,我们的策略在相同条件下仍然产生了有竞争力的性能 ( 58.5 % ) (58.5\%) (58.5%),而网格搜索策略为 ( 34.6 % ) (34.6\%) (34.6%)。我们还在表4中报告了其他 K K K值的 AR ⁡ @ K \operatorname{AR}@K AR@K。当 K K K设置为相对较小的值时,我们发现我们提出的对象感知采样策略使用更少的提示就能带来大幅度的性能提升。总体而言,我们提出的方法实现了平均 3.6 % 3.6\% 3.6%的性能提升( 42.5 % 42.5\% 42.5%对比 38.9 % 38.9\% 38.9%)。

Table 3. Zero-shot object proposal comparison between grid-search sampling and object-aware sampling (mask @ @ @ 1000 as the metric).

【翻译】表3. 网格搜索采样和对象感知采样之间的零样本对象提议比较(以mask @ @ @ 1000作为指标)。

Table 4. Zero-shot object proposal comparison between grid-search sampling and object-aware sampling.

【翻译】表4. 网格搜索采样和对象感知采样之间的零样本对象提议比较。

Table 5. Influence of the image encoders on MobileSAMv2 for zero-shot object proposal (mask@1000).

【翻译】表5. 图像编码器对MobileSAMv2零样本对象提议的影响(mask@1000)。

5.2. 关于与蒸馏图像编码器的兼容性

In the above, we only consider the prompt-guided mask decoder, however, the whole pipeline needs to run the image encoder once before running the mask decoder. As shown in Figure 1, the time spent on the image encoder is relatively small for SegEvery with the grid-search point sampling. However, this is no longer the case when adopting our object-aware prompt sampling strategy, which reduces the time on the mask decoder to around 100 m s 100\mathrm{ms} 100ms . Therefore, we consider reducing the time spent on the image encoder by replacing the original one (ViT-H) in the SAM with a distilled one in the MobileSAM project [34]. The results with different distilled image encoders are shown in Table 5. We observe a moderate performance drop (from 59.2 % 59.2\% 59.2% to 56.3 % 56.3\% 56.3% ) when EfficientViT-L2 is used. Given that EfficientViT-l2 runs around 20 m s 20\mathrm{ms} 20ms which is significantly faster than that of ViT-H (more than 400 m s , 400\mathrm{ms}, 400ms, ), it is worthwhile to replace the image encoder. Due to the simplicity and effectiveness of decoupled knowledge distillation introduced in MobileSAM [34], a more powerful distilled image encoder is expected to emerge soon to further alleviate the performance drop. It is worth highlighting that MobileSAM and MobileSAMv2 solve two orthogonal issues: faster SegAny and faster SegEvery. Combing them together constitutes a unified framework for efficient SegAny and SegEvery.

【翻译】在上文中,我们只考虑了提示引导的掩码解码器,然而,整个流程需要在运行掩码解码器之前先运行一次图像编码器。如图1所示,对于使用网格搜索点采样的SegEvery,图像编码器花费的时间相对较小。然而,当采用我们的对象感知提示采样策略时,情况就不再如此了,该策略将掩码解码器的时间减少到约 100 m s 100\mathrm{ms} 100ms。因此,我们考虑通过用MobileSAM项目[34]中的蒸馏编码器替换SAM中的原始编码器(ViT-H)来减少图像编码器花费的时间。表5显示了使用不同蒸馏图像编码器的结果。我们观察到当使用EfficientViT-L2时性能有适度下降(从 59.2 % 59.2\% 59.2%降至 56.3 % 56.3\% 56.3%)。鉴于EfficientViT-l2运行时间约为 20 m s 20\mathrm{ms} 20ms,这比ViT-H(超过 400 m s 400\mathrm{ms} 400ms)快得多,因此替换图像编码器是值得的。由于MobileSAM[34]中引入的解耦知识蒸馏的简单性和有效性,预计很快会出现更强大的蒸馏图像编码器,以进一步缓解性能下降。值得强调的是,MobileSAM和MobileSAMv2解决了两个正交的问题:更快的SegAny和更快的SegEvery。将它们结合在一起构成了一个高效SegAny和SegEvery的统一框架。

6. 额外比较和消融研究

6.1. 与无提示方法的比较

As discussed in [34], the SegEvery is in essence not a promptable segmentation task and thus can be realized in promptfree manner. Such an approach has been attempted in [41] with YOLOv8-seg, which mainly augments YOLOv8-det with a protonet module to generate mask prototype. The intance mask is obtained by convolving the mask prototype with a mask coefficient that has the same length as the prototype dimension (32 by default), which is mathematically a dot product. Here, we point out that the mask decoder of SAM [14] also generates the mask by making a dot product between a mask coefficient (called mask token in [14]) and a mask prototype (called image embedding in [14]), which have the same (32) dimensions so that the dot product can be computed. Intuitively, the quality of generated mask relies on how well the mask coefficent and mask prototype interact with each other. The mask decoder in [14] adopts two-way attention to enable the interaction between the mask prototype and mask coeffcient before performing the final product. Such an interaction is the key foundation for guaranteeing the high-quality mask in SAM. By contrast, there is no explicit interaction between the mask coefficients and mask prototypes in the prompt-free approach. With a single shared mask prototype, it often predicts multiple objects at different regions of the image and thus relies on a bounding box to crop the mask. This can help remove the irrelevant masks outside the box but still fails in yielding high-quality masks as [14], at least partly, due to lack of the interaction between mask coefficient and mask prototype. Even though the prompt-free approach realizes the fastest speed, it results in a nontrivial performance drop (see Table 6). The less satisfactory performance of the prompt-free approach is mainly attributed to the poor mask boundary (see Figure 2). Compared with prompt-free approach, the two prompt-aware approaches (SAM and MobileSAMv2) generate masks with much more fine-grained boundaries. SAM tends to over-segment things while our MobileSAMv2 alleviates this tendency by utilizing its object-aware property.

【翻译】如[34]中所讨论的,SegEvery本质上不是一个可提示的分割任务,因此可以以无提示的方式实现。[41]中使用YOLOv8-seg尝试了这种方法,它主要通过protonet模块增强YOLOv8-det来生成掩码原型。实例掩码是通过将掩码原型与具有相同长度的掩码系数(默认为32)进行卷积获得的,这在数学上是一个点积。在这里,我们指出SAM[14]的掩码解码器也通过在掩码系数(在[14]中称为掩码令牌)和掩码原型(在[14]中称为图像嵌入)之间进行点积来生成掩码,它们具有相同的(32)维度,因此可以计算点积。直观地说,生成掩码的质量取决于掩码系数和掩码原型之间的交互程度。[14]中的掩码解码器采用双向注意力机制,在执行最终乘积之前实现掩码原型和掩码系数之间的交互。这种交互是保证SAM中高质量掩码的关键基础。相比之下,在无提示方法中,掩码系数和掩码原型之间没有明确的交互。使用单个共享的掩码原型,它通常会预测图像不同区域的多个对象,因此依赖于边界框来裁剪掩码。这可以帮助去除框外的无关掩码,但仍然无法产生像[14]那样的高质量掩码,至少部分原因是缺乏掩码系数和掩码原型之间的交互。尽管无提示方法实现了最快的速度,但它导致了不可忽视的性能下降(见表6)。无提示方法性能不太令人满意主要归因于较差的掩码边界(见图2)。与无提示方法相比,两种提示感知方法(SAM和MobileSAMv2)生成的掩码具有更精细的边界。SAM倾向于过度分割事物,而我们的MobileSAMv2通过利用其对象感知特性缓解了这种趋势。

Table 6. Zero-shot object proposal comparison between prompt-free and prompt-aware approaches (mask @ @ @ 1000).

【翻译】表6. 无提示和提示感知方法之间的零样本对象提议比较(mask @ @ @ 1000)。

Figure 2. Comparison between prompt-free and prompt-aware mask predictions. Prompt-free tends to predict the mask with a non-smooth boundary compared with prompt-aware approaches. For the two prompt-aware approaches, SAM tends to over-segment things while our MobileSAMv2 addresses it due to its object-aware property. Best view in color and zoom in.

【翻译】图2. 无提示和提示感知掩码预测之间的比较。与提示感知方法相比,无提示方法倾向于预测具有非平滑边界的掩码。对于两种提示感知方法,SAM倾向于过度分割事物,而我们的MobileSAMv2由于其对象感知特性解决了这个问题。最佳彩色查看并放大。

6.2. Ablation Study

With the mask AR@1000 as the metric, we find that our proposed sampling strategy often yields fewer prompts than 1000, which motivates us to explore the influence of the maximum number of (box) prompts in our proposed prompt sampling strategy. The results in Table 7 show that increasing the number of box prompts is beneficial for a higher mask AR, however, it saturates after it approaches 320. Therefore, by default, we set the maximum number of prompts in MobileSAMv2 to 320.

【翻译】以mask AR@1000作为指标,我们发现我们提出的采样策略通常产生少于1000个提示,这促使我们探索提示采样策略中(边界框)提示的最大数量的影响。表7中的结果表明,增加边界框提示的数量有利于获得更高的mask AR,然而,当接近320时就会饱和。因此,默认情况下,我们将MobileSAMv2中的最大提示数量设置为320。

Table 7. Influence of the maximum number of prompts on MobileSAMv2 for zero-shot object proposal (mask@1000).

【翻译】表7. 提示的最大数量对MobileSAMv2零样本对象提议的影响(mask@1000)。

7. Conclusion and Future work

Orthogonal to the MobileSAM project making SegAny faster by distilling a lightweight image encoder, this project termed MobileSAMv2 makes SegEvery faster by proposing a new prompt sampling strategy in the prompt-guided mask decoder. Replacing the grid-search with our object-aware prompt sampling, we significantly improve the efficiency of SegEvery while achieving overall superior performance. We also demonstrate that our object-aware prompt sampling is compatible with the distilled image encoders in the MobileSAM project. Overall, our work constitutes a step towards a unified framework for efficient SegAny and SegEvery. Future work is needed to seek superior image encoder(s) and object discovery models(s).

【翻译】与通过蒸馏轻量级图像编码器使SegAny更快的MobileSAM项目正交,本项目称为MobileSAMv2,通过在提示引导的掩码解码器中提出新的提示采样策略使SegEvery更快。用我们的对象感知提示采样替换网格搜索,我们显著提高了SegEvery的效率,同时实现了整体优越的性能。我们还证明了我们的对象感知提示采样与MobileSAM项目中的蒸馏图像编码器兼容。总体而言,我们的工作构成了朝着高效SegAny和SegEvery统一框架迈出的一步。未来的工作需要寻求更优越的图像编码器和对象发现模型。