SAM3论文精读（逐段解析）

SAM 3: Segment Anything with Concepts

SAM 3: 基于概念的分割一切

论文地址：https://arxiv.org/abs/2511.16719

2025

Meta 超级智能实验室

Demo: https://segment-anything.com

Code: https://github.com/facebookresearch/sam3

Website: https://ai.meta.com/sam3

【论文总结】

SAM 3通过引入可提示概念分割（PCS）任务，构建了统一的检测-分割-跟踪框架，包括：（1）存在头（Presence Head）机制 实现识别与定位解耦，通过全局存在标记判断概念是否出现，各个对象查询专注于条件定位问题，有效处理开放词汇场景下的多实例检测；（2）双编码器-解码器架构采用共享感知编码器（PE）的检测器和跟踪器设计，检测器基于DETR范式通过融合编码器将多模态提示（文本、图像样例的位置/标签/视觉特征）与图像特征进行交叉注意融合，跟踪器继承SAM 2架构使用记忆库和Masklets（时空掩码）维护对象身份；（3）检测-传播-匹配流程 结合IoU匹配、掩码片段检测分数和定期重新提示的双重时序消歧策略，解决拥挤场景中的跟踪漂移和遮挡问题；（4）四阶段渐进式训练（PE预训练→检测器预训练→检测器微调→冻结骨干训练跟踪器）配合困难负样本显著提升模型对相似概念的区分能力；（5）构建人机协作数据引擎，通过AI验证器（掩码验证MV和完整性验证EV）自动筛选高质量标注，并主动挖掘失败案例进行迭代改进，最终产生包含400万独特概念的SA-Co数据集，使模型在LVIS等基准上实现零样本mask AP 48.8，相比现有最佳系统提升超过2倍性能。

We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.

【翻译】我们提出了Segment Anything Model (SAM) 3，这是一个统一的模型，能够基于概念提示在图像和视频中检测、分割和跟踪对象，我们将概念提示定义为简短的名词短语（例如"黄色校车"）、图像样例或两者的组合。可提示概念分割（PCS）接受这些提示并返回所有匹配对象实例的分割掩码和唯一标识。为了推进PCS，我们构建了一个可扩展的数据引擎，该引擎在图像和视频中生成包含400万个独特概念标签的高质量数据集，包括困难负样本。我们的模型由图像级检测器和基于记忆的视频跟踪器组成，它们共享单一的骨干网络。识别和定位通过存在头进行解耦，这提高了检测精度。SAM 3在图像和视频PCS上的准确率是现有系统的两倍，并改进了之前SAM在视觉分割任务上的能力。我们开源了SAM 3以及我们新的Segment Anything with Concepts (SA-Co)可提示概念分割基准。

【解析】SAM 3将目标检测、语义分割和目标跟踪三个任务统一到了一个框架中。基于概念提示机制，允许用户通过自然语言描述或视觉样例来指定他们想要分割的对象类型。与之前的SAM版本相比，SAM 3最大的创新在于引入了概念级别的理解能力。传统的分割模型通常需要预定义的类别集合，而SAM 3可以理解开放词汇的概念描述，使其在实际应用中更加灵活和实用。模型架构采用了共享骨干网络的设计，既保证了计算效率，又确保了图像检测和视频跟踪任务之间的特征一致性。存在头（PS:存在头(presence head) 是将"识别"(recognition)和"定位"(localization)解耦的关键设计,专门负责预测目标概念是否出现在图像中,而其他查询负责定位具体位置）的引入是一个关键的技术创新，它将对象的存在性判断与精确定位分离开来，解耦设计能够更好地处理复杂场景中的多实例检测问题。数据引擎的构建，400万个概念标签的规模，确保了模型训练所需的数据复杂性和多样性。困难负样本进一步提高了模型的鲁棒性，使其能够更好地区分相似但不同的概念。

1 Introduction

The ability to find and segment anything in a visual scene is foundational for multimodal AI, powering applications in robotics, content creation, augmented reality, data annotation, and broader sciences. The SAM series (Kirillov et al., 2023; Ravi et al., 2024) introduced the promptable segmentation task for images and videos, focusing on Promptable Visual Segmentation (PVS) with points, boxes or masks to segment a single object per prompt. While these methods achieved a breakthrough, they did not address the general task of finding and segmenting all instances of a concept appearing anywhere in the input (e.g., all "cats" in a video).

【翻译】在视觉场景中查找和分割任何物体的能力是多模态AI的基础，为机器人技术、内容创作、增强现实、数据标注和更广泛的科学领域提供支持。SAM系列（Kirillov等，2023；Ravi等，2024）引入了图像和视频的可提示分割任务，专注于可提示视觉分割（PVS），使用点、框或掩码来分割每个提示对应的单个对象。虽然这些方法取得了突破，但它们没有解决在输入中查找和分割概念的所有实例的一般任务（例如，视频中的所有"猫"）。

【解析】传统的SAM模型存在限制：它只能处理单个对象的分割。具体来说，当用户给出一个提示（比如点击、框选或掩码），SAM只会分割出一个对象实例。但在实际应用中，用户往往需要找到场景中某个概念的所有实例，比如一张图片中的所有汽车，或者一段视频中出现的所有人物。这种从"单实例分割"到"概念级全实例分割"的需求转变，正是SAM 3要解决的核心问题。

Figure 1 SAM 3 improves over SAM 2 on promptable visual segmentation with clicks (left) and introduces the new promptable concept segmentation capability (right). Users can segment all instances of a visual concept specified by a short noun phrase, image exemplars (positive or negative), or a combination of both.

【翻译】图1 SAM 3在基于点击的可提示视觉分割方面相比SAM 2有所改进（左），并引入了新的可提示概念分割能力（右）。用户可以通过简短的名词短语、图像样例（正样例或负样例）或两者的组合来指定视觉概念，从而分割该概念的所有实例。

Figure 2 Examples of SAM 3 improving segmentation of open-vocabulary concepts compared to OWLv2 (Minderer et al., 2024), on the SA-Co benchmark. See §F.6.1 for additional SAM 3 outputs.

【翻译】图2 SAM 3在SA-Co基准测试上相比OWLv2（Minderer等，2024）改进开放词汇概念分割的示例。更多SAM 3输出结果请参见§F.6.1。

To fill this gap, we present SAM 3, a model that achieves a step change in promptable segmentation in images and videos, improving PVS relative to SAM 2 and setting a new standard for Promptable Concept Segmentation (PCS). We formalize the PCS task (§2) as taking text and/or image exemplars as input, and predicting instance and semantic masks for every single object matching the concept, while preserving object identities across video frames (see Fig. 1). To focus on recognizing atomic visual concepts, we constrain text to simple noun phrases (NPs) such as "red apple" or "striped cat". While SAM 3 is not designed for long referring expressions or queries requiring reasoning, we show that it can be straightforwardly combined with a Multimodal Large Language Model (MLLM) to handle more complex language prompts. Consistent with previous SAM versions, SAM 3 is fully interactive, allowing users to resolve ambiguities by adding refinement prompts to guide the model towards their intended output.

【翻译】为了填补这一空白，我们提出了SAM 3，这是一个在图像和视频可提示分割方面实现跨越式变化的模型，相对于SAM 2改进了PVS，并为可提示概念分割（PCS）设立了新标准。我们将PCS任务（§2）形式化为接受文本和/或图像样例作为输入，并为匹配概念的每个单独对象预测实例和语义掩码，同时在视频帧间保持对象身份（见图1）。为了专注于识别原子视觉概念，我们将文本限制为简单的名词短语（NPs），如"红苹果"或"条纹猫"。虽然SAM 3不是为长引用表达式或需要推理的查询而设计的，但我们表明它可以直接与多模态大语言模型（MLLM）结合来处理更复杂的语言提示。与之前的SAM版本一致，SAM 3是完全交互式的，允许用户通过添加细化提示来解决歧义，引导模型朝向他们预期的输出。

【解析】SAM 3的核心创新：引入了可提示概念分割（PCS）任务。这个任务的关键特点是能够同时处理文本描述和视觉样例作为输入，然后输出所有匹配对象的分割结果。与传统方法不同，SAM 3不仅要识别对象在哪里，还要理解对象是什么，并且能够在视频序列中保持对象的身份一致性。模型设计上有意将文本输入限制为简单的名词短语，这样做的原因是为了确保模型专注于视觉概念的识别，而不是复杂的语言理解。当需要处理复杂语言时，可以通过与大语言模型的组合来实现。

Our model (§3) consists of a detector and a tracker that share a vision encoder (Bolya et al., 2025). The detector is a DETR-based (Carion et al., 2020) model conditioned on text, geometry, and image exemplars. To address the challenge of open-vocabulary concept detection, we introduce a separate presence head to decouple recognition and localization, which is especially effective when training with challenging negative phrases. The tracker inherits the SAM 2 transformer encoder-decoder architecture, supporting video segmentation and interactive refinement. The decoupled design for detection and tracking avoids task conflict, as the detector needs to be identity agnostic, while the tracker's main objective is to separate identities in the video.

【翻译】我们的模型（§3）由检测器和跟踪器组成，它们共享一个视觉编码器（Bolya等，2025）。检测器是基于DETR的（Carion等，2020）模型，以文本、几何和图像样例为条件。为了解决开放词汇概念检测的挑战，我们引入了一个单独的存在头来解耦识别和定位，这在使用具有挑战性的负样本短语进行训练时特别有效。跟踪器继承了SAM 2的transformer编码器-解码器架构，支持视频分割和交互式细化。检测和跟踪的解耦设计避免了任务冲突，因为检测器需要对身份不敏感，而跟踪器的主要目标是在视频中分离身份。

【解析】SAM 3的架构设计，检测器和跟踪器虽然功能不同，但通过共享视觉编码器来保证特征表示的一致性，既节省计算资源又确保了两个模块之间的协调性。检测器采用DETR架构是因为DETR能够直接输出对象的边界框和类别，避免传统检测方法中复杂的后处理步骤。条件化机制使得检测器能够同时处理多种类型的输入：文本描述提供语义信息，几何信息提供空间约束，图像样例提供视觉参考。存在头的引入是一个关键创新，它专门负责判断目标概念是否在图像中出现，而其他查询则专注于精确定位。这种解耦设计解决了传统方法中识别和定位任务相互干扰的问题，特别是在处理困难负样本时，存在头能够更好地学习概念的边界。跟踪器继承SAM 2架构，同时支持交互式细化功能使得用户可以在跟踪过程中进行实时调整。检测器的身份不敏感性和跟踪器的身份敏感性：检测器只需要找到所有相关对象，而跟踪器负责维护这些对象在时间序列中的一致性标识。

To unlock major performance gains, we build a human- and model-in-the-loop data engine (§4) that annotates a large and diverse training dataset. We innovate upon prior data engines in three key ways: (i) media curation: we curate more diverse media domains than past approaches that rely on homogeneous web sources, (ii) label curation: we significantly increase label diversity and difficulty by leveraging an ontology and multimodal LLMs as "AI annotators" to generate noun phrases and hard negatives, (iii) label verification: we double annotation throughput by fine-tuning MLLMs to be effective "AI verifiers" that achieve near-human accuracy.

【翻译】为了实现重大性能提升，我们构建了一个人机协作的数据引擎（§4），用于标注大规模且多样化的训练数据集。我们在三个关键方面对先前的数据引擎进行了创新：（i）媒体策划：我们策划了比依赖同质化网络资源的过往方法更多样化的媒体领域，（ii）标签策划：我们通过利用本体论和多模态LLMs作为"AI标注员"来生成名词短语和困难负样本，显著增加了标签的多样性和难度，（iii）标签验证：我们通过微调MLLMs成为有效的"AI验证员"来实现接近人类准确度，从而将标注吞吐量提高了一倍。

Starting from noisy media-phrase-mask pseudo-labels, our data engine checks mask quality and exhaustivity using both human and AI verifiers, filtering out correctly labeled examples and identifying challenging error cases. Human annotators then focus on fixing these errors by manually correcting masks. This enables us to annotate high-quality training data with 4M unique phrases and 52M masks, and a synthetic dataset with 38M phrases and 1.4B masks. We additionally create the Segment Anything with Concepts (SA-Co) benchmark for PCS (§5) containing 207K unique concepts with exhaustive masks in 120K images and 1.7K videos, > 50 × >50\times >50× more concepts than existing benchmarks.

【翻译】从噪声媒体-短语-掩码伪标签开始，我们的数据引擎使用人类和AI验证器检查掩码质量和完整性，过滤出正确标记的样例并识别具有挑战性的错误案例。然后人类标注员专注于通过手动纠正掩码来修复这些错误。这使我们能够标注包含400万个独特短语和5200万个掩码的高质量训练数据，以及包含3800万个短语和14亿个掩码的合成数据集。我们还为PCS创建了Segment Anything with Concepts (SA-Co)基准测试（§5），包含在12万张图像和1700个视频中的20.7万个独特概念的完整掩码，比现有基准测试多 > 50 × >50\times >50×个概念。

【解析】数据引擎建立双重验证机制：AI验证器负责大规模的初步筛选，能够快速识别明显的错误和正确的标注；人类验证器则专注于处理AI难以判断的边界情况和复杂场景。掩码质量检查不仅包括准确性，还包括完整性，确保所有相关对象都被正确标注。最终产生的数据集规模庞大：400万个独特短语覆盖了广泛的视觉概念，5200万个掩码提供了丰富的训练样本。合成数据集的规模更大，14亿个掩码为模型提供了充足的训练数据。SA-Co基准测试的创建，为可提示概念分割任务提供了更全面的评估标准。

Our experiments (§6) show that SAM 3 sets a new state-of-the-art in promptable segmentation, e.g., reaching a zero-shot mask AP of 48.8 on LVIS vs. the current best of 38.5, surpassing baselines on our new SACo benchmark by at least 2 × \times × (see examples in Fig. 2), and improving upon SAM 2 on visual prompts. Ablations (§A) verify that the choice of backbone, novel presence head, and adding hard negatives all boost results, and establish scaling laws on the PCS task for both our high-quality and synthetic datasets. We open-source the SA-Co benchmark and release the SAM 3 checkpoints and inference code. On an H200 GPU, SAM 3 runs in 30 ms for a single image with 100+ detected objects. In video, the inference latency scales with the number of objects, sustaining near real-time performance for ∼ 5 \sim5 ∼5 concurrent objects. We review related work in §7; next, we dive into the task.

【翻译】我们的实验（§6）表明SAM 3在可提示分割方面设立了新的最先进水平，例如在LVIS上达到48.8的零样本掩码AP，而当前最佳结果为38.5，在我们新的SA-Co基准测试上超越基线至少2 × \times ×（见图2中的示例），并在视觉提示方面改进了SAM 2。消融实验（§A）验证了骨干网络的选择、新颖的存在头以及添加困难负样本都能提升结果，并为我们的高质量和合成数据集在PCS任务上建立了缩放定律。我们开源了SA-Co基准测试并发布了SAM 3检查点和推理代码。在H200 GPU上，SAM 3对包含100+检测对象的单张图像运行时间为30毫秒。在视频中，推理延迟随对象数量缩放，对于 ∼ 5 \sim5 ∼5个并发对象能够维持接近实时的性能。我们在§7中回顾相关工作；接下来，我们深入探讨任务。

Figure 3 Illustration of supported initial and optional interactive refinement prompts in the PCS task.

【翻译】图3 PCS任务中支持的初始和可选交互式细化提示的说明。

2 Promptable Concept Segmentation (PCS) 可提示概念分割

We define the Promptable Concept Segmentation task as follows: given an image or short video ( ≤ \leq ≤ 30 secs), detect, segment and track all instances of a visual concept specified by a short text phrase, image exemplars, or a combination of both. We restrict concepts to those defined by simple noun phrases (NPs) consisting of a noun and optional modifiers. Noun-phrase prompts (when provided) are global to all frames of the image/video, while image exemplars can be provided on individual frames as positive or negative bounding boxes to iteratively refine the target masks (see Fig. 3).

【翻译】我们将可提示概念分割任务定义如下：给定一张图像或短视频（ ≤ \leq ≤ 30秒），检测、分割和跟踪由简短文本短语、图像样例或两者组合指定的视觉概念的所有实例。我们将概念限制为由名词和可选修饰符组成的简单名词短语（NPs）定义的概念。名词短语提示（当提供时）对图像/视频的所有帧都是全局的，而图像样例可以在单个帧上作为正或负边界框提供，以迭代地细化目标掩码（见图3）。

【解析】可提示概念分割任务的定义包含三个要素：输入、处理和输出。输入方面，系统接受图像或短视频作为视觉媒体，同时接受文本描述或视觉样例作为概念指定方式。处理方面，系统需要同时完成检测（找到对象位置）、分割（确定对象边界）和跟踪（维持时序一致性）三个任务。输出方面，系统必须找到所有匹配概念的实例，而不是像传统方法那样只处理单个对象。名词短语的限制是为了确保概念的原子性和可视觉化特性，避免复杂的语言理解需求。全局性vs局部性的提示，不同提示类型的作用机制：文本提示定义了整个序列中需要寻找的概念类别，而图像样例提示则允许在特定帧上进行精确的正负样本标注，这种设计使得用户可以通过渐进式交互来优化分割结果。

All prompts must be consistent in their category definition, or the model's behavior is undefined; e.g., "fish" cannot be refined with subsequent exemplar prompts of just the tail; instead the text prompt should be updated. Exemplar prompts are particularly useful when the model initially misses some instances, or when the concept is rare.

【翻译】所有提示必须在其类别定义上保持一致，否则模型的行为是未定义的；例如，"鱼"不能用后续仅包含尾巴的样例提示来细化；相反，应该更新文本提示。样例提示在模型最初遗漏某些实例时，或当概念稀有时特别有用。

【解析】提示一致性原则是确保模型行为可预测性的关键约束。这个原则要求所有提示（无论是文本还是视觉样例）都必须指向同一个语义概念层级。违反这个原则会导致概念漂移问题，即模型可能从识别完整对象转向识别对象的局部特征，这会破坏分割任务的语义完整性。样例提示的补充作用体现在两个场景：召回率不足和稀有概念处理。当模型由于训练数据偏差或视觉复杂性导致遗漏某些实例时，正样例提示可以提供额外的视觉参考来提高召回率。对于训练数据中出现频率较低的稀有概念，样例提示能够提供关键的视觉先验信息，帮助模型更好地理解和识别这些概念。

Our vocabulary includes any simple noun phrase groundable in a visual scene, which makes the task intrinsically ambiguous. There can be multiple interpretations of phrases arising from polysemy ("mouse" device vs. animal), subjective descriptors ("cozy", "large"), vague or context-dependent phrases that may not even be groundable ("brand identity"), boundary ambiguity (whether 'mirror' includes the frame) and factors such as occlusion and blur that obscure the extent of the object. While similar issues appear in large closed-vocabulary corpora (e.g., LVIS (Gupta et al., 2019)), they are alleviated by carefully curating the vocabulary and setting a clear definition of all the classes of interest. We address the ambiguity problem by collecting test annotations from three experts, adapting the evaluation protocol to allow multiple valid interpretations (§E.3), designing the data pipeline/guidelines to minimize ambiguity in annotation, and an ambiguity module in the model (§C.2).

【翻译】我们的词汇表包括任何可在视觉场景中落地的简单名词短语，这使得任务本质上具有歧义性。短语可能存在多种解释，这些解释来源于一词多义（"鼠标"设备vs.动物）、主观描述符（"舒适的"、"大的"）、模糊或依赖上下文的短语甚至可能无法落地（"品牌标识"）、边界歧义（"镜子"是否包括镜框）以及遮挡和模糊等因素，这些因素模糊了对象的范围。虽然类似问题出现在大型封闭词汇语料库中（例如，LVIS（Gupta等，2019）），但通过仔细策划词汇表并为所有感兴趣的类别设置清晰定义可以缓解这些问题。我们通过从三位专家收集测试标注、调整评估协议以允许多种有效解释（§E.3）、设计数据管道/指南以最小化标注中的歧义，以及在模型中设置歧义模块（§C.2）来解决歧义问题。

【解析】开放词汇的歧义性是可提示概念分割任务面临的难题。歧义性源于自然语言的多层次复杂性：词汇层面的一词多义现象使得同一个词在不同语境下可能指代完全不同的视觉概念；语义层面的主观性描述符缺乏客观的视觉判断标准；语用层面的上下文依赖性使得某些概念在脱离特定场景后失去可视觉化的基础。边界歧义问题反映了视觉分割任务中对象完整性定义的复杂性，不同的标注者可能对同一对象的边界范围有不同的理解。与封闭词汇系统不同，开放词汇系统无法通过预先定义来完全消除歧义。因此，SAM 3采用了多层次的歧义处理策略：专家标注的多样性确保评估标准的鲁棒性；评估协议的适应性允许合理的解释差异；数据管道的优化减少标注过程中的主观性；模型内置的歧义处理模块则从算法层面提供了处理不确定性的能力。

3 Model

SAM 3 is a generalization of SAM 2, supporting the new PCS task (§2) as well as the PVS task. It takes concept prompts (simple noun phrases, image exemplars) or visual prompts (points, boxes, masks) to define the objects to be (individually) segmented spatio-temporally. Image exemplars and visual prompts can be iteratively added on individual frames to refine the target masks---false positive and false negative objects can be removed or added respectively using image exemplars and an individual mask(let) can be refined using PVS in the style of SAM 2. Our architecture is broadly based on the SAM and (M)DETR (Carion et al., 2020; Kamath et al., 2021) series. Fig. 4 shows the SAM 3 architecture, consisting of a dual encoder-decoder transformer---a detector for image-level capabilities---which is used in combination with a tracker and memory for video. The detector and tracker ingest vision-language inputs from an aligned Perception Encoder (PE) backbone (Bolya et al., 2025). We present an overview below, see §C for details.

【翻译】SAM 3是SAM 2的泛化版本，支持新的PCS任务（§2）以及PVS任务。它接受概念提示（简单名词短语、图像样例）或视觉提示（点、框、掩码）来定义要进行时空分割的对象。图像样例和视觉提示可以在单个帧上迭代添加以细化目标掩码------假阳性和假阴性对象可以分别使用图像样例进行移除或添加，单个掩码（掩码片段）可以使用SAM 2风格的PVS进行细化。我们的架构广泛基于SAM和(M)DETR（Carion等，2020；Kamath等，2021）系列。图4显示了SAM 3架构，由双编码器-解码器transformer组成------一个用于图像级能力的检测器------与跟踪器和内存结合用于视频。检测器和跟踪器从对齐的感知编码器（PE）骨干网络（Bolya等，2025）中摄取视觉-语言输入。我们在下面提供概述，详细信息见§C。

【解析】迭代细化机制是SAM 3的核心创新之一，即允许用户通过逐步添加正负样例来优化分割结果。假阳性处理通过负样例提示来排除错误检测的对象，假阴性处理通过正样例提示来补充遗漏的对象。这种交互式设计使得系统能够适应复杂场景中的边界情况。双编码器-解码器架构的设计，源于检测和跟踪任务的不同需求：检测器专注于单帧内的对象识别和定位，而跟踪器则负责维护对象在时间序列中的一致性。感知编码器作为共享的特征提取器，确保检测和跟踪模块之间的特征表示一致性，这对于视频任务中的时序连贯性及其重要。

Detector Architecture. The architecture of the detector follows the general DETR paradigm. The image and text prompt are first encoded by PE and image exemplars, if present, are encoded by an exemplar encoder. We refer to the image exemplar tokens and text tokens jointly as "prompt tokens". The fusion encoder then accepts the unconditioned embeddings from the image encoder and conditions them by cross-attending to the prompt tokens. The fusion is followed by a DETR-like decoder, where learned object queries cross-attend to the conditioned image embeddings from the fusion encoder.

【翻译】检测器架构。检测器的架构遵循通用的DETR范式。图像和文本提示首先由PE编码，如果存在图像样例，则由样例编码器编码。我们将图像样例标记和文本标记统称为"提示标记"。融合编码器然后接受来自图像编码器的无条件嵌入，并通过交叉注意提示标记来对它们进行条件化。融合之后是类似DETR的解码器，其中学习的对象查询交叉注意来自融合编码器的条件化图像嵌入。

【解析】检测器架构采用DETR范式，优势在于其端到端的训练方式和对复杂场景的适应能力。DETR通过transformer架构直接输出对象的边界框和类别，避免传统检测方法中复杂的后处理步骤如非极大值抑制。多模态编码策略，即SAM 3对不同输入类型具备统一处理能力：感知编码器处理图像和文本的联合表示，样例编码器专门处理视觉参考样例。提示标记的概念统一了不同模态的输入表示，使得后续的融合过程能够在统一的特征空间中进行。融合编码器的条件化机制是关键创新，它通过交叉注意机制将图像特征与提示信息进行深度融合。无条件嵌入指的是纯粹的图像特征表示，而条件化过程则将语义信息注入到视觉特征中。对象查询作为可学习的参数，代表了不同的对象假设，通过与条件化图像特征的交叉注意来定位和识别目标对象，使得检测器能够同时处理多个对象实例，并且对每个实例进行独立的预测。

Figure 4 SAM 3 architecture overview. See Fig. 10 for a more detailed diagram.

【翻译】图4 SAM 3架构概述。详细图表见图10。

Each decoder layer predicts a classification logit for each object query (in our case, a binary label of whether the object corresponds to the prompt), and a delta from the bounding box predicted by the previous level, following Zhu et al. (2020). We use box-region-positional bias (Lin et al., 2023) to help focalize the attention on each object, but unlike recent DETR models, we stick to vanilla attention. During training, we adopt dual supervision from DAC-DETR (Hu et al., 2023), and the Align loss (Cai et al., 2024). The mask head is adapted from MaskFormer (Cheng et al., 2021). In addition, we also have a semantic segmentation head, which predicts a binary label for every pixel in the image, indicating whether or not it corresponds to the prompt. See § C \S\mathrm{C} §C for details.

【翻译】每个解码器层为每个对象查询预测一个分类logit（在我们的情况下，是一个二进制标签，表示对象是否对应于提示），以及与前一层预测的边界框的增量，遵循Zhu等（2020）的方法。我们使用框-区域-位置偏置（Lin等，2023）来帮助将注意力聚焦在每个对象上，但与最近的DETR模型不同，我们坚持使用原始注意力。在训练期间，我们采用来自DAC-DETR（Hu等，2023）的双重监督和Align损失（Cai等，2024）。掩码头改编自MaskFormer（Cheng等，2021）。此外，我们还有一个语义分割头，它为图像中的每个像素预测一个二进制标签，指示它是否对应于提示。详细信息见 § C \S\mathrm{C} §C。

【解析】解码器层的预测机制采用了渐进式细化策略。每个解码器层不是从零开始预测对象位置，而是在前一层预测基础上进行增量调整，以减少训练难度并提高预测精度。分类logit的二进制设计简化了多类别检测问题，将其转化为目标概念的存在性判断问题，这与传统的多类别分类不同，更适合开放词汇场景。框-区域-位置偏置机制通过引入空间先验信息来引导注意力机制，使模型能够更好地关注相关区域，但SAM 3选择保持原始注意力机制而非复杂的变体，这样做的好处是保持模型的简洁性和可解释性。双重监督策略结合了DAC-DETR的检测监督和Align损失的对齐监督，前者确保检测精度，后者确保多模态特征的对齐质量。掩码头的设计借鉴了MaskFormer的架构，能够生成高质量的实例掩码。语义分割头的引入提供了像素级的概念判断能力，这与实例分割形成互补，为模型提供了更丰富的监督信号。

Presence Token. It can be difficult for each of the proposal queries to both recognize (what) and localize (where) an object in the image/frame. For the recognition component, contextual cues from the entire image are important. However, forcing proposal queries to understand the global context can be counterproductive, as it conflicts with the inherently local nature of the localization objective. We decouple the recognition and localization steps by introducing a learned global presence token. This token is solely responsible for predicting whether the target concept in the form of a noun phrase (NP) is present in the image/frame, i.e. p ( N P p(\mathrm{NP} p(NP is present in input). Each proposal query q i q_{i} qi only needs to solve the localization problem p ( q i p(q_{i} p(qi is a match | NP is present in input). The final score for each proposal query is the product of its own score and the presence score.

【翻译】存在标记。对于每个提议查询来说，同时识别（什么）和定位（在哪里）图像/帧中的对象可能是困难的。对于识别组件，来自整个图像的上下文线索很重要。然而，强制提议查询理解全局上下文可能适得其反，因为这与定位目标固有的局部性质相冲突。我们通过引入一个学习的全局存在标记来解耦识别和定位步骤。该标记仅负责预测名词短语（NP）形式的目标概念是否存在于图像/帧中，即 p ( N P p(\mathrm{NP} p(NP is present in input)。每个提议查询 q i q_{i} qi只需要解决定位问题 p ( q i p(q_{i} p(qi is a match | NP is present in input)。每个提议查询的最终得分是其自身得分与存在得分的乘积。

【解析】存在标记机制用于解决目标检测中的一个矛盾：识别任务需要全局上下文信息来理解对象的语义特征，而定位任务需要局部精确性来确定对象的空间位置。传统方法让每个对象查询同时承担这两个任务，导致优化目标的冲突。SAM 3通过任务解耦将这两个问题分离：全局存在标记专门负责概念级别的存在性判断，它需要综合整个图像的信息来确定目标概念是否出现；而各个提议查询则专注于条件定位问题，即在已知概念存在的前提下确定具体的空间位置。这种设计其实是条件概率的思想： p ( N P p(\mathrm{NP} p(NP is present in input)表示概念存在的先验概率， p ( q i p(q_{i} p(qi is a match | NP is present in input)表示在概念存在条件下的定位概率。最终得分的乘积形式确保只有当概念确实存在且定位准确时，查询才会获得高分。解耦不仅提高了模型的性能，还增强了可解释性，因为可以分别分析识别和定位的准确性。

Image Exemplars and Interactivity. SAM 3 supports image exemplars, given as a pair---a bounding box and an associated binary label (positive or negative)---which can be used in isolation or to supplement the text prompt. The model then detects all the instances that match the prompt. For example, given a positive bounding box on a dog, the model will detect all dogs in the image. This is different from the PVS task in SAM 1 and 2, where a visual prompt yields only a single object instance. Each image exemplar is encoded separately by the exemplar encoder using an embedding for the position, an embedding for the label, and ROI-pooled visual features, then concatenated and processed by a small transformer. The resulting prompt is concatenated to the text prompt to comprise the prompt tokens. Image exemplars can be interactively provided based on errors in current detections to refine the output.

【翻译】图像样例和交互性。SAM 3支持图像样例，以一对形式给出------一个边界框和一个相关的二进制标签（正或负）------可以单独使用或补充文本提示。然后模型检测所有匹配提示的实例。例如，给定狗的正边界框，模型将检测图像中的所有狗。这与SAM 1和2中的PVS任务不同，在PVS任务中，视觉提示只产生单个对象实例。每个图像样例由样例编码器使用位置嵌入、标签嵌入和ROI池化视觉特征分别编码，然后连接并由小型transformer处理。生成的提示与文本提示连接以构成提示标记。图像样例可以基于当前检测中的错误交互式提供以细化输出。

【解析】图像样例机制是SAM 3相对于前代模型的重要改进，实现了从单实例分割到多实例检测的跨越。传统的视觉提示方法如SAM 1和SAM 2采用"一对一"的映射关系，即一个视觉提示对应一个分割结果，这种设计适合精确分割单个对象，但无法处理需要识别同类多个对象的场景。SAM 3的图像样例机制采用"一对多"的映射关系，通过提供一个正样例来定义目标概念的视觉特征，然后在整个图像中搜索所有符合该特征的对象实例。二进制标签系统的设计使得用户可以通过正负样例来精确控制检测行为：正样例告诉模型"寻找像这样的对象"，负样例告诉模型"排除像这样的对象"。样例编码器的多层次特征融合策略确保丰富的语义表示：位置嵌入提供空间上下文信息，标签嵌入区分正负样例的语义意图，ROI池化视觉特征提取样例区域的详细视觉特征。这三种特征的连接和transformer处理形成了统一的样例表示，与文本提示在同一特征空间中进行融合。交互式细化机制允许用户根据初始检测结果的质量动态调整样例提示，形成人机协作的迭代优化过程，适合处理复杂场景中的边界情况和模型失效情况。

Tracker and Video Architecture. Given a video and a prompt P P P , we use the detector and a tracker (see Fig. 4) to detect and track objects corresponding to the prompt throughout the video. On each frame, the detector finds new objects O t \mathcal{O}{t} Ot and the tracker propagates masklets M t − 1 \mathcal{M}{t-1} Mt−1 (spatial-temporal masks) from frames at the previous time t − 1 t-1 t−1 to their new locations M ^ t \hat{\mathcal{M}}{t} M^t on the current frame at time t t t . We use a matching function to associate propagated masklets M ^ t \hat{\mathcal{M}}{t} M^t with new object masks emerging in the current frame O t \mathcal{O}_{t} Ot ,

M ^ t = p r o p a g a t e ( M t − 1 ) , O t = detect ⁡ ( I t , P ) , M t = match ⁡ − a n d − u p d a t e ( M ^ t , O t ) . \hat{\boldsymbol{\mathcal{M}}}{t}=\mathrm{propagate}\left(\boldsymbol{\mathcal{M}}{t-1}\right),\quad\boldsymbol{\mathcal{O}}{t}=\operatorname*{detect}\left(\boldsymbol{I}{t},\boldsymbol{P}\right),\quad\boldsymbol{\mathcal{M}}{t}=\operatorname*{match}{-}\mathrm{and}{-}\mathrm{update}\left(\hat{\boldsymbol{\mathcal{M}}}{t},\boldsymbol{\mathcal{O}}_{t}\right). M^t=propagate(Mt−1),Ot=detect(It,P),Mt=−matchand−update(M^t,Ot).

【翻译】跟踪器和视频架构。给定视频和提示 P P P，我们使用检测器和跟踪器（见图4）来检测和跟踪整个视频中对应于提示的对象。在每一帧上，检测器找到新对象 O t \mathcal{O}{t} Ot，跟踪器将masklets M t − 1 \mathcal{M}{t-1} Mt−1（时空掩码）从前一时刻 t − 1 t-1 t−1的帧传播到当前时刻 t t t帧上的新位置 M ^ t \hat{\mathcal{M}}{t} M^t。我们使用匹配函数将传播的masklets M ^ t \hat{\mathcal{M}}{t} M^t与当前帧中出现的新对象掩码 O t \mathcal{O}_{t} Ot关联。

【解析】视频架构采用检测器-跟踪器双模块协同工作的设计，能有效处理视频序列中对象的时序连续性和空间变化性。检测器负责在每个独立帧上识别符合提示条件的新对象，而跟踪器则维护对象在时间维度上的一致性标识。Masklets概念是SAM 3在视频处理中的核心创新，它将传统的二维空间掩码扩展为三维时空掩码，能够同时编码对象的空间边界和时序演化信息。传播机制 M ^ t = p r o p a g a t e ( M t − 1 ) \hat{\boldsymbol{\mathcal{M}}}{t}=\mathrm{propagate}\left(\boldsymbol{\mathcal{M}}{t-1}\right) M^t=propagate(Mt−1)基于光流估计、特征匹配或运动预测等技术，将前一帧的对象掩码投影到当前帧的预期位置。检测过程 O t = detect ⁡ ( I t , P ) \boldsymbol{\mathcal{O}}{t}=\operatorname*{detect}\left(\boldsymbol{I}{t},\boldsymbol{P}\right) Ot=detect(It,P)在当前帧上独立执行概念检测，不依赖历史信息，这样设计的优势是能够发现新出现的对象实例，避免跟踪漂移导致的目标丢失。匹配和更新函数 M t = match ⁡ − a n d − u p d a t e ( M ^ t , O t ) \boldsymbol{\mathcal{M}}{t}=\operatorname*{match}{-}\mathrm{and}{-}\mathrm{update}\left(\hat{\boldsymbol{\mathcal{M}}}{t},\boldsymbol{\mathcal{O}}_{t}\right) Mt=match−and−update(M^t,Ot)是整个系统的关键组件，它需要解决数据关联问题：确定哪些传播的masklets与哪些新检测的对象对应，处理对象的出现、消失、遮挡和重新出现等复杂情况。这种双路径设计的优势在于结合了跟踪的时序连续性和检测的鲁棒性，既能保持对象身份的一致性，又能适应对象外观和场景的动态变化。

Tracking an Object with SAM 2 Style Propagation. A masklet is initialized for every object detected on the first frame. Then, on each subsequent frame, the tracker module predicts the new masklet locations M ^ t \hat{\mathcal{M}}{t} M^t of those already-tracked objects based on their previous locations M t − 1 \mathcal{M}{t-1} Mt−1 through a single-frame propagation step similar to the video object segmentation task in SAM 2. The tracker shares the same image/frame encoder (PE backbone) as the detector. After training the detector, we freeze PE and train the tracker as in SAM 2, including a prompt encoder, mask decoder, memory encoder, and a memory bank that encodes the object's appearance using features from the past frames and conditioning frames (frames where the object is first detected or user-prompted). The memory encoder is a transformer with self-attention across visual features on the current frame and cross-attention from the visual features to the spatial memory features in the memory bank. We describe details of our video approach in §C.3.

【翻译】使用SAM 2风格传播跟踪对象。为第一帧上检测到的每个对象初始化一个掩码片段。然后，在每个后续帧上，跟踪器模块基于那些已跟踪对象的先前位置 M t − 1 \mathcal{M}{t-1} Mt−1，通过类似于SAM 2中视频对象分割任务的单帧传播步骤来预测新的掩码片段位置 M ^ t \hat{\mathcal{M}}{t} M^t。跟踪器与检测器共享相同的图像/帧编码器（PE骨干网络）。在训练检测器之后，我们冻结PE并像SAM 2中一样训练跟踪器，包括提示编码器、掩码解码器、记忆编码器和记忆库，记忆库使用来自过去帧和条件帧（对象首次被检测到或用户提示的帧）的特征来编码对象的外观。记忆编码器是一个transformer，在当前帧的视觉特征上进行自注意力，并从视觉特征到记忆库中的空间记忆特征进行交叉注意力。我们在§C.3中描述视频方法的详细信息。

【解析】SAM 3的跟踪机制继承了SAM 2的核心设计思想，但在多对象跟踪场景下进行了扩展。掩码片段（masklet）的概念是时空分割的基本单元，它不仅包含空间位置信息，还携带时间连续性约束。初始化过程在第一帧建立所有目标对象的基准表示，这些表示将作为后续帧跟踪的参考模板。单帧传播机制的核心是预测函数 M ^ t = f ( M t − 1 ) \hat{\mathcal{M}}{t} = f(\mathcal{M}{t-1}) M^t=f(Mt−1)，它基于前一帧的掩码位置来估计当前帧的掩码位置。这种逐帧传播的方式能够处理对象的渐进式运动和形变，避免全局优化的计算复杂性。共享编码器架构的设计确保检测和跟踪模块之间的特征一致性，维持跟踪的稳定性。冻结感知编码器的训练策略采用了迁移学习的思想，先训练检测器获得良好的视觉表示能力，然后固定这些表示来训练跟踪器，这样可以避免跟踪训练过程中破坏已经学好的视觉特征。记忆库机制是跟踪系统的核心，它维护了每个对象在历史帧中的外观特征，包括过去帧的一般外观信息和条件帧的关键外观信息。条件帧通常是对象首次出现或用户明确标注的帧，这些帧提供了最可靠的对象表示。记忆编码器的双重注意力机制：自注意力机制处理当前帧内的特征关系，帮助理解对象在当前帧中的完整表示；交叉注意力机制则建立当前帧特征与历史记忆特征之间的对应关系，实现时间维度上的特征匹配和传播。

Figure 5 Overview of the final SAM 3 data engine. See § E . 1 \S\mathrm{E.1} §E.1 for details of collected data.

【翻译】图5 最终SAM 3数据引擎概述。收集数据的详细信息见 § E . 1 \S\mathrm{E.1} §E.1。

During inference, we only retain frames where the object is confidently present in the memory bank. The mask decoder is a two-way transformer between the encoder hidden states and the output tokens. To handle ambiguity, we predict three output masks for every tracked object on each frame along with their confidence, and select the most confident output as the predicted mask on the current frame.

【翻译】在推理过程中，我们只保留对象在记忆库中确信存在的帧。掩码解码器是编码器隐藏状态和输出标记之间的双向transformer。为了处理歧义性，我们为每个跟踪对象在每帧上预测三个输出掩码及其置信度，并选择最有信心的输出作为当前帧上的预测掩码。

【解析】推理阶段的记忆管理，只保留高置信度帧，避免错误信息在记忆库中的累积，因为低质量的记忆特征会污染后续的跟踪结果。这种选择性记忆机制类似于人类视觉系统中的注意力过滤，只关注和记住重要的视觉信息。双向transformer架构的掩码解码器实现编码器特征和输出表示之间的充分交互，编码器隐藏状态提供了丰富的视觉语义信息，而输出标记则代表了具体的掩码预测目标。双向交互确保了解码过程中信息的充分利用和精确传递。多假设预测机制是处理视觉歧义的有效策略，通过为每个对象生成三个候选掩码，系统能够覆盖不同的可能解释，适合处理遮挡、形变、或边界模糊等复杂情况，其中单一预测可能无法准确捕捉对象的真实状态。置信度评估机制为多假设选择提供量化标准，最高置信度的掩码通常对应最可能的正确解释。

Matching and Updating Based on Detections. After obtaining the tracked masks M ^ t \hat{\mathcal{M}}{t} M^t , we match them with the current frame detections O t \mathcal{O}{t} Ot through a simple IoU based matching function (§C.3) and add them to M t \mathcal{M}_{t} Mt on the current frame. We further spawn new masklets for all newly detected objects that are not matched. The merging might suffer from ambiguities, especially in crowded scenes. We address this with two temporal disambiguation strategies outlined next.

【翻译】基于检测的匹配和更新。在获得跟踪掩码 M ^ t \hat{\mathcal{M}}{t} M^t后，我们通过简单的基于IoU的匹配函数（§C.3）将它们与当前帧检测 O t \mathcal{O}{t} Ot进行匹配，并将它们添加到当前帧的 M t \mathcal{M}_{t} Mt中。我们进一步为所有未匹配的新检测对象生成新的掩码片段。合并可能会遇到歧义性问题，特别是在拥挤场景中。我们通过接下来概述的两种时序消歧策略来解决这个问题。

【解析】匹配和更新机制是视频对象跟踪系统的数据关联模块，负责建立时间维度上的对象一致性映射。通过计算两个区域的交集与并集的比值来评估它们的相似程度。在这个上下文中， M ^ t \hat{\mathcal{M}}{t} M^t代表从前一帧传播而来的预测掩码位置，而 O t \mathcal{O}{t} Ot代表当前帧上独立检测到的新对象。匹配过程实质上是解决一个二分图匹配问题：每个传播掩码需要找到最佳的检测对象进行关联，或者确定没有合适的匹配对象。成功匹配的掩码对会被合并到最终的 M t \mathcal{M}_{t} Mt中，这个过程不仅更新了对象的空间位置，还维持了其时序身份标识。新掩码片段的生成机制确保系统能够处理动态场景中新出现的对象，避免因为只依赖传播而错过新目标。拥挤场景中的歧义性问题主要源于多个对象在空间上的重叠和遮挡，这会导致IoU匹配出现多对一或一对多的冲突情况。时序消歧策略的引入是为了利用时间维度的连续性约束来解决这些空间匹配的不确定性，通过分析对象在多个时间步长上的行为模式来做出更可靠的关联决策。

First, we use temporal information in the form of a masklet detection score (§C.3) to measure how consistently a masklet is matched to a detection within a temporal window (based on the number of past frames where it was matched to a detection). If a masklet's detection score falls below a threshold, we suppress it. Second, we use the detector outputs to resolve specific failure modes of the tracker due to occlusions or distractors. We periodically re-prompt the tracker with high-confidence detection masks O t \mathcal{O}{t} Ot , replacing the tracker's own predictions M ^ t \hat{\mathcal{M}}{t} M^t . This ensures that the memory bank has recent and reliable references (other than the tracker's own predictions).

【翻译】首先，我们使用掩码片段检测分数（§C.3）形式的时序信息来测量掩码片段在时间窗口内与检测匹配的一致性（基于它与检测匹配的过去帧数）。如果掩码片段的检测分数低于阈值，我们就抑制它。其次，我们使用检测器输出来解决由于遮挡或干扰物导致的跟踪器特定失效模式。我们定期用高置信度检测掩码 O t \mathcal{O}{t} Ot重新提示跟踪器，替换跟踪器自己的预测 M ^ t \hat{\mathcal{M}}{t} M^t。这确保记忆库具有最新和可靠的参考（而不是跟踪器自己的预测）。

【解析】第一种时序消歧策略基于统计一致性原理，通过量化掩码片段在历史时间窗口内的匹配稳定性来评估其可靠性。掩码片段检测分数实际上是一个时序置信度度量，它统计了该掩码片段在过去若干帧中成功与检测结果匹配的频率。这种设计基于假设：真实的对象轨迹应该在时间上保持相对稳定的检测响应，而由噪声、误检或跟踪漂移产生的虚假轨迹往往表现出不一致的匹配模式。阈值抑制机制起到了轨迹质量控制的作用，自动清理那些可能由于跟踪错误而产生的低质量轨迹。第二种策略针对跟踪系统的固有局限性：纯粹基于传播的跟踪方法容易受到遮挡、快速运动、外观变化等因素的影响而产生累积误差。定期重新提示机制通过引入独立的检测信息来打破这种误差累积的循环，高置信度检测掩码 O t \mathcal{O}{t} Ot提供了当前帧上最可靠的对象位置估计。替换操作 M ^ t ← O t \hat{\mathcal{M}}{t} \leftarrow \mathcal{O}_{t} M^t←Ot实质上是一种自适应校正机制，当检测器的置信度足够高时，系统会选择相信检测结果而非跟踪预测。记忆库的更新策略确保长期跟踪的稳定性：通过定期注入高质量的检测结果，记忆库能够维持对象外观模型的准确性，避免因为持续依赖可能有偏差的跟踪预测而导致的特征漂移问题。

Instance Refinement with Visual Prompts. After obtaining the initial set of masks (or masklets), SAM 3 allows refining individual masks(lets) using positive and negative clicks. Specifically, given the user clicks, we apply the prompt encoder to encode them, and feed the encoded prompt into the mask decoder to predict an adjusted mask. In videos the mask is then propagated across the entire video to obtain a refined masklet.

【翻译】使用视觉提示进行实例细化。在获得初始掩码集合（或掩码片段）后，SAM 3允许使用正负点击来细化单个掩码（片段）。具体来说，给定用户点击，我们应用提示编码器对其进行编码，并将编码的提示输入掩码解码器以预测调整后的掩码。在视频中，掩码然后在整个视频中传播以获得细化的掩码片段。

【解析】实例细化机制（SAM 3的交互式设计），允许用户通过简单的点击操作来纠正和优化自动生成的分割结果。正负点击的概念来源于传统的交互式分割方法，正点击告诉系统"这个位置应该被包含在掩码内"，负点击告诉系统"这个位置应该被排除在掩码外"。提示编码器的作用是将这些离散的空间点击转换为连续的特征表示，这些表示能够被神经网络有效处理。编码过程通常包括位置嵌入和点击类型嵌入的组合，前者编码点击的空间位置信息，后者区分正负点击的语义意图。掩码解码器接收编码后的点击提示以及原始的视觉特征，通过注意力机制或其他融合策略来生成调整后的掩码预测。这个过程实质上是一个条件生成问题：在给定视觉内容和用户约束的条件下，生成最符合用户意图的分割掩码。视频场景下的传播机制将单帧的细化结果扩展到整个时序，这就不仅要处理空间上的掩码调整，还要保持时间维度上的一致性。细化后的掩码片段需要在保持用户指定约束的同时，适应视频中对象的运动和形变。

Training Stages. We train SAM 3 in four stages that progressively add data and capabilities: 1) Perception Encoder (PE) pre-training, 2) detector pre-training, 3) detector fine-tuning, and 4) tracker training with a frozen backbone. See §C.4.1 for details.

【翻译】训练阶段。我们分四个阶段训练SAM 3，逐步增加数据和能力：1）感知编码器（PE）预训练，2）检测器预训练，3）检测器微调，4）使用冻结骨干网络的跟踪器训练。详细信息见§C.4.1。

【解析】SAM 3采用分阶段渐进式训练策略，这种方法在大规模多任务模型训练中被广泛采用，能够有效管理训练复杂性并优化最终性能。第一阶段的感知编码器预训练建立了整个系统的视觉表示基础，通常使用大规模无标注或弱标注数据进行自监督学习，目标是学习通用的视觉特征表示能力。这个阶段的训练不涉及具体的下游任务，而是专注于构建强大的视觉理解能力。第二阶段的检测器预训练在已有视觉表示的基础上，开始学习概念检测的核心能力，包括对象定位、分类和掩码生成等功能。这个阶段使用相对简单或标准的检测数据集，确保检测器能够掌握基本的检测技能。第三阶段的检测器微调是关键的能力提升阶段，使用更复杂、更具挑战性的数据来增强检测器的泛化能力和鲁棒性，特别是针对开放词汇场景的适应性。第四阶段的跟踪器训练采用了冻结骨干网络的策略，这种设计有两个重要优势：首先，它保护了前面阶段学到的视觉表示不被跟踪任务的特定需求所破坏；其次，它大大减少了训练参数的数量，提高了训练效率。冻结策略确保检测和跟踪模块之间的特征一致性，这对于视频场景下的稳定性至关重要。渐进式训练的核心思想是任务分解和能力积累：每个阶段专注于特定的能力建设，后续阶段在前面的基础上进行扩展，避免了同时优化多个复杂目标可能导致的训练不稳定问题。

4 Data Engine

Achieving a step change in PCS with SAM 3 requires training on a large, diverse set of concepts and visual domains, beyond existing datasets (see Fig. 12). We build an efficient data engine that iteratively generates annotated data via a feedback loop with SAM 3, human annotators, and AI annotators, actively mining media-phrase pairs on which the current version of SAM 3 fails to produce high-quality training data to further improve the model. By delegating certain tasks to AI annotators---models that match or surpass human accuracy---we more than double the throughput compared to a human-only annotation pipeline. We develop the data engine in four phases, with each phase increasing the use of AI models to steer human effort to the most challenging failure cases, alongside expanding visual domain coverage. Phases 1-3 focus only on images, with Phase 4 expanding to videos. We describe the key steps here; details and metrics are in §D.

【翻译】使用SAM 3实现PCS的重大突破需要在大量多样化的概念和视觉领域上进行训练，超越现有数据集（见图12）。我们构建了一个高效的数据引擎，通过SAM 3、人工标注员和AI标注员之间的反馈循环迭代生成标注数据，主动挖掘当前版本SAM 3无法产生高质量训练数据的媒体-短语对，以进一步改进模型。通过将某些任务委托给AI标注员------那些匹配或超越人类准确性的模型------我们将吞吐量比纯人工标注流水线提高了一倍以上。我们分四个阶段开发数据引擎，每个阶段都增加AI模型的使用，将人工努力引导到最具挑战性的失败案例上，同时扩展视觉领域覆盖范围。阶段1-3仅专注于图像，阶段4扩展到视频。我们在这里描述关键步骤；详细信息和指标在§D中。

Figure 6 Example video (top) and images (bottom) from SA-Co with annotated phrases and instance masks/IDs.

【翻译】图6 来自SA-Co的示例视频（上）和图像（下），带有标注的短语和实例掩码/ID。

Data Engine Components (Fig. 5). Media inputs (image or video) are mined from a large pool with the help of a curated ontology. An AI model proposes noun phrases (NPs) describing visual concepts, followed by another model (e.g., SAM 3) that generates candidate instance masks for each proposed NP. The proposed masks are verified by a two-step process: first, in Mask Verification (MV) annotators accept or reject masks based on their quality and relevance to the NP. Second, in Exhaustivity Verification ( E V ) (E V) (EV) annotators check if all instances of the NP have been masked in the input. Any media-NP pairs that did not pass the exhaustivity check are sent to a manual correction stage, where humans add, remove or edit masks (using SAM 1 in a browser based tool), or use "group" masks for small, hard to separate objects. Annotators may reject ungroundable or ambiguous phrases.

【翻译】数据引擎组件（图5）。媒体输入（图像或视频）在精心策划的本体论帮助下从大型池中挖掘。AI模型提出描述视觉概念的名词短语（NPs），然后另一个模型（例如SAM 3）为每个提出的NP生成候选实例掩码。提出的掩码通过两步过程进行验证：首先，在掩码验证（MV）中，标注员根据掩码的质量和与NP的相关性接受或拒绝掩码。其次，在完整性验证（EV）中，标注员检查输入中NP的所有实例是否都已被掩码化。任何未通过完整性检查的媒体-NP对都会被发送到手动校正阶段，人类在此阶段添加、删除或编辑掩码（使用基于浏览器工具中的SAM 1），或对小的、难以分离的对象使用"组"掩码。标注员可能会拒绝无法落地或模糊的短语。

Phase 1: Human Verification. We first randomly sample images and NP proposal with a simple captioner and parser. The initial mask proposal model is SAM 2 prompted with the output of an off-the-shelf open-vocabulary detector, and initial verifiers are human. In this phase, we collected 4.3M image-NP pairs as the initial SA-Co/HQ dataset. We train SAM 3 on this data and use it as the mask proposal model for the next phase.

【翻译】阶段1：人工验证。我们首先使用简单的字幕生成器和解析器随机采样图像和NP提议。初始掩码提议模型是SAM 2，由现成的开放词汇检测器的输出进行提示，初始验证器是人工的。在这个阶段，我们收集了430万个图像-NP对作为初始的SA-Co/HQ数据集。我们在这些数据上训练SAM 3，并将其用作下一阶段的掩码提议模型。

Phase 2: Human + \pmb{+} + AI Verification. In this next phase, we use human accept/reject labels from the MV and EV tasks collected in Phase 1 to fine-tune Llama 3.2 (Dubey et al., 2024) to create AI verifiers that automatically perform the MV and EV tasks. These models receive image-phrase-mask triplets and output multiple-choice ratings of mask quality or exhaustivity. This new auto-verification process allows our human effort to be focused on the most challenging cases. We continue to re-train SAM 3 on newly collected data and update it 6 times. As SAM 3 and AI verifiers improve, a higher proportion of labels are auto-generated, further accelerating data collection. The introduction of AI verifiers for MV and EV roughly doubles the data engine's throughput vs. human annotators. We refer to §A.4 for detailed analysis of how AI verifiers improve the data engine's throughput. We further upgrade the NP proposal step to a Llama-based pipeline that also proposes hard negative NPs adversarial to SAM 3. Phase 2 adds 122M image-NP pairs to SA-Co/HQ.

【翻译】阶段2：人工+AI验证。在下一个阶段，我们使用阶段1中收集的MV和EV任务的人工接受/拒绝标签来微调Llama 3.2（Dubey等，2024），创建自动执行MV和EV任务的AI验证器。这些模型接收图像-短语-掩码三元组，并输出掩码质量或完整性的多选评级。这种新的自动验证过程使我们的人工努力能够专注于最具挑战性的案例。我们继续在新收集的数据上重新训练SAM 3，并更新了6次。随着SAM 3和AI验证器的改进，更高比例的标签是自动生成的，进一步加速了数据收集。MV和EV的AI验证器的引入使数据引擎的吞吐量比人工标注员大约翻了一倍。我们参考§A.4详细分析AI验证器如何提高数据引擎的吞吐量。我们进一步将NP提议步骤升级为基于Llama的流水线，该流水线还提议对SAM 3具有对抗性的困难负样本NP。阶段2向SA-Co/HQ添加了1.22亿个图像-NP对。

Phase 3: Scaling and Domain Expansion. In the third phase, we use AI models to mine increasingly challenging cases and broaden domain coverage in SA-Co/HQ to 15 datasets (Fig. 15). A domain is a unique distribution of text and visual data. In new domains, the MV AI verifier performs well zero-shot, but the EV AI verifier needs to be improved with modest domain-specific human supervision. We also expand concept coverage to long-tail, fine-grained concepts by extracting NPs from the image alt-text where available and by mining concepts from a 22.4M node SA-Co ontology (§D.2) based on Wikidata (17 top-level categories, 72 sub-categories). We iterate SAM 3 training 7 times and AI verifiers 3 times, and add 19.5M image-NP pairs to SA-Co/HQ.

【翻译】阶段3：扩展和领域扩展。在第三阶段，我们使用AI模型挖掘越来越具有挑战性的案例，并将SA-Co/HQ中的领域覆盖范围扩大到15个数据集（图15）。领域是文本和视觉数据的独特分布。在新领域中，MV AI验证器在零样本情况下表现良好，但EV AI验证器需要通过适度的领域特定人工监督来改进。我们还通过从可用的图像替代文本中提取NP，以及从基于Wikidata的2240万节点SA-Co本体（§D.2）（17个顶级类别，72个子类别）中挖掘概念，将概念覆盖范围扩展到长尾、细粒度概念。我们迭代SAM 3训练7次，AI验证器3次，并向SA-Co/HQ添加了1950万个图像-NP对。

Phase 4: Video Annotation. This phase extends the data engine to video. We use a mature image SAM 3 to collect targeted quality annotations that capture video-specific challenges. The data mining pipeline applies scene/motion filters, content balancing, ranking, and targeted searches. Video frames are sampled (randomly or by object density) and sent to the image annotation flow (from phase 3). Masklets (spatio-temporal masks) are produced with SAM 3 (now extended to video) and post-processed via deduplication and removal of trivial masks. Because video annotation is more difficult, we concentrate humans on likely failures by favoring clips with many crowded objects and tracking failures. The collected video data SA-Co/VIDEO consists of 52.5K videos and 467K masklets. See §D.6 for details.

【翻译】阶段4：视频标注。这个阶段将数据引擎扩展到视频。我们使用成熟的图像SAM 3来收集捕获视频特定挑战的目标质量标注。数据挖掘流水线应用场景/运动过滤器、内容平衡、排序和目标搜索。视频帧被采样（随机或按对象密度）并发送到图像标注流程（来自阶段3）。掩码片段（时空掩码）由SAM 3（现在扩展到视频）生成，并通过去重和移除琐碎掩码进行后处理。由于视频标注更加困难，我们通过偏向于具有许多拥挤对象和跟踪失败的片段，将人工集中在可能的失败上。收集的视频数据SA-Co/VIDEO包含5.25万个视频和46.7万个掩码片段。详细信息见§D.6。

5 Segment Anything with Concepts (SA-Co) Dataset

Training Data. We collect three image datasets for the PCS task: (i) SA-Co/HQ, the high-quality image data collected from the data engine in phases 1-4, (ii) SA-Co/SYN, a synthetic dataset of images labeled by a mature data engine (phase 3) without human involvement, and (iii) SA-Co/EXT, 15 external datasets that have instance mask annotations, enriched with hard negatives using our ontology pipeline. Notably in the SA-Co/HQ dataset we annotate 5.2M images and 4M unique NPs, making it the largest high-quality open-vocab segmentation dataset. We also annotate a video dataset, SA-Co/VIDEO, containing 52.5K videos and 24.8K unique NPs, forming 134K video-NP pairs. The videos on average have 84.1 frames at 6 fps. See § E . 1 \S\mathrm{E.1} §E.1 for details including full statistics, comparison with existing datasets and the distribution of concepts.

【翻译】训练数据。我们为PCS任务收集了三个图像数据集：(i) SA-Co/HQ，在阶段1-4中从数据引擎收集的高质量图像数据，(ii) SA-Co/SYN，由成熟数据引擎（阶段3）在无人工参与下标注的合成图像数据集，(iii) SA-Co/EXT，15个具有实例掩码标注的外部数据集，使用我们的本体流水线丰富了困难负样本。值得注意的是，在SA-Co/HQ数据集中，我们标注了520万张图像和400万个独特的NP，使其成为最大的高质量开放词汇分割数据集。我们还标注了一个视频数据集SA-Co/VIDEO，包含5.25万个视频和2.48万个独特的NP，形成13.4万个视频-NP对。视频平均有84.1帧，帧率为6 fps。详细信息包括完整统计、与现有数据集的比较和概念分布见§E.1。

SA-Co Benchmark. The SA-Co evaluation benchmark has 207K unique phrases, 121K images and videos, and over 3M media-phrase pairs with hard negative labels to test open-vocabulary recognition. It has 4 splits: SA-Co/Gold has seven domains and each image-NP pair is annotated by three different annotators (used to measure human performance); SA-Co/Silver has ten domains and only one human annotation per image-NP pair; SA-Co/Bronze and SA-Co/Bio are nine existing datasets either with existing mask annotations or masks generated by using boxes as prompts to SAM 2. The SA-Co/VEval benchmark has three domains and one annotator per video-NP pair. See Tab. 28 for dataset statistics and Fig. 6 for example annotations.

【翻译】SA-Co基准。SA-Co评估基准包含20.7万个独特短语、12.1万张图像和视频，以及超过300万个带有困难负样本标签的媒体-短语对，用于测试开放词汇识别。它有4个分割：SA-Co/Gold有七个领域，每个图像-NP对由三个不同的标注员标注（用于测量人类性能）；SA-Co/Silver有十个领域，每个图像-NP对只有一个人工标注；SA-Co/Bronze和SA-Co/Bio是九个现有数据集，要么有现有的掩码标注，要么使用框作为提示给SAM 2生成掩码。SA-Co/VEval基准有三个领域，每个视频-NP对有一个标注员。数据集统计见表28，示例标注见图6。

Metrics. We aim to measure the usefulness of the model in downstream applications. Detection metrics such as average precision (AP) do not account for calibration, which means that models can be difficult to use in practice. To remedy this, we only evaluate predictions with confidence above 0.5, effectively introducing a threshold that mimics downstream usages and enforces good calibration. The PCS task can be naturally split into two sub-tasks, localization and classification. We evaluate localization using positive micro F1 ( p m F 1 \mathrm{pmF_{1}} pmF1 ) on positive media-phrase pairs with at least one ground-truth mask. Classification is measured with image-level Matthews Correlation Coefficient (IL_MCC) which ranges in [ − 1 , 1 ] [-1,1] [−1,1] and evaluates binary prediction at the image level ("is the object present?") without regard for mask quality. Our main metric, classification-gated F1 ( c g F 1 \mathrm{(cgF_{1}} (cgF1 ), combines these as follows: c g F 1 = 100 ∗ p m F 1 ∗ I L _ M C C \mathrm{cgF_{1}=100*p m F_{1}*I L\_M C C} cgF1=100∗pmF1∗IL_MCC . Full definitions are in §E.3.

【翻译】指标。我们旨在测量模型在下游应用中的有用性。检测指标如平均精度（AP）不考虑校准，这意味着模型在实践中可能难以使用。为了解决这个问题，我们只评估置信度高于0.5的预测，有效地引入了一个阈值，模拟下游使用并强制良好的校准。PCS任务可以自然地分为两个子任务：定位和分类。我们使用正样本微观F1（ p m F 1 \mathrm{pmF_{1}} pmF1）在至少有一个真实掩码的正样本媒体-短语对上评估定位。分类使用图像级马修斯相关系数（IL_MCC）测量，范围在 [ − 1 , 1 ] [-1,1] [−1,1]，在图像级别评估二元预测（"对象是否存在？"），不考虑掩码质量。我们的主要指标，分类门控F1（ c g F 1 \mathrm{cgF_{1}} cgF1），将这些结合如下： c g F 1 = 100 ∗ p m F 1 ∗ I L _ M C C \mathrm{cgF_{1}=100*p m F_{1}*I L\_M C C} cgF1=100∗pmF1∗IL_MCC。完整定义见§E.3。

Handling Ambiguity. We collect 3 annotations per NP on SA-Co/Gold. We measure oracle accuracy comparing each prediction to all ground truths and selecting the best score. See §E.3.

【翻译】处理歧义性。我们在SA-Co/Gold上为每个NP收集3个标注。我们通过将每个预测与所有真实标注比较并选择最佳分数来测量预言准确性。见§E.3。

6 Experiments

We evaluate SAM 3 across image and video segmentation, few-shot adaptation to detection and counting benchmarks, and segmentation with complex language queries with SAM3+ MLLM. We also show a subset of ablations, with more in §A. References, more results and details are in §F.

【翻译】我们在图像和视频分割、对检测和计数基准的少样本适应，以及使用SAM3+ MLLM进行复杂语言查询分割等方面评估SAM 3。我们还展示了一部分消融实验，更多内容在§A中。参考文献、更多结果和详细信息在§F中。

Image PCS with Text. We evaluate instance segmentation, box detection, and semantic segmentation on external and our benchmarks. SAM 3 is prompted with a single NP at a time, and predicts instance masks, bounding boxes, or semantic masks. As baselines, we evaluate OWLv2, GroundingDino (gDino), and LLMDet on box detection, and prompt SAM 1 with their boxes to evaluate segmentation. We also compare to APE, DINO-X, and Gemini 2.5 Flash, a generalist LLM. Tab. 1 shows that zero-shot, SAM 3 sets a new state-of-the-art on closed-vocabulary COCO, COCO-O and on LVIS boxes, and is significantly better on LVIS masks. On open-vocabulary SA-Co/Gold SAM 3 achieves more than double the c g F 1 \mathrm{cgF_{1}} cgF1 score of the strongest baseline OWLv2*, and 74 % 74\% 74% of the estimated human performance. The improvements are even higher on the other SA-Co splits. Open vocabulary semantic segmentation results on ADE-847, PascalConcept-59, and Cityscapes show that SAM 3 outperforms APE, a strong specialist baseline. See §F.1 for details.

【翻译】带文本的图像PCS。我们在外部基准和我们的基准上评估实例分割、框检测和语义分割。SAM 3每次使用单个NP进行提示，并预测实例掩码、边界框或语义掩码。作为基线，我们在框检测上评估OWLv2、GroundingDino（gDino）和LLMDet，并使用它们的框提示SAM 1来评估分割。我们还与APE、DINO-X和通用LLM Gemini 2.5 Flash进行比较。表1显示，在零样本情况下，SAM 3在封闭词汇COCO、COCO-O和LVIS框上创造了新的最先进水平，在LVIS掩码上显著更好。在开放词汇SA-Co/Gold上，SAM 3的 c g F 1 \mathrm{cgF_{1}} cgF1分数是最强基线OWLv2*的两倍多，达到估计人类性能的74%。在其他SA-Co分割上的改进甚至更高。在ADE-847、PascalConcept-59和Cityscapes上的开放词汇语义分割结果显示，SAM 3优于强大的专业基线APE。详细信息见§F.1。

Table 1 Evaluation on image concept segmentation with text. APo corresponds to COCO-O accuracy, ⋆ \star ⋆ : partially trained on LVIS, † \dagger † : from original papers, δ: from DINO-X API. Gray numbers indicate usage of respective closed set training data (LVIS/COCO). See §F.1 for more baselines and results and § E . 4 \S\mathrm{E.4} §E.4 for details of human performance.

【翻译】表1 带文本的图像概念分割评估。APo对应COCO-O准确率， ⋆ \star ⋆：在LVIS上部分训练， † \dagger †：来自原始论文，δ：来自DINO-X API。灰色数字表示使用相应的封闭集训练数据（LVIS/COCO）。更多基线和结果见§F.1，人类性能详细信息见§E.4。

Table2 Zero-shot and 10-shot transfer on in-the-wild datasets.

【翻译】表2 在野外数据集上的零样本和10样本迁移。

Table 3 Prompting with 1 exemplar on COCO, LVIS and ODinW13. Evaluation per prompt type: T (text-only), I (image-only), and T+I (combined text and image). A P + {\mathrm{AP^{+}}} AP+ is evaluated only on positives examples.

【翻译】表3 在COCO、LVIS和ODinW13上使用1个示例进行提示。按提示类型评估：T（仅文本）、I（仅图像）和T+I（文本和图像组合）。 A P + {\mathrm{AP^{+}}} AP+仅在正样本上评估。

Few-Shot Adaptation. We evaluate zero- and few-shot transfer of SAM 3 on ODinW13 and RF100-VL, with their original labels as prompts. We do not perform any prompt tuning. We fine-tune SAM 3 without mask loss, and report average bbox mAP in Tab. 2. SAM 3 achieves state-of-the-art 10-shot performance, surpassing in-context prompting in Gemini and object detection experts (gDino); more details in §F.3. RF-100VL contains domains with specialized prompts that are out of SAM 3's current scope, but SAM 3 adapts through fine-tuning more efficiently than baselines.

【翻译】少样本适应。我们在ODinW13和RF100-VL上评估SAM 3的零样本和少样本迁移，使用它们的原始标签作为提示。我们不执行任何提示调优。我们在没有掩码损失的情况下微调SAM 3，并在表2中报告平均边界框mAP。SAM 3实现了最先进的10样本性能，超越了Gemini中的上下文提示和目标检测专家（gDino）；更多详细信息见§F.3。RF-100VL包含具有专门提示的领域，这些提示超出了SAM 3当前的范围，但SAM 3通过微调比基线更有效地适应。

PCS with 1 Exemplar. We first evaluate image exemplars using a single input box sampled at random from the ground truth. This can be done only on "positive" data, where each prompted object appears in the image. We report the corresponding A P + {\mathrm{A}}\mathrm{P^{+}} AP+ in Tab. 3 across three settings: text prompt (T), exemplar image (I), and both text and image (T+I); SAM 3 outperforms prior state-of-the-art T-Rex2 by a healthy margin on COCO ( + 18.3 ) (+18.3) (+18.3) , LVIS ( + 10.3 ) (+10.3) (+10.3) , and ODinW (+20.5). See § F . 2 \S\mathrm{F}.2 §F.2 for more details and results on SA-Co/Gold.

【翻译】使用1个示例的PCS。我们首先使用从真实标注中随机采样的单个输入框来评估图像示例。这只能在"正样本"数据上进行，其中每个提示的对象都出现在图像中。我们在表3中报告了三种设置下对应的 A P + {\mathrm{A}}\mathrm{P^{+}} AP+：文本提示（T）、示例图像（I）以及文本和图像（T+I）；SAM 3在COCO（+18.3）、LVIS（+10.3）和ODinW（+20.5）上以显著优势超越了先前最先进的T-Rex2。更多详细信息和SA-Co/Gold上的结果见§F.2。

PCS with K Exemplars. Next, we evaluate SAM 3 in an interactive setting, simulating collaboration with a human annotator. Starting with a text prompt, we iteratively add one exemplar prompt at a time: missed ground truths are candidate positive prompts, false positive detections are candidate negative prompts. Results (Fig. 7) are compared to a perfect PVS baseline, where we simulate the user manually fixing errors using ideal box-to-mask corrections. SAM 3's PCS improves c g F 1 \mathrm{cgF_{1}} cgF1 more quickly, as it generalizes from exemplars (e.g., detecting or suppressing similar objects), while PVS only corrects individual instances. After 3 clicks, interactive PCS outperforms text-only by + 21.6 c g F 1 +21.6~\mathrm{cgF_{1}} +21.6 cgF1 points and PVS refinement by +2.0. Performance plateaus after 4 clicks, as exemplars cannot fix poor-quality masks. Simulating a hybrid switch to PVS at this point yields gains, showing complementary.

【翻译】使用K个示例的PCS。接下来，我们在交互式设置中评估SAM 3，模拟与人工标注员的协作。从文本提示开始，我们每次迭代添加一个示例提示：遗漏的真实标注是候选正样本提示，误检是候选负样本提示。结果（图7）与完美的PVS基线进行比较，我们模拟用户使用理想的框到掩码校正手动修复错误。SAM 3的PCS改进 c g F 1 \mathrm{cgF_{1}} cgF1更快，因为它从示例中泛化（例如，检测或抑制相似对象），而PVS只校正单个实例。3次点击后，交互式PCS比仅文本高出 + 21.6 c g F 1 +21.6~\mathrm{cgF_{1}} +21.6 cgF1点，比PVS细化高出+2.0。性能在4次点击后趋于平稳，因为示例无法修复质量差的掩码。此时模拟混合切换到PVS会产生收益，显示了互补性。

Figure 7 c g F 1 \mathrm{cgF_{1}} cgF1 vs. # of interactive box prompts for SAM 3 compared to the ideal PVS baseline, averaged over SA-Co/Gold phrases.

【翻译】图7 SAM 3与理想PVS基线相比的 c g F 1 \mathrm{cgF_{1}} cgF1与交互式框提示数量的关系，在SA-Co/Gold短语上平均。

Object Counting. We evaluate on object counting benchmarks CountBench and PixMo-Count to compare with several MLLMs using Accuracy ( % \% % ) and Mean Absolute Error (MAE) from previous technical reports and our own evaluations. See Tab. 4 for results and §F.4 for more evaluation details. Compared to MLLMs, SAM 3 not only achieves good object counting accuracy, but also provides object segmentation that most MLLMs cannot provide.

【翻译】目标计数。我们在目标计数基准CountBench和PixMo-Count上进行评估，使用准确率（ % \% %）和平均绝对误差（MAE）与几个MLLM进行比较，数据来自先前的技术报告和我们自己的评估。结果见表4，更多评估详细信息见§F.4。与MLLM相比，SAM 3不仅实现了良好的目标计数准确率，还提供了大多数MLLM无法提供的目标分割。

Video PCS with Text. We evaluate video segmentation with text prompts on both our SA-Co/VEval benchmark and existing public benchmarks. For SA-Co/VEval, we report c g F 1 \mathrm{cgF_{1}} cgF1 and pHOTA metrics (defined in §F.5) across its subsets (SA-V, YT-Temporal-1B, SmartGlasses). For public benchmarks, we use their official metrics. Baselines include GLEE, an open-vocabulary image and video segmentation model, "LLMDet + SAM 3 Tracker" (replacing our detector with LLMDet), and "SAM 3 Detector + T-by-D" (replacing our tracker with an association module based on the tracking-by-detection paradigm). In Tab. 5, SAM 3 largely outperforms these baselines, especially on benchmarks with a very large number of noun phrases. On SA-Co/VEval it reaches over 80 % 80\% 80% of human pHOTA. See §F.5 for more details.

【翻译】带文本的视频PCS。我们在我们的SA-Co/VEval基准和现有公共基准上评估带文本提示的视频分割。对于SA-Co/VEval，我们在其子集（SA-V、YT-Temporal-1B、SmartGlasses）上报告 c g F 1 \mathrm{cgF_{1}} cgF1和pHOTA指标（在§F.5中定义）。对于公共基准，我们使用它们的官方指标。基线包括GLEE（一个开放词汇图像和视频分割模型）、"LLMDet + SAM 3 Tracker"（用LLMDet替换我们的检测器）和"SAM 3 Detector + T-by-D"（用基于检测跟踪范式的关联模块替换我们的跟踪器）。在表5中，SAM 3在很大程度上优于这些基线，特别是在具有大量名词短语的基准上。在SA-Co/VEval上，它达到了人类pHOTA的80%以上。更多详细信息见§F.5。

Table4 Accuracy on counting benchmarks. Gray indicates usage of training sets.

【翻译】表4 计数基准上的准确率。灰色表示使用训练集。

Table 5 Video PCS from a text prompt (open-vocabulary video instance segmentation) on SA-Co/VEval and public benchmarks (see Tab. 39 for more results and analyses). SAM 3 shows strong performance, especially on benchmarks with a large number of NPs. †: GLEE and LLMDet do not perform well zero-shot on SA-Co/VEval.

【翻译】表5 在SA-Co/VEval和公共基准上从文本提示进行视频PCS（开放词汇视频实例分割）（更多结果和分析见表39）。SAM 3显示出强劲性能，特别是在具有大量NP的基准上。†：GLEE和LLMDet在SA-Co/VEval上零样本性能不佳。

PVS. We evaluate SAM 3 on a range of visual prompting tasks, including Video Object Segmentation (VOS) and interactive image segmentation. Tab. 6 compares SAM 3 to recent state-of-the-art methods on the VOS task. SAM 3 achieves significant improvements over SAM 2 on most benchmarks, particularly on the challenging MOSEv2 dataset, where SAM 3 outperforms prior work by 6.5 points. For the interactive image segmentation task, we evaluate SAM 3 on the 37 datasets benchmark introduced in Ravi et al. (2024). As shown in Tab. 7, SAM 3 outperforms SAM 2 on average mIoU. See also § F . 6 \S\mathrm{F.6} §F.6 and Fig. 21 for interactive video segmentation.

【翻译】PVS。我们在一系列视觉提示任务上评估SAM 3，包括视频目标分割（VOS）和交互式图像分割。表6将SAM 3与VOS任务上最新的最先进方法进行比较。SAM 3在大多数基准上相比SAM 2实现了显著改进，特别是在具有挑战性的MOSEv2数据集上，SAM 3比先前工作高出6.5个点。对于交互式图像分割任务，我们在Ravi等人（2024）引入的37个数据集基准上评估SAM 3。如表7所示，SAM 3在平均mIoU上优于SAM 2。交互式视频分割还可参见§F.6和图21。

Table 6 SAM 3 improves over SAM 2 in VOS. †: Zero-shot.

【翻译】表6 SAM 3在VOS上相比SAM 2有所改进。†：零样本。

Table 7 Interactive image segmentation on the SA-37 benchmark.

【翻译】表7 在SA-37基准上的交互式图像分割。

SAM 3 Agent. We experiment with an MLLM that uses SAM 3 as a tool to segment more complex text queries (see Fig. 25). The MLLM proposes noun phrase queries to prompt SAM 3 and analyzes the returned masks, iterating until the masks are satisfactory. Tab. 8 shows that this "SAM 3 Agent" evaluated zero-shot on ReasonSeg and OmniLabel surpasses prior work without training on any referring expression segmentation or reasoning segmentation data. SAM 3 Agent also outperforms previous zero-shot results on RefCOCO+ and RefCOCOg. SAM 3 can be combined with various MLLMs, with the same set of the system prompts for all those MLLMs, showing SAM 3's robustness. See §G for more details.

【翻译】SAM 3智能体。我们实验了一个使用SAM 3作为工具来分割更复杂文本查询的MLLM（见图25）。MLLM提出名词短语查询来提示SAM 3并分析返回的掩码，迭代直到掩码令人满意。表8显示，这个"SAM 3智能体"在ReasonSeg和OmniLabel上进行零样本评估，在没有训练任何指代表达分割或推理分割数据的情况下超越了先前的工作。SAM 3智能体在RefCOCO+和RefCOCOg上也优于之前的零样本结果。SAM 3可以与各种MLLM结合，对所有这些MLLM使用相同的系统提示集，显示了SAM 3的鲁棒性。更多详细信息见§G。

Table 8 SAM 3 Agent results. Gray indicates fine-tuned results on ReasonSeg (train), * indicates reproduced results, underline indicates the main metric. †: LISA-13B-LLaVA1.5 for ReasonSeg; REAL for OmniLabel.

【翻译】表8 SAM 3智能体结果。灰色表示在ReasonSeg（训练）上的微调结果，*表示复现结果，下划线表示主要指标。†：ReasonSeg使用LISA-13B-LLaVA1.5；OmniLabel使用REAL。

Table 9 Selected model and data ablations on SA-Co/Gold. Numbers across tables are not directly comparable.

【翻译】表9 在SA-Co/Gold上选定的模型和数据消融实验。不同表格间的数字不能直接比较。

Selected Ablations. In Tab. 9 we report a subset of the more extensive ablations from §A. Note that the ablated models are from different, shorter training runs than the model evaluated above. The presence head boosts c g F 1 \mathrm{cgF_{1}} cgF1 by + 1.5 +1.5 +1.5 (9a), improving image-level recognition measured by IL_MCC by + 0.05 +0.05 +0.05 . Tab. 9b shows that adding hard negatives significantly improves the model performance, most notably the image-level IL_MCC from 0.44 to 0.68. Tab. 9c shows that synthetic (SYN) training data improves over the external (EXT) by + 8.8 c g F 1 +8.8~\mathrm{cgF_{1}} +8.8 cgF1 and our high-quality (HQ) annotations add + 14.6 c g F 1 +14.6~\mathrm{cgF_{1}} +14.6 cgF1 on top of this baseline. We present detailed data scaling laws of both types of data in §A.2, showing their effectiveness on both in-domain and out-of-domain test sets. In Tab. 9d, we show how AI verifiers can improve pseudo-labels. Replacing the presence score from SAM 3 with that score from the exhaustivity verification (EV) AI verifier boosts c g F 1 \mathrm{cgF_{1}} cgF1 by + 7.2 +7.2 +7.2 . Using the mask verification (MV) AI verifier to remove bad masks adds another 1.1 points. Overall, AI verifiers close half of the gap between SAM 3's and human performance.

【翻译】选定的消融实验。在表9中，我们报告了§A中更广泛消融实验的一个子集。注意，消融模型来自与上述评估模型不同的、更短的训练运行。存在头将 c g F 1 \mathrm{cgF_{1}} cgF1提升了 + 1.5 +1.5 +1.5（9a），将由IL_MCC测量的图像级识别改进了 + 0.05 +0.05 +0.05。表9b显示，添加困难负样本显著改善了模型性能，最显著的是图像级IL_MCC从0.44提升到0.68。表9c显示，合成（SYN）训练数据比外部（EXT）数据改进了 + 8.8 c g F 1 +8.8~\mathrm{cgF_{1}} +8.8 cgF1，我们的高质量（HQ）标注在此基线上又增加了 + 14.6 c g F 1 +14.6~\mathrm{cgF_{1}} +14.6 cgF1。我们在§A.2中展示了两种数据类型的详细数据缩放定律，显示了它们在域内和域外测试集上的有效性。在表9d中，我们展示了AI验证器如何改进伪标签。用详尽性验证（EV）AI验证器的分数替换SAM 3的存在分数，将 c g F 1 \mathrm{cgF_{1}} cgF1提升了 + 7.2 +7.2 +7.2。使用掩码验证（MV）AI验证器去除坏掩码又增加了1.1个点。总体而言，AI验证器缩小了SAM 3与人类性能之间一半的差距。

Domain adaptation ablation. With domain-specific synthetic data generated by SAM 3 + AI verifiers, we show that one can significantly improve performance on a new domain without any human annotation. We hold out one of the SA-Co domains, "Food&drink", from training SAM 3 and AI verifiers. We then use three variants of training data for the novel "Food&drink" domain: high-quality AI+human annotations as in SA-Co/HQ (referred to as SA-Co/HQ-Food), synthetic annotations as in SA-Co/SYN, using AI but no humans (SA-Co/SYN-Food), and pseudo-labels generated before the AI verification step, i.e. skipping both AI verifiers and humans (PL-Food). Fig. 8 plots performance on the "Food&drink" test set of the SA-Co/Gold benchmark as each type of training data is scaled up. We mix the domain specific data and high-quality general domain data at a 1:1 ratio. PL-Food provides some improvement compared to the baseline SAM 3 (zero-shot), but is far below the other variants due to its lower quality. HQ-Food and SYN-Food show similar scaling behavior, with SYN-Food slightly lower but eventually catching up, without incurring any human annotation cost. This points to a scalable way to improve performance on new data distributions. More details are in §A.3.

【翻译】领域适应消融实验。通过SAM 3 + AI验证器生成的领域特定合成数据，我们展示了可以在没有任何人工标注的情况下显著改善新领域的性能。我们从训练SAM 3和AI验证器中保留了SA-Co领域之一"Food&drink"。然后我们为新的"Food&drink"领域使用三种训练数据变体：如SA-Co/HQ中的高质量AI+人工标注（称为SA-Co/HQ-Food）、如SA-Co/SYN中的合成标注，使用AI但不使用人工（SA-Co/SYN-Food），以及在AI验证步骤之前生成的伪标签，即跳过AI验证器和人工（PL-Food）。图8绘制了随着每种训练数据类型扩展，在SA-Co/Gold基准的"Food&drink"测试集上的性能。我们以1:1的比例混合领域特定数据和高质量通用领域数据。与基线SAM 3（零样本）相比，PL-Food提供了一些改进，但由于质量较低，远低于其他变体。HQ-Food和SYN-Food显示出相似的缩放行为，SYN-Food略低但最终赶上，而不产生任何人工标注成本。这指向了一种在新数据分布上改善性能的可扩展方法。更多详细信息见§A.3。

Figure 8 Domain adaptation via synthetic data. Synthetic (SYN) data generated by SAM 3 + AI verifiers (teacher system) achieves similar scaling behavior as human-annotated (HQ) data.

【翻译】图8 通过合成数据进行领域适应。由SAM 3 + AI验证器（教师系统）生成的合成（SYN）数据实现了与人工标注（HQ）数据相似的缩放行为。

Promptable and Interactive Visual Segmentation. SAM (Kirillov et al., 2023) introduces "promptable" image segmentation with interactive refinement. While the original task definition included text prompts, they were not fully developed. SAM 2 (Ravi et al., 2024) extended the promptable visual segmentation task to video, allowing refinement points on any frame. SAM 3 inherits geometry-based segmentation while extending to include text and image exemplar prompts to segment all instances of a concept in images and videos.

【翻译】可提示和交互式视觉分割。SAM（Kirillov等人，2023）引入了具有交互式细化的"可提示"图像分割。虽然原始任务定义包括文本提示，但它们并未完全开发。SAM 2（Ravi等人，2024）将可提示视觉分割任务扩展到视频，允许在任何帧上进行细化点。SAM 3继承了基于几何的分割，同时扩展到包括文本和图像示例提示，以分割图像和视频中概念的所有实例。

Open-Vocabulary Detection and Segmentation in Images exhaustively labels every instance of an open-vocabulary object category with a coarse bounding box (detection) or a fine-grained pixel mask (segmentation). Recent open-vocabulary (OV) detection (Gu et al., 2021; Minderer et al., 2022) and segmentation (Ding et al., 2022; Liang et al., 2023) methods leverage large-scale vision-language encoders such as CLIP (Radford et al., 2021) to handle categories described by arbitrary text, even those never seen during training. While DETR (Carion et al., 2020) is limited to a closed set of categories seen during training, MDETR (Kamath et al., 2021) evolves the approach to condition on raw text queries. Image exemplars used as prompts to specify the desired object category (e.g., DINOv (Li et al., 2023a), T-Rex2 (Jiang et al., 2024)) present a practical alternative to text, but fall short in conveying the abstract concept of objects as effectively as text prompts. We introduce a new benchmark for OV segmentation with > 100 × >100\times >100× more unique concepts than prior work.

【翻译】图像中的开放词汇检测和分割详尽地标记开放词汇对象类别的每个实例，使用粗糙的边界框（检测）或细粒度的像素掩码（分割）。最近的开放词汇（OV）检测（Gu等人，2021；Minderer等人，2022）和分割（Ding等人，2022；Liang等人，2023）方法利用大规模视觉-语言编码器如CLIP（Radford等人，2021）来处理由任意文本描述的类别，甚至是训练期间从未见过的类别。虽然DETR（Carion等人，2020）仅限于训练期间见过的封闭类别集，但MDETR（Kamath等人，2021）发展了基于原始文本查询的条件方法。用作提示以指定所需对象类别的图像示例（例如，DINOv（Li等人，2023a），T-Rex2（Jiang等人，2024））提供了文本的实用替代方案，但在传达对象抽象概念方面不如文本提示有效。我们引入了一个新的OV分割基准，其独特概念比先前工作多 > 100 × >100\times >100×。

Visual Grounding localizes a language expression referring to a region of the image with a box or mask. (Plummer et al., 2020) introduces phrase detection as both deciding whether the phrase is relevant to an image and localizing it. GLIP (Li et al., 2022b) and GroundingDino (Liu et al., 2023) formulate object detection as phrase grounding, unifying both tasks during training. MQ-GLIP (Xu et al., 2023) adds image exemplars to text as queries. Building on this trend toward models supporting multiple tasks and modalities, GLEE (Wu et al., 2024a) allows text phrases, referring expressions, and visual prompts for category and instance grounding in both images and videos. Unlike SAM 3, GLEE does not support exemplars or interactive refinement. LISA (Lai et al., 2024) allows segmentation that requires reasoning, while OMG-LLaVa (Zhang et al., 2024a) and GLaMM (Rasheed et al., 2024) generate natural language responses interleaved with corresponding segmentation masks, with GLaMM accepting both textual and optional image prompts as input. Some general-purpose MLLMs can output boxes and masks (Gemini2.5 (Comanici et al., 2025)) or points (Molmo (Deitke et al., 2025)). SAM 3 can be used as a "vision tool" in combination with an MLLM (§6).

【翻译】视觉定位将指向图像区域的语言表达用框或掩码进行定位。（Plummer等人，2020）引入短语检测，既要决定短语是否与图像相关，又要对其进行定位。GLIP（Li等人，2022b）和GroundingDino（Liu等人，2023）将目标检测表述为短语定位，在训练期间统一两个任务。MQ-GLIP（Xu等人，2023）将图像示例添加到文本作为查询。基于支持多任务和多模态模型的趋势，GLEE（Wu等人，2024a）允许在图像和视频中使用文本短语、指代表达和视觉提示进行类别和实例定位。与SAM 3不同，GLEE不支持示例或交互式细化。LISA（Lai等人，2024）允许需要推理的分割，而OMG-LLaVa（Zhang等人，2024a）和GLaMM（Rasheed等人，2024）生成与相应分割掩码交错的自然语言响应，GLaMM接受文本和可选图像提示作为输入。一些通用MLLM可以输出框和掩码（Gemini2.5（Comanici等人，2025））或点（Molmo（Deitke等人，2025））。SAM 3可以与MLLM结合用作"视觉工具"（§6）。

Multi-Object Tracking and Segmentation methods identify object instances in video and track them, associating each with a unique ID. In tracking-by-detection methods, detection is performed independently on each frame to produce boxes and confidence scores, followed by association of boxes using motion-based and appearance-based matching as in SORT (Bewley et al., 2016; Wojke et al., 2017), Tracktor (Bergmann et al., 2019), ByteTrack (Zhang et al., 2022c), SAM2MOT (Jiang et al., 2025), or OC-SORT (Cao et al., 2023). An alternative is an end-to-end trainable architecture that jointly detects and associates objects, e.g., TrackFormer (Meinhardt et al., 2022), TransTrack (Sun et al., 2020), or MOTR (Zeng et al., 2022). TrackFormer uses a DETR-like encoder-decoder that initializes new tracks from static object queries and auto-regressively follows existing tracks with identity-preserving track queries. A challenge with joint models is the conflict between detection and tracking (Feichtenhofer et al., 2017; Yu et al., 2023a), where one needs to focus on semantics while the other on disentangling identities, even if their spatial locations overlap over time. SAM 3 is a strong image detector tightly integrated into a tracker to segment concepts in videos.

【翻译】多目标跟踪和分割方法识别视频中的目标实例并跟踪它们，为每个实例关联唯一ID。在基于检测的跟踪方法中，在每帧上独立执行检测以产生框和置信度分数，然后使用基于运动和外观的匹配来关联框，如SORT（Bewley等人，2016；Wojke等人，2017）、Tracktor（Bergmann等人，2019）、ByteTrack（Zhang等人，2022c）、SAM2MOT（Jiang等人，2025）或OC-SORT（Cao等人，2023）。另一种选择是端到端可训练架构，联合检测和关联目标，例如TrackFormer（Meinhardt等人，2022）、TransTrack（Sun等人，2020）或MOTR（Zeng等人，2022）。TrackFormer使用类似DETR的编码器-解码器，从静态目标查询初始化新轨迹，并使用保持身份的轨迹查询自回归地跟踪现有轨迹。联合模型的挑战是检测和跟踪之间的冲突（Feichtenhofer等人，2017；Yu等人，2023a），其中一个需要关注语义，另一个需要解开身份，即使它们的空间位置随时间重叠。SAM 3是一个强大的图像检测器，紧密集成到跟踪器中以分割视频中的概念。

8 Conclusion

We present Segment Anything with Concepts, enabling open-vocabulary text and image exemplars as prompts in interactive segmentation. Our principal contributions are: (i) introducing the PCS task and SA-Co benchmark, (ii) an architecture that decouples recognition, localization and tracking and extends SAM 2 to solve concept segmentation while retaining visual segmentation capabilities, (iii) a high-quality, efficient data engine that leverages the complimentary strengths of human and AI annotators. SAM 3 achieves state-of-the-art results, doubling performance over prior systems for PCS on SA-Co in images and videos. That said, our model has several limitations. For example, it struggles to generalize to out-of-domain terms, which could be mitigated by automatic domain expansion but requires extra training. We discuss this and other limitations of our model in §B. We believe SAM 3 and the SA-Co benchmark will be important milestones and pave the way for future research and applications in computer vision.

【翻译】我们提出了具有概念的分割一切（Segment Anything with Concepts），在交互式分割中启用开放词汇文本和图像示例作为提示。我们的主要贡献是：（i）引入PCS任务和SA-Co基准，（ii）一个解耦识别、定位和跟踪的架构，扩展SAM 2以解决概念分割，同时保留视觉分割能力，（iii）一个高质量、高效的数据引擎，利用人工和AI标注者的互补优势。SAM 3在SA-Co上的图像和视频PCS任务中实现了最先进的结果，性能比先前系统提升了一倍。尽管如此，我们的模型有几个局限性。例如，它难以泛化到域外术语，这可以通过自动域扩展来缓解，但需要额外的训练。我们在§B中讨论了模型的这些和其他局限性。我们相信SAM 3和SA-Co基准将成为重要的里程碑，为计算机视觉的未来研究和应用铺平道路。