基于 SAM3 + FastAPI 搭建智能图像标注工具实战

本文记录了一个基于 Meta SAM3 模型的 Web 端图像标注工具的完整开发过程。该工具支持文本驱动分割、点击交互分割、框选分割、批量自动标注和多边形轮廓编辑，标注结果可导出为 YOLO 和 COCO JSON 格式，直接用于模型训练。

为什么要做这个工具

训练一个目标检测或实例分割模型，最耗时的环节不是训练本身，而是数据标注。传统标注工具（LabelImg、CVAT 等）需要人工逐个框选或描边，标注 100 张图片可能要花一整天。

2025 年 11 月，Meta 发布了 SAM3（Segment Anything with Concepts），首次支持开放词汇分割------输入任意文本短语（如 "person"、"crack"、"cell"），模型就能自动分割图中所有匹配的实例。这意味着标注效率可以从"逐个描边"跃升到"说一个词就全标好"。

但 SAM3 只是一个模型，不是标注工具。它没有界面、没有标注管理、没有数据导出。于是笔者基于 SAM3 源码搭建了一个完整的 Web 端标注工具，本文是完整的开发记录。

技术选型

层级	选型	理由
AI 模型	SAM3（本地部署）	开放词汇分割 + 交互式分割，标注场景最合适
后端	FastAPI	Python 生态，和 SAM3 同语言，异步高性能
前端	React + TypeScript + Ant Design	组件生态成熟
画布	react-konva	Canvas 2D 渲染，支持图片叠加、鼠标交互、图形拖拽
掩码编解码	pycocotools	COCO 标准 RLE 格式，兼容性好

整体架构

采用三栏布局，左侧图片列表、中间画布、右侧工具和标注管理：

复制代码

┌──────────────┬──────────────────────────┬──────────────────┐
│  图片列表     │       画布区域            │   工具面板        │
│              │                          │                  │
│ 批量上传      │                          │ 单张上传          │
│ 批量自动标注  │   图片 + 掩码叠加         │ 文本/点击/框选    │
│ 缩略图列表    │   点击标记 / 框选预览      │ 分割结果列表      │
│ 标注状态      │   多边形顶点编辑          │ 已保存标注        │
│              │                          │ 导出 YOLO/COCO   │
└──────────────┴──────────────────────────┴──────────────────┘

前后端通过 REST API 通信，掩码数据用 RLE 编码压缩传输，掩码可视化（半透明填充 + 轮廓描边）由后端生成 PNG 通过 base64 传给前端。

后端核心：SAM3 模型服务封装

后端的核心是 SAM3Service 类，负责模型加载、图像特征缓存和分割推理。

模型懒加载

SAM3 模型体积大，加载耗时数秒。采用懒加载策略，首次收到请求时才初始化：

python 复制代码

class SAM3Service:
    def __init__(self, max_cache_size=10):
        self._model = None
        self._processor = None
        self._lock = threading.Lock()
        self._state_cache = OrderedDict()  # LRU 缓存
        self._max_cache_size = max_cache_size

    def _ensure_model(self):
        if self._processor is not None:
            return
        with self._lock:
            if self._processor is not None:
                return
            self._model = build_sam3_image_model(
                enable_inst_interactivity=True,  # 开启点击分割支持
            )
            self._processor = Sam3Processor(self._model, confidence_threshold=0.5)

enable_inst_interactivity=True 是关键参数，开启后模型会加载 SAM1 兼容的交互式预测器，支持点击和框选分割。

图像特征缓存

set_image() 是最耗时的操作（需要跑一次完整的视觉编码器），后续的分割操作只需要跑轻量的文本编码或解码头。因此必须缓存图像特征：

python 复制代码

def load_image(self, image_id, image):
    self._ensure_model()
    with torch.autocast("cuda", dtype=torch.bfloat16), torch.inference_mode():
        state = self._processor.set_image(image)
    self._put_state(image_id, state)  # LRU 缓存
    return {"image_id": image_id, "width": image.size[0], "height": image.size[1]}

缓存采用 LRU 策略，超出上限时淘汰最久未使用的 state，并主动释放 GPU 显存：

python 复制代码

def _put_state(self, image_id, state):
    self._state_cache[image_id] = state
    self._state_cache.move_to_end(image_id)
    while len(self._state_cache) > self._max_cache_size:
        _, evicted = self._state_cache.popitem(last=False)
        self._release_state_tensors(evicted)  # 释放 GPU 张量

三种分割模式

文本分割 ------调用 Sam3Processor.set_text_prompt()，输入文本短语，返回所有匹配实例的掩码：

python 复制代码

def text_prompt(self, image_id, text):
    state = self._get_or_load_state(image_id)
    state = self._processor.set_text_prompt(text, state)
    return self._format_result(state)

点击分割 ------调用 model.predict_inst()（SAM1 兼容接口），传入正负点坐标：

python 复制代码

def click_prompt(self, image_id, points, labels):
    state = self._get_or_load_state(image_id)
    # 归一化坐标 → 像素坐标
    point_coords = np.array([[p[0] * img_w, p[1] * img_h] for p in points])
    point_labels = np.array(labels)
    # 单点用 multimask 选最佳，多点用 single mask
    use_multimask = len(points) == 1
    masks_np, scores_np, _ = self._model.predict_inst(
        state,
        point_coords=point_coords,
        point_labels=point_labels,
        multimask_output=use_multimask,
    )

这里有一个关键细节：predict_inst 是 Sam3Image 模型的方法，它会从 Sam3Processor.set_image() 已经计算好的 backbone_out 中提取特征，不需要重新跑视觉编码器。这就是为什么第一次加载图片慢（几秒），后续点击分割快（毫秒级）。

框选分割 ------同样调用 predict_inst，传入 box 参数：

python 复制代码

def box_prompt(self, image_id, box, label):
    state = self._get_or_load_state(image_id)
    cx, cy, w, h = box
    box_pixels = np.array([
        (cx - w/2) * img_w, (cy - h/2) * img_h,
        (cx + w/2) * img_w, (cy + h/2) * img_h,
    ])
    masks_np, scores_np, _ = self._model.predict_inst(
        state, box=box_pixels, multimask_output=False,
    )

掩码可视化

掩码可视化由后端生成，避免前端做复杂的像素操作。核心是 _generate_overlay 函数，生成半透明填充 + 轮廓描边的 PNG 图片：

python 复制代码

def _generate_overlay(masks, img_h, img_w, colors=None):
    overlay = np.zeros((img_h, img_w, 4), dtype=np.uint8)
    for i, mask in enumerate(masks):
        color = colors[i % len(colors)]
        binary = mask > 0.5
        # 半透明填充
        overlay[binary, :3] = color
        overlay[binary, 3] = 80
        # 轮廓检测 + 膨胀
        edge = np.zeros_like(binary, dtype=bool)
        edge[1:, :] |= binary[1:, :] != binary[:-1, :]
        edge[:-1, :] |= binary[1:, :] != binary[:-1, :]
        edge[:, 1:] |= binary[:, 1:] != binary[:, :-1]
        edge[:, :-1] |= binary[:, 1:] != binary[:, :-1]
        thick_edge = binary_dilation(edge, iterations=1)
        overlay[thick_edge, :3] = color
        overlay[thick_edge, 3] = 255
    # 编码为 PNG base64
    img = PILImage.fromarray(overlay, 'RGBA')
    buf = io.BytesIO()
    img.save(buf, format='PNG', optimize=True)
    return base64.b64encode(buf.getvalue()).decode('utf-8')

轮廓检测的原理很简单：如果一个像素是前景（mask=1），但它的上下左右有背景像素（mask=0），那它就是边缘。然后用 binary_dilation 膨胀 1 像素让轮廓线更清晰。

生成的 PNG 通过 base64 编码放在 API 响应的 overlay 字段中，前端直接作为 Image 渲染到 Canvas 上。

API 设计

所有接口遵循 RESTful 风格：

复制代码

POST /api/image/upload          # 单张上传（自动提取特征）
POST /api/image/upload_batch    # 批量上传（仅保存文件）
GET  /api/image/list            # 图片列表（含标注数量）
GET  /api/image/{id}/thumbnail  # 缩略图
GET  /api/image/{id}/file       # 原始图片

POST /api/prompt/text           # 文本分割
POST /api/prompt/click          # 点击分割（累积正负点）
POST /api/prompt/box            # 框选分割
POST /api/prompt/reset          # 重置

POST /api/annotation/save       # 保存标注
GET  /api/annotation/{image_id} # 查询标注
DELETE /api/annotation/{id}     # 删除标注

GET  /api/export/coco           # 导出 COCO JSON
GET  /api/export/yolo           # 导出 YOLO 格式 zip

POST /api/batch/auto_label      # 批量自动标注（SSE 流式进度）

分割接口的响应格式统一为：

json 复制代码

{
  "masks": ["<RLE编码>", ...],
  "boxes": [[x1, y1, x2, y2], ...],
  "scores": [0.95, ...],
  "count": 3,
  "overlay": "<base64 PNG>"
}

前端核心：画布交互

画布组件基于 react-konva，核心挑战是在同一个 Canvas 上叠加原始图片、掩码 overlay、点击标记、框选预览和多边形编辑。

图片自适应

画布需要自适应容器宽度，同时限制最大高度：

typescript 复制代码

const maxWidth = containerWidth - 16;
const maxHeight = window.innerHeight * 0.85;
const scaleByWidth = imageWidth > 0 ? maxWidth / imageWidth : 1;
const scaleByHeight = imageHeight > 0 ? maxHeight / imageHeight : 1;
const scale = Math.min(scaleByWidth, scaleByHeight, 1);
const displayWidth = imageWidth * scale;
const displayHeight = imageHeight * scale;

点击模式的左右键处理

一个容易踩的坑：Canvas 的 onClick 事件不响应右键。需要用 onMouseUp 统一处理：

typescript 复制代码

const handleMouseUp = useCallback((e) => {
  const isRightClick = e.evt.button === 2;
  if (toolMode === 'click') {
    const label = isRightClick ? 0 : 1;  // 右键=负向，左键=正向
    onClickPrompt({ x: nx, y: ny, label });
  }
  if (toolMode === 'box' && boxStart) {
    onBoxPrompt([cx, cy, nw, nh], !isRightClick);
  }
}, [...]);

同时需要禁用右键菜单：

typescript 复制代码

const handleContextMenu = useCallback((e) => {
  e.evt.preventDefault();
}, []);

多边形顶点编辑

已保存的标注可以转为多边形轮廓进行精细编辑。后端用 OpenCV 提取轮廓并简化：

python 复制代码

def mask_to_polygon(mask, tolerance=2.0):
    contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    polygons = []
    for contour in contours:
        approx = cv2.approxPolyDP(contour, tolerance, True)
        if len(approx) >= 3:
            polygons.append(approx.reshape(-1).tolist())
    return polygons

前端用 react-konva 的 Line（多边形轮廓）和 Circle（顶点）渲染，顶点支持拖拽移动、双击删除、点击边中点插入新顶点。

批量自动标注

批量标注是提升效率的关键功能。用户上传一个文件夹的图片，输入文本 prompt 和类别名，后端逐张处理并通过 SSE（Server-Sent Events）推送进度：

python 复制代码

@app.post("/api/batch/auto_label")
async def batch_auto_label(req: dict):
    def generate():
        for idx, image_id in enumerate(image_ids):
            # 按需加载图片特征
            if sam3_service._get_state(image_id) is None:
                image = Image.open(file_path).convert("RGB")
                sam3_service.load_image(image_id, image)
            # 文本分割
            result = sam3_service.text_prompt(image_id, text)
            # 保存标注
            for i in range(result["count"]):
                _annotations.append({...})
            yield f"data: {json.dumps({'status': 'done', 'count': saved})}\n\n"
    return StreamingResponse(generate(), media_type="text/event-stream")

前端用 Fetch API 读取 SSE 流，实时更新进度条。

数据导出

YOLO 格式

导出为 zip 包，包含 images/、labels/、classes.txt：

python 复制代码

# 将像素坐标的边界框转为 YOLO 归一化格式
cx = ((box[0] + box[2]) / 2) / img_w
cy = ((box[1] + box[3]) / 2) / img_h
w = (box[2] - box[0]) / img_w
h = (box[3] - box[1]) / img_h
line = f"{class_id} {cx:.6f} {cy:.6f} {w:.6f} {h:.6f}"

COCO JSON 格式

掩码使用 pycocotools 的标准 RLE 编码，与 Detectron2、MMDetection 等框架直接兼容。

开发中踩过的坑

1. reset_all_prompts 没有返回值

SAM3 的 Sam3Processor.reset_all_prompts() 是原地修改 state，不返回新对象。如果写成 state = processor.reset_all_prompts(state)，state 会变成 None，后续操作全部报错。

2. 点击分割的 multimask 策略

SAM 的推荐做法是：单点用 multimask_output=True（返回 3 个候选，取最佳），多点用 multimask_output=False（返回 1 个综合结果）。如果多点也用 multimask，模型可能会选择一个只覆盖局部的掩码。

3. 框选分割不能用 add_geometric_prompt

Sam3Processor.add_geometric_prompt 需要先有文本 prompt，没有文本时它用 "visual" 占位，会导致全图所有对象都被检测出来。框选应该用 predict_inst 的 box 参数，它是独立的交互式分割，不依赖文本。

4. RLE 编解码的行列顺序

COCO 的 RLE 是按列展开的（Fortran order），自己手写编解码很容易搞错行列顺序，导致掩码显示为竖条纹。建议直接用 pycocotools 的标准实现，不要自己造轮子。

5. antd Upload 的 directory 模式重复触发

antd 的 Upload 组件在 directory 模式下，beforeUpload 会被每个文件触发一次，每次都带着完整的 fileList。需要用 ref 做防重复处理，否则同一批文件会被上传多次。

本文基于实际开发过程整理，SAM3 模型版本为 SAM 3.1（2026 年 3 月），前端使用 React 18 + Ant Design + react-konva，后端使用 FastAPI + PyTorch。