阿里云VIS + Qwen-Vision自定义抠图系统实现指南

系统流程详解

整体工作流程

1.用户输入 → 2. 通义千问-Vision语义理解 → 3. 生成结构化Prompt → 4. 调用VIS图像分割 → 5. 处理分割结果 → 6. 返回最终抠图结果 → 7. 用户确认/调整

详细步骤说明

🔹 步骤1: 用户输入

所作事情：用户提供需要处理的图像和自然语言指令

输入形式：

json 复制代码

{
  "image_url": "https://example.com/cat.jpg",
  "instruction": "请把左下角那只黑猫完整抠出来"
}

注意事项：

图像URL必须是公网可访问地址
指令应尽量具体（包含位置、颜色、特征等描述）
支持Base64编码图像（需转换为data URL格式）

🔹 步骤2: 通义千问-Vision语义理解

所作事情：调用Qwen-VL模型理解图像内容和用户指令，定位目标物体

调用API：dashscope.MultiModalConversation

模型：qwen-vl-plus

请求参数：

json 复制代码

{
  "model": "qwen-vl-plus",
  "messages": [
    {
      "role": "user",
      "content": [
        {"image": "https://example.com/cat.jpg"},
        {"text": "请分析这张图，并返回'左下角那只黑猫'的边界框坐标[x1,y1,x2,y2]，用JSON格式"}
      ]
    }
  ]
}

返回示例：

json 复制代码

{
  "output": {
    "choices": [
      {
        "message": {
          "content": "{\"box\": [50, 200, 180, 350], \"object\": \"black cat\"}"
        }
      }
    ]
  }
}

关键参数说明：

参数	类型	必填	说明
model	string	是	模型名称，推荐使用qwen-vl-plus
image	string	是	图像URL或Base64编码
text	string	是	精心设计的提示词，要求返回结构化数据

提示词设计技巧：

明确指定输出格式："请以JSON格式返回，包含box字段"
指定坐标格式："坐标格式为 $x1,y1,x2,y2$ ，原点在左上角"
限定目标："只返回最符合描述的单个目标"

🔹 步骤3: 生成结构化Prompt

所作事情：解析Qwen-VL返回结果，转换为VIS可接受的Prompt格式

处理逻辑：

提取JSON格式的坐标数据
验证坐标有效性（检查是否在图像尺寸范围内）
转换为VIS所需的Prompt格式

代码示例：

python 复制代码

def parse_qwen_response(qwen_output):
    """解析Qwen-VL返回结果，生成VIS可用的Prompt"""
    try:
        # 从Qwen输出中提取JSON部分
        content = qwen_output["output"]["choices"][0]["message"]["content"]
        data = json.loads(content)
        
        # 验证数据结构
        if "box" not in data:
            raise ValueError("返回数据缺少box字段")
            
        # 返回标准化的Prompt结构
        return {
            "type": "box",
            "coords": data["box"],
            "object": data.get("object", "target")
        }
    except Exception as e:
        # 尝试更宽松的解析
        import ast
        try:
            data = ast.literal_eval(content)
            return {
                "type": "box",
                "coords": data["box"],
                "object": data.get("object", "target")
            }
        except:
            raise ValueError(f"无法解析Qwen输出: {str(e)}")

Prompt类型对照表：

用户描述方式	Prompt类型	示例
"点一下这个按钮"	point	`{"type": "point", "coords": [x, y]}`
"中间那个穿红衣服的人"	box	`{"type": "box", "coords": [x1,y1,x2,y2]}`
"沿着这个轮廓抠"	points	`{"type": "points", "coords": [[x1,y1], [x2,y2], ...]}`

🔹 步骤4: 调用VIS图像分割

所作事情：使用结构化Prompt调用VIS执行高精度图像分割

调用API：SegmentImageRequest (视觉智能平台)

HTTP请求：

http 复制代码

POST /image_segmentation HTTP/1.1
Host: vis.cn-shanghai.aliyuncs.com
Content-Type: application/json
Authorization: YOUR_ACCESS_TOKEN

{
  "Action": "SegmentImage",
  "ImageURL": "https://example.com/cat.jpg",
  "Method": "interactive",
  "Prompt": {
    "Type": "box",
    "Coords": [50, 200, 180, 350]
  }
}

请求参数详解：

参数	类型	必填	说明
ImageURL	string	是	原始图像URL
Method	string	是	固定为"interactive"（交互式分割）
Prompt.Type	string	是	提示类型：box/point/points
Prompt.Coords	array	是	坐标数据，根据Type不同格式不同
OutputFormat	string	否	输出格式：png/mask，默认为png

坐标系统说明：

原点(0,0)位于图像左上角
坐标值为绝对像素值（非归一化）
box格式： $x_min, y_min, x_max, y_max$
point格式： $x, y$

🔹 步骤5: 处理分割结果

所作事情：解析VIS返回结果，处理分割图像

返回数据结构：

json 复制代码

{
  "MaskURL": "https://vis-result-bucket/mask_123.png",
  "AlphaImageURL": "https://vis-result-bucket/alpha_123.png",
  "Contours": [
    {"x": 52, "y": 205},
    {"x": 55, "y": 208},
    ...
  ],
  "Confidence": 0.96
}

关键字段说明：

字段	类型	说明
MaskURL	string	灰度掩码图URL（0-255，255为前景）
AlphaImageURL	string	带透明通道的结果图URL（PNG格式）
Contours	array	轮廓点坐标数组（用于精细调整）
Confidence	float	分割置信度（0.0-1.0）

后处理建议：

python 复制代码

import requests
from PIL import Image
import numpy as np
from io import BytesIO

def process_segmentation_result(result):
    """下载并处理分割结果"""
    # 下载透明图像
    response = requests.get(result["AlphaImageURL"])
    img = Image.open(BytesIO(response.content))
    
    # (可选)边缘优化
    if "Contours" in result:
        # 使用OpenCV进行边缘平滑
        contours = np.array([[p["x"], p["y"]] for p in result["Contours"]])
        # ...边缘优化代码
    
    return img

🔹 步骤6: 返回最终抠图结果

所作事情：将处理后的抠图结果返回给用户

标准返回格式：

json 复制代码

{
  "status": "success",
  "result": {
    "original_image": "https://example.com/cat.jpg",
    "mask_image": "https://vis-result-bucket/mask_123.png",
    "transparent_image": "https://vis-result-bucket/alpha_123.png",
    "confidence": 0.96,
    "object_name": "black cat",
    "processing_time": 1.25
  }
}

错误返回格式：

json 复制代码

{
  "status": "error",
  "code": "VIS_400",
  "message": "Invalid prompt coordinates",
  "details": {
    "coords": [
      50,
      200,
      180,
      350
    ],
    "image_size": [
      800,
      600
    ]
  }
}

🔹 步骤7: 用户确认/调整

所作事情：用户确认结果或要求调整

交互选项：

✅ 接受结果
✏️ 手动调整区域（返回新坐标）
🔄 重新描述目标（新指令）

调整请求示例：

json 复制代码

{
  "action": "adjust",
  "new_instruction": "请只抠出猫的头部，不包括身体",
  "additional_points": [
    [
      100,
      250
    ],
    [
      120,
      260
    ]
  ]
}

调整处理流程：

接收用户调整请求
将新指令和坐标合并到原始请求
重新调用Qwen-VL进行理解
使用增强的Prompt调用VIS
返回优化后的结果

🛠️ 实际使用指南

1. 前期准备

开通服务：

视觉智能开放平台
通义实验室

获取凭证：

python 复制代码

# 在阿里云控制台获取
DASHSCOPE_API_KEY = "your_dashscope_api_key"
ALIYUN_ACCESS_KEY_ID = "your_access_key_id"
ALIYUN_ACCESS_KEY_SECRET = "your_access_key_secret"

安装依赖：

bash 复制代码

pip install aliyun-python-sdk-core dashscope requests pillow

2. 完整代码实现

python 复制代码

import json
import time
import requests
from io import BytesIO
from PIL import Image
import dashscope
from aliyunsdkcore.client import AcsClient
from aliyunsdkvis.request.v20200703 import SegmentImageRequest

# 初始化客户端
dashscope.api_key = "your_dashscope_api_key"
client = AcsClient(
    "your_access_key_id",
    "your_access_key_secret",
    "cn-shanghai"
)

def analyze_with_qwen(image_url, instruction):
    """调用Qwen-VL解析图像和指令，返回目标坐标"""
    start_time = time.time()
    response = dashscope.MultiModalConversation.call(
        model="qwen-vl-plus",
        messages=[
            {
                "role": "user",
                "content": [
                    {"image": image_url},
                    {"text": f"{instruction}，请返回边界框坐标[x1,y1,x2,y2]，用JSON格式，仅包含box和object字段"}
                ]
            }
        ]
    )
    response["processing_time"] = time.time() - start_time
    return response

def parse_qwen_response(qwen_output):
    """解析Qwen-VL返回结果，生成VIS可用的Prompt"""
    try:
        content = qwen_output["output"]["choices"][0]["message"]["content"]
        data = json.loads(content)
        if "box" not in data:
            raise ValueError("返回数据缺少box字段")
        return {
            "type": "box",
            "coords": data["box"],
            "object": data.get("object", "target")
        }
    except Exception as e:
        import ast
        try:
            data = ast.literal_eval(content)
            return {
                "type": "box",
                "coords": data["box"],
                "object": data.get("object", "target")
            }
        except:
            raise ValueError(f"无法解析Qwen输出: {str(e)}")

def call_vis_segmentation(image_url, prompt):
    """调用阿里云VIS进行图像分割"""
    start_time = time.time()
    request = SegmentImageRequest.SegmentImageRequest()
    request.set_accept_format('json')
    request.set_ImageURL(image_url)
    request.set_Method("interactive")
    request.set_Prompt({
        "Type": prompt["type"],
        "Coords": prompt["coords"]
    })
    
    response = client.do_action_with_exception(request)
    result = json.loads(response)
    result["processing_time"] = time.time() - start_time
    return result["Data"]

def custom_image_matting(image_url, instruction, max_retries=3):
    """
    自定义抠图主函数
    
    Args:
        image_url (str): 图像URL
        instruction (str): 用户指令
        max_retries (int): 最大重试次数
        
    Returns:
        dict: 包含结果和元数据的字典
    """
    try:
        # 步骤2: Qwen-VL语义理解
        qwen_response = analyze_with_qwen(image_url, instruction)
        
        # 步骤3: 生成结构化Prompt
        prompt = parse_qwen_response(qwen_response)
        
        # 步骤4: 调用VIS分割
        vis_result = call_vis_segmentation(image_url, prompt)
        
        # 步骤5: 处理结果
        if "AlphaImageURL" in vis_result:
            # 步骤6: 返回最终结果
            return {
                "status": "success",
                "result": {
                    "original_image": image_url,
                    "transparent_image": vis_result["AlphaImageURL"],
                    "mask_image": vis_result.get("MaskURL"),
                    "confidence": vis_result.get("Confidence", 0.0),
                    "object_name": instruction,
                    "processing_time": vis_result["processing_time"],
                    "prompt": prompt
                }
            }
        else:
            return {
                "status": "error",
                "code": "VIS_NO_RESULT",
                "message": "VIS返回结果缺少必要字段",
                "details": vis_result
            }
            
    except Exception as e:
        if max_retries > 0:
            # 递归重试（最多max_retries次）
            return custom_image_matting(image_url, instruction, max_retries-1)
        return {
            "status": "error",
            "code": "PROCESSING_ERROR",
            "message": str(e),
            "details": {
                "image_url": image_url,
                "instruction": instruction
            }
        }

3. 调用示例

python 复制代码

# 基本使用
result = custom_image_matting(
    image_url="https://your-bucket/example.jpg",
    instruction="请把图中穿红色T恤的男士完整抠出来"
)

if result["status"] == "success":
    print("抠图成功! 透明图像地址:", result["result"]["transparent_image"])
    # 下载并显示结果
    def download_image(url):
        response = requests.get(url)
        return Image.open(BytesIO(response.content))
    img = download_image(result["result"]["transparent_image"])
    img.show()
else:
    print("抠图失败:", result["message"])

# 处理调整请求
adjust_result = custom_image_matting(
    image_url="https://your-bucket/example.jpg",
    instruction="请只抠出猫的头部，不包括身体"
)

4. 常见问题处理

问题	解决方案
Qwen-VL返回非JSON格式	1. 添加更严格的提示词 2. 实现多级解析逻辑（正则/AST） 3. 设置最大重试次数
VIS分割不精确	1. 尝试使用点提示(point)代替框提示(box) 2. 增加提示点数量 3. 添加后处理边缘优化
高并发请求失败	1. 使用阿里云API网关配置限流 2. 实现请求队列和重试机制 3. 考虑使用函数计算(FC)自动扩缩容
中文指令理解不佳	1. 在指令前添加"【中文指令】"前缀 2. 提供示例指令 3. 使用更详细的描述（包含颜色、位置、特征）

💡 优化建议

缓存机制

对相同图像+相似指令的请求进行缓存
使用Redis存储近期结果（有效期24小时）

批量处理

python 复制代码

def batch_matting(image_urls, instructions):
    """批量处理多个抠图请求"""
    from concurrent.futures import ThreadPoolExecutor
    
    with ThreadPoolExecutor(max_workers=5) as executor:
        futures = [
            executor.submit(custom_image_matting, url, instr)
            for url, instr in zip(image_urls, instructions)
        ]
        return [f.result() for f in futures]

前端集成

实现可视化调整界面（可拖动调整框）
添加实时预览功能
支持手绘修正

错误监控

记录失败请求和原因
设置阈值自动告警
定期分析常见失败模式