通义千问( 五 ) 图片分析

5.多模态

5.1.图片分析

5.1.1.介绍

通义千问VL(Qwen-VL)是阿里云研发的大规模视觉语言模型（Large Vision Language Model, LVLM），可以以图像、文本、检测框作为输入，并以文本和检测框作为输出，支持中文多模态对话及多图对话，并具有更好的性能，是首个支持中文开放域的通用定位模型和首个开源448分辨率的大规模视觉语言模型。

通义千问VL模型主要有以下特点：

强大的性能：在四大类多模态任务的标准英文测评中（Zero-shot Captioning/VQA/DocVQA/Grounding）上，均取得同等通用模型大小下最好效果；
多语言对话模型：天然支持英文、中文等多语言对话，端到端支持图片里中英双语的长文本识别；
多图交错对话：支持多图输入和比较，指定图片问答，多图文学创作等；
首个支持中文开放域定位的通用模型：通过中文开放域语言表达进行检测框标注；
细粒度识别和理解：相比于目前其它开源LVLM使用的224分辨率，Qwen-VL是首个开源的448分辨率的LVLM模型。更高分辨率可以提升细粒度的文字识别、文档问答和检测框标注。

升级的Qwen-VL(qwen-vl-plus/qwen-vl-max)模型现有几大特点：

大幅增强了图片中文字处理能力，帮助您有效提取、整理、总结文字信息。
增加可处理分辨率范围，各分辨率和长宽比的图都能处理，大图和长图能看清。
增强视觉推理和决策能力，适于搭建视觉Agent，让大模型Agent的想象力进一步扩展。
升级看图做题能力，拍一拍习题图发给Qwen-VL，大模型能帮用户一步步解题。

5.1.2.模型概览

用户以文本和url形式的图片形式输入包含多轮对话历史和当前指令的信息序列（messages），到返回模型生成的回复作为输出。在这一过程中，文本将被转换为语言模型可以处理的token序列。

图片将被按照图片像素转换为token序列，28*28的像素对应一个token，如果长宽不是28的整数倍，则向上取到28的整数倍计算，一张图最少包含4个token，最多包含1280个token。

模型名	计费单价	基础限流
qwen-vl-plus	0.008元 / 1,000 tokens	以下条件任何一个超出都会触发限流：流量 ≤ 60 QPM，每分钟处理不超过60个完整的请求； Token消耗 ≤ 100,000 TPM，每分钟消耗的Token数目不超过100,000。
qwen-vl-max	0.02元 / 1,000 tokens	以下条件任何一个超出都会触发限流：流量 ≤ 15 QPM，每分钟处理不超过15个完整的请求； Token消耗 ≤ 25,000 TPM，每分钟消耗的Token数目不超过25,000。

5.1.3.图片限制

对于输入的图片有以下限制：

图片文件大小不超过10 MB
图片总的像素数不超过 1048576，这相当于一张长宽均为 1024 的图片总像素数
单次最多支持上传10张图片。

图片支持的格式：

图片格式	Content Type	文件扩展名
BMP	image/bmp	.bmp
JPEG	image/jpeg	.jpeg, .jpg
PNG	image/png	.png
TIFF	image/tiff	.tif, .tiff
WEBP	image/webp	.webp

5.1.4.线上图片

java 复制代码

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;

import java.util.Arrays;
import java.util.Collections;

@RestController
@RequestMapping("/tongyi")
public class AigcVlController {

    @Value("${tongyi.api-key}")
    private String apiKey;


    public  MultiModalConversationResult simpleMultiModalConversationCall(String message)
            throws ApiException, NoApiKeyException, UploadFileException {
        // 设置API密钥
        Constants.apiKey = apiKey;

        MultiModalConversation conv = new MultiModalConversation();

        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"),
                        Collections.singletonMap("text", message)))
                .build();

        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .model(MultiModalConversation.Models.QWEN_VL_PLUS)
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result);

        return result;
    }


    @RequestMapping("/aigc/vl")
    public String callBase(@RequestParam(value = "message", required = false, defaultValue = "这是什么?") String message) throws NoApiKeyException, InputRequiredException {

        try {
            MultiModalConversationResult result = simpleMultiModalConversationCall(message);
            return result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text").toString();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        return null;
    }
}

5.1.4.1.测试

text 复制代码

GET http://localhost:8081/tongyi/aigc/vl

HTTP/1.1 200 

这张图片显示了一位女士和她的狗在海滩上。她们似乎正在享受彼此的陪伴，狗狗坐在沙滩上伸出爪子与女士握手或互动。背景是美丽的日落景色，海浪轻轻拍打着海岸线。

请注意，我提供的描述基于图像中可见的内容，并不包括任何超出视觉信息之外的推测性解释。如果您需要更多关于场景、物体或其他细节的信息，请告诉我！

5.1.5.本地图片

可以通过本地文件进行接口调用。在传入文件路径时，请根据您所使用的系统和文件的路径进行调整，详情如下表所示。

系统	传入的文件路径	示例
Linux或macOS系统	`file://{文件的绝对路径}`	`file:///home/images/test.png`
Windows系统	`file:///{文件的绝对路径}`	`file:///D:/images/test.png`

代码案例

java 复制代码

import com.alibaba.dashscope.aigc.multimodalconversation.*;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;

import java.util.Arrays;
import java.util.HashMap;

@RestController
@RequestMapping("/tongyi")
public class AigcVlLocalController {

    @Value("${tongyi.api-key}")
    private String apiKey;


    public MultiModalConversationResult callWithLocalFile(String message)
            throws ApiException, NoApiKeyException, UploadFileException {
        // 设置API密钥
        Constants.apiKey = apiKey;

        String localFilePath1 = "file:///D:/upload/图片1.png";
        String localFilePath2 = "file:///D:/upload/图片2.png";
        // 创建会话
        MultiModalConversation conv = new MultiModalConversation();

        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        new HashMap<String, Object>(){{ put("image", localFilePath1); } },
                        new HashMap<String, Object>(){{ put("image", localFilePath2); } },
                        new HashMap<String, Object>(){{ put("text", message); } } )
                )
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .model(MultiModalConversation.Models.QWEN_VL_PLUS)
                .message(userMessage)
                .build();

        MultiModalConversationResult result = conv.call(param);

        return result;
    }


    @RequestMapping("/aigc/local")
    public String callBase(@RequestParam(value = "message", required = false, defaultValue = "这是什么?") String message) throws NoApiKeyException, InputRequiredException {

        try {
            MultiModalConversationResult result = callWithLocalFile(message);
            return result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text").toString();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        return null;
    }
}

5.1.5.1.测试

text 复制代码

###
GET http://localhost:8081/tongyi/aigc/local?message=10分满分,两张图片的相似度你给几分
    
    
    
###
GET http://localhost:8081/tongyi/aigc/local?message=提取第一张图片上的文字