Azure Machine Learning - 如何使用 GPT-4 Turbo with Vision

介绍如何在Azure中使用GPT-4 Turbo with Vision
关注TechLead，分享AI全维度知识。作者拥有10+年互联网服务架构、AI产品研发经验、团队管理经验，同济本复旦硕，复旦机器人智能实验室成员，阿里云认证的资深架构师，项目管理专业人士，上亿营收AI产品研发负责人

GPT-4 Turbo with Vision 介绍

GPT-4 Turbo with Vision 是 OpenAI 开发的一个大型多模态模型 (LMM)，可以分析图像，并为有关图像的问题提供文本回应。它结合了自然语言处理和视觉理解，GPT-4 Turbo with Vision 可以回答一般图像相关问题。如果使用 $视觉增强$ 还可以出示视频。

调用会话补全 API

以下 REST 命令显示了通过代码使用 GPT-4 Turbo with Vision 模型的最基本方法。

将 POST 请求发送到 https://{RESOURCE_NAME}.openai.azure.com/openai/deployments/{DEPLOYMENT_NAME}/chat/completions?api-version=2023-12-01-preview

RESOURCE_NAME 是 OpenAI 资源的名称
DEPLOYMENT_NAME 是 GPT-4 Turbo with Vision 模型部署的名称

必需的标头：

Content-Type：application/json
api-key: {API_KEY}

正文：下列为请求正文示例。格式与 GPT-4 的聊天补全 API 相同，只是消息内容可以是包含字符串和图像（图像的有效 HTTP 或 HTTPS URL 或者 base-64 编码的图像）的数组。切记设置 "max_tokens" 值，否则返回输出将被截断。

复制代码

{
    "messages": [ 
        {
            "role": "system", 
            "content": "You are a helpful assistant." 
        },
        {
            "role": "user", 
            "content": [
            {
                "type": "text",
                "text": "Describe this picture:"
            },
            {
                "type": "image_url",
                "image_url": {
                        "url": "<URL or base-64-encoded image>"
                    }
                } 
           ] 
        }
    ],
    "max_tokens": 100, 
    "stream": false 
}

输出

API 响应应如下所示：

复制代码

{
    "id": "chatcmpl-8VAVx58veW9RCm5K1ttmxU6Cm4XDX",
    "object": "chat.completion",
    "created": 1702439277,
    "model": "gpt-4",
    "prompt_filter_results": [
        {
            "prompt_index": 0,
            "content_filter_results": {
                "hate": {
                    "filtered": false,
                    "severity": "safe"
                },
                "self_harm": {
                    "filtered": false,
                    "severity": "safe"
                },
                "sexual": {
                    "filtered": false,
                    "severity": "safe"
                },
                "violence": {
                    "filtered": false,
                    "severity": "safe"
                }
            }
        }
    ],
    "choices": [
        {
            "finish_details": {
                "type": "stop",
                "stop": "<|fim_suffix|>"
            },
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The picture shows an individual dressed in formal attire, which includes a black tuxedo with a black bow tie. There is an American flag on the left lapel of the individual's jacket. The background is predominantly blue with white text that reads \"THE KENNEDY PROFILE IN COURAGE AWARD\" and there are also visible elements of the flag of the United States placed behind the individual."
            },
            "content_filter_results": {
                "hate": {
                    "filtered": false,
                    "severity": "safe"
                },
                "self_harm": {
                    "filtered": false,
                    "severity": "safe"
                },
                "sexual": {
                    "filtered": false,
                    "severity": "safe"
                },
                "violence": {
                    "filtered": false,
                    "severity": "safe"
                }
            }
        }
    ],
    "usage": {
        "prompt_tokens": 1156,
        "completion_tokens": 80,
        "total_tokens": 1236
    }
}

每个响应都包含 "finish_details" 字段。子字段 "type" 具有以下可能值：

stop：API 返回了完整的模型输出。
max_tokens：由于 max_tokens 输入参数或模型的标记限制，模型输出不完整。
content_filter：由于内容筛选器的标志，省略了内容。

如果 finish_details.type 为 stop，则还有一个 "stop" 属性指定导致输出结束的标记。

图像处理中的详细信息参数设置：低、高、自动

模型中的详细信息参数提供三种选择：low、high 或 auto，用于调整模型解释和处理图像的方式。默认设置为自动，其中模型将根据图像输入的大小在低或高之间做出决定。

low 设置：模型不会激活"高分辨率"模式，而是处理分辨率较低的 512x512 版本，从而加快响应速度，减少在细化细节并不重要的方案中的标记消耗。
high 设置：模型将激活"高分辨率"模式。在此设置下，模型首先查看低分辨率图像，然后根据输入图像生成详细的 512x512 段。每个段使用两倍的标记预算，从而获得对图像的更详细解释。"

对图像使用视觉增强

GPT-4 Turbo with Vision 提供对 Azure AI 服务定制增强功能的独占访问权限。与 Azure AI 视觉结合使用时，它可以为聊天模型提供有关图像中可见文本和对象位置的更详细信息，从而增强聊天体验。

"光学字符识别 (OCR)"集成使模型能够针对密集文本、转换后的图像和数字较多的财务文档生成更高质量的响应。它还涵盖了更广泛的语言。

对象接地集成为数据分析和用户交互带来了新的层面，因为该功能可以在视觉上区分和突出显示其处理的图像中的重要元素。

将 POST 请求发送到 https://{RESOURCE_NAME}.openai.azure.com/openai/deployments/{DEPLOYMENT_NAME}/extensions/chat/completions?api-version=2023-12-01-preview

RESOURCE_NAME 是 OpenAI 资源的名称
DEPLOYMENT_NAME 是 GPT-4 Turbo with Vision 模型部署的名称

必需的标头：

Content-Type：application/json
api-key: {API_KEY}

正文：

格式与 GPT-4 的聊天补全 API 类似，但消息内容可以是包含字符串和图像（图像的有效 HTTP 或 HTTPS URL 或者 base-64 编码的图像）的数组。

还必须包括 enhancements 和 dataSources 对象。 enhancements 为会话中请求的视觉增强功能。它具有 grounding 和 ocr 属性，每个属性都有一个布尔 enabled 属性。使用这些内容请求 OCR 服务和/或对象检测/定位服务。 dataSources 为视觉增强需要的计算机视觉资源数据。它具有应为 "AzureComputerVision" 的 type 属性和 parameters 属性。将 endpoint 和 key 设置为计算机视觉资源的终结点 URL 和访问密钥。切记设置 "max_tokens" 值，否则返回输出将被截断。

复制代码

{
    "enhancements": {
            "ocr": {
              "enabled": true
            },
            "grounding": {
              "enabled": true
            }
    },
    "dataSources": [
    {
        "type": "AzureComputerVision",
        "parameters": {
            "endpoint": "<your_computer_vision_endpoint>",
            "key": "<your_computer_vision_key>"
        }
    }],
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": [
            {
                "type": "text",
                "text": "Describe this picture:"
            },
            {
                "type": "image_url",
                "image_url": {
                        "url":"<URL or base-64-encoded image>" 
                    }
                }
           ] 
        }
    ],
    "max_tokens": 100, 
    "stream": false 
}

输出

如此，从模型收到的聊天响应应包括有关图像的增强信息，例如对象标签和边界框以及 OCR 结果。 API 响应应如下所示：

复制代码

{
    "id": "chatcmpl-8UyuhLfzwTj34zpevT3tWlVIgCpPg",
    "object": "chat.completion",
    "created": 1702394683,
    "model": "gpt-4",
    "choices":
    [
        {
            "finish_details":
            {
                "type": "stop",
                "stop": "<|fim_suffix|>"
            },
            "index": 0,
            "message":
            {
                "role": "assistant",
                "content": "The image shows a close-up of an individual with dark hair and what appears to be a short haircut. The person has visible ears and a bit of their neckline. The background is a neutral light color, providing a contrast to the dark hair."
            },
            "enhancements":
            {
                "grounding":
                {
                    "lines":
                    [
                        {
                            "text": "The image shows a close-up of an individual with dark hair and what appears to be a short haircut. The person has visible ears and a bit of their neckline. The background is a neutral light color, providing a contrast to the dark hair.",
                            "spans":
                            [
                                {
                                    "text": "the person",
                                    "length": 10,
                                    "offset": 99,
                                    "polygon": [{"x":0.11950000375509262,"y":0.4124999940395355},{"x":0.8034999370574951,"y":0.4124999940395355},{"x":0.8034999370574951,"y":0.6434999704360962},{"x":0.11950000375509262,"y":0.6434999704360962}]
                                }
                            ]
                        }
                    ],
                    "status": "Success"
                }
            }
        }
    ],
    "usage":
    {
        "prompt_tokens": 816,
        "completion_tokens": 49,
        "total_tokens": 865
    }
}

每个响应都包含 "finish_details" 字段。子字段 "type" 具有以下可能值：

stop：API 返回了完整的模型输出。
max_tokens：由于 max_tokens 输入参数或模型的标记限制，模型输出不完整。
content_filter：由于内容筛选器的标志，省略了内容。

如果 finish_details.type 为 stop，则还有一个 "stop" 属性指定导致输出结束的标记。

对视频使用视觉增强

GPT-4 Turbo with Vision 提供对 Azure AI 服务定制增强功能的独占访问权限。视频提示集成使用 Azure AI 视觉视频检索对视频中的一组帧进行采样，并创建视频中语音的转录。它使 AI 模型能够给出有关视频内容的摘要和答案。

按照以下步骤设置视频检索系统并将其与 AI 聊天模型集成：

获取与正在使用的 Azure OpenAI 资源位于同一区域中的 Azure AI 视觉资源。
按照 $使用矢量化检索视频$ 中的指示创建视频检索索引。创建索引后返回到本指南。
将索引名称、视频的 documentId 参数以及视频的 blob 存储 SAS URL 保存到一个临时位置。在后面的步骤中会用到这些参数。
将 POST 请求准备到 https://{RESOURCE_NAME}.openai.azure.com/openai/deployments/{DEPLOYMENT_NAME}/extensions/chat/completions?api-version=2023-12-01-preview
- RESOURCE_NAME 是 OpenAI 资源的名称
- DEPLOYMENT_NAME 是 GPT-4 视觉模型部署的名称
必需的标头：
- Content-Type：application/json
- api-key: {API_KEY}

将以下 JSON 结构添加到请求正文中：

复制代码

{
    "enhancements": {
            "video": {
              "enabled": true
            }
    },
    "dataSources": [
    {
        "type": "AzureComputerVisionVideoIndex",
        "parameters": {
            "endpoint": "<your_computer_vision_endpoint>",
            "key": "<your_computer_vision_key>",
            "computerVisionBaseUrl": "<your_computer_vision_endpoint>",
            "computerVisionApiKey": "<your_computer_vision_key>",
            "indexName": "<name_of_your_index>",
            "videoUrls": ["<your_video_SAS_URL>"]
        }
    }],
    "messages": [ 
        {
            "role": "system", 
            "content": "You are a helpful assistant." 
        },
        {
            "role": "user",
            "content": [
                    {
                        "type": "text",
                        "text": "Describe this video:"
                    }
                ]
        },
        {
            "role": "user",
            "content": [
                    {
                        "type": "acv_document_id",
                        "acv_document_id": "<your_video_ID>"
                    }
                ]
        }
    ],
    "max_tokens": 100, 
}

请求包括 enhancements 和 dataSources 对象。 enhancements 为会话中请求的视觉增强功能。 dataSources 为视觉增强需要的计算机视觉资源数据。它具有应为 "AzureComputerVisionVideoIndex" 的 type 属性，以及包含 AI 视觉和视频信息的 parameters 属性。

将自己的信息填写在上述所有 <placeholder> 字段中：按需输入 OpenAI 和 AI 视觉资源的终结点 URL 和密钥，并按照之前的步骤检索视频索引信息。
将 POST 请求发送到 API 终结点。它应包含 OpenAI 和 AI 视觉凭据、视频索引的名称以及单个视频的 ID 和 SAS URL。

输出

从模型收到的聊天响应应包含有关视频的信息。 API 响应应如下所示：

复制代码

{
    "id": "chatcmpl-8V4J2cFo7TWO7rIfs47XuDzTKvbct",
    "object": "chat.completion",
    "created": 1702415412,
    "model": "gpt-4",
    "choices":
    [
        {
            "finish_details":
            {
                "type": "stop",
                "stop": "<|fim_suffix|>"
            },
            "index": 0,
            "message":
            {
                "role": "assistant",
                "content": "The advertisement video opens with a blurred background that suggests a serene and aesthetically pleasing environment, possibly a workspace with a nature view. As the video progresses, a series of frames showcase a digital interface with search bars and prompts like \"Inspire new ideas,\" \"Research a topic,\" and \"Organize my plans,\" suggesting features of a software or application designed to assist with productivity and creativity.\n\nThe color palette is soft and varied, featuring pastel blues, pinks, and purples, creating a calm and inviting atmosphere. The backgrounds of some frames are adorned with abstract, organically shaped elements and animations, adding to the sense of innovation and modernity.\n\nMidway through the video, the focus shifts to what appears to be a browser or software interface with the phrase \"Screens simulated, subject to change; feature availability and timing may vary,\" indicating the product is in development and that the visuals are illustrative of its capabilities.\n\nThe use of text prompts continues with \"Help me relax,\" followed by a demonstration of a 'dark mode' feature, providing a glimpse into the software's versatility and user-friendly design.\n\nThe video concludes by revealing the product name, \"Copilot,\" and positioning it as \"Your everyday AI companion,\" implying the use of artificial intelligence to enhance daily tasks. The final frames feature the Microsoft logo, associating the product with the well-known technology company.\n\nIn summary, the advertisement video is for a Microsoft product named \"Copilot,\" which seems to be an AI-powered software tool aimed at improving productivity, creativity, and organization for its users. The video conveys a message of innovation, ease, and support in daily digital interactions through a visually appealing and calming presentation."
            }
        }
    ],
    "usage":
    {
        "prompt_tokens": 2068,
        "completion_tokens": 341,
        "total_tokens": 2409
    }
}

每个响应都包含 "finish_details" 字段。子字段 "type" 具有以下可能值：

stop：API 返回了完整的模型输出。
max_tokens：由于 max_tokens 输入参数或模型的标记限制，模型输出不完整。
content_filter：由于内容筛选器的标志，省略了内容。

如果 finish_details.type 为 stop，则还有一个 "stop" 属性指定导致输出结束的标记。

视频提示的定价示例

GPT-4 Turbo with Vision 的定价是动态的，取决于使用的特定功能和输入。有关 Azure OpenAI 定价的全面视图，请参阅Azure OpenAI 定价。

基本费用和附加功能如下所述：

GPT-4 Turbo with Vision 的基本定价为：

输入：每 1000 个令牌 0.01 USDX
输出：每 1000 个令牌 0.03 USDX

视频提示与视频检索加载项集成：

引入：每分钟视频 0.05 USDX
事务：每 1000 个视频检索索引的 1000 个查询 0.25 USDX

处理视频将涉及使用额外的令牌确定关键帧进行分析。这些附加令牌的数量大致相当于文本输入中令牌的总和加上 700 个令牌。

计算

对于典型的用例，假设我使用了 3 分钟的视频和 100 个令牌提示输入。视频部分的脚本长度为 100 个令牌，我处理提示时，生成了 100 个输出令牌。此事务的定价如下所示：

项	详细信息	总成本
GPT-4 Turbo with Vision 输入令牌	100 个文本令牌	$0.001
用于确定帧的附加成本	100 个输入令牌 + 700 个令牌 + 1 个视频检索 txn	0.00825 USDX
图像输入和脚本输入	20 个图像（每个图像 85 个令牌） + 100 个脚本令牌	0.018 USDX
输出令牌	100 个令牌（假定）	0.003 USDX
总成本		0.03025 USDX

此外，还有 0.15 USDX 的一次性索引成本，用于为此 3 分钟的视频段生成视频检索索引。可以跨任意数量的视频检索和 GPT-4 Turbo with Vision 重复使用此索引。

限制

图像支持

"每个聊天会话的图像增强限制"：无法对单个聊天调用中的多个图像应用增强功能。
"最大输入图像大小"：输入图像的最大大小限制为 20 MB。
"增强 API 中的对象定位"：当增强 API 用于对象定位时，模型会检测对象重复项，它将为所有重复项生成一个边界框和标签，而不是为每个重复项生成单独的边界框和标签。
"低分辨率准确度"：使用"低分辨率"设置分析图像可以加快响应速度，且某些用例使用的输入令牌更少。但是，这可能会影响对图像对象和文本识别的准确性。
"图像聊天限制"：在聊天操场或 API 中上传图像时，每个聊天调用有 10 张图像的限制。

视频支持

"低分辨率"：使用 GPT-4 Turbo with Vision 的"低分辨率"设置来分析视频帧可能会影响对视频中小对象和文本识别的准确性。
"视频文件限制"：支持 MP4 和 MOV 文件类型。在 Azure AI 操场中，视频长度必须少于 3 分钟。使用 API 时没有此类限制。
提示限制：视频提示仅包含一个视频，不包含图像。在操场中可以清除会话以尝试其他视频或图像。
"有限的帧选择"：目前服务从整个视频中选择 20 帧，这可能无法捕获所有关键时刻或细节。帧选择可以大致均匀地分布在整个视频中，也可以通过特定的视频检索查询集中选择，具体取决于提示。
"语言支持"：目前，系统主要支持英语以进行脚本定位。脚本不提供有关歌词的准确信息。

关注TechLead，分享AI全维度知识。作者拥有10+年互联网服务架构、AI产品研发经验、团队管理经验，同济本复旦硕，复旦机器人智能实验室成员，阿里云认证的资深架构师，项目管理专业人士，上亿营收AI产品研发负责人