| Step | Data Format | Example Matrix Shape | Detailed Explanation |
|---|---|---|---|
| 1. Raw Image Input | Pixel Matrix | [1, 3, 224, 224] |
A single 224x224 3-channel (RGB) color image. |
| 2. ViT Patching | Patch Sequence | [1, 196, 768] |
The image is sliced into 196 patches of size 16x16 and flattened into vectors. |
| 3. ViT Output | Visual Feature Sequence | [1, 196, 768] |
Processed by the ViT encoder to obtain a feature sequence containing global information. |
| 4. Connector Projection | Aligned Visual Features | [1, 196, 4096] |
Visual features are projected into a 4096-dimensional space to align with the LLM's word embedding dimension. |
| 5. Text Embedding | Tokenized Text Features | [1, 5, 4096] |
The question "What is in the image?" is tokenized and embedded into five 4096-dimensional vectors. |
| 6. Multimodal Concatenation | Input Visual + Text Joint Features | [1, 201, 4096] |
Concatenating 196 visual tokens and 5 text tokens along the sequence dimension. |
| 7. LLM Generation Output | Generated Token ID Sequence | [1, 7] |
The LLM generates a 7-token answer based on the joint features. |
【无标题】
翩若惊鸿_2026-01-02 21:51
相关推荐
晚风吹红霞1 分钟前
C++异常处理核心知识点全解析CoderCodingNo2 分钟前
【信奥业余科普】C++ 的奇妙之旅 | 17:面的铺展与文本的本质——二维数组与字符串J2虾虾3 分钟前
Java Lambda 表达式详解文档csbysj20207 分钟前
CSS 网格元素lly2024067 分钟前
DOM 元素:深入理解与高效运用鸟儿不吃草9 分钟前
安卓实现左右布局聊天界面曦夜日长23 分钟前
C++ STL容器string(一):string的变量细节、默认函数的认识以及常用接口的使用代码中介商25 分钟前
C++ STL 标准模板库完全指南:从容器到迭代器winner888127 分钟前
C++ 构造函数、析构函数、虚函数、虚析构北山有鸟30 分钟前
IS_ERR 判断出错后,再用 PTR_ERR 把它强制转换回 int 型的错误码作为函数的返回值。