| Step | Data Format | Example Matrix Shape | Detailed Explanation |
|---|---|---|---|
| 1. Raw Image Input | Pixel Matrix | [1, 3, 224, 224] |
A single 224x224 3-channel (RGB) color image. |
| 2. ViT Patching | Patch Sequence | [1, 196, 768] |
The image is sliced into 196 patches of size 16x16 and flattened into vectors. |
| 3. ViT Output | Visual Feature Sequence | [1, 196, 768] |
Processed by the ViT encoder to obtain a feature sequence containing global information. |
| 4. Connector Projection | Aligned Visual Features | [1, 196, 4096] |
Visual features are projected into a 4096-dimensional space to align with the LLM's word embedding dimension. |
| 5. Text Embedding | Tokenized Text Features | [1, 5, 4096] |
The question "What is in the image?" is tokenized and embedded into five 4096-dimensional vectors. |
| 6. Multimodal Concatenation | Input Visual + Text Joint Features | [1, 201, 4096] |
Concatenating 196 visual tokens and 5 text tokens along the sequence dimension. |
| 7. LLM Generation Output | Generated Token ID Sequence | [1, 7] |
The LLM generates a 7-token answer based on the joint features. |
【无标题】
翩若惊鸿_2026-01-02 21:51
相关推荐
dapeng28701 小时前
分布式系统容错设计qq_417695051 小时前
代码热修复技术badhope6 小时前
Mobile-Skills:移动端技能可视化的创新实践码云数智-园园7 小时前
微服务架构下的分布式事务:在一致性与可用性之间寻找平衡C++ 老炮儿的技术栈7 小时前
volatile使用场景hz_zhangrl7 小时前
CCF-GESP 等级考试 2026年3月认证C++一级真题解析Liu628887 小时前
C++中的工厂模式高级应用IT猿手8 小时前
基于控制障碍函数的多无人机编队动态避障控制方法研究,MATLAB代码AI科技星8 小时前
全尺度角速度统一:基于 v ≡ c 的纯推导与验证sunwenjian8868 小时前
Java进阶——IO 流