| Step | Data Format | Example Matrix Shape | Detailed Explanation |
|---|---|---|---|
| 1. Raw Image Input | Pixel Matrix | [1, 3, 224, 224] |
A single 224x224 3-channel (RGB) color image. |
| 2. ViT Patching | Patch Sequence | [1, 196, 768] |
The image is sliced into 196 patches of size 16x16 and flattened into vectors. |
| 3. ViT Output | Visual Feature Sequence | [1, 196, 768] |
Processed by the ViT encoder to obtain a feature sequence containing global information. |
| 4. Connector Projection | Aligned Visual Features | [1, 196, 4096] |
Visual features are projected into a 4096-dimensional space to align with the LLM's word embedding dimension. |
| 5. Text Embedding | Tokenized Text Features | [1, 5, 4096] |
The question "What is in the image?" is tokenized and embedded into five 4096-dimensional vectors. |
| 6. Multimodal Concatenation | Input Visual + Text Joint Features | [1, 201, 4096] |
Concatenating 196 visual tokens and 5 text tokens along the sequence dimension. |
| 7. LLM Generation Output | Generated Token ID Sequence | [1, 7] |
The LLM generates a 7-token answer based on the joint features. |
【无标题】
翩若惊鸿_2026-01-02 21:51
相关推荐
skywalk816326 分钟前
言知项目后续方向建议拉勾科研工作室1 小时前
区块链工程毕业论文题目【249个】雪豹阿伟2 小时前
21.Winfrom —— 定时器、日期选择器、进度条、表格、DataTablez落落2 小时前
C#WinForm控件实战:Panel与单选框动态创建ptc学习者2 小时前
python 中描述符@property property 大概的样子zmzb01032 小时前
Python课后习题训练记录Day129张忠琳2 小时前
【Go 1.26.4】Golang Map 深度解析Vertira2 小时前
如何对QT开发的软件进行打包[已解决]AI人工智能+电脑小能手2 小时前
【大白话说Java面试题 第110题】【并发篇】第10题:CAS 存在哪些问题?石一峰6992 小时前
C 语言函数设计模式实战经验