【无标题】

Step Data Format Example Matrix Shape Detailed Explanation
1. Raw Image Input Pixel Matrix [1, 3, 224, 224] A single 224x224 3-channel (RGB) color image.
2. ViT Patching Patch Sequence [1, 196, 768] The image is sliced into 196 patches of size 16x16 and flattened into vectors.
3. ViT Output Visual Feature Sequence [1, 196, 768] Processed by the ViT encoder to obtain a feature sequence containing global information.
4. Connector Projection Aligned Visual Features [1, 196, 4096] Visual features are projected into a 4096-dimensional space to align with the LLM's word embedding dimension.
5. Text Embedding Tokenized Text Features [1, 5, 4096] The question "What is in the image?" is tokenized and embedded into five 4096-dimensional vectors.
6. Multimodal Concatenation Input Visual + Text Joint Features [1, 201, 4096] Concatenating 196 visual tokens and 5 text tokens along the sequence dimension.
7. LLM Generation Output Generated Token ID Sequence [1, 7] The LLM generates a 7-token answer based on the joint features.
相关推荐
skywalk816326 分钟前
言知项目后续方向建议
开发语言·学习·编程
拉勾科研工作室1 小时前
区块链工程毕业论文题目【249个】
开发语言·javascript
雪豹阿伟2 小时前
21.Winfrom —— 定时器、日期选择器、进度条、表格、DataTable
c#·上位机·winfrom
z落落2 小时前
C#WinForm控件实战:Panel与单选框动态创建
开发语言·c#
ptc学习者2 小时前
python 中描述符@property property 大概的样子
开发语言·python
zmzb01032 小时前
Python课后习题训练记录Day129
开发语言·python
张忠琳2 小时前
【Go 1.26.4】Golang Map 深度解析
开发语言·后端·golang
Vertira2 小时前
如何对QT开发的软件进行打包[已解决]
开发语言·qt
AI人工智能+电脑小能手2 小时前
【大白话说Java面试题 第110题】【并发篇】第10题:CAS 存在哪些问题?
java·开发语言·面试
石一峰6992 小时前
C 语言函数设计模式实战经验
c语言·开发语言·设计模式