【无标题】

Step Data Format Example Matrix Shape Detailed Explanation
1. Raw Image Input Pixel Matrix [1, 3, 224, 224] A single 224x224 3-channel (RGB) color image.
2. ViT Patching Patch Sequence [1, 196, 768] The image is sliced into 196 patches of size 16x16 and flattened into vectors.
3. ViT Output Visual Feature Sequence [1, 196, 768] Processed by the ViT encoder to obtain a feature sequence containing global information.
4. Connector Projection Aligned Visual Features [1, 196, 4096] Visual features are projected into a 4096-dimensional space to align with the LLM's word embedding dimension.
5. Text Embedding Tokenized Text Features [1, 5, 4096] The question "What is in the image?" is tokenized and embedded into five 4096-dimensional vectors.
6. Multimodal Concatenation Input Visual + Text Joint Features [1, 201, 4096] Concatenating 196 visual tokens and 5 text tokens along the sequence dimension.
7. LLM Generation Output Generated Token ID Sequence [1, 7] The LLM generates a 7-token answer based on the joint features.
相关推荐
Scout-leaf2 天前
WPF新手村教程(三)—— 路由事件
c#·wpf
用户298698530143 天前
程序员效率工具:Spire.Doc如何助你一键搞定Word表格排版
后端·c#·.net
mudtools4 天前
搭建一套.net下能落地的飞书考勤系统
后端·c#·.net
玩泥巴的4 天前
搭建一套.net下能落地的飞书考勤系统
c#·.net·二次开发·飞书
唐宋元明清21884 天前
.NET 本地Db数据库-技术方案选型
windows·c#
郑州光合科技余经理4 天前
代码展示:PHP搭建海外版外卖系统源码解析
java·开发语言·前端·后端·系统架构·uni-app·php
lindexi4 天前
dotnet DirectX 通过可等待交换链降低输入渲染延迟
c#·directx·d2d·direct2d·vortice
feifeigo1234 天前
matlab画图工具
开发语言·matlab
dustcell.4 天前
haproxy七层代理
java·开发语言·前端
norlan_jame4 天前
C-PHY与D-PHY差异
c语言·开发语言