【无标题】

Step Data Format Example Matrix Shape Detailed Explanation
1. Raw Image Input Pixel Matrix [1, 3, 224, 224] A single 224x224 3-channel (RGB) color image.
2. ViT Patching Patch Sequence [1, 196, 768] The image is sliced into 196 patches of size 16x16 and flattened into vectors.
3. ViT Output Visual Feature Sequence [1, 196, 768] Processed by the ViT encoder to obtain a feature sequence containing global information.
4. Connector Projection Aligned Visual Features [1, 196, 4096] Visual features are projected into a 4096-dimensional space to align with the LLM's word embedding dimension.
5. Text Embedding Tokenized Text Features [1, 5, 4096] The question "What is in the image?" is tokenized and embedded into five 4096-dimensional vectors.
6. Multimodal Concatenation Input Visual + Text Joint Features [1, 201, 4096] Concatenating 196 visual tokens and 5 text tokens along the sequence dimension.
7. LLM Generation Output Generated Token ID Sequence [1, 7] The LLM generates a 7-token answer based on the joint features.
相关推荐
Da Da 泓2 小时前
多线程(七)【线程池】
java·开发语言·线程池·多线程
搬砖的工人2 小时前
写了一个IIS监控工具,对付“假死“后自动重启站点
c#
杰瑞不懂代码2 小时前
基于 MATLAB 的 BPSK/QPSK/2DPSK 在 AWGN 信道下的 BER 性能仿真与对比分析
开发语言·matlab·qpsk·2psk·2dpsk
小鸡脚来咯3 小时前
python虚拟环境
开发语言·python
龘龍龙3 小时前
Python基础(九)
android·开发语言·python
电摇小人3 小时前
我的“C++之旅”(博客之星主题作文)
java·开发语言
资生算法程序员_畅想家_剑魔3 小时前
Java常见技术分享-23-多线程安全-总结
java·开发语言
ytttr8733 小时前
MATLAB中CVX凸优化工具箱的使用指南
开发语言·matlab
萧曵 丶3 小时前
ArrayList 和 HashMap 自动扩容机制详解
java·开发语言·面试