| Step | Data Format | Example Matrix Shape | Detailed Explanation |
|---|---|---|---|
| 1. Raw Image Input | Pixel Matrix | [1, 3, 224, 224] |
A single 224x224 3-channel (RGB) color image. |
| 2. ViT Patching | Patch Sequence | [1, 196, 768] |
The image is sliced into 196 patches of size 16x16 and flattened into vectors. |
| 3. ViT Output | Visual Feature Sequence | [1, 196, 768] |
Processed by the ViT encoder to obtain a feature sequence containing global information. |
| 4. Connector Projection | Aligned Visual Features | [1, 196, 4096] |
Visual features are projected into a 4096-dimensional space to align with the LLM's word embedding dimension. |
| 5. Text Embedding | Tokenized Text Features | [1, 5, 4096] |
The question "What is in the image?" is tokenized and embedded into five 4096-dimensional vectors. |
| 6. Multimodal Concatenation | Input Visual + Text Joint Features | [1, 201, 4096] |
Concatenating 196 visual tokens and 5 text tokens along the sequence dimension. |
| 7. LLM Generation Output | Generated Token ID Sequence | [1, 7] |
The LLM generates a 7-token answer based on the joint features. |
【无标题】
翩若惊鸿_2026-01-02 21:51
相关推荐
Da Da 泓2 小时前
多线程(七)【线程池】搬砖的工人2 小时前
写了一个IIS监控工具,对付“假死“后自动重启站点杰瑞不懂代码2 小时前
基于 MATLAB 的 BPSK/QPSK/2DPSK 在 AWGN 信道下的 BER 性能仿真与对比分析小鸡脚来咯3 小时前
python虚拟环境龘龍龙3 小时前
Python基础(九)电摇小人3 小时前
我的“C++之旅”(博客之星主题作文)资生算法程序员_畅想家_剑魔3 小时前
Java常见技术分享-23-多线程安全-总结ytttr8733 小时前
MATLAB中CVX凸优化工具箱的使用指南萧曵 丶3 小时前
ArrayList 和 HashMap 自动扩容机制详解