关于Transformer的理解

关于Transformer, QKV的意义表示其更像是一个可学习的查询系统,或许以前搜索引擎的算法就与此有关或者某个分支的搜索算法与此类似。


Can anyone help me to understand this image? - #2 by J_Johnson - nlp - PyTorch Forums

Embeddings - these are learnable weights where each token(token could be a word, sentence piece, subword, character, etc) are converted into a vector, say, with 500 values between 0 and 1 that are trainable.

Positional Encoding - for each token, we want to inform the model where it's located, orderwise. This is because linear layers are not ideal for handling sequential information. So we manually pass this in by adding a vector of sine and cosine values on the first 2 elements in the embedding vector.

This sequence of vectors goes through an attention layer, which basically is like a learnable digitized database search function with keys, queries and values. In this case, we are "searching" for the most likely next token.

The Feed Forward is just a basic linear layer, but is applied across each embedding in the sequence separately(i.e. 3 dim tensor instead of 2 dim).

Then the final Linear layer is where we want to get out our predicted next token in the form of a vector of probabilities, which we apply a softmax to put the values in the range of 0 to 1.

There are two sides because when that diagram was developed, it was being used in language translations. But generative language models for next token prediction just use the Transformer decoder and not the encoder.

Here is a PyTorch tutorial that might help you go through how it works.

Language Modeling with nn.Transformer and torchtext --- PyTorch Tutorials 2.0.1+cu117 documentation


相关推荐
jerryinwuhan9 分钟前
LORA时间
人工智能
码农葫芦侠10 分钟前
Vercel Labs Skills:AI 编程安装「技能Skills」的工具
人工智能·ai·ai编程
宝贝儿好12 分钟前
【强化学习】第十章:连续动作空间强化学习:随机高斯策略、DPG算法
人工智能·python·深度学习·算法·机器人
未来之窗软件服务15 分钟前
AI人工智能(二十三)错误示范ASR 语音识别C#—东方仙盟练气期
人工智能·c#·语音识别·仙盟创梦ide·东方仙盟
金智维科技官方22 分钟前
智能体,重构企业自动化未来
人工智能·自动化·agent·智能体·数字员工
桂花饼22 分钟前
谷歌正式发布 Gemini 3.1 Pro:核心智能升级与国内极速接入指南
人工智能·qwen3-next·claude-sonnet·sora2pro·gemini-3.1pro·grok-420-fast·openclaw 配置教程
Mixtral43 分钟前
2026年3款AI会议记录工具测评:告别会后整理
人工智能
Evand J44 分钟前
【课题推荐】深度学习驱动的交通流量预测系统(基于LSTM的交通流量预测系统),MATLAB实现
人工智能·深度学习·matlab·课题简介
甲枫叶1 小时前
【claude热点资讯】Claude Code 更新:手机遥控电脑开发,Remote Control 功能上线
java·人工智能·智能手机·产品经理·ai编程
光头颜1 小时前
任务分解与子 Agent 调度:Controller/Worker 模式的最小可运行实现(SQL + 文档 RAG)
人工智能·智能体