tokenizer.apply_chat_template()

tokenizer.apply_chat_template() 是 Hugging Face transformers 库中用于将对话历史（messages）格式化为模型可接受的输入文本 的关键方法，尤其在使用 Chat 模型（如 Qwen、Llama-3、ChatGLM、Phi-3 等） 时必不可少。

作用

将结构化的对话列表（如 [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]）

基本用法

复制代码

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")

messages = [
    {"role": "user", "content": "你好！"},
    {"role": "assistant", "content": "你好呀！有什么我可以帮你的吗？"},
    {"role": "user", "content": "今天天气怎么样？"}
]

# 应用聊天模板
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,      # 返回字符串（False）还是 token IDs（True）
    add_generation_prompt=True  # 是否在末尾添加 assistant 开始标记（用于生成）
)

print(prompt)

输出示例（Qwen3 格式）：

复制代码

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
你好！<|im_end|>
<|im_start|>assistant
你好呀！有什么我可以帮你的吗？<|im_end|>
<|im_start|>user
今天天气怎么样？<|im_end|>
<|im_start|>assistant

注意：Qwen3 默认会自动插入 system message（若未提供），可通过 add_system_message=False 关闭。

关键参数说明

参数	类型	说明
`conversation`	`List[Dict]`	对话历史，每个 dict 含 `"role"`（user/assistant/system）和 `"content"`
`tokenize`	`bool`	`False` → 返回 str；`True` → 返回 `List[int]`（token IDs）
`add_generation_prompt`	`bool`	若为 `True`，在末尾加上 `"<
`return_tensors`	`str` (e.g., `"pt"`)	当 `tokenize=True` 时，返回张量类型（需配合 `padding`, `truncation` 使用）

常见场景

1. 推理时构造 prompt

复制代码

messages = [
{"role": "system", "content": "你是一个助手。"},
{"role": "user", "content": "你好"}]

input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

2. 微调时预处理数据集

复制代码

def format_example(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"],
            tokenize=False,
            add_generation_prompt=False  # 微调时通常包含完整对话（含 assistant 回复）
        )
    }

dataset = raw_dataset.map(format_example)

3. 自定义聊天模板（高级）

如果模型没有内置 template，可手动设置：

复制代码

tokenizer.chat_template = "{% for message in messages %}{{message['role']}}: {{message['content']}}\n{% endfor %}"

注意事项

不是所有 tokenizer 都支持
- 需要模型在 HF Hub 上定义 chat_template（现代 Chat 模型基本都支持）
- 可通过 tokenizer.chat_template 查看当前模板
Qwen / Llama-3 / Yi 等格式不同
- Qwen 使用 <|im_start|> / <|im_end|>
- Llama-3 使用 <|start_header_id|>user<|end_header_id|>\n\n...<|eot_id|>
- 不要混用模板！
微调 vs 推理的差异
- 微调：输入应包含 完整对话（user + assistant）
- 推理：只给 user 输入，设 add_generation_prompt=True

如何查看当前 tokenizer 的聊天模板？

复制代码

print(tokenizer.chat_template)
# 或美化输出
import json
print(json.dumps(tokenizer.chat_template, indent=2))