DeepSeek-R1-Qwen-32B bpe算法，分词器配置 LlamaTokenizerFast

这是一个 DeepSeek 模型的 tokenizer 配置文件（tokenizer_config.json）。让我为你解析其中的关键配置：

bash 复制代码

{
  "add_bos_token": true,
  "add_eos_token": false,
  "bos_token": {
    "__type": "AddedToken",
    "content": "<｜begin▁of▁sentence｜>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "clean_up_tokenization_spaces": false,
  "eos_token": {
    "__type": "AddedToken",
    "content": "<｜end▁of▁sentence｜>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "legacy": true,
  "model_max_length": 16384,
  "pad_token": {
    "__type": "AddedToken",
    "content": "<｜end▁of▁sentence｜>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "sp_model_kwargs": {},
  "unk_token": null,
  "tokenizer_class": "LlamaTokenizerFast",
  "chat_template": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<｜User｜>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\\n' + '```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```' + '<｜tool▁call▁end｜>'}}{%- set ns.is_first = true -%}{%- else %}{{'\\n' + '<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\\n' + '```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```' + '<｜tool▁call▁end｜>'}}{{'<｜tool▁calls▁end｜><｜end▁of▁sentence｜>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<｜Assistant｜>' + content + '<｜end▁of▁sentence｜>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\\n<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<｜tool▁outputs▁end｜>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<｜Assistant｜>'}}{% endif %}"
}

基本信息

属性	值	说明
`tokenizer_class`	`LlamaTokenizerFast`	基于 Fast Llama Tokenizer
`model_max_length`	16384	最大上下文长度 16K tokens
`legacy`	true	使用传统模式

特殊 Token

Token	内容	用途
BOS (开始符)	`<｜begin▁of▁sentence｜>`	序列开始标记
EOS (结束符)	`<｜end▁of▁sentence｜>`	序列结束标记
PAD (填充符)	与 EOS 相同	填充用，与结束符共用

注意：使用 ▁（下划线）表示 SentencePiece 中的空格标记，这是 SPM 的标准做法。

Chat Template 结构

这个 Jinja2 模板定义了对话格式，支持工具调用（Function Calling）：

角色标记

User : <｜User｜>
Assistant : <｜Assistant｜>

工具调用标记

复制代码

<｜tool▁calls▁begin｜>     # 工具调用开始
<｜tool▁call▁begin｜>       # 单个工具开始
<｜tool▁sep｜>              # 分隔符
<｜tool▁call▁end｜>         # 单个工具结束
<｜tool▁calls▁end｜>        # 工具调用结束

工具输出标记

复制代码

<｜tool▁outputs▁begin｜>    # 工具输出开始
<｜tool▁output▁begin｜>     # 单个输出开始
<｜tool▁output▁end｜>       # 单个输出结束
<｜tool▁outputs▁end｜>      # 工具输出结束

对话格式示例

普通对话：

复制代码

<｜begin▁of▁sentence｜>系统提示<｜User｜>用户输入<｜Assistant｜>助手回复<｜end▁of▁sentence｜>

工具调用：

复制代码

<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>function<｜tool▁sep｜>工具名
```json
{参数}
```<｜tool▁call▁end｜><｜tool▁calls▁end｜><｜end▁of�sentence｜>

关键特性

支持工具调用：完整的 function calling 标记体系
生成控制 ：add_generation_prompt 控制是否在末尾添加 <｜Assistant｜> 提示模型继续生成
特殊处理 ：如果助手回复包含 ⏎（特殊换行符），会取其后的内容
无 UNK token ：unk_token 为 null，遇到未知字符可能有特定处理策略

这是 DeepSeek-V2/V3 或类似模型的典型配置，支持 16K 上下文和原生工具调用能力。

DeepSeek-R1-Qwen-32B bpe算法 ，分词器配置 LlamaTokenizerFast

基本信息

特殊 Token

Chat Template 结构

角色标记

工具调用标记

工具输出标记

对话格式示例

关键特性

DeepSeek-R1-Qwen-32B bpe算法，分词器配置 LlamaTokenizerFast