CS336——2. Tokenizer

文章目录

[1. tokenization(令牌/词元化)](#1. tokenization(令牌/词元化))
- [1.1 简单介绍](#1.1 简单介绍)
- [1.2 tokenizer演示](#1.2 tokenizer演示)
- [1.4 观测总结(Observations)](#1.4 观测总结(Observations))
- [1.X 为什么空格是和后面的词语合并](#1.X 为什么空格是和后面的词语合并)
[2. 常见的分词实现](#2. 常见的分词实现)
- [2.1 Character-based tokenization](#2.1 Character-based tokenization)
- - [2.1.1 原理介绍](#2.1.1 原理介绍)
  - [2.1.2 简单实现](#2.1.2 简单实现)
- [2.2 Byte-based tokenization](#2.2 Byte-based tokenization)
- - [2.2.1 原理介绍](#2.2.1 原理介绍)
  - [2.2.2 简单实现](#2.2.2 简单实现)
- [2.3 Word-based tokenization](#2.3 Word-based tokenization)
- [2.4 Byte Pair Encoding (BPE)](#2.4 Byte Pair Encoding (BPE))
- - [2.4.1 原理介绍](#2.4.1 原理介绍)
  - [2.4.2 简单实现](#2.4.2 简单实现)
[3. 作业](#3. 作业)
[4. 总结](#4. 总结)
[X. 其他](#X. 其他)
- [X.1. [UNK]影响困惑度计算](#X.1. [UNK]影响困惑度计算)

链接：

b站视频链接：斯坦福CS336：大模型从0到1｜第一讲：概述和tokenization【中英双语】， 59:11开始讲分词
课程ppt：
- 演示版：https://stanford-cs336.github.io/spring2025-lectures/?trace=var/traces/lecture_01.json
- 源文件版：https://github.com/stanford-cs336/spring2025-lectures/blob/main/lecture_01.py

1. tokenization(令牌/词元化)

可以去看看 Andrej Karpathy关于tokenization的视频，

原始的YouTube链接,
B站搬运的视频地址: 【Andrej Karpathy】中文字幕｜Let's build the GPT Tokenizer
另外， Andrej Karpathy有一个GPT系列的视频（包含上面这个Tokenizer的视频）合集：【精译⚡从零开始构建 GPT 系列】Andrej Karpathy

1.1 简单介绍

txt 复制代码

# 原始文本通常是使用 unicode string表示的, 例如：
string = "Hello, 🌍! 你好!"

# 如果要将其作为语言模型的输入，则需要把这些unicode表示的字符串，转为一个整数序列
# 不一定是一个字符对应一个整数，例如：上述string可以使用下面的整数序列表示
# 则其中每个整数表示一个token（即一个token不一定是一个字，一般是2~3个汉字）
indices = [15496, 11, 995, 0]

# 语言模型会对这个整数序列生成一个概率分布(probability distribution)

因此需要：
1. 可以把`unicode strings`编码(encode)成`tokens`的一个程序
2. 同样，需要有一个把`tokens`解码成一个`unicode strings`

Tokenizer就是完成tokens和strings之间互相转换的一个类

在DeepSeek-Token 用量计算中，有：

一般情况下模型中 token 和字数的换算比例大致如下：

1 个英文字符 ≈ 0.3 个 token。

1 个中文字符 ≈ 0.6 个 token。

可以直接看看transformers库中关于Tokenizer的描述和实现：

huggingface/transformers/tokenizer
Summary of the tokenizers
网上一些从0实现的版本基本也都是调用transformers库：
- 个人构建MoE大模型：从预训练到DPO的完整实践→ llm_trainer/llm_trainer
  /tokenizer.py
- 上面这个库其实就是精简版本的llama-factory

1.2 tokenizer演示

网页链接： tiktokenizer.vercel.app, https://tiktokenizer.vercel.app/?encoder=gpt2

右上角的gpt-4o是选择的tokenizer，支持很多开源的，可以换着感受下

注意：空格也是token的一部分。上面将hello, hello分成了三个词，分别是hello,对应的token标记是 24912， ,对应的token标记是11，以及 hello对应的token标记是40617。

所以同一个词语，有空格，和没空格，属于不同的token（会被划分成不同的词语）,同时由于英语的语法习惯，空格一般会和其后的词语共同划分为一个token。详见 1.X 为什么空格是和后面的词语合并 说明

不像传统的NLP，传统的NLP分词的时候，会忽略/删除空格或者标点符号等stop words，比如：英语分词就是直接用空格等符号来划分边界的。导致分词操作不可逆，llm中使用的Tokenizer则是可逆操作(reversible operations)。

另外，也可以看看Tokenizer对于数字的划分，第一行的 \n是 793，下面的,\n是412，最后的\n是198。

可以看到，对于数字的划分，不是靠千分位，或者数值大小等区分的，似乎没什么规律

1.4 观测总结(Observations)

一个词与其前面的空格属于同一个 token。
- A word and its preceding space are part of the same token (e.g., " world").
同一个词，在句首和句中的表示不同
- A word at the beginning and in the middle are represented differently (e.g., "hello hello").
数字字符串会被分成多个数字
- Numbers are tokenized into every few digits.

测试这种Tokenizer方式是否可以实现可逆操作（即对原始字符串编码，对编码后的结果再解码，看解码结果是否等于原始字符串。即是一个round trip(可往返)）：

python 复制代码

# 摘取自 https://stanford-cs336.github.io/spring2025-lectures/?trace=var/traces/lecture_01.json

# https://huggingface.co/docs/transformers/model_doc/gpt2
# https://huggingface.co/openai-community/gpt2/tree/main  gpt2模型大概也就5G，普通电脑跑的起来 懒得下载模型就去kaggle上启动一个直接跑，很快

import tiktoken
def get_gpt2_tokenizer():
    # Code: https://github.com/openai/tiktoken
    # You can use cl100k_base for the gpt3.5-turbo or gpt4 tokenizer
    return tiktoken.get_encoding("gpt2")
def get_compression_ratio(string: str, indices: list[int]) -> float:
    """Given `string` that has been tokenized into `indices`, ."""
    num_bytes = len(bytes(string, encoding="utf-8"))  # @inspect num_bytes
    num_tokens = len(indices)                       # @inspect num_tokens
    return num_bytes / num_tokens
    
tokenizer = get_gpt2_tokenizer()
string = "Hello, 🌍! 你好!"  # @inspect string

indices = tokenizer.encode(string)  # @inspect indices
reconstructed_string = tokenizer.decode(indices)  # @inspect reconstructed_string
assert string == reconstructed_string
compression_ratio = get_compression_ratio(string, indices)  # @inspect compression_ratio
# 字节数/Token数  这里的答案是 1.6666666667，即每个Token代表1.6字节的数据
# 压缩比

labels = ["string", "indices", "reconstructed_string"]

# 计算最大标签长度（加上冒号和空格）
max_label_width = max(len(label) for label in labels) + 2  # +2 用于 ": "（冒号+空格）
# 打印对齐的输出
print(f"{'string:'.ljust(max_label_width)}{string}")
print(f"{'indices:'.ljust(max_label_width)}{indices}")
print(f"{'reconstructed_string:'.ljust(max_label_width)}{reconstructed_string}")
# 得到以下输出：
# string:               Hello, 🌍! 你好!
# indices:              [15496, 11, 12520, 234, 235, 0, 220, 19526, 254, 25001, 121, 0]
# reconstructed_string: Hello, 🌍! 你好!

# 即： 对原始内容编码后再解码，可以无损得到原始内容

1.X 为什么空格是和后面的词语合并

问了下Qwen3-Max，问题：

为什么BPE分词中，空格一般是和其后的词语共同划分为同一个token，而不是其前面的词语

有以下回答：

在 BPE（Byte Pair Encoding）分词中，空格通常被视为其后词语的一部分（即与后面的词合并为一个 token），而不是前面词语的一部分，这主要是出于以下几个原因：

1. 统一处理词边界的方式

在大多数基于 BPE 的实现（如 Hugging Face 的 tokenizers、原始 BPE 算法）中，为了明确区分词与词之间的边界，会在每个词的开头添加一个特殊符号（通常是空格），然后将这个空格视为该词的一部分。

例如，原始句子：

复制代码

Hello world

预处理后变成（每个词前加空格，除了第一个）：

复制代码

HelloĠworld

（其中 Ġ 表示空格，常用于可视化）

这样，" world"（带前导空格）被视为一个整体单位参与 BPE 合并。这样做可以确保：

模型能区分 "world"（句首）和 " world"（非句首）；
保留词边界信息，对语言建模非常重要（比如预测下一个词时，知道当前词是否在句中）。

2. 避免歧义和边界模糊

如果空格被归到前一个词，那么：

句尾词会包含一个空格，而句中词不会（因为下一个词会"抢走"空格），造成不一致；
在解码时难以判断一个 token 是否后面还有内容；
对于最后一个词（后面没有空格），处理逻辑会变得复杂。

而将空格作为后一个词的前缀，则：

所有非句首词都以空格开头；
句首词没有前导空格；
结构统一，易于编码和解码。

3. 与子词分词目标一致

BPE 的目标是将常见字符对逐步合并成子词单元。如果空格被当作独立符号或附着在前一个词上，会导致：

高频词如 "the" 和 " the" 被视为完全不同的单元；
无法有效共享子词表示（比如 "the" 在句首和句中应有相似表示）。

通过将空格作为后一个词的前缀，BPE 可以学习到：

"the"（句首）
" the"（句中）
作为两个不同的 token，但它们的语义和上下文角色不同，这种区分是合理的。

4. 实际实现惯例（如 GPT、RoBERTa 等）

主流模型（如 GPT 系列）使用 BPE 时，都采用 "空格属于后一个词" 的策略。例如：

"Hello world" → tokens: ["Hello", "Ġworld"]
"I love NLP" → tokens: ["I", "Ġlove", "ĠNLP"]

其中 Ġ 就代表空格，是 token 的一部分。

这种设计使得：

解码时只需拼接所有 token 字符串即可还原原文；
无需额外存储"是否后接空格"的元信息。

总结

空格被划归到其后的词语，是为了统一词边界表示、避免歧义、便于子词学习，并与主流实现保持一致。这是一种经过实践验证的高效且清晰的设计选择。

如果你查看 Hugging Face 的 tokenizer 输出，会发现几乎所有基于 BPE 的模型（如 GPT-2、RoBERTa）都采用这种策略。

2. 常见的分词实现

2.1 Character-based tokenization

基于字符的分词，是最简单的一种分词方式。

一个Unicode字符串，就是一个Unicode字符组成的序列。

2.1.1 原理介绍

python 复制代码

# 每个字符都可以通过`ord`转为一个整数
In [1]: assert ord("a") == 97
In [2]: assert ord("🌍") == 127757
# 可以通过chr从整数转为字符
In [3]: assert chr(97) == "a"
In [4]: assert chr(127757) == "🌍"

# 更熟悉的可能是Unicode 
# 下面是一种 Unicode 转义序列表示法（Unicode escape sequence），常用于编程语言（如 Python、Java、JavaScript）中以 ASCII 安全的方式表示 Unicode 字符。
uni_str ="\u9700\u8981\u4f7f\u7528\u4ec0\u4e48\u8d26\u53f7\u767b\u5f55"

In [8]: hex_str = ['0x9700','0x8981','0x4f7f','0x7528','0x4ec0', '0x4e48','0x8d2
   ...: 6','0x53f7', '0x767b','0x5f55']
   ...: uni_int = [ int(i, 16) for i in hex_str]
   ...: for i in uni_int:
   ...:   print(chr(i))
   ...: 
需
要
使
用
什
么
账
号
登
录

In [9]: uni_str ="\u9700\u8981\u4f7f\u7528\u4ec0\u4e48\u8d26\u53f7\u767b\u5f55"
In [10]: uni_str
Out[10]: '需要使用什么账号登录'

2.1.2 简单实现

这里就可以定义一个Character-based tokenizer，来获取整数表示词元。直接在ipython中测试以下代码：

python 复制代码

from abc import ABC
class Tokenizer(ABC):
    """Abstract interface for a tokenizer."""
    def encode(self, string: str) -> list[int]:
        raise NotImplementedError

    def decode(self, indices: list[int]) -> str:
        raise NotImplementedError
# 实现很简单，就是使用`ord`和`chr`函数，分别把一串字符转为整数列表，以及从整数列表转为字符序列
class CharacterTokenizer(Tokenizer):
    """Represent a string as a sequence of Unicode code points."""
    def encode(self, string: str) -> list[int]:
        return list(map(ord, string))

    def decode(self, indices: list[int]) -> str:
        return "".join(map(chr, indices))

tokenizer = CharacterTokenizer()
string = "Hello, 🌍! 你好!"  # @inspect string
indices = tokenizer.encode(string)  # @inspect indices
reconstructed_string = tokenizer.decode(indices)  # @inspect reconstructed_string
assert string == reconstructed_string
print(indices) # 长度13
[72, 101, 108, 108, 111, 44, 32, 127757, 33, 32, 20320, 22909, 33]

def get_compression_ratio(string: str, indices: list[int]) -> float:
    """Given `string` that has been tokenized into `indices`, ."""
    num_bytes = len(bytes(string, encoding="utf-8"))  # @inspect num_bytes
    num_tokens = len(indices)                       # @inspect num_tokens
    return num_bytes / num_tokens

print(get_compression_ratio(string, indices)) # 20/13
1.5384615384615385

首先，一个Unicode字符，中文部分是使用2~4个字节表示的，ASCII部分，则是1个字节。所以总共字节数其实是：10+4+2*3=20(hello 一个逗号以及两个空格和两个叹号，都是1个字节，地球符号是4个字节，你好两个字每个都是3字节)。所以压缩比不是恒定的1

另外，这种编码的问题在于：

有的整数值很大，比如： 127757(🌍) → 词表/字典很大
- 根据wiki: List of Unicode characters：截止到 Unicode v17.0版本，现有 297,334(297K)的Unicode character有code points(整数编码对应)
整体相当于是在词汇表/字典中，为每个字符平均分配一个槽位(slot)
但是有些词的出现频率会远远高于其他词，因此这种平均/uniform统一分配的方式有些浪费。并不是一种能很好的利用运算开销(budget)的方式。→ 词频差距大，无法有效利用
所以这种基于字符编码的分词方式，比较naive

2.2 Byte-based tokenization

Unicode字符串可以使用字节序列来表示，使用0~255之间的一个整数来表示字节

最常见的Unicode编码就是utf-8。

2.2.1 原理介绍

关于UTF-8和ASCII编码的关系：

UTF-8 编码完全兼容并包含 ASCII 编码
对于 Unicode 码点在 U+0000 到 U+007F 范围内的字符（即 ASCII 字符），UTF-8 使用与 ASCII 完全相同的单字节编码方式。
- UTF-8 对 U+0000 ~ U+007F 的编码规则：
- 如果字符的 Unicode 码点 ≤ 127（即 ≤ 0x7F），直接用一个字节表示，且该字节的值等于码点本身。
- 例如，'a'（U+0061 = 97）在 UTF-8 中编码为单个字节：0x61（十进制 97）。
超出 ASCII 范围的字符：对于码点大于 U+007F 的字符（如中文、希腊字母、表情符号等），UTF-8 使用 2 到 4 个字节进行编码，而 ASCII 无法表示这些字符。

python 复制代码

# 1. 一些utf-8字符是用单字节表示的
#  Python 中的 bytes 表示:为了可读性，对可打印的 ASCII 字符（如字母、数字、常见符号）直接显示为字符形式，而不是 \x61
assert bytes("a", encoding="utf-8") == b"a"
print(bytes("a", encoding="utf-8")) # b'a'

In [3]: print(b'\x61')
b'a'

# 2. 其他中文，表情符号等都是2~4字节表示的
assert bytes("🌍", encoding="utf-8") == b"\xf0\x9f\x8c\x8d"
print(bytes("🌍", encoding="utf-8")) # b'\xf0\x9f\x8c\x8d'

assert bytes("你", encoding="utf-8") == b"\xe4\xbd\xa0"
print(bytes("你", encoding="utf-8")) # b'\xe4\xbd\xa0'

参考：菜鸟教程-ASCII 表

2.2.2 简单实现

python 复制代码

from abc import ABC
class Tokenizer(ABC):
    """Abstract interface for a tokenizer."""
    def encode(self, string: str) -> list[int]:
        raise NotImplementedError

    def decode(self, indices: list[int]) -> str:
        raise NotImplementedError
        
class ByteTokenizer(Tokenizer):
    """Represent a string as a sequence of bytes."""
    def encode(self, string: str) -> list[int]:
        string_bytes = string.encode("utf-8")  # @inspect string_bytes
        indices = list(map(int, string_bytes))  # @inspect indices
        return indices

    def decode(self, indices: list[int]) -> str:
        string_bytes = bytes(indices)  # @inspect string_bytes
        string = string_bytes.decode("utf-8")  # @inspect string
        return string

tokenizer = ByteTokenizer()
string = "Hello, 🌍! 你好!"  # @inspect string
indices = tokenizer.encode(string)  # @inspect indices
reconstructed_string = tokenizer.decode(indices)  # @inspect reconstructed_string
assert string == reconstructed_string

print(indices) # 长度为20
[72(h), 101(e), 108(l), 108(l), 111(o), 44(,), 32(英文空格), 240, 159, 140, 
141(🌏), 33(!), 32(英文空格), 228, 189, 160(你), 229, 165, 189(好), 33(!)]

def get_compression_ratio(string: str, indices: list[int]) -> float:
    """Given `string` that has been tokenized into `indices`, ."""
    num_bytes = len(bytes(string, encoding="utf-8"))  # @inspect num_bytes
    num_tokens = len(indices)                       # @inspect num_tokens
    return num_bytes / num_tokens

print(get_compression_ratio(string, indices)) # 1.0 压缩比本身就是描述每个token占用的字节数

可以看到，由于每个字节是8位数， 2 8 = 256 2^8=256 28=256，所以每个数值都是低于256的数值，这样词表就会很小。。。只有256个词

虽然词频可能无法反映出一些东西，但是不会存在很严重的稀疏性问题(对比字符编码)

但是字节编码存在的一个潜在问题是： long sequences（相对于字符编码，同样的"Hello, 🌍! 你好!", 字符编码长度是13，而字节编码则变成了20，长度↑35%）。对于中文等，最差的情况就是4倍于原始序列长度(压缩比是4)，即一个token是0.25个汉字。。

而目前的大模型的token，一般认为一个token是1.5个汉字。

考虑到Transformer的上下文长度是有限的（因为注意力是二次的quadratic），输入序列太长就不太好。

2.3 Word-based tokenization

基于词的分词，是传统的NLP里最常见的一种方式，即：把字符串序列分成词语。

python 复制代码

# regex比re更强大  pip install regex -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple 
import regex

string = "I'll say supercalifragilisticexpialidocious!"
segments = regex.findall(r"\w+|.", string)  # @inspect segments
# \w 匹配任意字母数字字符，. 匹配任意字符（除了换行符）
# 这个模式会匹配字符串中的每一个字符，但会将连续的单词字符组合成一个整体
print(segments)
['I', "'", 'll', ' ', 'say', ' ', 'supercalifragilisticexpialidocious', '!']

# GPT2 用来进行 pre Tokenizer 的一个正则表达式
# https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py#L23
GPT2_TOKENIZER_REGEX = \
    r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
pattern = GPT2_TOKENIZER_REGEX  # @inspect pattern
segments = regex.findall(pattern, string)  # @inspect segments
print(segments)
['I', "'ll", ' say', ' supercalifragilisticexpialidocious', '!']

这样做的问题在于：

词的数量会很大（词表不可控，这种分词方式就类似于Unicode characters，基本就是一个字/词对应一个整数）
长尾效应会导致有很多词出现次数很少，但是这种低频词的数量又会很多，导致模型对这些词无法有很充足的认知
词表数量不固定，如果遇到一个新的输入，其中包含的segments不在训练时候的词表里，那么就需要添加一个[UNK] (Unkown)标签，而[UNK]这个标签会影响困惑度计算。详见: X.1. [UNK]影响困惑度计算

2.4 Byte Pair Encoding (BPE)

2.4.1 原理介绍

wiki-Byte_pair_encoding

BPE是在1994年由 Philip Gage 提出的用于数据压缩的算法，论文
首次被引入NLP是用于进行翻译任务，在此之前，主流的论文都是使用Word Tokenizer的。Sennrich+ 2015-Neural Machine Translation of Rare Words with Subword Units
之后BPE就被GPT-2使用，进入了语言建模的时代，Radford+ 2019-Language Models are Unsupervised Multitask Learners
- 注意，GPT-2的论文中，先使用word-based tokenization把文本分割成初步的segments，然后再对每个segment，运行original BPE算法。

基本思想：

不再像以前一样预先设定好如何分割的概念，而是在原始文本上训练来自动获取词表（train the tokenizer on raw text to automatically determine the vocabulary.）
符合 常见的字符序列由单个词元（token）表示，而罕见的字符序列则由多个词元表示 这一直觉。（common sequences of characters are represented by a single token, rare sequences are represented by many tokens.）
这个直觉很类似哈夫曼编码，算法-哈夫曼编码或者哈夫曼树与哈夫曼编码：聪明的数据压缩技术，
- 使用频率越高的字符，采用越短的编码~
- 在一组字符的哈夫曼编码中，任一字符的哈弗曼编码不可能是另一字符的哈弗曼编码的前缀

BPE算法的概要实现：

把字符串序列转为字节序列
连续合并最常见的相邻对(successively merge the most common pair of adjacent tokens.)，即：如果一个标记对频繁出现，那么就把它压缩成一个token。

2.4.2 简单实现

建议用Debug模式运行下面代码，查看其中一些标记为inspect的变量，更高的理解执行过程

更详细的说明过程位于： https://github.com/CastleDream/cs336/blob/main/ch1_Tokenizer/code/c1_BPE.py

这个代码还是要自己跑一遍，才能印象深刻

python 复制代码

from collections import defaultdict
def merge(indices: list[int], pair: tuple[int, int], new_index: int) -> list[int]:  # @inspect indices, @inspect pair, @inspect new_index
    """Return `indices`, but with all instances of `pair` replaced with `new_index`."""
    new_indices = []  # @inspect new_indices
    i = 0  # @inspect i
    while i < len(indices):
        if i + 1 < len(indices) and indices[i] == pair[0] and indices[i + 1] == pair[1]:
            new_indices.append(new_index)
            i += 2
        else:
            new_indices.append(indices[i])
            i += 1
    return new_indices

# dataclass的作用直接问GPT
from dataclasses import dataclass
@dataclass(frozen=True)
class BPETokenizerParams:
    """All you need to specify a BPETokenizer."""
    vocab: dict[int, bytes]     # index -> bytes
    merges: dict[tuple[int, int], int]  # index1,index2 -> new_index

def train_bpe(string: str, num_merges: int) -> BPETokenizerParams:
    """
    Args:
        num_merges(int): 表示要执行多少次合并，整个合并过程要重复多少遍
    Returns:
        BPETokenizerParams: 
    """
    # 1. 把字符串序列转为字节/整数序列
    indices = list(map(int, string.encode("utf-8")))  # @inspect indices
    print(indices)
    # 记录合并了哪些内容，一个映射，键的两个整数表示字节，或者已经存在的token，值表示合并后新创建的token
    merges: dict[tuple[int, int], int] = {}  # index1, index2 => merged index,
    print("merges: ", merges)
    # 方便表示索引到字节的映射
    vocab: dict[int, bytes] = {x: bytes([x]) for x in range(256)}  # index -> bytes, 
    print("vocab: ", vocab)
    # 2. 开始循环
    for i in range(num_merges):
        # 计算每个tokens对共现的次数 Count the number of occurrences of each pair of tokens
        counts = defaultdict(int)
        for index1, index2 in zip(indices, indices[1:]):  # For each adjacent pair
            counts[(index1, index2)] += 1  # @inspect counts
        print(f"num_merges={i}, counts={counts}")
        # 找到出现次数最多的对，Find the most common pair.
        pair = max(counts, key=counts.get)  # @inspect pair
        index1, index2 = pair
        # 合并出现自出最多的对 Merge that pair.
        new_index = 256 + i  # @inspect new_index
        merges[pair] = new_index  # @inspect merges
        vocab[new_index] = vocab[index1] + vocab[index2]  # @inspect vocab
        indices = merge(indices, pair, new_index)  # @inspect indices
    return BPETokenizerParams(vocab=vocab, merges=merges)

string = "the cat in the hat"  # @inspect string
params = train_bpe(string, num_merges=3)

# indices 输出：[116, 104, 101, 32, 99, 97, 116, 32, 105, 110, 32, 116, 104, 101, 32, 104, 97, 116]
# merges 输出：{}
# vocab 输出：vocab:  {0: b'\x00', 1: b'\x01', 2: b'\x02', 3: b'\x03', 4: b'\x04', 5: b'\x05', 6: b'\x06', ..., 121: b'y', 122: b'z', ..., 255: b'\xff'}

# 即 116, 104相邻的情况有2次，必须是116在前，然后104； 116,104≠ 104,116（是排列，不是组合）
# num_merges=0, counts={(116, 104): 2, (104, 101): 2, (101, 32): 2, (32, 99): 1, (99, 97): 1, (97, 116): 2, (116, 32): 1, (32, 105): 1, (105, 110): 1, (110, 32): 1, (32, 116): 1, (32, 104): 1, (104, 97): 1})
# num_merges=1, counts={(256, 101): 2, (101, 32): 2, (32, 99): 1, (99, 97): 1, (97, 116): 2, (116, 32): 1, (32, 105): 1, (105, 110): 1, (110, 32): 1, (32, 256): 1, (32, 104): 1, (104, 97): 1})
# num_merges=2, counts={(257, 32): 2, (32, 99): 1, (99, 97): 1, (97, 116): 2, (116, 32): 1, (32, 105): 1, (105, 110): 1, (110, 32): 1, (32, 257): 1, (32, 104): 1, (104, 97): 1})

3. 作业

上面实现的BPE算法很简陋，可以根据以下提示优化上面的BPE算法：

encode() currently loops over all merges. Only loop over merges that matter.
- 上面编码的时候，会对所有merge对象进行搜索，但是实际上并不会用到所有的merge组合，应该只处理那些存在的merge对象
Detect and preserve special tokens (e.g., <|endoftext|>).
Use pre-tokenization (e.g., the GPT-2 tokenizer regex).
Try to make the implementation as fast as possible.

4. 总结

Tokenizer: strings <-> tokens (indices)
Character-based, byte-based, word-based tokenization highly suboptimal
BPE is an effective heuristic that looks at corpus statistics
Tokenization is a necessary evil, maybe one day we'll just do it from bytes...

X. 其他

X.1. [UNK]影响困惑度计算

Qwen3-Max使用[UNK]会mess up perplexity calculations吗？举例说明，得到以下回答：

UNK\]（表示"未知词"）会显著影响语言模型的困惑度（perplexity）计算。原因在于： * 当一个词被替换为 \[UNK\] 时，模型无法准确估计该词的真实概率 * 从而可能低估或高估整个句子的可能性，导致 perplexity 偏高（通常偏高，因为 \[UNK\] 的概率往往较低）。 **困惑度（Perplexity）简介** 困惑度是对语言模型预测能力的一种度量，定义为： Perplexity = exp ⁡ ( − 1 N ∑ i = 1 N log ⁡ P ( w i ∣ w 1 , ... , w i − 1 ) ) \\text{Perplexity} = \\exp\\left(-\\frac{1}{N} \\sum_{i=1}\^{N} \\log P(w_i \\mid w_1, \\dots, w_{i-1})\\right) Perplexity=exp(−N1i=1∑NlogP(wi∣w1,...,wi−1)) 其中 (N) 是词的数量，(P(w_i \\mid \\text{context})) 是模型预测第 (i) 个词的概率。 **\[UNK\] 如何影响 perplexity?** 假设我们有一个测试句子： > "Kubernetes is powerful." 但词汇表中没有 "Kubernetes"，于是它被替换为 \[UNK\]： > "\[UNK\] is powerful." **情况 1：词汇表中包含 "Kubernetes"** * 模型可能给 "Kubernetes" 赋予合理的概率（比如 0.01）。 * 整体句子 log 概率 = log(0.01) + log(P("is" \| "Kubernetes")) + log(P("powerful" \| "Kubernetes is")) * 假设结果为：-4.6 -- 1.2 -- 2.0 = -7.8 * Perplexity ≈ exp(7.8 / 3) ≈ exp(2.6) ≈ 13.5 \*\* 情况 2：词汇表不含 "Kubernetes"，被替换为 \[UNK\]\*\* * 模型对 \[UNK\] 的概率可能非常低（比如 0.0001），因为 \[UNK\] 通常很少出现在训练数据中（或被刻意限制频率）。 * 句子变为："\[UNK\] is powerful." * 整体 log 概率 = log(0.0001) + log(P("is" \| "\[UNK\]")) + log(P("powerful" \| "\[UNK\] is")) * = -9.2 -- 1.5 -- 2.2 = -12.9 * Perplexity ≈ exp(12.9 / 3) ≈ exp(4.3) ≈ 73.7 > **对比**：困惑度从 \~13.5 暴涨到 \~73.7，仅因一个词被替换为 \[UNK\]。 **更严重的后果** * 如果测试集中有很多未登录词（OOV, out-of-vocabulary），\[UNK\] 频繁出现，会导致 perplexity 严重失真，无法真实反映模型的语言建模能力。 * 在实际评估中，这种偏差可能误导模型选择或超参调优。 参考： * ✅[深入理解语言模型的困惑度(perplexity)](https://zhuanlan.zhihu.com/p/686808564)