研究CLIPTokenizer利用已知词表对输入文本进行BPE编码的逻辑

最近遇到分词的问题，和CLIP有关。所以我想在网上找一些介绍CLIP如何实现分词的文章，但是这些文章缺少必要的细节。最终决定，还是直接去看transformers包的代码。

CLIP采用BPE编码算法，编码时需要查2张表：

合并词字典。里面是出现频次从高到底排列的字符串子串，后续合并词的子串时会用到。HuggingFace上保存的文件名是merges.txt。
词典。这是一张映射表，将合并子串后的词语映射为token_id就是用这张表完成的。HuggingFace上保存的文件名是vocab.json。

我调试代码用的是laion/CLIP-ViT-bigG-14-laion2B-39B-b160k仓库提供的表，其它使用CLIP模型开发的项目理论上用的是相同的表。

编码的过程大致分为4步：

将文本进行标准化。这一步的英文描述叫做normalization。
将标准化后的文本进行预分词，pre tokenization。
将标准化后的词拆分，再通过查找上面第1点的合并词字典，按照顺序先后进行合并。
合并后的字符串经过上面第2点的词典映射，转换为token_id。

接着对token_id序列左右两侧添加起始和结束ID，分别是：

复制代码

startoftext = 49406
endoftext = 49407

CLIP的模型只能处理77个token_id，上述过程后不足77个的话不足的部分补0（这里参考mlfoundations/open_clip仓库的.ipynb文件，里面显示的结果是不足的补0）；如果超出77个，那么截断后面超出77个的部分，并且设置第77个token为endoftext。

编码过程

1. 标准化

这一步是将文本编码格式进行调整，去除额外的空白，大写转小写，方便后续的处理。具体由成下面这些步骤组成：

对控制字符、换行、非可显示的\0符号等可以视作空白的字符进行去除
对每个中日韩CJK符号的左右两侧各添加一个空格
转换为NFC标准Unicode编码
移除左右两侧空白符号，并以空格为分割符拆分字符串为列表
对每个字符转换为小写

5步处理完成后重新用空格连接。

调试时，经过这5步处理的文本变化对比：

复制代码

Before:
Electric Power is Everywhere Present In Unlimited Quantities,
It Can Drive The World's Machinery Without The Need Of Coal, Oil, Gas
Or Any Other Fuel.
After:
electric power is everywhere present in unlimited quantities, it can drive the world's machinery without the need of coal, oil, gas or any other fuel.

上面的Before文本，也就是原始输入有3行。处理时每一行结尾的空格被去除了，结果只有一行。大写字母也被转成了小写。

2. 预分词

通过正则表达式将字符串拆成字符串列表：

复制代码

<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+

这个正则表达式做了这样的事情：

匹配 <|startoftext|>
匹配 <|endoftext|>
匹配 's
匹配 't
匹配 're
匹配 've
匹配 'm
匹配 'll
匹配 'd
[\p{L}]+ 匹配一个或多个Unicode字母（包括所有大小写字母）。比如连在一起的多个字母（左右都是字母的字母连字）。
[\p{N}] 匹配一个数字，多个数字连在一次匹配多次。
[^\s\p{L}\p{N}]+ 匹配一个或多个非空格、非Unicode字母、非数字的字符。

这里会按照这些规则把经过标准化处理的文本重新拆分为多个字符串。下面是拆分后的字符串列表中的每一个字符串元素（一行一个）：

复制代码

electric
power
is
everywhere
present
in
unlimited
quantities
,
it
can
drive
the
world
's
machinery
without
the
need
of
coal
,
oil
,
gas
or
any
other
fuel
.

注：这里不含有任何空格，因为空格不在正则表达式匹配的目标里。其中第11条会把多个数字逐个拆分，例如1234567会被拆分为1/2/3/4/5/6/7。

3. 拆分再合并的过程

这一步是BPE编码，处理的目标是上一步处理结果的每一个字符串。准确地说，上一步我列出的列表中，每一个字符串元素会分别进行一次独立的BPE编码处理：一次处理一行，上一行和下一行在这里不存在联系。

拆分过程很简单，对每一个字符串按照字符逐个拆分，最后一个字符后添加</w>。例如第一行是electric，处理后是8个子字符串：

复制代码

e l e c t r i c</w>

然后按照高频子串的出现规律进行合并，需要使用合并词字典（merges.txt）。具体步骤如下：

一拆分

首先，两两排列成对，枚举所有可能的对，形成集合。例如将上面例子e l e c t r i c</w>排列出来的集合是：

复制代码

(e l) (l e) (e c) (c t) (t r) (r i) (i c</w>)

这个集合一共有7个字符对。

二合并

找出所有的字符对在合并词字典中出现的位置，其中位置越靠前则代表这个对在训练语料中出现的频次越高。例如这7个字符对出现的位置分别是：

字符对	行
`(e l)`	34
`(l e)`	24
`(e c)`	237
`(c t)`	2871
`(t r)`	125
`(r i)`	43
`(i c</w>)`	519

显然，(l e)出现在第24行，它的位置最靠前。现在将最靠前的那一个对合并。换个角度说，le将作为一个独立"字符"，在electric拆分时不单独拆分。

le合并前：

复制代码

e l e c t r i c</w>

合并后变成了

复制代码

e le c t r i c</w>

集合从

复制代码

(e l) (l e) (e c) (c t) (t r) (r i) (i c</w>)

变成了：

复制代码

(e le) (le c) (c t) (t r) (r i) (i c</w>)

三重复【一】和【二】

可以发现，每次合并都会使下一次拆分出来的集合中，对的数量变少。现在重复进行拆分合并，直到没有办法在合并词字典中找到任何可以合并的内容或者直到没有任何内容可以拆分为止。

这一步处理完成后，上面的示例每一行内容内容看上去除了增加</w>外没有变化：

拆分合并前	拆分再合并后
electric	`electric</w>`
power	`power</w>`
is	`is</w>`
everywhere	`everywhere</w>`
present	`present</w>`
in	`in</w>`
unlimited	`unlimited</w>`
quantities	`quantities</w>`
,	`,</w>`
it	`it</w>`
can	`can</w>`
drive	`drive</w>`
the	`the</w>`
world	`world</w>`
's	`'s</w>`
machinery	`machinery</w>`
without	`without</w>`
the	`the</w>`
need	`need</w>`
of	`of</w>`
coal	`coal</w>`
,	`,</w>`
oil	`oil</w>`
,	`,</w>`
gas	`gas</w>`
or	`or</w>`
any	`any</w>`
other	`other</w>`
fuel	`fuel</w>`
.	`.</w>`

没有变化的原因是，用来调试举例的句子每个单子都在常用词里，所以所有内容都可以在合并词词典里找到。随便滚键盘就可以看到差别了。例如：

拆分合并前	拆分再合并后
diwocrnuafjkalopwqqo	`di` `wo` `cr` `nu` `af` `j` `kal` `op` `w` `qq` `o</w>`

4. 映射为ID

这一步使用上面提到的词典，vocab.json，按照顺序查出上面拆分再合并后的所有子字符串对应的ID，即为最终的toke_id序列。

用最后一个例子举例：

子字符串	`di`	`wo`	`cr`	`nu`	`af`	`j`	`kal`	`op`	`w`	`qq`	`o</w>`
ID	570	1613	1075	1156	702	73	4187	676	86	31960	334

特别处理的是，如果子字符串在词典里找不到，那么对应的ID设置为UNK的ID，意为未知token。在CLIP里UNK=endoftext=49407。

本地调试代码的过程

这部分记录一下我本地调试使用的环境、包版本、部分关键的源代码，还有一些逻辑的记录。

环境信息

Windows系统，transformers包版本：

复制代码

$ pip freeze | grep transformers
DEPRECATION: Loading egg at d:\programs\python3\lib\site-packages\vboxapi-1.0-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330
sentence-transformers==2.2.2
transformers==4.36.2

关键的类的位置

CLIPTokenizer类的位置：

复制代码

D:\programs\python3\lib\site-packages\transformers\models\clip\tokenization_clip.py

PreTrainedTokenizer类的位置：

复制代码

D:\programs\python3\lib\site-packages\transformers\tokenization_utils.py

PreTrainedTokenizerBase类的位置：

复制代码

D:\programs\python3\lib\site-packages\transformers\tokenization_utils_base.py

SpecialTokensMixin类位置：

复制代码

D:\programs\python3\lib\site-packages\transformers\tokenization_utils_base.py

PushToHubMixin类位置：

复制代码

D:\programs\python3\lib\site-packages\transformers\utils\hub.py

继承关系：

复制代码

                                                                  -> SpecialTokensMixin
CLIPTokenizer -> PreTrainedTokenizer -> PreTrainedTokenizerBase /
                                                                \
                                                                  -> PushToHubMixin

调试用脚本

本地调试用脚本， tokenizer.py ：

python 复制代码

from transformers import CLIPTokenizer

tokenizer = CLIPTokenizer.from_pretrained(
    'D:\\files\\CLIP-ViT-g-14-laion2B-s34B-b88K'
)

text = """Electric Power is Everywhere Present In Unlimited Quantities,
It Can Drive The World's Machinery Without The Need Of Coal, Oil, Gas
Or Any Other Fuel."""

print("Input:\n" + text)
print(tokenizer(text))

加载的词表文件

通过下面的方式给CLIPTokenizer的__init__代码添加临时输出然后运行脚本，得到加载的词表位置：

python 复制代码

print('Load vocab_file, vocab_file=' + vocab_file)
with open(vocab_file, encoding="utf-8") as vocab_handle:
    self.encoder = json.load(vocab_handle)
self.decoder = {v: k for k, v in self.encoder.items()}
self.errors = errors  # how to handle errors in decoding
self.byte_encoder = bytes_to_unicode()
self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
print('Load merges_file, merges_file=' + merges_file)
with open(merges_file, encoding="utf-8") as merges_handle:
    bpe_merges = merges_handle.read().strip().split("\n")[1 : 49152 - 256 - 2 + 1]

文件位置在下面的运行记录里打印出来了：

复制代码

$ python tokenizer.py
Load vocab_file, vocab_file=D:\files\CLIP-ViT-g-14-laion2B-s34B-b88K\vocab.json
Load merges_file, merges_file=D:\files\CLIP-ViT-g-14-laion2B-s34B-b88K\merges.txt
Input:
Electric Power is Everywhere Present In Unlimited Quantities,
It Can Drive The World's Machinery Without The Need Of Coal, Oil, Gas
Or Any Other Fuel.
{'input_ids': [49406, 5031, 1807, 533, 6364, 2881, 530, 11015, 33917, 267, 585, 753, 2620, 518, 1002, 568, 21858, 2193, 518, 1262, 539, 7919, 267, 2870, 267, 2474, 541, 1504, 1010, 5945, 269, 49407], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

注：这里读取的第二个文件，去掉了第一行后剩余的2到48895行内容每一行作为一个元素转换为列表，存储到变量bpe_merges。然后执行：

python 复制代码

bpe_merges = [tuple(merge.split()) for merge in bpe_merges]
self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))

bpe_merges内容格式如：['i n', 't h', 'a n', 'r e', ... 每个字符串中间有一个空格。这里按照空格拆分每个元素并形成元组，形如：

复制代码

[('i', 'n'), ('t', 'h'), ('a', 'n'), ('r', 'e'), ...]

第二句self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))实质作用是形成元组和排列序号的映射字字典，即：

复制代码

self.bpe_ranks = {
    ('i', 'n'): 0,
    ('t', 'h'): 1,
    ('a', 'n'): 2,
    ('r', 'e'): 3,
    ...
}

这里备注，是因为后面用到了self.bpe_ranks。

进行编码的函数

对输入和输出参数进行打印

给CLIPTokenizer的以下方法添加一行打印参数的代码：

python 复制代码

def _tokenize(self, text):
    """Tokenize a string."""
    print('Called method _tokenize(), param text=' + text)

再次运行，输出内容是：

复制代码

$ python tokenizer.py
Load vocab_file, vocab_file=D:\files\CLIP-ViT-g-14-laion2B-s34B-b88K\vocab.json
Load merges_file, merges_file=D:\files\CLIP-ViT-g-14-laion2B-s34B-b88K\merges.txt
Input:
Electric Power is Everywhere Present In Unlimited Quantities,
It Can Drive The World's Machinery Without The Need Of Coal, Oil, Gas
Or Any Other Fuel.
Called method _tokenize(), param text=Electric Power is Everywhere Present In Unlimited Quantities,
It Can Drive The World's Machinery Without The Need Of Coal, Oil, Gas
Or Any Other Fuel.
{'input_ids': [49406, 5031, 1807, 533, 6364, 2881, 530, 11015, 33917, 267, 585, 753, 2620, 518, 1002, 568, 21858, 2193, 518, 1262, 539, 7919, 267, 2870, 267, 2474, 541, 1504, 1010, 5945, 269, 49407], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

对这个方法代码返回内容进行调试，打印返回值内容和类型。修改代码如下：

python 复制代码

for token in re.findall(self.pat, text):
    token = "".join(
        self.byte_encoder[b] for b in token.encode("utf-8")
    )  # Maps all our bytes to unicode strings, avoiding control tokens of the BPE (spaces in our case)
    bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
print('bpe_tokens=')
print(bpe_tokens)
print('bpe_tokens type=')
print(type(bpe_tokens))
return bpe_tokens

运行得到：

复制代码

$ python tokenizer.py
Load vocab_file, vocab_file=D:\files\CLIP-ViT-g-14-laion2B-s34B-b88K\vocab.json
Load merges_file, merges_file=D:\files\CLIP-ViT-g-14-laion2B-s34B-b88K\merges.txt
Input:
Electric Power is Everywhere Present In Unlimited Quantities,
It Can Drive The World's Machinery Without The Need Of Coal, Oil, Gas
Or Any Other Fuel.
Called method _tokenize(), param text=Electric Power is Everywhere Present In Unlimited Quantities,
It Can Drive The World's Machinery Without The Need Of Coal, Oil, Gas
Or Any Other Fuel.
bpe_tokens=
['electric</w>', 'power</w>', 'is</w>', 'everywhere</w>', 'present</w>', 'in</w>', 'unlimited</w>', 'quantities</w>', ',</w>', 'it</w>', 'can</w>', 'drive</w>', 'the</w>', 'world</w>', "'s</w>", 'machinery</w>', 'without</w>', 'the</w>', 'need</w>', 'of</w>', 'coal</w>', ',</w>', 'oil</w>', ',</w>', 'gas</w>', 'or</w>', 'any</w>', 'other</w>', 'fuel</w>', '.</w>']
bpe_tokens type=
<class 'list'>
{'input_ids': [49406, 5031, 1807, 533, 6364, 2881, 530, 11015, 33917, 267, 585, 753, 2620, 518, 1002, 568, 21858, 2193, 518, 1262, 539, 7919, 267, 2870, 267, 2474, 541, 1504, 1010, 5945, 269, 49407], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

_tokenize方法执行内部逻辑研究

因为存在if判断，所以进行调试研究逻辑分支执行的方响：

python 复制代码

if self.fix_text is None:
    print('self.fix_text is None')
    text = " ".join(self.nlp.tokenize(text))
else:
    print('self.fix_text is not None')
    text = whitespace_clean(self.fix_text(text)).lower()

执行结果显示如下。结果表明if语句为真，执行了上半部分逻辑。

复制代码

$ python tokenizer.py
Load vocab_file, vocab_file=D:\files\CLIP-ViT-g-14-laion2B-s34B-b88K\vocab.json
Load merges_file, merges_file=D:\files\CLIP-ViT-g-14-laion2B-s34B-b88K\merges.txt
Input:
Electric Power is Everywhere Present In Unlimited Quantities,
It Can Drive The World's Machinery Without The Need Of Coal, Oil, Gas
Or Any Other Fuel.
Called method _tokenize(), param text=Electric Power is Everywhere Present In Unlimited Quantities,
It Can Drive The World's Machinery Without The Need Of Coal, Oil, Gas
Or Any Other Fuel.
self.fix_text is None
bpe_tokens=
['electric</w>', 'power</w>', 'is</w>', 'everywhere</w>', 'present</w>', 'in</w>', 'unlimited</w>', 'quantities</w>', ',</w>', 'it</w>', 'can</w>', 'drive</w>', 'the</w>', 'world</w>', "'s</w>", 'machinery</w>', 'without</w>', 'the</w>', 'need</w>', 'of</w>', 'coal</w>', ',</w>', 'oil</w>', ',</w>', 'gas</w>', 'or</w>', 'any</w>', 'other</w>', 'fuel</w>', '.</w>']
bpe_tokens type=
<class 'list'>
{'input_ids': [49406, 5031, 1807, 533, 6364, 2881, 530, 11015, 33917, 267, 585, 753, 2620, 518, 1002, 568, 21858, 2193, 518, 1262, 539, 7919, 267, 2870, 267, 2474, 541, 1504, 1010, 5945, 269, 49407], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

方法完整代码如下。显然_tokenize()方法是对text进行分词，并标记</w>后返回一个存储了分词结果的列表。

python 复制代码

def _tokenize(self, text):
    bpe_tokens = []
    if self.fix_text is None:
        text = " ".join(self.nlp.tokenize(text))
    else:
        text = whitespace_clean(self.fix_text(text)).lower()

    for token in re.findall(self.pat, text):
        token = "".join(
            self.byte_encoder[b] for b in token.encode("utf-8")
        )  # Maps all our bytes to unicode strings, avoiding control tokens of the BPE (spaces in our case)
        bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
    return bpe_tokens

代码中的if-else无论执行哪个分支，都是为了重新对变量text进行赋值。通过打印调试，可以发现这里text被赋值前后的内容对比为：

复制代码

Before:
Electric Power is Everywhere Present In Unlimited Quantities,
It Can Drive The World's Machinery Without The Need Of Coal, Oil, Gas
Or Any Other Fuel.
After:
electric power is everywhere present in unlimited quantities, it can drive the world's machinery without the need of coal, oil, gas or any other fuel.

根据表象判断，self.nlp.tokenize(text)过程对文本进行了大写转小写，去除换行符号。具体而言这里实际处理方式是：

对控制字符、换行、非可显示的\0符号进行去除
对每个中日韩CJK符号的左右两侧各添加一个空格
转换整个字符串为NFC标准Unicode编码
移除左右两侧空白符号，并以空格为分割符拆分字符串为列表
对每个字符转换为小写

详细代码调试过程参考下面的章节 self.nlp.tokenize 。

接着，使用正则表达式匹配。这个正则是：

复制代码

"<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+"

通过调试，直接打印出这个正则，内容是：

复制代码

<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+

它做了这些事情：

匹配 <|startoftext|>
匹配 <|endoftext|>
匹配 's
匹配 't
匹配 're
匹配 've
匹配 'm
匹配 'll
匹配 'd
[\p{L}]+ 匹配一个或多个Unicode字母（包括所有大小写字母）。比如连在一起的多个字母（左右都是字母的字母连字）。
[\p{N}] 匹配一个数字，多个数字连在一次匹配多次。
[^\s\p{L}\p{N}]+ 匹配一个或多个非空格、非Unicode字母、非数字的字符。

注意，这里匹配的内容是不包括空格的，也就是空格经过这次匹配被过滤掉了。

接着， for token in re.findall(self.pat, text): ，也就是按照正则匹配出来的多个部分，一个个部分来迭代。打印出来每次迭代处理的一个token：

python 复制代码

for token in re.findall(self.pat, text):
    print('token=#' + token + '#') # 特意加了符号#，token内容是不包括空格的

token内容是（注：符号#之间的是上面处理后的小写单词，空格没有了）：

复制代码

token=#electric#
token=#power#
token=#is#
token=#everywhere#
token=#present#
token=#in#
token=#unlimited#
token=#quantities#
token=#,#
token=#it#
token=#can#
token=#drive#
token=#the#
token=#world#
token=#'s#
token=#machinery#
token=#without#
token=#the#
token=#need#
token=#of#
token=#coal#
token=#,#
token=#oil#
token=#,#
token=#gas#
token=#or#
token=#any#
token=#other#
token=#fuel#
token=#.#

再接着，对每个token转为字节类型，并按照事先建立的映射转换每一个字节：

python 复制代码

token = "".join(
    self.byte_encoder[b] for b in token.encode("utf-8")
)  # Maps all our bytes to unicode strings, avoiding control tokens of the BPE (spaces in our case)
print('token=~' + token + '~') # 这里是我添加的调试输出

从代码可以看见，映射之后重新通过join拼接为一个字符串（用波浪线~打印），打印内容是：

复制代码

token=#electric#
token=~electric~
token=#power#
token=~power~
token=#is#
token=~is~
token=#everywhere#
token=~everywhere~
token=#present#
token=~present~
token=#in#
token=~in~
token=#unlimited#
token=~unlimited~
token=#quantities#
token=~quantities~
token=#,#
token=~,~
token=#it#
token=~it~
token=#can#
token=~can~
token=#drive#
token=~drive~
token=#the#
token=~the~
token=#world#
token=~world~
token=#'s#
token=~'s~
token=#machinery#
token=~machinery~
token=#without#
token=~without~
token=#the#
token=~the~
token=#need#
token=~need~
token=#of#
token=~of~
token=#coal#
token=~coal~
token=#,#
token=~,~
token=#oil#
token=~oil~
token=#,#
token=~,~
token=#gas#
token=~gas~
token=#or#
token=~or~
token=#any#
token=~any~
token=#other#
token=~other~
token=#fuel#
token=~fuel~
token=#.#
token=~.~

可以发现其实没有任何变化。这一步实质是防止BPE编码失败进行了最后一次特殊符号的过滤。查看self.byte_encoder，赋值代码是self.byte_encoder = bytes_to_unicode()，bytes_to_unicode的内容如下：

python 复制代码

@lru_cache()
def bytes_to_unicode():
    """
    Returns list of utf-8 byte and a mapping to unicode strings. We specifically avoids mapping to whitespace/control
    characters the bpe code barfs on.

    The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab
    if you want to avoid UNKs. When you're at something like a 10B token dataset you end up needing around 5K for
    decent coverage. This is a significant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup
    tables between utf-8 bytes and unicode strings.
    """
    bs = (
        list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
    )
    cs = bs[:]
    n = 0
    for b in range(2**8):
        if b not in bs:
            bs.append(b)
            cs.append(2**8 + n)
            n += 1
    cs = [chr(n) for n in cs]
    return dict(zip(bs, cs))

函数内容主要是：创建一个从UTF-8字节到Unicode字符串的映射。这个映射特别避开了空白/控制字符，因为这些字符在BPE（Byte Pair Encoding）编码中会引起问题。

最后进行对每一个token的BPE编码。调用逻辑是这一行：

python 复制代码

bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))

执行顺序是：

self.bpe(token) 这是BPE编码的主要部分
.split(" ") 按照空格分割为列表
pe_token for bpe_token in ... 组装为新的列表
bpe_tokens.extend() 合并到返回结果里

bpe内部逻辑

以每个字符为单位对token拆分，最后一个字符添加</w>。例如：without 按照这个步骤操作结果是：

复制代码

('w', 'i', 't', 'h', 'o', 'u', 't</w>')

然后，按照顺序把两个字符两个字符成对（调用get_pairs），组成集合，pairs。组对方式如下：w和i，w和t，t和h，以此类推。如：

复制代码

pairs={('w', 'i'), ('i', 't'), ('t', 'h'), ('h', 'o'), ('o', 'u'), ('u', 't</w>')}

如果一对都没有（即入参的token只有一个字符时），那么直接返回token+</w>。

接着是循环逻辑。

python 复制代码

while True:
    bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
    if bigram not in self.bpe_ranks: # 这个if是为了处理异常情况，即字符有问题没办法在merges.txt中找到任何记录。
        break
    first, second = bigram
    new_word = []
    i = 0
    while i < len(word):
        try:
            j = word.index(first, i)
        except ValueError:
            new_word.extend(word[i:])
            break
        else:
            new_word.extend(word[i:j])
            i = j

        if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
            new_word.append(first + second)
            i += 2
        else:
            new_word.append(word[i])
            i += 1
    new_word = tuple(new_word)
    word = new_word
    if len(word) == 1:
        break
    else:
        pairs = get_pairs(word)
word = " ".join(word)

两层循环嵌套。

python 复制代码

while True:
    bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
    if bigram not in self.bpe_ranks: # 这个if是为了处理异常情况，即字符有问题没办法在merges.txt中找到任何记录。
        break
    first, second = bigram
    # ...... 内存嵌套省略
    new_word = tuple(new_word)
    word = new_word
    if len(word) == 1:
        break
    else:
        pairs = get_pairs(word)
word = " ".join(word)

外层功能是，在self.bpe_ranks（也就是merges.txt文件内容）中找出所有pairs对应的行并定位到行号最小的那个记录（first, second），然后将这个记录交给内存循环处理产生新的pairs。外层循环的退出条件是pairs内字符只剩下一个，或者说单词word只剩下一个字符。new_word是内存处理后产生的结果。

python 复制代码

new_word = []
i = 0
while i < len(word):
    try:
        j = word.index(first, i)
    except ValueError:
        new_word.extend(word[i:])
        break
    else:
        new_word.extend(word[i:j])
        i = j

    if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
        new_word.append(first + second)
        i += 2
    else:
        new_word.append(word[i])
        i += 1

内层循环，主要是从单词字符串world中找到全部外层循环查找出来的的行号最小的pairs（下面显示的bigram），然后进行合并word中所有能够合并的（new_word是合并后的），最后再次调用get_pair产生新的pairs。

复制代码

bpe begin, no cache
word = tuple(1) + (2 + "</w>")
1=
withou
2=
t
word=
('w', 'i', 't', 'h', 'o', 'u', 't</w>')
pairs=
{('h', 'o'), ('o', 'u'), ('u', 't</w>'), ('w', 'i'), ('i', 't'), ('t', 'h')}
bigram=
('t', 'h')
new_word=
['w', 'i', 'th', 'o', 'u', 't</w>']
bigram=
('o', 'u')
new_word=
['w', 'i', 'th', 'ou', 't</w>']
bigram=
('ou', 't</w>')
new_word=
['w', 'i', 'th', 'out</w>']
bigram=
('w', 'i')
new_word=
['wi', 'th', 'out</w>']
bigram=
('wi', 'th')
new_word=
['with', 'out</w>']
bigram=
('with', 'out</w>')
new_word=
['without</w>']
bpe end, word=(without</w>)

self.nlp.tokenize执行逻辑

文件内进行搜索，找到对self.nlp进行赋值操作的地方位于CLIPTokenizer类的__init__方法，代码如下：

python 复制代码

def __init__(
    self,
    vocab_file,
    merges_file,
    errors="replace",
    unk_token="<|endoftext|>",
    bos_token="<|startoftext|>",
    eos_token="<|endoftext|>",
    pad_token="<|endoftext|>",  # hack to enable padding
    **kwargs,
):
    bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
    eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
    unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
    try:
        import ftfy

        self.fix_text = ftfy.fix_text
    except ImportError:
        logger.info("ftfy or spacy is not installed using custom BasicTokenizer instead of ftfy.")
        self.nlp = BasicTokenizer(strip_accents=False, do_split_on_punc=False)
        self.fix_text = None

这个BasicTokenizer类和CLIPTokenizer类在同一个文件里，上面调用的tokenize方法代码如下：

python 复制代码

def tokenize(self, text, never_split=None):
    """
    Basic Tokenization of a piece of text. For sub-word tokenization, see WordPieceTokenizer.

    Args:
        never_split (`List[str]`, *optional*)
            Kept for backward compatibility purposes. Now implemented directly at the base class level (see
            [`PreTrainedTokenizer.tokenize`]) List of token not to split.
    """
    # union() returns a new set by concatenating the two sets.
    never_split = self.never_split.union(set(never_split)) if never_split else self.never_split
    text = self._clean_text(text)

    # This was added on November 1st, 2018 for the multilingual and Chinese
    # models. This is also applied to the English models now, but it doesn't
    # matter since the English models were not trained on any Chinese data
    # and generally don't have any Chinese data in them (there are Chinese
    # characters in the vocabulary because Wikipedia does have some Chinese
    # words in the English Wikipedia.).
    if self.tokenize_chinese_chars:
        text = self._tokenize_chinese_chars(text)
    # prevents treating the same character with different unicode codepoints as different characters
    unicode_normalized_text = unicodedata.normalize("NFC", text)
    orig_tokens = whitespace_tokenize(unicode_normalized_text)
    split_tokens = []
    for token in orig_tokens:
        if token not in never_split:
            if self.do_lower_case:
                token = token.lower()
                if self.strip_accents is not False:
                    token = self._run_strip_accents(token)
            elif self.strip_accents:
                token = self._run_strip_accents(token)
        split_tokens.extend(self._run_split_on_punc(token, never_split))

    output_tokens = whitespace_tokenize(" ".join(split_tokens))
    return output_tokens

代码实现存在几个if判断。第一个if用于对输入text进行处理，调试结果表明执行代码的时候if self.tokenize_chinese_chars:判断是真，也就是执行的时候会进入self._tokenize_chinese_chars(text)。

第一个对text处理的逻辑

text = self._clean_text(text)，调用的方法代码如下：

python 复制代码

def _clean_text(self, text):
    """Performs invalid character removal and whitespace cleanup on text."""
    output = []
    for char in text:
        cp = ord(char)
        if cp == 0 or cp == 0xFFFD or _is_control(char):
            continue
        if _is_whitespace(char):
            output.append(" ")
        else:
            output.append(char)
    return "".join(output)

显然，方法对字符串中的换行符和非可显示的文字字符进行了去除。

第二个对text进行的处理

_tokenize_chinese_chars 代码实现如下：

python 复制代码

def _tokenize_chinese_chars(self, text):
    """Adds whitespace around any CJK character."""
    output = []
    for char in text:
        cp = ord(char)
        if self._is_chinese_char(cp):
            output.append(" ")
            output.append(char)
            output.append(" ")
        else:
            output.append(char)
    return "".join(output)

显然，代码逻辑是遍历输入文本，判断当前文字如果是中文，那么在这个文字左右两侧各添加一个空格；如果当前文本不是中文，那么不添加。返回修改后的字符串。

这里判断是否中文的逻辑如下。不止中文，如果是日韩文字也判断为True：

python 复制代码

def _is_chinese_char(self, cp):
    """Checks whether CP is the codepoint of a CJK character."""
    # This defines a "chinese character" as anything in the CJK Unicode block:
    #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
    #
    # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
    # despite its name. The modern Korean Hangul alphabet is a different block,
    # as is Japanese Hiragana and Katakana. Those alphabets are used to write
    # space-separated words, so they are not treated specially and handled
    # like the all of the other languages.
    if (
        (cp >= 0x4E00 and cp <= 0x9FFF)
        or (cp >= 0x3400 and cp <= 0x4DBF)  #
        or (cp >= 0x20000 and cp <= 0x2A6DF)  #
        or (cp >= 0x2A700 and cp <= 0x2B73F)  #
        or (cp >= 0x2B740 and cp <= 0x2B81F)  #
        or (cp >= 0x2B820 and cp <= 0x2CEAF)  #
        or (cp >= 0xF900 and cp <= 0xFAFF)
        or (cp >= 0x2F800 and cp <= 0x2FA1F)  #
    ):  #
        return True

    return False

对text执行了这两个处理之后，进行以下的处理得到新的处理结果：

python 复制代码

# prevents treating the same character with different unicode codepoints as different characters
unicode_normalized_text = unicodedata.normalize("NFC", text)
orig_tokens = whitespace_tokenize(unicode_normalized_text)

unicodedata是一个Python的标准模块。模块提供了一些有用的函数，可以处理Unicode字符串的各种操作，包括字符分类、大小写转换、数字转换等。normalize方法功能是把文本中的编码转换为指定的Unicode标准。这里是把text里的特殊符号转换为NFC标准的Unicode字符。

whitespace_tokenize函数位于同一个文件内，代码如下。显然它的功能是移除文本串最左和最右边的所有空白，然后以文本之间间隔的空格作为分割符，把整个文本串拆分为多个子字符串，返回一个列表。

python 复制代码

# Copied from transformers.models.bert.tokenization_bert.whitespace_tokenize
def whitespace_tokenize(text):
    """Runs basic whitespace cleaning and splitting on a piece of text."""
    text = text.strip()
    if not text:
        return []
    tokens = text.split()
    return tokens

代码的下一步操作如下，我添加了逻辑执行的注释。显然是对orig_tokens进行处理产生split_tokens。

python 复制代码

print('orig_tokens=')
print(orig_tokens)
split_tokens = []
for token in orig_tokens:
    if token not in never_split: # 这里never_split是空的集合
        if self.do_lower_case: # 这里的do_lower_case是真
            token = token.lower()
            if self.strip_accents is not False: # 这个if判断是失败的，
                token = self._run_strip_accents(token) # 这一句是不会被执行的
        elif self.strip_accents:
            token = self._run_strip_accents(token) # 这一句是不会被执行的
    split_tokens.extend(self._run_split_on_punc(token, never_split)) # 方法_run_split_on_punc执行结果是直接返回[token]

由于内部进行了比较细致的逻辑操作，我打印输出了前后的结果：

复制代码

orig_tokens=
['Electric', 'Power', 'is', 'Everywhere', 'Present', 'In', 'Unlimited', 'Quantities,', 'It', 'Can', 'Drive', 'The', "World's", 'Machinery', 'Without', 'The', 'Need', 'Of', 'Coal,', 'Oil,', 'Gas', 'Or', 'Any', 'Other', 'Fuel.']
split_tokens=
['electric', 'power', 'is', 'everywhere', 'present', 'in', 'unlimited', 'quantities,', 'it', 'can', 'drive', 'the', "world's", 'machinery', 'without', 'the', 'need', 'of', 'coal,', 'oil,', 'gas', 'or', 'any', 'other', 'fuel.']

其中，通过调试表明这里的never_split是一个空的集合，所以上面代码片段中第一个if总是会进去执行。
self.do_lower_case是真值，总是转换为小写字符。
如注释表明的那样，self._run_strip_accents(token)不会被执行。
方法_run_split_on_punc内部判断后直接返回[token]，因此实际上可以忽略。

token转id

调用 _convert_token_to_id 方法，转换为id ：

python 复制代码

def _convert_token_to_id(self, token):
    """Converts a token (str) in an id using the vocab."""
    id_ = self.encoder.get(token, self.encoder.get(self.unk_token))
    print('token=%s,id=%d'%(token, id_))
    return id_

输出：

复制代码

token=electric</w>,id=5031
token=power</w>,id=1807
token=is</w>,id=533
token=everywhere</w>,id=6364
token=present</w>,id=2881
token=in</w>,id=530
token=unlimited</w>,id=11015
token=quantities</w>,id=33917
token=,</w>,id=267
token=it</w>,id=585
token=can</w>,id=753
token=drive</w>,id=2620
token=the</w>,id=518
token=world</w>,id=1002
token='s</w>,id=568
token=machinery</w>,id=21858
token=without</w>,id=2193
token=the</w>,id=518
token=need</w>,id=1262
token=of</w>,id=539
token=coal</w>,id=7919
token=,</w>,id=267
token=oil</w>,id=2870
token=,</w>,id=267
token=gas</w>,id=2474
token=or</w>,id=541
token=any</w>,id=1504
token=other</w>,id=1010
token=fuel</w>,id=5945
token=.</w>,id=269

这里的self.encoder就是一开始初始化的词表：

python 复制代码

with open(vocab_file, encoding="utf-8") as vocab_handle:
    self.encoder = json.load(vocab_handle)

这一步简单，只是返回json文件key对应的值。

研究CLIPTokenizer利用已知词表对输入文本进行BPE编码的逻辑