第四节：GLM-4v-9b模型的tokenizer源码解读

文章目录

前言
一、GLM-4v-9b的tokenizer加载源码解读
二、基于GLM-4v-9b的tokenizer示例解读huggingface加载原理
三、基于GLM-4v-9b的tokenizer方法实验验证

前言

清华智普的GLM-4v-9b模型，作为优化的多模态大模型，特别适用于国内应用场景，解决了国外模型本地化不足的问题。本专栏提供环境安装、数据处理、视觉与语言模型源码理解，并基于Hugging Face重构GLM模型搭建教程，帮助理解、修改和应用GLM墨西哥，指导搭建多模态大模型，帮助读者自由搭建与修改大模型。本节给出GLM-4-9B模型tokenizer方法源码解读内容。

第一节：GLM-4v-9B大模型安装、推理与训练详细教程

第二节：GLM-4v-9B数据加载源码解读

第三节：GLM-4v-9B数据加载之huggingface数据加载方法教程(通用大模型数据加载实列)

第四节：GLM-4v-9b模型的tokenizer源码解读

第五节：GLM-4v-9b模型model加载源码解读(模型相关参数方法解读)

第六节：GLM-4v-9b模型加载源码解读(模型加载方法解读)

第七节：GLM-4v-9b模型的视觉模型源码解读

第八节：GLM-4v-9b模型的大语言模型源码解读(ChatGLMForConditionalGeneration)

第九节：通过Debug解析ChatGLMForConditionalGeneration的数据流，理解GLM-4v-9b模型架构

第十节：通过Debug解析ChatGLMModel的数据流，理解视觉与语言模型结合架构

第十一节：利用huggingface重构GLM-4v-9B模型数据处理代码Demo

第十二节：利用huggingface重构GLM-4v-9B训练模型代码Demo
第十一、十二节是在理解GLM-4v-9B模型后，使用huggignface重新构建/搭建GLM-4v-9B模型，使读者能自由构建多模态大模型！

本节给出GLM-4v-9b模型的tokenizer加载，而加载使用huggingface方法，是十分简单。为此，本节以glm的tokenizer示例重点解读huggingface加载tokenizer源码解读。特别知道，tokenizer的json文件配置如下代码，即可使用tokenization_chatglm.py文件的ChatGLM4Tokenizer类来加载tokenizer，这个为后面搭建自己模型能用到，其json配置如下：

python 复制代码

` "auto_map": {  
		"AutoTokenizer": [
	  "tokenization_chatglm.ChatGLM4Tokenizer",
			      null
			    ]
		  },

一、GLM-4v-9b的tokenizer加载源码解读

如果你对huggingface库很熟悉，那么tokenizer加载会变得异常简单，基本也就是几行代码事情。在这一小部分，我给出GLM-4v-9b模型的tokenizer加载方法。随后，我将在后面重点解读hugingface如何加载tokenizer内容。

1、GLM-4v-9b的tokenizer加载主函数源码解读

来源：finetune_demo/finetune_vision.py-->load_tokenizer_and_model函数

finetune_vision.py文件加载tokenizer，也是glm团队自己构建函数，其源码如下：

python 复制代码

tokenizer, model = load_tokenizer_and_model(model_dir, peft_config=ft_config.peft_config)

2、GLM-4v-9b基于huggingface的tokenizer加载源码解读

也就是说，load_tokenizer_and_model既加载tokenizer也加载model，而加载tokenizer如下源码：

python 复制代码

tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)

model_dir就是文件路径，包含了token相关内容，使用huggingface这个方法即可加载。

3、GLM-4v-9b自定义tokenizer类源码解读

在权重文件中有个tokenizer_config.json配置文件，用于tokenizer使用，其中有个参数auto_map"中的"AutoTokenizer"的 "tokenization_chatglm.ChatGLM4Tokenizer"信息可以用来加载自定义tokenizer方法，huggingface会根据 "tokenization_chatglm.ChatGLM4Tokenizer"信息找到文件 tokenization_chatglm.py去执行ChatGLM4Tokenizer类，实现tokenizer方法加载。而这个类就是GLM的tokenizer方法，该方法继承了huggignface的PreTrainedTokenizer，PreTrainedTokenizer继续继承基类。我这里提一下，读者可以自己理解下。

python 复制代码

class ChatGLM4Tokenizer(PreTrainedTokenizer):
    vocab_files_names = {"vocab_file": "tokenizer.model"}
    model_input_names = ["input_ids", "attention_mask", "position_ids"]
    def __init__(
            self,
            vocab_file,
            padding_side="left",
            clean_up_tokenization_spaces=False,
            encode_special_tokens=False,
            image_size=None,
            **kwargs
    ):
    ...

二、基于GLM-4v-9b的tokenizer示例解读huggingface加载原理

这里我以glm源码的tokenizer作为列子来解读huggingface相关tokenizer的处理方法。这个也适用于其它模型tokenizer使用。我的目的是通过这小节，我们能构建添加自己的token，也即如何继承huggingface添加自己行业的tokenizer方法。

1、tokenizer加载主函数(huggingface)

为了更好理解tokenizer加载方式，我构建了Demo来执行，其代码如下：

python 复制代码

from transformers import  AutoTokenizer
if __name__=='__main__':
    model_dir = '/model_experiment_tj/tokenizer_file'
    tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True,do_lower_case=False)

所展示如下：

图中重要是tokenization_chatglm.py、generation_config.json、tokenizer_config.json、tokenizer.model文件，我将后续进行解读。

2、AutoTokenizer.from_pretrained源码解读(huggingface)

调用：finetune_demo/finetune_vision.py-->load_tokenizer_and_model函数；来源：transformers/models/auto/tokenization_auto.py-->class AutoTokenizer -->from_pretrained函数

我们将使用以下方法加载tokenizer，我提供这个类所有，但我们更关注from_pretrained方法，其源码如下：

python 复制代码

class AutoTokenizer:
    r"""
    This is a generic tokenizer class that will be instantiated as one of the tokenizer classes of the library when
    created with the [`AutoTokenizer.from_pretrained`] class method.

    This class cannot be instantiated directly using `__init__()` (throws an error).
    """

    def __init__(self):
        raise EnvironmentError(
            "AutoTokenizer is designed to be instantiated "
            "using the `AutoTokenizer.from_pretrained(pretrained_model_name_or_path)` method."
        )

    @classmethod
    @replace_list_option_in_docstrings(TOKENIZER_MAPPING_NAMES)
    def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
    	...

我们注意到，**kwargs可以给定需要传递的参数。

给出该方法位置如下图：

3、from_pretrained函数参数说明(huggingface)

随即，我给出from_pretrained源码参数注释内容如下：

python 复制代码

 从预训练模型词汇表实例化库中的一个分词器类。

 要实例化的分词器类是根据配置对象的 `model_type` 属性（无论是作为参数传递还是从 `pretrained_model_name_or_path` 加载，如果可能的话）或当它缺失时，通过使用模式匹配在 `pretrained_model_name_or_path` 上进行选择：

 列出选项

 参数：
     pretrained_model_name_or_path (`str` 或 `os.PathLike`)：
         可以是以下之一：
             - 字符串，*模型标识* 是托管在 huggingface.co 模型仓库内的预定义分词器。
             - 包含分词器所需的词汇文件的 *目录* 路径，例如使用 [`~PreTrainedTokenizer.save_pretrained`] 方法保存的，如 `./my_model_directory/`。
             - 当且仅当分词器只需要单个词汇文件（像 Bert 或 XLNet）时，单个保存的词汇文件的路径或 URL。 （并非适用于所有派生类）
     inputs (附加的位置参数，*可选*)：
         将传递给分词器 `__init__()` 方法。
     config ([`PretrainedConfig`]，*可选*)：
         用于确定要实例化的分词器类的配置对象。
     cache_dir (`str` 或 `os.PathLike`，*可选*)：
         下载的预训练模型配置应缓存到的目录路径，如果不想使用标准缓存。
     force_download (`bool`，*可选*，默认为 `False`)：
         是否强制(重新)下载模型权重和配置文件并覆盖缓存版本（如果存在）。
     resume_download:
         已弃用并忽略。现在在可能的情况下所有下载都会自动恢复。
         将在 Transformers v5 中移除。
     proxies (`Dict[str, str]`，*可选*)：
         代理服务器的字典，按协议或端点使用，例如 `{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}`。每个请求都使用这些代理。
     revision (`str`，*可选*，默认为 `"main"`)：
         使用的具体模型版本。它可以是一个分支名、标签名或提交ID，因为我们使用基于git的系统在huggingface.co上存储模型和其他工件，所以 `revision` 可以是git允许的任何标识符。
     subfolder (`str`，*可选*)：
         如果相关文件位于 huggingface.co 上模型仓库的子文件夹中（例如对于 facebook/rag-token-base），请在这里指定。
     use_fast (`bool`，*可选*，默认为 `True`)：
         如果支持给定模型，使用 [快速Rust分词器](https://huggingface.co/docs/tokenizers/index)。如果给定模型没有可用的快速分词器，则返回普通的Python分词器。
     tokenizer_type (`str`，*可选*)：
         要加载的分词器类型。
     trust_remote_code (`bool`，*可选*，默认为 `False`)：
         是否允许在Hub上自定义模型在其自己的建模文件中定义。此选项应仅设置为您信任的存储库，并且您已阅读其中的代码，因为它将在您的本地机器上执行Hub上的代码。
     kwargs (附加的关键字参数，*可选*)：
         将传递给分词器 `__init__()` 方法。可以用来设置特殊标记，如 `bos_token`、`eos_token`、`unk_token`、`sep_token`、`pad_token`、`cls_token`、`mask_token`、`additional_special_tokens`。有关更多详细信息，请参见 `__init__()` 的参数。

 示例：

 ```python
 >>> from transformers import AutoTokenizer

 >>> # 从 huggingface.co 下载词汇表并缓存。
 >>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

 >>> # 从 huggingface.co（用户上传）下载词汇表并缓存。
 >>> tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-cased")

 >>> # 如果词汇文件在目录中（例如分词器使用 *save_pretrained('./test/saved_model/')* 保存）
 >>> # tokenizer = AutoTokenizer.from_pretrained("./test/bert_saved_model/")

 >>> # 从 huggingface.co 下载词汇表并定义模型特定参数
 >>> tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base", add_prefix_space=True)

4、获取tokenizer_config.json文件内容(huggingface-->from_pretrained)

通过上面调用，主要进入get_tokenizer_config函数来获得tokenizer_config，而pretrained_model_name_or_path就是含有tokenizer相关的路径，其源码如下：

调用：transformer.models.auto.tokenization_auto.py-->AutoTokenizer.from_pretrained函数；来源：transformers/models/auto/tokenization_auto.py-->get_tokenizer_config函数

python 复制代码

tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)

get_tokenizer_config源码解读

其源码如下：

python 复制代码

def get_tokenizer_config(
    pretrained_model_name_or_path: Union[str, os.PathLike],
    cache_dir: Optional[Union[str, os.PathLike]] = None,
    force_download: bool = False,
    resume_download: Optional[bool] = None,
    proxies: Optional[Dict[str, str]] = None,
    token: Optional[Union[bool, str]] = None,
    revision: Optional[str] = None,
    local_files_only: bool = False,
    subfolder: str = "",
    **kwargs,
):
    """
    Loads the tokenizer configuration from a pretrained model tokenizer configuration.
    """
    use_auth_token = kwargs.pop("use_auth_token", None)
    if use_auth_token is not None:
        warnings.warn(
            "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
            FutureWarning,
        )
        if token is not None:
            raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
        token = use_auth_token

    commit_hash = kwargs.get("_commit_hash", None)
    resolved_config_file = cached_file(
        pretrained_model_name_or_path,
        TOKENIZER_CONFIG_FILE,
        cache_dir=cache_dir,
        force_download=force_download,
        resume_download=resume_download,
        proxies=proxies,
        token=token,
        revision=revision,
        local_files_only=local_files_only,
        subfolder=subfolder,
        _raise_exceptions_for_gated_repo=False,
        _raise_exceptions_for_missing_entries=False,
        _raise_exceptions_for_connection_errors=False,
        _commit_hash=commit_hash,
    )
    if resolved_config_file is None:
        logger.info("Could not locate the tokenizer configuration file, will try to use the model config instead.")
        return {}
    commit_hash = extract_commit_hash(resolved_config_file, commit_hash)

    with open(resolved_config_file, encoding="utf-8") as reader:
        result = json.load(reader)
    result["_commit_hash"] = commit_hash
    return result

a、tokenizer_config.json文件路径获取

这个函数我按照流程来解读，首先是获得tokenizer_config.json文件绝对路径，源码如下：
来源：transformer.models.auto.tokenization_auto.py-->get_tokenizer_config函数

python 复制代码

resolved_config_file = cached_file(
        pretrained_model_name_or_path,
        TOKENIZER_CONFIG_FILE,
        cache_dir=cache_dir,
        force_download=force_download,
        resume_download=resume_download,
        proxies=proxies,
        token=token,
        revision=revision,
        local_files_only=local_files_only,
        subfolder=subfolder,
        _raise_exceptions_for_gated_repo=False,
        _raise_exceptions_for_missing_entries=False,
        _raise_exceptions_for_connection_errors=False,
        _commit_hash=commit_hash,
    )

cache_file源码解读

如果给出tokenizer文件路径，cache_file函数就根据这个文件返回tokenizer_config.json文件绝对路径，否则就是其它状况之类返回，我们暂时不关注。

来源：transformer.utils.hub.py--> cached_file函数

python 复制代码

def cached_file(
    path_or_repo_id: Union[str, os.PathLike],  # tokenizer文件夹路径	 
    filename: str,  # tokenizer_config.json	
    cache_dir: Optional[Union[str, os.PathLike]] = None,
    force_download: bool = False,
    resume_download: Optional[bool] = None,
    proxies: Optional[Dict[str, str]] = None,
    token: Optional[Union[bool, str]] = None,
    revision: Optional[str] = None,
    local_files_only: bool = False,
    subfolder: str = "",
    repo_type: Optional[str] = None,
    user_agent: Optional[Union[str, Dict[str, str]]] = None,
    _raise_exceptions_for_gated_repo: bool = True,
    _raise_exceptions_for_missing_entries: bool = True,
    _raise_exceptions_for_connection_errors: bool = True,
    _commit_hash: Optional[str] = None,
    **deprecated_kwargs,
) -> Optional[str]:
 
    use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
  	...

    path_or_repo_id = str(path_or_repo_id)
    full_filename = os.path.join(subfolder, filename)  # 获得了tokenizer_config.json文件路径
    if os.path.isdir(path_or_repo_id):
        resolved_file = os.path.join(os.path.join(path_or_repo_id, subfolder), filename)
        if not os.path.isfile(resolved_file):
            if _raise_exceptions_for_missing_entries and filename not in ["config.json", f"{subfolder}/config.json"]:
                raise EnvironmentError(
                    f"{path_or_repo_id} does not appear to have a file named {full_filename}. Checkout "
                    f"'https://huggingface.co/{path_or_repo_id}/tree/{revision}' for available files."
                )
            else:
                return None
        return resolved_file  # 这里就直接退出函数，返回tokenizer_config.json文件路径  

	下面代码就是和下载等相关，我们不在解读了
	...

b、tokenizer_config.json内容读取

然后通过json.load直接读取tokenizer_config.json文件内容，如下代码：

python 复制代码

    with open(resolved_config_file, encoding="utf-8") as reader:
        result = json.load(reader)
    result["_commit_hash"] = commit_hash
    return result

5、tokenizer_config.json文件内容(补充示例)

既然已经读了tokenizer_config.json文件内容，并给了变量tokenizer_config，那后续会使用到，也是比较重要的。为此，我想给大家展示tokenizer_config.json文件内容，以glm的tokenizer为例，如下：
tokenizer_config.json

json 复制代码

{
  "auto_map": {
    "AutoTokenizer": [
      "tokenization_chatglm.ChatGLM4Tokenizer",
      null
    ]
  },
  "added_tokens_decoder": {
    "151329": {
      "content": "<|endoftext|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151330": {
      "content": "[MASK]",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151331": {
      "content": "[gMASK]",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151332": {
      "content": "[sMASK]",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151333": {
      "content": "<sop>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151334": {
      "content": "<eop>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151335": {
      "content": "<|system|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151336": {
      "content": "<|user|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151337": {
      "content": "<|assistant|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151338": {
      "content": "<|observation|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151339": {
      "content": "<|begin_of_image|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151340": {
      "content": "<|end_of_image|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151341": {
      "content": "<|begin_of_video|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151342": {
      "content": "<|end_of_video|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
  },
  "additional_special_tokens": ["<|endoftext|>", "[MASK]", "[gMASK]", "[sMASK]", "<sop>", "<eop>", "<|system|>",
                               "<|user|>", "<|assistant|>", "<|observation|>", "<|begin_of_image|>", "<|end_of_image|>",
                               "<|begin_of_video|>", "<|end_of_video|>"],
  "clean_up_tokenization_spaces": false,
  "do_lower_case": false,
  "eos_token": "<|endoftext|>",
  "pad_token": "<|endoftext|>",
  "model_max_length": 1000000000000000019884624838656,
  "padding_side": "left",
  "remove_space": false,
  "tokenizer_class": "ChatGLM4Tokenizer",
  "image_size": 1120
}

6、获取tokenizer类名称(huggingface-->from_pretrained)

python 复制代码

config_tokenizer_class = tokenizer_config.get("tokenizer_class")

如上获得tokenizer_class为ChatGLM4Tokenizer，这个就是在tokenizer_config.json文件中，如下：

json 复制代码

"tokenizer_class": "ChatGLM4Tokenizer",

7、获得auto_map内容

以下代码获得tokenizer_config.json文件的auto_map内容，如下：

python 复制代码

tokenizer_auto_map = None
if "auto_map" in tokenizer_config:
	if isinstance(tokenizer_config["auto_map"], (tuple, list)):
	    # Legacy format for dynamic tokenizers
	    tokenizer_auto_map = tokenizer_config["auto_map"]
	else:
	    tokenizer_auto_map = tokenizer_config["auto_map"].get("AutoTokenizer", None)

而auto_map存在内容，tokenizer_auto_map为字符串["tokenization_chatglm.ChatGLM4Tokenizer", None]。

json 复制代码

"auto_map": {
    "AutoTokenizer": [
      "tokenization_chatglm.ChatGLM4Tokenizer",
      null
    ]
  },

8、获得tokenizer类实列化

最终到了这里就return返回tokenizer的类了，也就是获得了tokenizer这个方法了。

python 复制代码

if has_remote_code and trust_remote_code:
    if use_fast and tokenizer_auto_map[1] is not None:
        class_ref = tokenizer_auto_map[1]
    else:
        class_ref = tokenizer_auto_map[0]
    tokenizer_class = get_class_from_dynamic_module(class_ref, pretrained_model_name_or_path, **kwargs)
    _ = kwargs.pop("code_revision", None)
    if os.path.isdir(pretrained_model_name_or_path):
        tokenizer_class.register_for_auto_class()
    return tokenizer_class.from_pretrained(
        pretrained_model_name_or_path, *inputs, trust_remote_code=trust_remote_code, **kwargs
    )

a、get_class_from_dynamic_module函数

理解就是tokenizer类实例化，里面继承太多，暂时大概知道就好了！

python 复制代码

tokenizer_class = get_class_from_dynamic_module(class_ref, pretrained_model_name_or_path, **kwargs)

b、tokenizer_class.from_pretrained方法

来源：transformer.tokenization_utils_basepy-->PretrainedTokenizerBase-->from_pretrained函数

直接到类from_pretrained函数，其中也考虑到'added_tokens.json'、'special_tokens_map.json'、'tokenizer_config.json'与'tokenizer.json'文件内容，如下代码：

python 复制代码

   # At this point pretrained_model_name_or_path is either a directory or a model identifier name
   additional_files_names = {
       "added_tokens_file": ADDED_TOKENS_FILE,  # kept only for legacy
       "special_tokens_map_file": SPECIAL_TOKENS_MAP_FILE,  # kept only for legacy
       "tokenizer_config_file": TOKENIZER_CONFIG_FILE,
       # tokenizer_file used to initialize a slow from a fast. Properly copy the `addedTokens` instead of adding in random orders
       "tokenizer_file": FULL_TOKENIZER_FILE,
   }
   vocab_files = {**cls.vocab_files_names, **additional_files_names}

9、小结

tokenizer是通过tokenizer文件路径得到，其中tokenizer_config.josn文件很重要，特别是 "auto_map": { "AutoTokenizer": [ "tokenization_chatglm.ChatGLM4Tokenizer", null ] },与"tokenizer_class": "ChatGLM4Tokenizer",分别决定tokenizer类方法与类别名称。其中tokenization_chatglm.ChatGLM4Tokenizer会通过该字段解析为tokenization_chatglm.py文件，然后和tokenizer文件路径结合找到这个文件，在通过类别找到tokenizer类。

三、基于GLM-4v-9b的tokenizer方法实验验证

我们从上面内容知道tokenizer_config.josn文件很重，那么我们修改里面内容是不是相应token也会被修改，于是我们接下来验证我们想法。

我们使用代码验证如下：

python 复制代码

from transformers import  AutoTokenizer

if __name__=='__main__':
    model_dir = '/GLM-4V-9B/GLM-4-main/model_glm_tj/model_experiment_tj/tokenizer_file'
    tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)

1、查看原有glm的token

glm模型对应的token到151343个token结束，我们来查看之前与之后的token。

1、查看[151329,151342]的token

代码如下：

python 复制代码

tokenizer.decode([a for a in range(151329,151343)])

结果token如下：

python 复制代码

'<|endoftext|> [MASK] [gMASK] [sMASK] <sop> <eop> <|system|> <|user|> <|assistant|> <|observation|> <|begin_of_image|> <|end_of_image|> <|begin_of_video|> <|end_of_video|>'

2、查看151342之后内容

代码与结果如下：

python 复制代码

tokenizer.decode([151343])---》结果 ''

tokenizer.decode([151343,151344])---》结果 ''

可发现后面结果都是这样的！

2、随意添加token与对应id

我们在tokenizer_config.json文件中添加id为151343的token，