Large Language Model (LLM) Tokenizers - bos_token - eos_token - unk_token

Large Language Model {LLM} Tokenizers - bos_token - eos_token - unk_token

  • [1. NVIDIA NeMo Framework](#1. NVIDIA NeMo Framework)
    • [1.1. Tokenizers](#1.1. Tokenizers)
  • [2. PyTorch Module code](#2. PyTorch Module code)
    • [2.1. `torchtune.modules.tokenizers._tiktoken`](#2.1. torchtune.modules.tokenizers._tiktoken)
  • References

1. NVIDIA NeMo Framework

https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html

NVIDIA NeMo Framework is a scalable and cloud-native generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (e.g. Automatic Speech Recognition and Text-to-Speech).

It enables users to efficiently create, customize, and deploy new generative AI models by leveraging existing code and pre-trained model checkpoints.

NeMo Framework provides end-to-end support for developing Large Language Models (LLMs) and Multimodal Models (MMs).

1.1. Tokenizers

复制代码
class nemo.collections.common.tokenizers.AutoTokenizer(
    pretrained_model_name: str,
    vocab_file: str | None = None,
    merges_file: str | None = None,
    mask_token: str | None = None,
    bos_token: str | None = None,
    eos_token: str | None = None,
    pad_token: str | None = None,
    sep_token: str | None = None,
    cls_token: str | None = None,
    unk_token: str | None = None,
    additional_special_tokens: List | None = [],
    use_fast: bool | None = False,
    trust_remote_code: bool | None = False,
)

pretrained_model_name - corresponds to HuggingFace-AutoTokenizer's 'pretrained_model_name_or_path' input argument.

vocab_file - path to file with vocabulary which consists of characters separated by newlines.

mask_token - mask token

bos_token - the beginning of sequence token

eos_token - the end of sequence token. Usually equal to sep_token

pad_token - token to use for padding

sep_token - token used for separating sequences

cls_token - class token. Usually equal to bos_token

unk_token - token to use for unknown tokens

additional_special_tokens - list of other tokens beside standard special tokens (bos, eos, pad, etc.). For example, sentinel tokens for T5 (<extra_id_0>, <extra_id_1>, etc.)

use_fast - whether to use fast HuggingFace tokenizer

2. PyTorch Module code

https://pytorch.org/torchtune/0.1/_modules/index.html

2.1. torchtune.modules.tokenizers._tiktoken

https://pytorch.org/torchtune/0.1/_modules/torchtune/modules/tokenizers/_tiktoken.html

复制代码
        path (str): Path to pretrained tokenizer checkpoint file.
        name (str): Name of the tokenizer (used by tiktoken for identification).
        pattern (str): Regex pattern used to for string parsing.
        all_special_tokens (Optional[List[str]]): List of all special tokens. 
            First element must be bos token, second element must be eos token, final element must be python tag. 
            All elements must be unique. Length must be at most 256. Default: None (will use ALL_SPECIAL_TOKENS)
        bos_token (str): Beginning of sequence token. Defaults to BEGIN_OF_TEXT.
        eos_token (str): End of sequence token. Defaults to END_OF_TEXT.
        start_header_id (str): Start header token. Defaults to START_HEADER_ID.
        end_header_id (str): End header token. Defaults to END_HEADER_ID.
        step_id (str): Step token. Defaults to STEP_ID.
        eom_id (str): End of message token. Defaults to EOM_ID.
        eot_id (str): End of turn token. Defaults to EOT_ID.
        python_tag (str): Python tag token. Defaults to PYTHON_TAG.

References

1\] Yongqiang Cheng, \[2\] How do LLMs process text data - A deep dive into Tokenization (Part-1),

相关推荐
twc8291 小时前
LLM辅助编程:从直接生成到测试驱动的质量跃迁
软件测试·大模型·llm
岁岁种桃花儿1 小时前
AI超级智能开发系列从入门到上天第九篇:SpringAI搭建本地知识库
数据库·人工智能·ai·llm·智能体
得物技术2 小时前
Claude在得物App数仓的深度集成与效能演进
大数据·人工智能·llm
数据智能老司机2 小时前
知识图谱与 LLM 的实战应用——从本体构建你的第一个知识图谱
llm
猫咪老师3 小时前
RAG与GraphRAG介绍
人工智能·算法·llm
arvin_xiaoting3 小时前
OpenClaw学习总结_II_频道系统_4:Slack集成详解
前端·学习·自动化·llm·ai agent·飞书机器人·openclaw
怕浪猫4 小时前
第5章 输出解析:让模型返回结构化数据
langchain·llm·ai编程
带娃的IT创业者4 小时前
LLM Function Calling 的 Schema 陷阱与纯语言输出双重保障
llm·function call
阿里云大数据AI技术18 小时前
告别“金鱼记忆”:Hologres + Mem0,为大模型打造企业级长记忆引擎
人工智能·llm