Large Language Model (LLM) Tokenizers - bos_token - eos_token - unk_token

Large Language Model {LLM} Tokenizers - bos_token - eos_token - unk_token

  • [1. NVIDIA NeMo Framework](#1. NVIDIA NeMo Framework)
    • [1.1. Tokenizers](#1.1. Tokenizers)
  • [2. PyTorch Module code](#2. PyTorch Module code)
    • [2.1. `torchtune.modules.tokenizers._tiktoken`](#2.1. torchtune.modules.tokenizers._tiktoken)
  • References

1. NVIDIA NeMo Framework

https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html

NVIDIA NeMo Framework is a scalable and cloud-native generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (e.g. Automatic Speech Recognition and Text-to-Speech).

It enables users to efficiently create, customize, and deploy new generative AI models by leveraging existing code and pre-trained model checkpoints.

NeMo Framework provides end-to-end support for developing Large Language Models (LLMs) and Multimodal Models (MMs).

1.1. Tokenizers

复制代码
class nemo.collections.common.tokenizers.AutoTokenizer(
    pretrained_model_name: str,
    vocab_file: str | None = None,
    merges_file: str | None = None,
    mask_token: str | None = None,
    bos_token: str | None = None,
    eos_token: str | None = None,
    pad_token: str | None = None,
    sep_token: str | None = None,
    cls_token: str | None = None,
    unk_token: str | None = None,
    additional_special_tokens: List | None = [],
    use_fast: bool | None = False,
    trust_remote_code: bool | None = False,
)

pretrained_model_name - corresponds to HuggingFace-AutoTokenizer's 'pretrained_model_name_or_path' input argument.

vocab_file - path to file with vocabulary which consists of characters separated by newlines.

mask_token - mask token

bos_token - the beginning of sequence token

eos_token - the end of sequence token. Usually equal to sep_token

pad_token - token to use for padding

sep_token - token used for separating sequences

cls_token - class token. Usually equal to bos_token

unk_token - token to use for unknown tokens

additional_special_tokens - list of other tokens beside standard special tokens (bos, eos, pad, etc.). For example, sentinel tokens for T5 (<extra_id_0>, <extra_id_1>, etc.)

use_fast - whether to use fast HuggingFace tokenizer

2. PyTorch Module code

https://pytorch.org/torchtune/0.1/_modules/index.html

2.1. torchtune.modules.tokenizers._tiktoken

https://pytorch.org/torchtune/0.1/_modules/torchtune/modules/tokenizers/_tiktoken.html

复制代码
        path (str): Path to pretrained tokenizer checkpoint file.
        name (str): Name of the tokenizer (used by tiktoken for identification).
        pattern (str): Regex pattern used to for string parsing.
        all_special_tokens (Optional[List[str]]): List of all special tokens. 
            First element must be bos token, second element must be eos token, final element must be python tag. 
            All elements must be unique. Length must be at most 256. Default: None (will use ALL_SPECIAL_TOKENS)
        bos_token (str): Beginning of sequence token. Defaults to BEGIN_OF_TEXT.
        eos_token (str): End of sequence token. Defaults to END_OF_TEXT.
        start_header_id (str): Start header token. Defaults to START_HEADER_ID.
        end_header_id (str): End header token. Defaults to END_HEADER_ID.
        step_id (str): Step token. Defaults to STEP_ID.
        eom_id (str): End of message token. Defaults to EOM_ID.
        eot_id (str): End of turn token. Defaults to EOT_ID.
        python_tag (str): Python tag token. Defaults to PYTHON_TAG.

References

1\] Yongqiang Cheng, \[2\] How do LLMs process text data - A deep dive into Tokenization (Part-1),

相关推荐
沛沛老爹10 小时前
Web开发者5分钟上手:Agent Skills环境搭建与基础使用实战
java·人工智能·llm·llama·rag·agent skills
bl4ckpe4ch13 小时前
LLM提示词,究极提高效率【WIP】
llm·大语言模型·提示词工程·llm提示词
irises16 小时前
开源项目next-ai-draw-io核心能力拆解
前端·后端·llm
irises16 小时前
通过`ai.js`与`@ai-sdk`实现前后端tool注入与交互
前端·后端·llm
課代表16 小时前
大语言模型能够理解的11种文件格式
人工智能·语言模型·自然语言处理·llm·markdown·token·模型
智泊AI1 天前
一文讲清:主流大模型推理部署框架:vLLM、SGLang、TensorRT-LLM、ollama、XInference
llm
大霸王龙1 天前
MinIO 对象存储系统架构图集
人工智能·llm·minio
MoonOut1 天前
LLM | ARC-AGI:有趣的 benchmark
llm
栀秋6661 天前
🌟从“抽卡式编程”到规范驱动: 深度解析「Vibe Coding」的三层跃迁
微信小程序·llm·vibecoding
YuTaoShao1 天前
【Prompt】Prompt 工程入门指南
人工智能·llm·prompt·提示词