Large Language Model (LLM) Tokenizers - bos_token - eos_token - unk_token

Large Language Model {LLM} Tokenizers - bos_token - eos_token - unk_token

  • [1. NVIDIA NeMo Framework](#1. NVIDIA NeMo Framework)
    • [1.1. Tokenizers](#1.1. Tokenizers)
  • [2. PyTorch Module code](#2. PyTorch Module code)
    • [2.1. `torchtune.modules.tokenizers._tiktoken`](#2.1. torchtune.modules.tokenizers._tiktoken)
  • References

1. NVIDIA NeMo Framework

https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html

NVIDIA NeMo Framework is a scalable and cloud-native generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (e.g. Automatic Speech Recognition and Text-to-Speech).

It enables users to efficiently create, customize, and deploy new generative AI models by leveraging existing code and pre-trained model checkpoints.

NeMo Framework provides end-to-end support for developing Large Language Models (LLMs) and Multimodal Models (MMs).

1.1. Tokenizers

复制代码
class nemo.collections.common.tokenizers.AutoTokenizer(
    pretrained_model_name: str,
    vocab_file: str | None = None,
    merges_file: str | None = None,
    mask_token: str | None = None,
    bos_token: str | None = None,
    eos_token: str | None = None,
    pad_token: str | None = None,
    sep_token: str | None = None,
    cls_token: str | None = None,
    unk_token: str | None = None,
    additional_special_tokens: List | None = [],
    use_fast: bool | None = False,
    trust_remote_code: bool | None = False,
)

pretrained_model_name - corresponds to HuggingFace-AutoTokenizer's 'pretrained_model_name_or_path' input argument.

vocab_file - path to file with vocabulary which consists of characters separated by newlines.

mask_token - mask token

bos_token - the beginning of sequence token

eos_token - the end of sequence token. Usually equal to sep_token

pad_token - token to use for padding

sep_token - token used for separating sequences

cls_token - class token. Usually equal to bos_token

unk_token - token to use for unknown tokens

additional_special_tokens - list of other tokens beside standard special tokens (bos, eos, pad, etc.). For example, sentinel tokens for T5 (<extra_id_0>, <extra_id_1>, etc.)

use_fast - whether to use fast HuggingFace tokenizer

2. PyTorch Module code

https://pytorch.org/torchtune/0.1/_modules/index.html

2.1. torchtune.modules.tokenizers._tiktoken

https://pytorch.org/torchtune/0.1/_modules/torchtune/modules/tokenizers/_tiktoken.html

复制代码
        path (str): Path to pretrained tokenizer checkpoint file.
        name (str): Name of the tokenizer (used by tiktoken for identification).
        pattern (str): Regex pattern used to for string parsing.
        all_special_tokens (Optional[List[str]]): List of all special tokens. 
            First element must be bos token, second element must be eos token, final element must be python tag. 
            All elements must be unique. Length must be at most 256. Default: None (will use ALL_SPECIAL_TOKENS)
        bos_token (str): Beginning of sequence token. Defaults to BEGIN_OF_TEXT.
        eos_token (str): End of sequence token. Defaults to END_OF_TEXT.
        start_header_id (str): Start header token. Defaults to START_HEADER_ID.
        end_header_id (str): End header token. Defaults to END_HEADER_ID.
        step_id (str): Step token. Defaults to STEP_ID.
        eom_id (str): End of message token. Defaults to EOM_ID.
        eot_id (str): End of turn token. Defaults to EOT_ID.
        python_tag (str): Python tag token. Defaults to PYTHON_TAG.

References

1 Yongqiang Cheng, https://yongqiang.blog.csdn.net/

2 How do LLMs process text data - A deep dive into Tokenization (Part-1), https://gdevakumar.medium.com/how-do-llms-process-text-data-a-deep-dive-into-tokenization-part-1-342bd365c6dc

相关推荐
装不满的克莱因瓶17 小时前
学习并掌握 LangChain 检索器的作用,实现让 LLM 动态调用知识库功能
人工智能·python·ai·langchain·llm·agent·智能体
惟愿光怪陆离18 小时前
OpenCode 注意事项
llm
初旭save1 天前
Agent Skill 不是写 Prompt,是给 LLM 做存储分层
llm·agent·claude
AINative软件工程1 天前
LLM 应用的 Rate Limiting 工程实战:Per-User Token 配额、滑动窗口限流与优先级队列的生产落地
llm
晨欣2 天前
Claude Opus 4.8:模型小幅升级,平台大步向前
llm·claude·anthropic·claude code·harness
lhxcc_fly2 天前
6.LangChain--RAG
langchain·llm·rag
lhxcc_fly2 天前
6.1RAG--文档加载器
langchain·llm·rag
AINative软件工程2 天前
LLM 推理成本工程:从 Token 计量到分层路由的生产降本实践
llm
dy_Alley2 天前
从输入到决策:意图识别在 AI 架构中的定位与应用 — 第六章《置信度决策路由》
llm
dy_Alley2 天前
从输入到决策:意图识别在 AI 架构中的定位与应用 — 第七章《集成与组装》
llm