Large Language Model (LLM) Tokenizers - bos_token - eos_token - unk_token

Large Language Model {LLM} Tokenizers - bos_token - eos_token - unk_token

  • [1. NVIDIA NeMo Framework](#1. NVIDIA NeMo Framework)
    • [1.1. Tokenizers](#1.1. Tokenizers)
  • [2. PyTorch Module code](#2. PyTorch Module code)
    • [2.1. `torchtune.modules.tokenizers._tiktoken`](#2.1. torchtune.modules.tokenizers._tiktoken)
  • References

1. NVIDIA NeMo Framework

https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html

NVIDIA NeMo Framework is a scalable and cloud-native generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (e.g. Automatic Speech Recognition and Text-to-Speech).

It enables users to efficiently create, customize, and deploy new generative AI models by leveraging existing code and pre-trained model checkpoints.

NeMo Framework provides end-to-end support for developing Large Language Models (LLMs) and Multimodal Models (MMs).

1.1. Tokenizers

复制代码
class nemo.collections.common.tokenizers.AutoTokenizer(
    pretrained_model_name: str,
    vocab_file: str | None = None,
    merges_file: str | None = None,
    mask_token: str | None = None,
    bos_token: str | None = None,
    eos_token: str | None = None,
    pad_token: str | None = None,
    sep_token: str | None = None,
    cls_token: str | None = None,
    unk_token: str | None = None,
    additional_special_tokens: List | None = [],
    use_fast: bool | None = False,
    trust_remote_code: bool | None = False,
)

pretrained_model_name - corresponds to HuggingFace-AutoTokenizer's 'pretrained_model_name_or_path' input argument.

vocab_file - path to file with vocabulary which consists of characters separated by newlines.

mask_token - mask token

bos_token - the beginning of sequence token

eos_token - the end of sequence token. Usually equal to sep_token

pad_token - token to use for padding

sep_token - token used for separating sequences

cls_token - class token. Usually equal to bos_token

unk_token - token to use for unknown tokens

additional_special_tokens - list of other tokens beside standard special tokens (bos, eos, pad, etc.). For example, sentinel tokens for T5 (<extra_id_0>, <extra_id_1>, etc.)

use_fast - whether to use fast HuggingFace tokenizer

2. PyTorch Module code

https://pytorch.org/torchtune/0.1/_modules/index.html

2.1. torchtune.modules.tokenizers._tiktoken

https://pytorch.org/torchtune/0.1/_modules/torchtune/modules/tokenizers/_tiktoken.html

复制代码
        path (str): Path to pretrained tokenizer checkpoint file.
        name (str): Name of the tokenizer (used by tiktoken for identification).
        pattern (str): Regex pattern used to for string parsing.
        all_special_tokens (Optional[List[str]]): List of all special tokens. 
            First element must be bos token, second element must be eos token, final element must be python tag. 
            All elements must be unique. Length must be at most 256. Default: None (will use ALL_SPECIAL_TOKENS)
        bos_token (str): Beginning of sequence token. Defaults to BEGIN_OF_TEXT.
        eos_token (str): End of sequence token. Defaults to END_OF_TEXT.
        start_header_id (str): Start header token. Defaults to START_HEADER_ID.
        end_header_id (str): End header token. Defaults to END_HEADER_ID.
        step_id (str): Step token. Defaults to STEP_ID.
        eom_id (str): End of message token. Defaults to EOM_ID.
        eot_id (str): End of turn token. Defaults to EOT_ID.
        python_tag (str): Python tag token. Defaults to PYTHON_TAG.

References

1\] Yongqiang Cheng, \[2\] How do LLMs process text data - A deep dive into Tokenization (Part-1),

相关推荐
qcx237 小时前
阿里 RynnVLA-002 源码深度拆解:一个 7B 模型如何同时当机器人大脑和世界模拟器
ai·机器人·llm·agent·具身智能·vla
字节跳动开源13 小时前
局中局!给 Agent 装上 OpenViking,它们竟然学会了“记仇”和“伪装”?
人工智能·开源·llm
不懂的浪漫14 小时前
从看清到理解:CNN、Transformer 与 RAG 背后的 AI 架构迁徙
ai·cnn·llm·transformer·rag
目黑live +wacyltd17 小时前
算法备案的实操指南(含截图示例)
人工智能·算法·llm·大模型备案·算法备案
Paraverse_徐志斌18 小时前
【AI Agent】常用架构模式:ReAct、Plan-and-Execute、Reflection
人工智能·ai·架构·llm·agent·react
qcx2319 小时前
混合检索+重排序:当前 RAG 精度提升最成熟的工程路径
算法·ai·llm·agent·rag·agentic
名字不好奇19 小时前
大模型如何理解上下文:Attention 机制详解
人工智能·llm·transformer
组合缺一19 小时前
agentscope-harness vs solon-ai-harness:Java 智能体「马具引擎」的双雄对决
java·人工智能·ai·llm·agent·solon·agentscope
冬奇Lab1 天前
RAG 系列(十五):CRAG——检索结果不好时自动纠偏
人工智能·llm
kyriewen1 天前
老板逼我上AI,我偷偷在浏览器里跑LLaMA,省下20万API费
前端·react.js·llm