Large Language Model (LLM) Tokenizers - bos_token - eos_token - unk_token

Large Language Model {LLM} Tokenizers - bos_token - eos_token - unk_token

  • [1. NVIDIA NeMo Framework](#1. NVIDIA NeMo Framework)
    • [1.1. Tokenizers](#1.1. Tokenizers)
  • [2. PyTorch Module code](#2. PyTorch Module code)
    • [2.1. `torchtune.modules.tokenizers._tiktoken`](#2.1. torchtune.modules.tokenizers._tiktoken)
  • References

1. NVIDIA NeMo Framework

https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html

NVIDIA NeMo Framework is a scalable and cloud-native generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (e.g. Automatic Speech Recognition and Text-to-Speech).

It enables users to efficiently create, customize, and deploy new generative AI models by leveraging existing code and pre-trained model checkpoints.

NeMo Framework provides end-to-end support for developing Large Language Models (LLMs) and Multimodal Models (MMs).

1.1. Tokenizers

复制代码
class nemo.collections.common.tokenizers.AutoTokenizer(
    pretrained_model_name: str,
    vocab_file: str | None = None,
    merges_file: str | None = None,
    mask_token: str | None = None,
    bos_token: str | None = None,
    eos_token: str | None = None,
    pad_token: str | None = None,
    sep_token: str | None = None,
    cls_token: str | None = None,
    unk_token: str | None = None,
    additional_special_tokens: List | None = [],
    use_fast: bool | None = False,
    trust_remote_code: bool | None = False,
)

pretrained_model_name - corresponds to HuggingFace-AutoTokenizer's 'pretrained_model_name_or_path' input argument.

vocab_file - path to file with vocabulary which consists of characters separated by newlines.

mask_token - mask token

bos_token - the beginning of sequence token

eos_token - the end of sequence token. Usually equal to sep_token

pad_token - token to use for padding

sep_token - token used for separating sequences

cls_token - class token. Usually equal to bos_token

unk_token - token to use for unknown tokens

additional_special_tokens - list of other tokens beside standard special tokens (bos, eos, pad, etc.). For example, sentinel tokens for T5 (<extra_id_0>, <extra_id_1>, etc.)

use_fast - whether to use fast HuggingFace tokenizer

2. PyTorch Module code

https://pytorch.org/torchtune/0.1/_modules/index.html

2.1. torchtune.modules.tokenizers._tiktoken

https://pytorch.org/torchtune/0.1/_modules/torchtune/modules/tokenizers/_tiktoken.html

复制代码
        path (str): Path to pretrained tokenizer checkpoint file.
        name (str): Name of the tokenizer (used by tiktoken for identification).
        pattern (str): Regex pattern used to for string parsing.
        all_special_tokens (Optional[List[str]]): List of all special tokens. 
            First element must be bos token, second element must be eos token, final element must be python tag. 
            All elements must be unique. Length must be at most 256. Default: None (will use ALL_SPECIAL_TOKENS)
        bos_token (str): Beginning of sequence token. Defaults to BEGIN_OF_TEXT.
        eos_token (str): End of sequence token. Defaults to END_OF_TEXT.
        start_header_id (str): Start header token. Defaults to START_HEADER_ID.
        end_header_id (str): End header token. Defaults to END_HEADER_ID.
        step_id (str): Step token. Defaults to STEP_ID.
        eom_id (str): End of message token. Defaults to EOM_ID.
        eot_id (str): End of turn token. Defaults to EOT_ID.
        python_tag (str): Python tag token. Defaults to PYTHON_TAG.

References

1 Yongqiang Cheng, https://yongqiang.blog.csdn.net/

2 How do LLMs process text data - A deep dive into Tokenization (Part-1), https://gdevakumar.medium.com/how-do-llms-process-text-data-a-deep-dive-into-tokenization-part-1-342bd365c6dc

相关推荐
AndrewHZ36 分钟前
【LLM技术全景】规模定律与模型演进:为什么模型越大越强?
人工智能·gpt·深度学习·语言模型·llm·openai·规模定律
装不满的克莱因瓶1 小时前
了解 LangChain 中的 LLM 与 ChatModel 的差异
人工智能·python·ai·langchain·llm·agent·chatmodel
颜酱1 小时前
LangChain 工具调用:从原理、入门到落地
langchain·llm
swipe1 小时前
做多轮对话 Agent,为什么我建议把短期记忆放到 Redis
后端·面试·llm
swipe2 小时前
别再把关系库和向量库拆开了:PostgreSQL 搭建 AI 长期记忆层实战
面试·langchain·llm
元Y亨H5 小时前
大数据转大模型(LLM)进阶学习路线图
大数据·llm
小lan猫9 小时前
用 AI Agent 让购物更便捷:LumiGlow 电商网站实践
前端框架·llm·agent
meilindehuzi_a9 小时前
全栈进阶:告别 Node 繁琐配置,用下一代运行时 Bun 丝滑构建 AI Agent 客户端
人工智能·llm
sg_knight9 小时前
Claude Code、Cursor、Copilot、openCode,到底怎么选
llm·copilot·agent·claude·code·codex·claude-code
程序员三明治10 小时前
RAG 元数据的作用与管理:让知识库回答可追溯、可过滤、可维护
人工智能·llm·知识库·元数据·rag·java后端