vllm使用plugin集成外部模型结构

模型结构概念介绍

vllm是最流行的LLM 推理服务的解决方案之一，它除了提供LLM 推理的吞吐优化、性能优化的能力以外，也内置了各种各样的模型结构，在这个代码目录中：

vllm 的模型结构一直对标着huggingface 的 transformers中的模型结构。比如在 transformers 中新增加了模型结构以后，vllm 社区会跟进实现一个同名的类。另外在 vllm 中每一个模型结构类都需要在 register.py 中注册。

模型结构和模型的名称不是一个概念，对于任何一个 huggingface 的模型都可以通过查看它的 config.json 来确认它的模型结构。比如Qwen2.5-32B看这个：

huggingface.co/Qwen/Qwen2....

json 复制代码

{
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  ...
}

所以Qwen2.5-32B这个模型的模型结构是Qwen2ForCausalLM。Qwen2ForCausalLM 定义在 transformers 仓库中: github.com/huggingface... vllm 在启动的时候，同样需要解析 config.json，去获取这个模型结构，然后找到这 vllm 中已经注册的同名的模型结构类来加载模型数据集文件。

在企业内部为了针对某个业务、某个产品或者解决某个特定问题，也会有各种自定义的模型结构产生。这些模型并不会贡献给 vllm 官方（没有意义）。算法同学在训练阶段通常基于 transformers 库已有模型进行魔改来定义新的模型结构。而正式上线出于性能需要，要改用 vllm 来提供推理，所以也需要在 vllm 中重新实现一遍该型类型，否则 vllm 加载训练好的模型数据集就会报错：not supported。

此时就会有两种实现方式，一种是直接在 vllm 的 models 目录中写自己的模型结构 python 文件，然后在 registry.py 中进行注册，这种称之为内部集成。还有一种是在其他python包（pypi 包）中实现新的模型结构，然后完成注册，这种称之为外部集成（官方叫做：Out-of-Tree Model Integration)。

但是内部集成好理解，但不够灵活，这里的不灵活有如下几个方面。

你基于某个版本的 vllm ，自定义了模型结构，并修改了 registry.py 文件。然后重新在内部部署环境上分发该版本 vllm。后续如果需要升级新 vllm 版本（vllm 版本更新很快！）的时候，你可能又要把之前的代码在新版本 vllm 上重做一次。
vllm 可能是通过类conda（Mamba/Miniforge/...)环境来分发的。这个环境很大，提前装好了很多pypi 包，vllm 是其中之一。但是新增模型结构的时候如果每次修改 vllm，那么这个大的环境也需要重新分发。

所以在 vllm 的外部集成新的模型结构才是明智之举。

如何在vllm 外部集成新模型结构?

如果你直接搜索vllm new model 等关键词，直接找到是 vllm 0.6.0 版本中的一个经典的文档：

docs.vllm.ai/en/v0.6.0/m...

但这里最后只是轻描淡写地写了一句：

If you are running api server with vllm serve , you can wrap the entrypoint with the following code:

python 复制代码

ModelRegistry.register_model("YourModelForCausalLM", YourModelForCausalLM)
import runpy
runpy.run_module('vllm.entrypoints.openai.api_server', run_name='__main__')

如果你是自己写一个 python 脚本调用 vllm 的库来启动 vllm，那么自然好执行注册代码。如果你是 vllm serve 运行模型，则需要 warp the entrypoint，但是如何做，这里没说。作者假定用户都有这个知识背景......

官方例子解读

不过在后来的 vllm 版本文档中，其实已经给了更具体的例子，不过是放到 plugin（插件）系统的文档中的。

它主要是演示 vllm 的 plugin 机制，顺便用外部集成模型类举个例子，所以直接搜索可能搜不到：

python 复制代码

# inside `setup.py` file
from setuptools import setup

setup(name='vllm_add_dummy_model',
    version='0.1',
    packages=['vllm_add_dummy_model'],
    entry_points={
        'vllm.general_plugins':
        ["register_dummy_model = vllm_add_dummy_model:register"]
    })

# inside `vllm_add_dummy_model/__init__.py` file
def register():
    from vllm import ModelRegistry

    if "MyLlava" not in ModelRegistry.get_supported_archs():
        ModelRegistry.register_model(
            "MyLlava",
            "vllm_add_dummy_model.my_llava:MyLlava",
        )

这就是 warp the entrypoint的方式，需要实现两个文件。

题外话：这个官方例子，之前注释里写错了，笔者进行过纠正：github.com/vllm-projec... ，主要是之前把vllm_add_dummy_model/init.py写成了vllm_add_dummy_model.py。其实vllm_add_dummy_model.py也不是不行，只需要把setup.py中的packages=['vllm_add_dummy_model'] 改成 py_modules=['vllm_add_dummy_model']

首先你要自己弄一个自己的pypi 包，在这个包里实现注册新模型结构的功能，即调用vllm 的ModelRegistry的 register_model 方法。第一个参数是模型结构名，第二个参数的模型结构类的导入路径（import path），关于导入路径下面也会在介绍。另外其实这里也可以直接用类来进行注册，即：

python 复制代码

# inside `vllm_add_dummy_model/__init__.py` file
from vllm_add_dummy_model.my_llava import MyLlava
def register():
    from vllm import ModelRegistry

    if "MyLlava" not in ModelRegistry.get_supported_archs():
        ModelRegistry.register_model(
            "MyLlava",
            MyLlava
        )

然后你需要给pypi 包的 setup.py 中，设置entry_points 参数，来指定刚才那个注册函数（register）的导入路径：

python 复制代码

    entry_points={
        'vllm.general_plugins':
        ["register_dummy_model = vllm_add_dummy_model:register"]
    })

导入路径介绍

介绍一下导入路径，这个是 python 的官方语法。前面是路径，然后在冒号后面指定要导入符号，对于 vllm 注册模型，这个里导入的符号只要是 callable 的即可，既可以是函数，也可以是有__call__方法的类。

再介绍一下路径的部分，对于xxx:register，xxx 可以是 xxx.py，也可以是一个目录 xxx，然后在xxx 目录下的__init__.py中定义 register 函数。当然也可以放到更深的目录里面，用点号分割，比如xxx.yyy:register。这个可以表示 xxx 目录里面有一个 yyy.py，yyy.py 里面有一个函数 register，也可以表示 xxx下面有 yyy 目录，yyy 目录的__init__.py里有 register 函数。

生效方式 1

然后这个 pypi 包怎么生效呢？如果这个pypi 包名为vllm_add_dummy_model，那么只要pip install vllm_add_dummy_model和vllm 安装到相同的环境即可。

如果我们方便改 conda 环境自然没问题，即如果我可以自由的安装一个新 pypi 包到部署环境的 conda 环境中，那么其实我直接修改 vllm 的源码加上新模型结构也可以。但这不解决我前面提到的不灵活的问题。如果喜欢获得更好的灵活性，我建议使用下面的方式。

生效方式2

对于 vllm 而言，最常用的启动方式是使用vllm serve命令。也有使用 Ray Serve ，在Ray 的 Actor 中使用 vllm 的 Engine 提供服务的，通过serve run命令启动服务。Anyway，在不修改 vllm 源码，不更新 conda 环境的时候，我们可以像下面这样做。

在生产环境中，conda 环境（包含各类第三方 pypi）和自定义代码通常是分离的（如果不是，建议改造成分离的）。在可以开发环境中，进入到前面说的新增模型结构的代码目录中，然后执行:

bash 复制代码

python setup.py bdist_wheel

这时候会生成 egg-info 目录。比如vllm_add_dummy_model.egg-info。

然后我们把我们的模型文件、注册文件、以及这个 egg-info 一起打包（或者直接使用上述命令的 whl 文件），进行现网分发即可。

vllm启动的时候，只需要 cd 到这个打包文件解压后的目录然后vllm serve即可。此时就可以识别新的模型结构了。

setuptools 的入口点（Entry Points）

setuptools 本是用于构建python 项目的第三方库，不过早已成为事实标准。

添加图片注释，不超过 140 字（可选）

setup.py 是 setuptools 中用来描述构建方式的"配置"文件，性质相当于 C++里面的 Makefile（make）或者 BUILD（Bazel）。当然，因为要构建的是 python 项目，所以能玩的花样会更多一点，入口点（Entry Points）机制就其中之一。我们前面已经见识过，通过entry_points语法可以注册了入口点，执行我们的自定义代码。

setuptools内置了一些入口点的类型，比如console_scripts 和gui_scripts，比如：

python 复制代码

setup(
    # ...,
    entry_points={
        'console_scripts': [
            'hello-world = timmins:hello_world',
        ]
    }
)

另外也可以自定义入口点的类型。

那么入口点注册的插件什么时候被执行呢？

有两种方式，一种是用户显式调用，第二种是主程序动态发现与加载时执行。

console_scripts这种内置插件就是需要用户显式调用的。比如上面注册的名为 hello-world 的 console_scripts 类型插件，其实是帮你在 PATH 路径里实现了一个名为 hello-world 命令。终端输入 hello-world 即可执行插件timmins:hello_world。

bash 复制代码

$ hello-world
Hello world

不过入口点的强大之处在于第二种执行方式，这是实现python插件系统的核心。插件（plugin）一词在维基百科中是这样定义的：

它无需重新构建系统即可扩展现有软件系统的功能。插件功能是系统可定制性的一种方式。

比如 vllm 注册模型时候用到的vllm.general_plugins，其实 vllm 内部实现的一个名为 general_plugins 的入口点，在vllm中直接就称呼为插件（plugin）。下面我将以vllm.general_plugins源码为例介绍来对插件的实现方式进行抛砖引玉。

vllm.general_plugins插件

在 vllm 仓库的vllm/plugins/init.py文件中，定义了变量 DEFAULT_PLUGINS_GROUP，就是字符串 vllm.general_plugins。

python 复制代码

# Default plugins group will be loaded in all processes(process0, engine core
# process and worker processes)
DEFAULT_PLUGINS_GROUP = "vllm.general_plugins"

看名字是插件组（plugins group)，之所以叫"组""，是因为一个插件可以注册的回调函数可以有多个。

除了 vllm.general_plugins 以外，vllm 还支持如下插件：

python 复制代码

# IO processor plugins group will be loaded in process0 only
IO_PROCESSOR_PLUGINS_GROUP = "vllm.io_processor_plugins"
# Platform plugins group will be loaded in all processes when
# `vllm.platforms.current_platform` is called and the value not initialized,
PLATFORM_PLUGINS_GROUP = "vllm.platform_plugins"
# Stat logger plugins group will be loaded in process0 only when serve vLLM with
# async mode.
STAT_LOGGER_PLUGINS_GROUP = "vllm.stat_logger_plugins"

vllm的其他几种插件本文不过多介绍，简而言之：

vllm.io_processor_plugins：用来实现自定义的IO处理器的插件。
vllm.platform_plugins：支持在新的平台（GPU）上使用vllm的插件。
STAT_LOGGER_PLUGINS_GROUP：自定义日志插件。

load_plugins_by_group()

插件在vllm中都是通过load_plugins_by_group函数加载的，它有一个参数group：

python 复制代码

def load_plugins_by_group(group: str) -> dict[str, Callable[[], Any]]:
    from importlib.metadata import entry_points

    allowed_plugins = envs.VLLM_PLUGINS

    discovered_plugins = entry_points(group=group)
    if len(discovered_plugins) == 0:
        logger.debug("No plugins for group %s found.", group)
        return {}

    # Check if the only discovered plugin is the default one
    is_default_group = group == DEFAULT_PLUGINS_GROUP
    # Use INFO for non-default groups and DEBUG for the default group
    log_level = logger.debug if is_default_group else logger.info

    log_level("Available plugins for group %s:", group)
    for plugin in discovered_plugins:
        log_level("- %s -> %s", plugin.name, plugin.value)

    if allowed_plugins is None:
        log_level(
            "All plugins in this group will be loaded. "
            "Set `VLLM_PLUGINS` to control which plugins to load."
        )

    plugins = dict[str, Callable[[], Any]]()
    for plugin in discovered_plugins:
        if allowed_plugins is None or plugin.name in allowed_plugins:
            if allowed_plugins is not None:
                log_level("Loading plugin %s", plugin.name)

            try:
                func = plugin.load()
                plugins[plugin.name] = func
            except Exception:
                logger.exception("Failed to load plugin %s", plugin.name)

    return plugins

主要逻辑是先通过 entry_point() 函数获取到发现的插件（discovered_plugins）。然后遍历，构造一个 dict类型的对象返回，dict 的 key 的插件名，value 是回调函数。

这里的关键就是 importlib.metadata 的entry_point()函数，它的主要作用是查询当前 Python 环境中所有已安装包注册的入口点。该函数自 python3.10 引入，返回值是 EntryPoint 类型。for plugin in discovered_plugins:的 plugin 就是 EntryPoint 类型，plugin.name 就是插件的名称。以之前的代码为例：

python 复制代码

    entry_points={
        'vllm.general_plugins':
        ["register_dummy_model = vllm_add_dummy_model:register"]
    })

register_dummy_model 就是 plugin.name，plugin.load() 就是通过导入路径vllm_add_dummy_model:register得到具体的func。也就是从字符串转成了函数（其实是所有 callable 类型都可以）。

load_general_plugins()

load_plugins_by_group()是通用的加载插件组的函数，前面提到的 vllm几种插件它都能加载。对于 vllm.general_plugins 类型的插件，还有一个特殊的封装------load_general_plugins()

这个函数有两个功能：1）vllm.general_plugins 插件要支持幂等性，即加载多次不会有副作用。所以它通过全局变量 plugins_loaded来识别当前执行过插件加载，如果加载过则直接退出。2）通过load_plugins_by_group()只是获取了插件回调函数的 dict 对象，这些回调函数并没有被真正执行，所以在这里遍历执行一遍。

python 复制代码

def load_general_plugins():
    """WARNING: plugins can be loaded for multiple times in different
    processes. They should be designed in a way that they can be loaded
    multiple times without causing issues.
    """
    global plugins_loaded
    if plugins_loaded:
        return
    plugins_loaded = True

    plugins = load_plugins_by_group(group=DEFAULT_PLUGINS_GROUP)
    # general plugins, we only need to execute the loaded functions
    for func in plugins.values():
        func()

再来看一下load_general_plugins()被调用的地方，有多处：

添加图片注释，不超过 140 字（可选）

符合前面所说的自定义插件由主程序控制加载时机。所以python 官方库importlib.metadata提供了环境内所有入口点的识别功能（entry_points 函数），然后 vllm 自行控制（按需）在什么位置去识别入口点并加载注册的插件。

EngineCore

看截图有这么多load_general_plugins()被调用的地方，因为vllm.general_plugins不仅可以注册新模型结构，也可以用来做其他事。这里我们关注的点是在core.py中：

python 复制代码

class EngineCore:
    """Inner loop of vLLM's Engine."""

    def __init__(
        self,
        vllm_config: VllmConfig,
        executor_class: type[Executor],
        log_stats: bool,
        executor_fail_callback: Callable | None = None,
    ):
        # plugins need to be loaded at the engine/scheduler level too
        from vllm.plugins import load_general_plugins

        load_general_plugins()

        self.vllm_config = vllm_config
        if vllm_config.parallel_config.data_parallel_rank == 0:
            logger.info(
                "Initializing a V1 LLM engine (v%s) with config: %s",
                VLLM_VERSION,
                vllm_config,
            )

        self.log_stats = log_stats

        # Setup Model.
        self.model_executor = executor_class(vllm_config)
        ...

在vllm的EngineCore中，初始化model_executor（加载模型）之前，会调用load_general_plugins()，加载一下vllm.general_plugins注册的插件，vllm外部的新模型结构也得以被感知。