14-vLLM LoRA 适配器系统深度解析

vLLM LoRA 适配器系统深度解析

定位

LoRA（Low-Rank Adaptation）是 vLLM 中实现高效多租户适配器服务（Multi-Tenant LoRA Serving）的核心子系统。它基于 Punica 论文架构，支持在单次推理 batch 中同时运行多个不同的 LoRA 适配器，通过低秩矩阵分解 和批量 GEMM 融合计算实现近乎零开销的 adapter 切换。本模块覆盖从配置、权重管理、请求绑定、层级替换到 GPU kernel 计算的完整链路。
#mermaid-svg-Zavn2IouhG6P8i8u{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-Zavn2IouhG6P8i8u .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-Zavn2IouhG6P8i8u .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-Zavn2IouhG6P8i8u .error-icon{fill:#552222;}#mermaid-svg-Zavn2IouhG6P8i8u .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-Zavn2IouhG6P8i8u .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-Zavn2IouhG6P8i8u .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-Zavn2IouhG6P8i8u .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-Zavn2IouhG6P8i8u .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-Zavn2IouhG6P8i8u .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-Zavn2IouhG6P8i8u .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-Zavn2IouhG6P8i8u .marker{fill:#333333;stroke:#333333;}#mermaid-svg-Zavn2IouhG6P8i8u .marker.cross{stroke:#333333;}#mermaid-svg-Zavn2IouhG6P8i8u svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-Zavn2IouhG6P8i8u p{margin:0;}#mermaid-svg-Zavn2IouhG6P8i8u .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-Zavn2IouhG6P8i8u .cluster-label text{fill:#333;}#mermaid-svg-Zavn2IouhG6P8i8u .cluster-label span{color:#333;}#mermaid-svg-Zavn2IouhG6P8i8u .cluster-label span p{background-color:transparent;}#mermaid-svg-Zavn2IouhG6P8i8u .label text,#mermaid-svg-Zavn2IouhG6P8i8u span{fill:#333;color:#333;}#mermaid-svg-Zavn2IouhG6P8i8u .node rect,#mermaid-svg-Zavn2IouhG6P8i8u .node circle,#mermaid-svg-Zavn2IouhG6P8i8u .node ellipse,#mermaid-svg-Zavn2IouhG6P8i8u .node polygon,#mermaid-svg-Zavn2IouhG6P8i8u .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Zavn2IouhG6P8i8u .rough-node .label text,#mermaid-svg-Zavn2IouhG6P8i8u .node .label text,#mermaid-svg-Zavn2IouhG6P8i8u .image-shape .label,#mermaid-svg-Zavn2IouhG6P8i8u .icon-shape .label{text-anchor:middle;}#mermaid-svg-Zavn2IouhG6P8i8u .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-Zavn2IouhG6P8i8u .rough-node .label,#mermaid-svg-Zavn2IouhG6P8i8u .node .label,#mermaid-svg-Zavn2IouhG6P8i8u .image-shape .label,#mermaid-svg-Zavn2IouhG6P8i8u .icon-shape .label{text-align:center;}#mermaid-svg-Zavn2IouhG6P8i8u .node.clickable{cursor:pointer;}#mermaid-svg-Zavn2IouhG6P8i8u .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-Zavn2IouhG6P8i8u .arrowheadPath{fill:#333333;}#mermaid-svg-Zavn2IouhG6P8i8u .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-Zavn2IouhG6P8i8u .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-Zavn2IouhG6P8i8u .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Zavn2IouhG6P8i8u .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-Zavn2IouhG6P8i8u .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Zavn2IouhG6P8i8u .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-Zavn2IouhG6P8i8u .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-Zavn2IouhG6P8i8u .cluster text{fill:#333;}#mermaid-svg-Zavn2IouhG6P8i8u .cluster span{color:#333;}#mermaid-svg-Zavn2IouhG6P8i8u div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-Zavn2IouhG6P8i8u .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-Zavn2IouhG6P8i8u rect.text{fill:none;stroke-width:0;}#mermaid-svg-Zavn2IouhG6P8i8u .icon-shape,#mermaid-svg-Zavn2IouhG6P8i8u .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Zavn2IouhG6P8i8u .icon-shape p,#mermaid-svg-Zavn2IouhG6P8i8u .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-Zavn2IouhG6P8i8u .icon-shape .label rect,#mermaid-svg-Zavn2IouhG6P8i8u .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Zavn2IouhG6P8i8u .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-Zavn2IouhG6P8i8u .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-Zavn2IouhG6P8i8u :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} ⚡ 运行时引擎
🔧 Layer 层实现
📌 LoRA 核心层
Triton/CUDA Kernel
LoRAConfig

配置参数
LoRARequest

请求绑定
LoRALayerWeights

权重存储
LoRAModel

模型封装
BaseLayerWithLoRA

基类抽象
ColumnParallelLinear+LoRA
RowParallelLinear+LoRA
FusedMoE+LoRA
LogitsProcessor+LoRA
ReplicatedLinear+LoRA
VocabParallelEmbedding+LoRA
WorkerLoRAManager

Worker 级管理
LoRAModelManager

模型管理器
LoRAResolver

适配器解析
PunicaWrapper

GPU 计算库
GPU 批量计算

一、LoRA 概述

1.1 Low-Rank Adaptation 原理

LoRA（Low-Rank Adaptation）是一种参数高效微调方法 ，其核心思想是将全量微调的大矩阵更新 Δ W \Delta W ΔW 分解为两个低秩矩阵的乘积：

Δ W = B × A 其中 A ∈ R r × k , B ∈ R d × r \Delta W = B \times A \quad \text{其中 } A \in \mathbb{R}^{r \times k}, \; B \in \mathbb{R}^{d \times r} ΔW=B×A其中 A∈Rr×k,B∈Rd×r

A A A (lora_a) : 降维投影矩阵，将输入从 k k k 维投影到 r r r 维（ r ≪ min ⁡ ( k , d ) r \ll \min(k, d) r≪min(k,d)）
B B B (lora_b) : 升维恢复矩阵，将 r r r 维映射回 d d d 维
rank ( r r r): 低秩维度，典型值为 8, 16, 32, 64
scaling factor : scaling = α r \text{scaling} = \frac{\alpha}{r} scaling=rα，其中 α \alpha α 是 lora_alpha

前向计算公式：

h = W 0 x + Δ W x = W 0 x + B A x ⋅ α r h = W_0 x + \Delta W x = W_0 x + B A x \cdot \frac{\alpha}{r} h=W0x+ΔWx=W0x+BAx⋅rα

vLLM 的 LoRA 实现源自 Punica: Multi-Tenant LoRA Serving 论文，其关键创新在于：

Batched GEMM : 将多个 LoRA adapter 的 A A A 和 B B B 矩阵堆叠为 4D 张量 (max_loras, 1, rank, dim)，利用 Triton/CUDA kernel 实现一次前向传播处理多个 adapter
Dynamic Dispatch : 通过 token_lora_indices 映射每个 token 到对应的 LoRA slot index，实现 per-token 级别的 adapter 选择

二、LoRAConfig 配置参数

源码位置: $lora.py$ (file:///workspace/vllm/config/lora.py)

LoRAConfig 是 vLLM 中控制 LoRA 行为的核心配置类，使用 Pydantic 进行参数校验。

2.1 关键配置项

参数名	类型	默认值	说明
`max_lora_rank`	`MaxLoRARanks`	`16`	支持的最大 LoRA rank，可选值：1/8/16/32/64/128/256/320/512
`max_loras`	`int`	`1`	单个 batch 中同时存在的最大 LoRA adapter 数量
`fully_sharded_loras`	`bool`	`False`	是否对 LoRA 权重做完整 tensor parallel 分片（S-LoRA 策略）
`max_cpu_loras`	`int	None`	`None`
`lora_dtype`	`torch.dtype	str`	`"auto"`
`target_modules`	`list[str]	None`	`None`
`default_mm_loras`	`dict[str, str]	None`	`None`
`enable_tower_connector_lora`	`bool`	`False`	是否启用视觉编码器和 connector 的 LoRA（实验性功能）
`specialize_active_lora`	`bool`	`False`	是否按活跃 LoRA 数量专门化 CUDA graph

2.2 配置校验逻辑

python 复制代码

# /workspace/vllm/config/lora.py L101-L118
@model_validator(mode="after")
def _validate_lora_config(self) -> Self:
    if self.max_cpu_loras is None:
        self.max_cpu_loras = self.max_loras
    elif self.max_cpu_loras < self.max_loras:
        raise ValueError(
            f"max_cpu_loras ({self.max_cpu_loras}) must be >= "
            f"max_loras ({self.max_loras})."
        )
    if envs.VLLM_LORA_ENABLE_DUAL_STREAM and not current_platform.is_cuda_alike():
        raise ValueError("Dual CUDA streams are only supported on CUDA platforms.")
    ...
    return self

关键约束：

max_cpu_loras >= max_loras: CPU 缓存容量必须不小于 GPU 并发容量
Dual stream 仅在 CUDA 平台可用
fully_sharded_loras 与 dual stream 不兼容

2.3 Hash 计算

compute_hash() 方法用于唯一标识影响计算图的配置组合，服务于 CUDA graph 缓存键：

python 复制代码

# /workspace/vllm/config/lora.py L75-L99
def compute_hash(self) -> str:
    factors: list[Any] = []
    factors.append(self.max_lora_rank)
    factors.append(self.max_loras)
    factors.append(self.fully_sharded_loras)
    factors.append(self.lora_dtype)
    factors.append(self.enable_tower_connector_lora)
    factors.append(
        tuple(sorted(self.target_modules)) if self.target_modules else None
    )
    hash_str = safe_hash(str(factors).encode(), usedforsecurity=False).hexdigest()
    return hash_str

三、LoRAModel 模型封装

源码位置: $lora_model.py$ (file:///workspace/vllm/lora/lora_model.py)

LoRAModel 是一个已加载 LoRA adapter 的完整表示，包含该 adapter 所有层的权重数据。

3.1 数据结构

python 复制代码

# /workspace/vllm/lora/lora_model.py L24-L46
class LoRAModel:
    """A LoRA fine-tuned model."""

    def __init__(
        self,
        lora_model_id: int,       # 全局唯一的 LoRA ID (>0)
        rank: int,                 # 该 adapter 的实际 rank
        loras: dict[str, LoRALayerWeights],  # module_name -> weights
    ) -> None:
        self.id = lora_model_id
        assert lora_model_id > 0, (
            f"a valid lora id should be greater than 0, got {self.id}"
        )
        self.rank = rank
        self.loras: dict[str, LoRALayerWeights] = loras

核心属性：

id: 全局唯一整数标识符，用于索引 stacked 权重张量的第一维
rank: 该 adapter 的实际 rank（必须 <= max_lora_rank）
loras: 字典结构 {module_name: LoRALayerWeights}，key 为模型层名称（如 model.layers.0.self_attn.qkv_proj）

3.2 从 Tensor 构建

python 复制代码

# /workspace/vllm/lora/lora_model.py L73-L121
@classmethod
def from_lora_tensors(
    cls,
    lora_model_id: int,
    tensors: dict[str, torch.Tensor],   # 原始权重字典
    peft_helper: PEFTHelper,             # PEFT 配置信息
    device: str = "cuda",
    dtype: torch.dtype | None = None,
    model_vocab_size: int | None = None,
    weights_mapper: WeightsMapper | None = None,
    skip_prefixes: list[str] | None = None,
) -> "LoRAModel":

构建流程：

遍历 tensors 中每个张量名称
跳过 base embedding 权重和指定前缀的模块
使用 parse_fine_tuned_lora_name() 解析出 module_name 和是否为 lora_a
为每个 module 创建 LoRALayerWeights 对象并填入对应的 lora_a / lora_b 张量
支持 pin_memory（CPU 端优化）

3.3 从本地 Checkpoint 加载

python 复制代码

# /workspace/vllm/lora/lora_model.py L123-L244
@classmethod
def from_local_checkpoint(
    cls,
    lora_dir: str,                          # LoRA 目录路径
    expected_lora_modules: set[str],         # 期望的目标模块集合
    peft_helper: PEFTHelper,
    *,
    lora_model_id: int | None = None,
    device: str = "cuda",
    ...
) -> "LoRAModel":

支持的文件格式优先级：

Tensorizer 格式 : adapter_model.tensors（通过 TensorDeserializer）
SafeTensors 格式 : adapter_model.safetensors（推荐）
PyTorch 格式 : adapter_model.bin 或 adapter_model.pt

模块校验机制 ：加载时会检查 checkpoint 中的模块是否都在 expected_lora_modules 中，防止加载错误的 adapter。

3.4 Clone 与查询接口

python 复制代码

# /workspace/vllm/lora/lora_model.py L48-L63
def clone(self, lora_model_id: int) -> "LoRAModel":
    """返回共享底层张量的副本（不同 ID）"""
    return self.__class__(lora_model_id, rank=self.rank, loras=self.loras.copy())

def get_lora(self, module_name: str) -> LoRALayerWeights | None:
    """按模块名获取 LoRA 权重"""
    return self.loras.get(module_name, None)

def check_lora_name(self, lora_name: str) -> bool:
    return lora_name in self.loras

四、LoRAWeights 权重管理

源码位置: $lora_weights.py$ (file:///workspace/vllm/lora/lora_weights.py)

定义了 LoRA 权重的数据结构和打包逻辑。

4.1 LoRALayerWeights --- 单层权重

python 复制代码

# /workspace/vllm/lora/lora_weights.py L13-L96
class LoRALayerWeights:
    """LoRA weights for a layer composed of two low rank matrixes."""

    def __init__(
        self,
        module_name: str,           # 模块名
        rank: int,                   # rank
        lora_alpha: int,             # alpha 值
        lora_a: torch.Tensor,        # A 矩阵 shape: (rank, input_dim)
        lora_b: torch.Tensor,        # B 矩阵 shape: (output_dim, rank)
        scaling: float | None = None,# 缩放因子
    ) -> None:
        ...
        if scaling is None:
            self.scaling = self.lora_alpha / self.rank  # 默认 scaling = alpha/rank

关键属性与方法：

属性/方法	返回值	说明
`input_dim`	`int`	`lora_a.shape[1]`，输入维度
`output_dim`	`int`	`lora_b.shape[0]`，输出维度
`is_packed`	`bool`	是否为 packed 结构（基类返回 False）
`optimize()`	`self`	将 scaling 合并入 lora_b（避免运行时乘法）

optimize 操作 ：将 scaling 因子预乘到 lora_b 中，减少推理时的计算量：

python 复制代码

# /workspace/vllm/lora/lora_weights.py L36-L42
def optimize(self) -> "LoRALayerWeights":
    """Optimize the LoRA by merging the scaling into lora_b."""
    if self.scaling == 1:
        return self
    self.lora_b *= self.scaling
    self.scaling = 1
    return self

4.2 PackedLoRALayerWeights --- 打包权重

用于 packed layers （如 qkv_proj 将 Q/K/V 合并为一个线性层），将多个子层的 LoRA 打包为一个对象：

python 复制代码

# /workspace/vllm/lora/lora_weights.py L99-L249
class PackedLoRALayerWeights(LoRALayerWeights):
    def __init__(
        self,
        module_name: str,
        rank: int,
        lora_alphas: list[int | None],     # 每个 sub-layer 的 alpha
        lora_a: list[torch.Tensor | None], # 每个 sub-layer 的 A
        lora_b: list[torch.Tensor | None], # 每个 sub-layer 的 B
        scaling: list[float] | None = None,
    ) -> None:

pack 方法 ：将多个 LoRALayerWeights 打包为一个 PackedLoRALayerWeights：

python 复制代码

@classmethod
def pack(cls, loras: GenericSequence["LoRALayerWeights | None"]) -> "PackedLoRALayerWeights":
    """Pack a list of LoRAs into a single LoRA.
    If LoRA is None, it signifies that the submodule does not have a LoRA.
    """

pack_moe 方法 ：专用于 MoE 层的打包，按 (w1, w2, w3) 三元组组织 expert 权重：

python 复制代码

# /workspace/vllm/lora/lora_weights.py L154-L228
@classmethod
def pack_moe(
    cls,
    loras: GenericSequence["LoRALayerWeights | None"],
    module_name: str,
    is_non_gated_moe: bool = False,
) -> "PackedLoRALayerWeights":

MoE 打包的关键细节：

按 num_experts 遍历，每个 expert 提取 (w1_lora, w2_lora, w3_lora)
使用 torch.stack 将同类型权重沿 expert 维度堆叠
对于 non-gated MoE（无 gate_proj），复用 w1 的权重作为 w3 并设置 last_scaling=1.0 避免双重缩放

4.3 Dummy 权重创建

用于 warmup 和 slot 占位：

python 复制代码

# /workspace/vllm/lora/lora_weights.py L72-L96
@classmethod
def create_dummy_lora_weights(
    cls,
    module_name: str,
    input_dim: int,
    output_dim: int,
    rank: int,
    dtype: torch.dtype,
    device: torch.types.Device,
) -> "LoRALayerWeights":
    pin_memory = str(device) == "cpu" and is_pin_memory_available()
    lora_a = torch.zeros([rank, input_dim], dtype=dtype, device=device, pin_memory=pin_memory)
    lora_b = torch.zeros([output_dim, rank], dtype=dtype, device=device, pin_memory=pin_memory)
    return cls(module_name, rank=rank, lora_alpha=1, lora_a=lora_a, lora_b=lora_b)

五、LoRA Request 请求级绑定

源码位置: $request.py$ (file:///workspace/vllm/lora/request.py)

LoRARequest 是用户请求与 LoRA adapter 之间的绑定契约。

5.1 数据结构

python 复制代码

# /workspace/vllm/lora/request.py L8-L66
class LoRARequest(
    msgspec.Struct,
    omit_defaults=True,
    array_like=True,
):
    """
    Request for a LoRA adapter.

    lora_int_id must be globally unique for a given adapter.
    """

    lora_name: str                    # LoRA 名称（人类可读）
    lora_int_id: int                  # 全局唯一整数 ID（> 0）
    lora_path: str = ""               # Adapter 文件路径或 HuggingFace repo ID
    base_model_name: str | None = None  # 基础模型名称
    tensorizer_config_dict: dict | None = None  # Tensorizer 反序列化配置
    load_inplace: bool = False        # 是否强制重新加载（覆盖已有）

5.2 关键设计决策

msgspec.Struct ：使用 msgspec 而非 dataclass，原因包括：

高性能序列化/反序列化
array_like=True 支持 NumPy-like 的数组语义
omit_defaults=True 减少传输数据量

唯一性保证：

python 复制代码

# /workspace/vllm/lora/request.py L32-L37
def __post_init__(self):
    if self.lora_int_id < 1:
        raise ValueError(f"id must be > 0, got {self.lora_int_id}")
    assert self.lora_path, "lora_path cannot be empty"

相等性与哈希 ：基于 lora_name 实现，使得同名 adapter 在跨 engine 场景下可以正确去重：

python 复制代码

# /workspace/vllm/lora/request.py L51-L66
def __eq__(self, value: object) -> bool:
    return isinstance(value, self.__class__) and self.lora_name == value.lora_name

def __hash__(self) -> int:
    return hash(self.lora_name)

便捷属性：

python 复制代码

@property
def adapter_id(self):
    return self.lora_int_id

@property
def name(self):
    return self.lora_name

@property
def path(self):
    return self.lora_path

六、LoRA Layer 层实现

源码目录: $layers/$ (file:///workspace/vllm/lora/layers/)

vLLM 将原始模型中的 Linear/Embedding/MoE 层替换为带 LoRA 能力的版本，形成完整的层级继承体系。

6.1 类继承体系总览

#mermaid-svg-9PvTcQWLPlGc7A3t{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-9PvTcQWLPlGc7A3t .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-9PvTcQWLPlGc7A3t .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-9PvTcQWLPlGc7A3t .error-icon{fill:#552222;}#mermaid-svg-9PvTcQWLPlGc7A3t .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-9PvTcQWLPlGc7A3t .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-9PvTcQWLPlGc7A3t .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-9PvTcQWLPlGc7A3t .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-9PvTcQWLPlGc7A3t .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-9PvTcQWLPlGc7A3t .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-9PvTcQWLPlGc7A3t .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-9PvTcQWLPlGc7A3t .marker{fill:#333333;stroke:#333333;}#mermaid-svg-9PvTcQWLPlGc7A3t .marker.cross{stroke:#333333;}#mermaid-svg-9PvTcQWLPlGc7A3t svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-9PvTcQWLPlGc7A3t p{margin:0;}#mermaid-svg-9PvTcQWLPlGc7A3t .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-9PvTcQWLPlGc7A3t .cluster-label text{fill:#333;}#mermaid-svg-9PvTcQWLPlGc7A3t .cluster-label span{color:#333;}#mermaid-svg-9PvTcQWLPlGc7A3t .cluster-label span p{background-color:transparent;}#mermaid-svg-9PvTcQWLPlGc7A3t .label text,#mermaid-svg-9PvTcQWLPlGc7A3t span{fill:#333;color:#333;}#mermaid-svg-9PvTcQWLPlGc7A3t .node rect,#mermaid-svg-9PvTcQWLPlGc7A3t .node circle,#mermaid-svg-9PvTcQWLPlGc7A3t .node ellipse,#mermaid-svg-9PvTcQWLPlGc7A3t .node polygon,#mermaid-svg-9PvTcQWLPlGc7A3t .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-9PvTcQWLPlGc7A3t .rough-node .label text,#mermaid-svg-9PvTcQWLPlGc7A3t .node .label text,#mermaid-svg-9PvTcQWLPlGc7A3t .image-shape .label,#mermaid-svg-9PvTcQWLPlGc7A3t .icon-shape .label{text-anchor:middle;}#mermaid-svg-9PvTcQWLPlGc7A3t .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-9PvTcQWLPlGc7A3t .rough-node .label,#mermaid-svg-9PvTcQWLPlGc7A3t .node .label,#mermaid-svg-9PvTcQWLPlGc7A3t .image-shape .label,#mermaid-svg-9PvTcQWLPlGc7A3t .icon-shape .label{text-align:center;}#mermaid-svg-9PvTcQWLPlGc7A3t .node.clickable{cursor:pointer;}#mermaid-svg-9PvTcQWLPlGc7A3t .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-9PvTcQWLPlGc7A3t .arrowheadPath{fill:#333333;}#mermaid-svg-9PvTcQWLPlGc7A3t .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-9PvTcQWLPlGc7A3t .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-9PvTcQWLPlGc7A3t .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-9PvTcQWLPlGc7A3t .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-9PvTcQWLPlGc7A3t .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-9PvTcQWLPlGc7A3t .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-9PvTcQWLPlGc7A3t .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-9PvTcQWLPlGc7A3t .cluster text{fill:#333;}#mermaid-svg-9PvTcQWLPlGc7A3t .cluster span{color:#333;}#mermaid-svg-9PvTcQWLPlGc7A3t div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-9PvTcQWLPlGc7A3t .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-9PvTcQWLPlGc7A3t rect.text{fill:none;stroke-width:0;}#mermaid-svg-9PvTcQWLPlGc7A3t .icon-shape,#mermaid-svg-9PvTcQWLPlGc7A3t .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-9PvTcQWLPlGc7A3t .icon-shape p,#mermaid-svg-9PvTcQWLPlGc7A3t .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-9PvTcQWLPlGc7A3t .icon-shape .label rect,#mermaid-svg-9PvTcQWLPlGc7A3t .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-9PvTcQWLPlGc7A3t .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-9PvTcQWLPlGc7A3t .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-9PvTcQWLPlGc7A3t :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} nn.Module
BaseLayerWithLoRA

抽象基类
BaseLinearLayerWithLoRA

Linear 公共逻辑
ColumnParallelLinearWithLoRA
.BLL
MergedColumnParallelLinearWithLoRA

(gate_up_proj 等 2 slices)
MergedColumnParallelLinearVariableSliceWithLoRA

(3+ slices)
QKVParallelLinearWithLoRA

(qkv_proj 1 slice)
MergedQKVParallelLinearWithLoRA

(qkv_proj 3 slices)
ColumnParallelLinearWithShardedLoRA

(S-LoRA)
MergedColumnParallelLinearWithShardedLoRA
QKVParallelLinearWithShardedLoRA
MergedQKVParallelLinearWithShardedLoRA
RowParallelLinearWithLoRA
RowParallelLinearWithShardedLoRA
ReplicatedLinearWithLoRA
FusedMoEWithLoRA
FusedMoE3DWithLoRA
LogitsProcessorWithLoRA
VocabParallelEmbeddingWithLoRA

6.2 BaseLayerWithLoRA --- 抽象基类

源码: $base.py$ (file:///workspace/vllm/lora/layers/base.py)

定义所有 LoRA layer 必须实现的接口契约：

python 复制代码

# /workspace/vllm/lora/layers/base.py L16-L78
class BaseLayerWithLoRA(nn.Module):
    def slice_lora_a(self, lora_a): ...      # TP 分片 lora_a
    def slice_lora_b(self, lora_b): ...      # TP 分片 lora_b
    def create_lora_weights(...): ...        # 初始化 stacked 权重缓冲区
    def reset_lora(self, index: int): ...    # 重置某 slot 为零
    def set_lora(self, index, lora_a, lora_b): ...  # 写入权重到指定 slot
    def set_mapping(self, punica_wrapper): ...       # 绑定 PunicaWrapper
    @classmethod
    def can_replace_layer(cls, source_layer, lora_config, packed_modules_list, model_config): ...

6.3 BaseLinearLayerWithLoRA --- Linear 公共基类

源码: $base_linear.py$ (file:///workspace/vllm/lora/layers/base_linear.py)

封装所有 Linear 类型 LoRA layer 的公共逻辑，包括权重缓冲区初始化、同步/异步 apply、dual-stream 支持等。

6.3.1 权重缓冲区初始化

python 复制代码

# /workspace/vllm/lora/layers/base_linear.py L106-L157
def create_lora_weights(
    self,
    max_loras: int,
    lora_config: LoRAConfig,
    model_config: PretrainedConfig | None = None,
) -> None:
    self.lora_config = lora_config
    # 根据 base_layer 类型决定输出维度
    if isinstance(self.base_layer, ReplicatedLinear):
        lora_a_out_size = lora_config.max_lora_rank
        lora_b_out_size = self.output_size
    elif isinstance(self.base_layer, ColumnParallelLinear):
        lora_a_out_size = (
            lora_config.max_lora_rank
            if not lora_config.fully_sharded_loras
            else divide(lora_config.max_lora_rank, self.tp_size)
        )
        lora_b_out_size = self.output_size
    elif isinstance(self.base_layer, RowParallelLinear):
        lora_a_out_size = lora_config.max_lora_rank
        lora_b_out_size = (
            self.output_size
            if not lora_config.fully_sharded_loras
            else divide(self.output_size, self.tp_size)
        )

    # 创建 stacked 权重缓冲区
    # shape: (max_loras, 1, lora_out_size, input/output_size)
    self.lora_a_stacked = tuple(
        torch.zeros(max_loras, 1, lora_a_out_size, self.input_size,
                     dtype=lora_config.lora_dtype, device=self.device)
        for _ in range(self.n_slices)
    )
    self.lora_b_stacked = tuple(
        torch.zeros(max_loras, 1, lora_b_out_size, lora_config.max_lora_rank,
                     dtype=lora_config.lora_dtype, device=self.device)
        for _ in range(self.n_slices)
    )

stacked 权重张量的形状设计：

第 0 维 (max_loras): 不同 LoRA adapter 的 slot
第 1 维 (1): 为广播保留（Punica kernel 要求）
第 2-3 维: 实际权重矩阵维度

6.3.2 同步与双流执行模式

python 复制代码

# /workspace/vllm/lora/layers/base_linear.py L192-236
def apply(self, x: torch.Tensor, bias: torch.Tensor | None = None) -> torch.Tensor:
    if self._enable_aux_cuda_stream and is_forward_context_available():
        # 双 CUDA 流模式：base layer 和 LoRA 在不同流上并行执行
        output_size = sum(self.output_slices)
        return torch.ops.vllm.lora_linear_async(
            self.layer_name, output_size, x, bias
        )
    else:
        # 同步模式：顺序执行
        return self._apply_sync(x, bias)

def _apply_sync(self, x, bias):
    output = self.base_layer.quant_method.apply(self.base_layer, x, bias)
    return self._apply_lora_to_output(x, output)

Dual Stream 优化 （VLLM_LORA_ENABLE_DUAL_STREAM=True）：

Base linear 在 default CUDA stream 上执行
LoRA computation 在 auxiliary stream 上执行
两者通过 torch.cuda.Event 同步，实现流水线重叠

6.3.3 LoRA 应用核心逻辑

python 复制代码

# /workspace/vllm/lora/layers/base_linear.py L213-236
def _apply_lora_to_output(self, x: torch.Tensor, output: torch.Tensor) -> torch.Tensor:
    original_shape = output.shape if output.ndim == 3 else None

    # transformers backend 需要 flatten batch 维度
    if x.ndim == 3 and output.ndim == 3:
        output = output.flatten(0, 1)
        x = x.flatten(0, 1)

    # 调用 PunicaWrapper 执行批量 LoRA GEMM
    lora_output: torch.Tensor | None = self.punica_wrapper.add_lora_linear(
        output, x, self.lora_a_stacked, self.lora_b_stacked, 1.0, self.output_slices
    )
    if not current_platform.can_update_inplace():
        output = lora_output

    if original_shape is not None:
        output = output.reshape(original_shape)
    return output

6.4 ColumnParallelLinearWithLoRA

源码: $column_parallel_linear.py$ (file:///workspace/vllm/lora/layers/column_parallel_linear.py)

适用于 ColumnParallelLinear 及其变体的 LoRA 包装。核心特征是分片 lora_b（沿输出维度分片以匹配 TP）。

python 复制代码

# /workspace/vllm/lora/layers/column_parallel_linear.py L102-L128
def slice_lora_b(self, lora_b: torch.Tensor) -> torch.Tensor:
    if self.is_merged_col_linear:
        # MergedColumnParallelLinear: 如 gate_up_proj，需要拆分为两半再分别分片
        shard_size = self.output_size // 2
        offset = lora_b.shape[0] // 2
        left_weight = lora_b[self.tp_rank * shard_size : (self.tp_rank + 1) * shard_size, :]
        right_weight = lora_b[offset + self.tp_rank * shard_size : offset + (self.tp_rank + 1) * shard_size, :]
        lora_b = torch.cat([left_weight, right_weight], dim=0)
    else:
        # 标准 ColumnParallelLinear
        shard_size = self.output_size
        start_idx = self.tp_rank * shard_size
        end_idx = (self.tp_rank + 1) * shard_size
        lora_b = lora_b[start_idx:end_idx, :]
    return lora_b

子类族谱：

类名	适用场景	n_slices	特殊逻辑
`ColumnParallelLinearWithLoRA`	普通 ColumnParallelLinear	1	分片 lora_b
`MergedColumnParallelLinearWithLoRA`	gate_up_proj 等 2-slice 合并层	2	分别处理两个 sub-lora
`QKVParallelLinearWithLoRA`	qkv_proj (单一 LoRA)	1	Q/K/V 分别按 head 数量分片
`MergedQKVParallelLinearWithLoRA`	qkv_proj (三个独立 LoRA)	3	Q/K/V 各自独立的 slice 和 sharding
`MergedColumnParallelLinearVariableSliceWithLoRA`	3+ slices 合并层	3+	动态切片数
`*WithShardedLoRA` 变体	S-LoRA fully-sharded 模式	同左	额外分片 lora_a

Fully-Sharded 变体 （S-LoRA 策略）：基于论文 S-LoRA: Serving Thousands of Concurrent LoRA Adapters，额外对 lora_a 沿 rank 维度进行分片，并在 apply() 中引入 all_gather + add_expand 两阶段计算：

python 复制代码

# /workspace/vllm/lora/layers/column_parallel_linear.py L24-L80 (辅助函数 _mcp_apply)
def _mcp_apply(x, bias, layer):
    output = layer.base_layer.quant_method.apply(layer.base_layer, x, bias)
    x = x.view(-1, x.shape[-1])
    output, out_orig_shape = output.view(-1, output.shape[-1]), output.shape

    # Stage 1: Shrink (local GEMM with sharded lora_a)
    buffers = torch.empty_strided(buffer_shape, ...)
    buffers.zero_()
    shrunk_buffers = layer.punica_wrapper.add_shrink(buffers, x, layer.lora_a_stacked, 1.0)

    # All-gather intermediate results
    buffers = tensor_model_parallel_all_gather(buffers)

    # Stage 2: Expand (GEMM with lora_b, add to output)
    lora_output = layer.punica_wrapper.add_expand(
        output, buffers, layer.lora_b_stacked, layer.output_slices,
        offset_start=0, add_input=True,
    )
    ...

6.5 RowParallelLinearWithLoRA

源码: $row_parallel_linear.py$ (file:///workspace/vllm/lora/layers/row_parallel_linear.py)

适用于 RowParallelLinear 的 LoRA 包装。核心特征是分片 lora_a（沿输入维度分片），且 lora_b 不分片（因为 RowParallel 的输出需要 all_reduce）。

python 复制代码

# /workspace/vllm/lora/layers/row_parallel_linear.py L32-40
def slice_lora_a(self, lora_a: torch.Tensor) -> torch.Tensor:
    shard_size = self.input_size
    start_idx = self.tp_rank * shard_size
    end_idx = (self.tp_rank + 1) * shard_size
    lora_a = lora_a[:, start_idx:end_idx]
    return lora_b

Sharded 变体 (RowParallelLinearWithShardedLoRA) 额外分片 lora_b，采用类似 Column sharded 的 shrink + all_reduce + expand 模式。

6.6 FusedMoEWithLoRA

源码: $fused_moe.py$ (file:///workspace/vllm/lora/layers/fused_moe.py)

最复杂的 LoRA layer，为 MoE（Mixture of Experts）层添加 LoRA 支持。涉及 w1(gate_proj)、w2(down_proj)、w3(up_proj) 三组权重的 LoRA 管理。

6.6.1 权重缓冲区结构

python 复制代码

# /workspace/vllm/lora/layers/fused_moe.py L84-144
def _create_lora_a_weights(self, max_loras, lora_config):
    # w13 (w1/w3) lora_a: (max_loras, num_experts, rank, hidden_size) × _w13_slices
    self.w13_lora_a_stacked: tuple[torch.Tensor, ...] = tuple(
        torch.zeros((max_loras, self.base_layer.local_num_experts,
                     lora_config.max_lora_rank if not self.fully_sharded
                     else divide(lora_config.max_lora_rank, self.tp_size),
                     self.base_layer.hidden_size),
                    dtype=lora_config.lora_dtype, device=self.device)
        for _ in range(self._w13_slices)  # gated MoE: 2 slices; non-gated: 1 slice
    )

    # w2 lora_a: (max_loras, num_experts, rank, intermediate_size_per_partition)
    self.w2_lora_a_stacked: tuple[torch.Tensor, ...] = (...)

def _create_lora_b_weights(self, max_loras, lora_config):
    # w13 lora_b: (max_loras, num_experts, intermediate_size_per_partition, rank)
    # w2 lora_b: (max_loras, num_experts, hidden_size[, //tp_size], rank)

特殊设计点：

引入 adapter_enabled 张量跟踪哪些 slot 有有效 adapter
_w13_slices: gated MoE（如 Mixtral）为 2（gate_proj + up_proj），non-gated 为 1
EP（Expert Parallelism）兼容性检查

6.6.2 set_lora 中的 TP 分片逻辑

python 复制代码

# /workspace/vllm/lora/layers/fused_moe.py L282-344
def set_lora(self, index, lora_a, lora_b):
    assert isinstance(lora_a, list)  # [w1_lora_a, w2_lora_a, w3_lora_a]
    assert isinstance(lora_b, list)  # [w1_lora_b, w2_lora_b, w3_lora_b]

    self.reset_lora(index)
    self.adapter_enabled[index] = 1

    w1_lora_a, w2_lora_a, w3_lora_a = lora_a
    w1_lora_b, w2_lora_b, w3_lora_b = lora_b

    # 对 w1/w3 做 rank 维度分片 (if fully_sharded)
    slliced_w1_lora_a = self._slice_w13_a(w1_lora_a)
    slliced_w1_lora_b = self._slice_w13_b(w1_lora_b)

    # 对 w2 做相应分片
    sliced_w2_lora_a = self._slice_w2_a(w2_lora_a)
    sliced_w2_lora_b = self._slice_w2_b(w2_lora_b)

    # Copy to stacked buffers
    self.w13_lora_a_stacked[0][index, :, :, :].copy_(slliced_w1_lora_a, non_blocking=True)
    ...

6.6.3 FusedMoE3DWithLoRA --- 3D MoE 变体

适用于 DeepSeek 等 3D MoE 架构，将 w1 和 w3 的 lora_b 沿中间维度拼接：

python 复制代码

# /workspace/vllm/lora/layers/fused_moe.py L388-401
def _create_lora_b_weights(self, max_loras, lora_config):
    # w13 lora_b: intermediate_size * 2 (w1+w3 concatenated)
    self.w13_lora_b_stacked: tuple[torch.Tensor] = tuple(
        torch.zeros((max_loras, self.base_layer.local_num_experts,
                     self.base_layer.intermediate_size_per_partition * 2,
                     lora_config.max_lora_rank), ...)
    )

6.7 LogitsProcessorWithLoRA

源码: $logits_processor.py$ (file:///workspace/vllm/lora/layers/logits_processor.py)

为 LogitsProcessor（LM Head）添加 LoRA 支持，允许 LoRA adapter 修改词汇表分布。

特殊之处：

vocab_size 上限限制为 258048
处理 sharded-to-full vocabulary 映射（TP 场景下的词表重排）
使用专门的 add_lora_logits Punica 接口

python 复制代码

# /workspace/vllm/lora/layers/logits_processor.py L141-190
def _get_logits(self, hidden_states, lm_head, embedding_bias=None):
    logits = actual_lm_head.quant_method.apply(actual_lm_head, hidden_states)
    if embedding_bias is not None:
        logits += embedding_bias

    logits = self.base_layer._gather_logits(logits)

    # Reindex for sharded vocab
    if self.sharded_to_full_mapping_gpu is not None:
        logits = logits[:, self.sharded_to_full_mapping_gpu]

    # Apply LoRA to logits
    lora_output = self.punica_wrapper.add_lora_logits(
        logits, hidden_states, self.lora_a_stacked, self.lora_b_stacked, 1.0
    )
    ...

6.8 ReplicatedLinearWithLoRA

源码: $replicated_linear.py$ (file:///workspace/vllm/lora/layers/replicated_linear.py)

适用于 ReplicatedLinear（如 GateLinear）。无需 TP 分片（因为权重本身是复制的）。

6.9 VocabParallelEmbeddingWithLoRA

源码: $vocal_parallel_embedding.py$ (file:///workspace/vllm/lora/layers/vocal_parallel_embedding.py)

为 Embedding 层添加 LoRA，支持新增 token 的嵌入扩展。

关键实现：

python 复制代码

# /workspace/vllm/lora/layers/vocal_parallel_embedding.py L96-126
def forward(self, x: torch.Tensor) -> torch.Tensor:
    num_tokens = x.shape[0]
    indices_1 = self.punica_wrapper._embeddings_indices[1][:num_tokens]

    # Lookup LoRA A embeddings (like an embedding table lookup)
    full_lora_a_embeddings = F.embedding(x + indices_1, self.lora_a_stacked_2d)

    full_output = self.base_layer.forward(x)

    # Apply LoRA: output += lora_A_embedding @ lora_B
    lora_output = self.punica_wrapper.add_lora_embedding(
        full_output, full_lora_a_embeddings, self.lora_b_stacked, add_input=True
    )
    return full_output.view_as(full_output_org)

注意：lora_a 在 Embedding 场景下的角色类似于一个额外的嵌入查找表，而非传统的矩阵乘法。

七、Punica Wrapper GPU LoRA 计算库集成

源码目录: $punica_wrapper/$ (file:///workspace/vllm/lora/punica_wrapper/)

Punica Wrapper 是 vLLM LoRA 系统的计算引擎抽象层，封装了底层的 Triton/CUDA/XPU kernel 调用。

7.1 架构层次

#mermaid-svg-mYnk4xKB1z6tCiyT{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-mYnk4xKB1z6tCiyT .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-mYnk4xKB1z6tCiyT .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-mYnk4xKB1z6tCiyT .error-icon{fill:#552222;}#mermaid-svg-mYnk4xKB1z6tCiyT .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-mYnk4xKB1z6tCiyT .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-mYnk4xKB1z6tCiyT .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-mYnk4xKB1z6tCiyT .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-mYnk4xKB1z6tCiyT .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-mYnk4xKB1z6tCiyT .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-mYnk4xKB1z6tCiyT .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-mYnk4xKB1z6tCiyT .marker{fill:#333333;stroke:#333333;}#mermaid-svg-mYnk4xKB1z6tCiyT .marker.cross{stroke:#333333;}#mermaid-svg-mYnk4xKB1z6tCiyT svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-mYnk4xKB1z6tCiyT p{margin:0;}#mermaid-svg-mYnk4xKB1z6tCiyT .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-mYnk4xKB1z6tCiyT .cluster-label text{fill:#333;}#mermaid-svg-mYnk4xKB1z6tCiyT .cluster-label span{color:#333;}#mermaid-svg-mYnk4xKB1z6tCiyT .cluster-label span p{background-color:transparent;}#mermaid-svg-mYnk4xKB1z6tCiyT .label text,#mermaid-svg-mYnk4xKB1z6tCiyT span{fill:#333;color:#333;}#mermaid-svg-mYnk4xKB1z6tCiyT .node rect,#mermaid-svg-mYnk4xKB1z6tCiyT .node circle,#mermaid-svg-mYnk4xKB1z6tCiyT .node ellipse,#mermaid-svg-mYnk4xKB1z6tCiyT .node polygon,#mermaid-svg-mYnk4xKB1z6tCiyT .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-mYnk4xKB1z6tCiyT .rough-node .label text,#mermaid-svg-mYnk4xKB1z6tCiyT .node .label text,#mermaid-svg-mYnk4xKB1z6tCiyT .image-shape .label,#mermaid-svg-mYnk4xKB1z6tCiyT .icon-shape .label{text-anchor:middle;}#mermaid-svg-mYnk4xKB1z6tCiyT .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-mYnk4xKB1z6tCiyT .rough-node .label,#mermaid-svg-mYnk4xKB1z6tCiyT .node .label,#mermaid-svg-mYnk4xKB1z6tCiyT .image-shape .label,#mermaid-svg-mYnk4xKB1z6tCiyT .icon-shape .label{text-align:center;}#mermaid-svg-mYnk4xKB1z6tCiyT .node.clickable{cursor:pointer;}#mermaid-svg-mYnk4xKB1z6tCiyT .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-mYnk4xKB1z6tCiyT .arrowheadPath{fill:#333333;}#mermaid-svg-mYnk4xKB1z6tCiyT .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-mYnk4xKB1z6tCiyT .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-mYnk4xKB1z6tCiyT .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-mYnk4xKB1z6tCiyT .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-mYnk4xKB1z6tCiyT .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-mYnk4xKB1z6tCiyT .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-mYnk4xKB1z6tCiyT .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-mYnk4xKB1z6tCiyT .cluster text{fill:#333;}#mermaid-svg-mYnk4xKB1z6tCiyT .cluster span{color:#333;}#mermaid-svg-mYnk4xKB1z6tCiyT div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-mYnk4xKB1z6tCiyT .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-mYnk4xKB1z6tCiyT rect.text{fill:none;stroke-width:0;}#mermaid-svg-mYnk4xKB1z6tCiyT .icon-shape,#mermaid-svg-mYnk4xKB1z6tCiyT .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-mYnk4xKB1z6tCiyT .icon-shape p,#mermaid-svg-mYnk4xKB1z6tCiyT .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-mYnk4xKB1z6tCiyT .icon-shape .label rect,#mermaid-svg-mYnk4xKB1z6tCiyT .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-mYnk4xKB1z6tCiyT .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-mYnk4xKB1z6tCiyT .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-mYnk4xKB1z6tCiyT :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} lora_shrink
lora_expand
fused_moe_lora
bgmv/sgmv
bgmv
PunicaWrapperABC

抽象基类 (ABC)
PunicaWrapperBase

元数据管理 + 公共逻辑
PunicaWrapperGPU

Triton kernel 实现
PunicaWrapperCPU

PyTorch op 实现
PunicaWrapperXPU

Intel XPU IPEX kernel
Triton Shrink Kernel
Triton Expand Kernel
Triton MoE LoRA Kernel
PyTorch Native Ops
IPEX Custom Ops

7.2 PunicaWrapperBase --- 元数据中心

源码: $punica_base.py$ (file:///workspace/vllm/lora/punica_wrapper/punica_base.py)

维护 Multi-LoRA 推理所需的全部状态信息和映射数据。

7.2.1 核心状态张量

python 复制代码

# /workspace/vllm/lora/punica_wrapper/punica_base.py L131-166
class PunicaWrapperBase(PunicaWrapperABC):
    def __init__(self, max_num_batched_tokens, max_batches, device, **kwargs):
        # Token 级别的 LoRA 索引映射 (decode 用)
        self._token_lora_indices = torch.empty(
            max_num_batched_tokens, dtype=torch.long, device=device)
        # Sampler 索引 (logits processor 用)
        self._sampler_indices = torch.empty(...)
        self._sampler_indices_padded = torch.empty(...)
        # Embedding 索引 (embedding layer 用)
        self._embeddings_indices = torch.empty(2, max_num_batched_tokens, ...)

        # Prefill 阶段专用元数据
        self._seq_start_locs = torch.empty(max_batches, ...)   # 序列起始位置
        self._seq_lengths = torch.empty(max_batches, ...)       # 序列长度
        self._lora_indices_per_batch = torch.empty(...)          # 每 batch 的 LoRA ID

7.2.2 元数据更新流程

python 复制代码

# /workspace/vllm/lora/punica_wrapper/punica_base.py L284-299
def update_metadata(self, mapping, lora_index_to_id, max_loras, vocab_size, **kwargs):
    # 1. 更新基础映射（token → LoRA index）
    self._update_base_metadata(mapping, lora_index_to_id, max_loras, vocab_size)

    # 2. 如果是 prefill 阶段，额外计算 batch 级元数据
    if mapping.is_prefill:
        self._update_prefill_metadata(self.token_lora_indices)
        self.is_prefill = True
    else:
        self.is_prefill = False

7.2.3 核心 Kernel 接口语义

方法	语义公式	用途
`add_shrink(y, x, lora_a, scale)`	`y[i] += (x @ lora_a[i]) * scale`	LoRA A 的 GEMM（降维）
`add_expand(y, x, lora_b, slices, offset)`	`y[:, offset:offset+slice] += x[i] @ lora_b[i]`	LoRA B 的 GEMM（升维）
`add_lora_linear(y, x, lora_a, lora_b, scale, slices)`	shrink + expand 组合	完整 LoRA linear
`add_lora_embedding(y, x, lora_b)`	`y += x @ lora_b`	Embedding LoRA
`add_lora_logits(y, x, lora_a, lora_b, scale)`	`buffer=(x@lora_a)*scale; y+=buffer@lora_b`	Logits LoRA
`add_lora_fused_moe(...)`	fused MoE LoRA forward	MoE LoRA
`add_lora_w13(...)`	w1/w3 LoRA + routing alignment	MoE gate/up proj
`add_lora_w2(...)`	w2 LoRA (reuses w13 routing)	MoE down proj

7.3 PunicaWrapperGPU --- Triton 实现

源码: $punica_gpu.py$ (file:///workspace/vllm/lora/punica_wrapper/punica_gpu.py)

使用 Triton 编写的自定义 kernel 实现，是生产环境的主要 backend。

7.3.1 初始化与 CUDA Graph 专项化

python 复制代码

# /workspace/vllm/lora/punica_wrapper/punica_gpu.py L40-73
def __init__(self, max_num_batched_tokens, max_batches, device, **kwargs):
    PunicaWrapperBase.__init__(self, ...)
    self.lora_config = kwargs["lora_config"]
    self.max_loras = self.lora_config.max_loras

    # 计算 captured LoRA counts 用于 CUDA Graph 专项化
    captured_lora_counts = get_captured_lora_counts(
        self.max_loras, self.lora_config.specialize_active_lora)

    # Token 级映射 meta（用于 decode 阶段的 sgmv）
    self.token_mapping_meta = LoRAKernelMeta.make(
        self.max_loras, max_num_batched_tokens, device=device,
        captured_lora_counts=captured_lora_counts)

    # Prompt 级映射 meta（用于 prefill/logits 的 sgmv）
    self.prompt_mapping_meta = LoRAKernelMeta.make(
        self.max_loras, max_num_batched_tokens, device=device,
        captured_lora_counts=captured_lora_counts)

CUDA Graph 专项化 (specialize_active_lora)：当启用时，会为不同数量的活跃 LoRA（2 的幂次，直到 max_loras）分别捕获 CUDA graph，以优化不同负载模式的性能。

7.3.2 add_lora_linear 实现细节

python 复制代码

# /workspace/vllm/lora/punica_wrapper/punica_gpu.py L203-264
def add_lora_linear(self, y, x, lora_a_stacked, lora_b_stacked, scale, output_slices, *, buffer=None, **kwargs):
    assert len(lora_a_stacked) == len(lora_b_stacked) == len(output_slices)
    assert buffer is None, "buffer should be created internally"

    r = lora_b_stacked[0].size(-1)
    # 创建 float32 buffer（Triton kernel 内部 zero）
    buffer = torch.empty(
        (len(output_slices), x.size(0), r),
        dtype=torch.float32, device=x.device)

    add_inputs = kwargs.pop("add_inputs", True)
    # Stage 1: Shrink - x @ lora_a -> buffer
    self.add_shrink(buffer, x, lora_a_stacked, scale, **kwargs)
    # Stage 2: Expand - buffer @ lora_b -> y
    self.add_expand(y, buffer, lora_b_stacked, output_slices, add_inputs=add_inputs, **kwargs)

7.3.3 MoE LoRA 计算

MoE LoRA 是最复杂的计算路径，涉及 token-expert routing 对齐和 fused GEMM：

python 复制代码

# /workspace/vllm/lora/punica_wrapper/punica_gpu.py L489-620
def add_lora_w13(self, y, x, lora_a_stacked, lora_b_stacked,
                 topk_ids, topk_weights, expert_map, w1, w2,
                 num_tokens, top_k_num, max_loras, adapter_enabled,
                 local_num_experts, top_k, num_slices, fully_sharded,
                 use_tuned_config, token_lora_mapping=None):

    # 1. 获取最优 kernel config（tuned 或 heuristic）
    if use_tuned_config:
        shrink_config = get_lora_op_configs(op_type="fused_moe_lora_w13_shrink", ...)
        expand_config = get_lora_op_configs(op_type="fused_moe_lora_w13_expand", ...)
    else:
        shrink_config = try_get_optimal_moe_lora_config(op_type="fused_moe_lora_w13_shrink", ...)
        ...

    # 2. 对齐 block size（将 tokens 和 experts 组织为 block-sized chunks）
    SPARSITY_FACTOR = 8
    naive_block_assignment = (
        expert_map is None
        and num_tokens * top_k * SPARSITY_FACTOR <= local_num_experts * max_loras
    )
    (token_lora_mapping, sorted_token_ids_lora,
     expert_ids_lora, num_tokens_post_padded_lora) = self.moe_lora_align_block_size(...)

    # 3. 执行 fused MoE LoRA forward
    self.add_lora_fused_moe(y.view(-1, top_k_num, y.shape[-1]), x,
                            lora_a_stacked, lora_b_stacked, topk_weights,
                            _sorted, _eids, num_tokens_post_padded_lora, ...)

    # 4. 返回 routing tensors 供 w2 复用
    return (sorted_token_ids_lora, expert_ids_lora, num_tokens_post_padded_lora, token_lora_mapping)

7.4 PunicaWrapperCPU --- PyTorch Op 实现

源码: $punica_cpu.py$ (file:///workspace/vllm/lora/punica_wrapper/punica_cpu.py)

使用 PyTorch 原生操作实现，适用于无 GPU/Triton 环境。区分 prefill 和 decode 两种模式：

python 复制代码

# /workspace/vllm/lora/punica_wrapper/punica_cpu.py L38-163
class PunicaWrapperCPU(PunicaWrapperBase):
    # Decode: 使用 bgmv (batched GEMV) - per-token 粒度
    def _shrink_decode(self, y, x, w_t_all, scale):
        bgmv_shrink(x, w_t_all, y, self.token_lora_indices, scale)

    # Prefill: 使用 sgmv (segmented GEMV) - per-sequence 粒度
    def _shrink_prefill(self, y, x, w_t_all, scale):
        if self.no_lora:
            return
        sgmv_shrink(x, w_t_all, y, *self.prefill_metadata, scale)

Prefill vs Decode 的区别：

Decode : 每个 token 可能有不同的 LoRA，使用 token_lora_indices 逐 token 索引
Prefill : 同一 sequence 的所有 token 共享同一 LoRA，使用 prefill_metadata（seq_start_locs, seq_lengths）批量处理

7.5 PunicaWrapperXPU --- Intel XPU 实现

源码: $punica_xpu.py$ (file:///workspace/vllm/lora/punica_wrapper/punica_xpu.py)

针对 Intel XPU（GPU）平台的实现，使用 IPEX 自定义 kernel（bgmv_shrink, bgmv_expand, bgmv_expand_slice），并标记动态 shape 以支持 torch.compile：

python 复制代码

# /workspace/vllm/lora/punica_wrapper/punica_xpu.py L46-48
def __init__(self, ...):
    ...
    torch._dynamo.mark_dynamic(self._token_lora_indices, 0)
    torch._dynamo.mark_dynamic(self._embeddings_indices, 1)
    torch._dynamo.mark_dynamic(self._sampler_indices_padded, 0)

八、LoRA Resolver 适配器解析

源码: $resolver.py$ (file:///workspace/vllm/lora/resolver.py)

定义 LoRA adapter 的发现与解析抽象接口，支持从不同来源（本地文件系统、HuggingFace Hub、云存储等）获取 adapter。

8.1 抽象接口

python 复制代码

# /workspace/vllm/lora/resolver.py L14-41
class LoRAResolver(ABC):
    @abstractmethod
    async def resolve_lora(
        self, base_model_name: str, lora_name: str
    ) -> LoRARequest | None:
        """Abstract method to resolve and fetch a LoRA model adapter.

        Args:
            base_model_name: The name/identifier of the base model.
            lora_name: The name/identifier of the LoRA model to resolve.

        Returns:
            Optional[LoRARequest]: The resolved LoRA request, or None if not found.
        """
        pass

8.2 注册中心

python 复制代码

# /workspace/vllm/lora/resolver.py L43-88
@dataclass
class _LoRAResolverRegistry:
    resolvers: dict[str, LoRAResolver] = field(default_factory=dict)

    def register_resolver(self, resolver_name: str, resolver: LoRAResolver) -> None: ...
    def get_resolver(self, resolver_name: str) -> LoRAResolver: ...
    def get_supported_resolvers(self) -> Set[str]: ...

LoRAResolverRegistry = _LoRAResolverRegistry()  # 全局单例

注册中心设计：采用插件式架构，允许第三方注册自定义 resolver（如 S3 resolver、数据库 resolver 等），通过名称字符串进行查找。

九、WorkerManager Worker 级管理

源码: $worker_manager.py$ (file:///workspace/vllm/lora/worker_manager.py)

WorkerLoRAManager 是 Worker 进程中 LoRA 生命周期的管理者，负责 adapter 的加载、卸载、激活和缓存。

9.1 架构角色

#mermaid-svg-jIUAnItZva1k64ho{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-jIUAnItZva1k64ho .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-jIUAnItZva1k64ho .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-jIUAnItZva1k64ho .error-icon{fill:#552222;}#mermaid-svg-jIUAnItZva1k64ho .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-jIUAnItZva1k64ho .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-jIUAnItZva1k64ho .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-jIUAnItZva1k64ho .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-jIUAnItZva1k64ho .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-jIUAnItZva1k64ho .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-jIUAnItZva1k64ho .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-jIUAnItZva1k64ho .marker{fill:#333333;stroke:#333333;}#mermaid-svg-jIUAnItZva1k64ho .marker.cross{stroke:#333333;}#mermaid-svg-jIUAnItZva1k64ho svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-jIUAnItZva1k64ho p{margin:0;}#mermaid-svg-jIUAnItZva1k64ho .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-jIUAnItZva1k64ho .cluster-label text{fill:#333;}#mermaid-svg-jIUAnItZva1k64ho .cluster-label span{color:#333;}#mermaid-svg-jIUAnItZva1k64ho .cluster-label span p{background-color:transparent;}#mermaid-svg-jIUAnItZva1k64ho .label text,#mermaid-svg-jIUAnItZva1k64ho span{fill:#333;color:#333;}#mermaid-svg-jIUAnItZva1k64ho .node rect,#mermaid-svg-jIUAnItZva1k64ho .node circle,#mermaid-svg-jIUAnItZva1k64ho .node ellipse,#mermaid-svg-jIUAnItZva1k64ho .node polygon,#mermaid-svg-jIUAnItZva1k64ho .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-jIUAnItZva1k64ho .rough-node .label text,#mermaid-svg-jIUAnItZva1k64ho .node .label text,#mermaid-svg-jIUAnItZva1k64ho .image-shape .label,#mermaid-svg-jIUAnItZva1k64ho .icon-shape .label{text-anchor:middle;}#mermaid-svg-jIUAnItZva1k64ho .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-jIUAnItZva1k64ho .rough-node .label,#mermaid-svg-jIUAnItZva1k64ho .node .label,#mermaid-svg-jIUAnItZva1k64ho .image-shape .label,#mermaid-svg-jIUAnItZva1k64ho .icon-shape .label{text-align:center;}#mermaid-svg-jIUAnItZva1k64ho .node.clickable{cursor:pointer;}#mermaid-svg-jIUAnItZva1k64ho .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-jIUAnItZva1k64ho .arrowheadPath{fill:#333333;}#mermaid-svg-jIUAnItZva1k64ho .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-jIUAnItZva1k64ho .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-jIUAnItZva1k64ho .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-jIUAnItZva1k64ho .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-jIUAnItZva1k64ho .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-jIUAnItZva1k64ho .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-jIUAnItZva1k64ho .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-jIUAnItZva1k64ho .cluster text{fill:#333;}#mermaid-svg-jIUAnItZva1k64ho .cluster span{color:#333;}#mermaid-svg-jIUAnItZva1k64ho div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-jIUAnItZva1k64ho .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-jIUAnItZva1k64ho rect.text{fill:none;stroke-width:0;}#mermaid-svg-jIUAnItZva1k64ho .icon-shape,#mermaid-svg-jIUAnItZva1k64ho .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-jIUAnItZva1k64ho .icon-shape p,#mermaid-svg-jIUAnItZva1k64ho .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-jIUAnItZva1k64ho .icon-shape .label rect,#mermaid-svg-jIUAnItZva1k64ho .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-jIUAnItZva1k64ho .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-jIUAnItZva1k64ho .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-jIUAnItZva1k64ho :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} set_active_adapters()
slot 管理
_load_adapter()
API Server

接收请求
LLMEngine

调度中心
WorkerLoRAManager

Worker 端管理
LoRAModelManager

模型管理器
GPU Stacked Weights

(max_loras slots)
磁盘/HF Hub

Adapter 文件
LoRAResolver

路径解析

9.2 WorkerLoRAManager --- 基础管理器

python 复制代码

# /workspace/vllm/lora/worker_manager.py L25-223
class WorkerLoRAManager:
    """WorkerLoRAManager that manages LoRA models on the worker side.

    Every request, the requested LoRAs will be loaded (unless they are already
    loaded), and every other LoRA will be unloaded."""

核心生命周期方法：

方法	功能
`create_lora_manager(model, vllm_config)`	初始化 LoRAModelManager，替换模型中的层
`set_active_adapters(requests, mapping)`	主入口：根据当前请求设置活跃 adapters
`add_adapter(adapter_request)`	加载单个 adapter 到 GPU
`remove_adapter(adapter_id)`	卸载单个 adapter
`list_adapters()`	列出当前已加载的 adapter IDs
`add_dummy_lora(lora_request, rank)`	添加 dummy LoRA（warmup 用）
`pin_adapter(adapter_id)`	固定 adapter 不被驱逐

9.2.1 Adapter 加载流程

python 复制代码

# /workspace/vllm/lora/worker_manager.py L99-157
def _load_adapter(self, lora_request: LoRARequest) -> LoRAModel:
    # 1. 确定期望的目标模块列表
    supported_lora_modules = self._adapter_manager.supported_lora_modules
    packed_modules_mapping = self._adapter_manager.packed_modules_mapping
    expected_lora_lst = []
    for module in supported_lora_modules:
        if module in packed_modules_mapping:
            expected_lora_lst.extend(packed_modules_mapping[module])
        else:
            expected_lora_lst.append(module)

    # 2. 解析 adapter 绝对路径
    lora_path = get_adapter_absolute_path(lora_request.lora_path)

    # 3. 加载 PEFT 配置
    peft_helper = PEFTHelper.from_local_dir(lora_path, ...)

    # 4. 校验 LoRA 配置合法性
    peft_helper.validate_legal(self.lora_config)

    # 5. 获取模型特定的权重映射和跳过前缀
    hf_to_vllm_mapper = getattr(model, "hf_to_vllm_mapper", None)
    lora_skip_prefixes = getattr(model, "lora_skip_prefixes", None)

    # 6. 从 checkpoint 加载 LoRAModel（到 CPU）
    lora = self._lora_model_cls.from_local_checkpoint(
        lora_path, expected_lora_modules, peft_helper=peft_helper,
        lora_model_id=lora_request.lora_int_id, device="cpu", ...)
    return lora

9.2.2 活跃 Adapter 应用策略

python 复制代码

# /workspace/vllm/lora/worker_manager.py L178-206
def _apply_adapters(self, adapter_requests: set[Any]) -> None:
    existing_adapters = self.list_adapters()
    models_map = {
        adapter_request.adapter_id: adapter_request
        for adapter_request in adapter_requests if adapter_request
    }
    # 容量检查
    if len(models_map) > self._adapter_manager.adapter_slots:
        raise RuntimeError(f"Requested models ({len(models_map)}) > slots ({self._adapter_manager.adapter_slots})")

    requested_ids = set(models_map)
    # 卸载不再需要的 adapters
    for adapter_id in existing_adapters - requested_ids:
        self.remove_adapter(adapter_id)
    # 加载新需要的 adapters
    for adapter_id in requested_ids - existing_adapters:
        self.add_adapter(models_map[adapter_id])

策略特点：精确匹配策略------只保留当前批次需要的 adapters，其余全部卸载。

9.3 LRUCacheWorkerLoRAManager --- LRU 缓存管理器

继承 WorkerLoRAManager，增加 LRU 缓存淘汰 策略，适合频繁切换 adapter 的场景：

python 复制代码

# /workspace/vllm/lora/worker_manager.py L226-302
class LRUCacheWorkerLoRAManager(WorkerLoRAManager):
    """Uses an LRU Cache. Least recently used LoRAs will be unloaded
    if the cache is above capacity."""

    def add_adapter(self, lora_request: LoRARequest) -> bool:
        if (lora_request.lora_int_id not in self.list_adapters()
                or lora_request.load_inplace):
            # 先加载新 adapter（确保有效性）
            lora = self._load_adapter(lora_request)

            # 移除已有版本（支持 inplace 更新）
            self._adapter_manager.remove_adapter(lora.id)

            # 容量超限时驱逐最久未使用的 adapter
            if len(self._adapter_manager) + 1 > self._adapter_manager.capacity:
                self._adapter_manager.remove_oldest_adapter()

            # 添加新 adapter
            loaded = self._adapter_manager.add_adapter(lora)
        else:
            # 已缓存则 touch 更新 LRU 位置
            loaded = (self._adapter_manager.get_adapter(lora_request.lora_int_id) is not None)

        self._adapter_manager.activate_adapter(lora_request.lora_int_id)
        return loaded

LRU 策略 vs 基础策略对比：

特性	WorkerLoRAManager	LRUCacheWorkerLoRAManager
卸载策略	精确匹配（不需要的全卸载）	LRU 淘汰（超容量时驱逐最旧的）
已缓存命中	直接跳过	touch 更新访问时间
Inplace 更新	不支持	支持 (`load_inplace`)
适用场景	固定 few-shot adapters	大量 adapters 轮转

十、动态加载/卸载机制

10.1 完整请求处理流程

Disk PunicaWrapper LoRA Layers LoRAModelManager WorkerLoRAManager Scheduler LLMEngine 用户请求 Disk PunicaWrapper LoRA Layers LoRAModelManager WorkerLoRAManager Scheduler LLMEngine 用户请求 #mermaid-svg-MPrhVa7qz1PY0Ysc{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-MPrhVa7qz1PY0Ysc .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-MPrhVa7qz1PY0Ysc .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-MPrhVa7qz1PY0Ysc .error-icon{fill:#552222;}#mermaid-svg-MPrhVa7qz1PY0Ysc .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-MPrhVa7qz1PY0Ysc .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-MPrhVa7qz1PY0Ysc .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-MPrhVa7qz1PY0Ysc .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-MPrhVa7qz1PY0Ysc .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-MPrhVa7qz1PY0Ysc .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-MPrhVa7qz1PY0Ysc .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-MPrhVa7qz1PY0Ysc .marker{fill:#333333;stroke:#333333;}#mermaid-svg-MPrhVa7qz1PY0Ysc .marker.cross{stroke:#333333;}#mermaid-svg-MPrhVa7qz1PY0Ysc svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-MPrhVa7qz1PY0Ysc p{margin:0;}#mermaid-svg-MPrhVa7qz1PY0Ysc .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-MPrhVa7qz1PY0Ysc text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-MPrhVa7qz1PY0Ysc .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-MPrhVa7qz1PY0Ysc .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-MPrhVa7qz1PY0Ysc .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-MPrhVa7qz1PY0Ysc .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-MPrhVa7qz1PY0Ysc #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-MPrhVa7qz1PY0Ysc .sequenceNumber{fill:white;}#mermaid-svg-MPrhVa7qz1PY0Ysc #sequencenumber{fill:#333;}#mermaid-svg-MPrhVa7qz1PY0Ysc #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-MPrhVa7qz1PY0Ysc .messageText{fill:#333;stroke:none;}#mermaid-svg-MPrhVa7qz1PY0Ysc .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-MPrhVa7qz1PY0Ysc .labelText,#mermaid-svg-MPrhVa7qz1PY0Ysc .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-MPrhVa7qz1PY0Ysc .loopText,#mermaid-svg-MPrhVa7qz1PY0Ysc .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-MPrhVa7qz1PY0Ysc .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-MPrhVa7qz1PY0Ysc .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-MPrhVa7qz1PY0Ysc .noteText,#mermaid-svg-MPrhVa7qz1PY0Ysc .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-MPrhVa7qz1PY0Ysc .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-MPrhVa7qz1PY0Ysc .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-MPrhVa7qz1PY0Ysc .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-MPrhVa7qz1PY0Ysc .actorPopupMenu{position:absolute;}#mermaid-svg-MPrhVa7qz1PY0Ysc .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-MPrhVa7qz1PY0Ysc .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-MPrhVa7qz1PY0Ysc .actor-man circle,#mermaid-svg-MPrhVa7qz1PY0Ysc line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-MPrhVa7qz1PY0Ysc :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} alt $新 LoRA 请求$ alt $LoRA 不再需要$ add_request(lora_request) schedule() set_active_adapters({req1, req2}) _apply_adapters() add_adapter(new_req) _load_adapter(new_req) from_local_checkpoint() LoRAModel (CPU) add_adapter(lora_model) set_lora(slot_id, lora_a, lora_b) update_metadata(mapping) execute_model() forward(hidden_states) add_lora_linear(output, x, ...) LoRA applied output output_with_lora set_active_adapters({remaining}) remove_adapter(old_id) reset_lora(old_slot)

10.2 Slot-Based 权重管理

vLLM 采用 pre-allocated slot 策略管理 GPU 上的 LoRA 权重：

初始化阶段 ：create_lora_weights(max_loras) 预分配形状为 (max_loras, 1, rank, dim) 的零张量
加载阶段 ：set_lora(slot_index, lora_a, lora_b) 将权重 copy_ 到对应 slot
激活阶段 ：通过 LoRAMapping 建立 token → slot_index 的映射关系
计算阶段 ：Punica kernel 根据 token_lora_indices 从对应 slot 取权重计算
卸载阶段 ：reset_lora(slot_index) 将对应 slot 清零

10.3 LoRAMapping --- Token 到 Adapter 的映射

LoRAMapping 是连接请求级 LoRA binding 和 kernel 级计算的桥梁数据结构：

Decode 阶段 ：每个 token 对应一个 lora_index（-1 表示无 LoRA），组成 token_lora_indices 向量
Prefill 阶段 ：每个 sequence 共享一个 lora_index，配合 seq_start_locs 和 seq_lengths 使用

10.4 性能优化要点

优化技术	说明	代码位置
Batched GEMM	多个 LoRA 的 A/B 矩阵堆叠为 4D 张量，单次 kernel 调用处理	`lora_a/b_stacked`
Scaling Pre-fusion	`optimize()` 将 alpha/r 合入 lora_b，消除运行时乘法	$lora_weights.py L36$ (file:///workspace/vllm/lora/lora_weights.py#L36-L42)
Dual CUDA Stream	Base linear 和 LoRA 在不同流上并行执行	$base_linear.py L238$ (file:///workspace/vllm/lora/layers/base_linear.py#L238-L302)
CUDA Graph Specialization	按活跃 LoRA 数量（2 的幂次）专门化 graph	`specialize_active_lora`
Float32 Buffer	Shrink 中间结果使用 float32，参考 Triton issue #1387	$punica_gpu.py L246$ (file:///workspace/vllm/lora/punica_wrapper/punica_gpu.py#L246-L248)
Dummy LoRA Reuse	Warmup 阶段复用同一个 dummy LoRA 对象（clone 不同 ID）	$worker_manager.py L159$ (file:///workspace/vllm/lora/worker_manager.py#L159-L170)
Non-blocking Copy	`copy_(non_blocking=True)` 异步拷贝权重	各 `set_lora` 方法
S-LoRA Sharding	Fully-sharded 模式下额外分片 lora_a，减少通信量	`*WithShardedLoRA` 子类
MoE Routing Reuse	w13 的 routing 结果（sorted_ids, expert_ids）直接供 w2 复用	$punica_gpu.py L615$ (file:///workspace/vllm/lora/punica_wrapper/punica_gpu.py#L615-L620)

附录：关键文件索引

文件	核心职责
$config/lora.py$ (file:///workspace/vllm/config/lora.py)	LoRAConfig 配置定义与校验
$lora/lora_model.py$ (file:///workspace/vllm/lora/lora_model.py)	LoRAModel 封装与 checkpoint 加载
$lora/lora_weights.py$ (file:///workspace/vllm/lora/lora_weights.py)	LoRALayerWeights / PackedLoRALayerWeights 数据结构
$lora/request.py$ (file:///workspace/vllm/lora/request.py)	LoRARequest 请求绑定数据结构
$lora/layers/base.py$ (file:///workspace/vllm/lora/layers/base.py)	BaseLayerWithLoRA 抽象基类
$lora/layers/base_linear.py$ (file:///workspace/vllm/lora/layers/base_linear.py)	BaseLinearLayerWithLoRA 公共逻辑
$lora/layers/column_parallel_linear.py$ (file:///workspace/vllm/lora/layers/column_parallel_linear.py)	Column Parallel Linear + LoRA（含 QKV/Merged/Sharded 变体）
$lora/layers/row_parallel_linear.py$ (file:///workspace/vllm/lora/layers/row_parallel_linear.py)	Row Parallel Linear + LoRA（含 Sharded 变体）
$lora/layers/fused_moe.py$ (file:///workspace/vllm/lora/layers/fused_moe.py)	FusedMoE + LoRA（含 3D MoE 变体）
$lora/layers/logits_processor.py$ (file:///workspace/vllm/lora/layers/logits_processor.py)	LogitsProcessor + LoRA
$lora/layers/replicated_linear.py$ (file:///workspace/vllm/lora/layers/replicated_linear.py)	ReplicatedLinear + LoRA
$lora/layers/vocal_parallel_embedding.py$ (file:///workspace/vllm/lora/layers/vocal_parallel_embedding.py)	VocabParallelEmbedding + LoRA
$lora/punica_wrapper/punica_base.py$ (file:///workspace/vllm/lora/punica_wrapper/punica_base.py)	PunicaWrapper 抽象基类与元数据管理
$lora/punica_wrapper/punica_gpu.py$ (file:///workspace/vllm/lora/punica_wrapper/punica_gpu.py)	GPU/Triton kernel 实现
$lora/punica_wrapper/punica_cpu.py$ (file:///workspace/vllm/lora/punica_wrapper/punica_cpu.py)	CPU PyTorch op 实现
$lora/punica_wrapper/punica_xpu.py$ (file:///workspace/vllm/lora/punica_wrapper/punica_xpu.py)	Intel XPU/IPEX 实现
$lora/resolver.py$ (file:///workspace/vllm/lora/resolver.py)	LoRA 适配器解析抽象接口与注册中心
$lora/worker_manager.py$ (file:///workspace/vllm/lora/worker_manager.py)	Worker 级 LoRA 生命周期管理