NVIDIA大模型推理框架:TensorRT-LLM软件流程(三)trtllm-serve启动流程–HTTP Request

trtllm-serve启动流程--HTTP Request

  • TensorRT-LLM软件流程系列一和二 Blog写了加载模型和启动server整个流程,server启动之后就是接收http请求
  • 这篇博客主要介绍server端接收HTTP Request,并且对request处理生成response整个流程

Register Routes

  • OpenAIServer init的时候会调用register_routes function,对请求的类型注册到router中,当接收到对应请求的时候会调用对应的处理函数
  • 使用curl请求作为client端向server发送请求,其中类型是/v1/completions,所以执行openai_responses function
py 复制代码
// tensorrt_llm\serve\openai_server.py
def register_routes(self):
    self.app.add_api_route("/health", self.health, methods=["GET"])
    self.app.add_api_route("/kv_cache_events", self.get_kv_cache_events, methods=["POST"])
    self.app.add_api_route("/v1/completions",
                           self.openai_completion,
                           methods=["POST"])
    self.app.add_api_route("/v1/chat/completions",
                           self.openai_chat if not self.use_harmony else self.chat_harmony,
                           methods=["POST"])
    self.app.add_api_route("/v1/responses",
                           self.openai_responses,
                           methods=["POST"])
    if self.llm.args.return_perf_metrics:
        # register /prometheus/metrics
        self.mount_metrics()

//curl http请求
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "Qwen3-7B/",
      "prompt": "Please describe what is Qwen.",
      "max_tokens": 12,
      "temperature": 0
  }'

openai_completion

  • prompt_inputs function把input prompt 转化为TextPrompt instance,然后对TextPrompt 消息进行tokenizer
  • 通过asyncio.to_thread把任务线程池处理,因为之前input_processor已经在create_input_processor function中赋值为DefaultInputProcessor,所以会直接调用DefaultInputProcessor class call function
  • call function会调用之前设置qwen大模型专用的tokenizer(运行trtllm-serve,设置的参数--tokenizer ./tmp/Qwen/7B/)把prompt先分词为单个token,再把单个token转化为对应的数字 ID(即 prompt_token_ids)
  • generate_async function实例化GenerationRequest,然后submit到任务queue中
py 复制代码
// tensorrt_llm\serve\openai_server.py
prompt_token_ids, extra_processed_inputs = await asyncio.to_thread(self.llm.input_processor, prompt, sampling_params)

self.input_processor = create_input_processor(self._hf_model_dir,
                                              self.tokenizer)

//tensorrt_llm\inputs\registry.py
class DefaultInputProcessor(InputProcessor):
    """Preprocess the inputs to the model."""

    def __init__(self,
                 model_path,
                 model_config,
                 tokenizer,
                 trust_remote_code: bool = True) -> None:
        self.tokenizer = tokenizer
        self.model_config = model_config
        self.model_path = model_path
        self.multimodal_hashing_supported = None

    def __call__(
        self, inputs: TextPrompt, sampling_params: SamplingParams
    ) -> Tuple[List[int], Optional[ExtraProcessedInputs]]:
        """The default input processor handles only tokenization."""
        with nvtx_range_debug("tokenize prompt"):
            try:
                token_ids = self.tokenizer.encode(
                    inputs["prompt"],
                    add_special_tokens=sampling_params.add_special_tokens,
                    **kwargs)
            except:
                # Tiktoken path
                token_ids = self.tokenizer.encode(
                    inputs["prompt"], allowed_special=toktoken_special_tokens)
        return token_ids, None
  • submit执行结束之后create_task function execution of a coroutine object in a spawn task
  • merge_completion_responses function instantiate CompletionResponse
  • 最后JSONResponse接口将模型生成的结果CompletionResponse,按 OpenAI API 协议封装为标准 JSON 响应,返回给客户端
    流程图:

submit

  • submit函数主要功能是把request发送到大模型中进行处理
  • self 执行submit function实际执行的是class GenerationExecutorProxy中的submit function,因为之前self._executor = self._executor_cls.create就是instance GenerationExecutorProxy
py 复制代码
## tensorrt_llm\executor\executor.py
def generate_async():
	result = self.submit(request)

## tensorrt_llm\executor\proxy.py
def submit(self, request: GenerationRequest) -> GenerationResult:
    with nvtx_range_debug("request_queue.put"):
        self.request_queue.put(request)

## tensorrt_llm\executor\worker.py
## class GenerationExecutorWorker
def worker_main function(){
	request_queue = IpcQueue(worker_queues.request_queue_addr,
	                         is_server=False,
	                         name="worker_request_queue")

	with worker:
	    try:
	        worker.block_subordinates()
	        if is_leader:
	            if postproc_worker_config.enabled:
	                worker.set_postproc_queues(result_queues)
	            else:
	                worker.set_result_queue(result_queue)
	            # Send ready signal with confirmation
	            ready_msg = (ready_signal, None)
	            if not worker_init_status_queue.notify_with_retry(ready_msg):
	                logger.warning(
	                    "Failed to deliver ready signal to proxy, continuing anyway"
	                )
	            while (req := request_queue.get()) is not None:
	                if isinstance(req, CancellingRequest):
	                    worker.abort_request(req.id)
	                elif isinstance(req, GenerationRequest):
	                    try:
	                        worker.submit(req)
	                    except RequestError as e:
	                        logger.error(f"submit request failed: {e}")
	                        logger.error(traceback.format_exc())
	                        worker._await_response_helper.temp_error_responses.put(
	                            ErrorResponse(req.id, e, req.id))
	                else:
	                    raise ValueError(f"Unknown request type: {type(req)}")
  • class GenerationExecutorProxy submit function把request加入request_queue中等待worker_main function中获取处理
  • worker_main function获取request,直接调用worker.submit(req)处理,实际执行是class BaseWorker submit function
  • 通过Request 构建GenerationResult,并且与client_id相对应
  • 使用tllm.Request调用到C++ Layer instantiate Request任务
  • 创建成功executor_request之后,self.engine.enqueue_request(executor_request)执行到C++ Layer executor.cpp中,因为Engine就是实例化的C++ class Executor
  • 特别注意:在_enqueue_request代码中有对MULTIMODAL的注释,目前仅 PyTorch Backend 支持多模态输入的处理,TensorRT backen 目前应该是不支持的(2025/11/2 version,main branch)
    流程图:

tllm.Request and engine.enqueue_request

  • 从Python Layer 到C++ Layer 都是需要通过pybind中间件转化,
  • Request调用到request.cpp中逻辑就是实例化,然后对参数进行赋值
cpp 复制代码
//cpp\tensorrt_llm\pybind\executor\request.cpp
py::class_<tle::Request> request(m, "Request", pybind11::dynamic_attr());

//cpp\tensorrt_llm\executor\request.cpp
Request::Request(VecTokens inputTokenIds, SizeType32 maxTokens, bool streaming, SamplingConfig const& samplingConfig,
    OutputConfig const& outputConfig, std::optional<SizeType32> const& endId, std::optional<SizeType32> const& padId,
    std::optional<std::vector<SizeType32>> positionIds, std::optional<std::list<VecTokens>> badWords,
    std::optional<std::list<VecTokens>> stopWords, std::optional<Tensor> embeddingBias,
    std::optional<ExternalDraftTokensConfig> externalDraftTokensConfig, std::optional<PromptTuningConfig> pTuningConfig,
    std::optional<MultimodalInput> multimodalInput, std::optional<Tensor> multimodalEmbedding,
    std::optional<MropeConfig> mRopeConfig, std::optional<LoraConfig> loraConfig,
    std::optional<LookaheadDecodingConfig> lookaheadConfig,
    std::optional<KvCacheRetentionConfig> kvCacheRetentionConfig, std::optional<std::string> logitsPostProcessorName,
    std::optional<LogitsPostProcessor> logitslogitsPostProcessor, std::optional<VecTokens> encoderInputTokenIds,
    std::optional<IdType> clientId, bool returnAllGeneratedTokens, float priority, RequestType type,
    std::optional<ContextPhaseParams> contextPhaseParams, std::optional<Tensor> encoderInputFeatures,
    std::optional<SizeType32> encoderOutputLength, std::optional<Tensor> crossAttentionMask,
    SizeType32 numReturnSequences, std::optional<EagleConfig> eagleConfig, std::optional<Tensor> skipCrossAttnBlocks,
    std::optional<GuidedDecodingParams> guidedDecodingParams, std::optional<SizeType32> languageAdapterUid,
    std::optional<MillisecondsType> allottedTimeMs, std::optional<CacheSaltIDType> cacheSaltID)
    : mImpl(std::make_unique<Impl>(std::move(inputTokenIds), maxTokens, streaming, samplingConfig, outputConfig, endId,
        padId, std::move(positionIds), std::move(badWords), std::move(stopWords), std::move(embeddingBias),
        std::move(externalDraftTokensConfig), std::move(pTuningConfig), std::move(multimodalInput),
        std::move(multimodalEmbedding), std::move(mRopeConfig), std::move(loraConfig), lookaheadConfig,
        std::move(kvCacheRetentionConfig), std::move(logitsPostProcessorName), std::move(logitslogitsPostProcessor),
        std::move(encoderInputTokenIds), clientId, returnAllGeneratedTokens, priority, type,
        std::move(contextPhaseParams), std::move(encoderInputFeatures), encoderOutputLength, crossAttentionMask,
        numReturnSequences, eagleConfig, skipCrossAttnBlocks, std::move(guidedDecodingParams), languageAdapterUid,
        allottedTimeMs, cacheSaltID))
{
}

//cpp\tensorrt_llm\executor\executor.cpp
IdType Executor::enqueueRequest(Request const& llmRequest)
{
    return mImpl->enqueueRequest(llmRequest);
}


//cpp\tensorrt_llm\executor\executorImpl.cpp
IdType Executor::Impl::enqueueRequest(Request const& request)
{
    return enqueueRequests({&request, 1}).at(0);
}
  • enqueueRequest function在Executor中的调用会经过Executor::Impl中转
  • 当executorImpl 中执行enqueueRequest,发送mQueuedReqCv.notify_one消息会对getLeaderNewReqWithIds function中mQueuedReqCv.wait激活
  • 取出nextRequest push到reqWithIds中。然后return到getNewReqWithIds function
  • getNewReqWithIds执行完之后return到最外层while循环,执行forwardAsyn function,进而调用TrtGptModelInflightBatching forwardAsyn,在forwardAsyn中实现GPU异步处理Request请求
  • 最后执行updateIterationStats和updateRequestStats
  • executionLoop function是在 Executor::Impl::initialize中执行的,这个是在TensorRT LLM初始化的流程中会执行到的函数
cpp 复制代码
// Executor::Impl::initialize function
mExecutionThread = std::thread(&Impl::executionLoop, this);

流程图:

cpp 复制代码
//executorImpl.cpp
// Executor::Impl::loadModel 赋值流程
mModel = createModel(rawEngine, modelConfig, worldConfig, executorConfig);

std::shared_ptr<Model> Executor::Impl::createModel(runtime::RawEngine const& rawEngine,
    runtime::ModelConfig const& modelConfig, runtime::WorldConfig const& worldConfig,
    ExecutorConfig const& executorConfig)
{
    auto const gptModelType = [&executorConfig, &modelConfig]()
    {
        switch (executorConfig.getBatchingType())
        {
        case BatchingType::kSTATIC:
            TLLM_THROW(
                "Static batching type is deprecated. Please use in-flight batching with "
                "CapacitySchedulerPolicy::kSTATIC_BATCH instead.");
        case BatchingType::kINFLIGHT:
            return modelConfig.isRnnBased() ? batch_manager::TrtGptModelType::InflightBatching
                                            : batch_manager::TrtGptModelType::InflightFusedBatching;
        default: TLLM_THROW("Invalid batching strategy");
        }
    }();

    bool const isLeaderInOrchMode = (mCommMode == CommunicationMode::kORCHESTRATOR) && mIsLeader;
    auto const& fixedExecutorConfig = executorConfigIsValid(executorConfig, modelConfig)
        ? executorConfig
        : fixExecutorConfig(executorConfig, modelConfig);
    
    std::cout << "Entering Impl() createModel gptModelType :"<< std::endl;


    return batch_manager::TrtGptModelFactory::create(
        rawEngine, modelConfig, worldConfig, gptModelType, fixedExecutorConfig, isLeaderInOrchMode);
}

//trtGptModelFactory.h
    static std::shared_ptr<TrtGptModel> create(runtime::RawEngine const& rawEngine,
        runtime::ModelConfig const& modelConfig, runtime::WorldConfig const& worldConfig, TrtGptModelType modelType,
        executor::ExecutorConfig const& executorConfig, bool isLeaderInOrchMode)
    {
        auto logger = std::make_shared<runtime::TllmLogger>();
        auto const device = worldConfig.getDevice();
        auto const rank = worldConfig.getRank();
        TLLM_LOG_INFO("Rank %d is using GPU %d", rank, device);
        TLLM_CUDA_CHECK(cudaSetDevice(device));
        std::cout << "modelType = " << static_cast<int>(modelType) << std::endl;

        if ((modelType == TrtGptModelType::InflightBatching) || (modelType == TrtGptModelType::InflightFusedBatching))
        {
            executor::ExecutorConfig const& fixedExecutorConfig
                = TrtGptModelInflightBatching::executorConfigIsValid(modelConfig, executorConfig)
                ? executorConfig
                : TrtGptModelInflightBatching::fixExecutorConfig(modelConfig, executorConfig);
            bool const ctxGenFusion = modelType == TrtGptModelType::InflightFusedBatching;
            return std::make_shared<TrtGptModelInflightBatching>(
                logger, modelConfig, worldConfig, rawEngine, ctxGenFusion, fixedExecutorConfig, isLeaderInOrchMode);
        }

        throw std::runtime_error("Invalid modelType in trtGptModelFactory");
    }

//调用
void Executor::Impl::forwardAsync(RequestList& activeRequests)
{
    try
    {
        TLLM_LOG_DEBUG("num active requests in scope: %d", activeRequests.size());
        mModel->forwardAsync(activeRequests);
    }
}

forwardAsync中调用mModel->forwardAsync,mModel赋值是在create中赋值为TrtGptModelInflightBatching class 实例

forwardAsync

  • class TrtGptModelInflightBatching forwardAsync function负责异步调度和执行 LLM 解码阶段的请求。
  • mCapacityScheduler对activeRequests通过MaxRequestsScheduler、MaxUtilizationScheduler、GuaranteedNoEvictScheduler三种算法进行调度,产生出fittingRequests(可执行)、requestsToPause(需暂停)分类。 然后在mPauseRequests中对需暂停请求(requestsToPause)执行 "资源释放 + 状态更新 + 移出活跃队列" 的操作。这两个操作对Request实现的动态优先级调度。
cpp 复制代码
// TensorRT-LLM-main\cpp\tensorrt_llm\batch_manager\capacityScheduler.cpp
std::tuple<RequestVector, RequestVector, RequestVector> CapacityScheduler::operator()(RequestList const& activeRequests,
    OptionalRef<kv_cache_manager::BaseKVCacheManager> kvCacheManager,
    OptionalRef<BasePeftCacheManager const> peftCacheManager,
    OptionalRef<kv_cache_manager::BaseKVCacheManager const> crossKvCacheManager) const
{
    NVTX3_SCOPED_RANGE(capacitySchedulerScheduling);
    return std::visit(
        [&activeRequests, &kvCacheManager, &crossKvCacheManager, &peftCacheManager](
            auto const& scheduler) -> std::tuple<RequestVector, RequestVector, RequestVector>
        {
            RequestVector tmpFittingRequests;
            RequestVector pausedRequests;
            if constexpr (std::is_same_v<std::decay_t<decltype(scheduler)>, MaxRequestsScheduler>)
            {
                std::tie(tmpFittingRequests, pausedRequests) = scheduler(activeRequests);
            }
            else if constexpr (std::is_same_v<std::decay_t<decltype(scheduler)>, MaxUtilizationScheduler>)
            {
                std::tie(tmpFittingRequests, pausedRequests)
                    = scheduler(*kvCacheManager, peftCacheManager, activeRequests);
            }
            else if constexpr (std::is_same_v<std::decay_t<decltype(scheduler)>, GuaranteedNoEvictScheduler>
                || std::is_same_v<std::decay_t<decltype(scheduler)>, StaticBatchScheduler>)
            {
                std::tie(tmpFittingRequests, pausedRequests)
                    = scheduler(*kvCacheManager, crossKvCacheManager, peftCacheManager, activeRequests);
            }
            else
            {
                throw std::runtime_error("Unsupported capacity scheduler policy");
            }
            TLLM_LOG_DEBUG("[Summary] Capacity scheduler allows %d requests, pauses %d requests",
                tmpFittingRequests.size(), pausedRequests.size());

            RequestVector fittingRequests;
            RequestVector fittingDisaggGenInitRequests;
            for (auto const& llmReq : tmpFittingRequests)
            {
                if (llmReq->isDisaggGenerationInitState())
                {
                    fittingDisaggGenInitRequests.push_back(llmReq);
                }
                else
                {
                    fittingRequests.push_back(llmReq);
                }
            }

            return {std::move(fittingRequests), std::move(fittingDisaggGenInitRequests), std::move(pausedRequests)};
        },
        mScheduler);
}
这段代码我觉得写得高效并且简洁,
constexpr编译器期间确定scheduler,std::visit安全访问 variant 存储的实际类型,std::tie简化 Tuple 解包,一次性赋值

在MaxUtilizationScheduler中的调度方法中,会使用如下原则调度Request
//MaxUtilizationScheduler::operator() function   capacityScheduler.cpp
auto startedReqLambda = [this](std::shared_ptr<LlmRequest> const& req)
{
    return (req->hasReachedState(getNoScheduleUntilState()) && !req->hasReachedState(getNoScheduleAfterState())
        && ((req->isContextInitState() && !req->isFirstContextChunk()) || req->isGenerationInProgressState()));
};
原本我有一个疑问:如果是相同优先级,那一个Request换一个意义在哪里?
答案其实是注释写的:Find last active in case we need to evict
最久没有活跃的Request硬件的使用效率基本就是最低,可以先evict出来pause,替换硬件效率高的Request先处理,这个理念符合Max Utilization

在TensorRT LLM中并没有通过优先级调度Request,因为Request优先级是业务逻辑,需要用户自己来定义,所以TensorRT LLM原生没有这部分功能的
  • changeSpecDecMode中会切换Speculative Decoding Mode,目前只有一种Lookahead算法
  • AssignReqSeqSlots会对Request通过slot进行承接,实现 GPU 显存 / 计算资源的预分配与复用,避免动态内存开销和显存碎片;支撑 Inflight Batching 动态调度,突破静态批处理的效率瓶颈;统一管控序列全生命周期状态,确保推理连贯性与可追溯性,这种设计方式对同一个slot 多个Request是可以KV cache复用
  • executeBatch实现使用GPU处理Request,其中有通过prepareBuffers调用prepareStep,实现将用户的推理请求(上下文请求 + 生成请求)转换为模型推理所需的GPU 缓冲区数据
  • executeContext 中会先获取是否有CudaGraph,分两条路线执行。如果没有就直接enqueue到cuda stream,如果有就通过cuda graph来执行
  • 在debug 模式下会保存IO Tensors信息,查看输出异常是logic tensor或者其他元素导致的,IO tensor的意思是inputMap、outputMap中的数据
  • executeBatch中对GenFusion做区分,如果没有fusion会对contextRequests、generationRequests分别执行executeStep。
  • 后面会对contextRequests的处理逻辑和generationRequests进行交换,在inflight batching模式下不断地交替执行prefill和decode。
    流程图:

Summary

  • 当forwardAsync处理完之后,最后处理的数据使用Mpi communication进行同步,最后整合生成数据封装格式返回response消息到Client端。

运行报错经验分享

修改C++代码不生效

我有出现过修改C++ 代码,python3 ./scripts/build_wheel.py编译不生效的问题,编译过程中都出现修改报错,但是就是没有生效。

最好的方式是先pip unistall tensorrt_llm,然后再pip install ./build/tensorrt_llm*.whl,

直接安装会因为版本相同可能没有安装的情况

bash 复制代码
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-serve", line 5, in <module>
    from tensorrt_llm.commands.serve import main
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/__init__.py", line 66, in <module>
    import tensorrt_llm._torch.models as torch_models
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/__init__.py", line 1, in <module>
    from .llm import LLM
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/llm.py", line 1, in <module>
    from tensorrt_llm.llmapi.llm import _TorchLLM
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/__init__.py", line 2, in <module>
    from ..executor import CompletionOutput, LoRARequest, RequestError
ImportError: cannot import name 'CompletionOutput' from 'tensorrt_llm.executor' (/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/__init__.py)

运行trtllm-serve会出现这个报错,原因应该是出现TensorRT-LLM Python whl包安装不完整,再执行pip install -e .重新安装就可以解决报错
相关推荐
安全二次方security²4 小时前
CUDA C++编程指南(7.15&16)——C++语言扩展之内存空间谓词和转化函数
c++·人工智能·nvidia·cuda·内存空间谓词函数·内存空间转化函数·address space
安全二次方security²7 小时前
CUDA C++编程指南(7.5&6)——C++语言扩展之内存栅栏函数和同步函数
c++·人工智能·nvidia·cuda·内存栅栏函数·同步函数·syncthreads
安全二次方security²1 天前
CUDA C++编程指南(7.2)——C++语言扩展之变量内存空间指定符
c++·人工智能·nvidia·cuda·内存空间指定符·__shared__·__device__
安全二次方security²2 天前
CUDA C++编程指南(7.1)——C++语言扩展之函数执行空间指定符
c++·人工智能·nvidia·cuda·cuda编程·global·函数执行空间指定符
云雾J视界3 天前
AI服务器供电革命:为何交错并联Buck成为算力时代的必然选择
服务器·人工智能·nvidia·算力·buck·dgx·交错并联
塔楼3 天前
Deep Think with Confidence:让大模型更准确
语言模型·大模型推理
403240736 天前
【2026最新】Jetson全系列安装支持CUDA加速的OpenCV 4.10保姆级教程(适配Jetpack 6/5及Orin/Xavier/Nano等)
linux·opencv·计算机视觉·nvidia·cuda·jetson
REDcker12 天前
Nvidia英伟达显卡型号发布史与架构演进详解
架构·gpu·显卡·nvidia·cuda·英伟达·演进
scott19851215 天前
NVIDIA GPU内部结构:高性能矩阵乘法内核剖析
线性代数·矩阵·gpu·nvidia·cuda
clorisqqq16 天前
黄仁勋 CES 2026 演讲笔记(part2 精华)
nvidia·黄仁勋·ces2026