从运行日志观察vllm进行模型部署的过程

在探索llama factory的时候我们看到了llama进行模型部署的工作 启动日志

sql 复制代码
日志解读 INFO 04-16 10:17:52 api_server.py:149] vLLM API server version 0.4.0.post1
INFO 04-16 10:17:52 api_server.py:150] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name='Qwen1___5-MoE-A2___7B', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=True, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2024-04-16 10:17:54,532	INFO worker.py:1752 -- Started a local Ray instance.
INFO 04-16 10:17:55 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-16 10:17:58 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 04-16 10:17:58 selector.py:25] Using XFormers backend.
INFO 04-16 10:18:07 model_runner.py:104] Loading model weights took 26.6740 GB
INFO 04-16 10:18:08 ray_gpu_executor.py:240] # GPU blocks: 14171, # CPU blocks: 1365
INFO 04-16 10:18:10 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-16 10:18:10 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-16 10:18:14 model_runner.py:867] Graph capturing finished in 4 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-16 10:18:14 serving_chat.py:331] Using default chat template:
INFO 04-16 10:18:14 serving_chat.py:331] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 04-16 10:18:14 serving_chat.py:331] You are a helpful assistant<|im_end|>
INFO 04-16 10:18:14 serving_chat.py:331] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 04-16 10:18:14 serving_chat.py:331] ' + message['content']}}{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|im_end|>' + '
INFO 04-16 10:18:14 serving_chat.py:331] '}}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{ '<|im_start|>assistant
INFO 04-16 10:18:14 serving_chat.py:331] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Started server process [528169]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

这个日志来自一个运行在服务器上的应用程序,它使用了Microsoft的LLM(大型语言模型)引擎,并且是通过Ray分布式计算框架来管理模型的。以下是对日志中一些关键点的解读:

  1. API服务器版本信息

    vbscript 复制代码
    INFO 04-16 10:17:52 api_server.py:149] vLLM API server version 0.4.0.post1

    这表示LLM API服务器的版本是0.4.0.post1。

  2. 启动参数

    css 复制代码
    INFO 04-16 10:17:52 api_server.py:150] args: Namespace(...)

    这里列出了启动API服务器的参数,包括服务器地址、端口、日志级别、允许的源、允许的方法、允许的头部、API密钥、服务的模型名称、LORA模块、聊天模板、响应角色、SSL配置等。

  3. 初始化LLM引擎

    arduino 复制代码
    INFO 04-16 10:17:55 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model=...

    LLM引擎正在使用特定的配置初始化,包括模型路径、标记器、版本、信任远程代码、数据类型、序列长度等。

  4. 特殊标记

    arduino 复制代码
    Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

    这表明模型中添加了特殊标记,这些标记与词汇表中的单词嵌入有关,需要进行微调或训练。

  5. FlashAttention包未找到

    kotlin 复制代码
    INFO 04-16 10:17:58 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.

    由于FlashAttention包未安装,所以无法使用它。安装这个包可以提高性能。

  6. XFormers后端使用

    sql 复制代码
    INFO 04-16 10:17:58 selector.py:25] Using XFormers backend.

    应用使用了XFormers后端。

  7. 加载模型权重

    INFO 04-16 10:18:07 model_runner.py:104] Loading model weights took 26.6740 GB
    

    加载模型权重花费了大约26.6740 GB的内存。

  8. GPU和CPU块的数量

    bash 复制代码
    INFO 04-16 10:18:08 ray_gpu_executor.py:240] # GPU blocks: 14171, # CPU blocks: 1365

    在Ray分布式计算框架中,有14171个GPU块和1365个CPU块。

  9. 模型捕获完成

    erlang 复制代码
    INFO 04-16 10:18:14 model_runner.py:867] Graph capturing finished in 4 secs.

    模型图的捕获在4秒内完成。

  10. 服务器启动

    vbnet 复制代码
    INFO:     Started server process [528169]
    INFO:     Waiting for application startup.
    INFO:     Application startup complete.
    INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

    服务器进程启动,并且应用启动完成,服务器现在正在监听8000端口。 以上是日志中一些关键点的解读。这些日志可以帮助开发者了解服务器的启动情况、配置参数、模型加载状态、后端选择以及服务器运行状态等信息。

第一次启动模型的时候我们发现没有安装FlashAttention包,于是我们通过源码的方式下载安装了FlashAttention包。

bash 复制代码
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
pip install .

第二次启动日志

sql 复制代码
(qwen2_moe) ca2@ubuntu:~$ python -m vllm.entrypoints.openai.api_server --served-model-name Qwen1___5-MoE-A2___7B --model /home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat  --worker-use-ray --tensor-parallel-size 2
INFO 04-17 05:46:14 api_server.py:149] vLLM API server version 0.4.0.post1
INFO 04-17 05:46:14 api_server.py:150] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name='Qwen1___5-MoE-A2___7B', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=True, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2024-04-17 05:46:17,103	INFO worker.py:1752 -- Started a local Ray instance.
INFO 04-17 05:46:17 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-17 05:46:21 selector.py:16] Using FlashAttention backend.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:22 selector.py:16] Using FlashAttention backend.
INFO 04-17 05:46:22 pynccl_utils.py:45] vLLM is using nccl==2.18.1
(RayWorkerVllm pid=633099) INFO 04-17 05:46:22 pynccl_utils.py:45] vLLM is using nccl==2.18.1
WARNING 04-17 05:46:23 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerVllm pid=633099) WARNING 04-17 05:46:23 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:28 model_runner.py:104] Loading model weights took 13.3677 GB
INFO 04-17 05:46:31 model_runner.py:104] Loading model weights took 13.3677 GB
INFO 04-17 05:46:33 ray_gpu_executor.py:240] # GPU blocks: 37540, # CPU blocks: 2730
INFO 04-17 05:46:35 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-17 05:46:35 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:35 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:35 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-17 05:46:39 model_runner.py:867] Graph capturing finished in 4 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:39 model_runner.py:867] Graph capturing finished in 4 secs.
INFO 04-17 05:46:39 serving_chat.py:331] Using default chat template:
INFO 04-17 05:46:39 serving_chat.py:331] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 04-17 05:46:39 serving_chat.py:331] You are a helpful assistant<|im_end|>
INFO 04-17 05:46:39 serving_chat.py:331] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 04-17 05:46:39 serving_chat.py:331] ' + message['content']}}{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|im_end|>' + '
INFO 04-17 05:46:39 serving_chat.py:331] '}}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{ '<|im_start|>assistant
INFO 04-17 05:46:39 serving_chat.py:331] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

这个日志似乎是来自一个使用Ray分布式计算框架和某种LLM(大型语言模型)的API服务器的启动过程。以下是日志中一些关键点的解读:

  1. API服务器版本信息

    vbscript 复制代码
    INFO 04-17 05:46:14 api_server.py:149] vLLM API server version 0.4.0.post1

    这表示LLM API服务器的版本是0.4.0.post1。

  2. 启动参数

    css 复制代码
    INFO 04-17 05:46:14 api_server.py:150] args: Namespace(...)

    这里列出了启动API服务器的参数,包括服务器地址、端口、日志级别、允许的源、允许的方法、允许的头部、API密钥、服务的模型名称、LORA模块、聊天模板、响应角色、SSL配置等。

  3. 初始化LLM引擎

    arduino 复制代码
    INFO 04-17 05:46:17 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model=...

    LLM引擎正在使用特定的配置初始化,包括模型路径、标记器、版本、信任远程代码、数据类型、序列长度等。

  4. 特殊标记

    arduino 复制代码
    Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

    这表明模型中添加了特殊标记,这些标记与词汇表中的单词嵌入有关,需要进行微调或训练。

  5. FlashAttention后端使用

    sql 复制代码
    INFO 04-17 05:46:21 selector.py:16] Using FlashAttention backend.

    应用使用了FlashAttention后端。

  6. GPU和CPU块的数量

    bash 复制代码
    INFO 04-17 05:46:33 ray_gpu_executor.py:240] # GPU blocks: 37540, # CPU blocks: 2730

    在Ray分布式计算框架中,有37540个GPU块和2730个CPU块。

  7. 模型捕获完成

    erlang 复制代码
    INFO 04-17 05:46:39 model_runner.py:867] Graph capturing finished in 4 secs.

    模型图的捕获在4秒内完成。

  8. 服务器启动

    vbnet 复制代码
    INFO:     Started server process [528169]
    INFO:     Waiting for application startup.
    INFO:     Application startup complete.
    INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

    服务器进程启动,并且应用启动完成,服务器现在正在监听8000端口。 以上是日志中一些关键点的解读。这些日志可以帮助开发者了解服务器的启动情况、配置参数、模型加载状态、后端选择以及服务器运行状态等信息。

希望这篇内容对您有帮助 加我好友 15246115202

相关推荐
no_play_no_games31 分钟前
[模板]树的最长路径
算法·深度优先·图论·树形结构
tan77º1 小时前
【C++】异常
c++·算法
ymchuangke1 小时前
数据清洗-缺失值处理-缺失值可视化图(竖线)
python·算法·数学建模
我要学编程(ಥ_ಥ)1 小时前
滑动窗口算法专题(1)
java·数据结构·算法·leetcode
niceffking1 小时前
JVM 一个对象是否已经死亡?
java·jvm·算法
大油头儿1 小时前
排序算法-冒泡排序
数据结构·算法·排序算法
地平线开发者2 小时前
地平线占用预测 FlashOcc 参考算法-V1.0
算法·自动驾驶
LluckyYH2 小时前
代码随想录Day 46|动态规划完结,leetcode题目:647. 回文子串、516.最长回文子序列
数据结构·人工智能·算法·leetcode·动态规划
源代码:趴菜2 小时前
LeetCode118:杨辉三角
算法·leetcode·动态规划
luluvx2 小时前
LeetCode[中等] 74.搜索二维矩阵
算法·leetcode·矩阵