从运行日志观察vllm进行模型部署的过程

在探索llama factory的时候我们看到了llama进行模型部署的工作启动日志

sql 复制代码

日志解读 INFO 04-16 10:17:52 api_server.py:149] vLLM API server version 0.4.0.post1
INFO 04-16 10:17:52 api_server.py:150] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name='Qwen1___5-MoE-A2___7B', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=True, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2024-04-16 10:17:54,532	INFO worker.py:1752 -- Started a local Ray instance.
INFO 04-16 10:17:55 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-16 10:17:58 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 04-16 10:17:58 selector.py:25] Using XFormers backend.
INFO 04-16 10:18:07 model_runner.py:104] Loading model weights took 26.6740 GB
INFO 04-16 10:18:08 ray_gpu_executor.py:240] # GPU blocks: 14171, # CPU blocks: 1365
INFO 04-16 10:18:10 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-16 10:18:10 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-16 10:18:14 model_runner.py:867] Graph capturing finished in 4 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-16 10:18:14 serving_chat.py:331] Using default chat template:
INFO 04-16 10:18:14 serving_chat.py:331] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 04-16 10:18:14 serving_chat.py:331] You are a helpful assistant<|im_end|>
INFO 04-16 10:18:14 serving_chat.py:331] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 04-16 10:18:14 serving_chat.py:331] ' + message['content']}}{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|im_end|>' + '
INFO 04-16 10:18:14 serving_chat.py:331] '}}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{ '<|im_start|>assistant
INFO 04-16 10:18:14 serving_chat.py:331] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Started server process [528169]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

这个日志来自一个运行在服务器上的应用程序，它使用了Microsoft的LLM（大型语言模型）引擎，并且是通过Ray分布式计算框架来管理模型的。以下是对日志中一些关键点的解读：

API服务器版本信息 ：
vbscript 复制代码
```
INFO 04-16 10:17:52 api_server.py:149] vLLM API server version 0.4.0.post1
```
这表示LLM API服务器的版本是0.4.0.post1。
启动参数 ：
css 复制代码
```
INFO 04-16 10:17:52 api_server.py:150] args: Namespace(...)
```
这里列出了启动API服务器的参数，包括服务器地址、端口、日志级别、允许的源、允许的方法、允许的头部、API密钥、服务的模型名称、LORA模块、聊天模板、响应角色、SSL配置等。
初始化LLM引擎 ：
arduino 复制代码
```
INFO 04-16 10:17:55 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model=...
```
LLM引擎正在使用特定的配置初始化，包括模型路径、标记器、版本、信任远程代码、数据类型、序列长度等。
特殊标记 ：
arduino 复制代码
```
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
```
这表明模型中添加了特殊标记，这些标记与词汇表中的单词嵌入有关，需要进行微调或训练。
FlashAttention包未找到 ：
kotlin 复制代码
```
INFO 04-16 10:17:58 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
```
由于FlashAttention包未安装，所以无法使用它。安装这个包可以提高性能。
XFormers后端使用 ：
sql 复制代码
```
INFO 04-16 10:17:58 selector.py:25] Using XFormers backend.
```
应用使用了XFormers后端。
加载模型权重 ：
复制代码
```
INFO 04-16 10:18:07 model_runner.py:104] Loading model weights took 26.6740 GB
```
加载模型权重花费了大约26.6740 GB的内存。
GPU和CPU块的数量 ：
bash 复制代码
```
INFO 04-16 10:18:08 ray_gpu_executor.py:240] # GPU blocks: 14171, # CPU blocks: 1365
```
在Ray分布式计算框架中，有14171个GPU块和1365个CPU块。

模型捕获完成 ：

erlang 复制代码

INFO 04-16 10:18:14 model_runner.py:867] Graph capturing finished in 4 secs.

模型图的捕获在4秒内完成。

服务器启动 ：
vbnet 复制代码
```
INFO:     Started server process [528169]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
```
服务器进程启动，并且应用启动完成，服务器现在正在监听8000端口。以上是日志中一些关键点的解读。这些日志可以帮助开发者了解服务器的启动情况、配置参数、模型加载状态、后端选择以及服务器运行状态等信息。

第一次启动模型的时候我们发现没有安装FlashAttention包，于是我们通过源码的方式下载安装了FlashAttention包。

bash 复制代码

git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
pip install .

第二次启动日志

sql 复制代码

(qwen2_moe) ca2@ubuntu:~$ python -m vllm.entrypoints.openai.api_server --served-model-name Qwen1___5-MoE-A2___7B --model /home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat  --worker-use-ray --tensor-parallel-size 2
INFO 04-17 05:46:14 api_server.py:149] vLLM API server version 0.4.0.post1
INFO 04-17 05:46:14 api_server.py:150] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name='Qwen1___5-MoE-A2___7B', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=True, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2024-04-17 05:46:17,103	INFO worker.py:1752 -- Started a local Ray instance.
INFO 04-17 05:46:17 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-17 05:46:21 selector.py:16] Using FlashAttention backend.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:22 selector.py:16] Using FlashAttention backend.
INFO 04-17 05:46:22 pynccl_utils.py:45] vLLM is using nccl==2.18.1
(RayWorkerVllm pid=633099) INFO 04-17 05:46:22 pynccl_utils.py:45] vLLM is using nccl==2.18.1
WARNING 04-17 05:46:23 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerVllm pid=633099) WARNING 04-17 05:46:23 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:28 model_runner.py:104] Loading model weights took 13.3677 GB
INFO 04-17 05:46:31 model_runner.py:104] Loading model weights took 13.3677 GB
INFO 04-17 05:46:33 ray_gpu_executor.py:240] # GPU blocks: 37540, # CPU blocks: 2730
INFO 04-17 05:46:35 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-17 05:46:35 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:35 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:35 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-17 05:46:39 model_runner.py:867] Graph capturing finished in 4 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:39 model_runner.py:867] Graph capturing finished in 4 secs.
INFO 04-17 05:46:39 serving_chat.py:331] Using default chat template:
INFO 04-17 05:46:39 serving_chat.py:331] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 04-17 05:46:39 serving_chat.py:331] You are a helpful assistant<|im_end|>
INFO 04-17 05:46:39 serving_chat.py:331] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 04-17 05:46:39 serving_chat.py:331] ' + message['content']}}{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|im_end|>' + '
INFO 04-17 05:46:39 serving_chat.py:331] '}}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{ '<|im_start|>assistant
INFO 04-17 05:46:39 serving_chat.py:331] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

这个日志似乎是来自一个使用Ray分布式计算框架和某种LLM（大型语言模型）的API服务器的启动过程。以下是日志中一些关键点的解读：

API服务器版本信息 ：
vbscript 复制代码
```
INFO 04-17 05:46:14 api_server.py:149] vLLM API server version 0.4.0.post1
```
这表示LLM API服务器的版本是0.4.0.post1。
启动参数 ：
css 复制代码
```
INFO 04-17 05:46:14 api_server.py:150] args: Namespace(...)
```
这里列出了启动API服务器的参数，包括服务器地址、端口、日志级别、允许的源、允许的方法、允许的头部、API密钥、服务的模型名称、LORA模块、聊天模板、响应角色、SSL配置等。
初始化LLM引擎 ：
arduino 复制代码
```
INFO 04-17 05:46:17 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model=...
```
LLM引擎正在使用特定的配置初始化，包括模型路径、标记器、版本、信任远程代码、数据类型、序列长度等。
特殊标记 ：
arduino 复制代码
```
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
```
这表明模型中添加了特殊标记，这些标记与词汇表中的单词嵌入有关，需要进行微调或训练。
FlashAttention后端使用 ：
sql 复制代码
```
INFO 04-17 05:46:21 selector.py:16] Using FlashAttention backend.
```
应用使用了FlashAttention后端。
GPU和CPU块的数量 ：
bash 复制代码
```
INFO 04-17 05:46:33 ray_gpu_executor.py:240] # GPU blocks: 37540, # CPU blocks: 2730
```
在Ray分布式计算框架中，有37540个GPU块和2730个CPU块。

模型捕获完成 ：

erlang 复制代码

INFO 04-17 05:46:39 model_runner.py:867] Graph capturing finished in 4 secs.

模型图的捕获在4秒内完成。

服务器启动 ：
vbnet 复制代码
```
INFO:     Started server process [528169]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
```
服务器进程启动，并且应用启动完成，服务器现在正在监听8000端口。以上是日志中一些关键点的解读。这些日志可以帮助开发者了解服务器的启动情况、配置参数、模型加载状态、后端选择以及服务器运行状态等信息。

希望这篇内容对您有帮助加我好友 15246115202