在探索llama factory的时候我们看到了llama进行模型部署的工作 启动日志
sql
日志解读 INFO 04-16 10:17:52 api_server.py:149] vLLM API server version 0.4.0.post1
INFO 04-16 10:17:52 api_server.py:150] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name='Qwen1___5-MoE-A2___7B', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=True, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2024-04-16 10:17:54,532 INFO worker.py:1752 -- Started a local Ray instance.
INFO 04-16 10:17:55 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-16 10:17:58 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 04-16 10:17:58 selector.py:25] Using XFormers backend.
INFO 04-16 10:18:07 model_runner.py:104] Loading model weights took 26.6740 GB
INFO 04-16 10:18:08 ray_gpu_executor.py:240] # GPU blocks: 14171, # CPU blocks: 1365
INFO 04-16 10:18:10 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-16 10:18:10 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-16 10:18:14 model_runner.py:867] Graph capturing finished in 4 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-16 10:18:14 serving_chat.py:331] Using default chat template:
INFO 04-16 10:18:14 serving_chat.py:331] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 04-16 10:18:14 serving_chat.py:331] You are a helpful assistant<|im_end|>
INFO 04-16 10:18:14 serving_chat.py:331] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 04-16 10:18:14 serving_chat.py:331] ' + message['content']}}{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|im_end|>' + '
INFO 04-16 10:18:14 serving_chat.py:331] '}}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{ '<|im_start|>assistant
INFO 04-16 10:18:14 serving_chat.py:331] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO: Started server process [528169]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
这个日志来自一个运行在服务器上的应用程序,它使用了Microsoft的LLM(大型语言模型)引擎,并且是通过Ray分布式计算框架来管理模型的。以下是对日志中一些关键点的解读:
-
API服务器版本信息 :
vbscriptINFO 04-16 10:17:52 api_server.py:149] vLLM API server version 0.4.0.post1
这表示LLM API服务器的版本是0.4.0.post1。
-
启动参数 :
cssINFO 04-16 10:17:52 api_server.py:150] args: Namespace(...)
这里列出了启动API服务器的参数,包括服务器地址、端口、日志级别、允许的源、允许的方法、允许的头部、API密钥、服务的模型名称、LORA模块、聊天模板、响应角色、SSL配置等。
-
初始化LLM引擎 :
arduinoINFO 04-16 10:17:55 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model=...
LLM引擎正在使用特定的配置初始化,包括模型路径、标记器、版本、信任远程代码、数据类型、序列长度等。
-
特殊标记 :
arduinoSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
这表明模型中添加了特殊标记,这些标记与词汇表中的单词嵌入有关,需要进行微调或训练。
-
FlashAttention包未找到 :
kotlinINFO 04-16 10:17:58 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
由于FlashAttention包未安装,所以无法使用它。安装这个包可以提高性能。
-
XFormers后端使用 :
sqlINFO 04-16 10:17:58 selector.py:25] Using XFormers backend.
应用使用了XFormers后端。
-
加载模型权重 :
INFO 04-16 10:18:07 model_runner.py:104] Loading model weights took 26.6740 GB
加载模型权重花费了大约26.6740 GB的内存。
-
GPU和CPU块的数量 :
bashINFO 04-16 10:18:08 ray_gpu_executor.py:240] # GPU blocks: 14171, # CPU blocks: 1365
在Ray分布式计算框架中,有14171个GPU块和1365个CPU块。
-
模型捕获完成 :
erlangINFO 04-16 10:18:14 model_runner.py:867] Graph capturing finished in 4 secs.
模型图的捕获在4秒内完成。
-
服务器启动 :
vbnetINFO: Started server process [528169] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
服务器进程启动,并且应用启动完成,服务器现在正在监听8000端口。 以上是日志中一些关键点的解读。这些日志可以帮助开发者了解服务器的启动情况、配置参数、模型加载状态、后端选择以及服务器运行状态等信息。
第一次启动模型的时候我们发现没有安装FlashAttention包,于是我们通过源码的方式下载安装了FlashAttention包。
bash
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
pip install .
第二次启动日志
sql
(qwen2_moe) ca2@ubuntu:~$ python -m vllm.entrypoints.openai.api_server --served-model-name Qwen1___5-MoE-A2___7B --model /home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat --worker-use-ray --tensor-parallel-size 2
INFO 04-17 05:46:14 api_server.py:149] vLLM API server version 0.4.0.post1
INFO 04-17 05:46:14 api_server.py:150] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name='Qwen1___5-MoE-A2___7B', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=True, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2024-04-17 05:46:17,103 INFO worker.py:1752 -- Started a local Ray instance.
INFO 04-17 05:46:17 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-17 05:46:21 selector.py:16] Using FlashAttention backend.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:22 selector.py:16] Using FlashAttention backend.
INFO 04-17 05:46:22 pynccl_utils.py:45] vLLM is using nccl==2.18.1
(RayWorkerVllm pid=633099) INFO 04-17 05:46:22 pynccl_utils.py:45] vLLM is using nccl==2.18.1
WARNING 04-17 05:46:23 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerVllm pid=633099) WARNING 04-17 05:46:23 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:28 model_runner.py:104] Loading model weights took 13.3677 GB
INFO 04-17 05:46:31 model_runner.py:104] Loading model weights took 13.3677 GB
INFO 04-17 05:46:33 ray_gpu_executor.py:240] # GPU blocks: 37540, # CPU blocks: 2730
INFO 04-17 05:46:35 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-17 05:46:35 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:35 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:35 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-17 05:46:39 model_runner.py:867] Graph capturing finished in 4 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:39 model_runner.py:867] Graph capturing finished in 4 secs.
INFO 04-17 05:46:39 serving_chat.py:331] Using default chat template:
INFO 04-17 05:46:39 serving_chat.py:331] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 04-17 05:46:39 serving_chat.py:331] You are a helpful assistant<|im_end|>
INFO 04-17 05:46:39 serving_chat.py:331] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 04-17 05:46:39 serving_chat.py:331] ' + message['content']}}{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|im_end|>' + '
INFO 04-17 05:46:39 serving_chat.py:331] '}}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{ '<|im_start|>assistant
INFO 04-17 05:46:39 serving_chat.py:331] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
这个日志似乎是来自一个使用Ray分布式计算框架和某种LLM(大型语言模型)的API服务器的启动过程。以下是日志中一些关键点的解读:
-
API服务器版本信息 :
vbscriptINFO 04-17 05:46:14 api_server.py:149] vLLM API server version 0.4.0.post1
这表示LLM API服务器的版本是0.4.0.post1。
-
启动参数 :
cssINFO 04-17 05:46:14 api_server.py:150] args: Namespace(...)
这里列出了启动API服务器的参数,包括服务器地址、端口、日志级别、允许的源、允许的方法、允许的头部、API密钥、服务的模型名称、LORA模块、聊天模板、响应角色、SSL配置等。
-
初始化LLM引擎 :
arduinoINFO 04-17 05:46:17 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model=...
LLM引擎正在使用特定的配置初始化,包括模型路径、标记器、版本、信任远程代码、数据类型、序列长度等。
-
特殊标记 :
arduinoSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
这表明模型中添加了特殊标记,这些标记与词汇表中的单词嵌入有关,需要进行微调或训练。
-
FlashAttention后端使用 :
sqlINFO 04-17 05:46:21 selector.py:16] Using FlashAttention backend.
应用使用了FlashAttention后端。
-
GPU和CPU块的数量 :
bashINFO 04-17 05:46:33 ray_gpu_executor.py:240] # GPU blocks: 37540, # CPU blocks: 2730
在Ray分布式计算框架中,有37540个GPU块和2730个CPU块。
-
模型捕获完成 :
erlangINFO 04-17 05:46:39 model_runner.py:867] Graph capturing finished in 4 secs.
模型图的捕获在4秒内完成。
-
服务器启动 :
vbnetINFO: Started server process [528169] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
服务器进程启动,并且应用启动完成,服务器现在正在监听8000端口。 以上是日志中一些关键点的解读。这些日志可以帮助开发者了解服务器的启动情况、配置参数、模型加载状态、后端选择以及服务器运行状态等信息。
希望这篇内容对您有帮助 加我好友 15246115202