Llama2 API部署错误调试

使用FastAPI和Uvicorn这两个库来部署模型，假设已预先下载Llama2库，安装路径为：

复制代码

/home/runUser/llama/atom-7B

一、定位accelerate_server.py 文件路径

运行仓库中的accelerate_server.py脚本启动API服务

复制代码

(pytorch) runUser@**:~$ python accelerate_server.py --model_path /home/runUser/llama/atom-7B --gpus "0" --infer_dtype "int16" --model_source "llama2_chinese"
python: can't open file '/home/runUser/accelerate_server.py': [Errno 2] No such file or directory

提示找不到文件accelerate_server.py，使用find命令查找文件路径

复制代码

(pytorch) runUser@**:~$ find /home/runUser/ -name "accelerate_server.py"
/home/runUser/install_model/Llama-Chinese/scripts/api/accelerate_server.py

cd到文件路径，就可以执行脚本命令了

二、TypeError: LlamaForCausalLM.init() got an unexpected keyword argument 'use_flash_attention_2'

运行脚本报错如下：

复制代码

(pytorch) runUser@l**:~/install_model/Llama-Chinese/scripts/api$ python accelerate_server.py --model_path /home/runUser/llama/atom-7B --gpus "0" --infer_dtype "float16" --model_source "llama2_chinese"
/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/sklearn/utils/_param_validation.py:14: UserWarning: A NumPy version >=1.22.4 and <2.3.0 is required for this version of SciPy (detected version 2.3.0)
  from scipy.sparse import csr_matrix, issparse
get_world_size:1
`torch_dtype` is deprecated! Use `dtype` instead!
Traceback (most recent call last):
  File "/home/runUser/install_model/Llama-Chinese/scripts/api/accelerate_server.py", line 182, in <module>
    model = AutoModelForCausalLM.from_pretrained(args.model_path, **kwargs,trust_remote_code=True,use_flash_attention_2=True)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/transformers/models/auto/auto_factory.py", line 597, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/transformers/modeling_utils.py", line 288, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/transformers/modeling_utils.py", line 5106, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: LlamaForCausalLM.__init__() got an unexpected keyword argument 'use_flash_attention_2'

在accelerate_server.py文件，182行如下图所示：

删除**use_flash_attention_2=True，**可以消除这个错误，运行成功

复制代码

(pytorch) runUser@**:~/install_model/Llama-Chinese/scripts/api$ python accelerate_server.py --model_path /home/runUser/llama/atom-7B --gpus "0" --infer_dtype "float16" --model_source "llama2_chinese"
/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/sklearn/utils/_param_validation.py:14: UserWarning: A NumPy version >=1.22.4 and <2.3.0 is required for this version of SciPy (detected version 2.3.0)
  from scipy.sparse import csr_matrix, issparse
get_world_size:1
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.04s/it]
Some parameters are on the meta device because they were offloaded to the cpu.
INFO:     Started server process [54825]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)

二、AttributeError: 'DynamicCache' object has no attribute 'seen_tokens'

使用accelerate_client.py脚本向API服务器发送请求，提示:HTTP Error 500: Internal Server Error。后台报错如下所示：

复制代码

INFO:     127.0.0.1:35030 - "POST /generate HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/uvicorn/protocols/http/h11_impl.py", line 410, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/fastapi/applications.py", line 1135, in __call__
    await super().__call__(scope, receive, send)
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/starlette/applications.py", line 107, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/starlette/routing.py", line 716, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/starlette/routing.py", line 736, in app
    await route.handle(scope, receive, send)
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/starlette/routing.py", line 290, in handle
    await self.app(scope, receive, send)
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/fastapi/routing.py", line 115, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/fastapi/routing.py", line 101, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/fastapi/routing.py", line 355, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/fastapi/routing.py", line 243, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runUser/install_model/Llama-Chinese/scripts/api/accelerate_server.py", line 131, in create_item
    generate_ids = model.generate(**generate_kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/transformers/generation/utils.py", line 2539, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/transformers/generation/utils.py", line 2860, in _sample
    model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runUser/.cache/huggingface/modules/transformers_modules/atom-7B/model_atom.py", line 1379, in prepare_inputs_for_generation
    past_length = past_key_values.seen_tokens
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'DynamicCache' object has no attribute 'seen_tokens'

复制代码

在accelerate_server.py文件，182行修改后，如下图所示：

删除trust_remote_code=True，再重启，请求运行成功

复制代码

INFO:     Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)
generate_kwargs: {'input_ids': tensor([[    1, 12968, 29901, 29871, 63694, 32164,    13,     2,     1,  4007,
         22137, 29901, 29871]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0'), 'max_new_tokens': 2048, 'do_sample': True, 'top_p': 0.95, 'top_k': 50, 'temperature': 0.3, 'num_beams': 1, 'repetition_penalty': 1.2, 'max_length': 2048}
Both `max_new_tokens` (=2048) and `max_length`(=2048) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
INFO:     127.0.0.1:51542 - "POST /generate HTTP/1.1" 200 OK

三、`torch_dtype` is deprecated!，这个不重要，不过我看着不舒服

启动成功的时候，会有警告提示：`torch_dtype` is deprecated!

复制代码

(pytorch) runUser@**:~/install_model/Llama-Chinese/scripts/api$ python accelerate_server.py --model_path /home/runUser/llama/atom-7B --gpus "0" --infer_dtype "float16" --model_source "llama2_chinese"
/home/runUser/anaconda3/envs/pytorch/lib/python3.12/site-packages/sklearn/utils/_param_validation.py:14: UserWarning: A NumPy version >=1.22.4 and <2.3.0 is required for this version of SciPy (detected version 2.3.0)
  from scipy.sparse import csr_matrix, issparse
get_world_size:1
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.03s/it]
Some parameters are on the meta device because they were offloaded to the cpu.
INFO:     Started server process [55362]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

在accelerate_server.py文件，178行，如下图所示：

修改为

复制代码

kwargs["dtype"] = dtype

则不会再有提示