10分钟实现基于Ubuntu25.04本地推理ERNIE模型

零、硬件介绍

先前一直没电脑，手头的macbook怎么也跑不起来，看大家都用上了，我就赶紧回家，翻出旧台式电脑，开始安装。这是安装完ubuntu后的系统截图，配置如图，有一块英伟达的3060显卡，以及64Gb内存，差不多可以了吧。

一、环境准备

1.系统准备

二话不说，直接格式化掉win11,U盘g安装ubuntuunb最新版 ubuntu25.04版本，话不多说，直接装即可。内核信息如下：

bash 复制代码

(p3) livingbody@gaint:~$ uname -a
Linux gaint 6.14.0-23-generic #23-Ubuntu SMP PREEMPT_DYNAMIC Fri Jun 13 23:02:20 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

2.conda环境准备

下载minniconda安装包打开清华源： mirrors.tuna.tsinghua.edu.cn/anaconda/mi... ，选择最新安装包下载
安装miniconda 给安装包赋予in执行权限，然后安装，ig命令你如下所示：

bash 复制代码

chmod +x Downloads/Miniconda3-py39_4.9.2-Linux-x86_64.sh 
./Downloads/Miniconda3-py39_4.9.2-Linux-x86_64.sh

设置清华源下载oh-my-tuna.py项目，按说明操作,github不方便可以用gitcode： gitcode.com/gh_mirrors/...

bash 复制代码

wget https://tuna.moe/oh-my-tuna/oh-my-tuna.py
python oh-my-tuna.py

3.推理环境创建

python环境创建

bash 复制代码

conda create -n p3 python=3.12
conda activate p3

gpu版本paddlepaddle安装

bash 复制代码

python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu129/
python -c "import paddle;paddle.utils.run_check()"

这种模式省去了自己下载、安装cuda、cudnn的繁琐程序，a极为推进，g网速够快1分钟即可完成安装。

fastdeploy安装

bash 复制代码

python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-86_89

如果提示没有合适的包，可以打开www.paddlepaddle.org.cn/packages/st...直接下载并强制安装，亲测可行。

二、模型下载、加载、测试

1.模型下载、加载

在终端执行下列命令，即可下载并加载模型

bash 复制代码

python -m fastdeploy.entrypoints.openai.api_server \
       --model baidu/ERNIE-4.5-0.3B-Paddle \
       --port 8180 \
       --metrics-port 8181 \
       --engine-worker-queue-port 8182 \
       --max-model-len 32768 \
       --max-num-seqs 32

下载模型保存于PaddlePaddle/ERNIE-4.5-0.3B-Paddle路径下：

bash 复制代码

(base) livingbody@gaint:~$ ls PaddlePaddle/ERNIE-4.5-0.3B-Paddle/ -la
总计 706276
drwxrwxr-x 3 livingbody livingbody      4096 Jul  6 16:04 .
drwxrwxr-x 3 livingbody livingbody      4096 Jul  6 16:39 ..
-rw-rw-r-- 1 livingbody livingbody     23133 Jul  6 16:04 added_tokens.json
-rw-rw-r-- 1 livingbody livingbody       556 Jul  6 16:04 config.json
-rw-rw-r-- 1 livingbody livingbody       125 Jul  6 16:04 generation_config.json
-rw-rw-r-- 1 livingbody livingbody     11366 Jul  6 16:04 LICENSE
-rw-rw-r-- 1 livingbody livingbody 721508576 Jul  6 16:04 model.safetensors
-rw------- 1 livingbody livingbody       658 Jul  6 16:04 .msc
-rw-rw-r-- 1 livingbody livingbody        67 Jul  6 16:18 .mv
-rw-rw-r-- 1 livingbody livingbody      7690 Jul  6 16:04 README.md
-rw-rw-r-- 1 livingbody livingbody     15404 Jul  6 16:04 special_tokens_map.json
drwxrwxr-x 2 livingbody livingbody      4096 Jul  6 16:04 ._tmp
-rw-rw-r-- 1 livingbody livingbody      1248 Jul  6 16:04 tokenizer_config.json
-rw-rw-r-- 1 livingbody livingbody   1614363 Jul  6 16:04 tokenizer.model

2.模型调用

启动后，给出下列连接，可供调用。

bash 复制代码

INFO     2025-07-06 16:05:14,001 11789 engine.py[line:276] Worker processes are launched with 15.871807098388672 seconds.
INFO     2025-07-06 16:05:14,001 11789 api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
INFO     2025-07-06 16:05:14,002 11789 api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
INFO     2025-07-06 16:05:14,002 11789 api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions

通过url调用，api_key无。

bash 复制代码

import openai
host = "0.0.0.0"
port = "8180"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")

response = client.chat.completions.create(
    model="null",
    messages=[
        {"role": "system", "content": "You are a very usefull assistant."},
        {"role": "user", "content": "Please talk about the SUN"},
    ],
    stream=True,
)
for chunk in response:
    if chunk.choices[0].delta:
        print(chunk.choices[0].delta.content, end='')
print('\n')

10分钟实现基于Ubuntu25.04本地推理ERNIE模型

零、硬件介绍

一、环境准备

1.系统准备

2.conda环境准备

3.推理环境创建

二、模型下载、加载、测试

1.模型下载、加载

2.模型调用

三、难点总结

1.确确实实需要一块显卡，没有显卡很难搞；

2.使用官网安装说明，可以直接通过pip安装cuda、cudnn,省掉了很多麻烦；

3.使用fastdeploy部署非常省力，连模型下载都自动完成了，亲测速度很快，能跑满网速。