零、硬件介绍
先前一直没电脑,手头的macbook怎么也跑不起来,看大家都用上了,我就赶紧回家,翻出旧台式电脑,开始安装。 这是安装完ubuntu后的系统截图,配置如图,有一块英伟达的3060显卡,以及64Gb内存,差不多可以了吧。

一、环境准备
1.系统准备
二话不说,直接格式化掉win11,U盘g安装ubuntuunb最新版 ubuntu25.04版本,话不多说,直接装即可。 内核信息如下:
bash
(p3) livingbody@gaint:~$ uname -a
Linux gaint 6.14.0-23-generic #23-Ubuntu SMP PREEMPT_DYNAMIC Fri Jun 13 23:02:20 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

2.conda环境准备
-
下载minniconda安装包 打开清华源: mirrors.tuna.tsinghua.edu.cn/anaconda/mi... ,选择最新安装包下载
-
安装miniconda 给安装包赋予in执行权限,然后安装,ig命令你如下所示:
bash
chmod +x Downloads/Miniconda3-py39_4.9.2-Linux-x86_64.sh
./Downloads/Miniconda3-py39_4.9.2-Linux-x86_64.sh
- 设置清华源 下载oh-my-tuna.py项目,按说明操作,github不方便可以用gitcode: gitcode.com/gh_mirrors/...
bash
wget https://tuna.moe/oh-my-tuna/oh-my-tuna.py
python oh-my-tuna.py
3.推理环境创建
- python环境创建
bash
conda create -n p3 python=3.12
conda activate p3
- gpu版本paddlepaddle安装
bash
python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu129/
python -c "import paddle;paddle.utils.run_check()"

这种模式省去了自己下载、安装cuda、cudnn的繁琐程序,a极为推进,g网速够快1分钟即可完成安装。
- fastdeploy安装
bash
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-86_89
如果提示没有合适的包,可以打开www.paddlepaddle.org.cn/packages/st...直接下载并强制安装,亲测可行。
二、模型下载、加载、测试
1.模型下载、加载
在终端执行下列命令,即可下载并加载模型
bash
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--max-model-len 32768 \
--max-num-seqs 32

下载模型保存于PaddlePaddle/ERNIE-4.5-0.3B-Paddle路径下:
bash
(base) livingbody@gaint:~$ ls PaddlePaddle/ERNIE-4.5-0.3B-Paddle/ -la
总计 706276
drwxrwxr-x 3 livingbody livingbody 4096 Jul 6 16:04 .
drwxrwxr-x 3 livingbody livingbody 4096 Jul 6 16:39 ..
-rw-rw-r-- 1 livingbody livingbody 23133 Jul 6 16:04 added_tokens.json
-rw-rw-r-- 1 livingbody livingbody 556 Jul 6 16:04 config.json
-rw-rw-r-- 1 livingbody livingbody 125 Jul 6 16:04 generation_config.json
-rw-rw-r-- 1 livingbody livingbody 11366 Jul 6 16:04 LICENSE
-rw-rw-r-- 1 livingbody livingbody 721508576 Jul 6 16:04 model.safetensors
-rw------- 1 livingbody livingbody 658 Jul 6 16:04 .msc
-rw-rw-r-- 1 livingbody livingbody 67 Jul 6 16:18 .mv
-rw-rw-r-- 1 livingbody livingbody 7690 Jul 6 16:04 README.md
-rw-rw-r-- 1 livingbody livingbody 15404 Jul 6 16:04 special_tokens_map.json
drwxrwxr-x 2 livingbody livingbody 4096 Jul 6 16:04 ._tmp
-rw-rw-r-- 1 livingbody livingbody 1248 Jul 6 16:04 tokenizer_config.json
-rw-rw-r-- 1 livingbody livingbody 1614363 Jul 6 16:04 tokenizer.model
2.模型调用
启动后,给出下列连接,可供调用。
bash
INFO 2025-07-06 16:05:14,001 11789 engine.py[line:276] Worker processes are launched with 15.871807098388672 seconds.
INFO 2025-07-06 16:05:14,001 11789 api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
INFO 2025-07-06 16:05:14,002 11789 api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
INFO 2025-07-06 16:05:14,002 11789 api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
通过url调用,api_key无。
bash
import openai
host = "0.0.0.0"
port = "8180"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.chat.completions.create(
model="null",
messages=[
{"role": "system", "content": "You are a very usefull assistant."},
{"role": "user", "content": "Please talk about the SUN"},
],
stream=True,
)
for chunk in response:
if chunk.choices[0].delta:
print(chunk.choices[0].delta.content, end='')
print('\n')
