简单、高效且低成本的预训练、微调与服务,惠及大众基于 Ray 架构设计的覆盖大语言模型(LLM)完整生命周期的解决方案byzer-llm

简单、高效且低成本的预训练、微调与服务,惠及大众基于 Ray 架构设计的覆盖大语言模型(LLM)完整生命周期的解决方案byzer-llm

官网:https://github.com/allwefantasy/byzer-llm

手册:https://github.com/allwefantasy/byzer-llm/blob/master/docs/zh/001_%E4%B8%80%E4%B8%AA%E5%8A%AA%E5%8A%9B%E6%88%90%E4%B8%BA%E5%A4%A7%E6%A8%A1%E5%9E%8B%E7%BC%96%E7%A8%8B%E6%8E%A5%E5%8F%A3%E7%9A%84%E7%A5%9E%E5%A5%87Python%E5%BA%93.md

Byzer-LLM 基于 Ray 技术构建,是一款覆盖大语言模型(LLM)完整生命周期的解决方案,包括预训练、微调、部署及推理服务等阶段。

Byzer-LLM 的独特之处在于:

  1. 全生命周期管理:支持预训练、微调、部署和推理服务全流程
  2. 兼容 Python/SQL API 接口
  3. 基于 Ray 架构设计,便于轻松扩展

名词解释

SaaS 模型 :把 LLM/AI能力封装成云服务 ,面向企业或个人以订阅/按调用次数收费,的AI大模型模型,也就是我们常说的AI模型调用

安装使用

直接pip安装

复制代码
pip install byzer-llm

启动ray服务

复制代码
ray start --head

服务启动,提示

复制代码
Local node IP: 192.168.0.95
/home/skywalk/minipy312/lib/python3.12/site-packages/ray/thirdparty_files/psutil/__init__.py:2017: RuntimeWarning: shared, active, inactive memory stats couldn't be determined and were set to 0
  ret = _psplatform.virtual_memory()

--------------------
Ray runtime started.
--------------------

Next steps
  To add another node to this Ray cluster, run
    ray start --address='192.168.0.95:6379'
  
  To connect to this Ray cluster:
    import ray
    ray.init()
  
  To submit a Ray job using the Ray Jobs CLI:
    RAY_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py

直接安装Auto-Coder以便安装byzer-llm

手册认为直接pip安装 byzer-llm,后面的配置部分会比较麻烦,不如直接安装Auto-Coder,会自动配置好:这样一起执行即可

复制代码
pip install pip -U
pip install -U auto-coder
ray start --head

输出

复制代码
ray start --head
Enable usage stats collection? This prompt will auto-proceed in 10 seconds to avoid blocking cluster startup. Confirm [Y/n]:
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 172.25.183.186

--------------------
Ray runtime started.
--------------------

Next steps
  To add another node to this Ray cluster, run
    ray start --address='172.25.183.186:6379'

  To connect to this Ray cluster:
    import ray
    ray.init()

  To submit a Ray job using the Ray Jobs CLI:
    RAY_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py

  See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html
  for more information on submitting Ray jobs to the Ray cluster.

  To terminate the Ray runtime, run
    ray stop

  To view the status of the cluster, use
    ray status

  To monitor and debug Ray, view the dashboard at
    127.0.0.1:8265

  If connection to the dashboard fails, check your firewall settings and network configuration.

事情变的有趣起来,现在让我们组个ray集群吧

加入ray

前面在192.168.1.5启动了ray服务器,这里再加入一台机器

复制代码
ray start --address='192.168.1.5:6379'

加入完成:

复制代码
ray status
======== Autoscaler status: 2025-11-01 19:16:47.050543 ========
Node status
---------------------------------------------------------------
Active:
 1 node_dfa56c248840a471c7b0be2e7dbeb3fb28041a6010a6887cabf07c79
 1 node_f32f6c64e9df213e0c403f833f54d28a0493fe8d800284f46e9f878c
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 0.0/28.0 CPU
 0.0/1.0 GPU
 0B/23.64GiB memory
 0B/10.13GiB object_store_memory

Total Constraints:
 (no request_resources() constraints)
Total Demands:
 (no resource demands)

应用

使用byzer-llm 做LLM大模型中转

这个先略,不太会

使用命令:

byzerllm deploy --pretrained_model_type saas/openai \
--cpus_per_worker 0.001 \
--gpus_per_worker 0 \
--num_workers 3 \
--infer_params saas.api_key=${MODEL_OPENAI_TOKEN} saas.model=gpt-3.5-turbo-0125 \
--model gpt3_5_chat

使用byzer-llm启动本地大模型

在另一个文档里进行记录

调试

报错server_ Failed to start the grpc server

File "/home/skywalk/minipy312/lib/python3.12/site-packages/ray/_private/node.py", line 796, in _init_gcs_client

raise RuntimeError(

RuntimeError: Failed to start GCS. Last 1 lines of error files:

2025-05-28 19:20:11,803 C 54659 54659 (gcs_server) grpc_server.cc:128: Check failed: server_ Failed to start the grpc server. The specified port is 6379. This means that Ray's core components will not be able to function correctly. If the server startup error message is `Address already in use`, it indicates the server fails to start because the port is already used by other processes (such as --node-manager-port, --object-manager-port, --gcs-server-port, and ports between --min-worker-port, --max-worker-port). Try running sudo lsof -i :6379 to check if there are other processes listening to the port.

.Please check /tmp/ray/session_2025-05-28_19-20-11_612410_54651/logs/gcs_server.out for details. Last connection error: None

报错Session name session_2025-05-29_09-05-56_429060_59678 does not match persisted value

AssertionError: Session name session_2025-05-29_09-05-56_429060_59678 does not match persisted value b'session_2025-05-28_19-39-24_560590_55048'. Perhaps there was an error connecting to Redis.

清除/tmp/ray/

复制代码
rm -rf /tmp/ray/

依旧报错,删除ray进程

复制代码
# 杀掉所有 ray processes
ps aux | grep ray | grep -v grep | awk '{print $2}' | xargs kill -9

# 杀掉残留 redis(Ray 会自带一个 redis-server)
ps aux | grep redis | grep -v grep | awk '{print $2}' | xargs kill -9

依旧报错

安装redis

复制代码
sudo pkg install redis

To setup "redis" you need to edit the configuration file:

/usr/local/etc/redis.conf

To run redis from startup, add redis_enable="YES"

in your /etc/rc.conf.

启动redis,依旧报错

先搁置

在FreeBSD的bash里,使用linux兼容安装的python3.12系统里,可以安装Auto-Coder,但是ray启动不了

启动报错:

File "/home/skywalk/minipy312/lib/python3.12/site-packages/ray/_private/node.py", line 364, in init

self.start_head_processes()

File "/home/skywalk/minipy312/lib/python3.12/site-packages/ray/_private/node.py", line 1458, in start_head_processes

self.start_gcs_server()

File "/home/skywalk/minipy312/lib/python3.12/site-packages/ray/_private/node.py", line 1225, in start_gcs_server

process_info = ray._private.services.start_gcs_server(

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/home/skywalk/minipy312/lib/python3.12/site-packages/ray/_private/services.py", line 1515, in start_gcs_server

stdout_file = open(os.devnull, "w")

^^^^^^^^^^^^^^^^^^^^^

PermissionError: Errno 13 Permission denied: '/dev/null'

看了下/dev/null是有可写权限的,但是没办法啊

换到linux 兼容环境下吧

复制代码
sudo chroot /compat/ubuntu22/ /bin/bash

然后安装

复制代码
pip install pip -U
pip install -U auto-coder
ray start --head

启动ray报错Ray component worker_ports is trying to use a port number 12868 that is used by other components.

启动命令

复制代码
ray start --head

raise ValueError(

ValueError: Ray component worker_ports is trying to use a port number 12868 that is used by other components.

Port information: {'gcs': 'random', 'object_manager': 'random', 'node_manager': 'random', 'gcs_server': 6379, 'client_server': 10001, 'dashboard': 8265, 'dashboard_agent_grpc': 12868, 'dashboard_agent_http': 52365, 'runtime_env_agent': 33302, 'metrics_export': 63589, 'redis_shards': 'random', 'worker_ports': '9998 ports from 10002 to 19999'}

If you allocate ports, please make sure the same port is not used by multiple components.

问题不大,应该是被刚才的进程占用的端口号

在来一次,ray start --head ,哟,它启动了

复制代码
ray start --head
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 192.168.1.5
/home/skywalk/py312/lib/python3.12/site-packages/ray/thirdparty_files/psutil/__init__.py:2017: RuntimeWarning: shared, active, inactive memory stats couldn't be determined and were set to 0
  ret = _psplatform.virtual_memory()

--------------------
Ray runtime started.
--------------------

Next steps
  To add another node to this Ray cluster, run
    ray start --address='192.168.1.5:6379'

  To connect to this Ray cluster:
    import ray
    ray.init()

  To submit a Ray job using the Ray Jobs CLI:
    RAY_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py

  See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html
  for more information on submitting Ray jobs to the Ray cluster.

  To terminate the Ray runtime, run
    ray stop

  To view the status of the cluster, use
    ray status

  To monitor and debug Ray, view the dashboard at
    127.0.0.1:8265

  If connection to the dashboard fails, check your firewall settings and network configuration.

加入ray报错RuntimeError: Version mismatch

ray start --address='192.168.1.5:6379'

File "/home/skywalk/py312/lib/python3.12/site-packages/ray/scripts/scripts.py", line 1164, in start

node.check_version_info()

File "/home/skywalk/py312/lib/python3.12/site-packages/ray/_private/node.py", line 454, in check_version_info

ray._private.utils.check_version_info(

File "/home/skywalk/py312/lib/python3.12/site-packages/ray/_private/utils.py", line 1569, in check_version_info

raise RuntimeError(error_message)

RuntimeError: Version mismatch: The cluster was started with:

Ray: 2.47.1

Python: 3.12.9

This process on node 172.25.183.186 was started with:

Ray: 2.47.1

Python: 3.12.3

升级本机3.12.3到3.12.9,其实应该说重装才对,安装pyenv

复制代码
curl https://pyenv.run | bash

安装python3.12.9

复制代码
pyenv install 3.12.9

安装Auto-Coder

复制代码
​pip install pip -U
pip install -U auto-coder

加入ray

复制代码
ray start --address='192.168.1.5:6379'

加入ok,查看一下

复制代码
ray status
======== Autoscaler status: 2025-11-01 19:16:47.050543 ========
Node status
---------------------------------------------------------------
Active:
 1 node_dfa56c248840a471c7b0be2e7dbeb3fb28041a6010a6887cabf07c79
 1 node_f32f6c64e9df213e0c403f833f54d28a0493fe8d800284f46e9f878c
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 0.0/28.0 CPU
 0.0/1.0 GPU
 0B/23.64GiB memory
 0B/10.13GiB object_store_memory

Total Constraints:
 (no request_resources() constraints)
Total Demands:
 (no resource demands)
相关推荐
Resistance丶未来6 分钟前
管控用量,降本增效,MAI Gateway:助力企业搭建 Tokens 统一管理体系
人工智能·大模型·api·claude·ai安全·魔芋ai·maigateway
GIS数据转换器6 分钟前
无人机车载巡检系统
大数据·数据库·人工智能·数据挖掘·数据分析·无人机
逸模7 小时前
告别熬夜手工整理台账,逸模智能归集实现项目数据自动化存档
大数据·运维·人工智能·笔记·其他·信息可视化·自动化
weixin_397574098 小时前
生产管理和设备管理:制造执行层的AI痛点
人工智能·制造
冬奇Lab8 小时前
Agent 系列(16):工具链设计——让 LLM 用对工具的五个原则
人工智能·llm·agent
冬奇Lab8 小时前
每日一个开源项目(第125篇):taste-skill - 给 AI 装上审美,让前端不再千篇一律
人工智能·开源·agent
Ajie'Blog8 小时前
Copilot Agent Tasks API 开放:AI 编程开始进入后台任务时代
服务器·前端·javascript·人工智能·copilot·ai编程
SEONIB_Explorer8 小时前
AI SEO 与传统SEO成本对比:哪种更划算?
人工智能
一次旅行8 小时前
AI领域每日资讯报告
人工智能