下面 OpenAI Data Analysis 工具的一个截图:
基本用法就是上传一个文件,然后可以通过自然语言对文件做各种统计和可视化。
今天给大家介绍基于 Byzer-Agent 开发的一款类似 OpenAI Data Analysis的工具。
先跑题一下
上一篇 Byzer-LLM 支持同时开源和SaaS版通义千问 我们知道, Byzer-LLM 可以让用户使用统一的部署方式,以及调用方式来操作市面上主流的开源大模型和SaaS 类模型。
再上上一篇 给开源大模型带来Function Calling、 Respond With Class 我们提及了 Byzer-LLM 带来的一堆有意思的功能。实际除了能部署大模型以外,Byzer-LLM 还能重头预训练大模型,微调大模型。
功能比较多,是因为我们把Byzer-LLM 大模型基础设施,所有和大模型有直接关系的,都会融入到 Byzer-LLM中。大家看一眼架构图,就能理解了。
Byzer-Agent
Byzer-Agent 是一个分布式Agent框架。和 Byzer-LLM 一样,也是基于 Ray的。大家可以访问他的主页:
https://github.com/allwefantasy/byzer-agent
Data Analysis
基于 Byzer-Agent 开发的 Data Analysis 是一个典型的 Multi-Agent 案例,下面是架构图:
为了运行该工具,你需要提前安装
- Byzer-LLM : https://github.com/allwefantasy/byzer-llm
按文档描述安装,实际上就是安装一堆的包。很简单的,然后启动一个 Ray 实例,然后就可以开工了。
为了最方便的跑起来,你可以去申请国内的一个SaaS模型。我这里用的是Qwen。
先连接集群:
swift
import ray
from byzerllm.utils.client import ByzerLLM,Templates
ray.init(address="auto",namespace="default")
接着启动一个 SaaS 模型:
makefile
llm = ByzerLLM()
chat_model_name = "qianwen_chat"
if llm.is_model_exist(chat_model_name):
llm.undeploy(chat_model_name)
llm.setup_num_workers(1).setup_gpus_per_worker(0)
llm.deploy(model_path="",
pretrained_model_type="saas/qianwen",
udf_name=chat_model_name,
infer_params={
"saas.api_key":"xxxx",
"saas.model":"qwen-max-1201"
})
当然,你可以按相同的方式启动一个开源的模型,只要指定下 model_path以及类型即可。比如,启动一个的开源模型方式如下:
makefile
import ray
ray.init(address="auto",namespace="default")
llm = ByzerLLM()
chat_model_name = "qianwen_chat"
model_location="/home/byzerllm/models/Qwen-72B-Chat"
llm.setup_gpus_per_worker(8).setup_num_workers(1).setup_infer_backend(InferBackend.VLLM)
llm.deploy(
model_path=model_location,
pretrained_model_type="custom/auto",
udf_name=chat_model_name,
infer_params={}
)
这里我下载了一个Qwen-72B-Cha 模型,并且对他进行部署。
现在,可以启动 Data Analysis 实例了:
python
from byzerllm.apps.agent import Agents
from byzerllm.apps.agent.user_proxy_agent import UserProxyAgent
from byzerllm.apps.agent.extensions.data_analysis import DataAnalysis
from byzerllm.utils.client import ByzerLLM,Templates
chat_model_name = "qianwen_chat"
llm = ByzerLLM()
llm.setup_default_model_name(chat_model_name)
llm.setup_template(chat_model_name,template=Templates.qwen())
user = Agents.create_local_agent(UserProxyAgent,"user",llm,None,
human_input_mode="NEVER",
max_consecutive_auto_reply=0)
data_analysis = DataAnalysis("chat4","william","/home/byzerllm/projects/jupyter-workspace/test.csv",
llm,None)
def show_image(content):
from IPython.display import display, Image
import base64
img = Image(base64.b64decode(content))
display(img)
我们初始化DataAnalysis, 实际上有个会话概念,通过第一个和第二个参数确定唯一性。当初始化的时候,会自动和 PreivewAgent 通讯预览我们指定的文件。下面是初始化过程中多个Agent的自动通讯:
cs
use_shared_disk: False file_path: /home/byzerllm/projects/jupyter-workspace/test.csv new_file_path: /home/byzerllm/projects/jupyter-workspace/data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac.csv
(DataAnalysisPipeline pid=2134293) data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac (to privew_file_agent):
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) We have a file, the file path is: /home/byzerllm/projects/jupyter-workspace/data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac.csv , please preview this file
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) --------------------------------------------------------------------------------
(DataAnalysisPipeline pid=2134293) privew_file_agent (to python_interpreter):
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) Here's the Python code that meets your requirements:
(DataAnalysisPipeline pid=2134293) ```python
(DataAnalysisPipeline pid=2134293) import pandas as pd
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) file_path = "/home/byzerllm/projects/jupyter-workspace/data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac.csv"
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) try:
(DataAnalysisPipeline pid=2134293) # Read the file based on its suffix
(DataAnalysisPipeline pid=2134293) if file_path.endswith(".csv"):
(DataAnalysisPipeline pid=2134293) df = pd.read_csv(file_path)
(DataAnalysisPipeline pid=2134293) elif file_path.endswith(".xlsx") or file_path.endswith(".xls"):
(DataAnalysisPipeline pid=2134293) df = pd.read_excel(file_path)
(DataAnalysisPipeline pid=2134293) else:
(DataAnalysisPipeline pid=2134293) raise ValueError(f"Unsupported file type: {file_path}")
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) # Set the flag to indicate successful loading
(DataAnalysisPipeline pid=2134293) loaded_successfully = True
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) # Show the first 5 rows of the file
(DataAnalysisPipeline pid=2134293) file_preview = df.head()
(DataAnalysisPipeline pid=2134293) except Exception as e:
(DataAnalysisPipeline pid=2134293) # Set the flag to indicate failed loading
(DataAnalysisPipeline pid=2134293) loaded_successfully = False
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) # Show an error message
(DataAnalysisPipeline pid=2134293) file_preview = f"Error occurred while loading the file: {str(e)}"
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) # Print the result
(DataAnalysisPipeline pid=2134293) print(file_preview)
(DataAnalysisPipeline pid=2134293) ```
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) In this code, we first define the `file_path` variable to store the path of the file to be previewed. Then, we use a `try`-`except` block to handle possible exceptions during file loading.
(DataAnalysisPipeline pid=2134293) Inside the `try` block, we check the suffix of the file path to determine how to read the file. If the file is a CSV file, we use the `pd.read_csv()` function to load it into a DataFrame. If the file is an Excel file (either .xlsx or .xls format), we use the `pd.read_excel()` function to load it. If the file has an unsupported suffix, we raise a `ValueError` exception.
(DataAnalysisPipeline pid=2134293) If the file is loaded successfully, we set the `loaded_successfully` flag to `True`, and use the `head()` method of the DataFrame to get the first 5 rows of the file, which is stored in the `file_preview` variable.
(DataAnalysisPipeline pid=2134293) If any exception occurs during file loading, we set the `loaded_successfully` flag to `False`, and store an error message in the `file_preview` variable.
(DataAnalysisPipeline pid=2134293) Finally, we print the contents of the `file_preview` variable to show the result of file previewing.
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) --------------------------------------------------------------------------------
(DataAnalysisPipeline pid=2134293) python_interpreter (to privew_file_agent):
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) exitcode: 0 (execution succeeded)
(DataAnalysisPipeline pid=2134293) Code output: ID Deaths Year Entity
(DataAnalysisPipeline pid=2134293) 0 1 1267360 1900 All natural disasters
(DataAnalysisPipeline pid=2134293) 1 2 200018 1901 All natural disasters
(DataAnalysisPipeline pid=2134293) 2 3 46037 1902 All natural disasters
(DataAnalysisPipeline pid=2134293) 3 4 6506 1903 All natural disasters
(DataAnalysisPipeline pid=2134293) 4 5 22758 1905 All natural disasters
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) --------------------------------------------------------------------------------
(DataAnalysisPipeline pid=2134293) privew_file_agent (to data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac):
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) ID,Deaths,Year,Entity
(DataAnalysisPipeline pid=2134293) 1,1267360,1900,All natural disasters
(DataAnalysisPipeline pid=2134293) 2,200018,1901,All natural disasters
(DataAnalysisPipeline pid=2134293) 3,46037,1902,All natural disasters
(DataAnalysisPipeline pid=2134293) 4,6506,1903,All natural disasters
(DataAnalysisPipeline pid=2134293) 5,22758,1905,All natural disasters
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) --------------------------------------------------------------------------------
(DataAnalysisPipeline pid=2134293) sync the conversation of preview_file_agent to other agents
(DataAnalysisPipeline pid=2134293) data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac (to assistant_agent):
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) We have a file, the file path is: /home/byzerllm/projects/jupyter-workspace/data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac.csv , please preview this file
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) --------------------------------------------------------------------------------
(DataAnalysisPipeline pid=2134293) data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac (to assistant_agent):
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) ID,Deaths,Year,Entity
(DataAnalysisPipeline pid=2134293) 1,1267360,1900,All natural disasters
(DataAnalysisPipeline pid=2134293) 2,200018,1901,All natural disasters
(DataAnalysisPipeline pid=2134293) 3,46037,1902,All natural disasters
(DataAnalysisPipeline pid=2134293) 4,6506,1903,All natural disasters
(DataAnalysisPipeline pid=2134293) 5,22758,1905,All natural disasters
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) --------------------------------------------------------------------------------
(DataAnalysisPipeline pid=2134293) data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac (to visualization_agent):
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) We have a file, the file path is: /home/byzerllm/projects/jupyter-workspace/data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac.csv , please preview this file
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) --------------------------------------------------------------------------------
(DataAnalysisPipeline pid=2134293) data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac (to visualization_agent):
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) ID,Deaths,Year,Entity
(DataAnalysisPipeline pid=2134293) 1,1267360,1900,All natural disasters
(DataAnalysisPipeline pid=2134293) 2,200018,1901,All natural disasters
(DataAnalysisPipeline pid=2134293) 3,46037,1902,All natural disasters
(DataAnalysisPipeline pid=2134293) 4,6506,1903,All natural disasters
(DataAnalysisPipeline pid=2134293) 5,22758,1905,All natural disasters
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) --------------------------------------------------------------------------------
还是很清晰的。
接着我么问一个问题:
javascript
data_analysis.analyze("根据文件统计下1901年总死亡人数")
对应的Agent 协作过程下:
sql
(DataAnalysisPipeline pid=2134293) user_data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac (to data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac):
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) 根据文件统计下1901年总死亡人数
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) --------------------------------------------------------------------------------
(DataAnalysisPipeline pid=2134293) Select agent: assistant_agent to answer the question: 根据文件统计下1901年总死亡人数
(DataAnalysisPipeline pid=2134293) data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac (to assistant_agent):
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) 根据文件统计下1901年总死亡人数
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) --------------------------------------------------------------------------------
(DataAnalysisPipeline pid=2134293) assistant_agent (to python_interpreter):
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) ```python
(DataAnalysisPipeline pid=2134293) # filename: stats.py
(DataAnalysisPipeline pid=2134293) import pandas as pd
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) def get_total_deaths_year(year):
(DataAnalysisPipeline pid=2134293) df = pd.read_csv("/home/byzerllm/projects/jupyter-workspace/data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac.csv")
(DataAnalysisPipeline pid=2134293) total_deaths = df[df["Year"] == year]["Deaths"].sum()
(DataAnalysisPipeline pid=2134293) return total_deaths
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) total_deaths_1901 = get_total_deaths_year(1901)
(DataAnalysisPipeline pid=2134293) print(f"The total number of deaths in 1901 is {total_deaths_1901}.")
(DataAnalysisPipeline pid=2134293) ```
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) Run the above Python script to calculate the total number of deaths in 1901.
(DataAnalysisPipeline pid=2134293)
(DataAnalysisPipeline pid=2134293) --------------------------------------------------------------------------------
你可以通过
makefile
o = data_analysis.output()
o["content"]
获取最后的输出结果:
javascript
exitcode: 0 (execution succeeded)
Code output: The total number of deaths in 1901 is 400036.
我们也可以进行可视化:
javascript
data_analysis.analyze("根据年份按 Deaths 绘制一个折线图")
最后结果会返回一张图片的base64编码
你可以通过下面的代码展示图片:
makefile
o = data_analysis.output()
show_image(o["content"])
图片如下:
这里图片的曲线看着有点乱,不知道是数据还是绘制的问题。
你也可以查看所有的对话:
css
data_analysis.get_chat_messages()
会展示和所有Agent的通话记录:
最后,你可以通过close 结束对话,释放资源。如果用户60分钟内不再和有互动,资源也会被自动释放。
css
data_analysis.close()
你可以很方便的包装成一个Web服务,从而对外提供数据分析服务。
总结
通过DataAnalysis 可以看到 Byzer-Agent 框架可以有效的支撑用户开发较为复杂的大模型应用。通过 Byer-LLM 可以让我们很方便的和大模型做交互。下次我们来讲讲 Byzer-Retrieval,敬请期待。