基于Byzer-Agent 框架开发智能数据分析工具

下面 OpenAI Data Analysis 工具的一个截图:

基本用法就是上传一个文件,然后可以通过自然语言对文件做各种统计和可视化。

今天给大家介绍基于 Byzer-Agent 开发的一款类似 OpenAI Data Analysis的工具。

先跑题一下

上一篇 Byzer-LLM 支持同时开源和SaaS版通义千问 我们知道, Byzer-LLM 可以让用户使用统一的部署方式,以及调用方式来操作市面上主流的开源大模型和SaaS 类模型。

再上上一篇 给开源大模型带来Function Calling、 Respond With Class 我们提及了 Byzer-LLM 带来的一堆有意思的功能。实际除了能部署大模型以外,Byzer-LLM 还能重头预训练大模型,微调大模型。

功能比较多,是因为我们把Byzer-LLM 大模型基础设施,所有和大模型有直接关系的,都会融入到 Byzer-LLM中。大家看一眼架构图,就能理解了。

Byzer-Agent

Byzer-Agent 是一个分布式Agent框架。和 Byzer-LLM 一样,也是基于 Ray的。大家可以访问他的主页:

https://github.com/allwefantasy/byzer-agent

Data Analysis

基于 Byzer-Agent 开发的 Data Analysis 是一个典型的 Multi-Agent 案例,下面是架构图:

为了运行该工具,你需要提前安装

  1. Byzer-LLM : https://github.com/allwefantasy/byzer-llm

按文档描述安装,实际上就是安装一堆的包。很简单的,然后启动一个 Ray 实例,然后就可以开工了。

为了最方便的跑起来,你可以去申请国内的一个SaaS模型。我这里用的是Qwen。

先连接集群:

swift 复制代码
import ray
from byzerllm.utils.client import ByzerLLM,Templates
ray.init(address="auto",namespace="default")

接着启动一个 SaaS 模型:

makefile 复制代码
llm = ByzerLLM()
chat_model_name = "qianwen_chat"


if llm.is_model_exist(chat_model_name):
    llm.undeploy(chat_model_name)


llm.setup_num_workers(1).setup_gpus_per_worker(0)
llm.deploy(model_path="",
        pretrained_model_type="saas/qianwen",
        udf_name=chat_model_name,
        infer_params={
            "saas.api_key":"xxxx",            
            "saas.model":"qwen-max-1201"
        })

当然,你可以按相同的方式启动一个开源的模型,只要指定下 model_path以及类型即可。比如,启动一个的开源模型方式如下:

makefile 复制代码
import ray
ray.init(address="auto",namespace="default") 
llm = ByzerLLM()


chat_model_name = "qianwen_chat"
model_location="/home/byzerllm/models/Qwen-72B-Chat"


llm.setup_gpus_per_worker(8).setup_num_workers(1).setup_infer_backend(InferBackend.VLLM)
llm.deploy(
    model_path=model_location,
    pretrained_model_type="custom/auto",
    udf_name=chat_model_name,
    infer_params={}
)

这里我下载了一个Qwen-72B-Cha 模型,并且对他进行部署。

现在,可以启动 Data Analysis 实例了:

python 复制代码
from byzerllm.apps.agent import Agents
from byzerllm.apps.agent.user_proxy_agent import UserProxyAgent
from byzerllm.apps.agent.extensions.data_analysis import DataAnalysis
from byzerllm.utils.client import ByzerLLM,Templates


chat_model_name = "qianwen_chat"
llm = ByzerLLM()
llm.setup_default_model_name(chat_model_name) 
llm.setup_template(chat_model_name,template=Templates.qwen())


user = Agents.create_local_agent(UserProxyAgent,"user",llm,None,
                                human_input_mode="NEVER",
                                max_consecutive_auto_reply=0)


data_analysis = DataAnalysis("chat4","william","/home/byzerllm/projects/jupyter-workspace/test.csv",
                             llm,None)


def show_image(content):
    from IPython.display import display, Image
    import base64             
    img = Image(base64.b64decode(content))
    display(img)

我们初始化DataAnalysis, 实际上有个会话概念,通过第一个和第二个参数确定唯一性。当初始化的时候,会自动和 PreivewAgent 通讯预览我们指定的文件。下面是初始化过程中多个Agent的自动通讯:

cs 复制代码
use_shared_disk: False file_path: /home/byzerllm/projects/jupyter-workspace/test.csv new_file_path: /home/byzerllm/projects/jupyter-workspace/data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac.csv
(DataAnalysisPipeline pid=2134293) data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac (to privew_file_agent):
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) We have a file, the file path is: /home/byzerllm/projects/jupyter-workspace/data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac.csv , please preview this file
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) --------------------------------------------------------------------------------
(DataAnalysisPipeline pid=2134293) privew_file_agent (to python_interpreter):
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) Here's the Python code that meets your requirements:
(DataAnalysisPipeline pid=2134293) ```python
(DataAnalysisPipeline pid=2134293) import pandas as pd
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) file_path = "/home/byzerllm/projects/jupyter-workspace/data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac.csv"
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) try:
(DataAnalysisPipeline pid=2134293)     # Read the file based on its suffix
(DataAnalysisPipeline pid=2134293)     if file_path.endswith(".csv"):
(DataAnalysisPipeline pid=2134293)         df = pd.read_csv(file_path)
(DataAnalysisPipeline pid=2134293)     elif file_path.endswith(".xlsx") or file_path.endswith(".xls"):
(DataAnalysisPipeline pid=2134293)         df = pd.read_excel(file_path)
(DataAnalysisPipeline pid=2134293)     else:
(DataAnalysisPipeline pid=2134293)         raise ValueError(f"Unsupported file type: {file_path}")
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293)     # Set the flag to indicate successful loading
(DataAnalysisPipeline pid=2134293)     loaded_successfully = True
(DataAnalysisPipeline pid=2134293)     
(DataAnalysisPipeline pid=2134293)     # Show the first 5 rows of the file
(DataAnalysisPipeline pid=2134293)     file_preview = df.head()
(DataAnalysisPipeline pid=2134293) except Exception as e:
(DataAnalysisPipeline pid=2134293)     # Set the flag to indicate failed loading
(DataAnalysisPipeline pid=2134293)     loaded_successfully = False
(DataAnalysisPipeline pid=2134293)     
(DataAnalysisPipeline pid=2134293)     # Show an error message
(DataAnalysisPipeline pid=2134293)     file_preview = f"Error occurred while loading the file: {str(e)}"
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) # Print the result
(DataAnalysisPipeline pid=2134293) print(file_preview)
(DataAnalysisPipeline pid=2134293) ```
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) In this code, we first define the `file_path` variable to store the path of the file to be previewed. Then, we use a `try`-`except` block to handle possible exceptions during file loading.
(DataAnalysisPipeline pid=2134293) Inside the `try` block, we check the suffix of the file path to determine how to read the file. If the file is a CSV file, we use the `pd.read_csv()` function to load it into a DataFrame. If the file is an Excel file (either .xlsx or .xls format), we use the `pd.read_excel()` function to load it. If the file has an unsupported suffix, we raise a `ValueError` exception.
(DataAnalysisPipeline pid=2134293) If the file is loaded successfully, we set the `loaded_successfully` flag to `True`, and use the `head()` method of the DataFrame to get the first 5 rows of the file, which is stored in the `file_preview` variable.
(DataAnalysisPipeline pid=2134293) If any exception occurs during file loading, we set the `loaded_successfully` flag to `False`, and store an error message in the `file_preview` variable.
(DataAnalysisPipeline pid=2134293) Finally, we print the contents of the `file_preview` variable to show the result of file previewing.
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) --------------------------------------------------------------------------------
(DataAnalysisPipeline pid=2134293) python_interpreter (to privew_file_agent):
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) exitcode: 0 (execution succeeded)
(DataAnalysisPipeline pid=2134293) Code output:    ID   Deaths  Year                 Entity
(DataAnalysisPipeline pid=2134293) 0   1  1267360  1900  All natural disasters
(DataAnalysisPipeline pid=2134293) 1   2   200018  1901  All natural disasters
(DataAnalysisPipeline pid=2134293) 2   3    46037  1902  All natural disasters
(DataAnalysisPipeline pid=2134293) 3   4     6506  1903  All natural disasters
(DataAnalysisPipeline pid=2134293) 4   5    22758  1905  All natural disasters
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) --------------------------------------------------------------------------------
(DataAnalysisPipeline pid=2134293) privew_file_agent (to data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac):
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) ID,Deaths,Year,Entity
(DataAnalysisPipeline pid=2134293) 1,1267360,1900,All natural disasters
(DataAnalysisPipeline pid=2134293) 2,200018,1901,All natural disasters
(DataAnalysisPipeline pid=2134293) 3,46037,1902,All natural disasters
(DataAnalysisPipeline pid=2134293) 4,6506,1903,All natural disasters
(DataAnalysisPipeline pid=2134293) 5,22758,1905,All natural disasters
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) --------------------------------------------------------------------------------
(DataAnalysisPipeline pid=2134293) sync the conversation of preview_file_agent to other agents
(DataAnalysisPipeline pid=2134293) data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac (to assistant_agent):
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) We have a file, the file path is: /home/byzerllm/projects/jupyter-workspace/data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac.csv , please preview this file
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) --------------------------------------------------------------------------------
(DataAnalysisPipeline pid=2134293) data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac (to assistant_agent):
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) ID,Deaths,Year,Entity
(DataAnalysisPipeline pid=2134293) 1,1267360,1900,All natural disasters
(DataAnalysisPipeline pid=2134293) 2,200018,1901,All natural disasters
(DataAnalysisPipeline pid=2134293) 3,46037,1902,All natural disasters
(DataAnalysisPipeline pid=2134293) 4,6506,1903,All natural disasters
(DataAnalysisPipeline pid=2134293) 5,22758,1905,All natural disasters
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) --------------------------------------------------------------------------------
(DataAnalysisPipeline pid=2134293) data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac (to visualization_agent):
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) We have a file, the file path is: /home/byzerllm/projects/jupyter-workspace/data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac.csv , please preview this file
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) --------------------------------------------------------------------------------
(DataAnalysisPipeline pid=2134293) data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac (to visualization_agent):
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) ID,Deaths,Year,Entity
(DataAnalysisPipeline pid=2134293) 1,1267360,1900,All natural disasters
(DataAnalysisPipeline pid=2134293) 2,200018,1901,All natural disasters
(DataAnalysisPipeline pid=2134293) 3,46037,1902,All natural disasters
(DataAnalysisPipeline pid=2134293) 4,6506,1903,All natural disasters
(DataAnalysisPipeline pid=2134293) 5,22758,1905,All natural disasters
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) --------------------------------------------------------------------------------

还是很清晰的。

接着我么问一个问题:

javascript 复制代码
data_analysis.analyze("根据文件统计下1901年总死亡人数")

对应的Agent 协作过程下:

sql 复制代码
(DataAnalysisPipeline pid=2134293) user_data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac (to data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac):
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) 根据文件统计下1901年总死亡人数
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) --------------------------------------------------------------------------------
(DataAnalysisPipeline pid=2134293) Select agent: assistant_agent to answer the question: 根据文件统计下1901年总死亡人数
(DataAnalysisPipeline pid=2134293) data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac (to assistant_agent):
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) 根据文件统计下1901年总死亡人数
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) --------------------------------------------------------------------------------
(DataAnalysisPipeline pid=2134293) assistant_agent (to python_interpreter):
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) ```python
(DataAnalysisPipeline pid=2134293) # filename: stats.py
(DataAnalysisPipeline pid=2134293) import pandas as pd
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) def get_total_deaths_year(year):
(DataAnalysisPipeline pid=2134293)     df = pd.read_csv("/home/byzerllm/projects/jupyter-workspace/data_analysis_pp_e61639d1e6e6504af87495b8bf80ecac.csv")
(DataAnalysisPipeline pid=2134293)     total_deaths = df[df["Year"] == year]["Deaths"].sum()
(DataAnalysisPipeline pid=2134293)     return total_deaths
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) total_deaths_1901 = get_total_deaths_year(1901)
(DataAnalysisPipeline pid=2134293) print(f"The total number of deaths in 1901 is {total_deaths_1901}.")
(DataAnalysisPipeline pid=2134293) ```
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) Run the above Python script to calculate the total number of deaths in 1901.
(DataAnalysisPipeline pid=2134293) 
(DataAnalysisPipeline pid=2134293) --------------------------------------------------------------------------------

你可以通过

makefile 复制代码
o = data_analysis.output()
o["content"]

获取最后的输出结果:

javascript 复制代码
exitcode: 0 (execution succeeded)
Code output: The total number of deaths in 1901 is 400036.

我们也可以进行可视化:

javascript 复制代码
data_analysis.analyze("根据年份按 Deaths 绘制一个折线图")

最后结果会返回一张图片的base64编码

你可以通过下面的代码展示图片:

makefile 复制代码
o = data_analysis.output()
show_image(o["content"])

图片如下:

这里图片的曲线看着有点乱,不知道是数据还是绘制的问题。

你也可以查看所有的对话:

css 复制代码
data_analysis.get_chat_messages()

会展示和所有Agent的通话记录:

最后,你可以通过close 结束对话,释放资源。如果用户60分钟内不再和有互动,资源也会被自动释放。

css 复制代码
data_analysis.close()

你可以很方便的包装成一个Web服务,从而对外提供数据分析服务。

总结

通过DataAnalysis 可以看到 Byzer-Agent 框架可以有效的支撑用户开发较为复杂的大模型应用。通过 Byer-LLM 可以让我们很方便的和大模型做交互。下次我们来讲讲 Byzer-Retrieval,敬请期待。