LLM 构建Data Multi-Agents 赋能数据分析平台的实践之②：数据治理之一

概述

数据治理不仅是产业数字化转型的基石，更是推动产业向更高层次、更精细化、更智能的方向发展的重要引擎。通过科学有效的数据治理实践，产业能够在数字化进程中实现数据驱动的决策与行动，最终达到转型升级的战略目标。
一、数据治理在产业数字化中的作用至关重要，主要体现在以下几个方面：

1、数据质量保障：数据治理首先确保产业数据的准确无误、完整一致和时效恰当。通过建立数据管理制度、执行数据质量检查和管控流程，产业能够显著提高数据质量，消除错误和冗余，增强对内部及外部数据源的信任程度，为后续数据分析和应用提供坚实的基础。

2、合规与风险管理：在产业数字化中，数据治理确保企业严格遵循法规要求，尤其是在数据隐私保护和信息安全方面。它帮助产业规避因数据滥用、泄露或不合规带来的法律责任和信誉损失，通过设定严格的访问权限、审计跟踪和数据生命周期管理，降低数据风险。

3、数据整合与分析：数据治理促进了跨部门、跨系统数据的有效整合，使得产业能更便捷地进行大数据分析和挖掘。通过数据标准化和统一视图的创建，产业可以直观地呈现数据，发掘隐藏的商业洞察，进而指导决策、优化流程、创新服务和产品。

4、供应链优化与协同：特别是在供应链领域，数据治理有助于实现供应链的全程可视化和透明化，提高响应速度，减少库存成本，改善物流效率，推动整个产业链条的协同效应。

5、构建数据资产：数据治理将原始数据转化为可信赖的数据资产，赋能企业在数字经济时代下构建核心竞争力。通过数据治理的各个环节，企业能够提炼出具有战略价值的数据资产，使其在市场竞争中占据优势地位。

当前数据治理工作存在以下问题：
二、从数据治理体系建设看：

1、制定的数据标准、数据规范形同虚设，很难真正借助数据标准知识来分析数据质量情况；

2、组织普遍存在"先建设后治理"或"应急式治理"的现象，即在信息化建设初期并未充分考虑数据治理的重要性，导致后期大量历史遗留问题难以解决，如元数据混乱、数据血缘不明晰等。

3、治理体系工作量大，项目整体资金支持不够，但是数据治理工作是一个繁琐的流程；
三、从数据治理产品来看：

1、技术与应用脱节：市场上的数据治理产品功能模块较为分散，包括元数据管理、数据标准管理、数据模型管理、数据质量管理、主数据管理等，这些工具之间集成度不高，可能造成治理碎片化，难以形成统一有效的治理能力。

执行与落地难题：

2、数据治理项目往往以项目制的形式推进，而数据治理本身应该是一个长期、持续的过程。短期项目式的治理很难彻底解决问题，且项目完成后往往缺乏延续性的运维和优化，导致治理效果不佳。

在前述文章中我们讨论了将LLM引入数据平台的设想、可行性及应用场景，并在该文中【LLM 构建Data Multi-Agents 赋能数据分析平台的实践之①：数据采集】构建了一个数据采集的multi-agents，本文将继续讨论LLM构建Data Multi Agents应用于数据治理领域的可能性，并提出数据治理的新方式。

整体设计

整体系统的流程是：

1）补充数据治理领域知识：建立数据标准私有知识库(包括数据规范、数据标准、数据指标方法、数据治理流程、数据质量检查方案等)--->建立知识检索系统(RAG)，检索与所需治理的数据相匹配的数据质量标准知识块；

2）建立数据治理Agents，根据第一步检索到的数据质量标准知识块作为限定条件，分析数据的合规性；

3）数据治理Agents可以根据llm内拥有的知识不依赖数据标准知识块进行主动数据质量问题的发现：如Z-score（标准化得分）、K近邻(KNN)异常检测、高斯混合模型(Gaussian Mixture Model, GMM)等，这些方法可以自主发现疑似异常值，但是缺少领域知识的嵌入，无法做到精准的异常检测，所以通过构建一个领域数据质量标准知识库，增强llm对领域知识的对齐能力。

通义千问数据质量问题检测方法：------>该方法缺少领域知识。

数据治理实践

本次实践要分析全国生猪出栏体重数据中存在的异常值，其中很重要的一个数据标准是生猪出栏体重限定在70~200kg，本次任务需要找出数据中超过该限值的数据项；

1.装载llm：因为算力不足的问题，本次实践使用llamacpp装载openchat.gguf模型。下载方式：huggingface国内镜像：https://hf-mirror.com/

python 复制代码

!HF_ENDPOINT=https://hf-mirror.com huggingface-cli download --resume-download second-state/OpenChat-3.5-0106-GGUF openchat-3.5-0106-Q8_0.gguf --local-dir ./Qwen/gpt2

加载llm模型：

python 复制代码

import os
from langchain.vectorstores import Chroma
from langchain.prompts import PromptTemplate
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
template = """Question: {question}

Answer: Let's work this out in a step by step way to be sure we have the right answer."""

prompt = PromptTemplate(template=template, input_variables=["question"])
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

n_gpu_layers =0  # Change this value based on your model and your GPU VRAM pool.
n_batch = 5120  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="./Qwen/gpt2/openchat-3.5-0106-Q8_0.gguf",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    max_tokens=200000,
    n_ctx=8912,
    callback_manager=callback_manager,
    verbose=True, # Verbose is required to pass to the callback manager
    context_window=4096,
)

2.构建数据质量分析Agents

1）工具库构建

--->工具库构建：①：数据块检索工具：Search_for_data_retriever，用于检索所需治理的数据

python 复制代码

df = pd.read_excel('data.xlsx')
model_id = "iic/nlp_corom_sentence-embedding_english-base"
embeddings = ModelScopeEmbeddings(model_id=model_id)
from langchain.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(file_path="./data.csv")
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=150000, chunk_overlap=0)
split_docs = text_splitter.split_documents(docs)
vectordb = Chroma.from_documents(documents=split_docs,embedding=embeddings)
retriever_tool = create_retriever_tool(
    vectordb.as_retriever(), "Search_for_data_retriever", "which you can retriever relate data block."
)

------>工具库构建构建：python_repl，用于生成python代码分析检索到的数据块

python 复制代码

repl = PythonAstREPLTool(
            locals={"df": df},
            name="python_repl",
            description="The tool is used to generate Python code analysis based on the data blocks retrieved by the first tool named 'Search_for_data_retriever',runs the code and outputs both the code and the results of the computation.",
            #args_schema=PythonInputs,
            )

工具库：tools = [repl,retriever_tool]

2）提示词工程

python 复制代码

TEMPLATE = """You are working with a pandas dataframe in Python. The name of the dataframe is `df`.
It is important to understand the attributes of the dataframe before working with it. This is the result of running `df.head().to_markdown()`

<df>
{dhead}
</df>

You are not meant to use only these rows to answer questions - they are meant as a way of telling you about the shape and schema of the dataframe.
You also do not have use only the information here to answer questions - you can run intermediate queries to do exporatory data analysis to give you more information as needed.

You have a tool called `Search_for_data_retriever` through which you can retriever relate data block.And then you have a tool called python_repl through which you can write code to anylsis the data block retrivered by `Search_for_data_retriever`. Otherwise, try to solve it with code.
You possess two essential tools: 

1. `Search_for_data_retriever`: This tool allows you to fetch relevant data blocks by specifying a search query. For instance, you might use the query 'National pig slaughter weight data' to retrieve related information.
2. `python_repl`: With this tool, you can analyze and process the data retrieved from `Search_for_data_retriever` using Python code.

When facing a question, assess whether you need to employ these tools iteratively.

Example：
<question>It is known that the price interval for apples in the market is between 5 and 20 yuan per kilogram; your task is to detect any anomalous values within the set of apple market price quotes. </question>
<logic>
  First, To begin with, you should scrutinize the user's query, where in the user has furnished an essential detail: that the benchmark data range for apple prices is from 5 to 20 yuan per kilogram.
  Then, use `Search_for_data_retriever` to gather the necessary data.
  retriever output:apple market price
  Then, leverage `python_repl` to detect any anomalous values based on the retrieved data.
  the code write by python_repl:
  '''import pandas as pd
     df["outliers"] = (df["apple market price"] > 200) | (df["apple market price"] <5))
     df["outliers"] = df["outliers"].astype(int)
     df.dropna(subset=["outliers"], inplace=True)
     df.to_csv("apple market price anomalous.csv", index=False)
  the output should print the anomalous,and output a file named apple market price anomalous.csv.
</logic>
If you have not found the precise answer, please iteratively use the tools.
"""
template = TEMPLATE.format(dhead=df.head().to_markdown())

提示词工程提示：经过测试，提示词工程最好包括两个部分，即任务描述以及例子提示。
3）Multi-Agents构建

python 复制代码

#记忆链
memory = ConversationBufferMemory(
    memory_key="chat_history",
    human_prefix="Input",
    ai_prefix="Response",
    input_key="input",
    return_messages=False,
)
#agents构建
agent = ZeroShotAgent.from_llm_and_tools(
            llm=llm,
            tools=tools,
            prefix=template,
        )
agent_executor = AgentExecutor(agent=agent, tools=tools, max_iterations=25, handle_parsing_errors=True,early_stopping_method="generate",verbose=True,memory=memory)

**运行测试①：**不提供数据质量标准知识块，检测异常数据

python 复制代码

query = """
Utilize data quality anomaly detection algorithms to pinpoint irregularities within the dataset. 
Assess the overall data quality scenario encompassing dimensions like data consistency, integrity, and promptness, subsequently generating an exceptional data file as output. 
To identify unusual patterns specifically in the pigs' market weight dataset, execute queries targeting outlier data points. 
Furthermore, uncover additional anomalous records and compile them into a distinct abnormal data file.
"""
agent_executor.invoke({'input':query})

分析计算过程：

①自主构建数据质量检测算法Z-score并决定阈值为2，检测异常数据。

结果：该方法能获取到一些异常值，但是如果没有更多的领域知识给到系统，只能检测出有限的异常值。

**运行测试②：**问题提示中加入数据标准描述：Given that the known weight range for pigs at market is between 70kg and 200kg。

python 复制代码

agent_executor.invoke({'input':
					'Given that the known weight range for pigs at market is between 70kg and 200kg, there have lots outliers data, 
you would need to query the the pigs market weights dataset 
to identify the rows containing outlier data.
Generate a separate file for abnormal data.'})

分析运行过程：

①Data Multi Agents首先使用"Search_for_data_retriever"检索到相关的数据块：

②使用"python repl"生成数据分析代码

③运行结果并自我修正过程

④找到最优结果：

输出的代码

python 复制代码

df[(df[['Market Weight(Henan Province)/kg', 'Market Weight(Hunan Province)/kg', 'Market Weight(Anhui Province)/kg']] >= 70) & (df[['Market Weight(Henan Province)/kg', 'Market Weight(Hunan Province)/kg', 'Market Weight(Anhui Province)/kg']] <= 200)].to_csv('normal.csv')
 df[(df[['Market Weight(Henan Province)/kg', 'Market Weight(Hunan Province)/kg', 'Market Weight(Anhui Province)/kg']] < 70) | (df[['Market Weight(Henan Province)/kg', 'Market Weight(Hunan Province)/kg', 'Market Weight(Anhui Province)/kg']] > 200)].to_csv('outliers.csv')

结果：输出两份文件，一份将异常数据剔除，另一份异常数据

小结

通过本次测试，初步验证了将LLM构建的Data Multi-Agents嵌入数据平台用作数据治理的可行性，在测试过程中有如下问题尚需考虑：

1、Agents推理的速度、精度和准确度比较依赖大模型本身的逻辑推理能力，在初步测试中，我使用开源的通义千问、YI、百川等模型可以完成任务的规划，但是生成的python代码错误极多，导致Agents迭代自身依然找不到代码的错误而失败。openchat这个开源模型，我试下来还是比较合理。

2、Agents的推理准确性和任务规划、执行过程对于提示词的依赖性依然很强，在构建tools、Agents 的时候尽量将tool工具的描述写的准确以及Agents 的提示词最好提供一个相似任务的示例。对于python_repl工具，应该描述df数据的简要信息，生成代码过程会更精确、迭代次数少。

3、本次实践尚需用户提供领域知识，如市场价格的区间、单位等等，没用利用行业和企业构建的数据标准知识、行业信息等额外知识，实现完全自主分析数据质量并将异常数据替换。下一个项目将完成上述任务。