智能问数系统(二):数据分析师Python
文章目录
摘要
- 由于SQL(Structured Query Language)本质上是为数据检索和关系运算设计的,虽然 SQL 支持 SUM, AVG, COUNT 等基本聚合,但对于复杂的统计分析,SQL 的表达力有限,SQL 返回的是平铺的二维表(Rows & Columns)
- 比如我们通过Text-to-SQL把自然语言:统计分析2025年6月分的每个产品的销售额 转换为如下的SQL语句:
sql
select `order_items`.`product_id`, `order_items`.`order_id`, `products`.`name`, `order_items`.`quantity`, `order_items`.`unit_price`
from `orders` join `order_items` on `orders`.`id` = `order_items`.`order_id` join `products` on `order_items`.`product_id` = `products`.`id`
where `orders`.`order_date` >= '2025-06-01' and `orders`.`order_date` <= '2025-06-30' and `orders`.`status` = 'completed';
- SQL执行结果如下,下面数据直接平铺展示不能体现数据的价值,因此我们可以通过Python的Pandas库对数据进行技术分析体现数据的价值。

Python代码生成
- 我们可以通过提示词(提供示例)来让LLM来生成可执行的Python代码:ChatClientRequest
markdown
# ROLE: 专业Python数据分析工程师
你是一个专业的Python数据分析工程师,你的任务是根据用户提供的自然语言分析需求、数据库表结构和SQL查询结果样例,编写一段**可直接运行的无状态Python脚本**。
**核心要求**:生成的代码必须满足:
① 输入SQL结果JSON → ② 执行分析 → ③ 输出JSON结果 的完整闭环。
④ 异常时通过stderr提供可调试的完整堆栈信息。
# 示例输出
假设用户需求是"统计每个渠道的线索数量和转化率",生成的代码可能如下:
```python
import sys
import json
import traceback
import pandas as pd
try:
# 从stdin读取输入数据
input_data = json.load(sys.stdin)
# 转换为DataFrame
df = pd.DataFrame(input_data)
# 自动类型推断:将字符串形式的数值转换为实际数值类型
for col in df.columns:
df[col] = pd.to_numeric(df[col], errors='ignore')
# 动态分析逻辑
result = {
"channel_stats": []
}
# 计算每个渠道的线索数量和转化率
if "channel" in df.columns and "conversion" in df.columns:
grouped = df.groupby("channel").agg(
lead_count=("conversion", "size"),
conversion_rate=("conversion", "mean")
).reset_index()
result["channel_stats"] = grouped.to_dict(orient="records")
# 输出结果为JSON对象
print(json.dumps(result, ensure_ascii=False))
except Exception:
# 捕获异常并输出堆栈信息到stderr
traceback.print_exc(file=sys.stderr)
sys.exit(1)
```
=== 上下文信息 ===
【表结构】
```
{"name":"saa_data_agent","table":[{"name":"products","description":"商品表","column":[{"name":"stock","description":"商品库存数量","enumeration":0,"type":"number","data":["100","500","200"]},{"name":"name","description":"商品名称","enumeration":0,"type":"text","data":["智能手机","T恤衫","小说"]},{"name":"price","description":"商品单价","enumeration":0,"type":"number","data":["2999.00","89.00","39.00"]},{"name":"id","description":"商品ID,主键自增","enumeration":0,"type":"number","data":["1","2","3"]},{"name":"created_at","description":"商品上架时间","enumeration":0,"type":"datetime","data":["2026-01-12 09:27:11"]}],"primaryKeys":["id"]},{"name":"order_items","description":"订单明细表","column":[{"name":"product_id","description":"商品ID","enumeration":0,"type":"number","data":["1","2","3"]},{"name":"id","description":"订单明细ID,主键自增","enumeration":0,"type":"number","data":["1","2","3"]},{"name":"order_id","description":"订单ID","enumeration":0,"type":"number","data":["1","2","3"]},{"name":"quantity","description":"购买数量","enumeration":0,"type":"number","data":["1","2"]},{"name":"unit_price","description":"下单时商品单价","enumeration":0,"type":"number","data":["2999.00","89.00","39.00"]}],"primaryKeys":["id"]},{"name":"orders","description":"订单表","column":[{"name":"status","description":"订单状态(pending/completed/cancelled等)","enumeration":0,"type":"text","data":["completed","pending","cancelled"]},{"name":"order_date","description":"下单时间","enumeration":0,"type":"datetime","data":["2025-06-01 10:10:00","2025-06-02 09:23:00","2025-06-03 13:45:00"]},{"name":"user_id","description":"下单用户ID","enumeration":0,"type":"number","data":["1","2","3"]},{"name":"id","description":"订单ID,主键自增","enumeration":0,"type":"number","data":["1","2","3"]},{"name":"total_amount","description":"订单总金额","enumeration":0,"type":"number","data":["3088.00","39.00","1204.00"]}],"primaryKeys":["id"]}],"foreignKeys":["order_items.order_id=orders.id","product_categories.category_id=categories.id","orders.user_id=users.id","product_categories.product_id=products.id","order_items.product_id=products.id","orders.user_id=users.id"]}
```
【输入样例】
```json
[{"quantity":"1","product_id":"1","name":"智能手机","unit_price":"2999.00"},{"quantity":"1","product_id":"2","name":"T恤衫","unit_price":"89.00"},{"quantity":"2","product_id":"4","name":"咖啡机","unit_price":"599.00"},{"quantity":"2","product_id":"5","name":"牛奶","unit_price":"3.00"},{"quantity":"1","product_id":"6","name":"笔记本电脑","unit_price":"4999.00"}]
```
【方案指导】
```json
{"instruction":"读取上一步的查询结果,计算每个产品的销售额(销售额 = unit_price * quantity),按 product_id 和商品名称分组进行汇总,并生成销售额从高到低的排序列表,同时绘制前10名产品的销售额柱状图。"}
```
---
=== 用户输入 ===
接下来是用户的需求:统计分析2025年6月(2025-06-01至2025-06-30)每个产品的销售额
- LLM的response响应text对应的python脚本如下:
python
import sys
import json
import traceback
import pandas as pd
try:
# 从stdin读取输入数据
input_data = json.load(sys.stdin)
# 转换为DataFrame
df = pd.DataFrame(input_data)
# 自动类型推断:将字符串形式的数值转换为实际数值类型
for col in df.columns:
df[col] = pd.to_numeric(df[col], errors='ignore')
# 检查是否包含必要字段,若不包含则尝试从orders和order_items关联分析
if "order_date" in df.columns:
# 筛选2025年6月的数据(2025-06-01 至 2025-06-30)
df["order_date"] = pd.to_datetime(df["order_date"])
start_date = pd.Timestamp("2025-06-01")
end_date = pd.Timestamp("2025-06-30 23:59:59")
filtered_df = df[(df["order_date"] >= start_date) & (df["order_date"] <= end_date)]
else:
# 如果没有order_date,假定输入已经是6月的订单明细(如order_items join products)
filtered_df = df.copy()
# 计算每个产品的销售额:unit_price * quantity
if all(col in filtered_df.columns for col in ["unit_price", "quantity", "product_id"]):
filtered_df["sales"] = filtered_df["unit_price"] * filtered_df["quantity"]
# 按 product_id 和商品名称(如果有name字段)分组汇总销售额
group_cols = ["product_id"]
if "name" in filtered_df.columns:
group_cols.append("name")
sales_summary = filtered_df.groupby(group_cols)["sales"].sum().reset_index()
# 排序:按销售额从高到低
sales_summary = sales_summary.sort_values(by="sales", ascending=False).reset_index(drop=True)
# 生成结果列表
result = {
"product_sales": sales_summary.to_dict(orient="records")
}
else:
# 缺少关键字段时返回空结果
result = {
"product_sales": [],
"error": "Missing required fields: 'unit_price', 'quantity', or 'product_id'"
}
# 输出结果为JSON对象
print(json.dumps(result, ensure_ascii=False))
except Exception:
# 捕获异常并输出堆栈信息到stderr
traceback.print_exc(file=sys.stderr)
sys.exit(1)
Python脚本执行
- Python脚步执行的三种方式
- Docker:通过Docker容器的方式运(生产环境推荐Docker)
- Local :使用本地Python3 环境运行代码的实现类,要求本地的Python3 需要有pandas等数据分析库。(本地测试)
- AI模拟 :使用AI 模拟运行Python 代码(便于在无Docker环境测试)
Docker执行
- 核心思想是:"主机负责调度,容器负责执行,文件系统作为通信桥梁"
java
protected TaskResponse execTaskInContainer(TaskRequest request, String containerId) {
// 该目录将用于 与容器共享文件
Path tempDir = this.containerTempPath.get(containerId);
if (tempDir == null) {
log.error("Container '{}' does not exist work dir", containerId);
return TaskResponse.exception("Container '" + containerId + "' does not exist work dir");
}
// 写入Py代码和标准输入
try {
Files.write(tempDir.resolve("script.py"),
StringUtils.hasText(request.code()) ? request.code().getBytes() : "".getBytes());
Files.write(tempDir.resolve("requirements.txt"),
StringUtils.hasText(request.requirement()) ? request.requirement().getBytes() : "".getBytes());
Files.write(tempDir.resolve("input_data.txt"),
StringUtils.hasText(request.input()) ? request.input().getBytes() : "".getBytes());
Files.write(tempDir.resolve("stdout.txt"), "".getBytes());
Files.write(tempDir.resolve("stderr.txt"), "".getBytes());
}
catch (Exception e) {
return TaskResponse.exception(e.getMessage());
}
try {
// 启动容器执行
dockerClient.startContainerCmd(containerId).exec();
dockerClient.waitContainerCmd(containerId)
.start()
.awaitCompletion(this.properties.getContainerTimeout(), TimeUnit.SECONDS);
// 读取stdout和stderr
String stdout = Files.readString(tempDir.resolve("stdout.txt"));
String stderr = Files.readString(tempDir.resolve("stderr.txt"));
// 返回结果
InspectContainerResponse inspectResponse = dockerClient.inspectContainerCmd(containerId).exec();
int exitCode = Objects.requireNonNull(inspectResponse.getState().getExitCodeLong()).intValue();
if (exitCode != 0) {
return TaskResponse.failure(stdout, stderr);
}
return TaskResponse.success(stdout);
}
catch (Exception e) {
log.error("Error when creating container in docker: {}", e.getMessage());
return TaskResponse.exception(e.getMessage());
}
}
Local
java
@Override
protected TaskResponse execTaskInContainer(TaskRequest request, String containerId) {
// 获取隔离目录
Path container = this.containers.get(containerId);
// 写入Py代码和标准输入
Path scriptFile = container.resolve("script.py");
Path stdinFile = container.resolve("stdin.txt");
Path requirementFile = container.resolve("requirements.txt");
try {
Files.write(scriptFile, Optional.ofNullable(request.code()).orElse("").getBytes());
Files.write(stdinFile, Optional.ofNullable(request.input()).orElse("").getBytes());
Files.write(requirementFile, Optional.ofNullable(request.requirement()).orElse("").getBytes());
}
catch (Exception e) {
log.error("Create temp file failed: {}", e.getMessage(), e);
return TaskResponse.exception(e.getMessage());
}
// 如果有requirements,则先安装依赖
if (this.checkProgramExists(pipNames) != null && StringUtils.hasText(request.requirement())) {
//删除了
}
// 运行Python代码
Process process = null;
try {
ProcessBuilder pb = new ProcessBuilder(this.checkProgramExists(pythonNames),
scriptFile.toAbsolutePath().toString());
pb.directory(container.toFile());
pb.redirectInput(stdinFile.toFile());
process = pb.start();
// 读取stdout和stderr
StringWriter stdoutWriter = new StringWriter();
StringWriter stderrWriter = new StringWriter();
try (BufferedReader stdoutReader = new BufferedReader(new InputStreamReader(process.getInputStream()));
BufferedReader stderrReader = new BufferedReader(new InputStreamReader(process.getErrorStream()))) {
CompletableFuture<Void> stdoutFuture = CompletableFuture.runAsync(() -> {
stdoutReader.transferTo(stdoutWriter);
});
CompletableFuture<Void> stderrFuture = CompletableFuture.runAsync(() -> {
stderrReader.transferTo(stderrWriter);
});
// 等待进程完成,带超时限制
boolean completed = process.waitFor(this.parseToMilliseconds(this.properties.getCodeTimeout()),
TimeUnit.MILLISECONDS);
// 等待输出读取完成,给输出读取额外2秒时间
CompletableFuture.allOf(stdoutFuture, stderrFuture).get(2, TimeUnit.SECONDS);
}
// 返回结果
int exitCode = process.exitValue();
String stdout = stdoutWriter.toString();
String stderr = stderrWriter.toString();
if (exitCode != 0) {
return TaskResponse.failure(stdout, stderr);
}
else {
return TaskResponse.success(stdout);
}
}
catch (Exception e) {
log.error("Python execution failed: {}", e.getMessage(), e);
return TaskResponse.exception(e.getMessage());
}
finally {
if (process != null && process.isAlive()) {
process.destroyForcibly();
}
}
}
AI模拟
- 通过调用LLM进行模拟执行
java
@Override
public TaskResponse runTask(TaskRequest request) {
String userPrompt = String.format("""
【代码】
```python
%s
```
【标准输入】
```json
%s
```
""", request.code(), request.input());
String output = llmService.toStringFlux(llmService.call(SYSTEM_PROMPT, userPrompt))
.collect(StringBuilder::new, StringBuilder::append)
.map(StringBuilder::toString)
.block();
return TaskResponse.success(output);
}
Python总结分析
- Python执行输出结果
jso
{"product_sales":[{"product_id":6,"name":"笔记本电脑","sales":4999.0},{"product_id":1,"name":"智能手机","sales":2999.0},{"product_id":4,"name":"咖啡机","sales":1198.0},{"product_id":9,"name":"羽绒服","sales":1197.0},{"product_id":10,"name":"历史书","sales":276.0},{"product_id":2,"name":"T恤衫","sales":178.0},{"product_id":5,"name":"牛奶","sales":31.0},{"product_id":8,"name":"巧克力","sales":25.0},{"product_id":3,"name":"小说","sales":16.0}]}
- 对于Python执行输出结果可以再通过LLM进行分析总结
markdown
# ROLE: 数据分析报告撰写专家
你是一位专业的数据分析报告撰写专家,擅长将复杂的数据分析结果转化为清晰、准确、易懂的自然语言总结。
你的任务是根据用户的原始查询需求和Python脚本的分析输出结果,生成一段**结构清晰、语言简洁、内容准确**的总结性描述。
---
### 输入信息
【用户原始查询】
统计分析2025年6月(2025-06-01至2025-06-30)每个产品的销售额
【Python分析结果】
{"product_sales":[{"product_id":6,"name":"笔记本电脑","sales":4999.0},{"product_id":1,"name":"智能手机","sales":2999.0},{"product_id":4,"name":"咖啡机","sales":1198.0},{"product_id":9,"name":"羽绒服","sales":1197.0},{"product_id":10,"name":"历史书","sales":276.0},{"product_id":2,"name":"T恤衫","sales":178.0},{"product_id":5,"name":"牛奶","sales":31.0},{"product_id":8,"name":"巧克力","sales":25.0},{"product_id":3,"name":"小说","sales":16.0}]}
---
### 输出模板
请根据以下模板生成总结内容:
```text
[总结内容]
```
请根据以上规则生成符合要求的总结内容:
- 最终分析结果
tex
2025年6月,笔记本电脑销售额为4999.0元,位居第一;智能手机销售额为2999.0元,排名第二;咖啡机销售额为1198.0元,位列第三。其他产品销售额依次为羽绒服1197.0元、历史书276.0元、T恤衫178.0元、牛奶31.0元、巧克力25.0元、小说16.0元。
总结
- 可以防止 Token 溢出: SQL 可能返回上万条结果,直接塞给 LLM 会导致上下文溢出且费用昂贵。Python 可以对数据进行采样、分桶(Binning)或特征提取,将"脱水"后的精简摘要发给 LLM。
- 异常检测: Python 脚本可以快速扫描结果集中的离群值(Outliers)或空值,并在分析报告中自动标注异常。
- 通过 Schema 增强 确保取数准确,通过 Python 脚本 确保分析深度,最后通过 LLM 总结 确保回复质量