OpenClaw 实战案例:数据分析平台构建

目录

    • 摘要
    • [1. 引言 - 数据分析平台概述](#1. 引言 - 数据分析平台概述)
      • [1.1 数据分析需求](#1.1 数据分析需求)
      • [1.2 平台架构设计](#1.2 平台架构设计)
      • [1.3 核心功能规划](#1.3 核心功能规划)
    • [2. 数据采集模块](#2. 数据采集模块)
      • [2.1 多源数据连接器](#2.1 多源数据连接器)
      • [2.2 数据采集调度器](#2.2 数据采集调度器)
    • [3. 数据清洗模块](#3. 数据清洗模块)
      • [3.1 数据质量检测](#3.1 数据质量检测)
      • [3.2 数据清洗处理器](#3.2 数据清洗处理器)
    • [4. 数据分析引擎](#4. 数据分析引擎)
      • [4.1 统计分析器](#4.1 统计分析器)
      • [4.2 自然语言查询](#4.2 自然语言查询)
    • [5. 可视化引擎](#5. 可视化引擎)
      • [5.1 图表生成器](#5.1 图表生成器)
    • [6. 最佳实践](#6. 最佳实践)
      • [6.1 平台设计原则](#6.1 平台设计原则)
      • [6.2 常见问题](#6.2 常见问题)
    • [7. 总结](#7. 总结)
      • [7.1 核心要点](#7.1 核心要点)
      • [7.2 下一步学习](#7.2 下一步学习)
    • 参考资料

摘要

本文通过一个完整的数据分析平台案例,演示如何使用 OpenClaw 构建智能数据分析系统。文章涵盖数据采集、数据清洗、数据分析、可视化展示等核心功能,帮助开发者掌握 OpenClaw 在数据分析场景的应用。通过详细的系统设计和代码实现,让读者了解数据分析平台的完整构建过程。📊


1. 引言 - 数据分析平台概述

1.1 数据分析需求

企业数据分析面临诸多挑战,传统方案难以满足现代业务需求:

挑战 传统方案 OpenClaw方案
数据分散 手动汇总 自动采集整合
分析门槛高 需要专业分析师 自然语言查询
响应慢 批量处理 实时分析
洞察浅 描述性分析 预测性分析
协作难 报告分发 智能问答

1.2 平台架构设计

#mermaid-svg-2T0V0U6d0A4GEPc0{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-2T0V0U6d0A4GEPc0 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-2T0V0U6d0A4GEPc0 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-2T0V0U6d0A4GEPc0 .error-icon{fill:#552222;}#mermaid-svg-2T0V0U6d0A4GEPc0 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-2T0V0U6d0A4GEPc0 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-2T0V0U6d0A4GEPc0 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-2T0V0U6d0A4GEPc0 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-2T0V0U6d0A4GEPc0 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-2T0V0U6d0A4GEPc0 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-2T0V0U6d0A4GEPc0 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-2T0V0U6d0A4GEPc0 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-2T0V0U6d0A4GEPc0 .marker.cross{stroke:#333333;}#mermaid-svg-2T0V0U6d0A4GEPc0 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-2T0V0U6d0A4GEPc0 p{margin:0;}#mermaid-svg-2T0V0U6d0A4GEPc0 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-2T0V0U6d0A4GEPc0 .cluster-label text{fill:#333;}#mermaid-svg-2T0V0U6d0A4GEPc0 .cluster-label span{color:#333;}#mermaid-svg-2T0V0U6d0A4GEPc0 .cluster-label span p{background-color:transparent;}#mermaid-svg-2T0V0U6d0A4GEPc0 .label text,#mermaid-svg-2T0V0U6d0A4GEPc0 span{fill:#333;color:#333;}#mermaid-svg-2T0V0U6d0A4GEPc0 .node rect,#mermaid-svg-2T0V0U6d0A4GEPc0 .node circle,#mermaid-svg-2T0V0U6d0A4GEPc0 .node ellipse,#mermaid-svg-2T0V0U6d0A4GEPc0 .node polygon,#mermaid-svg-2T0V0U6d0A4GEPc0 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-2T0V0U6d0A4GEPc0 .rough-node .label text,#mermaid-svg-2T0V0U6d0A4GEPc0 .node .label text,#mermaid-svg-2T0V0U6d0A4GEPc0 .image-shape .label,#mermaid-svg-2T0V0U6d0A4GEPc0 .icon-shape .label{text-anchor:middle;}#mermaid-svg-2T0V0U6d0A4GEPc0 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-2T0V0U6d0A4GEPc0 .rough-node .label,#mermaid-svg-2T0V0U6d0A4GEPc0 .node .label,#mermaid-svg-2T0V0U6d0A4GEPc0 .image-shape .label,#mermaid-svg-2T0V0U6d0A4GEPc0 .icon-shape .label{text-align:center;}#mermaid-svg-2T0V0U6d0A4GEPc0 .node.clickable{cursor:pointer;}#mermaid-svg-2T0V0U6d0A4GEPc0 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-2T0V0U6d0A4GEPc0 .arrowheadPath{fill:#333333;}#mermaid-svg-2T0V0U6d0A4GEPc0 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-2T0V0U6d0A4GEPc0 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-2T0V0U6d0A4GEPc0 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-2T0V0U6d0A4GEPc0 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-2T0V0U6d0A4GEPc0 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-2T0V0U6d0A4GEPc0 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-2T0V0U6d0A4GEPc0 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-2T0V0U6d0A4GEPc0 .cluster text{fill:#333;}#mermaid-svg-2T0V0U6d0A4GEPc0 .cluster span{color:#333;}#mermaid-svg-2T0V0U6d0A4GEPc0 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-2T0V0U6d0A4GEPc0 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-2T0V0U6d0A4GEPc0 rect.text{fill:none;stroke-width:0;}#mermaid-svg-2T0V0U6d0A4GEPc0 .icon-shape,#mermaid-svg-2T0V0U6d0A4GEPc0 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-2T0V0U6d0A4GEPc0 .icon-shape p,#mermaid-svg-2T0V0U6d0A4GEPc0 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-2T0V0U6d0A4GEPc0 .icon-shape .label rect,#mermaid-svg-2T0V0U6d0A4GEPc0 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-2T0V0U6d0A4GEPc0 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-2T0V0U6d0A4GEPc0 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-2T0V0U6d0A4GEPc0 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 应用服务层
分析引擎层
数据处理层
数据源层
数据库
API接口
文件上传
实时流
数据采集
数据清洗
数据转换
数据存储
统计分析
机器学习
自然语言查询
可视化引擎
报表中心
智能问答
预警系统
数据API

1.3 核心功能规划

功能模块 核心能力 技术实现
数据采集 多源数据接入 连接器 + 流式采集
数据清洗 数据质量保障 规则引擎 + 异常检测
数据分析 多维度分析 SQL + ML
智能查询 自然语言交互 NLP + SQL生成
可视化 图表展示 图表库 + 自动推荐

2. 数据采集模块

2.1 多源数据连接器

python 复制代码
from abc import ABC, abstractmethod
from typing import Dict, List, Any, Optional
from dataclasses import dataclass
import pandas as pd

@dataclass
class DataSource:
    """数据源配置"""
    name: str
    type: str
    connection: Dict[str, Any]
    schema: Optional[Dict] = None

class DataConnector(ABC):
    """数据连接器基类"""
    
    @abstractmethod
    def connect(self, source: DataSource) -> bool:
        """建立连接"""
        pass
    
    @abstractmethod
    def fetch(self, query: str) -> pd.DataFrame:
        """获取数据"""
        pass
    
    @abstractmethod
    def get_schema(self) -> Dict:
        """获取数据结构"""
        pass

class DatabaseConnector(DataConnector):
    """数据库连接器"""
    
    def __init__(self):
        self.connection = None
        self.source = None
    
    def connect(self, source: DataSource) -> bool:
        """建立数据库连接"""
        self.source = source
        
        # 根据数据库类型选择驱动
        db_type = source.type
        
        if db_type == "mysql":
            import pymysql
            self.connection = pymysql.connect(
                host=source.connection["host"],
                port=source.connection.get("port", 3306),
                user=source.connection["user"],
                password=source.connection["password"],
                database=source.connection["database"]
            )
        
        elif db_type == "postgresql":
            import psycopg2
            self.connection = psycopg2.connect(
                host=source.connection["host"],
                port=source.connection.get("port", 5432),
                user=source.connection["user"],
                password=source.connection["password"],
                database=source.connection["database"]
            )
        
        elif db_type == "sqlite":
            import sqlite3
            self.connection = sqlite3.connect(source.connection["path"])
        
        return self.connection is not None
    
    def fetch(self, query: str) -> pd.DataFrame:
        """执行查询并返回DataFrame"""
        if not self.connection:
            raise Exception("未建立连接")
        
        return pd.read_sql(query, self.connection)
    
    def get_schema(self) -> Dict:
        """获取数据库结构"""
        if self.source.type == "mysql":
            query = f"""
            SELECT table_name, column_name, data_type 
            FROM information_schema.columns 
            WHERE table_schema = '{self.source.connection["database"]}'
            """
        elif self.source.type == "postgresql":
            query = f"""
            SELECT table_name, column_name, data_type 
            FROM information_schema.columns 
            WHERE table_schema = 'public'
            """
        else:
            return {}
        
        df = self.fetch(query)
        
        # 构建schema字典
        schema = {}
        for _, row in df.iterrows():
            table = row["table_name"]
            if table not in schema:
                schema[table] = {"columns": []}
            schema[table]["columns"].append({
                "name": row["column_name"],
                "type": row["data_type"]
            })
        
        return schema

class APIConnector(DataConnector):
    """API连接器"""
    
    def __init__(self):
        self.source = None
        self.session = None
    
    def connect(self, source: DataSource) -> bool:
        """配置API连接"""
        self.source = source
        
        import requests
        self.session = requests.Session()
        
        # 设置认证
        if "api_key" in source.connection:
            self.session.headers["Authorization"] = f"Bearer {source.connection['api_key']}"
        
        return True
    
    def fetch(self, endpoint: str, params: Dict = None) -> pd.DataFrame:
        """获取API数据"""
        if not self.session:
            raise Exception("未建立连接")
        
        base_url = self.source.connection["base_url"]
        url = f"{base_url}/{endpoint}"
        
        response = self.session.get(url, params=params)
        
        if response.status_code == 200:
            data = response.json()
            
            # 处理嵌套数据
            if isinstance(data, list):
                return pd.DataFrame(data)
            elif isinstance(data, dict):
                # 尝试找到数据列表
                for key in ["data", "results", "items"]:
                    if key in data and isinstance(data[key], list):
                        return pd.DataFrame(data[key])
                return pd.DataFrame([data])
        
        raise Exception(f"API请求失败: {response.status_code}")
    
    def get_schema(self) -> Dict:
        """获取API数据结构"""
        # 通过示例数据推断结构
        return {}

class FileConnector(DataConnector):
    """文件连接器"""
    
    def __init__(self):
        self.source = None
        self.data = None
    
    def connect(self, source: DataSource) -> bool:
        """加载文件"""
        self.source = source
        file_path = source.connection["path"]
        file_type = source.connection.get("type", "csv")
        
        if file_type == "csv":
            self.data = pd.read_csv(file_path)
        elif file_type == "excel":
            self.data = pd.read_excel(file_path)
        elif file_type == "json":
            self.data = pd.read_json(file_path)
        elif file_type == "parquet":
            self.data = pd.read_parquet(file_path)
        else:
            return False
        
        return True
    
    def fetch(self, query: str = None) -> pd.DataFrame:
        """获取数据"""
        return self.data.copy()
    
    def get_schema(self) -> Dict:
        """获取数据结构"""
        if self.data is None:
            return {}
        
        return {
            "columns": [
                {"name": col, "type": str(self.data[col].dtype)}
                for col in self.data.columns
            ]
        }

# 使用示例
# 数据库连接
db_source = DataSource(
    name="main_db",
    type="mysql",
    connection={
        "host": "localhost",
        "user": "root",
        "password": "password",
        "database": "mydb"
    }
)

db_connector = DatabaseConnector()
db_connector.connect(db_source)
df = db_connector.fetch("SELECT * FROM users LIMIT 10")
print(f"获取 {len(df)} 条记录")

# API连接
api_source = DataSource(
    name="weather_api",
    type="api",
    connection={
        "base_url": "https://api.weather.com",
        "api_key": "your_api_key"
    }
)

api_connector = APIConnector()
api_connector.connect(api_source)
weather_df = api_connector.fetch("current", {"city": "beijing"})
print(f"天气数据: {weather_df.shape}")

# 文件连接
file_source = DataSource(
    name="sales_data",
    type="file",
    connection={
        "path": "/data/sales.csv",
        "type": "csv"
    }
)

file_connector = FileConnector()
file_connector.connect(file_source)
sales_df = file_connector.fetch()
print(f"销售数据: {sales_df.shape}")

2.2 数据采集调度器

python 复制代码
from typing import Dict, List, Callable
from datetime import datetime, timedelta
import threading
import time

@dataclass
class CollectionTask:
    """采集任务"""
    id: str
    name: str
    connector: DataConnector
    query: str
    schedule: str  # cron表达式
    callback: Callable
    last_run: datetime = None
    next_run: datetime = None
    status: str = "pending"

class DataCollectionScheduler:
    """数据采集调度器"""
    
    def __init__(self):
        self.tasks: Dict[str, CollectionTask] = {}
        self.running = False
        self.thread = None
    
    def add_task(self, task: CollectionTask):
        """添加采集任务"""
        # 计算下次运行时间
        task.next_run = self._parse_schedule(task.schedule)
        self.tasks[task.id] = task
    
    def remove_task(self, task_id: str):
        """移除采集任务"""
        if task_id in self.tasks:
            del self.tasks[task_id]
    
    def start(self):
        """启动调度器"""
        self.running = True
        self.thread = threading.Thread(target=self._run_loop, daemon=True)
        self.thread.start()
    
    def stop(self):
        """停止调度器"""
        self.running = False
        if self.thread:
            self.thread.join(timeout=5)
    
    def _run_loop(self):
        """调度循环"""
        while self.running:
            now = datetime.now()
            
            for task in self.tasks.values():
                if task.next_run and now >= task.next_run:
                    self._execute_task(task)
                    task.last_run = now
                    task.next_run = self._parse_schedule(task.schedule)
            
            time.sleep(1)
    
    def _execute_task(self, task: CollectionTask):
        """执行采集任务"""
        try:
            task.status = "running"
            
            # 获取数据
            df = task.connector.fetch(task.query)
            
            # 调用回调
            task.callback(df)
            
            task.status = "success"
        
        except Exception as e:
            task.status = "failed"
            print(f"任务 {task.name} 执行失败: {e}")
    
    def _parse_schedule(self, schedule: str) -> datetime:
        """解析调度时间"""
        # 简化实现:支持简单格式
        # 实际应使用 croniter 库
        
        if schedule == "every_minute":
            return datetime.now() + timedelta(minutes=1)
        elif schedule == "every_hour":
            return datetime.now() + timedelta(hours=1)
        elif schedule == "every_day":
            return datetime.now() + timedelta(days=1)
        else:
            return datetime.now() + timedelta(hours=1)
    
    def get_task_status(self) -> List[Dict]:
        """获取任务状态"""
        return [
            {
                "id": task.id,
                "name": task.name,
                "status": task.status,
                "last_run": task.last_run.isoformat() if task.last_run else None,
                "next_run": task.next_run.isoformat() if task.next_run else None
            }
            for task in self.tasks.values()
        ]

# 使用示例
scheduler = DataCollectionScheduler()

def process_data(df: pd.DataFrame):
    """处理采集的数据"""
    print(f"处理 {len(df)} 条数据")
    # 存储到数据仓库
    # 或触发后续分析

# 添加采集任务
task1 = CollectionTask(
    id="task_001",
    name="用户数据采集",
    connector=db_connector,
    query="SELECT * FROM users WHERE created_at > NOW() - INTERVAL 1 HOUR",
    schedule="every_hour",
    callback=process_data
)

scheduler.add_task(task1)
scheduler.start()

3. 数据清洗模块

3.1 数据质量检测

python 复制代码
from typing import Dict, List, Tuple
import numpy as np

class DataQualityChecker:
    """数据质量检测器"""
    
    def __init__(self):
        self.rules: List[Dict] = []
    
    def add_rule(self, column: str, rule_type: str, params: Dict = None):
        """添加质量规则"""
        self.rules.append({
            "column": column,
            "type": rule_type,
            "params": params or {}
        })
    
    def check(self, df: pd.DataFrame) -> Dict:
        """执行质量检测"""
        results = {
            "total_rows": len(df),
            "total_columns": len(df.columns),
            "issues": [],
            "score": 100
        }
        
        for rule in self.rules:
            column = rule["column"]
            rule_type = rule["type"]
            params = rule["params"]
            
            if column not in df.columns:
                results["issues"].append({
                    "column": column,
                    "type": "missing_column",
                    "message": f"列 {column} 不存在"
                })
                continue
            
            col_data = df[column]
            
            if rule_type == "not_null":
                null_count = col_data.isnull().sum()
                if null_count > 0:
                    results["issues"].append({
                        "column": column,
                        "type": "null_values",
                        "count": null_count,
                        "percentage": null_count / len(df) * 100
                    })
            
            elif rule_type == "unique":
                dup_count = col_data.duplicated().sum()
                if dup_count > 0:
                    results["issues"].append({
                        "column": column,
                        "type": "duplicates",
                        "count": dup_count
                    })
            
            elif rule_type == "range":
                min_val = params.get("min")
                max_val = params.get("max")
                
                if min_val is not None:
                    below_min = (col_data < min_val).sum()
                    if below_min > 0:
                        results["issues"].append({
                            "column": column,
                            "type": "below_min",
                            "count": below_min,
                            "min": min_val
                        })
                
                if max_val is not None:
                    above_max = (col_data > max_val).sum()
                    if above_max > 0:
                        results["issues"].append({
                            "column": column,
                            "type": "above_max",
                            "count": above_max,
                            "max": max_val
                        })
            
            elif rule_type == "pattern":
                pattern = params.get("pattern")
                if pattern:
                    import re
                    invalid = ~col_data.astype(str).str.match(pattern, na=False)
                    invalid_count = invalid.sum()
                    if invalid_count > 0:
                        results["issues"].append({
                            "column": column,
                            "type": "pattern_mismatch",
                            "count": invalid_count,
                            "pattern": pattern
                        })
        
        # 计算质量分数
        if results["issues"]:
            issue_penalty = sum(
                issue.get("percentage", 5) 
                for issue in results["issues"]
            )
            results["score"] = max(0, 100 - issue_penalty)
        
        return results
    
    def get_profile(self, df: pd.DataFrame) -> Dict:
        """获取数据概要"""
        profile = {
            "row_count": len(df),
            "column_count": len(df.columns),
            "memory_usage": df.memory_usage(deep=True).sum(),
            "columns": {}
        }
        
        for col in df.columns:
            col_data = df[col]
            col_profile = {
                "dtype": str(col_data.dtype),
                "null_count": col_data.isnull().sum(),
                "null_percentage": col_data.isnull().sum() / len(df) * 100,
                "unique_count": col_data.nunique()
            }
            
            # 数值类型统计
            if col_data.dtype in ["int64", "float64"]:
                col_profile.update({
                    "min": col_data.min(),
                    "max": col_data.max(),
                    "mean": col_data.mean(),
                    "median": col_data.median(),
                    "std": col_data.std()
                })
            
            # 字符串类型统计
            elif col_data.dtype == "object":
                col_profile.update({
                    "min_length": col_data.astype(str).str.len().min(),
                    "max_length": col_data.astype(str).str.len().max(),
                    "avg_length": col_data.astype(str).str.len().mean()
                })
            
            profile["columns"][col] = col_profile
        
        return profile

# 使用示例
checker = DataQualityChecker()

# 添加质量规则
checker.add_rule("user_id", "not_null")
checker.add_rule("user_id", "unique")
checker.add_rule("age", "range", {"min": 0, "max": 150})
checker.add_rule("email", "pattern", {"pattern": r"^[\w\.-]+@[\w\.-]+\.\w+$"})

# 执行检测
df = pd.DataFrame({
    "user_id": [1, 2, 3, None, 5],
    "age": [25, 30, -5, 40, 200],
    "email": ["a@b.com", "invalid", "c@d.com", "e@f.com", "g@h.com"]
})

results = checker.check(df)
print(f"质量分数: {results['score']}")
print(f"问题数: {len(results['issues'])}")

# 获取数据概要
profile = checker.get_profile(df)
print(f"数据概要: {profile['row_count']} 行, {profile['column_count']} 列")

3.2 数据清洗处理器

python 复制代码
from typing import Dict, List, Callable

class DataCleaner:
    """数据清洗处理器"""
    
    def __init__(self):
        self.steps: List[Dict] = []
    
    def add_step(self, name: str, processor: Callable, params: Dict = None):
        """添加清洗步骤"""
        self.steps.append({
            "name": name,
            "processor": processor,
            "params": params or {}
        })
    
    def clean(self, df: pd.DataFrame) -> Tuple[pd.DataFrame, Dict]:
        """执行清洗"""
        original_count = len(df)
        report = {
            "original_rows": original_count,
            "steps": []
        }
        
        result_df = df.copy()
        
        for step in self.steps:
            before_count = len(result_df)
            
            result_df = step["processor"](result_df, **step["params"])
            
            after_count = len(result_df)
            
            report["steps"].append({
                "name": step["name"],
                "rows_before": before_count,
                "rows_after": after_count,
                "rows_removed": before_count - after_count
            })
        
        report["final_rows"] = len(result_df)
        report["rows_removed"] = original_count - len(result_df)
        
        return result_df, report

# 预定义清洗处理器
def remove_duplicates(df: pd.DataFrame, subset: List[str] = None) -> pd.DataFrame:
    """去除重复"""
    return df.drop_duplicates(subset=subset)

def fill_missing(df: pd.DataFrame, columns: Dict[str, Any] = None) -> pd.DataFrame:
    """填充缺失值"""
    result = df.copy()
    
    for col, value in (columns or {}).items():
        if col in result.columns:
            result[col] = result[col].fillna(value)
    
    return result

def remove_outliers(df: pd.DataFrame, column: str, method: str = "iqr", threshold: float = 1.5) -> pd.DataFrame:
    """去除异常值"""
    if column not in df.columns:
        return df
    
    if method == "iqr":
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        
        lower = Q1 - threshold * IQR
        upper = Q3 + threshold * IQR
        
        return df[(df[column] >= lower) & (df[column] <= upper)]
    
    elif method == "zscore":
        from scipy import stats
        z_scores = stats.zscore(df[column].dropna())
        return df[abs(z_scores) <= threshold]
    
    return df

def standardize_text(df: pd.DataFrame, column: str, lowercase: bool = True, strip: bool = True) -> pd.DataFrame:
    """标准化文本"""
    result = df.copy()
    
    if column in result.columns:
        if lowercase:
            result[column] = result[column].astype(str).str.lower()
        if strip:
            result[column] = result[column].astype(str).str.strip()
    
    return result

def convert_types(df: pd.DataFrame, columns: Dict[str, str]) -> pd.DataFrame:
    """转换数据类型"""
    result = df.copy()
    
    for col, dtype in columns.items():
        if col in result.columns:
            try:
                result[col] = result[col].astype(dtype)
            except Exception as e:
                print(f"转换 {col} 失败: {e}")
    
    return result

# 使用示例
cleaner = DataCleaner()

# 添加清洗步骤
cleaner.add_step("去除重复", remove_duplicates, {"subset": ["user_id"]})
cleaner.add_step("填充缺失", fill_missing, {"columns": {"age": 0, "name": "未知"}})
cleaner.add_step("去除异常值", remove_outliers, {"column": "age", "method": "iqr"})
cleaner.add_step("标准化文本", standardize_text, {"column": "email", "lowercase": True})
cleaner.add_step("类型转换", convert_types, {"columns": {"age": "int64", "created_at": "datetime64"}})

# 执行清洗
cleaned_df, report = cleaner.clean(df)
print(f"清洗报告: {report}")

4. 数据分析引擎

4.1 统计分析器

python 复制代码
from typing import Dict, List, Optional
from scipy import stats
import numpy as np

class StatisticalAnalyzer:
    """统计分析器"""
    
    def __init__(self, df: pd.DataFrame):
        self.df = df
    
    def descriptive_stats(self, columns: List[str] = None) -> Dict:
        """描述性统计"""
        if columns is None:
            columns = self.df.select_dtypes(include=[np.number]).columns.tolist()
        
        result = {}
        
        for col in columns:
            if col not in self.df.columns:
                continue
            
            data = self.df[col].dropna()
            
            result[col] = {
                "count": len(data),
                "mean": data.mean(),
                "std": data.std(),
                "min": data.min(),
                "q1": data.quantile(0.25),
                "median": data.median(),
                "q3": data.quantile(0.75),
                "max": data.max(),
                "skewness": data.skew(),
                "kurtosis": data.kurtosis()
            }
        
        return result
    
    def correlation_analysis(self, method: str = "pearson") -> pd.DataFrame:
        """相关性分析"""
        numeric_df = self.df.select_dtypes(include=[np.number])
        return numeric_df.corr(method=method)
    
    def hypothesis_test(self, column1: str, column2: str, test_type: str = "ttest") -> Dict:
        """假设检验"""
        data1 = self.df[column1].dropna()
        data2 = self.df[column2].dropna()
        
        if test_type == "ttest":
            statistic, pvalue = stats.ttest_ind(data1, data2)
            return {
                "test": "t-test",
                "statistic": statistic,
                "p_value": pvalue,
                "significant": pvalue < 0.05
            }
        
        elif test_type == "mannwhitney":
            statistic, pvalue = stats.mannwhitneyu(data1, data2)
            return {
                "test": "Mann-Whitney U",
                "statistic": statistic,
                "p_value": pvalue,
                "significant": pvalue < 0.05
            }
        
        elif test_type == "chi2":
            contingency = pd.crosstab(self.df[column1], self.df[column2])
            statistic, pvalue, dof, expected = stats.chi2_contingency(contingency)
            return {
                "test": "Chi-square",
                "statistic": statistic,
                "p_value": pvalue,
                "dof": dof,
                "significant": pvalue < 0.05
            }
        
        return {}
    
    def anova(self, group_column: str, value_column: str) -> Dict:
        """方差分析"""
        groups = self.df.groupby(group_column)[value_column]
        
        group_data = [group.dropna().values for name, group in groups]
        
        statistic, pvalue = stats.f_oneway(*group_data)
        
        return {
            "test": "ANOVA",
            "statistic": statistic,
            "p_value": pvalue,
            "significant": pvalue < 0.05,
            "groups": len(group_data)
        }
    
    def time_series_analysis(self, date_column: str, value_column: str, freq: str = "D") -> Dict:
        """时间序列分析"""
        df = self.df.copy()
        df[date_column] = pd.to_datetime(df[date_column])
        df = df.set_index(date_column)
        
        # 重采样
        resampled = df[value_column].resample(freq)
        
        result = {
            "daily_stats": {
                "mean": resampled.mean().to_dict(),
                "sum": resampled.sum().to_dict(),
                "count": resampled.count().to_dict()
            }
        }
        
        # 趋势分析
        from scipy.signal import detrend
        values = df[value_column].values
        trend = detrend(values)
        
        result["detrended"] = trend.tolist()
        
        return result

# 使用示例
analyzer = StatisticalAnalyzer(sales_df)

# 描述性统计
stats_result = analyzer.descriptive_stats(["price", "quantity", "total"])
print(f"统计结果: {stats_result}")

# 相关性分析
corr = analyzer.correlation_analysis()
print(f"相关性矩阵:\n{corr}")

# 假设检验
test_result = analyzer.hypothesis_test("group_a", "group_b", "ttest")
print(f"检验结果: {test_result}")

4.2 自然语言查询

python 复制代码
from typing import Dict, List, Optional
import re

class NaturalLanguageQuery:
    """自然语言查询处理器"""
    
    def __init__(self, schema: Dict):
        self.schema = schema
        self.query_templates = self._build_templates()
    
    def _build_templates(self) -> List[Dict]:
        """构建查询模板"""
        return [
            {
                "pattern": r"(.+)的平均值",
                "sql_template": "SELECT AVG({column}) FROM {table}",
                "type": "aggregation"
            },
            {
                "pattern": r"(.+)的总和",
                "sql_template": "SELECT SUM({column}) FROM {table}",
                "type": "aggregation"
            },
            {
                "pattern": r"(.+)的最大值",
                "sql_template": "SELECT MAX({column}) FROM {table}",
                "type": "aggregation"
            },
            {
                "pattern": r"(.+)的最小值",
                "sql_template": "SELECT MIN({column}) FROM {table}",
                "type": "aggregation"
            },
            {
                "pattern": r"按(.+)分组统计(.+)",
                "sql_template": "SELECT {group_column}, COUNT(*) FROM {table} GROUP BY {group_column}",
                "type": "grouping"
            },
            {
                "pattern": r"(.+)前(\d+)名",
                "sql_template": "SELECT * FROM {table} ORDER BY {column} DESC LIMIT {limit}",
                "type": "ranking"
            }
        ]
    
    def parse(self, question: str) -> Dict:
        """解析自然语言问题"""
        result = {
            "question": question,
            "sql": None,
            "type": None,
            "confidence": 0
        }
        
        for template in self.query_templates:
            match = re.search(template["pattern"], question)
            
            if match:
                result["type"] = template["type"]
                
                # 提取参数
                if template["type"] == "aggregation":
                    column_name = match.group(1)
                    column = self._find_column(column_name)
                    table = self._find_table(column)
                    
                    if column and table:
                        result["sql"] = template["sql_template"].format(
                            column=column,
                            table=table
                        )
                        result["confidence"] = 0.8
                
                elif template["type"] == "grouping":
                    group_col_name = match.group(1)
                    value_col_name = match.group(2)
                    
                    group_column = self._find_column(group_col_name)
                    table = self._find_table(group_column)
                    
                    if group_column and table:
                        result["sql"] = template["sql_template"].format(
                            group_column=group_column,
                            table=table
                        )
                        result["confidence"] = 0.7
                
                elif template["type"] == "ranking":
                    column_name = match.group(1)
                    limit = match.group(2)
                    
                    column = self._find_column(column_name)
                    table = self._find_table(column)
                    
                    if column and table:
                        result["sql"] = template["sql_template"].format(
                            column=column,
                            table=table,
                            limit=limit
                        )
                        result["confidence"] = 0.8
                
                break
        
        return result
    
    def _find_column(self, name: str) -> Optional[str]:
        """查找匹配的列名"""
        name_lower = name.lower()
        
        for table_name, table_info in self.schema.items():
            for column in table_info.get("columns", []):
                if name_lower in column["name"].lower():
                    return column["name"]
        
        return None
    
    def _find_table(self, column: str) -> Optional[str]:
        """查找列所在的表"""
        for table_name, table_info in self.schema.items():
            for col in table_info.get("columns", []):
                if col["name"] == column:
                    return table_name
        
        return None
    
    def execute(self, question: str, connector: DataConnector) -> pd.DataFrame:
        """执行自然语言查询"""
        parsed = self.parse(question)
        
        if parsed["sql"]:
            return connector.fetch(parsed["sql"])
        
        return pd.DataFrame()

# 使用示例
schema = {
    "sales": {
        "columns": [
            {"name": "product", "type": "string"},
            {"name": "price", "type": "float"},
            {"name": "quantity", "type": "int"},
            {"name": "region", "type": "string"}
        ]
    }
}

nlq = NaturalLanguageQuery(schema)

# 解析问题
result = nlq.parse("价格的平均值")
print(f"SQL: {result['sql']}")
print(f"置信度: {result['confidence']}")

# 执行查询
# df = nlq.execute("销售额前10名", db_connector)

5. 可视化引擎

5.1 图表生成器

python 复制代码
from typing import Dict, List, Optional
import matplotlib.pyplot as plt
import seaborn as sns

class ChartGenerator:
    """图表生成器"""
    
    def __init__(self, style: str = "seaborn"):
        plt.style.use(style)
        self.figures: List[plt.Figure] = []
    
    def bar_chart(self, df: pd.DataFrame, x: str, y: str, title: str = None) -> plt.Figure:
        """柱状图"""
        fig, ax = plt.subplots(figsize=(10, 6))
        
        df.plot.bar(x=x, y=y, ax=ax)
        
        if title:
            ax.set_title(title)
        
        ax.set_xlabel(x)
        ax.set_ylabel(y)
        
        plt.tight_layout()
        self.figures.append(fig)
        
        return fig
    
    def line_chart(self, df: pd.DataFrame, x: str, y: str, title: str = None) -> plt.Figure:
        """折线图"""
        fig, ax = plt.subplots(figsize=(12, 6))
        
        df.plot.line(x=x, y=y, ax=ax)
        
        if title:
            ax.set_title(title)
        
        plt.tight_layout()
        self.figures.append(fig)
        
        return fig
    
    def pie_chart(self, df: pd.DataFrame, values: str, labels: str, title: str = None) -> plt.Figure:
        """饼图"""
        fig, ax = plt.subplots(figsize=(8, 8))
        
        ax.pie(df[values], labels=df[labels], autopct='%1.1f%%')
        
        if title:
            ax.set_title(title)
        
        self.figures.append(fig)
        
        return fig
    
    def scatter_plot(self, df: pd.DataFrame, x: str, y: str, hue: str = None, title: str = None) -> plt.Figure:
        """散点图"""
        fig, ax = plt.subplots(figsize=(10, 8))
        
        if hue:
            for category in df[hue].unique():
                subset = df[df[hue] == category]
                ax.scatter(subset[x], subset[y], label=category, alpha=0.6)
            ax.legend()
        else:
            ax.scatter(df[x], df[y], alpha=0.6)
        
        ax.set_xlabel(x)
        ax.set_ylabel(y)
        
        if title:
            ax.set_title(title)
        
        plt.tight_layout()
        self.figures.append(fig)
        
        return fig
    
    def heatmap(self, df: pd.DataFrame, title: str = None) -> plt.Figure:
        """热力图"""
        fig, ax = plt.subplots(figsize=(10, 8))
        
        sns.heatmap(df, annot=True, fmt=".2f", cmap="coolwarm", ax=ax)
        
        if title:
            ax.set_title(title)
        
        plt.tight_layout()
        self.figures.append(fig)
        
        return fig
    
    def histogram(self, df: pd.DataFrame, column: str, bins: int = 30, title: str = None) -> plt.Figure:
        """直方图"""
        fig, ax = plt.subplots(figsize=(10, 6))
        
        ax.hist(df[column], bins=bins, edgecolor='black')
        
        ax.set_xlabel(column)
        ax.set_ylabel('Frequency')
        
        if title:
            ax.set_title(title)
        
        plt.tight_layout()
        self.figures.append(fig)
        
        return fig
    
    def box_plot(self, df: pd.DataFrame, x: str, y: str, title: str = None) -> plt.Figure:
        """箱线图"""
        fig, ax = plt.subplots(figsize=(10, 6))
        
        df.boxplot(column=y, by=x, ax=ax)
        
        if title:
            ax.set_title(title)
        
        plt.tight_layout()
        self.figures.append(fig)
        
        return fig
    
    def save_all(self, directory: str, prefix: str = "chart"):
        """保存所有图表"""
        import os
        
        os.makedirs(directory, exist_ok=True)
        
        for i, fig in enumerate(self.figures):
            path = os.path.join(directory, f"{prefix}_{i+1}.png")
            fig.savefig(path, dpi=150)
            print(f"保存图表: {path}")
    
    def close_all(self):
        """关闭所有图表"""
        for fig in self.figures:
            plt.close(fig)
        self.figures.clear()

# 使用示例
chart_gen = ChartGenerator()

# 生成图表
chart_gen.bar_chart(sales_df, x="product", y="sales", title="产品销售额")
chart_gen.line_chart(time_df, x="date", y="revenue", title="收入趋势")
chart_gen.pie_chart(category_df, values="amount", labels="category", title="类别占比")
chart_gen.scatter_plot(sales_df, x="price", y="quantity", hue="region", title="价格与销量关系")

# 保存图表
chart_gen.save_all("/output/charts", "analysis")

6. 最佳实践

6.1 平台设计原则

原则 说明 实践
易用性 降低使用门槛 自然语言查询
可扩展 支持新数据源 插件式连接器
高性能 快速响应 缓存 + 索引
安全性 数据保护 权限控制

6.2 常见问题

问题 原因 解决方案
查询慢 数据量大 分区 + 索引
结果不准 数据质量差 数据清洗
图表乱码 编码问题 设置字体

7. 总结

7.1 核心要点

本文通过完整的数据分析平台案例,展示了 OpenClaw 在数据分析场景的应用:

模块 核心功能 技术要点
数据采集 多源接入 连接器 + 调度
数据清洗 质量保障 规则 + 处理器
数据分析 多维分析 统计 + ML
智能查询 自然语言 NLP + SQL
可视化 图表展示 自动推荐

7.2 下一步学习

  • 第76篇:OpenClaw 实战案例:内容创作系统

参考资料


相关推荐
luj_17681 小时前
草酸与烟酸对消化及糖代谢的影响解析
服务器·c语言·开发语言·经验分享·算法
潘正翔1 小时前
docker基础_镜像使用
linux·运维·服务器·docker·容器·centos·devops
勉灬之2 小时前
利用双网卡服务器搭建 Verdaccio 中转,解决内网 npm 依赖下载问题
运维·服务器·npm
DB哥讲数据库2 小时前
rocky linux安装教程:VMware虚拟机图文讲解部署Rocky Linux 9(附镜像包)
linux·运维·服务器
未*望2 小时前
【Linux入坑(二)—全志T133开发板适配USB-电容屏触摸屏驱动(多点触控) 】
linux·运维·服务器
懒鸟一枚2 小时前
为什么 useradd -rs /bin/false service 创建的用户无法用 su 切换?
linux·服务器·数据库
risc1234562 小时前
Lucene80DocValuesConsumer 五种类型源码阅读顺序
java·服务器·前端
爱喝热水的呀哈喽2 小时前
hypermesh两个网格参数解析
服务器·数据库·mysql
AI科技星2 小时前
拓扑生命系统确定性理论:基于32维流形的遗传密码起源与衰老动力学( 中英双语顶刊终稿·标准数学符号)
开发语言·网络·人工智能·算法·机器学习·乖乖数学·全域数学