前言
💡 痛点:数据库运维复杂度高?SQL 调优靠经验?故障排查慢?误操作风险大?
🎯 解决方案:构建 AI Agent 数据库运维助手 --- 自动监控、智能诊断、SQL 优化、故障自愈,让 DBA 效率提升 10 倍。
AI Agent 数据库运维能力矩阵:
#mermaid-svg-G3vS3kOHBVY9N7rL{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-G3vS3kOHBVY9N7rL .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-G3vS3kOHBVY9N7rL .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-G3vS3kOHBVY9N7rL .error-icon{fill:#552222;}#mermaid-svg-G3vS3kOHBVY9N7rL .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-G3vS3kOHBVY9N7rL .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-G3vS3kOHBVY9N7rL .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-G3vS3kOHBVY9N7rL .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-G3vS3kOHBVY9N7rL .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-G3vS3kOHBVY9N7rL .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-G3vS3kOHBVY9N7rL .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-G3vS3kOHBVY9N7rL .marker{fill:#333333;stroke:#333333;}#mermaid-svg-G3vS3kOHBVY9N7rL .marker.cross{stroke:#333333;}#mermaid-svg-G3vS3kOHBVY9N7rL svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-G3vS3kOHBVY9N7rL p{margin:0;}#mermaid-svg-G3vS3kOHBVY9N7rL .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-G3vS3kOHBVY9N7rL .cluster-label text{fill:#333;}#mermaid-svg-G3vS3kOHBVY9N7rL .cluster-label span{color:#333;}#mermaid-svg-G3vS3kOHBVY9N7rL .cluster-label span p{background-color:transparent;}#mermaid-svg-G3vS3kOHBVY9N7rL .label text,#mermaid-svg-G3vS3kOHBVY9N7rL span{fill:#333;color:#333;}#mermaid-svg-G3vS3kOHBVY9N7rL .node rect,#mermaid-svg-G3vS3kOHBVY9N7rL .node circle,#mermaid-svg-G3vS3kOHBVY9N7rL .node ellipse,#mermaid-svg-G3vS3kOHBVY9N7rL .node polygon,#mermaid-svg-G3vS3kOHBVY9N7rL .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-G3vS3kOHBVY9N7rL .rough-node .label text,#mermaid-svg-G3vS3kOHBVY9N7rL .node .label text,#mermaid-svg-G3vS3kOHBVY9N7rL .image-shape .label,#mermaid-svg-G3vS3kOHBVY9N7rL .icon-shape .label{text-anchor:middle;}#mermaid-svg-G3vS3kOHBVY9N7rL .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-G3vS3kOHBVY9N7rL .rough-node .label,#mermaid-svg-G3vS3kOHBVY9N7rL .node .label,#mermaid-svg-G3vS3kOHBVY9N7rL .image-shape .label,#mermaid-svg-G3vS3kOHBVY9N7rL .icon-shape .label{text-align:center;}#mermaid-svg-G3vS3kOHBVY9N7rL .node.clickable{cursor:pointer;}#mermaid-svg-G3vS3kOHBVY9N7rL .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-G3vS3kOHBVY9N7rL .arrowheadPath{fill:#333333;}#mermaid-svg-G3vS3kOHBVY9N7rL .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-G3vS3kOHBVY9N7rL .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-G3vS3kOHBVY9N7rL .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-G3vS3kOHBVY9N7rL .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-G3vS3kOHBVY9N7rL .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-G3vS3kOHBVY9N7rL .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-G3vS3kOHBVY9N7rL .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-G3vS3kOHBVY9N7rL .cluster text{fill:#333;}#mermaid-svg-G3vS3kOHBVY9N7rL .cluster span{color:#333;}#mermaid-svg-G3vS3kOHBVY9N7rL div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-G3vS3kOHBVY9N7rL .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-G3vS3kOHBVY9N7rL rect.text{fill:none;stroke-width:0;}#mermaid-svg-G3vS3kOHBVY9N7rL .icon-shape,#mermaid-svg-G3vS3kOHBVY9N7rL .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-G3vS3kOHBVY9N7rL .icon-shape p,#mermaid-svg-G3vS3kOHBVY9N7rL .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-G3vS3kOHBVY9N7rL .icon-shape .label rect,#mermaid-svg-G3vS3kOHBVY9N7rL .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-G3vS3kOHBVY9N7rL .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-G3vS3kOHBVY9N7rL .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-G3vS3kOHBVY9N7rL :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 输出
AI Agent
输入
监控指标
慢查询日志
告警事件
自然语言
感知层
诊断层
推理层
执行层
诊断报告
优化方案
自动修复
SQL 建议
传统 DBA vs AI Agent DBA:
| 能力 | 传统 DBA | AI Agent DBA |
|---|---|---|
| 7×24 监控 | ⚠️ 值班制 | ✅ 全天候 |
| 故障响应 | 🐢 分钟~小时 | ⚡ 秒级 |
| SQL 调优 | ⚠️ 依赖经验 | ✅ 自动分析 |
| 容量规划 | ⚠️ 粗略估算 | ✅ 数据驱动预测 |
| 变更审核 | ✅ 人工把关 | ⚠️ 需审批机制 |
| 知识积累 | ❌ 人员流失 | ✅ 持久化知识库 |
一、数据库连接与监控
1.1 统一数据库连接层
python
# ===== 数据库连接抽象层 =====
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from typing import List, Dict, Optional, Any, Tuple
from datetime import datetime, timedelta
import time
import json
@dataclass
class DatabaseConfig:
"""数据库配置"""
db_type: str # mysql / postgresql / mongodb / redis
host: str
port: int
database: str
username: str = ""
password: str = ""
pool_size: int = 5
connect_timeout: int = 10
@property
def safe_display(self) -> str:
return f"{self.db_type}://{self.host}:{self.port}/{self.database}"
class DatabaseConnection(ABC):
"""数据库连接抽象基类"""
def __init__(self, config: DatabaseConfig):
self.config = config
self._conn = None
self._connected = False
@abstractmethod
def connect(self) -> bool:
"""建立连接"""
pass
@abstractmethod
def execute(self, sql: str, params: tuple = None) -> Any:
"""执行 SQL"""
pass
@abstractmethod
def query(self, sql: str, params: tuple = None) -> List[Dict]:
"""查询"""
pass
@abstractmethod
def close(self):
"""关闭连接"""
pass
@abstractmethod
def get_metrics(self) -> Dict:
"""获取数据库指标"""
pass
def __enter__(self):
self.connect()
return self
def __exit__(self, *args):
self.close()
class MySQLConnection(DatabaseConnection):
"""MySQL 连接"""
def connect(self) -> bool:
try:
import pymysql
self._conn = pymysql.connect(
host=self.config.host,
port=self.config.port,
user=self.config.username,
password=self.config.password,
database=self.config.database,
connect_timeout=self.config.connect_timeout,
cursorclass=pymysql.cursors.DictCursor
)
self._connected = True
return True
except Exception as e:
print(f"MySQL 连接失败: {e}")
return False
def execute(self, sql: str, params: tuple = None) -> Any:
with self._conn.cursor() as cursor:
cursor.execute(sql, params)
self._conn.commit()
return cursor.rowcount
def query(self, sql: str, params: tuple = None) -> List[Dict]:
with self._conn.cursor() as cursor:
cursor.execute(sql, params)
return cursor.fetchall()
def close(self):
if self._conn:
self._conn.close()
self._connected = False
def get_metrics(self) -> Dict:
metrics = {}
# 连接数
result = self.query("SHOW STATUS LIKE 'Threads_connected'")
metrics["connections"] = int(result[0]["Value"]) if result else 0
# QPS
result = self.query("SHOW STATUS LIKE 'Queries'")
metrics["queries"] = int(result[0]["Value"]) if result else 0
# 慢查询
result = self.query("SHOW STATUS LIKE 'Slow_queries'")
metrics["slow_queries"] = int(result[0]["Value"]) if result else 0
# InnoDB 行操作
result = self.query("SHOW STATUS LIKE 'Innodb_row_read'")
metrics["innodb_rows_read"] = int(result[0]["Value"]) if result else 0
# 缓冲池命中率
read_result = self.query("SHOW STATUS LIKE 'Innodb_buffer_pool_read_requests'")
miss_result = self.query("SHOW STATUS LIKE 'Innodb_buffer_pool_reads'")
reads = int(read_result[0]["Value"]) if read_result else 1
misses = int(miss_result[0]["Value"]) if miss_result else 0
metrics["buffer_pool_hit_rate"] = round((1 - misses / max(reads, 1)) * 100, 2)
return metrics
class PostgreSQLConnection(DatabaseConnection):
"""PostgreSQL 连接"""
def connect(self) -> bool:
try:
import psycopg2
self._conn = psycopg2.connect(
host=self.config.host,
port=self.config.port,
user=self.config.username,
password=self.config.password,
dbname=self.config.database,
connect_timeout=self.config.connect_timeout
)
self._connected = True
return True
except Exception as e:
print(f"PostgreSQL 连接失败: {e}")
return False
def execute(self, sql: str, params: tuple = None) -> Any:
with self._conn.cursor() as cursor:
cursor.execute(sql, params)
self._conn.commit()
return cursor.rowcount
def query(self, sql: str, params: tuple = None) -> List[Dict]:
import psycopg2.extras
with self._conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cursor:
cursor.execute(sql, params)
return [dict(row) for row in cursor.fetchall()]
def close(self):
if self._conn:
self._conn.close()
self._connected = False
def get_metrics(self) -> Dict:
metrics = {}
# 活跃连接
result = self.query("SELECT count(*) as cnt FROM pg_stat_activity")
metrics["connections"] = result[0]["cnt"] if result else 0
# 数据库大小
result = self.query(f"SELECT pg_database_size('{self.config.database}') as size")
metrics["database_size"] = result[0]["size"] if result else 0
# 缓存命中率
result = self.query("""
SELECT sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio
FROM pg_statio_user_tables
""")
metrics["cache_hit_ratio"] = round(
(result[0]["ratio"] or 0) * 100, 2
) if result else 0
# 死锁
result = self.query("SELECT count(*) as cnt FROM pg_locks WHERE NOT granted")
metrics["waiting_locks"] = result[0]["cnt"] if result else 0
return metrics
class ConnectionFactory:
"""连接工厂"""
_registry: Dict[str, type] = {
"mysql": MySQLConnection,
"postgresql": PostgreSQLConnection,
}
@classmethod
def create(cls, config: DatabaseConfig) -> DatabaseConnection:
conn_class = cls._registry.get(config.db_type)
if not conn_class:
raise ValueError(f"不支持的数据库类型: {config.db_type}")
return conn_class(config)
# 使用
config = DatabaseConfig(
db_type="mysql",
host="localhost",
port=3306,
database="production",
username="root",
password="secret"
)
conn = ConnectionFactory.create(config)
1.2 实时监控采集
python
# ===== 实时监控采集器 =====
import threading
import time
from collections import deque
@dataclass
class MetricPoint:
"""指标数据点"""
timestamp: float
value: float
tags: Dict[str, str] = field(default_factory=dict)
class MetricCollector:
"""指标采集器"""
def __init__(self, conn: DatabaseConnection, interval: int = 60):
self.conn = conn
self.interval = interval
self._metrics: Dict[str, deque] = {}
self._max_points = 1440 # 保留 24 小时(每分钟一点)
self._running = False
self._thread = None
self._baseline: Dict[str, float] = {}
def start(self):
"""启动采集"""
if self._running:
return
self._running = True
self._thread = threading.Thread(target=self._collect_loop, daemon=True)
self._thread.start()
print(f"📊 监控采集已启动 (间隔: {self.interval}s)")
def stop(self):
"""停止采集"""
self._running = False
if self._thread:
self._thread.join(timeout=5)
print("📊 监控采集已停止")
def _collect_loop(self):
"""采集循环"""
while self._running:
try:
metrics = self.conn.get_metrics()
now = time.time()
for key, value in metrics.items():
if key not in self._metrics:
self._metrics[key] = deque(maxlen=self._max_points)
self._metrics[key].append(MetricPoint(
timestamp=now, value=float(value)
))
# 更新基线
if not self._baseline:
self._baseline = {k: float(v) for k, v in metrics.items()}
except Exception as e:
print(f"采集异常: {e}")
time.sleep(self.interval)
def get_current(self) -> Dict[str, float]:
"""获取当前指标"""
return {
key: points[-1].value if points else 0
for key, points in self._metrics.items()
}
def get_history(self, metric: str, minutes: int = 60) -> List[MetricPoint]:
"""获取历史指标"""
if metric not in self._metrics:
return []
cutoff = time.time() - minutes * 60
return [
p for p in self._metrics[metric]
if p.timestamp >= cutoff
]
def detect_anomaly(self, metric: str, threshold: float = 2.0) -> bool:
"""简单异常检测(Z-Score)"""
history = self.get_history(metric, minutes=60)
if len(history) < 10:
return False
values = [p.value for p in history]
mean = sum(values) / len(values)
std = (sum((v - mean) ** 2 for v in values) / len(values)) ** 0.5
if std == 0:
return False
current = values[-1]
z_score = abs(current - mean) / std
return z_score > threshold
def get_trend(self, metric: str, minutes: int = 60) -> str:
"""获取趋势"""
history = self.get_history(metric, minutes)
if len(history) < 5:
return "insufficient_data"
recent = [p.value for p in history[-5:]]
older = [p.value for p in history[-10:-5]] if len(history) >= 10 else recent[:3]
recent_avg = sum(recent) / len(recent)
older_avg = sum(older) / len(older)
change = (recent_avg - older_avg) / max(older_avg, 0.001) * 100
if change > 10:
return "increasing"
elif change < -10:
return "decreasing"
return "stable"
def health_check(self) -> Dict:
"""健康检查"""
current = self.get_current()
issues = []
# 连接数检查
max_conn = self._baseline.get("max_connections", 151)
conn_usage = current.get("connections", 0) / max_conn * 100
if conn_usage > 80:
issues.append({
"severity": "critical" if conn_usage > 95 else "warning",
"metric": "connections",
"message": f"连接数使用率 {conn_usage:.1f}%(阈值 80%)",
"value": current.get("connections", 0)
})
# 缓冲池命中率
hit_rate = current.get("buffer_pool_hit_rate", 100)
if hit_rate < 95:
issues.append({
"severity": "warning",
"metric": "buffer_pool_hit_rate",
"message": f"缓冲池命中率 {hit_rate}%(阈值 95%)",
"value": hit_rate
})
# 慢查询
slow_queries = current.get("slow_queries", 0)
baseline_slow = self._baseline.get("slow_queries", 0)
if slow_queries > baseline_slow * 2 and slow_queries > 10:
issues.append({
"severity": "warning",
"metric": "slow_queries",
"message": f"慢查询增长过快: {slow_queries}(基线 {baseline_slow})",
"value": slow_queries
})
# 异常指标
for metric in current:
if self.detect_anomaly(metric):
issues.append({
"severity": "info",
"metric": metric,
"message": f"{metric} 检测到异常波动",
"value": current[metric]
})
return {
"status": "unhealthy" if any(i["severity"] == "critical" for i in issues) else "healthy",
"issues": issues,
"metrics": current,
"timestamp": datetime.now().isoformat()
}
# 使用
collector = MetricCollector(conn, interval=30)
二、慢查询分析
2.1 慢查询采集与解析
python
# ===== 慢查询分析 =====
import re
from dataclasses import dataclass
@dataclass
class SlowQuery:
"""慢查询记录"""
sql: str
query_time: float # 秒
lock_time: float
rows_examined: int
rows_sent: int
timestamp: str = ""
database: str = ""
user: str = ""
host: str = ""
@property
def efficiency(self) -> float:
"""查询效率(rows_sent / rows_examined)"""
if self.rows_examined == 0:
return 1.0
return self.rows_sent / self.rows_examined
@property
def is_full_scan(self) -> bool:
"""是否全表扫描"""
return self.rows_examined > 10000 and self.efficiency < 0.01
@property
def severity(self) -> str:
if self.query_time > 60:
return "critical"
elif self.query_time > 10:
return "major"
elif self.query_time > 3:
return "minor"
return "info"
class SlowQueryAnalyzer:
"""慢查询分析器"""
def __init__(self, conn: DatabaseConnection):
self.conn = conn
def get_slow_queries(self, limit: int = 50, min_time: float = 1.0) -> List[SlowQuery]:
"""获取慢查询"""
if isinstance(self.conn, MySQLConnection):
return self._get_mysql_slow_queries(limit, min_time)
elif isinstance(self.conn, PostgreSQLConnection):
return self._get_pg_slow_queries(limit, min_time)
return []
def _get_mysql_slow_queries(self, limit: int, min_time: float) -> List[SlowQuery]:
"""MySQL 慢查询"""
queries = self.conn.query("""
SELECT
DIGEST_TEXT as sql_text,
SUM_TIMER_WAIT / 1000000000000 as total_time_sec,
AVG_TIMER_WAIT / 1000000000000 as avg_time_sec,
SUM_ROWS_EXAMINED as rows_examined,
SUM_ROWS_SENT as rows_sent,
COUNT_STAR as exec_count,
FIRST_SEEN as first_seen,
LAST_SEEN as last_seen
FROM performance_schema.events_statements_summary_by_digest
WHERE AVG_TIMER_WAIT / 1000000000000 > %s
ORDER BY AVG_TIMER_WAIT DESC
LIMIT %s
""", (min_time, limit))
results = []
for row in queries:
results.append(SlowQuery(
sql=row.get("sql_text", ""),
query_time=float(row.get("avg_time_sec", 0)),
lock_time=0,
rows_examined=int(row.get("rows_examined", 0)),
rows_sent=int(row.get("rows_sent", 0)),
timestamp=str(row.get("last_seen", ""))
))
return results
def _get_pg_slow_queries(self, limit: int, min_time: float) -> List[SlowQuery]:
"""PostgreSQL 慢查询"""
queries = self.conn.query("""
SELECT
query,
mean_exec_time / 1000 as avg_time_sec,
rows,
calls,
max_exec_time / 1000 as max_time_sec
FROM pg_stat_statements
WHERE mean_exec_time / 1000 > %s
ORDER BY mean_exec_time DESC
LIMIT %s
""", (min_time, limit))
results = []
for row in queries:
results.append(SlowQuery(
sql=row.get("query", ""),
query_time=float(row.get("avg_time_sec", 0)),
lock_time=0,
rows_examined=int(row.get("rows", 0)),
rows_sent=int(row.get("rows", 0))
))
return results
def analyze_query(self, query: SlowQuery) -> Dict:
"""分析单条慢查询"""
analysis = {
"sql": query.sql[:200],
"severity": query.severity,
"query_time": query.query_time,
"efficiency": round(query.efficiency, 4),
"issues": [],
"suggestions": []
}
# 1. 全表扫描检测
if query.is_full_scan:
analysis["issues"].append("全表扫描")
analysis["suggestions"].append(
self._suggest_index(query.sql)
)
# 2. SELECT * 检测
if re.search(r'SELECT\s+\*', query.sql, re.IGNORECASE):
analysis["issues"].append("SELECT * 使用")
analysis["suggestions"].append("只查询需要的列,避免 SELECT *")
# 3. 无 WHERE 条件
if not re.search(r'\bWHERE\b', query.sql, re.IGNORECASE):
analysis["issues"].append("缺少 WHERE 条件")
analysis["suggestions"].append("添加 WHERE 条件限制结果集")
# 4. LIKE 前缀通配符
if re.search(r"LIKE\s+['\"]%", query.sql, re.IGNORECASE):
analysis["issues"].append("LIKE 前缀通配符")
analysis["suggestions"].append("避免 LIKE '%xxx',无法使用索引")
# 5. 子查询
if re.search(r'\bSELECT\b.*\bFROM\b.*\(\s*SELECT', query.sql, re.IGNORECASE | re.DOTALL):
analysis["issues"].append("嵌套子查询")
analysis["suggestions"].append("考虑将子查询改写为 JOIN")
# 6. ORDER BY 常量
if re.search(r'ORDER\s+BY\s+\d+', query.sql, re.IGNORECASE):
analysis["issues"].append("ORDER BY 列号")
analysis["suggestions"].append("使用列名替代列号,提升可读性")
# 7. OR 条件
if re.search(r'\bOR\b', query.sql, re.IGNORECASE):
analysis["issues"].append("OR 条件")
analysis["suggestions"].append("OR 可能导致索引失效,考虑 UNION ALL")
# 8. 大量行扫描低效
if query.efficiency < 0.001 and query.rows_examined > 1000:
analysis["issues"].append(f"低效扫描(效率 {query.efficiency:.4f})")
analysis["suggestions"].append("检查索引覆盖和查询条件选择性")
return analysis
def _suggest_index(self, sql: str) -> str:
"""建议索引"""
# 提取表名
table_match = re.search(r'FROM\s+(\w+)', sql, re.IGNORECASE)
if not table_match:
return "无法解析表名"
table = table_match.group(1)
# 提取 WHERE 条件列
where_match = re.search(r'WHERE\s+(.+?)(?:\s+GROUP|\s+ORDER|\s+LIMIT|$)', sql, re.IGNORECASE)
columns = []
if where_match:
col_matches = re.findall(r'(\w+)\s*[=<>!]|(\w+)\s+(?:IN|LIKE|BETWEEN)', where_match.group(1), re.IGNORECASE)
for m in col_matches:
col = m[0] or m[1]
if col and col.upper() not in ('AND', 'OR', 'NOT', 'NULL'):
columns.append(col)
# 提取 JOIN 列
join_matches = re.findall(r'JOIN\s+\w+\s+ON\s+(\w+\.\w+)\s*=\s*(\w+\.\w+)', sql, re.IGNORECASE)
for m in join_matches:
for col_ref in m:
if col_ref.startswith(f"{table}."):
columns.append(col_ref.split(".")[1])
if columns:
cols_str = ", ".join(columns[:3])
return f"建议添加索引: CREATE INDEX idx_{table}_{'_'.join(columns[:2])} ON {table}({cols_str})"
return f"表 {table} 可能需要添加索引"
def get_explain(self, sql: str) -> List[Dict]:
"""获取执行计划"""
if isinstance(self.conn, MySQLConnection):
return self.conn.query(f"EXPLAIN {sql}")
elif isinstance(self.conn, PostgreSQLConnection):
return self.conn.query(f"EXPLAIN (FORMAT JSON) {sql}")
return []
def generate_report(self, queries: List[SlowQuery] = None) -> str:
"""生成慢查询报告"""
if queries is None:
queries = self.get_slow_queries()
if not queries:
return "✅ 没有发现慢查询"
lines = [
"🐌 慢查询分析报告",
"=" * 60,
f"发现慢查询: {len(queries)} 条",
""
]
for i, query in enumerate(queries[:20], 1):
analysis = self.analyze_query(query)
icon = {"critical": "🔴", "major": "🟡", "minor": "🔵", "info": "⚪"}[query.severity]
lines.append(f"{icon} #{i} [{query.severity}] 耗时 {query.query_time:.2f}s")
lines.append(f" SQL: {query.sql[:100]}...")
lines.append(f" 扫描: {query.rows_examined} 行 → 返回 {query.rows_sent} 行")
lines.append(f" 效率: {query.efficiency:.4f}")
if analysis["issues"]:
lines.append(f" 问题: {', '.join(analysis['issues'])}")
if analysis["suggestions"]:
lines.append(f" 建议: {analysis['suggestions'][0]}")
lines.append("")
return "\n".join(lines)
# 使用
analyzer = SlowQueryAnalyzer(conn)
report = analyzer.generate_report()
2.2 SQL 自动优化
python
# ===== SQL 自动优化 =====
class SQLOptimizer:
"""SQL 优化器"""
# 优化规则
RULES = {
"select_star": {
"pattern": r"SELECT\s+\*",
"message": "避免 SELECT *",
"auto_fix": False
},
"implicit_conversion": {
"pattern": r"WHERE\s+\w+\s*=\s*['\"]?\d+['\"]",
"message": "隐式类型转换可能导致索引失效",
"auto_fix": False
},
"or_condition": {
"pattern": r"\bOR\b",
"message": "OR 可能导致索引失效",
"auto_fix": True
},
"subquery": {
"pattern": r"\bIN\s*\(\s*SELECT\b",
"message": "IN 子查询可优化为 JOIN",
"auto_fix": True
},
"like_prefix": {
"pattern": r"LIKE\s+['\"]%",
"message": "前缀通配符无法使用索引",
"auto_fix": False
},
"function_on_index": {
"pattern": r"WHERE\s+\w+\s*\(",
"message": "WHERE 条件中对列使用函数会导致索引失效",
"auto_fix": True
},
"order_by_rand": {
"pattern": r"ORDER\s+BY\s+RAND\s*\(",
"message": "ORDER BY RAND() 导致全表扫描",
"auto_fix": True
},
"group_by_without_index": {
"pattern": r"GROUP\s+BY",
"message": "GROUP BY 可能需要索引支持",
"auto_fix": False
},
"limit_without_order": {
"pattern": r"LIMIT\s+\d+",
"message": "没有 ORDER BY 的 LIMIT 结果不确定",
"auto_fix": False
},
}
def analyze(self, sql: str) -> Dict:
"""分析 SQL 并给出优化建议"""
issues = []
optimizations = []
for rule_id, rule in self.RULES.items():
if re.search(rule["pattern"], sql, re.IGNORECASE):
issue = {
"rule": rule_id,
"message": rule["message"],
"auto_fixable": rule["auto_fix"]
}
issues.append(issue)
if rule["auto_fix"]:
optimized = self._apply_fix(sql, rule_id)
if optimized != sql:
optimizations.append({
"rule": rule_id,
"original": sql,
"optimized": optimized
})
return {
"sql": sql,
"issues": issues,
"optimizations": optimizations,
"issue_count": len(issues),
"auto_fix_count": len(optimizations)
}
def _apply_fix(self, sql: str, rule_id: str) -> str:
"""应用优化规则"""
if rule_id == "or_condition":
# OR → UNION ALL(简化版)
return sql # 需要上下文才能正确转换
elif rule_id == "subquery":
# IN (SELECT ...) → JOIN
match = re.search(
r'(\w+)\s+IN\s*\(\s*SELECT\s+(\w+)\s+FROM\s+(\w+)(?:\s+WHERE\s+(.+?))?\s*\)',
sql, re.IGNORECASE
)
if match:
col, sub_col, sub_table, sub_where = match.groups()
join_clause = f"JOIN {sub_table} ON {col} = {sub_table}.{sub_col}"
if sub_where:
join_clause += f" AND {sub_where}"
return re.sub(
r'\w+\s+IN\s*\(\s*SELECT.+?\)',
f"{col} IN (SELECT {sub_col} FROM {sub_table})",
sql, flags=re.IGNORECASE
)
elif rule_id == "function_on_index":
# DATE(col) = '2024-01-01' → col >= '2024-01-01' AND col < '2024-01-02'
match = re.search(
r"(\w+)\((\w+)\)\s*=\s*['\"]([^'\"]+)['\"]",
sql, re.IGNORECASE
)
if match:
func, col, val = match.groups()
if func.upper() == "DATE":
next_day = self._next_day(val)
return re.sub(
r"\w+\(\w+\)\s*=\s*['\"][^'\"]+['\"]",
f"{col} >= '{val}' AND {col} < '{next_day}'",
sql, flags=re.IGNORECASE
)
elif rule_id == "order_by_rand":
return re.sub(
r"ORDER\s+BY\s+RAND\s*\(\s*\)",
"ORDER BY id -- 请用程序随机选取替代 ORDER BY RAND()",
sql, flags=re.IGNORECASE
)
return sql
@staticmethod
def _next_day(date_str: str) -> str:
from datetime import datetime, timedelta
dt = datetime.strptime(date_str, "%Y-%m-%d")
return (dt + timedelta(days=1)).strftime("%Y-%m-%d")
def optimize(self, sql: str) -> str:
"""自动优化 SQL"""
result = self.analyze(sql)
optimized = sql
for opt in result["optimizations"]:
optimized = opt["optimized"]
return optimized
class LLMSQLOptimizer:
"""LLM 驱动的 SQL 优化"""
def __init__(self, model: str = "gpt-4o", api_key: str = None):
from openai import OpenAI
self.client = OpenAI(api_key=api_key)
self.model = model
def optimize(
self,
sql: str,
db_type: str = "mysql",
table_schema: str = "",
explain_result: str = ""
) -> Dict:
"""LLM 优化 SQL"""
system_prompt = f"""你是一个数据库 SQL 优化专家。请分析并优化以下 {db_type} SQL 查询。
分析维度:
1. 索引使用情况
2. 查询效率
3. 是否有全表扫描
4. 是否可以改写为更高效的等价形式
5. 是否有锁竞争风险
输出 JSON:
{{
"issues": ["问题1", "问题2"],
"optimized_sql": "优化后的 SQL",
"explanation": "优化说明",
"estimated_improvement": "预估提升幅度",
"index_suggestions": ["建议索引1", "建议索引2"]
}}"""
user_msg = f"原始 SQL:\n```sql\n{sql}\n```"
if table_schema:
user_msg += f"\n\n表结构:\n{table_schema}"
if explain_result:
user_msg += f"\n\n执行计划:\n{explain_result}"
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_msg}
],
response_format={"type": "json_object"},
temperature=0.1
)
import json
return json.loads(response.choices[0].message.content)
# 使用
optimizer = SQLOptimizer()
test_sql = """
SELECT * FROM orders
WHERE DATE(created_at) = '2024-01-15'
AND status IN (SELECT id FROM order_status WHERE name = 'pending')
ORDER BY RAND()
LIMIT 10
"""
result = optimizer.analyze(test_sql)
print(f"发现 {result['issue_count']} 个问题,{result['auto_fix_count']} 个可自动修复")
for issue in result["issues"]:
print(f" ⚠️ {issue['message']} (自动修复: {'是' if issue['auto_fixable'] else '否'})")
三、智能诊断引擎
3.1 故障诊断 Agent
python
# ===== 故障诊断 Agent =====
class DiagnosticAgent:
"""数据库故障诊断 Agent"""
def __init__(self, conn: DatabaseConnection, collector: MetricCollector = None):
self.conn = conn
self.collector = collector or MetricCollector(conn)
self.diagnoses: List[Dict] = []
def diagnose(self, symptom: str = "") -> Dict:
"""诊断数据库问题"""
health = self.collector.health_check()
# 收集诊断数据
diag_data = {
"symptom": symptom,
"health": health,
"slow_queries": [],
"locks": [],
"processes": [],
"table_stats": []
}
# 获取慢查询
try:
analyzer = SlowQueryAnalyzer(self.conn)
diag_data["slow_queries"] = [
{"sql": q.sql[:100], "time": q.query_time, "severity": q.severity}
for q in analyzer.get_slow_queries(limit=10)
]
except Exception:
pass
# 获取锁信息
try:
diag_data["locks"] = self._get_lock_info()
except Exception:
pass
# 获取进程列表
try:
diag_data["processes"] = self._get_process_list()
except Exception:
pass
# 获取表统计
try:
diag_data["table_stats"] = self._get_table_stats()
except Exception:
pass
# 分析诊断
diagnosis = self._analyze(diag_data)
self.diagnoses.append(diagnosis)
return diagnosis
def _get_lock_info(self) -> List[Dict]:
"""获取锁信息"""
if isinstance(self.conn, MySQLConnection):
return self.conn.query("""
SELECT
l.*,
t.trx_query,
t.trx_started,
TIMESTAMPDIFF(SECOND, t.trx_started, NOW()) as wait_seconds
FROM information_schema.innodb_locks l
LEFT JOIN information_schema.innodb_trx t ON l.lock_trx_id = t.trx_id
ORDER BY t.trx_started
""")
return []
def _get_process_list(self) -> List[Dict]:
"""获取进程列表"""
if isinstance(self.conn, MySQLConnection):
return self.conn.query("SHOW FULL PROCESSLIST")
elif isinstance(self.conn, PostgreSQLConnection):
return self.conn.query("""
SELECT pid, usename, application_name, state,
query, now() - query_start as duration
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start
""")
return []
def _get_table_stats(self) -> List[Dict]:
"""获取表统计信息"""
if isinstance(self.conn, MySQLConnection):
return self.conn.query("""
SELECT
TABLE_NAME,
TABLE_ROWS,
DATA_LENGTH / 1024 / 1024 as data_mb,
INDEX_LENGTH / 1024 / 1024 as index_mb,
TABLE_ROWS / NULLIF(DATA_LENGTH, 0) * 1024 * 1024 as avg_row_size
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = %s
ORDER BY DATA_LENGTH DESC
LIMIT 20
""", (self.conn.config.database,))
return []
def _analyze(self, data: Dict) -> Dict:
"""分析诊断数据"""
findings = []
actions = []
severity = "healthy"
health = data.get("health", {})
# 1. 连接数过高
if health.get("status") != "healthy":
for issue in health.get("issues", []):
if issue["severity"] == "critical":
severity = "critical"
findings.append(f"🔴 {issue['message']}")
else:
if severity != "critical":
severity = "warning"
findings.append(f"🟡 {issue['message']}")
# 2. 慢查询
slow_queries = data.get("slow_queries", [])
critical_slow = [q for q in slow_queries if q.get("severity") == "critical"]
if critical_slow:
if severity != "critical":
severity = "warning"
findings.append(f"🐌 发现 {len(critical_slow)} 条严重慢查询")
actions.append("分析慢查询日志,添加缺失索引")
# 3. 锁等待
locks = data.get("locks", [])
if locks:
if severity != "critical":
severity = "warning"
findings.append(f"🔒 发现 {len(locks)} 个锁等待")
actions.append("检查长事务,考虑拆分或添加索引避免锁升级")
# 4. 长时间运行查询
processes = data.get("processes", [])
long_queries = [
p for p in processes
if self._get_duration_seconds(p) > 300
]
if long_queries:
findings.append(f"⏳ 发现 {len(long_queries)} 个运行超过 5 分钟的查询")
actions.append("检查长查询是否可优化或需要 KILL")
# 5. 大表
large_tables = [
t for t in data.get("table_stats", [])
if float(t.get("data_mb", 0)) > 1000
]
if large_tables:
findings.append(f"📦 发现 {len(large_tables)} 个超过 1GB 的大表")
actions.append("考虑分区或归档历史数据")
if not findings:
findings.append("✅ 数据库状态正常")
return {
"severity": severity,
"symptom": data.get("symptom", ""),
"findings": findings,
"actions": actions,
"data": {
"slow_queries": len(slow_queries),
"locks": len(locks),
"long_queries": len(long_queries),
"large_tables": len(large_tables)
},
"timestamp": datetime.now().isoformat()
}
@staticmethod
def _get_duration_seconds(process: Dict) -> float:
"""获取进程持续时间(秒)"""
duration = process.get("Time", process.get("duration", 0))
if isinstance(duration, timedelta):
return duration.total_seconds()
return float(duration) if duration else 0
def generate_diagnosis_report(self, diagnosis: Dict) -> str:
"""生成诊断报告"""
lines = [
"🏥 数据库诊断报告",
"=" * 50,
f"时间: {diagnosis['timestamp']}",
f"状态: {diagnosis['severity']}",
f"症状: {diagnosis.get('symptom', '无')}",
"",
"发现:",
]
for finding in diagnosis["findings"]:
lines.append(f" {finding}")
if diagnosis["actions"]:
lines.append("")
lines.append("建议操作:")
for action in diagnosis["actions"]:
lines.append(f" 💡 {action}")
return "\n".join(lines)
3.2 LLM 增强诊断
python
# ===== LLM 增强诊断 =====
class LLMDiagnosticAgent:
"""LLM 增强的诊断 Agent"""
def __init__(self, conn: DatabaseConnection, model: str = "gpt-4o", api_key: str = None):
self.conn = conn
self.basic_agent = DiagnosticAgent(conn)
from openai import OpenAI
self.client = OpenAI(api_key=api_key)
self.model = model
def diagnose(self, symptom: str = "") -> Dict:
"""LLM 增强诊断"""
# 1. 基础诊断
basic = self.basic_agent.diagnose(symptom)
# 2. LLM 深度分析
context = self._build_context(basic)
system_prompt = """你是一个资深 DBA,擅长数据库故障诊断。
根据提供的诊断数据,请:
1. 分析根本原因
2. 给出具体修复步骤
3. 评估风险等级
4. 提供预防建议
输出 JSON:
{
"root_cause": "根本原因",
"severity": "critical|warning|info",
"fix_steps": ["步骤1", "步骤2"],
"risk_assessment": "风险评估",
"prevention": ["建议1", "建议2"],
"estimated_recovery_time": "预估恢复时间"
}"""
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": context}
],
response_format={"type": "json_object"},
temperature=0.1
)
import json
llm_result = json.loads(response.choices[0].message.content)
# 合并结果
return {
**basic,
"llm_analysis": llm_result
}
def _build_context(self, diagnosis: Dict) -> str:
"""构建 LLM 上下文"""
lines = [
f"数据库类型: {self.conn.config.db_type}",
f"数据库地址: {self.conn.config.safe_display}",
f"症状描述: {diagnosis.get('symptom', '无')}",
"",
"基础诊断发现:"
]
for finding in diagnosis.get("findings", []):
lines.append(f" - {finding}")
data = diagnosis.get("data", {})
if data:
lines.append("")
lines.append("统计数据:")
for key, value in data.items():
lines.append(f" - {key}: {value}")
return "\n".join(lines)
def natural_language_query(self, question: str) -> Dict:
"""自然语言查询数据库"""
# 获取数据库 schema
schema = self._get_schema()
system_prompt = f"""你是一个 SQL 专家。根据用户的问题,生成合适的 SQL 查询。
数据库类型: {self.conn.config.db_type}
数据库: {self.conn.config.database}
表结构:
{schema}
输出 JSON:
{{
"sql": "生成的 SQL",
"explanation": "SQL 说明",
"is_safe": true/false // 是否安全(只读)
}}"""
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": question}
],
response_format={"type": "json_object"},
temperature=0.1
)
import json
result = json.loads(response.choices[0].message.content)
# 执行 SQL(如果是安全的)
if result.get("is_safe", False):
try:
rows = self.conn.query(result["sql"])
result["data"] = rows[:100] # 限制返回行数
except Exception as e:
result["error"] = str(e)
return result
def _get_schema(self) -> str:
"""获取数据库 Schema"""
if isinstance(self.conn, MySQLConnection):
tables = self.conn.query("""
SELECT TABLE_NAME, COLUMN_NAME, COLUMN_TYPE, COLUMN_KEY
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = %s
ORDER BY TABLE_NAME, ORDINAL_POSITION
""", (self.conn.config.database,))
schema_lines = []
current_table = ""
for t in tables:
if t["TABLE_NAME"] != current_table:
current_table = t["TABLE_NAME"]
schema_lines.append(f"\nTABLE {current_table}:")
key = f" [{t['COLUMN_KEY']}]" if t["COLUMN_KEY"] else ""
schema_lines.append(f" {t['COLUMN_NAME']} {t['COLUMN_TYPE']}{key}")
return "\n".join(schema_lines)
return "Schema 不可用"
四、自动巡检
4.1 巡检框架
python
# ===== 自动巡检框架 =====
class InspectionCheck:
"""巡检检查项"""
def __init__(self, name: str, category: str, severity: str = "warning"):
self.name = name
self.category = category
self.severity = severity
@property
def check_id(self) -> str:
return f"{self.category}_{self.name}".replace(" ", "_").lower()
class DatabaseInspector:
"""数据库巡检器"""
def __init__(self, conn: DatabaseConnection):
self.conn = conn
self.results: List[Dict] = []
def run_full_inspection(self) -> Dict:
"""执行完整巡检"""
print("🔍 开始数据库巡检...")
checks = [
self._check_connections,
self._check_slow_queries,
self._check_locks,
self._check_table_fragmentation,
self._check_index_usage,
self._check_replication,
self._check_disk_usage,
self._check_security,
self._check_configuration,
]
results = []
for check in checks:
try:
result = check()
results.append(result)
status = "✅" if result["status"] == "pass" else "❌"
print(f" {status} {result['name']}: {result.get('message', '')}")
except Exception as e:
results.append({
"name": check.__name__,
"status": "error",
"message": str(e)
})
print(f" ⚠️ {check.__name__}: 执行出错 {e}")
# 汇总
passed = sum(1 for r in results if r["status"] == "pass")
failed = sum(1 for r in results if r["status"] == "fail")
warnings = sum(1 for r in results if r["status"] == "warning")
summary = {
"total": len(results),
"passed": passed,
"failed": failed,
"warnings": warnings,
"overall": "healthy" if failed == 0 else "unhealthy",
"checks": results,
"timestamp": datetime.now().isoformat()
}
self.results = results
return summary
def _check_connections(self) -> Dict:
"""检查连接使用"""
if isinstance(self.conn, MySQLConnection):
current = self.conn.query("SHOW STATUS LIKE 'Threads_connected'")[0]["Value"]
max_conn = self.conn.query("SHOW VARIABLES LIKE 'max_connections'")[0]["Value"]
usage = int(current) / int(max_conn) * 100
if usage > 90:
return {"name": "连接使用率", "status": "fail", "message": f"连接使用率 {usage:.1f}%(危险)", "value": usage}
elif usage > 70:
return {"name": "连接使用率", "status": "warning", "message": f"连接使用率 {usage:.1f}%(偏高)", "value": usage}
return {"name": "连接使用率", "status": "pass", "message": f"连接使用率 {usage:.1f}%", "value": usage}
return {"name": "连接使用率", "status": "pass", "message": "非 MySQL 跳过"}
def _check_slow_queries(self) -> Dict:
"""检查慢查询"""
try:
analyzer = SlowQueryAnalyzer(self.conn)
slow = analyzer.get_slow_queries(limit=100, min_time=1.0)
critical = sum(1 for q in slow if q.severity == "critical")
if critical > 0:
return {"name": "慢查询", "status": "fail", "message": f"{critical} 条严重慢查询", "value": len(slow)}
elif len(slow) > 10:
return {"name": "慢查询", "status": "warning", "message": f"{len(slow)} 条慢查询", "value": len(slow)}
return {"name": "慢查询", "status": "pass", "message": f"{len(slow)} 条慢查询", "value": len(slow)}
except Exception:
return {"name": "慢查询", "status": "pass", "message": "无法获取慢查询"}
def _check_locks(self) -> Dict:
"""检查锁"""
if isinstance(self.conn, MySQLConnection):
try:
locks = self.conn.query("""
SELECT COUNT(*) as cnt FROM information_schema.innodb_lock_waits
""")
lock_count = locks[0]["cnt"] if locks else 0
if lock_count > 5:
return {"name": "锁等待", "status": "fail", "message": f"{lock_count} 个锁等待", "value": lock_count}
elif lock_count > 0:
return {"name": "锁等待", "status": "warning", "message": f"{lock_count} 个锁等待", "value": lock_count}
except Exception:
pass
return {"name": "锁等待", "status": "pass", "message": "无锁等待"}
def _check_table_fragmentation(self) -> Dict:
"""检查表碎片"""
if isinstance(self.conn, MySQLConnection):
tables = self.conn.query("""
SELECT
TABLE_NAME,
DATA_LENGTH / 1024 / 1024 as data_mb,
DATA_FREE / 1024 / 1024 as free_mb,
ROUND(DATA_FREE / NULLIF(DATA_LENGTH, 0) * 100, 2) as fragmentation_pct
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = %s AND DATA_FREE > 100 * 1024 * 1024
ORDER BY DATA_FREE DESC
LIMIT 10
""", (self.conn.config.database,))
fragmented = [t for t in tables if float(t.get("fragmentation_pct", 0)) > 30]
if fragmented:
names = [t["TABLE_NAME"] for t in fragmented]
return {
"name": "表碎片",
"status": "warning",
"message": f"{len(fragmented)} 个表碎片率 > 30%: {', '.join(names[:3])}",
"value": len(fragmented)
}
return {"name": "表碎片", "status": "pass", "message": "碎片率正常"}
def _check_index_usage(self) -> Dict:
"""检查索引使用率"""
if isinstance(self.conn, MySQLConnection):
unused = self.conn.query("""
SELECT
t.TABLE_NAME,
s.INDEX_NAME,
s.CARDINALITY
FROM information_schema.TABLES t
JOIN information_schema.STATISTICS s
ON t.TABLE_NAME = s.TABLE_NAME
WHERE t.TABLE_SCHEMA = %s
AND s.NON_UNIQUE = 1
AND s.INDEX_NAME != 'PRIMARY'
ORDER BY t.TABLE_NAME
""", (self.conn.config.database,))
return {"name": "索引使用", "status": "pass", "message": f"检查了 {len(unused)} 个索引", "value": len(unused)}
return {"name": "索引使用", "status": "pass", "message": "非 MySQL 跳过"}
def _check_replication(self) -> Dict:
"""检查复制状态"""
if isinstance(self.conn, MySQLConnection):
try:
status = self.conn.query("SHOW SLAVE STATUS")
if status:
row = status[0]
io_running = row.get("Slave_IO_Running") == "Yes"
sql_running = row.get("Slave_SQL_Running") == "Yes"
lag = float(row.get("Seconds_Behind_Master", 0) or 0)
if not io_running or not sql_running:
return {"name": "复制状态", "status": "fail", "message": f"复制中断: IO={io_running}, SQL={sql_running}"}
elif lag > 60:
return {"name": "复制状态", "status": "warning", "message": f"复制延迟 {lag}s"}
return {"name": "复制状态", "status": "pass", "message": f"复制正常,延迟 {lag}s"}
except Exception:
pass
return {"name": "复制状态", "status": "pass", "message": "未配置复制"}
def _check_disk_usage(self) -> Dict:
"""检查磁盘使用"""
if isinstance(self.conn, MySQLConnection):
try:
size = self.conn.query(f"""
SELECT
ROUND(SUM(DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024 / 1024, 2) as total_gb
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = '{self.conn.config.database}'
""")
total_gb = float(size[0]["total_gb"]) if size else 0
if total_gb > 100:
return {"name": "磁盘使用", "status": "warning", "message": f"数据库大小 {total_gb}GB"}
return {"name": "磁盘使用", "status": "pass", "message": f"数据库大小 {total_gb}GB"}
except Exception:
pass
return {"name": "磁盘使用", "status": "pass", "message": "无法获取"}
def _check_security(self) -> Dict:
"""检查安全配置"""
issues = []
if isinstance(self.conn, MySQLConnection):
# 检查空密码用户
empty_users = self.conn.query("""
SELECT User, Host FROM mysql.user
WHERE authentication_string = '' OR plugin = 'mysql_native_password'
""")
if empty_users:
issues.append(f"{len(empty_users)} 个用户使用旧认证方式")
# 检查远程 root
remote_root = self.conn.query("""
SELECT Host FROM mysql.user
WHERE User = 'root' AND Host != 'localhost' AND Host != '127.0.0.1'
""")
if remote_root:
issues.append("root 用户允许远程登录")
if issues:
return {"name": "安全配置", "status": "warning", "message": "; ".join(issues)}
return {"name": "安全配置", "status": "pass", "message": "安全配置正常"}
def _check_configuration(self) -> Dict:
"""检查配置"""
recommendations = []
if isinstance(self.conn, MySQLConnection):
# innodb_buffer_pool_size
buffer_size = self.conn.query("SHOW VARIABLES LIKE 'innodb_buffer_pool_size'")
if buffer_size:
size_mb = int(buffer_size[0]["Value"]) / 1024 / 1024
if size_mb < 1024:
recommendations.append(f"innodb_buffer_pool_size={size_mb:.0f}MB 偏小,建议 >= 1GB")
# slow_query_log
slow_log = self.conn.query("SHOW VARIABLES LIKE 'slow_query_log'")
if slow_log and slow_log[0]["Value"] == "OFF":
recommendations.append("建议开启 slow_query_log")
# long_query_time
long_time = self.conn.query("SHOW VARIABLES LIKE 'long_query_time'")
if long_time and float(long_time[0]["Value"]) > 5:
recommendations.append(f"long_query_time={long_time[0]['Value']}s 偏大,建议 <= 2s")
if recommendations:
return {"name": "配置优化", "status": "warning", "message": "; ".join(recommendations)}
return {"name": "配置优化", "status": "pass", "message": "配置合理"}
def generate_report(self, summary: Dict) -> str:
"""生成巡检报告"""
lines = [
"🔍 数据库巡检报告",
"=" * 50,
f"时间: {summary['timestamp']}",
f"总体状态: {'✅ 健康' if summary['overall'] == 'healthy' else '❌ 不健康'}",
f"通过: {summary['passed']} | 警告: {summary['warnings']} | 失败: {summary['failed']}",
"",
"检查详情:",
]
for check in summary["checks"]:
icon = {"pass": "✅", "warning": "⚠️", "fail": "❌", "error": "💥"}[check["status"]]
lines.append(f" {icon} {check['name']}: {check.get('message', '')}")
return "\n".join(lines)
五、变更管理
5.1 SQL 变更审核
python
# ===== SQL 变更审核 =====
class SQLChangeReviewer:
"""SQL 变更审核"""
DANGEROUS_PATTERNS = [
(r'\bDROP\s+TABLE\b', "DROP TABLE", "critical"),
(r'\bDROP\s+DATABASE\b', "DROP DATABASE", "critical"),
(r'\bTRUNCATE\s+TABLE\b', "TRUNCATE TABLE", "critical"),
(r'\bDELETE\s+FROM\b(?!.*\bWHERE\b)', "无 WHERE 的 DELETE", "critical"),
(r'\bUPDATE\s+\w+\s+SET\b(?!.*\bWHERE\b)', "无 WHERE 的 UPDATE", "critical"),
(r'\bALTER\s+TABLE\b.*\bDROP\s+COLUMN\b', "删除列", "major"),
(r'\bALTER\s+TABLE\b.*\bMODIFY\s+COLUMN\b', "修改列类型", "major"),
(r'\bALTER\s+TABLE\b.*\bADD\s+INDEX\b', "添加索引(大表可能锁表)", "warning"),
(r'\bRENAME\s+TABLE\b', "重命名表", "warning"),
]
def __init__(self, conn: DatabaseConnection = None):
self.conn = conn
def review(self, sql: str, auto_approve_safe: bool = False) -> Dict:
"""审核 SQL 变更"""
sql = sql.strip().rstrip(";")
issues = []
max_severity = "info"
# 1. 危险模式检查
for pattern, name, severity in self.DANGEROUS_PATTERNS:
if re.search(pattern, sql, re.IGNORECASE):
issues.append({
"rule": name,
"severity": severity,
"message": f"检测到危险操作: {name}"
})
if severity == "critical":
max_severity = "critical"
elif severity == "major" and max_severity != "critical":
max_severity = "major"
elif max_severity == "info":
max_severity = "warning"
# 2. 影响行数估算
affected_rows = self._estimate_affected_rows(sql)
# 3. 锁定风险评估
lock_risk = self._assess_lock_risk(sql)
# 4. 回滚方案
rollback = self._generate_rollback(sql)
# 判断是否可自动执行
can_auto_execute = (
max_severity == "info"
and affected_rows < 1000
and lock_risk == "low"
and auto_approve_safe
)
return {
"sql": sql,
"severity": max_severity,
"issues": issues,
"affected_rows_estimated": affected_rows,
"lock_risk": lock_risk,
"rollback_sql": rollback,
"can_auto_execute": can_auto_execute,
"requires_approval": max_severity in ("critical", "major"),
"reviewed_at": datetime.now().isoformat()
}
def _estimate_affected_rows(self, sql: str) -> int:
"""估算影响行数"""
if not self.conn:
return -1 # 无法估算
try:
# 提取表名
table_match = re.search(r'(?:FROM|UPDATE|INTO)\s+(\w+)', sql, re.IGNORECASE)
if not table_match:
return -1
table = table_match.group(1)
if re.match(r'\s*DELETE', sql, re.IGNORECASE):
where_match = re.search(r'WHERE\s+(.+?)$', sql, re.IGNORECASE)
if where_match:
count_sql = f"SELECT COUNT(*) as cnt FROM {table} WHERE {where_match.group(1)}"
else:
count_sql = f"SELECT COUNT(*) as cnt FROM {table}"
result = self.conn.query(count_sql)
return int(result[0]["cnt"]) if result else 0
elif re.match(r'\s*UPDATE', sql, re.IGNORECASE):
where_match = re.search(r'WHERE\s+(.+?)$', sql, re.IGNORECASE)
if where_match:
count_sql = f"SELECT COUNT(*) as cnt FROM {table} WHERE {where_match.group(1)}"
result = self.conn.query(count_sql)
return int(result[0]["cnt"]) if result else 0
return -1 # 无 WHERE,可能全表更新
except Exception:
pass
return -1
def _assess_lock_risk(self, sql: str) -> str:
"""评估锁风险"""
if re.match(r'\s*(DROP|TRUNCATE|ALTER)', sql, re.IGNORECASE):
return "high"
elif re.match(r'\s*(DELETE|UPDATE)', sql, re.IGNORECASE):
if not re.search(r'\bWHERE\b', sql, re.IGNORECASE):
return "high"
elif re.search(r'\bLIMIT\b', sql, re.IGNORECASE):
return "low"
return "medium"
elif re.match(r'\s*(INSERT|CREATE)', sql, re.IGNORECASE):
return "low"
return "medium"
def _generate_rollback(self, sql: str) -> str:
"""生成回滚 SQL"""
if re.match(r'\s*CREATE\s+TABLE\s+(\w+)', sql, re.IGNORECASE):
match = re.search(r'CREATE\s+TABLE\s+(\w+)', sql, re.IGNORECASE)
if match:
return f"DROP TABLE IF EXISTS {match.group(1)};"
elif re.match(r'\s*DROP\s+TABLE\s+(\w+)', sql, re.IGNORECASE):
return "-- ⚠️ DROP TABLE 无法自动回滚,请确保有备份"
elif re.match(r'\s*INSERT\s+INTO', sql, re.IGNORECASE):
match = re.search(r'INSERT\s+INTO\s+(\w+)', sql, re.IGNORECASE)
if match:
return f"-- 需要根据主键生成 DELETE 语句"
elif re.match(r'\s*ALTER\s+TABLE', sql, re.IGNORECASE):
return "-- ALTER TABLE 部分操作可逆,需具体分析"
return "-- 无自动回滚方案,请手动准备"
# 使用
reviewer = SQLChangeReviewer()
test_sqls = [
"DELETE FROM users WHERE last_login < '2023-01-01'",
"DROP TABLE temp_logs",
"UPDATE products SET price = price * 1.1",
"ALTER TABLE orders ADD COLUMN shipping_address VARCHAR(500)",
"INSERT INTO audit_log (action, user_id) VALUES ('login', 123)",
]
for sql in test_sqls:
result = reviewer.review(sql)
icon = {"critical": "🔴", "major": "🟡", "warning": "🔵", "info": "✅"}[result["severity"]]
print(f"{icon} {sql[:50]}... → 严重度: {result['severity']}, 锁风险: {result['lock_risk']}")
5.2 变更执行器
python
# ===== 安全变更执行器 =====
class SafeChangeExecutor:
"""安全变更执行器"""
def __init__(self, conn: DatabaseConnection, reviewer: SQLChangeReviewer = None):
self.conn = conn
self.reviewer = reviewer or SQLChangeReviewer(conn)
self.execution_log: List[Dict] = []
def execute(
self,
sql: str,
force: bool = False,
dry_run: bool = False,
timeout: int = 300
) -> Dict:
"""安全执行 SQL 变更"""
# 1. 审核
review = self.reviewer.review(sql)
# 2. 判断是否可执行
if review["requires_approval"] and not force:
return {
"status": "blocked",
"reason": f"SQL 需要审批(严重度: {review['severity']})",
"review": review,
"hint": "使用 force=True 强制执行(不推荐)"
}
if review["severity"] == "critical" and not force:
return {
"status": "blocked",
"reason": "高危操作被阻止",
"review": review,
"hint": "请人工确认后使用 force=True"
}
# 3. Dry run
if dry_run:
return {
"status": "dry_run",
"review": review,
"message": f"Dry run 完成。预估影响 {review['affected_rows_estimated']} 行"
}
# 4. 执行前备份(可选)
backup_sql = self._generate_backup(sql)
# 5. 执行
start_time = time.time()
try:
affected = self.conn.execute(sql)
duration = time.time() - start_time
result = {
"status": "success",
"affected_rows": affected,
"duration": round(duration, 3),
"review": review,
"backup_sql": backup_sql,
"rollback_sql": review.get("rollback_sql", "")
}
except Exception as e:
duration = time.time() - start_time
result = {
"status": "error",
"error": str(e),
"duration": round(duration, 3),
"review": review,
"rollback_sql": review.get("rollback_sql", "")
}
# 记录日志
self.execution_log.append({
**result,
"sql": sql,
"timestamp": datetime.now().isoformat()
})
return result
def _generate_backup(self, sql: str) -> str:
"""生成备份 SQL"""
if re.match(r'\s*UPDATE', sql, re.IGNORECASE):
table_match = re.search(r'UPDATE\s+(\w+)', sql, re.IGNORECASE)
where_match = re.search(r'WHERE\s+(.+?)$', sql, re.IGNORECASE)
if table_match and where_match:
return f"SELECT * FROM {table_match.group(1)} WHERE {where_match.group(1)}"
elif re.match(r'\s*DELETE', sql, re.IGNORECASE):
table_match = re.search(r'FROM\s+(\w+)', sql, re.IGNORECASE)
where_match = re.search(r'WHERE\s+(.+?)$', sql, re.IGNORECASE)
if table_match and where_match:
return f"SELECT * INTO backup_{table_match.group(1)}_{int(time.time())} FROM {table_match.group(1)} WHERE {where_match.group(1)}"
return "-- 无需备份"
六、容量规划与预测
6.1 容量预测
python
# ===== 容量规划与预测 =====
import math
class CapacityPlanner:
"""容量规划器"""
def __init__(self, collector: MetricCollector):
self.collector = collector
def predict_growth(self, metric: str, days: int = 30) -> Dict:
"""预测指标增长"""
history = self.collector.get_history(metric, minutes=60 * 24 * 7) # 7 天数据
if len(history) < 24:
return {"status": "insufficient_data", "data_points": len(history)}
# 简单线性回归
values = [p.value for p in history]
n = len(values)
x = list(range(n))
sum_x = sum(x)
sum_y = sum(values)
sum_xy = sum(xi * yi for xi, yi in zip(x, values))
sum_x2 = sum(xi ** 2 for xi in x)
slope = (n * sum_xy - sum_x * sum_y) / (n * sum_x2 - sum_x ** 2)
intercept = (sum_y - slope * sum_x) / n
# 预测
current = values[-1]
future_points = days * 24 # 假设每小时一个数据点
predicted = intercept + slope * (n + future_points)
growth_rate = (predicted - current) / max(current, 0.001) * 100
return {
"status": "ok",
"current": round(current, 2),
"predicted": round(predicted, 2),
"growth_rate_pct": round(growth_rate, 2),
"daily_growth": round(slope * 24, 4),
"slope": round(slope, 6),
"days": days,
"will_exceed_threshold": None
}
def estimate_disk_full(self, current_gb: float, growth_gb_per_day: float,
total_disk_gb: float, safety_pct: float = 80) -> Dict:
"""预估磁盘满时间"""
threshold = total_disk_gb * safety_pct / 100
remaining = threshold - current_gb
if growth_gb_per_day <= 0:
return {"status": "no_growth", "message": "无增长趋势"}
days_until_full = remaining / growth_gb_per_day
return {
"current_gb": round(current_gb, 2),
"growth_gb_per_day": round(growth_gb_per_day, 3),
"threshold_gb": round(threshold, 2),
"days_until_threshold": round(days_until_full, 1),
"estimated_full_date": (
datetime.now() + timedelta(days=days_until_full)
).strftime("%Y-%m-%d"),
"urgency": "critical" if days_until_full < 7 else "warning" if days_until_full < 30 else "normal"
}
def recommend_scaling(self, metrics: Dict) -> List[Dict]:
"""推荐扩容方案"""
recommendations = []
# 连接数
conn_usage = metrics.get("connections", 0) / max(metrics.get("max_connections", 151), 1)
if conn_usage > 0.7:
recommendations.append({
"resource": "连接数",
"current_usage": f"{conn_usage:.1%}",
"recommendation": "增加 max_connections 或使用连接池",
"priority": "high" if conn_usage > 0.9 else "medium"
})
# 缓冲池
buffer_hit = metrics.get("buffer_pool_hit_rate", 100)
if buffer_hit < 95:
recommendations.append({
"resource": "InnoDB 缓冲池",
"current_usage": f"命中率 {buffer_hit}%",
"recommendation": "增大 innodb_buffer_pool_size",
"priority": "high" if buffer_hit < 90 else "medium"
})
# 慢查询
slow_q = metrics.get("slow_queries", 0)
if slow_q > 50:
recommendations.append({
"resource": "查询性能",
"current_usage": f"{slow_q} 慢查询",
"recommendation": "优化慢查询或升级 CPU",
"priority": "high"
})
return recommendations
七、知识库与自然语言交互
7.1 DBA 知识库
python
# ===== DBA 知识库 =====
class DBAKnowledgeBase:
"""DBA 知识库"""
KNOWLEDGE = {
"mysql_high_cpu": {
"symptoms": ["CPU 使用率高", "查询响应慢"],
"causes": [
"缺少索引导致全表扫描",
"复杂查询(多表 JOIN、子查询)",
"锁竞争",
"缓存命中率低"
],
"diagnosis_steps": [
"SHOW FULL PROCESSLIST 查看当前查询",
"分析慢查询日志",
"检查 EXPLAIN 执行计划",
"检查 InnoDB 状态"
],
"solutions": [
"添加缺失索引",
"优化查询(避免 SELECT *、子查询)",
"增大缓冲池",
"考虑读写分离"
]
},
"mysql_high_memory": {
"symptoms": ["内存使用持续增长", "OOM Kill"],
"causes": [
"innodb_buffer_pool_size 设置过大",
"临时表过大",
"连接数过多",
"内存泄漏(Bug)"
],
"diagnosis_steps": [
"检查 buffer pool 大小配置",
"检查临时表使用",
"检查连接数和每个连接内存",
"监控内存增长趋势"
],
"solutions": [
"合理设置 innodb_buffer_pool_size(通常 50-70% 物理内存)",
"优化查询减少临时表",
"使用连接池控制连接数",
"升级 MySQL 版本修复已知泄漏"
]
},
"mysql_deadlock": {
"symptoms": ["Deadlock found when trying to get lock"],
"causes": [
"事务交叉锁定不同资源",
"长事务持有锁时间过长",
"索引不当导致锁升级"
],
"diagnosis_steps": [
"SHOW ENGINE INNODB STATUS",
"分析死锁日志",
"检查涉及的事务和 SQL"
],
"solutions": [
"按固定顺序访问表和行",
"缩短事务",
"使用较低隔离级别(RC)",
"添加合适索引避免锁升级"
]
},
"replication_lag": {
"symptoms": ["Seconds_Behind_Master 持续增长"],
"causes": [
"大事务",
"从库性能不足",
"网络延迟",
"从库无索引导致 SQL 执行慢"
],
"diagnosis_steps": [
"检查主库 binlog 大小",
"检查从库 IO/SQL 线程状态",
"检查从库负载",
"对比主从延迟"
],
"solutions": [
"拆分大事务",
"从库升级硬件",
"并行复制(slave_parallel_workers)",
"从库添加索引"
]
},
}
def search(self, query: str) -> List[Dict]:
"""搜索知识库"""
results = []
query_lower = query.lower()
for key, knowledge in self.KNOWLEDGE.items():
score = 0
# 匹配症状
for symptom in knowledge.get("symptoms", []):
if any(w in symptom.lower() for w in query_lower.split()):
score += 2
# 匹配关键词
if any(w in key.lower() for w in query_lower.split()):
score += 3
if score > 0:
results.append({**knowledge, "key": key, "score": score})
return sorted(results, key=lambda x: x["score"], reverse=True)
def get_troubleshooting_guide(self, symptom: str) -> str:
"""获取故障排查指南"""
results = self.search(symptom)
if not results:
return f"未找到与 '{symptom}' 相关的知识"
best = results[0]
lines = [
f"📖 故障排查指南: {best['key']}",
"=" * 40,
"",
"症状:",
]
for s in best.get("symptoms", []):
lines.append(f" - {s}")
lines.append("\n可能原因:")
for c in best.get("causes", []):
lines.append(f" - {c}")
lines.append("\n排查步骤:")
for i, step in enumerate(best.get("diagnosis_steps", []), 1):
lines.append(f" {i}. {step}")
lines.append("\n解决方案:")
for s in best.get("solutions", []):
lines.append(f" 💡 {s}")
return "\n".join(lines)
# 使用
kb = DBAKnowledgeBase()
print(kb.get_troubleshooting_guide("CPU 高"))
7.2 自然语言交互 Agent
python
# ===== 自然语言交互 Agent =====
class DBAgent:
"""数据库运维 Agent"""
def __init__(self, conn: DatabaseConnection, model: str = "gpt-4o", api_key: str = None):
self.conn = conn
self.collector = MetricCollector(conn)
self.diagnostic = DiagnosticAgent(conn, self.collector)
self.inspector = DatabaseInspector(conn)
self.sql_reviewer = SQLChangeReviewer(conn)
self.executor = SafeChangeExecutor(conn, self.sql_reviewer)
self.knowledge = DBAKnowledgeBase()
from openai import OpenAI
self.llm = OpenAI(api_key=api_key)
self.model = model
def chat(self, message: str) -> str:
"""自然语言交互"""
intent = self._detect_intent(message)
if intent == "health":
health = self.collector.health_check()
return self._format_health(health)
elif intent == "diagnose":
diagnosis = self.diagnostic.diagnose(message)
return self.diagnostic.generate_diagnosis_report(diagnosis)
elif intent == "slow_query":
analyzer = SlowQueryAnalyzer(self.conn)
return analyzer.generate_report()
elif intent == "inspect":
summary = self.inspector.run_full_inspection()
return self.inspector.generate_report(summary)
elif intent == "optimize_sql":
sql = self._extract_sql(message)
if sql:
optimizer = SQLOptimizer()
result = optimizer.analyze(sql)
return self._format_optimization(result)
return "请提供需要优化的 SQL 语句"
elif intent == "review_sql":
sql = self._extract_sql(message)
if sql:
result = self.sql_reviewer.review(sql)
return self._format_review(result)
return "请提供需要审核的 SQL 语句"
elif intent == "knowledge":
return self.knowledge.get_troubleshooting_guide(message)
else:
return self._llm_chat(message)
def _detect_intent(self, message: str) -> str:
"""检测意图"""
message_lower = message.lower()
intents = {
"health": ["健康", "状态", "health", "status", "监控"],
"diagnose": ["诊断", "故障", "慢", "卡", "报错", "diagnose", "troubleshoot"],
"slow_query": ["慢查询", "slow query", "慢sql"],
"inspect": ["巡检", "检查", "inspect", "check"],
"optimize_sql": ["优化", "optimize", "调优"],
"review_sql": ["审核", "review", "变更", "执行"],
"knowledge": ["怎么", "如何", "为什么", "是什么"],
}
for intent, keywords in intents.items():
if any(kw in message_lower for kw in keywords):
return intent
return "general"
def _extract_sql(self, message: str) -> str:
"""提取 SQL"""
# 尝试提取代码块中的 SQL
import re
match = re.search(r'```sql\s*(.*?)\s*```', message, re.DOTALL)
if match:
return match.group(1)
match = re.search(r'```\s*(.*?)\s*```', message, re.DOTALL)
if match:
content = match.group(1).strip()
if content.upper().startswith(("SELECT", "INSERT", "UPDATE", "DELETE", "CREATE", "ALTER", "DROP")):
return content
# 直接 SQL
stripped = message.strip()
if stripped.upper().startswith(("SELECT", "INSERT", "UPDATE", "DELETE", "CREATE", "ALTER", "DROP")):
return stripped
return ""
def _llm_chat(self, message: str) -> str:
"""LLM 通用对话"""
context = self._build_context()
response = self.llm.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": f"你是一个数据库运维助手。\n\n当前数据库状态:\n{context}"},
{"role": "user", "content": message}
],
temperature=0.3
)
return response.choices[0].message.content
def _build_context(self) -> str:
"""构建上下文"""
try:
health = self.collector.health_check()
metrics = health.get("metrics", {})
return f"""
数据库: {self.conn.config.safe_display}
状态: {health.get('status', 'unknown')}
问题: {len(health.get('issues', []))}
指标: {json.dumps(metrics, ensure_ascii=False, default=str)}
"""
except Exception:
return "无法获取数据库状态"
@staticmethod
def _format_health(health: Dict) -> str:
lines = [
f"📊 数据库状态: {'✅ 健康' if health['status'] == 'healthy' else '❌ 不健康'}",
]
for issue in health.get("issues", []):
icon = "🔴" if issue["severity"] == "critical" else "🟡"
lines.append(f" {icon} {issue['message']}")
return "\n".join(lines)
@staticmethod
def _format_optimization(result: Dict) -> str:
lines = [
f"🔍 SQL 优化分析",
f" 问题数: {result['issue_count']}",
f" 可自动修复: {result['auto_fix_count']}",
]
for issue in result.get("issues", []):
lines.append(f" ⚠️ {issue['message']}")
for opt in result.get("optimizations", []):
lines.append(f" ✨ {opt['rule']}: 已优化")
return "\n".join(lines)
@staticmethod
def _format_review(result: Dict) -> str:
icon = {"critical": "🔴", "major": "🟡", "warning": "🔵", "info": "✅"}[result["severity"]]
lines = [
f"{icon} SQL 审核结果",
f" 严重度: {result['severity']}",
f" 锁风险: {result['lock_risk']}",
f" 需要审批: {'是' if result['requires_approval'] else '否'}",
]
for issue in result.get("issues", []):
lines.append(f" ⚠️ {issue['message']}")
if result.get("rollback_sql"):
lines.append(f" 🔄 回滚: {result['rollback_sql']}")
return "\n".join(lines)
# 使用示例
if __name__ == '__main__':
config = DatabaseConfig(
db_type="mysql",
host="localhost",
port=3306,
database="production",
username="root",
password="secret"
)
conn = ConnectionFactory.create(config)
agent = DBAgent(conn)
# 自然语言交互
print(agent.chat("数据库健康状态如何?"))
print(agent.chat("帮我分析慢查询"))
print(agent.chat("优化这个 SQL: SELECT * FROM orders WHERE DATE(created_at) = '2024-01-15'"))
print(agent.chat("审核这个变更: DELETE FROM logs WHERE created_at < '2023-01-01'"))
print(agent.chat("CPU 使用率高怎么排查?"))
八、总结
8.1 架构全景
渲染错误: Mermaid 渲染失败: Lexical error on line 10. Unrecognized text. ... INTENT意图识别 → ROUTE路由分发 -----------------------^
8.2 功能矩阵
| 功能 | 规则引擎 | LLM | 人工 |
|---|---|---|---|
| 实时监控 | ✅ 自动 | - | 确认 |
| 慢查询分析 | ✅ 自动 | ✅ 增强 | 复核 |
| SQL 优化 | ✅ 规则 | ✅ 深度 | 确认 |
| 故障诊断 | ✅ 初步 | ✅ 深度 | 最终决策 |
| 变更审核 | ✅ 自动 | ⚠️ 辅助 | 审批 |
| 变更执行 | ✅ 安全执行 | - | 高危确认 |
| 巡检 | ✅ 自动 | - | 复核 |
| 容量预测 | ✅ 数据驱动 | ⚠️ 辅助 | 决策 |
| 知识查询 | ✅ 知识库 | ✅ 增强 | - |
8.3 最佳实践
| 实践 | 说明 |
|---|---|
| 分层决策 | 低危自动执行,中危人工审批,高危禁止自动 |
| 先查后改 | 执行前 dry_run,提供回滚方案 |
| 知识积累 | 将故障处理经验沉淀到知识库 |
| 渐进自动化 | 从监控→诊断→建议→半自动→全自动逐步演进 |
| 安全边界 | DROP/TRUNCATE/无 WHERE 的 UPDATE/DELETE 必须人工审批 |
| 可观测性 | 所有自动操作记录审计日志 |
8.4 方案对比
| 方案 | 自动化程度 | 安全性 | 适用场景 |
|---|---|---|---|
| 纯规则 | 中 | 高 | 日常监控、巡检 |
| 规则 + LLM | 高 | 中 | 智能诊断、SQL 优化 |
| 全 LLM | 高 | 低 | 知识问答、辅助分析 |
| 混合(推荐) | 高 | 高 | 生产环境运维 |
本文涵盖 AI Agent 数据库运维的完整技术栈:从监控采集到智能诊断,从 SQL 优化到变更管理,从自动巡检到容量规划,从知识库到自然语言交互。