目录
-
- 摘要
- [1. 引言 - 智能运维概述](#1. 引言 - 智能运维概述)
-
- [1.1 运维自动化需求](#1.1 运维自动化需求)
- [1.2 系统架构设计](#1.2 系统架构设计)
- [1.3 核心功能规划](#1.3 核心功能规划)
- [2. 监控数据采集](#2. 监控数据采集)
-
- [2.1 主机监控采集器](#2.1 主机监控采集器)
- [2.2 应用监控采集器](#2.2 应用监控采集器)
- [3. 日志分析系统](#3. 日志分析系统)
-
- [3.1 日志采集与解析](#3.1 日志采集与解析)
- [4. 告警管理系统](#4. 告警管理系统)
-
- [4.1 告警规则引擎](#4.1 告警规则引擎)
- [4.2 告警降噪](#4.2 告警降噪)
- [5. 故障诊断系统](#5. 故障诊断系统)
-
- [6. 自动修复系统](#6. 自动修复系统)
-
- [6.1 自动化修复执行器](#6.1 自动化修复执行器)
- [7. 最佳实践](#7. 最佳实践)
-
- [7.1 系统设计原则](#7.1 系统设计原则)
- [7.2 常见问题](#7.2 常见问题)
- [8. 总结](#8. 总结)
-
- [8.1 核心要点](#8.1 核心要点)
- [7.2 下一步学习](#7.2 下一步学习)
- 参考资料
摘要
本文通过一个完整的自动化运维系统案例,演示如何使用 OpenClaw 构建智能运维平台。文章涵盖监控告警、日志分析、故障诊断、自动化修复等核心功能,帮助开发者掌握 OpenClaw 在运维自动化场景的应用。通过详细的系统设计和代码实现,让读者了解智能运维系统的完整构建过程。🔧
1. 引言 - 智能运维概述
1.1 运维自动化需求
现代IT系统运维面临诸多挑战,传统运维方式已难以满足需求:
| 挑战 |
传统运维 |
OpenClaw智能运维 |
| 告警风暴 |
人工筛选处理 |
智能聚合降噪 |
| 故障定位 |
逐层排查 |
根因分析 |
| 修复效率 |
手动执行 |
自动修复 |
| 知识传承 |
依赖个人经验 |
知识库沉淀 |
| 7×24响应 |
轮班值守 |
AI全天候 |
1.2 系统架构设计
#mermaid-svg-nHoFEDNossYyC7PJ{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-nHoFEDNossYyC7PJ .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-nHoFEDNossYyC7PJ .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-nHoFEDNossYyC7PJ .error-icon{fill:#552222;}#mermaid-svg-nHoFEDNossYyC7PJ .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-nHoFEDNossYyC7PJ .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-nHoFEDNossYyC7PJ .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-nHoFEDNossYyC7PJ .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-nHoFEDNossYyC7PJ .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-nHoFEDNossYyC7PJ .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-nHoFEDNossYyC7PJ .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-nHoFEDNossYyC7PJ .marker{fill:#333333;stroke:#333333;}#mermaid-svg-nHoFEDNossYyC7PJ .marker.cross{stroke:#333333;}#mermaid-svg-nHoFEDNossYyC7PJ svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-nHoFEDNossYyC7PJ p{margin:0;}#mermaid-svg-nHoFEDNossYyC7PJ .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-nHoFEDNossYyC7PJ .cluster-label text{fill:#333;}#mermaid-svg-nHoFEDNossYyC7PJ .cluster-label span{color:#333;}#mermaid-svg-nHoFEDNossYyC7PJ .cluster-label span p{background-color:transparent;}#mermaid-svg-nHoFEDNossYyC7PJ .label text,#mermaid-svg-nHoFEDNossYyC7PJ span{fill:#333;color:#333;}#mermaid-svg-nHoFEDNossYyC7PJ .node rect,#mermaid-svg-nHoFEDNossYyC7PJ .node circle,#mermaid-svg-nHoFEDNossYyC7PJ .node ellipse,#mermaid-svg-nHoFEDNossYyC7PJ .node polygon,#mermaid-svg-nHoFEDNossYyC7PJ .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-nHoFEDNossYyC7PJ .rough-node .label text,#mermaid-svg-nHoFEDNossYyC7PJ .node .label text,#mermaid-svg-nHoFEDNossYyC7PJ .image-shape .label,#mermaid-svg-nHoFEDNossYyC7PJ .icon-shape .label{text-anchor:middle;}#mermaid-svg-nHoFEDNossYyC7PJ .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-nHoFEDNossYyC7PJ .rough-node .label,#mermaid-svg-nHoFEDNossYyC7PJ .node .label,#mermaid-svg-nHoFEDNossYyC7PJ .image-shape .label,#mermaid-svg-nHoFEDNossYyC7PJ .icon-shape .label{text-align:center;}#mermaid-svg-nHoFEDNossYyC7PJ .node.clickable{cursor:pointer;}#mermaid-svg-nHoFEDNossYyC7PJ .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-nHoFEDNossYyC7PJ .arrowheadPath{fill:#333333;}#mermaid-svg-nHoFEDNossYyC7PJ .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-nHoFEDNossYyC7PJ .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-nHoFEDNossYyC7PJ .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-nHoFEDNossYyC7PJ .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-nHoFEDNossYyC7PJ .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-nHoFEDNossYyC7PJ .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-nHoFEDNossYyC7PJ .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-nHoFEDNossYyC7PJ .cluster text{fill:#333;}#mermaid-svg-nHoFEDNossYyC7PJ .cluster span{color:#333;}#mermaid-svg-nHoFEDNossYyC7PJ div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-nHoFEDNossYyC7PJ .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-nHoFEDNossYyC7PJ rect.text{fill:none;stroke-width:0;}#mermaid-svg-nHoFEDNossYyC7PJ .icon-shape,#mermaid-svg-nHoFEDNossYyC7PJ .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-nHoFEDNossYyC7PJ .icon-shape p,#mermaid-svg-nHoFEDNossYyC7PJ .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-nHoFEDNossYyC7PJ .icon-shape .label rect,#mermaid-svg-nHoFEDNossYyC7PJ .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-nHoFEDNossYyC7PJ .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-nHoFEDNossYyC7PJ .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-nHoFEDNossYyC7PJ :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 自动执行层
智能分析层
数据处理层
数据采集层
主机监控
应用监控
日志采集
网络监控
数据清洗
指标聚合
日志解析
异常检测
告警分析
根因定位
趋势预测
容量规划
告警通知
自动修复
工单创建
报告生成
1.3 核心功能规划
| 功能模块 |
核心能力 |
技术实现 |
| 监控采集 |
多源数据采集 |
Prometheus + 自定义采集器 |
| 日志分析 |
日志解析与分析 |
ELK + AI分析 |
| 告警管理 |
智能告警降噪 |
规则引擎 + ML |
| 故障诊断 |
根因分析 |
知识图谱 + 推理 |
| 自动修复 |
故障自愈 |
Ansible + 脚本 |
2. 监控数据采集
2.1 主机监控采集器
from dataclasses import dataclass
from typing import Dict, List, Optional
import time
import threading
import json
@dataclass
class Metric:
"""监控指标"""
name: str
value: float
timestamp: float
tags: Dict[str, str]
unit: str = ""
class HostMonitor:
"""主机监控采集器"""
def __init__(self, host: str, interval: int = 60):
self.host = host
self.interval = interval
self.running = False
self.metrics: List[Metric] = []
self.collectors = {
"cpu": self._collect_cpu,
"memory": self._collect_memory,
"disk": self._collect_disk,
"network": self._collect_network
}
def start(self):
"""启动监控"""
self.running = True
thread = threading.Thread(target=self._collect_loop, daemon=True)
thread.start()
def stop(self):
"""停止监控"""
self.running = False
def _collect_loop(self):
"""采集循环"""
while self.running:
for collector_name, collector_func in self.collectors.items():
try:
metrics = collector_func()
self.metrics.extend(metrics)
except Exception as e:
print(f"采集 {collector_name} 失败: {e}")
time.sleep(self.interval)
def _collect_cpu(self) -> List[Metric]:
"""采集CPU指标"""
# 使用 psutil 或 SSH 远程采集
# 这里简化实现
return [
Metric(
name="cpu.usage",
value=45.5,
timestamp=time.time(),
tags={"host": self.host},
unit="%"
),
Metric(
name="cpu.load1",
value=2.5,
timestamp=time.time(),
tags={"host": self.host}
),
Metric(
name="cpu.load5",
value=3.2,
timestamp=time.time(),
tags={"host": self.host}
),
Metric(
name="cpu.load15",
value=2.8,
timestamp=time.time(),
tags={"host": self.host}
)
]
def _collect_memory(self) -> List[Metric]:
"""采集内存指标"""
return [
Metric(
name="memory.usage",
value=75.3,
timestamp=time.time(),
tags={"host": self.host},
unit="%"
),
Metric(
name="memory.used",
value=12.5 * 1024 * 1024 * 1024, # 12.5 GB
timestamp=time.time(),
tags={"host": self.host},
unit="bytes"
),
Metric(
name="memory.total",
value=16 * 1024 * 1024 * 1024, # 16 GB
timestamp=time.time(),
tags={"host": self.host},
unit="bytes"
)
]
def _collect_disk(self) -> List[Metric]:
"""采集磁盘指标"""
metrics = []
# 根分区
metrics.extend([
Metric(
name="disk.usage",
value=68.5,
timestamp=time.time(),
tags={"host": self.host, "mount": "/"},
unit="%"
),
Metric(
name="disk.iops.read",
value=150,
timestamp=time.time(),
tags={"host": self.host, "mount": "/"},
unit="ops/s"
),
Metric(
name="disk.iops.write",
value=80,
timestamp=time.time(),
tags={"host": self.host, "mount": "/"},
unit="ops/s"
)
])
return metrics
def _collect_network(self) -> List[Metric]:
"""采集网络指标"""
return [
Metric(
name="network.bytes.in",
value=1024 * 1024 * 50, # 50 MB/s
timestamp=time.time(),
tags={"host": self.host, "interface": "eth0"},
unit="bytes/s"
),
Metric(
name="network.bytes.out",
value=1024 * 1024 * 30, # 30 MB/s
timestamp=time.time(),
tags={"host": self.host, "interface": "eth0"},
unit="bytes/s"
),
Metric(
name="network.packets.in",
value=50000,
timestamp=time.time(),
tags={"host": self.host, "interface": "eth0"},
unit="packets/s"
),
Metric(
name="network.packets.out",
value=30000,
timestamp=time.time(),
tags={"host": self.host, "interface": "eth0"},
unit="packets/s"
)
]
def get_metrics(self, name: str = None, since: float = None) -> List[Metric]:
"""获取指标"""
result = self.metrics
if name:
result = [m for m in result if m.name == name]
if since:
result = [m for m in result if m.timestamp >= since]
return result
# 使用示例
monitor = HostMonitor("server-01", interval=30)
monitor.start()
# 等待采集
time.sleep(60)
# 获取指标
metrics = monitor.get_metrics()
print(f"采集到 {len(metrics)} 个指标")
2.2 应用监控采集器
from typing import Dict, List
import requests
class ApplicationMonitor:
"""应用监控采集器"""
def __init__(self):
self.apps: Dict[str, dict] = {}
def register_app(self, app_name: str, endpoints: Dict):
"""
注册应用
Args:
app_name: 应用名称
endpoints: 端点配置
{
"health": "http://app:8080/health",
"metrics": "http://app:8080/metrics",
"ready": "http://app:8080/ready"
}
"""
self.apps[app_name] = {
"name": app_name,
"endpoints": endpoints,
"status": "unknown"
}
def check_health(self, app_name: str) -> Dict:
"""检查应用健康状态"""
app = self.apps.get(app_name)
if not app:
return {"error": "应用未注册"}
health_url = app["endpoints"].get("health")
if not health_url:
return {"error": "未配置健康检查端点"}
try:
start_time = time.time()
response = requests.get(health_url, timeout=5)
elapsed = time.time() - start_time
if response.status_code == 200:
app["status"] = "healthy"
return {
"status": "healthy",
"response_time": elapsed,
"details": response.json() if response.headers.get("content-type", "").startswith("application/json") else response.text
}
else:
app["status"] = "unhealthy"
return {
"status": "unhealthy",
"status_code": response.status_code
}
except requests.exceptions.Timeout:
app["status"] = "timeout"
return {"status": "timeout", "error": "请求超时"}
except Exception as e:
app["status"] = "error"
return {"status": "error", "error": str(e)}
def collect_metrics(self, app_name: str) -> List[Metric]:
"""采集应用指标"""
app = self.apps.get(app_name)
if not app:
return []
metrics_url = app["endpoints"].get("metrics")
if not metrics_url:
return []
try:
response = requests.get(metrics_url, timeout=10)
if response.status_code == 200:
# 解析 Prometheus 格式指标
return self._parse_prometheus_metrics(response.text, app_name)
except Exception as e:
print(f"采集指标失败: {e}")
return []
def _parse_prometheus_metrics(self, text: str, app_name: str) -> List[Metric]:
"""解析Prometheus格式指标"""
metrics = []
for line in text.split('\n'):
if line.startswith('#') or not line.strip():
continue
try:
# 简化解析:metric_name{labels} value
if '{' in line:
name_part, value_part = line.split('}')
name_with_labels = name_part.split('{')
name = name_with_labels[0]
labels_str = name_with_labels[1]
value = float(value_part.strip())
# 解析标签
tags = {"app": app_name}
for label in labels_str.split(','):
if '=' in label:
k, v = label.split('=')
tags[k.strip()] = v.strip('"')
else:
parts = line.split()
name = parts[0]
value = float(parts[1])
tags = {"app": app_name}
metrics.append(Metric(
name=name,
value=value,
timestamp=time.time(),
tags=tags
))
except Exception:
continue
return metrics
def get_all_status(self) -> Dict:
"""获取所有应用状态"""
result = {}
for app_name in self.apps:
result[app_name] = self.check_health(app_name)
return result
# 使用示例
app_monitor = ApplicationMonitor()
# 注册应用
app_monitor.register_app("web-api", {
"health": "http://web-api:8080/health",
"metrics": "http://web-api:8080/metrics"
})
app_monitor.register_app("user-service", {
"health": "http://user-service:8081/health",
"metrics": "http://user-service:8081/metrics"
})
# 检查健康状态
status = app_monitor.get_all_status()
for app_name, app_status in status.items():
print(f"{app_name}: {app_status['status']}")
3. 日志分析系统
3.1 日志采集与解析
from dataclasses import dataclass
from typing import List, Dict, Optional
import re
from datetime import datetime
@dataclass
class LogEntry:
"""日志条目"""
timestamp: datetime
level: str
message: str
source: str
metadata: Dict
class LogParser:
"""日志解析器"""
def __init__(self):
self.patterns = {
"nginx": r'(?P<ip>[\d.]+) - - \[(?P<timestamp>[^\]]+)\] "(?P<method>\w+) (?P<path>[^\s]+) HTTP/[\d.]+" (?P<status>\d+) (?P<size>\d+)',
"java": r'(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}) \[(?P<thread>[^\]]+)\] (?P<level>\w+)\s+(?P<class>[^\s]+) - (?P<message>.+)',
"python": r'(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}) - (?P<level>\w+) - (?P<message>.+)',
"syslog": r'(?P<timestamp>\w{3}\s+\d+\s+\d{2}:\d{2}:\d{2}) (?P<host>\S+) (?P<process>[^\[]+)\[(?P<pid>\d+)\]: (?P<message>.+)'
}
def parse(self, line: str, log_type: str) -> Optional[LogEntry]:
"""解析日志行"""
pattern = self.patterns.get(log_type)
if not pattern:
return None
match = re.match(pattern, line)
if not match:
return None
groups = match.groupdict()
# 解析时间戳
timestamp = self._parse_timestamp(groups.get("timestamp", ""))
# 确定日志级别
level = groups.get("level", "INFO").upper()
# 提取消息
message = groups.get("message", line)
# 构建元数据
metadata = {k: v for k, v in groups.items() if k not in ["timestamp", "level", "message"]}
return LogEntry(
timestamp=timestamp,
level=level,
message=message,
source=log_type,
metadata=metadata
)
def _parse_timestamp(self, ts_str: str) -> datetime:
"""解析时间戳"""
formats = [
"%Y-%m-%d %H:%M:%S,%f",
"%d/%b/%Y:%H:%M:%S %z",
"%b %d %H:%M:%S"
]
for fmt in formats:
try:
return datetime.strptime(ts_str, fmt)
except ValueError:
continue
return datetime.now()
class LogAnalyzer:
"""日志分析器"""
def __init__(self):
self.parser = LogParser()
self.logs: List[LogEntry] = []
self.error_patterns = [
r'Exception',
r'Error',
r'Failed',
r'Timeout',
r'Connection refused'
]
def add_log(self, line: str, log_type: str):
"""添加日志"""
entry = self.parser.parse(line, log_type)
if entry:
self.logs.append(entry)
def analyze_errors(self, time_range: tuple = None) -> List[Dict]:
"""分析错误日志"""
errors = []
for log in self.logs:
if log.level in ["ERROR", "FATAL", "CRITICAL"]:
# 检查是否在时间范围内
if time_range:
if not (time_range[0] <= log.timestamp <= time_range[1]):
continue
# 匹配错误模式
matched_patterns = []
for pattern in self.error_patterns:
if re.search(pattern, log.message):
matched_patterns.append(pattern)
errors.append({
"timestamp": log.timestamp.isoformat(),
"level": log.level,
"message": log.message,
"source": log.source,
"patterns": matched_patterns,
"metadata": log.metadata
})
return errors
def get_statistics(self) -> Dict:
"""获取统计信息"""
level_counts = {}
source_counts = {}
for log in self.logs:
level_counts[log.level] = level_counts.get(log.level, 0) + 1
source_counts[log.source] = source_counts.get(log.source, 0) + 1
return {
"total_logs": len(self.logs),
"by_level": level_counts,
"by_source": source_counts
}
def detect_anomalies(self) -> List[Dict]:
"""检测异常"""
anomalies = []
# 检测错误率突增
# 简化实现:统计最近5分钟的错误数
recent_time = datetime.now() - timedelta(minutes=5)
recent_errors = [log for log in self.logs
if log.level in ["ERROR", "FATAL"] and log.timestamp >= recent_time]
if len(recent_errors) > 10: # 阈值
anomalies.append({
"type": "error_spike",
"count": len(recent_errors),
"severity": "high" if len(recent_errors) > 50 else "medium"
})
return anomalies
# 使用示例
analyzer = LogAnalyzer()
# 添加日志
analyzer.add_log('2026-04-20 12:00:00,123 - ERROR - Connection refused to database', 'python')
analyzer.add_log('2026-04-20 12:00:01,456 - INFO - Request processed successfully', 'python')
analyzer.add_log('2026-04-20 12:00:02,789 - ERROR - Timeout waiting for response', 'python')
# 分析错误
errors = analyzer.analyze_errors()
print(f"发现 {len(errors)} 个错误")
# 获取统计
stats = analyzer.get_statistics()
print(f"日志统计: {stats}")
4. 告警管理系统
4.1 告警规则引擎
from dataclasses import dataclass
from typing import Dict, List, Callable, Optional
from enum import Enum
class AlertSeverity(Enum):
"""告警级别"""
INFO = "info"
WARNING = "warning"
CRITICAL = "critical"
EMERGENCY = "emergency"
@dataclass
class Alert:
"""告警"""
id: str
name: str
severity: AlertSeverity
message: str
source: str
timestamp: float
labels: Dict[str, str]
status: str = "firing"
acknowledged: bool = False
resolved_at: Optional[float] = None
class AlertRule:
"""告警规则"""
def __init__(self, name: str, condition: Callable, severity: AlertSeverity,
duration: int = 0, labels: Dict = None):
self.name = name
self.condition = condition
self.severity = severity
self.duration = duration # 持续时间(秒)
self.labels = labels or {}
self.firing_since = None
class AlertManager:
"""告警管理器"""
def __init__(self):
self.rules: Dict[str, AlertRule] = {}
self.alerts: Dict[str, Alert] = {}
self.handlers: List[Callable] = []
def add_rule(self, rule: AlertRule):
"""添加告警规则"""
self.rules[rule.name] = rule
def add_handler(self, handler: Callable):
"""添加告警处理器"""
self.handlers.append(handler)
def evaluate(self, metrics: List[Metric]):
"""评估告警规则"""
for rule_name, rule in self.rules.items():
try:
is_firing = rule.condition(metrics)
if is_firing:
if rule.firing_since is None:
rule.firing_since = time.time()
# 检查是否达到持续时间
if time.time() - rule.firing_since >= rule.duration:
self._fire_alert(rule, metrics)
else:
if rule.firing_since is not None:
self._resolve_alert(rule)
rule.firing_since = None
except Exception as e:
print(f"评估规则 {rule_name} 失败: {e}")
def _fire_alert(self, rule: AlertRule, metrics: List[Metric]):
"""触发告警"""
alert_id = f"alert_{rule.name}_{int(time.time())}"
if alert_id not in self.alerts:
alert = Alert(
id=alert_id,
name=rule.name,
severity=rule.severity,
message=self._generate_message(rule, metrics),
source="monitor",
timestamp=time.time(),
labels=rule.labels
)
self.alerts[alert_id] = alert
# 调用处理器
for handler in self.handlers:
try:
handler(alert)
except Exception as e:
print(f"处理器执行失败: {e}")
def _resolve_alert(self, rule: AlertRule):
"""解除告警"""
for alert in self.alerts.values():
if alert.name == rule.name and alert.status == "firing":
alert.status = "resolved"
alert.resolved_at = time.time()
def _generate_message(self, rule: AlertRule, metrics: List[Metric]) -> str:
"""生成告警消息"""
# 简化实现
return f"告警规则 {rule.name} 触发"
def get_active_alerts(self) -> List[Alert]:
"""获取活跃告警"""
return [a for a in self.alerts.values() if a.status == "firing"]
def acknowledge(self, alert_id: str):
"""确认告警"""
if alert_id in self.alerts:
self.alerts[alert_id].acknowledged = True
# 定义告警规则
def cpu_high_condition(metrics: List[Metric]) -> bool:
"""CPU使用率过高"""
cpu_metrics = [m for m in metrics if m.name == "cpu.usage"]
if cpu_metrics:
avg_cpu = sum(m.value for m in cpu_metrics) / len(cpu_metrics)
return avg_cpu > 80
return False
def memory_high_condition(metrics: List[Metric]) -> bool:
"""内存使用率过高"""
mem_metrics = [m for m in metrics if m.name == "memory.usage"]
if mem_metrics:
avg_mem = sum(m.value for m in mem_metrics) / len(mem_metrics)
return avg_mem > 90
return False
def disk_full_condition(metrics: List[Metric]) -> bool:
"""磁盘空间不足"""
disk_metrics = [m for m in metrics if m.name == "disk.usage"]
for m in disk_metrics:
if m.value > 85:
return True
return False
# 使用示例
alert_manager = AlertManager()
# 添加规则
alert_manager.add_rule(AlertRule(
name="cpu_high",
condition=cpu_high_condition,
severity=AlertSeverity.WARNING,
duration=60,
labels={"team": "ops"}
))
alert_manager.add_rule(AlertRule(
name="memory_high",
condition=memory_high_condition,
severity=AlertSeverity.CRITICAL,
duration=30,
labels={"team": "ops"}
))
alert_manager.add_rule(AlertRule(
name="disk_full",
condition=disk_full_condition,
severity=AlertSeverity.WARNING,
labels={"team": "ops"}
))
# 添加处理器
def send_notification(alert: Alert):
"""发送通知"""
print(f"[{alert.severity.value.upper()}] {alert.name}: {alert.message}")
alert_manager.add_handler(send_notification)
# 评估告警
metrics = [
Metric("cpu.usage", 85, time.time(), {"host": "server-01"}),
Metric("memory.usage", 92, time.time(), {"host": "server-01"})
]
alert_manager.evaluate(metrics)
4.2 告警降噪
from typing import Dict, List
from collections import defaultdict
class AlertDeduplicator:
"""告警去重器"""
def __init__(self, window_seconds: int = 300):
self.window = window_seconds
self.seen_alerts: Dict[str, float] = {}
def should_send(self, alert: Alert) -> bool:
"""判断是否应该发送告警"""
# 生成告警指纹
fingerprint = self._generate_fingerprint(alert)
now = time.time()
# 检查是否在时间窗口内已发送
if fingerprint in self.seen_alerts:
if now - self.seen_alerts[fingerprint] < self.window:
return False
# 更新发送时间
self.seen_alerts[fingerprint] = now
# 清理过期记录
self._cleanup(now)
return True
def _generate_fingerprint(self, alert: Alert) -> str:
"""生成告警指纹"""
return f"{alert.name}:{alert.source}:{alert.labels.get('host', '')}"
def _cleanup(self, now: float):
"""清理过期记录"""
expired = [k for k, v in self.seen_alerts.items() if now - v > self.window * 2]
for k in expired:
del self.seen_alerts[k]
class AlertAggregator:
"""告警聚合器"""
def __init__(self, window_seconds: int = 60):
self.window = window_seconds
self.pending: Dict[str, List[Alert]] = defaultdict(list)
def add(self, alert: Alert) -> bool:
"""
添加告警
Returns:
是否应该立即发送
"""
# 按规则名称分组
group_key = alert.name
self.pending[group_key].append(alert)
# 检查是否达到聚合阈值
if len(self.pending[group_key]) >= 5:
return True
return False
def get_aggregated(self) -> Dict[str, List[Alert]]:
"""获取聚合的告警"""
result = dict(self.pending)
self.pending.clear()
return result
# 使用示例
deduplicator = AlertDeduplicator(window_seconds=300)
aggregator = AlertAggregator(window_seconds=60)
# 处理告警
alert1 = Alert(
id="alert_001",
name="cpu_high",
severity=AlertSeverity.WARNING,
message="CPU使用率过高",
source="monitor",
timestamp=time.time(),
labels={"host": "server-01"}
)
if deduplicator.should_send(alert1):
print("发送告警")
else:
print("告警已去重")
5. 故障诊断系统
5.1 根因分析
from typing import Dict, List, Set, Optional
from dataclasses import dataclass
@dataclass
class DiagnosisResult:
"""诊断结果"""
root_cause: str
confidence: float
evidence: List[str]
suggestions: List[str]
class FaultDiagnoser:
"""故障诊断器"""
def __init__(self):
self.knowledge_base = self._load_knowledge()
self.causality_graph = self._build_causality_graph()
def _load_knowledge(self) -> Dict:
"""加载知识库"""
return {
"cpu_high": {
"causes": ["process_runaway", "insufficient_resources", "traffic_spike"],
"symptoms": ["high_load", "slow_response"],
"solutions": ["kill_process", "scale_out", "optimize_code"]
},
"memory_high": {
"causes": ["memory_leak", "large_cache", "insufficient_memory"],
"symptoms": ["oom", "slow_gc"],
"solutions": ["restart_service", "increase_memory", "fix_leak"]
},
"disk_full": {
"causes": ["log_bloat", "large_files", "insufficient_disk"],
"symptoms": ["write_failed", "service_down"],
"solutions": ["clean_logs", "delete_files", "expand_disk"]
},
"network_error": {
"causes": ["dns_failure", "firewall_block", "network_down"],
"symptoms": ["connection_timeout", "connection_refused"],
"solutions": ["check_dns", "check_firewall", "check_network"]
}
}
def _build_causality_graph(self) -> Dict:
"""构建因果关系图"""
return {
"traffic_spike": ["cpu_high", "memory_high", "network_congestion"],
"memory_leak": ["memory_high", "oom_killer", "service_crash"],
"disk_full": ["write_failed", "service_down"],
"dns_failure": ["connection_timeout", "service_unavailable"],
"process_runaway": ["cpu_high", "system_hang"]
}
def diagnose(self, symptoms: List[str], context: Dict) -> DiagnosisResult:
"""
诊断故障
Args:
symptoms: 症状列表
context: 上下文信息
Returns:
诊断结果
"""
# 匹配知识库
matched_issues = []
for issue, info in self.knowledge_base.items():
# 检查症状匹配
symptom_match = len(set(symptoms) & set(info["symptoms"])) > 0
if symptom_match:
matched_issues.append({
"issue": issue,
"causes": info["causes"],
"solutions": info["solutions"],
"match_score": len(set(symptoms) & set(info["symptoms"])) / len(info["symptoms"])
})
if not matched_issues:
return DiagnosisResult(
root_cause="unknown",
confidence=0,
evidence=[],
suggestions=["请提供更多信息以帮助诊断"]
)
# 排序并选择最可能的原因
matched_issues.sort(key=lambda x: x["match_score"], reverse=True)
top_match = matched_issues[0]
# 确定根因
root_cause = self._determine_root_cause(top_match, context)
# 收集证据
evidence = self._collect_evidence(top_match, symptoms, context)
# 生成建议
suggestions = top_match["solutions"]
return DiagnosisResult(
root_cause=root_cause,
confidence=top_match["match_score"],
evidence=evidence,
suggestions=suggestions
)
def _determine_root_cause(self, match: Dict, context: Dict) -> str:
"""确定根因"""
causes = match["causes"]
# 根据上下文选择最可能的原因
# 简化实现:返回第一个
return causes[0] if causes else match["issue"]
def _collect_evidence(self, match: Dict, symptoms: List[str], context: Dict) -> List[str]:
"""收集证据"""
evidence = []
evidence.append(f"检测到症状: {', '.join(symptoms)}")
evidence.append(f"匹配问题: {match['issue']}")
evidence.append(f"可能原因: {', '.join(match['causes'])}")
# 添加上下文证据
if "cpu_usage" in context:
evidence.append(f"CPU使用率: {context['cpu_usage']}%")
if "memory_usage" in context:
evidence.append(f"内存使用率: {context['memory_usage']}%")
return evidence
# 使用示例
diagnoser = FaultDiagnoser()
# 诊断故障
result = diagnoser.diagnose(
symptoms=["high_load", "slow_response"],
context={
"cpu_usage": 95,
"memory_usage": 60
}
)
print(f"根因: {result.root_cause}")
print(f"置信度: {result.confidence:.2f}")
print(f"证据: {result.evidence}")
print(f"建议: {result.suggestions}")
6. 自动修复系统
6.1 自动化修复执行器
from typing import Dict, List, Callable, Optional
from dataclasses import dataclass
import subprocess
@dataclass
class RemediationAction:
"""修复动作"""
name: str
description: str
executor: Callable
params: Dict
rollback: Optional[Callable] = None
class AutoRemediation:
"""自动修复系统"""
def __init__(self):
self.actions: Dict[str, RemediationAction] = {}
self.history: List[Dict] = []
self._register_default_actions()
def _register_default_actions(self):
"""注册默认修复动作"""
self.register_action(RemediationAction(
name="restart_service",
description="重启服务",
executor=self._restart_service,
params={"service_name": None},
rollback=self._rollback_restart
))
self.register_action(RemediationAction(
name="clean_logs",
description="清理日志文件",
executor=self._clean_logs,
params={"log_dir": "/var/log", "days": 7}
))
self.register_action(RemediationAction(
name="kill_process",
description="终止进程",
executor=self._kill_process,
params={"process_name": None}
))
self.register_action(RemediationAction(
name="scale_out",
description="扩容实例",
executor=self._scale_out,
params={"service": None, "count": 1}
))
def register_action(self, action: RemediationAction):
"""注册修复动作"""
self.actions[action.name] = action
def execute(self, action_name: str, params: Dict = None) -> Dict:
"""
执行修复动作
Args:
action_name: 动作名称
params: 参数
Returns:
执行结果
"""
action = self.actions.get(action_name)
if not action:
return {"success": False, "error": "动作不存在"}
# 合并参数
final_params = {**action.params, **(params or {})}
# 执行
try:
result = action.executor(final_params)
# 记录历史
self.history.append({
"action": action_name,
"params": final_params,
"result": result,
"timestamp": time.time()
})
return result
except Exception as e:
return {"success": False, "error": str(e)}
def _restart_service(self, params: Dict) -> Dict:
"""重启服务"""
service_name = params.get("service_name")
if not service_name:
return {"success": False, "error": "缺少服务名称"}
# 使用systemctl重启
try:
result = subprocess.run(
["systemctl", "restart", service_name],
capture_output=True,
text=True,
timeout=30
)
if result.returncode == 0:
return {"success": True, "message": f"服务 {service_name} 已重启"}
else:
return {"success": False, "error": result.stderr}
except subprocess.TimeoutExpired:
return {"success": False, "error": "重启超时"}
def _clean_logs(self, params: Dict) -> Dict:
"""清理日志"""
log_dir = params.get("log_dir", "/var/log")
days = params.get("days", 7)
try:
# 查找并删除旧日志
result = subprocess.run(
["find", log_dir, "-name", "*.log", "-mtime", f"+{days}", "-delete"],
capture_output=True,
text=True
)
return {"success": True, "message": f"已清理 {log_dir} 中 {days} 天前的日志"}
except Exception as e:
return {"success": False, "error": str(e)}
def _kill_process(self, params: Dict) -> Dict:
"""终止进程"""
process_name = params.get("process_name")
if not process_name:
return {"success": False, "error": "缺少进程名称"}
try:
# 查找并终止进程
result = subprocess.run(
["pkill", "-f", process_name],
capture_output=True,
text=True
)
return {"success": True, "message": f"已终止进程 {process_name}"}
except Exception as e:
return {"success": False, "error": str(e)}
def _scale_out(self, params: Dict) -> Dict:
"""扩容"""
# 简化实现
service = params.get("service")
count = params.get("count", 1)
return {"success": True, "message": f"已为 {service} 增加 {count} 个实例"}
def _rollback_restart(self, params: Dict):
"""回滚重启"""
# 实现回滚逻辑
pass
def get_history(self, limit: int = 100) -> List[Dict]:
"""获取执行历史"""
return self.history[-limit:]
# 使用示例
remediation = AutoRemediation()
# 执行修复
result = remediation.execute("restart_service", {"service_name": "nginx"})
print(result)
# 清理日志
result = remediation.execute("clean_logs", {"log_dir": "/var/log/nginx", "days": 3})
print(result)
7. 最佳实践
7.1 系统设计原则
| 原则 |
说明 |
实践 |
| 自动化优先 |
减少人工干预 |
自动检测 + 自动修复 |
| 可观测性 |
全面监控 |
指标 + 日志 + 追踪 |
| 容错设计 |
系统高可用 |
冗余 + 降级 |
| 安全可控 |
操作可审计 |
权限控制 + 审计日志 |
7.2 常见问题
| 问题 |
原因 |
解决方案 |
| 误报多 |
阈值不合理 |
动态阈值 + 降噪 |
| 修复失败 |
权限不足 |
检查执行权限 |
| 响应慢 |
流程复杂 |
优化流程 |
8. 总结
8.1 核心要点
本文通过完整的自动化运维系统案例,展示了 OpenClaw 在运维场景的应用:
| 模块 |
核心功能 |
技术要点 |
| 监控采集 |
多源数据采集 |
主机 + 应用监控 |
| 日志分析 |
日志解析与分析 |
模式匹配 + 异常检测 |
| 告警管理 |
智能告警 |
规则引擎 + 降噪 |
| 故障诊断 |
根因分析 |
知识库 + 推理 |
| 自动修复 |
故障自愈 |
动作执行 + 回滚 |
7.2 下一步学习
- 第75篇:OpenClaw 实战案例:数据分析平台
参考资料