OpenClaw 实战案例:自动化运维系统构建

目录

    • 摘要
    • [1. 引言 - 智能运维概述](#1. 引言 - 智能运维概述)
      • [1.1 运维自动化需求](#1.1 运维自动化需求)
      • [1.2 系统架构设计](#1.2 系统架构设计)
      • [1.3 核心功能规划](#1.3 核心功能规划)
    • [2. 监控数据采集](#2. 监控数据采集)
      • [2.1 主机监控采集器](#2.1 主机监控采集器)
      • [2.2 应用监控采集器](#2.2 应用监控采集器)
    • [3. 日志分析系统](#3. 日志分析系统)
      • [3.1 日志采集与解析](#3.1 日志采集与解析)
    • [4. 告警管理系统](#4. 告警管理系统)
      • [4.1 告警规则引擎](#4.1 告警规则引擎)
      • [4.2 告警降噪](#4.2 告警降噪)
    • [5. 故障诊断系统](#5. 故障诊断系统)
      • [5.1 根因分析](#5.1 根因分析)
    • [6. 自动修复系统](#6. 自动修复系统)
      • [6.1 自动化修复执行器](#6.1 自动化修复执行器)
    • [7. 最佳实践](#7. 最佳实践)
      • [7.1 系统设计原则](#7.1 系统设计原则)
      • [7.2 常见问题](#7.2 常见问题)
    • [8. 总结](#8. 总结)
      • [8.1 核心要点](#8.1 核心要点)
      • [7.2 下一步学习](#7.2 下一步学习)
    • 参考资料

摘要

本文通过一个完整的自动化运维系统案例,演示如何使用 OpenClaw 构建智能运维平台。文章涵盖监控告警、日志分析、故障诊断、自动化修复等核心功能,帮助开发者掌握 OpenClaw 在运维自动化场景的应用。通过详细的系统设计和代码实现,让读者了解智能运维系统的完整构建过程。🔧


1. 引言 - 智能运维概述

1.1 运维自动化需求

现代IT系统运维面临诸多挑战,传统运维方式已难以满足需求:

挑战 传统运维 OpenClaw智能运维
告警风暴 人工筛选处理 智能聚合降噪
故障定位 逐层排查 根因分析
修复效率 手动执行 自动修复
知识传承 依赖个人经验 知识库沉淀
7×24响应 轮班值守 AI全天候

1.2 系统架构设计

#mermaid-svg-nHoFEDNossYyC7PJ{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-nHoFEDNossYyC7PJ .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-nHoFEDNossYyC7PJ .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-nHoFEDNossYyC7PJ .error-icon{fill:#552222;}#mermaid-svg-nHoFEDNossYyC7PJ .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-nHoFEDNossYyC7PJ .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-nHoFEDNossYyC7PJ .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-nHoFEDNossYyC7PJ .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-nHoFEDNossYyC7PJ .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-nHoFEDNossYyC7PJ .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-nHoFEDNossYyC7PJ .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-nHoFEDNossYyC7PJ .marker{fill:#333333;stroke:#333333;}#mermaid-svg-nHoFEDNossYyC7PJ .marker.cross{stroke:#333333;}#mermaid-svg-nHoFEDNossYyC7PJ svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-nHoFEDNossYyC7PJ p{margin:0;}#mermaid-svg-nHoFEDNossYyC7PJ .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-nHoFEDNossYyC7PJ .cluster-label text{fill:#333;}#mermaid-svg-nHoFEDNossYyC7PJ .cluster-label span{color:#333;}#mermaid-svg-nHoFEDNossYyC7PJ .cluster-label span p{background-color:transparent;}#mermaid-svg-nHoFEDNossYyC7PJ .label text,#mermaid-svg-nHoFEDNossYyC7PJ span{fill:#333;color:#333;}#mermaid-svg-nHoFEDNossYyC7PJ .node rect,#mermaid-svg-nHoFEDNossYyC7PJ .node circle,#mermaid-svg-nHoFEDNossYyC7PJ .node ellipse,#mermaid-svg-nHoFEDNossYyC7PJ .node polygon,#mermaid-svg-nHoFEDNossYyC7PJ .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-nHoFEDNossYyC7PJ .rough-node .label text,#mermaid-svg-nHoFEDNossYyC7PJ .node .label text,#mermaid-svg-nHoFEDNossYyC7PJ .image-shape .label,#mermaid-svg-nHoFEDNossYyC7PJ .icon-shape .label{text-anchor:middle;}#mermaid-svg-nHoFEDNossYyC7PJ .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-nHoFEDNossYyC7PJ .rough-node .label,#mermaid-svg-nHoFEDNossYyC7PJ .node .label,#mermaid-svg-nHoFEDNossYyC7PJ .image-shape .label,#mermaid-svg-nHoFEDNossYyC7PJ .icon-shape .label{text-align:center;}#mermaid-svg-nHoFEDNossYyC7PJ .node.clickable{cursor:pointer;}#mermaid-svg-nHoFEDNossYyC7PJ .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-nHoFEDNossYyC7PJ .arrowheadPath{fill:#333333;}#mermaid-svg-nHoFEDNossYyC7PJ .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-nHoFEDNossYyC7PJ .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-nHoFEDNossYyC7PJ .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-nHoFEDNossYyC7PJ .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-nHoFEDNossYyC7PJ .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-nHoFEDNossYyC7PJ .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-nHoFEDNossYyC7PJ .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-nHoFEDNossYyC7PJ .cluster text{fill:#333;}#mermaid-svg-nHoFEDNossYyC7PJ .cluster span{color:#333;}#mermaid-svg-nHoFEDNossYyC7PJ div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-nHoFEDNossYyC7PJ .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-nHoFEDNossYyC7PJ rect.text{fill:none;stroke-width:0;}#mermaid-svg-nHoFEDNossYyC7PJ .icon-shape,#mermaid-svg-nHoFEDNossYyC7PJ .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-nHoFEDNossYyC7PJ .icon-shape p,#mermaid-svg-nHoFEDNossYyC7PJ .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-nHoFEDNossYyC7PJ .icon-shape .label rect,#mermaid-svg-nHoFEDNossYyC7PJ .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-nHoFEDNossYyC7PJ .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-nHoFEDNossYyC7PJ .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-nHoFEDNossYyC7PJ :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 自动执行层
智能分析层
数据处理层
数据采集层
主机监控
应用监控
日志采集
网络监控
数据清洗
指标聚合
日志解析
异常检测
告警分析
根因定位
趋势预测
容量规划
告警通知
自动修复
工单创建
报告生成

1.3 核心功能规划

功能模块 核心能力 技术实现
监控采集 多源数据采集 Prometheus + 自定义采集器
日志分析 日志解析与分析 ELK + AI分析
告警管理 智能告警降噪 规则引擎 + ML
故障诊断 根因分析 知识图谱 + 推理
自动修复 故障自愈 Ansible + 脚本

2. 监控数据采集

2.1 主机监控采集器

python 复制代码
from dataclasses import dataclass
from typing import Dict, List, Optional
import time
import threading
import json

@dataclass
class Metric:
    """监控指标"""
    name: str
    value: float
    timestamp: float
    tags: Dict[str, str]
    unit: str = ""

class HostMonitor:
    """主机监控采集器"""
    
    def __init__(self, host: str, interval: int = 60):
        self.host = host
        self.interval = interval
        self.running = False
        self.metrics: List[Metric] = []
        self.collectors = {
            "cpu": self._collect_cpu,
            "memory": self._collect_memory,
            "disk": self._collect_disk,
            "network": self._collect_network
        }
    
    def start(self):
        """启动监控"""
        self.running = True
        thread = threading.Thread(target=self._collect_loop, daemon=True)
        thread.start()
    
    def stop(self):
        """停止监控"""
        self.running = False
    
    def _collect_loop(self):
        """采集循环"""
        while self.running:
            for collector_name, collector_func in self.collectors.items():
                try:
                    metrics = collector_func()
                    self.metrics.extend(metrics)
                except Exception as e:
                    print(f"采集 {collector_name} 失败: {e}")
            
            time.sleep(self.interval)
    
    def _collect_cpu(self) -> List[Metric]:
        """采集CPU指标"""
        # 使用 psutil 或 SSH 远程采集
        # 这里简化实现
        
        return [
            Metric(
                name="cpu.usage",
                value=45.5,
                timestamp=time.time(),
                tags={"host": self.host},
                unit="%"
            ),
            Metric(
                name="cpu.load1",
                value=2.5,
                timestamp=time.time(),
                tags={"host": self.host}
            ),
            Metric(
                name="cpu.load5",
                value=3.2,
                timestamp=time.time(),
                tags={"host": self.host}
            ),
            Metric(
                name="cpu.load15",
                value=2.8,
                timestamp=time.time(),
                tags={"host": self.host}
            )
        ]
    
    def _collect_memory(self) -> List[Metric]:
        """采集内存指标"""
        return [
            Metric(
                name="memory.usage",
                value=75.3,
                timestamp=time.time(),
                tags={"host": self.host},
                unit="%"
            ),
            Metric(
                name="memory.used",
                value=12.5 * 1024 * 1024 * 1024,  # 12.5 GB
                timestamp=time.time(),
                tags={"host": self.host},
                unit="bytes"
            ),
            Metric(
                name="memory.total",
                value=16 * 1024 * 1024 * 1024,  # 16 GB
                timestamp=time.time(),
                tags={"host": self.host},
                unit="bytes"
            )
        ]
    
    def _collect_disk(self) -> List[Metric]:
        """采集磁盘指标"""
        metrics = []
        
        # 根分区
        metrics.extend([
            Metric(
                name="disk.usage",
                value=68.5,
                timestamp=time.time(),
                tags={"host": self.host, "mount": "/"},
                unit="%"
            ),
            Metric(
                name="disk.iops.read",
                value=150,
                timestamp=time.time(),
                tags={"host": self.host, "mount": "/"},
                unit="ops/s"
            ),
            Metric(
                name="disk.iops.write",
                value=80,
                timestamp=time.time(),
                tags={"host": self.host, "mount": "/"},
                unit="ops/s"
            )
        ])
        
        return metrics
    
    def _collect_network(self) -> List[Metric]:
        """采集网络指标"""
        return [
            Metric(
                name="network.bytes.in",
                value=1024 * 1024 * 50,  # 50 MB/s
                timestamp=time.time(),
                tags={"host": self.host, "interface": "eth0"},
                unit="bytes/s"
            ),
            Metric(
                name="network.bytes.out",
                value=1024 * 1024 * 30,  # 30 MB/s
                timestamp=time.time(),
                tags={"host": self.host, "interface": "eth0"},
                unit="bytes/s"
            ),
            Metric(
                name="network.packets.in",
                value=50000,
                timestamp=time.time(),
                tags={"host": self.host, "interface": "eth0"},
                unit="packets/s"
            ),
            Metric(
                name="network.packets.out",
                value=30000,
                timestamp=time.time(),
                tags={"host": self.host, "interface": "eth0"},
                unit="packets/s"
            )
        ]
    
    def get_metrics(self, name: str = None, since: float = None) -> List[Metric]:
        """获取指标"""
        result = self.metrics
        
        if name:
            result = [m for m in result if m.name == name]
        
        if since:
            result = [m for m in result if m.timestamp >= since]
        
        return result

# 使用示例
monitor = HostMonitor("server-01", interval=30)
monitor.start()

# 等待采集
time.sleep(60)

# 获取指标
metrics = monitor.get_metrics()
print(f"采集到 {len(metrics)} 个指标")

2.2 应用监控采集器

python 复制代码
from typing import Dict, List
import requests

class ApplicationMonitor:
    """应用监控采集器"""
    
    def __init__(self):
        self.apps: Dict[str, dict] = {}
    
    def register_app(self, app_name: str, endpoints: Dict):
        """
        注册应用
        
        Args:
            app_name: 应用名称
            endpoints: 端点配置
                {
                    "health": "http://app:8080/health",
                    "metrics": "http://app:8080/metrics",
                    "ready": "http://app:8080/ready"
                }
        """
        self.apps[app_name] = {
            "name": app_name,
            "endpoints": endpoints,
            "status": "unknown"
        }
    
    def check_health(self, app_name: str) -> Dict:
        """检查应用健康状态"""
        app = self.apps.get(app_name)
        if not app:
            return {"error": "应用未注册"}
        
        health_url = app["endpoints"].get("health")
        if not health_url:
            return {"error": "未配置健康检查端点"}
        
        try:
            start_time = time.time()
            response = requests.get(health_url, timeout=5)
            elapsed = time.time() - start_time
            
            if response.status_code == 200:
                app["status"] = "healthy"
                return {
                    "status": "healthy",
                    "response_time": elapsed,
                    "details": response.json() if response.headers.get("content-type", "").startswith("application/json") else response.text
                }
            else:
                app["status"] = "unhealthy"
                return {
                    "status": "unhealthy",
                    "status_code": response.status_code
                }
        
        except requests.exceptions.Timeout:
            app["status"] = "timeout"
            return {"status": "timeout", "error": "请求超时"}
        
        except Exception as e:
            app["status"] = "error"
            return {"status": "error", "error": str(e)}
    
    def collect_metrics(self, app_name: str) -> List[Metric]:
        """采集应用指标"""
        app = self.apps.get(app_name)
        if not app:
            return []
        
        metrics_url = app["endpoints"].get("metrics")
        if not metrics_url:
            return []
        
        try:
            response = requests.get(metrics_url, timeout=10)
            
            if response.status_code == 200:
                # 解析 Prometheus 格式指标
                return self._parse_prometheus_metrics(response.text, app_name)
            
        except Exception as e:
            print(f"采集指标失败: {e}")
        
        return []
    
    def _parse_prometheus_metrics(self, text: str, app_name: str) -> List[Metric]:
        """解析Prometheus格式指标"""
        metrics = []
        
        for line in text.split('\n'):
            if line.startswith('#') or not line.strip():
                continue
            
            try:
                # 简化解析:metric_name{labels} value
                if '{' in line:
                    name_part, value_part = line.split('}')
                    name_with_labels = name_part.split('{')
                    name = name_with_labels[0]
                    labels_str = name_with_labels[1]
                    value = float(value_part.strip())
                    
                    # 解析标签
                    tags = {"app": app_name}
                    for label in labels_str.split(','):
                        if '=' in label:
                            k, v = label.split('=')
                            tags[k.strip()] = v.strip('"')
                else:
                    parts = line.split()
                    name = parts[0]
                    value = float(parts[1])
                    tags = {"app": app_name}
                
                metrics.append(Metric(
                    name=name,
                    value=value,
                    timestamp=time.time(),
                    tags=tags
                ))
            
            except Exception:
                continue
        
        return metrics
    
    def get_all_status(self) -> Dict:
        """获取所有应用状态"""
        result = {}
        
        for app_name in self.apps:
            result[app_name] = self.check_health(app_name)
        
        return result

# 使用示例
app_monitor = ApplicationMonitor()

# 注册应用
app_monitor.register_app("web-api", {
    "health": "http://web-api:8080/health",
    "metrics": "http://web-api:8080/metrics"
})

app_monitor.register_app("user-service", {
    "health": "http://user-service:8081/health",
    "metrics": "http://user-service:8081/metrics"
})

# 检查健康状态
status = app_monitor.get_all_status()
for app_name, app_status in status.items():
    print(f"{app_name}: {app_status['status']}")

3. 日志分析系统

3.1 日志采集与解析

python 复制代码
from dataclasses import dataclass
from typing import List, Dict, Optional
import re
from datetime import datetime

@dataclass
class LogEntry:
    """日志条目"""
    timestamp: datetime
    level: str
    message: str
    source: str
    metadata: Dict

class LogParser:
    """日志解析器"""
    
    def __init__(self):
        self.patterns = {
            "nginx": r'(?P<ip>[\d.]+) - - \[(?P<timestamp>[^\]]+)\] "(?P<method>\w+) (?P<path>[^\s]+) HTTP/[\d.]+" (?P<status>\d+) (?P<size>\d+)',
            "java": r'(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}) \[(?P<thread>[^\]]+)\] (?P<level>\w+)\s+(?P<class>[^\s]+) - (?P<message>.+)',
            "python": r'(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}) - (?P<level>\w+) - (?P<message>.+)',
            "syslog": r'(?P<timestamp>\w{3}\s+\d+\s+\d{2}:\d{2}:\d{2}) (?P<host>\S+) (?P<process>[^\[]+)\[(?P<pid>\d+)\]: (?P<message>.+)'
        }
    
    def parse(self, line: str, log_type: str) -> Optional[LogEntry]:
        """解析日志行"""
        pattern = self.patterns.get(log_type)
        if not pattern:
            return None
        
        match = re.match(pattern, line)
        if not match:
            return None
        
        groups = match.groupdict()
        
        # 解析时间戳
        timestamp = self._parse_timestamp(groups.get("timestamp", ""))
        
        # 确定日志级别
        level = groups.get("level", "INFO").upper()
        
        # 提取消息
        message = groups.get("message", line)
        
        # 构建元数据
        metadata = {k: v for k, v in groups.items() if k not in ["timestamp", "level", "message"]}
        
        return LogEntry(
            timestamp=timestamp,
            level=level,
            message=message,
            source=log_type,
            metadata=metadata
        )
    
    def _parse_timestamp(self, ts_str: str) -> datetime:
        """解析时间戳"""
        formats = [
            "%Y-%m-%d %H:%M:%S,%f",
            "%d/%b/%Y:%H:%M:%S %z",
            "%b %d %H:%M:%S"
        ]
        
        for fmt in formats:
            try:
                return datetime.strptime(ts_str, fmt)
            except ValueError:
                continue
        
        return datetime.now()

class LogAnalyzer:
    """日志分析器"""
    
    def __init__(self):
        self.parser = LogParser()
        self.logs: List[LogEntry] = []
        self.error_patterns = [
            r'Exception',
            r'Error',
            r'Failed',
            r'Timeout',
            r'Connection refused'
        ]
    
    def add_log(self, line: str, log_type: str):
        """添加日志"""
        entry = self.parser.parse(line, log_type)
        if entry:
            self.logs.append(entry)
    
    def analyze_errors(self, time_range: tuple = None) -> List[Dict]:
        """分析错误日志"""
        errors = []
        
        for log in self.logs:
            if log.level in ["ERROR", "FATAL", "CRITICAL"]:
                # 检查是否在时间范围内
                if time_range:
                    if not (time_range[0] <= log.timestamp <= time_range[1]):
                        continue
                
                # 匹配错误模式
                matched_patterns = []
                for pattern in self.error_patterns:
                    if re.search(pattern, log.message):
                        matched_patterns.append(pattern)
                
                errors.append({
                    "timestamp": log.timestamp.isoformat(),
                    "level": log.level,
                    "message": log.message,
                    "source": log.source,
                    "patterns": matched_patterns,
                    "metadata": log.metadata
                })
        
        return errors
    
    def get_statistics(self) -> Dict:
        """获取统计信息"""
        level_counts = {}
        source_counts = {}
        
        for log in self.logs:
            level_counts[log.level] = level_counts.get(log.level, 0) + 1
            source_counts[log.source] = source_counts.get(log.source, 0) + 1
        
        return {
            "total_logs": len(self.logs),
            "by_level": level_counts,
            "by_source": source_counts
        }
    
    def detect_anomalies(self) -> List[Dict]:
        """检测异常"""
        anomalies = []
        
        # 检测错误率突增
        # 简化实现:统计最近5分钟的错误数
        recent_time = datetime.now() - timedelta(minutes=5)
        recent_errors = [log for log in self.logs 
                        if log.level in ["ERROR", "FATAL"] and log.timestamp >= recent_time]
        
        if len(recent_errors) > 10:  # 阈值
            anomalies.append({
                "type": "error_spike",
                "count": len(recent_errors),
                "severity": "high" if len(recent_errors) > 50 else "medium"
            })
        
        return anomalies

# 使用示例
analyzer = LogAnalyzer()

# 添加日志
analyzer.add_log('2026-04-20 12:00:00,123 - ERROR - Connection refused to database', 'python')
analyzer.add_log('2026-04-20 12:00:01,456 - INFO - Request processed successfully', 'python')
analyzer.add_log('2026-04-20 12:00:02,789 - ERROR - Timeout waiting for response', 'python')

# 分析错误
errors = analyzer.analyze_errors()
print(f"发现 {len(errors)} 个错误")

# 获取统计
stats = analyzer.get_statistics()
print(f"日志统计: {stats}")

4. 告警管理系统

4.1 告警规则引擎

python 复制代码
from dataclasses import dataclass
from typing import Dict, List, Callable, Optional
from enum import Enum

class AlertSeverity(Enum):
    """告警级别"""
    INFO = "info"
    WARNING = "warning"
    CRITICAL = "critical"
    EMERGENCY = "emergency"

@dataclass
class Alert:
    """告警"""
    id: str
    name: str
    severity: AlertSeverity
    message: str
    source: str
    timestamp: float
    labels: Dict[str, str]
    status: str = "firing"
    acknowledged: bool = False
    resolved_at: Optional[float] = None

class AlertRule:
    """告警规则"""
    
    def __init__(self, name: str, condition: Callable, severity: AlertSeverity,
                 duration: int = 0, labels: Dict = None):
        self.name = name
        self.condition = condition
        self.severity = severity
        self.duration = duration  # 持续时间(秒)
        self.labels = labels or {}
        self.firing_since = None

class AlertManager:
    """告警管理器"""
    
    def __init__(self):
        self.rules: Dict[str, AlertRule] = {}
        self.alerts: Dict[str, Alert] = {}
        self.handlers: List[Callable] = []
    
    def add_rule(self, rule: AlertRule):
        """添加告警规则"""
        self.rules[rule.name] = rule
    
    def add_handler(self, handler: Callable):
        """添加告警处理器"""
        self.handlers.append(handler)
    
    def evaluate(self, metrics: List[Metric]):
        """评估告警规则"""
        for rule_name, rule in self.rules.items():
            try:
                is_firing = rule.condition(metrics)
                
                if is_firing:
                    if rule.firing_since is None:
                        rule.firing_since = time.time()
                    
                    # 检查是否达到持续时间
                    if time.time() - rule.firing_since >= rule.duration:
                        self._fire_alert(rule, metrics)
                else:
                    if rule.firing_since is not None:
                        self._resolve_alert(rule)
                    rule.firing_since = None
            
            except Exception as e:
                print(f"评估规则 {rule_name} 失败: {e}")
    
    def _fire_alert(self, rule: AlertRule, metrics: List[Metric]):
        """触发告警"""
        alert_id = f"alert_{rule.name}_{int(time.time())}"
        
        if alert_id not in self.alerts:
            alert = Alert(
                id=alert_id,
                name=rule.name,
                severity=rule.severity,
                message=self._generate_message(rule, metrics),
                source="monitor",
                timestamp=time.time(),
                labels=rule.labels
            )
            
            self.alerts[alert_id] = alert
            
            # 调用处理器
            for handler in self.handlers:
                try:
                    handler(alert)
                except Exception as e:
                    print(f"处理器执行失败: {e}")
    
    def _resolve_alert(self, rule: AlertRule):
        """解除告警"""
        for alert in self.alerts.values():
            if alert.name == rule.name and alert.status == "firing":
                alert.status = "resolved"
                alert.resolved_at = time.time()
    
    def _generate_message(self, rule: AlertRule, metrics: List[Metric]) -> str:
        """生成告警消息"""
        # 简化实现
        return f"告警规则 {rule.name} 触发"
    
    def get_active_alerts(self) -> List[Alert]:
        """获取活跃告警"""
        return [a for a in self.alerts.values() if a.status == "firing"]
    
    def acknowledge(self, alert_id: str):
        """确认告警"""
        if alert_id in self.alerts:
            self.alerts[alert_id].acknowledged = True

# 定义告警规则
def cpu_high_condition(metrics: List[Metric]) -> bool:
    """CPU使用率过高"""
    cpu_metrics = [m for m in metrics if m.name == "cpu.usage"]
    if cpu_metrics:
        avg_cpu = sum(m.value for m in cpu_metrics) / len(cpu_metrics)
        return avg_cpu > 80
    return False

def memory_high_condition(metrics: List[Metric]) -> bool:
    """内存使用率过高"""
    mem_metrics = [m for m in metrics if m.name == "memory.usage"]
    if mem_metrics:
        avg_mem = sum(m.value for m in mem_metrics) / len(mem_metrics)
        return avg_mem > 90
    return False

def disk_full_condition(metrics: List[Metric]) -> bool:
    """磁盘空间不足"""
    disk_metrics = [m for m in metrics if m.name == "disk.usage"]
    for m in disk_metrics:
        if m.value > 85:
            return True
    return False

# 使用示例
alert_manager = AlertManager()

# 添加规则
alert_manager.add_rule(AlertRule(
    name="cpu_high",
    condition=cpu_high_condition,
    severity=AlertSeverity.WARNING,
    duration=60,
    labels={"team": "ops"}
))

alert_manager.add_rule(AlertRule(
    name="memory_high",
    condition=memory_high_condition,
    severity=AlertSeverity.CRITICAL,
    duration=30,
    labels={"team": "ops"}
))

alert_manager.add_rule(AlertRule(
    name="disk_full",
    condition=disk_full_condition,
    severity=AlertSeverity.WARNING,
    labels={"team": "ops"}
))

# 添加处理器
def send_notification(alert: Alert):
    """发送通知"""
    print(f"[{alert.severity.value.upper()}] {alert.name}: {alert.message}")

alert_manager.add_handler(send_notification)

# 评估告警
metrics = [
    Metric("cpu.usage", 85, time.time(), {"host": "server-01"}),
    Metric("memory.usage", 92, time.time(), {"host": "server-01"})
]

alert_manager.evaluate(metrics)

4.2 告警降噪

python 复制代码
from typing import Dict, List
from collections import defaultdict

class AlertDeduplicator:
    """告警去重器"""
    
    def __init__(self, window_seconds: int = 300):
        self.window = window_seconds
        self.seen_alerts: Dict[str, float] = {}
    
    def should_send(self, alert: Alert) -> bool:
        """判断是否应该发送告警"""
        # 生成告警指纹
        fingerprint = self._generate_fingerprint(alert)
        
        now = time.time()
        
        # 检查是否在时间窗口内已发送
        if fingerprint in self.seen_alerts:
            if now - self.seen_alerts[fingerprint] < self.window:
                return False
        
        # 更新发送时间
        self.seen_alerts[fingerprint] = now
        
        # 清理过期记录
        self._cleanup(now)
        
        return True
    
    def _generate_fingerprint(self, alert: Alert) -> str:
        """生成告警指纹"""
        return f"{alert.name}:{alert.source}:{alert.labels.get('host', '')}"
    
    def _cleanup(self, now: float):
        """清理过期记录"""
        expired = [k for k, v in self.seen_alerts.items() if now - v > self.window * 2]
        for k in expired:
            del self.seen_alerts[k]

class AlertAggregator:
    """告警聚合器"""
    
    def __init__(self, window_seconds: int = 60):
        self.window = window_seconds
        self.pending: Dict[str, List[Alert]] = defaultdict(list)
    
    def add(self, alert: Alert) -> bool:
        """
        添加告警
        
        Returns:
            是否应该立即发送
        """
        # 按规则名称分组
        group_key = alert.name
        
        self.pending[group_key].append(alert)
        
        # 检查是否达到聚合阈值
        if len(self.pending[group_key]) >= 5:
            return True
        
        return False
    
    def get_aggregated(self) -> Dict[str, List[Alert]]:
        """获取聚合的告警"""
        result = dict(self.pending)
        self.pending.clear()
        return result

# 使用示例
deduplicator = AlertDeduplicator(window_seconds=300)
aggregator = AlertAggregator(window_seconds=60)

# 处理告警
alert1 = Alert(
    id="alert_001",
    name="cpu_high",
    severity=AlertSeverity.WARNING,
    message="CPU使用率过高",
    source="monitor",
    timestamp=time.time(),
    labels={"host": "server-01"}
)

if deduplicator.should_send(alert1):
    print("发送告警")
else:
    print("告警已去重")

5. 故障诊断系统

5.1 根因分析

python 复制代码
from typing import Dict, List, Set, Optional
from dataclasses import dataclass

@dataclass
class DiagnosisResult:
    """诊断结果"""
    root_cause: str
    confidence: float
    evidence: List[str]
    suggestions: List[str]

class FaultDiagnoser:
    """故障诊断器"""
    
    def __init__(self):
        self.knowledge_base = self._load_knowledge()
        self.causality_graph = self._build_causality_graph()
    
    def _load_knowledge(self) -> Dict:
        """加载知识库"""
        return {
            "cpu_high": {
                "causes": ["process_runaway", "insufficient_resources", "traffic_spike"],
                "symptoms": ["high_load", "slow_response"],
                "solutions": ["kill_process", "scale_out", "optimize_code"]
            },
            "memory_high": {
                "causes": ["memory_leak", "large_cache", "insufficient_memory"],
                "symptoms": ["oom", "slow_gc"],
                "solutions": ["restart_service", "increase_memory", "fix_leak"]
            },
            "disk_full": {
                "causes": ["log_bloat", "large_files", "insufficient_disk"],
                "symptoms": ["write_failed", "service_down"],
                "solutions": ["clean_logs", "delete_files", "expand_disk"]
            },
            "network_error": {
                "causes": ["dns_failure", "firewall_block", "network_down"],
                "symptoms": ["connection_timeout", "connection_refused"],
                "solutions": ["check_dns", "check_firewall", "check_network"]
            }
        }
    
    def _build_causality_graph(self) -> Dict:
        """构建因果关系图"""
        return {
            "traffic_spike": ["cpu_high", "memory_high", "network_congestion"],
            "memory_leak": ["memory_high", "oom_killer", "service_crash"],
            "disk_full": ["write_failed", "service_down"],
            "dns_failure": ["connection_timeout", "service_unavailable"],
            "process_runaway": ["cpu_high", "system_hang"]
        }
    
    def diagnose(self, symptoms: List[str], context: Dict) -> DiagnosisResult:
        """
        诊断故障
        
        Args:
            symptoms: 症状列表
            context: 上下文信息
        
        Returns:
            诊断结果
        """
        # 匹配知识库
        matched_issues = []
        
        for issue, info in self.knowledge_base.items():
            # 检查症状匹配
            symptom_match = len(set(symptoms) & set(info["symptoms"])) > 0
            
            if symptom_match:
                matched_issues.append({
                    "issue": issue,
                    "causes": info["causes"],
                    "solutions": info["solutions"],
                    "match_score": len(set(symptoms) & set(info["symptoms"])) / len(info["symptoms"])
                })
        
        if not matched_issues:
            return DiagnosisResult(
                root_cause="unknown",
                confidence=0,
                evidence=[],
                suggestions=["请提供更多信息以帮助诊断"]
            )
        
        # 排序并选择最可能的原因
        matched_issues.sort(key=lambda x: x["match_score"], reverse=True)
        top_match = matched_issues[0]
        
        # 确定根因
        root_cause = self._determine_root_cause(top_match, context)
        
        # 收集证据
        evidence = self._collect_evidence(top_match, symptoms, context)
        
        # 生成建议
        suggestions = top_match["solutions"]
        
        return DiagnosisResult(
            root_cause=root_cause,
            confidence=top_match["match_score"],
            evidence=evidence,
            suggestions=suggestions
        )
    
    def _determine_root_cause(self, match: Dict, context: Dict) -> str:
        """确定根因"""
        causes = match["causes"]
        
        # 根据上下文选择最可能的原因
        # 简化实现:返回第一个
        return causes[0] if causes else match["issue"]
    
    def _collect_evidence(self, match: Dict, symptoms: List[str], context: Dict) -> List[str]:
        """收集证据"""
        evidence = []
        
        evidence.append(f"检测到症状: {', '.join(symptoms)}")
        evidence.append(f"匹配问题: {match['issue']}")
        evidence.append(f"可能原因: {', '.join(match['causes'])}")
        
        # 添加上下文证据
        if "cpu_usage" in context:
            evidence.append(f"CPU使用率: {context['cpu_usage']}%")
        
        if "memory_usage" in context:
            evidence.append(f"内存使用率: {context['memory_usage']}%")
        
        return evidence

# 使用示例
diagnoser = FaultDiagnoser()

# 诊断故障
result = diagnoser.diagnose(
    symptoms=["high_load", "slow_response"],
    context={
        "cpu_usage": 95,
        "memory_usage": 60
    }
)

print(f"根因: {result.root_cause}")
print(f"置信度: {result.confidence:.2f}")
print(f"证据: {result.evidence}")
print(f"建议: {result.suggestions}")

6. 自动修复系统

6.1 自动化修复执行器

python 复制代码
from typing import Dict, List, Callable, Optional
from dataclasses import dataclass
import subprocess

@dataclass
class RemediationAction:
    """修复动作"""
    name: str
    description: str
    executor: Callable
    params: Dict
    rollback: Optional[Callable] = None

class AutoRemediation:
    """自动修复系统"""
    
    def __init__(self):
        self.actions: Dict[str, RemediationAction] = {}
        self.history: List[Dict] = []
        self._register_default_actions()
    
    def _register_default_actions(self):
        """注册默认修复动作"""
        self.register_action(RemediationAction(
            name="restart_service",
            description="重启服务",
            executor=self._restart_service,
            params={"service_name": None},
            rollback=self._rollback_restart
        ))
        
        self.register_action(RemediationAction(
            name="clean_logs",
            description="清理日志文件",
            executor=self._clean_logs,
            params={"log_dir": "/var/log", "days": 7}
        ))
        
        self.register_action(RemediationAction(
            name="kill_process",
            description="终止进程",
            executor=self._kill_process,
            params={"process_name": None}
        ))
        
        self.register_action(RemediationAction(
            name="scale_out",
            description="扩容实例",
            executor=self._scale_out,
            params={"service": None, "count": 1}
        ))
    
    def register_action(self, action: RemediationAction):
        """注册修复动作"""
        self.actions[action.name] = action
    
    def execute(self, action_name: str, params: Dict = None) -> Dict:
        """
        执行修复动作
        
        Args:
            action_name: 动作名称
            params: 参数
        
        Returns:
            执行结果
        """
        action = self.actions.get(action_name)
        if not action:
            return {"success": False, "error": "动作不存在"}
        
        # 合并参数
        final_params = {**action.params, **(params or {})}
        
        # 执行
        try:
            result = action.executor(final_params)
            
            # 记录历史
            self.history.append({
                "action": action_name,
                "params": final_params,
                "result": result,
                "timestamp": time.time()
            })
            
            return result
        
        except Exception as e:
            return {"success": False, "error": str(e)}
    
    def _restart_service(self, params: Dict) -> Dict:
        """重启服务"""
        service_name = params.get("service_name")
        if not service_name:
            return {"success": False, "error": "缺少服务名称"}
        
        # 使用systemctl重启
        try:
            result = subprocess.run(
                ["systemctl", "restart", service_name],
                capture_output=True,
                text=True,
                timeout=30
            )
            
            if result.returncode == 0:
                return {"success": True, "message": f"服务 {service_name} 已重启"}
            else:
                return {"success": False, "error": result.stderr}
        
        except subprocess.TimeoutExpired:
            return {"success": False, "error": "重启超时"}
    
    def _clean_logs(self, params: Dict) -> Dict:
        """清理日志"""
        log_dir = params.get("log_dir", "/var/log")
        days = params.get("days", 7)
        
        try:
            # 查找并删除旧日志
            result = subprocess.run(
                ["find", log_dir, "-name", "*.log", "-mtime", f"+{days}", "-delete"],
                capture_output=True,
                text=True
            )
            
            return {"success": True, "message": f"已清理 {log_dir} 中 {days} 天前的日志"}
        
        except Exception as e:
            return {"success": False, "error": str(e)}
    
    def _kill_process(self, params: Dict) -> Dict:
        """终止进程"""
        process_name = params.get("process_name")
        if not process_name:
            return {"success": False, "error": "缺少进程名称"}
        
        try:
            # 查找并终止进程
            result = subprocess.run(
                ["pkill", "-f", process_name],
                capture_output=True,
                text=True
            )
            
            return {"success": True, "message": f"已终止进程 {process_name}"}
        
        except Exception as e:
            return {"success": False, "error": str(e)}
    
    def _scale_out(self, params: Dict) -> Dict:
        """扩容"""
        # 简化实现
        service = params.get("service")
        count = params.get("count", 1)
        
        return {"success": True, "message": f"已为 {service} 增加 {count} 个实例"}
    
    def _rollback_restart(self, params: Dict):
        """回滚重启"""
        # 实现回滚逻辑
        pass
    
    def get_history(self, limit: int = 100) -> List[Dict]:
        """获取执行历史"""
        return self.history[-limit:]

# 使用示例
remediation = AutoRemediation()

# 执行修复
result = remediation.execute("restart_service", {"service_name": "nginx"})
print(result)

# 清理日志
result = remediation.execute("clean_logs", {"log_dir": "/var/log/nginx", "days": 3})
print(result)

7. 最佳实践

7.1 系统设计原则

原则 说明 实践
自动化优先 减少人工干预 自动检测 + 自动修复
可观测性 全面监控 指标 + 日志 + 追踪
容错设计 系统高可用 冗余 + 降级
安全可控 操作可审计 权限控制 + 审计日志

7.2 常见问题

问题 原因 解决方案
误报多 阈值不合理 动态阈值 + 降噪
修复失败 权限不足 检查执行权限
响应慢 流程复杂 优化流程

8. 总结

8.1 核心要点

本文通过完整的自动化运维系统案例,展示了 OpenClaw 在运维场景的应用:

模块 核心功能 技术要点
监控采集 多源数据采集 主机 + 应用监控
日志分析 日志解析与分析 模式匹配 + 异常检测
告警管理 智能告警 规则引擎 + 降噪
故障诊断 根因分析 知识库 + 推理
自动修复 故障自愈 动作执行 + 回滚

7.2 下一步学习

  • 第75篇:OpenClaw 实战案例:数据分析平台

参考资料