OpenClaw 实战案例：自动化运维系统构建

- 摘要
- [1. 引言 - 智能运维概述](#1. 引言 - 智能运维概述)
- - [1.1 运维自动化需求](#1.1 运维自动化需求)
  - [1.2 系统架构设计](#1.2 系统架构设计)
  - [1.3 核心功能规划](#1.3 核心功能规划)
- [2. 监控数据采集](#2. 监控数据采集)
- - [2.1 主机监控采集器](#2.1 主机监控采集器)
  - [2.2 应用监控采集器](#2.2 应用监控采集器)
- [3. 日志分析系统](#3. 日志分析系统)
- - [3.1 日志采集与解析](#3.1 日志采集与解析)
- [4. 告警管理系统](#4. 告警管理系统)
- - [4.1 告警规则引擎](#4.1 告警规则引擎)
  - [4.2 告警降噪](#4.2 告警降噪)
- [5. 故障诊断系统](#5. 故障诊断系统)
- - [5.1 根因分析](#5.1 根因分析)
- [6. 自动修复系统](#6. 自动修复系统)
- - [6.1 自动化修复执行器](#6.1 自动化修复执行器)
- [7. 最佳实践](#7. 最佳实践)
- - [7.1 系统设计原则](#7.1 系统设计原则)
  - [7.2 常见问题](#7.2 常见问题)
- [8. 总结](#8. 总结)
- - [8.1 核心要点](#8.1 核心要点)
  - [7.2 下一步学习](#7.2 下一步学习)
- 参考资料

摘要

本文通过一个完整的自动化运维系统案例，演示如何使用 OpenClaw 构建智能运维平台。文章涵盖监控告警、日志分析、故障诊断、自动化修复等核心功能，帮助开发者掌握 OpenClaw 在运维自动化场景的应用。通过详细的系统设计和代码实现，让读者了解智能运维系统的完整构建过程。🔧

1. 引言 - 智能运维概述

1.1 运维自动化需求

现代IT系统运维面临诸多挑战，传统运维方式已难以满足需求：

挑战	传统运维	OpenClaw智能运维
告警风暴	人工筛选处理	智能聚合降噪
故障定位	逐层排查	根因分析
修复效率	手动执行	自动修复
知识传承	依赖个人经验	知识库沉淀
7×24响应	轮班值守	AI全天候

1.2 系统架构设计

#mermaid-svg-nHoFEDNossYyC7PJ{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-nHoFEDNossYyC7PJ .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-nHoFEDNossYyC7PJ .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-nHoFEDNossYyC7PJ .error-icon{fill:#552222;}#mermaid-svg-nHoFEDNossYyC7PJ .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-nHoFEDNossYyC7PJ .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-nHoFEDNossYyC7PJ .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-nHoFEDNossYyC7PJ .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-nHoFEDNossYyC7PJ .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-nHoFEDNossYyC7PJ .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-nHoFEDNossYyC7PJ .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-nHoFEDNossYyC7PJ .marker{fill:#333333;stroke:#333333;}#mermaid-svg-nHoFEDNossYyC7PJ .marker.cross{stroke:#333333;}#mermaid-svg-nHoFEDNossYyC7PJ svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-nHoFEDNossYyC7PJ p{margin:0;}#mermaid-svg-nHoFEDNossYyC7PJ .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-nHoFEDNossYyC7PJ .cluster-label text{fill:#333;}#mermaid-svg-nHoFEDNossYyC7PJ .cluster-label span{color:#333;}#mermaid-svg-nHoFEDNossYyC7PJ .cluster-label span p{background-color:transparent;}#mermaid-svg-nHoFEDNossYyC7PJ .label text,#mermaid-svg-nHoFEDNossYyC7PJ span{fill:#333;color:#333;}#mermaid-svg-nHoFEDNossYyC7PJ .node rect,#mermaid-svg-nHoFEDNossYyC7PJ .node circle,#mermaid-svg-nHoFEDNossYyC7PJ .node ellipse,#mermaid-svg-nHoFEDNossYyC7PJ .node polygon,#mermaid-svg-nHoFEDNossYyC7PJ .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-nHoFEDNossYyC7PJ .rough-node .label text,#mermaid-svg-nHoFEDNossYyC7PJ .node .label text,#mermaid-svg-nHoFEDNossYyC7PJ .image-shape .label,#mermaid-svg-nHoFEDNossYyC7PJ .icon-shape .label{text-anchor:middle;}#mermaid-svg-nHoFEDNossYyC7PJ .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-nHoFEDNossYyC7PJ .rough-node .label,#mermaid-svg-nHoFEDNossYyC7PJ .node .label,#mermaid-svg-nHoFEDNossYyC7PJ .image-shape .label,#mermaid-svg-nHoFEDNossYyC7PJ .icon-shape .label{text-align:center;}#mermaid-svg-nHoFEDNossYyC7PJ .node.clickable{cursor:pointer;}#mermaid-svg-nHoFEDNossYyC7PJ .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-nHoFEDNossYyC7PJ .arrowheadPath{fill:#333333;}#mermaid-svg-nHoFEDNossYyC7PJ .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-nHoFEDNossYyC7PJ .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-nHoFEDNossYyC7PJ .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-nHoFEDNossYyC7PJ .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-nHoFEDNossYyC7PJ .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-nHoFEDNossYyC7PJ .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-nHoFEDNossYyC7PJ .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-nHoFEDNossYyC7PJ .cluster text{fill:#333;}#mermaid-svg-nHoFEDNossYyC7PJ .cluster span{color:#333;}#mermaid-svg-nHoFEDNossYyC7PJ div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-nHoFEDNossYyC7PJ .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-nHoFEDNossYyC7PJ rect.text{fill:none;stroke-width:0;}#mermaid-svg-nHoFEDNossYyC7PJ .icon-shape,#mermaid-svg-nHoFEDNossYyC7PJ .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-nHoFEDNossYyC7PJ .icon-shape p,#mermaid-svg-nHoFEDNossYyC7PJ .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-nHoFEDNossYyC7PJ .icon-shape .label rect,#mermaid-svg-nHoFEDNossYyC7PJ .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-nHoFEDNossYyC7PJ .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-nHoFEDNossYyC7PJ .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-nHoFEDNossYyC7PJ :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 自动执行层
智能分析层
数据处理层
数据采集层
主机监控
应用监控
日志采集
网络监控
数据清洗
指标聚合
日志解析
异常检测
告警分析
根因定位
趋势预测
容量规划
告警通知
自动修复
工单创建
报告生成

1.3 核心功能规划

功能模块	核心能力	技术实现
监控采集	多源数据采集	Prometheus + 自定义采集器
日志分析	日志解析与分析	ELK + AI分析
告警管理	智能告警降噪	规则引擎 + ML
故障诊断	根因分析	知识图谱 + 推理
自动修复	故障自愈	Ansible + 脚本

2. 监控数据采集

2.1 主机监控采集器

python 复制代码

from dataclasses import dataclass
from typing import Dict, List, Optional
import time
import threading
import json

@dataclass
class Metric:
    """监控指标"""
    name: str
    value: float
    timestamp: float
    tags: Dict[str, str]
    unit: str = ""

class HostMonitor:
    """主机监控采集器"""
    
    def __init__(self, host: str, interval: int = 60):
        self.host = host
        self.interval = interval
        self.running = False
        self.metrics: List[Metric] = []
        self.collectors = {
            "cpu": self._collect_cpu,
            "memory": self._collect_memory,
            "disk": self._collect_disk,
            "network": self._collect_network
        }
    
    def start(self):
        """启动监控"""
        self.running = True
        thread = threading.Thread(target=self._collect_loop, daemon=True)
        thread.start()
    
    def stop(self):
        """停止监控"""
        self.running = False
    
    def _collect_loop(self):
        """采集循环"""
        while self.running:
            for collector_name, collector_func in self.collectors.items():
                try:
                    metrics = collector_func()
                    self.metrics.extend(metrics)
                except Exception as e:
                    print(f"采集 {collector_name} 失败: {e}")
            
            time.sleep(self.interval)
    
    def _collect_cpu(self) -> List[Metric]:
        """采集CPU指标"""
        # 使用 psutil 或 SSH 远程采集
        # 这里简化实现
        
        return [
            Metric(
                name="cpu.usage",
                value=45.5,
                timestamp=time.time(),
                tags={"host": self.host},
                unit="%"
            ),
            Metric(
                name="cpu.load1",
                value=2.5,
                timestamp=time.time(),
                tags={"host": self.host}
            ),
            Metric(
                name="cpu.load5",
                value=3.2,
                timestamp=time.time(),
                tags={"host": self.host}
            ),
            Metric(
                name="cpu.load15",
                value=2.8,
                timestamp=time.time(),
                tags={"host": self.host}
            )
        ]
    
    def _collect_memory(self) -> List[Metric]:
        """采集内存指标"""
        return [
            Metric(
                name="memory.usage",
                value=75.3,
                timestamp=time.time(),
                tags={"host": self.host},
                unit="%"
            ),
            Metric(
                name="memory.used",
                value=12.5 * 1024 * 1024 * 1024,  # 12.5 GB
                timestamp=time.time(),
                tags={"host": self.host},
                unit="bytes"
            ),
            Metric(
                name="memory.total",
                value=16 * 1024 * 1024 * 1024,  # 16 GB
                timestamp=time.time(),
                tags={"host": self.host},
                unit="bytes"
            )
        ]
    
    def _collect_disk(self) -> List[Metric]:
        """采集磁盘指标"""
        metrics = []
        
        # 根分区
        metrics.extend([
            Metric(
                name="disk.usage",
                value=68.5,
                timestamp=time.time(),
                tags={"host": self.host, "mount": "/"},
                unit="%"
            ),
            Metric(
                name="disk.iops.read",
                value=150,
                timestamp=time.time(),
                tags={"host": self.host, "mount": "/"},
                unit="ops/s"
            ),
            Metric(
                name="disk.iops.write",
                value=80,
                timestamp=time.time(),
                tags={"host": self.host, "mount": "/"},
                unit="ops/s"
            )
        ])
        
        return metrics
    
    def _collect_network(self) -> List[Metric]:
        """采集网络指标"""
        return [
            Metric(
                name="network.bytes.in",
                value=1024 * 1024 * 50,  # 50 MB/s
                timestamp=time.time(),
                tags={"host": self.host, "interface": "eth0"},
                unit="bytes/s"
            ),
            Metric(
                name="network.bytes.out",
                value=1024 * 1024 * 30,  # 30 MB/s
                timestamp=time.time(),
                tags={"host": self.host, "interface": "eth0"},
                unit="bytes/s"
            ),
            Metric(
                name="network.packets.in",
                value=50000,
                timestamp=time.time(),
                tags={"host": self.host, "interface": "eth0"},
                unit="packets/s"
            ),
            Metric(
                name="network.packets.out",
                value=30000,
                timestamp=time.time(),
                tags={"host": self.host, "interface": "eth0"},
                unit="packets/s"
            )
        ]
    
    def get_metrics(self, name: str = None, since: float = None) -> List[Metric]:
        """获取指标"""
        result = self.metrics
        
        if name:
            result = [m for m in result if m.name == name]
        
        if since:
            result = [m for m in result if m.timestamp >= since]
        
        return result

# 使用示例
monitor = HostMonitor("server-01", interval=30)
monitor.start()

# 等待采集
time.sleep(60)

# 获取指标
metrics = monitor.get_metrics()
print(f"采集到 {len(metrics)} 个指标")

2.2 应用监控采集器

python 复制代码

from typing import Dict, List
import requests

class ApplicationMonitor:
    """应用监控采集器"""
    
    def __init__(self):
        self.apps: Dict[str, dict] = {}
    
    def register_app(self, app_name: str, endpoints: Dict):
        """
        注册应用
        
        Args:
            app_name: 应用名称
            endpoints: 端点配置
                {
                    "health": "http://app:8080/health",
                    "metrics": "http://app:8080/metrics",
                    "ready": "http://app:8080/ready"
                }
        """
        self.apps[app_name] = {
            "name": app_name,
            "endpoints": endpoints,
            "status": "unknown"
        }
    
    def check_health(self, app_name: str) -> Dict:
        """检查应用健康状态"""
        app = self.apps.get(app_name)
        if not app:
            return {"error": "应用未注册"}
        
        health_url = app["endpoints"].get("health")
        if not health_url:
            return {"error": "未配置健康检查端点"}
        
        try:
            start_time = time.time()
            response = requests.get(health_url, timeout=5)
            elapsed = time.time() - start_time
            
            if response.status_code == 200:
                app["status"] = "healthy"
                return {
                    "status": "healthy",
                    "response_time": elapsed,
                    "details": response.json() if response.headers.get("content-type", "").startswith("application/json") else response.text
                }
            else:
                app["status"] = "unhealthy"
                return {
                    "status": "unhealthy",
                    "status_code": response.status_code
                }
        
        except requests.exceptions.Timeout:
            app["status"] = "timeout"
            return {"status": "timeout", "error": "请求超时"}
        
        except Exception as e:
            app["status"] = "error"
            return {"status": "error", "error": str(e)}
    
    def collect_metrics(self, app_name: str) -> List[Metric]:
        """采集应用指标"""
        app = self.apps.get(app_name)
        if not app:
            return []
        
        metrics_url = app["endpoints"].get("metrics")
        if not metrics_url:
            return []
        
        try:
            response = requests.get(metrics_url, timeout=10)
            
            if response.status_code == 200:
                # 解析 Prometheus 格式指标
                return self._parse_prometheus_metrics(response.text, app_name)
            
        except Exception as e:
            print(f"采集指标失败: {e}")
        
        return []
    
    def _parse_prometheus_metrics(self, text: str, app_name: str) -> List[Metric]:
        """解析Prometheus格式指标"""
        metrics = []
        
        for line in text.split('\n'):
            if line.startswith('#') or not line.strip():
                continue
            
            try:
                # 简化解析：metric_name{labels} value
                if '{' in line:
                    name_part, value_part = line.split('}')
                    name_with_labels = name_part.split('{')
                    name = name_with_labels[0]
                    labels_str = name_with_labels[1]
                    value = float(value_part.strip())
                    
                    # 解析标签
                    tags = {"app": app_name}
                    for label in labels_str.split(','):
                        if '=' in label:
                            k, v = label.split('=')
                            tags[k.strip()] = v.strip('"')
                else:
                    parts = line.split()
                    name = parts[0]
                    value = float(parts[1])
                    tags = {"app": app_name}
                
                metrics.append(Metric(
                    name=name,
                    value=value,
                    timestamp=time.time(),
                    tags=tags
                ))
            
            except Exception:
                continue
        
        return metrics
    
    def get_all_status(self) -> Dict:
        """获取所有应用状态"""
        result = {}
        
        for app_name in self.apps:
            result[app_name] = self.check_health(app_name)
        
        return result

# 使用示例
app_monitor = ApplicationMonitor()

# 注册应用
app_monitor.register_app("web-api", {
    "health": "http://web-api:8080/health",
    "metrics": "http://web-api:8080/metrics"
})

app_monitor.register_app("user-service", {
    "health": "http://user-service:8081/health",
    "metrics": "http://user-service:8081/metrics"
})

# 检查健康状态
status = app_monitor.get_all_status()
for app_name, app_status in status.items():
    print(f"{app_name}: {app_status['status']}")

3. 日志分析系统

3.1 日志采集与解析

python 复制代码

from dataclasses import dataclass
from typing import List, Dict, Optional
import re
from datetime import datetime

@dataclass
class LogEntry:
    """日志条目"""
    timestamp: datetime
    level: str
    message: str
    source: str
    metadata: Dict

class LogParser:
    """日志解析器"""
    
    def __init__(self):
        self.patterns = {
            "nginx": r'(?P<ip>[\d.]+) - - \[(?P<timestamp>[^\]]+)\] "(?P<method>\w+) (?P<path>[^\s]+) HTTP/[\d.]+" (?P<status>\d+) (?P<size>\d+)',
            "java": r'(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}) \[(?P<thread>[^\]]+)\] (?P<level>\w+)\s+(?P<class>[^\s]+) - (?P<message>.+)',
            "python": r'(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}) - (?P<level>\w+) - (?P<message>.+)',
            "syslog": r'(?P<timestamp>\w{3}\s+\d+\s+\d{2}:\d{2}:\d{2}) (?P<host>\S+) (?P<process>[^\[]+)\[(?P<pid>\d+)\]: (?P<message>.+)'
        }
    
    def parse(self, line: str, log_type: str) -> Optional[LogEntry]:
        """解析日志行"""
        pattern = self.patterns.get(log_type)
        if not pattern:
            return None
        
        match = re.match(pattern, line)
        if not match:
            return None
        
        groups = match.groupdict()
        
        # 解析时间戳
        timestamp = self._parse_timestamp(groups.get("timestamp", ""))
        
        # 确定日志级别
        level = groups.get("level", "INFO").upper()
        
        # 提取消息
        message = groups.get("message", line)
        
        # 构建元数据
        metadata = {k: v for k, v in groups.items() if k not in ["timestamp", "level", "message"]}
        
        return LogEntry(
            timestamp=timestamp,
            level=level,
            message=message,
            source=log_type,
            metadata=metadata
        )
    
    def _parse_timestamp(self, ts_str: str) -> datetime:
        """解析时间戳"""
        formats = [
            "%Y-%m-%d %H:%M:%S,%f",
            "%d/%b/%Y:%H:%M:%S %z",
            "%b %d %H:%M:%S"
        ]
        
        for fmt in formats:
            try:
                return datetime.strptime(ts_str, fmt)
            except ValueError:
                continue
        
        return datetime.now()

class LogAnalyzer:
    """日志分析器"""
    
    def __init__(self):
        self.parser = LogParser()
        self.logs: List[LogEntry] = []
        self.error_patterns = [
            r'Exception',
            r'Error',
            r'Failed',
            r'Timeout',
            r'Connection refused'
        ]
    
    def add_log(self, line: str, log_type: str):
        """添加日志"""
        entry = self.parser.parse(line, log_type)
        if entry:
            self.logs.append(entry)
    
    def analyze_errors(self, time_range: tuple = None) -> List[Dict]:
        """分析错误日志"""
        errors = []
        
        for log in self.logs:
            if log.level in ["ERROR", "FATAL", "CRITICAL"]:
                # 检查是否在时间范围内
                if time_range:
                    if not (time_range[0] <= log.timestamp <= time_range[1]):
                        continue
                
                # 匹配错误模式
                matched_patterns = []
                for pattern in self.error_patterns:
                    if re.search(pattern, log.message):
                        matched_patterns.append(pattern)
                
                errors.append({
                    "timestamp": log.timestamp.isoformat(),
                    "level": log.level,
                    "message": log.message,
                    "source": log.source,
                    "patterns": matched_patterns,
                    "metadata": log.metadata
                })
        
        return errors
    
    def get_statistics(self) -> Dict:
        """获取统计信息"""
        level_counts = {}
        source_counts = {}
        
        for log in self.logs:
            level_counts[log.level] = level_counts.get(log.level, 0) + 1
            source_counts[log.source] = source_counts.get(log.source, 0) + 1
        
        return {
            "total_logs": len(self.logs),
            "by_level": level_counts,
            "by_source": source_counts
        }
    
    def detect_anomalies(self) -> List[Dict]:
        """检测异常"""
        anomalies = []
        
        # 检测错误率突增
        # 简化实现：统计最近5分钟的错误数
        recent_time = datetime.now() - timedelta(minutes=5)
        recent_errors = [log for log in self.logs 
                        if log.level in ["ERROR", "FATAL"] and log.timestamp >= recent_time]
        
        if len(recent_errors) > 10:  # 阈值
            anomalies.append({
                "type": "error_spike",
                "count": len(recent_errors),
                "severity": "high" if len(recent_errors) > 50 else "medium"
            })
        
        return anomalies

# 使用示例
analyzer = LogAnalyzer()

# 添加日志
analyzer.add_log('2026-04-20 12:00:00,123 - ERROR - Connection refused to database', 'python')
analyzer.add_log('2026-04-20 12:00:01,456 - INFO - Request processed successfully', 'python')
analyzer.add_log('2026-04-20 12:00:02,789 - ERROR - Timeout waiting for response', 'python')

# 分析错误
errors = analyzer.analyze_errors()
print(f"发现 {len(errors)} 个错误")

# 获取统计
stats = analyzer.get_statistics()
print(f"日志统计: {stats}")

4. 告警管理系统

4.1 告警规则引擎

python 复制代码

from dataclasses import dataclass
from typing import Dict, List, Callable, Optional
from enum import Enum

class AlertSeverity(Enum):
    """告警级别"""
    INFO = "info"
    WARNING = "warning"
    CRITICAL = "critical"
    EMERGENCY = "emergency"

@dataclass
class Alert:
    """告警"""
    id: str
    name: str
    severity: AlertSeverity
    message: str
    source: str
    timestamp: float
    labels: Dict[str, str]
    status: str = "firing"
    acknowledged: bool = False
    resolved_at: Optional[float] = None

class AlertRule:
    """告警规则"""
    
    def __init__(self, name: str, condition: Callable, severity: AlertSeverity,
                 duration: int = 0, labels: Dict = None):
        self.name = name
        self.condition = condition
        self.severity = severity
        self.duration = duration  # 持续时间（秒）
        self.labels = labels or {}
        self.firing_since = None

class AlertManager:
    """告警管理器"""
    
    def __init__(self):
        self.rules: Dict[str, AlertRule] = {}
        self.alerts: Dict[str, Alert] = {}
        self.handlers: List[Callable] = []
    
    def add_rule(self, rule: AlertRule):
        """添加告警规则"""
        self.rules[rule.name] = rule
    
    def add_handler(self, handler: Callable):
        """添加告警处理器"""
        self.handlers.append(handler)
    
    def evaluate(self, metrics: List[Metric]):
        """评估告警规则"""
        for rule_name, rule in self.rules.items():
            try:
                is_firing = rule.condition(metrics)
                
                if is_firing:
                    if rule.firing_since is None:
                        rule.firing_since = time.time()
                    
                    # 检查是否达到持续时间
                    if time.time() - rule.firing_since >= rule.duration:
                        self._fire_alert(rule, metrics)
                else:
                    if rule.firing_since is not None:
                        self._resolve_alert(rule)
                    rule.firing_since = None
            
            except Exception as e:
                print(f"评估规则 {rule_name} 失败: {e}")
    
    def _fire_alert(self, rule: AlertRule, metrics: List[Metric]):
        """触发告警"""
        alert_id = f"alert_{rule.name}_{int(time.time())}"
        
        if alert_id not in self.alerts:
            alert = Alert(
                id=alert_id,
                name=rule.name,
                severity=rule.severity,
                message=self._generate_message(rule, metrics),
                source="monitor",
                timestamp=time.time(),
                labels=rule.labels
            )
            
            self.alerts[alert_id] = alert
            
            # 调用处理器
            for handler in self.handlers:
                try:
                    handler(alert)
                except Exception as e:
                    print(f"处理器执行失败: {e}")
    
    def _resolve_alert(self, rule: AlertRule):
        """解除告警"""
        for alert in self.alerts.values():
            if alert.name == rule.name and alert.status == "firing":
                alert.status = "resolved"
                alert.resolved_at = time.time()
    
    def _generate_message(self, rule: AlertRule, metrics: List[Metric]) -> str:
        """生成告警消息"""
        # 简化实现
        return f"告警规则 {rule.name} 触发"
    
    def get_active_alerts(self) -> List[Alert]:
        """获取活跃告警"""
        return [a for a in self.alerts.values() if a.status == "firing"]
    
    def acknowledge(self, alert_id: str):
        """确认告警"""
        if alert_id in self.alerts:
            self.alerts[alert_id].acknowledged = True

# 定义告警规则
def cpu_high_condition(metrics: List[Metric]) -> bool:
    """CPU使用率过高"""
    cpu_metrics = [m for m in metrics if m.name == "cpu.usage"]
    if cpu_metrics:
        avg_cpu = sum(m.value for m in cpu_metrics) / len(cpu_metrics)
        return avg_cpu > 80
    return False

def memory_high_condition(metrics: List[Metric]) -> bool:
    """内存使用率过高"""
    mem_metrics = [m for m in metrics if m.name == "memory.usage"]
    if mem_metrics:
        avg_mem = sum(m.value for m in mem_metrics) / len(mem_metrics)
        return avg_mem > 90
    return False

def disk_full_condition(metrics: List[Metric]) -> bool:
    """磁盘空间不足"""
    disk_metrics = [m for m in metrics if m.name == "disk.usage"]
    for m in disk_metrics:
        if m.value > 85:
            return True
    return False

# 使用示例
alert_manager = AlertManager()

# 添加规则
alert_manager.add_rule(AlertRule(
    name="cpu_high",
    condition=cpu_high_condition,
    severity=AlertSeverity.WARNING,
    duration=60,
    labels={"team": "ops"}
))

alert_manager.add_rule(AlertRule(
    name="memory_high",
    condition=memory_high_condition,
    severity=AlertSeverity.CRITICAL,
    duration=30,
    labels={"team": "ops"}
))

alert_manager.add_rule(AlertRule(
    name="disk_full",
    condition=disk_full_condition,
    severity=AlertSeverity.WARNING,
    labels={"team": "ops"}
))

# 添加处理器
def send_notification(alert: Alert):
    """发送通知"""
    print(f"[{alert.severity.value.upper()}] {alert.name}: {alert.message}")

alert_manager.add_handler(send_notification)

# 评估告警
metrics = [
    Metric("cpu.usage", 85, time.time(), {"host": "server-01"}),
    Metric("memory.usage", 92, time.time(), {"host": "server-01"})
]

alert_manager.evaluate(metrics)

4.2 告警降噪

python 复制代码

from typing import Dict, List
from collections import defaultdict

class AlertDeduplicator:
    """告警去重器"""
    
    def __init__(self, window_seconds: int = 300):
        self.window = window_seconds
        self.seen_alerts: Dict[str, float] = {}
    
    def should_send(self, alert: Alert) -> bool:
        """判断是否应该发送告警"""
        # 生成告警指纹
        fingerprint = self._generate_fingerprint(alert)
        
        now = time.time()
        
        # 检查是否在时间窗口内已发送
        if fingerprint in self.seen_alerts:
            if now - self.seen_alerts[fingerprint] < self.window:
                return False
        
        # 更新发送时间
        self.seen_alerts[fingerprint] = now
        
        # 清理过期记录
        self._cleanup(now)
        
        return True
    
    def _generate_fingerprint(self, alert: Alert) -> str:
        """生成告警指纹"""
        return f"{alert.name}:{alert.source}:{alert.labels.get('host', '')}"
    
    def _cleanup(self, now: float):
        """清理过期记录"""
        expired = [k for k, v in self.seen_alerts.items() if now - v > self.window * 2]
        for k in expired:
            del self.seen_alerts[k]

class AlertAggregator:
    """告警聚合器"""
    
    def __init__(self, window_seconds: int = 60):
        self.window = window_seconds
        self.pending: Dict[str, List[Alert]] = defaultdict(list)
    
    def add(self, alert: Alert) -> bool:
        """
        添加告警
        
        Returns:
            是否应该立即发送
        """
        # 按规则名称分组
        group_key = alert.name
        
        self.pending[group_key].append(alert)
        
        # 检查是否达到聚合阈值
        if len(self.pending[group_key]) >= 5:
            return True
        
        return False
    
    def get_aggregated(self) -> Dict[str, List[Alert]]:
        """获取聚合的告警"""
        result = dict(self.pending)
        self.pending.clear()
        return result

# 使用示例
deduplicator = AlertDeduplicator(window_seconds=300)
aggregator = AlertAggregator(window_seconds=60)

# 处理告警
alert1 = Alert(
    id="alert_001",
    name="cpu_high",
    severity=AlertSeverity.WARNING,
    message="CPU使用率过高",
    source="monitor",
    timestamp=time.time(),
    labels={"host": "server-01"}
)

if deduplicator.should_send(alert1):
    print("发送告警")
else:
    print("告警已去重")

5. 故障诊断系统

5.1 根因分析

python 复制代码

from typing import Dict, List, Set, Optional
from dataclasses import dataclass

@dataclass
class DiagnosisResult:
    """诊断结果"""
    root_cause: str
    confidence: float
    evidence: List[str]
    suggestions: List[str]

class FaultDiagnoser:
    """故障诊断器"""
    
    def __init__(self):
        self.knowledge_base = self._load_knowledge()
        self.causality_graph = self._build_causality_graph()
    
    def _load_knowledge(self) -> Dict:
        """加载知识库"""
        return {
            "cpu_high": {
                "causes": ["process_runaway", "insufficient_resources", "traffic_spike"],
                "symptoms": ["high_load", "slow_response"],
                "solutions": ["kill_process", "scale_out", "optimize_code"]
            },
            "memory_high": {
                "causes": ["memory_leak", "large_cache", "insufficient_memory"],
                "symptoms": ["oom", "slow_gc"],
                "solutions": ["restart_service", "increase_memory", "fix_leak"]
            },
            "disk_full": {
                "causes": ["log_bloat", "large_files", "insufficient_disk"],
                "symptoms": ["write_failed", "service_down"],
                "solutions": ["clean_logs", "delete_files", "expand_disk"]
            },
            "network_error": {
                "causes": ["dns_failure", "firewall_block", "network_down"],
                "symptoms": ["connection_timeout", "connection_refused"],
                "solutions": ["check_dns", "check_firewall", "check_network"]
            }
        }
    
    def _build_causality_graph(self) -> Dict:
        """构建因果关系图"""
        return {
            "traffic_spike": ["cpu_high", "memory_high", "network_congestion"],
            "memory_leak": ["memory_high", "oom_killer", "service_crash"],
            "disk_full": ["write_failed", "service_down"],
            "dns_failure": ["connection_timeout", "service_unavailable"],
            "process_runaway": ["cpu_high", "system_hang"]
        }
    
    def diagnose(self, symptoms: List[str], context: Dict) -> DiagnosisResult:
        """
        诊断故障
        
        Args:
            symptoms: 症状列表
            context: 上下文信息
        
        Returns:
            诊断结果
        """
        # 匹配知识库
        matched_issues = []
        
        for issue, info in self.knowledge_base.items():
            # 检查症状匹配
            symptom_match = len(set(symptoms) & set(info["symptoms"])) > 0
            
            if symptom_match:
                matched_issues.append({
                    "issue": issue,
                    "causes": info["causes"],
                    "solutions": info["solutions"],
                    "match_score": len(set(symptoms) & set(info["symptoms"])) / len(info["symptoms"])
                })
        
        if not matched_issues:
            return DiagnosisResult(
                root_cause="unknown",
                confidence=0,
                evidence=[],
                suggestions=["请提供更多信息以帮助诊断"]
            )
        
        # 排序并选择最可能的原因
        matched_issues.sort(key=lambda x: x["match_score"], reverse=True)
        top_match = matched_issues[0]
        
        # 确定根因
        root_cause = self._determine_root_cause(top_match, context)
        
        # 收集证据
        evidence = self._collect_evidence(top_match, symptoms, context)
        
        # 生成建议
        suggestions = top_match["solutions"]
        
        return DiagnosisResult(
            root_cause=root_cause,
            confidence=top_match["match_score"],
            evidence=evidence,
            suggestions=suggestions
        )
    
    def _determine_root_cause(self, match: Dict, context: Dict) -> str:
        """确定根因"""
        causes = match["causes"]
        
        # 根据上下文选择最可能的原因
        # 简化实现：返回第一个
        return causes[0] if causes else match["issue"]
    
    def _collect_evidence(self, match: Dict, symptoms: List[str], context: Dict) -> List[str]:
        """收集证据"""
        evidence = []
        
        evidence.append(f"检测到症状: {', '.join(symptoms)}")
        evidence.append(f"匹配问题: {match['issue']}")
        evidence.append(f"可能原因: {', '.join(match['causes'])}")
        
        # 添加上下文证据
        if "cpu_usage" in context:
            evidence.append(f"CPU使用率: {context['cpu_usage']}%")
        
        if "memory_usage" in context:
            evidence.append(f"内存使用率: {context['memory_usage']}%")
        
        return evidence

# 使用示例
diagnoser = FaultDiagnoser()

# 诊断故障
result = diagnoser.diagnose(
    symptoms=["high_load", "slow_response"],
    context={
        "cpu_usage": 95,
        "memory_usage": 60
    }
)

print(f"根因: {result.root_cause}")
print(f"置信度: {result.confidence:.2f}")
print(f"证据: {result.evidence}")
print(f"建议: {result.suggestions}")

6. 自动修复系统

6.1 自动化修复执行器

python 复制代码

from typing import Dict, List, Callable, Optional
from dataclasses import dataclass
import subprocess

@dataclass
class RemediationAction:
    """修复动作"""
    name: str
    description: str
    executor: Callable
    params: Dict
    rollback: Optional[Callable] = None

class AutoRemediation:
    """自动修复系统"""
    
    def __init__(self):
        self.actions: Dict[str, RemediationAction] = {}
        self.history: List[Dict] = []
        self._register_default_actions()
    
    def _register_default_actions(self):
        """注册默认修复动作"""
        self.register_action(RemediationAction(
            name="restart_service",
            description="重启服务",
            executor=self._restart_service,
            params={"service_name": None},
            rollback=self._rollback_restart
        ))
        
        self.register_action(RemediationAction(
            name="clean_logs",
            description="清理日志文件",
            executor=self._clean_logs,
            params={"log_dir": "/var/log", "days": 7}
        ))
        
        self.register_action(RemediationAction(
            name="kill_process",
            description="终止进程",
            executor=self._kill_process,
            params={"process_name": None}
        ))
        
        self.register_action(RemediationAction(
            name="scale_out",
            description="扩容实例",
            executor=self._scale_out,
            params={"service": None, "count": 1}
        ))
    
    def register_action(self, action: RemediationAction):
        """注册修复动作"""
        self.actions[action.name] = action
    
    def execute(self, action_name: str, params: Dict = None) -> Dict:
        """
        执行修复动作
        
        Args:
            action_name: 动作名称
            params: 参数
        
        Returns:
            执行结果
        """
        action = self.actions.get(action_name)
        if not action:
            return {"success": False, "error": "动作不存在"}
        
        # 合并参数
        final_params = {**action.params, **(params or {})}
        
        # 执行
        try:
            result = action.executor(final_params)
            
            # 记录历史
            self.history.append({
                "action": action_name,
                "params": final_params,
                "result": result,
                "timestamp": time.time()
            })
            
            return result
        
        except Exception as e:
            return {"success": False, "error": str(e)}
    
    def _restart_service(self, params: Dict) -> Dict:
        """重启服务"""
        service_name = params.get("service_name")
        if not service_name:
            return {"success": False, "error": "缺少服务名称"}
        
        # 使用systemctl重启
        try:
            result = subprocess.run(
                ["systemctl", "restart", service_name],
                capture_output=True,
                text=True,
                timeout=30
            )
            
            if result.returncode == 0:
                return {"success": True, "message": f"服务 {service_name} 已重启"}
            else:
                return {"success": False, "error": result.stderr}
        
        except subprocess.TimeoutExpired:
            return {"success": False, "error": "重启超时"}
    
    def _clean_logs(self, params: Dict) -> Dict:
        """清理日志"""
        log_dir = params.get("log_dir", "/var/log")
        days = params.get("days", 7)
        
        try:
            # 查找并删除旧日志
            result = subprocess.run(
                ["find", log_dir, "-name", "*.log", "-mtime", f"+{days}", "-delete"],
                capture_output=True,
                text=True
            )
            
            return {"success": True, "message": f"已清理 {log_dir} 中 {days} 天前的日志"}
        
        except Exception as e:
            return {"success": False, "error": str(e)}
    
    def _kill_process(self, params: Dict) -> Dict:
        """终止进程"""
        process_name = params.get("process_name")
        if not process_name:
            return {"success": False, "error": "缺少进程名称"}
        
        try:
            # 查找并终止进程
            result = subprocess.run(
                ["pkill", "-f", process_name],
                capture_output=True,
                text=True
            )
            
            return {"success": True, "message": f"已终止进程 {process_name}"}
        
        except Exception as e:
            return {"success": False, "error": str(e)}
    
    def _scale_out(self, params: Dict) -> Dict:
        """扩容"""
        # 简化实现
        service = params.get("service")
        count = params.get("count", 1)
        
        return {"success": True, "message": f"已为 {service} 增加 {count} 个实例"}
    
    def _rollback_restart(self, params: Dict):
        """回滚重启"""
        # 实现回滚逻辑
        pass
    
    def get_history(self, limit: int = 100) -> List[Dict]:
        """获取执行历史"""
        return self.history[-limit:]

# 使用示例
remediation = AutoRemediation()

# 执行修复
result = remediation.execute("restart_service", {"service_name": "nginx"})
print(result)

# 清理日志
result = remediation.execute("clean_logs", {"log_dir": "/var/log/nginx", "days": 3})
print(result)

7. 最佳实践

7.1 系统设计原则

原则	说明	实践
自动化优先	减少人工干预	自动检测 + 自动修复
可观测性	全面监控	指标 + 日志 + 追踪
容错设计	系统高可用	冗余 + 降级
安全可控	操作可审计	权限控制 + 审计日志

7.2 常见问题

问题	原因	解决方案
误报多	阈值不合理	动态阈值 + 降噪
修复失败	权限不足	检查执行权限
响应慢	流程复杂	优化流程

8. 总结

8.1 核心要点

本文通过完整的自动化运维系统案例，展示了 OpenClaw 在运维场景的应用：

模块	核心功能	技术要点
监控采集	多源数据采集	主机 + 应用监控
日志分析	日志解析与分析	模式匹配 + 异常检测
告警管理	智能告警	规则引擎 + 降噪
故障诊断	根因分析	知识库 + 推理
自动修复	故障自愈	动作执行 + 回滚

7.2 下一步学习

第75篇：OpenClaw 实战案例：数据分析平台

OpenClaw 实战案例：自动化运维系统构建

目录

摘要

1. 引言 - 智能运维概述

1.1 运维自动化需求

1.2 系统架构设计

1.3 核心功能规划

2. 监控数据采集

2.1 主机监控采集器

2.2 应用监控采集器

3. 日志分析系统

3.1 日志采集与解析

4. 告警管理系统

4.1 告警规则引擎

4.2 告警降噪

5. 故障诊断系统

5.1 根因分析

6. 自动修复系统

6.1 自动化修复执行器

7. 最佳实践

7.1 系统设计原则

7.2 常见问题

8. 总结

8.1 核心要点

7.2 下一步学习

参考资料