自动化运维实战：监控告警与自动化运维的完整方案

大家好，我是迪哥。自动化运维是保证系统稳定运行的关键，从监控告警到自动化运维，从故障自愈到智能运维，我们经历了从手动到自动的演进。今天就聊聊自动化运维的最佳实践。

监控告警架构

复制代码

┌─────────────────────────────────────────────────────────────┐
│                    监控告警架构                            │
├─────────────────────────────────────────────────────────────┤
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐    │
│  │   数据采集    │  │   数据存储    │  │   告警通知    │    │
│  │  Exporter    │  │  Prometheus  │  │   AlertMgr   │    │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘    │
│         │                 │                 │              │
│         ▼                 ▼                 ▼              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐    │
│  │   可视化      │  │   自动化      │  │   日志分析    │    │
│  │   Grafana    │  │   运维        │  │   ELK/Loki   │    │
│  └──────────────┘  └──────────────┘  └──────────────┘    │
└─────────────────────────────────────────────────────────────┘

监控配置

Prometheus 配置

yaml 复制代码

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'spring-app'
    static_configs:
      - targets: ['app:8080']
    metrics_path: '/actuator/prometheus'
  
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

告警规则

yaml 复制代码

groups:
  - name: example_alerts
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[1m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}%"
      
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} is down"
          description: "Instance {{ $labels.instance }} is not responding"

自动化运维

自动扩缩容

yaml 复制代码

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  minReplicas: 3
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

自动重启故障 Pod

yaml 复制代码

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: app

自动备份

bash 复制代码

#!/bin/bash

# 数据库备份脚本
BACKUP_DIR="/backup"
DATE=$(date +%Y%m%d_%H%M%S)

# 备份 MySQL
mysqldump -u root -p$DB_PASSWORD example > $BACKUP_DIR/mysql_backup_$DATE.sql

# 备份 Redis
redis-cli SAVE
cp /var/lib/redis/dump.rdb $BACKUP_DIR/redis_backup_$DATE.rdb

# 清理7天前的备份
find $BACKUP_DIR -type f -mtime +7 -delete

日志分析

ELK 配置

yaml 复制代码

# docker-compose.yml
version: '3.8'
services:
  elasticsearch:
    image: elasticsearch:8.8.0
    environment:
      - discovery.type=single-node
    ports:
      - "9200:9200"
  
  logstash:
    image: logstash:8.8.0
    volumes:
      - ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
      - ./logstash/pipeline:/usr/share/logstash/pipeline
  
  kibana:
    image: kibana:8.8.0
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch

最佳实践清单

维度	最佳实践
监控	使用 Prometheus + Grafana
告警	配置合理的告警规则，避免告警疲劳
扩缩容	使用 HPA，基于 CPU/内存/自定义指标
备份	定期备份，自动化清理
日志	使用 ELK 或 Loki 进行日志分析

说到自动化运维，我家那只叫 Docker 的哈士奇最近学会了"自动喂食"------每天到饭点就自动去狗粮碗旁边等着，还会用爪子敲碗提醒我，这自动化程度比我们的运维系统还高 😂

我是迪哥，我们下期再见！