自动化运维实战:监控告警与自动化运维的完整方案
大家好,我是迪哥。自动化运维是保证系统稳定运行的关键,从监控告警到自动化运维,从故障自愈到智能运维,我们经历了从手动到自动的演进。今天就聊聊自动化运维的最佳实践。
监控告警架构
┌─────────────────────────────────────────────────────────────┐
│ 监控告警架构 │
├─────────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ 数据采集 │ │ 数据存储 │ │ 告警通知 │ │
│ │ Exporter │ │ Prometheus │ │ AlertMgr │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ 可视化 │ │ 自动化 │ │ 日志分析 │ │
│ │ Grafana │ │ 运维 │ │ ELK/Loki │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
监控配置
Prometheus 配置
yaml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'spring-app'
static_configs:
- targets: ['app:8080']
metrics_path: '/actuator/prometheus'
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
告警规则
yaml
groups:
- name: example_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[1m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}%"
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
description: "Instance {{ $labels.instance }} is not responding"
自动化运维
自动扩缩容
yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
自动重启故障 Pod
yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: app
自动备份
bash
#!/bin/bash
# 数据库备份脚本
BACKUP_DIR="/backup"
DATE=$(date +%Y%m%d_%H%M%S)
# 备份 MySQL
mysqldump -u root -p$DB_PASSWORD example > $BACKUP_DIR/mysql_backup_$DATE.sql
# 备份 Redis
redis-cli SAVE
cp /var/lib/redis/dump.rdb $BACKUP_DIR/redis_backup_$DATE.rdb
# 清理7天前的备份
find $BACKUP_DIR -type f -mtime +7 -delete
日志分析
ELK 配置
yaml
# docker-compose.yml
version: '3.8'
services:
elasticsearch:
image: elasticsearch:8.8.0
environment:
- discovery.type=single-node
ports:
- "9200:9200"
logstash:
image: logstash:8.8.0
volumes:
- ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
- ./logstash/pipeline:/usr/share/logstash/pipeline
kibana:
image: kibana:8.8.0
ports:
- "5601:5601"
depends_on:
- elasticsearch
最佳实践清单
| 维度 | 最佳实践 |
|---|---|
| 监控 | 使用 Prometheus + Grafana |
| 告警 | 配置合理的告警规则,避免告警疲劳 |
| 扩缩容 | 使用 HPA,基于 CPU/内存/自定义指标 |
| 备份 | 定期备份,自动化清理 |
| 日志 | 使用 ELK 或 Loki 进行日志分析 |
说到自动化运维,我家那只叫 Docker 的哈士奇最近学会了"自动喂食"------每天到饭点就自动去狗粮碗旁边等着,还会用爪子敲碗提醒我,这自动化程度比我们的运维系统还高 😂
我是迪哥,我们下期再见!