监控告警方案
本文介绍基于 Prometheus 和 Grafana 的监控告警体系。
系列导航
一、监控架构
Grafana Dashboard
Prometheus Server
Nova Exporter
Neutron Exporter
Keystone Exporter
MariaDB Exporter
RabbitMQ Exporter
Memcached Exporter
二、Prometheus 配置
2.1 部署 Prometheus
bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/prometheus \
--namespace monitoring \
--create-namespace
2.2 告警规则
yaml
# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: openstack-alerts
namespace: monitoring
spec:
groups:
- name: openstack
interval: 30s
rules:
# Nova 服务监控
- alert: NovaComputeDown
expr: up{job="nova-compute"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Nova compute service down"
# Neutron Agent 监控
- alert: NeutronAgentDown
expr: neutron_agent_state{adminState="up"} == 0
for: 5m
labels:
severity: warning
# 数据库连接池
- alert: MariaDBConnectionPoolHigh
expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
for: 5m
labels:
severity: warning
# RabbitMQ 队列堆积
- alert: RabbitMQQueueBacklog
expr: rabbitmq_queue_messages > 1000
for: 10m
labels:
severity: warning
三、Grafana Dashboard
3.1 部署 Grafana
bash
helm install grafana grafana/grafana \
--namespace monitoring \
--set adminPassword=<your-grafana-password>
3.2 访问 Grafana
bash
kubectl port-forward -n monitoring svc/grafana 3000:80
# 访问 http://localhost:3000
3.3 导入 Dashboard
- 登录 Grafana
- 导入 Dashboard ID:
- OpenStack Overview: 自定义
- Node Exporter: 1860
- MySQL: 7362
四、关键监控指标
4.1 Nova 指标
# 虚拟机数量
nova_running_vms
# 虚拟机创建失败率
rate(nova_instance_create_errors[5m])
# Hypervisor 资源使用率
nova_hypervisor_vcpus_used / nova_hypervisor_vcpus * 100
4.2 Neutron 指标
# Agent 状态
neutron_agent_state
# 网络数量
neutron_networks_total
# 端口数量
neutron_ports_total
4.3 数据库指标
# 连接数
mysql_global_status_threads_connected
# 慢查询
mysql_global_status_slow_queries
# QPS
rate(mysql_global_status_questions[5m])
4.4 RabbitMQ 指标
# 队列深度
rabbitmq_queue_messages
# 消息速率
rate(rabbitmq_queue_messages_published_total[5m])
# 连接数
rabbitmq_connections
五、告警通知
5.1 配置 Alertmanager
yaml
# alertmanager-config.yaml
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default'
receivers:
- name: 'default'
email_configs:
- to: 'admin@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager@example.com'
auth_password: '<your-smtp-password>'
六、日志聚合
6.1 EFK Stack
bash
# 部署 Elasticsearch
helm install elasticsearch elastic/elasticsearch \
-n logging --create-namespace
# 部署 Fluentd
kubectl apply -f fluentd-daemonset.yaml
# 部署 Kibana
helm install kibana elastic/kibana -n logging
七、健康检查脚本
bash
#!/bin/bash
# health-check.sh
echo "=== OpenStack 健康检查 ==="
# 1. 服务状态
echo "[1/5] 检查服务状态"
openstack service list
# 2. 计算服务
echo "[2/5] 检查计算服务"
openstack compute service list
# 3. 网络代理
echo "[3/5] 检查网络代理"
openstack network agent list
# 4. Hypervisor
echo "[4/5] 检查 Hypervisor"
openstack hypervisor list
openstack hypervisor stats show
# 5. Pod 状态
echo "[5/5] 检查 Pod 状态"
kubectl get pods -n openstack | grep -v Running | grep -v Completed
echo "=== 检查完成 ==="
八、监控最佳实践
- 设置合理的告警阈值
- 避免告警风暴
- 定期审查告警规则
- 保留历史监控数据
- 建立值班机制
下一篇:备份恢复方案