OpenStack on Kubernetes 生产部署实战（十七）

监控告警方案

本文介绍基于 Prometheus 和 Grafana 的监控告警体系。

系列导航

一、监控架构

Grafana Dashboard
Prometheus Server
Nova Exporter
Neutron Exporter
Keystone Exporter
MariaDB Exporter
RabbitMQ Exporter
Memcached Exporter

二、Prometheus 配置

2.1 部署 Prometheus

bash 复制代码

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus prometheus-community/prometheus \
  --namespace monitoring \
  --create-namespace

2.2 告警规则

yaml 复制代码

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: openstack-alerts
  namespace: monitoring
spec:
  groups:
  - name: openstack
    interval: 30s
    rules:
    # Nova 服务监控
    - alert: NovaComputeDown
      expr: up{job="nova-compute"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Nova compute service down"
        
    # Neutron Agent 监控
    - alert: NeutronAgentDown
      expr: neutron_agent_state{adminState="up"} == 0
      for: 5m
      labels:
        severity: warning
        
    # 数据库连接池
    - alert: MariaDBConnectionPoolHigh
      expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
      for: 5m
      labels:
        severity: warning
        
    # RabbitMQ 队列堆积
    - alert: RabbitMQQueueBacklog
      expr: rabbitmq_queue_messages > 1000
      for: 10m
      labels:
        severity: warning

三、Grafana Dashboard

3.1 部署 Grafana

bash 复制代码

helm install grafana grafana/grafana \
  --namespace monitoring \
  --set adminPassword=<your-grafana-password>

3.2 访问 Grafana

bash 复制代码

kubectl port-forward -n monitoring svc/grafana 3000:80
# 访问 http://localhost:3000

3.3 导入 Dashboard

登录 Grafana
导入 Dashboard ID:
- OpenStack Overview: 自定义
- Node Exporter: 1860
- MySQL: 7362

四、关键监控指标

4.1 Nova 指标

复制代码

# 虚拟机数量
nova_running_vms

# 虚拟机创建失败率
rate(nova_instance_create_errors[5m])

# Hypervisor 资源使用率
nova_hypervisor_vcpus_used / nova_hypervisor_vcpus * 100

4.2 Neutron 指标

复制代码

# Agent 状态
neutron_agent_state

# 网络数量
neutron_networks_total

# 端口数量
neutron_ports_total

4.3 数据库指标

复制代码

# 连接数
mysql_global_status_threads_connected

# 慢查询
mysql_global_status_slow_queries

# QPS
rate(mysql_global_status_questions[5m])

4.4 RabbitMQ 指标

复制代码

# 队列深度
rabbitmq_queue_messages

# 消息速率
rate(rabbitmq_queue_messages_published_total[5m])

# 连接数
rabbitmq_connections

五、告警通知

5.1 配置 Alertmanager

yaml 复制代码

# alertmanager-config.yaml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default'

receivers:
- name: 'default'
  email_configs:
  - to: 'admin@example.com'
    from: 'alertmanager@example.com'
    smarthost: 'smtp.example.com:587'
    auth_username: 'alertmanager@example.com'
    auth_password: '<your-smtp-password>'

六、日志聚合

6.1 EFK Stack

bash 复制代码

# 部署 Elasticsearch
helm install elasticsearch elastic/elasticsearch \
  -n logging --create-namespace

# 部署 Fluentd
kubectl apply -f fluentd-daemonset.yaml

# 部署 Kibana
helm install kibana elastic/kibana -n logging

七、健康检查脚本

bash 复制代码

#!/bin/bash
# health-check.sh

echo "=== OpenStack 健康检查 ==="

# 1. 服务状态
echo "[1/5] 检查服务状态"
openstack service list

# 2. 计算服务
echo "[2/5] 检查计算服务"
openstack compute service list

# 3. 网络代理
echo "[3/5] 检查网络代理"
openstack network agent list

# 4. Hypervisor
echo "[4/5] 检查 Hypervisor"
openstack hypervisor list
openstack hypervisor stats show

# 5. Pod 状态
echo "[5/5] 检查 Pod 状态"
kubectl get pods -n openstack | grep -v Running | grep -v Completed

echo "=== 检查完成 ==="

八、监控最佳实践

设置合理的告警阈值
避免告警风暴
定期审查告警规则
保留历史监控数据
建立值班机制

下一篇：备份恢复方案