微服务监控:Prometheus与Grafana实战
大家好,我是欧阳瑞(Rich Own)。今天想和大家聊聊微服务监控这个重要话题。作为一个全栈开发者,监控是保障系统稳定运行的关键。今天就来分享一下Prometheus和Grafana的实战经验。
为什么需要监控?
| 场景 | 说明 |
|---|---|
| 故障排查 | 快速定位问题 |
| 性能优化 | 发现性能瓶颈 |
| 容量规划 | 预测资源需求 |
| 安全审计 | 追踪异常行为 |
Prometheus简介
Prometheus是一个开源的监控系统,具有以下特点:
- 多维度数据模型
- 灵活的查询语言(PromQL)
- 高效的时间序列数据库
- 内置告警机制
安装Prometheus
bash
# 使用Docker安装
docker run -d --name prometheus \
-p 9090:9090 \
-v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus
配置文件
yaml
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'api-service'
static_configs:
- targets: ['api-service:3000']
metrics_path: '/metrics'
指标类型
python
# 计数器(Counter)
http_requests_total = Counter('http_requests_total', 'Total HTTP requests')
# 仪表盘(Gauge)
memory_usage = Gauge('memory_usage_bytes', 'Memory usage in bytes')
# 直方图(Histogram)
request_duration = Histogram('request_duration_seconds', 'Request duration')
# 摘要(Summary)
response_size = Summary('response_size_bytes', 'Response size')
实战:监控API服务
python
from flask import Flask
from prometheus_client import Counter, Histogram, generate_latest
app = Flask(__name__)
REQUESTS = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint'])
DURATION = Histogram('request_duration_seconds', 'Request duration')
@app.route('/')
@DURATION.time()
def index():
REQUESTS.labels(method='GET', endpoint='/').inc()
return 'Hello World'
@app.route('/metrics')
def metrics():
return generate_latest(), 200, {'Content-Type': 'text/plain'}
if __name__ == '__main__':
app.run(port=3000)
Grafana配置
bash
# 使用Docker安装Grafana
docker run -d --name grafana \
-p 3000:3000 \
-v /path/to/grafana-data:/var/lib/grafana \
grafana/grafana
配置数据源
yaml
# 添加Prometheus数据源
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus:9090
access: proxy
isDefault: true
创建仪表盘
json
{
"dashboard": {
"id": null,
"title": "API监控",
"panels": [
{
"type": "graph",
"title": "请求数",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{method}} {{endpoint}}"
}
]
},
{
"type": "graph",
"title": "请求延迟",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(request_duration_seconds_bucket[5m]))",
"legendFormat": "P95"
}
]
}
]
}
}
告警配置
yaml
# alerting_rules.yml
groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: rate(http_errors_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }}% for API service"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P95 latency is {{ $value }}s"
最佳实践
1. 指标命名规范
python
# <metric_type>_<name>_<unit>
http_requests_total
memory_usage_bytes
request_duration_seconds
2. 标签管理
python
REQUESTS.labels(
method='GET',
endpoint='/api/users',
status_code='200'
).inc()
3. 可视化技巧
json
{
"panels": [
{
"type": "stat",
"title": "平均延迟",
"targets": [
{
"expr": "avg(request_duration_seconds)"
}
]
},
{
"type": "gauge",
"title": "内存使用率",
"targets": [
{
"expr": "memory_usage_bytes / memory_total_bytes * 100"
}
]
}
]
}
总结
Prometheus和Grafana是监控领域的黄金组合。通过合理的指标设计和可视化配置,可以全面监控系统的运行状态。
我的鬃狮蜥Hash对监控也有自己的理解------它总是时刻关注周围环境的变化,这也许就是自然界的"监控系统"吧!
如果你对监控感兴趣,欢迎留言交流!我是欧阳瑞,极客之路,永无止境!
技术栈:Prometheus · Grafana · 监控