目录
- Prometheus监控指标集成指南
-
- 引言
- Prometheus基础概念
-
- Prometheus架构概览
- 核心概念解析
-
- [1. 数据模型](#1. 数据模型)
- [2. 指标类型](#2. 指标类型)
- 监控指标集成方法
-
- [1. 暴露指标端点](#1. 暴露指标端点)
- [2. 选择合适的客户端库](#2. 选择合适的客户端库)
- [3. 指标设计原则](#3. 指标设计原则)
- [4. 服务发现配置](#4. 服务发现配置)
- Python应用指标集成实现
-
- 项目结构
- 完整代码实现
-
- [1. 依赖文件:requirements.txt](#1. 依赖文件:requirements.txt)
- [2. 配置文件:config.yaml](#2. 配置文件:config.yaml)
- [3. 指标定义模块:metrics.py](#3. 指标定义模块:metrics.py)
- [4. 业务逻辑模块:business_logic.py](#4. 业务逻辑模块:business_logic.py)
- [5. 主应用:app.py](#5. 主应用:app.py)
- [6. Prometheus配置文件:prometheus.yaml](#6. Prometheus配置文件:prometheus.yaml)
- 指标查询与分析
-
- PromQL查询示例
-
- [1. 基础查询](#1. 基础查询)
- [2. 性能分析](#2. 性能分析)
- [3. 业务指标分析](#3. 业务指标分析)
- Grafana仪表板配置
- 最佳实践与优化建议
- 故障排查与调试
-
- 常见问题及解决方案
-
- [1. 指标无法访问](#1. 指标无法访问)
- [2. 指标丢失或异常](#2. 指标丢失或异常)
- [3. 性能问题](#3. 性能问题)
- 代码自查与测试
- 总结
- 后续步骤
『宝藏代码胶囊开张啦!』------ 我的 CodeCapsule 来咯!✨写代码不再头疼!我的新站点 CodeCapsule 主打一个 "白菜价"+"量身定制 "!无论是卡脖子的毕设/课设/文献复现 ,需要灵光一现的算法改进 ,还是想给项目加个"外挂",这里都有便宜又好用的代码方案等你发现!低成本,高适配,助你轻松通关!速来围观 👉 CodeCapsule官网
Prometheus监控指标集成指南
引言
在当今的云原生时代,应用监控已成为系统可观测性的重要组成部分。Prometheus作为CNCF毕业项目,已成为云原生监控领域的事实标准。它以其强大的多维数据模型、灵活的查询语言和高效的存储引擎,为分布式系统的监控提供了强大支持。
本文将深入探讨如何将应用监控指标集成到Prometheus中,涵盖从基础概念到高级实践的全流程。我们将通过一个完整的Python示例演示如何实现自定义指标的暴露和采集。
Prometheus基础概念
Prometheus架构概览
数据流 HTTP拉取 存储 Prometheus Server 应用程序
暴露/metrics端点 TSDB 拉取目标 服务发现 Pushgateway PromQL查询 Alertmanager Grafana API客户端
核心概念解析
1. 数据模型
Prometheus使用多维数据模型,每个时间序列由以下部分组成:
- 指标名称(Metric Name):描述指标的类型
- 标签(Labels):键值对,用于区分相同指标的不同维度
- 时间戳(Timestamp):数据点的时间
- 样本值(Sample Value):64位浮点数
指标格式示例:
http_requests_total{method="POST", handler="/api", status="200"} 1027
http_requests_total{method="POST", handler="/api", status="400"} 3
2. 指标类型
Prometheus定义了四种核心指标类型:
-
Counter(计数器):单调递增的计数器,用于记录累计值
- 公式: C ( t ) = C ( t − 1 ) + Δ C(t) = C(t-1) + \Delta C(t)=C(t−1)+Δ
- 示例:请求总数、错误总数
-
Gauge(仪表盘):可增可减的变量,反映当前状态
- 示例:CPU使用率、内存使用量、并发连接数
-
Histogram(直方图):对观测值进行采样,统计分布情况
- 公式: P q u a n t i l e = value at quantile P_{quantile} = \text{value at quantile} Pquantile=value at quantile
- 示例:请求延迟分布
-
Summary(摘要):类似直方图,但计算客户端的分位数
- 示例:请求延迟的百分位数
监控指标集成方法
1. 暴露指标端点
Prometheus通过HTTP端点拉取指标数据。应用程序需要暴露一个/metrics端点,返回符合Prometheus格式的指标数据。
2. 选择合适的客户端库
根据应用的技术栈选择合适的Prometheus客户端库:
- Python :
prometheus-client - Java :
micrometer或simpleclient - Go :
prometheus/client_golang - Node.js :
prom-client
3. 指标设计原则
设计监控指标时应遵循以下原则:
- 单一职责原则:每个指标只测量一件事
- 明确的命名规范 :使用
_分隔单词,如http_requests_total - 有意义的标签:使用标签区分指标的不同维度
- 避免标签基数爆炸:不要使用高基数字段作为标签
4. 服务发现配置
Prometheus支持多种服务发现机制:
- 静态配置
- DNS服务发现
- Kubernetes服务发现
- Consul服务发现
- 文件服务发现
Python应用指标集成实现
项目结构
prometheus-metrics-demo/
├── app.py # 主应用
├── metrics.py # 指标定义
├── business_logic.py # 业务逻辑模拟
├── config.yaml # 配置文件
├── requirements.txt # 依赖包
└── prometheus.yaml # Prometheus配置
完整代码实现
1. 依赖文件:requirements.txt
txt
prometheus-client==0.17.1
flask==2.3.3
requests==2.31.0
pyyaml==6.0
numpy==1.24.3
psutil==5.9.5
2. 配置文件:config.yaml
yaml
app:
name: "metrics-demo-app"
version: "1.0.0"
environment: "development"
port: 8000
metrics:
enabled: true
path: "/metrics"
port: 8001
collect_interval: 15 # 秒
logging:
level: "INFO"
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
endpoints:
external_api: "https://httpbin.org"
internal_api: "http://localhost:8000/api"
3. 指标定义模块:metrics.py
python
"""
Prometheus指标定义模块
负责定义和注册所有监控指标
"""
from prometheus_client import (
Counter, Gauge, Histogram, Summary,
generate_latest, CONTENT_TYPE_LATEST,
CollectorRegistry, Info
)
from prometheus_client.exposition import MetricsHandler
import time
import psutil
import os
from typing import Dict, Any, Optional
import threading
class ApplicationMetrics:
"""
应用监控指标管理器
统一管理所有Prometheus指标,确保指标命名规范一致
"""
def __init__(self, app_name: str = "unknown"):
"""
初始化指标管理器
Args:
app_name: 应用名称,用于指标前缀
"""
self.app_name = app_name
self.registry = CollectorRegistry()
# 应用信息指标
self.app_info = Info(
f'{self.app_name}_info',
'Application information',
registry=self.registry
)
# HTTP请求指标
self.http_requests_total = Counter(
f'{self.app_name}_http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status'],
registry=self.registry
)
self.http_request_duration_seconds = Histogram(
f'{self.app_name}_http_request_duration_seconds',
'HTTP request duration in seconds',
['method', 'endpoint'],
buckets=(0.01, 0.05, 0.1, 0.5, 1.0, 5.0),
registry=self.registry
)
self.http_request_size_bytes = Summary(
f'{self.app_name}_http_request_size_bytes',
'HTTP request size in bytes',
['method', 'endpoint'],
registry=self.registry
)
# 业务指标
self.orders_processed_total = Counter(
f'{self.app_name}_orders_processed_total',
'Total orders processed',
['status', 'payment_method'],
registry=self.registry
)
self.order_value = Summary(
f'{self.app_name}_order_value',
'Order value statistics',
['currency'],
registry=self.registry
)
# 系统资源指标
self.cpu_usage_percent = Gauge(
f'{self.app_name}_cpu_usage_percent',
'CPU usage percentage',
['cpu_id'],
registry=self.registry
)
self.memory_usage_bytes = Gauge(
f'{self.app_name}_memory_usage_bytes',
'Memory usage in bytes',
registry=self.registry
)
self.memory_available_bytes = Gauge(
f'{self.app_name}_memory_available_bytes',
'Available memory in bytes',
registry=self.registry
)
self.disk_usage_percent = Gauge(
f'{self.app_name}_disk_usage_percent',
'Disk usage percentage',
['mountpoint'],
registry=self.registry
)
# 应用性能指标
self.active_threads = Gauge(
f'{self.app_name}_active_threads',
'Number of active threads',
registry=self.registry
)
self.active_connections = Gauge(
f'{self.app_name}_active_connections',
'Number of active database connections',
registry=self.registry
)
self.queue_size = Gauge(
f'{self.app_name}_queue_size',
'Size of processing queue',
registry=self.registry
)
# 错误指标
self.errors_total = Counter(
f'{self.app_name}_errors_total',
'Total errors',
['type', 'component'],
registry=self.registry
)
# 缓存指标
self.cache_hits_total = Counter(
f'{self.app_name}_cache_hits_total',
'Total cache hits',
['cache_name'],
registry=self.registry
)
self.cache_misses_total = Counter(
f'{self.app_name}_cache_misses_total',
'Total cache misses',
['cache_name'],
registry=self.registry
)
# 初始化应用信息
self._set_app_info()
# 启动系统指标收集器
self._start_system_metrics_collector()
def _set_app_info(self):
"""设置应用信息"""
self.app_info.info({
'version': '1.0.0',
'environment': os.getenv('APP_ENV', 'development'),
'build_date': '2024-01-01',
'commit_hash': 'abc123'
})
def _start_system_metrics_collector(self):
"""启动系统指标收集线程"""
def collect_system_metrics():
while True:
try:
self._update_system_metrics()
except Exception as e:
self.record_error('system_metrics', 'metrics_collector', str(e))
time.sleep(15) # 每15秒收集一次
collector_thread = threading.Thread(
target=collect_system_metrics,
daemon=True,
name='system-metrics-collector'
)
collector_thread.start()
def _update_system_metrics(self):
"""更新系统资源指标"""
# CPU使用率
cpu_percent = psutil.cpu_percent(percpu=True)
for i, percent in enumerate(cpu_percent):
self.cpu_usage_percent.labels(cpu_id=str(i)).set(percent)
# 内存使用情况
memory = psutil.virtual_memory()
self.memory_usage_bytes.set(memory.used)
self.memory_available_bytes.set(memory.available)
# 磁盘使用情况
for partition in psutil.disk_partitions():
try:
usage = psutil.disk_usage(partition.mountpoint)
self.disk_usage_percent.labels(
mountpoint=partition.mountpoint
).set(usage.percent)
except (PermissionError, FileNotFoundError):
continue
# 活跃线程数
self.active_threads.set(threading.active_count())
def record_http_request(self, method: str, endpoint: str,
status: str, duration: float, size: int = 0):
"""
记录HTTP请求指标
Args:
method: HTTP方法
endpoint: 请求端点
status: HTTP状态码
duration: 请求持续时间(秒)
size: 请求大小(字节)
"""
self.http_requests_total.labels(
method=method,
endpoint=endpoint,
status=status
).inc()
self.http_request_duration_seconds.labels(
method=method,
endpoint=endpoint
).observe(duration)
if size > 0:
self.http_request_size_bytes.labels(
method=method,
endpoint=endpoint
).observe(size)
def record_order(self, status: str, payment_method: str,
value: float, currency: str = 'USD'):
"""
记录订单处理指标
Args:
status: 订单状态
payment_method: 支付方式
value: 订单金额
currency: 货币类型
"""
self.orders_processed_total.labels(
status=status,
payment_method=payment_method
).inc()
self.order_value.labels(currency=currency).observe(value)
def record_error(self, error_type: str, component: str,
description: Optional[str] = None):
"""
记录错误指标
Args:
error_type: 错误类型
component: 发生错误的组件
description: 错误描述(可选)
"""
self.errors_total.labels(
type=error_type,
component=component
).inc()
def record_cache_operation(self, cache_name: str, hit: bool):
"""
记录缓存操作指标
Args:
cache_name: 缓存名称
hit: 是否命中
"""
if hit:
self.cache_hits_total.labels(cache_name=cache_name).inc()
else:
self.cache_misses_total.labels(cache_name=cache_name).inc()
def update_queue_size(self, size: int):
"""更新队列大小指标"""
self.queue_size.set(size)
def update_active_connections(self, count: int):
"""更新活跃连接数指标"""
self.active_connections.set(count)
def get_metrics(self):
"""获取指标数据"""
return generate_latest(self.registry)
# 全局指标实例
metrics = ApplicationMetrics("demo_app")
4. 业务逻辑模块:business_logic.py
python
"""
业务逻辑模拟模块
演示如何在实际业务中集成指标收集
"""
import random
import time
import threading
from typing import Dict, Any
from queue import Queue, Empty
from datetime import datetime
import uuid
from metrics import metrics
class Order:
"""订单类"""
def __init__(self, order_id: str, customer_id: str,
items: list, payment_method: str):
self.order_id = order_id
self.customer_id = customer_id
self.items = items
self.payment_method = payment_method
self.total_amount = sum(item['price'] * item['quantity']
for item in items)
self.status = 'pending'
self.created_at = datetime.now()
self.processed_at = None
self.currency = 'USD'
class OrderProcessor:
"""订单处理器"""
def __init__(self, max_workers: int = 3):
"""
初始化订单处理器
Args:
max_workers: 最大工作线程数
"""
self.order_queue = Queue()
self.max_workers = max_workers
self.workers = []
self.running = False
self.processed_orders = []
# 模拟的数据库连接池
self.active_db_connections = 0
self.max_db_connections = 10
# 模拟的缓存
self.cache = {}
def start(self):
"""启动订单处理器"""
if self.running:
return
self.running = True
for i in range(self.max_workers):
worker = threading.Thread(
target=self._process_orders_worker,
daemon=True,
name=f'order-worker-{i}'
)
worker.start()
self.workers.append(worker)
print(f"Started {self.max_workers} order processing workers")
# 启动指标更新线程
metrics_thread = threading.Thread(
target=self._update_metrics,
daemon=True,
name='metrics-updater'
)
metrics_thread.start()
def stop(self):
"""停止订单处理器"""
self.running = False
for worker in self.workers:
worker.join(timeout=5)
self.workers.clear()
def submit_order(self, order_data: Dict[str, Any]) -> str:
"""
提交新订单
Args:
order_data: 订单数据
Returns:
订单ID
"""
order_id = str(uuid.uuid4())[:8]
order = Order(
order_id=order_id,
customer_id=order_data.get('customer_id', 'anonymous'),
items=order_data.get('items', []),
payment_method=order_data.get('payment_method', 'credit_card')
)
# 将订单放入队列
self.order_queue.put(order)
# 更新队列大小指标
metrics.update_queue_size(self.order_queue.qsize())
print(f"Submitted order {order_id} with amount ${order.total_amount:.2f}")
return order_id
def _process_orders_worker(self):
"""订单处理工作线程"""
while self.running:
try:
# 从队列获取订单(最多等待1秒)
order = self.order_queue.get(timeout=1)
# 模拟处理延迟
processing_time = random.uniform(0.1, 2.0)
time.sleep(processing_time)
# 模拟数据库操作
self._simulate_db_operation()
# 模拟缓存查询
cache_key = f"customer:{order.customer_id}"
cache_hit = self._check_cache(cache_key)
# 随机决定订单状态(模拟成功率)
success_rate = 0.95 # 95%成功率
status = 'completed' if random.random() < success_rate else 'failed'
if status == 'failed':
error_type = random.choice(['payment_failed', 'inventory_error', 'system_error'])
metrics.record_error(error_type, 'order_processor',
f"Order {order.order_id} failed")
# 记录订单处理指标
metrics.record_order(
status=status,
payment_method=order.payment_method,
value=order.total_amount,
currency=order.currency
)
# 记录缓存指标
metrics.record_cache_operation('customer_cache', cache_hit)
# 更新订单状态
order.status = status
order.processed_at = datetime.now()
self.processed_orders.append(order)
print(f"Processed order {order.order_id}: {status}")
# 更新队列大小指标
metrics.update_queue_size(self.order_queue.qsize())
except Empty:
continue
except Exception as e:
metrics.record_error('processing_error', 'order_worker', str(e))
print(f"Error processing order: {e}")
def _simulate_db_operation(self):
"""模拟数据库操作"""
# 模拟连接获取和释放
if self.active_db_connections < self.max_db_connections:
self.active_db_connections += 1
metrics.update_active_connections(self.active_db_connections)
# 模拟查询时间
time.sleep(random.uniform(0.01, 0.1))
self.active_db_connections -= 1
metrics.update_active_connections(self.active_db_connections)
def _check_cache(self, key: str) -> bool:
"""检查缓存(模拟)"""
# 模拟缓存命中率
hit_rate = 0.7 # 70%缓存命中率
hit = random.random() < hit_rate
if not hit and key not in self.cache:
# 模拟缓存未命中时的数据加载
self.cache[key] = {'data': 'cached_value', 'timestamp': time.time()}
return hit
def _update_metrics(self):
"""定期更新指标"""
while self.running:
try:
# 更新队列大小指标
metrics.update_queue_size(self.order_queue.qsize())
# 更新活跃连接数指标
metrics.update_active_connections(self.active_db_connections)
time.sleep(10) # 每10秒更新一次
except Exception as e:
print(f"Error updating metrics: {e}")
def get_stats(self) -> Dict[str, Any]:
"""获取处理器统计信息"""
return {
'queue_size': self.order_queue.qsize(),
'processed_count': len(self.processed_orders),
'active_workers': len([w for w in self.workers if w.is_alive()]),
'active_connections': self.active_db_connections
}
# 模拟外部API调用
class ExternalAPIClient:
"""外部API客户端(模拟)"""
def __init__(self, base_url: str):
self.base_url = base_url
def make_request(self, method: str, endpoint: str,
data: Dict[str, Any] = None) -> Dict[str, Any]:
"""
发送HTTP请求并记录指标
Args:
method: HTTP方法
endpoint: API端点
data: 请求数据
Returns:
响应数据
"""
start_time = time.time()
try:
# 模拟HTTP请求延迟
delay = random.uniform(0.05, 1.5)
time.sleep(delay)
# 模拟不同的响应状态
success_rate = 0.9 # 90%成功率
if random.random() < success_rate:
status = '200'
response = {'success': True, 'data': {'id': 123}}
else:
status = random.choice(['400', '500'])
response = {'success': False, 'error': 'Request failed'}
# 模拟响应大小
response_size = random.randint(100, 5000)
# 记录HTTP指标
metrics.record_http_request(
method=method,
endpoint=endpoint,
status=status,
duration=time.time() - start_time,
size=response_size
)
return response
except Exception as e:
# 记录异常指标
duration = time.time() - start_time
metrics.record_http_request(
method=method,
endpoint=endpoint,
status='500',
duration=duration
)
metrics.record_error('http_error', 'external_api', str(e))
return {'success': False, 'error': str(e)}
# 生成模拟订单数据
def generate_mock_order() -> Dict[str, Any]:
"""生成模拟订单数据"""
products = [
{'name': 'Laptop', 'price': 999.99},
{'name': 'Phone', 'price': 699.99},
{'name': 'Tablet', 'price': 399.99},
{'name': 'Headphones', 'price': 199.99},
{'name': 'Charger', 'price': 29.99}
]
num_items = random.randint(1, 5)
items = random.sample(products, num_items)
for item in items:
item['quantity'] = random.randint(1, 3)
payment_methods = ['credit_card', 'paypal', 'apple_pay', 'google_pay']
return {
'customer_id': f'customer_{random.randint(1000, 9999)}',
'items': items,
'payment_method': random.choice(payment_methods)
}
5. 主应用:app.py
python
"""
Prometheus指标集成演示应用
主应用入口,集成Flask Web服务和指标端点
"""
import os
import yaml
import time
import threading
from datetime import datetime
from typing import Dict, Any
from http.server import HTTPServer
from prometheus_client.exposition import ThreadingWSGIServer
from flask import Flask, request, jsonify, Response
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
from metrics import metrics
from business_logic import OrderProcessor, ExternalAPIClient, generate_mock_order
class MetricsApp:
"""指标集成应用主类"""
def __init__(self, config_path: str = 'config.yaml'):
"""
初始化应用
Args:
config_path: 配置文件路径
"""
# 加载配置
self.config = self._load_config(config_path)
# 初始化Flask应用
self.flask_app = Flask(__name__)
self._setup_routes()
# 初始化组件
self.order_processor = OrderProcessor(
max_workers=self.config['app'].get('workers', 3)
)
self.api_client = ExternalAPIClient(
self.config['endpoints']['external_api']
)
# 应用状态
self.start_time = datetime.now()
self.request_count = 0
# 启动指标服务器
self._start_metrics_server()
def _load_config(self, config_path: str) -> Dict[str, Any]:
"""加载配置文件"""
try:
with open(config_path, 'r') as f:
config = yaml.safe_load(f)
# 设置环境变量覆盖
config['app']['port'] = int(os.getenv('APP_PORT', config['app']['port']))
config['metrics']['port'] = int(os.getenv('METRICS_PORT', config['metrics']['port']))
return config
except Exception as e:
print(f"Error loading config: {e}")
return self._get_default_config()
def _get_default_config(self) -> Dict[str, Any]:
"""获取默认配置"""
return {
'app': {
'name': 'metrics-demo-app',
'port': 8000,
'workers': 3
},
'metrics': {
'port': 8001,
'path': '/metrics'
},
'endpoints': {
'external_api': 'https://httpbin.org'
}
}
def _setup_routes(self):
"""设置Flask路由"""
@self.flask_app.route('/')
def index():
"""首页"""
self._record_request('GET', '/')
return jsonify({
'app': self.config['app']['name'],
'version': '1.0.0',
'status': 'running',
'uptime': str(datetime.now() - self.start_time)
})
@self.flask_app.route('/api/health')
def health():
"""健康检查端点"""
self._record_request('GET', '/api/health')
return jsonify({
'status': 'healthy',
'timestamp': datetime.now().isoformat()
})
@self.flask_app.route('/api/orders', methods=['POST'])
def create_order():
"""创建订单端点"""
start_time = time.time()
try:
order_data = request.get_json() or generate_mock_order()
# 验证订单数据
if not order_data.get('items'):
order_data = generate_mock_order()
# 提交订单
order_id = self.order_processor.submit_order(order_data)
# 调用外部API(模拟)
api_response = self.api_client.make_request(
'POST',
'/api/orders',
{'order_id': order_id}
)
response_data = {
'order_id': order_id,
'status': 'accepted',
'api_response': api_response,
'estimated_processing_time': '10-30 seconds'
}
# 记录成功的HTTP请求
self._record_request_with_duration(
'POST', '/api/orders', '200',
time.time() - start_time,
len(str(response_data))
)
return jsonify(response_data), 202
except Exception as e:
# 记录失败的HTTP请求
self._record_request_with_duration(
'POST', '/api/orders', '500',
time.time() - start_time
)
# 记录错误指标
metrics.record_error('api_error', 'create_order', str(e))
return jsonify({
'error': 'Failed to create order',
'details': str(e)
}), 500
@self.flask_app.route('/api/orders/<order_id>')
def get_order(order_id: str):
"""获取订单状态"""
self._record_request('GET', f'/api/orders/{order_id}')
# 模拟查询逻辑
return jsonify({
'order_id': order_id,
'status': random.choice(['processing', 'completed', 'failed']),
'created_at': datetime.now().isoformat()
})
@self.flask_app.route('/api/stats')
def get_stats():
"""获取应用统计信息"""
self._record_request('GET', '/api/stats')
stats = {
'app': {
'name': self.config['app']['name'],
'uptime': str(datetime.now() - self.start_time),
'request_count': self.request_count
},
'order_processor': self.order_processor.get_stats(),
'timestamp': datetime.now().isoformat()
}
return jsonify(stats)
@self.flask_app.route('/api/metrics/prometheus')
def prometheus_metrics():
"""Prometheus指标端点"""
self._record_request('GET', '/api/metrics/prometheus')
return Response(
generate_latest(),
mimetype=CONTENT_TYPE_LATEST
)
@self.flask_app.errorhandler(404)
def not_found(error):
"""404错误处理"""
self._record_request(request.method, request.path, '404')
return jsonify({'error': 'Not found'}), 404
@self.flask_app.errorhandler(500)
def internal_error(error):
"""500错误处理"""
self._record_request(request.method, request.path, '500')
return jsonify({'error': 'Internal server error'}), 500
def _record_request(self, method: str, endpoint: str, status: str = '200'):
"""记录HTTP请求(简化版)"""
metrics.record_http_request(method, endpoint, status, 0.1)
self.request_count += 1
def _record_request_with_duration(self, method: str, endpoint: str,
status: str, duration: float, size: int = 0):
"""记录HTTP请求(带持续时间)"""
metrics.record_http_request(method, endpoint, status, duration, size)
self.request_count += 1
def _start_metrics_server(self):
"""启动独立的指标服务器"""
def run_metrics_server():
"""运行指标服务器"""
metrics_port = self.config['metrics']['port']
metrics_path = self.config['metrics']['path']
# 创建简单的HTTP服务器用于指标端点
from prometheus_client import make_wsgi_app
from wsgiref.simple_server import make_server
metrics_app = make_wsgi_app()
def combined_app(environ, start_response):
if environ['PATH_INFO'] == metrics_path:
return metrics_app(environ, start_response)
else:
start_response('404 Not Found', [('Content-Type', 'text/plain')])
return [b'Not Found']
print(f"Starting metrics server on port {metrics_port}")
httpd = make_server('', metrics_port, combined_app)
httpd.serve_forever()
# 在独立线程中启动指标服务器
metrics_thread = threading.Thread(
target=run_metrics_server,
daemon=True,
name='metrics-server'
)
metrics_thread.start()
def run(self):
"""运行应用"""
# 启动订单处理器
self.order_processor.start()
# 启动模拟负载生成器
self._start_load_generator()
# 运行Flask应用
app_port = self.config['app']['port']
print(f"Starting application server on port {app_port}")
print(f"Metrics available at http://localhost:{self.config['metrics']['port']}/metrics")
print(f"Health check at http://localhost:{app_port}/api/health")
self.flask_app.run(
host='0.0.0.0',
port=app_port,
debug=False,
threaded=True
)
def _start_load_generator(self):
"""启动模拟负载生成器"""
def generate_load():
"""生成模拟负载"""
import random
while True:
try:
# 随机生成一些订单
if random.random() < 0.3: # 30%的概率生成订单
order_data = generate_mock_order()
# 使用线程提交订单,避免阻塞
threading.Thread(
target=self.order_processor.submit_order,
args=(order_data,),
daemon=True
).start()
# 随机调用外部API
if random.random() < 0.2: # 20%的概率调用API
self.api_client.make_request(
'GET',
'/api/status'
)
# 随机间隔
time.sleep(random.uniform(0.5, 5))
except Exception as e:
print(f"Error in load generator: {e}")
load_thread = threading.Thread(
target=generate_load,
daemon=True,
name='load-generator'
)
load_thread.start()
def main():
"""主函数"""
print("=" * 60)
print("Prometheus Metrics Integration Demo")
print("=" * 60)
# 创建并运行应用
app = MetricsApp('config.yaml')
try:
app.run()
except KeyboardInterrupt:
print("\nShutting down...")
app.order_processor.stop()
print("Application stopped.")
if __name__ == '__main__':
import random # 在模块级别导入
main()
6. Prometheus配置文件:prometheus.yaml
yaml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
environment: 'development'
cluster: 'demo-cluster'
# 告警规则文件
rule_files:
# - "alerts/*.yaml"
# 告警管理器配置
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
# 拉取配置
scrape_configs:
# 监控Prometheus自身
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
scrape_interval: 10s
metrics_path: '/metrics'
# 监控我们的演示应用
- job_name: 'demo-app'
scrape_interval: 15s
metrics_path: '/metrics'
static_configs:
- targets: ['localhost:8001']
labels:
app: 'metrics-demo'
version: '1.0.0'
relabel_configs:
- source_labels: [__address__]
target_label: instance
- source_labels: [__meta_demo_app_version]
target_label: version
- source_labels: [__meta_demo_app_environment]
target_label: env
# 监控应用API端点
- job_name: 'demo-app-api'
scrape_interval: 30s
metrics_path: '/api/metrics/prometheus'
static_configs:
- targets: ['localhost:8000']
metrics_relabel_configs:
- source_labels: [__name__]
regex: '(http_requests_total|http_request_duration_seconds|orders_processed_total)'
action: keep
# 黑盒监控(HTTP探针)
- job_name: 'blackbox-http'
scrape_interval: 30s
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://localhost:8000/api/health
- http://localhost:8001/metrics
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: localhost:9115 # blackbox exporter地址
# 系统监控(Node Exporter)
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
scrape_interval: 30s
# 远程写配置(可选)
remote_write:
- url: "http://remote-prometheus:9090/api/v1/write"
queue_config:
capacity: 2500
max_shards: 200
# 远程读配置(可选)
remote_read:
- url: "http://remote-prometheus:9090/api/v1/read"
read_recent: true
指标查询与分析
PromQL查询示例
1. 基础查询
promql
# 查询总HTTP请求数
sum(demo_app_http_requests_total)
# 按端点分组统计请求数
sum by (endpoint) (rate(demo_app_http_requests_total[5m]))
# 计算请求成功率
sum(rate(demo_app_http_requests_total{status=~"2.."}[5m]))
/
sum(rate(demo_app_http_requests_total[5m]))
2. 性能分析
promql
# 95分位请求延迟
histogram_quantile(0.95,
sum(rate(demo_app_http_request_duration_seconds_bucket[5m]))
by (le, endpoint))
# 平均订单处理时间
rate(demo_app_order_value_sum[1h])
/
rate(demo_app_order_value_count[1h])
# CPU使用率趋势
avg_over_time(demo_app_cpu_usage_percent[5m])
3. 业务指标分析
promql
# 订单处理速率
sum(rate(demo_app_orders_processed_total[5m]))
by (status, payment_method)
# 缓存命中率
sum(rate(demo_app_cache_hits_total[5m]))
/
(sum(rate(demo_app_cache_hits_total[5m]))
+ sum(rate(demo_app_cache_misses_total[5m])))
Grafana仪表板配置
仪表板JSON配置示例
json
{
"dashboard": {
"title": "应用监控仪表板",
"panels": [
{
"title": "HTTP请求速率",
"type": "graph",
"targets": [{
"expr": "sum(rate(demo_app_http_requests_total[5m])) by (endpoint)",
"legendFormat": "{{endpoint}}"
}]
},
{
"title": "请求延迟分布",
"type": "heatmap",
"targets": [{
"expr": "rate(demo_app_http_request_duration_seconds_bucket[5m])",
"format": "heatmap"
}]
}
]
}
}
最佳实践与优化建议
1. 指标设计最佳实践
命名规范
- 使用
_作为单词分隔符 - 以
_total结尾的计数器 - 以
_seconds结尾的时间指标 - 单位后缀:
_bytes、_percent等
标签设计
python
# 好的标签设计
http_requests_total{method="GET", endpoint="/api/users", status="200"}
# 避免标签基数爆炸
# 错误:使用用户ID作为标签
http_requests_total{user_id="12345", ...}
# 正确:使用用户类型或角色
http_requests_total{user_type="premium", ...}
2. 性能优化
指标采集优化
python
# 批量更新指标,减少锁竞争
class BatchMetricUpdater:
def __init__(self):
self.batch = []
def add_metric(self, metric_callable, *args):
self.batch.append((metric_callable, args))
def flush(self):
for metric_callable, args in self.batch:
metric_callable(*args)
self.batch.clear()
内存优化
python
# 使用弱引用避免内存泄漏
import weakref
class MetricManager:
def __init__(self):
self._metrics = weakref.WeakValueDictionary()
def get_metric(self, name):
if name not in self._metrics:
metric = self._create_metric(name)
self._metrics[name] = metric
return self._metrics[name]
3. 高可用部署架构
联邦集群 监控集群 生产集群 联邦Prometheus Prometheus Server A Prometheus HA Pair Prometheus Server B 远程存储 长期存储 Alertmanager集群 通知渠道 Grafana 负载均衡器 应用实例 1 应用实例 2 应用实例 3
故障排查与调试
常见问题及解决方案
1. 指标无法访问
bash
# 检查端点是否可访问
curl http://localhost:8001/metrics
# 检查Prometheus配置
promtool check config prometheus.yaml
# 查看Prometheus日志
journalctl -u prometheus -f
2. 指标丢失或异常
python
# 添加调试信息
import logging
logging.basicConfig(level=logging.DEBUG)
# 验证指标格式
from prometheus_client.parser import text_string_to_metric_families
metrics_data = generate_latest()
for family in text_string_to_metric_families(metrics_data.decode()):
print(f"Family: {family.name}")
for sample in family.samples:
print(f" Sample: {sample}")
3. 性能问题
python
# 监控指标收集性能
import time
from prometheus_client import Summary
collect_duration = Summary('metrics_collect_duration_seconds',
'Time spent collecting metrics')
@collect_duration.time()
def collect_metrics():
# 指标收集逻辑
pass
代码自查与测试
代码质量检查
python
"""
代码自查和测试模块
确保代码质量并减少BUG
"""
import unittest
import time
import threading
from io import StringIO
from prometheus_client import REGISTRY
from metrics import ApplicationMetrics
from business_logic import OrderProcessor, generate_mock_order
class TestApplicationMetrics(unittest.TestCase):
"""应用指标测试类"""
def setUp(self):
"""测试前置设置"""
self.metrics = ApplicationMetrics("test_app")
def test_counter_increment(self):
"""测试计数器递增"""
# 记录HTTP请求
self.metrics.record_http_request("GET", "/test", "200", 0.1)
# 获取指标数据
metrics_data = self.metrics.get_metrics().decode()
# 验证计数器存在
self.assertIn('test_app_http_requests_total', metrics_data)
self.assertIn('method="GET"', metrics_data)
self.assertIn('endpoint="/test"', metrics_data)
def test_gauge_set_value(self):
"""测试仪表盘设置值"""
# 设置队列大小
self.metrics.update_queue_size(42)
metrics_data = self.metrics.get_metrics().decode()
self.assertIn('test_app_queue_size 42.0', metrics_data)
def test_error_recording(self):
"""测试错误记录"""
# 记录错误
self.metrics.record_error("validation", "user_service", "Invalid input")
metrics_data = self.metrics.get_metrics().decode()
self.assertIn('type="validation"', metrics_data)
self.assertIn('component="user_service"', metrics_data)
def test_concurrent_access(self):
"""测试并发访问"""
def worker():
for _ in range(100):
self.metrics.record_http_request(
"POST", "/api", "200", 0.05
)
# 创建多个线程并发访问
threads = []
for i in range(10):
t = threading.Thread(target=worker)
threads.append(t)
t.start()
# 等待所有线程完成
for t in threads:
t.join()
# 验证指标数据完整性
metrics_data = self.metrics.get_metrics().decode()
self.assertIn('test_app_http_requests_total', metrics_data)
class TestOrderProcessor(unittest.TestCase):
"""订单处理器测试类"""
def setUp(self):
self.processor = OrderProcessor(max_workers=2)
def test_order_submission(self):
"""测试订单提交"""
order_data = generate_mock_order()
order_id = self.processor.submit_order(order_data)
self.assertIsInstance(order_id, str)
self.assertEqual(len(order_id), 8)
def test_processor_start_stop(self):
"""测试处理器启动和停止"""
self.processor.start()
self.assertTrue(self.processor.running)
# 提交一些订单
for _ in range(5):
self.processor.submit_order(generate_mock_order())
# 等待处理
time.sleep(2)
self.processor.stop()
self.assertFalse(self.processor.running)
def test_metrics_integration(self):
"""测试指标集成"""
from metrics import metrics
# 提交订单
order_data = generate_mock_order()
self.processor.submit_order(order_data)
# 验证指标更新
time.sleep(1)
metrics_data = metrics.get_metrics().decode()
# 检查队列大小指标
self.assertIn('demo_app_queue_size', metrics_data)
def run_all_tests():
"""运行所有测试"""
suite = unittest.TestLoader().loadTestsFromTestCase(TestApplicationMetrics)
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestOrderProcessor))
runner = unittest.TextTestRunner(verbosity=2)
result = runner.run(suite)
return result.wasSuccessful()
if __name__ == '__main__':
print("开始运行代码自查测试...")
print("=" * 60)
success = run_all_tests()
print("=" * 60)
if success:
print("✅ 所有测试通过!代码质量良好。")
else:
print("❌ 测试失败,请检查代码。")
# 内存泄漏检查
print("\n内存使用检查...")
import gc
gc.collect()
print(f"存活对象数量: {len(gc.get_objects())}")
# 指标格式验证
print("\n指标格式验证...")
from prometheus_client.parser import text_string_to_metric_families
test_metrics = ApplicationMetrics("validation_app")
test_metrics.record_http_request("GET", "/validate", "200", 0.1)
metrics_data = test_metrics.get_metrics()
try:
families = list(text_string_to_metric_families(metrics_data.decode()))
print(f"✅ 发现 {len(families)} 个指标族,格式正确")
for family in families[:3]: # 显示前3个指标族
print(f" - {family.name}: {len(family.samples)} 个样本")
except Exception as e:
print(f"❌ 指标格式错误: {e}")
总结
本文详细介绍了Prometheus监控指标集成的完整流程,从基础概念到高级实践,涵盖:
- Prometheus核心概念:数据模型、指标类型、架构原理
- 指标集成方法:客户端库选择、指标设计原则、服务发现配置
- 完整实现示例:使用Python和Flask构建可观测应用
- 最佳实践:指标设计、性能优化、高可用架构
- 故障排查:常见问题解决方案和调试技巧
通过本文的指导,您可以:
- ✅ 正确设计符合Prometheus规范的监控指标
- ✅ 在应用中集成指标收集和暴露功能
- ✅ 配置Prometheus进行指标采集和告警
- ✅ 使用Grafana创建监控仪表板
- ✅ 实施生产环境的最佳实践
监控指标集成是构建可靠、可观测系统的关键步骤。正确实施监控不仅能帮助您快速发现问题,还能为容量规划、性能优化和业务分析提供数据支持。
后续步骤
- 深入学习:阅读Prometheus官方文档,了解高级特性
- 实践部署:在生产环境中部署并调优监控系统
- 扩展监控:集成日志和追踪,实现完整的可观测性
- 自动化:使用Infrastructure as Code管理监控配置
- 安全加固:实施指标访问控制和数据加密
资源推荐: