Prometheus Python Client 实操指南：从零实现自定义 Exporter

文章目录

- 一、环境准备
- 二、指标类型
- 三、核心API
- - [1. Gauge](#1. Gauge)
  - [2. Counter](#2. Counter)
  - [3. Histogram](#3. Histogram)
  - [4. Summary](#4. Summary)
- 四、简单实操
- 五、进阶实操
- 六、综合示例
- [七、对接 Prometheus](#七、对接 Prometheus)
- 八、知识总结

prometheus_client 是 Prometheus 官方提供的 Python 客户端库，核心作用是在 Python 应用中定义、采集监控指标，并通过 HTTP 接口暴露给 Prometheus 采集 。

本文以实操为主 ，带你快速掌握核心用法，还能实现类似 node_exporter 的系统指标采集工具。

一、环境准备

prometheus-client：Prometheus Python 客户端
psutil：用于采集系统 CPU、内存、磁盘等指标（模拟系统监控场景）

bash 复制代码

pip install prometheus-client psutil

二、指标类型

prometheus_client 提供4种核心指标类型，覆盖绝大多数监控场景，先记清用途，再看代码：

指标类型	核心特点	适用场景
Gauge（可波动数值）	数值可增可减	CPU使用率、内存占用、进程数
Counter（计数器）	只增不减（可重置）	请求总数、错误数、网络流量
Histogram（直方图）	分桶统计分布，服务端算分位数	请求耗时、响应大小分布
Summary（分位数统计）	客户端直接算分位数	无需预定义桶的耗时统计

三、核心API

1. Gauge

核心方法：set(数值)、inc()（加1）、dec()（减1）

python 复制代码

# 定义
gauge = Gauge("test_gauge", "测试可波动指标")
# 赋值
gauge.set(50)
# 增减
gauge.inc(2)  # 52
gauge.dec(10) # 42

2. Counter

核心方法：inc()（加1）、inc(数值)（加指定数）

python 复制代码

# 定义（带标签：method、endpoint）
counter = Counter("request_total", "请求总数", ["method", "endpoint"])
# 标签赋值+计数
counter.labels(method="GET", endpoint="/api").inc()

3. Histogram

用于统计耗时、大小分布，服务端计算P50/P95/P99：

python 复制代码

from prometheus_client import Histogram
# 定义（自定义耗时桶：单位秒）
histo = Histogram("request_duration_seconds", "请求耗时", buckets=(0.1, 0.5, 1.0, 5.0))
# 记录数值
histo.observe(0.3)
# 装饰器自动统计耗时
@histo.time()
def handle_request():
    time.sleep(0.2)

4. Summary

无需分桶，直接计算分位数，用法和 Histogram 一致：

python 复制代码

from prometheus_client import Summary
summary = Summary("request_latency_seconds", "请求延迟")
# 自动统计耗时
with summary.time():
    time.sleep(0.4)

四、简单实操

先写一个最小可用版本，暴露系统指标，1分钟就能跑起来：

python 复制代码

#!/usr/bin/env python3
# simple_exporter.py
from prometheus_client import start_http_server, Gauge
import psutil
import time

# 1. 定义 Gauge 指标（名称、描述）
cpu_usage = Gauge("system_cpu_usage_percent", "系统CPU使用率百分比")
memory_used = Gauge("system_memory_used_bytes", "系统已用内存字节数")

def collect_metrics():
    """采集系统指标并赋值给指标对象"""
    # CPU使用率
    cpu_usage.set(psutil.cpu_percent(interval=1))
    # 已用内存
    mem = psutil.virtual_memory()
    memory_used.set(mem.used)

if __name__ == "__main__":
    # 2. 启动HTTP服务，8000端口暴露指标
    start_http_server(8000)
    print("Exporter 已启动：http://localhost:8000/metrics")
    
    # 3. 循环采集指标
    while True:
        collect_metrics()
        time.sleep(15)  # 15秒采集一次

运行与验证 :

执行脚本：python simple_exporter.py
浏览器访问：http://localhost:8000/metrics，能看到自定义指标：

bash 复制代码

# HELP system_cpu_usage_percent 系统CPU使用率百分比
# TYPE system_cpu_usage_percent gauge
system_cpu_usage_percent 12.5
# HELP system_memory_used_bytes 系统已用内存字节数
# TYPE system_memory_used_bytes gauge
system_memory_used_bytes 8589934592

五、进阶实操

生产环境常用类封装+多指标+标签，支持CPU、内存、磁盘、进程等系统指标，结构更规范：

python 复制代码

#!/usr/bin/env python3
# advanced_exporter.py
from prometheus_client import start_http_server, Gauge, Counter
import psutil
import time
import threading
import logging

# 日志配置
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class SystemExporter:
    def __init__(self):
        # 定义所有监控指标
        # CPU指标
        self.cpu_usage = Gauge("system_cpu_usage_percent", "CPU使用率")
        self.cpu_count = Gauge("system_cpu_count", "CPU核心数")
        
        # 内存指标
        self.memory_used = Gauge("system_memory_used_bytes", "已用内存")
        self.memory_total = Gauge("system_memory_total_bytes", "总内存")
        
        # 磁盘指标（带标签：partition分区）
        self.disk_used = Gauge("system_disk_used_bytes", "已用磁盘空间", ["partition"])
        
        # 网络指标（Counter类型，只增不减）
        self.net_sent = Counter("system_net_sent_bytes_total", "发送总流量", ["interface"])
        self.net_recv = Counter("system_net_recv_bytes_total", "接收总流量", ["interface"])

    def collect(self):
        """采集所有系统指标"""
        try:
            # CPU
            self.cpu_usage.set(psutil.cpu_percent(interval=1))
            self.cpu_count.set(psutil.cpu_count())
            
            # 内存
            mem = psutil.virtual_memory()
            self.memory_used.set(mem.used)
            self.memory_total.set(mem.total)
            
            # 磁盘（标签赋值：partition=/）
            disk = psutil.disk_usage("/")
            self.disk_used.labels(partition="/").set(disk.used)
            
            # 网络（默认网卡流量）
            net = psutil.net_io_counters()
            self.net_sent.labels(interface="default").inc(net.bytes_sent)
            self.net_recv.labels(interface="default").inc(net.bytes_recv)
            
            logger.info("指标采集完成")
        except Exception as e:
            logger.error(f"采集失败：{str(e)}")

    def start_collect_loop(self, interval=15):
        """后台线程循环采集指标"""
        def loop():
            while True:
                self.collect()
                time.sleep(interval)
        threading.Thread(target=loop, daemon=True).start()

if __name__ == "__main__":
    exporter = SystemExporter()
    exporter.start_collect_loop(15)  # 15秒采集一次
    start_http_server(8000)
    logger.info("Exporter 运行中：http://localhost:8000/metrics")
    
    # 保持程序运行
    while True:
        time.sleep(1)

六、综合示例

python 复制代码

from prometheus_client import Counter, Histogram, Summary, start_http_server
import time
import random
from flask import Flask

app = Flask(__name__)

# 1. Counter - 统计请求总数和错误数
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

http_errors_total = Counter(
    'http_errors_total',
    'Total HTTP errors',
    ['error_type']
)

# 2. Histogram - 统计响应时间分布（用于计算百分位）
request_duration_histogram = Histogram(
    'request_duration_histogram_seconds',
    'Request duration (Histogram)',
    ['endpoint'],
    buckets=(0.01, 0.05, 0.1, 0.5, 1.0, 5.0)
)

# 3. Summary - 统计响应时间摘要（客户端计算分位）
request_duration_summary = Summary(
    'request_duration_summary_seconds',
    'Request duration (Summary)',
    ['endpoint']
)

@app.route('/api/users')
def get_users():
    """获取用户列表"""
    start_time = time.time()
    
    try:
        # 模拟处理
        time.sleep(random.uniform(0.01, 0.5))
        
        # 记录成功请求
        http_requests_total.labels(
            method='GET',
            endpoint='/api/users',
            status=200
        ).inc()
        
        # 记录响应时间（使用两种方法）
        duration = time.time() - start_time
        request_duration_histogram.labels(endpoint='/api/users').observe(duration)
        request_duration_summary.labels(endpoint='/api/users').observe(duration)
        
        return {'users': ['Alice', 'Bob']}, 200
        
    except Exception as e:
        # 记录错误
        http_errors_total.labels(error_type=type(e).__name__).inc()
        http_requests_total.labels(
            method='GET',
            endpoint='/api/users',
            status=500
        ).inc()
        return {'error': str(e)}, 500

if __name__ == '__main__':
    # 在 8000 端口暴露指标
    start_http_server(8000)
    
    # 在 5000 端口启动应用
    app.run(port=5000)
    
    # 现在可以访问：
    # - 应用：http://localhost:5000/api/users
    # - 指标：http://localhost:8000/metrics

七、对接 Prometheus

Exporter 启动后，只需修改 Prometheus 配置文件 prometheus.yml，即可自动采集指标：

yaml 复制代码

scrape_configs:
  - job_name: "python_exporter"
    scrape_interval: 15s  # 15秒拉取一次指标
    static_configs:
      - targets: ["localhost:8000"]  # 你的Exporter地址

重启 Prometheus 后，在 Prometheus UI 就能查询到自定义指标了。

八、知识总结

核心价值 ：用 Python 快速实现自定义监控，替代/补充 node_exporter
指标选择：波动值用 Gauge，累计值用 Counter，耗时分布用 Histogram
标签用法 ：labels(key=value) 实现多维度指标统计（如分区、接口、状态码）
部署规范：后台线程采集指标，HTTP 端口独立暴露，避免阻塞业务