Prometheus快速入门实战

介绍

prometheus 受启发于 Google 的 Brogmon 监控系统（相似 kubernetes 是从 Brog 系统演变而来）。2016 年 5 月继 kubernetes 之后成为第二个加入 CNCF 基金会的项目，同年 6 月正式发布 1.0 版本。2017 年底发布基于全新存储层的 2.0 版本，能更好地与容器平台、云平台配合。

官方网站：https://prometheus.io

项目托管：https://github.com/prometheus

优势

prometheus 是基于一个开源的完整监控方案，其对传统监控系统的测试和告警模型进行了彻底的颠覆，形成了基于中央化的规则计算、统一分析和告警的新模型。相对传统的监控系统有如下几个优点。

易于管理

部署使用的是 go 编译的二进制文件，不存在任何第三方依赖问题，可以使用服务发现动态管理监控目标。

监控服务内部运行状态

我们可以使用 prometheus 提供的常用开发语言提供的 client 库完成应用层面暴露数据，采集应用内部运行信息。

强大的查询语言 promQL

prometheus 内置一个强大的数据查询语言 PromQL，通过 PromQL 可以实现对监控数据的查询、聚合。同时 PromQL 也被应用于数据可视化（如 grafana ）以及告警中的。

高效

对于监控系统而言，大量的监控任务必然导致有大量的数据产生。而 Prometheus 可以高效地处理这些数据。

可扩展

prometheus 配置比较简单，可以在每个数据中心运行独立的 prometheus server，也可以使用联邦集群，让多个 prometheus 实例产生一个逻辑集群，还可以在单个 prometheus server 处理的任务量过大的时候，通过使用功能分区和联邦集群对其扩展。

易于集成

目前官方提供多种语言的客户端 sdk，基于这些 sdk 可以快速让应用程序纳入到监控系统中，同时还可以支持与其他的监控系统集成。

可视化

prometheus server 自带一个 ui，通过这个 ui 可以方便对数据进行查询和图形化展示，可以对接 grafana 可视化工具展示精美监控指标。

架构

prometheus 负责从 pushgateway 和 Jobs 中采集数据，存储到后端 Storatge 中，可以通过 PromQL 进行查询，推送 alerts 信息到 AlertManager。AlertManager 根据不同的路由规则进行报警通知。

prometheus server

是 Prometheus 组件中的核心部分，负责实现对监控数据的获取，存储以及查询。

exporter

简单说是采集端，通过 http 服务的形式保留一个 url 地址，prometheus server 通过访问该 exporter 提供的 endpoint 端点，即可获取到需要采集的监控数据。exporter 分为 2 大类。

直接采集：这一类 exporter 直接内置了对 Prometheus 监控的支持，比如 cAdvisor，Kubernetes 等。

间接采集：原有监控目标不支持 prometheus，需要通过 prometheus 提供的客户端库编写监控采集程序，例如 Mysql Exporter，JMX Exporter 等。

AlertManager

在 prometheus 中，支持基于 PromQL 创建告警规则，如果满足定义的规则，则会产生一条告警信息，进入 AlertManager 进行处理。可以集成邮件，Slack 或者通过 webhook 自定义报警。

PushGateway

由于 Prometheus 数据采集采用 pull 方式进行设置的，内置必须保证 prometheus server 和对应的 exporter 必须通信，当网络情况无法直接满足时，可以使用 pushgateway 来进行中转，可以通过 pushgateway 将内部网络数据主动 push 到 gateway 里面去，而 prometheus 采用 pull 方式拉取 pushgateway 中数据。

web ui

Prometheus 内置一个简单的 Web 控制台，可以查询指标，查看配置信息或者 Service Discovery 等，实际工作中，查看指标或者创建仪表盘通常使用 Grafana，Prometheus 作为 Grafana 的数据源。

数据模型

Prometheus 将所有数据存储为时间序列，具有相同度量名称以及标签属于同一个指标。每个时间序列都由度量名称和一组键值对（也称为标签）组成。

格式：

bash 复制代码

# 表示一个度量指标和一组键值对标签
<metric name>{<label name>=<label value>, ...}

度量指标名称是 api_http_requests_total，标签为method="POST", handler="/messages"的示例如下所示：

bash 复制代码

api_http_requests_total{method="POST", handler="/messages"}

指标类型

prometheus 的指标有四种类型，分别是 Counter，Gauge，Histogram，Summary。

Counter

只增不减的计数器，用于描述某个指标的累计状态，比如请求量统计，http_requests_total。

Gauge

可增可减的计量器，用于描述某个指标当前的状态，比如系统内存余量，node_memory_MemFree_bytes。

Histogram

直方图指标用于描述指标的分布情况，比如对于请求响应时间，总共 10w 个请求，小于 10ms 的有 5w 个，小于 50ms 的有 9w 个，小于 100ms 的有 9.9w 个。

Summary

和直方图类似，summary 也是用于描述指标分布情况，不过表现形式不同，比如还是对于请求响应时间， summary 描述则是，总共 10w 个请求，50% 小于 10ms，90% 小于 50ms，99% 小于 100ms。

安装

大致了解了 Prometheus 后，我们将其先安装起来。

linux 安装

Prometheus 也是 go 语言开发的，所以只需要下载其二进制包进行安装即可。

前往官网下载最新版本即可。

下载地址：https://prometheus.io/download

bash 复制代码

[root@localhost prometheus]# tar -zxvf prometheus-2.37.1.linux-amd64.tar.gz 
prometheus-2.37.1.linux-amd64/
prometheus-2.37.1.linux-amd64/consoles/
prometheus-2.37.1.linux-amd64/consoles/index.html.example
prometheus-2.37.1.linux-amd64/consoles/node-cpu.html
prometheus-2.37.1.linux-amd64/consoles/node-disk.html
prometheus-2.37.1.linux-amd64/consoles/node-overview.html
prometheus-2.37.1.linux-amd64/consoles/node.html
prometheus-2.37.1.linux-amd64/consoles/prometheus-overview.html
prometheus-2.37.1.linux-amd64/consoles/prometheus.html
prometheus-2.37.1.linux-amd64/console_libraries/
prometheus-2.37.1.linux-amd64/console_libraries/menu.lib
prometheus-2.37.1.linux-amd64/console_libraries/prom.lib
prometheus-2.37.1.linux-amd64/prometheus.yml
prometheus-2.37.1.linux-amd64/LICENSE
prometheus-2.37.1.linux-amd64/NOTICE
prometheus-2.37.1.linux-amd64/prometheus
prometheus-2.37.1.linux-amd64/promtool
[root@localhost prometheus]# cd prometheus-2.37.1.linux-amd64
[root@localhost prometheus-2.37.1.linux-amd64]# ll
total 206252
drwxr-xr-x. 2 3434 3434        38 Sep 12 09:04 console_libraries
drwxr-xr-x. 2 3434 3434       173 Sep 12 09:04 consoles
-rw-r--r--. 1 3434 3434     11357 Sep 12 09:04 LICENSE
-rw-r--r--. 1 3434 3434      3773 Sep 12 09:04 NOTICE
-rwxr-xr-x. 1 3434 3434 109681846 Sep 12 08:46 prometheus
-rw-r--r--. 1 3434 3434       934 Sep 12 09:04 prometheus.yml
-rwxr-xr-x. 1 3434 3434 101497637 Sep 12 08:49 promtool
[root@localhost prometheus-2.37.1.linux-amd64]# ./prometheus --help
usage: prometheus [<flags>]

The Prometheus monitoring server

Flags:
  -h, --help                     Show context-sensitive help (also try --help-long and --help-man).
      --version                  Show application version.

Prometheus 是通过一个 YAML 配置文件来进行启动的，如果我们使用二进制的方式来启动的话，可以使用下面的命令：

bash 复制代码

./prometheus --config.file=prometheus.yml

其中 prometheus.yml 文件的基本配置如下：

bash 复制代码

global:
  scrape_interval:     15s
  evaluation_interval: 15s
rule_files:
  # - "first.rules"
  # - "second.rules"
scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']

上面这个配置文件中包含了 3 个模块：global、rule_files 和 scrape_configs。

global

模块控制 Prometheus Server 的全局配置：

oscrape_interval：表示 prometheus 抓取指标数据的频率，默认是 15s，我们可以覆盖这个值；

oevaluation_interval：用来控制评估规则的频率，prometheus 使用规则产生新的时间序列数据或者产生警报；

rule_files

指定了报警规则所在的位置，prometheus 可以根据这个配置加载规则，用于生成新的时间序列数据或者报警信息，当前我们没有配置任何报警规则。

scrape_configs

用于控制 prometheus 监控哪些资源。

由于 prometheus 通过 HTTP 的方式来暴露的它本身的监控数据，prometheus 也能够监控本身的健康情况。在默认的配置里有一个单独的 job，叫做 prometheus，它采集 prometheus 服务本身的时间序列数据。这个 job 包含了一个单独的、静态配置的目标：监听 localhost 上的 9090 端口。prometheus 默认会通过目标的 /metrics 路径采集 metrics。所以，默认的 job 通过 URL：[http://localhost:9090/metrics](http://localhost:9090/metrics) 采集 metrics。收集到的时间序列包含 prometheus 服务本身的状态和性能。如果我们还有其他的资源需要监控的话，直接配置在 scrape_configs 模块下面就可以了。

bash 复制代码

[root@localhost prometheus-2.37.1.linux-amd64]# ./prometheus --config.file=prometheus.yml

docker 安装

对于 Docker 用户，直接使用 Prometheus 的镜像即可启动 Prometheus Server：

bash 复制代码

docker run -d -p 9090:9090 -v /etc/prometheus:/etc/prometheus prom/prometheus

启动完成后，可以通过[http://localhost:9090](http://localhost:9090)访问 Prometheus 的 UI 界面。

配置文件详解

bash 复制代码

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

global

此片段指定的是 prometheus 的全局配置，比如采集间隔，抓取超时时间等。

rule_files

此片段指定报警规则文件，prometheus 根据这些规则信息，会推送报警信息到 alertmanager 中。

scrape_configs

此片段指定抓取配置，prometheus 的数据采集通过此片段配置。

alerting

此片段指定报警配置，这里主要是指定 prometheus 将报警规则推送到指定的 alertmanager 实例地址。

remote_write

指定后端的存储的写入 api 地址。

remote_read

指定后端的存储的读取 api 地址。

global

bash 复制代码

# How frequently to scrape targets by default.
[ scrape_interval: <duration> | default = 1m ] # 抓取间隔

# How long until a scrape request times out.
[ scrape_timeout: <duration> | default = 10s ] # 抓取超时时间

# How frequently to evaluate rules.
[ evaluation_interval: <duration> | default = 1m ] # 评估规则间隔

# The labels to add to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels: # 外部一些标签设置
[ <labelname>: <labelvalue> ... ]

scrapy_config

一个 scrape_config 片段指定一组目标和参数，目标就是实例，指定采集的端点，参数描述如何采集这些实例，主要参数如下

scrape_interval

抓取间隔，默认继承 global 值。

scrape_timeout

抓取超时时间，默认继承 global 值。

metric_path

抓取路径，默认是 /metrics。

scheme

指定采集使用的协议，http 或者 https。

params

指定 url 参数。

basic_auth

指定认证信息。

*_sd_configs

指定服务发现配置

static_configs

静态指定服务 job。

relabel_config

relabel 设置。