prometheus自定义监控(pushgateway和blackbox)和远端存储VictoriaMetrics

1 pushgateway采集

1.1 自定义采集键值

如果自定义采集需求时，就可以通过写脚本 + 定时任务定期发送数据到 pushgateway 达到自定义监控

1.部署 pushgateway，以 10.0.0.42 节点为例

bash 复制代码

1.下载组件
wget https://github.com/prometheus/pushgateway/releases/download/v1.9.0/pushgateway-1.9.0.linux-amd64.tar.gz

2.解压软件包 
tar xf pushgateway-1.9.0.linux-amd64.tar.gz  -C /zhiyong18/softwares/

3.启动pushgateway组件，默认监听9091端口
cd /zhiyong18/softwares/pushgateway-1.9.0.linux-amd64/
./pushgateway 

4.访问pushgateway的WebUI
http://10.0.0.41:9091/#

2.prometheus 增加新任务

bash 复制代码

[root@prometheus-server31 ~]# vim /zhiyong18/softwares/prometheus-2.53.2.linux-amd64/prometheus.yml 
...
  - job_name: zhiyong18-zhiyong-pushgateway
    # 若不指定则默认值为false。
    # 当设置为true时，若采集的指标包含中和内置的标签冲突时(比如job,instance)会覆盖。
    # 当设置为false时，则不会覆盖，而是在标签前面加一个"exported_*"字段。
    honor_labels: true
    static_configs:
    - targets:
      - 10.0.0.41:9091

3.访问测试：http://10.0.0.31:9090/targets ，可以看到新的 targets

4.发送测试数据到 pushgateway，注意：传递的数据是键值对，KEY一般是字符串类型，而value必须是一个数字类型。

bash 复制代码

echo "wzy_age 18" | curl --data-binary @-  \
http://10.0.0.41:9091/metrics/job/zhiyong18_student/instance/10.0.0.31

5.访问prometheus，查看接收到的数据

1.2 监控TCP状态案例

1.编写脚本，定期发送本机的TCP状态指标到 pushgateway 10.0.0.41

yaml 复制代码

cat /usr/local/bin/tcp_status.sh
#!/bin/bash
# 定义TCP的12种状态
ESTABLISHED_COUNT=0
SYN_SENT_COUNT=0
SYN_RECV_COUNT=0
FIN_WAIT1_COUNT=0
FIN_WAIT2_COUNT=0
TIME_WAIT_COUNT=0
CLOSE_COUNT=0
CLOSE_WAIT_COUNT=0
LAST_ACK_COUNT=0
LISTEN_COUNT=0
CLOSING_COUNT=0
UNKNOWN_COUNT=0

# 定义任务名称
JOB_NAME=tcp_status
# 定义实例名称
INSTANCE_NAME=harbor250
# 定义pushgateway主机
HOST=10.0.0.41
# 定义pushgateway端口
PORT=9091

# TCP的12种状态
ALL_STATUS=(ESTABLISHED SYN_SENT SYN_RECV FIN_WAIT1 FIN_WAIT2 TIME_WAIT CLOSE CLOSE_WAIT LAST_ACK LISTEN CLOSING UNKNOWN)

# 声明一个关联数组,类似于py的dict,go的map
declare -A tcp_status

# 统计TCP的11种状态
for i in ${ALL_STATUS[@]}
do
  temp=`netstat -untalp | grep $i  | wc -l`
  tcp_status[${i}]=$temp
done

# 将统计后的结果发送到pushgateway
for i in ${!tcp_status[@]}
do 
   data="$i ${tcp_status[$i]}"
   # TODO: shell如果想要设计成相同key不同标签的方式存在问题，只会有最后一种状态被发送
   # 目前我怀疑是pushgateway组件不支持同一个metrics中key所对应的value不同的情况。
   #data="zhiyong18_tcp_all_status{status=\"$i\"} ${tcp_status[$i]}"
   #echo $data
   echo $data | curl --data-binary @-  http://${HOST}:${PORT}/metrics/job/${JOB_NAME}/instance/${INSTANCE_NAME}
   # sleep 1
done

复制代码

运行：`bash /usr/local/bin/tcp_status.sh`

2.访问

prometheus查看并搜索结果：

2 黑/白盒监控

2.1 黑白监控介绍

黑盒监控：黑盒监控是面向现象的，关注的是系统当前的状态，而不是预测未来会发生的问题。比如，当系统出现故障时，黑盒监控会发出警报

**白盒监控：**白盒监控则更深入，依赖于对系统内部信息的检测，如系统日志、HTTP节点等。它不仅能检测到当前的问题，还能预测到即将发生的问题，甚至那些被重试掩盖的问题

Prometheus基于blackbox进行黑盒监控

blackbox_exporter概述

blackbox exporter支持基于HTTP, HTTPS, DNS, TCP, ICMP, gRPC协议来对目标节点进行监控
比如基于http协议我们可以探测一个网站的返回状态码为200判读服务是否正常
比如基于TCP协议我们可以探测一个主机端口是否监听
比如基于ICMP协议来ping一个主机的连通性
比如基于gRPC协议来调用接口并验证服务是否正常工作
比如基于DNS协议可以来检测域名解析

2.2 blackbox监控网站状态案例

01 安装blackbox_exporter

在任意节点都能安装，并不是agent效果

bash 复制代码

1 下载软件
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.25.0/blackbox_exporter-0.25.0.linux-amd64.tar.gz

2 解压软件包
tar xvf blackbox_exporter-0.25.0.linux-amd64.tar.gz -C  /zhiyong18/softwares/

3 启动服务
cd /zhiyong18/softwares/blackbox_exporter-0.25.0.linux-amd64/
./blackbox_exporter 

4 访问blackbox的WebUI
http://10.0.0.32:9115/

02 添加blackbox数据

1.修改Prometheus配置文件，对blackbox创建采集 job

yaml 复制代码

[root@prs31~]# cat /zhiyong18/softwares/prometheus-2.53.2.linux-amd64/prometheus.yml
...
- job_name: 'zhiyong18-blackbox-exporter-http'
    # 修改访问路径，若不修改，默认值为"/metrics"
    metrics_path: /probe
    # 配置URL的相关参数
    params:
      # 此处表示使用的是blackbox的http模块，从而判断相应的返回状态码是否为200
      module: [http_2xx] 
      # 下面这两个标签是我自定义的，便于大家理解
      names: ["zhiyong18"]
      name: ["zhiyong"]
    # 静态配置，需要手动指定监控目标
    static_configs:
        # 需要监控的目标
      - targets:
          # 支持https协议
        - https://www.jd.com/
          # 支持http协议，以grafna为例
        - http://10.0.0.31:3000
          # 支持http协议和自定义端口，以prometheus的web为例
        - http://10.0.0.31:9090
    # 对目标节点进行重新打标签配置
    relabel_configs:
        # 指定源标签，此处的"__address__"表示内置的标签，存储的是被监控目标的IP地址
      - source_labels: [__address__]
        # 指定目标标签，其实就是在"Endpoint"中加了一个target字段(用于指定监控目标)，
        target_label: __param_target
        # 指定需要执行的动作，默认值为"replace"，常用的动作有: replace, keep, and drop。
        # 但官方支持十几种动作： https://prometheus.io/docs/prometheus/2.53/configuration/configuration/#relabel_action
        # 将"__address__"传递给target字段。
        action: replace
      - source_labels: [__param_target]
        target_label: instance
        #target_label: instance2024
        
        # 上面的2个配置段也可以改写成如下的配置哟~
     # - source_labels: [__address__]
     #   target_label: instance
     #   action: replace
     # - source_labels: [instance]
     #   target_label: __param_target
     #   action: replace
      - target_label: __address__
        # 指定要替换的值,此处我指定为blackbox exporter的主机地址
        replacement: 10.0.0.32:9115

无注释版的配置：(以这个为最终测试)

yaml 复制代码

  - job_name: 'zhiyong18-blackbox-exporter-http'
    metrics_path: '/probe'
    params:
      module: [http_2xx] 
      names: ["zhiyong18"]
      name: ["zhiyong"]
    static_configs:
    - targets:
      - https://www.jd.com/
      - http://10.0.0.31:3000
    
    # 重写标签
    relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
      action: replace

    - source_labels: [__param_target]
      target_label: instance

    # 向目标发起数据探测
    - target_label: __address__
      replacement: 10.0.0.32:9115

03 访问测试

1.访问prometheus的WebUI：http://10.0.0.31:9090/targets

2.访问blackbox exporter的WebUI：http://10.0.0.32:9115/

3.grafana展示数据，这2个模版ID可以参考

bash 复制代码

7587
13659

监控到了网站的指标，ssl证书过期时间，流量，状态码...

2.3 基于ICMP监控主机存活

1.修改Prometheus配置文件增加ICMP采集任务

yaml 复制代码

  - job_name: 'zhiyong18-blackbox-exporter-icmp'
    metrics_path: /probe
    params:
      # 如果不指定模块，则默认类型为"http_2xx"
      module: [icmp]
    static_configs:
    - targets:
      - 10.0.0.41
      - 10.0.0.42
      - 10.0.0.66
    relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      # 如果instance不修改，则instance和"__address__"的值相同
      target_label: instance
    - target_label: __address__
      replacement: 10.0.0.32:9115

2.查看 prometheus 的targets

3.访问 blackbox 的页面

4.dashboard查看

要使用13659模版，基于"zhiyong18-blackbox-exporter-icmp"标签进行过滤。

2.4 监控TCP端口存活

1.修改配置

yaml 复制代码

	1 修改Prometheus配置文件
[root@prometheus-server31 ~]# vim /zhiyong18/softwares/prometheus-2.53.2.linux-amd64/prometheus.yml 
 
...
scrape_configs:
  ...
  - job_name: 'zhiyong18-blackox-exporter-tcp'
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets:
          - 10.0.0.41:80
          - 10.0.0.42:22
          - 10.0.0.31:9090
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 10.0.0.32:9115

2.访问prometheus

3.使用grafana查看数据

基于"zhiyong18-blackbox-exporter-tcp"标签进行过滤。

3 prometheus扩展

3.1 远端存储VictoriaMetrics

VictoriaMetrics是一个快速、经济高效且可扩展的监控解决方案和时间序列数据库。如果数据全部存储在prometheus一个节点上有单点的风险，有必要使用一个分布式的高可用的外置存储。

官网：https://victoriametrics.com/

官方文档：https://docs.victoriametrics.com/

GitHub地址：https://github.com/VictoriaMetrics/VictoriaMetrics

部署文档：https://docs.victoriametrics.com/quick-start/

集群部署参考 (非官方)

01 安装VictoriaMetrics

prometheus-server 10.0.0.32 进行单点安装为例

bash 复制代码

1 下载软件
wget https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/v1.93.16/victoria-metrics-linux-amd64-v1.93.16.tar.gz

2 解压软件包 
tar xf victoria-metrics-linux-amd64-v1.93.16.tar.gz -C /usr/local/bin/

3 编写启动脚本
cat > /etc/systemd/system/victoria-metrics.service <<'EOF'
[Unit]
Description=zhiyong18 Linux VictoriaMetrics Server
Documentation=https://docs.victoriametrics.com/
After=network.target

[Service]
ExecStart=/usr/local/bin/victoria-metrics-prod  \
   -httpListenAddr=0.0.0.0:8428 \
   -storageDataPath=/zhiyong18/data/victoria-metrics \
   -retentionPeriod=6

[Install]
WantedBy=multi-user.target
EOF

4 重启服务
systemctl daemon-reload
systemctl enable --now victoria-metrics.service
systemctl status victoria-metrics

5 检查端口是否存活
 ss -ntl | grep 8428
LISTEN 0      4096         0.0.0.0:8428      0.0.0.0:*          

6 查看webUI
http://10.0.0.32:8428/

02 使用VictoriaMetrics

1.修改prometheus-server 31 的配置文件：vim /zhiyong18/softwares/prometheus-2.53.2.linux-amd64/prometheus.yml

yaml 复制代码

...
# 在顶级字段中配置VictoriaMetrics地址，注意不要放到通用字段下
remote_write:
- url: http://10.0.0.32:8428/api/v1/write

2.重新加载prometheus的配置

bash 复制代码

systemctl stop prometheus-server

/zhiyong18/softwares/prometheus-2.53.2.linux-amd64/prometheus \
--config.file=/zhiyong18/softwares/prometheus-2.53.2.linux-amd64/prometheus.yml

03 配置grafana新数据源

1.由于数据源发生了存储到了 VictoriaMetrics ，所以grafana要添加新的数据源，否则查不到

2.导入仪表盘（1860）的时候要指定新的数据源

3.2 自定义exporter(python案例)

01 使用python自定义exporter

1.在prometheus-server31节点安装 pip3

bash 复制代码

apt update
apt install -y python3-pip

# 创建一个虚拟用户，但是有家目录
useradd -m -s /bin/bash python

2.切换用户为python做如下操作：

bash 复制代码

3 修改配置文件(可选)
su - python
vim .bashrc 
...
# 取消下面一行的注释，添加颜色显示
force_color_prompt=yes

4 pip配置加速
mkdir ~/.pip; vim ~/.pip/pip.conf
# 注释掉以前的，添加阿里源
# [global]
# index-url=https://pypi.tuna.tsinghua.edu.cn/simple
# [install]
# trusted-host=pypi.douban.com
[global]
index-url=https://mirrors.aliyun.com/pypi/simple
[install]
trusted-host=mirrors.aliyun.com


5 安装实际环境中相关模块库
pip3 install flask prometheus_client
pip3 list

6 创建代码目录
mkdir code

7 编写python代码
cd code
cat > flask_metric.py <<'EOF'
#!/usr/bin/python3

from prometheus_client import start_http_server,Counter, Summary
from flask import Flask, jsonify
from wsgiref.simple_server import make_server
import time

app = Flask(__name__)

# Create a metric to track time spent and requests made
REQUEST_TIME = Summary('request_processing_seconds', 'Time spent processing request')
COUNTER_TIME  = Counter("request_count", "Total request count of the host")

@app.route("/apps")
@REQUEST_TIME.time()
def requests_count():
    COUNTER_TIME.inc()
    return jsonify({"office": "wenzy18@qq.com"},{"auther":"Wen Zhiyong"})

if __name__ == "__main__":
    start_http_server(8000)
    httpd = make_server( '0.0.0.0', 8001, app )
    httpd.serve_forever()
EOF


8 启动python程序
python3 flask_metric.py 

# 因为没有客户端访问，所以没有任何输出

3.客户端测试，使用任意一个节点均可

bash 复制代码

cat > zhiyong18_curl_metrics.sh <<'EOF'
#!/bin/bash

URL=http://10.0.0.31:8001/apps

while true;do
    curl_num=$(( $RANDOM%50+1 ))
    sleep_num=$(( $RANDOM%5+1 ))
    for c_num in `seq $curl_num`;do
        curl -s $URL &> /dev/null
    done
    sleep $sleep_num
done
EOF


# 回到之前的python执行窗口，可以看到以下输出
10.0.0.31 - - [14/Dec/2024 22:33:28] "GET /apps HTTP/1.1" 200 55
10.0.0.31 - - [14/Dec/2024 22:33:28] "GET /apps HTTP/1.1" 200 55
10.0.0.31 - - [14/Dec/2024 22:33:28] "GET /apps HTTP/1.1" 200 55
10.0.0.31 - - [14/Dec/2024 22:33:28] "GET /apps HTTP/1.1" 200 55
10.0.0.31 - - [14/Dec/2024 22:33:28] "GET /apps HTTP/1.1" 200 55
10.0.0.31 - - [14/Dec/2024 22:33:28] "GET /apps HTTP/1.1" 200 55
10.0.0.31 - - [14/Dec/2024 22:33:28] "GET /apps HTTP/1.1" 200 55
10.0.0.31 - - [14/Dec/2024 22:33:28] "GET /apps HTTP/1.1" 200 55

4.python脚本 flask_metric.py 是一个简单的web服务器，所以可以访问，并且能打开

02 采集exporter数据

1.修改prometheus配置 vim /zhiyong18/softwares/prometheus-2.53.2.linux-amd64/prometheus.yml ，添加任务

yaml 复制代码

  - job_name: "wzy python 自定义exporter"
    static_configs:
    - targets:
      - 10.0.0.31:8000

2.在 http://10.0.0.31:9090/targets 可以看到这条指标

3.执行搜索，http://10.0.0.31:9090/

bash 复制代码

request_count_total

4.该指标可以自定义画图 grafana ，和 PQL 查询一样

bash 复制代码

# apps请求总数
request_count_total

# 每分钟请求数量曲线QPS
increase(request_count_total{job="wzy python 自定义exporter"}[1m])

#每分钟请求量变化率曲线 
irate(request_count_total{job="wzy python 自定义exporter"}[1m])
	
# 每分钟请求处理平均耗时
request_processing_seconds_sum{job="wzy python 自定义exporter"} / request_processing_seconds_count{job=""}
me: "wzy python 自定义exporter"
    static_configs:
    - targets:
      - 10.0.0.31:8000

2.在 http://10.0.0.31:9090/targets 可以看到这条指标

外链图片转存中...(img-Z4jtR4iO-1742134880458)

3.执行搜索，http://10.0.0.31:9090/

bash 复制代码

request_count_total

4.该指标可以自定义画图 grafana ，和 PQL 查询一样

bash 复制代码

# apps请求总数
request_count_total

# 每分钟请求数量曲线QPS
increase(request_count_total{job="wzy python 自定义exporter"}[1m])

#每分钟请求量变化率曲线 
irate(request_count_total{job="wzy python 自定义exporter"}[1m])
	
# 每分钟请求处理平均耗时
request_processing_seconds_sum{job="wzy python 自定义exporter"} / request_processing_seconds_count{job=""}