Prometheus+Grafana+AlertManager完整安装过程

文章目录

  • 1.概述
  • 2.被监控服务器相关软件安装
    • [2.1 Docker、Docker-Compose安装](#2.1 Docker、Docker-Compose安装)
    • [2.2 基础软件、采集软件容器化部署](#2.2 基础软件、采集软件容器化部署)
  • 3.监控服务器核心组件安装
    • 3.1.Prometheus安装
      • [3.1.1 安装流程](#3.1.1 安装流程)
      • [3.1.2 修改prometheus.yml](#3.1.2 修改prometheus.yml)
      • [3.1.3 添加监控规则文件](#3.1.3 添加监控规则文件)
      • [3.1.4 自定义Systemctl](#3.1.4 自定义Systemctl)
      • [3.1.5 启动Prometheus](#3.1.5 启动Prometheus)
      • 3.1.6.访问Prometheus后台
      • [3.1.7 检查target是否已经加载](#3.1.7 检查target是否已经加载)
      • [3.1.8 检查rules是否已经加载](#3.1.8 检查rules是否已经加载)
      • 3.1.9.备注说明
    • 3.2.Grafana安装
      • 3.2.1.安装流程
      • [3.2.2 自定义Systemctl](#3.2.2 自定义Systemctl)
      • [3.2.3 启动Grafana](#3.2.3 启动Grafana)
      • [3.2.4 访问Grafana后台](#3.2.4 访问Grafana后台)
      • [3.2.5 添加Prometheus数据源](#3.2.5 添加Prometheus数据源)
      • [3.2.6 添加Dashboard(服务器监控仪表盘)](#3.2.6 添加Dashboard(服务器监控仪表盘))
      • [3.2.7 添加Dashboard(容器监控仪表盘)](#3.2.7 添加Dashboard(容器监控仪表盘))
      • [3.2.8 添加JAVA监控(JVM监控仪表盘)](#3.2.8 添加JAVA监控(JVM监控仪表盘))
      • [3.2.9 添加Mysql监控(Mysql监控仪表盘)](#3.2.9 添加Mysql监控(Mysql监控仪表盘))
      • [3.2.10 添加Nginx监控(Nginx监控仪表盘)](#3.2.10 添加Nginx监控(Nginx监控仪表盘))
      • [3.2.11 添加Redis监控(Redis监控仪表盘)](#3.2.11 添加Redis监控(Redis监控仪表盘))
      • [3.2.12 添加黑盒监控(黑盒监控仪表盘)](#3.2.12 添加黑盒监控(黑盒监控仪表盘))
    • 3.3.AlertManager安装

1.概述

服务器资源情况:

服务器名 IP CPU 内存 描述
prometheus 192.168.25.41 1 1.9G 安装Prometheus、Grafana、AlertManager。模拟监控的服务器
prometheus-monitor-node 192.168.25.42 1 1.9G 安装一些基础软件、应用服务。模拟被监控的服务器

192.168.25.41 监控服务器上软件安装情况:

IP 端口 软件名称 版本 安装方式 完整安装包名称 备注
192.168.25.41 9090 prometheus 3.1.0 二进制方式 prometheus-3.1.0.linux-amd64.tar.gz 监控的核心组件
192.168.25.41 3000 grafana 11.5.1 二进制方式 grafana-enterprise-11.5.1.linux-amd64.tar.gz 提供图形化监控数据展示
192.168.25.41 9093 alertmanager 0.28.0 二进制方式 alertmanager-0.28.0.linux-amd64.tar.gz 告警通知
192.168.25.41 9100 node_exporter 1.8.2 二进制方式 node_exporter-1.8.2.linux-amd64.tar.gz (可选安装)对服务器资源进行监控,并把数据提供给Prometheus。

192.168.25.42 被监控服务器上软件安装情况:

IP 端口 软件名称 版本 安装方式 备注
192.168.25.42 docker 24.0.2 二进制方式 用于运行其他软件、服务
192.168.25.42 docker-compose v2.5.0 二进制方式
192.168.25.42 80 nginx 1.27.4 容器方式 Nginx服务
192.168.25.42 6379 redis 6.2.17 容器方式 Redis服务
192.168.25.42 3306 mysql 6.2.17 容器方式 MySql服务
192.168.25.42 8081 java-web-demo 0.0.4 容器方式 java应用服务(这是本人私有的镜像)
192.168.25.42 8080 cadvisor v0.33.0 容器方式 用于采集Docker的监控数据
192.168.25.42 9091 pushgateway v1.11.0 容器方式 用户可将采集数据发到pushgateway,由pushgatway将采集数据推到Prometheus
192.168.25.42 8080 cadvisor v0.33.0 容器方式 用于采集Docker的监控数据
192.168.25.42 9100 node_exporter v1.5.0 容器方式 用于采集服务器的监控数据
192.168.25.42 9113 nginx_exporter v1.5.0 容器方式 用于采集Nginx的监控数据
192.168.25.42 9121 redis_exporter v1.5.0 容器方式 用于采集Redis的监控数据
192.168.25.42 9104 mysqld-exporter v1.5.0 容器方式 用于采集MySql的监控数据(也可监控Mariadb)
192.168.25.42 9100 blackbox_exporter 0.25.0 容器方式 采集外部地址数据。如http、tcp、icpm等

注:软件下载地址:https://prometheus.io/download/

注:软件下载地址:https://prometheus.io/download/

2.被监控服务器相关软件安装

2.1 Docker、Docker-Compose安装

参考安装文档:Centos7.9离线安装Docker24(无坑版)

或直接使用:docker-v1.24 一键部署脚本

2.2 基础软件、采集软件容器化部署

创建docker-compose.yaml文件

内容如下:

bash 复制代码
version: '3.3'

volumes:
  prometheus_data: {}
  grafana_data: {}

networks:
  monitoring:
    driver: bridge

services:
  nginx:
    image: nginx:1.27.4
    container_name: nginx
    restart: always
    ports:
      - 80:80
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - /root/default.conf:/etc/nginx/conf.d/default.conf
    environment:
      - TZ=Asia/Shanghai
  redis:
    image: redis:6.2.17
    container_name: redis
    command: redis-server --requirepass 123456 --maxmemory 512mb
    restart: always
    #volumes:
    #  - /data/redis/data:/data
    ports:
      - 6379:6379
  db:
    image: mysql:8.0.37
    restart: always
    container_name: mysql
    environment:
      TZ: Asia/Shanghai
      LANG: en_US.UTF-8
      MYSQL_ROOT_PASSWORD: 123456
    command:
      --default-authentication-plugin=mysql_native_password
      --character-set-server=utf8mb4
      --collation-server=utf8mb4_general_ci
      --lower_case_table_names=1
      --performance_schema=1
      --sql-mode=""
      --skip-log-bin
    volumes:
      - /data/mysql/conf:/etc/mysql/conf.d #数据文件挂载
      - /data/mysql/data:/var/lib/mysql #数据文件挂载
    ports:
       - 3306:3306
  java-web-demo:
    image: java-web-demo:0.0.4
    container_name: springboot-web-demo 
    restart: always
    ports:
      - 8081:8080 

  pushgateway:
    image: prom/pushgateway:v1.11.0
    container_name: pushgateway
    restart: always
    ports:
      - "9091:9091"

  cadvisor:
    image: google/cadvisor:v0.33.0
    container_name: cadvisor
    restart: always
    privileged: true
    ports:
      - 8080:8080
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    environment:
      - TZ=Asia/Shanghai
    networks:
      - monitoring
  node_exporter:
    image: prom/node-exporter:v1.5.0
    container_name: node-exporter
    restart: always
    ports:
      - 9100:9100
    network_mode: "host"
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    environment:
      TZ: Asia/Shanghai
    command: 
      - '--web.listen-address=:9100'
      - '--path.procfs=/host/proc' 
      - '--path.sysfs=/host/sys'
      - "--path.rootfs=/rootfs"
      - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc|rootfs/var/lib/docker)($$|/)'
  nginx_exporter:
    image: nginx/nginx-prometheus-exporter:0.11
    container_name: nginx_exporter
    hostname: nginx_exporter
    command:
     - '-nginx.scrape-uri=http://192.168.25.42/stub_status'
    restart: always
    ports:
    - "9113:9113"
  redis_exporter:
    image: oliver006/redis_exporter
    container_name: redis_exporter
    restart: always
    environment:
      REDIS_ADDR: "192.168.25.42:6379"
      REDIS_PASSWORD: 123456
    ports:
      - "9121:9121"
  mysqld-exporter:
    image: prom/mysqld-exporter
    container_name: mysqld-exporter
    restart: always
    command:
      - '--collect.info_schema.processlist'
      - '--collect.info_schema.innodb_metrics'
      - '--collect.info_schema.tablestats'
      - '--collect.info_schema.tables'
      - '--collect.info_schema.userstats'
      - '--collect.engine_innodb_status'
      - '--config.my-cnf=/my.cnf'
    volumes:
      - /root/my.cnf:/my.cnf
    ports:
      - 9104:9104
  blackbox_exporter:
    image: prom/blackbox-exporter:v0.25.0
    container_name: blackbox_exporter
    restart: always
    volumes:
    - /data/blackbox_exporter:/etc/blackbox_exporter
    ports:
    - 9115:9115

上面数据卷挂载将宿主机中指定文件映射到容器,故而宿主机上还需要有相关文件或者文件夹,具体如下:

Nginx的default.conf文件完整内容如下:

bash 复制代码
[root@prometheus-monitor-node ~]# cat default.conf
server {
    listen       80;
    listen  [::]:80;
    server_name  localhost;

    #access_log  /var/log/nginx/host.access.log  main;

    location / {
        root   /usr/share/nginx/html;
        index  index.html index.htm;
    }

    location /stub_status {
        stub_status on;
        access_log off;
        #allow nginx_export的ip;    
        allow 0.0.0.0/0;
        deny all;
    }

    #error_page  404              /404.html;

    # redirect server error pages to the static page /50x.html
    #
    error_page   500 502 503 504  /50x.html;
    location = /50x.html {
        root   /usr/share/nginx/html;
    }
}

Mysql的my.conf文件完整内容

bash 复制代码
[root@prometheus-monitor-node ~]# cat my.cnf 
[client]
host=192.168.25.42
user=exporter
password=123456
port=3306

黑盒blackbox_exporter的config.yml文件完整内容如下

  • 注:/data/blackbox_exporter只有一个config.yml文件,没有其他文件了。
bash 复制代码
mkdir -p /data/blackbox_exporter
bash 复制代码
[root@prometheus-monitor-node ~]# cat /data/blackbox_exporter/config.yml 
modules:
  http_2xx:
    prober: http
    http:
      method: GET
  http_post_2xx:
    prober: http
    http:
      method: POST
  tcp_connect:
    prober: tcp
  pop3s_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^+OK"
      tls: true
      tls_config:
        insecure_skip_verify: false
  grpc:
    prober: grpc
    grpc:
      tls: true
      preferred_ip_protocol: "ip4"
  grpc_plain:
    prober: grpc
    grpc:
      tls: false
      service: "service1"
  ssh_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^SSH-2.0-"
      - send: "SSH-2.0-blackbox-ssh-check"
  irc_banner:
    prober: tcp
    tcp:
      query_response:
      - send: "NICK prober"
      - send: "USER prober prober prober :prober"
      - expect: "PING :([^ ]+)"
        send: "PONG ${1}"
      - expect: "^:[^ ]+ 001"
  icmp:
    prober: icmp
  icmp_ttl5:
    prober: icmp
    timeout: 5s
    icmp:
      ttl: 5

运行相关服务

bash 复制代码
docker -f docker-compose.yaml up -d

最终效果如下:

bash 复制代码
[root@prometheus-monitor-node ~]# docker ps
CONTAINER ID   IMAGE                                  COMMAND                   CREATED          STATUS                    PORTS                                                  NAMES
e45f0d5ece57   google/cadvisor:v0.33.0                "/usr/bin/cadvisor -..."   32 minutes ago   Up 32 minutes (healthy)   0.0.0.0:8080->8080/tcp, :::8080->8080/tcp              cadvisor
394ca6fc8915   prom/pushgateway:v1.11.0               "/bin/pushgateway"        3 days ago       Up 55 minutes             0.0.0.0:9091->9091/tcp, :::9091->9091/tcp              pushgateway
f5e38277ceca   prom/blackbox-exporter:v0.25.0         "/bin/blackbox_expor..."   7 days ago       Up 55 minutes             0.0.0.0:9115->9115/tcp, :::9115->9115/tcp              blackbox_exporter
1e38cd21cf0e   nginx:1.27.4                           "/docker-entrypoint...."   7 days ago       Up 55 minutes             0.0.0.0:80->80/tcp, :::80->80/tcp                      nginx
bdc5b8b4b4f4   java-web-demo:0.0.4                    "java -jar app.jar"       9 days ago       Up 55 minutes             0.0.0.0:8081->8080/tcp, :::8081->8080/tcp              springboot-web-demo
e0b0e18151ea   prom/mysqld-exporter                   "/bin/mysqld_exporte..."   10 days ago      Up 55 minutes             0.0.0.0:9104->9104/tcp, :::9104->9104/tcp              mysqld-exporter
1528db64726b   redis:6.2.17                           "docker-entrypoint.s..."   10 days ago      Up 55 minutes             0.0.0.0:6379->6379/tcp, :::6379->6379/tcp              redis
82eb7094818c   mysql:8.0.37                           "docker-entrypoint.s..."   10 days ago      Up 55 minutes             0.0.0.0:3306->3306/tcp, :::3306->3306/tcp, 33060/tcp   mysql
452c7fb8c129   oliver006/redis_exporter               "/redis_exporter"         10 days ago      Up 55 minutes             0.0.0.0:9121->9121/tcp, :::9121->9121/tcp              redis_exporter
c63f2666c72b   nginx/nginx-prometheus-exporter:0.11   "/usr/bin/nginx-prom..."   10 days ago      Up 55 minutes             0.0.0.0:9113->9113/tcp, :::9113->9113/tcp              nginx_exporter
50ca88f777b8   prom/node-exporter:v1.5.0              "/bin/node_exporter ..."   10 days ago      Up 55 minutes                                                                    node-exporter

注1 :docker方式查看容器列表:docker ps
注2 :docker方式重启容器:docker restart 容器名
注3 :docker方式查看容器日志:docker logs -f 容器名
注4 :docker-compose方式查看容器列表:docker-compose ps
注5:docker-compose重启/重新加载指定服务:docker -f docker-compose.yaml up -d 服务名

3.监控服务器核心组件安装

3.1.Prometheus安装

3.1.1 安装流程

bash 复制代码
# 下载安装包
wget https://github.com/prometheus/prometheus/releases/download/v3.1.0/prometheus-3.1.0.linux-amd64.tar.gz
# 创建安装目录
mkdir -p /opt/software/prometheus/
# 解压到特定目录
tar -zxvf prometheus-3.1.0.linux-amd64.tar.gz -C /opt/software/prometheus/
#重命名
mv /opt/software/prometheus/prometheus-3.1.0.linux-amd64 /opt/software/prometheus/prometheus

这是我最终配置完成后的目录结构(如果后面安装看懵了请参考此目录结构):

bash 复制代码
[root@prometheus prometheus]# tree /opt/software/prometheus/prometheus

├── data
├── EOF
├── LICENSE
├── NOTICE
├── prometheus
├── prometheus.yml
├── prometheus.yml_bak
├── promtool
├── rules
│   ├── alert.yml
│   ├── blackbox_exporter.yml
│   ├── docker.yml
│   ├── mysqld.yml
│   ├── nginx.yml
│   ├── node-exporter.yml
│   ├── redis.yml
│   └── springboot.yml
└── targets
    ├── blackbox_http.yml
    ├── blackbox_icmp.yml
    ├── blackbox_tcp.yml
    ├── pushgateway.yml
    ├── springboot.yml
    └── targets.yml

3.1.2 修改prometheus.yml

:prometheus.yml里面的配置有两种配置方式,一种是静态文件配置方式,第二种是动态文件配置方式。你可以两种都尝试看看,推荐用第二种,

方式一(静态文件方式,每次配置发生变动需手动加载):

修改prometheus.yml,完整内容如下:

bash 复制代码
# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - 192.168.25.41:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/*"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: "alertmanager"
    scrape_interval: 15s
    static_configs:
      - targets: ["192.168.25.41:9093"]
  - job_name: "node_exporter"
    scrape_interval: 15s
    static_configs:
      - targets: ["192.168.25.41:9100"]
        labels:
          instance: Prometheus服务器
          myCusLabel: myCusVal
      - targets: ["192.168.25.42:9100"]
        labels:
          instance: 被监控服务器42
          myCusLabel: myCusVal
  - job_name: "cadvisor"
    scrape_interval: 15s
    static_configs:
      - targets: ["192.168.25.42:8080"]
        labels:
          instance: 被监控服务器42
          myCusLabel: myCusVal
  - job_name: "nginx-exporter"
    scrape_interval: 15s
    static_configs:
      - targets: ["192.168.25.42:9113"]
        labels:
          instance: 被监控服务器42
          myCusLabel: myCusVal
  - job_name: "redis-exporter"
    scrape_interval: 15s
    static_configs:
      - targets: ["192.168.25.42:9121"]
        labels:
          instance: 被监控服务器42
          myCusLabel: myCusVal
  - job_name: "mysql-exporter"
    scrape_interval: 15s
    static_configs:
      - targets: ["192.168.25.42:9104"]
        labels:
          instance: 被监控服务器42
          myCusLabel: myCusVal
  - job_name: "springboot-demo"
    scrape_interval: 15s
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ["192.168.25.42:8081"]
        labels:
          instance: 被监控服务器42
          myCusLabel: myCusVal
          mytype: springboot
  - job_name: "pushgateway"
    scrape_interval: 15s
    honor_labels: true  #加上此配置,exporter节点上传数据中的一些标签将不会被pushgateway节点的相同标签覆盖
    static_configs: 
      - targets: ["192.168.25.42:9091"]
        labels:
          instance: 被监控服务器42
          myCusLabel: myCusVal
  - job_name: "blackbox_http"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://www.baidu.com
        - https://www.jd.com
        labels:
          company: "外部公司"
          project: "外部项目"
          env: "dev"
      - targets: 
        - http://192.168.25.42:8081/
        labels:
          describe: springboot-web-demo应用
          company: "内部公司"
          project: "内部项目"
          env: "test"
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.25.42:9115
  - job_name: "blackbox_tcp"
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets: 
        - 192.168.25.42:22
        - 192.168.25.41:9090
        labels:
          company: "内部公司"
          project: "测试项目"
          env: "test"
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.25.42:9115
  #icmp检查配置 ping
  - job_name: "blackbox_icmp"
    metrics_path: /probe
    params:
      module: [icmp]
    static_configs:
      - targets: 
        - 192.168.25.42
        - 192.168.25.200  # 不存在的ip
        labels:
          company: "内部公司"
          project: "测试项目"
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.25.42:9115

方式二(动态文件方式,每次配置发生变动会自动加载):

修改prometheus.yml,完整内容如下:

bash 复制代码
# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - 192.168.25.41:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/*"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "file-target"
    file_sd_configs:
      - refresh_interval: 10s
        files:
          - targets/targets.yml
  - job_name: "file-sd-pushgateway"
    scrape_interval: 15s
    honor_labels: true  # 为true时,exporter节点上传数据中的一些标签将不会被pushgateway节点的相同标签覆盖
    file_sd_configs:
      - refresh_interval: 10s
        files:
          - targets/pushgateway.yml    
  - job_name: "file-application"
    scrape_interval: 15s
    metrics_path: '/actuator/prometheus'
    file_sd_configs:
      - refresh_interval: 10s
        files:
          - targets/springboot.yml
      
  - job_name: "file-blackbox_http"
    metrics_path: /probe
    params:
      module: [http_2xx]
    file_sd_configs:
      - refresh_interval: 10s
        files: 
          - targets/blackbox_http.yml
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - source_labels: [__param_target]
        target_label: __param_myparam
        replacement: myparamVal
      - target_label: __address__
        replacement: 192.168.25.42:9115
      #- regex: "__(.*)__"    默认__xx__的标签都会隐藏,这两行让__xx__标签替换成xx标签,故而用户可以看到这些xx标签
      #  action: labelmap
  - job_name: "file-blackbox_tcp"
    metrics_path: /probe
    params:
      module: [tcp_connect]
    file_sd_configs:
      - refresh_interval: 10s
        files: 
          - targets/blackbox_tcp.yml
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.25.42:9115
  #icmp检查配置 ping
  - job_name: "file-blackbox_icmp"
    metrics_path: /probe
    params:
      module: [icmp]
    file_sd_configs:
      - refresh_interval: 10s
        files: 
          - targets/blackbox_icmp.yml
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.25.42:9115

创建targets目录,用于存放动态配置文件。

bash 复制代码
mkdir -p /opt/software/prometheus/prometheus/targets/

添加/opt/software/prometheus/prometheus/targets/targets.yml文件,完整内容如下:

bash 复制代码
- targets: ["192.168.25.41:9090"]
  labels:
    job: prometheus
- targets: ["192.168.25.41:9093"]
  labels:
    job: alertmanager
- targets: ["192.168.25.41:9100"]
  labels:
    job: node_exporter
    instance: Prometheus服务器
    myCusLabel: myCusVal
- targets: ["192.168.25.42:9100"]
  labels:
    job: node_exporter
    instance: 被监控服务器42
    myCusLabel: myCusVal
- targets: ["192.168.25.42:8080"]
  labels:
    job: cadvisor
    instance: 被监控服务器42
    myCusLabel: myCusVal
- targets: ["192.168.25.42:9113"]
  labels:
    job: nginx-exporter
    instance: 被监控服务器42
    myCusLabel: myCusVal
- targets: ["192.168.25.42:9121"]
  labels:
    job: redis-exporter
    instance: 被监控服务器42
    myCusLabel: myCusVal
- targets: ["192.168.25.42:9104"]
  labels:
    job: mysql-exporter
    instance: 被监控服务器42
    myCusLabel: myCusVal

添加/opt/software/prometheus/prometheus/targets/springboot.yml文件,内容如下:

bash 复制代码
- targets: ["192.168.25.42:8081"]
  labels:
    job: java-application
    instance: 被监控服务器42
    myCusLabel: myCusVal
    mytype: springboot

添加/opt/software/prometheus/prometheus/targets/pushgateway.yml文件,完整内容如下:

bash 复制代码
- targets: ["192.168.25.42:9091"]
  labels:
    job: pushgateway
    instance: 被监控服务器42
    myCusLabel: myCusVal

添加/opt/software/prometheus/prometheus/targets/blackbox_http.yml文件,完整内容如下:

bash 复制代码
- targets:
    - https://www.baidu.com
    - https://www.jd.com
  labels:
    job: blackbox_http
    company: "外部公司"
    project: "外部项目"
    env: "dev"
- targets: 
    - http://192.168.25.42:8081/
  labels:
    job: blackbox_http
    describe: springboot-web-demo应用
    company: "内部公司"
    project: "内部项目"
    env: "dev"

添加/opt/software/prometheus/prometheus/targets/blackbox_tcp.yml文件,完整内容如下:

bash 复制代码
- targets: 
    - 192.168.25.42	  # 被监控的endpoint
    - 192.168.25.200  # 被监控的endpoint
  labels:
    job: blackbox_icmp
    company: "内部公司"
    project: "测试项目"

添加/opt/software/prometheus/prometheus/targets/blackbox_icmp.yml文件,完整内容如下:

bash 复制代码
- targets: 
    - 192.168.25.42	  # 被监控的IP
    - 192.168.25.200  # 被监控的IP
  labels:
    job: blackbox_icmp
    company: "内部公司"
    project: "测试项目"
    env: "dev"

3.1.3 添加监控规则文件

创建规则目录

bash 复制代码
mkdir /opt/software/prometheus/prometheus/rules/

添加/opt/software/prometheus/prometheus/rules/alert.yml件,完整内容如下:

bash 复制代码
groups:
- name: Prometheus alert
  rules:
  - alert: 服务告警
    expr: up==0
    for: 30s
    labels:
      serverity: critical
    annotations:
      summary: "服务异常,实例{{$labels.instance}}"
      description: "{{$labels.job}}服务已关闭"

添加/opt/software/prometheus/prometheus/rules/node-exporter.yml文件,完整内容如下:

bash 复制代码
groups:
 - name: node-exporter
   rules:
   - alert: HostOutOfMemory
     expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "主机内存不足,实例:{{ $labels.instance }}"
       description: "内存可用率<10%,当前值:{{ $value }}"
   - alert: HostMemoryUnderMemoryPressure
     expr: rate(node_vmstat_pgmajfault[1m]) > 1000
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "内存压力不足,实例:{{ $labels.instance }}"
       description: "节点内存压力大。 重大页面错误率高,当前值为:{{ $value }}"
   - alert: HostUnusualNetworkThroughputIn
     expr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "异常流入网络吞吐量,实例:{{ $labels.instance }}"
       description: "网络流入流量 > 100 MB/s,当前值:{{ $value }}"
   - alert: HostUnusualNetworkThroughputOut
     expr: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "异常流出网络吞吐量,实例:{{ $labels.instance }}"
       description: "网络流出流量 > 100 MB/s,当前值为:{{ $value }}"
   - alert: HostUnusualDiskReadRate
     expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "异常磁盘读取,实例:{{ $labels.instance }}"
       description: "磁盘读取> 50 MB/s,当前值:{{ $value }}"
   - alert: HostUnusualDiskWriteRate
     expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "异常磁盘写入,实例:{{ $labels.instance }}"
       description: "磁盘写入> 50 MB/s,当前值:{{ $value }}"
   - alert: HostOutOfDiskSpace
     expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "磁盘空间不足告警,实例:{{ $labels.instance }}"
       description: "剩余磁盘空间< 10% ,当前值:{{ $value }}"
   - alert: HostDiskWillFillIn24Hours
     expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "磁盘空间将在24小时内耗尽,实例:{{ $labels.instance }}"
       description: "以当前写入速率预计磁盘空间将在 24 小时内耗尽,当前值:{{ $value }}"
   - alert: HostOutOfInodes
     expr: node_filesystem_files_free{mountpoint ="/"} / node_filesystem_files{mountpoint="/"} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly{mountpoint="/"} == 0
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "磁盘Inodes不足,实例:{{ $labels.instance }}"
       description: "剩余磁盘 inodes < 10%,当前值: {{ $value }}"
   - alert: HostUnusualDiskReadLatency
     expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "异常磁盘读取延迟,实例:{{ $labels.instance }}"
       description: "磁盘读取延迟 > 100ms,当前值:{{ $value }}"
   - alert: HostUnusualDiskWriteLatency
     expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "异常磁盘写入延迟,实例:{{ $labels.instance }}"
       description: "磁盘写入延迟 > 100ms,当前值:{{ $value }}"
   - alert: high_load 
     expr: node_load1 > 4
     for: 2m
     labels:
       severity: page
     annotations:
       summary: "CPU1分钟负载过高,实例:{{ $labels.instance }}"
       description: "CPU1分钟负载>4,已经持续2分钟。当前值为:{{ $value }}"
   - alert: HostCpuIsUnderUtilized
     expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
     for: 1m
     labels:
       severity: warning
     annotations:
       summary: "cpu负载高,实例:{{ $labels.instance }}"
       description: "cpu负载> 80%,当前值:{{ $value }}"
   - alert: HostCpuStealNoisyNeighbor
     expr: avg by(instance) (rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100 > 10
     for: 0m
     labels:
       severity: warning
     annotations:
       summary: "CPU窃取率异常,实例:{{ $labels.instance }}"
       description: "CPU 窃取率 > 10%。 嘈杂的邻居正在扼杀 VM 性能,或者 Spot 实例可能失去信用,当前值:{{ $value }}"
   - alert: HostSwapIsFillingUp
     expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "磁盘swap空间使用率异常,实例:{{ $labels.instance }}"
       description: "磁盘swap空间使用率>80%"
   - alert: HostNetworkReceiveErrors
     expr: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "异常网络接收错误,实例:{{ $labels.instance }}"
       description: "网卡{{ $labels.device }}在过去2分钟接收错误率大于0.01,当前值:{{ $value }}"
   - alert: HostNetworkTransmitErrors
     expr: rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "异常网络传输错误,实例:{{ $labels.instance }}"
       description: "网卡{{ $labels.device }}在过去2分钟传输错误率大于0.01,当前值:{{ $value }}"
   - alert: HostNetworkInterfaceSaturated
     expr: (rate(node_network_receive_bytes_total{device!~"^tap.*"}[1m]) + rate(node_network_transmit_bytes_total{device!~"^tap.*"}[1m])) / node_network_speed_bytes{device!~"^tap.*"} > 0.8 < 10000
     for: 1m
     labels:
       severity: warning
     annotations:
       summary: "异常网络接口饱和,实例:{{ $labels.instance }}"
       description: "网卡{{ $labels.device }}正在超载,当前值{{ $value }}"
   - alert: HostConntrackLimit
     expr: node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "异常连接数,实例:{{ $labels.instance }}"
       description: "连接数过大,当前连接数:{{ $value }}"
   - alert: HostClockSkew
     expr: (node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "异常时钟偏差,实例:{{ $labels.instance }}"
       description: "检测到时钟偏差,时钟不同步。值为:{{ $value }}"
   - alert: HostClockNotSynchronising
     expr: min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "时钟不同步,实例:{{ $labels.instance }}"
       description: "时钟不同步"
   - alert: NodeFileDescriptorLimit
     expr: node_filefd_allocated / node_filefd_maximum * 100 > 80
     for: 1m
     labels:
       severity: warning
     annotations:
       summary: "预计内核将很快耗尽文件描述符限制"
       description: "{{ $labels.instance }}}已分配的文件描述符数超过了限制的80%,当前值为:{{ $value }}"

添加/opt/software/prometheus/prometheus/rules/docker.yml文件,完整内容如下:

bash 复制代码
groups:
- name: DockerContainers
  rules:
  - alert: ContainerKilled
    expr: time() - container_last_seen > 60
    for: 0m
    labels:
      severity: warning
    annotations:
      isummary: "Docker容器被杀死 容器:{{ $labels.instance }}"
      description: "{{ $value }}个容器消失了"
  # This rule can be very noisy in dynamic infra with legitimate container start/stop/deployment.
  - alert: ContainerAbsent
    expr: absent(container_last_seen)
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "无容器 容器: {{ $labels.instance }}"
      description: "5分钟检查容器不存在,值为:{{ $value }}"
  - alert: ContainerCpuUsage
    expr: (sum(rate(container_cpu_usage_seconds_total{name!=""}[3m])) BY (instance, name) * 100) > 300
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "容器cpu使用率告警 容器: {{ $labels.instance }}"
      description: "容器cpu使用率超过300%,当前值为:{{ $value }}"
  - alert: ContainerMemoryUsage
    expr: (sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "容器内存使用率告警 容器: {{ $labels.instance }}"
      description: "容器内存使用率超过80%,当前值为:{{ $value }}"
  - alert: ContainerVolumeIoUsage
    expr: (sum(container_fs_io_current{name!=""}) BY (instance, name) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "容器存储io使用率告警 容器: {{ $labels.instance }}"
      description: "容器存储io使用率超过 80%,当前值为:{{ $value }}"
  - alert: ContainerHighThrottleRate
    expr: rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "容器限制告警 容器:{{ $labels.instance }}"
      description: "容器被限制,当前值为:{{ $value }}"

添加/opt/software/prometheus/prometheus/rules/mysqld.yml文件,完整内容如下:

bash 复制代码
groups:
- name: MySQL
  rules:
  - alert: MysqlDown
    expr: mysql_up == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "MySQL Down,实例:{{ $labels.instance }}"
      description: "MySQL_exporter连不上MySQL了,当前状态为:{{ $value }}"    
  - alert: MysqlTooManyConnections
    expr: max_over_time(mysql_global_status_threads_connected[1m]) / mysql_global_variables_max_connections * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Mysql连接数过多告警,实例:{{ $labels.instance }}"
      description: "MySQL连接数>80%,当前值:{{ $value }}"
  - alert: MysqlHighThreadsRunning
    expr: max_over_time(mysql_global_status_threads_running[1m]) > 20
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Mysql运行的线程过多,实例:{{ $labels.instance }}"
      description: "Mysql运行的线程 > 20,当前运行的线程:{{ $value }}" 
  - alert: MysqlSlowQueries
    expr: increase(mysql_global_status_slow_queries[2m]) > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Mysql慢日志告警,实例:{{ $labels.instance }}"
      description: "MySQL在过去2分钟有新的{{ $value }}条慢查询"
  #MySQL innodb 日志写入停滞
  - alert: MysqlInnodbLogWaits
    expr: rate(mysql_global_status_innodb_log_waits[15m]) > 10
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: "MySQL innodb日志等待,实例:{{ $labels.instance }}"
      description: "MySQL innodb日志写入停滞,当前值: {{ $value }}"
  - alert: MysqlRestarted
    expr: mysql_global_status_uptime < 60
    for: 0m
    labels:
      severity: info
    annotations:
      summary: "MySQL 重启,实例:{{ $labels.instance }}"
      description: "不到一分钟前,MySQL重启过"

  - alert: RowLockCurrentWaits
    expr: mysql_global_status_innodb_row_lock_current_waits > 0
    for: 1m
    labels:
      severity: info
    annotations:
      summary: "MySQL有锁等待,实例:{{ $labels.instance }}"
      description: "当前有{{ $value }}个锁等待"

添加/opt/software/prometheus/prometheus/rules/nginx.yml文件,完整内容如下:

bash 复制代码
groups:
- name: nginx
  rules:
  # 对任何实例超过30秒无法联系的情况发出警报
  - alert: NginxDown
    expr: nginx_up == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "nginx异常,实例:{{ $labels.instance }}"
      description: "{{ $labels.job }} nginx已关闭"

添加/opt/software/prometheus/prometheus/rules/redis.yml文件,完整内容如下:

bash 复制代码
groups:
- name: redis
  rules:
  - alert: RedisDown
    expr: redis_up == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: 'Redis Down,实例:{{ $labels.instance }}'
      description: "Redis实例 is down"
  - alert: RedisMissingBackup
    expr: time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: "Redis备份丢失,实例:{{ $labels.instance }}"
      description: "Redis 24小时未备份"

  - alert: RedisOutOfConfiguredMaxmemory
    expr: redis_memory_used_bytes / redis_memory_max_bytes * 100 > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Redis超出配置的最大内存,实例:{{ $labels.instance }}"
      description: "Redis内存使用超过配置最大内存的90%"
  - alert: RedisTooManyConnections
    expr: redis_connected_clients > 100
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Redis连接数过多,实例:{{ $labels.instance }}"
      description: "Redis当前连接数为: {{ $value }}"
  - alert: RedisNotEnoughConnections
    expr: redis_connected_clients < 1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Redis没有足够的连接,实例:{{ $labels.instance }}"
      description: "Redis当前连接数为: {{ $value }}"
  - alert: RedisRejectedConnections
    expr: increase(redis_rejected_connections_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: "Redis有拒绝连接,实例:{{ $labels.instance }}"
      description: "与Redis 的某些连接被拒绝{{ $value }}"

添加/opt/software/prometheus/prometheus/rules/springboot.yml文件,完整内容如下:

bash 复制代码
groups:
- name: SprinBoot
  rules:
  - alert: SprinBooErrorEvents
    expr: increase(logback_events_total{level="error"}[3m]) > 0
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "Springboot错误事件 容器: $labels.instance"
      description: "在过去2分钟有新的{{ $value }}个错误事件"

添加/opt/software/prometheus/prometheus/rules/blackbox_exporter.yml文件,完整内容如下:

bash 复制代码
groups:
- name: Blackbox
  rules:
  - alert: 黑盒子探测失败告警
    expr: probe_success == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "黑盒子探测失败{{ $labels.instance }}"
      description: "黑盒子检测失败,当前值:{{ $value }}"
  - alert: 请求慢告警
    expr: avg_over_time(probe_duration_seconds[1m]) > 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "请求慢{{ $labels.instance }}"
      description: "请求时间超过1秒,值为:{{ $value }}"
  - alert: http状态码检测失败
    expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "http状态码检测失败{{ $labels.instance }}"
      description: "HTTP状态码非 200-399,当前状态码为:{{ $value }}"
  - alert: ssl证书即将到期
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "证书即将到期{{ $labels.instance }}"
      description: "SSL 证书在 30 天后到期,值:{{ $value }}"

  - alert: ssl证书即将到期
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "证书即将到期{{ $labels.instance }}"
      description: "SSL 证书在 3 天后到期,值:{{ $value }}"

  - alert: ssl证书已过期
    expr: probe_ssl_earliest_cert_expiry - time() <= 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "证书已过期{{ $labels.instance }}"
      description: "SSL 证书已经过期,请确认是否在使用"

3.1.4 自定义Systemctl

bash 复制代码
cat > /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target

[Service]
Type=simple
User=root
Group=root
Restart=on-failure
ExecStart=/opt/software/prometheus/prometheus/prometheus \
  --config.file=/opt/software/prometheus/prometheus/prometheus.yml \
  --storage.tsdb.path=/opt/software/prometheus/prometheus/data \
  --storage.tsdb.retention.time=60d \
  --web.enable-lifecycle

[Install]
WantedBy=multi-user.target

EOF

加载自定义的systemctl配置

bash 复制代码
systemctl daemon-reload

3.1.5 启动Prometheus

bash 复制代码
# 启动Prometheus
systemctl start prometheus
# 开机自启动Prometheus
systemctl enable prometheus

3.1.6.访问Prometheus后台

浏览器访问:http://192.168.25.41:9090/

3.1.7 检查target是否已经加载

在这里插入图片描述



3.1.8 检查rules是否已经加载



3.1.9.备注说明

注一 :curl -X POST http://192.168.25.41:9090/-/reload命令可在运行期间重新加载Prometheus配置文件
注二 :查看Prometheus运行日志:journalctl -u prometheus.service
注三 :Prometheus管理地址访问:http://192.168.25.41:9090/
注四:检查配置文件语法是否通过:./promtool check config prometheus.yml

3.2.Grafana安装

3.2.1.安装流程

Grafana下载地址:https://grafana.com/grafana/download/11.5.1?platform=linux

bash 复制代码
# 下载安装包
wget https://dl.grafana.com/enterprise/release/grafana-enterprise-11.5.1.linux-amd64.tar.gz
# 解压到特定目录
tar -zxvf grafana-enterprise-11.5.1.linux-amd64.tar.gz -C /opt/software/prometheus/
#重命名
mv /opt/software/prometheus/grafana-v11.5.1 /opt/software/prometheus/grafana

3.2.2 自定义Systemctl

bash 复制代码
cat > /etc/systemd/system/grafana-server.service << 'EOF'
[Unit]
Description=Grafana server
Documentation=http://docs.grafana.org
[Service]
Type=simple
User=root
Group=root
Restart=on-failure
ExecStart=/opt/software/prometheus/grafana/bin/grafana-server \
  --config=/opt/software/prometheus/grafana/conf/defaults.ini \
  --homepath=/opt/software/prometheus/grafana
[Install]
WantedBy=multi-user.target

EOF

加载自定义的systemctl配置

bash 复制代码
systemctl daemon-reload

3.2.3 启动Grafana

bash 复制代码
# 启动Prometheus
systemctl start grafana-server
# 开机自启动Prometheus
systemctl enable grafana-server

3.2.4 访问Grafana后台

浏览器访问:http://192.168.25.41:3000/

默认账号/密码 :admin/admin

默认账号密码可详见:/opt/software/prometheus/grafana/conf/defaults.ini文件

输入账号密码进行登陆...

3.2.5 添加Prometheus数据源


然后一直点下一步即可创建出数据源。

3.2.6 添加Dashboard(服务器监控仪表盘)

Grafana的node exporter full地址:https://grafana.com/grafana/dashboards/1860-node-exporter-full/

具体下载链接:https://grafana.com/api/dashboards/1860/revisions/33/download



最终效果:

3.2.7 添加Dashboard(容器监控仪表盘)

Grafana的容器监控仪表盘地址:https://grafana.com/grafana/dashboards/11600-docker-container/

具体下载链接:https://grafana.com/api/dashboards/11600/revisions/1/download

下载json文件后,将json文件导入dashboard。效果如下:

3.2.8 添加JAVA监控(JVM监控仪表盘)

Grafana的容器监控仪表盘地址:https://grafana.com/grafana/dashboards/4701-jvm-micrometer//

具体下载链接:https://grafana.com/api/dashboards/4701/revisions/10/download

下载json文件后,将json文件导入dashboard。效果如下:

3.2.9 添加Mysql监控(Mysql监控仪表盘)

Grafana的容器监控仪表盘地址:https://grafana.com/grafana/dashboards/7362-mysql-overview/

具体下载链接:https://grafana.com/api/dashboards/7362/revisions/5/download

下载json文件后,将json文件导入dashboard。效果如下:

3.2.10 添加Nginx监控(Nginx监控仪表盘)

Grafana的容器监控仪表盘地址:https://grafana.com/grafana/dashboards/12708-nginx/

具体下载链接:https://grafana.com/api/dashboards/12708/revisions/1/download

下载json文件后,将json文件导入dashboard。效果如下:

3.2.11 添加Redis监控(Redis监控仪表盘)

Grafana的容器监控仪表盘地址:https://grafana.com/grafana/dashboards/11835-redis-dashboard-for-prometheus-redis-exporter-helm-stable-redis-ha

具体下载链接:https://grafana.com/api/dashboards/11835/revisions/1/download

下载json文件后,将json文件导入dashboard。效果如下:

3.2.12 添加黑盒监控(黑盒监控仪表盘)

Grafana的容器监控仪表盘地址:https://grafana.com/grafana/dashboards/9965-1-blackbox-exporter-dashboard-20220412/

具体下载链接:https://grafana.com/api/dashboards/9965/revisions/3/download

下载json文件后,将json文件导入dashboard。效果如下:

3.3.AlertManager安装

3.3.1.安装流程

官网下载地址点我

bash 复制代码
# 下载安装包
wget https://github.com/prometheus/alertmanager/releases/download/v0.28.0/alertmanager-0.28.0.linux-amd64.tar.gz
# 解压到特定目录
tar -zxvf alertmanager-0.28.0.linux-amd64.tar.gz -C /opt/software/prometheus/
#重命名
mv /opt/software/prometheus/alertmanager-0.28.0.linux-amd64 /opt/software/prometheus/alertmanager

3.3.2.修改alertmanager.yml

bash 复制代码
vim /opt/software/prometheus/alertmanager/alertmanager.yml

完整内容如下:

bash 复制代码
global:
  #163服务器
  smtp_smarthost: 'smtp.163.com:465'
  #发邮件的邮箱
  smtp_from: '[email protected]'
  #发邮件的邮箱用户名,也就是你的邮箱     
  smtp_auth_username: '[email protected]'
  #发邮件的邮箱密码(授权码)
  smtp_auth_password: 'xxxxxxxxx'
  #tls验证配置,false为关闭
  smtp_require_tls: false

#模版配置
templates:
  - '/opt/software/prometheus/alertmanager/template/*.tmpl'

route:
  group_by: ['altername']
  # 当收到告警的时候,等待group_wait配置的时间10s,看是否还有告警,如果有就一起发出去
  group_wait: 10s
  # 如果上次告警信息发送成功,此时又来了一个新的告警数据,则需要等待group_interval配置的时间才可以发送出去
  group_interval: 10s
  # 如果上次告警信息发送成功,且问题没有解决,则等待 repeat_interval配置的时间再次发送告警数据
  repeat_interval: 1m
  # 全局报警组,这个参数是必选的,和下面报警组名要相同
  #receiver: 'email'	# 邮件通知
  #receiver: 'wechat'	# 企业微信通知
  receiver: 'web.hook'	# API接口通知

receivers: 
  - name: 'email'
    email_configs:
      # 邮件通知人1
    - to: '[email protected]'  
      # 报警修复时也发送通知
      send_resolved: true
      # 指定邮件模板
      html: '{{ template "email.html" .}}'
      # 邮件通知人2
    - to: '[email protected]'
      # 报警修复时也发送通知
      send_resolved: true
      # 指定邮件模板
      html: '{{ template "email.html" .}}'

  - name: "wechat"
    webhook_configs:
      # 企业微信通知服务暴露出来的URL,后面会讲。
      - url: 'http://192.168.25.41:8999/webhook?key=xxxxxx'
        send_resolved: true

  - name: 'web.hook'
    webhook_configs:
      # 能处理告警内容的服务暴露出来的API接口(用户自定义服务来处理告警信息)。
      - url: 'http://192.168.25.2:8080/webhook'

# 抑制规则。下面意思为:如果有一个严重程度为"critical"的告警(源告警),并且存在另一个严重程度为"warning"的告警(目标告警),且这两个告警的 alertname、dev 和 instance 标签值都相同,那么系统将不会发送或显示那个"warning"级别的告警。
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

3.3.3.邮件通知相关配置

3.3.3.1 获取163邮件授权码

位置:设置 》POP3/SMTP /IMAP》新增授权码

3.3.3.2 编定义邮件模板(可选)

注:自定义邮件模板是可选的,你也可以不设置自定义邮件模板,此时用的就是默认邮件模板。

bash 复制代码
mkdir /opt/software/prometheus/alertmanager/template/

3.3.4.企业微信通知相关配置

3.3.4.1.获取企业微信机器人webhook

1、首先在企业微信里面创建一个 个人组织,然后在聊天群中创建一个机器人。

2、创建成功后,进入全员群(刚创建的那个群):

3、在全员群里面添加机器人:

添加成功后,会有一个机器人对应的Webhook地址,里面包含一个KEY。这KEY很重要,alertmanager.yml里面配置企业微信告警通知的时候会使用到这个KEY。具体配置查看上文章节3.3.2

3.3.4.2.配置企业微信机器人告警通知服务

下载地址:https://download.csdn.net/download/jianghuchuang/90415760

文件下载后打开,如下所示:

bash 复制代码
[root@prometheus alertmanager-wechatrobot-webhook]# pwd
/opt/software/prometheus/alertmanager-wechatrobot-webhook

[root@prometheus alertmanager-wechatrobot-webhook]# ll
总用量 10700
-rwxr-xr-x 1 root root 10924527 2月  16 18:50 alertmanager-wechatrobot-webhook
-rw-r--r-- 1 root root      265 2月  16 18:51 alertmanager-wechatrobot-webhook.service
-rw-r--r-- 1 root root      554 2月  16 18:50 Dockerfile
-rw-r--r-- 1 root root    11558 2月  16 18:50 LICENSE
-rw-r--r-- 1 root root      684 2月  16 18:50 README.md
-rw-r--r-- 1 root root      164 2月  16 18:50 start.sh

自定义Systemctl:

bash 复制代码
 cat > /etc/systemd/system/alertmanager-wechatrobot-webhook.service << 'EOF'
[Unit]
Description=alertmanager-wechatrobot-webhook

[Service]
User=root
Group=root
Restart=on-failure
ExecStart=/opt/software/prometheus/alertmanager-wechatrobot-webhook/alertmanager-wechatrobot-webhook

[Install]
WantedBy=multi-user.target

EOF

启动服务:

bash 复制代码
#重新加载systemctl
systemctl daemon-reload
#启动服务
systemctl start alertmanager-wechatrobot-webhook
#开机自动
systemctl enable alertmanager-wechatrobot-webhook

注:启动后,当前会暴露一个接口:
http://192.168.25.41:8999/webhook?key=xxxxxx。这个地址配置在alertmanager.yml里面(具体配置查看上文章节3.3.2)。key是企业微信机器人Webhook里面的key

3.3.5 自定义服务通知相关配置(springboot为例)

自定义服务接收告警消息,以springboot2.6.2为例子。

3.3.5.1 修改pom.xml
xml 复制代码
<!-- spring-boot-actuator依赖 -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-actuator</artifactId>
            <version>2.6.2</version>
        </dependency>
        <!-- prometheus依赖!!!!这个千万不要加版本号,不然访问不到/actuator/prometheus接口 -->
        <!-- https://mvnrepository.com/artifact/io.micrometer/micrometer-registry-prometheus -->
        <dependency>
            <groupId>io.micrometer</groupId>
            <artifactId>micrometer-registry-prometheus</artifactId>
        </dependency>
3.3.5.2 修改application.yml

application.yml 完整配置如下

yaml 复制代码
server:
  port: 8080
spring:
  application:
    name: my-application-name
#management:
#  endpoints:
#    web:
#      exposure:
#        include: "*"
#  metrics:
#    tags:
#      myApplication: ${spring.application.name}
#      myTag:  my-application-tag

management:
  metrics:
    tags:
      application: ${spring.application.name}
    export:
      prometheus:
        enabled: true
  #开启暴露web
  endpoints:
    web:
      exposure:
        include: prometheus
3.3.5.3 添加webhook接口
java 复制代码
package com.jxf.controller;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import lombok.extern.slf4j.Slf4j;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;

import java.text.SimpleDateFormat;
import java.time.Instant;
import java.time.ZonedDateTime;
import java.time.format.DateTimeFormatter;
import java.util.Date;

@Slf4j
@RestController
@RequestMapping("")
public class WebController {

    /**
     * 处理alert manager传过来的告警消息
     * @param payload 告警内容示例:{"receiver":"web\\.hook","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"http状态码检测失败","company":"外部公司","env":"dev","instance":"https://www.baidu.com","job":"blackbox_http","project":"外部项目","severity":"critical"},"annotations":{"description":"HTTP状态码非 200-399,当前状态码为:0","summary":"http状态码检测失败https://www.baidu.com"},"startsAt":"2025-02-16T13:44:39.052Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus:9090/graph?g0.expr=probe_http_status_code+%3C%3D+199+or+probe_http_status_code+%3E%3D+400\u0026g0.tab=1","fingerprint":"43644ea5fd1763ae"},{"status":"firing","labels":{"alertname":"请求慢告警","company":"外部公司","env":"dev","instance":"https://www.baidu.com","job":"blackbox_http","project":"外部项目","severity":"warning"},"annotations":{"description":"请求时间超过1秒,值为:8.00394340575","summary":"请求慢https://www.baidu.com"},"startsAt":"2025-02-16T13:46:24.052Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus:9090/graph?g0.expr=avg_over_time%28probe_duration_seconds%5B1m%5D%29+%3E+1\u0026g0.tab=1","fingerprint":"8e7993450dd4f17a"},{"status":"firing","labels":{"alertname":"黑盒子探测失败告警","company":"外部公司","env":"dev","instance":"https://www.baidu.com","job":"blackbox_http","project":"外部项目","severity":"critical"},"annotations":{"description":"黑盒子检测失败,当前值:0","summary":"黑盒子探测失败https://www.baidu.com"},"startsAt":"2025-02-16T13:44:39.052Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus:9090/graph?g0.expr=probe_success+%3D%3D+0\u0026g0.tab=1","fingerprint":"62e21997d2d79e55"},{"status":"firing","labels":{"alertname":"http状态码检测失败","company":"外部公司","env":"dev","instance":"https://www.jd.com","job":"blackbox_http","project":"外部项目","severity":"critical"},"annotations":{"description":"HTTP状态码非 200-399,当前状态码为:0","summary":"http状态码检测失败https://www.jd.com"},"startsAt":"2025-02-16T13:44:54.052Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus:9090/graph?g0.expr=probe_http_status_code+%3C%3D+199+or+probe_http_status_code+%3E%3D+400\u0026g0.tab=1","fingerprint":"9760f8b330d656e7"},{"status":"firing","labels":{"alertname":"请求慢告警","company":"外部公司","env":"dev","instance":"https://www.jd.com","job":"blackbox_http","project":"外部项目","severity":"warning"},"annotations":{"description":"请求时间超过1秒,值为:8.003534726249999","summary":"请求慢https://www.jd.com"},"startsAt":"2025-02-16T13:44:39.052Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus:9090/graph?g0.expr=avg_over_time%28probe_duration_seconds%5B1m%5D%29+%3E+1\u0026g0.tab=1","fingerprint":"a5440fd5910568ef"},{"status":"firing","labels":{"alertname":"黑盒子探测失败告警","company":"外部公司","env":"dev","instance":"https://www.jd.com","job":"blackbox_http","project":"外部项目","severity":"critical"},"annotations":{"description":"黑盒子检测失败,当前值:0","summary":"黑盒子探测失败https://www.jd.com"},"startsAt":"2025-02-16T13:44:54.052Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus:9090/graph?g0.expr=probe_success+%3D%3D+0\u0026g0.tab=1","fingerprint":"d18163c9f43b4cba"},{"status":"firing","labels":{"alertname":"请求慢告警","company":"内部公司","instance":"192.168.25.200","job":"blackbox_icmp","project":"测试项目","severity":"warning"},"annotations":{"description":"请求时间超过1秒,值为:9.50104220425","summary":"请求慢192.168.25.200"},"startsAt":"2025-02-16T13:44:39.052Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus:9090/graph?g0.expr=avg_over_time%28probe_duration_seconds%5B1m%5D%29+%3E+1\u0026g0.tab=1","fingerprint":"f952edd5054e1767"},{"status":"firing","labels":{"alertname":"黑盒子探测失败告警","company":"内部公司","instance":"192.168.25.200","job":"blackbox_icmp","project":"测试项目","severity":"critical"},"annotations":{"description":"黑盒子检测失败,当前值:0","summary":"黑盒子探测失败192.168.25.200"},"startsAt":"2025-02-16T13:44:54.052Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus:9090/graph?g0.expr=probe_success+%3D%3D+0\u0026g0.tab=1","fingerprint":"e79230366cea2dc2"}],"groupLabels":{},"commonLabels":{},"commonAnnotations":{},"externalURL":"http://prometheus:9093","version":"4","groupKey":"{}:{}","truncatedAlerts":0}
     * @return
     */
    @PostMapping("/webhook")
    public ResponseEntity<String> receiveAlert(@RequestBody String payload) {
        JSONObject data = JSON.parseObject(payload);
        JSONArray alerts = JSON.parseArray(data.getString("alerts"));
        log.error("======================收到告警消息({}条)========================", alerts.size());
        for (int i = 0; i < alerts.size(); i++) {
            JSONObject alert = alerts.getJSONObject(i);
            String status = alert.getString("status");
            String startsAt = new SimpleDateFormat("YYYY-MM-dd HH:mm:ss").format(Date.from(Instant.parse(alert.getString("startsAt"))));
            String endsAt = new SimpleDateFormat("YYYY-MM-dd HH:mm:ss").format(Date.from(Instant.parse(alert.getString("endsAt"))));

            JSONObject annotations = alert.getJSONObject("annotations");
            String summary = annotations.getString("summary");
            String description = annotations.getString("description");
            log.error("收到告警消息:类型{}, 开始时间:{}, 结束时间:{}, 主题:{}, 详情:{}", status, startsAt, endsAt, summary, description);
        }
        log("====================================================");

        // 返回响应,表示已成功接收告警
        return new ResponseEntity<>("Alerts received", HttpStatus.OK);
    }

}

3.3.6 自定义Systemctl

bash 复制代码
cat > /etc/systemd/system/alertmanager.service << 'EOF'
[Unit]
Description=Alert Manager
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=root
Group=root
ExecStart=/opt/software/prometheus/alertmanager/alertmanager \
  --config.file=/opt/software/prometheus/alertmanager/alertmanager.yml \
  --storage.path=/opt/software/prometheus/alertmanager/data

Restart=always

[Install]
WantedBy=multi-user.target

EOF

加载自定义的systemctl配置

bash 复制代码
systemctl daemon-reload

3.3.7 启动AlertManager

bash 复制代码
# 启动alertmanager
systemctl start alertmanager
# 开机自启动alertmanager
systemctl enable alertmanager

3.3.8 告警效果展示

3.3.8.1 邮件告警效果展示

:首先确保你的服务

默认使用的邮件告警模板如下


使用自定义的邮件告警模板后,效果如下

3.3.8.2 企业微信告警效果展示
相关推荐
hwj运维之路2 天前
k8s监控方案实践(三):部署与配置Grafana可视化平台
云原生·kubernetes·grafana
一个向上的运维者6 天前
Prometheus生产实战全流程详解(存储/负载/调度篇)
云原生·prometheus
hwj运维之路6 天前
k8s监控方案实践(一):部署Prometheus与Node Exporter
容器·kubernetes·prometheus
Mr.小怪7 天前
自定义prometheus exporter实现监控阿里云RDS
阿里云·adb·prometheus
阿里云云原生7 天前
剑指大规模 AI 可观测,阿里云 Prometheus 2.0 应运而生
人工智能·阿里云·prometheus
小黑_深呼吸8 天前
Prometheus实战教程:k8s平台-Redis监控案例
运维·kubernetes·prometheus·监控
码上淘金9 天前
【Prometheus】业务指标与基础指标的标签来源差异及设计解析(扩展版)
prometheus
维C°10 天前
Grafana-新增用户
grafana
Yang三少喜欢撸铁13 天前
通过Docker部署Prometheus + Grafana搭建监控平台【超详细版】
linux·服务器·docker·容器·grafana·prometheus