Prometheus+Grafana+AlertManager完整安装过程

文章目录

1.概述
2.被监控服务器相关软件安装
- [2.1 Docker、Docker-Compose安装](#2.1 Docker、Docker-Compose安装)
- [2.2 基础软件、采集软件容器化部署](#2.2 基础软件、采集软件容器化部署)
3.监控服务器核心组件安装
- 3.1.Prometheus安装
- - [3.1.1 安装流程](#3.1.1 安装流程)
  - [3.1.2 修改prometheus.yml](#3.1.2 修改prometheus.yml)
  - [3.1.3 添加监控规则文件](#3.1.3 添加监控规则文件)
  - [3.1.4 自定义Systemctl](#3.1.4 自定义Systemctl)
  - [3.1.5 启动Prometheus](#3.1.5 启动Prometheus)
  - 3.1.6.访问Prometheus后台
  - [3.1.7 检查target是否已经加载](#3.1.7 检查target是否已经加载)
  - [3.1.8 检查rules是否已经加载](#3.1.8 检查rules是否已经加载)
  - 3.1.9.备注说明
- 3.2.Grafana安装
- - 3.2.1.安装流程
  - [3.2.2 自定义Systemctl](#3.2.2 自定义Systemctl)
  - [3.2.3 启动Grafana](#3.2.3 启动Grafana)
  - [3.2.4 访问Grafana后台](#3.2.4 访问Grafana后台)
  - [3.2.5 添加Prometheus数据源](#3.2.5 添加Prometheus数据源)
  - [3.2.6 添加Dashboard(服务器监控仪表盘)](#3.2.6 添加Dashboard(服务器监控仪表盘))
  - [3.2.7 添加Dashboard(容器监控仪表盘)](#3.2.7 添加Dashboard(容器监控仪表盘))
  - [3.2.8 添加JAVA监控(JVM监控仪表盘)](#3.2.8 添加JAVA监控(JVM监控仪表盘))
  - [3.2.9 添加Mysql监控(Mysql监控仪表盘)](#3.2.9 添加Mysql监控(Mysql监控仪表盘))
  - [3.2.10 添加Nginx监控(Nginx监控仪表盘)](#3.2.10 添加Nginx监控(Nginx监控仪表盘))
  - [3.2.11 添加Redis监控(Redis监控仪表盘)](#3.2.11 添加Redis监控(Redis监控仪表盘))
  - [3.2.12 添加黑盒监控(黑盒监控仪表盘)](#3.2.12 添加黑盒监控(黑盒监控仪表盘))
- 3.3.AlertManager安装
- - 3.3.1.安装流程
  - 3.3.2.修改alertmanager.yml
  - 3.3.3.邮件通知相关配置
  - - [3.3.3.1 获取163邮件授权码](#3.3.3.1 获取163邮件授权码)
    - [3.3.3.2 编定义邮件模板(可选)](#3.3.3.2 编定义邮件模板(可选))
  - 3.3.4.企业微信通知相关配置
  - - 3.3.4.1.获取企业微信机器人webhook
    - 3.3.4.2.配置企业微信机器人告警通知服务
  - [3.3.5 自定义服务通知相关配置(springboot为例)](#3.3.5 自定义服务通知相关配置(springboot为例))
  - - [3.3.5.1 修改pom.xml](#3.3.5.1 修改pom.xml)
    - [3.3.5.2 修改application.yml](#3.3.5.2 修改application.yml)
    - [3.3.5.3 添加webhook接口](#3.3.5.3 添加webhook接口)
  - [3.3.6 自定义Systemctl](#3.3.6 自定义Systemctl)
  - [3.3.7 启动AlertManager](#3.3.7 启动AlertManager)
  - [3.3.8 告警效果展示](#3.3.8 告警效果展示)
  - - [3.3.8.1 邮件告警效果展示](#3.3.8.1 邮件告警效果展示)
    - [3.3.8.2 企业微信告警效果展示](#3.3.8.2 企业微信告警效果展示)

1.概述

服务器资源情况：

服务器名	IP	CPU	内存	描述
prometheus	192.168.25.41	1	1.9G	安装Prometheus、Grafana、AlertManager。模拟监控的服务器
prometheus-monitor-node	192.168.25.42	1	1.9G	安装一些基础软件、应用服务。模拟被监控的服务器

192.168.25.41 监控服务器上软件安装情况：

IP	端口	软件名称	版本	安装方式	完整安装包名称	备注
192.168.25.41	9090	prometheus	3.1.0	二进制方式	prometheus-3.1.0.linux-amd64.tar.gz	监控的核心组件
192.168.25.41	3000	grafana	11.5.1	二进制方式	grafana-enterprise-11.5.1.linux-amd64.tar.gz	提供图形化监控数据展示
192.168.25.41	9093	alertmanager	0.28.0	二进制方式	alertmanager-0.28.0.linux-amd64.tar.gz	告警通知
192.168.25.41	9100	node_exporter	1.8.2	二进制方式	node_exporter-1.8.2.linux-amd64.tar.gz	(可选安装)对服务器资源进行监控，并把数据提供给Prometheus。

192.168.25.42 被监控服务器上软件安装情况：

IP	端口	软件名称	版本	安装方式	备注
192.168.25.42		docker	24.0.2	二进制方式	用于运行其他软件、服务
192.168.25.42		docker-compose	v2.5.0	二进制方式
192.168.25.42	80	nginx	1.27.4	容器方式	Nginx服务
192.168.25.42	6379	redis	6.2.17	容器方式	Redis服务
192.168.25.42	3306	mysql	6.2.17	容器方式	MySql服务
192.168.25.42	8081	java-web-demo	0.0.4	容器方式	java应用服务(这是本人私有的镜像)
192.168.25.42	8080	cadvisor	v0.33.0	容器方式	用于采集Docker的监控数据
192.168.25.42	9091	pushgateway	v1.11.0	容器方式	用户可将采集数据发到pushgateway，由pushgatway将采集数据推到Prometheus
192.168.25.42	8080	cadvisor	v0.33.0	容器方式	用于采集Docker的监控数据
192.168.25.42	9100	node_exporter	v1.5.0	容器方式	用于采集服务器的监控数据
192.168.25.42	9113	nginx_exporter	v1.5.0	容器方式	用于采集Nginx的监控数据
192.168.25.42	9121	redis_exporter	v1.5.0	容器方式	用于采集Redis的监控数据
192.168.25.42	9104	mysqld-exporter	v1.5.0	容器方式	用于采集MySql的监控数据（也可监控Mariadb）
192.168.25.42	9100	blackbox_exporter	0.25.0	容器方式	采集外部地址数据。如http、tcp、icpm等

注：软件下载地址：https://prometheus.io/download/

2.被监控服务器相关软件安装

2.1 Docker、Docker-Compose安装

参考安装文档：Centos7.9离线安装Docker24（无坑版）

或直接使用：docker-v1.24 一键部署脚本

2.2 基础软件、采集软件容器化部署

创建docker-compose.yaml文件

内容如下：

bash 复制代码

version: '3.3'

volumes:
  prometheus_data: {}
  grafana_data: {}

networks:
  monitoring:
    driver: bridge

services:
  nginx:
    image: nginx:1.27.4
    container_name: nginx
    restart: always
    ports:
      - 80:80
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - /root/default.conf:/etc/nginx/conf.d/default.conf
    environment:
      - TZ=Asia/Shanghai
  redis:
    image: redis:6.2.17
    container_name: redis
    command: redis-server --requirepass 123456 --maxmemory 512mb
    restart: always
    #volumes:
    #  - /data/redis/data:/data
    ports:
      - 6379:6379
  db:
    image: mysql:8.0.37
    restart: always
    container_name: mysql
    environment:
      TZ: Asia/Shanghai
      LANG: en_US.UTF-8
      MYSQL_ROOT_PASSWORD: 123456
    command:
      --default-authentication-plugin=mysql_native_password
      --character-set-server=utf8mb4
      --collation-server=utf8mb4_general_ci
      --lower_case_table_names=1
      --performance_schema=1
      --sql-mode=""
      --skip-log-bin
    volumes:
      - /data/mysql/conf:/etc/mysql/conf.d #数据文件挂载
      - /data/mysql/data:/var/lib/mysql #数据文件挂载
    ports:
       - 3306:3306
  java-web-demo:
    image: java-web-demo:0.0.4
    container_name: springboot-web-demo 
    restart: always
    ports:
      - 8081:8080 

  pushgateway:
    image: prom/pushgateway:v1.11.0
    container_name: pushgateway
    restart: always
    ports:
      - "9091:9091"

  cadvisor:
    image: google/cadvisor:v0.33.0
    container_name: cadvisor
    restart: always
    privileged: true
    ports:
      - 8080:8080
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    environment:
      - TZ=Asia/Shanghai
    networks:
      - monitoring
  node_exporter:
    image: prom/node-exporter:v1.5.0
    container_name: node-exporter
    restart: always
    ports:
      - 9100:9100
    network_mode: "host"
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    environment:
      TZ: Asia/Shanghai
    command: 
      - '--web.listen-address=:9100'
      - '--path.procfs=/host/proc' 
      - '--path.sysfs=/host/sys'
      - "--path.rootfs=/rootfs"
      - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc|rootfs/var/lib/docker)($$|/)'
  nginx_exporter:
    image: nginx/nginx-prometheus-exporter:0.11
    container_name: nginx_exporter
    hostname: nginx_exporter
    command:
     - '-nginx.scrape-uri=http://192.168.25.42/stub_status'
    restart: always
    ports:
    - "9113:9113"
  redis_exporter:
    image: oliver006/redis_exporter
    container_name: redis_exporter
    restart: always
    environment:
      REDIS_ADDR: "192.168.25.42:6379"
      REDIS_PASSWORD: 123456
    ports:
      - "9121:9121"
  mysqld-exporter:
    image: prom/mysqld-exporter
    container_name: mysqld-exporter
    restart: always
    command:
      - '--collect.info_schema.processlist'
      - '--collect.info_schema.innodb_metrics'
      - '--collect.info_schema.tablestats'
      - '--collect.info_schema.tables'
      - '--collect.info_schema.userstats'
      - '--collect.engine_innodb_status'
      - '--config.my-cnf=/my.cnf'
    volumes:
      - /root/my.cnf:/my.cnf
    ports:
      - 9104:9104
  blackbox_exporter:
    image: prom/blackbox-exporter:v0.25.0
    container_name: blackbox_exporter
    restart: always
    volumes:
    - /data/blackbox_exporter:/etc/blackbox_exporter
    ports:
    - 9115:9115

上面数据卷挂载将宿主机中指定文件映射到容器，故而宿主机上还需要有相关文件或者文件夹，具体如下：

Nginx的default.conf文件完整内容如下：：

bash 复制代码

[root@prometheus-monitor-node ~]# cat default.conf
server {
    listen       80;
    listen  [::]:80;
    server_name  localhost;

    #access_log  /var/log/nginx/host.access.log  main;

    location / {
        root   /usr/share/nginx/html;
        index  index.html index.htm;
    }

    location /stub_status {
        stub_status on;
        access_log off;
        #allow nginx_export的ip;    
        allow 0.0.0.0/0;
        deny all;
    }

    #error_page  404              /404.html;

    # redirect server error pages to the static page /50x.html
    #
    error_page   500 502 503 504  /50x.html;
    location = /50x.html {
        root   /usr/share/nginx/html;
    }
}

Mysql的my.conf文件完整内容

bash 复制代码

[root@prometheus-monitor-node ~]# cat my.cnf 
[client]
host=192.168.25.42
user=exporter
password=123456
port=3306

黑盒blackbox_exporter的config.yml文件完整内容如下：

注：/data/blackbox_exporter只有一个config.yml文件，没有其他文件了。

bash 复制代码

mkdir -p /data/blackbox_exporter

bash 复制代码

[root@prometheus-monitor-node ~]# cat /data/blackbox_exporter/config.yml 
modules:
  http_2xx:
    prober: http
    http:
      method: GET
  http_post_2xx:
    prober: http
    http:
      method: POST
  tcp_connect:
    prober: tcp
  pop3s_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^+OK"
      tls: true
      tls_config:
        insecure_skip_verify: false
  grpc:
    prober: grpc
    grpc:
      tls: true
      preferred_ip_protocol: "ip4"
  grpc_plain:
    prober: grpc
    grpc:
      tls: false
      service: "service1"
  ssh_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^SSH-2.0-"
      - send: "SSH-2.0-blackbox-ssh-check"
  irc_banner:
    prober: tcp
    tcp:
      query_response:
      - send: "NICK prober"
      - send: "USER prober prober prober :prober"
      - expect: "PING :([^ ]+)"
        send: "PONG ${1}"
      - expect: "^:[^ ]+ 001"
  icmp:
    prober: icmp
  icmp_ttl5:
    prober: icmp
    timeout: 5s
    icmp:
      ttl: 5

运行相关服务

bash 复制代码

docker -f docker-compose.yaml up -d

最终效果如下：

bash 复制代码

[root@prometheus-monitor-node ~]# docker ps
CONTAINER ID   IMAGE                                  COMMAND                   CREATED          STATUS                    PORTS                                                  NAMES
e45f0d5ece57   google/cadvisor:v0.33.0                "/usr/bin/cadvisor -..."   32 minutes ago   Up 32 minutes (healthy)   0.0.0.0:8080->8080/tcp, :::8080->8080/tcp              cadvisor
394ca6fc8915   prom/pushgateway:v1.11.0               "/bin/pushgateway"        3 days ago       Up 55 minutes             0.0.0.0:9091->9091/tcp, :::9091->9091/tcp              pushgateway
f5e38277ceca   prom/blackbox-exporter:v0.25.0         "/bin/blackbox_expor..."   7 days ago       Up 55 minutes             0.0.0.0:9115->9115/tcp, :::9115->9115/tcp              blackbox_exporter
1e38cd21cf0e   nginx:1.27.4                           "/docker-entrypoint...."   7 days ago       Up 55 minutes             0.0.0.0:80->80/tcp, :::80->80/tcp                      nginx
bdc5b8b4b4f4   java-web-demo:0.0.4                    "java -jar app.jar"       9 days ago       Up 55 minutes             0.0.0.0:8081->8080/tcp, :::8081->8080/tcp              springboot-web-demo
e0b0e18151ea   prom/mysqld-exporter                   "/bin/mysqld_exporte..."   10 days ago      Up 55 minutes             0.0.0.0:9104->9104/tcp, :::9104->9104/tcp              mysqld-exporter
1528db64726b   redis:6.2.17                           "docker-entrypoint.s..."   10 days ago      Up 55 minutes             0.0.0.0:6379->6379/tcp, :::6379->6379/tcp              redis
82eb7094818c   mysql:8.0.37                           "docker-entrypoint.s..."   10 days ago      Up 55 minutes             0.0.0.0:3306->3306/tcp, :::3306->3306/tcp, 33060/tcp   mysql
452c7fb8c129   oliver006/redis_exporter               "/redis_exporter"         10 days ago      Up 55 minutes             0.0.0.0:9121->9121/tcp, :::9121->9121/tcp              redis_exporter
c63f2666c72b   nginx/nginx-prometheus-exporter:0.11   "/usr/bin/nginx-prom..."   10 days ago      Up 55 minutes             0.0.0.0:9113->9113/tcp, :::9113->9113/tcp              nginx_exporter
50ca88f777b8   prom/node-exporter:v1.5.0              "/bin/node_exporter ..."   10 days ago      Up 55 minutes                                                                    node-exporter

注1 ：docker方式查看容器列表：docker ps
注2 ：docker方式重启容器：docker restart 容器名
注3 ：docker方式查看容器日志：docker logs -f 容器名
注4 ：docker-compose方式查看容器列表：docker-compose ps
注5：docker-compose重启/重新加载指定服务：docker -f docker-compose.yaml up -d 服务名

3.监控服务器核心组件安装

3.1.Prometheus安装

3.1.1 安装流程

bash 复制代码

# 下载安装包
wget https://github.com/prometheus/prometheus/releases/download/v3.1.0/prometheus-3.1.0.linux-amd64.tar.gz
# 创建安装目录
mkdir -p /opt/software/prometheus/
# 解压到特定目录
tar -zxvf prometheus-3.1.0.linux-amd64.tar.gz -C /opt/software/prometheus/
#重命名
mv /opt/software/prometheus/prometheus-3.1.0.linux-amd64 /opt/software/prometheus/prometheus

这是我最终配置完成后的目录结构(如果后面安装看懵了请参考此目录结构)：

bash 复制代码

[root@prometheus prometheus]# tree /opt/software/prometheus/prometheus

├── data
├── EOF
├── LICENSE
├── NOTICE
├── prometheus
├── prometheus.yml
├── prometheus.yml_bak
├── promtool
├── rules
│   ├── alert.yml
│   ├── blackbox_exporter.yml
│   ├── docker.yml
│   ├── mysqld.yml
│   ├── nginx.yml
│   ├── node-exporter.yml
│   ├── redis.yml
│   └── springboot.yml
└── targets
    ├── blackbox_http.yml
    ├── blackbox_icmp.yml
    ├── blackbox_tcp.yml
    ├── pushgateway.yml
    ├── springboot.yml
    └── targets.yml

3.1.2 修改prometheus.yml

注：prometheus.yml里面的配置有两种配置方式，一种是静态文件配置方式，第二种是动态文件配置方式。你可以两种都尝试看看，推荐用第二种，

方式一(静态文件方式，每次配置发生变动需手动加载)：

修改prometheus.yml，完整内容如下：

bash 复制代码

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - 192.168.25.41:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/*"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: "alertmanager"
    scrape_interval: 15s
    static_configs:
      - targets: ["192.168.25.41:9093"]
  - job_name: "node_exporter"
    scrape_interval: 15s
    static_configs:
      - targets: ["192.168.25.41:9100"]
        labels:
          instance: Prometheus服务器
          myCusLabel: myCusVal
      - targets: ["192.168.25.42:9100"]
        labels:
          instance: 被监控服务器42
          myCusLabel: myCusVal
  - job_name: "cadvisor"
    scrape_interval: 15s
    static_configs:
      - targets: ["192.168.25.42:8080"]
        labels:
          instance: 被监控服务器42
          myCusLabel: myCusVal
  - job_name: "nginx-exporter"
    scrape_interval: 15s
    static_configs:
      - targets: ["192.168.25.42:9113"]
        labels:
          instance: 被监控服务器42
          myCusLabel: myCusVal
  - job_name: "redis-exporter"
    scrape_interval: 15s
    static_configs:
      - targets: ["192.168.25.42:9121"]
        labels:
          instance: 被监控服务器42
          myCusLabel: myCusVal
  - job_name: "mysql-exporter"
    scrape_interval: 15s
    static_configs:
      - targets: ["192.168.25.42:9104"]
        labels:
          instance: 被监控服务器42
          myCusLabel: myCusVal
  - job_name: "springboot-demo"
    scrape_interval: 15s
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ["192.168.25.42:8081"]
        labels:
          instance: 被监控服务器42
          myCusLabel: myCusVal
          mytype: springboot
  - job_name: "pushgateway"
    scrape_interval: 15s
    honor_labels: true  #加上此配置，exporter节点上传数据中的一些标签将不会被pushgateway节点的相同标签覆盖
    static_configs: 
      - targets: ["192.168.25.42:9091"]
        labels:
          instance: 被监控服务器42
          myCusLabel: myCusVal
  - job_name: "blackbox_http"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://www.baidu.com
        - https://www.jd.com
        labels:
          company: "外部公司"
          project: "外部项目"
          env: "dev"
      - targets: 
        - http://192.168.25.42:8081/
        labels:
          describe: springboot-web-demo应用
          company: "内部公司"
          project: "内部项目"
          env: "test"
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.25.42:9115
  - job_name: "blackbox_tcp"
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets: 
        - 192.168.25.42:22
        - 192.168.25.41:9090
        labels:
          company: "内部公司"
          project: "测试项目"
          env: "test"
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.25.42:9115
  #icmp检查配置 ping
  - job_name: "blackbox_icmp"
    metrics_path: /probe
    params:
      module: [icmp]
    static_configs:
      - targets: 
        - 192.168.25.42
        - 192.168.25.200  # 不存在的ip
        labels:
          company: "内部公司"
          project: "测试项目"
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.25.42:9115

方式二(动态文件方式，每次配置发生变动会自动加载)：

修改prometheus.yml，完整内容如下：

bash 复制代码

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - 192.168.25.41:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/*"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "file-target"
    file_sd_configs:
      - refresh_interval: 10s
        files:
          - targets/targets.yml
  - job_name: "file-sd-pushgateway"
    scrape_interval: 15s
    honor_labels: true  # 为true时，exporter节点上传数据中的一些标签将不会被pushgateway节点的相同标签覆盖
    file_sd_configs:
      - refresh_interval: 10s
        files:
          - targets/pushgateway.yml    
  - job_name: "file-application"
    scrape_interval: 15s
    metrics_path: '/actuator/prometheus'
    file_sd_configs:
      - refresh_interval: 10s
        files:
          - targets/springboot.yml
      
  - job_name: "file-blackbox_http"
    metrics_path: /probe
    params:
      module: [http_2xx]
    file_sd_configs:
      - refresh_interval: 10s
        files: 
          - targets/blackbox_http.yml
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - source_labels: [__param_target]
        target_label: __param_myparam
        replacement: myparamVal
      - target_label: __address__
        replacement: 192.168.25.42:9115
      #- regex: "__(.*)__"    默认__xx__的标签都会隐藏，这两行让__xx__标签替换成xx标签，故而用户可以看到这些xx标签
      #  action: labelmap
  - job_name: "file-blackbox_tcp"
    metrics_path: /probe
    params:
      module: [tcp_connect]
    file_sd_configs:
      - refresh_interval: 10s
        files: 
          - targets/blackbox_tcp.yml
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.25.42:9115
  #icmp检查配置 ping
  - job_name: "file-blackbox_icmp"
    metrics_path: /probe
    params:
      module: [icmp]
    file_sd_configs:
      - refresh_interval: 10s
        files: 
          - targets/blackbox_icmp.yml
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.25.42:9115

创建targets目录，用于存放动态配置文件。

bash 复制代码

mkdir -p /opt/software/prometheus/prometheus/targets/

添加/opt/software/prometheus/prometheus/targets/targets.yml文件，完整内容如下：

bash 复制代码

- targets: ["192.168.25.41:9090"]
  labels:
    job: prometheus
- targets: ["192.168.25.41:9093"]
  labels:
    job: alertmanager
- targets: ["192.168.25.41:9100"]
  labels:
    job: node_exporter
    instance: Prometheus服务器
    myCusLabel: myCusVal
- targets: ["192.168.25.42:9100"]
  labels:
    job: node_exporter
    instance: 被监控服务器42
    myCusLabel: myCusVal
- targets: ["192.168.25.42:8080"]
  labels:
    job: cadvisor
    instance: 被监控服务器42
    myCusLabel: myCusVal
- targets: ["192.168.25.42:9113"]
  labels:
    job: nginx-exporter
    instance: 被监控服务器42
    myCusLabel: myCusVal
- targets: ["192.168.25.42:9121"]
  labels:
    job: redis-exporter
    instance: 被监控服务器42
    myCusLabel: myCusVal
- targets: ["192.168.25.42:9104"]
  labels:
    job: mysql-exporter
    instance: 被监控服务器42
    myCusLabel: myCusVal

添加/opt/software/prometheus/prometheus/targets/springboot.yml文件，内容如下：

bash 复制代码

- targets: ["192.168.25.42:8081"]
  labels:
    job: java-application
    instance: 被监控服务器42
    myCusLabel: myCusVal
    mytype: springboot

添加/opt/software/prometheus/prometheus/targets/pushgateway.yml文件，完整内容如下：

bash 复制代码

- targets: ["192.168.25.42:9091"]
  labels:
    job: pushgateway
    instance: 被监控服务器42
    myCusLabel: myCusVal

添加/opt/software/prometheus/prometheus/targets/blackbox_http.yml文件，完整内容如下：

bash 复制代码

- targets:
    - https://www.baidu.com
    - https://www.jd.com
  labels:
    job: blackbox_http
    company: "外部公司"
    project: "外部项目"
    env: "dev"
- targets: 
    - http://192.168.25.42:8081/
  labels:
    job: blackbox_http
    describe: springboot-web-demo应用
    company: "内部公司"
    project: "内部项目"
    env: "dev"

添加/opt/software/prometheus/prometheus/targets/blackbox_tcp.yml文件，完整内容如下：

bash 复制代码

- targets: 
    - 192.168.25.42	  # 被监控的endpoint
    - 192.168.25.200  # 被监控的endpoint
  labels:
    job: blackbox_icmp
    company: "内部公司"
    project: "测试项目"

添加/opt/software/prometheus/prometheus/targets/blackbox_icmp.yml文件，完整内容如下：

bash 复制代码

- targets: 
    - 192.168.25.42	  # 被监控的IP
    - 192.168.25.200  # 被监控的IP
  labels:
    job: blackbox_icmp
    company: "内部公司"
    project: "测试项目"
    env: "dev"

3.1.3 添加监控规则文件

创建规则目录

bash 复制代码

mkdir /opt/software/prometheus/prometheus/rules/

添加/opt/software/prometheus/prometheus/rules/alert.yml件，完整内容如下：

bash 复制代码

groups:
- name: Prometheus alert
  rules:
  - alert: 服务告警
    expr: up==0
    for: 30s
    labels:
      serverity: critical
    annotations:
      summary: "服务异常，实例{{$labels.instance}}"
      description: "{{$labels.job}}服务已关闭"

添加/opt/software/prometheus/prometheus/rules/node-exporter.yml文件，完整内容如下：

bash 复制代码

groups:
 - name: node-exporter
   rules:
   - alert: HostOutOfMemory
     expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "主机内存不足,实例:{{ $labels.instance }}"
       description: "内存可用率<10%，当前值：{{ $value }}"
   - alert: HostMemoryUnderMemoryPressure
     expr: rate(node_vmstat_pgmajfault[1m]) > 1000
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "内存压力不足,实例:{{ $labels.instance }}"
       description: "节点内存压力大。 重大页面错误率高，当前值为：{{ $value }}"
   - alert: HostUnusualNetworkThroughputIn
     expr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "异常流入网络吞吐量,实例:{{ $labels.instance }}"
       description: "网络流入流量 > 100 MB/s，当前值：{{ $value }}"
   - alert: HostUnusualNetworkThroughputOut
     expr: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "异常流出网络吞吐量，实例:{{ $labels.instance }}"
       description: "网络流出流量 > 100 MB/s，当前值为：{{ $value }}"
   - alert: HostUnusualDiskReadRate
     expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "异常磁盘读取,实例:{{ $labels.instance }}"
       description: "磁盘读取> 50 MB/s，当前值：{{ $value }}"
   - alert: HostUnusualDiskWriteRate
     expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "异常磁盘写入,实例:{{ $labels.instance }}"
       description: "磁盘写入> 50 MB/s，当前值：{{ $value }}"
   - alert: HostOutOfDiskSpace
     expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "磁盘空间不足告警,实例:{{ $labels.instance }}"
       description: "剩余磁盘空间< 10% ，当前值：{{ $value }}"
   - alert: HostDiskWillFillIn24Hours
     expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "磁盘空间将在24小时内耗尽,实例:{{ $labels.instance }}"
       description: "以当前写入速率预计磁盘空间将在 24 小时内耗尽，当前值：{{ $value }}"
   - alert: HostOutOfInodes
     expr: node_filesystem_files_free{mountpoint ="/"} / node_filesystem_files{mountpoint="/"} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly{mountpoint="/"} == 0
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "磁盘Inodes不足,实例:{{ $labels.instance }}"
       description: "剩余磁盘 inodes < 10%，当前值： {{ $value }}"
   - alert: HostUnusualDiskReadLatency
     expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "异常磁盘读取延迟,实例:{{ $labels.instance }}"
       description: "磁盘读取延迟 > 100ms，当前值：{{ $value }}"
   - alert: HostUnusualDiskWriteLatency
     expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "异常磁盘写入延迟,实例:{{ $labels.instance }}"
       description: "磁盘写入延迟 > 100ms，当前值：{{ $value }}"
   - alert: high_load 
     expr: node_load1 > 4
     for: 2m
     labels:
       severity: page
     annotations:
       summary: "CPU1分钟负载过高,实例:{{ $labels.instance }}"
       description: "CPU1分钟负载>4，已经持续2分钟。当前值为：{{ $value }}"
   - alert: HostCpuIsUnderUtilized
     expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
     for: 1m
     labels:
       severity: warning
     annotations:
       summary: "cpu负载高,实例:{{ $labels.instance }}"
       description: "cpu负载> 80%，当前值：{{ $value }}"
   - alert: HostCpuStealNoisyNeighbor
     expr: avg by(instance) (rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100 > 10
     for: 0m
     labels:
       severity: warning
     annotations:
       summary: "CPU窃取率异常,实例:{{ $labels.instance }}"
       description: "CPU 窃取率 > 10%。 嘈杂的邻居正在扼杀 VM 性能，或者 Spot 实例可能失去信用，当前值：{{ $value }}"
   - alert: HostSwapIsFillingUp
     expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "磁盘swap空间使用率异常,实例:{{ $labels.instance }}"
       description: "磁盘swap空间使用率>80%"
   - alert: HostNetworkReceiveErrors
     expr: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "异常网络接收错误,实例:{{ $labels.instance }}"
       description: "网卡{{ $labels.device }}在过去2分钟接收错误率大于0.01，当前值:{{ $value }}"
   - alert: HostNetworkTransmitErrors
     expr: rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "异常网络传输错误,实例:{{ $labels.instance }}"
       description: "网卡{{ $labels.device }}在过去2分钟传输错误率大于0.01，当前值:{{ $value }}"
   - alert: HostNetworkInterfaceSaturated
     expr: (rate(node_network_receive_bytes_total{device!~"^tap.*"}[1m]) + rate(node_network_transmit_bytes_total{device!~"^tap.*"}[1m])) / node_network_speed_bytes{device!~"^tap.*"} > 0.8 < 10000
     for: 1m
     labels:
       severity: warning
     annotations:
       summary: "异常网络接口饱和,实例:{{ $labels.instance }}"
       description: "网卡{{ $labels.device }}正在超载，当前值{{ $value }}"
   - alert: HostConntrackLimit
     expr: node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "异常连接数,实例:{{ $labels.instance }}"
       description: "连接数过大，当前连接数：{{ $value }}"
   - alert: HostClockSkew
     expr: (node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "异常时钟偏差,实例:{{ $labels.instance }}"
       description: "检测到时钟偏差，时钟不同步。值为：{{ $value }}"
   - alert: HostClockNotSynchronising
     expr: min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: "时钟不同步,实例:{{ $labels.instance }}"
       description: "时钟不同步"
   - alert: NodeFileDescriptorLimit
     expr: node_filefd_allocated / node_filefd_maximum * 100 > 80
     for: 1m
     labels:
       severity: warning
     annotations:
       summary: "预计内核将很快耗尽文件描述符限制"
       description: "{{ $labels.instance }}｝已分配的文件描述符数超过了限制的80%，当前值为：{{ $value }}"

添加/opt/software/prometheus/prometheus/rules/docker.yml文件，完整内容如下：

bash 复制代码

groups:
- name: DockerContainers
  rules:
  - alert: ContainerKilled
    expr: time() - container_last_seen > 60
    for: 0m
    labels:
      severity: warning
    annotations:
      isummary: "Docker容器被杀死 容器:{{ $labels.instance }}"
      description: "{{ $value }}个容器消失了"
  # This rule can be very noisy in dynamic infra with legitimate container start/stop/deployment.
  - alert: ContainerAbsent
    expr: absent(container_last_seen)
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "无容器 容器: {{ $labels.instance }}"
      description: "5分钟检查容器不存在，值为：{{ $value }}"
  - alert: ContainerCpuUsage
    expr: (sum(rate(container_cpu_usage_seconds_total{name!=""}[3m])) BY (instance, name) * 100) > 300
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "容器cpu使用率告警 容器: {{ $labels.instance }}"
      description: "容器cpu使用率超过300%，当前值为：{{ $value }}"
  - alert: ContainerMemoryUsage
    expr: (sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "容器内存使用率告警 容器: {{ $labels.instance }}"
      description: "容器内存使用率超过80%，当前值为：{{ $value }}"
  - alert: ContainerVolumeIoUsage
    expr: (sum(container_fs_io_current{name!=""}) BY (instance, name) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "容器存储io使用率告警 容器: {{ $labels.instance }}"
      description: "容器存储io使用率超过 80%，当前值为：{{ $value }}"
  - alert: ContainerHighThrottleRate
    expr: rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "容器限制告警 容器:{{ $labels.instance }}"
      description: "容器被限制，当前值为：{{ $value }}"

添加/opt/software/prometheus/prometheus/rules/mysqld.yml文件，完整内容如下：

bash 复制代码

groups:
- name: MySQL
  rules:
  - alert: MysqlDown
    expr: mysql_up == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "MySQL Down,实例:{{ $labels.instance }}"
      description: "MySQL_exporter连不上MySQL了，当前状态为：{{ $value }}"    
  - alert: MysqlTooManyConnections
    expr: max_over_time(mysql_global_status_threads_connected[1m]) / mysql_global_variables_max_connections * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Mysql连接数过多告警,实例:{{ $labels.instance }}"
      description: "MySQL连接数>80%,当前值：{{ $value }}"
  - alert: MysqlHighThreadsRunning
    expr: max_over_time(mysql_global_status_threads_running[1m]) > 20
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Mysql运行的线程过多,实例:{{ $labels.instance }}"
      description: "Mysql运行的线程 > 20，当前运行的线程：{{ $value }}" 
  - alert: MysqlSlowQueries
    expr: increase(mysql_global_status_slow_queries[2m]) > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Mysql慢日志告警,实例:{{ $labels.instance }}"
      description: "MySQL在过去2分钟有新的{{ $value }}条慢查询"
  #MySQL innodb 日志写入停滞
  - alert: MysqlInnodbLogWaits
    expr: rate(mysql_global_status_innodb_log_waits[15m]) > 10
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: "MySQL innodb日志等待,实例:{{ $labels.instance }}"
      description: "MySQL innodb日志写入停滞，当前值： {{ $value }}"
  - alert: MysqlRestarted
    expr: mysql_global_status_uptime < 60
    for: 0m
    labels:
      severity: info
    annotations:
      summary: "MySQL 重启,实例:{{ $labels.instance }}"
      description: "不到一分钟前，MySQL重启过"

  - alert: RowLockCurrentWaits
    expr: mysql_global_status_innodb_row_lock_current_waits > 0
    for: 1m
    labels:
      severity: info
    annotations:
      summary: "MySQL有锁等待,实例:{{ $labels.instance }}"
      description: "当前有{{ $value }}个锁等待"

添加/opt/software/prometheus/prometheus/rules/nginx.yml文件，完整内容如下：

bash 复制代码

groups:
- name: nginx
  rules:
  # 对任何实例超过30秒无法联系的情况发出警报
  - alert: NginxDown
    expr: nginx_up == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "nginx异常,实例:{{ $labels.instance }}"
      description: "{{ $labels.job }} nginx已关闭"

添加/opt/software/prometheus/prometheus/rules/redis.yml文件，完整内容如下：

bash 复制代码

groups:
- name: redis
  rules:
  - alert: RedisDown
    expr: redis_up == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: 'Redis Down,实例:{{ $labels.instance }}'
      description: "Redis实例 is down"
  - alert: RedisMissingBackup
    expr: time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: "Redis备份丢失,实例:{{ $labels.instance }}"
      description: "Redis 24小时未备份"

  - alert: RedisOutOfConfiguredMaxmemory
    expr: redis_memory_used_bytes / redis_memory_max_bytes * 100 > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Redis超出配置的最大内存,实例:{{ $labels.instance }}"
      description: "Redis内存使用超过配置最大内存的90%"
  - alert: RedisTooManyConnections
    expr: redis_connected_clients > 100
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Redis连接数过多,实例:{{ $labels.instance }}"
      description: "Redis当前连接数为： {{ $value }}"
  - alert: RedisNotEnoughConnections
    expr: redis_connected_clients < 1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Redis没有足够的连接,实例:{{ $labels.instance }}"
      description: "Redis当前连接数为： {{ $value }}"
  - alert: RedisRejectedConnections
    expr: increase(redis_rejected_connections_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: "Redis有拒绝连接,实例:{{ $labels.instance }}"
      description: "与Redis 的某些连接被拒绝{{ $value }}"

添加/opt/software/prometheus/prometheus/rules/springboot.yml文件，完整内容如下：

bash 复制代码

groups:
- name: SprinBoot
  rules:
  - alert: SprinBooErrorEvents
    expr: increase(logback_events_total{level="error"}[3m]) > 0
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "Springboot错误事件 容器: $labels.instance"
      description: "在过去2分钟有新的{{ $value }}个错误事件"

添加/opt/software/prometheus/prometheus/rules/blackbox_exporter.yml文件，完整内容如下：

bash 复制代码

groups:
- name: Blackbox
  rules:
  - alert: 黑盒子探测失败告警
    expr: probe_success == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "黑盒子探测失败{{ $labels.instance }}"
      description: "黑盒子检测失败，当前值：{{ $value }}"
  - alert: 请求慢告警
    expr: avg_over_time(probe_duration_seconds[1m]) > 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "请求慢{{ $labels.instance }}"
      description: "请求时间超过1秒，值为：{{ $value }}"
  - alert: http状态码检测失败
    expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "http状态码检测失败{{ $labels.instance }}"
      description: "HTTP状态码非 200-399，当前状态码为：{{ $value }}"
  - alert: ssl证书即将到期
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "证书即将到期{{ $labels.instance }}"
      description: "SSL 证书在 30 天后到期，值：{{ $value }}"

  - alert: ssl证书即将到期
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "证书即将到期{{ $labels.instance }}"
      description: "SSL 证书在 3 天后到期，值：{{ $value }}"

  - alert: ssl证书已过期
    expr: probe_ssl_earliest_cert_expiry - time() <= 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "证书已过期{{ $labels.instance }}"
      description: "SSL 证书已经过期，请确认是否在使用"

3.1.4 自定义Systemctl

bash 复制代码

cat > /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target

[Service]
Type=simple
User=root
Group=root
Restart=on-failure
ExecStart=/opt/software/prometheus/prometheus/prometheus \
  --config.file=/opt/software/prometheus/prometheus/prometheus.yml \
  --storage.tsdb.path=/opt/software/prometheus/prometheus/data \
  --storage.tsdb.retention.time=60d \
  --web.enable-lifecycle

[Install]
WantedBy=multi-user.target

EOF

加载自定义的systemctl配置

bash 复制代码

systemctl daemon-reload

3.1.5 启动Prometheus

bash 复制代码

# 启动Prometheus
systemctl start prometheus
# 开机自启动Prometheus
systemctl enable prometheus

3.1.6.访问Prometheus后台

浏览器访问：http://192.168.25.41:9090/

3.1.7 检查target是否已经加载

在这里插入图片描述

3.1.8 检查rules是否已经加载

3.1.9.备注说明

注一：curl -X POST http://192.168.25.41:9090/-/reload命令可在运行期间重新加载Prometheus配置文件
注二：查看Prometheus运行日志：journalctl -u prometheus.service
注三：Prometheus管理地址访问：http://192.168.25.41:9090/
注四：检查配置文件语法是否通过：./promtool check config prometheus.yml

3.2.Grafana安装

3.2.1.安装流程

Grafana下载地址：https://grafana.com/grafana/download/11.5.1?platform=linux：

bash 复制代码

# 下载安装包
wget https://dl.grafana.com/enterprise/release/grafana-enterprise-11.5.1.linux-amd64.tar.gz
# 解压到特定目录
tar -zxvf grafana-enterprise-11.5.1.linux-amd64.tar.gz -C /opt/software/prometheus/
#重命名
mv /opt/software/prometheus/grafana-v11.5.1 /opt/software/prometheus/grafana

3.2.2 自定义Systemctl

bash 复制代码

cat > /etc/systemd/system/grafana-server.service << 'EOF'
[Unit]
Description=Grafana server
Documentation=http://docs.grafana.org
[Service]
Type=simple
User=root
Group=root
Restart=on-failure
ExecStart=/opt/software/prometheus/grafana/bin/grafana-server \
  --config=/opt/software/prometheus/grafana/conf/defaults.ini \
  --homepath=/opt/software/prometheus/grafana
[Install]
WantedBy=multi-user.target

EOF

加载自定义的systemctl配置

bash 复制代码

systemctl daemon-reload

3.2.3 启动Grafana

bash 复制代码

# 启动Prometheus
systemctl start grafana-server
# 开机自启动Prometheus
systemctl enable grafana-server

3.2.4 访问Grafana后台

浏览器访问：http://192.168.25.41:3000/

默认账号/密码 :admin/admin

默认账号密码可详见：/opt/software/prometheus/grafana/conf/defaults.ini文件

输入账号密码进行登陆...

3.2.5 添加Prometheus数据源

然后一直点下一步即可创建出数据源。

3.2.6 添加Dashboard(服务器监控仪表盘)

Grafana的node exporter full地址：https://grafana.com/grafana/dashboards/1860-node-exporter-full/

具体下载链接：https://grafana.com/api/dashboards/1860/revisions/33/download

最终效果：