16.大数据监控

0.说明

监控主要构成。

软件版本。

1.exporter监控配置

1.1 node_exporter

启动命令

bash 复制代码

nohup ./node_exporter &

服务

创建文件 /etc/systemd/system/node_exporter.service：

bash 复制代码

[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=bigdatabit9
Group=bigdatabit9
Type=simple
ExecStart=/opt/apps/node_exporter/node_exporter
Restart=always

[Install]
WantedBy=multi-user.target

1.2 kafka_exporter

启动脚本

bash 复制代码

#!/bin/bash
cd /opt/apps/exporters/kafka_exporter 
nohup ./kafka_exporter --kafka.server=instance-kafka01:9092 --kafka.server=instance-kafka02:9092 --kafka.server=instance-kafka03:9092 \
--zookeeper.server=instance-kafka03:2181,instance-kafka02:2181,instance-kafka01:2181 \
--web.listen-address="172.16.0.243:9340" >/dev/null 2>&1 &

服务

创建文件 /etc/systemd/system/kafka_exporter.service：

bash 复制代码

[Unit]
Description=Kafka Exporter for Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=bigdatabit9
Group=bigdatabit9
Type=simple
ExecStart=/opt/apps/exporters/kafka_exporter/kafka_exporter \
  --kafka.server=instance-kafka01:9092 \
  --kafka.server=instance-kafka02:9092 \
  --kafka.server=instance-kafka03:9092 \
  --zookeeper.server=instance-kafka03:2181,instance-kafka02:2181,instance-kafka01:2181 \
  --web.listen-address=0.0.0.0:9340
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

启动exporter

这里以kafka_exporter为例，其他服务一样。

命令

bash 复制代码

sudo systemctl daemon-reload
sudo systemctl enable kafka_exporter
sudo systemctl start kafka_exporter

检查服务状态

bash 复制代码

sudo systemctl status kafka_exporter

2. prometheus 配置

2.1 配置prometheus.yml

yaml 复制代码

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - instance-metric01:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
  - "rules/*.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]
  - job_name: "pushgateway"
    static_configs:
      - targets: ["instance-metric01:9091"]
  - job_name: "kafka"
    static_configs:
      - targets: ["1instance-kafka02:9340"]
  - job_name: "node"
    static_configs:
      - targets: ["instance-kafka01:9100","instance-kafka02:9100","instance-kafka03:9100","instance-metric01:9100"]
    metric_relabel_configs:
    - action: replace
      source_labels: ["instance"]
      regex: ([^:]+):([0-9]+)
      replacement: $1
      target_label: "host_name"

2.2 告警规则rules 配置

在prometheus目录rules目录下。

cpu.yml

yaml 复制代码

groups:
- name: cpu_state
  rules:
  - alert: cpu使用率告警
    expr: (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[2m])) by (host_name)) * 100 > 90
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "{{$labels.host_name}}CPU使用率超过90%"
      description: " 服务器【{{$labels.host_name}}】：当前CPU使用率{{$value}}%超过90%"

disk.yml

yaml 复制代码

groups:
- name: disk_state
  rules:
  - alert: 磁盘使用率告警
    expr: (node_filesystem_size_bytes{fstype=~"ext.?|xfs"} - node_filesystem_avail_bytes{fstype=~"ext.?|xfs"}) / node_filesystem_size_bytes{fstype=~"ext.?|xfs"} * 100 > 80
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "{{$labels.host_name}}磁盘分区使用率超过80%"
      description: " 服务器【{{$labels.host_name}}】上的挂载点：【{{ $labels.mountpoint }}】当前值{{$value}}%超过80%"

dispatcher.yml

yaml 复制代码

groups:
- name: dispatcher_state
  rules:
  - alert: dispatcher06状态
    expr: sum(dispatcher06_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.218上的dispatcher写入数据为0，进程发生问题！"
  - alert: dispatcher07状态
    expr: sum(dispatcher07_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.219上的dispatcher写入数据为0，进程发生问题！"
  - alert: dispatcherk1状态
    expr: sum(dispatcherk1_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.243上的dispatcher写入数据为0，进程发生问题！"
  - alert: dispatcherk2状态
    expr: sum(dispatcherk2_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.244上的dispatcher写入数据为0，进程发生问题！"
  - alert: dispatcherk3状态
    expr: sum(dispatcherk3_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.245上的dispatcher写入数据为0，进程发生问题！"
  - alert: dispatcherk4状态
    expr: sum(dispatcherk4_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.246上的dispatcher写入数据为0，进程发生问题！"
  - alert: dispatcherk5状态
    expr: sum(dispatcherk5_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.247上的dispatcher写入数据为0，进程发生问题！"
  - alert: dispatcherk6状态
    expr: sum(dispatcherk6_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.140上的dispatcher写入数据为0，进程发生问题！"
  - alert: dispatcherk7状态
    expr: sum(dispatcherk7_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.141上的dispatcher写入数据为0，进程发生问题！"
  - alert: dispatcherk8状态
    expr: sum(dispatcherk8_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.142上的dispatcher写入数据为0，进程发生问题！"
  - alert: dispatcherk9状态
    expr: sum(dispatcherk9_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.143上的dispatcher写入数据为0，进程发生问题！"
  - alert: dispatcherk13状态
    expr: sum(dispatcherk13_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.155上的dispatcher写入数据为0，进程发生问题！"

dn.yml

yaml 复制代码

groups:
- name: dn_state
  rules:
  - alert: DataNode容量告警
    expr: (sum(Hadoop_DataNode_DfsUsed{name="FSDatasetState"}) by (host_name) / sum(Hadoop_DataNode_Capacity{name="FSDatasetState"}) by(host_name)) * 100 > 80
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "DataNode节点：{{$labels.host_name}}已使用容量超过80%"
      description: "DataNode节点：{{$labels.host_name}}，当前已使用容量：{{$value}}超过总容量的80%"

kafka_lag.yml

yaml 复制代码

groups:
- name: kafka_lag
  rules:
  - alert: kafka消息积压报警
    expr: sum(kafka_consumergroup_lag{ topic!~"pct_.+"}) by(consumergroup,topic) > 500000 or sum(kafka_consumergroup_lag{topic=~"pct_.+"}) by(consumergroup,topic) > 2000000
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "Topic:{{$labels.topic}}的消费组{{$labels.consumergroup}}消息积压"
      description: "消息Lag:{{$value}}"

mem.yml

yaml 复制代码

groups:
- name: memory_state
  rules:
  - alert: 内存使用率告警
    expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes)))* 100 > 90
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "{{$labels.host_name}}内存使用率超过90%"
      description: " 服务器【{{$labels.host_name}}】：当前内存使用率{{$value}}%超过90%"

process.yml

yaml 复制代码

groups:
- name: proc_state
  rules:
  - alert: 进程存活告警
    expr: namedprocess_namegroup_num_procs<1
    for: 60s
    labels:
      severity: critical
      target: "{{$labels.app_name}}"
    annotations:
      summary: "进程{{$labels.app_name}}已停止"
      description: "进程 {{$labels.app_name}} 在服务器:{{$labels.host_name}}上已经停止."

prometheus_process.yml

yaml 复制代码

groups:
- name: proc_state
  rules:
  - alert: prometheus组件进程存活告警
    expr: sum(up) by(instance,job) == 0
    for: 30s
    labels:
      severity: critical
      target: "{{$labels.job}}"
    annotations:
      summary: "进程{{$labels.job}}已停止"
      description: "进程 {{$labels.job}} 在服务器:{{$labels.instance}}上已经停止."

yarn.yml

yaml 复制代码

groups:
- name: yarn_node
  rules:
  - alert: yarn节点不足
    expr: sum(Hadoop_ResourceManager_NumActiveNMs{job='rm'}) by (job) < 13 or sum(Hadoop_ResourceManager_NumActiveNMs{job='rmf'}) by (job) < 12
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "yarn集群:{{$labels.job}}节点不足"

2.3 启动

启动命令

bash 复制代码

nohup /opt/apps/prometheus/prometheus \
--web.listen-address="0.0.0.0:9090" \
--web.read-timeout=5m \
--web.max-connections=10  \
--storage.tsdb.retention=7d  \
--storage.tsdb.path="data/" \
--query.max-concurrency=20   \
--query.timeout=2m \
--web.enable-lifecycle \
> /opt/apps/prometheus/logs/start.log 2>&1 &

2.4 重新加载配置

重新加载配置

bash 复制代码

curl -X POST http://localhost:9090/-/reload

3. pushgateway

启动命令

bash 复制代码

nohup /opt/apps/pushgateway/pushgateway \
--web.listen-address="0.0.0.0:9091" \
> /opt/apps/pushgateway/start.log 2>&1 &

4. alertmanager

4.1 配置alertmanager.yml

bash 复制代码

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 1m
  repeat_interval: 5m
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://mecury-ca01:9825/api/alarm/send'
    send_resolved: true
inhibit_rules:
  - source_match:
      alertname: 'ApplicationDown'
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'job', "target", 'instance']

配置报警地址，报警参数参考

bash 复制代码

{
  "version": "4",
  "groupKey": "alertname:ApplicationDown",
  "status": "firing",
  "receiver": "web.hook",
  "groupLabels": {
    "alertname": "ApplicationDown"
  },
  "commonLabels": {
    "alertname": "ApplicationDown",
    "severity": "critical",
    "instance": "10.0.0.1:8080",
    "job": "web",
    "target": "10.0.0.1"
  },
  "commonAnnotations": {
    "summary": "Web application is down",
    "description": "The web application at instance 10.0.0.1:8080 is not responding."
  },
  "externalURL": "http://alertmanager:9093",
  "alerts": [
    {
      "status": "firing",
      "labels": {
        "alertname": "ApplicationDown",
        "severity": "critical",
        "instance": "10.0.0.1:8080",
        "job": "web",
        "target": "10.0.0.1"
      },
      "annotations": {
        "summary": "Web application is down",
        "description": "The web application at instance 10.0.0.1:8080 is not responding."
      },
      "startsAt": "2025-06-19T04:30:00Z",
      "endsAt": "0001-01-01T00:00:00Z",
      "generatorURL": "http://prometheus:9090/graph?g0.expr=up%7Bjob%3D%22web%22%7D+%3D%3D+0",
      "fingerprint": "1234567890abcdef"
    }
  ]
}

4.2 启动

启动脚本 start.sh

bash 复制代码

#!/bin/bash

nohup /opt/apps/alertmanager/alertmanager \
--config.file=/opt/apps/alertmanager/alertmanager.yml \
> /opt/apps/alertmanager/start.log 2>&1 &

5.grafana

5.1 安装

启动命令

bash 复制代码

nohup /opt/apps/grafana/bin/grafana-server web > /opt/apps/grafana/grafana.log 2>&1 &

默认用户名和密码：admin

5.2 常用模板

node 16098

kafka 7589

process 249