16.大数据监控

0.说明

监控主要构成。

软件版本。

1.exporter监控配置

1.1 node_exporter

启动命令

bash 复制代码
nohup ./node_exporter &

服务

创建文件 /etc/systemd/system/node_exporter.service:

bash 复制代码
[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=bigdatabit9
Group=bigdatabit9
Type=simple
ExecStart=/opt/apps/node_exporter/node_exporter
Restart=always

[Install]
WantedBy=multi-user.target

1.2 kafka_exporter

启动脚本

bash 复制代码
#!/bin/bash
cd /opt/apps/exporters/kafka_exporter 
nohup ./kafka_exporter --kafka.server=instance-kafka01:9092 --kafka.server=instance-kafka02:9092 --kafka.server=instance-kafka03:9092 \
--zookeeper.server=instance-kafka03:2181,instance-kafka02:2181,instance-kafka01:2181 \
--web.listen-address="172.16.0.243:9340" >/dev/null 2>&1 &

服务

创建文件 /etc/systemd/system/kafka_exporter.service:

bash 复制代码
[Unit]
Description=Kafka Exporter for Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=bigdatabit9
Group=bigdatabit9
Type=simple
ExecStart=/opt/apps/exporters/kafka_exporter/kafka_exporter \
  --kafka.server=instance-kafka01:9092 \
  --kafka.server=instance-kafka02:9092 \
  --kafka.server=instance-kafka03:9092 \
  --zookeeper.server=instance-kafka03:2181,instance-kafka02:2181,instance-kafka01:2181 \
  --web.listen-address=0.0.0.0:9340
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

启动exporter

这里以kafka_exporter为例,其他服务一样。

命令

bash 复制代码
sudo systemctl daemon-reload
sudo systemctl enable kafka_exporter
sudo systemctl start kafka_exporter

检查服务状态

bash 复制代码
sudo systemctl status kafka_exporter

2. prometheus 配置

2.1 配置prometheus.yml

yaml 复制代码
# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - instance-metric01:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
  - "rules/*.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]
  - job_name: "pushgateway"
    static_configs:
      - targets: ["instance-metric01:9091"]
  - job_name: "kafka"
    static_configs:
      - targets: ["1instance-kafka02:9340"]
  - job_name: "node"
    static_configs:
      - targets: ["instance-kafka01:9100","instance-kafka02:9100","instance-kafka03:9100","instance-metric01:9100"]
    metric_relabel_configs:
    - action: replace
      source_labels: ["instance"]
      regex: ([^:]+):([0-9]+)
      replacement: $1
      target_label: "host_name"

2.2 告警规则rules 配置

在prometheus目录rules目录下。

cpu.yml

yaml 复制代码
groups:
- name: cpu_state
  rules:
  - alert: cpu使用率告警
    expr: (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[2m])) by (host_name)) * 100 > 90
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "{{$labels.host_name}}CPU使用率超过90%"
      description: " 服务器【{{$labels.host_name}}】:当前CPU使用率{{$value}}%超过90%"

disk.yml

yaml 复制代码
groups:
- name: disk_state
  rules:
  - alert: 磁盘使用率告警
    expr: (node_filesystem_size_bytes{fstype=~"ext.?|xfs"} - node_filesystem_avail_bytes{fstype=~"ext.?|xfs"}) / node_filesystem_size_bytes{fstype=~"ext.?|xfs"} * 100 > 80
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "{{$labels.host_name}}磁盘分区使用率超过80%"
      description: " 服务器【{{$labels.host_name}}】上的挂载点:【{{ $labels.mountpoint }}】当前值{{$value}}%超过80%"

dispatcher.yml

yaml 复制代码
groups:
- name: dispatcher_state
  rules:
  - alert: dispatcher06状态
    expr: sum(dispatcher06_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.218上的dispatcher写入数据为0,进程发生问题!"
  - alert: dispatcher07状态
    expr: sum(dispatcher07_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.219上的dispatcher写入数据为0,进程发生问题!"
  - alert: dispatcherk1状态
    expr: sum(dispatcherk1_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.243上的dispatcher写入数据为0,进程发生问题!"
  - alert: dispatcherk2状态
    expr: sum(dispatcherk2_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.244上的dispatcher写入数据为0,进程发生问题!"
  - alert: dispatcherk3状态
    expr: sum(dispatcherk3_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.245上的dispatcher写入数据为0,进程发生问题!"
  - alert: dispatcherk4状态
    expr: sum(dispatcherk4_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.246上的dispatcher写入数据为0,进程发生问题!"
  - alert: dispatcherk5状态
    expr: sum(dispatcherk5_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.247上的dispatcher写入数据为0,进程发生问题!"
  - alert: dispatcherk6状态
    expr: sum(dispatcherk6_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.140上的dispatcher写入数据为0,进程发生问题!"
  - alert: dispatcherk7状态
    expr: sum(dispatcherk7_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.141上的dispatcher写入数据为0,进程发生问题!"
  - alert: dispatcherk8状态
    expr: sum(dispatcherk8_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.142上的dispatcher写入数据为0,进程发生问题!"
  - alert: dispatcherk9状态
    expr: sum(dispatcherk9_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.143上的dispatcher写入数据为0,进程发生问题!"
  - alert: dispatcherk13状态
    expr: sum(dispatcherk13_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.155上的dispatcher写入数据为0,进程发生问题!"

dn.yml

yaml 复制代码
groups:
- name: dn_state
  rules:
  - alert: DataNode容量告警
    expr: (sum(Hadoop_DataNode_DfsUsed{name="FSDatasetState"}) by (host_name) / sum(Hadoop_DataNode_Capacity{name="FSDatasetState"}) by(host_name)) * 100 > 80
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "DataNode节点:{{$labels.host_name}}已使用容量超过80%"
      description: "DataNode节点:{{$labels.host_name}},当前已使用容量:{{$value}}超过总容量的80%"

kafka_lag.yml

yaml 复制代码
groups:
- name: kafka_lag
  rules:
  - alert: kafka消息积压报警
    expr: sum(kafka_consumergroup_lag{ topic!~"pct_.+"}) by(consumergroup,topic) > 500000 or sum(kafka_consumergroup_lag{topic=~"pct_.+"}) by(consumergroup,topic) > 2000000
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "Topic:{{$labels.topic}}的消费组{{$labels.consumergroup}}消息积压"
      description: "消息Lag:{{$value}}"

mem.yml

yaml 复制代码
groups:
- name: memory_state
  rules:
  - alert: 内存使用率告警
    expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes)))* 100 > 90
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "{{$labels.host_name}}内存使用率超过90%"
      description: " 服务器【{{$labels.host_name}}】:当前内存使用率{{$value}}%超过90%"

process.yml

yaml 复制代码
groups:
- name: proc_state
  rules:
  - alert: 进程存活告警
    expr: namedprocess_namegroup_num_procs<1
    for: 60s
    labels:
      severity: critical
      target: "{{$labels.app_name}}"
    annotations:
      summary: "进程{{$labels.app_name}}已停止"
      description: "进程 {{$labels.app_name}} 在服务器:{{$labels.host_name}}上已经停止."

prometheus_process.yml

yaml 复制代码
groups:
- name: proc_state
  rules:
  - alert: prometheus组件进程存活告警
    expr: sum(up) by(instance,job) == 0
    for: 30s
    labels:
      severity: critical
      target: "{{$labels.job}}"
    annotations:
      summary: "进程{{$labels.job}}已停止"
      description: "进程 {{$labels.job}} 在服务器:{{$labels.instance}}上已经停止."

yarn.yml

yaml 复制代码
groups:
- name: yarn_node
  rules:
  - alert: yarn节点不足
    expr: sum(Hadoop_ResourceManager_NumActiveNMs{job='rm'}) by (job) < 13 or sum(Hadoop_ResourceManager_NumActiveNMs{job='rmf'}) by (job) < 12
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "yarn集群:{{$labels.job}}节点不足"

2.3 启动

启动命令

bash 复制代码
nohup /opt/apps/prometheus/prometheus \
--web.listen-address="0.0.0.0:9090" \
--web.read-timeout=5m \
--web.max-connections=10  \
--storage.tsdb.retention=7d  \
--storage.tsdb.path="data/" \
--query.max-concurrency=20   \
--query.timeout=2m \
--web.enable-lifecycle \
> /opt/apps/prometheus/logs/start.log 2>&1 &

2.4 重新加载配置

重新加载配置

bash 复制代码
curl -X POST http://localhost:9090/-/reload

3. pushgateway

启动命令

bash 复制代码
nohup /opt/apps/pushgateway/pushgateway \
--web.listen-address="0.0.0.0:9091" \
> /opt/apps/pushgateway/start.log 2>&1 &

4. alertmanager

4.1 配置alertmanager.yml

bash 复制代码
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 1m
  repeat_interval: 5m
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://mecury-ca01:9825/api/alarm/send'
    send_resolved: true
inhibit_rules:
  - source_match:
      alertname: 'ApplicationDown'
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'job', "target", 'instance']

配置报警地址,报警参数参考

bash 复制代码
{
  "version": "4",
  "groupKey": "alertname:ApplicationDown",
  "status": "firing",
  "receiver": "web.hook",
  "groupLabels": {
    "alertname": "ApplicationDown"
  },
  "commonLabels": {
    "alertname": "ApplicationDown",
    "severity": "critical",
    "instance": "10.0.0.1:8080",
    "job": "web",
    "target": "10.0.0.1"
  },
  "commonAnnotations": {
    "summary": "Web application is down",
    "description": "The web application at instance 10.0.0.1:8080 is not responding."
  },
  "externalURL": "http://alertmanager:9093",
  "alerts": [
    {
      "status": "firing",
      "labels": {
        "alertname": "ApplicationDown",
        "severity": "critical",
        "instance": "10.0.0.1:8080",
        "job": "web",
        "target": "10.0.0.1"
      },
      "annotations": {
        "summary": "Web application is down",
        "description": "The web application at instance 10.0.0.1:8080 is not responding."
      },
      "startsAt": "2025-06-19T04:30:00Z",
      "endsAt": "0001-01-01T00:00:00Z",
      "generatorURL": "http://prometheus:9090/graph?g0.expr=up%7Bjob%3D%22web%22%7D+%3D%3D+0",
      "fingerprint": "1234567890abcdef"
    }
  ]
}

4.2 启动

启动脚本 start.sh

bash 复制代码
#!/bin/bash

nohup /opt/apps/alertmanager/alertmanager \
--config.file=/opt/apps/alertmanager/alertmanager.yml \
> /opt/apps/alertmanager/start.log 2>&1 &

5.grafana

5.1 安装

启动命令

bash 复制代码
nohup /opt/apps/grafana/bin/grafana-server web > /opt/apps/grafana/grafana.log 2>&1 &

默认用户名和密码:admin

5.2 常用模板

node 16098

kafka 7589

process 249

相关推荐
七夜zippoe15 小时前
CANN Runtime跨进程通信 共享设备上下文的IPC实现
大数据·cann
威胁猎人16 小时前
【黑产大数据】2025年全球电商业务欺诈风险研究报告
大数据
L5434144616 小时前
告别代码堆砌匠厂架构让你的系统吞吐量翻倍提升
大数据·人工智能·架构·自动化·rpa
证榜样呀16 小时前
2026 大专计算机专业必考证书推荐什么
大数据·前端
LLWZAI16 小时前
让朱雀AI检测无法判断的AI公众号文章,当创作者开始与算法「躲猫猫」
大数据·人工智能·深度学习
SickeyLee16 小时前
产品经理案例分析(五):电商产品后台设计:撑起前台体验的 “隐形支柱”
大数据
callJJ17 小时前
Spring AI 文本聊天模型完全指南:ChatModel 与 ChatClient
java·大数据·人工智能·spring·spring ai·聊天模型
冻感糕人~17 小时前
收藏备用|小白&程序员必看!AI Agent入门详解(附工业落地实操关联)
大数据·人工智能·架构·大模型·agent·ai大模型·大模型学习
蓝眸少年CY17 小时前
Hadoop2-HDFS文件系统
大数据·hadoop·hdfs