二进制部署Prometheus+grafana+alertmanager+node_exporter

Prometheus 是一个开源的监控和告警工具包，旨在提供高可靠性和可扩展性。它最初由 SoundCloud 开发，现已成为云原生计算基金会（CNCF）的一部分。以下是 Prometheus 的一些关键特性和概念：

**时间序列数据库**：Prometheus 将所有数据存储为时间序列，这些时间序列通过指标名称和一组键值对（称为标签）进行标识。这种设计允许高维度的数据收集和查询。
**数据收集**：Prometheus 使用拉取模型来收集指标。它会在配置的端点上以指定的时间间隔抓取指标，这些端点以简单的文本格式暴露指标。
**查询语言**：Prometheus 提供了一种强大的查询语言，称为 PromQL（Prometheus 查询语言），允许用户轻松提取和操作时间序列数据。
**告警功能**：Prometheus 内置了告警功能。用户可以基于 PromQL 查询定义告警规则，Prometheus 可以通过 Alertmanager 将告警发送到各种通知渠道（如电子邮件、Slack 等）。
**可视化**：虽然 Prometheus 本身不提供高级可视化功能，但可以与 Grafana 等工具集成，以创建仪表板和可视化的指标表现。
**可扩展性**：Prometheus 设计用于处理大量数据，可以通过使用多个实例和联邦机制进行水平扩展。
**服务发现**：Prometheus 可以使用各种服务发现机制（例如 Kubernetes、Consul 等）自动发现要抓取指标的目标，使其适用于动态环境。
**生态系统**：Prometheus 拥有丰富的导出器生态系统，导出器是将各种服务和系统（如数据库、Web 服务器等）的指标以 Prometheus 可抓取的格式暴露出来的组件。

总的来说，Prometheus 被广泛用于监控云原生应用和微服务，为系统性能和可靠性提供了宝贵的洞察。

下载安装包：

复制代码

Prometheus下载： https://prometheus.io/download/
wget https://github.com/prometheus/prometheus/releases/download/v2.53.3/prometheus-2.53.3.linux-amd64.tar.gz


Grafana下载： https://grafana.com/grafana/download?pg=get&plcmt=selfmanaged-box1-cta1
wget https://dl.grafana.com/enterprise/release/grafana-enterprise-11.3.1.linux-amd64.tar.gz


Alertmanager下载： https://prometheus.io/download/#alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz


Node_exporter下载： https://prometheus.io/download/#node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz

解压：

复制代码

mkdir /usr/local/prometheus
tar xvf prometheus-2.53.3.linux-amd64.tar.gz
mv prometheus-2.53.3.linux-amd64 /usr/local/prometheus/prometheus

tar xvf node_exporter-1.8.2.linux-amd64.tar.gz
mv node_exporter-1.8.2.linux-amd64 /usr/local/prometheus/node_exporter

tar xvf alertmanager-0.27.0.linux-amd64.tar.gz
mv alertmanager-0.27.0.linux-amd64 /usr/local/prometheus/alertmanager

tar xvf grafana-enterprise-11.3.1.linux-amd64.tar.gz
mv grafana-v11.3.1 /usr/local/prometheus/grafana

配置开机Prometheus开机服务

复制代码

cat > /usr/lib/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target

[Service]
Type=simple
User=prometheus
Group=prometheus
Restart=on-failure
ExecStart=/usr/local/prometheus/prometheus/prometheus \
  --config.file=/usr/local/prometheus/prometheus/prometheus.yml \
  --storage.tsdb.path=/usr/local/prometheus/prometheus/data \
  --storage.tsdb.retention.time=60d \
  --web.enable-lifecycle

[Install]
WantedBy=multi-user.target

EOF

配置grafana开机服务

复制代码

cat > /usr/lib/systemd/system/grafana.service << 'EOF'
[Unit]
Description=Grafana server
Documentation=http://docs.grafana.org
[Service]
Type=simple
User=prometheus
Group=prometheus
Restart=on-failure
ExecStart=/usr/local/prometheus/grafana/bin/grafana-server \
  --config=/usr/local/prometheus/grafana/conf/defaults.ini \
  --homepath=/usr/local/prometheus/grafana
[Install]
WantedBy=multi-user.target

EOF

配置alertmanager开机服务

复制代码

cat > /usr/lib/systemd/system/alertmanager.service << 'EOF'
[Unit]
Description=Alert Manager
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/prometheus/alertmanager/alertmanager \
  --config.file=/usr/local/prometheus/alertmanager/alertmanager.yml \
  --storage.path=/usr/local/prometheus/alertmanager/data
Restart=always

[Install]
WantedBy=multi-user.target

EOF

配置node_exporter开机服务

复制代码

cat > /usr/lib/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target
[Service]
User=prometheus
Group=prometheus
ExecStart=/usr/local/prometheus/node_exporter/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target

EOF

配置开机自启

复制代码

useradd --no-create-home --shell /bin/false prometheus
或者
useradd -M -s /usr/sbin/nologin prometheus

chown -R prometheus.prometheus /usr/local/prometheus
systemctl daemon-reload



systemctl enable --now prometheus
systemctl status prometheus

systemctl enable --now alertmanager
systemctl status alertmanager

systemctl enable --now node_exporter
systemctl status node_exporter

systemctl enable --now grafana
systemctl status grafana

端口说明：

复制代码

root@u22pro:~# netstat -tnlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN      677/systemd-resolve
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      36805/sshd: /usr/sb
tcp6       0      0 :::3000                 :::*                    LISTEN      104697/grafana
tcp6       0      0 :::22                   :::*                    LISTEN      36805/sshd: /usr/sb
tcp6       0      0 :::9100                 :::*                    LISTEN      104555/node_exporte
tcp6       0      0 :::9090                 :::*                    LISTEN      104520/prometheus
tcp6       0      0 :::9093                 :::*                    LISTEN      104539/alertmanager
tcp6       0      0 :::9094                 :::*                    LISTEN      104539/alertmanager
root@u22pro:~#


3000 - grafana 管理页面 
9100 - node_exporter上报metrics 端口 
9090 - prometheus 管理页面
9093 - alertmanager 管理页面
9094 - alertmanager

prometheus.yml配置

复制代码

root@u22pro:/usr/local/prometheus/prometheus# cat prometheus.yml | grep -v '#'
global:

alerting:
  alertmanagers:
    - static_configs:
        - targets:

rule_files:
  - "alert.yml"

scrape_configs:
  - job_name: "prometheus"

    static_configs:
      - targets: ["localhost:9090"]

  - job_name: 'node-exporter'
    scrape_interval: 15s
    static_configs:
    - targets: ['localhost:9100']
      labels:
        instance: Prometheus服务器
    - targets: ['192.168.50.5:9100']
      labels:
        instance: linux-192.168.50.5
    - targets: ['192.168.50.6:9100']
      labels:
        instance: linux-192.168.50.6

alert.yml

复制代码

root@u22pro:/usr/local/prometheus/prometheus# cat alert.yml
groups:
- name: Prometheus alert
  rules:
  # 对任何实例超过30s无法联系的情况 发出警报
  - alert: 服务告警
    expr: up == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      instance: "{{ $labels.instance }}"
      description: "{{ $labels.job  }} 服务已关闭"


- name: 服务器资源监控
  rules:
  - alert: 内存使用率过高
    expr: 100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 80 #这里的监控参数根据自己实际监控的指标去修改，其他维度的同理
    for: 3m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{ $labels.instance }} 内存使用率过高, 请尽快处理！"
      description: "{{ $labels.instance }}内存使用率超过80%,当前使用率{{ $value }}%."

  - alert: 服务器宕机
    expr: up == 0
    for: 1s
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.instance}} 服务器宕机, 请尽快处理!"
      description: "{{$labels.instance}} 服务器延时超过3分钟,当前状态{{ $value }}. "

  - alert: CPU高负荷
    expr: 100 - (avg by (instance,job)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
    for: 5m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.instance}} CPU使用率过高,请尽快处理！"
      description: "{{$labels.instance}} CPU使用大于90%,当前使用率{{ $value }}%. "

  - alert: 磁盘IO性能
    expr: avg(irate(node_disk_io_time_seconds_total[1m])) by(instance,job)* 100 > 90
    for: 5m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.instance}} 流入磁盘IO使用率过高,请尽快处理！"
      description: "{{$labels.instance}} 流入磁盘IO大于90%,当前使用率{{ $value }}%."


  - alert: 网络流入
    expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance,job)) / 100) > 102400
    for: 5m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.instance}} 流入网络带宽过高，请尽快处理！"
      description: "{{$labels.instance}} 流入网络带宽持续5分钟高于100M. RX带宽使用量{{$value}}."

  - alert: 网络流出
    expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance,job)) / 100) > 102400
    for: 5m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.instance}} 流出网络带宽过高,请尽快处理！"
      description: "{{$labels.instance}} 流出网络带宽持续5分钟高于100M. RX带宽使用量{$value}}."

  - alert: TCP连接数
    expr: node_netstat_Tcp_CurrEstab > 10000
    for: 2m
    labels:
      severity: 严重告警
    annotations:
      summary: " TCP_ESTABLISHED过高！"
      description: "{{$labels.instance}} TCP_ESTABLISHED大于100%,当前使用率{{ $value }}%."

  - alert: 磁盘容量
    expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 90
    for: 1m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.mountpoint}} 磁盘分区使用率过高，请尽快处理！"
      description: "{{$labels.instance}} 磁盘分区使用大于90%，当前使用率{{ $value }}%."

修改配置之后，手动重载配置

curl -X POST http://localhost:9090/-/reload

检测Prometheus配置文件是否正确

./promtool check config prometheus.yml

打开gafana管理页面，添加prometheus datasource

添加node_exportor dashboard

grafana服务器监控显示面板：