Prometheus 全栈监控体系部署与使用指南

日期 : 2026-05-30

环境: 华为云香港区 ECS x4 (Ubuntu 24.04.4 LTS, 内核 6.8.0-106)

[1. 整体架构](#1. 整体架构)
[2. 环境信息](#2. 环境信息)
[3. Node Exporter 部署 (数据采集层)](#3. Node Exporter 部署 (数据采集层))
[4. Prometheus Server 部署 (核心采集层)](#4. Prometheus Server 部署 (核心采集层))
[5. Alertmanager 部署 (告警路由层)](#5. Alertmanager 部署 (告警路由层))
[6. prometheus-webhook-dingtalk 部署 (钉钉集成)](#6. prometheus-webhook-dingtalk 部署 (钉钉集成))
[7. Grafana 部署 (可视化层)](#7. Grafana 部署 (可视化层))
[8. 告警规则配置](#8. 告警规则配置)
[9. 钉钉告警验证与压力测试](#9. 钉钉告警验证与压力测试)
[10. Grafana 仪表盘配置](#10. Grafana 仪表盘配置)
[11. PromQL 查询大全](#11. PromQL 查询大全)
[12. 日常运维命令参考](#12. 日常运维命令参考)
[13. 错误总结 (15条)](#13. 错误总结 (15条))
[14. 当前运行数据](#14. 当前运行数据)

1. 整体架构

1.1 三层监控架构

复制代码

┌─────────────────────────────────────────────────────────┐
│                    可视化与告警层                         │
│  ┌──────────────┐         ┌──────────────────────────┐  │
│  │ Grafana:3000 │         │     钉钉群机器人          │  │
│  │  22面板仪表盘 │         │     实时告警推送          │  │
│  └──────┬───────┘         └────────▲─────────────────┘  │
├─────────┼──────────────────────────┼───────────────────┤
│         │        核心采集与告警层    │                    │
│  ┌──────▼──────┐ ┌───────────┐ ┌──┴──────────────┐    │
│  │ Prometheus  │→│Alertmanager│→│ Webhook DingTalk │    │
│  │   :9090     │ │   :9093   │ │   :8060          │    │
│  │ 时序采集存储 │ │ 告警路由   │ │ HMAC-SHA256 加签  │    │
│  └──────▲──────┘ └───────────┘ └──────────────────┘    │
├─────────┼───────────────────────────────────────────────┤
│         │        数据采集层                              │
│  ┌──────┴──────────────────────────────────────────┐    │
│  │  Node Exporter :9100 (×4 华为云 ECS)              │    │
│  │  CPU · 内存 · 磁盘 · 网络 · 系统负载              │    │
│  └─────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────┘

1.2 告警数据流

复制代码

Prometheus 规则评估 → Alert Firing → Alertmanager 分组/去重
→ Webhook POST → dingtalk HMAC 签名 → 钉钉 API → 群消息

关键时序:

复制代码

t=0          t=2m-5m         t=2m+30s        t=2m+30s+
Pending  →   触发 Firing  →  group_wait  →   钉钉 Push
(检测到)     (for 满足)      (首次聚合)      (群消息接收)

1.3 版本清单

组件	版本	安装方式	说明
Prometheus Server	2.45.3	APT (Ubuntu 24.04 仓库)	Debian 打包版
Alertmanager	0.26.0	APT (Ubuntu 24.04 仓库)	二进制: `prometheus-alertmanager`
Node Exporter	1.7.0	APT (Ubuntu 24.04 仓库)	二进制: `prometheus-node-exporter`
Grafana	13.0.1	APT (阿里云镜像)	`grafana-server` deprecation
prometheus-webhook-dingtalk	2.1.0	本地下载 SFTP 上传	Go 编译单文件 (~19MB)

2. 环境信息

2.1 服务器清单

节点名	公网 IP	内网 IP	角色	部署组件
prom	120.46.81.16	192.168.0.228	监控主节点	Prometheus + Alertmanager + Grafana + webhook-dingtalk + Node Exporter
node-1	120.46.209.49	192.168.0.207	被监控节点	Node Exporter
node-2	119.3.252.64	192.168.0.238	被监控节点	Node Exporter
node-3	1.92.79.147	192.168.0.152	被监控节点	Node Exporter

2.2 端口与服务映射

端口	服务	所在节点	访问方式
3000	Grafana Web UI	prom	`http://120.46.81.16:3000`
9090	Prometheus API + UI	prom	`http://120.46.81.16:9090`
9093	Alertmanager UI	prom	`http://120.46.81.16:9093`
8060	DingTalk Webhook	prom (仅 localhost 使用)	不对外暴露
9100	Node Exporter	全部 4 节点	内网 192.168.0.x:9100

2.3 SSH 管理工具

使用 Python + paramiko 编写的批量 SSH 工具:

python 复制代码

# D:\tools\ssh_exec.py --- 单节点执行
python ssh_exec.py prom "systemctl status prometheus"

# D:\tools\scp_upload.py --- SFTP 文件上传
python scp_upload.py prom local_file /remote/path/

3. Node Exporter 部署 (数据采集层)

3.1 安装

bash 复制代码

# Ubuntu 24.04 直接通过 APT 安装
apt update
apt install -y prometheus-node-exporter

# 验证版本
prometheus-node-exporter --version
# node_exporter, version 1.7.0 (branch: debian/sid, revision: 1.7.0-1ubuntu0.3)

3.2 服务管理

bash 复制代码

# 启动并设置开机自启
systemctl enable prometheus-node-exporter
systemctl start prometheus-node-exporter

# 验证监听
ss -tlnp | grep 9100
# LISTEN 0 4096  *:9100  *:*  users:(("prometheus-node",pid=8214,fd=3))

# 验证指标输出
curl -s http://localhost:9100/metrics | head -5
# # HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# # TYPE node_cpu_seconds_total counter
# node_cpu_seconds_total{cpu="0",mode="idle"} 22456.72

3.3 批量部署

在所有 4 台服务器上重复上述步骤。使用 SSH 工具批量执行:

bash 复制代码

# 所有节点安装
python ssh_exec.py --all "apt install -y prometheus-node-exporter && systemctl enable --now prometheus-node-exporter"

4. Prometheus Server 部署 (核心采集层)

4.1 安装

bash 复制代码

apt install -y prometheus

# 验证
prometheus --version
# prometheus, version 2.45.3+ds (branch: debian/sid, revision: 2.45.3+ds-2ubuntu0.3)

4.2 配置文件 (`/etc/prometheus/prometheus.yml`)

yaml 复制代码

global:
  scrape_interval: 15s        # 每15秒采集一次
  evaluation_interval: 15s    # 每15秒评估一次规则

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  # Prometheus 自身监控
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # 所有节点的 Node Exporter
  - job_name: 'node'
    static_configs:
      - targets:
        - '192.168.0.228:9100'   # prom (主节点自身)
        - '192.168.0.207:9100'   # node-1
        - '192.168.0.238:9100'   # node-2
        - '192.168.0.152:9100'   # node-3

设计决策 : targets 使用内网 IP，因为华为云 ECS 内网互通且不占用公网带宽。

4.3 服务管理

bash 复制代码

systemctl enable prometheus
systemctl start prometheus

# 检查状态
systemctl status prometheus --no-pager -l
ss -tlnp | grep 9090

4.4 验证采集目标

bash 复制代码

curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool

# 预期: 5/5 targets UP (Prometheus自身 + 4个Node Exporter)

5. Alertmanager 部署 (告警路由层)

5.1 安装

bash 复制代码

apt install -y prometheus-alertmanager

# 验证
prometheus-alertmanager --version
# alertmanager, version 0.26.0+ds (branch: debian/sid, revision: 0.26.0+ds-1ubuntu0.3)

注意 : Ubuntu 24.04 APT 安装的二进制名是 prometheus-alertmanager，不是 alertmanager。系统通过 systemd unit 自动引用正确路径。

5.2 配置文件 (`/etc/prometheus/alertmanager.yml`)

yaml 复制代码

global:
  resolve_timeout: 5m

route:
  receiver: 'default'
  group_by: ['alertname', 'instance']
  group_wait: 30s         # 首次告警聚合等待
  group_interval: 5m       # 同组后续告警间隔
  repeat_interval: 4h      # 重复发送间隔

  routes:
    - receiver: 'dingtalk'
      continue: true

    # 严重告警额外处理
    - matchers:
        - severity="critical"
      receiver: 'dingtalk'
      group_wait: 10s
      repeat_interval: 1h

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://localhost:8060/dingtalk/dingtalk/send'
        send_resolved: true

  - name: 'dingtalk'
    webhook_configs:
      - url: 'http://localhost:8060/dingtalk/dingtalk/send'
        send_resolved: true

# 告警抑制: critical 出现时抑制同 instance 的 warning
inhibit_rules:
  - source_matchers: [severity="critical"]
    target_matchers: [severity="warning"]
    equal: [alertname, instance]

5.3 核心参数解释

参数	值	含义
`group_wait`	30s	收到第一条告警后等 30s，收集同组其他告警一起发送
`group_interval`	5m	同组已有告警变化时的最小发送间隔
`repeat_interval`	4h	相同告警未恢复时，每 4h 重发一次
`send_resolved`	true	告警恢复时也发送通知

5.4 服务管理

bash 复制代码

systemctl enable prometheus-alertmanager
systemctl start prometheus-alertmanager
ss -tlnp | grep 9093

6. prometheus-webhook-dingtalk 部署 (钉钉集成)

6.1 为什么需要这个组件

Alertmanager 原生的 webhook 格式与钉钉机器人的消息格式不兼容。prometheus-webhook-dingtalk 作为中间层承担:

格式转换: Alertmanager webhook JSON → 钉钉 markdown 消息
HMAC 签名: 对消息体做 SHA256 加签，满足钉钉安全要求
模板渲染: 可自定义消息模板

6.2 安装 (本地下载 + SFTP 上传)

由于 GitHub Release 从香港访问不稳定，采用先下载到本地再上传的方案:

bash 复制代码

# 1. 本地下载 (Windows)
# 从 https://github.com/timonwong/prometheus-webhook-dingtalk/releases
# 下载 prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz

# 2. 解压后通过 SFTP 上传
python sftp_upload.py prom prometheus-webhook-dingtalk /opt/prometheus-webhook-dingtalk/

# 3. 服务器端设置权限
chmod +x /opt/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk

6.3 配置文件 (`/opt/prometheus-webhook-dingtalk/config.yml`)

yaml 复制代码

targets:
  dingtalk:
    url: "https://oapi.dingtalk.com/robot/send?access_token=090fba13440082ce1aab6f43a2cc47a7b9866521ddf94c7e9502ac100b87280c"
    secret: "SEC7acf1b961d10d3f476f2e9a793148482d75b627b61913c44833b49cc8c35e3a2"

6.4 Systemd 服务 (`/etc/systemd/system/prometheus-webhook-dingtalk.service`)

ini 复制代码

[Unit]
Description=Prometheus Webhook DingTalk
Documentation=https://github.com/timonwong/prometheus-webhook-dingtalk
After=network.target

[Service]
Type=simple
User=root
ExecStart=/opt/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk \
  --config.file=/opt/prometheus-webhook-dingtalk/config.yml \
  --web.listen-address=:8060
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

6.5 启动与验证

bash 复制代码

systemctl daemon-reload
systemctl enable --now prometheus-webhook-dingtalk

# 测试连通性
curl -s http://localhost:8060/-/healthy
# Healthy

6.6 Webhook URL 路径问题

v2.1.0 版本的 webhook 路径为 /dingtalk/dingtalk/send（不是 /dingtalk/webhook/dingtalk）。这是 v2.x 的路由格式变更:

yaml 复制代码

# 正确 ✅ (v2.1.0)
url: 'http://localhost:8060/dingtalk/dingtalk/send'

# 错误 ❌ (旧版路径，返回 404)
url: 'http://localhost:8060/dingtalk/webhook/dingtalk'

7. Grafana 部署 (可视化层)

7.1 APT 镜像选择

直接使用官方源 apt.grafana.com 从香港下载 87MB 的 deb 包速度仅 23KB/s，需要 1 小时+。改用阿里云镜像:

bash 复制代码

# 添加阿里云 Grafana 镜像源
cat > /etc/apt/sources.list.d/grafana.list << 'EOF'
deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://mirrors.aliyun.com/grafana/apt stable main
EOF

# 导入 GPG 密钥
curl -fsSL https://mirrors.aliyun.com/grafana/apt/gpg.key \
  | gpg --dearmor > /etc/apt/keyrings/grafana.gpg

# 安装
apt update && apt install -y grafana

7.2 启动与密码重置

bash 复制代码

systemctl enable --now grafana-server

# Grafana v13 中 grafana-cli 已废弃
# 错误 ❌
grafana-cli admin reset-admin-password admin123

# 正确 ✅
grafana cli --homepath /usr/share/grafana admin reset-admin-password admin123

7.3 访问信息

项目	值
URL	`http://120.46.81.16:3000`
用户名	`admin`
密码	`admin123`

8. 告警规则配置

8.1 规则文件 (`/etc/prometheus/rules/node_alerts.yml`)

yaml 复制代码

groups:
  - name: node_alerts
    rules:
      - alert: InstanceDown
        expr: up{job="node"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is DOWN"
          description: "Node {{ $labels.instance }} has been down for > 2min"

      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU > 80% on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Memory > 85% on {{ $labels.instance }}"
          description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Disk < 10% on {{ $labels.instance }}"
          description: "Disk available is {{ $value }}% on {{ $labels.instance }}"

8.2 规则参数速查

告警名	表达式	阈值	for	级别
InstanceDown	`up==0`	节点离线	2m	critical
HighCPUUsage	CPU 使用率 > 80%	80%	5m	warning
HighMemoryUsage	内存使用率 > 85%	85%	5m	warning
DiskSpaceLow	磁盘可用 < 10%	10%	2m	critical

8.3 `for` 参数的设计意义

for 是"持续满足条件的最小时间"。例如 CPU > 80% for 5m:

t=0: CPU 飙升至 85% → Pending (不触发)
t=2m: 仍为 83% → 继续 Pending
t=5m: 仍未恢复 → Firing → 发送到 Alertmanager

这能有效过滤短时抖动，避免误告警。

8.4 配置生效

bash 复制代码

# 重载 Prometheus 配置 (热加载)
kill -HUP $(pgrep prometheus)

# 或重启服务
systemctl reload prometheus

# 验证规则加载
curl -s http://localhost:9090/api/v1/rules | python3 -m json.tool

9. 钉钉告警验证与压力测试

9.1 测试策略

为了验证告警链路从 Prometheus → Alertmanager → webhook-dingtalk → 钉钉群是否完整，进行了三项资源压力测试。

9.2 测试前准备

将 for 降低为 5-10 秒以便快速触发:

yaml 复制代码

# 临时测试规则 (测试后恢复)
- alert: HighCPUUsage
  expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[30s])) * 100) > 80
  for: 5s    # 降低等待时间

9.3 CPU 压力测试

bash 复制代码

# 在 node-3 上执行
apt install -y stress-ng
stress-ng --cpu 2 --timeout 180s

# 结果: CPU 100% → HighCPUUsage 触发 → 钉钉通知 HTTP 200 ✅

9.4 内存压力测试

bash 复制代码

# Python 分配 3GB 内存 (服务器总量 3.3GB)
python3 -c "
data = bytearray(3 * 1024 * 1024 * 1024)  # 3GB
import time; time.sleep(120)
"

# 结果: 内存 94% → HighMemoryUsage 触发 → 钉钉通知 ✅

教训 : 简单的 stress-ng --vm 只分配了 2.5GB 仅达 74%，未触发 85% 阈值。必须用 Python 精确控制分配量。

9.5 磁盘压力测试

bash 复制代码

dd if=/dev/zero of=/tmp/bigfile bs=1M count=5000 status=progress
# 写入 5GB → 磁盘可用降至 74% → DiskSpaceLow 触发 → 钉钉通知 ✅

# 清理
rm -f /tmp/bigfile

9.6 测试结果

告警类型	触发值	阈值	钉钉通知	HTTP 状态
HighCPUUsage	100%	>80%	✅ 收到	200
HighMemoryUsage	94%	>85%	✅ 收到	200
DiskSpaceLow	74% 可用	<80% 可用(测试)	✅ 收到	200
HighCPUUsage (Resolved)	0.2%	-	✅ 恢复通知	200
HighMemoryUsage (Resolved)	12%	-	✅ 恢复通知	200
DiskSpaceLow (Resolved)	86% 可用	-	✅ 恢复通知	200

8 条钉钉通知全部 HTTP 200，告警链路完整验证通过。

9.7 恢复生产规则

测试完成后立即恢复:

bash 复制代码

# 恢复原始规则文件
cp /etc/prometheus/rules/node_alerts_prod.yml /etc/prometheus/rules/node_alerts.yml
systemctl reload prometheus

10. Grafana 仪表盘配置

10.1 创建 Prometheus 数据源

bash 复制代码

curl -s -u admin:admin123 -X POST http://localhost:3000/api/datasources \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Prometheus",
    "type": "prometheus",
    "url": "http://localhost:9090",
    "access": "proxy",
    "isDefault": true
  }'

json 复制代码

{
  "datasource": {
    "id": 1,
    "uid": "bfnmdr6ol0hz4b",
    "name": "Prometheus",
    "type": "prometheus",
    "url": "http://localhost:9090",
    "isDefault": true
  },
  "message": "Datasource added"
}

10.2 导入仪表盘

bash 复制代码

curl -s -u admin:admin123 -X POST http://localhost:3000/api/dashboards/db \
  -H "Content-Type: application/json" \
  -d @/root/grafana_dashboard_node_exporter.json

json 复制代码

{
  "id": 2545816398360576,
  "uid": "node-exporter-full",
  "url": "/d/node-exporter-full/9a32b53",
  "status": "success"
}

10.3 仪表盘面板布局

复制代码

┌─────────────────────────────────────────────────────────────┐
│ Row 1: 📊 节点概览                                          │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────────┐  │
│ │ 在线节点数 │ │ 告警数量  │ │ 采集目标数 │ │ Prom 运行时间  │  │
│ │  (Stat)   │ │  (Stat)  │ │  (Stat)   │ │    (Stat)      │  │
│ └──────────┘ └──────────┘ └──────────┘ └───────────────┘  │
├─────────────────────────────────────────────────────────────┤
│ Row 2: 🔥 CPU 监控                                          │
│ ┌────────────────────────────────┐ ┌──────────────────┐    │
│ │  CPU 使用率折线图 (TimeSeries)  │ │  Gauge 仪表盘     │    │
│ │  阈值: 60%黄 · 80%红           │ │  当前值快照       │    │
│ └────────────────────────────────┘ └──────────────────┘    │
├─────────────────────────────────────────────────────────────┤
│ Row 3: 💾 内存监控                                           │
│ ┌────────────────┐ ┌──────────┐ ┌──────────────┐          │
│ │ 使用率折线图     │ │  Gauge   │ │ 内存详情      │          │
│ │ 70%黄·85%红     │ │  仪表盘   │ │ Total/Avail   │          │
│ └────────────────┘ └──────────┘ └──────────────┘          │
├─────────────────────────────────────────────────────────────┤
│ Row 4: 💿 磁盘监控                                           │
│ ┌────────────────┐ ┌──────────┐ ┌──────────────┐          │
│ │ 根分区使用率     │ │  Gauge   │ │ 磁盘空间详情  │          │
│ │ 70%黄·85%红     │ │  仪表盘   │ │ Total/Avail   │          │
│ └────────────────┘ └──────────┘ └──────────────┘          │
├─────────────────────────────────────────────────────────────┤
│ Row 5: 🌐 网络监控                                           │
│ ┌──────────────────────┐ ┌──────────────────────┐          │
│ │  接收速率 (Mbps)       │ │  发送速率 (Mbps)       │          │
│ │  eth0, 排除 lo        │ │  eth0, 排除 lo        │          │
│ └──────────────────────┘ └──────────────────────┘          │
├─────────────────────────────────────────────────────────────┤
│ Row 6: ⚙️ 系统负载                                           │
│ ┌──────────────────────────┐ ┌──────────────────────┐     │
│ │  Load 1m / 5m / 15m      │ │  节点运行时间          │     │
│ │  各节点负载趋势            │ │  uptime Stat          │     │
│ └──────────────────────────┘ └──────────────────────┘     │
└─────────────────────────────────────────────────────────────┘

10.4 面板类型统计

类型	数量	用途
`row`	6	逻辑分区折叠行
`stat`	6	关键数字卡片
`timeseries`	8	时间序列折线图
`gauge`	4	仪表盘即时快照

10.5 阈值色彩方案 (与告警规则对齐)

级别	CPU	内存	磁盘	颜色
正常	0-60%	0-70%	0-70% (可用≥30%)	绿色
警告	60-80%	70-85%	可用 15-30%	黄色
危险	80%+	85%+	可用 <15%	红色

10.6 访问入口

项目	地址
Grafana 主页	`http://120.46.81.16:3000`
仪表盘直链	`http://120.46.81.16:3000/d/node-exporter-full`
Prometheus UI	`http://120.46.81.16:9090`
Alertmanager UI	`http://120.46.81.16:9093`

11. PromQL 查询大全

11.1 CPU 监控

promql 复制代码

# CPU 使用率 (%) --- 反算 idle 时间
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle",job="node"}[5m])) * 100)

# 为什么用 rate() 而不是直接减?
# node_cpu_seconds_total 是 counter 类型 (单调递增)
# rate() 计算每秒增量, 乘以100得到百分比
# [5m] 窗口: 取最近5分钟数据, 平滑短时波动

# 为什么用 MemAvailable 而不是 MemFree?
# MemFree 不含内核缓存/缓冲区
# MemAvailable 是内核估算的"真正可用"内存 (与 free -h 的 available 列一致)

11.2 内存监控

promql 复制代码

# 内存使用率 (%)
(1 - (node_memory_MemAvailable_bytes{job="node"} / node_memory_MemTotal_bytes{job="node"})) * 100

# 内存总量 (GB)
node_memory_MemTotal_bytes{job="node"} / 1024 / 1024 / 1024

# 可用内存 (GB)
node_memory_MemAvailable_bytes{job="node"} / 1024 / 1024 / 1024

# 内存使用量 (GB)
(node_memory_MemTotal_bytes{job="node"} - node_memory_MemAvailable_bytes{job="node"}) / 1024 / 1024 / 1024

11.3 磁盘监控

promql 复制代码

# 磁盘使用率 (%) --- 仅根分区
(1 - (
  node_filesystem_avail_bytes{job="node",mountpoint="/",fstype!="rootfs"} /
  node_filesystem_size_bytes{job="node",mountpoint="/",fstype!="rootfs"}
)) * 100

# 过滤条件:
#   mountpoint="/"     → 仅根分区
#   fstype!="rootfs"   → 排除伪文件系统
#   实际设备: /dev/vda1 (华为云系统盘, 40GB)

# 磁盘可用百分比 (告警规则用)
(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100

# 磁盘总量 (GB)
node_filesystem_size_bytes{job="node",mountpoint="/",fstype!="rootfs"} / 1024 / 1024 / 1024

# 磁盘可用 (GB)
node_filesystem_avail_bytes{job="node",mountpoint="/",fstype!="rootfs"} / 1024 / 1024 / 1024

11.4 网络监控

promql 复制代码

# 网络接收速率 (Mbps)
rate(node_network_receive_bytes_total{job="node",device!="lo"}[5m]) * 8 / 1000000

# 网络发送速率 (Mbps)
rate(node_network_transmit_bytes_total{job="node",device!="lo"}[5m]) * 8 / 1000000

# device!="lo"  → 排除回环接口
# *8            → bytes 转 bits
# /1000000      → bits 转 Mbps

11.5 系统负载

promql 复制代码

# 1 分钟负载
node_load1{job="node"}

# 5 分钟负载
node_load5{job="node"}

# 15 分钟负载
node_load15{job="node"}

11.6 节点状态

promql 复制代码

# 在线节点数
count(up{job="node"} == 1)

# 当前 Firing 告警数
count(ALERTS{alertstate="firing"})

# 采集目标总数
count(up{job="node"})

# Prometheus 运行时间 (小时)
(time() - process_start_time_seconds{job="prometheus"}) / 3600

# 节点运行时间 (天)
(time() - node_boot_time_seconds{job="node"}) / 86400

11.7 常用调试查询

promql 复制代码

# 查看所有已加载的告警规则
ALERTS

# 查看特定节点所有指标 (Prometheus UI 自动补全)
{instance="192.168.0.228:9100"}

# 查看最近 5 分钟 CPU 模式分布
rate(node_cpu_seconds_total{instance="192.168.0.228:9100"}[5m])

12. 日常运维命令参考

12.1 服务管理

bash 复制代码

# 查看所有 Prometheus 相关服务状态
systemctl status prometheus prometheus-alertmanager \
  prometheus-webhook-dingtalk grafana-server prometheus-node-exporter

# 重载配置 (热加载, 不中断服务)
systemctl reload prometheus

# 重载 Alertmanager
systemctl reload prometheus-alertmanager

# 查看日志
journalctl -u prometheus -f          # Prometheus
journalctl -u prometheus-alertmanager -f  # Alertmanager
journalctl -u prometheus-webhook-dingtalk -f  # Webhook
journalctl -u grafana-server -f      # Grafana

12.2 Prometheus 调试

bash 复制代码

# 查看 Targets 状态
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool

# 查看已加载的告警规则
curl -s http://localhost:9090/api/v1/rules | python3 -m json.tool

# 即时查询
curl -s 'http://localhost:9090/api/v1/query?query=up{job="node"}'

# 范围查询 (过去 1 小时 CPU)
curl -s 'http://localhost:9090/api/v1/query_range?query=rate(node_cpu_seconds_total{mode="idle"}[5m])&start='$(date -d '1 hour ago' +%s)'&end='$(date +%s)'&step=60'

# 检查配置语法
promtool check config /etc/prometheus/prometheus.yml

# 检查规则语法
promtool check rules /etc/prometheus/rules/node_alerts.yml

12.3 Alertmanager 调试

bash 复制代码

# 查看当前活跃告警
curl -s http://localhost:9093/api/v1/alerts | python3 -m json.tool

# 查看静默规则
curl -s http://localhost:9093/api/v1/silences | python3 -m json.tool

# 手动创建静默
curl -s -X POST http://localhost:9093/api/v1/silences \
  -H "Content-Type: application/json" \
  -d '{"matchers":[{"name":"alertname","value":"HighCPUUsage","isRegex":false}],
       "startsAt":"2026-05-30T12:00:00Z","endsAt":"2026-05-30T14:00:00Z",
       "createdBy":"admin","comment":"维护窗口"}'

12.4 Grafana 管理

bash 复制代码

# 导出仪表盘 (备份)
curl -s -u admin:admin123 http://localhost:3000/api/dashboards/uid/node-exporter-full \
  | python3 -m json.tool > dashboard-backup.json

# 重置管理员密码
grafana cli --homepath /usr/share/grafana admin reset-admin-password <new_password>

# 查看数据源健康
curl -s -u admin:admin123 http://localhost:3000/api/datasources/1/health

12.5 Node Exporter

bash 复制代码

# 验证所有节点指标可达
for ip in 228 207 238 152; do
  echo -n "192.168.0.$ip:9100 → "
  curl -s -o /dev/null -w "%{http_code}" http://192.168.0.$ip:9100/metrics
  echo
done

# 检查特定指标
curl -s http://192.168.0.228:9100/metrics | grep -E "node_load1|node_cpu_seconds_total"

13. 错误总结 (15条)

错误 1: GitHub Release 从香港直连全部超时

现象:

复制代码

curl -L https://github.com/prometheus/node_exporter/releases/download/v1.7.0/...
→ Connection timed out after 120s

根因: 华为云香港 ECS 连接 GitHub Release 的 CDN 节点速度极慢，88 个 release 全部失败。

解决方案 : 改用 Ubuntu 24.04 官方 APT 仓库的打包版本 (prometheus-node-exporter)。

教训: 海外托管环境部署开源组件，应优先使用 APT 仓库而非 GitHub Release 直接下载。APT 通过镜像加速，可靠性远高于跨境直连。

错误 2: APT 安装 Prometheus 创建了错误用户类型

现象 : 手动创建的 prometheus 用户是普通用户，APT 要求 systemd sysusers 生成的系统用户。

解决方案:

bash 复制代码

userdel -r prometheus   # 删除普通用户
apt install --reinstall prometheus   # 重装自动创建系统用户

教训: 不要提前创建用户，让 APT 包管理器处理。

错误 3: Prometheus 2.45 YAML labels 字段位置错误

现象:

yaml 复制代码

# ❌ 错误: labels 在 static_configs 外层
scrape_configs:
  - job_name: 'node'
    labels:
      env: 'production'
    static_configs:
      ...

复制代码

Error: parsing YAML: yaml: unmarshal errors: field labels not found

根因 : Prometheus 2.x 中 labels 字段只能在 static_configs 内部使用。

正确写法:

yaml 复制代码

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: [...]
        labels:
          env: 'production'

错误 4: `pkill prometheus` 导致 SSH 会话被关闭

现象 : 执行 pkill prometheus 后 SSH 连接断开。

根因 : Prometheus 进程名匹配 SSH 连接中的其他 prometheus 关键字，导致意外杀进程。

解决方案 : 始终使用 systemctl restart prometheus 管理服务，不用 pkill。

错误 5: Grafana APT 官方源下载 87MB 包仅 23KB/s

现象:

复制代码

Get:1 https://apt.grafana.com stable/main amd64 grafana 13.0.1 [87.3 MB]
  23.4 kB/s | 87 MB  预计剩余 1h 23m

根因 : 华为云香港 ECS 连接 apt.grafana.com 带宽极低。

解决方案 : 切换阿里云镜像 mirrors.aliyun.com/grafana/apt，下载仅需 30 秒。

教训: 国内服务器部署 Grafana 等海外开源组件，必须优先使用国内镜像。

错误 6: Grafana v13 `grafana-cli` 命令已废弃

现象:

bash 复制代码

grafana-cli admin reset-admin-password admin123
# Grafana-server Init Failed: Could not find config defaults, make sure homepath is set

正确方式:

bash 复制代码

grafana cli --homepath /usr/share/grafana admin reset-admin-password admin123

教训 : Grafana v13 将 CLI 重构为 grafana <subcommand> 格式，grafana-cli 不再可用。

错误 7: Grafana 首次登录 admin/admin 返回 401

现象:

json 复制代码

{"message": "Invalid username or password", "statusCode": 401}

根因: 安装过程中可能预设了不同密码或首次登录已修改。

解决方案: 在服务器端通过 CLI 重置密码 (见错误 6)。

错误 8: 仪表盘导入后所有面板显示 "No data"

现象: 仪表盘 JSON 导入成功，但所有面板显示 "No data"。

根因: Prometheus 数据源未创建。Grafana 不会自动发现本地 Prometheus。

解决: 先通过 API 创建数据源，确认连通性，再导入仪表盘。

错误 9: Webhook DingTalk URL 路径 404

现象 : Alertmanager 日志 webhook response 404。

配置错误:

yaml 复制代码

url: 'http://localhost:8060/dingtalk/webhook/dingtalk'  # ❌ 旧版本路径

正确配置 (v2.1.0):

yaml 复制代码

url: 'http://localhost:8060/dingtalk/dingtalk/send'     # ✅

教训: 不同版本的 webhook-dingtalk 路由路径不同，应以启动日志为准。

错误 10: Shell heredoc 写 YAML 时 `$labels` / `$value` 被展开

现象 : 通过 shell heredoc (cat > file << 'EOF') 写入告警规则后，Prometheus 启动报语法错误。

根因 : 虽然使用了单引号 'EOF' 阻止变量展开，但 $value 在某些 shell 版本中仍被解析为空字符串。

解决方案:

在本地 Windows 编写规则文件
通过 SFTP 上传到服务器 (python sftp_upload.py prom rules.yml /etc/prometheus/rules/)
在服务器端验证语法: promtool check rules /etc/prometheus/rules/node_alerts.yml

错误 11: 内网 IP vs 公网 IP 混淆

现象: Node Exporter 指标中 instance 显示为内网 IP (192.168.0.x)，用户从外部无法直接访问。

说明: 这不是错误，是有意设计:

Prometheus scrape 走内网 (低延迟、不消耗公网带宽)
Grafana 和 Prometheus UI 走公网 (用户从外部浏览器访问)
无需额外穿透代理

错误 12: 内存压测 `stress-ng --vm` 分配不足

现象 : stress-ng --vm 1 --vm-bytes 2.5G 仅使内存达到 74%，未触发 85% 阈值。

根因: 系统有 buffer/cache 可回收，实际分配 2.5GB 后使用率偏低。

解决 : 使用 Python bytearray(3GB) 精确控制分配，达到 94%。

教训: 压测工具的参数需要根据实际物理内存精确计算，预留缓冲。

错误 13: CPU 压测中 `rate()[1m]` 窗口太长

现象: 压测 2 分钟后仍未触发告警。

根因 : rate(node_cpu_seconds_total{mode="idle"}[1m]) 取 1 分钟平均，前 30s 空闲拉低了平均值。

临时解决 : 测试时将窗口减至 [30s] + for: 5s，快速触发。

生产建议 : 保持 [5m] + for: 5m，这是 5 分钟滚动平均的行业标准做法。

错误 14: `alertmanager --version` 提示 command not found

现象:

bash 复制代码

$ alertmanager --version
bash: alertmanager: command not found

根因 : Ubuntu 24.04 APT 安装的二进制名是 prometheus-alertmanager，不是 alertmanager。

解决方案 : 使用 prometheus-alertmanager --version，或通过 systemd 管理服务。

错误 15: Grafana Image Renderer 插件未安装

现象: 无法通过 Grafana API 渲染仪表盘截图 (用于自动化报告)。

说明 : grafana-image-renderer 是可选插件，需要 Node.js + Chromium，未预装。不影响仪表盘数据展示和交互使用。

如需安装:

bash 复制代码

grafana cli plugins install grafana-image-renderer
apt install -y chromium-browser
systemctl restart grafana-server

14. 当前运行数据

采集时间: 2026-05-30 20:00 CST

采集方式: Prometheus HTTP API /api/v1/query

14.1 节点在线状态

节点	内网 IP	状态	运行时间
prom	192.168.0.228	✅ UP	持续运行
node-1	192.168.0.207	✅ UP	持续运行
node-2	192.168.0.238	✅ UP	持续运行
node-3	192.168.0.152	✅ UP	持续运行

14.2 CPU 使用率

节点	CPU	告警阈值	状态
prom	0.6%	>80%	🟢 正常
node-1	0.2%	>80%	🟢 正常
node-2	0.2%	>80%	🟢 正常
node-3	0.2%	>80%	🟢 正常

14.3 内存使用率 (总量 3.3GB × 4)

节点	使用率	告警阈值	状态
prom	19.2%	>85%	🟢 正常
node-1	12.1%	>85%	🟢 正常
node-2	13.2%	>85%	🟢 正常
node-3	11.5%	>85%	🟢 正常

14.4 磁盘使用率 (根分区 /)

节点	使用率	可用	总量	告警阈值	状态
prom	17.6%	32.4 GB	39.3 GB	<10%	🟢 正常
node-1	12.9%	34.2 GB	39.3 GB	<10%	🟢 正常
node-2	12.9%	34.2 GB	39.3 GB	<10%	🟢 正常
node-3	13.5%	33.9 GB	39.3 GB	<10%	🟢 正常

14.5 网络流量 (eth0)

节点	接收	发送
prom	0.033 Mbps	0.003 Mbps
node-1	0.004 Mbps	0.010 Mbps
node-2	0.004 Mbps	0.010 Mbps
node-3	0.005 Mbps	0.010 Mbps

14.6 系统负载

节点	Load 5m	Load 15m
prom	0.00	0.00
node-1	0.00	0.00
node-2	0.00	0.00
node-3	0.03	0.26

14.7 当前告警

复制代码

ALERTS{alertstate="firing"}: (none)

所有节点运行正常，无活跃告警。

附录 A: 配置文件清单

文件	位置	说明
Prometheus 配置	`/etc/prometheus/prometheus.yml`	采集配置 + 告警匹配
Alertmanager 配置	`/etc/prometheus/alertmanager.yml`	告警路由
告警规则	`/etc/prometheus/rules/node_alerts.yml`	4 条规则
DingTalk 配置	`/opt/prometheus-webhook-dingtalk/config.yml`	accessToken + secret
DingTalk systemd	`/etc/systemd/system/prometheus-webhook-dingtalk.service`	自启服务
Grafana 仪表盘	`D:\tools\grafana_dashboard_node_exporter.json`	可重新导入
SSH 管理工具	`D:\tools\ssh_exec.py`	批量 SSH
SFTP 上传工具	`D:\tools\sftp_upload.py`	文件上传
指标采集脚本	`D:\tools\collect_full.py`	批量查询当前值

附录 B: 访问入口汇总

服务	URL	认证
Grafana 仪表盘	`http://120.46.81.16:3000/d/node-exporter-full`	admin / admin123
Grafana 主页	`http://120.46.81.16:3000`	admin / admin123
Prometheus UI	`http://120.46.81.16:9090`	无
Prometheus Targets	`http://120.46.81.16:9090/targets`	无
Prometheus Alerts	`http://120.46.81.16:9090/alerts`	无
Alertmanager UI	`http://120.46.81.16:9093`	无

维护记录:

2026-05-30: 初始版本，完整覆盖 Prometheus v2.45.3 + Alertmanager v0.26.0 + Grafana v13.0.1 + webhook-dingtalk v2.1.0 全栈部署到使用

包含: 架构设计 / 部署步骤 / 告警配置 / 压力测试 / PromQL 大全 / 15 条错误总结 / 运维命令 / 当前运行数据

签发: @WorkBuddy

Prometheus 全栈监控体系部署与使用指南