Prometheus监控多个网站状态并推送到企业微信群机器人的完整方案

下面是一个使用Prometheus监控多个网站可用性，并通过Alertmanager发送告警到企业微信机器人的完整解决方案。

整体架构

bash 复制代码

[Blackbox Exporter] → [Prometheus] → [Alertmanager] → [企业微信机器人]
       ↑                     ↑
     网站探测              告警规则
      (多个网站)           (宕机/响应慢等)

完整配置方案

bash 复制代码

mkdir -p website-monitoring/{bin,config,data,templates,logs}
cd website-monitoring

下载所需组件

bash 复制代码

# 获取最新版本

BLACKBOX_VER=$(curl -s https://api.github.com/repos/prometheus/blackbox_exporter/releases/latest | grep tag_name | cut -d'"' -f4)
PROM_VER=$(curl -s https://api.github.com/repos/prometheus/prometheus/releases/latest | grep tag_name | cut -d'"' -f4)
ALERT_VER=$(curl -s https://api.github.com/repos/prometheus/alertmanager/releases/latest | grep tag_name | cut -d'"' -f4)


# 下载解压
wget https://github.com/prometheus/blackbox_exporter/releases/download/${BLACKBOX_VER}/blackbox_exporter-${BLACKBOX_VER:1}.linux-amd64.tar.gz
wget https://github.com/prometheus/prometheus/releases/download/${PROM_VER}/prometheus-${PROM_VER:1}.linux-amd64.tar.gz
wget https://github.com/prometheus/alertmanager/releases/download/${ALERT_VER}/alertmanager-${ALERT_VER:1}.linux-amd64.tar.gz

tar zxvf blackbox_exporter-*.tar.gz --strip-components=1 -C bin --wildcards '*/blackbox_exporter'
tar zxvf prometheus-*.tar.gz --strip-components=1 -C bin --wildcards '*/prometheus'
tar zxvf alertmanager-*.tar.gz --strip-components=1 -C bin --wildcards '*/alertmanager'

rm *.tar.gz

配置文件
config/blackbox.yml (监控目标配置):

bash 复制代码

modules:
  http_2xx:
    prober: http
    timeout: 10s
    http:
      preferred_ip_protocol: "ip4"
      valid_http_versions: ["HTTP/1.1", "HTTP/2"]
      valid_status_codes: [200, 301, 302]
      follow_redirects: true
      tls_config:
        insecure_skip_verify: true  # 跳过证书验证

  https_2xx:
    prober: http
    timeout: 10s
    http:
      method: GET
      preferred_ip_protocol: "ip4"
      tls_config:
        insecure_skip_verify: true

config/prometheus.yml:

bash 复制代码

global:
  scrape_interval: 30s
  evaluation_interval: 30s

scrape_configs:
  # 监控Blackbox Exporter本身
  - job_name: 'blackbox'
    metrics_path: /metrics
    static_configs:
      - targets: ['localhost:9115']
  
  # 监控多个网站
  - job_name: 'website_availability'
    metrics_path: /probe
    params:
      module: [http_2xx]  # 使用上面定义的模块
    
    # 监控的网站列表
    static_configs:
      - targets:
          - https://www.example.com  # 网站1
          - https://api.example.com  # 网站2
          - https://app.example.com  # 网站3
          - https://blog.example.org # 外部网站
          - http://192.168.1.100     # 内网网站
    
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9115  # Blackbox Exporter地址

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

rule_files:
  - 'config/alert_rules.yml'

config/alert_rules.yml (网站告警规则):

bash 复制代码

groups:
- name: website-monitoring
  rules:
  
  # 网站完全不可访问
  - alert: WebsiteDown
    expr: probe_success{job="website_availability"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "网站不可访问: {{ $labels.instance }}"
      description: "{{ $labels.instance }} 完全不可达超过 1 分钟"
      dashboard: "https://grafana.example.com/dashboard?var-site={{ $labels.instance }}"
  
  # 网站响应缓慢
  - alert: WebsiteSlowResponse
    expr: probe_duration_seconds{job="website_availability"} > 3
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: "网站响应缓慢: {{ $labels.instance }}"
      description: "{{ $labels.instance }} 平均响应时间超过 3 秒 (当前: {{ $value | printf \"%.2f\" }} 秒)"
      
  # HTTP状态码异常
  - alert: WebsiteBadStatus
    expr: probe_http_status_code{job="website_availability"} >= 400
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "HTTP状态码异常: {{ $labels.instance }}"
      description: "{{ $labels.instance }} 返回 {{ $value }} 错误状态码"
  
  # SSL证书即将过期
  - alert: SSLCertExpiringSoon
    expr: (probe_ssl_earliest_cert_expiry{job="website_availability"} - time()) < 86400 * 7  # 7天内过期
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "SSL证书即将过期: {{ $labels.instance }}"
      description: "{{ $labels.instance }} 的SSL证书将在 {{ $value | humanizeDuration }} 后过期"

config/alertmanager.yml:

bash 复制代码

global:
  resolve_timeout: 10m

route:
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'wechat-bot'
  
  # 基于严重程度的路由
  routes:
  - match:
      severity: critical
    repeat_interval: 20m
    receiver: 'wechat-critical'

receivers:
- name: 'wechat-bot'
  webhook_configs:
  - url: 'https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_WEBHOOK_KEY'
    send_resolved: true

- name: 'wechat-critical'
  webhook_configs:
  - url: 'https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_CRITICAL_WEBHOOK_KEY'
    send_resolved: true
    
templates:
- 'templates/wechat.tmpl'

templates/wechat.tmpl (企业微信消息模板):

bash 复制代码

{{ define "wechat.message" }}
{{ if gt (len .Alerts.Firing) 0 }}
{{ range .Alerts }}
{{ if eq .Labels.severity "critical" }}
🚨🚨 **【严重故障】** 🚨🚨
{% else %}
⚠️ **【网站告警】** ⚠️
{{ end }}

📌 网站名称: {{ .Labels.instance }}
🔄 状态: {{ if eq .Status "firing" }}故障{{ else }}恢复{{ end }}
🔢 错误代码: {{ if .Annotations.summary }}{{ .Annotations.summary }}{{ else }}{{ .Labels.alertname }}{{ end }}
⏰ 触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}

{{ if .Annotations.description }}
📝 详细说明: 
{{ .Annotations.description }}
{{ end }}

{{ if .Annotations.dashboard }}
🔗 查看详情: {{ .Annotations.dashboard }}
{{ end }}

{{ if eq .Labels.severity "critical" }}
<@all> 请立即处理！
{{ end }}
━━━━━━━━━━━━━━━━━
{{ end }}
{{ end }}

{{ if gt (len .Alerts.Resolved) 0 }}
{{ range .Alerts }}
✅ **【恢复正常】**
📌 网站名称: {{ .Labels.instance }}
⏱ 持续时间: {{ humanizeDuration .Duration }}
🕒 恢复时间: {{ .EndsAt.Format "2006-01-02 15:04:05" }}
━━━━━━━━━━━━━━━━━
{{ end }}
{{ end }}
{{ end }}

启动脚本
start-monitoring.sh:

bash 复制代码

#!/bin/bash

# 设置企业微信机器人KEY（替换为实际值）
export WEBHOOK_KEY="your_wechat_robot_key"
export CRITICAL_KEY="your_critical_robot_key"

# 准备目录
mkdir -p data logs

# 启动Blackbox Exporter
echo "🚀 启动Blackbox Exporter (端口:9115)..."
./bin/blackbox_exporter --config.file=config/blackbox.yml > logs/blackbox.log 2>&1 &
BB_PID=$!
echo $BB_PID > data/blackbox.pid

# 启动Alertmanager
echo "📡 启动Alertmanager (端口:9093)..."
./bin/alertmanager \
  --config.file=config/alertmanager.yml \
  --storage.path=data/alertmanager > logs/alertmanager.log 2>&1 &
AM_PID=$!
echo $AM_PID > data/alertmanager.pid

# 启动Prometheus
echo "📊 启动Prometheus (端口:9090)..."
./bin/prometheus \
  --config.file=config/prometheus.yml \
  --storage.tsdb.path=data/prometheus > logs/prometheus.log 2>&1 &
PROM_PID=$!
echo $PROM_PID > data/prometheus.pid

# 显示状态
echo ""
echo "✅ 监控系统已启动!"
echo "------------------------------------------"
echo "Prometheus:       http://localhost:9090"
echo "Alertmanager:     http://localhost:9093"
echo "Blackbox Exporter: http://localhost:9115"
echo "Grafana仪表盘:    https://grafana.example.com (可选)"
echo ""
echo "监控网站:"
grep -v '#' config/prometheus.yml | awk '/- targets:/{flag=1} flag && /- http/{print $2}'
echo "------------------------------------------"
echo "按 Ctrl+C 停止所有服务"

# 等待退出
trap "stop_services" SIGINT
tail -f /dev/null

stop_services() {
  echo ""
  echo "🛑 停止服务..."
  kill $PROM_PID $AM_PID $BB_PID
  rm data/*.pid
  echo "✅ 所有服务已停止"
  exit 0
}

stop-monitoring.sh:

bash 复制代码

#!/bin/bash

echo "🛑 停止监控系统..."
if [ -f data/prometheus.pid ]; then
  kill $(cat data/prometheus.pid)
fi
if [ -f data/alertmanager.pid ]; then
  kill $(cat data/alertmanager.pid)
fi
if [ -f data/blackbox.pid ]; then
  kill $(cat data/blackbox.pid)
fi
rm -f data/*.pid
echo "✅ 所有服务已停止"

添加监控脚本工具
add-website.sh (添加新网站监控):

bash 复制代码

#!/bin/bash

if [ $# -eq 0 ]; then
  echo "用法: $0 <网站URL>"
  echo "示例: $0 https://new-website.com"
  exit 1
fi

URL=$1
CONFIG="config/prometheus.yml"

# 检查URL是否已在监控列表中
if grep -q "$URL" "$CONFIG"; then
  echo "❌ 网站已在监控列表中: $URL"
  exit 1
fi

# 在prometheus.yml中添加新网站
sed -i "/- targets:/a \ \ \ \ \ \ - $URL" $CONFIG

echo "✅ 已添加网站到监控: $URL"
echo "正在重新加载Prometheus配置..."

curl -X POST http://localhost:9090/-/reload

if [ $? -eq 0 ]; then
  echo "🔄 配置已重载"
else
  echo "⚠️ 配置重载失败，请检查Prometheus状态"
fi

remove-website.sh (移除网站监控):

bash 复制代码

#!/bin/bash

if [ $# -eq 0 ]; then
  echo "用法: $0 <网站URL>"
  echo "示例: $0 https://new-website.com"
  exit 1
fi

URL=$1
CONFIG="config/prometheus.yml"
TEMP="/tmp/prometheus.tmp"

# 创建临时文件
grep -v "$URL" $CONFIG > $TEMP

# 检查是否实际删除了内容
if [ $(wc -l < $CONFIG) -eq $(wc -l < $TEMP) ]; then
  echo "❌ 未找到监控的网站: $URL"
  rm $TEMP
  exit 1
fi

mv $TEMP $CONFIG
echo "✅ 已移除网站监控: $URL"
echo "正在重新加载Prometheus配置..."

curl -X POST http://localhost:9090/-/reload

if [ $? -eq 0 ]; then
  echo "🔄 配置已重载"
else
  echo "⚠️ 配置重载失败，请检查Prometheus状态"
fi

设置脚本权限

bash 复制代码

chmod +x *.sh
mkdir -p data logs

企业微信机器人配置

1.创建两个企业微信群机器人：

常规告警机器人：用于一般告警

关键告警机器人：用于严重故障，设置@全体成员

2.获取机器人Webhook URL：

bash 复制代码

https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=XXXXX

3.替换配置文件中的KEY：

在 config/alertmanager.yml 中替换 YOUR_WEBHOOK_KEY 和 YOUR_CRITICAL_WEBHOOK_KEY

在 start-monitoring.sh 中设置 WEBHOOK_KEY 和 CRITICAL_KEY 环境变量

使用指南

1.启动监控系统

bash 复制代码

./start-monitoring.sh

2.添加新网站监控

bash 复制代码

./add-website.sh https://example.com

3.停止监控系统

bash 复制代码

./stop-monitoring.sh

4.验证配置

bash 复制代码

# 检查Prometheus目标
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.scrapePool == "website_availability")'

# 检查告警规则
curl http://localhost:9090/api/v1/rules | jq

# 测试企业微信发送
curl -X POST -H "Content-Type: application/json" \
-d '{"msgtype": "text", "text": {"content": "测试消息"}}' \
"https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=$WEBHOOK_KEY"

企业微信消息示例

bash 复制代码

严重故障告警
🚨🚨 **【严重故障】** 🚨🚨

📌 网站名称: https://api.example.com
🔄 状态: 故障
🔢 错误代码: 网站不可访问: https://api.example.com
⏰ 触发时间: 2023-08-20 14:30:45

📝 详细说明: 
https://api.example.com 完全不可达超过 1 分钟

🔗 查看详情: https://grafana.example.com/dashboard?var-site=https://api.example.com

网站响应缓慢

bash 复制代码

⚠️ **【网站告警】** ⚠️

📌 网站名称: https://app.example.com
🔄 状态: 故障
🔢 错误代码: 网站响应缓慢: https://app.example.com
⏰ 触发时间: 2023-08-20 13:45:22

📝 详细说明: 
https://app.example.com 平均响应时间超过 3 秒 (当前: 4.75 秒)

🔗 查看详情: https://grafana.example.com/dashboard?var-site=https://app.example.com

网站恢复正常

bash 复制代码

✅ **【恢复正常】**

📌 网站名称: https://api.example.com
⏱ 持续时间: 1h15m23s
🕒 恢复时间: 2023-08-20 15:45:05

监控多个网站的最佳实践

网站分组管理

bash 复制代码

# prometheus.yml
- job_name: 'website_main'
  params:
    module: [http_2xx]
  static_configs:
    - targets:
        - https://www.example.com
        - https://api.example.com
  relabel_configs: [...]
        
- job_name: 'website_partner'
  params:
    module: [https_2xx]
  static_configs:
    - targets:
        - https://partner1.com
        - https://partner2.com
  relabel_configs: [...]

地理分布监控
使用多区域部署Prometheus：

bash 复制代码

- job_name: 'website_us'
  static_configs:
    - targets: [...]
  relabel_configs:
    - target_label: region
      replacement: "us"

- job_name: 'website_eu'
  static_configs:
    - targets: [...]
  relabel_configs:
    - target_label: region
      replacement: "eu"

自定义监控参数
在blackbox.yml中添加特殊监控配置：

bash 复制代码

modules:
  wordpress_auth:
    prober: http
    http:
      headers:
        Authorization: "Bearer YOUR_TOKEN"
      valid_status_codes: [200]

集成Grafana仪表盘
创建网站健康概览仪表盘：

bash 复制代码

SELECT
  instance,
  avg(probe_duration_seconds) as avg_response,
  min(probe_duration_seconds) as min_response,
  max(probe_duration_seconds) as max_response,
  avg(probe_success) * 100 as uptime_percent
FROM website_availability
GROUP BY instance
ORDER BY uptime_percent DESC

性能优化建议

调整抓取频率：

yaml 复制代码

# 根据网站重要性设置不同频率
- job_name: 'critical_websites'
  scrape_interval: 10s
  static_configs: [...]
  
- job_name: 'normal_websites'
  scrape_interval: 30s
  static_configs: [...]

设置告警抑制：

yaml 复制代码

# alertmanager.yml
inhibit_rules:
  - source_match:
      alertname: 'WebsiteDown'
    target_match_re:
      alertname: 'WebsiteSlowResponse|WebsiteBadStatus'
    equal: ['instance']

限制数据保留：

yaml 复制代码

# 启动参数
--storage.tsdb.retention.time=7d

4.负载均衡探测：

在不同区域的服务器上部署Blackbox Exporter实例

常见问题解决

1.网站监控结果不准确

检查Blackbox Exporter日志：tail -f logs/blackbox.log

验证网络连通性：curl -v http://localhost:9115/probe?target=https://example.com\&module=http_2xx

调整超时设置：增加 timeout: 15s

2.告警未发送到企业微信

检查机器人KEY是否正确

测试Alertmanager配置：./bin/amtool check-config config/alertmanager.yml

查看Alertmanager日志：tail -f logs/alertmanager.log

3.误报问题

增加for持续时间：for: 5m

添加过滤条件：expr: probe_success == 0 and on(instance) probe_http_status_code{job="website_availability"} != 0

4.高资源使用

限制探测频率：scrape_interval: 60s

减少监控网站数量

使用多个Blackbox实例分担负载

通过此方案，您可以轻松监控多个网站的健康状态，并在出现问题时通过企业微信快速通知运维团队，确保业务的连续性和稳定性。