Prometheus监控多个网站状态并推送到企业微信群机器人的完整方案
下面是一个使用Prometheus监控多个网站可用性,并通过Alertmanager发送告警到企业微信机器人的完整解决方案。
整体架构
bash
[Blackbox Exporter] → [Prometheus] → [Alertmanager] → [企业微信机器人]
↑ ↑
网站探测 告警规则
(多个网站) (宕机/响应慢等)
完整配置方案
- 创建监控目录结构
bash
mkdir -p website-monitoring/{bin,config,data,templates,logs}
cd website-monitoring
- 下载所需组件
bash
# 获取最新版本
BLACKBOX_VER=$(curl -s https://api.github.com/repos/prometheus/blackbox_exporter/releases/latest | grep tag_name | cut -d'"' -f4)
PROM_VER=$(curl -s https://api.github.com/repos/prometheus/prometheus/releases/latest | grep tag_name | cut -d'"' -f4)
ALERT_VER=$(curl -s https://api.github.com/repos/prometheus/alertmanager/releases/latest | grep tag_name | cut -d'"' -f4)
# 下载解压
wget https://github.com/prometheus/blackbox_exporter/releases/download/${BLACKBOX_VER}/blackbox_exporter-${BLACKBOX_VER:1}.linux-amd64.tar.gz
wget https://github.com/prometheus/prometheus/releases/download/${PROM_VER}/prometheus-${PROM_VER:1}.linux-amd64.tar.gz
wget https://github.com/prometheus/alertmanager/releases/download/${ALERT_VER}/alertmanager-${ALERT_VER:1}.linux-amd64.tar.gz
tar zxvf blackbox_exporter-*.tar.gz --strip-components=1 -C bin --wildcards '*/blackbox_exporter'
tar zxvf prometheus-*.tar.gz --strip-components=1 -C bin --wildcards '*/prometheus'
tar zxvf alertmanager-*.tar.gz --strip-components=1 -C bin --wildcards '*/alertmanager'
rm *.tar.gz
- 配置文件
config/blackbox.yml (监控目标配置):
bash
modules:
http_2xx:
prober: http
timeout: 10s
http:
preferred_ip_protocol: "ip4"
valid_http_versions: ["HTTP/1.1", "HTTP/2"]
valid_status_codes: [200, 301, 302]
follow_redirects: true
tls_config:
insecure_skip_verify: true # 跳过证书验证
https_2xx:
prober: http
timeout: 10s
http:
method: GET
preferred_ip_protocol: "ip4"
tls_config:
insecure_skip_verify: true
config/prometheus.yml:
bash
global:
scrape_interval: 30s
evaluation_interval: 30s
scrape_configs:
# 监控Blackbox Exporter本身
- job_name: 'blackbox'
metrics_path: /metrics
static_configs:
- targets: ['localhost:9115']
# 监控多个网站
- job_name: 'website_availability'
metrics_path: /probe
params:
module: [http_2xx] # 使用上面定义的模块
# 监控的网站列表
static_configs:
- targets:
- https://www.example.com # 网站1
- https://api.example.com # 网站2
- https://app.example.com # 网站3
- https://blog.example.org # 外部网站
- http://192.168.1.100 # 内网网站
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: localhost:9115 # Blackbox Exporter地址
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- 'config/alert_rules.yml'
config/alert_rules.yml (网站告警规则):
bash
groups:
- name: website-monitoring
rules:
# 网站完全不可访问
- alert: WebsiteDown
expr: probe_success{job="website_availability"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "网站不可访问: {{ $labels.instance }}"
description: "{{ $labels.instance }} 完全不可达超过 1 分钟"
dashboard: "https://grafana.example.com/dashboard?var-site={{ $labels.instance }}"
# 网站响应缓慢
- alert: WebsiteSlowResponse
expr: probe_duration_seconds{job="website_availability"} > 3
for: 3m
labels:
severity: warning
annotations:
summary: "网站响应缓慢: {{ $labels.instance }}"
description: "{{ $labels.instance }} 平均响应时间超过 3 秒 (当前: {{ $value | printf \"%.2f\" }} 秒)"
# HTTP状态码异常
- alert: WebsiteBadStatus
expr: probe_http_status_code{job="website_availability"} >= 400
for: 2m
labels:
severity: warning
annotations:
summary: "HTTP状态码异常: {{ $labels.instance }}"
description: "{{ $labels.instance }} 返回 {{ $value }} 错误状态码"
# SSL证书即将过期
- alert: SSLCertExpiringSoon
expr: (probe_ssl_earliest_cert_expiry{job="website_availability"} - time()) < 86400 * 7 # 7天内过期
for: 5m
labels:
severity: warning
annotations:
summary: "SSL证书即将过期: {{ $labels.instance }}"
description: "{{ $labels.instance }} 的SSL证书将在 {{ $value | humanizeDuration }} 后过期"
config/alertmanager.yml:
bash
global:
resolve_timeout: 10m
route:
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'wechat-bot'
# 基于严重程度的路由
routes:
- match:
severity: critical
repeat_interval: 20m
receiver: 'wechat-critical'
receivers:
- name: 'wechat-bot'
webhook_configs:
- url: 'https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_WEBHOOK_KEY'
send_resolved: true
- name: 'wechat-critical'
webhook_configs:
- url: 'https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_CRITICAL_WEBHOOK_KEY'
send_resolved: true
templates:
- 'templates/wechat.tmpl'
templates/wechat.tmpl (企业微信消息模板):
bash
{{ define "wechat.message" }}
{{ if gt (len .Alerts.Firing) 0 }}
{{ range .Alerts }}
{{ if eq .Labels.severity "critical" }}
🚨🚨 **【严重故障】** 🚨🚨
{% else %}
⚠️ **【网站告警】** ⚠️
{{ end }}
📌 网站名称: {{ .Labels.instance }}
🔄 状态: {{ if eq .Status "firing" }}故障{{ else }}恢复{{ end }}
🔢 错误代码: {{ if .Annotations.summary }}{{ .Annotations.summary }}{{ else }}{{ .Labels.alertname }}{{ end }}
⏰ 触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ if .Annotations.description }}
📝 详细说明:
{{ .Annotations.description }}
{{ end }}
{{ if .Annotations.dashboard }}
🔗 查看详情: {{ .Annotations.dashboard }}
{{ end }}
{{ if eq .Labels.severity "critical" }}
<@all> 请立即处理!
{{ end }}
━━━━━━━━━━━━━━━━━
{{ end }}
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}
{{ range .Alerts }}
✅ **【恢复正常】**
📌 网站名称: {{ .Labels.instance }}
⏱ 持续时间: {{ humanizeDuration .Duration }}
🕒 恢复时间: {{ .EndsAt.Format "2006-01-02 15:04:05" }}
━━━━━━━━━━━━━━━━━
{{ end }}
{{ end }}
{{ end }}
- 启动脚本
start-monitoring.sh:
bash
#!/bin/bash
# 设置企业微信机器人KEY(替换为实际值)
export WEBHOOK_KEY="your_wechat_robot_key"
export CRITICAL_KEY="your_critical_robot_key"
# 准备目录
mkdir -p data logs
# 启动Blackbox Exporter
echo "🚀 启动Blackbox Exporter (端口:9115)..."
./bin/blackbox_exporter --config.file=config/blackbox.yml > logs/blackbox.log 2>&1 &
BB_PID=$!
echo $BB_PID > data/blackbox.pid
# 启动Alertmanager
echo "📡 启动Alertmanager (端口:9093)..."
./bin/alertmanager \
--config.file=config/alertmanager.yml \
--storage.path=data/alertmanager > logs/alertmanager.log 2>&1 &
AM_PID=$!
echo $AM_PID > data/alertmanager.pid
# 启动Prometheus
echo "📊 启动Prometheus (端口:9090)..."
./bin/prometheus \
--config.file=config/prometheus.yml \
--storage.tsdb.path=data/prometheus > logs/prometheus.log 2>&1 &
PROM_PID=$!
echo $PROM_PID > data/prometheus.pid
# 显示状态
echo ""
echo "✅ 监控系统已启动!"
echo "------------------------------------------"
echo "Prometheus: http://localhost:9090"
echo "Alertmanager: http://localhost:9093"
echo "Blackbox Exporter: http://localhost:9115"
echo "Grafana仪表盘: https://grafana.example.com (可选)"
echo ""
echo "监控网站:"
grep -v '#' config/prometheus.yml | awk '/- targets:/{flag=1} flag && /- http/{print $2}'
echo "------------------------------------------"
echo "按 Ctrl+C 停止所有服务"
# 等待退出
trap "stop_services" SIGINT
tail -f /dev/null
stop_services() {
echo ""
echo "🛑 停止服务..."
kill $PROM_PID $AM_PID $BB_PID
rm data/*.pid
echo "✅ 所有服务已停止"
exit 0
}
stop-monitoring.sh:
bash
#!/bin/bash
echo "🛑 停止监控系统..."
if [ -f data/prometheus.pid ]; then
kill $(cat data/prometheus.pid)
fi
if [ -f data/alertmanager.pid ]; then
kill $(cat data/alertmanager.pid)
fi
if [ -f data/blackbox.pid ]; then
kill $(cat data/blackbox.pid)
fi
rm -f data/*.pid
echo "✅ 所有服务已停止"
- 添加监控脚本工具
add-website.sh (添加新网站监控):
bash
#!/bin/bash
if [ $# -eq 0 ]; then
echo "用法: $0 <网站URL>"
echo "示例: $0 https://new-website.com"
exit 1
fi
URL=$1
CONFIG="config/prometheus.yml"
# 检查URL是否已在监控列表中
if grep -q "$URL" "$CONFIG"; then
echo "❌ 网站已在监控列表中: $URL"
exit 1
fi
# 在prometheus.yml中添加新网站
sed -i "/- targets:/a \ \ \ \ \ \ - $URL" $CONFIG
echo "✅ 已添加网站到监控: $URL"
echo "正在重新加载Prometheus配置..."
curl -X POST http://localhost:9090/-/reload
if [ $? -eq 0 ]; then
echo "🔄 配置已重载"
else
echo "⚠️ 配置重载失败,请检查Prometheus状态"
fi
remove-website.sh (移除网站监控):
bash
#!/bin/bash
if [ $# -eq 0 ]; then
echo "用法: $0 <网站URL>"
echo "示例: $0 https://new-website.com"
exit 1
fi
URL=$1
CONFIG="config/prometheus.yml"
TEMP="/tmp/prometheus.tmp"
# 创建临时文件
grep -v "$URL" $CONFIG > $TEMP
# 检查是否实际删除了内容
if [ $(wc -l < $CONFIG) -eq $(wc -l < $TEMP) ]; then
echo "❌ 未找到监控的网站: $URL"
rm $TEMP
exit 1
fi
mv $TEMP $CONFIG
echo "✅ 已移除网站监控: $URL"
echo "正在重新加载Prometheus配置..."
curl -X POST http://localhost:9090/-/reload
if [ $? -eq 0 ]; then
echo "🔄 配置已重载"
else
echo "⚠️ 配置重载失败,请检查Prometheus状态"
fi
- 设置脚本权限
bash
chmod +x *.sh
mkdir -p data logs
企业微信机器人配置
1.创建两个企业微信群机器人:
常规告警机器人:用于一般告警
关键告警机器人:用于严重故障,设置@全体成员
2.获取机器人Webhook URL:
bash
https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=XXXXX
3.替换配置文件中的KEY:
在 config/alertmanager.yml 中替换 YOUR_WEBHOOK_KEY 和 YOUR_CRITICAL_WEBHOOK_KEY
在 start-monitoring.sh 中设置 WEBHOOK_KEY 和 CRITICAL_KEY 环境变量
使用指南
1.启动监控系统
bash
./start-monitoring.sh
2.添加新网站监控
bash
./add-website.sh https://example.com
3.停止监控系统
bash
./stop-monitoring.sh
4.验证配置
bash
# 检查Prometheus目标
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.scrapePool == "website_availability")'
# 检查告警规则
curl http://localhost:9090/api/v1/rules | jq
# 测试企业微信发送
curl -X POST -H "Content-Type: application/json" \
-d '{"msgtype": "text", "text": {"content": "测试消息"}}' \
"https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=$WEBHOOK_KEY"
企业微信消息示例
bash
严重故障告警
🚨🚨 **【严重故障】** 🚨🚨
📌 网站名称: https://api.example.com
🔄 状态: 故障
🔢 错误代码: 网站不可访问: https://api.example.com
⏰ 触发时间: 2023-08-20 14:30:45
📝 详细说明:
https://api.example.com 完全不可达超过 1 分钟
🔗 查看详情: https://grafana.example.com/dashboard?var-site=https://api.example.com
网站响应缓慢
bash
⚠️ **【网站告警】** ⚠️
📌 网站名称: https://app.example.com
🔄 状态: 故障
🔢 错误代码: 网站响应缓慢: https://app.example.com
⏰ 触发时间: 2023-08-20 13:45:22
📝 详细说明:
https://app.example.com 平均响应时间超过 3 秒 (当前: 4.75 秒)
🔗 查看详情: https://grafana.example.com/dashboard?var-site=https://app.example.com
网站恢复正常
bash
✅ **【恢复正常】**
📌 网站名称: https://api.example.com
⏱ 持续时间: 1h15m23s
🕒 恢复时间: 2023-08-20 15:45:05
监控多个网站的最佳实践
- 网站分组管理
bash
# prometheus.yml
- job_name: 'website_main'
params:
module: [http_2xx]
static_configs:
- targets:
- https://www.example.com
- https://api.example.com
relabel_configs: [...]
- job_name: 'website_partner'
params:
module: [https_2xx]
static_configs:
- targets:
- https://partner1.com
- https://partner2.com
relabel_configs: [...]
- 地理分布监控
使用多区域部署Prometheus:
bash
- job_name: 'website_us'
static_configs:
- targets: [...]
relabel_configs:
- target_label: region
replacement: "us"
- job_name: 'website_eu'
static_configs:
- targets: [...]
relabel_configs:
- target_label: region
replacement: "eu"
- 自定义监控参数
在blackbox.yml中添加特殊监控配置:
bash
modules:
wordpress_auth:
prober: http
http:
headers:
Authorization: "Bearer YOUR_TOKEN"
valid_status_codes: [200]
- 集成Grafana仪表盘
创建网站健康概览仪表盘:
bash
SELECT
instance,
avg(probe_duration_seconds) as avg_response,
min(probe_duration_seconds) as min_response,
max(probe_duration_seconds) as max_response,
avg(probe_success) * 100 as uptime_percent
FROM website_availability
GROUP BY instance
ORDER BY uptime_percent DESC
性能优化建议
调整抓取频率:
yaml
# 根据网站重要性设置不同频率
- job_name: 'critical_websites'
scrape_interval: 10s
static_configs: [...]
- job_name: 'normal_websites'
scrape_interval: 30s
static_configs: [...]
设置告警抑制:
yaml
# alertmanager.yml
inhibit_rules:
- source_match:
alertname: 'WebsiteDown'
target_match_re:
alertname: 'WebsiteSlowResponse|WebsiteBadStatus'
equal: ['instance']
限制数据保留:
yaml
# 启动参数
--storage.tsdb.retention.time=7d
4.负载均衡探测:
在不同区域的服务器上部署Blackbox Exporter实例
常见问题解决
1.网站监控结果不准确
检查Blackbox Exporter日志:tail -f logs/blackbox.log
验证网络连通性:curl -v http://localhost:9115/probe?target=https://example.com\&module=http_2xx
调整超时设置:增加 timeout: 15s
2.告警未发送到企业微信
检查机器人KEY是否正确
测试Alertmanager配置:./bin/amtool check-config config/alertmanager.yml
查看Alertmanager日志:tail -f logs/alertmanager.log
3.误报问题
增加for持续时间:for: 5m
添加过滤条件:expr: probe_success == 0 and on(instance) probe_http_status_code{job="website_availability"} != 0
4.高资源使用
限制探测频率:scrape_interval: 60s
减少监控网站数量
使用多个Blackbox实例分担负载
通过此方案,您可以轻松监控多个网站的健康状态,并在出现问题时通过企业微信快速通知运维团队,确保业务的连续性和稳定性。