1. 核心架构概览
plaintext
┌─────────────────────────────────────────────────────────────────┐
│ Prometheus 架构 │
├─────────────────────────────────────────────────────────────────┤
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │Exporter│ │Exporter│ │Exporter│ │Exporter│ (:9100/9104) │
│ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │
│ └───────────┴───────────┴───────────┘ │
│ │ Pull (15s) │
│ ▼ │
│ ┌──────────────┐ │
│ │ Prometheus │──┐ │
│ │ Server │ │ ┌──────────────┐ │
│ │ ┌──────────┐ │ └───▶│ Alertmanager │──▶通知 │
│ │ │ TSDB │ │ └──────────────┘ │
│ │ └──────────┘ │ │
│ └──────┬───────┘ │
│ ▼ │
│ ┌──────────────┐ (:3000) │
│ │ Grafana │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
流程 :Exporter暴露/metrics → Prometheus定时Pull → TSDB存储 → Alertmanager告警 → Grafana展示
2. 部署安装(Docker Compose)
yaml
# docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.47.0
container_name: prometheus
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/rules:/etc/prometheus/rules
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d'
- '--web.enable-lifecycle'
networks:
- monitoring
alertmanager:
image: prom/alertmanager:v0.26.0
container_name: alertmanager
restart: unless-stopped
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
networks:
- monitoring
grafana:
image: grafana/grafana:10.1.0
container_name: grafana
restart: unless-stopped
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin123
volumes:
- grafana_data:/var/lib/grafana
networks:
- monitoring
node-exporter:
image: prom/node-exporter:v1.6.1
container_name: node-exporter
restart: unless-stopped
ports:
- "9100:9100"
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/host'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|$)'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/host:ro
networks:
- monitoring
networks:
monitoring:
driver: bridge
volumes:
prometheus_data:
grafana_data:
3. 核心配置(prometheus.yml详解)
yaml
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'prod'
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- "rules/*.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
labels:
env: 'prod'
- job_name: 'file_sd'
file_sd_configs:
- files:
- 'targets/*.json'
refresh_interval: 30s
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- job_name: 'relabel_demo'
static_configs:
- targets: ['192.168.1.100:8080']
relabel_configs:
- source_labels: [__address__]
regex: '([^:]+):(\d+)'
target_label: instance
replacement: '${1}'
- target_label: env
replacement: 'prod'
- regex: '__meta_.*'
action: labeldrop
4. Exporter部署
node_exporter安装
bash
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xzf node_exporter-1.6.1.linux-amd64.tar.gz
sudo mv node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/
sudo tee /etc/systemd/system/node-exporter.service <<EOF
[Unit]
Description=Node Exporter
After=network.target
[Service]
ExecStart=/usr/local/bin/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload && sudo systemctl enable --now node-exporter
常用Exporter
表格
| Exporter | 端口 | 监控目标 | 关键指标 |
|---|---|---|---|
| node_exporter | 9100 | Linux | cpu/mem/disk/net |
| windows_exporter | 9182 | Windows | iis/sqlserver |
| mysql_exporter | 9104 | MySQL | queries/connections |
| postgres_exporter | 9187 | PostgreSQL | queries/buffers |
| redis_exporter | 9121 | Redis | memory/commands |
| blackbox_exporter | 9115 | HTTP/TCP | probe_success |
| cadvisor | 8080 | Docker | container_* |
5. PromQL查询基础
promql
# 即时向量
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # CPU使用率
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) # 内存使用率
# 区间向量
rate(node_cpu_seconds_total{mode="user"}[5m]) # 变化率
increase(http_requests_total[1h]) # 增量
# 聚合
sum by (instance, job) (rate(node_cpu_seconds_total[5m]))
count(node_cpu_seconds_total)
max by (service) (http_request_duration_seconds_bucket)
# 函数
predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 4*3600) # 预测
irate(node_cpu_seconds_total{mode="user"}[5m]) # 瞬时变化率
label_replace(up{job="node"}, "hostname", "$1", "instance", "([^:]+):.*")
6. Alertmanager告警配置
告警规则
yaml
# rules/alerts.yml
groups:
- name: node_alerts
rules:
- alert: NodeDown
expr: up{job="node"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "节点 {{ $labels.instance }} 宕机"
- alert: HighCPU
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "CPU使用率超过80%,当前: {{ $value | printf \"%.2f\" }}%"
- alert: LowMemory
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1
for: 3m
labels:
severity: warning
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
for: 2m
labels:
severity: critical
Alertmanager配置
yaml
# alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.qq.com:587'
smtp_from: 'alert@example.com'
smtp_auth_password: 'xxxxxx'
route:
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'critical-receiver'
group_wait: 10s
receivers:
- name: 'default-receiver'
email_configs:
- to: 'ops@example.com'
send_resolved: true
slack_configs:
- channel: '#alerts'
send_resolved: true
- name: 'critical-receiver'
webhook_configs:
- url: 'http://dingtalk:8060/dingtalk/webhook'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['instance']
7. 常用命令与API
bash
# 热加载配置
curl -X POST http://localhost:9090/-/reload
# TSDB操作
curl http://localhost:9090/api/v1/status/tsdb
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={job="test"}'
curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones
# HTTP API
curl -G http://localhost:9090/api/v1/query --data-urlencode 'query=up{job="node"}'
curl -G http://localhost:9090/api/v1/query_range \
--data-urlencode 'query=up{job="node"}' \
--data-urlencode 'start=2024-01-01T00:00:00Z' \
--data-urlencode 'end=2024-01-01T01:00:00Z' \
--data-urlencode 'step=60s'
curl http://localhost:9090/api/v1/targets
curl http://localhost:9090/api/v1/alerts
curl http://localhost:9090/api/v1/rules
8. 常见问题排查
问题1: Target Down
bash
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health=="down")'
nc -zv <target_ip> <port>
docker logs node-exporter
curl http://<target>:9100/metrics
问题2: 指标缺失
bash
curl http://localhost:9090/api/v1/label/__name__/values | jq '.data[]' | grep -i <metric>
curl -G http://localhost:9090/api/v1/query --data-urlencode 'query=<metric_name>_total'
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].labels'
问题3: 告警不触发
bash
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.type=="alerting")'
curl -s http://localhost:9093/api/v1/status
curl http://localhost:9093/api/v1/silences
表格
| 排查命令 | 用途 |
|---|---|
up{job="xxx"} |
确认target状态 |
rate(x[5m]) > 0 |
验证指标存在 |
ALERTS{alertname="xxx"} |
检查告警状态 |
promtool check config |
验证配置文件 |
9. 最佳实践
命名规范
yaml
# 指标名: <域>_<子系统>_<名称>_<单位>
node_memory_Available_bytes
http_request_duration_seconds
# 标签: app_name, env, region, cluster, instance
# 避免高基数标签(user_id, ip等)
联邦集群
yaml
- job_name: 'federate'
metrics_path: '/federate'
params:
'match[]': ['{__name__=~".+"}']
static_configs:
- targets:
- 'prometheus-prod:9090'
- 'prometheus-prod2:9090'
高可用
plaintext
┌─────────────┐ ┌─────────────┐
│ Prometheus │ │ Prometheus │ # 双写
│ Primary │ │ Replica │
└──────┬──────┘ └──────┬──────┘
└────────┬─────────┘
▼
┌──────────────┐
│Thanos Receiver│ # 统一存储
└──────────────┘
远程存储
yaml
remote_write:
- url: http://thanos-receive:19291/api/v1/receive
queue_config:
capacity: 10000
max_shards: 30
remote_read:
- url: http://thanos-query:10912/api/v1/read
read_recent: true
性能优化
- 标签基数控制: 避免超过10万标签组合
- 抓取间隔: 高频5s,低频60s
- 记录规则: 预聚合复杂查询
- 存储清理: 合理保留周期
- 联邦分区: 按服务域拆分Prometheus