一、前置条件(已完成)
✅ Rocky Linux 9
✅ Docker 已安装并运行
✅ Prometheus 3.12.0(已安装)
✅ Alertmanager 0.33.0(已安装 + 163 邮箱可用)
✅ Node Exporter(主机监控)
✅ MySQL Exporter(已装)
二、使用官方 Quick Start 部署 cAdvisor
1、启动 cAdvisor
bash
VERSION=0.55.1 && \
sudo docker rm -f cadvisor 2>/dev/null && \
sudo docker pull ghcr.io/google/cadvisor:$VERSION && \
sudo docker run -d \
--name=cadvisor \
--restart=always \
--privileged \
--device=/dev/kmsg \
-v /:/rootfs:ro \
-v /var/run:/var/run:ro \
-v /sys:/sys:ro \
-v /var/lib/docker:/var/lib/docker:ro \
-v /dev/disk:/dev/disk:ro \
-v /sys/fs/cgroup:/sys/fs/cgroup:ro \
-p 8080:8080 \
ghcr.io/google/cadvisor:$VERSION
2、验证 cAdvisor
bash
sudo docker ps
curl http://localhost:8080/metrics | head
三、Prometheus 接入 Docker 监控
1、编辑 Prometheus 配置
bash
sudo vim /etc/prometheus/prometheus.yml
追加
# Docker 容器监控(cAdvisor 官方 Quick Start)
- job_name: "docker"
static_configs:
- targets: ["localhost:8080"]
✅ 完整 scrape_configs示例
bash
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
# The label name is added as a label `label_name=<label_value>` to any timeseries scraped from this config.
labels:
app: "prometheus"
# 系统监控(Node Exporter)
- job_name: "node_exporter"
static_configs:
- targets: ["localhost:9100"]
# MySQL 监控(mysqld_exporter)
- job_name: "mysql"
static_configs:
- targets: ["localhost:9104"]
# Alertmanager 监控
- job_name: "alertmanager"
static_configs:
- targets: ["localhost:9093"]
labels:
app: "alertmanager"
# ↓↓ Docker 监控 ↓↓
- job_name: "docker"
static_configs:
- targets: ["localhost:8080"]
四、Docker 告警规则(生产级)
bash
# ======================
# Docker 容器监控告警
# ======================
- name: docker
rules:
- alert: DockerDown
expr: up{job="docker"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Docker 监控服务(cAdvisor)不可用"
description: "cAdvisor {{ $labels.instance }} 已失联超过 1 分钟"
- alert: ContainerHighCPU
expr: sum(rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m])) by (container) > 0.8
for: 3m
labels:
severity: warning
annotations:
summary: "容器 CPU 使用率过高"
description: "容器 {{ $labels.container }} CPU 使用率超过 80%"
- alert: ContainerHighMemory
expr: sum(container_memory_usage_bytes{container!="POD",container!=""}) by (container)
/ sum(container_spec_memory_limit_bytes{container!="POD",container!=""}) by (container) > 0.8
for: 3m
labels:
severity: warning
annotations:
summary: "容器内存使用率过高"
description: "容器 {{ $labels.container }} 内存使用率超过 80%"
五、检查并重启
bash
promtool check config /etc/prometheus/prometheus.yml
sudo systemctl restart prometheus
六、验证 Docker 监控
1、浏览器访问:
http://服务器IP:9090 → Status → Targets

2、触发 Docker 告警测试
bash
sudo docker stop cadvisor
等 1~2 分钟:
Prometheus → Alerts → DockerDown = Firing

✅ 163 邮箱收到 Docker 告警邮件

恢复:
bash
sudo docker start cadvisor
✅ 收到 Resolved 邮件
