Prometheus-02: 安装部署与配置管理详解
涵盖二进制、Docker、Kubernetes三种部署方式和配置管理
相关文档链接
官方文档资源
- Prometheus官方安装指南 - 详细的安装步骤和配置说明
- Prometheus配置文档 - 完整的配置参数说明
- Prometheus Docker Hub - 官方Docker镜像
- Alertmanager安装文档 - 告警管理器安装配置
GitHub项目资源
- Prometheus主项目 - 源码和发布版本
- Node Exporter - 系统指标收集器
- Alertmanager项目 - 告警管理组件
- Grafana项目 - 数据可视化平台
中文社区资源
- Prometheus中文文档 - 云原生社区维护的中文文档
- Prometheus监控实战 - 详细的实战教程
- 容器监控实践指南 - Docker监控栈示例
部署工具和脚本
- Prometheus Operator - Kubernetes操作器
- kube-prometheus - Kubernetes监控栈
- Ansible Prometheus Role - 自动化部署脚本
一、Prometheus安装部署概述
1.1 部署方式选择
Prometheus支持多种部署方式,根据不同的环境和需求可以选择合适的方案:
二进制直接部署
- 适用场景: 传统物理机或虚拟机环境
- 优势: 部署简单,性能最优,便于调试
- 劣势: 需要手动管理服务,扩展性有限
Docker容器部署
- 适用场景: 容器化环境,快速测试验证
- 优势: 环境隔离,版本管理方便,快速部署
- 劣势: 需要额外的容器管理开销
Kubernetes集群部署
- 适用场景: 云原生环境,大规模集群监控
- 优势: 高可用,自动扩展,服务发现
- 劣势: 复杂性较高,需要K8s基础知识
1.2 系统要求
硬件要求
yaml
# 最小配置
CPU: 2核心
内存: 4GB RAM
磁盘: 50GB SSD(建议)
网络: 1Gbps
# 推荐配置(生产环境)
CPU: 8核心及以上
内存: 16GB RAM及以上
磁盘: 200GB+ SSD,IOPS > 3000
网络: 10Gbps
操作系统支持
- Linux(CentOS 7+, Ubuntu 18.04+, RHEL 7+)
- macOS(开发测试)
- Windows(测试环境)
二、二进制安装部署
2.1 下载安装包
bash
# 设置版本变量
PROMETHEUS_VERSION="2.45.0"
ALERTMANAGER_VERSION="0.25.0"
NODE_EXPORTER_VERSION="1.6.0"
# 创建目录
sudo mkdir -p /opt/prometheus
sudo mkdir -p /etc/prometheus
sudo mkdir -p /var/lib/prometheus
sudo mkdir -p /var/log/prometheus
# 下载Prometheus
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
# 下载Alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
# 下载Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
2.2 安装Prometheus
bash
# 解压安装
tar -xzf prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
cd prometheus-${PROMETHEUS_VERSION}.linux-amd64
# 复制二进制文件
sudo cp prometheus promtool /usr/local/bin/
sudo cp -r consoles/ console_libraries/ /etc/prometheus/
# 设置权限
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus /var/log/prometheus
sudo chmod 755 /usr/local/bin/prometheus /usr/local/bin/promtool
2.3 配置Prometheus
创建主配置文件:
yaml
# /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
region: 'us-west-1'
rule_files:
- "/etc/prometheus/rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
scrape_configs:
# Prometheus自监控
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
scrape_interval: 5s
metrics_path: /metrics
# Node Exporter监控
- job_name: 'node-exporter'
static_configs:
- targets:
- 'localhost:9100'
- '192.168.1.10:9100'
- '192.168.1.11:9100'
scrape_interval: 10s
scrape_timeout: 5s
# 应用监控示例
- job_name: 'webapp'
static_configs:
- targets: ['app1.example.com:8080', 'app2.example.com:8080']
metrics_path: '/actuator/prometheus'
scrape_interval: 30s
2.4 创建系统服务
ini
# /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Time Series Collection and Processing Server
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
Restart=on-failure
RestartSec=5s
ExecStart=/usr/local/bin/prometheus \
--config.file /etc/prometheus/prometheus.yml \
--storage.tsdb.path /var/lib/prometheus/ \
--storage.tsdb.retention.time=30d \
--storage.tsdb.retention.size=50GB \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.listen-address=0.0.0.0:9090 \
--web.max-connections=512 \
--web.read-timeout=5m \
--web.max-connections=512 \
--log.level=info \
--log.format=logfmt
ExecReload=/bin/kill -HUP $MAINPID
TimeoutStopSec=20s
SendSIGKILL=no
# 安全配置
NoNewPrivileges=yes
PrivateTmp=yes
ProtectHome=yes
ProtectSystem=strict
ReadWritePaths=/var/lib/prometheus
ProtectKernelModules=yes
ProtectKernelTunables=yes
ProtectControlGroups=yes
[Install]
WantedBy=multi-user.target
2.5 安装Node Exporter
bash
# 解压安装
tar -xzf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
cd node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64
# 复制二进制文件
sudo cp node_exporter /usr/local/bin/
# 创建用户
sudo useradd --no-create-home --shell /bin/false node_exporter
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
Node Exporter系统服务:
ini
# /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Documentation=https://prometheus.io/docs/guides/node-exporter/
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
Restart=on-failure
RestartSec=5s
ExecStart=/usr/local/bin/node_exporter \
--web.listen-address=0.0.0.0:9100 \
--path.procfs=/proc \
--path.sysfs=/sys \
--collector.filesystem.mount-points-exclude="^/(sys|proc|dev|host|etc|rootfs/var/lib/docker/containers|rootfs/var/lib/docker/overlay2|rootfs/run/docker/netns|rootfs/var/lib/docker/aufs)($$|/)" \
--collector.filesystem.fs-types-exclude="^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$$"
[Install]
WantedBy=multi-user.target
2.6 安装Alertmanager
bash
# 解压安装
tar -xzf alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
cd alertmanager-${ALERTMANAGER_VERSION}.linux-amd64
# 复制文件
sudo cp alertmanager amtool /usr/local/bin/
sudo mkdir -p /etc/alertmanager
sudo mkdir -p /var/lib/alertmanager
# 设置权限
sudo useradd --no-create-home --shell /bin/false alertmanager
sudo chown -R alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager
Alertmanager配置:
yaml
# /etc/alertmanager/alertmanager.yml
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@yourcompany.com'
smtp_auth_username: 'alerts@yourcompany.com'
smtp_auth_password: 'your-app-password'
smtp_require_tls: true
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
group_wait: 0s
group_interval: 5m
repeat_interval: 30m
receivers:
- name: 'default-receiver'
email_configs:
- to: 'team@yourcompany.com'
subject: '[ALERT] {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
{{ end }}
- name: 'critical-alerts'
email_configs:
- to: 'oncall@yourcompany.com'
subject: '[CRITICAL] {{ .GroupLabels.alertname }}'
body: |
CRITICAL ALERT!
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Runbook: {{ .Annotations.runbook_url }}
{{ end }}
webhook_configs:
- url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
send_resolved: true
Alertmanager系统服务:
ini
# /etc/systemd/system/alertmanager.service
[Unit]
Description=Alertmanager for Prometheus
Documentation=https://prometheus.io/docs/alerting/alertmanager/
Wants=network-online.target
After=network-online.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
Restart=on-failure
RestartSec=5s
ExecStart=/usr/local/bin/alertmanager \
--config.file /etc/alertmanager/alertmanager.yml \
--storage.path /var/lib/alertmanager/ \
--web.listen-address=0.0.0.0:9093 \
--web.external-url=http://localhost:9093 \
--cluster.listen-address=0.0.0.0:9094 \
--log.level=info
[Install]
WantedBy=multi-user.target
三、Docker容器部署
3.1 Docker Compose部署
创建完整的监控栈:
yaml
# docker-compose.yml
version: '3.8'
networks:
monitoring:
driver: bridge
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
x-logging: &default-logging
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
services:
prometheus:
image: prom/prometheus:v2.45.0
container_name: prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=50GB'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
- '--log.level=info'
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/rules:/etc/prometheus/rules:ro
- prometheus_data:/prometheus
networks:
- monitoring
restart: unless-stopped
logging: *default-logging
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:9090/-/ready"]
interval: 30s
timeout: 10s
retries: 5
alertmanager:
image: prom/alertmanager:v0.25.0
container_name: alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--web.external-url=http://localhost:9093'
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
- alertmanager_data:/alertmanager
networks:
- monitoring
restart: unless-stopped
logging: *default-logging
grafana:
image: grafana/grafana:10.0.0
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
- ./grafana/dashboards:/var/lib/grafana/dashboards:ro
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_USERS_ALLOW_SIGN_UP=false
- GF_INSTALL_PLUGINS=grafana-piechart-panel,grafana-worldmap-panel
networks:
- monitoring
restart: unless-stopped
logging: *default-logging
node-exporter:
image: prom/node-exporter:v1.6.0
container_name: node-exporter
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
networks:
- monitoring
restart: unless-stopped
logging: *default-logging
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.47.0
container_name: cadvisor
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
privileged: true
networks:
- monitoring
restart: unless-stopped
logging: *default-logging
3.2 容器配置文件
Prometheus配置(容器版本):
yaml
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "/etc/prometheus/rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'alertmanager'
static_configs:
- targets: ['alertmanager:9093']
- job_name: 'grafana'
static_configs:
- targets: ['grafana:3000']
3.3 启动和管理
bash
# 启动服务栈
docker-compose up -d
# 查看服务状态
docker-compose ps
# 查看日志
docker-compose logs -f prometheus
docker-compose logs -f grafana
# 重启服务
docker-compose restart prometheus
# 更新配置后重载
docker-compose exec prometheus kill -HUP 1
# 停止服务
docker-compose down
# 完全清理(包括数据)
docker-compose down -v
四、Kubernetes集群部署
4.1 使用Prometheus Operator
安装Prometheus Operator:
bash
# 添加Helm仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# 安装kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
--set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.resources.requests.storage=10Gi \
--set grafana.persistence.enabled=true \
--set grafana.persistence.size=10Gi
4.2 自定义资源配置
Prometheus自定义资源:
yaml
# prometheus-server.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus-server
namespace: monitoring
spec:
serviceAccountName: prometheus-server
serviceMonitorSelector:
matchLabels:
app: prometheus
ruleSelector:
matchLabels:
app: prometheus
resources:
requests:
memory: 2Gi
cpu: 1000m
limits:
memory: 4Gi
cpu: 2000m
retention: 30d
storage:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
storageClassName: fast-ssd
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/monitoring
operator: In
values: ["true"]
ServiceMonitor配置:
yaml
# servicemonitor-node-exporter.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: node-exporter
namespace: monitoring
labels:
app: prometheus
spec:
selector:
matchLabels:
app: node-exporter
endpoints:
- port: metrics
interval: 30s
path: /metrics
honorLabels: true
五、配置管理最佳实践
5.1 配置文件组织
bash
# 推荐的目录结构
/etc/prometheus/
├── prometheus.yml # 主配置文件
├── rules/ # 告警规则目录
│ ├── node-rules.yml
│ ├── app-rules.yml
│ └── kubernetes-rules.yml
├── targets/ # 服务发现配置
│ ├── static-targets.yml
│ └── file-sd/
├── tls/ # TLS证书
│ ├── prometheus.crt
│ └── prometheus.key
└── templates/ # 告警模板
└── email.tmpl
5.2 动态配置管理
文件服务发现配置:
yaml
# prometheus.yml中的配置
scrape_configs:
- job_name: 'file-sd'
file_sd_configs:
- files:
- '/etc/prometheus/targets/*.json'
- '/etc/prometheus/targets/*.yml'
refresh_interval: 30s
动态目标配置文件:
json
// /etc/prometheus/targets/web-servers.json
[
{
"targets": ["web1.example.com:9100", "web2.example.com:9100"],
"labels": {
"job": "web-servers",
"env": "production",
"team": "backend"
}
},
{
"targets": ["db1.example.com:9100", "db2.example.com:9100"],
"labels": {
"job": "database-servers",
"env": "production",
"team": "dba"
}
}
]
5.3 配置验证工具
bash
# 配置语法检查
/usr/local/bin/promtool check config /etc/prometheus/prometheus.yml
# 规则语法检查
/usr/local/bin/promtool check rules /etc/prometheus/rules/*.yml
# 查询语法检查
/usr/local/bin/promtool query instant 'up'
# 配置热重载
curl -X POST http://localhost:9090/-/reload
# 健康检查
curl http://localhost:9090/-/healthy
curl http://localhost:9090/-/ready
六、监控模版配置
6.1 物理机监控模版
完整的物理机监控配置:
yaml
# 物理机监控Job配置
- job_name: 'physical-servers'
static_configs:
- targets:
- 'server1.prod.com:9100'
- 'server2.prod.com:9100'
- 'server3.prod.com:9100'
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: localhost:9100
metric_relabel_configs:
- source_labels: [__name__]
regex: 'node_cpu_seconds_total'
target_label: __tmp_cpu_metric
replacement: 'true'
# 物理机告警规则
groups:
- name: physical-server-alerts
rules:
- alert: HostDown
expr: up{job="physical-servers"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Physical server {{ $labels.instance }} is down"
description: "{{ $labels.instance }} has been down for more than 1 minute"
- alert: HighCPUUsage
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for 5 minutes"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 85%"
- alert: DiskSpaceUsage
expr: (1 - (node_filesystem_free_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High disk usage on {{ $labels.instance }}"
description: "Disk usage is above 85% on {{ $labels.mountpoint }}"
6.2 容器监控模版
Docker容器监控配置:
yaml
# 容器监控Job配置
- job_name: 'docker-containers'
static_configs:
- targets: ['cadvisor:8080']
scrape_interval: 15s
metrics_path: /metrics
relabel_configs:
- source_labels: [__meta_docker_container_name]
target_label: container_name
- source_labels: [__meta_docker_container_label_com_docker_compose_service]
target_label: compose_service
# Kubernetes Pod监控
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# 容器告警规则
groups:
- name: container-alerts
rules:
- alert: ContainerDown
expr: absent(container_last_seen{name!=""}) or (time() - container_last_seen{name!=""}) > 60
for: 1m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} is down"
- alert: ContainerHighCPU
expr: sum(rate(container_cpu_usage_seconds_total{name!=""}[5m])) by (name) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} high CPU usage"
- alert: ContainerHighMemory
expr: (container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""}) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} high memory usage"
6.3 应用监控模版
Java应用监控配置:
yaml
# Java应用监控
- job_name: 'java-applications'
static_configs:
- targets:
-