Prometheus-02: 安装部署与配置管理详解

Prometheus-02: 安装部署与配置管理详解

涵盖二进制、Docker、Kubernetes三种部署方式和配置管理

相关文档链接

官方文档资源

GitHub项目资源

中文社区资源

部署工具和脚本

一、Prometheus安装部署概述

1.1 部署方式选择

Prometheus支持多种部署方式,根据不同的环境和需求可以选择合适的方案:

二进制直接部署
  • 适用场景: 传统物理机或虚拟机环境
  • 优势: 部署简单,性能最优,便于调试
  • 劣势: 需要手动管理服务,扩展性有限
Docker容器部署
  • 适用场景: 容器化环境,快速测试验证
  • 优势: 环境隔离,版本管理方便,快速部署
  • 劣势: 需要额外的容器管理开销
Kubernetes集群部署
  • 适用场景: 云原生环境,大规模集群监控
  • 优势: 高可用,自动扩展,服务发现
  • 劣势: 复杂性较高,需要K8s基础知识

1.2 系统要求

硬件要求
yaml 复制代码
# 最小配置
CPU: 2核心
内存: 4GB RAM
磁盘: 50GB SSD(建议)
网络: 1Gbps

# 推荐配置(生产环境)
CPU: 8核心及以上
内存: 16GB RAM及以上
磁盘: 200GB+ SSD,IOPS > 3000
网络: 10Gbps
操作系统支持
  • Linux(CentOS 7+, Ubuntu 18.04+, RHEL 7+)
  • macOS(开发测试)
  • Windows(测试环境)

二、二进制安装部署

2.1 下载安装包

bash 复制代码
# 设置版本变量
PROMETHEUS_VERSION="2.45.0"
ALERTMANAGER_VERSION="0.25.0"
NODE_EXPORTER_VERSION="1.6.0"

# 创建目录
sudo mkdir -p /opt/prometheus
sudo mkdir -p /etc/prometheus
sudo mkdir -p /var/lib/prometheus
sudo mkdir -p /var/log/prometheus

# 下载Prometheus
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz

# 下载Alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz

# 下载Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz

2.2 安装Prometheus

bash 复制代码
# 解压安装
tar -xzf prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
cd prometheus-${PROMETHEUS_VERSION}.linux-amd64

# 复制二进制文件
sudo cp prometheus promtool /usr/local/bin/
sudo cp -r consoles/ console_libraries/ /etc/prometheus/

# 设置权限
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus /var/log/prometheus
sudo chmod 755 /usr/local/bin/prometheus /usr/local/bin/promtool

2.3 配置Prometheus

创建主配置文件:

yaml 复制代码
# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    region: 'us-west-1'

rule_files:
  - "/etc/prometheus/rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

scrape_configs:
  # Prometheus自监控
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 5s
    metrics_path: /metrics

  # Node Exporter监控
  - job_name: 'node-exporter'
    static_configs:
      - targets: 
        - 'localhost:9100'
        - '192.168.1.10:9100'
        - '192.168.1.11:9100'
    scrape_interval: 10s
    scrape_timeout: 5s
    
  # 应用监控示例
  - job_name: 'webapp'
    static_configs:
      - targets: ['app1.example.com:8080', 'app2.example.com:8080']
    metrics_path: '/actuator/prometheus'
    scrape_interval: 30s

2.4 创建系统服务

ini 复制代码
# /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Time Series Collection and Processing Server
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
Restart=on-failure
RestartSec=5s
ExecStart=/usr/local/bin/prometheus \
  --config.file /etc/prometheus/prometheus.yml \
  --storage.tsdb.path /var/lib/prometheus/ \
  --storage.tsdb.retention.time=30d \
  --storage.tsdb.retention.size=50GB \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --web.listen-address=0.0.0.0:9090 \
  --web.max-connections=512 \
  --web.read-timeout=5m \
  --web.max-connections=512 \
  --log.level=info \
  --log.format=logfmt

ExecReload=/bin/kill -HUP $MAINPID
TimeoutStopSec=20s
SendSIGKILL=no

# 安全配置
NoNewPrivileges=yes
PrivateTmp=yes
ProtectHome=yes
ProtectSystem=strict
ReadWritePaths=/var/lib/prometheus
ProtectKernelModules=yes
ProtectKernelTunables=yes
ProtectControlGroups=yes

[Install]
WantedBy=multi-user.target

2.5 安装Node Exporter

bash 复制代码
# 解压安装
tar -xzf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
cd node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64

# 复制二进制文件
sudo cp node_exporter /usr/local/bin/

# 创建用户
sudo useradd --no-create-home --shell /bin/false node_exporter
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

Node Exporter系统服务:

ini 复制代码
# /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Documentation=https://prometheus.io/docs/guides/node-exporter/
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
Restart=on-failure
RestartSec=5s
ExecStart=/usr/local/bin/node_exporter \
  --web.listen-address=0.0.0.0:9100 \
  --path.procfs=/proc \
  --path.sysfs=/sys \
  --collector.filesystem.mount-points-exclude="^/(sys|proc|dev|host|etc|rootfs/var/lib/docker/containers|rootfs/var/lib/docker/overlay2|rootfs/run/docker/netns|rootfs/var/lib/docker/aufs)($$|/)" \
  --collector.filesystem.fs-types-exclude="^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$$"

[Install]
WantedBy=multi-user.target

2.6 安装Alertmanager

bash 复制代码
# 解压安装
tar -xzf alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
cd alertmanager-${ALERTMANAGER_VERSION}.linux-amd64

# 复制文件
sudo cp alertmanager amtool /usr/local/bin/
sudo mkdir -p /etc/alertmanager
sudo mkdir -p /var/lib/alertmanager

# 设置权限
sudo useradd --no-create-home --shell /bin/false alertmanager
sudo chown -R alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager

Alertmanager配置:

yaml 复制代码
# /etc/alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@yourcompany.com'
  smtp_auth_username: 'alerts@yourcompany.com'
  smtp_auth_password: 'your-app-password'
  smtp_require_tls: true

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default-receiver'
  routes:
  - match:
      severity: critical
    receiver: 'critical-alerts'
    group_wait: 0s
    group_interval: 5m
    repeat_interval: 30m

receivers:
- name: 'default-receiver'
  email_configs:
  - to: 'team@yourcompany.com'
    subject: '[ALERT] {{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
      {{ end }}

- name: 'critical-alerts'
  email_configs:
  - to: 'oncall@yourcompany.com'
    subject: '[CRITICAL] {{ .GroupLabels.alertname }}'
    body: |
      CRITICAL ALERT!
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      Runbook: {{ .Annotations.runbook_url }}
      {{ end }}
  webhook_configs:
  - url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
    send_resolved: true

Alertmanager系统服务:

ini 复制代码
# /etc/systemd/system/alertmanager.service
[Unit]
Description=Alertmanager for Prometheus
Documentation=https://prometheus.io/docs/alerting/alertmanager/
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
Restart=on-failure
RestartSec=5s
ExecStart=/usr/local/bin/alertmanager \
  --config.file /etc/alertmanager/alertmanager.yml \
  --storage.path /var/lib/alertmanager/ \
  --web.listen-address=0.0.0.0:9093 \
  --web.external-url=http://localhost:9093 \
  --cluster.listen-address=0.0.0.0:9094 \
  --log.level=info

[Install]
WantedBy=multi-user.target

三、Docker容器部署

3.1 Docker Compose部署

创建完整的监控栈:

yaml 复制代码
# docker-compose.yml
version: '3.8'

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

x-logging: &default-logging
  driver: "json-file"
  options:
    max-size: "10m"
    max-file: "3"

services:
  prometheus:
    image: prom/prometheus:v2.45.0
    container_name: prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--storage.tsdb.retention.size=50GB'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
      - '--log.level=info'
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/rules:/etc/prometheus/rules:ro
      - prometheus_data:/prometheus
    networks:
      - monitoring
    restart: unless-stopped
    logging: *default-logging
    healthcheck:
      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:9090/-/ready"]
      interval: 30s
      timeout: 10s
      retries: 5

  alertmanager:
    image: prom/alertmanager:v0.25.0
    container_name: alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--web.external-url=http://localhost:9093'
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager_data:/alertmanager
    networks:
      - monitoring
    restart: unless-stopped
    logging: *default-logging

  grafana:
    image: grafana/grafana:10.0.0
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_INSTALL_PLUGINS=grafana-piechart-panel,grafana-worldmap-panel
    networks:
      - monitoring
    restart: unless-stopped
    logging: *default-logging

  node-exporter:
    image: prom/node-exporter:v1.6.0
    container_name: node-exporter
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    networks:
      - monitoring
    restart: unless-stopped
    logging: *default-logging

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.47.0
    container_name: cadvisor
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    privileged: true
    networks:
      - monitoring
    restart: unless-stopped
    logging: *default-logging

3.2 容器配置文件

Prometheus配置(容器版本):

yaml 复制代码
# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "/etc/prometheus/rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'alertmanager'
    static_configs:
      - targets: ['alertmanager:9093']

  - job_name: 'grafana'
    static_configs:
      - targets: ['grafana:3000']

3.3 启动和管理

bash 复制代码
# 启动服务栈
docker-compose up -d

# 查看服务状态
docker-compose ps

# 查看日志
docker-compose logs -f prometheus
docker-compose logs -f grafana

# 重启服务
docker-compose restart prometheus

# 更新配置后重载
docker-compose exec prometheus kill -HUP 1

# 停止服务
docker-compose down

# 完全清理(包括数据)
docker-compose down -v

四、Kubernetes集群部署

4.1 使用Prometheus Operator

安装Prometheus Operator:

bash 复制代码
# 添加Helm仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# 安装kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
  --set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.resources.requests.storage=10Gi \
  --set grafana.persistence.enabled=true \
  --set grafana.persistence.size=10Gi

4.2 自定义资源配置

Prometheus自定义资源:

yaml 复制代码
# prometheus-server.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus-server
  namespace: monitoring
spec:
  serviceAccountName: prometheus-server
  serviceMonitorSelector:
    matchLabels:
      app: prometheus
  ruleSelector:
    matchLabels:
      app: prometheus
  resources:
    requests:
      memory: 2Gi
      cpu: 1000m
    limits:
      memory: 4Gi
      cpu: 2000m
  retention: 30d
  storage:
    volumeClaimTemplate:
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 50Gi
        storageClassName: fast-ssd
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: node-role.kubernetes.io/monitoring
            operator: In
            values: ["true"]

ServiceMonitor配置:

yaml 复制代码
# servicemonitor-node-exporter.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: node-exporter
  namespace: monitoring
  labels:
    app: prometheus
spec:
  selector:
    matchLabels:
      app: node-exporter
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
    honorLabels: true

五、配置管理最佳实践

5.1 配置文件组织

bash 复制代码
# 推荐的目录结构
/etc/prometheus/
├── prometheus.yml          # 主配置文件
├── rules/                  # 告警规则目录
│   ├── node-rules.yml
│   ├── app-rules.yml
│   └── kubernetes-rules.yml
├── targets/                # 服务发现配置
│   ├── static-targets.yml
│   └── file-sd/
├── tls/                    # TLS证书
│   ├── prometheus.crt
│   └── prometheus.key
└── templates/              # 告警模板
    └── email.tmpl

5.2 动态配置管理

文件服务发现配置:

yaml 复制代码
# prometheus.yml中的配置
scrape_configs:
  - job_name: 'file-sd'
    file_sd_configs:
      - files:
        - '/etc/prometheus/targets/*.json'
        - '/etc/prometheus/targets/*.yml'
        refresh_interval: 30s

动态目标配置文件:

json 复制代码
// /etc/prometheus/targets/web-servers.json
[
  {
    "targets": ["web1.example.com:9100", "web2.example.com:9100"],
    "labels": {
      "job": "web-servers",
      "env": "production",
      "team": "backend"
    }
  },
  {
    "targets": ["db1.example.com:9100", "db2.example.com:9100"],
    "labels": {
      "job": "database-servers",
      "env": "production",
      "team": "dba"
    }
  }
]

5.3 配置验证工具

bash 复制代码
# 配置语法检查
/usr/local/bin/promtool check config /etc/prometheus/prometheus.yml

# 规则语法检查
/usr/local/bin/promtool check rules /etc/prometheus/rules/*.yml

# 查询语法检查
/usr/local/bin/promtool query instant 'up'

# 配置热重载
curl -X POST http://localhost:9090/-/reload

# 健康检查
curl http://localhost:9090/-/healthy
curl http://localhost:9090/-/ready

六、监控模版配置

6.1 物理机监控模版

完整的物理机监控配置:

yaml 复制代码
# 物理机监控Job配置
- job_name: 'physical-servers'
  static_configs:
    - targets:
      - 'server1.prod.com:9100'
      - 'server2.prod.com:9100'
      - 'server3.prod.com:9100'
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: localhost:9100
  metric_relabel_configs:
    - source_labels: [__name__]
      regex: 'node_cpu_seconds_total'
      target_label: __tmp_cpu_metric
      replacement: 'true'

# 物理机告警规则
groups:
- name: physical-server-alerts
  rules:
  - alert: HostDown
    expr: up{job="physical-servers"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Physical server {{ $labels.instance }} is down"
      description: "{{ $labels.instance }} has been down for more than 1 minute"

  - alert: HighCPUUsage
    expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage is above 80% for 5 minutes"

  - alert: HighMemoryUsage
    expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage on {{ $labels.instance }}"
      description: "Memory usage is above 85%"

  - alert: DiskSpaceUsage
    expr: (1 - (node_filesystem_free_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})) * 100 > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High disk usage on {{ $labels.instance }}"
      description: "Disk usage is above 85% on {{ $labels.mountpoint }}"

6.2 容器监控模版

Docker容器监控配置:

yaml 复制代码
# 容器监控Job配置
- job_name: 'docker-containers'
  static_configs:
    - targets: ['cadvisor:8080']
  scrape_interval: 15s
  metrics_path: /metrics
  relabel_configs:
    - source_labels: [__meta_docker_container_name]
      target_label: container_name
    - source_labels: [__meta_docker_container_label_com_docker_compose_service]
      target_label: compose_service

# Kubernetes Pod监控
- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
    - role: pod
  relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
      action: replace
      target_label: __metrics_path__
      regex: (.+)
    - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      regex: ([^:]+)(?::\d+)?;(\d+)
      replacement: $1:$2
      target_label: __address__

# 容器告警规则
groups:
- name: container-alerts
  rules:
  - alert: ContainerDown
    expr: absent(container_last_seen{name!=""}) or (time() - container_last_seen{name!=""}) > 60
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Container {{ $labels.name }} is down"

  - alert: ContainerHighCPU
    expr: sum(rate(container_cpu_usage_seconds_total{name!=""}[5m])) by (name) * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Container {{ $labels.name }} high CPU usage"

  - alert: ContainerHighMemory
    expr: (container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""}) * 100 > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Container {{ $labels.name }} high memory usage"

6.3 应用监控模版

Java应用监控配置:

yaml 复制代码
# Java应用监控
- job_name: 'java-applications'
  static_configs:
    - targets:
      -
相关推荐
x-cmd2 小时前
[x-cmd] x-cmd 对 Xonsh 的支持
linux·运维·服务器·终端·命令行
失散132 小时前
分布式专题——15 ZooKeeper特性与节点数据类型详解
java·分布式·zookeeper·云原生·架构
打不了嗝 ᥬ᭄3 小时前
【Linux】网络基础
linux·运维·网络
拉拉拉拉拉拉拉马3 小时前
在ssh远程连接的autodl服务器(中国无root权限服务器)上使用copilt的Claude模型
运维·服务器·ssh
失散133 小时前
分布式专题——20 Kafka快速入门
java·分布式·云原生·架构·kafka
boonya3 小时前
如何理解Service Mesh(服务网格)
云原生·服务网格·service_mesh
小马过河R3 小时前
K8s引入Service Mesh原因及Istio入门
云原生·容器·kubernetes·k8s·istio·service_mesh
伞啊伞3 小时前
自动化运维工具 Ansible 集中化管理服务器
运维·自动化·ansible
一只花里胡哨的程序猿3 小时前
odoo18应用、队列服务器分离(SSHFS)
运维·服务器·odoo