前言
💡 痛点: K8s 怎么做监控?Prometheus 怎么配置采集?告警规则怎么写?Grafana Dashboard 怎么设计?高可用怎么部署?
🎯 解决方案: 本文系统覆盖 Prometheus + Grafana 全链路:Prometheus 架构与配置、Service Discovery 自动发现、PromQL 查询语法、AlertManager 告警管理、Grafana Dashboard 设计、Recording Rules 预计算、Thanos/Cortex 高可用、自定义 Exporter 开发。
#mermaid-svg-Keho2EBmIU8soYI4{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-Keho2EBmIU8soYI4 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-Keho2EBmIU8soYI4 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-Keho2EBmIU8soYI4 .error-icon{fill:#552222;}#mermaid-svg-Keho2EBmIU8soYI4 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-Keho2EBmIU8soYI4 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-Keho2EBmIU8soYI4 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-Keho2EBmIU8soYI4 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-Keho2EBmIU8soYI4 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-Keho2EBmIU8soYI4 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-Keho2EBmIU8soYI4 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-Keho2EBmIU8soYI4 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-Keho2EBmIU8soYI4 .marker.cross{stroke:#333333;}#mermaid-svg-Keho2EBmIU8soYI4 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-Keho2EBmIU8soYI4 p{margin:0;}#mermaid-svg-Keho2EBmIU8soYI4 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-Keho2EBmIU8soYI4 .cluster-label text{fill:#333;}#mermaid-svg-Keho2EBmIU8soYI4 .cluster-label span{color:#333;}#mermaid-svg-Keho2EBmIU8soYI4 .cluster-label span p{background-color:transparent;}#mermaid-svg-Keho2EBmIU8soYI4 .label text,#mermaid-svg-Keho2EBmIU8soYI4 span{fill:#333;color:#333;}#mermaid-svg-Keho2EBmIU8soYI4 .node rect,#mermaid-svg-Keho2EBmIU8soYI4 .node circle,#mermaid-svg-Keho2EBmIU8soYI4 .node ellipse,#mermaid-svg-Keho2EBmIU8soYI4 .node polygon,#mermaid-svg-Keho2EBmIU8soYI4 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Keho2EBmIU8soYI4 .rough-node .label text,#mermaid-svg-Keho2EBmIU8soYI4 .node .label text,#mermaid-svg-Keho2EBmIU8soYI4 .image-shape .label,#mermaid-svg-Keho2EBmIU8soYI4 .icon-shape .label{text-anchor:middle;}#mermaid-svg-Keho2EBmIU8soYI4 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-Keho2EBmIU8soYI4 .rough-node .label,#mermaid-svg-Keho2EBmIU8soYI4 .node .label,#mermaid-svg-Keho2EBmIU8soYI4 .image-shape .label,#mermaid-svg-Keho2EBmIU8soYI4 .icon-shape .label{text-align:center;}#mermaid-svg-Keho2EBmIU8soYI4 .node.clickable{cursor:pointer;}#mermaid-svg-Keho2EBmIU8soYI4 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-Keho2EBmIU8soYI4 .arrowheadPath{fill:#333333;}#mermaid-svg-Keho2EBmIU8soYI4 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-Keho2EBmIU8soYI4 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-Keho2EBmIU8soYI4 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Keho2EBmIU8soYI4 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-Keho2EBmIU8soYI4 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Keho2EBmIU8soYI4 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-Keho2EBmIU8soYI4 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-Keho2EBmIU8soYI4 .cluster text{fill:#333;}#mermaid-svg-Keho2EBmIU8soYI4 .cluster span{color:#333;}#mermaid-svg-Keho2EBmIU8soYI4 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-Keho2EBmIU8soYI4 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-Keho2EBmIU8soYI4 rect.text{fill:none;stroke-width:0;}#mermaid-svg-Keho2EBmIU8soYI4 .icon-shape,#mermaid-svg-Keho2EBmIU8soYI4 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Keho2EBmIU8soYI4 .icon-shape p,#mermaid-svg-Keho2EBmIU8soYI4 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-Keho2EBmIU8soYI4 .icon-shape .label rect,#mermaid-svg-Keho2EBmIU8soYI4 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Keho2EBmIU8soYI4 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-Keho2EBmIU8soYI4 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-Keho2EBmIU8soYI4 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 可视化
存储
数据采集
应用指标
/metrics
K8s 指标
kube-state-metrics
节点指标
node_exporter
黑盒探测
blackbox_exporter
Prometheus
TSDB
Recording Rules
预计算
Grafana
Dashboard
AlertManager
告警路由
一、Prometheus 架构与配置
1.1 核心架构
yaml
# ======== Prometheus K8s 部署(完整)=======
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources: [nodes, nodes/metrics, nodes/proxy, pods, services]
verbs: [get, list, watch]
- apiGroups: [""]
resources: [configmaps]
verbs: [get]
- apiGroups: ["extensions", "networking.k8s.io"]
resources: [ingresses]
verbs: [get, list, watch]
- nonResourceURLs: [/metrics, /metrics/cadvisor]
verbs: [get]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
securityContext:
runAsUser: 65534
runAsGroup: 65534
fsGroup: 65534
runAsNonRoot: true
containers:
- name: prometheus
image: prom/prometheus:v2.51.0
args:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention.time=30d
- --storage.tsdb.retention.size=50GB
- --web.enable-lifecycle
- --web.enable-remote-write-receiver
- --query.max-concurrency=20
- --query.timeout=2m
ports:
- containerPort: 9090
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: data
mountPath: /prometheus
- name: rules
mountPath: /etc/prometheus/rules
readinessProbe:
httpGet:
path: /-/ready
port: 9090
initialDelaySeconds: 30
periodSeconds: 5
livenessProbe:
httpGet:
path: /-/healthy
port: 9090
initialDelaySeconds: 30
periodSeconds: 15
volumes:
- name: config
configMap:
name: prometheus-config
- name: data
persistentVolumeClaim:
claimName: prometheus-data
- name: rules
configMap:
name: prometheus-rules
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-data
namespace: monitoring
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 100Gi
storageClassName: standard
---
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitoring
labels:
app: prometheus
spec:
type: ClusterIP
ports:
- port: 9090
targetPort: 9090
selector:
app: prometheus
1.2 Prometheus 配置详解
yaml
# ======== prometheus-config.yml ========
global:
scrape_interval: 15s # 默认采集间隔
scrape_timeout: 10s # 采集超时
evaluation_interval: 15s # 规则评估间隔
external_labels:
cluster: production
region: us-east-1
env: prod
# ======== 告警管理 ========
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# ======== 规则文件 ========
rule_files:
- /etc/prometheus/rules/*.yml
# ======== 采集配置 ========
scrape_configs:
# ======== Prometheus 自身指标 ========
- job_name: prometheus
scrape_interval: 5s
static_configs:
- targets: [localhost:9090]
# ======== Kubernetes API Server ========
- job_name: kubernetes-apiservers
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# ======== Kubernetes Nodes (kubelet) ========
- job_name: kubernetes-nodes
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# ======== Kubernetes Nodes (cadvisor) ========
- job_name: kubernetes-cadvisor
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__meta_kubernetes_node_address_InternalIP]
target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
# ======== kube-state-metrics ========
- job_name: kube-state-metrics
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [monitoring]
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
action: keep
regex: kube-state-metrics
# ======== node-exporter ========
- job_name: node-exporter
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [monitoring]
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
action: keep
regex: node-exporter
# ======== 应用服务自动发现 ========
- job_name: myapp
kubernetes_sd_configs:
- role: pod
namespaces:
names: [production]
relabel_configs:
# 只采集带 prometheus.io/scrape=true 注解的 Pod
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: __meta_kubernetes_pod_ip:$1
- source_labels: [__meta_kubernetes_pod_label_app]
action: replace
target_label: app
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
# ======== 黑盒探测 ========
- job_name: blackbox-http
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://myapp.example.com/healthz
- https://api.example.com/v1/status
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
# ======== Pushgateway(短生命周期任务)=======
- job_name: pushgateway
honor_labels: true # 避免覆盖推送的标签
static_configs:
- targets: [pushgateway:9091]
二、PromQL 查询语法
2.1 基础查询
bash
# ======== PromQL 语法速查 ========
# ======== 即时查询(Instant Query)=======
# 查询当前值
# CPU 使用率(所有节点)
node_cpu_seconds_total{mode!="idle"}
# CPU 使用率百分比
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 内存使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# 磁盘使用率
(node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100
# ======== 范围查询(Range Query)=======
# 查询一段时间内的值
# 最近 1 小时的 CPU 使用率
rate(node_cpu_seconds_total{mode="idle"}[1h])
# 最近 5 分钟的请求速率
rate(http_requests_total[5m])
# ======== 聚合操作(Aggregation)=======
# 按 app 聚合请求速率
sum by (app) (rate(http_requests_total[5m]))
# 按 namespace 和 app 聚合
sum by (namespace, app) (rate(http_requests_total[5m]))
# 前 5 个最活跃的 Pod
topk(5, sum by (pod) (rate(http_requests_total[5m])))
# 平均 CPU 使用率
avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
# Pod 数量
count by (namespace) (kube_pod_status_phase{phase="Running"})
# ======== 函数(Functions)=======
# rate: 速率(适合 counter 类型)
rate(http_requests_total[5m])
# irate: 瞬时速率(仅最后两个点)
irate(http_requests_total[5m])
# increase: 增量(适合 counter)
increase(http_requests_total[1h])
# deriv: 导数(变化趋势)
deriv(node_load1[1h])
# predict_linear: 预测值
predict_linear(node_filesystem_avail_bytes[1h], 4*3600) # 预测 4 小时后
# histogram_quantile: 分位数
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # P95
# absent: 检测指标缺失
absent(up{job="myapp"}) # 如果 myapp 没有数据,返回 1
# time: 当前时间
time() - kube_pod_start_time # Pod 运行时长
# ======== 运算符(Operators)=======
# 数学运算
http_requests_total * 2
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
# 比较运算
http_requests_total > 1000
node_cpu_usage > 0.8
# 逻辑运算
http_requests_total > 1000 and http_errors_total > 100
http_requests_total > 1000 or http_errors_total > 10
unless: 排除
# 集合运算
http_requests_total == bool 0 # 返回 0 或 1(布尔值转换)
# ======== 子查询(Subquery)=======
# 最近 1 小时中,每 5 分钟的速率最大值
max_over_time(rate(http_requests_total[5m])[1h:5m])
# ======== 标签操作(Label Operations)=======
# label_replace: 替换标签
label_replace(up{job="node"}, "hostname", "$1", "instance", "(.*):.*")
# label_join: 拼接标签
label_join(up{job="node"}, "host", ",", "instance", "job")
# ======== 常见查询模板 ========
# HTTP 请求 P95 延迟(按服务)
histogram_quantile(0.95, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m])))
# HTTP 请求 P99 延迟(按路由)
histogram_quantile(0.99, sum by (le, route) (rate(http_request_duration_seconds_bucket[5m])))
# 请求速率(QPS)
sum by (service) (rate(http_requests_total[5m]))
# 错误率(百分比)
sum by (service) (rate(http_requests_total{code=~"5.."}[5m]))
/ sum by (service) (rate(http_requests_total[5m])) * 100
# Pod 重启次数
sum by (namespace, pod) (increase(kube_pod_container_status_restarts_total[1h]))
# Pod 非 Running 状态
kube_pod_status_phase{phase!="Running"}
# Deployment 未就绪
kube_deployment_status_replicas_unavailable > 0
# 容器 OOMKill
kube_pod_container_status_terminated_reason{reason="OOMKilled"} > 0
# PVC 使用率
kube_persistentvolumeclaim_resource_requests_storage_bytes
/ kube_persistentvolumeclaim_capacity_bytes * 100
三、AlertManager 告警管理
3.1 告警规则
yaml
# ======== prometheus-rules.yml ========
groups:
# ======== 基础告警 ========
- name: infrastructure
rules:
# CPU 使用率 > 90%
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 5m
labels:
severity: critical
team: infrastructure
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}% on {{ $labels.instance }} (threshold: 90%)"
runbook_url: "https://runbooks.example.com/high-cpu"
dashboard_url: "https://grafana.example.com/d/node-dashboard?var-instance={{ $labels.instance }}"
# 内存使用率 > 85%
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}% on {{ $labels.instance }} (threshold: 85%)"
# 磁盘使用率 > 80%
- alert: HighDiskUsage
expr: (node_filesystem_size_bytes{fstype!~"tmpfs|fuse.*"} - node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.*"}) / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.*"} * 100 > 80
for: 10m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High disk usage on {{ $labels.instance }}"
description: "Disk {{ $labels.mountpoint }} usage is {{ $value }}% on {{ $labels.instance }}"
# 磁盘预测 4 小时后满
- alert: DiskWillFillIn4Hours
expr: predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.*"}[1h], 4*3600) < 0
for: 5m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Disk {{ $labels.mountpoint }} will fill in 4 hours on {{ $labels.instance }}"
# 节点不可达
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 3m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Node {{ $labels.instance }} is down"
description: "Node {{ $labels.instance }} has been down for more than 3 minutes"
# ======== K8s 告警 ========
- name: kubernetes
rules:
# Pod 非 Running 状态
- alert: PodNotRunning
expr: kube_pod_status_phase{phase!="Running", phase!="Succeeded"} > 0
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is not Running"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in {{ $labels.phase }} state for more than 5 minutes"
# Deployment 副本不足
- alert: DeploymentReplicasMismatch
expr: kube_deployment_spec_replicas != kube_deployment_status_available_replicas
for: 10m
labels:
severity: warning
team: platform
annotations:
summary: "Deployment {{ $labels.namespace }}/{{ $labels.name }} replicas mismatch"
# Pod CrashLooping
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
# Pod OOMKilled
- alert: PodOOMKilled
expr: kube_pod_container_status_terminated_reason{reason="OOMKilled"} == 1
for: 0m
labels:
severity: critical
team: platform
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} was OOMKilled"
# PVC 使用率 > 85%
- alert: PVCAlmostFull
expr: (kube_persistentvolumeclaim_resource_requests_storage_bytes / kube_persistentvolumeclaim_capacity_bytes) * 100 > 85
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is almost full"
# HPA 达到最大副本数
- alert: HPAMaxedOut
expr: kube_hpa_status_current_replicas == kube_hpa_spec_max_replicas
for: 10m
labels:
severity: warning
team: platform
annotations:
summary: "HPA {{ $labels.namespace }}/{{ $labels.hpa }} has reached max replicas"
# ======== 应用告警 ========
- name: application
rules:
# HTTP 5xx 错误率 > 5%
- alert: HighErrorRate
expr: |
sum by (service) (rate(http_requests_total{code=~"5.."}[5m]))
/ sum by (service) (rate(http_requests_total[5m])) * 100 > 5
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High error rate for {{ $labels.service }}"
description: "Error rate is {{ $value }}% for {{ $labels.service }} (threshold: 5%)"
# HTTP P95 延迟 > 2s
- alert: HighLatencyP95
expr: histogram_quantile(0.95, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))) > 2
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "High P95 latency for {{ $labels.service }}"
description: "P95 latency is {{ $value }}s for {{ $labels.service }}"
# 服务不可达
- alert: ServiceDown
expr: up{job="myapp"} == 0
for: 1m
labels:
severity: critical
team: backend
annotations:
summary: "Service {{ $labels.job }} is down"
# 数据库连接池耗尽
- alert: DatabasePoolExhausted
expr: db_pool_active_connections / db_pool_max_connections > 0.9
for: 3m
labels:
severity: critical
team: backend
annotations:
summary: "Database pool exhausted for {{ $labels.service }}"
description: "Pool utilization is {{ $value }}% for {{ $labels.service }}"
# ======== Recording Rules(预计算)=======
- name: recording_rules
rules:
# CPU 使用率预计算(减少查询计算量)
- record: node:cpu_usage:percent
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 内存使用率预计算
- record: node:memory_usage:percent
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# HTTP QPS 预计算
- record: service:http_requests:rate5m
expr: sum by (service) (rate(http_requests_total[5m]))
# HTTP 5xx 错误率预计算
- record: service:http_errors:rate5m
expr: sum by (service) (rate(http_requests_total{code=~"5.."}[5m]))
# HTTP 错误率百分比
- record: service:http_error_rate:percent
expr: |
service:http_errors:rate5m
/ service:http_requests:rate5m * 100
3.2 AlertManager 配置
yaml
# ======== alertmanager-config.yml ========
global:
resolve_timeout: 5m
smtp_smarthost: smtp.example.com:587
smtp_from: alerts@example.com
smtp_auth_username: alerts@example.com
smtp_auth_password: smtp_password
# ======== 告警路由(Routing)=======
route:
group_by: [alertname, cluster, namespace]
group_wait: 30s # 新组等待 30s 再发送
group_interval: 5m # 同组新告警间隔 5m
repeat_interval: 4h # 重复发送间隔 4h
receiver: default # 默认接收器
routes:
# Critical 告警 → PagerDuty + Slack
- match:
severity: critical
receiver: pagerduty-critical
group_wait: 10s
group_interval: 1m
repeat_interval: 1h
continue: true # 继续匹配下一条路由
- match:
severity: critical
receiver: slack-critical
# Warning 告警 → Slack
- match:
severity: warning
receiver: slack-warning
repeat_interval: 8h
# 基础设施团队告警
- match:
team: infrastructure
receiver: slack-infra
group_by: [alertname, instance]
# 后端团队告警
- match_re:
team: backend|api
receiver: slack-backend
# 特定 namespace
- match:
namespace: production
receiver: pagerduty-production
repeat_interval: 30m
# ======== 抑制规则(Inhibition)=======
inhibitions:
# NodeDown 抑制该节点上的其他告警
- source_match:
alertname: NodeDown
target_match_re:
alertname: HighCPUUsage|HighMemoryUsage|HighDiskUsage
equal: [instance]
# Critical 抑制 Warning
- source_match:
severity: critical
target_match:
severity: warning
equal: [alertname, namespace]
# ======== 接收器(Receivers)=======
receivers:
- name: default
email_configs:
- to: oncall@example.com
send_resolved: true
- name: pagerduty-critical
pagerduty_configs:
- routing_key: pagerduty-routing-key
severity: critical
description: "{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}"
- name: slack-critical
slack_configs:
- channel: '#alerts-critical'
api_url: https://hooks.slack.com/services/xxx
title: "[CRITICAL] {{ .GroupLabels.alertname }}"
text: |
*Alert:* {{ .GroupLabels.alertname }}
*Cluster:* {{ .GroupLabels.cluster }}
*Namespace:* {{ .GroupLabels.namespace }}
*Summary:* {{ .CommonAnnotations.summary }}
*Description:* {{ .CommonAnnotations.description }}
*Dashboard:* {{ .CommonAnnotations.dashboard_url }}
*Runbook:* {{ .CommonAnnotations.runbook_url }}
send_resolved: true
- name: slack-warning
slack_configs:
- channel: '#alerts-warning'
api_url: https://hooks.slack.com/services/xxx
title: "[WARNING] {{ .GroupLabels.alertname }}"
text: "{{ .CommonAnnotations.summary }}"
send_resolved: true
- name: slack-infra
slack_configs:
- channel: '#infra-alerts'
api_url: https://hooks.slack.com/services/xxx
- name: slack-backend
slack_configs:
- channel: '#backend-alerts'
api_url: https://hooks.slack.com/services/xxx
- name: pagerduty-production
pagerduty_configs:
- routing_key: pagerduty-prod-key
severity: critical
四、Grafana Dashboard
4.1 Dashboard JSON 模板
json
{
"dashboard": {
"uid": "myapp-dashboard",
"title": "MyApp - Service Overview",
"tags": ["myapp", "production"],
"timezone": "browser",
"refresh": "30s",
"time": { "from": "now-1h", "to": "now" },
"templating": {
"list": [
{
"name": "datasource",
"type": "prometheus",
"query": "prometheus",
"current": { "text": "Prometheus", "value": "Prometheus" }
},
{
"name": "namespace",
"type": "query",
"query": "label_values(kube_pod_info, namespace)",
"current": { "text": "production", "value": "production" }
},
{
"name": "service",
"type": "query",
"query": "label_values(http_requests_total{namespace=\"$namespace\"}, service)",
"current": { "text": "myapp", "value": "myapp" }
}
]
},
"panels": [
{
"id": 1,
"title": "Request Rate (QPS)",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"targets": [
{
"expr": "sum by (service) (rate(http_requests_total{namespace=\"$namespace\",service=\"$service\"}[5m]))",
"legendFormat": "{{service}} QPS"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps",
"color": { "mode": "palette-classic" },
"custom": {
"drawStyle": "line",
"lineInterpolation": "smooth",
"fillOpacity": 10
}
}
}
},
{
"id": 2,
"title": "Error Rate (%)",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
"targets": [
{
"expr": "service:http_error_rate:percent{namespace=\"$namespace\",service=\"$service\"}",
"legendFormat": "{{service}} Error Rate"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 3 },
{ "color": "red", "value": 5 }
]
}
}
}
},
{
"id": 3,
"title": "P50/P95/P99 Latency",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
"targets": [
{
"expr": "histogram_quantile(0.5, sum by (le) (rate(http_request_duration_seconds_bucket{namespace=\"$namespace\",service=\"$service\"}[5m])))",
"legendFormat": "P50"
},
{
"expr": "histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{namespace=\"$namespace\",service=\"$service\"}[5m])))",
"legendFormat": "P95"
},
{
"expr": "histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket{namespace=\"$namespace\",service=\"$service\"}[5m])))",
"legendFormat": "P99"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 2 }
]
}
}
}
},
{
"id": 4,
"title": "Pod Status",
"type": "stat",
"gridPos": { "h": 4, "w": 6, "x": 12, "y": 8 },
"targets": [
{
"expr": "count(kube_pod_status_phase{namespace=\"$namespace\",phase=\"Running\"})",
"legendFormat": "Running"
},
{
"expr": "count(kube_pod_status_phase{namespace=\"$namespace\",phase!=\"Running\",phase!=\"Succeeded\"})",
"legendFormat": "Not Running"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]
}
}
}
},
{
"id": 5,
"title": "CPU Usage",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 },
"targets": [
{
"expr": "sum by (pod) (rate(container_cpu_usage_seconds_total{namespace=\"$namespace\",pod=~\"$service.*\"}[5m])) * 100",
"legendFormat": "{{pod}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"max": 100
}
}
},
{
"id": 6,
"title": "Memory Usage",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 },
"targets": [
{
"expr": "sum by (pod) (container_memory_working_set_bytes{namespace=\"$namespace\",pod=~\"$service.*\"}) / 1024 / 1024",
"legendFormat": "{{pod}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "MiB"
}
}
}
]
},
"overwrite": true
}
4.2 Dashboard Provisioning
yaml
# ======== Grafana Dashboard 自动导入 ========
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
myapp-dashboard.json: |
{{ .Files.Get "dashboards/myapp-dashboard.json" | nindent 4 }}
---
# ======== Grafana Dashboard Provider ========
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-providers
namespace: monitoring
data:
providers.yaml: |
apiVersion: 1
providers:
- name: default
orgId: 1
folder: General
type: file
disableDeletion: false
editable: true
updateIntervalSeconds: 30
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true
- name: myapp
orgId: 1
folder: MyApp
type: file
options:
path: /var/lib/grafana/dashboards/myapp
五、自定义 Exporter 开发
5.1 Go Exporter 实现
go
// ======== 自定义 Prometheus Exporter(Go)=======
package main
import (
"context"
"log"
"net/http"
"os"
"sync"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
// ======== Exporter 结构体 ========
type MyAppExporter struct {
mutex sync.Mutex
// Gauge 指标(可增减)
up prometheus.Gauge
totalConnections prometheus.Gauge
activeQueries prometheus.Gauge
// Counter 指标(只增不减)
requestsTotal *prometheus.CounterVec
errorsTotal *prometheus.CounterVec
// Histogram 指标(分布)
queryDuration *prometheus.HistogramVec
// Summary 指标(客户端分位数)
responseSize prometheus.Summary
// 采集来源
client MyAppClient
}
type MyAppClient interface {
GetStats(ctx context.Context) (*AppStats, error)
}
type AppStats struct {
TotalConnections int64
ActiveQueries int64
Errors int64
ResponseTimeMs float64
}
// ======== 指标定义 ========
func NewMyAppExporter(client MyAppClient) *MyAppExporter {
return &MyAppExporter{
up: prometheus.NewGauge(prometheus.GaugeOpts{
Name: "myapp_up",
Help: "Was the last scrape of myapp successful.",
}),
totalConnections: prometheus.NewGauge(prometheus.GaugeOpts{
Name: "myapp_connections_total",
Help: "Total number of connections.",
}),
activeQueries: prometheus.NewGauge(prometheus.GaugeOpts{
Name: "myapp_queries_active",
Help: "Number of active queries.",
}),
requestsTotal: prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "myapp_requests_total",
Help: "Total number of requests.",
}, []string{"method", "path", "status_code"}),
errorsTotal: prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "myapp_errors_total",
Help: "Total number of errors.",
}, []string{"type"}),
queryDuration: prometheus.NewHistogramVec(prometheus.HistogramOpts{
Name: "myapp_query_duration_seconds",
Help: "Query duration in seconds.",
Buckets: prometheus.DefBuckets, // 默认: .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10
// 自定义 bucket:
// Buckets: []float64{.01, .05, .1, .25, .5, 1, 2.5, 5, 10, 30},
}, []string{"query_type"}),
responseSize: prometheus.NewSummary(prometheus.SummaryOpts{
Name: "myapp_response_size_bytes",
Help: "Response size in bytes.",
Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
}),
client: client,
}
}
// ======== Describe + Collect(必须实现)=======
func (e *MyAppExporter) Describe(ch chan<- *prometheus.Desc) {
e.up.Describe(ch)
e.totalConnections.Describe(ch)
e.activeQueries.Describe(ch)
e.requestsTotal.Describe(ch)
e.errorsTotal.Describe(ch)
e.queryDuration.Describe(ch)
e.responseSize.Describe(ch)
}
func (e *MyAppExporter) Collect(ch chan<- prometheus.Metric) {
e.mutex.Lock()
defer e.mutex.Unlock()
// 采集数据
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
stats, err := e.client.GetStats(ctx)
if err != nil {
e.up.Set(0) // 采集失败
e.up.Collect(ch)
log.Printf("Failed to scrape myapp: %v", err)
return
}
// 设置指标值
e.up.Set(1) // 采集成功
e.totalConnections.Set(float64(stats.TotalConnections))
e.activeQueries.Set(float64(stats.ActiveQueries))
// Collect 所有指标
e.up.Collect(ch)
e.totalConnections.Collect(ch)
e.activeQueries.Collect(ch)
e.requestsTotal.Collect(ch)
e.errorsTotal.Collect(ch)
e.queryDuration.Collect(ch)
e.responseSize.Collect(ch)
}
// ======== 中间件(自动记录请求指标)=======
type MetricsMiddleware struct {
requestsTotal *prometheus.CounterVec
queryDuration *prometheus.HistogramVec
}
func (m *MetricsMiddleware) Handler(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// 包装 ResponseWriter 以获取 status code
wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}
next.ServeHTTP(wrapped, r)
// 记录指标
duration := time.Since(start).Seconds()
m.requestsTotal.WithLabelValues(r.Method, r.URL.Path, strconv.Itoa(wrapped.statusCode)).Inc()
m.queryDuration.WithLabelValues(r.Method).Observe(duration)
})
}
type responseWriter struct {
http.ResponseWriter
statusCode int
}
func (rw *responseWriter) WriteHeader(code int) {
rw.statusCode = code
rw.ResponseWriter.WriteHeader(code)
}
// ======== main 函数 ========
func main() {
client := &myAppClientImpl{baseURL: os.Getenv("APP_URL")}
exporter := NewMyAppExporter(client)
prometheus.MustRegister(exporter)
// 添加中间件
middleware := &MetricsMiddleware{
requestsTotal: exporter.requestsTotal,
queryDuration: exporter.queryDuration,
}
http.Handle("/metrics", promhttp.Handler())
http.Handle("/", middleware.Handler(http.DefaultServeMux))
log.Println("Starting exporter on :9100")
log.Fatal(http.ListenAndServe(":9100", nil))
}
六、高可用部署
6.1 Thanos 部署
yaml
# ======== Thanos Sidecar 模式(与 Prometheus 共存)=======
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-thanos-sidecar
namespace: monitoring
spec:
replicas: 1
template:
spec:
containers:
# Prometheus
- name: prometheus
image: prom/prometheus:v2.51.0
args:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention.time=2h # Thanos 模式只需要短期保留
ports:
- containerPort: 9090
# Thanos Sidecar
- name: thanos-sidecar
image: thanosio/thanos:v0.34.0
args:
- sidecar
- --prometheus.url=http://localhost:9090
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
- --objstore.config=$(OBJSTORE_CONFIG)
env:
- name: OBJSTORE_CONFIG
valueFrom:
secretKeyRef:
name: thanos-objstore
key: config
ports:
- containerPort: 10901 # gRPC
- containerPort: 10902 # HTTP
volumeMounts:
- name: prometheus-data
mountPath: /prometheus
yaml
# ======== Thanos objstore 配置(S3)=======
apiVersion: v1
kind: Secret
metadata:
name: thanos-objstore
namespace: monitoring
type: Opaque
stringData:
config: |
type: S3
config:
bucket: thanos-storage
endpoint: s3.amazonaws.com
region: us-east-1
access_key: AKIAIOSFODNN7EXAMPLE
secret_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
insecure: false
yaml
# ======== Thanos Query(全局查询网关)=======
apiVersion: apps/v1
kind: Deployment
metadata:
name: thanos-query
namespace: monitoring
spec:
replicas: 2
selector:
matchLabels:
app: thanos-query
template:
metadata:
labels:
app: thanos-query
spec:
containers:
- name: thanos-query
image: thanosio/thanos:v0.34.0
args:
- query
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
- --store=dnssrv+_grpc._tcp.thanos-store.monitoring.svc.cluster.local
- --store=dnssrv+_grpc._tcp.thanos-sidecar.monitoring.svc.cluster.local
- --query.auto-downsampling
ports:
- containerPort: 10901
- containerPort: 10902
七、Checklist 总结
□ Prometheus 配置
□ scrape_configs(采集目标)
□ global 配置(间隔/标签)
□ kubernetes_sd_configs(自动发现)
□ relabel_configs(标签重写)
□ rule_files(告警/记录规则)
□ alerting(AlertManager 配置)
□ storage.tsdb.retention(保留时间)
□ PromQL 查询
□ 即时查询/范围查询
□ rate/irate/increase
□ histogram_quantile(P95/P99)
□ 聚合操作(sum/avg/topk/count)
□ 运算符(数学/比较/逻辑)
□ absent(指标缺失检测)
□ predict_linear(趋势预测)
□ AlertManager
□ 告警规则(alertname/expr/for/labels/annotations)
□ 路由规则(group_by/group_wait/repeat_interval)
□ 抑制规则(inhibitions)
□ 接收器(email/slack/pagerduty)
□ 静默规则(silences)
□ Recording Rules
□ 预计算常用查询
□ 减少 Dashboard 查询计算量
□ Grafana
□ Dashboard JSON 设计
□ 变量模板(datasource/namespace/service)
□ 阈值颜色(thresholds)
□ 自动 Provisioning
□ 自定义 Exporter
□ Gauge/Counter/Histogram/Summary
□ Describe + Collect 接口
□ 中间件集成
□ 健康检查(up 指标)
□ 高可用
□ Thanos Sidecar + Store Gateway
□ Thanos Query(全局查询)
□ Object Storage(S3/GCS)
□ Cortex/VictoriaMetrics 方案对比
总结
Prometheus 监控三层架构:
| 层次 | 功能 | 工具 |
|---|---|---|
| 采集层 | 暴露指标 + 自动发现 | Exporter + SD + Pushgateway |
| 存储层 | 持久存储 + 预计算 | Prometheus TSDB + Recording Rules |
| 展示层 | 可视化 + 告警 | Grafana + AlertManager |
监控指标四类型:
| 类型 | 用途 | 示例 | PromQL |
|---|---|---|---|
| Counter | 只增(请求总数) | http_requests_total | rate() / increase() |
| Gauge | 可增减(当前值) | cpu_usage, connections | 直接查询 |
| Histogram | 分布(延迟) | request_duration_bucket | histogram_quantile() |
| Summary | 客户端分位数 | response_size | 直接查询 |
告警分级建议:
| 级别 | 响应时间 | 通知渠道 |
|---|---|---|
| Critical | 5 分钟内 | PagerDuty + Slack + 电话 |
| Warning | 30 分钟内 | Slack + Email |
| Info | 1 小时内 | Slack |
下一步推荐:
- Thanos 完整部署实战(Query Frontend + Store Gateway + Compactor)
- VictoriaMetrics 替代方案实战
- OpenTelemetry 统一可观测性(Traces + Metrics + Logs)