一、监控全景图:组件角色与职责
在K8s生态中,监控不是单一工具能完成的,而是一个"全家桶"协同作战:
scss
┌─────────────────────────────────────────────────────────┐
│ Grafana (可视化) │
│ 将枯燥数据变成直观图表,支持多数据源和告警 │
└───────────────┬─────────────────────────────────────────┘
│
┌───────────────▼─────────────────────────────────────────┐
│ Prometheus (监控大脑) │
│ 拉取+存储指标数据,提供查询语言,触发告警规则 │
└───────────────┬─────────────────────────────────────────┘
│
┌───────────┼───────────┐
▼ ▼ ▼
┌──────┐ ┌──────┐ ┌─────────────┐
│Node │ │cAd- │ │kube-state- │
│Export│ │visor │ │metrics │
└──────┘ └──────┘ └─────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────┐
│ Kubernetes集群 │
│ 节点资源 容器资源 资源对象状态 │
└─────────────────────────────────────────┘
各组件精准定位:
- Prometheus:监控系统的"中央处理器",负责数据采集、存储、查询和告警触发
- cAdvisor:容器资源"透视镜",专门采集CPU、内存、磁盘IO等容器级指标
- kube-state-metrics:集群资源"状态快照机",关注Deployment、Pod等K8s对象状态
- Metrics Server:HPA的"数据供应商",为自动扩缩容提供实时资源指标
- Grafana:数据"可视化魔术师",让监控数据生动直观
- Alertmanager:告警"智能管家",负责告警的路由、去重和通知
二、实战部署:一步步构建监控体系
第一步:创建监控专用命名空间
bash
# 所有监控组件都部署在kube-ops命名空间,便于管理
kubectl create ns kube-ops
第二步:部署Prometheus(监控大脑)
2.1 配置文件存储(ConfigMap)
yaml
# prometheus-cm.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: kube-ops
data:
prometheus.yml: |
global:
scrape_interval: 15s # 每15秒抓取一次数据
scrape_timeout: 15s # 抓取超时时间
scrape_configs:
- job_name: 'prometheus' # 第一个任务:监控自身
static_configs:
- targets: ['localhost:9090'] # Prometheus默认端口
2.2 数据持久化存储(PVC)
yaml
# prometheus-pvc.yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: prometheus
namespace: kube-ops
spec:
storageClassName: nfs-client # 使用NFS存储
accessModes:
- ReadWriteMany # 多节点读写(支持多副本)
resources:
requests:
storage: 10Gi # 申请10GB存储空间
2.3 权限配置(RBAC)
yaml
# prometheus-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: kube-ops
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes # 查看节点
- services # 查看服务
- endpoints # 查看端点
- pods # 查看Pod
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: kube-ops
2.4 部署Prometheus应用
yaml
# prometheus-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: kube-ops
spec:
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
containers:
- image: prom/prometheus:v2.4.3
name: prometheus
command: ["/bin/prometheus"]
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention=24h" # 数据保留24小时
- "--web.enable-admin-api" # 启用管理API
- "--web.enable-lifecycle" # 支持配置热重载
ports:
- containerPort: 9090
name: http
volumeMounts:
- mountPath: "/prometheus"
name: data
- mountPath: "/etc/prometheus"
name: config-volume
volumes:
- name: data
persistentVolumeClaim:
claimName: prometheus
- name: config-volume
configMap:
name: prometheus-config
2.5 暴露服务(Service)
yaml
# prometheus-svc.yaml
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: kube-ops
spec:
selector:
app: prometheus
type: NodePort # NodePort类型,可通过节点IP访问
ports:
- name: web
port: 9090
targetPort: http
部署命令:
bash
# 按顺序部署
kubectl apply -f prometheus-cm.yaml
kubectl apply -f prometheus-pvc.yaml
kubectl apply -f prometheus-rbac.yaml
kubectl apply -f prometheus-deploy.yaml
kubectl apply -f prometheus-svc.yaml
第三步:扩展监控范围
3.1 监控Ingress-Nginx
Ingress-Nginx默认在10254端口暴露监控指标,只需配置抓取:
yaml
# 更新prometheus-cm.yaml,在scrape_configs中添加
- job_name: 'ingressnginx20'
static_configs:
- targets: ['192.168.200.20:10254'] # 第一个Ingress节点
- job_name: 'ingressnginx30'
static_configs:
- targets: ['192.168.200.30:10254'] # 第二个Ingress节点
配置热重载:
bash
# 应用新配置
kubectl apply -f prometheus-cm.yaml
# 热重载Prometheus(需要提前开启--web.enable-lifecycle)
curl -X POST "http://<Prometheus-Service-IP>:9090/-/reload"
3.2 部署Node Exporter(节点监控)
yaml
# prome-node-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet # 每个节点运行一个实例
metadata:
name: node-exporter
namespace: kube-ops
spec:
selector:
matchLabels:
name: node-exporter
template:
metadata:
labels:
name: node-exporter
spec:
hostPID: true # 共享主机进程空间
hostIPC: true # 共享主机IPC空间
hostNetwork: true # 使用主机网络
containers:
- name: node-exporter
image: prom/node-exporter:v0.16.0
ports:
- containerPort: 9100
securityContext:
privileged: true # 特权模式,读取主机信息
args:
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
volumeMounts:
- name: proc
mountPath: /host/proc
- name: sys
mountPath: /host/sys
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
3.3 监控容器资源(cAdvisor)
cAdvisor已集成在kubelet中,通过API Server代理访问:
yaml
# 在prometheus配置中添加
- job_name: 'kubernetes-cadvisor'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
3.4 监控API Server
yaml
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https # 只监控API Server的https端点
第四步:部署Grafana(数据可视化)
4.1 创建存储
yaml
# grafana-volume.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: grafana
namespace: kube-ops
spec:
storageClassName: nfs-client
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
4.2 调整权限(首次部署)
yaml
# grafana-chown-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: grafana-chown
namespace: kube-ops
spec:
template:
spec:
restartPolicy: Never
containers:
- name: grafana-chown
command: ["chown", "-R", "472:472", "/var/lib/grafana"]
image: busybox
volumeMounts:
- name: storage
subPath: grafana
mountPath: /var/lib/grafana
volumes:
- name: storage
persistentVolumeClaim:
claimName: grafana
4.3 部署Grafana
yaml
# grafana-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: kube-ops
spec:
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:5.3.4
ports:
- containerPort: 3000
name: grafana
env:
- name: GF_SECURITY_ADMIN_USER
value: admin
- name: GF_SECURITY_ADMIN_PASSWORD
value: admin321
readinessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 60
periodSeconds: 10
volumeMounts:
- mountPath: /var/lib/grafana
subPath: grafana
name: storage
securityContext:
fsGroup: 472
runAsUser: 472
volumes:
- name: storage
persistentVolumeClaim:
claimName: grafana
4.4 暴露服务
yaml
# grafana-svc.yaml
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: kube-ops
spec:
type: NodePort
ports:
- port: 3000
targetPort: 3000
selector:
app: grafana
第五步:配置告警(Alertmanager)
5.1 告警规则定义
yaml
# alert-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: kube-ops
data:
node-memory.rules: |
groups:
- name: node-alerts
rules:
- alert: NodeMemoryUsage
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 80
for: 2m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} 内存使用率过高"
description: "{{ $labels.instance }} 内存使用率超过80%,当前值 {{ $value }}%"
5.2 Alertmanager配置
yaml
# alertmanager-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: kube-ops
data:
alertmanager.yml: |
global:
smtp_smarthost: 'smtp.163.com:25'
smtp_from: 'monitor@163.com'
smtp_auth_username: 'monitor@163.com'
smtp_auth_password: 'your-auth-code'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'admin@company.com'
三、监控查询实战
常用PromQL查询示例:
promql
# 1. 查看节点CPU使用率
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 2. 查看节点内存使用率
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# 3. 查看Pod内存使用(前10名)
topk(10, sum(container_memory_usage_bytes{container_name!=""}) by (pod_name, namespace))
# 4. Ingress请求量统计
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress)
# 5. API Server请求延迟(P99)
histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (le, verb))
Grafana仪表板导入:
- 访问Grafana:
http://<节点IP>:<NodePort> - 添加数据源:选择Prometheus,URL填写
http://prometheus.kube-ops.svc:9090 - 导入模板:
- Kubernetes集群监控:ID
7249 - Node Exporter:ID
8919 - cAdvisor:ID
14282
- Kubernetes集群监控:ID
四、监控最佳实践
1. 命名规范
yaml
# 为监控指标添加统一标签
relabel_configs:
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
2. 资源限制
yaml
# 为监控组件设置合理的资源限制
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
3. 数据保留策略
yaml
# 根据需求调整数据保留时间
args:
- "--storage.tsdb.retention.time=30d" # 保留30天
- "--storage.tsdb.retention.size=100GB" # 或限制大小
4. 高可用部署
yaml
# Prometheus高可用配置
spec:
replicas: 2 # 双副本
strategy:
type: RollingUpdate
五、故障排查指南
常见问题及解决:
| 问题 | 可能原因 | 解决方案 |
|---|---|---|
| Prometheus Target显示DOWN | 网络不通/端口未开放 | 检查防火墙、Service配置 |
| Grafana无法连接数据源 | 网络策略限制 | 检查NetworkPolicy |
| 监控数据缺失 | 抓取配置错误 | 检查relabel配置 |
| 内存占用过高 | 数据保留太久 | 调整retention参数 |
排查命令:
bash
# 查看Prometheus日志
kubectl logs -f prometheus-pod -n kube-ops
# 检查配置是否正确
kubectl get configmap prometheus-config -n kube-ops -o yaml
# 验证指标是否正常抓取
kubectl exec -it prometheus-pod -n kube-ops -- wget -O- http://localhost:9090/api/v1/targets
六、总结
通过本文的实战部署,你已经搭建了一套完整的Kubernetes监控体系:
- 数据采集层:Node Exporter + cAdvisor + kube-state-metrics
- 数据处理层:Prometheus + Alertmanager
- 数据展示层:Grafana
这套方案具有以下优势:
- 全面性:覆盖节点、容器、应用、K8s对象
- 灵活性:支持自定义监控指标和告警规则
- 可视化:通过Grafana提供丰富的仪表板
- 可扩展:轻松集成新的监控目标
记住:监控不是一次性的工作,而是一个持续优化的过程。随着业务发展,需要不断调整监控策略和告警阈值,让监控真正为业务稳定运行保驾护航。