Prometheus监控服务器及K8s集群资源

Prometheus+Grafana监控服务器及K8s集群资源

本文档详细介绍了如何在 Rocky Linux 8.5 系统上部署 Prometheus + Grafana，实现对服务器资源及 Kubernetes 集群组件可视化监控，内容包含环境准备、组件安装、配置优化、报警设置等。

一、环境信息概览

IP	主机名	内存/CPU	安装软件	操作系统	架构
192.168.200.11	`k8s-master`	2C/2G/100G	`node_exporter`	RockyLinux8.5	X86
192.168.200.12	`k8s-node1`	2C/2G/100G	`node_exporter`	RockyLinux8.5	X86
192.168.200.13	`k8s-node2`	2C/2G/100G	`node_exporter`	RockyLinux8.5	X86
192.168.200.14	`prometheus`	2C/2G/100G	`prometheus`、`pushgateway`、`prometheus`、`grafana-enterprise`、`node_exporter`	RockyLinux8.5	X86

K8s部署参考：https://blog.csdn.net/weixin_45867513/article/details/144565165?spm=1011.2415.3001.5331

二、核心组件与端口说明

组件名称	软件包名称	版本	作用	默认端口
Prometheus	prometheus-2.29.1.linux-amd64.tar.gz	v2.29.1	监控系统，负责拉取和存储数据	9090
Node Exporter	node_exporter-1.2.2.linux-amd64.tar.gz	v1.2.2	收集服务器基础资源信息，比如 CPU 和内存	9100
Pushgateway	pushgateway-1.4.1.linux-amd64.tar.gz	v1.4.1	提供"主动推送"监控数据的功能	9091
Alertmanager	alertmanager-0.23.0.linux-amd64.tar.gz	v0.23.0	负责接收和发送告警通知，比如发邮件或钉钉	9093
Grafana	grafana-enterprise-v12.0.0.linux-amd64.tar.gz	v12.0.0	图形化展示监控数据	3000

下载方式：

prometheus官网：https://prometheus.io/download/
grafana官网：https://grafana.com/grafana/download?pg=graf\&plcmt=deploy-box-1
百度网盘（含所有软件包）：https://pan.baidu.com/s/18ekfed1AxZ8BmwO_bJHxNA 提取码: 4hzu

三、环境准备（104）

3.1 修改主机名 & 防火墙设置

shell 复制代码

[root@localhost ~]# systemctl disable --now firewalld
[root@localhost ~]# hostnamectl set-hostname prometheus && bash

3.2 设置 Hosts 文件

bash 复制代码

[root@prometheus ~]# cat >> /etc/hosts << EOF
192.168.200.11 k8s-master
192.168.200.12 k8s-node1
192.168.200.13 k8s-node2
192.168.200.14 prometheus
EOF

3.3 免密登录设置

bash 复制代码

[root@prometheus ~]# ssh-keygen
[root@prometheus ~]# ssh-copy-id prometheus
[root@prometheus ~]# ssh-copy-id k8s-master01
[root@prometheus ~]# ssh-copy-id k8s-node01
[root@prometheus ~]# ssh-copy-id k8s-node02

3.4 时间同步配置

bash 复制代码

yum install -y chrony
echo "server ntp.aliyun.com iburst" >> /etc/chrony.conf
systemctl enable --now chronyd

3.5 上传软件包

bash 复制代码

mkdir -pv /data/software
cd /data/software
# 上传所有组件 tar.gz 包至该目录

四、安装 Prometheus 组件

4.1 Prometheus 安装

bash 复制代码

[root@prometheus software]# tar -xf prometheus-2.29.1.linux-amd64.tar.gz -C /usr/local/

配置文件：/usr/local/prometheus/prometheus.yml修改前备份文件

bash 复制代码

    static_configs:                      #使用静态配置方式指定采集目标。
      - targets: ["localhost:9090"]      #Prometheus 自身的监控地址（默认端口 9090）
  - job_name: 'pushgateway'              #定义任务名称，用于标识该采集任务
    static_configs:
        - targets: ['192.168.200.14:9091']  # prometheus所在主机IP
          labels:
            instance: pushgateway
  - job_name: 'node exporter'    
    static_configs:
        - targets: ['192.168.200.11:9100','192.168.200.12:9100','192.168.200.13:9100','192.168.200.14:9100']   #被采集的客户端ip，prometheus会从http://这这些ip/metrics拉取数据

这个配置文件主要定义了三类监控

Prometheus 自身：监控 Prometheus 服务的运行状态。

Push Gateway：接收短期作业推送的指标。

Node Exporter 集群：监控多台服务器的系统资源使用情况。

创建 systemd 服务：/etc/systemd/system/prometheus.service

bash 复制代码

[Unit]
Description=Prometheus
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/prometheus-2.29.1/prometheus --config.file=/usr/local/prometheus-2.29.1/prometheus.yml
WorkingDirectory=/usr/local/prometheus-2.29.1
Restart=always

[Install]
WantedBy=multi-user.target

启动服务：

bash 复制代码

systemctl daemon-reload
systemctl enable --now prometheus

4.2 PushGateway安装

prometehes拉取数据是比较困难的，所以中间有一个中转站pushgateway，他更像是metric和prometehes的填充，prometehes会到pushgateway上去获取数据

shell 复制代码

[root@prometheus prometheus-2.29.1]# cd /data/software/
[root@prometheus software]# tar -xf pushgateway-1.4.1.linux-amd64.tar.gz 
[root@prometheus software]# mv pushgateway-1.4.1.linux-amd64 /usr/local/pushgateway-1.4.1

创建服务：/etc/systemd/system/pushgateway.service

bash 复制代码

[Unit]
Description=Pushgateway
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/pushgateway-1.4.1/pushgateway --web.listen-address :9091
WorkingDirectory=/usr/local/pushgateway-1.4.1
Restart=always

[Install]
WantedBy=multi-user.target

启动服务：

bash 复制代码

systemctl daemon-reload
systemctl enable --now pushgateway

4.3 Alermanager安装

这个软件主要用于报警

shell 复制代码

[root@prometheus software]# tar -xf alertmanager-0.28.1.linux-amd64.tar.gz 
[root@prometheus software]# mv alertmanager-0.28.1.linux-amd64 alertmanager-0.28.1

创建服务：/etc/systemd/system/alertmanager.service

bash 复制代码

[Unit]
Description=Alertmanager
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/alertmanager-0.28.1/alertmanager --config.file=/usr/local/alertmanager-0.28.1/alertmanager.yml
WorkingDirectory=/usr/local/alertmanager-0.28.1
Restart=always

[Install]
WantedBy=multi-user.target

启动服务：

bash 复制代码

systemctl daemon-reload
systemctl enable --now alertmanager

4.4 Node Exporter安装（所有节点）

安装到哪个node上就能获取哪个node的节点信息，prometheus通常会从暴露的HTTP服务地址（通常是/metrics）拉取监控样本数据，最终存放在自己时序数据库中

bash 复制代码

[root@prometheus software]# tar -xf node_exporter-1.2.2.linux-amd64.tar.gz 
[root@prometheus software]# mv node_exporter-1.2.2.linux-amd64 /usr/local/node_exporter-1.2.2

远程给其它主机

shell 复制代码

[root@prometheus software]# cd /usr/local/
[root@prometheus local]# scp -r node_exporter-1.2.2/ k8s-master:/usr/local/
LICENSE                                    100%   11KB   6.3MB/s   00:00    
NOTICE                                     100%  463   426.7KB/s   00:00    
node_exporter                              100%   18MB  64.7MB/s   00:00    
[root@prometheus local]# scp -r node_exporter-1.2.2/ k8s-node1:/usr/local/
LICENSE                                    100%   11KB   6.7MB/s   00:00    
NOTICE                                     100%  463   211.1KB/s   00:00    
node_exporter                              100%   18MB  66.2MB/s   00:00    
[root@prometheus local]# scp -r node_exporter-1.2.2/ k8s-node2:/usr/local/
LICENSE                                    100%   11KB   5.9MB/s   00:00    
NOTICE                                     100%  463   394.6KB/s   00:00    
node_exporter                              100%   18MB  83.0MB/s   00:00

创建服务：/etc/systemd/system/node_exporter.service

bash 复制代码

[root@prometheus local]# vim /etc/systemd/system/node_exporter.service

[Unit]
Description=node_export
Documentation=https://github.com/prometheus/node_exporter
After=network.target
[Service]
Type=simple
User=root
ExecStart= /usr/local/node_exporter-1.2.2/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target

远程给其它主机：

bash 复制代码

#给其他主机传送
[root@prometheus local]# scp -r /etc/systemd/system/node_exporter.service k8s-master:/etc/systemd/system/
node_exporter.service                      100%  254   271.2KB/s   00:00    
[root@prometheus local]# scp -r /etc/systemd/system/node_exporter.service k8s-node1:/etc/systemd/system/
node_exporter.service                      100%  254   199.4KB/s   00:00    
[root@prometheus local]# scp -r /etc/systemd/system/node_exporter.service k8s-node2:/etc/systemd/system/
node_exporter.service                      100%  254   130.4KB/s   00:00

所有主机启动exporter服务

bash 复制代码

[root@prometheus local]# systemctl enable --now node_exporter
[root@k8s-master ~]# systemctl enable --now node_exporter
[root@k8s-node1 ~]# systemctl enable --now node_exporter
[root@k8s-node2 ~]# systemctl enable --now node_exporter

所有主机访问9100端口：

192.168.200.11:9100/metrics

192.168.200.12:9100/metrics

192.168.200.13:9100/metricss

192.168.200.14:9100/metrics

能访问该端口，表明目标主机上的 Node Exporter 服务已成功启动并正常监听 ，正在通过 http://IP:9100/metrics 提供数据供prometheus拉取。

启动prometheus和pushgateway

shell 复制代码

systemctl daemon-reload
systemctl enable --now prometheus
systemctl enable --now pushgateway

访问prometheus服务端界面

五、Grafana 安装与使用

5.1 Grafana安装

shell 复制代码

[root@prometheus software]# cd /data/software/
[root@prometheus software]# tar -xf grafana-enterprise-12.0.0.linux-amd64.tar.gz -C /usr/local/

创建Granfana服务单元/etc/systemd/system/grafana.service

bash 复制代码

[Unit]
Description=Grafana
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/grafana-v12.0.0/bin/grafana-server web
WorkingDirectory=/usr/local/grafana-v12.0.0
Restart=always

[Install]
WantedBy=multi-user.target

bash 复制代码

systemctl daemon-reload 
systemctl enable --now grafana.service

访问地址：http://192.168.200.14:3000（默认账号密码：admin / admin）首次登录需强制修改密码

5.2 添加数据源

进入 Grafana → 左侧菜单栏→ Data sources
添加 Prometheus 数据源，地址为：http://192.168.200.14:9090
点击 Save & Test 确认连接成功

这个是prometheus服务器地址：http://192.168.200.14:9090 输入完成后点击下方的Save&test按钮

5.3 导入模板

点击右上角搜索按钮

点击导入模板

输入模板ID 8919 导入

5.4 图形化数据展示

Grafana官网 https://grafana.com/grafana/dashboards/ 多个模板可供选择

六、Kubernetes 集群监控（kube-state-metrics 安装）

6.1 安装Kube-state-metrics服务

作用：
通过监听 Kubernetes API Server，将集群内资源对象（ Pod、Deployment、Service、Namespace、ConfigMap 等）的状态转换为 Prometheus 可采集的时间序列指标。

yaml文件创建kube-state-metrics.yaml

yaml 复制代码

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: 1.9.0
  name: kube-state-metrics
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  - secrets
  - nodes
  - pods
  - services
  - serviceaccounts
  - resourcequotas
  - replicationcontrollers
  - limitranges
  - persistentvolumeclaims
  - persistentvolumes
  - namespaces
  - endpoints
  verbs:
  - list
  - watch
- apiGroups:
  - apps
  resources:
  - statefulsets
  - daemonsets
  - deployments
  - replicasets
  verbs:
  - list
  - watch
- apiGroups:
  - batch
  resources:
  - cronjobs
  - jobs
  verbs:
  - list
  - watch
- apiGroups:
  - autoscaling
  resources:
  - horizontalpodautoscalers
  verbs:
  - list
  - watch
- apiGroups:
  - authentication.k8s.io
  resources:
  - tokenreviews
  verbs:
  - create
- apiGroups:
  - authorization.k8s.io
  resources:
  - subjectaccessreviews
  verbs:
  - create
- apiGroups:
  - policy
  resources:
  - poddisruptionbudgets
  verbs:
  - list
  - watch
- apiGroups:
  - certificates.k8s.io
  resources:
  - certificatesigningrequests
  verbs:
  - list
  - watch
- apiGroups:
  - discovery.k8s.io
  resources:
  - endpointslices
  verbs:
  - list
  - watch
- apiGroups:
  - storage.k8s.io
  resources:
  - storageclasses
  - volumeattachments
  verbs:
  - list
  - watch
- apiGroups:
  - admissionregistration.k8s.io
  resources:
  - mutatingwebhookconfigurations
  - validatingwebhookconfigurations
  verbs:
  - list
  - watch
- apiGroups:
  - networking.k8s.io
  resources:
  - networkpolicies
  - ingressclasses
  - ingresses
  verbs:
  - list
  - watch
- apiGroups:
  - coordination.k8s.io
  resources:
  - leases
  verbs:
  - list
  - watch
- apiGroups:
  - rbac.authorization.k8s.io
  resources:
  - clusterrolebindings
  - clusterroles
  - rolebindings
  - roles
  verbs:
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: 1.9.0
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: 1.9.0
  name: kube-state-metrics
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: kube-state-metrics
  template:
    metadata:
      labels:
        app.kubernetes.io/component: exporter
        app.kubernetes.io/name: kube-state-metrics
        app.kubernetes.io/version: 1.9.0
    spec:
      automountServiceAccountToken: true
      containers:
      - image: swr.cn-north-4.myhuaweicloud.com/ddn-k8s/registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.9.2
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          timeoutSeconds: 5
        name: kube-state-metrics
        ports:
        - containerPort: 8080
          name: http-metrics
        - containerPort: 8081
          name: telemetry
        readinessProbe:
          httpGet:
            path: /
            port: 8081
          initialDelaySeconds: 5
          timeoutSeconds: 5
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 65534
          seccompProfile:
            type: RuntimeDefault
      nodeSelector:
        kubernetes.io/os: linux
      serviceAccountName: kube-state-metrics
---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: 1.9.0
  name: kube-state-metrics
  namespace: kube-system
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: 1.9.0
  name: kube-state-metrics
  namespace: kube-system
spec:
  type: NodePort
  ports:
  - name: http-metrics
    port: 9090
    protocol: TCP
    targetPort: http-metrics
    nodePort: 31031
  - name: telemetry
    port: 9091
    protocol: TCP
    targetPort: telemetry
    nodePort: 31032
  selector:
    app.kubernetes.io/name: kube-state-metrics

bash 复制代码

kubectl apply -f kube-state-metrics.yaml

bash 复制代码

#查看pod运行状态
[root@k8s-master01 ~]# kubectl get pod -A |grep kube-state-metrics
kube-system   kube-state-metrics-cc7d6998c-kjj4b        1/1     Running   0              20h

[root@k8s-master01 ~]# kubectl get svc -A |grep kube-state-metrics
kube-system   kube-state-metrics   NodePort    10.100.7.78     <none>        9090:31031/TCP,9091:31032/TCP   20h

bash 复制代码

#修改prometheus.yml文件添加
 - job_name: "kube-state-metrics"
    static_configs:
            - targets: ['192.168.200.11:31031']

#重启prometheus
systemctl restart prometheus

6.2 模板导入

七、邮箱报警

7.1 Alertmanager邮箱配置

bash 复制代码

cp /usr/local/alertmanager-0.28.1/alertmanager.yml{,.bak}

vim

编辑 /usr/local/alertmanager-0.28.1/alertmanager.yml

这个文档使用163邮箱报警

yaml 复制代码

global:
  smtp_smarthost: 'smtp.163.com:465'
  smtp_from: 'jin1205577136@163.com'
  smtp_auth_username: 'jin1205577136@163.com'
  smtp_auth_password: 'PTbz4G4WEstghuh7'
  smtp_require_tls: false

templates:
  - '/usr/local/alertmanager-0.28.1/custom.tmpl'

route:
  receiver: '邮件接收人'
  group_wait: 0s
  group_interval: 30s
  repeat_interval: 300s

receivers:
- name: '邮件接收人'
  email_configs:
  - to: 'jin1205577136@163.com'                   #多个收件人用逗号分隔
    send_resolved: true
    html: '{{ template "custom_alert_template" . }}'
    headers:
      Subject: |
        {{- if eq .Status "firing" -}}
        [⚠️紧急告警] 🚨 触发中：{{ .CommonLabels.alertname }} - 实例: {{ .CommonLabels.instance }}
        {{- else -}}
        [✅已恢复] 🔔 告警解决：{{ .CommonLabels.alertname }} - 实例: {{ .CommonLabels.instance }}
        {{- end -}}

这个文档使用钉钉报警

bash 复制代码

global:
  # 移除 SMTP 相关配置（如果不再使用邮箱）

templates:
  - '/usr/local/alertmanager-0.28.1/custom.tmpl'  # 只需包含钉钉模板

route:
  receiver: 'dingtalk-notification'  # 默认接收器改为钉钉
  group_wait: 0s
  group_interval: 30s
  repeat_interval: 300s

receivers:
- name: 'dingtalk-notification'
  webhook_configs:
  - url: 'https://oapi.dingtalk.com/robot/send?access_token=xxx'  # 替换为你的钉钉 Webhook
    send_resolved: true
    http_config:
      tls_config:
        insecure_skip_verify: true
    headers:
      Content-Type: 'application/json'
    body: '{{ template "dingtalk_alert_template" . }}'  # 引用钉钉模板

7.2 邮箱模板配置

**注意：**以下为邮箱和钉钉两种方案，选择其中一个即可。

**方案一：**邮箱配置

bash 复制代码

#创建模板
 cd /usr/local/alertmanager-0.28.1/
 vim custom.tmpl

html 复制代码

{{ define "custom_alert_template" }}
{{- if eq .Status "firing" -}}
<div style="max-width:750px; margin:0 auto; font-family:'Inter',-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,'Helvetica Neue',Arial,sans-serif; background-color:#f9fafb; border-radius:12px; overflow:hidden; box-shadow:0 10px 25px -5px rgba(0,0,0,0.1),0 8px 10px -6px rgba(0,0,0,0.1);">
  <div style="background:linear-gradient(135deg, #e53935 0%, #b71c1c 100%); color:white; padding:25px 30px; position:relative; overflow:hidden;">
    <div style="position:absolute; top:0; right:0; width:100px; height:100px; opacity:0.1;">
      <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" width="100" height="100">
        <path d="M50,10 C80,10 100,30 100,50 C100,80 80,100 50,100 C20,100 0,80 0,50 C0,30 20,10 50,10 Z" fill="white"></path>
        <path d="M50,30 C66.57,30 80,43.43 80,60 C80,76.57 66.57,90 50,90 C33.43,90 20,76.57 20,60 C20,43.43 33.43,30 50,30 Z" fill="none" stroke="white" stroke-width="5"></path>
        <circle cx="50" cy="60" r="10" fill="white"></circle>
      </svg>
    </div>
    <div style="position:relative; z-index:1;">
      <div style="display:flex; align-items:center; margin-bottom:15px;">
        <div style="width:50px; height:50px; border-radius:50%; background-color:rgba(255,255,255,0.2); display:flex; align-items:center; justify-content:center; margin-right:15px; animation:pulse 2s infinite;">
          <span style="font-size:28px;">🚨</span>
        </div>
        <div>
          <h2 style="margin:0; font-weight:500; font-size:22px;">[触发中] {{ .CommonLabels.alertname }}</h2>
          <p style="margin:0; font-size:14px; opacity:0.8;">紧急程度: {{ .CommonLabels.severity | title }}</p>
        </div>
      </div>
    </div>
  </div>
{{- else -}}
<div style="max-width:750px; margin:0 auto; font-family:'Inter',-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,'Helvetica Neue',Arial,sans-serif; background-color:#f9fafb; border-radius:12px; overflow:hidden; box-shadow:0 10px 25px -5px rgba(0,0,0,0.1),0 8px 10px -6px rgba(0,0,0,0.1);">
  <div style="background:linear-gradient(135deg, #43a047 0%, #2e7d32 100%); color:white; padding:25px 30px; position:relative; overflow:hidden;">
    <div style="position:absolute; top:0; right:0; width:100px; height:100px; opacity:0.1;">
      <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" width="100" height="100">
        <path d="M50,10 C80,10 100,30 100,50 C100,80 80,100 50,100 C20,100 0,80 0,50 C0,30 20,10 50,10 Z" fill="white"></path>
        <path d="M30,50 L50,70 L75,35" fill="none" stroke="white" stroke-width="8" stroke-linecap="round"></path>
      </svg>
    </div>
    <div style="position:relative; z-index:1;">
      <div style="display:flex; align-items:center; margin-bottom:15px;">
        <div style="width:50px; height:50px; border-radius:50%; background-color:rgba(255,255,255,0.2); display:flex; align-items:center; justify-content:center; margin-right:15px;">
          <span style="font-size:28px;">✅</span>
        </div>
        <div>
          <h2 style="margin:0; font-weight:500; font-size:22px;">[已解决] {{ .CommonLabels.alertname }}</h2>
          <p style="margin:0; font-size:14px; opacity:0.8;">状态已恢复正常</p>
        </div>
      </div>
    </div>
  </div>
{{- end }}

  <div style="padding:30px; background-color:white;">
    {{- if .Alerts -}}
      {{- range .Alerts }}
        <div style="margin-bottom:25px; border-radius:10px; overflow:hidden; box-shadow:0 4px 6px -1px rgba(0,0,0,0.1),0 2px 4px -1px rgba(0,0,0,0.06); transition:transform 0.3s ease, box-shadow 0.3s ease;">
          <div style="padding:20px; background-color:
            {{- if eq $.Status "firing" -}}
              {{- if eq .Labels.severity "critical" }}#ffebee{{ else if eq .Labels.severity "warning" }}#fff3e0{{ else }}#e8f5e9{{ end -}}
            {{- else -}}
              #e8f5e9
            {{- end -}};
            border-left:4px solid;
            border-left-color:
            {{- if eq $.Status "firing" -}}
              {{- if eq .Labels.severity "critical" }}#e53935{{ else if eq .Labels.severity "warning" }}#fb8c00{{ else }}#43a047{{ end -}}
            {{- else -}}
              #43a047
            {{- end -}};">
            
            <div style="display:flex; justify-content:space-between; align-items:flex-start;">
              <div style="display:flex; align-items:center;">
                <div style="width:40px; height:40px; border-radius:50%; background-color:
                  {{- if eq $.Status "firing" -}}
                    {{- if eq .Labels.severity "critical" }}rgba(229,57,53,0.1){{ else if eq .Labels.severity "warning" }}rgba(251,140,0,0.1){{ else }}rgba(67,160,71,0.1){{ end -}}
                  {{- else -}}
                    rgba(67,160,71,0.1)
                  {{- end -}};
                  display:flex; align-items:center; justify-content:center; margin-right:15px;">
                  <span style="font-size:20px; color:
                    {{- if eq $.Status "firing" -}}
                      {{- if eq .Labels.severity "critical" }}#e53935{{ else if eq .Labels.severity "warning" }}#fb8c00{{ else }}#43a047{{ end -}}
                    {{- else -}}
                      #43a047
                    {{- end -}};">
                    {{- if eq $.Status "firing" -}}
                      {{- if eq .Labels.severity "critical" }}🔥{{ else if eq .Labels.severity "warning" }}⚠️{{ else }}🔍{{ end -}}
                    {{- else -}}
                      ✅
                    {{- end -}}
                  </span>
                </div>
                <h3 style="margin:0; color:#2d3748; font-weight:500; font-size:18px;">{{ .Labels.alertname }}</h3>
              </div>
              <div style="padding:4px 12px; border-radius:4px; background-color:
                {{- if eq $.Status "firing" -}}
                  {{- if eq .Labels.severity "critical" }}rgba(229,57,53,0.1){{ else if eq .Labels.severity "warning" }}rgba(251,140,0,0.1){{ else }}rgba(67,160,71,0.1){{ end -}}
                {{- else -}}
                  rgba(67,160,71,0.1)
                {{- end -}};
                color:
                {{- if eq $.Status "firing" -}}
                  {{- if eq .Labels.severity "critical" }}#e53935{{ else if eq .Labels.severity "warning" }}#fb8c00{{ else }}#43a047{{ end -}}
                {{- else -}}
                  #43a047
                {{- end -}};
                font-size:13px; font-weight:500;">
                {{- if eq $.Status "firing" -}}
                  {{ .Labels.severity | title }}
                {{- else -}}
                  已解决
                {{- end -}}
              </div>
            </div>
            
            <div style="margin-top:15px; padding-left:55px;">
              <div style="display:grid; grid-template-columns:1fr 1fr; gap:10px;">
                <div style="padding:10px; background-color:rgba(0,0,0,0.03); border-radius:6px;">
                  <p style="margin:0; font-size:13px; color:#718096;">实例</p>
                  <p style="margin:0; font-size:16px; font-weight:500;">{{ .Labels.instance }}</p>
                </div>
                <div style="padding:10px; background-color:rgba(0,0,0,0.03); border-radius:6px;">
                  <p style="margin:0; font-size:13px; color:#718096;">触发时间</p>
                  <p style="margin:0; font-size:16px; font-weight:500;">{{ .StartsAt.Format "2006-01-02 15:04:05" }}</p>
                </div>
              </div>
              
              <div style="margin-top:15px; padding:15px; background-color:rgba(0,0,0,0.03); border-radius:6px;">
                <p style="margin:0 0 5px 0; font-size:13px; color:#718096;">描述</p>
                <p style="margin:0; font-size:16px;">{{ .Annotations.summary }}</p>
              </div>
              
              {{- if .Annotations.description -}}
              <div style="margin-top:15px; padding:15px; background-color:rgba(0,0,0,0.03); border-radius:6px;">
                <p style="margin:0 0 5px 0; font-size:13px; color:#718096;">详细信息</p>
                <p style="margin:0; font-size:16px;">{{ .Annotations.description }}</p>
              </div>
              {{- end -}}
              
              {{- if .Annotations.runbook_url -}}
              <div style="margin-top:15px;">
                <a href="{{ .Annotations.runbook_url }}" style="display:inline-flex; align-items:center; padding:8px 16px; background-color:#1976d2; color:white; border-radius:6px; text-decoration:none; font-weight:500; transition:background-color 0.3s ease;">
                  <span style="margin-right:8px;">📖</span>
                  查看操作手册
                </a>
              </div>
              {{- end -}}
            </div>
          </div>
        </div>
      {{- end -}}
    {{- else -}}
      <div style="text-align:center; padding:40px 0; color:#6b7280;">
        <div style="font-size:64px; color:#e5e7eb; margin-bottom:20px;">✅</div>
        <h3 style="margin:0 0 10px 0; font-weight:400; color:#374151;">一切正常</h3>
        <p style="margin:0; max-width:300px; margin:0 auto;">没有活跃的告警需要处理</p>
      </div>
    {{- end -}}
  </div>
  
  <div style="padding:15px 30px; background-color:#f9fafb; border-top:1px solid #e5e7eb; text-align:center; font-size:13px; color:#9ca3af;">
    <p style="margin:0;">此告警由 Alertmanager 自动生成</p>
    <p style="margin:5px 0 0 0;">查看所有告警: <a href="http://alertmanager-server:9093" style="color:#1976d2; text-decoration:none;">http://alertmanager-server:9093</a></p>
  </div>
</div>

<style>
  @keyframes pulse {
    0% { transform: scale(1); box-shadow: 0 0 0 0 rgba(239, 83, 80, 0.7); }
    70% { transform: scale(1.05); box-shadow: 0 0 0 10px rgba(239, 83, 80, 0); }
    100% { transform: scale(1); box-shadow: 0 0 0 0 rgba(239, 83, 80, 0); }
  }
</style>
{{ end }}

bash 复制代码

#启动alertmanager服务
systemctl start alertmanager.service

**方案二：**钉钉配置

bash 复制代码

vim alertmanager.yml

bash 复制代码

global:
  # 移除 SMTP 相关配置（如果不再使用邮箱）

templates:
  - '/usr/local/alertmanager-0.28.1/custom.tmpl'  #钉钉模板

route:
  receiver: 'dingtalk-notification'  # 默认接收器改为钉钉
  group_wait: 0s
  group_interval: 30s
  repeat_interval: 300s

receivers:
- name: 'dingtalk-notification'
  webhook_configs:
  - url: 'https://oapi.dingtalk.com/robot/send?access_token'  #钉钉的Webhook
    send_resolved: true
    http_config:
      tls_config:
        insecure_skip_verify: true
    headers:
      Content-Type: 'application/json'
    body: '{{ template "dingtalk_alert_template" . }}'  # 引用钉钉模板

bash 复制代码

vim custom.tmpl

bash 复制代码

{{/* 钉钉模板（JSON Markdown 格式） */}}
{{ define "dingtalk_alert_template" }}
{
  "msgtype": "markdown",
  "markdown": {
    "title": "[{{ if eq .Status \"firing\" }}🔥告警触发{{ else }}✅告警恢复{{ end }}] {{ .CommonLabels.alertname }}",
    "text": "### {{ if eq .Status \"firing\" }}🔥告警触发（级别：{{ .CommonLabels.severity | toUpper }}）{{ else }}✅告警恢复{{ end }}\n" +
            "**告警名称**: {{ .CommonLabels.alertname }}\n" +
            "**实例**: {{ .CommonLabels.instance }}\n" +
            "**描述**: {{ .CommonAnnotations.description }}\n\n" +
            "{{ range .Alerts }}\n" +
            "- **详情**: {{ .Annotations.summary }}\n" +
            "- **时间**: {{ .StartsAt.Format \"2006-01-02 15:04:05\" }}\n" +
            "{{ end }}"
  }
}
{{ end }}

两者可以同时使用，文档这里不做配置

7.3 触发报警规则配置

bash 复制代码

#创建规则文件
vim /usr/local/prometheus-2.29.1/alert-rules.yml

yaml 复制代码

groups:
# 主机资源监控规则（基础架构稳定性）
- name: host-resource-rules
  rules:
    # 节点 CPU 使用率过高（系统负载）
    - alert: 主机CPU使用率过高
      expr: (100 - avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
      for: 20s
      labels:
        severity: 严重
      annotations:
        summary: "主机 {{ $labels.instance }} CPU 使用率过高 (当前值: {{ $value | printf \"%.2f\" }}%)"
        description: "主机 CPU 使用率持续超过 80%，可能导致服务响应缓慢，需检查高负载进程。"

    # 节点内存使用率过高（防止 OOM）
    - alert: 主机内存使用率过高
      expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
      for: 20s
      labels:
        severity: 严重
      annotations:
        summary: "主机 {{ $labels.instance }} 内存使用率过高 (当前值: {{ $value | printf \"%.2f\" }}%)"
        description: "主机内存使用率持续超过 85%，可能触发 OOM Killer，导致服务异常终止。"

    # 节点磁盘空间不足（避免系统崩溃）
    - alert: 磁盘空间不足
      expr: (1 - node_filesystem_free_bytes{fstype=~"ext4|xfs|btrfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs|btrfs"}) * 100 > 80
      for: 20s
      labels:
        severity: 警告
      annotations:
        summary: "主机 {{ $labels.instance }} 磁盘分区 {{ $labels.mountpoint }} 空间不足 (当前值: {{ $value | printf \"%.2f\" }}%)"
        description: "磁盘空间使用率超过 80%，可能导致日志写入失败、应用无法启动，需清理或扩容。"

    # 节点磁盘 IO 延迟过高（影响读写性能）
    - alert: 磁盘IO延迟过高
      expr: avg by(instance, device)(irate(node_disk_io_time_seconds_total[5m])) > 100
      for: 20s
      labels:
        severity: 警告
      annotations:
        summary: "主机 {{ $labels.instance }} 磁盘 {{ $labels.device }} IO 延迟过高 (当前值: {{ $value | printf \"%.2f\" }}ms)"
        description: "磁盘 IO 延迟持续超过 100ms，可能由磁盘故障、碎片过多或负载过高导致。"

    # 节点网络带宽利用率过高（网络瓶颈）
    - alert: 网络带宽利用率过高
      expr: sum by(instance, device)(rate(node_network_receive_bytes_total[5m]) + rate(node_network_transmit_bytes_total[5m])) / node_network_speed_bytes * 100 > 80
      for: 20s
      labels:
        severity: 警告
      annotations:
        summary: "主机 {{ $labels.instance }} 网络接口 {{ $labels.device }} 带宽利用率过高 (当前值: {{ $value | printf \"%.2f\" }}%)"
        description: "网络带宽利用率持续超过 80%，可能导致数据包丢失、服务响应超时。"

    # 节点负载过高（系统整体压力）
    - alert: 系统负载过高
      expr: node_load1 > on (instance) node_cpu_seconds_total{mode="idle"} * 0.8
      for: 20s
      labels:
        severity: 警告
      annotations:
        summary: "主机 {{ $labels.instance }} 系统负载过高 (当前值: {{ $value | printf \"%.2f\" }})"
        description: "系统 1 分钟负载超过 CPU 核心数的 80%，可能由进程泄漏或资源竞争导致。"

    # 节点不可达（基础设施可用性）
    - alert: 节点不可达
      expr: up{job="node-exporter"} == 0
      for: 20s
      labels:
        severity: 严重
      annotations:
        summary: "主机 {{ $labels.instance }} 不可达"
        description: "Node Exporter 无法访问，可能由主机崩溃、网络中断或服务异常导致。"

# Kubernetes 资源监控规则（容器平台稳定性）
- name: kubernetes-resource-rules
  rules:
    # Pod CPU 使用率过高（容器资源）
    - alert: PodCPU使用率过高
      expr: sum by(namespace, pod)(rate(container_cpu_usage_seconds_total{container!="POD", container!=""}[5m])) / sum by(namespace, pod)(container_spec_cpu_quota{container!="POD", container!=""}) * 100 > 80
      for: 5m
      labels:
        severity: 警告
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} CPU 使用率过高 (当前值: {{ $value | printf \"%.2f\" }}%)"
        description: "Pod CPU 使用率持续超过 80%，可能导致容器被限流（throttled），影响服务性能。"

    # Pod 内存使用率过高（容器资源）
    - alert: Pod内存使用率过高
      expr: sum by(namespace, pod)(container_memory_usage_bytes{container!="POD", container!=""}) / sum by(namespace, pod)(container_spec_memory_limit_bytes{container!="POD", container!=""}) * 100 > 90
      for: 5m
      labels:
        severity: 警告
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 内存使用率过高 (当前值: {{ $value | printf \"%.2f\" }}%)"
        description: "Pod 内存使用率持续超过 90%，可能触发 OOM Kill，导致容器重启。"

    # Deployment 副本不足（服务可用性）
    - alert: Deployment副本不足
      expr: kube_deployment_spec_replicas != kube_deployment_status_available_replicas
      for: 10m
      labels:
        severity: 严重
      annotations:
        summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} 副本不足"
        description: "期望 {{ $value | printf \"%.0f\" }} 个副本，但只有 {{ $labels.available_replicas }} 个可用，可能影响服务可用性。"

    # PVC 容量不足（存储资源）
    - alert: PVCCapacity不足
      expr: (1 - kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) * 100 > 80
      for: 24h
      labels:
        severity: 警告
      annotations:
        summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} 容量不足 (当前值: {{ $value | printf \"%.2f\" }}%)"
        description: "PVC 容量使用率超过 80%，建议在 24 小时内扩容，避免影响数据写入。"

# 应用程序监控规则（直接关联业务）
- name: application-rules
  rules:
    # HTTP 请求错误率过高（业务可用性）
    - alert: HTTP请求错误率过高
      expr: sum by(job)(rate(http_requests_total{status=~"5.."}[5m])) / sum by(job)(rate(http_requests_total[5m])) * 100 > 5
      for: 10m
      labels:
        severity: 警告
      annotations:
        summary: "服务 {{ $labels.job }} HTTP 请求错误率过高 (当前值: {{ $value | printf \"%.2f\" }}%)"
        description: "HTTP 5xx 错误率持续超过 5%，可能由服务崩溃、依赖故障或资源不足导致。"

    # 数据库连接池耗尽（数据访问）
    - alert: 数据库连接池耗尽
      expr: database_connections_active / database_connections_max * 100 > 90
      for: 5m
      labels:
        severity: 严重
      annotations:
        summary: "数据库 {{ $labels.instance }} 连接池耗尽 (当前值: {{ $value | printf \"%.2f\" }}%)"
        description: "数据库连接池使用率超过 90%，新请求可能无法获取连接，导致服务无响应。"

    # 消息队列堆积（异步处理）
    - alert: 消息队列堆积
      expr: sum by(queue)(rabbitmq_queue_messages_ready) > 10000
      for: 10m
      labels:
        severity: 警告
      annotations:
        summary: "消息队列 {{ $labels.queue }} 堆积严重 (当前值: {{ $value | printf \"%.0f\" }} 条)"
        description: "消息队列中待处理消息超过 10000 条，消费者可能存在瓶颈，需检查消费逻辑或扩容。"

7.4 Prometheus连接配置

bash 复制代码

vim /usr/local/prometheus-2.29.1/prometheus.yml

yaml 复制代码

alerting:
  alertmanagers:
  - static_configs:
    - targets: ["192.168.200.14:9093"]  #Alertmanager服务地址（默认端口9093）

rule_files:
  - "/usr/local/prometheus-2.29.1/alert-rules.yml"  #报警规则文件路径

bash 复制代码

#重启启动prometheus服务
systemctl restart prometheus.service

7.5 CPU触发报警测试

测试192.168.200.14主机cpu达到阈值触发报警

bash 复制代码

#在200.104编写测试脚本
vim cpu-test.sh

bash 复制代码

#!/usr/bin/env bash
# 文件名：cpu_stress.sh
# 用法：chmod +x cpu_stress.sh && ./cpu_stress.sh [持续时长秒]
# 默认持续 60 秒，如果传入参数则按参数秒数运行。

# 运行时长（秒）
DURATION=${1:-60}

# 计算结束时间
END_TIME=$((SECONDS + DURATION))

# 启动与 CPU 核数相同的子进程，每个子进程都在忙循环
CPU_CORES=$(nproc)
echo ">> 启动 $CPU_CORES 个 busy-loop，压满所有 CPU 核心，共运行 $DURATION 秒"

for ((i=1; i<=CPU_CORES; i++)); do
  (
    # 子进程忙循环直到超时
    while [ $SECONDS -lt $END_TIME ]; do
      :  # 空操作，占用 CPU
    done
  ) &
done

# 等待所有子进程结束
wait
echo ">> CPU 压力测试结束。"

bash 复制代码

#执行脚本（执行脚本前使用   journalctl -u alertmanager.service -f    查看日志）
journalctl -u alertmanager.service -f 

bash cpu-test.sh

bash 复制代码

#日志
5月 14 03:01:39 prometheus alertmanager[7188]: time=2025-05-14T07:01:39.747Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=主机CPU使用率过高[c57a700][resolved]
5月 14 03:01:54 prometheus alertmanager[7188]: time=2025-05-14T07:01:54.746Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=主机CPU使用率过高[8ccbb6e][resolved]
5月 14 03:01:54 prometheus alertmanager[7188]: time=2025-05-14T07:01:54.752Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup={}:{} alerts="[主机CPU使用率过高[c57a700][resolved] 主机CPU使用率过高[8ccbb6e][resolved]]"
5月 14 03:01:56 prometheus alertmanager[7188]: time=2025-05-14T07:01:56.043Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=邮件接收人 integration=email[0] aggrGroup={}:{} attempts=1 duration=1.291165694s alerts="[主机CPU使用率过高[c57a700][resolved] 主机CPU使用率过高[8ccbb6e][resolved]]"

Prometheus监控服务器及K8s集群资源

Prometheus+Grafana监控服务器及K8s集群资源

一、环境信息概览

二、核心组件与端口说明

三、环境准备（104）

3.1 修改主机名 & 防火墙设置

3.2 设置 Hosts 文件

3.3 免密登录设置

3.4 时间同步配置

3.5 上传软件包

四、安装 Prometheus 组件

4.1 Prometheus 安装

4.2 PushGateway安装

4.3 Alermanager安装

4.4 Node Exporter安装（所有节点）

五、Grafana 安装与使用

5.1 Grafana安装

5.2 添加数据源

5.3 导入模板

5.4 图形化数据展示

六、Kubernetes 集群监控（kube-state-metrics 安装）

6.1 安装Kube-state-metrics服务

6.2 模板导入

七、邮箱报警

7.1 Alertmanager邮箱配置

7.2 邮箱模板配置

7.3 触发报警规则配置

7.4 Prometheus连接配置

7.5 CPU触发报警测试

7.6 查看邮箱邮件