Prometheus监控服务器及K8s集群资源

Prometheus+Grafana监控服务器及K8s集群资源

本文档详细介绍了如何在 Rocky Linux 8.5 系统上部署 Prometheus + Grafana,实现对服务器资源及 Kubernetes 集群组件可视化监控,内容包含环境准备、组件安装、配置优化、报警设置等。

一、环境信息概览

IP 主机名 内存/CPU 安装软件 操作系统 架构
192.168.200.11 k8s-master 2C/2G/100G node_exporter RockyLinux8.5 X86
192.168.200.12 k8s-node1 2C/2G/100G node_exporter RockyLinux8.5 X86
192.168.200.13 k8s-node2 2C/2G/100G node_exporter RockyLinux8.5 X86
192.168.200.14 prometheus 2C/2G/100G prometheuspushgatewayprometheusgrafana-enterprisenode_exporter RockyLinux8.5 X86

K8s部署参考https://blog.csdn.net/weixin_45867513/article/details/144565165?spm=1011.2415.3001.5331

二、核心组件与端口说明

组件名称 软件包名称 版本 作用 默认端口
Prometheus prometheus-2.29.1.linux-amd64.tar.gz v2.29.1 监控系统,负责拉取和存储数据 9090
Node Exporter node_exporter-1.2.2.linux-amd64.tar.gz v1.2.2 收集服务器基础资源信息,比如 CPU 和内存 9100
Pushgateway pushgateway-1.4.1.linux-amd64.tar.gz v1.4.1 提供"主动推送"监控数据的功能 9091
Alertmanager alertmanager-0.23.0.linux-amd64.tar.gz v0.23.0 负责接收和发送告警通知,比如发邮件或钉钉 9093
Grafana grafana-enterprise-v12.0.0.linux-amd64.tar.gz v12.0.0 图形化展示监控数据 3000

下载方式:

三、环境准备(104)

3.1 修改主机名 & 防火墙设置

shell 复制代码
[root@localhost ~]# systemctl disable --now firewalld
[root@localhost ~]# hostnamectl set-hostname prometheus && bash

3.2 设置 Hosts 文件

bash 复制代码
[root@prometheus ~]# cat >> /etc/hosts << EOF
192.168.200.11 k8s-master
192.168.200.12 k8s-node1
192.168.200.13 k8s-node2
192.168.200.14 prometheus
EOF

3.3 免密登录设置

bash 复制代码
[root@prometheus ~]# ssh-keygen
[root@prometheus ~]# ssh-copy-id prometheus
[root@prometheus ~]# ssh-copy-id k8s-master01
[root@prometheus ~]# ssh-copy-id k8s-node01
[root@prometheus ~]# ssh-copy-id k8s-node02

3.4 时间同步配置

bash 复制代码
yum install -y chrony
echo "server ntp.aliyun.com iburst" >> /etc/chrony.conf
systemctl enable --now chronyd

3.5 上传软件包

bash 复制代码
mkdir -pv /data/software
cd /data/software
# 上传所有组件 tar.gz 包至该目录

四、安装 Prometheus 组件

4.1 Prometheus 安装

bash 复制代码
[root@prometheus software]# tar -xf prometheus-2.29.1.linux-amd64.tar.gz -C /usr/local/

配置文件:/usr/local/prometheus/prometheus.yml修改前备份文件

bash 复制代码
    static_configs:                      #使用静态配置方式指定采集目标。
      - targets: ["localhost:9090"]      #Prometheus 自身的监控地址(默认端口 9090)
  - job_name: 'pushgateway'              #定义任务名称,用于标识该采集任务
    static_configs:
        - targets: ['192.168.200.14:9091']  # prometheus所在主机IP
          labels:
            instance: pushgateway
  - job_name: 'node exporter'    
    static_configs:
        - targets: ['192.168.200.11:9100','192.168.200.12:9100','192.168.200.13:9100','192.168.200.14:9100']   #被采集的客户端ip,prometheus会从http://这这些ip/metrics拉取数据

这个配置文件主要定义了三类监控

Prometheus 自身:监控 Prometheus 服务的运行状态。

Push Gateway:接收短期作业推送的指标。

Node Exporter 集群:监控多台服务器的系统资源使用情况。

创建 systemd 服务:/etc/systemd/system/prometheus.service

bash 复制代码
[Unit]
Description=Prometheus
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/prometheus-2.29.1/prometheus --config.file=/usr/local/prometheus-2.29.1/prometheus.yml
WorkingDirectory=/usr/local/prometheus-2.29.1
Restart=always

[Install]
WantedBy=multi-user.target

启动服务:

bash 复制代码
systemctl daemon-reload
systemctl enable --now prometheus

4.2 PushGateway安装

prometehes拉取数据是比较困难的,所以中间有一个中转站pushgateway,他更像是metric和prometehes的填充,prometehes会到pushgateway上去获取数据

shell 复制代码
[root@prometheus prometheus-2.29.1]# cd /data/software/
[root@prometheus software]# tar -xf pushgateway-1.4.1.linux-amd64.tar.gz 
[root@prometheus software]# mv pushgateway-1.4.1.linux-amd64 /usr/local/pushgateway-1.4.1

创建服务:/etc/systemd/system/pushgateway.service

bash 复制代码
[Unit]
Description=Pushgateway
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/pushgateway-1.4.1/pushgateway --web.listen-address :9091
WorkingDirectory=/usr/local/pushgateway-1.4.1
Restart=always

[Install]
WantedBy=multi-user.target

启动服务:

bash 复制代码
systemctl daemon-reload
systemctl enable --now pushgateway

4.3 Alermanager安装

这个软件主要用于报警

shell 复制代码
[root@prometheus software]# tar -xf alertmanager-0.28.1.linux-amd64.tar.gz 
[root@prometheus software]# mv alertmanager-0.28.1.linux-amd64 alertmanager-0.28.1

创建服务:/etc/systemd/system/alertmanager.service

bash 复制代码
[Unit]
Description=Alertmanager
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/alertmanager-0.28.1/alertmanager --config.file=/usr/local/alertmanager-0.28.1/alertmanager.yml
WorkingDirectory=/usr/local/alertmanager-0.28.1
Restart=always

[Install]
WantedBy=multi-user.target

启动服务:

bash 复制代码
systemctl daemon-reload
systemctl enable --now alertmanager

4.4 Node Exporter安装(所有节点)

安装到哪个node上就能获取哪个node的节点信息,prometheus通常会从暴露的HTTP服务地址(通常是/metrics)拉取监控样本数据,最终存放在自己时序数据库中

bash 复制代码
[root@prometheus software]# tar -xf node_exporter-1.2.2.linux-amd64.tar.gz 
[root@prometheus software]# mv node_exporter-1.2.2.linux-amd64 /usr/local/node_exporter-1.2.2

远程给其它主机

shell 复制代码
[root@prometheus software]# cd /usr/local/
[root@prometheus local]# scp -r node_exporter-1.2.2/ k8s-master:/usr/local/
LICENSE                                    100%   11KB   6.3MB/s   00:00    
NOTICE                                     100%  463   426.7KB/s   00:00    
node_exporter                              100%   18MB  64.7MB/s   00:00    
[root@prometheus local]# scp -r node_exporter-1.2.2/ k8s-node1:/usr/local/
LICENSE                                    100%   11KB   6.7MB/s   00:00    
NOTICE                                     100%  463   211.1KB/s   00:00    
node_exporter                              100%   18MB  66.2MB/s   00:00    
[root@prometheus local]# scp -r node_exporter-1.2.2/ k8s-node2:/usr/local/
LICENSE                                    100%   11KB   5.9MB/s   00:00    
NOTICE                                     100%  463   394.6KB/s   00:00    
node_exporter                              100%   18MB  83.0MB/s   00:00

创建服务:/etc/systemd/system/node_exporter.service

bash 复制代码
[root@prometheus local]# vim /etc/systemd/system/node_exporter.service

[Unit]
Description=node_export
Documentation=https://github.com/prometheus/node_exporter
After=network.target
[Service]
Type=simple
User=root
ExecStart= /usr/local/node_exporter-1.2.2/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target

远程给其它主机:

bash 复制代码
#给其他主机传送
[root@prometheus local]# scp -r /etc/systemd/system/node_exporter.service k8s-master:/etc/systemd/system/
node_exporter.service                      100%  254   271.2KB/s   00:00    
[root@prometheus local]# scp -r /etc/systemd/system/node_exporter.service k8s-node1:/etc/systemd/system/
node_exporter.service                      100%  254   199.4KB/s   00:00    
[root@prometheus local]# scp -r /etc/systemd/system/node_exporter.service k8s-node2:/etc/systemd/system/
node_exporter.service                      100%  254   130.4KB/s   00:00 

所有主机启动exporter服务

bash 复制代码
[root@prometheus local]# systemctl enable --now node_exporter
[root@k8s-master ~]# systemctl enable --now node_exporter
[root@k8s-node1 ~]# systemctl enable --now node_exporter
[root@k8s-node2 ~]# systemctl enable --now node_exporter

所有主机访问9100端口:

192.168.200.11:9100/metrics

192.168.200.12:9100/metrics

192.168.200.13:9100/metricss

192.168.200.14:9100/metrics

能访问该端口,表明目标主机上的 Node Exporter 服务已成功启动并正常监听 ,正在通过 http://IP:9100/metrics 提供数据供prometheus拉取。

启动prometheus和pushgateway

shell 复制代码
systemctl daemon-reload
systemctl enable --now prometheus
systemctl enable --now pushgateway

访问prometheus服务端界面

五、Grafana 安装与使用

5.1 Grafana安装

shell 复制代码
[root@prometheus software]# cd /data/software/
[root@prometheus software]# tar -xf grafana-enterprise-12.0.0.linux-amd64.tar.gz -C /usr/local/

创建Granfana服务单元/etc/systemd/system/grafana.service

bash 复制代码
[Unit]
Description=Grafana
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/grafana-v12.0.0/bin/grafana-server web
WorkingDirectory=/usr/local/grafana-v12.0.0
Restart=always

[Install]
WantedBy=multi-user.target
bash 复制代码
systemctl daemon-reload 
systemctl enable --now grafana.service 

访问地址:http://192.168.200.14:3000(默认账号密码:admin / admin) 首次登录需强制修改密码

5.2 添加数据源

  1. 进入 Grafana左侧菜单栏Data sources
  2. 添加 Prometheus 数据源,地址为:http://192.168.200.14:9090
  3. 点击 Save & Test 确认连接成功

这个是prometheus服务器地址:http://192.168.200.14:9090 输入完成后点击下方的Save&test按钮

5.3 导入模板

点击右上角搜索按钮

点击导入模板

输入模板ID 8919 导入

5.4 图形化数据展示

Grafana官网 https://grafana.com/grafana/dashboards/ 多个模板可供选择

六、Kubernetes 集群监控(kube-state-metrics 安装)

6.1 安装Kube-state-metrics服务

  • 作用
    通过监听 Kubernetes API Server,将集群内资源对象( PodDeploymentServiceNamespaceConfigMap 等)的状态转换为 Prometheus 可采集的时间序列指标

yaml文件创建kube-state-metrics.yaml

yaml 复制代码
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: 1.9.0
  name: kube-state-metrics
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  - secrets
  - nodes
  - pods
  - services
  - serviceaccounts
  - resourcequotas
  - replicationcontrollers
  - limitranges
  - persistentvolumeclaims
  - persistentvolumes
  - namespaces
  - endpoints
  verbs:
  - list
  - watch
- apiGroups:
  - apps
  resources:
  - statefulsets
  - daemonsets
  - deployments
  - replicasets
  verbs:
  - list
  - watch
- apiGroups:
  - batch
  resources:
  - cronjobs
  - jobs
  verbs:
  - list
  - watch
- apiGroups:
  - autoscaling
  resources:
  - horizontalpodautoscalers
  verbs:
  - list
  - watch
- apiGroups:
  - authentication.k8s.io
  resources:
  - tokenreviews
  verbs:
  - create
- apiGroups:
  - authorization.k8s.io
  resources:
  - subjectaccessreviews
  verbs:
  - create
- apiGroups:
  - policy
  resources:
  - poddisruptionbudgets
  verbs:
  - list
  - watch
- apiGroups:
  - certificates.k8s.io
  resources:
  - certificatesigningrequests
  verbs:
  - list
  - watch
- apiGroups:
  - discovery.k8s.io
  resources:
  - endpointslices
  verbs:
  - list
  - watch
- apiGroups:
  - storage.k8s.io
  resources:
  - storageclasses
  - volumeattachments
  verbs:
  - list
  - watch
- apiGroups:
  - admissionregistration.k8s.io
  resources:
  - mutatingwebhookconfigurations
  - validatingwebhookconfigurations
  verbs:
  - list
  - watch
- apiGroups:
  - networking.k8s.io
  resources:
  - networkpolicies
  - ingressclasses
  - ingresses
  verbs:
  - list
  - watch
- apiGroups:
  - coordination.k8s.io
  resources:
  - leases
  verbs:
  - list
  - watch
- apiGroups:
  - rbac.authorization.k8s.io
  resources:
  - clusterrolebindings
  - clusterroles
  - rolebindings
  - roles
  verbs:
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: 1.9.0
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: 1.9.0
  name: kube-state-metrics
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: kube-state-metrics
  template:
    metadata:
      labels:
        app.kubernetes.io/component: exporter
        app.kubernetes.io/name: kube-state-metrics
        app.kubernetes.io/version: 1.9.0
    spec:
      automountServiceAccountToken: true
      containers:
      - image: swr.cn-north-4.myhuaweicloud.com/ddn-k8s/registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.9.2
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          timeoutSeconds: 5
        name: kube-state-metrics
        ports:
        - containerPort: 8080
          name: http-metrics
        - containerPort: 8081
          name: telemetry
        readinessProbe:
          httpGet:
            path: /
            port: 8081
          initialDelaySeconds: 5
          timeoutSeconds: 5
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 65534
          seccompProfile:
            type: RuntimeDefault
      nodeSelector:
        kubernetes.io/os: linux
      serviceAccountName: kube-state-metrics
---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: 1.9.0
  name: kube-state-metrics
  namespace: kube-system
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: 1.9.0
  name: kube-state-metrics
  namespace: kube-system
spec:
  type: NodePort
  ports:
  - name: http-metrics
    port: 9090
    protocol: TCP
    targetPort: http-metrics
    nodePort: 31031
  - name: telemetry
    port: 9091
    protocol: TCP
    targetPort: telemetry
    nodePort: 31032
  selector:
    app.kubernetes.io/name: kube-state-metrics
bash 复制代码
kubectl apply -f kube-state-metrics.yaml 
bash 复制代码
#查看pod运行状态
[root@k8s-master01 ~]# kubectl get pod -A |grep kube-state-metrics
kube-system   kube-state-metrics-cc7d6998c-kjj4b        1/1     Running   0              20h

[root@k8s-master01 ~]# kubectl get svc -A |grep kube-state-metrics
kube-system   kube-state-metrics   NodePort    10.100.7.78     <none>        9090:31031/TCP,9091:31032/TCP   20h
bash 复制代码
#修改prometheus.yml文件添加
 - job_name: "kube-state-metrics"
    static_configs:
            - targets: ['192.168.200.11:31031']

#重启prometheus
systemctl restart prometheus

6.2 模板导入

七、邮箱报警

7.1 Alertmanager邮箱配置

bash 复制代码
cp /usr/local/alertmanager-0.28.1/alertmanager.yml{,.bak}

vim 

编辑 /usr/local/alertmanager-0.28.1/alertmanager.yml

这个文档使用163邮箱报警

yaml 复制代码
global:
  smtp_smarthost: 'smtp.163.com:465'
  smtp_from: 'jin1205577136@163.com'
  smtp_auth_username: 'jin1205577136@163.com'
  smtp_auth_password: 'PTbz4G4WEstghuh7'
  smtp_require_tls: false

templates:
  - '/usr/local/alertmanager-0.28.1/custom.tmpl'

route:
  receiver: '邮件接收人'
  group_wait: 0s
  group_interval: 30s
  repeat_interval: 300s

receivers:
- name: '邮件接收人'
  email_configs:
  - to: 'jin1205577136@163.com'                   #多个收件人用逗号分隔
    send_resolved: true
    html: '{{ template "custom_alert_template" . }}'
    headers:
      Subject: |
        {{- if eq .Status "firing" -}}
        [⚠️紧急告警] 🚨 触发中:{{ .CommonLabels.alertname }} - 实例: {{ .CommonLabels.instance }}
        {{- else -}}
        [✅已恢复] 🔔 告警解决:{{ .CommonLabels.alertname }} - 实例: {{ .CommonLabels.instance }}
        {{- end -}}

这个文档使用钉钉报警

bash 复制代码
global:
  # 移除 SMTP 相关配置(如果不再使用邮箱)

templates:
  - '/usr/local/alertmanager-0.28.1/custom.tmpl'  # 只需包含钉钉模板

route:
  receiver: 'dingtalk-notification'  # 默认接收器改为钉钉
  group_wait: 0s
  group_interval: 30s
  repeat_interval: 300s

receivers:
- name: 'dingtalk-notification'
  webhook_configs:
  - url: 'https://oapi.dingtalk.com/robot/send?access_token=xxx'  # 替换为你的钉钉 Webhook
    send_resolved: true
    http_config:
      tls_config:
        insecure_skip_verify: true
    headers:
      Content-Type: 'application/json'
    body: '{{ template "dingtalk_alert_template" . }}'  # 引用钉钉模板

7.2 邮箱模板配置

**注意:**以下为邮箱和钉钉两种方案,选择其中一个即可。

**方案一:**邮箱配置

bash 复制代码
#创建模板
 cd /usr/local/alertmanager-0.28.1/
 vim custom.tmpl
html 复制代码
{{ define "custom_alert_template" }}
{{- if eq .Status "firing" -}}
<div style="max-width:750px; margin:0 auto; font-family:'Inter',-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,'Helvetica Neue',Arial,sans-serif; background-color:#f9fafb; border-radius:12px; overflow:hidden; box-shadow:0 10px 25px -5px rgba(0,0,0,0.1),0 8px 10px -6px rgba(0,0,0,0.1);">
  <div style="background:linear-gradient(135deg, #e53935 0%, #b71c1c 100%); color:white; padding:25px 30px; position:relative; overflow:hidden;">
    <div style="position:absolute; top:0; right:0; width:100px; height:100px; opacity:0.1;">
      <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" width="100" height="100">
        <path d="M50,10 C80,10 100,30 100,50 C100,80 80,100 50,100 C20,100 0,80 0,50 C0,30 20,10 50,10 Z" fill="white"></path>
        <path d="M50,30 C66.57,30 80,43.43 80,60 C80,76.57 66.57,90 50,90 C33.43,90 20,76.57 20,60 C20,43.43 33.43,30 50,30 Z" fill="none" stroke="white" stroke-width="5"></path>
        <circle cx="50" cy="60" r="10" fill="white"></circle>
      </svg>
    </div>
    <div style="position:relative; z-index:1;">
      <div style="display:flex; align-items:center; margin-bottom:15px;">
        <div style="width:50px; height:50px; border-radius:50%; background-color:rgba(255,255,255,0.2); display:flex; align-items:center; justify-content:center; margin-right:15px; animation:pulse 2s infinite;">
          <span style="font-size:28px;">🚨</span>
        </div>
        <div>
          <h2 style="margin:0; font-weight:500; font-size:22px;">[触发中] {{ .CommonLabels.alertname }}</h2>
          <p style="margin:0; font-size:14px; opacity:0.8;">紧急程度: {{ .CommonLabels.severity | title }}</p>
        </div>
      </div>
    </div>
  </div>
{{- else -}}
<div style="max-width:750px; margin:0 auto; font-family:'Inter',-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,'Helvetica Neue',Arial,sans-serif; background-color:#f9fafb; border-radius:12px; overflow:hidden; box-shadow:0 10px 25px -5px rgba(0,0,0,0.1),0 8px 10px -6px rgba(0,0,0,0.1);">
  <div style="background:linear-gradient(135deg, #43a047 0%, #2e7d32 100%); color:white; padding:25px 30px; position:relative; overflow:hidden;">
    <div style="position:absolute; top:0; right:0; width:100px; height:100px; opacity:0.1;">
      <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" width="100" height="100">
        <path d="M50,10 C80,10 100,30 100,50 C100,80 80,100 50,100 C20,100 0,80 0,50 C0,30 20,10 50,10 Z" fill="white"></path>
        <path d="M30,50 L50,70 L75,35" fill="none" stroke="white" stroke-width="8" stroke-linecap="round"></path>
      </svg>
    </div>
    <div style="position:relative; z-index:1;">
      <div style="display:flex; align-items:center; margin-bottom:15px;">
        <div style="width:50px; height:50px; border-radius:50%; background-color:rgba(255,255,255,0.2); display:flex; align-items:center; justify-content:center; margin-right:15px;">
          <span style="font-size:28px;">✅</span>
        </div>
        <div>
          <h2 style="margin:0; font-weight:500; font-size:22px;">[已解决] {{ .CommonLabels.alertname }}</h2>
          <p style="margin:0; font-size:14px; opacity:0.8;">状态已恢复正常</p>
        </div>
      </div>
    </div>
  </div>
{{- end }}

  <div style="padding:30px; background-color:white;">
    {{- if .Alerts -}}
      {{- range .Alerts }}
        <div style="margin-bottom:25px; border-radius:10px; overflow:hidden; box-shadow:0 4px 6px -1px rgba(0,0,0,0.1),0 2px 4px -1px rgba(0,0,0,0.06); transition:transform 0.3s ease, box-shadow 0.3s ease;">
          <div style="padding:20px; background-color:
            {{- if eq $.Status "firing" -}}
              {{- if eq .Labels.severity "critical" }}#ffebee{{ else if eq .Labels.severity "warning" }}#fff3e0{{ else }}#e8f5e9{{ end -}}
            {{- else -}}
              #e8f5e9
            {{- end -}};
            border-left:4px solid;
            border-left-color:
            {{- if eq $.Status "firing" -}}
              {{- if eq .Labels.severity "critical" }}#e53935{{ else if eq .Labels.severity "warning" }}#fb8c00{{ else }}#43a047{{ end -}}
            {{- else -}}
              #43a047
            {{- end -}};">
            
            <div style="display:flex; justify-content:space-between; align-items:flex-start;">
              <div style="display:flex; align-items:center;">
                <div style="width:40px; height:40px; border-radius:50%; background-color:
                  {{- if eq $.Status "firing" -}}
                    {{- if eq .Labels.severity "critical" }}rgba(229,57,53,0.1){{ else if eq .Labels.severity "warning" }}rgba(251,140,0,0.1){{ else }}rgba(67,160,71,0.1){{ end -}}
                  {{- else -}}
                    rgba(67,160,71,0.1)
                  {{- end -}};
                  display:flex; align-items:center; justify-content:center; margin-right:15px;">
                  <span style="font-size:20px; color:
                    {{- if eq $.Status "firing" -}}
                      {{- if eq .Labels.severity "critical" }}#e53935{{ else if eq .Labels.severity "warning" }}#fb8c00{{ else }}#43a047{{ end -}}
                    {{- else -}}
                      #43a047
                    {{- end -}};">
                    {{- if eq $.Status "firing" -}}
                      {{- if eq .Labels.severity "critical" }}🔥{{ else if eq .Labels.severity "warning" }}⚠️{{ else }}🔍{{ end -}}
                    {{- else -}}
                      ✅
                    {{- end -}}
                  </span>
                </div>
                <h3 style="margin:0; color:#2d3748; font-weight:500; font-size:18px;">{{ .Labels.alertname }}</h3>
              </div>
              <div style="padding:4px 12px; border-radius:4px; background-color:
                {{- if eq $.Status "firing" -}}
                  {{- if eq .Labels.severity "critical" }}rgba(229,57,53,0.1){{ else if eq .Labels.severity "warning" }}rgba(251,140,0,0.1){{ else }}rgba(67,160,71,0.1){{ end -}}
                {{- else -}}
                  rgba(67,160,71,0.1)
                {{- end -}};
                color:
                {{- if eq $.Status "firing" -}}
                  {{- if eq .Labels.severity "critical" }}#e53935{{ else if eq .Labels.severity "warning" }}#fb8c00{{ else }}#43a047{{ end -}}
                {{- else -}}
                  #43a047
                {{- end -}};
                font-size:13px; font-weight:500;">
                {{- if eq $.Status "firing" -}}
                  {{ .Labels.severity | title }}
                {{- else -}}
                  已解决
                {{- end -}}
              </div>
            </div>
            
            <div style="margin-top:15px; padding-left:55px;">
              <div style="display:grid; grid-template-columns:1fr 1fr; gap:10px;">
                <div style="padding:10px; background-color:rgba(0,0,0,0.03); border-radius:6px;">
                  <p style="margin:0; font-size:13px; color:#718096;">实例</p>
                  <p style="margin:0; font-size:16px; font-weight:500;">{{ .Labels.instance }}</p>
                </div>
                <div style="padding:10px; background-color:rgba(0,0,0,0.03); border-radius:6px;">
                  <p style="margin:0; font-size:13px; color:#718096;">触发时间</p>
                  <p style="margin:0; font-size:16px; font-weight:500;">{{ .StartsAt.Format "2006-01-02 15:04:05" }}</p>
                </div>
              </div>
              
              <div style="margin-top:15px; padding:15px; background-color:rgba(0,0,0,0.03); border-radius:6px;">
                <p style="margin:0 0 5px 0; font-size:13px; color:#718096;">描述</p>
                <p style="margin:0; font-size:16px;">{{ .Annotations.summary }}</p>
              </div>
              
              {{- if .Annotations.description -}}
              <div style="margin-top:15px; padding:15px; background-color:rgba(0,0,0,0.03); border-radius:6px;">
                <p style="margin:0 0 5px 0; font-size:13px; color:#718096;">详细信息</p>
                <p style="margin:0; font-size:16px;">{{ .Annotations.description }}</p>
              </div>
              {{- end -}}
              
              {{- if .Annotations.runbook_url -}}
              <div style="margin-top:15px;">
                <a href="{{ .Annotations.runbook_url }}" style="display:inline-flex; align-items:center; padding:8px 16px; background-color:#1976d2; color:white; border-radius:6px; text-decoration:none; font-weight:500; transition:background-color 0.3s ease;">
                  <span style="margin-right:8px;">📖</span>
                  查看操作手册
                </a>
              </div>
              {{- end -}}
            </div>
          </div>
        </div>
      {{- end -}}
    {{- else -}}
      <div style="text-align:center; padding:40px 0; color:#6b7280;">
        <div style="font-size:64px; color:#e5e7eb; margin-bottom:20px;">✅</div>
        <h3 style="margin:0 0 10px 0; font-weight:400; color:#374151;">一切正常</h3>
        <p style="margin:0; max-width:300px; margin:0 auto;">没有活跃的告警需要处理</p>
      </div>
    {{- end -}}
  </div>
  
  <div style="padding:15px 30px; background-color:#f9fafb; border-top:1px solid #e5e7eb; text-align:center; font-size:13px; color:#9ca3af;">
    <p style="margin:0;">此告警由 Alertmanager 自动生成</p>
    <p style="margin:5px 0 0 0;">查看所有告警: <a href="http://alertmanager-server:9093" style="color:#1976d2; text-decoration:none;">http://alertmanager-server:9093</a></p>
  </div>
</div>

<style>
  @keyframes pulse {
    0% { transform: scale(1); box-shadow: 0 0 0 0 rgba(239, 83, 80, 0.7); }
    70% { transform: scale(1.05); box-shadow: 0 0 0 10px rgba(239, 83, 80, 0); }
    100% { transform: scale(1); box-shadow: 0 0 0 0 rgba(239, 83, 80, 0); }
  }
</style>
{{ end }}
bash 复制代码
#启动alertmanager服务
systemctl start alertmanager.service 

**方案二:**钉钉配置

bash 复制代码
vim alertmanager.yml
bash 复制代码
global:
  # 移除 SMTP 相关配置(如果不再使用邮箱)

templates:
  - '/usr/local/alertmanager-0.28.1/custom.tmpl'  #钉钉模板

route:
  receiver: 'dingtalk-notification'  # 默认接收器改为钉钉
  group_wait: 0s
  group_interval: 30s
  repeat_interval: 300s

receivers:
- name: 'dingtalk-notification'
  webhook_configs:
  - url: 'https://oapi.dingtalk.com/robot/send?access_token'  #钉钉的Webhook
    send_resolved: true
    http_config:
      tls_config:
        insecure_skip_verify: true
    headers:
      Content-Type: 'application/json'
    body: '{{ template "dingtalk_alert_template" . }}'  # 引用钉钉模板
bash 复制代码
vim custom.tmpl
bash 复制代码
{{/* 钉钉模板(JSON Markdown 格式) */}}
{{ define "dingtalk_alert_template" }}
{
  "msgtype": "markdown",
  "markdown": {
    "title": "[{{ if eq .Status \"firing\" }}🔥告警触发{{ else }}✅告警恢复{{ end }}] {{ .CommonLabels.alertname }}",
    "text": "### {{ if eq .Status \"firing\" }}🔥告警触发(级别:{{ .CommonLabels.severity | toUpper }}){{ else }}✅告警恢复{{ end }}\n" +
            "**告警名称**: {{ .CommonLabels.alertname }}\n" +
            "**实例**: {{ .CommonLabels.instance }}\n" +
            "**描述**: {{ .CommonAnnotations.description }}\n\n" +
            "{{ range .Alerts }}\n" +
            "- **详情**: {{ .Annotations.summary }}\n" +
            "- **时间**: {{ .StartsAt.Format \"2006-01-02 15:04:05\" }}\n" +
            "{{ end }}"
  }
}
{{ end }}

两者可以同时使用,文档这里不做配置

7.3 触发报警规则配置

bash 复制代码
#创建规则文件
vim /usr/local/prometheus-2.29.1/alert-rules.yml
yaml 复制代码
groups:
# 主机资源监控规则(基础架构稳定性)
- name: host-resource-rules
  rules:
    # 节点 CPU 使用率过高(系统负载)
    - alert: 主机CPU使用率过高
      expr: (100 - avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
      for: 20s
      labels:
        severity: 严重
      annotations:
        summary: "主机 {{ $labels.instance }} CPU 使用率过高 (当前值: {{ $value | printf \"%.2f\" }}%)"
        description: "主机 CPU 使用率持续超过 80%,可能导致服务响应缓慢,需检查高负载进程。"

    # 节点内存使用率过高(防止 OOM)
    - alert: 主机内存使用率过高
      expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
      for: 20s
      labels:
        severity: 严重
      annotations:
        summary: "主机 {{ $labels.instance }} 内存使用率过高 (当前值: {{ $value | printf \"%.2f\" }}%)"
        description: "主机内存使用率持续超过 85%,可能触发 OOM Killer,导致服务异常终止。"

    # 节点磁盘空间不足(避免系统崩溃)
    - alert: 磁盘空间不足
      expr: (1 - node_filesystem_free_bytes{fstype=~"ext4|xfs|btrfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs|btrfs"}) * 100 > 80
      for: 20s
      labels:
        severity: 警告
      annotations:
        summary: "主机 {{ $labels.instance }} 磁盘分区 {{ $labels.mountpoint }} 空间不足 (当前值: {{ $value | printf \"%.2f\" }}%)"
        description: "磁盘空间使用率超过 80%,可能导致日志写入失败、应用无法启动,需清理或扩容。"

    # 节点磁盘 IO 延迟过高(影响读写性能)
    - alert: 磁盘IO延迟过高
      expr: avg by(instance, device)(irate(node_disk_io_time_seconds_total[5m])) > 100
      for: 20s
      labels:
        severity: 警告
      annotations:
        summary: "主机 {{ $labels.instance }} 磁盘 {{ $labels.device }} IO 延迟过高 (当前值: {{ $value | printf \"%.2f\" }}ms)"
        description: "磁盘 IO 延迟持续超过 100ms,可能由磁盘故障、碎片过多或负载过高导致。"

    # 节点网络带宽利用率过高(网络瓶颈)
    - alert: 网络带宽利用率过高
      expr: sum by(instance, device)(rate(node_network_receive_bytes_total[5m]) + rate(node_network_transmit_bytes_total[5m])) / node_network_speed_bytes * 100 > 80
      for: 20s
      labels:
        severity: 警告
      annotations:
        summary: "主机 {{ $labels.instance }} 网络接口 {{ $labels.device }} 带宽利用率过高 (当前值: {{ $value | printf \"%.2f\" }}%)"
        description: "网络带宽利用率持续超过 80%,可能导致数据包丢失、服务响应超时。"

    # 节点负载过高(系统整体压力)
    - alert: 系统负载过高
      expr: node_load1 > on (instance) node_cpu_seconds_total{mode="idle"} * 0.8
      for: 20s
      labels:
        severity: 警告
      annotations:
        summary: "主机 {{ $labels.instance }} 系统负载过高 (当前值: {{ $value | printf \"%.2f\" }})"
        description: "系统 1 分钟负载超过 CPU 核心数的 80%,可能由进程泄漏或资源竞争导致。"

    # 节点不可达(基础设施可用性)
    - alert: 节点不可达
      expr: up{job="node-exporter"} == 0
      for: 20s
      labels:
        severity: 严重
      annotations:
        summary: "主机 {{ $labels.instance }} 不可达"
        description: "Node Exporter 无法访问,可能由主机崩溃、网络中断或服务异常导致。"

# Kubernetes 资源监控规则(容器平台稳定性)
- name: kubernetes-resource-rules
  rules:
    # Pod CPU 使用率过高(容器资源)
    - alert: PodCPU使用率过高
      expr: sum by(namespace, pod)(rate(container_cpu_usage_seconds_total{container!="POD", container!=""}[5m])) / sum by(namespace, pod)(container_spec_cpu_quota{container!="POD", container!=""}) * 100 > 80
      for: 5m
      labels:
        severity: 警告
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} CPU 使用率过高 (当前值: {{ $value | printf \"%.2f\" }}%)"
        description: "Pod CPU 使用率持续超过 80%,可能导致容器被限流(throttled),影响服务性能。"

    # Pod 内存使用率过高(容器资源)
    - alert: Pod内存使用率过高
      expr: sum by(namespace, pod)(container_memory_usage_bytes{container!="POD", container!=""}) / sum by(namespace, pod)(container_spec_memory_limit_bytes{container!="POD", container!=""}) * 100 > 90
      for: 5m
      labels:
        severity: 警告
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 内存使用率过高 (当前值: {{ $value | printf \"%.2f\" }}%)"
        description: "Pod 内存使用率持续超过 90%,可能触发 OOM Kill,导致容器重启。"

    # Deployment 副本不足(服务可用性)
    - alert: Deployment副本不足
      expr: kube_deployment_spec_replicas != kube_deployment_status_available_replicas
      for: 10m
      labels:
        severity: 严重
      annotations:
        summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} 副本不足"
        description: "期望 {{ $value | printf \"%.0f\" }} 个副本,但只有 {{ $labels.available_replicas }} 个可用,可能影响服务可用性。"

    # PVC 容量不足(存储资源)
    - alert: PVCCapacity不足
      expr: (1 - kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) * 100 > 80
      for: 24h
      labels:
        severity: 警告
      annotations:
        summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} 容量不足 (当前值: {{ $value | printf \"%.2f\" }}%)"
        description: "PVC 容量使用率超过 80%,建议在 24 小时内扩容,避免影响数据写入。"

# 应用程序监控规则(直接关联业务)
- name: application-rules
  rules:
    # HTTP 请求错误率过高(业务可用性)
    - alert: HTTP请求错误率过高
      expr: sum by(job)(rate(http_requests_total{status=~"5.."}[5m])) / sum by(job)(rate(http_requests_total[5m])) * 100 > 5
      for: 10m
      labels:
        severity: 警告
      annotations:
        summary: "服务 {{ $labels.job }} HTTP 请求错误率过高 (当前值: {{ $value | printf \"%.2f\" }}%)"
        description: "HTTP 5xx 错误率持续超过 5%,可能由服务崩溃、依赖故障或资源不足导致。"

    # 数据库连接池耗尽(数据访问)
    - alert: 数据库连接池耗尽
      expr: database_connections_active / database_connections_max * 100 > 90
      for: 5m
      labels:
        severity: 严重
      annotations:
        summary: "数据库 {{ $labels.instance }} 连接池耗尽 (当前值: {{ $value | printf \"%.2f\" }}%)"
        description: "数据库连接池使用率超过 90%,新请求可能无法获取连接,导致服务无响应。"

    # 消息队列堆积(异步处理)
    - alert: 消息队列堆积
      expr: sum by(queue)(rabbitmq_queue_messages_ready) > 10000
      for: 10m
      labels:
        severity: 警告
      annotations:
        summary: "消息队列 {{ $labels.queue }} 堆积严重 (当前值: {{ $value | printf \"%.0f\" }} 条)"
        description: "消息队列中待处理消息超过 10000 条,消费者可能存在瓶颈,需检查消费逻辑或扩容。"

7.4 Prometheus连接配置

bash 复制代码
vim /usr/local/prometheus-2.29.1/prometheus.yml
yaml 复制代码
alerting:
  alertmanagers:
  - static_configs:
    - targets: ["192.168.200.14:9093"]  #Alertmanager服务地址(默认端口9093)

rule_files:
  - "/usr/local/prometheus-2.29.1/alert-rules.yml"  #报警规则文件路径
bash 复制代码
#重启启动prometheus服务
systemctl restart prometheus.service 

7.5 CPU触发报警测试

测试192.168.200.14主机cpu达到阈值触发报警

bash 复制代码
#在200.104编写测试脚本
vim cpu-test.sh
bash 复制代码
#!/usr/bin/env bash
# 文件名:cpu_stress.sh
# 用法:chmod +x cpu_stress.sh && ./cpu_stress.sh [持续时长秒]
# 默认持续 60 秒,如果传入参数则按参数秒数运行。

# 运行时长(秒)
DURATION=${1:-60}

# 计算结束时间
END_TIME=$((SECONDS + DURATION))

# 启动与 CPU 核数相同的子进程,每个子进程都在忙循环
CPU_CORES=$(nproc)
echo ">> 启动 $CPU_CORES 个 busy-loop,压满所有 CPU 核心,共运行 $DURATION 秒"

for ((i=1; i<=CPU_CORES; i++)); do
  (
    # 子进程忙循环直到超时
    while [ $SECONDS -lt $END_TIME ]; do
      :  # 空操作,占用 CPU
    done
  ) &
done

# 等待所有子进程结束
wait
echo ">> CPU 压力测试结束。"
bash 复制代码
#执行脚本(执行脚本前使用   journalctl -u alertmanager.service -f    查看日志)
journalctl -u alertmanager.service -f 

bash cpu-test.sh
bash 复制代码
#日志
5月 14 03:01:39 prometheus alertmanager[7188]: time=2025-05-14T07:01:39.747Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=主机CPU使用率过高[c57a700][resolved]
5月 14 03:01:54 prometheus alertmanager[7188]: time=2025-05-14T07:01:54.746Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=主机CPU使用率过高[8ccbb6e][resolved]
5月 14 03:01:54 prometheus alertmanager[7188]: time=2025-05-14T07:01:54.752Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup={}:{} alerts="[主机CPU使用率过高[c57a700][resolved] 主机CPU使用率过高[8ccbb6e][resolved]]"
5月 14 03:01:56 prometheus alertmanager[7188]: time=2025-05-14T07:01:56.043Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=邮件接收人 integration=email[0] aggrGroup={}:{} attempts=1 duration=1.291165694s alerts="[主机CPU使用率过高[c57a700][resolved] 主机CPU使用率过高[8ccbb6e][resolved]]"

7.6 查看邮箱邮件

相关推荐
Ling_Ze40 分钟前
mysql和postgressql数据库在服务器中容器创建和工具连接
服务器·数据库·mysql
regret~43 分钟前
【笔记】创建systemctl服务
linux·服务器·笔记
Heavydrink1 小时前
Java项目部署云服务器详细教程
java·服务器·开发语言
水天需0101 小时前
ps 命令全面详解
linux·服务器·网络
Lethehong1 小时前
算力新标杆:昇腾Atlas 800T NPU实战Llama-2-7b全流程评测与技术解析
运维·服务器·数据库·llama-2-7b·昇腾atlas 800t
wanhengidc1 小时前
云手机中都运用到了哪些技术
运维·服务器·科技·智能手机·云计算
weixin_46681 小时前
K8S-Deployment
云原生·容器·kubernetes
阑梦清川1 小时前
计算机网络--关于域名服务器的访问顺序
运维·服务器·计算机网络
SDAU20051 小时前
ESP32C3在Arduino下的MQTT操作
linux·服务器·前端