K8S云原生监控方案Prometheus+grafana

[1. 概述](#1. 概述)

[1.1 系统架构](#1.1 系统架构)

[1.1.1 架构图](#1.1.1 架构图)

[1.2 环境准备](#1.2 环境准备)

[2. 部署prometheus](#2. 部署prometheus)

[2.1 创建Namespace](#2.1 创建Namespace)

[2.2 创建ConfigMap资源](#2.2 创建ConfigMap资源)

[2.3 创建ServiceAccount，Clusterrole，Clusterrolebinding，Service，Deployment，ingress，persistentVolumeClaim](#2.3 创建ServiceAccount，Clusterrole，Clusterrolebinding，Service，Deployment，ingress，persistentVolumeClaim)

[3. 部署Node_exporter组件](#3. 部署Node_exporter组件)

[3.1 创建Daemonsets资源](#3.1 创建Daemonsets资源)

[4. 部署Kube_state_metrics组件](#4. 部署Kube_state_metrics组件)

[4.1 创建ServiceAccount，ClusterRole，ClusterRoleBinding，Deployment，Service](#4.1 创建ServiceAccount，ClusterRole，ClusterRoleBinding，Deployment，Service)

[5. 部署Grafana可视化平台](#5. 部署Grafana可视化平台)

[5.1 创建PersistentVolumeClaim，Deployment，Service](#5.1 创建PersistentVolumeClaim，Deployment，Service)

[6. 部署命令](#6. 部署命令)

[7. 访问服务](#7. 访问服务)

[8. grafana仪表盘展示](#8. grafana仪表盘展示)

[8.1 为grafana配置数据源](#8.1 为grafana配置数据源)

[8.2 导入仪表盘](#8.2 导入仪表盘)

[8.3 仪表盘展示](#8.3 仪表盘展示)

1. 概述

Prometheus是一个开源的监控和告警系统，特别适合云原生环境。本文将详细介绍如何在Kubernetes集群中部署一个完整的Prometheus监控系统，包括Prometheus Server、Node Exporter、Kube-state-metrics和Grafana等组件。

1.1 系统架构

Prometheus监控系统包含以下组件：

Prometheus Server: 核心监控服务器，负责数据采集和存储
Node Exporter: 节点级指标收集器
Kube-state-metrics: Kubernetes集群状态指标收集器
Grafana: 数据可视化和仪表板

1.1.1 架构图

1.2 环境准备

IP	主机名	备注
192.168.48.11	master1	master节点，k8s1.32.7
192.168.48.12	master2	master节点，k8s1.32.7
192.168.48.13	master3	master节点，k8s1.32.7
192.168.48.14	node01	node节点，k8s1.32.7
192.168.48.15	node02	noder节点，k8s1.32.7
192.168.48.16	node03	node节点，k8s1.32.7
192.168.48.19	database	harbor仓库，nfs服务器

本次使用k8s高可用集群，且部署均采用国内镜像，即使没有harbor仓库也能正常部署，如果镜像拉取超时，请在评论区留言，博主一定及时补。nfs服务器一定要有，如果其他存储方案如ceph，hostpath等自行更改yaml文件配置。

k8s搭建nfs共享存储参考往期博客：

k8s搭建nfs共享存储

k8s集群搭建参考往期博客：

openeuler24.03部署k8s1.32.7集群（一主两从）

k8s高可用集群搭建参考往期博客：

openeuler24.03部署k8s1.32.7高可用集群（三主三从）

2. 部署prometheus

2.1 创建Namespace

复制代码

vim prometheus-namespace.yaml

复制代码

apiVersion: v1
kind: Namespace
metadata:
  name: monitor
  labels:
    name: monitor
    purpose: monitoring

2.2 创建ConfigMap资源

复制代码

vim prometheus-configmap.yaml

复制代码

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitor
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    scrape_configs:
    # 采集 Prometheus 自身
    - job_name: 'prometheus'
      kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names: [monitor]
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_name]
        regex: prometheus-svc
        action: keep
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        regex: web
        action: keep

    # 采集 CoreDNS
    - job_name: 'coredns'
      kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names: [kube-system]
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_name]
        regex: kube-dns
        action: keep
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        regex: metrics
        action: keep

    # 采集 kube-apiserver
    - job_name: 'kube-apiserver'
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: false
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names: [default, kube-system]
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_name]
        regex: kubernetes
        action: keep
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        regex: https
        action: keep

    # 采集 node-exporter
    - job_name: 'node-exporter'
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):10250'
        replacement: '${1}:9100'
        target_label: __address__
        action: replace

    # 采集 cadvisor
    - job_name: 'cadvisor'
      kubernetes_sd_configs:
      - role: node
      scheme: https
      tls_config:
        insecure_skip_verify: true
        ca_file: '/var/run/secrets/kubernetes.io/serviceaccount/ca.crt'
      bearer_token_file: '/var/run/secrets/kubernetes.io/serviceaccount/token'
      relabel_configs:
      - target_label: __metrics_path__
        replacement: /metrics/cadvisor

2.3 创建ServiceAccount，Clusterrole，Clusterrolebinding，Service，Deployment，ingress，persistentVolumeClaim

复制代码

vim prometheus.yaml

复制代码

#创建SA
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitor
 
---
#创建clusterrole
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  - services
  - endpoints
  - pods
  - nodes/proxy
  - nodes/proxy
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - "extenstions"
  resources:
    - ingresses
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - configmaps
  - nodes/metrics
  verbs:
  - get
- nonResourceURLs:
  - /metrics
  verbs:
  - get
 
---
#创建clusterrolebinding
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitor
 
---
#创建svc
apiVersion: v1
kind: Service
metadata:
  name: prometheus-svc
  namespace: monitor
  labels:
    app: prometheus
  annotations:
    prometheus_io_scrape: "true"  # 注解,有这个才可以被Prometheus发现
spec:
  selector:
    app: prometheus
  type: NodePort
  ports:
    - name: web
      nodePort: 32224
      port: 9090
      targetPort: http
 
---
#创建ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: prometheus-ingress
  namespace: monitor
spec:
  ingressClassName: nginx
  rules:
  - host: www.myprometheus.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name:  prometheus-svc
            port:
              number: 9090
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-pvc  # PVC 名称
  namespace: monitor
spec:
  accessModes:
    - ReadWriteOnce  # 访问模式（可选：ReadWriteOnce/ReadOnlyMany/ReadWriteMany）
  resources:
    requests:
      storage: 2Gi  # 请求的存储容量
  storageClassName: nfs-client  # 指定 StorageClass（根据集群环境调整）
---
#创建deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitor
  labels:
    app: prometheus
spec:
  selector:
    matchLabels:
      app: prometheus
  replicas: 1
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      serviceAccountName: prometheus
      initContainers:
      - name: "change-permission-of-directory"
        image: swr.cn-north-4.myhuaweicloud.com/ddn-k8s/quay.io/prometheus/busybox:latest
        command: ["/bin/sh"]
        args: ["-c","chown -R 65534:65534 /prometheus"]
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: "/etc/prometheus"
          name: config-volume
        - mountPath: "/prometheus"
          name: data
      containers:
      - image: swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/prom/prometheus:latest
        name: prometheus
        args:
        - "--config.file=/etc/prometheus/prometheus.yml"#指定prometheus配置文件路径
        - "--storage.tsdb.path=/prometheus"#指定tsdb数据库存储路径
        - "--web.enable-lifecycle"#允许热更新，curl localhost:9090/-/reload 进行热更新
        - "--web.console.libraries=/usr/share/prometheus/console_libraries"
        - "--web.console.templates=/usr/share/prometheus/consoles"
        ports:
        - containerPort: 9090
          name: http
        volumeMounts:
        - mountPath: "/etc/prometheus"
          name: config-volume
        - mountPath: "/prometheus"
          name: data
        resources:
          requests:
            cpu: 100m
            memory: 512Mi
          limits:
            cpu: 100m
            memory: 512Mi
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: prometheus-pvc
      - configMap:
          name: prometheus-config
        name: config-volume

3. 部署Node_exporter组件

3.1 创建Daemonsets资源

复制代码

vim node-exportet-daemonset.yaml

复制代码

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitor
  labels:
    app: node-exporter
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
    spec:
      hostPID: true
      hostIPC: true
      hostNetwork: true
      nodeSelector:
        kubernetes.io/os: linux
      containers:
      - name: node-exporter
        image: docker.io/prom/node-exporter:latest
        args:
        - --web.listen-address=$(HOSTIP):9100
        - --path.procfs=/host/proc
        - --path.sysfs=/host/sys
        - --path.rootfs=/host/root
        - --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/)
        - --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
        ports:
        - containerPort: 9100
        env:
        - name: HOSTIP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        resources:
          requests:
            cpu: 150m
            memory: 180Mi
          limits:
            cpu: 150m
            memory: 180Mi
        securityContext:
          runAsNonRoot: true
          runAsUser: 65534
        volumeMounts:
        - name: proc
          mountPath: /host/proc
        - name: sys
          mountPath: /host/sys
        - name: root
          mountPath: /host/root
          mountPropagation: HostToContainer
          readOnly: true
      tolerations:
      - operator: "Exists"
      volumes:
      - name: proc
        hostPath:
          path: /proc
      - name: dev
        hostPath:
          path: /dev
      - name: sys
        hostPath:
          path: /sys
      - name: root
        hostPath:
          path: /

创建Service

复制代码

vim node-exportet-svc.yaml

复制代码

apiVersion: v1
kind: Service
metadata:
  name: node-exporter
  namespace: monitor
  labels:
    app: node-exporter
spec:
  selector:
    app: node-exporter
  ports:
  - name: metrics
    port: 9100
    targetPort: 9100
  clusterIP: None  # Headless Service（直接通过 Pod IP 访问）

4. 部署Kube_state_metrics组件

4.1 创建ServiceAccount，ClusterRole，ClusterRoleBinding，Deployment，Service

复制代码

kube-state-metrics.yaml

复制代码

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kube-state-metrics
  namespace: monitor
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kube-state-metrics
rules:
- apiGroups: [""]
  resources: ["nodes", "pods", "services", "resourcequotas", "replicationcontrollers", "limitranges", "persistentvolumeclaims", "persistentvolumes", "namespaces", "endpoints"]
  verbs: ["list", "watch"]
- apiGroups: ["extensions"]
  resources: ["daemonsets", "deployments", "replicasets"]
  verbs: ["list", "watch"]
- apiGroups: ["apps"]
  resources: ["statefulsets"]
  verbs: ["list", "watch"]
- apiGroups: ["batch"]
  resources: ["cronjobs", "jobs"]
  verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
  resources: ["horizontalpodautoscalers"]
  verbs: ["list", "watch"]
- apiGroups: ["networking.k8s.io"]
  resources: ["ingresses"]
  verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: monitor
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kube-state-metrics
  namespace: monitor
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kube-state-metrics
  template:
    metadata:
      labels:
        app: kube-state-metrics
    spec:
      serviceAccountName: kube-state-metrics
      containers:
      - name: kube-state-metrics
        image: swr.cn-north-4.myhuaweicloud.com/ddn-k8s/registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.9.2
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/scrape: 'true'
  name: kube-state-metrics
  namespace: monitor
  labels:
    app: kube-state-metrics
spec:
  ports:
  - name: kube-state-metrics
    port: 8080
    protocol: TCP
  selector:
    app: kube-state-metrics

5. 部署Grafana可视化平台

5.1 创建PersistentVolumeClaim，Deployment，Service

复制代码

vim grafana.yaml

复制代码

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: grafana-pvc  # PVC 名称
  namespace: monitor
spec:
  accessModes:
    - ReadWriteOnce  # 访问模式（可选：ReadWriteOnce/ReadOnlyMany/ReadWriteMany）
  resources:
    requests:
      storage: 2Gi  # 请求的存储容量
  storageClassName: nfs-client  # 指定 StorageClass（根据集群环境调整）
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana-server
  namespace: monitor
spec:
  replicas: 1
  selector:
    matchLabels:
      task: monitoring
      k8s-app: grafana
  template:
    metadata:
      labels:
        task: monitoring
        k8s-app: grafana
    spec:
      containers:
      - name: grafana
        image: grafana/grafana:latest
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 3000
          protocol: TCP
        volumeMounts:
        - mountPath: /var/lib/grafana/
          name: grafana-data
        env:
        - name: INFLUXDB_HOST
          value: monitoring-influxdb
        - name: GF_SERVER_HTTP_PORT
          value: "3000"
        - name: GF_AUTH_BASIC_ENABLED
          value: "false"
        - name: GF_AUTH_ANONYMOUS_ENABLED
          value: "true"
        - name: GF_AUTH_ANONYMOUS_ORG_ROLE
          value: Admin
        - name: GF_SERVER_ROOT_URL
          value: /
      volumes:
      - name: grafana-data
        persistentVolumeClaim:
          claimName: grafana-pvc
      affinity:  # 调度优化（可选）
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            preference:
              matchExpressions:
              - key: node-role.kubernetes.io/monitoring
                operator: Exists
---
apiVersion: v1
kind: Service
metadata:
  labels:
    kubernetes.io/cluster-service: 'true'
    kubernetes.io/name: monitoring-grafana
  name: grafana-svc
  namespace: monitor
spec:
  ports:
  - port: 80
    targetPort: 3000
    nodePort: 31091
  selector:
    k8s-app: grafana
  type: NodePort

6. 部署命令

按照以下顺序部署各个组件：

复制代码

# 1. 创建命名空间
kubectl apply -f prometheus-namespace.yaml

# 2. 部署Prometheus配置
kubectl apply -f prometheus-configmap.yaml

# 3. 部署Prometheus主服务
kubectl apply -f prometheus.yaml

# 4. 部署Kube-state-metrics
kubectl apply -f kube-state-metrics.yaml

# 5. 部署Node Exporter
kubectl apply -f node-exportet-daemonset.yaml
kubectl apply -f node-exportet-svc.yaml

# 6. 部署Grafana
kubectl apply -f grafana.yaml

检查pod状态：

复制代码

[root@master1 prometheus]# kubectl get pod -n monitor 
NAME                                 READY   STATUS    RESTARTS   AGE
grafana-server-64c9777c7b-drgdd      1/1     Running   0          110m
kube-state-metrics-6db447664-6r2wp   1/1     Running   0          110m
node-exporter-ccwk8                  1/1     Running   0          110m
node-exporter-fbq22                  1/1     Running   0          110m
node-exporter-hbtm6                  1/1     Running   0          110m
node-exporter-ndbhh                  1/1     Running   0          110m
node-exporter-sbb4p                  1/1     Running   0          110m
node-exporter-xd467                  1/1     Running   0          110m
prometheus-7cd9944dc4-lbjwx          1/1     Running   0          110m