k8s—Prometheus+Grafana+Altermaneger构建监控平台

目录

一、安装node-exporter

1.下载所需镜像

2.编写node-export.yaml文件并应用

3.测试node-exporter并获取数据

[二、Prometheus server安装和配置](#二、Prometheus server安装和配置)

1.创建sa(serviceaccount)账号,对sa做rabc授权

[1)创建一个 sa 账号 monitor](#1)创建一个 sa 账号 monitor)

[2)把 sa 账号 monitor 通过 clusterrolebing 绑定到 clusterrole 上](#2)把 sa 账号 monitor 通过 clusterrolebing 绑定到 clusterrole 上)

2.创建Prometheus数据存储目录

[3.安装Prometheus server服务](#3.安装Prometheus server服务)

3.1创建configmap用来存放Prometheus配置信息

1)创建yaml文件

2)应用并查看

[3.2 通过deployment部署prometheus](#3.2 通过deployment部署prometheus)

1)上传所需镜像

2)编写yaml文件

3)应用并查看

[3.3给Prometheus pod创建一个service](#3.3给Prometheus pod创建一个service)

1)编写yaml文件

2)应用并查看

3)结果测试

[3.4 Prometheus 热加载](#3.4 Prometheus 热加载)

三、Grafana的安装和配置

1.Grafana介绍

2.安装和Grafana

1)上传镜像

2)编写yaml文件

3)应用并查看

3.Grafana接入Prometheus数据源

[3.1 浏览器访问](#3.1 浏览器访问)

经上述查看,映射端口为30244,在浏览器输入IP:端口号即可访问

[3.2 配置grafana界面](#3.2 配置grafana界面)

[1)选择Create your first data source](#1)选择Create your first data source)

2)导入监控模板

3)导入docker_rev1.json监控模板

[4)如果 Grafana 导入 Prometheusz 之后,发现仪表盘没有数据,如何排查?](#4)如果 Grafana 导入 Prometheusz 之后,发现仪表盘没有数据,如何排查?)

四、安装kube-state-metrics组件

1.介绍kube-state-metrics组件

2.安装kube-state-metrics组件

1)创建sa并对其授权

2)上传镜像

3)编写yaml文件并应用

4)创建service

5)导入监控模板

五、配置alertmanager组件

1.创建alertmanager-cm.yaml配置文件

2.Prometheus报警流程

3.创建Prometheus和告警规则的配置文件

1)下图是配置文件中需要修改的地方:

2)删除上次设置的configmap配置

3)配置文件configmap

4)应用配置文件

4.安装Prometheus和altermanager

[4.1 安装](#4.1 安装)

1)删除上述操作步骤安装的Prometheus的deployment资源

2)生成etcd-certs

3)拉取镜像

4)编写deployment的yaml文件并应用

5)创建alertmanager的service以便于访问

6)浏览器访问测试

[4.2 访问web界面查看效果](#4.2 访问web界面查看效果)

1)访问Prometheus的web页面

2)修改配置文件

3)再次访问web页面


一、安装node-exporter

1.下载所需镜像
复制代码
# 我直接用的镜像压缩包,上传到服务器然后docker load
# 所有节点都要有这个镜像
[root@k8s-master ~]# docker load -i node-exporter.tar.gz 
ad68498f8d86: Loading layer [==================================================>]  4.628MB/4.628MB
ad8512dce2a7: Loading layer [==================================================>]  2.781MB/2.781MB
cc1adb06ef21: Loading layer [==================================================>]   16.9MB/16.9MB
Loaded image: prom/node-exporter:v0.16.0
2.编写node-export.yaml文件并应用
复制代码
[root@k8s-master node-exporter]# vim node-export.yaml 
[root@k8s-master node-exporter]# cat node-export.yaml 
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitor-sa        #记得创建命名空间,否则后面会出错
  labels:
    name: node-exporter
spec:
  selector:
    matchLabels:
     name: node-exporter
  template:
    metadata:
      labels:
        name: node-exporter
    spec:
      hostPID: true        #表示pod中的容器可以直接使用主机的网络,与宿主机进行通信
      hostIPC: true
      hostNetwork: true        #会直接将宿主机的9100端口映射出来,不需要创建service
      containers:
      - name: node-exporter
        image: prom/node-exporter:v0.16.0
        ports:
        - containerPort: 9100
        resources:
          requests:
            cpu: 0.15        #容器运行至少需要0.15核CPU
        securityContext:
          privileged: true        #开启特权模式
        args:
        - --path.procfs        #配置挂载宿主机的路径
        - /host/proc
        - --path.sysfs
        - /host/sys
        - --collector.filesystem.ignored-mount-points
        - '"^/(sys|proc|dev|host|etc)($|/)"'
        volumeMounts:
        - name: dev
          mountPath: /host/dev
        - name: proc
          mountPath: /host/proc
        - name: sys
          mountPath: /host/sys
        - name: rootfs
          mountPath: /rootfs
      tolerations:
      - key: "node-role.kubernetes.io/master"
        operator: "Exists"
        effect: "NoSchedule"
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: dev
          hostPath:
            path: /dev
        - name: sys
          hostPath:
            path: /sys
        - name: rootfs
          hostPath:
            path: /
[root@k8s-master node-exporter]# kubectl apply -f node-export.yaml 
Error from server (NotFound): error when creating "node-export.yaml": namespaces "monitor-sa" not found        #命名空间没有创建;
[root@k8s-master node-exporter]#  kubectl create ns monitor-sa
namespace/monitor-sa created
[root@k8s-master node-exporter]# kubectl apply -f node-export.yaml 
daemonset.apps/node-exporter created

# 查看创建好的pod;发现IP与宿主机IP相同
[root@k8s-master node-exporter]# kubectl get pod -n monitor-sa -o wide
NAME                  READY   STATUS    RESTARTS   AGE     IP               NODE         NOMINATED NODE   READINESS GATES
node-exporter-fdvjc   1/1     Running   0          8m21s   192.168.22.136   k8s-node2    <none>           <none>
node-exporter-gzfnq   1/1     Running   0          8m21s   192.168.22.134   k8s-master   <none>           <none>
node-exporter-r85gw   1/1     Running   0          8m21s   192.168.22.135   k8s-node1    <none>           <none>
3.测试node-exporter并获取数据
复制代码
# 通过curl  宿主机IP:9100/metrics 采集数据
# 我访问的是node1节点的CPU

[root@k8s-master node-exporter]# curl 192.168.22.135:9100/metrics | grep node_cpu_seconds
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 74373  100 74373    0     0   413k      0 --:--:-- --:--:-- --:--:--  417k
# HELP node_cpu_seconds_total Seconds the cpus spent in each mode.    #解释当前指标的含义
# TYPE node_cpu_seconds_total counter        #说明当前指标的数据类型
node_cpu_seconds_total{cpu="0",mode="idle"} 18145.96
node_cpu_seconds_total{cpu="0",mode="iowait"} 1.43
node_cpu_seconds_total{cpu="0",mode="irq"} 0
node_cpu_seconds_total{cpu="0",mode="nice"} 0.05
node_cpu_seconds_total{cpu="0",mode="softirq"} 29.26
node_cpu_seconds_total{cpu="0",mode="steal"} 0
node_cpu_seconds_total{cpu="0",mode="system"} 443.06
node_cpu_seconds_total{cpu="0",mode="user"} 383.4
node_cpu_seconds_total{cpu="1",mode="idle"} 18073.89
node_cpu_seconds_total{cpu="1",mode="iowait"} 1.23
node_cpu_seconds_total{cpu="1",mode="irq"} 0
node_cpu_seconds_total{cpu="1",mode="nice"} 0.02
node_cpu_seconds_total{cpu="1",mode="softirq"} 61.35
node_cpu_seconds_total{cpu="1",mode="steal"} 0
node_cpu_seconds_total{cpu="1",mode="system"} 446.99
node_cpu_seconds_total{cpu="1",mode="user"} 361.69

# node1节点的负载情况

[root@k8s-master node-exporter]# curl 192.168.22.135:9100/metrics | grep node_load
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP node_load1 1m load average.
# TYPE node_load1 gauge
node_load1 0.1        #最近一分钟以内的负载情况
# HELP node_load15 15m load average.
# TYPE node_load15 gauge
node_load15 0.09
# HELP node_load5 5m load average.
# TYPE node_load5 gauge
node_load5 0.04
100 74460  100 74460    0     0  6343k      0 --:--:-- --:--:-- --:--:-- 6610k

二、Prometheus server安装和配置

1.创建sa(serviceaccount)账号,对sa做rabc授权
1)创建一个 sa 账号 monitor
复制代码
[root@k8s-master node-exporter]# kubectl create serviceaccount monitor -n monitor-sa
serviceaccount/monitor created
2)把 sa 账号 monitor 通过 clusterrolebing 绑定到 clusterrole 上
复制代码
[root@k8s-master node-exporter]# kubectl create clusterrolebinding monitor-clusterrolebinding -n monitor-sa --clusterrole=cluster-admin --serviceaccount=monitor-sa:monitor
clusterrolebinding.rbac.authorization.k8s.io/monitor-clusterrolebinding created
2.创建Prometheus数据存储目录
复制代码
# 在node1节点创建目录

[root@k8s-node1 ~]# mkdir /data
[root@k8s-node1 ~]# chmod 777 /data
3.安装Prometheus server服务
3.1创建configmap用来存放Prometheus配置信息
1)创建yaml文件
复制代码
[root@k8s-master yaml]# vim prometheus-cfg.yaml 
[root@k8s-master yaml]# cat prometheus-cfg.yaml
---
kind: ConfigMap
apiVersion: v1
metadata:
  labels:
    app: prometheus
  name: prometheus-config
  namespace: monitor-sa
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s        #采集目标主机监控数据的时间间隔
      scrape_timeout: 10s        #数据采集超时时间,默认10秒
      evaluation_interval: 1m        #触发告警检测的时间,默认是1m
    scrape_configs:                #配置数据源,称为target,每个target用job_name命名
    - job_name: 'kubernetes-node'
      kubernetes_sd_configs:        #使用的是k8s的服务发现
      - role: node           #使用node角色,它使用默认的kubelet提供的http端口来发现集群中的每个node节点
      relabel_configs:        #重新标记
      - source_labels: [__address__]        #配置的原始标签,匹配地址
        regex: '(.*):10250'            #匹配带有10250端口的url
        replacement: '${1}:9100'        #把匹配到的 ip:10250 的 ip 保留
        target_label: __address__        #新生成的 url 是${1}获取到的 ip:9100
        action: replace
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
    - job_name: 'kubernetes-node-cadvisor'        # 抓取 cAdvisor 数据,是获取 kubelet 上/metrics/cadvisor 接口数据来获取容器的资源使用情况
      kubernetes_sd_configs:
      - role:  node
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - action: labelmap            #把匹配到的标签保留
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    - job_name: 'kubernetes-apiserver'
      kubernetes_sd_configs:
      - role: endpoints        #使用 k8s 中的 endpoint 服务发现,采集 apiserver 6443 端口获取到的数据
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]        #endpoint 这个对象的名称空间,endpoint 对象的服务名,exnpoint 的端口名称
        action: keep            #采集满足条件的实例,其他实例不采集
        regex: default;kubernetes;https
    - job_name: 'kubernetes-service-endpoints'
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true        # 重新打标仅抓取到的具有 "prometheus.io/scrape: true" 的 annotation 的端点,意思是说如果某个 service 具有 prometheus.io/scrape = true annotation 声明则抓取
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)        #重新设置 scheme,匹配源标签__meta_kubernetes_service_annotation_prometheus_io_scheme 也就是 prometheus.io/scheme annotation,如果源标签的值匹配到 regex,则把值替换为__scheme__对应的值
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)            # 应用中自定义暴露的指标,不过这里写的要和 service 中做好约定,如果 service 中这样写 prometheus.io/app-metricspath: '/metrics' 那么你这里就要 
__meta_kubernetes_service_annotation_prometheus_io_app_metrics_path 这样写
      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2        #暴露自定义的应用的端口,就是把地址和你在 service 中定义的 "prometheus.io/port = <port>" 声明做一个拼接,然后赋值给__address__,这样 prometheus 就能获取自定义应用的端口,然后通过这个端口再结合__metrics_path__来获取指标
      - action: labelmap            #保留下面匹配到的标签
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace        #替换__meta_kubernetes_namespace 变成 kubernetes_namespace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: kubernetes_name 
2)应用并查看
复制代码
[root@k8s-master yaml]# kubectl apply -f prometheus-cfg.yaml 
configmap/prometheus-config created
[root@k8s-master yaml]# kubectl get cm -n monitor-sa
NAME                DATA   AGE
prometheus-config   1      48m
3.2 通过deployment部署prometheus
1)上传所需镜像
复制代码
# node1节点,定义的yaml文件中指定了k8s-node1节点

[root@k8s-node1 ~]# docker load -i prometheus-2-2-1.tar.gz 
6a749002dd6a: Loading layer  1.338MB/1.338MB
5f70bf18a086: Loading layer  1.024kB/1.024kB
1692ded805c8: Loading layer  2.629MB/2.629MB
035489d93827: Loading layer  66.18MB/66.18MB
8b6ef3a2ab2c: Loading layer   44.5MB/44.5MB
ff98586f6325: Loading layer  3.584kB/3.584kB
017a13aba9f4: Loading layer   12.8kB/12.8kB
4d04d79bb1a5: Loading layer  27.65kB/27.65kB
75f6c078fa6b: Loading layer  10.75kB/10.75kB
5e8313e8e2ba: Loading layer  6.144kB/6.144kB
Loaded image: prom/prometheus:v2.2.1
2)编写yaml文件
复制代码
[root@k8s-master yaml]# vim prometheus-deploy.yaml 
[root@k8s-master yaml]# cat prometheus-deploy.yaml 
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-server
  namespace: monitor-sa
  labels:
    app: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
      component: server
    #matchExpressions:
    #- {key: app, operator: In, values: [prometheus]}
    #- {key: component, operator: In, values: [server]}
  template:
    metadata:
      labels:
        app: prometheus
        component: server
      annotations:
        prometheus.io/scrape: 'false'
    spec:
      nodeName: k8s-node1
      serviceAccountName: monitor
      containers:
      - name: prometheus
        image: prom/prometheus:v2.2.1
        imagePullPolicy: IfNotPresent
        command:
          - prometheus
          - --config.file=/etc/prometheus/prometheus.yml
          - --storage.tsdb.path=/prometheus        #旧数据存储目录
          - --storage.tsdb.retention=720h        #何时删除旧数据,默认为 15 天
          - --web.enable-lifecycle            #开启热加载
        ports:
        - containerPort: 9090
          protocol: TCP
        volumeMounts:
        - mountPath: /etc/prometheus/prometheus.yml
          name: prometheus-config
          subPath: prometheus.yml
        - mountPath: /prometheus/
          name: prometheus-storage-volume
      volumes:
        - name: prometheus-config
          configMap:
            name: prometheus-config
            items:
              - key: prometheus.yml
                path: prometheus.yml
                mode: 0644
        - name: prometheus-storage-volume
          hostPath:
           path: /data
           type: Directory
3)应用并查看
复制代码
[root@k8s-master yaml]# kubectl apply -f prometheus-deploy.yaml 
deployment.apps/prometheus-server created
[root@k8s-master yaml]# kubectl get deploy -n monitor-sa
NAME                READY   UP-TO-DATE   AVAILABLE   AGE
prometheus-server   1/1     1            1           26s
3.3给Prometheus pod创建一个service
1)编写yaml文件
复制代码
[root@k8s-master yaml]# vim prometheus-svc.yaml 
[root@k8s-master yaml]# cat prometheus-svc.yaml 
apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: monitor-sa
  labels:
    app: prometheus
spec:
  type: NodePort
  ports:
    - port: 9090
      targetPort: 9090
      protocol: TCP
  selector:
    app: prometheus
    component: server
2)应用并查看
复制代码
[root@k8s-master yaml]# kubectl apply -f prometheus-svc.yaml 
service/prometheus created
[root@k8s-master yaml]# kubectl get svc -n monitor-sa
NAME         TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
prometheus   NodePort   10.104.137.10   <none>        9090:30481/TCP   12s
3)结果测试

通过查询可以看到service在宿主机上映射的端口是30481,访问k8s集群的node1节点的IP:端口/graph,就可以访问到web ui界面

点击上方的Status中的Targets,可以看到以下界面:

3.4 Prometheus 热加载

为了每次修改配置文件可以热加载 prometheus,也就是不停止 prometheus,就可以使配置生效,想要使配置生效可用如下热加载命令

curl -X POST http://podIP:9090/-/reload

三、Grafana的安装和配置

1.Grafana介绍

Grafana 是一个跨平台的开源的度量分析和可视化工具,可以将采集的数据可视化的展示,并及时通知给告警接收方。它主要有以下特点:
1)展示方式:快速灵活的客户端图表,面板插件有许多不同方式的可视化指标和日志,官方库中具 有丰富的仪表盘插件,比如热图、折线图、图表等多种展示方式;
2)数据源:Graphite,InfluxDB,OpenTSDB,Prometheus,Elasticsearch,CloudWatch 和 KairosDB 等;
3)通知提醒:以可视方式定义最重要指标的警报规则,Grafana 将不断计算并发送通知,在数据达 到阈值时通过 Slack、PagerDuty 等获得通知;
4)混合展示:在同一图表中混合使用不同的数据源,可以基于每个查询指定数据源,甚至自定义数 据源;
5)注释:使用来自不同数据源的丰富事件注释图表,将鼠标悬停在事件上会显示完整的事件元数据 和标记

2.安装和Grafana
1)上传镜像
复制代码
# node1节点

[root@k8s-node1 images-prometheus]# docker load -i heapster-grafana-amd64_v5_0_4.tar.gz 
6816d98be637: Loading layer  4.642MB/4.642MB
523feee8e0d3: Loading layer  161.5MB/161.5MB
43d2638621da: Loading layer  230.4kB/230.4kB
f24c0fa82e54: Loading layer   2.56kB/2.56kB
334547094992: Loading layer  5.826MB/5.826MB
Loaded image: k8s.gcr.io/heapster-grafana-amd64:v5.0.4
2)编写yaml文件
复制代码
[root@k8s-master yaml]# vim grafana.yaml 
[root@k8s-master yaml]# cat grafana.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: monitoring-grafana
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      task: monitoring
      k8s-app: grafana
  template:
    metadata:
      labels:
        task: monitoring
        k8s-app: grafana
    spec:
      nodeName: k8s-node1
      containers:
      - name: grafana
        image: k8s.gcr.io/heapster-grafana-amd64:v5.0.4
        ports:
        - containerPort: 3000
          protocol: TCP
        volumeMounts:
        - mountPath: /etc/ssl/certs
          name: ca-certificates
          readOnly: true
        - mountPath: /var
          name: grafana-storage
        env:
        - name: INFLUXDB_HOST
          value: monitoring-influxdb
        - name: GF_SERVER_HTTP_PORT
          value: "3000"
          # The following env variables are required to make Grafana accessible via
          # the kubernetes api-server proxy. On production clusters, we recommend
          # removing these env variables, setup auth for grafana, and expose the grafana
          # service using a LoadBalancer or a public IP.
        - name: GF_AUTH_BASIC_ENABLED
          value: "false"
        - name: GF_AUTH_ANONYMOUS_ENABLED
          value: "true"
        - name: GF_AUTH_ANONYMOUS_ORG_ROLE
          value: Admin
        - name: GF_SERVER_ROOT_URL
          # If you're only using the API Server proxy, set this value instead:
          # value: /api/v1/namespaces/kube-system/services/monitoring-grafana/proxy
          value: /
      volumes:
      - name: ca-certificates
        hostPath:
          path: /etc/ssl/certs
      - name: grafana-storage
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  labels:
    # For use as a Cluster add-on (https://github.com/kubernetes/kubernetes/tree/master/cluster/addons)
    # If you are NOT using this as an addon, you should comment out this line.
    kubernetes.io/cluster-service: 'true'
    kubernetes.io/name: monitoring-grafana
  name: monitoring-grafana
  namespace: kube-system
spec:
  # In a production setup, we recommend accessing Grafana through an external Loadbalancer
  # or through a public IP.
  # type: LoadBalancer
  # You could also use NodePort to expose the service at a randomly-generated port
  # type: NodePort
  ports:
  - port: 80
    targetPort: 3000
  selector:
    k8s-app: grafana
  type: NodePort
3)应用并查看
复制代码
[root@k8s-master yaml]# kubectl apply -f grafana.yaml 
deployment.apps/monitoring-grafana created
service/monitoring-grafana created
[root@k8s-master yaml]# kubectl get pod -n kube-system -o wide | grep monitor
monitoring-grafana-7979b958c7-rxcw7   1/1     Running   0          64s     10.244.1.23      k8s-node1    <none>           <none>
[root@k8s-master yaml]# kubectl get svc -n kube-system
NAME                 TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                  AGE
kube-dns             ClusterIP   10.96.0.10     <none>        53/UDP,53/TCP,9153/TCP   3d16h
monitoring-grafana   NodePort    10.107.203.6   <none>        80:30244/TCP             3m48s
3.Grafana接入Prometheus数据源
3.1 浏览器访问
经上述查看,映射端口为30244,在浏览器输入IP:端口号即可访问
3.2 配置grafana界面
1)选择Create your first data source
2)导入监控模板

监控模板链接:https://grafana.com/dashboards

此处导入的是node_exporter.json文件

3)导入docker_rev1.json监控模板

跟上一步操作一样

4)如果 Grafana 导入 Prometheusz 之后,发现仪表盘没有数据,如何排查?

打开 grafana 界面,找到仪表盘对应无数据的图标

node_memory_MemTotal_bytes就是grafana上采集的内存数据,需要到Prometheus ui界面看看指标是否是相同的

四、安装kube-state-metrics组件

1.介绍kube-state-metrics组件

kube-state-metrics 通过监听 API Server 生成有关资源对象的状态指标,比如 Deployment、Node、Pod,需要注意的是 kube-state-metrics 只是简单的提供一个 metrics 数据,并不会存储这 些指标数据,所以我们可以使用 Prometheus 来抓取这些数据然后存储,主要关注的是业务相关的一 些元数据,比如 Deployment、Pod、副本状态等;调度了多少个 replicas?现在可用的有几个?多 少个 Pod 是 running/stopped/terminated 状态?Pod 重启了多少次?我有多少 job 在运行中

2.安装kube-state-metrics组件
1)创建sa并对其授权
复制代码
[root@k8s-master yaml]# vim kube-state-metrics-rbac.yaml 
[root@k8s-master yaml]# cat kube-state-metrics-rbac.yaml 
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kube-state-metrics
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kube-state-metrics
rules:
- apiGroups: [""]
  resources: ["nodes", "pods", "services", "resourcequotas", "replicationcontrollers", "limitranges", "persistentvolumeclaims", "persistentvolumes", "namespaces", "endpoints"]
  verbs: ["list", "watch"]
- apiGroups: ["extensions"]
  resources: ["daemonsets", "deployments", "replicasets"]
  verbs: ["list", "watch"]
- apiGroups: ["apps"]
  resources: ["statefulsets"]
  verbs: ["list", "watch"]
- apiGroups: ["batch"]
  resources: ["cronjobs", "jobs"]
  verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
  resources: ["horizontalpodautoscalers"]
  verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: kube-system
[root@k8s-master yaml]# kubectl apply -f kube-state-metrics-rbac.yaml 
serviceaccount/kube-state-metrics created
clusterrole.rbac.authorization.k8s.io/kube-state-metrics created
clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created
[root@k8s-master yaml]# kubectl get sa -n kube-system | grep state
kube-state-metrics                   1         11m
2)上传镜像
复制代码
[root@k8s-node1 images-prometheus]# docker load -i kube-state-metrics_1_9_0.tar.gz 
932da5156413: Loading layer  3.062MB/3.062MB
bd8df7c22fdb: Loading layer     31MB/31MB
Loaded image: quay.io/coreos/kube-state-metrics:v1.9.0
[root@k8s-node1 images-prometheus]# docker images | grep state
quay.io/coreos/kube-state-metrics                                v1.9.0              101b910a2162        4 years ago         32.8MB
3)编写yaml文件并应用
复制代码
[root@k8s-master yaml]# vim kube-state-metrics-deploy.yaml 
[root@k8s-master yaml]# cat kube-state-metrics-deploy.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kube-state-metrics
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kube-state-metrics
  template:
    metadata:
      labels:
        app: kube-state-metrics
    spec:
      nodeName: k8s-node1
      serviceAccountName: kube-state-metrics
      containers:
      - name: kube-state-metrics
        image: quay.io/coreos/kube-state-metrics:v1.9.0
        ports:
        - containerPort: 8080
[root@k8s-master yaml]# kubectl apply -f kube-state-metrics-deploy.yaml 
deployment.apps/kube-state-metrics created
[root@k8s-master yaml]# kubectl get pod -n kube-system | grep kube-state
kube-state-metrics-7684896db9-l5vsz   1/1     Running   0          61s
4)创建service
复制代码
[root@k8s-master yaml]# vim kube-state-metrics-svc.yaml 
[root@k8s-master yaml]# cat kube-state-metrics-svc.yaml 
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/scrape: 'true'
  name: kube-state-metrics
  namespace: kube-system
  labels:
    app: kube-state-metrics
spec:
  ports:
  - name: kube-state-metrics
    port: 8080
    protocol: TCP
  selector:
    app: kube-state-metrics
[root@k8s-master yaml]# kubectl apply -f kube-state-metrics-svc.yaml 
service/kube-state-metrics created
[root@k8s-master yaml]# kubectl get svc -n kube-system
NAME                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                  AGE
kube-dns             ClusterIP   10.96.0.10       <none>        53/UDP,53/TCP,9153/TCP   3d18h
kube-state-metrics   ClusterIP   10.104.238.225   <none>        8080/TCP                 19s
monitoring-grafana   NodePort    10.107.203.6     <none>        80:30244/TCP             148m
[root@k8s-master yaml]# 
5)导入监控模板

两个模板:

Kubernetes Cluster (Prometheus)-1577674936972.json

Kubernetes cluster monitoring (via Prometheus) (k8s 1.16)-1577691996738.json

五、配置alertmanager组件

1.创建alertmanager-cm.yaml配置文件
复制代码
[root@k8s-master yaml]# vim alertmanager-cm.yaml 
[root@k8s-master yaml]# cat alertmanager-cm.yaml 
kind: ConfigMap
apiVersion: v1
metadata:
  name: alertmanager
  namespace: monitor-sa
data:
  alertmanager.yml: |-
    global:
      resolve_timeout: 1m
      smtp_smarthost: 'smtp.163.com:25'        #网易163邮箱
      smtp_from: '198********@163.com'        #从哪个邮箱发送邮件
      smtp_auth_username: '198********'        #发送邮件的用户
      smtp_auth_password: 'YLOPKFRHHONSHHXM'        #网易邮箱的授权码,要用自己的
      smtp_require_tls: false
    route:
      group_by: [alertname]         # 采用哪个标签来作为分组依据
      group_wait: 10s            # 组告警等待时间。也就是告警产生后等待 10s,如果有同组告警一起发出
      group_interval: 10s        # 上下两组发送告警的间隔时间
      repeat_interval: 10m        # 重复发送告警的时间,减少相同邮件的发送频率,默认是 1h
      receiver: default-receiver
    receivers:
    - name: 'default-receiver'
      email_configs:
      - to: '178******@qq.com'        #接受邮件的邮箱,不能跟上面的邮箱相同
        send_resolved: true
[root@k8s-master yaml]# kubectl apply -f alertmanager-cm.yaml 
configmap/alertmanager created
2.Prometheus报警流程

1)Prometheus Server 监控目标主机上暴露的 http 接口(这里假设接口 A),通过 Promethes 配置的'scrape_interval'定义的时间间隔,定期采集目标主机上监控数据。
2)当接口 A 不可用的时候,Server 端会持续的尝试从接口中取数据,直到"scrape_timeout"时间后 停止尝试。这时候把接口的状态变为"DOWN"。
3)Prometheus 同时根据配置的"evaluation_interval"的时间间隔,定期(默认 1min)的对 Alert Rule 进行评估;当到达评估周期的时候,发现接口 A 为 DOWN,即 UP=0 为真,激活Alert,进入"PENDING"状态,并记录当前 active 的时间;
4) 当下一个 alert rule 的评估周期到来的时候,发现 UP=0 继续为真,然后判断警报 Active 的时间是否已经超出 rule 里的'for' 持续时间,如果未超出,则进入下一个评估周期;如果时间超出,则 alert 的状态变为"FIRING";同时调用 Alertmanager 接口,发送相关报警数据。
5)AlertManager 收到报警数据后,会将警报信息进行分组,然后根据 alertmanager 配置的"group_wait"时间先进行等待。等 wait 时间过后再发送报警信息。
6)属于同一个 Alert Group 的警报,在等待的过程中可能进入新的 alert,如果之前的报警已经成 功发出,那么间隔"group_interval"的时间间隔后再重新发送报警信息。比如配置的是邮件报警,那么同属一个 group 的报警信息会汇总在一个邮件里进行发送。
7)如果 Alert Group 里的警报一直没发生变化并且已经成功发送,等待'repeat_interval'时间间隔之后再重复发送相同的报警邮件;如果之前的警报没有成功发送,则相当于触发第 6 条条件,则需要等待 group_interval 时间间隔后重复发送

3.创建Prometheus和告警规则的配置文件
1)下图是配置文件中需要修改的地方:
2)删除上次设置的configmap配置
复制代码
#  先删除上次设置的configmap
[root@k8s-master yaml]# kubectl delete -f prometheus-cfg.yaml 
configmap "prometheus-config" deleted
3)配置文件configmap
复制代码
#  编写新的configmap配置文件

[root@k8s-master yaml]# vim prometheus-alertmanager-cfg.yaml
[root@k8s-master yaml]# cat prometheus-alertmanager-cfg.yaml 
kind: ConfigMap
apiVersion: v1
metadata:
  labels:
    app: prometheus
  name: prometheus-config
  namespace: monitor-sa
data:
  prometheus.yml: |
    rule_files:
    - /etc/prometheus/rules.yml
    alerting:
      alertmanagers:
      - static_configs:
        - targets: ["localhost:9093"]
    global:
      scrape_interval: 15s
      scrape_timeout: 10s
      evaluation_interval: 1m
    scrape_configs:
    - job_name: 'kubernetes-node'
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):10250'
        replacement: '${1}:9100'
        target_label: __address__
        action: replace
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
    - job_name: 'kubernetes-node-cadvisor'
      kubernetes_sd_configs:
      - role:  node
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    - job_name: 'kubernetes-apiserver'
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https
    - job_name: 'kubernetes-service-endpoints'
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: kubernetes_name 
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - action: keep
        regex: true
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_scrape
      - action: replace
        regex: (.+)
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_path
        target_label: __metrics_path__
      - action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        source_labels:
        - __address__
        - __meta_kubernetes_pod_annotation_prometheus_io_port
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: kubernetes_namespace
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_name
        target_label: kubernetes_pod_name
    - job_name: 'kubernetes-schedule'
      scrape_interval: 5s
      static_configs:
      - targets: ['192.168.22.134:10251']
    - job_name: 'kubernetes-controller-manager'
      scrape_interval: 5s
      static_configs:
      - targets: ['192.168.22.134:10252']
    - job_name: 'kubernetes-kube-proxy'
      scrape_interval: 5s
      static_configs:
      - targets: ['192.168.22.134:10249','192.168.22.135:10249','192.168.22.136:10249']
    - job_name: 'kubernetes-etcd'
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/ca.crt
        cert_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/server.crt
        key_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/server.key
      scrape_interval: 5s
      static_configs:
      - targets: ['192.168.22.134:2379']
  rules.yml: |
    groups:
    - name: example
      rules:
      - alert: kube-proxy的cpu使用率大于80%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-kube-proxy"}[1m]) * 100 > 80
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"
      - alert:  kube-proxy的cpu使用率大于90%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-kube-proxy"}[1m]) * 100 > 90
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"
      - alert: scheduler的cpu使用率大于80%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-schedule"}[1m]) * 100 > 80
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"
      - alert:  scheduler的cpu使用率大于90%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-schedule"}[1m]) * 100 > 90
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"
      - alert: controller-manager的cpu使用率大于80%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-controller-manager"}[1m]) * 100 > 80
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"
      - alert:  controller-manager的cpu使用率大于90%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-controller-manager"}[1m]) * 100 > 0
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"
      - alert: apiserver的cpu使用率大于80%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-apiserver"}[1m]) * 100 > 80
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"
      - alert:  apiserver的cpu使用率大于90%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-apiserver"}[1m]) * 100 > 90
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"
      - alert: etcd的cpu使用率大于80%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-etcd"}[1m]) * 100 > 80
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"
      - alert:  etcd的cpu使用率大于90%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-etcd"}[1m]) * 100 > 90
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"
      - alert: kube-state-metrics的cpu使用率大于80%
        expr: rate(process_cpu_seconds_total{k8s_app=~"kube-state-metrics"}[1m]) * 100 > 80
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过80%"
          value: "{{ $value }}%"
          threshold: "80%"      
      - alert: kube-state-metrics的cpu使用率大于90%
        expr: rate(process_cpu_seconds_total{k8s_app=~"kube-state-metrics"}[1m]) * 100 > 0
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过90%"
          value: "{{ $value }}%"
          threshold: "90%"      
      - alert: coredns的cpu使用率大于80%
        expr: rate(process_cpu_seconds_total{k8s_app=~"kube-dns"}[1m]) * 100 > 80
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过80%"
          value: "{{ $value }}%"
          threshold: "80%"      
      - alert: coredns的cpu使用率大于90%
        expr: rate(process_cpu_seconds_total{k8s_app=~"kube-dns"}[1m]) * 100 > 90
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过90%"
          value: "{{ $value }}%"
          threshold: "90%"      
      - alert: kube-proxy打开句柄数>600
        expr: process_open_fds{job=~"kubernetes-kube-proxy"}  > 600
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600"
          value: "{{ $value }}"
      - alert: kube-proxy打开句柄数>1000
        expr: process_open_fds{job=~"kubernetes-kube-proxy"}  > 1000
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000"
          value: "{{ $value }}"
      - alert: kubernetes-schedule打开句柄数>600
        expr: process_open_fds{job=~"kubernetes-schedule"}  > 600
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600"
          value: "{{ $value }}"
      - alert: kubernetes-schedule打开句柄数>1000
        expr: process_open_fds{job=~"kubernetes-schedule"}  > 1000
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000"
          value: "{{ $value }}"
      - alert: kubernetes-controller-manager打开句柄数>600
        expr: process_open_fds{job=~"kubernetes-controller-manager"}  > 600
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600"
          value: "{{ $value }}"
      - alert: kubernetes-controller-manager打开句柄数>1000
        expr: process_open_fds{job=~"kubernetes-controller-manager"}  > 1000
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000"
          value: "{{ $value }}"
      - alert: kubernetes-apiserver打开句柄数>600
        expr: process_open_fds{job=~"kubernetes-apiserver"}  > 600
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600"
          value: "{{ $value }}"
      - alert: kubernetes-apiserver打开句柄数>1000
        expr: process_open_fds{job=~"kubernetes-apiserver"}  > 1000
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000"
          value: "{{ $value }}"
      - alert: kubernetes-etcd打开句柄数>600
        expr: process_open_fds{job=~"kubernetes-etcd"}  > 600
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600"
          value: "{{ $value }}"
      - alert: kubernetes-etcd打开句柄数>1000
        expr: process_open_fds{job=~"kubernetes-etcd"}  > 1000
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000"
          value: "{{ $value }}"
      - alert: coredns
        expr: process_open_fds{k8s_app=~"kube-dns"}  > 600
        for: 2s
        labels:
          severity: warnning 
        annotations:
          description: "插件{{$labels.k8s_app}}({{$labels.instance}}): 打开句柄数超过600"
          value: "{{ $value }}"
      - alert: coredns
        expr: process_open_fds{k8s_app=~"kube-dns"}  > 1000
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "插件{{$labels.k8s_app}}({{$labels.instance}}): 打开句柄数超过1000"
          value: "{{ $value }}"
      - alert: kube-proxy
        expr: process_virtual_memory_bytes{job=~"kubernetes-kube-proxy"}  > 2000000000
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G"
          value: "{{ $value }}"
      - alert: scheduler
        expr: process_virtual_memory_bytes{job=~"kubernetes-schedule"}  > 2000000000
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G"
          value: "{{ $value }}"
      - alert: kubernetes-controller-manager
        expr: process_virtual_memory_bytes{job=~"kubernetes-controller-manager"}  > 2000000000
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G"
          value: "{{ $value }}"
      - alert: kubernetes-apiserver
        expr: process_virtual_memory_bytes{job=~"kubernetes-apiserver"}  > 2000000000
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G"
          value: "{{ $value }}"
      - alert: kubernetes-etcd
        expr: process_virtual_memory_bytes{job=~"kubernetes-etcd"}  > 2000000000
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G"
          value: "{{ $value }}"
      - alert: kube-dns
        expr: process_virtual_memory_bytes{k8s_app=~"kube-dns"}  > 2000000000
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "插件{{$labels.k8s_app}}({{$labels.instance}}): 使用虚拟内存超过2G"
          value: "{{ $value }}"
      - alert: HttpRequestsAvg
        expr: sum(rate(rest_client_requests_total{job=~"kubernetes-kube-proxy|kubernetes-kubelet|kubernetes-schedule|kubernetes-control-manager|kubernetes-apiservers"}[1m]))  > 1000
        for: 2s
        labels:
          team: admin
        annotations:
          description: "组件{{$labels.job}}({{$labels.instance}}): TPS超过1000"
          value: "{{ $value }}"
          threshold: "1000"   
      - alert: Pod_restarts
        expr: kube_pod_container_status_restarts_total{namespace=~"kube-system|default|monitor-sa"} > 0
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "在{{$labels.namespace}}名称空间下发现{{$labels.pod}}这个pod下的容器{{$labels.container}}被重启,这个监控指标是由{{$labels.instance}}采集的"
          value: "{{ $value }}"
          threshold: "0"
      - alert: Pod_waiting
        expr: kube_pod_container_status_waiting_reason{namespace=~"kube-system|default"} == 1
        for: 2s
        labels:
          team: admin
        annotations:
          description: "空间{{$labels.namespace}}({{$labels.instance}}): 发现{{$labels.pod}}下的{{$labels.container}}启动异常等待中"
          value: "{{ $value }}"
          threshold: "1"   
      - alert: Pod_terminated
        expr: kube_pod_container_status_terminated_reason{namespace=~"kube-system|default|monitor-sa"} == 1
        for: 2s
        labels:
          team: admin
        annotations:
          description: "空间{{$labels.namespace}}({{$labels.instance}}): 发现{{$labels.pod}}下的{{$labels.container}}被删除"
          value: "{{ $value }}"
          threshold: "1"
      - alert: Etcd_leader
        expr: etcd_server_has_leader{job="kubernetes-etcd"} == 0
        for: 2s
        labels:
          team: admin
        annotations:
          description: "组件{{$labels.job}}({{$labels.instance}}): 当前没有leader"
          value: "{{ $value }}"
          threshold: "0"
      - alert: Etcd_leader_changes
        expr: rate(etcd_server_leader_changes_seen_total{job="kubernetes-etcd"}[1m]) > 0
        for: 2s
        labels:
          team: admin
        annotations:
          description: "组件{{$labels.job}}({{$labels.instance}}): 当前leader已发生改变"
          value: "{{ $value }}"
          threshold: "0"
      - alert: Etcd_failed
        expr: rate(etcd_server_proposals_failed_total{job="kubernetes-etcd"}[1m]) > 0
        for: 2s
        labels:
          team: admin
        annotations:
          description: "组件{{$labels.job}}({{$labels.instance}}): 服务失败"
          value: "{{ $value }}"
          threshold: "0"
      - alert: Etcd_db_total_size
        expr: etcd_debugging_mvcc_db_total_size_in_bytes{job="kubernetes-etcd"} > 10000000000
        for: 2s
        labels:
          team: admin
        annotations:
          description: "组件{{$labels.job}}({{$labels.instance}}):db空间超过10G"
          value: "{{ $value }}"
          threshold: "10G"
      - alert: Endpoint_ready
        expr: kube_endpoint_address_not_ready{namespace=~"kube-system|default"} == 1
        for: 2s
        labels:
          team: admin
        annotations:
          description: "空间{{$labels.namespace}}({{$labels.instance}}): 发现{{$labels.endpoint}}不可用"
          value: "{{ $value }}"
          threshold: "1"
    - name: 物理节点状态-监控告警
      rules:
      - alert: 物理节点cpu使用率
        expr: 100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)*100 > 90
        for: 2s
        labels:
          severity: ccritical
        annotations:
          summary: "{{ $labels.instance }}cpu使用率过高"
          description: "{{ $labels.instance }}的cpu使用率超过90%,当前使用率[{{ $value }}],需要排查处理" 
      - alert: 物理节点内存使用率
        expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 90
        for: 2s
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }}内存使用率过高"
          description: "{{ $labels.instance }}的内存使用率超过90%,当前使用率[{{ $value }}],需要排查处理"
      - alert: InstanceDown
        expr: up == 0
        for: 2s
        labels:
          severity: critical
        annotations:   
          summary: "{{ $labels.instance }}: 服务器宕机"
          description: "{{ $labels.instance }}: 服务器延时超过2分钟"
      - alert: 物理节点磁盘的IO性能
        expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 60
        for: 2s
        labels:
          severity: critical
        annotations:
          summary: "{{$labels.mountpoint}} 流入磁盘IO使用率过高!"
          description: "{{$labels.mountpoint }} 流入磁盘IO大于60%(目前使用:{{$value}})"
      - alert: 入网流量带宽
        expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
        for: 2s
        labels:
          severity: critical
        annotations:
          summary: "{{$labels.mountpoint}} 流入网络带宽过高!"
          description: "{{$labels.mountpoint }}流入网络带宽持续5分钟高于100M. RX带宽使用率{{$value}}"
      - alert: 出网流量带宽
        expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
        for: 2s
        labels:
          severity: critical
        annotations:
          summary: "{{$labels.mountpoint}} 流出网络带宽过高!"
          description: "{{$labels.mountpoint }}流出网络带宽持续5分钟高于100M. RX带宽使用率{{$value}}"
      - alert: TCP会话
        expr: node_netstat_Tcp_CurrEstab > 1000
        for: 2s
        labels:
          severity: critical
        annotations:
          summary: "{{$labels.mountpoint}} TCP_ESTABLISHED过高!"
          description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$value}}%)"
      - alert: 磁盘容量
        expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 80
        for: 2s
        labels:
          severity: critical
        annotations:
          summary: "{{$labels.mountpoint}} 磁盘分区使用率过高!"
          description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)"
4)应用配置文件
复制代码
[root@k8s-master yaml]# kubectl apply -f prometheus-alertmanager-cfg.yaml 
configmap/prometheus-config created
[root@k8s-master yaml]# kubectl get cm -n monitor-sa
NAME                DATA   AGE
alertmanager        1      25m
prometheus-config   2      3m20s
4.安装Prometheus和altermanager
4.1 安装
1)删除上述操作步骤安装的Prometheus的deployment资源
复制代码
[root@k8s-master yaml]# kubectl delete -f prometheus-deploy.yaml 
deployment.apps "prometheus-server" deleted
2)生成etcd-certs
复制代码
[root@k8s-master yaml]# kubectl -n monitor-sa create secret generic etcd-certs --from-file=/etc/kubernetes/pki/etcd/server.key --from-file=/etc/kubernetes/pki/etcd/server.crt --from-file=/etc/kubernetes/pki/etcd/ca.crt
secret/etcd-certs created

[root@k8s-master yaml]# kubectl get secret -n monitor-sa
NAME                  TYPE                                  DATA   AGE
default-token-jjw8z   kubernetes.io/service-account-token   3      24h
etcd-certs            Opaque                                3      40s
monitor-token-jr24f   kubernetes.io/service-account-token   3      23h
3)拉取镜像
复制代码
# 此处我用的node2节点

[root@k8s-node2 images-prometheus]# docker load -i alertmanager.tar.gz 
4febd3792a1f: Loading layer   1.36MB/1.36MB
68d1a8b41cc0: Loading layer  2.586MB/2.586MB
5f70bf18a086: Loading layer  1.024kB/1.024kB
30d4e7b232e4: Loading layer  12.77MB/12.77MB
6b961451fcb0: Loading layer  16.59MB/16.59MB
b5abc4736d3f: Loading layer  6.144kB/6.144kB
Loaded image: prom/alertmanager:v0.14.0
[root@k8s-node2 images-prometheus]# scp alertmanager.tar.gz k8s-node2:/root/
root@k8s-node2's password: 
alertmanager.tar.gz                                                100%   32MB  16.1MB/s   00:01    
[root@k8s-node2 images-prometheus]# docker images | grep alert
prom/alertmanager                                                v0.14.0             23744b2d645c        6 years ago         31.9MB
4)编写deployment的yaml文件并应用
复制代码
[root@k8s-master yaml]# vim prometheus-alertmanager-deploy.yaml 
[root@k8s-master yaml]# cat prometheus-alertmanager-deploy.yaml 
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-server
  namespace: monitor-sa
  labels:
    app: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
      component: server
    #matchExpressions:
    #- {key: app, operator: In, values: [prometheus]}
    #- {key: component, operator: In, values: [server]}
  template:
    metadata:
      labels:
        app: prometheus
        component: server
      annotations:
        prometheus.io/scrape: 'false'
    spec:
      nodeName: k8s-node1            #此处指定的node1节点
      serviceAccountName: monitor
      containers:
      - name: prometheus
        image: prom/prometheus:v2.2.1
        imagePullPolicy: IfNotPresent
        command:
        - "/bin/prometheus"
        args:
        - "--config.file=/etc/prometheus/prometheus.yml"
        - "--storage.tsdb.path=/prometheus"
        - "--storage.tsdb.retention=24h"
        - "--web.enable-lifecycle"
        ports:
        - containerPort: 9090
          protocol: TCP
        volumeMounts:
        - mountPath: /etc/prometheus
          name: prometheus-config
        - mountPath: /prometheus/
          name: prometheus-storage-volume
        - name: k8s-certs
          mountPath: /var/run/secrets/kubernetes.io/k8s-certs/etcd/
      - name: alertmanager
        image: prom/alertmanager:v0.14.0
        imagePullPolicy: IfNotPresent
        args:
        - "--config.file=/etc/alertmanager/alertmanager.yml"
        - "--log.level=debug"
        ports:
        - containerPort: 9093
          protocol: TCP
          name: alertmanager
        volumeMounts:
        - name: alertmanager-config
          mountPath: /etc/alertmanager
        - name: alertmanager-storage
          mountPath: /alertmanager
        - name: localtime
          mountPath: /etc/localtime
      volumes:
        - name: prometheus-config
          configMap:
            name: prometheus-config
        - name: prometheus-storage-volume
          hostPath:
           path: /data
           type: Directory
        - name: k8s-certs
          secret:
           secretName: etcd-certs
        - name: alertmanager-config
          configMap:
            name: alertmanager
        - name: alertmanager-storage
          hostPath:
           path: /data/alertmanager
           type: DirectoryOrCreate
        - name: localtime
          hostPath:
           path: /usr/share/zoneinfo/Asia/Shanghai

#  应用yaml文件

[root@k8s-master yaml]# kubectl apply -f prometheus-alertmanager-deploy.yaml 
deployment.apps/prometheus-server created
[root@k8s-master yaml]# kubectl get pod -n monitor-sa
NAME                                 READY   STATUS    RESTARTS   AGE
node-exporter-fdvjc                  1/1     Running   1          24h
node-exporter-gzfnq                  1/1     Running   0          24h
node-exporter-r85gw                  1/1     Running   0          24h
prometheus-server-6c5bc4d65b-9qzn6   2/2     Running   0          39s
5)创建alertmanager的service以便于访问
复制代码
[root@k8s-master yaml]# vim alertmanager-svc.yaml 
[root@k8s-master yaml]# cat alertmanager-svc.yaml 
---
apiVersion: v1
kind: Service
metadata:
  labels:
    name: prometheus
    kubernetes.io/cluster-service: 'true'
  name: alertmanager
  namespace: monitor-sa
spec:
  ports:
  - name: alertmanager
    nodePort: 30066
    port: 9093
    protocol: TCP
    targetPort: 9093
  selector:
    app: prometheus
  sessionAffinity: None
  type: NodePort
[root@k8s-master yaml]# kubectl apply -f alertmanager-svc.yaml 
service/alertmanager created
[root@k8s-master yaml]# kubectl get svc -n monitor-sa
NAME           TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
alertmanager   NodePort   10.98.208.193   <none>        9093:30066/TCP   16s
prometheus     NodePort   10.104.137.10   <none>        9090:30481/TCP   20h
6)浏览器访问测试

通过上述查询,可以看到Prometheus映射的端口为30481,alertmanager映射的端口为30066;浏览器输入192.168.22.135:30066/#/alerts访问;

也就是http://node1节点IP:端口号/#/alerts

4.2 访问web界面查看效果
1)访问Prometheus的web页面

点击status中的targets;

2)修改配置文件
复制代码
#  kube-schedule:

vim /etc/kubernetes/manifests/kube-scheduler.yaml

#将里面的--bind-address=127.0.0.1改成192.168.22.134;--port=0删除;
#把httpGet:下面的hosts改成192.168.22.134
# 注意是改成master节点的IP


#   kube-controller-manager

vim /etc/kubernetes/manifests/kube-controller-manager.yaml

#将里面的--bind-address=127.0.0.1改成192.168.22.134;--port=0删除;
#把httpGet:下面的hosts改成192.168.22.134
# 注意是改成master节点的IP


#  改完之后重启kubelet

#  查看服务:kubectl  get cs ;status都是healthy即可


#  kube-proxy

kubectl edit configmap kube-proxy -n kube-system

#  把metricsBindAddress这段修改成metricsBindAddress: 0.0.0.0:10249

#然后再删除pod重新创建即可
kubectl get pods -n kube-system | grep kube-proxy |awk '{print $1}' | xargs kubectl delete pods -n kube-system
3)再次访问web页面
相关推荐
cui_win2 小时前
Prometheus实战教程 - mysql监控
mysql·prometheus·压测
星哥说事6 小时前
Grafana仪表盘的创建与数据可视化
信息可视化·grafana
杰克逊的日记7 小时前
k8s某pod节点资源使用率过高,如何调整
linux·docker·kubernetes
古城小栈7 小时前
Go 语言容器感知,自动适配 K8s 资源限制
golang·kubernetes
nVisual15 小时前
Prometheus连接nVisual实现资产拓扑业务关联分析
prometheus
VermiliEiz1 天前
使用二进制文件方式部署kubernetes(1)
kubernetes·云计算
云计算小黄同学1 天前
k8s中的服务通过secret访问数据库的实际案例
数据库·阿里云·kubernetes
炸裂狸花猫1 天前
开源日志收集体系ELK
elk·elasticsearch·云原生·kubernetes·metricbeat
Swift社区1 天前
数据库连接池监控最佳实践:用 Prometheus + Grafana 打造可视化监控体系
数据库·grafana·prometheus