架构学习第七周--Prometheus

一、监控系统基础

二、Prometheus介绍

三、Prometheus单机部署

四、服务发现与告警功能

4.1，服务发现

4.2，告警功能实现

五、Prometheus与Kubernetes

5.1，Kubernetes指标

5.2，Prometheus集群部署

一、监控系统基础

监控系统组件：

-指标数据的抓取

-指标数据的存储

-指标数据的分析和可视化

-告警
监控体系（自底向上）

-系统层监控

--系统监控：CPU、Memory、Swap、Disk IO、Processes

--网络监控：网络设备、工作负载、网络延迟、丢包率

-中间件及基础设施类系统监控

--消息中间件：Kafka、RocketMQ和RabbitMQ等；

--Web服务容器：Tomcat和Jetty等；

--数据库及缓存系统：MySQL、PostgreSQL、MogoDB、ElasticSearch和Redis等；

--数据库连接池：ShardingSpere等；

--存储系统：NFS和Ceph等

-应用层监控

--用于衡量应用程序代码的状态和性能

-业务层监控

--业务接口：登录数、注册数、订单量、搜索量和支付量等
监控方法论

服务类监控--Google四黄金指标：

1，延迟（Latency）：服务请求所需要的时长

2，流量（Traffic），也称为吞吐量，衡量服务的容量需求

3，错误（Errors），失败请求的数量，通常以绝对数量或错误请求占请求总数的百分比来表示

4，饱和度（Saturation），衡量资源的使用情况，用于表达资源的整体利用率

系统类监控--Netflix的USE方法（Utilization Saturation and Errors Method）：

1，使用率（Utilization），显示系统资源的整体使用情况

2，饱和度（Saturation），显示资源饱和度，如CPU的平均运行排队长度

3，错误,（Errors），错误计数，如：网卡在数据包传输过程中检测到的以太网网络冲突了14次

容器类监控--Weave Cloud的RED方法：

1，（Request）Rate：每秒钟接收的请求数；

2，（Request）Errors：每秒失败的请求数；

3，（Request）Duration：每个请求所花费的时长

二、Prometheus介绍

Prometheus是一款时序（time series）数据库，并且可以通过服务发现机制或静态配置进行监控目标数据抓取的关键组件，结合生态系统内的其它组件，例如Pushgateway、Altermanager和Grafana等，可构成一个完整的IT监控系统。

Prometheus组件（1-4为关键组件）：

1，Prometheus Server --具有数据收集的TSDB，主要用于收集和存储时间序列数据，具有服务发现功能，如 Kuberentes的Node、Pod、Endpoints、Service、Ingress等信息

2，AlertManager--从Prometheus Server接收到"告警通知"后，通过去重、分组、路由等预处理功能后以高效向用户完成告警信息发送

3，Exporters --用于暴露现有应用程序或服务的指标给Prometheus Server

4，PushGateway--接收那些通常由短期作业生成的指标数据的网关，并支持由Prometheus Server进行指标拉取操作

5，Data Visualization--数据可视化，如Grafana等

6，Service Discovery--动态发现待监控的Target，从而完成监控配置的重要组件，该组件目前由Prometheus Server内建支持
Prometheus抓取数据方式

Prometheus不同于其他TSDB的是它主动从各Target上拉取（pull)数据，而非等待被监控端的推送（push），它不是事件驱动的存储系统；这种获取数据的方式有利于将配置集在Prometheus Server上完成，包括指标及采集数率等。

三种抓取途径

Exporters--部署在现有应用程序或服务上将监控数据转换成prometheus可识别的格式并暴露在指定的URL上

Instrumentation--应用程序自行定义将监控数据转换为prometheus格式并暴露在指定的URL上

Pushgateway--以push方式接收那些由短期作业生成的指标数据，再由Prometheus Server定时pull获取该类数据
Prometheus的监控数据

Prometheus仅用于以"键值"形式存储时序式的聚合数据，它并不支持存储文本信息。因为同一指标可能会适配到多个目标或设备，因此它使用"标签"作为元数据，这些标签还可以作为过滤器进行指标过滤及聚合运算。

监控数据描述方法：

Counter--计数器，用于保存单调递增型的数据，例如站点访问次数等；不能为负值，也不支持减少，但可以重置回0

Gauge--仪表盘，用于存储有着起伏特征的指标数据，例如内存空闲大小等

Histogram--直方图，它会在一段时间范围内对数据进行采样，并将其计入可配置的bucket之中；Histogram能够存储更多的信息，包括样本值分布在每个bucket（bucket自身的可配置）中的数量、所有样本值之和以及总的样本数量，从而Prometheus能够使用内置的函数进行样本平均值、样本分位值的计算

Summary--摘要，Histogram的扩展类型，但它是直接由被监测端自行聚合计算出分位数，并将计算结果响应给Prometheus Server的样本采集请求；因而，其分位数计算是由由监控端完成
Instance：能够接收Prometheus Server数据抓取操作的每个网络端点（endpoint）

Job：具有类似功能的Instance的集合。例如一个MySQL主从复制集群中的所有MySQL进程

PromQL（全称为Prometheus Query Language）：Prometheus内置的数据查询语言，支持用户进行实时的数据查询及聚合操作；PromQL支持处理两种向量，并内置提供了一组用于数据处理的函数
--即时向量：最近一次的时间戳上跟踪的数据指标；

--时间范围向量：指定时间范围内的所有时间戳上的数据指标；
Prometheus特性

关键特性：

-多维护数据模型：以指标名称及附加的label标识时间序列

-特有的数据查询语言：PromQL

-单个服务器节点即可正常工作，不依赖分布式存储

-基于HTTP协议，以Pull模式完成指标数据采集

-借助于PushGateway，支持Push模式的指标数据采集

-使用服务发现机制动态发现Target，或静态配置要监控的Target

-支持多种Graph和Dashboard

不适用的场景：

-Prometheus是一款指标监控系统，不适合存储事件及日志；

-Prometheus认为只有最近的监控数据才有查询的需要，其本地存储的设计初衷只是保存短期数据，因而不支持针对大量的历史数据进行存储；若需要存储长期的历史数据，需要将数据保存于InfluxDB或OpenTSDB等系统中；

-Prometheus的集群机制成熟度不高；
Prometheus TSDB数据存储格式为：

-以每2小时为一个时间窗口，并存储为一个单独的block；

-block会压缩、合并历史数据块，随着压缩合并，其block数量会减少；

-block的大小并不固定，但最小会保存两个小时的数据

Prometheus v3.0的TSDB每秒可以处理数百万个样本，这是通过预写日志（WAL）来实现。WAL工作逻辑：新采集的数据将先写入内存中并停留一段时间，然后即会被刷写到磁盘并映射进内存中，当内存映射的块或内存中的块老化到一定程度时，它会将作为持久块刷写到磁盘block，随着时间经过block会合并，直到超过保留期限删除。

三、Prometheus单机部署

下载链接：Download | Prometheushttps://prometheus.io/download/https://prometheus.io/download/https://prometheus.io/download/

#将下载的安装包解压至本地

root@k8s-master01:~#tar xf prometheus-2.53.3.linux-amd64.tar.gz -C /usr/local/

root@k8s-master01:~#ln -s /usr/local/prometheus-2.53.3.linux-amd64 /usr/local/prometheus

root@k8s-master01:~# cd /usr/local/prometheus/

root@k8s-master01:/usr/local/prometheus# ls

LICENSE NOTICE console_libraries consoles prometheus prometheus.yml promtool

root@k8s-master01:~# useradd -r prometheus #创建服务用户

root@k8s-master01:~# mkdir /usr/local/prometheus/data

root@k8s-master01:~# chown -R prometheus.prometheus /usr/local/prometheus/data

root@k8s-master01:~# vim /lib/systemd/system/prometheus.service #创建serivice文件
$Unit$
Description=Monitoring system and time series database

Documentation=https://prometheus.io/docs/introduction/overview/
$Service$
Restart=always

User=prometheus

EnvironmentFile=-/etc/default/prometheus

ExecStart=/usr/local/prometheus/prometheus \

--config.file=/usr/local/prometheus/prometheus.yml \

--storage.tsdb.path=/usr/local/prometheus/data \

--web.console.libraries=/usr/share/prometheus/console_libraries \

--web.enable-lifecycle \

$ARGS

ExecReload=/bin/kill -HUP $MAINPID

TimeoutStopSec=20s

SendSIGKILL=no

LimitNOFILE=8192
$Install$
WantedBy=multi-user.target

root@k8s-master01:~#systemctl daemon-reload

root@k8s-master01:~# systemctl enable --now prometheus.service

root@k8s-master01:~# ss -tnlp | grep '9090' #服务运行在9090端口

LISTEN 0 4096 *:9090 *:* users:(("prometheus",pid=62248,fd=7))

root@k8s-master01:~# curl localhost:9090/metrics #查看prometheus自身的监控参数
#此时访问本机的IP：9090可进入web管理界面

#部署node-exporter

root@k8s-master01:~# tar xf node_exporter-1.8.2.linux-amd64.tar.gz -C /usr/local

root@k8s-master01:~# ln -s /usr/local/node_exporter-1.8.2.linux-amd64 /usr/local/node_exporter

root@k8s-master01:~# vim /lib/systemd/system/node_exporter.service
$Unit$
Description=node_exporter

Documentation=https://prometheus.io/docs/introduction/overview/

After=network.target
$Service$
Type=simple

User=prometheus

ExecStart=/usr/local/node_exporter/node_exporter \

--collector.ntp \

--collector.mountstats \

--collector.systemd \

--collector.ethtool \

--collector.tcpstat

ExecReload=/bin/kill -HUP $MAINPID

TimeoutStopSec=20s

Restart=always
$Install$
WantedBy=multi-user.target

root@k8s-master01:~# systemctl enable --now node_exporter.service

root@k8s-master01:~# curl localhost:9100/metrics #查看node-exporter的监控参数

root@k8s-master01:~# vim /usr/local/prometheus/prometheus.yml #添加下列参数

scrape_configs:

job_name: "prometheus"

metrics_path defaults to '/metrics'

scheme defaults to 'http'.

static_configs:

targets: ["localhost:9090"]

job_name: "node_exporter"
static_configs:

targets: ["localhost:9100"]

root@k8s-master01:~# /usr/local/prometheus/promtool check config /usr/local/prometheus/prometheus.yml #检测Prometheus配置文件是否有误

root@k8s-master01:~# systemctl reload prometheus.service

#其他如consul exporter（9107），blackbox exporter（9115），grafana exporter（3000），mysql exporter（9104）等二进制部署方式与上述一致

四、服务发现与告警功能

4.1，服务发现

Prometheus Server的数据抓取工作于Pull模型，因而，它需要知道各Target的位置，对于小型系统环境可以通过static_configs指定各Target；对于大型环境需要通过器底层平台的API Server或DNS的A记录等获取Target信息。

#静态服务发现

#使用docker-compose或包安装的方式安装nginx和nginx-exporter

root@k8s-master01:~# cd /usr/local/prometheus

root@k8s-master01:/usr/local/prometheus# vim prometheus.yml #还原配置文件默认设置

scrape_configs:

job_name: "prometheus"

metrics_path defaults to '/metrics'

scheme defaults to 'http'.

static_configs:

targets: ["localhost:9090"]

root@k8s-master01:/usr/local/prometheus# systemctl reload prometheus.service

root@k8s-master01:/usr/local/prometheus# mkdir targets

root@k8s-master01:/usr/local/prometheus#vim targets/01.yaml #创建targets文件

targets:

localhost:9100

172.29.7.20:9113

labels:

app: exporter

root@k8s-master01:/usr/local/prometheus# vim prometheus.yml #将targets文件位置添加至策略内

......

job_name: "exporter"

file_sd_configs:

files:

targets/*.yaml

refresh_interval: 2m

root@k8s-master01:/usr/local/prometheus# ./promtool check config ./prometheus.yml #检测配置文件

Checking ./prometheus.yml

root@k8s-master01:/usr/local/prometheus# systemctl reload prometheus.service #重新加载服务

4.2，告警功能实现

当样本数据量较大时Prometheus Server负载较高，对于一些查询频率较高且运算较为复杂的查询来说，实时查询会存在一定程度的响应延迟。为了解决该问题，我们可以使用**记录规则（Recording rule）**预先运行频繁用到或计算消耗较大的表达式，并将其结果保存为一组新的时间序列。

#配置记录规则

root@k8s-master01:/usr/local/prometheus#mkdir -pv rules/{recording,alter}

root@k8s-master01:/usr/local/prometheus# vim rules/recording/rule01.yaml #创建规则文件

groups:

name: custom_rules

interval: 5s

rules:

record: instance:node_cpu:avg_rate5m

expr: 100 - avg(irate(node_cpu_seconds_total{job="node", mode="idle"}[5m])) by (instance) * 100

record: instace:node_memory_MemFree_percent

expr: 100 - (100 * node_memory_MemFree_bytes / node_memory_MemTotal_bytes)

record: instance:root:node_filesystem_free_percent

expr: 100 * node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}

root@k8s-master01:/usr/local/prometheus# vim prometheus.yml #添加rule文件位置

rule_files:

"rules/recording/*.yaml"

root@k8s-master01:/usr/local/prometheus# ./promtool check config ./prometheus.yml

root@k8s-master01:/usr/local/prometheus# systemctl reload prometheus.service

告警规则（Alert rule）是另一种定义在Prometheus配置文件中的PromQL表达式，它通常是一个基于查询语句的布尔表达式，该表达式负责触发告警，通常可以保存为记录规则避免可能参生的延迟。

#配置告警规则

#安装alertmanager

root@k8s-master01:/usr/local/prometheus# tar xf /root/alertmanager-0.27.0.linux-amd64.tar.gz -C /usr/local/

root@k8s-master01:/usr/local/prometheus#ln -s /usr/local/alertmanager-0.27.0.linux-amd64 /usr/local/alertmanager

root@k8s-master01:/usr/local/prometheus# cd ../alertmanager

root@k8s-master01:/usr/local/alertmanager# mkdir data

root@k8s-master01:/usr/local/alertmanager# chown -R prometheus.prometheus /usr/local/alertmanager/*

root@k8s-master01:/usr/local/alertmanager# vim /lib/systemd/system/alertmanager.service
$Unit$
Description=alertmanager

Documentation=https://prometheus.io/docs/introduction/overview/

After=network.target
$Service$
Type=simple

User=prometheus

ExecStart=/usr/local/alertmanager/alertmanager \

--config.file="/usr/local/alertmanager/alertmanager.yml" \

--storage.path="/usr/local/alertmanager/data/" \

--data.retention=120h \

--log.level=info

ExecReload=/bin/kill -HUP $MAINPID

TimeoutStopSec=20s

Restart=always
$Install$
WantedBy=multi-user.target

root@k8s-master01:/usr/local/alertmanager# systemctl enable --now alertmanager.service

#配置邮件服务，告警邮件将通过mail01@localhost发给wlm@localhost（）

root@k8s-master01:/usr/local/alertmanager# vim alertmanager.yml

global:

smtp_smarthost: 'localhost:25'

smtp_from: 'mail01@localhost'

smtp_require_tls: false

route:

group_by: ['alertname']

group_wait: 30s

group_interval: 5m

repeat_interval: 1h

receiver: 'email-receiver'

routes:

match:

severity: 'critical'

receiver: 'email-receiver'

receivers:

name: 'email-receiver'

webhook_configs:

to: 'wlm@localhost'

#添加告警协议

root@k8s-master01:/usr/local/prometheus# vim prometheus.yml

alerting:

alertmanagers:

static_configs:

targets:

172.29.7.20:9093

rule_files:

"rules/recording/*.yaml"

"rules/alter/*.yaml"

root@k8s-master01:/usr/local/prometheus#vim rules/alter/02.yaml

groups:

name: AllInstances

rules:

alert: Instance

expr: up == 0

for: 1m

annotations:

title: 'Instance down'

description: Instance has been down for more than 1 minute.'

summary: 'Instance down'

labels:

severity: 'critical'

root@k8s-master01:/usr/local/prometheus# ./promtool test rules rules/alter/02.yaml

root@k8s-master01:/usr/local/prometheus#systemctl reload prometheus.service

#手动关闭nignx服务测试告警功能

五、Prometheus与Kubernetes

5.1，Kubernetes指标

kubectl top node/pod--可以查看node或pod的使用情况，但是其依赖指标实现其功能（需要部署metrics组件才能使用该命令） ，其他组件如HPAv1（Horizontal Pod Autoscaler）同top命令一样需要API群组--metrics.k8s.io来完成扩，缩容。

root@k8s-master01:~# kubectl top nodes

NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%

k8s-master01 150m 7% 1118Mi 61%

k8s-master02 168m 8% 1016Mi 55%

k8s-master03 140m 7% 906Mi 49%

k8s-work01 73m 3% 1164Mi 63%

k8s-work02 60m 3% 558Mi 30%

k8s-work03 70m 3% 676Mi 37%

root@k8s-master01:~# kubectl top pod -n kube-system

NAME CPU(cores) MEMORY(bytes)

coredns-857d9ff4c9-899bh 2m 36Mi

coredns-857d9ff4c9-x486f 2m 33Mi

csi-nfs-controller-77c56d9d7-k6s5v 2m 72Mi

Metrics组件仅提供CPU和Memory的采集功能，为了更好的监控Kubernetes需要使用第三方组件如Prometheus Server扩展提供更多指标。注意：Prometheus的原生指标不被Kubernetes兼容，自身也不支持作为指标服务器使用（没有API群组作为调用入口），需要使用Prometheus Adapater提供一个API入口，注册一个群组，转换指标格式为Kubernetes兼容的格式。

因此监控Kubernetes集群时，需要部署Prometheus Server实现监控功能，部署kube-state-metrics组件其功能相当于Kubernetes exporter，部署InfluxDB/VictoriaMetircs用于监控指标的持久化存储，AlterManager告警组件和Grafana展示组件。

5.2，Prometheus集群部署

#使用Helm在Kubernetes集群部署Prometheus

#k8s-master01 172.29.7.10

#k8s-node01 172.29.7.11

#k8s-node02 172.29.7.12

#k8s-node03 172.29.7.13

#nfs-server 172.29.7.20

#部署nfs-csi，首先需要部署NFS本地服务器或容器

root@nfs-server:~# apt install -y nfs-kernel-server #在本地部署nfs服务器

root@nfs-server:~# mkdir /home/nfs #创建nfs目录

root@nfs-server:~# vim /etc/exports

/home/nfs *(rw,fsid=0,async,no_subtree_check,no_auth_nlm,insecure,no_root_squash)

root@nfs-server:~# systemctl start nfs-server

#在所有Kubernetes节点部署nfs客户端

root@k8s-node01:~# apt -y install nfs-common

root@k8s-node02:~# apt -y install nfs-common

root@k8s-node03:~# apt -y install nfs-common

#在Kubernetes上创建storageclass资源默认调度NFS服务器

root@k8s-master01:~# cat nfs-csi-storageclass.yaml

apiVersion: storage.k8s.io/v1

kind: StorageClass

metadata:

name: nfs-csi

annotations:

storageclass.kubernetes.io/is-default-class: "true"

provisioner: nfs.csi.k8s.io

parameters:

server: 172.29.7.20

share: /home/nfs

reclaimPolicy: Retain

volumeBindingMode: Immediate

#下载官方csi-driver-nfs/deploy/v4.6.0 at v4.6.0 · kubernetes-csi/csi-driver-nfs · GitHub下的yaml文件并进行部署，实现csi-nfs部署安装

root@k8s-master01:~#ls

crd-csi-snapshot.yaml csi-nfs-driverinfo.yaml csi-snapshot-controller.yaml rbac-snapshot-controller.yaml

csi-nfs-controller.yaml csi-nfs-node.yaml rbac-csi-nfs.yaml

root@k8s-master01:~# kubectl apply -f .

#使用Helm直接部署Prometheus

root@k8s-master01:~# helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

root@k8s-master01:~# helm install my-prometheus prometheus-community/prometheus --version 27.3.1

root@k8s-master01:~# kubectl create ingress ingress-my-prometheus --rule="prometheus.wlm.com/*"=my-prometheus-server:80 --class=nginx --dry-run=client -o yaml > ingress-my-prometheus.yaml #创建ingress资源

root@k8s-master01:~# kubectl get svc -n ingress-nginx

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE

ingress-nginx-controller LoadBalancer 10.110.104.45 172.29.7.51 80:30182/TCP,443:31691/TCP 2d

root@k8s-master01:~# kubectl apply -f ingress-my-prometheus.yaml
#在本地hosts解析域名和IP即可通过浏览器进行访问