微服务监控之Prometheus golang实践篇：从0开始搭建业务监控

引言

监控和性能分析是确保系统稳定运行和高效工作的关键一环。随着应用规模的不断扩大和复杂性的增加，有效的监控解决方案变得尤为重要。在这个背景下，Prometheus作为一款开源的监控和报警工具，提供了强大的能力来收集、存储、查询和可视化各种指标数据，在云原生体系中，Prometheus已经是k8s监控的标准方案。

Prometheus的核心理念

多维数据模型：Prometheus使用时间序列数据模型，其中每个数据点都由一个时间戳、一个浮点数值和一组可选的标签组成。这种模型允许你灵活地表示和查询各种维度的指标数据。
查询语言：Prometheus引入了自己的查询语言，即PromQL。它允许用户灵活地从时间序列数据中提取信息，进行聚合和计算，以及定义告警规则。
基于HTTP的数据采集：Prometheus使用HTTP协议进行数据采集，这使得它非常适合于云原生环境和微服务架构。
动态发现：Prometheus支持自动服务发现，能够自动发现并监控新加入或移除的目标实例。

基本概念

时间序列（Time Series）

时间序列是Prometheus的基本数据单元。它由以下三个主要部分组成：

时间戳（Timestamp）：表示数据点采集的时间。
标签（Labels）：用于标识和区分不同的时间序列。标签可以包含各种键值对，如instance="webserver1"、job="api"等。
样本值（Sample Value）：表示在特定时间戳下的浮点数值，表示了某个指标的度量结果。

时间序列可以用来表示各种应用程序、系统组件和服务的性能指标，如请求延迟、CPU使用率、内存消耗等。

指标（Metrics）

指标是时间序列的集合，代表了一个特定的数据类型。Prometheus支持多种内置指标，如http_requests_total、cpu_usage等。指标通常由指标名称和一组标签组成，用于唯一标识一个时间序列。

标签和标签集（Labels and Label Sets）

标签是Prometheus数据模型的关键组成部分，用于区分和分类时间序列。标签允许你在一个指标名称下存储多个时间序列，每个序列都可以有不同的标签值。标签的灵活性使得你能够更好地组织和查询指标数据，适应不同的监控需求。

数据采集器（Collectors）

Prometheus使用数据采集器来从不同的目标获取指标数据。数据采集器可以是内置的、第三方开发的或者用户自定义的。常见的数据采集方式包括：

HTTP Exporter：通过HTTP协议暴露指标数据，Prometheus通过HTTP访问获取数据。
Node Exporter：用于采集操作系统和硬件层面的指标数据，如CPU、内存、磁盘等。
自定义采集器：根据应用程序特点，开发自己的采集器，以满足特定的监控需求。

查询语言（PromQL）

Prometheus Query Language（PromQL）是一种用于查询和分析时间序列数据的强大语言。PromQL支持基本查询、聚合操作、向量操作和函数等功能，使用户能够从庞大的指标数据中提取有价值的信息。例如，你可以使用PromQL计算平均值、百分位数，或者创建自定义的合成指标。

PromQL的灵活性和强大功能使得用户能够深入挖掘监控数据，从而更好地理解系统性能和行为。

告警规则（Alerting Rules）

Prometheus允许你定义告警规则，以便在特定条件满足时触发告警。告警规则基于PromQL查询结果，可以设置触发条件、告警级别、通知方式等。一旦规则触发，Prometheus会发送通知，以便及时采取措施。

数据采集和存储

数据采集方式

Prometheus使用HTTP协议来采集数据，它通过一组称为"作业（Jobs）"的配置来定期抓取指标数据。以下是一些常见的数据采集方式：

HTTP Exporters：许多应用程序和服务提供了HTTP端点，Prometheus可以通过HTTP Exporter从这些端点获取指标数据。比如，Node Exporter用于采集操作系统和硬件层面的数据。
Service Discovery：Prometheus支持自动服务发现，它可以通过各种服务发现机制（如Consul、Kubernetes等）动态地发现并监控新加入的目标实例。
Push Gateway：用于临时性的数据推送，适用于一些短生命周期的任务。推送网关允许应用程序将指标数据推送到Prometheus，而不需要Prometheus主动拉取。
Blackbox Exporter：用于探测和监控网络服务。它可以执行HTTP、TCP、ICMP等类型的探测，并将结果作为指标提供给Prometheus。

数据存储

Prometheus使用本地的时间序列数据库来存储采集到的指标数据。数据存储采用了一种稀疏、流式的方式，以便高效地存储和查询大量数据。

时间序列数据库：Prometheus的时间序列数据库使用一种紧凑的数据格式，以节省存储空间。数据按照时间有序地存储，使得查询操作更加高效。
数据保留策略：Prometheus允许你定义数据保留策略，以决定存储多长时间的数据。老旧的数据会被自动删除，从而保证数据库的可用空间。
块和切分：数据存储被划分为多个块，每个块包含一段时间范围内的数据。块的切分允许Prometheus快速地丢弃过时数据，以及在查询时只加载必要的块。
远程写入和存储：除了本地存储，Prometheus还支持远程写入和存储数据，这对于分布式部署和数据汇总非常有用。

持久性和数据保留

Prometheus在本地持久性方面采取了折中策略。它保留原始数据，但会对长期存储的数据进行聚合，以减少查询时的开销。这种方式允许Prometheus在存储大量数据的同时保持高效的查询性能。

数据保留的策略和配置会影响存储占用和数据的可用时间。合理地设置数据保留策略对于系统的性能和资源利用至关重要。

数据类型

Counter

Counter是一个累计类型的数据指标，它代表单调递增的计数器。其值只能在重新启动时增加或重置为 0。

例如，您可以使用计数器来表示已响应的HTTP请求数，这个数一定是不断增长的。

ini 复制代码

//http接口请求数
request_duration_count{path="/api/v1/index"} 5301

Gauge

Gauge是可以任意上下波动数值的指标类型。

Gauge的值可增可减，可升可降，表示的一般是当前值。

例如：机器的CPU使用率，可大可小，连接池的连接状态等。

ini 复制代码

// mysql 连接池空闲的连接数
mysql_pool_stats{db="book",form="idle"} 4

Histogram

用于表示一段时间内的数据采样结果（通常是请求持续时间或响应大小等），单位可以自定义，下面示例单位为秒

ini 复制代码

// 在总共2次请求当中。http 请求响应时间 <=0.005 秒 的请求次数为0
http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.005",} 0.0
// 在总共2次请求当中。http 请求响应时间 <=0.01 秒 的请求次数为0
http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.01",} 0.0
// 在总共2次请求当中。http 请求响应时间 <=0.025 秒 的请求次数为0
http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.025",} 0.0
http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.05",} 0.0
http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.075",} 0.0
http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.1",} 0.0
http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.25",} 0.0
http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.5",} 0.0
http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.75",} 0.0
http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="1.0",} 0.0
http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="2.5",} 0.0
http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="5.0",} 0.0
http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="7.5",} 2.0
// 在总共2次请求当中。http 请求响应时间 <=10 秒 的请求次数为 2
http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="10.0",} 2.0
http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="+Inf",} 2.0

所有样本值的大小总和，命名为 <basename>_sum。

ini 复制代码

// 实际含义： 发生的2次 http 请求总的响应时间为 13.107670803000001 秒
http_requests_latency_seconds_histogram_sum{path="/",method="GET",code="200",} 13.107670803000001

样本总数，命名为 <basename>_count。值和 <basename>_bucket{le="+Inf"} 相同。

ini 复制代码

// 实际含义： 当前一共发生了 2 次 http 请求
http_requests_latency_seconds_histogram_count{path="/",method="GET",code="200",} 2.0

bucket 可以理解为是对数据指标值域的一个划分，划分的依据应该基于数据值的分布。注意后面的采样点是包含前面的采样点的，假设 xxx_bucket{...,le="0.01"} 的值为 10，而 xxx_bucket{...,le="0.05"} 的值为 30，那么意味着这 30 个采样点中，有 10 个是小于 10 ms 的，其余 20 个采样点的响应时间是介于 10 ms 和 50 ms 之间的。

Summary

与 Histogram 类型类似，用于表示一段时间内的数据采样结果（通常是请求持续时间或响应大小等），但它直接存储了分位数（通过客户端计算，然后展示出来），而不是通过区间来计算。

样本值的分位数分布情况，命名为 <basename>{quantile="<φ>"}。

ini 复制代码

// 含义：这 13 次 http 请求中有 50% 的请求响应时间是 3.052404983s
http_requests_latency_seconds_summary{path="/",method="GET",code="200",quantile="0.5",} 3.052404983
// 含义：这 13 次 http 请求中有 90% 的请求响应时间是 8.003261666s
http_requests_latency_seconds_summary{path="/",method="GET",code="200",quantile="0.9",} 8.003261666

所有样本值的大小总和，命名为 <basename>_sum。

ini 复制代码

// 含义：这13次 http 请求的总响应时间为 51.029495508s
http_requests_latency_seconds_summary_sum{path="/",method="GET",code="200",} 51.029495508

样本总数，命名为 <basename>_count。

ini 复制代码

// 含义：当前一共发生了 13 次 http 请求
http_requests_latency_seconds_summary_count{path="/",method="GET",code="200",} 13.0

Histogram 与 Summary 的异同：

它们都包含了 <basename>_sum 和 <basename>_count 指标
Histogram 需要通过 <basename>_bucket 来计算分位数，而 Summary 则直接存储了分位数的值
Histogram性能优于Summary，计算的消耗放在了查询处，Summary则相反，计算大头在客户端

实践

以最常用的HTTP Exporters为例，应用程序提供http接口暴露指标，Prometheus定时去应用程序收集

统计需求分析

报表内容	数据类型
各应用http请求总qps	counter
http各个接口请求qps	counter
http各接口请求平均响应时间	counter
各应用http请求响应耗时（百分比位）	histogram
mysql连接池信息（idle、in_user....）	gauge
mysql sql qps	counter
redis连接池信息	gauge

最终暴露的指标

ini 复制代码

# mysql sql执行响应时间
# HELP mysql_latency_bucket 
# TYPE mysql_latency_bucket histogram
mysql_latency_bucket{db="book",le="5"} 526404
mysql_latency_bucket{db="book",le="10"} 541715
mysql_latency_bucket{db="book",le="20"} 542265
mysql_latency_bucket{db="book",le="40"} 551277
mysql_latency_bucket{db="book",le="80"} 560642
mysql_latency_bucket{db="book",le="160"} 560827
mysql_latency_bucket{db="book",le="320"} 560827
mysql_latency_bucket{db="book",le="640"} 560827
mysql_latency_bucket{db="book",le="1280"} 560827
mysql_latency_bucket{db="book",le="2560"} 560827
mysql_latency_bucket{db="book",le="5120"} 560827
mysql_latency_bucket{db="book",le="10240"} 560827
mysql_latency_bucket{db="book",le="+Inf"} 560827
mysql_latency_bucket_sum{db="book"} 2.3091988300000303e+06
mysql_latency_bucket_count{db="book"} 560827


# mysql 连接池信息
# HELP mysql_pool_stats 
# TYPE mysql_pool_stats gauge
mysql_pool_stats{db="book",form="idle"} 4
mysql_pool_stats{db="book",form="in_use"} 0
mysql_pool_stats{db="book",form="max_open"} 40
mysql_pool_stats{db="book",form="opened"} 4
# HELP mysql_pool_wait_num_count 
# TYPE mysql_pool_wait_num_count gauge
mysql_pool_wait_num_count{db="book"} 0
# HELP mysql_pool_wait_time_count 
# TYPE mysql_pool_wait_time_count gauge
mysql_pool_wait_time_count{db="book"} 0


# mysql数据库查询qps
# HELP mysql_sql_duration_count 
# TYPE mysql_sql_duration_count counter
mysql_sql_duration_count{db="book",operate="query",table="attribute"} 1503
mysql_sql_duration_count{db="book",operate="query",table="banner"} 16329
mysql_sql_duration_count{db="book",operate="query",table="box_project"} 9613
mysql_sql_duration_count{db="book",operate="query",table="category"} 166
mysql_sql_duration_count{db="book",operate="query",table="coupon"} 59253
mysql_sql_duration_count{db="book",operate="query",table="goods"} 52556
mysql_sql_duration_count{db="book",operate="query",table="order"} 3638
mysql_sql_duration_count{db="book",operate="query",table="recommend_goods"} 539
mysql_sql_duration_count{db="book",operate="update",table="coupon"} 132
mysql_sql_duration_count{db="book",operate="update",table="user_coupon"} 132


# redis连接池信息
# HELP redis_pool_stats 
# TYPE redis_pool_stats gauge
redis_pool_stats{form="idle",host="127.0.0.1"} 1
redis_pool_stats{form="max_open",host="127.0.0.1"} 60
redis_pool_stats{form="opened",host="127.0.0.1"} 1

# 每个接口请求qps
# HELP request_duration_count 
# TYPE request_duration_count counter
request_duration_count{path="/api/v1/index"} 5301
request_duration_count{path="/api/v1/boxProjectList"} 63
request_duration_count{path="/api/v1//Goods"} 5992
request_duration_count{path="/api/v1//GoodsList"} 36384

# 每个接口请求响应时间(ms) 
# HELP request_duration_latency_ms 
# TYPE request_duration_latency_ms counter
request_duration_latency_ms{path="/api/v1/index"} 15301
request_duration_latency_ms{path="/api/v1/boxProjectList"} 301
request_duration_latency_ms{path="/api/v1//Goods"} 25992
request_duration_latency_ms{path="/api/v1//GoodsList"} 126384

# 整个应用接口请求响应时间
# HELP request_latency_bucket 
# TYPE request_latency_bucket histogram
request_latency_bucket{le="5"} 264971
request_latency_bucket{le="10"} 394475
request_latency_bucket{le="20"} 436793
request_latency_bucket{le="40"} 528383
request_latency_bucket{le="80"} 552429
request_latency_bucket{le="160"} 553348
request_latency_bucket{le="320"} 553352
request_latency_bucket{le="640"} 555777
request_latency_bucket{le="1280"} 556971
request_latency_bucket{le="2560"} 556973
request_latency_bucket{le="5120"} 556975
request_latency_bucket{le="10240"} 556975
request_latency_bucket{le="+Inf"} 556975
request_latency_bucket_sum 8.779842e+06
request_latency_bucket_count 556975

promethus可配置一些公共标签，例如配置服务名、节点ip，在该节点下收集的指标都会自动打上相应标签，所以这里业务代码示例不打节点、服务等标签。

代码实现（golang）

我们的目的是提供一个Http Get接口暴露上面定义的那些指标提供给promethues收集，全部自己实现也可以，这里使用使用官方提供的golang包实现（github.com/prometheus/client_golang/prometheus）

定义Prom结构体

go 复制代码

package metric

import (
	"context"
	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promhttp"
	"github.com/spf13/cast"
	"log"
	"net/http"
	"sync"
	"time"
)

type Prom struct {
	Gauges           map[string]prometheus.Gauge
	Counters         map[string]prometheus.Counter
	Histograms       map[string]prometheus.Counter
	LatencyHistogram prometheus.Histogram

	counterMutex sync.Mutex
	gaugesMutex  sync.Mutex
}

type MysqlQueryLabel struct {
	Db      string
	Table   string
	Operate string
}

type MysqlPoolLabel struct {
	Db   string
	Form string
}

type RedisPoolLabel struct {
	Host string
	Form string
}

type Label struct {
	key   string
	value string
}

var (
	GlobalProm *Prom
)

func InitProm() *Prom {
	// Add Go module build info.
	prometheus.MustRegister(prometheus.NewBuildInfoCollector())
	GlobalProm = &Prom{
		Counters: make(map[string]prometheus.Counter),
		Gauges:   make(map[string]prometheus.Gauge),
		LatencyHistogram: prometheus.NewHistogram(prometheus.HistogramOpts{
			Name: "request_latency_bucket",
			//Buckets: prometheus.LinearBuckets(10, 20, 12),
			Buckets: prometheus.ExponentialBuckets(5, 2, 12),
		}),
	}
	prometheus.MustRegister(GlobalProm.LatencyHistogram)
	return GlobalProm
}

func (p *Prom) GetGauge(name string, labels ...Label) prometheus.Gauge {
	key := name
	constLabels := make(prometheus.Labels)

	for _, v := range labels {
		key += v.key + ":" + v.value
		constLabels[v.key] = v.value
	}
	p.gaugesMutex.Lock()
	defer p.gaugesMutex.Unlock()
	if p.Gauges[key] == nil {
		p.Gauges[key] = prometheus.NewGauge(prometheus.GaugeOpts{
			Name:        name,
			ConstLabels: constLabels,
		})
		err := prometheus.Register(p.Gauges[key])
		if err != nil {
			log.Printf("[prometheus.Gauge.Register]:name:%s,err:%v", name, err)
		}
	}
	return p.Gauges[key]
}

func (p *Prom) GetCounter(name string, labels ...Label) prometheus.Counter {
	key := name
	constLabels := make(prometheus.Labels)

	for _, v := range labels {
		key += v.key + ":" + v.value
		constLabels[v.key] = v.value
	}
	p.counterMutex.Lock()
	defer p.counterMutex.Unlock()
	if p.Counters[key] == nil {
		p.Counters[key] = prometheus.NewCounter(prometheus.CounterOpts{
			Name:        name,
			ConstLabels: constLabels,
		})
		err := prometheus.Register(p.Counters[key])
		if err != nil {
			log.Printf("[prometheus.Counter.Register]:name:%s,err:%v", name, err)
		}
	}
	return p.Counters[key]
}

func (p *Prom) NewHistogram(name string, db string) prometheus.Histogram {
	histogram := prometheus.NewHistogram(prometheus.HistogramOpts{
		Name:        name,
		Buckets:     prometheus.ExponentialBuckets(5, 2, 12),
		ConstLabels: prometheus.Labels{"db": db},
	})
	prometheus.MustRegister(histogram)
	return histogram
}

func (p *Prom) NewMysqlBucket(db string) prometheus.Histogram {
	return p.NewHistogram("mysql_latency_bucket", db)
}

func (p *Prom) GetMysqlQueryCounter(in MysqlQueryLabel) prometheus.Counter {
	return p.GetCounter("mysql_sql_duration_count", Label{"db", in.Db}, Label{"table", in.Table}, Label{"operate", in.Operate})
}

func (p *Prom) GetRequestCounter(name string) prometheus.Counter {
	return p.GetCounter("request_duration_count", Label{"path", name})
}

func (p *Prom) GetRequestLatencyCounter(name string) prometheus.Counter {
	return p.GetCounter("request_duration_latency_ms", Label{"path", name})
}

func (p *Prom) GetRpcErrCounter(method string, code uint32) prometheus.Counter {
	return p.GetCounter("rpc_error_duration_count", Label{"path", method}, Label{"code", cast.ToString(code)})
}

func (p *Prom) GetMysqlPoolWaitNumGauge(in MysqlPoolLabel) prometheus.Gauge {
	return p.GetGauge("mysql_pool_wait_num_count", Label{"db", in.Db})
}

func (p *Prom) GetMysqlPoolWaitTimeGauge(in MysqlPoolLabel) prometheus.Gauge {
	return p.GetGauge("mysql_pool_wait_time_count", Label{"db", in.Db})
}

func (p *Prom) GetMysqlPoolGauge(in MysqlPoolLabel) prometheus.Gauge {
	return p.GetGauge("mysql_pool_stats", Label{"db", in.Db}, Label{"form", in.Form})
}

func (p *Prom) GetRedisPoolGauge(in RedisPoolLabel) prometheus.Gauge {
	return p.GetGauge("redis_pool_stats", Label{"host", in.Host}, Label{"form", in.Form})
}

func (p *Prom) MysqlQueryInc(in MysqlQueryLabel) {
	p.GetMysqlQueryCounter(in).Inc()
	return
}

func (p *Prom) Inc(name string, t int64) {
    //qps
	p.GetRequestCounter(name).Inc()
	//接口请求延迟百分位
	p.LatencyHistogram.Observe(float64(t))
	//接口平均响应时间
    p.GetRequestLatencyCounter(name).Add(float64(t))
	return
}

func (p *Prom) Listen(addr string) {
	// Expose the registered metrics via HTTP.
	http.Handle("/metrics", promhttp.HandlerFor(
		prometheus.DefaultGatherer,
		promhttp.HandlerOpts{
			// Opt into OpenMetrics to support exemplars.
			EnableOpenMetrics: false,
			Timeout:           time.Second * 3,
		},
	))
	go func() {
		log.Fatal(http.ListenAndServe(addr, nil))
	}()
}

http中间件中接入

scss 复制代码

func HttpServerInterceptor(next http.HandlerFunc) http.HandlerFunc {
	return func(w http.ResponseWriter, r *http.Request) {
		t := time.Now()
		next(w, r)
		if metric.GlobalProm != nil {
		    //统计项目qps、接口qps、延迟 
			metric.GlobalProm.Inc(r.URL.Path, time.Since(t).Milliseconds())
		}
	}
}

mysql数据接入(gorm v1)

go 复制代码

import(
    "github.com/jinzhu/gorm"
)
type Logger struct {
	gorm.LogWriter
	histogram *prometheus.Histogram
}

// Print format & print log
func (logger Logger) Print(values ...interface{}) {
	logger.Println(LogFormatter(logger.histogram, values...)...)
}

func isPrintable(s string) bool {
	for _, r := range s {
		if !unicode.IsPrint(r) {
			return false
		}
	}
	return true
}

//通过LogFormatter统计sql查询耗时
func LogFormatter(histogram *prometheus.Histogram, values ...interface{}) (messages []interface{}) {
	if len(values) > 1 {
		var (
			sql             string
			formattedValues []string
			level           = values[0]
			source          = fmt.Sprintf("(%v)", values[1])
		)

		messages = []interface{}{source}

		if len(values) == 2 {
			//remove the brackets
			source = fmt.Sprintf("%v", values[1])
			messages = []interface{}{source}
		}

		if level == "sql" {
			// duration
			rps := float64(values[2].(time.Duration).Nanoseconds()/1e4) / 100.0
			messages = []interface{}{fmt.Sprintf("mysqlSql:[%.2fms]", rps)}
			//rps statistic
			(*histogram).Observe(rps)
			// sql
			for _, value := range values[4].([]interface{}) {
				indirectValue := reflect.Indirect(reflect.ValueOf(value))
				if indirectValue.IsValid() {
					value = indirectValue.Interface()
					if t, ok := value.(time.Time); ok {
						if t.IsZero() {
							formattedValues = append(formattedValues, fmt.Sprintf("'%v'", "0000-00-00 00:00:00"))
						} else {
							formattedValues = append(formattedValues, fmt.Sprintf("'%v'", t.Format("2006-01-02 15:04:05")))
						}
					} else if b, ok := value.([]byte); ok {
						if str := string(b); isPrintable(str) {
							formattedValues = append(formattedValues, fmt.Sprintf("'%v'", str))
						} else {
							formattedValues = append(formattedValues, "'<binary>'")
						}
					} else if r, ok := value.(driver.Valuer); ok {
						if value, err := r.Value(); err == nil && value != nil {
							formattedValues = append(formattedValues, fmt.Sprintf("'%v'", value))
						} else {
							formattedValues = append(formattedValues, "NULL")
						}
					} else {
						switch value.(type) {
						case int, int8, int16, int32, int64, uint, uint8, uint16, uint32, uint64, float32, float64, bool:
							formattedValues = append(formattedValues, fmt.Sprintf("%v", value))
						default:
							formattedValues = append(formattedValues, fmt.Sprintf("'%v'", value))
						}
					}
				} else {
					formattedValues = append(formattedValues, "NULL")
				}
			}

			// differentiate between $n placeholders or else treat like ?
			if numericPlaceHolderRegexp.MatchString(values[3].(string)) {
				sql = values[3].(string)
				for index, value := range formattedValues {
					placeholder := fmt.Sprintf(`\$%d([^\d]|$)`, index+1)
					sql = regexp.MustCompile(placeholder).ReplaceAllString(sql, value+"$1")
				}
			} else {
				formattedValuesLength := len(formattedValues)
				for index, value := range sqlRegexp.Split(values[3].(string), -1) {
					sql += value
					if index < formattedValuesLength {
						sql += formattedValues[index]
					}
				}
			}

			messages = append(messages, sql)
			messages = append(messages, fmt.Sprintf(" rows:[%v]", strconv.FormatInt(values[5].(int64), 10)))
		} else {
			messages = append(messages, values[2:]...)
		}
	}

	return
}

func NewMysqlInstance(config *MysqlConfig) (*gorm.DB, error) {
	db, err := gorm.Open("mysql", fmt.Sprintf("%s:%s@(%s:%s)/%s?charset=%s&parseTime=True&loc=Local", config.User, config.Password, config.Host, config.Port, config.Db, config.Charset))
	if err != nil {
		return nil, err
	}
	db.SingularTable(config.SingularTable)
	gorm.DefaultTableNameHandler = func(db *gorm.DB, defaultTableName string) string {
		return config.Prefix + defaultTableName
	}
	
	db.DB().SetMaxOpenConns(config.MaxActive)
	db.DB().SetMaxIdleConns(config.MaxIdle)
	db.LogMode(config.LogModel)
    
    //自定义logger，通过监听log统计sql查询耗时
	loggerIns := log.New()
	loggerIns.SetOutput(os.Stdout)
	loggerIns.SetFormatter(&log.JSONFormatter{})
	histogram := metric.GlobalProm.NewMysqlBucket(config.Db)
	db.SetLogger(Logger{loggerIns, &histogram})

	if metric.GlobalProm != nil {
		getTableName := func(name string) string {
			return strings.Split(name, " ")[0]
		}
		queryCallback := func(scope *gorm.Scope) {
			metric.GlobalProm.MysqlQueryInc(metric.MysqlQueryLabel{Db: config.Db, Table: getTableName(scope.TableName()), Operate: "query"})
		}
		db.Callback().Query().After("gorm:query").Register("query:statistic", queryCallback)
		db.Callback().RowQuery().After("gorm:row_query").Register("row_query:statistic", queryCallback)

		updateCallback := func(scope *gorm.Scope) {
			metric.GlobalProm.MysqlQueryInc(metric.MysqlQueryLabel{Db: config.Db, Table: getTableName(scope.TableName()), Operate: "update"})
		}
		//update、Create、Delete统一归类为update lable
		db.Callback().Update().After("gorm:update").Register("update:statistic", updateCallback)
		db.Callback().Create().After("gorm:create").Register("create:statistic", updateCallback)
		db.Callback().Delete().After("gorm:delete").Register("delete:statistic", updateCallback)

		go func() {
		    //连接池状态统计
			ticker := time.NewTicker(time.Second * 2)
			inUse := metric.GlobalProm.GetMysqlPoolGauge(metric.MysqlPoolLabel{config.Db, "in_use"})
			idle := metric.GlobalProm.GetMysqlPoolGauge(metric.MysqlPoolLabel{config.Db, "idle"})
			opened := metric.GlobalProm.GetMysqlPoolGauge(metric.MysqlPoolLabel{config.Db, "opened"})
			waitNum := metric.GlobalProm.GetMysqlPoolWaitNumGauge(metric.MysqlPoolLabel{Db: config.Db})
			waitTime := metric.GlobalProm.GetMysqlPoolWaitTimeGauge(metric.MysqlPoolLabel{Db: config.Db})
			metric.GlobalProm.GetMysqlPoolGauge(metric.MysqlPoolLabel{config.Db, "max_open"}).Set(float64(config.MaxActive))
			for range ticker.C {
				stats := db.DB().Stats()
				inUse.Set(float64(stats.InUse))
				idle.Set(float64(stats.Idle))
				opened.Set(float64(stats.OpenConnections))
				waitNum.Set(float64(stats.WaitCount))
				waitTime.Set(float64(stats.WaitDuration / time.Millisecond))
			}
		}()
	}

	return db, nil
}

redis连接池指标(github.com/gomodule/redigo/redis)

go 复制代码

type RedisConfig struct {
	Host        string
	Port        string
	Password    string
	Db          int
	IdleTimeout time.Duration
	Wait        bool
	MaxIdel     int
	MaxActive   int
}

// 初始化redis
func NewRedisInstance(config *RedisConfig) (*redis.Pool, error) {
	redisPool := &redis.Pool{
		MaxIdle:     config.MaxIdel,
		MaxActive:   config.MaxActive,
		IdleTimeout: config.IdleTimeout,
		Wait: config.Wait,
		Dial: func() (redis.Conn, error) {
			c, err := redis.Dial("tcp", fmt.Sprintf("%s:%s", config.Host, config.Port), redis.DialDatabase(config.Db))
			if err != nil {
				return nil, err
			}
			password := config.Password
			if password == "" {
				return c, nil
			}
			if _, err := c.Do("AUTH", password); err != nil {
				c.Close()
				return nil, err
			}
			return c, err
		},
	}

	conn := redisPool.Get()
	defer conn.Close()
	r, err := redis.String(conn.Do("PING"))
	if r != "PONG" {
		err := errors.New("redis ping failed:" + err.Error())
		return nil, err
	}

	if metric.GlobalProm != nil {
		go func() {
			ticker := time.NewTicker(time.Second * 2)
			opened := metric.GlobalProm.GetRedisPoolGauge(metric.RedisPoolLabel{config.Host, "opened"})
			idle := metric.GlobalProm.GetRedisPoolGauge(metric.RedisPoolLabel{config.Host, "idle"})
			metric.GlobalProm.GetRedisPoolGauge(metric.RedisPoolLabel{config.Host, "max_open"}).Set(float64(config.MaxActive))
			for {
				select {
				case <-ticker.C:
					stats := redisPool.Stats()
					idle.Set(float64(stats.IdleCount))
					opened.Set(float64(stats.ActiveCount))
				}
			}
		}()
	}
	return redisPool, nil
}

main入口文件监听

css 复制代码

prom := metric.InitProm()
prom.Listen(":9099")

至此，服务器端口监听了地址：//ip:9099/metrics

promethues配置

配置targets信息

yaml 复制代码

# my global config
global:
  scrape_interval:     5s # Set the scrape interval to every 5 seconds. Default is every 1 minute.
  evaluation_interval: 5s # Evaluate rules every 5 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'book'

    # metrics_path defaults to '/metrics'
    metrics_path: '/metrics'
    # scheme defaults to 'http'.
    
    #targets可配置多个，多个使用","分隔
    static_configs:
    - targets: ['127.0.0.1:9099']

Grafana集成

配置prometheus数据源

图表配置

bash 复制代码

#service qps
sum(rate(request_latency_bucket_count{service=~"$service",pod=~"$pod"}[30s])) by (service)

#请求延迟
histogram_quantile(0.90, sum(rate(request_latency_bucket_bucket{service=~"$service",pod=~"$pod"}[30s])) by (le))
histogram_quantile(0.95, sum(rate(request_latency_bucket_bucket{service=~"$service",pod=~"$pod"}[30s])) by (le))
histogram_quantile(0.99, sum(rate(request_latency_bucket_bucket{service=~"$service",pod=~"$pod"}[30s])) by (le))

#api qps
sum(rate(request_duration_count{service=~"$service",pod=~"$pod"}[30s])) by (path)

# 接口平均响应时间
avg(increase(request_duration_latency_ms{pod=~"$pod"}[60s])  / increase(request_duration_count{pod=~"$pod"}[60s]) > 0) by (path)

#mysql latency
histogram_quantile(0.90, sum(rate(mysql_latency_bucket_bucket{service=~"$service",pod=~"$pod",db=~"$db"}[30s])) by (le))
histogram_quantile(0.95, sum(rate(mysql_latency_bucket_bucket{service=~"$service",pod=~"$pod",db=~"$db"}[30s])) by (le))
histogram_quantile(0.99, sum(rate(mysql_latency_bucket_bucket{service=~"$service",pod=~"$pod",db=~"$db"}[30s])) by (le))

#Mysql Pool Wait Time
sum(rate(mysql_pool_wait_time_count{service=~"$service",pod=~"$pod",db=~"$db"}[30s])) by (service)

#Mysql Pool
sum(mysql_pool_stats{service=~"$service",pod=~"$pod",db=~"$db"}) by (form)

#Mysql Table Qps
sum(rate(mysql_sql_duration_count{service=~"$service",pod=~"$pod",db=~"$db"}[30s])) by (table)

#Mysql Db Qps
sum(rate(mysql_sql_duration_count{service=~"$service",pod=~"$pod",db=~"$db"}[30s])) by (db)

#Redis Pool
sum(redis_pool_stats{service=~"$service",pod=~"$pod",host=~"$redis_host"}) by (form)

上述配置中的service、pod标签，因为案例中使用的K8s Prometheus Operator，这些标签由Operator自动映射的，不需要应用程序暴露。

创建图表

Service Qps

请求延迟百分位

Api Qps

mysql延迟

mysql连接池获取等待延迟

mysql连接池状态

mysql表维度qps

mysql db维度qps

redis连接池信息

筛选条件配置

k8s环境

在k8s运行应用时，应用部署所在的pod节点是由k8s动态控制的，我们需要统计每个pod暴露的指标，所以无法配置固定的targets。

Prometheus Operator

Prometheus提供了k8s Operator，在集群部署Operator后，可以通过ServiceMonitor声明实现target动态配置

ServiceMonitor

Operator会自动识别ServiceMonitor配置并动态更新target信息

yaml 复制代码

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: book-metrics
  namespace: book
  annotations:
    # 阿里云promethues opertor 3.1版本以后要加上该注解才会识别
    arms.prometheus.io/discovery: "true"
spec:
  endpoints:
    - interval: 15s
        # service port name
      port: prometheus
      path: /metrics
  selector:
    matchLabels:
      app: book-metrics

常见问题

为什么明明有接口请求，但是图表不显示数据

counter数据类型：通常是由于区间内数据量少导致的，对于一个时间区间内，使用rate相关函数至少要有2条及以上数据，因为counter记录的是一个累计值，需要通过当前时刻counter值去和上次时刻counter进行计算才可以得到结果。

grafana配置variable无法正常获取

配置variable后出现如下报错：

检查label values接口权限

bash 复制代码

curl http://x.x.x.x:30006/api/v1/label/instance/values
{
    "status": "success",
    "data": [],
    "warnings": [
        "disable action"
    ]
}

disable action说明接口被禁止调用，检查配置文件开启权限即可。

总结

本文简单介绍了promethues的基本概念，重点演示了业务如何接入自定义指标，理解其运作原理后，我们就可以做一些个性化监控。对于一些开源的服务（nginx、apisix等等），一般都提供了标准的HTTP Exporters和grafana模板文件，我们直接引用即可，希望这篇文章能帮助到你~