摘要:本文主要介绍k3s
环境和docker
环境下的grafana
+prometheus
+loki
的使用,着重介绍报警配置和日志接入。
k3s
- 安装文档地址:docs.k3s.io/zh/advanced...
我使用
docker
作为容器的,安装命令curl -sfL https://rancher-mirror.rancher.cn/k3s/k3s-install.sh | INSTALL_K3S_MIRROR=cn sh -s - --docker
Grafana
- 安装文档地址:grafana.com/docs/grafan...
- 面板地址:grafana.com/grafana/das...
Prometheus
- 官方文档地址:prometheus.io/docs/promet...
k8s+Operator
文档:prometheus-operator.dev/docs/gettin...
安装教程
本案例采用的的是
kube-prometheus
安装
ini
# Create the namespace and CRDs, and then wait for them to be available before creating the remaining resources
# Note that due to some CRD size we are using kubectl server-side apply feature which is generally available since kubernetes 1.22.
# If you are using previous kubernetes versions this feature may not be available and you would need to use kubectl create instead.
kubectl apply --server-side -f manifests/setup
kubectl wait \
--for condition=Established \
--all CustomResourceDefinition \
--namespace=monitoring
kubectl apply -f manifests/
针对registry.k8s.io
镜像下载不了
bash
docker pull bitnami/kube-state-metrics:2.13.0
docker tag bitnami/kube-state-metrics:2.13.0 registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.13.0
docker pull v5cn/prometheus-adapter:v0.12.0
docker tag v5cn/prometheus-adapter:v0.12.0 registry.k8s.io/prometheus-adapter/prometheus-adapter:v0.12.0
开放Grafana
的svc
端口访问
- 修改
svc
为NodePort
- 修改
networkpolicy
的入站流量为所有修改ingress
如下
yaml
spec:
egress:
- {}
ingress:
- {}
podSelector:
matchLabels:
app.kubernetes.io/component: grafana
app.kubernetes.io/name: grafana
app.kubernetes.io/part-of: kube-prometheus
policyTypes:
- Egress
- Ingress
监控springboot
项目
springboot.yaml
yaml
apiVersion: v1
kind: Namespace
metadata:
name: k8s-springboot
---
apiVersion: v1
kind: Service
metadata:
name: k8s-springboot-demo
namespace: k8s-springboot
lables:
app: k8s-springboot-demo
spec:
type: ClusterIP
ports:
- port: 8080
targetPort: 8080
protocol: TCP
name: web
selector:
app: k8s-springboot-demo
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: k8s-springboot-demo
namespace: k8s-springboot
spec:
selector:
matchLabels:
app: k8s-springboot-demo
replicas: 1
template:
metadata:
labels:
app: k8s-springboot-demo
spec:
containers:
- name: k8s-springboot-demo
image: huzhihui/springboot:1.0.0
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /
port: 8080
exec:
command:
- cat
- /tmp/healthy
tcpSocket:
port: 80
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
timeoutSeconds: 1
readinessProbe:
httpGet:
path: /
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
timeoutSeconds: 1
ServiceMonitor
选择
svc
yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: k8s-springboot-demo
namespace: k8s-springboot
labels:
team: k8s-springboot-demo
spec:
selector:
matchLabels:
app: k8s-springboot-demo
endpoints:
- port: web
Prometheus
注册端点
如果监控接不是
/metrics
,则需要配置ScrapeConfig
。
yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
namespace: k8s-springboot
spec:
serviceAccountName: prometheus
podMonitorSelector:
matchLabels:
team: k8s-springboot-demo
监控docker
机器的运行情况部署案例
部署配置
Grafana
的docker-compose.yml
yaml
version: "3"
services:
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
networks:
- default
volumes:
- ./data:/var/lib/grafana
networks:
default:
external:
name: nisec
prometheus.yml
yaml
scrape_configs:
# Prometheus
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# cadvisor
- job_name: cadvisor
scrape_interval: 5s
static_configs:
- targets:
- cadvisor:8080
# node-exporter
- job_name: node-exporter
scrape_interval: 5s
static_configs:
- targets:
- node-exporter:9100
# java jvm
- job_name: spring-boot
scrape_interval: 5s
metrics_path: '/actuator/prometheus'
static_configs:
- targets:
- '192.168.137.2:8081'
docker-compose.yml
挂载的
./data
目录分配777
权限
yaml
version: '3.2'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
networks:
- default
ports:
- 9090:9090
command:
- --config.file=/etc/prometheus/prometheus.yml
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./data:/prometheus
depends_on:
- cadvisor
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
expose:
- 9100
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
networks:
- default
ports:
- 8080:8080
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
networks:
default:
external:
name: nisec
启动
docker compose up -d
使用的Grafana
面板
- 配置数据源
- 监控
docker
+宿主机资源:grafana.com/grafana/das...
- 监控宿主机资源:grafana.com/grafana/das...
- 监控
SpringBoot
:grafana.com/grafana/das... - 监控
SpringBoot
+访问接口:grafana.com/grafana/das...
解决
rootfs
硬盘使用读取不到问题,只取第一行
kotlin
100 - ((topk(1,node_filesystem_avail_bytes{instance="$node",job="$job",mountpoint=~"/.*",fstype!="rootfs"}) * 100) / topk(1,node_filesystem_size_bytes{instance="$node",job="$job",mountpoint=~"/.*",fstype!="rootfs"}))
报警配置
- 钉钉添加群机器人
- 配置模板
直接复制即可
sql
{{ define "custom.alerts" -}}
{{ len .Alerts }} alert(s)
{{ range .Alerts -}}
{{ template "alert.summary_and_description" . -}}
{{ end -}}
{{ end -}}
{{ define "alert.summary_and_description" }}
--------------------
Summary: {{.Annotations.summary}}
Status: {{ .Status }}
Description: {{.Annotations.description}}
Detail: {{.Values.A}}
StartsAt: {{.StartsAt | tz "Asia/Chongqing" }}
endsAt: {{.EndsAt | tz "Asia/Chongqing" }}
{{ end -}}
-
配置
grafana
钉钉推送 -
新建通知策略
- 通知策略详细配置
下面的配置开启了标签匹配,只有匹配了标签的才走这个配置。
- 配置说明
配置项 | 详细解析 |
---|---|
Group wait |
等待几秒发送,如果配置10S则表示10S内的消息会合并为一个消息到第10S才一起发送,配置0S表示立即发送 |
Group interval |
表示【从报警状态恢复为正常状态】或者【正常状态变为报警状态】的消息发送间隔时长;如果配置5m,表示如果01分的时候发送了报警,02分恢复正常(不会立即发送会等待5分钟),等到06分发送,如果是07分恢复正常,则会11分才发送,它的间隔是你配置的时间的倍数区间 |
Repeat interval |
重复的报警发送时间间隔,如果一个报警一直都在报警而没有恢复正常则你配置的这个时间区间才会重新发送(实测还会加上Group interval 的时间间隔) |
报警配置案例
我使用的
mysql
的表接入的,方便改数据
- 输入查询条件
- 配置报警条件
如果值大于了40就报警
- 配置触发计算频率
我配置的是开发环境下的每隔10S触发一次;
Pending period
这个参数表示多长时间内的数据参与计算达到你的报警条件,配置0S表示计算的结果马上进行报警发送,当一个报警被触发时,系统会进入Pending Period。如果在观察期内报警状态持续存在,系统会触发真正的报警并发送通知;如果在观察期内报警状态不满足条件,系统会取消监控状态
loki
日志接入Grafana
下面给出我本地的简单案例
loki-config.yaml
yaml
# This is a complete configuration to deploy Loki backed by the filesystem.
# The index will be shipped to the storage via tsdb-shipper.
auth_enabled: false
server:
http_listen_port: 3100
common:
ring:
instance_addr: 127.0.0.1
kvstore:
store: inmemory
replication_factor: 1
path_prefix: /tmp/loki
schema_config:
configs:
- from: 2020-05-15
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
storage_config:
filesystem:
directory: /tmp/loki/chunks
promtail-config.yaml
yaml
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
__path__: /var/log/*log
- job_name: spring-boot
static_configs:
- targets:
- localhost
labels:
job: static-server
__path__: /var/log/static-server/*log
docker-compose.yml
yaml
version: "3"
networks:
loki:
external:
name: nisec
services:
loki:
image: grafana/loki:3.2.1
ports:
- "3100:3100"
volumes:
- ./loki-config.yaml:/mnt/config/loki-config.yaml
command: -config.file=/mnt/config/loki-config.yaml
networks:
- loki
promtail:
image: grafana/promtail:3.2.1
volumes:
- /var/log:/var/log
- ./promtail-config.yaml:/mnt/config/promtail-config.yaml
command: -config.file=/mnt/config/promtail-config.yaml
depends_on:
- loki
networks:
- loki