十七、K8s 可观测性:全链路追踪
文章目录
- [十七、K8s 可观测性:全链路追踪](#十七、K8s 可观测性:全链路追踪)
-
- [1、Skywalking 初识](#1、Skywalking 初识)
-
- [1.1 为什么需要全链路追踪平台](#1.1 为什么需要全链路追踪平台)
- [1.2 全链路追踪核心组件及工作原理](#1.2 全链路追踪核心组件及工作原理)
-
- [1.2.1 全链路追踪核心概念](#1.2.1 全链路追踪核心概念)
- [1.2.2 全链路追踪工作原理](#1.2.2 全链路追踪工作原理)
- [1.3 什么是Skywalking?](#1.3 什么是Skywalking?)
- [1.4 Skywalking架构解析](#1.4 Skywalking架构解析)
- [1.5 Skywalking核心术语和名词](#1.5 Skywalking核心术语和名词)
- [2、Skywalking 集群安装](#2、Skywalking 集群安装)
-
- [2.1 集群规划](#2.1 集群规划)
- [2.2 Skywalking 集群安装](#2.2 Skywalking 集群安装)
- [2.3 Java 服务接入 Skywalking](#2.3 Java 服务接入 Skywalking)
- [2.4 Go 服务接入 Skywalking](#2.4 Go 服务接入 Skywalking)
- [2.5 清理环境](#2.5 清理环境)
- 3、全链路追踪项目练习
-
- [3.1 服务部署](#3.1 服务部署)
-
- [3.1.1 部署数据库(延用上个实验配置)](#3.1.1 部署数据库(延用上个实验配置))
- [3.1.2 启动 order 服务](#3.1.2 启动 order 服务)
- [3.1.3 部署 handler 服务(延用上个实验配置)](#3.1.3 部署 handler 服务(延用上个实验配置))
- [3.1.4 部署 receive 服务](#3.1.4 部署 receive 服务)
- [3.1.5 部署前端服务](#3.1.5 部署前端服务)
- [3.2 服务访问与监控](#3.2 服务访问与监控)
- [3.3 模拟故障](#3.3 模拟故障)
- [4、Skywalking 告警](#4、Skywalking 告警)
-
- [4.1 Skywalking 告警通知](#4.1 Skywalking 告警通知)
- [4.2 Skywalking 告警规则](#4.2 Skywalking 告警规则)
- [4.3 钉钉告警机器人配置](#4.3 钉钉告警机器人配置)
- [4.4 Skywalking 接入钉钉告警](#4.4 Skywalking 接入钉钉告警)
- [4.5 自定义告警规则](#4.5 自定义告警规则)
1、Skywalking 初识
1.1 为什么需要全链路追踪平台
- 快速定位故障点
- 快速定位性能依赖关系
- 理解服务依赖关系
- 全局流量可视化
1.2 全链路追踪核心组件及工作原理
1.2.1 全链路追踪核心概念
- Trace:一个请求的完整操作过程被称作一个Trace,代表从客户端发起请求到后端完全处理到整个过程,一个trace由多个span组成。
- Span:一个Span表示Trace中的一部分工作,可以理解为一次函数调用或者是一个HTTP请求。每个Span都包含了操作名称、开始时间、结束时间以及操作相关的元数据等信息。Span具有上下级关系(父子关系),同时多个Span的结合就表达了一次Trace。
- Trace ID 和 Span ID:每个Trace都有一个唯一的 Trace ID,每一个Span都有一个唯一的 Span ID,并且还包含了指向父级Span的引用。

1.2.2 全链路追踪工作原理
1、客户端发起请求
2、服务A开始处理请求并创建初始Trace和Span
3、服务A将请求转发给服务B,同时传递 race ID 和 Span ID
4、服务B根据传递的信息继续创建新的Span,并标记父Span
5、所有服务处理完成后,各自产生的Span数据都会发送至追踪平台进行汇总
6、用户可以通过UI查看整个Trace的详细信息

1.3 什么是Skywalking?
Skywalking是一个针对分布式系统的应用性能监控(Application Performance Monitor, APM)和可观测性分析平台(Observability Analysis Platform)。Skywalking提供了包括分布式追踪、指标监控、故障诊断信息、服务网格遥测分析、异常告警以及可视化界面等功能,可帮助开发人员和运维团队更好地理解和管理应用和服务。
核心特性:
- 分布式追踪:Skywalking可以为请求生成跟踪数据,能够帮助用户了解整个调用链路的情况,从而定位性能瓶颈或问题根源
- 度量分析:支持对服务的健康状况进行度量分析,如响应时间、吞吐量、成功率等关键性能指标(KPI)
- 告警机制:支持自定义规则告警,当检测到异常情况时自动发送告警通知
- 丰富的UI界面:提供了直观易用的Web UI,方便用户查看追踪数据、监控指标及服务拓扑结构等
- 低侵入性:通过字节码注入的方式实现代码级别的监控,无需修改业务逻辑即可完成接入
- 多语言支持:除了Java之外,还支持.NET Core、Node.js、Python、Go等多种编程语言,满足不同开发环境的需求
- 多平台集成:支持与服务网格、Kubernetes集成
1.4 Skywalking架构解析

1.5 Skywalking核心术语和名词
- Service:Service指的是一个或一组提供相同功能或业务逻辑的应用。可以是一个微服务、一个web服务、一个数据库或者其他类型的后端服务
- Instance:Instance是指服务的一个具体运行实例。在一个分布式环境种,同一个服务可能部署在多个不同的服务器或者容器上,每个容器或服务器上的这个服务就是一个Instance
- Endpoint:Endpoint是指服务中可被外部访问的具体路径或接口,端点是服务对外暴露功能的入口点
2、Skywalking 集群安装
2.1 集群规划
主机名称 | 物理IP | 系统 | 资源配置 | 说明 |
---|---|---|---|---|
k8s-master01 | 192.168.200.50 | Rocky9.4 | 4核8g | Master节点 |
k8s-node01 | 192.168.200.51 | Rocky9.4 | 4核8g | Node01节点 |
k8s-node02 | 192.168.200.52 | Rocky9.4 | 4核8g | Node02节点 |
2.2 Skywalking 集群安装
# 添加 Skywalking Helm 源
[root@k8s-master01 ~]# export REPO=skywalking
[root@k8s-master01 ~]# helm repo add ${REPO} https://apache.jfrog.io/artifactory/skywalking-helm
# 下载skywalking
[root@k8s-master01 ~]# helm pull skywalking/skywalking
# 解压安装包:
[root@k8s-master01 ~]# tar xf skywalking-4.3.0.tgz
[root@k8s-master01 ~]# cd skywalking
[root@k8s-master01 skywalking]# vim values.yaml
[root@k8s-master01 skywalking]# cat values.yaml
# 更改 Elasticsearch 配置:
elasticsearch:
antiAffinity: soft
clusterHealthCheckParams: wait_for_status=green&timeout=10s
clusterName: es-cluster
config:
host: elasticsearch
password: admin
port:
http: 9200
user: admin
enabled: true
esMajorVersion: "7"
image: crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/elasticsearch
imagePullPolicy: IfNotPresent
imageTag: 7.5.1
persistence:
annotations: {}
enabled: true
replicas: 3
resources:
limits:
cpu: 2000m
memory: 3Gi
requests:
cpu: 1000m
memory: 2Gi
volumeClaimTemplate:
storageClassName: nfs-csi
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 30Gi
initContainer:
image: crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/busybox
tag: "1.30"
# 更改 OAP 的资源配置:
oap:
image:
pullPolicy: IfNotPresent
repository: crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/skywalking-oap-server
tag: 10.2.0
javaOpts: -Xmx2g -Xms2g
replicas: 3
resources:
limits:
cpu: 2000m
memory: 3Gi
requests:
cpu: 1000m
memory: 2Gi
storageType: elasticsearch
# 更改 UI 配置:
ui:
image:
pullPolicy: IfNotPresent
repository: crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/skywalking-ui
tag: 10.2.0
replicas: 3
service:
annotations: {}
externalPort: 80
internalPort: 8080
type: NodePort
[root@k8s-master01 skywalking]# vim templates/oap-deployment.yaml
[root@k8s-master01 skywalking]# sed -n "91,100p" templates/oap-deployment.yaml
livenessProbe:
tcpSocket:
port: 12800
initialDelaySeconds: 300
periodSeconds: 20
readinessProbe:
tcpSocket:
port: 12800
initialDelaySeconds: 300
periodSeconds: 20
# 删除冲突资源
[root@k8s-master01 skywalking]# rm -rf charts/elasticsearch/templates/pod*
# 安装:
[root@k8s-master01 skywalking]# helm install skywalking -n skywalking . --create-namespace
# 查看安装状态:
[root@k8s-master01 skywalking]# kubectl get po -n skywalking
NAME READY STATUS RESTARTS AGE
es-cluster-master-0 1/1 Running 0 13m
es-cluster-master-1 1/1 Running 0 13m
es-cluster-master-2 1/1 Running 0 13m
skywalking-es-init-mkvw7 1/1 Running 0 13m
skywalking-oap-6d8f594b7c-7w785 1/1 Running 0 13m
skywalking-oap-6d8f594b7c-p4z64 1/1 Running 0 13m
skywalking-oap-6d8f594b7c-vnp8t 1/1 Running 0 13m
skywalking-ui-774674cc7-qcm79 1/1 Running 0 13m
skywalking-ui-774674cc7-qhgg8 1/1 Running 0 13m
skywalking-ui-774674cc7-qwkjm 1/1 Running 0 13m
# 查看service
[root@k8s-master01 skywalking]# kubectl get svc skywalking-ui -n skywalking
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
skywalking-ui NodePort 10.108.110.98 <none> 80:31319/TCP 14m

2.3 Java 服务接入 Skywalking
Java 语言:
- JAVA_TOOL_OPTIONS:指定 JAVA 的启动参数,加载 agent 可以通过该变量实现,比如-javaagent:/skywalking/agent/skywalking-agent.jar
- SW_AGENT_NAME:服务名称,建议格式<组名>::<逻辑名>,推荐配置为命令空
间::服务名称- SW_AGENT_INSTANCE_NAME:实例名称,通常用于表示同一个服务不同的示
例,默认为 UUID@hostname,推荐使用 Pod 名称作为实例名称- SW_AGENT_COLLECTOR_BACKEND_SERVICES:Skywalking OAP 地址
[root@k8s-master01 skywalking]# mkdir demo/
[root@k8s-master01 skywalking]# cd demo/
[root@k8s-master01 demo]# vim demo-handler-deploy-sw.yaml
[root@k8s-master01 demoskywalking]# cat demo-handler-deploy-sw.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: demo-handler
name: demo-handler
namespace: demo
spec:
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: demo-handler
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: demo-handler
spec:
volumes: # 添加 Volumes 及初始化容器
- name: skywalking-agent
emptyDir: {}
initContainers:
- name: agent-container
image: crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/skywalking-java-agent:9.4.0-java8
volumeMounts:
- name: skywalking-agent
mountPath: /agent
command: [ "/bin/sh" ]
args: [ "-c", "cp -R /skywalking/agent /agent/ ; mkdir -p /agent/agent/logs/ ; chown -R 1001.1001 /agent" ]
containers:
- env:
- name: SPRING_PROFILES_ACTIVE
value: k8supgrade
- name: SERVER_PORT
value: "8080"
- name: JAVA_TOOL_OPTIONS # 添加环境变量
value: "-javaagent:/skywalking/agent/skywalking-agent.jar"
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: APP
valueFrom:
fieldRef:
fieldPath: metadata.labels['app']
- name: SW_AGENT_NAME
value: "$(NAMESPACE)::$(APP)"
- name: SW_AGENT_INSTANCE_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: SW_AGENT_COLLECTOR_BACKEND_SERVICES
value: skywalking-oap.skywalking:11800
image: crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/demo-handler:v1-upgrade
imagePullPolicy: IfNotPresent
volumeMounts: # 添加挂载
- name: skywalking-agent
mountPath: /skywalking
livenessProbe:
failureThreshold: 2
initialDelaySeconds: 30
periodSeconds: 5
successThreshold: 1
tcpSocket:
port: 8080
timeoutSeconds: 2
name: demo-handler
readinessProbe:
failureThreshold: 2
initialDelaySeconds: 30
periodSeconds: 5
successThreshold: 1
tcpSocket:
port: 8080
timeoutSeconds: 2
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
# 接下来创建服务并测试:
[root@k8s-master01 demoskywalking]# kubectl create namespace demo
[root@k8s-master01 demoskywalking]# kubectl create -f demo-handler-deploy-sw.yaml -n demo
# 检查pod情况
[root@k8s-master01 demoskywalking]# kubectl get po -n demo -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
demo-handler-5b6f9dd9c7-88pr649d6fd88f-kxhqb 1/1 Running 0 77s 1792.16.58.233 k8s-node028.32.140 k8s-master01 <none> <none>
# 访问测试(可以多测试几次)
[root@k8s-master01 demoskywalking]# curl 1792.16.58.2338.32.140:8080/api/generate
O4E,\1L!u-bzTE[7Fn#VCS+eK?fwcp|k
查看 skywalking 图表:

拓扑图

2.4 Go 服务接入 Skywalking
Go 语言:
- SW_AGENT_REPORTER_GRPC_BACKEND_SERVICE:Skywalking OAP 地址
- SW_AGENT_NAME:服务名称,建议格式<组名>::<逻辑名>,推荐配置为命令空
间::服务名称- SW_AGENT_INSTANCE_NAME:实例名称,通常用于表示同一个服务不同的示例,默认为 UUID@hostname,推荐使用 Pod 名称作为实例名称
# 下载测试程序:
[root@habor ~]# git clone https://gitee.com/dukuan/demo-order.git
# 编写dockerfile文件
[root@habor ~]# cd demo-order-master
[root@habor demo-order-master]# vim Dockerfile
[root@habor demo-order-master]# cat Dockerfile
FROM crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/skywalking-go:0.5.0-go1.22 AS builder
COPY ./ /go/src/
WORKDIR /go/src/
RUN export GO111MODULE=on && \
export GOPROXY=https://goproxy.cn,direct && \
skywalking-go-agent -inject /go/src && \
go build -o ./order -toolexec="skywalking-go-agent" -a /go/src
FROM crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/alpine:3.20
COPY --from=builder /go/src/order .
CMD [ "./order" ]
# 制作镜像
[root@habor demo-order-master]# docker build -t crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/demo-order:v1 .
# 推送镜像到镜像仓库
[root@habor demo-order-master]# docker push crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/demo-order:v1
[root@k8s-master01 demo]# vim mysql.yaml
[root@k8s-master01 demo]# cat mysql.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: mysql
name: mysql
namespace: demo
spec:
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: mysql
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: mysql
spec:
volumes:
- name: data
persistentVolumeClaim:
claimName: mysql-data
containers:
- env:
- name: MYSQL_ROOT_PASSWORD
value: password
image: crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/mysql:8.0.20
imagePullPolicy: IfNotPresent
name: mysql
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- name: data
mountPath: /var/lib/mysql
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
[root@k8s-master01 demo]# vim mysql-svc.yaml
[root@k8s-master01 demo]# cat mysql-svc.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: mysql
name: mysql
namespace: demo
spec:
ports:
- nodePort: 32541
port: 3306
protocol: TCP
targetPort: 3306
selector:
app: mysql
sessionAffinity: None
type: NodePort
[root@k8s-master01 demo]# vim mysql-pvc.yaml
[root@k8s-master01 demo]# cat mysql-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mysql-data
namespace: demo
spec:
resources:
requests:
storage: 5Gi
volumeMode: Filesystem
storageClassName: nfs-csi
accessModes:
- ReadWriteOnce
# 创建基础组件服务:
[root@k8s-master01 demo]# kubectl create -f mysql.yaml -f mysql-svc.yaml -f mysql-pvc.yaml -n demo
# 查看pod
[root@k8s-master01 demo]# kubectl get po -n demo
NAME READY STATUS RESTARTS AGE
....
mysql-6d698b4676-8hsn8 1/1 Running 0 3m22s
# 配置数据库:
[root@k8s-master01 demo]# kubectl exec -it mysql-6d698b4676-8hsn8 -n demo -- bash
root@mysql-6d698b4676-8hsn8:/# mysql -uroot -ppassword
....
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> create database orders;
Query OK, 1 row affected (0.01 sec)
mysql> CREATE USER 'order'@'%' IDENTIFIED BY 'password';
Query OK, 0 rows affected (0.01 sec)
mysql> GRANT ALL ON orders.* TO 'order'@'%';
Query OK, 0 rows affected (0.02 sec)
# 由于 Go 的代码在编译时已经插入探针,所以在启动时,无法特别指定配置,只需要保留相关的环境变量即可:
[root@k8s-master01 demo]# vim demo-order-deploy.yaml
[root@k8s-master01 demo]# cat demo-order-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: demo-order
name: demo-order
namespace: demo
spec:
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: demo-order
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: demo-order
spec:
containers:
- env:
- name: MYSQL_HOST
value: mysql
- name: MYSQL_PORT
value: "3306"
- name: MYSQL_USER
value: order
- name: MYSQL_PASSWORD
value: password
- name: MYSQL_DB
value: orders
# 添加变量
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: APP
valueFrom:
fieldRef:
fieldPath: metadata.labels['app']
- name: SW_AGENT_NAME
value: "$(NAMESPACE)::$(APP)"
#- name: SW_AGENT_NAME
# value: demo::demo-order
- name: SW_AGENT_INSTANCE_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: SW_AGENT_REPORTER_GRPC_BACKEND_SERVICE
value: skywalking-oap.skywalking:11800
image: crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/demo-order:v2
imagePullPolicy: Always
livenessProbe:
failureThreshold: 2
initialDelaySeconds: 30
periodSeconds: 5
successThreshold: 1
tcpSocket:
port: 8080
timeoutSeconds: 2
name: demo-order
readinessProbe:
failureThreshold: 2
initialDelaySeconds: 30
periodSeconds: 5
successThreshold: 1
tcpSocket:
port: 8080
timeoutSeconds: 2
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
# 接下来创建服务并测试:
[root@k8s-master01 demo]# kubectl create -f demo-order-deploy.yaml -n demo
# 检查pod情况
[root@k8s-master01 demo]# kubectl get po -n demo -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
demo-order-755cdc96-ltlzg 1/1 Running 0 65s 172.16.58.239 k8s-node02 <none> <none>
# 访问测试(可以多测试几次)
[root@k8s-master01 demo]# curl 172.16.58.239:8080/orders
[{"id":1,"name":"Order 1","price":10},{"id":2,"name":"Order 2","price":20}]
查看 skywalking 图表:



自动检测数据库

2.5 清理环境
[root@k8s-master01 demo]# kubectl delete deploy -n demo --all
3、全链路追踪项目练习
通过上述的学习,Skywalking 已经成功接入 Go 和 Java 的链路数据,接下来通过一个完整的项目,继续巩固 Skywalking 的学习。
项目架构:

3.1 服务部署
3.1.1 部署数据库(延用上个实验配置)
# 部署数据库
[root@k8s-master01 demo]# kubectl create -f mysql.yaml -f mysql-svc.yaml -f
[root@k8s-master01 demo]# kubectl get po -n demo
NAME READY STATUS RESTARTS AGE
mysql-6d698b4676-sk8hj 1/1 Running 0 17s
# 创建账号
[root@k8s-master01 demo]# kubectl exec -it mysql-6d698b4676-sk8hj -n demo -- bash
root@mysql-6d698b4676-sk8hj:/# mysql -uroot -ppassword
....
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> create database orders;
Query OK, 1 row affected (0.04 sec)
mysql> CREATE USER 'order'@'%' IDENTIFIED BY 'password';
Query OK, 0 rows affected (0.02 sec)
mysql> GRANT ALL ON orders.* TO 'order'@'%';
Query OK, 0 rows affected (0.01 sec)
3.1.2 启动 order 服务
# 启动 order 服务,order 服务为 Go 程序,无需更改额外的配置即可完成监控数据的推送:
# 延用上个实验配置,创建一个service
[root@k8s-master01 demo]# vim demo-order-svc.yaml
[root@k8s-master01 demo]# cat demo-order-svc.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: order
name: order
namespace: demo
spec:
ports:
- name: http-web
port: 80
protocol: TCP
targetPort: 8080
selector:
app: demo-order
sessionAffinity: None
type: ClusterIP
# 配置一个对外的域名
[root@k8s-master01 demo]# vim demo-order-ingress.yaml
[root@k8s-master01 demo]# cat demo-order-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: demo-order
namespace: demo
spec:
ingressClassName: nginx
rules:
- host: demo.test.com
http:
paths:
- backend:
service:
name: order
port:
number: 80
path: /orders
pathType: ImplementationSpecific
# 创建服务
[root@k8s-master01 demo]# kubectl create -f demo-order-deploy.yaml -f demo-order-svc.yaml -f demo-order-ingress.yaml -n demo
# 查看服务状态:
[root@k8s-master01 demo]# kubectl get pod -n demo -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
demo-order-755cdc96-8qlc9 1/1 Running 0 2m54s 172.16.58.245 k8s-node02 <none> <none>
mysql-6d698b4676-sk8hj 1/1 Running 0 111m 172.16.58.241 k8s-node02 <none> <none>
[root@k8s-master01 demo]# kubectl get svc,ingress -n demo
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/mysql NodePort 10.111.54.12 <none> 3306:32541/TCP 111m
service/order ClusterIP 10.101.166.166 <none> 80/TCP 3m1s
NAME CLASS HOSTS ADDRESS PORTS AGE
ingress.networking.k8s.io/demo-order nginx demo.test.com 192.168.200.52 80 3m1s
# 测试访问:
[root@k8s-master01 demo]# echo "192.168.200.52 demo.test.com" >> /etc/hosts
[root@k8s-master01 demo]# curl demo.test.com/orders
[{"id":1,"name":"Order 1","price":10},{"id":2,"name":"Order 2","price":20},{"id":3,"name":"Order 1","price":10},{"id":4,"name":"Order 2","price":20}]
3.1.3 部署 handler 服务(延用上个实验配置)
# 部署 handler 服务
[root@k8s-master01 demo]# kubectl create -f demo-handler-deploy-sw.yaml -f demo-handler-svc.yaml -n demo
3.1.4 部署 receive 服务
[root@k8s-master01 demo]# vim demo-receive-deploy.yaml
[root@k8s-master01 demo]# cat demo-receive-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: demo-receive
name: demo-receive
namespace: demo
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: demo-receive
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: demo-receive
spec:
volumes:
- name: skywalking-agent
emptyDir: {}
initContainers:
- name: agent-container
image: crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/skywalking-java-agent:9.4.0-java8
volumeMounts:
- name: skywalking-agent
mountPath: /agent
command: [ "/bin/sh" ]
args: [ "-c", "cp -R /skywalking/agent /agent/ ; mkdir -p /agent/agent/logs/ ; chown -R 1001.1001 /agent" ]
containers:
- env:
- name: SPRING_PROFILES_ACTIVE
value: k8supgrade
- name: SERVER_PORT
value: "8080"
- name: JAVA_TOOL_OPTIONS
value: "-javaagent:/skywalking/agent/skywalking-agent.jar"
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: APP
valueFrom:
fieldRef:
fieldPath: metadata.labels['app']
- name: SW_AGENT_NAME
value: "$(NAMESPACE)::$(APP)"
- name: SW_AGENT_INSTANCE_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: SW_AGENT_COLLECTOR_BACKEND_SERVICES
value: skywalking-oap.skywalking:11800
volumeMounts:
- name: skywalking-agent
mountPath: /skywalking
image: crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/demo-receive:v1-upgrade
imagePullPolicy: Always
livenessProbe:
failureThreshold: 2
initialDelaySeconds: 30
periodSeconds: 5
successThreshold: 1
tcpSocket:
port: 8080
timeoutSeconds: 2
name: demo-receive
readinessProbe:
failureThreshold: 2
initialDelaySeconds: 30
periodSeconds: 5
successThreshold: 1
tcpSocket:
port: 8080
timeoutSeconds: 2
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
[root@k8s-master01 demo]# vim demo-receive-svc.yaml
[root@k8s-master01 demo]# cat demo-receive-svc.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: demo-receive
name: demo-receive
namespace: demo
spec:
ports:
- name: http-web
port: 8080
protocol: TCP
targetPort: 8080
selector:
app: demo-receive
sessionAffinity: None
type: ClusterIP
[root@k8s-master01 demo]# vim demo-receive-ingress.yaml
[root@k8s-master01 demo]# cat demo-receive-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /$2
name: demo-receive
namespace: demo
spec:
ingressClassName: nginx
rules:
- host: demo.test.com
http:
paths:
- backend:
service:
name: demo-receive
port:
number: 8080
path: /receiveapi(/|$)(.*)
pathType: ImplementationSpecific
# 部署 receive 服务:
[root@k8s-master01 demo]# kubectl create -f demo-receive-deploy.yaml -f demo-receive-svc.yaml -f demo-receive-ingress.yaml -n demo
3.1.5 部署前端服务
[root@k8s-master01 demo]# vim demo-ui-deploy.yaml
[root@k8s-master01 demo]# cat demo-ui-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: demo-ui
name: demo-ui
namespace: demo
spec:
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: demo-ui
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: demo-ui
spec:
containers:
- image: crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/demo-ui:sw
imagePullPolicy: Always
livenessProbe:
failureThreshold: 2
initialDelaySeconds: 10
periodSeconds: 5
successThreshold: 1
tcpSocket:
port: 80
timeoutSeconds: 2
name: demo-ui
readinessProbe:
failureThreshold: 2
initialDelaySeconds: 10
periodSeconds: 5
successThreshold: 1
tcpSocket:
port: 80
timeoutSeconds: 2
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
[root@k8s-master01 demo]# vim demo-ui-svc.yaml
[root@k8s-master01 demo]# cat demo-ui-svc.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: demo-ui
name: demo-ui
namespace: demo
spec:
ports:
- name: http-web
port: 80
protocol: TCP
targetPort: 80
selector:
app: demo-ui
sessionAffinity: None
type: ClusterIP
[root@k8s-master01 demo]# vim demo-ui-ingress.yaml
[root@k8s-master01 demo]# cat demo-ui-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: demo-ui
namespace: demo
spec:
ingressClassName: nginx
rules:
- host: demo.test.com
http:
paths:
- backend:
service:
name: demo-ui
port:
number: 80
path: /
pathType: ImplementationSpecific
# 部署前端服务:
[root@k8s-master01 demo]# kubectl create -f demo-ui-deploy.yaml -f demo-ui-svc.yaml -f demo-ui-ingress.yaml -nn demo
# 部署完毕后,最终的服务如下:
[root@k8s-master01 demo]# kubectl get po,svc,ingress -n demo
NAME READY STATUS RESTARTS AGE
pod/demo-handler-5b6f9dd9c7-g4k5s 1/1 Running 1 (25m ago) 26m
pod/demo-order-755cdc96-8qlc9 1/1 Running 0 47m
pod/demo-receive-5cf555cdfd-j5g76 1/1 Running 1 (14m ago) 16m
pod/demo-ui-66bb5f4d67-smbpb 1/1 Running 0 83s
pod/mysql-6d698b4676-sk8hj 1/1 Running 0 155m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/demo-receive ClusterIP 10.103.251.213 <none> 8080/TCP 16m
service/demo-ui ClusterIP 10.106.49.125 <none> 80/TCP 83s
service/handler ClusterIP 10.102.43.148 <none> 80/TCP 26m
service/mysql NodePort 10.111.54.12 <none> 3306:32541/TCP 155m
service/order ClusterIP 10.101.166.166 <none> 80/TCP 47m
NAME CLASS HOSTS ADDRESS PORTS AGE
ingress.networking.k8s.io/demo-order nginx demo.test.com 192.168.200.52 80 47m
ingress.networking.k8s.io/demo-receive nginx demo.test.com 192.168.200.52 80 16m
ingress.networking.k8s.io/demo-ui nginx demo.test.com 192.168.200.52 80 83s
接下来通过浏览器访问:

3.2 服务访问与监控
接下来访问页面,测试生成密码和创建订单:

之后就可以看到整个项目的架构图:

创建订单会有随机延迟,延迟信息也可以在 skywalking 上面看到 trace 信息:

3.3 模拟故障
# 接下来模拟 handler 服务故障:
[root@k8s-master01 demo]# kubectl scale deploy demo-handler mysql --replicas=0 -n demo
再次访问即可收集到错误的链路信息:

4、Skywalking 告警
4.1 Skywalking 告警通知
Skywalking支持针对采集的Metrics数据进行监控告警,并可以在出现异常时及时作出反应。通过合理配置告警规则和钩子,可以实现有效地预防潜在问题并及时定位相关问题。
Skywalking的告警核心由一组规则实现,主要包含如下三个部分:
- 指标(Metrics):Skywalking收集的关于服务、实例和端点的各种性能指标数据
- 规则(Rules):告警的触发规则,默认定义在
config/alarm-settings.yaml
文件中,支持比较运算符和逻辑运算符等- 钩子(Hooks):当告警被触发后,通过钩子来执行特定的操作,如发送通知等
4.2 Skywalking 告警规则
Skywalking 告警规则由如下元素组成:
- 规则名称:全局唯一,必须由
_rule
结尾- expression:使用MOE(Metrics Query Expression)定义,表达式的结果必须是
SINGLE_VALUE
,且根操作必须是一个比较操作或布尔操作,同时结果需要为1(true)或0(false),当结果为1(true)时,告警会被触发- include-name:包含的实体名称,可以是Service、Instance、Endpoint等,列表类型
- exclude-names:排除的实体名称
- include-names-regex:正则匹配包含
- exclude-names-regex:正则匹配排除
- tags:附加告警标签,比如
level=warning
- period:周期,检查告警条件的时间窗口大小,以分钟为单位
- silence-period:静默期,某个告警被触发后,在接下来的一段时间内,该告警不会再次被触发,不指定该值则和
period
一样- hooks:告警触发时绑定的钩子名称,名称格式为
{hookType}.{hookName}
(例如slack.customl
),并且必须在alarm-settings.yml
文件的hooks
部分定义。如果未指定钩子名称,则会使用全局钩子- message:告警信息,可以用作描述当前告警
4.3 钉钉告警机器人配置
使用钉钉告警,需要先创建一个群聊,然后添加一个机器人:

添加机器人
选择自定义

填写机器人名称,以及复制密匙
添加机器人以及复制Webhook

4.4 Skywalking 接入钉钉告警
首先把 Skywalking 告警的配置文件放置在 Skywalking 的安装目录:
# 创建告警存放目录
[root@k8s-master01 demo]# mkdir -p ../files/conf.d/oap
[root@k8s-master01 demo]# cd ../files/conf.d/oap
# 从oap容器里把告警模板文件copy出来
[root@k8s-master01 oap]# kubectl cp skywalking-oap-6d8f594b7c-xrnbr:/skywalking/config/alarm-settings.yml ./alarm-settings.yml -n skywalking
# 添加钉钉告警
[root@k8s-master01 oap]# vim alarm-settings.yml
[root@k8s-master01 oap]# tail -14 alarm-settings.yml
hooks:
dingtalk:
default:
is-default: true
text-template: |-
{
"msgtype": "text",
"text": {
"content": "Apache SkyWalking Alarm: \n %s."
}
}
webhooks:
- url: https://oapi.dingtalk.com/robot/send?access_token=c7cd207fd31cd72f433d67effda0568b681b10f626f97c02cb55f03b73b651c5
secret: SECedef18728aa48ea6ca4c2f595967f6c389e2fc4d13bfca2741087b8c8878e017
# 更新配置(需要回到skywalking根目录)
[root@k8s-master01 oap]# cd ../../..
[root@k8s-master01 skywalking]# helm upgrade skywalking . -n skywalking
# 查看 Pod 更新状态:
[root@k8s-master01 skywalking]# kubectl get po -n skywalking | grep oap
skywalking-oap-5644bbbd46-hvvxx 1/1 Running 0 11m
# 查看配置文件是否更新:
[root@k8s-master01 skywalking]# kubectl exec skywalking-oap-5644bbbd46-hvvxx -n skywalking -- tail -14 config/alarm-settings.yml
Defaulted container "oap" out of: oap, wait-for-elasticsearch (init)
hooks:
dingtalk:
default:
is-default: true
text-template: |-
{
"msgtype": "text",
"text": {
"content": "Apache SkyWalking Alarm: \n %s."
}
}
webhooks:
- url: https://oapi.dingtalk.com/robot/send?access_token=c7cd207fd31cd72f433d67effda0568b681b10f626f97c02cb55f03b73b651c5
secret: SECedef18728aa48ea6ca4c2f595967f6c389e2fc4d13bfca2741087b8c8878e017
请求服务,触发告警:

等待一会钉钉即可查询到告警信息

4.5 自定义告警规则
除了默认告警,还可以添加一些自定义告警,比如想要监控 Java 服务 JVM 线程池是否阻塞,可以通过 instance_jvm_thread_blocked_state_thread_count
指标进行监控。
# 比如监控 JVM 阻塞的线程数大于 5:
[root@k8s-master01 oap]# vim alarm-settings.yml
[root@k8s-master01 oap]# cat alarm-settings.yml
....
rules:
thread_block_rule:
expression: sum(instance_jvm_thread_blocked_state_thread_count >5) >= 2
period: 5 # 检查过去 5 分钟的数据
message: "服务 {name} 的线程池,在过去两分钟内被阻塞的数量超过 5"
....
# 更改配置文件后,更新配置:
[root@k8s-master01 skywalking]# helm upgrade skywalking -n skywalking .
[root@k8s-master01 skywalking]# kubectl rollout restart deploy skywalking-oap -n skywalking