Spring Cloud —— SkyWalking(五)

11. 告警机制

skywalking 除了前文的 链路展示,性能剖析,日志管理外,还提供告警功能。程序出现异常时,能自动触发通知,帮助程序员防患于未然。

11.1 告警规则

skywalking 预先定义了一些规则,在 config/alarm-settings.yml 中,我们可以添加或更改规则。

默认的规则如下:

yml 复制代码
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Sample alarm rules.
rules:
  # Rule unique name, must be ended with `_rule`.
  service_resp_time_rule:
    # A MQE expression, the result type must be `SINGLE_VALUE` and the root operation of the expression must be a Compare Operation
    # which provides `1`(true) or `0`(false) result. When the result is `1`(true), the alarm will be triggered.
    expression: sum(service_resp_time > 1000) >= 3
    period: 10
    silence-period: 5
    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
#  service_resp_time_rule:
#    expression: avg(service_resp_time) > 1000
#    period: 10
#    silence-period: 5
#    message: Avg response time of service {name} is more than 1000ms in last 10 minutes.
  service_sla_rule:
    expression: sum(service_sla < 8000) >= 2
    # The length of time to evaluate the metrics
    period: 10
    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
    silence-period: 3
    message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
  service_resp_time_percentile_rule:
    expression: sum(service_percentile{p='50,75,90,95,99'} > 1000) >= 3
    period: 10
    silence-period: 5
    message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
  service_instance_resp_time_rule:
    expression: sum(service_instance_resp_time > 1000) >= 2
    period: 10
    silence-period: 5
    message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
  database_access_resp_time_rule:
    expression: sum(database_access_resp_time > 1000) >= 2
    period: 10
    message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
  endpoint_relation_resp_time_rule:
    expression: sum(endpoint_relation_resp_time > 1000) >= 2
    period: 10
    message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes
#  Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
#  Because the number of endpoint is much more than service and instance.
#
#  endpoint_resp_time_rule:
#    expression: sum(endpoint_resp_time > 1000) >= 2
#    period: 10
#    silence-period: 5
#    message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes

#hooks:
#  webhook:
#    default:
#      is-default: true
#      urls:
#        - http://127.0.0.1/notify/
#        - http://127.0.0.1/go-wechat/

字段解析:

**_rule :规则名,必需以此结尾
expression :告警表达式,结果为 1,报警
period :告警周期,评估性能等指标的间隔
silence-period :静默时长,触发一次该告警后,多长时长才能触发
message:告警信息

这些规则为:

  • 过去3min 内服务平均响应时长超 1s
  • 最近 2min 服务成功率低于 80%
  • 过去 3min 内超 1s 的服务响应百分数
  • 服务实例过去 2min 内平均响应时长超 1s,且实例名与正则表达式匹配
  • 终端节点过去 2min 内平均响应时间超 1s
  • 过去 2min 内数据库访问平均响应时间超 1s
  • 终端节点关系过去 2min 内平响超 1s

11.2 WebHook

告警信息默认显示在这里,但是人又不可能一直盯着它,这就需要它主动通知我们

11.2.1 webhook了解

webhook 是 允许应用程序外部系统 实时推送事件或数据的机制,常通过 HTTP 回调实现,从而实现跨系统自动化的消息传递

核心特点:

  • 事件 驱动:只有触发预先条件,才向目标 URL 发送 HTTP 请求(一般为 POST)
  • 轻量级:接收方只需提供一个可访问的 HTTP 端点接受数据即可,无需自己轮询
  • 灵活扩展:适用于 告警通知,流程触发,数据同步等场景

11.2.2 skywalking 中 webhook 实现

skywalking 中提供了 webhook 方式:

yml 复制代码
#hooks:
#  webhook:
#    default:
#      is-default: true
#      urls:
#        - http://127.0.0.1/notify/
#        - http://127.0.0.1/go-wechat/

官网相关资料

触发告警时,skywalking 会生成告警事件,并将告警信息封装为 JSON 格式,通过 HTTP 发送到预设地址

此JSON 相关信息:

JSON 格式定义参考:List<org.apache.skywalking.oap.server.core.alarm.AlarmMessage> (其官方 github 仓库中的对应代码)

  • scopeId,scope:告警目标的监控范围,仓库中org.apache.skywalking.oap.server.core.source.DefaultScopeDefine 中定义
  • name:告警目标的名字,如服务名,端点名等
  • id0:目标实体的主要ID,通常是数据库中主键
  • id1: 次要ID(可选),用于更精确的标识
  • ruleName:触发的告警规则名
  • alarmMessage:告警信息内容
  • startTime:告警触发时间戳(ms)
  • tags:标签列表,包含与告警相关信息

alarm-settings.yml 的注释掉的 webhook 部分取消注释

yml 复制代码
hooks:
 webhook:
   default:
     is-default: true
     urls:
       - http://127.0.0.1/notify/
       - http://127.0.0.1/go-wechat/

urls:消息推送地址

应用场景:

  • 集成第三方系统:将告警推送到企业微信,飞书,邮箱等协作工具
  • 自动化运维:触发运维脚本(如自动扩容)或联动故障管理系统
  • 数据聚合分析:将告警事件转发至大数据平台进行统计分析
相关推荐
闲人编程2 小时前
Flask-SQLAlchemy高级用法:关系建模与复杂查询
后端·python·flask·一对多·多对多·一对一·自引用
嘟嘟w2 小时前
Forward(转发)与Redirect(重定向)的区别
java
武子康2 小时前
大数据-180 Elasticsearch 近实时搜索:Segment、Refresh、Flush、Translog 全流程解析
大数据·后端·elasticsearch
程序员根根2 小时前
JavaSE 进阶:代理设计模式核心知识点(静态代理 + 动态代理 + 反射实现 + 实战案例)
java
武子康2 小时前
Java-189 Guava Cache 源码剖析:LocalCache、Segment 与 LoadingCache 工作原理全解析
java·redis·后端·spring·缓存·guava·guava cache
踏浪无痕2 小时前
彻底搞懂微服务 TraceId 传递:ThreadLocal、TTL 与全链路日志追踪实战
后端·微服务·面试
程序员小假2 小时前
我们来说一说 Redis 主从复制的原理及作用
java·后端
木鹅.2 小时前
聊天记忆
java
海上彼尚2 小时前
Go之路 - 1.gomod指令
开发语言·后端·golang