11. 告警机制
skywalking 除了前文的 链路展示,性能剖析,日志管理外,还提供告警功能。程序出现异常时,能自动触发通知,帮助程序员防患于未然。
11.1 告警规则
skywalking 预先定义了一些规则,在 config/alarm-settings.yml 中,我们可以添加或更改规则。
默认的规则如下:
yml
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Sample alarm rules.
rules:
# Rule unique name, must be ended with `_rule`.
service_resp_time_rule:
# A MQE expression, the result type must be `SINGLE_VALUE` and the root operation of the expression must be a Compare Operation
# which provides `1`(true) or `0`(false) result. When the result is `1`(true), the alarm will be triggered.
expression: sum(service_resp_time > 1000) >= 3
period: 10
silence-period: 5
message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
# service_resp_time_rule:
# expression: avg(service_resp_time) > 1000
# period: 10
# silence-period: 5
# message: Avg response time of service {name} is more than 1000ms in last 10 minutes.
service_sla_rule:
expression: sum(service_sla < 8000) >= 2
# The length of time to evaluate the metrics
period: 10
# How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
silence-period: 3
message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
service_resp_time_percentile_rule:
expression: sum(service_percentile{p='50,75,90,95,99'} > 1000) >= 3
period: 10
silence-period: 5
message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
service_instance_resp_time_rule:
expression: sum(service_instance_resp_time > 1000) >= 2
period: 10
silence-period: 5
message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
database_access_resp_time_rule:
expression: sum(database_access_resp_time > 1000) >= 2
period: 10
message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
endpoint_relation_resp_time_rule:
expression: sum(endpoint_relation_resp_time > 1000) >= 2
period: 10
message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes
# Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
# Because the number of endpoint is much more than service and instance.
#
# endpoint_resp_time_rule:
# expression: sum(endpoint_resp_time > 1000) >= 2
# period: 10
# silence-period: 5
# message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes
#hooks:
# webhook:
# default:
# is-default: true
# urls:
# - http://127.0.0.1/notify/
# - http://127.0.0.1/go-wechat/
字段解析:
**_rule :规则名,必需以此结尾
expression :告警表达式,结果为 1,报警
period :告警周期,评估性能等指标的间隔
silence-period :静默时长,触发一次该告警后,多长时长才能触发
message:告警信息
这些规则为:
- 过去3min 内服务平均响应时长超 1s
- 最近 2min 服务成功率低于 80%
- 过去 3min 内超 1s 的服务响应百分数
- 服务实例过去 2min 内平均响应时长超 1s,且实例名与正则表达式匹配
- 终端节点过去 2min 内平均响应时间超 1s
- 过去 2min 内数据库访问平均响应时间超 1s
- 终端节点关系过去 2min 内平响超 1s
11.2 WebHook
告警信息默认显示在这里,但是人又不可能一直盯着它,这就需要它主动通知我们

11.2.1 webhook了解
webhook 是 允许应用程序 向外部系统 实时推送事件或数据的机制,常通过 HTTP 回调实现,从而实现跨系统自动化的消息传递
核心特点:
- 事件 驱动:只有触发预先条件,才向目标 URL 发送 HTTP 请求(一般为 POST)
- 轻量级:接收方只需提供一个可访问的 HTTP 端点接受数据即可,无需自己轮询
- 灵活扩展:适用于 告警通知,流程触发,数据同步等场景
11.2.2 skywalking 中 webhook 实现
skywalking 中提供了 webhook 方式:
yml
#hooks:
# webhook:
# default:
# is-default: true
# urls:
# - http://127.0.0.1/notify/
# - http://127.0.0.1/go-wechat/
触发告警时,skywalking 会生成告警事件,并将告警信息封装为 JSON 格式,通过 HTTP 发送到预设地址
此JSON 相关信息:

JSON 格式定义参考:List<org.apache.skywalking.oap.server.core.alarm.AlarmMessage> (其官方 github 仓库中的对应代码)
- scopeId,scope:告警目标的监控范围,仓库中
org.apache.skywalking.oap.server.core.source.DefaultScopeDefine中定义 - name:告警目标的名字,如服务名,端点名等
- id0:目标实体的主要ID,通常是数据库中主键
- id1: 次要ID(可选),用于更精确的标识
- ruleName:触发的告警规则名
- alarmMessage:告警信息内容
- startTime:告警触发时间戳(ms)
- tags:标签列表,包含与告警相关信息
将 alarm-settings.yml 的注释掉的 webhook 部分取消注释
yml
hooks:
webhook:
default:
is-default: true
urls:
- http://127.0.0.1/notify/
- http://127.0.0.1/go-wechat/
urls:消息推送地址
应用场景:
- 集成第三方系统:将告警推送到企业微信,飞书,邮箱等协作工具
- 自动化运维:触发运维脚本(如自动扩容)或联动故障管理系统
- 数据聚合分析:将告警事件转发至大数据平台进行统计分析