文章目录
- 前言
- 一、准备
- 二、配置skywalking
-
- [2.1 修改alarm-settings.yml](#2.1 修改alarm-settings.yml)
- [2.2 重启skywalking](#2.2 重启skywalking)
- 三、收到告警邮件
- 总结
前言
skywalking根据监控规则,通过webhook调后端微服务接口,从而发送告警邮件。
一、准备
根据上几章内容,保证skywalking能监控到linux数据:
根据上一章内容,启动微服务,保证skywalkingNotifyToQqEmail可用。
二、配置skywalking
2.1 修改alarm-settings.yml
bash
cd /opt/skywalking/apache-skywalking-apm-bin/config
vim alarm-settings.yml
将hooks注释去掉,url修改为自己微服务地址。
新增1条监控规则,监控cpu使用率过高。为了测试,修改为超过20%则告警。
完整的配置文件:
yml
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Sample alarm rules.
rules:
# Rule unique name, must be ended with `_rule`.
service_resp_time_rule:
# A MQE expression, the result type must be `SINGLE_VALUE` and the root operation of the expression must be a Compare Operation
# which provides `1`(true) or `0`(false) result. When the result is `1`(true), the alarm will be triggered.
expression: sum(service_resp_time > 1000) >= 3
period: 10
silence-period: 5
message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
# service_resp_time_rule:
# expression: avg(service_resp_time) > 1000
# period: 10
# silence-period: 5
# message: Avg response time of service {name} is more than 1000ms in last 10 minutes.
service_sla_rule:
expression: sum(service_sla < 8000) >= 2
# The length of time to evaluate the metrics
period: 10
# How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
silence-period: 3
message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
service_resp_time_percentile_rule:
expression: sum(service_percentile{p='50,75,90,95,99'} > 1000) >= 3
period: 10
silence-period: 5
message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
service_instance_resp_time_rule:
expression: sum(service_instance_resp_time > 1000) >= 2
period: 10
silence-period: 5
message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
database_access_resp_time_rule:
expression: sum(database_access_resp_time > 1000) >= 2
period: 10
message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
endpoint_relation_resp_time_rule:
expression: sum(endpoint_relation_resp_time > 1000) >= 2
period: 10
message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes
cpu_usage_high_rule:
# MQE 表达式,返回 1 (true) 或 0 (false) 的结果。这里表示当 CPU 利用率超过 20% 时,触发告警。
expression: avg(meter_vm_cpu_total_percentage) > 20
# 每隔 1 分钟评估一次
period: 1
# 静默时间,告警触发后 3 分钟内不重复发送
silence-period: 3
# 告警消息
message: CPU usage of {name} is above 200%.
# Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
# Because the number of endpoint is much more than service and instance.
#
# endpoint_resp_time_rule:
# expression: sum(endpoint_resp_time > 1000) >= 2
# period: 10
# silence-period: 5
# message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes
hooks:
webhook:
default:
is-default: true
urls:
- http://10.211.55.2:8001/api/test/skywalkingNotifyToQqEmail
# - http://127.0.0.1/go-wechat/
2.2 重启skywalking
bash
ps -ef | grep skywalking
kill 8295 8296
sh startup.sh
三、收到告警邮件
重启skywalking后,告警规则即生效。过一会就回收到告警邮件。
微服务日志里也会打印日志:
总结
如果收不到邮件,则从头检查下链路。