监控-08-skywalking监控告警

文章目录


前言

skywalking根据监控规则,通过webhook调后端微服务接口,从而发送告警邮件。


一、准备

根据上几章内容,保证skywalking能监控到linux数据:

根据上一章内容,启动微服务,保证skywalkingNotifyToQqEmail可用。

二、配置skywalking

2.1 修改alarm-settings.yml

bash 复制代码
cd /opt/skywalking/apache-skywalking-apm-bin/config
vim alarm-settings.yml

将hooks注释去掉,url修改为自己微服务地址。

新增1条监控规则,监控cpu使用率过高。为了测试,修改为超过20%则告警。

完整的配置文件:

yml 复制代码
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Sample alarm rules.
rules:
  # Rule unique name, must be ended with `_rule`.
  service_resp_time_rule:
    # A MQE expression, the result type must be `SINGLE_VALUE` and the root operation of the expression must be a Compare Operation
    # which provides `1`(true) or `0`(false) result. When the result is `1`(true), the alarm will be triggered.
    expression: sum(service_resp_time > 1000) >= 3
    period: 10
    silence-period: 5
    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
#  service_resp_time_rule:
#    expression: avg(service_resp_time) > 1000
#    period: 10
#    silence-period: 5
#    message: Avg response time of service {name} is more than 1000ms in last 10 minutes.
  service_sla_rule:
    expression: sum(service_sla < 8000) >= 2
    # The length of time to evaluate the metrics
    period: 10
    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
    silence-period: 3
    message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
  service_resp_time_percentile_rule:
    expression: sum(service_percentile{p='50,75,90,95,99'} > 1000) >= 3
    period: 10
    silence-period: 5
    message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
  service_instance_resp_time_rule:
    expression: sum(service_instance_resp_time > 1000) >= 2
    period: 10
    silence-period: 5
    message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
  database_access_resp_time_rule:
    expression: sum(database_access_resp_time > 1000) >= 2
    period: 10
    message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
  endpoint_relation_resp_time_rule:
    expression: sum(endpoint_relation_resp_time > 1000) >= 2
    period: 10
    message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes
  cpu_usage_high_rule:
    # MQE 表达式,返回 1 (true) 或 0 (false) 的结果。这里表示当 CPU 利用率超过 20% 时,触发告警。
    expression: avg(meter_vm_cpu_total_percentage) > 20
    # 每隔 1 分钟评估一次
    period: 1
    # 静默时间,告警触发后 3 分钟内不重复发送
    silence-period: 3
    # 告警消息
    message: CPU usage of {name} is above 200%.
#  Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
#  Because the number of endpoint is much more than service and instance.
#
#  endpoint_resp_time_rule:
#    expression: sum(endpoint_resp_time > 1000) >= 2
#    period: 10
#    silence-period: 5
#    message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes

hooks:
  webhook:
    default:
      is-default: true
      urls:
        - http://10.211.55.2:8001/api/test/skywalkingNotifyToQqEmail
#        - http://127.0.0.1/go-wechat/

2.2 重启skywalking

bash 复制代码
ps -ef | grep skywalking
kill 8295 8296
sh startup.sh

三、收到告警邮件

重启skywalking后,告警规则即生效。过一会就回收到告警邮件。


微服务日志里也会打印日志:


总结

如果收不到邮件,则从头检查下链路。

相关推荐
你的微笑,乱了夏天9 天前
微服务链路追踪skywalking安装
分布式·后端·中间件·架构·skywalking
孟林洁18 天前
ES + SkyWalking + Spring Boot:日志分析与服务监控(三)
spring boot·elasticsearch·skywalking
醇氧18 天前
【skywalking 】More than 15,000 ‘grammar‘ tokens have been presented. 【未解决请求答案】
linux·运维·skywalking·1024程序员节
醇氧22 天前
【skywalking】监控 Spring Cloud Gateway 数据
java·skywalking
芥末鱿鱼~22 天前
Skywalking教程一
分布式·skywalking
一条行走的鱼1 个月前
分布式链路追踪-01初步认识SkyWalking
分布式·skywalking
服务端相声演员1 个月前
【实战篇】用SkyWalking排查线上[xxl-job xxl-rpc remoting error]问题
skywalking
Slow菜鸟1 个月前
SpringBoot教程(三十二) | SpringBoot集成Skywalking链路跟踪
spring boot·后端·skywalking
丶只有影子1 个月前
基于Docker部署最新版本SkyWalking【10.1.0版本】
docker·容器·skywalking