监控-08-skywalking监控告警

文章目录


前言

skywalking根据监控规则,通过webhook调后端微服务接口,从而发送告警邮件。


一、准备

根据上几章内容,保证skywalking能监控到linux数据:

根据上一章内容,启动微服务,保证skywalkingNotifyToQqEmail可用。

二、配置skywalking

2.1 修改alarm-settings.yml

bash 复制代码
cd /opt/skywalking/apache-skywalking-apm-bin/config
vim alarm-settings.yml

将hooks注释去掉,url修改为自己微服务地址。

新增1条监控规则,监控cpu使用率过高。为了测试,修改为超过20%则告警。

完整的配置文件:

yml 复制代码
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Sample alarm rules.
rules:
  # Rule unique name, must be ended with `_rule`.
  service_resp_time_rule:
    # A MQE expression, the result type must be `SINGLE_VALUE` and the root operation of the expression must be a Compare Operation
    # which provides `1`(true) or `0`(false) result. When the result is `1`(true), the alarm will be triggered.
    expression: sum(service_resp_time > 1000) >= 3
    period: 10
    silence-period: 5
    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
#  service_resp_time_rule:
#    expression: avg(service_resp_time) > 1000
#    period: 10
#    silence-period: 5
#    message: Avg response time of service {name} is more than 1000ms in last 10 minutes.
  service_sla_rule:
    expression: sum(service_sla < 8000) >= 2
    # The length of time to evaluate the metrics
    period: 10
    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
    silence-period: 3
    message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
  service_resp_time_percentile_rule:
    expression: sum(service_percentile{p='50,75,90,95,99'} > 1000) >= 3
    period: 10
    silence-period: 5
    message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
  service_instance_resp_time_rule:
    expression: sum(service_instance_resp_time > 1000) >= 2
    period: 10
    silence-period: 5
    message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
  database_access_resp_time_rule:
    expression: sum(database_access_resp_time > 1000) >= 2
    period: 10
    message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
  endpoint_relation_resp_time_rule:
    expression: sum(endpoint_relation_resp_time > 1000) >= 2
    period: 10
    message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes
  cpu_usage_high_rule:
    # MQE 表达式,返回 1 (true) 或 0 (false) 的结果。这里表示当 CPU 利用率超过 20% 时,触发告警。
    expression: avg(meter_vm_cpu_total_percentage) > 20
    # 每隔 1 分钟评估一次
    period: 1
    # 静默时间,告警触发后 3 分钟内不重复发送
    silence-period: 3
    # 告警消息
    message: CPU usage of {name} is above 200%.
#  Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
#  Because the number of endpoint is much more than service and instance.
#
#  endpoint_resp_time_rule:
#    expression: sum(endpoint_resp_time > 1000) >= 2
#    period: 10
#    silence-period: 5
#    message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes

hooks:
  webhook:
    default:
      is-default: true
      urls:
        - http://10.211.55.2:8001/api/test/skywalkingNotifyToQqEmail
#        - http://127.0.0.1/go-wechat/

2.2 重启skywalking

bash 复制代码
ps -ef | grep skywalking
kill 8295 8296
sh startup.sh

三、收到告警邮件

重启skywalking后,告警规则即生效。过一会就回收到告警邮件。


微服务日志里也会打印日志:


总结

如果收不到邮件,则从头检查下链路。

相关推荐
老三牛擦2 天前
熟练掌握RabbitMQ和Kafka的使用及相关应用场景。异步通知与解耦,流量削峰,配合本地消息表实现事务的最终一致性并解决消息可靠、顺序消费和错误重试等问题
skywalking
老三牛擦4 天前
熟悉多线程与并发编程,理解各类锁机制,熟悉JUC并发多线程及线程池,熟练异步编排编码,熟悉Redisson在分布式场景下各类锁的应用场景和并发控制原理。
skywalking
5007014 天前
SkyWalking 部署与应用(Windows)
windows·skywalking
递归尽头是星辰1 个月前
SkyWalking架构深度解析:分布式系统监控的利器
skywalking·分布式链路追踪·可观测性·云原生监控·微服务监控
·云扬·1 个月前
【PmHub面试篇】性能监控与分布式追踪利器Skywalking面试专题分析
分布式·面试·skywalking
XMYX-02 个月前
SkyWalking 报错:sw_profile_task 索引缺失问题分析与解决
运维·jenkins·skywalking
神雕大侠mu2 个月前
skywalking使用教程
skywalking
杰克逊的日记2 个月前
SkyWalking的工作原理和搭建过程
云原生·监控·skywalking
醇氧2 个月前
【skywalking】index“:“skywalking_metrics-all“},“status“:404}
skywalking
·云扬·2 个月前
【PmHub后端篇】Skywalking:性能监控与分布式追踪的利器
分布式·skywalking