常用告警规则

Node_exporter告警规则

NodeCPUUsageHigh:

yaml

复制代码

  • alert: NodeCPUUsageHigh
    expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: "High CPU usage detected on {{ labels.instance }}" description: "CPU usage is above 80% for more than 5 minutes on {{ labels.instance }}."
    解释:当某节点的CPU使用率超过80%并持续5分钟时触发告警。

NodeMemoryUsageHigh:

yaml

复制代码

  • alert: NodeMemoryUsageHigh
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: "High Memory usage detected on {{ labels.instance }}" description: "Memory usage is above 90% for more than 5 minutes on {{ labels.instance }}."
    解释:当某节点的内存使用率超过90%并持续5分钟时触发告警。

NodeDiskUsageHigh:

yaml

复制代码

  • alert: NodeDiskUsageHigh
    expr: (node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs"} - node_filesystem_free_bytes{fstype!~"tmpfs|fuse.lxcfs"}) / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs"} * 100 > 85
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High Disk usage detected on {{ labels.instance }}" description: "Disk usage is above 85% for more than 5 minutes on {{ labels.instance }}."
    解释:当某节点的磁盘使用率超过85%并持续5分钟时触发告警。

NodeFilesystemReadOnly:

yaml

复制代码

  • alert: NodeFilesystemReadOnly
    expr: node_filesystem_readonly{fstype!~"tmpfs|fuse.lxcfs"} == 1
    for: 10m
    labels:
    severity: critical
    annotations:
    summary: "Filesystem is read-only on {{ labels.instance }}" description: "Filesystem has been read-only for more than 10 minutes on {{ labels.instance }}."
    解释:当某节点的文件系统变为只读并持续10分钟时触发告警。

NodeLoadAverageHigh:

yaml

复制代码

  • alert: NodeLoadAverageHigh
    expr: node_load1 > 2 * count(node_cpu_seconds_total{mode="system"})
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High load average on {{ labels.instance }}" description: "1-minute load average is more than twice the number of CPUs for over 5 minutes on {{ labels.instance }}."
    解释:当某节点的1分钟负载平均值超过CPU数量的2倍并持续5分钟时触发告警。

NodeNetworkDown:

yaml

复制代码

  • alert: NodeNetworkDown
    expr: up{job="node_exporter"} == 0
    for: 10m
    labels:
    severity: critical
    annotations:
    summary: "Node down: {{ $labels.instance }}"
    description: "Node has been down for more than 10 minutes."
    解释:当某节点的node_exporter数据10分钟内没有上报时触发告警。

NodeSwapUsageHigh:

yaml

复制代码

  • alert: NodeSwapUsageHigh
    expr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100 > 50
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High swap usage on {{ labels.instance }}" description: "Swap usage is above 50% for more than 5 minutes on {{ labels.instance }}."
    解释:当某节点的交换分区使用率超过50%并持续5分钟时触发告警。

NodeFileSystemInodesUsageHigh:

yaml

复制代码

  • alert: NodeFileSystemInodesUsageHigh
    expr: (node_filesystem_files - node_filesystem_files_free) / node_filesystem_files * 100 > 80
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High filesystem inodes usage on {{ labels.instance }}" description: "Filesystem inodes usage is above 80% for more than 5 minutes on {{ labels.instance }}."
    解释:当某节点的文件系统inode使用率超过80%并持续5分钟时触发告警。

NodeTemperatureHigh:

yaml

复制代码

  • alert: NodeTemperatureHigh
    expr: node_hwmon_temp_celsius > 75
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High temperature on {{ labels.instance }}" description: "Node temperature is above 75 degrees Celsius for more than 5 minutes on {{ labels.instance }}."
    解释:当某节点的温度超过75摄氏度并持续5分钟时触发告警。

NodeProcessCountHigh:

yaml

复制代码

  • alert: NodeProcessCountHigh
    expr: count(node_scrape_collector_duration_seconds) > 500
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High process count on {{ labels.instance }}" description: "Number of processes is above 500 for more than 5 minutes on {{ labels.instance }}."
    解释:当某节点的进程数超过500并持续5分钟时触发告警。

windows_exporter告警规则

WindowsCPUUsageHigh:

yaml

复制代码

  • alert: WindowsCPUUsageHigh
    expr: avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[5m])) < 0.2
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: "High CPU usage on {{ labels.instance }}" description: "CPU usage is above 80% for more than 5 minutes on {{ labels.instance }}."
    解释:当某Windows节点的CPU使用率超过80%并持续5分钟时触发告警。

WindowsMemoryUsageHigh:

yaml

复制代码

  • alert: WindowsMemoryUsageHigh
    expr: windows_cs_physical_memory_bytes - windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes * 100 > 90
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: "High Memory usage on {{ labels.instance }}" description: "Memory usage is above 90% for more than 5 minutes on {{ labels.instance }}."
    解释:当某Windows节点的内存使用率超过90%并持续5分钟时触发告警。

WindowsDiskUsageHigh:

yaml

复制代码

  • alert: WindowsDiskUsageHigh
    expr: windows_logical_disk_free_bytes / windows_logical_disk_size_bytes * 100 < 15
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High Disk usage on {{ labels.instance }}" description: "Disk usage is above 85% for more than 5 minutes on {{ labels.instance }}."
    解释:当某Windows节点的磁盘使用率超过85%并持续5分钟时触发告警。

WindowsNetworkDown:

yaml

复制代码

  • alert: WindowsNetworkDown
    expr: up{job="windows_exporter"} == 0
    for: 10m
    labels:
    severity: critical
    annotations:
    summary: "Node down: {{ $labels.instance }}"
    description: "Node has been down for more than 10 minutes."
    解释:当某Windows节点的windows_exporter数据10分钟内没有上报时触发告警。

WindowsServiceNotRunning:

yaml

复制代码

  • alert: WindowsServiceNotRunning
    expr: windows_service_status == 2
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: "Service not running on {{ labels.instance }}" description: "A critical service is not running for more than 5 minutes on {{ labels.instance }}."
    解释:当某Windows节点上的某关键服务未运行并持续5分钟时触发告警。

WindowsPageFileUsageHigh:

yaml

复制代码

  • alert: WindowsPageFileUsageHigh
    expr: windows_os_paging_free_bytes / windows_os_paging_limit_bytes * 100
    继续

ChatGPT

windows_exporter告警规则(续)

WindowsPageFileUsageHigh:

yaml

复制代码

  • alert: WindowsPageFileUsageHigh
    expr: windows_os_paging_free_bytes / windows_os_paging_limit_bytes * 100 < 10
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High Page File usage on {{ labels.instance }}" description: "Page file usage is above 90% for more than 5 minutes on {{ labels.instance }}."
    解释:当某Windows节点的页面文件使用率超过90%并持续5分钟时触发告警。

WindowsHandleCountHigh:

yaml

复制代码

  • alert: WindowsHandleCountHigh
    expr: windows_process_handles_count > 10000
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High Handle count on {{ labels.instance }}" description: "Number of handles is above 10,000 for more than 5 minutes on {{ labels.instance }}."
    解释:当某Windows节点的句柄数超过10,000并持续5分钟时触发告警。

WindowsThreadCountHigh:

yaml

复制代码

  • alert: WindowsThreadCountHigh
    expr: windows_process_threads > 500
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High Thread count on {{ labels.instance }}" description: "Number of threads is above 500 for more than 5 minutes on {{ labels.instance }}."
    解释:当某Windows节点的线程数超过500并持续5分钟时触发告警。

WindowsProcessorQueueLengthHigh:

yaml

复制代码

  • alert: WindowsProcessorQueueLengthHigh
    expr: windows_system_processor_queue_length > 10
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High Processor Queue Length on {{ labels.instance }}" description: "Processor queue length is above 10 for more than 5 minutes on {{ labels.instance }}."
    解释:当某Windows节点的处理器队列长度超过10并持续5分钟时触发告警。

WindowsDiskIOWaitHigh:

yaml

复制代码

  • alert: WindowsDiskIOWaitHigh
    expr: avg by (instance) (rate(windows_logical_disk_idle_time{instance=~"[^:]+:.*"}[5m])) < 0.1
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High Disk I/O Wait on {{ labels.instance }}" description: "Disk I/O wait time is high for more than 5 minutes on {{ labels.instance }}."
    解释:当某Windows节点的磁盘I/O等待时间过高并持续5分钟时触发告警。

mysqld_exporter告警规则

MySQLHighThreadsRunning:

yaml

复制代码

  • alert: MySQLHighThreadsRunning
    expr: mysql_global_status_threads_running > 50
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: "High number of running threads in MySQL on {{ labels.instance }}" description: "Number of running threads is above 50 for more than 5 minutes on {{ labels.instance }}."
    解释:当MySQL实例的运行线程数超过50并持续5分钟时触发告警。

MySQLSlowQueries:

yaml

复制代码

  • alert: MySQLSlowQueries
    expr: rate(mysql_global_status_slow_queries[5m]) > 0
    for: 10m
    labels:
    severity: warning
    annotations:
    summary: "Slow queries detected on {{ labels.instance }}" description: "One or more slow queries detected in the last 10 minutes on {{ labels.instance }}."
    解释:当MySQL实例在过去10分钟内检测到慢查询时触发告警。

MySQLReplicationLag:

yaml

复制代码

  • alert: MySQLReplicationLag
    expr: mysql_slave_status_seconds_behind_master > 10
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: "Replication lag on MySQL slave {{ labels.instance }}" description: "Replication lag is more than 10 seconds for over 5 minutes on {{ labels.instance }}."
    解释:当MySQL从库的复制延迟超过10秒并持续5分钟时触发告警。

MySQLHighConnections:

yaml

复制代码

  • alert: MySQLHighConnections
    expr: mysql_global_status_threads_connected > 200
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: "High number of connections to MySQL on {{ labels.instance }}" description: "Number of connections is above 200 for more than 5 minutes on {{ labels.instance }}."
    解释:当MySQL实例的连接数超过200并持续5分钟时触发告警。

MySQLLowFreeDiskSpace:

yaml

复制代码

  • alert: MySQLLowFreeDiskSpace
    expr: (mysql_global_variables_datadir - node_filesystem_free_bytes{fstype!~"tmpfs|fuse.lxcfs"}) / mysql_global_variables_datadir * 100 > 90
    for: 10m
    labels:
    severity: critical
    annotations:
    summary: "Low free disk space for MySQL on {{ labels.instance }}" description: "Free disk space for MySQL data directory is below 10% for more than 10 minutes on {{ labels.instance }}."
    解释:当MySQL实例的数据目录剩余磁盘空间低于10%并持续10分钟时触发告警。

MySQLHighQueryTime:

yaml

复制代码

  • alert: MySQLHighQueryTime
    expr: rate(mysql_global_status_questions[5m]) > 1000
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High query rate on MySQL on {{ labels.instance }}" description: "Query rate is above 1000 queries per second for more than 5 minutes on {{ labels.instance }}."
    解释:当MySQL实例的查询率超过每秒1000个并持续5分钟时触发告警。

MySQLInnoDBBufferPoolUsageHigh:

yaml

复制代码

  • alert: MySQLInnoDBBufferPoolUsageHigh
    expr: (mysql_global_status_innodb_buffer_pool_bytes_data / mysql_global_status_innodb_buffer_pool_bytes_total) * 100 > 85
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High InnoDB buffer pool usage on MySQL on {{ labels.instance }}" description: "InnoDB buffer pool usage is above 85% for more than 5 minutes on {{ labels.instance }}."
    解释:当MySQL实例的InnoDB缓冲池使用率超过85%并持续5分钟时触发告警。

MySQLMaxConnectionsReached:

yaml

复制代码

  • alert: MySQLMaxConnectionsReached
    expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 90
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: "Max connections threshold reached on MySQL on {{ labels.instance }}" description: "Connections usage is above 90% of the max_connections limit for more than 5 minutes on {{ labels.instance }}."
    解释:当MySQL实例的连接数达到最大连接数限制的90%以上并持续5分钟时触发告警。

MySQLTableLocksContention:

yaml

复制代码

  • alert: MySQLTableLocksContention
    expr: rate(mysql_global_status_table_locks_waited[5m]) > 10
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "Table locks contention in MySQL on {{ labels.instance }}" description: "More than 10 table locks waited per second for more than 5 minutes on {{ labels.instance }}."
    解释:当MySQL实例的表锁等待数超过每秒10个并持续5分钟时触发告警。

MySQLLongRunningQueries:

yaml

复制代码

  • alert: MySQLLongRunningQueries
    expr: mysql_global_status_queries{job="mysql"} > 100
    for: 10m
    labels:
    severity: warning
    annotations:
    summary: "Long running queries in MySQL on {{ labels.instance }}" description: "More than 100 long running queries for over 10 minutes on {{ labels.instance }}."
    解释:当MySQL实例的长时间运行查询超过100个并持续10分钟时触发告警。

blackbox_exporter告警规则

EndpointDown:

yaml

复制代码

  • alert: EndpointDown
    expr: probe_success == 0
    for:
    继续

ChatGPT

blackbox_exporter告警规则

EndpointDown:

yaml

复制代码

  • alert: EndpointDown
    expr: probe_success == 0
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: "Endpoint down: {{ labels.instance }}" description: "The endpoint {{ labels.instance }} has been down for more than 5 minutes."
    解释:当某个端点连续5分钟探测失败时触发告警。

EndpointHighLatency:

yaml

复制代码

  • alert: EndpointHighLatency
    expr: probe_duration_seconds > 0.5
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High latency on endpoint: {{ labels.instance }}" description: "The endpoint {{ labels.instance }} has a latency higher than 0.5 seconds for more than 5 minutes."
    解释:当某个端点的响应时间超过0.5秒并持续5分钟时触发告警。

EndpointDNSResolutionFailure:

yaml

复制代码

  • alert: EndpointDNSResolutionFailure
    expr: probe_dns_lookup_time_seconds > 2
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "DNS resolution failure for endpoint: {{ labels.instance }}" description: "The endpoint {{ labels.instance }} has DNS resolution time higher than 2 seconds for more than 5 minutes."
    解释:当某个端点的DNS解析时间超过2秒并持续5分钟时触发告警。

EndpointConnectionTimeout:

yaml

复制代码

  • alert: EndpointConnectionTimeout
    expr: probe_tcp_connect_duration_seconds > 1
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "Connection timeout on endpoint: {{ labels.instance }}" description: "The endpoint {{ labels.instance }} has connection timeout higher than 1 second for more than 5 minutes."
    解释:当某个端点的连接时间超过1秒并持续5分钟时触发告警。

EndpointCertificateExpiry:

yaml

复制代码

  • alert: EndpointCertificateExpiry
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 7
    for: 1h
    labels:
    severity: warning
    annotations:
    summary: "SSL certificate expiry for endpoint: {{ labels.instance }}" description: "The SSL certificate for endpoint {{ labels.instance }} will expire in less than 7 days."
    解释:当某个端点的SSL证书将在7天内过期时触发告警。

HTTPResponseCodeMismatch:

yaml

复制代码

  • alert: HTTPResponseCodeMismatch
    expr: probe_http_status_code != 200
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "Unexpected HTTP response code on endpoint: {{ labels.instance }}" description: "The endpoint {{ labels.instance }} returned an unexpected HTTP response code for more than 5 minutes."
    解释:当某个端点的HTTP响应码不是200并持续5分钟时触发告警。

EndpointHTTPRedirects:

yaml

复制代码

  • alert: EndpointHTTPRedirects
    expr: probe_http_redirects > 5
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "Too many HTTP redirects on endpoint: {{ labels.instance }}" description: "The endpoint {{ labels.instance }} encountered more than 5 HTTP redirects for more than 5 minutes."
    解释:当某个端点的HTTP重定向次数超过5次并持续5分钟时触发告警。

EndpointDNSLookupFailures:

yaml

复制代码

  • alert: EndpointDNSLookupFailures
    expr: increase(probe_dns_lookup_time_seconds[5m]) > 0
    for: 10m
    labels:
    severity: warning
    annotations:
    summary: "DNS lookup failures on endpoint: {{ labels.instance }}" description: "The endpoint {{ labels.instance }} had DNS lookup failures for more than 10 minutes."
    解释:当某个端点在10分钟内有DNS解析失败时触发告警。

EndpointTCPConnectionRefused:

yaml

复制代码

  • alert: EndpointTCPConnectionRefused
    expr: probe_tcp_connect_success == 0
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: "TCP connection refused on endpoint: {{ labels.instance }}" description: "The endpoint {{ labels.instance }} refused TCP connections for more than 5 minutes."
    解释:当某个端点连续5分钟TCP连接失败时触发告警。

EndpointSSLCertificateInvalid:

yaml

复制代码

  • alert: EndpointSSLCertificateInvalid
    expr: probe_ssl_last_chain_expiry_timestamp_seconds - time() < 0
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: "Invalid SSL certificate on endpoint: {{ labels.instance }}" description: "The endpoint {{ labels.instance }} has an invalid SSL certificate."
    解释:当某个端点的SSL证书无效时触发告警。

cAdvisor告警规则

ContainerCPUUsageHigh:

yaml

复制代码

  • alert: ContainerCPUUsageHigh
    expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High CPU usage on container {{ labels.container }}" description: "Container {{ labels.container }} has CPU usage above 80% for more than 5 minutes."
    解释:当某个容器的CPU使用率超过80%并持续5分钟时触发告警。

ContainerMemoryUsageHigh:

yaml

复制代码

  • alert: ContainerMemoryUsageHigh
    expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: "High memory usage on container {{ labels.container }}" description: "Container {{ labels.container }} has memory usage above 90% for more than 5 minutes."
    解释:当某个容器的内存使用率超过90%并持续5分钟时触发告警。

ContainerDiskIOHigh:

yaml

复制代码

  • alert: ContainerDiskIOHigh
    expr: rate(container_fs_io_time_seconds_total[5m]) > 0.5
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High disk I/O on container {{ labels.container }}" description: "Container {{ labels.container }} has disk I/O usage above 50% for more than 5 minutes."
    解释:当某个容器的磁盘I/O使用率超过50%并持续5分钟时触发告警。

ContainerNetworkErrors:

yaml

复制代码

  • alert: ContainerNetworkErrors
    expr: rate(container_network_receive_errors_total[5m]) > 0 or rate(container_network_transmit_errors_total[5m]) > 0
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "Network errors on container {{ labels.container }}" description: "Container {{ labels.container }} is experiencing network errors for more than 5 minutes."
    解释:当某个容器在5分钟内出现网络错误时触发告警。

ContainerRestarting:

yaml

复制代码

  • alert: ContainerRestarting
    expr: rate(container_restart_count[5m]) > 1
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: "Container restarting frequently: {{ labels.container }}" description: "Container {{ labels.container }} has restarted more than once in the last 5 minutes."
    解释:当某个容器在5分钟内重启超过1次时触发告警。

ContainerOOMKilled:

yaml

复制代码

  • alert: ContainerOOMKilled
    expr: container_last_seen < time() - 300 and container_state_terminated_reason == "OOMKilled"
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: "Container OOMKilled: {{ labels.container }}" description: "Container {{ labels.container }} was killed due to out of memory in the last 5 minutes."
    解释:当某个容器在过去5分钟内因内存不足被杀死时触发告警。

ContainerFileSystemFull:

yaml

复制代码

  • alert: ContainerFileSystemFull
    expr: container_fs_usage_bytes / container_fs_limit_bytes > 0.9
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: "File system full on container {{ labels.container }}" description: "Container {{ labels.container }} file system usage is above 90
    继续

ChatGPT

cAdvisor告警规则(续)

ContainerFileSystemFull:

yaml

复制代码

  • alert: ContainerFileSystemFull
    expr: container_fs_usage_bytes / container_fs_limit_bytes > 0.9
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: "File system full on container {{ labels.container }}" description: "Container {{ labels.container }} file system usage is above 90% for more than 5 minutes."
    解释:当某个容器的文件系统使用率超过90%并持续5分钟时触发告警。

ContainerMemoryLeak:

yaml

复制代码

  • alert: ContainerMemoryLeak
    expr: increase(container_memory_usage_bytes[1h]) > 1000000000
    for: 1h
    labels:
    severity: warning
    annotations:
    summary: "Possible memory leak on container {{ labels.container }}" description: "Container {{ labels.container }} has increased its memory usage by more than 1 GB in the last hour."
    解释:当某个容器在过去1小时内内存使用量增加超过1GB时触发告警,可能存在内存泄漏。

ContainerHighCPUThrottling:

yaml

复制代码

  • alert: ContainerHighCPUThrottling
    expr: rate(container_cpu_cfs_throttled_seconds_total[5m]) / rate(container_cpu_cfs_periods_total[5m]) > 0.2
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High CPU throttling on container {{ labels.container }}" description: "Container {{ labels.container }} is experiencing CPU throttling more than 20% of the time for more than 5 minutes."
    解释:当某个容器在5分钟内CPU限制(throttling)时间超过总时间的20%时触发告警。

ContainerNetworkTrafficDrop:

yaml

复制代码

  • alert: ContainerNetworkTrafficDrop
    expr: rate(container_network_receive_packets_dropped_total[5m]) > 0 or rate(container_network_transmit_packets_dropped_total[5m]) > 0
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "Network traffic drops on container {{ labels.container }}" description: "Container {{ labels.container }} is experiencing network packet drops for more than 5 minutes."
    解释:当某个容器在5分钟内出现网络数据包丢失时触发告警。

这些告警规则基于不同的exporter,可以帮助监控和管理系统和服务的健康状态。通过定义合适的阈值和条件,您可以及时发现和解决潜在问题,确保系统的稳定性和性能。

相关推荐
却话巴山夜雨时i几秒前
295. 数据流的中位数【困难】
java·服务器·前端
云技纵横2 分钟前
Vue无限滚动实战——从原理到企业级优化方案
前端
细心细心再细心5 分钟前
响应式记录
前端·vue.js
java干货7 分钟前
优雅停机!Spring Boot 应用如何使用 Hook 线程完成“身后事”?
java·spring boot·后端
tealcwu10 分钟前
【Unity技巧】实现在Play时自动保存当前场景
java·unity·游戏引擎
uup10 分钟前
Java 多线程下的可见性问题
java
用户83071968408210 分钟前
通过泛型限制集合只读或只写
java
干就完了111 分钟前
关于git的操作命令(一篇盖全),可不用,但不可不知!
前端·javascript
之恒君12 分钟前
JavaScript 垃圾回收机制详解
前端·javascript