Node_exporter告警规则
NodeCPUUsageHigh:
yaml
复制代码
- alert: NodeCPUUsageHigh
expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
for: 5m
labels:
severity: critical
annotations:
summary: "High CPU usage detected on {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}."
解释:当某节点的CPU使用率超过80%并持续5分钟时触发告警。
NodeMemoryUsageHigh:
yaml
复制代码
- alert: NodeMemoryUsageHigh
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High Memory usage detected on {{ $labels.instance }}"
description: "Memory usage is above 90% for more than 5 minutes on {{ $labels.instance }}."
解释:当某节点的内存使用率超过90%并持续5分钟时触发告警。
NodeDiskUsageHigh:
yaml
复制代码
- alert: NodeDiskUsageHigh
expr: (node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs"} - node_filesystem_free_bytes{fstype!~"tmpfs|fuse.lxcfs"}) / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs"} * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High Disk usage detected on {{ $labels.instance }}"
description: "Disk usage is above 85% for more than 5 minutes on {{ $labels.instance }}."
解释:当某节点的磁盘使用率超过85%并持续5分钟时触发告警。
NodeFilesystemReadOnly:
yaml
复制代码
- alert: NodeFilesystemReadOnly
expr: node_filesystem_readonly{fstype!~"tmpfs|fuse.lxcfs"} == 1
for: 10m
labels:
severity: critical
annotations:
summary: "Filesystem is read-only on {{ $labels.instance }}"
description: "Filesystem has been read-only for more than 10 minutes on {{ $labels.instance }}."
解释:当某节点的文件系统变为只读并持续10分钟时触发告警。
NodeLoadAverageHigh:
yaml
复制代码
- alert: NodeLoadAverageHigh
expr: node_load1 > 2 * count(node_cpu_seconds_total{mode="system"})
for: 5m
labels:
severity: warning
annotations:
summary: "High load average on {{ $labels.instance }}"
description: "1-minute load average is more than twice the number of CPUs for over 5 minutes on {{ $labels.instance }}."
解释:当某节点的1分钟负载平均值超过CPU数量的2倍并持续5分钟时触发告警。
NodeNetworkDown:
yaml
复制代码
- alert: NodeNetworkDown
expr: up{job="node_exporter"} == 0
for: 10m
labels:
severity: critical
annotations:
summary: "Node down: {{ $labels.instance }}"
description: "Node has been down for more than 10 minutes."
解释:当某节点的node_exporter数据10分钟内没有上报时触发告警。
NodeSwapUsageHigh:
yaml
复制代码
- alert: NodeSwapUsageHigh
expr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100 > 50
for: 5m
labels:
severity: warning
annotations:
summary: "High swap usage on {{ $labels.instance }}"
description: "Swap usage is above 50% for more than 5 minutes on {{ $labels.instance }}."
解释:当某节点的交换分区使用率超过50%并持续5分钟时触发告警。
NodeFileSystemInodesUsageHigh:
yaml
复制代码
- alert: NodeFileSystemInodesUsageHigh
expr: (node_filesystem_files - node_filesystem_files_free) / node_filesystem_files * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High filesystem inodes usage on {{ $labels.instance }}"
description: "Filesystem inodes usage is above 80% for more than 5 minutes on {{ $labels.instance }}."
解释:当某节点的文件系统inode使用率超过80%并持续5分钟时触发告警。
NodeTemperatureHigh:
yaml
复制代码
- alert: NodeTemperatureHigh
expr: node_hwmon_temp_celsius > 75
for: 5m
labels:
severity: warning
annotations:
summary: "High temperature on {{ $labels.instance }}"
description: "Node temperature is above 75 degrees Celsius for more than 5 minutes on {{ $labels.instance }}."
解释:当某节点的温度超过75摄氏度并持续5分钟时触发告警。
NodeProcessCountHigh:
yaml
复制代码
- alert: NodeProcessCountHigh
expr: count(node_scrape_collector_duration_seconds) > 500
for: 5m
labels:
severity: warning
annotations:
summary: "High process count on {{ $labels.instance }}"
description: "Number of processes is above 500 for more than 5 minutes on {{ $labels.instance }}."
解释:当某节点的进程数超过500并持续5分钟时触发告警。
windows_exporter告警规则
WindowsCPUUsageHigh:
yaml
复制代码
- alert: WindowsCPUUsageHigh
expr: avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[5m])) < 0.2
for: 5m
labels:
severity: critical
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}."
解释:当某Windows节点的CPU使用率超过80%并持续5分钟时触发告警。
WindowsMemoryUsageHigh:
yaml
复制代码
- alert: WindowsMemoryUsageHigh
expr: windows_cs_physical_memory_bytes - windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High Memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90% for more than 5 minutes on {{ $labels.instance }}."
解释:当某Windows节点的内存使用率超过90%并持续5分钟时触发告警。
WindowsDiskUsageHigh:
yaml
复制代码
- alert: WindowsDiskUsageHigh
expr: windows_logical_disk_free_bytes / windows_logical_disk_size_bytes * 100 < 15
for: 5m
labels:
severity: warning
annotations:
summary: "High Disk usage on {{ $labels.instance }}"
description: "Disk usage is above 85% for more than 5 minutes on {{ $labels.instance }}."
解释:当某Windows节点的磁盘使用率超过85%并持续5分钟时触发告警。
WindowsNetworkDown:
yaml
复制代码
- alert: WindowsNetworkDown
expr: up{job="windows_exporter"} == 0
for: 10m
labels:
severity: critical
annotations:
summary: "Node down: {{ $labels.instance }}"
description: "Node has been down for more than 10 minutes."
解释:当某Windows节点的windows_exporter数据10分钟内没有上报时触发告警。
WindowsServiceNotRunning:
yaml
复制代码
- alert: WindowsServiceNotRunning
expr: windows_service_status == 2
for: 5m
labels:
severity: critical
annotations:
summary: "Service not running on {{ $labels.instance }}"
description: "A critical service is not running for more than 5 minutes on {{ $labels.instance }}."
解释:当某Windows节点上的某关键服务未运行并持续5分钟时触发告警。
WindowsPageFileUsageHigh:
yaml
复制代码
- alert: WindowsPageFileUsageHigh
expr: windows_os_paging_free_bytes / windows_os_paging_limit_bytes * 100
继续
ChatGPT
windows_exporter告警规则(续)
WindowsPageFileUsageHigh:
yaml
复制代码
- alert: WindowsPageFileUsageHigh
expr: windows_os_paging_free_bytes / windows_os_paging_limit_bytes * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: "High Page File usage on {{ $labels.instance }}"
description: "Page file usage is above 90% for more than 5 minutes on {{ $labels.instance }}."
解释:当某Windows节点的页面文件使用率超过90%并持续5分钟时触发告警。
WindowsHandleCountHigh:
yaml
复制代码
- alert: WindowsHandleCountHigh
expr: windows_process_handles_count > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "High Handle count on {{ $labels.instance }}"
description: "Number of handles is above 10,000 for more than 5 minutes on {{ $labels.instance }}."
解释:当某Windows节点的句柄数超过10,000并持续5分钟时触发告警。
WindowsThreadCountHigh:
yaml
复制代码
- alert: WindowsThreadCountHigh
expr: windows_process_threads > 500
for: 5m
labels:
severity: warning
annotations:
summary: "High Thread count on {{ $labels.instance }}"
description: "Number of threads is above 500 for more than 5 minutes on {{ $labels.instance }}."
解释:当某Windows节点的线程数超过500并持续5分钟时触发告警。
WindowsProcessorQueueLengthHigh:
yaml
复制代码
- alert: WindowsProcessorQueueLengthHigh
expr: windows_system_processor_queue_length > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High Processor Queue Length on {{ $labels.instance }}"
description: "Processor queue length is above 10 for more than 5 minutes on {{ $labels.instance }}."
解释:当某Windows节点的处理器队列长度超过10并持续5分钟时触发告警。
WindowsDiskIOWaitHigh:
yaml
复制代码
- alert: WindowsDiskIOWaitHigh
expr: avg by (instance) (rate(windows_logical_disk_idle_time{instance=~"[^:]+:.*"}[5m])) < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High Disk I/O Wait on {{ $labels.instance }}"
description: "Disk I/O wait time is high for more than 5 minutes on {{ $labels.instance }}."
解释:当某Windows节点的磁盘I/O等待时间过高并持续5分钟时触发告警。
mysqld_exporter告警规则
MySQLHighThreadsRunning:
yaml
复制代码
- alert: MySQLHighThreadsRunning
expr: mysql_global_status_threads_running > 50
for: 5m
labels:
severity: critical
annotations:
summary: "High number of running threads in MySQL on {{ $labels.instance }}"
description: "Number of running threads is above 50 for more than 5 minutes on {{ $labels.instance }}."
解释:当MySQL实例的运行线程数超过50并持续5分钟时触发告警。
MySQLSlowQueries:
yaml
复制代码
- alert: MySQLSlowQueries
expr: rate(mysql_global_status_slow_queries[5m]) > 0
for: 10m
labels:
severity: warning
annotations:
summary: "Slow queries detected on {{ $labels.instance }}"
description: "One or more slow queries detected in the last 10 minutes on {{ $labels.instance }}."
解释:当MySQL实例在过去10分钟内检测到慢查询时触发告警。
MySQLReplicationLag:
yaml
复制代码
- alert: MySQLReplicationLag
expr: mysql_slave_status_seconds_behind_master > 10
for: 5m
labels:
severity: critical
annotations:
summary: "Replication lag on MySQL slave {{ $labels.instance }}"
description: "Replication lag is more than 10 seconds for over 5 minutes on {{ $labels.instance }}."
解释:当MySQL从库的复制延迟超过10秒并持续5分钟时触发告警。
MySQLHighConnections:
yaml
复制代码
- alert: MySQLHighConnections
expr: mysql_global_status_threads_connected > 200
for: 5m
labels:
severity: critical
annotations:
summary: "High number of connections to MySQL on {{ $labels.instance }}"
description: "Number of connections is above 200 for more than 5 minutes on {{ $labels.instance }}."
解释:当MySQL实例的连接数超过200并持续5分钟时触发告警。
MySQLLowFreeDiskSpace:
yaml
复制代码
- alert: MySQLLowFreeDiskSpace
expr: (mysql_global_variables_datadir - node_filesystem_free_bytes{fstype!~"tmpfs|fuse.lxcfs"}) / mysql_global_variables_datadir * 100 > 90
for: 10m
labels:
severity: critical
annotations:
summary: "Low free disk space for MySQL on {{ $labels.instance }}"
description: "Free disk space for MySQL data directory is below 10% for more than 10 minutes on {{ $labels.instance }}."
解释:当MySQL实例的数据目录剩余磁盘空间低于10%并持续10分钟时触发告警。
MySQLHighQueryTime:
yaml
复制代码
- alert: MySQLHighQueryTime
expr: rate(mysql_global_status_questions[5m]) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High query rate on MySQL on {{ $labels.instance }}"
description: "Query rate is above 1000 queries per second for more than 5 minutes on {{ $labels.instance }}."
解释:当MySQL实例的查询率超过每秒1000个并持续5分钟时触发告警。
MySQLInnoDBBufferPoolUsageHigh:
yaml
复制代码
- alert: MySQLInnoDBBufferPoolUsageHigh
expr: (mysql_global_status_innodb_buffer_pool_bytes_data / mysql_global_status_innodb_buffer_pool_bytes_total) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High InnoDB buffer pool usage on MySQL on {{ $labels.instance }}"
description: "InnoDB buffer pool usage is above 85% for more than 5 minutes on {{ $labels.instance }}."
解释:当MySQL实例的InnoDB缓冲池使用率超过85%并持续5分钟时触发告警。
MySQLMaxConnectionsReached:
yaml
复制代码
- alert: MySQLMaxConnectionsReached
expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "Max connections threshold reached on MySQL on {{ $labels.instance }}"
description: "Connections usage is above 90% of the max_connections limit for more than 5 minutes on {{ $labels.instance }}."
解释:当MySQL实例的连接数达到最大连接数限制的90%以上并持续5分钟时触发告警。
MySQLTableLocksContention:
yaml
复制代码
- alert: MySQLTableLocksContention
expr: rate(mysql_global_status_table_locks_waited[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Table locks contention in MySQL on {{ $labels.instance }}"
description: "More than 10 table locks waited per second for more than 5 minutes on {{ $labels.instance }}."
解释:当MySQL实例的表锁等待数超过每秒10个并持续5分钟时触发告警。
MySQLLongRunningQueries:
yaml
复制代码
- alert: MySQLLongRunningQueries
expr: mysql_global_status_queries{job="mysql"} > 100
for: 10m
labels:
severity: warning
annotations:
summary: "Long running queries in MySQL on {{ $labels.instance }}"
description: "More than 100 long running queries for over 10 minutes on {{ $labels.instance }}."
解释:当MySQL实例的长时间运行查询超过100个并持续10分钟时触发告警。
blackbox_exporter告警规则
EndpointDown:
yaml
复制代码
- alert: EndpointDown
expr: probe_success == 0
for:
继续
ChatGPT
blackbox_exporter告警规则
EndpointDown:
yaml
复制代码
- alert: EndpointDown
expr: probe_success == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Endpoint down: {{ $labels.instance }}"
description: "The endpoint {{ $labels.instance }} has been down for more than 5 minutes."
解释:当某个端点连续5分钟探测失败时触发告警。
EndpointHighLatency:
yaml
复制代码
- alert: EndpointHighLatency
expr: probe_duration_seconds > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on endpoint: {{ $labels.instance }}"
description: "The endpoint {{ $labels.instance }} has a latency higher than 0.5 seconds for more than 5 minutes."
解释:当某个端点的响应时间超过0.5秒并持续5分钟时触发告警。
EndpointDNSResolutionFailure:
yaml
复制代码
- alert: EndpointDNSResolutionFailure
expr: probe_dns_lookup_time_seconds > 2
for: 5m
labels:
severity: warning
annotations:
summary: "DNS resolution failure for endpoint: {{ $labels.instance }}"
description: "The endpoint {{ $labels.instance }} has DNS resolution time higher than 2 seconds for more than 5 minutes."
解释:当某个端点的DNS解析时间超过2秒并持续5分钟时触发告警。
EndpointConnectionTimeout:
yaml
复制代码
- alert: EndpointConnectionTimeout
expr: probe_tcp_connect_duration_seconds > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Connection timeout on endpoint: {{ $labels.instance }}"
description: "The endpoint {{ $labels.instance }} has connection timeout higher than 1 second for more than 5 minutes."
解释:当某个端点的连接时间超过1秒并持续5分钟时触发告警。
EndpointCertificateExpiry:
yaml
复制代码
- alert: EndpointCertificateExpiry
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 7
for: 1h
labels:
severity: warning
annotations:
summary: "SSL certificate expiry for endpoint: {{ $labels.instance }}"
description: "The SSL certificate for endpoint {{ $labels.instance }} will expire in less than 7 days."
解释:当某个端点的SSL证书将在7天内过期时触发告警。
HTTPResponseCodeMismatch:
yaml
复制代码
- alert: HTTPResponseCodeMismatch
expr: probe_http_status_code != 200
for: 5m
labels:
severity: warning
annotations:
summary: "Unexpected HTTP response code on endpoint: {{ $labels.instance }}"
description: "The endpoint {{ $labels.instance }} returned an unexpected HTTP response code for more than 5 minutes."
解释:当某个端点的HTTP响应码不是200并持续5分钟时触发告警。
EndpointHTTPRedirects:
yaml
复制代码
- alert: EndpointHTTPRedirects
expr: probe_http_redirects > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Too many HTTP redirects on endpoint: {{ $labels.instance }}"
description: "The endpoint {{ $labels.instance }} encountered more than 5 HTTP redirects for more than 5 minutes."
解释:当某个端点的HTTP重定向次数超过5次并持续5分钟时触发告警。
EndpointDNSLookupFailures:
yaml
复制代码
- alert: EndpointDNSLookupFailures
expr: increase(probe_dns_lookup_time_seconds[5m]) > 0
for: 10m
labels:
severity: warning
annotations:
summary: "DNS lookup failures on endpoint: {{ $labels.instance }}"
description: "The endpoint {{ $labels.instance }} had DNS lookup failures for more than 10 minutes."
解释:当某个端点在10分钟内有DNS解析失败时触发告警。
EndpointTCPConnectionRefused:
yaml
复制代码
- alert: EndpointTCPConnectionRefused
expr: probe_tcp_connect_success == 0
for: 5m
labels:
severity: critical
annotations:
summary: "TCP connection refused on endpoint: {{ $labels.instance }}"
description: "The endpoint {{ $labels.instance }} refused TCP connections for more than 5 minutes."
解释:当某个端点连续5分钟TCP连接失败时触发告警。
EndpointSSLCertificateInvalid:
yaml
复制代码
- alert: EndpointSSLCertificateInvalid
expr: probe_ssl_last_chain_expiry_timestamp_seconds - time() < 0
for: 5m
labels:
severity: critical
annotations:
summary: "Invalid SSL certificate on endpoint: {{ $labels.instance }}"
description: "The endpoint {{ $labels.instance }} has an invalid SSL certificate."
解释:当某个端点的SSL证书无效时触发告警。
cAdvisor告警规则
ContainerCPUUsageHigh:
yaml
复制代码
- alert: ContainerCPUUsageHigh
expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on container {{ $labels.container }}"
description: "Container {{ $labels.container }} has CPU usage above 80% for more than 5 minutes."
解释:当某个容器的CPU使用率超过80%并持续5分钟时触发告警。
ContainerMemoryUsageHigh:
yaml
复制代码
- alert: ContainerMemoryUsageHigh
expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on container {{ $labels.container }}"
description: "Container {{ $labels.container }} has memory usage above 90% for more than 5 minutes."
解释:当某个容器的内存使用率超过90%并持续5分钟时触发告警。
ContainerDiskIOHigh:
yaml
复制代码
- alert: ContainerDiskIOHigh
expr: rate(container_fs_io_time_seconds_total[5m]) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High disk I/O on container {{ $labels.container }}"
description: "Container {{ $labels.container }} has disk I/O usage above 50% for more than 5 minutes."
解释:当某个容器的磁盘I/O使用率超过50%并持续5分钟时触发告警。
ContainerNetworkErrors:
yaml
复制代码
- alert: ContainerNetworkErrors
expr: rate(container_network_receive_errors_total[5m]) > 0 or rate(container_network_transmit_errors_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Network errors on container {{ $labels.container }}"
description: "Container {{ $labels.container }} is experiencing network errors for more than 5 minutes."
解释:当某个容器在5分钟内出现网络错误时触发告警。
ContainerRestarting:
yaml
复制代码
- alert: ContainerRestarting
expr: rate(container_restart_count[5m]) > 1
for: 5m
labels:
severity: critical
annotations:
summary: "Container restarting frequently: {{ $labels.container }}"
description: "Container {{ $labels.container }} has restarted more than once in the last 5 minutes."
解释:当某个容器在5分钟内重启超过1次时触发告警。
ContainerOOMKilled:
yaml
复制代码
- alert: ContainerOOMKilled
expr: container_last_seen < time() - 300 and container_state_terminated_reason == "OOMKilled"
for: 5m
labels:
severity: critical
annotations:
summary: "Container OOMKilled: {{ $labels.container }}"
description: "Container {{ $labels.container }} was killed due to out of memory in the last 5 minutes."
解释:当某个容器在过去5分钟内因内存不足被杀死时触发告警。
ContainerFileSystemFull:
yaml
复制代码
- alert: ContainerFileSystemFull
expr: container_fs_usage_bytes / container_fs_limit_bytes > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "File system full on container {{ $labels.container }}"
description: "Container {{ $labels.container }} file system usage is above 90
继续
ChatGPT
cAdvisor告警规则(续)
ContainerFileSystemFull:
yaml
复制代码
- alert: ContainerFileSystemFull
expr: container_fs_usage_bytes / container_fs_limit_bytes > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "File system full on container {{ $labels.container }}"
description: "Container {{ $labels.container }} file system usage is above 90% for more than 5 minutes."
解释:当某个容器的文件系统使用率超过90%并持续5分钟时触发告警。
ContainerMemoryLeak:
yaml
复制代码
- alert: ContainerMemoryLeak
expr: increase(container_memory_usage_bytes[1h]) > 1000000000
for: 1h
labels:
severity: warning
annotations:
summary: "Possible memory leak on container {{ $labels.container }}"
description: "Container {{ $labels.container }} has increased its memory usage by more than 1 GB in the last hour."
解释:当某个容器在过去1小时内内存使用量增加超过1GB时触发告警,可能存在内存泄漏。
ContainerHighCPUThrottling:
yaml
复制代码
- alert: ContainerHighCPUThrottling
expr: rate(container_cpu_cfs_throttled_seconds_total[5m]) / rate(container_cpu_cfs_periods_total[5m]) > 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU throttling on container {{ $labels.container }}"
description: "Container {{ $labels.container }} is experiencing CPU throttling more than 20% of the time for more than 5 minutes."
解释:当某个容器在5分钟内CPU限制(throttling)时间超过总时间的20%时触发告警。
ContainerNetworkTrafficDrop:
yaml
复制代码
- alert: ContainerNetworkTrafficDrop
expr: rate(container_network_receive_packets_dropped_total[5m]) > 0 or rate(container_network_transmit_packets_dropped_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Network traffic drops on container {{ $labels.container }}"
description: "Container {{ $labels.container }} is experiencing network packet drops for more than 5 minutes."
解释:当某个容器在5分钟内出现网络数据包丢失时触发告警。
这些告警规则基于不同的exporter,可以帮助监控和管理系统和服务的健康状态。通过定义合适的阈值和条件,您可以及时发现和解决潜在问题,确保系统的稳定性和性能。