常用告警规则

Node_exporter告警规则

NodeCPUUsageHigh:

yaml

复制代码

alert: NodeCPUUsageHigh
expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
for: 5m
labels:
severity: critical
annotations:
summary: "High CPU usage detected on {{ $labels.instance }}" description: "CPU usage is above 80% for more than 5 minutes on {{$ labels.instance }}."
解释：当某节点的CPU使用率超过80%并持续5分钟时触发告警。

NodeMemoryUsageHigh:

yaml

复制代码

alert: NodeMemoryUsageHigh
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High Memory usage detected on {{ $labels.instance }}" description: "Memory usage is above 90% for more than 5 minutes on {{$ labels.instance }}."
解释：当某节点的内存使用率超过90%并持续5分钟时触发告警。

NodeDiskUsageHigh:

yaml

复制代码

alert: NodeDiskUsageHigh
expr: (node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs"} - node_filesystem_free_bytes{fstype!~"tmpfs|fuse.lxcfs"}) / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs"} * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High Disk usage detected on {{ $labels.instance }}" description: "Disk usage is above 85% for more than 5 minutes on {{$ labels.instance }}."
解释：当某节点的磁盘使用率超过85%并持续5分钟时触发告警。

NodeFilesystemReadOnly:

yaml

复制代码

alert: NodeFilesystemReadOnly
expr: node_filesystem_readonly{fstype!~"tmpfs|fuse.lxcfs"} == 1
for: 10m
labels:
severity: critical
annotations:
summary: "Filesystem is read-only on {{ $labels.instance }}" description: "Filesystem has been read-only for more than 10 minutes on {{$ labels.instance }}."
解释：当某节点的文件系统变为只读并持续10分钟时触发告警。

NodeLoadAverageHigh:

yaml

复制代码

alert: NodeLoadAverageHigh
expr: node_load1 > 2 * count(node_cpu_seconds_total{mode="system"})
for: 5m
labels:
severity: warning
annotations:
summary: "High load average on {{ $labels.instance }}" description: "1-minute load average is more than twice the number of CPUs for over 5 minutes on {{$ labels.instance }}."
解释：当某节点的1分钟负载平均值超过CPU数量的2倍并持续5分钟时触发告警。

NodeNetworkDown:

yaml

复制代码

alert: NodeNetworkDown
expr: up{job="node_exporter"} == 0
for: 10m
labels:
severity: critical
annotations:
summary: "Node down: {{ $labels.instance }}"
description: "Node has been down for more than 10 minutes."
解释：当某节点的node_exporter数据10分钟内没有上报时触发告警。

NodeSwapUsageHigh:

yaml

复制代码

alert: NodeSwapUsageHigh
expr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100 > 50
for: 5m
labels:
severity: warning
annotations:
summary: "High swap usage on {{ $labels.instance }}" description: "Swap usage is above 50% for more than 5 minutes on {{$ labels.instance }}."
解释：当某节点的交换分区使用率超过50%并持续5分钟时触发告警。

NodeFileSystemInodesUsageHigh:

yaml

复制代码

alert: NodeFileSystemInodesUsageHigh
expr: (node_filesystem_files - node_filesystem_files_free) / node_filesystem_files * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High filesystem inodes usage on {{ $labels.instance }}" description: "Filesystem inodes usage is above 80% for more than 5 minutes on {{$ labels.instance }}."
解释：当某节点的文件系统inode使用率超过80%并持续5分钟时触发告警。

NodeTemperatureHigh:

yaml

复制代码

alert: NodeTemperatureHigh
expr: node_hwmon_temp_celsius > 75
for: 5m
labels:
severity: warning
annotations:
summary: "High temperature on {{ $labels.instance }}" description: "Node temperature is above 75 degrees Celsius for more than 5 minutes on {{$ labels.instance }}."
解释：当某节点的温度超过75摄氏度并持续5分钟时触发告警。

NodeProcessCountHigh:

yaml

复制代码

alert: NodeProcessCountHigh
expr: count(node_scrape_collector_duration_seconds) > 500
for: 5m
labels:
severity: warning
annotations:
summary: "High process count on {{ $labels.instance }}" description: "Number of processes is above 500 for more than 5 minutes on {{$ labels.instance }}."
解释：当某节点的进程数超过500并持续5分钟时触发告警。

windows_exporter告警规则

WindowsCPUUsageHigh:

yaml

复制代码

alert: WindowsCPUUsageHigh
expr: avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[5m])) < 0.2
for: 5m
labels:
severity: critical
annotations:
summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is above 80% for more than 5 minutes on {{$ labels.instance }}."
解释：当某Windows节点的CPU使用率超过80%并持续5分钟时触发告警。

WindowsMemoryUsageHigh:

yaml

复制代码

alert: WindowsMemoryUsageHigh
expr: windows_cs_physical_memory_bytes - windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High Memory usage on {{ $labels.instance }}" description: "Memory usage is above 90% for more than 5 minutes on {{$ labels.instance }}."
解释：当某Windows节点的内存使用率超过90%并持续5分钟时触发告警。

WindowsDiskUsageHigh:

yaml

复制代码

alert: WindowsDiskUsageHigh
expr: windows_logical_disk_free_bytes / windows_logical_disk_size_bytes * 100 < 15
for: 5m
labels:
severity: warning
annotations:
summary: "High Disk usage on {{ $labels.instance }}" description: "Disk usage is above 85% for more than 5 minutes on {{$ labels.instance }}."
解释：当某Windows节点的磁盘使用率超过85%并持续5分钟时触发告警。

WindowsNetworkDown:

yaml

复制代码

alert: WindowsNetworkDown
expr: up{job="windows_exporter"} == 0
for: 10m
labels:
severity: critical
annotations:
summary: "Node down: {{ $labels.instance }}"
description: "Node has been down for more than 10 minutes."
解释：当某Windows节点的windows_exporter数据10分钟内没有上报时触发告警。

WindowsServiceNotRunning:

yaml

复制代码

alert: WindowsServiceNotRunning
expr: windows_service_status == 2
for: 5m
labels:
severity: critical
annotations:
summary: "Service not running on {{ $labels.instance }}" description: "A critical service is not running for more than 5 minutes on {{$ labels.instance }}."
解释：当某Windows节点上的某关键服务未运行并持续5分钟时触发告警。

WindowsPageFileUsageHigh:

yaml

复制代码

alert: WindowsPageFileUsageHigh
expr: windows_os_paging_free_bytes / windows_os_paging_limit_bytes * 100
继续

ChatGPT

windows_exporter告警规则（续）

WindowsPageFileUsageHigh:

yaml

复制代码

alert: WindowsPageFileUsageHigh
expr: windows_os_paging_free_bytes / windows_os_paging_limit_bytes * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: "High Page File usage on {{ $labels.instance }}" description: "Page file usage is above 90% for more than 5 minutes on {{$ labels.instance }}."
解释：当某Windows节点的页面文件使用率超过90%并持续5分钟时触发告警。

WindowsHandleCountHigh:

yaml

复制代码

alert: WindowsHandleCountHigh
expr: windows_process_handles_count > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "High Handle count on {{ $labels.instance }}" description: "Number of handles is above 10,000 for more than 5 minutes on {{$ labels.instance }}."
解释：当某Windows节点的句柄数超过10,000并持续5分钟时触发告警。

WindowsThreadCountHigh:

yaml

复制代码

alert: WindowsThreadCountHigh
expr: windows_process_threads > 500
for: 5m
labels:
severity: warning
annotations:
summary: "High Thread count on {{ $labels.instance }}" description: "Number of threads is above 500 for more than 5 minutes on {{$ labels.instance }}."
解释：当某Windows节点的线程数超过500并持续5分钟时触发告警。

WindowsProcessorQueueLengthHigh:

yaml

复制代码

alert: WindowsProcessorQueueLengthHigh
expr: windows_system_processor_queue_length > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High Processor Queue Length on {{ $labels.instance }}" description: "Processor queue length is above 10 for more than 5 minutes on {{$ labels.instance }}."
解释：当某Windows节点的处理器队列长度超过10并持续5分钟时触发告警。

WindowsDiskIOWaitHigh:

yaml

复制代码

alert: WindowsDiskIOWaitHigh
expr: avg by (instance) (rate(windows_logical_disk_idle_time{instance=~"[^:]+:.*"}[5m])) < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High Disk I/O Wait on {{ $labels.instance }}" description: "Disk I/O wait time is high for more than 5 minutes on {{$ labels.instance }}."
解释：当某Windows节点的磁盘I/O等待时间过高并持续5分钟时触发告警。

mysqld_exporter告警规则

MySQLHighThreadsRunning:

yaml

复制代码

alert: MySQLHighThreadsRunning
expr: mysql_global_status_threads_running > 50
for: 5m
labels:
severity: critical
annotations:
summary: "High number of running threads in MySQL on {{ $labels.instance }}" description: "Number of running threads is above 50 for more than 5 minutes on {{$ labels.instance }}."
解释：当MySQL实例的运行线程数超过50并持续5分钟时触发告警。

MySQLSlowQueries:

yaml

复制代码

alert: MySQLSlowQueries
expr: rate(mysql_global_status_slow_queries[5m]) > 0
for: 10m
labels:
severity: warning
annotations:
summary: "Slow queries detected on {{ $labels.instance }}" description: "One or more slow queries detected in the last 10 minutes on {{$ labels.instance }}."
解释：当MySQL实例在过去10分钟内检测到慢查询时触发告警。

MySQLReplicationLag:

yaml

复制代码

alert: MySQLReplicationLag
expr: mysql_slave_status_seconds_behind_master > 10
for: 5m
labels:
severity: critical
annotations:
summary: "Replication lag on MySQL slave {{ $labels.instance }}" description: "Replication lag is more than 10 seconds for over 5 minutes on {{$ labels.instance }}."
解释：当MySQL从库的复制延迟超过10秒并持续5分钟时触发告警。

MySQLHighConnections:

yaml

复制代码

alert: MySQLHighConnections
expr: mysql_global_status_threads_connected > 200
for: 5m
labels:
severity: critical
annotations:
summary: "High number of connections to MySQL on {{ $labels.instance }}" description: "Number of connections is above 200 for more than 5 minutes on {{$ labels.instance }}."
解释：当MySQL实例的连接数超过200并持续5分钟时触发告警。

MySQLLowFreeDiskSpace:

yaml

复制代码

alert: MySQLLowFreeDiskSpace
expr: (mysql_global_variables_datadir - node_filesystem_free_bytes{fstype!~"tmpfs|fuse.lxcfs"}) / mysql_global_variables_datadir * 100 > 90
for: 10m
labels:
severity: critical
annotations:
summary: "Low free disk space for MySQL on {{ $labels.instance }}" description: "Free disk space for MySQL data directory is below 10% for more than 10 minutes on {{$ labels.instance }}."
解释：当MySQL实例的数据目录剩余磁盘空间低于10%并持续10分钟时触发告警。

MySQLHighQueryTime:

yaml

复制代码

alert: MySQLHighQueryTime
expr: rate(mysql_global_status_questions[5m]) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High query rate on MySQL on {{ $labels.instance }}" description: "Query rate is above 1000 queries per second for more than 5 minutes on {{$ labels.instance }}."
解释：当MySQL实例的查询率超过每秒1000个并持续5分钟时触发告警。

MySQLInnoDBBufferPoolUsageHigh:

yaml

复制代码

alert: MySQLInnoDBBufferPoolUsageHigh
expr: (mysql_global_status_innodb_buffer_pool_bytes_data / mysql_global_status_innodb_buffer_pool_bytes_total) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High InnoDB buffer pool usage on MySQL on {{ $labels.instance }}" description: "InnoDB buffer pool usage is above 85% for more than 5 minutes on {{$ labels.instance }}."
解释：当MySQL实例的InnoDB缓冲池使用率超过85%并持续5分钟时触发告警。

MySQLMaxConnectionsReached:

yaml

复制代码

alert: MySQLMaxConnectionsReached
expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "Max connections threshold reached on MySQL on {{ $labels.instance }}" description: "Connections usage is above 90% of the max_connections limit for more than 5 minutes on {{$ labels.instance }}."
解释：当MySQL实例的连接数达到最大连接数限制的90%以上并持续5分钟时触发告警。

MySQLTableLocksContention:

yaml

复制代码

alert: MySQLTableLocksContention
expr: rate(mysql_global_status_table_locks_waited[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Table locks contention in MySQL on {{ $labels.instance }}" description: "More than 10 table locks waited per second for more than 5 minutes on {{$ labels.instance }}."
解释：当MySQL实例的表锁等待数超过每秒10个并持续5分钟时触发告警。

MySQLLongRunningQueries:

yaml

复制代码

alert: MySQLLongRunningQueries
expr: mysql_global_status_queries{job="mysql"} > 100
for: 10m
labels:
severity: warning
annotations:
summary: "Long running queries in MySQL on {{ $labels.instance }}" description: "More than 100 long running queries for over 10 minutes on {{$ labels.instance }}."
解释：当MySQL实例的长时间运行查询超过100个并持续10分钟时触发告警。

blackbox_exporter告警规则

EndpointDown:

yaml

复制代码

alert: EndpointDown
expr: probe_success == 0
for:
继续

ChatGPT

blackbox_exporter告警规则

EndpointDown:

yaml

复制代码

alert: EndpointDown
expr: probe_success == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Endpoint down: {{ $labels.instance }}" description: "The endpoint {{$ labels.instance }} has been down for more than 5 minutes."
解释：当某个端点连续5分钟探测失败时触发告警。

EndpointHighLatency:

yaml

复制代码

alert: EndpointHighLatency
expr: probe_duration_seconds > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on endpoint: {{ $labels.instance }}" description: "The endpoint {{$ labels.instance }} has a latency higher than 0.5 seconds for more than 5 minutes."
解释：当某个端点的响应时间超过0.5秒并持续5分钟时触发告警。

EndpointDNSResolutionFailure:

yaml

复制代码

alert: EndpointDNSResolutionFailure
expr: probe_dns_lookup_time_seconds > 2
for: 5m
labels:
severity: warning
annotations:
summary: "DNS resolution failure for endpoint: {{ $labels.instance }}" description: "The endpoint {{$ labels.instance }} has DNS resolution time higher than 2 seconds for more than 5 minutes."
解释：当某个端点的DNS解析时间超过2秒并持续5分钟时触发告警。

EndpointConnectionTimeout:

yaml

复制代码

alert: EndpointConnectionTimeout
expr: probe_tcp_connect_duration_seconds > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Connection timeout on endpoint: {{ $labels.instance }}" description: "The endpoint {{$ labels.instance }} has connection timeout higher than 1 second for more than 5 minutes."
解释：当某个端点的连接时间超过1秒并持续5分钟时触发告警。

EndpointCertificateExpiry:

yaml

复制代码

alert: EndpointCertificateExpiry
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 7
for: 1h
labels:
severity: warning
annotations:
summary: "SSL certificate expiry for endpoint: {{ $labels.instance }}" description: "The SSL certificate for endpoint {{$ labels.instance }} will expire in less than 7 days."
解释：当某个端点的SSL证书将在7天内过期时触发告警。

HTTPResponseCodeMismatch:

yaml

复制代码

alert: HTTPResponseCodeMismatch
expr: probe_http_status_code != 200
for: 5m
labels:
severity: warning
annotations:
summary: "Unexpected HTTP response code on endpoint: {{ $labels.instance }}" description: "The endpoint {{$ labels.instance }} returned an unexpected HTTP response code for more than 5 minutes."
解释：当某个端点的HTTP响应码不是200并持续5分钟时触发告警。

EndpointHTTPRedirects:

yaml

复制代码

alert: EndpointHTTPRedirects
expr: probe_http_redirects > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Too many HTTP redirects on endpoint: {{ $labels.instance }}" description: "The endpoint {{$ labels.instance }} encountered more than 5 HTTP redirects for more than 5 minutes."
解释：当某个端点的HTTP重定向次数超过5次并持续5分钟时触发告警。

EndpointDNSLookupFailures:

yaml

复制代码

alert: EndpointDNSLookupFailures
expr: increase(probe_dns_lookup_time_seconds[5m]) > 0
for: 10m
labels:
severity: warning
annotations:
summary: "DNS lookup failures on endpoint: {{ $labels.instance }}" description: "The endpoint {{$ labels.instance }} had DNS lookup failures for more than 10 minutes."
解释：当某个端点在10分钟内有DNS解析失败时触发告警。

EndpointTCPConnectionRefused:

yaml

复制代码

alert: EndpointTCPConnectionRefused
expr: probe_tcp_connect_success == 0
for: 5m
labels:
severity: critical
annotations:
summary: "TCP connection refused on endpoint: {{ $labels.instance }}" description: "The endpoint {{$ labels.instance }} refused TCP connections for more than 5 minutes."
解释：当某个端点连续5分钟TCP连接失败时触发告警。

EndpointSSLCertificateInvalid:

yaml

复制代码

alert: EndpointSSLCertificateInvalid
expr: probe_ssl_last_chain_expiry_timestamp_seconds - time() < 0
for: 5m
labels:
severity: critical
annotations:
summary: "Invalid SSL certificate on endpoint: {{ $labels.instance }}" description: "The endpoint {{$ labels.instance }} has an invalid SSL certificate."
解释：当某个端点的SSL证书无效时触发告警。

cAdvisor告警规则

ContainerCPUUsageHigh:

yaml

复制代码

alert: ContainerCPUUsageHigh
expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on container {{ $labels.container }}" description: "Container {{$ labels.container }} has CPU usage above 80% for more than 5 minutes."
解释：当某个容器的CPU使用率超过80%并持续5分钟时触发告警。

ContainerMemoryUsageHigh:

yaml

复制代码

alert: ContainerMemoryUsageHigh
expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on container {{ $labels.container }}" description: "Container {{$ labels.container }} has memory usage above 90% for more than 5 minutes."
解释：当某个容器的内存使用率超过90%并持续5分钟时触发告警。

ContainerDiskIOHigh:

yaml

复制代码

alert: ContainerDiskIOHigh
expr: rate(container_fs_io_time_seconds_total[5m]) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High disk I/O on container {{ $labels.container }}" description: "Container {{$ labels.container }} has disk I/O usage above 50% for more than 5 minutes."
解释：当某个容器的磁盘I/O使用率超过50%并持续5分钟时触发告警。

ContainerNetworkErrors:

yaml

复制代码

alert: ContainerNetworkErrors
expr: rate(container_network_receive_errors_total[5m]) > 0 or rate(container_network_transmit_errors_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Network errors on container {{ $labels.container }}" description: "Container {{$ labels.container }} is experiencing network errors for more than 5 minutes."
解释：当某个容器在5分钟内出现网络错误时触发告警。

ContainerRestarting:

yaml

复制代码

alert: ContainerRestarting
expr: rate(container_restart_count[5m]) > 1
for: 5m
labels:
severity: critical
annotations:
summary: "Container restarting frequently: {{ $labels.container }}" description: "Container {{$ labels.container }} has restarted more than once in the last 5 minutes."
解释：当某个容器在5分钟内重启超过1次时触发告警。

ContainerOOMKilled:

yaml

复制代码

alert: ContainerOOMKilled
expr: container_last_seen < time() - 300 and container_state_terminated_reason == "OOMKilled"
for: 5m
labels:
severity: critical
annotations:
summary: "Container OOMKilled: {{ $labels.container }}" description: "Container {{$ labels.container }} was killed due to out of memory in the last 5 minutes."
解释：当某个容器在过去5分钟内因内存不足被杀死时触发告警。

ContainerFileSystemFull:

yaml

复制代码

alert: ContainerFileSystemFull
expr: container_fs_usage_bytes / container_fs_limit_bytes > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "File system full on container {{ $labels.container }}" description: "Container {{$ labels.container }} file system usage is above 90
继续

ChatGPT

cAdvisor告警规则（续）

ContainerFileSystemFull:

yaml

复制代码

alert: ContainerFileSystemFull
expr: container_fs_usage_bytes / container_fs_limit_bytes > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "File system full on container {{ $labels.container }}" description: "Container {{$ labels.container }} file system usage is above 90% for more than 5 minutes."
解释：当某个容器的文件系统使用率超过90%并持续5分钟时触发告警。

ContainerMemoryLeak:

yaml

复制代码

alert: ContainerMemoryLeak
expr: increase(container_memory_usage_bytes[1h]) > 1000000000
for: 1h
labels:
severity: warning
annotations:
summary: "Possible memory leak on container {{ $labels.container }}" description: "Container {{$ labels.container }} has increased its memory usage by more than 1 GB in the last hour."
解释：当某个容器在过去1小时内内存使用量增加超过1GB时触发告警，可能存在内存泄漏。

ContainerHighCPUThrottling:

yaml

复制代码

alert: ContainerHighCPUThrottling
expr: rate(container_cpu_cfs_throttled_seconds_total[5m]) / rate(container_cpu_cfs_periods_total[5m]) > 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU throttling on container {{ $labels.container }}" description: "Container {{$ labels.container }} is experiencing CPU throttling more than 20% of the time for more than 5 minutes."
解释：当某个容器在5分钟内CPU限制（throttling）时间超过总时间的20%时触发告警。

ContainerNetworkTrafficDrop:

yaml

复制代码

alert: ContainerNetworkTrafficDrop
expr: rate(container_network_receive_packets_dropped_total[5m]) > 0 or rate(container_network_transmit_packets_dropped_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Network traffic drops on container {{ $labels.container }}" description: "Container {{$ labels.container }} is experiencing network packet drops for more than 5 minutes."
解释：当某个容器在5分钟内出现网络数据包丢失时触发告警。

这些告警规则基于不同的exporter，可以帮助监控和管理系统和服务的健康状态。通过定义合适的阈值和条件，您可以及时发现和解决潜在问题，确保系统的稳定性和性能。