通过部署etcd-exporter+Prometheus,然后配置etcd相关告警可以及时发现etcd集群风险
常见监控项目
1. etcd集群无leader
Etcd cluster have no leader
- alert:EtcdNoLeader
expr: etcd_server_has_leader ==0
for:0m
labels:
severity: critical
annotations:
summary:EtcdnoLeader(instance {{ $labels.instance }})
description: "Etcd cluster have no leader\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"```
2. etcd grpc请求慢
GRPC requests slowing down, 99th percentile is over 0.15s
- alert:EtcdGrpcRequestsSlow
expr: histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{grpc_type="unary"}[1m]))by(grpc_service, grpc_method, le))>0.15
for:2m
labels:
severity: warning
annotations:
summary:Etcd GRPC requests slow (instance {{ $labels.instance }})
description: "GRPC requests slowing down, 99th percentile is over 0.15s\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
3. etcd http请求慢
HTTP requests slowing down, 99th percentile is over 0.15s
- alert: EtcdHttpRequestsSlow
expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[1m])) > 0.15
for: 2m
labels:
severity: warning
annotations:
summary: Etcd HTTP requests slow (instance {{ $labels.instance }})
description: "HTTP requests slowing down, 99th percentile is over 0.15s\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
4. Etcd member communication slow
Etcd member communication slowing down, 99th percentile is over 0.15s
- alert: EtcdMemberCommunicationSlow
expr: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[1m])) > 0.15
for: 2m
labels:
severity: warning
annotations:
summary: Etcd member communication slow (instance {{ $labels.instance }})
description: "Etcd member communication slowing down, 99th percentile is over 0.15s\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
5. Etcd high fsync durations
Etcd WAL fsync duration increasing, 99th percentile is over 0.5s
- alert: EtcdHighFsyncDurations
expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[1m])) > 0.5
for: 2m
labels:
severity: warning
annotations:
summary: Etcd high fsync durations (instance {{ $labels.instance }})
description: "Etcd WAL fsync duration increasing, 99th percentile is over 0.5s\n VALUE = {{ $