在建立了完整的CI/CD流水线后,我们现在需要配置智能的告警系统和自动扩缩容机制,确保应用在生产环境中能够自动应对流量变化并保持高可用性。
监控告警系统配置
Prometheus告警规则配置
yaml
# prometheus/alerts.yml - Prometheus告警规则配置
groups:
- name: springboot-app-alerts
rules:
# 应用可用性告警
- alert: SpringBootAppDown
expr: up{job="springboot-app"} == 0
for: 1m
labels:
severity: critical
team: backend
annotations:
summary: "SpringBoot应用宕机 (实例 {{ $labels.instance }})"
description: "SpringBoot应用 {{ $labels.instance }} 已经宕机超过1分钟。"
runbook: "https://wiki.company.com/runbooks/springboot-app-down"
- alert: SpringBootAppHighErrorRate
expr: rate(http_server_requests_seconds_count{job="springboot-app", status=~"5.."}[5m]) > 0.1
for: 2m
labels:
severity: critical
team: backend
annotations:
summary: "SpringBoot应用错误率过高 (实例 {{ $labels.instance }})"
description: "SpringBoot应用 {{ $labels.instance }} 5xx错误率超过10%,当前值: {{ $value | humanizePercentage }}"
runbook: "https://wiki.company.com/runbooks/springboot-high-error-rate"
# JVM性能告警
- alert: SpringBootAppHighMemoryUsage
expr: (sum(jvm_memory_used_bytes{job="springboot-app", area="heap"}) by (instance) / sum(jvm_memory_max_bytes{job="springboot-app", area="heap"}) by (instance)) > 0.8
for: 3m
labels:
severity: warning
team: backend
annotations:
summary: "SpringBoot应用内存使用率过高 (实例 {{ $labels.instance }})"
description: "SpringBoot应用 {{ $labels.instance }} 堆内存使用率超过80%,当前值: {{ $value | humanizePercentage }}"
runbook: "https://wiki.company.com/runbooks/springboot-high-memory"
- alert: SpringBootAppHighCPUUsage
expr: process_cpu_usage{job="springboot-app"} > 0.8
for: 3m
labels:
severity: warning
team: backend
annotations:
summary: "SpringBoot应用CPU使用率过高 (实例 {{ $labels.instance }})"
description: "SpringBoot应用 {{ $labels.instance }} CPU使用率超过80%,当前值: {{ $value | humanizePercentage }}"
runbook: "https://wiki.company.com/runbooks/springboot-high-cpu"
# 垃圾回收告警
- alert: SpringBootAppHighGC
expr: rate(jvm_gc_pause_seconds_sum{job="springboot-app"}[5m]) > 0.1
for: 2m
labels:
severity: warning
team: backend
annotations:
summary: "SpringBoot应用GC停顿时间过长 (实例 {{ $labels.instance }})"
description: "SpringBoot应用 {{ $labels.instance }} GC停顿时间过高,过去5分钟平均: {{ $value }}秒"
runbook: "https://wiki.company.com/runbooks/springboot-high-gc"
# 应用性能告警
- alert: SpringBootAppHighLatency
expr: histogram_quantile(0.95, rate(http_server_requests_seconds_bucket{job="springboot-app"}[5m])) > 2
for: 3m
labels:
severity: warning
team: backend
annotations:
summary: "SpringBoot应用响应时间过高 (实例 {{ $labels.instance }})"
description: "SpringBoot应用 {{ $labels.instance }} 95%响应时间超过2秒,当前值: {{ $value }}秒"
runbook: "https://wiki.company.com/runbooks/springboot-high-latency"
- alert: SpringBootAppLowThroughput
expr: rate(http_server_requests_seconds_count{job="springboot-app"}[5m]) < 10
for: 5m
labels:
severity: info
team: backend
annotations:
summary: "SpringBoot应用吞吐量过低 (实例 {{ $labels.instance }})"
description: "SpringBoot应用 {{ $labels.instance }} 吞吐量异常低,过去5分钟平均: {{ $value }} req/s"
runbook: "https://wiki.company.com/runbooks/springboot-low-throughput"
# 数据库连接告警
- alert: SpringBootAppHighDatabaseConnections
expr: spring_datasource_max_connections{job="springboot-app"} - spring_datasource_active_connections{job="springboot-app"} < 5
for: 2m
labels:
severity: warning
team: backend
annotations:
summary: "SpringBoot应用数据库连接池紧张 (实例 {{ $labels.instance }})"
description: "SpringBoot应用 {{ $labels.instance }} 数据库连接池可用连接少于5个"
runbook: "https://wiki.company.com/runbooks/springboot-database-connections"
# 自定义业务指标告警
- alert: SpringBootAppHighOrderErrorRate
expr: rate(orders_failed_total{job="springboot-app"}[5m]) / rate(orders_processed_total{job="springboot-app"}[5m]) > 0.05
for: 2m
labels:
severity: critical
team: backend
annotations:
summary: "SpringBoot应用订单处理错误率过高 (实例 {{ $labels.instance }})"
description: "SpringBoot应用 {{ $labels.instance }} 订单处理错误率超过5%,当前值: {{ $value | humanizePercentage }}"
runbook: "https://wiki.company.com/runbooks/springboot-order-errors"
- name: infrastructure-alerts
rules:
# 节点资源告警
- alert: NodeHighCPUUsage
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "节点CPU使用率过高 ({{ $labels.instance }})"
description: "节点 {{ $labels.instance }} CPU使用率超过80%,当前值: {{ $value }}%"
- alert: NodeHighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "节点内存使用率过高 ({{ $labels.instance }})"
description: "节点 {{ $labels.instance }} 内存使用率超过85%,当前值: {{ $value }}%"
- alert: NodeDiskSpaceRunningOut
expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 90
for: 5m
labels:
severity: critical
team: infrastructure
annotations:
summary: "节点磁盘空间不足 ({{ $labels.instance }} {{ $labels.mountpoint }})"
description: "节点 {{ $labels.instance }} 挂载点 {{ $labels.mountpoint }} 磁盘使用率超过90%,当前值: {{ $value }}%"
- alert: NodeNetworkSaturation
expr: rate(node_network_receive_bytes_total[5m]) > 100 * 1024 * 1024
for: 2m
labels:
severity: warning
team: infrastructure
annotations:
summary: "节点网络接收饱和 ({{ $labels.instance }})"
description: "节点 {{ $labels.instance }} 网络接收速率超过100MB/s"
- name: kubernetes-alerts
rules:
# Kubernetes集群告警
- alert: KubePodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 1m
labels:
severity: critical
team: platform
annotations:
summary: "Pod崩溃循环 ({{ $labels.namespace }}/{{ $labels.pod }})"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 正在崩溃循环"
- alert: KubeDeploymentReplicasMismatch
expr: kube_deployment_status_replicas_available != kube_deployment_spec_replicas
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Deployment副本数不匹配 ({{ $labels.namespace }}/{{ $labels.deployment }})"
description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} 可用副本数不匹配期望值"
- alert: KubeHPAReachedMax
expr: kube_horizontalpodautoscaler_status_current_replicas == kube_horizontalpodautoscaler_spec_max_replicas
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "HPA达到最大副本数 ({{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }})"
description: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} 已达到最大副本数 {{ $value }}"
Alertmanager配置
yaml
# alertmanager/alertmanager.yml - Alertmanager主配置
global:
smtp_smarthost: 'smtp.company.com:587'
smtp_from: 'alertmanager@company.com'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
templates:
- '/etc/alertmanager/templates/*.tmpl'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
group_wait: 5s
repeat_interval: 5m
routes:
- match:
team: backend
receiver: 'backend-critical'
- match:
team: platform
receiver: 'platform-critical'
- match:
severity: warning
receiver: 'warning-alerts'
group_wait: 30s
repeat_interval: 15m
- match:
severity: info
receiver: 'info-alerts'
group_wait: 1m
repeat_interval: 30m
receivers:
- name: 'default-receiver'
email_configs:
- to: 'devops@company.com'
subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Details:
{{ range .Labels.SortedPairs }} - {{ .Name }}: {{ .Value }}
{{ end }}
Runbook: {{ .Annotations.runbook }}
{{ end }}
- name: 'critical-alerts'
email_configs:
- to: 'sre-team@company.com'
subject: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
🚨 CRITICAL ALERT
================
Summary: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Labels:
{{ range .Labels.SortedPairs }} - {{ .Name }}: {{ .Value }}
{{ end }}
Runbook: {{ .Annotations.runbook }}
Time: {{ .StartsAt }}
{{ end }}
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX/XXX/XXX'
channel: '#alerts-critical'
title: '🚨 Critical Alert'
text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}{{ end }}'
color: 'danger'
pagerduty_configs:
- service_key: 'your-pagerduty-key'
description: '{{ .GroupLabels.alertname }}'
details:
summary: '{{ .Annotations.summary }}'
description: '{{ .Annotations.description }}'
- name: 'backend-critical'
email_configs:
- to: 'backend-team@company.com'
subject: '🚨 BACKEND CRITICAL: {{ .GroupLabels.alertname }}'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX/XXX/XXX'
channel: '#backend-alerts'
title: '🚨 Backend Critical'
pagerduty_configs:
- service_key: 'backend-pagerduty-key'
- name: 'platform-critical'
email_configs:
- to: 'platform-team@company.com'
subject: '🚨 PLATFORM CRITICAL: {{ .GroupLabels.alertname }}'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX/XXX/XXX'
channel: '#platform-alerts'
- name: 'warning-alerts'
email_configs:
- to: 'devops@company.com'
subject: '⚠️ WARNING: {{ .GroupLabels.alertname }}'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX/XXX/XXX'
channel: '#alerts-warning'
color: 'warning'
- name: 'info-alerts'
email_configs:
- to: 'devops@company.com'
subject: 'ℹ️ INFO: {{ .GroupLabels.alertname }}'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX/XXX/XXX'
channel: '#alerts-info'
color: 'good'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
- source_match:
severity: 'critical'
target_match:
severity: 'info'
equal: ['alertname', 'cluster', 'service']
告警管理脚本
bash
#!/bin/bash
# alert_manager.sh - 告警管理系统
set -euo pipefail
# 配置
ALERTMANAGER_URL="http://alertmanager:9093"
PROMETHEUS_URL="http://prometheus:9090"
SLACK_WEBHOOK="https://hooks.slack.com/services/XXX/XXX/XXX"
ALERT_RULES_DIR="./prometheus/alerts"
BACKUP_DIR="./backups/alerts"
# 颜色定义
GREEN='\033[0;32m'
RED='\033[0;31m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m'
log() {
echo -e "${GREEN}[$(date '+%Y-%m-%d %H:%M:%S')] $1${NC}"
}
# 检查告警管理器状态
check_alertmanager_status() {
log "检查Alertmanager状态..."
if curl -s "${ALERTMANAGER_URL}/-/healthy" > /dev/null; then
log "✅ Alertmanager运行正常"
return 0
else
echo -e "${RED}❌ Alertmanager不可用${NC}"
return 1
fi
}
# 重新加载告警规则
reload_prometheus_rules() {
log "重新加载Prometheus告警规则..."
if curl -s -X POST "${PROMETHEUS_URL}/-/reload" > /dev/null; then
log "✅ Prometheus规则重新加载成功"
else
echo -e "${RED}❌ Prometheus规则重新加载失败${NC}"
return 1
fi
}
# 重新加载Alertmanager配置
reload_alertmanager_config() {
log "重新加载Alertmanager配置..."
if curl -s -X POST "${ALERTMANAGER_URL}/-/reload" > /dev/null; then
log "✅ Alertmanager配置重新加载成功"
else
echo -e "${RED}❌ Alertmanager配置重新加载失败${NC}"
return 1
fi
}
# 验证告警规则语法
validate_alert_rules() {
log "验证告警规则语法..."
local rules_file="$1"
if [ ! -f "$rules_file" ]; then
echo -e "${RED}❌ 告警规则文件不存在: $rules_file${NC}"
return 1
fi
# 使用promtool验证规则
if command -v promtool >/dev/null 2>&1; then
if promtool check rules "$rules_file"; then
log "✅ 告警规则语法正确"
return 0
else
echo -e "${RED}❌ 告警规则语法错误${NC}"
return 1
fi
else
echo -e "${YELLOW}⚠️ promtool未安装,跳过语法检查${NC}"
return 0
fi
}
# 获取当前活跃告警
get_active_alerts() {
log "获取当前活跃告警..."
local response
response=$(curl -s "${ALERTMANAGER_URL}/api/v2/alerts" | jq -r '.[] | "\(.labels.alertname) - \(.status.state)"')
if [ -n "$response" ]; then
echo -e "${YELLOW}当前活跃告警:${NC}"
echo "$response"
else
log "✅ 无活跃告警"
fi
}
# 静默告警
silence_alert() {
local alert_name="$1"
local duration="${2:-1h}"
local creator="${3:-alert-manager}"
local comment="${4:-手动静默}"
log "静默告警: $alert_name, 时长: $duration"
local silence_data=$(cat << EOF
{
"matchers": [
{
"name": "alertname",
"value": "$alert_name",
"isRegex": false
}
],
"startsAt": "$(date -u +"%Y-%m-%dT%H:%M:%S.000Z")",
"endsAt": "$(date -u -d "+$duration" +"%Y-%m-%dT%H:%M:%S.000Z")",
"createdBy": "$creator",
"comment": "$comment",
"status": {
"state": "active"
}
}
EOF
)
local response
response=$(curl -s -X POST \
-H "Content-Type: application/json" \
-d "$silence_data" \
"${ALERTMANAGER_URL}/api/v2/silences")
local silence_id=$(echo "$response" | jq -r '.silenceID')
if [ "$silence_id" != "null" ]; then
log "✅ 告警静默成功, ID: $silence_id"
echo "$silence_id"
else
echo -e "${RED}❌ 告警静默失败${NC}"
return 1
fi
}
# 取消静默
unsilence_alert() {
local silence_id="$1"
log "取消静默: $silence_id"
if curl -s -X DELETE "${ALERTMANAGER_URL}/api/v2/silence/$silence_id" > /dev/null; then
log "✅ 静默已取消"
else
echo -e "${RED}❌ 取消静默失败${NC}"
return 1
fi
}
# 发送测试告警
send_test_alert() {
local alert_name="TestAlert"
local severity="warning"
local instance="test-instance"
log "发送测试告警..."
local test_alert=$(cat << EOF
[
{
"labels": {
"alertname": "$alert_name",
"severity": "$severity",
"instance": "$instance",
"job": "springboot-app"
},
"annotations": {
"summary": "测试告警 - 请忽略",
"description": "这是一个测试告警,用于验证告警系统工作正常",
"runbook": "https://wiki.company.com/runbooks/test-alert"
},
"generatorURL": "http://test.example.com"
}
]
EOF
)
if curl -s -X POST \
-H "Content-Type: application/json" \
-d "$test_alert" \
"${ALERTMANAGER_URL}/api/v1/alerts" > /dev/null; then
log "✅ 测试告警发送成功"
else
echo -e "${RED}❌ 测试告警发送失败${NC}"
return 1
fi
}
# 备份告警配置
backup_alert_config() {
local backup_timestamp=$(date +"%Y%m%d_%H%M%S")
local backup_path="$BACKUP_DIR/$backup_timestamp"
log "备份告警配置到: $backup_path"
mkdir -p "$backup_path"
# 备份告警规则
cp -r "$ALERT_RULES_DIR" "$backup_path/"
# 备份Alertmanager配置
curl -s "${ALERTMANAGER_URL}/api/v1/status" | jq '.' > "$backup_path/alertmanager_status.json"
# 备份当前静默规则
curl -s "${ALERTMANAGER_URL}/api/v2/silences" | jq '.' > "$backup_path/silences.json"
log "✅ 告警配置备份完成"
}
# 生成告警报告
generate_alert_report() {
local report_file="alert_report_$(date +%Y%m%d_%H%M%S).html"
log "生成告警报告: $report_file"
# 获取告警统计
local alert_stats=$(curl -s "${PROMETHEUS_URL}/api/v1/query" \
--data-urlencode 'query=count by (severity) (ALERTS)' | jq -r '.data.result[] | "\(.metric.severity): \(.value[1])"')
# 生成HTML报告
cat > "$report_file" << EOF
<!DOCTYPE html>
<html>
<head>
<title>告警系统报告</title>
<style>
body { font-family: Arial, sans-serif; margin: 40px; }
.summary { background: #f4f4f4; padding: 20px; border-radius: 5px; }
.critical { color: #e74c3c; }
.warning { color: #f39c12; }
.info { color: #3498db; }
.stats { margin: 20px 0; }
.stat-item { margin: 10px 0; }
</style>
</head>
<body>
<h1>告警系统状态报告</h1>
<div class="summary">
<h2>系统概览</h2>
<p><strong>生成时间:</strong> $(date)</p>
<p><strong>Alertmanager:</strong> $ALERTMANAGER_URL</p>
<p><strong>Prometheus:</strong> $PROMETHEUS_URL</p>
</div>
<div class="stats">
<h2>告警统计</h2>
$(echo "$alert_stats" | while read line; do
severity=$(echo "$line" | cut -d: -f1)
count=$(echo "$line" | cut -d: -f2)
echo "<div class='stat-item'><span class='$severity'>$severity: $count</span></div>"
done)
</div>
<div class="active-alerts">
<h2>活跃告警</h2>
<pre>$(get_active_alerts)</pre>
</div>
</body>
</html>
EOF
log "✅ 告警报告已生成: $report_file"
}
# 显示使用说明
show_usage() {
cat << EOF
使用说明: $0 [命令]
命令:
status 检查告警系统状态
reload 重新加载配置
validate 验证告警规则语法
list 列出活跃告警
silence NAME 静默指定告警
unsilence ID 取消静默
test 发送测试告警
backup 备份告警配置
report 生成告警报告
help 显示此帮助信息
示例:
$0 status # 检查状态
$0 silence SpringBootAppHighCPUUsage --duration 2h
$0 test # 发送测试告警
$0 report # 生成报告
EOF
}
# 主函数
main() {
local command=${1:-help}
case $command in
status)
check_alertmanager_status
get_active_alerts
;;
reload)
validate_alert_rules "$ALERT_RULES_DIR/alerts.yml"
reload_prometheus_rules
reload_alertmanager_config
;;
validate)
validate_alert_rules "$ALERT_RULES_DIR/alerts.yml"
;;
list)
get_active_alerts
;;
silence)
local alert_name=$2
local duration=${3:-1h}
if [ -z "$alert_name" ]; then
echo -e "${RED}请指定要静默的告警名称${NC}"
exit 1
fi
silence_alert "$alert_name" "$duration"
;;
unsilence)
local silence_id=$2
if [ -z "$silence_id" ]; then
echo -e "${RED}请指定要取消的静默ID${NC}"
exit 1
fi
unsilence_alert "$silence_id"
;;
test)
send_test_alert
;;
backup)
backup_alert_config
;;
report)
generate_alert_report
;;
help|*)
show_usage
;;
esac
}
# 执行主函数
main "$@"
Kubernetes自动扩缩容配置
HPA配置
yaml
# k8s/hpa.yaml - Horizontal Pod Autoscaler配置
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: springboot-app-hpa
namespace: production
labels:
app: springboot-app
version: v1
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: springboot-app
minReplicas: 2
maxReplicas: 10
metrics:
# CPU基于扩缩容
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# 内存基于扩缩容
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# 基于QPS的自定义指标扩缩容
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
# 基于响应时间的自定义指标
- type: Object
object:
metric:
name: http_request_duration_seconds
describedObject:
apiVersion: networking.k8s.io/v1
kind: Ingress
name: springboot-app-ingress
target:
type: Value
value: "500m" # 500毫秒
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
selectPolicy: Min
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 30
- type: Pods
value: 4
periodSeconds: 30
selectPolicy: Max
---
# 基于Prometheus自定义指标的HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: springboot-app-custom-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: springboot-app
minReplicas: 2
maxReplicas: 20
metrics:
# 基于业务指标的扩缩容 - 订单处理速率
- type: Pods
pods:
metric:
name: orders_processed_per_minute
target:
type: AverageValue
averageValue: "1000"
# 基于错误率的扩缩容
- type: Pods
pods:
metric:
name: error_rate
target:
type: AverageValue
averageValue: "50" # 50个错误/分钟时扩容
behavior:
scaleDown:
stabilizationWindowSeconds: 600 # 缩容等待10分钟
policies:
- type: Percent
value: 10
periodSeconds: 120
scaleUp:
stabilizationWindowSeconds: 60 # 扩容等待1分钟
policies:
- type: Percent
value: 100
periodSeconds: 30
---
# VPA (Vertical Pod Autoscaler) 配置
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: springboot-app-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: springboot-app
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: "springboot-app"
minAllowed:
cpu: "100m"
memory: "128Mi"
maxAllowed:
cpu: "2"
memory: "2Gi"
controlledResources: ["cpu", "memory"]
自定义指标适配器配置
yaml
# k8s/prometheus-adapter.yaml - Prometheus适配器配置
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-adapter-config
namespace: monitoring
data:
config.yaml: |
rules:
# 自定义指标规则
custom:
# HTTP请求QPS指标
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "http_requests_total"
as: "http_requests_per_second"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
# HTTP请求延迟指标
- seriesQuery: 'http_request_duration_seconds_bucket{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "http_request_duration_seconds"
as: "http_request_duration_seconds_p95"
metricsQuery: |
histogram_quantile(0.95,
sum(rate(<<.Series>>{<<.LabelMatchers>>}[5m])) by (le, <<.GroupBy>>)
)
# 业务指标 - 订单处理速率
- seriesQuery: 'orders_processed_total{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "orders_processed_total"
as: "orders_processed_per_minute"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
# 业务指标 - 错误率
- seriesQuery: 'orders_failed_total{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "orders_failed_total"
as: "error_rate"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
# JVM内存使用率
- seriesQuery: 'jvm_memory_used_bytes{namespace!="",pod!="",area="heap"}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "jvm_memory_used_bytes"
as: "jvm_heap_usage_percent"
metricsQuery: |
(sum(jvm_memory_used_bytes{<<.LabelMatchers>>,area="heap"}) by (<<.GroupBy>>) /
sum(jvm_memory_max_bytes{<<.LabelMatchers>>,area="heap"}) by (<<.GroupBy>>)) * 100
# 外部指标(用于集群级扩缩容)
external:
- seriesQuery: 'nginx_ingress_controller_requests{namespace!="",ingress!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
ingress: {resource: "ingress"}
name:
matches: "nginx_ingress_controller_requests"
as: "ingress_requests_per_second"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-adapter
namespace: monitoring
labels:
app: prometheus-adapter
spec:
replicas: 1
selector:
matchLabels:
app: prometheus-adapter
template:
metadata:
labels:
app: prometheus-adapter
spec:
containers:
- name: prometheus-adapter
image: directxman12/k8s-prometheus-adapter:v0.10.0
args:
- --secure-port=6443
- --cert-dir=/tmp
- --logtostderr=true
- --prometheus-url=http://prometheus-server.monitoring.svc.cluster.local
- --metrics-relist-interval=1m
- --v=6
- --config=/etc/adapter/config.yaml
ports:
- name: https
containerPort: 6443
volumeMounts:
- name: config
mountPath: /etc/adapter
readOnly: true
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 512Mi
volumes:
- name: config
configMap:
name: prometheus-adapter-config
自动扩缩容管理脚本
bash
#!/bin/bash
# autoscaling_manager.sh - 自动扩缩容管理系统
set -euo pipepipefail
# 配置
KUBE_NAMESPACE="production"
KUBE_CONTEXT="production-cluster"
PROMETHEUS_URL="http://prometheus-server.monitoring.svc.cluster.local:9090"
DEPLOYMENT_NAME="springboot-app"
HPA_NAME="springboot-app-hpa"
# 颜色定义
GREEN='\033[0;32m'
RED='\033[0;31m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m'
log() {
echo -e "${GREEN}[$(date '+%Y-%m-%d %H:%M:%S')] $1${NC}"
}
# 检查Kubernetes集群状态
check_cluster_status() {
log "检查Kubernetes集群状态..."
if kubectl --context="$KUBE_CONTEXT" cluster-info > /dev/null 2>&1; then
log "✅ Kubernetes集群连接正常"
else
echo -e "${RED}❌ 无法连接Kubernetes集群${NC}"
return 1
fi
# 检查节点状态
local node_count
node_count=$(kubectl --context="$KUBE_CONTEXT" get nodes --no-headers | grep -c "Ready")
log "集群中Ready节点数量: $node_count"
}
# 获取HPA状态
get_hpa_status() {
log "获取HPA状态..."
kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" get hpa "$HPA_NAME" -o wide
echo -e "\n${YELLOW}HPA详细状态:${NC}"
kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" describe hpa "$HPA_NAME" | grep -A 10 "Metrics:"
}
# 获取当前指标值
get_current_metrics() {
log "获取当前监控指标..."
local metrics=(
"cpu_usage"
"memory_usage"
"http_requests_per_second"
"http_request_duration_seconds_p95"
"jvm_heap_usage_percent"
)
for metric in "${metrics[@]}"; do
echo -e "\n${BLUE}=== $metric ===${NC}"
case $metric in
cpu_usage)
kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" top pods -l app="$DEPLOYMENT_NAME"
;;
memory_usage)
kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" top pods -l app="$DEPLOYMENT_NAME"
;;
*)
# 从Prometheus获取自定义指标
get_prometheus_metric "$metric"
;;
esac
done
}
# 从Prometheus查询指标
get_prometheus_metric() {
local metric_name="$1"
local query
case $metric_name in
http_requests_per_second)
query='sum(rate(http_requests_total{app="springboot-app"}[2m]))'
;;
http_request_duration_seconds_p95)
query='histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app="springboot-app"}[5m])) by (le))'
;;
jvm_heap_usage_percent)
query='(sum(jvm_memory_used_bytes{app="springboot-app",area="heap"}) / sum(jvm_memory_max_bytes{app="springboot-app",area="heap"})) * 100'
;;
*)
echo "未知指标: $metric_name"
return 1
;;
esac
local response
response=$(curl -s "$PROMETHEUS_URL/api/v1/query" \
--data-urlencode "query=$query" | jq -r '.data.result[0].value[1] // "N/A"')
echo "当前值: $response"
}
# 调整HPA配置
adjust_hpa_config() {
local min_replicas="$1"
local max_replicas="$2"
local cpu_threshold="${3:-70}"
log "调整HPA配置: min=$min_replicas, max=$max_replicas, cpu_threshold=$cpu_threshold%"
# 备份当前配置
kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" get hpa "$HPA_NAME" -o yaml > "hpa_backup_$(date +%Y%m%d_%H%M%S).yaml"
# 更新HPA配置
kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" patch hpa "$HPA_NAME" -p "
{
\"spec\": {
\"minReplicas\": $min_replicas,
\"maxReplicas\": $max_replicas,
\"metrics\": [
{
\"type\": \"Resource\",
\"resource\": {
\"name\": \"cpu\",
\"target\": {
\"type\": \"Utilization\",
\"averageUtilization\": $cpu_threshold
}
}
}
]
}
}"
log "✅ HPA配置更新完成"
get_hpa_status
}
# 手动扩容
scale_up() {
local replicas="$1"
log "手动扩容到 $replicas 个副本..."
kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" scale deployment "$DEPLOYMENT_NAME" --replicas="$replicas"
# 等待扩容完成
wait_for_scale "$replicas"
}
# 手动缩容
scale_down() {
local replicas="$1"
log "手动缩容到 $replicas 个副本..."
kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" scale deployment "$DEPLOYMENT_NAME" --replicas="$replicas"
# 等待缩容完成
wait_for_scale "$replicas"
}
# 等待扩缩容完成
wait_for_scale() {
local desired_replicas="$1"
local timeout=300
local start_time=$(date +%s)
log "等待扩缩容完成,期望副本数: $desired_replicas"
while true; do
local current_time=$(date +%s)
local elapsed=$((current_time - start_time))
if [ $elapsed -gt $timeout ]; then
echo -e "${RED}❌ 扩缩容超时${NC}"
return 1
fi
local current_replicas
current_replicas=$(kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" get deployment "$DEPLOYMENT_NAME" -o jsonpath='{.status.readyReplicas}')
if [ "$current_replicas" = "$desired_replicas" ]; then
log "✅ 扩缩容完成,当前副本数: $current_replicas"
break
fi
echo "⏳ 等待扩缩容... (当前: $current_replicas, 期望: $desired_replicas, 已等待: ${elapsed}s)"
sleep 10
done
}
# 模拟负载测试
simulate_load() {
local duration="${1:-5m}"
local concurrent_users="${2:-50}"
log "开始模拟负载测试,持续时间: $duration, 并发用户: $concurrent_users"
# 获取服务地址
local service_url
service_url=$(kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" get svc "$DEPLOYMENT_NAME" -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
if [ -z "$service_url" ]; then
service_url=$(kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" get svc "$DEPLOYMENT_NAME" -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
fi
if [ -z "$service_url" ]; then
echo -e "${RED}❌ 无法获取服务地址${NC}"
return 1
fi
log "服务地址: http://$service_url"
# 使用hey进行负载测试
if command -v hey >/dev/null 2>&1; then
hey -z "$duration" -c "$concurrent_users" \
-m GET \
"http://$service_url/health" \
"http://$service_url/info"
else
echo -e "${YELLOW}⚠️ hey工具未安装,使用curl简单测试${NC}"
# 简单的负载测试
for i in $(seq 1 "$concurrent_users"); do
curl -s "http://$service_url/health" > /dev/null &
done
sleep "$(echo "$duration" | sed 's/m//')m"
fi
log "✅ 负载测试完成"
}
# 生成扩缩容报告
generate_scaling_report() {
local report_file="scaling_report_$(date +%Y%m%d_%H%M%S).html"
log "生成扩缩容报告: $report_file"
# 获取HPA事件历史
local hpa_events
hpa_events=$(kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" describe hpa "$HPA_NAME" | grep -A 20 "Events:" || echo "无事件")
# 获取Pod重启历史
local pod_restarts
pod_restarts=$(kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" get pods -l app="$DEPLOYMENT_NAME" -o jsonpath='{.items[*].status.containerStatuses[0].restartCount}' | tr ' ' '\n' | sort -nr | head -1)
cat > "$report_file" << EOF
<!DOCTYPE html>
<html>
<head>
<title>自动扩缩容报告</title>
<style>
body { font-family: Arial, sans-serif; margin: 40px; }
.summary { background: #f4f4f4; padding: 20px; border-radius: 5px; }
.metrics { display: grid; grid-template-columns: repeat(2, 1fr); gap: 20px; margin: 20px 0; }
.metric-card { background: white; padding: 15px; border-radius: 5px; box-shadow: 0 2px 4px rgba(0,0,0,0.1); }
.events { background: #f8f9fa; padding: 15px; border-radius: 5px; }
</style>
</head>
<body>
<h1>自动扩缩容系统报告</h1>
<div class="summary">
<h2>系统概览</h2>
<p><strong>生成时间:</strong> $(date)</p>
<p><strong>应用:</strong> $DEPLOYMENT_NAME</p>
<p><strong>命名空间:</strong> $KUBE_NAMESPACE</p>
<p><strong>最大Pod重启次数:</strong> $pod_restarts</p>
</div>
<div class="metrics">
<div class="metric-card">
<h3>CPU使用率</h3>
<p>$(get_prometheus_metric "container_cpu_usage")</p>
</div>
<div class="metric-card">
<h3>内存使用率</h3>
<p>$(get_prometheus_metric "container_memory_usage")</p>
</div>
<div class="metric-card">
<h3>请求QPS</h3>
<p>$(get_prometheus_metric "http_requests_per_second")</p>
</div>
<div class="metric-card">
<h3>P95延迟</h3>
<p>$(get_prometheus_metric "http_request_duration_seconds_p95") 秒</p>
</div>
</div>
<div class="events">
<h2>HPA事件历史</h2>
<pre>$hpa_events</pre>
</div>
<div class="recommendations">
<h2>优化建议</h2>
<ul>
<li>根据业务高峰调整HPA的最小/最大副本数</li>
<li>监控JVM内存使用,考虑配置垂直扩缩容(VPA)</li>
<li>设置适当的扩缩容冷却时间,避免频繁扩缩容</li>
</ul>
</div>
</body>
</html>
EOF
log "✅ 扩缩容报告已生成: $report_file"
}
# 显示使用说明
show_usage() {
cat << EOF
使用说明: $0 [命令]
命令:
status 检查集群和HPA状态
metrics 获取当前监控指标
adjust MIN MAX [CPU_THRESHOLD] 调整HPA配置
scale-up REPLICAS 手动扩容
scale-down REPLICAS 手动缩容
simulate [DURATION] [USERS] 模拟负载测试
report 生成扩缩容报告
help 显示此帮助信息
示例:
$0 status # 检查状态
$0 adjust 2 10 80 # 调整HPA: min=2, max=10, CPU阈值=80%
$0 scale-up 5 # 手动扩容到5个副本
$0 simulate 10m 100 # 模拟10分钟100并发负载
$0 report # 生成报告
EOF
}
# 主函数
main() {
local command=${1:-help}
# 设置Kubernetes上下文
kubectl config use-context "$KUBE_CONTEXT" > /dev/null 2>&1 || true
case $command in
status)
check_cluster_status
get_hpa_status
;;
metrics)
get_current_metrics
;;
adjust)
local min_replicas=$2
local max_replicas=$3
local cpu_threshold=${4:-70}
if [ -z "$min_replicas" ] || [ -z "$max_replicas" ]; then
echo -e "${RED}请指定最小和最大副本数${NC}"
exit 1
fi
adjust_hpa_config "$min_replicas" "$max_replicas" "$cpu_threshold"
;;
scale-up)
local replicas=$2
if [ -z "$replicas" ]; then
echo -e "${RED}请指定副本数${NC}"
exit 1
fi
scale_up "$replicas"
;;
scale-down)
local replicas=$2
if [ -z "$replicas" ]; then
echo -e "${RED}请指定副本数${NC}"
exit 1
fi
scale_down "$replicas"
;;
simulate)
local duration=${2:-5m}
local users=${3:-50}
simulate_load "$duration" "$users"
;;
report)
generate_scaling_report
;;
help|*)
show_usage
;;
esac
}
# 执行主函数
main "$@"
智能扩缩容策略配置
基于预测的自动扩缩容
yaml
# k8s/keda-autoscaling.yaml - KEDA基于事件的自动扩缩容
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: springboot-app-keda
namespace: production
spec:
scaleTargetRef:
name: springboot-app
kind: Deployment
minReplicaCount: 2
maxReplicaCount: 20
cooldownPeriod: 300
pollingInterval: 30
triggers:
# 基于CPU的扩缩容
- type: cpu
metadata:
type: Utilization
value: "70"
# 基于内存的扩缩容
- type: memory
metadata:
type: Utilization
value: "80"
# 基于Prometheus指标的扩缩容
- type: prometheus
metadata:
serverAddress: http://prometheus-server.monitoring.svc.cluster.local:9090
metricName: http_requests_per_second
threshold: "100"
query: |
sum(rate(http_requests_total{app="springboot-app"}[2m]))
# 基于响应时间的扩缩容
- type: prometheus
metadata:
serverAddress: http://prometheus-server.monitoring.svc.cluster.local:9090
metricName: http_request_p95_latency
threshold: "1.0" # 1秒
query: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{app="springboot-app"}[5m])) by (le)
)
# 基于消息队列的扩缩容(如果使用消息队列)
- type: kafka
metadata:
bootstrapServers: kafka-broker:9092
consumerGroup: springboot-consumer
topic: orders
lagThreshold: "50"
# 基于定时任务的扩缩容
- type: cron
metadata:
timezone: Asia/Shanghai
start: 0 9 * * 1-5 # 工作日9点开始
end: 0 18 * * 1-5 # 工作日18点结束
desiredReplicas: "5"
# 基于外部API的扩缩容
- type: external
metadata:
https://external-metrics-api.com/load
threshold: "1000"
---
# 预测性扩缩容配置
apiVersion: batch/v1
kind: CronJob
metadata:
name: predictive-scaling
namespace: production
spec:
schedule: "*/5 * * * *" # 每5分钟运行一次
jobTemplate:
spec:
template:
spec:
containers:
- name: predictive-scaling
image: python:3.9
command:
- /bin/sh
- -c
- |
pip install requests pandas scikit-learn
python /scripts/predictive_scaling.py
volumeMounts:
- name: scripts
mountPath: /scripts
volumes:
- name: scripts
configMap:
name: predictive-scaling-scripts
restartPolicy: OnFailure
---
apiVersion: v1
kind: ConfigMap
metadata:
name: predictive-scaling-scripts
namespace: production
data:
predictive_scaling.py: |
#!/usr/bin/env python3
"""
预测性扩缩容脚本
基于历史负载模式预测未来负载并调整HPA
"""
import requests
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import json
import os
# 配置
PROMETHEUS_URL = "http://prometheus-server.monitoring.svc.cluster.local:9090"
K8S_API_URL = "https://kubernetes-api"
NAMESPACE = "production"
DEPLOYMENT = "springboot-app"
def query_prometheus(query):
"""查询Prometheus指标"""
response = requests.get(f"{PROMETHEUS_URL}/api/v1/query", params={"query": query})
if response.status_code == 200:
data = response.json()
if data['data']['result']:
return float(data['data']['result'][0]['value'][1])
return 0
def get_historical_metrics():
"""获取历史指标数据"""
end_time = datetime.now()
start_time = end_time - timedelta(hours=24)
# 查询过去24小时的QPS数据
query = f'avg_over_time(sum(rate(http_requests_total{{app="{DEPLOYMENT}"}}[5m]))[24h:5m])'
historical_data = query_prometheus(query)
return historical_data
def predict_future_load():
"""预测未来负载"""
# 基于历史数据的简单预测
historical_load = get_historical_metrics()
# 考虑时间因素(工作日/周末,白天/晚上)
now = datetime.now()
hour = now.hour
is_weekday = now.weekday() < 5
# 简单的预测模型
if is_weekday:
if 9 <= hour <= 12: # 上午工作时间
predicted_load = historical_load * 1.5
elif 13 <= hour <= 18: # 下午工作时间
predicted_load = historical_load * 1.8
elif 19 <= hour <= 23: # 晚上
predicted_load = historical_load * 1.2
else: # 深夜
predicted_load = historical_load * 0.5
else: # 周末
predicted_load = historical_load * 0.8
return max(predicted_load, 10) # 最小10 QPS
def calculate_desired_replicas(predicted_load):
"""计算期望的副本数"""
# 假设每个Pod可以处理50 QPS
pods_needed = max(2, int(predicted_load / 50) + 1)
return min(pods_needed, 20) # 最大20个副本
def update_hpa(replicas):
"""更新HPA配置"""
# 这里应该调用Kubernetes API更新HPA
# 为了安全,这里只打印日志
print(f"预测性扩缩容: 建议设置副本数为 {replicas}")
# 实际实现应该调用Kubernetes API
# patch_data = {
# "spec": {
# "minReplicas": replicas,
# "maxReplicas": max(replicas + 5, 20)
# }
# }
def main():
print("开始预测性扩缩容分析...")
# 预测未来负载
predicted_load = predict_future_load()
print(f"预测负载: {predicted_load:.2f} QPS")
# 计算所需副本数
desired_replicas = calculate_desired_replicas(predicted_load)
print(f"期望副本数: {desired_replicas}")
# 更新HPA配置
update_hpa(desired_replicas)
print("预测性扩缩容分析完成")
if __name__ == "__main__":
main()
告警和扩缩容架构图
以下图表展示了完整的告警和自动扩缩容系统架构:
flowchart TD
A[应用指标收集] --> B[Prometheus]
A --> C[应用日志]
B --> D[Alertmanager]
B --> E[Grafana]
B --> F[Prometheus Adapter]
C --> G[ELK/Loki]
D --> H[告警通知]
H --> H1[邮件]
H --> H2[Slack]
H --> H3[PagerDuty]
F --> I[Kubernetes Metrics API]
I --> J[HPA]
I --> K[KEDA]
J --> L[自动扩缩容]
K --> L
L --> M[Pod扩缩容]
M --> N[资源调整]
G --> O[日志分析]
E --> P[实时监控]
O --> Q[故障诊断]
P --> Q
Q --> R[性能优化]
R --> A
style A fill:#3498db,color:#fff
style D fill:#e74c3c,color:#fff
style L fill:#27ae60,color:#fff
style Q fill:#9b59b6,color:#fff
总结
通过本部分的配置,我们建立了完整的告警和自动扩缩容系统:
核心能力
- 智能告警: 多层次、多渠道的告警系统
- 自动扩缩容: 基于资源使用率和业务指标的自动扩缩容
- 预测性扩缩容: 基于历史模式的智能预测
- 全面监控: 从基础设施到应用层的全方位监控
关键技术
- Prometheus + Alertmanager: 强大的监控告警组合
- Kubernetes HPA: 原生自动扩缩容
- KEDA: 基于事件的自动扩缩容
- 自定义指标: 基于业务指标的智能扩缩容
- 多通知渠道: 邮件、Slack、PagerDuty等
最佳实践
- 分级告警: 根据严重程度分级处理告警
- 智能静默: 避免告警风暴,提高告警有效性
- 渐进式扩缩容: 平稳的扩缩容策略避免业务波动
- 预测性优化: 基于历史数据的智能预测
现在SpringBoot应用具备了企业级的智能运维能力,可以自动应对各种业务场景,确保系统的高可用性和稳定性。