Docker容器化实战：将你的SpringBoot应用一键打包部署（三）-配置告警和自动扩缩容

在建立了完整的CI/CD流水线后，我们现在需要配置智能的告警系统和自动扩缩容机制，确保应用在生产环境中能够自动应对流量变化并保持高可用性。

监控告警系统配置

Prometheus告警规则配置

yaml 复制代码

# prometheus/alerts.yml - Prometheus告警规则配置
groups:
  - name: springboot-app-alerts
    rules:
      # 应用可用性告警
      - alert: SpringBootAppDown
        expr: up{job="springboot-app"} == 0
        for: 1m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "SpringBoot应用宕机 (实例 {{ $labels.instance }})"
          description: "SpringBoot应用 {{ $labels.instance }} 已经宕机超过1分钟。"
          runbook: "https://wiki.company.com/runbooks/springboot-app-down"

      - alert: SpringBootAppHighErrorRate
        expr: rate(http_server_requests_seconds_count{job="springboot-app", status=~"5.."}[5m]) > 0.1
        for: 2m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "SpringBoot应用错误率过高 (实例 {{ $labels.instance }})"
          description: "SpringBoot应用 {{ $labels.instance }} 5xx错误率超过10%，当前值: {{ $value | humanizePercentage }}"
          runbook: "https://wiki.company.com/runbooks/springboot-high-error-rate"

      # JVM性能告警
      - alert: SpringBootAppHighMemoryUsage
        expr: (sum(jvm_memory_used_bytes{job="springboot-app", area="heap"}) by (instance) / sum(jvm_memory_max_bytes{job="springboot-app", area="heap"}) by (instance)) > 0.8
        for: 3m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "SpringBoot应用内存使用率过高 (实例 {{ $labels.instance }})"
          description: "SpringBoot应用 {{ $labels.instance }} 堆内存使用率超过80%，当前值: {{ $value | humanizePercentage }}"
          runbook: "https://wiki.company.com/runbooks/springboot-high-memory"

      - alert: SpringBootAppHighCPUUsage
        expr: process_cpu_usage{job="springboot-app"} > 0.8
        for: 3m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "SpringBoot应用CPU使用率过高 (实例 {{ $labels.instance }})"
          description: "SpringBoot应用 {{ $labels.instance }} CPU使用率超过80%，当前值: {{ $value | humanizePercentage }}"
          runbook: "https://wiki.company.com/runbooks/springboot-high-cpu"

      # 垃圾回收告警
      - alert: SpringBootAppHighGC
        expr: rate(jvm_gc_pause_seconds_sum{job="springboot-app"}[5m]) > 0.1
        for: 2m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "SpringBoot应用GC停顿时间过长 (实例 {{ $labels.instance }})"
          description: "SpringBoot应用 {{ $labels.instance }} GC停顿时间过高，过去5分钟平均: {{ $value }}秒"
          runbook: "https://wiki.company.com/runbooks/springboot-high-gc"

      # 应用性能告警
      - alert: SpringBootAppHighLatency
        expr: histogram_quantile(0.95, rate(http_server_requests_seconds_bucket{job="springboot-app"}[5m])) > 2
        for: 3m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "SpringBoot应用响应时间过高 (实例 {{ $labels.instance }})"
          description: "SpringBoot应用 {{ $labels.instance }} 95%响应时间超过2秒，当前值: {{ $value }}秒"
          runbook: "https://wiki.company.com/runbooks/springboot-high-latency"

      - alert: SpringBootAppLowThroughput
        expr: rate(http_server_requests_seconds_count{job="springboot-app"}[5m]) < 10
        for: 5m
        labels:
          severity: info
          team: backend
        annotations:
          summary: "SpringBoot应用吞吐量过低 (实例 {{ $labels.instance }})"
          description: "SpringBoot应用 {{ $labels.instance }} 吞吐量异常低，过去5分钟平均: {{ $value }} req/s"
          runbook: "https://wiki.company.com/runbooks/springboot-low-throughput"

      # 数据库连接告警
      - alert: SpringBootAppHighDatabaseConnections
        expr: spring_datasource_max_connections{job="springboot-app"} - spring_datasource_active_connections{job="springboot-app"} < 5
        for: 2m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "SpringBoot应用数据库连接池紧张 (实例 {{ $labels.instance }})"
          description: "SpringBoot应用 {{ $labels.instance }} 数据库连接池可用连接少于5个"
          runbook: "https://wiki.company.com/runbooks/springboot-database-connections"

      # 自定义业务指标告警
      - alert: SpringBootAppHighOrderErrorRate
        expr: rate(orders_failed_total{job="springboot-app"}[5m]) / rate(orders_processed_total{job="springboot-app"}[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "SpringBoot应用订单处理错误率过高 (实例 {{ $labels.instance }})"
          description: "SpringBoot应用 {{ $labels.instance }} 订单处理错误率超过5%，当前值: {{ $value | humanizePercentage }}"
          runbook: "https://wiki.company.com/runbooks/springboot-order-errors"

  - name: infrastructure-alerts
    rules:
      # 节点资源告警
      - alert: NodeHighCPUUsage
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "节点CPU使用率过高 ({{ $labels.instance }})"
          description: "节点 {{ $labels.instance }} CPU使用率超过80%，当前值: {{ $value }}%"

      - alert: NodeHighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "节点内存使用率过高 ({{ $labels.instance }})"
          description: "节点 {{ $labels.instance }} 内存使用率超过85%，当前值: {{ $value }}%"

      - alert: NodeDiskSpaceRunningOut
        expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "节点磁盘空间不足 ({{ $labels.instance }} {{ $labels.mountpoint }})"
          description: "节点 {{ $labels.instance }} 挂载点 {{ $labels.mountpoint }} 磁盘使用率超过90%，当前值: {{ $value }}%"

      - alert: NodeNetworkSaturation
        expr: rate(node_network_receive_bytes_total[5m]) > 100 * 1024 * 1024
        for: 2m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "节点网络接收饱和 ({{ $labels.instance }})"
          description: "节点 {{ $labels.instance }} 网络接收速率超过100MB/s"

  - name: kubernetes-alerts
    rules:
      # Kubernetes集群告警
      - alert: KubePodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 1m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Pod崩溃循环 ({{ $labels.namespace }}/{{ $labels.pod }})"
          description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 正在崩溃循环"

      - alert: KubeDeploymentReplicasMismatch
        expr: kube_deployment_status_replicas_available != kube_deployment_spec_replicas
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Deployment副本数不匹配 ({{ $labels.namespace }}/{{ $labels.deployment }})"
          description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} 可用副本数不匹配期望值"

      - alert: KubeHPAReachedMax
        expr: kube_horizontalpodautoscaler_status_current_replicas == kube_horizontalpodautoscaler_spec_max_replicas
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "HPA达到最大副本数 ({{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }})"
          description: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} 已达到最大副本数 {{ $value }}"

Alertmanager配置

yaml 复制代码

# alertmanager/alertmanager.yml - Alertmanager主配置
global:
  smtp_smarthost: 'smtp.company.com:587'
  smtp_from: 'alertmanager@company.com'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'

templates:
  - '/etc/alertmanager/templates/*.tmpl'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default-receiver'
  
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
      group_wait: 5s
      repeat_interval: 5m
      routes:
        - match:
            team: backend
          receiver: 'backend-critical'
        - match:
            team: platform
          receiver: 'platform-critical'
    
    - match:
        severity: warning
      receiver: 'warning-alerts'
      group_wait: 30s
      repeat_interval: 15m
    
    - match:
        severity: info
      receiver: 'info-alerts'
      group_wait: 1m
      repeat_interval: 30m

receivers:
  - name: 'default-receiver'
    email_configs:
      - to: 'devops@company.com'
        subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
            Alert: {{ .Annotations.summary }}
            Description: {{ .Annotations.description }}
            Details:
            {{ range .Labels.SortedPairs }}  - {{ .Name }}: {{ .Value }}
            {{ end }}
            Runbook: {{ .Annotations.runbook }}
          {{ end }}

  - name: 'critical-alerts'
    email_configs:
      - to: 'sre-team@company.com'
        subject: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
            🚨 CRITICAL ALERT
            ================
            Summary: {{ .Annotations.summary }}
            Description: {{ .Annotations.description }}
            
            Labels:
            {{ range .Labels.SortedPairs }}  - {{ .Name }}: {{ .Value }}
            {{ end }}
            
            Runbook: {{ .Annotations.runbook }}
            Time: {{ .StartsAt }}
          {{ end }}
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/XXX/XXX'
        channel: '#alerts-critical'
        title: '🚨 Critical Alert'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}{{ end }}'
        color: 'danger'
    pagerduty_configs:
      - service_key: 'your-pagerduty-key'
        description: '{{ .GroupLabels.alertname }}'
        details:
          summary: '{{ .Annotations.summary }}'
          description: '{{ .Annotations.description }}'

  - name: 'backend-critical'
    email_configs:
      - to: 'backend-team@company.com'
        subject: '🚨 BACKEND CRITICAL: {{ .GroupLabels.alertname }}'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/XXX/XXX'
        channel: '#backend-alerts'
        title: '🚨 Backend Critical'
    pagerduty_configs:
      - service_key: 'backend-pagerduty-key'

  - name: 'platform-critical'
    email_configs:
      - to: 'platform-team@company.com'
        subject: '🚨 PLATFORM CRITICAL: {{ .GroupLabels.alertname }}'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/XXX/XXX'
        channel: '#platform-alerts'

  - name: 'warning-alerts'
    email_configs:
      - to: 'devops@company.com'
        subject: '⚠️ WARNING: {{ .GroupLabels.alertname }}'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/XXX/XXX'
        channel: '#alerts-warning'
        color: 'warning'

  - name: 'info-alerts'
    email_configs:
      - to: 'devops@company.com'
        subject: 'ℹ️ INFO: {{ .GroupLabels.alertname }}'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/XXX/XXX'
        channel: '#alerts-info'
        color: 'good'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']
  
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'info'
    equal: ['alertname', 'cluster', 'service']

告警管理脚本

bash 复制代码

#!/bin/bash
# alert_manager.sh - 告警管理系统

set -euo pipefail

# 配置
ALERTMANAGER_URL="http://alertmanager:9093"
PROMETHEUS_URL="http://prometheus:9090"
SLACK_WEBHOOK="https://hooks.slack.com/services/XXX/XXX/XXX"
ALERT_RULES_DIR="./prometheus/alerts"
BACKUP_DIR="./backups/alerts"

# 颜色定义
GREEN='\033[0;32m'
RED='\033[0;31m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m'

log() {
    echo -e "${GREEN}[$(date '+%Y-%m-%d %H:%M:%S')] $1${NC}"
}

# 检查告警管理器状态
check_alertmanager_status() {
    log "检查Alertmanager状态..."
    
    if curl -s "${ALERTMANAGER_URL}/-/healthy" > /dev/null; then
        log "✅ Alertmanager运行正常"
        return 0
    else
        echo -e "${RED}❌ Alertmanager不可用${NC}"
        return 1
    fi
}

# 重新加载告警规则
reload_prometheus_rules() {
    log "重新加载Prometheus告警规则..."
    
    if curl -s -X POST "${PROMETHEUS_URL}/-/reload" > /dev/null; then
        log "✅ Prometheus规则重新加载成功"
    else
        echo -e "${RED}❌ Prometheus规则重新加载失败${NC}"
        return 1
    fi
}

# 重新加载Alertmanager配置
reload_alertmanager_config() {
    log "重新加载Alertmanager配置..."
    
    if curl -s -X POST "${ALERTMANAGER_URL}/-/reload" > /dev/null; then
        log "✅ Alertmanager配置重新加载成功"
    else
        echo -e "${RED}❌ Alertmanager配置重新加载失败${NC}"
        return 1
    fi
}

# 验证告警规则语法
validate_alert_rules() {
    log "验证告警规则语法..."
    
    local rules_file="$1"
    
    if [ ! -f "$rules_file" ]; then
        echo -e "${RED}❌ 告警规则文件不存在: $rules_file${NC}"
        return 1
    fi
    
    # 使用promtool验证规则
    if command -v promtool >/dev/null 2>&1; then
        if promtool check rules "$rules_file"; then
            log "✅ 告警规则语法正确"
            return 0
        else
            echo -e "${RED}❌ 告警规则语法错误${NC}"
            return 1
        fi
    else
        echo -e "${YELLOW}⚠️ promtool未安装，跳过语法检查${NC}"
        return 0
    fi
}

# 获取当前活跃告警
get_active_alerts() {
    log "获取当前活跃告警..."
    
    local response
    response=$(curl -s "${ALERTMANAGER_URL}/api/v2/alerts" | jq -r '.[] | "\(.labels.alertname) - \(.status.state)"')
    
    if [ -n "$response" ]; then
        echo -e "${YELLOW}当前活跃告警:${NC}"
        echo "$response"
    else
        log "✅ 无活跃告警"
    fi
}

# 静默告警
silence_alert() {
    local alert_name="$1"
    local duration="${2:-1h}"
    local creator="${3:-alert-manager}"
    local comment="${4:-手动静默}"
    
    log "静默告警: $alert_name, 时长: $duration"
    
    local silence_data=$(cat << EOF
{
    "matchers": [
        {
            "name": "alertname",
            "value": "$alert_name",
            "isRegex": false
        }
    ],
    "startsAt": "$(date -u +"%Y-%m-%dT%H:%M:%S.000Z")",
    "endsAt": "$(date -u -d "+$duration" +"%Y-%m-%dT%H:%M:%S.000Z")",
    "createdBy": "$creator",
    "comment": "$comment",
    "status": {
        "state": "active"
    }
}
EOF
)
    
    local response
    response=$(curl -s -X POST \
        -H "Content-Type: application/json" \
        -d "$silence_data" \
        "${ALERTMANAGER_URL}/api/v2/silences")
    
    local silence_id=$(echo "$response" | jq -r '.silenceID')
    
    if [ "$silence_id" != "null" ]; then
        log "✅ 告警静默成功, ID: $silence_id"
        echo "$silence_id"
    else
        echo -e "${RED}❌ 告警静默失败${NC}"
        return 1
    fi
}

# 取消静默
unsilence_alert() {
    local silence_id="$1"
    
    log "取消静默: $silence_id"
    
    if curl -s -X DELETE "${ALERTMANAGER_URL}/api/v2/silence/$silence_id" > /dev/null; then
        log "✅ 静默已取消"
    else
        echo -e "${RED}❌ 取消静默失败${NC}"
        return 1
    fi
}

# 发送测试告警
send_test_alert() {
    local alert_name="TestAlert"
    local severity="warning"
    local instance="test-instance"
    
    log "发送测试告警..."
    
    local test_alert=$(cat << EOF
[
    {
        "labels": {
            "alertname": "$alert_name",
            "severity": "$severity",
            "instance": "$instance",
            "job": "springboot-app"
        },
        "annotations": {
            "summary": "测试告警 - 请忽略",
            "description": "这是一个测试告警，用于验证告警系统工作正常",
            "runbook": "https://wiki.company.com/runbooks/test-alert"
        },
        "generatorURL": "http://test.example.com"
    }
]
EOF
)
    
    if curl -s -X POST \
        -H "Content-Type: application/json" \
        -d "$test_alert" \
        "${ALERTMANAGER_URL}/api/v1/alerts" > /dev/null; then
        log "✅ 测试告警发送成功"
    else
        echo -e "${RED}❌ 测试告警发送失败${NC}"
        return 1
    fi
}

# 备份告警配置
backup_alert_config() {
    local backup_timestamp=$(date +"%Y%m%d_%H%M%S")
    local backup_path="$BACKUP_DIR/$backup_timestamp"
    
    log "备份告警配置到: $backup_path"
    
    mkdir -p "$backup_path"
    
    # 备份告警规则
    cp -r "$ALERT_RULES_DIR" "$backup_path/"
    
    # 备份Alertmanager配置
    curl -s "${ALERTMANAGER_URL}/api/v1/status" | jq '.' > "$backup_path/alertmanager_status.json"
    
    # 备份当前静默规则
    curl -s "${ALERTMANAGER_URL}/api/v2/silences" | jq '.' > "$backup_path/silences.json"
    
    log "✅ 告警配置备份完成"
}

# 生成告警报告
generate_alert_report() {
    local report_file="alert_report_$(date +%Y%m%d_%H%M%S).html"
    
    log "生成告警报告: $report_file"
    
    # 获取告警统计
    local alert_stats=$(curl -s "${PROMETHEUS_URL}/api/v1/query" \
        --data-urlencode 'query=count by (severity) (ALERTS)' | jq -r '.data.result[] | "\(.metric.severity): \(.value[1])"')
    
    # 生成HTML报告
    cat > "$report_file" << EOF
<!DOCTYPE html>
<html>
<head>
    <title>告警系统报告</title>
    <style>
        body { font-family: Arial, sans-serif; margin: 40px; }
        .summary { background: #f4f4f4; padding: 20px; border-radius: 5px; }
        .critical { color: #e74c3c; }
        .warning { color: #f39c12; }
        .info { color: #3498db; }
        .stats { margin: 20px 0; }
        .stat-item { margin: 10px 0; }
    </style>
</head>
<body>
    <h1>告警系统状态报告</h1>
    <div class="summary">
        <h2>系统概览</h2>
        <p><strong>生成时间:</strong> $(date)</p>
        <p><strong>Alertmanager:</strong> $ALERTMANAGER_URL</p>
        <p><strong>Prometheus:</strong> $PROMETHEUS_URL</p>
    </div>
    
    <div class="stats">
        <h2>告警统计</h2>
        $(echo "$alert_stats" | while read line; do
            severity=$(echo "$line" | cut -d: -f1)
            count=$(echo "$line" | cut -d: -f2)
            echo "<div class='stat-item'><span class='$severity'>$severity: $count</span></div>"
        done)
    </div>
    
    <div class="active-alerts">
        <h2>活跃告警</h2>
        <pre>$(get_active_alerts)</pre>
    </div>
</body>
</html>
EOF

    log "✅ 告警报告已生成: $report_file"
}

# 显示使用说明
show_usage() {
    cat << EOF
使用说明: $0 [命令]

命令:
  status          检查告警系统状态
  reload          重新加载配置
  validate        验证告警规则语法
  list            列出活跃告警
  silence NAME    静默指定告警
  unsilence ID    取消静默
  test            发送测试告警
  backup          备份告警配置
  report          生成告警报告
  help            显示此帮助信息

示例:
  $0 status        # 检查状态
  $0 silence SpringBootAppHighCPUUsage --duration 2h
  $0 test          # 发送测试告警
  $0 report        # 生成报告
EOF
}

# 主函数
main() {
    local command=${1:-help}
    
    case $command in
        status)
            check_alertmanager_status
            get_active_alerts
            ;;
        reload)
            validate_alert_rules "$ALERT_RULES_DIR/alerts.yml"
            reload_prometheus_rules
            reload_alertmanager_config
            ;;
        validate)
            validate_alert_rules "$ALERT_RULES_DIR/alerts.yml"
            ;;
        list)
            get_active_alerts
            ;;
        silence)
            local alert_name=$2
            local duration=${3:-1h}
            if [ -z "$alert_name" ]; then
                echo -e "${RED}请指定要静默的告警名称${NC}"
                exit 1
            fi
            silence_alert "$alert_name" "$duration"
            ;;
        unsilence)
            local silence_id=$2
            if [ -z "$silence_id" ]; then
                echo -e "${RED}请指定要取消的静默ID${NC}"
                exit 1
            fi
            unsilence_alert "$silence_id"
            ;;
        test)
            send_test_alert
            ;;
        backup)
            backup_alert_config
            ;;
        report)
            generate_alert_report
            ;;
        help|*)
            show_usage
            ;;
    esac
}

# 执行主函数
main "$@"

Kubernetes自动扩缩容配置

HPA配置

yaml 复制代码

# k8s/hpa.yaml - Horizontal Pod Autoscaler配置
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: springboot-app-hpa
  namespace: production
  labels:
    app: springboot-app
    version: v1
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: springboot-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    # CPU基于扩缩容
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    
    # 内存基于扩缩容
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    
    # 基于QPS的自定义指标扩缩容
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"
    
    # 基于响应时间的自定义指标
    - type: Object
      object:
        metric:
          name: http_request_duration_seconds
        describedObject:
          apiVersion: networking.k8s.io/v1
          kind: Ingress
          name: springboot-app-ingress
        target:
          type: Value
          value: "500m"  # 500毫秒

  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
        - type: Pods
          value: 2
          periodSeconds: 60
      selectPolicy: Min
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 100
          periodSeconds: 30
        - type: Pods
          value: 4
          periodSeconds: 30
      selectPolicy: Max
---
# 基于Prometheus自定义指标的HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: springboot-app-custom-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: springboot-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
    # 基于业务指标的扩缩容 - 订单处理速率
    - type: Pods
      pods:
        metric:
          name: orders_processed_per_minute
        target:
          type: AverageValue
          averageValue: "1000"
    
    # 基于错误率的扩缩容
    - type: Pods
      pods:
        metric:
          name: error_rate
        target:
          type: AverageValue
          averageValue: "50"  # 50个错误/分钟时扩容

  behavior:
    scaleDown:
      stabilizationWindowSeconds: 600  # 缩容等待10分钟
      policies:
        - type: Percent
          value: 10
          periodSeconds: 120
    scaleUp:
      stabilizationWindowSeconds: 60   # 扩容等待1分钟
      policies:
        - type: Percent
          value: 100
          periodSeconds: 30
---
# VPA (Vertical Pod Autoscaler) 配置
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: springboot-app-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: springboot-app
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: "springboot-app"
        minAllowed:
          cpu: "100m"
          memory: "128Mi"
        maxAllowed:
          cpu: "2"
          memory: "2Gi"
        controlledResources: ["cpu", "memory"]

自定义指标适配器配置

yaml 复制代码

# k8s/prometheus-adapter.yaml - Prometheus适配器配置
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter-config
  namespace: monitoring
data:
  config.yaml: |
    rules:
      # 自定义指标规则
      custom:
        # HTTP请求QPS指标
        - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: {resource: "namespace"}
              pod: {resource: "pod"}
          name:
            matches: "http_requests_total"
            as: "http_requests_per_second"
          metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
        
        # HTTP请求延迟指标
        - seriesQuery: 'http_request_duration_seconds_bucket{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: {resource: "namespace"}
              pod: {resource: "pod"}
          name:
            matches: "http_request_duration_seconds"
            as: "http_request_duration_seconds_p95"
          metricsQuery: |
            histogram_quantile(0.95,
              sum(rate(<<.Series>>{<<.LabelMatchers>>}[5m])) by (le, <<.GroupBy>>)
            )
        
        # 业务指标 - 订单处理速率
        - seriesQuery: 'orders_processed_total{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: {resource: "namespace"}
              pod: {resource: "pod"}
          name:
            matches: "orders_processed_total"
            as: "orders_processed_per_minute"
          metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
        
        # 业务指标 - 错误率
        - seriesQuery: 'orders_failed_total{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: {resource: "namespace"}
              pod: {resource: "pod"}
          name:
            matches: "orders_failed_total"
            as: "error_rate"
          metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
        
        # JVM内存使用率
        - seriesQuery: 'jvm_memory_used_bytes{namespace!="",pod!="",area="heap"}'
          resources:
            overrides:
              namespace: {resource: "namespace"}
              pod: {resource: "pod"}
          name:
            matches: "jvm_memory_used_bytes"
            as: "jvm_heap_usage_percent"
          metricsQuery: |
            (sum(jvm_memory_used_bytes{<<.LabelMatchers>>,area="heap"}) by (<<.GroupBy>>) /
            sum(jvm_memory_max_bytes{<<.LabelMatchers>>,area="heap"}) by (<<.GroupBy>>)) * 100
    
    # 外部指标（用于集群级扩缩容）
    external:
      - seriesQuery: 'nginx_ingress_controller_requests{namespace!="",ingress!=""}'
        resources:
          overrides:
            namespace: {resource: "namespace"}
            ingress: {resource: "ingress"}
        name:
          matches: "nginx_ingress_controller_requests"
          as: "ingress_requests_per_second"
        metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-adapter
  namespace: monitoring
  labels:
    app: prometheus-adapter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-adapter
  template:
    metadata:
      labels:
        app: prometheus-adapter
    spec:
      containers:
      - name: prometheus-adapter
        image: directxman12/k8s-prometheus-adapter:v0.10.0
        args:
          - --secure-port=6443
          - --cert-dir=/tmp
          - --logtostderr=true
          - --prometheus-url=http://prometheus-server.monitoring.svc.cluster.local
          - --metrics-relist-interval=1m
          - --v=6
          - --config=/etc/adapter/config.yaml
        ports:
        - name: https
          containerPort: 6443
        volumeMounts:
        - name: config
          mountPath: /etc/adapter
          readOnly: true
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 512Mi
      volumes:
      - name: config
        configMap:
          name: prometheus-adapter-config

自动扩缩容管理脚本

bash 复制代码

#!/bin/bash
# autoscaling_manager.sh - 自动扩缩容管理系统

set -euo pipepipefail

# 配置
KUBE_NAMESPACE="production"
KUBE_CONTEXT="production-cluster"
PROMETHEUS_URL="http://prometheus-server.monitoring.svc.cluster.local:9090"
DEPLOYMENT_NAME="springboot-app"
HPA_NAME="springboot-app-hpa"

# 颜色定义
GREEN='\033[0;32m'
RED='\033[0;31m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m'

log() {
    echo -e "${GREEN}[$(date '+%Y-%m-%d %H:%M:%S')] $1${NC}"
}

# 检查Kubernetes集群状态
check_cluster_status() {
    log "检查Kubernetes集群状态..."
    
    if kubectl --context="$KUBE_CONTEXT" cluster-info > /dev/null 2>&1; then
        log "✅ Kubernetes集群连接正常"
    else
        echo -e "${RED}❌ 无法连接Kubernetes集群${NC}"
        return 1
    fi
    
    # 检查节点状态
    local node_count
    node_count=$(kubectl --context="$KUBE_CONTEXT" get nodes --no-headers | grep -c "Ready")
    log "集群中Ready节点数量: $node_count"
}

# 获取HPA状态
get_hpa_status() {
    log "获取HPA状态..."
    
    kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" get hpa "$HPA_NAME" -o wide
    
    echo -e "\n${YELLOW}HPA详细状态:${NC}"
    kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" describe hpa "$HPA_NAME" | grep -A 10 "Metrics:"
}

# 获取当前指标值
get_current_metrics() {
    log "获取当前监控指标..."
    
    local metrics=(
        "cpu_usage"
        "memory_usage" 
        "http_requests_per_second"
        "http_request_duration_seconds_p95"
        "jvm_heap_usage_percent"
    )
    
    for metric in "${metrics[@]}"; do
        echo -e "\n${BLUE}=== $metric ===${NC}"
        
        case $metric in
            cpu_usage)
                kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" top pods -l app="$DEPLOYMENT_NAME"
                ;;
            memory_usage)
                kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" top pods -l app="$DEPLOYMENT_NAME"
                ;;
            *)
                # 从Prometheus获取自定义指标
                get_prometheus_metric "$metric"
                ;;
        esac
    done
}

# 从Prometheus查询指标
get_prometheus_metric() {
    local metric_name="$1"
    
    local query
    case $metric_name in
        http_requests_per_second)
            query='sum(rate(http_requests_total{app="springboot-app"}[2m]))'
            ;;
        http_request_duration_seconds_p95)
            query='histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app="springboot-app"}[5m])) by (le))'
            ;;
        jvm_heap_usage_percent)
            query='(sum(jvm_memory_used_bytes{app="springboot-app",area="heap"}) / sum(jvm_memory_max_bytes{app="springboot-app",area="heap"})) * 100'
            ;;
        *)
            echo "未知指标: $metric_name"
            return 1
            ;;
    esac
    
    local response
    response=$(curl -s "$PROMETHEUS_URL/api/v1/query" \
        --data-urlencode "query=$query" | jq -r '.data.result[0].value[1] // "N/A"')
    
    echo "当前值: $response"
}

# 调整HPA配置
adjust_hpa_config() {
    local min_replicas="$1"
    local max_replicas="$2"
    local cpu_threshold="${3:-70}"
    
    log "调整HPA配置: min=$min_replicas, max=$max_replicas, cpu_threshold=$cpu_threshold%"
    
    # 备份当前配置
    kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" get hpa "$HPA_NAME" -o yaml > "hpa_backup_$(date +%Y%m%d_%H%M%S).yaml"
    
    # 更新HPA配置
    kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" patch hpa "$HPA_NAME" -p "
    {
        \"spec\": {
            \"minReplicas\": $min_replicas,
            \"maxReplicas\": $max_replicas,
            \"metrics\": [
                {
                    \"type\": \"Resource\",
                    \"resource\": {
                        \"name\": \"cpu\",
                        \"target\": {
                            \"type\": \"Utilization\",
                            \"averageUtilization\": $cpu_threshold
                        }
                    }
                }
            ]
        }
    }"
    
    log "✅ HPA配置更新完成"
    get_hpa_status
}

# 手动扩容
scale_up() {
    local replicas="$1"
    
    log "手动扩容到 $replicas 个副本..."
    
    kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" scale deployment "$DEPLOYMENT_NAME" --replicas="$replicas"
    
    # 等待扩容完成
    wait_for_scale "$replicas"
}

# 手动缩容
scale_down() {
    local replicas="$1"
    
    log "手动缩容到 $replicas 个副本..."
    
    kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" scale deployment "$DEPLOYMENT_NAME" --replicas="$replicas"
    
    # 等待缩容完成
    wait_for_scale "$replicas"
}

# 等待扩缩容完成
wait_for_scale() {
    local desired_replicas="$1"
    local timeout=300
    local start_time=$(date +%s)
    
    log "等待扩缩容完成，期望副本数: $desired_replicas"
    
    while true; do
        local current_time=$(date +%s)
        local elapsed=$((current_time - start_time))
        
        if [ $elapsed -gt $timeout ]; then
            echo -e "${RED}❌ 扩缩容超时${NC}"
            return 1
        fi
        
        local current_replicas
        current_replicas=$(kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" get deployment "$DEPLOYMENT_NAME" -o jsonpath='{.status.readyReplicas}')
        
        if [ "$current_replicas" = "$desired_replicas" ]; then
            log "✅ 扩缩容完成，当前副本数: $current_replicas"
            break
        fi
        
        echo "⏳ 等待扩缩容... (当前: $current_replicas, 期望: $desired_replicas, 已等待: ${elapsed}s)"
        sleep 10
    done
}

# 模拟负载测试
simulate_load() {
    local duration="${1:-5m}"
    local concurrent_users="${2:-50}"
    
    log "开始模拟负载测试，持续时间: $duration, 并发用户: $concurrent_users"
    
    # 获取服务地址
    local service_url
    service_url=$(kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" get svc "$DEPLOYMENT_NAME" -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    
    if [ -z "$service_url" ]; then
        service_url=$(kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" get svc "$DEPLOYMENT_NAME" -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
    fi
    
    if [ -z "$service_url" ]; then
        echo -e "${RED}❌ 无法获取服务地址${NC}"
        return 1
    fi
    
    log "服务地址: http://$service_url"
    
    # 使用hey进行负载测试
    if command -v hey >/dev/null 2>&1; then
        hey -z "$duration" -c "$concurrent_users" \
            -m GET \
            "http://$service_url/health" \
            "http://$service_url/info"
    else
        echo -e "${YELLOW}⚠️ hey工具未安装，使用curl简单测试${NC}"
        
        # 简单的负载测试
        for i in $(seq 1 "$concurrent_users"); do
            curl -s "http://$service_url/health" > /dev/null &
        done
        
        sleep "$(echo "$duration" | sed 's/m//')m"
    fi
    
    log "✅ 负载测试完成"
}

# 生成扩缩容报告
generate_scaling_report() {
    local report_file="scaling_report_$(date +%Y%m%d_%H%M%S).html"
    
    log "生成扩缩容报告: $report_file"
    
    # 获取HPA事件历史
    local hpa_events
    hpa_events=$(kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" describe hpa "$HPA_NAME" | grep -A 20 "Events:" || echo "无事件")
    
    # 获取Pod重启历史
    local pod_restarts
    pod_restarts=$(kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" get pods -l app="$DEPLOYMENT_NAME" -o jsonpath='{.items[*].status.containerStatuses[0].restartCount}' | tr ' ' '\n' | sort -nr | head -1)
    
    cat > "$report_file" << EOF
<!DOCTYPE html>
<html>
<head>
    <title>自动扩缩容报告</title>
    <style>
        body { font-family: Arial, sans-serif; margin: 40px; }
        .summary { background: #f4f4f4; padding: 20px; border-radius: 5px; }
        .metrics { display: grid; grid-template-columns: repeat(2, 1fr); gap: 20px; margin: 20px 0; }
        .metric-card { background: white; padding: 15px; border-radius: 5px; box-shadow: 0 2px 4px rgba(0,0,0,0.1); }
        .events { background: #f8f9fa; padding: 15px; border-radius: 5px; }
    </style>
</head>
<body>
    <h1>自动扩缩容系统报告</h1>
    
    <div class="summary">
        <h2>系统概览</h2>
        <p><strong>生成时间:</strong> $(date)</p>
        <p><strong>应用:</strong> $DEPLOYMENT_NAME</p>
        <p><strong>命名空间:</strong> $KUBE_NAMESPACE</p>
        <p><strong>最大Pod重启次数:</strong> $pod_restarts</p>
    </div>
    
    <div class="metrics">
        <div class="metric-card">
            <h3>CPU使用率</h3>
            <p>$(get_prometheus_metric "container_cpu_usage")</p>
        </div>
        <div class="metric-card">
            <h3>内存使用率</h3>
            <p>$(get_prometheus_metric "container_memory_usage")</p>
        </div>
        <div class="metric-card">
            <h3>请求QPS</h3>
            <p>$(get_prometheus_metric "http_requests_per_second")</p>
        </div>
        <div class="metric-card">
            <h3>P95延迟</h3>
            <p>$(get_prometheus_metric "http_request_duration_seconds_p95") 秒</p>
        </div>
    </div>
    
    <div class="events">
        <h2>HPA事件历史</h2>
        <pre>$hpa_events</pre>
    </div>
    
    <div class="recommendations">
        <h2>优化建议</h2>
        <ul>
            <li>根据业务高峰调整HPA的最小/最大副本数</li>
            <li>监控JVM内存使用，考虑配置垂直扩缩容(VPA)</li>
            <li>设置适当的扩缩容冷却时间，避免频繁扩缩容</li>
        </ul>
    </div>
</body>
</html>
EOF

    log "✅ 扩缩容报告已生成: $report_file"
}

# 显示使用说明
show_usage() {
    cat << EOF
使用说明: $0 [命令]

命令:
  status          检查集群和HPA状态
  metrics         获取当前监控指标
  adjust MIN MAX [CPU_THRESHOLD]  调整HPA配置
  scale-up REPLICAS   手动扩容
  scale-down REPLICAS 手动缩容
  simulate [DURATION] [USERS]  模拟负载测试
  report          生成扩缩容报告
  help            显示此帮助信息

示例:
  $0 status                    # 检查状态
  $0 adjust 2 10 80           # 调整HPA: min=2, max=10, CPU阈值=80%
  $0 scale-up 5               # 手动扩容到5个副本
  $0 simulate 10m 100         # 模拟10分钟100并发负载
  $0 report                   # 生成报告
EOF
}

# 主函数
main() {
    local command=${1:-help}
    
    # 设置Kubernetes上下文
    kubectl config use-context "$KUBE_CONTEXT" > /dev/null 2>&1 || true
    
    case $command in
        status)
            check_cluster_status
            get_hpa_status
            ;;
        metrics)
            get_current_metrics
            ;;
        adjust)
            local min_replicas=$2
            local max_replicas=$3
            local cpu_threshold=${4:-70}
            
            if [ -z "$min_replicas" ] || [ -z "$max_replicas" ]; then
                echo -e "${RED}请指定最小和最大副本数${NC}"
                exit 1
            fi
            
            adjust_hpa_config "$min_replicas" "$max_replicas" "$cpu_threshold"
            ;;
        scale-up)
            local replicas=$2
            if [ -z "$replicas" ]; then
                echo -e "${RED}请指定副本数${NC}"
                exit 1
            fi
            scale_up "$replicas"
            ;;
        scale-down)
            local replicas=$2
            if [ -z "$replicas" ]; then
                echo -e "${RED}请指定副本数${NC}"
                exit 1
            fi
            scale_down "$replicas"
            ;;
        simulate)
            local duration=${2:-5m}
            local users=${3:-50}
            simulate_load "$duration" "$users"
            ;;
        report)
            generate_scaling_report
            ;;
        help|*)
            show_usage
            ;;
    esac
}

# 执行主函数
main "$@"

智能扩缩容策略配置

基于预测的自动扩缩容

yaml 复制代码

# k8s/keda-autoscaling.yaml - KEDA基于事件的自动扩缩容
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: springboot-app-keda
  namespace: production
spec:
  scaleTargetRef:
    name: springboot-app
    kind: Deployment
  minReplicaCount: 2
  maxReplicaCount: 20
  cooldownPeriod: 300
  pollingInterval: 30
  
  triggers:
    # 基于CPU的扩缩容
    - type: cpu
      metadata:
        type: Utilization
        value: "70"
    
    # 基于内存的扩缩容
    - type: memory
      metadata:
        type: Utilization
        value: "80"
    
    # 基于Prometheus指标的扩缩容
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-server.monitoring.svc.cluster.local:9090
        metricName: http_requests_per_second
        threshold: "100"
        query: |
          sum(rate(http_requests_total{app="springboot-app"}[2m]))
    
    # 基于响应时间的扩缩容
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-server.monitoring.svc.cluster.local:9090
        metricName: http_request_p95_latency
        threshold: "1.0"  # 1秒
        query: |
          histogram_quantile(0.95, 
            sum(rate(http_request_duration_seconds_bucket{app="springboot-app"}[5m])) by (le)
          )
    
    # 基于消息队列的扩缩容（如果使用消息队列）
    - type: kafka
      metadata:
        bootstrapServers: kafka-broker:9092
        consumerGroup: springboot-consumer
        topic: orders
        lagThreshold: "50"
    
    # 基于定时任务的扩缩容
    - type: cron
      metadata:
        timezone: Asia/Shanghai
        start: 0 9 * * 1-5    # 工作日9点开始
        end: 0 18 * * 1-5     # 工作日18点结束
        desiredReplicas: "5"
    
    # 基于外部API的扩缩容
    - type: external
      metadata:
        https://external-metrics-api.com/load
        threshold: "1000"
---
# 预测性扩缩容配置
apiVersion: batch/v1
kind: CronJob
metadata:
  name: predictive-scaling
  namespace: production
spec:
  schedule: "*/5 * * * *"  # 每5分钟运行一次
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: predictive-scaling
            image: python:3.9
            command:
            - /bin/sh
            - -c
            - |
              pip install requests pandas scikit-learn
              python /scripts/predictive_scaling.py
            volumeMounts:
            - name: scripts
              mountPath: /scripts
          volumes:
          - name: scripts
            configMap:
              name: predictive-scaling-scripts
          restartPolicy: OnFailure
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: predictive-scaling-scripts
  namespace: production
data:
  predictive_scaling.py: |
    #!/usr/bin/env python3
    """
    预测性扩缩容脚本
    基于历史负载模式预测未来负载并调整HPA
    """
    
    import requests
    import pandas as pd
    import numpy as np
    from datetime import datetime, timedelta
    import json
    import os
    
    # 配置
    PROMETHEUS_URL = "http://prometheus-server.monitoring.svc.cluster.local:9090"
    K8S_API_URL = "https://kubernetes-api"
    NAMESPACE = "production"
    DEPLOYMENT = "springboot-app"
    
    def query_prometheus(query):
        """查询Prometheus指标"""
        response = requests.get(f"{PROMETHEUS_URL}/api/v1/query", params={"query": query})
        if response.status_code == 200:
            data = response.json()
            if data['data']['result']:
                return float(data['data']['result'][0]['value'][1])
        return 0
    
    def get_historical_metrics():
        """获取历史指标数据"""
        end_time = datetime.now()
        start_time = end_time - timedelta(hours=24)
        
        # 查询过去24小时的QPS数据
        query = f'avg_over_time(sum(rate(http_requests_total{{app="{DEPLOYMENT}"}}[5m]))[24h:5m])'
        historical_data = query_prometheus(query)
        
        return historical_data
    
    def predict_future_load():
        """预测未来负载"""
        # 基于历史数据的简单预测
        historical_load = get_historical_metrics()
        
        # 考虑时间因素（工作日/周末，白天/晚上）
        now = datetime.now()
        hour = now.hour
        is_weekday = now.weekday() < 5
        
        # 简单的预测模型
        if is_weekday:
            if 9 <= hour <= 12:  # 上午工作时间
                predicted_load = historical_load * 1.5
            elif 13 <= hour <= 18:  # 下午工作时间
                predicted_load = historical_load * 1.8
            elif 19 <= hour <= 23:  # 晚上
                predicted_load = historical_load * 1.2
            else:  # 深夜
                predicted_load = historical_load * 0.5
        else:  # 周末
            predicted_load = historical_load * 0.8
        
        return max(predicted_load, 10)  # 最小10 QPS
    
    def calculate_desired_replicas(predicted_load):
        """计算期望的副本数"""
        # 假设每个Pod可以处理50 QPS
        pods_needed = max(2, int(predicted_load / 50) + 1)
        return min(pods_needed, 20)  # 最大20个副本
    
    def update_hpa(replicas):
        """更新HPA配置"""
        # 这里应该调用Kubernetes API更新HPA
        # 为了安全，这里只打印日志
        print(f"预测性扩缩容: 建议设置副本数为 {replicas}")
        
        # 实际实现应该调用Kubernetes API
        # patch_data = {
        #     "spec": {
        #         "minReplicas": replicas,
        #         "maxReplicas": max(replicas + 5, 20)
        #     }
        # }
        
    def main():
        print("开始预测性扩缩容分析...")
        
        # 预测未来负载
        predicted_load = predict_future_load()
        print(f"预测负载: {predicted_load:.2f} QPS")
        
        # 计算所需副本数
        desired_replicas = calculate_desired_replicas(predicted_load)
        print(f"期望副本数: {desired_replicas}")
        
        # 更新HPA配置
        update_hpa(desired_replicas)
        
        print("预测性扩缩容分析完成")
    
    if __name__ == "__main__":
        main()

告警和扩缩容架构图

以下图表展示了完整的告警和自动扩缩容系统架构：

flowchart TD A[应用指标收集] --> B[Prometheus] A --> C[应用日志] B --> D[Alertmanager] B --> E[Grafana] B --> F[Prometheus Adapter] C --> G[ELK/Loki] D --> H[告警通知] H --> H1[邮件] H --> H2[Slack] H --> H3[PagerDuty] F --> I[Kubernetes Metrics API] I --> J[HPA] I --> K[KEDA] J --> L[自动扩缩容] K --> L L --> M[Pod扩缩容] M --> N[资源调整] G --> O[日志分析] E --> P[实时监控] O --> Q[故障诊断] P --> Q Q --> R[性能优化] R --> A style A fill:#3498db,color:#fff style D fill:#e74c3c,color:#fff style L fill:#27ae60,color:#fff style Q fill:#9b59b6,color:#fff

总结

通过本部分的配置，我们建立了完整的告警和自动扩缩容系统：

核心能力

智能告警: 多层次、多渠道的告警系统
自动扩缩容: 基于资源使用率和业务指标的自动扩缩容
预测性扩缩容: 基于历史模式的智能预测
全面监控: 从基础设施到应用层的全方位监控

关键技术

Prometheus + Alertmanager: 强大的监控告警组合
Kubernetes HPA: 原生自动扩缩容
KEDA: 基于事件的自动扩缩容
自定义指标: 基于业务指标的智能扩缩容
多通知渠道: 邮件、Slack、PagerDuty等

最佳实践

分级告警: 根据严重程度分级处理告警
智能静默: 避免告警风暴，提高告警有效性
渐进式扩缩容: 平稳的扩缩容策略避免业务波动
预测性优化: 基于历史数据的智能预测