Docker容器化实战:将你的SpringBoot应用一键打包部署(三)-配置告警和自动扩缩容

在建立了完整的CI/CD流水线后,我们现在需要配置智能的告警系统和自动扩缩容机制,确保应用在生产环境中能够自动应对流量变化并保持高可用性。

监控告警系统配置

Prometheus告警规则配置

yaml 复制代码
# prometheus/alerts.yml - Prometheus告警规则配置
groups:
  - name: springboot-app-alerts
    rules:
      # 应用可用性告警
      - alert: SpringBootAppDown
        expr: up{job="springboot-app"} == 0
        for: 1m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "SpringBoot应用宕机 (实例 {{ $labels.instance }})"
          description: "SpringBoot应用 {{ $labels.instance }} 已经宕机超过1分钟。"
          runbook: "https://wiki.company.com/runbooks/springboot-app-down"

      - alert: SpringBootAppHighErrorRate
        expr: rate(http_server_requests_seconds_count{job="springboot-app", status=~"5.."}[5m]) > 0.1
        for: 2m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "SpringBoot应用错误率过高 (实例 {{ $labels.instance }})"
          description: "SpringBoot应用 {{ $labels.instance }} 5xx错误率超过10%,当前值: {{ $value | humanizePercentage }}"
          runbook: "https://wiki.company.com/runbooks/springboot-high-error-rate"

      # JVM性能告警
      - alert: SpringBootAppHighMemoryUsage
        expr: (sum(jvm_memory_used_bytes{job="springboot-app", area="heap"}) by (instance) / sum(jvm_memory_max_bytes{job="springboot-app", area="heap"}) by (instance)) > 0.8
        for: 3m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "SpringBoot应用内存使用率过高 (实例 {{ $labels.instance }})"
          description: "SpringBoot应用 {{ $labels.instance }} 堆内存使用率超过80%,当前值: {{ $value | humanizePercentage }}"
          runbook: "https://wiki.company.com/runbooks/springboot-high-memory"

      - alert: SpringBootAppHighCPUUsage
        expr: process_cpu_usage{job="springboot-app"} > 0.8
        for: 3m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "SpringBoot应用CPU使用率过高 (实例 {{ $labels.instance }})"
          description: "SpringBoot应用 {{ $labels.instance }} CPU使用率超过80%,当前值: {{ $value | humanizePercentage }}"
          runbook: "https://wiki.company.com/runbooks/springboot-high-cpu"

      # 垃圾回收告警
      - alert: SpringBootAppHighGC
        expr: rate(jvm_gc_pause_seconds_sum{job="springboot-app"}[5m]) > 0.1
        for: 2m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "SpringBoot应用GC停顿时间过长 (实例 {{ $labels.instance }})"
          description: "SpringBoot应用 {{ $labels.instance }} GC停顿时间过高,过去5分钟平均: {{ $value }}秒"
          runbook: "https://wiki.company.com/runbooks/springboot-high-gc"

      # 应用性能告警
      - alert: SpringBootAppHighLatency
        expr: histogram_quantile(0.95, rate(http_server_requests_seconds_bucket{job="springboot-app"}[5m])) > 2
        for: 3m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "SpringBoot应用响应时间过高 (实例 {{ $labels.instance }})"
          description: "SpringBoot应用 {{ $labels.instance }} 95%响应时间超过2秒,当前值: {{ $value }}秒"
          runbook: "https://wiki.company.com/runbooks/springboot-high-latency"

      - alert: SpringBootAppLowThroughput
        expr: rate(http_server_requests_seconds_count{job="springboot-app"}[5m]) < 10
        for: 5m
        labels:
          severity: info
          team: backend
        annotations:
          summary: "SpringBoot应用吞吐量过低 (实例 {{ $labels.instance }})"
          description: "SpringBoot应用 {{ $labels.instance }} 吞吐量异常低,过去5分钟平均: {{ $value }} req/s"
          runbook: "https://wiki.company.com/runbooks/springboot-low-throughput"

      # 数据库连接告警
      - alert: SpringBootAppHighDatabaseConnections
        expr: spring_datasource_max_connections{job="springboot-app"} - spring_datasource_active_connections{job="springboot-app"} < 5
        for: 2m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "SpringBoot应用数据库连接池紧张 (实例 {{ $labels.instance }})"
          description: "SpringBoot应用 {{ $labels.instance }} 数据库连接池可用连接少于5个"
          runbook: "https://wiki.company.com/runbooks/springboot-database-connections"

      # 自定义业务指标告警
      - alert: SpringBootAppHighOrderErrorRate
        expr: rate(orders_failed_total{job="springboot-app"}[5m]) / rate(orders_processed_total{job="springboot-app"}[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "SpringBoot应用订单处理错误率过高 (实例 {{ $labels.instance }})"
          description: "SpringBoot应用 {{ $labels.instance }} 订单处理错误率超过5%,当前值: {{ $value | humanizePercentage }}"
          runbook: "https://wiki.company.com/runbooks/springboot-order-errors"

  - name: infrastructure-alerts
    rules:
      # 节点资源告警
      - alert: NodeHighCPUUsage
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "节点CPU使用率过高 ({{ $labels.instance }})"
          description: "节点 {{ $labels.instance }} CPU使用率超过80%,当前值: {{ $value }}%"

      - alert: NodeHighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "节点内存使用率过高 ({{ $labels.instance }})"
          description: "节点 {{ $labels.instance }} 内存使用率超过85%,当前值: {{ $value }}%"

      - alert: NodeDiskSpaceRunningOut
        expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "节点磁盘空间不足 ({{ $labels.instance }} {{ $labels.mountpoint }})"
          description: "节点 {{ $labels.instance }} 挂载点 {{ $labels.mountpoint }} 磁盘使用率超过90%,当前值: {{ $value }}%"

      - alert: NodeNetworkSaturation
        expr: rate(node_network_receive_bytes_total[5m]) > 100 * 1024 * 1024
        for: 2m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "节点网络接收饱和 ({{ $labels.instance }})"
          description: "节点 {{ $labels.instance }} 网络接收速率超过100MB/s"

  - name: kubernetes-alerts
    rules:
      # Kubernetes集群告警
      - alert: KubePodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 1m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Pod崩溃循环 ({{ $labels.namespace }}/{{ $labels.pod }})"
          description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 正在崩溃循环"

      - alert: KubeDeploymentReplicasMismatch
        expr: kube_deployment_status_replicas_available != kube_deployment_spec_replicas
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Deployment副本数不匹配 ({{ $labels.namespace }}/{{ $labels.deployment }})"
          description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} 可用副本数不匹配期望值"

      - alert: KubeHPAReachedMax
        expr: kube_horizontalpodautoscaler_status_current_replicas == kube_horizontalpodautoscaler_spec_max_replicas
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "HPA达到最大副本数 ({{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }})"
          description: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} 已达到最大副本数 {{ $value }}"

Alertmanager配置

yaml 复制代码
# alertmanager/alertmanager.yml - Alertmanager主配置
global:
  smtp_smarthost: 'smtp.company.com:587'
  smtp_from: 'alertmanager@company.com'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'

templates:
  - '/etc/alertmanager/templates/*.tmpl'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default-receiver'
  
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
      group_wait: 5s
      repeat_interval: 5m
      routes:
        - match:
            team: backend
          receiver: 'backend-critical'
        - match:
            team: platform
          receiver: 'platform-critical'
    
    - match:
        severity: warning
      receiver: 'warning-alerts'
      group_wait: 30s
      repeat_interval: 15m
    
    - match:
        severity: info
      receiver: 'info-alerts'
      group_wait: 1m
      repeat_interval: 30m

receivers:
  - name: 'default-receiver'
    email_configs:
      - to: 'devops@company.com'
        subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
            Alert: {{ .Annotations.summary }}
            Description: {{ .Annotations.description }}
            Details:
            {{ range .Labels.SortedPairs }}  - {{ .Name }}: {{ .Value }}
            {{ end }}
            Runbook: {{ .Annotations.runbook }}
          {{ end }}

  - name: 'critical-alerts'
    email_configs:
      - to: 'sre-team@company.com'
        subject: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
            🚨 CRITICAL ALERT
            ================
            Summary: {{ .Annotations.summary }}
            Description: {{ .Annotations.description }}
            
            Labels:
            {{ range .Labels.SortedPairs }}  - {{ .Name }}: {{ .Value }}
            {{ end }}
            
            Runbook: {{ .Annotations.runbook }}
            Time: {{ .StartsAt }}
          {{ end }}
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/XXX/XXX'
        channel: '#alerts-critical'
        title: '🚨 Critical Alert'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}{{ end }}'
        color: 'danger'
    pagerduty_configs:
      - service_key: 'your-pagerduty-key'
        description: '{{ .GroupLabels.alertname }}'
        details:
          summary: '{{ .Annotations.summary }}'
          description: '{{ .Annotations.description }}'

  - name: 'backend-critical'
    email_configs:
      - to: 'backend-team@company.com'
        subject: '🚨 BACKEND CRITICAL: {{ .GroupLabels.alertname }}'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/XXX/XXX'
        channel: '#backend-alerts'
        title: '🚨 Backend Critical'
    pagerduty_configs:
      - service_key: 'backend-pagerduty-key'

  - name: 'platform-critical'
    email_configs:
      - to: 'platform-team@company.com'
        subject: '🚨 PLATFORM CRITICAL: {{ .GroupLabels.alertname }}'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/XXX/XXX'
        channel: '#platform-alerts'

  - name: 'warning-alerts'
    email_configs:
      - to: 'devops@company.com'
        subject: '⚠️ WARNING: {{ .GroupLabels.alertname }}'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/XXX/XXX'
        channel: '#alerts-warning'
        color: 'warning'

  - name: 'info-alerts'
    email_configs:
      - to: 'devops@company.com'
        subject: 'ℹ️ INFO: {{ .GroupLabels.alertname }}'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/XXX/XXX'
        channel: '#alerts-info'
        color: 'good'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']
  
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'info'
    equal: ['alertname', 'cluster', 'service']

告警管理脚本

bash 复制代码
#!/bin/bash
# alert_manager.sh - 告警管理系统

set -euo pipefail

# 配置
ALERTMANAGER_URL="http://alertmanager:9093"
PROMETHEUS_URL="http://prometheus:9090"
SLACK_WEBHOOK="https://hooks.slack.com/services/XXX/XXX/XXX"
ALERT_RULES_DIR="./prometheus/alerts"
BACKUP_DIR="./backups/alerts"

# 颜色定义
GREEN='\033[0;32m'
RED='\033[0;31m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m'

log() {
    echo -e "${GREEN}[$(date '+%Y-%m-%d %H:%M:%S')] $1${NC}"
}

# 检查告警管理器状态
check_alertmanager_status() {
    log "检查Alertmanager状态..."
    
    if curl -s "${ALERTMANAGER_URL}/-/healthy" > /dev/null; then
        log "✅ Alertmanager运行正常"
        return 0
    else
        echo -e "${RED}❌ Alertmanager不可用${NC}"
        return 1
    fi
}

# 重新加载告警规则
reload_prometheus_rules() {
    log "重新加载Prometheus告警规则..."
    
    if curl -s -X POST "${PROMETHEUS_URL}/-/reload" > /dev/null; then
        log "✅ Prometheus规则重新加载成功"
    else
        echo -e "${RED}❌ Prometheus规则重新加载失败${NC}"
        return 1
    fi
}

# 重新加载Alertmanager配置
reload_alertmanager_config() {
    log "重新加载Alertmanager配置..."
    
    if curl -s -X POST "${ALERTMANAGER_URL}/-/reload" > /dev/null; then
        log "✅ Alertmanager配置重新加载成功"
    else
        echo -e "${RED}❌ Alertmanager配置重新加载失败${NC}"
        return 1
    fi
}

# 验证告警规则语法
validate_alert_rules() {
    log "验证告警规则语法..."
    
    local rules_file="$1"
    
    if [ ! -f "$rules_file" ]; then
        echo -e "${RED}❌ 告警规则文件不存在: $rules_file${NC}"
        return 1
    fi
    
    # 使用promtool验证规则
    if command -v promtool >/dev/null 2>&1; then
        if promtool check rules "$rules_file"; then
            log "✅ 告警规则语法正确"
            return 0
        else
            echo -e "${RED}❌ 告警规则语法错误${NC}"
            return 1
        fi
    else
        echo -e "${YELLOW}⚠️ promtool未安装,跳过语法检查${NC}"
        return 0
    fi
}

# 获取当前活跃告警
get_active_alerts() {
    log "获取当前活跃告警..."
    
    local response
    response=$(curl -s "${ALERTMANAGER_URL}/api/v2/alerts" | jq -r '.[] | "\(.labels.alertname) - \(.status.state)"')
    
    if [ -n "$response" ]; then
        echo -e "${YELLOW}当前活跃告警:${NC}"
        echo "$response"
    else
        log "✅ 无活跃告警"
    fi
}

# 静默告警
silence_alert() {
    local alert_name="$1"
    local duration="${2:-1h}"
    local creator="${3:-alert-manager}"
    local comment="${4:-手动静默}"
    
    log "静默告警: $alert_name, 时长: $duration"
    
    local silence_data=$(cat << EOF
{
    "matchers": [
        {
            "name": "alertname",
            "value": "$alert_name",
            "isRegex": false
        }
    ],
    "startsAt": "$(date -u +"%Y-%m-%dT%H:%M:%S.000Z")",
    "endsAt": "$(date -u -d "+$duration" +"%Y-%m-%dT%H:%M:%S.000Z")",
    "createdBy": "$creator",
    "comment": "$comment",
    "status": {
        "state": "active"
    }
}
EOF
)
    
    local response
    response=$(curl -s -X POST \
        -H "Content-Type: application/json" \
        -d "$silence_data" \
        "${ALERTMANAGER_URL}/api/v2/silences")
    
    local silence_id=$(echo "$response" | jq -r '.silenceID')
    
    if [ "$silence_id" != "null" ]; then
        log "✅ 告警静默成功, ID: $silence_id"
        echo "$silence_id"
    else
        echo -e "${RED}❌ 告警静默失败${NC}"
        return 1
    fi
}

# 取消静默
unsilence_alert() {
    local silence_id="$1"
    
    log "取消静默: $silence_id"
    
    if curl -s -X DELETE "${ALERTMANAGER_URL}/api/v2/silence/$silence_id" > /dev/null; then
        log "✅ 静默已取消"
    else
        echo -e "${RED}❌ 取消静默失败${NC}"
        return 1
    fi
}

# 发送测试告警
send_test_alert() {
    local alert_name="TestAlert"
    local severity="warning"
    local instance="test-instance"
    
    log "发送测试告警..."
    
    local test_alert=$(cat << EOF
[
    {
        "labels": {
            "alertname": "$alert_name",
            "severity": "$severity",
            "instance": "$instance",
            "job": "springboot-app"
        },
        "annotations": {
            "summary": "测试告警 - 请忽略",
            "description": "这是一个测试告警,用于验证告警系统工作正常",
            "runbook": "https://wiki.company.com/runbooks/test-alert"
        },
        "generatorURL": "http://test.example.com"
    }
]
EOF
)
    
    if curl -s -X POST \
        -H "Content-Type: application/json" \
        -d "$test_alert" \
        "${ALERTMANAGER_URL}/api/v1/alerts" > /dev/null; then
        log "✅ 测试告警发送成功"
    else
        echo -e "${RED}❌ 测试告警发送失败${NC}"
        return 1
    fi
}

# 备份告警配置
backup_alert_config() {
    local backup_timestamp=$(date +"%Y%m%d_%H%M%S")
    local backup_path="$BACKUP_DIR/$backup_timestamp"
    
    log "备份告警配置到: $backup_path"
    
    mkdir -p "$backup_path"
    
    # 备份告警规则
    cp -r "$ALERT_RULES_DIR" "$backup_path/"
    
    # 备份Alertmanager配置
    curl -s "${ALERTMANAGER_URL}/api/v1/status" | jq '.' > "$backup_path/alertmanager_status.json"
    
    # 备份当前静默规则
    curl -s "${ALERTMANAGER_URL}/api/v2/silences" | jq '.' > "$backup_path/silences.json"
    
    log "✅ 告警配置备份完成"
}

# 生成告警报告
generate_alert_report() {
    local report_file="alert_report_$(date +%Y%m%d_%H%M%S).html"
    
    log "生成告警报告: $report_file"
    
    # 获取告警统计
    local alert_stats=$(curl -s "${PROMETHEUS_URL}/api/v1/query" \
        --data-urlencode 'query=count by (severity) (ALERTS)' | jq -r '.data.result[] | "\(.metric.severity): \(.value[1])"')
    
    # 生成HTML报告
    cat > "$report_file" << EOF
<!DOCTYPE html>
<html>
<head>
    <title>告警系统报告</title>
    <style>
        body { font-family: Arial, sans-serif; margin: 40px; }
        .summary { background: #f4f4f4; padding: 20px; border-radius: 5px; }
        .critical { color: #e74c3c; }
        .warning { color: #f39c12; }
        .info { color: #3498db; }
        .stats { margin: 20px 0; }
        .stat-item { margin: 10px 0; }
    </style>
</head>
<body>
    <h1>告警系统状态报告</h1>
    <div class="summary">
        <h2>系统概览</h2>
        <p><strong>生成时间:</strong> $(date)</p>
        <p><strong>Alertmanager:</strong> $ALERTMANAGER_URL</p>
        <p><strong>Prometheus:</strong> $PROMETHEUS_URL</p>
    </div>
    
    <div class="stats">
        <h2>告警统计</h2>
        $(echo "$alert_stats" | while read line; do
            severity=$(echo "$line" | cut -d: -f1)
            count=$(echo "$line" | cut -d: -f2)
            echo "<div class='stat-item'><span class='$severity'>$severity: $count</span></div>"
        done)
    </div>
    
    <div class="active-alerts">
        <h2>活跃告警</h2>
        <pre>$(get_active_alerts)</pre>
    </div>
</body>
</html>
EOF

    log "✅ 告警报告已生成: $report_file"
}

# 显示使用说明
show_usage() {
    cat << EOF
使用说明: $0 [命令]

命令:
  status          检查告警系统状态
  reload          重新加载配置
  validate        验证告警规则语法
  list            列出活跃告警
  silence NAME    静默指定告警
  unsilence ID    取消静默
  test            发送测试告警
  backup          备份告警配置
  report          生成告警报告
  help            显示此帮助信息

示例:
  $0 status        # 检查状态
  $0 silence SpringBootAppHighCPUUsage --duration 2h
  $0 test          # 发送测试告警
  $0 report        # 生成报告
EOF
}

# 主函数
main() {
    local command=${1:-help}
    
    case $command in
        status)
            check_alertmanager_status
            get_active_alerts
            ;;
        reload)
            validate_alert_rules "$ALERT_RULES_DIR/alerts.yml"
            reload_prometheus_rules
            reload_alertmanager_config
            ;;
        validate)
            validate_alert_rules "$ALERT_RULES_DIR/alerts.yml"
            ;;
        list)
            get_active_alerts
            ;;
        silence)
            local alert_name=$2
            local duration=${3:-1h}
            if [ -z "$alert_name" ]; then
                echo -e "${RED}请指定要静默的告警名称${NC}"
                exit 1
            fi
            silence_alert "$alert_name" "$duration"
            ;;
        unsilence)
            local silence_id=$2
            if [ -z "$silence_id" ]; then
                echo -e "${RED}请指定要取消的静默ID${NC}"
                exit 1
            fi
            unsilence_alert "$silence_id"
            ;;
        test)
            send_test_alert
            ;;
        backup)
            backup_alert_config
            ;;
        report)
            generate_alert_report
            ;;
        help|*)
            show_usage
            ;;
    esac
}

# 执行主函数
main "$@"

Kubernetes自动扩缩容配置

HPA配置

yaml 复制代码
# k8s/hpa.yaml - Horizontal Pod Autoscaler配置
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: springboot-app-hpa
  namespace: production
  labels:
    app: springboot-app
    version: v1
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: springboot-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    # CPU基于扩缩容
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    
    # 内存基于扩缩容
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    
    # 基于QPS的自定义指标扩缩容
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"
    
    # 基于响应时间的自定义指标
    - type: Object
      object:
        metric:
          name: http_request_duration_seconds
        describedObject:
          apiVersion: networking.k8s.io/v1
          kind: Ingress
          name: springboot-app-ingress
        target:
          type: Value
          value: "500m"  # 500毫秒

  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
        - type: Pods
          value: 2
          periodSeconds: 60
      selectPolicy: Min
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 100
          periodSeconds: 30
        - type: Pods
          value: 4
          periodSeconds: 30
      selectPolicy: Max
---
# 基于Prometheus自定义指标的HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: springboot-app-custom-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: springboot-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
    # 基于业务指标的扩缩容 - 订单处理速率
    - type: Pods
      pods:
        metric:
          name: orders_processed_per_minute
        target:
          type: AverageValue
          averageValue: "1000"
    
    # 基于错误率的扩缩容
    - type: Pods
      pods:
        metric:
          name: error_rate
        target:
          type: AverageValue
          averageValue: "50"  # 50个错误/分钟时扩容

  behavior:
    scaleDown:
      stabilizationWindowSeconds: 600  # 缩容等待10分钟
      policies:
        - type: Percent
          value: 10
          periodSeconds: 120
    scaleUp:
      stabilizationWindowSeconds: 60   # 扩容等待1分钟
      policies:
        - type: Percent
          value: 100
          periodSeconds: 30
---
# VPA (Vertical Pod Autoscaler) 配置
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: springboot-app-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: springboot-app
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: "springboot-app"
        minAllowed:
          cpu: "100m"
          memory: "128Mi"
        maxAllowed:
          cpu: "2"
          memory: "2Gi"
        controlledResources: ["cpu", "memory"]

自定义指标适配器配置

yaml 复制代码
# k8s/prometheus-adapter.yaml - Prometheus适配器配置
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter-config
  namespace: monitoring
data:
  config.yaml: |
    rules:
      # 自定义指标规则
      custom:
        # HTTP请求QPS指标
        - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: {resource: "namespace"}
              pod: {resource: "pod"}
          name:
            matches: "http_requests_total"
            as: "http_requests_per_second"
          metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
        
        # HTTP请求延迟指标
        - seriesQuery: 'http_request_duration_seconds_bucket{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: {resource: "namespace"}
              pod: {resource: "pod"}
          name:
            matches: "http_request_duration_seconds"
            as: "http_request_duration_seconds_p95"
          metricsQuery: |
            histogram_quantile(0.95,
              sum(rate(<<.Series>>{<<.LabelMatchers>>}[5m])) by (le, <<.GroupBy>>)
            )
        
        # 业务指标 - 订单处理速率
        - seriesQuery: 'orders_processed_total{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: {resource: "namespace"}
              pod: {resource: "pod"}
          name:
            matches: "orders_processed_total"
            as: "orders_processed_per_minute"
          metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
        
        # 业务指标 - 错误率
        - seriesQuery: 'orders_failed_total{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: {resource: "namespace"}
              pod: {resource: "pod"}
          name:
            matches: "orders_failed_total"
            as: "error_rate"
          metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
        
        # JVM内存使用率
        - seriesQuery: 'jvm_memory_used_bytes{namespace!="",pod!="",area="heap"}'
          resources:
            overrides:
              namespace: {resource: "namespace"}
              pod: {resource: "pod"}
          name:
            matches: "jvm_memory_used_bytes"
            as: "jvm_heap_usage_percent"
          metricsQuery: |
            (sum(jvm_memory_used_bytes{<<.LabelMatchers>>,area="heap"}) by (<<.GroupBy>>) /
            sum(jvm_memory_max_bytes{<<.LabelMatchers>>,area="heap"}) by (<<.GroupBy>>)) * 100
    
    # 外部指标(用于集群级扩缩容)
    external:
      - seriesQuery: 'nginx_ingress_controller_requests{namespace!="",ingress!=""}'
        resources:
          overrides:
            namespace: {resource: "namespace"}
            ingress: {resource: "ingress"}
        name:
          matches: "nginx_ingress_controller_requests"
          as: "ingress_requests_per_second"
        metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-adapter
  namespace: monitoring
  labels:
    app: prometheus-adapter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-adapter
  template:
    metadata:
      labels:
        app: prometheus-adapter
    spec:
      containers:
      - name: prometheus-adapter
        image: directxman12/k8s-prometheus-adapter:v0.10.0
        args:
          - --secure-port=6443
          - --cert-dir=/tmp
          - --logtostderr=true
          - --prometheus-url=http://prometheus-server.monitoring.svc.cluster.local
          - --metrics-relist-interval=1m
          - --v=6
          - --config=/etc/adapter/config.yaml
        ports:
        - name: https
          containerPort: 6443
        volumeMounts:
        - name: config
          mountPath: /etc/adapter
          readOnly: true
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 512Mi
      volumes:
      - name: config
        configMap:
          name: prometheus-adapter-config

自动扩缩容管理脚本

bash 复制代码
#!/bin/bash
# autoscaling_manager.sh - 自动扩缩容管理系统

set -euo pipepipefail

# 配置
KUBE_NAMESPACE="production"
KUBE_CONTEXT="production-cluster"
PROMETHEUS_URL="http://prometheus-server.monitoring.svc.cluster.local:9090"
DEPLOYMENT_NAME="springboot-app"
HPA_NAME="springboot-app-hpa"

# 颜色定义
GREEN='\033[0;32m'
RED='\033[0;31m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m'

log() {
    echo -e "${GREEN}[$(date '+%Y-%m-%d %H:%M:%S')] $1${NC}"
}

# 检查Kubernetes集群状态
check_cluster_status() {
    log "检查Kubernetes集群状态..."
    
    if kubectl --context="$KUBE_CONTEXT" cluster-info > /dev/null 2>&1; then
        log "✅ Kubernetes集群连接正常"
    else
        echo -e "${RED}❌ 无法连接Kubernetes集群${NC}"
        return 1
    fi
    
    # 检查节点状态
    local node_count
    node_count=$(kubectl --context="$KUBE_CONTEXT" get nodes --no-headers | grep -c "Ready")
    log "集群中Ready节点数量: $node_count"
}

# 获取HPA状态
get_hpa_status() {
    log "获取HPA状态..."
    
    kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" get hpa "$HPA_NAME" -o wide
    
    echo -e "\n${YELLOW}HPA详细状态:${NC}"
    kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" describe hpa "$HPA_NAME" | grep -A 10 "Metrics:"
}

# 获取当前指标值
get_current_metrics() {
    log "获取当前监控指标..."
    
    local metrics=(
        "cpu_usage"
        "memory_usage" 
        "http_requests_per_second"
        "http_request_duration_seconds_p95"
        "jvm_heap_usage_percent"
    )
    
    for metric in "${metrics[@]}"; do
        echo -e "\n${BLUE}=== $metric ===${NC}"
        
        case $metric in
            cpu_usage)
                kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" top pods -l app="$DEPLOYMENT_NAME"
                ;;
            memory_usage)
                kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" top pods -l app="$DEPLOYMENT_NAME"
                ;;
            *)
                # 从Prometheus获取自定义指标
                get_prometheus_metric "$metric"
                ;;
        esac
    done
}

# 从Prometheus查询指标
get_prometheus_metric() {
    local metric_name="$1"
    
    local query
    case $metric_name in
        http_requests_per_second)
            query='sum(rate(http_requests_total{app="springboot-app"}[2m]))'
            ;;
        http_request_duration_seconds_p95)
            query='histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app="springboot-app"}[5m])) by (le))'
            ;;
        jvm_heap_usage_percent)
            query='(sum(jvm_memory_used_bytes{app="springboot-app",area="heap"}) / sum(jvm_memory_max_bytes{app="springboot-app",area="heap"})) * 100'
            ;;
        *)
            echo "未知指标: $metric_name"
            return 1
            ;;
    esac
    
    local response
    response=$(curl -s "$PROMETHEUS_URL/api/v1/query" \
        --data-urlencode "query=$query" | jq -r '.data.result[0].value[1] // "N/A"')
    
    echo "当前值: $response"
}

# 调整HPA配置
adjust_hpa_config() {
    local min_replicas="$1"
    local max_replicas="$2"
    local cpu_threshold="${3:-70}"
    
    log "调整HPA配置: min=$min_replicas, max=$max_replicas, cpu_threshold=$cpu_threshold%"
    
    # 备份当前配置
    kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" get hpa "$HPA_NAME" -o yaml > "hpa_backup_$(date +%Y%m%d_%H%M%S).yaml"
    
    # 更新HPA配置
    kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" patch hpa "$HPA_NAME" -p "
    {
        \"spec\": {
            \"minReplicas\": $min_replicas,
            \"maxReplicas\": $max_replicas,
            \"metrics\": [
                {
                    \"type\": \"Resource\",
                    \"resource\": {
                        \"name\": \"cpu\",
                        \"target\": {
                            \"type\": \"Utilization\",
                            \"averageUtilization\": $cpu_threshold
                        }
                    }
                }
            ]
        }
    }"
    
    log "✅ HPA配置更新完成"
    get_hpa_status
}

# 手动扩容
scale_up() {
    local replicas="$1"
    
    log "手动扩容到 $replicas 个副本..."
    
    kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" scale deployment "$DEPLOYMENT_NAME" --replicas="$replicas"
    
    # 等待扩容完成
    wait_for_scale "$replicas"
}

# 手动缩容
scale_down() {
    local replicas="$1"
    
    log "手动缩容到 $replicas 个副本..."
    
    kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" scale deployment "$DEPLOYMENT_NAME" --replicas="$replicas"
    
    # 等待缩容完成
    wait_for_scale "$replicas"
}

# 等待扩缩容完成
wait_for_scale() {
    local desired_replicas="$1"
    local timeout=300
    local start_time=$(date +%s)
    
    log "等待扩缩容完成,期望副本数: $desired_replicas"
    
    while true; do
        local current_time=$(date +%s)
        local elapsed=$((current_time - start_time))
        
        if [ $elapsed -gt $timeout ]; then
            echo -e "${RED}❌ 扩缩容超时${NC}"
            return 1
        fi
        
        local current_replicas
        current_replicas=$(kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" get deployment "$DEPLOYMENT_NAME" -o jsonpath='{.status.readyReplicas}')
        
        if [ "$current_replicas" = "$desired_replicas" ]; then
            log "✅ 扩缩容完成,当前副本数: $current_replicas"
            break
        fi
        
        echo "⏳ 等待扩缩容... (当前: $current_replicas, 期望: $desired_replicas, 已等待: ${elapsed}s)"
        sleep 10
    done
}

# 模拟负载测试
simulate_load() {
    local duration="${1:-5m}"
    local concurrent_users="${2:-50}"
    
    log "开始模拟负载测试,持续时间: $duration, 并发用户: $concurrent_users"
    
    # 获取服务地址
    local service_url
    service_url=$(kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" get svc "$DEPLOYMENT_NAME" -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    
    if [ -z "$service_url" ]; then
        service_url=$(kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" get svc "$DEPLOYMENT_NAME" -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
    fi
    
    if [ -z "$service_url" ]; then
        echo -e "${RED}❌ 无法获取服务地址${NC}"
        return 1
    fi
    
    log "服务地址: http://$service_url"
    
    # 使用hey进行负载测试
    if command -v hey >/dev/null 2>&1; then
        hey -z "$duration" -c "$concurrent_users" \
            -m GET \
            "http://$service_url/health" \
            "http://$service_url/info"
    else
        echo -e "${YELLOW}⚠️ hey工具未安装,使用curl简单测试${NC}"
        
        # 简单的负载测试
        for i in $(seq 1 "$concurrent_users"); do
            curl -s "http://$service_url/health" > /dev/null &
        done
        
        sleep "$(echo "$duration" | sed 's/m//')m"
    fi
    
    log "✅ 负载测试完成"
}

# 生成扩缩容报告
generate_scaling_report() {
    local report_file="scaling_report_$(date +%Y%m%d_%H%M%S).html"
    
    log "生成扩缩容报告: $report_file"
    
    # 获取HPA事件历史
    local hpa_events
    hpa_events=$(kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" describe hpa "$HPA_NAME" | grep -A 20 "Events:" || echo "无事件")
    
    # 获取Pod重启历史
    local pod_restarts
    pod_restarts=$(kubectl --context="$KUBE_CONTEXT" -n "$KUBE_NAMESPACE" get pods -l app="$DEPLOYMENT_NAME" -o jsonpath='{.items[*].status.containerStatuses[0].restartCount}' | tr ' ' '\n' | sort -nr | head -1)
    
    cat > "$report_file" << EOF
<!DOCTYPE html>
<html>
<head>
    <title>自动扩缩容报告</title>
    <style>
        body { font-family: Arial, sans-serif; margin: 40px; }
        .summary { background: #f4f4f4; padding: 20px; border-radius: 5px; }
        .metrics { display: grid; grid-template-columns: repeat(2, 1fr); gap: 20px; margin: 20px 0; }
        .metric-card { background: white; padding: 15px; border-radius: 5px; box-shadow: 0 2px 4px rgba(0,0,0,0.1); }
        .events { background: #f8f9fa; padding: 15px; border-radius: 5px; }
    </style>
</head>
<body>
    <h1>自动扩缩容系统报告</h1>
    
    <div class="summary">
        <h2>系统概览</h2>
        <p><strong>生成时间:</strong> $(date)</p>
        <p><strong>应用:</strong> $DEPLOYMENT_NAME</p>
        <p><strong>命名空间:</strong> $KUBE_NAMESPACE</p>
        <p><strong>最大Pod重启次数:</strong> $pod_restarts</p>
    </div>
    
    <div class="metrics">
        <div class="metric-card">
            <h3>CPU使用率</h3>
            <p>$(get_prometheus_metric "container_cpu_usage")</p>
        </div>
        <div class="metric-card">
            <h3>内存使用率</h3>
            <p>$(get_prometheus_metric "container_memory_usage")</p>
        </div>
        <div class="metric-card">
            <h3>请求QPS</h3>
            <p>$(get_prometheus_metric "http_requests_per_second")</p>
        </div>
        <div class="metric-card">
            <h3>P95延迟</h3>
            <p>$(get_prometheus_metric "http_request_duration_seconds_p95") 秒</p>
        </div>
    </div>
    
    <div class="events">
        <h2>HPA事件历史</h2>
        <pre>$hpa_events</pre>
    </div>
    
    <div class="recommendations">
        <h2>优化建议</h2>
        <ul>
            <li>根据业务高峰调整HPA的最小/最大副本数</li>
            <li>监控JVM内存使用,考虑配置垂直扩缩容(VPA)</li>
            <li>设置适当的扩缩容冷却时间,避免频繁扩缩容</li>
        </ul>
    </div>
</body>
</html>
EOF

    log "✅ 扩缩容报告已生成: $report_file"
}

# 显示使用说明
show_usage() {
    cat << EOF
使用说明: $0 [命令]

命令:
  status          检查集群和HPA状态
  metrics         获取当前监控指标
  adjust MIN MAX [CPU_THRESHOLD]  调整HPA配置
  scale-up REPLICAS   手动扩容
  scale-down REPLICAS 手动缩容
  simulate [DURATION] [USERS]  模拟负载测试
  report          生成扩缩容报告
  help            显示此帮助信息

示例:
  $0 status                    # 检查状态
  $0 adjust 2 10 80           # 调整HPA: min=2, max=10, CPU阈值=80%
  $0 scale-up 5               # 手动扩容到5个副本
  $0 simulate 10m 100         # 模拟10分钟100并发负载
  $0 report                   # 生成报告
EOF
}

# 主函数
main() {
    local command=${1:-help}
    
    # 设置Kubernetes上下文
    kubectl config use-context "$KUBE_CONTEXT" > /dev/null 2>&1 || true
    
    case $command in
        status)
            check_cluster_status
            get_hpa_status
            ;;
        metrics)
            get_current_metrics
            ;;
        adjust)
            local min_replicas=$2
            local max_replicas=$3
            local cpu_threshold=${4:-70}
            
            if [ -z "$min_replicas" ] || [ -z "$max_replicas" ]; then
                echo -e "${RED}请指定最小和最大副本数${NC}"
                exit 1
            fi
            
            adjust_hpa_config "$min_replicas" "$max_replicas" "$cpu_threshold"
            ;;
        scale-up)
            local replicas=$2
            if [ -z "$replicas" ]; then
                echo -e "${RED}请指定副本数${NC}"
                exit 1
            fi
            scale_up "$replicas"
            ;;
        scale-down)
            local replicas=$2
            if [ -z "$replicas" ]; then
                echo -e "${RED}请指定副本数${NC}"
                exit 1
            fi
            scale_down "$replicas"
            ;;
        simulate)
            local duration=${2:-5m}
            local users=${3:-50}
            simulate_load "$duration" "$users"
            ;;
        report)
            generate_scaling_report
            ;;
        help|*)
            show_usage
            ;;
    esac
}

# 执行主函数
main "$@"

智能扩缩容策略配置

基于预测的自动扩缩容

yaml 复制代码
# k8s/keda-autoscaling.yaml - KEDA基于事件的自动扩缩容
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: springboot-app-keda
  namespace: production
spec:
  scaleTargetRef:
    name: springboot-app
    kind: Deployment
  minReplicaCount: 2
  maxReplicaCount: 20
  cooldownPeriod: 300
  pollingInterval: 30
  
  triggers:
    # 基于CPU的扩缩容
    - type: cpu
      metadata:
        type: Utilization
        value: "70"
    
    # 基于内存的扩缩容
    - type: memory
      metadata:
        type: Utilization
        value: "80"
    
    # 基于Prometheus指标的扩缩容
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-server.monitoring.svc.cluster.local:9090
        metricName: http_requests_per_second
        threshold: "100"
        query: |
          sum(rate(http_requests_total{app="springboot-app"}[2m]))
    
    # 基于响应时间的扩缩容
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-server.monitoring.svc.cluster.local:9090
        metricName: http_request_p95_latency
        threshold: "1.0"  # 1秒
        query: |
          histogram_quantile(0.95, 
            sum(rate(http_request_duration_seconds_bucket{app="springboot-app"}[5m])) by (le)
          )
    
    # 基于消息队列的扩缩容(如果使用消息队列)
    - type: kafka
      metadata:
        bootstrapServers: kafka-broker:9092
        consumerGroup: springboot-consumer
        topic: orders
        lagThreshold: "50"
    
    # 基于定时任务的扩缩容
    - type: cron
      metadata:
        timezone: Asia/Shanghai
        start: 0 9 * * 1-5    # 工作日9点开始
        end: 0 18 * * 1-5     # 工作日18点结束
        desiredReplicas: "5"
    
    # 基于外部API的扩缩容
    - type: external
      metadata:
        https://external-metrics-api.com/load
        threshold: "1000"
---
# 预测性扩缩容配置
apiVersion: batch/v1
kind: CronJob
metadata:
  name: predictive-scaling
  namespace: production
spec:
  schedule: "*/5 * * * *"  # 每5分钟运行一次
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: predictive-scaling
            image: python:3.9
            command:
            - /bin/sh
            - -c
            - |
              pip install requests pandas scikit-learn
              python /scripts/predictive_scaling.py
            volumeMounts:
            - name: scripts
              mountPath: /scripts
          volumes:
          - name: scripts
            configMap:
              name: predictive-scaling-scripts
          restartPolicy: OnFailure
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: predictive-scaling-scripts
  namespace: production
data:
  predictive_scaling.py: |
    #!/usr/bin/env python3
    """
    预测性扩缩容脚本
    基于历史负载模式预测未来负载并调整HPA
    """
    
    import requests
    import pandas as pd
    import numpy as np
    from datetime import datetime, timedelta
    import json
    import os
    
    # 配置
    PROMETHEUS_URL = "http://prometheus-server.monitoring.svc.cluster.local:9090"
    K8S_API_URL = "https://kubernetes-api"
    NAMESPACE = "production"
    DEPLOYMENT = "springboot-app"
    
    def query_prometheus(query):
        """查询Prometheus指标"""
        response = requests.get(f"{PROMETHEUS_URL}/api/v1/query", params={"query": query})
        if response.status_code == 200:
            data = response.json()
            if data['data']['result']:
                return float(data['data']['result'][0]['value'][1])
        return 0
    
    def get_historical_metrics():
        """获取历史指标数据"""
        end_time = datetime.now()
        start_time = end_time - timedelta(hours=24)
        
        # 查询过去24小时的QPS数据
        query = f'avg_over_time(sum(rate(http_requests_total{{app="{DEPLOYMENT}"}}[5m]))[24h:5m])'
        historical_data = query_prometheus(query)
        
        return historical_data
    
    def predict_future_load():
        """预测未来负载"""
        # 基于历史数据的简单预测
        historical_load = get_historical_metrics()
        
        # 考虑时间因素(工作日/周末,白天/晚上)
        now = datetime.now()
        hour = now.hour
        is_weekday = now.weekday() < 5
        
        # 简单的预测模型
        if is_weekday:
            if 9 <= hour <= 12:  # 上午工作时间
                predicted_load = historical_load * 1.5
            elif 13 <= hour <= 18:  # 下午工作时间
                predicted_load = historical_load * 1.8
            elif 19 <= hour <= 23:  # 晚上
                predicted_load = historical_load * 1.2
            else:  # 深夜
                predicted_load = historical_load * 0.5
        else:  # 周末
            predicted_load = historical_load * 0.8
        
        return max(predicted_load, 10)  # 最小10 QPS
    
    def calculate_desired_replicas(predicted_load):
        """计算期望的副本数"""
        # 假设每个Pod可以处理50 QPS
        pods_needed = max(2, int(predicted_load / 50) + 1)
        return min(pods_needed, 20)  # 最大20个副本
    
    def update_hpa(replicas):
        """更新HPA配置"""
        # 这里应该调用Kubernetes API更新HPA
        # 为了安全,这里只打印日志
        print(f"预测性扩缩容: 建议设置副本数为 {replicas}")
        
        # 实际实现应该调用Kubernetes API
        # patch_data = {
        #     "spec": {
        #         "minReplicas": replicas,
        #         "maxReplicas": max(replicas + 5, 20)
        #     }
        # }
        
    def main():
        print("开始预测性扩缩容分析...")
        
        # 预测未来负载
        predicted_load = predict_future_load()
        print(f"预测负载: {predicted_load:.2f} QPS")
        
        # 计算所需副本数
        desired_replicas = calculate_desired_replicas(predicted_load)
        print(f"期望副本数: {desired_replicas}")
        
        # 更新HPA配置
        update_hpa(desired_replicas)
        
        print("预测性扩缩容分析完成")
    
    if __name__ == "__main__":
        main()

告警和扩缩容架构图

以下图表展示了完整的告警和自动扩缩容系统架构:

flowchart TD A[应用指标收集] --> B[Prometheus] A --> C[应用日志] B --> D[Alertmanager] B --> E[Grafana] B --> F[Prometheus Adapter] C --> G[ELK/Loki] D --> H[告警通知] H --> H1[邮件] H --> H2[Slack] H --> H3[PagerDuty] F --> I[Kubernetes Metrics API] I --> J[HPA] I --> K[KEDA] J --> L[自动扩缩容] K --> L L --> M[Pod扩缩容] M --> N[资源调整] G --> O[日志分析] E --> P[实时监控] O --> Q[故障诊断] P --> Q Q --> R[性能优化] R --> A style A fill:#3498db,color:#fff style D fill:#e74c3c,color:#fff style L fill:#27ae60,color:#fff style Q fill:#9b59b6,color:#fff

总结

通过本部分的配置,我们建立了完整的告警和自动扩缩容系统:

核心能力

  1. 智能告警: 多层次、多渠道的告警系统
  2. 自动扩缩容: 基于资源使用率和业务指标的自动扩缩容
  3. 预测性扩缩容: 基于历史模式的智能预测
  4. 全面监控: 从基础设施到应用层的全方位监控

关键技术

  • Prometheus + Alertmanager: 强大的监控告警组合
  • Kubernetes HPA: 原生自动扩缩容
  • KEDA: 基于事件的自动扩缩容
  • 自定义指标: 基于业务指标的智能扩缩容
  • 多通知渠道: 邮件、Slack、PagerDuty等

最佳实践

  1. 分级告警: 根据严重程度分级处理告警
  2. 智能静默: 避免告警风暴,提高告警有效性
  3. 渐进式扩缩容: 平稳的扩缩容策略避免业务波动
  4. 预测性优化: 基于历史数据的智能预测

现在SpringBoot应用具备了企业级的智能运维能力,可以自动应对各种业务场景,确保系统的高可用性和稳定性。

相关推荐
苦逼IT运维2 小时前
从 0 到 1 理解 Kubernetes:一次“破坏式”学习实践(一)
linux·学习·docker·容器·kubernetes
萧曵 丶2 小时前
Docker 面试题
运维·docker·容器
Charlie_lll2 小时前
力扣解题-[3379]转换数组
数据结构·后端·算法·leetcode
VX:Fegn08952 小时前
计算机毕业设计|基于springboot + vue云租车平台系统(源码+数据库+文档)
数据库·vue.js·spring boot·后端·课程设计
汤姆yu2 小时前
2026基于springboot的在线招聘系统
java·spring boot·后端
为什么不问问神奇的海螺呢丶3 小时前
n9e categraf docker 监控配置
运维·docker·容器
青树寒鸦3 小时前
wsl的docker备份mongo和迁移
运维·mongodb·docker·容器
计算机学姐3 小时前
基于SpringBoot的校园社团管理系统
java·vue.js·spring boot·后端·spring·信息可视化·推荐算法
hssfscv3 小时前
Javaweb学习笔记——后端实战8 springboot原理
笔记·后端·学习
咚为3 小时前
Rust tokio:Task ≠ Thread:Tokio 调度模型中的“假并发”与真实代价
开发语言·后端·rust