基于 Spring Boot 3.2.x 的 Actuator 监控指南:从健康检查到企业级监控体系

基于 Spring Boot 3.2.x 的 Actuator 监控指南:从健康检查到企业级监控体系

假设某天凌晨,生产环境突然告警,于是你迷迷糊糊的思考着:

  1. 不知道是哪个服务出问题 - 是网关、订单服务、还是支付服务?
  2. 不知道是哪个组件出问题 - 是数据库、Redis、还是MQ?
  3. 不知道问题有多严重 - 是性能下降,还是完全不可用?

经过20分钟的排查,才发现是Redis连接池耗尽。如果有完善的监控,这个问题5分钟就能定位。

这就是 Actuator 要解决的问题:为Spring Boot应用提供完整的生产就绪监控能力,让你:

  • 实时了解应用健康状态
  • 监控系统资源和性能指标
  • 动态调整应用配置
  • 快速定位和诊断问题

一、Spring Boot Actuator核心概念

1. 什么是Actuator?

Spring Boot Actuator为应用提供了生产就绪特性,通过HTTP或JMX端点暴露监控和管理功能。它主要包括:

  1. 健康检查 - 应用及其依赖的健康状态
  2. 指标收集 - JVM、系统、应用性能指标
  3. 信息暴露 - 应用配置、环境信息
  4. 操作管理 - 日志级别调整、关闭应用等

2. 端点(Endpoints)

Actuator通过端点暴露监控数据。Spring Boot 3.2.x内置了20+个端点:

端点 路径 描述 默认启用
health /actuator/health 应用健康状态
info /actuator/info 应用自定义信息
metrics /actuator/metrics 应用指标
loggers /actuator/loggers 查看和修改日志级别
env /actuator/env 环境配置信息
beans /actuator/beans 所有Spring Beans
mappings /actuator/mappings URL映射信息
threaddump /actuator/threaddump 线程转储
heapdump /actuator/heapdump 堆转储
shutdown /actuator/shutdown 优雅关闭应用

二、快速开始:基础配置

1. 添加依赖

xml 复制代码
<!-- pom.xml -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

<!-- Web支持(用于HTTP端点) -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-web</artifactId>
</dependency>

2. 基础配置

yaml 复制代码
# application.yml
spring:
  application:
    name: order-service
  
management:
  # 端点基础路径
  endpoints:
    web:
      base-path: /actuator
      exposure:
        # 暴露哪些端点(生产环境要严格控制)
        include: "health,info,metrics,prometheus"
      # 跨域配置
      cors:
        allowed-origins: "http://localhost:3000"
        allowed-methods: "GET,POST"
  
  # 端点启用配置
  endpoint:
    health:
      enabled: true
      show-details: when_authorized  # 详细信息的显示策略
      show-components: when_authorized
    info:
      enabled: true
    metrics:
      enabled: true
    prometheus:
      enabled: true
  
  # 健康检查配置
  health:
    # 健康检查组
    group:
      readiness:
        include: "db,redis,diskSpace"
        additional-path: "readiness"
      liveness:
        include: "ping"
        additional-path: "liveness"
    # 默认健康检查
    defaults:
      enabled: true
    # 自定义健康检查
    redis:
      enabled: true
    db:
      enabled: true
    diskspace:
      enabled: true
      threshold: 10MB
  
  # 指标配置
  metrics:
    export:
      prometheus:
        enabled: true
        step: 1m
    enable:
      jvm: true
      system: true
      logback: true
      process: true
    distribution:
      percentiles-histogram:
        http.server.requests: true
    tags:
      application: ${spring.application.name}
      environment: ${spring.profiles.active:default}
    web:
      server:
        request:
          autotime:
            enabled: true

3. 验证基础配置

启动应用后,访问以下端点:

bash 复制代码
# 健康检查
curl http://localhost:8080/actuator/health

# 应用信息
curl http://localhost:8080/actuator/info

# 所有指标
curl http://localhost:8080/actuator/metrics

# 特定指标
curl http://localhost:8080/actuator/metrics/jvm.memory.used

响应示例

json 复制代码
// GET /actuator/health
{
  "status": "UP",
  "components": {
    "db": {
      "status": "UP",
      "details": {
        "database": "MySQL",
        "validationQuery": "isValid()"
      }
    },
    "diskSpace": {
      "status": "UP",
      "details": {
        "total": 500068036608,
        "free": 350067040256,
        "threshold": 10485760,
        "exists": true
      }
    },
    "ping": {
      "status": "UP"
    }
  }
}

三、核心端点深度解析

1. 健康检查端点(Health)

健康检查是微服务架构中的生命线。Spring Boot 3.2改进了健康检查机制。

1.1 内置健康指示器
java 复制代码
// 演示内置健康检查
@Component
@Slf4j
public class HealthCheckDemo {
    
    @EventListener(ApplicationReadyEvent.class)
    public void showHealthIndicators(ApplicationReadyEvent event) {
        ApplicationContext context = event.getApplicationContext();
        HealthEndpoint healthEndpoint = context.getBean(HealthEndpoint.class);
        
        // 获取健康状态
        HealthComponent health = healthEndpoint.health();
        log.info("应用整体状态: {}", health.getStatus());
        
        // 获取所有健康指示器
        Map<String, HealthComponent> components = health.getComponents();
        log.info("健康指示器数量: {}", components.size());
        
        components.forEach((name, component) -> {
            log.info("指示器: {} - 状态: {}", name, component.getStatus());
        });
    }
}

运行结果

复制代码
2024-05-20 10:00:00.000 INFO  c.e.demo.HealthCheckDemo : 应用整体状态: UP
2024-05-20 10:00:00.001 INFO  c.e.demo.HealthCheckDemo : 健康指示器数量: 8
2024-05-20 10:00:00.002 INFO  c.e.demo.HealthCheckDemo : 指示器: db - 状态: UP
2024-05-20 10:00:00.003 INFO  c.e.demo.HealthCheckDemo : 指示器: diskSpace - 状态: UP
2024-05-20 10:00:00.004 INFO  c.e.demo.HealthCheckDemo : 指示器: ping - 状态: UP
2024-05-20 10:00:00.005 INFO  c.e.demo.HealthCheckDemo : 指示器: redis - 状态: UP
1.2 自定义健康指示器
java 复制代码
// 自定义Redis健康指示器
@Component
@Slf4j
public class CustomRedisHealthIndicator implements HealthIndicator {
    
    private final RedisTemplate<String, String> redisTemplate;
    private final StringRedisConnectionFactory connectionFactory;
    
    public CustomRedisHealthIndicator(RedisTemplate<String, String> redisTemplate,
                                     RedisConnectionFactory connectionFactory) {
        this.redisTemplate = redisTemplate;
        this.connectionFactory = (StringRedisConnectionFactory) connectionFactory;
    }
    
    @Override
    public Health health() {
        try {
            // 1. 检查连接
            long start = System.currentTimeMillis();
            String result = redisTemplate.execute((RedisCallback<String>) connection -> 
                connection.ping());
            long responseTime = System.currentTimeMillis() - start;
            
            if (!"PONG".equals(result)) {
                return Health.down()
                    .withDetail("error", "Redis响应异常: " + result)
                    .build();
            }
            
            // 2. 检查内存使用率
            Properties info = redisTemplate.getConnectionFactory()
                .getConnection().info("memory");
            long usedMemory = Long.parseLong(info.getProperty("used_memory"));
            long maxMemory = Long.parseLong(info.getProperty("maxmemory"));
            double memoryUsage = maxMemory > 0 ? 
                (double) usedMemory / maxMemory * 100 : 0;
            
            // 3. 检查连接数
            int connectedClients = Integer.parseInt(info.getProperty("connected_clients"));
            
            // 构建健康状态
            Health.Builder builder = Health.up()
                .withDetail("response_time", responseTime + "ms")
                .withDetail("memory_usage", String.format("%.2f%%", memoryUsage))
                .withDetail("connected_clients", connectedClients)
                .withDetail("version", info.getProperty("redis_version"));
            
            // 添加警告
            if (responseTime > 100) {
                builder.withDetail("warning", "响应时间较慢");
            }
            if (memoryUsage > 80) {
                builder.withDetail("warning", "内存使用率过高");
            }
            
            return builder.build();
            
        } catch (Exception e) {
            log.error("Redis健康检查失败", e);
            return Health.down(e)
                .withDetail("error", e.getMessage())
                .build();
        }
    }
}

注册自定义指示器

java 复制代码
@Configuration
public class HealthIndicatorConfig {
    
    @Bean
    @ConditionalOnBean(RedisTemplate.class)
    public HealthIndicator customRedisHealthIndicator(
            RedisTemplate<String, String> redisTemplate,
            RedisConnectionFactory connectionFactory) {
        return new CustomRedisHealthIndicator(redisTemplate, connectionFactory);
    }
    
    // 数据库连接池健康检查
    @Bean
    @ConditionalOnBean(DataSource.class)
    public HealthIndicator datasourceHealthIndicator(DataSource dataSource) {
        return () -> {
            try (Connection conn = dataSource.getConnection()) {
                boolean isValid = conn.isValid(5); // 5秒超时
                
                if (dataSource instanceof HikariDataSource hikari) {
                    HikariPoolMXBean pool = hikari.getHikariPoolMXBean();
                    
                    return Health.up()
                        .withDetail("active_connections", pool.getActiveConnections())
                        .withDetail("idle_connections", pool.getIdleConnections())
                        .withDetail("total_connections", pool.getTotalConnections())
                        .withDetail("threads_waiting", pool.getThreadsAwaitingConnection())
                        .withDetail("validation_timeout", "5s")
                        .build();
                }
                
                return Health.up()
                    .withDetail("validation", isValid)
                    .build();
                    
            } catch (Exception e) {
                return Health.down(e)
                    .withDetail("error", e.getMessage())
                    .build();
            }
        };
    }
}

访问测试

bash 复制代码
curl http://localhost:8080/actuator/health/redis

响应示例

json 复制代码
{
  "status": "UP",
  "details": {
    "response_time": "15ms",
    "memory_usage": "45.23%",
    "connected_clients": 12,
    "version": "7.0.0"
  }
}

2. 指标端点(Metrics)

Spring Boot使用Micrometer作为指标门面,支持多种监控系统。

2.1 核心指标分类
yaml 复制代码
# metrics配置详解
management:
  metrics:
    # JVM指标
    enable:
      jvm: true
      jvm.memory: true
      jvm.gc: true
      jvm.threads: true
      jvm.classes: true
    
    # 系统指标
    system:
      cpu: true
      disk: true
      uptime: true
    
    # 应用指标
    application:
      http:
        server:
          requests: true
        client:
          requests: true
      cache: true
      data:
        source: true
      jms: true
      kafka: true
    
    # 日志指标
    logback: true
    
    # 进程指标
    process: true
    
    # 自定义指标标签
    tags:
      application: ${spring.application.name}
      instance: ${spring.cloud.client.ip-address}:${server.port}
      region: ${cloud.region:unknown}
      zone: ${cloud.zone:unknown}
2.2 自定义业务指标
java 复制代码
// 订单服务业务指标
@Component
@Slf4j
public class OrderMetrics {
    
    // 计数器:订单创建数量
    private final Counter orderCreatedCounter;
    
    // 计时器:订单处理时间
    private final Timer orderProcessTimer;
    
    // 分布摘要:订单金额分布
    private final DistributionSummary orderAmountSummary;
    
    // 计量器:当前进行中的订单
    private final FunctionCounter activeOrdersGauge;
    private final AtomicInteger activeOrders = new AtomicInteger(0);
    
    public OrderMetrics(MeterRegistry registry) {
        // 创建计数器
        orderCreatedCounter = Counter.builder("order.created")
            .description("创建的订单数量")
            .tag("application", "order-service")
            .register(registry);
        
        // 创建计时器
        orderProcessTimer = Timer.builder("order.process.time")
            .description("订单处理时间")
            .publishPercentiles(0.5, 0.95, 0.99)  // 50%, 95%, 99%分位数
            .publishPercentileHistogram()
            .register(registry);
        
        // 创建分布摘要
        orderAmountSummary = DistributionSummary.builder("order.amount")
            .description("订单金额分布")
            .baseUnit("CNY")
            .scale(100)  // 金额单位:分
            .register(registry);
        
        // 创建计量器
        activeOrdersGauge = FunctionCounter.builder("order.active.count", 
                activeOrders, AtomicInteger::get)
            .description("当前活跃订单数量")
            .register(registry);
    }
    
    /**
     * 记录订单创建
     */
    public void recordOrderCreated(BigDecimal amount) {
        orderCreatedCounter.increment();
        
        // 记录金额
        orderAmountSummary.record(amount.multiply(BigDecimal.valueOf(100)).longValue());
        
        // 增加活跃订单
        activeOrders.incrementAndGet();
        
        log.info("订单创建指标记录完成,金额: {}", amount);
    }
    
    /**
     * 记录订单处理时间
     */
    public Timer.Sample startOrderProcessing() {
        return Timer.start();
    }
    
    public void endOrderProcessing(Timer.Sample sample, String orderId, boolean success) {
        sample.stop(orderProcessTimer
            .tag("order_id", orderId)
            .tag("success", String.valueOf(success)));
        
        // 减少活跃订单
        activeOrders.decrementAndGet();
        
        log.info("订单处理完成: {},成功: {}", orderId, success);
    }
    
    /**
     * 获取当前指标值
     */
    public Map<String, Object> getCurrentMetrics() {
        Map<String, Object> metrics = new LinkedHashMap<>();
        
        // 订单创建计数
        metrics.put("orders_created", orderCreatedCounter.count());
        
        // 平均处理时间
        metrics.put("avg_process_time_ms", orderProcessTimer.mean());
        
        // 活跃订单数
        metrics.put("active_orders", activeOrders.get());
        
        return metrics;
    }
}

在业务中使用

java 复制代码
@Service
@Slf4j
public class OrderService {
    
    private final OrderMetrics orderMetrics;
    
    public OrderService(OrderMetrics orderMetrics) {
        this.orderMetrics = orderMetrics;
    }
    
    @Transactional
    public Order createOrder(CreateOrderRequest request) {
        // 开始计时
        Timer.Sample timer = orderMetrics.startOrderProcessing();
        
        try {
            // 业务逻辑...
            Order order = new Order();
            order.setAmount(request.getAmount());
            order.setStatus(OrderStatus.CREATED);
            
            // 保存订单
            orderRepository.save(order);
            
            // 记录指标
            orderMetrics.recordOrderCreated(request.getAmount());
            
            // 结束计时(成功)
            orderMetrics.endOrderProcessing(timer, order.getId(), true);
            
            return order;
            
        } catch (Exception e) {
            // 结束计时(失败)
            orderMetrics.endOrderProcessing(timer, "unknown", false);
            throw e;
        }
    }
}

访问指标端点

bash 复制代码
# 获取所有指标
curl http://localhost:8080/actuator/metrics

# 获取特定指标
curl http://localhost:8080/actuator/metrics/order.created
curl http://localhost:8080/actuator/metrics/order.process.time

响应示例

json 复制代码
// GET /actuator/metrics/order.created
{
  "name": "order.created",
  "description": "创建的订单数量",
  "baseUnit": null,
  "measurements": [
    {
      "statistic": "COUNT",
      "value": 1250.0
    }
  ],
  "availableTags": [
    {
      "tag": "application",
      "values": ["order-service"]
    }
  ]
}

// GET /actuator/metrics/order.process.time
{
  "name": "order.process.time",
  "description": "订单处理时间",
  "baseUnit": "seconds",
  "measurements": [
    {
      "statistic": "COUNT",
      "value": 1250.0
    },
    {
      "statistic": "TOTAL_TIME",
      "value": 625.5
    },
    {
      "statistic": "MAX",
      "value": 2.1
    }
  ],
  "percentiles": {
    "0.5": 0.45,
    "0.95": 0.78,
    "0.99": 1.2
  }
}

3. 信息端点(Info)

信息端点用于暴露应用的静态信息。

3.1 基础配置
yaml 复制代码
# application.yml
management:
  info:
    # Git信息
    git:
      mode: full
    # 构建信息
    build:
      enabled: true
    # 环境信息
    env:
      enabled: true
    # Java信息
    java:
      enabled: true
    # OS信息
    os:
      enabled: true

# 自定义信息
info:
  app:
    name: "@project.name@"
    version: "@project.version@"
    description: "@project.description@"
  team:
    name: "技术研发部"
    contact: "tech@example.com"
  policy:
    security-level: "high"
    compliance: "ISO27001"
3.2 编程式InfoContributor
java 复制代码
// 自定义信息贡献者
@Component
public class CustomInfoContributor implements InfoContributor {
    
    @Value("${spring.application.name}")
    private String appName;
    
    @Autowired
    private ApplicationContext context;
    
    @Override
    public void contribute(Info.Builder builder) {
        // 1. 应用运行时信息
        builder.withDetail("application", Map.of(
            "name", appName,
            "contextPath", context.getApplicationName(),
            "startupTime", getStartupTime(),
            "uptime", getUptime()
        ));
        
        // 2. Bean统计信息
        String[] beanNames = context.getBeanDefinitionNames();
        Map<String, Long> beanStats = Arrays.stream(beanNames)
            .collect(Collectors.groupingBy(
                name -> {
                    BeanDefinition bd = ((ConfigurableApplicationContext) context)
                        .getBeanFactory().getBeanDefinition(name);
                    return bd.getResourceDescription() != null ? 
                        bd.getResourceDescription() : "unknown";
                },
                Collectors.counting()
            ));
        
        builder.withDetail("beans", Map.of(
            "total", beanNames.length,
            "statistics", beanStats
        ));
        
        // 3. 线程信息
        ThreadMXBean threadBean = ManagementFactory.getThreadMXBean();
        builder.withDetail("threads", Map.of(
            "total", threadBean.getThreadCount(),
            "daemon", threadBean.getDaemonThreadCount(),
            "peak", threadBean.getPeakThreadCount()
        ));
        
        // 4. 系统信息
        Runtime runtime = Runtime.getRuntime();
        builder.withDetail("system", Map.of(
            "processors", runtime.availableProcessors(),
            "memory", Map.of(
                "total", runtime.totalMemory() / 1024 / 1024 + "MB",
                "free", runtime.freeMemory() / 1024 / 1024 + "MB",
                "max", runtime.maxMemory() / 1024 / 1024 + "MB"
            )
        ));
        
        // 5. 业务信息
        builder.withDetail("business", Map.of(
            "features", List.of("订单管理", "支付处理", "库存管理"),
            "sla", "99.9%",
            "data-retention", "30天"
        ));
    }
    
    private String getStartupTime() {
        try {
            long startTime = ManagementFactory.getRuntimeMXBean().getStartTime();
            return Instant.ofEpochMilli(startTime)
                .atZone(ZoneId.systemDefault())
                .format(DateTimeFormatter.ISO_LOCAL_DATE_TIME);
        } catch (Exception e) {
            return "unknown";
        }
    }
    
    private String getUptime() {
        try {
            long uptime = ManagementFactory.getRuntimeMXBean().getUptime();
            long seconds = uptime / 1000;
            long days = seconds / 86400;
            long hours = (seconds % 86400) / 3600;
            long minutes = (seconds % 3600) / 60;
            return String.format("%d天%d小时%d分钟", days, hours, minutes);
        } catch (Exception e) {
            return "unknown";
        }
    }
}

访问信息端点

bash 复制代码
curl http://localhost:8080/actuator/info

响应示例

json 复制代码
{
  "application": {
    "name": "order-service",
    "contextPath": "",
    "startupTime": "2024-05-20T10:00:00",
    "uptime": "2天3小时15分钟"
  },
  "beans": {
    "total": 156,
    "statistics": {
      "Spring Boot": 45,
      "Spring Framework": 89,
      "业务Bean": 22
    }
  },
  "threads": {
    "total": 45,
    "daemon": 20,
    "peak": 50
  },
  "system": {
    "processors": 8,
    "memory": {
      "total": "256MB",
      "free": "128MB",
      "max": "512MB"
    }
  },
  "business": {
    "features": ["订单管理", "支付处理", "库存管理"],
    "sla": "99.9%",
    "data-retention": "30天"
  }
}

四、企业级监控方案

1. 监控架构设计

Spring Boot应用
Actuator端点
健康检查 /health
指标 /metrics
信息 /info
Kubernetes探针
Prometheus
监控大屏 Grafana
配置管理
自动扩缩容
自动重启
告警规则
实时监控
AlertManager
邮件/钉钉/微信告警

2. Kubernetes集成

2.1 健康检查探针
yaml 复制代码
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  template:
    spec:
      containers:
      - name: order-service
        image: order-service:1.0.0
        # 就绪探针
        readinessProbe:
          httpGet:
            path: /actuator/health/readiness
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 3
        # 存活探针
        livenessProbe:
          httpGet:
            path: /actuator/health/liveness
            port: 8080
          initialDelaySeconds: 60
          periodSeconds: 10
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 3
        # 启动探针
        startupProbe:
          httpGet:
            path: /actuator/health/liveness
            port: 8080
          failureThreshold: 30
          periodSeconds: 10
2.2 ServiceMonitor配置(Prometheus Operator)
yaml 复制代码
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: order-service
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: order-service
  endpoints:
  - port: http
    path: /actuator/prometheus
    interval: 30s
    scrapeTimeout: 10s
    relabelings:
    - sourceLabels: [__meta_kubernetes_pod_name]
      targetLabel: pod
    - sourceLabels: [__meta_kubernetes_namespace]
      targetLabel: namespace
  namespaceSelector:
    any: true

3. Prometheus配置

yaml 复制代码
# prometheus.yml
global:
  scrape_interval: 30s
  evaluation_interval: 30s

scrape_configs:
  - job_name: 'spring-boot-apps'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['order-service:8080', 'payment-service:8080', 'user-service:8080']
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod

4. Grafana仪表盘

Spring Boot应用监控仪表盘JSON配置

json 复制代码
{
  "dashboard": {
    "title": "Spring Boot应用监控",
    "panels": [
      {
        "title": "应用健康状态",
        "type": "stat",
        "targets": [{
          "expr": "spring_application_status{application=\"order-service\"}",
          "legendFormat": "{{instance}}"
        }]
      },
      {
        "title": "JVM内存使用",
        "type": "graph",
        "targets": [{
          "expr": "sum(jvm_memory_used_bytes{application=\"order-service\", area=\"heap\"}) by (instance)",
          "legendFormat": "堆内存使用"
        }, {
          "expr": "sum(jvm_memory_max_bytes{application=\"order-service\", area=\"heap\"}) by (instance)",
          "legendFormat": "堆内存上限"
        }]
      },
      {
        "title": "HTTP请求QPS",
        "type": "graph",
        "targets": [{
          "expr": "rate(http_server_requests_seconds_count{application=\"order-service\"}[5m])",
          "legendFormat": "{{method}} {{uri}} {{status}}"
        }]
      },
      {
        "title": "HTTP请求延迟",
        "type": "graph",
        "targets": [{
          "expr": "histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket{application=\"order-service\"}[5m])) by (le, uri, method, status))",
          "legendFormat": "P95 {{method}} {{uri}}"
        }]
      },
      {
        "title": "数据库连接池",
        "type": "graph",
        "targets": [{
          "expr": "hikaricp_connections_active{application=\"order-service\"}",
          "legendFormat": "活跃连接"
        }, {
          "expr": "hikaricp_connections_idle{application=\"order-service\"}",
          "legendFormat": "空闲连接"
        }]
      }
    ]
  }
}

五、安全配置

1. 端点安全保护

yaml 复制代码
# application.yml
spring:
  security:
    user:
      name: actuator
      password: ${ACTUATOR_PASSWORD:ChangeMe!}
      roles: ACTUATOR_ADMIN

management:
  endpoints:
    web:
      exposure:
        # 生产环境只暴露必要的端点
        include: "health,info,prometheus"
        exclude: "env,beans,loggers,heapdump,threaddump"
    
  # 端点安全
  endpoint:
    health:
      roles: "ACTUATOR_USER"
      show-details: when_authorized
    info:
      roles: "ACTUATOR_USER"
    prometheus:
      roles: "ACTUATOR_USER"
    # 敏感端点需要更高权限
    env:
      roles: "ACTUATOR_ADMIN"
    beans:
      roles: "ACTUATOR_ADMIN"
    loggers:
      roles: "ACTUATOR_ADMIN"

2. Spring Security配置

java 复制代码
@Configuration
@EnableWebSecurity
public class ActuatorSecurityConfig {
    
    @Bean
    @Order(1)
    public SecurityFilterChain actuatorSecurityFilterChain(HttpSecurity http) 
            throws Exception {
        
        http
            .securityMatcher("/actuator/**")
            .authorizeHttpRequests(authz -> authz
                // 健康检查公开访问(用于K8s探针)
                .requestMatchers("/actuator/health/**").permitAll()
                // Prometheus端点(可能需要认证)
                .requestMatchers("/actuator/prometheus").hasRole("ACTUATOR_USER")
                // 信息端点
                .requestMatchers("/actuator/info").hasRole("ACTUATOR_USER")
                // 敏感端点需要管理员权限
                .requestMatchers("/actuator/env", "/actuator/beans", 
                                "/actuator/loggers", "/actuator/heapdump",
                                "/actuator/threaddump").hasRole("ACTUATOR_ADMIN")
                // 其他端点
                .anyRequest().authenticated()
            )
            .httpBasic(Customizer.withDefaults())
            .sessionManagement(session -> session
                .sessionCreationPolicy(SessionCreationPolicy.STATELESS)
            )
            .csrf(AbstractHttpConfigurer::disable);
        
        return http.build();
    }
    
    @Bean
    public InMemoryUserDetailsManager userDetailsService() {
        UserDetails user = User.builder()
            .username("monitor")
            .password("{noop}monitor123")
            .roles("ACTUATOR_USER")
            .build();
        
        UserDetails admin = User.builder()
            .username("admin")
            .password("{noop}admin123")
            .roles("ACTUATOR_ADMIN", "ACTUATOR_USER")
            .build();
        
        return new InMemoryUserDetailsManager(user, admin);
    }
}

3. IP白名单限制

java 复制代码
@Component
public class ActuatorIpFilter extends OncePerRequestFilter {
    
    private final List<String> allowedIps = List.of(
        "10.0.0.0/8",      // 内网
        "192.168.0.0/16",  // 内网
        "127.0.0.1",       // 本地
        "172.16.0.0/12"    // Docker网络
    );
    
    @Override
    protected void doFilterInternal(HttpServletRequest request, 
                                   HttpServletResponse response, 
                                   FilterChain filterChain) 
            throws ServletException, IOException {
        
        String requestUri = request.getRequestURI();
        
        // 只对Actuator端点进行IP过滤
        if (requestUri.startsWith("/actuator") && 
            !requestUri.startsWith("/actuator/health")) {
            
            String clientIp = getClientIp(request);
            
            if (!isIpAllowed(clientIp)) {
                response.setStatus(HttpStatus.FORBIDDEN.value());
                response.getWriter().write("Access denied from IP: " + clientIp);
                return;
            }
        }
        
        filterChain.doFilter(request, response);
    }
    
    private String getClientIp(HttpServletRequest request) {
        String ip = request.getHeader("X-Forwarded-For");
        if (ip == null || ip.isEmpty() || "unknown".equalsIgnoreCase(ip)) {
            ip = request.getHeader("Proxy-Client-IP");
        }
        if (ip == null || ip.isEmpty() || "unknown".equalsIgnoreCase(ip)) {
            ip = request.getHeader("WL-Proxy-Client-IP");
        }
        if (ip == null || ip.isEmpty() || "unknown".equalsIgnoreCase(ip)) {
            ip = request.getRemoteAddr();
        }
        return ip;
    }
    
    private boolean isIpAllowed(String ip) {
        try {
            for (String allowedIp : allowedIps) {
                if (allowedIp.contains("/")) {
                    // CIDR表示法
                    SubnetUtils utils = new SubnetUtils(allowedIp);
                    if (utils.getInfo().isInRange(ip)) {
                        return true;
                    }
                } else if (allowedIp.equals(ip)) {
                    return true;
                }
            }
        } catch (Exception e) {
            // 解析失败,拒绝访问
        }
        return false;
    }
}

六、企业级最佳实践

1. 多环境配置

yaml 复制代码
# application-dev.yml
management:
  endpoints:
    web:
      exposure:
        include: "*"  # 开发环境暴露所有端点
  endpoint:
    health:
      show-details: always
  tracing:
    enabled: false  # 开发环境关闭追踪

# application-prod.yml
management:
  endpoints:
    web:
      exposure:
        include: "health,info,prometheus,metrics"  # 生产环境严格控制
      base-path: /internal/actuator  # 修改路径,增加安全性
  endpoint:
    health:
      show-details: never  # 生产环境不显示详情
    shutdown:
      enabled: false  # 生产环境禁用关闭端点
  server:
    port: 9090  # 使用不同端口
  tracing:
    enabled: true
    sampling:
      probability: 0.1  # 生产环境采样率10%

2. 监控告警规则

yaml 复制代码
# prometheus告警规则
groups:
- name: spring-boot-alerts
  rules:
  - alert: SpringBootAppDown
    expr: up{job="spring-boot-apps"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Spring Boot应用下线"
      description: "应用 {{ $labels.instance }} 已下线超过1分钟"
  
  - alert: HighMemoryUsage
    expr: (sum(jvm_memory_used_bytes{area="heap"}) / sum(jvm_memory_max_bytes{area="heap"})) * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "JVM堆内存使用率过高"
      description: "应用 {{ $labels.instance }} 堆内存使用率超过80%,当前值 {{ $value }}%"
  
  - alert: HighGCTime
    expr: rate(jvm_gc_pause_seconds_sum[5m]) > 0.1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "GC暂停时间过长"
      description: "应用 {{ $labels.instance }} GC暂停时间超过阈值"
  
  - alert: HighErrorRate
    expr: rate(http_server_requests_seconds_count{status=~"5.."}[5m]) / rate(http_server_requests_seconds_count[5m]) * 100 > 5
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "HTTP错误率过高"
      description: "应用 {{ $labels.instance }} 5xx错误率超过5%,当前值 {{ $value }}%"
  
  - alert: HighLatency
    expr: histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m])) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "HTTP请求延迟过高"
      description: "应用 {{ $labels.instance }} P95延迟超过1秒,当前值 {{ $value }}秒"

3. 性能优化配置

yaml 复制代码
management:
  metrics:
    # 指标采样配置
    distribution:
      slo:
        http.server.requests: 100ms, 200ms, 500ms, 1s, 2s
      percentiles-histogram:
        http.server.requests: true
      maximum-expected-value:
        http.server.requests: 10s
    
  # 端点缓存配置
  endpoint:
    health:
      cache:
        time-to-live: 10s
    metrics:
      cache:
        time-to-live: 30s
    prometheus:
      cache:
        time-to-live: 15s
    
  # 健康检查超时配置
  health:
    probes:
      enabled: true
      liveness-state:
        enabled: true
      readiness-state:
        enabled: true
    db:
      validation-query: "SELECT 1"
      timeout: 5s
    redis:
      timeout: 3s

七、常见问题与解决方案

问题1:端点访问返回404

原因:端点未启用或未暴露

解决方案

yaml 复制代码
management:
  endpoints:
    web:
      exposure:
        include: "health,info,metrics"  # 明确包含需要的端点
  endpoint:
    health:
      enabled: true
    info:
      enabled: true
    metrics:
      enabled: true

问题2:健康检查显示DOWN状态

原因:依赖服务不可用

排查工具

java 复制代码
@Component
@Slf4j
public class HealthCheckDebugger {
    
    @EventListener(ApplicationReadyEvent.class)
    public void debugHealthStatus(ApplicationReadyEvent event) {
        ApplicationContext context = event.getApplicationContext();
        HealthEndpoint healthEndpoint = context.getBean(HealthEndpoint.class);
        
        HealthComponent health = healthEndpoint.health();
        if (health.getStatus() == Status.DOWN) {
            log.error("应用健康状态为DOWN");
            
            health.getComponents().forEach((name, component) -> {
                if (component.getStatus() == Status.DOWN) {
                    log.error("故障组件: {} - 详情: {}", name, component.getDetails());
                }
            });
        }
    }
}

问题3:指标数据不准确

原因:指标配置错误或采样问题

验证工具

java 复制代码
@Component
@Slf4j
public class MetricsValidator {
    
    @Autowired
    private MeterRegistry meterRegistry;
    
    @Scheduled(fixedDelay = 60000)  // 每分钟检查一次
    public void validateMetrics() {
        List<Meter> meters = meterRegistry.getMeters();
        
        log.info("当前注册的指标数量: {}", meters.size());
        
        // 检查关键指标
        meters.stream()
            .filter(meter -> meter.getId().getName().startsWith("http.server.requests"))
            .findFirst()
            .ifPresent(meter -> {
                log.info("HTTP请求指标: {}", meter.measure());
            });
    }
}

监控不是目的,而是手段。

通过完善的监控体系,我们能够提前发现问题、快速定位故障、持续优化性能,最终为用户提供稳定可靠的服务。

相关推荐
WL_Aurora2 小时前
Java基础知识超详细总结(从入门到精通)
java
咖啡八杯2 小时前
GoF设计模式——抽象工厂模式
java·后端·spring·设计模式·抽象工厂模式
Thanks_ks2 小时前
分布式锁:Redis 与 Redisson 的工程实践与避坑指南
java·redis·分布式锁·redisson·微服务架构·并发编程·高可用
掉鱼的猫2 小时前
agentscope-harness vs solon-ai-harness:Java 智能体「马具引擎」的双雄对决
java·openai
RainCity2 小时前
Java Swing 自定义组件库分享(四)
java·笔记·后端
带刺的坐椅2 小时前
agentscope-harness vs solon-ai-harness:Java 智能体「马具引擎」的双雄对决
java·ai·llm·solon·agentscope·harness
Seven972 小时前
Paxos算法:如何解决分布式系统中的共识问题?
java
铁皮哥3 小时前
【力扣题解】LeetCode 25. K 个一组翻转链表
java·数据结构·windows·python·算法·leetcode·链表
小新同学^O^3 小时前
简单学习 --> 单例模式
java·学习·多线程