Spring boot应用监控集成

Spring Boot应用监控集成记录

背景

XScholar文献下载应用基于Spring Boot构建，需要接入Prometheus监控系统。应用已部署并运行在服务器上，需要暴露metrics端点供Prometheus采集。

初始状态

应用信息

框架: Spring Boot 2.x
部署端口: 10089
服务器: Linux服务器 (IPv4/IPv6双栈网络)
Prometheus: Docker容器部署

已有依赖

项目中已包含监控相关依赖：

xml 复制代码

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

集成过程记录

第一步：配置Spring Boot应用

基础配置

yaml 复制代码

# application-prod.yml
management:
  endpoints:
    web:
      exposure:
        include: "health,info,prometheus"
  endpoint:
    prometheus:
      enabled: true
    health:
      show-details: always
  metrics:
    export:
      prometheus:
        enabled: true

关键配置说明

endpoints.web.exposure.include: 暴露prometheus端点
endpoint.prometheus.enabled: 启用Prometheus指标导出
metrics.export.prometheus.enabled: 启用Prometheus格式指标

第二步：网络绑定配置问题

遇到的严重问题

应用启动后，Prometheus无法采集到数据，targets显示为DOWN状态。

初始错误配置

yaml 复制代码

# 错误的配置 - 只绑定localhost
server:
  port: 10089
  # 默认只绑定127.0.0.1，外部无法访问

问题分析过程

本地测试正常 ：在应用服务器上curl localhost:10089/actuator/prometheus能正常返回数据
远程访问失败：从Prometheus容器或其他服务器无法访问
网络诊断 ：使用netstat -tlnp | grep 10089发现应用只绑定了127.0.0.1

解决方案

yaml 复制代码

# 正确的配置 - 绑定所有网络接口
server:
  port: 10089
  address: 0.0.0.0  # 关键配置：绑定所有网络接口

验证方法

bash 复制代码

# 检查端口绑定情况
netstat -tlnp | grep 10089
# 应该看到: 0.0.0.0:10089 而不是 127.0.0.1:10089

# 测试外部访问
curl http://SERVER_IP:10089/actuator/health
curl http://SERVER_IP:10089/actuator/prometheus

第三步：Prometheus配置中的IP地址问题

遇到的核心问题

即使应用绑定了0.0.0.0，Prometheus仍然无法采集数据。

错误的Prometheus配置

yaml 复制代码

# prometheus.yml - 错误配置
scrape_configs:
  - job_name: 'xscholar-scheduler'
    static_configs:
      - targets: ['localhost:10089']        # 错误：容器内的localhost
      # 或
      - targets: ['10.10.132.55:10089']   # 错误：内网IP在容器中不可达

问题根因分析

容器网络隔离: Prometheus运行在Docker容器中，有独立的网络命名空间
localhost解析: 容器内的localhost指向容器本身，而非宿主机
内网IP限制: 容器可能无法直接访问宿主机的内网IP

解决方案：使用公网IP

yaml 复制代码

# prometheus.yml - 正确配置
scrape_configs:
  - job_name: 'xscholar-scheduler'
    static_configs:
      - targets: ['PUBLIC_IP:10089']  # 使用服务器的公网IP
    metrics_path: '/actuator/prometheus'
    scrape_interval: 30s
    scrape_timeout: 10s

网络架构说明

复制代码

Internet
    ↓
Public IP (服务器公网地址)
    ↓
Server (运行Spring Boot应用)
    ↓ Docker网络
Docker容器 (Prometheus)

第四步：IPv4/IPv6网络栈问题

遇到的复杂问题

配置公网IP后，仍然出现间歇性连接问题，日志显示网络超时。

问题现象

bash 复制代码

# Prometheus日志中的错误
level=warn msg="Error on ingesting samples" err="connection refused"
level=warn msg="Scrape failed" target="PUBLIC_IP:10089" err="context deadline exceeded"

根因分析

现代Linux服务器通常同时支持IPv4和IPv6，JVM默认可能优先使用IPv6，导致网络连接问题。

JVM网络栈配置问题

bash 复制代码

# 问题：JVM启动参数顺序和IPv6优先级
java -jar app.jar -Djava.net.preferIPv4Stack=true

解决方案

bash 复制代码

# 正确的JVM启动参数配置
java -Djava.net.preferIPv4Stack=true \
     -Djava.net.preferIPv6Addresses=false \
     -jar xscholar-scheduler.jar

参数说明

preferIPv4Stack=true: 强制JVM使用IPv4网络栈
preferIPv6Addresses=false: 禁用IPv6地址优先级
参数位置: 必须在-jar之前，否则不会生效

第五步：监控指标验证

验证metrics端点

bash 复制代码

# 检查基础指标
curl http://PUBLIC_IP:10089/actuator/prometheus | grep jvm_memory

# 检查自定义业务指标
curl http://PUBLIC_IP:10089/actuator/prometheus | grep daily_task

# 检查指标数量
curl http://PUBLIC_IP:10089/actuator/prometheus | wc -l

Prometheus验证

bash 复制代码

# 检查target状态
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="xscholar-scheduler")'

# 查询特定指标
curl 'http://localhost:9090/api/v1/query?query=up{job="xscholar-scheduler"}'

完整配置示例

Spring Boot配置

yaml 复制代码

# application-prod.yml
server:
  port: 10089
  address: 0.0.0.0  # 关键：绑定所有网络接口

management:
  endpoints:
    web:
      exposure:
        include: "health,info,prometheus"
  endpoint:
    prometheus:
      enabled: true
    health:
      show-details: always
  metrics:
    export:
      prometheus:
        enabled: true
    tags:
      application: xscholar-scheduler
      environment: production

Prometheus配置

yaml 复制代码

# prometheus.yml
scrape_configs:
  - job_name: 'xscholar-scheduler'
    static_configs:
      - targets: ['PUBLIC_IP:10089']  # 使用公网IP
    metrics_path: '/actuator/prometheus'
    scrape_interval: 30s
    scrape_timeout: 10s
    honor_labels: true
    scheme: http

JVM启动配置

bash 复制代码

#!/bin/bash
# start-app.sh
java -Djava.net.preferIPv4Stack=true \
     -Djava.net.preferIPv6Addresses=false \
     -Duser.timezone=Asia/Shanghai \
     -Xms1g -Xmx2g \
     -jar xscholar-scheduler.jar \
     --spring.profiles.active=prod

网络问题排查流程

第一层：应用层检查

bash 复制代码

# 1. 检查应用端口绑定
netstat -tlnp | grep 10089

# 2. 本地访问测试
curl http://localhost:10089/actuator/health

# 3. 内网访问测试
curl http://INTERNAL_IP:10089/actuator/health

# 4. 公网访问测试
curl http://PUBLIC_IP:10089/actuator/health

第二层：网络连通性检查

bash 复制代码

# 1. 防火墙检查
sudo ufw status
sudo iptables -L | grep 10089

# 2. 端口可达性测试
telnet PUBLIC_IP 10089

# 3. 从Prometheus容器测试
docker exec prometheus wget -O- http://PUBLIC_IP:10089/actuator/prometheus

第三层：容器网络检查

bash 复制代码

# 1. 检查容器网络配置
docker network ls
docker network inspect prometheus_monitoring

# 2. 容器间通信测试
docker exec prometheus ping PUBLIC_IP

# 3. DNS解析测试
docker exec prometheus nslookup PUBLIC_IP

踩坑总结

主要难点

网络绑定理解不足: localhost vs 0.0.0.0的区别
容器网络隔离: Docker容器网络与宿主机网络的关系
IP地址选择: 内网IP vs 公网IP的可达性问题
IPv4/IPv6栈: JVM网络栈优先级问题

关键经验教训

逐层排查: 从应用→网络→容器，分层次排查问题
网络理解: 深入理解容器网络和宿主机网络的关系
参数顺序: JVM参数位置影响是否生效
配置验证: 每层配置都要独立验证

最佳实践

网络配置规范

应用绑定: 生产环境统一使用0.0.0.0绑定
IP地址选择: 优先使用公网IP，确保各组件可达
IPv4优先: 生产环境强制使用IPv4避免兼容性问题

排查工具集合

bash 复制代码

# 网络诊断工具包
netstat -tlnp | grep PORT        # 检查端口绑定
ss -tlnp | grep PORT            # 现代版netstat
curl -I http://IP:PORT          # HTTP连通性测试
telnet IP PORT                  # TCP连通性测试
nmap -p PORT IP                 # 端口扫描

监控验证清单

应用端口正确绑定到0.0.0.0
防火墙规则允许对应端口
metrics端点返回有效数据
Prometheus能成功scrape目标
target状态显示为UP
指标数据在Prometheus中可查询

常见错误案例

错误1：只绑定localhost

yaml 复制代码

# 错误配置
server:
  port: 10089
  # 缺少address配置，默认只绑定127.0.0.1

现象 : 本地curl正常，远程访问失败
解决 : 添加address: 0.0.0.0

错误2：使用容器内localhost

yaml 复制代码

# 错误配置
- targets: ['localhost:10089']

现象 : Prometheus无法连接目标
解决: 使用宿主机的公网IP

错误3：JVM参数位置错误

bash 复制代码

# 错误启动方式
java -jar app.jar -Djava.net.preferIPv4Stack=true

现象 : IPv6优先导致连接问题
解决: 参数必须在-jar之前

性能考虑

指标收集频率

yaml 复制代码

# 根据业务需求调整采集频率
scrape_configs:
  - job_name: 'xscholar-scheduler'
    scrape_interval: 30s    # 业务应用30秒采集一次
    scrape_timeout: 10s     # 10秒超时

指标过滤优化

yaml 复制代码

# 只采集需要的指标，减少存储压力
metric_relabel_configs:
  - source_labels: [__name__]
    regex: '(daily_task_.*|token_.*|last_task_.*|jvm_memory_.*)'
    action: keep

下一步

Spring Boot应用成功接入Prometheus后，下一阶段将重点关注：

自定义业务指标的设计和实现
指标数据的分析和告警规则优化
性能监控和问题定位实践

这个阶段的重点是解决网络连通性问题，确保监控数据能稳定采集，为后续的业务监控和告警奠定基础。