系列文章:
【轻松入门SpringBoot】从0到1搭建web 工程(上)-使用SpringBoot框架
【轻松入门SpringBoot】从0到1搭建web 工程(中) -使用Spring框架
【轻松入门SpringBoot】从0到1搭建web 工程(下)-在实践中对比SpringBoot和Spring框架
目录
[SpringBoot 的健康检查:](#SpringBoot 的健康检查:)
前言:
前面几篇文章我们学习了如何使用 SpringBoot 框架搭建一个 web 工程,那么当服务启动后,我们如何知道我们的服务是否可用?比如:web 请求能否访问到?Spring 容器是否正常?能否访问数据库?业务初始化是否完成?这里就需要健康检查了。
现状:
我们可能见过那些健康检查的方式呢?比如:使用 talnet命令检查应用的端口是否处于监听状态;使用 ps -ef | grep 进程名检查进程是否存在;使用过滤日志关键字"服务启动成功"等判断服务是否启动;或者写一个静态返回,调/nginx.html返回ok时表示可用等等。
这些检查方式,在应用只有两个状态"启动成功"、"启动失败"的时候有用。比如:在服务启动时MySQL连接失败或Redis 连接失败服务无法启动时,通过上面的方法都能识别到"启动失败"。但如果对 MySQL 、Redis 等不是强依赖,比如,Redis 建联失败可以走本地,能正常提供 web 服务时,上述方法能识别到"启动成功",但识别不到"Redis 不可用"。
所以,如果我们想在服务启动时快速了解到每个组件的健康状态,在服务运行时能从监控平台实时监控到组件的健康状态,组件运行异常时能报警到钉钉群等,该怎么办呢? SpringBoot 提供了办法:actuator。
SpringBoot 的健康检查:
我们先看下SpringBoot官网的介绍
Application Availability
When deployed on platforms, applications can provide information about their availability to the platform using infrastructure such as Kubernetes Probes. Spring Boot includes out-of-the box support for the commonly used "liveness" and "readiness" availability states. If you are using Spring Boot's "actuator" support then these states are exposed as health endpoint groups.
In addition, you can also obtain availability states by injecting the ApplicationAvailability interface into your own beans.
Liveness State
The "Liveness" state of an application tells whether its internal state allows it to work correctly, or recover by itself if it is currently failing. A broken "Liveness" state means that the application is in a state that it cannot recover from, and the infrastructure should restart the application.
--note--
In general, the "Liveness" state should not be based on external checks, such as health checks. If it did, a failing external system (a database, a Web API, an external cache) would trigger massive restarts and cascading failures across the platform.
The internal state of Spring Boot applications is mostly represented by the Spring ApplicationContext. If the application context has started successfully, Spring Boot assumes that the application is in a valid state. An application is considered live as soon as the context has been refreshed, see Spring Boot application lifecycle and related Application Events.
Readiness State
The "Readiness" state of an application tells whether the application is ready to handle traffic. A failing "Readiness" state tells the platform that it should not route traffic to the application for now. This typically happens during startup, while CommandLineRunner and ApplicationRunner components are being processed, or at any time if the application decides that it is too busy for additional traffic.
An application is considered ready as soon as application and command-line runners have been called, see Spring Boot application lifecycle and related Application Events.
--tip--
Tasks expected to run during startup should be executed by CommandLineRunner and ApplicationRunner components instead of using Spring component lifecycle callbacks such as @PostConstruct.
一句话总结:安全检查需要服务自己能检查出来,如果依赖外部检查,等被识别到不可用的时候可能有点迟了,SpringBoot 支持从"存活"和"就绪"的维度提供安全检查,并建议在CommandLineRunner和ApplicationRunner组件中实现。
"If it did, a failing external system (a database, a Web API, an external cache) would trigger massive restarts and cascading failures across the platform.",依赖外部检查,出现故障的外部系统(数据库、Web API、外部缓存)会引发大规模重启,并在整个平台上导致级联故障。
为什么这样说呢?因为除了被挖断光纤这种极端的物理因素,服务很多时候不是一下不可用的,会经历一个逐渐不可用的过程。有过实战经验的朋友可能更能体会到,尤其是经历过OOM 故障。内存被逐渐占用并得不到释放,内存被占用越来越多,服务逐渐出现慢接口,一开始只有一两个接口超时,后来直到所有接口都超时。接口超时次数达到调用方的阈值,调用方启动降级。降级可能影响准确性,最终可能引起客诉。。。etc 影响范围像滚雪球一样,越滚越大。这个"不可用"状态,在被外部感知前,内部会提前感知到,如果内部能及时发现并处理,很大概率能把影响控制住,所以内部及时的健康检查很重要。
下面我们学习一下SpringBoot 框架提供的健康检查工具actuator,是如何工作的。
添加类路径依赖:
java
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
配置:
java
management:
endpoints:
web:
exposure:
include: health,info # 暴露health端点(默认不暴露)
endpoint:
health:
enabled: true # 开启 health 端点(默认开启)
show-details: always # 显示健康详情(比如数据库是否连接成功)
运行结果:
请求:http://localhost:8080/actuator/health
返回所有依赖的详细状态(目前仅配置了MySQL):
java
{
"status": "UP",
"components": {
"db": {
"status": "UP",
"details": {
"database": "MySQL",
"validationQuery": "isValid()"
}
},
"diskSpace": {
"status": "UP",
"details": {
"total": 994662584320,
"free": 838479388672,
"threshold": 10485760,
"exists": true
}
},
"ping": {
"status": "UP"
}
}
}
最外层的 up:应用整体可用:Spring 容器正常、所有"关键依赖"都可用,能处理业务请求;
里层的 up,比如 db:up,代表数据库连接正常。
up 是 Actuator 根据我们配置的健康检查规则判断的,比如:数据库能连,Redis 能 ping,缓存加载完成就可标记为 up,标准可以我们自定义。
对应也有非 up 状态:
DOWN:整体 / 某个依赖不可用(比如数据库连不上、Redis 宕机);
WARN:整体 / 某个依赖可用但有风险(比如磁盘空间不足、缓存命中率极低);
UNKNOWN:无法判断状态(比如依赖的检查逻辑抛异常)。
我们举个例子:
现在服务还在运行中,然后把我的MySQL 服务停了,我们看看健康检查的结果是:
MySQL 报错:
2025-12-22 14:29:39.584 WARN 24512 --- [nio-8080-exec-4] com.zaxxer.hikari.pool.PoolBase : HikariPool-1 - Failed to validate connection com.mysql.cj.jdbc.ConnectionImpl@677cf13 (No operations allowed after connection closed.). Possibly consider using a shorter maxLifetime value.
健康检查结果:
java
{
"status": "DOWN",
"components": {
"db": {
"status": "DOWN",
"details": {
"error": "org.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection; nested exception is java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30006ms."
}
},
"diskSpace": {
"status": "UP",
"details": {
"total": 994662584320,
"free": 838471426048,
"threshold": 10485760,
"exists": true
}
},
"ping": {
"status": "UP"
}
}
}
db的 status 是 down,其他 components 的 status 是 up。配置的db是服务的关键依赖,所以现在服务的 status 也是 down,避免了"假健康"的情况。
总结:
使用 SpringBoot 框架的actuator,通过调用一个接口就能检查服务内部依赖的状态,是不是还挺方便的?但我们知道生产环境的依赖比我们 mock 的环境复杂很多,所以下一章,我们尽可能的模拟一下生产环境,自定义健康检查的内容,为生产环境做一次"体检"。