Spring Boot 应用 Docker 监控:Prometheus + Grafana 全方位监控
-
- 摘要
- [第 1 章 监控体系架构设计](#第 1 章 监控体系架构设计)
-
- [1.1 监控系统整体架构](#1.1 监控系统整体架构)
- [1.2 各组件职责说明](#1.2 各组件职责说明)
- [1.3 监控指标分类](#1.3 监控指标分类)
-
- [1.3.1 应用层指标](#1.3.1 应用层指标)
- [1.3.2 系统层指标](#1.3.2 系统层指标)
- [1.3.3 业务层指标](#1.3.3 业务层指标)
- [第 2 章 Spring Boot 应用监控配置](#第 2 章 Spring Boot 应用监控配置)
-
- [2.1 添加监控依赖](#2.1 添加监控依赖)
- [2.2 应用配置](#2.2 应用配置)
- [2.3 自定义业务指标](#2.3 自定义业务指标)
- [2.4 业务服务集成监控](#2.4 业务服务集成监控)
- [2.5 Controller 层监控](#2.5 Controller 层监控)
- [2.6 数据模型](#2.6 数据模型)
- [第 3 章 Docker 化部署配置](#第 3 章 Docker 化部署配置)
-
- [3.1 Spring Boot 应用 Dockerfile](#3.1 Spring Boot 应用 Dockerfile)
- [3.2 Docker Compose 完整配置](#3.2 Docker Compose 完整配置)
- [3.3 Prometheus 配置](#3.3 Prometheus 配置)
- [3.4 告警规则配置](#3.4 告警规则配置)
- [3.5 Alertmanager 配置](#3.5 Alertmanager 配置)
- [第 4 章 Grafana 仪表板配置](#第 4 章 Grafana 仪表板配置)
-
- [4.1 数据源配置](#4.1 数据源配置)
- [4.2 仪表板配置](#4.2 仪表板配置)
- [4.3 Spring Boot 应用仪表板](#4.3 Spring Boot 应用仪表板)
- [4.4 完整仪表板配置](#4.4 完整仪表板配置)
- [第 5 章 高级监控特性](#第 5 章 高级监控特性)
-
- [5.1 自定义指标端点](#5.1 自定义指标端点)
- [5.2 分布式追踪集成](#5.2 分布式追踪集成)
- [5.3 性能测试与监控验证](#5.3 性能测试与监控验证)
- [第 6 章 生产环境部署与优化](#第 6 章 生产环境部署与优化)
-
- [6.1 生产级 Docker Compose](#6.1 生产级 Docker Compose)
- [6.2 监控数据持久化](#6.2 监控数据持久化)
- [6.3 安全配置](#6.3 安全配置)
- [第 7 章 故障排查与性能优化](#第 7 章 故障排查与性能优化)
-
- [7.1 常见问题排查](#7.1 常见问题排查)
-
- [7.1.1 指标无法收集](#7.1.1 指标无法收集)
- [7.1.2 内存泄漏排查](#7.1.2 内存泄漏排查)
- [7.2 性能优化建议](#7.2 性能优化建议)
- 总结
摘要
在现代微服务架构中,Spring Boot 应用的监控是确保系统稳定性和性能的关键。本文将深入探讨如何使用 Prometheus + Grafana 构建完整的 Docker 化监控体系,覆盖从应用指标暴露、容器监控到业务指标的全方位监控方案。通过详细的代码示例、配置文件和实战案例,展示如何实现从零到生产级的监控系统。
关键词: Spring Boot, Docker, Prometheus, Grafana, 监控, 微服务, 容器化
第 1 章 监控体系架构设计
1.1 监控系统整体架构
现代 Spring Boot 应用的监控体系应该包含以下核心组件:
Spring Boot 应用 Micrometer 指标 Prometheus 抓取 Prometheus 存储 Grafana 可视化 Docker 容器 cAdvisor 监控 主机系统 Node Exporter 告警通知 Email/Slack/Webhook
1.2 各组件职责说明
| 组件 | 职责 | 技术选型 |
|---|---|---|
| 指标收集 | 应用指标暴露 | Micrometer, Spring Boot Actuator |
| 指标抓取 | 定期拉取指标 | Prometheus |
| 容器监控 | 容器资源监控 | cAdvisor |
| 主机监控 | 系统资源监控 | Node Exporter |
| 可视化 | 指标展示与分析 | Grafana |
| 告警 | 异常检测与通知 | Alertmanager |
1.3 监控指标分类
1.3.1 应用层指标
- JVM 内存、GC、线程池
- HTTP 请求指标、响应时间
- 业务自定义指标
- 数据库连接池指标
1.3.2 系统层指标
- CPU、内存、磁盘使用率
- 网络 I/O、磁盘 I/O
- 容器资源使用情况
1.3.3 业务层指标
- 订单量、用户活跃度
- 业务异常统计
- 关键业务流程指标
第 2 章 Spring Boot 应用监控配置
2.1 添加监控依赖
首先在 pom.xml 中添加必要的监控依赖:
xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>monitored-spring-boot-app</artifactId>
<version>1.0.0</version>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>2.7.0</version>
</parent>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<dependency>
<groupId>com.h2database</groupId>
<artifactId>h2</artifactId>
<scope>runtime</scope>
</dependency>
</dependencies>
</project>
2.2 应用配置
配置 application.yml 启用监控端点:
yaml
server:
port: 8080
management:
endpoints:
web:
exposure:
include: health,info,metrics,prometheus,env,beans
base-path: /actuator
enabled-by-default: true
endpoint:
health:
show-details: always
show-components: always
prometheus:
enabled: true
metrics:
export:
prometheus:
enabled: true
distribution:
percentiles-histogram:
http.server.requests: true
web:
server:
request:
autotime:
enabled: true
tags:
application: ${spring.application.name}
environment: ${spring.profiles.active:default}
spring:
application:
name: order-service
profiles:
active: docker
datasource:
url: jdbc:h2:mem:testdb
driver-class-name: org.h2.Driver
username: sa
password: ''
jpa:
database-platform: org.hibernate.dialect.H2Dialect
hibernate:
ddl-auto: create-drop
show-sql: true
logging:
level:
org.springframework.web: DEBUG
io.micrometer: DEBUG
2.3 自定义业务指标
创建自定义指标监控业务逻辑:
java
package com.example.monitoring.service;
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import org.springframework.stereotype.Component;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.TimeUnit;
@Component
public class OrderMetrics {
private final Counter orderCreatedCounter;
private final Counter orderFailedCounter;
private final Timer orderProcessingTimer;
private final ConcurrentHashMap<String, Counter> statusCounters;
private final MeterRegistry meterRegistry;
public OrderMetrics(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
this.orderCreatedCounter = Counter.builder("order.created")
.description("Number of orders created")
.tag("application", "order-service")
.register(meterRegistry);
this.orderFailedCounter = Counter.builder("order.failed")
.description("Number of failed orders")
.tag("application", "order-service")
.register(meterRegistry);
this.orderProcessingTimer = Timer.builder("order.processing.time")
.description("Time taken to process orders")
.tag("application", "order-service")
.register(meterRegistry);
this.statusCounters = new ConcurrentHashMap<>();
}
public void incrementOrderCreated() {
orderCreatedCounter.increment();
}
public void incrementOrderFailed(String reason) {
orderFailedCounter.increment();
// 按失败原因统计
Counter reasonCounter = statusCounters.computeIfAbsent(reason,
k -> Counter.builder("order.failed.by.reason")
.description("Orders failed by reason")
.tag("reason", k)
.register(meterRegistry));
reasonCounter.increment();
}
public Timer.Sample startTimer() {
return Timer.start(meterRegistry);
}
public void recordTimer(Timer.Sample sample) {
sample.stop(orderProcessingTimer);
}
}
2.4 业务服务集成监控
在业务服务中使用监控指标:
java
package com.example.monitoring.service;
import com.example.monitoring.model.Order;
import com.example.monitoring.repository.OrderRepository;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional;
import java.util.Optional;
import java.util.Random;
@Service
@Transactional
public class OrderService {
private final OrderRepository orderRepository;
private final OrderMetrics orderMetrics;
private final Random random = new Random();
public OrderService(OrderRepository orderRepository, OrderMetrics orderMetrics) {
this.orderRepository = orderRepository;
this.orderMetrics = orderMetrics;
}
public Order createOrder(Order order) {
// 开始计时
var timerSample = orderMetrics.startTimer();
try {
// 模拟业务处理
simulateProcessing();
// 随机模拟失败情况
if (random.nextInt(10) == 0) { // 10% 失败率
throw new RuntimeException("Payment processing failed");
}
Order savedOrder = orderRepository.save(order);
orderMetrics.incrementOrderCreated();
return savedOrder;
} catch (Exception e) {
orderMetrics.incrementOrderFailed(e.getMessage());
throw e;
} finally {
// 记录处理时间
orderMetrics.recordTimer(timerSample);
}
}
public Optional<Order> getOrder(Long id) {
return orderRepository.findById(id);
}
private void simulateProcessing() throws InterruptedException {
// 模拟处理时间 100-500ms
Thread.sleep(100 + random.nextInt(400));
}
}
2.5 Controller 层监控
java
package com.example.monitoring.controller;
import com.example.monitoring.model.Order;
import com.example.monitoring.service.OrderService;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;
import java.util.Optional;
@RestController
@RequestMapping("/api/orders")
public class OrderController {
private final OrderService orderService;
public OrderController(OrderService orderService) {
this.orderService = orderService;
}
@PostMapping
public ResponseEntity<Order> createOrder(@RequestBody Order order) {
try {
Order createdOrder = orderService.createOrder(order);
return ResponseEntity.ok(createdOrder);
} catch (Exception e) {
return ResponseEntity.badRequest().build();
}
}
@GetMapping("/{id}")
public ResponseEntity<Order> getOrder(@PathVariable Long id) {
Optional<Order> order = orderService.getOrder(id);
return order.map(ResponseEntity::ok)
.orElse(ResponseEntity.notFound().build());
}
@GetMapping("/health")
public ResponseEntity<String> health() {
return ResponseEntity.ok("Service is healthy");
}
}
2.6 数据模型
java
package com.example.monitoring.model;
import javax.persistence.*;
import java.time.LocalDateTime;
@Entity
@Table(name = "orders")
public class Order {
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;
private String orderNumber;
private Double amount;
private String customerEmail;
@Enumerated(EnumType.STRING)
private OrderStatus status;
private LocalDateTime createdAt;
// 构造器、getter、setter
public Order() {
this.createdAt = LocalDateTime.now();
this.status = OrderStatus.PENDING;
}
public enum OrderStatus {
PENDING, PROCESSING, COMPLETED, FAILED
}
// getters and setters
public Long getId() { return id; }
public void setId(Long id) { this.id = id; }
public String getOrderNumber() { return orderNumber; }
public void setOrderNumber(String orderNumber) { this.orderNumber = orderNumber; }
public Double getAmount() { return amount; }
public void setAmount(Double amount) { this.amount = amount; }
public String getCustomerEmail() { return customerEmail; }
public void setCustomerEmail(String customerEmail) { this.customerEmail = customerEmail; }
public OrderStatus getStatus() { return status; }
public void setStatus(OrderStatus status) { this.status = status; }
public LocalDateTime getCreatedAt() { return createdAt; }
public void setCreatedAt(LocalDateTime createdAt) { this.createdAt = createdAt; }
}
第 3 章 Docker 化部署配置
3.1 Spring Boot 应用 Dockerfile
创建优化的 Dockerfile:
dockerfile
# 多阶段构建优化镜像大小
FROM maven:3.8.6-openjdk-17 as builder
WORKDIR /app
COPY pom.xml .
COPY src ./src
RUN mvn clean package -DskipTests
FROM openjdk:17-jre-slim
# 安装 curl 用于健康检查
RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY --from=builder /app/target/*.jar app.jar
# 创建非root用户
RUN groupadd -r spring && useradd -r -g spring spring
USER spring
# 健康检查
HEALTHCHECK --interval=30s --timeout=3s --start-period=60s --retries=3 \
CMD curl -f http://localhost:8080/actuator/health || exit 1
EXPOSE 8080
ENTRYPOINT ["java", "-jar", "/app.jar"]
3.2 Docker Compose 完整配置
创建 docker-compose.yml 定义完整的监控栈:
yaml
version: '3.8'
services:
# Spring Boot 应用服务
order-service:
build: .
ports:
- "8080:8080"
environment:
- SPRING_PROFILES_ACTIVE=docker
- MANAGEMENT_ENDPOINTS_WEB_EXPOSURE_INCLUDE=health,info,metrics,prometheus
networks:
- monitoring-network
labels:
- "prometheus.scrape=true"
- "prometheus.port=8080"
- "prometheus.path=/actuator/prometheus"
depends_on:
- prometheus
# Prometheus 监控服务
prometheus:
image: prom/prometheus:v2.40.0
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=200h'
- '--web.enable-lifecycle'
networks:
- monitoring-network
restart: unless-stopped
# Grafana 可视化
grafana:
image: grafana/grafana:9.3.2
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning
- ./grafana/dashboards:/var/lib/grafana/dashboards
- grafana_data:/var/lib/grafana
networks:
- monitoring-network
depends_on:
- prometheus
restart: unless-stopped
# cAdvisor 容器监控
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.47.0
ports:
- "8081:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
devices:
- /dev/kmsg
networks:
- monitoring-network
privileged: true
restart: unless-stopped
# Node Exporter 主机监控
node-exporter:
image: prom/node-exporter:v1.5.0
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
networks:
- monitoring-network
restart: unless-stopped
# Alertmanager 告警管理
alertmanager:
image: prom/alertmanager:v0.25.0
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
networks:
- monitoring-network
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
networks:
monitoring-network:
driver: bridge
3.3 Prometheus 配置
创建 prometheus/prometheus.yml 配置文件:
yaml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
environment: 'docker-monitoring'
# 告警规则配置
rule_files:
- "alerting_rules.yml"
# 抓取配置
scrape_configs:
# Prometheus 自身监控
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
scrape_interval: 10s
# Spring Boot 应用监控
- job_name: 'spring-boot-apps'
metrics_path: '/actuator/prometheus'
scrape_interval: 15s
static_configs:
- targets: ['order-service:8080']
relabel_configs:
- source_labels: [__address__]
target_label: __scheme__
regex: '(.*)'
replacement: 'http'
- source_labels: [__address__]
target_label: instance
regex: '(.*):(.*)'
replacement: '${1}'
# 节点监控
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
scrape_interval: 20s
# 容器监控
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
scrape_interval: 20s
# Alertmanager 监控
- job_name: 'alertmanager'
static_configs:
- targets: ['alertmanager:9093']
scrape_interval: 30s
# 告警配置
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
3.4 告警规则配置
创建 prometheus/alerting_rules.yml:
yaml
groups:
- name: spring-boot-alerts
rules:
# JVM 内存告警
- alert: HighJVMMemoryUsage
expr: sum(container_memory_usage_bytes{container_label_io_kubernetes_pod_name=~"order-service.*"}) / (1024 * 1024) > 512
for: 2m
labels:
severity: warning
service: order-service
annotations:
summary: "High JVM Memory Usage"
description: "JVM memory usage is above 512MB for more than 2 minutes"
# 应用宕机告警
- alert: ApplicationDown
expr: up{job="spring-boot-apps"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Application is down"
description: "The application has been down for more than 1 minute"
# 高错误率告警
- alert: HighErrorRate
expr: rate(http_server_requests_seconds_count{outcome="SERVER_ERROR"}[5m]) / rate(http_server_requests_seconds_count[5m]) > 0.05
for: 3m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is above 5% for more than 3 minutes"
# 高响应时间告警
- alert: HighResponseTime
expr: histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m])) > 2
for: 3m
labels:
severity: warning
annotations:
summary: "High response time detected"
description: "95th percentile response time is above 2 seconds"
- name: system-alerts
rules:
# 高CPU使用率告警
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage"
description: "CPU usage is above 80% for more than 5 minutes"
# 高内存使用率告警
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Memory usage is above 85% for more than 5 minutes"
# 磁盘空间告警
- alert: LowDiskSpace
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
for: 10m
labels:
severity: critical
annotations:
summary: "Low disk space"
description: "Disk space is below 15%"
3.5 Alertmanager 配置
创建 alertmanager/alertmanager.yml:
yaml
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@yourcompany.com'
smtp_auth_username: 'alerts@yourcompany.com'
smtp_auth_password: 'your-app-password'
templates:
- '/etc/alertmanager/templates/*.tmpl'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
routes:
- match:
severity: 'critical'
receiver: 'critical-alerts'
- match:
severity: 'warning'
receiver: 'warning-alerts'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://localhost:5001/'
- name: 'critical-alerts'
email_configs:
- to: 'oncall@yourcompany.com'
subject: '{{ .GroupLabels.alertname }} - CRITICAL'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }}{{ end }}
{{ end }}
slack_configs:
- api_url: 'https://hooks.slack.com/services/your/slack/webhook'
channel: '#critical-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.description }}'
- name: 'warning-alerts'
email_configs:
- to: 'dev-team@yourcompany.com'
subject: '{{ .GroupLabels.alertname }} - WARNING'
slack_configs:
- api_url: 'https://hooks.slack.com/services/your/slack/webhook'
channel: '#warning-alerts'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
第 4 章 Grafana 仪表板配置
4.1 数据源配置
创建 grafana/provisioning/datasources/datasource.yml:
yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
jsonData:
timeInterval: 15s
httpMethod: POST
4.2 仪表板配置
创建 grafana/provisioning/dashboards/dashboard.yml:
yaml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards
4.3 Spring Boot 应用仪表板
创建 grafana/dashboards/spring-boot-dashboard.json:
json
{
"dashboard": {
"id": null,
"title": "Spring Boot Application Metrics",
"tags": ["spring-boot", "prometheus"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "JVM Memory Usage",
"type": "stat",
"targets": [
{
"expr": "sum(container_memory_usage_bytes{container_label_io_kubernetes_pod_name=~'order-service.*'}) / (1024 * 1024)",
"legendFormat": "Memory Usage",
"refId": "A"
}
],
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"fieldConfig": {
"defaults": {
"unit": "MB",
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "red", "value": 80}
]
}
}
}
},
{
"id": 2,
"title": "HTTP Requests Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_server_requests_seconds_count[5m])",
"legendFormat": "Requests/sec",
"refId": "A"
}
],
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
}
],
"time": {"from": "now-6h", "to": "now"},
"timepicker": {
"refresh_intervals": ["5s", "10s", "30s", "1m", "5m", "15m", "30m", "1h", "2h", "1d"]
}
}
}
4.4 完整仪表板配置
由于完整的 JSON 配置很长,这里提供关键面板的配置思路:
json
// 完整的仪表板应包含以下面板:
{
"panels": [
// 1. 应用概览面板
// 2. JVM 内存面板
// 3. GC 统计面板
// 4. HTTP 请求面板
// 5. 业务指标面板
// 6. 系统资源面板
// 7. 容器资源面板
]
}
第 5 章 高级监控特性
5.1 自定义指标端点
创建自定义指标端点暴露业务指标:
java
package com.example.monitoring.config;
import io.micrometer.core.instrument.MeterRegistry;
import org.springframework.boot.actuate.endpoint.annotation.Endpoint;
import org.springframework.boot.actuate.endpoint.annotation.ReadOperation;
import org.springframework.stereotype.Component;
import java.util.HashMap;
import java.util.Map;
@Component
@Endpoint(id = "businessmetrics")
public class BusinessMetricsEndpoint {
private final MeterRegistry meterRegistry;
public BusinessMetricsEndpoint(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
}
@ReadOperation
public Map<String, Object> businessMetrics() {
Map<String, Object> metrics = new HashMap<>();
// 获取订单相关指标
double orderRate = meterRegistry.get("order.created").counter().count();
double failureRate = meterRegistry.get("order.failed").counter().count();
metrics.put("orders.created.total", orderRate);
metrics.put("orders.failed.total", failureRate);
metrics.put("orders.success.rate",
orderRate > 0 ? (orderRate - failureRate) / orderRate * 100 : 100);
return metrics;
}
}
5.2 分布式追踪集成
添加分布式追踪支持:
xml
<!-- 在 pom.xml 中添加 -->
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-sleuth</artifactId>
<version>3.1.0</version>
</dependency>
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-sleuth-zipkin</artifactId>
<version>3.1.0</version>
</dependency>
配置追踪:
yaml
spring:
sleuth:
sampler:
probability: 1.0
zipkin:
base-url: http://zipkin:9411
5.3 性能测试与监控验证
创建测试脚本验证监控系统:
java
package com.example.monitoring.test;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.boot.test.web.client.TestRestTemplate;
import org.springframework.http.ResponseEntity;
import org.springframework.test.context.ActiveProfiles;
import static org.assertj.core.api.Assertions.assertThat;
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
@ActiveProfiles("test")
public class MonitoringIntegrationTest {
@Autowired
private TestRestTemplate restTemplate;
@Test
public void testActuatorEndpoints() {
// 测试健康检查端点
ResponseEntity<String> healthResponse =
restTemplate.getForEntity("/actuator/health", String.class);
assertThat(healthResponse.getStatusCodeValue()).isEqualTo(200);
// 测试指标端点
ResponseEntity<String> metricsResponse =
restTemplate.getForEntity("/actuator/metrics", String.class);
assertThat(metricsResponse.getStatusCodeValue()).isEqualTo(200);
// 测试Prometheus端点
ResponseEntity<String> prometheusResponse =
restTemplate.getForEntity("/actuator/prometheus", String.class);
assertThat(prometheusResponse.getStatusCodeValue()).isEqualTo(200);
}
}
第 6 章 生产环境部署与优化
6.1 生产级 Docker Compose
创建生产环境配置 docker-compose.prod.yml:
yaml
version: '3.8'
services:
order-service:
deploy:
replicas: 3
resources:
limits:
memory: 1G
cpus: '0.5'
reservations:
memory: 512M
cpus: '0.25'
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
configs:
- source: app-config
target: /app/config/application.yml
secrets:
- db-password
prometheus:
deploy:
resources:
limits:
memory: 2G
cpus: '1.0'
volumes:
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=10GB'
- '--web.enable-lifecycle'
configs:
app-config:
file: ./config/application-prod.yml
secrets:
db-password:
file: ./secrets/db_password.txt
6.2 监控数据持久化
配置数据备份和持久化:
bash
# 备份 Prometheus 数据
docker exec prometheus tar czf - /prometheus > prometheus_backup.tar.gz
# 恢复数据
cat prometheus_backup.tar.gz | docker exec -i prometheus tar xzf - -C /
6.3 安全配置
添加安全认证:
yaml
# 配置 Grafana 认证
grafana:
environment:
- GF_AUTH_ANONYMOUS_ENABLED=false
- GF_AUTH_BASIC_ENABLED=true
- GF_SECURITY_SECRET_KEY=your-secret-key
第 7 章 故障排查与性能优化
7.1 常见问题排查
7.1.1 指标无法收集
bash
# 检查应用端点
curl http://localhost:8080/actuator/prometheus
# 检查 Prometheus 目标状态
curl http://localhost:9090/api/v1/targets
7.1.2 内存泄漏排查
java
// 添加内存监控
@Bean
public MeterRegistryCustomizer<MeterRegistry> metricsCommonTags() {
return registry -> registry.config().commonTags(
"application", "order-service",
"region", System.getenv("REGION")
);
}
7.2 性能优化建议
- 调整抓取间隔:根据应用负载调整
- 优化查询性能:使用记录规则预计算
- 数据保留策略:根据存储容量调整
- 资源限制:合理配置容器资源限制
总结
通过本文的完整配置,我们建立了一个生产级的 Spring Boot 应用监控系统,具备以下特性:
- ✅ 全方位监控:应用、系统、容器多层监控
- ✅ 实时告警:多级别、多通道告警机制
- ✅ 可视化展示:丰富的 Grafana 仪表板
- ✅ 生产就绪:高可用、安全、可扩展的配置
- ✅ 业务集成:自定义业务指标监控
这套监控体系能够帮助您及时发现和解决系统问题,确保 Spring Boot 应用在 Docker 环境中的稳定运行。
后续优化方向:
- 集成日志监控(ELK/Loki)
- 实现自动化故障恢复
- 添加机器学习异常检测
- 建立监控数据分析和预测能力
通过持续优化监控体系,您可以构建更加稳定、可靠的云原生应用系统。