案例03-附件D-监控系统

📋 概述

本文档详细描述了 Atlas Mapper 企业级监控诊断系统的设计和实现,包括性能监控、链路追踪、日志分析、告警系统等核心功能,为大型企业应用提供全方位的可观测性支持。


🏗️ 监控系统架构

监控系统整体架构图

graph TB subgraph "监控系统架构" subgraph "数据采集层" A1[应用指标采集] A2[JVM指标采集] A3[业务指标采集] A4[链路追踪采集] A5[日志采集] end subgraph "数据传输层" B1[Micrometer] B2[Zipkin Client] B3[Logback Appender] B4[Kafka Producer] end subgraph "数据存储层" C1[Prometheus] C2[Zipkin Server] C3[Elasticsearch] C4[InfluxDB] end subgraph "数据处理层" D1[Grafana] D2[Kibana] D3[Jaeger UI] D4[AlertManager] end subgraph "告警通知层" E1[邮件通知] E2[短信通知] E3[钉钉通知] E4[PagerDuty] end subgraph "运维管理层" F1[监控大屏] F2[告警管理] F3[性能分析] F4[容量规划] end end A1 --> B1 A2 --> B1 A3 --> B1 A4 --> B2 A5 --> B3 B1 --> C1 B2 --> C2 B3 --> C3 B4 --> C4 C1 --> D1 C2 --> D3 C3 --> D2 C4 --> D1 D1 --> E1 D2 --> E2 D4 --> E3 D4 --> E4 E1 --> F1 E2 --> F2 E3 --> F3 E4 --> F4

监控数据流图

sequenceDiagram participant App as 应用服务 participant Collector as 指标收集器 participant Prometheus as Prometheus participant Zipkin as Zipkin participant ELK as ELK Stack participant Grafana as Grafana participant AlertManager as AlertManager participant Notification as 通知系统 App->>Collector: 发送应用指标 App->>Zipkin: 发送链路追踪数据 App->>ELK: 发送日志数据 Collector->>Prometheus: 存储时序指标 Prometheus->>Grafana: 查询指标数据 Grafana->>Grafana: 生成监控图表 Prometheus->>AlertManager: 触发告警规则 AlertManager->>Notification: 发送告警通知 Zipkin->>Zipkin: 分析链路性能 ELK->>ELK: 日志分析和搜索 Note over App,Notification: 实时监控和告警流程

📊 核心监控组件

1. 应用性能监控 (APM)

java 复制代码
package io.github.nemoob.atlas.mapper.example.monitoring;

import io.micrometer.core.instrument.*;
import io.micrometer.core.instrument.binder.MeterBinder;
import org.springframework.stereotype.Component;
import org.springframework.beans.factory.annotation.Autowired;
import lombok.extern.slf4j.Slf4j;

import java.util.concurrent.atomic.AtomicLong;
import java.util.concurrent.ConcurrentHashMap;
import java.time.Duration;

/**
 * Atlas Mapper 性能监控器
 */
@Component
@Slf4j
public class AtlasMapperMetrics implements MeterBinder {
    
    private final MeterRegistry meterRegistry;
    
    // 核心指标
    private final Counter mappingCounter;
    private final Timer mappingTimer;
    private final Gauge memoryUsageGauge;
    private final DistributionSummary batchSizeDistribution;
    
    // 业务指标
    private final Counter errorCounter;
    private final Counter cacheHitCounter;
    private final Counter cacheMissCounter;
    
    // 实时统计
    private final AtomicLong activeMappings = new AtomicLong(0);
    private final AtomicLong totalMappings = new AtomicLong(0);
    private final AtomicLong totalErrors = new AtomicLong(0);
    private final ConcurrentHashMap<String, AtomicLong> mappingTypeCounters = new ConcurrentHashMap<>();
    
    public AtlasMapperMetrics(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;
        
        // 初始化核心指标
        this.mappingCounter = Counter.builder("atlas.mapper.mappings.total")
            .description("Total number of mappings performed")
            .register(meterRegistry);
            
        this.mappingTimer = Timer.builder("atlas.mapper.mapping.duration")
            .description("Time taken for mapping operations")
            .register(meterRegistry);
            
        this.memoryUsageGauge = Gauge.builder("atlas.mapper.memory.usage")
            .description("Current memory usage by mapper")
            .register(meterRegistry, this, AtlasMapperMetrics::getCurrentMemoryUsage);
            
        this.batchSizeDistribution = DistributionSummary.builder("atlas.mapper.batch.size")
            .description("Distribution of batch sizes")
            .register(meterRegistry);
            
        this.errorCounter = Counter.builder("atlas.mapper.errors.total")
            .description("Total number of mapping errors")
            .register(meterRegistry);
            
        this.cacheHitCounter = Counter.builder("atlas.mapper.cache.hits")
            .description("Number of cache hits")
            .register(meterRegistry);
            
        this.cacheMissCounter = Counter.builder("atlas.mapper.cache.misses")
            .description("Number of cache misses")
            .register(meterRegistry);
    }
    
    @Override
    public void bindTo(MeterRegistry registry) {
        // 绑定自定义指标
        Gauge.builder("atlas.mapper.active.mappings")
            .description("Number of currently active mappings")
            .register(registry, activeMappings, AtomicLong::get);
            
        Gauge.builder("atlas.mapper.error.rate")
            .description("Current error rate")
            .register(registry, this, AtlasMapperMetrics::getErrorRate);
            
        Gauge.builder("atlas.mapper.cache.hit.rate")
            .description("Cache hit rate")
            .register(registry, this, AtlasMapperMetrics::getCacheHitRate);
    }
    
    /**
     * 记录映射操作
     */
    public Timer.Sample startMapping(String mappingType) {
        activeMappings.incrementAndGet();
        totalMappings.incrementAndGet();
        
        // 按类型统计
        mappingTypeCounters.computeIfAbsent(mappingType, k -> new AtomicLong(0)).incrementAndGet();
        
        mappingCounter.increment(Tags.of("type", mappingType));
        
        return Timer.start(meterRegistry);
    }
    
    /**
     * 完成映射操作
     */
    public void finishMapping(Timer.Sample sample, String mappingType, boolean success) {
        activeMappings.decrementAndGet();
        
        sample.stop(Timer.builder("atlas.mapper.mapping.duration")
            .tag("type", mappingType)
            .tag("result", success ? "success" : "error")
            .register(meterRegistry));
            
        if (!success) {
            totalErrors.incrementAndGet();
            errorCounter.increment(Tags.of("type", mappingType));
        }
    }
    
    /**
     * 记录批量操作
     */
    public void recordBatchOperation(int batchSize, Duration duration, String operationType) {
        batchSizeDistribution.record(batchSize);
        
        Timer.builder("atlas.mapper.batch.duration")
            .tag("operation", operationType)
            .register(meterRegistry)
            .record(duration);
            
        // 计算吞吐量
        double throughput = batchSize / (duration.toMillis() / 1000.0);
        Gauge.builder("atlas.mapper.throughput")
            .tag("operation", operationType)
            .register(meterRegistry, throughput, Double::doubleValue);
    }
    
    /**
     * 记录缓存操作
     */
    public void recordCacheHit(String cacheType) {
        cacheHitCounter.increment(Tags.of("cache", cacheType));
    }
    
    public void recordCacheMiss(String cacheType) {
        cacheMissCounter.increment(Tags.of("cache", cacheType));
    }
    
    /**
     * 记录内存使用
     */
    public void recordMemoryUsage(long usedMemory, long totalMemory) {
        Gauge.builder("atlas.mapper.memory.used")
            .register(meterRegistry, usedMemory, Long::doubleValue);
            
        Gauge.builder("atlas.mapper.memory.total")
            .register(meterRegistry, totalMemory, Long::doubleValue);
            
        double utilizationRate = (double) usedMemory / totalMemory;
        Gauge.builder("atlas.mapper.memory.utilization")
            .register(meterRegistry, utilizationRate, Double::doubleValue);
    }
    
    /**
     * 获取当前内存使用量
     */
    private double getCurrentMemoryUsage() {
        Runtime runtime = Runtime.getRuntime();
        return runtime.totalMemory() - runtime.freeMemory();
    }
    
    /**
     * 获取错误率
     */
    private double getErrorRate() {
        long total = totalMappings.get();
        long errors = totalErrors.get();
        return total > 0 ? (double) errors / total : 0.0;
    }
    
    /**
     * 获取缓存命中率
     */
    private double getCacheHitRate() {
        double hits = cacheHitCounter.count();
        double misses = cacheMissCounter.count();
        double total = hits + misses;
        return total > 0 ? hits / total : 0.0;
    }
    
    /**
     * 获取详细统计信息
     */
    public MappingStatistics getStatistics() {
        return MappingStatistics.builder()
            .activeMappings(activeMappings.get())
            .totalMappings(totalMappings.get())
            .totalErrors(totalErrors.get())
            .errorRate(getErrorRate())
            .cacheHitRate(getCacheHitRate())
            .currentMemoryUsage(getCurrentMemoryUsage())
            .mappingTypeCounters(new ConcurrentHashMap<>(mappingTypeCounters))
            .build();
    }
}

/**
 * 映射统计信息
 */
@lombok.Data
@lombok.Builder
public class MappingStatistics {
    private long activeMappings;
    private long totalMappings;
    private long totalErrors;
    private double errorRate;
    private double cacheHitRate;
    private double currentMemoryUsage;
    private ConcurrentHashMap<String, AtomicLong> mappingTypeCounters;
}

2. 分布式链路追踪

java 复制代码
package io.github.nemoob.atlas.mapper.example.tracing;

import brave.Tracing;
import brave.Span;
import brave.Tracer;
import brave.propagation.TraceContext;
import org.springframework.stereotype.Component;
import org.springframework.beans.factory.annotation.Autowired;
import lombok.extern.slf4j.Slf4j;

import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;

/**
 * Atlas Mapper 链路追踪器
 */
@Component
@Slf4j
public class AtlasMapperTracer {
    
    @Autowired
    private Tracer tracer;
    
    // 活跃的Span缓存
    private final Map<String, Span> activeSpans = new ConcurrentHashMap<>();
    
    /**
     * 开始映射追踪
     */
    public Span startMappingTrace(String operationName, String sourceType, String targetType) {
        Span span = tracer.nextSpan()
            .name("atlas-mapper-" + operationName)
            .tag("component", "atlas-mapper")
            .tag("source.type", sourceType)
            .tag("target.type", targetType)
            .tag("operation", operationName)
            .start();
            
        // 缓存活跃Span
        String spanId = span.context().spanIdString();
        activeSpans.put(spanId, span);
        
        log.debug("Started mapping trace: {} -> {}, spanId: {}", sourceType, targetType, spanId);
        return span;
    }
    
    /**
     * 开始批量映射追踪
     */
    public Span startBatchMappingTrace(String operationName, int batchSize, String sourceType, String targetType) {
        Span span = tracer.nextSpan()
            .name("atlas-mapper-batch-" + operationName)
            .tag("component", "atlas-mapper")
            .tag("operation.type", "batch")
            .tag("batch.size", String.valueOf(batchSize))
            .tag("source.type", sourceType)
            .tag("target.type", targetType)
            .start();
            
        String spanId = span.context().spanIdString();
        activeSpans.put(spanId, span);
        
        log.debug("Started batch mapping trace: {} items {} -> {}, spanId: {}", 
            batchSize, sourceType, targetType, spanId);
        return span;
    }
    
    /**
     * 添加映射详情
     */
    public void addMappingDetails(Span span, Map<String, Object> details) {
        if (span != null) {
            details.forEach((key, value) -> {
                if (value != null) {
                    span.tag("mapping." + key, value.toString());
                }
            });
        }
    }
    
    /**
     * 记录映射错误
     */
    public void recordMappingError(Span span, Exception error) {
        if (span != null) {
            span.tag("error", "true")
                .tag("error.type", error.getClass().getSimpleName())
                .tag("error.message", error.getMessage());
                
            log.error("Mapping error recorded in trace: {}", error.getMessage(), error);
        }
    }
    
    /**
     * 记录性能指标
     */
    public void recordPerformanceMetrics(Span span, long duration, long memoryUsed) {
        if (span != null) {
            span.tag("performance.duration.ms", String.valueOf(duration))
                .tag("performance.memory.bytes", String.valueOf(memoryUsed));
        }
    }
    
    /**
     * 完成映射追踪
     */
    public void finishMappingTrace(Span span, boolean success) {
        if (span != null) {
            span.tag("success", String.valueOf(success));
            
            String spanId = span.context().spanIdString();
            activeSpans.remove(spanId);
            
            span.end();
            log.debug("Finished mapping trace, spanId: {}, success: {}", spanId, success);
        }
    }
    
    /**
     * 创建子Span用于详细操作追踪
     */
    public Span createChildSpan(Span parentSpan, String operationName) {
        if (parentSpan == null) {
            return null;
        }
        
        return tracer.newChild(parentSpan.context())
            .name(operationName)
            .tag("component", "atlas-mapper")
            .start();
    }
    
    /**
     * 追踪字段映射
     */
    public void traceFieldMapping(Span span, String fieldName, Object sourceValue, Object targetValue) {
        if (span != null) {
            Span fieldSpan = createChildSpan(span, "field-mapping");
            if (fieldSpan != null) {
                fieldSpan.tag("field.name", fieldName)
                    .tag("field.source.type", sourceValue != null ? sourceValue.getClass().getSimpleName() : "null")
                    .tag("field.target.type", targetValue != null ? targetValue.getClass().getSimpleName() : "null");
                fieldSpan.end();
            }
        }
    }
    
    /**
     * 追踪缓存操作
     */
    public void traceCacheOperation(Span span, String operation, String cacheKey, boolean hit) {
        if (span != null) {
            Span cacheSpan = createChildSpan(span, "cache-" + operation);
            if (cacheSpan != null) {
                cacheSpan.tag("cache.operation", operation)
                    .tag("cache.key", cacheKey)
                    .tag("cache.hit", String.valueOf(hit));
                cacheSpan.end();
            }
        }
    }
    
    /**
     * 获取当前追踪上下文
     */
    public TraceContext getCurrentTraceContext() {
        Span currentSpan = tracer.currentSpan();
        return currentSpan != null ? currentSpan.context() : null;
    }
    
    /**
     * 获取活跃Span统计
     */
    public TracingStatistics getTracingStatistics() {
        return TracingStatistics.builder()
            .activeSpansCount(activeSpans.size())
            .currentTraceId(getCurrentTraceId())
            .build();
    }
    
    private String getCurrentTraceId() {
        TraceContext context = getCurrentTraceContext();
        return context != null ? context.traceIdString() : null;
    }
}

/**
 * 追踪统计信息
 */
@lombok.Data
@lombok.Builder
public class TracingStatistics {
    private int activeSpansCount;
    private String currentTraceId;
}

3. 智能日志分析器

java 复制代码
package io.github.nemoob.atlas.mapper.example.logging;

import ch.qos.logback.classic.spi.ILoggingEvent;
import ch.qos.logback.core.AppenderBase;
import org.springframework.stereotype.Component;
import lombok.extern.slf4j.Slf4j;

import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;
import java.util.concurrent.atomic.AtomicLong;
import java.util.regex.Pattern;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;

/**
 * Atlas Mapper 智能日志分析器
 */
@Component
@Slf4j
public class AtlasMapperLogAnalyzer extends AppenderBase<ILoggingEvent> {
    
    // 日志队列
    private final BlockingQueue<ILoggingEvent> logQueue = new LinkedBlockingQueue<>(10000);
    
    // 统计计数器
    private final AtomicLong totalLogs = new AtomicLong(0);
    private final AtomicLong errorLogs = new AtomicLong(0);
    private final AtomicLong warnLogs = new AtomicLong(0);
    private final AtomicLong performanceLogs = new AtomicLong(0);
    
    // 错误模式匹配
    private final Map<Pattern, String> errorPatterns = new ConcurrentHashMap<>();
    private final Map<String, AtomicLong> errorTypeCounters = new ConcurrentHashMap<>();
    
    // 性能阈值
    private final long performanceThreshold = 1000; // 1秒
    private final long memoryThreshold = 100 * 1024 * 1024; // 100MB
    
    public AtlasMapperLogAnalyzer() {
        initializeErrorPatterns();
        startLogProcessor();
    }
    
    /**
     * 初始化错误模式
     */
    private void initializeErrorPatterns() {
        errorPatterns.put(Pattern.compile(".*OutOfMemoryError.*"), "MEMORY_ERROR");
        errorPatterns.put(Pattern.compile(".*NullPointerException.*"), "NULL_POINTER");
        errorPatterns.put(Pattern.compile(".*ClassCastException.*"), "TYPE_CAST_ERROR");
        errorPatterns.put(Pattern.compile(".*mapping.*failed.*"), "MAPPING_ERROR");
        errorPatterns.put(Pattern.compile(".*timeout.*"), "TIMEOUT_ERROR");
        errorPatterns.put(Pattern.compile(".*connection.*refused.*"), "CONNECTION_ERROR");
    }
    
    @Override
    protected void append(ILoggingEvent event) {
        totalLogs.incrementAndGet();
        
        // 异步处理日志
        if (!logQueue.offer(event)) {
            // 队列满时的处理策略
            log.warn("Log queue is full, dropping log event");
        }
    }
    
    /**
     * 启动日志处理器
     */
    private void startLogProcessor() {
        Thread processor = new Thread(this::processLogs, "log-analyzer");
        processor.setDaemon(true);
        processor.start();
    }
    
    /**
     * 处理日志事件
     */
    private void processLogs() {
        while (!Thread.currentThread().isInterrupted()) {
            try {
                ILoggingEvent event = logQueue.take();
                analyzeLogEvent(event);
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                break;
            } catch (Exception e) {
                log.error("Error processing log event", e);
            }
        }
    }
    
    /**
     * 分析日志事件
     */
    private void analyzeLogEvent(ILoggingEvent event) {
        String level = event.getLevel().toString();
        String message = event.getFormattedMessage();
        String loggerName = event.getLoggerName();
        
        // 统计日志级别
        switch (level) {
            case "ERROR" -> {
                errorLogs.incrementAndGet();
                analyzeErrorLog(event, message);
            }
            case "WARN" -> {
                warnLogs.incrementAndGet();
                analyzeWarningLog(event, message);
            }
            case "INFO", "DEBUG" -> analyzeInfoLog(event, message);
        }
        
        // 分析性能相关日志
        if (isPerformanceLog(message)) {
            performanceLogs.incrementAndGet();
            analyzePerformanceLog(event, message);
        }
        
        // 检查异常模式
        if (event.getThrowableProxy() != null) {
            analyzeException(event);
        }
    }
    
    /**
     * 分析错误日志
     */
    private void analyzeErrorLog(ILoggingEvent event, String message) {
        // 匹配错误模式
        for (Map.Entry<Pattern, String> entry : errorPatterns.entrySet()) {
            if (entry.getKey().matcher(message).matches()) {
                String errorType = entry.getValue();
                errorTypeCounters.computeIfAbsent(errorType, k -> new AtomicLong(0)).incrementAndGet();
                
                // 发送告警
                sendErrorAlert(errorType, message, event);
                break;
            }
        }
    }
    
    /**
     * 分析警告日志
     */
    private void analyzeWarningLog(ILoggingEvent event, String message) {
        // 检查是否为性能警告
        if (message.contains("slow") || message.contains("timeout") || message.contains("performance")) {
            sendPerformanceWarning(message, event);
        }
    }
    
    /**
     * 分析信息日志
     */
    private void analyzeInfoLog(ILoggingEvent event, String message) {
        // 提取业务指标
        extractBusinessMetrics(message);
    }
    
    /**
     * 分析性能日志
     */
    private void analyzePerformanceLog(ILoggingEvent event, String message) {
        // 提取性能数据
        PerformanceData perfData = extractPerformanceData(message);
        
        if (perfData != null) {
            // 检查性能阈值
            if (perfData.getDuration() > performanceThreshold) {
                sendPerformanceAlert("SLOW_OPERATION", perfData, event);
            }
            
            if (perfData.getMemoryUsage() > memoryThreshold) {
                sendPerformanceAlert("HIGH_MEMORY_USAGE", perfData, event);
            }
        }
    }
    
    /**
     * 分析异常信息
     */
    private void analyzeException(ILoggingEvent event) {
        String exceptionClass = event.getThrowableProxy().getClassName();
        String exceptionMessage = event.getThrowableProxy().getMessage();
        
        // 记录异常统计
        errorTypeCounters.computeIfAbsent(exceptionClass, k -> new AtomicLong(0)).incrementAndGet();
        
        // 检查是否为关键异常
        if (isCriticalException(exceptionClass)) {
            sendCriticalAlert(exceptionClass, exceptionMessage, event);
        }
    }
    
    /**
     * 判断是否为性能日志
     */
    private boolean isPerformanceLog(String message) {
        return message.contains("duration") || 
               message.contains("time") || 
               message.contains("performance") ||
               message.contains("memory") ||
               message.contains("throughput");
    }
    
    /**
     * 提取性能数据
     */
    private PerformanceData extractPerformanceData(String message) {
        try {
            // 使用正则表达式提取性能数据
            Pattern durationPattern = Pattern.compile("duration[:\\s]+(\\d+)");
            Pattern memoryPattern = Pattern.compile("memory[:\\s]+(\\d+)");
            
            java.util.regex.Matcher durationMatcher = durationPattern.matcher(message);
            java.util.regex.Matcher memoryMatcher = memoryPattern.matcher(message);
            
            long duration = durationMatcher.find() ? Long.parseLong(durationMatcher.group(1)) : 0;
            long memory = memoryMatcher.find() ? Long.parseLong(memoryMatcher.group(1)) : 0;
            
            return new PerformanceData(duration, memory);
            
        } catch (Exception e) {
            log.debug("Failed to extract performance data from: {}", message);
            return null;
        }
    }
    
    /**
     * 提取业务指标
     */
    private void extractBusinessMetrics(String message) {
        // 提取映射成功率、处理量等业务指标
        if (message.contains("mapping completed")) {
            // 解析并记录业务指标
        }
    }
    
    /**
     * 判断是否为关键异常
     */
    private boolean isCriticalException(String exceptionClass) {
        return exceptionClass.contains("OutOfMemoryError") ||
               exceptionClass.contains("StackOverflowError") ||
               exceptionClass.contains("SecurityException");
    }
    
    /**
     * 发送错误告警
     */
    private void sendErrorAlert(String errorType, String message, ILoggingEvent event) {
        AlertEvent alert = AlertEvent.builder()
            .type("ERROR")
            .subType(errorType)
            .message(message)
            .timestamp(event.getTimeStamp())
            .severity(AlertSeverity.HIGH)
            .build();
            
        // 发送到告警系统
        sendAlert(alert);
    }
    
    /**
     * 发送性能告警
     */
    private void sendPerformanceAlert(String alertType, PerformanceData perfData, ILoggingEvent event) {
        AlertEvent alert = AlertEvent.builder()
            .type("PERFORMANCE")
            .subType(alertType)
            .message(String.format("Performance issue detected: duration=%dms, memory=%dMB", 
                perfData.getDuration(), perfData.getMemoryUsage() / 1024 / 1024))
            .timestamp(event.getTimeStamp())
            .severity(AlertSeverity.MEDIUM)
            .build();
            
        sendAlert(alert);
    }
    
    /**
     * 发送性能警告
     */
    private void sendPerformanceWarning(String message, ILoggingEvent event) {
        AlertEvent alert = AlertEvent.builder()
            .type("WARNING")
            .subType("PERFORMANCE_WARNING")
            .message(message)
            .timestamp(event.getTimeStamp())
            .severity(AlertSeverity.LOW)
            .build();
            
        sendAlert(alert);
    }
    
    /**
     * 发送关键告警
     */
    private void sendCriticalAlert(String exceptionClass, String exceptionMessage, ILoggingEvent event) {
        AlertEvent alert = AlertEvent.builder()
            .type("CRITICAL")
            .subType(exceptionClass)
            .message(exceptionMessage)
            .timestamp(event.getTimeStamp())
            .severity(AlertSeverity.CRITICAL)
            .build();
            
        sendAlert(alert);
    }
    
    /**
     * 发送告警
     */
    private void sendAlert(AlertEvent alert) {
        // 实际实现中应该发送到告警系统
        log.info("Alert generated: {}", alert);
    }
    
    /**
     * 获取日志分析统计
     */
    public LogAnalysisStatistics getStatistics() {
        return LogAnalysisStatistics.builder()
            .totalLogs(totalLogs.get())
            .errorLogs(errorLogs.get())
            .warnLogs(warnLogs.get())
            .performanceLogs(performanceLogs.get())
            .errorTypeCounters(new ConcurrentHashMap<>(errorTypeCounters))
            .queueSize(logQueue.size())
            .build();
    }
}

/**
 * 性能数据
 */
@lombok.Data
@lombok.AllArgsConstructor
class PerformanceData {
    private long duration;
    private long memoryUsage;
}

/**
 * 告警事件
 */
@lombok.Data
@lombok.Builder
class AlertEvent {
    private String type;
    private String subType;
    private String message;
    private long timestamp;
    private AlertSeverity severity;
}

/**
 * 告警严重级别
 */
enum AlertSeverity {
    LOW, MEDIUM, HIGH, CRITICAL
}

/**
 * 日志分析统计
 */
@lombok.Data
@lombok.Builder
class LogAnalysisStatistics {
    private long totalLogs;
    private long errorLogs;
    private long warnLogs;
    private long performanceLogs;
    private Map<String, AtomicLong> errorTypeCounters;
    private int queueSize;
    
    public double getErrorRate() {
        return totalLogs > 0 ? (double) errorLogs / totalLogs : 0.0;
    }
}

🚨 告警系统

智能告警管理器

java 复制代码
package io.github.nemoob.atlas.mapper.example.alerting;

import org.springframework.stereotype.Component;
import org.springframework.beans.factory.annotation.Autowired;
import lombok.extern.slf4j.Slf4j;

import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.List;
import java.util.ArrayList;

/**
 * 智能告警管理器
 */
@Component
@Slf4j
public class SmartAlertManager {
    
    @Autowired
    private List<AlertChannel> alertChannels;
    
    @Autowired
    private AlertRuleEngine ruleEngine;
    
    // 告警状态管理
    private final ConcurrentHashMap<String, AlertState> alertStates = new ConcurrentHashMap<>();
    private final ConcurrentHashMap<String, Long> alertCooldowns = new ConcurrentHashMap<>();
    
    // 告警统计
    private final ConcurrentHashMap<String, java.util.concurrent.atomic.AtomicLong> alertCounters = 
        new ConcurrentHashMap<>();
    
    // 定时任务
    private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(2);
    
    public SmartAlertManager() {
        // 启动告警状态检查任务
        scheduler.scheduleAtFixedRate(this::checkAlertStates, 60, 60, TimeUnit.SECONDS);
        
        // 启动告警统计任务
        scheduler.scheduleAtFixedRate(this::generateAlertReport, 300, 300, TimeUnit.SECONDS);
    }
    
    /**
     * 处理告警事件
     */
    public void processAlert(AlertEvent alert) {
        try {
            String alertKey = generateAlertKey(alert);
            
            // 检查告警规则
            if (!ruleEngine.shouldTriggerAlert(alert)) {
                log.debug("Alert suppressed by rules: {}", alertKey);
                return;
            }
            
            // 检查冷却期
            if (isInCooldown(alertKey)) {
                log.debug("Alert in cooldown period: {}", alertKey);
                return;
            }
            
            // 更新告警状态
            AlertState currentState = alertStates.get(alertKey);
            AlertState newState = calculateNewState(currentState, alert);
            alertStates.put(alertKey, newState);
            
            // 判断是否需要发送告警
            if (shouldSendAlert(currentState, newState)) {
                sendAlert(alert, newState);
                setCooldown(alertKey, alert.getSeverity());
            }
            
            // 更新统计
            updateAlertStatistics(alert);
            
        } catch (Exception e) {
            log.error("Error processing alert: {}", alert, e);
        }
    }
    
    /**
     * 发送告警
     */
    private void sendAlert(AlertEvent alert, AlertState state) {
        // 根据严重级别选择告警渠道
        List<AlertChannel> channels = selectAlertChannels(alert.getSeverity());
        
        for (AlertChannel channel : channels) {
            try {
                channel.sendAlert(alert, state);
                log.info("Alert sent via {}: {}", channel.getChannelName(), alert.getType());
            } catch (Exception e) {
                log.error("Failed to send alert via {}: {}", channel.getChannelName(), e.getMessage());
            }
        }
    }
    
    /**
     * 选择告警渠道
     */
    private List<AlertChannel> selectAlertChannels(AlertSeverity severity) {
        return alertChannels.stream()
            .filter(channel -> channel.supportsLevel(severity))
            .collect(java.util.stream.Collectors.toList());
    }
    
    /**
     * 生成告警键
     */
    private String generateAlertKey(AlertEvent alert) {
        return String.format("%s:%s:%s", alert.getType(), alert.getSubType(), 
            alert.getSource() != null ? alert.getSource() : "unknown");
    }
    
    /**
     * 检查是否在冷却期
     */
    private boolean isInCooldown(String alertKey) {
        Long cooldownEnd = alertCooldowns.get(alertKey);
        return cooldownEnd != null && System.currentTimeMillis() < cooldownEnd;
    }
    
    /**
     * 设置冷却期
     */
    private void setCooldown(String alertKey, AlertSeverity severity) {
        long cooldownDuration = getCooldownDuration(severity);
        alertCooldowns.put(alertKey, System.currentTimeMillis() + cooldownDuration);
    }
    
    /**
     * 获取冷却期时长
     */
    private long getCooldownDuration(AlertSeverity severity) {
        return switch (severity) {
            case CRITICAL -> TimeUnit.MINUTES.toMillis(5);  // 5分钟
            case HIGH -> TimeUnit.MINUTES.toMillis(15);     // 15分钟
            case MEDIUM -> TimeUnit.MINUTES.toMillis(30);   // 30分钟
            case LOW -> TimeUnit.HOURS.toMillis(1);         // 1小时
        };
    }
    
    /**
     * 计算新的告警状态
     */
    private AlertState calculateNewState(AlertState currentState, AlertEvent alert) {
        if (currentState == null) {
            return AlertState.builder()
                .alertKey(generateAlertKey(alert))
                .firstOccurrence(alert.getTimestamp())
                .lastOccurrence(alert.getTimestamp())
                .occurrenceCount(1)
                .currentSeverity(alert.getSeverity())
                .status(AlertStatus.ACTIVE)
                .build();
        }
        
        return currentState.toBuilder()
            .lastOccurrence(alert.getTimestamp())
            .occurrenceCount(currentState.getOccurrenceCount() + 1)
            .currentSeverity(alert.getSeverity())
            .build();
    }
    
    /**
     * 判断是否应该发送告警
     */
    private boolean shouldSendAlert(AlertState currentState, AlertState newState) {
        if (currentState == null) {
            return true; // 首次告警
        }
        
        // 严重级别升级时发送
        if (newState.getCurrentSeverity().ordinal() > currentState.getCurrentSeverity().ordinal()) {
            return true;
        }
        
        // 达到重复告警阈值时发送
        int repeatThreshold = getRepeatThreshold(newState.getCurrentSeverity());
        return newState.getOccurrenceCount() % repeatThreshold == 0;
    }
    
    /**
     * 获取重复告警阈值
     */
    private int getRepeatThreshold(AlertSeverity severity) {
        return switch (severity) {
            case CRITICAL -> 1;   // 每次都发送
            case HIGH -> 3;       // 每3次发送一次
            case MEDIUM -> 5;     // 每5次发送一次
            case LOW -> 10;       // 每10次发送一次
        };
    }
    
    /**
     * 更新告警统计
     */
    private void updateAlertStatistics(AlertEvent alert) {
        String counterKey = alert.getType() + ":" + alert.getSeverity();
        alertCounters.computeIfAbsent(counterKey, 
            k -> new java.util.concurrent.atomic.AtomicLong(0)).incrementAndGet();
    }
    
    /**
     * 检查告警状态
     */
    private void checkAlertStates() {
        long now = System.currentTimeMillis();
        long autoResolveThreshold = TimeUnit.HOURS.toMillis(2); // 2小时自动恢复
        
        alertStates.entrySet().removeIf(entry -> {
            AlertState state = entry.getValue();
            if (state.getStatus() == AlertStatus.ACTIVE && 
                now - state.getLastOccurrence() > autoResolveThreshold) {
                
                // 发送恢复通知
                sendRecoveryNotification(state);
                return true;
            }
            return false;
        });
        
        // 清理过期的冷却期
        alertCooldowns.entrySet().removeIf(entry -> entry.getValue() < now);
    }
    
    /**
     * 发送恢复通知
     */
    private void sendRecoveryNotification(AlertState state) {
        AlertEvent recoveryAlert = AlertEvent.builder()
            .type("RECOVERY")
            .subType(state.getAlertKey())
            .message("Alert automatically resolved")
            .timestamp(System.currentTimeMillis())
            .severity(AlertSeverity.LOW)
            .build();
            
        sendAlert(recoveryAlert, state);
    }
    
    /**
     * 生成告警报告
     */
    private void generateAlertReport() {
        AlertReport report = AlertReport.builder()
            .reportTime(System.currentTimeMillis())
            .activeAlerts(alertStates.size())
            .alertCounters(new ConcurrentHashMap<>(alertCounters))
            .topAlertTypes(getTopAlertTypes())
            .build();
            
        log.info("Alert report generated: {}", report);
    }
    
    /**
     * 获取最频繁的告警类型
     */
    private List<String> getTopAlertTypes() {
        return alertCounters.entrySet().stream()
            .sorted((e1, e2) -> Long.compare(e2.getValue().get(), e1.getValue().get()))
            .limit(5)
            .map(java.util.Map.Entry::getKey)
            .collect(java.util.stream.Collectors.toList());
    }
    
    /**
     * 获取告警统计信息
     */
    public AlertStatistics getAlertStatistics() {
        return AlertStatistics.builder()
            .activeAlerts(alertStates.size())
            .totalAlerts(alertCounters.values().stream()
                .mapToLong(java.util.concurrent.atomic.AtomicLong::get)
                .sum())
            .alertsByType(new ConcurrentHashMap<>(alertCounters))
            .cooldownAlerts(alertCooldowns.size())
            .build();
    }
}

/**
 * 告警状态
 */
@lombok.Data
@lombok.Builder(toBuilder = true)
class AlertState {
    private String alertKey;
    private long firstOccurrence;
    private long lastOccurrence;
    private int occurrenceCount;
    private AlertSeverity currentSeverity;
    private AlertStatus status;
}

/**
 * 告警状态枚举
 */
enum AlertStatus {
    ACTIVE, RESOLVED, SUPPRESSED
}

/**
 * 告警报告
 */
@lombok.Data
@lombok.Builder
class AlertReport {
    private long reportTime;
    private int activeAlerts;
    private ConcurrentHashMap<String, java.util.concurrent.atomic.AtomicLong> alertCounters;
    private List<String> topAlertTypes;
}

/**
 * 告警统计信息
 */
@lombok.Data
@lombok.Builder
class AlertStatistics {
    private int activeAlerts;
    private long totalAlerts;
    private ConcurrentHashMap<String, java.util.concurrent.atomic.AtomicLong> alertsByType;
    private int cooldownAlerts;
}

📈 监控大屏和报表

监控大屏数据提供者

java 复制代码
package io.github.nemoob.atlas.mapper.example.dashboard;

import org.springframework.stereotype.Service;
import org.springframework.beans.factory.annotation.Autowired;
import lombok.extern.slf4j.Slf4j;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.List;
import java.util.Map;

/**
 * 监控大屏数据提供者
 */
@Service
@Slf4j
public class MonitoringDashboardService {
    
    @Autowired
    private AtlasMapperMetrics mapperMetrics;
    
    @Autowired
    private SmartAlertManager alertManager;
    
    @Autowired
    private AtlasMapperLogAnalyzer logAnalyzer;
    
    /**
     * 获取实时监控数据
     */
    public DashboardData getRealTimeData() {
        MappingStatistics mappingStats = mapperMetrics.getStatistics();
        AlertStatistics alertStats = alertManager.getAlertStatistics();
        LogAnalysisStatistics logStats = logAnalyzer.getStatistics();
        
        return DashboardData.builder()
            .timestamp(LocalDateTime.now().format(DateTimeFormatter.ISO_LOCAL_DATE_TIME))
            .mappingStatistics(mappingStats)
            .alertStatistics(alertStats)
            .logStatistics(logStats)
            .systemHealth(calculateSystemHealth(mappingStats, alertStats, logStats))
            .performanceMetrics(getPerformanceMetrics())
            .build();
    }
    
    /**
     * 计算系统健康度
     */
    private SystemHealth calculateSystemHealth(
            MappingStatistics mappingStats,
            AlertStatistics alertStats,
            LogAnalysisStatistics logStats) {
        
        double healthScore = 100.0;
        
        // 基于错误率计算健康度
        if (mappingStats.getErrorRate() > 0.05) { // 5%错误率
            healthScore -= 30;
        } else if (mappingStats.getErrorRate() > 0.01) { // 1%错误率
            healthScore -= 10;
        }
        
        // 基于告警数量计算健康度
        if (alertStats.getActiveAlerts() > 10) {
            healthScore -= 20;
        } else if (alertStats.getActiveAlerts() > 5) {
            healthScore -= 10;
        }
        
        // 基于日志错误率计算健康度
        if (logStats.getErrorRate() > 0.1) { // 10%错误日志
            healthScore -= 20;
        } else if (logStats.getErrorRate() > 0.05) { // 5%错误日志
            healthScore -= 10;
        }
        
        // 基于内存使用率计算健康度
        double memoryUsage = mappingStats.getCurrentMemoryUsage() / (1024 * 1024 * 1024); // GB
        if (memoryUsage > 2.0) {
            healthScore -= 15;
        } else if (memoryUsage > 1.0) {
            healthScore -= 5;
        }
        
        healthScore = Math.max(0, healthScore);
        
        HealthLevel level;
        if (healthScore >= 90) {
            level = HealthLevel.EXCELLENT;
        } else if (healthScore >= 70) {
            level = HealthLevel.GOOD;
        } else if (healthScore >= 50) {
            level = HealthLevel.WARNING;
        } else {
            level = HealthLevel.CRITICAL;
        }
        
        return SystemHealth.builder()
            .score(healthScore)
            .level(level)
            .lastUpdated(System.currentTimeMillis())
            .build();
    }
    
    /**
     * 获取性能指标
     */
    private PerformanceMetrics getPerformanceMetrics() {
        Runtime runtime = Runtime.getRuntime();
        
        return PerformanceMetrics.builder()
            .cpuUsage(getCurrentCpuUsage())
            .memoryUsage(runtime.totalMemory() - runtime.freeMemory())
            .maxMemory(runtime.maxMemory())
            .threadCount(Thread.activeCount())
            .gcCount(getGcCount())
            .gcTime(getGcTime())
            .build();
    }
    
    private double getCurrentCpuUsage() {
        java.lang.management.OperatingSystemMXBean osBean = 
            java.lang.management.ManagementFactory.getOperatingSystemMXBean();
        
        if (osBean instanceof com.sun.management.OperatingSystemMXBean sunOsBean) {
            return sunOsBean.getProcessCpuLoad() * 100;
        }
        return 0.0;
    }
    
    private long getGcCount() {
        return java.lang.management.ManagementFactory.getGarbageCollectorMXBeans()
            .stream()
            .mapToLong(java.lang.management.GarbageCollectorMXBean::getCollectionCount)
            .sum();
    }
    
    private long getGcTime() {
        return java.lang.management.ManagementFactory.getGarbageCollectorMXBeans()
            .stream()
            .mapToLong(java.lang.management.GarbageCollectorMXBean::getCollectionTime)
            .sum();
    }
    
    /**
     * 获取历史趋势数据
     */
    public List<TrendData> getTrendData(String metric, int hours) {
        // 实际实现中应该从时序数据库查询
        List<TrendData> trendData = new ArrayList<>();
        
        LocalDateTime now = LocalDateTime.now();
        for (int i = hours; i >= 0; i--) {
            LocalDateTime time = now.minusHours(i);
            double value = generateMockTrendValue(metric, i);
            
            trendData.add(TrendData.builder()
                .timestamp(time.format(DateTimeFormatter.ISO_LOCAL_DATE_TIME))
                .value(value)
                .build());
        }
        
        return trendData;
    }
    
    private double generateMockTrendValue(String metric, int hoursAgo) {
        // 模拟趋势数据生成
        double base = switch (metric) {
            case "error_rate" -> 0.02;
            case "response_time" -> 150.0;
            case "throughput" -> 1000.0;
            case "memory_usage" -> 512.0;
            default -> 50.0;
        };
        
        // 添加一些随机波动
        double variation = (Math.random() - 0.5) * 0.2;
        return base * (1 + variation);
    }
}

/**
 * 大屏数据
 */
@lombok.Data
@lombok.Builder
class DashboardData {
    private String timestamp;
    private MappingStatistics mappingStatistics;
    private AlertStatistics alertStatistics;
    private LogAnalysisStatistics logStatistics;
    private SystemHealth systemHealth;
    private PerformanceMetrics performanceMetrics;
}

/**
 * 系统健康度
 */
@lombok.Data
@lombok.Builder
class SystemHealth {
    private double score;
    private HealthLevel level;
    private long lastUpdated;
}

/**
 * 健康等级
 */
enum HealthLevel {
    EXCELLENT("优秀", "#00C851"),
    GOOD("良好", "#ffbb33"),
    WARNING("警告", "#FF8800"),
    CRITICAL("严重", "#FF4444");
    
    private final String description;
    private final String color;
    
    HealthLevel(String description, String color) {
        this.description = description;
        this.color = color;
    }
    
    public String getDescription() { return description; }
    public String getColor() { return color; }
}

/**
 * 性能指标
 */
@lombok.Data
@lombok.Builder
class PerformanceMetrics {
    private double cpuUsage;
    private long memoryUsage;
    private long maxMemory;
    private int threadCount;
    private long gcCount;
    private long gcTime;
    
    public double getMemoryUtilization() {
        return maxMemory > 0 ? (double) memoryUsage / maxMemory * 100 : 0.0;
    }
}

/**
 * 趋势数据
 */
@lombok.Data
@lombok.Builder
class TrendData {
    private String timestamp;
    private double value;
}

🎯 监控系统总结

核心监控指标

  1. 应用性能指标

    • 映射操作响应时间
    • 映射操作吞吐量
    • 错误率和成功率
    • 内存使用情况
  2. 系统资源指标

    • CPU 使用率
    • 内存利用率
    • 线程池状态
    • GC 性能
  3. 业务指标

    • 映射类型分布
    • 缓存命中率
    • 批量操作效率
    • 用户操作统计

告警策略

  1. 阈值告警

    • 响应时间 > 1秒
    • 错误率 > 5%
    • 内存使用 > 85%
    • CPU 使用 > 80%
  2. 趋势告警

    • 错误率持续上升
    • 响应时间持续恶化
    • 内存泄漏检测
    • 性能下降趋势
  3. 异常告警

    • 系统异常
    • 连接超时
    • 数据库异常
    • 第三方服务异常

最佳实践

  1. 监控设计原则

    • 全面覆盖关键指标
    • 合理设置告警阈值
    • 避免告警风暴
    • 提供可操作的告警信息
  2. 性能优化建议

    • 基于监控数据进行优化
    • 建立性能基线
    • 持续监控和改进
    • 定期性能评估
  3. 运维管理

    • 建立监控大屏
    • 制定应急响应流程
    • 定期监控报告
    • 容量规划和预测
相关推荐
少女续续念2 小时前
国产 DevOps 崛起!Gitee 领衔构建合规、高效的企业协作工具链
git·开源
uhakadotcom2 小时前
什么是OpenTelemetry?
后端·面试·github
少女续续念5 小时前
AI 不再是 “旁观者”!Gitee MCP Server 让智能助手接管代码仓库管理
git·开源
华仔啊5 小时前
主线程存了用户信息,子线程居然拿不到?ThreadLocal 背锅
java·后端
间彧5 小时前
Spring Boot项目中,Redis 如何同时执行多条命令
java·redis
召摇6 小时前
如何避免写垃圾代码:Java篇
java·后端·代码规范
vker6 小时前
第 1 天:单例模式(Singleton Pattern)—— 创建型模式
java·设计模式