Kafka消息丢失的3种场景,生产环境千万要注意
凌晨2点,我被一通紧急电话惊醒。运维同事焦急地说:"订单系统出大问题了!用户支付成功了,但订单状态没更新,已经有上百个用户投诉了!"
这是我职业生涯中遇到的最严重的Kafka消息丢失事故。经过3个小时的紧急排查,最终定位到是Kafka配置不当导致的消息丢失。那一夜,我深刻体会到了"细节决定成败"这句话的含义。
今天就来深度剖析Kafka消息丢失的3种典型场景,这些都是生产环境中的血泪教训,希望能帮助大家避开这些坑!
血泪教训:一次生产事故的完整复盘
让我先复盘一下那次事故的完整过程:
java
// 事故现场的Producer代码
@Service
public class OrderPaymentProducer {
@Autowired
private KafkaTemplate<String, String> kafkaTemplate;
// ✗ 问题代码:没有任何可靠性保障
public void publishPaymentEvent(PaymentEvent event) {
try {
String message = JSON.toJSONString(event);
kafkaTemplate.send("payment-topic", event.getOrderId(), message);
// 致命错误:没有等待发送结果就返回了!
log.info("支付事件已发送: {}", event.getOrderId());
} catch (Exception e) {
log.error("发送支付事件失败", e);
// 更致命的错误:异常被吞掉了,调用方不知道发送失败!
}
}
}
// 调用方代码
@Service
public class PaymentService {
@Autowired
private OrderPaymentProducer paymentProducer;
public void processPayment(String orderId, BigDecimal amount) {
// 处理支付逻辑
PaymentResult result = paymentGateway.charge(orderId, amount);
if (result.isSuccess()) {
// 支付成功,发布事件
PaymentEvent event = new PaymentEvent(orderId, amount, PaymentStatus.SUCCESS);
paymentProducer.publishPaymentEvent(event); // 以为消息发送成功了
log.info("订单 {} 支付成功,金额: {}", orderId, amount);
// 继续后续处理...
}
}
}
事故分析:
- 当晚Kafka集群发生了短暂的网络抖动
- Producer端配置了
acks=0
,消息发送后立即返回 - 大量消息在网络抖动期间丢失,但应用层毫不知情
- 用户支付成功了,但订单状态更新消息丢失,导致订单状态不一致
这次事故让我们损失了数十万的订单,更重要的是损害了用户信任。从此以后,我对Kafka的可靠性配置格外重视。
场景一:Producer端消息丢失
Producer端消息丢失是最常见的场景,通常由以下几种原因导致:
1.1 acks参数配置不当
java
@Configuration
public class KafkaProducerConfig {
@Bean
public ProducerFactory<String, String> producerFactory() {
Map<String, Object> props = new HashMap<>();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
// ✗ 危险配置:acks=0,消息发送后立即返回,不等待任何确认
props.put(ProducerConfig.ACKS_CONFIG, "0");
// ✗ 危险配置:retries=0,发送失败不重试
props.put(ProducerConfig.RETRIES_CONFIG, 0);
return new DefaultKafkaProducerFactory<>(props);
}
}
acks参数详解:
java
public class AcksConfigExplained {
/**
* acks=0: 火后不理模式
* - Producer发送消息后立即返回,不等待任何确认
* - 性能最高,但消息丢失风险最大
* - 适用场景:日志收集、监控指标等允许丢失的场景
*/
public void acksZeroExample() {
Properties props = new Properties();
props.put(ProducerConfig.ACKS_CONFIG, "0");
KafkaProducer<String, String> producer = new KafkaProducer<>(props);
// 消息可能在网络传输中丢失,Producer完全不知道
producer.send(new ProducerRecord<>("test-topic", "key", "message"));
}
/**
* acks=1: Leader确认模式(默认)
* - Producer等待Leader副本确认写入成功
* - 如果Leader确认后,消息同步给Follower之前Leader挂了,消息仍会丢失
* - 平衡了性能和可靠性
*/
public void acksOneExample() {
Properties props = new Properties();
props.put(ProducerConfig.ACKS_CONFIG, "1");
KafkaProducer<String, String> producer = new KafkaProducer<>(props);
// Leader确认后返回,但Follower可能还没同步
Future<RecordMetadata> future = producer.send(
new ProducerRecord<>("test-topic", "key", "message"));
try {
RecordMetadata metadata = future.get();
log.info("消息发送成功: partition={}, offset={}",
metadata.partition(), metadata.offset());
} catch (Exception e) {
log.error("消息发送失败", e);
}
}
/**
* acks=all(-1): 所有副本确认模式
* - Producer等待所有ISR(In-Sync Replicas)副本确认
* - 最高可靠性,但性能相对较低
* - 生产环境强烈推荐
*/
public void acksAllExample() {
Properties props = new Properties();
props.put(ProducerConfig.ACKS_CONFIG, "all");
// 必须配合min.insync.replicas使用
props.put(ProducerConfig.RETRIES_CONFIG, Integer.MAX_VALUE);
props.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, 1); // 保证消息顺序
KafkaProducer<String, String> producer = new KafkaProducer<>(props);
// 等待所有ISR副本确认,最高可靠性
producer.send(new ProducerRecord<>("test-topic", "key", "message"),
(metadata, exception) -> {
if (exception != null) {
log.error("消息发送失败", exception);
// 可以实现重试或者补偿逻辑
} else {
log.info("消息发送成功: partition={}, offset={}",
metadata.partition(), metadata.offset());
}
});
}
}
1.2 发送超时和重试机制
java
@Component
public class ReliableKafkaProducer {
private final KafkaTemplate<String, String> kafkaTemplate;
private final RedisTemplate<String, String> redisTemplate;
public ReliableKafkaProducer(KafkaTemplate<String, String> kafkaTemplate,
RedisTemplate<String, String> redisTemplate) {
this.kafkaTemplate = kafkaTemplate;
this.redisTemplate = redisTemplate;
}
/**
* 可靠的消息发送实现
*/
public boolean sendMessageReliably(String topic, String key, String message) {
return sendMessageReliably(topic, key, message, 3);
}
public boolean sendMessageReliably(String topic, String key, String message, int maxRetries) {
String messageId = generateMessageId(key, message);
// 1. 先检查是否已经成功发送过(防重复)
if (isMessageAlreadySent(messageId)) {
log.info("消息已发送过,跳过: messageId={}", messageId);
return true;
}
// 2. 记录消息发送尝试
recordMessageAttempt(messageId, topic, key, message);
for (int attempt = 1; attempt <= maxRetries; attempt++) {
try {
// 3. 同步发送消息(确保能拿到结果)
ListenableFuture<SendResult<String, String>> future =
kafkaTemplate.send(topic, key, message);
// 4. 设置超时时间
SendResult<String, String> result = future.get(10, TimeUnit.SECONDS);
// 5. 发送成功,记录成功状态
recordMessageSuccess(messageId, result);
log.info("消息发送成功: messageId={}, attempt={}, partition={}, offset={}",
messageId, attempt, result.getRecordMetadata().partition(),
result.getRecordMetadata().offset());
return true;
} catch (TimeoutException e) {
log.warn("消息发送超时: messageId={}, attempt={}", messageId, attempt);
if (attempt == maxRetries) {
recordMessageFailure(messageId, "发送超时", e);
return false;
}
} catch (Exception e) {
log.error("消息发送失败: messageId={}, attempt={}", messageId, attempt, e);
if (isRetriableException(e) && attempt < maxRetries) {
// 可重试异常,等待一段时间后重试
try {
Thread.sleep(1000 * attempt); // 指数退避
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
break;
}
} else {
// 不可重试异常或重试次数耗尽
recordMessageFailure(messageId, e.getMessage(), e);
return false;
}
}
}
recordMessageFailure(messageId, "重试次数耗尽", null);
return false;
}
/**
* 异步发送的回调处理版本
*/
public void sendMessageAsync(String topic, String key, String message,
MessageSendCallback callback) {
String messageId = generateMessageId(key, message);
if (isMessageAlreadySent(messageId)) {
callback.onSuccess(messageId);
return;
}
recordMessageAttempt(messageId, topic, key, message);
ListenableFuture<SendResult<String, String>> future =
kafkaTemplate.send(topic, key, message);
future.addCallback(new ListenableFutureCallback<SendResult<String, String>>() {
@Override
public void onSuccess(SendResult<String, String> result) {
recordMessageSuccess(messageId, result);
log.info("消息异步发送成功: messageId={}", messageId);
callback.onSuccess(messageId);
}
@Override
public void onFailure(Throwable ex) {
recordMessageFailure(messageId, ex.getMessage(), ex);
log.error("消息异步发送失败: messageId={}", messageId, ex);
if (isRetriableException(ex)) {
// 异步重试
retryMessageAsync(topic, key, message, messageId, 1, 3, callback);
} else {
callback.onFailure(messageId, ex);
}
}
});
}
private void retryMessageAsync(String topic, String key, String message,
String messageId, int attempt, int maxRetries,
MessageSendCallback callback) {
if (attempt > maxRetries) {
callback.onFailure(messageId, new RuntimeException("重试次数耗尽"));
return;
}
// 延迟重试
ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();
scheduler.schedule(() -> {
ListenableFuture<SendResult<String, String>> future =
kafkaTemplate.send(topic, key, message);
future.addCallback(
result -> {
recordMessageSuccess(messageId, result);
callback.onSuccess(messageId);
},
ex -> retryMessageAsync(topic, key, message, messageId,
attempt + 1, maxRetries, callback)
);
}, 1000L * attempt, TimeUnit.MILLISECONDS); // 指数退避
}
// 辅助方法
private String generateMessageId(String key, String message) {
return DigestUtils.md5DigestAsHex((key + message).getBytes());
}
private boolean isMessageAlreadySent(String messageId) {
return redisTemplate.hasKey("kafka:sent:" + messageId);
}
private void recordMessageAttempt(String messageId, String topic, String key, String message) {
// 记录到Redis,用于重复检测和监控
Map<String, String> info = new HashMap<>();
info.put("topic", topic);
info.put("key", key);
info.put("message", message);
info.put("status", "ATTEMPTING");
info.put("timestamp", String.valueOf(System.currentTimeMillis()));
redisTemplate.opsForHash().putAll("kafka:attempt:" + messageId, info);
redisTemplate.expire("kafka:attempt:" + messageId, 1, TimeUnit.HOURS);
}
private void recordMessageSuccess(String messageId, SendResult<String, String> result) {
// 标记消息发送成功
redisTemplate.opsForValue().set("kafka:sent:" + messageId, "SUCCESS", 24, TimeUnit.HOURS);
// 清理尝试记录
redisTemplate.delete("kafka:attempt:" + messageId);
// 记录成功指标
Metrics.counter("kafka.producer.success", "topic", result.getRecordMetadata().topic()).increment();
}
private void recordMessageFailure(String messageId, String reason, Throwable ex) {
Map<String, String> failureInfo = new HashMap<>();
failureInfo.put("reason", reason);
failureInfo.put("timestamp", String.valueOf(System.currentTimeMillis()));
if (ex != null) {
failureInfo.put("exception", ex.getClass().getSimpleName());
}
redisTemplate.opsForHash().putAll("kafka:failed:" + messageId, failureInfo);
redisTemplate.expire("kafka:failed:" + messageId, 24, TimeUnit.HOURS);
// 记录失败指标
Metrics.counter("kafka.producer.failure", "reason", reason).increment();
}
private boolean isRetriableException(Throwable ex) {
return ex instanceof RetriableException ||
ex instanceof TimeoutException ||
ex instanceof org.apache.kafka.common.errors.TimeoutException ||
ex instanceof org.apache.kafka.common.errors.NotEnoughReplicasException;
}
// 回调接口
public interface MessageSendCallback {
void onSuccess(String messageId);
void onFailure(String messageId, Throwable ex);
}
}
1.3 Producer配置最佳实践
java
@Configuration
public class OptimalProducerConfig {
@Bean
public ProducerFactory<String, String> reliableProducerFactory() {
Map<String, Object> props = new HashMap<>();
// === 基础配置 ===
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "broker1:9092,broker2:9092,broker3:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
// === 可靠性配置 ===
// 1. 确认机制:等待所有ISR副本确认
props.put(ProducerConfig.ACKS_CONFIG, "all");
// 2. 重试配置:设置足够的重试次数
props.put(ProducerConfig.RETRIES_CONFIG, Integer.MAX_VALUE);
// 3. 幂等性:防止重试导致消息重复
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
// 4. 有序性:单个连接最多一个未完成请求,保证消息顺序
props.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, 1);
// === 性能配置 ===
// 5. 批处理:增加吞吐量
props.put(ProducerConfig.BATCH_SIZE_CONFIG, 16384); // 16KB
props.put(ProducerConfig.LINGER_MS_CONFIG, 10); // 等待10ms收集更多消息
// 6. 压缩:减少网络传输
props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "snappy");
// 7. 缓冲区大小
props.put(ProducerConfig.BUFFER_MEMORY_CONFIG, 33554432); // 32MB
// === 超时配置 ===
// 8. 请求超时
props.put(ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG, 30000); // 30秒
props.put(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, 120000); // 2分钟
// 9. 连接超时
props.put(ProducerConfig.CONNECTIONS_MAX_IDLE_MS_CONFIG, 540000); // 9分钟
return new DefaultKafkaProducerFactory<>(props);
}
@Bean
public KafkaTemplate<String, String> reliableKafkaTemplate() {
return new KafkaTemplate<>(reliableProducerFactory());
}
}
场景二:Broker端消息丢失
Broker端消息丢失主要发生在以下几种情况:
2.1 副本同步配置不当
bash
# Kafka服务器配置文件 server.properties
# ✗ 危险配置:副本数量不足
default.replication.factor=1 # 只有一个副本,存在单点故障
# ✗ 危险配置:ISR最小副本数过少
min.insync.replicas=1 # 只需要1个副本确认就可以,可能导致数据丢失
# ✓ 推荐配置:足够的副本数量
default.replication.factor=3 # 3个副本
# ✓ 推荐配置:合理的ISR最小副本数
min.insync.replicas=2 # 至少需要2个副本确认
副本同步机制详解:
java
public class ReplicationMechanismExplained {
/**
* 副本同步状态监控
*/
@Component
public class KafkaReplicationMonitor {
private final AdminClient adminClient;
public KafkaReplicationMonitor() {
Properties props = new Properties();
props.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
this.adminClient = AdminClient.create(props);
}
/**
* 检查Topic的副本状态
*/
public void checkReplicationStatus(String topicName) throws Exception {
DescribeTopicsResult result = adminClient.describeTopics(Collections.singletonList(topicName));
TopicDescription description = result.values().get(topicName).get();
for (TopicPartitionInfo partition : description.partitions()) {
Node leader = partition.leader();
List<Node> replicas = partition.replicas();
List<Node> isr = partition.isr(); // In-Sync Replicas
log.info("分区 {}: Leader={}, 副本数={}, ISR数={}",
partition.partition(),
leader != null ? leader.id() : "无",
replicas.size(),
isr.size());
// 检查副本健康状态
if (isr.size() < replicas.size()) {
log.warn("分区 {} 存在滞后副本!ISR: {}, 所有副本: {}",
partition.partition(), isr, replicas);
}
// 检查是否满足min.insync.replicas要求
if (isr.size() < 2) { // 假设min.insync.replicas=2
log.error("分区 {} ISR副本数不足,存在消息丢失风险!", partition.partition());
}
}
}
/**
* 监控副本延迟情况
*/
@Scheduled(fixedRate = 60000) // 每分钟检查一次
public void monitorReplicaLag() {
try {
// 使用JMX获取副本延迟指标
MBeanServer server = ManagementFactory.getPlatformMBeanServer();
ObjectName objectName = new ObjectName("kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions");
Integer underReplicatedPartitions = (Integer) server.getAttribute(objectName, "Value");
if (underReplicatedPartitions > 0) {
log.error("发现 {} 个未完全复制的分区,存在数据丢失风险!", underReplicatedPartitions);
// 发送告警
alertService.sendAlert("Kafka副本异常",
"存在" + underReplicatedPartitions + "个未完全复制的分区");
}
// 记录指标
Metrics.gauge("kafka.under_replicated_partitions", underReplicatedPartitions);
} catch (Exception e) {
log.error("监控副本状态失败", e);
}
}
}
}
2.2 刷盘配置优化
bash
# Kafka服务器配置文件 server.properties
# === 刷盘相关配置 ===
# ✗ 危险配置:依赖操作系统刷盘,可能导致消息丢失
log.flush.interval.messages=9223372036854775807 # Long.MAX_VALUE,基本不刷盘
log.flush.interval.ms=9223372036854775807 # 基本不主动刷盘
# ✓ 推荐配置:合理的刷盘频率(根据业务场景调整)
log.flush.interval.messages=10000 # 每10000条消息刷盘一次
log.flush.interval.ms=1000 # 或者每1秒刷盘一次
# ✓ 更安全的配置:每条消息都刷盘(性能较低,但最安全)
log.flush.interval.messages=1
log.flush.interval.ms=0
# === 日志段配置 ===
log.segment.bytes=1073741824 # 1GB per segment
log.retention.hours=168 # 保留7天
log.retention.bytes=-1 # 不限制总大小
# === 其他可靠性配置 ===
unclean.leader.election.enable=false # 禁止非ISR副本成为Leader
min.insync.replicas=2 # 最少同步副本数
2.3 磁盘故障处理
java
@Component
public class KafkaDiskMonitor {
private final MeterRegistry meterRegistry;
private final AlertService alertService;
public KafkaDiskMonitor(MeterRegistry meterRegistry, AlertService alertService) {
this.meterRegistry = meterRegistry;
this.alertService = alertService;
}
/**
* 监控Kafka日志目录的磁盘使用情况
*/
@Scheduled(fixedRate = 30000) // 每30秒检查一次
public void monitorDiskUsage() {
String[] logDirs = {"/var/kafka-logs", "/data/kafka-logs"}; // 配置实际的日志目录
for (String logDir : logDirs) {
try {
File dir = new File(logDir);
if (!dir.exists()) {
continue;
}
long totalSpace = dir.getTotalSpace();
long freeSpace = dir.getFreeSpace();
long usedSpace = totalSpace - freeSpace;
double usagePercent = (double) usedSpace / totalSpace * 100;
// 记录指标
Gauge.builder("kafka.disk.usage.percent")
.tag("path", logDir)
.register(meterRegistry, () -> usagePercent);
log.info("磁盘使用情况: {} - 已使用: {:.2f}% ({}/{})",
logDir, usagePercent,
humanReadableByteCount(usedSpace),
humanReadableByteCount(totalSpace));
// 告警阈值检查
if (usagePercent > 90) {
alertService.sendCriticalAlert("Kafka磁盘空间严重不足",
String.format("目录 %s 磁盘使用率已达 %.2f%%", logDir, usagePercent));
} else if (usagePercent > 80) {
alertService.sendWarningAlert("Kafka磁盘空间不足",
String.format("目录 %s 磁盘使用率已达 %.2f%%", logDir, usagePercent));
}
// 检查磁盘IO性能
checkDiskIOPerformance(logDir);
} catch (Exception e) {
log.error("检查磁盘使用情况失败: {}", logDir, e);
}
}
}
/**
* 检查磁盘IO性能
*/
private void checkDiskIOPerformance(String logDir) {
try {
// 简单的写入测试
File testFile = new File(logDir, "kafka-disk-test.tmp");
long startTime = System.nanoTime();
try (FileOutputStream fos = new FileOutputStream(testFile)) {
byte[] testData = new byte[1024 * 1024]; // 1MB测试数据
fos.write(testData);
fos.getFD().sync(); // 强制刷盘
}
long writeTime = System.nanoTime() - startTime;
double writeTimeMs = writeTime / 1_000_000.0;
// 清理测试文件
testFile.delete();
// 记录IO性能指标
Timer.builder("kafka.disk.write.latency")
.tag("path", logDir)
.register(meterRegistry)
.record(writeTime, TimeUnit.NANOSECONDS);
log.debug("磁盘写入性能: {} - 1MB写入耗时: {:.2f}ms", logDir, writeTimeMs);
// 如果写入时间过长,发出告警
if (writeTimeMs > 1000) { // 超过1秒
alertService.sendWarningAlert("Kafka磁盘IO性能异常",
String.format("目录 %s 磁盘写入延迟: %.2fms", logDir, writeTimeMs));
}
} catch (Exception e) {
log.error("磁盘IO性能测试失败: {}", logDir, e);
alertService.sendWarningAlert("Kafka磁盘IO测试失败",
String.format("无法测试目录 %s 的磁盘IO性能", logDir));
}
}
/**
* 人性化的字节数显示
*/
private String humanReadableByteCount(long bytes) {
if (bytes < 1024) return bytes + " B";
int exp = (int) (Math.log(bytes) / Math.log(1024));
String pre = "KMGTPE".charAt(exp - 1) + "";
return String.format("%.2f %sB", bytes / Math.pow(1024, exp), pre);
}
}
场景三:Consumer端消息丢失
Consumer端消息丢失通常发生在消息处理过程中:
3.1 自动提交偏移量的陷阱
java
public class ConsumerOffsetTrap {
/**
* ✗ 危险的自动提交模式
*/
@KafkaListener(topics = "order-topic", groupId = "order-processor")
public void dangerousConsumer(String message) {
try {
// 处理消息
processOrder(message);
// 问题:如果这里抛出异常,消息已经被自动提交了
// 消息就永久丢失了!
} catch (Exception e) {
log.error("处理订单消息失败: {}", message, e);
// 异常被吞掉,offset已经自动提交,消息丢失
}
}
/**
* ✓ 安全的手动提交模式
*/
@Component
public class SafeOrderConsumer {
@KafkaListener(topics = "order-topic",
groupId = "order-processor",
containerFactory = "manualCommitContainerFactory")
public void safeConsumer(String message, Acknowledgment ack) {
try {
// 处理消息
OrderEvent event = JSON.parseObject(message, OrderEvent.class);
processOrder(event);
// 只有处理成功才提交offset
ack.acknowledge();
log.info("订单处理成功: {}", event.getOrderId());
} catch (Exception e) {
log.error("处理订单消息失败,不提交offset: {}", message, e);
// 可以选择重试或者发送到死信队列
handleFailedMessage(message, e);
}
}
private void handleFailedMessage(String message, Exception e) {
try {
// 发送到死信队列进行人工处理
kafkaTemplate.send("order-dlq", message);
log.info("失败消息已发送到死信队列: {}", message);
} catch (Exception dlqException) {
log.error("发送死信队列也失败了: {}", message, dlqException);
// 这种情况下可以写入数据库或文件,确保不丢失
}
}
}
/**
* 手动提交的容器工厂配置
*/
@Bean
public ConcurrentKafkaListenerContainerFactory<String, String> manualCommitContainerFactory() {
ConcurrentKafkaListenerContainerFactory<String, String> factory =
new ConcurrentKafkaListenerContainerFactory<>();
factory.setConsumerFactory(consumerFactory());
// 禁用自动提交
factory.getContainerProperties().setAckMode(ContainerProperties.AckMode.MANUAL_IMMEDIATE);
// 设置异常处理器
factory.setCommonErrorHandler(new DefaultErrorHandler(
new FixedBackOff(1000L, 3) // 重试3次,每次间隔1秒
));
return factory;
}
}
3.2 批量处理和事务一致性
java
@Service
@Transactional
public class TransactionalOrderProcessor {
@Autowired
private OrderService orderService;
@Autowired
private PaymentService paymentService;
@Autowired
private KafkaTransactionManager transactionManager;
/**
* ✗ 没有事务保护的批量处理
*/
public void unsafeBatchProcess(List<String> messages, Acknowledgment ack) {
try {
for (String message : messages) {
OrderEvent event = JSON.parseObject(message, OrderEvent.class);
// 数据库操作
orderService.updateOrder(event.getOrderId(), event.getStatus());
paymentService.processPayment(event.getPaymentInfo());
// 如果这里出现异常,前面的订单已经处理了,但offset没提交
// 重启后会重复处理前面的订单
}
ack.acknowledge(); // 批量提交offset
} catch (Exception e) {
log.error("批量处理失败", e);
// 不提交offset,但前面部分订单可能已经处理了
}
}
/**
* ✓ 有事务保护的批量处理
*/
@KafkaListener(topics = "order-topic", groupId = "order-batch-processor")
public void safeBatchProcess(@Payload List<String> messages,
Acknowledgment ack) {
TransactionTemplate transactionTemplate = new TransactionTemplate(transactionManager);
try {
// 在一个事务中处理所有消息
transactionTemplate.execute(status -> {
try {
for (String message : messages) {
OrderEvent event = JSON.parseObject(message, OrderEvent.class);
// 所有数据库操作都在同一个事务中
orderService.updateOrder(event.getOrderId(), event.getStatus());
paymentService.processPayment(event.getPaymentInfo());
}
return null;
} catch (Exception e) {
// 标记事务回滚
status.setRollbackOnly();
throw new RuntimeException("批量处理失败", e);
}
});
// 只有事务成功提交才acknowledge
ack.acknowledge();
log.info("批量处理成功,处理了 {} 条消息", messages.size());
} catch (Exception e) {
log.error("批量处理事务失败,不提交offset", e);
// 可以选择拆分批次重试,或者发送到死信队列
handleBatchFailure(messages, e);
}
}
/**
* 更高级的批量处理:部分失败处理
*/
public void advancedBatchProcess(List<String> messages, Acknowledgment ack) {
List<String> failedMessages = new ArrayList<>();
int successCount = 0;
for (String message : messages) {
try {
TransactionTemplate transactionTemplate = new TransactionTemplate(transactionManager);
transactionTemplate.execute(status -> {
OrderEvent event = JSON.parseObject(message, OrderEvent.class);
orderService.updateOrder(event.getOrderId(), event.getStatus());
paymentService.processPayment(event.getPaymentInfo());
return null;
});
successCount++;
} catch (Exception e) {
log.error("处理单条消息失败: {}", message, e);
failedMessages.add(message);
}
}
// 只要有成功的消息就提交offset
if (successCount > 0) {
ack.acknowledge();
log.info("批量处理完成,成功: {}, 失败: {}", successCount, failedMessages.size());
}
// 处理失败的消息
if (!failedMessages.isEmpty()) {
handleFailedMessages(failedMessages);
}
}
private void handleBatchFailure(List<String> messages, Exception e) {
// 尝试单个处理,找出具体哪条消息有问题
for (String message : messages) {
try {
OrderEvent event = JSON.parseObject(message, OrderEvent.class);
// 单独处理这条消息
processSingleOrder(event);
} catch (Exception singleException) {
log.error("单条消息处理也失败: {}", message, singleException);
// 发送到死信队列
kafkaTemplate.send("order-dlq", message);
}
}
}
private void handleFailedMessages(List<String> failedMessages) {
for (String message : failedMessages) {
// 发送到死信队列或重试队列
kafkaTemplate.send("order-retry", message);
}
}
private void processSingleOrder(OrderEvent event) {
TransactionTemplate transactionTemplate = new TransactionTemplate(transactionManager);
transactionTemplate.execute(status -> {
orderService.updateOrder(event.getOrderId(), event.getStatus());
paymentService.processPayment(event.getPaymentInfo());
return null;
});
}
}
3.3 Consumer配置最佳实践
java
@Configuration
public class OptimalConsumerConfig {
@Bean
public ConsumerFactory<String, String> reliableConsumerFactory() {
Map<String, Object> props = new HashMap<>();
// === 基础配置 ===
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "broker1:9092,broker2:9092,broker3:9092");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
// === 可靠性配置 ===
// 1. 禁用自动提交offset
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
// 2. 设置合理的session超时和心跳间隔
props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 30000); // 30秒
props.put(ConsumerConfig.HEARTBEAT_INTERVAL_MS_CONFIG, 10000); // 10秒
// 3. 设置合理的poll超时
props.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 300000); // 5分钟
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 100); // 每次最多拉取100条
// 4. 从最早的消息开始消费(新Consumer Group)
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
// === 性能配置 ===
// 5. 拉取大小配置
props.put(ConsumerConfig.FETCH_MIN_BYTES_CONFIG, 1024); // 最少1KB
props.put(ConsumerConfig.FETCH_MAX_WAIT_MS_CONFIG, 500); // 最多等待500ms
return new DefaultKafkaConsumerFactory<>(props);
}
@Bean
public ConcurrentKafkaListenerContainerFactory<String, String> reliableContainerFactory() {
ConcurrentKafkaListenerContainerFactory<String, String> factory =
new ConcurrentKafkaListenerContainerFactory<>();
factory.setConsumerFactory(reliableConsumerFactory());
// 手动提交模式
factory.getContainerProperties().setAckMode(ContainerProperties.AckMode.MANUAL_IMMEDIATE);
// 并发度配置
factory.setConcurrency(3); // 3个消费者线程
// 异常处理配置
factory.setCommonErrorHandler(createErrorHandler());
// 消息过滤器
factory.setRecordFilterStrategy(record -> {
// 过滤掉无效消息
String value = record.value().toString();
return value == null || value.trim().isEmpty();
});
return factory;
}
/**
* 创建错误处理器
*/
private CommonErrorHandler createErrorHandler() {
// 创建死信队列发布器
DeadLetterPublishingRecoverer recoverer = new DeadLetterPublishingRecoverer(
kafkaTemplate(),
(record, exception) -> {
// 根据异常类型决定发送到哪个死信队列
if (exception instanceof DeserializationException) {
return new TopicPartition("deseralization-dlq", -1);
} else if (exception instanceof BusinessException) {
return new TopicPartition("business-dlq", -1);
} else {
return new TopicPartition("general-dlq", -1);
}
}
);
// 配置重试策略
FixedBackOff backOff = new FixedBackOff(1000L, 3); // 重试3次,每次间隔1秒
DefaultErrorHandler errorHandler = new DefaultErrorHandler(recoverer, backOff);
// 配置不重试的异常
errorHandler.addNotRetryableExceptions(DeserializationException.class);
// 添加监听器
errorHandler.setRetryListeners((record, ex, deliveryAttempt) -> {
log.warn("消息重试 - 尝试次数: {}, 消息: {}, 异常: {}",
deliveryAttempt, record.value(), ex.getMessage());
});
return errorHandler;
}
@Bean
public KafkaTemplate<String, String> kafkaTemplate() {
return new KafkaTemplate<>(reliableProducerFactory());
}
private ProducerFactory<String, String> reliableProducerFactory() {
Map<String, Object> props = new HashMap<>();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.RETRIES_CONFIG, 3);
return new DefaultKafkaProducerFactory<>(props);
}
}
监控告警:消息丢失的预防体系
java
@Component
public class KafkaMonitoringSystem {
private final MeterRegistry meterRegistry;
private final AdminClient adminClient;
private final AlertService alertService;
public KafkaMonitoringSystem(MeterRegistry meterRegistry, AlertService alertService) {
this.meterRegistry = meterRegistry;
this.alertService = alertService;
Properties props = new Properties();
props.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
this.adminClient = AdminClient.create(props);
}
/**
* 监控消费延迟
*/
@Scheduled(fixedRate = 30000)
public void monitorConsumerLag() {
try {
// 获取所有消费者组
ListConsumerGroupsResult groupsResult = adminClient.listConsumerGroups();
Collection<ConsumerGroupListing> groups = groupsResult.all().get();
for (ConsumerGroupListing group : groups) {
String groupId = group.groupId();
// 获取消费者组的offset信息
DescribeConsumerGroupsResult groupResult = adminClient.describeConsumerGroups(
Collections.singletonList(groupId));
ConsumerGroupDescription description = groupResult.describedGroups().get(groupId).get();
// 获取消费者组的offset
ListConsumerGroupOffsetsResult offsetsResult = adminClient.listConsumerGroupOffsets(groupId);
Map<TopicPartition, OffsetAndMetadata> offsets = offsetsResult.partitionsToOffsetAndMetadata().get();
for (Map.Entry<TopicPartition, OffsetAndMetadata> entry : offsets.entrySet()) {
TopicPartition partition = entry.getKey();
long consumerOffset = entry.getValue().offset();
// 获取分区的最新offset
long latestOffset = getPartitionLatestOffset(partition);
long lag = latestOffset - consumerOffset;
// 记录指标
Gauge.builder("kafka.consumer.lag")
.tag("group", groupId)
.tag("topic", partition.topic())
.tag("partition", String.valueOf(partition.partition()))
.register(meterRegistry, () -> lag);
log.debug("消费延迟: group={}, topic={}, partition={}, lag={}",
groupId, partition.topic(), partition.partition(), lag);
// 告警检查
if (lag > 10000) { // 延迟超过10000条消息
alertService.sendCriticalAlert("Kafka消费严重延迟",
String.format("消费组 %s 在 %s:%d 上延迟 %d 条消息",
groupId, partition.topic(), partition.partition(), lag));
} else if (lag > 1000) { // 延迟超过1000条消息
alertService.sendWarningAlert("Kafka消费延迟",
String.format("消费组 %s 在 %s:%d 上延迟 %d 条消息",
groupId, partition.topic(), partition.partition(), lag));
}
}
}
} catch (Exception e) {
log.error("监控消费延迟失败", e);
}
}
/**
* 监控Topic的生产和消费TPS
*/
@Scheduled(fixedRate = 60000) // 每分钟统计一次
public void monitorThroughput() {
try {
// 这里需要结合JMX或者Kafka的指标来获取TPS数据
// 示例代码展示监控逻辑
String[] topics = {"order-topic", "payment-topic", "user-topic"};
for (String topic : topics) {
// 获取Topic的生产TPS
double produceTps = getTopicProduceTPS(topic);
double consumeTps = getTopicConsumeTPS(topic);
// 记录指标
Gauge.builder("kafka.produce.tps")
.tag("topic", topic)
.register(meterRegistry, () -> produceTps);
Gauge.builder("kafka.consume.tps")
.tag("topic", topic)
.register(meterRegistry, () -> consumeTps);
log.info("Topic {} TPS: 生产={}, 消费={}", topic, produceTps, consumeTps);
// 检查生产消费是否平衡
if (produceTps > consumeTps * 1.5) { // 生产速度明显大于消费速度
alertService.sendWarningAlert("Kafka生产消费不平衡",
String.format("Topic %s 生产TPS(%.2f) 远大于消费TPS(%.2f)",
topic, produceTps, consumeTps));
}
}
} catch (Exception e) {
log.error("监控TPS失败", e);
}
}
/**
* 监控死信队列
*/
@Scheduled(fixedRate = 300000) // 每5分钟检查一次死信队列
public void monitorDeadLetterQueues() {
String[] dlqTopics = {"order-dlq", "payment-dlq", "general-dlq"};
for (String dlqTopic : dlqTopics) {
try {
long messageCount = getTopicMessageCount(dlqTopic);
// 记录死信队列消息数量
Gauge.builder("kafka.dlq.message.count")
.tag("topic", dlqTopic)
.register(meterRegistry, () -> messageCount);
if (messageCount > 0) {
log.warn("发现死信消息: topic={}, count={}", dlqTopic, messageCount);
// 根据数量级别发送不同级别的告警
if (messageCount > 100) {
alertService.sendCriticalAlert("大量死信消息",
String.format("死信队列 %s 已有 %d 条消息待处理", dlqTopic, messageCount));
} else if (messageCount > 10) {
alertService.sendWarningAlert("发现死信消息",
String.format("死信队列 %s 有 %d 条消息待处理", dlqTopic, messageCount));
}
}
} catch (Exception e) {
log.error("监控死信队列失败: {}", dlqTopic, e);
}
}
}
/**
* 检查Kafka集群健康状态
*/
@Scheduled(fixedRate = 60000)
public void checkClusterHealth() {
try {
// 检查Broker状态
DescribeClusterResult clusterResult = adminClient.describeCluster();
Collection<Node> nodes = clusterResult.nodes().get();
int totalBrokers = nodes.size();
long aliveBrokers = nodes.stream().count(); // 简化示例,实际需要检查连通性
Gauge.builder("kafka.cluster.brokers.alive")
.register(meterRegistry, () -> aliveBrokers);
Gauge.builder("kafka.cluster.brokers.total")
.register(meterRegistry, () -> totalBrokers);
if (aliveBrokers < totalBrokers) {
alertService.sendCriticalAlert("Kafka Broker故障",
String.format("集群中有 %d/%d 个Broker不可用",
totalBrokers - aliveBrokers, totalBrokers));
}
// 检查Controller状态
Node controller = clusterResult.controller().get();
if (controller == null) {
alertService.sendCriticalAlert("Kafka Controller异常", "集群中没有活跃的Controller");
} else {
log.debug("当前Controller: {}", controller.id());
}
} catch (Exception e) {
log.error("检查集群健康状态失败", e);
alertService.sendCriticalAlert("Kafka集群检查失败", "无法连接到Kafka集群: " + e.getMessage());
}
}
// 辅助方法
private long getPartitionLatestOffset(TopicPartition partition) throws Exception {
Map<TopicPartition, OffsetSpec> requestLatestOffsets = new HashMap<>();
requestLatestOffsets.put(partition, OffsetSpec.latest());
ListOffsetsResult offsetsResult = adminClient.listOffsets(requestLatestOffsets);
ListOffsetsResult.ListOffsetsResultInfo resultInfo = offsetsResult.partitionResult(partition).get();
return resultInfo.offset();
}
private double getTopicProduceTPS(String topic) {
// 实际实现中需要通过JMX获取Kafka的指标
// 这里返回模拟数据
return Math.random() * 1000;
}
private double getTopicConsumeTPS(String topic) {
// 实际实现中需要通过JMX获取Kafka的指标
// 这里返回模拟数据
return Math.random() * 800;
}
private long getTopicMessageCount(String topic) throws Exception {
// 获取Topic的所有分区
DescribeTopicsResult topicsResult = adminClient.describeTopics(Collections.singletonList(topic));
TopicDescription topicDescription = topicsResult.values().get(topic).get();
long totalCount = 0;
for (TopicPartitionInfo partitionInfo : topicDescription.partitions()) {
TopicPartition partition = new TopicPartition(topic, partitionInfo.partition());
// 获取分区的最早和最新offset
Map<TopicPartition, OffsetSpec> earliestOffsets = new HashMap<>();
earliestOffsets.put(partition, OffsetSpec.earliest());
Map<TopicPartition, OffsetSpec> latestOffsets = new HashMap<>();
latestOffsets.put(partition, OffsetSpec.latest());
long earliestOffset = adminClient.listOffsets(earliestOffsets)
.partitionResult(partition).get().offset();
long latestOffset = adminClient.listOffsets(latestOffsets)
.partitionResult(partition).get().offset();
totalCount += (latestOffset - earliestOffset);
}
return totalCount;
}
}
总结:Kafka消息零丢失的完整方案
经过这次深度分析,我总结出了Kafka消息零丢失的黄金防线:
🛡️ 三道防线
第一道防线 - Producer端:
- ✅
acks=all
+retries=MAX_VALUE
+enable.idempotence=true
- ✅ 同步发送或异步回调确认
- ✅ 合理的超时和重试配置
第二道防线 - Broker端:
- ✅
replication.factor≥3
+min.insync.replicas≥2
- ✅
unclean.leader.election.enable=false
- ✅ 合理的刷盘策略
第三道防线 - Consumer端:
- ✅ 手动提交offset(
enable.auto.commit=false
) - ✅ 事务处理确保数据一致性
- ✅ 死信队列处理异常消息
📊 监控预警体系
diff
🚨 关键指标监控:
- 消费延迟(Consumer Lag)
- 未完全复制的分区数
- 死信队列消息数量
- 磁盘使用率和IO性能
- 集群健康状态
💡 生产环境建议
- 配置模板化:建立标准的生产配置模板
- 分层监控:从基础设施到应用层的全方位监控
- 故障演练:定期进行消息丢失场景的故障演练
- 文档规范:建立完善的Kafka使用规范和troubleshooting手册
最后的忠告
记住:Kafka不会无缘无故丢失消息,每一次消息丢失都是配置或使用不当导致的。
在我的职业生涯中,这次凌晨2点的事故让我明白了一个道理:细节决定成败,配置决定稳定性。希望这篇文章能帮助大家避开我踩过的坑!
你在生产环境中遇到过Kafka消息丢失的问题吗?是如何解决的?欢迎在评论区分享你的经验和教训!