SpringBoot原生实现分布式MapReduce计算

一、架构设计调整

核心组件替换方案:

1、注册中心

→ 数据库注册表

2、任务队列

→ 数据库任务表

3、分布式锁

→ 数据库行级锁

4、节点通信

→ HTTP REST接口

二、数据库表结构设计

java 复制代码
 节点注册表
CREATETABLE compute_nodes (
    node_id VARCHAR(36)PRIMARYKEY,
    last_heartbeat TIMESTAMP,
    statusENUM('ACTIVE','DOWN')
);
java 复制代码
-- 任务分片表
CREATETABLE task_shards (
    shard_id INTAUTO_INCREMENTPRIMARYKEY,
    data_range VARCHAR(100),-- 例如:1-10000
    statusENUM('PENDING','PROCESSING','COMPLETED'),
    locked_by VARCHAR(36),
    locked_at TIMESTAMP
);

三、核心实现代码

1. 节点自注册实现

java 复制代码
@Scheduled(fixedRate =3000)
public void nodeRegistration(){
    jdbcTemplate.update(
        "INSERT INTO compute_nodes VALUES (?, NOW(), 'ACTIVE') "+
        "ON DUPLICATE KEY UPDATE last_heartbeat = NOW()",
        nodeId
    );
    
    // 清理过期节点
    jdbcTemplate.update(
        "DELETE FROM compute_nodes WHERE last_heartbeat < ?",
        LocalDateTime.now().minusSeconds(10)
    );
}

2. 任务分片抢占式调度

java 复制代码
@Scheduled(fixedDelay =1000)
public void acquireTasks(){
    List<Long> shardIds = jdbcTemplate.queryForList(
        "SELECT shard_id FROM task_shards "+
        "WHERE status = 'PENDING' "+
        "ORDER BY shard_id LIMIT 5 FOR UPDATE SKIP LOCKED",
        Long.class
    );
    
    shardIds.forEach(shardId ->{
        int updated = jdbcTemplate.update(
            "UPDATE task_shards SET status = 'PROCESSING', "+
            "locked_by = ?, locked_at = NOW() "+
            "WHERE shard_id = ? AND status = 'PENDING'",
            nodeId, shardId
        );
        if(updated >0) processShard(shardId);
    });
}

3. Map阶段分布式处理

java 复制代码
public void processShard(Long shardId){
    try{
        DataRange range =getDataRange(shardId);
        List<Record> records =fetchData(range);
        
        Map<String, Double> partialResult = records.parallelStream()
            .collect(Collectors.groupingBy(
                Record::getCategory,
                Collectors.summingDouble(Record::getAmount)
            ));
        
        saveResult(shardId, partialResult);
        markShardCompleted(shardId);
    }catch(Exception e){
        releaseShard(shardId);
    }
}

4. Reduce阶段聚合实现

java 复制代码
public Map<String, Double> reduceAllResults(){
    return jdbcTemplate.query(
        "SELECT category, SUM(amount) AS total "+
        "FROM map_results GROUP BY category",
        (rs, rowNum)->newAbstractMap.SimpleEntry<>(
            rs.getString("category"),
            rs.getDouble("total")
        )).stream().collect(Collectors.toMap(
            Entry::getKey,Entry::getValue
        ));
}

四、关键优化点

1. 分片锁优化策略

java 复制代码
// 使用乐观锁避免长时间占用连接
public boolean tryLockShard(Long shardId) {
    return jdbcTemplate.update(
        "UPDATE task_shards SET version = version + 1 " +
        "WHERE shard_id = ? AND version = ?",
        shardId, currentVersion) > 0;
}

2. 结果缓存优化

java 复制代码
@Cacheable(value ="partialResults", key ="#shardId")
public Map<String, Double> getPartialResult(Long shardId){
    return jdbcTemplate.query(...);
}

// 配置类启用缓存
@Configuration
@EnableCaching
publicclassCacheConfig{
    @Bean
    public CacheManagercacheManager(){
        return new ConcurrentMapCacheManager();
    }
}

3. 分布式事务处理

java 复制代码
@Transactional(propagation = Propagation.REQUIRES_NEW)
public void markShardCompleted(Long shardId) {
    jdbcTemplate.update(
        "UPDATE task_shards SET status = 'COMPLETED' " +
        "WHERE shard_id = ?", shardId);
    
    eventPublisher.publishEvent(
        new ShardCompleteEvent(shardId));
}

五、部署架构对比

六、性能压测数据

测试环境:

100w数据

七、生产级改进建议

分片策略优化

java 复制代码
// 采用跳跃哈希算法避免热点
public List<Long> assignShards(int totalShards) {
    return IntStream.range(0, totalShards)
        .mapToObj(i -> (nodeHash + i*2654435761L) % totalShards)
        .collect(Collectors.toList());
}

动态分片扩容

java 复制代码
@Scheduled(fixedRate =60000)
public void autoReshard(){
    int currentShards = getCurrentShardCount();
    int required = calculateRequiredShards();
    
    if(required > currentShards){
        jdbcTemplate.execute("ALTER TABLE task_shards AUTO_INCREMENT = "+ required);
    }
}

结果校验机制

java 复制代码
public void validateResults() {
    jdbcTemplate.query("SELECT shard_id FROM task_shards WHERE status = 'COMPLETED'", 
        rs -> {
            Long shardId = rs.getLong(1);
            if(!resultCache.contains(shardId)) {
                repairShard(shardId);
            }
        });
}

该方案完全基于SpringBoot原生能力实现,通过关系型数据库+定时任务调度机制,在保持系统简洁性的同时满足基本分布式计算需求。适合中小规模(日处理千万级以下)的离线计算场景,如需更高性能建议仍考虑引入专业分布式计算框架。

相关推荐
vortex51 分钟前
【Web开发】从WSGI到Servlet再到Spring Boot
前端·spring boot·servlet
小裕哥略帅10 分钟前
Springboot中全局myBaits插件配置
java·spring boot·后端
BullSmall41 分钟前
JDK17下Kafka部署全指南
分布式·kafka
helloworld工程师1 小时前
Dubbo应用开发之基于Dubbo协议的springboot规范性开发
spring boot·后端·dubbo
初听于你1 小时前
Thymeleaf 模板引擎讲解
java·服务器·windows·spring boot·spring·eclipse
观望过往2 小时前
SpringBoot 集成 OpenCV 实现人脸图像抓取
spring boot·后端·opencv
ZePingPingZe2 小时前
Spring boot2.x-第05讲番外篇:常用端点说明
java·spring boot·后端
张人大 Renda Zhang2 小时前
Spring Cloud / Dubbo 是 2 楼,Kubernetes 是 1 楼,Service Mesh 是地下室:Java 微服务的“三层楼模型”
spring boot·spring cloud·云原生·架构·kubernetes·dubbo·service_mesh
Han.miracle2 小时前
Maven 基础与 Spring Boot 入门:环境搭建、项目开发及常见问题排查
java·spring boot·后端
BullSmall2 小时前
MinIO分布式存储实战指南
分布式