spring-cloud-loadbalancer-3.1.1版本bug踩点记录

问题描述

线上突然报错下标越界错误,日志如下

java 复制代码
java.lang.IndexOutOfBoundsException: Index -3 out of bounds for length 5
at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
at java.base/java.util.Objects.checkIndex(Objects.java:372)
at java.base/java.util.ArrayList.get(ArrayList.java:459)
at org.springframework.cloud.loadbalancer.core.RoundRobinLoadBalancer.getInstanceResponse(RoundRobinLoadBalancer.java:104)
at org.springframework.cloud.loadbalancer.core.RoundRobinLoadBalancer.processInstanceResponse(RoundRobinLoadBalancer.java:87)
at org.springframework.cloud.loadbalancer.core.RoundRobinLoadBalancer.lambda$choose$0(RoundRobinLoadBalancer.java:82)
at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:106)
at reactor.core.publisher.MonoNext$NextSubscriber.onNext(MonoNext.java:82)
at reactor.core.publisher.FluxDematerialize$DematerializeSubscriber.onNext(FluxDematerialize.java:98)
at reactor.core.publisher.FluxDematerialize$DematerializeSubscriber.onNext(FluxDematerialize.java:44)
at reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber.drainAsync(FluxFlattenIterable.java:421)
at reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber.drain(FluxFlattenIterable.java:686)
at reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber.onNext(FluxFlattenIterable.java:250)
at reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber.onNext(FluxSwitchIfEmpty.java:74)
at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816)
at reactor.core.publisher.MonoCollectList$MonoCollectListSubscriber.onComplete(MonoCollectList.java:128)
at reactor.core.publisher.DrainUtils.postCompleteDrain(DrainUtils.java:132)
at reactor.core.publisher.DrainUtils.postComplete(DrainUtils.java:187)
at reactor.core.publisher.FluxMaterialize$MaterializeSubscriber.onComplete(FluxMaterialize.java:141)
at reactor.core.publisher.Operators$ScalarSubscription.request(Operators.java:2400)
at reactor.core.publisher.FluxMaterialize$MaterializeSubscriber.request(FluxMaterialize.java:148)
at reactor.core.publisher.MonoCollectList$MonoCollectListSubscriber.onSubscribe(MonoCollectList.java:79)
at reactor.core.publisher.FluxMaterialize$MaterializeSubscriber.onSubscribe(FluxMaterialize.java:103)
at reactor.core.publisher.FluxJust.subscribe(FluxJust.java:68)
at reactor.core.publisher.InternalFluxOperator.subscribe(InternalFluxOperator.java:62)
at reactor.core.publisher.FluxDefer.subscribe(FluxDefer.java:54)
at reactor.core.publisher.Mono.subscribe(Mono.java:4400)
at reactor.core.publisher.Mono.block(Mono.java:1706)
at org.springframework.cloud.loadbalancer.blocking.client.BlockingLoadBalancerClient.choose(BlockingLoadBalancerClient.java:155)
at org.springframework.cloud.openfeign.loadbalancer.FeignBlockingLoadBalancerClient.execute(FeignBlockingLoadBalancerClient.java:97)
	at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:119)
	at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:89)
	at feign.ReflectiveFeign$FeignInvocationHandler.invoke(ReflectiveFeign.java:100)
	at org.springframework.cloud.openfeign.FeignCachingInvocationHandlerFactory$1.proceed(FeignCachingInvocationHandlerFactory.java:66)
	at org.springframework.cache.interceptor.CacheInterceptor.lambda$invoke$0(CacheInterceptor.java:54)
	at org.springframework.cache.interceptor.CacheAspectSupport.execute(CacheAspectSupport.java:351)
	at org.springframework.cache.interceptor.CacheInterceptor.invoke(CacheInterceptor.java:64)
	at org.springframework.cloud.openfeign.FeignCachingInvocationHandlerFactory.lambda$create$1(FeignCachingInvocationHandlerFactory.java:53)

问题分析

查看源码如下

java 复制代码
public class RoundRobinLoadBalancer implements ReactorServiceInstanceLoadBalancer {

	final AtomicInteger position;
    public RoundRobinLoadBalancer(ObjectProvider<ServiceInstanceListSupplier> serviceInstanceListSupplierProvider,
			String serviceId) {
		this(serviceInstanceListSupplierProvider, serviceId, new Random().nextInt(1000));
	}
...
    private Response<ServiceInstance> getInstanceResponse(List<ServiceInstance> instances) {
        if (instances.isEmpty()) {
            if (log.isWarnEnabled()) {
                log.warn("No servers available for service: " + serviceId);
            }
            return new EmptyResponse();
        }
        // TODO: enforce order?
        int pos = Math.abs(this.position.incrementAndGet());
        // 出现问题的第104行代码在这
        ServiceInstance instance = instances.get(pos % instances.size());
    
        return new DefaultResponse(instance);
    }

问题原因是pos % instances.size()变成了负数:-3

可以看到position是AtomicInteger类型默认初始化是1000以内的随机数。大家知道Integer越界后会成为负数,但是明明取了绝对值,为什么还会有负数?

只有一种可能就是Math.abs函数存在bug。查看abs源码如下。代码很简单,如果是负数直接取-a

java 复制代码
    public static int abs(int a) {
        return (a < 0) ? -a : a;
    }

问题复现

java 复制代码
    public static void main(String[] args) {
        AtomicInteger num = new AtomicInteger(new Random().nextInt(1000));
        int numStart = num.get();
        for (long i = 0; i < Long.MAX_VALUE; i++) {
            int pos = Math.abs(num.incrementAndGet());
            if (pos < 0) {
                log.info("numStart = {}, i = {} pos = {} pos % 5 = {}", numStart, i, pos, pos % 5);
                break;
            }
        }
    }

代码输出

java 复制代码
numStart = 185, i = 2147483462 pos = -2147483648 pos % 5 = -3

问题原因

问题复现后基本原因也很明确了,问题出现在第2147483462次递增,也就是num=185+2147483462=2147483647,那么此时2147483647递增后结果为-2147483648。abs对-2147483648求值结果为2147483648。对整形熟悉的朋友都知道,整形的范围是:-2147483648 至 2147483647。

也就是说绝对值2147483648对于整形发生了越界,即得到结果是一个负数:-2147483648

因此-2147483648对5(线上外部请求的路由url列表大小)取余数得到-3发生了IndexOutOfBoundsException异常

解决方案

由于发生了一次越界,那么下次发生越界的起码需要一段时间,此时安排合理的时间对spring-cloud-loadbalancer版本进行升级即可,因为新版本已经修复了该问题,可以参考官方issue与PR:https://github.com/spring-cloud/spring-cloud-commons/pull/1077

官方源码

java 复制代码
		// Ignore the sign bit, this allows pos to loop sequentially from 0 to
		// Integer.MAX_VALUE
		int pos = this.position.incrementAndGet() & Integer.MAX_VALUE;

		ServiceInstance instance = instances.get(pos % instances.size());

为什么要先对Integer.MAX_VALUE按位"与"运算?

  1. 位"与"运算只有在对(2的幂)取余数时候才能平替,即:X % 2^n = X & (2^n - 1)
  2. Integer.MAX_VALUE = 2^31 - 1 = 2147483647
相关推荐
SoleMotive.1 小时前
谢飞机爆笑面经:Java大厂3轮12问真题拆解(Redis穿透/Kafka分区/MCP Agent)
redis·spring cloud·kafka·java面试·mcp
奶茶精Gaaa2 小时前
测试能力提升--Bug分析能力
bug
MrSYJ4 小时前
Redis 做分布式 Session
后端·spring cloud·微服务
瑶山6 小时前
Spring Cloud微服务搭建五、集成负载均衡,远程调用,熔断降级
spring cloud·微服务·负载均衡·远程调用·熔断降级
主机哥哥17 小时前
阿里云OpenClaw部署全攻略,五种方案助你快速部署!
服务器·阿里云·负载均衡
金牌归来发现妻女流落街头20 小时前
【从SpringBoot到SpringCloud】
java·spring boot·spring cloud
LJianK120 小时前
idea自带的数据库修改默认值有bug
bug
Java后端的Ai之路1 天前
【Spring全家桶】-一文弄懂Spring Cloud Gateway
java·后端·spring cloud·gateway
vx_Biye_Design2 天前
【关注可免费领取源码】房屋出租系统的设计与实现--毕设附源码40805
java·spring boot·spring·spring cloud·servlet·eclipse·课程设计