spring-cloud-loadbalancer-3.1.1版本bug踩点记录

问题描述

线上突然报错下标越界错误,日志如下

java 复制代码
java.lang.IndexOutOfBoundsException: Index -3 out of bounds for length 5
at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
at java.base/java.util.Objects.checkIndex(Objects.java:372)
at java.base/java.util.ArrayList.get(ArrayList.java:459)
at org.springframework.cloud.loadbalancer.core.RoundRobinLoadBalancer.getInstanceResponse(RoundRobinLoadBalancer.java:104)
at org.springframework.cloud.loadbalancer.core.RoundRobinLoadBalancer.processInstanceResponse(RoundRobinLoadBalancer.java:87)
at org.springframework.cloud.loadbalancer.core.RoundRobinLoadBalancer.lambda$choose$0(RoundRobinLoadBalancer.java:82)
at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:106)
at reactor.core.publisher.MonoNext$NextSubscriber.onNext(MonoNext.java:82)
at reactor.core.publisher.FluxDematerialize$DematerializeSubscriber.onNext(FluxDematerialize.java:98)
at reactor.core.publisher.FluxDematerialize$DematerializeSubscriber.onNext(FluxDematerialize.java:44)
at reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber.drainAsync(FluxFlattenIterable.java:421)
at reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber.drain(FluxFlattenIterable.java:686)
at reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber.onNext(FluxFlattenIterable.java:250)
at reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber.onNext(FluxSwitchIfEmpty.java:74)
at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816)
at reactor.core.publisher.MonoCollectList$MonoCollectListSubscriber.onComplete(MonoCollectList.java:128)
at reactor.core.publisher.DrainUtils.postCompleteDrain(DrainUtils.java:132)
at reactor.core.publisher.DrainUtils.postComplete(DrainUtils.java:187)
at reactor.core.publisher.FluxMaterialize$MaterializeSubscriber.onComplete(FluxMaterialize.java:141)
at reactor.core.publisher.Operators$ScalarSubscription.request(Operators.java:2400)
at reactor.core.publisher.FluxMaterialize$MaterializeSubscriber.request(FluxMaterialize.java:148)
at reactor.core.publisher.MonoCollectList$MonoCollectListSubscriber.onSubscribe(MonoCollectList.java:79)
at reactor.core.publisher.FluxMaterialize$MaterializeSubscriber.onSubscribe(FluxMaterialize.java:103)
at reactor.core.publisher.FluxJust.subscribe(FluxJust.java:68)
at reactor.core.publisher.InternalFluxOperator.subscribe(InternalFluxOperator.java:62)
at reactor.core.publisher.FluxDefer.subscribe(FluxDefer.java:54)
at reactor.core.publisher.Mono.subscribe(Mono.java:4400)
at reactor.core.publisher.Mono.block(Mono.java:1706)
at org.springframework.cloud.loadbalancer.blocking.client.BlockingLoadBalancerClient.choose(BlockingLoadBalancerClient.java:155)
at org.springframework.cloud.openfeign.loadbalancer.FeignBlockingLoadBalancerClient.execute(FeignBlockingLoadBalancerClient.java:97)
	at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:119)
	at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:89)
	at feign.ReflectiveFeign$FeignInvocationHandler.invoke(ReflectiveFeign.java:100)
	at org.springframework.cloud.openfeign.FeignCachingInvocationHandlerFactory$1.proceed(FeignCachingInvocationHandlerFactory.java:66)
	at org.springframework.cache.interceptor.CacheInterceptor.lambda$invoke$0(CacheInterceptor.java:54)
	at org.springframework.cache.interceptor.CacheAspectSupport.execute(CacheAspectSupport.java:351)
	at org.springframework.cache.interceptor.CacheInterceptor.invoke(CacheInterceptor.java:64)
	at org.springframework.cloud.openfeign.FeignCachingInvocationHandlerFactory.lambda$create$1(FeignCachingInvocationHandlerFactory.java:53)

问题分析

查看源码如下

java 复制代码
public class RoundRobinLoadBalancer implements ReactorServiceInstanceLoadBalancer {

	final AtomicInteger position;
    public RoundRobinLoadBalancer(ObjectProvider<ServiceInstanceListSupplier> serviceInstanceListSupplierProvider,
			String serviceId) {
		this(serviceInstanceListSupplierProvider, serviceId, new Random().nextInt(1000));
	}
...
    private Response<ServiceInstance> getInstanceResponse(List<ServiceInstance> instances) {
        if (instances.isEmpty()) {
            if (log.isWarnEnabled()) {
                log.warn("No servers available for service: " + serviceId);
            }
            return new EmptyResponse();
        }
        // TODO: enforce order?
        int pos = Math.abs(this.position.incrementAndGet());
        // 出现问题的第104行代码在这
        ServiceInstance instance = instances.get(pos % instances.size());
    
        return new DefaultResponse(instance);
    }

问题原因是pos % instances.size()变成了负数:-3

可以看到position是AtomicInteger类型默认初始化是1000以内的随机数。大家知道Integer越界后会成为负数,但是明明取了绝对值,为什么还会有负数?

只有一种可能就是Math.abs函数存在bug。查看abs源码如下。代码很简单,如果是负数直接取-a

java 复制代码
    public static int abs(int a) {
        return (a < 0) ? -a : a;
    }

问题复现

java 复制代码
    public static void main(String[] args) {
        AtomicInteger num = new AtomicInteger(new Random().nextInt(1000));
        int numStart = num.get();
        for (long i = 0; i < Long.MAX_VALUE; i++) {
            int pos = Math.abs(num.incrementAndGet());
            if (pos < 0) {
                log.info("numStart = {}, i = {} pos = {} pos % 5 = {}", numStart, i, pos, pos % 5);
                break;
            }
        }
    }

代码输出

java 复制代码
numStart = 185, i = 2147483462 pos = -2147483648 pos % 5 = -3

问题原因

问题复现后基本原因也很明确了,问题出现在第2147483462次递增,也就是num=185+2147483462=2147483647,那么此时2147483647递增后结果为-2147483648。abs对-2147483648求值结果为2147483648。对整形熟悉的朋友都知道,整形的范围是:-2147483648 至 2147483647。

也就是说绝对值2147483648对于整形发生了越界,即得到结果是一个负数:-2147483648

因此-2147483648对5(线上外部请求的路由url列表大小)取余数得到-3发生了IndexOutOfBoundsException异常

解决方案

由于发生了一次越界,那么下次发生越界的起码需要一段时间,此时安排合理的时间对spring-cloud-loadbalancer版本进行升级即可,因为新版本已经修复了该问题,可以参考官方issue与PR:https://github.com/spring-cloud/spring-cloud-commons/pull/1077

官方源码

java 复制代码
		// Ignore the sign bit, this allows pos to loop sequentially from 0 to
		// Integer.MAX_VALUE
		int pos = this.position.incrementAndGet() & Integer.MAX_VALUE;

		ServiceInstance instance = instances.get(pos % instances.size());

为什么要先对Integer.MAX_VALUE按位"与"运算?

  1. 位"与"运算只有在对(2的幂)取余数时候才能平替,即:X % 2^n = X & (2^n - 1)
  2. Integer.MAX_VALUE = 2^31 - 1 = 2147483647
相关推荐
Tatakai2540 分钟前
Mybatis Plus分页查询返回total为0问题
java·spring·bug·mybatis
一叶飘零_sweeeet2 小时前
为什么 Feign 要用 HTTP 而不是 RPC?
java·网络协议·http·spring cloud·rpc·feign
bug菌¹3 小时前
滚雪球学SpringCloud[4.1讲]: Spring Cloud Gateway详解
java·spring cloud·微服务
bug菌¹3 小时前
滚雪球学SpringCloud[4.2讲]: Zuul:Netflix API Gateway详解
spring·spring cloud·gateway
小筱在线5 小时前
SpringCloud微服务实现服务熔断的实践指南
java·spring cloud·微服务
鸽芷咕7 小时前
【Python报错已解决】libpng warning: iccp: known incorrect sRGB profile
开发语言·python·机器学习·bug
小筱在线8 小时前
使用SpringCloud构建可伸缩的微服务架构
spring cloud·微服务·架构
&星辰入梦来&8 小时前
Nginx从入门到入土(三): 静态资源管理与代理服务
运维·nginx·负载均衡
赚钱给孩子买茅台喝9 小时前
智能BI项目第四期
java·spring boot·spring cloud·aigc
鸽芷咕11 小时前
【Python报错已解决】ModuleNotFoundError: No module named ‘paddle‘
开发语言·python·机器学习·bug·paddle