spring-cloud-loadbalancer-3.1.1版本bug踩点记录

问题描述

线上突然报错下标越界错误,日志如下

java 复制代码
java.lang.IndexOutOfBoundsException: Index -3 out of bounds for length 5
at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
at java.base/java.util.Objects.checkIndex(Objects.java:372)
at java.base/java.util.ArrayList.get(ArrayList.java:459)
at org.springframework.cloud.loadbalancer.core.RoundRobinLoadBalancer.getInstanceResponse(RoundRobinLoadBalancer.java:104)
at org.springframework.cloud.loadbalancer.core.RoundRobinLoadBalancer.processInstanceResponse(RoundRobinLoadBalancer.java:87)
at org.springframework.cloud.loadbalancer.core.RoundRobinLoadBalancer.lambda$choose$0(RoundRobinLoadBalancer.java:82)
at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:106)
at reactor.core.publisher.MonoNext$NextSubscriber.onNext(MonoNext.java:82)
at reactor.core.publisher.FluxDematerialize$DematerializeSubscriber.onNext(FluxDematerialize.java:98)
at reactor.core.publisher.FluxDematerialize$DematerializeSubscriber.onNext(FluxDematerialize.java:44)
at reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber.drainAsync(FluxFlattenIterable.java:421)
at reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber.drain(FluxFlattenIterable.java:686)
at reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber.onNext(FluxFlattenIterable.java:250)
at reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber.onNext(FluxSwitchIfEmpty.java:74)
at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816)
at reactor.core.publisher.MonoCollectList$MonoCollectListSubscriber.onComplete(MonoCollectList.java:128)
at reactor.core.publisher.DrainUtils.postCompleteDrain(DrainUtils.java:132)
at reactor.core.publisher.DrainUtils.postComplete(DrainUtils.java:187)
at reactor.core.publisher.FluxMaterialize$MaterializeSubscriber.onComplete(FluxMaterialize.java:141)
at reactor.core.publisher.Operators$ScalarSubscription.request(Operators.java:2400)
at reactor.core.publisher.FluxMaterialize$MaterializeSubscriber.request(FluxMaterialize.java:148)
at reactor.core.publisher.MonoCollectList$MonoCollectListSubscriber.onSubscribe(MonoCollectList.java:79)
at reactor.core.publisher.FluxMaterialize$MaterializeSubscriber.onSubscribe(FluxMaterialize.java:103)
at reactor.core.publisher.FluxJust.subscribe(FluxJust.java:68)
at reactor.core.publisher.InternalFluxOperator.subscribe(InternalFluxOperator.java:62)
at reactor.core.publisher.FluxDefer.subscribe(FluxDefer.java:54)
at reactor.core.publisher.Mono.subscribe(Mono.java:4400)
at reactor.core.publisher.Mono.block(Mono.java:1706)
at org.springframework.cloud.loadbalancer.blocking.client.BlockingLoadBalancerClient.choose(BlockingLoadBalancerClient.java:155)
at org.springframework.cloud.openfeign.loadbalancer.FeignBlockingLoadBalancerClient.execute(FeignBlockingLoadBalancerClient.java:97)
	at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:119)
	at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:89)
	at feign.ReflectiveFeign$FeignInvocationHandler.invoke(ReflectiveFeign.java:100)
	at org.springframework.cloud.openfeign.FeignCachingInvocationHandlerFactory$1.proceed(FeignCachingInvocationHandlerFactory.java:66)
	at org.springframework.cache.interceptor.CacheInterceptor.lambda$invoke$0(CacheInterceptor.java:54)
	at org.springframework.cache.interceptor.CacheAspectSupport.execute(CacheAspectSupport.java:351)
	at org.springframework.cache.interceptor.CacheInterceptor.invoke(CacheInterceptor.java:64)
	at org.springframework.cloud.openfeign.FeignCachingInvocationHandlerFactory.lambda$create$1(FeignCachingInvocationHandlerFactory.java:53)

问题分析

查看源码如下

java 复制代码
public class RoundRobinLoadBalancer implements ReactorServiceInstanceLoadBalancer {

	final AtomicInteger position;
    public RoundRobinLoadBalancer(ObjectProvider<ServiceInstanceListSupplier> serviceInstanceListSupplierProvider,
			String serviceId) {
		this(serviceInstanceListSupplierProvider, serviceId, new Random().nextInt(1000));
	}
...
    private Response<ServiceInstance> getInstanceResponse(List<ServiceInstance> instances) {
        if (instances.isEmpty()) {
            if (log.isWarnEnabled()) {
                log.warn("No servers available for service: " + serviceId);
            }
            return new EmptyResponse();
        }
        // TODO: enforce order?
        int pos = Math.abs(this.position.incrementAndGet());
        // 出现问题的第104行代码在这
        ServiceInstance instance = instances.get(pos % instances.size());
    
        return new DefaultResponse(instance);
    }

问题原因是pos % instances.size()变成了负数:-3

可以看到position是AtomicInteger类型默认初始化是1000以内的随机数。大家知道Integer越界后会成为负数,但是明明取了绝对值,为什么还会有负数?

只有一种可能就是Math.abs函数存在bug。查看abs源码如下。代码很简单,如果是负数直接取-a

java 复制代码
    public static int abs(int a) {
        return (a < 0) ? -a : a;
    }

问题复现

java 复制代码
    public static void main(String[] args) {
        AtomicInteger num = new AtomicInteger(new Random().nextInt(1000));
        int numStart = num.get();
        for (long i = 0; i < Long.MAX_VALUE; i++) {
            int pos = Math.abs(num.incrementAndGet());
            if (pos < 0) {
                log.info("numStart = {}, i = {} pos = {} pos % 5 = {}", numStart, i, pos, pos % 5);
                break;
            }
        }
    }

代码输出

java 复制代码
numStart = 185, i = 2147483462 pos = -2147483648 pos % 5 = -3

问题原因

问题复现后基本原因也很明确了,问题出现在第2147483462次递增,也就是num=185+2147483462=2147483647,那么此时2147483647递增后结果为-2147483648。abs对-2147483648求值结果为2147483648。对整形熟悉的朋友都知道,整形的范围是:-2147483648 至 2147483647。

也就是说绝对值2147483648对于整形发生了越界,即得到结果是一个负数:-2147483648

因此-2147483648对5(线上外部请求的路由url列表大小)取余数得到-3发生了IndexOutOfBoundsException异常

解决方案

由于发生了一次越界,那么下次发生越界的起码需要一段时间,此时安排合理的时间对spring-cloud-loadbalancer版本进行升级即可,因为新版本已经修复了该问题,可以参考官方issue与PR:https://github.com/spring-cloud/spring-cloud-commons/pull/1077

官方源码

java 复制代码
		// Ignore the sign bit, this allows pos to loop sequentially from 0 to
		// Integer.MAX_VALUE
		int pos = this.position.incrementAndGet() & Integer.MAX_VALUE;

		ServiceInstance instance = instances.get(pos % instances.size());

为什么要先对Integer.MAX_VALUE按位"与"运算?

  1. 位"与"运算只有在对(2的幂)取余数时候才能平替,即:X % 2^n = X & (2^n - 1)
  2. Integer.MAX_VALUE = 2^31 - 1 = 2147483647
相关推荐
cfm_291419 分钟前
RocketMQ源码深度解析(二)Netty通信、Broker心跳注册、消息收发、客户端负载均衡原理
负载均衡·rocketmq
Demon1_Coder1 小时前
跨域问题CORS
spring cloud
worilb2 小时前
Spring Cloud 学习与实践(6):Nacos 配置中心
数据库·学习·spring cloud
v***59832 小时前
SpringCloud实战十三:Gateway之 Spring Cloud Gateway 动态路由
java·spring cloud·gateway
阿狸猿16 小时前
论负载均衡技术在 Web 系统中的应用
运维·前端·负载均衡
JAVA社区18 小时前
Java高级全套教程(十四)—— SpringData超详细实战详解
java·开发语言·spring cloud·面试·职场和发展
javahongxi1 天前
Spring Cloud Trace 链路实现
java·spring boot·spring cloud
小旭95271 天前
Spring Cloud 集成分布式日志 ELK+Swagger 接口文档实战
java·分布式·后端·elk·spring cloud
霸道流氓气质1 天前
Spring Cloud Nacos 服务注册 IP 选择机制与配置详解
tcp/ip·spring cloud·php
初圣魔门首席弟子1 天前
bug【已解决】腾讯 WorkBuddy 无法访问:校园网限制导致的网络问题排查全记录
bug