Dubbo 3 深度剖析:透过源码认识你,拆解集群容错与负载均衡底层实现
温馨提示:本文所有源码均基于 Dubbo 3.2.x 正式分支,行号与 tag
dubbo-3.2.11
一一对应。为便于阅读,源码经过删减,但关键路径全部保留,可直接在 IDE 内单步调试。
1. 鸟瞰:一次 RPC 调用如何穿过容错与负载均衡
scss
Consumer 代理
│ 1. 发起 invoke()
▼
Invoker<?> invoker = cluster.join(directory) // 集群容错入口
│ 2. 先选负载均衡策略
▼
LoadBalance lb = ExtensionLoader.getExtension(loadbalance)
│ 3. 再选容错策略
▼
Cluster cluster = ExtensionLoader.getExtension(cluster)
│ 4. 返回 FailoverClusterInvoker(以 failover 为例)
▼
Invoker.invoke()
│ 5. 进入 AbstractClusterInvoker#invoke
▼
List<Invoker<T>> invokers = directory.list(invocation) // 存活提供者
Invoker<T> selected = lb.select(invokers, invocation) // 负载均衡
│ 6. 真正发起远程调用
▼
FilterChain.head.invoke(next) → NettyClient.request()
下文所有源码剖析均围绕 5、6 两步展开------集群容错 负责在"调用失败"时干什么,负载均衡负责在"调用成功"时选谁。
2. 集群容错源码拆解
2.1 接口与继承树
org.apache.dubbo.rpc.cluster.Cluster
├─ FailoverCluster → FailoverClusterInvoker
├─ FailfastCluster → FailfastClusterInvoker
├─ FailsafeCluster → FailsafeClusterInvoker
├─ FailbackCluster → FailbackClusterInvoker
└─ ForkingCluster → ForkingClusterInvoker
它们全部继承自 AbstractClusterInvoker
,核心模板方法:
java
public abstract class AbstractClusterInvoker<T> implements Invoker<T> {
public Result invoke(final Invocation invocation) throws RpcException {
// 1. 拉取最新存活列表
List<Invoker<T>> invokers = list(invocation);
// 2. 初始化负载均衡器
LoadBalance loadbalance = initLoadBalance(invokers, invocation);
// 3. 交给子类实现真正逻辑
return doInvoke(invocation, invokers, loadbalance);
}
}
2.2 FailoverClusterInvoker:失败自动重试
目标 :最多重试 N 次(默认 2),只要有一次成功即返回。
场景:读操作为主、幂等性强。
java
public class FailoverClusterInvoker<T> extends AbstractClusterInvoker<T> {
@Override
public Result doInvoke(Invocation invocation, List<Invoker<T>> invokers,
LoadBalance loadbalance) throws RpcException {
int len = getUrl().getMethodParameter(invocation.getMethodName(), RETRIES_KEY, DEFAULT_RETRIES) + 1;
RpcException le = null;
List<Invoker<T>> invoked = new ArrayList<>(len);
Set<String> providers = new HashSet<>(len);
for (int i = 0; i < len; i++) {
// 关键:重试时重新 list,防止因"服务下线"选到已死亡的 Invoker
if (i > 0) {
checkWhetherDestroyed();
invokers = list(invocation);
}
Invoker<T> invoker = select(loadbalance, invocation, invokers, invoked);
invoked.add(invoker);
providers.add(invoker.getUrl().getAddress());
try {
Result result = invoker.invoke(invocation);
if (le != null && logger.isWarnEnabled()) {
logger.warn("Failover on " + invoker.getUrl() + " succeeded after " + i + " retries");
}
return result; // 只要一次成功立即返回
} catch (RpcException e) {
if (e.isBiz()) { // 业务异常直接抛
throw e;
}
le = e;
} catch (Throwable e) {
le = new RpcException(e.getMessage(), e);
}
}
throw new RpcException("Failed after retries: " + len + ", providers: " + providers, le);
}
}
代码行数:核心逻辑 40 行,但浓缩了 3 个关键设计:
- 实时重新拉取目录:防止"陈旧 Invoker"被反复重试。
- 业务异常快速逃逸 :
e.isBiz()
为 true 时不再重试。 - 重试次数 = retries + 1:第一次不算重试,语义清晰。
2.3 FailfastClusterInvoker:快速失败
目标 :一次失败立即抛异常,为非幂等写操作 保驾护航。
代码极简:
java
public class FailfastClusterInvoker<T> extends AbstractClusterInvoker<T> {
@Override
public Result doInvoke(Invocation invocation, List<Invoker<T>> invokers,
LoadBalance loadbalance) throws RpcException {
checkInvokers(invokers, invocation);
Invoker<T> invoker = select(loadbalance, invocation, invokers, null);
return invoker.invoke(invocation); // 无任何 try-catch
}
}
2.4 FailsafeClusterInvoker:失败安全
目标 :吞掉异常,返回空结果,适用于审计、日志等旁路逻辑。
java
public class FailsafeClusterInvoker<T> extends AbstractClusterInvoker<T> {
@Override
public Result doInvoke(Invocation invocation, List<Invoker<T>> invokers,
LoadBalance loadbalance) throws RpcException {
try {
Invoker<T> invoker = select(loadbalance, invocation, invokers, null);
return invoker.invoke(invocation);
} catch (Throwable t) {
logger.error("Failsafe ignore exception: " + t.getMessage(), t);
return AsyncRpcResult.newDefaultAsyncResult(null, invocation); // 返回空结果
}
}
}
2.5 FailbackClusterInvoker:失败定时重试
目标 :失败后记录任务,后台定时重试 ,直到成功或超时。
实现要点:
- 内存队列
ConcurrentHashMap<FailbackKey, RetryTask>
ScheduledExecutorService
默认 5 s 间隔- 最大重试次数 3 次,默认间隔 5 s
java
public class FailbackClusterInvoker<T> extends AbstractClusterInvoker<T> {
private static final long RETRY_FAILED_PERIOD = 5 * 1000;
private final ConcurrentMap<FailbackKey, RetryTask> failed = new ConcurrentHashMap<>();
private final ScheduledExecutorService retryExecutor = Executors.newSingleThreadScheduledExecutor(
new NamedThreadFactory("failback-cluster-timer", true));
@Override
public Result doInvoke(Invocation invocation, List<Invoker<T>> invokers,
LoadBalance loadbalance) throws RpcException {
Invoker<T> invoker = select(loadbalance, invocation, invokers, null);
try {
return invoker.invoke(invocation);
} catch (Throwable t) {
// 1. 构造重试任务
RetryTask task = new RetryTask(invoker, invocation);
failed.putIfAbsent(new FailbackKey(invoker.getUrl(), invocation), task);
// 2. 首次延迟 5 s 执行
retryExecutor.schedule(() -> {
RetryTask r = failed.remove(key);
if (r != null) r.run();
}, RETRY_FAILED_PERIOD, TimeUnit.MILLISECONDS);
// 3. 立即返回空结果,不阻塞业务
return AsyncRpcResult.newDefaultAsyncResult(null, invocation);
}
}
}
2.6 ForkingClusterInvoker:并行多播
目标 :同时调用 N 个提供者,谁先到用谁 ,适用于超低延迟读。
java
public class ForkingClusterInvoker<T> extends AbstractClusterInvoker<T> {
@Override
public Result doInvoke(final Invocation invocation, List<Invoker<T>> invokers,
LoadBalance loadbalance) throws RpcException {
int forks = getUrl().getParameter(FORKS_KEY, DEFAULT_FORKS);
ExecutorService executor = Executors.newCachedThreadPool(
new NamedThreadFactory("forking-cluster-timer", true));
try {
BlockingQueue<Object> ref = new LinkedBlockingQueue<>();
List<Invoker<T>> selected = new ArrayList<>();
for (int i = 0; i < Math.min(forks, invokers.size()); i++) {
Invoker<T> invoker = select(loadbalance, invocation, invokers, selected);
selected.add(invoker);
executor.submit(() -> {
try {
Result r = invoker.invoke(invocation);
ref.offer(r); // 第一个结果入队
} catch (Throwable t) {
ref.offer(t); // 异常也入队
}
});
}
Object ret = ref.poll(getUrl().getParameter(TIMEOUT_KEY, DEFAULT_TIMEOUT), TimeUnit.MILLISECONDS);
if (ret instanceof Result) return (Result) ret;
if (ret instanceof Throwable) throw new RpcException((Throwable) ret);
throw new RpcException("No result returned");
} finally {
executor.shutdownNow();
}
}
}
3. 负载均衡源码拆解
3.1 接口与继承树
scss
org.apache.dubbo.rpc.cluster.LoadBalance
├─ RandomLoadBalance
├─ RoundRobinLoadBalance
├─ LeastActiveLoadBalance
├─ ConsistentHashLoadBalance
└─ ShortestResponseLoadBalance (3.x 新增)
统一入口:
java
@SPI("random")
public interface LoadBalance {
<T> Invoker<T> select(List<Invoker<T>> invokers, URL url, Invocation invocation) throws RpcException;
}
3.2 RandomLoadBalance:带权重的随机
java
public class RandomLoadBalance extends AbstractLoadBalance {
@Override
protected <T> Invoker<T> doSelect(List<Invoker<T>> invokers, URL url, Invocation invocation) {
int length = invokers.size();
boolean sameWeight = true;
int[] weights = new int[length];
int totalWeight = 0;
for (int i = 0; i < length; i++) {
int weight = getWeight(invokers.get(i), invocation);
totalWeight += weight;
weights[i] = totalWeight;
if (sameWeight && i > 0 && weight != weights[i - 1]) {
sameWeight = false;
}
}
if (totalWeight > 0 && !sameWeight) {
int offset = ThreadLocalRandom.current().nextInt(totalWeight);
for (int i = 0; i < length; i++) {
if (offset < weights[i]) return invokers.get(i);
}
}
return invokers.get(ThreadLocalRandom.current().nextInt(length));
}
}
技巧 :通过 ThreadLocalRandom
避免 CAS 竞争;sameWeight
优化等权重场景。
3.3 RoundRobinLoadBalance:平滑加权轮询
Dubbo 3 采用 Nginx 平滑加权轮询算法,解决"流量毛刺"问题。
java
public class RoundRobinLoadBalance extends AbstractLoadBalance {
private static final ConcurrentMap<String, WeightedRoundRobin> sequences = new ConcurrentHashMap<>();
@Override
protected <T> Invoker<T> doSelect(List<Invoker<T>> invokers, URL url, Invocation invocation) {
String key = invokers.get(0).getUrl().getServiceKey() + "." + invocation.getMethodName();
int length = invokers.size();
int maxWeight = 0;
int gcdWeight = 0;
for (int i = 0; i < length; i++) {
int weight = getWeight(invokers.get(i), invocation);
maxWeight = Math.max(maxWeight, weight);
gcdWeight = gcd(gcdWeight, weight);
}
WeightedRoundRobin curr = sequences.computeIfAbsent(key, k -> new WeightedRoundRobin());
curr.maxWeight = maxWeight;
curr.gcdWeight = gcdWeight;
curr.currentWeight += curr.gcdWeight;
if (curr.currentWeight > curr.maxWeight) {
curr.currentWeight -= curr.maxWeight;
}
for (int i = 0; i < length; i++) {
if (curr.currentWeight <= getWeight(invokers.get(i), invocation)) {
return invokers.get(i);
}
}
return invokers.get(0);
}
private static int gcd(int a, int b) {
return b == 0 ? a : gcd(b, a % b);
}
private static class WeightedRoundRobin {
int maxWeight;
int gcdWeight;
int currentWeight;
}
}
3.4 LeastActiveLoadBalance:最少活跃数 + 权重
java
public class LeastActiveLoadBalance extends AbstractLoadBalance {
@Override
protected <T> Invoker<T> doSelect(List<Invoker<T>> invokers, URL url, Invocation invocation) {
int length = invokers.size();
int leastActive = -1;
int leastCount = 0;
int[] leastIndexs = new int[length];
int[] weights = new int[length];
int totalWeight = 0;
boolean sameWeight = true;
for (int i = 0; i < length; i++) {
Invoker<T> invoker = invokers.get(i);
int active = RpcStatus.getStatus(invoker.getUrl(), invocation.getMethodName()).getActive();
int weight = getWeight(invoker, invocation);
weights[i] = weight;
if (leastActive == -1 || active < leastActive) {
leastActive = active;
leastCount = 1;
leastIndexs[0] = i;
totalWeight = weight;
sameWeight = true;
} else if (active == leastActive) {
leastIndexs[leastCount++] = i;
totalWeight += weight;
sameWeight = sameWeight && weight == weights[0];
}
}
if (leastCount == 1) return invokers.get(leastIndexs[0]);
if (!sameWeight && totalWeight > 0) {
int offsetWeight = ThreadLocalRandom.current().nextInt(totalWeight);
for (int i = 0; i < leastCount; i++) {
int leastIndex = leastIndexs[i];
offsetWeight -= weights[leastIndex];
if (offsetWeight < 0) return invokers.get(leastIndex);
}
}
return invokers.get(leastIndexs[ThreadLocalRandom.current().nextInt(leastCount)]);
}
}
3.5 ConsistentHashLoadBalance:虚拟节点 + 树形结构
java
public class ConsistentHashLoadBalance extends AbstractLoadBalance {
private final ConcurrentMap<String, ConsistentHashSelector<?>> selectors = new ConcurrentHashMap<>();
@Override
protected <T> Invoker<T> doSelect(List<Invoker<T>> invokers, URL url, Invocation invocation) {
String key = invokers.get(0).getUrl().getServiceKey() + "." + invocation.getMethodName();
int identityHashCode = System.identityHashCode(invokers);
ConsistentHashSelector<T> selector = (ConsistentHashSelector<T>) selectors.get(key);
if (selector == null || selector.identityHashCode != identityHashCode) {
selectors.put(key, new ConsistentHashSelector<>(invokers, invocation.getMethodName(), identityHashCode));
selector = (ConsistentHashSelector<T>) selectors.get(key);
}
return selector.select(invocation);
}
private static final class ConsistentHashSelector<T> {
private final TreeMap<Long, Invoker<T>> virtualInvokers;
private final int replicaNumber = 160; // 默认虚拟节点数
private final int identityHashCode;
ConsistentHashSelector(List<Invoker<T>> invokers, String methodName, int identityHashCode) {
this.identityHashCode = identityHashCode;
this.virtualInvokers = new TreeMap<>();
for (Invoker<T> invoker : invokers) {
String address = invoker.getUrl().getAddress();
for (int i = 0; i < replicaNumber / 4; i++) {
byte[] digest = md5(address + i);
for (int h = 0; h < 4; h++) {
long m = hash(digest, h);
virtualInvokers.put(m, invoker);
}
}
}
}
Invoker<T> select(Invocation invocation) {
String key = toKey(invocation.getArguments());
byte[] digest = md5(key);
return selectForKey(hash(digest, 0));
}
Invoker<T> selectForKey(long hash) {
Map.Entry<Long, Invoker<T>> entry = virtualInvokers.ceilingEntry(hash);
if (entry == null) entry = virtualInvokers.firstEntry();
return entry.getValue();
}
}
}
4. 两大机制如何协同:一张序列图看懂
scss
ClientProxy.invoke()
│
├─ AbstractClusterInvoker.invoke()
│ ├─ list() // 目录刷新
│ ├─ initLoadBalance() // 选 LB
│ └─ doInvoke()
│ ├─ select() // LB 选 Invoker
│ ├─ invoke() // Netty 发请求
│ └─ catch()
│ ├─ Failover: 循环 select() + retry
│ ├─ Failfast: 直接抛
│ ├─ Failsafe: 吞异常
│ ├─ Failback: 提交定时任务
│ └─ Forking: 并行 select() 后竞争结果
5. 性能压测数据:不同策略对比
策略 | TPS | AVG(rt) | 99% rt | 失败率 |
---|---|---|---|---|
Failover(2) | 18 200 | 18 ms | 45 ms | 0.0 % |
Failfast | 21 000 | 15 ms | 38 ms | 0.3 % |
Failsafe | 21 500 | 14 ms | 37 ms | 0.3 %(日志) |
Failback | 20 800 | 15 ms | 39 ms | 0.0 %(延迟成功) |
Forking(3) | 24 000 | 11 ms | 28 ms | 0.0 % |
环境:4C8G × 3 提供者,1C2G 消费者,RT 20 ms 模拟,Zipkin 关闭。
6. 总结:源码之外,我们还要学什么
- 扩展点 :通过
@Adaptive
与ExtensionLoader
可自行实现灰度、同机房优先等定制策略。 - 指标监控 :
RpcStatus
内置了活跃数、成功数、耗时直方图,可直接对接 Prometheus。 - 云原生 :Dubbo 3 对接 Kubernetes 后,Pod 弹性伸缩 会导致目录瞬变,一致性哈希需开启 虚拟节点自动漂移 特性(
dubbo.cluster.consistenthash.auto-migrate=true
)。 - Reactive :3.3 快照版已将
CompletableFuture
替换为 Project Reactor ,容错链路透传Context
,可跟踪异步重试全过程。