ClickHouse Dist表的Replica选择逻辑深度解析-- Custom Key以及Sample的执行逻辑

文章目录

前言
查询Distributed时在Shard内部选择Replica的逻辑
[Shard， Replica和Connection Pool](#Shard， Replica和Connection Pool)
- 集群启动时Pool的构建
查询时的基本运行逻辑
- ReadFromRemote和RemoteQueryExecutor
- ParallelReplicasMode和PoolMode的决策
- RemoteQueryExecutor的构造
- ConnectionPoolWithFailover对Replica的探测和挑选
- [两层IConnectionPool的实现： ConnectionPoolWithFailover 和 ConnectionPool](#两层IConnectionPool的实现： ConnectionPoolWithFailover 和 ConnectionPool)
- [LoadBalance Closure](#LoadBalance Closure)
- [Failover的基本框架: PoolWithFailoverBase<TNestedPool>](#Failover的基本框架: PoolWithFailoverBase)
- [TryGetEntryFunction Closure](#TryGetEntryFunction Closure)
ClickHouse中的Sample
[ClickHouse中的Custom Key](#ClickHouse中的Custom Key)
- [Initiator端的Parallel Replica Custom Filter](#Initiator端的Parallel Replica Custom Filter)
- [Remote Replica端自行添加对应的Custom Key Filter](#Remote Replica端自行添加对应的Custom Key Filter)
涉及到的相关知识

前言

我们经常使用ClickHouse的Dist表来实现集群内部的分布式查询。我们的问题是，基于ClickHouse的Shard和Replica概念，在多shard、多Replica环境下，ClickHouse是怎么做到不重复地读取数据的呢？最简单的，对于一个Shard内部的数据，是一个Replica在为它服务，还是所有的Replica都提供数据？

本文就带着这个问题，从代码层面探究整个Dist表查询时候在数据层面的处理逻辑。其中，Dist表针对Sample查询，Custom Key查询，在查询任务的分配和分布上有不同的处理逻辑。本文将详细探究。

同时，本文还总结性地介绍了C++的聚合类型，闭包和私有继承等知识。

查询Distributed时在Shard内部选择Replica的逻辑

我在另外一篇文章中已经讲过，Distributed表依赖Clusters.xml中对Cluster的定义来认识整个集群的拓扑，比如:

sql 复制代码

-- 创建 Distributed 表
CREATE TABLE distributed_mytable AS mytable 
ENGINE = Distributed(cluster_name, mydb, mytable，rand());

那么，根据下面的Clusters.xml 的配置，我们在查询Dist表的时候，Dist表只会读取一个Replica的数据(当然，这些数据可能一部分来自一个Replica，一部分来自另一个Replica，用户感觉似乎是来自于一个Replica)，而不是返回每个Shard的每个Replica的数据(会造成数据重复)：

xml 复制代码

      <!-- Cluster 配置：将服务器分为 2 个 shard -->
    <remote_servers>
        <my_cluster>
            <shard>  <!-- Cluster 的 Shard 1 -->
                <replica>
                    <host>server1</host>
                </replica>
                <replica>
                    <host>server2</host>
                </replica>
            </shard>
            <shard>  <!-- Cluster 的 Shard 2 -->
                <replica>
                    <host>server3</host>
                </replica>
                <replica>
                    <host>server4</host>
                </replica>
            </shard>
        </my_cluster>
    </remote_servers>

所以，我们的问题是，在多shard&多Replica环境下，ClickHouse是怎么做到不重复地读取数据的呢？

其实，我们能想到的大概策略，可能有两种：

对于每一个Shard，通过某种选择策略，选择其中一个Replica进行读取
对于每一个Shard，通过某种选择策略，将数据进行切分，从多个Replica中并行读取并最终拿到这个Shard的完整数据。

这两种方式，ClickHouse都支持，只不过每种方式都有其限定场景。

所以，我们需要回答的基本问题是：

一个 Distributed 查询发到一个 shard 时，会进一步读取这个Shard中的所有Replica，还是某一个Replica？
- 答案：我们可以通过配置max_parallel_replicas来决定是否开启并行Replica；
如果只会查一个 replica，那么，这个replica是怎么选出来的呢？而且，如果这个Replica它挂了，系统怎么切到同 shard 的另一个 replica以保证系统可用性？
- 答案： ClickHouse有自己的Replica选择策略，这些策略的基本依据是，比如，这个Replica上是否包含被查询表，是否up-to-date，等等。
同时，我们下文也会看到，我们可以通过max_parallel_replicas配置来设置是否使用并行replica策略。随后的问题是，如果打开了并行replica，那么同一个 shard 里多个 replicas 一起读，为什么结果不会重复？
- 答案：同一个 shard 内可以有多个 replicas 一起参与读取，但它们不是重复读取同一份数据，而是按规则切分任务，各自读自己那一部分。

下面，我们就顺着源码，把上面的所有问题回答清楚，我们介绍的重点，是Distributed 表的读路径，尤其是 shard 内 replica 的选择，以及并行副本模式下如何做到不重也不漏。

总的说来，我们在下文的介绍中将会看到:

在不开启并行replica的情况下，Distributed 查询的基本语义不是"把一个 shard 的所有副本都查一遍"，而是"每个 shard 产出一份结果；由 shard 内某个合适的 replica 代表这个 shard 去执行"；
在开启了并行replica的情况下，同一个 shard 内可以有多个 replicas 一起参与读取，但它们不是重复读取同一份数据，而是按规则切分任务，各自读自己那一部分。

这两层机制之间是彼此相关的，同时也柔和了Sample和Custom Key机制，下面具体讲解。

但是首先，我们会先讲解ConnectionPoolWithFailover，因为这是后面进行Shard内的Replica选择的基石：一个 ConnectionPoolWithFailover对象就代表了这个Shard中所有Replica的连接池。

ConnectionPoolWithFailover的初始化发生在ClickHouse集群初始化的阶段。

Shard， Replica和Connection Pool

集群启动时Pool的构建

在ClickHouse启动的时候，ClickHouse 会把同一个 shard 下的所有 replicas 先收集起来，然后统一包装成一个 ConnectionPoolWithFailover。这里的failover就是指在这个Shard内的不同Replica之间Failover:

从下面的代码可以看到，每一个Replica在启动的时候，会读取集群配置，加载集群中所有的Shard的信息，其中,Shard的连接信息保存在ConnectionPoolWithFailover对象中并塞到对应的ShardInfo中去:

cpp 复制代码

void Cluster::addShard( // 系统启动的时候，构建一个Shard的所有的Replica的信息
    const Settings & settings,
    Addresses addresses,
    bool treat_local_as_remote,
    UInt32 current_shard_num,
    UInt32 weight,
    ShardInfoInsertPathForInternalReplication insert_paths,
    bool internal_replication)
{
    Addresses shard_local_addresses;

    ConnectionPoolPtrs all_replicas_pools;
    all_replicas_pools.reserve(addresses.size());
    // 遍历这个shard内的所有的replica
    for (const auto & replica : addresses)
    {
        // 构造Replica pool，即专门连接到这个Replica intance的连接池，给上层的所有query共享
        auto replica_pool = ConnectionPoolFactory::instance().get(
            static_cast<unsigned>(settings.distributed_connections_pool_size),
            replica.host_name,
            replica.port,
            replica.default_database,
            replica.user,
            replica.password,
            .......);

        all_replicas_pools.emplace_back(replica_pool);
        if (replica.is_local && !treat_local_as_remote)
            shard_local_addresses.push_back(replica);
    }
    // 这个Shard内的所有的replica构成一个可以进行failover的ConnectionPoolWithFailoverPtr对象
    // 但是其实，对于一个Shard内的每一个replica本身，它也是一个pool，因为可能有多个查询在同时使用这个Replica的数据，因此到这个Replica
    // 有很多的链接
    ConnectionPoolWithFailoverPtr shard_pool = std::make_shared<ConnectionPoolWithFailover>(
        all_replicas_pools,
        settings.load_balancing,
        settings.distributed_replica_error_half_life.totalSeconds(),
        settings.distributed_replica_error_cap);
        
    shards_info.push_back({  // 一个ShardInfo对象包含了这个Shard的所有链接信息，shard_pool,以及shard下面所有replica的ConnectionPool
        std::move(insert_paths),
        current_shard_num,
        weight,
        std::move(shard_local_addresses),
        std::move(shard_pool),
        std::move(all_replicas_pools),
        internal_replication
    });
}

这段代码的含义很直接：一个 shard 下面的每个 replica，先各自有一个自己的 ConnectionPool replica_pool，这些 ConnectionPool replica_pool会被放到一个ConnectionPoolPtrs all_replicas_pools（其实就是一个ConnectionPool的数组），然后再被包进一个更高层的 ConnectionPoolWithFailover，后续查询不再直接面对"多个 replica"，而是面对"一个 shard 级别的 pool"

所以可以把它理解成：

ConnectionPool：单个 replica 的连接池

cpp 复制代码

  using ConnectionPoolPtr = std::shared_ptr<IConnectionPool>;
  using ConnectionPoolPtrs = std::vector<ConnectionPoolPtr>;
  
  /** A common connection pool, without fault tolerance.
    */
  class ConnectionPool : public IConnectionPool, private PoolBase<Connection>

ConnectionPoolWithFailover：一个 shard 下多个 replicas 的选择器。这个ConnectionPoolWithFailover对象，就是后面 Shard 内选择replica的核心对象。它的职责不是"维护某一个节点的一堆连接"，而是面对一个 shard 下的多个 replicas，决定先试哪个 replica，如果当前 replica 失败，切到同 shard 的另一个 replica，必要时拿一个或多个 replica 连接返回给上层。下文会详细讲解 ConnectionPoolWithFailover的机制。

下图展示了ConnectionPoolWithFailover和Shard以及ConnectionPool的关系:

text 复制代码

一个 shard
  |
  | 有一个 shard 级别的 failover pool
  v
ConnectionPoolWithFailover
  |
  | 内部持有多个 replica 的 ConnectionPool
  | 这些子池就是 nested_pools
  |
  +-----------------------------+-----------------------------+
  |                             |                             |
  v                             v                             v
ConnectionPool                  ConnectionPool                ConnectionPool
(replica 0 的连接池)             (replica 1 的连接池)           (replica 2 的连接池)

下图展示了Shard， Replica，和 ConnectionPool 的关系：

text 复制代码

一个 shard
  |
  | 包含多个 replicas
  |
  +-----------------------------+-----------------------------+
  |                             |                             |
  v                             v                             v
replica 0                       replica 1                     replica 2
  |                             |                             |
  | 每个 replica 有自己的连接池   | 每个 replica 有自己的连接池   | 每个 replica 有自己的连接池
  v                             v                             v
ConnectionPool                  ConnectionPool                ConnectionPool
  |                             |                             |
  | 继承 PoolBase<Connection>    | 继承 PoolBase<Connection>    | 继承 PoolBase<Connection>
  v                             v                             v
PoolBase<Connection>            PoolBase<Connection>          PoolBase<Connection>
  |                             |                             |
  | 管理多条到该 replica 的连接   | 管理多条到该 replica 的连接   | 管理多条到该 replica 的连接
  v                             v                             v
多个 Connection                  多个 Connection               多个 Connection

我们看一下 ConnectionPoolWithFailover

上文讲过，同时，从上图中也可以看到，在Cluster.cpp 里，同一个 shard 的多个 replicas 会先各自创建普通 ConnectionPool，然后统一包成一个ConnectionPoolWithFailover：

cpp 复制代码

class ConnectionPoolWithFailover : public IConnectionPool, private PoolWithFailoverBase<IConnectionPool>
{
public:
    ConnectionPoolWithFailover(
            ConnectionPoolPtrs nested_pools_, // ConnectionPoolWithFailover所封装的
            LoadBalancing load_balancing,  // ConnectionPoolWithFailover的LoadBalancing策略
            time_t decrease_error_period_ = DBMS_CONNECTION_POOL_WITH_FAILOVER_DEFAULT_DECREASE_ERROR_PERIOD,
            size_t max_error_cap = DBMS_CONNECTION_POOL_WITH_FAILOVER_MAX_ERROR_COUNT);

从构造函数可以看到:

ConnectionPoolWithFailover public继承了IConnectionPool接口，主要是继承该接口的get()方法，很显然，一个Pool的get()方法，就是返回这个Pool中的一个Connetion。至于基于什么策略选择对应的Connection，则需要看 ConnectionPoolWithFailover的具体实现逻辑。
我们还看到，还private继承了PoolWithFailoverBase<IConnectionPool>，这里PoolWithFailoverBase<TNestedPool> 是一个模板化的"多候选池 failover 框架"。它自己不关心管的是连接池、文件句柄池还是别的资源池，它只负责在多个候选子池之间排序、尝试、失败切换和结果筛选。所以，ConnectionPoolWithFailover所具有的Failover功能框架式来自于PoolWithFailoverBase<TNestedPool>。
PoolWithFailoverBase<TNestedPool>是一个上层池，而下层是多个子池拼接起来的，比如： ConnectionPool(host1)， ConnectionPool(host2)， ConnectionPool(host3), PoolWithFailoverBase<TNestedPool>把他们收进nested_pools。

ConnectionPoolWithFailover private继承了 PoolWithFailoverBase<IConnectionPool>，那么实际构造ConnectionPoolWithFailover的时候，这个模板参数实际指向的IConnectionPool的类型其实是ConnectionPool:

cpp 复制代码

using Base = PoolWithFailoverBase<IConnectionPool>;
using ConnectionPoolPtr = std::shared_ptr<IConnectionPool>;
using ConnectionPoolPtrs = std::vector<ConnectionPoolPtr>;

ConnectionPoolWithFailover::ConnectionPoolWithFailover(
        ConnectionPoolPtrs nested_pools_, // 编译时类型是 std::vector<std::shared_ptr<IConnectionPool>>，运行时，IConnectionPool实际上是ConnectionPool, 因此 nested_pools_ 的运行时类型是 std::vector<std::shared_ptr<ConnectionPool>>
        LoadBalancing load_balancing,
        time_t decrease_error_period_,
        size_t max_error_cap_)
    : Base(std::move(nested_pools_), decrease_error_period_, max_error_cap_, getLogger("ConnectionPoolWithFailover"))
    , get_priority_load_balancing(load_balancing)
{

cpp 复制代码

    template <typename TNestedPool>
    class PoolWithFailoverBase : private boost::noncopyable
    {
    public:
        using NestedPool = TNestedPool; // 编译时类型 IConnectionPool
        using NestedPoolPtr = std::shared_ptr<NestedPool>; // 编译时类型  std::shared_ptr<IConnectionPool>
        using Entry = typename NestedPool::Entry; // IConnectionPool::Entry，编译时类型是 IConnectionPool::Entry;
        using NestedPools = std::vector<NestedPoolPtr>; // 编译时类型是std::vector<std::shared_ptr<IConnectionPool>>

这里，在编译器编译ConnectionPoolWithFailover的时候，PoolWithFailoverBase<TNestedPool>的静态编译类型是IConnectionPool，因此构造 PoolWithFailoverBase<TNestedPool>时的模板参数TNestedPool是IConnectionPool，因此， PoolWithFailoverBase中下面的类型声明实际的编译期类型是:

cpp 复制代码

using NestedPool = IConnectionPool;
using NestedPoolPtr = std::shared_ptr<IConnectionPool>;
using Entry = IConnectionPool::Entry;
using NestedPools = std::vector<std::shared_ptr<IConnectionPool>>;

在运行时ConnectionPoolWithFailover被构造的时候(上文讲过，Cluster::addShard()中会构造ConnectionPoolWithFailover对象)，这些接口指针实际指向的是ConnectionPool这个IConnectionPool的实现类，因为ConnectionPoolWithFailover构造的时候的nested_pools的实际指向的内存类型是std::vector<std::shared_ptr<ConnectionPool>>:

cpp 复制代码

class ConnectionPool : public IConnectionPool, private PoolBase<Connection>
{
public:
    using Entry = IConnectionPool::Entry;
    using Base = PoolBase<Connection>;
    ConnectionPool(
        unsigned max_connections_,
        const String & host_,

cpp 复制代码

ConnectionPoolPtrs all_replicas_pools;
all_replicas_pools.reserve(addresses.size());
// 遍历这个shard内的所有的replica
for (const auto & replica : addresses)
{
    auto replica_pool = ConnectionPoolFactory::instance().get(
        static_cast<unsigned>(settings.distributed_connections_pool_size),
        replica.host_name,
        ....
        )
    all_replicas_pools.emplace_back(replica_pool); // replica_pool的运行时类型是std::shared_ptr<ConnectionPool>
}
// 这个Shard内的所有的replica构成一个可以进行failover的ConnectionPoolWithFailoverPtr对象
// 但是其实，对于一个Shard内的每一个replica本身，它也是一个pool，因为可能有多个查询在同时使用这个Replica的数据，因此到这个Replica
// 有很多的链接
ConnectionPoolWithFailoverPtr shard_pool = std::make_shared<ConnectionPoolWithFailover>(
    all_replicas_pools, // 实际类型是 std::shared_ptr<ConnectionPool>
    settings.load_balancing,
    settings.distributed_replica_error_half_life.totalSeconds(),
    settings.distributed_replica_error_cap);

具体继承关系如下图所示:

text 复制代码

    一个 shard
  |
  | 对外看：这个 shard 有一个 IConnectionPool
  v
ConnectionPoolWithFailover
  |
  +------------------------------------------------+
  |                                                |
  | public 继承 IConnectionPool                     | private 继承 PoolWithFailoverBase<IConnectionPool>
  | 对外表现为"一个连接池"                             | 对内实现"多个 replica pools 的 failover"
  |                                                |
  v                                                v
IConnectionPool                                   PoolWithFailoverBase<IConnectionPool>
                                                   |
                                                   | nested_pools
                                                   | 保存这个 shard 下多个 replica 的连接池
                                                   v
                                           vector<shared_ptr<IConnectionPool>> // 编译器类型
                                                   |
                                                   +-----------------------------+
                                                   |                             | ConnectionPool是运行时IConnectionPool的实际类型
                                                   v                             v
                                           ConnectionPool                  ConnectionPool // 运行时的多态实际类型
                                           (replica 0)                     (replica 1)
                                                   |                             |
                                                   | 继承                         | 继承
                                                   v                             v
                                           PoolBase<Connection>            PoolBase<Connection>
                                                   |                             |
                                                   | 管理多条 Connection           | 管理多条 Connection
                                                   v                             v
                                           Connection                      Connection

所以，ConnectionPoolWithFailover封装了在Shard内对Replica的ConnectionPool的选择逻辑。有了ConnectionPoolWithFailover的基本理解，我么就可以往下看一下ClickHouse执行一个Dist Query的基本过程。

查询时的基本运行逻辑

先忽略下层更复杂的决策逻辑。我们先从最简单的参数入手。

在ClickHouse中，查询一个Shard的时候最多使用多少Replica，这个上限由 max_parallel_replicas 决定的：

cpp 复制代码

M(NonZeroUInt64, max_parallel_replicas, 1, "The maximum number of replicas of each shard used when the query is executed. For consistency (to get different parts of the same partition), this option only works for the specified sampling key. The lag of the replicas is not controlled. Should be always greater than 0", 0) \

这里的含义是：

max_parallel_replicas = 1：一个 shard 最多只会使用 1 个 replica，这意味着，shard 内不可能真正展开多副本并行，无论是我们后面会讲到的那种并行模式比如SAMPLE, CUSTOMER_KEY, 或者READ_TASKS等，都需要严格遵照 max_parallel_replicas的设置，一个Shard中只有一个Replica提供数据；
max_parallel_replicas > 1：一个 shard 有可能同时使用多个 replicas。注意，这里只是有可能，而不是一定会。因为后面会涉及到更加复杂的不同的Replica的并行模式，下文会讲解。

同时，我们看到，默认情况下， max_parallel_replicas=1，这意味着，默认情况下，ClickHouse一定只会使用一个Replica来获取ClickHouse中一个Shard的全部数据，这也是我们的线上系统运行时使用的参数，我们没有修改过。但这只是最简单的情况。

了解了max_parallel_replicas这个硬性标准以后，只是我们了解整个Dist表运行逻辑的最粗浅理解。

ReadFromRemote和RemoteQueryExecutor

当一次 Distributed 查询开始真正构造远端执行计划时，对应的上层进行的解析 SQL、确定查询阶段、按 cluster 拆出目标 shards，并且把每个 shard 对应的远端查询 AST、header、连接池等信息进行整理的过程不在我们的讨论范围之列。现在，我们假设这些前置步骤已经执行完毕。到了这个时候，代码会构造一个专门负责远端读取的 ReadFromRemote，把这些 shards 包成后续执行计划的一部分。

这一点在 ClusterProxy::executeQuery() 里能看得很清楚：

cpp 复制代码

void executeQuery(
    QueryPlan & query_plan,
    .....
    bool is_remote_function)
{
    const Settings & settings = context->getSettingsRef();
    ....
    if (!remote_shards.empty())
    {
        ...

        auto plan = std::make_unique<QueryPlan>();
        auto read_from_remote = std::make_unique<ReadFromRemote>(
            std::move(remote_shards),
            ....
            query_info.storage_limits,
            not_optimized_cluster->getName());

        read_from_remote->setStepDescription("Read from remote replica");
        plan->addStep(std::move(read_from_remote));
        ....
    }

    .....
}

上层已经把"这个查询需要访问哪些远端 shard"准备好了，接下来交给 ReadFromRemote 去把这些 shard 变成真正可执行的、物理的远端读取流水线。也就是说，ReadFromRemote 站在"查询计划执行"的位置上，而不是站在"某个 replica 的连接池实现"那一层。具体的replica的连接池，会交给下层的RemoteQueryExecutor进行管理。

ReadFromRemote 真正开始工作时，会在 initializePipeline() 里遍历每个 shard。对每个 shard，它要么调用 addLazyPipe()，要么调用 addPipe()；而正常的远端读取路径，核心就在 addPipe() 里。

cpp 复制代码

void ReadFromRemote::initializePipeline(QueryPipelineBuilder & pipeline, const BuildQueryPipelineSettings &)
{
    Pipes pipes;

    for (const auto & shard : shards)
    {
        if (shard.lazy)
            addLazyPipe(pipes, shard); // 按照Pipe调用addLazyPipe
        else
            addPipe(pipes, shard); // 按照Pipe调用addPipe
    }

从上面的代码可以看到，ReadFromRemote::addPipe() 是按 shard 调用的。也就是说，后面每创建一个 RemoteQueryExecutor，语义上都不是"整个 Distributed 查询的执行器"，而是"当前这个 shard 的远端执行器"。

在ReadFromRemote::addPipe() 里，代码首先会为当前 shard 创建一个 RemoteQueryExecutor。构造时传进去的并不是某一个 replica 的连接，而是当前 shard 下所有候选 replicas 的 failover 连接池，也就是 shard.shard_info.pool：

cpp 复制代码

void ReadFromRemote::addPipe(Pipes & pipes, const ClusterProxy::SelectStreamFactory::Shard & shard)
{
    .....
    /// Parllel Replica Custom Key的模式
    if (shard.shard_filter_generator)
    {
        // 为Shard内部的每一个Replica都创建一个RemoteQueryExecutor对象，同时设置PoolMode=GET_ONE
        for (size_t i = 0; i < shard.shard_info.per_replica_pools.size(); ++i)
        {
            auto remote_query_executor = std::make_shared<RemoteQueryExecutor>(
                shard.shard_info.pool,
                query_string,
                );
            /// 上面的循环已经按 replica 拆好了 query；
            /// 当前 executor 只需要把这份带 filter 的 query 发给一个 replica。
            remote_query_executor->setPoolMode(PoolMode::GET_ONE);
        }
    }
    else
    {
        const String query_string = formattedAST(shard.query);
        // 为整个Shard只创建一个RemoteQueryExecutor对象
        auto remote_query_executor = std::make_shared<RemoteQueryExecutor>(
            shard.shard_info.pool, query_string, shard.header, context, throttler, scalars, external_tables, stage);
        remote_query_executor->setLogger(log);

        if (context->canUseTaskBasedParallelReplicas()) // 如果是READ_TASKS模式，那么只能使用一个Coordinator，因此PoolMode是GET_ONE
        {
            remote_query_executor->setPoolMode(PoolMode::GET_ONE);
        }
        else
            remote_query_executor->setPoolMode(PoolMode::GET_MANY);

        // 设置主表。因为主表会影响到对replica的选择，只选择replica上有该表、并且该表足够细新
        // 如果已经设置了Shard 的main_table，则使用，如果没有，则使用构造 ReadFromRemote 对象时传入到 main_table，比如，dist表查询的时候
        // main_table就是dist表背后的ReplicatedMergeTree的表
        if (!table_func_ptr)
            remote_query_executor->setMainTable(shard.main_table ? shard.main_table : main_table);

        pipes.emplace_back(
            createRemoteSourcePipe(remote_query_executor, add_agg_info, add_totals, add_extremes, async_read, async_query_sending));
        addConvertingActions(pipes.back(), output_stream->header, shard.has_missing_objects);
    }
}

我们从ReadFromRemote::addPipe()方法可以看到，它做了两件重要的事：

判断是否是Parallel Replica Custom Key模式

在 shard.shard_filter_generator 存在时(即支持在Initiator端为每一个Replica添加Custom Key Filter，即Cluster Level的Custom Key Filter，下文会讲)，initiator 已经在 ReadFromRemote::addPipe() 里按 replica 数量展开查询：
cpp 复制代码
```
for (size_t i = 0; i < shard.shard_info.per_replica_pools.size(); ++i)
{
    auto query = shard.query->clone();
    auto shard_filter = shard.shard_filter_generator(i + 1);
    ...
    remote_query_executor->setPoolMode(PoolMode::GET_ONE); //设置PoolMode为GET_ONE，因为这时候实际上为每一个Replica创建了一个`RemoteQueryExecutor`
}
```
这时一个 shard 会创建多个 RemoteQueryExecutor，每一个Replica一个：
text 复制代码
```
executor 0: query + filter_0
executor 1: query + filter_1
```
每个 RemoteQueryExecutor 已经对应一份 Per Replica Query，所以每个 RemoteQueryExecutor 只需要拿一个 replica connection：
复制代码
```
多个 executor × 每个 executor GET_ONE
```
因此，这里的 GET_ONE 不是说"整个 shard 只查一个 replica"，而是说，每一份已经拆好的 per-replica query(一个RemoteQueryExecutor对象 )，只发给一个 replica，因为此时一个RemoteQueryExecutor对象本来就只对应一个Replica。

我们从下面的代码可以看到，一个RemoteQueryExecutor对象会通过调用ConnectionPoolWithFailover::getManyImpl()来尝试获取Shard中的Replica的连接的，当PoolMode=GET_ONE，ConnectionPoolWithFailover最多只会拿一个Replica(max_entries=1)：

我们必须始终记住： PoolMode是RemoteQueryExecutor的属性，而不是ConnectionPoolWithFailover::getManyImpl()的属性。所以，当当PoolMode=GET_ONE而把max_entries设置为1，它的意思不是ConnectionPoolWithFailover只拿一个Replica，而是: 当前这个 RemoteQueryExecutor 通过 ConnectionPoolWithFailover 最多拿 1 个 Connection。
cpp 复制代码
```
/// 这是 shard 内挑副本的统一入口。
std::vector<ConnectionPoolWithFailover::TryResult> ConnectionPoolWithFailover::getManyImpl(
    const Settings & settings,
    PoolMode pool_mode
    ....
   )
{
    // 最多拿几个副本连接。
    if (pool_mode == PoolMode::GET_ALL)
    {
        /// 全拿。
        min_entries = nested_pools.size();
        max_entries = nested_pools.size();
    }
    else if (pool_mode == PoolMode::GET_ONE)
    {
        /// 只拿一个。
        max_entries = 1;
    }
    else if (pool_mode == PoolMode::GET_MANY)
    {
        /// 最多拿 max_parallel_replicas 个。
        max_entries = settings.max_parallel_replicas;
}
```
如果不是Parallel Replica Custom Key模式，即，是普通模式，那么就设置普通模式下的PoolMode。
- 通过canUseTaskBasedParallelReplicas()，决定是否可以使用基于ParallelReplicasMode::READ_TASKS的并行replica模式，如果的确如此，则为ParallelReplicasMode::READ_TASKS设置PoolMode为GET_ONE，否则，设置为GET_MANY:
cpp 复制代码
```
        if (context->canUseTaskBasedParallelReplicas()) // 如果是READ_TASKS模式，那么只能使用一个Coordinator，因此PoolMode是GET_ONE
        {
            remote_query_executor->setPoolMode(PoolMode::GET_ONE);
        }
        else
            remote_query_executor->setPoolMode(PoolMode::GET_MANY);
```
cpp 复制代码
```
    bool Context::canUseTaskBasedParallelReplicas() const
    {
        const auto & settings_ref = getSettingsRef();
        return getParallelReplicasMode() == ParallelReplicasMode::READ_TASKS && settings_ref.max_parallel_replicas > 1;
    }
```
- 在普通模式下，initiator 不会像在Parallel Replica Custom Key模式下一样里按 replica clone query，而是一个 shard 只创建一个 RemoteQueryExecutor。这时，RemoteQueryExecutor中的PoolMode 决定的是：针对这个Shard唯一的RemoteQueryExecutor最多要从 shard 的 replica pool 里拿几个连接。
  
  再次强调: PoolMode 是 RemoteQueryExecutor 的属性，所以理解PoolMode之前，要先看一个 shard 会创建几个 RemoteQueryExecutor对象:
  - 如果是 READ_TASKS，每一个RemoteQueryExecutor只需要连到一个 Coordinator Replica，由它再协调 shard 内其他 replicas，所以是：
    复制代码
```
一个 executor × GET_ONE
```
  - 如果不是 READ_TASKS，每一个RemoteQueryExecutor需要直接面对Shard内部的多个 replicas，就使用：
    复制代码
```
一个 executor × GET_MANY
```
  当然，我们从ConnectionPoolWithFailover::getManyImpl()中可以看到，这个GET_MANY的上限是 max_parallel_replicas。上面已经贴过 ConnectionPoolWithFailover::getManyImpl()中根据PoolMode设置max_entries的逻辑，这里不做赘述。
  
  下文会专门介绍getParallelReplicasMode()方法
  
  所以，从上面的代码可以看到，任何时候ConnectionPoolWithFailover都对应一个Shard的连接池，而一个RemoteQueryExecutor对象有可能对应一个Shard，也有可能对应一个Shard下面的一个Replica。

设置main_table

主表会影响到对replica的选择。我们知道，ConnectionPoolWithFailover会负责Replica的挑选，在挑选的时候会只选择replica上有该表、并且该表足够新的Replica。

cpp 复制代码

        // 设置主表。因为主表会影响到对replica的选择，只选择replica上有该表、并且该表足够细新
        // 如果已经设置了Shard 的main_table，则使用，如果没有，则使用构造 ReadFromRemote 对象时传入到 main_table，比如，dist表查询的时候
        // main_table就是dist表背后的ReplicatedMergeTree的表
        if (!table_func_ptr)
            remote_query_executor->setMainTable(shard.main_table ? shard.main_table : main_table);

ParallelReplicasMode和PoolMode的决策

我们从上面的分析可以看到，ReadFromRemote::addPipe()根据对应的ParallelReplicasMode来设置对应的PoolMode，然后RemoteQueryExecutor会根据PoolMode来决定最多需要多少个Connection。

所以，我们需要看一下执行器是怎么决定ParallelReplicasMode的。

首先理解ParallelReplicasMode: ParallelReplicasMode指的是查询时候的各种运行模式:

ClickHouse 为分布式查询定义了三种分工方式，即三种不同的并行Replica模式。注意，这个unit8_t的enum描述的不是"连接数"，而是当一个 shard 同时使用多个 replicas 时，这些 replicas 如何分工。

cpp 复制代码

enum class ParallelReplicasMode : uint8_t
{
    SAMPLE_KEY,
    CUSTOM_KEY,
    READ_TASKS,
};

SAMPLE_KEY

按表本身的 sampling 语义，把一个 shard 的读取范围静态切分给多个 replicas。它的特点是"先切分，再执行"。不同 replicas 读不同的子范围，因此不会重复。所以它本质上是静态切分。
CUSTOM_KEY
CUSTOM_KEY 的思路和 SAMPLE_KEY 类似，只不过切分依据不再是 sample key，而是用户显式指定的 key。

也就是：按用户指定的表达式或字段，把数据空间静态切分给多个 replicas。它同样属于静态切分。
READ_TASKS
READ_TASKS 和前两者完全不同，它不是提前决定"哪个 replica 读哪一块"，而是，先选出一个 coordinator，再由 coordinator 在运行时动态给多个 replicas 派发读任务。所以 READ_TASKS 的核心不是静态切分，而是动态调度。READ_TASKS是一种完全不同的、动态的调度模式，它和静态的切分方式 SAMPLE_KEY以及CUSTOM_KEY 是完全互斥的，如果走了READ_TASKS，就不可能走 SAMPLE_KEY 或者 CUSTOM_KEY。

关于PoolMode，我们上面说过，它是RemoteQueryExecutor的属性，它的准确含义是: 对当前这个 shard，这一层RemoteQueryExecutor对象准备一次性拿回多少个 replica 连接。我们在下文可以清晰看到ConnectionPoolWithFailover::getManyImpl()方法是怎么通过PoolMode来设置一个RemoteQueryExecutor中最少或者最多需要多少个Replica的，这里暂且不赘述。

cpp 复制代码

    /// Specifies how many connections to return from ConnectionPoolWithFailover::getMany() method.
    enum class PoolMode : uint8_t
    {
        /// 当前Shard只需要一个Replica链接
        GET_ONE = 0,
        /// 当前 shard 需要多个 replica 连接，具体上限由 max_parallel_replicas 决定
        GET_MANY,
        /// 当前 shard 下所有候选 replica 连接都要拿出来
        GET_ALL
    };

我们需要把PoolMode 和 ParallelReplicasMode进行区分，ParallelReplicasMode 是"并行执行模型"，而pool_mode 是"连接获取模式":

ParallelReplicasMode 回答的是如果一个 shard 要同时使用多个 replicas，这些 replicas 怎么分工？它关注的是工作划分方式。
PoolMode回答的是，当前这个PoolMode所属的ParallelReplicasMode对象，要建立几个连接？它关注的是连接数语义。

从上文的Context::canUseTaskBasedParallelReplicas()中可以看到，它是调用Context::getParallelReplicasMode()来判断当前的ParallelReplicaMode的，对应的Context::getParallelReplicasMode()方法如下 :

cpp 复制代码

Context::ParallelReplicasMode Context::getParallelReplicasMode() const
{
    const auto & settings_ref = getSettingsRef();
    using enum Context::ParallelReplicasMode;
    if (!settings_ref.parallel_replicas_custom_key.value.empty())
        return CUSTOM_KEY;   // 这里可以看到，CUSTOM_KEY的优先级最高
    if (settings_ref.allow_experimental_parallel_reading_from_replicas > 0)
        return READ_TASKS; // 如果不是CUSTOM_KEY，那么就看看是否使用READ_TASKS
    return SAMPLE_KEY; // 暂归类到SAMPLE_KEY
}

这里的逻辑很简单:

如果配了 parallel_replicas_custom_key，就是 CUSTOM_KEY模式。这里，parallel_replicas_custom_key是一个表达式；可以看到，CUSTOM_KEY的优先级最高；
如果没有 custom key，但开启了 allow_experimental_parallel_reading_from_replicas，就是 READ_TASKS模式；
如果两者都没有，就是 SAMPLE_KEY。注意，这里返回SAMPLE_KEY其实容易引起歧义，返回SAMPLE_KEY并不意味着最终一定会使用SAMPLE_KEY模式，因为能够使用SAMPLE_KEY一定是这个Table支持SAMPLE。这里的含义只是，在没有 custom key、也没有开启 task-based experimental 模式时，默认把并行副本的分工方式标签归到 SAMPLE_KEY 这一类，后面具体是否具备使用SAMPLE_KEY的能力，还会进一步判断。这个具体判断发生在:
cpp 复制代码
```
MergeTreeDataSelectSamplingData MergeTreeDataSelectExecutor::getSampling(
    .....)
{
    const Settings & settings = context->getSettingsRef();
    /// Sampling.
    MergeTreeDataSelectSamplingData sampling;
    .... 
    if (sample_size_ratio)
    {
    ....
    auto parallel_replicas_mode = context->getParallelReplicasMode();
    // 虽然我们有多个replica，并且当前的ParallelReplicasMode的确是SAMPLE_KEY模式，但是比如当前的表不支持sample，那么是无法使用
    // SAMPLE的，这时候，其实是退化到使用第一个replica提供全部数据
    if (settings.parallel_replicas_count > 1 && parallel_replicas_mode == Context::ParallelReplicasMode::SAMPLE_KEY
        && !data.supportsSampling() && settings.parallel_replica_offset > 0)
    {
        sampling.read_nothing = true;
        return sampling;
    }
    // 用户的SQL Query是一个sample query， 比如 `SELECT * FROM TB SAMPLE 0.1`
    sampling.use_sampling = relative_sample_size > 0 
    || (settings.parallel_replicas_count > 1 && parallel_replicas_mode == Context::ParallelReplicasMode::SAMPLE_KEY
        && data.supportsSampling());
```
这里的含义是:
- 如果我们使用了parallel replica(parallel_replicas_count > 1)，并且当前的模式的确可以是SAMPLE_KEY(既不是CUSTOMER_KEY也不是task-based experimental), 但是，数据本身不支持SAMPLE(也就是没法按照sample切分范围)，那么，对于非第一个replica，就什么都不读，而对第一个replica，就读全部数据，这其实就是意味着，让一个Replica代表整个Shard来提供数据 :
  cpp 复制代码
```
  sampling.read_nothing = true; // 第二个和后续的Replica不读，因为表本身根本不支持sample，所以针对sample查询，根本无法开启并行replica
```
- 否则:
  - 如果relative_sample_size>0，这往往意味着用户的Query中使用了比如sample 0.1，那么肯定会使用sample
    cpp 复制代码
```
  sampling.use_sampling = relative_sample_size > 0 || ....
```
  - 如果relative_sample_size<=0，这意味着用户的Query不是一个Sample query，那么，需要需要进一步判断才能决定是否使用sample。这里的判断标准是
    - 使用了并行replica(parallel_replicas_count > 1)，
    - 并且当前的模式的确可以是SAMPLE_KEY(既不是CUSTOMER_KEY也不是task-based experimental)，
    - 并且数据(表)本身支持sample，那么，毫无疑问，可以使用并行sampling模式:
    cpp 复制代码
```
  sampling.use_sampling = True;
```
  这里的准确含义是：如果表支持 sampling，并且 parallel_replicas_count > 1，并且当前并行模式是 SAMPLE_KEY，那么即使用户的 SQL 没写 SAMPLE，ClickHouse 也可以借助表定义里的 sampling key，把"全量读取任务"在多个 replicas 之间切开 。我们下文会详细解释ClickHouse的Sample语义。
  这里的问题是，为什么即使SQL中没有SAMPLE, 也会使用SAMPLE? 这是因为，ClickHouse把没有SAMPLE的情况归结为了一种特殊的SAMPLE: SAMPLE 1 ,我们看下面的关键核心代码:
  cpp 复制代码
```
    MergeTreeDataSelectSamplingData MergeTreeDataSelectExecutor::getSampling(...) { 
        .....
        if (settings.parallel_replicas_count > 1)
        {
            if (relative_sample_size == RelativeSize(0)) // 如果没有定义SAMPLE ，那么认为 relative_sample_size=1
                relative_sample_size = 1;
            relative_sample_size /= settings.parallel_replicas_count.value;
            relative_sample_offset += relative_sample_size * RelativeSize(settings.parallel_replica_offset.value);
        }
```
  这段代码的含义是: 如果用户没写 SAMPLE，relative_sample_size == 0，代码会先把它补成 1，也就是"整张表"，然后再除以 parallel_replicas_count，这样整张表就被切分到每个Replica，每个 replica 只负责自己那一段。
  所以，我们可以把它理解成：用户没写 SAMPLE，但 ClickHouse 内部临时把"全量范围"当成 SAMPLE 1，再把这 1 切成 N 份给 N 个 replicas。这样，用户写了SAMPLE ，和没写SAMPLE，在整个语义上统一起来了。

RemoteQueryExecutor的构造

我们上文讲过，Dist表的查询会交给RemoteQueryExecutor来完成。和ReadFromRemote不同，RemoteQueryExecutor 面对的不是"一个已经选好的 replica"，而是"当前 shard 下所有可能被选中的 replicas"。

我们看到，在ReadFromRemote::addPipe()中构造RemoteQueryExecutor的时候，主要是创建了一个 create_connections 闭包；这个Closure在真正执行时，会返回一个 std::unique_ptr<IConnections>，也就是当前 shard 这次查询要使用的"远端连接集合"。所以这个构造函数本身不立即建立连接，而是把"将来怎么建连接"的逻辑封装进 create_connections，它把当前 shard 后续应该怎么挑 replica、怎么建连接、要不要检查主表、要不要走 hedged requests，这整套逻辑保存下来。

我们看一下这个Closure的定义:

cpp 复制代码

std::function<std::unique_ptr<IConnections>(AsyncCallback)> create_connections;

我们看一下RemoteQueryExecutor::RemoteQueryExecutor构造方法创建这个闭包的具体过程:

cpp 复制代码

RemoteQueryExecutor::RemoteQueryExecutor(
    const ConnectionPoolWithFailoverPtr & pool,  // 一个ConnectionPoolWithFailover对象代表的是指向一个Shard的所有的Replica的连接池
    const String & query_,
    const Block & header_,
    ContextPtr context_,
    const ThrottlerPtr & throttler,
    const Scalars & scalars_,
    const Tables & external_tables_,
    QueryProcessingStage::Enum stage_,
    std::optional<Extension> extension_,
    GetPriorityForLoadBalancing::Func priority_func_)
    : RemoteQueryExecutor(query_, header_, context_, scalars_, external_tables_, stage_, extension_, priority_func_)
{
    // Closure捕获了this和pool，pool是一个指向了一个Shard的所有的Replica链接的IConnection对象
    create_connections = [this, pool, throttler](AsyncCallback async_callback)->std::unique_ptr<IConnections>
    {
        const Settings & current_settings = context->getSettingsRef();
        auto timeouts = ConnectionTimeouts::getTCPTimeoutsWithFailover(current_settings);

#if defined(OS_LINUX) // 在Linux环境下，支持hedged模式，那么，这时候创建的Connection就是HedgedConnections
        if (current_settings.use_hedged_requests)
        {
            std::shared_ptr<QualifiedTableName> table_to_check = nullptr;
            if (main_table)
                table_to_check = std::make_shared<QualifiedTableName>(main_table.getQualifiedName());

            auto res = std::make_unique<HedgedConnections>(
                pool, context, timeouts, throttler, pool_mode, table_to_check, std::move(async_callback), priority_func);
            if (extension && extension->replica_info)
                res->setReplicaInfo(*extension->replica_info);
            return res;
        }
#endif

        std::vector<IConnectionPool::Entry> connection_entries;
        std::optional<bool> skip_unavailable_endpoints;
        if (extension && extension->parallel_reading_coordinator)
            skip_unavailable_endpoints = true;
        // 如果有main table，那么在检查的时候需要确认这个Replica含有这个main table
        if (main_table)
        {
            /// 读主表时，除了"拿到连接"之外，还要额外检查这个副本是否适合当前查询：
            /// - 表是否存在；
            /// - 副本是否可用；
            /// - 副本是否足够新（up-to-date），必要时可以回退到 stale replica。
            /// 因此这里不能直接调用 getMany()，而是要调用 getManyChecked()。
            ///
            /// 返回值不是裸连接，而是 TryResult：
            /// - entry: 实际拿到的连接句柄；
            /// - is_usable / is_up_to_date / delay / is_readonly: 这个副本对当前查询的状态信息。
            auto try_results = pool->getManyChecked(
                timeouts,
                current_settings,
                pool_mode,
                main_table.getQualifiedName(), // 需要检查的表，即，在查找合适的Replica的时候，需要确认这个Replica上有这张表以及是否足够新
                std::move(async_callback),
                skip_unavailable_endpoints,
                priority_func);

            /// 这里真正需要交给后续查询执行层的是连接本身，
            /// 所以把每个 TryResult 里的 entry 抽出来，放进 connection_entries。
            connection_entries.reserve(try_results.size());
            for (auto & try_result : try_results)
                connection_entries.emplace_back(std::move(try_result.entry));
        }
        else
        {
            /// 没有 main_table 时，不需要做"表存在 / 副本延迟"这类检查，
            /// 直接按照 pool_mode、load_balancing 和 failover 策略获取连接即可。
            connection_entries = pool->getMany(
                timeouts, current_settings, pool_mode, std::move(async_callback), skip_unavailable_endpoints, priority_func);
        }

        auto res = std::make_unique<MultiplexedConnections>(std::move(connection_entries), context, throttler);
        if (extension && extension->replica_info)
            res->setReplicaInfo(*extension->replica_info);
        return res;
    };
}

这里，可以看到，RemoteQueryExecutor::RemoteQueryExecutor的构造方法就做了一件事，就是定义了一个create_connections的Closure，这个Closure的输出可能是HedgedConnections，也有可能是MultiplexedConnections:

如果设置了 current_settings.use_hedged_requests, 并且是在Linux环境下，那么就返回一个带 hedged requests 语义的连接集合。这种情况下，它不是简单拿一组普通连接，而是让多个候选连接带着"谁先快谁上"的 hedged 行为。
cpp 复制代码
```
auto res = std::make_unique<HedgedConnections>(
                pool, context, timeouts, throttler, pool_mode
```
如果不走 hedged requests，就会通过ConnectionPoolWithFailover先拿到一批 connection_entries，然后把选出来的这些Connection封装成为MultiplexedConnections以后返回：
cpp 复制代码
```
std::make_unique<MultiplexedConnections>(std::move(connection_entries), context, throttler);
```
也就是一个普通的多路复用连接集合。这里面的 connection_entries 就是从当前 shard 的 replica pools 里挑出来的实际连接句柄。

所以，总的说来，ConnectionPoolWithFailover 负责选连接，MultiplexedConnections 负责持有并使用这些已选出来的连接。他们之间的关系如下所示:

text 复制代码

RemoteQueryExecutor(定义闭包)
  |
  | create_connections 闭包执行
  v
ConnectionPoolWithFailover
  |
  | getMany() / getManyChecked()
  | 从当前 shard 的多个 replica pools 中挑选连接
  | 返回 Entry List
  v
vector<IConnectionPool::Entry>
  |
  | 每个 Entry 是从某个 replica 的 ConnectionPool 中借出的连接句柄
  | 每个 Entry 保护一条具体的 Connection，而不是到一个Replica的ConnectionPool
  | move 给 MultiplexedConnections
  v
MultiplexedConnections
  |
  | 持有这些被ConnectionPoolWithFailover选出来的Entry
  | 通过 Entry 访问具体的 Connection
  | 把 query 发送到这些 Connection 对应的 remote replicas
  v
多个 Connection
  |
  | 每个 Connection 都是到某个 remote replica 的真实连接
  v
remote replicas

在create_connections Closure 内部，我们可以看到它的执行逻辑:

作为一个闭包，它捕获的变量如下所示，即，它显式捕获了this, pool和throttler三个变量:
cpp 复制代码
```
create_connections = [this, pool, throttler](AsyncCallback async_callback)->std::unique_ptr<IConnections>
```
我们看一下this和pool这两个被捕获的变量:
- this，即当前的RemoteQueryExecutor，这让这个Closure可以访问RemoteQueryExecutor对象的成员。这也是为什么你会在闭包里看到这些变量，明明没写进 capture list，却都能用：context, main_table, extension, pool_mode, priority_function，这些都是RemoteQueryExecutor::RemoteQueryExecutor的调用者传进来的；
- pool，这个 pool 是构造函数参数：const ConnectionPoolWithFailoverPtr & pool，lambda 里把它按值捕获：因为 pool 是一个 shared_ptr，按值捕获的意义很明确，即，把当前 shard 对应的 ConnectionPoolWithFailover 保活到闭包执行时。所以 pool 是这个闭包里"当前 shard 的 replica 候选集合入口"；
如果有main_table，那么就调用ConnectionPoolWithFailover::getManyChecked()方法。从名字可以看到，这里不仅仅会选取Replica链接，而是在选取链接的时候会对Replica的合格性或者优劣进行筛选：
cpp 复制代码
```
    if (main_table)
    {
        auto try_results = pool->getManyChecked(
            timeouts,
            current_settings,
            pool_mode,
            main_table.getQualifiedName(), // 需要检查的表，即，在查找合适的Replica的时候，需要确认这个Replica上有这张表以及是否足够新
            std::move(async_callback),
            skip_unavailable_endpoints,
            priority_func);

        connection_entries.reserve(try_results.size());
        for (auto & try_result : try_results)
            connection_entries.emplace_back(std::move(try_result.entry)); // 取出entry，放到connection_entries中
    }
```
getManyChecked() 的返回值不是简单的一个存放了连接的数组，而是一个 std::vector<TryResult>。 TryResult 里除了真正的连接句柄 entry 之外，还会携带一些关于当前 replica 状态的附加信息，例如：
- is_usable：这个 replica 是否可用
- is_up_to_date：这个 replica 是否足够新
- delay：如果不够新，复制延迟是多少
- is_readonly：是否是只读副本
  也就是说，这一步的输出本质上是一组**"带体检报告的连接候选结果"，而不是一组裸连接**。
如果没有main_table，那么就直接调用ConnectionPoolWithFailover::getMany()。我们下文会看到，ConnectionPoolWithFailover::getMany()和上面的ConnectionPoolWithFailover::getManyChecked()的区别很小，基本上就是是否检查main_table在对应的replica上是否存在。
cpp 复制代码
```
else
        {
            /// 没有 main_table 时，不需要做"表存在 / 副本延迟"这类检查，
            /// 直接按照 pool_mode、load_balancing 和 failover 策略获取连接即可。
            connection_entries = pool->getMany(
                timeouts, current_settings, pool_mode, std::move(async_callback), skip_unavailable_endpoints, priority_func);
        }
```
跟ConnectionPoolWithFailover::getManyChecked()不同，ConnectionPoolWithFailover::getMany()直接就返回了一组挑选好的entry，也就是一组已经挑选好的连接句柄，不再像 ConnectionPoolWithFailover::getManyChecked() 那样附带 TryResult 中那些中间状态字段。

ConnectionPoolWithFailover对Replica的探测和挑选

我们看一下 ConnectionPoolWithFailover::getManyChecked()和ConnectionPoolWithFailover::getMany()方法的具体实现:

cpp 复制代码

std::vector<ConnectionPoolWithFailover::TryResult> ConnectionPoolWithFailover::getManyChecked(
    const ConnectionTimeouts & timeouts,
    const Settings & settings,
    PoolMode pool_mode,
    const QualifiedTableName & table_to_check,
    AsyncCallback async_callback,
    std::optional<bool> skip_unavailable_endpoints,
    GetPriorityForLoadBalancing::Func priority_func)
{
    /// 这一层和 getMany() 的区别很简单：
    /// 不只是"拿连接"，还要顺手检查这张表在这个副本上是不是可用、够不够新。
    TryGetEntryFunc try_get_entry = [&](const NestedPoolPtr & pool, std::string & fail_message)
    { return tryGetEntry(pool, timeouts, fail_message, settings, &table_to_check, async_callback); };

    /// 后面的通用逻辑都交给 getManyImpl()：
    /// 它会按 pool_mode 决定拿几个副本，再按优先级顺序去试。
    return getManyImpl(settings, pool_mode, try_get_entry, skip_unavailable_endpoints, priority_func);
}

std::vector<IConnectionPool::Entry> ConnectionPoolWithFailover::getMany(
    const ConnectionTimeouts & timeouts,
    const Settings & settings,
    PoolMode pool_mode,
    AsyncCallback async_callback,
    std::optional<bool> skip_unavailable_endpoints,
    GetPriorityForLoadBalancing::Func priority_func)
{
    /// 这是最基础的"拿多个副本连接"的入口。
    /// 和 getManyChecked() 的区别是：这里只关心连接本身，不额外检查某张表在副本上的状态。
    TryGetEntryFunc try_get_entry = [&](const NestedPoolPtr & pool, std::string & fail_message)
    { return tryGetEntry(pool, timeouts, fail_message, settings, nullptr, async_callback); };

    /// 真正的挑副本、重试和 failover 都在 getManyImpl() 里做。
    std::vector<TryResult> results = getManyImpl(settings, pool_mode, try_get_entry, skip_unavailable_endpoints, priority_func);

    /// 上一层只需要连接句柄，所以这里把 TryResult 里的 entry 抽出来返回。
    std::vector<Entry> entries;
    entries.reserve(results.size());
    for (auto & result : results)
        entries.emplace_back(std::move(result.entry));
    return entries;
}

我们看到，ConnectionPoolWithFailover::getMany(）和 ConnectionPoolWithFailover::getManyChecked()的区别:

ConnectionPoolWithFailover::getManyChecked()会通过TryGetEntryFunc这个回调来确认main_table是否存在，而ConnectionPoolWithFailover::getMany(）没做这种检查。
ConnectionPoolWithFailover::getMany(）的返回结果是将TryResult中的结果中的Entry进行了提取，返回Entry，下文会介绍，TryResult.Entry中封装的是连接信息。而ConnectionPoolWithFailover::getManyChecked()直接返回了ConnectionPoolWithFailover::TryResult的封装。下文会介绍TryResult结构体。

上面说过，这里的TryResult封装是： "尝试某一个 replica 之后得到的结果对象。", 它不是简单的"成功 / 失败"布尔值，而是把这次尝试里和副本选择有关的信息都打包起来。

这里的动机是，如果系统只是想"能不能连上这个 replica"，那返回一个Entry(Entry里面封装了链接信息) 或 bool(是否链接成功) 其实就够了。但 ClickHouse 在 shard 内挑 replica 时，关心的远不止这一点。它还要知道：

连接建立成功了吗？如果成功了，这个Connection对象是？这封装在Entry中。
这个 replica 对当前查询可不可用？is_usable
这是不是最新副本？is_up_to_date
如果不是最新，它延迟多少？delay
它是不是只读？is_readonly

所以ConnectionPoolWithFailover需要一个结构体，一次性把这些状态都带回来。所有的这些信息都封装在 TryResult中：

cpp 复制代码

using TryResult = Base::TryResult;

struct TryResult
{
    TryResult() = default;
    void reset() { *this = {}; }
    // 保存了Connection信息
    Entry entry; /// use isNull() to check if connection is established
    // 是否可用
    bool is_usable = false; /// if connection is established, then can be false only with table check
                            /// if table is not present on remote peer, -> it'll be false
    // 这个replica是否是up-to-date的，我们判断up-to-date的依据是根据max_replica_delay_for_distributed_queries配置
    // 的最大的delay的时间
    bool is_up_to_date = false; /// If true, the entry is a connection to up-to-date replica
                                /// Depends on max_replica_delay_for_distributed_queries setting
    // delay的时间，无论是否是up-to-date的，其实都存在delay
    UInt32 delay = 0; /// Helps choosing the "least stale" option when all replicas are stale.
    bool is_readonly = false;   /// Table is in read-only mode, INSERT can ignore such replicas.
};

这里的Entry其实是定义在ConnectionPoolWithFailover的基类PoolWithFailoverBase<TNestedPool>中。我们下文会详细讲解 PoolWithFailoverBase<TNestedPool>，它其实是给子类ConnectionPoolWithFailover提供了一个具有Failover特性的基本框架。代码如下:

cpp 复制代码

template <typename TNestedPool>
class PoolWithFailoverBase : private boost::noncopyable
{
public:
    using NestedPool = TNestedPool;
    using NestedPoolPtr = std::shared_ptr<NestedPool>;
    using Entry = typename NestedPool::Entry; // IConnectionPool::Entry，实际上是 PoolBase<Connection>::Entry;
    using NestedPools = std::vector<NestedPoolPtr>;

可以看到，Entry其实是TNestedPool::Entry，而TNestedPool其实是一个模板参数，在运行时，我们看到，在构造ConnectionPoolWithFailover的时候，构造的PoolwithFailoverBase的模板参数是IConnectionPool, 所以， Entry 其实是 IConnectionPool::Entry:

cpp 复制代码

class ConnectionPoolWithFailover : public IConnectionPool, private PoolWithFailoverBase<IConnectionPool>
{
public:
....
}

其中，Entry定义在IConnectionPool中。其中，我们看到， ConnectionPoolWithFailover也是直接继承IConnectionPool的。IConnectionPool是一个连接池的基础抽象：

cpp 复制代码

/** Interface for connection pools.
  *
  * Usage (using the usual `ConnectionPool` example)
  * ConnectionPool pool(...);
  *
  *    void thread()
  *    {
  *        auto connection = pool.get();
  *        connection->sendQuery(...);
  *    }
  */
class IConnectionPool : private boost::noncopyable
{
public:
    // IConnectionPool使用PoolBase进行Pool的管理，而Pool所管理的对象是connection
    using Entry = PoolBase<Connection>::Entry;

    IConnectionPool() = default;
    IConnectionPool(String host_, UInt16 port_, Priority config_priority_)
        : host(host_), port(port_), address(host + ":" + toString(port_)), config_priority(config_priority_)
    {
    }
     /// Selects the connection to work.
    virtual Entry get(const ConnectionTimeouts & timeouts) = 0;
    /// If force_connected is false, the client must manually ensure that returned connection is good.
    virtual Entry get(const ConnectionTimeouts & timeouts, /// NOLINT
                      const Settings & settings,
                      bool force_connected = true) = 0;

可以看到，IConnectionPool::Entry其实是PoolBase<Connection>::Entry，这里的Connection定义在src/Clients/Connection.h中，不做赘述。

注意：

我们上文讲过，ConnectionPoolWithFailover代表的是指向一个Shard的所有的Replica的连接池，而不是指向某一个Replica节点的连接池。但是，ConnectionPoolWithFailover继承IConnectionPool，而我们在上面的IConnetionPool中看到，它含有host/port等信息，这是否自相矛盾？
其实，IConnectionPool 里的 host/port 主要是给"单节点连接池"用的。 ConnectionPoolWithFailover 继承它，主要是为了复用"连接池接口get()"，不是因为它自己天然对应一个单独的 host/port ，最重要的，是复用了它的IConnectionPool::get()接口：因为上层很多逻辑只需要一件事： "给我一个能 get() 出连接的对象。"

IConnectionPool通过IConectionPool::get()获取对应的Entry，Entry中封装了Connection信息，所以，从上面IConnectionPool的注释我们可以看到，从Pool中拿到一个Connection然后使用这个Connection发送query的伪代码是:

从Pool中获取Entry
ConnectionPoolWithFailover 实现了基类IConnectionPool::get()的纯虚函数，因此，首先，我们需要从Pool中获取一个Connection，但是这个get()方法返回的是一个更丰富的封装，一个Entry对象:
cpp 复制代码
```
using Entry = PoolBase<Connection>::Entry;
auto connection = pool.get();// ConnectionPoolWithFailover实现了IConnectionPool::get()
```
从Entry中获取封装的Connection:
我们可以看到，Entry中重写了operator ->()，这个operator就直接返回Entry中的Connection *：
cpp 复制代码
```
    Object * operator->() & { return &*data->data.object; }
```
而这个 Object 在 PoolBase<Connection> 这个实例化里就是Connection;
基于Connection通过sendQuery()方法把query发送给对应的Replica。这里的connection->sendQuery()其实就等价于(Entry::operator->())->sendQuery(...)
cpp 复制代码
```
connection->sendQuery(...) // 基于拿到的Connection发送query
```

text 复制代码

ConnectionPoolWithFailover
  |
  | getMany() / getManyChecked()
  | 对 nested_pools 中的每个 replica pool 执行 try_get_entry
  v
PoolWithFailoverBase<IConnectionPool>::getMany()
  |
  | 每次尝试某个 replica pool，都会得到一个 TryResult
  |
  +--------------------------------+--------------------------------+
  |                                |                                |
  v                                v                                v
TryResult                         TryResult                         TryResult
(for replica 0)                   (for replica 1)                   (for replica 2)
  |                                |                                |
  | 包含 Entry                      | 包含 Entry                      | 包含 Entry
  | 包含 is_usable                  | 包含 is_usable                  | 包含 is_usable
  | 包含 is_up_to_date              | 包含 is_up_to_date              | 包含 is_up_to_date
  | 包含 delay                      | 包含 delay                      | 包含 delay
  | 包含 is_readonly                | 包含 is_readonly                | 包含 is_readonly
  v                                v                                v
Entry                             Entry                             Entry
  |                                |                                |
  | RAII 句柄                       | RAII 句柄                       | RAII 句柄
  | Entry 存活期间保护一条 Connection | Entry 存活期间保护一条 Connection | Entry 存活期间保护一条 Connection
  v                                v                                v
Connection                        Connection                        Connection
(到 replica 0 的真实连接)           (到 replica 1 的真实连接)           (到 replica 2 的真实连接)

可以看到，PoolBase<TObject>也是一个模板类，它是通用的基于池子的资源管理工具，这里的资源既可以管理Connection，也可以管理其他任何资源，通过传入模板参数来代表这个Pool所管理的对象。这里的Entry其实是一个别名，对应的实际类型是PoolBase<Connection>::Entry：

其实，PoolBase<Connection>::Entry其实就是代表着Pool所管理的一个object:

cpp 复制代码

template <typename TObject>
class PoolBase : private boost::noncopyable
{
public:
    using Object = TObject; // Connection
    using ObjectPtr = std::shared_ptr<Object>; // std::shared_ptr<Connection>
    using Ptr = std::shared_ptr<PoolBase<TObject>>;

private:

    /** The object with the flag, whether it is currently used. */
    struct PooledObject // PooledObject是对std::shared_ptr<Object>的封装，即对 std::shared_ptr<Connection>的封装
    {
        PooledObject(ObjectPtr object_, PoolBase & pool_)
            : object(object_), pool(pool_)
        {
        }

        ObjectPtr object; // 被管理的对象
        bool in_use = false;  // 是否正在使用
        std::atomic<bool> is_expired = false; // 是否已经过期
        PoolBase & pool; // 这个PooledObject所在的PoolBase
    };

    using Objects = std::vector<std::shared_ptr<PooledObject>>;  // PoolBase是PooledObject的真正拥有者，而PoolEntryHelper只是引用者

从PoolBase的代码可以看到，PoolBase是PooledObject的拥有者：

cpp 复制代码

PoolBase<Connection>
  |
  | items: vector<shared_ptr<PooledObject>>
  | PoolBase 创建并长期保存这些 PooledObject
  |
  +------------------------------------------------+
  |                                                |
  v                                                v
PooledObject                                   PooledObject
  |
  | PooledObject 是池里的一个槽位
  | 长期存在于 PoolBase::items 中
  |
  | object: shared_ptr<Connection>
  | in_use: bool
  | is_expired: atomic<bool>
  | pool: PoolBase<Connection>&
  |
  v
Connection

我们再看一下Entry的实现，Entry 不拥有 PooledObject，它只是通过 PoolEntryHelper 引用某个已经存在于池中的 PooledObject。
PoolBase<Connection>::Entry定义在PoolBase.h中:

cpp 复制代码

    class Entry
    {
    public:
        friend class PoolBase<Object>;

        Entry() = default;    /// For deferred initialization.

        /** The `Entry` object protects the resource from being used by another thread.
          * The following methods are forbidden for `rvalue`, so you can not write a similar to
          *
          * auto q = pool.get()->query("SELECT .."); // Oops, after this line Entry was destroyed
          * q.execute (); // Someone else can use this Connection
          */
        Object * operator->() && = delete;
        const Object * operator->() const && = delete;
        Object & operator*() && = delete;
        const Object & operator*() const && = delete;

        Object * operator->() &             { return &*data->data.object; }
        const Object * operator->() const & { return &*data->data.object; }
        Object & operator*() &              { return *data->data.object; }
        const Object & operator*() const &  { return *data->data.object; }

        /**
         * Expire an object to make it reallocated later.
         */
        void expire()
        {
            data->data.is_expired = true;
        }

        bool isNull() const { return data == nullptr; }

        PoolBase * getPool() const
        {
            ....       
        }

    private:
        std::shared_ptr<PoolEntryHelper> data; // 封装了一个PoolEntryHelper的智能指针，用来在构造和析构的时候管理Connection资源的in_use状态

        /**
         * 一个PoolEntryHelper封装了一个 PooledObject， 这个PooledObject里面通过in_use来代表现在是否正在使用
         * @param object
         */
        explicit Entry(PooledObject & object) : data(std::make_shared<PoolEntryHelper>(object)) {}
    };

可以看到，Entry 里只有一个成员 std::shared_ptr<PoolEntryHelper> data;，而PoolEntryHelper 里保存的是对 PooledObject 的引用，这个PooledObject才真正保存着对象本体。但是PoolEntryHelper只是引用了PooledObject，而PooledObject的真正拥有者是PoolBase:

cpp 复制代码

    struct PoolEntryHelper
    {
        // PoolEntryHelper 是对 PooledObject的封装，通过RAII的方式来设置PooledObject中的in_use变量，指示这个Connetion是否正在被使用
        explicit PoolEntryHelper(PooledObject & data_) : data(data_) { data.in_use = true; } // 构造的时候设置in_user=true
        ~PoolEntryHelper()
        {
            std::lock_guard lock(data.pool.mutex);
            data.in_use = false;  // 析构的时候in_user=false
            data.pool.available.notify_one();
        }

        PooledObject & data; // 
    };

Entry, PooledObject，PoolEntryHelper之间的逻辑关系如下所示。上文提到过，PoolBase是PooledObject的拥有者，在这里，Entry和PoolEntryHelper只是PooledObject的引用者或者临时使用者(借用者)。当PooledObject被Entry借出时：

text 复制代码

Entry
  |
  | data: shared_ptr<PoolEntryHelper>
  | Entry 是调用者拿到的临时句柄
  |
  v
PoolEntryHelper
  |
  | data: PooledObject&
  | 注意：这里是引用，不是拥有
  | 构造时设置 PooledObject.in_use = true
  | 析构时设置 PooledObject.in_use = false
  |
  v
同一个 PooledObject
  |
  | object: shared_ptr<Connection>
  |
  v
同一条 Connection

两层IConnectionPool的实现： ConnectionPoolWithFailover 和 ConnectionPool

我们看到，ConnectionPool和ConnectionPoolWithFailover都继承了IConnectionPool，因为他们都是资源池，只不过，ConnectionPoolWithFailover提供的是Shard层的连接池，而ConnectionPool提供的是Replica层的连接池，二者层级不同，但是在连接池这一层的概念是相同的，都借用了IConnectionPool::get()来取出链接:

如果动态对象是 ConnectionPoolWithFailover(Shard层连接池，一个 ConnectionPoolWithFailover对象负责一个Shard内的所有Replica连接)

text 复制代码

IConnectionPool::get()
  -> ConnectionPoolWithFailover::get()
  -> 在多个 replicas 中选一个
  -> 再调用某个 replica pool 的 get()

当外部调用IConnectionPool::get() 的时候，在Shard层面，IConnectionPool的实现是ConnectionPoolWithFailover，这时候的整个调用过程是:

cpp 复制代码

IConnectionPool::get()
  |
  | 虚函数动态派发
  | 运行时对象如果是 ConnectionPoolWithFailover
  v
ConnectionPoolWithFailover::get()
  |
  | 它自己不直接拥有 Connection
  | 内部调用 getManyChecked(..., PoolMode::GET_ONE)
  v
ConnectionPoolWithFailover::getManyChecked()
  |
  | 继续调用 getManyImpl(...)
  | 这里会决定 min_entries / max_entries
  | GET_ONE 时 max_entries = 1
  v
ConnectionPoolWithFailover::getManyImpl()
  |
  | 内部调用 Base::getMany(...)
  | Base = PoolWithFailoverBase<IConnectionPool>
  v
PoolWithFailoverBase<IConnectionPool>::getMany()
  |
  | PoolWithFailoverBase 拥有 nested_pools
  | nested_pools: vector<shared_ptr<IConnectionPool>>
  | 每个 nested_pool 对应 shard 内一个 replica 的连接池
  v
shared_pool_states / PoolState
  |
  | 为每个 nested_pool 维护状态
  | 包括 error_count / slowdown_count / priority / random
  | 根据 load_balancing 和错误状态排序
  v
选择一个 nested_pool
  |
  | nested_pool 静态类型是 shared_ptr<IConnectionPool>
  | 运行时对象通常是 ConnectionPool
  v
try_get_entry(nested_pool, fail_message)
  |
  | 这是 ConnectionPoolWithFailover 传给 Base::getMany 的回调
  | 回调内部会尝试从这个 replica pool 拿一条连接
  v
nested_pool->get(...)
  |
  | 虚函数动态派发
  | 运行时对象如果是 ConnectionPool
  v
ConnectionPool::get()
  |
  | 内部调用 Base::get()
  | Base = PoolBase<Connection>
  v
PoolBase<Connection>::get()
  |
  | PoolBase 拥有 items: vector<shared_ptr<PooledObject>>
  | 从 items 中找到一个 in_use = false 的 PooledObject
  v
PooledObject
  |
  | PooledObject 是某个 replica pool 里的连接槽位
  | 内部持有 shared_ptr<Connection>
  v
构造 Entry
  |
  | Entry 内部持有 shared_ptr<PoolEntryHelper>
  v
PoolEntryHelper
  |
  | 引用同一个 PooledObject
  | 构造时设置 PooledObject.in_use = true
  | 析构时设置 PooledObject.in_use = false
  v
Entry
  |
  | 返回给 ConnectionPoolWithFailover
  | 再由 ConnectionPoolWithFailover 返回给上层调用者
  | 调用者通过 Entry::operator-> 访问连接
  v
Connection

如果动态对象是 ConnectionPool(Replica层连接池，一个 ConnectionPool对象负责一个Replica的所有连接)

text 复制代码

IConnectionPool::get()
  -> ConnectionPool::get()
  -> 从 PoolBase<Connection> 中借出一条 Connection

当外部调用IConnectionPool::get() 的时候，在Replica层面，IConnectionPool的实现是ConnectionPool，这时候的整个调用过程是:

text 复制代码

IConnectionPool::get()
  |
  | 虚函数动态派发
  | 运行时对象如果是 ConnectionPool
  v
ConnectionPool::get()
  |
  | 内部调用 Base::get()
  | Base = PoolBase<Connection>
  v
PoolBase<Connection>::get()
  |
  | PoolBase 拥有 items: vector<shared_ptr<PooledObject>>
  | 从 items 中找到一个 in_use = false 的 PooledObject
  v
PooledObject
  |
  | PooledObject 是池里的槽位
  | 内部持有 shared_ptr<Connection>
  v
构造 Entry
  |
  | Entry 内部持有 shared_ptr<PoolEntryHelper>
  v
PoolEntryHelper
  |
  | 引用同一个 PooledObject
  | 构造时设置 PooledObject.in_use = true
  | 析构时设置 PooledObject.in_use = false
  v
Entry
  |
  | 调用者通过 Entry::operator-> 访问连接
  v
Connection

我们看到，无论是ConnectionPoolWithFailover::getManyChecked()，还是ConnectionPoolWithFailover::getMany()，底层都是通过调用ConnectionPoolWithFailover::getManyImpl()来完成的：

cpp 复制代码

/// 这是 shard 内挑副本的统一入口。
/// 它自己不关心"怎么检查一个副本"，那部分由 try_get_entry 决定；
/// 它只负责：
/// 1. 这次最少/最多拿几个副本；
/// 2. 按什么顺序试；
/// 3. 失败了要不要切到下一个副本。
std::vector<ConnectionPoolWithFailover::TryResult> ConnectionPoolWithFailover::getManyImpl(
    const Settings & settings,
    PoolMode pool_mode,
    const TryGetEntryFunc & try_get_entry,
    std::optional<bool> skip_unavailable_endpoints,
    GetPriorityForLoadBalancing::Func priority_func,
    bool skip_read_only_replicas)
{
    ....
    /// 没显式传的话，就沿用 skip_unavailable_shards。
    if (!skip_unavailable_endpoints.has_value())
        skip_unavailable_endpoints = settings.skip_unavailable_shards;

    /// 至少要拿到几个副本：
    /// - 允许跳过不可用 shard 时，0 个也能接受；
    /// - 否则至少得有 1 个。
    size_t min_entries = skip_unavailable_endpoints.value() ? 0 : 1;

    size_t max_tries = settings.connections_with_failover_max_tries;
    size_t max_entries;
    /// 最多拿几个副本连接。
    if (pool_mode == PoolMode::GET_ALL)
    {
        /// 全拿。
        min_entries = nested_pools.size();
        max_entries = nested_pools.size();
    }
    else if (pool_mode == PoolMode::GET_ONE)
    {
        /// 只拿一个。
        max_entries = 1;
    }
    else if (pool_mode == PoolMode::GET_MANY)
    {
        /// 最多拿 max_parallel_replicas 个。
        max_entries = settings.max_parallel_replicas;
    }
    else
    {
        throw DB::Exception(DB::ErrorCodes::LOGICAL_ERROR, "Unknown pool allocation mode");
    }

    if (!priority_func)
        /// 没传优先级函数，就按 load_balancing 现算一个。
        priority_func = makeGetPriorityFunc(settings);

    UInt64 max_ignored_errors = settings.distributed_replica_max_ignored_errors.value;
    bool fallback_to_stale_replicas = settings.fallback_to_stale_replicas_for_distributed_queries.value;
    /// 真正开始按顺序试副本、做 failover。
    return Base::getMany(min_entries, max_entries, max_tries, max_ignored_errors, fallback_to_stale_replicas, skip_read_only_replicas, try_get_entry, priority_func);
}

这里，ConnectionPoolWithFailover::getManyImpl()是 ConnectionPoolWithFailover 到 PoolWithFailoverBase 之间的一层"参数翻译器"，它将上层的参数准备好，然后调用Failover的基础框架PoolWithFailoverBase::getMany()，获取Shard内所需要的Replica的ConnectionPoolWithFailover::TryResult：

先算最少需要几个结果：min_entries。它表示：如果允许跳过不可用 endpoint/shard，那么 0 个Replica的结果也能接受，如果不允许跳过，那么至少得拿到 1 个Replica的结果。所以这里的 min_entries 不是"理想情况想拿几个"，而是：最低成功门槛。
cpp 复制代码
```
    /// 没显式传的话，就沿用 skip_unavailable_shards。
if (!skip_unavailable_endpoints.has_value())
    skip_unavailable_endpoints = settings.skip_unavailable_shards;

size_t min_entries = skip_unavailable_endpoints.value() ? 0 : 1;
```
这里, skip_unavailable_shards的定义如下:
cpp 复制代码
```
M(Bool, skip_unavailable_shards, false, "If true, ClickHouse silently skips unavailable shards. Shard is marked as unavailable when: 1) The shard cannot be reached due to a connection failure. 2) Shard is unresolvable through DNS. 3) Table does not exist on the shard.", 0) \
```
注意，这里是skip shard，而不是skip replica。skip shard的意思是，如果整个Shard不可用，那么ClickHouse是否静默地忽略掉这个Shard。显然，当这个Shard被忽略，数据就会缺失。
取出单个replica pool的最大重试次数:
cpp 复制代码
```
 size_t max_tries = settings.connections_with_failover_max_tries;
```
这个值决定对单个 replica pool 最多失败多少次就放弃。注意它不是整个 shard 的ConnectionPoolWithFailover的总重试次数，而是"单个候选Replica的Connection pool 的尝试上限";
cpp 复制代码
```
    M(UInt64, connections_with_failover_max_tries, 3, "The maximum number of attempts to connect to replicas.", 0) \
```

根据 pool_mode 决定最多要拿几个：max_entries，这里不赘述。

cpp 复制代码

size_t max_entries;
if (pool_mode == PoolMode::GET_ALL)
{
    min_entries = nested_pools.size();
    max_entries = nested_pools.size();
}
else if (pool_mode == PoolMode::GET_ONE)
{
    max_entries = 1;
}
else if (pool_mode == PoolMode::GET_MANY)
{
    max_entries = settings.max_parallel_replicas;
}
else
{
    throw DB::Exception(DB::ErrorCodes::LOGICAL_ERROR, "Unknown pool allocation mode");
}

这里需要注意的事，在PoolMode::GET_MANY模式下，max_entries是等于用户配置的max_parallel_replicas，而不是等于当前Shard中的Replica数量：

cpp 复制代码

M(NonZeroUInt64, max_parallel_replicas, 1, "The maximum number of replicas of each shard used when the query is executed. For consistency (to get different parts of the same partition), this option only works for the specified sampling key. The lag of the replicas is not controlled. Should be always greater than 0", 0) \

我们看到， max_parallel_replicas的默认值是1，这意味着，默认情况下在 PoolMode::GET_MANY和PoolMode::GET_ONE模式下，是不会出现并行replica的。

如果没传 priority_func，就按 load_balancing 计算 priority_func。这里的priority_func，指的是这个Shard中的replica按照什么顺序被ConnectionPoolWithFailover尝试 ，这个callback会传给PoolWithFailoverBase<TNestedPool>这个基本框架去调用，因为PoolWithFailoverBase这个基本框架只提供failover的基本逻辑，但是涉及到业务层的一些具体的需求，往往就通过callback的方式交给上层调用者自己去实现然后传进来。
cpp 复制代码
```
if (!priority_func)
 priority_func = makeGetPriorityFunc(settings);
```
我们下文会详细讲解这里的Load Balancing。

取出两个和"副本健康"相关的设置，这些参数将会传给接下来会调用的PoolWithFailoverBase<TNestedPool>::getMany()方法:

cpp 复制代码

// 排序时可以"忽略"多少历史错误。这里的动机是，在本轮选择副本时，给每个 replica 一点容错额度，不要因为replica的少量历史错误就把它明显降权。
UInt64 max_ignored_errors = settings.distributed_replica_max_ignored_errors.value;
// 如果按 max_replica_delay_for_distributed_queries 判断，up-to-date 副本不够，是否允许退回到 stale replica。
bool fallback_to_stale_replicas = settings.fallback_to_stale_replicas_for_distributed_queries.value;

cpp 复制代码

M(UInt64, distributed_replica_max_ignored_errors, 0, "Number of errors that will be ignored while choosing replicas", 0) \
M(Bool, fallback_to_stale_replicas_for_distributed_queries, true, "Suppose max_replica_delay_for_distributed_queries is set and all replicas for the queried table are stale. If this setting is enabled, the query will be performed anyway, otherwise the error will be reported.", 0) \

我们可以在PoolWithFailoverBase<TNestedPool>::getMany()中准确理解这两个参数的真实含义。

当准备好了所有参数，则调用PoolWithFailoverBase<TNestedPool>::getMany()具体执行，该方法将按照要求返回一个TryResult的集合，代表每一个被选中的Replica的Connection信息等，封装在TryResult中：
cpp 复制代码
```
return Base::getMany(min_entries, max_entries, max_tries, max_ignored_errors, fallback_to_stale_replicas, skip_read_only_replicas, try_get_entry, priority_func);
```
在这里，我们看到，上层调用者ConnectionPoolWithFailover在调用PoolWithFailoverBase<TNestedPool>::getMany()的时候，传入了两个Callback Closure，这两个Closure都声明在 PoolWithFailoverBase中，但是需要调用者去具体实现:
- try_get_entry：告诉 PoolWithFailoverBase，当它选中某个 nested pool 之后，应该如何从这个 pool 里"尝试"拿到一个可用连接 。PoolWithFailoverBase 不知道 nested pool 里面到底是什么资源，也不直接调用具体连接逻辑；它只负责调度和 failover，真正"怎么拿连接"交给这个 callback。
- priority_func：告诉 PoolWithFailoverBase，在这个PoolWithFailoverBase所维护的nested_pool中，各个IConnectionPool。它把 load_balancing 这类策略转换成一个可比较的优先级值，让 PoolWithFailoverBase 可以在错误次数、慢副本状态、配置优先级之外，再结合调用者指定的策略来排序 replica。

在下文中，我们会集中讲解底层的Failover的基本框架PoolWithFailoverBase<TNestedPool>的实现细节，尤其是它最重要的PoolWithFailoverBase<TNestedPool>::getMany()方法的具体实现。

LoadBalance Closure

因此，PoolWithFailoverBase<TNestedPool>定义了一个GetPriorityFunc的回调，这个回调让调用者可以早某种程度上决定尝试Shard中的所有Replica的时候的先后顺序，这个回调让调用者自己去实现并传进来，这样，它只需要直接使用这个传进来的callback即可。

我们之所以说"某种程度"，因为除了GetPriorityFunc，PoolWithFailoverBase<TNestedPool>还会参考并且是优先参考 其他因素来决定优先级。具体细节我们会在PoolWithFailoverBase<TNestedPool>::PoolState中讲解。

调用者ConnectionPoolWithFailover传进来的GetPriorityFunc是通过方法ConnectionPoolWithFailover::makeGetPriorityFunc()来定义的:

cpp 复制代码

    /// The client can provide this functor to affect load balancing - the index of a pool is passed to
    /// this functor. The pools with lower result value will be tried first.
    using GetPriorityFunc = std::function<Priority(size_t index)>;
    
   ConnectionPoolWithFailover::Base::GetPriorityFunc ConnectionPoolWithFailover::makeGetPriorityFunc(const Settings & settings)
        {
            // 用户在配置文件中配置的第一个offset，需要和子pool的大小做mod，这个参数在用户选择了FIRST_OR_RANDOM的LoadBalanacing策略的时候生效
            const size_t offset = settings.load_balancing_first_offset % nested_pools.size();
            const LoadBalancing load_balancing = LoadBalancing(settings.load_balancing); // 根据设置的LB的策略，构造一个LoadBalancing对象
            // 由LoadBalancing对象直接提供对应的GetPriorityFunc实现
            return get_priority_load_balancing.getPriorityFunc(load_balancing, offset, nested_pools.size());
        }

可以看到，ConnectionPoolWithFailover并没有手写这个函数，而是根据用户配置的LoadBalancing策略，从GetPriorityForLoadBalancing中直接生成对应的load balance的function。可以看到，这里首先获取用户配置的load_balancing策略，然后通过getPriorityFunc()来获取function:

cpp 复制代码

M(UInt64, load_balancing_first_offset, 0, "Which replica to preferably send a query when FIRST_OR_RANDOM load balancing strategy is used.", 0) \
M(LoadBalancing, load_balancing, LoadBalancing::RANDOM, "Which replicas (among healthy replicas) to preferably send a query to (on the first attempt) for distributed processing.", 0) \

我们看一下可以选择的LoadBalancing策略有哪些：

cpp 复制代码

    enum class LoadBalancing : uint8_t
    {
        /// 在"错误数最少"的那些副本里，随机选一个。
        RANDOM = 0,
        /// 先筛出"错误数最少"的那些副本，
        /// 再比较"副本主机名"和"本地主机名"前缀有多少字符不同，
        /// 前缀越接近的副本优先。
        NEAREST_HOSTNAME,
        /// 和 NEAREST_HOSTNAME 类似，
        /// 也是先筛出"错误数最少"的副本，
        /// 但主机名接近程度不是按前缀差异算，而是按 Levenshtein 编辑距离算。
        /// 编辑距离越小，说明主机名越接近，优先级越高。
        HOSTNAME_LEVENSHTEIN_DISTANCE,
        /// 在"错误数相同"的副本之间，
        /// 按配置文件里出现的顺序依次访问。
        IN_ORDER,
        /// 优先选第一个副本；
        /// 但如果第一个副本的错误数更高，不再强行选它，
        /// 而是在"错误数最少"的那些副本里随机挑一个。
        FIRST_OR_RANDOM,
        /// 在"错误数相同"的副本之间做轮询（round robin）
        ROUND_ROBIN,
    };

我们看一下GetPriorityForLoadBalancing::getPriorityFunc()的实现，这里

cpp 复制代码

    using Func = std::function<Priority(size_t index)>;
    
    GetPriorityForLoadBalancing::Func
    GetPriorityForLoadBalancing::getPriorityFunc(LoadBalancing load_balance, 
    size_t offset,  // 用户配置的load_balancing_first_offset
    size_t pool_size) const
    {
        std::function<Priority(size_t index)> get_priority;
        switch (load_balance)
        {
            case LoadBalancing::NEAREST_HOSTNAME:
                /// 按"副本主机名"和"本地主机名"的前缀距离排序，距离越小越优先。
                get_priority = [this](size_t i) { return Priority{static_cast<Int64>(hostname_prefix_distance[i])}; };
                break;
            case LoadBalancing::HOSTNAME_LEVENSHTEIN_DISTANCE:
                /// 按主机名的 Levenshtein 编辑距离排序，距离越小越优先。
                get_priority = [this](size_t i) { return Priority{static_cast<Int64>(hostname_levenshtein_distance[i])}; };
                break;
            case LoadBalancing::IN_ORDER:
                /// 直接按配置中的下标顺序排序：0, 1, 2, ...
                get_priority = [](size_t i) { return Priority{static_cast<Int64>(i)}; };
                break;
            case LoadBalancing::RANDOM:
                /// 不额外指定 priority，后续会退化为依赖随机数打散同优先级副本。
                break;
            case LoadBalancing::FIRST_OR_RANDOM:
                /// 优先 offset 指定的那个副本；其他副本统一给较低优先级。
                get_priority = [offset](size_t i) { return i != offset ? Priority{1} : Priority{0}; };
                break;
            case LoadBalancing::ROUND_ROBIN:
                /// 每次调用都推进 last_used，把"上次使用的下一个副本"放到最高优先级。
                auto local_last_used = last_used % pool_size;
                ++last_used;
    
                // Example: pool_size = 5
                // | local_last_used | i=0 | i=1 | i=2 | i=3 | i=4 |
                // | 0               | 4   | 0   | 1   | 2   | 3   |
                // | 1               | 3   | 4   | 0   | 1   | 2   |
                // | 2               | 2   | 3   | 4   | 0   | 1   |
                // | 3               | 1   | 2   | 3   | 4   | 0   |
                // | 4               | 0   | 1   | 2   | 3   | 4   |
    
                get_priority = [pool_size, local_last_used](size_t i)
                {
                    /// 把当前轮询起点映射成最小 priority，其余副本按环形顺序依次排在后面。
                    size_t priority = pool_size - 1;
                    if (i < local_last_used)
                        priority = pool_size - 1 - (local_last_used - i);
                    if (i > local_last_used)
                        priority = i - local_last_used - 1;
    
                    return Priority{static_cast<Int64>(priority)};
                };
                break;
        }
        return get_priority;
    }

所以，可以看到，这里生成的get_priority function的输入是对应的replica的index，返回的是对应的Priority。Priority是一个聚合类型：

cpp 复制代码

/// Common type for priority values.
/// Separate type (rather than `Int64` is used just to avoid implicit conversion errors and to default-initialize
struct Priority
{
    Int64 value = 0; /// Note that lower value means higher priority.
    // 定义了一个运算符，把当前对象转换成 Int64 
    constexpr operator Int64() const { return value; } /// NOLINT 从Priority -> Int64的转换运算符
};

比如，如果LoadBalancing策略是IN_ORDER，那么对应的get_priority是:

cpp 复制代码

case LoadBalancing::IN_ORDER:
            /// 直接按配置中的下标顺序排序：0, 1, 2, ...
            get_priority = [](size_t i) { return Priority{static_cast<Int64>(i)}; };
            break;

这意味着，用户传入对应的replica的index i，会返回一个聚合类型的Priority对象，由于Priority是聚合类型，所以这里使用了Priority的聚合初始化，而不是调用构造函数，Priority根本没有定义构造函数，当然，编译器会提供无参的默认构造函数，但是构造函数和初始化无关。下文我们会专门讲解聚合类型，聚合初始化和列表初始化。

Failover的基本框架: PoolWithFailoverBase

PoolWithFailoverBase<TNestedPool> 它是一个"多候选池调度器"：自己不直接懂 ClickHouse replica 业务，也不直接决定 PoolMode，而是负责把多个子池进行排队、失败重试、故障切换，并返回"足够好"的连接结果。

在ClickHouse的场景下，多个子池中的每一个子池指的就是到其中一个Replica的连接池，一个PoolWithFailoverBase<TNestedPool>对象就是用来对一个Shard内的多个Replica进行支持Failover的选择。因此，我们上文看到，ConnectionPoolWithFailover是PoolWithFailoverBase<TNestedPool>的子类，其中，模板参数TNestedPool是IConnectionPool。

我们看一下构造函数:

cpp 复制代码

template <typename TNestedPool>
class PoolWithFailoverBase : private boost::noncopyable
{
public:
    using NestedPool = TNestedPool;
    using NestedPoolPtr = std::shared_ptr<NestedPool>;
    using Entry = typename NestedPool::Entry;
    using NestedPools = std::vector<NestedPoolPtr>;
    PoolWithFailoverBase(
            NestedPools nested_pools_,
            time_t decrease_error_period_,
            size_t max_error_cap_,
            LoggerPtr log_)
        : nested_pools(std::move(nested_pools_))
        , decrease_error_period(decrease_error_period_)
        , max_error_cap(max_error_cap_)
        , shared_pool_states(nested_pools.size())
        , log(log_)
    {
        for (size_t i = 0;i < nested_pools.size(); ++i)
            shared_pool_states[i].config_priority = nested_pools[i]->getConfigPriority(); // 上层使用者根据业务需要设置config_priority
    }

ConnectionPoolWithFailover构造的时候，会构造基类PoolWithFailoverBase<IConnectionPool>，这里最重要的，PoolWithFailoverBase中的nested_pools就是这个shard下面的所有的Replica的ConnectionPool集合，所以， nested_pools其实是Pool的Pool：

cpp 复制代码

template <typename TNestedPool>
class PoolWithFailoverBase : private boost::noncopyable
{
public:
    using NestedPool = TNestedPool;
    using NestedPoolPtr = std::shared_ptr<NestedPool>;
    using Entry = typename NestedPool::Entry; // IConnectionPool::Entry，实际上是 PoolBase<Connection>::Entry;
    using NestedPools = std::vector<NestedPoolPtr>;
    
    const NestedPools nested_pools; //

从ConnectionPoolWithFailover的声明可以看到，PoolWithFailoverBase的编译器类型是PoolWithFailoverBase<IConnection>，因此，nested_pools的编译器类型是:

cpp 复制代码

std::vector<std::shared_ptr<IConnectionPool>>

我们下文会看到， ConnectionPoolWithFailover::getManyImpl其实就是调用了PoolWithFailoverBase<IConnectionPool>::getMany()方法。我们看看PoolWithFailoverBase<IConnectionPool>::getMany()的具体内容。

这里的 PoolWithFailoverBase::getMany() 是 PoolWithFailoverBase 的核心方法，它做的事情不是"从某一个 pool 里拿连接"，而是，在一组候选 nested_pools 里，按优先级顺序逐个尝试，收集足够数量的合格连接，并在失败时自动 failover：

cpp 复制代码

template <typename TNestedPool>
std::vector<typename PoolWithFailoverBase<TNestedPool>::TryResult>
PoolWithFailoverBase<TNestedPool>::getMany(
        size_t min_entries, size_t max_entries, size_t max_tries,
        size_t max_ignored_errors,
        bool fallback_to_stale_replicas, 
        bool skip_read_only_replicas,
        const TryGetEntryFunc & try_get_entry, // 对replica执行探测的具体callback
        const GetPriorityFunc & get_priority /*获取priority的callbak*/)
{
    /// 先把所有候选子池排好序。
    /// 这里每个 ShuffledPool 通常就对应一个 replica 的连接池；
    /// 因而整个 shuffled_pools 可以理解为"某个 shard 下所有候选 replica pools 的尝试顺序"。
    std::vector<ShuffledPool> shuffled_pools = getShuffledPools(max_ignored_errors, get_priority);

    /// Limit `max_tries` value by `max_error_cap` to avoid unlimited number of retries
    max_tries = std::min(max_tries, max_error_cap);

    /// try_results 与 shuffled_pools 一一对应，记录"对每个 pool 的尝试结果"。
    /// 注意这里不是只统计有没有连上，还要区分：
    /// - entry 是否拿到；
    /// - replica 是否可用（is_usable）；
    /// - replica 是否足够新（is_up_to_date）。
    std::vector<TryResult> try_results(shuffled_pools.size());
    size_t entries_count = 0;
    size_t usable_count = 0;
    size_t up_to_date_count = 0;
    size_t failed_pools_count = 0;

    std::string fail_messages;
    bool finished = false;
    while (!finished)
    {
        for (size_t i = 0; i < shuffled_pools.size(); ++i)
        {
            /// 停止条件：
            /// 1. 已经拿够 max_entries 个"最新副本"结果；
            /// 2. 所有 pool 要么已经成功拿到结果，要么已经被判定失败，再试也没有意义。
            if (up_to_date_count >= max_entries /// Already enough good entries.
                || entries_count + failed_pools_count >= nested_pools.size()) /// No more good entries will be produced.
            {
                finished = true;
                break;
            }

            ShuffledPool & shuffled_pool = shuffled_pools[i];
            TryResult & result = try_results[i];
            if (max_tries && (shuffled_pool.error_count >= max_tries || !result.entry.isNull()))
                continue;

            /// try_get_entry 是由上层传进来的"单次尝试"逻辑：
            /// 它负责从这个 pool 里试着拿一个连接，并顺手做资格检查
            /// （例如表是否存在、副本是否最新、是否只读等）。
            std::string fail_message;
            result = try_get_entry(shuffled_pool.pool, fail_message);

            if (!fail_message.empty())
                fail_messages += fail_message + '\n';

            if (!result.entry.isNull()) // 建连成功
            {
                ++entries_count;
                if (result.is_usable) // 连接拿到了，而且这个 replica 对当前请求也可用
                {
                    ++usable_count;
                    if (result.is_up_to_date)
                        ++up_to_date_count;
                }
            }
            else
            {
                /// 这次对当前 pool 的单次尝试失败：
                /// 1. 记录日志；
                /// 2. 提高这个 pool 的 error_count；
                /// 3. 如果达到 max_tries，就把它视为本轮彻底失败。
                shuffled_pool.error_count = std::min(max_error_cap, shuffled_pool.error_count + 1);

                if (shuffled_pool.error_count >= max_tries)
                {
                    ++failed_pools_count;
                 }
            }
        }
    }

    if (usable_count < min_entries) // 没有找到最小数量的Replica
        throw DB::NetException(DB::ErrorCodes::ALL_CONNECTION_TRIES_FAILED,
                "All connection tries failed. Log: \n\n{}\n", fail_messages);
    /// 删除无效结果：包括没连上、不可用，以及在 insert 场景下需要跳过的只读副本。
    std::erase_if(try_results, [&](const TryResult & r) { return isTryResultInvalid(r, skip_read_only_replicas); });

    /// 最终结果再按"新鲜度"做一次排序：
    /// - up-to-date 的放前面；
    /// - stale 之间按 delay 从小到大排，优先保留"没那么旧"的副本。
    std::stable_sort(
            try_results.begin(), try_results.end(),
            [](const TryResult & left, const TryResult & right)
            {
                // up_to_date 的在前，delay 小的在前。
                return std::forward_as_tuple(!left.is_up_to_date, left.delay)
                    < std::forward_as_tuple(!right.is_up_to_date, right.delay);
            });

    if (fallback_to_stale_replicas)
    {
        /// 允许回退到 stale replica 时：
        /// 返回前 max_entries 个结果即可，因为排序后前面已经是"最佳可用集合"。
        size_t size = std::min(try_results.size(), max_entries);
        try_results.resize(size);
    }
    else if (up_to_date_count >= min_entries)
    {
        .....
        /// 不允许回退到 stale 时，只保留最新副本结果。
        try_results.resize(up_to_date_count);
    }
    else
        throw DB::Exception(DB::ErrorCodes::ALL_REPLICAS_ARE_STALE,
                "Could not find enough connections to up-to-date replicas. Got: {}, needed: {}", up_to_date_count, max_entries);

    return try_results;
}

具体执行过程如下所示:

先把候选的Replica的 Pool 排序，排序的依据是封装在PoolState中的各种信息:

cpp 复制代码

std::vector<ShuffledPool> shuffled_pools = getShuffledPools(max_ignored_errors, get_priority);
max_tries = std::min(max_tries, max_error_cap);

排序正是发生在getShuffledPools()方法中，因此这个方法非常重要。我们看一下:

cpp 复制代码

template <typename TNestedPool>
std::vector<typename PoolWithFailoverBase<TNestedPool>::ShuffledPool> //返回一个ShufflledPool的vector
PoolWithFailoverBase<TNestedPool>::getShuffledPools(
    size_t max_ignored_errors, const PoolWithFailoverBase::GetPriorityFunc & get_priority, bool use_slowdown_count)
{
    /// Update random numbers and error counts.
    PoolStates pool_states = updatePoolStates(max_ignored_errors);
    if (get_priority) // 如果定义了get_priority function
    {
        // 每一个pool的priority,get_priority输入当前的pool的索引，返回这个pool的priority(一个uint64，值越小优先级越高)
        // 我们从 PoolState::compare可以看到，这里计算得到的priority只是最后的优先级决策的其中一个因素，并且也不是第一考量因素
        for (size_t i = 0; i < pool_states.size(); ++i)
            pool_states[i].priority = get_priority(i); 
    }

    /// Sort the pools into order in which they will be tried (based on respective PoolStates).
    /// Note that `error_count` and `slowdown_count` are used for ordering, but set to zero in the resulting ShuffledPool
    std::vector<ShuffledPool> shuffled_pools;
    shuffled_pools.reserve(nested_pools.size());
    for (size_t i = 0; i < nested_pools.size(); ++i)
        shuffled_pools.emplace_back(ShuffledPool{.pool = nested_pools[i], .state = &pool_states[i], .index = i});

    ::sort(
        shuffled_pools.begin(), shuffled_pools.end(),
        [use_slowdown_count](const ShuffledPool & lhs, const ShuffledPool & rhs)
        {
            return PoolState::compare(*lhs.state, *rhs.state, use_slowdown_count);
        });

    return shuffled_pools; // 返回排序以后的ShuffledPool的集合
}

可以看到，上面的排序最终是通过PoolState::compare()来进行的。从下面的代码可以看到，PoolWithFailoverBase将两个待排序的Pool的在排序时需要关心的因子都放在了PoolState对象中，然后基于PoolState对两个Pool进行排序:

cpp 复制代码

struct PoolWithFailoverBase<TNestedPool>::PoolState
    {
        // 这个 pool 最近累计了多少"失败分"。
        UInt64 error_count = 0;
        // 这个 pool 最近有多少次"虽然没死，但慢到被放弃/切换"。
        UInt64 slowdown_count = 0;
        // 在<remove_server>配置中给这个pool配置的priority
        Priority config_priority{1}; 
        // GetPriorityFunc所计算出来的Priority 
        Priority priority{0}; 
        // 当前面所有排序条件都一样时，用随机数打散，避免总是固定选同一个副本。
        UInt64 random = 0;
    
        void randomize()
        {
            random = rng();
        }
    
        static bool compare(const PoolState & lhs, const PoolState & rhs, bool use_slowdown_count)
        {
            /**
             * 把每个 replica 的状态压成一个排序 key，谁的 key 更小，谁优先级更高。
             * 可以看到， error_count的优先级最高，config_priority的优先级比priority的优先级更高
             */
            if (use_slowdown_count)
                return std::forward_as_tuple(lhs.error_count, lhs.slowdown_count, lhs.config_priority, lhs.priority, lhs.random)
                    < std::forward_as_tuple(rhs.error_count, rhs.slowdown_count, rhs.config_priority, rhs.priority, rhs.random);
            else
                return std::forward_as_tuple(lhs.error_count, lhs.config_priority, lhs.priority, lhs.random)
                    < std::forward_as_tuple(rhs.error_count, rhs.config_priority, rhs.priority, rhs.random);
        }
    }

上面的排序算法非常直观。这里不做赘述。

可以看到，这里的get_priority()其实是上层调用者传进来的一个callback，即，PoolWithFailoverBase 作为不关心上层业务逻辑的基本Failover框架，它还是给上层留下了一定的优先级决定权，放在了get_priority()这个callback中让上层自己实现，但是，这个优先级本身的优先权是低于error_count, slowdown_count和config_priority的。我们上文已经讲过LoadBalance的基本框架了。

按照刚刚排序以后形成的前后顺序，逐个下层pool进行尝试，尝试过程中把结果记录到TryEntry，同时统计当前的usage_count(用来最后确定是否满足了min_entry的数量要求)，以及up_to_date_count(用来判断在不允许使用stale replica的情况下是否满足数量要求)

cpp 复制代码

   ShuffledPool & shuffled_pool = shuffled_pools[i];
        TryResult & result = try_results[i];
        if (max_tries && (shuffled_pool.error_count >= max_tries || !result.entry.isNull()))
            continue;

        /// try_get_entry 是由上层传进来的"单次尝试"逻辑：
        /// 它负责从这个 pool 里试着拿一个连接，并顺手做资格检查
        /// （例如表是否存在、副本是否最新、是否只读等）。
        std::string fail_message;
        result = try_get_entry(shuffled_pool.pool, fail_message);

        if (!fail_message.empty())
            fail_messages += fail_message + '\n';

        if (!result.entry.isNull()) // 建连成功
        {
            ++entries_count;
            if (result.is_usable) // 连接拿到了，而且这个 replica 对当前请求也可用
            {
                if (skip_read_only_replicas && result.is_readonly)
                    ProfileEvents::increment(ProfileEvents::DistributedConnectionSkipReadOnlyReplica);
                else
                {
                    ++usable_count;
                    if (result.is_up_to_date)
                        ++up_to_date_count; // 更新non-stale的数量，后面会用来根据用户是否允许stale replica来判定是否满足用户要求
                }
            }
        }
        else
        {
            /// 这次对当前 pool 的单次尝试失败：
            /// 1. 记录日志；
            /// 2. 提高这个 pool 的 error_count；
            /// 3. 如果达到 max_tries，就把它视为本轮彻底失败。
            shuffled_pool.error_count = std::min(max_error_cap, shuffled_pool.error_count + 1);

            if (shuffled_pool.error_count >= max_tries)
            {
                ++failed_pools_count;
            }
        }

从上面代码可以看到，PoolWithFailoverBase这一层根本就不关心自己维护的是什么类型的资源的pool，它只负责调用来获取这个Shard的探测结果，而这个TryGetEntryFunc try_get_entry Clouse是上层调用者ConnectionPoolWithFailover定义的，这个原理和我们上面将的LoadBalance Closure相同:

cpp 复制代码

    result = try_get_entry(shuffled_pool.pool, fail_message);

下文会详细讲解ConnectionPoolWithFailover在调用PoolWithFailoverBase<TNestedPool>::getMany()时候对这个TryGetEntryFunc try_get_entry的定义。

在循环过程中，如果已经拿到了足够多(max_entries)的好的结果，或者，候选池都已经尝试完了，再尝试已经没有必要了，那么就提前退出循环，不用考虑后面的剩余的replica了，可以直接退出循环了：

cpp 复制代码

    while (!finished)
{
    for (size_t i = 0; i < shuffled_pools.size(); ++i)
    {
            /// 停止条件：
            /// 1. 已经拿够 max_entries 个"最新副本"结果；
            /// 2. 所有 pool 要么已经成功拿到结果，要么已经被判定失败，再试也没有意义。
            if (up_to_date_count >= max_entries /// Already enough good entries.
                || entries_count + failed_pools_count >= nested_pools.size()) /// No more good entries will be produced.
            {
                finished = true;
                break;
            }

当退出循环，看看拿到的可用的Entry的数量是否满足要求，如果不满足要求，就报错，满足要求，还需要继续进行一些剔除操作:

cpp 复制代码

    if (usable_count < min_entries) // 没有找到最小数量的Replica
        throw DB::NetException(DB::ErrorCodes::ALL_CONNECTION_TRIES_FAILED,
                "All connection tries failed. Log: \n\n{}\n", fail_messages);

过滤无效结果并重新排序

cpp 复制代码

    /// 删除无效结果：包括没连上、不可用，以及在 insert 场景下需要跳过的只读副本。
    std::erase_if(try_results, [&](const TryResult & r) { return isTryResultInvalid(r, skip_read_only_replicas); });

    /// 最终结果再按"新鲜度"做一次排序：
    /// - up-to-date 的放前面；
    /// - stale 之间按 delay 从小到大排，优先保留"没那么旧"的副本。
    std::stable_sort(
            try_results.begin(), try_results.end(),
            [](const TryResult & left, const TryResult & right)
            {
                // up_to_date 的在前，delay 小的在前。
                return std::forward_as_tuple(!left.is_up_to_date, left.delay)
                    < std::forward_as_tuple(!right.is_up_to_date, right.delay);
            });

这里会被先删掉的无效情况包括：1) entry.isNull(), 2) !is_usable, 3) 如果要求跳过只读，则is_readonly的TryResult需要被剔除。然后再把剩余结果按is_up_to_date 和 delay的优先级进行排序

根据是否允许stale replica的设置，对结果进行进一步裁剪。这里的动机是，显然，usable_count已经满足了min_entry 的要求了，但是，假如用户通过fallback_to_stale_replicas=false禁止使用stale replica，那么意味着我们需要抛弃所有的non-up-to-date replica，因此需要看看当前的up_to_date_replica的replica的数量是否满足min_entry的要求，

如果满足要求，就把非up_to_date的replica丢掉
如果不满足，那报错

cpp 复制代码

if (fallback_to_stale_replicas) // 如果允许stale replica，那么就返回全部的结果（上面已经判断完，这里全部的结果肯定是满足min_entries的要求的）
{
    /// 允许回退到 stale replica 时：
    /// 返回前 max_entries 个结果即可，因为排序后前面已经是"最佳可用集合"。
    size_t size = std::min(try_results.size(), max_entries);
    try_results.resize(size);
}
else if (up_to_date_count >= min_entries) // 如果不允许stale replica，那么需要看看up_to_date_count是否满足min_entries的要求， 而且当前 non-stale的replica已经大于min_entries，
{
    if (try_results.size() < up_to_date_count)
        throw DB::Exception(DB::ErrorCodes::LOGICAL_ERROR, "Could not find enough connections for up-to-date results. Got: {}, needed: {}", try_results.size(), up_to_date_count);

    /// 不允许回退到 stale 时，只保留up_to_date的replica， 非up_to_date的replica全部丢掉
    try_results.resize(up_to_date_count);
}
else // 不允许使用stale replica，但是当前找到的non-stale replica的数量不够(< min_entries)，只能报错了。
throw DB::Exception(DB::ErrorCodes::ALL_REPLICAS_ARE_STALE,
        "Could not find enough connections to up-to-date replicas. Got: {}, needed: {}", up_to_date_count, max_entries);

这里的fallback_to_stale_replicas是在构造从上层的settings里面传进来的，它的意思是，如果选取replica的时候最大允许的delay被显式设置max_replica_delay_for_distributed_queries，那么，如果对于这张表的的所有的replica都是stale的状态，那么这个query还是会正常执行。

cpp 复制代码

M(UInt64, max_replica_delay_for_distributed_queries, 300, "If set, distributed queries of Replicated tables will choose servers with replication delay in seconds less than the specified value (not inclusive). Zero means do not take delay into account.", 0) \

M(Bool, fallback_to_stale_replicas_for_distributed_queries, true, "Suppose max_replica_delay_for_distributed_queries is set and all replicas for the queried table are stale. If this setting is enabled, the query will be performed anyway, otherwise the error will be reported.", 0) \

TryGetEntryFunction Closure

上面讲过，ConnectionPoolWithFailover会自己定义TryGetEntryFunc try_get_entry这个Closure，并在调用PoolWithFailoverBase<TNestedPool>::getMany()时候传进来。PoolWithFailoverBase这个基本框架会使用调用者传进来的TryGetEntryFunc try_get_entry来对每一个Replica进行链接尝试，然后根据返回结果决定: 这次算成功还是失败，是不是可用，是不是 up-to-date，要不要继续试下一个。这个结果就封装在TryResult里面。

cpp 复制代码

    template <typename TNestedPool>
    std::vector<typename PoolWithFailoverBase<TNestedPool>::TryResult>
    PoolWithFailoverBase<TNestedPool>::getMany(
            size_t min_entries, size_t max_entries, size_t max_tries,
            size_t max_ignored_errors,
            bool fallback_to_stale_replicas, 
            bool skip_read_only_replicas,
            const TryGetEntryFunc & try_get_entry, // 对replica执行探测的具体callback
            const GetPriorityFunc & get_priority /*获取priority的callbak*/)
    {
     ....
    }

所以，PoolWithFailoverBase::getMany() 是调度器，PoolWithFailoverBase::try_get_entry回调是执行器。显然，PoolWithFailoverBase不关心具体的执行逻辑，它只是一个Failover框架，因此，我们上面见过的try_get_entry是具体业务层在调用PoolWithFailoverBase<TNestedPool>::getMany()的时候传进来的：

这是TryGetEntryFunc这个callback在 PoolWithFailoverBase中的声明:

cpp 复制代码

using TryGetEntryFunc = std::function<TryResult(const NestedPoolPtr & pool, std::string & fail_message)>;

它的意思是: 请调用者传进来一个函数，这个函数的功能是，给我一个子池 pool，我去"尝试"拿一次连接，并返回 TryResult。

ConnectionPoolWithFailover::getMany()中定义TryGetEntryFunc如下所示:

cpp 复制代码

   std::vector<IConnectionPool::Entry> ConnectionPoolWithFailover::getMany(
        const ConnectionTimeouts & timeouts,
        const Settings & settings,
        .....
        GetPriorityForLoadBalancing::Func priority_func)
        {
        /// 这是最基础的"拿多个副本连接"的入口。
        /// 和 getManyChecked() 的区别是：这里只关心连接本身，不额外检查某张表在副本上的状态。
        TryGetEntryFunc try_get_entry = [&](const NestedPoolPtr & pool, std::string & fail_message)
        { return tryGetEntry(pool, timeouts, fail_message, settings, nullptr, async_callback); };
            /// 真正的挑副本、重试和 failover 都在 getManyImpl() 里做。
        std::vector<TryResult> results = getManyImpl(settings, pool_mode, 
        try_get_entry,  // 传入这个Closure
        skip_unavailable_endpoints, priority_func);
        ....
    }

可以看到，传给PoolWithFailoverBase的回调TryGetEntryFunc try_get_entry内部是调用ConnectionPoolWithFailover::tryGetEntry()来实现的：

cpp 复制代码

  
  ConnectionPoolWithFailover::TryResult // 返回选取的结果
  ConnectionPoolWithFailover::tryGetEntry(
        const ConnectionPoolPtr & pool, // 这个ConnectionPool是某一个Replica的connection pool
        const ConnectionTimeouts & timeouts,
        std::string & fail_message,
        const Settings & settings,
        const QualifiedTableName * table_to_check,
        [[maybe_unused]] AsyncCallback async_callback) // 默认是空的
        {
            .......
            ConnectionEstablisher connection_establisher(pool, &timeouts, settings, log, table_to_check);
            TryResult result;
            connection_establisher.run(result, fail_message);
            return result; // 返回从这个replica pool中选取的connection的结果
    }

我们这里不具体研究ConnectionEstablisher和ConnectionEstablisher::run()方法，总之，ConnectionEstablisher::run()方法会根据指向这个Replica的ConnectionPool，和这个Replica进行通信，将通信以后的结果保存为TryResult。

ClickHouse中的Sample

在 ClickHouse 中，SAMPLE 并不是简单的随机抽样，而是一种基于物理存储结构的性能优化手段。

不同于普通 SQL 的过滤操作，ClickHouse 的采样必须在建表阶段就完成声明。这意味着我们必须预先决定按什么维度进行采样。

sql 复制代码

CREATE TABLE events (
    user_id UInt64,
    event_date Date,
    data String
) 
ENGINE = MergeTree()
ORDER BY (user_id, event_date)
SAMPLE BY user_id; -- 采样键必须包含在主键（ORDER BY）中

一旦表结构定义完成，你就可以在查询时通过 SAMPLE 关键字直接跳过大部分数据：

sql 复制代码

SELECT count() FROM events SAMPLE 0.1;

运行结果：即使我们只处理了 10% 的物理空间，ClickHouse 返回的 count() 结果依然是全表的预估总数 。

所以，这里的SAMPLE语义表达的不是基于全表的10%的数据运行sql，而是基于全表的10%的数据运行SQL并返回对应的预估的全表结果。

ClickHouse中的sample并不是随机抽样。普通的抽样（如 WHERE rand() < 0.1）属于逻辑抽样。数据库必须扫描全表，对每一行进行随机值计算。这种方式不仅耗时，且无法减少磁盘 I/O。

而 SAMPLE 0.1 是物理采样：

确定性：由于数据是按照 user_id的哈希值在磁盘上排序存储的，SAMPLE 0.1 实际上是让引擎只去读磁盘上哈希值位于前 10% 范围的数据块。
低开销：引擎直接定位到对应的数据块并读取，剩余 90% 的数据完全不经过内存和 CPU。

那么，基于采样，最终的全表结果是如何计算推演出来的呢？

我们可以通过 ClickHouse 提供的虚拟列 _sample_factor 来拆解其内部的数学逻辑：

sql 复制代码

SELECT 
    count() AS estimated_total,                       -- 1. 逻辑全表行数（预估）
    count() / any(_sample_factor) AS real_rows_read,   -- 2. 物理采样行数（真实读取）
    any(_sample_factor) AS factor                     -- 3. 采样缩放因子，对于sample 0.1，这里的_sample_factor = 10
FROM events 
SAMPLE 0.1;

既然是抽样，那就会有误差，这种误差在数据分布不均匀的时候尤其明显。在上面的例子中，如果 user_id 的分布极度不均匀（例如少数几个 ID 贡献了全表 50% 的行数），简单的采样就会失效。

问题点：如果这几个高频 ID 恰好落在我们采样的那 10% 区间里，最终的预估结果会被放大 10 倍，导致数据失真。

所以，我们在建表时，往往需要对采样键进行 Hash 均匀化，解决ID的聚集问题：

复制代码

ORDER BY (CounterID, intHash32(user_id))
SAMPLE BY intHash32(user_id);

通过intHash32，原本聚集的 ID 会被随机且均匀地映射到整个哈希空间。虽然高频 ID 的行数依然多，但它会固定分布在特定的哈希区间，保证了样本在统计学上的无偏性。

但是，我们必须了解，通过Hash均匀化，它可以解决ID的聚集问题，但是无法解决ID的数据倾斜问题

Hash 解决了"聚集"带来的系统性偏差

如果不做 Hash，数据可能是按时间或地域排列的。采样 10% 可能只抓住了"早期的用户"或"北京的用户"，导致样本根本无法代表整体。Hash 把这些属性彻底打散，确保了样本的无偏性------即从概率上讲，抽中的这 10% 空间，其用户组成比例与全表是一致的。

Hash 无法解决"个体数据量不均"导致的偶然性误差：

即使 ID 被打散了，如果不同 ID 拥有的行数天差地别（即数据倾斜），误差依然存在。

举个极端的例子：假设我们有 1000 个用户。其中 999 个用户每人只有 1 条记录，而 1 个"超级大客户"拥有 100 万条记录。

场景 A：超级大客户的 Hash 值落在你的采样区间（10%）。我们实际读到：100 万 + 约 100 条。预估总数： ( 1000100 ) × 10 = 1000.1 万 (1000100) \times 10 = \mathbf{1000.1 万} (1000100)×10=1000.1万。
场景 B：超级大客户的 Hash 值落在采样区间外（90%）。我们实际读到：约 100 条。预估总数： 100 × 10 = 1000 100 \times 10 = \mathbf{1000} 100×10=1000。

从上面的例子可以看到，真实总数大约是 100 万，但采样结果要么是 1000 万，要么是 1000。这就是误差。这种误差是由数据本身的"不均匀"导致的。

既然无法消除这种误差，为什么我们还要用它？

大数定律的救赎：在现实业务中，虽然存在大客户，但通常不会只有一个。当你有成千上万个"中大型"客户均匀分布在哈希空间时，有的被抽中，有的没被抽中，它们带来的正负误差会互相抵消，最终结果会趋于平稳。
指标的差异性：
- 算"钱"或"行数"：受倾斜影响极大。如果你的目标是财务对账，绝对不要用采样。
- 算"转化率"或"漏斗"：受倾斜影响较小。即便某个大客户的数据量很大，但只要他的转化率和普通用户接近，采样得到的百分比依然是非常准确的。

ClickHouse中的Custom Key

普通 Distributed 查询默认是一个 shard 选一个 replica 读(不考虑Sample的情况)。如果开启 parallel replicas，并且希望一个 shard 内多个 replicas 一起读，就必须解决一个问题：多个 replicas 都有同一份数据，如果它们都读全量，结果就重复了，所以必须给每个 replica 分配一部分数据。因此，CUSTOM_KEY 的做法是：用户指定一个表达式作为切分 key，然后 ClickHouse 根据这个 key 生成过滤条件，让不同 replicas 读取不同 key 空间。

CUSTOM_KEY和SAMPLE_KEY的区别是， SAMPLE_KEY依赖表定义里的 sampling key，CUSTOM_KEY 不依赖表的DDL的 sample key，而是用户在查询的时候进行显式指定：

sql 复制代码

SET parallel_replicas_custom_key = 'cityHash64(user_id)';

或者:

sql 复制代码

SET parallel_replicas_custom_key = 'customerId';

然后 ClickHouse 会按这个 key 给 replicas 分工。所以可以简单理解为：

SAMPLE_KEY：用表定义里的 sample key 切
CUSTOM_KEY：用用户设置的 custom key 切

我们从Context::getParallelReplicasMode()中可以看到，CUSTOM_KEY的优先级最高，即，只要我们既指定了parallel_replicas_custom_key，那么ParallelReplicasMode是CUSTOM_KEY，而不是SAMPLE_KEY

cpp 复制代码

Context::ParallelReplicasMode Context::getParallelReplicasMode() const
{
    const auto & settings_ref = getSettingsRef();
    using enum Context::ParallelReplicasMode;
    if (!settings_ref.parallel_replicas_custom_key.value.empty())
        return CUSTOM_KEY;
    if (settings_ref.allow_experimental_parallel_reading_from_replicas > 0)
        return READ_TASKS;
    return SAMPLE_KEY;
}

CUSTOM_KEY的工作方式是，给不同 replica 加不同 filter。比如有 2 个 replicas，custom key 是：

cpp 复制代码

cityHash64(user_id)

那么 ClickHouse 可以让：

复制代码

replica 0 读 cityHash64(user_id) % 2 = 0
replica 1 读 cityHash64(user_id) % 2 = 1

这样，每一行数据只会落到一个 replica 的 filter 里，多个 replicas 合起来覆盖完整数据，不重复、不遗漏。这就是 CUSTOM_KEY 的核心思想。

我们先不用管custom key filter是什么时候添加的，我们先只看看custom key filter是怎么生成的。custom key filter的生成是通过方法 getCustomKeyFilterForParallelReplica()来完成的：

cpp 复制代码

// 生成对应的Filter AST
ASTPtr getCustomKeyFilterForParallelReplica(
    size_t replicas_count, // replica总数
    size_t replica_num,    // 当前的replica的数量
    ASTPtr custom_key_ast, // custom key ast
    ParallelReplicasCustomKeyFilter filter, // filter类型
    const ColumnsDescription & columns,
    const ContextPtr & context)
{
    // 基于default filter type，生成对应的filter AST
    if (filter.filter_type == ParallelReplicasCustomKeyFilterType::DEFAULT)
    {
        // first we do modulo with replica count
        auto modulo_function = makeASTFunction("positiveModulo", custom_key_ast, std::make_shared<ASTLiteral>(replicas_count));

        /// then we compare result to the current replica number (offset)
        auto equals_function = makeASTFunction("equals", std::move(modulo_function), std::make_shared<ASTLiteral>(replica_num));

        return equals_function;
    }
    // 如果不是default filter AST, 那么就是range filter ast
    chassert(filter.filter_type == ParallelReplicasCustomKeyFilterType::RANGE);
    ........
}

这里的关键输入是:
parallel_replicas_count：总共几个 replicas 参与。这是内部配置，不可以被用户修改；
parallel_replica_offset：当前 replica 是第几个。这是内部配置，不可以被用户修改；
parallel_replicas_custom_key：用哪个表达式切分，用户端设置；
parallel_replicas_custom_key_filter_type：怎么生成 filter；用户端设置

我们下文会讲到,这个Custom Key Filter可能在Initiator端或者Replica端都可能发生(但不会同时在两端都发生)，两种情况下都会调用getCustomKeyFilterForParallelReplica()来生成对应的Filter AST.

对应的配置项如下所示:

cpp 复制代码

M(UInt64, parallel_replicas_count, 0, "This is internal setting that should not be used directly and represents an implementation detail of the 'parallel replicas' mode. This setting will be automatically set up by the initiator server for distributed queries to the number of parallel replicas participating in query processing.", 0) \
M(UInt64, parallel_replica_offset, 0, "This is internal setting that should not be used directly and represents an implementation detail of the 'parallel replicas' mode. This setting will be automatically set up by the initiator server for distributed queries to the index of the replica participating in query processing among parallel replicas.", 0) \
M(String, parallel_replicas_custom_key, "", "Custom key assigning work to replicas when parallel replicas are used.", 0) \
M(ParallelReplicasCustomKeyFilterType, parallel_replicas_custom_key_filter_type, ParallelReplicasCustomKeyFilterType::DEFAULT, "Type of filter to use with custom key for parallel replicas. default - use modulo operation on the custom key, range - use range filter on custom key using all possible values for the value type of custom key.", 0) \

比如:

复制代码

parallel_replicas_count = 2
parallel_replica_offset = 0
parallel_replicas_custom_key = cityHash64(user_id)

那么可能生成类似:

复制代码

positiveModulo(cityHash64(user_id), 2) = 0

对于另一个replica:

复制代码

parallel_replica_offset = 1

则生成:

复制代码

positiveModulo(cityHash64(user_id), 2) = 1

这个 filter 后面会进入当前 replica 的查询分析流程，用来限制当前 replica 只读属于自己的那份数据。

在ClickHouse中，这个filter的添加可能发生在两个不同的阶段:

发生在Initiator端(靠前)

Initiator 为每个 replica clone 一份 query
每份 query 里提前加上不同的 custom key filter
每份 query 用 GET_ONE 发给一个 replica

text 复制代码

Initiator 端生成 filter，并把不同 query 发给不同 replica
原始 query:
    SELECT ...
    FROM table
    WHERE original_condition
initiator 发给 shard1-replica1:
    SELECT ...
    FROM table
    WHERE original_condition
      AND filter_0
initiator 发给 shard1-replica2:
    SELECT ...
    FROM table
    WHERE original_condition
      AND filter_1
shard1-replica1:
    直接执行已经带 filter_0 的 query
shard1-replica2:
    直接执行已经带 filter_1 的 query

如果有两个replicas，并且custom_key是cityHash64(user_id)，那么可以理解成:

text 复制代码

    initiator 发给 shard1-replica1:
    SELECT ...
    FROM table
    WHERE original_condition
      AND positiveModulo(cityHash64(user_id), 2) = 0
initiator 发给 shard1-replica2:
    SELECT ...
    FROM table
    WHERE original_condition
      AND positiveModulo(cityHash64(user_id), 2) = 1

下图形象地展示了Initiator根据用户指定的custom_key为每一个Replica生成filter，然后将包含了对应filter的子query发送给远程的Shard的不同的Replica的过程:

text 复制代码

用户
  |
  v
initiator ClickHouse
  |
  | 1. InterpreterSelectQuery 解释用户原始查询
  |
  | 2. ClusterProxy::executeQuery 构造远程 shard 查询
  |
  | 3. ReadFromRemote::addPipe
  |    - 当前 shard 有两个 replicas
  |    - clone 出两份 query
  |    - 给 replica 0 的 query 加 filter_0
  |    - 给 replica 1 的 query 加 filter_1
  |    - 为两份 query 分别创建 RemoteQueryExecutor
  |
  +------------------------------+
  |                              |
  v                              v
remote replica 0 ClickHouse      remote replica 1 ClickHouse
  |                              |
  | 4. 收到带 filter_0 的子查询   | 4. 收到带 filter_1 的子查询
  |                              |
  | 5. InterpreterSelectQuery    | 5. InterpreterSelectQuery
  |    解释并执行子查询           |    解释并执行子查询
  |                              |
  | 6. 只读取 custom key          | 6. 只读取 custom key
  |    命中 filter_0 的数据        |    命中 filter_1 的数据
  |                              |
  +---------------+--------------+
                  |
                  v
          initiator ClickHouse
                  |
                  | 7. 汇总两个 replica 返回的数据
                  v
                用户

发生在remote replica端(靠后)

Initiator 给一个 shard 内多个 replicas 发同一份 query
发送时给每个 replica 带不同的 parallel_replica_offset
remote replica 在自己的 InterpreterSelectQuery 里生成 custom key filter

用户视角可以理解成，Initiator 端发送同一份 query，但给不同 replica 带不同 offset setting。

比如，原始Query:

sql 复制代码

    SELECT ...
    FROM table
    WHERE original_condition

然后，Initiator发送给两个Replica的Query是:

cpp 复制代码

initiator 发给 shard1-replica1:
    query:
        SELECT ...
        FROM table
        WHERE original_condition
    settings:
        parallel_replicas_count = 2
        parallel_replica_offset = 0
initiator 发给 shard1-replica2:
    query:
        SELECT ...
        FROM table
        WHERE original_condition
    settings:
        parallel_replicas_count = 2
        parallel_replica_offset = 1

两个Replica实际执行的Query是:

sql 复制代码

shard1-replica1:
    根据 parallel_replica_offset = 0 自己生成 filter_0
    实际执行:
        SELECT ...
        FROM table
        WHERE original_condition
          AND filter_0
shard1-replica2:
    根据 parallel_replica_offset = 1 自己生成 filter_1
    实际执行:
        SELECT ...
        FROM table
        WHERE original_condition
          AND filter_1

下面这张图形象展示了这种模式下的调用逻辑：

复制代码

 用户
  |
  v
initiator ClickHouse
  |
  | 1. InterpreterSelectQuery 解释用户原始查询
  |
  | 2. ClusterProxy::executeQuery 构造远程 shard 查询
  |
  | 3. ReadFromRemote::addPipe
  |    - 当前 shard 有两个 replicas
  |    - 不 clone 多份 query
  |    - 不在 initiator 端提前加 filter
  |    - 只创建一个 RemoteQueryExecutor
  |    - PoolMode = GET_MANY
  |
  | 4. RemoteQueryExecutor 创建连接
  |    - 从当前 shard 的 pool 中拿到两个 replica connections
  |
  | 5. MultiplexedConnections::sendQuery
  |    - 给两个 replicas 发送同一份 query
  |    - 给 replica 0 的 settings 设置:
  |        parallel_replicas_count = 2
  |        parallel_replica_offset = 0
  |    - 给 replica 1 的 settings 设置:
  |        parallel_replicas_count = 2
  |        parallel_replica_offset = 1
  |
  +------------------------------+
  |                              |
  v                              v
remote replica 0 ClickHouse      remote replica 1 ClickHouse
  |                              |
  | 6. 收到同一份子查询           | 6. 收到同一份子查询
  |    settings 中 offset = 0     |    settings 中 offset = 1
  |                              |
  | 7. InterpreterSelectQuery    | 7. InterpreterSelectQuery
  |    解释子查询                 |    解释子查询
  |                              |
  | 8. 根据 settings 自己生成     | 8. 根据 settings 自己生成
  |    filter_0                  |    filter_1
  |                              |
  | 9. 实际执行:                 | 9. 实际执行:
  |    原始 query AND filter_0    |    原始 query AND filter_1
  |                              |
  | 10. 只读取 custom key         | 10. 只读取 custom key
  |     命中 filter_0 的数据       |     命中 filter_1 的数据
  |                              |
  +---------------+--------------+
                  |
                  v
          initiator ClickHouse
                  |
                  | 11. 汇总两个 replica 返回的数据
                  v
                用户

然后，我们详细介绍一下两种情形的执行过程

Initiator端的Parallel Replica Custom Filter

这种情况的含义是，是否允许对整个Cluster使用Custom Key Filter。当且仅当整个Dist表的cluster只有一个Shard的情况下，ClickHouse才会允许在整个

Cluster的范围内使用Custom Key Filter:

cpp 复制代码

<remote_servers>
    <one_shard_cluster>
        <shard>
            <replica>
                <host>replica1</host>
                <port>9000</port>
            </replica>
            <replica>
                <host>replica2</host>
                <port>9000</port>
            </replica>
        </shard>
    </one_shard_cluster>
</remote_servers>

cpp 复制代码

void StorageDistributed::read(
    QueryPlan & query_plan,
    const Names &,
    const StorageSnapshotPtr & storage_snapshot,
    SelectQueryInfo & query_info,
    ContextPtr local_context,
    QueryProcessingStage::Enum processed_stage,
    const size_t /*max_block_size*/,
    const size_t /*num_streams*/)
{
    ......
    // 如果允许在Cluster层面使用Custom Key Filter，那么就生成对应的Filter Closure
    auto shard_filter_generator = ClusterProxy::getShardFilterGeneratorForCustomKey(
        *modified_query_info.getCluster(), local_context, getInMemoryMetadataPtr()->columns);
    
    ClusterProxy::executeQuery(
        query_plan,
        ......
        shard_filter_generator, // 传入所生成的用来产生Replica Filter的Closure
        is_remote_function);
    
    /// This is a bug, it is possible only when there is no shards to query, and this is handled earlier.
    if (!query_plan.isInitialized())
        throw Exception(ErrorCodes::LOGICAL_ERROR, "Pipeline is not initialized");
}

我们看一下ClusterProxy::getShardFilterGeneratorForCustomKey()。可以看到，这个方法是ClusterProxy的成员方法，因此，这里的含义是，是否能在Cluster层面使用Custom Key的过滤条件：

cpp 复制代码

using AdditionalShardFilterGenerator = std::function<ASTPtr(uint64_t)>;
    
AdditionalShardFilterGenerator
getShardFilterGeneratorForCustomKey(const Cluster & cluster, ContextPtr context, const ColumnsDescription & columns)
{
    /// 只有"一个 shard 多个 replicas"的 custom key parallel replicas 场景，
    /// 才需要在 initiator 侧为每个 replica 预先生成不同的过滤条件。
    if (!context->canUseParallelReplicasCustomKeyForCluster(cluster))
        return {};  //无法在Cluster范围内使用custom key filter, 因此filter是空
    
    const auto & settings = context->getSettingsRef();
    /// 把用户配置的 parallel_replicas_custom_key 解析成 AST。
    auto custom_key_ast = parseCustomKeyForTable(settings.parallel_replicas_custom_key, *context);
    if (custom_key_ast == nullptr) // 解析失败时返回空 generator，后续就不会走 per-replica query 改写。
        return {};
    
    /// 返回一个 "replica_num -> filter AST" 的函数。
    /// ReadFromRemote::addPipe() 会按 replica 编号调用它，
    /// 给同一个 shard 下的每个 replica 生成互不重叠的 custom key filter。
    return [my_custom_key_ast = std::move(custom_key_ast),
            column_description = columns,
            custom_key_type = settings.parallel_replicas_custom_key_filter_type.value,
            custom_key_range_lower = settings.parallel_replicas_custom_key_range_lower.value,
            custom_key_range_upper = settings.parallel_replicas_custom_key_range_upper.value,
            query_context = context,
            replica_count = cluster.getShardsInfo().front().per_replica_pools.size()](uint64_t replica_num) -> ASTPtr
    {
        return getCustomKeyFilterForParallelReplica(
            replica_count, 
            replica_num - 1,  // replica_num 传入时从 1 开始，getCustomKeyFilterForParallelReplica() 使用从 0 开始的 replica offset。
            my_custom_key_ast,
            {custom_key_type, custom_key_range_lower, custom_key_range_upper},
            column_description,
            query_context);
    };
}

可以看到，

getShardFilterGeneratorForCustomKey()会首先判断当前能否在Cluster级别直接使用Custom Key Filter，这是通过方法Context::canUseParallelReplicasCustomKeyForCluster()完成的。我们已经说过，当且仅当Cluster层面只有一个Shard，才能在Initiator端直接使用Custom Key Filter:

cpp 复制代码

// 是否允许对整个Cluster使用Custom Key filter，这里最重要的条件是 cluster.getShardCount() == 1
bool Context::canUseParallelReplicasCustomKeyForCluster(const Cluster & cluster) const
{
    return canUseParallelReplicasCustomKey() && cluster.getShardCount() == 1 && cluster.getShardsInfo()[0].getAllNodeCount() > 1;
}

/**
 * 如果的确设置的支持多replica，并且当前的并行replica模式是CUSTOM_KEY模式
 * @return
 */
bool Context::canUseParallelReplicasCustomKey() const
{
    return settings->max_parallel_replicas > 1 && getParallelReplicasMode() == Context::ParallelReplicasMode::CUSTOM_KEY;
}

然后，Initiator会提取Custom key表达式，然后生成并返回一个Labmda 表达式，这个生成的labmda AdditionalShardFilterGenerator内部实际上会调用getCustomKeyFilterForParallelReplica()，根据输入的replica_num(即replica的序号，不是总数量)，返回一个filter 的AST表达式，随后， ReadFromRemote::addPipe()会为每一个replica调用这个Closure来生成对应的Filter AST：

所以，StorageDistributed::read()会判断是否支持Cluster级别的Custom Key Filter，如果的确支持就会生成对应的shard_filter_generator 在ReadFromRemote::addPipe()中，会给每一个replica clone一份query，加上其对应的filter。所以，我们看一下 ReadFromRemote::addPipe()中处理Custom Key Filter的过程。

我们上文讲过，ReadFromRemote::addPipe()调用的时候是针对某一个具体的Shard的，不是针对Cluster中的所有Shard的，也不是针对Shard中的某一个Replica的，所以，这里，ReadFromRemote::addPipe()会为这个Shard中的每一个Replica生成对应的Filter AST并且attach对应的query上，然后为每一个Replica生成对应的RemoteQueryExecutor。随后，Replica收到的就是已经在Initiator端准备好了Filter的Sub Query：

cpp 复制代码

void ReadFromRemote::addPipe(Pipes & pipes, const ClusterProxy::SelectStreamFactory::Shard & shard)
{
    bool add_agg_info = stage == QueryProcessingStage::WithMergeableState;
    ....
     /// parallel replicas custom key case
    /// Custom key 并行副本场景：initiator 会把一个 shard query 拆成多份 per-replica query。
    /// 每个 replica 拿到一份 clone 后的 query，并带上自己的 custom key filter，从而避免读取范围重叠。
    if (shard.shard_filter_generator) // 当前 shard 需要走 initiator 侧的 custom key per-replica query 改写路径。
    {
        // 遍历这个Shard下面的所有的Replica
        for (size_t i = 0; i < shard.shard_info.per_replica_pools.size(); ++i)
        {
            /// 每个 replica 的 WHERE filter 不同，所以修改前必须先 clone query AST。
            auto query = shard.query->clone();
            auto & select_query = query->as<ASTSelectQuery &>();
            /// shard_filter_generator 使用从 1 开始的 replica 编号，
            /// 返回当前 replica 对应的 custom key filter。
            auto shard_filter = shard.shard_filter_generator(i + 1); // 调用`ClusterProxy::getShardFilterGeneratorForCustomKey()`，生成对应的filter
            if (shard_filter) // 如果成功生成了Shard Filter
            {
                auto where_expression = select_query.where();
                if (where_expression)
                    /// 保留原始 WHERE 条件，并额外叠加当前 replica 的 filter。
                    shard_filter = makeASTFunction("and", where_expression, shard_filter);

                /// 改写 clone 后的 query，使它只读取分配给当前 replica 的 custom-key 范围。
                select_query.setExpression(ASTSelectQuery::Expression::WHERE, std::move(shard_filter));
            }

            const String query_string = formattedAST(query);

            if (!priority_func_factory.has_value())
                priority_func_factory = GetPriorityForLoadBalancing(LoadBalancing::ROUND_ROBIN, randomSeed());

            /// 构造对应的 RemoteQueryExecutor
            auto remote_query_executor = std::make_shared<RemoteQueryExecutor>(
            shard.shard_info.pool,
            query_string,
            shard.header,
            context,
            throttler,
            scalars,
            external_tables,
            stage,
            std::nullopt,
            priority_func);
        remote_query_executor->setLogger(log);
        remote_query_executor->setPoolMode(PoolMode::GET_ONE); // 这里设置PoolMode::GET_ONE

        if (!table_func_ptr)
            remote_query_executor->setMainTable(shard.main_table ? shard.main_table : main_table);
            }
        }
        else
        {
           ....
        }

其实上文已经讲过，PoolMode是RemoteQueryExecutor的属性，在这种情况下，由于我们为每一个Replica添加了不同的Custom Key Filter，每一个Replica都有一个独立的RemoteQueryExecutor对象，即每一个RemoteQueryExecutor都只需要处理一个Replica，因此，这时候的PoolMode是GET_ONE，而不是GET_MANY，这意味着随后的max_entries = 1:

cpp 复制代码

std::vector<ConnectionPoolWithFailover::TryResult> ConnectionPoolWithFailover::getManyImpl(
const Settings & settings,
PoolMode pool_mode,
....)
{
     ....
    /// 最多拿几个副本连接。
    if (pool_mode == PoolMode::GET_ALL)
    {
        /// 全拿。
        min_entries = nested_pools.size();
        max_entries = nested_pools.size();
    }
    else if (pool_mode == PoolMode::GET_ONE)
    {
        /// 只拿一个。
        max_entries = 1;
    }
    else if (pool_mode == PoolMode::GET_MANY)
    {
        /// 最多拿 max_parallel_replicas 个。
        max_entries = settings.max_parallel_replicas;
    }
    ....

Remote Replica端自行添加对应的Custom Key Filter

上面说过，当且仅当整个Cluster只有一个Shard的时候，Initiator才会为这个唯一的Shard中的每一个Replica准备好对应的Custom Key Filter。

但是如果Cluster有不止一个Shard，那么就无法在Initiator端进行Custom Key Filter的添加，需要在Replica端自行进行，这时候，Initiator端的
ReadFromRemote::addPipe()方法就是走的else路径:

cpp 复制代码

void ReadFromRemote::addPipe(Pipes & pipes, const ClusterProxy::SelectStreamFactory::Shard & shard)
{
    bool add_agg_info = stage == QueryProcessingStage::WithMergeableState;
    bool add_totals = false;
    bool add_extremes = false;
    bool async_read = context->getSettingsRef().async_socket_for_remote;
    bool async_query_sending = context->getSettingsRef().async_query_sending_for_remote;
    if (stage == QueryProcessingStage::Complete)
    {
        add_totals = shard.query->as<ASTSelectQuery &>().group_by_with_totals;
        add_extremes = context->getSettingsRef().extremes;
    }

    scalars["_shard_num"]
        = Block{{DataTypeUInt32().createColumnConst(1, shard.shard_info.shard_num), std::make_shared<DataTypeUInt32>(), "_shard_num"}};

    if (context->canUseTaskBasedParallelReplicas()) // 当前是READ_TASKS模式
    {
        if (context->getSettingsRef().cluster_for_parallel_replicas.changed)
        {

            const String cluster_for_parallel_replicas = context->getSettingsRef().cluster_for_parallel_replicas;
            if (cluster_for_parallel_replicas != cluster_name) // 如果设置的cluster_for_parallel_replicas不是Dist表的cluster名字
                LOG_INFO(
                    log,
                    "cluster_for_parallel_replicas has been set for the query but has no effect: {}. Distributed table cluster is "
                    "used: {}",
                    cluster_for_parallel_replicas,
                    cluster_name);
        }

        LOG_TRACE(log, "Setting `cluster_for_parallel_replicas` to {}", cluster_name);
        context->setSetting("cluster_for_parallel_replicas", cluster_name);// 设置这个并行replica的cluster信息
    }

    /// Custom key 并行副本场景：initiator 会把一个 shard query 拆成多份 per-replica query。
    /// 每个 replica 拿到一份 clone 后的 query，并带上自己的 custom key filter，从而避免读取范围重叠。
    if (shard.shard_filter_generator)
    {
        for (size_t i = 0; i < shard.shard_info.per_replica_pools.size(); ++i)
        {
            ......   
        }
    }
    else
    {
        const String query_string = formattedAST(shard.query);

        auto remote_query_executor = std::make_shared<RemoteQueryExecutor>(
            shard.shard_info.pool, query_string, shard.header, context, throttler, scalars, external_tables, stage);
        remote_query_executor->setLogger(log);

        if (context->canUseTaskBasedParallelReplicas()) // 如果是READ_TASKS模式，那么只能使用一个Coordinator，因此PoolMode是GET_ONE
        {
            remote_query_executor->setPoolMode(PoolMode::GET_ONE);
        }
        else
            remote_query_executor->setPoolMode(PoolMode::GET_MANY);

        // 设置主表，因为主表会影响到对replica的选择，只选择replica上有该表、并且该表足够细新
        // 如果已经设置了Shard 的main_table，则使用，如果没有，则使用构造 ReadFromRemote 对象时传入到 main_table，比如，dist表查询的时候
        // main_table就是dist表背后的ReplicatedMergeTree的表
        if (!table_func_ptr)
            remote_query_executor->setMainTable(shard.main_table ? shard.main_table : main_table);

        pipes.emplace_back(
            createRemoteSourcePipe(remote_query_executor, add_agg_info, add_totals, add_extremes, async_read, async_query_sending));
        addConvertingActions(pipes.back(), output_stream->header, shard.has_missing_objects);
    }
}

上面已经非常详细地分析过，如果设置了对应的parallel_replicas_custom_key表达式，但是Initiator无法在Cluster Layer去使用 Parallel Custom Key Filter，只能在Remote Replica上单独apply对应的Custom Key Filter，那么这时候PoolMode就是GET_MANY，一个RemoteQueryExecutor会建立Shard内的多个Replica的Connection，把相同的SQL发送给Shard内的所有Replica。

为了让各个Remote Replica能够不重复、不遗漏地扫描部分数据，MultiplexedConnections在这种情况下，会在发送Query的时候，把对应的Replica的offset信息放到settings中作为元数据，一起发送给远程的Replica:

cpp 复制代码

void MultiplexedConnections::sendQuery(
    const ConnectionTimeouts & timeouts,
    const String & query,
    const String & query_id,
    UInt64 stage,
    ClientInfo & client_info,
    bool with_pending_data)
{
    ....
    /// Offset parallel replicas 场景：
    /// initiator 会把同一份 query 发给多个 replica connections，
    /// 并通过 parallel_replicas_count / parallel_replica_offset 告诉每个远端 replica 自己负责哪一份数据。
    const bool enable_offset_parallel_processing = context->canUseOffsetParallelReplicas();

    size_t num_replicas = replica_states.size();
    if (num_replicas > 1)
    {
        if (enable_offset_parallel_processing) // 默认enable，只有在Initiator已经使用了Parallel Replica Custom key，才会显式disable掉这个变量，从而避免Replica端重复进行Custom Key Filter
            /// 参与本次并行读取的 replica 总数
            modified_settings.parallel_replicas_count = num_replicas;

        for (size_t i = 0; i < num_replicas; ++i)
        {
            if (enable_offset_parallel_processing)
                /// 当前连接对应的 replica 编号，从 0 开始。
                /// 远端 InterpreterSelectQuery 会根据这个 offset 生成本 replica 的 custom key / sample filter。
                modified_settings.parallel_replica_offset = i;

            /// query 字符串本身相同；不同 replica 的分工通过 modified_settings 中的 offset 表达。
            replica_states[i].connection->sendQuery(
                timeouts, query, /* query_parameters */ {}, query_id, stage, &modified_settings, &client_info, with_pending_data, {});
        }
    }
    else
    {
        /// 只有一个 replica connection 时，不需要设置 parallel replica offset。
        replica_states[0].connection->sendQuery(
            timeouts, query, /* query_parameters */ {}, query_id, stage, &modified_settings, &client_info, with_pending_data, {});
    }

    sent_query = true;
}
}

可以看到，在发送Query的时候，会判断是否enable了基于offset的Parallel Replica，如果使用了，那么就需要在设置中去设置parallel_replicas_count和parallel_replica_offset，这个信息会让Remote Replica知道自己将负责哪一部分数据:

cpp 复制代码

/// initiator 不直接把不同 filter 写进 SQL，而是给每个 replica 传不同的 parallel_replica_offset。
/// remote replica 收到 query 后，根据自己的 offset 和总 replica 数 parallel_replicas_count，自己生成属于自己的过滤条件。
/// 如果当前已经走了别的并行副本模式，比如 READ_TASKS 或 initiator 已经提前按 replica 改写 SQL，就不能再启用这种 offset 模式，否则可能重复切分或重复加 filter。
bool offset_parallel_replicas_enabled = true;
    
    
bool Context::canUseOffsetParallelReplicas() const
{
    return offset_parallel_replicas_enabled && settings->max_parallel_replicas > 1
        && getParallelReplicasMode() != Context::ParallelReplicasMode::READ_TASKS;
}

可以看到，只有当以下条件全部满足，才会在Replica端使用基于offset的Parallel Replica：

offset_parallel_replicas_enabled 没有被显式关掉。我们从上面的代码可以看到，offset_parallel_replicas_enabled默认是打开的，但是，假如已经在Initiator端apply了Cluster级别的Custom Key Filter，那么就无需再在Replica端重复添加这个filter了，这时候就会显式disable掉offset_parallel_replicas_enabled。
这是在ClusterProxy::updateSettingsAndClientInfoForCluster()中进行的，这个方法的调用链是:

text 复制代码

StorageDistributed::read()
  -> ClusterProxy::executeQuery()
  -> updateSettingsAndClientInfoForCluster()

cpp 复制代码

ContextMutablePtr updateSettingsAndClientInfoForCluster(const Cluster & cluster,
    bool is_remote_function,
    ContextPtr context,
    const Settings & settings,
    const StorageID & main_table,
    ASTPtr additional_filter_ast,
    LoggerPtr log,
    const DistributedSettings * distributed_settings)
{
    .....
    auto new_context = Context::createCopy(context);
    new_context->setSettings(new_settings);
    new_context->setClientInfo(new_client_info);
    // Initiator端已经使用了Cluster Level的Parallel Replica Custom Key, 那么就需要disable掉Offset Parallel Replica
    if (context->canUseParallelReplicasCustomKeyForCluster(cluster))
        new_context->disableOffsetParallelReplicas();
    
    return new_context;
}

max_parallel_replicas是否大于1，这个很明显，不用解释
ParallelReplicasMode不可以是READ_TASKS。 READ_TASKS 是另一套并行副本机制，它不是靠 offset 给每个 replica 静态分配 filter，而是：
- 先连接一个 coordinator replica
- coordinator 动态给其他 replicas 派发读取任务
  所以如果当前模式是 READ_TASKS，就不能再启用 offset parallel replicas。否则会变成：
- 一边 coordinator 动态派任务
- 一边 offset filter 静态切分数据
  这两套机制会冲突。

然后 Remote Replica 收到 query 后，在自己的 InterpreterSelectQuery 里根据:

parallel_replicas_count
parallel_replica_offset
parallel_replicas_custom_key
来生成自己的 custom key filter。代码如下:

cpp 复制代码

InterpreterSelectQuery::InterpreterSelectQuery(
    const ASTPtr & query_ptr_,
    const ContextMutablePtr & context_,
    ...
{
if (storage && context->canUseParallelReplicasCustomKey() && !joined_tables.tablesWithColumns().empty())
{
    if (settings.parallel_replicas_count > 1) // 用户设置了custom_key，并且 parallel_replicas_count > 1
    {
        if (auto custom_key_ast = parseCustomKeyForTable(settings.parallel_replicas_custom_key, *context))
        {
            parallel_replicas_custom_filter_ast = getCustomKeyFilterForParallelReplica(
                settings.parallel_replicas_count, // Initiator传过来的，replica的数量
                settings.parallel_replica_offset, // Initiator传过来的，当前replica 的offset
                std::move(custom_key_ast),  // AST
                {settings.parallel_replicas_custom_key_filter_type,
                 settings.parallel_replicas_custom_key_range_lower,
                 settings.parallel_replicas_custom_key_range_upper},
                storage->getInMemoryMetadataPtr()->columns,
                context);
        }
    }
    
    if (parallel_replicas_custom_filter_ast) // 把生成的Filter AST添加到query中
            {
                parallel_replicas_custom_filter_info = generateFilterActions(
                        table_id, parallel_replicas_custom_filter_ast, context, storage, storage_snapshot, metadata_snapshot, required_columns,
                        prepared_sets);

                parallel_replicas_custom_filter_info->do_remove_column = true;
                query_info.filter_asts.push_back(parallel_replicas_custom_filter_ast);
            }

}

这里，依然是调用getCustomKeyFilterForParallelReplica()来根据用户的Custom Key表达式，和当前的offset等信息，来生成对应的AST，上文已经介绍过该方法，这里不做赘述。

CUSTOM_KEY 还支持两种 filter 类型。

默认是取模方式。

类似：

sql 复制代码

custom_key % parallel_replicas_count = parallel_replica_offset

适合 hash 后比较均匀的 key。

比如：

sql 复制代码

SET parallel_replicas_custom_key = 'cityHash64(user_id)';
SET parallel_replicas_custom_key_filter_type = 'default';

我们可以通过parallel_replicas_custom_key_filter_type来设置Parallel Replica Custom Key的类型:

cpp 复制代码

M(ParallelReplicasCustomKeyFilterType, parallel_replicas_custom_key_filter_type, ParallelReplicasCustomKeyFilterType::DEFAULT, "Type of filter to use with custom key for parallel replicas. default - use modulo operation on the custom key, range - use range filter on custom key using all possible values for the value type of custom key.", 0) \

取range的方式

range 是按 key 的取值范围切。比如总范围是 [0, 100)，有 2 个 replicas：

sql 复制代码

replica 0 读 [0, 50)
replica 1 读 [50, 100)

涉及到的相关知识

C++聚合类型和聚合初始化以及列表初始化

我们上文在讨论Priority的时候看到了聚合类型和聚合初始化:
cpp struct Priority { Int64 value = 0; /// Note that lower value means higher priority. // 定义了一个运算符，把当前对象转换成 Int64 constexpr operator Int64() const { return value; } /// NOLINT 从Priority -> Int64的转换运算符 };

从代码中可以看到，我们在构造Priority的时候，虽然Priority没有定义有参数的构造函数，但是我们直接通过花括号 {} 按照成员定义的顺序依次赋值。注意，普通构造函数的调用使用的是圆括号()，所有，这里不是调用构造函数，而是使用的聚合初始化:

cpp 复制代码

get_priority = [](size_t i) { return Priority{static_cast<Int64>(i)}; };

在 C++ 标准中，聚合类型是指那些满足特定条件的类型，它们允许通过成员列表直接初始化，这种初始化叫做聚合初始化。

我们可以直观地理解为：一个类（或结构体）如果没有复杂的"内部逻辑"（比如私有成员、自定义构造函数、虚函数、继承等），它就像是一堆数据的简单集合。

在C++中，一个类可以被定义为聚合类型的核心条件是:

没有自定义构造函数（由程序员显式定义的）。
没有私有或保护的非静态成员（所有成员都得是 public 的）。
没有基类（不能继承自别的类）。
没有虚函数。

我们该怎么理解"聚合"这个词？

聚合"这个词在中文语境下可能有点抽象，但在计算机术语里：

物理层面：意味着成员在内存中是连续分布的，没有多余的管理开销（比如虚函数表指针）。
逻辑层面：意味着该类型不控制数据的合法性。
- 非聚合（保险箱）：Person 类有构造函数，它想确保 age 至少是 0，或者 name 不能为空。你必须通过它的"安检"（构造函数）才能创建对象。
- 聚合（塑料袋）：Point 只是把 x 和 y 聚在一起。它不关心 x 是多少，你可以直接把数据扔进去。

与聚合初始化类似并且容易混淆的，是列表初始化。说他们相似，是因为他们都可以使用大括号{ }进行初始化。

列表初始化（List Initialization）是 C++11 引入的一种通用初始化语法，也就是我们常说的"大括号 {} 初始化"。

它的出现是为了解决 C++ 早期初始化语法混乱的问题（比如有的用 ()，有的用 =，有的不能在类内初始化等），提供一种一劳永逸的方案。

根据是否带有 =，列表初始化可以被分为两类：

直接列表初始化（Direct-list-initialization）：
int x{10}; 或 Person p{"Alice", 18};
拷贝列表初始化（Copy-list-initialization）：
int x = {10}; 或 Person p = {"Alice", 18};

注意：对于绝大多数情况，这两者效果是一样的。唯一的区别是：如果构造函数被声明为 explicit（显式），则不能使用带 = 的拷贝列表初始化。

有了列表初始化后，几乎所有东西都能用 {}：

cpp 复制代码

    // 1. 基本变量
    int a{5};
    
    // 2. 数组
    int arr[]{1, 2, 3};
    
    // 3. 聚合类型（如前所述）
    Point p{10, 20};
    
    // 4. 调用构造函数
    std::string s{"Hello"};
    
    // 5. 容器（最强大的用途之一）
    std::vector<int> v{1, 2, 3, 4, 5}; 
    
    // 6. 动态分配内存
    int* ptr = new int[3]{1, 2, 3};
    
    // 7. 函数返回值
    return {1, "Success"}; // 自动推导为对应的结构体或对象

列表初始化不仅仅是换个写法，它有两个非常重要的底层行为：

防止类型收窄（Narrowing Conversion）

这是列表初始化最强大的安全特性。使用 () 或 = 初始化时，编译器允许"大变小"的隐式转换（可能会丢失精度），但 {} 会直接报错或警告。
cpp 复制代码
```
int x = 3.14;   // 编译通过，x 变成 3（精度丢失）
int y(3.14);    // 编译通过，y 变成 3

int z{3.14};    // 报错！编译器拒绝执行"收窄转换"
```
优先匹配 std::initializer_list

如果一个类定义了接收 std::initializer_list 的构造函数（如 vector, list 等容器），那么使用 {} 时，编译器会优先调用它。

这会导致一个经典的"坑"：
cpp 复制代码
```
std::vector<int> v1(10, 2); // 调用普通构造函数：创建 10 个元素，每个都是 2
std::vector<int> v2{10, 2}; // 调用 initializer_list 构造函数：创建 2 个元素，分别是 10 和 2
```

当我们写下 T object{arg1, arg2...} 时，编译器的逻辑顺序通常是：

检查是否是聚合类型：如果是，执行聚合初始化（直接填空，不看构造函数）。
检查是否有 std::initializer_list 构造函数：如果有，尝试把 {} 里的参数打包传进去。
查找普通构造函数：如果前两个都不匹配，则寻找参数个数和类型匹配的普通构造函数。
报错：如果连普通构造函数也对不上，编译失败。

所以，列表初始化和聚合初始化的区别和联系是: ："列表初始化"是语法形式，而"聚合初始化"是它在特定条件下触发的一种语义行为。对于聚合类型进行列表初始化，这种列表初始化本质上是聚合初始化。

cpp 复制代码

#include <iostream>
#include <string>

using namespace std;

// 1. 纯聚合类型：没有自定义构造函数
struct Point
{
    int x = 0;
    int y = 0;
};

// 2. 有普通构造函数的类型
struct Person
{
    string name;
    int age;

    Person() : name("default"), age(0)
    {
        cout << "Person default constructor\n";
    }

    Person(string n, int a) : name(n), age(a)
    {
        cout << "Person normal constructor\n";
    }
};

// 3. 像 Priority 这种轻量包装类型：没有自定义构造函数，可聚合初始化
struct Priority
{
    long long value = 0;

    constexpr operator long long() const
    {
        return value;
    }
};

int main()
{
    cout << "==== 1) 聚合初始化 Point ====\n";
    Point p1{10, 20};   // Point是聚合类型，因此使用列表初始化的形式来进行聚合初始化
    Point p2{};         // Point是聚合类型，因此使用列表初始化的形式来进行聚合初始化，成员走默认值
    Point p3;           // 默认初始化；这里成员有默认值，所以仍然是 0, 0

    cout << "p1 = {" << p1.x << ", " << p1.y << "}\n";
    cout << "p2 = {" << p2.x << ", " << p2.y << "}\n";
    cout << "p3 = {" << p3.x << ", " << p3.y << "}\n\n";

    cout << "==== 2) Person: 默认构造 / 直接初始化 / 列表初始化 ====\n";
    Person a;                  // 默认构造函数
    Person b("Alice", 18);     // 直接初始化：调用构造函数
    Person c{"Bob", 20};       // 列表初始化：最终也会匹配并调用构造函数

    cout << "a = {name=" << a.name << ", age=" << a.age << "}\n";
    cout << "b = {name=" << b.name << ", age=" << b.age << "}\n";
    cout << "c = {name=" << c.name << ", age=" << c.age << "}\n\n";

    cout << "==== 3) Priority：聚合初始化 + 类型转换运算符 ====\n";
    Priority pr1{5};   // Priority是聚合类型，因此使用列表初始化的形式来进行聚合初始化，把 value 设成 5
    Priority pr2{};    // Priority是聚合类型，因此使用列表初始化的形式来进行聚合初始化，value 用默认值 0
    Priority pr3;      // 默认初始化，value 也是 0

    long long x = pr1; // 自动调用 operator long long()

    cout << "pr1.value = " << pr1.value << "\n";
    cout << "pr2.value = " << pr2.value << "\n";
    cout << "pr3.value = " << pr3.value << "\n";
    cout << "x = " << x << "\n\n";
    return 0;
}

对应输出为:

cpp 复制代码

==== 1) 聚合初始化 Point ====
p1 = {10, 20}
p2 = {0, 0}
p3 = {0, 0}

==== 2) 普通构造 / 默认构造 Person ====
Person default constructor
Person normal constructor
a = {name=default, age=0}
b = {name=Alice, age=18}

==== 3) Priority：聚合初始化 + 类型转换运算符 ====
pr1.value = 5
pr2.value = 0
pr3.value = 0
x = 5

所以，我们可以得出结论：

Point 和 Priority 没有自定义构造函数，属于聚合类型，可直接用 { } 按成员初始化。
Priority{5} 不是调用 Priority(long long)，因为这个构造函数根本不存在；它只是把 value 初始化为 5，属于聚合初始化。
Person 有自定义构造函数，所以它不是聚合类型，不存在聚合初始化。
Person b("Alice", 18) 是直接初始化，调用构造函数。 Person c{"Bob", 20} 是列表初始化，但最终也会匹配并调用构造函数；它不是聚合初始化。

Closure和Callback

我们在上文讲到PoolWithFailoverBase<TNestedPool>的时候看到，PoolWithFailoverBase<TNestedPool>声明了两个Labmda表达式Closure，用来给调用者进行实现，PoolWithFailoverBase<TNestedPool>不关心具体实现，只会在自己的逻辑中直接使用这两个Labmda:

try_get_entry：告诉 PoolWithFailoverBase，当它选中某个 nested pool 之后，应该如何从这个 pool 里"尝试"拿到一个可用连接 。PoolWithFailoverBase 不知道 nested pool 里面到底是什么资源，也不直接调用具体连接逻辑；它只负责调度和 failover，真正"怎么拿连接"交给这个 callback。

cpp 复制代码

using TryGetEntryFunc = std::function<TryResult(const NestedPoolPtr & pool, std::string & fail_message)>;
std::vector<IConnectionPool::Entry> ConnectionPoolWithFailover::getMany(
    .....
    GetPriorityForLoadBalancing::Func priority_func)
    {
    /// 这是最基础的"拿多个副本连接"的入口。
    /// 和 getManyChecked() 的区别是：这里只关心连接本身，不额外检查某张表在副本上的状态。
    TryGetEntryFunc try_get_entry = [&](const NestedPoolPtr & pool, std::string & fail_message)
    { 
        return tryGetEntry(pool, timeouts, fail_message, settings, nullptr, async_callback); 
    };
        /// 真正的挑副本、重试和 failover 都在 getManyImpl() 里做。
    std::vector<TryResult> results = getManyImpl(settings, pool_mode, 
    try_get_entry,  // 传入这个Closure
    skip_unavailable_endpoints, priority_func);
    ....
}

priority_func：告诉 PoolWithFailoverBase，在这个PoolWithFailoverBase所维护的nested_pool中，各个IConnectionPool。它把 load_balancing 这类策略转换成一个可比较的优先级值，让 PoolWithFailoverBase 可以在错误次数、慢副本状态、配置优先级之外，再结合调用者指定的策略来排序 replica。

cpp 复制代码

    /// The client can provide this functor to affect load balancing - the index of a pool is passed to
    /// this functor. The pools with lower result value will be tried first.
    using GetPriorityFunc = std::function<Priority(size_t index)>;
    
   ConnectionPoolWithFailover::Base::GetPriorityFunc ConnectionPoolWithFailover::makeGetPriorityFunc(const Settings & settings)
        {
            // 用户在配置文件中配置的第一个offset，需要和子pool的大小做mod，这个参数在用户选择了FIRST_OR_RANDOM的LoadBalanacing策略的时候生效
            const size_t offset = settings.load_balancing_first_offset % nested_pools.size();
            const LoadBalancing load_balancing = LoadBalancing(settings.load_balancing); // 根据设置的LB的策略，构造一个LoadBalancing对象
            // 由LoadBalancing对象直接提供对应的GetPriorityFunc实现
            return get_priority_load_balancing.getPriorityFunc(load_balancing, offset, nested_pools.size());
        }

所以，我们可以来了解一下闭包:

在 C++ 中，闭包（Closure）是一个非常强大且实用的概念。简单来说，闭包是一个能够记住并访问其定义时所在作用域中变量的函数对象。在 C++11 及更高版本中，闭包主要是通过 Lambda 表达式来实现的。

Lambda 与闭包虽然这两个词经常混用，但它们在学术上有细微区别：

Lambda 表达式：是我们写下的代码（即语法糖），例如 [ ]( ) { }。
闭包：是 Lambda 表达式在运行时生成的临时对象。它包含了函数体和捕获到的变量。

闭包的组成结构一个典型的 C++ Lambda（闭包）由以下几部分组成：

捕获列表[]：决定闭包能"带走"哪些外部变量。
参数列表 ()：类似于普通函数的参数。
返回类型 ->：通常可以省略，由编译器自动推导。
函数体 {}：逻辑实现部分。

捕获方式	语法	描述
显式值捕获	`[x]`	拷贝变量 x 的副本到闭包内部。闭包内修改副本不影响外部变量，且默认副本是 const 的。
显式引用捕获	`[&x]`	闭包内保存变量 x 的引用。内部修改会直接反映到外部变量。
隐式值捕获	`[=]`	编译器根据函数体代码，自动按值捕获所有用到的局部变量。
隐式引用捕获	`[&]`	编译器根据函数体代码，自动按引用捕获所有用到的局部变量。
混合捕获	`[=, &x]`	默认按值捕获所有变量，但 x 显式指定按引用捕获。
捕获 this	`[this]`	在类成员函数中使用，允许 Lambda 访问类的成员变量和成员函数。

我们使用下面的代码，直接展示了不同的捕获类型的不同行为:

cpp 复制代码

#include <iostream>
#include <vector>
#include <memory>

class CaptureDemo {
public:
    int member_val = 100;

    void test_this_capture() {
        std::cout << "\n--- 6. [this] Capture ---" << std::endl;
        // 捕获 this 指针以访问成员变量
        auto lambda = [this]() {
            member_val += 10;
            std::cout << "Member value after lambda: " << member_val << std::endl;
        };
        lambda();
    }
};

int main() {
    int x = 10;
    int y = 20;
    int z = 30;

    std::cout << "Original Addresses & Values:" << std::endl;
    std::cout << "x: " << &x << " | val: " << x << std::endl;
    std::cout << "y: " << &y << " | val: " << y << std::endl;
    std::cout << "z: " << &z << " | val: " << z << std::endl;

    // 1. 显式值捕获 [x]
    // 即使外部 x 变了，lambda 内部的副本在定义那一刻就确定了
    auto explicit_value = [x]() mutable {
        x += 5; 
        std::cout << "\n--- 1. [x] Explicit Value ---" << std::endl;
        std::cout << "Inside Lambda x addr: " << &x << " | val: " << x << std::endl;
    };

    // 2. 显式引用捕获 [&y]
    auto explicit_ref = [&y]() {
        y += 5;
        std::cout << "\n--- 2. [&y] Explicit Reference ---" << std::endl;
        std::cout << "Inside Lambda y addr: " << &y << " | val: " << y << std::endl;
    };

    // 3. 隐式值捕获 [=]
    auto implicit_value = [=]() {
        std::cout << "\n--- 3. [=] Implicit Value ---" << std::endl;
        std::cout << "Inside Lambda z (copy) val: " << z << std::endl;
        // z++; // 错误：值捕获副本默认不可修改
    };

    // 4. 隐式引用捕获 [&]
    auto implicit_ref = [&]() {
        z += 10;
        std::cout << "\n--- 4. [&] Implicit Reference ---" << std::endl;
        std::cout << "Inside Lambda z addr: " << &z << " | val: " << z << std::endl;
    };

    // 5. 混合捕获 [=, &z]
    // 除了 z 按引用，其余全部按值
    auto mixed_capture = [=, &z]() {
        std::cout << "\n--- 5. [=, &z] Mixed Capture ---" << std::endl;
        std::cout << "x (copy): " << x << " | z (ref): " << z << std::endl;
    };

    // 执行所有 Lambda
    explicit_value();
    explicit_ref();
    implicit_value();
    implicit_ref();
    mixed_capture();

    // 6. 成员变量捕获
    CaptureDemo demo;
    demo.test_this_capture();

    std::cout << "\nFinal Outside Values:" << std::endl;
    std::cout << "x: " << x << " (unchanged by value capture)" << std::endl;
    std::cout << "y: " << y << " (changed by ref capture)" << std::endl;
    std::cout << "z: " << z << " (changed by ref capture)" << std::endl;

    return 0;
}

对应的输出如下所示:

text 复制代码

Original Addresses & Values:
x: 0x7ffd3d11f4ec | val: 10
y: 0x7ffd3d11f4e8 | val: 20
z: 0x7ffd3d11f4e4 | val: 30

--- 1. [x] Explicit Value ---
Inside Lambda x addr: 0x7ffd3d11f4e0 | val: 15

--- 2. [&y] Explicit Reference ---
Inside Lambda y addr: 0x7ffd3d11f4e8 | val: 25

--- 3. [=] Implicit Value ---
Inside Lambda z (copy) val: 30

--- 4. [&] Implicit Reference ---
Inside Lambda z addr: 0x7ffd3d11f4e4 | val: 40

--- 5. [=, &z] Mixed Capture ---
x (copy): 10 | z (ref): 40

--- 6. [this] Capture ---
Member value after lambda: 110

Final Outside Values:
x: 10 (unchanged by value capture)
y: 25 (changed by ref capture)
z: 40 (changed by ref capture)

闭包的工作原理:

当我们写下一个 Lambda 时，编译器实际上在后台为我们生成了一个匿名类（Functor）。例如，下面这段代码：

cpp 复制代码

int factor = 2;
auto multiply = [factor](int val) {
    return val * factor;
};

编译器会将其转化为类似这样的结构：

cpp 复制代码

class __Lambda_Unique_Name {
    int factor; // 捕获的变量变成了成员变量
public:
    __Lambda_Unique_Name(int f) : factor(f) {}
    int operator()(int val) const {
        return val * factor;
    }
};

为什么要用闭包？

A. 就近定义逻辑在标准库算法（如 std::sort 或 std::find_if）中，我们可以直接在调用处编写逻辑，而不需要跑到头文件里去定义一个单独的函数。
cpp 复制代码
```
+std::vector<int> nums = {1, 5, 2, 8};
// 使用闭包自定义排序规则
std::sort(nums.begin(), nums.end(), [](int a, int b) {
    return a > b; 
});
```
B. 状态保持闭包可以携带状态，这使得它比普通函数更灵活。普通函数需要依赖全局变量或静态变量来保持状态，而闭包通过捕获局部变量，实现了封装性。
C. 异步与回调在现代 C++ 的并发编程中，闭包常用于回调函数。我们可以捕获某些上下文信息，等到异步操作完成后再执行。

生存期问题（Dangling References）这是使用 C++ 闭包时最需要注意的地方。引用捕获非常高效，但如果闭包的生命周期超过了被捕获变量的生命周期，程序就会崩溃。警告：如果你在一个异步任务或返回一个 Lambda 时使用了 $\&$ （引用捕获），务必确保被捕获的变量在 Lambda 执行时依然有效。总结C++ 的闭包是 Lambda 表达式产生的运行期对象。它通过将代码和环境（数据）封装在一起，极大地增强了代码的可读性和函数式编程的能力。你目前是在学习 C++ 的基础语法，还是在尝试解决多线程/异步回调中的具体问题？

比如:

cpp 复制代码

#include <iostream>
#include <functional>
    
std::function<void()> create_bad_closure() {
    int local_val = 100; // 这是一个局部变量，存储在栈上
    
    // 错误：通过引用捕获了局部变量 local_val
    auto lambda = [&local_val]() {
        std::cout << "Local value is: " << local_val << std::endl;
    };
    
    return lambda; 
} // <--- 函数运行到这里，local_val 被销毁了！
    
int main() {
    auto my_closure = create_bad_closure();
    
    // 灾难：my_closure 内部持有一个已经销毁变量的引用（野指针/悬空引用）
    my_closure(); // 结果：可能是乱码，也可能是程序直接崩溃（段错误）
    
    return 0;
}

私有继承

我们从ConnectionPoolWithFailover的声明中看到，ConnectionPoolWithFailover是public继承了IConnectionPool，而private继承了PoolWithFailoverBase<IConnectionPool>:

cpp 复制代码

class ConnectionPoolWithFailover : public IConnectionPool, private PoolWithFailoverBase<IConnectionPool>
{
public:
    ConnectionPoolWithFailover(
            ConnectionPoolPtrs nested_pools_, // ConnectionPoolWithFailover所封装的
            LoadBalancing load_balancing,  // ConnectionPoolWithFailover的LoadBalancing策略
            time_t decrease_error_period_ = DBMS_CONNECTION_POOL_WITH_FAILOVER_DEFAULT_DECREASE_ERROR_PERIOD,
            size_t max_error_cap = DBMS_CONNECTION_POOL_WITH_FAILOVER_MAX_ERROR_COUNT);

在 C++ 中，私有继承（Private Inheritance）是一种特殊的继承方式，它主要表达 "根据某物实现"（Is-implemented-in-terms-of）的逻辑，而不是接口的扩展。

当我们使用 class Derived : private Base 时，会发生以下两件核心事情：

访问权限的"降级"在私有继承中，基类的所有公有（public）成员和保护（protected）成员在子类中都会变成 private。

基类的 public 成员：在子类中变为 private。
基类的 protected 成员：在子类中变为 private。
外部访问：子类对象无法调用任何来自基类的函数。

私有继承以后的访问权限变化如下表所示:

基类成员访问权限	继承方式	在子类中的权限	对子类外部（如 main 函数）的可见性
public	`private` 继承	private	不可见
protected	`private` 继承	private	不可见
private	`private` 继承	不可访问	不可见

语义：不是 "Is-a"，而是 "Implemented-in-terms-of"

公有继承（Public Inheritance）表达的是 "Is-a"（是一个）的关系。比如：医生是一个人。
私有继承表达的是 "Implemented-in-terms-of"（用......来实现）。比如：我用"引擎"来实现"汽车"，但"汽车"并不是"引擎"。

下面代码展示了私有继承的基本用法:

cpp 复制代码

class Engine {
    public:
        void start() { /* 启动引擎 */ }
        void injectFuel() { /* 注入燃油 - 细节操作 */ }
    };
    
    // 汽车私有继承自引擎
    class Car : private Engine {
    public:
        void drive() {
            start(); // 内部可以访问基类的 public 成员
            // ... 行驶逻辑
        }
        // 外部无法通过 car.start() 直接启动引擎
    };

私有继承的功能其实大部分可以通过组合（在类里放一个成员对象）来实现：
- 组合 (Composition)：class Car { Engine e; }; ------ 首选方案，耦合度更低，但是很显然,Car只能访问Engine的public成员。
- 私有继承：只有在以下特殊情况才使用：
  - 需要访问基类的 protected 成员。需要访问基类的 protected 成员"是私有继承相对于组合（Composition）的唯一核心优势。组合则做不到，组合只能方成员变量Engine的public变量。
  - 需要重写（Override）基类的虚函数，但不希望外界看到继承关系。
  - 空基类优化 (EBO)：如果基类没有数据成员，私有继承可以不占用子类的空间，而组合会占用至少 1 字节。
    比如：
cpp 复制代码
```
class Engine {
protected:
    void tuneInternal() { /* 内部调校逻辑 */ }
public:
    void start() { /* 启动 */ }
};

class Car {
private:
    Engine e; 
public:
    void drive() {
        e.start();         // OK: 可以访问 public
        // e.tuneInternal(); // 报错！Car 不是 Engine 的子类，没权碰 protected
    }
};
```
如果把 Engine 作为成员变量，那Car只能访问它的 public 接口。基类那些专门留给子类用的"黑箱工具"（protected 成员），成员对象是拿不到的。但是通过private 继承，Car 变成了 Engine 的一种子类，因此它拿到了进入 Engine 内部空间的"通行证"：
cpp 复制代码
```
    class Car : private Engine {
    public:
        void drive() {
            start();         // OK: 访问基类 public
            tuneInternal();  // OK!! 只有继承了才能访问 protected 成员
        }
    };
```
这时候的语义是：
- 对内（Car的内部）：我继承了你，所以我可以像亲儿子一样用你所有的 protected 工具。
- 对外（main 函数等）：我对所有人保密。外部看 Car 时，根本不知道它内部有一个 Engine 的功能。
另外一个实际场景是，重写虚函数。这也是"需要访问基类内部"的一种变体。如果我们想重写（Override）基类的某个虚函数，但又不希望自己的类被当成基类来用，就必须用私有继承。
cpp 复制代码
```
    class Timer {
    public:
        void start() { /* ... */ }
    protected:
        virtual void onTick() = 0; // 留给子类实现的内部钩子
    };
    
    // 我想用 Timer 的功能，但我不想让别人把 MyTask 当成一个 Timer
    class MyTask : private Timer {
    protected:
        void onTick() override {
            // 实现具体的定时任务逻辑
        }
    public:
        void run() {
            start(); // 调用基类方法启动定时
        }
    };
```

总之，当我们面临以下抉择时：

组合：如果我只需要调用基类的 public 函数，选组合。
私有继承：如果我需要调用基类的 protected 函数，或者需要重写基类的 virtual 函数，但我又不想暴露这种"是一个"的关系，选私有继承。