StarRocks高效聚合源码解析

0.简介

聚合是数据分析中的一种常用的手段，其性能直接对于整个系统的分析来说有着非常重要的影响，本文将对starrocks中聚合实现做深入的分析，主要包括其实现策略以及采用的优化手段。

1.集群策略

StarRocks是Shared Nothing、Massive Parallel Processing(MPP)架构，在这种架构下，一张物理表数据会按照分片多副本方式分布在不同机器上，所有，传统的聚合方式无法直接应用。所以StarRocks对其进行了扩展，将其扩展到多阶段，其实现聚合主要采用的是哈希聚合的方式，整体思路是根据group by的key来选用一阶段或两阶段或多阶段聚合。

1）如果group by的字段是分布列，不同节点上的key一定不会出现重合，直接在本地执行聚合，最后再进行合并结果即可。

2）如果group by的字段不是分布列，那么就需要先进行一次预聚合，然后将预聚合后的数据按照相同规则进行一次重分布，最后再进行一次final 聚合，然后合并结果即可。

当然其一阶段和二阶段可以组合使用，比如Count Distinct不包含Group By场景下通常采用4阶段聚合（由1个1阶段聚合和1个2阶段聚合的组合构成）；而Count Distinct包含Group By通常采用3阶段聚合（由2个2阶段聚合的组合构成）。本文主要分析其优化手段，这种组合方式不再详细描述。

2.单机实现思路即代码分析

2.1 pipeline执行器

StarRocks的执行器采用的是pull模型，也即火山模型，通过该模型实现了pipeline执行器，在PG执行器文章中有对其概念的详细说明。StarRocks在集群层（FE）进行策略选择生成查询计划后发送的单机层（BE)后，其会拆分多条pipeline（pipeline是将前后多个算子进行串联，前部的算子称为source，尾部的算子称为sink）实现单机并行执行。在设计上，只有pipeline尾部的sink算子才能是全量物化算子（需要所有数据到达后才会产生后续输出算子），所以需要对其进行拆分为两个算子，也就是source和sink算子，分别置于两条pipeline中，并通过aggreator进行数据交互，这样就可以边物化边进行下一个算子，流水式工作；根据并行度设置，可以实例化多份并行操作，提升效率。

2.2 预聚合

预聚合是各节点在数据重分布之前先进行一次本地聚合，减少重分布数据量，但当key重复量较少时，效果并不好，会造成无效的一轮聚合，所以其会根据当前hash表情况进行选择预聚合或流式输出（不在hash表的数据输出格式和hash表结果输出chunk一致，统一处理）。

cpp 复制代码

 if (!ht_needs_expansion ||
            _aggregator->should_expand_preagg_hash_tables(_aggregator->num_input_rows(), chunk_size, allocated_bytes,
                                                          _aggregator->hash_map_variant().size())) {
            // hash table is not full or allow to expand the hash table according reduction rate
            SCOPED_TIMER(_aggregator->agg_compute_timer());
            TRY_CATCH_BAD_ALLOC(_aggregator->build_hash_map(chunk_size));
            if (_aggregator->is_none_group_by_exprs()) {
                RETURN_IF_ERROR(_aggregator->compute_single_agg_state(chunk.get(), chunk_size));
            } else {
                RETURN_IF_ERROR(_aggregator->compute_batch_agg_states(chunk.get(), chunk_size));
            }
            TRY_CATCH_BAD_ALLOC(_aggregator->try_convert_to_two_level_map());
            COUNTER_SET(_aggregator->hash_table_size(), (int64_t)_aggregator->hash_map_variant().size());
            break;
        } else {
            _auto_state = AggrAutoState::ADJUST;
            _auto_context.adjust_count = 0;
            VLOG_ROW << "auto agg: " << _auto_context.get_auto_state_string(AggrAutoState::INIT_PREAGG) << " "
                     << _auto_context.init_preagg_count << " -> " << _auto_context.get_auto_state_string(_auto_state);
        }

根据结果决定是否继续预聚合还是进行final聚合，其可以避免无效聚合操作同时控制大小，加快cpu访问。

2.3 final聚合

StarRocks中final聚合是将输入的chunk数据插入到hash表中进行聚合计算。先从数据结构来看：

1）_hash_map_variant：哈希表，kv结构，key为聚合key，value为AggData的指针。

2）AggData：哈希表中的value，保存聚合函数结果信息。

3）_tmp_agg_states：和每行数据对应的数组，存放当前行的key所对应的AggData的地址信息(使其可以按列计算，同时可以避免来回操作hash表，增加缓存命中率）。

了解了数据结构之后来看其聚合过程：

1）创建哈希表（_hash_map_variant）即hash表的优化：在创建过程中需要考虑不同类型，不同特征的key，这些不同对应的最有的哈希算法也不同，所以其生成过程是通过宏来生成的，通过编译来创建所以的可能，然后运行时进入不同分支。其哈希表如果是有多个group by列组合成的键且内存占用达到一定数量，会转换为二级哈希表。

cpp 复制代码

#define APPLY_FOR_AGG_VARIANT_ALL(M) \
    M(phase1_uint8)                  \
    M(phase1_int8)                   \
    M(phase1_int16)                  \
    M(phase1_int32)                  \
    M(phase1_int64)                  \
    M(phase1_int128)                 \
    M(phase1_decimal32)              \
    M(phase1_decimal64)              \
    M(phase1_decimal128)             \
    M(phase1_date)                   \
    M(phase1_timestamp)              \
    M(phase1_string)                 \
    M(phase1_slice)                  \
    M(phase1_null_uint8)             \
    M(phase1_null_int8)              \
    M(phase1_null_int16)             \
    M(phase1_null_int32)             \
    M(phase1_null_int64)             \
    M(phase1_null_int128)            \
    M(phase1_null_decimal32)         \
    M(phase1_null_decimal64)         \
    M(phase1_null_decimal128)        \
    M(phase1_null_date)              \
    M(phase1_null_timestamp)         \
    M(phase1_null_string)            \
    M(phase1_slice_two_level)        \
    M(phase1_int32_two_level)        \
    M(phase2_uint8)                  \
    M(phase2_int8)                   \
    M(phase2_int16)                  \
    M(phase2_int32)                  \
    M(phase2_int64)                  \
    M(phase2_int128)                 \
    M(phase2_decimal32)              \
    M(phase2_decimal64)              \
    M(phase2_decimal128)             \
    M(phase2_date)                   \
    M(phase2_timestamp)              \
    M(phase2_string)                 \
    M(phase2_slice)                  \
    M(phase2_null_uint8)             \
    M(phase2_null_int8)              \
    M(phase2_null_int16)             \
    M(phase2_null_int32)             \
    M(phase2_null_int64)             \
    M(phase2_null_int128)            \
    M(phase2_null_decimal32)         \
    M(phase2_null_decimal64)         \
    M(phase2_null_decimal128)        \
    M(phase2_null_date)              \
    M(phase2_null_timestamp)         \
    M(phase2_null_string)            \
    M(phase2_slice_two_level)        \
    M(phase2_int32_two_level)        \
    M(phase1_slice_fx4)              \
    M(phase1_slice_fx8)              \
    M(phase1_slice_fx16)             \
    M(phase2_slice_fx4)              \
    M(phase2_slice_fx8)              \
    M(phase2_slice_fx16)
    
    
void AggHashSetVariant::init(RuntimeState* state, Type type, AggStatistics* agg_stat) {
    _type = type;
    switch (_type) {
#define M(NAME)                                                                                                    \
    case Type::NAME:                                                                                               \
        hash_set_with_key = std::make_unique<detail::AggHashSetVariantTypeTraits<Type::NAME>::HashSetWithKeyType>( \
                state->chunk_size());                                                                              \
        break;
        APPLY_FOR_AGG_VARIANT_ALL(M)
#undef M
    }
}

接下来是遍历key col插入哈希表，如果key已经存在则将其对应的_tmp_agg_states信息指向对应的AggData，若不存在则分配然后指向刚生成的AggData。

其哈希表采用的是phmap，支持惰性插入（插入复杂类型键值对时，可以将key中用于校验哈希的部分和键值对构造函数分开传入，先计算哈希来检查key是否存在，若存在，直接返回，不存在则调用构建函数进行创建，使得可以更为高效的插入。

另外，其还对内存分配进行了优化，使用一种常见的优化思路，批量分配，使用HashTableKeyAllocator来批量分配。

除了内存分配之外，还有预取优化，在内存访问时，cpu会将后续一段连续的内存预先放入cache，假设顺序访问的话，会命中cache，这样就能提高性能，其适合于接下来顺序访问的场景，但在很多情况下cpu是无法预知要访问的地址，但是我们的程序可能知道，如下所示二分搜索代码，可以对新的mid进行预取，这样就能命中cache line.

cpp 复制代码

int binarySearch(int *array, size_t number_of_elements, int key) {
    size_t low = 0, high = number_of_elements-1, mid;
    while(low <= high) {
        mid = (low + high)/2;
#ifdef DO_PREFETCH
        // low path
        __builtin_prefetch (&array[(mid + 1 + high)/2], 0, 1);
        // high path
        __builtin_prefetch (&array[(low + mid - 1)/2], 0, 1);
#endif
        if(array[mid] < key)
            low = mid + 1;
        else if(array[mid] == key)
            return mid;
        else if(array[mid] > key)
            high = mid-1;
    }
    return -1;
}

对应StarRocks代码为，其预取场景和值通过测试得出：

cpp 复制代码

template <typename Func, bool allocate_and_compute_state, bool compute_not_founds>
    ALWAYS_NOINLINE void compute_agg_states_non_nullable(size_t chunk_size, const Columns& key_columns, MemPool* pool,
                                                         Func&& allocate_func, Buffer<AggDataPtr>* agg_states,
                                                         Filter* not_founds) {
        DCHECK(!key_columns[0]->is_nullable());
        auto column = down_cast<ColumnType*>(key_columns[0].get());
        size_t bucket_count = this->hash_map.bucket_count();
        // Assign not_founds vector when needs compute not founds.
        if constexpr (compute_not_founds) {
            DCHECK(not_founds);
            (*not_founds).assign(chunk_size, 0);
        }
        if (bucket_count < prefetch_threhold) {
            this->template compute_agg_noprefetch<Func, allocate_and_compute_state, compute_not_founds>(
                    column, agg_states, std::forward<Func>(allocate_func), not_founds);
        } else {
            this->template compute_agg_prefetch<Func, allocate_and_compute_state, compute_not_founds>(
                    column, agg_states, std::forward<Func>(allocate_func), not_founds);
        }
    }

2）进行聚合计算，将输入数据进行计算并存储至AggData。

cpp 复制代码

Status Aggregator::compute_batch_agg_states(Chunk* chunk, size_t chunk_size) {
    SCOPED_TIMER(_agg_stat->agg_function_compute_timer);
    bool use_intermediate = _use_intermediate_as_input();
    auto& agg_expr_ctxs = use_intermediate ? _intermediate_agg_expr_ctxs : _agg_expr_ctxs;
    for (size_t i = 0; i < _agg_fn_ctxs.size(); i++) {
        // evaluate arguments at i-th agg function
        RETURN_IF_ERROR(evaluate_agg_input_column(chunk, agg_expr_ctxs[i], i));
        SCOPED_THREAD_LOCAL_STATE_ALLOCATOR_SETTER(_allocator.get());
        // batch call update or merge
        if (!_is_merge_funcs[i] && !use_intermediate) {
        /原始数据更新到聚合结果
            _agg_functions[i]->update_batch(_agg_fn_ctxs[i], chunk_size, _agg_states_offsets[i],
                                            _agg_input_raw_columns[i].data(), _tmp_agg_states.data());
        } else {
            DCHECK_GE(_agg_input_columns[i].size(), 1);
            /一个聚合结果合并到另一个结果
            _agg_functions[i]->merge_batch(_agg_fn_ctxs[i], _agg_input_columns[i][0]->size(), _agg_states_offsets[i],
                                           _agg_input_columns[i][0].get(), _tmp_agg_states.data());
        }
    }
    RETURN_IF_ERROR(check_has_error());
    return Status::OK();
}

3）将哈希表输出，根据内存哈希表的内存分配和布局特性，读取其中数据进行保存。另外当处理不在hash表中数据时采用流式输出，保证数据chunk结构一致，统一处理。

cpp 复制代码

Status StreamAggregator::output_changes_internal(int32_t chunk_size, StreamChunkPtr* result_chunk,
                                                 ChunkPtr* intermediate_chunk, std::vector<ChunkPtr>& detail_chunks) {
    SCOPED_TIMER(_agg_stat->get_results_timer);
    RETURN_IF_ERROR(hash_map_variant().visit([&](auto& variant_value) {
        auto& hash_map_with_key = *variant_value;
        using HashMapWithKey = std::remove_reference_t<decltype(hash_map_with_key)>;
        // initialize _it_hash
        if (!_it_hash.has_value()) {
            _it_hash = _state_allocator.begin();
        }
        auto it = std::any_cast<RawHashTableIterator>(_it_hash);
        auto end = _state_allocator.end();
        const auto hash_map_size = _hash_map_variant.size();
        auto num_rows = std::min<size_t>(hash_map_size - _num_rows_processed, chunk_size);
        Columns group_by_columns = _create_group_by_columns(num_rows);
        int32_t read_index = 0;
        {
            SCOPED_TIMER(_agg_stat->iter_timer);
            hash_map_with_key.results.resize(num_rows);
            // get key/value from hashtable
            while ((it != end) & (read_index < num_rows)) {
                auto* value = it.value();
                hash_map_with_key.results[read_index] = *reinterpret_cast<typename HashMapWithKey::KeyType*>(value);
                // Reuse _tmp_agg_states to store state pointer address
                _tmp_agg_states[read_index] = value;
                ++read_index;
                it.next();
            }
        }
        { hash_map_with_key.insert_keys_to_columns(hash_map_with_key.results, group_by_columns, read_index); }
        {
            // output intermediate and detail tables' change.
            RETURN_IF_ERROR(_agg_group_state->output_changes(read_index, group_by_columns, _tmp_agg_states,
                                                             intermediate_chunk, &detail_chunks));
            // output result state table changes
            RETURN_IF_ERROR(_output_result_changes(read_index, group_by_columns, result_chunk));
        }
        // NOTE: StreamAggregate do not support output NULL keys which is different from OLAP Engine.
        _is_ht_eos = (it == end);
        _it_hash = it;
        _num_rows_returned += read_index;
        _num_rows_processed += read_index;
        return Status::OK();
    }));
    // update incremental state into state table
    RETURN_IF_ERROR(_agg_group_state->write(_state, result_chunk, intermediate_chunk, detail_chunks));
    return Status::OK();
}

参考资料：https://zhuanlan.zhihu.com/p/592058276