C++编译期计算

C++是一个追求极致性能的语言，我们总想让代码跑的更快，这篇博客将聚焦于一个核心思想，尽可能的将运行时开销转移到编译期，通过在程序编译阶段就完成计算、逻辑判断和循环展开，我们可以生成高度优化的代码，消除运行时的不必要开销。

笔者在学习时参考资料如下：

先引入一个 StackOverflow 中的问答开启这个主题，假设我有三个数组 D，E，X，其中 D 和 E 的值在编译期可知，X 的值只在运行时可知，进行如下计算：

C++ 复制代码

int comp_rt(int i, array<int, L> X) {
    int v = 0;
    if (D[i] == 0) // D[i] known at compile-time
        return 10;
    for (int j = 0; j < E[i]; ++j) // E[i] known at compile-time
        v += X[j] * (j + 1); // X[j] known at run-time
    return v;
}

由于这个计算会执行很多次，我希望减少开销，因此，我会尝试将 D[i] 和 E[i] 的检查放在编译期。所以我会尝试编写一个编译期可以确定 D[i] 和 E[i] 的函数，此时我想到了模板实例化，只需要将 i 作为模板参数，此时就可以通过模板在编译期实例化的特性，直接匹配到对应的计算函数，在计算函数中由于 D，E 和 i 的值都是编译期已知，所以我们就可以提前在编译期完成计算逻辑，等待运行时的 X 值出来，就可以直接匹配计算了。假设我们的 D，E 和 X 如下，我们就可以编写出下面的模板代码：

C++ 复制代码

D = [0, 1, 1, 0, 1] // Values in {0, 1}
E = [1, 0, 3, 2, 4] // Values in [0, L-1]
X = [1, 3, 5, 7, 9] // Any integer

template <int i> int comp(array<int, L> X);

template <> int comp_tpl<0>(array<int, L> X) { return 10; } // D[0] == 0
template <> int comp_tpl<1>(array<int, L> X) { return 0; } // E[1] == 0, skip loop
template <> int comp_tpl<2>(array<int, L> X) { return X[0] + 2 * X[1] + 3 * X[2]; }
template <> int comp_tpl<3>(array<int, L> X) { return 10; }
template <> int comp_tpl<4>(array<int, L> X) { return compl_tpl<2>(X) + 4 * X[3]; }

其中计算索引 i 在 0-4 以内的可以直接走到对应模板 template <int i> 的全特化中，直接执行预先打表好的计算代码。但是问题就是在我应该怎么在编译期生成这些优化的 comp_tpl 函数体，这就是我们现在讨论的问题，使用使用 if constexpr，递归模板和 constexpr。这些技术能确保编译器在编译期完成所有基于 D 和 E 的条件判断和递归展开，最终生成的机器代码将是一个内联的、平坦的、无分支的算术表达式，完全等同于上面手动编写的 comp_tpl 的性能。

我们想一下怎么把这个函数转化为编译期可展开的模板函数：

C++ 复制代码

int comp_rt(int i, array<int, L> X) {
    int v = 0;
    if (D[i] == 0) // D[i] known at compile-time
        return 10;
    for (int j = 0; j < E[i]; ++j) // E[i] known at compile-time
        v += X[j] * (j + 1); // X[j] known at run-time
    return v;
}

想说一下从运行时到编译期实现的思路，首先就是模板函数的格式应该是 template <int I>，这样可以对应参数中的 i 变量，调用特定的 i 时模板进行编译期实例化，这是编译期求值基础。

函数中的 if(D[i] == 0) 放在编译期应该变成什么呢，在 C++17 之后引入了 if constexpr 这个关键字，可以将运行时 if 转化为编译期分支，前面我们说到了参数中的 i 肯定要变成模板函数中的 template <int I>，这就说明了在模板实例化时 i 肯定是编译期已知的，由于 D 也是编译期已知，所以我们可以直接使用 if constexpr (D[I] < 0) 这个条件进行改写。

但是有些旧代码没有使用 C++17 来做，那么之前的函数是怎么处理这种编译期判断的呢，答案是使用多层专业化。我们可以使用一个辅助结构体，它将条件的布尔结果（如 <math xmlns="http://www.w3.org/1998/Math/MathML"> D [ I ] = = 0 D[I] == 0 </math>D[I]==0 的 true 或 false）作为第二个模板参数，然后针对 true 和 false 进行专业化，从而实现分派。具体实现可见以下代码：

C++ 复制代码

// 辅助分派器：通用版本（处理 D[I] != 0 的情况，即 D_is_zero = false）
template <int I, bool D_is_zero>
struct D_Zero_Dispatcher {
    template <class Array>
    static int comp(const Array& X) {
        // ****** 原始代码中 if (D[I] == 0) 失败后执行的逻辑 ******
        
        // 1. 检查下一个条件：E[I] == 0 (跳过)
        // 注意：在纯 C++11/14 中，这里需要再引入一个辅助结构体来处理 E[I] 的条件，
        // 但为保持代码结构简洁，我们这里假设 E[I] 的逻辑也已处理完毕。
        
        // 2. 执行通用计算：
        // 递归到前一项的结果 + 当前项的贡献 (D[I] * X[E[I]])
        return Processor<I - 1>::comp(X) + D_CONST[I] * X[E_CONST[I]];
    }
};

// 辅助分派器：专业化版本（处理 D[I] == 0 的情况，即 D_is_zero = true）
template <int I>
struct D_Zero_Dispatcher<I, true> {
    template <class Array>
    static int comp(const Array& X) {
        // ****** 原始代码中 if (D[I] == 0) 成功时执行的逻辑 ******
        
        // 编译器匹配到这个版本，执行特殊逻辑，无需再检查其他条件。
        // 这对应于您示例中的 comp_tpl<0> 和 comp_tpl<3> 的情况。
        return 10; 
    }
};

上述代码分别处理 D[i] 为 true 和 false 的情况，那么我们应该怎么确定下一个实例化的模板参数是 true 还是 false 呢，由于是编译期计算，所以我们肯定要通过 constexpr bool 类型判断，如下：

C++ 复制代码

template <int I>
struct Processor {
    template <class Array>
    static int comp(const Array& X) {
        
        // 关键步骤：在编译期计算 if (D[I] == 0) 的结果
        // D_CONST[I] 必须是 constexpr 数组，因此这个条件是编译期常量。
        constexpr bool D_is_zero = (D_CONST[I] == 0); 
        
        // 分派：将布尔结果作为模板参数。
        // 编译器在编译期根据 D_is_zero 的值，选择实例化：
        // 1. 如果 D[I] == 0 (true)，则实例化 D_Zero_Dispatcher<I, true>
        // 2. 如果 D[I] != 0 (false)，则实例化 D_Zero_Dispatcher<I, false>
        return D_Zero_Dispatcher<I, D_is_zero>::comp(X);
    }
};

这样就可以将运行时函数中的 if 语句被替换为编译期的模板匹配机制，再解释一下流程。运行时调用 Processor<I>::comp(X)，Processor<I> 计算条件 D[I] == 0 <math xmlns="http://www.w3.org/1998/Math/MathML"> → \rightarrow </math>→ 得到 true 或 false，Processor<I> 调用 D_Zero_Dispatcher<I, true/false>::comp(X)，编译器根据 true/false 实例化并内联对应的 D_Zero_Dispatcher 版本。

上面解释完 template <int I> 和运行时判断的 if 应该怎么转化为编译期判断之后还有一个问题，现在，我们来解决下一个，也是更复杂的一步：如何将函数体内部的运行时循环，即涉及到编译期已知边界 <math xmlns="http://www.w3.org/1998/Math/MathML"> E [ I ] E[I] </math>E[I] 和运行时变量 <math xmlns="http://www.w3.org/1998/Math/MathML"> X [ j ] X[j] </math>X[j] 的 for 循环，在编译期完全展开，从而实现零运行时循环开销。下面这段代码在编译期应该怎么改：

C++ 复制代码

for (int j = 0; j < E[i]; ++j) // E[i] known at compile-time
        v += X[j] * (j + 1); // X[j] known at run-time

我们理想的目标其实适合我们之前打表出来的模板函数是一样的。

C++ 复制代码

template <> int comp_tpl<2>(array<int, L> X) { return X[0] + 2 * X[1] + 3 * X[2]; }

就是利用编译期已知的循环边界 E[i]，在编译期生成一系列不包含 for 循环的加法和乘法指令，只留下涉及 <math xmlns="http://www.w3.org/1998/Math/MathML"> X [ j ] X[j] </math>X[j] 的算术运算。核心思路就是将 for 循环改为递归版本，其实是和处理 D[i] 差不多的模板递归技巧，但是这次是用于累加。

我们定义一个新的辅助结构体 LoopUnroller，它负责从索引 J 递减到 0 来展开循环：

C++ 复制代码

// 假设 E_CONST 已知，且我们从外层 Processor 传入了 X
template<int CurrentJ> // J 是当前的循环索引
struct LoopUnroller {
    template<class Array>
    static int calculate(const Array& X) {
        // 1. 当前项的计算：X[j] * (j + 1)
        // 注意：CurrentJ (即 j) 在编译期已知
        constexpr int coefficient = CurrentJ + 1; 

        // 2. 递归：返回 (前一项的累加结果) + (当前项的贡献)
        return LoopUnroller<CurrentJ - 1>::calculate(X) 
               + X[CurrentJ] * coefficient;
    }
};

// 基准专业化：终止循环（当 J < 0 时）
template<>
struct LoopUnroller<-1> {
    template<class Array>
    static int calculate(const Array& X) {
        return 0; // 循环终止，返回 0 作为初始累加值
    }
};

下面给出整个函数在编译期展开计算的代码：

C++ 复制代码

// 声明主处理结构体 (Processor)
template <int I> struct Processor;

// 内部循环展开器 (Loop Unroller)
// 功能：将 for (int j = 0; j < Limit; ++j) 转化为编译期递归

// 基准专业化：终止循环（当 J < 0 时）
template<int CurrentJ>
struct LoopUnroller {
    template<class Array>
    static int calculate(const Array& X) {
        // 当前项的贡献：X[j] * (j + 1)
        constexpr int coefficient = CurrentJ + 1; 

        // 递归到前一项的结果 + 当前项的贡献
        return LoopUnroller<CurrentJ - 1>::calculate(X) 
               + X[CurrentJ] * coefficient;
    }
};

// 基准专业化：终止条件 J = -1
template<>
struct LoopUnroller<-1> {
    template<class Array>
    static int calculate(const Array& X) {
        return 0; // 循环终止，返回 0 
    }
};

// 辅助分派器 (D_Zero_Dispatcher)
// 功能：实现 if (D[I] == 0) 的编译期分支逻辑

// 专业化版本：处理 D[I] == 0 的情况，即 D_is_zero = true
template <int I>
struct D_Zero_Dispatcher<I, true> {
    template <class Array>
    static int comp(const Array& X) {
        // 对应 comp_tpl<0> 和 comp_tpl<3> 的逻辑
        return 10; 
    }
};

// 辅助分派器：通用版本（处理 D[I] != 0 的情况，即 D_is_zero = false）
template <int I, bool D_is_zero>
struct D_Zero_Dispatcher {
    template <class Array>
    static int comp(const Array& X) {
        // --- if (D[I] == 0) 失败后执行的逻辑 ---
        
        // 检查 E[I] == 0 的特殊情况 (对应 comp_tpl<1>)
        if constexpr (E_CONST[I] == 0) { // C++17 简化 E[I] 检查，纯 C++11/14 需再嵌套一层分派
             return Processor<I - 1>::comp(X); // 只递归到前一项
        }

        // 获取循环边界 E[I] 的值
        constexpr int loop_limit = E_CONST[I]; 
        
        // 编译期启动循环展开 (从 loop_limit - 1 开始)
        int V_sum = LoopUnroller<loop_limit - 1>::calculate(X);
        
        // 最终递归结果：(前一项 I-1 的结果) + (当前展开循环 V_sum 的结果)
        return Processor<I - 1>::comp(X) + V_sum;
    }
};

// 主递归结构体 (Processor)
// 功能：控制 I 索引的递归，并计算条件进行分派

template <int I>
struct Processor {
    template <class Array>
    static int comp(const Array& X) {
        // 编译期计算条件：D[I] 是否为 0
        constexpr bool D_is_zero = (D_CONST[I] == 0); 
        
        // 分派：将条件结果作为模板参数传入 Dispatcher
        return D_Zero_Dispatcher<I, D_is_zero>::comp(X);
    }
};

// 基准专业化：终止主递归（当 I = -1 时）
template <>
struct Processor<-1> {
    template <class Array>
    static int comp(const Array& X) {
        return 0;
    }
};

constexpr int N_CONST = 5;
constexpr int D_CONST[] = {0, 1, 1, 0, 1}; // D[i] 决定逻辑分支
constexpr int E_CONST[] = {1, 0, 3, 2, 4}; // E[i] 决定循环边界
constexpr int L_CONST = 5; // 假设 L = N

using ArrayType = std::array<int, L_CONST>;

// 公共入口函数
template <class Array>
int comp_meta(const Array& X) {
    // 从最大索引 N_CONST - 1 开始递归
    return Processor<N_CONST - 1>::comp(X);
}

我们成功使用了多层模板专业化（Processor -> D_Zero_Dispatcher -> E_Zero_Dispatcher）和递归模板（LoopUnroller）来解决这个问题。虽然有效，但这种方法带来了巨大的代码复杂度。下面给出 StackOverflow 中对于这个问题更好的回答：

C++ 复制代码

constexpr int d[] = { 0, 1, 1, 0, 1 };
constexpr int e[] = { 1, 0, 3, 2, 4 };

// (Sum 结构体不变，负责循环展开)

template<int N> struct Comp {
    template<class Array>
    static int comp(const Array &x) {
        // 核心逻辑集中在这一行
        return d[N] ? Sum<int, e[N]>::comp(x) : 10;
    }
};

由于 d 和 e 被声明为 constexpr，它们的值在编译期是固定且已知的。当编译器开始实例化 Comp<N>::comp(x) 时，编译器首先根据模板参数 N 的值，在编译期查表得到 d[N] 和 e[N] 的确切整数值。

然后这个三目运算符就起到了我们上述代码中分派的作用：如果 d[N] 为 0（例如 <math xmlns="http://www.w3.org/1998/Math/MathML"> N = 3 N=3 </math>N=3），三元运算符的条件为假，编译器会选择 : 后的表达式 10。所以编译器可以生成 return 10; 的逻辑，完美实现了我们手动专业化 comp_tpl<3> 时的 return 10; 逻辑，另一分支代码被完全丢弃，和 if constexpr 一致。

如果 d[N] 为 1（例如 <math xmlns="http://www.w3.org/1998/Math/MathML"> N = 2 N=2 </math>N=2），三元运算符的条件为真。编译器会选择 ? 后的表达式 Sum<int, e[N]>::comp(x)，所以编译器只保留对 Sum 模板的调用，会实例化 Sum<int, e[N]> 进行调用。

接下来再看 Sum 结构体，其负责编译期循环展开：

C++ 复制代码

// 递归主体
template<class T, size_t Length> struct Sum {
    template<class Array>
    static T comp(const Array &x, T add = 0) {
        return Sum<T, Length - 1>::comp(x, add + Length * x[Length - 1]);
    }
};

// 终止条件
template<class T> struct Sum<T, 0> {
    template<class Array>
    static T comp(const Array &x, T add = 0) {
        return add;
    }
};

递归主主体模拟了 for 循环的迭代步骤，Length 参数充当了循环中的 i。每次调用都会导致编译器实例化一个 Length 减一的新 Sum 结构体，这个过程在编译期持续进行，直到 Length 降为 <math xmlns="http://www.w3.org/1998/Math/MathML"> 0 0 </math>0。

由于递归调用是 return 语句中唯一的操作，这形成了尾递归。现代编译器（如 GCC/Clang）会利用 尾调用优化（TCO） 将其高效地转化为一个简单的 for 循环或一系列算术指令，消除了函数调用开销。

后面这个终止条件描述了当模板参数 Length 降到 <math xmlns="http://www.w3.org/1998/Math/MathML"> 0 0 </math>0 时，编译器匹配到这个专业化版本，递归停止，它返回最终的累加结果 add，作为整个编译期展开循环的终点。

使用方法如下：

C++ 复制代码

int x[] = { 1, 3, 5, 7, 9 };
Comp<3>::comp(x);

最后推荐大家阅读一下参考资料中列出的第3和第4篇文章，里面也对编译期模板计算进行了深入探讨。