1 理论基础

在椭圆曲线密码学（ECC）中，kG也称标量乘法运算，即把椭圆曲线上的基点G与标量k进行相乘的运算，结果是椭圆曲线上的另一个点R=kG，其定义为k个G连续相加的结果。该运算是椭圆曲线密钥生成、加解密、签名及验证中的核心运算，所以围绕它产生了多种加速算法，这里仅对secp256k1中前后出现的两种典型算法进行说明。

1.1 窗口查表算法（Windowed lookup）

该算法是secp256k1库早期所使用的算法，其核心思想是将标量k（以256位标量为例）按bits位划分为256/bits个窗口，然后根据预计算查找表查找每个窗口值对应的点，再将每个窗口点相加即可得到最终结果R，其数学公式如下：

以窗口大小bits=4为例，则窗口数为256/4=64个，上述公式可以表示为：

这里k_i 为4bit二进制数，所以每个窗口包含2⁴=16个查找点（一个窗口对应点0G, 1G, 2G, ..., 15G），则包含16个窗口的查找表为64x16的二维数组，整个表的内存大小为64*16*sizeof(POINT)，假如POINT点x，y坐标都以32字节保存，则查找表所需的内存大小为64*16*(32+32)=64KB。这里如果将bits增大到16，则每个窗口包含2¹⁶=65536个查找点，窗口个数为256/16=16个，整个查找表大小为16*65536*sizeof(POINT)=64MB。

根据以上分析可知整个kG计算复杂度为O(256/bits)点加操作，例如当bits=4时，需要查找64次表再将对应结果点进行点加即可。

1.2 多梳状算法（Multi-Comb）

为了严格的对应上源码，这里也加入了盲化部分内容，即计算R=kG时不直接进行计算，而是计算R=(k - b)*G + b*G，这样是为了防止攻击者通过分析功耗、电磁辐射等侧信道信息来推测出k，从而引入随机盲化值b，对于盲化后的计算公式，b*G可以预运算获得（对应后续源码中的ge_offset，通过查表计算出(k-b)*G后要加上该值才能得到最终值kG），所以对于任意的k，其实盲化后主要计算部分是(k - b)*G。在计算该部分时，使用了有符号数字多梳状算法，该算法中定义的comb(s, P)函数是一个关键内容：

comb(s, P) = sum( (2*s[i] - 1) * 2^i * P ) for i=0..COMB_BITS-1

其中，s[i]是标量s的第i个比特值（0或1），P是椭圆曲线上点。公式中2*s[i] - 1是一个巧妙的转换：如果s[i] = 1，则2*1 - 1；如果s[i] = 0，则2*0 - 1 = -1，所以上式中对于每个比特位i，不是加上2^i*P就是减去2^i*P，这就是"有符号数字"的含义（系数是+1或-1），可以进一步将该式简化称一个更熟悉的形式：

复制代码

comb(s, P) = sum( (2*s[i] - 1) * 2^i * P )                                （1）
           = [ sum( 2*s[i]*2^i ) - sum( 2^i ) ] * P
           = [ 2 * sum(s[i]*2^i) - (2^COMB_BITS - 1) ] * P
           = [ 2*s - (2^COMB_BITS - 1) ] * P

1. 将盲化与梳状算法结合

最直接的想法是在计算(k - b)*G时，让其等于comb(s, G)，即有：

(k - b) * G = comb(s, G) = [ 2*s - (2^COMB_BITS - 1) ] * G

由此可知只要解出相应s，即可通过调用（1）公式计算出(k-b)*G，s可以通过以下公式进行求解：

s = (k - b + (2^COMB_BITS - 1)) / 2 (mod order) (2)

这个公式需要对2进行模逆运算（相当于除法），在模运算中，除法是复杂且耗时的操作，需要避免。

2. 避免模除2

为了避免昂贵的模除2操作，这里采用了一个巧妙地优化，将公式中基点G除于2，不再计算comb(s, G)，而是计算comb(d, G/2)，仍利用最一开始的公式，则有：

复制代码

comb(d, G/2) = sum( (2*d[i] - 1) * 2^i * (G/2) )                         (3)
             = [ 2*d - (2^COMB_BITS - 1) ] * (G/2)
             = [ d - (2^COMB_BITS - 1)/2 ] * G

现在，令comb(d, G/2) = (k - b) * G，则有：

(k - b) * G = [ d - (2^COMB_BITS - 1)/2 ] * G

得到：

k - b = d - (2^COMB_BITS - 1)/2

最终解出d：

d = k - b + (2^COMB_BITS - 1)/2 (mod order)　　(4)

这里(2^COMB_BITS - 1)/2 是一个常量，可以预先计算。现在，计算d只需要一次模加法和一次模减法，完全避免了模除运算。在后续的源码实现中定义偏移量scalar_offset=(2^COMB_BITS - 1)/2 - b (mod order)，则有d=k+scalar_offset，最终kG=(k-b)*G+b*G=comb(d, G/2)+ge_offset。

3. 梳状含义

在梳状算法中会将标量k拆分成多个位段，每个位段称之为块（Block），每个块中又有T个齿（Teeth），齿与齿之间又有S个间距（Spacing）。以源码中典型取值为例：

复制代码

#define COMB_BLOCKS 11
#define COMB_TEETH 6
#define COMB_SPACING 4

表明会将标量k分成11块，每个块有6个齿，齿间距为4，则每个Block中包含24bits数据，梳子每次可以选中两两间隔为4的6bits数据，在这样的结构下11个块可以包含264bits数据，完全覆盖256bits的标量k（标量k还需进行数据位填充）。

定义mask(b) = sum(2^((b*COMB_TEETH + t)*COMB_SPACING) for t=0..COMB_TEETH-1)，则当b取值从0到COMB_BLOCKS-1时，有以下对应关系：

复制代码

mask(0)  = 2^0   + 2^4   + 2^8   + 2^12  + 2^16  + 2^20,
mask(1)  = 2^24  + 2^28  + 2^32  + 2^36  + 2^40  + 2^44,
mask(2)  = 2^48  + 2^52  + 2^56  + 2^60  + 2^64  + 2^68,
...
mask(10) = 2^240 + 2^244 + 2^248 + 2^252 + 2^256 + 2^260

由此可通过这些掩码拆分比特位d[i]，具体来说，每个掩码会被使用COMB_SPACING次，且每次使用时的偏移量不同。

复制代码

d = (d & mask(0)<<0) + (d & mask(1)<<0) + ... + (d & mask(COMB_BLOCKS-1)<<0) +
    (d & mask(0)<<1) + (d & mask(1)<<1) + ... + (d & mask(COMB_BLOCKS-1)<<1) +
    ...
    (d & mask(0)<<(COMB_SPACING-1)) + ...

接下来定义：

复制代码

table(b, m) = (m - mask(b)/2) * G                             （5）

这里b=0..COMB_BLOCKS-1，m=(d & mask(b))，m是标量d在b块中对应的值，每个块对应的m可以有2^COMB_TEETH个不同取值，可以预计算m对应的点乘值m*G。

table(b, m) = (m - mask(b)/2)*G = ((d&mask(b)) - mask(b)/2)*G = (2(d&mask(b)) - mask(b))*G/2 = ((2d - 1)&mask(b))*G/2 = sum(2^i * (2*d[i] - 1) * G/2) 这里i遍历掩码mask(b)中的置位比特位，即最终有：

复制代码

table(b, m) = sum(2^i * (2*d[i] - 1) * G/2)                   （6）

例如当m=2^48 + 2^56 + 2^68时（位于块2中的值），则有table(2, m) = (2^48 - 2^52 + 2^56 - 2^60 - 2^64 + 2^68) * G/2。

结合以上定义，可以重写comb(d, G/2)：

复制代码

comb(d, G/2) = 2^0 * (table(0, d>>0 & mask(0)) + ... + table(COMB_BLOCKS-1, d>>0 & mask(COMP_BLOCKS-1)))                               （7）
             + 2^1 * (table(0, d>>1 & mask(0)) + ... + table(COMB_BLOCKS-1, d>>1 & mask(COMP_BLOCKS-1)))
             + 2^2 * (table(0, d>>2 & mask(0)) + ... + table(COMB_BLOCKS-1, d>>2 & mask(COMP_BLOCKS-1)))
             + ...
             + 2^(COMB_SPACING-1) * (table(0, d>>(COMB_SPACING-1) & mask(0)) + ...)

即：

复制代码

sum(2^i * sum(table(b, d>>i & mask(b)), b=0..COMB_BLOCKS-1), i=0..COMB_SPACING-1)

以下是计算过程伪代码：

复制代码

c = infinity
for comb_off in range(COMB_SPACING - 1, -1, -1):
  for block in range(COMB_BLOCKS):
    c += table(block, (d >> comb_off) & mask(block))
  if comb_off > 0:
    c = 2*c
return c

在以上伪代码中，梳子是从高位"梳到"低位的，即依次先处理每个块的高位，再依次处理每个块的低位。

4. 折半表

之前已经有结论每个块对应的m可以有2^COMB_TEETH个不同取值，所以对于table(b, m)来说需要有2^COMB_TEETH个表项才能覆盖该b块所有可能取值，但实际上只需一半的表项即可覆盖所有可能取值，即table(b, m)实际只包含2^(COMB_TEETH-1)个表项，以下是分析过程：

由m=(d & mask(b))，令m'是m所有位反转对应的值，则有m'=m XOR mask(b)，进而有table(b, m') = table(b, m XOR mask(b)) = table(b, mask(b) - m) = (mask(b) - m - mask(b)/2)*G = -(m - mask(b)/2)*G = -table(b, m)。

即如果m'对应是m的所有位反转，那么m'对应的table表项即为m相应表项的负值，对于任意椭圆点P = (x, y)，其负值很容易求得即-P = (x, -y)，所以table(b, m)表只需保存一半的表项，另一半的位反转表项，只需对相应的表项取负即可。对于0块来说，当梳齿对应块中最低位时，有以下对应关系：

例如由m=2^0 + 2^8 + 2^16时的表项（对应点P=(x, y)），可以直接求出m'=2^4 + 2^12 + 2^20的表项（对应点-P = (x, -y)），只需要对m值对应的表项中y值取负即可得到m'值对应的表项。

2 源码详解

2.1 预计算表

函数secp256k1_ecmult_gen_compute_table用于产生第1节中的table(b, m)表项，函数源码如下：

复制代码

  1 static void secp256k1_ecmult_gen_compute_table(secp256k1_ge_storage* table, const secp256k1_ge* gen, int blocks, int teeth, int spacing) {
  2     size_t points = ((size_t)1) << (teeth - 1);                                                 // 每个块的预计算点数 = 2^(teeth-1)
  3     size_t points_total = points * blocks;                                                      // 总共预计算的点数 = 每个块的预计算点数 * 块数
  4     secp256k1_ge* prec = checked_malloc(&default_error_callback, points_total * sizeof(*prec)); // 长度为总预计算点数的仿射点数组
  5     secp256k1_gej* ds = checked_malloc(&default_error_callback, teeth * sizeof(*ds));           // 每块的"tooth"的雅可比坐标点数组
  6     secp256k1_gej* vs = checked_malloc(&default_error_callback, points_total * sizeof(*vs));    // 长度为总预计算点数的雅可比坐标点数组
  7     secp256k1_gej u;
  8     size_t vs_pos = 0;
  9     secp256k1_scalar half;
 10     secp256k1_ge halfgenAffine;
 11     secp256k1_ge_storage halfgenStorage;
 12     int block, i;
 13 
 14     VERIFY_CHECK(points_total > 0);
 15 
 16     /* u is the running power of two times gen we're working with, initially gen/2. */
 17     secp256k1_scalar_half(&half, &secp256k1_scalar_one);
 18     //print_scalar(&half, "0x", "\n");
 19     secp256k1_gej_set_infinity(&u);
 20     for (i = 255; i >= 0; --i) {
 21         /* Use a very simple multiplication ladder to avoid dependency on ecmult. */
 22         secp256k1_gej_double_var(&u, &u, NULL);
 23         if (secp256k1_scalar_get_bits_limb32(&half, i, 1)) {
 24             secp256k1_gej_add_ge_var(&u, &u, gen, NULL);
 25         }
 26     }
 27 
 28     secp256k1_ge_set_gej(&halfgenAffine, &u);
 29     secp256k1_ge_to_storage(&halfgenStorage, &halfgenAffine);
 30     //print_storage(&halfgenStorage.x, "0x", "\n");
 31     //print_storage(&halfgenStorage.y, "0x", "\n");
 32 #ifdef VERIFY
 33     {
 34         /* Verify that u*2 = gen. */
 35         secp256k1_gej double_u;
 36         secp256k1_gej_double_var(&double_u, &u, NULL);
 37         VERIFY_CHECK(secp256k1_gej_eq_ge_var(&double_u, gen));
 38     }
 39 #endif
 40 
 41     for (block = 0; block < blocks; ++block) {
 42         int tooth;
 43         /* Here u = 2^(block*teeth*spacing) * gen/2. */
 44         secp256k1_gej sum;
 45         secp256k1_gej_set_infinity(&sum);
 46         for (tooth = 0; tooth < teeth; ++tooth) {
 47             /* Here u = 2^((block*teeth + tooth)*spacing) * gen/2. */
 48             /* Make sum = sum(2^((block*teeth + t)*spacing), t=0..tooth) * gen/2. */
 49             secp256k1_gej_add_var(&sum, &sum, &u, NULL);
 50             /* Make u = 2^((block*teeth + tooth)*spacing + 1) * gen/2. */
 51             secp256k1_gej_double_var(&u, &u, NULL);
 52             /* Make ds[tooth] = u = 2^((block*teeth + tooth)*spacing + 1) * gen/2. */
 53             ds[tooth] = u;
 54             /* Make u = 2^((block*teeth + tooth + 1)*spacing) * gen/2, unless at the end. */
 55             if (block + tooth != blocks + teeth - 2) {
 56                 int bit_off;
 57                 for (bit_off = 1; bit_off < spacing; ++bit_off) {
 58                     secp256k1_gej_double_var(&u, &u, NULL);
 59                 }
 60             }
 61         }
 62         /* Now u = 2^((block*teeth + teeth)*spacing) * gen/2
 63          *       = 2^((block+1)*teeth*spacing) * gen/2       */
 64 
 65         /* Next, compute the table entries for block number block in Jacobian coordinates.
 66          * The entries will occupy vs[block*points + i] for i=0..points-1.
 67          * We start by computing the first (i=0) value corresponding to all summed
 68          * powers of two times G being negative. */
 69         secp256k1_gej_neg(&vs[vs_pos++], &sum);
 70         /*secp256k1_ge_set_gej(&halfgenAffine, &vs[vs_pos - 1]);
 71         secp256k1_ge_to_storage(&halfgenStorage, &halfgenAffine);
 72         print_storage(&halfgenStorage.x, "0x", "\n");
 73         print_storage(&halfgenStorage.y, "0x", "\n");*/
 74         /* And then teeth-1 times "double" the range of i values for which the table
 75          * is computed: in each iteration, double the table by taking an existing
 76          * table entry and adding ds[tooth]. */
 77         for (tooth = 0; tooth < teeth - 1; ++tooth) {
 78             size_t stride = ((size_t)1) << tooth;
 79             size_t index;
 80             for (index = 0; index < stride; ++index, ++vs_pos) {
 81                 secp256k1_gej_add_var(&vs[vs_pos], &vs[vs_pos - stride], &ds[tooth], NULL);
 82             }
 83         }
 84     }
 85     VERIFY_CHECK(vs_pos == points_total);
 86 
 87     /* Convert all points simultaneously from secp256k1_gej to secp256k1_ge. */
 88     secp256k1_ge_set_all_gej_var(prec, vs, points_total);
 89     /* Convert all points from secp256k1_ge to secp256k1_ge_storage output. */
 90     for (block = 0; block < blocks; ++block) {
 91         size_t index;
 92         for (index = 0; index < points; ++index) {
 93             VERIFY_CHECK(!secp256k1_ge_is_infinity(&prec[block * points + index]));
 94             secp256k1_ge_to_storage(&table[block * points + index], &prec[block * points + index]);
 95             print_storage(&table[block * points + index].x, "0x", "\n");
 96             print_storage(&table[block * points + index].y, "0x", "\n");
 97         }
 98     }
 99 
100     /* Free memory. */
101     free(vs);
102     free(ds);
103     free(prec);
104 }

正如之前分析代码中第2行给出了每个块儿需要进行预计算点的个数（即table(b, m)在固定b时的表项数），COMB_TEETH=6时，表项数是2^5=32个；第3行为b取值从0到COMB_BLOCKS-1时完整表中点的总数，以COMB_BLOCKS=11为例，点的总数是32*11=352；接下来第4~6行为对应的表项分配内存空间，其中ds用于存储在每个块中梳齿对应的"权重"，所以其大小是梳齿个数6，以0块儿为例，其取值为2*G/2，2*2^4*G/2，2*2^8*G/2，2*2^12*G/2，2*2^16*G/2，2*2^20*G/2，后续给出详细求解过程；第17~26行先求得标量1/2 mod order，再用通用的倍点加法求射影点u=G/2；第28，29行分别获取点u对应的仿射坐标及存储坐标值；第33~38行验证部分是检查2*u是否等于生成点G。

接下来，第11行处的for循环用于对11个块儿，依次求块对应的table(b, m)内容；之后第44，45行定义临时变量sum，并将其初始化为零点，第46行的for循环用于求之前所说的ds，以及公式（6）中减数部分的累加和sum；具体来说，第47，48行给出了后续u和sum的具体计算公式；接下来第49行将上轮更新过的u加到sum上（两个for循环都为第一次时，u值为G/2，加完以后sum=G/2）;之后第51行将u翻倍，例如在tooth=0时，u=2*G/2；第53~60行，先将u赋值给ds[tooth]，然后再将u翻SPACING-1倍，算上第51行翻倍u相当于共翻倍SPACING次（当SPACING=4时对应2^4），另外55行的if判断用于在最后一个块最后一次时，由于不再使用这时已无需对u再进行SPACING-1次翻倍操作，for循环完毕已经依次求出之前所说的梳齿"权重"------ds，以及公式（6）中减数部分的累加和------sum。

第69行将累加和取负（因为sum对应公式中的减数部分）后，赋值给当前块中的第一个表项vs[pos]，对应所有梳齿位都位0时表项值，接下来，第77行for循环分别计算梳齿0，1，...，teeth-2齿位对应是1时的table的表项值；第78行用于计算对应梳齿位为1时表项的个数，如第0个梳齿位为1时，只能产生2^0=1个表项000001，第1个梳齿位为1时，可以产生2^1=2个表项000010和000011，第2个梳齿位为1时，可以产生2^2=4个表项000100，000101，000110，000111，依次类推第teeth-2=4齿位对应是1时，可以产生2^4=16个表项；之后第80~82行的for循环依次产生之前折半表部分图中前半部分对应的表项值，总个数是1+1+2+4+..+16=32。

代码接下来的内容主要是坐标转换相关内容，这里不再进行详细解释，总之循环执行完毕会产生所有11个块对用的查找表。

2.2 计算k*G

接下来需要看如何利用上一小节产生的查找表进行k*G计算，仍旧先给出计算函数secp256k1_ecmult_gen源码：

复制代码

  1 static void secp256k1_ecmult_gen(const secp256k1_ecmult_gen_context *ctx, secp256k1_gej *r, const secp256k1_scalar *gn) {
  2     uint32_t comb_off;
  3     secp256k1_ge add;
  4     secp256k1_fe neg;
  5     secp256k1_ge_storage adds;
  6     secp256k1_scalar d;
  7     /* Array of uint32_t values large enough to store COMB_BITS bits. Only the bottom
  8      * 8 are ever nonzero, but having the zero padding at the end if COMB_BITS>256
  9      * avoids the need to deal with out-of-bounds reads from a scalar. */
 10     uint32_t recoded[(COMB_BITS + 31) >> 5] = {0};
 11     int first = 1, i;
 12 
 13     memset(&adds, 0, sizeof(adds));
 14 
 15     /* We want to compute R = gn*G.
 16      *
 17      * To blind the scalar used in the computation, we rewrite this to be
 18      * R = (gn - b)*G + b*G, with a blinding value b determined by the context.
 19      *
 20      * The multiplication (gn-b)*G will be performed using a signed-digit multi-comb (see Section
 21      * 3.3 of "Fast and compact elliptic-curve cryptography" by Mike Hamburg,
 22      * https://eprint.iacr.org/2012/309).
 23      *
 24      * Let comb(s, P) = sum((2*s[i]-1)*2^i*P for i=0..COMB_BITS-1), where s[i] is the i'th bit of
 25      * the binary representation of scalar s. So the s[i] values determine whether -2^i*P (s[i]=0)
 26      * or +2^i*P (s[i]=1) are added together. COMB_BITS is at least 256, so all bits of s are
 27      * covered. By manipulating:
 28      *
 29      *     comb(s, P) = sum((2*s[i]-1)*2^i*P for i=0..COMB_BITS-1)
 30      * <=> comb(s, P) = sum((2*s[i]-1)*2^i for i=0..COMB_BITS-1) * P
 31      * <=> comb(s, P) = (2*sum(s[i]*2^i for i=0..COMB_BITS-1) - sum(2^i for i=0..COMB_BITS-1)) * P
 32      * <=> comb(s, P) = (2*s - (2^COMB_BITS - 1)) * P
 33      *
 34      * If we wanted to compute (gn-b)*G as comb(s, G), it would need to hold that
 35      *
 36      *     (gn - b) * G = (2*s - (2^COMB_BITS - 1)) * G
 37      * <=> s = (gn - b + (2^COMB_BITS - 1))/2 (mod order)
 38      *
 39      * We use an alternative here that avoids the modular division by two: instead we compute
 40      * (gn-b)*G as comb(d, G/2). For that to hold it must be the case that
 41      *
 42      *     (gn - b) * G = (2*d - (2^COMB_BITS - 1)) * (G/2)
 43      * <=> d = gn - b + (2^COMB_BITS - 1)/2 (mod order)
 44      *
 45      * Adding precomputation, our final equations become:
 46      *
 47      *     ctx->scalar_offset = (2^COMB_BITS - 1)/2 - b (mod order)
 48      *     ctx->ge_offset = b*G
 49      *     d = gn + ctx->scalar_offset (mod order)
 50      *     R = comb(d, G/2) + ctx->ge_offset
 51      *
 52      * comb(d, G/2) function is then computed by summing + or - 2^(i-1)*G, for i=0..COMB_BITS-1,
 53      * depending on the value of the bits d[i] of the binary representation of scalar d.
 54      */
 55 
 56     /* Compute the scalar d = (gn + ctx->scalar_offset). */
 57     secp256k1_scalar_add(&d, &ctx->scalar_offset, gn);
 58     /* Convert to recoded array. */
 59     for (i = 0; i < 8 && i < ((COMB_BITS + 31) >> 5); ++i) {
 60         recoded[i] = secp256k1_scalar_get_bits_limb32(&d, 32 * i, 32);
 61     }
 62     secp256k1_scalar_clear(&d);
 63 
 64     /* In secp256k1_ecmult_gen_prec_table we have precomputed sums of the
 65      * (2*d[i]-1) * 2^(i-1) * G points, for various combinations of i positions.
 66      * We rewrite our equation in terms of these table entries.
 67      *
 68      * Let mask(b) = sum(2^((b*COMB_TEETH + t)*COMB_SPACING) for t=0..COMB_TEETH-1),
 69      * with b ranging from 0 to COMB_BLOCKS-1. So for example with COMB_BLOCKS=11,
 70      * COMB_TEETH=6, COMB_SPACING=4, we would have:
 71      *   mask(0)  = 2^0   + 2^4   + 2^8   + 2^12  + 2^16  + 2^20,
 72      *   mask(1)  = 2^24  + 2^28  + 2^32  + 2^36  + 2^40  + 2^44,
 73      *   mask(2)  = 2^48  + 2^52  + 2^56  + 2^60  + 2^64  + 2^68,
 74      *   ...
 75      *   mask(10) = 2^240 + 2^244 + 2^248 + 2^252 + 2^256 + 2^260
 76      *
 77      * We will split up the bits d[i] using these masks. Specifically, each mask is
 78      * used COMB_SPACING times, with different shifts:
 79      *
 80      * d = (d & mask(0)<<0) + (d & mask(1)<<0) + ... + (d & mask(COMB_BLOCKS-1)<<0) +
 81      *     (d & mask(0)<<1) + (d & mask(1)<<1) + ... + (d & mask(COMB_BLOCKS-1)<<1) +
 82      *     ...
 83      *     (d & mask(0)<<(COMB_SPACING-1)) + ...
 84      *
 85      * Now define table(b, m) = (m - mask(b)/2) * G, and we will precompute these values for
 86      * b=0..COMB_BLOCKS-1, and for all values m which (d & mask(b)) can take (so m can take on
 87      * 2^COMB_TEETH distinct values).
 88      *
 89      * If m=(d & mask(b)), then table(b, m) is the sum of 2^i * (2*d[i]-1) * G/2, with i
 90      * iterating over the set bits in mask(b). In our example, table(2, 2^48 + 2^56 + 2^68)
 91      * would equal (2^48 - 2^52 + 2^56 - 2^60 - 2^64 + 2^68) * G/2.
 92      *
 93      * With that, we can rewrite comb(d, G/2) as:
 94      *
 95      *     2^0 * (table(0, d>>0 & mask(0)) + ... + table(COMB_BLOCKS-1, d>>0 & mask(COMP_BLOCKS-1)))
 96      *   + 2^1 * (table(0, d>>1 & mask(0)) + ... + table(COMB_BLOCKS-1, d>>1 & mask(COMP_BLOCKS-1)))
 97      *   + 2^2 * (table(0, d>>2 & mask(0)) + ... + table(COMB_BLOCKS-1, d>>2 & mask(COMP_BLOCKS-1)))
 98      *   + ...
 99      *   + 2^(COMB_SPACING-1) * (table(0, d>>(COMB_SPACING-1) & mask(0)) + ...)
100      *
101      * Or more generically as
102      *
103      *   sum(2^i * sum(table(b, d>>i & mask(b)), b=0..COMB_BLOCKS-1), i=0..COMB_SPACING-1)
104      *
105      * This is implemented using an outer loop that runs in reverse order over the lines of this
106      * equation, which in each iteration runs an inner loop that adds the terms of that line and
107      * then doubles the result before proceeding to the next line.
108      *
109      * In pseudocode:
110      *   c = infinity
111      *   for comb_off in range(COMB_SPACING - 1, -1, -1):
112      *     for block in range(COMB_BLOCKS):
113      *       c += table(block, (d >> comb_off) & mask(block))
114      *     if comb_off > 0:
115      *       c = 2*c
116      *   return c
117      *
118      * This computes c = comb(d, G/2), and thus finally R = c + ctx->ge_offset. Note that it would
119      * be possible to apply an initial offset instead of a final offset (moving ge_offset to take
120      * the place of infinity above), but the chosen approach allows using (in a future improvement)
121      * an incomplete addition formula for most of the multiplication.
122      *
123      * The last question is how to implement the table(b, m) function. For any value of b,
124      * m=(d & mask(b)) can only take on at most 2^COMB_TEETH possible values (the last one may have
125      * fewer as there mask(b) may exceed the curve order). So we could create COMB_BLOCK tables
126      * which contain a value for each such m value.
127      *
128      * Now note that if m=(d & mask(b)), then flipping the relevant bits of m results in negating
129      * the result of table(b, m). This is because table(b,m XOR mask(b)) = table(b, mask(b) - m) =
130      * (mask(b) - m - mask(b)/2)*G = (-m + mask(b)/2)*G = -(m - mask(b)/2)*G = -table(b, m).
131      * Because of this it suffices to only store the first half of the m values for every b. If an
132      * entry from the second half is needed, we look up its bit-flipped version instead, and negate
133      * it.
134      *
135      * secp256k1_ecmult_gen_prec_table[b][index] stores the table(b, m) entries. Index
136      * is the relevant mask(b) bits of m packed together without gaps. */
137 
138     /* Outer loop: iterate over comb_off from COMB_SPACING - 1 down to 0. */
139     comb_off = COMB_SPACING - 1;
140     while (1) {
141         uint32_t block;
142         uint32_t bit_pos = comb_off;
143         /* Inner loop: for each block, add table entries to the result. */
144         for (block = 0; block < COMB_BLOCKS; ++block) {
145             /* Gather the mask(block)-selected bits of d into bits. They're packed:
146              * bits[tooth] = d[(block*COMB_TEETH + tooth)*COMB_SPACING + comb_off]. */
147             uint32_t bits = 0, sign, abs, index, tooth;
148             /* Instead of reading individual bits here to construct the bits variable,
149              * build up the result by xoring rotated reads together. In every iteration,
150              * one additional bit is made correct, starting at the bottom. The bits
151              * above that contain junk. This reduces leakage by avoiding computations
152              * on variables that can have only a low number of possible values (e.g.,
153              * just two values when reading a single bit into a variable.) See:
154              * https://www.usenix.org/system/files/conference/usenixsecurity18/sec18-alam.pdf
155              */
156             for (tooth = 0; tooth < COMB_TEETH; ++tooth) {
157                 /* Construct bitdata s.t. the bottom bit is the bit we'd like to read.
158                  *
159                  * We could just set bitdata = recoded[bit_pos >> 5] >> (bit_pos & 0x1f)
160                  * but this would simply discard the bits that fall off at the bottom,
161                  * and thus, for example, bitdata could still have only two values if we
162                  * happen to shift by exactly 31 positions. We use a rotation instead,
163                  * which ensures that bitdata doesn't lose entropy. This relies on the
164                  * rotation being atomic, i.e., the compiler emitting an actual rot
165                  * instruction. */
166                 uint32_t bitdata = secp256k1_rotr32(recoded[bit_pos >> 5], bit_pos & 0x1f);
167 
168                 /* Clear the bit at position tooth, but sssh, don't tell clang. */
169                 uint32_t volatile vmask = ~(1 << tooth);
170                 bits &= vmask;
171 
172                 /* Write the bit into position tooth (and junk into higher bits). */
173                 bits ^= bitdata << tooth;
174                 bit_pos += COMB_SPACING;
175             }
176 
177             /* If the top bit of bits is 1, flip them all (corresponding to looking up
178              * the negated table value), and remember to negate the result in sign. */
179             sign = (bits >> (COMB_TEETH - 1)) & 1;
180             abs = (bits ^ -sign) & (COMB_POINTS - 1);
181             VERIFY_CHECK(sign == 0 || sign == 1);
182             VERIFY_CHECK(abs < COMB_POINTS);
183 
184             /** This uses a conditional move to avoid any secret data in array indexes.
185              *   _Any_ use of secret indexes has been demonstrated to result in timing
186              *   sidechannels, even when the cache-line access patterns are uniform.
187              *  See also:
188              *   "A word of warning", CHES 2013 Rump Session, by Daniel J. Bernstein and Peter Schwabe
189              *    (https://cryptojedi.org/peter/data/chesrump-20130822.pdf) and
190              *   "Cache Attacks and Countermeasures: the Case of AES", RSA 2006,
191              *    by Dag Arne Osvik, Adi Shamir, and Eran Tromer
192              *    (https://eprint.iacr.org/2005/271.pdf)
193              */
194             for (index = 0; index < COMB_POINTS; ++index) {
195                 secp256k1_ge_storage_cmov(&adds, &secp256k1_ecmult_gen_prec_table[block][index], index == abs);
196             }
197 
198             /* Set add=adds or add=-adds, in constant time, based on sign. */
199             secp256k1_ge_from_storage(&add, &adds);
200             secp256k1_fe_negate(&neg, &add.y, 1);
201             secp256k1_fe_cmov(&add.y, &neg, sign);
202 
203             /* Add the looked up and conditionally negated value to r. */
204             if (EXPECT(first, 0)) {
205                 /* If this is the first table lookup, we can skip addition. */
206                 secp256k1_gej_set_ge(r, &add);
207                 /* Give the entry a random Z coordinate to blind intermediary results. */
208                 secp256k1_gej_rescale(r, &ctx->proj_blind);
209                 first = 0;
210             } else {
211                 secp256k1_gej_add_ge(r, r, &add);
212             }
213         }
214 
215         /* Double the result, except in the last iteration. */
216         if (comb_off-- == 0) break;
217         secp256k1_gej_double(r, r);
218     }
219 
220     /* Correct for the scalar_offset added at the start (ge_offset = b*G, while b was
221      * subtracted from the input scalar gn). */
222     secp256k1_gej_add_ge(r, r, &ctx->ge_offset);
223 
224     /* Cleanup. */
225     secp256k1_fe_clear(&neg);
226     secp256k1_ge_clear(&add);
227     secp256k1_memclear_explicit(&adds, sizeof(adds));
228     secp256k1_memclear_explicit(&recoded, sizeof(recoded));
229 }

源码第10给出一个位重编码变量recoded，用于存放标量gn（也即之前说的k）的二级制位，仍以之前的梳状结构为例，则COMB_BITS=11*6*4=264bits，则recoded长度为9，这时9*(1<<5)=288>264可以完全存储相应的重编码位；源码中的注释对应上一节的解析内容，接下来不再详细进行说明；第57行应用公式（4）对d进行求解；第59~62行从256bits的标量d中依次取出32bits并放入到recoded中，在32位实现下这里相当于把d中数据直接拷贝到recoded。

接下来是函数实现主体部分，首先第139行将comb_off初始化为COMB_SPACING-1，即从梳子从块儿的高位开始；之后140行处的while会依次处理comb_off位对应的各个梳齿取值，直到comb_off移动至0位；第142行将bit_pos初始化为com_off，梳齿对应第0块的高位；随后第144行用for循环依次处理COMB_BLOCKS个块儿；第156行使用for循环对于当前块儿的COMB_TEETH个梳齿位依次进行处理；第166行将包含bit_pos位的recoded[]进行循环右移，使得结果bitdata的最低位正好是我们需要的bit_pos位；第169行是tooth齿对应的掩码vmask，在掩码中tooth位为0其他位为1；第170行通过与掩码vmask做且操作清除目标位；接下来第173行将刚刚提取的位（bitdata最低位）写入到bits的第tooth个位置，bitdata<< tooth将包含所需位的32位数通过左移到将最低位移动到正确的位置------tooth，然后通过异或来将tooth值写入到bits（上一步已将bits的tooth位清零，零与其他值异或保留其他值）；随后第174行将bit_pos移动到下一个梳齿位。可以看出循环结束后bits会在低COMB_TEETH位依次记录下当前块的梳齿位对应的值（bits低COMB_TEETH位以外值为无效数据），即将块儿中相应值"梳"出来。

之后，第179行将"梳"出来的bits最高位做为符号位；第180行取得除符号外，其他位实际应该的取值abs，其中-sign为mask，sign=0时，对应mask=0x00000000，sign=1时，对应mask=0xFFFFFFFF，即sign=0时取bits值（与0异或取原址），sign=1时bits值每个位都取反（与1异或取反值），之后通过&(COMB_POINT-1)获取bits mask后的低5bit值；第194~196行处的for循环取值abs对应的table表项值；第199~201行根据符号位确定是否进行取负操作；第204~212行根据是否为第一次将相应的表项值赋值或者附加到r上，在第一次时还可以会用proj_blind对r初始值进行盲化操作：

由图中解释可知经过盲化操作后，点r'虽然各个坐标分量的取值都已经都已经改变，但是由射影坐标的定义可知，其实r'和r对应同样的仿射坐标，即它们表示同一个点。

当144行的for循环执行完毕，表示当前comb_off梳齿位对应的所有块儿已经处理完毕，接下来第216行将梳齿向低位移位继续进行下一个梳齿位对应块的处理，直到COMB_SPACING=4个梳齿位都处理完毕则跳出while循环；另外因为梳齿位是从高位到低位进行的，所以第217行对r进行了翻倍操作，这相当于公式（7）中每行内的移位操作。

整个while循环执行完毕后得到的值r=(gn - b)*G，最终gn*G=(gn - b)*G + b*G，所以在222行又在r上加上了ge_offset=b*G，最终得到gn*G，求解结束。

2.3 补充说明

从上面的分析可知分块数COMB_BLOCKS和梳齿数COMB_TEETH共同决定了查找表的大小，其中分块数对查找表是线性影响，而梳齿数对查找表是指数级影响（影响更大），所以在内存受限或者优先考虑内存时，需要对这两个参数的取值进行斟酌考虑。另外COMB_BLOCKS决定了查表访存次数和点加操作次数，而COMB_SPACING决定了倍点操作次数，在进行CUDA编程时都需要根据实际情况，斟酌确定相关参数的取值。

secp256k1算法详解五（kG点乘多梳状算法）

1 理论基础

1.1 窗口查表算法（Windowed lookup）

1.2 多梳状算法（Multi-Comb）

1. 将盲化与梳状算法结合

2. 避免模除2

3. 梳状含义

4. 折半表

2 源码详解

2.1 预计算表

2.2 计算k*G

2.3 补充说明