1 理论基础
在椭圆曲线密码学(ECC)中,kG也称标量乘法运算,即把椭圆曲线上的基点G与标量k进行相乘的运算,结果是椭圆曲线上的另一个点R=kG,其定义为k个G连续相加的结果。该运算是椭圆曲线密钥生成、加解密、签名及验证中的核心运算,所以围绕它产生了多种加速算法,这里仅对secp256k1中前后出现的两种典型算法进行说明。
1.1 窗口查表算法(Windowed lookup)
该算法是secp256k1库早期所使用的算法,其核心思想是将标量k(以256位标量为例)按bits位划分为256/bits个窗口,然后根据预计算查找表查找每个窗口值对应的点,再将每个窗口点相加即可得到最终结果R,其数学公式如下:

以窗口大小bits=4为例,则窗口数为256/4=64个,上述公式可以表示为:

这里ki 为4bit二进制数,所以每个窗口包含24=16个查找点(一个窗口对应点0G, 1G, 2G, ..., 15G),则包含16个窗口的查找表为64x16的二维数组,整个表的内存大小为64*16*sizeof(POINT),假如POINT点x,y坐标都以32字节保存,则查找表所需的内存大小为64*16*(32+32)=64KB。这里如果将bits增大到16,则每个窗口包含216=65536个查找点,窗口个数为256/16=16个,整个查找表大小为16*65536*sizeof(POINT)=64MB。
根据以上分析可知整个kG计算复杂度为O(256/bits)点加操作,例如当bits=4时,需要查找64次表再将对应结果点进行点加即可。
1.2 多梳状算法(Multi-Comb)
为了严格的对应上源码,这里也加入了盲化部分内容,即计算R=kG时不直接进行计算,而是计算R=(k - b)*G + b*G,这样是为了防止攻击者通过分析功耗、电磁辐射等侧信道信息来推测出k,从而引入随机盲化值b,对于盲化后的计算公式,b*G可以预运算获得(对应后续源码中的ge_offset,通过查表计算出(k-b)*G后要加上该值才能得到最终值kG),所以对于任意的k,其实盲化后主要计算部分是(k - b)*G。在计算该部分时,使用了有符号数字多梳状算法,该算法中定义的comb(s, P)函数是一个关键内容:
comb(s, P) = sum( (2*s[i] - 1) * 2^i * P ) for i=0..COMB_BITS-1
其中,s[i]是标量s的第i个比特值(0或1),P是椭圆曲线上点。公式中2*s[i] - 1是一个巧妙的转换:如果s[i] = 1,则2*1 - 1;如果s[i] = 0,则2*0 - 1 = -1,所以上式中对于每个比特位i,不是加上2^i*P就是减去2^i*P,这就是"有符号数字"的含义(系数是+1或-1),可以进一步将该式简化称一个更熟悉的形式:
comb(s, P) = sum( (2*s[i] - 1) * 2^i * P ) (1)
= [ sum( 2*s[i]*2^i ) - sum( 2^i ) ] * P
= [ 2 * sum(s[i]*2^i) - (2^COMB_BITS - 1) ] * P
= [ 2*s - (2^COMB_BITS - 1) ] * P
1. 将盲化与梳状算法结合
最直接的想法是在计算(k - b)*G时,让其等于comb(s, G),即有:
(k - b) * G = comb(s, G) = [ 2*s - (2^COMB_BITS - 1) ] * G
由此可知只要解出相应s,即可通过调用(1)公式计算出(k-b)*G,s可以通过以下公式进行求解:
s = (k - b + (2^COMB_BITS - 1)) / 2 (mod order) (2)
这个公式需要对2进行模逆运算(相当于除法),在模运算中,除法是复杂且耗时的操作,需要避免。
2. 避免模除2
为了避免昂贵的模除2操作,这里采用了一个巧妙地优化,将公式中基点G除于2,不再计算comb(s, G),而是计算comb(d, G/2),仍利用最一开始的公式,则有:
comb(d, G/2) = sum( (2*d[i] - 1) * 2^i * (G/2) ) (3)
= [ 2*d - (2^COMB_BITS - 1) ] * (G/2)
= [ d - (2^COMB_BITS - 1)/2 ] * G
现在,令comb(d, G/2) = (k - b) * G,则有:
(k - b) * G = [ d - (2^COMB_BITS - 1)/2 ] * G
得到:
k - b = d - (2^COMB_BITS - 1)/2
最终解出d:
d = k - b + (2^COMB_BITS - 1)/2 (mod order) (4)
这里(2^COMB_BITS - 1)/2 是一个常量,可以预先计算。现在,计算d只需要一次模加法和一次模减法,完全避免了模除运算。在后续的源码实现中定义偏移量scalar_offset=(2^COMB_BITS - 1)/2 - b (mod order),则有d=k+scalar_offset,最终kG=(k-b)*G+b*G=comb(d, G/2)+ge_offset。
3. 梳状含义
在梳状算法中会将标量k拆分成多个位段,每个位段称之为块(Block),每个块中又有T个齿(Teeth),齿与齿之间又有S个间距(Spacing)。以源码中典型取值为例:
#define COMB_BLOCKS 11
#define COMB_TEETH 6
#define COMB_SPACING 4
表明会将标量k分成11块,每个块有6个齿,齿间距为4,则每个Block中包含24bits数据,梳子每次可以选中两两间隔为4的6bits数据,在这样的结构下11个块可以包含264bits数据,完全覆盖256bits的标量k(标量k还需进行数据位填充)。

定义mask(b) = sum(2^((b*COMB_TEETH + t)*COMB_SPACING) for t=0..COMB_TEETH-1),则当b取值从0到COMB_BLOCKS-1时,有以下对应关系:
mask(0) = 2^0 + 2^4 + 2^8 + 2^12 + 2^16 + 2^20,
mask(1) = 2^24 + 2^28 + 2^32 + 2^36 + 2^40 + 2^44,
mask(2) = 2^48 + 2^52 + 2^56 + 2^60 + 2^64 + 2^68,
...
mask(10) = 2^240 + 2^244 + 2^248 + 2^252 + 2^256 + 2^260
由此可通过这些掩码拆分比特位d[i],具体来说,每个掩码会被使用COMB_SPACING次,且每次使用时的偏移量不同。
d = (d & mask(0)<<0) + (d & mask(1)<<0) + ... + (d & mask(COMB_BLOCKS-1)<<0) +
(d & mask(0)<<1) + (d & mask(1)<<1) + ... + (d & mask(COMB_BLOCKS-1)<<1) +
...
(d & mask(0)<<(COMB_SPACING-1)) + ...
接下来定义:
table(b, m) = (m - mask(b)/2) * G (5)
这里b=0..COMB_BLOCKS-1,m=(d & mask(b)),m是标量d在b块中对应的值,每个块对应的m可以有2^COMB_TEETH个不同取值,可以预计算m对应的点乘值m*G。
table(b, m) = (m - mask(b)/2)*G = ((d&mask(b)) - mask(b)/2)*G = (2(d&mask(b)) - mask(b))*G/2 = ((2d - 1)&mask(b))*G/2 = sum(2^i * (2*d[i] - 1) * G/2) 这里i遍历掩码mask(b)中的置位比特位,即最终有:
table(b, m) = sum(2^i * (2*d[i] - 1) * G/2) (6)
例如当m=2^48 + 2^56 + 2^68时(位于块2中的值),则有table(2, m) = (2^48 - 2^52 + 2^56 - 2^60 - 2^64 + 2^68) * G/2。
结合以上定义,可以重写comb(d, G/2):
comb(d, G/2) = 2^0 * (table(0, d>>0 & mask(0)) + ... + table(COMB_BLOCKS-1, d>>0 & mask(COMP_BLOCKS-1))) (7)
+ 2^1 * (table(0, d>>1 & mask(0)) + ... + table(COMB_BLOCKS-1, d>>1 & mask(COMP_BLOCKS-1)))
+ 2^2 * (table(0, d>>2 & mask(0)) + ... + table(COMB_BLOCKS-1, d>>2 & mask(COMP_BLOCKS-1)))
+ ...
+ 2^(COMB_SPACING-1) * (table(0, d>>(COMB_SPACING-1) & mask(0)) + ...)
即:
sum(2^i * sum(table(b, d>>i & mask(b)), b=0..COMB_BLOCKS-1), i=0..COMB_SPACING-1)
以下是计算过程伪代码:
c = infinity
for comb_off in range(COMB_SPACING - 1, -1, -1):
for block in range(COMB_BLOCKS):
c += table(block, (d >> comb_off) & mask(block))
if comb_off > 0:
c = 2*c
return c
在以上伪代码中,梳子是从高位"梳到"低位的,即依次先处理每个块的高位,再依次处理每个块的低位。
4. 折半表
之前已经有结论每个块对应的m可以有2^COMB_TEETH个不同取值,所以对于table(b, m)来说需要有2^COMB_TEETH个表项才能覆盖该b块所有可能取值,但实际上只需一半的表项即可覆盖所有可能取值,即table(b, m)实际只包含2^(COMB_TEETH-1)个表项,以下是分析过程:
由m=(d & mask(b)),令m'是m所有位反转对应的值,则有m'=m XOR mask(b),进而有table(b, m') = table(b, m XOR mask(b)) = table(b, mask(b) - m) = (mask(b) - m - mask(b)/2)*G = -(m - mask(b)/2)*G = -table(b, m)。
即如果m'对应是m的所有位反转,那么m'对应的table表项即为m相应表项的负值,对于任意椭圆点P = (x, y),其负值很容易求得即-P = (x, -y),所以table(b, m)表只需保存一半的表项,另一半的位反转表项,只需对相应的表项取负即可。对于0块来说,当梳齿对应块中最低位时,有以下对应关系:

例如由m=2^0 + 2^8 + 2^16时的表项(对应点P=(x, y)),可以直接求出m'=2^4 + 2^12 + 2^20的表项(对应点-P = (x, -y)),只需要对m值对应的表项中y值取负即可得到m'值对应的表项。
2 源码详解
2.1 预计算表
函数secp256k1_ecmult_gen_compute_table用于产生第1节中的table(b, m)表项,函数源码如下:
1 static void secp256k1_ecmult_gen_compute_table(secp256k1_ge_storage* table, const secp256k1_ge* gen, int blocks, int teeth, int spacing) {
2 size_t points = ((size_t)1) << (teeth - 1); // 每个块的预计算点数 = 2^(teeth-1)
3 size_t points_total = points * blocks; // 总共预计算的点数 = 每个块的预计算点数 * 块数
4 secp256k1_ge* prec = checked_malloc(&default_error_callback, points_total * sizeof(*prec)); // 长度为总预计算点数的仿射点数组
5 secp256k1_gej* ds = checked_malloc(&default_error_callback, teeth * sizeof(*ds)); // 每块的"tooth"的雅可比坐标点数组
6 secp256k1_gej* vs = checked_malloc(&default_error_callback, points_total * sizeof(*vs)); // 长度为总预计算点数的雅可比坐标点数组
7 secp256k1_gej u;
8 size_t vs_pos = 0;
9 secp256k1_scalar half;
10 secp256k1_ge halfgenAffine;
11 secp256k1_ge_storage halfgenStorage;
12 int block, i;
13
14 VERIFY_CHECK(points_total > 0);
15
16 /* u is the running power of two times gen we're working with, initially gen/2. */
17 secp256k1_scalar_half(&half, &secp256k1_scalar_one);
18 //print_scalar(&half, "0x", "\n");
19 secp256k1_gej_set_infinity(&u);
20 for (i = 255; i >= 0; --i) {
21 /* Use a very simple multiplication ladder to avoid dependency on ecmult. */
22 secp256k1_gej_double_var(&u, &u, NULL);
23 if (secp256k1_scalar_get_bits_limb32(&half, i, 1)) {
24 secp256k1_gej_add_ge_var(&u, &u, gen, NULL);
25 }
26 }
27
28 secp256k1_ge_set_gej(&halfgenAffine, &u);
29 secp256k1_ge_to_storage(&halfgenStorage, &halfgenAffine);
30 //print_storage(&halfgenStorage.x, "0x", "\n");
31 //print_storage(&halfgenStorage.y, "0x", "\n");
32 #ifdef VERIFY
33 {
34 /* Verify that u*2 = gen. */
35 secp256k1_gej double_u;
36 secp256k1_gej_double_var(&double_u, &u, NULL);
37 VERIFY_CHECK(secp256k1_gej_eq_ge_var(&double_u, gen));
38 }
39 #endif
40
41 for (block = 0; block < blocks; ++block) {
42 int tooth;
43 /* Here u = 2^(block*teeth*spacing) * gen/2. */
44 secp256k1_gej sum;
45 secp256k1_gej_set_infinity(&sum);
46 for (tooth = 0; tooth < teeth; ++tooth) {
47 /* Here u = 2^((block*teeth + tooth)*spacing) * gen/2. */
48 /* Make sum = sum(2^((block*teeth + t)*spacing), t=0..tooth) * gen/2. */
49 secp256k1_gej_add_var(&sum, &sum, &u, NULL);
50 /* Make u = 2^((block*teeth + tooth)*spacing + 1) * gen/2. */
51 secp256k1_gej_double_var(&u, &u, NULL);
52 /* Make ds[tooth] = u = 2^((block*teeth + tooth)*spacing + 1) * gen/2. */
53 ds[tooth] = u;
54 /* Make u = 2^((block*teeth + tooth + 1)*spacing) * gen/2, unless at the end. */
55 if (block + tooth != blocks + teeth - 2) {
56 int bit_off;
57 for (bit_off = 1; bit_off < spacing; ++bit_off) {
58 secp256k1_gej_double_var(&u, &u, NULL);
59 }
60 }
61 }
62 /* Now u = 2^((block*teeth + teeth)*spacing) * gen/2
63 * = 2^((block+1)*teeth*spacing) * gen/2 */
64
65 /* Next, compute the table entries for block number block in Jacobian coordinates.
66 * The entries will occupy vs[block*points + i] for i=0..points-1.
67 * We start by computing the first (i=0) value corresponding to all summed
68 * powers of two times G being negative. */
69 secp256k1_gej_neg(&vs[vs_pos++], &sum);
70 /*secp256k1_ge_set_gej(&halfgenAffine, &vs[vs_pos - 1]);
71 secp256k1_ge_to_storage(&halfgenStorage, &halfgenAffine);
72 print_storage(&halfgenStorage.x, "0x", "\n");
73 print_storage(&halfgenStorage.y, "0x", "\n");*/
74 /* And then teeth-1 times "double" the range of i values for which the table
75 * is computed: in each iteration, double the table by taking an existing
76 * table entry and adding ds[tooth]. */
77 for (tooth = 0; tooth < teeth - 1; ++tooth) {
78 size_t stride = ((size_t)1) << tooth;
79 size_t index;
80 for (index = 0; index < stride; ++index, ++vs_pos) {
81 secp256k1_gej_add_var(&vs[vs_pos], &vs[vs_pos - stride], &ds[tooth], NULL);
82 }
83 }
84 }
85 VERIFY_CHECK(vs_pos == points_total);
86
87 /* Convert all points simultaneously from secp256k1_gej to secp256k1_ge. */
88 secp256k1_ge_set_all_gej_var(prec, vs, points_total);
89 /* Convert all points from secp256k1_ge to secp256k1_ge_storage output. */
90 for (block = 0; block < blocks; ++block) {
91 size_t index;
92 for (index = 0; index < points; ++index) {
93 VERIFY_CHECK(!secp256k1_ge_is_infinity(&prec[block * points + index]));
94 secp256k1_ge_to_storage(&table[block * points + index], &prec[block * points + index]);
95 print_storage(&table[block * points + index].x, "0x", "\n");
96 print_storage(&table[block * points + index].y, "0x", "\n");
97 }
98 }
99
100 /* Free memory. */
101 free(vs);
102 free(ds);
103 free(prec);
104 }
正如之前分析代码中第2行给出了每个块儿需要进行预计算点的个数(即table(b, m)在固定b时的表项数),COMB_TEETH=6时,表项数是2^5=32个;第3行为b取值从0到COMB_BLOCKS-1时完整表中点的总数,以COMB_BLOCKS=11为例,点的总数是32*11=352;接下来第4~6行为对应的表项分配内存空间,其中ds用于存储在每个块中梳齿对应的"权重",所以其大小是梳齿个数6,以0块儿为例,其取值为2*G/2,2*2^4*G/2,2*2^8*G/2,2*2^12*G/2,2*2^16*G/2,2*2^20*G/2,后续给出详细求解过程;第17~26行先求得标量1/2 mod order,再用通用的倍点加法求射影点u=G/2;第28,29行分别获取点u对应的仿射坐标及存储坐标值;第33~38行验证部分是检查2*u是否等于生成点G。
接下来,第11行处的for循环用于对11个块儿,依次求块对应的table(b, m)内容;之后第44,45行定义临时变量sum,并将其初始化为零点,第46行的for循环用于求之前所说的ds,以及公式(6)中减数部分的累加和sum;具体来说,第47,48行给出了后续u和sum的具体计算公式;接下来第49行将上轮更新过的u加到sum上(两个for循环都为第一次时,u值为G/2,加完以后sum=G/2);之后第51行将u翻倍,例如在tooth=0时,u=2*G/2;第53~60行,先将u赋值给ds[tooth],然后再将u翻SPACING-1倍,算上第51行翻倍u相当于共翻倍SPACING次(当SPACING=4时对应2^4),另外55行的if判断用于在最后一个块最后一次时,由于不再使用这时已无需对u再进行SPACING-1次翻倍操作,for循环完毕已经依次求出之前所说的梳齿"权重"------ds,以及公式(6)中减数部分的累加和------sum。
第69行将累加和取负(因为sum对应公式中的减数部分)后,赋值给当前块中的第一个表项vs[pos],对应所有梳齿位都位0时表项值,接下来,第77行for循环分别计算梳齿0,1,...,teeth-2齿位对应是1时的table的表项值;第78行用于计算对应梳齿位为1时表项的个数,如第0个梳齿位为1时,只能产生2^0=1个表项000001,第1个梳齿位为1时,可以产生2^1=2个表项000010和000011,第2个梳齿位为1时,可以产生2^2=4个表项000100,000101,000110,000111,依次类推第teeth-2=4齿位对应是1时,可以产生2^4=16个表项;之后第80~82行的for循环依次产生之前折半表部分图中前半部分对应的表项值,总个数是1+1+2+4+..+16=32。
代码接下来的内容主要是坐标转换相关内容,这里不再进行详细解释,总之循环执行完毕会产生所有11个块对用的查找表。
2.2 计算k*G
接下来需要看如何利用上一小节产生的查找表进行k*G计算,仍旧先给出计算函数secp256k1_ecmult_gen源码:
1 static void secp256k1_ecmult_gen(const secp256k1_ecmult_gen_context *ctx, secp256k1_gej *r, const secp256k1_scalar *gn) {
2 uint32_t comb_off;
3 secp256k1_ge add;
4 secp256k1_fe neg;
5 secp256k1_ge_storage adds;
6 secp256k1_scalar d;
7 /* Array of uint32_t values large enough to store COMB_BITS bits. Only the bottom
8 * 8 are ever nonzero, but having the zero padding at the end if COMB_BITS>256
9 * avoids the need to deal with out-of-bounds reads from a scalar. */
10 uint32_t recoded[(COMB_BITS + 31) >> 5] = {0};
11 int first = 1, i;
12
13 memset(&adds, 0, sizeof(adds));
14
15 /* We want to compute R = gn*G.
16 *
17 * To blind the scalar used in the computation, we rewrite this to be
18 * R = (gn - b)*G + b*G, with a blinding value b determined by the context.
19 *
20 * The multiplication (gn-b)*G will be performed using a signed-digit multi-comb (see Section
21 * 3.3 of "Fast and compact elliptic-curve cryptography" by Mike Hamburg,
22 * https://eprint.iacr.org/2012/309).
23 *
24 * Let comb(s, P) = sum((2*s[i]-1)*2^i*P for i=0..COMB_BITS-1), where s[i] is the i'th bit of
25 * the binary representation of scalar s. So the s[i] values determine whether -2^i*P (s[i]=0)
26 * or +2^i*P (s[i]=1) are added together. COMB_BITS is at least 256, so all bits of s are
27 * covered. By manipulating:
28 *
29 * comb(s, P) = sum((2*s[i]-1)*2^i*P for i=0..COMB_BITS-1)
30 * <=> comb(s, P) = sum((2*s[i]-1)*2^i for i=0..COMB_BITS-1) * P
31 * <=> comb(s, P) = (2*sum(s[i]*2^i for i=0..COMB_BITS-1) - sum(2^i for i=0..COMB_BITS-1)) * P
32 * <=> comb(s, P) = (2*s - (2^COMB_BITS - 1)) * P
33 *
34 * If we wanted to compute (gn-b)*G as comb(s, G), it would need to hold that
35 *
36 * (gn - b) * G = (2*s - (2^COMB_BITS - 1)) * G
37 * <=> s = (gn - b + (2^COMB_BITS - 1))/2 (mod order)
38 *
39 * We use an alternative here that avoids the modular division by two: instead we compute
40 * (gn-b)*G as comb(d, G/2). For that to hold it must be the case that
41 *
42 * (gn - b) * G = (2*d - (2^COMB_BITS - 1)) * (G/2)
43 * <=> d = gn - b + (2^COMB_BITS - 1)/2 (mod order)
44 *
45 * Adding precomputation, our final equations become:
46 *
47 * ctx->scalar_offset = (2^COMB_BITS - 1)/2 - b (mod order)
48 * ctx->ge_offset = b*G
49 * d = gn + ctx->scalar_offset (mod order)
50 * R = comb(d, G/2) + ctx->ge_offset
51 *
52 * comb(d, G/2) function is then computed by summing + or - 2^(i-1)*G, for i=0..COMB_BITS-1,
53 * depending on the value of the bits d[i] of the binary representation of scalar d.
54 */
55
56 /* Compute the scalar d = (gn + ctx->scalar_offset). */
57 secp256k1_scalar_add(&d, &ctx->scalar_offset, gn);
58 /* Convert to recoded array. */
59 for (i = 0; i < 8 && i < ((COMB_BITS + 31) >> 5); ++i) {
60 recoded[i] = secp256k1_scalar_get_bits_limb32(&d, 32 * i, 32);
61 }
62 secp256k1_scalar_clear(&d);
63
64 /* In secp256k1_ecmult_gen_prec_table we have precomputed sums of the
65 * (2*d[i]-1) * 2^(i-1) * G points, for various combinations of i positions.
66 * We rewrite our equation in terms of these table entries.
67 *
68 * Let mask(b) = sum(2^((b*COMB_TEETH + t)*COMB_SPACING) for t=0..COMB_TEETH-1),
69 * with b ranging from 0 to COMB_BLOCKS-1. So for example with COMB_BLOCKS=11,
70 * COMB_TEETH=6, COMB_SPACING=4, we would have:
71 * mask(0) = 2^0 + 2^4 + 2^8 + 2^12 + 2^16 + 2^20,
72 * mask(1) = 2^24 + 2^28 + 2^32 + 2^36 + 2^40 + 2^44,
73 * mask(2) = 2^48 + 2^52 + 2^56 + 2^60 + 2^64 + 2^68,
74 * ...
75 * mask(10) = 2^240 + 2^244 + 2^248 + 2^252 + 2^256 + 2^260
76 *
77 * We will split up the bits d[i] using these masks. Specifically, each mask is
78 * used COMB_SPACING times, with different shifts:
79 *
80 * d = (d & mask(0)<<0) + (d & mask(1)<<0) + ... + (d & mask(COMB_BLOCKS-1)<<0) +
81 * (d & mask(0)<<1) + (d & mask(1)<<1) + ... + (d & mask(COMB_BLOCKS-1)<<1) +
82 * ...
83 * (d & mask(0)<<(COMB_SPACING-1)) + ...
84 *
85 * Now define table(b, m) = (m - mask(b)/2) * G, and we will precompute these values for
86 * b=0..COMB_BLOCKS-1, and for all values m which (d & mask(b)) can take (so m can take on
87 * 2^COMB_TEETH distinct values).
88 *
89 * If m=(d & mask(b)), then table(b, m) is the sum of 2^i * (2*d[i]-1) * G/2, with i
90 * iterating over the set bits in mask(b). In our example, table(2, 2^48 + 2^56 + 2^68)
91 * would equal (2^48 - 2^52 + 2^56 - 2^60 - 2^64 + 2^68) * G/2.
92 *
93 * With that, we can rewrite comb(d, G/2) as:
94 *
95 * 2^0 * (table(0, d>>0 & mask(0)) + ... + table(COMB_BLOCKS-1, d>>0 & mask(COMP_BLOCKS-1)))
96 * + 2^1 * (table(0, d>>1 & mask(0)) + ... + table(COMB_BLOCKS-1, d>>1 & mask(COMP_BLOCKS-1)))
97 * + 2^2 * (table(0, d>>2 & mask(0)) + ... + table(COMB_BLOCKS-1, d>>2 & mask(COMP_BLOCKS-1)))
98 * + ...
99 * + 2^(COMB_SPACING-1) * (table(0, d>>(COMB_SPACING-1) & mask(0)) + ...)
100 *
101 * Or more generically as
102 *
103 * sum(2^i * sum(table(b, d>>i & mask(b)), b=0..COMB_BLOCKS-1), i=0..COMB_SPACING-1)
104 *
105 * This is implemented using an outer loop that runs in reverse order over the lines of this
106 * equation, which in each iteration runs an inner loop that adds the terms of that line and
107 * then doubles the result before proceeding to the next line.
108 *
109 * In pseudocode:
110 * c = infinity
111 * for comb_off in range(COMB_SPACING - 1, -1, -1):
112 * for block in range(COMB_BLOCKS):
113 * c += table(block, (d >> comb_off) & mask(block))
114 * if comb_off > 0:
115 * c = 2*c
116 * return c
117 *
118 * This computes c = comb(d, G/2), and thus finally R = c + ctx->ge_offset. Note that it would
119 * be possible to apply an initial offset instead of a final offset (moving ge_offset to take
120 * the place of infinity above), but the chosen approach allows using (in a future improvement)
121 * an incomplete addition formula for most of the multiplication.
122 *
123 * The last question is how to implement the table(b, m) function. For any value of b,
124 * m=(d & mask(b)) can only take on at most 2^COMB_TEETH possible values (the last one may have
125 * fewer as there mask(b) may exceed the curve order). So we could create COMB_BLOCK tables
126 * which contain a value for each such m value.
127 *
128 * Now note that if m=(d & mask(b)), then flipping the relevant bits of m results in negating
129 * the result of table(b, m). This is because table(b,m XOR mask(b)) = table(b, mask(b) - m) =
130 * (mask(b) - m - mask(b)/2)*G = (-m + mask(b)/2)*G = -(m - mask(b)/2)*G = -table(b, m).
131 * Because of this it suffices to only store the first half of the m values for every b. If an
132 * entry from the second half is needed, we look up its bit-flipped version instead, and negate
133 * it.
134 *
135 * secp256k1_ecmult_gen_prec_table[b][index] stores the table(b, m) entries. Index
136 * is the relevant mask(b) bits of m packed together without gaps. */
137
138 /* Outer loop: iterate over comb_off from COMB_SPACING - 1 down to 0. */
139 comb_off = COMB_SPACING - 1;
140 while (1) {
141 uint32_t block;
142 uint32_t bit_pos = comb_off;
143 /* Inner loop: for each block, add table entries to the result. */
144 for (block = 0; block < COMB_BLOCKS; ++block) {
145 /* Gather the mask(block)-selected bits of d into bits. They're packed:
146 * bits[tooth] = d[(block*COMB_TEETH + tooth)*COMB_SPACING + comb_off]. */
147 uint32_t bits = 0, sign, abs, index, tooth;
148 /* Instead of reading individual bits here to construct the bits variable,
149 * build up the result by xoring rotated reads together. In every iteration,
150 * one additional bit is made correct, starting at the bottom. The bits
151 * above that contain junk. This reduces leakage by avoiding computations
152 * on variables that can have only a low number of possible values (e.g.,
153 * just two values when reading a single bit into a variable.) See:
154 * https://www.usenix.org/system/files/conference/usenixsecurity18/sec18-alam.pdf
155 */
156 for (tooth = 0; tooth < COMB_TEETH; ++tooth) {
157 /* Construct bitdata s.t. the bottom bit is the bit we'd like to read.
158 *
159 * We could just set bitdata = recoded[bit_pos >> 5] >> (bit_pos & 0x1f)
160 * but this would simply discard the bits that fall off at the bottom,
161 * and thus, for example, bitdata could still have only two values if we
162 * happen to shift by exactly 31 positions. We use a rotation instead,
163 * which ensures that bitdata doesn't lose entropy. This relies on the
164 * rotation being atomic, i.e., the compiler emitting an actual rot
165 * instruction. */
166 uint32_t bitdata = secp256k1_rotr32(recoded[bit_pos >> 5], bit_pos & 0x1f);
167
168 /* Clear the bit at position tooth, but sssh, don't tell clang. */
169 uint32_t volatile vmask = ~(1 << tooth);
170 bits &= vmask;
171
172 /* Write the bit into position tooth (and junk into higher bits). */
173 bits ^= bitdata << tooth;
174 bit_pos += COMB_SPACING;
175 }
176
177 /* If the top bit of bits is 1, flip them all (corresponding to looking up
178 * the negated table value), and remember to negate the result in sign. */
179 sign = (bits >> (COMB_TEETH - 1)) & 1;
180 abs = (bits ^ -sign) & (COMB_POINTS - 1);
181 VERIFY_CHECK(sign == 0 || sign == 1);
182 VERIFY_CHECK(abs < COMB_POINTS);
183
184 /** This uses a conditional move to avoid any secret data in array indexes.
185 * _Any_ use of secret indexes has been demonstrated to result in timing
186 * sidechannels, even when the cache-line access patterns are uniform.
187 * See also:
188 * "A word of warning", CHES 2013 Rump Session, by Daniel J. Bernstein and Peter Schwabe
189 * (https://cryptojedi.org/peter/data/chesrump-20130822.pdf) and
190 * "Cache Attacks and Countermeasures: the Case of AES", RSA 2006,
191 * by Dag Arne Osvik, Adi Shamir, and Eran Tromer
192 * (https://eprint.iacr.org/2005/271.pdf)
193 */
194 for (index = 0; index < COMB_POINTS; ++index) {
195 secp256k1_ge_storage_cmov(&adds, &secp256k1_ecmult_gen_prec_table[block][index], index == abs);
196 }
197
198 /* Set add=adds or add=-adds, in constant time, based on sign. */
199 secp256k1_ge_from_storage(&add, &adds);
200 secp256k1_fe_negate(&neg, &add.y, 1);
201 secp256k1_fe_cmov(&add.y, &neg, sign);
202
203 /* Add the looked up and conditionally negated value to r. */
204 if (EXPECT(first, 0)) {
205 /* If this is the first table lookup, we can skip addition. */
206 secp256k1_gej_set_ge(r, &add);
207 /* Give the entry a random Z coordinate to blind intermediary results. */
208 secp256k1_gej_rescale(r, &ctx->proj_blind);
209 first = 0;
210 } else {
211 secp256k1_gej_add_ge(r, r, &add);
212 }
213 }
214
215 /* Double the result, except in the last iteration. */
216 if (comb_off-- == 0) break;
217 secp256k1_gej_double(r, r);
218 }
219
220 /* Correct for the scalar_offset added at the start (ge_offset = b*G, while b was
221 * subtracted from the input scalar gn). */
222 secp256k1_gej_add_ge(r, r, &ctx->ge_offset);
223
224 /* Cleanup. */
225 secp256k1_fe_clear(&neg);
226 secp256k1_ge_clear(&add);
227 secp256k1_memclear_explicit(&adds, sizeof(adds));
228 secp256k1_memclear_explicit(&recoded, sizeof(recoded));
229 }
源码第10给出一个位重编码变量recoded,用于存放标量gn(也即之前说的k)的二级制位,仍以之前的梳状结构为例,则COMB_BITS=11*6*4=264bits,则recoded长度为9,这时9*(1<<5)=288>264可以完全存储相应的重编码位;源码中的注释对应上一节的解析内容,接下来不再详细进行说明;第57行应用公式(4)对d进行求解;第59~62行从256bits的标量d中依次取出32bits并放入到recoded中,在32位实现下这里相当于把d中数据直接拷贝到recoded。
接下来是函数实现主体部分,首先第139行将comb_off初始化为COMB_SPACING-1,即从梳子从块儿的高位开始;之后140行处的while会依次处理comb_off位对应的各个梳齿取值,直到comb_off移动至0位;第142行将bit_pos初始化为com_off,梳齿对应第0块的高位;随后第144行用for循环依次处理COMB_BLOCKS个块儿;第156行使用for循环对于当前块儿的COMB_TEETH个梳齿位依次进行处理;第166行将包含bit_pos位的recoded[]进行循环右移,使得结果bitdata的最低位正好是我们需要的bit_pos位;第169行是tooth齿对应的掩码vmask,在掩码中tooth位为0其他位为1;第170行通过与掩码vmask做且操作清除目标位;接下来第173行将刚刚提取的位(bitdata最低位)写入到bits的第tooth个位置,bitdata<< tooth将包含所需位的32位数通过左移到将最低位移动到正确的位置------tooth,然后通过异或来将tooth值写入到bits(上一步已将bits的tooth位清零,零与其他值异或保留其他值);随后第174行将bit_pos移动到下一个梳齿位。可以看出循环结束后bits会在低COMB_TEETH位依次记录下当前块的梳齿位对应的值(bits低COMB_TEETH位以外值为无效数据),即将块儿中相应值"梳"出来。
之后,第179行将"梳"出来的bits最高位做为符号位;第180行取得除符号外,其他位实际应该的取值abs,其中-sign为mask,sign=0时,对应mask=0x00000000,sign=1时,对应mask=0xFFFFFFFF,即sign=0时取bits值(与0异或取原址),sign=1时bits值每个位都取反(与1异或取反值),之后通过&(COMB_POINT-1)获取bits mask后的低5bit值;第194~196行处的for循环取值abs对应的table表项值;第199~201行根据符号位确定是否进行取负操作;第204~212行根据是否为第一次将相应的表项值赋值或者附加到r上,在第一次时还可以会用proj_blind对r初始值进行盲化操作:

由图中解释可知经过盲化操作后,点r'虽然各个坐标分量的取值都已经都已经改变,但是由射影坐标的定义可知,其实r'和r对应同样的仿射坐标,即它们表示同一个点。
当144行的for循环执行完毕,表示当前comb_off梳齿位对应的所有块儿已经处理完毕,接下来第216行将梳齿向低位移位继续进行下一个梳齿位对应块的处理,直到COMB_SPACING=4个梳齿位都处理完毕则跳出while循环;另外因为梳齿位是从高位到低位进行的,所以第217行对r进行了翻倍操作,这相当于公式(7)中每行内的移位操作。
整个while循环执行完毕后得到的值r=(gn - b)*G,最终gn*G=(gn - b)*G + b*G,所以在222行又在r上加上了ge_offset=b*G,最终得到gn*G,求解结束。
2.3 补充说明
从上面的分析可知分块数COMB_BLOCKS和梳齿数COMB_TEETH共同决定了查找表的大小,其中分块数对查找表是线性影响,而梳齿数对查找表是指数级影响(影响更大),所以在内存受限或者优先考虑内存时,需要对这两个参数的取值进行斟酌考虑。另外COMB_BLOCKS决定了查表访存次数和点加操作次数,而COMB_SPACING决定了倍点操作次数,在进行CUDA编程时都需要根据实际情况,斟酌确定相关参数的取值。