背景

点击率预估（Click Through Rate Prediction）是广告和推荐系统中的核心工作之一。在广告系统中，当用户发起广告请求后，系统需要从全量广告集合中召回相关的若干条候选广告，并对于每个候选广告预估其点击率，最后通过一定的排序公式（如使用点击率乘以出价作为排序分进行排序），筛选排序靠前的候选广告作为最终胜出的广告向用户曝光。

早期，Logistic回归模型由于其解释性好、易于工程实现的优点被广泛应用于点击率预估，而随着深度学习的发展，深度学习逐渐被应用于点击率预估，并在模型结构上不断演进，其中有一系列工作不只考虑单次广告请求的上下文和候选广告的特征，还会考虑用户历史行为的特征，基于深度学习对用户兴趣进行建模。本文是对这一系列工作的论文阅读笔记，如有不足之处，请大家多指正。

DIN

2018年阿里妈妈发表的论文《Deep Interest Network for Click-Through Rate Prediction》提出了DIN算法。DIN算法将用户历史行为中的商品的Embedding向量作为用户兴趣的表征。由于不同用户历史行为长度不同，涉及的商品数目不同，因此需要将不同数目的商品的Embedding向量通过池化汇总为一个Embedding向量，以满足多层神经网络输入维度固定的要求，而池化会限制用户兴趣的表达，另外，用户兴趣是多样化的，比如某男性用户历史上可能陆续购买过手机和运动鞋，若候选广告是平板电脑，则购买手机这一历史行为所表征的用户兴趣和候选广告更相关，应该有更多的权重，基于这一原则，DIN算法引入注意力机制，计算历史行为中的商品的Embedding向量和候选广告的注意力得分，并基于注意力得分对历史行为中的商品的Embedding向量进行加权求和池化，从而挖掘和候选广告相关的用户兴趣，并满足多层神经网络输入维度固定的要求。

论文首先介绍了基线模型，即将用户历史行为中的商品的Embedding向量作为用户兴趣的表征，通过池化汇总为一个Embedding向量，最后和其他特征的Embedding向量拼接在一起，输入多层神经网络，最后通过一个Softmax函数输出预估点击率，论文称该模型为"Embedding & MLP"模型。基线模型如图1左侧所示。

基线模型特征包括四类：用户画像（User Profile）、用户行为（User Behavior）、候选广告（Candidate Ad）、上下文（Context），这些特征会被编码为高维离散二值特征，令第 <math xmlns="http://www.w3.org/1998/Math/MathML"> i i </math>i个特征表示为 <math xmlns="http://www.w3.org/1998/Math/MathML"> t i ∈ R K i t_i \in R^{K_i} </math>ti∈RKi， <math xmlns="http://www.w3.org/1998/Math/MathML"> K i K_i </math>Ki表示其维度，即特征值基数， <math xmlns="http://www.w3.org/1998/Math/MathML"> t i [ j ] t_i[j] </math>ti[j]表示特征 <math xmlns="http://www.w3.org/1998/Math/MathML"> t i t_i </math>ti的第 <math xmlns="http://www.w3.org/1998/Math/MathML"> j j </math>j个元素，并且 <math xmlns="http://www.w3.org/1998/Math/MathML"> t i [ j ] ∈ { 0 , 1 } t_i[j] \in \{0, 1\} </math>ti[j]∈{0,1}。 <math xmlns="http://www.w3.org/1998/Math/MathML"> ∑ j = 1 K i t i [ j ] = k \sum_{j=1}^{K_i}{t_i[j]}=k </math>∑j=1Kiti[j]=k，若 <math xmlns="http://www.w3.org/1998/Math/MathML"> k = 1 k=1 </math>k=1，则 <math xmlns="http://www.w3.org/1998/Math/MathML"> t i t_i </math>ti为独热编码（one-hot encoding），若 <math xmlns="http://www.w3.org/1998/Math/MathML"> k > 1 k>1 </math>k>1，则 <math xmlns="http://www.w3.org/1998/Math/MathML"> t i t_i </math>ti为多热编码（multi-hot encoding）。一个样本可以表示成 <math xmlns="http://www.w3.org/1998/Math/MathML"> x = [ t 1 T , t 2 T , . . . t M T ] T x=[t_1^T, t_2^T, ... t_M^T]^T </math>x=[t1T,t2T,...tMT]T，其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> M M </math>M是特征数， <math xmlns="http://www.w3.org/1998/Math/MathML"> ∑ i = 1 M K i = K \sum_{i=1}^M{K_i}=K </math>∑i=1MKi=K， <math xmlns="http://www.w3.org/1998/Math/MathML"> K K </math>K是整个特征空间的维度。例如样本：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> [ w e e k d a y = F r i d a y , g e n d e r = F e m a l e , v i s i t e d _ c a t e _ i d s = { B a g , B o o k } , a d _ c a t e _ i d = B o o k ] [weekday=Friday, gender=Female, visited\_cate\_ids=\{Bag, Book\}, ad\_cate\_id=Book] </math>[weekday=Friday,gender=Female,visited_cate_ids={Bag,Book},ad_cate_id=Book]

可以表示为：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> [ 0 , 0 , 0 , 0 , 1 , 0 , 0 ] ⏟ w e e k d a y = F r i d a y [ 0 , 1 ] ⏟ g e n d e r = F e m a l e [ 0 , . . . , 1 , . . . , 1 , . . . , 0 ] ⏟ v i s i t e d _ c a t e _ i d s = { B a g , B o o k } [ 0 , . . . , 1 , . . . , 0 ] ⏟ a d _ c a t e _ i d = B o o k \underbrace{[0, 0, 0, 0, 1, 0, 0]}{weekday=Friday}\underbrace{[0, 1]}{gender=Female\quad}\underbrace{[0, ..., 1, ..., 1, ..., 0]}_{visited\_cate\ids=\{Bag, Book\}\quad}\underbrace{[0, ..., 1, ..., 0]}{ad\_cate\_id=Book} </math>weekday=Friday [0,0,0,0,1,0,0]gender=Female [0,1]visited_cate_ids={Bag,Book} [0,...,1,...,1,...,0]ad_cate_id=Book [0,...,1,...,0]

通过Embedding层将高维离散二值特征转化为低维稠密矩阵，对于特征 <math xmlns="http://www.w3.org/1998/Math/MathML"> t i t_i </math>ti，令 <math xmlns="http://www.w3.org/1998/Math/MathML"> W i = [ w 1 i , . . . , w j i , . . . , w K i i ] ∈ R D × K i W^i=[w_1^i, ..., w_j^i, ..., w_{K_i}^i] \in R^{D \times K_i} </math>Wi=[w1i,...,wji,...,wKii]∈RD×Ki表示 <math xmlns="http://www.w3.org/1998/Math/MathML"> t i t_i </math>ti的Embedding字典，其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> w j i ∈ R D w_j^i \in R^D </math>wji∈RD表示D维的Embedding向量，Embedding计算其实就是Table Lookup操作：

如果特征 <math xmlns="http://www.w3.org/1998/Math/MathML"> t i t_i </math>ti是独热编码且 <math xmlns="http://www.w3.org/1998/Math/MathML"> t i [ j ] = 1 t_i[j]=1 </math>ti[j]=1，那么 <math xmlns="http://www.w3.org/1998/Math/MathML"> t i t_i </math>ti在Embedding计算后即得到Embedding向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> e i = w j i e_i=w_j^i </math>ei=wji;
若果特征 <math xmlns="http://www.w3.org/1998/Math/MathML"> t i t_i </math>ti是多热编码且对于 <math xmlns="http://www.w3.org/1998/Math/MathML"> j ∈ { i 1 , i 2 , . . . , i k } j \in \{i_1, i_2, ..., i_k\} </math>j∈{i1,i2,...,ik}， <math xmlns="http://www.w3.org/1998/Math/MathML"> t i [ j ] = 1 t_i[j]=1 </math>ti[j]=1，那么 <math xmlns="http://www.w3.org/1998/Math/MathML"> t i t_i </math>ti在Embedding计算后即得到一组Embedding向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> { e i 1 , e i 2 , . . . , e i k } = { w i 1 i , w i 2 i , . . . , w i k i } \{e_{i_1}, e_{i_2},..., e_{i_k}\}=\{w_{i_1}^i, w_{i_2}^i, ..., w_{i_k}^i\} </math>{ei1,ei2,...,eik}={wi1i,wi2i,...,wiki}。

例如，若特征 <math xmlns="http://www.w3.org/1998/Math/MathML"> t i t_i </math>ti表示用户浏览过的商品id，则 <math xmlns="http://www.w3.org/1998/Math/MathML"> t i t_i </math>ti是多热编码，对于 <math xmlns="http://www.w3.org/1998/Math/MathML"> t i [ j ] = 1 t_i[j]=1 </math>ti[j]=1，则表示商品 <math xmlns="http://www.w3.org/1998/Math/MathML"> j j </math>j被用户浏览过，且 <math xmlns="http://www.w3.org/1998/Math/MathML"> { e i 1 , e i 2 , . . . , e i k } \{e_{i_1}, e_{i_2},..., e_{i_k}\} </math>{ei1,ei2,...,eik}为商品 <math xmlns="http://www.w3.org/1998/Math/MathML"> { i 1 , i 2 , . . . , i k } \{i_1, i_2, ..., i_k\} </math>{i1,i2,...,ik}相应的Embedding向量。不同用户有不同的商品浏览行为，因此不同样本多热编码特征 <math xmlns="http://www.w3.org/1998/Math/MathML"> t i t_i </math>ti的Embedding向量组是变长的，而全连接网络只能处理固定长度的输入，因此需要通过一个池化层将Embedding向量组转化为固定长度向量：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> e i = pooling ( e i 1 , e i 2 , . . . , e i k ) e_i = \text{pooling}(e_{i_1}, e_{i_2}, ..., e_{i_k}) </math>ei=pooling(ei1,ei2,...,eik)

常用池化方式包括求和池化和平均池化。在基线模型中，用户行为特征使用已浏览过的若干个商品，而每个商品的特征包括商品id、商品店铺id、商品品类id。对每个商品的商品id、商品店铺id、商品品类id进行Embedding计算后再拼接在一起得到商品的Embedding向量。再通过对所有浏览过的商品Embedding向量进行求和池化得到用户行为特征的Embedding向量。最后再将所有特征的Embedding向量拼接在一起，组成样本的向量表示。将上述拼接好的样本向量输入到多层全连接网络。该网络加上输入层共4层。输入层共有16种特征，每种特征的Embedding向量是12维，因此输入层共有192个输入。中间两层分别有200和80个单元，每个单元采用PReLU作为激活函数。最后一层为Softmax层。模型的损失函数采用负对数似然函数：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L = − 1 N ∑ ( x , y ) ∈ S ( y log ⁡ p ( x ) + ( 1 − y ) log ⁡ ( 1 − p ( x ) ) ) L=-\frac{1}{N}\sum_{(x,y)\in\mathcal{S}}{(y\log p(x)+(1-y)\log(1-p(x)))} </math>L=−N1(x,y)∈S∑(ylogp(x)+(1−y)log(1−p(x)))

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> S \mathcal{S} </math>S表示训练样本集，大小为 <math xmlns="http://www.w3.org/1998/Math/MathML"> N N </math>N， <math xmlns="http://www.w3.org/1998/Math/MathML"> y y </math>y为样本标记（广告是否点击）， <math xmlns="http://www.w3.org/1998/Math/MathML"> p ( x ) p(x) </math>p(x)为模型最后Softmax层输出的预估点击率。

在基线模型中，用户行为特征通过对浏览过的各商品的Embedding向量求和池化得到，并没有考虑这些商品与候选商品的相关性，因此，DIN引入了Attention机制，如图1右侧所示。相对于基线模型，DIN引入局部激活单元（Activation Unit），通过引入加权求和池化来计算给定候选广告 <math xmlns="http://www.w3.org/1998/Math/MathML"> A A </math>A时的用户行为特征：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> v U ( A ) = f ( v A , e 1 , e 2 , . . . , e H ) = ∑ j = 1 H a ( e j , v A ) e j = ∑ j = 1 H w j e j v_U(A)=f(v_A, e_1, e_2, ..., e_H)=\sum_{j=1}^H{a(e_j, v_A)e_j}=\sum_{j=1}^H{w_j e_j} </math>vU(A)=f(vA,e1,e2,...,eH)=j=1∑Ha(ej,vA)ej=j=1∑Hwjej

其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> { e 1 , e 2 , . . . , e H } \{e_1, e_2, ..., e_H\} </math>{e1,e2,...,eH}是用户 <math xmlns="http://www.w3.org/1998/Math/MathML"> U U </math>U行为特征的Embedding向量组（即浏览过的商品Embedding向量）， <math xmlns="http://www.w3.org/1998/Math/MathML"> v A v_A </math>vA是候选广告（即商品） <math xmlns="http://www.w3.org/1998/Math/MathML"> A A </math>A的Embedding向量，即对每个浏览过的商品，设置一个Activation Unit，将浏览过的商品Embedding向量和候选商品Embedding向量作为输入。Activation Unit输出每个浏览过商品的权重 <math xmlns="http://www.w3.org/1998/Math/MathML"> w j w_j </math>wj，使用该权重叉乘浏览过商品的Embedding向量进行加权后再进行求和池化。在Activation Unit内部，对浏览过商品的Embedding向量和候选商品的Embedding向量进行外积计算后再和上述两者拼接后输入两层全连接网络，第一层有36个单元，每个单元采用PReLU或Dice（后面详述）作为激活函数，第二层为Linear层。各种特征拼接后输入的全连接网络和基线模型类似，但设计了新激活函数Dice并取得了更好的效果。对于Activation Unit输出每个浏览过商品的权重，论文中提到，不同于其他Attention机制，这里对权重不作归一化来限制 <math xmlns="http://www.w3.org/1998/Math/MathML"> ∑ i w i = 1 \sum_i{w_i}=1 </math>∑iwi=1，这么做的好处是可以保留浏览过商品和候选商品相关性的密度，例如，某男性用户历史行为大部分和数码有关、少部分和男装相关，那么候选广告为平面电脑时的 <math xmlns="http://www.w3.org/1998/Math/MathML"> v u ( A ) v_u(A) </math>vu(A)值应大于候选广告为T恤时的 <math xmlns="http://www.w3.org/1998/Math/MathML"> v u ( A ) v_u(A) </math>vu(A)值，即用户对前者的兴趣大于对后者的兴趣。

DIEN

2019年阿里妈妈发表的论文《Deep Interest Evolution Network for Click-Through Rate Prediction》提出了DIEN算法。前述的DIN算法将用户历史行为中的商品的Embedding向量直接作为用户兴趣的表征，并通过注意力机制计算历史行为中的商品的Embedding向量和目标商品的注意力得分，并基于注意力得分对历史行为中的商品的Embedding向量进行加权求和，从而挖掘历史行为中和目标商品相关的用户兴趣的表征。DIEN在此基础上，首先通过兴趣抽取层，使用RNN对用户行为序列进行建模，将RNN输出的隐状态作为兴趣表征，将行为序列转化为兴趣表征序列，再通过兴趣演进层，引入注意力机制计算用户各阶段兴趣表征和候选广告的注意力得分，并结合注意力得分和RNN，对用户和候选广告相关的兴趣演进过程进行建模，得到用户和候选广告相关的兴趣演进表征。

论文首先介绍了特征和基线模型。基线模型和DIN中的基线模型基本一致，不再详述。特征也和DIN中的特征基本一致，包括四类：用户画像（性别、年龄等）、用户行为（已浏览商品id）、候选广告（广告id、店铺id等）、上下文（广告请求时间等）。这四类特征经过独热编码后的特征向量分别用 <math xmlns="http://www.w3.org/1998/Math/MathML"> x p \text{x}_p </math>xp、 <math xmlns="http://www.w3.org/1998/Math/MathML"> x b \text{x}_b </math>xb、 <math xmlns="http://www.w3.org/1998/Math/MathML"> x a \text{x}_a </math>xa、 <math xmlns="http://www.w3.org/1998/Math/MathML"> x c \text{x}_c </math>xc表示，其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> x b = [ b 1 ; b 2 ; ... , b T ] ∈ R K × T \text{x}_b=[\text{b}_1;\text{b}_2;\dots,\text{b}_T]\in\mathbb{R}^{K\times T} </math>xb=[b1;b2;...,bT]∈RK×T， <math xmlns="http://www.w3.org/1998/Math/MathML"> T T </math>T表示用户行为的次数， <math xmlns="http://www.w3.org/1998/Math/MathML"> K K </math>K表示所有商品数， <math xmlns="http://www.w3.org/1998/Math/MathML"> b t \text{b}_t </math>bt表示用户第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t次行为浏览商品的独热编码， <math xmlns="http://www.w3.org/1998/Math/MathML"> b t [ j t ] = 1 \text{b}t[j_t]=1 </math>bt[jt]=1表示用户第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t次行为浏览了第 <math xmlns="http://www.w3.org/1998/Math/MathML"> j t j_t </math>jt个商品。令所有商品的Embedding字典为 <math xmlns="http://www.w3.org/1998/Math/MathML"> E g o o d s = [ m 1 ; m 2 ; ... ; m K ] ∈ R n E × K \text{E}{goods}=[\text{m}_1;\text{m}_2;\dots;\text{m}K]\in\mathbb{R}^{n_E\times K} </math>Egoods=[m1;m2;...;mK]∈RnE×K，其中第 <math xmlns="http://www.w3.org/1998/Math/MathML"> j j </math>j列 <math xmlns="http://www.w3.org/1998/Math/MathML"> m j ∈ R n E \text{m}j\in\mathbb{R}^{n_E} </math>mj∈RnE即表示商品 <math xmlns="http://www.w3.org/1998/Math/MathML"> j j </math>j的Embedding向量，Embedding向量维度为 <math xmlns="http://www.w3.org/1998/Math/MathML"> n E n_E </math>nE，则用户行为特征经过Embedding计算后的Embedding向量可表示为 <math xmlns="http://www.w3.org/1998/Math/MathML"> e b = [ m j 1 ; m j 2 ; ... ; m j T ] \text{e}b=[\text{m}{j_1};\text{m}{j_2};\dots;\text{m}{j_T}] </math>eb=[mj1;mj2;...;mjT]，相应的，用户画像、候选广告、上下文这三类特征经过Embedding计算后的Embedding向量分别用 <math xmlns="http://www.w3.org/1998/Math/MathML"> e p \text{e}_p </math>ep、 <math xmlns="http://www.w3.org/1998/Math/MathML"> e a \text{e}a </math>ea、 <math xmlns="http://www.w3.org/1998/Math/MathML"> e c \text{e}c </math>ec表示。模型的损失函数也采用负对数似然函数：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L t a r g e t = − 1 N ∑ ( x , y ) ∈ D N ( y log ⁡ p ( x ) + ( 1 − y ) log ⁡ ( 1 − p ( x ) ) ) L{target}=-\frac{1}{N}\sum{(\text{x},y)\in\mathcal{D}}^N{(y\log p(x)+(1-y)\log(1-p(x)))} </math>Ltarget=−N1(x,y)∈D∑N(ylogp(x)+(1−y)log(1−p(x)))

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> x = [ x p , x a , x c , x b ] ∈ D \text{x}=[\text{x}_p,\text{x}_a,\text{x}_c,\text{x}_b]\in\mathcal{D} </math>x=[xp,xa,xc,xb]∈D， <math xmlns="http://www.w3.org/1998/Math/MathML"> D \mathcal{D} </math>D表示训练样本集，大小为 <math xmlns="http://www.w3.org/1998/Math/MathML"> N N </math>N， <math xmlns="http://www.w3.org/1998/Math/MathML"> y y </math>y为样本标记（广告是否点击）， <math xmlns="http://www.w3.org/1998/Math/MathML"> p ( x ) p(x) </math>p(x)为模型最后Softmax层输出的预估点击率。

DIEN的网络结构如图2所示，和DIN保持一致的部分是Embedding层和全连接网络层，优化创新的部分是用户兴趣演进过程的挖掘，其包含两部分：第一部分是兴趣抽取层（Interest Extractor Layer），将用户行为序列转化为兴趣序列，第二部分是兴趣演进层（Interest Evolving Layer），对兴趣演进过程中与候选广告相关的部分进行建模。

在兴趣抽取层（图2中黄色部分）中，论文使用RNN对用户行为序列进行建模，将RNN输出的隐状态作为兴趣表征。RNN逐个处理每次用户行为 <math xmlns="http://www.w3.org/1998/Math/MathML"> b t \text{b}_t </math>bt，第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t次用户行为的隐状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> h t \text{h}_t </math>ht由当前行为的Embedding向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> e b [ t ] \text{e}b[t] </math>eb[t]和前序行为的隐状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> h t − 1 \text{h}{t-1} </math>ht−1计算得出：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> h t = H ( e b [ t ] , h t − 1 ) , t ∈ [ 0 , T ] \text{h}_t=\mathcal{H}(\text{e}b[t],\text{h}{t-1}),\space t\in[0,T] </math>ht=H(eb[t],ht−1), t∈[0,T]

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> H \mathcal{H} </math>H表示RNN的网络结构，具体实现采用了门控循环单元（Gated Recurrent Unit，GRU），其能够解决传统RNN中梯度计算、矩阵连乘导致的梯度消失或梯度爆炸问题。GRU的网络结构如图3，其引入重置门（Reset Gate）和更新门（Update Gate）这两个元素大小在 <math xmlns="http://www.w3.org/1998/Math/MathML"> ( 0 , 1 ) (0,1) </math>(0,1)区间的向量，通过其控制新状态中旧状态的比例。

第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t次用户行为的隐状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> h t \text{h}_t </math>ht的具体计算过程如下。令 <math xmlns="http://www.w3.org/1998/Math/MathML"> i t = e b [ t ] \text{i}_t=\text{e}_b[t] </math>it=eb[t]表示GRU的输入， <math xmlns="http://www.w3.org/1998/Math/MathML"> i t ∈ R n E \text{i}_t\in\mathbb{R}^{n_E} </math>it∈RnE，重置门 <math xmlns="http://www.w3.org/1998/Math/MathML"> r t \text{r}_t </math>rt由下式计算：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> r t = σ ( W r i t + U r h t − 1 + b r ) \text{r}_t=\sigma(W^r\text{i}t+U^r\text{h}{t-1}+\text{b}^r) </math>rt=σ(Writ+Urht−1+br)

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> σ \sigma </math>σ为Sigmoid函数， <math xmlns="http://www.w3.org/1998/Math/MathML"> W r ∈ R n H × n E W^r\in\mathbb{R}^{n_H\times n_E} </math>Wr∈RnH×nE， <math xmlns="http://www.w3.org/1998/Math/MathML"> U r ∈ R n H × n H U^r\in \mathbb{R}^{n_H\times n_H} </math>Ur∈RnH×nH， <math xmlns="http://www.w3.org/1998/Math/MathML"> b r ∈ R n H \text{b}^r\in\mathbb{R}^{n_H} </math>br∈RnH。更新门 <math xmlns="http://www.w3.org/1998/Math/MathML"> u t \text{u}_t </math>ut由下式计算：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> u t = σ ( W u i t + U u h t − 1 + b u ) \text{u}_t=\sigma(W^u\text{i}t+U^u\text{h}{t-1}+\text{b}^u) </math>ut=σ(Wuit+Uuht−1+bu)

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> σ \sigma </math>σ为Sigmoid函数， <math xmlns="http://www.w3.org/1998/Math/MathML"> W u ∈ R n H × n E W^u\in\mathbb{R}^{n_H\times n_E} </math>Wu∈RnH×nE， <math xmlns="http://www.w3.org/1998/Math/MathML"> U u ∈ R n H × n H U^u\in \mathbb{R}^{n_H\times n_H} </math>Uu∈RnH×nH， <math xmlns="http://www.w3.org/1998/Math/MathML"> b u ∈ R n H \text{b}^u\in\mathbb{R}^{n_H} </math>bu∈RnH。候选隐状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> h ~ t \tilde{\text{h}}_t </math>h~t由下式计算：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> h ~ t = tanh ⁡ ( W h i t + r t ∘ U h h t − 1 + b h ) \tilde{\text{h}}_t=\tanh(W^h\text{i}_t+\text{r}t\circ U^h\text{h}{t-1}+\text{b}^h) </math>h~t=tanh(Whit+rt∘Uhht−1+bh)

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> ∘ \circ </math>∘表示Hadamard积（按元素逐个相乘）， <math xmlns="http://www.w3.org/1998/Math/MathML"> W h ∈ R n H × n E W^h\in\mathbb{R}^{n_H\times n_E} </math>Wh∈RnH×nE， <math xmlns="http://www.w3.org/1998/Math/MathML"> U h ∈ R n H × n H U^h\in \mathbb{R}^{n_H\times n_H} </math>Uh∈RnH×nH， <math xmlns="http://www.w3.org/1998/Math/MathML"> b h ∈ R n H \text{b}^h\in\mathbb{R}^{n_H} </math>bh∈RnH 最终隐状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> h t \text{h}_t </math>ht由下式计算：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> h t = ( 1 − u t ) ∘ h t − 1 + u t ∘ h ~ t \text{h}_t=(1-\text{u}t)\circ\text{h}{t-1}+\text{u}_t\circ\tilde{h}_t </math>ht=(1−ut)∘ht−1+ut∘h~t

为了能让隐状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> h t \text{h}_t </math>ht更准确地表征用户兴趣，论文引入了多任务学习思想，基于每一步的用户兴趣表征 <math xmlns="http://www.w3.org/1998/Math/MathML"> h t \text{h}_t </math>ht直接影响下一步的用户行为这一假设，使用 <math xmlns="http://www.w3.org/1998/Math/MathML"> h t \text{h}_t </math>ht对下一步的商品 <math xmlns="http://www.w3.org/1998/Math/MathML"> e b [ t + 1 ] \text{e}_b[t+1] </math>eb[t+1]是否被用户点击进行预测，采用Logistic回归预估点击率：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> σ ( h t , e b [ t + 1 ] ) \sigma(\text{h}_t,\text{e}_b[t+1]) </math>σ(ht,eb[t+1])

而如何构造上述辅助任务的训练样本集，论文构造 <math xmlns="http://www.w3.org/1998/Math/MathML"> N N </math>N个用户的历史行为序列对 <math xmlns="http://www.w3.org/1998/Math/MathML"> { e b i , e ^ b i } ∈ D B \{\text{e}_b^i,\hat{\text{e}}b^i\}\in\mathcal{D}{\mathcal{B}} </math>{ebi,e^bi}∈DB， <math xmlns="http://www.w3.org/1998/Math/MathML"> i ∈ 1 , 2 , ... , N i\in 1,2,\dots,N </math>i∈1,2,...,N，对于第 <math xmlns="http://www.w3.org/1998/Math/MathML"> i i </math>i个用户， <math xmlns="http://www.w3.org/1998/Math/MathML"> e b i ∈ R T × n E \text{e}b^i\in\mathbb{R}^{T\times n_E} </math>ebi∈RT×nE即用户历史点击、浏览的商品的Embedding向量序列，而 <math xmlns="http://www.w3.org/1998/Math/MathML"> e ^ b i ∈ R T × n E \hat{\text{e}}b^i\in\mathbb{R}^{T\times n_E} </math>e^bi∈RT×nE是在全部商品中排除用户历史点击、浏览的商品后再采样得到的商品的Embedding向量序列。基于上述训练样本集，辅助任务的损失函数也采用负对数似然函数：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L a u x = − 1 N ( ∑ i = 1 N ∑ t log ⁡ σ ( h t i , e b i [ t + 1 ] ) + log ⁡ ( 1 − σ ( h t i , e ^ b i [ t + 1 ] ) ) ) L{aux}=-\frac{1}{N}{(\sum{i=1}^N{\sum_t{\log\sigma(\text{h}_t^i,\text{e}_b^i[t+1])+\log(1-\sigma(\text{h}_t^i,\hat{e}_b^i[t+1]))}})} </math>Laux=−N1(i=1∑Nt∑logσ(hti,ebi[t+1])+log(1−σ(hti,e^bi[t+1])))

最终，整个模型的损失函数为：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L = L t a r g e t + α ∗ L a u x L=L_{target}+\alpha*L_{aux} </math>L=Ltarget+α∗Laux

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> α \alpha </math>α作为超参，用于权衡点击率预估和用户兴趣表征的准确性。论文指出，引入辅助任务和损失，不仅能促进隐状态更准确地表征用户兴趣，还能缓解RNN反向传播时梯度消失的问题，并为Embedding层的训练提供更多的有监督信息。

用户的兴趣是多样化的，比如某男性用户既对数码产品感兴趣，也对男装感兴趣，并且用户不同类别的兴趣是独立演进的，比如某男性用户在数码产品方面可能一段时间对手机感兴趣，另一段时间对耳机感兴趣，因此，在兴趣抽取层使用RNN对用户行为序列进行建模，将RNN输出的隐状态作为兴趣表征后，DIEN进一步在兴趣演进层（图2中红色部分）中，引入注意力机制计算用户各阶段兴趣表征和候选广告的注意力得分，并结合注意力得分和RNN，对用户和候选广告相关的兴趣演进过程进行建模，得到用户和候选广告相关的兴趣演进表征。

用户各阶段兴趣表征和候选广告的注意力得分采用如下公式计算：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> a t = exp ⁡ ( h t W e a ) ∑ j = 1 T exp ⁡ ( h j W e a ) a_t=\frac{\exp(\text{h}_tW\text{e}a)}{\sum{j=1}^T{\exp(\text{h}_jW\text{e}_a)}} </math>at=∑j=1Texp(hjWea)exp(htWea)

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> W ∈ R n H × n A W\in\mathbb{R}^{n_H\times n_A} </math>W∈RnH×nA， <math xmlns="http://www.w3.org/1998/Math/MathML"> n A n_A </math>nA是候选广告的Embedding向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> e a \text{e}_a </math>ea的维度。在得到注意力的基础上，DIEN进一步将兴趣抽取层RNN各步输出的隐状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> h t \text{h}_t </math>ht作为兴趣演进层RNN各步的输入 <math xmlns="http://www.w3.org/1998/Math/MathML"> i t ′ \text{i}_t' </math>it′。兴趣演进层RNN的具体实现也采用GRU，但如何结合注意力得分和GRU，论文提出了三种方案。第一种方案是直接对输入进行注意力得分加权：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> i t ′ = h t ∗ a t \text{i}_t'=\text{h}_t*a_t </math>it′=ht∗at

该方案被称为AIGRU（GRU with attentional input）。

第二种方案是使用注意力得分替换更新门，用于从候选隐状态和前述隐状态计算当前隐状态：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> h t ′ = ( 1 − α t ) ∗ h t − 1 ′ + α t ∗ h ~ t ′ \text{h}t'=(1-\alpha_t)*\text{h}{t-1}'+\alpha_t*\tilde{\text{h}}_t' </math>ht′=(1−αt)∗ht−1′+αt∗h~t′

该方案被称为AGRU（Attention based GRU）。

更新门是一个向量，而注意力得分是一个标量，使用注意力得分替换更新门会忽略隐状态各维度信息的差异性，因此论文提出了第三种方案，先对更新门使用注意力得分进行加权，再使用加权后的更新门计算当前隐状态：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> u ~ t ′ = a t ∗ u t ′ \tilde{\text{u}}_t'=a_t*\text{u}_t' </math>u~t′=at∗ut′
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> h t ′ = ( 1 − u ~ t ′ ) ∘ h t − 1 ′ + u ~ t ′ ∘ h ~ t ′ \text{h}_t'=(1-\tilde{\text{u}}t')\circ\text{h}{t-1}'+\tilde{\text{u}}_t'\circ \tilde{\text{h}}_t' </math>ht′=(1−u~t′)∘ht−1′+u~t′∘h~t′

即注意力得分较少、和候选广告不相关的兴趣的更新门的值较小，该兴趣在当前隐状态中的占比也较小，该方案被称为AUGRU（GRU with attentional update gate）。从论文后续的实验结果来看，也是三个方案中的AUGRU的AUC最高。

DIEN通过兴趣演进层得到兴趣演进表征后，再将其与用户画像、候选广告和上下文特征的Embedding向量拼接在一起，输入多层神经网络，最后由Softmax层输出预估点击率。

DSTN

2019年阿里超级汇川发表的论文《Deep Spatio-Temporal Neural Networks for Click-Through Rate Prediction》提出了DSTN算法。DSTN算法的应用场景是搜索广告。在搜索广告场景下，用户输入搜索词，由广告系统返回若干个相关的广告展示至用户。DSTN算法认为当前候选广告是否被点击会受两类信息影响，如图4所示：

时间维度的信息，和DIN、DIEN类似，DSTN认为用户历史点击的广告（Clicked ads）表征了用户正向偏好，可以使用该信息辅助预测候选广告的点击率，同时，DSTN还认为用户历史未点击的广告（Unclicked ads）表征了用户负向偏好，也可以使用该信息辅助预测候选广告的点击率；
空间维度的信息，DSTN认为和候选广告在同一页面中的其他广告（Contextual ad）也会影响候选广告是否被点击，例如某男性用户搜索手机，若返回的多个广告中只有一个是其欲购买的品牌，则这个广告较大概率会被点击。

基于上述分析，DSTN将用户历史点击广告序列、历史未点击广告序列、上下文广告序列作为特征，由于点击广告序列、未点击广告序列、上下文广告序列涉及多个广告，而不同广告和候选广告的相关性不同，因此和DIN类似，DSTN使用交互注意力机制（Interactive Attention）分别计算每个点击广告、未点击广告、上下文广告的Embedding向量和候选广告的Embedding向量的注意力得分，基于注意力得分，分别对点击广告序列、未点击广告序列、上下文广告序列中的多个广告的Embeding向量进行加权求和，从而得到点击广告序列、未点击广告序列、上下文广告序列和候选广告相关的表征，最后将点击广告序列、未点击广告序列、上下文广告序列的Embedding向量和候选广告的Embedding向量拼接在一起，输入全连接网络层，并由Sigmoid函数输出候选广告的预测点击率。

图5 DNN、DSTN-Pooling model、DSTN-Interactive attention model

论文详细介绍了DSTN的设计过程，最初只将候选广告的特征作为输入，基于"Embedding & MLP"形式的网络结构进行建模，接着引入点击广告序列、未点击广告序列和上下文广告序列，并通过池化操作将这些序列中多个广告的Embedding向量转化为固定长度的Embedding向量，类似DIN、DIEN中的基线模型，再接着将池化操作替换为自注意力机制（计算各序列中每个广告自身的注意力得分）和交互注意力机制（计算各序列中每个广告和候选广告的注意力得分），以下只介绍采用交互注意力机制的DSTN的网络结构，不再详述其他网络结构。

采用交互注意力机制的DSTN的网络结构如图5右侧所示，令候选广告的Embedding向量为 <math xmlns="http://www.w3.org/1998/Math/MathML"> x t ∈ R D t \text{x}t\in\mathbb{R}^{D_t} </math>xt∈RDt， <math xmlns="http://www.w3.org/1998/Math/MathML"> n c n_c </math>nc个上下文广告的Embedding向量为 <math xmlns="http://www.w3.org/1998/Math/MathML"> { x c i ∈ R D c } i = 1 n c \{\text{x}{ci}\in\mathbb{R}^{D_c}\}{i=1}^{n_c} </math>{xci∈RDc}i=1nc， <math xmlns="http://www.w3.org/1998/Math/MathML"> n l n_l </math>nl个点击广告的Embedding向量为 <math xmlns="http://www.w3.org/1998/Math/MathML"> { x l j ∈ R D l } j = 1 n l \{\text{x}{lj}\in\mathbb{R}^{D_l}\}{j=1}^{n_l} </math>{xlj∈RDl}j=1nl， <math xmlns="http://www.w3.org/1998/Math/MathML"> n u n_u </math>nu个未点击广告的Embedding向量为 <math xmlns="http://www.w3.org/1998/Math/MathML"> { x u q ∈ R D u } q = 1 n u \{\text{x}{uq}\in\mathbb{R}^{D_u}\}{q=1}^{n_u} </math>{xuq∈RDu}q=1nu。基于注意力得分个，对上下文广告的Embedding向量进行加权求和：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> x ~ c = ∑ i = 1 n c α c i ( x t , x c i ) x c i \tilde{\text{x}}c=\sum{i=1}^{n_c}{\alpha{ci}(\text{x}t,\text{x}{ci})\text{x}_{ci}} </math>x~c=i=1∑ncαci(xt,xci)xci

其中的注意力得分采用多层神经网络进行计算，该网络包含一个隐层，并使用ReLU函数作为激活函数，可用以下公式表示其结构：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> α c i ( x t , x c i ) = exp ⁡ ( h T ReLU ( W t c [ x t , x c i ] + b t c 1 ) + b t c 2 ) \alpha_{ci}(\text{x}t,\text{x}{ci})=\exp(\text{h}^T\text{ReLU}(\text{W}{tc}[\text{x}t,\text{x}{ci}]+\text{b}{tc1})+b_{tc2}) </math>αci(xt,xci)=exp(hTReLU(Wtc[xt,xci]+btc1)+btc2)

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> h \text{h} </math>h、 <math xmlns="http://www.w3.org/1998/Math/MathML"> W t c \text{W}{tc} </math>Wtc、 <math xmlns="http://www.w3.org/1998/Math/MathML"> b t c 1 \text{b}{tc1} </math>btc1、 <math xmlns="http://www.w3.org/1998/Math/MathML"> b t c 2 b_{tc2} </math>btc2均为模型参数。和 <math xmlns="http://www.w3.org/1998/Math/MathML"> x ~ c \tilde{\text{x}}_c </math>x~c类似，可以得到点击广告、未点击广告的Embedding向量的加权和 <math xmlns="http://www.w3.org/1998/Math/MathML"> x ~ l \tilde{\text{x}}_l </math>x~l、 <math xmlns="http://www.w3.org/1998/Math/MathML"> x ~ u \tilde{\text{x}}_u </math>x~u。DSTN网络结构中后续的全连接网络层和Sigmoid函数不再详述。

最后，再介绍一下DSTN的在线推理过程，如图6所示。不同于其他模型只需推理一次即可得到各候选广告的预测点击率，因为输入需包含上下文广告，所以DSTN的在线推理过程包含以下4步：

Ad Server请求Model Server，输入包含候选广告、点击广告、未点击广告，但无上下文广告；
Model Server返回候选广告的预测点击率；
Ad Server根据候选广告的预测点击率，按照一定的排序机制选择最靠前的一个候选广告，将其作为上下文广告，再次请求Model Server，输入包含剩余的候选广告、点击广告、未点击广告和上下文广告；
Model Server再次返回候选广告的预测点击率；

一般情况下，3、4两步需重复执行多次，直至选择出所需的多条广告，但论文指出，考虑到在线推理的低延时要求，3、4两步只执行一次，并基于第4步返回的预测点击率，按照一定的排序机制选择靠前的多个候选广告。

DFN

2020年腾讯发表的论文《Deep Feedback Network for Recommendation》提出了DFN算法。DFN算法的应用场景是微信中的内容推荐，其将内容推荐问题转化为内容点击率预估问题，而内容推荐场景下，用户对内容有三种反馈，如图7所示：

隐式正反馈（Implicit positive feedback），即用户点击内容；
隐式负反馈（Implicte negative feedback），即用户未点击内容；
显式负反馈（Explicit negative feedback），即用户点击内容的"不喜欢"按钮。

DFN对用户这三种反馈的行为序列进行建模，从中学习用户的正向和负向偏好。DFN首先使用Transformer中的多头自注意力分别挖掘每种反馈的行为序列和候选内容的相关性，得到每种反馈的行为序列和候选内容相关的Embedding向量。针对用户隐式负反馈多、但噪音也多（用户不点击内容可能有多种原因，并不一定是因为不喜欢），而显式负反馈和隐式正反馈少、但较准确的特点，DFN基于注意力机制，分别计算未点击行为和显式负反馈、隐式正反馈的Embedding向量的注意力得分，并使用上述两种注意力得分分别对未点击行为的Embedding向量进行加权求和，得到未点击行为中负向偏好、正向偏好的Embedding向量，最后将三种反馈的行为序列和候选内容相关的Embedding向量，以及未点击行为中负向偏好、正向偏好的Embedding向量拼接在一起作为用户反馈的表征，连同其他类型的特征，进行特征交互，再由Sigmoid函数输出候选内容的预估点击率。

令用户隐式正反馈（点击）的行为序列为 <math xmlns="http://www.w3.org/1998/Math/MathML"> { c 1 , ⋯ , c n 1 } \{c_1,\cdots,c_{n_1}\} </math>{c1,⋯,cn1}，显式负反馈（点击"不喜欢"按钮）的行为序列为 <math xmlns="http://www.w3.org/1998/Math/MathML"> { d 1 , ⋯ , d n 2 } \{d_1,\cdots,d_{n_2}\} </math>{d1,⋯,dn2}，隐式负反馈（未点击）的行为序列为 <math xmlns="http://www.w3.org/1998/Math/MathML"> { u 1 , ⋯ , u n 3 } \{u_1,\cdots,u_{n_3}\} </math>{u1,⋯,un3}。

DFN的网络结构如图8所示，其包括反馈交互模块（Deep Feedback Interaction Module）和特征交互模块（Feature Interaction Module）两部分。

反馈交互模块如图8右侧所示，又包括同类反馈交互组件（Internal Feedback Interaction Component）和跨类反馈交互组件（External Feedback Interaction Component）两部分。

同类反馈交互组件使用Transformer中的多头自注意力分别挖掘每种反馈的行为序列和候选内容的相关性。以隐式正反馈（点击）为例说明其计算过程。令输入矩阵 <math xmlns="http://www.w3.org/1998/Math/MathML"> B c = { t , c 1 , ⋯ , c n 1 } \text{B}_c=\{\text{t},\text{c}1,\cdots,\text{c}{n_1}\} </math>Bc={t,c1,⋯,cn1}，其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> t \text{t} </math>t为候选内容的Embedding向量， <math xmlns="http://www.w3.org/1998/Math/MathML"> c k \text{c}_k </math>ck为用户第 <math xmlns="http://www.w3.org/1998/Math/MathML"> k k </math>k次点击的Embedding向量（所点击内容的Embedding向量加上序号的Embedding向量），Transformer中的自注意力由下式计算：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> Q = W Q B c , K = W K B c , V = W V B c \text{Q}=\text{W}^Q\text{B}_c,\space\text{K}=\text{W}^K\text{B}_c,\space\text{V}=\text{W}^V\text{B}_c </math>Q=WQBc, K=WKBc, V=WVBc
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> Attention ( Q , K , V ) = softmax ( Q T K n h ) V \text{Attention}(\text{Q},\text{K},\text{V})=\text{softmax}(\frac{\text{Q}^T\text{K}}{\sqrt{n_h}})\text{V} </math>Attention(Q,K,V)=softmax(nh QTK)V

多头自注意力中第 <math xmlns="http://www.w3.org/1998/Math/MathML"> i i </math>i个头的自注意力由下式计算：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> head i = Attention ( W i Q Q , W i K K , W i V V ) \text{head}_i=\text{Attention}(\text{W}_i^Q\text{Q},\text{W}_i^K\text{K},\text{W}_i^V\text{V}) </math>headi=Attention(WiQQ,WiKK,WiVV)

多头自注意力的输出矩阵为：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> F c = concat ( head 1 , ⋯ , head h ) ⋅ W O \text{F}_c=\text{concat}(\text{head}_1,\cdots,\text{head}_h)\cdot\text{W}^O </math>Fc=concat(head1,⋯,headh)⋅WO

<math xmlns="http://www.w3.org/1998/Math/MathML"> F c \text{F}_c </math>Fc中包含 <math xmlns="http://www.w3.org/1998/Math/MathML"> n 1 + 1 n_1+1 </math>n1+1个Embedding向量，进行平均池化：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> f c = Average_pooling ( F c ) \text{f}_c=\text{Average\_pooling}(\text{F}_c) </math>fc=Average_pooling(Fc)

<math xmlns="http://www.w3.org/1998/Math/MathML"> f c ∈ R n h \text{f}_c\in\mathbb{R}^{n_h} </math>fc∈Rnh，即隐式正反馈（点击）的行为序列和候选内容相关的Embedding向量。类似地，可以得到隐式负反馈、显式负反馈的行为序列和候选内容相关的Embedding向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> f u \text{f}_u </math>fu、 <math xmlns="http://www.w3.org/1998/Math/MathML"> f d \text{f}_d </math>fd。

跨类反馈交互组件基于注意力机制，分别计算未点击行为和显式负反馈、隐式正反馈的Embedding向量的注意力得分，并使用上述两种注意力得分分别对未点击行为的Embedding向量进行加权求和，得到未点击行为中负向偏好、正向偏好的Embedding向量。以未点击行为中负向偏好的Embedding向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> f u d \text{f}{ud} </math>fud为例，说明其计算过程。计算过程可用下式表示：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> f u d = ∑ i = 1 n 3 α i u i , α i = f ( f d , u i ) ∑ j = 1 n 3 f ( f d , u i ) \text{f}{ud}=\sum_{i=1}^{n_3}{\alpha_i\text{u}_i},\space\alpha_i=\frac{f(\text{f}_d,\text{u}i)}{\sum{j=1}^{n_3}{f(\text{f}_d,\text{u}_i)}} </math>fud=i=1∑n3αiui, αi=∑j=1n3f(fd,ui)f(fd,ui)

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( a , b ) f(a,b) </math>f(a,b)表示多层神经网络：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> f ( a , b ) = MLP ( concat ( a , b , a − b , a ⊙ b ) ) f(\text{a},\text{b})=\text{MLP}(\text{concat}(\text{a},\text{b},\text{a}-\text{b},\text{a}\odot\text{b})) </math>f(a,b)=MLP(concat(a,b,a−b,a⊙b))

<math xmlns="http://www.w3.org/1998/Math/MathML"> ⊙ \odot </math>⊙表示Hadamard积（按元素逐个相乘）。类似地，可以得到未点击行为中正向偏好的Embedding向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> f u c \text{f}_{uc} </math>fuc。

最后将三种反馈的行为序列和候选内容相关的Embedding向量，以及未点击行为中负向偏好、正向偏好的Embedding向量拼接在一起作为用户反馈的表征：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> f F e e d = { f c , f d , f u , f u c , f u d } \text{f}_{Feed}=\{\text{f}_c,\text{f}d,\text{f}u,\text{f}{uc},\text{f}{ud}\} </math>fFeed={fc,fd,fu,fuc,fud}

特征交互模块如图8左侧所示。首先对各种特征进行交互和挖掘，其又包括三种组件：Wide、FM和Deep，该设计类似"Wide & Deep"。

通过Wide组件对稀疏特征进行简单的线性映射：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> y i W i d e = w i T x i + b i \text{y}_i^{Wide}=\text{w}_i^T\text{x}_i+b_i </math>yiWide=wiTxi+bi

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> x i \text{x}_i </math>xi为独热编码后的特征。

通过FM组件对稠密特征 <math xmlns="http://www.w3.org/1998/Math/MathML"> F ′ = { f 1 , ⋯ , f m , f F e e d } \text{F}'=\{\text{f}1,\cdots,\text{f}m,\text{f}{Feed}\} </math>F′={f1,⋯,fm,fFeed}进行两两交叉：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> y F M = ∑ i = 1 m + 5 ∑ j = i + 1 m + 5 f i ′ ⊙ f j ′ \text{y}^{FM}=\sum{i=1}^{m+5}{\sum_{j=i+1}^{m+5}{\text{f}_i'\odot\text{f}_j'}} </math>yFM=i=1∑m+5j=i+1∑m+5fi′⊙fj′

通过Deep组件的多层神经网络对稠密特征间的深层关系进行挖掘：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> y D e e p = f ( 2 ) \text{y}^{Deep}=\text{f}^{(2)} </math>yDeep=f(2)
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> f ( i + 1 ) = ReLU ( W ( i ) f ( i ) + b ( i ) ) \text{f}^{(i+1)}=\text{ReLU}(\text{W}^{(i)}\text{f}^{(i)}+\text{b}^{(i)}) </math>f(i+1)=ReLU(W(i)f(i)+b(i))
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> f ( 0 ) = concat ( f 1 , ⋯ , f m , f F e e d ) \text{f}^{(0)}=\text{concat}(\text{f}_1,\cdots,\text{f}m,\text{f}{Feed}) </math>f(0)=concat(f1,⋯,fm,fFeed)

最后将上述组件输出的特征Embedding向量拼接在一起，再由Sigmoid函数输出候选内容的预估点击率：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> y = concat ( y W i d e , y F M , y D e e p ) \text{y}=\text{concat}(\text{y}^{Wide},\text{y}^{FM},\text{y}^{Deep}) </math>y=concat(yWide,yFM,yDeep)
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> p ( x ) = σ ( w p T y ) p(x)=\sigma(\text{w}_p^T\text{y}) </math>p(x)=σ(wpTy)

和用户反馈类似，模型训练所使用的损失函数也包含三类样本，分别是点击样本、未点击样本和点击"不喜欢"按钮样本：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L = − 1 N ( λ c ∑ S c log ⁡ p ( x ) + λ u ∑ S u log ⁡ ( 1 − p ( x ) ) + λ d ∑ S d log ⁡ ( 1 − p ( x ) ) ) L=-\frac{1}{N}(\lambda_c\sum_{S_c}{\log p(x)}+\lambda_u\sum_{S_u}{\log(1-p(x))}+\lambda_d\sum_{S_d}{\log(1-p(x))}) </math>L=−N1(λcSc∑logp(x)+λuSu∑log(1−p(x))+λdSd∑log(1−p(x)))

RACP

2022年阿里巴巴发表的论文《Modeling Users' Contextualized Page-wise Feedback for Click-Through Rate Prediction in E-commerce Search》提出了RACP算法。

DCIN

2023年美团发表的论文《Deep Context Interest Network for Click-Through Rate Prediction》提出了DCIN算法。

点击率预估论文阅读笔记

背景

DIN

DIEN

DSTN

DFN

RACP

DCIN

参考文献