背景

广告主一般会在多个渠道（媒体或广告平台）分配预算，进行多渠道的广告投放，比如某个游戏行业广告主为推广新游戏，可能会在搜索、信息流、开屏、直播等多种形式的渠道流量上投放游戏广告，吸引用户下载、安装、激活并付费，而用户可能会在上述渠道流量上先后浏览到该游戏广告，直至最后发生转化行为。广告对用户的曝光被称为触点，某个广告对某个用户在多个渠道流量的多次曝光构成触点序列，被称为转化路径。广告主一般会收集其在多个渠道上投放广告的曝光、点击和转化明细数据，从而从全局角度构建全渠道下的转化路径集合，并借助算法分析转化路径各个触点对最终转化的归因权重，这种分析被称为多触点归因（Multi Touch Attribution，MTA），通过这种分析，广告主可以量化评估各个渠道对转化的重要性，从而调整在各个渠道上的预算分配，实现全局角度的投放策略优化。

在上一篇《基于深度学习的多触点归因论文阅读笔记》中，笔者简要介绍了多触点归因算法的分类，包括基于规则的算法和基于数据的算法两大类。

基于规则的算法包括末次触点归因、首次触点归因等，其算法规则较简单，技术上易于实现，但无法充分使用转化路径中的用户、广告和触点序列信息，而同一个渠道在不同用户偏好、不同转化路径下对最终转化的归因权重应该是不同的，因此，基于规则的算法难以精细化地计算各渠道的归因权重，实现个性化的归因。

基于数据的算法最早于2011年在论文《Data-driven Multi-touch Attribution Models》中被提出，其使用Logistic回归模型进行各触点归因权重分析。而随着深度学习的发展，近几年来不少论文探索基于深度学习的多触点归因算法。

基于深度学习的算法一般先将多触点归因问题变换为转化率预估问题，并基于触点序列的数据特点，采用循环神经网络对其进行序列建模，通过模型预估转化率，在得到转化率预估模型的基础上，再进一步进行归因权重的计算。而归因权重的计算一般有两种方式：一种方式是使用转化率预估模型的中间参数作为各触点的归因权重，例如上一篇阅读笔记中介绍的DNAMTA和DARNN，其使用各触点隐状态通过注意力机制输出的注意力得分作为归因权重；另一种方式是将深度学习和因果推断相结合，采用反事实分析计算各渠道的夏普利值作为归因权重，例如上一篇阅读笔记中介绍的JDMTA，其对各渠道，使用转化率预估模型分别预估有无该渠道时的转化率，因引入该渠道带来的转化率提升即该渠道对转化的边际期望增益（Marginal Expected Incremental Benefit），也就是该渠道的夏普利值（Shapley Value），被作为该渠道的归因权重。
虽然上一篇阅读笔记中介绍的多个基于深度学习的算法较充分地挖掘了转化路径中的各类信息，但并没有解决混杂偏差（Confounding Bias）。而在多触点归因问题中，用户的静态特征和历史行为作为混杂因子（Coufounder），同时影响后续触点的渠道选择以及最终转化的是否发生，从而引入混杂偏差。

本篇阅读笔记主要介绍基于深度学习的多触点归因算法中解决混杂问题、进一步提升归因权重准确性的两篇论文，如有不足之处，请指正。

什么是混杂

这里以因果推断中一个经典的例子介绍一下什么是混杂（Confounding）和混杂因子（Confounder）。通过统计发现，吃巧克力多的国家，也是获诺贝尔奖多的国家，是否可以因此得出因为吃巧克力多、所以获诺贝尔奖多的因果结论，显然这个结论是不符合常识、是错误的。而上述两个现象存在关联，可能是由于某个共同的原因，比如经济较发达，导致某个国家吃巧克力多，获诺贝尔奖也多，但是不能得出吃巧克力多和获诺贝尔奖多是因果关系。以上这种因为某种原因同时影响两个结果、并进而错误推断两个结果存在因果关系的问题被称为Confounding（混杂），而上例中的"经济发达"即混杂因子（Confounder）。

而广告场景下的多触点归因也存在混杂问题。用户的静态特征和历史行为都会影响广告渠道和最终转化，例如某男性数码爱好者搜索某款手机、点击搜索结果中该款手机的广告并最终下单购买，并不能得出因为该款手机在搜索渠道向用户展示广告、所以用户下单购买的因果结论。真实情况是搜索渠道的广告曝光和用户下单购买都是因为用户对该款手机感兴趣，即使没有广告曝光，用户仍会购买该款手机。

引入域对抗训练解决医疗诊断的混杂问题-CRN

2020年的论文《Estimating Counterfactual Treatment Outcomes over Time Through Adversarially Balanced Representations》提出了CRN（Counterfactual Recurrent Network），用于解决医疗诊断序列建模中的混杂问题。在临床医疗领域，医生对病人按时间分阶段实施治疗方案，在某个阶段，医生根据病人历史给出适当的治疗方案，并在下阶段观察治疗结果。论文希望基于大量的临床医疗数据（例如电子病历数据）进行模型训练，能够在某个阶段根据病人历史和治疗方案由模型预测治疗结果以决策最佳方案。病人历史、治疗方案和治疗结果按时间分多个阶段，这也是一个典型的序列建模问题。论文在CRN中基于表征学习思想，设计了序列到序列的架构，在编码器部分，通过RNN对过去的病人历史、治疗方案和治疗结果进行序列建模，并引入域对抗训练，最终输出病人历史的无偏表征，消除Confounder问题带来的偏差，在解码器部分，将编码器输出的病人历史无偏表征作为初始状态输入，通过RNN对未来某治疗方案序列下的治疗结果进行预测，并更新病人历史无偏表征。

问题建模

令数据集 <math xmlns="http://www.w3.org/1998/Math/MathML"> D = { { x t ( i ) , a t ( i ) , y t + 1 ( i ) } t = 1 T ( i ) ∪ { v ( i ) } } i = 1 N \mathcal{D}=\left\{\{\text{x}t^{(i)},\text{a}t^{(i)},\text{y}{t+1}^{(i)}\}{t=1}^{T^{(i)}}\cup\{\text{v}^{(i)}\}\right\}_{i=1}^N </math>D={{xt(i),at(i),yt+1(i)}t=1T(i)∪{v(i)}}i=1N，其中包含 <math xmlns="http://www.w3.org/1998/Math/MathML"> N N </math>N个病人的数据，对于病人 <math xmlns="http://www.w3.org/1998/Math/MathML"> i i </math>i，其数据包含两部分，一部分是病人静态特征 <math xmlns="http://www.w3.org/1998/Math/MathML"> V ( i ) ∈ V \text{V}^{(i)}\in\mathcal{V} </math>V(i)∈V，例如性别，另一部分是随时间变化的序列数据，分为 <math xmlns="http://www.w3.org/1998/Math/MathML"> T ( i ) T^{(i)} </math>T(i)步，对于第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t步，其数据又分为病人随时间变化的特征 <math xmlns="http://www.w3.org/1998/Math/MathML"> X t ( i ) ∈ X t \text{X}_t^{(i)}\in\mathcal{X}t </math>Xt(i)∈Xt，治疗方案 <math xmlns="http://www.w3.org/1998/Math/MathML"> A t ( i ) ∈ { A 1 , ... , A K } = A \text{A}t^{(i)}\in\{A_1,\dots,A_K\}=\mathcal{A} </math>At(i)∈{A1,...,AK}=A，以及在下一步产出的治疗结果 <math xmlns="http://www.w3.org/1998/Math/MathML"> Y t + 1 ( i ) ∈ Y t + 1 \text{Y}{t+1}^{(i)}\in\mathcal{Y}{t+1} </math>Yt+1(i)∈Yt+1。论文指出后面的分析会省略变量中的病人序号 <math xmlns="http://www.w3.org/1998/Math/MathML"> i i </math>i。

令 <math xmlns="http://www.w3.org/1998/Math/MathML"> H ˉ t = ( X ˉ t , A ˉ t − 1 , V ) \bar{\text{H}}_t=(\bar{\text{X}}t,\bar{\text{A}}{t-1},\text{V}) </math>Hˉt=(Xˉt,Aˉt−1,V)表示病人在第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t步的历史，其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> X ˉ t = ( X 1 , ... , X t ) \bar{\text{X}}_t=(\text{X}_1,\dots,\text{X}t) </math>Xˉt=(X1,...,Xt)表示截至第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t步病人随时间变化的特征序列， <math xmlns="http://www.w3.org/1998/Math/MathML"> V \text{V} </math>V表示病人静态特征， <math xmlns="http://www.w3.org/1998/Math/MathML"> Y [ a ˉ ] \text{Y}[\bar{\text{a}}] </math>Y[aˉ]表示未来采用治疗方案序列 <math xmlns="http://www.w3.org/1998/Math/MathML"> a ˉ \bar{\text{a}} </math>aˉ后可能的治疗结果。论文的目标是基于第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t步的病人历史，预测第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t步之后（包含第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t步）采用治疗方案序列 <math xmlns="http://www.w3.org/1998/Math/MathML"> a ˉ \bar{\text{a}} </math>aˉ后的治疗结果，可表示为：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> E ( Y t + τ [ a ˉ ( t , t + τ − 1 ) ] ∣ H ˉ t ) \mathbb{E}(\text{Y}{t+\tau}[\={\text{a}}(t,t+\tau-1)]|\={\text{H}}_t) </math>E(Yt+τ[aˉ(t,t+τ−1)]∣Hˉt)

其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> a ˉ ( t , t + τ − 1 ) = [ a t , ... , a t + τ − 1 ] \={\text{a}}(t,t+\tau-1)=[\text{a}t,\dots,\text{a}{t+\tau-1}] </math>aˉ(t,t+τ−1)=[at,...,at+τ−1]表示第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t步到第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t + τ − 1 t+\tau-1 </math>t+τ−1步的治疗方案序列， <math xmlns="http://www.w3.org/1998/Math/MathML"> Y t + τ [ a ˉ ( t , t + τ − 1 ) ] \text{Y}_{t+\tau}[\={\text{a}}(t,t+\tau-1)] </math>Yt+τ[aˉ(t,t+τ−1)]表示采用治疗方案序列 <math xmlns="http://www.w3.org/1998/Math/MathML"> a ˉ \bar{\text{a}} </math>aˉ后在第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t + τ t+\tau </math>t+τ步产出的治疗结果。

网络结构

CRN网络结构如图4所示，采用序列到序列的架构，分为编码器和解码器两部分，两者结构类似，但作用不同，并分别训练，在编码器部分，通过RNN对过去的病人历史、治疗方案和治疗结果进行序列建模，并引入域对抗训练，最终输出病人历史的无偏表征，消除Confounder问题带来的偏差，在解码器部分，将编码器输出的病人历史无偏表征作为初始状态输入，通过RNN对未来某治疗方案序列下的治疗结果进行预测，并更新病人历史无偏表征。

编码器层是一个典型的RNN/LSTM网络结构，第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t步的输入包含上一步的隐状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> h t − 1 h_{t-1} </math>ht−1、上一步的治疗方案 <math xmlns="http://www.w3.org/1998/Math/MathML"> A t − 1 \text{A}_{t-1} </math>At−1、病人的静态特征 <math xmlns="http://www.w3.org/1998/Math/MathML"> V \text{V} </math>V和随时间变化的特征 <math xmlns="http://www.w3.org/1998/Math/MathML"> X t \text{X}_t </math>Xt，第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t步的输出为当前步的隐状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> h t h_t </math>ht，第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t步的输出再输入到一个全连接网络（激活函数采用ELU函数）中，由全连接网络输出第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t步病人历史的无偏表征 <math xmlns="http://www.w3.org/1998/Math/MathML"> Φ ( H ˉ t ) \Phi(\bar{\text{H}}_t) </math>Φ(Hˉt)。 <math xmlns="http://www.w3.org/1998/Math/MathML"> Φ ( H ˉ t ) \Phi(\bar{\text{H}}_t) </math>Φ(Hˉt)将病人历史 <math xmlns="http://www.w3.org/1998/Math/MathML"> H ˉ t \bar{\text{H}}_t </math>Hˉt映射至表征空间 <math xmlns="http://www.w3.org/1998/Math/MathML"> R \mathcal{R} </math>R，映射后的表征需要满足以下要求：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> P ( Φ ( H ˉ t ) ∣ A t = A 1 ) = ⋯ = P ( Φ ( H ˉ t ) ∣ A t = A K ) P(\Phi(\bar{\text{H}}_t)|\text{A}_t=A_1)=\cdots=P(\Phi(\bar{\text{H}}_t)|\text{A}_t=A_K) </math>P(Φ(Hˉt)∣At=A1)=⋯=P(Φ(Hˉt)∣At=AK)

也就是说，不管采用哪种治疗方案，其病人历史表征的概率分布是相同的，消除了病人历史和当前治疗方案的关联，从而也就消除了Confounder问题带来的偏差。

具体如何实现上述要求，论文将不同治疗方案看做迁移学习的不同域，使用域适应（Domain Adaption）中的域对抗训练（Domain Adversarial Training）方法。CRN分别设计了治疗方案预测网络 <math xmlns="http://www.w3.org/1998/Math/MathML"> G a ( Φ ( H ˉ t ) ; θ a ) G_a(\Phi(\bar{\text{H}}t);\theta_a) </math>Ga(Φ(Hˉt);θa)（ <math xmlns="http://www.w3.org/1998/Math/MathML"> θ a \theta_a </math>θa表示网络参数）和治疗结果预测网络 <math xmlns="http://www.w3.org/1998/Math/MathML"> G y ( Φ ( H ˉ t ) ; θ y ) G_y(\Phi(\bar{\text{H}}t);\theta_y) </math>Gy(Φ(Hˉt);θy)（ <math xmlns="http://www.w3.org/1998/Math/MathML"> θ y \theta_y </math>θy表示网络参数）。将 <math xmlns="http://www.w3.org/1998/Math/MathML"> G a G_a </math>Ga作为域对抗训练中的域分类器，将 <math xmlns="http://www.w3.org/1998/Math/MathML"> G y G_y </math>Gy作为域对抗训练中的结果预测器。 <math xmlns="http://www.w3.org/1998/Math/MathML"> G a G_a </math>Ga的输入是病人历史表征，通过Softmax函数输出各种治疗方案的概率。 <math xmlns="http://www.w3.org/1998/Math/MathML"> G y G_y </math>Gy的输入是病人历史表征和当前治疗方案，通过线性函数输出在下一步产出的治疗结果。 <math xmlns="http://www.w3.org/1998/Math/MathML"> G a G_a </math>Ga是一个分类问题，损失函数采用交叉熵损失函数，公式如下所示：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L t , a ( i ) ( θ r , θ a ) = − ∑ j = 1 K I { a t ( i ) = a j } log ⁡ ( G a j ( Φ ( H ˉ t ; θ r ) ; θ a ) ) \mathcal{L}{t,a}^{(i)}(\theta_r,\theta_a)=-\sum{j=1}^{K}{\mathbb{I}_{\{a_t^{(i)}=a_j\}}\log(G_a^j(\Phi(\bar{\text{H}}_t;\theta_r);\theta_a))} </math>Lt,a(i)(θr,θa)=−j=1∑KI{at(i)=aj}log(Gaj(Φ(Hˉt;θr);θa))

模型训练时，并不是希望治疗方案预测的损失函数越小越好，而是越大越好，即病人历史表征无法预测治疗方案，这样能够消除病人历史和当前治疗方案的关联，从而也就消除了Confounder问题带来的偏差。
<math xmlns="http://www.w3.org/1998/Math/MathML"> G y G_y </math>Gy是一个回归问题，损失函数采用MSE，公式如下所示：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L t , y ( i ) ( θ r , θ y ) = ∥ Y t + 1 ( i ) − ( G y ( Φ ( H ˉ t ; θ r ) , θ y ) ) ∥ 2 \mathcal{L}{t,y}^{(i)}(\theta_r,\theta_y)=\parallel\text{Y}{t+1}^{(i)}-(G_y(\Phi(\bar{\text{H}}_t;\theta_r),\theta_y))\parallel^2 </math>Lt,y(i)(θr,θy)=∥Yt+1(i)−(Gy(Φ(Hˉt;θr),θy))∥2

模型训练时，希望治疗结果预测的损失函数越小。将 <math xmlns="http://www.w3.org/1998/Math/MathML"> G a G_a </math>Ga和 <math xmlns="http://www.w3.org/1998/Math/MathML"> G y G_y </math>Gy的损失函数整合后的公式如下所示：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L t ( θ r , θ y , θ a ) = ∑ i = 1 N L t , y ( i ) ( θ r , θ y ) − λ L t , a ( i ) ( θ r , θ a ) \mathcal{L}t(\theta_r,\theta_y,\theta_a)=\sum{i=1}^N{\mathcal{L}{t,y}^{(i)}(\theta_r,\theta_y)-\lambda\mathcal{L}{t,a}^{(i)}(\theta_r,\theta_a)} </math>Lt(θr,θy,θa)=i=1∑NLt,y(i)(θr,θy)−λLt,a(i)(θr,θa)

最小化上述损失函数，即最小化治疗结果预测的损失函数，最大化治疗方案预测的损失函数，其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> λ \lambda </math>λ作为超参，用于权衡病人历史表征的无偏性和治疗结果预测的准确性。

训练时的前向、反向传播如图5所示。前向传播分别计算治疗方案预测和治疗结果预测的损失函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> L a \mathcal{L}_a </math>La和 <math xmlns="http://www.w3.org/1998/Math/MathML"> L y \mathcal{L}_y </math>Ly，根据 <math xmlns="http://www.w3.org/1998/Math/MathML"> L y \mathcal{L}_y </math>Ly反向传播更新模型参数时，分别计算 <math xmlns="http://www.w3.org/1998/Math/MathML"> G y G_y </math>Gy和病人历史表征（RNN+全连接神经网络）的梯度 <math xmlns="http://www.w3.org/1998/Math/MathML"> ∂ L y ∂ θ y \frac{\partial\mathcal{L}_y}{\partial\theta_y} </math>∂θy∂Ly、 <math xmlns="http://www.w3.org/1998/Math/MathML"> ∂ L y ∂ θ r \frac{\partial\mathcal{L}_y}{\partial\theta_r} </math>∂θr∂Ly，并按梯度更新参数，根据 <math xmlns="http://www.w3.org/1998/Math/MathML"> L a \mathcal{L}_a </math>La反向传播更新模型参数时，分别计算 <math xmlns="http://www.w3.org/1998/Math/MathML"> G a G_a </math>Ga和病人历史表征（RNN+全连接神经网络）的梯度 <math xmlns="http://www.w3.org/1998/Math/MathML"> ∂ L a ∂ θ a \frac{\partial\mathcal{L}_a}{\partial\theta_a} </math>∂θa∂La、 <math xmlns="http://www.w3.org/1998/Math/MathML"> ∂ L a ∂ θ r \frac{\partial\mathcal{L}_a}{\partial\theta_r} </math>∂θr∂La，对于 <math xmlns="http://www.w3.org/1998/Math/MathML"> G a G_a </math>Ga，按照 <math xmlns="http://www.w3.org/1998/Math/MathML"> λ ∂ L a ∂ θ a \lambda\frac{\partial\mathcal{L}_a}{\partial\theta_a} </math>λ∂θa∂La更新参数，即令其自身预测治疗方案尽量准确，但对于病人历史表征（RNN+全连接神经网络），采用梯度反转层（Gradient Reversal Layer，GRL），按照 <math xmlns="http://www.w3.org/1998/Math/MathML"> − λ ∂ L a ∂ θ r -\lambda\frac{\partial\mathcal{L}_a}{\partial\theta_r} </math>−λ∂θr∂La更新参数，即令病人历史表征预测治疗方案尽量错误。

CRN的解码器结构与编码器基本类似，不同之处有：

编码器输出的病人历史无偏表征作为解码器的初始状态输入；
解码器推理时，可设计多种可能的治疗方案序列 <math xmlns="http://www.w3.org/1998/Math/MathML"> a ˉ ( t , t + τ − 1 ) \={\text{a}}(t,t+\tau-1) </math>aˉ(t,t+τ−1)作为输入，由解码器预测每一步的治疗结果，并且每一步的治疗结果以自回归的方式用作下一步的输入，最后根据治疗结果决策最佳治疗方案。

模型训练

CRN模型训练伪代码如图6所示。

模型训练可粗略分为4步：

分批次（每批包含多个病人的数据）对编码器进行训练，前向、反向传播细节已在网络结构中介绍；
对于每个病人、每步使用编码器计算相应的病人历史无偏表征；
将每个病人的历史按最长步长 <math xmlns="http://www.w3.org/1998/Math/MathML"> τ max \tau_{\text{max}} </math>τmax拆分多个子序列；
分批次（每批包含多个病人的多个历史子序列）对解码器进行训练，训练细节和编码器类似，不同之处只是对于子序列的第一个节点，将其相应前一个节点的病人历史无偏表征作为隐状态的初始值。

借鉴CRN解决MTA中用户历史的混杂问题-CAMTA

2020年的论文《CAMTA: Causal Attention Model for Multi-touch Attribution》提出了CAMTA，其基本思路仍是将多触点归因问题变换为转化率预估问题，采用循环神经网络对触点序列进行序列建模，通过模型预估转化率，在得到转化率预估模型的基础上，再进一步进行归因权重的计算，从而实现用户个性化的多触点归因，其创新之处是借鉴了CRN的思想，将常规循环神经网络升级为因果循环神经网络（Causal Recurrent Network），解决广告场景中多触点归因的混杂问题。

论文使用图7说明广告场景中多触点归因的混杂问题， <math xmlns="http://www.w3.org/1998/Math/MathML"> T 1 \mathbf{T}_1 </math>T1和 <math xmlns="http://www.w3.org/1998/Math/MathML"> T 2 \mathbf{T}_2 </math>T2时刻用户的上下文 <math xmlns="http://www.w3.org/1998/Math/MathML"> x 1 \mathbf{x}_1 </math>x1、 <math xmlns="http://www.w3.org/1998/Math/MathML"> x 2 \mathbf{x}_2 </math>x2作为混杂因子，既影响这两个时刻用户是否发生转化，也影响这两个时刻的渠道选择，从而导致渠道选择并不是随机的，即选择偏差（Selection Bias）问题。而理想的无偏情况下，每个触点、每个渠道的用户分布应该是随机。

问题建模

令数据集 <math xmlns="http://www.w3.org/1998/Math/MathML"> D = { u n , { x t n , c t , z t + 1 n } t = 1 T n , y n } n = 1 N \mathcal{D}=\{u^n,\{\mathbf{x}t^n,\mathbf{c}t,z{t+1}^n\}{t=1}^{T^n},y^n\}_{n=1}^N </math>D={un,{xtn,ct,zt+1n}t=1Tn,yn}n=1N，其中包含 <math xmlns="http://www.w3.org/1998/Math/MathML"> N N </math>N个用户的转化路径数据。对于用户 <math xmlns="http://www.w3.org/1998/Math/MathML"> u n u^n </math>un的转化路径数据，其有 <math xmlns="http://www.w3.org/1998/Math/MathML"> T n T^n </math>Tn个触点。对于第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t个触点，其包含当时的渠道 <math xmlns="http://www.w3.org/1998/Math/MathML"> c t \mathbf{c}_t </math>ct、上下文特征向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> x t n \mathbf{x}t^n </math>xtn（特征向量包含用户信息、广告信息、上下文信息等），以及用户最终是否点击的二值结果 <math xmlns="http://www.w3.org/1998/Math/MathML"> z t + 1 n z{t+1}^n </math>zt+1n（0表示未点击，1表示点击）。 <math xmlns="http://www.w3.org/1998/Math/MathML"> c t \mathbf{c}_t </math>ct是渠道的独热编码，即 <math xmlns="http://www.w3.org/1998/Math/MathML"> c t = [ c t ( 1 ) , ... , c t ( k ) , ... , c t ( K ) ] \mathbf{c}_t=[c_t(1),\dots,c_t(k),\dots,c_t(K)] </math>ct=[ct(1),...,ct(k),...,ct(K)]，其中，若第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t个触点在渠道 <math xmlns="http://www.w3.org/1998/Math/MathML"> k k </math>k，则 <math xmlns="http://www.w3.org/1998/Math/MathML"> c t ( k ) = 1 c_t(k)=1 </math>ct(k)=1，其他位为0。用户 <math xmlns="http://www.w3.org/1998/Math/MathML"> u n u^n </math>un的转化路径数据还包括最终是否发生转化的二值结果 <math xmlns="http://www.w3.org/1998/Math/MathML"> y n y^n </math>yn（0表示未发生转化，1表示发生转化）。

网络结构

CAMTA的整体网络结构如图8所示。

因果循环网络

CRN通过引入域对抗训练，输出病人历史的无偏表征，消除Confounder问题带来的偏差。CAMTA借鉴CRN，也通过引入域对抗训练，输出用户历史的无偏表征 。

CAMTA首先通过循环神经网络对用户历史的触点序列进行建模，挖掘用户历史的隐状态表征，该表征计算函数可由下式表示：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> s t n = f t , s ( x t n , x t − 1 n , ... , x 1 n , c t − 1 , ... , c 1 , z t n , z t − 1 n , ... , z 2 n ) \mathbf{s}t^n=f{t,s}(\mathbf{x}t^n,\mathbf{x}{t-1}^n,\dots,\mathbf{x}1^n,\mathbf{c}{t-1},\dots,\mathbf{c}1,z_t^n,z{t-1}^n,\dots,z_2^n) </math>stn=ft,s(xtn,xt−1n,...,x1n,ct−1,...,c1,ztn,zt−1n,...,z2n)

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> s t n ∈ R L \mathbf{s}t^n\in\mathbb{R}^L </math>stn∈RL，表示第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t个触点时的用户历史隐状态表征，它由用户截至第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t个触点（包含第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t个触点）的用户信息序列 <math xmlns="http://www.w3.org/1998/Math/MathML"> [ x 1 n , ... , x t − 1 n , x t n ] [\mathbf{x}1^n,\dots,\mathbf{x}{t-1}^n,\mathbf{x}t^n] </math>[x1n,...,xt−1n,xtn]、截至第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t个触点（不包含第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t个触点）的触点渠道序列 <math xmlns="http://www.w3.org/1998/Math/MathML"> [ c 1 , ... , c t − 1 ] [\mathbf{c}1,\dots,\mathbf{c}{t-1}] </math>[c1,...,ct−1]、截至第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t个触点（不包含第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t个触点）的是否点击广告的二值结果序列 <math xmlns="http://www.w3.org/1998/Math/MathML"> [ z 2 n , ... , z t − 1 n , z t n ] [z_2^n,\dots,z{t-1}^n,z_t^n] </math>[z2n,...,zt−1n,ztn]，通过函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> f t , s f{t,s} </math>ft,s计算所得。
<math xmlns="http://www.w3.org/1998/Math/MathML"> s t n \mathbf{s}_t^n </math>stn作为用户的重要特征，被用于第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t个触点时的广告投放决策，进而影响第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t个触点的渠道 <math xmlns="http://www.w3.org/1998/Math/MathML"> c t \mathbf{c}_t </math>ct的选择，导致选择偏差问题。因此，论文在 <math xmlns="http://www.w3.org/1998/Math/MathML"> s t n \mathbf{s}_t^n </math>stn的基础上，进一步学习用户历史隐状态的无偏表征 <math xmlns="http://www.w3.org/1998/Math/MathML"> r t n ∈ R M \mathbf{r}_t^n\in\mathbb{R}^M </math>rtn∈RM。具体实现是借鉴CRN，通过一个线性映射矩阵 <math xmlns="http://www.w3.org/1998/Math/MathML"> Φ t \Phi_t </math>Φt，将 <math xmlns="http://www.w3.org/1998/Math/MathML"> s t n \mathbf{s}_t^n </math>stn转化为 <math xmlns="http://www.w3.org/1998/Math/MathML"> r t n \mathbf{r}_t^n </math>rtn，如下式所示：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> Φ t : s t n → r t n , Φ ∈ R M × L \Phi_t:\mathbf{s}_t^n\rightarrow\mathbf{r}_t^n,\space\Phi\in\mathbb{R}^{M\times L} </math>Φt:stn→rtn, Φ∈RM×L

然后将 <math xmlns="http://www.w3.org/1998/Math/MathML"> r t n \mathbf{r}_t^n </math>rtn作为两个分类器的输入进行域对抗训练。下一节将介绍其中的细节。

最小最大损失

域对抗训练的细节与CRN基本一致。论文引入两个分类器分别基于第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t个触点时的用户历史无偏表征 <math xmlns="http://www.w3.org/1998/Math/MathML"> r t n \mathbf{r}t^n </math>rtn对第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t个触点时的渠道选择和用户点击进行预估。渠道预估的分类器 <math xmlns="http://www.w3.org/1998/Math/MathML"> C t , c \mathcal{C}{t,c} </math>Ct,c为两层MLP结构，最后一层的输出为 <math xmlns="http://www.w3.org/1998/Math/MathML"> K K </math>K维向量，该向量再经过Softmax函数得到各渠道概率的预估值。点击预估的分类器 <math xmlns="http://www.w3.org/1998/Math/MathML"> C t , z \mathcal{C}{t,z} </math>Ct,z也为两层MLP结构，最后一层的输出为1维标量，该标量再经过Sigmoid函数得到点击概率的预估值，即 <math xmlns="http://www.w3.org/1998/Math/MathML"> z ^ t + 1 n \hat{z}{t+1}^n </math>z^t+1n。

因为需要学习用户历史无偏表征 <math xmlns="http://www.w3.org/1998/Math/MathML"> r t n \mathbf{r}t^n </math>rtn，所以渠道预估分类器 <math xmlns="http://www.w3.org/1998/Math/MathML"> C t , c \mathcal{C}{t,c} </math>Ct,c在给定 <math xmlns="http://www.w3.org/1998/Math/MathML"> r t n \mathbf{r}t^n </math>rtn时对各个渠道概率的预估值应该是等值的，进而需要最大化该分类器的损失函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> L t , c \mathcal{L}{t,c} </math>Lt,c：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L t , c ( Φ t ) = − ∑ k = 1 K c t ( k ) log ⁡ ( C t , c ( r t n ) ) \mathcal{L}{t,c}(\Phi_t)=-\sum{k=1}^K{c_t(k)\log(\mathcal{C}_{t,c}(\mathbf{r}_t^n))} </math>Lt,c(Φt)=−k=1∑Kct(k)log(Ct,c(rtn))

同时，还需要用户历史无偏表征 <math xmlns="http://www.w3.org/1998/Math/MathML"> r t n \mathbf{r}t^n </math>rtn能够准确预估广告是否被点击，因此，需要最小化点击预估分类器的损失函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> L t , z \mathcal{L}{t,z} </math>Lt,z：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L t , z ( Φ t ) = − ∑ k = 1 K z t + 1 log ⁡ ( z ^ t + 1 n ) \mathcal{L}{t,z}(\Phi_t)=-\sum{k=1}^K{z_{t+1}\log(\hat{z}_{t+1}^n)} </math>Lt,z(Φt)=−k=1∑Kzt+1log(z^t+1n)

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> z ^ t n = C t , z ( [ r t , c t ] ) \hat{z}t^n=\mathcal{C}{t,z}([\mathbf{r}_t,\mathbf{c}_t]) </math>z^tn=Ct,z([rt,ct])。综上，需要最小化以下损失函数：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L r ( Φ , λ ) = ∑ t L t , z ( r t n ) − λ L t , c ( r t n ) \mathcal{L}r(\Phi,\lambda)=\sum_t{\mathcal{L}{t,z}(\mathbf{r}t^n)-\lambda\mathcal{L}{t,c}(\mathbf{r}_t^n)} </math>Lr(Φ,λ)=t∑Lt,z(rtn)−λLt,c(rtn)

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> λ \lambda </math>λ是超参， <math xmlns="http://www.w3.org/1998/Math/MathML"> Φ = [ Φ 1 , ... , Φ T n ] \Phi=[\Phi_1,\dots,\Phi_{T^n}] </math>Φ=[Φ1,...,ΦTn]，即通过域对抗训练，得到每个触点的线性映射矩阵，将每个触点的用户历史隐状态表征转化为相应的无偏表征。

注意力机制

通过域对抗训练，一方面获取用户历史的无偏表征，解决选择偏差问题，另一方面也通过引入各触点点击预估作为最终转化预估的辅助任务，缓解转化数据稀疏的问题。

论文在域对抗训练的基础上，进一步通过注意力机制计算每个触点对最终转化的归因权重。第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t个触点时的预估点击率 <math xmlns="http://www.w3.org/1998/Math/MathML"> z ^ t + 1 n \hat{z}_{t+1}^n </math>z^t+1n，经过一层MLP，得到该触点点击的隐状态表征 <math xmlns="http://www.w3.org/1998/Math/MathML"> v t n \mathbf{v}_t^n </math>vtn，如下式所示：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> v t n = tanh ⁡ ( W v z ^ t + 1 n + b v ) \mathbf{v}_t^n=\tanh(\mathbf{W}v\hat{z}{t+1}^n+\mathbf{b}_v) </math>vtn=tanh(Wvz^t+1n+bv)

再计算 <math xmlns="http://www.w3.org/1998/Math/MathML"> v t n \mathbf{v}_t^n </math>vtn和上下文向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> u \mathbf{u} </math>u的点积，并通过Softmax函数对点积进行归一化，作为归因权重，如下式所示：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> a t = exp ⁡ ( ( v t n ) T u ) ∑ t exp ⁡ ( ( v t n ) T u ) a_t=\frac{\exp((\mathbf{v}_t^n)^T\mathbf{u})}{\sum_t{\exp((\mathbf{v}_t^n)^T\mathbf{u})}} </math>at=∑texp((vtn)Tu)exp((vtn)Tu)

其中，上下文向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> u \mathbf{u} </math>u作为触点归因权重的领域知识，可以作为模型参数进行学习。

转化率预估

对各触点点击的隐状态表征 <math xmlns="http://www.w3.org/1998/Math/MathML"> v t n \mathbf{v}_t^n </math>vtn基于归因权重进行加权求和，得到用户历史的隐状态表征 <math xmlns="http://www.w3.org/1998/Math/MathML"> h n \mathbf{h}^n </math>hn，如下式所示：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> h n = ∑ t a t v t n \mathbf{h}^n=\sum_t{a_t\mathbf{v}_t^n} </math>hn=t∑atvtn

再将 <math xmlns="http://www.w3.org/1998/Math/MathML"> h n \mathbf{h}^n </math>hn通过一层MLP和Sigmoid函数得到最终的转化率预估值，如下式所示：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> y ^ n = sigmoid ( W y T h n + b y ) \hat{y}^n=\text{sigmoid}(\mathbf{W}_y^T\mathbf{h}^n+\mathbf{b}_y) </math>y^n=sigmoid(WyThn+by)

转化率预估的损失函数采用交叉熵损坏函数，如下所示：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L y ( W y , b y , W v , b v , Φ , u ) = − y n log ⁡ ( y ^ n ) \mathcal{L}_y(\mathbf{W}_y,\mathbf{b}_y,\mathbf{W}_v,\mathbf{b}_v,\Phi,\mathbf{u})=-y^n\log(\hat{y}^n) </math>Ly(Wy,by,Wv,bv,Φ,u)=−ynlog(y^n)

上述公式中的 <math xmlns="http://www.w3.org/1998/Math/MathML"> W y , b y , W v , b v , Φ , u \mathbf{W}_y,\mathbf{b}_y,\mathbf{W}_v,\mathbf{b}_v,\Phi,\mathbf{u} </math>Wy,by,Wv,bv,Φ,u均是可训练的参数。

整体结构

综上，CAMTA的整体网络结构如图8所示，其包含三部分。

第一部分是因果循环神经网络，由粉色虚线标出，其使用循环神经网络对用户历史的触点序列进行建模，得到每个触点的隐状态表征 <math xmlns="http://www.w3.org/1998/Math/MathML"> s t n \mathbf{s}_t^n </math>stn，进而通过域对抗训练，得到每个触点的无偏隐状态表征 <math xmlns="http://www.w3.org/1998/Math/MathML"> r t n \mathbf{r}_t^n </math>rtn，最后再基于每个触点的无偏隐状态表征通过点击预估分类器预估每个触点的点击率。

第二部分是注意力机制，由蓝色虚线标出，其基于每个触点的预估点击率学习每个触点的点击隐状态表征 <math xmlns="http://www.w3.org/1998/Math/MathML"> v t n \mathbf{v}_t^n </math>vtn，再计算 <math xmlns="http://www.w3.org/1998/Math/MathML"> v t n \mathbf{v}_t^n </math>vtn和一个可训练的上下文向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> u \mathbf{u} </math>u的点积，并通过Softmax函数对点积进行归一化，作为各触点的归因权重。

第三部分是转化率预估，由灰色虚线标出，其对各触点的点击隐状态表征基于归因权重进行加权求和，得到用户历史的隐状态表征 <math xmlns="http://www.w3.org/1998/Math/MathML"> h n \mathbf{h}^n </math>hn，最后再通过一层MLP和Sigmoid函数得到最终的转化率预估值。

整个网络的损失函数即加和域对抗训练中的最小最大损失函数和最后转化率预估的交叉熵损失函数，如下式所示：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L ( λ , β ) = L r ( Φ , λ ) + β L y ( W y , b y , W v , b v , Φ , u ) \mathcal{L}(\lambda,\beta)=\mathcal{L}_r(\Phi,\lambda)+\beta\mathcal{L}_y(\mathbf{W}_y,\mathbf{b}_y,\mathbf{W}_v,\mathbf{b}_v,\Phi,\mathbf{u}) </math>L(λ,β)=Lr(Φ,λ)+βLy(Wy,by,Wv,bv,Φ,u)

实验分析

数据集

论文使用了由Criteo提供的并在多触点归因领域被广泛使用的数据集------Criteo Attribution Modeling for Bidding Dataset。Criteo是全球最大的个性化重定向广告服务商，目前已覆盖桌面端和移动端，提供广告定制及重定向投放，其提供的数据集包含1600万次广告曝光（触点），涉及675个广告计划，每个广告曝光记录均包含用户和转化的唯一标识，每个用户可能发生一次或多次转化。论文首先对上述数据集进行处理，从675个广告计划中随机选择10个计划作为广告渠道进行分析，并筛选属于这10个计划的广告曝光，按用户维度划分为多个转化路径，每个转化路径的触点不超过20个，且最终最多发生一次转化。数据集处理前后的的统计信息如图9所示，数据集处理后共包含4.4万个用户的4.6万个转化路径、8.2万个触点，其中最终发生转化的转化路径共2500余个，发生点击的触点共2.7万个。处理后的数据集按6:2:2的比例被划分为训练集、验证集和测试集。

评估指标

评估指标包含两类。第一类指标用于评估转化率和点击率预估结果是否准确，分别是转化率预估的损失函数：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L L c o n v = − ∑ n = 1 N y n log ⁡ ( y ^ n ) \mathbf{LL}{conv}=-\sum{n=1}^N{y^n\log(\hat{y}^n)} </math>LLconv=−n=1∑Nynlog(y^n)

点击率预估的损失函数：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L L c l i c k = − ∑ n = 1 N ∑ t = 1 T n z t n log ⁡ ( z ^ t n ) \mathbf{LL}{click}=-\sum{n=1}^N{\sum_{t=1}^{T^n}{z_t^n\log(\hat{z}_t^n)}} </math>LLclick=−n=1∑Nt=1∑Tnztnlog(z^tn)

以及转化率预估的AUC值。

第二类指标用于评估归因权重结果是否准确，论文采用DARNN中的评估算法，使用以下公式计算各渠道的ROI：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ROI k = ∑ ∀ y n = 1 ∑ t a t n I ( c t ( k ) = 1 ) V ( y n ) ∑ n ∑ t cost t n ( k ) \text{ROI}k=\frac{\sum{\forall y^n=1}{\sum_t{a_t^n\mathbb{I}(c_t(k)=1)V(y^n)}}}{\sum_n{\sum_t{\text{cost}_t^n(k)}}} </math>ROIk=∑n∑tcosttn(k)∑∀yn=1∑tatnI(ct(k)=1)V(yn)

其中，分子上， <math xmlns="http://www.w3.org/1998/Math/MathML"> a t n ( k ) a_t^n(k) </math>atn(k)表示用户 <math xmlns="http://www.w3.org/1998/Math/MathML"> n n </math>n的转化路径在第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t个触点、渠道 <math xmlns="http://www.w3.org/1998/Math/MathML"> k k </math>k上的归因权重，基于该归因权重，将转化带来的收入的一部分分配给该渠道，并进而求和所有转化分配给该渠道的收入作为该渠道的总收入；分母上， <math xmlns="http://www.w3.org/1998/Math/MathML"> cost t n \text{cost}_t^n </math>costtn表示用户 <math xmlns="http://www.w3.org/1998/Math/MathML"> n n </math>n的转化路径在第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t个触点、渠道 <math xmlns="http://www.w3.org/1998/Math/MathML"> k k </math>k上的广告费用，求和后得到该渠道的总费用。

将各个渠道的ROI作为其权重，从总预算中分配该渠道的预算：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> b k = ROI c k ∑ v = 1 K ROI c v × B b_k=\frac{\text{ROI}{c_k}}{\sum{v=1}^K{\text{ROI}_{c_v}}}\times B </math>bk=∑v=1KROIcvROIck×B

ROI越高的渠道，预算越多。然后，按时间序回放转化路径历史数据。初始化转化路径黑名单为空。回放历史数据时，若当前触点所对应的转化路径不在黑名单中，则判断触点所对应的渠道是否超预算（渠道预算余额是否足够支付当前触点广告费用），若超出预算，则将该转化路径加入黑名单，若未超出预算，则按广告费用扣减渠道预算，并累加广告投放总费用和转化总数（当前触点发生转化才累加转化总数）。回放完成后，根据广告投放总费用和转化总数以及进一步计算的CPA和CVR衡量归因权重结果是否准确。

基线模型

论文共引入了5种模型作为基线模型，除了基础的逻辑回归模型外，还包括在上篇论文阅读笔记中介绍的DNAMTA和DARNN。

结果分析

各模型转化率预估损失、点击率预估损失以及AUC值如图10所示。从中可以看出，CAMTA的损失值最小，且AUC值最高，效果最好。而"CAMTA（ <math xmlns="http://www.w3.org/1998/Math/MathML"> λ = 0 \lambda=0 </math>λ=0）"通过设置 <math xmlns="http://www.w3.org/1998/Math/MathML"> λ = 0 \lambda=0 </math>λ=0，不引入域对抗训练、不消除因混杂因子导致的选择偏差，其效果稍弱于CAMTA，这也说明了CAMTA通过因果循环神经网络在对用户历史进行序列建模时引入域对抗训练、消除选择偏差、获取无偏表征的重要性。另外，效果靠前的DARNN、CAMTA均同时对转化和点击进行预估，而其他基线模型仅将点击作为模型输入，这也说明了对点击进行预估能够缓解转化数据稀疏的问题，提升预估效果。

而归因权重的评估，论文对原预算按1、0.8、0.6、0.2的比例得到新预算，对这四种预算，按"评估指标"中的算法进行历史数据回放，最终的实验结果如图11所示。从中可以看出，CAMTA在大部分情况下，CPA（每次转化成本）最低，CVR（转化率）最高，转化数最多。

解决MTA中多类别混杂因子导致的混杂问题-CausalMTA

2021年的论文《CausalMTA: Eliminating the User Confounding Bias for Causal Multi-touch Attribution》提出了CausalMTA，其和CAMTA相比，将用户偏好这一混杂因子，进一步区分为不变的静态属性和变化的动态特征，对于静态属性，其使用变分循环自编码器作为渠道序列生成模型获取其无偏分布，然后基于无偏分布和逆概率加权方法对每个转化路径重加权，从而消除静态属性引起的选择偏差，而对于动态特征，其和CAMTA类似，也是借鉴CRN，通过循环神经网络和域对抗训练，生成用户历史的无偏表征，从而消除动态特征引起的选择偏差，得到无偏的转化率预估模型。最后，基于转化率预估模型，采用反事实分析计算各渠道的夏普利值作为归因权重，即对各渠道，使用转化率预估模型分别预估有无该渠道时的转化率，因引入该渠道带来的转化率提升即该渠道对转化的边际期望增益，也就是该渠道的夏普利值，被作为该渠道的归因权重。

用户偏好中的静态属性和动态特征如图12所示，静态属性包括用户年龄、性别、职业等不随触点序列变化的属性，动态特征即用户已浏览广告构成的触点序列。

问题建模

令 <math xmlns="http://www.w3.org/1998/Math/MathML"> D \mathcal{D} </math>D表示广告触点数据集，其包含 <math xmlns="http://www.w3.org/1998/Math/MathML"> U U </math>U个用户的 <math xmlns="http://www.w3.org/1998/Math/MathML"> N N </math>N个转化路径。每个转化路径可由三元组 <math xmlns="http://www.w3.org/1998/Math/MathML"> ( u i , J i , y i ) (\mathbf{u}^i,\mathbf{J}^i,y^i) </math>(ui,Ji,yi)表示，其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> u i \mathbf{u}^i </math>ui表示用户静态属性， <math xmlns="http://www.w3.org/1998/Math/MathML"> J i \mathbf{J}^i </math>Ji表示触点序列， <math xmlns="http://www.w3.org/1998/Math/MathML"> y i y^i </math>yi表示最终是否发生转化的二值标识（0为未发生转化，1为发生转化）。每个触点序列包含多个触点，即 <math xmlns="http://www.w3.org/1998/Math/MathML"> { p t i } t = 1 T i \{\mathbf{p}t^i\}{t=1}^{T^i} </math>{pti}t=1Ti，其中，每个触点进一步由两部分组成，即 <math xmlns="http://www.w3.org/1998/Math/MathML"> p t i = ( c t i , f t i ) \mathbf{p}_t^i=(\mathbf{c}_t^i,\mathbf{f}_t^i) </math>pti=(cti,fti)，其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> c t i \mathbf{c}_t^i </math>cti表示渠道，共有 <math xmlns="http://www.w3.org/1998/Math/MathML"> K K </math>K个渠道，即 <math xmlns="http://www.w3.org/1998/Math/MathML"> c t i ∈ { c 1 , ... , c k , ... , c K } \mathbf{c}_t^i\in\{\mathbf{c}_1,\dots,\mathbf{c}_k,\dots,\mathbf{c}_K\} </math>cti∈{c1,...,ck,...,cK}， <math xmlns="http://www.w3.org/1998/Math/MathML"> f t i \mathbf{f}_t^i </math>fti表示用户动态特征（即当前触点前用户已浏览的广告）。多触点归因的目标即对于转化路径进行建模，计算每个渠道的归因权重。

解决方案

CausalMTA整体解决方案模型如图13所，其包含三部分。第一部为转化路径重加权（Journey Reweighting），其对于静态属性，使用变分循环自编码器作为渠道序列生成模型获取其无偏分布，然后基于无偏分布和逆概率加权方法对每个转化路径重加权，从而消除静态属性引起的选择偏差。第二部分为因果转化率预估（Causal Conversion Prediction），其对于动态特征，借鉴CRN，通过循环神经网络和域对抗训练，生成用户历史的无偏表征，从而消除动态特征引起的选择偏差，得到无偏的转化率预估模型。第三部分为归因权重计算（Attribution），其基于转化率预估模型，采用反事实分析计算各渠道的夏普利值作为归因权重，即对各渠道，使用转化率预估模型分别预估有无该渠道时的转化率，因引入该渠道带来的转化率提升即该渠道对转化的边际期望增益，也就是该渠道的夏普利值，被作为该渠道的归因权重。

转化路径重加权

渠道序列生成

变分循环自编码器（Variational Recurrent Auto-Encoder，VRAE）及其之前的自编码器（AutoEncoder，AE）、变分自编码器（Variational AutoEncoder，VAE）均属于生成模型。论文使用变分循环自编码器作为渠道序列生成模型获取其生成概率。

关于自编码器（AutoEncoder，AE）和变分自编码器（Variational AutoEncoder，VAE）的介绍可以参考知乎上的这篇文章《AutoEncoder (AE) 和 Variational AutoEncoder (VAE) 的详细介绍和对比》。

对于自编码器（AutoEncoder，AE），其结构如图14所示，将输入 <math xmlns="http://www.w3.org/1998/Math/MathML"> x x </math>x通过编码器 <math xmlns="http://www.w3.org/1998/Math/MathML"> e θ e_\theta </math>eθ压缩、得到低维隐向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> z z </math>z，再将隐向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> z z </math>z通过解码器 <math xmlns="http://www.w3.org/1998/Math/MathML"> d ϕ d_\phi </math>dϕ还原、得到输出 <math xmlns="http://www.w3.org/1998/Math/MathML"> x ^ \hat{x} </math>x^，其优化目标是输出尽可能还原输入，最小化输出和输入之间的差值：
<math xmlns="http://www.w3.org/1998/Math/MathML"> loss = ∥ x − x ^ ∥ 2 = ∥ x − d ϕ ( z ) ∥ 2 = ∥ x − d ϕ ( e θ ( x ) ) ∥ 2 \text{loss}=\lVert x-\hat{x}\rVert_2=\lVert x-d_\phi(z)\rVert_2=\lVert x-d_\phi(e_\theta(x))\rVert_2 </math>loss=∥x−x^∥2=∥x−dϕ(z)∥2=∥x−dϕ(eθ(x))∥2

对于变分自编码器（Variational AutoEncoder），其结构如图15所示。和自编码器类似，也是先将输入 <math xmlns="http://www.w3.org/1998/Math/MathML"> x x </math>x通过编码器 <math xmlns="http://www.w3.org/1998/Math/MathML"> e θ e_\theta </math>eθ压缩，再通过解码器 <math xmlns="http://www.w3.org/1998/Math/MathML"> d ϕ d_\phi </math>dϕ还原、得到输出 <math xmlns="http://www.w3.org/1998/Math/MathML"> x ^ \hat{x} </math>x^。但和自编码器不同的是，变分自编码器中编码器的输出不是低维隐向量各维度值的确定值，而是低纬隐向量各维度的值的概率分布，且概率分布满足正态分布，即 <math xmlns="http://www.w3.org/1998/Math/MathML"> z ∼ N ( μ x , σ x 2 ) z\sim\mathcal{N}(\mu_x,\sigma_x^2) </math>z∼N(μx,σx2)，则变分自编码器中编码器的输出是低纬隐向量各维度的值的概率分布的均值和标准差。在得到低纬隐向量各维度的值的概率分布后，基于该概率分布使用重参数技术进行采样，得到低维隐向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> z z </math>z：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> z = μ x + σ x ϵ , ϵ ∼ N ( 0 , I ) z=\mu_x+\sigma_x\epsilon,\epsilon\sim\mathcal{N}(0,\mathbf{I}) </math>z=μx+σxϵ,ϵ∼N(0,I)

随后和自编码器相同，再将隐向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> z z </math>z通过解码器 <math xmlns="http://www.w3.org/1998/Math/MathML"> d ϕ d_\phi </math>dϕ还原、得到输出 <math xmlns="http://www.w3.org/1998/Math/MathML"> x ^ \hat{x} </math>x^。

变分自编码器的损失函数包含两部分，第一部分为重建损失，即自编码器的损失函数，最小化输出和输入之间的差值：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> reconstruction loss = ∥ x − x ^ ∥ 2 = ∥ x − d ϕ ( z ) ∥ 2 = ∥ x − d ϕ ( μ x + σ x ϵ ) ∥ 2 \text{reconstruction loss}=\lVert x-\hat{x}\rVert_2=\lVert x-d_\phi(z)\rVert_2=\lVert x-d_\phi(\mu_x+\sigma_x\epsilon)\rVert_2 </math>reconstruction loss=∥x−x^∥2=∥x−dϕ(z)∥2=∥x−dϕ(μx+σxϵ)∥2

第二部分为相似损失，即编码器输出的概率分布和标准正态分布的KL散度：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> similarity loss = KL Divergence = D KL ( N ( μ x , σ x ) ∥ N ( 0 , I ) ) \text{similarity loss}=\text{KL Divergence}=D_{\text{KL}}(\mathcal{N}(\mu_x,\sigma_x)\|\mathcal{N}(0,\mathbf{I})) </math>similarity loss=KL Divergence=DKL(N(μx,σx)∥N(0,I))

整体损失为上述两部分之和：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> loss = reconstruction loss + similarity loss \text{loss}=\text{reconstruction loss}+\text{similarity loss} </math>loss=reconstruction loss+similarity loss

而变分循环自编码器（Variational Recurrent AutoEncoder，VRAE）于2014年在论文《Variational Recurrent Auto-Encoders》中被提出，其进一步将变分自编码器和RNN融合。论文使用变分循环自编码器作为渠道序列生成模型，如图16上方虚线框所示。转化路径 <math xmlns="http://www.w3.org/1998/Math/MathML"> ( u , J , y ) (\mathbf{u},\mathbf{J},y) </math>(u,J,y)中的渠道序列 <math xmlns="http://www.w3.org/1998/Math/MathML"> C = { c t } t = 1 T \mathbf{C}=\{\mathbf{c}t\}{t=1}^T </math>C={ct}t=1T被作为变分循环自编码器中编码器的输入，使用RNN/LSTM对其进行序列建模。渠道序列中的每个渠道通过LSTM输出相应的隐状态：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> { h t } t = 1 T = LSTM enc ( C , h 0 ) \{\mathbf{h}t\}{t=1}^T=\text{LSTM}_{\text{enc}}(\mathbf{C},\mathbf{h}_0) </math>{ht}t=1T=LSTMenc(C,h0)

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> h 0 \mathbf{h}_0 </math>h0为初始隐状态，LSTM的结构不在此处详述，可以参考《动手学深度学习-LSTM部分》。

渠道序列中最后一个渠道通过LSTM输出隐状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> h T \mathbf{h}T </math>hT，将其通过线性映射得到中间隐变量 <math xmlns="http://www.w3.org/1998/Math/MathML"> z \mathbf{z} </math>z的均值 <math xmlns="http://www.w3.org/1998/Math/MathML"> μ z \mu_z </math>μz：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> μ z = W μ h T + b μ \mu_z=W\mu\mathbf{h}T+\mathbf{b}\mu </math>μz=WμhT+bμ

再将 <math xmlns="http://www.w3.org/1998/Math/MathML"> h T \mathbf{h}T </math>hT通过线性映射和指数运算得到中间隐变量 <math xmlns="http://www.w3.org/1998/Math/MathML"> z \mathbf{z} </math>z的标准差 <math xmlns="http://www.w3.org/1998/Math/MathML"> σ z \sigma_z </math>σz：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> log ⁡ ( σ z ) = W σ h T + b σ \log(\sigma_z)=W\sigma\mathbf{h}T+\mathbf{b}\sigma </math>log(σz)=WσhT+bσ

和变分自编码器相同，使用重参数技术进行采样，得到低维隐向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> z \mathbf{z} </math>z：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> z = μ x + σ x ϵ , ϵ ∼ N ( 0 , I ) \mathbf{z}=\mu_x+\sigma_x\epsilon,\epsilon\sim\mathcal{N}(0,\mathbf{I}) </math>z=μx+σxϵ,ϵ∼N(0,I)

变分循环自编码器的解码器也使用RNN/LSTM，输出重建的渠道序列 <math xmlns="http://www.w3.org/1998/Math/MathML"> C out = { c t ′ } t = 1 T \mathbf{C}_{\text{out}}=\{\mathbf{c}t'\}{t=1}^T </math>Cout={ct′}t=1T。将隐向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> z \mathbf{z} </math>z通过线性映射和Tanh函数得到解码器中循环神经网络的初始隐状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> h 0 ′ \mathbf{h}_0' </math>h0′：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> h 0 ′ = tanh ⁡ ( W z T z + b z ) \mathbf{h}_0'=\tanh(W_z^T\mathbf{z}+\mathbf{b}_z) </math>h0′=tanh(WzTz+bz)

解码器中的序列建模类似自回归的方式，即第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t步的隐状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> h t ′ \mathbf{h}t' </math>ht′由第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t − 1 t-1 </math>t−1步的隐状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> h t − 1 ′ \mathbf{h}{t-1}' </math>ht−1′和渠道 <math xmlns="http://www.w3.org/1998/Math/MathML"> c t − 1 ′ \mathbf{c}{t-1}' </math>ct−1′通过LSTM得到：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> { h t ′ } t = 1 T = LSTM dec ( C out , h 0 ′ ) \{\mathbf{h}t'\}{t=1}^T=\text{LSTM}{\text{dec}}(\mathbf{C}_{\text{out}},\mathbf{h}_0') </math>{ht′}t=1T=LSTMdec(Cout,h0′)

第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t步的渠道 <math xmlns="http://www.w3.org/1998/Math/MathML"> c t ′ \mathbf{c}_t' </math>ct′再由第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t步的隐状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> h t ′ \mathbf{h}_t' </math>ht′通过线性映射和Softmax函数得到：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> c t ′ = softmax ( W O h t ′ + b O ) \mathbf{c}_t'=\text{softmax}(W_O\mathbf{h}_t'+\mathbf{b}_O) </math>ct′=softmax(WOht′+bO)

损失函数和变分自编码器类似，也包含两部分：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L w = α ∑ i = 1 N ∑ t = 1 T i C E ( c t , c t ′ ) + β D K L ( q ϕ ( z ) ∥ p θ ( z ) ) \mathcal{L}w=\alpha\sum{i=1}^N{\sum_{t=1}^{T_i}{CE(\mathbf{c}t,\mathbf{c}t')+\beta D{KL}(q\phi(\mathbf{z})\|p_\theta(\mathbf{z}))}} </math>Lw=αi=1∑Nt=1∑TiCE(ct,ct′)+βDKL(qϕ(z)∥pθ(z))

第一部分是 <math xmlns="http://www.w3.org/1998/Math/MathML"> c t \mathbf{c}t </math>ct和 <math xmlns="http://www.w3.org/1998/Math/MathML"> c t ′ \mathbf{c}t' </math>ct′的交叉熵损失函数，第二部分是中间隐变量 <math xmlns="http://www.w3.org/1998/Math/MathML"> z \mathbf{z} </math>z的后验分布和先验分布的KL散度，其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> p θ ( z ) p\theta(\mathbf{z}) </math>pθ(z)是先验分布，并假设其满足标准正态分布 <math xmlns="http://www.w3.org/1998/Math/MathML"> N ( 0 , I ) \mathcal{N}(0,\mathbf{I}) </math>N(0,I)， <math xmlns="http://www.w3.org/1998/Math/MathML"> q ϕ ( z ∣ c i ) q\phi(\mathbf{z}|\mathbf{c}^i) </math>qϕ(z∣ci)是后验分布，即对于数据集中某个转化路径的渠道序列 <math xmlns="http://www.w3.org/1998/Math/MathML"> c i \mathbf{c}^i </math>ci，其隐变量的分布满足正态分布 <math xmlns="http://www.w3.org/1998/Math/MathML"> N ( μ i , ( σ i ) 2 ) \mathcal{N}(\mu^i,(\sigma^i)^2) </math>N(μi,(σi)2)， <math xmlns="http://www.w3.org/1998/Math/MathML"> α \alpha </math>α和 <math xmlns="http://www.w3.org/1998/Math/MathML"> β \beta </math>β是超参，用于控制两部分损失函数的权重。

转化路径重加权

当训练数据足够多时，若忽视用户静态属性的影响则渠道序列的分布倾向于随机的。在此设定下，变分循环自编码器倾向于生成无偏的渠道序列。而随机化投放的渠道序列相较于受用户偏好影响的样本（转化路径）有着更高的权重值，权重值可由逆概率加权的方式得到，学得的样本权重应当满足以下公式：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> W T ( u , C ) = p ( C ) / p ( C ∣ u ) W_T(\mathbf{u},\mathbf{C})=p(\mathbf{C})/p(\mathbf{C}|\mathbf{u}) </math>WT(u,C)=p(C)/p(C∣u)

结合变分循环自编码器中已得到的每个样本渠道序列隐变量的分布，对于每个样本的权重，可由下式计算：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> w i = W T ( u i , c i ) = { E z ∼ q ϕ ( z ∣ c i ) [ 1 W z ( u i , z ) ] } − 1 w^i=W_T(\mathbf{u}^i,\mathbf{c}^i)=\{\mathbb{E}{\mathbf{z}\sim q\phi(\mathbf{z}|\mathbf{c}^i)}[\frac{1}{W_z(\mathbf{u}^i,\mathbf{z})}]\}^{-1} </math>wi=WT(ui,ci)={Ez∼qϕ(z∣ci)[Wz(ui,z)1]}−1

采用领域分类器的方式估计 <math xmlns="http://www.w3.org/1998/Math/MathML"> W z ( u i , z ) W_z(\mathbf{u}^i,\mathbf{z}) </math>Wz(ui,z)。具体而言，以用户静态属性和变分循环自编码器中渠道序列隐变量作为正样本，即 <math xmlns="http://www.w3.org/1998/Math/MathML"> { ( u i , z ) } 1 ≤ i ≤ N \{(\mathbf{u}^i,\mathbf{z})\}{1\le i\le N} </math>{(ui,z)}1≤i≤N，且 <math xmlns="http://www.w3.org/1998/Math/MathML"> z ∼ q ϕ ( z ∣ c i ) \mathbf{z}\sim q\phi(\mathbf{z}|\mathbf{c}^i) </math>z∼qϕ(z∣ci)，以用户静态属性和从标准正态分布中采样的隐变量作为负样本，即 <math xmlns="http://www.w3.org/1998/Math/MathML"> { ( u i , z ) } 1 ≤ i ≤ N \{(\mathbf{u}^i,\mathbf{z})\}{1\le i\le N} </math>{(ui,z)}1≤i≤N，且 <math xmlns="http://www.w3.org/1998/Math/MathML"> z ∼ p θ ( z ∣ c i ) \mathbf{z}\sim p\theta(\mathbf{z}|\mathbf{c}^i) </math>z∼pθ(z∣ci)，训练领域分类器。分类器为常规的Embedding+MLP结构，由下式表示，对用户静态属性进行Embedding后得到其Embedding向量，和隐变量拼接在一起后，输入多层感知机，最后一层的输出经过Sigmoid函数得到正样本概率：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> e u = Embedding ( u ) x = concat ( e u , z ) p θ d ( L ∣ u , z ) = sigmoid ( MLP ( x ) ) \begin{align} \mathbf{e}u&=\text{Embedding}(\mathbf{u}) \\ \mathbf{x}&=\text{concat}(\mathbf{e}u,\mathbf{z}) \\ p{\theta{d}}(L|\mathbf{u},\mathbf{z})&=\text{sigmoid}(\text{MLP}(\mathbf{x})) \end{align} </math>euxpθd(L∣u,z)=Embedding(u)=concat(eu,z)=sigmoid(MLP(x))

训练分类器后， <math xmlns="http://www.w3.org/1998/Math/MathML"> W z ( u , z ) W_z(\mathbf{u},\mathbf{z}) </math>Wz(u,z)可由下式计算：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> W z ( u , z ) = p ( u , z ∣ L = 0 ) p ( u , z ∣ L = 1 ) = p ( L = 0 ∣ u , z ) p ( L = 1 ∣ u , z ) W_z(\mathbf{u},\mathbf{z})=\frac{p(\mathbf{u},\mathbf{z}|L=0)}{p(\mathbf{u},\mathbf{z}|L=1)}=\frac{p(L=0|\mathbf{u},\mathbf{z})}{p(L=1|\mathbf{u},\mathbf{z})} </math>Wz(u,z)=p(u,z∣L=1)p(u,z∣L=0)=p(L=1∣u,z)p(L=0∣u,z)

最后基于上式和 <math xmlns="http://www.w3.org/1998/Math/MathML"> w i w^i </math>wi的计算公式，求解得到每个样本（转化路径）的权重 <math xmlns="http://www.w3.org/1998/Math/MathML"> { w i } i = 1 N \{w_i\}_{i=1}^N </math>{wi}i=1N，对样本进行重加权。

因果转化预估

因果转化预估和CAMTA类似，也是借鉴CRN，通过循环神经网络和域对抗训练，生成用户历史的无偏表征。其将渠道序列 <math xmlns="http://www.w3.org/1998/Math/MathML"> C + = { c 0 , c t } t = 1 T − 1 \mathbf{C}+=\{\mathbf{c}0,\mathbf{c}t\}{t=1}^{T-1} </math>C+={c0,ct}t=1T−1、用户动态特征序列 <math xmlns="http://www.w3.org/1998/Math/MathML"> F = { f t } t = 1 T \mathbf{F}=\{\mathbf{f}t\}{t=1}^T </math>F={ft}t=1T和用户静态属性 <math xmlns="http://www.w3.org/1998/Math/MathML"> u \mathbf{u} </math>u通过Embedding层得到相应的Embedding向量，然后逐步将序列中每步的渠道Embedding向量和用户动态特征Embedding向量拼接在一起，通过LSTM输出每步的用户历史的表征 <math xmlns="http://www.w3.org/1998/Math/MathML"> { o u t } t = 1 T \{\mathbf{out}\}{t=1}^T </math>{out}t=1T，如下所示：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> e u , e C + , e F = Embedding ( u , C + , F ) v i n = concat ( e C + , e F ) { o u t t } t = 1 T = LSTM pred ( v i n , h 0 ) \begin{align} \mathbf{e}u,\mathbf{e}{C+},\mathbf{e}F&=\text{Embedding}(\mathbf{u},\mathbf{C}+,\mathbf{F})\\ \mathbf{v}{in}&=\text{concat}(\mathbf{e}{C_+},\mathbf{e}F)\\ \{\mathbf{out}t\}{t=1}^T&=\text{LSTM}{\text{pred}}(\mathbf{v}_{in},\mathbf{h}_0) \end{align} </math>eu,eC+,eFvin{outt}t=1T=Embedding(u,C+,F)=concat(eC+,eF)=LSTMpred(vin,h0)

其中，第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t步的输入是 <math xmlns="http://www.w3.org/1998/Math/MathML"> c t − 1 \mathbf{c}_{t-1} </math>ct−1和 <math xmlns="http://www.w3.org/1998/Math/MathML"> f t \mathbf{f}_t </math>ft的Embedding向量以及前序的隐状态，输出是 <math xmlns="http://www.w3.org/1998/Math/MathML"> o u t t \mathbf{out}_t </math>outt。

为了获取用户历史的无偏表征，论文借鉴CRN，对 <math xmlns="http://www.w3.org/1998/Math/MathML"> { o u t } t = 1 T \{\mathbf{out}\}_{t=1}^T </math>{out}t=1T进一步通过梯度反转层和渠道预估分类器（MLP+Softmax）输出各渠道概率，如下所示：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> v ∗ t r e v ∗ t = 1 T = MLP ( GRL ( o u t ∗ t ∗ t = 1 T ) ) c ∗ t r e v ∗ t = 1 T = softmax ( v ∗ t r e v ∗ t = 1 T ) \begin{align} {\mathbf{v}*t^{rev}}*{t=1}^T&=\text{MLP}(\text{GRL}({\mathbf{out}*t}*{t=1}^T))\\ {\mathbf{c}*t^{rev}}*{t=1}^T&=\text{softmax}({\mathbf{v}*t^{rev}}*{t=1}^T)\\ \end{align} </math>v∗trev∗t=1Tc∗trev∗t=1T=MLP(GRL(out∗t∗t=1T))=softmax(v∗trev∗t=1T)

训练渠道预估分类器时需要尽可能地使得 <math xmlns="http://www.w3.org/1998/Math/MathML"> o u t ∗ t = 1 T {\mathbf{out}}*{t=1}^T </math>out∗t=1T无法准确预估渠道，因此对于渠道预估分类损失的梯度，通过梯度反转层反转该梯度，进行模型参数更新。

论文同时引入注意力机制，计算 <math xmlns="http://www.w3.org/1998/Math/MathML"> o u t ∗ t = 1 T {\mathbf{out}}*{t=1}^T </math>out∗t=1T中各步用户历史无偏表征和 <math xmlns="http://www.w3.org/1998/Math/MathML"> o u t ∗ T \mathbf{out}*T </math>out∗T的注意力得分，并基于注意力得分对各步用户历史无偏表征进行加权求和，得到最终的用户历史无偏表征 <math xmlns="http://www.w3.org/1998/Math/MathML"> v ∗ a t t n \mathbf{v}*{attn} </math>v∗attn。最终的用户历史无偏表征 <math xmlns="http://www.w3.org/1998/Math/MathML"> v ∗ a t t n \mathbf{v}*{attn} </math>v∗attn通过转化预估分类器（MLP+Softmax）输出最终是否发生转化的概率，如下所示：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> v ∗ a t t n = Attention ( o u t ∗ T , o u t ∗ t = 1 T ) v ∗ p r e d = softmax ( MLP ( v ∗ a t t n ) ) \begin{align} \mathbf{v}*{attn}&=\text{Attention}(\mathbf{out}*T,{\mathbf{out}}*{t=1}^T)\\ \mathbf{v}*{pred}&=\text{softmax}(\text{MLP}(\mathbf{v}*{attn})) \end{align} </math>v∗attnv∗pred=Attention(out∗T,out∗t=1T)=softmax(MLP(v∗attn))

模型整体损失函数为渠道预估分类损失和转化预估分类损失之和：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L ∗ p = γ ∑ ∗ i = 1 N ∑ _ t = 1 T _ i C E ( c ∗ t r e v , c ∗ t ) + δ ∑ ∗ i = 1 N w _ i ⋅ C E ( v ∗ p r e d i , y i ) \mathcal{L}*p=\gamma\sum*{i=1}^N{\sum\_{t=1}^{T\_i}{CE(\mathbf{c}*t^{rev},\mathbf{c}*t)}}+\delta\sum*{i=1}^N{w\_i\cdot CE(\mathbf{v}*{pred}^i,y^i)} </math>L∗p=γ∑∗i=1N∑_t=1T_iCE(c∗trev,c∗t)+δ∑∗i=1Nw_i⋅CE(v∗predi,yi)

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> γ \gamma </math>γ和 <math xmlns="http://www.w3.org/1998/Math/MathML"> δ \delta </math>δ是超参， <math xmlns="http://www.w3.org/1998/Math/MathML"> w i w_i </math>wi是"转化路径重加权"部分计算的样本权重。

归因权重计算

论文基于转化率预估模型，采用反事实分析计算各渠道的夏普利值作为归因权重，即对各渠道，使用转化率预估模型分别预估有无该渠道时的转化率，因引入该渠道带来的转化率提升即该渠道对转化的边际期望增益，也就是该渠道的夏普利值，被作为该渠道的归因权重。

令 <math xmlns="http://www.w3.org/1998/Math/MathML"> J i \ { p t i } \mathbf{J}^i \backslash \{\mathbf{p}_t^i\} </math>Ji\{pti}表示转化路径 <math xmlns="http://www.w3.org/1998/Math/MathML"> J i \mathbf{J}^i </math>Ji去除触点 <math xmlns="http://www.w3.org/1998/Math/MathML"> p t i \mathbf{p}_t^i </math>pti后的反事实触点序列， <math xmlns="http://www.w3.org/1998/Math/MathML"> S \mathcal{S} </math>S表示 <math xmlns="http://www.w3.org/1998/Math/MathML"> J i \ { p t i } \mathbf{J}^i \backslash \{\mathbf{p}_t^i\} </math>Ji\{pti}的子触点序列集合， <math xmlns="http://www.w3.org/1998/Math/MathML"> p ( J i ) p(\mathbf{J}^i) </math>p(Ji)表示转化路径 <math xmlns="http://www.w3.org/1998/Math/MathML"> J i \mathbf{J}^i </math>Ji的转化率预估值，则渠道 <math xmlns="http://www.w3.org/1998/Math/MathML"> c t i \mathbf{c}t^i </math>cti的夏普利值 <math xmlns="http://www.w3.org/1998/Math/MathML"> S V t i SV_t^i </math>SVti可由下式计算：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> S V t i = ∑ S ⊆ J i \ { p t i } ∣ S ∣ ! ( ∣ J i ∣ − ∣ S ∣ − 1 ) ! ∣ J i ∣ ! [ p ( S ∪ { p t i } ) − p ( S ) ] SV_t^i=\sum{\mathcal{S}\subseteq \mathbf{J}^i \backslash \{\mathbf{p}_t^i\}}{\frac{|\mathcal{S}|!(|\mathbf{J}^i|-|\mathcal{S}|-1)!}{|\mathbf{J}^i|!}[p(\mathcal{S}\cup\{\mathbf{p}_t^i\})-p(\mathcal{S})]} </math>SVti=S⊆Ji\{pti}∑∣Ji∣!∣S∣!(∣Ji∣−∣S∣−1)![p(S∪{pti})−p(S)]

即对 <math xmlns="http://www.w3.org/1998/Math/MathML"> J i \ { p t i } \mathbf{J}^i \backslash \{\mathbf{p}_t^i\} </math>Ji\{pti}的每个子触点序列，计算其中增加渠道 <math xmlns="http://www.w3.org/1998/Math/MathML"> c t i \mathbf{c}_t^i </math>cti后的转化率预估值相对于增加前的增益，求和所有子触点序列增加渠道 <math xmlns="http://www.w3.org/1998/Math/MathML"> c t i \mathbf{c}_t^i </math>cti后的转化率预估值增益，作为该渠道的夏普利值。

论文中进一步对夏普利值进行归一化，如下所示：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> a t i = σ ( S V t i ) / ∑ t = 1 T i σ ( S V t i ) \mathbf{a}t^i=\sigma(SV_t^i)/\sum{t=1}^{T^i}{\sigma(SV_t^i)} </math>ati=σ(SVti)/t=1∑Tiσ(SVti)

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> σ ( x ) = max ⁡ ( 0 , x ) \sigma(x)=\max(0,x) </math>σ(x)=max(0,x)， <math xmlns="http://www.w3.org/1998/Math/MathML"> a t i \mathbf{a}_t^i </math>ati表示转化路径 <math xmlns="http://www.w3.org/1998/Math/MathML"> J i \mathbf{J}^i </math>Ji中第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t个广告触点的转化权重。

整体流程

综上，CausalMTA整体流程伪代码如图17所示：

训练渠道序列生成模型，对于数据集 <math xmlns="http://www.w3.org/1998/Math/MathML"> D \mathcal{D} </math>D中的每个转化路径，使用其渠道序列通过变分循环自编码器生成隐变量和新渠道序列，计算损失函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> L w \mathcal{L}_w </math>Lw的值和梯度，基于梯度下降更新自编码器的参数。
计算转化路径权重，对于数据集 <math xmlns="http://www.w3.org/1998/Math/MathML"> D \mathcal{D} </math>D中的每个转化路径，基于自编码器隐变量和从标准正态分布采样生成分类器的正、负样本，计算分类器的损失函数值和梯度，基于梯度下降更新分类器的参数，再根据分类器的预测值计算每个转化路径的权重。
训练转化预估模型，对于数据集 <math xmlns="http://www.w3.org/1998/Math/MathML"> D \mathcal{D} </math>D中的每个转化路径，计算转化预估模型损失函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> L p \mathcal{L}_p </math>Lp的值和梯度，基于梯度下降更新模型的参数。
计算归因权重，对于数据集 <math xmlns="http://www.w3.org/1998/Math/MathML"> D \mathcal{D} </math>D中的每个转化路径，计算转化路径每个触点渠道 <math xmlns="http://www.w3.org/1998/Math/MathML"> c t i \mathbf{c}_t^i </math>cti的夏普利值 <math xmlns="http://www.w3.org/1998/Math/MathML"> S V t i SV_t^i </math>SVti，并进而通过归一化计算该触点的转化归因权重。

实验分析

论文和CAMTA一致，也使用了多触点归因领域常用的Criteo数据集，并同样进行了处理。数据集处理前后的的统计信息如图9所示。另外，论文还使用了合成数据集和阿里巴巴线上广告数据集。

基线模型包括CAMTA以及上一篇阅读笔记介绍的DNAMTA、DARNN、JDMTA等。评估指标和方法也和CAMTA一致，分为转化率预估效果评估和归因权重计算效果评估。

转化率预估效果评估的结果如图19所示，CausalMTA的AUC最高，效果最好。

归因权重计算效果评估的结果如图20所示，CausalMTA在大部分情况下，CPA（每次转化成本）最低，CVR（转化率）最高，转化数最多。

基于深度学习的多触点归因论文阅读笔记（2）

背景

什么是混杂

引入域对抗训练解决医疗诊断的混杂问题-CRN

问题建模

网络结构

模型训练

借鉴CRN解决MTA中用户历史的混杂问题-CAMTA

问题建模

网络结构

因果循环网络

最小最大损失

注意力机制

转化率预估

整体结构

实验分析

数据集

评估指标

基线模型

结果分析

解决MTA中多类别混杂因子导致的混杂问题-CausalMTA

问题建模

解决方案

转化路径重加权

渠道序列生成

转化路径重加权

因果转化预估

归因权重计算

整体流程

实验分析

参考文献