文档智能：OCR+Rocketqa+layoutxlm

此次先记录LayoutLMv2，梳理相关论文，记录如下：

首先认识一下 visually-rich document understanding tasks → \to → VrDU

其次，the text fields of interest，与图像识别的感兴趣区域 region of Interest 类似，AI对该词做出的解释如下：

在文档智能领域，处理和分析文档时，系统需要能够自动识别和提取文档中的关键信息。这些信息通常以文本字段的形式出现，并被定义为"text fields of interest"。这些字段的重要性在于它们提供了文档的核心内容，有助于实现文档的快速检索、分类、摘要生成等目的。

作者介绍到，最近的VrDU任务主要依赖于两个方向：

第一个方向通常建立在文本和视觉/布局/风格信息之间的浅层融合之上。这些方法分别利用预训练的NLP和CV模型，并将来自多种模态的信息组合起来进行监督学习。尽管已经取得了良好的性能，但一种文档类型的领域知识不能轻易转移到另一种，因此一旦文档类型发生变化，这些模型通常需要重新训练。

第二个方向依赖于不同领域中大量未标记文档的文本、视觉和布局信息之间的深度融合；

预训练的模型从不同的文档类型中吸收跨模态知识，从而保持了这些布局和样式之间的局部不变性。此外，当模型需要转移到具有不同文档格式的另一个域时，只有几个标记的样本就足以微调通用模型；

LayoutLMv2在预训练阶段，利用Transformer来学习视觉和文本信息之间的跨模态交互，集成了视觉信息；

For the pre-training strategies, we use two new training objectives for LayoutLMv2 in addition to the masked visual-language modeling.

The first is the proposed text-image alignment strategy, which aligns the textlines and the corresponding image regions.

The second is the text-image matching strategy , where the model learns whether the document image and textual content are correlated.

在第二章节的模型介绍中，首先介绍了三部分：Text Embedding、Visual Embedding、Layout Embedding；

1. Text Embedding：

除了对文本进行分词编码之外，还加了起始和终止符号；使用了1-D位置编码，以及 segment s i s_i si ∈ \in ∈ { [ A ] , [ B ] } \{[A], [B]\} {[A],[B]}。

其中，segment embedding is used to distinguish different text segments.

注意，序列的最大长度设定为L：

Extra [PAD] tokens are appended to the end so that the final sequence's length is exactly the maximum sequence length L.

2. Visual Embedding：

使用ResNeXt-FPN结构之后，通过 flatten 操作，得到了W × H 的 VisTokEmb(I)；再然后，使用一个线性层将 visual token embedding 与text embeddings 保持在同样的维度；

同理，使用了1-D位置编码，the 1D positional embedding is shared with the text embedding layer.

同理，for the segment embedding, we attach all visual tokens to the visual segment [C].

3. Layout Embedding:

embedding the spatial layout information represented by axis-aligned token bounding boxes from the OCR results, in which box width and height together with corner coordinates are identified.

4. Multi-modal Encoder with Spatial-Aware Self-Attention Mechanism

The encoder concatenates visual embeddings { v 0 v_0 v0, ..., v W H − 1 v_{W H−1} vWH−1 } and text embeddings { t 0 t_0 t0, ..., t L − 1 t_{L−1} tL−1 } to a unified sequence,

and fuses spatial information by adding the layout embeddings to get the i-th (0 ≤ i < W H + L) first layer input：

然后，为了引入相对位置而非绝对位置，在transformer-attention机制中，softmax之前，引入偏置项，b：

we model the semantic relative position and spatial relative position as bias terms to prevent adding too many parameters.

Let b ( 1 D ) b^{(1D)} b(1D), b ( 2 D x ) b^{(2D_x)} b(2Dx) and b ( 2 D y ) b^{(2D_y)} b(2Dy) denote the learnable 1D and 2D relative position biases respectively.

Assuming ( x i x_i xi, y i y_i yi) anchors the top left corner coordinates of the i-th bounding box, we obtain the spatial-aware attention score:

关于此处的偏置项：

在深度学习和计算机视觉的上下文中，偏置项通常被设计为与模型中的其他参数（如权重）一起学习和优化，但它们并不直接对应于输入数据的连续特征或位置。

相反，偏置项是模型参数的一部分，用于调整激活函数的输出或注意力机制的分数，以引入额外的灵活性。

在处理具有空间位置信息的任务（如图像中的物体检测或自然语言处理中的位置编码）时，我们可能会想要将空间位置信息以某种方式整合到模型中。

由于空间位置是连续的（例如，图像中的像素坐标），但模型参数（包括偏置项）是离散的（存储在内存中的数值），因此我们需要一种方法来将连续的空间位置映射到离散的参数上。

The biases are different among attention heads but shared in all encoder layers.

The biases are different among attention heads：

这意味着在每个注意力头（attention head）中，偏置项都是不同的。在基于多头注意力（multi-head attention）的模型中，模型会并行地计算多个注意力权重集合，每个集合被称为一个"头"。由于每个头可能关注输入的不同部分或特征，因此为每个头分配不同的偏置项有助于模型捕获并区分这些不同的信息。

but shared in all encoder layers：

虽然每个注意力头有自己的偏置项，但这些偏置项在所有的编码器层（encoder layers）之间是共享的。在像Transformer这样的模型中，编码器通常由多个堆叠的层组成，每层都包含注意力机制和其他组件。这句话意味着，无论在哪个编码器层，同一注意力头的偏置项都是相同的。这种设计有助于减少模型参数的数量，并可能促进不同层之间的信息流动和一致性。

即，在一个具有多头注意力的模型中，每个注意力头都有自己的独特偏置项，但这些偏置项在模型的所有编码器层之间是共享的。这种设计方式结合了模型的表达能力和参数效率。

5. Masked Visual-Language Modeling

randomly mask some text tokens and ask the model to recover the masked tokens.

Meanwhile, the layout information remains unchanged, which means the model knows each masked token's location on the page.

The output representations of masked tokens from the encoder are fed into a classifier over the whole vocabulary, driven by a cross-entropy loss.

在交叉熵损失的驱动下，来自编码器的 masked tokens 的输出表示，被馈送到整个词汇表上的分类器中。

To avoid visual clue leakage, we mask image regions corresponding to masked tokens on the raw page image input before feeding it into the visual encoder.

6. Text-Image Alignment ：

In the TIA task, some tokens lines are randomly selected, and their image regions are covered on the document image.

注意，这里是tokens被选择，然后覆盖对应的图像；

During pre-training, a classification layer is built above the encoder outputs.

This layer predicts a label for each text token depending on whether it is covered, i.e., [Covered] or [Not Covered], and computes the binary cross-entropy loss.

7. Text-Image Matching

We feed the output representationat [CLS] into a classifier to predict whether the image and text are from the same document page.

Regular inputs are positive samples.
To construct a negative sample, an image is either replaced by a page image from another document or dropped.

在训练过程中，分类器会接收来自正样本和负样本的输入，并学习如何区分这两种情况。

具体来说，分类器会尝试从[CLS]标记的输出表示中提取足够的信息，以判断图像和文本是否匹配。通过最小化分类损失（如交叉熵损失），分类器可以逐渐学习到区分正样本和负样本的有效特征。

To prevent the model from cheating by finding task features, we perform the same masking and covering operations to images in negative samples.

关于cheating：模型可能不是通过学习真正的特征或规律来区分正负样本，而是可能通过一些捷径或非预期的方式（例如，仅仅基于图像的某些无关紧要的特征）来做出判断。