不同类型的语义相似度损失函数（SentenceTransformerLoss）

文章目录

不同输入类型的损失
[输入类型：[(anchor, positive/negative, label 1/0)...]，label为1距离小、为0距离大](#输入类型：[(anchor, positive/negative, label 1/0)...]，label为1距离小、为0距离大)
- ContrastiveLoss（对比损失）
- OnlineContrastiveLoss
[输入类型：[(sentence1, label1), (sentence2, label2)...]，label相同则距离小](#输入类型：[(sentence1, label1), (sentence2, label2)...]，label相同则距离小)
[输入类型：[(sentence1, sentence2, score), ...], 拟合sentence pair的score（大于0小于1）](#输入类型：[(sentence1, sentence2, score), ...], 拟合sentence pair的score（大于0小于1）)
- CosineSimilarityLoss（相似度回归）
- CoSENTLoss（相似度回归和排序任务）
[输入类型：[(sentence1, sentence2, label), ...], 多分类sentence pair](#输入类型：[(sentence1, sentence2, label), ...], 多分类sentence pair)
- SoftmaxLoss
[输入类型：[(anchor, positive, negative), ...], 三元组样本对输入](#输入类型：[(anchor, positive, negative), ...], 三元组样本对输入)
- TripletLoss
- [MultipleNegativesRankingLoss / InfoNCELoss](#MultipleNegativesRankingLoss / InfoNCELoss)
- CachedMultipleNegativesRankingLoss
[输入类型：[(anchor, positive), ...], 仅正样本对输入](#输入类型：[(anchor, positive), ...], 仅正样本对输入)
[输入类型：[sentence1, sentence2, ...]，无标签输入](#输入类型：[sentence1, sentence2, ...]，无标签输入)

不同输入类型的损失

根据任务、数据类型 选择合适的损失，详见这里。

输入类型：[(anchor, positive/negative, label 1/0)...]，label为1距离小、为0距离大

ContrastiveLoss（对比损失）

对于样本对A和B：

正样本对（类别为1），它们之间的距离应尽可能近；
负样本对（类别为0），它们之间的距离应尽可能远，只惩罚距离小于margin的负样本对，距离超过阈值时不再惩罚；

distance_metric默认为余弦距离，margin默认为0.5，loss为d^2(a,p) + max(margin - d^2(a,n), 0)。

python 复制代码

def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor) -> Tensor:
    reps = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]
    assert len(reps) == 2
    rep_anchor, rep_other = reps
    distances = self.distance_metric(rep_anchor, rep_other)
    losses = 0.5 * (
        labels.float() * distances.pow(2) + (1 - labels).float() * F.relu(self.margin - distances).pow(2)
    )
    return losses.mean() if self.size_average else losses.sum()

OnlineContrastiveLoss

与ContrastiveLoss基本相同，该loss仅选择批次内困难样本计算损失，通常效果比对比损失更优。

损失：选择距离小于最大正样本对距离的负样本，选择距离大于最小负样本对距离的正样本。忽略负样本对最小距离 与正样本对最大距离 的差超过阈值的easy实例。

python 复制代码

def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor, size_average=False) -> Tensor:
    embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]

    distance_matrix = self.distance_metric(embeddings[0], embeddings[1])
    negs = distance_matrix[labels == 0]
    poss = distance_matrix[labels == 1]

    # select hard positive and hard negative pairs
    negative_pairs = negs[negs < (poss.max() if len(poss) > 1 else negs.mean())]
    positive_pairs = poss[poss > (negs.min() if len(negs) > 1 else poss.mean())]

    positive_loss = positive_pairs.pow(2).sum()
    negative_loss = F.relu(self.margin - negative_pairs).pow(2).sum()
    loss = positive_loss + negative_loss
    return loss

输入类型：[(sentence1, label1), (sentence2, label2)...]，label相同则距离小

BatchAllTripletLoss

损失度量：

批次内具有相同标签的句子属于同一类，距离应近；
批次内具有不同标签的句子属于不同类，距离应远；
对于任意锚点样本，其与具有相同标签样本的距离应小于与其具有不同标签样本的距离；

比如对于四个样本[(a, label1), (b, label1), (c, label2), (d, label2)]，则pairwise_dist为[[aa, ab, ac, ad], ..., [da, db, dc, dd]]。若a作为锚点，ab正样本对距离，ac为负样本对距离，loss中的其中一项为ab-ac+margin。

正样本对距离越大，负样本对距离越小，则损失越大。忽略距离差大于margin的正负样本对，即ab-ac+margin<0，这种样本对容易区分，对损失影响不大。

python 复制代码

def batch_all_triplet_loss(self, labels: Tensor, embeddings: Tensor) -> Tensor:
    # Get the pairwise distance matrix
    pairwise_dist = self.distance_metric(embeddings)
    anchor_positive_dist = pairwise_dist.unsqueeze(2)
    anchor_negative_dist = pairwise_dist.unsqueeze(1)

    # Compute a 3D tensor of size (batch_size, batch_size, batch_size)
    # triplet_loss[i, j, k] will contain the triplet loss of anchor=i, positive=j, negative=k
    # Uses broadcasting where the 1st argument has shape (batch_size, batch_size, 1)
    # and the 2nd (batch_size, 1, batch_size)
    triplet_loss = anchor_positive_dist - anchor_negative_dist + self.triplet_margin

    # Put to zero the invalid triplets
    # (where label(a) != label(p) or label(n) == label(a) or a == p)
    mask = BatchHardTripletLoss.get_triplet_mask(labels)
    triplet_loss = mask.float() * triplet_loss

    # Remove negative losses (i.e. the easy triplets)
    triplet_loss[triplet_loss < 0] = 0

    # Count number of positive triplets (where triplet_loss > 0)
    valid_triplets = triplet_loss[triplet_loss > 1e-16]
    num_positive_triplets = valid_triplets.size(0)
    # num_valid_triplets = mask.sum()
    # fraction_positive_triplets = num_positive_triplets / (num_valid_triplets.float() + 1e-16)

    # Get final mean triplet loss over the positive valid triplets
    triplet_loss = triplet_loss.sum() / (num_positive_triplets + 1e-16)

    return triplet_loss

BatchHardSoftMarginTripletLoss

批次内任一锚点，与相同标签样本的最大距离也要比与不同标签的最小距离更近，同类样本即使远也要比非同类样本的距离近。

使用软间隔，loss=log(1 + exp(d(a, p) - d(a, n)))。正负样本对距离相近时，损失变化速率最快，易优化；正样本对距离远小于负样本距离时，损失趋于0。

python 复制代码

def batch_hard_triplet_soft_margin_loss(self, labels: Tensor, embeddings: Tensor) -> Tensor:
    # Get the pairwise distance matrix
    pairwise_dist = self.distance_metric(embeddings)

    # For each anchor, get the hardest positive
    # First, we need to get a mask for every valid positive (they should have same label)
    mask_anchor_positive = BatchHardTripletLoss.get_anchor_positive_triplet_mask(labels).float()

    # We put to 0 any element where (a, p) is not valid (valid if a != p and label(a) == label(p))
    anchor_positive_dist = mask_anchor_positive * pairwise_dist

    # shape (batch_size, 1)
    hardest_positive_dist, _ = anchor_positive_dist.max(1, keepdim=True)

    # For each anchor, get the hardest negative
    # First, we need to get a mask for every valid negative (they should have different labels)
    mask_anchor_negative = BatchHardTripletLoss.get_anchor_negative_triplet_mask(labels).float()

    # We add the maximum value in each row to the invalid negatives (label(a) == label(n))
    max_anchor_negative_dist, _ = pairwise_dist.max(1, keepdim=True)
    anchor_negative_dist = pairwise_dist + max_anchor_negative_dist * (1.0 - mask_anchor_negative)

    # shape (batch_size,)
    hardest_negative_dist, _ = anchor_negative_dist.min(1, keepdim=True)

    # Combine biggest d(a, p) and smallest d(a, n) into final triplet loss with soft margin
    # tl = hardest_positive_dist - hardest_negative_dist + margin
    # tl[tl < 0] = 0
    tl = torch.log1p(torch.exp(hardest_positive_dist - hardest_negative_dist))
    triplet_loss = tl.mean()

    return triplet_loss

BatchHardTripletLoss

与BatchHardSoftMarginTripletLoss不同的是，手动设置间隔，loss = d(a, p) - d(a, n) + margin，令loss[loss < 0] = 0，忽略正负距离相差超过阈值的样本对。

输入类型：[(sentence1, sentence2, score), ...], 拟合sentence pair的score（大于0小于1）

CosineSimilarityLoss（相似度回归）

计算样本对之间的余弦相似分数，和标签分数做MSE损失。cos_score_transformation默认不执行任何操作，loss_fct默认为MSE损失。

python 复制代码

def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor) -> Tensor:
    embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]
    output = self.cos_score_transformation(torch.cosine_similarity(embeddings[0], embeddings[1]))
    return self.loss_fct(output, labels.float().view(-1))

CoSENTLoss（相似度回归和排序任务）

Cosine Sentence Loss，远离参考科学空间------CoSENT（一）：比Sentence-BERT更有效的句向量方案。

损失：对于句对(i, j)和(k,l)，若标签label[i,j] < label[k,l]，则期望模型预测的相似度scores[i,j] < scores[k,l]。损失定义为loss=log(1 + exp(s[i,j] - s[k,l]) + exp...)，即期望(i,j)的相似分数小于(k,l)！

相似分数度量：余弦相似分数score，1表示相似，0表示不相似。这里不是距离是相似分数，训练完成后，不同向量之间的 余弦距离表示语义相似度，适用于句子相似度回归和排序任务。

比如batch内3对样本编号1,2和3，真值labels为(0.1, 0.7, 0.9)，则样本对(1, 2), (1, 3), (2, 3)参与计算loss。如果预测scores为(0.3, 0.4, 0.2)，差值分数为(-0.1, 0.1, 0.2)，差值分数为正，则损失更大！

python 复制代码

def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor) -> Tensor:
    embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]

    scores = self.similarity_fct(embeddings[0], embeddings[1])
    scores = scores * self.scale
    scores = scores[:, None] - scores[None, :]

    # label matrix indicating which pairs are relevant
    labels = labels[:, None] < labels[None, :]
    labels = labels.float()

    # mask out irrelevant pairs so they are negligible after exp()
    scores = scores - (1 - labels) * 1e12

    # append a zero as e^0 = 1
    scores = torch.cat((torch.zeros(1).to(scores.device), scores.view(-1)), dim=0)
    loss = torch.logsumexp(scores, dim=0)

    return loss

输入类型：[(sentence1, sentence2, label), ...], 多分类sentence pair

SoftmaxLoss

孪生网络，文本对多分类。

python 复制代码

model = SentenceTransformer("microsoft/mpnet-base")
train_dataset = Dataset.from_dict({
    "sentence1": [
        "A person on a horse jumps over a broken down airplane.",
        "A person on a horse jumps over a broken down airplane.",
        "A person on a horse jumps over a broken down airplane.",
        "Children smiling and waving at camera",
    ],
    "sentence2": [
        "A person is training his horse for a competition.",
        "A person is at a diner, ordering an omelette.",
        "A person is outdoors, on a horse.",
        "There are children present.",
    ],
    "label": [1, 2, 0, 0],
})
loss = losses.SoftmaxLoss(model, model.get_sentence_embedding_dimension(), num_labels=3)

python 复制代码

 def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor) -> Tensor | tuple[Tensor, Tensor]:
     reps = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]
     rep_a, rep_b = reps

     vectors_concat = []
     if self.concatenation_sent_rep:
         vectors_concat.append(rep_a)
         vectors_concat.append(rep_b)

     if self.concatenation_sent_difference:
         vectors_concat.append(torch.abs(rep_a - rep_b))

     if self.concatenation_sent_multiplication:
         vectors_concat.append(rep_a * rep_b)

     features = torch.cat(vectors_concat, 1)

     output = self.classifier(features)

     if labels is not None:
         loss = self.loss_fct(output, labels.view(-1))
         return loss
     else:
         return reps, output

输入类型：[(anchor, positive, negative), ...], 三元组样本对输入

TripletLoss

锚点与正负样本之间的距离要大于margin，也就是说，惩罚dis(anchor,neg) - dis(anhor,pos)<margin的三元组。默认distance_metric为欧式距离，margin为5。

python 复制代码

def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor) -> Tensor:
    reps = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]

    rep_anchor, rep_pos, rep_neg = reps
    distance_pos = self.distance_metric(rep_anchor, rep_pos)
    distance_neg = self.distance_metric(rep_anchor, rep_neg)

    losses = F.relu(distance_pos - distance_neg + self.triplet_margin)
    return losses.mean()

MultipleNegativesRankingLoss / InfoNCELoss

任意锚点样本，包含一条正样本和多条负样本。计算锚点和正、负样本之间的相似度，使用softmax多分类。增加锚点与正样本之间的相似度，降低锚点与负样本之间的相似度。

等价于InfoNCE loss，在softmax之间对score进行温度缩放。MultipleNegativesRankingLoss里面就是scale参数，scale=1就是标签的交叉熵损失。

python 复制代码

def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor) -> Tensor:
    # Compute the embeddings and distribute them to anchor and candidates (positive and optionally negatives)
    embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]
    anchors = embeddings[0]  # (batch_size, embedding_dim)
    candidates = torch.cat(embeddings[1:])  # (batch_size * (1 + num_negatives), embedding_dim)

    # For every anchor, we compute the similarity to all other candidates (positives and negatives),
    # also from other anchors. This gives us a lot of in-batch negatives.
    scores = self.similarity_fct(anchors, candidates) * self.scale
    # (batch_size, batch_size * (1 + num_negatives))

    # anchor[i] should be most similar to candidates[i], as that is the paired positive,
    # so the label for anchor[i] is i
    range_labels = torch.arange(0, scores.size(0), device=scores.device)

    return self.cross_entropy_loss(scores, range_labels)

CachedMultipleNegativesRankingLoss

MultipleNegativesRankingLoss的优化版本，将批次中的样本分多个mini-batch，缓存梯度，避免OOM。

输入类型：[(anchor, positive), ...], 仅正样本对输入

可使用MultipleNegativesRankingLoss损失，将batch内其它样本对的positive作为自身的negatives，执行softmax分类。批次内样本数越多，越难分类，预期效果越好。

输入类型：[sentence1, sentence2, ...]，无标签输入

无标签输入。

可使用ContrastiveTensionLossInBatchNegatives损失，同一句子执行两次forward（网络中包含dropout等随机操作），目标是使不同forward之间同句接近、不同句远离。