自适应层

嵌入模型通常是具有多个编码层（例如 12 层，如 all-mpnet-base-v2；或 6 层，如 all-MiniLM-L6-v2）的模型。为了获得嵌入，必须遍历所有这些层。2D 俄罗斯套娃句子嵌入 (2DMSE) 预印本通过提出一种训练嵌入模型的方法来重新审视这一概念，该方法在仅使用部分层时也能表现良好。这使得推理速度更快，而性能成本相对较低。

注意

2DMSE 预印本后来更新并更名为 ESE: Espresso Sentence Embeddings。Sentence Transformers 中自适应层和 Matryoshka2d（自适应层 + 俄罗斯套娃嵌入）的实现基于最初的预印本，我们接受实现更新的 ESE 论文的贡献。

用例

2DMSE 论文提到，使用通过自适应层和俄罗斯套娃表示学习训练的较大模型的少数层，其性能可以优于像标准嵌入模型一样训练的小型模型。

结果

让我们看看自适应层嵌入模型与常规嵌入模型相比可能达到的性能。为此实验，我训练了两个模型：

tomaarsen/mpnet-base-nli-adaptive-layer：通过运行 adaptive_layer_nli.py 和 microsoft/mpnet-base 进行训练。
tomaarsen/mpnet-base-nli：一个与前者几乎相同的模型，但仅使用 MultipleNegativesRankingLoss 而不是在 MultipleNegativesRankingLoss 之上使用 AdaptiveLayerLoss。我也使用 microsoft/mpnet-base 作为基础模型。

这两个模型都在 AllNLI 数据集上进行了训练，该数据集是 SNLI 和 MultiNLI 数据集的串联。我使用多个不同的嵌入维度在 STSBenchmark 测试集上评估了这些模型。结果如下图所示：

adaptive_layer_results

第一张图显示，当模型层数减少时，自适应层模型仍能保持更高的性能。这在第二张图中也清楚地显示出来，该图表明当层数减少到 1 时，80% 的性能得以保留。

最后，第三张图显示了我在测试中 GPU 和 CPU 设备的预期加速比。正如您所看到的，移除一半的层大约可以获得 2 倍的加速，而 STSB 上的性能成本约为 15%（~86 -> ~75 斯皮尔曼相关系数）。当移除更多层时，CPU 的性能增益会更大，并且在性能损失 20% 的情况下，实现 5 倍到 10 倍的加速是非常可行的。

训练

支持自适应层的训练非常基础：我们不仅仅将一些损失函数应用于最后一层，还会将相同的损失函数应用于前几层的池化嵌入。此外，我们还采用 KL 散度损失，旨在使非最后一层的嵌入与最后一层的嵌入匹配。这可以看作是一种有趣的知识蒸馏方法，其中最后一层作为教师模型，前几层作为学生模型。

例如，对于 12 层的 microsoft/mpnet-base，它现在将被训练成模型在每个 12 层之后都能产生有意义的嵌入。

from sentence_transformers import SentenceTransformer
from sentence_transformers.losses import CoSENTLoss, AdaptiveLayerLoss

model = SentenceTransformer("microsoft/mpnet-base")

base_loss = CoSENTLoss(model=model)
loss = AdaptiveLayerLoss(model=model, loss=base_loss)

参考：AdaptiveLayerLoss

请注意，使用 AdaptiveLayerLoss 进行训练并没有比不使用它时慢很多。

此外，这可以与 MatryoshkaLoss 结合使用，从而使所得模型既可以减少层数，也可以减少输出维度的大小。有关减少输出维度的更多信息，请参阅俄罗斯套娃嵌入。在 Sentence Transformers 中，这两种损失的组合称为 Matryoshka2dLoss，并提供了简写以简化训练。

from sentence_transformers import SentenceTransformer
from sentence_transformers.losses import CoSENTLoss, Matryoshka2dLoss

model = SentenceTransformer("microsoft/mpnet-base")

base_loss = CoSENTLoss(model=model)
loss = Matryoshka2dLoss(model=model, loss=base_loss, matryoshka_dims=[768, 512, 256, 128, 64])

参考：Matryoshka2dLoss

推理

使用自适应层损失训练模型后，您可以将模型层截断到所需的层数。请注意，这需要对模型本身进行一些操作，并且每个模型的结构略有不同，因此步骤会因模型而异。

首先，我们将像这样加载模型并访问底层的 transformers 模型

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("tomaarsen/mpnet-base-nli-adaptive-layer")

# We can access the underlying model with `model.transformers_model`
print(model.transformers_model)

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): MPNetOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (relative_attention_bias): Embedding(32, 12)
  )
  (pooler): MPNetPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)

此输出将因模型而异。我们将查找编码器中重复的层。对于此 MPNet 模型，它存储在 model.transformers_model.encoder.layer 下。然后我们可以对模型进行切片，只保留前几层以加速模型

new_num_layers = 3
model.transformers_model.encoder.layer = model.transformers_model.encoder.layer[:new_num_layers]

然后我们可以使用 SentenceTransformers.encode 运行推理。

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("tomaarsen/mpnet-base-nli-adaptive-layer")
new_num_layers = 3
model.transformers_model.encoder.layer = model.transformers_model.encoder.layer[:new_num_layers]

embeddings = model.encode(
    [
        "The weather is so nice!",
        "It's so sunny outside!",
        "He drove to the stadium.",
    ]
)
# Similarity of the first sentence with the other two
similarities = model.similarity(embeddings[0], embeddings[1:])
# => tensor([[0.7761, 0.1655]])
# compared to tensor([[ 0.7547, -0.0162]]) for the full model

如您所见，即使只使用了 3 层，相关句子之间的相似度也远高于不相关句子。您可以随意将此脚本复制到本地，修改 new_num_layers，并观察相似度差异。

代码示例

请参阅以下脚本作为如何在实践中应用 AdaptiveLayerLoss 的示例：

adaptive_layer_nli.py：此示例使用 MultipleNegativesRankingLoss 和 AdaptiveLayerLoss 来训练一个强大的嵌入模型，使用自然语言推理 (NLI) 数据。它是 NLI 文档的改编。
adaptive_layer_sts.py：此示例使用 CoSENTLoss 和 AdaptiveLayerLoss 在 STSBenchmark 数据集的训练集上训练一个嵌入模型。它是 STS 文档的改编。

以及以下脚本，了解如何应用 Matryoshka2dLoss：

2d_matryoshka_nli.py：此示例使用 MultipleNegativesRankingLoss 和 Matryoshka2dLoss 来训练一个强大的嵌入模型，使用自然语言推理 (NLI) 数据。它是 NLI 文档的改编。
2d_matryoshka_sts.py：此示例使用 CoSENTLoss 和 Matryoshka2dLoss 在 STSBenchmark 数据集的训练集上训练一个嵌入模型。它是 STS 文档的改编。