迁移指南

从 v4.x 迁移到 v5.x

v5 版 Sentence Transformers 引入了 SparseEncoder 嵌入模型(有关其更多详细信息,请参阅 Sparse Encoder 用法),同时还为其提供了广泛的训练套件,包括 SparseEncoderTrainerSparseEncoderTrainingArguments。与 v3(更新了 SentenceTransformer)和 v4(更新了 CrossEncoder)不同,此更新未废弃任何训练方法。

model.encode 的迁移

我们引入了两个新方法:encode_query()encode_document()。在处理信息检索任务时,建议使用它们替代 encode() 方法。这些方法是 encode() 的专用版本,主要区别在于两点:

  1. 如果未提供 prompt_nameprompt,它将使用模型 prompts 字典中预定义的“查询”prompt(如果可用)。

  2. 它将 task 设置为“query”。如果模型包含 Router 模块,它将使用“query”任务类型将输入路由到相应的子模块。

同样的方法也适用于 SparseEncoder 模型。

encode_query 和 encode_document

v4.x

v5.x (推荐)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
query = "What is the capital of France?"
document = "Paris is the capital of France."

# Use the prompt with the name "query" for the query
query_embedding = model.encode(query, prompt_name="query")
document_embedding = model.encode(document)

print(query_embedding.shape, document_embedding.shape)
# => (1, 768) (1, 768)
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
query = "What is the capital of France?"
document = "Paris is the capital of France."

# The new encode_query and encode_document methods call encode,
# but with the prompt name set to "query" or "document" if the
# model has prompts saved, and the task set to "query" or "document",
# if the model has a Router module.
query_embedding = model.encode_query(query)
document_embedding = model.encode_document(document)

print(query_embedding.shape, document_embedding.shape)
# => (1, 768) (1, 768)

我们还废弃了 encode_multi_process() 方法,该方法曾用于使用多个进程并行编码大型数据集。此方法现已由 encode() 方法取代,并使用 devicepoolchunk_size 参数。向 device 参数提供设备列表以使用多个进程,或提供单个设备以使用单个进程。pool 参数可用于传递一个可跨调用重用的多进程池,而 chunk_size 参数可用于控制并行发送到每个进程的块大小。

encode_multi_process 废弃 -> encode

v4.x

v5.x (推荐)

from sentence_transformers import SentenceTransformer

def main():
    model = SentenceTransformer("all-mpnet-base-v2")
    texts = ["The weather is so nice!", "It's so sunny outside.", ...]

    pool = model.start_multi_process_pool(["cpu", "cpu", "cpu", "cpu"])
    embeddings = model.encode_multi_process(texts, pool, chunk_size=512)
    model.stop_multi_process_pool(pool)

    print(embeddings.shape)
    # => (4000, 768)

if __name__ == "__main__":
    main()
from sentence_transformers import SentenceTransformer

def main():
    model = SentenceTransformer("all-mpnet-base-v2")
    texts = ["The weather is so nice!", "It's so sunny outside.", ...]

    embeddings = model.encode(texts, device=["cpu", "cpu", "cpu", "cpu"], chunk_size=512)

    print(embeddings.shape)
    # => (4000, 768)

if __name__ == "__main__":
    main()

`truncate_dim` 参数允许您通过截断嵌入来降低其维度。这对于优化存储和检索,同时保留大部分语义信息非常有用。研究表明,Transformer 嵌入的前几个维度通常包含大部分重要信息。

将 truncate_dim 添加到 encode

v4.x

v5.x (推荐)

from sentence_transformers import SentenceTransformer

# To truncate embeddings to a specific dimension,
# you had to specify the dimension when loading
model = SentenceTransformer(
   "mixedbread-ai/mxbai-embed-large-v1",
   truncate_dim=384,
)
sentences = ["This is an example sentence", "Each sentence is converted"]

embeddings = model.encode(sentences)
print(embeddings.shape)
# => (2, 384)
from sentence_transformers import SentenceTransformer

# Now you can either specify the dimension when loading the model...
model = SentenceTransformer(
   "mixedbread-ai/mxbai-embed-large-v1",
   truncate_dim=384,
)
sentences = ["This is an example sentence", "Each sentence is converted"]

# ... or you can specify it when encoding
embeddings = model.encode(sentences, truncate_dim=256)
print(embeddings.shape)
# => (2, 256)

# The encode parameter has priority, but otherwise the model truncate_dim is used
embeddings = model.encode(sentences)
print(embeddings.shape)
# => (2, 384)

从 Asym 迁移到 Router

`Asym` 模块已更名为新模块 Router 并进行了更新,它提供相同的功能,但具有更一致的 API 和附加特性。新的 Router 模块允许更灵活地路由不同任务,例如查询和文档嵌入,并且建议在处理需要针对不同任务(特别是查询和文档)进行不同处理的非对称模型时使用。

encode_query()encode_document() 方法会自动设置 `task` 参数,Router 模块使用该参数将输入分别路由到查询或文档子模块。

Asym -> Router

v4.x

v5.x (推荐)

from sentence_transformers import SentenceTransformer, models

# Load a Sentence Transformer model and add an asymmetric router
# for different query and document post-processing
model = SentenceTransformer("microsoft/mpnet-base")
dim = model.get_sentence_embedding_dimension()
asym_model = models.Asym({
    'sts': [models.Dense(dim, dim)],
    'classification': [models.Dense(dim, dim)]
})
model.add_module("asym", asym_model)
from sentence_transformers import SentenceTransformer, models

# Load a Sentence Transformer model and add a router
# for different query and document post-processing
model = SentenceTransformer("microsoft/mpnet-base")
dim = model.get_sentence_embedding_dimension()
router_model = models.Router({
    'sts': [models.Dense(dim, dim)],
    'classification': [models.Dense(dim, dim)]
})
model.add_module("router", router_model)
Asym -> 用于查询和文档的 Router

v4.x

v5.x (推荐)

from sentence_transformers import SentenceTransformer
from sentence_transformers.models import Router, Normalize

# Use a regular SentenceTransformer for the document embeddings,
# and a static embedding model for the query embeddings
document_embedder = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
query_embedder = SentenceTransformer("static-retrieval-mrl-en-v1")
asym = Asym({
    "query": list(query_embedder.children()),
    "document": list(document_embedder.children()),
})
normalize = Normalize()

# Create an asymmetric model with different encoders for queries and documents
model = SentenceTransformer(
    modules=[asym, normalize],
)

# ... requires more training to align the vector spaces

# Use the query & document routes
query_embedding = model.encode({"query": "What is the capital of France?"})
document_embedding = model.encode({"document": "Paris is the capital of France."})
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import Router, Normalize

# Use a regular SentenceTransformer for the document embeddings,
# and a static embedding model for the query embeddings
document_embedder = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
query_embedder = SentenceTransformer("static-retrieval-mrl-en-v1")
router = Router.for_query_document(
    query_modules=list(query_embedder.children()),
    document_modules=list(document_embedder.children()),
)
normalize = Normalize()

# Create an asymmetric model with different encoders for queries and documents
model = SentenceTransformer(
    modules=[router, normalize],
)

# ... requires more training to align the vector spaces

# Use the query & document routes
query_embedding = model.encode_query("What is the capital of France?")
document_embedding = model.encode_document("Paris is the capital of France.")
Asym 推理 -> Router 推理

v4.x

v5.x (推荐)

...

# Use the query & document routes as keys in dictionaries
query_embedding = model.encode([{"query": "What is the capital of France?"}])
document_embedding = model.encode([
    {"document": "Paris is the capital of France."},
    {"document": "Berlin is the capital of Germany."},
])
class_embedding = model.encode(
    [{"classification": "S&P500 is down 2.1% today."}],
)
...

# Use the query & document routes with encode_query/encode_document
query_embedding = model.encode_query(["What is the capital of France?"])
document_embedding = model.encode_document([
    "Paris is the capital of France.",
    "Berlin is the capital of Germany.",
])

# When using routes other than "query" and "document", you can use the `task` parameter
# on model.encode
class_embedding = model.encode(
    ["S&P500 is down 2.1% today."],
    task="classification"  # or any other task defined in the model Router
)
Asym 训练 -> Router 训练

v4.x

v5.x (推荐)

...

# Prepare a training dataset for an Asym model with "query" and "document" keys
train_dataset = Dataset.from_dict({
    "query": [
        "is toprol xl the same as metoprolol?",
        "are eyes always the same size?",
    ],
    "answer": [
        "Metoprolol succinate is also known by the brand name Toprol XL.",
        "The eyes are always the same size from birth to death.",
    ],
})

# This mapper turns normal texts into a dictionary mapping Asym keys to the text
def mapper(sample):
    return {
        "question": {"query": sample["question"]},
        "answer": {"document": sample["answer"]},
    }

train_dataset = train_dataset.map(mapper)
print(train_dataset[0])
"""
{
    "question": {"query": "is toprol xl the same as metoprolol?"},
    "answer": {"document": "Metoprolol succinate is also known by the ..."}
}
"""

trainer = SentenceTransformerTrainer(  # Or SparseEncoderTrainer
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    ...
)
...

# Prepare a training dataset for a Router model with "query" and "document" keys
train_dataset = Dataset.from_dict({
    "query": [
        "is toprol xl the same as metoprolol?",
        "are eyes always the same size?",
    ],
    "answer": [
        "Metoprolol succinate is also known by the brand name Toprol XL.",
        "The eyes are always the same size from birth to death.",
    ],
})
train_dataset = train_dataset.map(mapper)
print(train_dataset[0])
"""
{
    "question": "is toprol xl the same as metoprolol?",
    "answer": "Metoprolol succinate is also known by the brand name Toprol XL."
}
"""

args = SentenceTransformerTrainingArguments(  # Or SparseEncoderTrainingArguments
    # Map dataset columns to the Router keys
    router_mapping={
        "question": "query",
        "answer": "document",
    }
)

trainer = SentenceTransformerTrainer(  # Or SparseEncoderTrainer
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    ...
)

高级用法迁移

Module 和 InputModule 便捷超类

v4.x

v5.x (推荐)

from sentence_transformers import SentenceTransformer
import torch

class MyModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # Custom code here

model = SentenceTransformer(modules=[MyModule()])
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import Module, InputModule

# The new Module and InputModule superclasses provide convenience methods
# like 'load', 'load_file_path', 'load_dir_path', 'load_torch_weights',
# 'save_config', 'save_torch_weights', 'get_config_dict'
# InputModule is meant to be used as the first module, is requires the
# 'tokenize' method to be implemented
class MyModule(Module):
    def __init__(self):
        super().__init__()
        # Custom initialization code here

model = SentenceTransformer(modules=[MyModule()])
通过类或函数自定义批量采样器

v4.x

v5.x (推荐)

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer

class CustomSentenceTransformerTrainer(SentenceTransformerTrainer):
    # Custom batch samplers require subclassing the Trainer
    def get_batch_sampler(
        self,
        dataset,
        batch_size,
        drop_last,
        valid_label_columns=None,
        generator=None,
        seed=0,
    ):
        # Custom batch sampler logic here
        return ...

...

trainer = CustomSentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    ...
)
trainer.train()
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
from sentence_transformers.sampler import DefaultBatchSampler
import torch

class CustomBatchSampler(DefaultBatchSampler):
    def __init__(
        self,
        dataset: Dataset,
        batch_size: int,
        drop_last: bool,
        valid_label_columns: list[str] | None = None,
        generator: torch.Generator | None = None,
        seed: int = 0,
    ):
        super().__init__(dataset, batch_size, drop_last, valid_label_columns, generator, seed)
        # Custom batch sampler logic here

args = SentenceTransformerTrainingArguments(
    # Other training arguments
    batch_sampler=CustomBatchSampler,  # Use the custom batch sampler class
)
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    ...
)
trainer.train()

# Or, use a function to initialize the batch sampler
def custom_batch_sampler(
    dataset: Dataset,
    batch_size: int,
    drop_last: bool,
    valid_label_columns: list[str] | None = None,
    generator: torch.Generator | None = None,
    seed: int = 0,
):
    # Custom batch sampler logic here
    return ...

args = SentenceTransformerTrainingArguments(
    # Other training arguments
    batch_sampler=custom_batch_sampler,  # Use the custom batch sampler function
)
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    ...
)
trainer.train()
通过类或函数自定义多数据集批量采样器

v4.x

v5.x (推荐)

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer

class CustomSentenceTransformerTrainer(SentenceTransformerTrainer):
    def get_multi_dataset_batch_sampler(
        self,
        dataset: ConcatDataset,
        batch_samplers: list[BatchSampler],
        generator: torch.Generator | None = None,
        seed: int | None = 0,
    ):
        # Custom multi-dataset batch sampler logic here
        return ...

...

trainer = CustomSentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    ...
)
trainer.train()
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
from sentence_transformers.sampler import MultiDatasetDefaultBatchSampler
import torch

class CustomMultiDatasetBatchSampler(MultiDatasetDefaultBatchSampler):
    def __init__(
        self,
        dataset: ConcatDataset,
        batch_samplers: list[BatchSampler],
        generator: torch.Generator | None = None,
        seed: int = 0,
    ):
        super().__init__(dataset, batch_samplers=batch_samplers, generator=generator, seed=seed)
        # Custom multi-dataset batch sampler logic here

args = SentenceTransformerTrainingArguments(
    # Other training arguments
    multi_dataset_batch_sampler=CustomMultiDatasetBatchSampler,
)
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    ...
)
trainer.train()

# Or, use a function to initialize the batch sampler
def custom_batch_sampler(
    dataset: ConcatDataset,
    batch_samplers: list[BatchSampler],
    generator: torch.Generator | None = None,
    seed: int = 0,
):
    # Custom multi-dataset batch sampler logic here
    return ...

args = SentenceTransformerTrainingArguments(
    # Other training arguments
    multi_dataset_batch_sampler=custom_batch_sampler,  # Use the custom batch sampler function
)
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    ...
)
trainer.train()
自定义分段学习率

v4.x

v5.x (推荐)

# A bunch of hacky code to set different learning rates
# for different sections of the model
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer

# Custom learning rate for each section of the model,
# mapping regular expressions of parameter names to learning rates
# Matching is done with 'search', not just 'match' or 'fullmatch'
learning_rate_mapping = {
    "SparseStaticEmbedding": 1e-4,
    "linear_.*": 1e-5,
}

args = SentenceTransformerTrainingArguments(
    ...,
    learning_rate=1e-5,  # Default learning rate
    learning_rate_mapping=learning_rate_mapping,
)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    ...
)
trainer.train()
使用复合损失进行训练

v4.x

v5.x (推荐)

class CustomLoss(torch.nn.Module):
    def __init__(self, model, ...):
        super().__init__()
        # Custom loss initialization code here

    def forward(self, features, labels):
        loss_component_one = self.compute_loss_one(features, labels)
        loss_component_two = self.compute_loss_two(features, labels)

        loss = loss_component_one * alpha + loss_component_two * beta
        return loss

 loss = CustomLoss(model, ...)
class CustomLoss(torch.nn.Module):
    def __init__(self, model, ...):
        super().__init__()
        # Custom loss initialization code here

    def forward(self, features, labels):
        loss_component_one = self.compute_loss_one(features, labels)
        loss_component_two = self.compute_loss_two(features, labels)

        # You can now return a dictionary of loss components.
        # The trainer considers the full loss as the sum of all
        # components, but each component will also be logged separately.
        return {
            "loss_one": loss_component_one,
            "loss_two": loss_component_two,
        }

loss = CustomLoss(model, ...)
访问底层 Transformer 模型

v4.x

v5.x (推荐)

from sentence_transformers import SentenceTransformer

# Sometimes, for one reason or another, you need to access the underlying
# Transformer directly. This was previously commonly done by accessing
# the first module, often 'Transformer', and then accessing the
# `auto_model` attribute.
model = SentenceTransformer("all-MiniLM-L6-v2")
print(model[0].auto_model)
# BertModel(
#   (embeddings): BertEmbeddings(
# ...
from sentence_transformers import SentenceTransformer

# Now, you can just use the `transformers_model` attribute on the model itself
# even if your model has non-standard modules.
model = SentenceTransformer("all-MiniLM-L6-v2")
print(model.transformers_model)
# BertModel(
#   (embeddings): BertEmbeddings(
# ...

从 v3.x 迁移到 v4.x

v4 版 Sentence Transformers 重构了 CrossEncoder 重排器/成对分类模型的训练,用 CrossEncoderTrainerCrossEncoderTrainingArguments 取代了 CrossEncoder.fit。与 v3 和 SentenceTransformer 模型类似,此更新**软废弃**了 CrossEncoder.fit,这意味着它仍然可用,但建议切换到新的 v4.x 训练格式。在底层,此方法现在使用新的训练器。

警告

如果您没有使用 CrossEncoder.fit 的代码,那么从 v3.x 更新到 v4.x 无需对代码进行任何更改。

如果您有,您的代码仍然可以运行,但建议切换到新的 v4.x 训练格式,因为它允许更多的训练参数和功能。有关更多详细信息,请参阅 训练概述

新旧训练流程

v3.x

v4.x (推荐)

from sentence_transformers import CrossEncoder, InputExample
from torch.utils.data import DataLoader

# 1. Define the model. Either from scratch of by loading a pre-trained model
model = CrossEncoder("microsoft/mpnet-base")

# 2. Define your train examples. You need more than just two examples...
train_examples = [
    InputExample(texts=["What are pandas?", "The giant panda ..."], label=1),
    InputExample(texts=["What's a panda?", "Mount Vesuvius is a ..."], label=0),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

# 3. Finetune the model
model.fit(train_dataloader=train_dataloader, epochs=1, warmup_steps=100)
from datasets import load_dataset
from sentence_transformers import CrossEncoder, CrossEncoderTrainer
from sentence_transformers.cross_encoder.losses import BinaryCrossEntropyLoss

# 1. Define the model. Either from scratch of by loading a pre-trained model
model = CrossEncoder("microsoft/mpnet-base")

# 2. Load a dataset to finetune on, convert to required format
dataset = load_dataset("sentence-transformers/hotpotqa", "triplet", split="train")

def triplet_to_labeled_pair(batch):
    anchors = batch["anchor"]
    positives = batch["positive"]
    negatives = batch["negative"]
    return {
        "sentence_A": anchors * 2,
        "sentence_B": positives + negatives,
        "labels": [1] * len(positives) + [0] * len(negatives),
    }

dataset = dataset.map(triplet_to_labeled_pair, batched=True, remove_columns=dataset.column_names)
train_dataset = dataset.select(range(10_000))
eval_dataset = dataset.select(range(10_000, 11_000))

# 3. Define a loss function
loss = BinaryCrossEntropyLoss(model)

# 4. Create a trainer & train
trainer = CrossEncoderTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
)
trainer.train()

# 5. Save the trained model
model.save_pretrained("models/mpnet-base-hotpotqa")
# model.push_to_hub("mpnet-base-hotpotqa")

`CrossEncoder` 初始化和方法参数的迁移

v3.x

v4.x (推荐)

CrossEncoder(model_name=...)

更名为 CrossEncoder(model_name_or_path=...)

CrossEncoder(automodel_args=...)

更名为 CrossEncoder(model_kwargs=...)

CrossEncoder(tokenizer_args=...)

更名为 CrossEncoder(tokenizer_kwargs=...)

CrossEncoder(config_args=...)

更名为 CrossEncoder(config_kwargs=...)

CrossEncoder(cache_dir=...)

更名为 CrossEncoder(cache_folder=...)

CrossEncoder(default_activation_function=...)

更名为 CrossEncoder(activation_fn=...)

CrossEncoder(classifier_dropout=...)

请改用 CrossEncoder(config_kwargs={"classifier_dropout": ...})

CrossEncoder.predict(activation_fct=...)

更名为 CrossEncoder.predict(activation_fn=...)

CrossEncoder.rank(activation_fct=...)

更名为 CrossEncoder.rank(activation_fn=...)

CrossEncoder.predict(num_workers=...)

已完全废弃,不再有任何作用。

CrossEncoder.rank(num_workers=...)

已完全废弃,不再有任何作用。

注意

旧的关键字参数仍然有效,但会发出警告,建议您改用新名称。

`CrossEncoder.fit` 特定参数的迁移

CrossEncoder.fit(train_dataloader)

v3.x

v4.x (推荐)

from sentence_transformers import CrossEncoder, InputExample
from torch.utils.data import DataLoader

# 1. Define the model. Either from scratch of by loading a pre-trained model
model = CrossEncoder("microsoft/mpnet-base")

# 2. Define your train examples. You need more than just two examples...
train_examples = [
    InputExample(texts=["What are pandas?", "The giant panda ..."], label=1),
    InputExample(texts=["What's a panda?", "Mount Vesuvius is a ..."], label=0),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

# 3. Finetune the model
model.fit(train_dataloader=train_dataloader)
from datasets import Dataset
from sentence_transformers import CrossEncoder, CrossEncoderTrainer
from sentence_transformers.cross_encoder.losses import BinaryCrossEntropyLoss

# Define a training dataset
train_examples = [
    {
        "sentence_1": "A person on a horse jumps over a broken down airplane.",
        "sentence_2": "A person is outdoors, on a horse.",
        "label": 1,
    },
    {
        "sentence_1": "Children smiling and waving at camera",
        "sentence_2": "The kids are frowning",
        "label": 0,
    },
]
train_dataset = Dataset.from_list(train_examples)

# Define a loss function
loss = BinaryCrossEntropyLoss(model)

# Finetune the model
trainer = CrossEncoderTrainer(
    model=model,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()
CrossEncoder.fit(loss_fct)

v3.x

v4.x (推荐)

...

# Finetune the model
model.fit(
    train_dataloader=train_dataloader,
    loss_fct=torch.nn.MSELoss(),
)
from sentence_transformers.cross_encoder.losses import MSELoss

...

# Prepare the loss function
# See all valid losses in https://sbert.net.cn/docs/cross_encoder/loss_overview.html
loss = MSELoss(model)

# Finetune the model
trainer = CrossEncoderTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()
CrossEncoder.fit(evaluator)

v3.x

v4.x (推荐)

...

# Load an evaluator
evaluator = CrossEncoderNanoBEIREvaluator()

# Finetune with an evaluator
model.fit(
    train_dataloader=train_dataloader,
    evaluator=evaluator,
)
# Load an evaluator
evaluator = CrossEncoderNanoBEIREvaluator()

# Finetune with an evaluator
trainer = CrossEncoderTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
    evaluator=evaluator,
)
trainer.train()
CrossEncoder.fit(epochs)

v3.x

v4.x (推荐)

...

# Finetune the model
model.fit(
    train_dataloader=train_dataloader,
    epochs=1,
)
...

# Prepare the Training Arguments
args = CrossEncoderTrainingArguments(
    num_train_epochs=1,
)

# Finetune the model
trainer = CrossEncoderTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()
CrossEncoder.fit(activation_fct)

v3.x

v4.x (推荐)

...

# Finetune the model
model.fit(
    train_dataloader=train_dataloader,
    activation_fct=torch.nn.Sigmoid(),
)
...

# Prepare the loss function
loss = MSELoss(model, activation_fn=torch.nn.Sigmoid())

# Finetune the model
trainer = CrossEncoderTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()
CrossEncoder.fit(scheduler)

v3.x

v4.x (推荐)

...

# Finetune the model
model.fit(
    train_dataloader=train_dataloader,
    scheduler="WarmupLinear",
)
...

# Prepare the Training Arguments
args = CrossEncoderTrainingArguments(
    # See https://hugging-face.cn/docs/transformers/main_classes/optimizer_schedules#transformers.SchedulerType
    lr_scheduler_type="linear"
)

# Finetune the model
trainer = CrossEncoderTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()
CrossEncoder.fit(warmup_steps)

v3.x

v4.x (推荐)

...

# Finetune the model
model.fit(
    train_dataloader=train_dataloader,
    warmup_steps=1000,
)
...

# Prepare the Training Arguments
args = CrossEncoderTrainingArguments(
    warmup_steps=1000,
)

# Finetune the model
trainer = CrossEncoderTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()
CrossEncoder.fit(optimizer_class, optimizer_params)

v3.x

v4.x (推荐)

...

# Finetune the model
model.fit(
    train_dataloader=train_dataloader,
    optimizer_class=torch.optim.AdamW,
    optimizer_params={"eps": 1e-7},
)
...

# Prepare the Training Arguments
args = CrossEncoderTrainingArguments(
    # See https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py
    optim="adamw_torch",
    optim_args={"eps": 1e-7},
)

# Finetune the model
trainer = CrossEncoderTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()
CrossEncoder.fit(weight_decay)

v3.x

v4.x (推荐)

...

# Finetune the model
model.fit(
    train_dataloader=train_dataloader,
    weight_decay=0.02,
)
...

# Prepare the Training Arguments
args = CrossEncoderTrainingArguments(
    weight_decay=0.02,
)

# Finetune the model
trainer = CrossEncoderTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()
CrossEncoder.fit(evaluation_steps)

v3.x

v4.x (推荐)

...

# Finetune the model
model.fit(
    train_dataloader=train_dataloader,
    evaluator=evaluator,
    evaluation_steps=1000,
)
...

# Prepare the Training Arguments
args = CrossEncoderTrainingArguments(
    eval_strategy="steps",
    eval_steps=1000,
)

# Finetune the model
# Note: You need an eval_dataset and/or evaluator to evaluate
trainer = CrossEncoderTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
    evaluator=evaluator,
)
trainer.train()
CrossEncoder.fit(output_path, save_best_model)

v3.x

v4.x (推荐)

...

# Finetune the model
model.fit(
    train_dataloader=train_dataloader,
    evaluator=evaluator,
    output_path="my/path",
    save_best_model=True,
)
...

# Prepare the Training Arguments
args = CrossEncoderTrainingArguments(
    load_best_model_at_end=True,
    metric_for_best_model="hotpotqa_ndcg@10", # E.g. `evaluator.primary_metric`
)

# Finetune the model
trainer = CrossEncoderTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()

# Save the best model at my output path
model.save_pretrained("my/path")
CrossEncoder.fit(max_grad_norm)

v3.x

v4.x (推荐)

...

# Finetune the model
model.fit(
    train_dataloader=train_dataloader,
    max_grad_norm=1,
)
...

# Prepare the Training Arguments
args = CrossEncoderTrainingArguments(
    max_grad_norm=1,
)

# Finetune the model
trainer = CrossEncoderTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()
CrossEncoder.fit(use_amp)

v3.x

v4.x (推荐)

...

# Finetune the model
model.fit(
    train_dataloader=train_dataloader,
    use_amp=True,
)
...

# Prepare the Training Arguments
args = CrossEncoderTrainingArguments(
    fp16=True,
    bf16=False, # If your GPU supports it, you can also use bf16 instead
)

# Finetune the model
trainer = CrossEncoderTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()
CrossEncoder.fit(callback)

v3.x

v4.x (推荐)

...

def printer_callback(score, epoch, steps):
    print(f"Score: {score:.4f} at epoch {epoch:d}, step {steps:d}")

# Finetune the model
model.fit(
    train_dataloader=train_dataloader,
    callback=printer_callback,
)
from transformers import TrainerCallback

...

class PrinterCallback(TrainerCallback):
    # Subclass any method from https://hugging-face.cn/docs/transformers/main_classes/callback#transformers.TrainerCallback
    def on_evaluate(self, args, state, control, metrics=None, **kwargs):
        print(f"Metrics: {metrics} at epoch {state.epoch:d}, step {state.global_step:d}")

printer_callback = PrinterCallback()

# Finetune the model
trainer = CrossEncoderTrainer(
    model=model,
    train_dataset=train_dataset,
    loss=loss,
    callbacks=[printer_callback],
)
trainer.train()
CrossEncoder.fit(show_progress_bar)

v3.x

v4.x (推荐)

...

# Finetune the model
model.fit(
    train_dataloader=train_dataloader,
    show_progress_bar=True,
)
...

# Prepare the Training Arguments
args = CrossEncoderTrainingArguments(
    disable_tqdm=False,
)

# Finetune the model
trainer = CrossEncoderTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()

注意

旧的 CrossEncoder.fit 方法仍然可用,它只是被软废弃了。它现在在底层使用了新的 CrossEncoderTrainer

CrossEncoder 评估器的迁移

v3.x

v4.x (推荐)

CEBinaryAccuracyEvaluator

请使用 CrossEncoderClassificationEvaluator,这是一个涵盖性评估器,使用相同的输入和输出。

CEBinaryClassificationEvaluator

请使用 CrossEncoderClassificationEvaluator,这是一个涵盖性评估器,使用相同的输入和输出。

CECorrelationEvaluator

请使用 CrossEncoderCorrelationEvaluator,此评估器已更名。

CEF1Evaluator

请使用 CrossEncoderClassificationEvaluator,这是一个涵盖性评估器,使用相同的输入和输出。

CESoftmaxAccuracyEvaluator

请使用 CrossEncoderClassificationEvaluator,这是一个涵盖性评估器,使用相同的输入和输出。

CERerankingEvaluator

更名为 CrossEncoderRerankingEvaluator,此评估器已更名

注意

旧的评估器仍然有效,它们只会提醒您更新到新的评估器。

从 v2.x 迁移到 v3.x

v3 版 Sentence Transformers 重构了 SentenceTransformer 嵌入模型的训练,用 SentenceTransformerTrainerSentenceTransformerTrainingArguments 取代了 SentenceTransformer.fit。此更新**软废弃**了 SentenceTransformer.fit,这意味着它仍然可用,但建议切换到新的 v3.x 训练格式。在底层,此方法现在使用新的训练器。

警告

如果您没有使用 SentenceTransformer.fit 的代码,那么从 v2.x 更新到 v3.x 无需对代码进行任何更改。

如果您有,您的代码仍然可以运行,但建议切换到新的 v3.x 训练格式,因为它允许更多的训练参数和功能。有关更多详细信息,请参阅 训练概述

新旧训练流程

v2.x

v3.x (推荐)

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# 1. Define the model. Either from scratch of by loading a pre-trained model
model = SentenceTransformer("microsoft/mpnet-base")

# 2. Define your train examples. You need more than just two examples...
train_examples = [
    InputExample(texts=[
        "A person on a horse jumps over a broken down airplane.",
        "A person is outdoors, on a horse.",
        "A person is at a diner, ordering an omelette.",
    ]),
    InputExample(texts=[
        "Children smiling and waving at camera",
        "There are children present",
        "The kids are frowning",
    ]),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

# 3. Define a loss function
train_loss = losses.MultipleNegativesRankingLoss(model)

# 4. Finetune the model
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=1,
    warmup_steps=100,
)

# 5. Save the trained model
model.save_pretrained("models/mpnet-base-all-nli")
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
from sentence_transformers.losses import MultipleNegativesRankingLoss

# 1. Define the model. Either from scratch of by loading a pre-trained model
model = SentenceTransformer("microsoft/mpnet-base")

# 2. Load a dataset to finetune on
dataset = load_dataset("sentence-transformers/all-nli", "triplet")
train_dataset = dataset["train"].select(range(10_000))
eval_dataset = dataset["dev"].select(range(1_000))

# 3. Define a loss function
loss = MultipleNegativesRankingLoss(model)

# 4. Create a trainer & train
trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
)
trainer.train()

# 5. Save the trained model
model.save_pretrained("models/mpnet-base-all-nli")
# model.push_to_hub("mpnet-base-all-nli")

`SentenceTransformer.fit` 特定参数的迁移

SentenceTransformer.fit(train_objectives)

v2.x

v3.x (推荐)

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Define a training dataloader
train_examples = [
    InputExample(texts=[
        "A person on a horse jumps over a broken down airplane.",
        "A person is outdoors, on a horse.",
        "A person is at a diner, ordering an omelette.",
    ]),
    InputExample(texts=[
        "Children smiling and waving at camera",
        "There are children present",
        "The kids are frowning",
    ]),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

# Define a loss function
train_loss = losses.MultipleNegativesRankingLoss(model)

# Finetune the model
model.fit(train_objectives=[(train_dataloader, train_loss)])
from datasets import Dataset
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
from sentence_transformers.losses import MultipleNegativesRankingLoss

# Define a training dataset
train_examples = [
    {
        "anchor": "A person on a horse jumps over a broken down airplane.",
        "positive": "A person is outdoors, on a horse.",
        "negative": "A person is at a diner, ordering an omelette.",
    },
    {
        "anchor": "Children smiling and waving at camera",
        "positive": "There are children present",
        "negative": "The kids are frowning",
    },
]
train_dataset = Dataset.from_list(train_examples)

# Define a loss function
loss = MultipleNegativesRankingLoss(model)

# Finetune the model
trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()
SentenceTransformer.fit(evaluator)

v2.x

v3.x (推荐)

...

# Load an evaluator
evaluator = NanoBEIREvaluator()

# Finetune with an evaluator
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=evaluator,
)
# Load an evaluator
evaluator = NanoBEIREvaluator()

# Finetune with an evaluator
trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
    evaluator=evaluator,
)
trainer.train()
SentenceTransformer.fit(epochs)

v2.x

v3.x (推荐)

...

# Finetune the model
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=1,
)
...

# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
    num_train_epochs=1,
)

# Finetune the model
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()
SentenceTransformer.fit(steps_per_epoch)

v2.x

v3.x (推荐)

...

# Finetune the model
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    steps_per_epoch=1000,
)
...

# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
    max_steps=1000, # Note: max_steps is across all epochs, not per epoch
)

# Finetune the model
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()
SentenceTransformer.fit(scheduler)

v2.x

v3.x (推荐)

...

# Finetune the model
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    scheduler="WarmupLinear",
)
...

# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
    # See https://hugging-face.cn/docs/transformers/main_classes/optimizer_schedules#transformers.SchedulerType
    lr_scheduler_type="linear"
)

# Finetune the model
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()
SentenceTransformer.fit(warmup_steps)

v2.x

v3.x (推荐)

...

# Finetune the model
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    warmup_steps=1000,
)
...

# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
    warmup_steps=1000,
)

# Finetune the model
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()
SentenceTransformer.fit(optimizer_class, optimizer_params)

v2.x

v3.x (推荐)

...

# Finetune the model
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    optimizer_class=torch.optim.AdamW,
    optimizer_params={"eps": 1e-7},
)
...

# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
    # See https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py
    optim="adamw_torch",
    optim_args={"eps": 1e-7},
)

# Finetune the model
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()
SentenceTransformer.fit(weight_decay)

v2.x

v3.x (推荐)

...

# Finetune the model
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    weight_decay=0.02,
)
...

# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
    weight_decay=0.02,
)

# Finetune the model
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()
SentenceTransformer.fit(evaluation_steps)

v2.x

v3.x (推荐)

...

# Finetune the model
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=evaluator,
    evaluation_steps=1000,
)
...

# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
    eval_strategy="steps",
    eval_steps=1000,
)

# Finetune the model
# Note: You need an eval_dataset and/or evaluator to evaluate
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
    evaluator=evaluator,
)
trainer.train()
SentenceTransformer.fit(output_path, save_best_model)

v2.x

v3.x (推荐)

...

# Finetune the model
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=evaluator,
    output_path="my/path",
    save_best_model=True,
)
...

# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
    load_best_model_at_end=True,
    metric_for_best_model="all_nli_cosine_accuracy", # E.g. `evaluator.primary_metric`
)

# Finetune the model
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()

# Save the best model at my output path
model.save_pretrained("my/path")
SentenceTransformer.fit(max_grad_norm)

v2.x

v3.x (推荐)

...

# Finetune the model
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    max_grad_norm=1,
)
...

# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
    max_grad_norm=1,
)

# Finetune the model
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()
SentenceTransformer.fit(use_amp)

v2.x

v3.x (推荐)

...

# Finetune the model
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    use_amp=True,
)
...

# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
    fp16=True,
    bf16=False, # If your GPU supports it, you can also use bf16 instead
)

# Finetune the model
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()
SentenceTransformer.fit(callback)

v2.x

v3.x (推荐)

...

def printer_callback(score, epoch, steps):
    print(f"Score: {score:.4f} at epoch {epoch:d}, step {steps:d}")

# Finetune the model
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    callback=printer_callback,
)
from transformers import TrainerCallback

...

class PrinterCallback(TrainerCallback):
    # Subclass any method from https://hugging-face.cn/docs/transformers/main_classes/callback#transformers.TrainerCallback
    def on_evaluate(self, args, state, control, metrics=None, **kwargs):
        print(f"Metrics: {metrics} at epoch {state.epoch:d}, step {state.global_step:d}")

printer_callback = PrinterCallback()

# Finetune the model
trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=train_dataset,
    loss=loss,
    callbacks=[printer_callback],
)
trainer.train()
SentenceTransformer.fit(show_progress_bar)

v2.x

v3.x (推荐)

...

# Finetune the model
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    show_progress_bar=True,
)
...

# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
    disable_tqdm=False,
)

# Finetune the model
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()
SentenceTransformer.fit(checkpoint_path, checkpoint_save_steps, checkpoint_save_total_limit)

v2.x

v3.x (推荐)

...

# Finetune the model
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    checkpoint_path="checkpoints",
    checkpoint_save_steps=5000,
    checkpoint_save_total_limit=2,
)
...

# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
    eval_strategy="steps",
    eval_steps=5000,
    save_strategy="steps",
    save_steps=5000,
    save_total_limit=2,
)

# Finetune the model
# Note: You need an eval_dataset and/or evaluator to checkpoint
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
)
trainer.train()

`SentenceTransformer.fit` 中使用的自定义 Datasets 和 DataLoaders 迁移

v2.x

v3.x (推荐)

ParallelSentencesDataset

手动创建 Dataset 并为嵌入添加 label 列。或者,考虑加载我们预提供的 Parallel Sentences Datasets 之一。

SentenceLabelDataset

加载或创建 Dataset 并使用 SentenceTransformerTrainingArguments(batch_sampler=BatchSamplers.GROUP_BY_LABEL)(使用 GroupByLabelBatchSampler)。推荐用于 BatchTripletLosses。

DenoisingAutoEncoderDataset

手动将包含噪声文本的列添加到带有文本的 Dataset 中,例如使用 Dataset.map

NoDuplicatesDataLoader

加载或创建 Dataset 并使用 SentenceTransformerTrainingArguments(batch_sampler=BatchSamplers.NO_DUPLICATES)(使用 NoDuplicatesBatchSampler)。推荐用于 MultipleNegativesRankingLoss