迁移指南
从 v4.x 迁移到 v5.x
v5 版 Sentence Transformers 引入了 SparseEncoder
嵌入模型(有关其更多详细信息,请参阅 Sparse Encoder 用法),同时还为其提供了广泛的训练套件,包括 SparseEncoderTrainer
和 SparseEncoderTrainingArguments
。与 v3(更新了 SentenceTransformer
)和 v4(更新了 CrossEncoder
)不同,此更新未废弃任何训练方法。
model.encode 的迁移
我们引入了两个新方法:encode_query()
和 encode_document()
。在处理信息检索任务时,建议使用它们替代 encode()
方法。这些方法是 encode()
的专用版本,主要区别在于两点:
如果未提供
prompt_name
或prompt
,它将使用模型prompts
字典中预定义的“查询”prompt(如果可用)。它将
task
设置为“query”。如果模型包含Router
模块,它将使用“query”任务类型将输入路由到相应的子模块。
同样的方法也适用于 SparseEncoder
模型。
v4.x |
v5.x (推荐) |
---|---|
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
query = "What is the capital of France?"
document = "Paris is the capital of France."
# Use the prompt with the name "query" for the query
query_embedding = model.encode(query, prompt_name="query")
document_embedding = model.encode(document)
print(query_embedding.shape, document_embedding.shape)
# => (1, 768) (1, 768)
|
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
query = "What is the capital of France?"
document = "Paris is the capital of France."
# The new encode_query and encode_document methods call encode,
# but with the prompt name set to "query" or "document" if the
# model has prompts saved, and the task set to "query" or "document",
# if the model has a Router module.
query_embedding = model.encode_query(query)
document_embedding = model.encode_document(document)
print(query_embedding.shape, document_embedding.shape)
# => (1, 768) (1, 768)
|
我们还废弃了 encode_multi_process()
方法,该方法曾用于使用多个进程并行编码大型数据集。此方法现已由 encode()
方法取代,并使用 device
、pool
和 chunk_size
参数。向 device
参数提供设备列表以使用多个进程,或提供单个设备以使用单个进程。pool
参数可用于传递一个可跨调用重用的多进程池,而 chunk_size
参数可用于控制并行发送到每个进程的块大小。
v4.x |
v5.x (推荐) |
---|---|
from sentence_transformers import SentenceTransformer
def main():
model = SentenceTransformer("all-mpnet-base-v2")
texts = ["The weather is so nice!", "It's so sunny outside.", ...]
pool = model.start_multi_process_pool(["cpu", "cpu", "cpu", "cpu"])
embeddings = model.encode_multi_process(texts, pool, chunk_size=512)
model.stop_multi_process_pool(pool)
print(embeddings.shape)
# => (4000, 768)
if __name__ == "__main__":
main()
|
from sentence_transformers import SentenceTransformer
def main():
model = SentenceTransformer("all-mpnet-base-v2")
texts = ["The weather is so nice!", "It's so sunny outside.", ...]
embeddings = model.encode(texts, device=["cpu", "cpu", "cpu", "cpu"], chunk_size=512)
print(embeddings.shape)
# => (4000, 768)
if __name__ == "__main__":
main()
|
`truncate_dim` 参数允许您通过截断嵌入来降低其维度。这对于优化存储和检索,同时保留大部分语义信息非常有用。研究表明,Transformer 嵌入的前几个维度通常包含大部分重要信息。
v4.x |
v5.x (推荐) |
---|---|
from sentence_transformers import SentenceTransformer
# To truncate embeddings to a specific dimension,
# you had to specify the dimension when loading
model = SentenceTransformer(
"mixedbread-ai/mxbai-embed-large-v1",
truncate_dim=384,
)
sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)
print(embeddings.shape)
# => (2, 384)
|
from sentence_transformers import SentenceTransformer
# Now you can either specify the dimension when loading the model...
model = SentenceTransformer(
"mixedbread-ai/mxbai-embed-large-v1",
truncate_dim=384,
)
sentences = ["This is an example sentence", "Each sentence is converted"]
# ... or you can specify it when encoding
embeddings = model.encode(sentences, truncate_dim=256)
print(embeddings.shape)
# => (2, 256)
# The encode parameter has priority, but otherwise the model truncate_dim is used
embeddings = model.encode(sentences)
print(embeddings.shape)
# => (2, 384)
|
从 Asym 迁移到 Router
`Asym` 模块已更名为新模块 Router
并进行了更新,它提供相同的功能,但具有更一致的 API 和附加特性。新的 Router
模块允许更灵活地路由不同任务,例如查询和文档嵌入,并且建议在处理需要针对不同任务(特别是查询和文档)进行不同处理的非对称模型时使用。
encode_query()
和 encode_document()
方法会自动设置 `task` 参数,Router
模块使用该参数将输入分别路由到查询或文档子模块。
Asym -> Router
v4.x |
v5.x (推荐) |
---|---|
from sentence_transformers import SentenceTransformer, models
# Load a Sentence Transformer model and add an asymmetric router
# for different query and document post-processing
model = SentenceTransformer("microsoft/mpnet-base")
dim = model.get_sentence_embedding_dimension()
asym_model = models.Asym({
'sts': [models.Dense(dim, dim)],
'classification': [models.Dense(dim, dim)]
})
model.add_module("asym", asym_model)
|
from sentence_transformers import SentenceTransformer, models
# Load a Sentence Transformer model and add a router
# for different query and document post-processing
model = SentenceTransformer("microsoft/mpnet-base")
dim = model.get_sentence_embedding_dimension()
router_model = models.Router({
'sts': [models.Dense(dim, dim)],
'classification': [models.Dense(dim, dim)]
})
model.add_module("router", router_model)
|
Asym -> 用于查询和文档的 Router
v4.x |
v5.x (推荐) |
---|---|
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import Router, Normalize
# Use a regular SentenceTransformer for the document embeddings,
# and a static embedding model for the query embeddings
document_embedder = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
query_embedder = SentenceTransformer("static-retrieval-mrl-en-v1")
asym = Asym({
"query": list(query_embedder.children()),
"document": list(document_embedder.children()),
})
normalize = Normalize()
# Create an asymmetric model with different encoders for queries and documents
model = SentenceTransformer(
modules=[asym, normalize],
)
# ... requires more training to align the vector spaces
# Use the query & document routes
query_embedding = model.encode({"query": "What is the capital of France?"})
document_embedding = model.encode({"document": "Paris is the capital of France."})
|
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import Router, Normalize
# Use a regular SentenceTransformer for the document embeddings,
# and a static embedding model for the query embeddings
document_embedder = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
query_embedder = SentenceTransformer("static-retrieval-mrl-en-v1")
router = Router.for_query_document(
query_modules=list(query_embedder.children()),
document_modules=list(document_embedder.children()),
)
normalize = Normalize()
# Create an asymmetric model with different encoders for queries and documents
model = SentenceTransformer(
modules=[router, normalize],
)
# ... requires more training to align the vector spaces
# Use the query & document routes
query_embedding = model.encode_query("What is the capital of France?")
document_embedding = model.encode_document("Paris is the capital of France.")
|
Asym 推理 -> Router 推理
v4.x |
v5.x (推荐) |
---|---|
...
# Use the query & document routes as keys in dictionaries
query_embedding = model.encode([{"query": "What is the capital of France?"}])
document_embedding = model.encode([
{"document": "Paris is the capital of France."},
{"document": "Berlin is the capital of Germany."},
])
class_embedding = model.encode(
[{"classification": "S&P500 is down 2.1% today."}],
)
|
...
# Use the query & document routes with encode_query/encode_document
query_embedding = model.encode_query(["What is the capital of France?"])
document_embedding = model.encode_document([
"Paris is the capital of France.",
"Berlin is the capital of Germany.",
])
# When using routes other than "query" and "document", you can use the `task` parameter
# on model.encode
class_embedding = model.encode(
["S&P500 is down 2.1% today."],
task="classification" # or any other task defined in the model Router
)
|
Asym 训练 -> Router 训练
v4.x |
v5.x (推荐) |
---|---|
...
# Prepare a training dataset for an Asym model with "query" and "document" keys
train_dataset = Dataset.from_dict({
"query": [
"is toprol xl the same as metoprolol?",
"are eyes always the same size?",
],
"answer": [
"Metoprolol succinate is also known by the brand name Toprol XL.",
"The eyes are always the same size from birth to death.",
],
})
# This mapper turns normal texts into a dictionary mapping Asym keys to the text
def mapper(sample):
return {
"question": {"query": sample["question"]},
"answer": {"document": sample["answer"]},
}
train_dataset = train_dataset.map(mapper)
print(train_dataset[0])
"""
{
"question": {"query": "is toprol xl the same as metoprolol?"},
"answer": {"document": "Metoprolol succinate is also known by the ..."}
}
"""
trainer = SentenceTransformerTrainer( # Or SparseEncoderTrainer
model=model,
args=training_args,
train_dataset=train_dataset,
...
)
|
...
# Prepare a training dataset for a Router model with "query" and "document" keys
train_dataset = Dataset.from_dict({
"query": [
"is toprol xl the same as metoprolol?",
"are eyes always the same size?",
],
"answer": [
"Metoprolol succinate is also known by the brand name Toprol XL.",
"The eyes are always the same size from birth to death.",
],
})
train_dataset = train_dataset.map(mapper)
print(train_dataset[0])
"""
{
"question": "is toprol xl the same as metoprolol?",
"answer": "Metoprolol succinate is also known by the brand name Toprol XL."
}
"""
args = SentenceTransformerTrainingArguments( # Or SparseEncoderTrainingArguments
# Map dataset columns to the Router keys
router_mapping={
"question": "query",
"answer": "document",
}
)
trainer = SentenceTransformerTrainer( # Or SparseEncoderTrainer
model=model,
args=training_args,
train_dataset=train_dataset,
...
)
|
高级用法迁移
Module 和 InputModule 便捷超类
v4.x |
v5.x (推荐) |
---|---|
from sentence_transformers import SentenceTransformer
import torch
class MyModule(torch.nn.Module):
def __init__(self):
super().__init__()
# Custom code here
model = SentenceTransformer(modules=[MyModule()])
|
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import Module, InputModule
# The new Module and InputModule superclasses provide convenience methods
# like 'load', 'load_file_path', 'load_dir_path', 'load_torch_weights',
# 'save_config', 'save_torch_weights', 'get_config_dict'
# InputModule is meant to be used as the first module, is requires the
# 'tokenize' method to be implemented
class MyModule(Module):
def __init__(self):
super().__init__()
# Custom initialization code here
model = SentenceTransformer(modules=[MyModule()])
|
通过类或函数自定义批量采样器
v4.x |
v5.x (推荐) |
---|---|
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
class CustomSentenceTransformerTrainer(SentenceTransformerTrainer):
# Custom batch samplers require subclassing the Trainer
def get_batch_sampler(
self,
dataset,
batch_size,
drop_last,
valid_label_columns=None,
generator=None,
seed=0,
):
# Custom batch sampler logic here
return ...
...
trainer = CustomSentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
...
)
trainer.train()
|
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
from sentence_transformers.sampler import DefaultBatchSampler
import torch
class CustomBatchSampler(DefaultBatchSampler):
def __init__(
self,
dataset: Dataset,
batch_size: int,
drop_last: bool,
valid_label_columns: list[str] | None = None,
generator: torch.Generator | None = None,
seed: int = 0,
):
super().__init__(dataset, batch_size, drop_last, valid_label_columns, generator, seed)
# Custom batch sampler logic here
args = SentenceTransformerTrainingArguments(
# Other training arguments
batch_sampler=CustomBatchSampler, # Use the custom batch sampler class
)
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
...
)
trainer.train()
# Or, use a function to initialize the batch sampler
def custom_batch_sampler(
dataset: Dataset,
batch_size: int,
drop_last: bool,
valid_label_columns: list[str] | None = None,
generator: torch.Generator | None = None,
seed: int = 0,
):
# Custom batch sampler logic here
return ...
args = SentenceTransformerTrainingArguments(
# Other training arguments
batch_sampler=custom_batch_sampler, # Use the custom batch sampler function
)
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
...
)
trainer.train()
|
通过类或函数自定义多数据集批量采样器
v4.x |
v5.x (推荐) |
---|---|
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
class CustomSentenceTransformerTrainer(SentenceTransformerTrainer):
def get_multi_dataset_batch_sampler(
self,
dataset: ConcatDataset,
batch_samplers: list[BatchSampler],
generator: torch.Generator | None = None,
seed: int | None = 0,
):
# Custom multi-dataset batch sampler logic here
return ...
...
trainer = CustomSentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
...
)
trainer.train()
|
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
from sentence_transformers.sampler import MultiDatasetDefaultBatchSampler
import torch
class CustomMultiDatasetBatchSampler(MultiDatasetDefaultBatchSampler):
def __init__(
self,
dataset: ConcatDataset,
batch_samplers: list[BatchSampler],
generator: torch.Generator | None = None,
seed: int = 0,
):
super().__init__(dataset, batch_samplers=batch_samplers, generator=generator, seed=seed)
# Custom multi-dataset batch sampler logic here
args = SentenceTransformerTrainingArguments(
# Other training arguments
multi_dataset_batch_sampler=CustomMultiDatasetBatchSampler,
)
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
...
)
trainer.train()
# Or, use a function to initialize the batch sampler
def custom_batch_sampler(
dataset: ConcatDataset,
batch_samplers: list[BatchSampler],
generator: torch.Generator | None = None,
seed: int = 0,
):
# Custom multi-dataset batch sampler logic here
return ...
args = SentenceTransformerTrainingArguments(
# Other training arguments
multi_dataset_batch_sampler=custom_batch_sampler, # Use the custom batch sampler function
)
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
...
)
trainer.train()
|
自定义分段学习率
v4.x |
v5.x (推荐) |
---|---|
# A bunch of hacky code to set different learning rates
# for different sections of the model
|
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
# Custom learning rate for each section of the model,
# mapping regular expressions of parameter names to learning rates
# Matching is done with 'search', not just 'match' or 'fullmatch'
learning_rate_mapping = {
"SparseStaticEmbedding": 1e-4,
"linear_.*": 1e-5,
}
args = SentenceTransformerTrainingArguments(
...,
learning_rate=1e-5, # Default learning rate
learning_rate_mapping=learning_rate_mapping,
)
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
...
)
trainer.train()
|
使用复合损失进行训练
v4.x |
v5.x (推荐) |
---|---|
class CustomLoss(torch.nn.Module):
def __init__(self, model, ...):
super().__init__()
# Custom loss initialization code here
def forward(self, features, labels):
loss_component_one = self.compute_loss_one(features, labels)
loss_component_two = self.compute_loss_two(features, labels)
loss = loss_component_one * alpha + loss_component_two * beta
return loss
loss = CustomLoss(model, ...)
|
class CustomLoss(torch.nn.Module):
def __init__(self, model, ...):
super().__init__()
# Custom loss initialization code here
def forward(self, features, labels):
loss_component_one = self.compute_loss_one(features, labels)
loss_component_two = self.compute_loss_two(features, labels)
# You can now return a dictionary of loss components.
# The trainer considers the full loss as the sum of all
# components, but each component will also be logged separately.
return {
"loss_one": loss_component_one,
"loss_two": loss_component_two,
}
loss = CustomLoss(model, ...)
|
从 v3.x 迁移到 v4.x
v4 版 Sentence Transformers 重构了 CrossEncoder
重排器/成对分类模型的训练,用 CrossEncoderTrainer
和 CrossEncoderTrainingArguments
取代了 CrossEncoder.fit
。与 v3 和 SentenceTransformer
模型类似,此更新**软废弃**了 CrossEncoder.fit
,这意味着它仍然可用,但建议切换到新的 v4.x 训练格式。在底层,此方法现在使用新的训练器。
警告
如果您没有使用 CrossEncoder.fit
的代码,那么从 v3.x 更新到 v4.x 无需对代码进行任何更改。
如果您有,您的代码仍然可以运行,但建议切换到新的 v4.x 训练格式,因为它允许更多的训练参数和功能。有关更多详细信息,请参阅 训练概述。
v3.x |
v4.x (推荐) |
---|---|
from sentence_transformers import CrossEncoder, InputExample
from torch.utils.data import DataLoader
# 1. Define the model. Either from scratch of by loading a pre-trained model
model = CrossEncoder("microsoft/mpnet-base")
# 2. Define your train examples. You need more than just two examples...
train_examples = [
InputExample(texts=["What are pandas?", "The giant panda ..."], label=1),
InputExample(texts=["What's a panda?", "Mount Vesuvius is a ..."], label=0),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
# 3. Finetune the model
model.fit(train_dataloader=train_dataloader, epochs=1, warmup_steps=100)
|
from datasets import load_dataset
from sentence_transformers import CrossEncoder, CrossEncoderTrainer
from sentence_transformers.cross_encoder.losses import BinaryCrossEntropyLoss
# 1. Define the model. Either from scratch of by loading a pre-trained model
model = CrossEncoder("microsoft/mpnet-base")
# 2. Load a dataset to finetune on, convert to required format
dataset = load_dataset("sentence-transformers/hotpotqa", "triplet", split="train")
def triplet_to_labeled_pair(batch):
anchors = batch["anchor"]
positives = batch["positive"]
negatives = batch["negative"]
return {
"sentence_A": anchors * 2,
"sentence_B": positives + negatives,
"labels": [1] * len(positives) + [0] * len(negatives),
}
dataset = dataset.map(triplet_to_labeled_pair, batched=True, remove_columns=dataset.column_names)
train_dataset = dataset.select(range(10_000))
eval_dataset = dataset.select(range(10_000, 11_000))
# 3. Define a loss function
loss = BinaryCrossEntropyLoss(model)
# 4. Create a trainer & train
trainer = CrossEncoderTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss=loss,
)
trainer.train()
# 5. Save the trained model
model.save_pretrained("models/mpnet-base-hotpotqa")
# model.push_to_hub("mpnet-base-hotpotqa")
|
`CrossEncoder` 初始化和方法参数的迁移
v3.x |
v4.x (推荐) |
---|---|
|
更名为 |
|
更名为 |
|
更名为 |
|
更名为 |
|
更名为 |
|
更名为 |
|
请改用 |
|
更名为 |
|
更名为 |
|
已完全废弃,不再有任何作用。 |
|
已完全废弃,不再有任何作用。 |
注意
旧的关键字参数仍然有效,但会发出警告,建议您改用新名称。
`CrossEncoder.fit` 特定参数的迁移
CrossEncoder.fit(train_dataloader)
v3.x |
v4.x (推荐) |
---|---|
from sentence_transformers import CrossEncoder, InputExample
from torch.utils.data import DataLoader
# 1. Define the model. Either from scratch of by loading a pre-trained model
model = CrossEncoder("microsoft/mpnet-base")
# 2. Define your train examples. You need more than just two examples...
train_examples = [
InputExample(texts=["What are pandas?", "The giant panda ..."], label=1),
InputExample(texts=["What's a panda?", "Mount Vesuvius is a ..."], label=0),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
# 3. Finetune the model
model.fit(train_dataloader=train_dataloader)
|
from datasets import Dataset
from sentence_transformers import CrossEncoder, CrossEncoderTrainer
from sentence_transformers.cross_encoder.losses import BinaryCrossEntropyLoss
# Define a training dataset
train_examples = [
{
"sentence_1": "A person on a horse jumps over a broken down airplane.",
"sentence_2": "A person is outdoors, on a horse.",
"label": 1,
},
{
"sentence_1": "Children smiling and waving at camera",
"sentence_2": "The kids are frowning",
"label": 0,
},
]
train_dataset = Dataset.from_list(train_examples)
# Define a loss function
loss = BinaryCrossEntropyLoss(model)
# Finetune the model
trainer = CrossEncoderTrainer(
model=model,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
|
CrossEncoder.fit(loss_fct)
v3.x |
v4.x (推荐) |
---|---|
...
# Finetune the model
model.fit(
train_dataloader=train_dataloader,
loss_fct=torch.nn.MSELoss(),
)
|
from sentence_transformers.cross_encoder.losses import MSELoss
...
# Prepare the loss function
# See all valid losses in https://sbert.net.cn/docs/cross_encoder/loss_overview.html
loss = MSELoss(model)
# Finetune the model
trainer = CrossEncoderTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
|
CrossEncoder.fit(evaluator)
v3.x |
v4.x (推荐) |
---|---|
...
# Load an evaluator
evaluator = CrossEncoderNanoBEIREvaluator()
# Finetune with an evaluator
model.fit(
train_dataloader=train_dataloader,
evaluator=evaluator,
)
|
# Load an evaluator
evaluator = CrossEncoderNanoBEIREvaluator()
# Finetune with an evaluator
trainer = CrossEncoderTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss=loss,
evaluator=evaluator,
)
trainer.train()
|
CrossEncoder.fit(epochs)
v3.x |
v4.x (推荐) |
---|---|
...
# Finetune the model
model.fit(
train_dataloader=train_dataloader,
epochs=1,
)
|
...
# Prepare the Training Arguments
args = CrossEncoderTrainingArguments(
num_train_epochs=1,
)
# Finetune the model
trainer = CrossEncoderTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
|
CrossEncoder.fit(activation_fct)
v3.x |
v4.x (推荐) |
---|---|
...
# Finetune the model
model.fit(
train_dataloader=train_dataloader,
activation_fct=torch.nn.Sigmoid(),
)
|
...
# Prepare the loss function
loss = MSELoss(model, activation_fn=torch.nn.Sigmoid())
# Finetune the model
trainer = CrossEncoderTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
|
CrossEncoder.fit(scheduler)
v3.x |
v4.x (推荐) |
---|---|
...
# Finetune the model
model.fit(
train_dataloader=train_dataloader,
scheduler="WarmupLinear",
)
|
...
# Prepare the Training Arguments
args = CrossEncoderTrainingArguments(
# See https://hugging-face.cn/docs/transformers/main_classes/optimizer_schedules#transformers.SchedulerType
lr_scheduler_type="linear"
)
# Finetune the model
trainer = CrossEncoderTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
|
CrossEncoder.fit(warmup_steps)
v3.x |
v4.x (推荐) |
---|---|
...
# Finetune the model
model.fit(
train_dataloader=train_dataloader,
warmup_steps=1000,
)
|
...
# Prepare the Training Arguments
args = CrossEncoderTrainingArguments(
warmup_steps=1000,
)
# Finetune the model
trainer = CrossEncoderTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
|
CrossEncoder.fit(optimizer_class, optimizer_params)
v3.x |
v4.x (推荐) |
---|---|
...
# Finetune the model
model.fit(
train_dataloader=train_dataloader,
optimizer_class=torch.optim.AdamW,
optimizer_params={"eps": 1e-7},
)
|
...
# Prepare the Training Arguments
args = CrossEncoderTrainingArguments(
# See https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py
optim="adamw_torch",
optim_args={"eps": 1e-7},
)
# Finetune the model
trainer = CrossEncoderTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
|
CrossEncoder.fit(weight_decay)
v3.x |
v4.x (推荐) |
---|---|
...
# Finetune the model
model.fit(
train_dataloader=train_dataloader,
weight_decay=0.02,
)
|
...
# Prepare the Training Arguments
args = CrossEncoderTrainingArguments(
weight_decay=0.02,
)
# Finetune the model
trainer = CrossEncoderTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
|
CrossEncoder.fit(evaluation_steps)
v3.x |
v4.x (推荐) |
---|---|
...
# Finetune the model
model.fit(
train_dataloader=train_dataloader,
evaluator=evaluator,
evaluation_steps=1000,
)
|
...
# Prepare the Training Arguments
args = CrossEncoderTrainingArguments(
eval_strategy="steps",
eval_steps=1000,
)
# Finetune the model
# Note: You need an eval_dataset and/or evaluator to evaluate
trainer = CrossEncoderTrainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss=loss,
evaluator=evaluator,
)
trainer.train()
|
CrossEncoder.fit(output_path, save_best_model)
v3.x |
v4.x (推荐) |
---|---|
...
# Finetune the model
model.fit(
train_dataloader=train_dataloader,
evaluator=evaluator,
output_path="my/path",
save_best_model=True,
)
|
...
# Prepare the Training Arguments
args = CrossEncoderTrainingArguments(
load_best_model_at_end=True,
metric_for_best_model="hotpotqa_ndcg@10", # E.g. `evaluator.primary_metric`
)
# Finetune the model
trainer = CrossEncoderTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
# Save the best model at my output path
model.save_pretrained("my/path")
|
CrossEncoder.fit(max_grad_norm)
v3.x |
v4.x (推荐) |
---|---|
...
# Finetune the model
model.fit(
train_dataloader=train_dataloader,
max_grad_norm=1,
)
|
...
# Prepare the Training Arguments
args = CrossEncoderTrainingArguments(
max_grad_norm=1,
)
# Finetune the model
trainer = CrossEncoderTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
|
CrossEncoder.fit(use_amp)
v3.x |
v4.x (推荐) |
---|---|
...
# Finetune the model
model.fit(
train_dataloader=train_dataloader,
use_amp=True,
)
|
...
# Prepare the Training Arguments
args = CrossEncoderTrainingArguments(
fp16=True,
bf16=False, # If your GPU supports it, you can also use bf16 instead
)
# Finetune the model
trainer = CrossEncoderTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
|
CrossEncoder.fit(callback)
v3.x |
v4.x (推荐) |
---|---|
...
def printer_callback(score, epoch, steps):
print(f"Score: {score:.4f} at epoch {epoch:d}, step {steps:d}")
# Finetune the model
model.fit(
train_dataloader=train_dataloader,
callback=printer_callback,
)
|
from transformers import TrainerCallback
...
class PrinterCallback(TrainerCallback):
# Subclass any method from https://hugging-face.cn/docs/transformers/main_classes/callback#transformers.TrainerCallback
def on_evaluate(self, args, state, control, metrics=None, **kwargs):
print(f"Metrics: {metrics} at epoch {state.epoch:d}, step {state.global_step:d}")
printer_callback = PrinterCallback()
# Finetune the model
trainer = CrossEncoderTrainer(
model=model,
train_dataset=train_dataset,
loss=loss,
callbacks=[printer_callback],
)
trainer.train()
|
注意
旧的 CrossEncoder.fit
方法仍然可用,它只是被软废弃了。它现在在底层使用了新的 CrossEncoderTrainer
。
CrossEncoder 评估器的迁移
v3.x |
v4.x (推荐) |
---|---|
|
请使用 |
|
请使用 |
|
请使用 |
|
请使用 |
|
请使用 |
|
更名为 |
注意
旧的评估器仍然有效,它们只会提醒您更新到新的评估器。
从 v2.x 迁移到 v3.x
v3 版 Sentence Transformers 重构了 SentenceTransformer
嵌入模型的训练,用 SentenceTransformerTrainer
和 SentenceTransformerTrainingArguments
取代了 SentenceTransformer.fit
。此更新**软废弃**了 SentenceTransformer.fit
,这意味着它仍然可用,但建议切换到新的 v3.x 训练格式。在底层,此方法现在使用新的训练器。
警告
如果您没有使用 SentenceTransformer.fit
的代码,那么从 v2.x 更新到 v3.x 无需对代码进行任何更改。
如果您有,您的代码仍然可以运行,但建议切换到新的 v3.x 训练格式,因为它允许更多的训练参数和功能。有关更多详细信息,请参阅 训练概述。
v2.x |
v3.x (推荐) |
---|---|
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
# 1. Define the model. Either from scratch of by loading a pre-trained model
model = SentenceTransformer("microsoft/mpnet-base")
# 2. Define your train examples. You need more than just two examples...
train_examples = [
InputExample(texts=[
"A person on a horse jumps over a broken down airplane.",
"A person is outdoors, on a horse.",
"A person is at a diner, ordering an omelette.",
]),
InputExample(texts=[
"Children smiling and waving at camera",
"There are children present",
"The kids are frowning",
]),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
# 3. Define a loss function
train_loss = losses.MultipleNegativesRankingLoss(model)
# 4. Finetune the model
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=1,
warmup_steps=100,
)
# 5. Save the trained model
model.save_pretrained("models/mpnet-base-all-nli")
|
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
from sentence_transformers.losses import MultipleNegativesRankingLoss
# 1. Define the model. Either from scratch of by loading a pre-trained model
model = SentenceTransformer("microsoft/mpnet-base")
# 2. Load a dataset to finetune on
dataset = load_dataset("sentence-transformers/all-nli", "triplet")
train_dataset = dataset["train"].select(range(10_000))
eval_dataset = dataset["dev"].select(range(1_000))
# 3. Define a loss function
loss = MultipleNegativesRankingLoss(model)
# 4. Create a trainer & train
trainer = SentenceTransformerTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss=loss,
)
trainer.train()
# 5. Save the trained model
model.save_pretrained("models/mpnet-base-all-nli")
# model.push_to_hub("mpnet-base-all-nli")
|
`SentenceTransformer.fit` 特定参数的迁移
SentenceTransformer.fit(train_objectives)
v2.x |
v3.x (推荐) |
---|---|
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
# Define a training dataloader
train_examples = [
InputExample(texts=[
"A person on a horse jumps over a broken down airplane.",
"A person is outdoors, on a horse.",
"A person is at a diner, ordering an omelette.",
]),
InputExample(texts=[
"Children smiling and waving at camera",
"There are children present",
"The kids are frowning",
]),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
# Define a loss function
train_loss = losses.MultipleNegativesRankingLoss(model)
# Finetune the model
model.fit(train_objectives=[(train_dataloader, train_loss)])
|
from datasets import Dataset
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
from sentence_transformers.losses import MultipleNegativesRankingLoss
# Define a training dataset
train_examples = [
{
"anchor": "A person on a horse jumps over a broken down airplane.",
"positive": "A person is outdoors, on a horse.",
"negative": "A person is at a diner, ordering an omelette.",
},
{
"anchor": "Children smiling and waving at camera",
"positive": "There are children present",
"negative": "The kids are frowning",
},
]
train_dataset = Dataset.from_list(train_examples)
# Define a loss function
loss = MultipleNegativesRankingLoss(model)
# Finetune the model
trainer = SentenceTransformerTrainer(
model=model,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
|
SentenceTransformer.fit(evaluator)
v2.x |
v3.x (推荐) |
---|---|
...
# Load an evaluator
evaluator = NanoBEIREvaluator()
# Finetune with an evaluator
model.fit(
train_objectives=[(train_dataloader, train_loss)],
evaluator=evaluator,
)
|
# Load an evaluator
evaluator = NanoBEIREvaluator()
# Finetune with an evaluator
trainer = SentenceTransformerTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss=loss,
evaluator=evaluator,
)
trainer.train()
|
SentenceTransformer.fit(epochs)
v2.x |
v3.x (推荐) |
---|---|
...
# Finetune the model
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=1,
)
|
...
# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
num_train_epochs=1,
)
# Finetune the model
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
|
SentenceTransformer.fit(steps_per_epoch)
v2.x |
v3.x (推荐) |
---|---|
...
# Finetune the model
model.fit(
train_objectives=[(train_dataloader, train_loss)],
steps_per_epoch=1000,
)
|
...
# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
max_steps=1000, # Note: max_steps is across all epochs, not per epoch
)
# Finetune the model
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
|
SentenceTransformer.fit(scheduler)
v2.x |
v3.x (推荐) |
---|---|
...
# Finetune the model
model.fit(
train_objectives=[(train_dataloader, train_loss)],
scheduler="WarmupLinear",
)
|
...
# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
# See https://hugging-face.cn/docs/transformers/main_classes/optimizer_schedules#transformers.SchedulerType
lr_scheduler_type="linear"
)
# Finetune the model
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
|
SentenceTransformer.fit(warmup_steps)
v2.x |
v3.x (推荐) |
---|---|
...
# Finetune the model
model.fit(
train_objectives=[(train_dataloader, train_loss)],
warmup_steps=1000,
)
|
...
# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
warmup_steps=1000,
)
# Finetune the model
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
|
SentenceTransformer.fit(optimizer_class, optimizer_params)
v2.x |
v3.x (推荐) |
---|---|
...
# Finetune the model
model.fit(
train_objectives=[(train_dataloader, train_loss)],
optimizer_class=torch.optim.AdamW,
optimizer_params={"eps": 1e-7},
)
|
...
# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
# See https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py
optim="adamw_torch",
optim_args={"eps": 1e-7},
)
# Finetune the model
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
|
SentenceTransformer.fit(weight_decay)
v2.x |
v3.x (推荐) |
---|---|
...
# Finetune the model
model.fit(
train_objectives=[(train_dataloader, train_loss)],
weight_decay=0.02,
)
|
...
# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
weight_decay=0.02,
)
# Finetune the model
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
|
SentenceTransformer.fit(evaluation_steps)
v2.x |
v3.x (推荐) |
---|---|
...
# Finetune the model
model.fit(
train_objectives=[(train_dataloader, train_loss)],
evaluator=evaluator,
evaluation_steps=1000,
)
|
...
# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
eval_strategy="steps",
eval_steps=1000,
)
# Finetune the model
# Note: You need an eval_dataset and/or evaluator to evaluate
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss=loss,
evaluator=evaluator,
)
trainer.train()
|
SentenceTransformer.fit(output_path, save_best_model)
v2.x |
v3.x (推荐) |
---|---|
...
# Finetune the model
model.fit(
train_objectives=[(train_dataloader, train_loss)],
evaluator=evaluator,
output_path="my/path",
save_best_model=True,
)
|
...
# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
load_best_model_at_end=True,
metric_for_best_model="all_nli_cosine_accuracy", # E.g. `evaluator.primary_metric`
)
# Finetune the model
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
# Save the best model at my output path
model.save_pretrained("my/path")
|
SentenceTransformer.fit(max_grad_norm)
v2.x |
v3.x (推荐) |
---|---|
...
# Finetune the model
model.fit(
train_objectives=[(train_dataloader, train_loss)],
max_grad_norm=1,
)
|
...
# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
max_grad_norm=1,
)
# Finetune the model
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
|
SentenceTransformer.fit(use_amp)
v2.x |
v3.x (推荐) |
---|---|
...
# Finetune the model
model.fit(
train_objectives=[(train_dataloader, train_loss)],
use_amp=True,
)
|
...
# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
fp16=True,
bf16=False, # If your GPU supports it, you can also use bf16 instead
)
# Finetune the model
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
|
SentenceTransformer.fit(callback)
v2.x |
v3.x (推荐) |
---|---|
...
def printer_callback(score, epoch, steps):
print(f"Score: {score:.4f} at epoch {epoch:d}, step {steps:d}")
# Finetune the model
model.fit(
train_objectives=[(train_dataloader, train_loss)],
callback=printer_callback,
)
|
from transformers import TrainerCallback
...
class PrinterCallback(TrainerCallback):
# Subclass any method from https://hugging-face.cn/docs/transformers/main_classes/callback#transformers.TrainerCallback
def on_evaluate(self, args, state, control, metrics=None, **kwargs):
print(f"Metrics: {metrics} at epoch {state.epoch:d}, step {state.global_step:d}")
printer_callback = PrinterCallback()
# Finetune the model
trainer = SentenceTransformerTrainer(
model=model,
train_dataset=train_dataset,
loss=loss,
callbacks=[printer_callback],
)
trainer.train()
|
SentenceTransformer.fit(checkpoint_path, checkpoint_save_steps, checkpoint_save_total_limit)
v2.x |
v3.x (推荐) |
---|---|
...
# Finetune the model
model.fit(
train_objectives=[(train_dataloader, train_loss)],
checkpoint_path="checkpoints",
checkpoint_save_steps=5000,
checkpoint_save_total_limit=2,
)
|
...
# Prepare the Training Arguments
args = SentenceTransformerTrainingArguments(
eval_strategy="steps",
eval_steps=5000,
save_strategy="steps",
save_steps=5000,
save_total_limit=2,
)
# Finetune the model
# Note: You need an eval_dataset and/or evaluator to checkpoint
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss=loss,
)
trainer.train()
|
`SentenceTransformer.fit` 中使用的自定义 Datasets 和 DataLoaders 迁移
v2.x |
v3.x (推荐) |
---|---|
|
手动创建 |
|
加载或创建 |
|
手动将包含噪声文本的列添加到带有文本的 |
|
加载或创建 |