注意

Sentence Transformers v5.2 最近发布，为 CrossEncoder 引入了多进程处理、多语言 NanoBEIR 评估器，在 mine_hard_negatives 中输出了相似度分数，支持 Transformers v5 等。更多信息请阅读 v5.2 发布说明。

注意

Sentence Transformers 正在从 UKP Lab 迁移到 🤗 Hugging Face。这使得现有的维护结构正式化，因为 Hugging Face 在过去两年中一直维护该项目。该项目的开发路线图、支持和对社区的承诺保持不变。阅读完整公告了解更多详情！

SentenceTransformers 文档

Sentence Transformers（又名 SBERT）是访问、使用和训练最先进的嵌入和重新排序模型的首选 Python 模块。它可用于使用 Sentence Transformer 模型计算嵌入（快速入门），使用 Cross-Encoder（又名重新排序器）模型计算相似度分数（快速入门），或使用 Sparse Encoder 模型生成稀疏嵌入（快速入门）。这开启了广泛的应用，包括语义搜索、语义文本相似度和释义挖掘。

🤗 Hugging Face 上提供了超过 10,000 个预训练的 Sentence Transformers 模型，可立即使用，包括来自 Massive Text Embeddings Benchmark (MTEB) 排行榜的许多最先进模型。此外，使用 Sentence Transformers 训练或微调您自己的嵌入模型、重新排序器模型或稀疏编码器模型非常容易，使您能够为特定用例创建自定义模型。

Sentence Transformers 由 UKP Lab 创建，目前由 🤗 Hugging Face 维护。如果出现问题或有其他疑问，请随时在 Sentence Transformers 存储库中提出问题。

用法

另请参阅

有关如何使用 Sentence Transformers 的更多快速信息，请参阅快速入门。

使用 Sentence Transformer 模型非常简单

嵌入模型

from sentence_transformers import SentenceTransformer

# 1. Load a pretrained Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# The sentences to encode
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

# 2. Calculate embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# 3. Calculate the embedding similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
#         [0.6660, 1.0000, 0.1411],
#         [0.1046, 0.1411, 1.0000]])

重新排序器模型

from sentence_transformers import CrossEncoder

# 1. Load a pretrained CrossEncoder model
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")

# The texts for which to predict similarity scores
query = "How many people live in Berlin?"
passages = [
    "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.",
    "Berlin has a yearly total of about 135 million day visitors, making it one of the most-visited cities in the European Union.",
    "In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.",
]

# 2a. Either predict scores pairs of texts
scores = model.predict([(query, passage) for passage in passages])
print(scores)
# => [8.607139 5.506266 6.352977]

# 2b. Or rank a list of passages for a query
ranks = model.rank(query, passages, return_documents=True)

print("Query:", query)
for rank in ranks:
    print(f"- #{rank['corpus_id']} ({rank['score']:.2f}): {rank['text']}")
"""
Query: How many people live in Berlin?
- #0 (8.61): Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.
- #2 (6.35): In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.
- #1 (5.51): Berlin has a yearly total of about 135 million day visitors, making it one of the most-visited cities in the European Union.
"""

稀疏编码器模型

from sentence_transformers import SparseEncoder

# 1. Load a pretrained SparseEncoder model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

# The sentences to encode
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

# 2. Calculate sparse embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 30522] - sparse representation with vocabulary size dimensions

# 3. Calculate the embedding similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[   35.629,     9.154,     0.098],
#         [    9.154,    27.478,     0.019],
#         [    0.098,     0.019,    29.553]])

# 4. Check sparsity stats
stats = SparseEncoder.sparsity(embeddings)
print(f"Sparsity: {stats['sparsity_ratio']:.2%}")
# Sparsity: 99.84%

接下来做什么？

考虑阅读以下部分之一以回答相关问题

嵌入模型
- 如何使用Sentence Transformer 模型？ Sentence Transformers > 用法
- 我可以使用哪些 Sentence Transformer 模型？ Sentence Transformers > 预训练模型
- 如何使 Sentence Transformer 模型更快？ Sentence Transformers > 用法 > 加速推理
- 如何训练/微调Sentence Transformer 模型？ Sentence Transformers > 训练概述
重新排序器模型
- 如何使用Cross Encoder 模型？ Cross Encoder > 用法
- 我可以使用哪些 Cross Encoder 模型？ Cross Encoder > 预训练模型
- 如何使 Cross Encoder 模型更快？ Cross Encoder > 用法 > 加速推理
- 如何训练/微调Cross Encoder 模型？ Cross Encoder > 训练概述
稀疏编码器模型
- 如何使用Sparse Encoder 模型？ Sparse Encoder > 用法
- 我可以使用哪些 Sparse Encoder 模型？ Sparse Encoder > 预训练模型
- 如何使 Sparse Encoder 模型更快？ Sparse Encoder > 用法 > 加速推理
- 如何训练/微调Sparse Encoder 模型？ Sparse Encoder > 训练概述
- 如何将 Sparse Encoder 模型集成到搜索引擎中？ Sparse Encoder > 向量数据库集成

引用

如果您觉得这个存储库有帮助，请随时引用我们的出版物 Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

@inproceedings{reimers-2019-sentence-bert,
  title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
  author = "Reimers, Nils and Gurevych, Iryna",
  booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
  month = "11",
  year = "2019",
  publisher = "Association for Computational Linguistics",
  url = "https://arxiv.org/abs/1908.10084",
}

如果您使用我们的多语言模型之一，请随时引用我们的出版物 Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation

@inproceedings{reimers-2020-multilingual-sentence-bert,
  title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
  author = "Reimers, Nils and Gurevych, Iryna",
  booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
  month = "11",
  year = "2020",
  publisher = "Association for Computational Linguistics",
  url = "https://arxiv.org/abs/2004.09813",
}

如果您使用数据增强的代码，请随时引用我们的出版物 Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks

@inproceedings{thakur-2020-AugSBERT,
  title = "Augmented {SBERT}: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks",
  author = "Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes  and Gurevych, Iryna",
  booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
  month = jun,
  year = "2021",
  address = "Online",
  publisher = "Association for Computational Linguistics",
  url = "https://www.aclweb.org/anthology/2021.naacl-main.28",
  pages = "296--310",
}