注意
Sentence Transformers v5.1 刚刚发布,为 SparseEncoder 模型带来了 ONNX 和 OpenVINO 后端。请阅读 稀疏编码器 > 用法 > 加速推理 了解更多关于您可以期待的性能提升,或阅读 v5.1 发布说明 了解其他变更信息。
注意
Sentence Transformers v5.0 最近发布,引入了 SparseEncoder 模型,这是一种用于高效神经词汇搜索和混合检索的新型模型。请阅读 稀疏编码器 > 用法 了解如何使用它们,或查看 v5.0 发布说明 了解其他变更的详细信息。
SentenceTransformers 文档
Sentence Transformers(又名 SBERT)是用于访问、使用和训练最先进的嵌入和重排序模型的首选 Python 模块。它可用于使用 Sentence Transformer 模型计算嵌入(快速入门),使用 Cross-Encoder(又名 reranker)模型计算相似度分数(快速入门),或使用 Sparse Encoder 模型生成稀疏嵌入(快速入门)。这开启了广泛的应用,包括语义搜索、语义文本相似度和转述挖掘。
在 🤗 Hugging Face 上有超过 10,000 个预训练的 Sentence Transformers 模型可供立即使用,其中包括许多来自大规模文本嵌入基准 (MTEB) 排行榜的最先进模型。此外,使用 Sentence Transformers 训练或微调您自己的嵌入模型、重排序模型或稀疏编码器模型非常简单,使您能够为特定用例创建自定义模型。
Sentence Transformers 由 UKPLab 创建,并由 🤗 Hugging Face 维护。如果出现问题或您有其他疑问,请随时在 Sentence Transformers 仓库中提出 issue。
用法
另请参阅
请参阅快速入门,获取更多关于如何使用 Sentence Transformers 的快捷信息。
使用 Sentence Transformer 模型非常简单
from sentence_transformers import SentenceTransformer
# 1. Load a pretrained Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")
# The sentences to encode
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
# 2. Calculate embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]
# 3. Calculate the embedding similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
# [0.6660, 1.0000, 0.1411],
# [0.1046, 0.1411, 1.0000]])
from sentence_transformers import CrossEncoder
# 1. Load a pretrained CrossEncoder model
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")
# The texts for which to predict similarity scores
query = "How many people live in Berlin?"
passages = [
"Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.",
"Berlin has a yearly total of about 135 million day visitors, making it one of the most-visited cities in the European Union.",
"In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.",
]
# 2a. Either predict scores pairs of texts
scores = model.predict([(query, passage) for passage in passages])
print(scores)
# => [8.607139 5.506266 6.352977]
# 2b. Or rank a list of passages for a query
ranks = model.rank(query, passages, return_documents=True)
print("Query:", query)
for rank in ranks:
print(f"- #{rank['corpus_id']} ({rank['score']:.2f}): {rank['text']}")
"""
Query: How many people live in Berlin?
- #0 (8.61): Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.
- #2 (6.35): In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.
- #1 (5.51): Berlin has a yearly total of about 135 million day visitors, making it one of the most-visited cities in the European Union.
"""
from sentence_transformers import SparseEncoder
# 1. Load a pretrained SparseEncoder model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# The sentences to encode
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
# 2. Calculate sparse embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 30522] - sparse representation with vocabulary size dimensions
# 3. Calculate the embedding similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 35.629, 9.154, 0.098],
# [ 9.154, 27.478, 0.019],
# [ 0.098, 0.019, 29.553]])
# 4. Check sparsity stats
stats = SparseEncoder.sparsity(embeddings)
print(f"Sparsity: {stats['sparsity_ratio']:.2%}")
# Sparsity: 99.84%
接下来做什么?
考虑阅读以下部分之一来回答相关问题
- 嵌入模型
如何使用 Sentence Transformer 模型? Sentence Transformers > 用法
我可以使用哪些 Sentence Transformer 模型? Sentence Transformers > 预训练模型
如何让 Sentence Transformer 模型更快? Sentence Transformers > 用法 > 加速推理
我如何训练/微调一个 Sentence Transformer 模型? Sentence Transformers > 训练概览
- 重排序模型
如何使用 Cross Encoder 模型? Cross Encoder > 用法
我可以使用哪些 Cross Encoder 模型? Cross Encoder > 预训练模型
如何让 Cross Encoder 模型更快? Cross Encoder > 用法 > 加速推理
我如何训练/微调一个 Cross Encoder 模型? Cross Encoder > 训练概览
- 稀疏编码器模型
如何使用 Sparse Encoder 模型? Sparse Encoder > 用法
我可以使用哪些 Sparse Encoder 模型? Sparse Encoder > 预训练模型
如何让 Sparse Encoder 模型更快? Sparse Encoder > 用法 > 加速推理
我如何训练/微调一个 Sparse Encoder 模型? Sparse Encoder > 训练概览
如何将 Sparse Encoder 模型与搜索引擎集成? Sparse Encoder > 向量数据库集成
引用
如果您觉得这个仓库有帮助,欢迎引用我们的出版物 Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
@inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", }
如果您使用多语言模型,欢迎引用我们的出版物 Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation
@inproceedings{reimers-2020-multilingual-sentence-bert, title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2020", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/2004.09813", }
如果您使用数据增强的代码,欢迎引用我们的出版物 Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks
@inproceedings{thakur-2020-AugSBERT, title = "Augmented {SBERT}: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks", author = "Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and Gurevych, Iryna", booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jun, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2021.naacl-main.28", pages = "296--310", }