预训练模型

我们通过 Sentence Transformers Hugging Face 组织提供各种预训练的 Sentence Transformers 模型。此外，Hugging Face Hub 上已公开发布了 6,000 多个社区 Sentence Transformers 模型。所有模型都可以在这里找到

原始模型：Sentence Transformers Hugging Face 组织。
社区模型：Hugging Face 上的所有 Sentence Transformer 模型。

每个模型都可以轻松下载和使用，如下所示

from sentence_transformers import SentenceTransformer

# Load https://hugging-face.cn/sentence-transformers/all-mpnet-base-v2
model = SentenceTransformer("all-mpnet-base-v2")
embeddings = model.encode([
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
])
similarities = model.similarity(embeddings, embeddings)

注意

可以参考 Massive Textual Embedding Benchmark 排行榜来寻找强大的 Sentence Transformer 模型。请注意

模型大小：建议过滤掉那些在没有大量硬件的情况下可能无法使用的大型模型。
实验是关键：在排行榜上表现良好的模型不一定在您的任务上表现良好，至关重要的是要尝试各种有前景的模型。

提示

阅读 Sentence Transformer > 使用 > 加速推理以获取如何将模型推理速度提高 2 到 3 倍的提示。

原始模型

下表概述了我们精选的一些模型。它们在嵌入句子（性能句子嵌入）和嵌入搜索查询及段落（性能语义搜索）方面的质量已得到广泛评估。

all-* 模型是在所有可用训练数据（超过 10 亿个训练对）上训练的，旨在作为通用模型。all-mpnet-base-v2 模型提供最佳质量，而 all-MiniLM-L6-v2 速度快 5 倍，但仍提供良好质量。切换“所有模型”以查看所有已评估的原始模型。

语义搜索模型

以下模型专门针对语义搜索进行训练：给定问题/搜索查询，这些模型能够找到相关的文本段落。有关更多详细信息，请参阅使用 > 语义搜索。

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("multi-qa-mpnet-base-cos-v1")

query_embedding = model.encode("How big is London")
passage_embeddings = model.encode([
    "London is known for its financial district",
    "London has 9,787,426 inhabitants at the 2011 census",
    "The United Kingdom is the fourth largest exporter of goods in the world",
])

similarity = model.similarity(query_embedding, passage_embeddings)
# => tensor([[0.4659, 0.6142, 0.2697]])

多问答模型

以下模型已在来自各种来源和领域的 2.15 亿个问答对上进行训练，包括 StackExchange、Yahoo Answers、Google 和 Bing 搜索查询等等。这些模型在许多搜索任务和领域中表现良好。

这些模型经过调整，可与点积相似度分数一起使用

模型	语义搜索性能（6 个数据集）	每秒查询数（GPU / CPU）
multi-qa-mpnet-base-dot-v1	57.60	4,000 / 170
multi-qa-distilbert-dot-v1	52.51	7,000 / 350
multi-qa-MiniLM-L6-dot-v1	49.19	18,000 / 750

这些模型生成长度为 1 的归一化向量，可与点积、余弦相似度和欧几里得距离作为相似度函数一起使用

模型	语义搜索性能（6 个数据集）	每秒查询数（GPU / CPU）
multi-qa-mpnet-base-cos-v1	57.46	4,000 / 170
multi-qa-distilbert-cos-v1	52.83	7,000 / 350
multi-qa-MiniLM-L6-cos-v1	51.83	18,000 / 750

MSMARCO 段落模型

以下模型已在 MSMARCO 段落排序数据集上进行训练，该数据集包含来自 Bing 搜索的 50 万个真实查询以及来自各种网络源的相关段落。鉴于 MSMARCO 数据集的多样性，这些模型在其他领域也表现良好。

这些模型经过调整，可与点积相似度分数一起使用

模型	MSMARCO MRR@10 开发集	语义搜索性能（6 个数据集）	每秒查询数（GPU / CPU）
msmarco-bert-base-dot-v5	38.08	52.11	4,000 / 170
msmarco-distilbert-dot-v5	37.25	49.47	7,000 / 350
msmarco-distilbert-base-tas-b	34.43	49.25	7,000 / 350

这些模型生成长度为 1 的归一化向量，可与点积、余弦相似度和欧几里得距离作为相似度函数一起使用

模型	MSMARCO MRR@10 开发集	语义搜索性能（6 个数据集）	每秒查询数（GPU / CPU）
msmarco-distilbert-cos-v5	33.79	44.98	7,000 / 350
msmarco-MiniLM-L12-cos-v5	32.75	43.89	11,000 / 400
msmarco-MiniLM-L6-cos-v5	32.27	42.16	18,000 / 750

MSMARCO 模型 - 更多详情

多语言模型

以下模型为不同语言的相同文本提供相似的嵌入。您无需指定输入语言。详细信息请参见我们的出版物使用知识蒸馏使单语句子嵌入多语化。我们使用了以下 50 多种语言：ar、bg、ca、cs、da、de、el、en、es、et、fa、fi、fr、fr-ca、gl、gu、he、hi、hr、hu、hy、id、it、ja、ka、ko、ku、lt、lv、mk、mn、mr、ms、my、nb、nl、pl、pt、pt-br、ro、ru、sk、sl、sq、sr、sv、th、tr、uk、ur、vi、zh-cn、zh-tw。

语义相似度模型

这些模型在一个语言或跨语言中查找语义相似的句子

distiluse-base-multilingual-cased-v1：多语言通用句子编码器的多语言知识蒸馏版本。支持 15 种语言：阿拉伯语、中文、荷兰语、英语、法语、德语、意大利语、韩语、波兰语、葡萄牙语、俄语、西班牙语、土耳其语。
distiluse-base-multilingual-cased-v2：多语言通用句子编码器的多语言知识蒸馏版本。此版本支持 50 多种语言，但性能略低于 v1 模型。
paraphrase-multilingual-MiniLM-L12-v2 - paraphrase-MiniLM-L12-v2 的多语言版本，在 50 多种语言的并行数据上训练。
paraphrase-multilingual-mpnet-base-v2 - paraphrase-mpnet-base-v2 的多语言版本，在 50 多种语言的并行数据上训练。

双语文本挖掘

双语文本挖掘描述了在两种语言中查找翻译句子对的过程。如果这是您的用例，以下模型提供了最佳性能

LaBSE - LaBSE 模型。支持 109 种语言。非常适合在多种语言中查找翻译对。正如此处详细说明的那样，LaBSE 在评估非翻译句子对的相似性方面效果较差。

通过遵循训练示例 > 多语言模型，可以轻松将模型扩展到新语言。

图像和文本模型

以下模型可以将图像和文本嵌入到联合向量空间中。有关如何用于文本到图像搜索、图像到图像搜索、图像聚类和零样本图像分类的更多详细信息，请参阅使用 > 图像搜索。

以下模型及其在零样本 ImageNet 验证数据集上的 Top 1 准确率可用。

模型	Top 1 性能
clip-ViT-L-14	75.4
clip-ViT-B-16	68.1
clip-ViT-B-32	63.3

我们进一步提供此多语言文本图像模型

clip-ViT-B-32-multilingual-v1 - clip-ViT-B-32 模型的多语言文本编码器，使用多语言知识蒸馏。此模型可以编码 50 多种语言的文本，以匹配 clip-ViT-B-32 模型中的图像向量。

INSTRUCTOR 模型

一些 INSTRUCTOR 模型，例如 hkunlp/instructor-large，在 Sentence Transformers 中原生支持。这些模型很特别，因为它们在训练时考虑了指令。值得注意的是，普通 Sentence Transformer 模型和 Instructor 模型之间的主要区别在于后者在池化步骤中不包含指令本身。

以下模型可以直接使用

您可以这样使用这些模型

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("hkunlp/instructor-large")
embeddings = model.encode(
    [
        "Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity",
        "Comparison of Atmospheric Neutrino Flux Calculations at Low Energies",
        "Fermion Bags in the Massive Gross-Neveu Model",
        "QCD corrections to Associated t-tbar-H production at the Tevatron",
    ],
    prompt="Represent the Medicine sentence for clustering: ",
)
print(embeddings.shape)
# => (4, 768)

例如，用于信息检索

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer("hkunlp/instructor-large")
query = "where is the food stored in a yam plant"
query_instruction = (
    "Represent the Wikipedia question for retrieving supporting documents: "
)
corpus = [
    'Yams are perennial herbaceous vines native to Africa, Asia, and the Americas and cultivated for the consumption of their starchy tubers in many temperate and tropical regions. The tubers themselves, also called "yams", come in a variety of forms owing to numerous cultivars and related species.',
    "The disparate impact theory is especially controversial under the Fair Housing Act because the Act regulates many activities relating to housing, insurance, and mortgage loansâ€”and some scholars have argued that the theory's use under the Fair Housing Act, combined with extensions of the Community Reinvestment Act, contributed to rise of sub-prime lending and the crash of the U.S. housing market and ensuing global economic recession",
    "Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.",
]
corpus_instruction = "Represent the Wikipedia document for retrieval: "

query_embedding = model.encode(query, prompt=query_instruction)
corpus_embeddings = model.encode(corpus, prompt=corpus_instruction)
similarities = cos_sim(query_embedding, corpus_embeddings)
print(similarities)
# => tensor([[0.8835, 0.7037, 0.6970]])

所有其他 Instructor 模型要么 1) 无法加载，因为它们在 modules.json 中引用了 InstructorEmbedding，要么 2) 在加载后需要调用 model.set_pooling_include_prompt(include_prompt=False)。

科学相似度模型

SPECTER 是一个在科学引文上训练的模型，可用于估计两篇出版物的相似性。我们可以用它来查找相似的论文。

allenai-specter - 语义搜索 Python 示例 / 语义搜索 Colab 示例