MSMARCO 模型（版本 2）

MS MARCO 是一个大规模信息检索语料库，它基于使用 Bing 搜索引擎的真实用户搜索查询而创建。所提供的模型可用于语义搜索，即，给定关键字/搜索短语/问题，该模型将找到与搜索查询相关的段落。

训练数据包含超过 50 万个示例，而完整的语料库包含超过 880 万个段落。

用法

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("msmarco-distilroberta-base-v2")

query_embedding = model.encode("How big is London")
passage_embedding = model.encode("London has 9,787,426 inhabitants at the 2011 census")

print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding))

有关用法的更多详细信息，请参阅应用 - 信息检索

性能

性能在 TREC-DL 2019 上进行评估，这是一个查询-段落检索任务，其中多个查询已根据其与给定查询的相关性进行注释。此外，我们在 MS Marco Passage Retrieval 数据集上进行评估。

作为基线，我们展示了使用 Elasticsearch 的 BM25 进行词汇搜索的结果。

方法	NDCG@10 (TREC DL 19 重新排序)	MRR@10 (MS Marco Dev)
BM25 (Elasticsearch)	45.46	17.29
msmarco-distilroberta-base-v2	65.65	28.55
msmarco-roberta-base-v2	67.18	29.17
msmarco-distilbert-base-v2	68.35	30.77

版本历史

版本 1