MS MARCO

MS MARCO Passage Ranking 是一个用于训练信息检索模型的大型数据集。它包含大约 50 万条来自必应搜索引擎的真实搜索查询,以及回答这些查询的相关文本段落。本页面展示了如何在此数据集上训练交叉编码器模型,以便在给定查询(关键词、短语或问题)的情况下搜索文本段落。

如果您对如何使用这些模型感兴趣,请参阅应用 - 检索与重排序。我们提供了预训练模型,您可以直接使用它们,而无需训练自己的模型。有关更多信息,请参阅用于 MS MARCO 的预训练交叉编码器

交叉编码器 (Cross Encoder)

一个交叉编码器接受查询和可能的相关段落,并返回一个分数,表示该段落与给定查询的相关程度。通常,在原始输出预测上应用torch.nn.Sigmoid,将其转换为介于 0 和 1 之间的值。

CrossEncoder

CrossEncoder 模型通常用于重排序:给定一个查询的可能相关段落列表,例如从SentenceTransformer 模型/BM25/Elasticsearch 检索到的,交叉编码器会重新排序此列表,使最相关的段落位于结果列表的顶部。

训练脚本

我们提供了几个训练脚本,其中包含各种损失函数,用于在 MS MARCO 上训练CrossEncoder

在所有脚本中,模型通过CrossEncoderNanoBEIREvaluatorMS MARCONFCorpusNQ的子集上进行评估。

在这些训练脚本中,我猜测training_ms_marco_lambda_preprocessed.pytraining_ms_marco_lambda_hard_neg.pytraining_ms_marco_bce_preprocessed.py能产生最强的模型,因为根据经验,LambdaLossBinaryCrossEntropyLoss都相当强大。在所有排序学习损失中,似乎LambdaLoss > PListMLELoss > ListNetLoss > RankNetLoss > ListMLELoss,但您的实际效果可能会有所不同。

此外,您还可以通过蒸馏进行训练。有关更多详细信息,请参阅交叉编码器 > 训练示例 > 蒸馏

推理

您可以使用任何用于 MS MARCO 的预训练交叉编码器模型进行推理,如下所示

from sentence_transformers import CrossEncoder

# 1. Load a pre-trained CrossEncoder model
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")

# 2. Predict scores for a pair of sentences
scores = model.predict([
    ("How many people live in Berlin?", "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers."),
    ("How many people live in Berlin?", "Berlin is well known for its museums."),
])
# => array([ 8.607138 , -4.3200774], dtype=float32)

# 3. Rank a list of passages for a query
query = "How many people live in Berlin?"
passages = [
    "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.",
    "Berlin is well known for its museums.",
    "In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.",
    "The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.",
    "The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019",
    "An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.",
    "Berlin is subdivided into 12 boroughs or districts (Bezirke).",
    "In 2015, the total labour force in Berlin was 1.85 million.",
    "In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.",
    "Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.",
]
ranks = model.rank(query, passages)

# Print the scores
print("Query:", query)
for rank in ranks:
    print(f"{rank['score']:.2f}\t{passages[rank['corpus_id']]}")
"""
Query: How many people live in Berlin?
8.92    The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.
8.61    Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.
8.24    An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.
7.60    In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.
6.35    In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.
5.42    Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.
3.45    In 2015, the total labour force in Berlin was 1.85 million.
0.33    Berlin is subdivided into 12 boroughs or districts (Bezirke).
-4.24   The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019
-4.32   Berlin is well known for its museums.
"""