MS MARCO

MS MARCO Passage Ranking 是一个用于训练信息检索模型的大型数据集。它包含约 50 万个来自必应(Bing)搜索引擎的真实搜索查询,以及能够回答这些查询的相关文本段落。本页展示了如何在该数据集上**训练** Cross Encoder 模型,以便用于根据查询(关键词、短语或问题)搜索文本段落。

如果您对如何使用这些模型感兴趣,请参阅应用 - 检索与重排。我们提供了**预训练模型**,您可以直接使用,无需自行训练。更多信息,请参阅用于 MS MARCO 的预训练 Cross-Encoder

Cross Encoder

Cross Encoder 接收一个查询和一个可能相关的段落,然后返回一个分数,表示该段落与给定查询的相关程度。通常,会对原始输出预测应用 torch.nn.Sigmoid,将其值转换为 0 到 1 之间。

CrossEncoder

CrossEncoder 模型常用于**重排**:给定一个包含针对某个查询的可能相关段落的列表(例如,从 SentenceTransformer 模型 / BM25 / Elasticsearch 检索得到),cross-encoder 会对该列表进行重排,使最相关的段落位于结果列表的顶部。

训练脚本

我们提供了几个使用不同损失函数的训练脚本,用于在 MS MARCO 上训练 CrossEncoder

在所有脚本中,模型都会在 MS MARCONFCorpusNQ 的子集上通过 CrossEncoderNanoBEIREvaluator 进行评估。

在这些训练脚本中,我怀疑 training_ms_marco_lambda_preprocessed.pytraining_ms_marco_lambda_hard_neg.pytraining_ms_marco_bce_preprocessed.py 会产生最强大的模型,因为根据经验,LambdaLossBinaryCrossEntropyLoss 都相当强大。在所有学习排序损失中,似乎 LambdaLoss > PListMLELoss > ListNetLoss > RankNetLoss > ListMLELoss,但您的实际效果可能会有所不同。

此外,您还可以使用蒸馏进行训练。更多详情,请参阅Cross Encoder > 训练示例 > 蒸馏

推理

您可以使用任何预训练的用于 MS MARCO 的 CrossEncoder 模型进行推理,如下所示:

from sentence_transformers import CrossEncoder

# 1. Load a pre-trained CrossEncoder model
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")

# 2. Predict scores for a pair of sentences
scores = model.predict([
    ("How many people live in Berlin?", "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers."),
    ("How many people live in Berlin?", "Berlin is well known for its museums."),
])
# => array([ 8.607138 , -4.3200774], dtype=float32)

# 3. Rank a list of passages for a query
query = "How many people live in Berlin?"
passages = [
    "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.",
    "Berlin is well known for its museums.",
    "In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.",
    "The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.",
    "The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019",
    "An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.",
    "Berlin is subdivided into 12 boroughs or districts (Bezirke).",
    "In 2015, the total labour force in Berlin was 1.85 million.",
    "In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.",
    "Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.",
]
ranks = model.rank(query, passages)

# Print the scores
print("Query:", query)
for rank in ranks:
    print(f"{rank['score']:.2f}\t{passages[rank['corpus_id']]}")
"""
Query: How many people live in Berlin?
8.92    The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.
8.61    Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.
8.24    An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.
7.60    In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.
6.35    In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.
5.42    Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.
3.45    In 2015, the total labour force in Berlin was 1.85 million.
0.33    Berlin is subdivided into 12 boroughs or districts (Bezirke).
-4.24   The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019
-4.32   Berlin is well known for its museums.
"""