MS MARCO
MS MARCO Passage Ranking 是一个用于训练信息检索模型的庞大数据集。它包含大约 50 万条来自 Bing 搜索引擎的真实搜索查询,以及回答查询的相关文本段落。本页面展示了如何在该数据集上训练交叉编码器模型,以便将其用于根据查询(关键词、短语或问题)搜索文本段落。
如果您对如何使用这些模型感兴趣,请参阅应用 - 检索与重排序。有预训练模型可用,您可以直接使用,无需训练自己的模型。欲了解更多信息,请参阅用于 MS MARCO 的预训练交叉编码器。
交叉编码器
交叉编码器同时接受查询和可能相关的段落,并返回一个分数,表示该段落与给定查询的相关程度。通常,会对原始输出预测应用torch.nn.Sigmoid
,将其转换为 0 到 1 之间的值。
CrossEncoder
模型常用于重排序:给定一个包含查询可能相关段落的列表(例如,从 SentenceTransformer
模型/BM25/Elasticsearch 检索而来),交叉编码器会重新排序此列表,使最相关的段落位于结果列表的顶部。
训练脚本
我们提供了几个训练脚本,包含各种损失函数,用于在 MS MARCO 数据集上训练 CrossEncoder
。
在所有脚本中,模型都会通过 CrossEncoderNanoBEIREvaluator
在 MS MARCO、NFCorpus 和 NQ 的子集上进行评估。
-
该示例也使用
BinaryCrossEntropyLoss
,但现在数据集预处理成(query, answer)
形式(label
为 1 或 0)是在训练脚本中完成的。 -
该示例使用
CachedMultipleNegativesRankingLoss
。该脚本将数据集预处理成(query, answer, negative_1, negative_2, negative_3, negative_4, negative_5)
形式。 -
该示例使用
ListNetLoss
。该脚本将数据集预处理成(query, [doc1, doc2, ..., docN])
形式,其中labels
为[score1, score2, ..., scoreN]
。 -
该示例使用
LambdaLoss
,并采用NDCGLoss2PPScheme
损失方案。该脚本将数据集预处理成(query, [doc1, doc2, ..., docN])
形式,其中labels
为[score1, score2, ..., scoreN]
。 training_ms_marco_lambda_preprocessed.py:
该示例在预处理的 MS MARCO 数据集上使用
LambdaLoss
,并采用NDCGLoss2PPScheme
损失方案。training_ms_marco_lambda_hard_neg.py:
该示例在上述示例的基础上,通过使用
mine_hard_negatives()
挖掘困难负样本来增加训练数据集的大小。-
该示例使用
ListMLELoss
。该脚本将数据集预处理成(query, [doc1, doc2, ..., docN])
形式,其中labels
为[score1, score2, ..., scoreN]
。 training_ms_marco_plistmle.py:
该示例使用
PListMLELoss
,并采用默认的PListMLELambdaWeight
位置加权。该脚本将数据集预处理成(query, [doc1, doc2, ..., docN])
形式,其中labels
为[score1, score2, ..., scoreN]
。-
该示例使用
RankNetLoss
。该脚本将数据集预处理成(query, [doc1, doc2, ..., docN])
形式,其中labels
为[score1, score2, ..., scoreN]
。
在这些训练脚本中,我推测 training_ms_marco_lambda_preprocessed.py、training_ms_marco_lambda_hard_neg.py 或 training_ms_marco_bce_preprocessed.py 能够生成最强的模型,因为根据经验,LambdaLoss
和 BinaryCrossEntropyLoss
都相当强大。在所有学习排序损失中,似乎 LambdaLoss
> PListMLELoss
> ListNetLoss
> RankNetLoss
> ListMLELoss
,但您的实际情况可能有所不同。
此外,您还可以使用蒸馏进行训练。有关详细信息,请参阅交叉编码器 > 训练示例 > 蒸馏。
推理
您可以使用任何用于 MS MARCO 的预训练 CrossEncoder 模型执行推理,如下所示
from sentence_transformers import CrossEncoder
# 1. Load a pre-trained CrossEncoder model
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")
# 2. Predict scores for a pair of sentences
scores = model.predict([
("How many people live in Berlin?", "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers."),
("How many people live in Berlin?", "Berlin is well known for its museums."),
])
# => array([ 8.607138 , -4.3200774], dtype=float32)
# 3. Rank a list of passages for a query
query = "How many people live in Berlin?"
passages = [
"Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.",
"Berlin is well known for its museums.",
"In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.",
"The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.",
"The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019",
"An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.",
"Berlin is subdivided into 12 boroughs or districts (Bezirke).",
"In 2015, the total labour force in Berlin was 1.85 million.",
"In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.",
"Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.",
]
ranks = model.rank(query, passages)
# Print the scores
print("Query:", query)
for rank in ranks:
print(f"{rank['score']:.2f}\t{passages[rank['corpus_id']]}")
"""
Query: How many people live in Berlin?
8.92 The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.
8.61 Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.
8.24 An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.
7.60 In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.
6.35 In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.
5.42 Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.
3.45 In 2015, the total labour force in Berlin was 1.85 million.
0.33 Berlin is subdivided into 12 boroughs or districts (Bezirke).
-4.24 The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019
-4.32 Berlin is well known for its museums.
"""