MS MARCO
MS MARCO Passage Ranking 是一个用于训练信息检索模型的大型数据集。它包含约 50 万个来自必应(Bing)搜索引擎的真实搜索查询,以及能够回答这些查询的相关文本段落。本页展示了如何在该数据集上**训练** Cross Encoder 模型,以便用于根据查询(关键词、短语或问题)搜索文本段落。
如果您对如何使用这些模型感兴趣,请参阅应用 - 检索与重排。我们提供了**预训练模型**,您可以直接使用,无需自行训练。更多信息,请参阅用于 MS MARCO 的预训练 Cross-Encoder。
Cross Encoder
Cross Encoder 接收一个查询和一个可能相关的段落,然后返回一个分数,表示该段落与给定查询的相关程度。通常,会对原始输出预测应用 torch.nn.Sigmoid
,将其值转换为 0 到 1 之间。
CrossEncoder
模型常用于**重排**:给定一个包含针对某个查询的可能相关段落的列表(例如,从 SentenceTransformer
模型 / BM25 / Elasticsearch 检索得到),cross-encoder 会对该列表进行重排,使最相关的段落位于结果列表的顶部。
训练脚本
我们提供了几个使用不同损失函数的训练脚本,用于在 MS MARCO 上训练 CrossEncoder
。
在所有脚本中,模型都会在 MS MARCO、NFCorpus、NQ 的子集上通过 CrossEncoderNanoBEIREvaluator
进行评估。
training_ms_marco_bce_preprocessed.py:
此示例在一个预处理过的 MS MARCO 数据集上使用
BinaryCrossEntropyLoss
。-
此示例也使用
BinaryCrossEntropyLoss
,但这次数据集的预处理(处理为(query, answer)
形式,label
为 1 或 0)是在训练脚本中完成的。 -
此示例使用
CachedMultipleNegativesRankingLoss
。该脚本将数据集预处理为(query, answer, negative_1, negative_2, negative_3, negative_4, negative_5)
的形式。 -
此示例使用
ListNetLoss
。该脚本将数据集预处理为(query, [doc1, doc2, ..., docN])
的形式,labels
为[score1, score2, ..., scoreN]
。 -
此示例使用
LambdaLoss
和NDCGLoss2PPScheme
损失方案。该脚本将数据集预处理为(query, [doc1, doc2, ..., docN])
的形式,labels
为[score1, score2, ..., scoreN]
。 training_ms_marco_lambda_preprocessed.py:
此示例在一个预处理过的 MS MARCO 数据集上使用
LambdaLoss
和NDCGLoss2PPScheme
损失方案。training_ms_marco_lambda_hard_neg.py:
此示例扩展了上述示例,通过使用
mine_hard_negatives()
挖掘难负样本来增加训练数据集的大小。-
此示例使用
ListMLELoss
。该脚本将数据集预处理为(query, [doc1, doc2, ..., docN])
的形式,labels
为[score1, score2, ..., scoreN]
。 training_ms_marco_plistmle.py:
此示例使用
PListMLELoss
和默认的PListMLELambdaWeight
位置加权。该脚本将数据集预处理为(query, [doc1, doc2, ..., docN])
的形式,labels
为[score1, score2, ..., scoreN]
。-
此示例使用
RankNetLoss
。该脚本将数据集预处理为(query, [doc1, doc2, ..., docN])
的形式,labels
为[score1, score2, ..., scoreN]
。
在这些训练脚本中,我怀疑 training_ms_marco_lambda_preprocessed.py、training_ms_marco_lambda_hard_neg.py 或 training_ms_marco_bce_preprocessed.py 会产生最强大的模型,因为根据经验,LambdaLoss
和 BinaryCrossEntropyLoss
都相当强大。在所有学习排序损失中,似乎 LambdaLoss
> PListMLELoss
> ListNetLoss
> RankNetLoss
> ListMLELoss
,但您的实际效果可能会有所不同。
此外,您还可以使用蒸馏进行训练。更多详情,请参阅Cross Encoder > 训练示例 > 蒸馏。
推理
您可以使用任何预训练的用于 MS MARCO 的 CrossEncoder 模型进行推理,如下所示:
from sentence_transformers import CrossEncoder
# 1. Load a pre-trained CrossEncoder model
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")
# 2. Predict scores for a pair of sentences
scores = model.predict([
("How many people live in Berlin?", "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers."),
("How many people live in Berlin?", "Berlin is well known for its museums."),
])
# => array([ 8.607138 , -4.3200774], dtype=float32)
# 3. Rank a list of passages for a query
query = "How many people live in Berlin?"
passages = [
"Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.",
"Berlin is well known for its museums.",
"In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.",
"The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.",
"The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019",
"An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.",
"Berlin is subdivided into 12 boroughs or districts (Bezirke).",
"In 2015, the total labour force in Berlin was 1.85 million.",
"In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.",
"Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.",
]
ranks = model.rank(query, passages)
# Print the scores
print("Query:", query)
for rank in ranks:
print(f"{rank['score']:.2f}\t{passages[rank['corpus_id']]}")
"""
Query: How many people live in Berlin?
8.92 The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.
8.61 Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.
8.24 An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.
7.60 In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.
6.35 In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.
5.42 Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.
3.45 In 2015, the total labour force in Berlin was 1.85 million.
0.33 Berlin is subdivided into 12 boroughs or districts (Bezirke).
-4.24 The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019
-4.32 Berlin is well known for its museums.
"""