MS MARCO
MS MARCO Passage Ranking 是一个大型数据集,用于训练信息检索模型。它包含约 50 万个来自 Bing 搜索引擎的真实搜索查询,以及回答查询的相关文本段落。此页面展示了如何在此数据集上训练 Cross Encoder 模型,以便它可以用于搜索给定查询(关键词、短语或问题)的文本段落。
如果您对如何使用这些模型感兴趣,请参阅 应用 - 检索 & 重排序。有预训练模型可用,您可以直接使用,无需训练自己的模型。有关更多信息,请参阅 MS MARCO 的预训练 Cross-Encoders。
Cross Encoder
Cross Encoder 接受查询和可能的关联段落,并返回一个分数,表示该段落与给定查询的相关程度。通常,torch.nn.Sigmoid
应用于原始输出预测,将其转换为 0 到 1 之间的值。
CrossEncoder
模型通常用于重排序:给定一个查询的可能相关段落列表,例如从 SentenceTransformer
模型 / BM25 / Elasticsearch 检索到的列表,cross-encoder 会对该列表进行重排序,以便最相关的段落位于结果列表的顶部。
训练脚本
我们提供了几个带有各种损失函数的训练脚本,用于在 MS MARCO 上训练 CrossEncoder
。
在所有脚本中,模型都在 MS MARCO、NFCorpus、NQ 的子集上进行评估,通过 CrossEncoderNanoBEIREvaluator
。
training_ms_marco_bce_preprocessed.py:
此示例在 预处理的 MS MARCO 数据集上使用
BinaryCrossEntropyLoss
。-
此示例也使用
BinaryCrossEntropyLoss
,但现在数据集预处理成(query, answer)
和label
作为 1 或 0 在训练脚本中完成。 -
此示例使用
CachedMultipleNegativesRankingLoss
。该脚本应用数据集预处理成(query, answer, negative_1, negative_2, negative_3, negative_4, negative_5)
。 -
此示例使用
ListNetLoss
。该脚本应用数据集预处理成(query, [doc1, doc2, ..., docN])
和labels
作为[score1, score2, ..., scoreN]
。 -
此示例使用
LambdaLoss
以及NDCGLoss2PPScheme
损失方案。该脚本应用数据集预处理成(query, [doc1, doc2, ..., docN])
和labels
作为[score1, score2, ..., scoreN]
。 training_ms_marco_lambda_preprocessed.py:
此示例在 预处理的 MS MARCO 数据集上使用
LambdaLoss
以及NDCGLoss2PPScheme
损失方案。training_ms_marco_lambda_hard_neg.py:
此示例通过使用
mine_hard_negatives()
挖掘难负样本来扩展上述示例,从而增加训练数据集的大小。-
此示例使用
ListMLELoss
。该脚本应用数据集预处理成(query, [doc1, doc2, ..., docN])
和labels
作为[score1, score2, ..., scoreN]
。 training_ms_marco_plistmle.py:
此示例使用
PListMLELoss
以及默认的PListMLELambdaWeight
位置权重。该脚本应用数据集预处理成(query, [doc1, doc2, ..., docN])
和labels
作为[score1, score2, ..., scoreN]
。-
此示例使用
RankNetLoss
。该脚本应用数据集预处理成(query, [doc1, doc2, ..., docN])
和labels
作为[score1, score2, ..., scoreN]
。
在这些训练脚本中,我怀疑 training_ms_marco_lambda_preprocessed.py、training_ms_marco_lambda_hard_neg.py 或 training_ms_marco_bce_preprocessed.py 产生最强大的模型,因为据传 LambdaLoss
和 BinaryCrossEntropyLoss
非常强大。 似乎 LambdaLoss
> PListMLELoss
> ListNetLoss
> RankNetLoss
> ListMLELoss
在所有 learning to rank 损失中,但您的结果可能会有所不同。
此外,您还可以使用 Distillation 进行训练。有关更多详细信息,请参阅 Cross Encoder > 训练示例 > Distillation。
推理
您可以像这样使用任何 MS MARCO 的预训练 CrossEncoder 模型 执行推理
from sentence_transformers import CrossEncoder
# 1. Load a pre-trained CrossEncoder model
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")
# 2. Predict scores for a pair of sentences
scores = model.predict([
("How many people live in Berlin?", "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers."),
("How many people live in Berlin?", "Berlin is well known for its museums."),
])
# => array([ 8.607138 , -4.3200774], dtype=float32)
# 3. Rank a list of passages for a query
query = "How many people live in Berlin?"
passages = [
"Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.",
"Berlin is well known for its museums.",
"In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.",
"The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.",
"The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019",
"An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.",
"Berlin is subdivided into 12 boroughs or districts (Bezirke).",
"In 2015, the total labour force in Berlin was 1.85 million.",
"In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.",
"Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.",
]
ranks = model.rank(query, passages)
# Print the scores
print("Query:", query)
for rank in ranks:
print(f"{rank['score']:.2f}\t{passages[rank['corpus_id']]}")
"""
Query: How many people live in Berlin?
8.92 The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.
8.61 Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.
8.24 An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.
7.60 In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.
6.35 In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.
5.42 Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.
3.45 In 2015, the total labour force in Berlin was 1.85 million.
0.33 Berlin is subdivided into 12 boroughs or districts (Bezirke).
-4.24 The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019
-4.32 Berlin is well known for its museums.
"""