预训练模型

Hugging Face Hub 上已公开发布了多个稀疏编码器模型

模型与这个简单的接口无缝集成

from sentence_transformers import SparseEncoder

# Download from the 🤗 Hub
model = SparseEncoder("naver/splade-v3")
# Run inference
queries = ["what causes aging fast"]
documents = [
    "UV-A light, specifically, is what mainly causes tanning, skin aging, and cataracts, UV-B causes sunburn, skin aging and skin cancer, and UV-C is the strongest, and therefore most effective at killing microorganisms. Again â\x80\x93 single words and multiple bullets.",
    "Answers from Ronald Petersen, M.D. Yes, Alzheimer's disease usually worsens slowly. But its speed of progression varies, depending on a person's genetic makeup, environmental factors, age at diagnosis and other medical conditions. Still, anyone diagnosed with Alzheimer's whose symptoms seem to be progressing quickly â\x80\x94 or who experiences a sudden decline â\x80\x94 should see his or her doctor.",
    "Bell's palsy and Extreme tiredness and Extreme fatigue (2 causes) Bell's palsy and Extreme tiredness and Hepatitis (2 causes) Bell's palsy and Extreme tiredness and Liver pain (2 causes) Bell's palsy and Extreme tiredness and Lymph node swelling in children (2 causes)",
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 30522] [3, 30522]

# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[11.3768, 10.8296,  4.3457]])

核心 SPLADE 模型

MS MARCO Passage Retrieval 是金标准数据集,它包含来自 Bing 搜索引擎的真实用户查询以及经过专家注释的相关文本段落。在此基准上训练的模型在作为生产搜索系统的嵌入模型方面表现出卓越的有效性。性能得分反映了对此数据集的评估,这是一个很好的指标,但不应该是唯一需要考虑的参数。

BEIR(信息检索基准测试)提供了一个异构基准,用于评估信息检索模型,在我们的案例中,它涵盖了 13 个不同的数据集。平均 nDCG@10 分数代表了所有 13 个数据集的平均性能。

请注意,以下所有数字均摘自不同的论文。这些模型代表了稀疏神经检索的支柱

模型名称 MS MARCO MRR@10 BEIR-13 平均 nDCG@10 参数
opensearch-project/opensearch-neural-sparse-encoding-v2-distill 不适用 52.8 67M
opensearch-project/opensearch-neural-sparse-encoding-v1 不适用 52.4 133M
naver/splade-v3 40.2 51.7 109M
ibm-granite/granite-embedding-30m-sparse 不适用 50.8 30M
naver/splade-cocondenser-selfdistil 37.6 50.7 109M
naver/splade_v2_distil 36.8 50.6 67M
naver/splade-cocondenser-ensembledistil 38.0 50.5 109M
naver/splade-v3-distilbert 38.7 50.0 67M
prithivida/Splade_PP_en_v2 37.8 49.4 109M
naver/splade-v3-lexical 40.0 49.1 109M
prithivida/Splade_PP_en_v1 37.2 48.7 109M
naver/splade_v2_max 34.0 46.4 67M
rasyosef/splade-mini 34.1 44.5 11M
rasyosef/splade-tiny 30.9 40.6 4M
BM25(基线) 18.4 45.6 不适用

免推理 SPLADE 模型

免推理 Splade 在文档部分使用传统的 Splade 架构,在查询部分使用 SparseStaticEmbedding 模块,该模块仅返回查询中每个 token 的预计算分数。因此,对于这些模型,我们失去了查询扩展,但查询推理变得接近即时,这对于速度优化非常有价值。

模型名称 BEIR-13 平均 nDCG@10 参数
opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte 54.6 137M
opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill 51.7 67M
opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill 50.4 67M
opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini 49.7 23M
opensearch-project/opensearch-neural-sparse-encoding-doc-v1 49.0 133M
naver/splade-v3-doc 47.0 109M

模型集合

这些是 Hugging Face Hub 上可用的模型集合