语义搜索
语义搜索是指超越传统基于关键词搜索的搜索技术。语义搜索不只依赖关键词的精确匹配,而是旨在理解查询和被搜索文档的含义与上下文。这使得搜索结果更相关、更准确,即使查询中没有精确匹配的关键词。
稀疏嵌入是一种大部分值为零,只有少量维度包含非零值(也称为“激活”值)的表示方法。这与密集嵌入相反,密集嵌入中所有维度通常都有非零值。传统的稀疏嵌入解决方案通常基于词汇,意味着它们依赖于术语或短语的精确匹配。然而,像 SPLADE 和其他稀疏编码器模型这样的现代稀疏编码器,可以生成既能捕捉语义又能保持稀疏性的嵌入。
只要搜索解决方案能充分利用稀疏嵌入中绝大多数维度为 0 的事实,这些嵌入就能实现极其高效的语义搜索。本页展示了一个示例,演示如何手动执行语义搜索,以及如何将 SparseEncoder 模型与流行的向量数据库/搜索系统集成。
如果您对语义搜索不熟悉,请参阅 Sentence Transformers > 语义搜索,以获取使用密集嵌入模型的更广泛解释。
手动搜索
使用稀疏编码器手动执行语义搜索非常直接,仅包含几个步骤:
加载 SparseEncoder 模型:从 Hugging Face Hub 或本地目录加载一个预训练的稀疏编码器模型。
编码语料库:使用模型将一组文档(语料库)编码为稀疏嵌入。
编码查询:使用相同的模型将用户查询编码为稀疏嵌入。
计算相似度:使用合适的相似度函数(例如余弦相似度、点积)计算查询嵌入和语料库嵌入之间的相似度。
检索结果:根据相似度分数对结果进行排序,并返回最相关的文档。
分析结果:可选地,分析结果以了解哪些词元对相似度分数的贡献最大。
from sentence_transformers import SparseEncoder, util
# 1. Load a pretrained SparseEncoder model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# 2. Encode a corpus of texts using the SparseEncoder model
corpus = [
"Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed.",
"Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning.",
"Neural networks are computing systems vaguely inspired by the biological neural networks that constitute animal brains.",
"Mars rovers are robotic vehicles designed to travel on the surface of Mars to collect data and perform experiments.",
"The James Webb Space Telescope is the largest optical telescope in space, designed to conduct infrared astronomy.",
"SpaceX's Starship is designed to be a fully reusable transportation system capable of carrying humans to Mars and beyond.",
"Global warming is the long-term heating of Earth's climate system observed since the pre-industrial period due to human activities.",
"Renewable energy sources include solar, wind, hydro, and geothermal power that naturally replenish over time.",
"Carbon capture technologies aim to collect CO2 emissions before they enter the atmosphere and store them underground.",
]
# Use "convert_to_tensor=True" to keep the tensors on GPU (if available)
corpus_embeddings = model.encode_document(corpus, convert_to_tensor=True)
# 3. Encode the user queries using the same SparseEncoder model
queries = [
"How do artificial neural networks work?",
"What technology is used for modern space exploration?",
"How can we address climate change challenges?",
]
query_embeddings = model.encode_query(queries, convert_to_tensor=True)
# 4. Use the similarity function to compute the similarity scores between the query and corpus embeddings
top_k = min(5, len(corpus)) # Find at most 5 sentences of the corpus for each query sentence
results = util.semantic_search(query_embeddings, corpus_embeddings, top_k=top_k, score_function=model.similarity)
# 5. Sort the results and print the top 5 most similar sentences for each query
for query_id, query in enumerate(queries):
pointwise_scores = model.intersection(query_embeddings[query_id], corpus_embeddings)
print(f"Query: {query}")
for res in results[query_id]:
corpus_id, score = res.values()
sentence = corpus[corpus_id]
pointwise_score = model.decode(pointwise_scores[corpus_id], top_k=10)
token_scores = ", ".join([f'("{token.strip()}", {value:.2f})' for token, value in pointwise_score])
print(f"Score: {score:.4f} - Sentence: {sentence} - Top influential tokens: {token_scores}")
print("")
切换以查看结果
"""
Query: How do artificial neural networks work?
Score: 16.9053 - Sentence: Neural networks are computing systems vaguely inspired by the biological neural networks that constitute animal brains. - Top influential tokens: ("neural", 5.71), ("networks", 3.24), ("network", 2.93), ("brain", 2.10), ("computer", 0.50), ("##uron", 0.32), ("artificial", 0.27), ("technology", 0.27), ("communication", 0.27), ("connection", 0.21)
Score: 13.6119 - Sentence: Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning. - Top influential tokens: ("artificial", 3.71), ("neural", 3.15), ("networks", 1.78), ("brain", 1.22), ("network", 1.12), ("ai", 1.07), ("machine", 0.39), ("robot", 0.20), ("technology", 0.20), ("algorithm", 0.18)
Score: 2.7373 - Sentence: Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed. - Top influential tokens: ("machine", 0.78), ("computer", 0.50), ("technology", 0.32), ("artificial", 0.22), ("robot", 0.21), ("ai", 0.20), ("process", 0.16), ("theory", 0.11), ("technique", 0.11), ("fuzzy", 0.06)
Score: 2.1430 - Sentence: Carbon capture technologies aim to collect CO2 emissions before they enter the atmosphere and store them underground. - Top influential tokens: ("technology", 0.42), ("function", 0.41), ("mechanism", 0.21), ("sensor", 0.21), ("device", 0.18), ("process", 0.18), ("generator", 0.13), ("detection", 0.10), ("technique", 0.10), ("tracking", 0.05)
Score: 2.0195 - Sentence: Mars rovers are robotic vehicles designed to travel on the surface of Mars to collect data and perform experiments. - Top influential tokens: ("robot", 0.67), ("function", 0.34), ("technology", 0.29), ("device", 0.23), ("experiment", 0.20), ("machine", 0.10), ("artificial", 0.08), ("design", 0.04), ("useful", 0.03), ("they", 0.02)
Query: What technology is used for modern space exploration?
Score: 10.4748 - Sentence: SpaceX's Starship is designed to be a fully reusable transportation system capable of carrying humans to Mars and beyond. - Top influential tokens: ("space", 4.40), ("technology", 1.15), ("nasa", 1.06), ("mars", 0.63), ("exploration", 0.52), ("spacecraft", 0.44), ("robot", 0.32), ("rocket", 0.28), ("astronomy", 0.27), ("travel", 0.26)
Score: 9.3818 - Sentence: The James Webb Space Telescope is the largest optical telescope in space, designed to conduct infrared astronomy. - Top influential tokens: ("space", 3.89), ("nasa", 1.09), ("astronomy", 0.93), ("discovery", 0.48), ("instrument", 0.47), ("technology", 0.35), ("device", 0.26), ("spacecraft", 0.25), ("invented", 0.22), ("equipment", 0.22)
Score: 8.5147 - Sentence: Mars rovers are robotic vehicles designed to travel on the surface of Mars to collect data and perform experiments. - Top influential tokens: ("technology", 1.39), ("mars", 0.79), ("exploration", 0.78), ("robot", 0.67), ("used", 0.66), ("nasa", 0.52), ("spacecraft", 0.44), ("device", 0.39), ("explore", 0.38), ("travel", 0.25)
Score: 7.6993 - Sentence: Carbon capture technologies aim to collect CO2 emissions before they enter the atmosphere and store them underground. - Top influential tokens: ("technology", 1.99), ("tech", 1.76), ("technologies", 1.74), ("equipment", 0.32), ("device", 0.31), ("technological", 0.28), ("mining", 0.22), ("sensor", 0.19), ("tool", 0.18), ("software", 0.11)
Score: 2.5526 - Sentence: Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed. - Top influential tokens: ("technology", 1.52), ("machine", 0.27), ("robot", 0.21), ("computer", 0.18), ("engineering", 0.12), ("technique", 0.11), ("science", 0.05), ("technological", 0.05), ("techniques", 0.02), ("innovation", 0.01)
Query: How can we address climate change challenges?
Score: 9.5587 - Sentence: Global warming is the long-term heating of Earth's climate system observed since the pre-industrial period due to human activities. - Top influential tokens: ("climate", 3.21), ("warming", 2.87), ("weather", 1.58), ("change", 0.46), ("global", 0.41), ("environmental", 0.39), ("storm", 0.19), ("pollution", 0.15), ("environment", 0.11), ("adaptation", 0.08)
Score: 1.3191 - Sentence: Carbon capture technologies aim to collect CO2 emissions before they enter the atmosphere and store them underground. - Top influential tokens: ("warming", 0.39), ("pollution", 0.34), ("environmental", 0.15), ("goal", 0.12), ("strategy", 0.07), ("monitoring", 0.07), ("protection", 0.06), ("greenhouse", 0.05), ("safety", 0.02), ("escape", 0.01)
Score: 1.0774 - Sentence: Renewable energy sources include solar, wind, hydro, and geothermal power that naturally replenish over time. - Top influential tokens: ("conservation", 0.39), ("sustainability", 0.18), ("environmental", 0.18), ("sustainable", 0.13), ("agriculture", 0.13), ("alternative", 0.07), ("recycling", 0.00)
Score: 0.2401 - Sentence: Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed. - Top influential tokens: ("strategy", 0.10), ("success", 0.06), ("foster", 0.04), ("engineering", 0.03), ("innovation", 0.00), ("research", 0.00)
Score: 0.1516 - Sentence: Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning. - Top influential tokens: ("strategy", 0.09), ("foster", 0.04), ("research", 0.01), ("approach", 0.01), ("engineering", 0.01)
"""
向量数据库搜索
另外,一些向量数据库和搜索引擎也可用于通过稀疏编码器执行语义搜索。这些系统旨在高效处理大规模向量数据,并提供相关文档的快速检索。它们可以利用嵌入的稀疏性来优化存储和搜索操作。
整体结构与手动搜索类似,但向量数据库负责文档的索引和检索。步骤大致如下:
编码语料库:加载您的数据,并使用预训练的稀疏编码器对文档进行编码。
索引:将文档及其稀疏嵌入在向量数据库中建立索引。
编码查询:用户查询使用相同的稀疏编码器进行编码。
检索:向量数据库执行相似度搜索以找到最相关的文档。
结果:返回搜索结果及其相似度分数和文档内容。
稀疏向量用于搜索的优势在于:
高效性:稀疏向量(其中大多数值为零)可以比密集向量更高效地存储和搜索。
可解释性:稀疏嵌入中的非零维度通常对应于特定的词元,使您能够理解哪些词元对相似度分数有贡献。
精确匹配:稀疏向量可以保留在密集嵌入中可能丢失的精确术语匹配信号。
Qdrant 集成
此示例演示了如何为稀疏向量搜索设置 Qdrant,展示了如何使用稀疏编码器高效地编码和索引文档,如何使用稀疏向量构建搜索查询,并提供了一个交互式查询界面。请参阅 semantic_search_qdrant.py 或以下内容:
先决条件:
在本地运行(或可访问)Qdrant,更多详情请参阅 Qdrant 快速入门。
必须安装 Qdrant Python 客户端:
pip install qdrant-client
import time
from datasets import load_dataset
from sentence_transformers import SparseEncoder
from sentence_transformers.sparse_encoder.search_engines import semantic_search_qdrant
# 1. Load the natural-questions dataset with 100K answers
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
num_docs = 10_000
corpus = dataset["answer"][:num_docs]
# 2. Come up with some queries
queries = dataset["query"][:2]
# 3. Load the model
sparse_model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# 4. Encode the corpus
corpus_embeddings = sparse_model.encode_document(
corpus, convert_to_sparse_tensor=True, batch_size=16, show_progress_bar=True
)
# Initially, we don't have a qdrant index yet
corpus_index = None
while True:
# 5. Encode the queries using the full precision
start_time = time.time()
query_embeddings = sparse_model.encode_query(queries, convert_to_sparse_tensor=True)
print(f"Encoding time: {time.time() - start_time:.6f} seconds")
# 6. Perform semantic search using qdrant
results, search_time, corpus_index = semantic_search_qdrant(
query_embeddings,
corpus_index=corpus_index,
corpus_embeddings=corpus_embeddings if corpus_index is None else None,
top_k=5,
output_index=True,
)
# 7. Output the results
print(f"Search time: {search_time:.6f} seconds")
for query, result in zip(queries, results):
print(f"Query: {query}")
for entry in result:
print(f"(Score: {entry['score']:.4f}) {corpus[entry['corpus_id']]}, corpus_id: {entry['corpus_id']}")
print("")
# 8. Prompt for more queries
queries = [input("Please enter a question: ")]
OpenSearch 集成
此示例演示了如何为稀疏向量搜索设置 OpenSearch,展示了如何使用稀疏编码器高效地编码和索引文档,如何使用稀疏向量构建搜索查询,并提供了一个交互式查询界面。请参阅 semantic_search_opensearch.py 或以下内容:
先决条件:
在本地运行(或可访问)OpenSearch,更多详情请参阅 在本地运行 OpenSearch。
此外,必须安装 OpenSearch Python 客户端:https://docs.opensearch.org.cn/docs/latest/clients/python-low-level/,例如:
pip install opensearch-py
此脚本是为
opensearch
v2.15.0+ 创建的。
import time
from datasets import load_dataset
from sentence_transformers import SparseEncoder
from sentence_transformers.models import Router
from sentence_transformers.sparse_encoder.models import MLMTransformer, SparseStaticEmbedding, SpladePooling
from sentence_transformers.sparse_encoder.search_engines import semantic_search_opensearch
# 1. Load the natural-questions dataset with 100K answers
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
num_docs = 10_000
corpus = dataset["answer"][:num_docs]
print(f"Finish loading data. Corpus size: {len(corpus)}")
# 2. Come up with some queries
queries = dataset["query"][:2]
# 3. Load the model
model_id = "opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill"
doc_encoder = MLMTransformer(model_id)
router = Router.for_query_document(
query_modules=[
SparseStaticEmbedding.from_json(
model_id,
tokenizer=doc_encoder.tokenizer,
frozen=True,
),
],
document_modules=[
doc_encoder,
SpladePooling("max", activation_function="log1p_relu"),
],
)
sparse_model = SparseEncoder(modules=[router], similarity_fn_name="dot")
print("Start encoding corpus...")
start_time = time.time()
# 4. Encode the corpus
corpus_embeddings = sparse_model.encode_document(
corpus, convert_to_sparse_tensor=True, batch_size=32, show_progress_bar=True
)
corpus_embeddings_decoded = sparse_model.decode(corpus_embeddings)
print(f"Corpus encoding time: {time.time() - start_time:.6f} seconds")
corpus_index = None
while True:
# 5. Encode the queries using inference-free mode
start_time = time.time()
query_embeddings = sparse_model.encode_query(queries, convert_to_sparse_tensor=True)
query_embeddings_decoded = sparse_model.decode(query_embeddings)
print(f"Query encoding time: {time.time() - start_time:.6f} seconds")
# 6. Perform semantic search using OpenSearch
results, search_time, corpus_index = semantic_search_opensearch(
query_embeddings_decoded,
corpus_embeddings_decoded=corpus_embeddings_decoded if corpus_index is None else None,
corpus_index=corpus_index,
top_k=5,
output_index=True,
)
# 7. Output the results
print(f"Search time: {search_time:.6f} seconds")
for query, result in zip(queries, results):
print(f"Query: {query}")
for entry in result:
print(f"(Score: {entry['score']:.4f}) {corpus[entry['corpus_id']]}, corpus_id: {entry['corpus_id']}")
print("")
# 8. Prompt for more queries
queries = [input("Please enter a question: ")]
Elasticsearch 集成
此示例演示了如何为稀疏向量搜索设置 Elasticsearch,展示了如何使用稀疏编码器高效地编码和索引文档,如何使用稀疏向量构建搜索查询,并提供了一个交互式查询界面。请参阅 semantic_search_elasticsearch.py 或以下内容:
先决条件:
在本地运行(或可访问)Elasticsearch,更多详情请参阅 在本地运行 Elasticsearch。
必须安装 Elasticsearch Python 客户端:
pip install elasticsearch
import time
from datasets import load_dataset
from sentence_transformers import SparseEncoder
from sentence_transformers.sparse_encoder.search_engines import semantic_search_elasticsearch
# 1. Load the natural-questions dataset with 100K answers
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
num_docs = 10_000
corpus = dataset["answer"][:num_docs]
# 2. Come up with some queries
queries = dataset["query"][:2]
# 3. Load the model
sparse_model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# 4. Encode the corpus
print("Start encoding corpus...")
start_time = time.time()
corpus_embeddings = sparse_model.encode_document(
corpus, convert_to_sparse_tensor=True, batch_size=16, show_progress_bar=True
)
corpus_embeddings_decoded = sparse_model.decode(corpus_embeddings)
print(f"Corpus encoding time: {time.time() - start_time:.6f} seconds")
corpus_index = None
while True:
# 5. Encode the queries using the full precision
start_time = time.time()
query_embeddings = sparse_model.encode_query(queries, convert_to_sparse_tensor=True)
query_embeddings_decoded = sparse_model.decode(query_embeddings)
print(f"Encoding time: {time.time() - start_time:.6f} seconds")
# 6. Perform semantic search using Elasticsearch
results, search_time, corpus_index = semantic_search_elasticsearch(
query_embeddings_decoded,
corpus_embeddings_decoded=corpus_embeddings_decoded if corpus_index is None else None,
corpus_index=corpus_index,
top_k=5,
output_index=True,
)
# 7. Output the results
print(f"Search time: {search_time:.6f} seconds")
for query, result in zip(queries, results):
print(f"Query: {query}")
for entry in result:
print(f"(Score: {entry['score']:.4f}) {corpus[entry['corpus_id']]}, corpus_id: {entry['corpus_id']}")
print("")
# 8. Prompt for more queries
queries = [input("Please enter a question: ")]
Seismic 集成
此示例演示了如何使用 Seismic 进行性能极高的稀疏向量搜索。它不需要运行单独的客户端,而是在内存中直接执行搜索。Seismic 库在 Bruch et al. (2024) 中被引入,其中显示其性能比常见的倒排文件 (IVF) 方法快一个数量级。有关构建 Seismic 索引的更多信息,请参阅 Seismic 指南。请参阅 semantic_search_seismic.py 或以下内容:
先决条件:
必须安装 Seismic Python 包:
pip install pyseismic-lsr
import time
from datasets import load_dataset
from sentence_transformers import SparseEncoder
from sentence_transformers.sparse_encoder.search_engines import semantic_search_seismic
# 1. Load the natural-questions dataset with 100K answers
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
num_docs = 10_000
corpus = dataset["answer"][:num_docs]
# 2. Come up with some queries
queries = dataset["query"][:2]
# 3. Load the model
sparse_model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# 4. Encode the corpus
print("Start encoding corpus...")
start_time = time.time()
corpus_embeddings = sparse_model.encode_document(
corpus, convert_to_sparse_tensor=True, batch_size=16, show_progress_bar=True
)
corpus_embeddings_decoded = sparse_model.decode(corpus_embeddings)
print(f"Corpus encoding time: {time.time() - start_time:.6f} seconds")
corpus_index = None
while True:
# 5. Encode the queries using the full precision
start_time = time.time()
query_embeddings = sparse_model.encode_query(queries, convert_to_sparse_tensor=True)
query_embeddings_decoded = sparse_model.decode(query_embeddings)
print(f"Encoding time: {time.time() - start_time:.6f} seconds")
# 6. Perform semantic search using Seismic
results, search_time, corpus_index = semantic_search_seismic(
query_embeddings_decoded,
corpus_embeddings_decoded=corpus_embeddings_decoded if corpus_index is None else None,
corpus_index=corpus_index,
top_k=5,
output_index=True,
)
# 7. Output the results
print(f"Search time: {search_time:.6f} seconds")
for query, result in zip(queries, results):
print(f"Query: {query}")
for entry in result:
print(f"(Score: {entry['score']:.4f}) {corpus[entry['corpus_id']]}, corpus_id: {entry['corpus_id']}")
print("")
# 8. Prompt for more queries
queries = [input("Please enter a question: ")]
SPLADE-index 集成
此示例演示了如何使用 splade-index 进行非常快速的稀疏向量搜索,它由 SciPy 稀疏矩阵提供支持,并建立在优秀的快速 BM25 实现 bm25s 之上。它不需要运行单独的客户端,而是在内存中直接执行搜索。请参阅 semantic_search_splade_index.py 或以下内容:
先决条件:
必须安装 SPLADE-index Python 包:
pip install splade-index
import time
from datasets import load_dataset
from splade_index import SPLADE
from sentence_transformers import SparseEncoder
# 1. Load the natural-questions dataset with 100K answers
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
num_docs = 10_000
corpus = dataset["answer"][:num_docs]
# 2. Come up with some queries
queries = dataset["query"][:2]
# 3. Load the model
sparse_model = SparseEncoder("rasyosef/splade-tiny")
# 4. Encode the corpus & create the index
print("Start encoding corpus and creating index...")
start_time = time.time()
corpus_index = SPLADE()
corpus_index.index(model=sparse_model, documents=corpus, batch_size=16, show_progress=True)
print(f"Encoded corpus and created index in {time.time() - start_time:.6f} seconds")
while True:
# 5. Encode the queries using the full precision
start_time = time.time()
all_doc_ids, all_documents, all_scores = corpus_index.retrieve(queries, k=5)
print(f"Encoding & Search time: {time.time() - start_time:.6f} seconds")
# 7. Output the results
for query, doc_ids, documents, scores in zip(queries, all_doc_ids, all_documents, all_scores):
print(f"Query: {query}")
for doc_id, document, score in zip(doc_ids, documents, scores):
print(f"(Score: {score:.4f}) {document}, corpus_id: {doc_id}")
print("")
# 8. Prompt for more queries
queries = [input("Please enter a question: ")]