语义搜索
语义搜索是指超越传统基于关键词搜索的技术。语义搜索不再仅仅依赖关键词的精确匹配,而是旨在理解查询和被搜索文档的含义和上下文。这使得即使关键词不完全匹配,也能获得更相关和准确的搜索结果。
稀疏嵌入是一种表示形式,其中大多数值均为零,只有少数维度包含非零(也称为活跃)值。这与密集嵌入形成对比,密集嵌入中所有维度通常都具有非零值。传统的稀疏嵌入解决方案通常基于词汇,这意味着它们依赖于术语或短语的精确匹配。然而,SPLADE 等现代稀疏编码器模型能够生成在捕获语义含义的同时保持稀疏性的嵌入。
只要搜索解决方案充分利用稀疏嵌入维度绝大多数为 0 的事实,这些嵌入就能实现极其高效的语义搜索。本页展示了一个示例,演示如何手动执行语义搜索,以及如何将 SparseEncoder 模型与流行的向量数据库/搜索系统集成。
如果您不熟悉语义搜索,请参阅Sentence Transformers > 语义搜索,以获取使用密集嵌入模型的更广泛解释。
手动搜索
使用稀疏编码器手动执行语义搜索非常简单,只需几个步骤:
加载 SparseEncoder 模型:从 Hugging Face Hub 或本地目录加载预训练的稀疏编码器模型。
编码语料库:使用模型将文档集(语料库)编码为稀疏嵌入。
编码查询:使用相同的模型将用户查询编码为稀疏嵌入。
计算相似度:使用合适的相似度函数(例如,余弦相似度、点积)计算查询嵌入与语料库嵌入之间的相似度。
检索结果:根据相似度分数对结果进行排序,并返回最相关的文档。
分析结果:可选地,分析结果以了解哪些标记对相似度分数贡献最大。
from sentence_transformers import SparseEncoder, util
# 1. Load a pretrained SparseEncoder model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# 2. Encode a corpus of texts using the SparseEncoder model
corpus = [
"Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed.",
"Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning.",
"Neural networks are computing systems vaguely inspired by the biological neural networks that constitute animal brains.",
"Mars rovers are robotic vehicles designed to travel on the surface of Mars to collect data and perform experiments.",
"The James Webb Space Telescope is the largest optical telescope in space, designed to conduct infrared astronomy.",
"SpaceX's Starship is designed to be a fully reusable transportation system capable of carrying humans to Mars and beyond.",
"Global warming is the long-term heating of Earth's climate system observed since the pre-industrial period due to human activities.",
"Renewable energy sources include solar, wind, hydro, and geothermal power that naturally replenish over time.",
"Carbon capture technologies aim to collect CO2 emissions before they enter the atmosphere and store them underground.",
]
# Use "convert_to_tensor=True" to keep the tensors on GPU (if available)
corpus_embeddings = model.encode_document(corpus, convert_to_tensor=True)
# 3. Encode the user queries using the same SparseEncoder model
queries = [
"How do artificial neural networks work?",
"What technology is used for modern space exploration?",
"How can we address climate change challenges?",
]
query_embeddings = model.encode_query(queries, convert_to_tensor=True)
# 4. Use the similarity function to compute the similarity scores between the query and corpus embeddings
top_k = min(5, len(corpus)) # Find at most 5 sentences of the corpus for each query sentence
results = util.semantic_search(query_embeddings, corpus_embeddings, top_k=top_k, score_function=model.similarity)
# 5. Sort the results and print the top 5 most similar sentences for each query
for query_id, query in enumerate(queries):
pointwise_scores = model.intersection(query_embeddings[query_id], corpus_embeddings)
print(f"Query: {query}")
for res in results[query_id]:
corpus_id, score = res.values()
sentence = corpus[corpus_id]
pointwise_score = model.decode(pointwise_scores[corpus_id], top_k=10)
token_scores = ", ".join([f'("{token.strip()}", {value:.2f})' for token, value in pointwise_score])
print(f"Score: {score:.4f} - Sentence: {sentence} - Top influential tokens: {token_scores}")
print("")
切换以查看结果
"""
Query: How do artificial neural networks work?
Score: 16.9053 - Sentence: Neural networks are computing systems vaguely inspired by the biological neural networks that constitute animal brains. - Top influential tokens: ("neural", 5.71), ("networks", 3.24), ("network", 2.93), ("brain", 2.10), ("computer", 0.50), ("##uron", 0.32), ("artificial", 0.27), ("technology", 0.27), ("communication", 0.27), ("connection", 0.21)
Score: 13.6119 - Sentence: Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning. - Top influential tokens: ("artificial", 3.71), ("neural", 3.15), ("networks", 1.78), ("brain", 1.22), ("network", 1.12), ("ai", 1.07), ("machine", 0.39), ("robot", 0.20), ("technology", 0.20), ("algorithm", 0.18)
Score: 2.7373 - Sentence: Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed. - Top influential tokens: ("machine", 0.78), ("computer", 0.50), ("technology", 0.32), ("artificial", 0.22), ("robot", 0.21), ("ai", 0.20), ("process", 0.16), ("theory", 0.11), ("technique", 0.11), ("fuzzy", 0.06)
Score: 2.1430 - Sentence: Carbon capture technologies aim to collect CO2 emissions before they enter the atmosphere and store them underground. - Top influential tokens: ("technology", 0.42), ("function", 0.41), ("mechanism", 0.21), ("sensor", 0.21), ("device", 0.18), ("process", 0.18), ("generator", 0.13), ("detection", 0.10), ("technique", 0.10), ("tracking", 0.05)
Score: 2.0195 - Sentence: Mars rovers are robotic vehicles designed to travel on the surface of Mars to collect data and perform experiments. - Top influential tokens: ("robot", 0.67), ("function", 0.34), ("technology", 0.29), ("device", 0.23), ("experiment", 0.20), ("machine", 0.10), ("artificial", 0.08), ("design", 0.04), ("useful", 0.03), ("they", 0.02)
Query: What technology is used for modern space exploration?
Score: 10.4748 - Sentence: SpaceX's Starship is designed to be a fully reusable transportation system capable of carrying humans to Mars and beyond. - Top influential tokens: ("space", 4.40), ("technology", 1.15), ("nasa", 1.06), ("mars", 0.63), ("exploration", 0.52), ("spacecraft", 0.44), ("robot", 0.32), ("rocket", 0.28), ("astronomy", 0.27), ("travel", 0.26)
Score: 9.3818 - Sentence: The James Webb Space Telescope is the largest optical telescope in space, designed to conduct infrared astronomy. - Top influential tokens: ("space", 3.89), ("nasa", 1.09), ("astronomy", 0.93), ("discovery", 0.48), ("instrument", 0.47), ("technology", 0.35), ("device", 0.26), ("spacecraft", 0.25), ("invented", 0.22), ("equipment", 0.22)
Score: 8.5147 - Sentence: Mars rovers are robotic vehicles designed to travel on the surface of Mars to collect data and perform experiments. - Top influential tokens: ("technology", 1.39), ("mars", 0.79), ("exploration", 0.78), ("robot", 0.67), ("used", 0.66), ("nasa", 0.52), ("spacecraft", 0.44), ("device", 0.39), ("explore", 0.38), ("travel", 0.25)
Score: 7.6993 - Sentence: Carbon capture technologies aim to collect CO2 emissions before they enter the atmosphere and store them underground. - Top influential tokens: ("technology", 1.99), ("tech", 1.76), ("technologies", 1.74), ("equipment", 0.32), ("device", 0.31), ("technological", 0.28), ("mining", 0.22), ("sensor", 0.19), ("tool", 0.18), ("software", 0.11)
Score: 2.5526 - Sentence: Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed. - Top influential tokens: ("technology", 1.52), ("machine", 0.27), ("robot", 0.21), ("computer", 0.18), ("engineering", 0.12), ("technique", 0.11), ("science", 0.05), ("technological", 0.05), ("techniques", 0.02), ("innovation", 0.01)
Query: How can we address climate change challenges?
Score: 9.5587 - Sentence: Global warming is the long-term heating of Earth's climate system observed since the pre-industrial period due to human activities. - Top influential tokens: ("climate", 3.21), ("warming", 2.87), ("weather", 1.58), ("change", 0.46), ("global", 0.41), ("environmental", 0.39), ("storm", 0.19), ("pollution", 0.15), ("environment", 0.11), ("adaptation", 0.08)
Score: 1.3191 - Sentence: Carbon capture technologies aim to collect CO2 emissions before they enter the atmosphere and store them underground. - Top influential tokens: ("warming", 0.39), ("pollution", 0.34), ("environmental", 0.15), ("goal", 0.12), ("strategy", 0.07), ("monitoring", 0.07), ("protection", 0.06), ("greenhouse", 0.05), ("safety", 0.02), ("escape", 0.01)
Score: 1.0774 - Sentence: Renewable energy sources include solar, wind, hydro, and geothermal power that naturally replenish over time. - Top influential tokens: ("conservation", 0.39), ("sustainability", 0.18), ("environmental", 0.18), ("sustainable", 0.13), ("agriculture", 0.13), ("alternative", 0.07), ("recycling", 0.00)
Score: 0.2401 - Sentence: Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed. - Top influential tokens: ("strategy", 0.10), ("success", 0.06), ("foster", 0.04), ("engineering", 0.03), ("innovation", 0.00), ("research", 0.00)
Score: 0.1516 - Sentence: Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning. - Top influential tokens: ("strategy", 0.09), ("foster", 0.04), ("research", 0.01), ("approach", 0.01), ("engineering", 0.01)
"""
向量数据库搜索
此外,一些向量数据库和搜索引擎也可用于使用稀疏编码器执行语义搜索。这些系统旨在高效处理大规模向量数据并快速检索相关文档。它们可以利用嵌入的稀疏性来优化存储和搜索操作。
整体结构与手动搜索类似,但向量数据库负责处理文档的索引和检索。步骤大致如下:
编码语料库:加载您的数据并使用预训练的稀疏编码器对文档进行编码。
索引:文档及其稀疏嵌入在向量数据库中进行索引。
编码查询:用户查询使用相同的稀疏编码器进行编码。
检索:向量数据库执行相似度搜索以查找最相关的文档。
结果:搜索结果连同其相似度分数和文档内容一并返回。
稀疏向量用于搜索的优点是:
效率:稀疏向量(其中大多数值为零)的存储和搜索效率高于密集向量。
可解释性:稀疏嵌入中的非零维度通常对应于特定标记,使您能够理解哪些标记对相似度分数有贡献。
精确匹配:稀疏向量可以保留在密集嵌入中可能丢失的精确术语匹配信号。
Qdrant 集成
本示例演示如何设置 Qdrant 进行稀疏向量搜索,展示了如何使用稀疏编码器高效编码和索引文档、使用稀疏向量构建搜索查询,并提供交互式查询界面。请参阅 semantic_search_qdrant.py 或以下内容:
先决条件:
Qdrant 在本地运行(或可访问),更多详细信息请参阅 Qdrant 快速入门。
已安装 Python Qdrant 客户端
pip install qdrant-client
import time
from datasets import load_dataset
from sentence_transformers import SparseEncoder
from sentence_transformers.sparse_encoder.search_engines import semantic_search_qdrant
# 1. Load the natural-questions dataset with 100K answers
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
num_docs = 10_000
corpus = dataset["answer"][:num_docs]
# 2. Come up with some queries
queries = dataset["query"][:2]
# 3. Load the model
sparse_model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# 4. Encode the corpus
corpus_embeddings = sparse_model.encode_document(
corpus, convert_to_sparse_tensor=True, batch_size=16, show_progress_bar=True
)
# Initially, we don't have a qdrant index yet
corpus_index = None
while True:
# 5. Encode the queries using the full precision
start_time = time.time()
query_embeddings = sparse_model.encode_query(queries, convert_to_sparse_tensor=True)
print(f"Encoding time: {time.time() - start_time:.6f} seconds")
# 6. Perform semantic search using qdrant
results, search_time, corpus_index = semantic_search_qdrant(
query_embeddings,
corpus_index=corpus_index,
corpus_embeddings=corpus_embeddings if corpus_index is None else None,
top_k=5,
output_index=True,
)
# 7. Output the results
print(f"Search time: {search_time:.6f} seconds")
for query, result in zip(queries, results):
print(f"Query: {query}")
for entry in result:
print(f"(Score: {entry['score']:.4f}) {corpus[entry['corpus_id']]}, corpus_id: {entry['corpus_id']}")
print("")
# 8. Prompt for more queries
queries = [input("Please enter a question: ")]
OpenSearch 集成
本示例演示如何设置 OpenSearch 进行稀疏向量搜索,展示了如何使用稀疏编码器高效编码和索引文档、使用稀疏向量构建搜索查询,并提供交互式查询界面。请参阅 semantic_search_opensearch.py 或以下内容:
先决条件:
OpenSearch 在本地运行(或可访问),更多详细信息请参阅 本地 OpenSearch。
此外,您需要安装 Python OpenSearch 客户端:https://docs.opensearch.org/docs/latest/clients/python-low-level/,例如:
pip install opensearch-py
此脚本是为
opensearch
v2.15.0+ 创建的。
import time
from datasets import load_dataset
from sentence_transformers import SparseEncoder
from sentence_transformers.models import Router
from sentence_transformers.sparse_encoder.models import MLMTransformer, SparseStaticEmbedding, SpladePooling
from sentence_transformers.sparse_encoder.search_engines import semantic_search_opensearch
# 1. Load the natural-questions dataset with 100K answers
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
num_docs = 10_000
corpus = dataset["answer"][:num_docs]
print(f"Finish loading data. Corpus size: {len(corpus)}")
# 2. Come up with some queries
queries = dataset["query"][:2]
# 3. Load the model
model_id = "opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill"
doc_encoder = MLMTransformer(model_id)
router = Router.for_query_document(
query_modules=[
SparseStaticEmbedding.from_json(
model_id,
tokenizer=doc_encoder.tokenizer,
frozen=True,
),
],
document_modules=[
doc_encoder,
SpladePooling("max", activation_function="log1p_relu"),
],
)
sparse_model = SparseEncoder(modules=[router], similarity_fn_name="dot")
print("Start encoding corpus...")
start_time = time.time()
# 4. Encode the corpus
corpus_embeddings = sparse_model.encode_document(
corpus, convert_to_sparse_tensor=True, batch_size=32, show_progress_bar=True
)
corpus_embeddings_decoded = sparse_model.decode(corpus_embeddings)
print(f"Corpus encoding time: {time.time() - start_time:.6f} seconds")
corpus_index = None
while True:
# 5. Encode the queries using inference-free mode
start_time = time.time()
query_embeddings = sparse_model.encode_query(queries, convert_to_sparse_tensor=True)
query_embeddings_decoded = sparse_model.decode(query_embeddings)
print(f"Query encoding time: {time.time() - start_time:.6f} seconds")
# 6. Perform semantic search using OpenSearch
results, search_time, corpus_index = semantic_search_opensearch(
query_embeddings_decoded,
corpus_embeddings_decoded=corpus_embeddings_decoded if corpus_index is None else None,
corpus_index=corpus_index,
top_k=5,
output_index=True,
)
# 7. Output the results
print(f"Search time: {search_time:.6f} seconds")
for query, result in zip(queries, results):
print(f"Query: {query}")
for entry in result:
print(f"(Score: {entry['score']:.4f}) {corpus[entry['corpus_id']]}, corpus_id: {entry['corpus_id']}")
print("")
# 8. Prompt for more queries
queries = [input("Please enter a question: ")]
Seismic 集成
本示例演示如何使用 Seismic 进行性能极高的稀疏向量搜索。它不需要运行单独的客户端,而是直接在内存中执行搜索。Seismic 库在 Bruch 等人 (2024) 的论文中提出,其中显示其性能比常见的倒排文件(IVF)方法高出一个数量级。有关构建 Seismic 索引的更多信息,请查阅 Seismic 指南。请参阅 semantic_search_seismic.py 或以下内容:
先决条件:
已安装 Seismic Python 包
pip install pyseismic-lsr
import time
from datasets import load_dataset
from sentence_transformers import SparseEncoder
from sentence_transformers.sparse_encoder.search_engines import semantic_search_seismic
# 1. Load the natural-questions dataset with 100K answers
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
num_docs = 10_000
corpus = dataset["answer"][:num_docs]
# 2. Come up with some queries
queries = dataset["query"][:2]
# 3. Load the model
sparse_model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# 4. Encode the corpus
print("Start encoding corpus...")
start_time = time.time()
corpus_embeddings = sparse_model.encode_document(
corpus, convert_to_sparse_tensor=True, batch_size=16, show_progress_bar=True
)
corpus_embeddings_decoded = sparse_model.decode(corpus_embeddings)
print(f"Corpus encoding time: {time.time() - start_time:.6f} seconds")
corpus_index = None
while True:
# 5. Encode the queries using the full precision
start_time = time.time()
query_embeddings = sparse_model.encode_query(queries, convert_to_sparse_tensor=True)
query_embeddings_decoded = sparse_model.decode(query_embeddings)
print(f"Encoding time: {time.time() - start_time:.6f} seconds")
# 6. Perform semantic search using Seismic
results, search_time, corpus_index = semantic_search_seismic(
query_embeddings_decoded,
corpus_embeddings_decoded=corpus_embeddings_decoded if corpus_index is None else None,
corpus_index=corpus_index,
top_k=5,
output_index=True,
)
# 7. Output the results
print(f"Search time: {search_time:.6f} seconds")
for query, result in zip(queries, results):
print(f"Query: {query}")
for entry in result:
print(f"(Score: {entry['score']:.4f}) {corpus[entry['corpus_id']]}, corpus_id: {entry['corpus_id']}")
print("")
# 8. Prompt for more queries
queries = [input("Please enter a question: ")]
Elasticsearch 集成
本示例演示如何设置 Elasticsearch 进行稀疏向量搜索,展示了如何使用稀疏编码器高效编码和索引文档、使用稀疏向量构建搜索查询,并提供交互式查询界面。请参阅 semantic_search_elasticsearch.py 或以下内容:
先决条件:
Elasticsearch 在本地运行(或可访问),更多详细信息请参阅 本地 Elasticsearch。
已安装 Python Elasticsearch 客户端
pip install elasticsearch
import time
from datasets import load_dataset
from sentence_transformers import SparseEncoder
from sentence_transformers.sparse_encoder.search_engines import semantic_search_elasticsearch
# 1. Load the natural-questions dataset with 100K answers
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
num_docs = 10_000
corpus = dataset["answer"][:num_docs]
# 2. Come up with some queries
queries = dataset["query"][:2]
# 3. Load the model
sparse_model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# 4. Encode the corpus
print("Start encoding corpus...")
start_time = time.time()
corpus_embeddings = sparse_model.encode_document(
corpus, convert_to_sparse_tensor=True, batch_size=16, show_progress_bar=True
)
corpus_embeddings_decoded = sparse_model.decode(corpus_embeddings)
print(f"Corpus encoding time: {time.time() - start_time:.6f} seconds")
corpus_index = None
while True:
# 5. Encode the queries using the full precision
start_time = time.time()
query_embeddings = sparse_model.encode_query(queries, convert_to_sparse_tensor=True)
query_embeddings_decoded = sparse_model.decode(query_embeddings)
print(f"Encoding time: {time.time() - start_time:.6f} seconds")
# 6. Perform semantic search using Elasticsearch
results, search_time, corpus_index = semantic_search_elasticsearch(
query_embeddings_decoded,
corpus_embeddings_decoded=corpus_embeddings_decoded if corpus_index is None else None,
corpus_index=corpus_index,
top_k=5,
output_index=True,
)
# 7. Output the results
print(f"Search time: {search_time:.6f} seconds")
for query, result in zip(queries, results):
print(f"Query: {query}")
for entry in result:
print(f"(Score: {entry['score']:.4f}) {corpus[entry['corpus_id']]}, corpus_id: {entry['corpus_id']}")
print("")
# 8. Prompt for more queries
queries = [input("Please enter a question: ")]