计算稀疏嵌入
安装 Sentence Transformers 后,您可以轻松使用稀疏编码器模型
from sentence_transformers import SparseEncoder
# 1. Load a pretrained SparseEncoder model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# The sentences to encode
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
# 2. Calculate sparse embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 30522] - sparse representation with vocabulary size dimensions
# 3. Calculate the embedding similarities (using dot product by default)
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 35.629, 9.154, 0.098],
# [ 9.154, 27.478, 0.019],
# [ 0.098, 0.019, 29.553]])
# 4. Check sparsity statistics
stats = SparseEncoder.sparsity(embeddings)
print(f"Sparsity: {stats['sparsity_ratio']:.2%}") # Typically >99% zeros
print(f"Avg non-zero dimensions per embedding: {stats['active_dims']:.2f}")
注意
尽管我们讨论的是句子嵌入,但稀疏编码器也可用于短语以及包含多个句子的长文本。有关长文本嵌入的注意事项,请参阅输入序列长度。
初始化稀疏编码器模型
第一步是加载预训练的稀疏编码器模型。您可以使用预训练模型中的任何模型或本地模型。有关参数信息,另请参阅SparseEncoder。
from sentence_transformers import SparseEncoder
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# Alternatively, you can pass a path to a local model directory:
model = SparseEncoder("output/models/sparse-distilbert-nq-finetuned")
模型将自动放置在性能最佳的可用设备上,例如cuda或mps(如果可用)。您也可以显式指定设备
model = SparseEncoder("naver/splade-cocondenser-ensembledistil", device="cuda")
计算嵌入
计算嵌入的方法是SparseEncoder.encode。
输入序列长度
对于像 BERT、RoBERTa、DistilBERT 等 Transformer 模型,运行时和内存需求随输入长度呈二次方增长。这限制了 Transformer 的输入长度。基于 BERT 的模型通常的取值为 512 个 token,相当于大约 300-400 个单词(对于英语)。
每个模型都有一个最大序列长度,即model.max_seq_length,这是可以处理的最大 token 数。较长的文本将被截断为前model.max_seq_length个 token
from sentence_transformers import SparseEncoder
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
print("Max Sequence Length:", model.max_seq_length)
# => Max Sequence Length: 256
# Change the length to 200
model.max_seq_length = 200
print("Max Sequence Length:", model.max_seq_length)
# => Max Sequence Length: 200
注意
您不能将长度增加到超过相应 Transformer 模型最大支持的长度。另请注意,如果模型是针对短文本训练的,那么长文本的表示可能不会那么好。
控制稀疏性
对于稀疏模型,您可以使用max_active_dims参数控制输出嵌入中活动维度(非零值)的最大数量。这对于减少内存使用和存储需求以及平衡准确性和检索延迟之间的权衡特别有用。
您可以在初始化模型时或编码期间指定max_active_dims
from sentence_transformers import SparseEncoder
# Initialize the SPLADE model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# Embed a list of sentences
sentences = [
"This framework generates embeddings for each input sentence",
"Sentences are passed as a list of string.",
"The quick brown fox jumps over the lazy dog.",
]
# Generate embeddings
embeddings = model.encode(sentences)
# Print embedding dimensionality and sparsity
print(f"Embedding dim: {model.get_sentence_embedding_dimension()}")
stats = model.sparsity(embeddings)
print(f"Embedding sparsity: {stats}")
print(f"Average non-zero dimensions: {stats['active_dims']:.2f}")
print(f"Sparsity percentage: {stats['sparsity_ratio']:.2%}")
"""
Embedding dim: 30522
Embedding sparsity: {'active_dims': 56.333335876464844, 'sparsity_ratio': 0.9981543366792325}
Average non-zero dimensions: 56.33
Sparsity percentage: 99.82%
"""
# Example of using max_active_dims during encoding to limit the active dimensions
embeddings_limited = model.encode(sentences, max_active_dims=32)
stats_limited = model.sparsity(embeddings_limited)
print(f"Limited embedding sparsity: {stats_limited}")
print(f"Average non-zero dimensions: {stats_limited['active_dims']:.2f}")
print(f"Sparsity percentage: {stats_limited['sparsity_ratio']:.2%}")
"""
Limited embedding sparsity: {'active_dims': 32.0, 'sparsity_ratio': 0.9989515759124565}
Average non-zero dimensions: 32.00
Sparsity percentage: 99.90%
"""
当您设置max_active_dims时,模型将只保留值最高的 Top-K 维度,并将所有其他值设置为零。这确保您的嵌入在保持受控稀疏性的同时保留最重要的语义信息。
注意
将max_active_dims值设置得太低可能会降低搜索结果的质量。最佳值取决于您的具体用例和数据集。
使用max_active_dims控制稀疏性的主要好处之一是减少内存使用。这是一个显示内存节省的示例
def get_sparse_embedding_memory_size(tensor):
# For sparse tensors, only count non-zero elements
return (tensor._values().element_size() * tensor._values().nelement() +
tensor._indices().element_size() * tensor._indices().nelement())
print(f"Original embeddings memory: {get_sparse_embedding_memory_size(embeddings) / 1024:.2f} KB")
print(f"Embeddings with max_active_dims=32 memory: {get_sparse_embedding_memory_size(embeddings_limited) / 1024:.2f} KB")
"""
Original embeddings memory: 3.32 KB
Embeddings with max_active_dims=32 memory: 1.88 KB
"""
如示例所示,将活动维度限制为 32 使得内存使用量减少了大约 43%。当处理大型文档集合时,这种效率变得更加显著,但需要与嵌入表示可能损失的质量进行权衡。请注意,每个评估器类都有一个max_active_dims参数,可以在评估期间设置以控制活动维度数量,因此您可以轻松比较不同设置的性能。
SPLADE 模型的解释性
使用 SPLADE 模型时,一个关键优势是可解释性。您可以轻松可视化哪些 token 对嵌入贡献最大,从而深入了解模型认为文本中重要内容。
from sentence_transformers import SparseEncoder
# Initialize the SPLADE model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# Embed a list of sentences
sentences = [
"This framework generates embeddings for each input sentence",
"Sentences are passed as a list of string.",
"The quick brown fox jumps over the lazy dog.",
]
# Generate embeddings
embeddings = model.encode(sentences)
# Visualize top tokens for each text
top_k = 10
token_weights = model.decode(embeddings, top_k=top_k)
print(f"\nTop tokens {top_k} for each text:")
# The result is a list of sentence embeddings as numpy arrays
for i, sentence in enumerate(sentences):
token_scores = ", ".join([f'("{token.strip()}", {value:.2f})' for token, value in token_weights[i]])
print(f"{i}: {sentence} -> Top tokens: {token_scores}")
"""
Top tokens 10 for each text:
0: This framework generates embeddings for each input sentence -> Top tokens: ("framework", 2.19), ("##bed", 2.12), ("input", 1.99), ("each", 1.60), ("em", 1.58), ("sentence", 1.49), ("generate", 1.42), ("##ding", 1.33), ("sentences", 1.10), ("create", 0.93)
1: Sentences are passed as a list of string. -> Top tokens: ("string", 2.72), ("pass", 2.24), ("sentences", 2.15), ("passed", 2.07), ("sentence", 1.90), ("strings", 1.86), ("list", 1.84), ("lists", 1.49), ("as", 1.18), ("passing", 0.73)
2: The quick brown fox jumps over the lazy dog. -> Top tokens: ("lazy", 2.18), ("fox", 1.67), ("brown", 1.56), ("over", 1.52), ("dog", 1.50), ("quick", 1.49), ("jump", 1.39), ("dogs", 1.25), ("foxes", 0.99), ("jumping", 0.84)
"""
这种可解释性有助于理解为什么某些文档在搜索应用程序中匹配或不匹配,并提供了模型行为的透明度。
多进程/多 GPU 编码
您可以使用多个 GPU(或在 CPU 机器上使用多个进程)编码输入文本。对于大型数据集,这通常会显著帮助,但对于小型数据集,启动多个进程的开销可能很大。
您可以使用SparseEncoder.encode()(或SparseEncoder.encode_query()或SparseEncoder.encode_document()),并结合以下两种方式:
通过
device参数,可以将其设置为例如"cuda:0"或"cpu"用于单进程计算,也可以设置为设备列表用于多进程或多 GPU 计算,例如["cuda:0", "cuda:1"]或["cpu", "cpu", "cpu", "cpu"]from sentence_transformers import SparseEncoder def main(): model = SparseEncoder("naver/splade-cocondenser-ensembledistil") # Encode with multiple GPUs embeddings = model.encode( inputs, device=["cuda:0", "cuda:1"] # or ["cpu", "cpu", "cpu", "cpu"] ) if __name__ == "__main__": main()
在调用
SparseEncoder.start_multi_process_pool()并传入设备列表(例如["cuda:0", "cuda:1"]或["cpu", "cpu", "cpu", "cpu"])之后,可以提供pool参数。这样做的好处是,该池可以重复用于多次调用SparseEncoder.encode(),这比每次调用都启动一个新池要高效得多from sentence_transformers import SparseEncoder def main(): model = SparseEncoder("naver/splade-cocondenser-ensembledistil") # Start a multi-process pool with multiple GPUs pool = model.start_multi_process_pool(target_devices=["cuda:0", "cuda:1"]) # Encode with multiple GPUs embeddings = model.encode(inputs, pool=pool) # Don't forget to stop the pool after usage model.stop_multi_process_pool(pool) if __name__ == "__main__": main()
此外,您可以使用chunk_size参数来控制发送到每个进程的块的大小。这与batch_size参数不同。例如,当chunk_size=1000且batch_size=32时,输入文本将分成 1000 个文本的块,每个块将被发送到一个进程,并以 32 个文本的批次进行嵌入。这有助于内存管理和性能,尤其是对于大型数据集。