用法
稀疏编码器模型的特点
计算稀疏向量表示,其中大多数维度为零
由于嵌入的稀疏性,为大规模检索系统提供效率优势
通常比密集嵌入更具可解释性,非零维度对应于特定标记
与密集嵌入互补,支持结合两种方法优势的混合搜索系统
安装 Sentence Transformers 后,您可以轻松使用稀疏编码器模型
from sentence_transformers import SparseEncoder
# 1. Load a pretrained SparseEncoder model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# The sentences to encode
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
# 2. Calculate sparse embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 30522] - sparse representation with vocabulary size dimensions
# 3. Calculate the embedding similarities (using dot product by default)
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 35.629, 9.154, 0.098],
# [ 9.154, 27.478, 0.019],
# [ 0.098, 0.019, 29.553]])
# 4. Check sparsity statistics
stats = SparseEncoder.sparsity(embeddings)
print(f"Sparsity: {stats['sparsity_ratio']:.2%}") # Typically >99% zeros
print(f"Avg non-zero dimensions per embedding: {stats['active_dims']:.2f}")