快速入门
句子转换器 (Sentence Transformer)
句子转换器(又称双编码器)模型的特点
给定文本或图像,计算一个固定大小的向量表示(嵌入)。
嵌入计算通常效率很高,嵌入相似度计算速度非常快。
适用于广泛的任务,如语义文本相似度、语义搜索、聚类、分类、转述挖掘等。
通常用作两阶段检索过程的第一步,其中交叉编码器(又称重排序器)模型用于对双编码器的 top-k 结果进行重排序。
一旦您安装了 Sentence Transformers,您就可以轻松使用句子转换器模型。
from sentence_transformers import SentenceTransformer
# 1. Load a pretrained Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")
# The sentences to encode
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
# 2. Calculate embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]
# 3. Calculate the embedding similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
# [0.6660, 1.0000, 0.1411],
# [0.1046, 0.1411, 1.0000]])
通过 SentenceTransformer("all-MiniLM-L6-v2")
,我们选择加载哪个句子转换器模型。在本例中,我们加载 all-MiniLM-L6-v2,这是一个在超过 10 亿个训练对的大型数据集上微调的 MiniLM 模型。使用 SentenceTransformer.similarity()
,我们计算所有句子对之间的相似度。正如预期的那样,前两个句子之间的相似度 (0.6660) 高于第一个和第三个句子 (0.1046) 或第二个和第三个句子 (0.1411) 之间的相似度。
微调句子转换器模型很简单,只需要几行代码。更多信息,请参阅训练概览部分。
提示
阅读句子转换器 > 用法 > 加快推理速度,了解如何将模型的推理速度提高 2-3 倍的技巧。
交叉编码器 (Cross Encoder)
交叉编码器(又称重排序器)模型的特点
给定文本对,计算一个相似度分数。
与句子转换器(又称双编码器)模型相比,通常提供更优的性能。
通常比句子转换器模型慢,因为它需要为每对文本进行计算,而不是为每个文本计算。
由于前两个特点,交叉编码器通常用于对句子转换器模型的 top-k 结果进行重排序。
交叉编码器(又称重排序器)模型的用法与句子转换器类似
from sentence_transformers.cross_encoder import CrossEncoder
# 1. Load a pretrained CrossEncoder model
model = CrossEncoder("cross-encoder/stsb-distilroberta-base")
# We want to compute the similarity between the query sentence...
query = "A man is eating pasta."
# ... and all sentences in the corpus
corpus = [
"A man is eating food.",
"A man is eating a piece of bread.",
"The girl is carrying a baby.",
"A man is riding a horse.",
"A woman is playing violin.",
"Two men pushed carts through the woods.",
"A man is riding a white horse on an enclosed ground.",
"A monkey is playing drums.",
"A cheetah is running behind its prey.",
]
# 2. We rank all sentences in the corpus for the query
ranks = model.rank(query, corpus)
# Print the scores
print("Query: ", query)
for rank in ranks:
print(f"{rank['score']:.2f}\t{corpus[rank['corpus_id']]}")
"""
Query: A man is eating pasta.
0.67 A man is eating food.
0.34 A man is eating a piece of bread.
0.08 A man is riding a horse.
0.07 A man is riding a white horse on an enclosed ground.
0.01 The girl is carrying a baby.
0.01 Two men pushed carts through the woods.
0.01 A monkey is playing drums.
0.01 A woman is playing violin.
0.01 A cheetah is running behind its prey.
"""
# 3. Alternatively, you can also manually compute the score between two sentences
import numpy as np
sentence_combinations = [[query, sentence] for sentence in corpus]
scores = model.predict(sentence_combinations)
# Sort the scores in decreasing order to get the corpus indices
ranked_indices = np.argsort(scores)[::-1]
print("Scores:", scores)
print("Indices:", ranked_indices)
"""
Scores: [0.6732372, 0.34102544, 0.00542465, 0.07569341, 0.00525378, 0.00536814, 0.06676237, 0.00534825, 0.00516717]
Indices: [0 1 3 6 2 5 7 4 8]
"""
通过 CrossEncoder("cross-encoder/stsb-distilroberta-base")
,我们选择加载哪个交叉编码器模型。在本例中,我们加载 cross-encoder/stsb-distilroberta-base,这是一个在 STS Benchmark 数据集上微调的 DistilRoBERTa 模型。
稀疏编码器 (Sparse Encoder)
稀疏编码器模型的特点
计算稀疏向量表示,其中大多数维度为零。
由于嵌入的稀疏性,为大规模检索系统提供了效率优势。
通常比密集嵌入更具可解释性,非零维度对应于特定的词元 (token)。
与密集嵌入互补,支持结合两种方法优势的混合搜索系统。
稀疏编码器模型的用法与句子转换器类似
from sentence_transformers import SparseEncoder
# 1. Load a pretrained SparseEncoder model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# The sentences to encode
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
# 2. Calculate sparse embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 30522] - sparse representation with vocabulary size dimensions
# 3. Calculate the embedding similarities (using dot product by default)
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 35.629, 9.154, 0.098],
# [ 9.154, 27.478, 0.019],
# [ 0.098, 0.019, 29.553]])
# 4. Check sparsity statistics
stats = SparseEncoder.sparsity(embeddings)
print(f"Sparsity: {stats['sparsity_ratio']:.2%}") # Typically >99% zeros
print(f"Avg non-zero dimensions per embedding: {stats['active_dims']:.2f}")
通过 SparseEncoder("naver/splade-cocondenser-ensembledistil")
,我们加载一个预训练的 SPLADE 模型,该模型生成稀疏嵌入。SPLADE (SParse Lexical AnD Expansion) 模型使用 MLM 预测机制创建稀疏表示,这对于信息检索任务尤其有效。
下一步
接下来,可以考虑阅读以下部分之一