英伟达基于Mistral 7B开发新一代Embedding模型——NV-Embed-v2

我们介绍的 NV-Embed-v2 是一种通用嵌入模型,它在大规模文本嵌入基准(MTEB 基准)(截至 2024 年 8 月 30 日)的 56 项文本嵌入任务中以 72.31 的高分排名第一。此外,它还在检索子类别中排名第一(在 15 项任务中获得 62.65 分),这对 RAG 技术的发展至关重要。

NV-Embed-v2 采用了多项新设计,包括让 LLM 关注潜在向量,以获得更好的池化嵌入输出,并展示了一种两阶段指令调整方法,以提高检索和非检索任务的准确性。此外,NV-Embed-v2 还采用了一种新颖的硬阴性挖掘方法,该方法考虑了正相关性得分,能更好地去除假阴性。

有关更多技术细节,请参阅我们的论文: NV-Embed:将 LLM 训练为通用嵌入模型的改进技术。

型号详情

  • 仅用于解码器的基本 LLM:Mistral-7B-v0.1
  • 池类型: Latent-Attention
  • 嵌入尺寸: 4096

如何使用

所需软件包

如果遇到问题,请尝试安装以下 python 软件包

bash 复制代码
pip uninstall -y transformer-engine
pip install torch==2.2.0
pip install transformers==4.42.4
pip install flash-attn==2.2.0
pip install sentence-transformers==2.7.0

以下是如何使用 Huggingface-transformer 和 Sentence-transformer 对查询和段落进行编码的示例。

HuggingFace Transformers

python 复制代码
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

# Each query needs to be accompanied by an corresponding instruction describing the task.
task_name_to_instruct = {"example": "Given a question, retrieve passages that answer the question",}

query_prefix = "Instruct: "+task_name_to_instruct["example"]+"\nQuery: "
queries = [
    'are judo throws allowed in wrestling?', 
    'how to become a radiology technician in michigan?'
    ]

# No instruction needed for retrieval passages
passage_prefix = ""
passages = [
    "Since you're reading this, you are probably someone from a judo background or someone who is just wondering how judo techniques can be applied under wrestling rules. So without further ado, let's get to the question. Are Judo throws allowed in wrestling? Yes, judo throws are allowed in freestyle and folkstyle wrestling. You only need to be careful to follow the slam rules when executing judo throws. In wrestling, a slam is lifting and returning an opponent to the mat with unnecessary force.",
    "Below are the basic steps to becoming a radiologic technologist in Michigan:Earn a high school diploma. As with most careers in health care, a high school education is the first step to finding entry-level employment. Taking classes in math and science, such as anatomy, biology, chemistry, physiology, and physics, can help prepare students for their college studies and future careers.Earn an associate degree. Entry-level radiologic positions typically require at least an Associate of Applied Science. Before enrolling in one of these degree programs, students should make sure it has been properly accredited by the Joint Review Committee on Education in Radiologic Technology (JRCERT).Get licensed or certified in the state of Michigan."
]

# load model with tokenizer
model = AutoModel.from_pretrained('nvidia/NV-Embed-v2', trust_remote_code=True)

# get the embeddings
max_length = 32768
query_embeddings = model.encode(queries, instruction=query_prefix, max_length=max_length)
passage_embeddings = model.encode(passages, instruction=passage_prefix, max_length=max_length)

# normalize embeddings
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
passage_embeddings = F.normalize(passage_embeddings, p=2, dim=1)

# get the embeddings with DataLoader (spliting the datasets into multiple mini-batches)
# batch_size=2
# query_embeddings = model._do_encode(queries, batch_size=batch_size, instruction=query_prefix, max_length=max_length, num_workers=32, return_numpy=True)
# passage_embeddings = model._do_encode(passages, batch_size=batch_size, instruction=passage_prefix, max_length=max_length, num_workers=32, return_numpy=True)

scores = (query_embeddings @ passage_embeddings.T) * 100
print(scores.tolist())
# [[87.42693328857422, 0.46283677220344543], [0.965264618396759, 86.03721618652344]]

Sentence-Transformers

python 复制代码
import torch
from sentence_transformers import SentenceTransformer

# Each query needs to be accompanied by an corresponding instruction describing the task.
task_name_to_instruct = {"example": "Given a question, retrieve passages that answer the question",}

query_prefix = "Instruct: "+task_name_to_instruct["example"]+"\nQuery: "
queries = [
    'are judo throws allowed in wrestling?', 
    'how to become a radiology technician in michigan?'
    ]

# No instruction needed for retrieval passages
passages = [
    "Since you're reading this, you are probably someone from a judo background or someone who is just wondering how judo techniques can be applied under wrestling rules. So without further ado, let's get to the question. Are Judo throws allowed in wrestling? Yes, judo throws are allowed in freestyle and folkstyle wrestling. You only need to be careful to follow the slam rules when executing judo throws. In wrestling, a slam is lifting and returning an opponent to the mat with unnecessary force.",
    "Below are the basic steps to becoming a radiologic technologist in Michigan:Earn a high school diploma. As with most careers in health care, a high school education is the first step to finding entry-level employment. Taking classes in math and science, such as anatomy, biology, chemistry, physiology, and physics, can help prepare students for their college studies and future careers.Earn an associate degree. Entry-level radiologic positions typically require at least an Associate of Applied Science. Before enrolling in one of these degree programs, students should make sure it has been properly accredited by the Joint Review Committee on Education in Radiologic Technology (JRCERT).Get licensed or certified in the state of Michigan."
]

# load model with tokenizer
model = SentenceTransformer('nvidia/NV-Embed-v2', trust_remote_code=True)
model.max_seq_length = 32768
model.tokenizer.padding_side="right"

def add_eos(input_examples):
  input_examples = [input_example + model.tokenizer.eos_token for input_example in input_examples]
  return input_examples

# get the embeddings
batch_size = 2
query_embeddings = model.encode(add_eos(queries), batch_size=batch_size, prompt=query_prefix, normalize_embeddings=True)
passage_embeddings = model.encode(add_eos(passages), batch_size=batch_size, normalize_embeddings=True)

scores = (query_embeddings @ passage_embeddings.T) * 100
print(scores.tolist())

MTEB 基准的指令模板

对于检索、STS 和摘要的 MTEB 子任务,请使用 instructions.json 中的指令前缀模板。 对于分类、聚类和重排,请使用 NV-Embed 论文表 7 中提供的说明。 7 中提供的说明。

instructions.json

javascript 复制代码
{
    "ClimateFEVER":
            {
                "query": "Given a claim about climate change, retrieve documents that support or refute the claim",
                "corpus": ""
            },
    "HotpotQA":
        {
            "query": "Given a multi-hop question, retrieve documents that can help answer the question",
            "corpus": ""
        },
    "FEVER":
        {
            "query": "Given a claim, retrieve documents that support or refute the claim",
            "corpus": ""
        },
    "MSMARCO":
        {
            "query": "Given a web search query, retrieve relevant passages that answer the query",
            "corpus": ""
        },
    "DBPedia":
        {
            "query": "Given a query, retrieve relevant entity descriptions from DBPedia",
            "corpus": ""
        },
    "NQ":
        {
            "query": "Given a question, retrieve passages that answer the question",
            "corpus": ""
        },
    "QuoraRetrieval":
        {
            "query": "Given a question, retrieve questions that are semantically equivalent to the given question",
            "corpus": "Given a question, retrieve questions that are semantically equivalent to the given question"
        },
    "SCIDOCS":
        {
            "query": "Given a scientific paper title, retrieve paper abstracts that are cited by the given paper",
            "corpus": ""
        },
    "TRECCOVID":
        {
            "query": "Given a query on COVID-19, retrieve documents that answer the query",
            "corpus": ""
        },
    "Touche2020":
        {
            "query": "Given a question, retrieve passages that answer the question",
            "corpus": ""
        },
    "SciFact":
        {
            "query": "Given a scientific claim, retrieve documents that support or refute the claim",
            "corpus": ""
        },
    "NFCorpus":
        {
            "query": "Given a question, retrieve relevant documents that answer the question",
            "corpus": ""
        },
    "ArguAna":
        {
            "query": "Given a claim, retrieve documents that support or refute the claim",
            "corpus": ""
        },
    "FiQA2018":
        {
            "query": "Given a financial question, retrieve relevant passages that answer the query",
            "corpus": ""
        },
    "STS":
        {
            "text": "Retrieve semantically similar text"
        },
    "SUMM":
        {
            "text": "Given a news summary, retrieve other semantically similar summaries"
        }
}

如何启用多 GPU(注意,这是 HuggingFace Transformers的情况)

python 复制代码
from transformers import AutoModel
from torch.nn import DataParallel

embedding_model = AutoModel.from_pretrained("nvidia/NV-Embed-v2")
for module_key, module in embedding_model._modules.items():
    embedding_model._modules[module_key] = DataParallel(module)
相关推荐
Cathyqiii1 小时前
Diffusion-TS:一种基于季节性-趋势分解与重构引导的可解释时间序列扩散模型
人工智能·神经网络·1024程序员节
数字冰雹1 小时前
数字孪生技术 重构 智能仓储新生态
人工智能·重构
EasyCVR2 小时前
从汇聚到智能:解析视频融合平台EasyCVR视频智能分析技术背后的关键技术
大数据·人工智能
m0_650108242 小时前
【论文精读】GenTron:基于 Transformer 的扩散模型革新图像与视频生成
人工智能·论文精读·transformer扩散模型·文生图(t2i)·文生视频(t2v)
文火冰糖的硅基工坊3 小时前
[人工智能-大模型-66]:模型层技术 - 两种编程范式:数学函数式编程与逻辑推理式编程,构建起截然不同的智能系统。
人工智能·神经网络·算法·1024程序员节
创思通信3 小时前
树莓派的YOLO智能AI识别系统,识别ESP32还是STM32
人工智能·stm32·yolo
funfan05173 小时前
【开发AI】Windows安装和使用Milvus的保姆级教程
人工智能·windows·milvus
Fuly10243 小时前
使用docker安装向量数据库milvus
人工智能
darkfive3 小时前
构建大模型安全自动化测试框架:从手工POC到AI对抗AI的递归Fuzz实践
人工智能·安全·ai·自动化
一点一木3 小时前
火山方舟 Responses API 实战指南:从概念到「公司尽调 Dossier 生成器」
前端·人工智能·api