英伟达基于Mistral 7B开发新一代Embedding模型——NV-Embed-v2

我们介绍的 NV-Embed-v2 是一种通用嵌入模型,它在大规模文本嵌入基准(MTEB 基准)(截至 2024 年 8 月 30 日)的 56 项文本嵌入任务中以 72.31 的高分排名第一。此外,它还在检索子类别中排名第一(在 15 项任务中获得 62.65 分),这对 RAG 技术的发展至关重要。

NV-Embed-v2 采用了多项新设计,包括让 LLM 关注潜在向量,以获得更好的池化嵌入输出,并展示了一种两阶段指令调整方法,以提高检索和非检索任务的准确性。此外,NV-Embed-v2 还采用了一种新颖的硬阴性挖掘方法,该方法考虑了正相关性得分,能更好地去除假阴性。

有关更多技术细节,请参阅我们的论文: NV-Embed:将 LLM 训练为通用嵌入模型的改进技术。

型号详情

  • 仅用于解码器的基本 LLM:Mistral-7B-v0.1
  • 池类型: Latent-Attention
  • 嵌入尺寸: 4096

如何使用

所需软件包

如果遇到问题,请尝试安装以下 python 软件包

bash 复制代码
pip uninstall -y transformer-engine
pip install torch==2.2.0
pip install transformers==4.42.4
pip install flash-attn==2.2.0
pip install sentence-transformers==2.7.0

以下是如何使用 Huggingface-transformer 和 Sentence-transformer 对查询和段落进行编码的示例。

HuggingFace Transformers

python 复制代码
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

# Each query needs to be accompanied by an corresponding instruction describing the task.
task_name_to_instruct = {"example": "Given a question, retrieve passages that answer the question",}

query_prefix = "Instruct: "+task_name_to_instruct["example"]+"\nQuery: "
queries = [
    'are judo throws allowed in wrestling?', 
    'how to become a radiology technician in michigan?'
    ]

# No instruction needed for retrieval passages
passage_prefix = ""
passages = [
    "Since you're reading this, you are probably someone from a judo background or someone who is just wondering how judo techniques can be applied under wrestling rules. So without further ado, let's get to the question. Are Judo throws allowed in wrestling? Yes, judo throws are allowed in freestyle and folkstyle wrestling. You only need to be careful to follow the slam rules when executing judo throws. In wrestling, a slam is lifting and returning an opponent to the mat with unnecessary force.",
    "Below are the basic steps to becoming a radiologic technologist in Michigan:Earn a high school diploma. As with most careers in health care, a high school education is the first step to finding entry-level employment. Taking classes in math and science, such as anatomy, biology, chemistry, physiology, and physics, can help prepare students for their college studies and future careers.Earn an associate degree. Entry-level radiologic positions typically require at least an Associate of Applied Science. Before enrolling in one of these degree programs, students should make sure it has been properly accredited by the Joint Review Committee on Education in Radiologic Technology (JRCERT).Get licensed or certified in the state of Michigan."
]

# load model with tokenizer
model = AutoModel.from_pretrained('nvidia/NV-Embed-v2', trust_remote_code=True)

# get the embeddings
max_length = 32768
query_embeddings = model.encode(queries, instruction=query_prefix, max_length=max_length)
passage_embeddings = model.encode(passages, instruction=passage_prefix, max_length=max_length)

# normalize embeddings
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
passage_embeddings = F.normalize(passage_embeddings, p=2, dim=1)

# get the embeddings with DataLoader (spliting the datasets into multiple mini-batches)
# batch_size=2
# query_embeddings = model._do_encode(queries, batch_size=batch_size, instruction=query_prefix, max_length=max_length, num_workers=32, return_numpy=True)
# passage_embeddings = model._do_encode(passages, batch_size=batch_size, instruction=passage_prefix, max_length=max_length, num_workers=32, return_numpy=True)

scores = (query_embeddings @ passage_embeddings.T) * 100
print(scores.tolist())
# [[87.42693328857422, 0.46283677220344543], [0.965264618396759, 86.03721618652344]]

Sentence-Transformers

python 复制代码
import torch
from sentence_transformers import SentenceTransformer

# Each query needs to be accompanied by an corresponding instruction describing the task.
task_name_to_instruct = {"example": "Given a question, retrieve passages that answer the question",}

query_prefix = "Instruct: "+task_name_to_instruct["example"]+"\nQuery: "
queries = [
    'are judo throws allowed in wrestling?', 
    'how to become a radiology technician in michigan?'
    ]

# No instruction needed for retrieval passages
passages = [
    "Since you're reading this, you are probably someone from a judo background or someone who is just wondering how judo techniques can be applied under wrestling rules. So without further ado, let's get to the question. Are Judo throws allowed in wrestling? Yes, judo throws are allowed in freestyle and folkstyle wrestling. You only need to be careful to follow the slam rules when executing judo throws. In wrestling, a slam is lifting and returning an opponent to the mat with unnecessary force.",
    "Below are the basic steps to becoming a radiologic technologist in Michigan:Earn a high school diploma. As with most careers in health care, a high school education is the first step to finding entry-level employment. Taking classes in math and science, such as anatomy, biology, chemistry, physiology, and physics, can help prepare students for their college studies and future careers.Earn an associate degree. Entry-level radiologic positions typically require at least an Associate of Applied Science. Before enrolling in one of these degree programs, students should make sure it has been properly accredited by the Joint Review Committee on Education in Radiologic Technology (JRCERT).Get licensed or certified in the state of Michigan."
]

# load model with tokenizer
model = SentenceTransformer('nvidia/NV-Embed-v2', trust_remote_code=True)
model.max_seq_length = 32768
model.tokenizer.padding_side="right"

def add_eos(input_examples):
  input_examples = [input_example + model.tokenizer.eos_token for input_example in input_examples]
  return input_examples

# get the embeddings
batch_size = 2
query_embeddings = model.encode(add_eos(queries), batch_size=batch_size, prompt=query_prefix, normalize_embeddings=True)
passage_embeddings = model.encode(add_eos(passages), batch_size=batch_size, normalize_embeddings=True)

scores = (query_embeddings @ passage_embeddings.T) * 100
print(scores.tolist())

MTEB 基准的指令模板

对于检索、STS 和摘要的 MTEB 子任务,请使用 instructions.json 中的指令前缀模板。 对于分类、聚类和重排,请使用 NV-Embed 论文表 7 中提供的说明。 7 中提供的说明。

instructions.json

javascript 复制代码
{
    "ClimateFEVER":
            {
                "query": "Given a claim about climate change, retrieve documents that support or refute the claim",
                "corpus": ""
            },
    "HotpotQA":
        {
            "query": "Given a multi-hop question, retrieve documents that can help answer the question",
            "corpus": ""
        },
    "FEVER":
        {
            "query": "Given a claim, retrieve documents that support or refute the claim",
            "corpus": ""
        },
    "MSMARCO":
        {
            "query": "Given a web search query, retrieve relevant passages that answer the query",
            "corpus": ""
        },
    "DBPedia":
        {
            "query": "Given a query, retrieve relevant entity descriptions from DBPedia",
            "corpus": ""
        },
    "NQ":
        {
            "query": "Given a question, retrieve passages that answer the question",
            "corpus": ""
        },
    "QuoraRetrieval":
        {
            "query": "Given a question, retrieve questions that are semantically equivalent to the given question",
            "corpus": "Given a question, retrieve questions that are semantically equivalent to the given question"
        },
    "SCIDOCS":
        {
            "query": "Given a scientific paper title, retrieve paper abstracts that are cited by the given paper",
            "corpus": ""
        },
    "TRECCOVID":
        {
            "query": "Given a query on COVID-19, retrieve documents that answer the query",
            "corpus": ""
        },
    "Touche2020":
        {
            "query": "Given a question, retrieve passages that answer the question",
            "corpus": ""
        },
    "SciFact":
        {
            "query": "Given a scientific claim, retrieve documents that support or refute the claim",
            "corpus": ""
        },
    "NFCorpus":
        {
            "query": "Given a question, retrieve relevant documents that answer the question",
            "corpus": ""
        },
    "ArguAna":
        {
            "query": "Given a claim, retrieve documents that support or refute the claim",
            "corpus": ""
        },
    "FiQA2018":
        {
            "query": "Given a financial question, retrieve relevant passages that answer the query",
            "corpus": ""
        },
    "STS":
        {
            "text": "Retrieve semantically similar text"
        },
    "SUMM":
        {
            "text": "Given a news summary, retrieve other semantically similar summaries"
        }
}

如何启用多 GPU(注意,这是 HuggingFace Transformers的情况)

python 复制代码
from transformers import AutoModel
from torch.nn import DataParallel

embedding_model = AutoModel.from_pretrained("nvidia/NV-Embed-v2")
for module_key, module in embedding_model._modules.items():
    embedding_model._modules[module_key] = DataParallel(module)
相关推荐
WWZZ202520 分钟前
快速上手大模型:深度学习7(实践:卷积层)
人工智能·深度学习·算法·机器人·大模型·卷积神经网络·具身智能
简佐义的博客44 分钟前
Genome Biol. IF 9.4 Q1 | ATAC-seq 数据分析实用指南,根据本文就可以构建ATAC生信分析流程了
人工智能
老蒋新思维1 小时前
陈修超入局:解锁 AI 与 IP 融合的创新增长密码
网络·人工智能·网络协议·tcp/ip·企业管理·知识付费·创客匠人
San30.2 小时前
从代码规范到 AI Agent:现代前端开发的智能化演进
javascript·人工智能·代码规范
DO_Community2 小时前
基于AI Agent模板:快速生成 SQL 测试数据
人工智能·python·sql·ai·llm·ai编程
HeteroCat2 小时前
关于No Chatbot的思考
人工智能
咚咚王者2 小时前
人工智能之数据分析 numpy:第一章 学习链路
人工智能·数据分析·numpy
中杯可乐多加冰2 小时前
数据分析案例详解:基于smardaten实现智慧交通运营指标数据分析展示
人工智能·低代码·数据分析·交通物流·智慧交通·无代码·大屏端
算家计算2 小时前
对标ChatGPT!千问App正式上线:AI应用终局之战正在打响
人工智能·资讯