英伟达基于Mistral 7B开发新一代Embedding模型——NV-Embed-v2

我们介绍的 NV-Embed-v2 是一种通用嵌入模型,它在大规模文本嵌入基准(MTEB 基准)(截至 2024 年 8 月 30 日)的 56 项文本嵌入任务中以 72.31 的高分排名第一。此外,它还在检索子类别中排名第一(在 15 项任务中获得 62.65 分),这对 RAG 技术的发展至关重要。

NV-Embed-v2 采用了多项新设计,包括让 LLM 关注潜在向量,以获得更好的池化嵌入输出,并展示了一种两阶段指令调整方法,以提高检索和非检索任务的准确性。此外,NV-Embed-v2 还采用了一种新颖的硬阴性挖掘方法,该方法考虑了正相关性得分,能更好地去除假阴性。

有关更多技术细节,请参阅我们的论文: NV-Embed:将 LLM 训练为通用嵌入模型的改进技术。

型号详情

  • 仅用于解码器的基本 LLM:Mistral-7B-v0.1
  • 池类型: Latent-Attention
  • 嵌入尺寸: 4096

如何使用

所需软件包

如果遇到问题,请尝试安装以下 python 软件包

bash 复制代码
pip uninstall -y transformer-engine
pip install torch==2.2.0
pip install transformers==4.42.4
pip install flash-attn==2.2.0
pip install sentence-transformers==2.7.0

以下是如何使用 Huggingface-transformer 和 Sentence-transformer 对查询和段落进行编码的示例。

HuggingFace Transformers

python 复制代码
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

# Each query needs to be accompanied by an corresponding instruction describing the task.
task_name_to_instruct = {"example": "Given a question, retrieve passages that answer the question",}

query_prefix = "Instruct: "+task_name_to_instruct["example"]+"\nQuery: "
queries = [
    'are judo throws allowed in wrestling?', 
    'how to become a radiology technician in michigan?'
    ]

# No instruction needed for retrieval passages
passage_prefix = ""
passages = [
    "Since you're reading this, you are probably someone from a judo background or someone who is just wondering how judo techniques can be applied under wrestling rules. So without further ado, let's get to the question. Are Judo throws allowed in wrestling? Yes, judo throws are allowed in freestyle and folkstyle wrestling. You only need to be careful to follow the slam rules when executing judo throws. In wrestling, a slam is lifting and returning an opponent to the mat with unnecessary force.",
    "Below are the basic steps to becoming a radiologic technologist in Michigan:Earn a high school diploma. As with most careers in health care, a high school education is the first step to finding entry-level employment. Taking classes in math and science, such as anatomy, biology, chemistry, physiology, and physics, can help prepare students for their college studies and future careers.Earn an associate degree. Entry-level radiologic positions typically require at least an Associate of Applied Science. Before enrolling in one of these degree programs, students should make sure it has been properly accredited by the Joint Review Committee on Education in Radiologic Technology (JRCERT).Get licensed or certified in the state of Michigan."
]

# load model with tokenizer
model = AutoModel.from_pretrained('nvidia/NV-Embed-v2', trust_remote_code=True)

# get the embeddings
max_length = 32768
query_embeddings = model.encode(queries, instruction=query_prefix, max_length=max_length)
passage_embeddings = model.encode(passages, instruction=passage_prefix, max_length=max_length)

# normalize embeddings
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
passage_embeddings = F.normalize(passage_embeddings, p=2, dim=1)

# get the embeddings with DataLoader (spliting the datasets into multiple mini-batches)
# batch_size=2
# query_embeddings = model._do_encode(queries, batch_size=batch_size, instruction=query_prefix, max_length=max_length, num_workers=32, return_numpy=True)
# passage_embeddings = model._do_encode(passages, batch_size=batch_size, instruction=passage_prefix, max_length=max_length, num_workers=32, return_numpy=True)

scores = (query_embeddings @ passage_embeddings.T) * 100
print(scores.tolist())
# [[87.42693328857422, 0.46283677220344543], [0.965264618396759, 86.03721618652344]]

Sentence-Transformers

python 复制代码
import torch
from sentence_transformers import SentenceTransformer

# Each query needs to be accompanied by an corresponding instruction describing the task.
task_name_to_instruct = {"example": "Given a question, retrieve passages that answer the question",}

query_prefix = "Instruct: "+task_name_to_instruct["example"]+"\nQuery: "
queries = [
    'are judo throws allowed in wrestling?', 
    'how to become a radiology technician in michigan?'
    ]

# No instruction needed for retrieval passages
passages = [
    "Since you're reading this, you are probably someone from a judo background or someone who is just wondering how judo techniques can be applied under wrestling rules. So without further ado, let's get to the question. Are Judo throws allowed in wrestling? Yes, judo throws are allowed in freestyle and folkstyle wrestling. You only need to be careful to follow the slam rules when executing judo throws. In wrestling, a slam is lifting and returning an opponent to the mat with unnecessary force.",
    "Below are the basic steps to becoming a radiologic technologist in Michigan:Earn a high school diploma. As with most careers in health care, a high school education is the first step to finding entry-level employment. Taking classes in math and science, such as anatomy, biology, chemistry, physiology, and physics, can help prepare students for their college studies and future careers.Earn an associate degree. Entry-level radiologic positions typically require at least an Associate of Applied Science. Before enrolling in one of these degree programs, students should make sure it has been properly accredited by the Joint Review Committee on Education in Radiologic Technology (JRCERT).Get licensed or certified in the state of Michigan."
]

# load model with tokenizer
model = SentenceTransformer('nvidia/NV-Embed-v2', trust_remote_code=True)
model.max_seq_length = 32768
model.tokenizer.padding_side="right"

def add_eos(input_examples):
  input_examples = [input_example + model.tokenizer.eos_token for input_example in input_examples]
  return input_examples

# get the embeddings
batch_size = 2
query_embeddings = model.encode(add_eos(queries), batch_size=batch_size, prompt=query_prefix, normalize_embeddings=True)
passage_embeddings = model.encode(add_eos(passages), batch_size=batch_size, normalize_embeddings=True)

scores = (query_embeddings @ passage_embeddings.T) * 100
print(scores.tolist())

MTEB 基准的指令模板

对于检索、STS 和摘要的 MTEB 子任务,请使用 instructions.json 中的指令前缀模板。 对于分类、聚类和重排,请使用 NV-Embed 论文表 7 中提供的说明。 7 中提供的说明。

instructions.json

javascript 复制代码
{
    "ClimateFEVER":
            {
                "query": "Given a claim about climate change, retrieve documents that support or refute the claim",
                "corpus": ""
            },
    "HotpotQA":
        {
            "query": "Given a multi-hop question, retrieve documents that can help answer the question",
            "corpus": ""
        },
    "FEVER":
        {
            "query": "Given a claim, retrieve documents that support or refute the claim",
            "corpus": ""
        },
    "MSMARCO":
        {
            "query": "Given a web search query, retrieve relevant passages that answer the query",
            "corpus": ""
        },
    "DBPedia":
        {
            "query": "Given a query, retrieve relevant entity descriptions from DBPedia",
            "corpus": ""
        },
    "NQ":
        {
            "query": "Given a question, retrieve passages that answer the question",
            "corpus": ""
        },
    "QuoraRetrieval":
        {
            "query": "Given a question, retrieve questions that are semantically equivalent to the given question",
            "corpus": "Given a question, retrieve questions that are semantically equivalent to the given question"
        },
    "SCIDOCS":
        {
            "query": "Given a scientific paper title, retrieve paper abstracts that are cited by the given paper",
            "corpus": ""
        },
    "TRECCOVID":
        {
            "query": "Given a query on COVID-19, retrieve documents that answer the query",
            "corpus": ""
        },
    "Touche2020":
        {
            "query": "Given a question, retrieve passages that answer the question",
            "corpus": ""
        },
    "SciFact":
        {
            "query": "Given a scientific claim, retrieve documents that support or refute the claim",
            "corpus": ""
        },
    "NFCorpus":
        {
            "query": "Given a question, retrieve relevant documents that answer the question",
            "corpus": ""
        },
    "ArguAna":
        {
            "query": "Given a claim, retrieve documents that support or refute the claim",
            "corpus": ""
        },
    "FiQA2018":
        {
            "query": "Given a financial question, retrieve relevant passages that answer the query",
            "corpus": ""
        },
    "STS":
        {
            "text": "Retrieve semantically similar text"
        },
    "SUMM":
        {
            "text": "Given a news summary, retrieve other semantically similar summaries"
        }
}

如何启用多 GPU(注意,这是 HuggingFace Transformers的情况)

python 复制代码
from transformers import AutoModel
from torch.nn import DataParallel

embedding_model = AutoModel.from_pretrained("nvidia/NV-Embed-v2")
for module_key, module in embedding_model._modules.items():
    embedding_model._modules[module_key] = DataParallel(module)
相关推荐
说私域1 分钟前
链动2+1模式AI智能名片小程序赋能客户端微商生态化构建研究
人工智能·小程序·流量运营·私域运营
油泼辣子多加2 分钟前
【信创】华为昇腾大模型训练
人工智能·机器学习·数据挖掘
marteker2 分钟前
熊猫快餐以手工制作的动画短片庆祝农历新年
人工智能
彬鸿科技4 分钟前
彬鸿科技bhSDR Studio/Matlab总览讲解
人工智能·matlab·软件无线电·sdr
敢敢のwings6 分钟前
NVIDIA Alpamayo 完整使用教程与介绍
人工智能
zhangfeng11339 分钟前
VS Code,trae-cn qcoder cursor krio 装了 Markdown 插件却打不开预览
人工智能·python
一个王二不小12 分钟前
A-Stock Trading:基于 AI 多 Agent 协同辩论的 A 股量化分析系统【不构成任何投资建议】
人工智能·trading agent
会周易的程序员13 分钟前
# cv coach从视频到模型:一站式计算机视觉数据预处理工具全解析
人工智能·计算机视觉·音视频
火云洞红孩儿14 分钟前
使用Python开发游戏角色识别!(游戏辅助工具开发入门)
人工智能·python·游戏