如何在 Node.js 中创建嵌入向量

原文链接：How to Create Vector Embeddings in Node.js

作者：Phil Nash

译者：倔强青铜三

前言

大家好，我是倔强青铜三 。是一名热情的软件工程师，我热衷于分享和传播IT技术，致力于通过我的知识和技能推动技术交流与创新，欢迎关注我，微信公众号：倔强青铜三。欢迎点赞、收藏、关注，一键三连！！！

在构建检索增强生成（RAG）应用时，首要任务是准备数据。你需要将非结构化数据分割成块，将这些块转换为向量嵌入，最后将嵌入存储到向量数据库中。

在 JavaScript 中有多种方法可以创建向量嵌入。在本文中，我们将探讨在 Node.js 中生成向量嵌入的四种方法：本地生成、通过 API、通过框架，以及使用Astra DB 的 Vectorize。

本地向量嵌入

在HuggingFace上有许多开源模型可用于创建向量嵌入。Transformers.js是一个模块，允许你在 JavaScript 中使用机器学习模型，无论是在浏览器还是 Node.js 中。它使用ONNX 运行时来实现这一点，适用于已发布 ONNX 权重的模型，其中有很多模型可以用来创建向量嵌入。

你可以通过以下命令安装该模块：

bash 复制代码

npm install @xenova/transformers

该包实际上可以执行许多任务，但特征提取是你生成向量嵌入所需的功能。

一个流行的本地向量嵌入模型是all-MiniLM-L6-v2。它被训练为一个全能模型，可以从文本块中生成 384 维向量。

要使用它，从 Transformers.js 导入pipeline函数，并创建一个使用你提供的模型执行"特征提取"的提取器。然后，你可以将文本块传递给提取器，它将返回一个张量对象，你可以将其转换为普通的 JavaScript 数组。

总体来说，代码如下：

javascript 复制代码

import { pipeline } from "@xenova/transformers";  
  
const extractor = await pipeline(  
  "feature-extraction",  
  "Xenova/all-MiniLM-L6-v2"  
);  
  
const response = await extractor(  
  ["A robot may not injure a human being or, through inaction, allow a human being to come to harm."],  
  { pooling: "mean", normalize: true }  
);  
  
console.log(Array.from(response.data));  
// => [-0.004044221248477697,  0.026746056973934174,   0.0071970801800489426, ... ]

如果你将文本数组传递给提取器，实际上可以同时嵌入多个文本。然后，你可以调用tolist方法，它将返回一个数组列表作为你的向量。

javascript 复制代码

const response = await extractor(  
  [  
    "A robot may not injure a human being or, through inaction, allow a human being to come to harm.",  
    "A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.",  
    "A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.",  
  ],  
  { pooling: "mean", normalize: true }  
);  
  
console.log(response.tolist());  
// [  
//   [ -0.006129210349172354,  0.016346964985132217,   0.009711502119898796, ...],  
//   [-0.053930871188640594,  -0.002175076398998499,   0.032391052693128586, ...],  
//   [-0.05358131229877472,  0.021030642092227936, 0.0010665050940588117, ...]  
// ]

有许多模型可以用于从文本创建向量嵌入huggingface.co/models?pipe...，因为你是本地运行，可以尝试这些模型，看看哪个最适合你的数据。你应该注意这些模型可以处理的文本长度。例如，all-MiniLM-L6-v2 模型在超过 128 个token时无法提供良好的结果，并且最多可以处理 256 个 token，因此它适用于句子或小段文本。如果你的文本数据源更大，你需要将数据分割成适当大小的块。

像这样的本地嵌入模型在你自己的机器上进行实验时非常有用，或者在部署时拥有合适的硬件来高效运行它们。这是熟悉不同模型并了解事情如何运作的简单方法，无需注册各种 API 服务。

话虽如此，有许多有用的向量嵌入模型作为 API 提供，接下来让我们看看它们。

API

有许多服务提供嵌入模型作为 API。这些包括 LLM 提供商，如OpenAI、Google或Cohere，以及专业提供商如Voyage AI或Jina。大多数提供商都有通用的嵌入模型，但有些提供商提供了针对特定数据集训练的模型，例如Voyage AI 的金融、法律和代码优化模型。

这些 API 提供商提供 HTTP API，通常带有 npm 包，方便调用。你通常需要从服务中获取 API 密钥，然后可以通过将文本发送到 API 来生成嵌入。

例如，你可以通过Gemini API 使用 Google 的文本嵌入模型，如下所示：

javascript 复制代码

import { GoogleGenerativeAI } from "@google/generative-ai";  
  
const genAI = new GoogleGenerativeAI(process.env.API\_KEY);  
const model = genAI.getGenerativeModel({ model: "text-embedding-004"});  
const text = "A robot may not injure a human being or, through inaction, allow a human being to come to harm."  
  
const result = await model.embedContent(text);  
console.log(result.embedding.values);  
// => [0.04574034, 0.038084425, -0.00916391, ...]

不过，每个 API 都不同，虽然创建嵌入的请求通常比较简单，但你可能需要为每个想要调用的 API 学习一种新方法------当然，除非你尝试使用旨在简化这一过程的可用框架之一。

框架

有许多项目，如LangChain或LlamaIndex，为 GenAI 工具链的各个部分创建了抽象，包括嵌入。

LangChain 和 LlamaIndex 都允许你通过 API 或本地模型生成嵌入，所有这些都使用相同的接口。例如，以下是如何使用Gemini API 和 LangChain一起创建上述相同嵌入的示例：

javascript 复制代码

import { GoogleGenerativeAIEmbeddings } from "@langchain/google-genai";  
  
const embeddings = new GoogleGenerativeAIEmbeddings({  
  apiKey: process.env.API\_KEY,  
  model: "text-embedding-004",  
});  
const text = "A robot may not injure a human being or, through inaction, allow a human being to come to harm."  
  
const embedding = await embeddings.embedQuery(text);  
console.log(embedding);  
// => [0.04574034, 0.038084425, -0.00916391, ...]

相比之下，以下是如何通过LangChain 使用 OpenAI 嵌入模型的示例：

javascript 复制代码

import { OpenAIEmbeddings } from "@langchain/openai";  
  
const embeddings = new OpenAIEmbeddings({  
  apiKey: process.env.API\_KEY,  
  model: "text-embedding-3-large",  
});  
const text = "A robot may not injure a human being or, through inaction, allow a human being to come to harm."  
  
const embedding = await embeddings.embedQuery(text);  
console.log(embedding);  
// => [0.009445431, -0.0073068426, -0.00814802, ...]

除了更改导入的名称和有时选项外，所有嵌入模型都有一个一致的接口，便于互换使用。

如果你使用 LangChain 创建整个管道，这些嵌入接口与向量数据库接口配合得很好。你可以将嵌入模型提供给数据库集成，LangChain 会在插入文档或执行向量搜索时处理生成嵌入。例如，以下是如何使用 Google 的嵌入并将文档存储到Astra DB via LangChain的示例：

javascript 复制代码

import { GoogleGenerativeAIEmbeddings } from "@langchain/google-genai";  
import { AstraDBVectorStore } from "@langchain/community/vectorstores/astradb";  
  
const embeddings = new GoogleGenerativeAIEmbeddings({  
  apiKey: process.env.API\_KEY,  
  model: "text-embedding-004",  
});  
  
const vectorStore = await AstraDBVectorStore.fromDocuments(  
  documents, // 要放入存储中的文档对象列表  
  embeddings, // 嵌入模型  
  astraConfig, // 连接到 Astra DB 的配置  
);

当你将嵌入模型提供给数据库对象时，还可以使用它执行向量搜索。

javascript 复制代码

const results = vectorStore.similaritySearch("Are robots allowed to protect themselves?");

LlamaIndex 允许创建类似的嵌入模型和使用它们的向量存储。查看LlamaIndex 关于 RAG 的文档。

作为补充，LangChain和LlamaIndex集成的模型列表是流行嵌入模型的好示例。

直接在数据库中

到目前为止，上述方法大多涉及独立于将嵌入存储到向量数据库中创建向量嵌入。当你想将这些向量存储到像 Astra DB 这样的向量数据库时，大致如下所示：

javascript 复制代码

import { DataAPIClient } from "@datastax/astra-db-ts";  
const client = new DataAPIClient(process.env.ASTRA\_DB\_APPLICATION\_TOKEN);  
const db = client.db(process.env.ASTRA\_DB\_API\_ENDPOINT);  
const collection = db.collection(process.env.ASTRA\_DB\_COLLECTION);  
  
await collection.insertOne({  
  text: "A robot may not injure a human being or, through inaction, allow a human being to come to harm.",  
  $vector: [0.04574034, 0.038084425, -0.00916391, ...]  
});

这假设你已经为所使用的模型的正确维度创建了一个支持向量的集合。

你还可以使用向量对集合中的文档执行搜索，如下所示：

javascript 复制代码

const cursor = collection.find({}, {  
  
  sort: { $vector: [0.04574034, 0.038084425, -0.00916391, ...] },  
  
  limit: 5,  
  
});  
  
const results = await cursor.toArray();

在这种情况下，你必须先创建向量，然后使用它们存储或搜索数据库。即使在框架的情况下，这一过程也会发生，只是被抽象化了。

使用 Astra DB，你可以在将文档插入集合时或对集合执行向量搜索时让数据库为你生成嵌入。

这称为Astra DB vectorize；以下是它的工作原理。

首先，设置嵌入提供程序集成。有一个内置集成提供NVIDIA NV-Embed-QA 模型，或者你可以选择其他提供程序并使用自己的 API 密钥进行配置。

然后，当你设置集合时，可以选择要使用的嵌入提供程序并设置正确的维度数。

现在，当你向此集合添加文档时，可以使用特殊键$vectorize添加内容，将创建向量嵌入。

javascript 复制代码

await collection.insertOne({  
  $vectorize: "A robot may not injure a human being or, through inaction, allow a human being to come to harm."  
});

当你想对集合执行向量搜索时，可以按特殊字段$vectorize排序，Astra DB 将处理创建向量嵌入并执行搜索。

javascript 复制代码

const cursor = collection.find({}, {  
  sort: { $vectorize: "Are robots allowed to protect themselve?" },  
  limit: 5,  
});  
const results = await cursor.toArray();

这有以下几个优点：

它很稳健，因为 Astra DB 处理与嵌入提供程序的交互
它可能比分别调用两个 API 来创建嵌入和存储它们更快
它减少了你需要编写的代码量

选择最适合你的应用程序的方法

有许多模型、提供程序和方法可以将文本转换为向量嵌入。从内容中创建向量嵌入是 RAG 管道的重要组成部分，确实需要一些实验才能使其适合你的数据。

你可以选择托管自己的模型、调用 API、使用框架，或者让 Astra DB 为你处理创建向量嵌入。而且，如果你想完全避免编写代码，可以选择使用Langflow 的拖放界面来创建你的 RAG 管道。

最后感谢阅读！欢迎关注我，微信公众号 ：倔强青铜三。欢迎点赞、收藏、关注，一键三连！！！