【自然语言处理与大模型】LlamaIndex的词嵌入模型和向量数据库

（1）嵌入模型的定义和作用

嵌入模型（Embedding Model）的本质在于将高维的、稀疏的数据转换为低维的、密集的向量表示，使得这些向量能够在数学空间中表达原始数据的语义信息。作用如下：

降维表示：嵌入模型能够将文本、图像或其它类型的数据映射到一个连续的向量空间中，这个过程通常伴随着维度的降低。例如，一个包含大量词汇的文本可以通过嵌入模型被表示为固定长度的向量。
捕捉语义关系：在生成的向量空间中，相似或相关的概念在空间中的距离较近，而不相关或相异的概念则距离较远。这意味着嵌入模型不仅能捕捉单个词语或数据点的含义，还能反映它们之间的语义关系。在计算嵌入向量之间的相似度时，有多种方法可供选择，如点积、余弦相似度等。LlamaIndex 在默认情况下使用余弦相似度来进行嵌入比较。
应用于各种任务：这些向量表示可以用于各种自然语言处理和机器学习任务，如文本分类、情感分析、问答系统、推荐系统等。通过使用嵌入向量，模型可以在不需要理解人类语言复杂性的情况下，理解和处理输入数据。
基于上下文的学习：对于一些先进的嵌入模型（如BERT及其变体），它们不仅考虑单词本身的含义，还考虑了单词在其出现的上下文中的意义。这种方式极大地提高了对多义词的理解能力，并能更好地捕捉句子层面的语义信息。

下面介绍一个llamaindex里面最常用的词嵌入类HuggingFaceEmbedding用之前先安装库

bash 复制代码

pip install llama-index-embeddings-huggingface

python 复制代码

# 导入 HuggingFaceEmbedding 类，用于加载本地的词嵌入（Embedding）模型
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# 初始化一个 HuggingFaceEmbedding 实例
# 指定模型路径为本地已经下载好的中文 BGE 模型（版本1.5）
# 该模型可以将文本转换为向量表示（即 Embedding）
embed_model = HuggingFaceEmbedding(
    model_name="/root/workspace/llm_models/bge_small_zh_v1.5"
)

# 使用词嵌入模型对文本 "Hello World!" 进行编码，得到其对应的向量表示
# 注意：虽然这是中文模型，但也能处理英文文本
embeddings = embed_model.get_text_embedding("Hello World!")

# 打印向量的长度（维度），通常 BGE 模型输出为 384 或 768 维等
print(len(embeddings))

# 打印前5个维度的数值，查看部分向量结果
print(embeddings[:5])

（2）向量数据库ChromaDB

ChromaDB是LlamaIndex中的一个存储向量数据的组件。它允许用户将文本数据转换为向量，并将这些向量存储在数据库中，以便进行高效的相似性搜索。ChromaDB特别适用于需要快速检索与查询最相似文档的场景。

bash 复制代码

pip install chromadb
pip install llama-index-vector-stores-chroma

临时存放数据（放在内存里）

python 复制代码

from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.vector_stores.chroma import ChromaClient

# 初始化 ChromaVectorStore
vector_store = ChromaVectorStore()


# ChromaVectorStore 需要一个 ChromaDB 客户端来存储和管理向量数据。你可以选择连接到一个现有的 ChromaDB 实例，或者创建一个新的临时实例。
# 创建一个临时客户端
chroma_client = chromadb.EphemeralClient()

# 创建一个集合
collection_name = "example_collection"
chroma_collection = chroma_client.create_collection(collection_name)

# 通过将客户端实例传递给 ChromaVectorStore，可以将其与具体的数据库集合关联起来。
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

# 完成初始化后，您可以使用 ChromaVectorStore 来添加、查询和检索向量数据。

持久存放数据（存在磁盘里）

python 复制代码

# 初始化一个 ChromaDB 的持久化客户端，数据将被保存在 "./chroma_db" 目录下
db = chromadb.PersistentClient(path="./chroma_db")

# 获取或创建一个名为 "quickstart" 的集合（collection），用于存储向量数据
chroma_collection = db.get_or_create_collection("quickstart")

# 将该集合包装成 LlamaIndex 可用的 VectorStore 接口
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

# 创建一个 StorageContext，指定当前使用的 vector_store，用于控制索引如何存储和加载数据
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# 使用文档（documents）和 storage_context 构建一个 VectorStoreIndex 索引
# 在此过程中会使用 embed_model 对文档进行嵌入编码，并将结果写入磁盘上的 ChromaDB
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, embed_model=embed_model
)

从磁盘加载（恢复索引）

python 复制代码

# 重新初始化一个 ChromaDB 客户端，指向同一个路径
db2 = chromadb.PersistentClient(path="./chroma_db")

# 获取之前创建的 collection（集合）
chroma_collection = db2.get_or_create_collection("quickstart")

# 同样包装成 LlamaIndex 的 VectorStore
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

# 从现有的 vector_store 中重建索引，不依赖原始文档
index = VectorStoreIndex.from_vector_store(
    vector_store,
    embed_model=embed_model,
)

查询索引：查询引擎会在底层自动将问题嵌入，然后在向量库中查找最相关的文档片段。结合检索结果，生成自然语言的回答。

python 复制代码

# 将索引封装为一个查询引擎
query_engine = index.as_query_engine()

# 执行自然语言查询："作者在成长过程中做了什么？"
response = query_engine.query("作者在成长过程中做了什么？")

展示响应结果
print(response)