学习AI Agent编程－第五天－LlamaIndex - 将Nodes生成索引并存储

将Documents转成Nodes是一个相当费时费力的事情，所以需要把nodes进行存储起来，下次使用时无需再次走files -> documents -> nodes这个操作路径，直接从存储加载即可。

所以，我们首先要做的就是把nodes存起来。在存起来之前，我们先要弄清楚一个概念，就是索引。其实在LlamaIndex中，索引这个概念与日常生活我们接触到的索引有些不同，日常我们接触到的索引就像书籍的目录，从目录找到想要找的内容的页码，而LlamaIndex中的索引其实就是一个分类，你可以认为他就是那本书，然后把nodes放进去，然后这个索引就会根据nodes生成"目录"。所以，在理解LlamaIndex的索引时，要把它想成个容器，比如说是一本书，然后往里面添加nodes，比如说是书的页，然后索引自动生成目录，也就是书的目录。

常用的索引类型有：

diff 复制代码

一般索引：
- SummaryIndex
- DocumentSummaryIndex
- TreeIndex
  
向量索引：
VectorStoreIndex

然后我们就要把一个索引存起来，然而存起来只有两种方式，一种是直接把索引保存成文件，另一种则是保存进向量数据库中。前者是一般索引常用的方式，而后者只能是VectorStoreIndex才能做。

一般一般索引并不需要进行保存，一般索引一般用完即弃，用于查询用途。比如说SummaryIndex，会把某一个特定的文档转成nodes，然后生成一个SummaryIndex，然后配合llm从这个文件查询某个问题。

我们这里主要使用玩具向量库chromadb来实现一个demo。代码如下：

改入相关库：

python 复制代码

from datetime import datetime
import os

import chromadb
from dotenv import load_dotenv
from llama_index.core import Document, SimpleDirectoryReader, VectorStoreIndex, StorageContext, Settings
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.core.schema import BaseNode
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.dashscope import DashScope
from llama_index.vector_stores.chroma import ChromaVectorStore

函数实现

python 复制代码

def save_to_vector_db(nodes: list[BaseNode], path: str, collection_name: str) -> None:
    """
    save nodes to vector db
    :param nodes: nodes
    :param path: path
    :param collection_name: collection name
    :return:
    """
    # 定义一个embed_model
    embed_model = HuggingFaceEmbedding(
		model_name="BAAI/bge-small-en-v1.5" # 中文友好的模型，也可以换成其他
	)
    # client - 生成一个client
    chroma_client = chromadb.PersistentClient(path=path)
    # collection - 定义一个集合，即索引存在哪个集合里
    collection = chroma_client.get_or_create_collection(collection_name)
    # vector store - 定义一个store，说明存在哪里
    vector_store = ChromaVectorStore(chroma_collection=collection)
    # storage context － 定义一个上下文
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    # save to db - 保存进
    VectorStoreIndex(
        nodes,
        embed_model = embed_model, # 注意这里，需要指定embed_model
        storage_context=storage_context,
        show_progress=True
    )

为什么要在VectorStoreIndex中指定一个embed_model呢？

其实是需要这个embed_model对索引的内容进行向量化，只有这样，才能把文本存进向量数据库中，文本是不能直接存进向量数据库中的。而embed_model其实还有另一个功能，就是相似性查询（向量查询），查询与用户查询相关的nodes。

而embed_model一般是通用的，也就是说一个model在整个应用中都会用到，所以一般是会在Settings中设置它，而不会将它作为参数传给那些要用到它的类中：

python 复制代码

Settings.embed_model = HuggingFaceEmbedding(
	model_name="BAAI/bge-small-en-v1.5"
)

这样的话，VectorStoreIndex的实例化就可以省掉embed_model这个参数：

python 复制代码

VectorStoreIndex(
    nodes,
    storage_context=storage_context,
    show_progress=True
)

将所有前面的代码组合起来进行测试：

python 复制代码

from datetime import datetime
import os

import chromadb
from dotenv import load_dotenv
from llama_index.core import Document, SimpleDirectoryReader, VectorStoreIndex, StorageContext, Settings
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.core.schema import BaseNode
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.dashscope import DashScope
from llama_index.vector_stores.chroma import ChromaVectorStore

from build_documents.directory import PDFReader


def load_documents(path: str) -> list[Document]:
    """
    load documents from path

    :param path: path that contains documents
    :return: documents
    """
    parser = PDFReader()
    return SimpleDirectoryReader(
        path,
        file_extractor={".pdf": parser}
    ).load_data()


def transform_to_nodes(documents: list[Document]) -> list[BaseNode]:
    """
    transform documents to nodes

    :param documents: documents
    :return: nodes
    """
    embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5", device="cpu")
    splitter = SemanticSplitterNodeParser(
        embed_model=embed_model,
        buffer_size=1,  # 根据文档特性调整
        breakpoint_percentile_threshold=85,  # 85 或 95
    )
    return splitter.get_nodes_from_documents(documents)


def create_llm():
    return DashScope(
        api_base=os.environ.get("OPENAI_API_BASE"),
        api_key=os.environ.get("OPENAI_API_KEY"),
        model=os.environ.get('OPENAI_MODEL_NAME'),
        is_chat_model=True,
        is_function_calling_model=True,
        enable_thinking=False,
        temperature=0.8
    )


def save_to_vector_db(nodes: list[BaseNode], path: str, collection_name: str) -> None:
    """
    save nodes to vector db
    :param nodes: nodes
    :param path: path
    :param collection_name: collection name
    :return:
    """
    # client
    chroma_client = chromadb.PersistentClient(path=path)
    # collection
    collection = chroma_client.get_or_create_collection(collection_name)
    # vector store
    vector_store = ChromaVectorStore(chroma_collection=collection)
    # storage context
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    # save to db
    VectorStoreIndex(
        nodes,
        storage_context=storage_context,
        show_progress=True
    )


if __name__ == "__main__":
    load_dotenv()

    Settings.embed_model = HuggingFaceEmbedding(
        model_name="BAAI/bge-small-en-v1.5"
    )

    print(f'{datetime.now().strftime("%Y-%m-%d %H:%M:%S")}: start loading documents')
    documents = load_documents("./files/")
    print(f'{datetime.now().strftime("%Y-%m-%d %H:%M:%S")}: complete loading documents')

    print(f'{datetime.now().strftime("%Y-%m-%d %H:%M:%S")}: start transforming documents to nodes')
    nodes = transform_to_nodes(documents)
    print(f'{datetime.now().strftime("%Y-%m-%d %H:%M:%S")}: complete transforming documents to nodes')

    print(f'{datetime.now().strftime("%Y-%m-%d %H:%M:%S")}: start saving nodes to db')
    save_to_vector_db(nodes, path="./data/db", collection_name="postgresql-docs")
    print(f'{datetime.now().strftime("%Y-%m-%d %H:%M:%S")}: complete saving nodes to db')