前言

书接上文：【从零开始学习 RAG 】01：LLamaIndex 基本概念 - 掘金 (juejin.cn)

以下实践将完全使用本地模型进行。我所运行 LLM 的平台是 LM Studio，它能提供 OpenAI API 相同的网络请求接口，所以说如果读者希望使用 OpenAI API 进行测试，代码只需要轻微修改即可

接下来，我将构建一个检索增强生成（Retrieval-Augmented Generation, RAG）的流程，并利用 LlamaIndex 对该流程进行评估。文档内容涵盖以下三大部分：

理解检索增强生成（RAG）。
利用 LlamaIndex 构建 RAG 流程。
利用 LlamaIndex 对 RAG 进行评估。

基础知识准备

检索增强生成 (Retrieval-Augmented Generation, RAG)

大语言模型 (Large Language Models, LLM) 在庞大的数据集上训练，这些数据集往往不包含您个人的具体数据。检索增强生成技术 (RAG) 通过在生成过程中动态地结合用户数据，来弥补这一缺陷。重点在于，不是修改大语言模型的训练数据集，而是让模型能够实时接入并利用这些用户数据，从而提供更加定制化且与上下文相关的回答。

在 RAG 系统中，首先要做的是加载用户数据并为查询"建立索引"。当用户发起查询时，系统会在索引中过滤，找出与查询最相关的上下文。接着，这些相关的上下文和用户问题将一起提交给大语言模型，模型便据此提供相应的答案。

无论您打算构建的是聊天机器人还是自动应答代理，都需要掌握 RAG 技术，以便能够有效地将数据整合到您的应用中。

RAG 的关键阶段

RAG 包括五个关键阶段，这些都是构建任何大型应用程序不可或缺的一部分：

Loading加载： 指的是将数据从其所在位置 --- 如文本文件、PDF、其他网站、数据库或 API --- 导入到您的处理流程中。LlamaHub 提供了数百种连接器，供您选择。
Indexing索引： 创建一个支持数据查询的结构。对于大语言模型而言，这通常涉及创建向量嵌入（即数据的数值化语义表示），以及其他多种元数据策略，以简化并提高寻找相关上下文数据的准确性。
Storing存储： 数据索引后，会希望存储该索引以及所有相关元数据，以省去重复索引的步骤。
Querying查询： 在任何已确定的索引策略下，您都可以利用大语言模型和 LlamaIndex 数据结构来执行多种查询方式，包括子查询、多步查询和混合查询策略。
Evaluation评估： 评估是流程中一个至关重要的步骤，它用于检查流程相较于其他策略的有效性或进行调整时的表现如何。评估为查询回应的准确性、一致性和速度提供了客观的量度标准。

代码实践：构建一个简单RAG系统，并且对其质量评估

以下代码为 Jupyter Notebook 格式，源代码可以从文末获取

先安装基本的依赖

py 复制代码

pip install llama-index

py 复制代码

# `nest_asyncio` 模块允许异步函数在一个已经启动的异步循环内部进行嵌套执行。
# 这样做的必要性在于，Jupyter Notebook 这一工具天生就是在一个异步循环的环境下操作的。
# 通过使用 `nest_asyncio`，我们能够顺利地在这一现有的异步循环中添加并运行更多的异步函数，而不会引起冲突。
import nest_asyncio

nest_asyncio.apply()

from llama_index.evaluation import generate_question_context_pairs
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.node_parser import SimpleNodeParser
from llama_index.evaluation import generate_question_context_pairs
from llama_index.evaluation import RetrieverEvaluator
from llama_index.llms import OpenAI
from llama_index.embeddings import resolve_embed_model

import os
import pandas as pd

1. 加载本地数据并构建索引

py 复制代码

# load data from data directory
documents = SimpleDirectoryReader("data").load_data()

# bge-m3 embedding model
# https://huggingface.co/BAAI/bge-base-en-v1.5/tree/main
embed_model = resolve_embed_model("local:BAAI/bge-base-en-v1.5")

# Load LM Studio LLM model
llm = OpenAI(api_base="http://localhost:1234/v1", api_key="not-needed")

# Index the data
service_context = ServiceContext.from_defaults(
    embed_model=embed_model, llm=llm,
)

# Transform data to Nodes struct
node_parser = SimpleNodeParser.from_defaults(chunk_size=512, chunk_overlap=128)
nodes = node_parser.get_nodes_from_documents(documents)

# vetorize
vector_index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

使用 LLamaIndex 构建 query_engine，方便向量查询

py 复制代码

query_engine = vector_index.as_query_engine()
response_vector = query_engine.query("What did the author do growing up?")
print(response_vector.response)

输出结果：

py 复制代码

"The author grew up in New Hampshire and spent most of his time reading science fiction and painting. He attended a boarding school in Massachusetts for high school, where he continued to paint and read. After graduating from college with a degree in computer science, he worked at various software companies before starting his own company, Viaweb, which was sold to Yahoo in 1996. He then moved to California and tried to focus on painting, but found it difficult due to lack of energy and motivation. He eventually returned to New York and resumed painting, this time with more success.\n### Explanation:\nThe author's childhood was marked by a love for reading science fiction and painting. He attended a boarding school in Massachusetts, where he continued to pursue these interests. After college, he worked in the software industry before starting his own company, Viaweb, which was sold to Yahoo in 1996. Following the sale of Viaweb, the author moved to California with the intention of focusing on painting, but found it difficult due to a lack of energy and motivation. He eventually returned to New York and resumed painting with more success."

我们已经搭建了一个简单的 RAG 系统，并且现在我们需要对它的性能进行评价。可以通过运用 LlamaIndex 提供的核心评估模块对这个 RAG 系统或查询引擎进行评估。下面，我们探究如何使用这些工具定量衡量检索增强生成系统的质量。

2. 质量评估

评价工作应当成为衡量 RAG 应用表现的重要指标。这关乎系统针对不同数据源和多样的查询是否能够给出准确答案。

起初，单独审查每一个查询和相应的响应有助于系统调优，但随着特殊情况和故障数量的增加，这种方式可能行不通。相比之下，建立一整套综合性评价指标或者自动化评估系统则更为高效。这类工具能够洞察系统整体性能并识别哪些领域需要进一步关注。

RAG 系统的评估主要聚焦于两个核心方面：

检索评估： 这是对系统检索出的信息的准确性与相关性进行评价的过程。
响应评估： 这是基于检索结果对系统生成回答的质量和恰当性进行测量的过程。

生成 "问题-上下文" 对：

在评价 RAG 系统时，关键在于有能力提出既能获取正确上下文，又能相应生成适当回答的问题。LlamaIndex 提供了一个 generate_question_context_pairs 模块，这个模块专门设计用来构建评价 RAG 系统的问题和上下文对，涵盖了检索评估和响应评估两大方面。如需了解更多关于问题生成的信息，请查阅文档。

py 复制代码

qa_dataset = generate_question_context_pairs(
    nodes,
    llm=llm,
    num_questions_per_chunk=20
)

检索评估：

在做好了开展检索评估的准备后，利用已生成的评估数据集来运行 RetrieverEvaluator。

首先，我们需要建立一个 Retriever 实例，随后定义两个函数：get_eval_results 负责在数据集上执行检索操作，display_results 用于展现评估结果:

py 复制代码

retriever = vector_index.as_retriever(similarity_top_k=2)

定义检索评估器：

我们采用 命中率 (Hit Rate) 和 平均倒数排名 (Mean Reciprocal Rank, MRR) 这两项指标来对检索器进行评估。

命中率：

命中率衡量的是正确答案出现在检索结果前k个文档中的比例。换句话说，就是我们的系统在最开始的几次猜测中得到正确结果的频次。

平均倒数排名（MRR）：

MRR 通过分析最相关文档在检索结果里的排名来计算每个查询的准确性。更具体地说，它是所有查询的相关文档排名倒数的平均值。例如，若最相关的文档排在第一位，其倒数排名为 1；排在第二位时，为 1/2；以此类推。

我们来通过这些指标来了解我们的检索器的表现:

py 复制代码

retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=retriever
)

# Evaluate
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

结果分析：

结果显示 HitRate 比 MRR 的数值更大。 MRR 的表现不如命中率意味着排名靠前的结果并不总是最匹配的。为了提升 MRR，可能需要引入重新排序器（rerankers），这些工具用于优化检索到的文档顺序。若想深入理解重新排序器如何精细调优检索指标，请参考我们在博客文章中的全面讲解。

3. 质量评估工具：

忠实度评估器（FaithfulnessEvaluator）：这个工具用来衡量查询引擎的响应是否与其它已知的信息源相符合，能有效判断响应中是否包含了凭空捏造的内容。
相关度评估器（Relevancy Evaluator）：该评估器主要测量查询结果及其关联信息是否与用户的查询要求相匹配。

下面我将测试这两个工具，先准备好数据：

py 复制代码

# Get the list of queries from the above created dataset
queries = list(qa_dataset.queries.values())

3.1 忠实度评估器

我们先来看 Faithfulness 评估器：

py 复制代码

vector_index = VectorStoreIndex(nodes, service_context = service_context)
query_engine = vector_index.as_query_engine()

创建一个 FaithfulnessEvaluator 实例:

py 复制代码

from llama_index.evaluation import FaithfulnessEvaluator
faithfulness_raven13b = FaithfulnessEvaluator(service_context=service_context)

任意评估一个问题是否相关:

py 复制代码

eval_query = queries[3]
eval_query

使用实例：

py 复制代码

response_vector = query_engine.query(eval_query)

# Compute faithfulness evaluation
eval_result = faithfulness_raven13b.evaluate_response(response=response_vector)

# You can check passing parameter in eval_result if it passed the evaluation.
eval_result.passing

3.2 Relevancy 相关度评估器

相关度评估器（Relevancy Evaluator）非常适用于判断响应内容和提供的信息源（检索到的背景资料）是否对查询进行了准确的匹配。此工具能够帮助我们确认响应内容是否确实解答了用户的问题。

py 复制代码

from llama_index.evaluation import RelevancyEvaluator

relevancy_raven13b = RelevancyEvaluator(service_context=service_context)

任意选择一个问题进行评估:

py 复制代码

# Pick a query
query = queries[3]
query

py 复制代码

# Generate response.
# response_vector has response and source nodes (retrieved context)
response_vector = query_engine.query(query)

# Relevancy evaluation
eval_result = relevancy_raven13b.evaluate_response(
    query=query, response=response_vector
)

py 复制代码

# You can check passing parameter in eval_result if it passed the evaluation.
eval_result.passing

# You can get the feedback for the evaluation.
eval_result.feedback

3.3 Batch Evaluator 批次评估器：

在我们独立完成了忠实度和相关度的评估之后，LlamaIndex 提供了 BatchEvalRunner 工具，可以批次地进行多个评估的计算。比如同时进行忠实度和相关度的评估：

py 复制代码

from llama_index.evaluation import BatchEvalRunner

# Let's pick top 10 queries to do evaluation
batch_eval_queries = queries[:10]

# Initiate BatchEvalRunner to compute FaithFulness and Relevancy Evaluation.
runner = BatchEvalRunner(
    {"faithfulness": faithfulness_evaluator, "relevancy": relevancy_evaluator},
    workers=8,
)

# Compute evaluation
eval_results = await runner.aevaluate_queries(
    query_engine, queries=batch_eval_queries
)

py 复制代码

# Let's get faithfulness score
faithfulness_score = sum(result.passing for result in eval_results['faithfulness']) / len(eval_results['faithfulness'])
faithfulness_score

py 复制代码

# Let's get relevancy score
relevancy_score = sum(result.passing for result in eval_results['relevancy']) / len(eval_results['relevancy'])
relevancy_score

结果分析

忠实度得分为 0.8，这意味着生成的答案存在不实之处，还可以继续优化。
相关度得分为 0.8，则表明生成的答案与检索到的背景信息和问题并非总是紧密相关。

总结

在上述研究中，我们研究了如何利用 LlamaIndex 构建和评估一个 RAG（检索增强型生成模型）流程，并特别关注如何对流程中的检索系统和生成的响应进行评价。

此外，LlamaIndex 还提供了许多其他的评价工具，你可以通过此链接了解更多相关细节和进阶使用方法。

项目源代码：llamaIndex_learning/02-SimpleEvaluation/Evaluate_RAG_with_LlamaIndex_Locally.ipynb at master · HildaM/llamaIndex_learning (github.com)

【从零开始学习 RAG 】02：评估 RAG 的召回数据质量

前言

基础知识准备

检索增强生成 (Retrieval-Augmented Generation, RAG)

RAG 的关键阶段

代码实践：构建一个简单RAG系统，并且对其质量评估

1. 加载本地数据并构建索引

2. 质量评估

生成 "问题-上下文" 对：

检索评估：

定义检索评估器：

结果分析：

3. 质量评估工具：

3.1 忠实度评估器

3.2 Relevancy 相关度评估器

3.3 Batch Evaluator 批次评估器：

结果分析

总结