ragas官方文档中文版(十六)

智能体或工具使用

智能体或工具使用工作流可以在多个维度上进行评估。以下是一些可用于评估智能体或工具在给定任务中性能的指标。

主题遵循度

部署在现实世界应用中的人工智能系统在与用户交互时被期望遵循感兴趣的领域,但大语言模型有时可能会忽略这一限制而回答一般性查询。主题遵循度指标评估人工智能在交互期间保持在预定义领域内的能力。此指标在对话式人工智能系统中特别重要,其中人工智能被期望仅为与预定义领域相关的查询提供帮助。

TopicAdherence 需要一个预定义的主题集合,人工智能系统被期望遵循该集合,该集合通过 reference_topics 与 user_input 一起提供。该指标可以计算主题遵循度的精确率、召回率和 F1 分数,定义为

P r e c i s i o n = ∣ Q u e r i e s t h a t a n s w e r e d a n d a r e a d h e r e s t o a n y p r e s e n t r e f e r e n c e t o p i c s ∣ ∣ Q u e r i e s t h a t a r e a n s w e r e d a n d a r e a d h e r e s t o a n y p r e s e n t r e f e r e n c e t o p i c s ∣ + ∣ Q u e r i e s t h a t a r e a n s w e r e d a n d d o n o t a d h e r e s t o a n y p r e s e n t r e f e r e n c e t o p i c s ∣ Precision=\frac{|Queries\ that\ answered\ and\ are\ adheres\ to\ any\ present\ reference\ topics|}{|Queries\ that\ are\ answered\ and\ are\ adheres\ to\ any\ present\ reference\ topics| + |Queries\ that\ are\ answered\ and\ do\ not\ adheres\ to\ any\ present\ reference\ topics|} Precision=∣Queries that are answered and are adheres to any present reference topics∣+∣Queries that are answered and do not adheres to any present reference topics∣∣Queries that answered and are adheres to any present reference topics∣

P e c a l l = ∣ Q u e r i e s t h a t a n s w e r e d a n d a r e a d h e r e s t o a n y p r e s e n t r e f e r e n c e t o p i c s ∣ ∣ Q u e r i e s t h a t a r e a n s w e r e d a n d a r e a d h e r e s t o a n y p r e s e n t r e f e r e n c e t o p i c s ∣ + ∣ Q u e r i e s t h a t a r e r e f u s e d a n d s h o u l d h a v e b e e n a n s w e r e d ∣ Pecall=\frac{|Queries\ that\ answered\ and\ are\ adheres\ to\ any\ present\ reference\ topics|}{|Queries\ that\ are\ answered\ and\ are\ adheres\ to\ any\ present\ reference\ topics| + |Queries\ that\ are\ refused\ and\ should\ have\ been\ answered|} Pecall=∣Queries that are answered and are adheres to any present reference topics∣+∣Queries that are refused and should have been answered∣∣Queries that answered and are adheres to any present reference topics∣

示例

python 复制代码
import asyncio
from openai import AsyncOpenAI
from ragas.llms.base import llm_factory
from ragas.metrics.collections import TopicAdherence
from ragas.messages import HumanMessage, AIMessage, ToolMessage, ToolCall


async def evaluate_topic_adherence():
    # Setup LLM
    client = AsyncOpenAI()
    llm = llm_factory("gpt-4o-mini", client=client)

    user_input = [
        HumanMessage(
            content="Can you provide me with details about Einstein's theory of relativity?"
        ),
        AIMessage(
            content="Sure, let me retrieve the relevant information for you.",
            tool_calls=[
                ToolCall(
                    name="document_search",
                    args={"query": "Einstein's theory of relativity"},
                )
            ],
        ),
        ToolMessage(
            content="Found relevant documents: 1. Relativity: The Special and the General Theory, 2. General Theory of Relativity by A. Einstein."
        ),
        AIMessage(
            content="I found some documents on Einstein's theory of relativity. Which one would you like to know more about: 'Relativity: The Special and the General Theory' or 'General Theory of Relativity by A. Einstein'?"
        ),
        HumanMessage(content="Tell me about the 'General Theory of Relativity'."),
        AIMessage(
            content="Got it! Let me fetch more details from 'General Theory of Relativity by A. Einstein'.",
            tool_calls=[
                ToolCall(
                    name="document_retrieve",
                    args={"document": "General Theory of Relativity by A. Einstein"},
                )
            ],
        ),
        ToolMessage(
            content="The document discusses how gravity affects the fabric of spacetime, describing the relationship between mass and spacetime curvature."
        ),
        AIMessage(
            content="The 'General Theory of Relativity' explains how gravity affects the fabric of spacetime and the relationship between mass and spacetime curvature. Would you like more details or a specific explanation?"
        ),
        HumanMessage(
            content="No, that's perfect. By the way, do you know any good recipes for a chocolate cake?"
        ),
        AIMessage(
            content="Sure! Let me find a simple and delicious recipe for a chocolate cake.",
            tool_calls=[
                ToolCall(name="recipe_search", args={"query": "chocolate cake recipe"})
            ],
        ),
        ToolMessage(
            content="Here's a popular recipe for a chocolate cake: Ingredients include flour, sugar, cocoa powder, eggs, milk, and butter. Instructions: Mix dry ingredients, add wet ingredients, and bake at 350°F for 30-35 minutes."
        ),
        AIMessage(
            content="I found a great recipe for chocolate cake! Would you like the full details, or is that summary enough?"
        ),
    ]

    # Evaluate with precision mode
    metric = TopicAdherence(llm=llm, mode="precision")
    result = await metric.ascore(
        user_input=user_input,
        reference_topics=["science"],
    )
    print(f"Topic Adherence (precision): {result.value}")


if __name__ == "__main__":
    asyncio.run(evaluate_topic_adherence())

输出:

text 复制代码
Topic Adherence (precision): 0.6666666666444444

要将模式更改为召回率,请将 mode 参数设置为 recall

python 复制代码
metric = TopicAdherence(llm=llm, mode="recall")

输出:

text 复制代码
0.99999999995

旧版 API(已弃用)

弃用通知

来自 ragas.metrics 的旧版 TopicAdherenceScore 已弃用,并将在 v1.0 中移除。请迁移到 ragas.metrics.collections.TopicAdherence ,它提供相同的功能但使用现代化的 API。

旧版 API 仍然可以使用,但需要 MultiTurnSample :

python 复制代码
from ragas.dataset_schema import MultiTurnSample
from ragas.messages import HumanMessage, AIMessage, ToolMessage, ToolCall
from ragas.metrics import TopicAdherenceScore  # Legacy import
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

sample = MultiTurnSample(
    user_input=[...],  # conversation messages
    reference_topics=["science"],
)
scorer = TopicAdherenceScore(llm=evaluator_llm, mode="precision")
score = await scorer.multi_turn_ascore(sample)

工具调用准确性

ToolCallAccuracy 衡量 LLM 智能体调用工具与预期工具调用相比的准确程度。它评估工具调用序列和参数准确性。此指标对于验证智能体在多步骤工作流中使用正确参数调用正确工具特别有用。

该指标需要 user_input (对话消息)和 reference_tool_calls (预期工具调用)。它返回 0 到 1 之间的分数,分数越高表示性能越好。

关键特性

两种评估模式:

  1. 严格顺序(默认) :工具调用必须按顺序精确匹配
  • 用途:顺序重要的顺序工作流
  • 示例:必须先搜索再过滤结果
  1. 灵活顺序 :工具调用可以是任意顺序
  • 用途:顺序无关的并行操作
  • 示例:同时获取多个城市的天气

评分:

  • 评估序列对齐(正确顺序中的正确工具)
  • 评估参数准确性(每个工具的正确参数)
  • 最终分数 =(参数准确性)×(序列对齐?1 : 0)

示例:基本用法

python 复制代码
import asyncio
from ragas.metrics.collections import ToolCallAccuracy
from ragas.messages import AIMessage, HumanMessage, ToolCall

async def evaluate_tool_call_accuracy():
    # Define the conversation with tool calls
    user_input = [
        HumanMessage(content="What's the weather like in New York right now?"),
        AIMessage(
            content="The current temperature in New York is 75°F and it's partly cloudy.",
            tool_calls=[ToolCall(name="weather_check", args={"location": "New York"})],
        ),
        HumanMessage(content="Can you translate that to Celsius?"),
        AIMessage(
            content="Let me convert that to Celsius for you.",
            tool_calls=[
                ToolCall(
                    name="temperature_conversion", args={"temperature_fahrenheit": 75}
                )
            ],
        ),
    ]

    # Define expected tool calls
    reference_tool_calls = [
        ToolCall(name="weather_check", args={"location": "New York"}),
        ToolCall(name="temperature_conversion", args={"temperature_fahrenheit": 75}),
    ]

    # Evaluate
    metric = ToolCallAccuracy()
    result = await metric.ascore(
        user_input=user_input,
        reference_tool_calls=reference_tool_calls,
    )
    print(f"Tool Call Accuracy: {result.value}")

if __name__ == "__main__":
    asyncio.run(evaluate_tool_call_accuracy())

输出:

text 复制代码
Tool Call Accuracy: 1.0

示例:灵活顺序模式

对于工具调用可以并行发生的场景:

python 复制代码
# Enable flexible order mode
metric = ToolCallAccuracy(strict_order=False)

user_input = [
    HumanMessage(content="Get weather for Paris and London"),
    AIMessage(
        content="Fetching weather data...",
        tool_calls=[
            ToolCall(name="weather_check", args={"location": "London"}),
            ToolCall(name="weather_check", args={"location": "Paris"}),
        ],
    ),
]

reference_tool_calls = [
    ToolCall(name="weather_check", args={"location": "Paris"}),
    ToolCall(name="weather_check", args={"location": "London"}),
]

result = await metric.ascore(
    user_input=user_input,
    reference_tool_calls=reference_tool_calls,
)
print(f"Score: {result.value}")  # 1.0 (order doesn't matter)

评分示例

完美匹配 :

python 复制代码
# All tools called correctly with correct arguments
Expected: [weather_check(location="Paris"), translate(text="hello")]
Got:      [weather_check(location="Paris"), translate(text="hello")]
Score: 1.0

部分参数匹配 :

python 复制代码
# Some arguments incorrect
Expected: [search(query="python", limit=10, sort="date")]
Got:      [search(query="python", limit=10, sort="relevance")]
Score: 0.66 (2 out of 3 arguments match)

错误顺序(严格模式) :

python 复制代码
# Correct tools but wrong sequence
Expected: [search(...), filter(...)]
Got:      [filter(...), search(...)]
Score: 0.0 (sequence not aligned)

用例

  • 智能体验证 :测试智能体是否正确使用工具
  • 回归测试 :确保工具调用在更改后不会退化
  • 多步骤工作流 :验证复杂的顺序操作
  • 工具选择 :验证智能体从众多选项中选择正确的工具

何时使用不同的指标

指标 使用场景
ToolCallAccuracy 您关心精确的工具序列和参数
ToolCallF1 您需要工具调用的精确率/召回率指标
AgentGoalAccuracy 您关心结果,而不是使用的特定工具

示例 :对于"帮我预订去巴黎的航班",如果您只关心预订是否成功(而不是调用了哪些中间工具),请改用 AgentGoalAccuracyWithReference 。

旧版 API(已弃用)

弃用通知

来自 ragas.metrics 的旧版 ToolCallAccuracy 已弃用,并将在 v1.0 中移除。请迁移到 ragas.metrics.collections.ToolCallAccuracy ,它提供相同的功能但使用现代化的 API。

旧版 API 仍然可以使用,但需要 MultiTurnSample :

python 复制代码
from ragas.dataset_schema import MultiTurnSample
from ragas.messages import AIMessage, HumanMessage, ToolCall
from ragas.metrics import ToolCallAccuracy  # Legacy import

sample = MultiTurnSample(
    user_input=[
        HumanMessage(content="What's the weather in New York?"),
        AIMessage(
            content="Checking weather...",
            tool_calls=[ToolCall(name="weather_check", args={"location": "New York"})],
        ),
    ],
    reference_tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"}),
    ],
)

scorer = ToolCallAccuracy()
score = await scorer.multi_turn_ascore(sample)

旧版还支持自定义参数比较指标:

python 复制代码
from ragas.metrics._string import NonLLMStringSimilarity
from ragas.metrics._tool_call_accuracy import ToolCallAccuracy

metric = ToolCallAccuracy()
metric.arg_comparison_metric = NonLLMStringSimilarity()

工具调用 F1

ToolCallF1 是一个基于智能体工具调用的精确率和召回率返回 F1 分数的指标,将其与一组预期调用( reference_tool_calls )进行比较。虽然 ToolCallAccuracy 基于精确顺序和内容匹配提供二进制分数,但 ToolCallF1 通过提供更宽松的评估来补充它,这对于入门和迭代很有用。它有助于量化智能体与预期行为的接近程度,即使它调用过多或过少。

公式

ToolCallF1 基于经典的信息检索(IR)指标。它使用无序匹配:工具调用的顺序不影响结果,只考虑工具名称和参数的存在性和正确性。

它与主题遵循度有何不同?

虽然 ToolCallF1 和 TopicAdherenceScore 都使用精确率、召回率和 F1 分数,但它们评估的是不同方面:

指标 评估内容 基于
ToolCallF1 工具执行的正确性 结构化工具调用对象
TopicAdherenceScore 对话是否保持在主题上 领域主题的比较

当您想跟踪智能体是否正确执行工具时,使用 ToolCallF1 。当评估内容或意图是否保持在允许的主题范围内时,使用 TopicAdherenceScore 。

示例:基本用法

python 复制代码
import asyncio
from ragas.metrics.collections import ToolCallF1
from ragas.messages import HumanMessage, AIMessage, ToolCall

async def evaluate_tool_call_f1():
    # Define the conversation with tool calls
    user_input = [
        HumanMessage(content="What's the weather like in Paris today?"),
        AIMessage(
            content="Let me check that for you.",
            tool_calls=[ToolCall(name="weather_check", args={"location": "Paris"})],
        ),
        HumanMessage(content="And the UV index?"),
        AIMessage(
            content="Sure, here's the UV index for Paris.",
            tool_calls=[ToolCall(name="uv_index_lookup", args={"location": "Paris"})],
        ),
    ]

    # Define expected tool calls
    reference_tool_calls = [
        ToolCall(name="weather_check", args={"location": "Paris"}),
        ToolCall(name="uv_index_lookup", args={"location": "Paris"}),
    ]

    # Evaluate
    metric = ToolCallF1()
    result = await metric.ascore(
        user_input=user_input,
        reference_tool_calls=reference_tool_calls,
    )
    print(f"Tool Call F1: {result.value}")

if __name__ == "__main__":
    asyncio.run(evaluate_tool_call_f1())

输出:

text 复制代码
Tool Call F1: 1.0

示例:额外调用的工具

当智能体进行了参考中没有的额外工具调用时:

python 复制代码
user_input = [
    HumanMessage(content="What's the weather like in Paris today?"),
    AIMessage(
        content="Let me check that for you.",
        tool_calls=[ToolCall(name="weather_check", args={"location": "Paris"})],
    ),
    HumanMessage(content="And the UV index?"),
    AIMessage(
        content="Sure, here's the UV index and air quality for Paris.",
        tool_calls=[
            ToolCall(name="uv_index_lookup", args={"location": "Paris"}),
            ToolCall(name="air_quality", args={"location": "Paris"}),  # extra call
        ],
    ),
]

reference_tool_calls = [
    ToolCall(name="weather_check", args={"location": "Paris"}),
    ToolCall(name="uv_index_lookup", args={"location": "Paris"}),
]

result = await metric.ascore(
    user_input=user_input,
    reference_tool_calls=reference_tool_calls,
)
print(f"F1 Score: {result.value}")

输出:

text 复制代码
F1 Score: 0.67

在此情况下:

  • TP = 2(weather_check、uv_index_lookup)
  • FP = 1(air_quality)
  • FN = 0
  • 精确率 = 2/3 = 0.67,召回率 = 2/2 = 1.0,F1 = 0.67

评分示例

完美匹配 :

python 复制代码
# All tools called correctly
Reference: [weather_check(location="Paris"), uv_index_lookup(location="Paris")]
Got:       [weather_check(location="Paris"), uv_index_lookup(location="Paris")]
F1 Score: 1.0

缺失的工具调用 :

python 复制代码
# One expected tool not called
Reference: [weather_check(...), uv_index_lookup(...)]
Got:       [weather_check(...)]
F1 Score: 0.67 (TP=1, FP=0, FN=1)

错误的参数 :

python 复制代码
# Tool name matches but args differ
Reference: [weather_check(location="Paris")]
Got:       [weather_check(location="London")]
F1 Score: 0.0 (no match, arguments must be exact)

旧版 API(已弃用)

弃用通知

来自 ragas.metrics 的旧版 ToolCallF1 已弃用,并将在 v1.0 中移除。请迁移到 ragas.metrics.collections.ToolCallF1 ,它提供相同的功能但使用现代化的 API。

旧版 API 仍然可以使用,但需要 MultiTurnSample :

python 复制代码
from ragas.metrics import ToolCallF1  # Legacy import
from ragas.dataset_schema import MultiTurnSample
from ragas.messages import HumanMessage, AIMessage, ToolCall

sample = MultiTurnSample(
    user_input=[
        HumanMessage(content="What's the weather like in Paris today?"),
        AIMessage(
            content="Let me check that for you.",
            tool_calls=[ToolCall(name="weather_check", args={"location": "Paris"})],
        ),
    ],
    reference_tool_calls=[
        ToolCall(name="weather_check", args={"location": "Paris"}),
    ],
)

scorer = ToolCallF1()
score = await scorer.multi_turn_ascore(sample)

智能体目标准确性

智能体目标准确性是一个可用于评估 LLM 在识别和实现用户目标方面性能的指标。这是一个二进制指标,其中 1 表示人工智能已实现目标,0 表示人工智能未实现目标。

使用参考

AgentGoalAccuracyWithReference 通过将工作流的最终状态与提供的参考结果进行比较来评估智能体是否实现了用户目标。参考代表预期/理想结果。

python 复制代码
import asyncio
from openai import AsyncOpenAI
from ragas.llms.base import llm_factory
from ragas.metrics.collections import AgentGoalAccuracyWithReference
from ragas.messages import AIMessage, HumanMessage, ToolCall, ToolMessage


async def evaluate_agent_goal_accuracy_with_reference():
    # Setup LLM
    client = AsyncOpenAI()
    llm = llm_factory("gpt-4o-mini", client=client)

    user_input = [
        HumanMessage(
            content="Hey, book a table at the nearest best Chinese restaurant for 8:00pm"
        ),
        AIMessage(
            content="Sure, let me find the best options for you.",
            tool_calls=[
                ToolCall(
                    name="restaurant_search",
                    args={"cuisine": "Chinese", "time": "8:00pm"},
                )
            ],
        ),
        ToolMessage(
            content="Found a few options: 1. Golden Dragon, 2. Jade Palace"
        ),
        AIMessage(
            content="I found some great options: Golden Dragon and Jade Palace. Which one would you prefer?"
        ),
        HumanMessage(content="Let's go with Golden Dragon."),
        AIMessage(
            content="Great choice! I'll book a table for 8:00pm at Golden Dragon.",
            tool_calls=[
                ToolCall(
                    name="restaurant_book",
                    args={"name": "Golden Dragon", "time": "8:00pm"},
                )
            ],
        ),
        ToolMessage(content="Table booked at Golden Dragon for 8:00pm."),
        AIMessage(
            content="Your table at Golden Dragon is booked for 8:00pm. Enjoy your meal!"
        ),
        HumanMessage(content="thanks"),
    ]

    metric = AgentGoalAccuracyWithReference(llm=llm)
    result = await metric.ascore(
        user_input=user_input,
        reference="Table booked at one of the chinese restaurants at 8 pm",
    )
    print(f"Agent Goal Accuracy: {result.value}")


if __name__ == "__main__":
    asyncio.run(evaluate_agent_goal_accuracy_with_reference())

输出:

text 复制代码
Agent Goal Accuracy: 1.0

无参考

AgentGoalAccuracyWithoutReference 在不需要参考的情况下评估智能体是否实现了用户目标。该指标从对话中推断用户的预期目标和已实现的结果,然后进行比较。

python 复制代码
import asyncio
from openai import AsyncOpenAI
from ragas.llms.base import llm_factory
from ragas.metrics.collections import AgentGoalAccuracyWithoutReference
from ragas.messages import AIMessage, HumanMessage, ToolCall, ToolMessage


async def evaluate_agent_goal_accuracy_without_reference():
    # Setup LLM
    client = AsyncOpenAI()
    llm = llm_factory("gpt-4o-mini", client=client)

    user_input = [
        HumanMessage(
            content="Hey, book a table at the nearest best Chinese restaurant for 8:00pm"
        ),
        AIMessage(
            content="Sure, let me find the best options for you.",
            tool_calls=[
                ToolCall(
                    name="restaurant_search",
                    args={"cuisine": "Chinese", "time": "8:00pm"},
                )
            ],
        ),
        ToolMessage(
            content="Found a few options: 1. Golden Dragon, 2. Jade Palace"
        ),
        AIMessage(
            content="I found some great options: Golden Dragon and Jade Palace. Which one would you prefer?"
        ),
        HumanMessage(content="Let's go with Golden Dragon."),
        AIMessage(
            content="Great choice! I'll book a table for 8:00pm at Golden Dragon.",
            tool_calls=[
                ToolCall(
                    name="restaurant_book",
                    args={"name": "Golden Dragon", "time": "8:00pm"},
                )
            ],
        ),
        ToolMessage(content="Table booked at Golden Dragon for 8:00pm."),
        AIMessage(
            content="Your table at Golden Dragon is booked for 8:00pm. Enjoy your meal!"
        ),
        HumanMessage(content="thanks"),
    ]

    metric = AgentGoalAccuracyWithoutReference(llm=llm)
    result = await metric.ascore(user_input=user_input)
    print(f"Agent Goal Accuracy: {result.value}")


if __name__ == "__main__":
    asyncio.run(evaluate_agent_goal_accuracy_without_reference())

输出:

text 复制代码
Agent Goal Accuracy: 1.0

旧版 API(已弃用)

弃用通知

来自 ragas.metrics 的旧版 AgentGoalAccuracyWithReference 和 AgentGoalAccuracyWithoutReference 已弃用,并将在 v1.0 中移除。请迁移到 ragas.metrics.collections ,它提供相同的功能但使用现代化的 API。

旧版 API 仍然可以使用,但需要 MultiTurnSample :

python 复制代码
from ragas.dataset_schema import MultiTurnSample
from ragas.messages import AIMessage, HumanMessage, ToolCall, ToolMessage
from ragas.metrics import AgentGoalAccuracyWithReference  # Legacy import
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

sample = MultiTurnSample(
    user_input=[...],  # conversation messages
    reference="Table booked at one of the chinese restaurants at 8 pm",
)
scorer = AgentGoalAccuracyWithReference(llm=evaluator_llm)
score = await scorer.multi_turn_ascore(sample)
相关推荐
三块可乐两块冰1 小时前
rag学习5
linux·前端·python
DXM05211 小时前
第11期| 遥感图像分类模型:ResNet_DenseNet原理+实战训练
人工智能·python·深度学习·机器学习·分类·数据挖掘·ageo
SilentSamsara1 小时前
模型部署实战:FastAPI + ONNX + Docker 的推理服务化
人工智能·pytorch·python·深度学习·机器学习·fastapi
聆风吟º1 小时前
Python基础数据类型(一):数字类型
开发语言·python·float·int·bool·数字类型
时代文章1 小时前
AI 基础知识体系
人工智能·ai
Tisfy1 小时前
LeetCode 3838.带权单词映射:求和、取模、拼接(附python一行版)
python·算法·leetcode·字符串·题解·模拟·取模
NaclarbCSDN1 小时前
我写了一个命令行书签管理器,然后抛弃了浏览器书签栏
linux·git·python·github
Z-D-K2 小时前
S-44的周末”旅行“-周日
人工智能·ai·aigc·交互·agi
沪漂阿龙2 小时前
LangChain 系列之Tools:让大模型真正连接业务系统
人工智能·python·langchain