Agent 系列（16）：工具链设计——让 LLM 用对工具的五个原则

工具文档是写给 LLM 的，不是写给人的

你有没有写过这样的工具文档：

python 复制代码

@lc_tool
def get_data(query: str) -> str:
    """Get data."""
    ...

这对人类来说是糟糕的文档，对 LLM 来说更糟------它不知道这个工具做什么、什么时候调它、传什么参数。

工具设计有三条核心维度：描述质量（LLM 选不选你）、错误处理（出错时崩不崩）、粒度设计（参数好不好提取）。本文用实验数据说话。

Demo 1：描述质量------真正影响工具选择的条件

对比同一个天气工具的两个版本：

python 复制代码

# 版本 A：模糊
@lc_tool
def weather_vague(city: str) -> str:
    """Get data."""
    ...

# 版本 B：精准
@lc_tool
def weather_precise(city: str) -> str:
    """Get current weather for a city.

    Returns temperature (Celsius) and condition (sunny / cloudy / rainy / unknown).
    Use this whenever the user asks about weather, temperature, or sky conditions
    for a specific city. Pass the city name as a plain string, e.g. 'Beijing'.
    """
    ...

用 5 条天气查询对比两个版本的工具调用率：

vbnet 复制代码

Query                                            Vague      Precise
------------------------------------------------ ---------- ----------
What's the weather in Beijing today?             ✓ called   ✓ called
Is it raining in Shanghai right now?             ✓ called   ✓ called
What temperature should I expect in Shenzhen?    ✓ called   ✓ called
Should I bring an umbrella to Beijing?           ✓ called   ✓ called
How's the sky in Shanghai?                       ✓ called   ✓ called

Tool call rate --- Vague: 5/5  Precise: 5/5

两者都是 5/5。

这是一个反直觉的结果，背后有重要的前提条件：当 Agent 只有一个工具时，LLM 别无选择，无论描述多糟糕都会用它。 描述质量的差距只在 LLM 需要从多个工具中选择时才显现------这才是生产系统的常态。

一个 Agent 挂了 10 个工具，用户问"帮我查一下北京的天气"，LLM 要在 10 个 docstring 里找出谁最匹配。此时精准描述的工具胜率远高于模糊工具。

描述质量的黄金格式：

python 复制代码

"""<一句话说做什么>

返回：<返回值的格式和含义>
使用时机：<什么类型的用户问题应该触发这个工具>
参数说明：<参数名 + 传入格式示例>
"""

Demo 2：错误处理------raise 还是 return？

两个版本的工具，逻辑相同，出错行为不同：

python 复制代码

# 抛出异常 ← 危险
@lc_tool
def weather_raises(city: str) -> str:
    """Get current weather for a city."""
    if city.lower() not in MOCK_WEATHER:
        raise ValueError(f"City '{city}' not found in database.")
    ...

# 返回错误字符串 ← 安全
@lc_tool
def weather_returns_error(city: str) -> str:
    """Get current weather for a city. Returns error message if city not found."""
    data = MOCK_WEATHER.get(city.lower())
    if data is None:
        return (f"City '{city}' not found. "
                f"Available cities: {list(MOCK_WEATHER.keys())}. "
                f"Please ask the user to confirm the city name.")
    ...

三个测试用例：

已知城市（Beijing）： 两者结果相同，正常返回天气。

未知城市（Atlantis）：

vbnet 复制代码

raises : [CRASHED] ValueError: City 'Atlantis' not found in database.
returns: I'm sorry, but I couldn't find the weather information for Atlantis.
         Please make sure the city name is correct...

weather_raises 直接崩溃，整个 Agent run 终止；weather_returns_error 的 LLM 读到错误字符串，组织了一条友好的回复。

拼写错误（Shanghia）：

sql 复制代码

raises : The current weather in Shanghai is cloudy with a temperature of 22°C.
returns: The current weather in Shanghai is 22°C with a cloudy condition.

两者都正确------因为 LLM 在调工具之前就把 "Shanghia" 自动纠正成了 "Shanghai"，工具接收到的是正确城市名。这说明 LLM 有一定的输入自愈能力，但不能依赖它。

结论：工具只应返回字符串，永远不抛出异常。 异常会跳出 Agent 的控制流，LLM 没有机会处理它。错误字符串则可以被 LLM 读取、理解、然后决定下一步（重试、告知用户、换一个工具）。

Demo 3：粒度设计------胖工具 vs 细粒度工具

胖工具： 一个工具包办所有事，传入自由文本。

python 复制代码

@lc_tool
def omnibus_lookup(query: str) -> str:
    """Look up weather, product info, or evaluate math. Pass the full user question."""
    q = query.lower()
    for city in MOCK_WEATHER:
        if city in q:
            return json.dumps(MOCK_WEATHER[city])
    for name in MOCK_PRODUCTS:
        if name in q:
            return json.dumps(MOCK_PRODUCTS[name])
    # try math...

细粒度工具： 三个独立工具，各有精准类型参数。

四个测试用例的对比结果：

单步查询（天气、产品）： 两者都能完成任务，差异不明显。

多步 --- 天气 + 计算差值：

ini 复制代码

Fat  tools=['omnibus_lookup', 'omnibus_lookup']
     → The temperature difference is 3 degrees Celsius.

Fine tools=['get_weather', 'get_weather', 'calculator']
     → The difference is 3°C.  (3 separate calls)

Fat tool 调了两次，每次查一个城市，自己没法算差；Fine 工具调了两次天气 + 一次 calculator，明确分工。

多步 --- 产品价格 + 年费计算：

swift 复制代码

Fat  tools=['omnibus_lookup']   (只调一次！)
     → The monthly price is $299. The annual cost is $3588.

Fine tools=['get_product_info', 'calculator']   (两次)
     → The monthly price is $299. The annual cost is $3588.

这是最有趣的结果：Fat tool 只调了一次就答对了。因为 omnibus 工具内部发现价格是 299，LLM 在后续回答里直接做了 299×12 的心算，没有触发工具里的数学逻辑。

这说明 Fat 工具并不总是更差------但它的执行路径不透明，不可追踪，无法测试，不可维护。

何时用细粒度，何时允许合并：

markdown 复制代码

选细粒度：
  - 多个工具会被不同查询分别触发
  - 工具参数有明确的结构化类型（city: str, amount: float）
  - 需要可观测性（每个工具单独计时、记录入参）

允许合并：
  - 两个操作总是一起出现，从不单独使用
  - 合并后参数仍然是结构化的（不是自由文本）
  - 例如：get_weather_and_unit(city, unit: Literal["C","F"])

绝对不合并的情况： 合并后参数退化为自由文本 query: str------这把参数提取的负担推给了工具内部的文本解析，比让 LLM 提取结构化参数更脆。

五条工具设计黄金规则

less 复制代码

原则              错误示范                        正确示范
──────────────────────────────────────────────────────────────────────
描述              "Get data."                     What + When + How + param format
错误处理          raise ValueError(...)           return "Error: city not found"
粒度              omnibus(query: str)             get_weather(city: str)
参数命名          lookup(q: str)                  get_weather(city: str)
返回格式          raw dict / None                 JSON string or error string

黄金规则：为 LLM 设计工具，不是为人类。

LLM 通过三个信息决定如何调工具：

docstring：决定选不选这个工具
参数类型和名称：决定传什么值
返回值：决定下一步怎么做

这三个信息设计好，工具自然被正确使用。

设计 Checklist

Docstring

第一句话说清楚工具做什么（动词开头）
说明返回值格式（JSON / 纯文本 / 错误字符串）
说明使用时机（"when the user asks about..."）
给出参数示例（e.g. 'Beijing', e.g. '299 * 12'）

错误处理

工具只 return，永远不 raise
错误消息要有行动指南（"city not found. Available: $...$ "）
区分"数据不存在"和"输入格式错误"，给出不同提示

粒度

参数是结构化类型（str with clear semantics, int, float），不是自由文本
一个工具只做一件事------如果描述需要"以及"或"或者"，考虑拆分
多个工具之间互斥：不同查询触发不同工具

返回格式

成功：JSON 字符串（方便 LLM 解析字段）
失败："Error: <原因>. <建议操作>" 格式
不返回 None 或空字符串------LLM 不知道怎么处理空值

总结

五个核心结论：

描述质量的战场是多工具竞争：单工具时 LLM 别无选择，多工具时好 docstring 的工具胜率显著更高
工具只返回字符串，永远不抛异常：raise 让 Agent 崩溃，return 错误字符串让 LLM 有机会恢复
LLM 有输入自愈能力，但不可依赖："Shanghia" 被自动纠正为 "Shanghai"，但这不是可靠的防线
Fat 工具并不总是更差，但不可追踪：实测 omnibus 工具在某些场景只需一次调用就完成任务，但代价是执行路径不透明
参数类型决定参数质量 ：city: str（明确语义）> q: str（自由文本）------参数类型越清晰，LLM 提取值越准确

下一篇：Agent 上下文工程进阶 ------ 如何精确控制传给 LLM 的信息：系统提示词优化、few-shot 示例选择、动态上下文注入。

参考资料

欢迎访问 PrimeSkills ------ 一个精心策划的 AI Agent 与技能市场，所有内容均经过真实企业级工作流验证。没有噱头，只有真正有效的东西。

更多实用知识和有趣产品，欢迎访问我的个人主页