SkillRL完整训练指南 - SkillBank动态更新全流程
目录
- 训练流程总览
- [阶段1:Memory Data Generation(初始技能库创建)](#阶段1:Memory Data Generation(初始技能库创建))
- 阶段2:SFT(监督微调)
- 阶段3:RL训练(技能检索与动态更新)
- 小规模训练配置(1.5B模型)
- 效果验证方法
- 环境配置与API设置
- 常见问题与解决方案
- 完整示例命令
1. 训练流程总览
SkillRL的训练流程分为三个主要阶段,每个阶段都有明确的输入输出和操作步骤:
┌─────────────────────────────────────────────────────────────┐
│ │
│ 阶段1:Memory Data Generation │
│ 基座模型/教师模型 → 生成memory数据 → 创建初始技能库 │
│ │
├─────────────────────────────────────────────────────────────┤
│ │
│ 阶段2:SFT(监督微调) │
│ 基座模型 + SFT数据 → SFT微调 → SFT模型 │
│ │
├─────────────────────────────────────────────────────────────┤
│ │
│ 阶段3:RL训练(技能检索与动态更新) │
│ SFT模型 + 技能库 → GRPO训练 → 动态更新 → 最终模型 │
│ │
└─────────────────────────────────────────────────────────────┘
核心特点
- Memory Data Generation:使用强教师模型生成初始技能库
- SFT:使用LLaMA-Factory或自定义训练器进行监督微调
- RL训练:使用GRPO算法,支持技能检索和动态更新
- SkillBank:分层级技能库(General + Task-Specific + Common Mistakes)
教师模型的作用时机
教师模型(如qwen_8b.py中的Qwen3.6-Flash)在以下三个阶段使用:
1. Memory Data Generation阶段(构造轨迹)
作用:生成任务轨迹数据和推理过程
- 输入:任务描述(如"Find me a blue running shoe under $50")
- 输出:完整的执行轨迹,包括每一步的action、observation和reasoning
- 是否必需:是(如果从零开始训练)
示例:
任务:Find me a blue running shoe under $50
教师模型生成轨迹:
Step 0: Action=search[blue running shoe], Reasoning=Encode core constraints
Step 1: Action=click[Nike Blue Running Shoes], Reasoning=Inspect product details
Step 2: Action=click[Size 10], Reasoning=Select correct size
Step 3: Action=click[Buy Now], Reasoning=Complete purchase within budget
2. Skill Generation阶段(从轨迹提取技能)
作用:分析成功和失败轨迹,提取可复用的行为模式
- 输入:多条Memory Data(包含轨迹和推理)
- 输出:分层级的技能库(general_skills + task_specific_skills + common_mistakes)
- 是否必需:是(需要初始技能库才能开始RL训练)
示例:
分析100条成功轨迹后,教师模型提取出:
General Skills: 15个(跨所有任务的通用原则)
Task-Specific Skills: 30个(按产品类别分类)
Common Mistakes: 12个(常见错误和避免方法)
3. Dynamic Skill Update阶段(训练过程中持续优化)
作用:分析RL训练中的失败案例,生成新技能弥补能力缺口
- 输入:验证阶段的失败轨迹(当前模型无法解决的任务)
- 输出:新的技能(skill_id以dyn_开头,如dyn_001, dyn_002)
- 是否必需:否(可以关闭动态更新,使用静态技能库)
- 触发时机:每当某类任务成功率低于阈值时(默认0.4)
示例:
Step 0验证:
- apparel成功率: 35% (低于阈值40%)
- 收集12条失败轨迹
教师模型分析后生成:
dyn_001: Verify Price Before Product Selection
dyn_002: Confirm Size Availability Early
dyn_003: Filter by Category First
技能库从15个增加到18个
总结:教师模型 vs 训练蒸馏
| 阶段 | 教师模型作用 | 是否使用蒸馏 |
|---|---|---|
| 轨迹构造 | 生成完整任务轨迹(action + reasoning) | 否,直接生成 |
| 技能提取 | 分析轨迹,提取行为模式 | 否,直接提取 |
| 动态更新 | 分析失败案例,生成新技能 | 否,直接生成 |
| SFT训练 | 不涉及 | 否,监督学习 |
| RL训练 | 仅在动态更新时调用 | 部分(技能来自教师,策略来自RL) |
关键理解:
- 教师模型不参与模型训练的参数更新
- 教师模型只用于生成训练数据(轨迹、技能)
- 模型能力提升来自:
- SFT阶段:学习轨迹中的action patterns
- RL阶段:通过奖励信号优化策略
- 技能注入:提供先验知识,加速学习
1.1 教师模型配置与使用
支持的教师模型类型
SkillRL支持两种教师模型配置方式:
方法A:使用Qwen API(推荐,性价比高)
优点:
- 成本低(比Azure OpenAI便宜10倍以上)
- 速度快(响应时间短)
- 配额充足
配置方法:
- 设置环境变量
bash
export QWEN_API_KEY="your_qwen_api_key"
- 修改
skill_generation/webshop.py中的OpenAIClient类(第46-54行)
python
from openai import OpenAI # 改为标准OpenAI客户端
class OpenAIClient:
def __init__(self, max_new_tokens: int = 4096, model: str = "qwen3.6-flash"):
self.max_new_tokens = max_new_tokens
self.model = model
self.client = OpenAI( # 使用OpenAI客户端
api_key=os.environ.get("QWEN_API_KEY"), # 从环境变量读取
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
- 修改
agent_system/memory/skill_updater.py中的SkillUpdater类(第34-38行)
python
from openai import OpenAI # 改为标准OpenAI客户端
# 在__init__方法中替换AzureOpenAI
self.client = OpenAI(
api_key=os.environ.get("QWEN_API_KEY"),
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
self.model = "qwen3.6-flash" # 或 "qwen2.5-7b-instruct"
方法B:使用Azure OpenAI o3(默认配置)
配置环境变量:
bash
export AZURE_OPENAI_API_KEY="your_azure_api_key"
export AZURE_OPENAI_ENDPOINT="https://your-endpoint.openai.azure.com/"
export AZURE_OPENAI_API_VERSION="2025-01-01-preview"
注意事项:
- 需要Azure账户和API密钥
- 成本较高,配额有限
- 适合小规模实验(<100次调用)
测试教师模型连接
bash
# 测试Qwen API
python -c "
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get('QWEN_API_KEY'),
base_url='https://dashscope.aliyuncs.com/compatible-mode/v1'
)
response = client.chat.completions.create(
model='qwen3.6-flash',
messages=[{'role': 'user', 'content': '你好'}],
)
print(response.choices[0].message.content)
"
# 测试Azure OpenAI
python -c "
import os
from openai import AzureOpenAI
client = AzureOpenAI(
api_key=os.environ.get('AZURE_OPENAI_API_KEY'),
azure_endpoint=os.environ.get('AZURE_OPENAI_ENDPOINT'),
api_version=os.environ.get('AZURE_OPENAI_API_VERSION'),
)
response = client.chat.completions.create(
model='o3',
messages=[{'role': 'user', 'content': 'Hello'}],
)
print(response.choices[0].message.content)
"
1.2 目的
使用教师模型(如GPT-4、Claude-3或Qwen3.6-Flash)分析任务轨迹,生成分层级的技能库。
1.2 所需数据
Memory Data(记忆数据)格式
记忆数据包含成功和失败的任务轨迹,格式如下:
json
{
"memory_id": "mem_webshop_19499503",
"contextual_description": "WebShop task to purchase a Men's Apparel item with Color, Size, Material, Fit, Sleeve Style, and Price constraints. Solved by searching with detailed terms, selecting size and color options, and buying.",
"tags": {
"environment": "Webshop",
"outcome": "Success" // 必须是 "Success" 或 "Failure"
},
"content": {
"task_meta": {
"original_goal": "Find me machine wash men's dress shirts with cotton spandex, classic fit, short sleeve with color: melon berry, and size: large, and price lower than 50.00 dollars."
},
"refined_trajectory": {
"refined_trajectory": [
{
"step_index": 0,
"action": "search[men's dress shirts cotton spandex classic fit short sleeves [Color_Constraint] [Size_Constraint] [Price_Constraint] or less]",
"critical_observation": "Search results page shows multiple apparel items including at least one men's short-sleeve shirt candidate within desired price constraint.",
"reasoning": "Formulate a search query that encodes all known attribute constraints to surface candidate apparel items that may satisfy the goal."
},
{
"step_index": 1,
"action": "click[apparel_item]",
"critical_observation": "Product detail page for a men's short-sleeve shirt is opened, exposing selectable size and color options and a price within the required constraint.",
"reasoning": "Open a promising apparel item from the search results to inspect and configure its attributes against the goal constraints."
}
]
},
"strategic_guidelines": {
"planning_pattern": "search -> click_product -> set_size -> set_color -> purchase",
"mistakes_to_avoid": [
"Don't buy without checking price",
"Don't skip size selection"
]
}
}
}
小规模训练数据准备建议
目标:训练1.5B模型,小规模快速验证
数据量建议:
- Memory Data:20-50条轨迹(15-20条成功 + 5-10条失败)
- SFT Data:100-200条样本(用于快速验证)
- RL训练数据:16-32条训练样本,16-32条验证样本
注意事项:
-
成功率要求:Memory Data中成功率应在60-80%之间
- 太低(<50%):轨迹质量差,提取的技能不可靠
- 太高(>90%):缺乏失败样本,无法提取常见错误
-
任务多样性:确保覆盖不同产品类型和约束组合
python# 检查多样性 from collections import Counter product_types = [get_product_type(m['content']['task_meta']['original_goal']) for m in memories] print("产品类型分布:", Counter(product_types)) # 期望输出示例: # apparel: 12, footwear: 8, electronics: 5, home_decor: 3 -
约束覆盖:确保覆盖不同类型的约束
- 价格约束(under 50, between 20-100)
- 颜色约束(blue, red, black)
- 尺寸约束(S, M, L, XL)
- 材质约束(cotton, leather, synthetic)
记忆数据文件位置
- ALFWorld :
memory_data/alfworld/generated_memories_alfworld_total.json - WebShop :
memory_data/webshop/generated_memories_webshop_100.json,generated_memories_webshop_101-200.json - Search :
memory_data/search/generated_memories_search.json
1.3 执行步骤
步骤1:配置教师模型API
方法A:使用Qwen模型(推荐,性价比高)
参考 qwen_8b.py 中的配置(注意:qwen_8b.py 中硬编码了 API Key,请勿直接使用,应通过环境变量读取):
python
from openai import OpenAI
# 初始化Qwen客户端(请通过环境变量设置 API Key)
client = OpenAI(
api_key=os.environ.get("QWEN_API_KEY"), # 通过环境变量读取,避免硬编码
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
# 发起请求
completion = client.chat.completions.create(
model="qwen3.6-flash", # 或 "qwen2.5-7b-instruct"
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "你的任务..."}
],
stream=False,
)
方法B:使用Azure OpenAI
bash
export AZURE_OPENAI_API_KEY="your_api_key"
export AZURE_OPENAI_ENDPOINT="https://your-endpoint.openai.azure.com/"
export AZURE_OPENAI_API_VERSION="2025-01-01-preview"
步骤2:生成技能库
ALFWorld技能库生成
bash
python skill_generation/alfworld.py \
--memory_path memory_data/alfworld/generated_memories_alfworld_total.json \
--output_path memory_data/alfworld/claude_style_skills.json
WebShop技能库生成
bash
python skill_generation/webshop.py \
--memory_path memory_data/webshop/generated_memories_webshop_100.json \
--output_path memory_data/webshop/claude_style_skills.json
Search技能库生成
bash
python skill_generation/search.py \
--memory_path memory_data/search/generated_memories_search.json \
--output_path memory_data/search/claude_style_skills.json
步骤3:修改skill_generation脚本以使用Qwen模型
如果需要使用Qwen作为教师模型,修改 skill_generation/webshop.py:
python
# 原代码(Azure OpenAI)
# client = AzureOpenAI(api_key="", azure_endpoint="", api_version="")
# 修改为使用Qwen
from openai import OpenAI
class QwenClient:
def __init__(self):
self.client = OpenAI(
api_key=os.environ.get("QWEN_API_KEY"),
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
self.model = "qwen3.6-flash" # 或其他Qwen模型
def generate_response(self, messages: list) -> str:
response = self.client.chat.completions.create(
model=self.model,
messages=messages,
max_tokens=4096,
)
return response.choices[0].message.content
步骤4:验证生成的技能库
bash
# 查看技能数量
python -c "
import json
with open('memory_data/webshop/claude_style_skills.json', 'r') as f:
skills = json.load(f)
print(f'General Skills: {len(skills[\"general_skills\"])}')
print(f'Task-Specific Skills: {sum(len(v) for v in skills[\"task_specific_skills\"].values())}')
print(f'Common Mistakes: {len(skills[\"common_mistakes\"])}')
"
期望输出:
General Skills: 15
Task-Specific Skills: 30
Common Mistakes: 12
1.4 技能库格式详解
json
{
"general_skills": [
{
"skill_id": "gen_001",
"title": "Prioritize Core Keywords",
"principle": "Include product type, 1-2 key functional attributes, and any hard constraints (price, size, color) in the search query.",
"when_to_apply": "Before issuing first search or when refining an over-specific query."
}
],
"task_specific_skills": {
"apparel": [...],
"footwear": [...],
"electronics": [...],
"home_decor": [...],
"accessories": [...],
"beauty_health": [...],
"other": [...]
},
"common_mistakes": [
{
"mistake_id": "err_001",
"description": "Repeating the same action after it fails.",
"why_it_happens": "Agent does not track action history.",
"how_to_avoid": "Check the admissible actions list and try an alternative."
}
]
}
1.5 效果验证:查看构造的技能效果
方法1:人工检查技能质量
python
import json
with open('memory_data/webshop/claude_style_skills.json', 'r') as f:
skills = json.load(f)
print("="*60)
print("通用技能质量检查")
print("="*60)
for i, skill in enumerate(skills['general_skills'][:5], 1):
print(f"\n[gen_{i:03d}] {skill.get('title', 'N/A')}")
print(f" 原则: {skill.get('principle', 'N/A')[:100]}")
print(f" 应用场景: {skill.get('when_to_apply', 'N/A')[:100]}")
# 评估维度
principle = skill.get('principle', '')
if len(principle.split()) < 5:
print(" ✓ 简洁(< 5个词)")
else:
print(" ✗ 过长(≥ 5个词)")
if 'search' in principle.lower() or 'click' in principle.lower():
print(" ✓ 可操作(包含明确动作)")
else:
print(" ⚠ 抽象(缺少明确动作)")
print("\n" + "="*60)
print("任务特定技能分布")
print("="*60)
for category, skill_list in skills['task_specific_skills'].items():
print(f"{category:15s}: {len(skill_list):2d} 个技能")
print("\n" + "="*60)
print("常见错误示例")
print("="*60)
for mistake in skills['common_mistakes'][:3]:
print(f"\n[{mistake.get('mistake_id', 'N/A')}]")
print(f" 描述: {mistake.get('description', 'N/A')}")
print(f" 原因: {mistake.get('why_it_happens', 'N/A')}")
print(f" 避免: {mistake.get('how_to_avoid', 'N/A')}")
方法2:测试技能检索效果
python
from agent_system.memory.skills_only_memory import SkillsOnlyMemory
# 加载技能库
memory = SkillsOnlyMemory(
'memory_data/webshop/claude_style_skills.json',
retrieval_mode='template'
)
# 测试不同类型的任务
test_tasks = [
"Find a blue running shoe under $50",
"Purchase a men's cotton shirt with size L",
"Get a laptop under $800 with 16GB RAM",
"Buy a red dress for summer",
]
print("="*60)
print("技能检索效果测试")
print("="*60)
for task in test_tasks:
print(f"\n任务: {task}")
# 检索技能
retrieved = memory.retrieve(task, top_k=6)
print(f" 检测类型: {retrieved['task_type']}")
print(f" 通用技能: {len(retrieved['general_skills'])} 个")
print(f" 任务特定技能: {len(retrieved['task_specific_skills'])} 个")
# 检查技能是否相关
task_lower = task.lower()
keywords_to_match = ['price', 'size', 'color', 'constraint']
relevant_count = 0
for skill in retrieved['general_skills'][:3]:
skill_text = skill.get('principle', '').lower()
if any(kw in skill_text for kw in keywords_to_match):
relevant_count += 1
print(f" 相关性: {relevant_count}/3 个技能匹配关键词")
方法3:验证技能覆盖度
python
import json
from collections import Counter
with open('memory_data/webshop/generated_memories_webshop_100.json', 'r') as f:
memories = json.load(f)
# 提取所有任务
tasks = [m['content']['task_meta']['original_goal'] for m in memories]
# 统计关键词出现
keywords = [
'price', 'cost', 'dollar', 'cheap', 'expensive',
'size', 'small', 'medium', 'large', 'xl',
'color', 'red', 'blue', 'black', 'white',
'search', 'click', 'buy', 'purchase'
]
keyword_coverage = Counter()
for task in tasks:
task_lower = task.lower()
for kw in keywords:
if kw in task_lower:
keyword_coverage[kw] += 1
print("="*60)
print("任务关键词覆盖度分析")
print("="*60)
# 按类别分组
price_kws = [k for k in ['price', 'cost', 'dollar', 'cheap', 'expensive'] if k in keyword_coverage]
size_kws = [k for k in ['size', 'small', 'medium', 'large', 'xl'] if k in keyword_coverage]
color_kws = [k for k in ['color', 'red', 'blue', 'black', 'white'] if k in keyword_coverage]
action_kws = [k for k in ['search', 'click', 'buy', 'purchase'] if k in keyword_coverage]
print(f"\n价格约束关键词: {', '.join(price_kws)} - 出现 {sum(keyword_coverage[k] for k in price_kws)} 次")
print(f"尺寸约束关键词: {', '.join(size_kws)} - 出现 {sum(keyword_coverage[k] for k in size_kws)} 次")
print(f"颜色约束关键词: {', '.join(color_kws)} - 出现 {sum(keyword_coverage[k] for k in color_kws)} 次")
print(f"动作关键词: {', '.join(action_kws)} - 出现 {sum(keyword_coverage[k] for k in action_kws)} 次")
# 与技能库对比
with open('memory_data/webshop/claude_style_skills.json', 'r') as f:
skills = json.load(f)
skill_keywords = []
for skill in skills['general_skills'] + skills.get('common_mistakes', []):
skill_text = (skill.get('principle', '') + ' ' + skill.get('description', '') + ' ' + skill.get('how_to_avoid', '')).lower()
skill_keywords.extend([kw for kw in keywords if kw in skill_text])
skill_kw_coverage = Counter(skill_keywords)
print(f"\n技能库关键词覆盖:")
print(f" 通用+错误技能: {len(skills['general_skills']) + len(skills['common_mistakes'])} 个")
print(f" 关键词类型数: {len(set(skill_keywords))} 种")
方法4:可视化技能分布
python
import json
import matplotlib.pyplot as plt
with open('memory_data/webshop/claude_style_skills.json', 'r') as f:
skills = json.load(f)
# 通用技能长度分布
general_lengths = [len(s['principle'].split()) for s in skills['general_skills']]
task_lengths = {cat: [len(s['principle'].split()) for s in skills_list]
for cat, skills_list in skills['task_specific_skills'].items()}
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# 通用技能
axes[0].hist(general_lengths, bins=range(0, 25, 5), edgecolor='black')
axes[0].set_title('通用技能长度分布', fontsize=14)
axes[0].set_xlabel('技能原则词数', fontsize=12)
axes[0].set_ylabel('技能数量', fontsize=12)
# 任务特定技能
data_for_box = [task_lengths.get(cat, []) for cat in skills['task_specific_skills'].keys()]
axes[1].boxplot(data_for_box, labels=list(skills['task_specific_skills'].keys()))
axes[1].set_title('任务特定技能长度分布', fontsize=14)
axes[1].set_ylabel('技能原则词数', fontsize=12)
axes[1].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.savefig('skill_distribution.png', dpi=150)
print("技能分布图已保存到 skill_distribution.png")
阶段2:SFT(监督微调)
2.1 目的
使用监督微调让模型掌握基础任务能力和指令遵循能力。
2.2 小规模数据准备建议
SFT数据格式
python
# 示例:创建小规模SFT数据
import pandas as pd
data = [
{
"prompt": [{
"role": "user",
"content": "Find me a blue running shoe under $50"
}],
"response": "I'll help you find a blue running shoe under $50. Let me search for available options...",
},
# 更多样本...
]
# 小规模:100-200条样本
df = pd.DataFrame(data)
df.to_parquet("train.parquet")
# 验证集:20-50条样本
val_data = [...] # 类似格式
val_df = pd.DataFrame(val_data)
val_df.to_parquet("test.parquet")
数据质量检查
python
import json
# 检查SFT数据质量
def check_sft_quality(data):
print("="*60)
print("SFT数据质量检查")
print("="*60)
total = len(data)
print(f"\n总样本数: {total}")
# Prompt长度分布
prompt_lengths = [len(str(item['prompt'][0]['content'])) for item in data]
print(f"Prompt长度: 最短={min(prompt_lengths)}, 最长={max(prompt_lengths)}, 平均={sum(prompt_lengths)/len(prompt_lengths):.1f}")
# Response长度分布
response_lengths = [len(str(item['response'])) for item in data]
print(f"Response长度: 最短={min(response_lengths)}, 最长={max(response_lengths)}, 平均={sum(response_lengths)/len(response_lengths):.1f}")
# 关键词覆盖
response_texts = ' '.join([str(item['response']) for item in data]).lower()
keywords = ['search', 'click', 'buy', 'price', 'size', 'color']
for kw in keywords:
count = response_texts.count(kw)
print(f"关键词 '{kw}': 出现 {count} 次")
# 异常检测
empty_responses = sum(1 for item in data if not item['response'].strip())
if empty_responses > 0:
print(f"\n⚠ 警告: {empty_responses} 个空响应样本")
2.3 配置训练
使用LLaMA-Factory(推荐)
注意 :LLaMA-Factory 是外部工具,不包含在本项目代码库中。项目 README 中提到使用了 LLaMA-Factory 进行 SFT,但代码库中的 SFT 示例均使用 verl 自带的
fsdp_sft_trainer。以下为 LLaMA-Factory 的参考用法,请根据 LLaMA-Factory 最新文档 调整参数。
bash
# 安装LLaMA-Factory
git clone https://github.com/hiyouga/LlamaFactory.git
cd LlamaFactory
pip install -e ".[torch]"
# 配置训练(1.5B小模型)------ 使用 YAML 配置文件方式(推荐)
# 1. 创建数据集定义(data/dataset_info.json 中添加):
# "webshop_sft": {
# "file_name": "your_sft_data.json",
# "columns": {"prompt": "prompt", "response": "response"}
# }
# 2. 创建训练配置文件(examples/train_lora/qwen_sft.yaml)
# 3. 启动训练:
llamafactory-cli train examples/train_lora/qwen_sft.yaml
# 或使用命令行参数方式(参数可能随版本变化,请参考官方文档):
llamafactory-cli train \
--model_name_or_path Qwen/Qwen2.5-1.5B-Instruct \
--stage sft \
--dataset webshop_sft \
--template qwen \
--finetuning_type lora \
--lora_rank 64 \
--lora_alpha 64 \
--output_dir ./checkpoint/webshop_sft \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--learning_rate 5e-5 \
--max_steps 1000 \
--save_steps 100
使用verl自带的SFT训练器
创建训练脚本 examples/sft/webshop/run_qwen_15b_small.sh:
bash
#!/bin/bash
set -x
nproc_per_node=1
save_path=./checkpoint/webshop_sft_qwen15b_small
torchrun --standalone --nnodes=1 --nproc_per_node=$nproc_per_node \
-m verl.trainer.fsdp_sft_trainer \
data.train_files=$HOME/data/verl-agent/text/train.parquet \
data.val_files=$HOME/data/verl-agent/text/test.parquet \
data.prompt_key=prompt \
data.response_key=response \
optim.lr=5e-5 \
data.train_batch_size=8 \
data.micro_batch_size_per_gpu=2 \
data.max_length=2048 \
model.partial_pretrain=Qwen/Qwen2.5-1.5B-Instruct \
trainer.default_local_dir=$save_path \
trainer.project_name=webshop-sft-small \
trainer.experiment_name=qwen-1.5b-webshop-sft-small \
trainer.total_epochs=2 \
trainer.logger='[console]' \
model.enable_gradient_checkpointing=True \
model.lora_rank=64 \
model.lora_alpha=64 \
model.target_modules=all-linear \
use_remove_padding=true
参数说明 :verl SFT 训练器使用
data.max_length控制最大序列长度(非max_prompt_length/max_response_length),使用data.micro_batch_size_per_gpu同时控制验证 batch size(无独立的val_batch_size参数)。详见verl/trainer/config/sft_trainer.yaml。
运行训练:
bash
bash examples/sft/webshop/run_qwen_15b_small.sh
2.4 小规模训练关键参数说明
| 参数 | 小规模推荐值 | 说明 |
|---|---|---|
model.partial_pretrain |
Qwen/Qwen2.5-1.5B-Instruct | 基座模型路径 |
optim.lr |
5e-5 | 学习率(1.5B模型可以稍高) |
data.train_batch_size |
8 | 训练batch size(小模型可调大) |
data.micro_batch_size_per_gpu |
2 | 每GPU的micro batch size(同时用于验证) |
data.max_length |
2048 | 最大序列长度(prompt + response 总长度) |
model.lora_rank |
64-128 | LoRA秩(64-128平衡效果和显存) |
trainer.total_epochs |
2-3 | 训练轮数(小数据不宜过多) |
2.5 验证SFT结果
方法1:基本推理测试
python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained('./checkpoint/webshop_sft/global_step_XXX')
tokenizer = AutoTokenizer.from_pretrained('./checkpoint/webshop_sft/global_step_XXX')
# 测试推理
test_prompts = [
"Find me a blue running shoe under $50",
"Purchase a men's cotton shirt with size L",
"Get a laptop under $800 with 16GB RAM",
]
print("="*60)
print("SFT模型推理测试")
print("="*60)
for prompt in test_prompts:
print(f"\n输入: {prompt}")
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"输出: {response[:200]}..." if len(response) > 200 else f"输出: {response}")
# 检查关键词
response_lower = response.lower()
action_keywords = ['search', 'click', 'buy']
found_actions = [kw for kw in action_keywords if kw in response_lower]
print(f"包含动作关键词: {found_actions}")
方法2:在WebShop环境中评估
python
from agent_system.environments.env_package.webshop import WebshopEnv
# 加载SFT模型
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('./checkpoint/webshop_sft/global_step_XXX')
model = AutoModelForCausalLM.from_pretrained('./checkpoint/webshop_sft/global_step_XXX')
# 创建环境
env = WebshopEnv(use_small=True) # 使用小数据集
print("="*60)
print("SFT模型环境评估")
print("="*60)
test_goals = [
"Find a blue running shoe under $50",
"Purchase a men's cotton shirt with size L",
"Get a laptop under $800 with 16GB RAM",
]
success_count = 0
total_reward = 0
for i, goal in enumerate(test_goals, 1):
print(f"\n{'='*40}")
print(f"测试 {i}/{len(test_goals)}: {goal}")
print('='*40)
obs = env.reset(goal)
done = False
steps = 0
trajectory = []
while not done and steps < 15:
# 生成action
inputs = tokenizer(obs, return_tensors='pt')
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.4)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# 解析action(简化版)
if 'search[' in response:
action = extract_search_action(response)
elif 'click[' in response:
action = extract_click_action(response)
elif 'buy' in response:
action = 'buy now'
else:
action = 'think' # 默认动作
# 执行action
obs, reward, done, info = env.step(action)
trajectory.append({
'step': steps,
'action': action,
'observation': obs[:100] if len(obs) > 100 else obs,
'reward': reward
})
steps += 1
total_reward += reward
if info.get('success', False):
success_count += 1
print(f"✓ 成功! 总奖励: {total_reward:.2f}")
else:
print(f"✗ 失败. 原因: {info.get('fail_reason', 'unknown')}")
print(f"\n{'='*60}")
print(f"评估结果: {success_count}/{len(test_goals)} 成功")
print(f"平均奖励: {total_reward/len(test_goals):.2f}")
print('='*60)
阶段3:RL训练(技能检索与动态更新)
3.1 训练流程
┌─────────────────────────────────────────────────────────────┐
│ RL训练主流程 │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ 3.1 数据预处理 │ │
│ │ 准备训练/验证数据(parquet格式) │ │
│ └──────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ 3.2 技能检索与注入 │ │
│ │ 根据任务类型检索相关技能并注入prompt │ │
│ └──────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ 3.3 GRPO训练循环 │ │
│ │ Rollout → 计算奖励 → 策略更新 │ │
│ └──────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ 3.4 验证与动态更新 │ │
│ │ 验证模型 → 分析失败 → 生成新技能 → 更新库 │ │
│ └──────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ 3.5 保存模型和技能库 │ │
│ │ 保存checkpoint和动态更新的技能库 │ │
│ └──────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
3.2 数据预处理
准备训练数据
使用 examples/data_preprocess/prepare.py:
bash
# 小规模数据
train_data_size=16
val_data_size=16
# 准备数据
python3 -m examples.data_preprocess.prepare \
--mode 'text' \
--train_data_size $train_data_size \
--val_data_size $val_data_size
这会生成:
$HOME/data/verl-agent/text/train.parquet(训练集)$HOME/data/verl-agent/text/test.parquet(验证集)
数据格式:
python
{
"prompt": [{"role": "user", "content": "Find me a blue running shoe under $50"}],
"ability": "agent",
"extra_info": {"split": "train", "index": 0}
}
3.3 技能检索与注入
技能检索模式
注意 :实际的
examples/grpo_trainer/run_webshop_skills.sh脚本中没有显式设置retrieval_mode,因此使用代码中的默认值template。如果需要使用 embedding 模式,需要额外添加配置参数。
模式1:Template模式(默认,零延迟)
特点:
- 基于关键词匹配任务类型
- 返回该类型的所有任务特定技能 + 前top_k个通用技能
- 不需要GPU
配置:
bash
+env.skills_only_memory.retrieval_mode=template
+env.skills_only_memory.top_k=6
检索逻辑(skills_only_memory.py第115-177行):
python
def _detect_task_type(self, task_description: str) -> str:
goal = task_description.lower()
# WebShop类别检测
if any(kw in goal for kw in ['shirt', 'dress', 'jacket', 'pant']):
return 'apparel'
elif any(kw in goal for kw in ['shoe', 'boot', 'sneaker']):
return 'footwear'
# ...更多类别
模式2:Embedding模式(精准匹配,低延迟)
特点:
- 使用Qwen3-Embedding-0.6B编码任务描述
- 对所有技能计算语义相似度
- 返回Top-K最相关的技能
- 需要GPU进行embedding计算
配置:
bash
+env.skills_only_memory.retrieval_mode=embedding
+env.skills_only_memory.embedding_model_path=Qwen/Qwen3-Embedding-0.6B
+env.skills_only_memory.top_k=6
+env.skills_only_memory.task_specific_top_k=5
3.4 小规模RL训练配置
训练启动命令
bash
# 设置模型路径
export MODEL_PATH=./checkpoint/webshop_sft/global_step_XXX
# 运行RL训练
bash examples/grpo_trainer/run_webshop_skills.sh
小规模训练配置示例
注意 :以下为 1 GPU 小规模训练的推荐配置。实际的
examples/grpo_trainer/run_webshop_skills.sh是 8 GPU 的完整配置(n_gpus_per_node=8、tensor_model_parallel_size=4、total_epochs=150),请根据你的硬件条件调整。
bash
python3 -m verl.trainer.main_ppo \
# ============ 算法配置 ============
algorithm.adv_estimator=grpo \
algorithm.use_kl_in_reward=False \
\
# ============ 数据配置 ============
data.train_files=$HOME/data/verl-agent/text/train.parquet \
data.val_files=$HOME/data/verl-agent/text/test.parquet \
data.train_batch_size=16 \
data.val_batch_size=16 \
data.max_prompt_length=4096 \
data.max_response_length=512 \
data.filter_overlong_prompts=True \
data.truncation='left' \
data.return_raw_chat=True \
\
# ============ 模型配置 ============
actor_rollout_ref.model.path=$MODEL_PATH \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.01 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.model.use_remove_padding=True \
\
# ============ 分布式训练配置(1.5B小模型) ============
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.actor.ppo_mini_batch_size=16 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
\
# ============ Rollout配置 ============
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.max_num_batched_tokens=4096 \
actor_rollout_ref.rollout.max_num_seqs=64 \
actor_rollout_ref.rollout.val_kwargs.temperature=0.4 \
actor_rollout_ref.rollout.val_kwargs.do_sample=True \
\
# ============ 环境配置 ============
env.env_name=Webshop \
env.seed=0 \
env.max_steps=15 \
env.rollout.n=4 \
env.resources_per_worker.num_cpus=0.1 \
\
# ============ 技能库配置 ============
+env.use_skills_only_memory=True \
+env.skills_only_memory.skills_json_path=$HOME/verl-agent/memory_data/webshop/claude_style_skills.json \
+env.skills_only_memory.retrieval_mode=template \
+env.skills_only_memory.top_k=4 \
\
# ============ 动态更新配置 ============
+env.skills_only_memory.enable_dynamic_update=True \
+env.skills_only_memory.update_threshold=0.5 \
+env.skills_only_memory.max_new_skills=2 \
\
# ============ 训练器配置 ============
trainer.n_gpus_per_node=1 \
trainer.nnodes=1 \
trainer.project_name='verl_agent_webshop' \
trainer.experiment_name='grpo_qwen1.5b_small_dynamic' \
trainer.total_epochs=50 \
trainer.save_freq=10 \
trainer.test_freq=5 \
trainer.val_before_train=True \
trainer.logger='[console]' \
trainer.default_local_dir=./checkpoint/webshop_rl_qwen15b_small
小规模训练关键参数说明
| 参数 | 小规模推荐值 | 说明 |
|---|---|---|
| 算法参数 | ||
algorithm.adv_estimator |
grpo | 使用GRPO算法 |
env.rollout.n |
4 | GRPO组大小(小模型建议4-8) |
| 学习率 | ||
actor_rollout_ref.actor.optim.lr |
1e-6 | 策略学习率 |
actor_rollout_ref.actor.kl_loss_coef |
0.01 | KL散度损失系数 |
| 数据参数 | ||
data.train_batch_size |
16 | 训练样本数 |
data.val_batch_size |
16 | 验证样本数 |
data.max_prompt_length |
4096 | 最大prompt长度(考虑技能注入) |
data.max_response_length |
512 | 最大response长度 |
| 技能库参数 | ||
skills_json_path |
技能库路径 | 必需配置 |
retrieval_mode |
template | 检索模式(小模型推荐template) |
top_k |
4 | 通用技能数量(小数据减少到4) |
enable_dynamic_update |
True | 启用动态更新 |
update_threshold |
0.5 | 更新阈值(小数据可提高) |
max_new_skills |
2 | 每次最多新增技能数 |
3.5 验证与动态更新
动态更新机制
动态更新在验证阶段触发(每test_freq个epoch执行一次):
python
# 位置:verl/trainer/ppo/ray_trainer.py(约第837-918行)
def _update_skills_from_validation(
self,
sample_inputs, # 验证集输入
sample_outputs, # 验证集输出
sample_scores, # 验证集得分
success_rate, # 按任务类型的成功率
):
"""
根据验证结果动态更新技能库
"""
# 步骤1:检查是否需要更新
threshold = self.config.env.skills_only_memory.update_threshold
needs_update = False
low_success_tasks = []
for task_key, rate in success_rate.items():
if rate < threshold:
needs_update = True
task_type = task_key.replace('_success_rate', '')
low_success_tasks.append(task_type)
if not needs_update:
print(f"[SkillUpdate] All task success rates above {threshold}")
return
# 步骤2:收集失败轨迹
failed_trajectories = self._collect_failed_trajectories(
sample_inputs, sample_outputs, sample_scores
)
# 步骤3:初始化SkillUpdater(使用教师模型)
from agent_system.memory.skill_updater import SkillUpdater
skill_updater = SkillUpdater(
max_new_skills_per_update=self.config.env.skills_only_memory.max_new_skills
)
# 步骤4:分析失败并生成新技能
new_skills = skill_updater.analyze_failures(
failed_trajectories=failed_trajectories,
current_skills=self.envs.retrieval_memory.skills,
)
# 步骤5:添加新技能到训练环境
if new_skills:
self.envs.retrieval_memory.add_skills(new_skills, category='general')
# 步骤6:保存更新后的技能库
save_path = os.path.join(
self.config.trainer.default_local_dir,
f'updated_skills_step{self.global_steps}.json'
)
self.envs.retrieval_memory.save_skills(save_path)
3.6 训练输出与监控
训练日志示例
[SkillsOnlyMemory] Loaded skills: 15 general, 30 task-specific, 12 mistakes | retrieval_mode=template
[Step 0] Starting validation...
[Validation] apparel: 2/4 (50.0%)
[Validation] footwear: 3/4 (75.0%)
[Validation] electronics: 2/4 (50.0%)
[Validation] Average success rate: 0.58
[Step 0] epoch=0/50, reward=5.2, success_rate=0.58
[Step 0] policy_loss=0.345, kl_penalty=0.015, entropy=1.45
...
[Step 10] Validation: apparel=0.60, footwear=0.75, electronics=0.55
[Step 10] epoch=10/50, reward=7.8, success_rate=0.63
[SkillUpdate] Low success tasks: ['electronics'], triggering skill update...
[SkillUpdate] Analyzing 8 failed trajectories with o3...
[SkillUpdater] Generated 2 new skills: dyn_001, dyn_002
[SkillsOnlyMemory] Added skill: dyn_001 - Verify Technical Specs Before Purchase
[SkillsOnlyMemory] Added skill: dyn_002 - Check Price Range First
[SkillUpdate] Saved updated skill bank to ./checkpoint/updated_skills_step_10.json
[Step 11] epoch=11/50, reward=8.5, success_rate=0.70
输出文件结构
checkpoint/webshop_rl_qwen15b_small/
├── actor/ # 训练后的策略模型
│ ├── model.safetensors
│ ├── config.json
│ └── adapter_config.json # LoRA适配器配置
├── ref/ # 参考模型(未更新)
│ └── model.safetensors
├── updated_skills_step_0.json # 第0步动态更新的技能库
├── updated_skills_step_10.json # 第10步更新的技能库
├── updated_skills_step_20.json # 第20步更新的技能库
└── ... # 更多checkpoint
效果验证方法
本节介绍如何在训练和构造数据阶段查看模型效果和技能效果。
4.1 查看模型效果
方法1:训练过程中的实时监控(推荐,最直接)
训练日志中会实时显示关键指标,可以立即看到模型表现:
关键指标说明:
reward: 平均奖励值(越高越好)success_rate: 任务成功率(0-1之间)policy_loss: 策略损失(应该逐渐下降)kl_penalty: KL散度惩罚(保持参考模型约束)entropy: 输出熵值(保持多样性)
如何解读:
python
# 训练日志示例
[Step 0] epoch=0/50, reward=5.2, success_rate=0.58, policy_loss=0.345, kl_penalty=0.015
[Step 10] epoch=10/50, reward=7.8, success_rate=0.63, policy_loss=0.251, kl_penalty=0.012
[Step 20] epoch=20/50, reward=9.5, success_rate=0.71, policy_loss=0.189, kl_penalty=0.009
# 解读:
# 1. reward从5.2提升到9.5 → 模型能力在提升
# 2. success_rate从58%提升到71% → 任务完成率显著提高
# 3. policy_loss下降 → 模型在学习有效的策略
# 4. kl_penalty下降 → 模型与参考模型保持合理距离
实时监控脚本:
python
# 训练日志解析脚本
import re
import matplotlib.pyplot as plt
log_file = "checkpoint/webshop_rl_qwen15b_small/logs/training.log"
# 提取关键指标
metrics = []
with open(log_file, 'r') as f:
for line in f:
# 提取reward和success_rate
reward_match = re.search(r'reward=([\d.]+)', line)
sr_match = re.search(r'success_rate=([\d.]+)', line)
policy_match = re.search(r'policy_loss=([\d.]+)', line)
if reward_match and sr_match:
step = re.search(r'\[Step (\d+)\]', line)
if step:
metrics.append({
'step': int(step.group(1)),
'reward': float(reward_match.group(1)),
'success_rate': float(sr_match.group(1)),
'policy_loss': float(policy_match.group(1)) if policy_match else 0
})
# 绘制曲线
fig, axes = plt.subplots(3, 1, figsize=(14, 12))
steps = [m['step'] for m in metrics]
rewards = [m['reward'] for m in metrics]
success_rates = [m['success_rate'] for m in metrics]
policy_losses = [m['policy_loss'] for m in metrics]
# Reward曲线
axes[0].plot(steps, rewards, marker='o', linewidth=2, color='blue')
axes[0].set_xlabel('Training Steps', fontsize=12)
axes[0].set_ylabel('Average Reward', fontsize=12)
axes[0].set_title('Reward Progress (↑ better)', fontsize=14)
axes[0].grid(True)
# Success Rate曲线
axes[1].plot(steps, success_rates, marker='s', linewidth=2, color='green')
axes[1].set_xlabel('Training Steps', fontsize=12)
axes[1].set_ylabel('Success Rate', fontsize=12)
axes[1].set_title('Success Rate Progress (↑ better)', fontsize=14)
axes[1].grid(True)
axes[1].axhline(y=0.5, color='r', linestyle='--', label='Random Baseline')
axes[1].legend()
# Policy Loss曲线
axes[2].plot(steps, policy_losses, marker='^', linewidth=2, color='red')
axes[2].set_xlabel('Training Steps', fontsize=12)
axes[2].set_ylabel('Policy Loss', fontsize=12)
axes[2].set_title('Policy Loss Progress (↓ better)', fontsize=14)
axes[2].grid(True)
plt.tight_layout()
plt.savefig('training_curves.png', dpi=150)
print("训练曲线已保存到 training_curves.png")
方法2:模型能力对比测试(训练后评估)
对比不同训练阶段的模型,观察能力提升:
python
from transformers import AutoModelForCausalLM, AutoTokenizer
from agent_system.environments.env_package.webshop import WebshopEnv
# 加载不同阶段的模型
models_to_test = {
"SFT模型": "./checkpoint/webshop_sft/global_step_XXX",
"RL-Step10": "./checkpoint/webshop_rl_qwen15b_small/global_step_10",
"RL-Step30": "./checkpoint/webshop_rl_qwen15b_small/global_step_30",
"RL-Step50": "./checkpoint/webshop_rl_qwen15b_small/global_step_50",
}
# 加载技能库
from agent_system.memory.skills_only_memory import SkillsOnlyMemory
skill_memory = SkillsOnlyMemory(
'memory_data/webshop/claude_style_skills.json',
retrieval_mode='template'
)
# 测试任务集
test_tasks = [
"Find a blue running shoe under $50",
"Purchase a men's cotton shirt with size L",
"Get a laptop under $800 with 16GB RAM",
"Buy a red dress for summer",
"Find black leather boots under $100",
]
results = {}
for model_name, model_path in models_to_test.items():
print(f"\n{'='*60}")
print(f"测试模型: {model_name}")
print('='*60)
# 加载模型
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# 创建环境
env = WebshopEnv(use_small=True)
# 测试
success_count = 0
total_steps = 0
total_reward = 0
for task in test_tasks:
obs = env.reset(task)
done = False
steps = 0
task_reward = 0
while not done and steps < 15:
# 检索并注入技能
skills = skill_memory.retrieve(task, top_k=4)
skill_text = skill_memory.format_for_prompt(skills)
# 构造增强prompt
enhanced_prompt = f"{skill_text}\n\nTask: {task}\nObservation: {obs}"
# 生成action
inputs = tokenizer(enhanced_prompt, return_tensors='pt')
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.4)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# 解析action
action = parse_action(response)
# 执行
obs, reward, done, info = env.step(action)
task_reward += reward
steps += 1
if info.get('success', False):
success_count += 1
total_steps += steps
total_reward += task_reward
# 记录结果
results[model_name] = {
'success_rate': success_count / len(test_tasks),
'avg_steps': total_steps / len(test_tasks),
'avg_reward': total_reward / len(test_tasks)
}
print(f"成功率: {success_count}/{len(test_tasks)} ({success_count/len(test_tasks)*100:.1f}%)")
print(f"平均步数: {total_steps/len(test_tasks):.1f}")
print(f"平均奖励: {total_reward/len(test_tasks):.2f}")
# 对比结果
print(f"\n{'='*60}")
print("模型能力对比")
print('='*60)
print(f"{'模型':20s} {'成功率':10s} {'平均步数':12s} {'平均奖励':12s}")
print('-'*60)
for model_name, metrics in results.items():
print(f"{model_name:20s} {metrics['success_rate']*100:6.1f}% {metrics['avg_steps']:10.1f} {metrics['avg_reward']:10.2f}")
方法3:消融实验(技能的作用)
python
# 对比:有技能 vs 无技能
# 加载模型
model = AutoModelForCausalLM.from_pretrained("./checkpoint/webshop_rl_qwen15b_small/global_step_50")
tokenizer = AutoTokenizer.from_pretrained("./checkpoint/webshop_rl_qwen15b_small/global_step_50")
# 测试配置
configs = {
"有技能库": {"use_skills": True, "top_k": 4},
"无技能库": {"use_skills": False, "top_k": 0},
"Top-2技能": {"use_skills": True, "top_k": 2},
}
test_tasks = [ ... ] # 同上
for config_name, config in configs.items():
print(f"\n{'='*60}")
print(f"配置: {config_name}")
print('='*60)
env = WebshopEnv(use_small=True)
success_count = 0
for task in test_tasks:
obs = env.reset(task)
done = False
steps = 0
while not done and steps < 15:
# 根据配置决定是否注入技能
if config["use_skills"]:
skills = skill_memory.retrieve(task, top_k=config["top_k"])
skill_text = skill_memory.format_for_prompt(skills)
prompt = f"{skill_text}\n\nTask: {task}\nObservation: {obs}"
else:
prompt = f"Task: {task}\nObservation: {obs}"
# 生成action
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.4)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
action = parse_action(response)
# 执行
obs, reward, done, info = env.step(action)
steps += 1
if info.get('success', False):
success_count += 1
print(f"成功率: {success_count}/{len(test_tasks)} ({success_count/len(test_tasks)*100:.1f}%)")
4.2 查看构造的技能效果
方法1:技能使用频率分析
python
# 分析训练日志中技能的使用情况
import json
from collections import Counter
# 读取多个checkpoint的技能库
skill_files = [
"memory_data/webshop/claude_style_skills.json",
"checkpoint/webshop_rl_qwen15b_small/updated_skills_step_10.json",
"checkpoint/webshop_rl_qwen15b_small/updated_skills_step_20.json",
"checkpoint/webshop_rl_qwen15b_small/updated_skills_step_30.json",
]
skill_evolution = []
for i, skill_file in enumerate(skill_files):
with open(skill_file, 'r') as f:
skills = json.load(f)
general_skills = skills.get('general_skills', [])
skill_ids = [s['skill_id'] for s in general_skills]
skill_evolution.append({
'step': i * 10,
'skill_count': len(general_skills),
'skill_ids': skill_ids
})
# 打印技能演化
print("="*80)
print("技能库演化过程")
print("="*80)
for evolution in skill_evolution:
print(f"\nStep {evolution['step']:3d}: {evolution['skill_count']} 个通用技能")
print(f" 技能ID: {', '.join(evolution['skill_ids'][:10])}" if len(evolution['skill_ids']) > 10 else f" 技能ID: {', '.join(evolution['skill_ids'])}")
# 统计新增技能
initial_skills = set(skill_evolution[0]['skill_ids'])
final_skills = set(skill_evolution[-1]['skill_ids'])
new_skills = final_skills - initial_skills
print(f"\n{'='*80}")
print("新增技能分析")
print('='*80)
print(f"初始技能: {len(initial_skills)} 个")
print(f"最终技能: {len(final_skills)} 个")
print(f"新增技能: {len(new_skills)} 个")
print(f"\n新增技能列表:")
for skill_id in sorted(new_skills):
print(f" - {skill_id}")
方法2:技能有效性评估
python
# 评估动态添加的技能是否真的提升了性能
# 读取训练日志,找到技能更新前后的性能变化
import re
log_file = "checkpoint/webshop_rl_qwen15b_small/logs/training.log"
with open(log_file, 'r') as f:
log_content = f.read()
# 找到动态更新点
update_matches = re.findall(r'\[SkillUpdate\] Saved updated skill bank to .*updated_skills_step(\d+)\.json', log_content)
skill_updates = [int(m) for m in update_matches]
# 提取每次更新前后的成功率
success_rate_pattern = r'\[Step (\d+)\].*?success_rate=([\d.]+)'
all_matches = re.findall(success_rate_pattern, log_content)
improvements = []
for i, update_step in enumerate(skill_updates):
# 找到更新前的成功率
before_metrics = [m for m in all_matches if int(m[0]) <= update_step]
if before_metrics:
before_sr = float(before_metrics[-1][1])
else:
before_sr = 0.0
# 找到更新后的成功率
after_metrics = [m for m in all_matches if int(m[0]) > update_step]
if after_metrics:
after_sr = float(after_metrics[0][1])
else:
after_sr = 0.0
improvement = after_sr - before_sr
improvements.append({
'update_step': update_step,
'before_sr': before_sr,
'after_sr': after_sr,
'improvement': improvement
})
# 打印结果
print("="*80)
print("动态更新效果评估")
print("="*80)
print(f"{'更新步数':12s} {'更新前成功率':15s} {'更新后成功率':15s} {'提升':10s}")
print('-'*80)
for imp in improvements:
print(f"{imp['update_step']:12d} {imp['before_sr']*100:6.1f}% {imp['after_sr']*100:6.1f}% {imp['improvement']*100:+6.1f}%")
total_improvement = sum(imp['improvement'] for imp in improvements)
avg_improvement = total_improvement / len(improvements) if improvements else 0
print(f"\n总提升: {total_improvement*100:+.1f}%")
print(f"平均每次提升: {avg_improvement*100:+.1f}%")
方法3:技能相似度分析
python
# 检查新增技能是否与已有技能重复
from difflib import SequenceMatcher
def skill_similarity(skill1, skill2):
"""计算两个技能的相似度"""
text1 = skill1['title'] + ' ' + skill1['principle']
text2 = skill2['title'] + ' ' + skill2['principle']
matcher = SequenceMatcher(None, text1)
ratio = matcher.ratio(text2)
return ratio
# 读取初始和最终技能库
with open('memory_data/webshop/claude_style_skills.json', 'r') as f:
initial_skills = json.load(f)
final_skill_file = "checkpoint/webshop_rl_qwen15b_small/updated_skills_step_50.json"
with open(final_skill_file, 'r') as f:
final_skills = json.load(f)
# 找到新增技能
initial_ids = set(s['skill_id'] for s in initial_skills['general_skills'])
final_ids = set(s['skill_id'] for s in final_skills['general_skills'])
new_skill_ids = final_ids - initial_ids
new_skills = [s for s in final_skills['general_skills'] if s['skill_id'] in new_skill_ids]
# 计算与初始技能的相似度
print("="*80)
print("新增技能与已有技能的相似度分析")
print("="*80)
for new_skill in new_skills:
max_sim = 0.0
most_similar_id = None
for old_skill in initial_skills['general_skills']:
sim = skill_similarity(new_skill, old_skill)
if sim > max_sim:
max_sim = sim
most_similar_id = old_skill['skill_id']
print(f"\n[{new_skill['skill_id']}] {new_skill['title']}")
print(f" 原则: {new_skill['principle'][:60]}...")
print(f" 最相似技能: {most_similar_id} (相似度: {max_sim*100:.1f}%)")
if max_sim > 0.8:
print(" ⚠ 警告: 与已有技能高度相似")
elif max_sim > 0.5:
print(" ℹ 信息: 与已有技能部分相似")
else:
print(" ✓ 新技能")
小规模训练配置(1.5B模型)
5.1 数据准备注意事项
Memory Data准备
目标数量:20-50条轨迹
注意事项:
-
成功率控制:确保60-80%的成功率
pythonsuccess_count = sum(1 for m in memories if m['tags']['outcome'] == 'Success') total = len(memories) success_rate = success_count / total if success_rate < 0.6: print("警告: 成功率过低 (<60%)") elif success_rate > 0.9: print("警告: 成功率过高 (>90%),缺乏失败样本") -
任务类型平衡:确保不同产品类型都有样本
pythonfrom collections import Counter types = [detect_type(m) for m in memories] type_dist = Counter(types) print("产品类型分布:") for t, count in type_dist.items(): print(f" {t}: {count} ({count/len(types)*100:.1f}%)") # 检查平衡性 min_count = min(type_dist.values()) max_count = max(type_dist.values()) if max_count / min_count > 3: print("⚠ 警告: 类型分布不均衡") -
约束多样性:覆盖价格、尺寸、颜色等不同约束
pythonconstraint_types = set() for m in memories: goal = m['content']['task_meta']['original_goal'].lower() if 'price' in goal or '$' in goal: constraint_types.add('price') if 'size' in goal or any(s in goal for s in ['s', 'm', 'l', 'xl']): constraint_types.add('size') if 'color' in goal or any(c in goal for c in ['red', 'blue', 'black', 'white']): constraint_types.add('color') print(f"约束类型覆盖: {', '.join(sorted(constraint_types))}") if len(constraint_types) < 3: print("⚠ 警告: 约束类型不够丰富")
SFT数据准备
目标数量:100-200条样本
注意事项:
-
Prompt质量:确保prompt清晰且格式一致
python# 检查prompt格式 for item in sft_data[:10]: prompt = item['prompt'][0]['content'] # 长度检查 if len(prompt) > 500: print(f"⚠ Prompt过长: {len(prompt)} 字符") # 格式检查 if '?' not in prompt and 'Find' not in prompt and 'Purchase' not in prompt: print(f"⚠ Prompt格式不标准: {prompt[:50]}...") -
Response完整性:确保response包含完整的决策过程
python# 检查response质量 for item in sft_data[:10]: response = item['response'] response_lower = response.lower() # 检查动作关键词 action_keywords = ['search', 'click', 'buy', 'select', 'choose'] found_actions = [kw for kw in action_keywords if kw in response_lower] if len(found_actions) < 2: print(f"⚠ Response动作不完整: {response[:100]}...") # 检查推理过程 if 'because' not in response_lower and 'since' not in response_lower: print(f"ℹ 缺少推理连接词: {response[:100]}...") -
数据一致性:prompt和response应该对应
python# 检查prompt-response配对 mismatched = 0 for item in sft_data: prompt = item['prompt'][0]['content'].lower() response = item['response'].lower() # 提取prompt中的约束 prompt_constraints = [] if '$' in prompt: prompt_constraints.append('price') if 'size' in prompt or 's ' in prompt or 'm ' in prompt: prompt_constraints.append('size') # 检查response是否响应这些约束 response_actions = [kw for kw in prompt_constraints if kw in response] if len(response_actions) < len(prompt_constraints): mismatched += 1 print(f"⚠ 约束未响应: {item['prompt'][0]['content'][:50]}...") if mismatched > 0: print(f"\n⚠ 总计 {mismatched} 个prompt-response不匹配")
5.2 技能库构造注意事项
生成技能库时
-
教师模型选择
- 推荐:Qwen3.6-Flash(性价比高,速度快)
- 备选:Qwen2.5-7B-Instruct(质量更高,速度稍慢)
- 不推荐:Azure OpenAI(成本高,配额限制多)
-
技能数量控制
python# 小规模训练的技能数量建议 general_skills = 10-15 # 通用技能 task_specific_skills = 20-30 # 任务特定技能(总计) common_mistakes = 8-12 # 常见错误 print(f"技能库规模:") print(f" 通用技能: {general_skills}") print(f" 任务特定技能: {task_specific_skills}") print(f" 常见错误: {common_mistakes}") print(f" 总计: {general_skills + task_specific_skills + common_mistakes}") -
技能质量检查
python# 检查生成的技能质量 def check_skill_quality(skill): issues = [] # 1. 长度检查 principle = skill.get('principle', '') words = principle.split() if len(words) > 15: issues.append("原则过长") elif len(words) < 3: issues.append("原则过短") # 2. 可操作性检查 action_verbs = ['search', 'click', 'select', 'verify', 'check', 'choose', 'buy', 'purchase'] if not any(v in principle.lower() for v in action_verbs): issues.append("缺少明确动作") # 3. 具体性检查 vague_words = ['maybe', 'possibly', 'sometimes', 'often', 'usually'] if any(w in principle.lower() for w in vague_words): issues.append("包含模糊词汇") # 4. 冗余检查 redundancy_patterns = ['search and search', 'check and check', 'verify and verify'] if any(p in principle.lower() for p in redundancy_patterns): issues.append("包含冗余动作") return issues # 检查所有技能 all_issues = [] for skill in skills['general_skills']: issues = check_skill_quality(skill) if issues: all_issues.append((skill['skill_id'], issues)) print(f"发现质量问题的技能: {len(all_issues)} 个") for skill_id, issues in all_issues[:5]: print(f" [{skill_id}]: {', '.join(issues)}")
5.3 RL训练注意事项
小规模训练配置调整
-
GRPO组大小
bash# 小模型建议 env.rollout.n=4 # 节省显存 # 大模型建议 env.rollout.n=8 # 提高稳定性 -
学习率调整
bash# 小规模数据(过拟合风险低) actor_rollout_ref.actor.optim.lr=2e-6 # 可以稍高 # 小模型(1.5B) actor_rollout_ref.actor.kl_loss_coef=0.005 # 降低KL约束 -
显存优化
bash# 启用梯度检查点 actor_rollout_ref.model.enable_gradient_checkpointing=True # 启用参数offload actor_rollout_ref.actor.fsdp_config.param_offload=True actor_rollout_ref.actor.fsdp_config.optimizer_offload=True # 减少batch size data.train_batch_size=16 → data.train_batch_size=8 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 -
训练轮数控制
bash# 小规模数据不宜训练过多轮 trainer.total_epochs=30-50 # 防止过拟合 # 验证频率 trainer.test_freq=5 # 频繁验证,及时调整
环境配置与API设置
6.1 依赖安装
bash
# 克隆仓库
git clone https://github.com/aiming-lab/SkillRL.git
cd SkillRL
# 安装基础依赖
pip install -r requirements.txt
pip install vllm==0.11.0
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir
pip install -e .
# 安装OpenAI客户端(用于教师模型)
pip install openai
6.2 环境设置
WebShop环境
bash
cd agent_system/environments/env_package/webshop
./setup.sh -d small # 小数据集(快速实验)
# 或
./setup.sh -d all # 完整数据集(正式训练)
ALFWorld环境
bash
pip install alfworld
pip install gymnasium==0.29.1
pip install stable-baselines3==2.6.0
alfworld-download -f # 下载游戏文件和检测器
Search环境
bash
cd agent_system/environments/env_package/search/third_party
pip install -e .
pip install gym==0.26.2
6.3 API配置
使用Qwen API(推荐,性价比高)
bash
# 设置Qwen API(用于教师模型)
export QWEN_API_KEY="your_qwen_api_key"
# 如果使用Azure OpenAI
export AZURE_OPENAI_API_KEY="your_azure_api_key"
export AZURE_OPENAI_ENDPOINT="https://your-endpoint.openai.azure.com/"
export AZURE_OPENAI_API_VERSION="2025-01-01-preview"
测试API连接
bash
# 测试Qwen API
python -c "
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get('QWEN_API_KEY'),
base_url='https://dashscope.aliyuncs.com/compatible-mode/v1'
)
response = client.chat.completions.create(
model='qwen3.6-flash',
messages=[{'role': 'user', 'content': '你好'}],
)
print(response.choices[0].message.content)
"
常见问题与解决方案
7.1 Memory Data Generation阶段
问题1:API调用失败
错误 :Connection timeout / Authentication failed
解决方案:
bash
# 检查网络连接
ping dashscope.aliyuncs.com
# 验证API密钥
echo $QWEN_API_KEY # 应该输出你的密钥
# 检查配额
# 登录阿里云控制台查看API使用情况
问题2:生成的技能数量不对
现象:期望15个通用技能,实际只有10个
解决方案:
python
# 检查输入记忆数据格式
import json
with open('memory_data/webshop/generated_memories_webshop_100.json') as f:
memories = json.load(f)
print(f"Total memories: {len(memories)}")
print(f"Success: {sum(1 for m in memories if m['tags']['outcome'] == 'Success')}")
print(f"Failure: {sum(1 for m in memories if m['tags']['outcome'] == 'Failure')}")
# 查看生成脚本日志
# 技能生成脚本会打印每个阶段的输出
7.2 SFT训练阶段
问题1:显存不足(OOM)
错误 :CUDA out of memory
解决方案:
bash
# 减小batch size
data.train_batch_size=8 → data.train_batch_size=4
# 减小max_length(SFT训练器中prompt+response总长度)
data.max_length=2048 → data.max_length=1024
# 减小LoRA rank(减少参数量,降低显存占用)
model.lora_rank=64 → model.lora_rank=32
# 启用offload
actor_rollout_ref.actor.fsdp_config.param_offload=True
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True
问题2:训练不收敛
现象:loss不下降或震荡
解决方案:
bash
# 降低学习率
actor_rollout_ref.actor.optim.lr=1e-6 → actor_rollout_ref.actor.optim.lr=5e-7
# 增加KL散度系数
actor_rollout_ref.actor.kl_loss_coef=0.01 → actor_rollout_ref.actor.kl_loss_coef=0.05
# 调整采样温度
actor_rollout_ref.rollout.val_kwargs.temperature=0.4 → actor_rollout_ref.rollout.val_kwargs.temperature=0.2
# 检查数据质量
# 确保训练数据格式正确,包含足够的多样性
7.3 RL训练阶段
问题1:技能检索失败
错误 :Task type not detected
解决方案:
bash
# 检查skills_json_path
+env.skills_only_memory.skills_json_path=正确的技能库路径
# 检查任务描述格式
# 确保任务描述包含明确的产品类型关键词
# 调整_detect_task_type函数(skills_only_memory.py第115行)
# 添加更多关键词以支持你的任务类型
问题2:动态更新不触发
现象 :日志显示 [SkillUpdate] All task success rates above threshold
解决方案:
bash
# 降低update_threshold
+env.skills_only_memory.update_threshold=0.4 → +env.skills_only_memory.update_threshold=0.2
# 检查任务类型检测
# 确保不同任务类型被正确识别
# 检查失败轨迹收集
# 确保验证集中有足够的失败样本
问题3:教师模型API调用失败
错误 :[SkillUpdater] Error calling o3: API rate limit exceeded
解决方案:
bash
# 使用Qwen代替o3(更便宜、配额更高)
# 修改skill_updater.py,使用Qwen客户端
# 降低max_new_skills
+env.skills_only_memory.max_new_skills=3 → +env.skills_only_memory.max_new_skills=1
# 增加test_freq
trainer.test_freq=5 → trainer.test_freq=10
完整示例命令
示例1:WebShop小规模完整训练流程(Template模式 + 动态更新)
bash
# ============ 步骤1:环境设置 ============
export QWEN_API_KEY="your_qwen_api_key"
export MODEL_PATH=./checkpoint/webshop_sft/global_step_XXX
# ============ 步骤2:数据准备 ============
# 小规模数据
python3 -m examples.data_preprocess.prepare \
--mode 'text' \
--train_data_size 16 \
--val_data_size 16
# ============ 步骤3:RL训练 ============
bash examples/grpo_trainer/run_webshop_skills.sh
# 或直接运行(复制run_webshop_skills.sh的内容)
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=$HOME/data/verl-agent/text/train.parquet \
data.val_files=$HOME/data/verl-agent/text/test.parquet \
data.train_batch_size=16 \
data.val_batch_size=16 \
data.max_prompt_length=4096 \
data.max_response_length=512 \
actor_rollout_ref.model.path=$MODEL_PATH \
actor_rollout_ref.actor.optim.lr=2e-6 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.005 \
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
env.env_name=Webshop \
env.rollout.n=4 \
+env.use_skills_only_memory=True \
+env.skills_only_memory.skills_json_path=$HOME/verl-agent/memory_data/webshop/claude_style_skills.json \
+env.skills_only_memory.retrieval_mode=template \
+env.skills_only_memory.top_k=4 \
+env.skills_only_memory.enable_dynamic_update=True \
+env.skills_only_memory.update_threshold=0.5 \
+env.skills_only_memory.max_new_skills=2 \
trainer.total_epochs=50 \
trainer.save_freq=10 \
trainer.test_freq=5 \
trainer.val_before_train=True \
trainer.logger='[console]' \
trainer.default_local_dir=./checkpoint/webshop_rl_qwen15b_small
示例2:快速实验(关闭动态更新)
bash
python3 -m verl.trainer.main_ppo \
# ... 其他配置同上 ...
\
# 关闭动态更新,使用静态技能库
+env.skills_only_memory.enable_dynamic_update=False \
\
# 减少训练epoch
trainer.total_epochs=30 \
trainer.test_freq=10 \
\
# ... 其他配置 ...
总结
SkillRL完整训练流程
| 阶段 | 输入 | 操作 | 输出 |
|---|---|---|---|
| Memory Data Generation | 任务轨迹数据 | 使用教师模型分析轨迹 | 分层级技能库 |
| SFT | 基座模型 + SFT数据 | 监督微调 | SFT模型 |
| RL | SFT模型 + 技能库 | GRPO训练 + 动态更新 | 最终模型 + 更新的技能库 |
教师模型作用总结
| 作用阶段 | 具体作用 | 蒸馏方式 |
|---|---|---|
| 轨迹构造 | 生成完整任务轨迹(action + observation + reasoning) | 直接生成,不涉及蒸馏 |
| 技能提取 | 从轨迹中提取可复用的行为模式 | 直接分析,不涉及蒸馏 |
| 动态更新 | 分析RL失败案例,生成新技能 | 直接生成,不涉及蒸馏 |
| 模型训练 | 不参与 | - |
核心理解:
- 教师模型只生成训练数据(轨迹、技能),不参与模型参数更新
- 模型能力提升来自:
- SFT阶段:学习轨迹中的action patterns
- RL阶段:通过奖励信号优化策略
- 技能注入:提供先验知识,加速学习
- 技能库动态更新是在RL训练过程中,通过教师模型分析失败案例补充新技能
关键技术特点
-
分层级技能库
- General Skills:跨所有任务的通用原则
- Task-Specific Skills:特定任务类型的技能
- Common Mistakes:常见错误和避免方法
-
双检索模式
- Template模式:关键词匹配,零延迟
- Embedding模式:语义相似度,精准匹配
-
动态更新机制
- 自动分析验证失败案例
- 使用教师模型生成新技能
- 实时更新训练环境的技能库
-
小规模训练建议
- 数据量:Memory 20-50条,SFT 100-200条,RL 16-32条
- 技能库:10-15个通用技能
- 训练轮数:30-50 epochs
- 验证频率:每5个epoch验证一次
核心文件位置
skill_generation/*.py- 技能生成脚本agent_system/memory/skills_only_memory.py- 技能检索系统agent_system/memory/skill_updater.py- 技能动态更新verl/trainer/ppo/ray_trainer.py- RL训练主流程(含动态更新逻辑)verl/trainer/main_ppo.py- RL训练入口脚本verl/trainer/fsdp_sft_trainer.py- SFT训练器(verl内置)qwen_8b.py- Qwen API 测试脚本(仅参考,注意其中硬编码了 API Key,请勿直接使用)
使用建议
- 小模型(1.5B):建议使用template模式(零延迟)
- 大模型(7B+):可以使用embedding模式(更精准)
- 显存受限:减少batch size,启用offload
- 快速实验:关闭动态更新,使用静态技能库
- 小规模训练:控制数据量和训练轮数,避免过拟合
生成时间 :2026-05-13
基于代码库版本 :SkillRL-main
用途:完整训练指导 - SkillBank动态更新全流程 + 小规模训练注意事项