【SkillRL】自动学习技能 训练方法

SkillRL完整训练指南 - SkillBank动态更新全流程

目录

  1. 训练流程总览
  2. [阶段1:Memory Data Generation(初始技能库创建)](#阶段1:Memory Data Generation(初始技能库创建))
  3. 阶段2:SFT(监督微调)
  4. 阶段3:RL训练(技能检索与动态更新)
  5. 小规模训练配置(1.5B模型)
  6. 效果验证方法
  7. 环境配置与API设置
  8. 常见问题与解决方案
  9. 完整示例命令

1. 训练流程总览

SkillRL的训练流程分为三个主要阶段,每个阶段都有明确的输入输出和操作步骤:

复制代码
┌─────────────────────────────────────────────────────────────┐
│                                                     │
│  阶段1:Memory Data Generation  │
│  基座模型/教师模型 → 生成memory数据 → 创建初始技能库  │
│                                                     │
├─────────────────────────────────────────────────────────────┤
│                                                     │
│  阶段2:SFT(监督微调)                │
│  基座模型 + SFT数据 → SFT微调 → SFT模型          │
│                                                     │
├─────────────────────────────────────────────────────────────┤
│                                                     │
│  阶段3:RL训练(技能检索与动态更新)      │
│  SFT模型 + 技能库 → GRPO训练 → 动态更新 → 最终模型 │
│                                                     │
└─────────────────────────────────────────────────────────────┘

核心特点

  • Memory Data Generation:使用强教师模型生成初始技能库
  • SFT:使用LLaMA-Factory或自定义训练器进行监督微调
  • RL训练:使用GRPO算法,支持技能检索和动态更新
  • SkillBank:分层级技能库(General + Task-Specific + Common Mistakes)

教师模型的作用时机

教师模型(如qwen_8b.py中的Qwen3.6-Flash)在以下三个阶段使用:

1. Memory Data Generation阶段(构造轨迹)

作用:生成任务轨迹数据和推理过程

  • 输入:任务描述(如"Find me a blue running shoe under $50")
  • 输出:完整的执行轨迹,包括每一步的action、observation和reasoning
  • 是否必需:是(如果从零开始训练)

示例

复制代码
任务:Find me a blue running shoe under $50
教师模型生成轨迹:
Step 0: Action=search[blue running shoe], Reasoning=Encode core constraints
Step 1: Action=click[Nike Blue Running Shoes], Reasoning=Inspect product details
Step 2: Action=click[Size 10], Reasoning=Select correct size
Step 3: Action=click[Buy Now], Reasoning=Complete purchase within budget
2. Skill Generation阶段(从轨迹提取技能)

作用:分析成功和失败轨迹,提取可复用的行为模式

  • 输入:多条Memory Data(包含轨迹和推理)
  • 输出:分层级的技能库(general_skills + task_specific_skills + common_mistakes)
  • 是否必需:是(需要初始技能库才能开始RL训练)

示例

复制代码
分析100条成功轨迹后,教师模型提取出:
General Skills: 15个(跨所有任务的通用原则)
Task-Specific Skills: 30个(按产品类别分类)
Common Mistakes: 12个(常见错误和避免方法)
3. Dynamic Skill Update阶段(训练过程中持续优化)

作用:分析RL训练中的失败案例,生成新技能弥补能力缺口

  • 输入:验证阶段的失败轨迹(当前模型无法解决的任务)
  • 输出:新的技能(skill_id以dyn_开头,如dyn_001, dyn_002)
  • 是否必需:否(可以关闭动态更新,使用静态技能库)
  • 触发时机:每当某类任务成功率低于阈值时(默认0.4)

示例

复制代码
Step 0验证:
- apparel成功率: 35% (低于阈值40%)
- 收集12条失败轨迹

教师模型分析后生成:
dyn_001: Verify Price Before Product Selection
dyn_002: Confirm Size Availability Early
dyn_003: Filter by Category First

技能库从15个增加到18个

总结:教师模型 vs 训练蒸馏

阶段 教师模型作用 是否使用蒸馏
轨迹构造 生成完整任务轨迹(action + reasoning) 否,直接生成
技能提取 分析轨迹,提取行为模式 否,直接提取
动态更新 分析失败案例,生成新技能 否,直接生成
SFT训练 不涉及 否,监督学习
RL训练 仅在动态更新时调用 部分(技能来自教师,策略来自RL)

关键理解

  • 教师模型不参与模型训练的参数更新
  • 教师模型只用于生成训练数据(轨迹、技能)
  • 模型能力提升来自:
    • SFT阶段:学习轨迹中的action patterns
    • RL阶段:通过奖励信号优化策略
    • 技能注入:提供先验知识,加速学习

1.1 教师模型配置与使用

支持的教师模型类型

SkillRL支持两种教师模型配置方式:

方法A:使用Qwen API(推荐,性价比高)

优点

  • 成本低(比Azure OpenAI便宜10倍以上)
  • 速度快(响应时间短)
  • 配额充足

配置方法

  1. 设置环境变量
bash 复制代码
export QWEN_API_KEY="your_qwen_api_key"
  1. 修改 skill_generation/webshop.py 中的OpenAIClient类(第46-54行)
python 复制代码
from openai import OpenAI  # 改为标准OpenAI客户端

class OpenAIClient:
    def __init__(self, max_new_tokens: int = 4096, model: str = "qwen3.6-flash"):
        self.max_new_tokens = max_new_tokens
        self.model = model
        self.client = OpenAI(  # 使用OpenAI客户端
            api_key=os.environ.get("QWEN_API_KEY"),  # 从环境变量读取
            base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
        )
  1. 修改 agent_system/memory/skill_updater.py 中的SkillUpdater类(第34-38行)
python 复制代码
from openai import OpenAI  # 改为标准OpenAI客户端

# 在__init__方法中替换AzureOpenAI
self.client = OpenAI(
    api_key=os.environ.get("QWEN_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
self.model = "qwen3.6-flash"  # 或 "qwen2.5-7b-instruct"
方法B:使用Azure OpenAI o3(默认配置)

配置环境变量

bash 复制代码
export AZURE_OPENAI_API_KEY="your_azure_api_key"
export AZURE_OPENAI_ENDPOINT="https://your-endpoint.openai.azure.com/"
export AZURE_OPENAI_API_VERSION="2025-01-01-preview"

注意事项

  • 需要Azure账户和API密钥
  • 成本较高,配额有限
  • 适合小规模实验(<100次调用)
测试教师模型连接
bash 复制代码
# 测试Qwen API
python -c "
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get('QWEN_API_KEY'),
    base_url='https://dashscope.aliyuncs.com/compatible-mode/v1'
)

response = client.chat.completions.create(
    model='qwen3.6-flash',
    messages=[{'role': 'user', 'content': '你好'}],
)

print(response.choices[0].message.content)
"

# 测试Azure OpenAI
python -c "
import os
from openai import AzureOpenAI

client = AzureOpenAI(
    api_key=os.environ.get('AZURE_OPENAI_API_KEY'),
    azure_endpoint=os.environ.get('AZURE_OPENAI_ENDPOINT'),
    api_version=os.environ.get('AZURE_OPENAI_API_VERSION'),
)

response = client.chat.completions.create(
    model='o3',
    messages=[{'role': 'user', 'content': 'Hello'}],
)

print(response.choices[0].message.content)
"

1.2 目的

使用教师模型(如GPT-4、Claude-3或Qwen3.6-Flash)分析任务轨迹,生成分层级的技能库。

1.2 所需数据

Memory Data(记忆数据)格式

记忆数据包含成功和失败的任务轨迹,格式如下:

json 复制代码
{
  "memory_id": "mem_webshop_19499503",
  "contextual_description": "WebShop task to purchase a Men's Apparel item with Color, Size, Material, Fit, Sleeve Style, and Price constraints. Solved by searching with detailed terms, selecting size and color options, and buying.",
  "tags": {
    "environment": "Webshop",
    "outcome": "Success"  // 必须是 "Success" 或 "Failure"
  },
  "content": {
    "task_meta": {
      "original_goal": "Find me machine wash men's dress shirts with cotton spandex, classic fit, short sleeve with color: melon berry, and size: large, and price lower than 50.00 dollars."
    },
    "refined_trajectory": {
      "refined_trajectory": [
        {
          "step_index": 0,
          "action": "search[men's dress shirts cotton spandex classic fit short sleeves [Color_Constraint] [Size_Constraint] [Price_Constraint] or less]",
          "critical_observation": "Search results page shows multiple apparel items including at least one men's short-sleeve shirt candidate within desired price constraint.",
          "reasoning": "Formulate a search query that encodes all known attribute constraints to surface candidate apparel items that may satisfy the goal."
        },
        {
          "step_index": 1,
          "action": "click[apparel_item]",
          "critical_observation": "Product detail page for a men's short-sleeve shirt is opened, exposing selectable size and color options and a price within the required constraint.",
          "reasoning": "Open a promising apparel item from the search results to inspect and configure its attributes against the goal constraints."
        }
      ]
    },
    "strategic_guidelines": {
      "planning_pattern": "search -> click_product -> set_size -> set_color -> purchase",
      "mistakes_to_avoid": [
        "Don't buy without checking price",
        "Don't skip size selection"
      ]
    }
  }
}
小规模训练数据准备建议

目标:训练1.5B模型,小规模快速验证

数据量建议

  • Memory Data:20-50条轨迹(15-20条成功 + 5-10条失败)
  • SFT Data:100-200条样本(用于快速验证)
  • RL训练数据:16-32条训练样本,16-32条验证样本

注意事项

  1. 成功率要求:Memory Data中成功率应在60-80%之间

    • 太低(<50%):轨迹质量差,提取的技能不可靠
    • 太高(>90%):缺乏失败样本,无法提取常见错误
  2. 任务多样性:确保覆盖不同产品类型和约束组合

    python 复制代码
    # 检查多样性
    from collections import Counter
    
    product_types = [get_product_type(m['content']['task_meta']['original_goal'])
                      for m in memories]
    print("产品类型分布:", Counter(product_types))
    
    # 期望输出示例:
    # apparel: 12, footwear: 8, electronics: 5, home_decor: 3
  3. 约束覆盖:确保覆盖不同类型的约束

    • 价格约束(under 50, between 20-100)
    • 颜色约束(blue, red, black)
    • 尺寸约束(S, M, L, XL)
    • 材质约束(cotton, leather, synthetic)
记忆数据文件位置
  • ALFWorld : memory_data/alfworld/generated_memories_alfworld_total.json
  • WebShop : memory_data/webshop/generated_memories_webshop_100.json, generated_memories_webshop_101-200.json
  • Search : memory_data/search/generated_memories_search.json

1.3 执行步骤

步骤1:配置教师模型API

方法A:使用Qwen模型(推荐,性价比高)

参考 qwen_8b.py 中的配置(注意:qwen_8b.py 中硬编码了 API Key,请勿直接使用,应通过环境变量读取):

python 复制代码
from openai import OpenAI

# 初始化Qwen客户端(请通过环境变量设置 API Key)
client = OpenAI(
    api_key=os.environ.get("QWEN_API_KEY"),  # 通过环境变量读取,避免硬编码
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

# 发起请求
completion = client.chat.completions.create(
    model="qwen3.6-flash",  # 或 "qwen2.5-7b-instruct"
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "你的任务..."}
    ],
    stream=False,
)

方法B:使用Azure OpenAI

bash 复制代码
export AZURE_OPENAI_API_KEY="your_api_key"
export AZURE_OPENAI_ENDPOINT="https://your-endpoint.openai.azure.com/"
export AZURE_OPENAI_API_VERSION="2025-01-01-preview"
步骤2:生成技能库

ALFWorld技能库生成

bash 复制代码
python skill_generation/alfworld.py \
    --memory_path memory_data/alfworld/generated_memories_alfworld_total.json \
    --output_path memory_data/alfworld/claude_style_skills.json

WebShop技能库生成

bash 复制代码
python skill_generation/webshop.py \
    --memory_path memory_data/webshop/generated_memories_webshop_100.json \
    --output_path memory_data/webshop/claude_style_skills.json

Search技能库生成

bash 复制代码
python skill_generation/search.py \
    --memory_path memory_data/search/generated_memories_search.json \
    --output_path memory_data/search/claude_style_skills.json
步骤3:修改skill_generation脚本以使用Qwen模型

如果需要使用Qwen作为教师模型,修改 skill_generation/webshop.py

python 复制代码
# 原代码(Azure OpenAI)
# client = AzureOpenAI(api_key="", azure_endpoint="", api_version="")

# 修改为使用Qwen
from openai import OpenAI

class QwenClient:
    def __init__(self):
        self.client = OpenAI(
            api_key=os.environ.get("QWEN_API_KEY"),
            base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
        )
        self.model = "qwen3.6-flash"  # 或其他Qwen模型

    def generate_response(self, messages: list) -> str:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=messages,
            max_tokens=4096,
        )
        return response.choices[0].message.content
步骤4:验证生成的技能库
bash 复制代码
# 查看技能数量
python -c "
import json
with open('memory_data/webshop/claude_style_skills.json', 'r') as f:
    skills = json.load(f)
    print(f'General Skills: {len(skills[\"general_skills\"])}')
    print(f'Task-Specific Skills: {sum(len(v) for v in skills[\"task_specific_skills\"].values())}')
    print(f'Common Mistakes: {len(skills[\"common_mistakes\"])}')
"

期望输出:

复制代码
General Skills: 15
Task-Specific Skills: 30
Common Mistakes: 12

1.4 技能库格式详解

json 复制代码
{
  "general_skills": [
    {
      "skill_id": "gen_001",
      "title": "Prioritize Core Keywords",
      "principle": "Include product type, 1-2 key functional attributes, and any hard constraints (price, size, color) in the search query.",
      "when_to_apply": "Before issuing first search or when refining an over-specific query."
    }
  ],
  "task_specific_skills": {
    "apparel": [...],
    "footwear": [...],
    "electronics": [...],
    "home_decor": [...],
    "accessories": [...],
    "beauty_health": [...],
    "other": [...]
  },
  "common_mistakes": [
    {
      "mistake_id": "err_001",
      "description": "Repeating the same action after it fails.",
      "why_it_happens": "Agent does not track action history.",
      "how_to_avoid": "Check the admissible actions list and try an alternative."
    }
  ]
}

1.5 效果验证:查看构造的技能效果

方法1:人工检查技能质量
python 复制代码
import json

with open('memory_data/webshop/claude_style_skills.json', 'r') as f:
    skills = json.load(f)

print("="*60)
print("通用技能质量检查")
print("="*60)

for i, skill in enumerate(skills['general_skills'][:5], 1):
    print(f"\n[gen_{i:03d}] {skill.get('title', 'N/A')}")
    print(f"  原则: {skill.get('principle', 'N/A')[:100]}")
    print(f"  应用场景: {skill.get('when_to_apply', 'N/A')[:100]}")

    # 评估维度
    principle = skill.get('principle', '')
    if len(principle.split()) < 5:
        print("  ✓ 简洁(< 5个词)")
    else:
        print("  ✗ 过长(≥ 5个词)")

    if 'search' in principle.lower() or 'click' in principle.lower():
        print("  ✓ 可操作(包含明确动作)")
    else:
        print("  ⚠ 抽象(缺少明确动作)")

print("\n" + "="*60)
print("任务特定技能分布")
print("="*60)

for category, skill_list in skills['task_specific_skills'].items():
    print(f"{category:15s}: {len(skill_list):2d} 个技能")

print("\n" + "="*60)
print("常见错误示例")
print("="*60)

for mistake in skills['common_mistakes'][:3]:
    print(f"\n[{mistake.get('mistake_id', 'N/A')}]")
    print(f"  描述: {mistake.get('description', 'N/A')}")
    print(f"  原因: {mistake.get('why_it_happens', 'N/A')}")
    print(f"  避免: {mistake.get('how_to_avoid', 'N/A')}")
方法2:测试技能检索效果
python 复制代码
from agent_system.memory.skills_only_memory import SkillsOnlyMemory

# 加载技能库
memory = SkillsOnlyMemory(
    'memory_data/webshop/claude_style_skills.json',
    retrieval_mode='template'
)

# 测试不同类型的任务
test_tasks = [
    "Find a blue running shoe under $50",
    "Purchase a men's cotton shirt with size L",
    "Get a laptop under $800 with 16GB RAM",
    "Buy a red dress for summer",
]

print("="*60)
print("技能检索效果测试")
print("="*60)

for task in test_tasks:
    print(f"\n任务: {task}")

    # 检索技能
    retrieved = memory.retrieve(task, top_k=6)

    print(f"  检测类型: {retrieved['task_type']}")
    print(f"  通用技能: {len(retrieved['general_skills'])} 个")
    print(f"  任务特定技能: {len(retrieved['task_specific_skills'])} 个")

    # 检查技能是否相关
    task_lower = task.lower()
    keywords_to_match = ['price', 'size', 'color', 'constraint']

    relevant_count = 0
    for skill in retrieved['general_skills'][:3]:
        skill_text = skill.get('principle', '').lower()
        if any(kw in skill_text for kw in keywords_to_match):
            relevant_count += 1

    print(f"  相关性: {relevant_count}/3 个技能匹配关键词")
方法3:验证技能覆盖度
python 复制代码
import json
from collections import Counter

with open('memory_data/webshop/generated_memories_webshop_100.json', 'r') as f:
    memories = json.load(f)

# 提取所有任务
tasks = [m['content']['task_meta']['original_goal'] for m in memories]

# 统计关键词出现
keywords = [
    'price', 'cost', 'dollar', 'cheap', 'expensive',
    'size', 'small', 'medium', 'large', 'xl',
    'color', 'red', 'blue', 'black', 'white',
    'search', 'click', 'buy', 'purchase'
]

keyword_coverage = Counter()
for task in tasks:
    task_lower = task.lower()
    for kw in keywords:
        if kw in task_lower:
            keyword_coverage[kw] += 1

print("="*60)
print("任务关键词覆盖度分析")
print("="*60)

# 按类别分组
price_kws = [k for k in ['price', 'cost', 'dollar', 'cheap', 'expensive'] if k in keyword_coverage]
size_kws = [k for k in ['size', 'small', 'medium', 'large', 'xl'] if k in keyword_coverage]
color_kws = [k for k in ['color', 'red', 'blue', 'black', 'white'] if k in keyword_coverage]
action_kws = [k for k in ['search', 'click', 'buy', 'purchase'] if k in keyword_coverage]

print(f"\n价格约束关键词: {', '.join(price_kws)} - 出现 {sum(keyword_coverage[k] for k in price_kws)} 次")
print(f"尺寸约束关键词: {', '.join(size_kws)} - 出现 {sum(keyword_coverage[k] for k in size_kws)} 次")
print(f"颜色约束关键词: {', '.join(color_kws)} - 出现 {sum(keyword_coverage[k] for k in color_kws)} 次")
print(f"动作关键词: {', '.join(action_kws)} - 出现 {sum(keyword_coverage[k] for k in action_kws)} 次")

# 与技能库对比
with open('memory_data/webshop/claude_style_skills.json', 'r') as f:
    skills = json.load(f)

skill_keywords = []
for skill in skills['general_skills'] + skills.get('common_mistakes', []):
    skill_text = (skill.get('principle', '') + ' ' + skill.get('description', '') + ' ' + skill.get('how_to_avoid', '')).lower()
    skill_keywords.extend([kw for kw in keywords if kw in skill_text])

skill_kw_coverage = Counter(skill_keywords)
print(f"\n技能库关键词覆盖:")
print(f"  通用+错误技能: {len(skills['general_skills']) + len(skills['common_mistakes'])} 个")
print(f"  关键词类型数: {len(set(skill_keywords))} 种")
方法4:可视化技能分布
python 复制代码
import json
import matplotlib.pyplot as plt

with open('memory_data/webshop/claude_style_skills.json', 'r') as f:
    skills = json.load(f)

# 通用技能长度分布
general_lengths = [len(s['principle'].split()) for s in skills['general_skills']]
task_lengths = {cat: [len(s['principle'].split()) for s in skills_list]
                for cat, skills_list in skills['task_specific_skills'].items()}

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 通用技能
axes[0].hist(general_lengths, bins=range(0, 25, 5), edgecolor='black')
axes[0].set_title('通用技能长度分布', fontsize=14)
axes[0].set_xlabel('技能原则词数', fontsize=12)
axes[0].set_ylabel('技能数量', fontsize=12)

# 任务特定技能
data_for_box = [task_lengths.get(cat, []) for cat in skills['task_specific_skills'].keys()]
axes[1].boxplot(data_for_box, labels=list(skills['task_specific_skills'].keys()))
axes[1].set_title('任务特定技能长度分布', fontsize=14)
axes[1].set_ylabel('技能原则词数', fontsize=12)
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig('skill_distribution.png', dpi=150)
print("技能分布图已保存到 skill_distribution.png")

阶段2:SFT(监督微调)

2.1 目的

使用监督微调让模型掌握基础任务能力和指令遵循能力。

2.2 小规模数据准备建议

SFT数据格式
python 复制代码
# 示例:创建小规模SFT数据
import pandas as pd

data = [
    {
        "prompt": [{
            "role": "user",
            "content": "Find me a blue running shoe under $50"
        }],
        "response": "I'll help you find a blue running shoe under $50. Let me search for available options...",
    },
    # 更多样本...
]

# 小规模:100-200条样本
df = pd.DataFrame(data)
df.to_parquet("train.parquet")

# 验证集:20-50条样本
val_data = [...]  # 类似格式
val_df = pd.DataFrame(val_data)
val_df.to_parquet("test.parquet")
数据质量检查
python 复制代码
import json

# 检查SFT数据质量
def check_sft_quality(data):
    print("="*60)
    print("SFT数据质量检查")
    print("="*60)

    total = len(data)
    print(f"\n总样本数: {total}")

    # Prompt长度分布
    prompt_lengths = [len(str(item['prompt'][0]['content'])) for item in data]
    print(f"Prompt长度: 最短={min(prompt_lengths)}, 最长={max(prompt_lengths)}, 平均={sum(prompt_lengths)/len(prompt_lengths):.1f}")

    # Response长度分布
    response_lengths = [len(str(item['response'])) for item in data]
    print(f"Response长度: 最短={min(response_lengths)}, 最长={max(response_lengths)}, 平均={sum(response_lengths)/len(response_lengths):.1f}")

    # 关键词覆盖
    response_texts = ' '.join([str(item['response']) for item in data]).lower()
    keywords = ['search', 'click', 'buy', 'price', 'size', 'color']
    for kw in keywords:
        count = response_texts.count(kw)
        print(f"关键词 '{kw}': 出现 {count} 次")

    # 异常检测
    empty_responses = sum(1 for item in data if not item['response'].strip())
    if empty_responses > 0:
        print(f"\n⚠ 警告: {empty_responses} 个空响应样本")

2.3 配置训练

使用LLaMA-Factory(推荐)

注意 :LLaMA-Factory 是外部工具,不包含在本项目代码库中。项目 README 中提到使用了 LLaMA-Factory 进行 SFT,但代码库中的 SFT 示例均使用 verl 自带的 fsdp_sft_trainer。以下为 LLaMA-Factory 的参考用法,请根据 LLaMA-Factory 最新文档 调整参数。

bash 复制代码
# 安装LLaMA-Factory
git clone https://github.com/hiyouga/LlamaFactory.git
cd LlamaFactory
pip install -e ".[torch]"

# 配置训练(1.5B小模型)------ 使用 YAML 配置文件方式(推荐)
# 1. 创建数据集定义(data/dataset_info.json 中添加):
#    "webshop_sft": {
#      "file_name": "your_sft_data.json",
#      "columns": {"prompt": "prompt", "response": "response"}
#    }
# 2. 创建训练配置文件(examples/train_lora/qwen_sft.yaml)
# 3. 启动训练:
llamafactory-cli train examples/train_lora/qwen_sft.yaml

# 或使用命令行参数方式(参数可能随版本变化,请参考官方文档):
llamafactory-cli train \
    --model_name_or_path Qwen/Qwen2.5-1.5B-Instruct \
    --stage sft \
    --dataset webshop_sft \
    --template qwen \
    --finetuning_type lora \
    --lora_rank 64 \
    --lora_alpha 64 \
    --output_dir ./checkpoint/webshop_sft \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --learning_rate 5e-5 \
    --max_steps 1000 \
    --save_steps 100
使用verl自带的SFT训练器

创建训练脚本 examples/sft/webshop/run_qwen_15b_small.sh

bash 复制代码
#!/bin/bash
set -x

nproc_per_node=1
save_path=./checkpoint/webshop_sft_qwen15b_small

torchrun --standalone --nnodes=1 --nproc_per_node=$nproc_per_node \
    -m verl.trainer.fsdp_sft_trainer \
    data.train_files=$HOME/data/verl-agent/text/train.parquet \
    data.val_files=$HOME/data/verl-agent/text/test.parquet \
    data.prompt_key=prompt \
    data.response_key=response \
    optim.lr=5e-5 \
    data.train_batch_size=8 \
    data.micro_batch_size_per_gpu=2 \
    data.max_length=2048 \
    model.partial_pretrain=Qwen/Qwen2.5-1.5B-Instruct \
    trainer.default_local_dir=$save_path \
    trainer.project_name=webshop-sft-small \
    trainer.experiment_name=qwen-1.5b-webshop-sft-small \
    trainer.total_epochs=2 \
    trainer.logger='[console]' \
    model.enable_gradient_checkpointing=True \
    model.lora_rank=64 \
    model.lora_alpha=64 \
    model.target_modules=all-linear \
    use_remove_padding=true

参数说明 :verl SFT 训练器使用 data.max_length 控制最大序列长度(非 max_prompt_length / max_response_length),使用 data.micro_batch_size_per_gpu 同时控制验证 batch size(无独立的 val_batch_size 参数)。详见 verl/trainer/config/sft_trainer.yaml

运行训练:

bash 复制代码
bash examples/sft/webshop/run_qwen_15b_small.sh

2.4 小规模训练关键参数说明

参数 小规模推荐值 说明
model.partial_pretrain Qwen/Qwen2.5-1.5B-Instruct 基座模型路径
optim.lr 5e-5 学习率(1.5B模型可以稍高)
data.train_batch_size 8 训练batch size(小模型可调大)
data.micro_batch_size_per_gpu 2 每GPU的micro batch size(同时用于验证)
data.max_length 2048 最大序列长度(prompt + response 总长度)
model.lora_rank 64-128 LoRA秩(64-128平衡效果和显存)
trainer.total_epochs 2-3 训练轮数(小数据不宜过多)

2.5 验证SFT结果

方法1:基本推理测试
python 复制代码
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained('./checkpoint/webshop_sft/global_step_XXX')
tokenizer = AutoTokenizer.from_pretrained('./checkpoint/webshop_sft/global_step_XXX')

# 测试推理
test_prompts = [
    "Find me a blue running shoe under $50",
    "Purchase a men's cotton shirt with size L",
    "Get a laptop under $800 with 16GB RAM",
]

print("="*60)
print("SFT模型推理测试")
print("="*60)

for prompt in test_prompts:
    print(f"\n输入: {prompt}")

    inputs = tokenizer(prompt, return_tensors='pt')
    outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    print(f"输出: {response[:200]}..." if len(response) > 200 else f"输出: {response}")

    # 检查关键词
    response_lower = response.lower()
    action_keywords = ['search', 'click', 'buy']
    found_actions = [kw for kw in action_keywords if kw in response_lower]
    print(f"包含动作关键词: {found_actions}")
方法2:在WebShop环境中评估
python 复制代码
from agent_system.environments.env_package.webshop import WebshopEnv

# 加载SFT模型
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('./checkpoint/webshop_sft/global_step_XXX')
model = AutoModelForCausalLM.from_pretrained('./checkpoint/webshop_sft/global_step_XXX')

# 创建环境
env = WebshopEnv(use_small=True)  # 使用小数据集

print("="*60)
print("SFT模型环境评估")
print("="*60)

test_goals = [
    "Find a blue running shoe under $50",
    "Purchase a men's cotton shirt with size L",
    "Get a laptop under $800 with 16GB RAM",
]

success_count = 0
total_reward = 0

for i, goal in enumerate(test_goals, 1):
    print(f"\n{'='*40}")
    print(f"测试 {i}/{len(test_goals)}: {goal}")
    print('='*40)

    obs = env.reset(goal)
    done = False
    steps = 0
    trajectory = []

    while not done and steps < 15:
        # 生成action
        inputs = tokenizer(obs, return_tensors='pt')
        outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.4)
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # 解析action(简化版)
        if 'search[' in response:
            action = extract_search_action(response)
        elif 'click[' in response:
            action = extract_click_action(response)
        elif 'buy' in response:
            action = 'buy now'
        else:
            action = 'think'  # 默认动作

        # 执行action
        obs, reward, done, info = env.step(action)
        trajectory.append({
            'step': steps,
            'action': action,
            'observation': obs[:100] if len(obs) > 100 else obs,
            'reward': reward
        })

        steps += 1
        total_reward += reward

    if info.get('success', False):
        success_count += 1
        print(f"✓ 成功! 总奖励: {total_reward:.2f}")
    else:
        print(f"✗ 失败. 原因: {info.get('fail_reason', 'unknown')}")

print(f"\n{'='*60}")
print(f"评估结果: {success_count}/{len(test_goals)} 成功")
print(f"平均奖励: {total_reward/len(test_goals):.2f}")
print('='*60)

阶段3:RL训练(技能检索与动态更新)

3.1 训练流程

复制代码
┌─────────────────────────────────────────────────────────────┐
│  RL训练主流程                                          │
│                                                       │
│  ┌──────────────────────────────────────────────────┐      │
│  │ 3.1 数据预处理                               │      │
│  │  准备训练/验证数据(parquet格式)             │      │
│  └──────────────────────────────────────────────────┘      │
│                       ↓                                 │
│  ┌──────────────────────────────────────────────────┐      │
│  │ 3.2 技能检索与注入               │      │
│  │  根据任务类型检索相关技能并注入prompt           │      │
│  └──────────────────────────────────────────────────┘      │
│                       ↓                                 │
│  ┌──────────────────────────────────────────────────┐      │
│  │ 3.3 GRPO训练循环                     │      │
│  │  Rollout → 计算奖励 → 策略更新              │      │
│  └──────────────────────────────────────────────────┘      │
│                       ↓                                 │
│  ┌──────────────────────────────────────────────────┐      │
│  │ 3.4 验证与动态更新                │      │
│  │  验证模型 → 分析失败 → 生成新技能 → 更新库     │      │
│  └──────────────────────────────────────────────────┘      │
│                       ↓                                 │
│  ┌──────────────────────────────────────────────────┐      │
│  │ 3.5 保存模型和技能库                     │      │
│  │  保存checkpoint和动态更新的技能库               │      │
│  └──────────────────────────────────────────────────┘      │
└─────────────────────────────────────────────────────────────┘

3.2 数据预处理

准备训练数据

使用 examples/data_preprocess/prepare.py

bash 复制代码
# 小规模数据
train_data_size=16
val_data_size=16

# 准备数据
python3 -m examples.data_preprocess.prepare \
    --mode 'text' \
    --train_data_size $train_data_size \
    --val_data_size $val_data_size

这会生成:

  • $HOME/data/verl-agent/text/train.parquet(训练集)
  • $HOME/data/verl-agent/text/test.parquet(验证集)

数据格式:

python 复制代码
{
    "prompt": [{"role": "user", "content": "Find me a blue running shoe under $50"}],
    "ability": "agent",
    "extra_info": {"split": "train", "index": 0}
}

3.3 技能检索与注入

技能检索模式

注意 :实际的 examples/grpo_trainer/run_webshop_skills.sh 脚本中没有显式设置 retrieval_mode,因此使用代码中的默认值 template。如果需要使用 embedding 模式,需要额外添加配置参数。

模式1:Template模式(默认,零延迟)

特点:

  • 基于关键词匹配任务类型
  • 返回该类型的所有任务特定技能 + 前top_k个通用技能
  • 不需要GPU

配置:

bash 复制代码
+env.skills_only_memory.retrieval_mode=template
+env.skills_only_memory.top_k=6

检索逻辑(skills_only_memory.py第115-177行):

python 复制代码
def _detect_task_type(self, task_description: str) -> str:
    goal = task_description.lower()

    # WebShop类别检测
    if any(kw in goal for kw in ['shirt', 'dress', 'jacket', 'pant']):
        return 'apparel'
    elif any(kw in goal for kw in ['shoe', 'boot', 'sneaker']):
        return 'footwear'
    # ...更多类别

模式2:Embedding模式(精准匹配,低延迟)

特点:

  • 使用Qwen3-Embedding-0.6B编码任务描述
  • 对所有技能计算语义相似度
  • 返回Top-K最相关的技能
  • 需要GPU进行embedding计算

配置:

bash 复制代码
+env.skills_only_memory.retrieval_mode=embedding
+env.skills_only_memory.embedding_model_path=Qwen/Qwen3-Embedding-0.6B
+env.skills_only_memory.top_k=6
+env.skills_only_memory.task_specific_top_k=5

3.4 小规模RL训练配置

训练启动命令
bash 复制代码
# 设置模型路径
export MODEL_PATH=./checkpoint/webshop_sft/global_step_XXX

# 运行RL训练
bash examples/grpo_trainer/run_webshop_skills.sh
小规模训练配置示例

注意 :以下为 1 GPU 小规模训练的推荐配置。实际的 examples/grpo_trainer/run_webshop_skills.sh 是 8 GPU 的完整配置(n_gpus_per_node=8tensor_model_parallel_size=4total_epochs=150),请根据你的硬件条件调整。

bash 复制代码
python3 -m verl.trainer.main_ppo \
    # ============ 算法配置 ============
    algorithm.adv_estimator=grpo \
    algorithm.use_kl_in_reward=False \
    \
    # ============ 数据配置 ============
    data.train_files=$HOME/data/verl-agent/text/train.parquet \
    data.val_files=$HOME/data/verl-agent/text/test.parquet \
    data.train_batch_size=16 \
    data.val_batch_size=16 \
    data.max_prompt_length=4096 \
    data.max_response_length=512 \
    data.filter_overlong_prompts=True \
    data.truncation='left' \
    data.return_raw_chat=True \
    \
    # ============ 模型配置 ============
    actor_rollout_ref.model.path=$MODEL_PATH \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.01 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.model.use_remove_padding=True \
    \
    # ============ 分布式训练配置(1.5B小模型) ============
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=16 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    \
    # ============ Rollout配置 ============
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.max_num_batched_tokens=4096 \
    actor_rollout_ref.rollout.max_num_seqs=64 \
    actor_rollout_ref.rollout.val_kwargs.temperature=0.4 \
    actor_rollout_ref.rollout.val_kwargs.do_sample=True \
    \
    # ============ 环境配置 ============
    env.env_name=Webshop \
    env.seed=0 \
    env.max_steps=15 \
    env.rollout.n=4 \
    env.resources_per_worker.num_cpus=0.1 \
    \
    # ============ 技能库配置 ============
    +env.use_skills_only_memory=True \
    +env.skills_only_memory.skills_json_path=$HOME/verl-agent/memory_data/webshop/claude_style_skills.json \
    +env.skills_only_memory.retrieval_mode=template \
    +env.skills_only_memory.top_k=4 \
    \
    # ============ 动态更新配置 ============
    +env.skills_only_memory.enable_dynamic_update=True \
    +env.skills_only_memory.update_threshold=0.5 \
    +env.skills_only_memory.max_new_skills=2 \
    \
    # ============ 训练器配置 ============
    trainer.n_gpus_per_node=1 \
    trainer.nnodes=1 \
    trainer.project_name='verl_agent_webshop' \
    trainer.experiment_name='grpo_qwen1.5b_small_dynamic' \
    trainer.total_epochs=50 \
    trainer.save_freq=10 \
    trainer.test_freq=5 \
    trainer.val_before_train=True \
    trainer.logger='[console]' \
    trainer.default_local_dir=./checkpoint/webshop_rl_qwen15b_small
小规模训练关键参数说明
参数 小规模推荐值 说明
算法参数
algorithm.adv_estimator grpo 使用GRPO算法
env.rollout.n 4 GRPO组大小(小模型建议4-8)
学习率
actor_rollout_ref.actor.optim.lr 1e-6 策略学习率
actor_rollout_ref.actor.kl_loss_coef 0.01 KL散度损失系数
数据参数
data.train_batch_size 16 训练样本数
data.val_batch_size 16 验证样本数
data.max_prompt_length 4096 最大prompt长度(考虑技能注入)
data.max_response_length 512 最大response长度
技能库参数
skills_json_path 技能库路径 必需配置
retrieval_mode template 检索模式(小模型推荐template)
top_k 4 通用技能数量(小数据减少到4)
enable_dynamic_update True 启用动态更新
update_threshold 0.5 更新阈值(小数据可提高)
max_new_skills 2 每次最多新增技能数

3.5 验证与动态更新

动态更新机制

动态更新在验证阶段触发(每test_freq个epoch执行一次):

python 复制代码
# 位置:verl/trainer/ppo/ray_trainer.py(约第837-918行)

def _update_skills_from_validation(
    self,
    sample_inputs,      # 验证集输入
    sample_outputs,     # 验证集输出
    sample_scores,      # 验证集得分
    success_rate,      # 按任务类型的成功率
):
    """
    根据验证结果动态更新技能库
    """

    # 步骤1:检查是否需要更新
    threshold = self.config.env.skills_only_memory.update_threshold
    needs_update = False
    low_success_tasks = []

    for task_key, rate in success_rate.items():
        if rate < threshold:
            needs_update = True
            task_type = task_key.replace('_success_rate', '')
            low_success_tasks.append(task_type)

    if not needs_update:
        print(f"[SkillUpdate] All task success rates above {threshold}")
        return

    # 步骤2:收集失败轨迹
    failed_trajectories = self._collect_failed_trajectories(
        sample_inputs, sample_outputs, sample_scores
    )

    # 步骤3:初始化SkillUpdater(使用教师模型)
    from agent_system.memory.skill_updater import SkillUpdater
    skill_updater = SkillUpdater(
        max_new_skills_per_update=self.config.env.skills_only_memory.max_new_skills
    )

    # 步骤4:分析失败并生成新技能
    new_skills = skill_updater.analyze_failures(
        failed_trajectories=failed_trajectories,
        current_skills=self.envs.retrieval_memory.skills,
    )

    # 步骤5:添加新技能到训练环境
    if new_skills:
        self.envs.retrieval_memory.add_skills(new_skills, category='general')

        # 步骤6:保存更新后的技能库
        save_path = os.path.join(
            self.config.trainer.default_local_dir,
            f'updated_skills_step{self.global_steps}.json'
        )
        self.envs.retrieval_memory.save_skills(save_path)

3.6 训练输出与监控

训练日志示例
复制代码
[SkillsOnlyMemory] Loaded skills: 15 general, 30 task-specific, 12 mistakes | retrieval_mode=template

[Step 0] Starting validation...
[Validation] apparel: 2/4 (50.0%)
[Validation] footwear: 3/4 (75.0%)
[Validation] electronics: 2/4 (50.0%)
[Validation] Average success rate: 0.58

[Step 0] epoch=0/50, reward=5.2, success_rate=0.58
[Step 0] policy_loss=0.345, kl_penalty=0.015, entropy=1.45

...

[Step 10] Validation: apparel=0.60, footwear=0.75, electronics=0.55
[Step 10] epoch=10/50, reward=7.8, success_rate=0.63

[SkillUpdate] Low success tasks: ['electronics'], triggering skill update...
[SkillUpdate] Analyzing 8 failed trajectories with o3...
[SkillUpdater] Generated 2 new skills: dyn_001, dyn_002
[SkillsOnlyMemory] Added skill: dyn_001 - Verify Technical Specs Before Purchase
[SkillsOnlyMemory] Added skill: dyn_002 - Check Price Range First
[SkillUpdate] Saved updated skill bank to ./checkpoint/updated_skills_step_10.json

[Step 11] epoch=11/50, reward=8.5, success_rate=0.70
输出文件结构
复制代码
checkpoint/webshop_rl_qwen15b_small/
├── actor/                              # 训练后的策略模型
│   ├── model.safetensors
│   ├── config.json
│   └── adapter_config.json              # LoRA适配器配置
├── ref/                                # 参考模型(未更新)
│   └── model.safetensors
├── updated_skills_step_0.json            # 第0步动态更新的技能库
├── updated_skills_step_10.json           # 第10步更新的技能库
├── updated_skills_step_20.json           # 第20步更新的技能库
└── ...                                 # 更多checkpoint

效果验证方法

本节介绍如何在训练和构造数据阶段查看模型效果和技能效果。

4.1 查看模型效果

方法1:训练过程中的实时监控(推荐,最直接)

训练日志中会实时显示关键指标,可以立即看到模型表现:

关键指标说明

  • reward: 平均奖励值(越高越好)
  • success_rate: 任务成功率(0-1之间)
  • policy_loss: 策略损失(应该逐渐下降)
  • kl_penalty: KL散度惩罚(保持参考模型约束)
  • entropy: 输出熵值(保持多样性)

如何解读

python 复制代码
# 训练日志示例
[Step 0] epoch=0/50, reward=5.2, success_rate=0.58, policy_loss=0.345, kl_penalty=0.015
[Step 10] epoch=10/50, reward=7.8, success_rate=0.63, policy_loss=0.251, kl_penalty=0.012
[Step 20] epoch=20/50, reward=9.5, success_rate=0.71, policy_loss=0.189, kl_penalty=0.009

# 解读:
# 1. reward从5.2提升到9.5 → 模型能力在提升
# 2. success_rate从58%提升到71% → 任务完成率显著提高
# 3. policy_loss下降 → 模型在学习有效的策略
# 4. kl_penalty下降 → 模型与参考模型保持合理距离

实时监控脚本

python 复制代码
# 训练日志解析脚本
import re
import matplotlib.pyplot as plt

log_file = "checkpoint/webshop_rl_qwen15b_small/logs/training.log"

# 提取关键指标
metrics = []
with open(log_file, 'r') as f:
    for line in f:
        # 提取reward和success_rate
        reward_match = re.search(r'reward=([\d.]+)', line)
        sr_match = re.search(r'success_rate=([\d.]+)', line)
        policy_match = re.search(r'policy_loss=([\d.]+)', line)

        if reward_match and sr_match:
            step = re.search(r'\[Step (\d+)\]', line)
            if step:
                metrics.append({
                    'step': int(step.group(1)),
                    'reward': float(reward_match.group(1)),
                    'success_rate': float(sr_match.group(1)),
                    'policy_loss': float(policy_match.group(1)) if policy_match else 0
                })

# 绘制曲线
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

steps = [m['step'] for m in metrics]
rewards = [m['reward'] for m in metrics]
success_rates = [m['success_rate'] for m in metrics]
policy_losses = [m['policy_loss'] for m in metrics]

# Reward曲线
axes[0].plot(steps, rewards, marker='o', linewidth=2, color='blue')
axes[0].set_xlabel('Training Steps', fontsize=12)
axes[0].set_ylabel('Average Reward', fontsize=12)
axes[0].set_title('Reward Progress (↑ better)', fontsize=14)
axes[0].grid(True)

# Success Rate曲线
axes[1].plot(steps, success_rates, marker='s', linewidth=2, color='green')
axes[1].set_xlabel('Training Steps', fontsize=12)
axes[1].set_ylabel('Success Rate', fontsize=12)
axes[1].set_title('Success Rate Progress (↑ better)', fontsize=14)
axes[1].grid(True)
axes[1].axhline(y=0.5, color='r', linestyle='--', label='Random Baseline')
axes[1].legend()

# Policy Loss曲线
axes[2].plot(steps, policy_losses, marker='^', linewidth=2, color='red')
axes[2].set_xlabel('Training Steps', fontsize=12)
axes[2].set_ylabel('Policy Loss', fontsize=12)
axes[2].set_title('Policy Loss Progress (↓ better)', fontsize=14)
axes[2].grid(True)

plt.tight_layout()
plt.savefig('training_curves.png', dpi=150)
print("训练曲线已保存到 training_curves.png")
方法2:模型能力对比测试(训练后评估)

对比不同训练阶段的模型,观察能力提升:

python 复制代码
from transformers import AutoModelForCausalLM, AutoTokenizer
from agent_system.environments.env_package.webshop import WebshopEnv

# 加载不同阶段的模型
models_to_test = {
    "SFT模型": "./checkpoint/webshop_sft/global_step_XXX",
    "RL-Step10": "./checkpoint/webshop_rl_qwen15b_small/global_step_10",
    "RL-Step30": "./checkpoint/webshop_rl_qwen15b_small/global_step_30",
    "RL-Step50": "./checkpoint/webshop_rl_qwen15b_small/global_step_50",
}

# 加载技能库
from agent_system.memory.skills_only_memory import SkillsOnlyMemory
skill_memory = SkillsOnlyMemory(
    'memory_data/webshop/claude_style_skills.json',
    retrieval_mode='template'
)

# 测试任务集
test_tasks = [
    "Find a blue running shoe under $50",
    "Purchase a men's cotton shirt with size L",
    "Get a laptop under $800 with 16GB RAM",
    "Buy a red dress for summer",
    "Find black leather boots under $100",
]

results = {}

for model_name, model_path in models_to_test.items():
    print(f"\n{'='*60}")
    print(f"测试模型: {model_name}")
    print('='*60)

    # 加载模型
    model = AutoModelForCausalLM.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    # 创建环境
    env = WebshopEnv(use_small=True)

    # 测试
    success_count = 0
    total_steps = 0
    total_reward = 0

    for task in test_tasks:
        obs = env.reset(task)
        done = False
        steps = 0
        task_reward = 0

        while not done and steps < 15:
            # 检索并注入技能
            skills = skill_memory.retrieve(task, top_k=4)
            skill_text = skill_memory.format_for_prompt(skills)

            # 构造增强prompt
            enhanced_prompt = f"{skill_text}\n\nTask: {task}\nObservation: {obs}"

            # 生成action
            inputs = tokenizer(enhanced_prompt, return_tensors='pt')
            outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.4)
            response = tokenizer.decode(outputs[0], skip_special_tokens=True)

            # 解析action
            action = parse_action(response)

            # 执行
            obs, reward, done, info = env.step(action)
            task_reward += reward
            steps += 1

        if info.get('success', False):
            success_count += 1

        total_steps += steps
        total_reward += task_reward

    # 记录结果
    results[model_name] = {
        'success_rate': success_count / len(test_tasks),
        'avg_steps': total_steps / len(test_tasks),
        'avg_reward': total_reward / len(test_tasks)
    }

    print(f"成功率: {success_count}/{len(test_tasks)} ({success_count/len(test_tasks)*100:.1f}%)")
    print(f"平均步数: {total_steps/len(test_tasks):.1f}")
    print(f"平均奖励: {total_reward/len(test_tasks):.2f}")

# 对比结果
print(f"\n{'='*60}")
print("模型能力对比")
print('='*60)

print(f"{'模型':20s} {'成功率':10s} {'平均步数':12s} {'平均奖励':12s}")
print('-'*60)

for model_name, metrics in results.items():
    print(f"{model_name:20s} {metrics['success_rate']*100:6.1f}%       {metrics['avg_steps']:10.1f}      {metrics['avg_reward']:10.2f}")
方法3:消融实验(技能的作用)
python 复制代码
# 对比:有技能 vs 无技能

# 加载模型
model = AutoModelForCausalLM.from_pretrained("./checkpoint/webshop_rl_qwen15b_small/global_step_50")
tokenizer = AutoTokenizer.from_pretrained("./checkpoint/webshop_rl_qwen15b_small/global_step_50")

# 测试配置
configs = {
    "有技能库": {"use_skills": True, "top_k": 4},
    "无技能库": {"use_skills": False, "top_k": 0},
    "Top-2技能": {"use_skills": True, "top_k": 2},
}

test_tasks = [ ... ]  # 同上

for config_name, config in configs.items():
    print(f"\n{'='*60}")
    print(f"配置: {config_name}")
    print('='*60)

    env = WebshopEnv(use_small=True)
    success_count = 0

    for task in test_tasks:
        obs = env.reset(task)
        done = False
        steps = 0

        while not done and steps < 15:
            # 根据配置决定是否注入技能
            if config["use_skills"]:
                skills = skill_memory.retrieve(task, top_k=config["top_k"])
                skill_text = skill_memory.format_for_prompt(skills)
                prompt = f"{skill_text}\n\nTask: {task}\nObservation: {obs}"
            else:
                prompt = f"Task: {task}\nObservation: {obs}"

            # 生成action
            inputs = tokenizer(prompt, return_tensors='pt')
            outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.4)
            response = tokenizer.decode(outputs[0], skip_special_tokens=True)
            action = parse_action(response)

            # 执行
            obs, reward, done, info = env.step(action)
            steps += 1

        if info.get('success', False):
            success_count += 1

    print(f"成功率: {success_count}/{len(test_tasks)} ({success_count/len(test_tasks)*100:.1f}%)")

4.2 查看构造的技能效果

方法1:技能使用频率分析
python 复制代码
# 分析训练日志中技能的使用情况

import json
from collections import Counter

# 读取多个checkpoint的技能库
skill_files = [
    "memory_data/webshop/claude_style_skills.json",
    "checkpoint/webshop_rl_qwen15b_small/updated_skills_step_10.json",
    "checkpoint/webshop_rl_qwen15b_small/updated_skills_step_20.json",
    "checkpoint/webshop_rl_qwen15b_small/updated_skills_step_30.json",
]

skill_evolution = []

for i, skill_file in enumerate(skill_files):
    with open(skill_file, 'r') as f:
        skills = json.load(f)

    general_skills = skills.get('general_skills', [])
    skill_ids = [s['skill_id'] for s in general_skills]
    skill_evolution.append({
        'step': i * 10,
        'skill_count': len(general_skills),
        'skill_ids': skill_ids
    })

# 打印技能演化
print("="*80)
print("技能库演化过程")
print("="*80)

for evolution in skill_evolution:
    print(f"\nStep {evolution['step']:3d}: {evolution['skill_count']} 个通用技能")
    print(f"  技能ID: {', '.join(evolution['skill_ids'][:10])}" if len(evolution['skill_ids']) > 10 else f"  技能ID: {', '.join(evolution['skill_ids'])}")

# 统计新增技能
initial_skills = set(skill_evolution[0]['skill_ids'])
final_skills = set(skill_evolution[-1]['skill_ids'])
new_skills = final_skills - initial_skills

print(f"\n{'='*80}")
print("新增技能分析")
print('='*80)
print(f"初始技能: {len(initial_skills)} 个")
print(f"最终技能: {len(final_skills)} 个")
print(f"新增技能: {len(new_skills)} 个")
print(f"\n新增技能列表:")
for skill_id in sorted(new_skills):
    print(f"  - {skill_id}")
方法2:技能有效性评估
python 复制代码
# 评估动态添加的技能是否真的提升了性能

# 读取训练日志,找到技能更新前后的性能变化
import re

log_file = "checkpoint/webshop_rl_qwen15b_small/logs/training.log"

with open(log_file, 'r') as f:
    log_content = f.read()

# 找到动态更新点
update_matches = re.findall(r'\[SkillUpdate\] Saved updated skill bank to .*updated_skills_step(\d+)\.json', log_content)
skill_updates = [int(m) for m in update_matches]

# 提取每次更新前后的成功率
success_rate_pattern = r'\[Step (\d+)\].*?success_rate=([\d.]+)'
all_matches = re.findall(success_rate_pattern, log_content)

improvements = []

for i, update_step in enumerate(skill_updates):
    # 找到更新前的成功率
    before_metrics = [m for m in all_matches if int(m[0]) <= update_step]
    if before_metrics:
        before_sr = float(before_metrics[-1][1])
    else:
        before_sr = 0.0

    # 找到更新后的成功率
    after_metrics = [m for m in all_matches if int(m[0]) > update_step]
    if after_metrics:
        after_sr = float(after_metrics[0][1])
    else:
        after_sr = 0.0

    improvement = after_sr - before_sr
    improvements.append({
        'update_step': update_step,
        'before_sr': before_sr,
        'after_sr': after_sr,
        'improvement': improvement
    })

# 打印结果
print("="*80)
print("动态更新效果评估")
print("="*80)
print(f"{'更新步数':12s} {'更新前成功率':15s} {'更新后成功率':15s} {'提升':10s}")
print('-'*80)

for imp in improvements:
    print(f"{imp['update_step']:12d}      {imp['before_sr']*100:6.1f}%         {imp['after_sr']*100:6.1f}%        {imp['improvement']*100:+6.1f}%")

total_improvement = sum(imp['improvement'] for imp in improvements)
avg_improvement = total_improvement / len(improvements) if improvements else 0

print(f"\n总提升: {total_improvement*100:+.1f}%")
print(f"平均每次提升: {avg_improvement*100:+.1f}%")
方法3:技能相似度分析
python 复制代码
# 检查新增技能是否与已有技能重复

from difflib import SequenceMatcher

def skill_similarity(skill1, skill2):
    """计算两个技能的相似度"""
    text1 = skill1['title'] + ' ' + skill1['principle']
    text2 = skill2['title'] + ' ' + skill2['principle']

    matcher = SequenceMatcher(None, text1)
    ratio = matcher.ratio(text2)
    return ratio

# 读取初始和最终技能库
with open('memory_data/webshop/claude_style_skills.json', 'r') as f:
    initial_skills = json.load(f)

final_skill_file = "checkpoint/webshop_rl_qwen15b_small/updated_skills_step_50.json"
with open(final_skill_file, 'r') as f:
    final_skills = json.load(f)

# 找到新增技能
initial_ids = set(s['skill_id'] for s in initial_skills['general_skills'])
final_ids = set(s['skill_id'] for s in final_skills['general_skills'])
new_skill_ids = final_ids - initial_ids

new_skills = [s for s in final_skills['general_skills'] if s['skill_id'] in new_skill_ids]

# 计算与初始技能的相似度
print("="*80)
print("新增技能与已有技能的相似度分析")
print("="*80)

for new_skill in new_skills:
    max_sim = 0.0
    most_similar_id = None

    for old_skill in initial_skills['general_skills']:
        sim = skill_similarity(new_skill, old_skill)
        if sim > max_sim:
            max_sim = sim
            most_similar_id = old_skill['skill_id']

    print(f"\n[{new_skill['skill_id']}] {new_skill['title']}")
    print(f"  原则: {new_skill['principle'][:60]}...")
    print(f"  最相似技能: {most_similar_id} (相似度: {max_sim*100:.1f}%)")

    if max_sim > 0.8:
        print("  ⚠ 警告: 与已有技能高度相似")
    elif max_sim > 0.5:
        print("  ℹ 信息: 与已有技能部分相似")
    else:
        print("  ✓ 新技能")

小规模训练配置(1.5B模型)

5.1 数据准备注意事项

Memory Data准备

目标数量:20-50条轨迹

注意事项

  1. 成功率控制:确保60-80%的成功率

    python 复制代码
    success_count = sum(1 for m in memories if m['tags']['outcome'] == 'Success')
    total = len(memories)
    success_rate = success_count / total
    
    if success_rate < 0.6:
        print("警告: 成功率过低 (<60%)")
    elif success_rate > 0.9:
        print("警告: 成功率过高 (>90%),缺乏失败样本")
  2. 任务类型平衡:确保不同产品类型都有样本

    python 复制代码
    from collections import Counter
    
    types = [detect_type(m) for m in memories]
    type_dist = Counter(types)
    
    print("产品类型分布:")
    for t, count in type_dist.items():
        print(f"  {t}: {count} ({count/len(types)*100:.1f}%)")
    
    # 检查平衡性
    min_count = min(type_dist.values())
    max_count = max(type_dist.values())
    if max_count / min_count > 3:
        print("⚠ 警告: 类型分布不均衡")
  3. 约束多样性:覆盖价格、尺寸、颜色等不同约束

    python 复制代码
    constraint_types = set()
    
    for m in memories:
        goal = m['content']['task_meta']['original_goal'].lower()
        if 'price' in goal or '$' in goal:
            constraint_types.add('price')
        if 'size' in goal or any(s in goal for s in ['s', 'm', 'l', 'xl']):
            constraint_types.add('size')
        if 'color' in goal or any(c in goal for c in ['red', 'blue', 'black', 'white']):
            constraint_types.add('color')
    
    print(f"约束类型覆盖: {', '.join(sorted(constraint_types))}")
    
    if len(constraint_types) < 3:
        print("⚠ 警告: 约束类型不够丰富")
SFT数据准备

目标数量:100-200条样本

注意事项

  1. Prompt质量:确保prompt清晰且格式一致

    python 复制代码
    # 检查prompt格式
    for item in sft_data[:10]:
        prompt = item['prompt'][0]['content']
    
        # 长度检查
        if len(prompt) > 500:
            print(f"⚠ Prompt过长: {len(prompt)} 字符")
    
        # 格式检查
        if '?' not in prompt and 'Find' not in prompt and 'Purchase' not in prompt:
            print(f"⚠ Prompt格式不标准: {prompt[:50]}...")
  2. Response完整性:确保response包含完整的决策过程

    python 复制代码
    # 检查response质量
    for item in sft_data[:10]:
        response = item['response']
        response_lower = response.lower()
    
        # 检查动作关键词
        action_keywords = ['search', 'click', 'buy', 'select', 'choose']
        found_actions = [kw for kw in action_keywords if kw in response_lower]
    
        if len(found_actions) < 2:
            print(f"⚠ Response动作不完整: {response[:100]}...")
    
        # 检查推理过程
        if 'because' not in response_lower and 'since' not in response_lower:
            print(f"ℹ 缺少推理连接词: {response[:100]}...")
  3. 数据一致性:prompt和response应该对应

    python 复制代码
    # 检查prompt-response配对
    mismatched = 0
    for item in sft_data:
        prompt = item['prompt'][0]['content'].lower()
        response = item['response'].lower()
    
        # 提取prompt中的约束
        prompt_constraints = []
        if '$' in prompt:
            prompt_constraints.append('price')
        if 'size' in prompt or 's ' in prompt or 'm ' in prompt:
            prompt_constraints.append('size')
    
        # 检查response是否响应这些约束
        response_actions = [kw for kw in prompt_constraints if kw in response]
    
        if len(response_actions) < len(prompt_constraints):
            mismatched += 1
            print(f"⚠ 约束未响应: {item['prompt'][0]['content'][:50]}...")
    
    if mismatched > 0:
        print(f"\n⚠ 总计 {mismatched} 个prompt-response不匹配")

5.2 技能库构造注意事项

生成技能库时
  1. 教师模型选择

    • 推荐:Qwen3.6-Flash(性价比高,速度快)
    • 备选:Qwen2.5-7B-Instruct(质量更高,速度稍慢)
    • 不推荐:Azure OpenAI(成本高,配额限制多)
  2. 技能数量控制

    python 复制代码
    # 小规模训练的技能数量建议
    general_skills = 10-15    # 通用技能
    task_specific_skills = 20-30  # 任务特定技能(总计)
    common_mistakes = 8-12      # 常见错误
    
    print(f"技能库规模:")
    print(f"  通用技能: {general_skills}")
    print(f"  任务特定技能: {task_specific_skills}")
    print(f"  常见错误: {common_mistakes}")
    print(f"  总计: {general_skills + task_specific_skills + common_mistakes}")
  3. 技能质量检查

    python 复制代码
    # 检查生成的技能质量
    def check_skill_quality(skill):
        issues = []
    
        # 1. 长度检查
        principle = skill.get('principle', '')
        words = principle.split()
        if len(words) > 15:
            issues.append("原则过长")
        elif len(words) < 3:
            issues.append("原则过短")
    
        # 2. 可操作性检查
        action_verbs = ['search', 'click', 'select', 'verify', 'check', 'choose', 'buy', 'purchase']
        if not any(v in principle.lower() for v in action_verbs):
            issues.append("缺少明确动作")
    
        # 3. 具体性检查
        vague_words = ['maybe', 'possibly', 'sometimes', 'often', 'usually']
        if any(w in principle.lower() for w in vague_words):
            issues.append("包含模糊词汇")
    
        # 4. 冗余检查
        redundancy_patterns = ['search and search', 'check and check', 'verify and verify']
        if any(p in principle.lower() for p in redundancy_patterns):
            issues.append("包含冗余动作")
    
        return issues
    
    # 检查所有技能
    all_issues = []
    for skill in skills['general_skills']:
        issues = check_skill_quality(skill)
        if issues:
            all_issues.append((skill['skill_id'], issues))
    
    print(f"发现质量问题的技能: {len(all_issues)} 个")
    for skill_id, issues in all_issues[:5]:
        print(f"  [{skill_id}]: {', '.join(issues)}")

5.3 RL训练注意事项

小规模训练配置调整
  1. GRPO组大小

    bash 复制代码
    # 小模型建议
    env.rollout.n=4    # 节省显存
    # 大模型建议
    env.rollout.n=8    # 提高稳定性
  2. 学习率调整

    bash 复制代码
    # 小规模数据(过拟合风险低)
    actor_rollout_ref.actor.optim.lr=2e-6  # 可以稍高
    
    # 小模型(1.5B)
    actor_rollout_ref.actor.kl_loss_coef=0.005  # 降低KL约束
  3. 显存优化

    bash 复制代码
    # 启用梯度检查点
    actor_rollout_ref.model.enable_gradient_checkpointing=True
    
    # 启用参数offload
    actor_rollout_ref.actor.fsdp_config.param_offload=True
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True
    
    # 减少batch size
    data.train_batch_size=16 → data.train_batch_size=8
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2
  4. 训练轮数控制

    bash 复制代码
    # 小规模数据不宜训练过多轮
    trainer.total_epochs=30-50  # 防止过拟合
    
    # 验证频率
    trainer.test_freq=5  # 频繁验证,及时调整

环境配置与API设置

6.1 依赖安装

bash 复制代码
# 克隆仓库
git clone https://github.com/aiming-lab/SkillRL.git
cd SkillRL

# 安装基础依赖
pip install -r requirements.txt
pip install vllm==0.11.0
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir
pip install -e .

# 安装OpenAI客户端(用于教师模型)
pip install openai

6.2 环境设置

WebShop环境
bash 复制代码
cd agent_system/environments/env_package/webshop
./setup.sh -d small  # 小数据集(快速实验)
# 或
./setup.sh -d all    # 完整数据集(正式训练)
ALFWorld环境
bash 复制代码
pip install alfworld
pip install gymnasium==0.29.1
pip install stable-baselines3==2.6.0
alfworld-download -f  # 下载游戏文件和检测器
Search环境
bash 复制代码
cd agent_system/environments/env_package/search/third_party
pip install -e .
pip install gym==0.26.2

6.3 API配置

使用Qwen API(推荐,性价比高)
bash 复制代码
# 设置Qwen API(用于教师模型)
export QWEN_API_KEY="your_qwen_api_key"

# 如果使用Azure OpenAI
export AZURE_OPENAI_API_KEY="your_azure_api_key"
export AZURE_OPENAI_ENDPOINT="https://your-endpoint.openai.azure.com/"
export AZURE_OPENAI_API_VERSION="2025-01-01-preview"
测试API连接
bash 复制代码
# 测试Qwen API
python -c "
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get('QWEN_API_KEY'),
    base_url='https://dashscope.aliyuncs.com/compatible-mode/v1'
)

response = client.chat.completions.create(
    model='qwen3.6-flash',
    messages=[{'role': 'user', 'content': '你好'}],
)

print(response.choices[0].message.content)
"

常见问题与解决方案

7.1 Memory Data Generation阶段

问题1:API调用失败

错误Connection timeout / Authentication failed

解决方案

bash 复制代码
# 检查网络连接
ping dashscope.aliyuncs.com

# 验证API密钥
echo $QWEN_API_KEY  # 应该输出你的密钥

# 检查配额
# 登录阿里云控制台查看API使用情况
问题2:生成的技能数量不对

现象:期望15个通用技能,实际只有10个

解决方案

python 复制代码
# 检查输入记忆数据格式
import json
with open('memory_data/webshop/generated_memories_webshop_100.json') as f:
    memories = json.load(f)
    print(f"Total memories: {len(memories)}")
    print(f"Success: {sum(1 for m in memories if m['tags']['outcome'] == 'Success')}")
    print(f"Failure: {sum(1 for m in memories if m['tags']['outcome'] == 'Failure')}")

# 查看生成脚本日志
# 技能生成脚本会打印每个阶段的输出

7.2 SFT训练阶段

问题1:显存不足(OOM)

错误CUDA out of memory

解决方案

bash 复制代码
# 减小batch size
data.train_batch_size=8 → data.train_batch_size=4

# 减小max_length(SFT训练器中prompt+response总长度)
data.max_length=2048 → data.max_length=1024

# 减小LoRA rank(减少参数量,降低显存占用)
model.lora_rank=64 → model.lora_rank=32

# 启用offload
actor_rollout_ref.actor.fsdp_config.param_offload=True
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True
问题2:训练不收敛

现象:loss不下降或震荡

解决方案

bash 复制代码
# 降低学习率
actor_rollout_ref.actor.optim.lr=1e-6 → actor_rollout_ref.actor.optim.lr=5e-7

# 增加KL散度系数
actor_rollout_ref.actor.kl_loss_coef=0.01 → actor_rollout_ref.actor.kl_loss_coef=0.05

# 调整采样温度
actor_rollout_ref.rollout.val_kwargs.temperature=0.4 → actor_rollout_ref.rollout.val_kwargs.temperature=0.2

# 检查数据质量
# 确保训练数据格式正确,包含足够的多样性

7.3 RL训练阶段

问题1:技能检索失败

错误Task type not detected

解决方案

bash 复制代码
# 检查skills_json_path
+env.skills_only_memory.skills_json_path=正确的技能库路径

# 检查任务描述格式
# 确保任务描述包含明确的产品类型关键词

# 调整_detect_task_type函数(skills_only_memory.py第115行)
# 添加更多关键词以支持你的任务类型
问题2:动态更新不触发

现象 :日志显示 [SkillUpdate] All task success rates above threshold

解决方案

bash 复制代码
# 降低update_threshold
+env.skills_only_memory.update_threshold=0.4 → +env.skills_only_memory.update_threshold=0.2

# 检查任务类型检测
# 确保不同任务类型被正确识别

# 检查失败轨迹收集
# 确保验证集中有足够的失败样本
问题3:教师模型API调用失败

错误[SkillUpdater] Error calling o3: API rate limit exceeded

解决方案

bash 复制代码
# 使用Qwen代替o3(更便宜、配额更高)
# 修改skill_updater.py,使用Qwen客户端

# 降低max_new_skills
+env.skills_only_memory.max_new_skills=3 → +env.skills_only_memory.max_new_skills=1

# 增加test_freq
trainer.test_freq=5 → trainer.test_freq=10

完整示例命令

示例1:WebShop小规模完整训练流程(Template模式 + 动态更新)

bash 复制代码
# ============ 步骤1:环境设置 ============
export QWEN_API_KEY="your_qwen_api_key"
export MODEL_PATH=./checkpoint/webshop_sft/global_step_XXX

# ============ 步骤2:数据准备 ============
# 小规模数据
python3 -m examples.data_preprocess.prepare \
    --mode 'text' \
    --train_data_size 16 \
    --val_data_size 16

# ============ 步骤3:RL训练 ============
bash examples/grpo_trainer/run_webshop_skills.sh

# 或直接运行(复制run_webshop_skills.sh的内容)
python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=$HOME/data/verl-agent/text/train.parquet \
    data.val_files=$HOME/data/verl-agent/text/test.parquet \
    data.train_batch_size=16 \
    data.val_batch_size=16 \
    data.max_prompt_length=4096 \
    data.max_response_length=512 \
    actor_rollout_ref.model.path=$MODEL_PATH \
    actor_rollout_ref.actor.optim.lr=2e-6 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.005 \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    env.env_name=Webshop \
    env.rollout.n=4 \
    +env.use_skills_only_memory=True \
    +env.skills_only_memory.skills_json_path=$HOME/verl-agent/memory_data/webshop/claude_style_skills.json \
    +env.skills_only_memory.retrieval_mode=template \
    +env.skills_only_memory.top_k=4 \
    +env.skills_only_memory.enable_dynamic_update=True \
    +env.skills_only_memory.update_threshold=0.5 \
    +env.skills_only_memory.max_new_skills=2 \
    trainer.total_epochs=50 \
    trainer.save_freq=10 \
    trainer.test_freq=5 \
    trainer.val_before_train=True \
    trainer.logger='[console]' \
    trainer.default_local_dir=./checkpoint/webshop_rl_qwen15b_small

示例2:快速实验(关闭动态更新)

bash 复制代码
python3 -m verl.trainer.main_ppo \
    # ... 其他配置同上 ...
    \
    # 关闭动态更新,使用静态技能库
    +env.skills_only_memory.enable_dynamic_update=False \
    \
    # 减少训练epoch
    trainer.total_epochs=30 \
    trainer.test_freq=10 \
    \
    # ... 其他配置 ...

总结

SkillRL完整训练流程

阶段 输入 操作 输出
Memory Data Generation 任务轨迹数据 使用教师模型分析轨迹 分层级技能库
SFT 基座模型 + SFT数据 监督微调 SFT模型
RL SFT模型 + 技能库 GRPO训练 + 动态更新 最终模型 + 更新的技能库

教师模型作用总结

作用阶段 具体作用 蒸馏方式
轨迹构造 生成完整任务轨迹(action + observation + reasoning) 直接生成,不涉及蒸馏
技能提取 从轨迹中提取可复用的行为模式 直接分析,不涉及蒸馏
动态更新 分析RL失败案例,生成新技能 直接生成,不涉及蒸馏
模型训练 不参与 -

核心理解

  • 教师模型只生成训练数据(轨迹、技能),不参与模型参数更新
  • 模型能力提升来自:
    • SFT阶段:学习轨迹中的action patterns
    • RL阶段:通过奖励信号优化策略
    • 技能注入:提供先验知识,加速学习
  • 技能库动态更新是在RL训练过程中,通过教师模型分析失败案例补充新技能

关键技术特点

  1. 分层级技能库

    • General Skills:跨所有任务的通用原则
    • Task-Specific Skills:特定任务类型的技能
    • Common Mistakes:常见错误和避免方法
  2. 双检索模式

    • Template模式:关键词匹配,零延迟
    • Embedding模式:语义相似度,精准匹配
  3. 动态更新机制

    • 自动分析验证失败案例
    • 使用教师模型生成新技能
    • 实时更新训练环境的技能库
  4. 小规模训练建议

    • 数据量:Memory 20-50条,SFT 100-200条,RL 16-32条
    • 技能库:10-15个通用技能
    • 训练轮数:30-50 epochs
    • 验证频率:每5个epoch验证一次

核心文件位置

  • skill_generation/*.py - 技能生成脚本
  • agent_system/memory/skills_only_memory.py - 技能检索系统
  • agent_system/memory/skill_updater.py - 技能动态更新
  • verl/trainer/ppo/ray_trainer.py - RL训练主流程(含动态更新逻辑)
  • verl/trainer/main_ppo.py - RL训练入口脚本
  • verl/trainer/fsdp_sft_trainer.py - SFT训练器(verl内置)
  • qwen_8b.py - Qwen API 测试脚本(仅参考,注意其中硬编码了 API Key,请勿直接使用

使用建议

  • 小模型(1.5B):建议使用template模式(零延迟)
  • 大模型(7B+):可以使用embedding模式(更精准)
  • 显存受限:减少batch size,启用offload
  • 快速实验:关闭动态更新,使用静态技能库
  • 小规模训练:控制数据量和训练轮数,避免过拟合

生成时间 :2026-05-13
基于代码库版本 :SkillRL-main
用途:完整训练指导 - SkillBank动态更新全流程 + 小规模训练注意事项

相关推荐
容器魔方1 小时前
Kthena Router ScorePlugin 架构与基准测试分析
人工智能·云原生·容器·架构·开源
艾醒(AiXing-w)1 小时前
2026上半年通用语言理解场景选型推荐
人工智能
Soari1 小时前
科研与工程的 AI 助推器:深度拆解 scientific-agent-skills,为你的智能体注入专家级灵魂
人工智能·数据分析·科学计算·科研工具·aiagent·claudecode·ai技能库
才兄说1 小时前
机器人二次开发机器人动作定制?多机协同表演
人工智能·机器人
DogDaoDao1 小时前
【GitHub】NousResearch/Hermes-Agent 深度技术解析:自我进化的AI Agent新范式
人工智能·深度学习·程序员·大模型·github·ai编程·ai agent
沪漂阿龙1 小时前
面试题:评估指标详解——NLP 常用评估指标、BLEU、ROUGE、BLEU 和 ROUGE 区别全解析
人工智能·自然语言处理
必须会一定会1 小时前
AI 架构设计:如何避免一上来就把项目做复杂
人工智能
玖日大大1 小时前
GPT-5.5 幻觉率砍半 52.5%、长文本理解翻倍、推理速度 3x 提升 — OpenAI 从「参数内卷」到「可靠性内卷」的实用主义转向
人工智能·gpt·openai·ai可靠性·gpt-5.5·幻觉治理·大模型商业化
测试_AI_一辰1 小时前
AI时代,学东西的方式变了
人工智能·ai·自动化·状态模式·ai编程