AI Coding JSONL 里的系统标签噪音如何过滤

本文面向：想了解 Claude Code 数据格式细节的开发者，或在解析 JSONL 时遇到问题的人。预计阅读时间：6 分钟

问题

Claude Code 的 JSONL 文件里，每行是一个 JSON 对象。但不是每一行都是有用的对话内容。大量系统生成的噪音混在里面，如果不过滤，生成的笔记质量会很差。

JSONL 里都有什么

一个典型的 Claude Code JSONL 文件里，你可能看到这些类型的消息：

类型	是否有用	说明
`user`	✅ 有用	用户输入
`assistant`	✅ 有用	AI 回复
`system`	❌ 噪音	系统消息（api_error、compact_boundary 等）
`file-history-snapshot`	❌ 噪音	文件历史快照
`progress` / `agent_progress`	❌ 噪音	进度信息
`tool_use`	❌ 噪音	工具调用流式片段
`tool_result`	❌ 噪音	工具结果流式片段
`thinking`	❌ 噪音	思考过程流式片段
`text`	❌ 噪音	文本流式片段
无 uuid 的行	❌ 噪音	流式传输的中间状态

有用的只有 user 和 assistant 类型的完整消息（有 uuid）。

ChatCrystal 的过滤逻辑

typescript 复制代码

const SKIP_TYPES = new Set([
  'file-history-snapshot',
  'last-prompt',
  'progress',
  'agent_progress',
  'hook_progress',
  'queue-operation',
  'message',      // streaming delta
  'tool_use',     // streaming tool delta
  'tool_result',  // streaming tool result delta
  'thinking',     // streaming thinking delta
  'text',         // streaming text delta
  'tool_reference',
]);

function isRelevantMessage(line: RawMessage): boolean {
  if (!line.uuid) return false;           // 过滤流式片段
  if (line.type && SKIP_TYPES.has(line.type)) return false;
  if (line.type === 'system') return false;
  return ['user', 'assistant'].includes(line.type ?? '');
}

判断流程：

sql 复制代码

有 uuid？
  → 否：跳过（流式片段）
  → 是：类型在 SKIP_TYPES 里？
    → 是：跳过
    → 否：是 system？
      → 是：跳过
      → 否：是 user 或 assistant？
        → 是：保留
        → 否：跳过

消息内容的清理

即使保留下来的 user 和 assistant 消息，内容里也可能夹杂系统标签。ChatCrystal 会自动清除：

会被清除的标签

xml 复制代码

<system-reminder>系统提醒内容</system-reminder>
<command-name>/help</command-name>
<command-message>执行命令</command-message>
<command-args>--flag value</command-args>
<local-command-stdout>命令输出</local-command-stdout>
<local-command-caveat>注意事项</local-command-caveat>

清理代码

typescript 复制代码

function sanitizeContent(text: string): string {
  let result = text;
  result = result.replace(/<system-reminder>[\s\S]*?<\/system-reminder>/g, '');
  result = result.replace(/<command-name>[^<]*<\/command-name>/g, '');
  result = result.replace(/<command-message>[^<]*<\/command-message>/g, '');
  result = result.replace(/<command-args>[^<]*<\/command-args>/g, '');
  result = result.replace(/<local-command-stdout>[^<]*<\/local-command-stdout>/g, '');
  result = result.replace(/<local-command-caveat>[\s\S]*?<\/local-command-caveat>/g, '');
  return result.trim();
}

不过滤会怎样

如果不过滤噪音，LLM 生成摘要时会收到大量无关内容：

<system-reminder> 里的系统指令会干扰摘要判断
流式 delta 片段会产生重复内容
file-history-snapshot 会让摘要变成文件变更日志
tool_use / tool_result 片段会让摘要充斥工具调用细节

最终生成的笔记标题可能是「system-reminder: you are Claude...」而不是「修复登录 bug」。

为什么 Claude Code 要存这些噪音

这些噪音对 Claude Code 本身是有用的：

流式 delta --- 实时显示 AI 回复
system-reminder --- 维护对话上下文
tool_use/result --- 记录工具调用链
file-history-snapshot --- 支持撤销操作

但对「从对话中提取知识」这个目标来说，这些都是噪音。

项目地址：github.com/ZengLiangYi...