🧐 AI 批量检查数千份技术文档，如何实现高效文档纠错？

前言

前几天公司售后向我反馈，有客户发现产品的文档站中有一个字段的拼写错误了。由于我们的文档站是支持多语言的，所以有时候一些外语的拼写可能顺序错误了，但是靠人来 review 没法很很直观的看出来，我当时先是把问题快速处理了。

修复了问题后，我想到这样的问题可能不止这一处，所以我打算把这些问题全局的给过一下，但因为我们是做基础软件的公司，因此一些技术文档和使用说明也属于我们产品的一部分，而且文档已经积累了很多，我们还会同时维护很多个版本的文档在我们的文档站中，导致文档加起来可能有上千份 markdown 文件，人工检查肯定是不实际的，我就想到通过 AI 来进行文案检查，修复一些基础性的问题，例如错别字，语法错误，技术概念错误等。这种工作交给 AI 肯定是很合适的。

有了想法，我就开始设计检查的方式了。

实现 AI 批量纠错

通常我们检查文案可能会把整篇文章直接丢给某个 AI ，让它检查其中的问题，最终响应给我们修改后的原文，或者是指出一些错误的点，我们人工进行修改。那么 AI 在这个过程中可能参与到 "检查" 或 "纠错" 的环节。

因为文档的数量比较多，所以我只能选择在 AI 发现问题后自己进行纠错，但是如果让 AI 全量的返回纠错后的整篇文章，那 token 的开销就太大了，于是我想到可以指定一个格式，让 AI 检查后，通过 JSON 的方式以固定的结构指出文章中错误的行数和对应的字符位置，并且给出纠错后的内容，然后我再通过程序去对文章中的内容进行处理。

按照这种方式，我实现了第一版代码，然后调用 Deepseek 进行检查，我发现这种实现有问题，AI 是可以正确的检查到错误，但是响应的错误位置总是不准确，我思考了一下可能是以下几种原因：

Windows (\r\n)、Linux (\n)、旧版 Mac (\r) 的换行符不同，导致字符位置计算偏差。
Unicode字符：中文、Emoji等占多个字节（如 你好 在UTF-8占6字节），但AI可能按字符数计算位置。
空格/缩进差异，AI返回的列号可能忽略行首空格，而实际文件可能有Tab/空格混用。
一行有几百个字符时，AI 可能无法精准定位到某个位置

由于这种方式不准确，我就思考有没有更好的方式，让 AI 能准确的响应错误位置，并且 Token 的输出量还是不太高呢？

既然指定某一行的某一个字符位置这种方式不准确，那么我就改为直接指定某一行，然后给我那一整行的纠错结果，这样虽然会比原本的输出量更长，但是相对来说会更加准确。

除了调整 prompt，我还优化了输入的结构，我先用程序把文章用换行符分割成数组，然后在前面拼接上了行号，例如：

ts 复制代码

// 原始文本
这里
是
原始的
文本

// 处理后
1: 这里
2: 是
3: 原始的
4: 文本

通过这种方式，最终响应的行号确实准确了很多，由于我的文本数量很多，为了最大程度避免修复错误，我还增加了一个兜底机制，就是让 AI 将需要纠错的那一张原始的内容也响应给我，然后我自己与真正的原始内容进行对比，如果完全一致我才会应用纠错后的内容，避免有时 AI 忘记把前面的行号给删除掉。

代码实现

最终我实现了一个 TS 的脚本，大家可以根据自己的业务进行修改：

ts 复制代码

#!/usr/bin/env node

const fs = require('fs');
const path = require('path');

class AIDocChecker {
  constructor(apiKey, targetDir = '.') {
    this.apiKey = apiKey;
    this.targetDir = targetDir;
    this.baseUrl = 'https://ark.cn-beijing.volces.com/api/v3/chat/completions';
    this.model = 'doubao-seed-1-6-250615';
    this.maxConcurrency = 10;
  }

  async findMarkdownFiles(dir) {
    const files = [];

    const walk = (currentDir) => {
      try {
        const items = fs.readdirSync(currentDir);

        for (const item of items) {
          try {
            const fullPath = path.join(currentDir, item);
            const stat = fs.statSync(fullPath);

            if (stat.isDirectory()) {
              walk(fullPath);
            } else if (path.extname(item) === '.md') {
              files.push(fullPath);
            }
          } catch (itemError) {
            // 跳过无法访问的文件或目录
            console.warn(`跳过无法访问的项目: ${path.join(currentDir, item)}`);
          }
        }
      } catch (dirError) {
        // 跳过无法访问的目录
        console.warn(`跳过无法访问的目录: ${currentDir}`);
      }
    };

    walk(dir);
    return files;
  }

  async callAI(content) {
    const prompt = `检查以下文档的错误。只标记明显错误：错别字、重复词语、错误标点、语法错误、技术概念错误。

返回JSON格式：
- 无错误: {"hasErrors": false}
- 有错误: {"hasErrors": true, "errors": [{"line": 行号, "originalLine": "原始行内容 （不带行号前缀）", "correctedLine": "修正后行内容 （不带行号前缀）", "type": "错误类型"}]}

注意：只返回需要修改的完整行，确保originalLine完全匹配文档中的原始行。

文档内容：
${content}`;

    try {
      const response = await fetch(this.baseUrl, {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
          'Authorization': `Bearer ${this.apiKey}`
        },
        body: JSON.stringify({
          model: this.model,
          thinking: {
            type: "disabled"
          },
          messages: [
            {
              role: 'user',
              content: prompt
            },
            {
              role: 'assistant',
              content: '{'
            }
          ]
        })
      });

      if (!response.ok) {
        throw new Error(`API请求失败: ${response.status} ${response.statusText}`);
      }

      const data = await response.json();

      if (!data.choices || !data.choices[0] || !data.choices[0].message) {
        throw new Error('API返回数据格式不正确');
      }

      const aiResponse = data.choices[0].message.content;

      if (!aiResponse) {
        throw new Error('AI返回空响应');
      }

      console.log('AI原始响应:', aiResponse);

      try {
        // 尝试直接解析完整响应
        const parsed = JSON.parse('{' + aiResponse);
        return {
          hasErrors: parsed.hasErrors || false,
          errors: parsed.errors || []
        };
      } catch (parseError) {
        console.error('JSON解析失败:', parseError, aiResponse);

        // 尝试提取JSON内容
        try {
          const jsonMatch = aiResponse.match(/\{[\s\S]*\}/);
          if (jsonMatch) {
            const parsed = JSON.parse(jsonMatch[0]);
            return {
              hasErrors: parsed.hasErrors || false,
              errors: parsed.errors || []
            };
          }
        } catch (repairError) {
          console.error('JSON修复失败:', repairError);
        }

        // 如果都失败了，返回默认值
        return {
          hasErrors: false,
          errors: [],
          rawResponse: aiResponse
        };
      }
    } catch (error) {
      console.error('API调用失败:', error);
      return {
        hasErrors: false,
        errors: []
      };
    }
  }

  async checkFile(filePath) {
    try {
      const content = fs.readFileSync(filePath, 'utf8');
      const contentWithLineNumbers = content.split('\n').map((line, index) => `${index + 1}: ${line}`).join('\n');
      console.log(`正在检查文件: ${filePath}`);

      const result = await this.callAI(contentWithLineNumbers);

      if (result && result.hasErrors && result.errors && result.errors.length > 0) {
        console.log(`发现错误: ${filePath}`);
        console.log('错误详情:');
        result.errors.forEach((error, index) => {
          console.log(`  ${index + 1}. 第${error.line}行`);
          console.log(`     类型: ${error.type}`);
          console.log(`     原文: "${error.originalLine}"`);
          console.log(`     建议: "${error.correctedLine}"`);
          console.log('');
        });

        return {
          filePath,
          hasErrors: true,
          errors: result.errors
        };
      } else {
        console.log(`✓ 文件检查通过: ${filePath}`);
        return {
          filePath,
          hasErrors: false,
          errors: []
        };
      }
    } catch (error) {
      console.error(`检查文件失败 ${filePath}:`, error);
      return {
        filePath,
        hasErrors: false,
        errors: [],
        error: error.message
      };
    }
  }

  async applyFixes(filePath, errors) {
    try {
      const content = fs.readFileSync(filePath, 'utf8');
      const lines = content.split('\n');

      // 按行号倒序排序，从后往前修复，避免位置偏移问题
      const sortedErrors = errors.sort((a, b) => b.line - a.line);

      for (const error of sortedErrors) {
        const lineIndex = error.line - 1; // 转换为0索引
        if (lineIndex >= 0 && lineIndex < lines.length) {
          const currentLine = lines[lineIndex];

          // 验证原始行是否匹配
          if (currentLine === error.originalLine) {
            lines[lineIndex] = error.correctedLine;
            console.log(`    ✓ 已修复第${error.line}行: "${error.originalLine}" → "${error.correctedLine}"`);
          } else {
            console.log(`    ⚠ 跳过修复第${error.line}行 (行内容不匹配):`);
            console.log(`       期望: "${error.originalLine}"`);
            console.log(`       实际: "${currentLine}"`);
          }
        } else {
          console.log(`    ⚠ 跳过修复第${error.line}行 (行号超出范围)`);
        }
      }

      const fixedContent = lines.join('\n');
      fs.writeFileSync(filePath, fixedContent);
      console.log(`✓ 文件已保存: ${filePath}`);
      return true;
    } catch (error) {
      console.error(`修复文件失败 ${filePath}:`, error);
      return false;
    }
  }

  async run(options = {}) {
    const { fix = false, include = [], exclude = [] } = options;

    try {
      const files = await this.findMarkdownFiles(this.targetDir);

      let filteredFiles = files;

      // 应用包含过滤器
      if (include.length > 0) {
        filteredFiles = filteredFiles.filter(file =>
          include.some(pattern => file.includes(pattern))
        );
      }

      // 应用排除过滤器
      if (exclude.length > 0) {
        filteredFiles = filteredFiles.filter(file =>
          !exclude.some(pattern => file.includes(pattern))
        );
      }

      console.log(`找到 ${filteredFiles.length} 个Markdown文件`);

      const results = await this.processFilesWithConcurrency(filteredFiles, fix);

      // 生成报告
      const errorsFound = results.filter(r => r.hasErrors);
      console.log(`\n检查完成! 共检查 ${results.length} 个文件, 发现 ${errorsFound.length} 个文件有错误`);

      if (errorsFound.length > 0) {
        console.log('\n有错误的文件:');
        errorsFound.forEach(result => {
          console.log(`- ${result.filePath} (${result.errors.length} 个错误)`);
        });

        if (fix) {
          console.log('\n已自动修复所有错误文件');
        } else {
          console.log('\n使用 --fix 参数可自动修复错误');
        }
      }

      return results;
    } catch (error) {
      console.error('运行失败:', error);
      return [];
    }
  }

  async processFilesWithConcurrency(files, fix = false) {
    const results = [];
    const semaphore = new Array(this.maxConcurrency).fill(null);
    let index = 0;

    const processFile = async (filePath) => {
      try {
        const result = await this.checkFile(filePath);

        if (fix && result.hasErrors && result.errors && result.errors.length > 0) {
          await this.applyFixes(filePath, result.errors);
        }

        return result;
      } catch (error) {
        console.error(`处理文件失败 ${filePath}:`, error);
        return {
          filePath,
          hasErrors: false,
          errors: [],
          error: error.message
        };
      }
    };

    // 创建批次处理函数
    const processBatch = async () => {
      const promises = [];

      for (let i = 0; i < this.maxConcurrency && index < files.length; i++) {
        const filePath = files[index++];
        promises.push(processFile(filePath));
      }

      if (promises.length > 0) {
        const batchResults = await Promise.all(promises);
        results.push(...batchResults);

        // 如果还有文件要处理，继续下一批
        if (index < files.length) {
          await processBatch();
        }
      }
    };

    await processBatch();
    return results;
  }
}

// 命令行接口
if (require.main === module) {
  const args = process.argv.slice(2);

  const options = {
    fix: false,
    include: [],
    exclude: [],
    dir: '.'
  };

  // 解析命令行参数
  for (let i = 0; i < args.length; i++) {
    const arg = args[i];

    if (arg === '--fix') {
      options.fix = true;
    } else if (arg === '--include') {
      options.include = args[++i]?.split(',') || [];
    } else if (arg === '--exclude') {
      options.exclude = args[++i]?.split(',') || [];
    } else if (arg === '--dir') {
      options.dir = args[++i] || '.';
    } else if (arg === '--help') {
      console.log(`
AI文档检查工具

用法: node ai-doc-checker.js [选项]

选项:
  --fix                   自动修复发现的错误
  --include <patterns>    只检查包含指定模式的文件 (逗号分隔)
  --exclude <patterns>    排除包含指定模式的文件 (逗号分隔)
  --dir <directory>       指定要检查的目录 (默认: .)
  --help                  显示帮助信息

环境变量:
  ARK_API_KEY            API密钥 (必需)

示例:
  node ai-doc-checker.js --dir ./docs
  node ai-doc-checker.js --fix --include "getting-started,api"
  node ai-doc-checker.js --exclude "node_modules,backup"
      `);
      process.exit(0);
    }
  }

  const apiKey = process.env.ARK_API_KEY;

  if (!apiKey) {
    console.error('错误: 请设置环境变量 ARK_API_KEY');
    process.exit(1);
  }

  const checker = new AIDocChecker(apiKey, options.dir);
  checker.run(options).catch(error => {
    console.error('程序执行失败:', error);
    process.exit(1);
  });
}

module.exports = AIDocChecker;

这个工具的核心功能是自动检查 Markdown 文件中的各类错误，包括错别字、重复词、标点符号问题、语法错误以及技术概念错误。它会递归扫描你指定的目录（默认是当前文件夹），找到所有 Markdown 文件进行检查。

AI 使用了火山引擎的豆包模型来分析文档内容。检查完成后，它会返回一个结构化的 JSON 结果，清楚地告诉你哪些地方需要修改。比如发现第五行有个错别字，就会标注出原始内容和建议修改后的内容。doubao-seed-1.6 性价比还是蛮高的，我实测效果比 deepseek v3 更好，就是免费额度只有 50w 了，一下子就花完了😭

如果你确定要应用这些修改，只需要在运行命令时加上 --fix 参数，工具就会自动帮你修正这些错误。这里有个很实用的设计：它是从文件末尾开始往前修改的，这样可以避免修改前面内容导致后面行号错乱的问题。

为了提高检查效率，工具默认会同时检查10个文件，这个数量可以通过参数调整。在处理大量文件时，它会自动分批处理，既保证了速度又不会占用太多内存。

使用起来很简单，基本命令是 node ai-doc-checker.js。如果你想检查特定目录，可以用 --dir 参数指定路径。通过 --include 和 --exclude 参数，你可以灵活控制要检查哪些文件，比如只检查包含"api"的文件，或者排除"draft"文件夹。

这个工具特别适合用来检查技术文档、博客文章等内容，能帮你节省大量人工校对的时间。它处理文件时也很谨慎，遇到无权限访问的目录会自动跳过，不会因为个别文件问题导致整个检查过程中断。

使用方式，使用前记得先指定环境变量 export ARK_API_KEY=xxx，默认的 url 就是火山引擎的，大家也可以自行更换一下，但是要注意 body 要更改，这个结构不是每个平台都支持的。

使用方式：

bash 复制代码

# 基本检查（当前目录）
node ai-doc-checker.js

# 检查指定目录
node ai-doc-checker.js --dir ./docs

# 自动修复错误
node ai-doc-checker.js --fix

# 只检查包含"api"的文件
node ai-doc-checker.js --include api

# 排除"draft"文件夹
node ai-doc-checker.js --exclude draft

可优化点

在实现的过程中我想到一些可以优化的点：

每次全量检查所有文件，耗时且浪费 API 调用。
- 记录文件的 lastModified 时间或哈希值，跳过未修改的文件
- 支持 --since <timestamp> 只检查指定时间后变动的文件
- 缓存 AI 返回结果到本地 .aidoc-cache，避免重复分析相同内
--fix 直接全自动修复，无法人工确认。
- 增加 --interactive 模式，逐个显示建议修改并询问是否应用
  text 复制代码
```
发现错误：第12行 - "这理有一个错别字"
建议修改："这里"
是否应用？ (y/n/q) 
```
在输入文档时，根据自己的文档结构把一些无关紧要的内容先过滤掉，比如 markdown 的一些 metadata，或者文末的参考链接之类的，这样除了降低开销，还能减少上下文，让纠错效果更好

总结

通过上面这个脚本，我用了十分钟不到把我们文档站所有文件扫了一遍，修复了上百个小点，开销几百万个 token，相比于这纠错效率，价格真的很便宜了，而且如果不需要响应纠错理由之类的，开销还会更低。

现在很多 AI 的价格已经很亲民了，一些在很多场景中都可以用 AI 来做一些繁琐冗杂工作的兜底，后续我还会分析如何在 Github 中提交 PR 的时候自动检查文案问题，并在评论中提示用户如何优化。

如果文章对你有帮助，欢迎点赞~ respect！