三、从 0 开始构建一个代码库-使用 tree-sitter 进行代码分块技术实践

传统分块方法的局限性

固定长度分块的问题 在基于文档的检索中，固定长度分块可能会导致信息丢失或语义不完整。例如，在处理长文本时，固定长度分块可能会将关键信息分散到多个块中，导致检索结果不准确。

在基于代码的检索中，固定长度分块可能会导致函数或类的边界被切断，导致语义不完整。例如，在处理长函数或类时，固定长度分块可能会将函数或类的一部分内容分散到多个块中，导致检索结果不准确。所以最好的方法是根据代码的结构来进行分块，这样可以保证语义和代码逻辑的完整性。

语义上下文丢失 固定长度分块可能会丢失语义上下文，因为每个块只包含一部分文本。这可能导致检索结果不准确，因为模型无法理解块之间的关系。例如，在处理长文本时，固定长度分块可能会将关键信息分散到多个块中，导致检索结果不准确。所以最好的方法是根据代码的结构来进行分块，这样可以保证语义和代码逻辑的完整性。

段落分块的不足 段落分块可能会导致信息丢失或语义不完整。

语法级分块实现

1、基于抽象语法树（AST）的分块策略

维护语义完整，按代码逻辑结构分块，保留类、方法等有意义单元，不破坏代码结构。
更好捕捉关系，因嵌入模型在代码片段训练，能让其把握代码在潜在空间的相似性，理解内在逻辑。
利于代码检索，可获取完整方法、类等块，准确匹配用户查询，如查方法实现时能精准定位。
提供上下文，可获取代码库引用等信息，补充至嵌入向量或LLM，助其全面理解问题并准确作答。
适应多语言，虽语言语法各异，但AST可为各语言构建语法树，实现统一分块，提高通用性与一致性。

2、使用 tree-sitter 进行代码解析 tree-sitter 是一个解析器生成工具和增量解析库。它可以为源文件构建具体的语法树，并在源文件被编辑时有效地更新语法树。具有以下特点：

通用性：足够通用，能够解析任何编程语言。
高效性：速度足够快，能够在每次按键时进行解析。
鲁棒性：足够健壮，即使存在语法错误也能提供有用的结果。
无依赖性：具有纯 C 运行时库，不依赖其他复杂的环境。

3、示例代码展示如何利用 tree-sitter 进行分块

新建文件 src/treesitter.ts：

typescript 复制代码

import { Parser, Language, Tree, Query, Node } from "web-tree-sitter";
/**
 * 异步加载指定扩展名对应的树解析器语言模块，并设置解析器的语言。
 * @param ext - 文件扩展名，用于确定要加载的解析器语言模块。
 * @returns 返回设置了指定语言的解析器实例。
 * @throws 当传入的扩展名不支持时，抛出错误。
 */
async function loadParser(ext: string) {
  // 初始化解析器
  await Parser.init();
  const parser = new Parser();
  let wasmPath = "";
  // 根据文件扩展名选择对应的解析器语言模块路径
  switch (ext) {
    case "js":
      wasmPath = `./pkg/tree-sitter-javascript.wasm`;
      break;
    case "ts":
      wasmPath = `./pkg/tree-sitter-typescript.wasm`;
      break;
    // 可按需添加更多语言解析器
    default:
      throw new Error(`Unsupported language: ${ext}`);
  }
  // 加载解析器语言模块
  const language = await Language.load(wasmPath);
  // 设置解析器的语言并返回
  return { parser: parser.setLanguage(language), language: language };
}

/**
 * 检查文件是否需要被忽略。
 * @param filePath - 要检查的文件的完整路径。
 * @returns 如果文件需要被忽略则返回 true，否则返回 false。
 */
function checkIgnoreFile(filePath: string): boolean {
  const fs = require("fs");
  const path = require("path");
  // 读取忽略文件列表
  let ignoreFile = fs.readFileSync(".ignorefiles", "utf8");
  // 将忽略文件列表按行分割成数组
  ignoreFile = ignoreFile.split("\n");
  // 检查文件名是否在忽略列表中
  if (ignoreFile.includes(filePath)) {
    return true;
  }
  return false;
}

/**
 * 用于将指定目录下的代码文件拆分成代码块的类。
 */
class CodeSnapped {
  /**
   * 获取函数的名称
   * @param tree
   * @returns
   */
  async getFunctionNames(tree: Tree, language: Language) {
    const rootNode = tree.rootNode;
    const functionNames: string[] = [];

    const query = `(function_declaration name: (identifier) @function.name)`;
    // 创建 Query 实例
    const parsedQuery = new Query(language,query);
    const captures = parsedQuery.captures(rootNode);
    captures.forEach((capture) => {
      functionNames.push(capture.node.text);
    });

    return functionNames;
  }
  /**
   * 异步遍历指定目录下的所有文件，并将每个文件解析成代码块。
   * @param dirPath - 要遍历的目录路径。
   * @returns 返回包含所有代码块信息的数组。
   */
  async chunkCode(dirPath: string) {
    const fs = require("fs");
    const path = require("path");
    const chunks: any[] = [];

    /**
     * 递归遍历指定路径的函数。
     * @param currentPath - 当前要处理的路径，可以是文件或目录。
     */
    const traverseDirectory = async (currentPath: string) => {
      // 获取当前路径的文件状态
      const stats = fs.statSync(currentPath);
      if (stats.isDirectory()) {
        // 如果是目录，获取目录下的所有文件和子目录
        const files = fs.readdirSync(currentPath);
        for (const file of files) {
          // 递归调用遍历函数处理子目录和文件
          await traverseDirectory(path.join(currentPath, file));
        }
      } else if (stats.isFile()) {
        // 如果是文件，检查是否需要忽略
        if (checkIgnoreFile(currentPath)) {
          return;
        }
        // 获取文件扩展名
        const fileExtension = path.extname(currentPath).slice(1).toLowerCase();
        if (!fileExtension) {
          throw new Error("Unable to determine file extension");
        }
        // 加载对应的解析器
        const { parser, language } = await loadParser(fileExtension);
        // 读取文件内容
        const sourceCode = fs.readFileSync(currentPath, "utf8");
        // 解析文件内容
        let tree = parser.parse(sourceCode);
        try {
          if (!tree) {
            console.error(`Failed to parse ${currentPath}`);
            return;
          }
        } catch (error) {
          console.error(`Error parsing ${currentPath}:`, error);
          return;
        }
        let funcNames = this.getFunctionNames(tree, language);
        // 获取解析树的根节点
        const rootNode = tree.rootNode;
        // 将代码块信息添加到结果数组中
        for (const child of rootNode.children) {
          if (child?.type === "function_declaration") {
            let funcName = "";
            (await funcNames).forEach(name => {
                if (child.text.includes(name)) {
                    funcName = name;   
                }
            });
            chunks.push({
              filePath: currentPath,
              startPosition: child.startPosition,
              endPosition: child.endPosition,
              code: child.text,
              signature: funcName,
            });
          }
        }
      }
    };
    // 开始遍历指定目录
    await traverseDirectory(dirPath);
    return chunks;
  }
}

// 导出 CodeSnapped 类
export default CodeSnapped;

解析： 1、 loadParser 函数：

这个方法主要是使用 tree-sitter 库来加载指定扩展名对应的解析器语言模块，并设置解析器的语言。
该函数用于异步加载指定扩展名对应的树解析器语言模块，并设置解析器的语言。
它接受一个文件扩展名参数 ext，根据扩展名选择对应的解析器语言模块路径，并加载该语言模块。
最后，返回设置了指定语言的解析器实例。

2、checkIgnoreFile 函数：

这个方法用于检查文件是否需要被忽略。
它接受一个文件路径参数 filePath，读取 .ignorefiles 文件中的忽略文件列表，并检查传入的文件路径是否在忽略列表中。

3、CodeSnapped 类：

这个类用于将指定目录下的代码文件拆分成代码块。
它包含一个 chunkCode 方法，该方法接受一个目录路径参数 dirPath，用于遍历指定目录下的所有文件，并将每个文件解析成代码块。
在遍历过程中，它会检查文件是否需要被忽略，如果需要忽略，则跳过该文件。
对于每个文件，它会获取文件的扩展名，并根据扩展名加载对应的解析器。
然后，它会读取文件内容，并使用解析器解析文件内容，生成语法树。

4、编写测试文件 ：新建文件 src/treesitter.test.ts：

typescript 复制代码

import CodeSnapped from "./treesitter";
import CodeSnapped from './treesitter';
import { existsSync, readFileSync } from 'fs';
import { join } from 'path';



describe('CodeChunker', () => {
  let CodeSnappeder: CodeSnapped;

  beforeEach(() => {
    CodeSnappeder = new CodeSnapped();
  });

  test('简单测试 chunkCode 方法', async () => {
    const fileDir = "./src"
    const result = await CodeSnappeder.chunkCode(fileDir);
    expect(Array.isArray(result)).toBe(true);
    expect(result.length).toBeGreaterThan(0);
    console.log(result);
  });

});

运行测试：

bash 复制代码

npm run test

测试结果：

bash 复制代码

 PASS  src/treesitter.test.ts
  CodeChunker
    ✓ 简单测试 chunkCode 方法 (206 ms)

tree-sitter 文档： tree-sitter.github.io/tree-sitter...

三、从 0 开始构建一个代码库-使用 tree-sitter 进行代码分块技术实践