知识库系统的内容资产闭环怎么设计

[一. 文档上传](#一. 文档上传)

上传成功，不等于知识可用

[二. 文档处理process](#二. 文档处理process)

[三. 在线编辑](#三. 在线编辑)

[重建索引 reindex](#重建索引 reindex)

更为复杂的版本系统扩展

[四. 知识库内容的闭环](#四. 知识库内容的闭环)

这周在做知识库文档的开发链路, 一开始很容易把功能想简单：用户上传一个文件，后端保存起来，数据库记一条记录，然后返回成功。这个思路在网盘、附件系统里没问题，但放到知识库系统里就不够了。

因为知识库真正关心的不是文件有没有存下来，而是：

文件能不能被解析成正文
正文能不能被切块
分块能不能进入检索链路
用户编辑后，索引能不能跟着更新
出问题时，系统能不能解释当前文档到底卡在哪一步

所以这周我主要围绕一个问题做设计：

一个文件从上传开始，到变成可检索、可编辑、可重建索引的知识资产，中间到底应该拆成哪些状态和边界？

一. 文档上传

上传成功，不等于知识可用

很多时候我们会默认：上传成功 = 文档可用。

但在知识库系统里，这两个状态必须拆开。

上传阶段只负责接住原始资产，大概做这些事：

步骤	作用	结果
校验知识库	判断当前知识库是否允许写入	避免同步库、只读库被写入
校验文件	检查类型、大小、同名、配额	避免脏数据进入系统
保存文件	将原始文件落盘	得到 `storage_path`
计算哈希	识别文件内容	得到 `file_hash`
写文档表	记录文档元信息	生成 `documents` 记录
更新配额	记录用户存储占用	控制资源使用

代码里上传后的文档状态并不是 ready，而是 uploaded：

Go 复制代码

doc := entity.Document{
    UserID:          userID,
    KnowledgeBaseID: kb.ID,
    Title:           strings.TrimSuffix(fileHeader.Filename, filepath.Ext(fileHeader.Filename)),
    FileName:        filepath.Base(fileHeader.Filename),
    FileType:        documentFileType(fileHeader.Filename),
    FileSize:        fileHeader.Size,
    StoragePath:     storagePath,
    FileHash:        fileHash,
    SourceType:      documentSourceUpload,
    Status:          documentStatusUploaded,
}

如果上传接口里直接解析、切块、向量化，看起来调用方更省事，但后面会遇到几个问题：

问题	影响
上传接口耗时变长	大文件会拖慢请求
解析失败不好表达	文件已保存，但知识不可用
后续改异步任务困难	接口语义会被推翻
状态不清晰	用户不知道文档到底上传成功还是处理成功

所以我最后把上传接口定位成：

文档已经进入系统，但还没有保证可以被检索。

上传接口没有直接做解析和向量化，而是先完成文件校验、保存、写入文档主表和配额更新。

Go 复制代码

func (s *documentService) Upload(
    ctx context.Context,
    userID, kbID string,
    fileHeader *multipart.FileHeader,
) (dto.DocumentResponse, error) {
    if fileHeader == nil {
        return dto.DocumentResponse{}, apperrors.New(apperrors.CodeBadRequest, "上传文件不能为空")
    }

    kb, err := s.findWritableKnowledgeBase(ctx, userID, kbID)
    if err != nil {
        return dto.DocumentResponse{}, err
    }

    if err := s.validateFile(ctx, userID, kb.ID, fileHeader); err != nil {
        return dto.DocumentResponse{}, err
    }

    storagePath, fileHash, err := s.saveFile(userID, kb.ID, fileHeader)
    if err != nil {
        return dto.DocumentResponse{}, err
    }

    doc := entity.Document{
        UserID:          userID,
        KnowledgeBaseID: kb.ID,
        Title:           strings.TrimSuffix(fileHeader.Filename, filepath.Ext(fileHeader.Filename)),
        FileName:        filepath.Base(fileHeader.Filename),
        FileType:        documentFileType(fileHeader.Filename),
        FileSize:        fileHeader.Size,
        StoragePath:     storagePath,
        FileHash:        fileHash,
        SourceType:      documentSourceUpload,
        Status:          documentStatusUploaded,
    }

    if err := s.documentRepo.Create(ctx, &doc); err != nil {
        _ = os.Remove(storagePath)
        return dto.DocumentResponse{}, err
    }

    if err := s.storageQuotaRepo.AddUsedStorage(ctx, userID, defaultMaxStorageBytes, fileHeader.Size); err != nil {
        return dto.DocumentResponse{}, err
    }

    return documentResponse(doc), nil
}

也就是说，上传成功只是资产进入系统的第一步，不代表知识已经可用。

二. 文档处理process

上传之后，需要有一个处理动作，也就是 process

边界定成这样：

process 读取的是原始文件，也就是 documents.storage_path 指向的内容。

它适合处理刚上传进来的文件。

流程大概是：

复制代码

documents.storage_path
    -> 读取原始文件
    -> 解析正文
    -> 创建首个 document_versions
    -> 切分 document_chunks
    -> 文档变为 ready

对应代码里，processDocumentContent 会先读原始文件：

Go 复制代码

func (s *documentService) processDocumentContent(
    ctx context.Context,
    doc entity.Document,
    jobID string,
) error {
    if !s.chunkService.SupportsFileType(doc.FileType) {
        return apperrors.New(apperrors.CodeDocumentStatusInvalid, "当前文件类型暂不支持自动解析")
    }

    contentBytes, err := os.ReadFile(doc.StoragePath)
    if err != nil {
        return apperrors.NewWithErr(apperrors.CodeInternalError, "读取文档文件失败", err)
    }

    content := s.chunkService.NormalizeContent(string(contentBytes), doc.FileType)
    if content == "" {
        return apperrors.New(apperrors.CodeDocumentStatusInvalid, "文档正文为空，无法处理")
    }

    version := entity.DocumentVersion{
        ID:            uuid.NewString(),
        UserID:        doc.UserID,
        DocumentID:    doc.ID,
        VersionNo:     documentVersionInitialNo,
        Content:       content,
        ContentHash:   hashText(content),
        ChangeSummary: "首次解析生成版本",
    }

    chunks, err := s.chunkService.BuildChunks(
        ctx,
        doc,
        version.ID,
        s.chunkService.SplitContent(content),
    )
    if err != nil {
        return err
    }

    finishedAt := time.Now()
    return s.documentRepo.SaveProcessResult(
        ctx,
        doc,
        jobID,
        &version,
        chunks,
        documentStatusReady,
        documentJobStatusSuccess,
        finishedAt,
    )
}

然后创建第一个版本：

复制代码

version := entity.DocumentVersion{
    ID:            uuid.NewString(),
    UserID:        doc.UserID,
    DocumentID:    doc.ID,
    VersionNo:     documentVersionInitialNo,
    Content:       content,
    ContentHash:   contentHash,
    ChangeSummary: "首次解析生成版本",
}

需求文档里可能会写支持 PDF、Word、Excel、PPT、图片等格式，但一开始先做的是先处理纯文本内容的文档，当前阶段的自动解没有覆盖这么多类型。

所以这里把上传支持类型和自动解析支持类型拆开看：

能力	含义	当前取舍
允许上传	系统可以先保存该文件	可以宽一点
自动解析	系统能把文件转成正文	当前先收敛
可检索	正文已经切块并入库	必须处理成功

也就是说，能上传不代表一定能自动解析。

三. 在线编辑

支持在线编辑后，问题就变得更细了。用户改的是文档正文，但原始上传文件还在。那修改后的内容应该放在哪里？一般很容易想到覆盖原始文件。

比如用户上传了一个 xxx.md，在线编辑后直接把 storage_path 对应的文件内容改掉。短期看很简单，但后面会很麻烦。

做法	优点	问题
覆盖原始文件	实现简单	原始文件丢失，版本不可追踪
只改 chunks	检索马上变化	正文来源丢失，无法重新生成
新增版本记录	来源清晰，可追踪	多一张版本表，多一步状态维护

最后选择的是第三种：在线编辑创建新的 document_versions。

版本表大概表达的是：某个文档在某个时间点的正文内容。

Go 复制代码

type DocumentVersion struct {
    ID            string
    UserID        string
    DocumentID    string
    VersionNo     int
    Content       string
    ContentHash   string
    ChangeSummary string
    CreatedAt     time.Time
}

在线编辑时，不是直接改 documents，而是创建一个新版本：

Go 复制代码

func (s *documentService) CreateVersion(
    ctx context.Context,
    userID, documentID string,
    req requestdto.CreateDocumentVersionRequest,
) (dto.DocumentProcessingJobResponse, error) {
    doc, err := s.findEditableDocument(ctx, userID, documentID)
    if err != nil {
        return dto.DocumentProcessingJobResponse{}, err
    }

    content := strings.TrimSpace(req.Content)
    if content == "" {
        return dto.DocumentProcessingJobResponse{}, apperrors.New(apperrors.CodeBadRequest, "文档正文不能为空")
    }

    version := entity.DocumentVersion{
        ID:            uuid.NewString(),
        UserID:        doc.UserID,
        DocumentID:    doc.ID,
        Content:       content,
        ContentHash:   hashText(content),
        ChangeSummary: strings.TrimSpace(req.ChangeSummary),
    }

    job, chunks, finishedAt, err := s.buildReindexPayload(ctx, doc, version.ID, content)
    if err != nil {
        return dto.DocumentProcessingJobResponse{}, err
    }

    if err := s.documentVersionRepo.SaveVersionAndReindex(
        ctx,
        doc,
        &job,
        &version,
        chunks,
        documentStatusReady,
        documentJobStatusSuccess,
        finishedAt,
    ); err != nil {
        return dto.DocumentProcessingJobResponse{}, err
    }

    job.Status = documentJobStatusSuccess
    job.StartedAt = &finishedAt
    job.FinishedAt = &finishedAt

    return documentProcessingJobResponse(job), nil
}

这样拆完以后，几个核心表的职责就清楚了：

表	负责什么	不负责什么
`documents`	文档资产元信息	不保存正文历史
`document_versions`	正文版本	不表达检索分块
`document_chunks`	可检索分块	不作为正文源头
`document_processing_jobs`	处理任务记录	不保存业务正文

这个设计的好处是：后面排查问题时不会混乱。

比如用户说"我明明改了文档，为什么问答还是旧内容"，我们可以沿着链路查：

复制代码

最新 version 是否生成
    -> chunks 是否替换
    -> job 是否成功
    -> 文档状态是否 ready

而不是在原始文件、正文、索引之间来回猜。

重建索引 reindex

在线编辑之后，自然会有重建索引的问题。

这里我做了一个明确取舍：

保存新版本时可以触发 reindex，但手动 reindex 本身不创建新版本。

原因是：版本和索引不是一回事。

动作	是否创建版本	是否替换 chunks	适用场景
`process`	是，创建初始版本	是	首次处理上传文件
`CreateVersion`	是，创建新版本	是	用户在线编辑正文
`Reindex`	否	是	内容没变，只重建索引

如果用户编辑之后保存，后端进行重新向量化，内容没有变化，那就不应该多出一个版本。否则版本列表里会出现很多内容完全一样的记录，用户会很困惑。

代码里的 Reindex 会读取最新版本：

Go 复制代码

func (s *documentService) Reindex(
    ctx context.Context,
    userID, documentID string,
) (dto.DocumentProcessingJobResponse, error) {
    doc, err := s.findEditableDocument(ctx, userID, documentID)
    if err != nil {
        return dto.DocumentProcessingJobResponse{}, err
    }

    version, ok, err := s.documentVersionRepo.FindLatestByDocument(ctx, userID, documentID)
    if err != nil {
        return dto.DocumentProcessingJobResponse{}, err
    }
    if !ok {
        return dto.DocumentProcessingJobResponse{}, apperrors.New(
            apperrors.CodeDocumentStatusInvalid,
            "文档版本不存在，无法重新向量化",
        )
    }

    job, chunks, finishedAt, err := s.buildReindexPayload(ctx, doc, version.ID, version.Content)
    if err != nil {
        return dto.DocumentProcessingJobResponse{}, err
    }

    if err := s.documentVersionRepo.ReindexVersion(
        ctx,
        doc,
        &job,
        version,
        chunks,
        documentStatusReady,
        documentJobStatusSuccess,
        finishedAt,
    ); err != nil {
        return dto.DocumentProcessingJobResponse{}, err
    }

    job.Status = documentJobStatusSuccess
    job.StartedAt = &finishedAt
    job.FinishedAt = &finishedAt

    return documentProcessingJobResponse(job), nil
}

然后基于最新版本正文重新生成 chunks：

Go 复制代码

job, chunks, finishedAt, err := s.buildReindexPayload(ctx, doc, version.ID, version.Content)
if err != nil {
    return dto.DocumentProcessingJobResponse{}, err
}

这个边界看起来很小，但对长期维护很重要：

版本历史回答：内容什么时候变了
任务历史回答：系统什么时候处理过它
chunks 回答：当前检索用的是什么内容

这三件事如果混在一起，系统越做越难解释。

这条链路里真正容易出问题的，不是某个接口怎么写，而是事实来源不统一。

不同阶段的事实来源其实不一样：

阶段	事实来源	说明
刚上传	原始文件	`documents.storage_path`
首次处理后	初始正文版本	`document_versions.content`
在线编辑后	最新正文版本	最新 `document_versions.content`
问答检索时	文档分块	`document_chunks.content`
手动重建索引	最新正文版本	用最新版本重新生成 chunks

如果这个关系没想清楚，代码里很容易变成：

A 接口读原始文件
B 接口读版本正文
C 接口直接改 chunks
D 接口又从文档表里猜状态

短期可能都能跑，但一旦出问题，就很难排查。

所以现在更倾向于把知识资产链路画成这样：

复制代码

上传文件
  -> documents 保存文档元信息
  -> process 读取原始文件
  -> document_versions 保存正文版本
  -> document_chunks 保存可检索分块
  -> 在线编辑创建新版本
  -> reindex 基于最新版本重建分块

如果用表来总结，就是：

层次	数据	作用
文件层	原始上传文件	保留用户提交的资产
文档层	`documents`	管理文档状态、来源、大小、路径
版本层	`document_versions`	管理正文历史
索引层	`document_chunks`	支撑检索和问答
任务层	`document_processing_jobs`	记录处理动作和失败原因

这比单表塞所有字段麻烦一点，但后续解释能力强很多。

更为复杂的版本系统扩展

做到在线编辑时，其实还能继续扩展很多能力：

能力	当前是否做	原因
历史版本回滚	暂不做	主链路优先
草稿版本	暂不做	当前没有多人协作编辑场景
版本 diff	暂不做	展示价值有，但不是闭环必需
编辑锁	暂不做	当前先不处理并发编辑
异步任务队列	后续可做	当前先保证同步流程清楚
PDF/Word 深度解析	后续可扩展	当前先支持轻量文本类处理

这里的取舍是：先做最小闭环。

复制代码

上传
  -> 处理
  -> 生成版本
  -> 生成 chunks
  -> 在线编辑
  -> 重建索引

只要这条链路清楚，后面加异步队列、多源同步、复杂权限、版本回滚，都有地方接。如果最小闭环没打稳，先去做权限、回滚、草稿、第三方同步，很容易变成每个功能都沾一点，但每条链路都不完整，也就是常说的产品迭代思维。

四. 知识库内容的闭环

最终可以把知识库系统拆成以下链路：

复制代码

上传或同步内容
  -> 保存原始资产
  -> 解析正文
  -> 创建内容版本
  -> 切分 chunks
  -> 建立向量索引
  -> 用户问答消费
  -> 记录引用和反馈
  -> 生成治理任务
  -> 编辑内容版本
  -> 重建索引

对应到工程结构：

环节	核心表或模块	关键动作
内容生产	`documents`	上传、同步、保存来源
内容治理	`document_versions`	编辑、版本管理
内容检索	`document_chunks`	切块、向量化、召回
内容消费	问答服务	返回答案和引用
内容反馈	feedback 记录	收集问题和评价
内容迭代	reindex 任务	更新版本和索引

知识库系统的价值不在于文件数量，而在于内容能否持续被使用、被验证、被修正。RAG 把文档从存储对象变成了推理上下文，这要求系统必须保留内容来源、版本关系、检索结构和用户反馈。