1.4 RAG中的Schema

什么是Schema？

在计算机科学中，Schema（模式）指的是对数据结构、组织方式和约束条件的正式定义。在数据库领域，Schema定义了表的结构、字段类型、关系等。在文档处理中，Schema可以理解为对文档及其元数据结构的描述，它规定了哪些元数据字段是必需的，它们的类型、格式以及可能的取值范围。

为什么在RAG中需要统一的元数据Schema？

一致性：确保不同来源、不同类型的文档在存储和检索时具有一致的元数据字段，便于统一处理。
可预测性：系统知道每个文档块包含哪些元数据，可以基于这些字段进行过滤、排序和聚合。
数据质量：通过定义字段类型和约束，可以避免无效或错误的数据进入系统。
可维护性：当系统扩展时，统一的Schema使得新增文档类型或字段更加容易。
查询优化：向量数据库可以利用Schema中的字段类型和索引进行高效的查询。

如何定义元数据Schema？

一个完整的元数据Schema应该包括以下内容：

字段名称（Field Name）：元数据的键。
字段类型（Field Type）：如字符串、整数、浮点数、日期、布尔值等。
是否必需（Required）：该字段是否必须存在。
默认值（Default Value）：如果字段不存在，可以使用的默认值。
描述（Description）：字段的含义和用途。
约束（Constraints）：如字符串长度、数值范围、枚举值等。

RAG中元数据Schema的示例

假设我们有一个企业知识库，包含多种类型的文档（如PDF报告、Word文档、网页等）。我们可以定义以下统一的元数据Schema：

|-------------------|-------|----|-------|----------------|-------------------------------------|
| 字段名 | 类型 | 必需 | 默认值 | 描述 | 约束 |
| source | 字符串 | 是 | 无 | 文档来源的完整路径或URL | 最大长度500 |
| document_id | 字符串 | 是 | 无 | 文档的唯一标识符 | UUID格式 |
| title | 字符串 | 是 | 无 | 文档标题 | 最大长度200 |
| author | 字符串 | 否 | 未知 | 文档作者 | 最大长度100 |
| created_date | 日期 | 否 | 无 | 文档创建日期 | ISO 8601格式 |
| last_modified | 日期 | 否 | 无 | 文档最后修改日期 | ISO 8601格式 |
| document_type | 字符串 | 是 | 无 | 文档类型 | 枚举：pdf, word, excel, ppt, html, txt |
| page_number | 整数 | 否 | 无 | 页码（从1开始） | 大于等于1 |
| chunk_id | 整数 | 是 | 无 | 块在文档中的顺序索引 | 从0开始 |
| chunk_start_index | 整数 | 否 | 无 | 块在原始文档中的起始字符索引 | 大于等于0 |
| chunk_end_index | 整数 | 否 | 无 | 块在原始文档中的结束字符索引 | 大于等于0 |
| language | 字符串 | 否 | zh-CN | 文档语言 | 遵循BCP 47标准 |
| tags | 字符串列表 | 否 | 空列表 | 文档标签 | 每个标签最大长度50 |

在代码中实施Schema

我们可以在文档加载和分块后，对每个文档块进行元数据标准化，确保它们符合Schema。以下是一个示例函数：

python 复制代码

import uuid
from datetime import datetime
from typing import Dict, Any, List

def normalize_metadata(chunk, original_doc_metadata: Dict[str, Any], chunk_id: int, start_index: int, end_index: int) -> Dict[str, Any]:
    """
    根据Schema标准化元数据。
    
    Args:
        chunk: 文本块对象
        original_doc_metadata: 原始文档的元数据
        chunk_id: 块的ID
        start_index: 块在原始文档中的起始索引
        end_index: 块在原始文档中的结束索引
    
    Returns:
        标准化后的元数据字典
    """
    
    # 从原始文档元数据中提取信息，如果没有则使用默认值
    source = original_doc_metadata.get('source', 'unknown')
    title = original_doc_metadata.get('title', '无标题')
    author = original_doc_metadata.get('author', '未知作者')
    created_date = original_doc_metadata.get('created_date')
    last_modified = original_doc_metadata.get('last_modified')
    document_type = original_doc_metadata.get('document_type', 'unknown')
    page_number = original_doc_metadata.get('page', 1)  # 假设原始元数据中页码为0开始，这里转换为1开始
    language = original_doc_metadata.get('language', 'zh-CN')
    tags = original_doc_metadata.get('tags', [])
    
    # 如果created_date是字符串，尝试转换为datetime对象，然后再格式化为ISO字符串
    if created_date and isinstance(created_date, str):
        try:
            # 尝试解析常见日期格式，这里简化处理，实际可能需要更复杂的解析
            created_date = datetime.fromisoformat(created_date.replace('Z', '+00:00')).isoformat()
        except:
            created_date = datetime.now().isoformat()
    elif created_date and isinstance(created_date, datetime):
        created_date = created_date.isoformat()
    else:
        created_date = datetime.now().isoformat()
    
    # 同样处理last_modified
    if last_modified and isinstance(last_modified, str):
        try:
            last_modified = datetime.fromisoformat(last_modified.replace('Z', '+00:00')).isoformat()
        except:
            last_modified = datetime.now().isoformat()
    elif last_modified and isinstance(last_modified, datetime):
        last_modified = last_modified.isoformat()
    else:
        last_modified = datetime.now().isoformat()
    
    # 确保page_number是整数且至少为1
    try:
        page_number = int(page_number)
        if page_number < 1:
            page_number = 1
    except:
        page_number = 1
    
    # 确保tags是列表
    if not isinstance(tags, list):
        tags = [tags] if tags else []
    
    # 构建标准化元数据
    normalized_metadata = {
        'source': str(source)[:500],  # 限制长度
        'document_id': original_doc_metadata.get('document_id', str(uuid.uuid4())),
        'title': str(title)[:200],
        'author': str(author)[:100],
        'created_date': created_date,
        'last_modified': last_modified,
        'document_type': document_type,
        'page_number': page_number,
        'chunk_id': chunk_id,
        'chunk_start_index': start_index,
        'chunk_end_index': end_index,
        'language': language,
        'tags': tags[:10]  # 限制标签数量
    }
    
    return normalized_metadata

在分块过程中应用Schema

在分块时，我们可以调用这个函数来标准化每个块的元数据。例如：

python 复制代码

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    add_start_index=True  # 这个参数会在元数据中添加start_index
)

def split_and_normalize(documents):
    all_chunks = []
    for doc in documents:
        # 分块
        chunks = text_splitter.split_documents([doc])
        
        # 对每个块进行标准化
        for i, chunk in enumerate(chunks):
            # 获取start_index和end_index
            start_index = chunk.metadata.get('start_index', 0)
            end_index = start_index + len(chunk.page_content)
            
            # 标准化元数据
            normalized_metadata = normalize_metadata(
                chunk=chunk,
                original_doc_metadata=doc.metadata,
                chunk_id=i,
                start_index=start_index,
                end_index=end_index
            )
            
            # 更新块的元数据
            chunk.metadata = normalized_metadata
            all_chunks.append(chunk)
    
    return all_chunks

# 使用示例
# documents = loader.load()
# normalized_chunks = split_and_normalize(documents)

在向量数据库中定义Schema

当我们将文档块存入向量数据库时，也可以根据Schema来定义集合（collection）的结构。以ChromaDB为例：

python 复制代码

import chromadb

chroma_client = chromadb.PersistentClient(path="./chroma_db")

# 创建集合时，可以指定元数据字段的类型，以便进行过滤
collection = chroma_client.create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"},
    # 我们可以通过代码约束元数据字段，但ChromaDB目前不强制元数据Schema
)

# 添加文档时，确保元数据符合Schema
collection.add(
    documents=[chunk.page_content for chunk in normalized_chunks],
    metadatas=[chunk.metadata for chunk in normalized_chunks],
    ids=[f"chunk_{i}" for i in range(len(normalized_chunks))]
)

查询时利用Schema

在查询时，我们可以利用元数据字段进行过滤，例如只搜索某种类型的文档或某个作者的文档：

python 复制代码

# 查询时过滤
results = collection.query(
    query_texts=["查询内容"],
    n_results=10,
    where={"document_type": "pdf", "author": "张三"}  # 过滤条件
)

# 也可以使用范围查询
results = collection.query(
    query_texts=["查询内容"],
    n_results=10,
    where={"page_number": {"$gte": 10, "$lte": 20}}  # 页码在10到20之间
)

总结

统一的元数据Schema是构建健壮RAG系统的基石。它确保了数据的一致性，提高了系统的可维护性和查询能力。在实际项目中，Schema的设计需要根据业务需求灵活调整，并在文档处理的各个环节中严格执行。

通过定义清晰的Schema，我们可以：

规范数据摄入流程
提高检索的准确性
实现复杂的过滤和排序
便于系统监控和调试