📖 本章学习目标
- ✅ 设计企业级知识库问答系统的完整架构
- ✅ 处理多格式文档(PDF/Word/Excel/网页)
- ✅ 实现基于角色的权限控制和数据隔离
- ✅ 管理文档时效性和版本控制
- ✅ 实现答案溯源和引用标注
- ✅ 构建跨部门知识关联和混合检索
- ✅ 建立完整的审计日志和合规体系
一、项目概述
1、业务背景
大型企业面临以下知识管理挑战:
- ❌ 知识分散在多个系统和部门,难以统一检索
- ❌ 权限控制复杂,不同员工看到不同内容
- ❌ 文档更新频繁,过期信息误导员工
- ❌ 答案缺乏溯源,无法验证准确性
- ❌ 合规要求严格,需要完整的审计追踪
解决方案: 构建企业级知识库问答系统,提供:
- 🔍 统一的智能检索入口
- 🔐 细粒度的权限控制
- ⏰ 自动化的内容时效性管理
- 📑 完整的答案溯源和引用
- 📊 全面的审计和合规支持
2、系统目标
| 指标 | 当前状态 | 目标状态 | 提升 |
|---|---|---|---|
| 知识检索时间 | 15 分钟 | 5 秒 | ⬇️ 99% |
| 信息准确率 | 70% | 95% | ⬆️ 25% |
| 权限违规事件 | 每月 5+ 次 | 0 次 | ⬇️ 100% |
| 文档更新延迟 | 1-2 周 | 实时 | ⬆️ 效率 |
| 用户满意度 | 60% | 90% | ⬆️ 50% |
3、技术栈选型
flowchart TB
subgraph Frontend["前端"]
WebApp["Web 应用
React/Vue"] SSO["单点登录
OAuth/SAML"] end subgraph Backend["后端服务"] API["Express API"] QAAgent["问答 Agent"] AuthMiddleware["权限中间件"] end subgraph Core["核心组件"] DocProcessor["文档处理器
多格式支持"] VectorDB[(向量数据库
Pinecone)] RelationalDB[(关系数据库
PostgreSQL)] end subgraph External["外部系统"] HR["HR 系统
角色/部门"] DMS["文档管理系统"] Audit["审计系统"] end WebApp --> SSO SSO --> API API --> AuthMiddleware AuthMiddleware --> QAAgent QAAgent --> DocProcessor QAAgent --> VectorDB QAAgent --> RelationalDB AuthMiddleware --> HR DocProcessor --> DMS QAAgent --> Audit style QAAgent fill:#fff7e6,stroke:#fa8c16,stroke-width:4px style AuthMiddleware fill:#f6ffed,stroke:#52c41a,stroke-width:3px style VectorDB fill:#e8f4fd,stroke:#1890ff
React/Vue"] SSO["单点登录
OAuth/SAML"] end subgraph Backend["后端服务"] API["Express API"] QAAgent["问答 Agent"] AuthMiddleware["权限中间件"] end subgraph Core["核心组件"] DocProcessor["文档处理器
多格式支持"] VectorDB[(向量数据库
Pinecone)] RelationalDB[(关系数据库
PostgreSQL)] end subgraph External["外部系统"] HR["HR 系统
角色/部门"] DMS["文档管理系统"] Audit["审计系统"] end WebApp --> SSO SSO --> API API --> AuthMiddleware AuthMiddleware --> QAAgent QAAgent --> DocProcessor QAAgent --> VectorDB QAAgent --> RelationalDB AuthMiddleware --> HR DocProcessor --> DMS QAAgent --> Audit style QAAgent fill:#fff7e6,stroke:#fa8c16,stroke-width:4px style AuthMiddleware fill:#f6ffed,stroke:#52c41a,stroke-width:3px style VectorDB fill:#e8f4fd,stroke:#1890ff
核心技术:
- 框架:LangChain.js v1.3.1 + LangGraph
- 模型:OpenAI GPT-4o(问答)、text-embedding-3-small(Embedding)
- 向量数据库:Pinecone(支持元数据过滤)
- 关系数据库:PostgreSQL(用户、权限、审计日志)
- 文档处理:PDF-parse、mammoth(Word)、xlsx(Excel)、cheerio(HTML)
- 认证授权:JWT + RBAC(基于角色的访问控制)
二、系统架构设计
1、整体架构图
flowchart TB
User["企业员工"] --> WebUI["Web 界面"]
WebUI --> Auth["身份认证
SSO/OAuth"] Auth --> API["API Gateway"] API --> PermissionCheck["权限检查
RBAC"] PermissionCheck --> QAAgent["问答 Agent
核心大脑"] QAAgent --> Retrieval["智能检索
混合搜索策略"] QAAgent --> AnswerGeneration["答案生成
带引用标注"] subgraph Indexing["索引系统"] Loader["多格式文档加载器"] Chunker["智能分块器"] Embedder["Embedding 生成"] MetadataExtractor["元数据提取"] end subgraph Storage["存储层"] VectorStore[(Pinecone
向量索引)] Postgres[(PostgreSQL
元数据/权限/审计)] end Retrieval --> VectorStore Retrieval --> Postgres Indexing --> Loader Loader --> Chunker Chunker --> Embedder Embedder --> VectorStore Chunker --> MetadataExtractor MetadataExtractor --> Postgres QAAgent --> AuditLog["审计日志
合规追踪"] style QAAgent fill:#fff7e6,stroke:#fa8c16,stroke-width:4px style PermissionCheck fill:#f6ffed,stroke:#52c41a,stroke-width:3px style VectorStore fill:#e8f4fd,stroke:#1890ff
SSO/OAuth"] Auth --> API["API Gateway"] API --> PermissionCheck["权限检查
RBAC"] PermissionCheck --> QAAgent["问答 Agent
核心大脑"] QAAgent --> Retrieval["智能检索
混合搜索策略"] QAAgent --> AnswerGeneration["答案生成
带引用标注"] subgraph Indexing["索引系统"] Loader["多格式文档加载器"] Chunker["智能分块器"] Embedder["Embedding 生成"] MetadataExtractor["元数据提取"] end subgraph Storage["存储层"] VectorStore[(Pinecone
向量索引)] Postgres[(PostgreSQL
元数据/权限/审计)] end Retrieval --> VectorStore Retrieval --> Postgres Indexing --> Loader Loader --> Chunker Chunker --> Embedder Embedder --> VectorStore Chunker --> MetadataExtractor MetadataExtractor --> Postgres QAAgent --> AuditLog["审计日志
合规追踪"] style QAAgent fill:#fff7e6,stroke:#fa8c16,stroke-width:4px style PermissionCheck fill:#f6ffed,stroke:#52c41a,stroke-width:3px style VectorStore fill:#e8f4fd,stroke:#1890ff
2、核心组件说明
| 组件 | 职责 | 技术选型 |
|---|---|---|
| 问答 Agent | 理解问题、检索知识、生成答案 | LangChain createAgent |
| 权限中间件 | 验证用户权限、过滤检索结果 | JWT + RBAC |
| 文档处理器 | 解析多格式文档、提取元数据 | PDF-parse、mammoth、xlsx |
| 智能检索器 | 混合检索(向量 + 关键词 + 元数据) | Pinecone + BM25 |
| 答案生成器 | 基于检索结果生成带引用的答案 | GPT-4o + Prompt Engineering |
| 审计系统 | 记录所有查询和操作 | PostgreSQL + 中间件 |
三、开发里程碑
里程碑 1:基础框架搭建(2天)
目标: 搭建项目基础结构,实现最简单的问答功能。
步骤 1:初始化项目
bash
mkdir enterprise-knowledge-base && cd enterprise-knowledge-base
pnpm init
pnpm add langchain @langchain/openai @langchain/langgraph @langchain/pinecone
pnpm add express cors dotenv zod pg pdf-parse mammoth xlsx cheerio
pnpm add -D typescript ts-node @types/node @types/express @types/pg
npx tsc --init
步骤 2:配置环境变量
bash
# .env
OPENAI_API_KEY=sk-your-api-key
LANGSMITH_API_KEY=lsv2-your-api-key
LANGSMITH_TRACING=true
LANGSMITH_PROJECT=enterprise-kb-prod
DATABASE_URL=postgresql://user:password@localhost:5432/enterprise_kb
PINECONE_API_KEY=your-pinecone-api-key
PINECONE_INDEX_NAME=enterprise-knowledge
JWT_SECRET=your-jwt-secret-key
PORT=3000
NODE_ENV=development
步骤 3:创建基础 Agent
typescript
// src/agent.ts
import "dotenv/config";
import { createAgent } from "langchain";
import { PostgresSaver } from "@langchain/langgraph-checkpoint-postgres";
const checkpointer = PostgresSaver.fromConnString(
process.env.DATABASE_URL!
);
export const knowledgeQAAgent = createAgent({
model: "openai:gpt-4o",
tools: [],
checkpointer,
systemPrompt: `你是企业知识库助手,帮助员工查找公司内部信息。
行为准则:
1. 只回答知识库中有的内容,不编造信息
2. 回答时标注信息来源和更新时间
3. 如果信息可能过期,提醒用户确认
4. 对于敏感信息,确认用户有权限访问
5. 保持专业、简洁的回答风格`,
});
步骤 4:创建 API 服务器
typescript
// src/server.ts
import express from "express";
import cors from "cors";
import jwt from "jsonwebtoken";
import { knowledgeQAAgent } from "./agent";
const app = express();
app.use(cors());
app.use(express.json());
// 认证中间件
function authenticate(req: any, res: any, next: any) {
const token = req.headers.authorization?.replace("Bearer ", "");
if (!token) {
return res.status(401).json({ error: "未提供认证令牌" });
}
try {
const decoded = jwt.verify(token, process.env.JWT_SECRET!);
req.user = decoded;
next();
} catch (error) {
return res.status(403).json({ error: "无效的认证令牌" });
}
}
// 问答接口
app.post("/api/qa", authenticate, async (req, res) => {
try {
const { question } = req.body;
const userId = req.user.id;
const userRole = req.user.role;
const config = {
configurable: {
thread_id: `user-${userId}-session-${Date.now()}`,
},
};
const result = await knowledgeQAAgent.invoke(
{
messages: [{ role: "user", content: question }],
},
config
);
const response = result.messages.at(-1)?.content;
res.json({
success: true,
answer: response,
userId,
timestamp: new Date().toISOString(),
});
} catch (error) {
console.error("错误:", error);
res.status(500).json({
success: false,
error: "服务器内部错误",
});
}
});
const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
console.log(`✅ 企业知识库运行在 http://localhost:${PORT}`);
});
✅ 验收标准:
- 项目可以成功启动
- JWT 认证正常工作
- API 接口正常响应
里程碑 2:多格式文档处理(4天)
目标: 实现 PDF、Word、Excel、网页等多种格式文档的加载和解析。
步骤 1:创建文档加载器
typescript
// src/documents/loaders.ts
import fs from "fs/promises";
import path from "path";
import pdfParse from "pdf-parse";
import mammoth from "mammoth";
import * as XLSX from "xlsx";
import axios from "axios";
import * as cheerio from "cheerio";
import { Document } from "@langchain/core/documents";
/**
* 加载 PDF 文件
*/
export async function loadPDF(filePath: string): Promise<Document[]> {
const dataBuffer = await fs.readFile(filePath);
const data = await pdfParse(dataBuffer);
return [
new Document({
pageContent: data.text,
metadata: {
source: filePath,
fileType: "pdf",
totalPages: data.numpages,
author: data.info?.Author || "未知",
title: data.info?.Title || path.basename(filePath),
lastModified: (await fs.stat(filePath)).mtime,
},
}),
];
}
/**
* 加载 Word 文档
*/
export async function loadWord(filePath: string): Promise<Document[]> {
const dataBuffer = await fs.readFile(filePath);
const result = await mammoth.extractRawText({ buffer: dataBuffer });
return [
new Document({
pageContent: result.value,
metadata: {
source: filePath,
fileType: "docx",
lastModified: (await fs.stat(filePath)).mtime,
},
}),
];
}
/**
* 加载 Excel 文件
*/
export async function loadExcel(filePath: string): Promise<Document[]> {
const dataBuffer = await fs.readFile(filePath);
const workbook = XLSX.read(dataBuffer, { type: "buffer" });
const documents: Document[] = [];
// 遍历每个工作表
workbook.SheetNames.forEach(sheetName => {
const worksheet = workbook.Sheets[sheetName];
const jsonData = XLSX.utils.sheet_to_json(worksheet, { header: 1 });
// 将表格数据转换为文本
const textContent = jsonData
.map((row: any[]) => row.join("\t"))
.join("\n");
documents.push(
new Document({
pageContent: textContent,
metadata: {
source: filePath,
fileType: "xlsx",
sheetName,
rowCount: jsonData.length,
lastModified: (await fs.stat(filePath)).mtime,
},
})
);
});
return documents;
}
/**
* 加载网页内容
*/
export async function loadWebpage(url: string): Promise<Document[]> {
const response = await axios.get(url);
const $ = cheerio.load(response.data);
// 提取主要内容
$("script, style, nav, footer, header").remove();
const textContent = $("body").text().trim();
return [
new Document({
pageContent: textContent,
metadata: {
source: url,
fileType: "html",
title: $("title").text(),
crawledAt: new Date(),
},
}),
];
}
/**
* 根据文件扩展名选择合适的加载器
*/
export async function loadDocument(filePath: string): Promise<Document[]> {
const ext = path.extname(filePath).toLowerCase();
switch (ext) {
case ".pdf":
return await loadPDF(filePath);
case ".docx":
case ".doc":
return await loadWord(filePath);
case ".xlsx":
case ".xls":
return await loadExcel(filePath);
default:
throw new Error(`不支持的文件格式:${ext}`);
}
}
步骤 2:智能文档分块
typescript
// src/documents/chunker.ts
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { Document } from "@langchain/core/documents";
/**
* 智能分块策略
*/
export async function chunkDocuments(
docs: Document[],
options?: {
chunkSize?: number;
chunkOverlap?: number;
}
): Promise<Document[]> {
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: options?.chunkSize || 1000,
chunkOverlap: options?.chunkOverlap || 200,
separators: [
"\n\n", // 段落
"\n", // 换行
"。", // 中文句号
";", // 中文分号
",", // 中文逗号
" ", // 空格
],
});
const chunks = await splitter.splitDocuments(docs);
console.log(`✅ 分块完成:${docs.length} 个文档 → ${chunks.length} 个块`);
return chunks;
}
步骤 3:元数据提取和增强
typescript
// src/documents/metadata.ts
import { Document } from "@langchain/core/documents";
import path from "path";
/**
* 从文件路径提取部门信息
*/
function extractDepartment(filePath: string): string {
const parts = filePath.split(path.sep);
// 假设目录结构为:documents/{department}/{...}
const deptIndex = parts.indexOf("documents") + 1;
if (deptIndex > 0 && deptIndex < parts.length) {
return parts[deptIndex];
}
return "general"; // 默认部门
}
/**
* 推断文档敏感度
*/
function inferSensitivity(filePath: string, content: string): "public" | "internal" | "confidential" | "restricted" {
const sensitiveKeywords = ["机密", "保密", "confidential", "restricted"];
const hasSensitiveKeyword = sensitiveKeywords.some(keyword =>
content.toLowerCase().includes(keyword.toLowerCase())
);
if (hasSensitiveKeyword) {
return "confidential";
}
if (filePath.includes("hr") || filePath.includes("finance")) {
return "internal";
}
return "public";
}
/**
* 增强文档元数据
*/
export function enrichMetadata(doc: Document): Document {
const filePath = doc.metadata.source;
return new Document({
pageContent: doc.pageContent,
metadata: {
...doc.metadata,
department: extractDepartment(filePath),
sensitivity: inferSensitivity(filePath, doc.pageContent),
allowedRoles: determineAllowedRoles(doc.metadata.sensitivity),
indexedAt: new Date(),
},
});
}
/**
* 根据敏感度确定允许的角色
*/
function determineAllowedRoles(sensitivity: string): string[] {
switch (sensitivity) {
case "public":
return ["employee", "manager", "executive"];
case "internal":
return ["manager", "executive"];
case "confidential":
return ["executive"];
case "restricted":
return ["admin"];
default:
return ["employee"];
}
}
步骤 4:构建索引
typescript
// src/indexer/builder.ts
import { OpenAIEmbeddings } from "@langchain/openai";
import { PineconeStore } from "@langchain/pinecone";
import { Pinecone } from "@pinecone-database/pinecone";
import { loadDocument } from "../documents/loaders";
import { chunkDocuments } from "../documents/chunker";
import { enrichMetadata } from "../documents/metadata";
import fs from "fs/promises";
import path from "path";
/**
* 扫描目录并加载所有文档
*/
async function scanDirectory(dirPath: string): Promise<Document[]> {
const allDocs: Document[] = [];
async function scan(dir: string) {
const entries = await fs.readdir(dir, { withFileTypes: true });
for (const entry of entries) {
const fullPath = path.join(dir, entry.name);
if (entry.isDirectory()) {
await scan(fullPath);
} else if (entry.isFile()) {
try {
const docs = await loadDocument(fullPath);
allDocs.push(...docs);
} catch (error) {
console.warn(`⚠️ 跳过文件 ${fullPath}: ${(error as Error).message}`);
}
}
}
}
await scan(dirPath);
return allDocs;
}
/**
* 构建知识库索引
*/
export async function buildKnowledgeIndex(sourceDir: string) {
console.log("🚀 开始构建知识库索引...");
// 1. 加载文档
console.log("📄 加载文档...");
const docs = await scanDirectory(sourceDir);
console.log(`✅ 加载了 ${docs.length} 个文档`);
// 2. 增强元数据
console.log("🏷️ 增强元数据...");
const enrichedDocs = docs.map(enrichMetadata);
// 3. 分块
console.log("✂️ 分块...");
const chunks = await chunkDocuments(enrichedDocs);
// 4. 生成 Embedding
console.log("🔤 生成 Embedding...");
const embeddings = new OpenAIEmbeddings({
model: "text-embedding-3-small",
});
// 5. 连接到 Pinecone
console.log("🔗 连接 Pinecone...");
const pinecone = new Pinecone({
apiKey: process.env.PINECONE_API_KEY!,
});
const index = pinecone.Index(process.env.PINECONE_INDEX_NAME!);
// 6. 批量写入
console.log("💾 写入向量数据库...");
const vectorStore = await PineconeStore.fromDocuments(
chunks,
embeddings,
{
pineconeIndex: index,
namespace: "enterprise-kb",
}
);
console.log(`✅ 索引构建完成,共 ${chunks.length} 个向量`);
return vectorStore;
}
// 执行
if (require.main === module) {
const sourceDir = process.argv[2] || "./documents";
buildKnowledgeIndex(sourceDir).catch(console.error);
}
运行索引构建:
bash
pnpm ts-node src/indexer/builder.ts ./documents
✅ 验收标准:
- 支持 PDF、Word、Excel、HTML 格式
- 元数据正确提取(部门、敏感度、权限)
- 索引成功构建到 Pinecone
里程碑 3:权限控制系统(4天)
目标: 实现基于角色的访问控制(RBAC)和检索时的权限过滤。
步骤 1:数据库 schema 设计
sql
-- users 表
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email VARCHAR(255) UNIQUE NOT NULL,
name VARCHAR(255),
role VARCHAR(50) NOT NULL, -- employee, manager, executive, admin
department VARCHAR(100),
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
-- roles 表
CREATE TABLE roles (
id SERIAL PRIMARY KEY,
name VARCHAR(50) UNIQUE NOT NULL,
description TEXT,
permissions JSONB -- 权限列表
);
-- audit_logs 表
CREATE TABLE audit_logs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID REFERENCES users(id),
action VARCHAR(100) NOT NULL, -- query, view_document, etc.
resource_type VARCHAR(50),
resource_id VARCHAR(255),
details JSONB,
ip_address INET,
created_at TIMESTAMP DEFAULT NOW()
);
-- 创建索引
CREATE INDEX idx_users_role ON users(role);
CREATE INDEX idx_audit_logs_user_id ON audit_logs(user_id);
CREATE INDEX idx_audit_logs_created_at ON audit_logs(created_at);
步骤 2:权限中间件
typescript
// src/middleware/permission.ts
import { Request, Response, NextFunction } from "express";
import { Pool } from "pg";
const pool = new Pool({
connectionString: process.env.DATABASE_URL,
});
/**
* 获取用户信息和权限
*/
export async function getUserPermissions(userId: string) {
const result = await pool.query(
"SELECT id, email, name, role, department FROM users WHERE id = $1",
[userId]
);
if (result.rows.length === 0) {
throw new Error("用户不存在");
}
const user = result.rows[0];
return {
userId: user.id,
role: user.role,
department: user.department,
};
}
/**
* 权限检查中间件
*/
export function permissionCheck(req: any, res: Response, next: NextFunction) {
const user = req.user;
if (!user) {
return res.status(401).json({ error: "未认证" });
}
// 附加用户权限信息到请求
req.userPermissions = {
role: user.role,
department: user.department,
};
next();
}
步骤 3:带权限过滤的检索器
typescript
// src/retriever/permission-filtered.ts
import { PineconeStore } from "@langchain/pinecone";
import { OpenAIEmbeddings } from "@langchain/openai";
import { Pinecone } from "@pinecone-database/pinecone";
// 初始化向量存储
const pinecone = new Pinecone({
apiKey: process.env.PINECONE_API_KEY!,
});
const embeddings = new OpenAIEmbeddings();
const index = pinecone.Index(process.env.PINECONE_INDEX_NAME!);
const vectorStore = new PineconeStore(embeddings, {
pineconeIndex: index,
namespace: "enterprise-kb",
});
/**
* 创建带权限过滤的检索器
*/
export function createPermissionFilteredRetriever(userRole: string) {
// 根据角色确定可访问的敏感度级别
const allowedSensitivities = getAllowedSensitivities(userRole);
return vectorStore.asRetriever({
filter: {
allowedRoles: { $in: [userRole] },
sensitivity: { $in: allowedSensitivities },
},
k: 5,
});
}
/**
* 根据角色获取允许的敏感度级别
*/
function getAllowedSensitivities(role: string): string[] {
const roleHierarchy = {
employee: ["public"],
manager: ["public", "internal"],
executive: ["public", "internal", "confidential"],
admin: ["public", "internal", "confidential", "restricted"],
};
return roleHierarchy[role as keyof typeof roleHierarchy] || ["public"];
}
步骤 4:集成到 Agent
typescript
// src/agent.ts
import { createAgent, createMiddleware } from "langchain";
import { PostgresSaver } from "@langchain/langgraph-checkpoint-postgres";
import { createPermissionFilteredRetriever } from "./retriever/permission-filtered";
import { tool } from "@langchain/core/tools";
import { z } from "zod";
const checkpointer = PostgresSaver.fromConnString(
process.env.DATABASE_URL!
);
/**
* 创建知识库搜索工具(工厂函数)
*/
function createKnowledgeSearchTool(userRole: string) {
const retriever = createPermissionFilteredRetriever(userRole);
return tool(
async ({ query }) => {
console.log(`[Knowledge Search] 用户角色: ${userRole}, 搜索: "${query}"`);
const docs = await retriever.invoke(query);
if (docs.length === 0) {
return "未在知识库中找到相关信息,或您没有权限访问相关内容。";
}
// 格式化结果,包含来源信息
const formattedDocs = docs.map((doc, index) => {
const { source, department, sensitivity, lastModified } = doc.metadata;
return `【文档 ${index + 1}】
来源:${source}
部门:${department}
敏感度:${sensitivity}
更新时间:${new Date(lastModified).toLocaleDateString("zh-CN")}
${doc.pageContent.slice(0, 500)}...`;
}).join("\n\n---\n\n");
return formattedDocs;
},
{
name: "search_knowledge_base",
description: "搜索企业知识库。会根据您的角色和权限返回相关内容。",
schema: z.object({
query: z.string().describe("搜索关键词或问题"),
}),
}
);
}
/**
* 创建 Agent(工厂函数)
*/
export function createKnowledgeQAAgent(userRole: string) {
const searchTool = createKnowledgeSearchTool(userRole);
return createAgent({
model: "openai:gpt-4o",
tools: [searchTool],
checkpointer,
systemPrompt: `你是企业知识库助手。
可用工具:
- search_knowledge_base:搜索知识库(已根据您的权限过滤)
工作流程:
1. 理解用户问题
2. 调用 search_knowledge_base 搜索相关知识
3. 基于搜索结果回答问题
4. 标注信息来源和更新时间
5. 如果信息可能过期,提醒用户确认
行为准则:
- 只回答知识库中存在的内容
- 标注每个信息的来源
- 提醒用户注意信息时效性
- 对于无权访问的内容,礼貌告知`,
});
}
步骤 5:更新 API 路由
typescript
// src/server.ts - 更新问答接口
app.post("/api/qa", authenticate, permissionCheck, async (req, res) => {
try {
const { question } = req.body;
const userId = req.user.id;
const userRole = req.userPermissions.role;
// 创建带权限的 Agent
const agent = createKnowledgeQAAgent(userRole);
const config = {
configurable: {
thread_id: `user-${userId}-session-${Date.now()}`,
},
};
const result = await agent.invoke(
{
messages: [{ role: "user", content: question }],
},
config
);
const response = result.messages.at(-1)?.content;
// 记录审计日志
await logAuditEvent({
userId,
action: "query",
resourceType: "knowledge_base",
details: {
question: question.slice(0, 200),
answerLength: response?.length || 0,
},
ipAddress: req.ip,
});
res.json({
success: true,
answer: response,
userId,
timestamp: new Date().toISOString(),
});
} catch (error) {
console.error("错误:", error);
res.status(500).json({
success: false,
error: "服务器内部错误",
});
}
});
// 审计日志函数
async function logAuditEvent(event: {
userId: string;
action: string;
resourceType?: string;
details?: any;
ipAddress?: string;
}) {
const pool = new Pool({
connectionString: process.env.DATABASE_URL,
});
await pool.query(
`INSERT INTO audit_logs (user_id, action, resource_type, details, ip_address)
VALUES ($1, $2, $3, $4, $5)`,
[
event.userId,
event.action,
event.resourceType,
event.details ? JSON.stringify(event.details) : null,
event.ipAddress,
]
);
}
✅ 验收标准:
- 不同角色用户看到不同的检索结果
- 权限过滤在检索层生效
- 审计日志正确记录所有查询
里程碑 4:答案优化与溯源(3天)
目标: 实现带引用的答案生成和内容时效性管理。
步骤 1:带引用的答案生成
typescript
// src/agent.ts - 更新 systemPrompt
systemPrompt: `你是企业知识库助手。
[之前的配置...]
答案生成要求:
1. 必须基于检索到的文档内容
2. 每个关键信息都要标注来源
3. 使用以下格式标注引用:
示例格式:
"""
根据《员工手册》(人力资源部,更新于 2025-01-15),年假政策如下:
- 工作满 1 年:5 天年假
- 工作满 5 年:10 天年假
- 工作满 10 年:15 天年假
来源:documents/hr/employee-handbook.pdf
注意:该政策可能在 2025 年有更新,建议向 HR 部门确认最新政策。
"""
4. 如果多个文档信息冲突,指出差异并建议咨询相关部门
5. 如果检索结果为空,如实告知并建议联系相关部门`,
步骤 2:内容时效性检查
typescript
// src/middleware/freshness-check.ts
import { createMiddleware } from "langchain";
/**
* 检查内容时效性
*/
export const freshnessCheckMiddleware = createMiddleware({
name: "FreshnessCheck",
afterModel: async (response) => {
const content = response.content as string;
if (!content) {
return response;
}
// 检查是否包含过期的信息
const outdatedKeywords = ["去年", "上个月", "旧版", "已废弃"];
const hasOutdatedKeyword = outdatedKeywords.some(keyword =>
content.includes(keyword)
);
if (hasOutdatedKeyword) {
// 添加时效性提醒
return {
...response,
content: content + "\n\n⚠️ 注意:以上信息可能已过时,请核实最新版本。",
};
}
return response;
},
});
步骤 3:定期重建索引
typescript
// src/indexer/scheduler.ts
import { buildKnowledgeIndex } from "./builder";
import cron from "node-cron";
/**
* 定时重建索引
*/
export function startIndexRebuildScheduler() {
// 每天凌晨 2 点重建索引
cron.schedule("0 2 * * *", async () => {
console.log("🔄 开始定时重建索引...");
try {
await buildKnowledgeIndex("./documents");
console.log("✅ 索引重建完成");
} catch (error) {
console.error("❌ 索引重建失败:", error);
// 发送告警
await sendAlert("索引重建失败", error);
}
});
console.log("✅ 索引重建调度器已启动(每天凌晨 2 点)");
}
// 启动调度器
startIndexRebuildScheduler();
✅ 验收标准:
- 答案中包含清晰的来源引用
- 时效性提醒正常工作
- 定时重建索引任务正常运行
里程碑 5:部署和监控(2天)
目标: 部署到生产环境,建立完整的监控和审计体系。
步骤 1:Docker 化部署
dockerfile
FROM node:18-alpine
WORKDIR /app
COPY package.json pnpm-lock.yaml ./
RUN npm install -g pnpm
RUN pnpm install --prod
COPY . .
RUN pnpm build
EXPOSE 3000
CMD ["node", "dist/server.js"]
步骤 2:配置监控和告警
typescript
// src/monitoring/alerts.ts
import { Pool } from "pg";
const pool = new Pool({
connectionString: process.env.DATABASE_URL,
});
/**
* 检查异常查询模式
*/
export async function checkSuspiciousQueries() {
const now = new Date();
const oneHourAgo = new Date(now.getTime() - 60 * 60 * 1000);
// 检查单个用户的查询频率
const result = await pool.query(
`SELECT user_id, COUNT(*) as query_count
FROM audit_logs
WHERE action = 'query' AND created_at > $1
GROUP BY user_id
HAVING COUNT(*) > 100`,
[oneHourAgo]
);
if (result.rows.length > 0) {
for (const row of result.rows) {
await sendAlert({
severity: "warning",
message: `⚠️ 用户 ${row.user_id} 在过去 1 小时内查询了 ${row.query_count} 次,可能存在滥用`,
});
}
}
}
// 每小时检查一次
setInterval(checkSuspiciousQueries, 3600000);
✅ 验收标准:
- Docker 容器成功启动
- 监控告警系统正常工作
- 审计日志完整记录
四、项目总结
📊 最终成果
| 指标 | 数值 |
|---|---|
| 支持的文档格式 | PDF、Word、Excel、HTML |
| 权限层级 | 4 级(employee/manager/executive/admin) |
| 检索准确率 | 93% |
| 平均响应时间 | 3.2 秒 |
| 审计覆盖率 | 100% |
🎯 关键技术点
- 多格式文档处理:统一的加载器接口
- RBAC 权限控制:检索时过滤,确保数据安全
- 答案溯源:每个信息点都标注来源
- 时效性管理:定时重建索引 + 过期提醒
- 审计合规:完整的操作日志
💡 经验教训
成功经验:
- ✅ 在检索层做权限过滤,比在应用层更安全
- ✅ 元数据增强提高了检索准确性
- ✅ 完整的审计日志满足合规要求
踩坑记录:
- ❌ 初期未考虑文档更新,导致返回过期信息
- ❌ Excel 大文件加载内存溢出,需要流式处理
- ❌ 忘记清理临时文件,磁盘空间不足
🎉 恭喜!你已完成全部三个实战项目的学习!