第15章:实战综合项目一:智能文档处理系统
前言
大家好,我是鲫小鱼。是一名不写前端代码
的前端工程师,热衷于分享非前端的知识,带领切图仔逃离切图圈子,欢迎关注我,微信公众号:《鲫小鱼不正经》
。欢迎点赞、收藏、关注,一键三连!!
🎯 本章学习目标
通过本章学习,您将:
- 构建一个完整的智能文档处理系统,涵盖文档上传、解析、索引、检索和问答
- 掌握多模态文档处理技术:PDF、Word、Excel、图片、音频等
- 实现基于 LangChain.js 的 RAG 系统,支持复杂查询和引用溯源
- 集成 LangGraph 工作流,实现文档处理的自动化管道
- 搭建 Next.js 全栈应用,提供现代化的用户界面
- 掌握企业级文档处理系统的架构设计和最佳实践
📋 项目概述
系统功能特性
🔍 智能文档解析
- 支持多种格式:PDF、Word、Excel、PPT、图片、音频
- OCR 文字识别和表格提取
- 自动文档分类和标签生成
- 文档结构分析和元数据提取
🧠 智能问答系统
- 基于 RAG 的文档问答
- 多轮对话和上下文记忆
- 引用溯源和置信度评估
- 复杂查询理解和意图识别
⚡ 自动化工作流
- LangGraph 驱动的文档处理管道
- 批量文档处理和质量检查
- 异常处理和重试机制
- 处理进度实时监控
🎨 现代化界面
- Next.js 14 + TypeScript 全栈应用
- 响应式设计和移动端适配
- 实时进度显示和状态更新
- 直观的文档管理和搜索界面
🏗️ 系统架构设计
整体架构图
scss
┌─────────────────────────────────────────────────────────────┐
│ 前端层 (Next.js) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ 文档上传 │ │ 智能问答 │ │ 文档管理 │ │
│ │ 界面 │ │ 界面 │ │ 界面 │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ API 网关层 (Next.js API Routes) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ 文档处理 │ │ 问答服务 │ │ 用户管理 │ │
│ │ API │ │ API │ │ API │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 业务服务层 │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ 文档解析 │ │ RAG 检索 │ │ LangGraph │ │
│ │ 服务 │ │ 服务 │ │ 工作流 │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 数据存储层 │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ 向量数据库 │ │ 关系数据库 │ │ 对象存储 │ │
│ │ (Chroma) │ │ (PostgreSQL) │ │ (MinIO) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
技术栈选择
前端技术
- Next.js 14 (App Router)
- TypeScript
- Tailwind CSS
- React Hook Form
- React Query
后端技术
- Node.js + TypeScript
- LangChain.js
- LangGraph
- Express.js
AI 模型服务
- OpenAI GPT-4/GPT-3.5-turbo
- OpenAI Embeddings
- Whisper (语音转文字)
- Tesseract.js (OCR)
数据存储
- Chroma (向量数据库)
- PostgreSQL (关系数据库)
- MinIO (对象存储)
- Redis (缓存)
部署与运维
- Docker + Docker Compose
- Nginx (反向代理)
- PM2 (进程管理)
🚀 项目初始化
环境准备
bash
# 1. 创建项目目录
mkdir intelligent-document-system
cd intelligent-document-system
# 2. 初始化 Next.js 项目
npx create-next-app@latest . --typescript --tailwind --eslint --app --src-dir --import-alias "@/*"
# 3. 安装核心依赖
npm install @langchain/core @langchain/community @langchain/openai @langchain/langgraph
npm install @langchain/chroma @langchain/postgres
npm install multer formidable @types/multer
npm install tesseract.js pdf-parse mammoth
npm install @prisma/client prisma
npm install redis ioredis
npm install zod react-hook-form @hookform/resolvers
npm install @tanstack/react-query
npm install lucide-react clsx tailwind-merge
# 4. 安装开发依赖
npm install -D @types/node @types/multer tsx nodemon
npm install -D prisma
项目结构
perl
intelligent-document-system/
├── src/
│ ├── app/ # Next.js App Router
│ │ ├── api/ # API 路由
│ │ │ ├── documents/ # 文档处理 API
│ │ │ ├── chat/ # 问答 API
│ │ │ └── upload/ # 文件上传 API
│ │ ├── documents/ # 文档管理页面
│ │ ├── chat/ # 智能问答页面
│ │ └── layout.tsx # 根布局
│ ├── components/ # 可复用组件
│ │ ├── ui/ # 基础 UI 组件
│ │ ├── document/ # 文档相关组件
│ │ └── chat/ # 聊天相关组件
│ ├── lib/ # 工具库
│ │ ├── ai/ # AI 相关工具
│ │ ├── db/ # 数据库工具
│ │ ├── storage/ # 存储工具
│ │ └── utils.ts # 通用工具
│ ├── services/ # 业务服务
│ │ ├── document/ # 文档处理服务
│ │ ├── rag/ # RAG 服务
│ │ └── workflow/ # 工作流服务
│ └── types/ # TypeScript 类型定义
├── prisma/ # 数据库模式
├── docker/ # Docker 配置
├── docs/ # 项目文档
└── package.json
📄 数据库设计
Prisma 模式定义
prisma
// prisma/schema.prisma
generator client {
provider = "prisma-client-js"
}
datasource db {
provider = "postgresql"
url = env("DATABASE_URL")
}
model User {
id String @id @default(cuid())
email String @unique
name String?
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
documents Document[]
chats Chat[]
@@map("users")
}
model Document {
id String @id @default(cuid())
title String
filename String
fileType String
fileSize Int
filePath String
status DocumentStatus @default(PROCESSING)
metadata Json?
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
userId String
user User @relation(fields: [userId], references: [id])
chunks DocumentChunk[]
chats Chat[]
@@map("documents")
}
model DocumentChunk {
id String @id @default(cuid())
content String
metadata Json?
embedding Float[]
chunkIndex Int
createdAt DateTime @default(now())
documentId String
document Document @relation(fields: [documentId], references: [id])
citations Citation[]
@@map("document_chunks")
}
model Chat {
id String @id @default(cuid())
title String?
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
userId String
user User @relation(fields: [userId], references: [id])
documentId String?
document Document? @relation(fields: [documentId], references: [id])
messages Message[]
@@map("chats")
}
model Message {
id String @id @default(cuid())
content String
role MessageRole
metadata Json?
createdAt DateTime @default(now())
chatId String
chat Chat @relation(fields: [chatId], references: [id])
citations Citation[]
@@map("messages")
}
model Citation {
id String @id @default(cuid())
content String
source String
pageNumber Int?
confidence Float
chunkId String?
chunk DocumentChunk? @relation(fields: [chunkId], references: [id])
messageId String?
message Message? @relation(fields: [messageId], references: [id])
@@map("citations")
}
enum DocumentStatus {
PROCESSING
COMPLETED
FAILED
DELETED
}
enum MessageRole {
USER
ASSISTANT
SYSTEM
}
数据库初始化
bash
# 生成 Prisma 客户端
npx prisma generate
# 创建数据库迁移
npx prisma migrate dev --name init
# 查看数据库
npx prisma studio
🔧 核心服务实现
文档解析服务
typescript
// src/services/document/parser.ts
import { Document } from '@langchain/core/documents';
import pdf from 'pdf-parse';
import mammoth from 'mammoth';
import Tesseract from 'tesseract.js';
import * as XLSX from 'xlsx';
export interface DocumentMetadata {
title: string;
author?: string;
pages?: number;
wordCount?: number;
language?: string;
createdAt: Date;
}
export class DocumentParser {
async parsePDF(buffer: Buffer): Promise<{ content: string; metadata: DocumentMetadata }> {
try {
const data = await pdf(buffer);
return {
content: data.text,
metadata: {
title: data.info?.Title || 'Untitled',
author: data.info?.Author,
pages: data.numpages,
wordCount: data.text.split(/\s+/).length,
createdAt: new Date()
}
};
} catch (error) {
throw new Error(`PDF 解析失败: ${error.message}`);
}
}
async parseWord(buffer: Buffer): Promise<{ content: string; metadata: DocumentMetadata }> {
try {
const result = await mammoth.extractRawText({ buffer });
return {
content: result.value,
metadata: {
title: 'Word Document',
wordCount: result.value.split(/\s+/).length,
createdAt: new Date()
}
};
} catch (error) {
throw new Error(`Word 文档解析失败: ${error.message}`);
}
}
async parseExcel(buffer: Buffer): Promise<{ content: string; metadata: DocumentMetadata }> {
try {
const workbook = XLSX.read(buffer);
let content = '';
workbook.SheetNames.forEach(sheetName => {
const worksheet = workbook.Sheets[sheetName];
const sheetData = XLSX.utils.sheet_to_csv(worksheet);
content += `Sheet: ${sheetName}\n${sheetData}\n\n`;
});
return {
content,
metadata: {
title: workbook.Props?.Title || 'Excel Document',
createdAt: new Date()
}
};
} catch (error) {
throw new Error(`Excel 文档解析失败: ${error.message}`);
}
}
async parseImage(buffer: Buffer): Promise<{ content: string; metadata: DocumentMetadata }> {
try {
const { data: { text } } = await Tesseract.recognize(buffer, 'chi_sim+eng');
return {
content: text,
metadata: {
title: 'Image Document',
wordCount: text.split(/\s+/).length,
createdAt: new Date()
}
};
} catch (error) {
throw new Error(`图片 OCR 识别失败: ${error.message}`);
}
}
async parseDocument(buffer: Buffer, mimeType: string): Promise<{ content: string; metadata: DocumentMetadata }> {
switch (mimeType) {
case 'application/pdf':
return this.parsePDF(buffer);
case 'application/vnd.openxmlformats-officedocument.wordprocessingml.document':
return this.parseWord(buffer);
case 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet':
return this.parseExcel(buffer);
case 'image/jpeg':
case 'image/png':
case 'image/gif':
return this.parseImage(buffer);
default:
throw new Error(`不支持的文件类型: ${mimeType}`);
}
}
}
文档分块服务
typescript
// src/services/document/chunker.ts
import { RecursiveCharacterTextSplitter } from '@langchain/text-splitter';
import { Document } from '@langchain/core/documents';
export interface ChunkMetadata {
chunkIndex: number;
source: string;
pageNumber?: number;
section?: string;
}
export class DocumentChunker {
private textSplitter: RecursiveCharacterTextSplitter;
constructor() {
this.textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
separators: ['\n\n', '\n', '。', '!', '?', ';', '.', '!', '?', ';', ' ', '']
});
}
async chunkDocument(content: string, metadata: DocumentMetadata): Promise<Document[]> {
const documents = await this.textSplitter.createDocuments(
[content],
[metadata],
{
chunkHeader: `文档: ${metadata.source}\n\n`,
appendChunkOverlapHeader: true
}
);
return documents.map((doc, index) => ({
...doc,
metadata: {
...doc.metadata,
chunkIndex: index,
totalChunks: documents.length
}
}));
}
async chunkByPages(content: string, metadata: DocumentMetadata): Promise<Document[]> {
const pages = content.split(/\f/); // 分页符分割
const documents: Document[] = [];
let chunkIndex = 0;
for (let i = 0; i < pages.length; i++) {
const pageContent = pages[i].trim();
if (!pageContent) continue;
const pageDocs = await this.textSplitter.createDocuments(
[pageContent],
[{ ...metadata, pageNumber: i + 1 }],
{
chunkHeader: `第 ${i + 1} 页\n\n`,
appendChunkOverlapHeader: true
}
);
documents.push(...pageDocs.map(doc => ({
...doc,
metadata: {
...doc.metadata,
chunkIndex: chunkIndex++,
totalChunks: documents.length + pageDocs.length
}
})));
}
return documents;
}
}
RAG 检索服务
typescript
// src/services/rag/retriever.ts
import { Chroma } from '@langchain/chroma';
import { OpenAIEmbeddings } from '@langchain/openai';
import { Document } from '@langchain/core/documents';
export interface RetrievalResult {
documents: Document[];
scores: number[];
totalResults: number;
}
export class RAGRetriever {
private vectorStore: Chroma;
private embeddings: OpenAIEmbeddings;
constructor() {
this.embeddings = new OpenAIEmbeddings({
openAIApiKey: process.env.OPENAI_API_KEY,
modelName: 'text-embedding-3-small'
});
this.vectorStore = new Chroma(this.embeddings, {
url: process.env.CHROMA_URL || 'http://localhost:8000',
collectionName: 'documents'
});
}
async addDocuments(documents: Document[]): Promise<void> {
await this.vectorStore.addDocuments(documents);
}
async similaritySearch(query: string, k: number = 5): Promise<RetrievalResult> {
const results = await this.vectorStore.similaritySearchWithScore(query, k);
return {
documents: results.map(([doc]) => doc),
scores: results.map(([, score]) => score),
totalResults: results.length
};
}
async similaritySearchWithFilter(
query: string,
filter: Record<string, any>,
k: number = 5
): Promise<RetrievalResult> {
const results = await this.vectorStore.similaritySearchWithScore(query, k, filter);
return {
documents: results.map(([doc]) => doc),
scores: results.map(([, score]) => score),
totalResults: results.length
};
}
async maxMarginalRelevanceSearch(
query: string,
k: number = 5,
fetchK: number = 20,
lambdaMult: number = 0.5
): Promise<RetrievalResult> {
const results = await this.vectorStore.maxMarginalRelevanceSearch(
query,
{ k, fetchK, lambdaMult }
);
return {
documents: results,
scores: [], // MMR 不返回分数
totalResults: results.length
};
}
}
LangGraph 工作流
typescript
// src/services/workflow/document-processing.ts
import { StateGraph, END } from '@langchain/langgraph';
import { DocumentParser } from '../document/parser';
import { DocumentChunker } from '../document/chunker';
import { RAGRetriever } from '../rag/retriever';
export interface DocumentProcessingState {
documentId: string;
buffer: Buffer;
mimeType: string;
status: 'parsing' | 'chunking' | 'embedding' | 'indexing' | 'completed' | 'failed';
content?: string;
metadata?: any;
chunks?: any[];
error?: string;
progress: number;
}
export class DocumentProcessingWorkflow {
private parser: DocumentParser;
private chunker: DocumentChunker;
private retriever: RAGRetriever;
constructor() {
this.parser = new DocumentParser();
this.chunker = new DocumentChunker();
this.retriever = new RAGRetriever();
}
async buildGraph() {
const workflow = new StateGraph<DocumentProcessingState>({
channels: {
documentId: { value: '' },
buffer: { value: Buffer.alloc(0) },
mimeType: { value: '' },
status: { value: 'parsing' },
content: { value: '' },
metadata: { value: {} },
chunks: { value: [] },
error: { value: '' },
progress: { value: 0 }
}
});
// 添加节点
workflow.addNode('parse', this.parseDocument.bind(this));
workflow.addNode('chunk', this.chunkDocument.bind(this));
workflow.addNode('embed', this.embedChunks.bind(this));
workflow.addNode('index', this.indexChunks.bind(this));
workflow.addNode('complete', this.completeProcessing.bind(this));
workflow.addNode('fail', this.handleFailure.bind(this));
// 添加边
workflow.addEdge('start', 'parse');
workflow.addConditionalEdges('parse', this.checkParseResult.bind(this));
workflow.addConditionalEdges('chunk', this.checkChunkResult.bind(this));
workflow.addConditionalEdges('embed', this.checkEmbedResult.bind(this));
workflow.addEdge('index', 'complete');
workflow.addEdge('complete', END);
workflow.addEdge('fail', END);
return workflow.compile();
}
private async parseDocument(state: DocumentProcessingState): Promise<Partial<DocumentProcessingState>> {
try {
const result = await this.parser.parseDocument(state.buffer, state.mimeType);
return {
status: 'chunking',
content: result.content,
metadata: result.metadata,
progress: 25
};
} catch (error) {
return {
status: 'failed',
error: error.message,
progress: 0
};
}
}
private async chunkDocument(state: DocumentProcessingState): Promise<Partial<DocumentProcessingState>> {
try {
const chunks = await this.chunker.chunkDocument(
state.content!,
{ source: state.documentId }
);
return {
status: 'embedding',
chunks: chunks,
progress: 50
};
} catch (error) {
return {
status: 'failed',
error: error.message,
progress: 25
};
}
}
private async embedChunks(state: DocumentProcessingState): Promise<Partial<DocumentProcessingState>> {
try {
// 这里会调用 embedding 服务
return {
status: 'indexing',
progress: 75
};
} catch (error) {
return {
status: 'failed',
error: error.message,
progress: 50
};
}
}
private async indexChunks(state: DocumentProcessingState): Promise<Partial<DocumentProcessingState>> {
try {
await this.retriever.addDocuments(state.chunks!);
return {
status: 'completed',
progress: 100
};
} catch (error) {
return {
status: 'failed',
error: error.message,
progress: 75
};
}
}
private async completeProcessing(state: DocumentProcessingState): Promise<Partial<DocumentProcessingState>> {
return {
status: 'completed',
progress: 100
};
}
private async handleFailure(state: DocumentProcessingState): Promise<Partial<DocumentProcessingState>> {
return {
status: 'failed',
progress: 0
};
}
private checkParseResult(state: DocumentProcessingState): string {
return state.status === 'failed' ? 'fail' : 'chunk';
}
private checkChunkResult(state: DocumentProcessingState): string {
return state.status === 'failed' ? 'fail' : 'embed';
}
private checkEmbedResult(state: DocumentProcessingState): string {
return state.status === 'failed' ? 'fail' : 'index';
}
}
🌐 Next.js API 路由实现
文档上传 API
typescript
// src/app/api/documents/upload/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { writeFile, mkdir } from 'fs/promises';
import { join } from 'path';
import { DocumentProcessingWorkflow } from '@/services/workflow/document-processing';
import { prisma } from '@/lib/db';
export async function POST(request: NextRequest) {
try {
const formData = await request.formData();
const file = formData.get('file') as File;
const userId = formData.get('userId') as string;
if (!file || !userId) {
return NextResponse.json(
{ error: '缺少必要参数' },
{ status: 400 }
);
}
// 验证文件类型
const allowedTypes = [
'application/pdf',
'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
'image/jpeg',
'image/png',
'image/gif'
];
if (!allowedTypes.includes(file.type)) {
return NextResponse.json(
{ error: '不支持的文件类型' },
{ status: 400 }
);
}
// 验证文件大小 (10MB)
if (file.size > 10 * 1024 * 1024) {
return NextResponse.json(
{ error: '文件大小超过限制' },
{ status: 400 }
);
}
// 保存文件到数据库
const document = await prisma.document.create({
data: {
title: file.name,
filename: file.name,
fileType: file.type,
fileSize: file.size,
filePath: '', // 稍后更新
status: 'PROCESSING',
userId: userId
}
});
// 创建上传目录
const uploadDir = join(process.cwd(), 'uploads', userId);
await mkdir(uploadDir, { recursive: true });
// 保存文件
const buffer = Buffer.from(await file.arrayBuffer());
const filePath = join(uploadDir, `${document.id}_${file.name}`);
await writeFile(filePath, buffer);
// 更新文件路径
await prisma.document.update({
where: { id: document.id },
data: { filePath }
});
// 启动文档处理工作流
const workflow = new DocumentProcessingWorkflow();
const graph = await workflow.buildGraph();
// 异步处理文档
graph.invoke({
documentId: document.id,
buffer,
mimeType: file.type,
status: 'parsing',
progress: 0
}).catch(error => {
console.error('文档处理失败:', error);
prisma.document.update({
where: { id: document.id },
data: { status: 'FAILED' }
});
});
return NextResponse.json({
success: true,
document: {
id: document.id,
title: document.title,
status: document.status
}
});
} catch (error) {
console.error('上传失败:', error);
return NextResponse.json(
{ error: '上传失败' },
{ status: 500 }
);
}
}
智能问答 API
typescript
// src/app/api/chat/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { ChatOpenAI } from '@langchain/openai';
import { PromptTemplate } from '@langchain/core/prompts';
import { StringOutputParser } from '@langchain/core/output_parsers';
import { RAGRetriever } from '@/services/rag/retriever';
import { prisma } from '@/lib/db';
export async function POST(request: NextRequest) {
try {
const { message, chatId, documentId } = await request.json();
if (!message || !chatId) {
return NextResponse.json(
{ error: '缺少必要参数' },
{ status: 400 }
);
}
// 初始化模型和检索器
const model = new ChatOpenAI({
openAIApiKey: process.env.OPENAI_API_KEY,
modelName: 'gpt-3.5-turbo',
temperature: 0.7
});
const retriever = new RAGRetriever();
// 检索相关文档片段
const retrievalResult = await retriever.similaritySearch(message, 5);
// 构建上下文
const context = retrievalResult.documents
.map((doc, index) => `片段 ${index + 1}:\n${doc.pageContent}`)
.join('\n\n');
// 创建 Prompt 模板
const promptTemplate = PromptTemplate.fromTemplate(`
你是一个专业的文档问答助手。请基于以下文档内容回答用户问题。
文档内容:
{context}
用户问题:{question}
请提供准确、详细的回答,并在回答中引用相关的文档片段。如果文档中没有相关信息,请明确说明。
回答:
`);
// 构建处理链
const chain = promptTemplate.pipe(model).pipe(new StringOutputParser());
// 生成回答
const answer = await chain.invoke({
context,
question: message
});
// 保存消息到数据库
const userMessage = await prisma.message.create({
data: {
content: message,
role: 'USER',
chatId: chatId
}
});
const assistantMessage = await prisma.message.create({
data: {
content: answer,
role: 'ASSISTANT',
chatId: chatId,
metadata: {
retrievalResults: retrievalResult.documents.map(doc => ({
content: doc.pageContent,
metadata: doc.metadata
}))
}
}
});
// 创建引用记录
const citations = await Promise.all(
retrievalResult.documents.map(async (doc, index) => {
return prisma.citation.create({
data: {
content: doc.pageContent,
source: doc.metadata.source || 'Unknown',
pageNumber: doc.metadata.pageNumber,
confidence: 1 - retrievalResult.scores[index], // 转换为置信度
messageId: assistantMessage.id,
chunkId: doc.metadata.chunkId
}
});
})
);
return NextResponse.json({
success: true,
message: {
id: assistantMessage.id,
content: answer,
role: 'ASSISTANT',
citations: citations.map(c => ({
id: c.id,
content: c.content,
source: c.source,
pageNumber: c.pageNumber,
confidence: c.confidence
}))
}
});
} catch (error) {
console.error('问答失败:', error);
return NextResponse.json(
{ error: '问答失败' },
{ status: 500 }
);
}
}
流式问答 API
typescript
// src/app/api/chat/stream/route.ts
import { NextRequest } from 'next/server';
import { ChatOpenAI } from '@langchain/openai';
import { PromptTemplate } from '@langchain/core/prompts';
import { RAGRetriever } from '@/services/rag/retriever';
export const runtime = 'edge';
export async function POST(request: NextRequest) {
try {
const { message, chatId } = await request.json();
const encoder = new TextEncoder();
const stream = new ReadableStream({
async start(controller) {
try {
// 发送开始信号
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({ type: 'start' })}\n\n`)
);
// 检索相关文档
const retriever = new RAGRetriever();
const retrievalResult = await retriever.similaritySearch(message, 5);
// 发送检索结果
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({
type: 'retrieval',
data: {
documents: retrievalResult.documents.length,
scores: retrievalResult.scores
}
})}\n\n`)
);
// 构建上下文
const context = retrievalResult.documents
.map((doc, index) => `片段 ${index + 1}:\n${doc.pageContent}`)
.join('\n\n');
// 初始化流式模型
const model = new ChatOpenAI({
openAIApiKey: process.env.OPENAI_API_KEY,
modelName: 'gpt-3.5-turbo',
temperature: 0.7,
streaming: true
});
const promptTemplate = PromptTemplate.fromTemplate(`
你是一个专业的文档问答助手。请基于以下文档内容回答用户问题。
文档内容:
{context}
用户问题:{question}
请提供准确、详细的回答,并在回答中引用相关的文档片段。
`);
const chain = promptTemplate.pipe(model);
// 流式生成回答
const stream = await chain.stream({
context,
question: message
});
let fullAnswer = '';
for await (const chunk of stream) {
const content = chunk.content;
fullAnswer += content;
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({
type: 'chunk',
data: { content }
})}\n\n`)
);
}
// 发送完成信号
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({
type: 'complete',
data: {
answer: fullAnswer,
citations: retrievalResult.documents.map(doc => ({
content: doc.pageContent,
source: doc.metadata.source,
pageNumber: doc.metadata.pageNumber
}))
}
})}\n\n`)
);
} catch (error) {
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({
type: 'error',
data: { message: error.message }
})}\n\n`)
);
} finally {
controller.close();
}
}
});
return new Response(stream, {
headers: {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
},
});
} catch (error) {
return new Response(
JSON.stringify({ error: '流式问答失败' }),
{ status: 500 }
);
}
}
🎨 前端界面实现
文档上传组件
tsx
// src/components/document/UploadForm.tsx
'use client';
import { useState, useCallback } from 'react';
import { useDropzone } from 'react-dropzone';
import { Upload, File, AlertCircle, CheckCircle } from 'lucide-react';
interface UploadFormProps {
userId: string;
onUploadSuccess: (document: any) => void;
}
export default function UploadForm({ userId, onUploadSuccess }: UploadFormProps) {
const [uploading, setUploading] = useState(false);
const [uploadStatus, setUploadStatus] = useState<'idle' | 'success' | 'error'>('idle');
const [errorMessage, setErrorMessage] = useState('');
const onDrop = useCallback(async (acceptedFiles: File[]) => {
const file = acceptedFiles[0];
if (!file) return;
setUploading(true);
setUploadStatus('idle');
setErrorMessage('');
try {
const formData = new FormData();
formData.append('file', file);
formData.append('userId', userId);
const response = await fetch('/api/documents/upload', {
method: 'POST',
body: formData,
});
const result = await response.json();
if (result.success) {
setUploadStatus('success');
onUploadSuccess(result.document);
} else {
setUploadStatus('error');
setErrorMessage(result.error || '上传失败');
}
} catch (error) {
setUploadStatus('error');
setErrorMessage('网络错误,请重试');
} finally {
setUploading(false);
}
}, [userId, onUploadSuccess]);
const { getRootProps, getInputProps, isDragActive } = useDropzone({
onDrop,
accept: {
'application/pdf': ['.pdf'],
'application/vnd.openxmlformats-officedocument.wordprocessingml.document': ['.docx'],
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': ['.xlsx'],
'image/jpeg': ['.jpg', '.jpeg'],
'image/png': ['.png'],
'image/gif': ['.gif'],
},
maxFiles: 1,
maxSize: 10 * 1024 * 1024, // 10MB
});
return (
<div className="w-full max-w-2xl mx-auto">
<div
{...getRootProps()}
className={`
border-2 border-dashed rounded-lg p-8 text-center cursor-pointer transition-colors
${isDragActive ? 'border-blue-500 bg-blue-50' : 'border-gray-300'}
${uploading ? 'opacity-50 cursor-not-allowed' : ''}
`}
>
<input {...getInputProps()} disabled={uploading} />
<div className="flex flex-col items-center space-y-4">
{uploading ? (
<div className="animate-spin rounded-full h-12 w-12 border-b-2 border-blue-500"></div>
) : (
<Upload className="h-12 w-12 text-gray-400" />
)}
<div>
<p className="text-lg font-medium text-gray-900">
{uploading ? '正在上传...' : isDragActive ? '释放文件以上传' : '拖拽文件到这里或点击选择'}
</p>
<p className="text-sm text-gray-500 mt-2">
支持 PDF、Word、Excel、图片格式,最大 10MB
</p>
</div>
{uploadStatus === 'success' && (
<div className="flex items-center space-x-2 text-green-600">
<CheckCircle className="h-5 w-5" />
<span>上传成功!</span>
</div>
)}
{uploadStatus === 'error' && (
<div className="flex items-center space-x-2 text-red-600">
<AlertCircle className="h-5 w-5" />
<span>{errorMessage}</span>
</div>
)}
</div>
</div>
</div>
);
}
智能问答组件
tsx
// src/components/chat/ChatInterface.tsx
'use client';
import { useState, useRef, useEffect } from 'react';
import { Send, Bot, User, Loader2 } from 'lucide-react';
interface Message {
id: string;
content: string;
role: 'USER' | 'ASSISTANT';
citations?: Citation[];
timestamp: Date;
}
interface Citation {
id: string;
content: string;
source: string;
pageNumber?: number;
confidence: number;
}
interface ChatInterfaceProps {
chatId: string;
documentId?: string;
}
export default function ChatInterface({ chatId, documentId }: ChatInterfaceProps) {
const [messages, setMessages] = useState<Message[]>([]);
const [input, setInput] = useState('');
const [loading, setLoading] = useState(false);
const [streaming, setStreaming] = useState(false);
const messagesEndRef = useRef<HTMLDivElement>(null);
const scrollToBottom = () => {
messagesEndRef.current?.scrollIntoView({ behavior: 'smooth' });
};
useEffect(() => {
scrollToBottom();
}, [messages]);
const handleSubmit = async (e: React.FormEvent) => {
e.preventDefault();
if (!input.trim() || loading) return;
const userMessage: Message = {
id: Date.now().toString(),
content: input,
role: 'USER',
timestamp: new Date()
};
setMessages(prev => [...prev, userMessage]);
setInput('');
setLoading(true);
try {
const response = await fetch('/api/chat', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
message: input,
chatId,
documentId
}),
});
const result = await response.json();
if (result.success) {
const assistantMessage: Message = {
id: result.message.id,
content: result.message.content,
role: 'ASSISTANT',
citations: result.message.citations,
timestamp: new Date()
};
setMessages(prev => [...prev, assistantMessage]);
} else {
throw new Error(result.error);
}
} catch (error) {
console.error('发送消息失败:', error);
// 可以添加错误提示
} finally {
setLoading(false);
}
};
const handleStreamSubmit = async (e: React.FormEvent) => {
e.preventDefault();
if (!input.trim() || streaming) return;
const userMessage: Message = {
id: Date.now().toString(),
content: input,
role: 'USER',
timestamp: new Date()
};
setMessages(prev => [...prev, userMessage]);
setInput('');
setStreaming(true);
try {
const response = await fetch('/api/chat/stream', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
message: input,
chatId
}),
});
if (!response.body) {
throw new Error('No response body');
}
const reader = response.body.getReader();
const decoder = new TextDecoder();
let assistantMessage: Message = {
id: Date.now().toString(),
content: '',
role: 'ASSISTANT',
timestamp: new Date()
};
setMessages(prev => [...prev, assistantMessage]);
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n');
for (const line of lines) {
if (line.startsWith('data: ')) {
try {
const data = JSON.parse(line.slice(6));
if (data.type === 'chunk') {
assistantMessage.content += data.data.content;
setMessages(prev =>
prev.map(msg =>
msg.id === assistantMessage.id
? { ...msg, content: assistantMessage.content }
: msg
)
);
} else if (data.type === 'complete') {
assistantMessage.content = data.data.answer;
assistantMessage.citations = data.data.citations;
setMessages(prev =>
prev.map(msg =>
msg.id === assistantMessage.id
? { ...msg, content: assistantMessage.content, citations: assistantMessage.citations }
: msg
)
);
}
} catch (e) {
console.error('解析流数据失败:', e);
}
}
}
}
} catch (error) {
console.error('流式对话失败:', error);
} finally {
setStreaming(false);
}
};
return (
<div className="flex flex-col h-full max-w-4xl mx-auto">
{/* 消息列表 */}
<div className="flex-1 overflow-y-auto p-4 space-y-4">
{messages.map((message) => (
<div
key={message.id}
className={`flex ${message.role === 'USER' ? 'justify-end' : 'justify-start'}`}
>
<div
className={`flex items-start space-x-2 max-w-3xl ${
message.role === 'USER' ? 'flex-row-reverse space-x-reverse' : ''
}`}
>
<div
className={`flex-shrink-0 w-8 h-8 rounded-full flex items-center justify-center ${
message.role === 'USER'
? 'bg-blue-500 text-white'
: 'bg-gray-200 text-gray-600'
}`}
>
{message.role === 'USER' ? <User className="w-4 h-4" /> : <Bot className="w-4 h-4" />}
</div>
<div
className={`px-4 py-2 rounded-lg ${
message.role === 'USER'
? 'bg-blue-500 text-white'
: 'bg-gray-100 text-gray-900'
}`}
>
<div className="whitespace-pre-wrap">{message.content}</div>
{/* 引用信息 */}
{message.citations && message.citations.length > 0 && (
<div className="mt-3 pt-3 border-t border-gray-200">
<div className="text-sm font-medium text-gray-600 mb-2">参考资料:</div>
<div className="space-y-2">
{message.citations.map((citation, index) => (
<div key={citation.id} className="text-sm bg-gray-50 p-2 rounded">
<div className="font-medium">{citation.source}</div>
{citation.pageNumber && (
<div className="text-gray-500">第 {citation.pageNumber} 页</div>
)}
<div className="text-gray-700 mt-1">{citation.content}</div>
</div>
))}
</div>
</div>
)}
</div>
</div>
</div>
))}
{(loading || streaming) && (
<div className="flex justify-start">
<div className="flex items-start space-x-2">
<div className="flex-shrink-0 w-8 h-8 rounded-full bg-gray-200 text-gray-600 flex items-center justify-center">
<Bot className="w-4 h-4" />
</div>
<div className="px-4 py-2 rounded-lg bg-gray-100">
<div className="flex items-center space-x-2">
<Loader2 className="w-4 h-4 animate-spin" />
<span>正在思考...</span>
</div>
</div>
</div>
</div>
)}
<div ref={messagesEndRef} />
</div>
{/* 输入框 */}
<form onSubmit={streaming ? handleStreamSubmit : handleSubmit} className="p-4 border-t">
<div className="flex space-x-2">
<input
type="text"
value={input}
onChange={(e) => setInput(e.target.value)}
placeholder="输入您的问题..."
className="flex-1 px-4 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-blue-500"
disabled={loading || streaming}
/>
<button
type="submit"
disabled={loading || streaming || !input.trim()}
className="px-4 py-2 bg-blue-500 text-white rounded-lg hover:bg-blue-600 disabled:opacity-50 disabled:cursor-not-allowed flex items-center space-x-2"
>
<Send className="w-4 h-4" />
<span>发送</span>
</button>
</div>
</form>
</div>
);
}
🐳 Docker 部署配置
Docker Compose 配置
yaml
# docker-compose.yml
version: '3.8'
services:
app:
build: .
ports:
- "3000:3000"
environment:
- NODE_ENV=production
- DATABASE_URL=postgresql://postgres:password@postgres:5432/documents
- OPENAI_API_KEY=${OPENAI_API_KEY}
- CHROMA_URL=http://chroma:8000
- REDIS_URL=redis://redis:6379
depends_on:
- postgres
- chroma
- redis
volumes:
- ./uploads:/app/uploads
postgres:
image: postgres:15
environment:
- POSTGRES_DB=documents
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=password
volumes:
- postgres_data:/var/lib/postgresql/data
ports:
- "5432:5432"
chroma:
image: chromadb/chroma:latest
ports:
- "8000:8000"
volumes:
- chroma_data:/chroma/chroma
environment:
- CHROMA_SERVER_HOST=0.0.0.0
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- redis_data:/data
nginx:
image: nginx:alpine
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
depends_on:
- app
volumes:
postgres_data:
chroma_data:
redis_data:
Dockerfile
dockerfile
# Dockerfile
FROM node:18-alpine AS base
# Install dependencies only when needed
FROM base AS deps
RUN apk add --no-cache libc6-compat
WORKDIR /app
# Install dependencies based on the preferred package manager
COPY package.json package-lock.json* ./
RUN npm ci
# Rebuild the source code only when needed
FROM base AS builder
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
# Generate Prisma client
RUN npx prisma generate
# Build the application
RUN npm run build
# Production image, copy all the files and run next
FROM base AS runner
WORKDIR /app
ENV NODE_ENV production
RUN addgroup --system --gid 1001 nodejs
RUN adduser --system --uid 1001 nextjs
COPY --from=builder /app/public ./public
# Set the correct permission for prerender cache
RUN mkdir .next
RUN chown nextjs:nodejs .next
# Automatically leverage output traces to reduce image size
COPY --from=builder --chown=nextjs:nodejs /app/.next/standalone ./
COPY --from=builder --chown=nextjs:nodejs /app/.next/static ./.next/static
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./package.json
COPY --from=builder /app/prisma ./prisma
USER nextjs
EXPOSE 3000
ENV PORT 3000
CMD ["node", "server.js"]
📊 性能优化与监控
缓存策略
typescript
// src/lib/cache.ts
import Redis from 'ioredis';
const redis = new Redis(process.env.REDIS_URL || 'redis://localhost:6379');
export class CacheService {
async get<T>(key: string): Promise<T | null> {
try {
const value = await redis.get(key);
return value ? JSON.parse(value) : null;
} catch (error) {
console.error('缓存读取失败:', error);
return null;
}
}
async set(key: string, value: any, ttl: number = 3600): Promise<void> {
try {
await redis.setex(key, ttl, JSON.stringify(value));
} catch (error) {
console.error('缓存写入失败:', error);
}
}
async del(key: string): Promise<void> {
try {
await redis.del(key);
} catch (error) {
console.error('缓存删除失败:', error);
}
}
// 文档检索缓存
async cacheRetrievalResult(query: string, result: any): Promise<void> {
const key = `retrieval:${Buffer.from(query).toString('base64')}`;
await this.set(key, result, 1800); // 30分钟
}
async getCachedRetrievalResult(query: string): Promise<any> {
const key = `retrieval:${Buffer.from(query).toString('base64')}`;
return await this.get(key);
}
}
性能监控
typescript
// src/lib/monitoring.ts
export class PerformanceMonitor {
private static instance: PerformanceMonitor;
private metrics: Map<string, number[]> = new Map();
static getInstance(): PerformanceMonitor {
if (!PerformanceMonitor.instance) {
PerformanceMonitor.instance = new PerformanceMonitor();
}
return PerformanceMonitor.instance;
}
startTimer(name: string): () => void {
const start = Date.now();
return () => {
const duration = Date.now() - start;
this.recordMetric(name, duration);
};
}
recordMetric(name: string, value: number): void {
if (!this.metrics.has(name)) {
this.metrics.set(name, []);
}
this.metrics.get(name)!.push(value);
}
getMetrics(name: string): { avg: number; p95: number; count: number } {
const values = this.metrics.get(name) || [];
if (values.length === 0) {
return { avg: 0, p95: 0, count: 0 };
}
const sorted = values.sort((a, b) => a - b);
const avg = values.reduce((sum, val) => sum + val, 0) / values.length;
const p95Index = Math.floor(sorted.length * 0.95);
const p95 = sorted[p95Index];
return { avg, p95, count: values.length };
}
getAllMetrics(): Record<string, any> {
const result: Record<string, any> = {};
for (const [name] of this.metrics) {
result[name] = this.getMetrics(name);
}
return result;
}
}
🧪 测试策略
单元测试示例
typescript
// src/services/document/__tests__/parser.test.ts
import { DocumentParser } from '../parser';
import fs from 'fs';
import path from 'path';
describe('DocumentParser', () => {
let parser: DocumentParser;
beforeEach(() => {
parser = new DocumentParser();
});
describe('parsePDF', () => {
it('should parse PDF file correctly', async () => {
const pdfPath = path.join(__dirname, '../fixtures/sample.pdf');
const buffer = fs.readFileSync(pdfPath);
const result = await parser.parsePDF(buffer);
expect(result.content).toBeDefined();
expect(result.content.length).toBeGreaterThan(0);
expect(result.metadata.title).toBeDefined();
expect(result.metadata.pages).toBeGreaterThan(0);
});
});
describe('parseWord', () => {
it('should parse Word document correctly', async () => {
const docxPath = path.join(__dirname, '../fixtures/sample.docx');
const buffer = fs.readFileSync(docxPath);
const result = await parser.parseWord(buffer);
expect(result.content).toBeDefined();
expect(result.content.length).toBeGreaterThan(0);
expect(result.metadata.wordCount).toBeGreaterThan(0);
});
});
});
集成测试示例
typescript
// src/app/api/__tests__/chat.test.ts
import { POST } from '../chat/route';
import { NextRequest } from 'next/server';
describe('/api/chat', () => {
it('should handle chat request correctly', async () => {
const request = new NextRequest('http://localhost:3000/api/chat', {
method: 'POST',
body: JSON.stringify({
message: 'What is the main topic of the document?',
chatId: 'test-chat-id'
}),
headers: {
'Content-Type': 'application/json'
}
});
const response = await POST(request);
const data = await response.json();
expect(response.status).toBe(200);
expect(data.success).toBe(true);
expect(data.message.content).toBeDefined();
expect(data.message.role).toBe('ASSISTANT');
});
});
📚 本章总结
通过本章学习,我们完成了:
✅ 系统架构设计
- 设计了完整的企业级文档处理系统架构
- 实现了多模态文档解析和智能问答功能
- 集成了 LangGraph 工作流自动化处理
✅ 技术实现
- 使用 LangChain.js 构建 RAG 检索系统
- 实现了多种文档格式的解析和处理
- 搭建了 Next.js 全栈应用和现代化界面
✅ 工程实践
- 配置了 Docker 容器化部署
- 实现了性能优化和缓存策略
- 建立了完整的测试体系
✅ 生产就绪
- 实现了错误处理和重试机制
- 添加了性能监控和日志记录
- 提供了完整的部署和运维方案
🎯 下章预告
在下一章《实战综合项目二:AI 驱动的代码助手》中,我们将:
- 构建智能代码分析和生成系统
- 实现代码审查、重构建议和自动修复
- 集成多种编程语言和框架支持
- 开发代码质量评估和优化建议功能
最后感谢阅读!欢迎关注我,微信公众号:
《鲫小鱼不正经》
。欢迎点赞、收藏、关注,一键三连!!!