SpringAI2.0 RAG 完整实现:Document ETL、Vector Store 与检索增强
前言:构建生产级文档问答系统
检索增强生成(RAG)是当前企业 AI 应用最热门的场景之一。Spring AI 2.0 提供了一套完整的 RAG 实现框架,从文档摄取(ETL)到向量存储,再到检索增强,全链路覆盖。
作为架构师,我在多个大型企业项目中实施过 RAG 系统,深知生产环境的挑战:文档格式多样、处理流程复杂、检索精度要求高、性能优化困难等。Spring AI 2.0 的 RAG 框架正好解决了这些问题。
本文将带你从零开始构建一个生产级的文档问答系统,涵盖 Document ETL、Vector Store 和 QuestionAnswerAdvisor 的完整实现。
一、Document Ingestion ETL 框架
1.1 ETL 流程概述
Spring AI 2.0 的 ETL 框架采用标准化的三阶段处理:
[DocumentReader] → [DocumentTransformer] → [DocumentWriter]
↓ ↓ ↓
文档读取 文档转换 文档写入
核心接口定义:
java
// DocumentReader:文档读取接口
public interface DocumentReader {
List<Document> read();
}
// DocumentTransformer:文档转换接口
public interface DocumentTransformer extends Function<List<Document>, List<Document>> {
}
// DocumentWriter:文档写入接口
public interface DocumentWriter extends Consumer<List<Document>> {
}
1.2 DocumentReader 实现详解
PDF 文档读取
Spring AI 2.0 提供了三种 PDF 读取方式,满足不同场景需求。
PagePdfDocumentReader:按页分割
java
@Service
public class PdfDocumentService {
public List<Document> loadPdfByPage(String filePath) {
PagePdfDocumentReader pdfReader = new PagePdfDocumentReader(
filePath,
PdfDocumentReaderConfig.builder()
.withPageTopMargin(0)
.withPageExtractedTextFormatter(
ExtractedTextFormatter.builder()
.withNumberOfTopTextLinesToDelete(0)
.build()
)
.withPagesPerDocument(1) // 每页一个 Document
.build()
);
return pdfReader.read();
}
}
ParagraphPdfDocumentReader:按段落分割
java
@Service
public class PdfDocumentService {
public List<Document> loadPdfByParagraph(String filePath) {
ParagraphPdfDocumentReader pdfReader = new ParagraphPdfDocumentReader(
filePath,
PdfDocumentReaderConfig.builder()
.withPageTopMargin(0)
.withPageExtractedTextFormatter(
ExtractedTextFormatter.builder()
.withNumberOfTopTextLinesToDelete(0)
.build()
)
.withPagesPerDocument(1)
.build()
);
return pdfReader.read();
}
}
使用场景对比:
| Reader | 适用场景 | 优点 | 缺点 |
|---|---|---|---|
| PagePdfDocumentReader | 需要精确定位页码 | 保留页面结构 | 可能割裂逻辑段落 |
| ParagraphPdfDocumentReader | 需要段落级检索 | 逻辑完整 | 依赖 PDF 目录结构 |
Apache Tika 通用文档读取
TikaDocumentReader 支持多种格式:PDF、DOC/DOCX、PPT/PPTX、HTML 等。
java
@Service
public class TikaDocumentService {
// 读取 Word 文档
public List<Document> loadWordDocument(Resource resource) {
TikaDocumentReader reader = new TikaDocumentReader(resource);
return reader.read();
}
// 读取 PowerPoint
public List<Document> loadPowerPoint(Resource resource) {
TikaDocumentReader reader = new TikaDocumentReader(resource);
return reader.read();
}
// 读取 HTML
public List<Document> loadHtml(Resource resource) {
TikaDocumentReader reader = new TikaDocumentReader(resource);
return reader.read();
}
// 批量读取多种格式
public List<Document> loadMultipleDocuments(List<Resource> resources) {
List<Document> allDocuments = new ArrayList<>();
for (Resource resource : resources) {
TikaDocumentReader reader = new TikaDocumentReader(resource);
allDocuments.addAll(reader.read());
}
return allDocuments;
}
}
支持的格式列表:
- 文档类:PDF, DOC, DOCX, RTF, ODT
- 演示类:PPT, PPTX, ODP
- 网页类:HTML, XHTML
- 其他:TXT, XML, JSON
自定义 DocumentReader
java
public class CustomDocumentReader implements DocumentReader {
private final Resource resource;
public CustomDocumentReader(Resource resource) {
this.resource = resource;
}
@Override
public List<Document> read() {
try (InputStream is = resource.getInputStream()) {
// 自定义解析逻辑
String content = parseContent(is);
Map<String, Object> metadata = extractMetadata(resource);
Document document = new Document(content, metadata);
return List.of(document);
} catch (IOException e) {
throw new DocumentReaderException("Failed to read document", e);
}
}
private String parseContent(InputStream is) {
// 实现自定义解析逻辑
return "";
}
private Map<String, Object> extractMetadata(Resource resource) {
Map<String, Object> metadata = new HashMap<>();
metadata.put("filename", resource.getFilename());
metadata.put("created_at", Instant.now());
return metadata;
}
}
1.3 DocumentTransformer 实现详解
TextSplitter:文本分割
文本分割是 RAG 的关键步骤,直接影响检索精度。
TokenTextSplitter:基于 Token 分割
java
@Service
public class DocumentTransformerService {
private final TextSplitter textSplitter;
public DocumentTransformerService() {
// 创建 Token 分割器
this.textSplitter = new TokenTextSplitter(
1000, // chunk size
200, // overlap
5, // chunk size to keep header
10000, // chunk size to keep footer
true // keep separator
);
}
public List<Document> splitDocuments(List<Document> documents) {
return textSplitter.apply(documents);
}
}
自定义分割策略
java
@Service
public class CustomSplitterService {
// 按句子分割
public List<Document> splitBySentences(Document document) {
String content = document.getContent();
String[] sentences = content.split("[.!?]+\\s+");
List<Document> chunks = new ArrayList<>();
for (int i = 0; i < sentences.length; i++) {
Map<String, Object> metadata = new HashMap<>(document.getMetadata());
metadata.put("chunk_index", i);
metadata.put("total_chunks", sentences.length);
chunks.add(new Document(sentences[i].trim(), metadata));
}
return chunks;
}
// 按段落分割
public List<Document> splitByParagraphs(Document document) {
String content = document.getContent();
String[] paragraphs = content.split("\\n\\n+");
List<Document> chunks = new ArrayList<>();
for (int i = 0; i < paragraphs.length; i++) {
Map<String, Object> metadata = new HashMap<>(document.getMetadata());
metadata.put("chunk_index", i);
metadata.put("total_chunks", paragraphs.length);
chunks.add(new Document(paragraphs[i].trim(), metadata));
}
return chunks;
}
// 智能分割(保持语义完整)
public List<Document> smartSplit(Document document) {
String content = document.getContent();
List<Document> chunks = new ArrayList<>();
// 使用正则匹配句子边界
Pattern pattern = Pattern.compile("[.!?]+\\s+");
Matcher matcher = pattern.matcher(content);
int start = 0;
int chunkIndex = 0;
StringBuilder currentChunk = new StringBuilder();
while (matcher.find()) {
int end = matcher.end();
String sentence = content.substring(start, end);
currentChunk.append(sentence);
// 如果 chunk 达到合适大小,保存它
if (currentChunk.length() >= 500) {
Map<String, Object> metadata = new HashMap<>(document.getMetadata());
metadata.put("chunk_index", chunkIndex++);
chunks.add(new Document(currentChunk.toString(), metadata));
currentChunk = new StringBuilder();
}
start = end;
}
// 添加最后一个 chunk
if (currentChunk.length() > 0) {
Map<String, Object> metadata = new HashMap<>(document.getMetadata());
metadata.put("chunk_index", chunkIndex);
chunks.add(new Document(currentChunk.toString(), metadata));
}
return chunks;
}
}
ContentFormatTransformer:内容格式化
java
@Service
public class ContentFormatTransformerService {
// 统一换行符
public Document normalizeLineBreaks(Document document) {
String content = document.getContent()
.replaceAll("\\r\\n", "\n") // Windows 换行符
.replaceAll("\\r", "\n"); // 旧 Mac 换行符
Map<String, Object> metadata = new HashMap<>(document.getMetadata());
metadata.put("normalized", true);
return new Document(content, metadata);
}
// 移除多余空白
public Document removeExtraWhitespace(Document document) {
String content = document.getContent()
.replaceAll("\\s+", " ") // 多个空白替换为一个空格
.trim();
Map<String, Object> metadata = new HashMap<>(document.getMetadata());
metadata.put("whitespace_removed", true);
return new Document(content, metadata);
}
// 提取并保留代码块
public Document preserveCodeBlocks(Document document) {
String content = document.getContent();
// 保护代码块
Map<String, String> codeBlocks = new HashMap<>();
Pattern codePattern = Pattern.compile("```(.*?)```", Pattern.DOTALL);
Matcher codeMatcher = codePattern.matcher(content);
int index = 0;
StringBuilder processedContent = new StringBuilder();
int lastEnd = 0;
while (codeMatcher.find()) {
String code = codeMatcher.group();
String placeholder = String.format("__CODE_BLOCK_%d__", index++);
codeBlocks.put(placeholder, code);
processedContent.append(content.substring(lastEnd, codeMatcher.start()));
processedContent.append(placeholder);
lastEnd = codeMatcher.end();
}
processedContent.append(content.substring(lastEnd));
// 处理非代码部分
String formattedContent = processedContent.toString()
.replaceAll("\\s+", " ");
// 恢复代码块
for (Map.Entry<String, String> entry : codeBlocks.entrySet()) {
formattedContent = formattedContent.replace(
entry.getKey(),
entry.getValue()
);
}
Map<String, Object> metadata = new HashMap<>(document.getMetadata());
metadata.put("code_preserved", true);
return new Document(formattedContent, metadata);
}
}
KeywordMetadataEnricher:关键词提取
java
@Service
public class KeywordEnricherService {
// 提取关键词(简单实现)
public List<Document> enrichWithKeywords(List<Document> documents) {
List<Document> enriched = new ArrayList<>();
for (Document document : documents) {
String content = document.getContent();
Set<String> keywords = extractKeywords(content);
Map<String, Object> metadata = new HashMap<>(document.getMetadata());
metadata.put("keywords", keywords);
enriched.add(new Document(content, metadata));
}
return enriched;
}
private Set<String> extractKeywords(String content) {
// 简单的关键词提取:提取出现频率高的词
Map<String, Integer> wordFrequency = new HashMap<>();
// 分词(简单实现)
String[] words = content.toLowerCase()
.split("[\\s.,!?;:\"'()\\[\\]{}]+");
// 停用词列表
Set<String> stopWords = Set.of(
"the", "is", "at", "which", "on", "and", "a", "an",
"of", "in", "to", "for", "with", "by", "as"
);
// 统计词频
for (String word : words) {
if (word.length() > 3 && !stopWords.contains(word)) {
wordFrequency.merge(word, 1, Integer::sum);
}
}
// 返回前 10 个高频词
return wordFrequency.entrySet().stream()
.sorted((e1, e2) -> e2.getValue().compareTo(e1.getValue()))
.limit(10)
.map(Map.Entry::getKey)
.collect(Collectors.toSet());
}
}
SummaryMetadataEnricher:摘要生成
java
@Service
public class SummaryEnricherService {
private final ChatClient chatClient;
public SummaryEnricherService(ChatClient chatClient) {
this.chatClient = chatClient;
}
// 批量生成摘要
public List<Document> enrichWithSummary(List<Document> documents) {
List<Document> enriched = new ArrayList<>();
for (Document document : documents) {
String summary = generateSummary(document.getContent());
Map<String, Object> metadata = new HashMap<>(document.getMetadata());
metadata.put("summary", summary);
enriched.add(new Document(document.getContent(), metadata));
}
return enriched;
}
private String generateSummary(String content) {
// 限制内容长度
String truncatedContent = content.length() > 2000
? content.substring(0, 2000)
: content;
return chatClient.prompt()
.user("""
请为以下内容生成一个简洁的摘要(不超过 100 字):
%s
""".formatted(truncatedContent))
.call()
.content();
}
}
1.4 完整的 ETL 流程
java
@Service
public class DocumentIngestionService {
private final List<DocumentReader> readers;
private final List<DocumentTransformer> transformers;
private final DocumentWriter writer;
private final EmbeddingModel embeddingModel;
private final VectorStore vectorStore;
public DocumentIngestionService(
List<DocumentReader> readers,
List<DocumentTransformer> transformers,
DocumentWriter writer
) {
this.readers = readers;
this.transformers = transformers;
this.writer = writer;
}
// 执行完整 ETL 流程
public void ingestDocuments(List<Resource> resources) {
log.info("开始文档摄取流程,文档数量:{}", resources.size());
// 1. 读取阶段
List<Document> documents = new ArrayList<>();
for (Resource resource : resources) {
List<Document> docs = readDocument(resource);
documents.addAll(docs);
log.info("读取文档:{},共 {} 个文档", resource.getFilename(), docs.size());
}
// 2. 转换阶段
for (DocumentTransformer transformer : transformers) {
documents = transformer.transform(documents);
log.info("应用转换器:{},当前文档数:{}",
transformer.getClass().getSimpleName(), documents.size());
}
// 3. 写入阶段
writer.accept(documents);
log.info("文档摄取完成,共写入 {} 个文档", documents.size());
}
private List<Document> readDocument(Resource resource) {
for (DocumentReader reader : readers) {
try {
List<Document> docs = reader.read();
if (!docs.isEmpty()) {
return docs;
}
} catch (Exception e) {
log.debug("读取器 {} 无法处理文档 {}",
reader.getClass().getSimpleName(), resource.getFilename());
}
}
throw new DocumentReaderException("无法读取文档:" + resource.getFilename());
}
}
二、Vector Store 可移植 API 与 SQL-like 元数据过滤
2.1 Vector Store 接口
java
public interface VectorStore {
// 添加文档
void add(List<Document> documents);
// 相似度搜索
List<Document> similaritySearch(SearchRequest request);
// 删除文档
void delete(List<String> idList);
// 清空所有文档
default void deleteAll() {
delete(Collections.emptyList());
}
}
2.2 Vector Store 配置
Redis Vector Store
java
@Configuration
public class RedisVectorStoreConfig {
@Bean
public RedisVectorStore redisVectorStore(
RedisConnectionFactory connectionFactory,
EmbeddingModel embeddingModel
) {
RedisVectorStoreConfig config = RedisVectorStoreConfig.builder()
.withIndexName("ai-documents")
.withPrefix("doc:")
.withInitializeSchema(true)
.build();
return new RedisVectorStore(connectionFactory, embeddingModel, config);
}
}
配置文件:
yaml
spring:
data:
redis:
host: localhost
port: 6379
password:
database: 0
ai:
vectorstore:
redis:
index: ai-documents
prefix: doc:
initialize-schema: true
PostgreSQL PGVector
java
@Configuration
public class PgVectorStoreConfig {
@Bean
public PgVectorStore pgVectorStore(
JdbcTemplate jdbcTemplate,
EmbeddingModel embeddingModel
) {
PgVectorStoreConfig config = PgVectorStoreConfig.builder()
.withTableName("document_embeddings")
.withVectorDimension(1536)
.withInitializeSchema(true)
.build();
return new PgVectorStore(jdbcTemplate, embeddingModel, config);
}
}
2.3 元数据过滤
Spring AI 2.0 提供了 SQL-like 的元数据过滤 API。
java
@Service
public class VectorSearchService {
private final VectorStore vectorStore;
// 基础相似度搜索
public List<Document> search(String query) {
SearchRequest request = SearchRequest.query(query)
.withTopK(5)
.withSimilarityThreshold(0.7);
return vectorStore.similaritySearch(request);
}
// 带元数据过滤的搜索
public List<Document> searchWithFilter(
String query,
String category,
String year
) {
SearchRequest request = SearchRequest.query(query)
.withTopK(10)
.withSimilarityThreshold(0.6)
.withFilterExpression("category == '%s' && year == '%s'"
.formatted(category, year));
return vectorStore.similaritySearch(request);
}
// 使用 FilterExpressionBuilder
public List<Document> searchWithBuilder(
String query,
Map<String, Object> filters
) {
FilterExpressionBuilder builder = new FilterExpressionBuilder();
// 构建复杂过滤条件
Expression expression = builder.eq("category", filters.get("category"))
.and(builder.gte("year", filters.get("minYear")))
.and(builder.lte("year", filters.get("maxYear")));
SearchRequest request = SearchRequest.query(query)
.withTopK(10)
.withFilterExpression(expression);
return vectorStore.similaritySearch(request);
}
}
过滤表达式语法:
java
// 相等
"category == 'tech'"
// 不等
"category != 'tech'"
// 大于
"year > 2020"
// 大于等于
"year >= 2020"
// 小于
"year < 2025"
// 小于等于
"year <= 2025"
// 与
"category == 'tech' && year >= 2020"
// 或
"category == 'tech' || category == 'ai'"
// 非空
"author != null"
// 包含
"tags in ['java', 'spring']"
2.4 批量处理与性能优化
java
@Service
public class BatchVectorService {
private final VectorStore vectorStore;
private final EmbeddingModel embeddingModel;
// 批量添加文档
@Async
public CompletableFuture<Void> addDocumentsBatch(
List<Document> documents,
int batchSize
) {
return CompletableFuture.runAsync(() -> {
int total = documents.size();
for (int i = 0; i < total; i += batchSize) {
int end = Math.min(i + batchSize, total);
List<Document> batch = documents.subList(i, end);
vectorStore.add(batch);
log.info("批量添加文档进度:{}/{}", end, total);
// 避免过快调用
try {
Thread.sleep(100);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
break;
}
}
});
}
// 并行生成嵌入
public List<List<Float>> generateEmbeddingsParallel(
List<String> texts,
int concurrency
) {
ExecutorService executor = Executors.newFixedThreadPool(concurrency);
try {
List<Future<List<Float>>> futures = texts.stream()
.map(text -> executor.submit(() ->
embeddingModel.embed(text)
))
.toList();
List<List<Float>> embeddings = new ArrayList<>();
for (Future<List<Float>> future : futures) {
embeddings.add(future.get());
}
return embeddings;
} catch (InterruptedException | ExecutionException e) {
throw new RuntimeException("生成嵌入失败", e);
} finally {
executor.shutdown();
}
}
}
三、QuestionAnswerAdvisor 自包含模板机制
3.1 QuestionAnswerAdvisor 基本使用
java
@Service
public class RAGChatService {
private final ChatClient chatClient;
private final VectorStore vectorStore;
// 基础 RAG 问答
public String askQuestion(String question) {
return chatClient.prompt()
.user(question)
.advisors(new QuestionAnswerAdvisor(
vectorStore,
SearchRequest.defaults()
))
.call()
.content();
}
// 自定义搜索参数
public String askQuestionWithParams(
String question,
int topK,
double threshold
) {
return chatClient.prompt()
.user(question)
.advisors(new QuestionAnswerAdvisor(
vectorStore,
SearchRequest.builder()
.withTopK(topK)
.withSimilarityThreshold(threshold)
.build()
))
.call()
.content();
}
}
3.2 自包含模板机制
QuestionAnswerAdvisor 使用自包含模板,自动填充上下文。
默认模板:
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Context:
{question_answer_context}
Question:
{query}
Helpful Answer:
自定义模板:
java
@Service
public class CustomTemplateRAGService {
private final ChatClient chatClient;
private final VectorStore vectorStore;
// 使用自定义模板
public String askWithCustomTemplate(String question) {
String customTemplate = """
你是一个专业的技术顾问。
基于以下知识库内容回答用户问题:
知识库内容:
{question_answer_context}
用户问题:
{query}
要求:
1. 基于知识库内容回答
2. 引用具体的文档段落
3. 如果知识库中没有相关信息,明确说明
""";
return chatClient.prompt()
.user(question)
.advisors(QuestionAnswerAdvisor.builder()
.vectorStore(vectorStore)
.searchRequest(SearchRequest.defaults())
.userTextAdvise(customTemplate) // 自定义模板
.build())
.call()
.content();
}
}
3.3 获取检索到的文档
java
@Service
public class RAGDebugService {
private final ChatClient chatClient;
private final VectorStore vectorStore;
public RAGResult askWithDebug(String question) {
// 创建一个存储检索结果的容器
AtomicReference<List<Document>> retrievedDocs = new AtomicReference<>();
String answer = chatClient.prompt()
.user(question)
.advisors(QuestionAnswerAdvisor.builder()
.vectorStore(vectorStore)
.searchRequest(SearchRequest.defaults())
.consumerAdvisor(request -> {
// 捕获检索到的文档
retrievedDocs.set((List<Document>) request
.getAdviceContext()
.get(QuestionAnswerAdvisor.RETRIEVED_DOCUMENTS));
})
.build())
.call()
.content();
return new RAGResult(answer, retrievedDocs.get());
}
public record RAGResult(
String answer,
List<Document> retrievedDocuments
) {}
}
四、EmbeddingModel 批量处理与性能优化
4.1 EmbeddingModel 接口
java
public interface EmbeddingModel {
// 单个嵌入
default List<Float> embed(String text) {
return embed(List.of(text)).get(0);
}
// 批量嵌入
List<List<Float>> embed(List<String> texts);
// 文档嵌入
default List<List<Float>> embed(List<Document> documents) {
List<String> texts = documents.stream()
.map(Document::getContent)
.toList();
return embed(texts);
}
}
4.2 批量处理策略
java
@Service
public class EmbeddingOptimizationService {
private final EmbeddingModel embeddingModel;
private static final int OPTIMAL_BATCH_SIZE = 100;
// 智能批量嵌入
public List<List<Float>> embedOptimally(List<String> texts) {
int total = texts.size();
List<List<Float>> allEmbeddings = new ArrayList<>();
// 分批处理
for (int i = 0; i < total; i += OPTIMAL_BATCH_SIZE) {
int end = Math.min(i + OPTIMAL_BATCH_SIZE, total);
List<String> batch = texts.subList(i, end);
List<List<Float>> embeddings = embeddingModel.embed(batch);
allEmbeddings.addAll(embeddings);
log.info("生成嵌入进度:{}/{}", end, total);
// 短暂暂停,避免限流
if (end < total) {
try {
Thread.sleep(100);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
break;
}
}
}
return allEmbeddings;
}
// 并行批量嵌入
public List<List<Float>> embedParallel(
List<String> texts,
int concurrency
) {
int batchSize = (int) Math.ceil((double) texts.size() / concurrency);
List<List<String>> batches = new ArrayList<>();
for (int i = 0; i < texts.size(); i += batchSize) {
int end = Math.min(i + batchSize, texts.size());
batches.add(texts.subList(i, end));
}
ExecutorService executor = Executors.newFixedThreadPool(concurrency);
try {
List<Future<List<List<Float>>>> futures = batches.stream()
.map(batch -> executor.submit(() ->
embeddingModel.embed(batch)
))
.toList();
List<List<Float>> allEmbeddings = new ArrayList<>();
for (Future<List<List<Float>>> future : futures) {
allEmbeddings.addAll(future.get());
}
return allEmbeddings;
} catch (InterruptedException | ExecutionException e) {
throw new RuntimeException("并行嵌入失败", e);
} finally {
executor.shutdown();
}
}
}
五、完整 RAG 系统实现
5.1 系统架构
┌─────────────────┐
│ 文档上传接口 │
└────────┬────────┘
│
↓
┌─────────────────────────────────┐
│ Document Ingestion Pipeline │
│ - Reader │
│ - Transformer │
│ - Writer │
└────────┬──────────────────────┘
│
↓
┌─────────────────────────────────┐
│ Vector Store (Redis) │
└────────┬──────────────────────┘
│
↓
┌─────────────────────────────────┐
│ QuestionAnswerAdvisor │
│ - 检索相关文档 │
│ - 构建上下文 │
│ - 增强提示词 │
└────────┬──────────────────────┘
│
↓
┌─────────────────────────────────┐
│ ChatClient │
│ - 发送增强后的请求 │
│ - 返回答案 │
└─────────────────────────────────┘
5.2 完整代码实现
配置类:
java
@Configuration
public class RAGConfig {
@Bean
public ChatClient chatClient(ChatModel chatModel) {
return ChatClient.builder(chatModel).build();
}
@Bean
public RedisVectorStore vectorStore(
RedisConnectionFactory connectionFactory,
EmbeddingModel embeddingModel
) {
return new RedisVectorStore(
connectionFactory,
embeddingModel,
RedisVectorStoreConfig.builder()
.withIndexName("ai-documents")
.withInitializeSchema(true)
.build()
);
}
@Bean
public DocumentTransformer documentTransformer() {
TextSplitter textSplitter = new TokenTextSplitter();
KeywordMetadataEnricher keywordEnricher = new KeywordMetadataEnricher();
return documents -> {
documents = textSplitter.apply(documents);
documents = keywordEnricher.apply(documents);
return documents;
};
}
}
服务类:
java
@Service
public class RAGService {
private final ChatClient chatClient;
private final VectorStore vectorStore;
private final DocumentTransformer documentTransformer;
// 文档摄取
public void ingestDocuments(List<Resource> resources) {
List<Document> allDocuments = new ArrayList<>();
for (Resource resource : resources) {
List<Document> docs = readDocument(resource);
allDocuments.addAll(docs);
}
// 转换文档
List<Document> transformedDocs = documentTransformer.apply(allDocuments);
// 添加到向量存储
vectorStore.add(transformedDocs);
log.info("成功摄取 {} 个文档", transformedDocs.size());
}
// 问答
public String ask(String question) {
return chatClient.prompt()
.user(question)
.advisors(new QuestionAnswerAdvisor(
vectorStore,
SearchRequest.defaults()
.withTopK(5)
.withSimilarityThreshold(0.7)
))
.call()
.content();
}
private List<Document> readDocument(Resource resource) {
String filename = resource.getFilename();
String extension = filename.substring(filename.lastIndexOf('.') + 1);
return switch (extension.toLowerCase()) {
case "pdf" -> new PagePdfDocumentReader(resource).read();
case "doc", "docx" -> new TikaDocumentReader(resource).read();
case "txt" -> List.of(new Document(
readTextResource(resource),
Map.of("filename", filename)
));
default -> throw new UnsupportedOperationException(
"不支持的文件类型:" + extension
);
};
}
private String readTextResource(Resource resource) {
try {
return new String(resource.getContentAsByteArray(), StandardCharsets.UTF_8);
} catch (IOException e) {
throw new RuntimeException("读取文本文件失败", e);
}
}
}
控制器:
java
@RestController
@RequestMapping("/api/rag")
public class RAGController {
private final RAGService ragService;
@PostMapping("/ingest")
public ResponseEntity<String> ingest(
@RequestParam("file") MultipartFile file
) {
try {
Resource resource = file.getResource();
ragService.ingestDocuments(List.of(resource));
return ResponseEntity.ok("文档摄取成功");
} catch (Exception e) {
return ResponseEntity
.status(HttpStatus.INTERNAL_SERVER_ERROR)
.body("文档摄取失败:" + e.getMessage());
}
}
@PostMapping("/ask")
public ResponseEntity<String> ask(@RequestBody String question) {
String answer = ragService.ask(question);
return ResponseEntity.ok(answer);
}
}
总结
Spring AI 2.0 提供了一套完整的 RAG 实现框架,从文档摄取到向量存储,再到检索增强,全链路覆盖。通过合理使用 Document ETL 框架、Vector Store API 和 QuestionAnswerAdvisor,我们可以构建出生产级的文档问答系统。
作为一名架构师,我建议在实际项目中:
- 选择合适的 DocumentReader:根据文档类型和结构选择
- 优化文本分割策略:平衡粒度和语义完整性
- 充分利用元数据过滤:提高检索精度
- 自定义 RAG 模板:适应不同业务场景
- 性能优化:批量处理、并行化、缓存
RAG 系统的质量直接影响用户体验,Spring AI 2.0 为我们提供了坚实的基础,让我们能够专注于业务逻辑而非底层实现。
参考资料:
- Spring AI ETL Pipeline 文档:https://docs.spring.io/spring-ai/reference/api/etl-pipeline.html
- Spring AI Vector Store 文档:https://docs.spring.io/spring-ai/reference/api/vectordbs.html
- Apache Tika 文档:https://tika.apache.org/2.9.0/formats.html