原文链接地址# SpringAI(GA):RAG下的ETL快速上手
教程说明
说明:本教程将采用2025年5月20日正式的GA版,给出如下内容
- 核心功能模块的快速上手教程
- 核心功能模块的源码级解读
- Spring ai alibaba增强的快速上手教程 + 源码级解读
版本:JDK21 + SpringBoot3.4.5 + SpringAI 1.0.0 + SpringAI Alibaba 1.0.0.2
将陆续完成如下章节教程。本章是第六章(Rag增强问答质量)下的ETL Pipeline快速上手篇
代码开源如下:github.com/GTyingzi/sp...

微信推文往届解读可参考:
第一章内容
SpringAI(GA)的chat:快速上手+自动注入源码解读
第二章内容
SpringAI(GA):Sqlite、Mysql、Redis消息存储快速上手
第三章内容
第五章内容
SpringAI(GA):内存、Redis、ES的向量数据库存储---快速上手
SpringAI(GA):向量数据库理论源码解读+Redis、Es接入源码
第六章内容
获取更好的观赏体验,可付费获取飞书云文档Spring AI最新教程权限,目前49.9,随着内容不断完善,会逐步涨价。
注:M6版快速上手教程+源码解读飞书云文档已免费提供
RAG 的 ETL Pipeline 快速上手
!TIP\] 提取(Extract)、转换(Transform)和加载(Load)框架是《第六章:Rag 增强问答质量》中数据处理的链路,将原始数据源导入到向量化存储的流程,确保数据处于最佳格式,以便 AI 模型进行检索
实战代码可见:github.com/GTyingzi/sp... 下的 rag/rag-etl-pipeline
源码解读可见:《ETL Pipeline 源码解析》

pom 文件
xml
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-autoconfigure-model-openai</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-commons</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-rag</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-jsoup-document-reader</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-markdown-document-reader</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-pdf-document-reader</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-tika-document-reader</artifactId>
</dependency>
</dependencies>
application.yml
yaml
server:
port: 8080
spring:
application:
name: rag-etl-pipeline
ai:
openai:
api-key: ${DASHSCOPEAPIKEY}
base-url: https://dashscope.aliyuncs.com/compatible-mode
chat:
options:
model: qwen-max
embedding:
options:
model: text-embedding-v1
提取文档
Constant
java
package com.spring.ai.tutorial.rag.model;
public class Constant {
public static final String PREFIX = "classpath:data/";
public static final String TEXTFILEPATH = PREFIX + "/text.txt";
public static final String JSONFILEPATH = PREFIX + "/text.json";
public static final String MARKDOWNFILEPATH = PREFIX + "/text.md";
public static final String PDFFILEPATH = PREFIX + "/google-ai-agents-whitepaper.pdf";;
public static final String HTMLFILEPATH = PREFIX + "/spring-ai.html";
}

ReaderController
java
package com.spring.ai.tutorial.rag.controller;
import com.spring.ai.tutorial.rag.model.Constant;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.ai.document.Document;
import org.springframework.ai.reader.JsonReader;
import org.springframework.ai.reader.TextReader;
import org.springframework.ai.reader.jsoup.JsoupDocumentReader;
import org.springframework.ai.reader.markdown.MarkdownDocumentReader;
import org.springframework.ai.reader.pdf.PagePdfDocumentReader;
import org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader;
import org.springframework.ai.reader.tika.TikaDocumentReader;
import org.springframework.core.io.DefaultResourceLoader;
import org.springframework.core.io.Resource;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import java.util.List;
@RestController
@RequestMapping("/reader")
public class ReaderController {
private static final Logger logger = LoggerFactory.getLogger(ReaderController.class);
@GetMapping("/text")
public List<Document> readText() {
logger.info("start read text file");
Resource resource = new DefaultResourceLoader().getResource(Constant.TEXTFILEPATH);
TextReader textReader = new TextReader(resource); // 适用于文本数据
return textReader.read();
}
@GetMapping("/json")
public List<Document> readJson() {
logger.info("start read json file");
Resource resource = new DefaultResourceLoader().getResource(Constant.JSONFILEPATH);
JsonReader jsonReader = new JsonReader(resource); // 只可以传json格式文件
return jsonReader.read();
}
@GetMapping("/pdf-page")
public List<Document> readPdfPage() {
logger.info("start read pdf file by page");
Resource resource = new DefaultResourceLoader().getResource(Constant.PDFFILEPATH);
PagePdfDocumentReader pagePdfDocumentReader = new PagePdfDocumentReader(resource); // 只可以传pdf格式文件
return pagePdfDocumentReader.read();
}
@GetMapping("/pdf-paragraph")
public List<Document> readPdfParagraph() {
logger.info("start read pdf file by paragraph");
Resource resource = new DefaultResourceLoader().getResource(Constant.PDFFILEPATH);
ParagraphPdfDocumentReader paragraphPdfDocumentReader = new ParagraphPdfDocumentReader(resource); // 有目录的pdf文件
return paragraphPdfDocumentReader.read();
}
@GetMapping("/markdown")
public List<Document> readMarkdown() {
logger.info("start read markdown file");
MarkdownDocumentReader markdownDocumentReader = new MarkdownDocumentReader(Constant.MARKDOWNFILEPATH); // 只可以传markdown格式文件
return markdownDocumentReader.read();
}
@GetMapping("/html")
public List<Document> readHtml() {
logger.info("start read html file");
Resource resource = new DefaultResourceLoader().getResource(Constant.HTMLFILEPATH);
JsoupDocumentReader jsoupDocumentReader = new JsoupDocumentReader(resource); // 只可以传html格式文件
return jsoupDocumentReader.read();
}
@GetMapping("/tika")
public List<Document> readTika() {
logger.info("start read file with Tika");
Resource resource = new DefaultResourceLoader().getResource(Constant.HTMLFILEPATH);
TikaDocumentReader tikaDocumentReader = new TikaDocumentReader(resource); // 可以传多种文档格式
return tikaDocumentReader.read();
}
}
效果
读取文本文件

读取 json 文件

读取 pdf 文件

读取带目录的 pdf 文件

读取 markdown 文件

读取 html 文件

利用 tika 读取任意文档格式

转换文档
TransformerController
java
package com.spring.ai.tutorial.rag.controller;
import com.spring.ai.tutorial.rag.model.Constant;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.ai.chat.model.ChatModel;
import org.springframework.ai.document.DefaultContentFormatter;
import org.springframework.ai.document.Document;
import org.springframework.ai.model.transformer.KeywordMetadataEnricher;
import org.springframework.ai.model.transformer.SummaryMetadataEnricher;
import org.springframework.ai.reader.pdf.PagePdfDocumentReader;
import org.springframework.ai.transformer.ContentFormatTransformer;
import org.springframework.ai.transformer.splitter.TokenTextSplitter;
import org.springframework.core.io.DefaultResourceLoader;
import org.springframework.core.io.Resource;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import java.util.List;
@RestController
@RequestMapping("/transformer")
public class TransformerController {
private static final Logger logger = LoggerFactory.getLogger(TransformerController.class);
private final List<Document> documents;
private final ChatModel chatModel;
public TransformerController(ChatModel chatModel) {
logger.info("start read pdf file by page");
Resource resource = new DefaultResourceLoader().getResource(Constant.PDFFILEPATH);
PagePdfDocumentReader pagePdfDocumentReader = new PagePdfDocumentReader(resource); // 只可以传pdf格式文件
this.documents = pagePdfDocumentReader.read();
this.chatModel = chatModel;
}
@GetMapping("/token-text-splitter")
public List<Document> tokenTextSplitter() {
logger.info("start token text splitter");
TokenTextSplitter tokenTextSplitter = TokenTextSplitter.builder()
// 每个文本块的目标token数量
.withChunkSize(800)
// 每个文本块的最小字符数
.withMinChunkSizeChars(350)
// 丢弃小于此长度的文本块
.withMinChunkLengthToEmbed(5)
// 文本中生成的最大块数
.withMaxNumChunks(10000)
// 是否保留分隔符
.withKeepSeparator(true)
.build();
return tokenTextSplitter.split(this.documents);
}
@GetMapping("/content-format-transformer")
public List<Document> contentFormatTransformer() {
logger.info("start content format transformer");
DefaultContentFormatter defaultContentFormatter = DefaultContentFormatter.defaultConfig();
ContentFormatTransformer contentFormatTransformer = new ContentFormatTransformer(defaultContentFormatter);
return contentFormatTransformer.apply(this.documents);
}
@GetMapping("/keyword-metadata-enricher")
public List<Document> keywordMetadataEnricher() {
logger.info("start keyword metadata enricher");
KeywordMetadataEnricher keywordMetadataEnricher = new KeywordMetadataEnricher(this.chatModel, 3);
return keywordMetadataEnricher.apply(this.documents);
}
@GetMapping("/summary-metadata-enricher")
public List<Document> summaryMetadataEnricher() {
logger.info("start summary metadata enricher");
List<SummaryMetadataEnricher.SummaryType> summaryTypes = List.of(
SummaryMetadataEnricher.SummaryType.NEXT,
SummaryMetadataEnricher.SummaryType.CURRENT,
SummaryMetadataEnricher.SummaryType.PREVIOUS);
SummaryMetadataEnricher summaryMetadataEnricher = new SummaryMetadataEnricher(this.chatModel, summaryTypes);
return summaryMetadataEnricher.apply(this.documents);
}
}
效果
TokenTextSplitter 切分

DefaultContentFormatter 格式化

KeywordMetadataEnricher 提取关键字

SummaryMetadataEnricher 提取摘要

写出文档
WriterController
java
package com.spring.ai.tutorial.rag.controller;
import com.spring.ai.tutorial.rag.model.Constant;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.ai.document.Document;
import org.springframework.ai.embedding.EmbeddingModel;
import org.springframework.ai.reader.pdf.PagePdfDocumentReader;
import org.springframework.ai.vectorstore.SearchRequest;
import org.springframework.ai.vectorstore.SimpleVectorStore;
import org.springframework.ai.writer.FileDocumentWriter;
import org.springframework.core.io.DefaultResourceLoader;
import org.springframework.core.io.Resource;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import java.util.List;
@RestController
@RequestMapping("/writer")
public class WriterController {
private static final Logger logger = LoggerFactory.getLogger(WriterController.class);
private final List<Document> documents;
private final SimpleVectorStore simpleVectorStore;
public WriterController(EmbeddingModel embeddingModel) {
logger.info("start read pdf file by page");
Resource resource = new DefaultResourceLoader().getResource(Constant.PDFFILEPATH);
PagePdfDocumentReader pagePdfDocumentReader = new PagePdfDocumentReader(resource); // 只可以传pdf格式文件
this.documents = pagePdfDocumentReader.read();
this.simpleVectorStore = SimpleVectorStore
.builder(embeddingModel).build();
}
@GetMapping("/file")
public void writeFile() {
logger.info("Writing file...");
String fileName = "output.txt";
FileDocumentWriter fileDocumentWriter = new FileDocumentWriter(fileName, true);
fileDocumentWriter.accept(this.documents);
}
@GetMapping("/vector")
public void writeVector() {
logger.info("Writing vector...");
simpleVectorStore.add(documents);
}
@GetMapping("/search")
public List<Document> search() {
logger.info("start search data");
return simpleVectorStore.similaritySearch(SearchRequest
.builder()
.query("Spring")
.topK(2)
.build());
}
}
效果
Document 写出文本文件

写入 vector

从 vector 中查找

学习交流圈
你好,我是影子,曾先后在🐻、新能源、老铁就职,现在是一名AI研发工程师,同时作为Spring AI Alibaba开源社区的Committer。目前新建了一个交流群,一个人走得快,一群人走得远,另外,本人长期维护一套飞书云文档笔记,涵盖后端、大数据系统化的面试资料,可私信免费获取
