SpringAI(GA):RAG下的ETL源码解读

原文链接地址:# SpringAI(GA):RAG下的ETL源码解读

教程说明

说明:本教程将采用2025年5月20日正式的GA版,给出如下内容

  1. 核心功能模块的快速上手教程
  2. 核心功能模块的源码级解读
  3. Spring ai alibaba增强的快速上手教程 + 源码级解读

版本:JDK21 + SpringBoot3.4.5 + SpringAI 1.0.0 + SpringAI Alibaba 1.0.0.2

将陆续完成如下章节教程。本章是第六章(Rag增强问答质量)下的ETL-pipeline源码解读

代码开源如下:github.com/GTyingzi/sp...

微信推文往届解读可参考:

第一章内容

SpringAI(GA)的chat:快速上手+自动注入源码解读

SpringAI(GA):ChatClient调用链路解读

第二章内容

SpringAI的Advisor:快速上手+源码解读

SpringAI(GA):Sqlite、Mysql、Redis消息存储快速上手

第三章内容

SpringAI(GA):Tool工具整合---快速上手

第五章内容

SpringAI(GA):内存、Redis、ES的向量数据库存储---快速上手

SpringAI(GA):向量数据库理论源码解读+Redis、Es接入源码

第六章内容

SpringAI(GA):RAG快速上手+模块化解读

SpringAI(GA):RAG下的ETL快速上手

获取更好的观赏体验,可付费获取飞书云文档Spring AI最新教程权限,目前49.9,随着内容不断完善,会逐步涨价。

注:M6版快速上手教程+源码解读飞书云文档已免费提供

ETL Pipeline 源码解析

DocumentReader(读取文档数据接口类)

java 复制代码
package org.springframework.ai.document;

import java.util.List;
import java.util.function.Supplier;

public interface DocumentReader extends Supplier<List<Document>> {
    default List<Document> read() {
        return (List)this.get();
    }
}

TextReader

用于从资源中读取文本内容并将其转换为 Document 对象

  • Resource resource:读取的资源
  • Map<String, Object> customMetadata:存储与 Document 对象相关的元数据
  • Charset charset:指定读取文本时使用的字符集,默认为 UTF8

方法说明

|-----------------------|------------------------------|
| 方法名称 | 描述 |
| TextReader | 通过资源URL、资源对象构造读取器 |
| setCharset | 设置读取文本时的字符集,默认为UTF8 |
| getCharset | 获取当前使用的字符集 |
| getCustomMetadata | 获取自定义元数据 |
| get | 读取文本,返回Document列表 |
| getResourceIdentifier | 获取资源的唯一标识(如文件名、URI、URL或描述信息) |

java 复制代码
package org.springframework.ai.reader;

import java.io.IOException;
import java.net.URI;
import java.net.URL;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.core.io.DefaultResourceLoader;
import org.springframework.core.io.Resource;
import org.springframework.util.StreamUtils;

public class TextReader implements DocumentReader {
    public static final String CHARSETMETADATA = "charset";
    public static final String SOURCEMETADATA = "source";
    private final Resource resource;
    private final Map<String, Object> customMetadata;
    private Charset charset;

    public TextReader(String resourceUrl) {
        this((new DefaultResourceLoader()).getResource(resourceUrl));
    }

    public TextReader(Resource resource) {
        this.customMetadata = new HashMap();
        this.charset = StandardCharsets.UTF8;
        Objects.requireNonNull(resource, "The Spring Resource must not be null");
        this.resource = resource;
    }

    public Charset getCharset() {
        return this.charset;
    }

    public void setCharset(Charset charset) {
        Objects.requireNonNull(charset, "The charset must not be null");
        this.charset = charset;
    }

    public Map<String, Object> getCustomMetadata() {
        return this.customMetadata;
    }

    public List<Document> get() {
        try {
            String document = StreamUtils.copyToString(this.resource.getInputStream(), this.charset);
            this.customMetadata.put("charset", this.charset.name());
            this.customMetadata.put("source", this.resource.getFilename());
            this.customMetadata.put("source", this.getResourceIdentifier(this.resource));
            return List.of(new Document(document, this.customMetadata));
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    protected String getResourceIdentifier(Resource resource) {
        String filename = resource.getFilename();
        if (filename != null && !filename.isEmpty()) {
            return filename;
        } else {
            try {
                URI uri = resource.getURI();
                if (uri != null) {
                    return uri.toString();
                }
            } catch (IOException var5) {
            }

            try {
                URL url = resource.getURL();
                if (url != null) {
                    return url.toString();
                }
            } catch (IOException var4) {
            }

            return resource.getDescription();
        }
    }
}

JsonReader

用于从 JSON 资源中读取数据并将其转换为 Document 对象

  • Resource resource:表示要读取的 JSON 资源
  • JsonMetadataGenerator jsonMetadataGenerator:用于生成与 JSON 数据相关的元数据
  • ObjectMapper objectMapper:用于解析 JSON 数据
  • List<String> jsonKeysToUse:用于从 JSON 中提取哪些字段作为文档内容,若未指定则使用整个 JSON 对象

方法说明

|------------|-----------------------|
| 方法名称 | 描述 |
| JsonReader | 通过资源对象、提取的字段名构造读取器 |
| get | 读取json文件,返回Document列表 |

java 复制代码
package org.springframework.ai.reader;

import com.fasterxml.jackson.core.type.TypeReference;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.io.IOException;
import java.util.Collections;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import java.util.stream.Stream;
import java.util.stream.StreamSupport;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.core.io.Resource;

public class JsonReader implements DocumentReader {
    private final Resource resource;
    private final JsonMetadataGenerator jsonMetadataGenerator;
    private final ObjectMapper objectMapper;
    private final List<String> jsonKeysToUse;

    public JsonReader(Resource resource) {
        this(resource);
    }

    public JsonReader(Resource resource, String... jsonKeysToUse) {
        this(resource, new EmptyJsonMetadataGenerator(), jsonKeysToUse);
    }

    public JsonReader(Resource resource, JsonMetadataGenerator jsonMetadataGenerator, String... jsonKeysToUse) {
        this.objectMapper = new ObjectMapper();
        Objects.requireNonNull(jsonKeysToUse, "keys must not be null");
        Objects.requireNonNull(jsonMetadataGenerator, "jsonMetadataGenerator must not be null");
        Objects.requireNonNull(resource, "The Spring Resource must not be null");
        this.resource = resource;
        this.jsonMetadataGenerator = jsonMetadataGenerator;
        this.jsonKeysToUse = List.of(jsonKeysToUse);
    }

    public List<Document> get() {
        try {
            JsonNode rootNode = this.objectMapper.readTree(this.resource.getInputStream());
            return rootNode.isArray() ? StreamSupport.stream(rootNode.spliterator(), true).map((jsonNode) -> this.parseJsonNode(jsonNode, this.objectMapper)).toList() : Collections.singletonList(this.parseJsonNode(rootNode, this.objectMapper));
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    private Document parseJsonNode(JsonNode jsonNode, ObjectMapper objectMapper) {
        Map<String, Object> item = (Map)objectMapper.convertValue(jsonNode, new TypeReference<Map<String, Object>>() {
        });
        StringBuilder sb = new StringBuilder();
        Stream var10000 = this.jsonKeysToUse.stream();
        Objects.requireNonNull(item);
        var10000.filter(item::containsKey).forEach((key) -> sb.append(key).append(": ").append(item.get(key)).append(System.lineSeparator()));
        Map<String, Object> metadata = this.jsonMetadataGenerator.generate(item);
        String content = sb.isEmpty() ? item.toString() : sb.toString();
        return new Document(content, metadata);
    }

    protected List<Document> get(JsonNode rootNode) {
        return rootNode.isArray() ? StreamSupport.stream(rootNode.spliterator(), true).map((jsonNode) -> this.parseJsonNode(jsonNode, this.objectMapper)).toList() : Collections.singletonList(this.parseJsonNode(rootNode, this.objectMapper));
    }

    public List<Document> get(String pointer) {
        try {
            JsonNode rootNode = this.objectMapper.readTree(this.resource.getInputStream());
            JsonNode targetNode = rootNode.at(pointer);
            if (targetNode.isMissingNode()) {
                throw new IllegalArgumentException("Invalid JSON Pointer: " + pointer);
            } else {
                return this.get(targetNode);
            }
        } catch (IOException e) {
            throw new RuntimeException("Error reading JSON resource", e);
        }
    }
}

JsoupDocumentReader

用于从 HTML 文档中提取文本内容,并将其转换为 Document 对象

各字段含义:

  • Resource htmlResource:要读取的 HTML 资源
  • JsoupDocumentReaderConfig config:配置 HTML 文档读取行为,包括字符集、选择器、是否提取所有元素,是否按元素分组等

方法说明

|---------------------|-----------------------------|
| 方法名称 | 描述 |
| JsoupDocumentReader | 通过资源URL、资源对象、解析HTML配置等构造读取器 |
| get | 读取html文件,返回Document列表 |

java 复制代码
package org.springframework.ai.reader.jsoup;

import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.ai.reader.jsoup.config.JsoupDocumentReaderConfig;
import org.springframework.core.io.DefaultResourceLoader;
import org.springframework.core.io.Resource;

public class JsoupDocumentReader implements DocumentReader {
    private final Resource htmlResource;
    private final JsoupDocumentReaderConfig config;

    public JsoupDocumentReader(String htmlResource) {
        this((new DefaultResourceLoader()).getResource(htmlResource));
    }

    public JsoupDocumentReader(Resource htmlResource) {
        this(htmlResource, JsoupDocumentReaderConfig.defaultConfig());
    }

    public JsoupDocumentReader(String htmlResource, JsoupDocumentReaderConfig config) {
        this((new DefaultResourceLoader()).getResource(htmlResource), config);
    }

    public JsoupDocumentReader(Resource htmlResource, JsoupDocumentReaderConfig config) {
        this.htmlResource = htmlResource;
        this.config = config;
    }

    public List<Document> get() {
        try (InputStream inputStream = this.htmlResource.getInputStream()) {
            org.jsoup.nodes.Document doc = Jsoup.parse(inputStream, this.config.charset, "");
            List<Document> documents = new ArrayList();
            if (this.config.allElements) {
                String allText = doc.body().text();
                Document document = new Document(allText);
                this.addMetadata(doc, document);
                documents.add(document);
            } else if (this.config.groupByElement) {
                for(Element element : doc.select(this.config.selector)) {
                    String elementText = element.text();
                    Document document = new Document(elementText);
                    this.addMetadata(doc, document);
                    documents.add(document);
                }
            } else {
                Elements elements = doc.select(this.config.selector);
                String text = (String)elements.stream().map(Element::text).collect(Collectors.joining(this.config.separator));
                Document document = new Document(text);
                this.addMetadata(doc, document);
                documents.add(document);
            }

            return documents;
        } catch (IOException e) {
            throw new RuntimeException("Failed to read HTML resource: " + String.valueOf(this.htmlResource), e);
        }
    }

    private void addMetadata(org.jsoup.nodes.Document jsoupDoc, Document springDoc) {
        Map<String, Object> metadata = new HashMap();
        metadata.put("title", jsoupDoc.title());

        for(String metaTag : this.config.metadataTags) {
            String value = jsoupDoc.select("meta[name=" + metaTag + "]").attr("content");
            if (!value.isEmpty()) {
                metadata.put(metaTag, value);
            }
        }

        if (this.config.includeLinkUrls) {
            Elements links = jsoupDoc.select("a[href]");
            List<String> linkUrls = links.stream().map((link) -> link.attr("abs:href")).toList();
            metadata.put("linkUrls", linkUrls);
        }

        metadata.putAll(this.config.additionalMetadata);
        springDoc.getMetadata().putAll(metadata);
    }
}
JsoupDocumentReaderConfig

配置 JsoupDocumentReader 行为的工具类

  • String charset:读取 HTML 文档时使用的字符编码,默认值为 "UTF-8"
  • String selector:用于提取 HTML 元素的 CSS 选择器,默认值为 "body"
  • String separator:在提取多个元素的文本内容时使用的分隔符,默认值为 "\n"
  • boolean allElements:是否提取 HTML 文档中所有元素的文本内容,并生成一个 Document 对象,默认值为 false
  • boolean groupByElement:是否按元素分组提取文本内容,并为每个元素生成一个 Document 对象,默认值为 false
  • boolean includeLinkUrls:是否将 HTML 文档中的链接 URL 包含在元数据中,默认值为 false
  • List<String> metadataTags:指定从 HTML 文档的 标签中提取哪些元数据,默认包含 "description" 和 "keywords"
  • Map<String, Object> additionalMetadata:用于添加额外的元数据到生成的 Document 对象中
java 复制代码
package org.springframework.ai.reader.jsoup.config;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.springframework.util.Assert;

public final class JsoupDocumentReaderConfig {
    public final String charset;
    public final String selector;
    public final String separator;
    public final boolean allElements;
    public final boolean groupByElement;
    public final boolean includeLinkUrls;
    public final List<String> metadataTags;
    public final Map<String, Object> additionalMetadata;

    private JsoupDocumentReaderConfig(Builder builder) {
        this.charset = builder.charset;
        this.selector = builder.selector;
        this.separator = builder.separator;
        this.allElements = builder.allElements;
        this.includeLinkUrls = builder.includeLinkUrls;
        this.metadataTags = builder.metadataTags;
        this.groupByElement = builder.groupByElement;
        this.additionalMetadata = builder.additionalMetadata;
    }

    public static Builder builder() {
        return new Builder();
    }

    public static JsoupDocumentReaderConfig defaultConfig() {
        return builder().build();
    }

    public static final class Builder {
        private String charset = "UTF-8";
        private String selector = "body";
        private String separator = "\n";
        private boolean allElements = false;
        private boolean includeLinkUrls = false;
        private List<String> metadataTags = new ArrayList(List.of("description", "keywords"));
        private boolean groupByElement = false;
        private Map<String, Object> additionalMetadata = new HashMap();

        private Builder() {
        }

        public Builder charset(String charset) {
            this.charset = charset;
            return this;
        }

        public Builder selector(String selector) {
            this.selector = selector;
            return this;
        }

        public Builder separator(String separator) {
            this.separator = separator;
            return this;
        }

        public Builder allElements(boolean allElements) {
            this.allElements = allElements;
            return this;
        }

        public Builder groupByElement(boolean groupByElement) {
            this.groupByElement = groupByElement;
            return this;
        }

        public Builder includeLinkUrls(boolean includeLinkUrls) {
            this.includeLinkUrls = includeLinkUrls;
            return this;
        }

        public Builder metadataTag(String metadataTag) {
            this.metadataTags.add(metadataTag);
            return this;
        }

        public Builder metadataTags(List<String> metadataTags) {
            this.metadataTags = new ArrayList(metadataTags);
            return this;
        }

        public Builder additionalMetadata(String key, Object value) {
            Assert.notNull(key, "key must not be null");
            Assert.notNull(value, "value must not be null");
            this.additionalMetadata.put(key, value);
            return this;
        }

        public Builder additionalMetadata(Map<String, Object> additionalMetadata) {
            Assert.notNull(additionalMetadata, "additionalMetadata must not be null");
            this.additionalMetadata = additionalMetadata;
            return this;
        }

        public JsoupDocumentReaderConfig build() {
            return new JsoupDocumentReaderConfig(this);
        }
    }
}

MarkdownDocumentReader

用于从 Markdown 文件中读取内容并将其转换为 Document 对象。基于 CommonMark 库解析 Markdown 文档,支持将标题、段落、代码块等内容分组为 Document 对象,并生成相关元数据

  • Resource markdownResource:要读取的 Markdown 资源
  • MarkdownDocumentReaderConfig config:配置 Markdown 文档读取行为,包括是否将水平分割线视为文档分隔符、是否包含代码块、是否包含引用块等
  • Parser parser:解析 Markdown 文档的 CommonMark 解析器,用于将 Markdown 文本解析为节点树

DocumentVisitor 作为内部静态类,继承自 CommonMark 的 AbstractVisitor,用于遍历和解析 Markdown 的语法树节点,将其内容按配置分组、提取为结构化的 Document 对象

  1. 历 Markdown 解析后的节点树,根据配置(如是否按水平线分组、是否包含代码块/引用等)将内容分组
  2. 识别标题、段落、代码块、引用等不同类型节点,提取文本和元数据,构建 Document
  3. 支持为不同类型内容(如标题、代码块、引用)添加分类、标题、语言等元数据,便于后续 AI 处理。

|------------------------|---------------------------------|
| 方法名称 | 描述 |
| MarkdownDocumentReader | 通过资源URL、资源对象、解析markdown配置等构造读取器 |
| get | 读取markdown文件,返回Document列表 |

java 复制代码
package org.springframework.ai.reader.markdown;

import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import org.commonmark.node.AbstractVisitor;
import org.commonmark.node.BlockQuote;
import org.commonmark.node.Code;
import org.commonmark.node.FencedCodeBlock;
import org.commonmark.node.HardLineBreak;
import org.commonmark.node.Heading;
import org.commonmark.node.ListItem;
import org.commonmark.node.Node;
import org.commonmark.node.SoftLineBreak;
import org.commonmark.node.Text;
import org.commonmark.node.ThematicBreak;
import org.commonmark.parser.Parser;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.ai.reader.markdown.config.MarkdownDocumentReaderConfig;
import org.springframework.core.io.DefaultResourceLoader;
import org.springframework.core.io.Resource;

public class MarkdownDocumentReader implements DocumentReader {
    private final Resource markdownResource;
    private final MarkdownDocumentReaderConfig config;
    private final Parser parser;

    public MarkdownDocumentReader(String markdownResource) {
        this((new DefaultResourceLoader()).getResource(markdownResource), MarkdownDocumentReaderConfig.defaultConfig());
    }

    public MarkdownDocumentReader(String markdownResource, MarkdownDocumentReaderConfig config) {
        this((new DefaultResourceLoader()).getResource(markdownResource), config);
    }

    public MarkdownDocumentReader(Resource markdownResource, MarkdownDocumentReaderConfig config) {
        this.markdownResource = markdownResource;
        this.config = config;
        this.parser = Parser.builder().build();
    }

    public List<Document> get() {
        try (InputStream input = this.markdownResource.getInputStream()) {
            Node node = this.parser.parseReader(new InputStreamReader(input));
            DocumentVisitor documentVisitor = new DocumentVisitor(this.config);
            node.accept(documentVisitor);
            return documentVisitor.getDocuments();
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    static class DocumentVisitor extends AbstractVisitor {
        private final List<Document> documents = new ArrayList();
        private final List<String> currentParagraphs = new ArrayList();
        private final MarkdownDocumentReaderConfig config;
        private Document.Builder currentDocumentBuilder;

        DocumentVisitor(MarkdownDocumentReaderConfig config) {
            this.config = config;
        }

        public void visit(org.commonmark.node.Document document) {
            this.currentDocumentBuilder = Document.builder();
            super.visit(document);
        }

        public void visit(Heading heading) {
            this.buildAndFlush();
            super.visit(heading);
        }

        public void visit(ThematicBreak thematicBreak) {
            if (this.config.horizontalRuleCreateDocument) {
                this.buildAndFlush();
            }

            super.visit(thematicBreak);
        }

        public void visit(SoftLineBreak softLineBreak) {
            this.translateLineBreakToSpace();
            super.visit(softLineBreak);
        }

        public void visit(HardLineBreak hardLineBreak) {
            this.translateLineBreakToSpace();
            super.visit(hardLineBreak);
        }

        public void visit(ListItem listItem) {
            this.translateLineBreakToSpace();
            super.visit(listItem);
        }

        public void visit(BlockQuote blockQuote) {
            if (!this.config.includeBlockquote) {
                this.buildAndFlush();
            }

            this.translateLineBreakToSpace();
            this.currentDocumentBuilder.metadata("category", "blockquote");
            super.visit(blockQuote);
        }

        public void visit(Code code) {
            this.currentParagraphs.add(code.getLiteral());
            this.currentDocumentBuilder.metadata("category", "codeinline");
            super.visit(code);
        }

        public void visit(FencedCodeBlock fencedCodeBlock) {
            if (!this.config.includeCodeBlock) {
                this.buildAndFlush();
            }

            this.translateLineBreakToSpace();
            this.currentParagraphs.add(fencedCodeBlock.getLiteral());
            this.currentDocumentBuilder.metadata("category", "codeblock");
            this.currentDocumentBuilder.metadata("lang", fencedCodeBlock.getInfo());
            this.buildAndFlush();
            super.visit(fencedCodeBlock);
        }

        public void visit(Text text) {
            Node var3 = text.getParent();
            if (var3 instanceof Heading heading) {
                this.currentDocumentBuilder.metadata("category", "header%d".formatted(heading.getLevel())).metadata("title", text.getLiteral());
            } else {
                this.currentParagraphs.add(text.getLiteral());
            }

            super.visit(text);
        }

        public List<Document> getDocuments() {
            this.buildAndFlush();
            return this.documents;
        }

        private void buildAndFlush() {
            if (!this.currentParagraphs.isEmpty()) {
                String content = String.join("", this.currentParagraphs);
                Document.Builder builder = this.currentDocumentBuilder.text(content);
                Map var10000 = this.config.additionalMetadata;
                Objects.requireNonNull(builder);
                var10000.forEach(builder::metadata);
                Document document = builder.build();
                this.documents.add(document);
                this.currentParagraphs.clear();
            }

            this.currentDocumentBuilder = Document.builder();
        }

        private void translateLineBreakToSpace() {
            if (!this.currentParagraphs.isEmpty()) {
                this.currentParagraphs.add(" ");
            }

        }
    }
}
MarkdownDocumentReaderConfig

配置 MarkdownDocumentReader 的行为

  • boolean horizontalRuleCreateDocument:是否将水平分割线分隔的文本创建为新的 Document
  • boolean includeCodeBlock:是否将代码块包含在段落文档中,还是单独创建新文档
  • boolean includeBlockquote:是否将引用块包含在段落文档中,还是单独创建新文档
  • Map<String, Object> additionalMetadata:添加额外元数据
java 复制代码
package org.springframework.ai.reader.markdown.config;

import java.util.HashMap;
import java.util.Map;
import org.springframework.util.Assert;

public class MarkdownDocumentReaderConfig {
    public final boolean horizontalRuleCreateDocument;
    public final boolean includeCodeBlock;
    public final boolean includeBlockquote;
    public final Map<String, Object> additionalMetadata;

    public MarkdownDocumentReaderConfig(Builder builder) {
        this.horizontalRuleCreateDocument = builder.horizontalRuleCreateDocument;
        this.includeCodeBlock = builder.includeCodeBlock;
        this.includeBlockquote = builder.includeBlockquote;
        this.additionalMetadata = builder.additionalMetadata;
    }

    public static MarkdownDocumentReaderConfig defaultConfig() {
        return builder().build();
    }

    public static Builder builder() {
        return new Builder();
    }

    public static final class Builder {
        private boolean horizontalRuleCreateDocument = false;
        private boolean includeCodeBlock = false;
        private boolean includeBlockquote = false;
        private Map<String, Object> additionalMetadata = new HashMap();

        private Builder() {
        }

        public Builder withHorizontalRuleCreateDocument(boolean horizontalRuleCreateDocument) {
            this.horizontalRuleCreateDocument = horizontalRuleCreateDocument;
            return this;
        }

        public Builder withIncludeCodeBlock(boolean includeCodeBlock) {
            this.includeCodeBlock = includeCodeBlock;
            return this;
        }

        public Builder withIncludeBlockquote(boolean includeBlockquote) {
            this.includeBlockquote = includeBlockquote;
            return this;
        }

        public Builder withAdditionalMetadata(String key, Object value) {
            Assert.notNull(key, "key must not be null");
            Assert.notNull(value, "value must not be null");
            this.additionalMetadata.put(key, value);
            return this;
        }

        public Builder withAdditionalMetadata(Map<String, Object> additionalMetadata) {
            Assert.notNull(additionalMetadata, "additionalMetadata must not be null");
            this.additionalMetadata = additionalMetadata;
            return this;
        }

        public MarkdownDocumentReaderConfig build() {
            return new MarkdownDocumentReaderConfig(this);
        }
    }
}

PagePdfDocumentReader

用于将 PDF 文件按页分组解析为多个 Document,每个 Document 可包含一页或多页内容,支持自定义分组和页面裁剪

  • PDDocument document:要读取的 PDF 文档对象
  • String resourceFileName:存储 PDF 文件的名字
  • PdfDocumentReaderConfig config:配置 PDF 文档读取行为,包括每份文档包含的页数、页边距

|-----------------------|----------------------------|
| 方法名称 | 描述 |
| PagePdfDocumentReader | 通过资源URL、资源对象、解析PDF配置等构造读取器 |
| get | 读取PDF,返回Document列表 |
| toDocument | 将指定页内容和元数据封装为 Document |

java 复制代码
package org.springframework.ai.reader.pdf;

import java.awt.Rectangle;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;
import org.apache.pdfbox.io.RandomAccessReadBuffer;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.ai.reader.pdf.config.PdfDocumentReaderConfig;
import org.springframework.ai.reader.pdf.layout.PDFLayoutTextStripperByArea;
import org.springframework.core.io.DefaultResourceLoader;
import org.springframework.core.io.Resource;
import org.springframework.util.CollectionUtils;
import org.springframework.util.StringUtils;

public class PagePdfDocumentReader implements DocumentReader {
    public static final String METADATASTARTPAGENUMBER = "pagenumber";
    public static final String METADATAENDPAGENUMBER = "endpagenumber";
    public static final String METADATAFILENAME = "filename";
    private static final String PDFPAGEREGION = "pdfPageRegion";
    protected final PDDocument document;
    private final Logger logger;
    protected String resourceFileName;
    private PdfDocumentReaderConfig config;

    public PagePdfDocumentReader(String resourceUrl) {
        this((new DefaultResourceLoader()).getResource(resourceUrl));
    }

    public PagePdfDocumentReader(Resource pdfResource) {
        this(pdfResource, PdfDocumentReaderConfig.defaultConfig());
    }

    public PagePdfDocumentReader(String resourceUrl, PdfDocumentReaderConfig config) {
        this((new DefaultResourceLoader()).getResource(resourceUrl), config);
    }

    public PagePdfDocumentReader(Resource pdfResource, PdfDocumentReaderConfig config) {
        this.logger = LoggerFactory.getLogger(this.getClass());

        try {
            PDFParser pdfParser = new PDFParser(new RandomAccessReadBuffer(pdfResource.getInputStream()));
            this.document = pdfParser.parse();
            this.resourceFileName = pdfResource.getFilename();
            this.config = config;
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    public List<Document> get() {
        List<Document> readDocuments = new ArrayList();

        try {
            PDFLayoutTextStripperByArea pdfTextStripper = new PDFLayoutTextStripperByArea();
            int pageNumber = 0;
            int pagesPerDocument = 0;
            int startPageNumber = pageNumber;
            List<String> pageTextGroupList = new ArrayList();
            int totalPages = this.document.getDocumentCatalog().getPages().getCount();
            int logFrequency = totalPages > 10 ? totalPages / 10 : 1;
            int counter = 0;
            PDPage lastPage = (PDPage)this.document.getDocumentCatalog().getPages().iterator().next();

            for(PDPage page : this.document.getDocumentCatalog().getPages()) {
                lastPage = page;
                if (counter % logFrequency == 0 && counter / logFrequency < 10) {
                    this.logger.info("Processing PDF page: {}", counter + 1);
                }

                ++counter;
                ++pagesPerDocument;
                if (this.config.pagesPerDocument != 0 && pagesPerDocument >= this.config.pagesPerDocument) {
                    pagesPerDocument = 0;
                    String aggregatedPageTextGroup = (String)pageTextGroupList.stream().collect(Collectors.joining());
                    if (StringUtils.hasText(aggregatedPageTextGroup)) {
                        readDocuments.add(this.toDocument(page, aggregatedPageTextGroup, startPageNumber, pageNumber));
                    }

                    pageTextGroupList.clear();
                    startPageNumber = pageNumber + 1;
                }

                int x0 = (int)page.getMediaBox().getLowerLeftX();
                int xW = (int)page.getMediaBox().getWidth();
                int y0 = (int)page.getMediaBox().getLowerLeftY() + this.config.pageTopMargin;
                int yW = (int)page.getMediaBox().getHeight() - (this.config.pageTopMargin + this.config.pageBottomMargin);
                pdfTextStripper.addRegion("pdfPageRegion", new Rectangle(x0, y0, xW, yW));
                pdfTextStripper.extractRegions(page);
                String pageText = pdfTextStripper.getTextForRegion("pdfPageRegion");
                if (StringUtils.hasText(pageText)) {
                    pageText = this.config.pageExtractedTextFormatter.format(pageText, pageNumber);
                    pageTextGroupList.add(pageText);
                }

                ++pageNumber;
                pdfTextStripper.removeRegion("pdfPageRegion");
            }

            if (!CollectionUtils.isEmpty(pageTextGroupList)) {
                readDocuments.add(this.toDocument(lastPage, (String)pageTextGroupList.stream().collect(Collectors.joining()), startPageNumber, pageNumber));
            }

            this.logger.info("Processing {} pages", totalPages);
            return readDocuments;
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    protected Document toDocument(PDPage page, String docText, int startPageNumber, int endPageNumber) {
        Document doc = new Document(docText);
        doc.getMetadata().put("pagenumber", startPageNumber);
        if (startPageNumber != endPageNumber) {
            doc.getMetadata().put("endpagenumber", endPageNumber);
        }

        doc.getMetadata().put("filename", this.resourceFileName);
        return doc;
    }
}
PdfDocumentReaderConfig

PDF 文档读取器的配置类,用于控制 PDF 解析和分组行为

  • int ALLPAGES:常量,值为 0,表示将所有页合并为一个 Document
  • boolean reversedParagraphPosition:是否反转每页内段落顺序,默认为 false
  • int pagesPerDocument:每个 Document 包含的页数,0 表示所有页合并,默认 1
  • int pageTopMargin:每页顶部裁剪的像素数,默认 0
  • int pageBottomMargin:每页底部裁剪的像素数,默认 0
  • int pageExtractedTextFormatter:提取文本后的格式化器,可自定义每页文本的处理方式
java 复制代码
package org.springframework.ai.reader.pdf.config;

import org.springframework.ai.reader.ExtractedTextFormatter;
import org.springframework.util.Assert;

public final class PdfDocumentReaderConfig {
    public static final int ALLPAGES = 0;
    public final boolean reversedParagraphPosition;
    public final int pagesPerDocument;
    public final int pageTopMargin;
    public final int pageBottomMargin;
    public final ExtractedTextFormatter pageExtractedTextFormatter;

    private PdfDocumentReaderConfig(Builder builder) {
        this.pagesPerDocument = builder.pagesPerDocument;
        this.pageBottomMargin = builder.pageBottomMargin;
        this.pageTopMargin = builder.pageTopMargin;
        this.pageExtractedTextFormatter = builder.pageExtractedTextFormatter;
        this.reversedParagraphPosition = builder.reversedParagraphPosition;
    }

    public static Builder builder() {
        return new Builder();
    }

    public static PdfDocumentReaderConfig defaultConfig() {
        return builder().build();
    }

    public static final class Builder {
        private int pagesPerDocument = 1;
        private int pageTopMargin = 0;
        private int pageBottomMargin = 0;
        private ExtractedTextFormatter pageExtractedTextFormatter = ExtractedTextFormatter.defaults();
        private boolean reversedParagraphPosition = false;

        private Builder() {
        }

        public Builder withPageExtractedTextFormatter(ExtractedTextFormatter pageExtractedTextFormatter) {
            Assert.notNull(pageExtractedTextFormatter, "PageExtractedTextFormatter must not be null.");
            this.pageExtractedTextFormatter = pageExtractedTextFormatter;
            return this;
        }

        public Builder withPagesPerDocument(int pagesPerDocument) {
            Assert.isTrue(pagesPerDocument >= 0, "Page count must be a positive value.");
            this.pagesPerDocument = pagesPerDocument;
            return this;
        }

        public Builder withPageTopMargin(int topMargin) {
            Assert.isTrue(topMargin >= 0, "Page margins must be a positive value.");
            this.pageTopMargin = topMargin;
            return this;
        }

        public Builder withPageBottomMargin(int bottomMargin) {
            Assert.isTrue(bottomMargin >= 0, "Page margins must be a positive value.");
            this.pageBottomMargin = bottomMargin;
            return this;
        }

        public Builder withReversedParagraphPosition(boolean reversedParagraphPosition) {
            this.reversedParagraphPosition = reversedParagraphPosition;
            return this;
        }

        public PdfDocumentReaderConfig build() {
            return new PdfDocumentReaderConfig(this);
        }
    }
}

ParagraphPdfDocumentReader

用于将 PDF 文件按段落(基于目录/结构信息)解析为多个 Document,每个 Document 通常对应一个段落

  • PDDocument document:要读取的 PDF 文档对象
  • String resourceFileName:存储 PDF 文件的名字
  • PdfDocumentReaderConfig config:配置 PDF 文档读取行为,包括每份文档包含的页数、页边距
  • ParagraphManager paragraphTextExtractor:负责解析 PDF 并提取段落信息

|----------------------------|----------------------------|
| 方法名称 | 描述 |
| ParagraphPdfDocumentReader | 通过资源URL、资源对象、解析PDF配置等构造读取器 |
| get | 读取带目录的PDF,返回Document列表 |
| toDocument | 将指定段落内容和元数据封装为 Document |
| addMetadata | 为 Document 添加元数据 |
| getTextBetweenParagraphs | 提取两个段落之间的文本内容 |

java 复制代码
package org.springframework.ai.reader.pdf;

import java.awt.Rectangle;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.apache.pdfbox.io.RandomAccessReadBuffer;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.ai.reader.pdf.config.ParagraphManager;
import org.springframework.ai.reader.pdf.config.PdfDocumentReaderConfig;
import org.springframework.ai.reader.pdf.layout.PDFLayoutTextStripperByArea;
import org.springframework.core.io.DefaultResourceLoader;
import org.springframework.core.io.Resource;
import org.springframework.util.CollectionUtils;
import org.springframework.util.StringUtils;

public class ParagraphPdfDocumentReader implements DocumentReader {
    private static final String METADATASTARTPAGE = "pagenumber";
    private static final String METADATAENDPAGE = "endpagenumber";
    private static final String METADATATITLE = "title";
    private static final String METADATALEVEL = "level";
    private static final String METADATAFILENAME = "filename";
    protected final PDDocument document;
    private final Logger logger;
    private final ParagraphManager paragraphTextExtractor;
    protected String resourceFileName;
    private PdfDocumentReaderConfig config;

    public ParagraphPdfDocumentReader(String resourceUrl) {
        this((new DefaultResourceLoader()).getResource(resourceUrl));
    }

    public ParagraphPdfDocumentReader(Resource pdfResource) {
        this(pdfResource, PdfDocumentReaderConfig.defaultConfig());
    }

    public ParagraphPdfDocumentReader(String resourceUrl, PdfDocumentReaderConfig config) {
        this((new DefaultResourceLoader()).getResource(resourceUrl), config);
    }

    public ParagraphPdfDocumentReader(Resource pdfResource, PdfDocumentReaderConfig config) {
        this.logger = LoggerFactory.getLogger(this.getClass());

        try {
            PDFParser pdfParser = new PDFParser(new RandomAccessReadBuffer(pdfResource.getInputStream()));
            this.document = pdfParser.parse();
            this.config = config;
            this.paragraphTextExtractor = new ParagraphManager(this.document);
            this.resourceFileName = pdfResource.getFilename();
        } catch (IllegalArgumentException iae) {
            throw iae;
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    public List<Document> get() {
        List<ParagraphManager.Paragraph> paragraphs = this.paragraphTextExtractor.flatten();
        List<Document> documents = new ArrayList(paragraphs.size());
        if (!CollectionUtils.isEmpty(paragraphs)) {
            this.logger.info("Start processing paragraphs from PDF");
            Iterator<ParagraphManager.Paragraph> itr = paragraphs.iterator();
            ParagraphManager.Paragraph current = (ParagraphManager.Paragraph)itr.next();
            ParagraphManager.Paragraph next;
            if (!itr.hasNext()) {
                documents.add(this.toDocument(current, current));
            } else {
                for(; itr.hasNext(); current = next) {
                    next = (ParagraphManager.Paragraph)itr.next();
                    Document document = this.toDocument(current, next);
                    if (document != null && StringUtils.hasText(document.getText())) {
                        documents.add(this.toDocument(current, next));
                    }
                }
            }
        }

        this.logger.info("End processing paragraphs from PDF");
        return documents;
    }

    protected Document toDocument(ParagraphManager.Paragraph from, ParagraphManager.Paragraph to) {
        String docText = this.getTextBetweenParagraphs(from, to);
        if (!StringUtils.hasText(docText)) {
            return null;
        } else {
            Document document = new Document(docText);
            this.addMetadata(from, to, document);
            return document;
        }
    }

    protected void addMetadata(ParagraphManager.Paragraph from, ParagraphManager.Paragraph to, Document document) {
        document.getMetadata().put("title", from.title());
        document.getMetadata().put("pagenumber", from.startPageNumber());
        document.getMetadata().put("endpagenumber", to.startPageNumber());
        document.getMetadata().put("level", from.level());
        document.getMetadata().put("filename", this.resourceFileName);
    }

    public String getTextBetweenParagraphs(ParagraphManager.Paragraph fromParagraph, ParagraphManager.Paragraph toParagraph) {
        int startPage = fromParagraph.startPageNumber() - 1;
        int endPage = toParagraph.startPageNumber() - 1;

        try {
            StringBuilder sb = new StringBuilder();
            PDFLayoutTextStripperByArea pdfTextStripper = new PDFLayoutTextStripperByArea();
            pdfTextStripper.setSortByPosition(true);

            for(int pageNumber = startPage; pageNumber <= endPage; ++pageNumber) {
                PDPage page = this.document.getPage(pageNumber);
                int fromPosition = fromParagraph.position();
                int toPosition = toParagraph.position();
                if (this.config.reversedParagraphPosition) {
                    fromPosition = (int)(page.getMediaBox().getHeight() - (float)fromPosition);
                    toPosition = (int)(page.getMediaBox().getHeight() - (float)toPosition);
                }

                int x0 = (int)page.getMediaBox().getLowerLeftX();
                int xW = (int)page.getMediaBox().getWidth();
                int y0 = (int)page.getMediaBox().getLowerLeftY();
                int yW = (int)page.getMediaBox().getHeight();
                if (pageNumber == startPage) {
                    y0 = fromPosition;
                    yW = (int)page.getMediaBox().getHeight() - fromPosition;
                }

                if (pageNumber == endPage) {
                    yW = toPosition - y0;
                }

                if (y0 + yW == (int)page.getMediaBox().getHeight()) {
                    yW -= this.config.pageBottomMargin;
                }

                if (y0 == 0) {
                    y0 += this.config.pageTopMargin;
                    yW -= this.config.pageTopMargin;
                }

                pdfTextStripper.addRegion("pdfPageRegion", new Rectangle(x0, y0, xW, yW));
                pdfTextStripper.extractRegions(page);
                String text = pdfTextStripper.getTextForRegion("pdfPageRegion");
                if (StringUtils.hasText(text)) {
                    sb.append(text);
                }

                pdfTextStripper.removeRegion("pdfPageRegion");
            }

            String text = sb.toString();
            if (StringUtils.hasText(text)) {
                text = this.config.pageExtractedTextFormatter.format(text, startPage);
            }

            return text;
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}
ParagraphManager

类用于管理 PDF 文档的段落结构,主要通过解析 PDF 目录(TOC/书签)生成段落树,并可将其扁平化为段落列表,便于后续内容提取和分组

  • Paragraph rootParagraph:段落树的根节点,类型为 Paragraph,包含所有段落的层级结构
  • PDDocument document:PDFBox 的 PDDocument,表示当前处理的 PDF 文档

|----------------------|--------------------------------------------------------------------------------------------------------|
| 方法名称 | 描述 |
| ParagraphManager | 传入 PDF 文档,自动解析目录生成段落树 |
| flatten | 将段落树扁平化为 Paragraph 列表,便于顺序遍历 |
| getParagraphsByLevel | 按指定层级获取段落列表,可选是否包含跨层级段落 |
| Paragraph | 静态内部类,表示段落的元数据(标题、层级、起止页码、位置、子段落等) |
| generateParagraphs | ParagraphManager 的核心递归方法,用于遍历 PDF 目录(TOC/书签)的树结构,将每个目录项(PDOutlineItem)转换为 Paragraph,并构建出完整的段落树(章节层级结构) |

java 复制代码
package org.springframework.ai.reader.pdf.config;

import java.io.IOException;
import java.io.PrintStream;
import java.util.ArrayList;
import java.util.List;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageTree;
import org.apache.pdfbox.pdmodel.interactive.documentnavigation.destination.PDPageXYZDestination;
import org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem;
import org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode;
import org.springframework.util.Assert;
import org.springframework.util.CollectionUtils;

public class ParagraphManager {
    private final Paragraph rootParagraph;
    private final PDDocument document;

    public ParagraphManager(PDDocument document) {
        Assert.notNull(document, "PDDocument must not be null");
        Assert.notNull(document.getDocumentCatalog().getDocumentOutline(), "Document outline (e.g. TOC) is null. Make sure the PDF document has a table of contents (TOC). If not, consider the PagePdfDocumentReader or the TikaDocumentReader instead.");

        try {
            this.document = document;
            this.rootParagraph = this.generateParagraphs(new Paragraph((Paragraph)null, "root", -1, 1, this.document.getNumberOfPages(), 0), this.document.getDocumentCatalog().getDocumentOutline(), 0);
            this.printParagraph(this.rootParagraph, System.out);
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    public List<Paragraph> flatten() {
        List<Paragraph> paragraphs = new ArrayList();

        for(Paragraph child : this.rootParagraph.children()) {
            this.flatten(child, paragraphs);
        }

        return paragraphs;
    }

    private void flatten(Paragraph current, List<Paragraph> paragraphs) {
        paragraphs.add(current);

        for(Paragraph child : current.children()) {
            this.flatten(child, paragraphs);
        }

    }

    private void printParagraph(Paragraph paragraph, PrintStream printStream) {
        printStream.println(paragraph);

        for(Paragraph childParagraph : paragraph.children()) {
            this.printParagraph(childParagraph, printStream);
        }

    }

    protected Paragraph generateParagraphs(Paragraph parentParagraph, PDOutlineNode bookmark, Integer level) throws IOException {
        for(PDOutlineItem current = bookmark.getFirstChild(); current != null; current = current.getNextSibling()) {
            int pageNumber = this.getPageNumber(current);
            int nextSiblingNumber = this.getPageNumber(current.getNextSibling());
            if (nextSiblingNumber < 0) {
                nextSiblingNumber = this.getPageNumber(current.getLastChild());
            }

            int paragraphPosition = current.getDestination() instanceof PDPageXYZDestination ? ((PDPageXYZDestination)current.getDestination()).getTop() : 0;
            Paragraph currentParagraph = new Paragraph(parentParagraph, current.getTitle(), level, pageNumber, nextSiblingNumber, paragraphPosition);
            parentParagraph.children().add(currentParagraph);
            this.generateParagraphs(currentParagraph, current, level + 1);
        }

        return parentParagraph;
    }

    private int getPageNumber(PDOutlineItem current) throws IOException {
        if (current == null) {
            return -1;
        } else {
            PDPage currentPage = current.findDestinationPage(this.document);
            PDPageTree pages = this.document.getDocumentCatalog().getPages();

            for(int i = 0; i < pages.getCount(); ++i) {
                PDPage page = pages.get(i);
                if (page.equals(currentPage)) {
                    return i + 1;
                }
            }

            return -1;
        }
    }

    public List<Paragraph> getParagraphsByLevel(Paragraph paragraph, int level, boolean interLevelText) {
        List<Paragraph> resultList = new ArrayList();
        if (paragraph.level() < level) {
            if (!CollectionUtils.isEmpty(paragraph.children())) {
                if (interLevelText) {
                    Paragraph interLevelParagraph = new Paragraph(paragraph.parent(), paragraph.title(), paragraph.level(), paragraph.startPageNumber(), ((Paragraph)paragraph.children().get(0)).startPageNumber(), paragraph.position());
                    resultList.add(interLevelParagraph);
                }

                for(Paragraph child : paragraph.children()) {
                    resultList.addAll(this.getParagraphsByLevel(child, level, interLevelText));
                }
            }
        } else if (paragraph.level() == level) {
            resultList.add(paragraph);
        }

        return resultList;
    }

    public static record Paragraph(Paragraph parent, String title, int level, int startPageNumber, int endPageNumber, int position, List<Paragraph> children) {
        public Paragraph(Paragraph parent, String title, int level, int startPageNumber, int endPageNumber, int position) {
            this(parent, title, level, startPageNumber, endPageNumber, position, new ArrayList());
        }

        public String toString() {
            String indent = this.level < 0 ? "" : (new String(new char[this.level * 2])).replace('\u0000', ' ');
            return indent + " " + this.level + ") " + this.title + " [" + this.startPageNumber + "," + this.endPageNumber + "], children = " + this.children.size() + ", pos = " + this.position;
        }
    }
}

TikaDocumentReader

用于从多种文档格式(如 PDF、DOC/DOCX、PPT/PPTX、HTML 等)中提取文本,并将其封装为 Document 对象,基于 Apache Tika 库实现,支持广泛的文档格式。

  • AutoDetectParser parser:自动检索文档类型并文本的解析器
  • ContentHandler handler:管理内容提取的处理器
  • Metadata metadata:读取文档相关的元数据
  • ParseContext context:解析过程信息的上下文
  • Resource resource:指向文档的资源对象
  • ExtractedTextFormatter textFormatter: 格式化提取的文本

|--------------------|---------------------------|
| 方法名称 | 描述 |
| TikaDocumentReader | 通过资源URL、资源对象、文本格式化器等构造读取器 |
| get | 从多种文档格式读取,返回Document列表 |

java 复制代码
package org.springframework.ai.reader.tika;

import java.io.IOException;
import java.io.InputStream;
import java.util.List;
import java.util.Objects;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.ai.reader.ExtractedTextFormatter;
import org.springframework.core.io.DefaultResourceLoader;
import org.springframework.core.io.Resource;
import org.springframework.util.StringUtils;
import org.xml.sax.ContentHandler;

public class TikaDocumentReader implements DocumentReader {
    public static final String METADATASOURCE = "source";
    private final AutoDetectParser parser;
    private final ContentHandler handler;
    private final Metadata metadata;
    private final ParseContext context;
    private final Resource resource;
    private final ExtractedTextFormatter textFormatter;

    public TikaDocumentReader(String resourceUrl) {
        this(resourceUrl, ExtractedTextFormatter.defaults());
    }

    public TikaDocumentReader(String resourceUrl, ExtractedTextFormatter textFormatter) {
        this((new DefaultResourceLoader()).getResource(resourceUrl), textFormatter);
    }

    public TikaDocumentReader(Resource resource) {
        this(resource, ExtractedTextFormatter.defaults());
    }

    public TikaDocumentReader(Resource resource, ExtractedTextFormatter textFormatter) {
        this(resource, new BodyContentHandler(-1), textFormatter);
    }

    public TikaDocumentReader(Resource resource, ContentHandler contentHandler, ExtractedTextFormatter textFormatter) {
        this.parser = new AutoDetectParser();
        this.handler = contentHandler;
        this.metadata = new Metadata();
        this.context = new ParseContext();
        this.resource = resource;
        this.textFormatter = textFormatter;
    }

    public List<Document> get() {
        try (InputStream stream = this.resource.getInputStream()) {
            this.parser.parse(stream, this.handler, this.metadata, this.context);
            return List.of(this.toDocument(this.handler.toString()));
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    private Document toDocument(String docText) {
        docText = (String)Objects.requireNonNullElse(docText, "");
        docText = this.textFormatter.format(docText);
        Document doc = new Document(docText);
        doc.getMetadata().put("source", this.resourceName());
        return doc;
    }

    private String resourceName() {
        try {
            String resourceName = this.resource.getFilename();
            if (!StringUtils.hasText(resourceName)) {
                resourceName = this.resource.getURI().toString();
            }

            return resourceName;
        } catch (IOException e) {
            return String.format("Invalid source URI: %s", e.getMessage());
        }
    }
}

DocumentTransformer(转换文档数据接口类)

java 复制代码
package org.springframework.ai.document;

import java.util.List;
import java.util.function.Function;

public interface DocumentTransformer extends Function<List<Document>, List<Document>> {
    default List<Document> transform(List<Document> transform) {
        return (List)this.apply(transform);
    }
}

TextSplitter

主要用于将长文本型 Document 拆分为多个较小的文本块(chunk),它为具体的文本分割策略(如按长度、按句子、按段落等)提供了通用框架

  • boolean copyContentFormatter:表示是否将文档内容格式化后,拆分复制到子文档中

|-------------------------|-----------------------------|
| 方法名称 | 描述 |
| apply | 对输入文档列表进行拆分,返回拆分后的文档列表 |
| split | 拆分文档,返回拆分后的文档列表 |
| setCopyContentFormatter | 控制是否继承内容格式化器 |
| isCopyContentFormatter | 获取 copyContentFormatter 当前值 |

java 复制代码
package org.springframework.ai.transformer.splitter;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.ai.document.ContentFormatter;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentTransformer;

public abstract class TextSplitter implements DocumentTransformer {
    private static final Logger logger = LoggerFactory.getLogger(TextSplitter.class);
    private boolean copyContentFormatter = true;

    public List<Document> apply(List<Document> documents) {
        return this.doSplitDocuments(documents);
    }

    public List<Document> split(List<Document> documents) {
        return this.apply(documents);
    }

    public List<Document> split(Document document) {
        return this.apply(List.of(document));
    }

    public boolean isCopyContentFormatter() {
        return this.copyContentFormatter;
    }

    public void setCopyContentFormatter(boolean copyContentFormatter) {
        this.copyContentFormatter = copyContentFormatter;
    }

    private List<Document> doSplitDocuments(List<Document> documents) {
        List<String> texts = new ArrayList();
        List<Map<String, Object>> metadataList = new ArrayList();
        List<ContentFormatter> formatters = new ArrayList();

        for(Document doc : documents) {
            texts.add(doc.getText());
            metadataList.add(doc.getMetadata());
            formatters.add(doc.getContentFormatter());
        }

        return this.createDocuments(texts, formatters, metadataList);
    }

    private List<Document> createDocuments(List<String> texts, List<ContentFormatter> formatters, List<Map<String, Object>> metadataList) {
        List<Document> documents = new ArrayList();

        for(int i = 0; i < texts.size(); ++i) {
            String text = (String)texts.get(i);
            Map<String, Object> metadata = (Map)metadataList.get(i);
            List<String> chunks = this.splitText(text);
            if (chunks.size() > 1) {
                logger.info("Splitting up document into " + chunks.size() + " chunks.");
            }

            for(String chunk : chunks) {
                Map<String, Object> metadataCopy = (Map)metadata.entrySet().stream().filter((e) -> e.getKey() != null && e.getValue() != null).collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
                Document newDoc = new Document(chunk, metadataCopy);
                if (this.copyContentFormatter) {
                    newDoc.setContentFormatter((ContentFormatter)formatters.get(i));
                }

                documents.add(newDoc);
            }
        }

        return documents;
    }

    protected abstract List<String> splitText(String text);
}
TokenTextSplitter

用于将文本按 token 拆分为指定大小块,基于 jtokit 库实现,适用于需要按 token 粒度处理文本的场景,如 LLM 的输入处理。

  • int chunkSize:每个文本块的目标 token 数量,默认为 800
  • int minChunkSizeChars:每个文本块的最小字符数,默认为 350
  • int minChunkLengthToEmbed:丢弃小于此长度的文本块,默认为 5
  • int maxNumChunks:文本中生成的最大块数,默认为 10000
  • boolean keepSeparator:是否保留分隔符(如换号符),默认
  • EncodingRegistry registry:用于获取编码的注册表
  • Encoding encoding:用于编码和解码的 token 的编码器

|-----------|--------------------------------------------|
| 方法名称 | 描述 |
| splitText | 实现自 TextSplitter,将文本按 token 分块,返回分块后的字符串列表 |
| doSplit | 核心分块逻辑,按 token 长度切分文本 |

java 复制代码
package org.springframework.ai.transformer.splitter;

import com.knuddels.jtokkit.Encodings;
import com.knuddels.jtokkit.api.Encoding;
import com.knuddels.jtokkit.api.EncodingRegistry;
import com.knuddels.jtokkit.api.EncodingType;
import com.knuddels.jtokkit.api.IntArrayList;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;
import org.springframework.util.Assert;

public class TokenTextSplitter extends TextSplitter {
    private static final int DEFAULTCHUNKSIZE = 800;
    private static final int MINCHUNKSIZECHARS = 350;
    private static final int MINCHUNKLENGTHTOEMBED = 5;
    private static final int MAXNUMCHUNKS = 10000;
    private static final boolean KEEPSEPARATOR = true;
    private final EncodingRegistry registry;
    private final Encoding encoding;
    private final int chunkSize;
    private final int minChunkSizeChars;
    private final int minChunkLengthToEmbed;
    private final int maxNumChunks;
    private final boolean keepSeparator;

    public TokenTextSplitter() {
        this(800, 350, 5, 10000, true);
    }

    public TokenTextSplitter(boolean keepSeparator) {
        this(800, 350, 5, 10000, keepSeparator);
    }

    public TokenTextSplitter(int chunkSize, int minChunkSizeChars, int minChunkLengthToEmbed, int maxNumChunks, boolean keepSeparator) {
        this.registry = Encodings.newLazyEncodingRegistry();
        this.encoding = this.registry.getEncoding(EncodingType.CL100KBASE);
        this.chunkSize = chunkSize;
        this.minChunkSizeChars = minChunkSizeChars;
        this.minChunkLengthToEmbed = minChunkLengthToEmbed;
        this.maxNumChunks = maxNumChunks;
        this.keepSeparator = keepSeparator;
    }

    public static Builder builder() {
        return new Builder();
    }

    protected List<String> splitText(String text) {
        return this.doSplit(text, this.chunkSize);
    }

    protected List<String> doSplit(String text, int chunkSize) {
        if (text != null && !text.trim().isEmpty()) {
            List<Integer> tokens = this.getEncodedTokens(text);
            List<String> chunks = new ArrayList();
            int numchunks = 0;

            while(!tokens.isEmpty() && numchunks < this.maxNumChunks) {
                List<Integer> chunk = tokens.subList(0, Math.min(chunkSize, tokens.size()));
                String chunkText = this.decodeTokens(chunk);
                if (chunkText.trim().isEmpty()) {
                    tokens = tokens.subList(chunk.size(), tokens.size());
                } else {
                    int lastPunctuation = Math.max(chunkText.lastIndexOf(46), Math.max(chunkText.lastIndexOf(63), Math.max(chunkText.lastIndexOf(33), chunkText.lastIndexOf(10))));
                    if (lastPunctuation != -1 && lastPunctuation > this.minChunkSizeChars) {
                        chunkText = chunkText.substring(0, lastPunctuation + 1);
                    }

                    String chunkTextToAppend = this.keepSeparator ? chunkText.trim() : chunkText.replace(System.lineSeparator(), " ").trim();
                    if (chunkTextToAppend.length() > this.minChunkLengthToEmbed) {
                        chunks.add(chunkTextToAppend);
                    }

                    tokens = tokens.subList(this.getEncodedTokens(chunkText).size(), tokens.size());
                    ++numchunks;
                }
            }

            if (!tokens.isEmpty()) {
                String remainingtext = this.decodeTokens(tokens).replace(System.lineSeparator(), " ").trim();
                if (remainingtext.length() > this.minChunkLengthToEmbed) {
                    chunks.add(remainingtext);
                }
            }

            return chunks;
        } else {
            return new ArrayList();
        }
    }

    private List<Integer> getEncodedTokens(String text) {
        Assert.notNull(text, "Text must not be null");
        return this.encoding.encode(text).boxed();
    }

    private String decodeTokens(List<Integer> tokens) {
        Assert.notNull(tokens, "Tokens must not be null");
        IntArrayList tokensIntArray = new IntArrayList(tokens.size());
        Objects.requireNonNull(tokensIntArray);
        tokens.forEach(tokensIntArray::add);
        return this.encoding.decode(tokensIntArray);
    }

    public static final class Builder {
        private int chunkSize = 800;
        private int minChunkSizeChars = 350;
        private int minChunkLengthToEmbed = 5;
        private int maxNumChunks = 10000;
        private boolean keepSeparator = true;

        private Builder() {
        }

        public Builder withChunkSize(int chunkSize) {
            this.chunkSize = chunkSize;
            return this;
        }

        public Builder withMinChunkSizeChars(int minChunkSizeChars) {
            this.minChunkSizeChars = minChunkSizeChars;
            return this;
        }

        public Builder withMinChunkLengthToEmbed(int minChunkLengthToEmbed) {
            this.minChunkLengthToEmbed = minChunkLengthToEmbed;
            return this;
        }

        public Builder withMaxNumChunks(int maxNumChunks) {
            this.maxNumChunks = maxNumChunks;
            return this;
        }

        public Builder withKeepSeparator(boolean keepSeparator) {
            this.keepSeparator = keepSeparator;
            return this;
        }

        public TokenTextSplitter build() {
            return new TokenTextSplitter(this.chunkSize, this.minChunkSizeChars, this.minChunkLengthToEmbed, this.maxNumChunks, this.keepSeparator);
        }
    }
}

ContentFormatTransformer

对 Document 列表中的每个文档应用内容格式化器,以格式化文档

  • boolean disableTemplateRewrite:表示是否禁用内容格式化器的模版重写功能
  • ContentFormatter contentFormatter:用于格式化文档内容的实例
java 复制代码
package org.springframework.ai.transformer;

import java.util.ArrayList;
import java.util.List;
import org.springframework.ai.document.ContentFormatter;
import org.springframework.ai.document.DefaultContentFormatter;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentTransformer;

public class ContentFormatTransformer implements DocumentTransformer {
    private final boolean disableTemplateRewrite;
    private final ContentFormatter contentFormatter;

    public ContentFormatTransformer(ContentFormatter contentFormatter) {
        this(contentFormatter, false);
    }

    public ContentFormatTransformer(ContentFormatter contentFormatter, boolean disableTemplateRewrite) {
        this.contentFormatter = contentFormatter;
        this.disableTemplateRewrite = disableTemplateRewrite;
    }

    public List<Document> apply(List<Document> documents) {
        if (this.contentFormatter != null) {
            documents.forEach(this::processDocument);
        }

        return documents;
    }

    private void processDocument(Document document) {
        ContentFormatter var4 = document.getContentFormatter();
        if (var4 instanceof DefaultContentFormatter docFormatter) {
            var4 = this.contentFormatter;
            if (var4 instanceof DefaultContentFormatter toUpdateFormatter) {
                this.updateFormatter(document, docFormatter, toUpdateFormatter);
                return;
            }
        }

        this.overrideFormatter(document);
    }

    private void updateFormatter(Document document, DefaultContentFormatter docFormatter, DefaultContentFormatter toUpdateFormatter) {
        List<String> updatedEmbedExcludeKeys = new ArrayList(docFormatter.getExcludedEmbedMetadataKeys());
        updatedEmbedExcludeKeys.addAll(toUpdateFormatter.getExcludedEmbedMetadataKeys());
        List<String> updatedInterfaceExcludeKeys = new ArrayList(docFormatter.getExcludedInferenceMetadataKeys());
        updatedInterfaceExcludeKeys.addAll(toUpdateFormatter.getExcludedInferenceMetadataKeys());
        DefaultContentFormatter.Builder builder = DefaultContentFormatter.builder().withExcludedEmbedMetadataKeys(updatedEmbedExcludeKeys).withExcludedInferenceMetadataKeys(updatedInterfaceExcludeKeys).withMetadataTemplate(docFormatter.getMetadataTemplate()).withMetadataSeparator(docFormatter.getMetadataSeparator());
        if (!this.disableTemplateRewrite) {
            builder.withTextTemplate(docFormatter.getTextTemplate());
        }

        document.setContentFormatter(builder.build());
    }

    private void overrideFormatter(Document document) {
        document.setContentFormatter(this.contentFormatter);
    }
}
ContentFormatte(格式化接口类)
java 复制代码
public interface ContentFormatter {

    String format(Document document, MetadataMode mode);

}
DefaultContentFormatter

用于格式化 Document 对象的内容和元数据,通过模版和配置来控制文档显示方式

  • String metadataTemplate:元数据格式化模版,包含{key}和{value}占位符
  • String metadataSeparator:元数据字段之间的分隔符
  • String textTemplate:文档文本格式化模板,包含{content}和{metadatastring}占位符
  • List<String> excludedInferenceMetadataKeys:在推理模式下排除的元数据键列表
  • List<String> excludedEmbedMetadataKeys:在嵌入模式下排除的元数据键列表
java 复制代码
package org.springframework.ai.document;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.stream.Collectors;
import org.springframework.util.Assert;

public final class DefaultContentFormatter implements ContentFormatter {
    private static final String TEMPLATECONTENTPLACEHOLDER = "{content}";
    private static final String TEMPLATEMETADATASTRINGPLACEHOLDER = "{metadatastring}";
    private static final String TEMPLATEVALUEPLACEHOLDER = "{value}";
    private static final String TEMPLATEKEYPLACEHOLDER = "{key}";
    private static final String DEFAULTMETADATATEMPLATE = String.format("%s: %s", "{key}", "{value}");
    private static final String DEFAULTMETADATASEPARATOR = System.lineSeparator();
    private static final String DEFAULTTEXTTEMPLATE = String.format("%s\n\n%s", "{metadatastring}", "{content}");
    private final String metadataTemplate;
    private final String metadataSeparator;
    private final String textTemplate;
    private final List<String> excludedInferenceMetadataKeys;
    private final List<String> excludedEmbedMetadataKeys;

    private DefaultContentFormatter(Builder builder) {
        this.metadataTemplate = builder.metadataTemplate;
        this.metadataSeparator = builder.metadataSeparator;
        this.textTemplate = builder.textTemplate;
        this.excludedInferenceMetadataKeys = builder.excludedInferenceMetadataKeys;
        this.excludedEmbedMetadataKeys = builder.excludedEmbedMetadataKeys;
    }

    public static Builder builder() {
        return new Builder();
    }

    public static DefaultContentFormatter defaultConfig() {
        return builder().build();
    }

    public String format(Document document, MetadataMode metadataMode) {
        Map<String, Object> metadata = this.metadataFilter(document.getMetadata(), metadataMode);
        String metadataText = (String)metadata.entrySet().stream().map((metadataEntry) -> this.metadataTemplate.replace("{key}", (CharSequence)metadataEntry.getKey()).replace("{value}", metadataEntry.getValue().toString())).collect(Collectors.joining(this.metadataSeparator));
        return this.textTemplate.replace("{metadatastring}", metadataText).replace("{content}", document.getText());
    }

    protected Map<String, Object> metadataFilter(Map<String, Object> metadata, MetadataMode metadataMode) {
        if (metadataMode == MetadataMode.ALL) {
            return new HashMap(metadata);
        } else if (metadataMode == MetadataMode.NONE) {
            return new HashMap(Collections.emptyMap());
        } else {
            Set<String> usableMetadataKeys = new HashSet(metadata.keySet());
            if (metadataMode == MetadataMode.INFERENCE) {
                usableMetadataKeys.removeAll(this.excludedInferenceMetadataKeys);
            } else if (metadataMode == MetadataMode.EMBED) {
                usableMetadataKeys.removeAll(this.excludedEmbedMetadataKeys);
            }

            return new HashMap((Map)metadata.entrySet().stream().filter((e) -> usableMetadataKeys.contains(e.getKey())).collect(Collectors.toMap((e) -> (String)e.getKey(), (e) -> e.getValue())));
        }
    }

    public String getMetadataTemplate() {
        return this.metadataTemplate;
    }

    public String getMetadataSeparator() {
        return this.metadataSeparator;
    }

    public String getTextTemplate() {
        return this.textTemplate;
    }

    public List<String> getExcludedInferenceMetadataKeys() {
        return Collections.unmodifiableList(this.excludedInferenceMetadataKeys);
    }

    public List<String> getExcludedEmbedMetadataKeys() {
        return Collections.unmodifiableList(this.excludedEmbedMetadataKeys);
    }

    public static final class Builder {
        private String metadataTemplate;
        private String metadataSeparator;
        private String textTemplate;
        private List<String> excludedInferenceMetadataKeys;
        private List<String> excludedEmbedMetadataKeys;

        private Builder() {
            this.metadataTemplate = DefaultContentFormatter.DEFAULTMETADATATEMPLATE;
            this.metadataSeparator = DefaultContentFormatter.DEFAULTMETADATASEPARATOR;
            this.textTemplate = DefaultContentFormatter.DEFAULTTEXTTEMPLATE;
            this.excludedInferenceMetadataKeys = new ArrayList();
            this.excludedEmbedMetadataKeys = new ArrayList();
        }

        public Builder from(DefaultContentFormatter fromFormatter) {
            this.withExcludedEmbedMetadataKeys(fromFormatter.getExcludedEmbedMetadataKeys()).withExcludedInferenceMetadataKeys(fromFormatter.getExcludedInferenceMetadataKeys()).withMetadataSeparator(fromFormatter.getMetadataSeparator()).withMetadataTemplate(fromFormatter.getMetadataTemplate()).withTextTemplate(fromFormatter.getTextTemplate());
            return this;
        }

        public Builder withMetadataTemplate(String metadataTemplate) {
            Assert.hasText(metadataTemplate, "Metadata Template must not be empty");
            this.metadataTemplate = metadataTemplate;
            return this;
        }

        public Builder withMetadataSeparator(String metadataSeparator) {
            Assert.notNull(metadataSeparator, "Metadata separator must not be empty");
            this.metadataSeparator = metadataSeparator;
            return this;
        }

        public Builder withTextTemplate(String textTemplate) {
            Assert.hasText(textTemplate, "Document's text template must not be empty");
            this.textTemplate = textTemplate;
            return this;
        }

        public Builder withExcludedInferenceMetadataKeys(List<String> excludedInferenceMetadataKeys) {
            Assert.notNull(excludedInferenceMetadataKeys, "Excluded inference metadata keys must not be null");
            this.excludedInferenceMetadataKeys = excludedInferenceMetadataKeys;
            return this;
        }

        public Builder withExcludedInferenceMetadataKeys(String... keys) {
            Assert.notNull(keys, "Excluded inference metadata keys must not be null");
            this.excludedInferenceMetadataKeys.addAll(Arrays.asList(keys));
            return this;
        }

        public Builder withExcludedEmbedMetadataKeys(List<String> excludedEmbedMetadataKeys) {
            Assert.notNull(excludedEmbedMetadataKeys, "Excluded Embed metadata keys must not be null");
            this.excludedEmbedMetadataKeys = excludedEmbedMetadataKeys;
            return this;
        }

        public Builder withExcludedEmbedMetadataKeys(String... keys) {
            Assert.notNull(keys, "Excluded Embed metadata keys must not be null");
            this.excludedEmbedMetadataKeys.addAll(Arrays.asList(keys));
            return this;
        }

        public DefaultContentFormatter build() {
            return new DefaultContentFormatter(this);
        }
    }
}

KeywordMetadataEnricher

从文档中提取关键词,并将其作为元数据添加到文档中。通过调用 ChatModel 生成关键词,并将关键词存储在文档的元数据中

  • ChatModel chatModel:与 LLM 交互,生成关键词
  • int keywordCount:要提取的关键词数量
java 复制代码
package org.springframework.ai.model.transformer;

import java.util.List;
import java.util.Map;
import org.springframework.ai.chat.model.ChatModel;
import org.springframework.ai.chat.prompt.Prompt;
import org.springframework.ai.chat.prompt.PromptTemplate;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentTransformer;
import org.springframework.util.Assert;

public class KeywordMetadataEnricher implements DocumentTransformer {
    public static final String CONTEXTSTRPLACEHOLDER = "contextstr";
    public static final String KEYWORDSTEMPLATE = "{contextstr}. Give %s unique keywords for this\ndocument. Format as comma separated. Keywords:";
    private static final String EXCERPTKEYWORDSMETADATAKEY = "excerptkeywords";
    private final ChatModel chatModel;
    private final int keywordCount;

    public KeywordMetadataEnricher(ChatModel chatModel, int keywordCount) {
        Assert.notNull(chatModel, "ChatModel must not be null");
        Assert.isTrue(keywordCount >= 1, "Document count must be >= 1");
        this.chatModel = chatModel;
        this.keywordCount = keywordCount;
    }

    public List<Document> apply(List<Document> documents) {
        for(Document document : documents) {
            PromptTemplate template = new PromptTemplate(String.format("{contextstr}. Give %s unique keywords for this\ndocument. Format as comma separated. Keywords:", this.keywordCount));
            Prompt prompt = template.create(Map.of("contextstr", document.getText()));
            String keywords = this.chatModel.call(prompt).getResult().getOutput().getText();
            document.getMetadata().putAll(Map.of("excerptkeywords", keywords));
        }

        return documents;
    }
}

SummaryMetadataEnricher

用于从文档中提取摘要,并将其作为元数据添加到文档中。支持提取当前文档、前一个文档和下一个文档的摘要,并将这些摘要存储在文档的元数据中

  • ChatModel chatModel:与 LLM 交互,生成摘要

  • List<SummaryType> summaryTypes:要提取的摘要类型列表(当前、前一个、后一个)

  • MetadataMode metadataMode:元数据模式,用于控制文档内容的格式化方式

    • ALL:格式化内容时包含所有元数据(如作者、页码、标题等),适合需要上下文丰富信息的场景
    • EMBED:仅包含用于向量嵌入相关的元数据。通常用于向量数据库检索,保证只输出对嵌入有用的元数据,减少无关信息干扰
    • INFERENCE:仅包含推理相关的元数据。适合推理、问答等场景,输出对模型推理有帮助的元数据,过滤掉无关内容
    • NONE:只输出纯文本内容,不包含任何元数据,适合只关心正文的场景
  • String summaryTemplate:用于生成摘要的模版

java 复制代码
package org.springframework.ai.model.transformer;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.springframework.ai.chat.model.ChatModel;
import org.springframework.ai.chat.prompt.Prompt;
import org.springframework.ai.chat.prompt.PromptTemplate;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentTransformer;
import org.springframework.ai.document.MetadataMode;
import org.springframework.util.Assert;
import org.springframework.util.CollectionUtils;

public class SummaryMetadataEnricher implements DocumentTransformer {
    public static final String DEFAULTSUMMARYEXTRACTTEMPLATE = "Here is the content of the section:\n{contextstr}\n\nSummarize the key topics and entities of the section.\n\nSummary:";
    private static final String SECTIONSUMMARYMETADATAKEY = "sectionsummary";
    private static final String NEXTSECTIONSUMMARYMETADATAKEY = "nextsectionsummary";
    private static final String PREVSECTIONSUMMARYMETADATAKEY = "prevsectionsummary";
    private static final String CONTEXTSTRPLACEHOLDER = "contextstr";
    private final ChatModel chatModel;
    private final List<SummaryType> summaryTypes;
    private final MetadataMode metadataMode;
    private final String summaryTemplate;

    public SummaryMetadataEnricher(ChatModel chatModel, List<SummaryType> summaryTypes) {
        this(chatModel, summaryTypes, "Here is the content of the section:\n{contextstr}\n\nSummarize the key topics and entities of the section.\n\nSummary:", MetadataMode.ALL);
    }

    public SummaryMetadataEnricher(ChatModel chatModel, List<SummaryType> summaryTypes, String summaryTemplate, MetadataMode metadataMode) {
        Assert.notNull(chatModel, "ChatModel must not be null");
        Assert.hasText(summaryTemplate, "Summary template must not be empty");
        this.chatModel = chatModel;
        this.summaryTypes = CollectionUtils.isEmpty(summaryTypes) ? List.of(SummaryMetadataEnricher.SummaryType.CURRENT) : summaryTypes;
        this.metadataMode = metadataMode;
        this.summaryTemplate = summaryTemplate;
    }

    public List<Document> apply(List<Document> documents) {
        List<String> documentSummaries = new ArrayList();

        for(Document document : documents) {
            String documentContext = document.getFormattedContent(this.metadataMode);
            Prompt prompt = (new PromptTemplate(this.summaryTemplate)).create(Map.of("contextstr", documentContext));
            documentSummaries.add(this.chatModel.call(prompt).getResult().getOutput().getText());
        }

        for(int i = 0; i < documentSummaries.size(); ++i) {
            Map<String, Object> summaryMetadata = this.getSummaryMetadata(i, documentSummaries);
            ((Document)documents.get(i)).getMetadata().putAll(summaryMetadata);
        }

        return documents;
    }

    private Map<String, Object> getSummaryMetadata(int i, List<String> documentSummaries) {
        Map<String, Object> summaryMetadata = new HashMap();
        if (i > 0 && this.summaryTypes.contains(SummaryMetadataEnricher.SummaryType.PREVIOUS)) {
            summaryMetadata.put("prevsectionsummary", documentSummaries.get(i - 1));
        }

        if (i < documentSummaries.size() - 1 && this.summaryTypes.contains(SummaryMetadataEnricher.SummaryType.NEXT)) {
            summaryMetadata.put("nextsectionsummary", documentSummaries.get(i + 1));
        }

        if (this.summaryTypes.contains(SummaryMetadataEnricher.SummaryType.CURRENT)) {
            summaryMetadata.put("sectionsummary", documentSummaries.get(i));
        }

        return summaryMetadata;
    }

    public static enum SummaryType {
        PREVIOUS,
        CURRENT,
        NEXT;
    }
}

DocumentWriter(文档写入接口类)

java 复制代码
package org.springframework.ai.document;

import java.util.List;
import java.util.function.Consumer;

public interface DocumentWriter extends Consumer<List<Document>> {
    default void write(List<Document> documents) {
        this.accept(documents);
    }
}

FileDocumentWriter

将一组 Document 文档对象的内容写入到指定文件,支持追加写入、文档分隔标记、元数据格式化等功能

  • String fileName:写入文件的名称
  • boolean withDocumentMarkers:表示是否在文件中包含文档标记(如文档索引、页码)
  • MetadataMode metadataMode:元数据模式,控制文档内容的格式化方式
  • boolean append:是否将内容追加到文件末尾,而不是覆盖

|--------------------|-------------------------|
| 方法名称 | 描述 |
| FileDocumentWriter | 通过文件名、分隔标记、元数据、追加等构造写入器 |
| accept | 将文档内容写入文件,支持分隔标记和元数据格式化 |

java 复制代码
package org.springframework.ai.writer;

import java.io.FileWriter;
import java.util.List;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentWriter;
import org.springframework.ai.document.MetadataMode;
import org.springframework.util.Assert;

public class FileDocumentWriter implements DocumentWriter {
    public static final String METADATASTARTPAGENUMBER = "pagenumber";
    public static final String METADATAENDPAGENUMBER = "endpagenumber";
    private final String fileName;
    private final boolean withDocumentMarkers;
    private final MetadataMode metadataMode;
    private final boolean append;

    public FileDocumentWriter(String fileName) {
        this(fileName, false, MetadataMode.NONE, false);
    }

    public FileDocumentWriter(String fileName, boolean withDocumentMarkers) {
        this(fileName, withDocumentMarkers, MetadataMode.NONE, false);
    }

    public FileDocumentWriter(String fileName, boolean withDocumentMarkers, MetadataMode metadataMode, boolean append) {
        Assert.hasText(fileName, "File name must have a text.");
        Assert.notNull(metadataMode, "MetadataMode must not be null.");
        this.fileName = fileName;
        this.withDocumentMarkers = withDocumentMarkers;
        this.metadataMode = metadataMode;
        this.append = append;
    }

    public void accept(List<Document> docs) {
        try {
            try (FileWriter writer = new FileWriter(this.fileName, this.append)) {
                int index = 0;

                for(Document doc : docs) {
                    if (this.withDocumentMarkers) {
                        writer.write(String.format("%n### Doc: %s, pages:[%s,%s]\n", index, doc.getMetadata().get("pagenumber"), doc.getMetadata().get("endpagenumber")));
                    }

                    writer.write(doc.getFormattedContent(this.metadataMode));
                    ++index;
                }
            }

        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}

VectorStore

VectorStore 继承了 DocumentWriter 接口,详情可见 《Vector Databases》

相关推荐
我的golang之路果然有问题10 分钟前
ElasticSearch+Gin+Gorm简单示例
大数据·开发语言·后端·elasticsearch·搜索引擎·golang·gin
mldong2 小时前
我的全栈工程师之路:全栈学习路线分享
前端·后端
噼里啪啦啦.2 小时前
SpringBoot统一功能处理
java·spring boot·后端
考虑考虑3 小时前
JPA自定义sql参数为空和postgresql遇到问题
spring boot·后端·spring
橘子青衫4 小时前
Java多线程编程:深入探索线程同步与互斥的实战策略
java·后端·性能优化
shengjk14 小时前
一文搞懂 python __init__.py 文件
后端
泯泷4 小时前
编写 Dockerfile:从入门到精通
后端·docker·容器
焦个朋友吧4 小时前
《云上选座》项目分析
vue.js·后端
紫气东来,茉上花开5 小时前
[特殊字符] Spring Boot底层原理深度解析与高级面试题精析
spring boot·后端·spring
brzhang5 小时前
iOS 26 的备忘录,终于他娘的要支持 Markdown 了!
前端·后端·架构