SpringAI(GA):RAG下的ETL源码解读

原文链接地址:# SpringAI(GA):RAG下的ETL源码解读

教程说明

说明:本教程将采用2025年5月20日正式的GA版,给出如下内容

  1. 核心功能模块的快速上手教程
  2. 核心功能模块的源码级解读
  3. Spring ai alibaba增强的快速上手教程 + 源码级解读

版本:JDK21 + SpringBoot3.4.5 + SpringAI 1.0.0 + SpringAI Alibaba 1.0.0.2

将陆续完成如下章节教程。本章是第六章(Rag增强问答质量)下的ETL-pipeline源码解读

代码开源如下:github.com/GTyingzi/sp...

微信推文往届解读可参考:

第一章内容

SpringAI(GA)的chat:快速上手+自动注入源码解读

SpringAI(GA):ChatClient调用链路解读

第二章内容

SpringAI的Advisor:快速上手+源码解读

SpringAI(GA):Sqlite、Mysql、Redis消息存储快速上手

第三章内容

SpringAI(GA):Tool工具整合---快速上手

第五章内容

SpringAI(GA):内存、Redis、ES的向量数据库存储---快速上手

SpringAI(GA):向量数据库理论源码解读+Redis、Es接入源码

第六章内容

SpringAI(GA):RAG快速上手+模块化解读

SpringAI(GA):RAG下的ETL快速上手

获取更好的观赏体验,可付费获取飞书云文档Spring AI最新教程权限,目前49.9,随着内容不断完善,会逐步涨价。

注:M6版快速上手教程+源码解读飞书云文档已免费提供

ETL Pipeline 源码解析

DocumentReader(读取文档数据接口类)

java 复制代码
package org.springframework.ai.document;

import java.util.List;
import java.util.function.Supplier;

public interface DocumentReader extends Supplier<List<Document>> {
    default List<Document> read() {
        return (List)this.get();
    }
}

TextReader

用于从资源中读取文本内容并将其转换为 Document 对象

  • Resource resource:读取的资源
  • Map<String, Object> customMetadata:存储与 Document 对象相关的元数据
  • Charset charset:指定读取文本时使用的字符集,默认为 UTF8

方法说明

|-----------------------|------------------------------|
| 方法名称 | 描述 |
| TextReader | 通过资源URL、资源对象构造读取器 |
| setCharset | 设置读取文本时的字符集,默认为UTF8 |
| getCharset | 获取当前使用的字符集 |
| getCustomMetadata | 获取自定义元数据 |
| get | 读取文本,返回Document列表 |
| getResourceIdentifier | 获取资源的唯一标识(如文件名、URI、URL或描述信息) |

java 复制代码
package org.springframework.ai.reader;

import java.io.IOException;
import java.net.URI;
import java.net.URL;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.core.io.DefaultResourceLoader;
import org.springframework.core.io.Resource;
import org.springframework.util.StreamUtils;

public class TextReader implements DocumentReader {
    public static final String CHARSETMETADATA = "charset";
    public static final String SOURCEMETADATA = "source";
    private final Resource resource;
    private final Map<String, Object> customMetadata;
    private Charset charset;

    public TextReader(String resourceUrl) {
        this((new DefaultResourceLoader()).getResource(resourceUrl));
    }

    public TextReader(Resource resource) {
        this.customMetadata = new HashMap();
        this.charset = StandardCharsets.UTF8;
        Objects.requireNonNull(resource, "The Spring Resource must not be null");
        this.resource = resource;
    }

    public Charset getCharset() {
        return this.charset;
    }

    public void setCharset(Charset charset) {
        Objects.requireNonNull(charset, "The charset must not be null");
        this.charset = charset;
    }

    public Map<String, Object> getCustomMetadata() {
        return this.customMetadata;
    }

    public List<Document> get() {
        try {
            String document = StreamUtils.copyToString(this.resource.getInputStream(), this.charset);
            this.customMetadata.put("charset", this.charset.name());
            this.customMetadata.put("source", this.resource.getFilename());
            this.customMetadata.put("source", this.getResourceIdentifier(this.resource));
            return List.of(new Document(document, this.customMetadata));
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    protected String getResourceIdentifier(Resource resource) {
        String filename = resource.getFilename();
        if (filename != null && !filename.isEmpty()) {
            return filename;
        } else {
            try {
                URI uri = resource.getURI();
                if (uri != null) {
                    return uri.toString();
                }
            } catch (IOException var5) {
            }

            try {
                URL url = resource.getURL();
                if (url != null) {
                    return url.toString();
                }
            } catch (IOException var4) {
            }

            return resource.getDescription();
        }
    }
}

JsonReader

用于从 JSON 资源中读取数据并将其转换为 Document 对象

  • Resource resource:表示要读取的 JSON 资源
  • JsonMetadataGenerator jsonMetadataGenerator:用于生成与 JSON 数据相关的元数据
  • ObjectMapper objectMapper:用于解析 JSON 数据
  • List<String> jsonKeysToUse:用于从 JSON 中提取哪些字段作为文档内容,若未指定则使用整个 JSON 对象

方法说明

|------------|-----------------------|
| 方法名称 | 描述 |
| JsonReader | 通过资源对象、提取的字段名构造读取器 |
| get | 读取json文件,返回Document列表 |

java 复制代码
package org.springframework.ai.reader;

import com.fasterxml.jackson.core.type.TypeReference;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.io.IOException;
import java.util.Collections;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import java.util.stream.Stream;
import java.util.stream.StreamSupport;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.core.io.Resource;

public class JsonReader implements DocumentReader {
    private final Resource resource;
    private final JsonMetadataGenerator jsonMetadataGenerator;
    private final ObjectMapper objectMapper;
    private final List<String> jsonKeysToUse;

    public JsonReader(Resource resource) {
        this(resource);
    }

    public JsonReader(Resource resource, String... jsonKeysToUse) {
        this(resource, new EmptyJsonMetadataGenerator(), jsonKeysToUse);
    }

    public JsonReader(Resource resource, JsonMetadataGenerator jsonMetadataGenerator, String... jsonKeysToUse) {
        this.objectMapper = new ObjectMapper();
        Objects.requireNonNull(jsonKeysToUse, "keys must not be null");
        Objects.requireNonNull(jsonMetadataGenerator, "jsonMetadataGenerator must not be null");
        Objects.requireNonNull(resource, "The Spring Resource must not be null");
        this.resource = resource;
        this.jsonMetadataGenerator = jsonMetadataGenerator;
        this.jsonKeysToUse = List.of(jsonKeysToUse);
    }

    public List<Document> get() {
        try {
            JsonNode rootNode = this.objectMapper.readTree(this.resource.getInputStream());
            return rootNode.isArray() ? StreamSupport.stream(rootNode.spliterator(), true).map((jsonNode) -> this.parseJsonNode(jsonNode, this.objectMapper)).toList() : Collections.singletonList(this.parseJsonNode(rootNode, this.objectMapper));
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    private Document parseJsonNode(JsonNode jsonNode, ObjectMapper objectMapper) {
        Map<String, Object> item = (Map)objectMapper.convertValue(jsonNode, new TypeReference<Map<String, Object>>() {
        });
        StringBuilder sb = new StringBuilder();
        Stream var10000 = this.jsonKeysToUse.stream();
        Objects.requireNonNull(item);
        var10000.filter(item::containsKey).forEach((key) -> sb.append(key).append(": ").append(item.get(key)).append(System.lineSeparator()));
        Map<String, Object> metadata = this.jsonMetadataGenerator.generate(item);
        String content = sb.isEmpty() ? item.toString() : sb.toString();
        return new Document(content, metadata);
    }

    protected List<Document> get(JsonNode rootNode) {
        return rootNode.isArray() ? StreamSupport.stream(rootNode.spliterator(), true).map((jsonNode) -> this.parseJsonNode(jsonNode, this.objectMapper)).toList() : Collections.singletonList(this.parseJsonNode(rootNode, this.objectMapper));
    }

    public List<Document> get(String pointer) {
        try {
            JsonNode rootNode = this.objectMapper.readTree(this.resource.getInputStream());
            JsonNode targetNode = rootNode.at(pointer);
            if (targetNode.isMissingNode()) {
                throw new IllegalArgumentException("Invalid JSON Pointer: " + pointer);
            } else {
                return this.get(targetNode);
            }
        } catch (IOException e) {
            throw new RuntimeException("Error reading JSON resource", e);
        }
    }
}

JsoupDocumentReader

用于从 HTML 文档中提取文本内容,并将其转换为 Document 对象

各字段含义:

  • Resource htmlResource:要读取的 HTML 资源
  • JsoupDocumentReaderConfig config:配置 HTML 文档读取行为,包括字符集、选择器、是否提取所有元素,是否按元素分组等

方法说明

|---------------------|-----------------------------|
| 方法名称 | 描述 |
| JsoupDocumentReader | 通过资源URL、资源对象、解析HTML配置等构造读取器 |
| get | 读取html文件,返回Document列表 |

java 复制代码
package org.springframework.ai.reader.jsoup;

import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.ai.reader.jsoup.config.JsoupDocumentReaderConfig;
import org.springframework.core.io.DefaultResourceLoader;
import org.springframework.core.io.Resource;

public class JsoupDocumentReader implements DocumentReader {
    private final Resource htmlResource;
    private final JsoupDocumentReaderConfig config;

    public JsoupDocumentReader(String htmlResource) {
        this((new DefaultResourceLoader()).getResource(htmlResource));
    }

    public JsoupDocumentReader(Resource htmlResource) {
        this(htmlResource, JsoupDocumentReaderConfig.defaultConfig());
    }

    public JsoupDocumentReader(String htmlResource, JsoupDocumentReaderConfig config) {
        this((new DefaultResourceLoader()).getResource(htmlResource), config);
    }

    public JsoupDocumentReader(Resource htmlResource, JsoupDocumentReaderConfig config) {
        this.htmlResource = htmlResource;
        this.config = config;
    }

    public List<Document> get() {
        try (InputStream inputStream = this.htmlResource.getInputStream()) {
            org.jsoup.nodes.Document doc = Jsoup.parse(inputStream, this.config.charset, "");
            List<Document> documents = new ArrayList();
            if (this.config.allElements) {
                String allText = doc.body().text();
                Document document = new Document(allText);
                this.addMetadata(doc, document);
                documents.add(document);
            } else if (this.config.groupByElement) {
                for(Element element : doc.select(this.config.selector)) {
                    String elementText = element.text();
                    Document document = new Document(elementText);
                    this.addMetadata(doc, document);
                    documents.add(document);
                }
            } else {
                Elements elements = doc.select(this.config.selector);
                String text = (String)elements.stream().map(Element::text).collect(Collectors.joining(this.config.separator));
                Document document = new Document(text);
                this.addMetadata(doc, document);
                documents.add(document);
            }

            return documents;
        } catch (IOException e) {
            throw new RuntimeException("Failed to read HTML resource: " + String.valueOf(this.htmlResource), e);
        }
    }

    private void addMetadata(org.jsoup.nodes.Document jsoupDoc, Document springDoc) {
        Map<String, Object> metadata = new HashMap();
        metadata.put("title", jsoupDoc.title());

        for(String metaTag : this.config.metadataTags) {
            String value = jsoupDoc.select("meta[name=" + metaTag + "]").attr("content");
            if (!value.isEmpty()) {
                metadata.put(metaTag, value);
            }
        }

        if (this.config.includeLinkUrls) {
            Elements links = jsoupDoc.select("a[href]");
            List<String> linkUrls = links.stream().map((link) -> link.attr("abs:href")).toList();
            metadata.put("linkUrls", linkUrls);
        }

        metadata.putAll(this.config.additionalMetadata);
        springDoc.getMetadata().putAll(metadata);
    }
}
JsoupDocumentReaderConfig

配置 JsoupDocumentReader 行为的工具类

  • String charset:读取 HTML 文档时使用的字符编码,默认值为 "UTF-8"
  • String selector:用于提取 HTML 元素的 CSS 选择器,默认值为 "body"
  • String separator:在提取多个元素的文本内容时使用的分隔符,默认值为 "\n"
  • boolean allElements:是否提取 HTML 文档中所有元素的文本内容,并生成一个 Document 对象,默认值为 false
  • boolean groupByElement:是否按元素分组提取文本内容,并为每个元素生成一个 Document 对象,默认值为 false
  • boolean includeLinkUrls:是否将 HTML 文档中的链接 URL 包含在元数据中,默认值为 false
  • List<String> metadataTags:指定从 HTML 文档的 标签中提取哪些元数据,默认包含 "description" 和 "keywords"
  • Map<String, Object> additionalMetadata:用于添加额外的元数据到生成的 Document 对象中
java 复制代码
package org.springframework.ai.reader.jsoup.config;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.springframework.util.Assert;

public final class JsoupDocumentReaderConfig {
    public final String charset;
    public final String selector;
    public final String separator;
    public final boolean allElements;
    public final boolean groupByElement;
    public final boolean includeLinkUrls;
    public final List<String> metadataTags;
    public final Map<String, Object> additionalMetadata;

    private JsoupDocumentReaderConfig(Builder builder) {
        this.charset = builder.charset;
        this.selector = builder.selector;
        this.separator = builder.separator;
        this.allElements = builder.allElements;
        this.includeLinkUrls = builder.includeLinkUrls;
        this.metadataTags = builder.metadataTags;
        this.groupByElement = builder.groupByElement;
        this.additionalMetadata = builder.additionalMetadata;
    }

    public static Builder builder() {
        return new Builder();
    }

    public static JsoupDocumentReaderConfig defaultConfig() {
        return builder().build();
    }

    public static final class Builder {
        private String charset = "UTF-8";
        private String selector = "body";
        private String separator = "\n";
        private boolean allElements = false;
        private boolean includeLinkUrls = false;
        private List<String> metadataTags = new ArrayList(List.of("description", "keywords"));
        private boolean groupByElement = false;
        private Map<String, Object> additionalMetadata = new HashMap();

        private Builder() {
        }

        public Builder charset(String charset) {
            this.charset = charset;
            return this;
        }

        public Builder selector(String selector) {
            this.selector = selector;
            return this;
        }

        public Builder separator(String separator) {
            this.separator = separator;
            return this;
        }

        public Builder allElements(boolean allElements) {
            this.allElements = allElements;
            return this;
        }

        public Builder groupByElement(boolean groupByElement) {
            this.groupByElement = groupByElement;
            return this;
        }

        public Builder includeLinkUrls(boolean includeLinkUrls) {
            this.includeLinkUrls = includeLinkUrls;
            return this;
        }

        public Builder metadataTag(String metadataTag) {
            this.metadataTags.add(metadataTag);
            return this;
        }

        public Builder metadataTags(List<String> metadataTags) {
            this.metadataTags = new ArrayList(metadataTags);
            return this;
        }

        public Builder additionalMetadata(String key, Object value) {
            Assert.notNull(key, "key must not be null");
            Assert.notNull(value, "value must not be null");
            this.additionalMetadata.put(key, value);
            return this;
        }

        public Builder additionalMetadata(Map<String, Object> additionalMetadata) {
            Assert.notNull(additionalMetadata, "additionalMetadata must not be null");
            this.additionalMetadata = additionalMetadata;
            return this;
        }

        public JsoupDocumentReaderConfig build() {
            return new JsoupDocumentReaderConfig(this);
        }
    }
}

MarkdownDocumentReader

用于从 Markdown 文件中读取内容并将其转换为 Document 对象。基于 CommonMark 库解析 Markdown 文档,支持将标题、段落、代码块等内容分组为 Document 对象,并生成相关元数据

  • Resource markdownResource:要读取的 Markdown 资源
  • MarkdownDocumentReaderConfig config:配置 Markdown 文档读取行为,包括是否将水平分割线视为文档分隔符、是否包含代码块、是否包含引用块等
  • Parser parser:解析 Markdown 文档的 CommonMark 解析器,用于将 Markdown 文本解析为节点树

DocumentVisitor 作为内部静态类,继承自 CommonMark 的 AbstractVisitor,用于遍历和解析 Markdown 的语法树节点,将其内容按配置分组、提取为结构化的 Document 对象

  1. 历 Markdown 解析后的节点树,根据配置(如是否按水平线分组、是否包含代码块/引用等)将内容分组
  2. 识别标题、段落、代码块、引用等不同类型节点,提取文本和元数据,构建 Document
  3. 支持为不同类型内容(如标题、代码块、引用)添加分类、标题、语言等元数据,便于后续 AI 处理。

|------------------------|---------------------------------|
| 方法名称 | 描述 |
| MarkdownDocumentReader | 通过资源URL、资源对象、解析markdown配置等构造读取器 |
| get | 读取markdown文件,返回Document列表 |

java 复制代码
package org.springframework.ai.reader.markdown;

import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import org.commonmark.node.AbstractVisitor;
import org.commonmark.node.BlockQuote;
import org.commonmark.node.Code;
import org.commonmark.node.FencedCodeBlock;
import org.commonmark.node.HardLineBreak;
import org.commonmark.node.Heading;
import org.commonmark.node.ListItem;
import org.commonmark.node.Node;
import org.commonmark.node.SoftLineBreak;
import org.commonmark.node.Text;
import org.commonmark.node.ThematicBreak;
import org.commonmark.parser.Parser;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.ai.reader.markdown.config.MarkdownDocumentReaderConfig;
import org.springframework.core.io.DefaultResourceLoader;
import org.springframework.core.io.Resource;

public class MarkdownDocumentReader implements DocumentReader {
    private final Resource markdownResource;
    private final MarkdownDocumentReaderConfig config;
    private final Parser parser;

    public MarkdownDocumentReader(String markdownResource) {
        this((new DefaultResourceLoader()).getResource(markdownResource), MarkdownDocumentReaderConfig.defaultConfig());
    }

    public MarkdownDocumentReader(String markdownResource, MarkdownDocumentReaderConfig config) {
        this((new DefaultResourceLoader()).getResource(markdownResource), config);
    }

    public MarkdownDocumentReader(Resource markdownResource, MarkdownDocumentReaderConfig config) {
        this.markdownResource = markdownResource;
        this.config = config;
        this.parser = Parser.builder().build();
    }

    public List<Document> get() {
        try (InputStream input = this.markdownResource.getInputStream()) {
            Node node = this.parser.parseReader(new InputStreamReader(input));
            DocumentVisitor documentVisitor = new DocumentVisitor(this.config);
            node.accept(documentVisitor);
            return documentVisitor.getDocuments();
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    static class DocumentVisitor extends AbstractVisitor {
        private final List<Document> documents = new ArrayList();
        private final List<String> currentParagraphs = new ArrayList();
        private final MarkdownDocumentReaderConfig config;
        private Document.Builder currentDocumentBuilder;

        DocumentVisitor(MarkdownDocumentReaderConfig config) {
            this.config = config;
        }

        public void visit(org.commonmark.node.Document document) {
            this.currentDocumentBuilder = Document.builder();
            super.visit(document);
        }

        public void visit(Heading heading) {
            this.buildAndFlush();
            super.visit(heading);
        }

        public void visit(ThematicBreak thematicBreak) {
            if (this.config.horizontalRuleCreateDocument) {
                this.buildAndFlush();
            }

            super.visit(thematicBreak);
        }

        public void visit(SoftLineBreak softLineBreak) {
            this.translateLineBreakToSpace();
            super.visit(softLineBreak);
        }

        public void visit(HardLineBreak hardLineBreak) {
            this.translateLineBreakToSpace();
            super.visit(hardLineBreak);
        }

        public void visit(ListItem listItem) {
            this.translateLineBreakToSpace();
            super.visit(listItem);
        }

        public void visit(BlockQuote blockQuote) {
            if (!this.config.includeBlockquote) {
                this.buildAndFlush();
            }

            this.translateLineBreakToSpace();
            this.currentDocumentBuilder.metadata("category", "blockquote");
            super.visit(blockQuote);
        }

        public void visit(Code code) {
            this.currentParagraphs.add(code.getLiteral());
            this.currentDocumentBuilder.metadata("category", "codeinline");
            super.visit(code);
        }

        public void visit(FencedCodeBlock fencedCodeBlock) {
            if (!this.config.includeCodeBlock) {
                this.buildAndFlush();
            }

            this.translateLineBreakToSpace();
            this.currentParagraphs.add(fencedCodeBlock.getLiteral());
            this.currentDocumentBuilder.metadata("category", "codeblock");
            this.currentDocumentBuilder.metadata("lang", fencedCodeBlock.getInfo());
            this.buildAndFlush();
            super.visit(fencedCodeBlock);
        }

        public void visit(Text text) {
            Node var3 = text.getParent();
            if (var3 instanceof Heading heading) {
                this.currentDocumentBuilder.metadata("category", "header%d".formatted(heading.getLevel())).metadata("title", text.getLiteral());
            } else {
                this.currentParagraphs.add(text.getLiteral());
            }

            super.visit(text);
        }

        public List<Document> getDocuments() {
            this.buildAndFlush();
            return this.documents;
        }

        private void buildAndFlush() {
            if (!this.currentParagraphs.isEmpty()) {
                String content = String.join("", this.currentParagraphs);
                Document.Builder builder = this.currentDocumentBuilder.text(content);
                Map var10000 = this.config.additionalMetadata;
                Objects.requireNonNull(builder);
                var10000.forEach(builder::metadata);
                Document document = builder.build();
                this.documents.add(document);
                this.currentParagraphs.clear();
            }

            this.currentDocumentBuilder = Document.builder();
        }

        private void translateLineBreakToSpace() {
            if (!this.currentParagraphs.isEmpty()) {
                this.currentParagraphs.add(" ");
            }

        }
    }
}
MarkdownDocumentReaderConfig

配置 MarkdownDocumentReader 的行为

  • boolean horizontalRuleCreateDocument:是否将水平分割线分隔的文本创建为新的 Document
  • boolean includeCodeBlock:是否将代码块包含在段落文档中,还是单独创建新文档
  • boolean includeBlockquote:是否将引用块包含在段落文档中,还是单独创建新文档
  • Map<String, Object> additionalMetadata:添加额外元数据
java 复制代码
package org.springframework.ai.reader.markdown.config;

import java.util.HashMap;
import java.util.Map;
import org.springframework.util.Assert;

public class MarkdownDocumentReaderConfig {
    public final boolean horizontalRuleCreateDocument;
    public final boolean includeCodeBlock;
    public final boolean includeBlockquote;
    public final Map<String, Object> additionalMetadata;

    public MarkdownDocumentReaderConfig(Builder builder) {
        this.horizontalRuleCreateDocument = builder.horizontalRuleCreateDocument;
        this.includeCodeBlock = builder.includeCodeBlock;
        this.includeBlockquote = builder.includeBlockquote;
        this.additionalMetadata = builder.additionalMetadata;
    }

    public static MarkdownDocumentReaderConfig defaultConfig() {
        return builder().build();
    }

    public static Builder builder() {
        return new Builder();
    }

    public static final class Builder {
        private boolean horizontalRuleCreateDocument = false;
        private boolean includeCodeBlock = false;
        private boolean includeBlockquote = false;
        private Map<String, Object> additionalMetadata = new HashMap();

        private Builder() {
        }

        public Builder withHorizontalRuleCreateDocument(boolean horizontalRuleCreateDocument) {
            this.horizontalRuleCreateDocument = horizontalRuleCreateDocument;
            return this;
        }

        public Builder withIncludeCodeBlock(boolean includeCodeBlock) {
            this.includeCodeBlock = includeCodeBlock;
            return this;
        }

        public Builder withIncludeBlockquote(boolean includeBlockquote) {
            this.includeBlockquote = includeBlockquote;
            return this;
        }

        public Builder withAdditionalMetadata(String key, Object value) {
            Assert.notNull(key, "key must not be null");
            Assert.notNull(value, "value must not be null");
            this.additionalMetadata.put(key, value);
            return this;
        }

        public Builder withAdditionalMetadata(Map<String, Object> additionalMetadata) {
            Assert.notNull(additionalMetadata, "additionalMetadata must not be null");
            this.additionalMetadata = additionalMetadata;
            return this;
        }

        public MarkdownDocumentReaderConfig build() {
            return new MarkdownDocumentReaderConfig(this);
        }
    }
}

PagePdfDocumentReader

用于将 PDF 文件按页分组解析为多个 Document,每个 Document 可包含一页或多页内容,支持自定义分组和页面裁剪

  • PDDocument document:要读取的 PDF 文档对象
  • String resourceFileName:存储 PDF 文件的名字
  • PdfDocumentReaderConfig config:配置 PDF 文档读取行为,包括每份文档包含的页数、页边距

|-----------------------|----------------------------|
| 方法名称 | 描述 |
| PagePdfDocumentReader | 通过资源URL、资源对象、解析PDF配置等构造读取器 |
| get | 读取PDF,返回Document列表 |
| toDocument | 将指定页内容和元数据封装为 Document |

java 复制代码
package org.springframework.ai.reader.pdf;

import java.awt.Rectangle;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;
import org.apache.pdfbox.io.RandomAccessReadBuffer;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.ai.reader.pdf.config.PdfDocumentReaderConfig;
import org.springframework.ai.reader.pdf.layout.PDFLayoutTextStripperByArea;
import org.springframework.core.io.DefaultResourceLoader;
import org.springframework.core.io.Resource;
import org.springframework.util.CollectionUtils;
import org.springframework.util.StringUtils;

public class PagePdfDocumentReader implements DocumentReader {
    public static final String METADATASTARTPAGENUMBER = "pagenumber";
    public static final String METADATAENDPAGENUMBER = "endpagenumber";
    public static final String METADATAFILENAME = "filename";
    private static final String PDFPAGEREGION = "pdfPageRegion";
    protected final PDDocument document;
    private final Logger logger;
    protected String resourceFileName;
    private PdfDocumentReaderConfig config;

    public PagePdfDocumentReader(String resourceUrl) {
        this((new DefaultResourceLoader()).getResource(resourceUrl));
    }

    public PagePdfDocumentReader(Resource pdfResource) {
        this(pdfResource, PdfDocumentReaderConfig.defaultConfig());
    }

    public PagePdfDocumentReader(String resourceUrl, PdfDocumentReaderConfig config) {
        this((new DefaultResourceLoader()).getResource(resourceUrl), config);
    }

    public PagePdfDocumentReader(Resource pdfResource, PdfDocumentReaderConfig config) {
        this.logger = LoggerFactory.getLogger(this.getClass());

        try {
            PDFParser pdfParser = new PDFParser(new RandomAccessReadBuffer(pdfResource.getInputStream()));
            this.document = pdfParser.parse();
            this.resourceFileName = pdfResource.getFilename();
            this.config = config;
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    public List<Document> get() {
        List<Document> readDocuments = new ArrayList();

        try {
            PDFLayoutTextStripperByArea pdfTextStripper = new PDFLayoutTextStripperByArea();
            int pageNumber = 0;
            int pagesPerDocument = 0;
            int startPageNumber = pageNumber;
            List<String> pageTextGroupList = new ArrayList();
            int totalPages = this.document.getDocumentCatalog().getPages().getCount();
            int logFrequency = totalPages > 10 ? totalPages / 10 : 1;
            int counter = 0;
            PDPage lastPage = (PDPage)this.document.getDocumentCatalog().getPages().iterator().next();

            for(PDPage page : this.document.getDocumentCatalog().getPages()) {
                lastPage = page;
                if (counter % logFrequency == 0 && counter / logFrequency < 10) {
                    this.logger.info("Processing PDF page: {}", counter + 1);
                }

                ++counter;
                ++pagesPerDocument;
                if (this.config.pagesPerDocument != 0 && pagesPerDocument >= this.config.pagesPerDocument) {
                    pagesPerDocument = 0;
                    String aggregatedPageTextGroup = (String)pageTextGroupList.stream().collect(Collectors.joining());
                    if (StringUtils.hasText(aggregatedPageTextGroup)) {
                        readDocuments.add(this.toDocument(page, aggregatedPageTextGroup, startPageNumber, pageNumber));
                    }

                    pageTextGroupList.clear();
                    startPageNumber = pageNumber + 1;
                }

                int x0 = (int)page.getMediaBox().getLowerLeftX();
                int xW = (int)page.getMediaBox().getWidth();
                int y0 = (int)page.getMediaBox().getLowerLeftY() + this.config.pageTopMargin;
                int yW = (int)page.getMediaBox().getHeight() - (this.config.pageTopMargin + this.config.pageBottomMargin);
                pdfTextStripper.addRegion("pdfPageRegion", new Rectangle(x0, y0, xW, yW));
                pdfTextStripper.extractRegions(page);
                String pageText = pdfTextStripper.getTextForRegion("pdfPageRegion");
                if (StringUtils.hasText(pageText)) {
                    pageText = this.config.pageExtractedTextFormatter.format(pageText, pageNumber);
                    pageTextGroupList.add(pageText);
                }

                ++pageNumber;
                pdfTextStripper.removeRegion("pdfPageRegion");
            }

            if (!CollectionUtils.isEmpty(pageTextGroupList)) {
                readDocuments.add(this.toDocument(lastPage, (String)pageTextGroupList.stream().collect(Collectors.joining()), startPageNumber, pageNumber));
            }

            this.logger.info("Processing {} pages", totalPages);
            return readDocuments;
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    protected Document toDocument(PDPage page, String docText, int startPageNumber, int endPageNumber) {
        Document doc = new Document(docText);
        doc.getMetadata().put("pagenumber", startPageNumber);
        if (startPageNumber != endPageNumber) {
            doc.getMetadata().put("endpagenumber", endPageNumber);
        }

        doc.getMetadata().put("filename", this.resourceFileName);
        return doc;
    }
}
PdfDocumentReaderConfig

PDF 文档读取器的配置类,用于控制 PDF 解析和分组行为

  • int ALLPAGES:常量,值为 0,表示将所有页合并为一个 Document
  • boolean reversedParagraphPosition:是否反转每页内段落顺序,默认为 false
  • int pagesPerDocument:每个 Document 包含的页数,0 表示所有页合并,默认 1
  • int pageTopMargin:每页顶部裁剪的像素数,默认 0
  • int pageBottomMargin:每页底部裁剪的像素数,默认 0
  • int pageExtractedTextFormatter:提取文本后的格式化器,可自定义每页文本的处理方式
java 复制代码
package org.springframework.ai.reader.pdf.config;

import org.springframework.ai.reader.ExtractedTextFormatter;
import org.springframework.util.Assert;

public final class PdfDocumentReaderConfig {
    public static final int ALLPAGES = 0;
    public final boolean reversedParagraphPosition;
    public final int pagesPerDocument;
    public final int pageTopMargin;
    public final int pageBottomMargin;
    public final ExtractedTextFormatter pageExtractedTextFormatter;

    private PdfDocumentReaderConfig(Builder builder) {
        this.pagesPerDocument = builder.pagesPerDocument;
        this.pageBottomMargin = builder.pageBottomMargin;
        this.pageTopMargin = builder.pageTopMargin;
        this.pageExtractedTextFormatter = builder.pageExtractedTextFormatter;
        this.reversedParagraphPosition = builder.reversedParagraphPosition;
    }

    public static Builder builder() {
        return new Builder();
    }

    public static PdfDocumentReaderConfig defaultConfig() {
        return builder().build();
    }

    public static final class Builder {
        private int pagesPerDocument = 1;
        private int pageTopMargin = 0;
        private int pageBottomMargin = 0;
        private ExtractedTextFormatter pageExtractedTextFormatter = ExtractedTextFormatter.defaults();
        private boolean reversedParagraphPosition = false;

        private Builder() {
        }

        public Builder withPageExtractedTextFormatter(ExtractedTextFormatter pageExtractedTextFormatter) {
            Assert.notNull(pageExtractedTextFormatter, "PageExtractedTextFormatter must not be null.");
            this.pageExtractedTextFormatter = pageExtractedTextFormatter;
            return this;
        }

        public Builder withPagesPerDocument(int pagesPerDocument) {
            Assert.isTrue(pagesPerDocument >= 0, "Page count must be a positive value.");
            this.pagesPerDocument = pagesPerDocument;
            return this;
        }

        public Builder withPageTopMargin(int topMargin) {
            Assert.isTrue(topMargin >= 0, "Page margins must be a positive value.");
            this.pageTopMargin = topMargin;
            return this;
        }

        public Builder withPageBottomMargin(int bottomMargin) {
            Assert.isTrue(bottomMargin >= 0, "Page margins must be a positive value.");
            this.pageBottomMargin = bottomMargin;
            return this;
        }

        public Builder withReversedParagraphPosition(boolean reversedParagraphPosition) {
            this.reversedParagraphPosition = reversedParagraphPosition;
            return this;
        }

        public PdfDocumentReaderConfig build() {
            return new PdfDocumentReaderConfig(this);
        }
    }
}

ParagraphPdfDocumentReader

用于将 PDF 文件按段落(基于目录/结构信息)解析为多个 Document,每个 Document 通常对应一个段落

  • PDDocument document:要读取的 PDF 文档对象
  • String resourceFileName:存储 PDF 文件的名字
  • PdfDocumentReaderConfig config:配置 PDF 文档读取行为,包括每份文档包含的页数、页边距
  • ParagraphManager paragraphTextExtractor:负责解析 PDF 并提取段落信息

|----------------------------|----------------------------|
| 方法名称 | 描述 |
| ParagraphPdfDocumentReader | 通过资源URL、资源对象、解析PDF配置等构造读取器 |
| get | 读取带目录的PDF,返回Document列表 |
| toDocument | 将指定段落内容和元数据封装为 Document |
| addMetadata | 为 Document 添加元数据 |
| getTextBetweenParagraphs | 提取两个段落之间的文本内容 |

java 复制代码
package org.springframework.ai.reader.pdf;

import java.awt.Rectangle;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.apache.pdfbox.io.RandomAccessReadBuffer;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.ai.reader.pdf.config.ParagraphManager;
import org.springframework.ai.reader.pdf.config.PdfDocumentReaderConfig;
import org.springframework.ai.reader.pdf.layout.PDFLayoutTextStripperByArea;
import org.springframework.core.io.DefaultResourceLoader;
import org.springframework.core.io.Resource;
import org.springframework.util.CollectionUtils;
import org.springframework.util.StringUtils;

public class ParagraphPdfDocumentReader implements DocumentReader {
    private static final String METADATASTARTPAGE = "pagenumber";
    private static final String METADATAENDPAGE = "endpagenumber";
    private static final String METADATATITLE = "title";
    private static final String METADATALEVEL = "level";
    private static final String METADATAFILENAME = "filename";
    protected final PDDocument document;
    private final Logger logger;
    private final ParagraphManager paragraphTextExtractor;
    protected String resourceFileName;
    private PdfDocumentReaderConfig config;

    public ParagraphPdfDocumentReader(String resourceUrl) {
        this((new DefaultResourceLoader()).getResource(resourceUrl));
    }

    public ParagraphPdfDocumentReader(Resource pdfResource) {
        this(pdfResource, PdfDocumentReaderConfig.defaultConfig());
    }

    public ParagraphPdfDocumentReader(String resourceUrl, PdfDocumentReaderConfig config) {
        this((new DefaultResourceLoader()).getResource(resourceUrl), config);
    }

    public ParagraphPdfDocumentReader(Resource pdfResource, PdfDocumentReaderConfig config) {
        this.logger = LoggerFactory.getLogger(this.getClass());

        try {
            PDFParser pdfParser = new PDFParser(new RandomAccessReadBuffer(pdfResource.getInputStream()));
            this.document = pdfParser.parse();
            this.config = config;
            this.paragraphTextExtractor = new ParagraphManager(this.document);
            this.resourceFileName = pdfResource.getFilename();
        } catch (IllegalArgumentException iae) {
            throw iae;
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    public List<Document> get() {
        List<ParagraphManager.Paragraph> paragraphs = this.paragraphTextExtractor.flatten();
        List<Document> documents = new ArrayList(paragraphs.size());
        if (!CollectionUtils.isEmpty(paragraphs)) {
            this.logger.info("Start processing paragraphs from PDF");
            Iterator<ParagraphManager.Paragraph> itr = paragraphs.iterator();
            ParagraphManager.Paragraph current = (ParagraphManager.Paragraph)itr.next();
            ParagraphManager.Paragraph next;
            if (!itr.hasNext()) {
                documents.add(this.toDocument(current, current));
            } else {
                for(; itr.hasNext(); current = next) {
                    next = (ParagraphManager.Paragraph)itr.next();
                    Document document = this.toDocument(current, next);
                    if (document != null && StringUtils.hasText(document.getText())) {
                        documents.add(this.toDocument(current, next));
                    }
                }
            }
        }

        this.logger.info("End processing paragraphs from PDF");
        return documents;
    }

    protected Document toDocument(ParagraphManager.Paragraph from, ParagraphManager.Paragraph to) {
        String docText = this.getTextBetweenParagraphs(from, to);
        if (!StringUtils.hasText(docText)) {
            return null;
        } else {
            Document document = new Document(docText);
            this.addMetadata(from, to, document);
            return document;
        }
    }

    protected void addMetadata(ParagraphManager.Paragraph from, ParagraphManager.Paragraph to, Document document) {
        document.getMetadata().put("title", from.title());
        document.getMetadata().put("pagenumber", from.startPageNumber());
        document.getMetadata().put("endpagenumber", to.startPageNumber());
        document.getMetadata().put("level", from.level());
        document.getMetadata().put("filename", this.resourceFileName);
    }

    public String getTextBetweenParagraphs(ParagraphManager.Paragraph fromParagraph, ParagraphManager.Paragraph toParagraph) {
        int startPage = fromParagraph.startPageNumber() - 1;
        int endPage = toParagraph.startPageNumber() - 1;

        try {
            StringBuilder sb = new StringBuilder();
            PDFLayoutTextStripperByArea pdfTextStripper = new PDFLayoutTextStripperByArea();
            pdfTextStripper.setSortByPosition(true);

            for(int pageNumber = startPage; pageNumber <= endPage; ++pageNumber) {
                PDPage page = this.document.getPage(pageNumber);
                int fromPosition = fromParagraph.position();
                int toPosition = toParagraph.position();
                if (this.config.reversedParagraphPosition) {
                    fromPosition = (int)(page.getMediaBox().getHeight() - (float)fromPosition);
                    toPosition = (int)(page.getMediaBox().getHeight() - (float)toPosition);
                }

                int x0 = (int)page.getMediaBox().getLowerLeftX();
                int xW = (int)page.getMediaBox().getWidth();
                int y0 = (int)page.getMediaBox().getLowerLeftY();
                int yW = (int)page.getMediaBox().getHeight();
                if (pageNumber == startPage) {
                    y0 = fromPosition;
                    yW = (int)page.getMediaBox().getHeight() - fromPosition;
                }

                if (pageNumber == endPage) {
                    yW = toPosition - y0;
                }

                if (y0 + yW == (int)page.getMediaBox().getHeight()) {
                    yW -= this.config.pageBottomMargin;
                }

                if (y0 == 0) {
                    y0 += this.config.pageTopMargin;
                    yW -= this.config.pageTopMargin;
                }

                pdfTextStripper.addRegion("pdfPageRegion", new Rectangle(x0, y0, xW, yW));
                pdfTextStripper.extractRegions(page);
                String text = pdfTextStripper.getTextForRegion("pdfPageRegion");
                if (StringUtils.hasText(text)) {
                    sb.append(text);
                }

                pdfTextStripper.removeRegion("pdfPageRegion");
            }

            String text = sb.toString();
            if (StringUtils.hasText(text)) {
                text = this.config.pageExtractedTextFormatter.format(text, startPage);
            }

            return text;
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}
ParagraphManager

类用于管理 PDF 文档的段落结构,主要通过解析 PDF 目录(TOC/书签)生成段落树,并可将其扁平化为段落列表,便于后续内容提取和分组

  • Paragraph rootParagraph:段落树的根节点,类型为 Paragraph,包含所有段落的层级结构
  • PDDocument document:PDFBox 的 PDDocument,表示当前处理的 PDF 文档

|----------------------|--------------------------------------------------------------------------------------------------------|
| 方法名称 | 描述 |
| ParagraphManager | 传入 PDF 文档,自动解析目录生成段落树 |
| flatten | 将段落树扁平化为 Paragraph 列表,便于顺序遍历 |
| getParagraphsByLevel | 按指定层级获取段落列表,可选是否包含跨层级段落 |
| Paragraph | 静态内部类,表示段落的元数据(标题、层级、起止页码、位置、子段落等) |
| generateParagraphs | ParagraphManager 的核心递归方法,用于遍历 PDF 目录(TOC/书签)的树结构,将每个目录项(PDOutlineItem)转换为 Paragraph,并构建出完整的段落树(章节层级结构) |

java 复制代码
package org.springframework.ai.reader.pdf.config;

import java.io.IOException;
import java.io.PrintStream;
import java.util.ArrayList;
import java.util.List;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageTree;
import org.apache.pdfbox.pdmodel.interactive.documentnavigation.destination.PDPageXYZDestination;
import org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem;
import org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode;
import org.springframework.util.Assert;
import org.springframework.util.CollectionUtils;

public class ParagraphManager {
    private final Paragraph rootParagraph;
    private final PDDocument document;

    public ParagraphManager(PDDocument document) {
        Assert.notNull(document, "PDDocument must not be null");
        Assert.notNull(document.getDocumentCatalog().getDocumentOutline(), "Document outline (e.g. TOC) is null. Make sure the PDF document has a table of contents (TOC). If not, consider the PagePdfDocumentReader or the TikaDocumentReader instead.");

        try {
            this.document = document;
            this.rootParagraph = this.generateParagraphs(new Paragraph((Paragraph)null, "root", -1, 1, this.document.getNumberOfPages(), 0), this.document.getDocumentCatalog().getDocumentOutline(), 0);
            this.printParagraph(this.rootParagraph, System.out);
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    public List<Paragraph> flatten() {
        List<Paragraph> paragraphs = new ArrayList();

        for(Paragraph child : this.rootParagraph.children()) {
            this.flatten(child, paragraphs);
        }

        return paragraphs;
    }

    private void flatten(Paragraph current, List<Paragraph> paragraphs) {
        paragraphs.add(current);

        for(Paragraph child : current.children()) {
            this.flatten(child, paragraphs);
        }

    }

    private void printParagraph(Paragraph paragraph, PrintStream printStream) {
        printStream.println(paragraph);

        for(Paragraph childParagraph : paragraph.children()) {
            this.printParagraph(childParagraph, printStream);
        }

    }

    protected Paragraph generateParagraphs(Paragraph parentParagraph, PDOutlineNode bookmark, Integer level) throws IOException {
        for(PDOutlineItem current = bookmark.getFirstChild(); current != null; current = current.getNextSibling()) {
            int pageNumber = this.getPageNumber(current);
            int nextSiblingNumber = this.getPageNumber(current.getNextSibling());
            if (nextSiblingNumber < 0) {
                nextSiblingNumber = this.getPageNumber(current.getLastChild());
            }

            int paragraphPosition = current.getDestination() instanceof PDPageXYZDestination ? ((PDPageXYZDestination)current.getDestination()).getTop() : 0;
            Paragraph currentParagraph = new Paragraph(parentParagraph, current.getTitle(), level, pageNumber, nextSiblingNumber, paragraphPosition);
            parentParagraph.children().add(currentParagraph);
            this.generateParagraphs(currentParagraph, current, level + 1);
        }

        return parentParagraph;
    }

    private int getPageNumber(PDOutlineItem current) throws IOException {
        if (current == null) {
            return -1;
        } else {
            PDPage currentPage = current.findDestinationPage(this.document);
            PDPageTree pages = this.document.getDocumentCatalog().getPages();

            for(int i = 0; i < pages.getCount(); ++i) {
                PDPage page = pages.get(i);
                if (page.equals(currentPage)) {
                    return i + 1;
                }
            }

            return -1;
        }
    }

    public List<Paragraph> getParagraphsByLevel(Paragraph paragraph, int level, boolean interLevelText) {
        List<Paragraph> resultList = new ArrayList();
        if (paragraph.level() < level) {
            if (!CollectionUtils.isEmpty(paragraph.children())) {
                if (interLevelText) {
                    Paragraph interLevelParagraph = new Paragraph(paragraph.parent(), paragraph.title(), paragraph.level(), paragraph.startPageNumber(), ((Paragraph)paragraph.children().get(0)).startPageNumber(), paragraph.position());
                    resultList.add(interLevelParagraph);
                }

                for(Paragraph child : paragraph.children()) {
                    resultList.addAll(this.getParagraphsByLevel(child, level, interLevelText));
                }
            }
        } else if (paragraph.level() == level) {
            resultList.add(paragraph);
        }

        return resultList;
    }

    public static record Paragraph(Paragraph parent, String title, int level, int startPageNumber, int endPageNumber, int position, List<Paragraph> children) {
        public Paragraph(Paragraph parent, String title, int level, int startPageNumber, int endPageNumber, int position) {
            this(parent, title, level, startPageNumber, endPageNumber, position, new ArrayList());
        }

        public String toString() {
            String indent = this.level < 0 ? "" : (new String(new char[this.level * 2])).replace('\u0000', ' ');
            return indent + " " + this.level + ") " + this.title + " [" + this.startPageNumber + "," + this.endPageNumber + "], children = " + this.children.size() + ", pos = " + this.position;
        }
    }
}

TikaDocumentReader

用于从多种文档格式(如 PDF、DOC/DOCX、PPT/PPTX、HTML 等)中提取文本,并将其封装为 Document 对象,基于 Apache Tika 库实现,支持广泛的文档格式。

  • AutoDetectParser parser:自动检索文档类型并文本的解析器
  • ContentHandler handler:管理内容提取的处理器
  • Metadata metadata:读取文档相关的元数据
  • ParseContext context:解析过程信息的上下文
  • Resource resource:指向文档的资源对象
  • ExtractedTextFormatter textFormatter: 格式化提取的文本

|--------------------|---------------------------|
| 方法名称 | 描述 |
| TikaDocumentReader | 通过资源URL、资源对象、文本格式化器等构造读取器 |
| get | 从多种文档格式读取,返回Document列表 |

java 复制代码
package org.springframework.ai.reader.tika;

import java.io.IOException;
import java.io.InputStream;
import java.util.List;
import java.util.Objects;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.ai.reader.ExtractedTextFormatter;
import org.springframework.core.io.DefaultResourceLoader;
import org.springframework.core.io.Resource;
import org.springframework.util.StringUtils;
import org.xml.sax.ContentHandler;

public class TikaDocumentReader implements DocumentReader {
    public static final String METADATASOURCE = "source";
    private final AutoDetectParser parser;
    private final ContentHandler handler;
    private final Metadata metadata;
    private final ParseContext context;
    private final Resource resource;
    private final ExtractedTextFormatter textFormatter;

    public TikaDocumentReader(String resourceUrl) {
        this(resourceUrl, ExtractedTextFormatter.defaults());
    }

    public TikaDocumentReader(String resourceUrl, ExtractedTextFormatter textFormatter) {
        this((new DefaultResourceLoader()).getResource(resourceUrl), textFormatter);
    }

    public TikaDocumentReader(Resource resource) {
        this(resource, ExtractedTextFormatter.defaults());
    }

    public TikaDocumentReader(Resource resource, ExtractedTextFormatter textFormatter) {
        this(resource, new BodyContentHandler(-1), textFormatter);
    }

    public TikaDocumentReader(Resource resource, ContentHandler contentHandler, ExtractedTextFormatter textFormatter) {
        this.parser = new AutoDetectParser();
        this.handler = contentHandler;
        this.metadata = new Metadata();
        this.context = new ParseContext();
        this.resource = resource;
        this.textFormatter = textFormatter;
    }

    public List<Document> get() {
        try (InputStream stream = this.resource.getInputStream()) {
            this.parser.parse(stream, this.handler, this.metadata, this.context);
            return List.of(this.toDocument(this.handler.toString()));
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    private Document toDocument(String docText) {
        docText = (String)Objects.requireNonNullElse(docText, "");
        docText = this.textFormatter.format(docText);
        Document doc = new Document(docText);
        doc.getMetadata().put("source", this.resourceName());
        return doc;
    }

    private String resourceName() {
        try {
            String resourceName = this.resource.getFilename();
            if (!StringUtils.hasText(resourceName)) {
                resourceName = this.resource.getURI().toString();
            }

            return resourceName;
        } catch (IOException e) {
            return String.format("Invalid source URI: %s", e.getMessage());
        }
    }
}

DocumentTransformer(转换文档数据接口类)

java 复制代码
package org.springframework.ai.document;

import java.util.List;
import java.util.function.Function;

public interface DocumentTransformer extends Function<List<Document>, List<Document>> {
    default List<Document> transform(List<Document> transform) {
        return (List)this.apply(transform);
    }
}

TextSplitter

主要用于将长文本型 Document 拆分为多个较小的文本块(chunk),它为具体的文本分割策略(如按长度、按句子、按段落等)提供了通用框架

  • boolean copyContentFormatter:表示是否将文档内容格式化后,拆分复制到子文档中

|-------------------------|-----------------------------|
| 方法名称 | 描述 |
| apply | 对输入文档列表进行拆分,返回拆分后的文档列表 |
| split | 拆分文档,返回拆分后的文档列表 |
| setCopyContentFormatter | 控制是否继承内容格式化器 |
| isCopyContentFormatter | 获取 copyContentFormatter 当前值 |

java 复制代码
package org.springframework.ai.transformer.splitter;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.ai.document.ContentFormatter;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentTransformer;

public abstract class TextSplitter implements DocumentTransformer {
    private static final Logger logger = LoggerFactory.getLogger(TextSplitter.class);
    private boolean copyContentFormatter = true;

    public List<Document> apply(List<Document> documents) {
        return this.doSplitDocuments(documents);
    }

    public List<Document> split(List<Document> documents) {
        return this.apply(documents);
    }

    public List<Document> split(Document document) {
        return this.apply(List.of(document));
    }

    public boolean isCopyContentFormatter() {
        return this.copyContentFormatter;
    }

    public void setCopyContentFormatter(boolean copyContentFormatter) {
        this.copyContentFormatter = copyContentFormatter;
    }

    private List<Document> doSplitDocuments(List<Document> documents) {
        List<String> texts = new ArrayList();
        List<Map<String, Object>> metadataList = new ArrayList();
        List<ContentFormatter> formatters = new ArrayList();

        for(Document doc : documents) {
            texts.add(doc.getText());
            metadataList.add(doc.getMetadata());
            formatters.add(doc.getContentFormatter());
        }

        return this.createDocuments(texts, formatters, metadataList);
    }

    private List<Document> createDocuments(List<String> texts, List<ContentFormatter> formatters, List<Map<String, Object>> metadataList) {
        List<Document> documents = new ArrayList();

        for(int i = 0; i < texts.size(); ++i) {
            String text = (String)texts.get(i);
            Map<String, Object> metadata = (Map)metadataList.get(i);
            List<String> chunks = this.splitText(text);
            if (chunks.size() > 1) {
                logger.info("Splitting up document into " + chunks.size() + " chunks.");
            }

            for(String chunk : chunks) {
                Map<String, Object> metadataCopy = (Map)metadata.entrySet().stream().filter((e) -> e.getKey() != null && e.getValue() != null).collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
                Document newDoc = new Document(chunk, metadataCopy);
                if (this.copyContentFormatter) {
                    newDoc.setContentFormatter((ContentFormatter)formatters.get(i));
                }

                documents.add(newDoc);
            }
        }

        return documents;
    }

    protected abstract List<String> splitText(String text);
}
TokenTextSplitter

用于将文本按 token 拆分为指定大小块,基于 jtokit 库实现,适用于需要按 token 粒度处理文本的场景,如 LLM 的输入处理。

  • int chunkSize:每个文本块的目标 token 数量,默认为 800
  • int minChunkSizeChars:每个文本块的最小字符数,默认为 350
  • int minChunkLengthToEmbed:丢弃小于此长度的文本块,默认为 5
  • int maxNumChunks:文本中生成的最大块数,默认为 10000
  • boolean keepSeparator:是否保留分隔符(如换号符),默认
  • EncodingRegistry registry:用于获取编码的注册表
  • Encoding encoding:用于编码和解码的 token 的编码器

|-----------|--------------------------------------------|
| 方法名称 | 描述 |
| splitText | 实现自 TextSplitter,将文本按 token 分块,返回分块后的字符串列表 |
| doSplit | 核心分块逻辑,按 token 长度切分文本 |

java 复制代码
package org.springframework.ai.transformer.splitter;

import com.knuddels.jtokkit.Encodings;
import com.knuddels.jtokkit.api.Encoding;
import com.knuddels.jtokkit.api.EncodingRegistry;
import com.knuddels.jtokkit.api.EncodingType;
import com.knuddels.jtokkit.api.IntArrayList;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;
import org.springframework.util.Assert;

public class TokenTextSplitter extends TextSplitter {
    private static final int DEFAULTCHUNKSIZE = 800;
    private static final int MINCHUNKSIZECHARS = 350;
    private static final int MINCHUNKLENGTHTOEMBED = 5;
    private static final int MAXNUMCHUNKS = 10000;
    private static final boolean KEEPSEPARATOR = true;
    private final EncodingRegistry registry;
    private final Encoding encoding;
    private final int chunkSize;
    private final int minChunkSizeChars;
    private final int minChunkLengthToEmbed;
    private final int maxNumChunks;
    private final boolean keepSeparator;

    public TokenTextSplitter() {
        this(800, 350, 5, 10000, true);
    }

    public TokenTextSplitter(boolean keepSeparator) {
        this(800, 350, 5, 10000, keepSeparator);
    }

    public TokenTextSplitter(int chunkSize, int minChunkSizeChars, int minChunkLengthToEmbed, int maxNumChunks, boolean keepSeparator) {
        this.registry = Encodings.newLazyEncodingRegistry();
        this.encoding = this.registry.getEncoding(EncodingType.CL100KBASE);
        this.chunkSize = chunkSize;
        this.minChunkSizeChars = minChunkSizeChars;
        this.minChunkLengthToEmbed = minChunkLengthToEmbed;
        this.maxNumChunks = maxNumChunks;
        this.keepSeparator = keepSeparator;
    }

    public static Builder builder() {
        return new Builder();
    }

    protected List<String> splitText(String text) {
        return this.doSplit(text, this.chunkSize);
    }

    protected List<String> doSplit(String text, int chunkSize) {
        if (text != null && !text.trim().isEmpty()) {
            List<Integer> tokens = this.getEncodedTokens(text);
            List<String> chunks = new ArrayList();
            int numchunks = 0;

            while(!tokens.isEmpty() && numchunks < this.maxNumChunks) {
                List<Integer> chunk = tokens.subList(0, Math.min(chunkSize, tokens.size()));
                String chunkText = this.decodeTokens(chunk);
                if (chunkText.trim().isEmpty()) {
                    tokens = tokens.subList(chunk.size(), tokens.size());
                } else {
                    int lastPunctuation = Math.max(chunkText.lastIndexOf(46), Math.max(chunkText.lastIndexOf(63), Math.max(chunkText.lastIndexOf(33), chunkText.lastIndexOf(10))));
                    if (lastPunctuation != -1 && lastPunctuation > this.minChunkSizeChars) {
                        chunkText = chunkText.substring(0, lastPunctuation + 1);
                    }

                    String chunkTextToAppend = this.keepSeparator ? chunkText.trim() : chunkText.replace(System.lineSeparator(), " ").trim();
                    if (chunkTextToAppend.length() > this.minChunkLengthToEmbed) {
                        chunks.add(chunkTextToAppend);
                    }

                    tokens = tokens.subList(this.getEncodedTokens(chunkText).size(), tokens.size());
                    ++numchunks;
                }
            }

            if (!tokens.isEmpty()) {
                String remainingtext = this.decodeTokens(tokens).replace(System.lineSeparator(), " ").trim();
                if (remainingtext.length() > this.minChunkLengthToEmbed) {
                    chunks.add(remainingtext);
                }
            }

            return chunks;
        } else {
            return new ArrayList();
        }
    }

    private List<Integer> getEncodedTokens(String text) {
        Assert.notNull(text, "Text must not be null");
        return this.encoding.encode(text).boxed();
    }

    private String decodeTokens(List<Integer> tokens) {
        Assert.notNull(tokens, "Tokens must not be null");
        IntArrayList tokensIntArray = new IntArrayList(tokens.size());
        Objects.requireNonNull(tokensIntArray);
        tokens.forEach(tokensIntArray::add);
        return this.encoding.decode(tokensIntArray);
    }

    public static final class Builder {
        private int chunkSize = 800;
        private int minChunkSizeChars = 350;
        private int minChunkLengthToEmbed = 5;
        private int maxNumChunks = 10000;
        private boolean keepSeparator = true;

        private Builder() {
        }

        public Builder withChunkSize(int chunkSize) {
            this.chunkSize = chunkSize;
            return this;
        }

        public Builder withMinChunkSizeChars(int minChunkSizeChars) {
            this.minChunkSizeChars = minChunkSizeChars;
            return this;
        }

        public Builder withMinChunkLengthToEmbed(int minChunkLengthToEmbed) {
            this.minChunkLengthToEmbed = minChunkLengthToEmbed;
            return this;
        }

        public Builder withMaxNumChunks(int maxNumChunks) {
            this.maxNumChunks = maxNumChunks;
            return this;
        }

        public Builder withKeepSeparator(boolean keepSeparator) {
            this.keepSeparator = keepSeparator;
            return this;
        }

        public TokenTextSplitter build() {
            return new TokenTextSplitter(this.chunkSize, this.minChunkSizeChars, this.minChunkLengthToEmbed, this.maxNumChunks, this.keepSeparator);
        }
    }
}

ContentFormatTransformer

对 Document 列表中的每个文档应用内容格式化器,以格式化文档

  • boolean disableTemplateRewrite:表示是否禁用内容格式化器的模版重写功能
  • ContentFormatter contentFormatter:用于格式化文档内容的实例
java 复制代码
package org.springframework.ai.transformer;

import java.util.ArrayList;
import java.util.List;
import org.springframework.ai.document.ContentFormatter;
import org.springframework.ai.document.DefaultContentFormatter;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentTransformer;

public class ContentFormatTransformer implements DocumentTransformer {
    private final boolean disableTemplateRewrite;
    private final ContentFormatter contentFormatter;

    public ContentFormatTransformer(ContentFormatter contentFormatter) {
        this(contentFormatter, false);
    }

    public ContentFormatTransformer(ContentFormatter contentFormatter, boolean disableTemplateRewrite) {
        this.contentFormatter = contentFormatter;
        this.disableTemplateRewrite = disableTemplateRewrite;
    }

    public List<Document> apply(List<Document> documents) {
        if (this.contentFormatter != null) {
            documents.forEach(this::processDocument);
        }

        return documents;
    }

    private void processDocument(Document document) {
        ContentFormatter var4 = document.getContentFormatter();
        if (var4 instanceof DefaultContentFormatter docFormatter) {
            var4 = this.contentFormatter;
            if (var4 instanceof DefaultContentFormatter toUpdateFormatter) {
                this.updateFormatter(document, docFormatter, toUpdateFormatter);
                return;
            }
        }

        this.overrideFormatter(document);
    }

    private void updateFormatter(Document document, DefaultContentFormatter docFormatter, DefaultContentFormatter toUpdateFormatter) {
        List<String> updatedEmbedExcludeKeys = new ArrayList(docFormatter.getExcludedEmbedMetadataKeys());
        updatedEmbedExcludeKeys.addAll(toUpdateFormatter.getExcludedEmbedMetadataKeys());
        List<String> updatedInterfaceExcludeKeys = new ArrayList(docFormatter.getExcludedInferenceMetadataKeys());
        updatedInterfaceExcludeKeys.addAll(toUpdateFormatter.getExcludedInferenceMetadataKeys());
        DefaultContentFormatter.Builder builder = DefaultContentFormatter.builder().withExcludedEmbedMetadataKeys(updatedEmbedExcludeKeys).withExcludedInferenceMetadataKeys(updatedInterfaceExcludeKeys).withMetadataTemplate(docFormatter.getMetadataTemplate()).withMetadataSeparator(docFormatter.getMetadataSeparator());
        if (!this.disableTemplateRewrite) {
            builder.withTextTemplate(docFormatter.getTextTemplate());
        }

        document.setContentFormatter(builder.build());
    }

    private void overrideFormatter(Document document) {
        document.setContentFormatter(this.contentFormatter);
    }
}
ContentFormatte(格式化接口类)
java 复制代码
public interface ContentFormatter {

    String format(Document document, MetadataMode mode);

}
DefaultContentFormatter

用于格式化 Document 对象的内容和元数据,通过模版和配置来控制文档显示方式

  • String metadataTemplate:元数据格式化模版,包含{key}和{value}占位符
  • String metadataSeparator:元数据字段之间的分隔符
  • String textTemplate:文档文本格式化模板,包含{content}和{metadatastring}占位符
  • List<String> excludedInferenceMetadataKeys:在推理模式下排除的元数据键列表
  • List<String> excludedEmbedMetadataKeys:在嵌入模式下排除的元数据键列表
java 复制代码
package org.springframework.ai.document;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.stream.Collectors;
import org.springframework.util.Assert;

public final class DefaultContentFormatter implements ContentFormatter {
    private static final String TEMPLATECONTENTPLACEHOLDER = "{content}";
    private static final String TEMPLATEMETADATASTRINGPLACEHOLDER = "{metadatastring}";
    private static final String TEMPLATEVALUEPLACEHOLDER = "{value}";
    private static final String TEMPLATEKEYPLACEHOLDER = "{key}";
    private static final String DEFAULTMETADATATEMPLATE = String.format("%s: %s", "{key}", "{value}");
    private static final String DEFAULTMETADATASEPARATOR = System.lineSeparator();
    private static final String DEFAULTTEXTTEMPLATE = String.format("%s\n\n%s", "{metadatastring}", "{content}");
    private final String metadataTemplate;
    private final String metadataSeparator;
    private final String textTemplate;
    private final List<String> excludedInferenceMetadataKeys;
    private final List<String> excludedEmbedMetadataKeys;

    private DefaultContentFormatter(Builder builder) {
        this.metadataTemplate = builder.metadataTemplate;
        this.metadataSeparator = builder.metadataSeparator;
        this.textTemplate = builder.textTemplate;
        this.excludedInferenceMetadataKeys = builder.excludedInferenceMetadataKeys;
        this.excludedEmbedMetadataKeys = builder.excludedEmbedMetadataKeys;
    }

    public static Builder builder() {
        return new Builder();
    }

    public static DefaultContentFormatter defaultConfig() {
        return builder().build();
    }

    public String format(Document document, MetadataMode metadataMode) {
        Map<String, Object> metadata = this.metadataFilter(document.getMetadata(), metadataMode);
        String metadataText = (String)metadata.entrySet().stream().map((metadataEntry) -> this.metadataTemplate.replace("{key}", (CharSequence)metadataEntry.getKey()).replace("{value}", metadataEntry.getValue().toString())).collect(Collectors.joining(this.metadataSeparator));
        return this.textTemplate.replace("{metadatastring}", metadataText).replace("{content}", document.getText());
    }

    protected Map<String, Object> metadataFilter(Map<String, Object> metadata, MetadataMode metadataMode) {
        if (metadataMode == MetadataMode.ALL) {
            return new HashMap(metadata);
        } else if (metadataMode == MetadataMode.NONE) {
            return new HashMap(Collections.emptyMap());
        } else {
            Set<String> usableMetadataKeys = new HashSet(metadata.keySet());
            if (metadataMode == MetadataMode.INFERENCE) {
                usableMetadataKeys.removeAll(this.excludedInferenceMetadataKeys);
            } else if (metadataMode == MetadataMode.EMBED) {
                usableMetadataKeys.removeAll(this.excludedEmbedMetadataKeys);
            }

            return new HashMap((Map)metadata.entrySet().stream().filter((e) -> usableMetadataKeys.contains(e.getKey())).collect(Collectors.toMap((e) -> (String)e.getKey(), (e) -> e.getValue())));
        }
    }

    public String getMetadataTemplate() {
        return this.metadataTemplate;
    }

    public String getMetadataSeparator() {
        return this.metadataSeparator;
    }

    public String getTextTemplate() {
        return this.textTemplate;
    }

    public List<String> getExcludedInferenceMetadataKeys() {
        return Collections.unmodifiableList(this.excludedInferenceMetadataKeys);
    }

    public List<String> getExcludedEmbedMetadataKeys() {
        return Collections.unmodifiableList(this.excludedEmbedMetadataKeys);
    }

    public static final class Builder {
        private String metadataTemplate;
        private String metadataSeparator;
        private String textTemplate;
        private List<String> excludedInferenceMetadataKeys;
        private List<String> excludedEmbedMetadataKeys;

        private Builder() {
            this.metadataTemplate = DefaultContentFormatter.DEFAULTMETADATATEMPLATE;
            this.metadataSeparator = DefaultContentFormatter.DEFAULTMETADATASEPARATOR;
            this.textTemplate = DefaultContentFormatter.DEFAULTTEXTTEMPLATE;
            this.excludedInferenceMetadataKeys = new ArrayList();
            this.excludedEmbedMetadataKeys = new ArrayList();
        }

        public Builder from(DefaultContentFormatter fromFormatter) {
            this.withExcludedEmbedMetadataKeys(fromFormatter.getExcludedEmbedMetadataKeys()).withExcludedInferenceMetadataKeys(fromFormatter.getExcludedInferenceMetadataKeys()).withMetadataSeparator(fromFormatter.getMetadataSeparator()).withMetadataTemplate(fromFormatter.getMetadataTemplate()).withTextTemplate(fromFormatter.getTextTemplate());
            return this;
        }

        public Builder withMetadataTemplate(String metadataTemplate) {
            Assert.hasText(metadataTemplate, "Metadata Template must not be empty");
            this.metadataTemplate = metadataTemplate;
            return this;
        }

        public Builder withMetadataSeparator(String metadataSeparator) {
            Assert.notNull(metadataSeparator, "Metadata separator must not be empty");
            this.metadataSeparator = metadataSeparator;
            return this;
        }

        public Builder withTextTemplate(String textTemplate) {
            Assert.hasText(textTemplate, "Document's text template must not be empty");
            this.textTemplate = textTemplate;
            return this;
        }

        public Builder withExcludedInferenceMetadataKeys(List<String> excludedInferenceMetadataKeys) {
            Assert.notNull(excludedInferenceMetadataKeys, "Excluded inference metadata keys must not be null");
            this.excludedInferenceMetadataKeys = excludedInferenceMetadataKeys;
            return this;
        }

        public Builder withExcludedInferenceMetadataKeys(String... keys) {
            Assert.notNull(keys, "Excluded inference metadata keys must not be null");
            this.excludedInferenceMetadataKeys.addAll(Arrays.asList(keys));
            return this;
        }

        public Builder withExcludedEmbedMetadataKeys(List<String> excludedEmbedMetadataKeys) {
            Assert.notNull(excludedEmbedMetadataKeys, "Excluded Embed metadata keys must not be null");
            this.excludedEmbedMetadataKeys = excludedEmbedMetadataKeys;
            return this;
        }

        public Builder withExcludedEmbedMetadataKeys(String... keys) {
            Assert.notNull(keys, "Excluded Embed metadata keys must not be null");
            this.excludedEmbedMetadataKeys.addAll(Arrays.asList(keys));
            return this;
        }

        public DefaultContentFormatter build() {
            return new DefaultContentFormatter(this);
        }
    }
}

KeywordMetadataEnricher

从文档中提取关键词,并将其作为元数据添加到文档中。通过调用 ChatModel 生成关键词,并将关键词存储在文档的元数据中

  • ChatModel chatModel:与 LLM 交互,生成关键词
  • int keywordCount:要提取的关键词数量
java 复制代码
package org.springframework.ai.model.transformer;

import java.util.List;
import java.util.Map;
import org.springframework.ai.chat.model.ChatModel;
import org.springframework.ai.chat.prompt.Prompt;
import org.springframework.ai.chat.prompt.PromptTemplate;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentTransformer;
import org.springframework.util.Assert;

public class KeywordMetadataEnricher implements DocumentTransformer {
    public static final String CONTEXTSTRPLACEHOLDER = "contextstr";
    public static final String KEYWORDSTEMPLATE = "{contextstr}. Give %s unique keywords for this\ndocument. Format as comma separated. Keywords:";
    private static final String EXCERPTKEYWORDSMETADATAKEY = "excerptkeywords";
    private final ChatModel chatModel;
    private final int keywordCount;

    public KeywordMetadataEnricher(ChatModel chatModel, int keywordCount) {
        Assert.notNull(chatModel, "ChatModel must not be null");
        Assert.isTrue(keywordCount >= 1, "Document count must be >= 1");
        this.chatModel = chatModel;
        this.keywordCount = keywordCount;
    }

    public List<Document> apply(List<Document> documents) {
        for(Document document : documents) {
            PromptTemplate template = new PromptTemplate(String.format("{contextstr}. Give %s unique keywords for this\ndocument. Format as comma separated. Keywords:", this.keywordCount));
            Prompt prompt = template.create(Map.of("contextstr", document.getText()));
            String keywords = this.chatModel.call(prompt).getResult().getOutput().getText();
            document.getMetadata().putAll(Map.of("excerptkeywords", keywords));
        }

        return documents;
    }
}

SummaryMetadataEnricher

用于从文档中提取摘要,并将其作为元数据添加到文档中。支持提取当前文档、前一个文档和下一个文档的摘要,并将这些摘要存储在文档的元数据中

  • ChatModel chatModel:与 LLM 交互,生成摘要

  • List<SummaryType> summaryTypes:要提取的摘要类型列表(当前、前一个、后一个)

  • MetadataMode metadataMode:元数据模式,用于控制文档内容的格式化方式

    • ALL:格式化内容时包含所有元数据(如作者、页码、标题等),适合需要上下文丰富信息的场景
    • EMBED:仅包含用于向量嵌入相关的元数据。通常用于向量数据库检索,保证只输出对嵌入有用的元数据,减少无关信息干扰
    • INFERENCE:仅包含推理相关的元数据。适合推理、问答等场景,输出对模型推理有帮助的元数据,过滤掉无关内容
    • NONE:只输出纯文本内容,不包含任何元数据,适合只关心正文的场景
  • String summaryTemplate:用于生成摘要的模版

java 复制代码
package org.springframework.ai.model.transformer;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.springframework.ai.chat.model.ChatModel;
import org.springframework.ai.chat.prompt.Prompt;
import org.springframework.ai.chat.prompt.PromptTemplate;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentTransformer;
import org.springframework.ai.document.MetadataMode;
import org.springframework.util.Assert;
import org.springframework.util.CollectionUtils;

public class SummaryMetadataEnricher implements DocumentTransformer {
    public static final String DEFAULTSUMMARYEXTRACTTEMPLATE = "Here is the content of the section:\n{contextstr}\n\nSummarize the key topics and entities of the section.\n\nSummary:";
    private static final String SECTIONSUMMARYMETADATAKEY = "sectionsummary";
    private static final String NEXTSECTIONSUMMARYMETADATAKEY = "nextsectionsummary";
    private static final String PREVSECTIONSUMMARYMETADATAKEY = "prevsectionsummary";
    private static final String CONTEXTSTRPLACEHOLDER = "contextstr";
    private final ChatModel chatModel;
    private final List<SummaryType> summaryTypes;
    private final MetadataMode metadataMode;
    private final String summaryTemplate;

    public SummaryMetadataEnricher(ChatModel chatModel, List<SummaryType> summaryTypes) {
        this(chatModel, summaryTypes, "Here is the content of the section:\n{contextstr}\n\nSummarize the key topics and entities of the section.\n\nSummary:", MetadataMode.ALL);
    }

    public SummaryMetadataEnricher(ChatModel chatModel, List<SummaryType> summaryTypes, String summaryTemplate, MetadataMode metadataMode) {
        Assert.notNull(chatModel, "ChatModel must not be null");
        Assert.hasText(summaryTemplate, "Summary template must not be empty");
        this.chatModel = chatModel;
        this.summaryTypes = CollectionUtils.isEmpty(summaryTypes) ? List.of(SummaryMetadataEnricher.SummaryType.CURRENT) : summaryTypes;
        this.metadataMode = metadataMode;
        this.summaryTemplate = summaryTemplate;
    }

    public List<Document> apply(List<Document> documents) {
        List<String> documentSummaries = new ArrayList();

        for(Document document : documents) {
            String documentContext = document.getFormattedContent(this.metadataMode);
            Prompt prompt = (new PromptTemplate(this.summaryTemplate)).create(Map.of("contextstr", documentContext));
            documentSummaries.add(this.chatModel.call(prompt).getResult().getOutput().getText());
        }

        for(int i = 0; i < documentSummaries.size(); ++i) {
            Map<String, Object> summaryMetadata = this.getSummaryMetadata(i, documentSummaries);
            ((Document)documents.get(i)).getMetadata().putAll(summaryMetadata);
        }

        return documents;
    }

    private Map<String, Object> getSummaryMetadata(int i, List<String> documentSummaries) {
        Map<String, Object> summaryMetadata = new HashMap();
        if (i > 0 && this.summaryTypes.contains(SummaryMetadataEnricher.SummaryType.PREVIOUS)) {
            summaryMetadata.put("prevsectionsummary", documentSummaries.get(i - 1));
        }

        if (i < documentSummaries.size() - 1 && this.summaryTypes.contains(SummaryMetadataEnricher.SummaryType.NEXT)) {
            summaryMetadata.put("nextsectionsummary", documentSummaries.get(i + 1));
        }

        if (this.summaryTypes.contains(SummaryMetadataEnricher.SummaryType.CURRENT)) {
            summaryMetadata.put("sectionsummary", documentSummaries.get(i));
        }

        return summaryMetadata;
    }

    public static enum SummaryType {
        PREVIOUS,
        CURRENT,
        NEXT;
    }
}

DocumentWriter(文档写入接口类)

java 复制代码
package org.springframework.ai.document;

import java.util.List;
import java.util.function.Consumer;

public interface DocumentWriter extends Consumer<List<Document>> {
    default void write(List<Document> documents) {
        this.accept(documents);
    }
}

FileDocumentWriter

将一组 Document 文档对象的内容写入到指定文件,支持追加写入、文档分隔标记、元数据格式化等功能

  • String fileName:写入文件的名称
  • boolean withDocumentMarkers:表示是否在文件中包含文档标记(如文档索引、页码)
  • MetadataMode metadataMode:元数据模式,控制文档内容的格式化方式
  • boolean append:是否将内容追加到文件末尾,而不是覆盖

|--------------------|-------------------------|
| 方法名称 | 描述 |
| FileDocumentWriter | 通过文件名、分隔标记、元数据、追加等构造写入器 |
| accept | 将文档内容写入文件,支持分隔标记和元数据格式化 |

java 复制代码
package org.springframework.ai.writer;

import java.io.FileWriter;
import java.util.List;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentWriter;
import org.springframework.ai.document.MetadataMode;
import org.springframework.util.Assert;

public class FileDocumentWriter implements DocumentWriter {
    public static final String METADATASTARTPAGENUMBER = "pagenumber";
    public static final String METADATAENDPAGENUMBER = "endpagenumber";
    private final String fileName;
    private final boolean withDocumentMarkers;
    private final MetadataMode metadataMode;
    private final boolean append;

    public FileDocumentWriter(String fileName) {
        this(fileName, false, MetadataMode.NONE, false);
    }

    public FileDocumentWriter(String fileName, boolean withDocumentMarkers) {
        this(fileName, withDocumentMarkers, MetadataMode.NONE, false);
    }

    public FileDocumentWriter(String fileName, boolean withDocumentMarkers, MetadataMode metadataMode, boolean append) {
        Assert.hasText(fileName, "File name must have a text.");
        Assert.notNull(metadataMode, "MetadataMode must not be null.");
        this.fileName = fileName;
        this.withDocumentMarkers = withDocumentMarkers;
        this.metadataMode = metadataMode;
        this.append = append;
    }

    public void accept(List<Document> docs) {
        try {
            try (FileWriter writer = new FileWriter(this.fileName, this.append)) {
                int index = 0;

                for(Document doc : docs) {
                    if (this.withDocumentMarkers) {
                        writer.write(String.format("%n### Doc: %s, pages:[%s,%s]\n", index, doc.getMetadata().get("pagenumber"), doc.getMetadata().get("endpagenumber")));
                    }

                    writer.write(doc.getFormattedContent(this.metadataMode));
                    ++index;
                }
            }

        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}

VectorStore

VectorStore 继承了 DocumentWriter 接口,详情可见 《Vector Databases》

相关推荐
文心快码BaiduComate8 小时前
文心快码3.5S实测插件开发,Architect模式令人惊艳
前端·后端·架构
5pace8 小时前
【JavaWeb|第二篇】SpringBoot篇
java·spring boot·后端
HenryLin8 小时前
Kronos核心概念解析
后端
oak隔壁找我8 小时前
Spring AOP源码深度解析
java·后端
货拉拉技术8 小时前
大规模 Kafka 消费集群调度方案
后端
oak隔壁找我8 小时前
MyBatis Plus 源码深度解析
java·后端
oak隔壁找我8 小时前
Druid 数据库连接池源码详细解析
java·数据库·后端
剽悍一小兔8 小时前
Nginx 基本使用配置大全
后端
LCG元8 小时前
性能排查必看!当Linux服务器CPU/内存飙高,如何快速定位并"干掉"罪魁祸首进程?
linux·后端
oak隔壁找我8 小时前
MyBatis 源码深度解析
java·后端