聊聊Spring AI Alibaba的MarkdownDocumentParser

本文主要研究一下Spring AI Alibaba的MarkdownDocumentParser

MarkdownDocumentParser

community/document-parsers/spring-ai-alibaba-starter-document-parser-markdown/src/main/java/com/alibaba/cloud/ai/parser/markdown/MarkdownDocumentParser.java

复制代码
public class MarkdownDocumentParser implements DocumentParser {

	/**
	 * Configuration to a parsing process.
	 */
	private final MarkdownDocumentParserConfig config;

	/**
	 * Markdown parser.
	 */
	private final Parser parser;

	public MarkdownDocumentParser() {
		this(MarkdownDocumentParserConfig.defaultConfig());
	}

	/**
	 * Create a new {@link MarkdownDocumentParser} instance.
	 *
	 */
	public MarkdownDocumentParser(MarkdownDocumentParserConfig config) {
		this.config = config;
		this.parser = Parser.builder().build();
	}

	@Override
	public List<Document> parse(InputStream inputStream) {
		try (var input = inputStream) {
			Node node = this.parser.parseReader(new InputStreamReader(input));

			DocumentVisitor documentVisitor = new DocumentVisitor(this.config);
			node.accept(documentVisitor);

			return documentVisitor.getDocuments();
		}
		catch (IOException e) {
			throw new RuntimeException(e);
		}
	}

	//......
}	

MarkdownDocumentParser使用了org.commonmark.parser.Parser来解析inputStream到node,然后通过DocumentVisitor去解析为Document

DocumentVisitor

复制代码
	static class DocumentVisitor extends AbstractVisitor {

		private final List<Document> documents = new ArrayList<>();

		private final List<String> currentParagraphs = new ArrayList<>();

		private final MarkdownDocumentParserConfig config;

		private Document.Builder currentDocumentBuilder;

		DocumentVisitor(MarkdownDocumentParserConfig config) {
			this.config = config;
		}

		/**
		 * Visits the document node and initializes the current document builder.
		 */
		@Override
		public void visit(org.commonmark.node.Document document) {
			this.currentDocumentBuilder = Document.builder();
			super.visit(document);
		}

		@Override
		public void visit(Heading heading) {
			buildAndFlush();
			super.visit(heading);
		}

		@Override
		public void visit(ThematicBreak thematicBreak) {
			if (this.config.horizontalRuleCreateDocument) {
				buildAndFlush();
			}
			super.visit(thematicBreak);
		}

		@Override
		public void visit(SoftLineBreak softLineBreak) {
			translateLineBreakToSpace();
			super.visit(softLineBreak);
		}

		@Override
		public void visit(HardLineBreak hardLineBreak) {
			translateLineBreakToSpace();
			super.visit(hardLineBreak);
		}

		@Override
		public void visit(ListItem listItem) {
			translateLineBreakToSpace();
			super.visit(listItem);
		}

		@Override
		public void visit(BlockQuote blockQuote) {
			if (!this.config.includeBlockquote) {
				buildAndFlush();
			}

			translateLineBreakToSpace();
			this.currentDocumentBuilder.metadata("category", "blockquote");
			super.visit(blockQuote);
		}

		@Override
		public void visit(Code code) {
			this.currentParagraphs.add(code.getLiteral());
			this.currentDocumentBuilder.metadata("category", "code_inline");
			super.visit(code);
		}

		@Override
		public void visit(FencedCodeBlock fencedCodeBlock) {
			if (!this.config.includeCodeBlock) {
				buildAndFlush();
			}

			translateLineBreakToSpace();
			this.currentParagraphs.add(fencedCodeBlock.getLiteral());
			this.currentDocumentBuilder.metadata("category", "code_block");
			this.currentDocumentBuilder.metadata("lang", fencedCodeBlock.getInfo());

			buildAndFlush();

			super.visit(fencedCodeBlock);
		}

		@Override
		public void visit(Text text) {
			if (text.getParent() instanceof Heading heading) {
				this.currentDocumentBuilder.metadata("category", "header_%d".formatted(heading.getLevel()))
					.metadata("title", text.getLiteral());
			}
			else {
				this.currentParagraphs.add(text.getLiteral());
			}

			super.visit(text);
		}

		public List<Document> getDocuments() {
			buildAndFlush();

			return this.documents;
		}

		private void buildAndFlush() {
			if (!this.currentParagraphs.isEmpty()) {
				String content = String.join("", this.currentParagraphs);

				Document.Builder builder = this.currentDocumentBuilder.text(content);

				this.config.additionalMetadata.forEach(builder::metadata);

				Document document = builder.build();

				this.documents.add(document);

				this.currentParagraphs.clear();
			}
			this.currentDocumentBuilder = Document.builder();
		}

		private void translateLineBreakToSpace() {
			if (!this.currentParagraphs.isEmpty()) {
				this.currentParagraphs.add(" ");
			}
		}

	}

DocumentVisitor继承了AbstractVisitor,它在每类visit方法将内容添加到currentParagraphs,同时添加对应的metadata,最后通过buildAndFlush去构建document,每次构建完会重新给currentDocumentBuilder赋值为新的Document.builder()

示例

复制代码
class MarkdownDocumentParserTest {

	@Test
	void testOnlyHeadersWithParagraphs() throws IOException {
		MarkdownDocumentParser reader = new MarkdownDocumentParser();

		List<Document> documents = reader
			.parse(new DefaultResourceLoader().getResource("classpath:/only-headers.md").getInputStream());

		assertThat(documents).hasSize(4)
			.extracting(Document::getMetadata, Document::getText)
			.containsOnly(tuple(Map.of("category", "header_1", "title", "Header 1a"),
					"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur diam eros, laoreet sit amet cursus vitae, varius sed nisi. Cras sit amet quam quis velit commodo porta consectetur id nisi. Phasellus tincidunt pulvinar augue."),
					tuple(Map.of("category", "header_1", "title", "Header 1b"),
							"Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Etiam lobortis risus libero, sed sollicitudin risus cursus in. Morbi enim metus, ornare vel lacinia eget, venenatis vel nibh."),
					tuple(Map.of("category", "header_2", "title", "Header 2b"),
							"Proin vel laoreet leo, sed luctus augue. Sed et ligula commodo, commodo lacus at, consequat turpis. Maecenas eget sapien odio. Maecenas urna lectus, pellentesque in accumsan aliquam, congue eu libero."),
					tuple(Map.of("category", "header_2", "title", "Header 2c"),
							"Ut rhoncus nec justo a porttitor. Pellentesque auctor pharetra eros, viverra sodales lorem aliquet id. Curabitur semper nisi vel sem interdum suscipit."));
	}

	@Test
	void testWithFormatting() throws IOException {
		MarkdownDocumentParser reader = new MarkdownDocumentParser();

		List<Document> documents = reader
			.parse(new DefaultResourceLoader().getResource("classpath:/with-formatting.md").getInputStream());

		assertThat(documents).hasSize(2)
			.extracting(Document::getMetadata, Document::getText)
			.containsOnly(tuple(Map.of("category", "header_1", "title", "This is a fancy header name"),
					"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec tincidunt velit non bibendum gravida. Cras accumsan tincidunt ornare. Donec hendrerit consequat tellus blandit accumsan. Aenean aliquam metus at arcu elementum dignissim."),
					tuple(Map.of("category", "header_3", "title", "Header 3"),
							"Aenean eu leo eu nibh tristique posuere quis quis massa."));
	}

	@Test
	void testDocumentDividedViaHorizontalRules() throws IOException {
		MarkdownDocumentParserConfig config = MarkdownDocumentParserConfig.builder()
			.withHorizontalRuleCreateDocument(true)
			.build();

		MarkdownDocumentParser reader = new MarkdownDocumentParser(config);

		List<Document> documents = reader
			.parse(new DefaultResourceLoader().getResource("classpath:/horizontal-rules.md").getInputStream());

		assertThat(documents).hasSize(7)
			.extracting(Document::getMetadata, Document::getText)
			.containsOnly(tuple(Map.of(),
					"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec tincidunt velit non bibendum gravida."),
					tuple(Map.of(),
							"Cras accumsan tincidunt ornare. Donec hendrerit consequat tellus blandit accumsan. Aenean aliquam metus at arcu elementum dignissim."),
					tuple(Map.of(),
							"Nullam nisi dui, egestas nec sem nec, interdum lobortis enim. Pellentesque odio orci, faucibus eu luctus nec, venenatis et magna."),
					tuple(Map.of(),
							"Vestibulum nec eros non felis fermentum posuere eget ac risus. Curabitur et fringilla massa. Cras facilisis nec nisl sit amet sagittis."),
					tuple(Map.of(),
							"Aenean eu leo eu nibh tristique posuere quis quis massa. Nullam lacinia luctus sem ut vehicula."),
					tuple(Map.of(),
							"Aenean quis vulputate mi. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Nam tincidunt nunc a tortor tincidunt, nec lobortis diam rhoncus."),
					tuple(Map.of(), "Nulla facilisi. Phasellus eget tellus sed nibh ornare interdum eu eu mi."));
	}

	@Test
	void testDocumentNotDividedViaHorizontalRulesWhenIsDisabled() throws IOException {

		MarkdownDocumentParserConfig config = MarkdownDocumentParserConfig.builder()
			.withHorizontalRuleCreateDocument(false)
			.build();
		MarkdownDocumentParser reader = new MarkdownDocumentParser(config);

		List<Document> documents = reader
			.parse(new DefaultResourceLoader().getResource("classpath:/horizontal-rules.md").getInputStream());

		assertThat(documents).hasSize(1);

		Document documentsFirst = documents.get(0);
		assertThat(documentsFirst.getMetadata()).isEmpty();
		assertThat(documentsFirst.getText()).startsWith("Lorem ipsum dolor sit amet, consectetur adipiscing elit")
			.endsWith("Phasellus eget tellus sed nibh ornare interdum eu eu mi.");
	}

	@Test
	void testSimpleMarkdownDocumentWithHardAndSoftLineBreaks() throws IOException {

		MarkdownDocumentParser reader = new MarkdownDocumentParser();

		List<Document> documents = reader
			.parse(new DefaultResourceLoader().getResource("classpath:/simple.md").getInputStream());

		assertThat(documents).hasSize(1);

		Document documentsFirst = documents.get(0);
		assertThat(documentsFirst.getMetadata()).isEmpty();
		assertThat(documentsFirst.getText()).isEqualTo(
				"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec tincidunt velit non bibendum gravida. Cras accumsan tincidunt ornare. Donec hendrerit consequat tellus blandit accumsan. Aenean aliquam metus at arcu elementum dignissim.Nullam nisi dui, egestas nec sem nec, interdum lobortis enim. Pellentesque odio orci, faucibus eu luctus nec, venenatis et magna. Vestibulum nec eros non felis fermentum posuere eget ac risus.Aenean eu leo eu nibh tristique posuere quis quis massa. Nullam lacinia luctus sem ut vehicula.");
	}

	@Test
	void testCode() throws IOException {
		MarkdownDocumentParserConfig config = MarkdownDocumentParserConfig.builder()
			.withHorizontalRuleCreateDocument(true)
			.build();

		MarkdownDocumentParser reader = new MarkdownDocumentParser(config);

		List<Document> documents = reader
			.parse(new DefaultResourceLoader().getResource("classpath:/code.md").getInputStream());

		assertThat(documents).satisfiesExactly(document -> {
			assertThat(document.getMetadata()).isEqualTo(Map.of());
			assertThat(document.getText()).isEqualTo("This is a Java sample application:");
		}, document -> {
			assertThat(document.getMetadata()).isEqualTo(Map.of("lang", "java", "category", "code_block"));
			assertThat(document.getText()).startsWith("package com.example.demo;")
				.contains("SpringApplication.run(DemoApplication.class, args);");
		}, document -> {
			assertThat(document.getMetadata()).isEqualTo(Map.of("category", "code_inline"));
			assertThat(document.getText()).isEqualTo(
					"Markdown also provides the possibility to use inline code formatting throughout the entire sentence.");
		}, document -> {
			assertThat(document.getMetadata()).isEqualTo(Map.of());
			assertThat(document.getText())
				.isEqualTo("Another possibility is to set block code without specific highlighting:");
		}, document -> {
			assertThat(document.getMetadata()).isEqualTo(Map.of("lang", "", "category", "code_block"));
			assertThat(document.getText()).isEqualTo("./mvnw spring-javaformat:apply\n");
		});
	}

	@Test
	void testCodeWhenCodeBlockShouldNotBeSeparatedDocument() throws IOException {
		MarkdownDocumentParserConfig config = MarkdownDocumentParserConfig.builder()
			.withHorizontalRuleCreateDocument(true)
			.withIncludeCodeBlock(true)
			.build();

		MarkdownDocumentParser reader = new MarkdownDocumentParser(config);

		List<Document> documents = reader
			.parse(new DefaultResourceLoader().getResource("classpath:/code.md").getInputStream());

		assertThat(documents).satisfiesExactly(document -> {
			assertThat(document.getMetadata()).isEqualTo(Map.of("lang", "java", "category", "code_block"));
			assertThat(document.getText()).startsWith("This is a Java sample application: package com.example.demo")
				.contains("SpringApplication.run(DemoApplication.class, args);");
		}, document -> {
			assertThat(document.getMetadata()).isEqualTo(Map.of("category", "code_inline"));
			assertThat(document.getText()).isEqualTo(
					"Markdown also provides the possibility to use inline code formatting throughout the entire sentence.");
		}, document -> {
			assertThat(document.getMetadata()).isEqualTo(Map.of("lang", "", "category", "code_block"));
			assertThat(document.getText()).isEqualTo(
					"Another possibility is to set block code without specific highlighting: ./mvnw spring-javaformat:apply\n");
		});
	}

	@Test
	void testBlockquote() throws IOException {

		MarkdownDocumentParser reader = new MarkdownDocumentParser();

		List<Document> documents = reader
			.parse(new DefaultResourceLoader().getResource("classpath:/blockquote.md").getInputStream());

		assertThat(documents).hasSize(2)
			.extracting(Document::getMetadata, Document::getText)
			.containsOnly(tuple(Map.of(),
					"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur diam eros, laoreet sit amet cursus vitae, varius sed nisi. Cras sit amet quam quis velit commodo porta consectetur id nisi. Phasellus tincidunt pulvinar augue."),
					tuple(Map.of("category", "blockquote"),
							"Proin vel laoreet leo, sed luctus augue. Sed et ligula commodo, commodo lacus at, consequat turpis. Maecenas eget sapien odio. Maecenas urna lectus, pellentesque in accumsan aliquam, congue eu libero. Ut rhoncus nec justo a porttitor. Pellentesque auctor pharetra eros, viverra sodales lorem aliquet id. Curabitur semper nisi vel sem interdum suscipit."));
	}

	@Test
	void testBlockquoteWhenBlockquoteShouldNotBeSeparatedDocument() throws IOException {
		MarkdownDocumentParserConfig config = MarkdownDocumentParserConfig.builder()
			.withIncludeBlockquote(true)
			.build();

		MarkdownDocumentParser reader = new MarkdownDocumentParser(config);

		List<Document> documents = reader
			.parse(new DefaultResourceLoader().getResource("classpath:/blockquote.md").getInputStream());

		assertThat(documents).hasSize(1);

		Document documentsFirst = documents.get(0);
		assertThat(documentsFirst.getMetadata()).isEqualTo(Map.of("category", "blockquote"));
		assertThat(documentsFirst.getText()).isEqualTo(
				"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur diam eros, laoreet sit amet cursus vitae, varius sed nisi. Cras sit amet quam quis velit commodo porta consectetur id nisi. Phasellus tincidunt pulvinar augue. Proin vel laoreet leo, sed luctus augue. Sed et ligula commodo, commodo lacus at, consequat turpis. Maecenas eget sapien odio. Maecenas urna lectus, pellentesque in accumsan aliquam, congue eu libero. Ut rhoncus nec justo a porttitor. Pellentesque auctor pharetra eros, viverra sodales lorem aliquet id. Curabitur semper nisi vel sem interdum suscipit.");
	}

	@Test
	void testLists() throws IOException {

		MarkdownDocumentParser reader = new MarkdownDocumentParser();

		List<Document> documents = reader
			.parse(new DefaultResourceLoader().getResource("classpath:/lists.md").getInputStream());

		assertThat(documents).hasSize(2)
			.extracting(Document::getMetadata, Document::getText)
			.containsOnly(tuple(Map.of("category", "header_2", "title", "Ordered list"),
					"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur diam eros, laoreet sit amet cursus vitae, varius sed nisi. Cras sit amet quam quis velit commodo porta consectetur id nisi. Phasellus tincidunt pulvinar augue. Proin vel laoreet leo, sed luctus augue. Sed et ligula commodo, commodo lacus at, consequat turpis. Maecenas eget sapien odio. Pellentesque auctor pharetra eros, viverra sodales lorem aliquet id. Curabitur semper nisi vel sem interdum suscipit. Maecenas urna lectus, pellentesque in accumsan aliquam, congue eu libero. Ut rhoncus nec justo a porttitor."),
					tuple(Map.of("category", "header_2", "title", "Unordered list"),
							"Aenean eu leo eu nibh tristique posuere quis quis massa. Aenean imperdiet libero dui, nec malesuada dui maximus vel. Vestibulum sed dui condimentum, cursus libero in, dapibus tortor. Etiam facilisis enim in egestas dictum."));
	}

	@Test
	void testWithAdditionalMetadata() throws IOException {
		MarkdownDocumentParserConfig config = MarkdownDocumentParserConfig.builder()
			.withAdditionalMetadata("service", "some-service-name")
			.withAdditionalMetadata("env", "prod")
			.build();

		MarkdownDocumentParser reader = new MarkdownDocumentParser(config);

		List<Document> documents = reader
			.parse(new DefaultResourceLoader().getResource("classpath:/simple.md").getInputStream());

		assertThat(documents).hasSize(1);

		Document documentsFirst = documents.get(0);
		assertThat(documentsFirst.getMetadata()).isEqualTo(Map.of("service", "some-service-name", "env", "prod"));
		assertThat(documentsFirst.getText()).startsWith("Lorem ipsum dolor sit amet, consectetur adipiscing elit.");
	}

}

小结

Spring AI Alibaba的spring-ai-alibaba-starter-document-parser-markdown提供了MarkdownDocumentParser用于解析markdown文件到Document。

doc

相关推荐
trayvontang23 分钟前
Python虚拟环境与包管理工具(uv、Conda)
python·conda·uv·虚拟环境·miniconda·miniforge
伊织code26 分钟前
pdfminer.six
python·pdf·图片·提取·文本·pdfminer·pdfminer.six
hqxstudying1 小时前
JAVA项目中邮件发送功能
java·开发语言·python·邮件
小兔兔吃萝卜1 小时前
Spring 创建 Bean 的 8 种主要方式
java·后端·spring
Q_Q5110082851 小时前
python的软件工程与项目管理课程组学习系统
spring boot·python·django·flask·node.js·php·软件工程
合作小小程序员小小店2 小时前
SDN安全开发环境中常见的框架,工具,第三方库,mininet常见指令介绍
python·安全·生成对抗网络·网络安全·网络攻击模型
后台开发者Ethan2 小时前
Python需要了解的一些知识
开发语言·人工智能·python
北京_宏哥2 小时前
Python零基础从入门到精通详细教程11 - python数据类型之数字(Number)-浮点型(float)详解
前端·python·面试
AAA修煤气灶刘哥2 小时前
面试官: SpringBoot自动配置的原理是什么?从启动到生效,一文讲透
后端·spring·面试
盼小辉丶2 小时前
PyTorch生成式人工智能——使用MusicGen生成音乐
pytorch·python·深度学习·生成模型