聊聊Spring AI Alibaba的SentenceSplitter

本文主要研究一下Spring AI Alibaba的SentenceSplitter

SentenceSplitter

spring-ai-alibaba-core/src/main/java/com/alibaba/cloud/ai/transformer/splitter/SentenceSplitter.java

复制代码
public class SentenceSplitter extends TextSplitter {

	private final EncodingRegistry registry = Encodings.newLazyEncodingRegistry();

	private final Encoding encoding = registry.getEncoding(EncodingType.CL100K_BASE);

	private static final int DEFAULT_CHUNK_SIZE = 1024;

	private final SentenceModel sentenceModel;

	private final int chunkSize;

	public SentenceSplitter() {
		this(DEFAULT_CHUNK_SIZE);
	}

	public SentenceSplitter(int chunkSize) {
		this.chunkSize = chunkSize;
		this.sentenceModel = getSentenceModel();
	}

	@Override
	protected List<String> splitText(String text) {
		SentenceDetectorME sentenceDetector = new SentenceDetectorME(sentenceModel);
		String[] texts = sentenceDetector.sentDetect(text);
		if (texts == null || texts.length == 0) {
			return Collections.emptyList();
		}

		List<String> chunks = new ArrayList<>();
		StringBuilder chunk = new StringBuilder();
		for (int i = 0; i < texts.length; i++) {
			int currentChunkSize = getEncodedTokens(chunk.toString()).size();
			int textTokenSize = getEncodedTokens(texts[i]).size();
			if (currentChunkSize + textTokenSize > chunkSize) {
				chunks.add(chunk.toString());
				chunk = new StringBuilder(texts[i]);
			}
			else {
				chunk.append(texts[i]);
			}

			if (i == texts.length - 1) {
				chunks.add(chunk.toString());
			}
		}

		return chunks;
	}

	private SentenceModel getSentenceModel() {
		try (InputStream is = getClass().getResourceAsStream("/opennlp/opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin")) {
			if (is == null) {
				throw new RuntimeException("sentence model is invalid");
			}

			return new SentenceModel(is);
		}
		catch (IOException e) {
			throw new RuntimeException(e);
		}
	}

	private List<Integer> getEncodedTokens(String text) {
		Assert.notNull(text, "Text must not be null");
		return this.encoding.encode(text).boxed();
	}

}

SentenceSplitter继承了TextSplitter,其构造器会通过getSentenceModel()来加载/opennlp/opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin这个SentenceModel;splitText方法创建SentenceDetectorME,使用其sentDetect来拆分句子,再根据chunkSize进一步合并或拆分

示例

spring-ai-alibaba-core/src/test/java/com/alibaba/cloud/ai/transformer/splitter/SentenceSplitterTests.java

复制代码
class SentenceSplitterTests {

	private SentenceSplitter splitter;

	private static final int CUSTOM_CHUNK_SIZE = 100;

	@BeforeEach
	void setUp() {
		// Initialize with default chunk size
		splitter = new SentenceSplitter();
	}

	/**
	 * Test default constructor. Verifies that splitter can be created with default chunk
	 * size.
	 */
	@Test
	void testDefaultConstructor() {
		SentenceSplitter defaultSplitter = new SentenceSplitter();
		assertThat(defaultSplitter).isNotNull();
	}

	/**
	 * Test constructor with custom chunk size. Verifies that splitter can be created with
	 * specified chunk size.
	 */
	@Test
	void testCustomChunkSizeConstructor() {
		SentenceSplitter customSplitter = new SentenceSplitter(CUSTOM_CHUNK_SIZE);
		assertThat(customSplitter).isNotNull();
	}

	/**
	 * Test splitting simple sentences. Verifies basic sentence splitting functionality.
	 */
	@Test
	void testSplitSimpleSentences() {
		String text = "This is a test. This is another test. And this is a third test.";
		Document doc = new Document(text);
		List<Document> documents = splitter.apply(Collections.singletonList(doc));

		assertThat(documents).isNotNull();
		assertThat(documents).hasSize(1);
		assertThat(documents.get(0).getText()).contains("This is a test", "This is another test",
				"And this is a third test");
	}

	/**
	 * Test splitting empty text. Verifies handling of empty input.
	 */
	@Test
	void testSplitEmptyText() {
		Document doc = new Document("");
		List<Document> documents = splitter.apply(Collections.singletonList(doc));
		assertThat(documents).isEmpty();
	}

	/**
	 * Test splitting text with special characters. Verifies handling of text with various
	 * punctuation and special characters.
	 */
	@Test
	void testSplitTextWithSpecialCharacters() {
		String text = "Hello, world! How are you? I'm doing great... This is a test; with various punctuation.";
		Document doc = new Document(text);
		List<Document> documents = splitter.apply(Collections.singletonList(doc));

		assertThat(documents).isNotNull();
		assertThat(documents).hasSize(1);
		assertThat(documents.get(0).getText()).contains("Hello, world", "How are you", "I'm doing great",
				"This is a test");
	}

	/**
	 * Test splitting long text. Verifies handling of text that exceeds default chunk
	 * size.
	 */
	@Test
	void testSplitLongText() {
		// Generate a very long text that will exceed the default chunk size (1024
		// tokens)
		StringBuilder longText = new StringBuilder();
		String longSentence = "This is a very long sentence with many words that will contribute to the total token count and eventually force the text to be split into multiple chunks because it exceeds the default chunk size limit of 1024 tokens. ";
		// Repeat the sentence enough times to ensure we exceed the chunk size
		for (int i = 0; i < 50; i++) {
			longText.append(longSentence);
		}
		Document doc = new Document(longText.toString());

		List<Document> documents = splitter.apply(Collections.singletonList(doc));

		// Verify that the text was split into multiple documents
		assertThat(documents).isNotNull();
		assertThat(documents).hasSizeGreaterThan(1);
		// Verify that each document contains part of the original text
		documents.forEach(document -> assertThat(document.getText()).contains("This is a very long sentence"));
	}

	/**
	 * Test splitting text with multiple line breaks. Verifies handling of text with
	 * various types of line breaks.
	 */
	@Test
	void testSplitTextWithLineBreaks() {
		String text = "First sentence.\nSecond sentence.\r\nThird sentence.\rFourth sentence.";
		Document doc = new Document(text);
		List<Document> documents = splitter.apply(Collections.singletonList(doc));

		assertThat(documents).isNotNull();
		assertThat(documents.get(0).getText()).contains("First sentence", "Second sentence", "Third sentence",
				"Fourth sentence");
	}

	/**
	 * Test splitting text with single character sentences. Verifies handling of very
	 * short sentences.
	 */
	@Test
	void testSplitSingleCharacterSentences() {
		String text = "A. B. C. D.";
		Document doc = new Document(text);
		List<Document> documents = splitter.apply(Collections.singletonList(doc));

		assertThat(documents).isNotNull();
		assertThat(documents).hasSize(1);
		assertThat(documents.get(0).getText()).contains("A", "B", "C", "D");
	}

	/**
	 * Test splitting multiple documents. Verifies handling of multiple input documents.
	 */
	@Test
	void testSplitMultipleDocuments() {
		List<Document> inputDocs = new ArrayList<>();
		inputDocs.add(new Document("First document. With multiple sentences."));
		inputDocs.add(new Document("Second document. Also with multiple sentences."));

		List<Document> documents = splitter.apply(inputDocs);
		assertThat(documents).isNotNull();
		assertThat(documents).hasSizeGreaterThan(1);
	}

}

小结

Spring AI Alibaba提供了SentenceSplitter,它使用了opennlp的SentenceDetectorME进行拆分,其构造器会加载/opennlp/opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin这个SentenceModel。

doc

相关推荐
会飞的老朱11 小时前
医药集团数智化转型,智能综合管理平台激活集团管理新效能
大数据·人工智能·oa协同办公
聆风吟º12 小时前
CANN runtime 实战指南:异构计算场景中运行时组件的部署、调优与扩展技巧
人工智能·神经网络·cann·异构计算
寻星探路13 小时前
【深度长文】万字攻克网络原理:从 HTTP 报文解构到 HTTPS 终极加密逻辑
java·开发语言·网络·python·http·ai·https
Codebee14 小时前
能力中心 (Agent SkillCenter):开启AI技能管理新时代
人工智能
聆风吟º15 小时前
CANN runtime 全链路拆解:AI 异构计算运行时的任务管理与功能适配技术路径
人工智能·深度学习·神经网络·cann
uesowys15 小时前
Apache Spark算法开发指导-One-vs-Rest classifier
人工智能·算法·spark
AI_567815 小时前
AWS EC2新手入门:6步带你从零启动实例
大数据·数据库·人工智能·机器学习·aws
User_芊芊君子15 小时前
CANN大模型推理加速引擎ascend-transformer-boost深度解析:毫秒级响应的Transformer优化方案
人工智能·深度学习·transformer
ValhallaCoder15 小时前
hot100-二叉树I
数据结构·python·算法·二叉树
智驱力人工智能16 小时前
小区高空抛物AI实时预警方案 筑牢社区头顶安全的实践 高空抛物检测 高空抛物监控安装教程 高空抛物误报率优化方案 高空抛物监控案例分享
人工智能·深度学习·opencv·算法·安全·yolo·边缘计算