使用ApachePOI读取docx文件,首先引入maven:
xml
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>5.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>5.0.0</version>
</dependency>
可以对元素进行遍历,获取内容(段落和表格)
如下:
java
try {
FileInputStream fis = new FileInputStream("xxxx.docx");
XWPFDocument document = new XWPFDocument(fis);
// 遍历文档中的所有元素(段落和表格)
List<IBodyElement> bodyElements = document.getBodyElements();
for (IBodyElement bodyElement : bodyElements) {
if (bodyElement instanceof XWPFParagraph) {
XWPFParagraph paragraph = (XWPFParagraph) bodyElement;
System.out.println(paragraph.getStyleID() + ":" + paragraph.getText());
} else if (bodyElement instanceof XWPFTable) {
System.out.println(((XWPFTable) bodyElement).getText());
} else if (bodyElement instanceof XWPFPicture) {
System.out.println(Arrays.toString(((XWPFPicture) bodyElement).getPictureData().getData()));
}
}
fis.close();
} catch (IOException e) {
e.printStackTrace();
}
发现此遍历方式不能读取到图片,只能通过getAllPictures
方法得到所有图片,但失去了段落和图片之间的顺序。经过探索,改正遍历方式:
java
for (IBodyElement element : bodyElements) {
if (element instanceof XWPFParagraph) {
XWPFParagraph paragraph = (XWPFParagraph) element;
String text = paragraph.getText();
if (text != null && !text.isEmpty()) {
//处理段落或正文
} else {
// 顺序遍历图片
paragraph.getIRuns().forEach(run -> {
if (run instanceof XWPFRun) {
XWPFRun xWPFRun = (XWPFRun) run;
for (XWPFPicture picture : xWPFRun.getEmbeddedPictures()) {
XWPFPictureData pictureData = picture.getPictureData();
String base64Image = "<img src='data:image/png;base64," + Base64.getEncoder().encodeToString((pictureData.getData())) + "'/>";
}
}
});
}
} else if (element instanceof XWPFTable) {
//处理表格
XWPFTable table = (XWPFTable) element;
String text = table.getText();
}
}
这样遍历就可以按顺序读取docx文件的内容了。