接到一个任务,要读取doc文件的内容。解析里面的内容,进行一个处理和返回。
读取doc:
看了
很多文章,基本都是:
java
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
public class WordImportUtils {
public static String readWordFile(String path) {
File file = new File(path);
FileInputStream fileInputStream = null;
try {
fileInputStream = new FileInputStream((file.getAbsolutePath()));
HWPFDocument document = new HWPFDocument(fileInputStream);
WordExtractor extractor = new WordExtractor(document);
return extractor.getText();//此处还有很多别的方法可以使用
} catch (IOException e) {
e.printStackTrace();
} finally {
if(fileInputStream != null){
try {
fileInputStream.close();
}catch (Exception e){
e.printStackTrace();
}
}
}
return "";
}
public static void main(String args[]) throws Exception {
String path="D:\\tmp\\invest.doc";
String content = readWordFile(path);
System.out.println(content);
}
}
- extractor对象的几个get方法说明:
- getText() :返回String,全文内容
- getMainTextboxText() 返回String[],读取的是多个文本框中的内容
- getParagraphText() 返回String[],读取的是多个自然段的内容
- 读取的图片都没了,表格只保留了文字部分,格式都没有了。其余的格式(换行、回车等)均被保留
里面
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
这两行报错。
看了下引用:
java
implementation 'org.apache.poi:poi:3.16'
implementation 'org.apache.poi:poi-ooxml:3.16'
细看了下代码,确实没有。就直接搜 WordExtractor
原来是少了:poi-scratchpad 的引用 引用版本查询
java
implementation 'org.apache.poi:poi-scratchpad:3.16'
这样就正常。
如果是docx或者pdf格式怎么处理呢?
读取docx
java
import org.apache.pdfbox.io.RandomAccessBuffer;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.poi.POIXMLTextExtractor;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.commons.lang.StringUtils;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
public class WordImportUtils {
public static void main(String args[]) throws Exception {
String path="D:\\tmp\\invest2.docx";
String docx = readWord(path);
System.out.println(docx);
}
public static String readWord(String path) throws Exception {
File file = new File(path);
String suffix = StringUtils.substringAfterLast(path, ".");
FileInputStream fileInputStream = new FileInputStream((file.getAbsolutePath()));
return readWord(fileInputStream, suffix);
}
public static String readWord(InputStream inputStream, String suffix) throws Exception{
if ("doc".equals(suffix)) {
HWPFDocument document = new HWPFDocument(inputStream);
WordExtractor extractor = new WordExtractor(document);
return extractor.getText();
} else if ("docx".equals(suffix)) {
OPCPackage opcPackage = OPCPackage.open(inputStream);
POIXMLTextExtractor extractor = new XWPFWordExtractor(opcPackage);
return extractor.getText();
} else if("pdf".equals(suffix)){
RandomAccessBuffer rab = new RandomAccessBuffer(inputStream);
PDFParser pdfParser = new PDFParser(rab);
pdfParser.parse();
PDDocument document = pdfParser.getPDDocument();
// 获取页码
int pages = document.getNumberOfPages();
PDFTextStripper stripper = new PDFTextStripper();
// 设置按顺序输出
stripper.setSortByPosition(true);
stripper.setStartPage(1);
stripper.setEndPage(pages);
return stripper.getText(document);
}else{
return null;
}
}
}
pdf的引用包:
implementation 'org.apache.pdfbox:fontbox:2.0.22'
implementation 'org.apache.pdfbox:pdfbox:2.0.22'
文件上传后读取
controller
java
@ApiOperation("读取word")
@RequestMapping(value = "/extractWordFile", method = RequestMethod.POST)
public ResultVO extractWordFile(@RequestPart("file") MultipartFile file) throws Exception {
Map<String, Object> data = importDocService.extractWordFile(file);
return ResultVO.createSuccess(data);
}
service
java
@Service
public class ImportDocService {
private static final Logger logger = LoggerFactory.getLogger(VisitImportDocService.class);
public Map<String, Object> extractWordFile(MultipartFile file) throws Exception{
String fileSuffix = StringUtils.substringAfterLast(file.getOriginalFilename(), ".");
String data = WordImportUtils.readWord(file.getInputStream(), fileSuffix);
// 根据实际的要求去获取到副标题和内容
String subFirstFix = "";
String content = "";
Map<String, Object> result = new HashMap<>();
result.put("subHead", subFirstFix);
result.put("content", content);
return result;
}
}
总结:
读取doc文件主要是 poi-scratchpad的正确引入。这边主要是处理文本的。后续有需要处理其它的,再进行添加处理