前一篇我讲解了如何对接MiniMax实现FAQ,其实知识库不仅仅可以实现FAQ,还能实现帮助文档的查询,内部培训资料的查询等等,但是这些培训资料大部分是word版本的,并且有层级结构,比如标题1,标题1-1等等,这种层级结构AI是无法识别的,所以需要转换成markdown格式的,帮助文档还有一个问题就是相似问题容易混淆,比如A文档有联系人,B文档也有联系人,这时查联系人就会出现错误的情况,所以要在每个标题前加前缀,比如A文档的联系人,B文档的联系人,这些也希望能够在转换程序中实现.
代码实现
java
public class MDTest {
@Test
public void testMD() throws Exception {
String preHeader = "A文档的";
if (preHeader == null) preHeader = "";
String path = "d:/333.docx";
File file = new File(path);
FileInputStream is = new FileInputStream(file);
XWPFDocument document = new XWPFDocument(is);
StringBuilder sb = new StringBuilder();
for (IBodyElement e : document.getBodyElements()) {
if (e instanceof XWPFParagraph) {
XWPFParagraph paragraph = (XWPFParagraph) e;
appendParagraphText(sb, paragraph, document, preHeader);
} else if (e instanceof XWPFTable) {
XWPFTable table = (XWPFTable) e;
appendTableText(sb, table);
} else if (e instanceof XWPFSDT) {
sb.append(((XWPFSDT) e).getContent().getText());
}
sb.append('\n');
}
FileUtil.writeBytes(sb.toString().getBytes(), "d:/aaa.md");
}
private static boolean isHeadTitle(XWPFDocument document, XWPFParagraph paragraph) {
String styleName = getStyleName(document, paragraph);
return styleName.toLowerCase().startsWith("heading");
}
private static String getStyleName(XWPFDocument document, XWPFParagraph paragraph) {
String styleId = paragraph.getStyleID();
if (StringUtils.isNotEmpty(styleId)) {
XWPFStyle style = document.getStyles().getStyle(styleId);
return style.getName();
}
return "";
}
private static void appendTableText(StringBuilder text, XWPFTable table) {
for (XWPFTableRow row : table.getRows()) {
List<ICell> cells = row.getTableICells();
for (int i = 0; i < cells.size(); i++) {
ICell cell = cells.get(i);
if (cell instanceof XWPFTableCell) {
text.append(((XWPFTableCell) cell).getTextRecursively());
} else if (cell instanceof XWPFSDTCell) {
text.append(((XWPFSDTCell) cell).getContent().getText());
}
if (i < cells.size() - 1) {
text.append("\t");
}
}
text.append('\n');
}
}
private static void appendParagraphText(StringBuilder text, XWPFParagraph paragraph, XWPFDocument document, String preHeader) {
String styleName = getStyleName(document, paragraph);
if (styleName.toLowerCase().startsWith("heading")) {
String number = styleName.replaceAll(".*(\\d+)$", "$1");
appendHeader(text, number, preHeader);
}
for (IRunElement run : paragraph.getRuns()) {
text.append(run);
}
}
private static void appendHeader(StringBuilder text, String number, String preHeader) {
if (StringUtils.isEmpty(number)) return;
int num = Integer.parseInt(number);
for (int i = 0; i < num; i++) {
text.append("#");
}
text.append(" ").append(preHeader);
}
}
代码解析
我们先通过poi读取word文档,然后获取段落,getStyleName方法可以获取段落的样式,如果样式是heading 1,表示是标题1,heading 2是标题2,而markdown格式的标题1是# 标题,标题2是## 标题,这样就能实现转换,其余的直接按照文本放入就行了
效果
word形式
markdown形式