word格式原理与编号解析

文章目录

开始

相信很多朋友有出来word的需求,比如Word转PDF,Word转Markdown等。虽然现在AI已经非常强了,但是使用AI转了之后我们很多时候还是需要去校验一下文字对不对。

怎么出来这类需求呢?总不能肉眼一个一个去看吧,有时候几十、几百甚至几千上万个文件,大脑都能整得宕机。

这个时候,我们就可以使用poi工具来处理。

但实际上Word格式非常复杂,这也让poi的接口非常复杂,很难全部记忆。

有什么好的方法能处理这个问题呢?

有,就是理解Word格式。

问题引入

我们先来看一个实际问题,我们有一批pdf,是通过Word转换来,因为是合同性质的资料,我们必须确保它一个字都不能变。

这其中一个很重要的问题就是编号,Word编号是单独处理的,不能简单处理。

java 复制代码
public static String readDocxText(String filePath) throws IOException {
    StringBuilder textBuilder = new StringBuilder();
    try (XWPFDocument document = new XWPFDocument(new FileInputStream(filePath))) {
        for (XWPFParagraph paragraph : document.getParagraphs()) {
            String paragraphText = paragraph.getText();
            if (!paragraphText.trim().isEmpty()) {
                textBuilder.append(paragraphText).append("\n");
            }
        }
    }
    return textBuilder.toString();
}

@Test
public void printText() throws IOException {
    String text = readDocxText("C:\\Users\\trayv\\Downloads\\bak\\title.docx");
    System.out.println(text);
}

怎么办呢?

docx文件格式

我们首先来了解一下docx的文件格式,否则看到下面这类代码,指定蒙圈。

java 复制代码
XWPFNum xwpfNum = numbering.getNum(numID);
numStart = xwpfNum.getCTNum().getLvlOverrideArray()[0].getStartOverride().getVal();
CTDecimalNumber abstractNumId = xwpfNum.getCTNum().getAbstractNumId();
XWPFAbstractNum abstractNum = numbering.getAbstractNum(abstractNumId.getVal());
CTAbstractNum ctAbstractNum = abstractNum.getCTAbstractNum();
CTLvl ctLvl = ctAbstractNum.getLvlArray(0);
format = ctLvl.getNumFmt().getVal().toString();
lvlText = ctLvl.getLvlText().getVal();

不清楚docx格式的朋友,现在可能想问,这都是什么啊?

没关系,我们来看一下docx的格式就就会清晰很多。

docx其实和jar、epub这类文件一样本质上是一个zip文件。

其中核心在word文件夹下:

我们来看一下document的内容:

现在,知道.getPPr().getNumPr()是啥了吧,其实就是获取标签对应的对象。

以后,我们使用poi的时候,就可以对着xml文件去看有没有对应api就可以了。

就这么简单吗?

肯定不是,其实非常复杂,有些虽然它自己没有编号,但是它的style可能会有对应的编号。

然后我们需要在numbering.xml文件中找对应的信息:

看着头大吧,头大就对了,其实实际远比这复杂,还有很多关联关系。

还好poi已经为我们处理了大部分问题,接下来我们来看实际应用。

应用实例

这里只是一个简化版,没有处理多个ilvl、重编号、全部的编号类型等情况,不过基本够用了,如果实际情况有出入,相信有前面的知识,也能知道怎么去处理。

java 复制代码
import org.apache.pdfbox.Loader;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.*;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.math.BigInteger;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.function.Function;
import java.util.function.Predicate;

/**
 * 编号枚举:
 * org.openxmlformats.schemas.wordprocessingml.x2006.main.STNumberFormat.Enum
 * chineseCountingThousand: 一
 * japaneseCounting: 一
 */
public class WordPdfCompareHelper {

    public static void generateCompareText(String pdfPathStr, String docxPathStr, String outDirStr) throws IOException {
        String pdfText = readPdfText(pdfPathStr);
        String docxText = readDocxText(docxPathStr);

        Path pdfTxtPath = Paths.get(outDirStr, "pdf", getFileName(pdfPathStr) + ".txt");
        File parent = pdfTxtPath.getParent().toFile();
        if (!parent.exists()) {
            if (!parent.mkdirs()) {
                return;
            }
        }
        Path docxTxtPath = Paths.get(outDirStr, "docx", getFileName(docxPathStr) + ".txt");
        parent = docxTxtPath.getParent().toFile();
        if (!parent.exists()) {
            if (!parent.mkdirs()) {
                return;
            }
        }
        Files.writeString(pdfTxtPath, pdfText);
        Files.writeString(docxTxtPath, docxText);
    }

    private static String getFileName(String pathStr) {
        File file = new File(pathStr);
        String fileName = file.getName();
        int index = fileName.lastIndexOf(".");
        if (index == -1) {
            return fileName;
        } else {
            return fileName.substring(0, index);
        }
    }

    public static String readDocxText(String filePathStr) throws IOException {
//        return getWordNumText(filePathStr, getDefaultPredicate(), getDefaultContentFunction());
        return getWordNumText(filePathStr, null, null);
    }

    /**
     *
     * @param filePathStr      文件路径
     * @param contentFilter    内容过滤器,例如跳过空行
     * @param contentConverter 内容转换器,例如需要把全角替换为半角等
     * @return 实际文件内容
     * @throws IOException IO 异常信息
     */
    public static String getWordNumText(
            String filePathStr,
            Predicate<String> contentFilter,
            Function<String, String> contentConverter) throws IOException {
        try (FileInputStream fis = new FileInputStream(filePathStr);
             XWPFDocument doc = new XWPFDocument(fis)
        ) {
            XWPFNumbering numbering = doc.getNumbering();
            XWPFStyles styles = doc.getStyles();

            StringBuilder sb = new StringBuilder();

            Map<String, Integer> numMap = new HashMap<>();
            for (XWPFParagraph paragraph : doc.getParagraphs()) {
                String paragraphText = paragraph.getText();
                if (contentFilter != null) {
                    if (!contentFilter.test(paragraphText)) {
                        continue;
                    }
                }
                String numText = "";
                BigInteger numID = paragraph.getNumID();
                BigInteger numStart;
                String format;
                String lvlText;
                if (numID != null) {
                    BigInteger level = paragraph.getNumIlvl();
                    if (level != null) {
                        format = paragraph.getNumFmt();
                        lvlText = paragraph.getNumLevelText();
                        numStart = paragraph.getNumStartOverride();
                    } else {
//                        w:num
                        XWPFNum xwpfNum = numbering.getNum(numID);
                        numStart = xwpfNum.getCTNum().getLvlOverrideArray()[0].getStartOverride().getVal();
                        CTDecimalNumber abstractNumId = xwpfNum.getCTNum().getAbstractNumId();
                        XWPFAbstractNum abstractNum = numbering.getAbstractNum(abstractNumId.getVal());
                        CTAbstractNum ctAbstractNum = abstractNum.getCTAbstractNum();
                        CTLvl ctLvl = ctAbstractNum.getLvlArray(0);
                        format = ctLvl.getNumFmt().getVal().toString();
                        lvlText = ctLvl.getLvlText().getVal();
                    }
                    System.out.println("numId:" + numID + " level:" + level + " format:" + format + " getNumLevelText:"
                            + lvlText + " numStart:" + numStart + " text:" + paragraphText);
                    String key = numID.toString();
                    Integer num = numMap.get(key);
                    if (num == null) {
                        num = 1;
                    } else {
                        num = num + 1;
                    }
                    numMap.put(key, num);
                    numText = formatNumText(format, num);
                    numText = lvlText.replaceFirst("%1", numText);
                } else {
                    String styleID = paragraph.getStyleID();
                    XWPFStyle style = styles.getStyle(styleID);
                    if (style != null) {
                        // styles.xml w:style w:pPr w:numPr
                        CTNumPr numPr = style.getCTStyle().getPPr().getNumPr();
                        if (numPr != null) {
                            CTDecimalNumber numId = numPr.getNumId();
                            BigInteger ilvl = numPr.getIlvl().getVal();

                            XWPFNum xwpfNum = numbering.getNum(numId.getVal());

                            CTNum ctNum = xwpfNum.getCTNum();

                            CTDecimalNumber abstractNumId = ctNum.getAbstractNumId();
                            XWPFAbstractNum abstractNum = numbering.getAbstractNum(abstractNumId.getVal());

                            CTLvl ctLvl = abstractNum.getCTAbstractNum().getLvlArray(0);
                            String formatName = ctLvl.getNumFmt().getVal().toString();
                            String numFormat = ctLvl.getLvlText().getVal();
                            System.out.println("style--->" + styleID +
                                    " numId:" + numId.getVal() +
                                    " numPr.getIlvl().getVal():" + ilvl +
                                    " getNumFmt:" + formatName +
                                    " ctLvl.getLvlText().getVal():" + numFormat +
                                    " numStart:" + ctLvl.getStart().getVal() +
                                    " text:" + paragraphText);
                            String key = numId.toString();
                            Integer num = numMap.get(key);
                            if (num == null) {
                                num = 1;
                            } else {
                                num = num + 1;
                            }
                            numMap.put(key, num);
                            numText = formatNumText(formatName, num);
                            numText = numFormat.replaceFirst("%1", numText);
                        }
                    }
                    if (contentConverter != null) {
                        paragraphText = contentConverter.apply(paragraphText);
                    }
                }
                sb.append(numText).append(paragraphText).append("\n");
            }
            return reLine(sb.toString());
        }
    }

    public static String formatNumText(String textFormat, int num) {
        if (textFormat.equals("chineseCountingThousand")
                || textFormat.equals("japaneseCounting")
                || textFormat.equals("chineseCounting")
        ) {
            return convertToChinese(num);
        }
        return String.valueOf(num);
    }

    public static String simpleNumText(XWPFParagraph paragraph, Map<String, Integer> numMap) {
        BigInteger numID = paragraph.getNumID();
        String text = paragraph.getText();
        if (numID == null) {
            return text;
        }
        String getNumLevelText = paragraph.getNumLevelText();
        String key = numID.toString();
        Integer num = numMap.get(key);
        if (num == null) {
            num = 1;
        } else {
            num = num + 1;
        }
        numMap.put(key, num);
        String numText = convertToChinese(num);
        numText = getNumLevelText.replaceFirst("%1", numText);
        return numText + text;
    }

    private static Function<String, String> getDefaultContentFunction() {
        return (in) -> {
            in = in.replaceAll("\\s+", "");
            // \R 会匹配 \n, \r, \r\n 等所有换行符
//            in = in.replaceAll("\\R", "");
            in = in.replaceAll(".", ".");
            return in;
        };
    }

    private static Predicate<String> getDefaultPredicate() {
        return (in) -> !in.isEmpty();
    }

    private static String reLine(String content) {
        // 因为转 pdf 之后行段落会变,所以使用新规则重新分行
//        String[] lines = content.split("[。!]");
        String[] lines = content.split("[。!\n]");
        List<String> list = Arrays.stream(lines).filter(getDefaultPredicate())
                .map(getDefaultContentFunction()).toList();
        return String.join("\n", list);
    }

    /**
     * 读取 PDF全部文本
     *
     * @param filePath PDF 文件路径
     * @return 提取的文本内容
     * @throws IOException 读取异常
     */
    public static String readPdfText(String filePath) throws IOException {
        try (PDDocument document = Loader.loadPDF(new File(filePath))) {
            if (document.isEncrypted()) {
                document.setAllSecurityToBeRemoved(true);
            }
            PDFTextStripper textStripper = new PDFTextStripper();
            textStripper.setLineSeparator("\n");
            textStripper.setSortByPosition(true);
            return reLine(textStripper.getText(document));
        }
    }

    private static final String[] NUMBERS = {"零", "一", "二", "三", "四", "五", "六", "七", "八", "九"};
    private static final String[] UNITS = {"", "十", "百", "千"};

    /**
     * 阿拉伯数字转中文数字(仅支持0-9999)
     *
     * @param number 待转换数字(0-9999)
     * @return 符合中文习惯的中文数字字符串
     */
    public static String convertToChinese(int number) {
        if (number < 0 || number > 9999) {
            throw new IllegalArgumentException("仅支持转换0-9999之间的数字");
        }

        if (number == 0) {
            return NUMBERS[0];
        }

        char[] numChars = String.valueOf(number).toCharArray();
        int len = numChars.length;
        StringBuilder sb = new StringBuilder();
        boolean needZero = false;

        for (int i = 0; i < len; i++) {
            int digit = numChars[i] - '0';
            int pos = len - 1 - i;
            if (digit == 0) {
                if (pos != 0 && !needZero) {
                    needZero = true;
                }
            } else {
                if (needZero) {
                    sb.append(NUMBERS[0]);
                    needZero = false;
                }
                sb.append(NUMBERS[digit]).append(UNITS[pos]);
            }
        }

        String result = sb.toString();
        if (result.startsWith("一十")) {
            result = result.substring(1);
        }
        return result;
    }
}
java 复制代码
@Test
void generateCompareText() throws IOException {
    String pdfPathStr = "C:\\Users\\trayv\\Downloads\\bak\\title.pdf";
    String docxPathStr = "C:\\Users\\trayv\\Downloads\\bak\\title.docx";
    String outDir = "D:\\tmp\\compare";
    WordPdfCompareHelper.generateCompareText(pdfPathStr,docxPathStr,outDir);
}

处理之后,我们可以通过:WinMerge比较

pom.xml

xml 复制代码
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>vip.meet</groupId>
    <artifactId>poi-learn</artifactId>
    <version>1.0.0</version>

    <properties>
        <java.version>17</java.version>
        <poi.version>5.5.1</poi.version>
        <pdfbox.version>3.0.6</pdfbox.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>${poi.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>${poi.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>${poi.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>fontbox</artifactId>
            <version>${pdfbox.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>pdfbox</artifactId>
            <version>${pdfbox.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>pdfbox-tools</artifactId>
            <version>${pdfbox.version}</version>
        </dependency>
        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>33.5.0-jre</version>
        </dependency>
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-collections4</artifactId>
            <version>4.5.0</version>
        </dependency>
        <dependency>
            <groupId>org.dom4j</groupId>
            <artifactId>dom4j</artifactId>
            <version>2.1.4</version>
        </dependency>
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.20.0</version>
        </dependency>
        <dependency>
            <groupId>com.alibaba.fastjson2</groupId>
            <artifactId>fastjson2</artifactId>
            <version>2.0.60</version>
        </dependency>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.21.2</version>
        </dependency>
        <dependency>
            <groupId>com.belerweb</groupId>
            <artifactId>pinyin4j</artifactId>
            <version>2.5.1</version>
        </dependency>
        <dependency>
            <groupId>org.yaml</groupId>
            <artifactId>snakeyaml</artifactId>
            <version>2.5</version>
        </dependency>

        <dependency>
            <groupId>ch.qos.logback</groupId>
            <artifactId>logback-classic</artifactId>
            <version>1.4.8</version>
        </dependency>

        <!-- 对应版本:https://projectlombok.org/changelog-->
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <version>1.18.38</version>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>org.junit.jupiter</groupId>
            <artifactId>junit-jupiter-engine</artifactId>
            <version>5.9.2</version>
            <scope>test</scope>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.12.1</version>
                <configuration>
                    <source>${java.version}</source>
                    <target>${java.version}</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>
相关推荐
huluang2 小时前
高性能Word文档批注处理器的设计与实现
开发语言·c#·word
gc_22994 小时前
学习C#调用OpenXml操作word文档的基本用法(12:读取文档字体表)
word·openxml·字体表
weixin_462446231 天前
【原创实践】Python 将 Markdown 文件转换为 Word(docx)完整实现
开发语言·python·word
_修铁路的1 天前
【Poi-tl】 Word模板填充导出
java·word·poi-tl
gc_22992 天前
学习C#调用OpenXml操作word文档的基本用法(11:操作文档缩略图)
word·缩略图·openxml
无敌的黑星星2 天前
office 批量word转pdf
pdf·word·vba
CodeCraft Studio2 天前
国产化Word处理控件Spire.Doc教程:使用C# 编程方式批量转换Word为RTF
开发语言·c#·word·spire.doc·word文档转换·word开发组件·word api库
console.log('npc')2 天前
vue3文件上传弹窗,图片pdf,word,结合预览kkview
前端·javascript·vue.js·pdf·word
程序员柒叔2 天前
Dify 工作流组件开发指南
大模型·word·workflow·工作流·dify