Java 统计 Word 文档中的单词数量

在处理 Word 文档时，统计单词数量是一个常见的需求。无论是计算稿费、控制文章篇幅，还是分析文本信息量，都需要一种可靠的方式来获取文档中的单词总数。本文将介绍如何使用 Java 语言，借助 Spire.Doc for Java 库，实现对 Word 文档中单词数量的统计功能。

一、环境准备

在开始编码之前，需要在 Java 项目中引入 Spire.Doc for Java 库。

若项目使用 Maven 管理依赖，可在 pom.xml 中配置以下内容：

xml 复制代码

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>

<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.doc</artifactId>
        <version>14.4.9</version>
    </dependency>
</dependencies>

对于非 Maven 项目，可以手动下载 JAR 文件并将其添加到项目的 classpath 中。

二、单词统计的基本思路

Word 文档中的内容以结构化方式组织：文档包含多个节（Section），每个节包含多个段落（Paragraph），每个段落包含多个文本区域（TextRange）或文档元素。要统计单词数量，需要遍历文档中的所有文本内容，然后按空格、标点等分隔符进行分词计算。

需要注意的是，中文和英文的单词边界判断逻辑有所不同。对于英文文本，通常以空格或标点作为分隔符；对于中文文本，每个汉字通常被视为一个独立的字符单位。本文以英文单词的统计规则为主进行说明。

三、基础统计示例

以下示例演示了如何遍历整个 Word 文档，提取所有文本内容，并计算单词数量。

java 复制代码

import com.spire.doc.Document;

public class WordCountExample {
    public static void main(String[] args) {
        // 加载 Word 文档
        Document doc = new Document();
        doc.loadFromFile("SampleDocument.docx");

        // 获取文档中的文本内容
        String text = doc.getText();

        // 统计单词数量
        int wordCount = countWords(text);

        System.out.println("文档中的单词数量: " + wordCount);

        doc.close();
    }

    /**
     * 统计文本中的单词数量（针对英文）
     * @param text 输入文本
     * @return 单词数量
     */
    private static int countWords(String text) {
        if (text == null || text.trim().isEmpty()) {
            return 0;
        }
        // 按空格、换行、标点符号等分隔符进行拆分
        String[] words = text.trim().split("[\\s\\p{Punct}]+");
        return words.length;
    }
}

上述代码中，doc.getText() 方法返回文档中的所有文本内容，包括段落、表格单元格、页眉页脚等位置的文字。随后通过正则表达式 [\\s\\p{Punct}]+ 对文本进行分词，统计单词数量。

四、逐段落统计

在某些场景下，可能需要分别统计每个段落的单词数量。以下示例展示了如何遍历文档中的每个段落，并输出每个段落的单词数。

java 复制代码

import com.spire.doc.Document;
import com.spire.doc.documents.Paragraph;

public class ParagraphWordCountExample {
    public static void main(String[] args) {
        Document doc = new Document();
        doc.loadFromFile("SampleDocument.docx");

        int totalWordCount = 0;
        int paragraphIndex = 1;

        // 遍历文档中的所有段落
        for (Object obj : doc.getSections()) {
            com.spire.doc.Section section = (com.spire.doc.Section) obj;
            for (Object paraObj : section.getParagraphs()) {
                Paragraph paragraph = (Paragraph) paraObj;
                String text = paragraph.getText();
                int wordCount = countWords(text);
                totalWordCount += wordCount;
                System.out.println("段落 " + paragraphIndex + " 单词数: " + wordCount);
                paragraphIndex++;
            }
        }

        System.out.println("文档总单词数: " + totalWordCount);
        doc.close();
    }

    private static int countWords(String text) {
        if (text == null || text.trim().isEmpty()) {
            return 0;
        }
        String[] words = text.trim().split("[\\s\\p{Punct}]+");
        return words.length;
    }
}

五、处理表格中的文字

Word 文档中的表格也包含需要统计的文本内容。如果仅遍历段落，会遗漏表格内的文字。以下示例展示了如何同时统计段落和表格中的单词数量。

java 复制代码

import com.spire.doc.Document;
import com.spire.doc.Table;
import com.spire.doc.documents.Paragraph;

public class FullWordCountExample {
    public static void main(String[] args) {
        Document doc = new Document();
        doc.loadFromFile("DocumentWithTable.docx");

        int totalWordCount = 0;

        // 统计段落中的单词
        for (Object obj : doc.getSections()) {
            com.spire.doc.Section section = (com.spire.doc.Section) obj;
            for (Object paraObj : section.getParagraphs()) {
                Paragraph paragraph = (Paragraph) paraObj;
                totalWordCount += countWords(paragraph.getText());
            }
        }

        // 统计表格中的单词
        for (Object tableObj : doc.getTables()) {
            Table table = (Table) tableObj;
            for (Object rowObj : table.getRows()) {
                com.spire.doc.TableRow row = (com.spire.doc.TableRow) rowObj;
                for (Object cellObj : row.getCells()) {
                    com.spire.doc.TableCell cell = (com.spire.doc.TableCell) cellObj;
                    for (Object paraObj : cell.getParagraphs()) {
                        Paragraph paragraph = (Paragraph) paraObj;
                        totalWordCount += countWords(paragraph.getText());
                    }
                }
            }
        }

        System.out.println("文档总单词数（包含表格）: " + totalWordCount);
        doc.close();
    }

    private static int countWords(String text) {
        if (text == null || text.trim().isEmpty()) {
            return 0;
        }
        String[] words = text.trim().split("[\\s\\p{Punct}]+");
        return words.length;
    }
}

六、统计特定节（Section）的单词数

对于包含多个节的文档（如分章节的书籍），可能需要单独统计某个节的单词数量。以下示例展示了如何统计第一个节中的单词数。

java 复制代码

import com.spire.doc.Document;
import com.spire.doc.Section;

public class SectionWordCountExample {
    public static void main(String[] args) {
        Document doc = new Document();
        doc.loadFromFile("MultiSectionDocument.docx");

        // 获取第一个节
        Section firstSection = doc.getSections().get(0);
        
        String sectionText = firstSection.getText();
        int wordCount = countWords(sectionText);
        
        System.out.println("第一节的单词数量: " + wordCount);
        
        doc.close();
    }

    private static int countWords(String text) {
        if (text == null || text.trim().isEmpty()) {
            return 0;
        }
        String[] words = text.trim().split("[\\s\\p{Punct}]+");
        return words.length;
    }
}

七、中文与混合文本的处理

对于中文文本或中英文混排的文档，单词统计规则需要适当调整。中文通常按字符数而非单词数进行统计。以下示例展示了如何同时统计中文字符数和英文单词数。

java 复制代码

public class MixedLanguageCountExample {
    
    public static void main(String[] args) {
        Document doc = new Document();
        doc.loadFromFile("MixedLanguageDocument.docx");
        
        String text = doc.getText();
        
        int englishWordCount = countEnglishWords(text);
        int chineseCharCount = countChineseCharacters(text);
        
        System.out.println("英文单词数: " + englishWordCount);
        System.out.println("中文字符数: " + chineseCharCount);
        
        doc.close();
    }
    
    private static int countEnglishWords(String text) {
        if (text == null || text.trim().isEmpty()) {
            return 0;
        }
        // 匹配英文单词（字母序列）
        String[] words = text.split("[^a-zA-Z]+");
        int count = 0;
        for (String w : words) {
            if (!w.isEmpty()) {
                count++;
            }
        }
        return count;
    }
    
    private static int countChineseCharacters(String text) {
        if (text == null) {
            return 0;
        }
        int count = 0;
        for (char c : text.toCharArray()) {
            // 判断是否为中文字符（Unicode 范围：4E00-9FFF）
            if (c >= 0x4E00 && c <= 0x9FFF) {
                count++;
            }
        }
        return count;
    }
}

八、注意事项

在使用文档单词统计功能时，有几个问题值得留意：

格式特殊字符 ：getText() 方法返回的文本中可能包含 Word 的特殊控制字符（如分页符、段落标记等），这些字符不会被计入单词数量，但可能影响分词结果的正则匹配。
连字符与缩写 ：英文中的连字符词（如 state-of-the-art）和缩写（如 don't）在当前的简单分词规则下可能被拆分成多个单词。如需更精确的统计，可以使用更复杂的自然语言处理库。
数字与符号 ：当前正则表达式中，数字被视为单词的一部分。例如 2024 会被计为一个单词。如需排除数字，可以调整正则表达式。
性能考虑 ：对于超长文档（数百页以上），getText() 方法会返回较大的字符串，可能消耗较多内存。逐段落或逐元素统计是更节省内存的方式。

总结

通过上述示例可以看出，使用 Java 统计 Word 文档中的单词数量，核心步骤是加载文档、提取文本、进行分词计数。根据不同的业务需求，可以选择全局统计、逐段落统计、包含表格内容的统计，或针对中英文混排文档的分类统计。这些方法可以覆盖日常开发中处理 Word 文档字数统计的大部分场景。需要注意的是，分词规则的精确程度直接影响统计结果的准确性，对于正式出版、稿费计算等对精度要求较高的场景，建议根据实际文本特征对分词规则进行适当调整。此外，对于需要批量处理多个文档或定时执行统计任务的情况，可以将上述逻辑封装为可复用的工具方法，以便集成到更大的应用系统中。