批量将 Word 文件转换为 HTML：Python 实现指南

概述
工具功能
实现代码
代码解析
- [1. docx_to_html 函数](#1. docx_to_html 函数)
- [2. batch_convert_to_html 函数](#2. batch_convert_to_html 函数)
- [3. 主程序](#3. 主程序)

概述

在日常工作中，我们可能会遇到将大量 Word 文档（.docx）转换为 HTML 文件的需求，比如为了将文档内容展示到网页上，或者实现文档在线阅读功能。今天，我们将分享一个用 Python 编写的实用工具，支持将整个文件夹下的 Word 文件批量转换为 HTML，同时保留文档的样式，如段落缩进、加粗、斜体等。

工具功能

支持单个 Word 文件到 HTML 的转换。

批量处理文件夹中的 Word 文件。

保留段落样式（如段落缩进、首行缩进、左右边距等）。

支持加粗、斜体、下划线等文本样式。

支持 Word 文档中的表格内容转换。

实现代码

以下是完整的实现代码：

python 复制代码

import os
from docx import Document
from html import escape


def docx_to_html(docx_path):
    """将单个 Word 文件转换为 HTML，保留换行、段落、缩进等格式"""
    document = Document(docx_path)
    html_content = "<html><head><meta charset='utf-8'></head><body>"

    for paragraph in document.paragraphs:
        # 获取段落的样式
        left_indent = paragraph.paragraph_format.left_indent
        right_indent = paragraph.paragraph_format.right_indent
        first_line_indent = paragraph.paragraph_format.first_line_indent

        # 样式转换为 HTML 的 inline 样式
        styles = []
        if left_indent:
            styles.append(f"margin-left: {int(left_indent.pt * 1.33)}px;")
        if right_indent:
            styles.append(f"margin-right: {int(right_indent.pt * 1.33)}px;")
        if first_line_indent:
            styles.append(f"text-indent: {int(first_line_indent.pt * 1.33)}px;")

        style_attribute = f" style='{' '.join(styles)}'" if styles else ""

        # 转换加粗、斜体等样式
        content = ""
        for run in paragraph.runs:
            run_text = escape(run.text)
            if run.bold:
                run_text = f"<b>{run_text}</b>"
            if run.italic:
                run_text = f"<i>{run_text}</i>"
            if run.underline:
                run_text = f"<u>{run_text}</u>"
            content += run_text

        # 包裹为段落
        html_content += f"<p{style_attribute}>{content}</p>"

    # 处理表格
    for table in document.tables:
        html_content += "<table border='1' style='border-collapse: collapse; width: 100%;'>"
        for row in table.rows:
            html_content += "<tr>"
            for cell in row.cells:
                html_content += f"<td>{escape(cell.text)}</td>"
            html_content += "</tr>"
        html_content += "</table>"

    html_content += "</body></html>"
    return html_content


def batch_convert_to_html(input_folder, output_folder):
    """批量将文件夹下的 docx 文档转换为 HTML 文件"""
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    for filename in os.listdir(input_folder):
        if filename.endswith(".docx"):
            input_path = os.path.join(input_folder, filename)
            output_path = os.path.join(output_folder, f"{os.path.splitext(filename)[0]}.html")

            try:
                html_content = docx_to_html(input_path)
                with open(output_path, 'w', encoding='utf-8') as html_file:
                    html_file.write(html_content)
                print(f"成功转换: {filename} -> {output_path}")
            except Exception as e:
                print(f"转换失败: {filename}, 错误信息: {e}")

# 设置输入和输出文件夹路径
input_folder = r"input_path"  # 替换为存储 Word 文档的文件夹路径
output_folder = r"output_path"  # 替换为存储 HTML 文件的文件夹路径

# 批量转换
batch_convert_to_html(input_folder, output_folder)

代码解析

1. docx_to_html 函数

功能：将单个 Word 文件转换为 HTML。
解析段落样式：
-- 使用 paragraph_format 获取段落的左缩进、右缩进和首行缩进。
-- 转换为 HTML 的内联样式。
-- 转换文本样式：
-- 解析 run 对象的加粗、斜体和下划线样式，生成对应的 HTML 标签。
处理表格：
-- 将 Word 表格转换为带边框的 HTML 表格。

2. batch_convert_to_html 函数

功能：批量处理文件夹下的 Word 文件。
自动创建输出文件夹。
遍历输入文件夹下的 .docx 文件，并逐个调用 docx_to_html 函数。
将生成的 HTML 文件存储到输出文件夹。

3. 主程序

设置输入文件夹和输出文件夹路径。
调用 batch_convert_to_html 完成批量转换。