Python 自动化办公神器｜一键转换所有文档为 PDF

前言

在日常工作和学习中，我们常常需要将各种格式的文件（如 Word、Excel、PPT、TXT、HTML 和图片）统一转换为 PDF 格式，以便于归档、打印或分享。手动操作不仅效率低，还容易出错。

本文提供一个完整的 Python 脚本 ，能够实现对多种办公文档和图像格式的自动化批量转换 ，并内置了错误处理机制 与日志记录系统，确保整个过程安全可靠、可追踪、易维护。

掌握这一工具，将大大提升你的工作效率。

第一章：为什么需要批量文档转PDF？

PDF 是一种跨平台通用的文档格式，具有以下优点：

不依赖特定软件即可查看；
排版不会因设备不同而错乱；
支持加密、签名等安全功能；
便于归档、打印和分享。

然而，面对大量文档时，逐个转换费时费力。这时，使用 Python 编写一个自动化的批量转换脚本就显得尤为重要。

它不仅能节省时间，还能减少人为操作失误，是现代数字办公不可或缺的一环。

第二章：支持转换的文件类型与技术原理

该脚本目前支持以下文件类型的转换：

Microsoft Office 系列：
- Word（.doc / .docx）
- Excel（.xls / .xlsx）
- PowerPoint（.ppt / .pptx）
文本类文档：
- TXT（纯文本）
- HTML（网页内容）
图像类文件：
- JPG、PNG、BMP、GIF、TIFF、TIF 等主流格式

📌 技术说明：

使用 win32com 控制 Microsoft Office 实现 Word、Excel、PPT 的转换；

使用 FPDF + pdfkit 处理 TXT 和 HTML；

使用 PIL（Pillow）处理图像文件；

日志系统采用标准库 logging；

错误处理通过 try-except 捕获异常并记录详细信息。

第三章：安装依赖模块（Windows 平台）

由于部分模块仅适用于 Windows 环境（如 win32com），因此该脚本主要运行于 Windows 系统，并需提前安装 Microsoft Office。

🔧 安装命令如下：

bash 复制代码

pip install pywin32 pillow fpdf pdfkit

此外，还需安装 wkhtmltopdf 来支持 HTML 到 PDF 的转换：

下载地址：wkhtmltopdf.org/downloads.h...

下载后请将其添加到系统环境变量 PATH 中，例如：

bash 复制代码

C:\Program Files\wkhtmltopdf\bin

第四章：转换所有文档为 PDF 的 Python 脚本

python 复制代码

import os
import sys
import logging
from win32com import client as win32_client
from PIL import Image
from fpdf import FPDF
import pdfkit
import traceback

# 配置日志系统
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('conversion.log', encoding='utf-8'),
        logging.StreamHarness(sys.stdout)
    ]
)

class DocumentConverter:
    def __init__(self, input_dir, output_dir):
        self.input_dir = input_dir
        self.output_dir = output_dir
        self.supported_extensions = {
            'doc': self._convert_word,
            'docx': self._convert_word,
            'xls': self._convert_excel,
            'xlsx': self._convert_excel,
            'ppt': self._convert_powerpoint,
            'pptx': self._convert_powerpoint,
            'txt': self._convert_txt,
            'html': self._convert_html,
            'htm': self._convert_html,
            'jpg': self._convert_image,
            'jpeg': self._convert_image,
            'png': self._convert_image,
            'bmp': self._convert_image,
            'gif': self._convert_image,
            'tiff': self._convert_image,
            'tif': self._convert_image
        }

    def _ensure_output_dir(self):
        """确保输出目录存在"""
        if not os.path.exists(self.output_dir):
            os.makedirs(self.output_dir)

    def _get_files(self):
        """获取输入目录中的所有支持文件"""
        for root, _, files in os.walk(self.input_dir):
            for file in files:
                ext = file.split('.')[-1].lower()
                if ext in self.supported_extensions:
                    yield os.path.join(root, file), ext

    def _convert_word(self, file_path):
        """转换 Word 文档为 PDF"""
        try:
            word = win32_client.Dispatch("Word.Application")
            doc = word.Documents.Open(file_path)
            output_file = self._output_path(file_path, 'pdf')
            doc.ExportAsFixedFormat(
                OutputFileName=output_file,
                ExportFormat=0,  # wdExportFormatPDF
                OpenAfterExport=False,
                OptimizeFor=0,
                CreateBookmarks=1
            )
            doc.Close()
            word.Quit()
            logging.info(f"✅ 已成功转换: {file_path}")
        except Exception as e:
            logging.error(f"❌ Word转换失败: {file_path} | 错误: {str(e)}\n{traceback.format_exc()}")

    def _convert_excel(self, file_path):
        """转换 Excel 表格为 PDF"""
        try:
            excel = win32_client.Dispatch("Excel.Application")
            wb = excel.Workbooks.Open(file_path)
            output_file = self._output_path(file_path, 'pdf')
            wb.ExportAsFixedFormat(
                Type=0,  # xlTypePDF
                OutputFileName=output_file,
                Quality=1
            )
            wb.Close()
            excel.Quit()
            logging.info(f"✅ 已成功转换: {file_path}")
        except Exception as e:
            logging.error(f"❌ Excel转换失败: {file_path} | 错误: {str(e)}\n{traceback.format_exc()}")

    def _convert_powerpoint(self, file_path):
        """转换 PPT 文件为 PDF"""
        try:
            powerpoint = win32_client.Dispatch("PowerPoint.Application")
            presentation = powerpoint.Presentations.Open(file_path)
            output_file = self._output_path(file_path, 'pdf')
            presentation.SaveAs(output_file, 32)  # 32 代表 PDF 格式
            presentation.Close()
            powerpoint.Quit()
            logging.info(f"✅ 已成功转换: {file_path}")
        except Exception as e:
            logging.error(f"❌ PPT转换失败: {file_path} | 错误: {str(e)}\n{traceback.format_exc()}")

    def _convert_txt(self, file_path):
        """将 TXT 文件转换为 PDF"""
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()
            pdf = FPDF()
            pdf.add_page()
            pdf.set_auto_page_break(auto=True, margin=15)
            pdf.set_font("Arial", size=12)
            for line in content.split('\n'):
                pdf.cell(0, 10, txt=line, ln=1)
            output_file = self._output_path(file_path, 'pdf')
            pdf.output(output_file)
            logging.info(f"✅ 已成功转换: {file_path}")
        except Exception as e:
            logging.error(f"❌ TXT转换失败: {file_path} | 错误: {str(e)}\n{traceback.format_exc()}")

    def _convert_html(self, file_path):
        """将 HTML 文件转换为 PDF"""
        try:
            output_file = self._output_path(file_path, 'pdf')
            pdfkit.from_file(file_path, output_file)
            logging.info(f"✅ 已成功转换: {file_path}")
        except Exception as e:
            logging.error(f"❌ HTML转换失败: {file_path} | 错误: {str(e)}\n{traceback.format_exc()}")

    def _convert_image(self, file_path):
        """将图像文件转换为 PDF"""
        try:
            image = Image.open(file_path)
            if image.mode != "RGB":
                image = image.convert("RGB")
            output_file = self._output_path(file_path, 'pdf')
            image.save(output_file, save_all=True, append_images=[image])
            logging.info(f"✅ 已成功转换: {file_path}")
        except Exception as e:
            logging.error(f"❌ 图像转换失败: {file_path} | 错误: {str(e)}\n{traceback.format_exc()}")

    def _output_path(self, file_path, new_ext):
        """生成输出路径"""
        filename = os.path.basename(file_path)
        name = os.path.splitext(filename)[0]
        return os.path.join(self.output_dir, f"{name}.{new_ext}")

    def convert_all(self):
        """开始批量转换"""
        self._ensure_output_dir()
        count = 0
        for file_path, ext in self._get_files():
            logging.info(f"🔄 正在转换: {file_path}")
            self.supported_extensions[ext](file_path)
            count += 1
        logging.info(f"📊 共转换 {count} 个文件")

if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser(description="批量将文档转换为 PDF")
    parser.add_argument("--input", required=True, help="源文件夹路径")
    parser.add_argument("--output", required=True, help="目标输出文件夹")
    args = parser.parse_args()

    converter = DocumentConverter(args.input, args.output)
    converter.convert_all()

第五章：如何使用这个脚本？

✅ 使用方法：

将上述脚本保存为 batch_convert_to_pdf.py
在终端执行命令：

bash 复制代码

python batch_convert_to_pdf.py --input D:\\Documents --output D:\\ConvertedPDFs

📁 输入输出要求：

输入路径应包含待转换的文档；
输出路径会自动创建，无需手动建立；
转换结果以原文件名命名，扩展名为 .pdf。

⚠️ 注意事项：

仅限 Windows 平台使用；
需要安装 Microsoft Office；
建议管理员权限运行；
可根据需求修改日志级别、字体样式、页面设置等。

第六章：脚本亮点与实际应用场景

✨ 脚本亮点：

自动识别多种格式，智能选择对应转换方式；
支持多层级文件夹扫描；
所有转换过程均记录日志，便于排查问题；
支持中断恢复机制（日志可追溯）；
可轻松拓展新增格式支持。

📌 应用场景举例：

学生整理课程资料为 PDF；
企业集中归档合同、报告；
开发者自动生成文档集；
内容创作者打包作品为 PDF；
图书馆或档案馆数字化处理原始文档。

有了这样一个高效的转换工具，你就能把精力集中在更重要的任务上。

第七章：常见问题与解决方案

❗ 问题1：提示"ModuleNotFoundError"

这表示某些依赖未安装，请检查是否已安装以下模块：

bash 复制代码

pip install pywin32 pillow fpdf pdfkit

同时确认 wkhtmltopdf 是否已加入系统路径。

❗ 问题2：转换 Word/Excel/PPT 失败

可能是 Office 组件版本不兼容或未正确注册 COM 对象。

✅ 解决办法：

重启 Office 或系统；
使用管理员身份运行脚本；
更新 Office 至最新版本；

❗ 问题3：HTML 转换乱码或样式丢失

HTML 内容复杂度高时，pdfkit 可能无法完美还原页面样式。

✅ 解决方案：

使用 --no-sandbox 参数（慎用）；
调整 pdfkit 配置，启用 JavaScript 支持；
若对排版要求极高，建议使用浏览器插件导出 PDF。

总结

该 Python 脚本，支持将 Word、Excel、PPT、TXT、HTML、图像等多种格式批量转换为 PDF，并具备良好的错误处理和日志记录机制。

无论你是学生、教师、行政人员还是开发者，这个脚本都能帮你节省大量时间，让你专注于更有价值的工作。