在现代办公自动化和数据处理中,Word文档与JSON格式之间的转换需求日益增多。本文将详细介绍如何使用Python实现.docx文件与JSON格式之间的高效双向转换,并提供一个完整的解决方案。
一、功能概述与应用场景
Word文档与JSON格式的转换在多个场景下非常有用:
- 文档内容提取与分析:将Word文档内容转换为结构化JSON数据,便于后续处理和分析
- 自动化报告生成:将JSON数据自动填充到预定义的Word模板中
- 文档格式转换:作为Word与其他格式(如Markdown、HTML)转换的中间步骤
- 内容管理系统:实现文档内容的版本控制和结构化存储
二、核心技术与库选择
实现Word与JSON转换主要依赖以下Python库:
- python-docx :专门用于读写Word
.docx文件的主流库 - json:Python标准库,处理JSON格式数据
与其他方案相比,如Simplify-Docx 或FastMCP框架 ,直接使用python-docx提供了更大的灵活性和控制力,适合需要精细处理文档样式的场景。
三、代码实现详解
3.1 从Word文档提取JSON数据
docx_to_json函数负责将Word文档转换为结构化JSON数据,其核心逻辑如下:
python
def docx_to_json(docx_path):
document = Document(docx_path)
doc_data = {
"paragraphs": [],
"styles": [],
"tables": []
}
# 提取文档样式信息
styles = document.styles
for style in styles:
if style.type == WD_STYLE_TYPE.PARAGRAPH:
style_info = {}
# 只提取非空的样式属性
if style.name:
style_info["name"] = style.name
if style.font.name:
style_info["font_name"] = style.font.name
# 更多样式属性提取...
if style_info:
doc_data["styles"].append(style_info)
这种方法不仅提取文本内容,还完整保留样式信息,确保转换后的JSON数据能够准确还原原始文档格式 。
3.2 从JSON数据还原Word文档
json_to_docx函数实现反向转换,其关键技术点包括:
python
def json_to_docx(json_data, output_path):
document = Document()
# 处理段落和文本样式
for para_data in json_data.get("paragraphs", []):
style_name = para_data.get("style", "Normal")
try:
paragraph = document.add_paragraph(style=style_name)
except:
paragraph = document.add_paragraph(style="Normal")
# 设置段落对齐方式
alignment_str = para_data.get("alignment")
if alignment_str:
if "CENTER" in alignment_str:
paragraph.alignment = WD_ALIGN_PARAGRAPH.CENTER
# 其他对齐方式处理...
# 处理文本运行(runs)及其样式
runs_data = para_data.get("runs", [])
if runs_data:
for run_data in runs_data:
text = run_data.get("text", "")
run = paragraph.add_run(text)
# 设置粗体、斜体、下划线等样式
run.bold = run_data.get("bold", False)
run.italic = run_data.get("italic", False)
# 更多样式设置...
此实现特别注重样式还原的准确性,为缺失的样式属性提供合理的默认值,确保生成的文档具有良好的可读性 。
3.3 表格处理机制
代码还包含对Word表格的完整处理:
python
# 处理表格数据
for table_data in json_data.get("tables", []):
if table_data.get("rows"):
# 动态创建表格
first_row = table_data["rows"][0]
num_rows = len(table_data["rows"])
num_cols = len(first_row["cells"]) if first_row.get("cells") else 1
table = document.add_table(rows=num_rows, cols=num_cols)
# 填充表格内容
for i, row_data in enumerate(table_data["rows"]):
row = table.rows[i]
for j, cell_data in enumerate(row_data.get("cells", [])):
if j < len(row.cells):
cell = row.cells[j]
cell.text = cell_data.get("text", "")
表格处理采用动态结构创建方式,根据JSON数据自动确定行列数,保证表格结构的准确性 。
四、使用教程
4.1 环境准备
首先安装必要的依赖库:
bash
pip install python-docx
4.2 基本使用示例
- 将Word文档转换为JSON:
python
from docx_to_json_converter import docx_to_json
# 转换Word文档为JSON
json_data = docx_to_json("示例文档.docx")
# 保存JSON文件
import json
with open("文档数据.json", "w", encoding="utf-8") as f:
json.dump(json_data, f, ensure_ascii=False, indent=2)
- 将JSON数据还原为Word文档:
python
from docx_to_json_converter import json_to_docx
# 读取JSON数据
with open("文档数据.json", "r", encoding="utf-8") as f:
json_data = json.load(f)
# 转换为Word文档
json_to_docx(json_data, "还原的文档.docx")
4.3 高级功能使用
代码还提供了交互式命令行界面,直接运行脚本即可选择转换方向:
bash
python docx_to_json_converter.py
根据提示选择操作类型(1或2),然后输入文件路径即可完成转换 。
五、扩展应用与进阶技巧
5.1 样式模板复用
在实际应用中,可以结合模板复用机制提高效率:
python
# 创建样式模板
def create_style_template(docx_path):
json_data = docx_to_json(docx_path)
# 提取并保存样式信息
template = {
"styles": json_data["styles"],
"metadata": {"created_time": "2023-01-01", "type": "report"}
}
return template
这种方法特别适用于批量生成标准化文档的场景,如报告、合同等 。
5.2 与LangChain集成
可以将此工具与LangChain等AI框架集成,实现智能文档处理:
python
from langchain.document_loaders import Docx2txtLoader
# 加载生成的Word文档
loader = Docx2txtLoader("还原的文档.docx")
documents = loader.load()
# 后续进行文本分析、问答等AI处理
这种结合为文档处理提供了更多可能性,如自动摘要、内容分类等 。
六、性能优化建议
- 大文件处理 :对于大型Word文档,可以采用分块处理策略,避免内存溢出
- 缓存机制:对常用样式模板实施缓存,提高转换效率
- 批量处理 :大量文档转换时,可以实现并行处理机制
七、总结
本文介绍的Word文档与JSON双向转换方案具有以下优势:
- 完整性:支持文本、样式、表格等Word文档核心元素的转换
- 灵活性:提供了API和命令行两种使用方式,适应不同场景需求
- 实用性:代码可直接用于生产环境,且易于扩展
这种转换工具在文档自动化处理 、内容管理系统 和数据迁移等场景下具有重要价值。通过进一步集成其他工具(如pandoc、OCR技术等),还可以扩展更多文档处理能力 。
希望本文对您在文档处理方面的工作有所帮助!如有任何问题或建议,欢迎交流讨论。
完整代码已在文章开头提供,您可以直接复制使用或根据需要进行修改。
python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Docx to JSON and JSON to Docx converter
可以将docx文件的所有样式抽取成为json对象,也可以将json对象还原为docx文件
"""
import json
from docx import Document
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.enum.style import WD_STYLE_TYPE
from docx.shared import RGBColor, Pt
from docx.oxml.ns import qn
import os
def docx_to_json(docx_path):
"""
将docx文件转换为JSON格式
忽略值为null的样式属性
"""
document = Document(docx_path)
# 存储所有内容的字典
doc_data = {
"paragraphs": [],
"styles": [],
"tables": []
}
# 获取所有样式
styles = document.styles
for style in styles:
if style.type == WD_STYLE_TYPE.PARAGRAPH:
style_info = {}
# 只添加非空的属性
if style.name:
style_info["name"] = style.name
if style.type:
style_info["type"] = "paragraph"
if style.font.name:
style_info["font_name"] = style.font.name
if style.font.size:
style_info["font_size"] = style.font.size.pt
if style.font.bold is not None:
style_info["bold"] = style.font.bold
if style.font.italic is not None:
style_info["italic"] = style.font.italic
if style.font.underline is not None:
style_info["underline"] = style.font.underline
if style.font.color.rgb:
style_info["color"] = str(style.font.color.rgb)
# 添加段落格式信息
if style.paragraph_format:
paragraph_format = {}
if style.paragraph_format.alignment is not None:
paragraph_format["alignment"] = str(style.paragraph_format.alignment)
if style.paragraph_format.left_indent:
paragraph_format["left_indent"] = style.paragraph_format.left_indent.pt
if style.paragraph_format.right_indent:
paragraph_format["right_indent"] = style.paragraph_format.right_indent.pt
if style.paragraph_format.first_line_indent:
paragraph_format["first_line_indent"] = style.paragraph_format.first_line_indent.pt
if style.paragraph_format.space_before:
paragraph_format["space_before"] = style.paragraph_format.space_before.pt
if style.paragraph_format.space_after:
paragraph_format["space_after"] = style.paragraph_format.space_after.pt
# 限制line_spacing值避免溢出
if style.paragraph_format.line_spacing and style.paragraph_format.line_spacing <= 100:
paragraph_format["line_spacing"] = style.paragraph_format.line_spacing
if style.paragraph_format.keep_with_next is not None:
paragraph_format["keep_with_next"] = style.paragraph_format.keep_with_next
if style.paragraph_format.keep_together is not None:
paragraph_format["keep_together"] = style.paragraph_format.keep_together
if style.paragraph_format.page_break_before is not None:
paragraph_format["page_break_before"] = style.paragraph_format.page_break_before
if style.paragraph_format.widow_control is not None:
paragraph_format["widow_control"] = style.paragraph_format.widow_control
if paragraph_format:
style_info["paragraph_format"] = paragraph_format
# 只有当style_info不为空时才添加
if style_info:
doc_data["styles"].append(style_info)
# 获取所有段落
for para in document.paragraphs:
para_info = {}
# 只添加非空的属性
if para.text:
para_info["text"] = para.text
if para.style and para.style.name:
para_info["style"] = para.style.name
# 添加段落格式信息
if para.paragraph_format:
paragraph_format = {}
if para.paragraph_format.alignment is not None:
paragraph_format["alignment"] = str(para.paragraph_format.alignment)
if para.paragraph_format.left_indent:
paragraph_format["left_indent"] = para.paragraph_format.left_indent.pt
if para.paragraph_format.right_indent:
paragraph_format["right_indent"] = para.paragraph_format.right_indent.pt
if para.paragraph_format.first_line_indent:
paragraph_format["first_line_indent"] = para.paragraph_format.first_line_indent.pt
if para.paragraph_format.space_before:
paragraph_format["space_before"] = para.paragraph_format.space_before.pt
if para.paragraph_format.space_after:
paragraph_format["space_after"] = para.paragraph_format.space_after.pt
# 限制line_spacing值避免溢出
if para.paragraph_format.line_spacing and para.paragraph_format.line_spacing <= 100:
paragraph_format["line_spacing"] = para.paragraph_format.line_spacing
if para.paragraph_format.keep_with_next is not None:
paragraph_format["keep_with_next"] = para.paragraph_format.keep_with_next
if para.paragraph_format.keep_together is not None:
paragraph_format["keep_together"] = para.paragraph_format.keep_together
if para.paragraph_format.page_break_before is not None:
paragraph_format["page_break_before"] = para.paragraph_format.page_break_before
if para.paragraph_format.widow_control is not None:
paragraph_format["widow_control"] = para.paragraph_format.widow_control
if paragraph_format:
para_info["paragraph_format"] = paragraph_format
# 处理runs
runs_list = []
for run in para.runs:
run_info = {}
# 只添加非空的属性
if run.text:
run_info["text"] = run.text
if run.bold is not None:
run_info["bold"] = run.bold
if run.italic is not None:
run_info["italic"] = run.italic
if run.underline is not None:
run_info["underline"] = run.underline
if run.font.name:
run_info["font_name"] = run.font.name
if run.font.size:
run_info["font_size"] = run.font.size.pt
if run.font.color.rgb:
run_info["color"] = str(run.font.color.rgb)
if run.font.highlight_color:
run_info["highlight_color"] = str(run.font.highlight_color)
if run.font.strike is not None:
run_info["strike"] = run.font.strike
if run.font.superscript is not None:
run_info["superscript"] = run.font.superscript
if run.font.subscript is not None:
run_info["subscript"] = run.font.subscript
if run.font.all_caps is not None:
run_info["all_caps"] = run.font.all_caps
if run.font.small_caps is not None:
run_info["small_caps"] = run.font.small_caps
# 只有当run_info不为空时才添加
if run_info:
runs_list.append(run_info)
if runs_list:
para_info["runs"] = runs_list
# 只有当para_info不为空时才添加
if para_info:
doc_data["paragraphs"].append(para_info)
# 获取所有表格
for table in document.tables:
table_info = {
"rows": []
}
# 添加表格属性
if hasattr(table, 'style') and table.style:
table_info["style"] = table.style.name
for row in table.rows:
row_info = {
"cells": []
}
for cell in row.cells:
cell_info = {}
# 只添加非空的属性
if cell.text:
cell_info["text"] = cell.text
paragraphs_list = []
# 获取单元格中的段落
for para in cell.paragraphs:
para_dict = {}
if para.text:
para_dict["text"] = para.text
if para.style and para.style.name:
para_dict["style"] = para.style.name
# 添加段落格式信息
if para.paragraph_format:
paragraph_format = {}
if para.paragraph_format.alignment is not None:
paragraph_format["alignment"] = str(para.paragraph_format.alignment)
if para.paragraph_format.left_indent:
paragraph_format["left_indent"] = para.paragraph_format.left_indent.pt
if para.paragraph_format.right_indent:
paragraph_format["right_indent"] = para.paragraph_format.right_indent.pt
if para.paragraph_format.first_line_indent:
paragraph_format["first_line_indent"] = para.paragraph_format.first_line_indent.pt
if para.paragraph_format.space_before:
paragraph_format["space_before"] = para.paragraph_format.space_before.pt
if para.paragraph_format.space_after:
paragraph_format["space_after"] = para.paragraph_format.space_after.pt
# 限制line_spacing值避免溢出
if para.paragraph_format.line_spacing and para.paragraph_format.line_spacing <= 100:
paragraph_format["line_spacing"] = para.paragraph_format.line_spacing
if paragraph_format:
para_dict["paragraph_format"] = paragraph_format
if para_dict:
paragraphs_list.append(para_dict)
if paragraphs_list:
cell_info["paragraphs"] = paragraphs_list
if cell_info:
row_info["cells"].append(cell_info)
if row_info["cells"]:
table_info["rows"].append(row_info)
if table_info["rows"]:
doc_data["tables"].append(table_info)
return doc_data
def json_to_docx(json_data, output_path):
"""
将JSON数据转换为docx文件
为缺失的样式属性设置默认值
"""
document = Document()
# 添加段落
for para_data in json_data.get("paragraphs", []):
# 设置默认样式
style_name = para_data.get("style", "Normal")
try:
paragraph = document.add_paragraph(style=style_name)
except:
paragraph = document.add_paragraph(style="Normal")
# 设置段落格式
paragraph_format_data = para_data.get("paragraph_format", {})
if paragraph_format_data:
# 设置段落对齐方式
alignment_str = paragraph_format_data.get("alignment")
if alignment_str:
# 解析对齐字符串,提取其中的枚举值
if "LEFT" in alignment_str:
paragraph.alignment = WD_ALIGN_PARAGRAPH.LEFT
elif "CENTER" in alignment_str:
paragraph.alignment = WD_ALIGN_PARAGRAPH.CENTER
elif "RIGHT" in alignment_str:
paragraph.alignment = WD_ALIGN_PARAGRAPH.RIGHT
elif "JUSTIFY" in alignment_str:
paragraph.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY
elif "DISTRIBUTE" in alignment_str:
paragraph.alignment = WD_ALIGN_PARAGRAPH.DISTRIBUTE
elif "JUSTIFY_MED" in alignment_str:
paragraph.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY_MED
# 设置段落间距和缩进
if "left_indent" in paragraph_format_data:
paragraph.paragraph_format.left_indent = Pt(paragraph_format_data["left_indent"])
if "right_indent" in paragraph_format_data:
paragraph.paragraph_format.right_indent = Pt(paragraph_format_data["right_indent"])
if "first_line_indent" in paragraph_format_data:
paragraph.paragraph_format.first_line_indent = Pt(paragraph_format_data["first_line_indent"])
if "space_before" in paragraph_format_data:
paragraph.paragraph_format.space_before = Pt(paragraph_format_data["space_before"])
if "space_after" in paragraph_format_data:
paragraph.paragraph_format.space_after = Pt(paragraph_format_data["space_after"])
# 限制line_spacing值避免溢出
if "line_spacing" in paragraph_format_data and paragraph_format_data["line_spacing"] <= 100:
paragraph.paragraph_format.line_spacing = paragraph_format_data["line_spacing"]
if "keep_with_next" in paragraph_format_data:
paragraph.paragraph_format.keep_with_next = paragraph_format_data["keep_with_next"]
if "keep_together" in paragraph_format_data:
paragraph.paragraph_format.keep_together = paragraph_format_data["keep_together"]
if "page_break_before" in paragraph_format_data:
paragraph.paragraph_format.page_break_before = paragraph_format_data["page_break_before"]
if "widow_control" in paragraph_format_data:
paragraph.paragraph_format.widow_control = paragraph_format_data["widow_control"]
# 清空默认文本并添加runs
paragraph.clear()
# 处理runs
runs_data = para_data.get("runs", [])
if runs_data:
for run_data in runs_data:
text = run_data.get("text", "")
run = paragraph.add_run(text)
# 设置run属性,默认为False
run.bold = run_data.get("bold", False)
run.italic = run_data.get("italic", False)
run.underline = run_data.get("underline", False)
run.font.strike = run_data.get("strike", False)
run.font.superscript = run_data.get("superscript", False)
run.font.subscript = run_data.get("subscript", False)
run.font.all_caps = run_data.get("all_caps", False)
run.font.small_caps = run_data.get("small_caps", False)
# 设置字体大小,默认为Pt(12)
font_size = run_data.get("font_size")
if font_size:
run.font.size = Pt(font_size)
else:
run.font.size = Pt(12)
# 设置字体名称,默认为None(使用默认字体)
font_name = run_data.get("font_name")
if font_name:
run.font.name = font_name
run._element.rPr.rFonts.set(qn('w:eastAsia'), font_name)
# 设置字体颜色,默认为黑色
color = run_data.get("color")
if color and color != "None":
try:
run.font.color.rgb = RGBColor.from_string(color)
except:
# 如果颜色格式错误,使用默认黑色
pass
# 设置高亮颜色
highlight_color = run_data.get("highlight_color")
if highlight_color and highlight_color != "None":
try:
# 注意:此处简化处理,实际应用中需要根据字符串映射到对应的WD_COLOR_INDEX值
pass
except:
# 如果高亮颜色格式错误,忽略
pass
else:
# 如果没有runs数据,则直接添加段落文本
text = para_data.get("text", "")
run = paragraph.add_run(text)
# 应用默认样式
run.font.size = Pt(12)
# 添加表格
for table_data in json_data.get("tables", []):
if table_data.get("rows"):
# 创建表格,行数和列数根据第一行确定
first_row = table_data["rows"][0]
num_rows = len(table_data["rows"])
num_cols = len(first_row["cells"]) if first_row.get("cells") else 1
table = document.add_table(rows=num_rows, cols=num_cols)
# 设置表格样式
table_style = table_data.get("style")
if table_style:
try:
table.style = table_style
except:
# 如果样式不存在,使用默认样式
pass
# 填充表格内容
for i, row_data in enumerate(table_data["rows"]):
row = table.rows[i]
for j, cell_data in enumerate(row_data.get("cells", [])):
if j < len(row.cells):
cell = row.cells[j]
cell.text = cell_data.get("text", "")
# 处理单元格中的段落
cell_paragraphs = cell_data.get("paragraphs", [])
if cell_paragraphs:
# 清除默认段落
cell.paragraphs[0].clear()
# 添加段落
for para_data in cell_paragraphs:
para = cell.add_paragraph()
para.text = para_data.get("text", "")
# 设置段落样式
para_style = para_data.get("style")
if para_style:
try:
para.style = para_style
except:
pass
# 设置段落格式
paragraph_format_data = para_data.get("paragraph_format", {})
if paragraph_format_data:
# 设置段落对齐方式
alignment_str = paragraph_format_data.get("alignment")
if alignment_str:
# 解析对齐字符串,提取其中的枚举值
if "LEFT" in alignment_str:
para.alignment = WD_ALIGN_PARAGRAPH.LEFT
elif "CENTER" in alignment_str:
para.alignment = WD_ALIGN_PARAGRAPH.CENTER
elif "RIGHT" in alignment_str:
para.alignment = WD_ALIGN_PARAGRAPH.RIGHT
elif "JUSTIFY" in alignment_str:
para.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY
elif "DISTRIBUTE" in alignment_str:
para.alignment = WD_ALIGN_PARAGRAPH.DISTRIBUTE
elif "JUSTIFY_MED" in alignment_str:
para.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY_MED
# 设置段落间距和缩进
if "left_indent" in paragraph_format_data:
para.paragraph_format.left_indent = Pt(paragraph_format_data["left_indent"])
if "right_indent" in paragraph_format_data:
para.paragraph_format.right_indent = Pt(paragraph_format_data["right_indent"])
if "first_line_indent" in paragraph_format_data:
para.paragraph_format.first_line_indent = Pt(paragraph_format_data["first_line_indent"])
if "space_before" in paragraph_format_data:
para.paragraph_format.space_before = Pt(paragraph_format_data["space_before"])
if "space_after" in paragraph_format_data:
para.paragraph_format.space_after = Pt(paragraph_format_data["space_after"])
# 限制line_spacing值避免溢出
if "line_spacing" in paragraph_format_data and paragraph_format_data["line_spacing"] <= 100:
para.paragraph_format.line_spacing = paragraph_format_data["line_spacing"]
# 保存文档
document.save(output_path)
def main():
"""
主函数,演示如何使用转换功能
"""
print("Docx Converter")
print("1. Convert docx to json")
print("2. Convert json to docx")
choice = input("请选择操作 (1 或 2): ")
if choice == "1":
docx_path = input("请输入docx文件路径: ")
if not os.path.exists(docx_path):
print("文件不存在!")
return
json_data = docx_to_json(docx_path)
json_path = docx_path.replace(".docx", ".json")
with open(json_path, "w", encoding="utf-8") as f:
json.dump(json_data, f, ensure_ascii=False, indent=2)
print(f"转换完成! JSON文件已保存为: {json_path}")
elif choice == "2":
json_path = input("请输入json文件路径: ")
if not os.path.exists(json_path):
print("文件不存在!")
return
with open(json_path, "r", encoding="utf-8") as f:
json_data = json.load(f)
output_path = json_path.replace(".json", "_restored.docx")
json_to_docx(json_data, output_path)
print(f"转换完成! Docx文件已保存为: {output_path}")
else:
print("无效的选择!")
if __name__ == "__main__":
main()