Python处理Word文档完全指南：从基础到进阶

诸神缄默不语-个人CSDN博文目录

教程视频和代码生成效果请看视频：www.bilibili.com/video/BV1Ab...

如果需要原Jupyter notebook文件和用作示例的图片、文档，可以联系我。

在Python中处理Word文档是一项常见且实用的任务。本文将介绍如何使用几个主流的Python库来创建、修改和处理Word文档，涵盖从基础操作到高级功能的完整流程。

所需库及安装

在开始之前，需要安装以下Python库：

python-docx：用于创建和修改Word文档
docxtpl：用于基于模板填充Word文档
docxcompose：用于合并多个Word文档
lxml：XML处理库

可以通过pip安装：

plain 复制代码

pip install python-docx docxtpl docxcompose lxml

或者使用uv：

plain 复制代码

uv add python-docx docxtpl docxcompose lxml

1. 基础操作

1.1 创建和保存文档

使用python-docx创建文档非常简单：

python 复制代码

from docx import Document  
doc = Document()  
doc.add_paragraph("Python-docx是一个用于创建")  
doc.save("文件1.docx")

1.2 设置中文字体

默认字体对中文支持不佳，需要单独设置中文字体：

python 复制代码

from docx.oxml.ns import qn  
def set_chinese_font(run, zh_font_name="宋体", en_font_name="Times New Roman"):  
    run.font.name = en_font_name  
    run._element.rPr.rFonts.set(qn("w:eastAsia"), zh_font_name)  
doc = Document()  
paragraph = doc.add_paragraph()  
run = paragraph.add_run('这是一段设置了中文字体的文本。')  
set_chinese_font(run)  
doc.save("文件1.docx")

注意：保存文件时，文件不能被打开，否则会报PermissionError错误。

1.3 导入现有文档

python 复制代码

doc = Document('example.docx')

注意事项：

必须是标准docx文件，不能是doc文件
不能是strict open XML格式

1.4 遍历文档内容

python 复制代码

# 遍历段落  
for para in doc.paragraphs[:3]:  
    print(para)  
    print(para.text)  
    print()  
# 遍历表格  
for table in doc.tables:  
    for row in table.rows:  
        for cell in row.cells:  
            print(cell.text)

2. 文档格式设置

2.1 小标题

python 复制代码

doc.add_heading("1.1 Transformer整体工作流程", 2)  
doc.add_heading("Transformer整体架构", 3)

注意：需要文档里有对应的标题样式，否则会报错。

2.2 段落处理

添加段落

python 复制代码

text = """Transformer 模型由编码器（Encoder）和解码器（Decoder）组成。..."""  
paragraph1 = doc.add_paragraph(text)

首行缩进

首行缩进2字符：

python 复制代码

paragraph_format = paragraph1.paragraph_format  
paragraph_format.first_line_indent = 0  
paragraph_format.element.pPr.ind.set(qn("w:firstLineChars"), '200')

首行缩进固定距离：

python 复制代码

para_format.first_line_indent = Pt(10)

段落对齐

python 复制代码

from docx.enum.text import WD_PARAGRAPH_ALIGNMENT  
paragraph1.alignment = WD_PARAGRAPH_ALIGNMENT.LEFT

删除段落

python 复制代码

p = paragraph1._element  
p.getparent().remove(p)

换行处理

python 复制代码

# 将文本按换行符分割成多个段落  
for one_paragraph_text in text.split("\n"):  
    temp_paragraph = doc.add_paragraph(one_paragraph_text)  
    paragraph_format = temp_paragraph.paragraph_format  
    paragraph_format.first_line_indent = 0  
    paragraph_format.element.pPr.ind.set(qn("w:firstLineChars"), "200")

常用段落格式

python 复制代码

from docx.shared import Pt  
para_format = temp_paragraph.paragraph_format  
para_format.line_spacing = Pt(18)  # 行间距（固定值）  
para_format.space_before = Pt(3)  # 段前距离  
para_format.space_after = Pt(0)  # 段后距离  
para_format.right_indent = Pt(20)  # 右侧缩进  
para_format.left_indent = Pt(0)  # 左侧缩进

2.3 字符格式设置

python 复制代码

from docx.shared import RGBColor, Pt  
# 加粗文本  
temp_paragraph.add_run('加粗文本').bold = True  
# 红色斜体文本  
run = temp_paragraph.add_run('红色斜体文本')  
run.font.color.rgb = RGBColor(255,0,0)  # 设置红色  
run.font.size = Pt(14)  # 字号14磅  
run.bold = True  # 加粗  
run.italic = True  # 斜体  
run.underline = True  # 下划线  
# 下标和上标  
run2 = temp_paragraph.add_run("1")  
run2.font.subscript = True  # 下标  
run3 = temp_paragraph.add_run("2")  
run3.font.superscript = True  # 上标

2.4 表格处理

创建表格

python 复制代码

table = doc.add_table(rows=4, cols=5)  
table.style = "Grid Table 1 Light"  # 应用预定义样式

填充单元格

python 复制代码

# 方式1：直接指定单元格  
cell = table.cell(0, 1)  
cell.text = "parrot, possibly dead"  
# 方式2：通过行获取单元格  
row = table.rows[1]  
cells = row.cells  
cells[0].text = "Foo bar to you."  
cells[1].text = "And a hearty foo bar to you too sir!"

获取可用表格样式

python 复制代码

from docx.enum.style import WD_STYLE_TYPE  
styles = doc.styles  
for s in styles:  
    if s.type == WD_STYLE_TYPE.TABLE:  
        print(s.name)

增加和删除行

python 复制代码

# 增加一行  
row = table.add_row()  
# 删除一行  
def remove_row(table, row):  
    tbl = table._tbl  
    tr = row._tr  
    tbl.remove(tr)  
row = table.rows[len(table.rows) - 1]  
remove_row(table, row)

批量填充数据

python 复制代码

# 方式1：一行一行添加  
items = (  
    (7, "1024", "Plush kittens"),  
    (3, "2042", "Furbees"),  
    (1, "1288", "French Poodle Collars, Deluxe"),  
)  
for item in items:  
    cells = table.add_row().cells  
    cells[0].text = str(item[0])  
    cells[1].text = item[1]  
    cells[2].text = item[2]  
# 方式2：批量填充  
for row in table.rows:  
    for cell in row.cells:  
        cell.text = "数据单元"

合并单元格

python 复制代码

table.cell(0, 0).merge(table.cell(1, 1))  # 跨行列合并

表格格式设置

python 复制代码

# 表格宽度自适应  
table.autofit = True  
# 指定行高  
from docx.shared import Cm  
table.rows[0].height = Cm(0.93)  
# 修改表格字体大小  
table.style.font.size = Pt(15)  
# 设置单元格对齐  
from docx.enum.table import WD_ALIGN_VERTICAL  
cell = table.cell(0, 0)  
cell.paragraphs[0].paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER  
cell.vertical_alignment = WD_ALIGN_VERTICAL.CENTER  
# 复制表格  
from copy import deepcopy  
table_copy = deepcopy(doc.tables[0])  
para1 = doc.add_paragraph()  
para1._p.addnext(table_copy._element)

2.5 图片处理

插入图片

python 复制代码

from io import BytesIO  
import base64  
# 普通插入  
doc.add_picture('图片1.png')  
doc.add_picture('图片2.png', width=Inches(2.5), height=Inches(2))  
# 使用base64插入  
picture2_base64 = open("图片2base64.txt").read()  
img2_buf = base64.b64decode(picture2_base64)  
doc.add_picture(BytesIO(img2_buf))  
# 并排放图  
run = doc.add_paragraph().add_run()  
run.add_picture("图片1.png", width=Inches(2.5), height=Inches(2))  
run.add_picture("图片1.png", width=Inches(2.5), height=Inches(2))

2.6 分页符

python 复制代码

doc.add_page_break()

2.7 样式管理

python 复制代码

# 修改已有样式  
doc.styles["Normal"].font.size = Pt(14)  
doc.styles['Normal'].font.name = 'Arial'  
doc.styles['Normal']._element.rPr.rFonts.set(qn('w:eastAsia'), '楷体')  
# 创建自定义段落样式  
from docx.enum.style import WD_STYLE_TYPE  
UserStyle1 = doc.styles.add_style('UserStyle1', WD_STYLE_TYPE.PARAGRAPH)  
UserStyle1.font.size = Pt(40)  
UserStyle1.font.color.rgb = RGBColor(0xff, 0xde, 0x00)  
UserStyle1.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER  
UserStyle1.font.name = '微软雅黑'  
UserStyle1._element.rPr.rFonts.set(qn('w:eastAsia'), '微软雅黑')  
# 使用自定义样式  
doc.add_paragraph('自定义段落样式', style=UserStyle1)

3. 使用docxtpl进行模板填充

docxtpl可以将Word文档制作成模板，实现数据自动填充。

3.1 创建模板

首先创建一个包含占位符的Word模板，占位符使用双花括号{{}}包裹。

3.2 填充模板

python 复制代码

from docxtpl import DocxTemplate, InlineImage, RichText  
tpl = DocxTemplate("docxexample.docx")  
text = """Transformer 模型由编码器（Encoder）和解码器（Decoder）组成..."""  
picture1 = InlineImage(tpl, image_descriptor="图片1.png")  
# 准备数据  
paragraphs1 = [  
    "步骤1：输入表示（Input Representation）",  
    "步骤2：编码器处理（Encoder Processing）",  
    "步骤3：解码器处理（Decoder Processing）",  
]  
paragraphs2 = [  
    {"step": 1, "text": "输入向量（词嵌入+位置编码）进入编码器层。"},  
    {"step": 2, "text": "自注意力子层。"},  
    {"step": 3, "text": "前馈网络子层。"},  
]  
table = [  
    {"character": "并行计算", "description": "编码器可并行处理整个序列（与RNN不同）"},  
    {"character": "自注意力", "description": "每个词直接关联所有词，捕获长距离依赖"},  
    {"character": "位置编码", "description": "为无顺序的注意力机制注入位置信息"},  
]  
alerts = [  
    {  
        "date": "2015-03-10",  
        "desc": RichText("Very critical alert", color="FF0000", bold=True),  
        "type": "CRITICAL",  
        "bg": "FF0000",  
    },  
    # ... 其他数据  
]  
# 渲染模板  
context = {  
    "title": "Transformer",  
    "text_body": text,  
    "picture1": picture1,  
    "picture2": picture2,  
    "paragraphs1": paragraphs1,  
    "paragraphs2": paragraphs2,  
    "runs": paragraphs1,  
    "display_paragraph": True,  
    "table1": table,  
    "table2": table,  
    "alerts": alerts,  
}  
tpl.render(context)  
tpl.save("文件3.docx")

4. 进阶功能

4.1 表格高级操作

设置单元格边框

python 复制代码

from docx.oxml import OxmlElement  
from docx.oxml.ns import qn  
def set_cell_border(cell, **kwargs):  
    tc = cell._tc  
    tcPr = tc.get_or_add_tcPr()  
    tcBorders = tcPr.first_child_found_in("w:tcBorders")  
    if tcBorders is None:  
        tcBorders = OxmlElement("w:tcBorders")  
        tcPr.append(tcBorders)  
      
    for edge in ("left", "top", "right", "bottom", "insideH", "insideV"):  
        edge_data = kwargs.get(edge)  
        if edge_data:  
            tag = "w:{}".format(edge)  
            element = tcBorders.find(qn(tag))  
            if element is None:  
                element = OxmlElement(tag)  
                tcBorders.append(element)  
              
            for key in ["sz", "val", "color", "space", "shadow"]:  
                if key in edge_data:  
                    element.set(qn("w:{}".format(key)), str(edge_data[key]))  
# 使用示例  
set_cell_border(  
    table.cell(0, 0),  
    top={"sz": 4, "val": "single", "color": "#000000", "space": "0"},  
    bottom={"sz": 4, "val": "single", "color": "#000000", "space": "0"},  
    left={"sz": 4, "val": "single", "color": "#000000", "space": "0"},  
    right={"sz": 4, "val": "single", "color": "#000000", "space": "0"},  
)

4.2 超链接

python 复制代码

def add_hyperlink(paragraph, url, text):  
    part = paragraph.part  
    r_id = part.relate_to(  
        url,  
        "http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink",  
        is_external=True,  
    )  
    hyperlink = OxmlElement("w:hyperlink")  
    hyperlink.set(qn("r:id"), r_id)  
      
    run = OxmlElement("w:r")  
    run_text = OxmlElement("w:t")  
    run_text.text = text  
    run.append(run_text)  
    hyperlink.append(run)  
      
    paragraph._p.append(hyperlink)  
p = doc.add_paragraph("点击访问： ")  
add_hyperlink(p, "https://www.baidu.com", "示例链接")

4.3 图片高级操作

提取文档中的图片

python 复制代码

import zipfile  
from xml.etree.ElementTree import fromstring  
def extract_images(docx_path, output_dir):  
    with zipfile.ZipFile(docx_path) as z:  
        try:  
            doc_rels = z.read('word/_rels/document.xml.rels').decode('utf-8')  
        except KeyError:  
            return []  
          
        root = fromstring(doc_rels)  
        rels = []  
        for child in root:  
            if 'Type' in child.attrib and child.attrib['Type'] == RT.IMAGE:  
                rels.append((child.attrib['Id'], child.attrib['Target']))  
          
        images = []  
        for rel_id, target in rels:  
            try:  
                image_data = z.read('word/' + target)  
                image_name = target.split('/')[-1]  
                with open(f"{output_dir}/{image_name}", 'wb') as f:  
                    f.write(image_data)  
                images.append(image_name)  
            except KeyError:  
                continue  
        return images  
print(extract_images("Transformer原理纯享版.docx", "pictures"))

插入浮动图片

python 复制代码

# 插入"衬于文字下方"的浮动图片  
# 如将 behindDoc="1" 改成0就是"浮于文字上方"了  
# refer to docx.oxml.shape.CT_Inline  
class CT_Anchor(BaseOxmlElement):  
    """  
    ``<w:anchor>`` element, container for a floating image.  
    """  
    extent = OneAndOnlyOne('wp:extent')  
    docPr = OneAndOnlyOne('wp:docPr')  
    graphic = OneAndOnlyOne('a:graphic')  
    @classmethod  
    def new(cls, cx, cy, shape_id, pic, pos_x, pos_y):  
        """  
        Return a new ``<wp:anchor>`` element populated with the values passed  
        as parameters.  
        """  
        anchor = parse_xml(cls._anchor_xml(pos_x, pos_y))  
        anchor.extent.cx = cx  
        anchor.extent.cy = cy  
        anchor.docPr.id = shape_id  
        anchor.docPr.name = 'Picture %d' % shape_id  
        anchor.graphic.graphicData.uri = (  
            'http://schemas.openxmlformats.org/drawingml/2006/picture'  
        )  
        anchor.graphic.graphicData._insert_pic(pic)  
        return anchor  
    @classmethod  
    def new_pic_anchor(cls, shape_id, rId, filename, cx, cy, pos_x, pos_y):  
        """  
        Return a new `wp:anchor` element containing the `pic:pic` element  
        specified by the argument values.  
        """  
        pic_id = 0  # Word doesn't seem to use this, but does not omit it  
        pic = CT_Picture.new(pic_id, filename, rId, cx, cy)  
        anchor = cls.new(cx, cy, shape_id, pic, pos_x, pos_y)  
        anchor.graphic.graphicData._insert_pic(pic)  
        return anchor  
    @classmethod  
    def _anchor_xml(cls, pos_x, pos_y):  
        return (  
            '<wp:anchor distT="0" distB="0" distL="0" distR="0" simplePos="0" relativeHeight="0" \n'  
            '           behindDoc="1" locked="0" layoutInCell="1" allowOverlap="1" \n'  
            '           %s>\n'  
            '  <wp:simplePos x="0" y="0"/>\n'  
            '  <wp:positionH relativeFrom="page">\n'  
            '    <wp:posOffset>%d</wp:posOffset>\n'  
            '  </wp:positionH>\n'  
            '  <wp:positionV relativeFrom="page">\n'  
            '    <wp:posOffset>%d</wp:posOffset>\n'  
            '  </wp:positionV>\n'                      
            '  <wp:extent cx="914400" cy="914400"/>\n'  
            '  <wp:wrapNone/>\n'  
            '  <wp:docPr id="666" name="unnamed"/>\n'  
            '  <wp:cNvGraphicFramePr>\n'  
            '    <a:graphicFrameLocks noChangeAspect="1"/>\n'  
            '  </wp:cNvGraphicFramePr>\n'  
            '  <a:graphic>\n'  
            '    <a:graphicData uri="URI not set"/>\n'  
            '  </a:graphic>\n'  
            '</wp:anchor>' % ( nsdecls('wp', 'a', 'pic', 'r'), int(pos_x), int(pos_y) )  
        )  
# refer to docx.parts.story.BaseStoryPart.new_pic_inline  
def new_pic_anchor(part, image_descriptor, width, height, pos_x, pos_y):  
    """Return a newly-created `w:anchor` element.  
    The element contains the image specified by *image_descriptor* and is scaled  
    based on the values of *width* and *height*.  
    """  
    rId, image = part.get_or_add_image(image_descriptor)  
    cx, cy = image.scaled_dimensions(width, height)  
    shape_id, filename = part.next_id, image.filename      
    return CT_Anchor.new_pic_anchor(shape_id, rId, filename, cx, cy, pos_x, pos_y)  
# refer to docx.text.run.add_picture  
def add_float_picture(p, image_path_or_stream, width=None, height=None, pos_x=0, pos_y=0):  
    """Add float picture at fixed position `pos_x` and `pos_y` to the top-left point of page.  
    """  
    run = p.add_run()  
    anchor = new_pic_anchor(run.part, image_path_or_stream, width, height, pos_x, pos_y)  
    run._r.add_drawing(anchor)  
# refer to docx.oxml.__init__.py  
register_element_cls('wp:anchor', CT_Anchor)  
document = Document()  
# add a floating picture  
p = document.add_paragraph()  
add_float_picture(p, '图片1.png')  
# add text  
p.add_run('Hello World '*50)  
document.save('文件2.docx')  
# https://www.cnblogs.com/dancesir/p/17788854.html

4.4 分栏

python 复制代码

# 分2栏  
section = doc.sections[0]  
sectPr = section._sectPr  
cols = sectPr.xpath('./w:cols')[0]  
cols.set(qn('w:num'),'2')

4.5 页眉页脚

python 复制代码

# 普通页眉  
doc = Document('Transformer原理纯享版.docx')  
doc.sections[0].header.paragraphs[0].text = "这是第1节页眉"  
# 分奇偶设置页眉  
doc.settings.odd_and_even_pages_header_footer = True  
doc.sections[0].even_page_header.paragraphs[0].text = "这是偶数页页眉"  
doc.sections[0].header.paragraphs[0].text = "这是奇数页页眉"  
# 设置首页页眉  
doc.sections[0].different_first_page_header_footer = True  
doc.sections[0].first_page_header.paragraphs[0].text = "这是首页页眉"

4.6 目录

python 复制代码

# 插入目录（不会更新域）  
paragraph = doc.paragraphs[0].insert_paragraph_before()  
run = paragraph.add_run()  
fldChar = OxmlElement('w:fldChar')  
fldChar.set(qn('w:fldCharType'), 'begin')  
instrText = OxmlElement('w:instrText')  
instrText.set(qn('xml:space'), 'preserve')  
instrText.text = r'TOC \o "1-3" \h \z \u'  
fldChar2 = OxmlElement('w:fldChar')  
fldChar2.set(qn('w:fldCharType'), 'separate')  
fldChar3 = OxmlElement('w:t')  
fldChar3.text = "Right-click to update field."  
fldChar2.append(fldChar3)  
fldChar4 = OxmlElement('w:fldChar')  
fldChar4.set(qn('w:fldCharType'), 'end')  
r_element = run._r  
r_element.append(fldChar)  
r_element.append(instrText)  
r_element.append(fldChar2)  
r_element.append(fldChar4)  
# 自动更新目录  
import lxml  
name_space = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"  
update_name_space = "%supdateFields" % name_space  
val_name_space = "%sval" % name_space  
try:  
    element_update_field_obj = lxml.etree.SubElement(doc.settings.element, update_name_space)  
    element_update_field_obj.set(val_name_space, "true")  
except Exception as e:  
    del e

4.7 文档合并

python 复制代码

from docxcompose.composer import Composer  
master = Document("文件1.docx")  
composer = Composer(master)  
doc1 = Document("文件2.docx")  
composer.append(doc1)  
doc2 = Document("文件3.docx")  
composer.append(doc2)  
composer.save("combined.docx")

注意：合并文档时，后面的文档会跟随第一个文档的格式。

总结

本文介绍了Python处理Word文档的完整流程，包括：

使用python-docx进行基础的文档创建、编辑和格式化
使用docxtpl实现基于模板的自动化数据填充
使用docxcompose合并多个Word文档
各种进阶功能如设置单元格边框、插入超链接、提取图片、设置页眉页脚等

这些技术可以广泛应用于自动化报告生成、批量文档处理、合同模板填充等场景，大大提高工作效率。

参考资料

1. 我之前撰写过的博文

现在我把整个教程大概了一番，因此将仅保持本文更新：

2. 官方文档

python-docx官方文档：python-docx.readthedocs.io/
docxtpl官方文档：docxtpl.readthedocs.io/

Python处理Word文档完全指南：从基础到进阶

所需库及安装

1. 基础操作

1.1 创建和保存文档

1.2 设置中文字体

1.3 导入现有文档

1.4 遍历文档内容

2. 文档格式设置

2.1 小标题

2.2 段落处理

添加段落

首行缩进

段落对齐

删除段落

换行处理

常用段落格式

2.3 字符格式设置

2.4 表格处理

创建表格

填充单元格

获取可用表格样式

增加和删除行

批量填充数据

合并单元格

表格格式设置

2.5 图片处理

插入图片

2.6 分页符

2.7 样式管理

3. 使用docxtpl进行模板填充

3.1 创建模板

3.2 填充模板

4. 进阶功能

4.1 表格高级操作

设置单元格边框

4.2 超链接

4.3 图片高级操作

提取文档中的图片

插入浮动图片

4.4 分栏

4.5 页眉页脚

4.6 目录

4.7 文档合并

总结

参考资料

1. 我之前撰写过的博文

2. 官方文档

3. 其他网络资料