Python处理Word文档完全指南:从基础到进阶

诸神缄默不语-个人CSDN博文目录

教程视频和代码生成效果请看视频:www.bilibili.com/video/BV1Ab...

如果需要原Jupyter notebook文件和用作示例的图片、文档,可以联系我。

在Python中处理Word文档是一项常见且实用的任务。本文将介绍如何使用几个主流的Python库来创建、修改和处理Word文档,涵盖从基础操作到高级功能的完整流程。

所需库及安装

在开始之前,需要安装以下Python库:

  • python-docx:用于创建和修改Word文档
  • docxtpl:用于基于模板填充Word文档
  • docxcompose:用于合并多个Word文档
  • lxml:XML处理库

可以通过pip安装:

plain 复制代码
pip install python-docx docxtpl docxcompose lxml  

或者使用uv:

plain 复制代码
uv add python-docx docxtpl docxcompose lxml  

1. 基础操作

1.1 创建和保存文档

使用python-docx创建文档非常简单:

python 复制代码
from docx import Document  
doc = Document()  
doc.add_paragraph("Python-docx是一个用于创建")  
doc.save("文件1.docx")  

1.2 设置中文字体

默认字体对中文支持不佳,需要单独设置中文字体:

python 复制代码
from docx.oxml.ns import qn  
def set_chinese_font(run, zh_font_name="宋体", en_font_name="Times New Roman"):  
    run.font.name = en_font_name  
    run._element.rPr.rFonts.set(qn("w:eastAsia"), zh_font_name)  
doc = Document()  
paragraph = doc.add_paragraph()  
run = paragraph.add_run('这是一段设置了中文字体的文本。')  
set_chinese_font(run)  
doc.save("文件1.docx")  

注意 :保存文件时,文件不能被打开,否则会报PermissionError错误。

1.3 导入现有文档

python 复制代码
doc = Document('example.docx')  

注意事项

  • 必须是标准docx文件,不能是doc文件
  • 不能是strict open XML格式

1.4 遍历文档内容

python 复制代码
# 遍历段落  
for para in doc.paragraphs[:3]:  
    print(para)  
    print(para.text)  
    print()  
# 遍历表格  
for table in doc.tables:  
    for row in table.rows:  
        for cell in row.cells:  
            print(cell.text)  

2. 文档格式设置

2.1 小标题

python 复制代码
doc.add_heading("1.1 Transformer整体工作流程", 2)  
doc.add_heading("Transformer整体架构", 3)  

注意:需要文档里有对应的标题样式,否则会报错。

2.2 段落处理

添加段落

python 复制代码
text = """Transformer 模型由编码器(Encoder)和解码器(Decoder)组成。..."""  
paragraph1 = doc.add_paragraph(text)  

首行缩进

首行缩进2字符:

python 复制代码
paragraph_format = paragraph1.paragraph_format  
paragraph_format.first_line_indent = 0  
paragraph_format.element.pPr.ind.set(qn("w:firstLineChars"), '200')  

首行缩进固定距离:

python 复制代码
para_format.first_line_indent = Pt(10)  

段落对齐

python 复制代码
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT  
paragraph1.alignment = WD_PARAGRAPH_ALIGNMENT.LEFT  

删除段落

python 复制代码
p = paragraph1._element  
p.getparent().remove(p)  

换行处理

python 复制代码
# 将文本按换行符分割成多个段落  
for one_paragraph_text in text.split("\n"):  
    temp_paragraph = doc.add_paragraph(one_paragraph_text)  
    paragraph_format = temp_paragraph.paragraph_format  
    paragraph_format.first_line_indent = 0  
    paragraph_format.element.pPr.ind.set(qn("w:firstLineChars"), "200")  

常用段落格式

python 复制代码
from docx.shared import Pt  
para_format = temp_paragraph.paragraph_format  
para_format.line_spacing = Pt(18)  # 行间距(固定值)  
para_format.space_before = Pt(3)  # 段前距离  
para_format.space_after = Pt(0)  # 段后距离  
para_format.right_indent = Pt(20)  # 右侧缩进  
para_format.left_indent = Pt(0)  # 左侧缩进  

2.3 字符格式设置

python 复制代码
from docx.shared import RGBColor, Pt  
# 加粗文本  
temp_paragraph.add_run('加粗文本').bold = True  
# 红色斜体文本  
run = temp_paragraph.add_run('红色斜体文本')  
run.font.color.rgb = RGBColor(255,0,0)  # 设置红色  
run.font.size = Pt(14)  # 字号14磅  
run.bold = True  # 加粗  
run.italic = True  # 斜体  
run.underline = True  # 下划线  
# 下标和上标  
run2 = temp_paragraph.add_run("1")  
run2.font.subscript = True  # 下标  
run3 = temp_paragraph.add_run("2")  
run3.font.superscript = True  # 上标  

2.4 表格处理

创建表格

python 复制代码
table = doc.add_table(rows=4, cols=5)  
table.style = "Grid Table 1 Light"  # 应用预定义样式  

填充单元格

python 复制代码
# 方式1:直接指定单元格  
cell = table.cell(0, 1)  
cell.text = "parrot, possibly dead"  
# 方式2:通过行获取单元格  
row = table.rows[1]  
cells = row.cells  
cells[0].text = "Foo bar to you."  
cells[1].text = "And a hearty foo bar to you too sir!"  

获取可用表格样式

python 复制代码
from docx.enum.style import WD_STYLE_TYPE  
styles = doc.styles  
for s in styles:  
    if s.type == WD_STYLE_TYPE.TABLE:  
        print(s.name)  

增加和删除行

python 复制代码
# 增加一行  
row = table.add_row()  
# 删除一行  
def remove_row(table, row):  
    tbl = table._tbl  
    tr = row._tr  
    tbl.remove(tr)  
row = table.rows[len(table.rows) - 1]  
remove_row(table, row)  

批量填充数据

python 复制代码
# 方式1:一行一行添加  
items = (  
    (7, "1024", "Plush kittens"),  
    (3, "2042", "Furbees"),  
    (1, "1288", "French Poodle Collars, Deluxe"),  
)  
for item in items:  
    cells = table.add_row().cells  
    cells[0].text = str(item[0])  
    cells[1].text = item[1]  
    cells[2].text = item[2]  
# 方式2:批量填充  
for row in table.rows:  
    for cell in row.cells:  
        cell.text = "数据单元"  

合并单元格

python 复制代码
table.cell(0, 0).merge(table.cell(1, 1))  # 跨行列合并  

表格格式设置

python 复制代码
# 表格宽度自适应  
table.autofit = True  
# 指定行高  
from docx.shared import Cm  
table.rows[0].height = Cm(0.93)  
# 修改表格字体大小  
table.style.font.size = Pt(15)  
# 设置单元格对齐  
from docx.enum.table import WD_ALIGN_VERTICAL  
cell = table.cell(0, 0)  
cell.paragraphs[0].paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER  
cell.vertical_alignment = WD_ALIGN_VERTICAL.CENTER  
# 复制表格  
from copy import deepcopy  
table_copy = deepcopy(doc.tables[0])  
para1 = doc.add_paragraph()  
para1._p.addnext(table_copy._element)  

2.5 图片处理

插入图片

python 复制代码
from io import BytesIO  
import base64  
# 普通插入  
doc.add_picture('图片1.png')  
doc.add_picture('图片2.png', width=Inches(2.5), height=Inches(2))  
# 使用base64插入  
picture2_base64 = open("图片2base64.txt").read()  
img2_buf = base64.b64decode(picture2_base64)  
doc.add_picture(BytesIO(img2_buf))  
# 并排放图  
run = doc.add_paragraph().add_run()  
run.add_picture("图片1.png", width=Inches(2.5), height=Inches(2))  
run.add_picture("图片1.png", width=Inches(2.5), height=Inches(2))  

2.6 分页符

python 复制代码
doc.add_page_break()  

2.7 样式管理

python 复制代码
# 修改已有样式  
doc.styles["Normal"].font.size = Pt(14)  
doc.styles['Normal'].font.name = 'Arial'  
doc.styles['Normal']._element.rPr.rFonts.set(qn('w:eastAsia'), '楷体')  
# 创建自定义段落样式  
from docx.enum.style import WD_STYLE_TYPE  
UserStyle1 = doc.styles.add_style('UserStyle1', WD_STYLE_TYPE.PARAGRAPH)  
UserStyle1.font.size = Pt(40)  
UserStyle1.font.color.rgb = RGBColor(0xff, 0xde, 0x00)  
UserStyle1.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER  
UserStyle1.font.name = '微软雅黑'  
UserStyle1._element.rPr.rFonts.set(qn('w:eastAsia'), '微软雅黑')  
# 使用自定义样式  
doc.add_paragraph('自定义段落样式', style=UserStyle1)  

3. 使用docxtpl进行模板填充

docxtpl可以将Word文档制作成模板,实现数据自动填充。

3.1 创建模板

首先创建一个包含占位符的Word模板,占位符使用双花括号{{}}包裹。

3.2 填充模板

python 复制代码
from docxtpl import DocxTemplate, InlineImage, RichText  
tpl = DocxTemplate("docxexample.docx")  
text = """Transformer 模型由编码器(Encoder)和解码器(Decoder)组成..."""  
picture1 = InlineImage(tpl, image_descriptor="图片1.png")  
# 准备数据  
paragraphs1 = [  
    "步骤1:输入表示(Input Representation)",  
    "步骤2:编码器处理(Encoder Processing)",  
    "步骤3:解码器处理(Decoder Processing)",  
]  
paragraphs2 = [  
    {"step": 1, "text": "输入向量(词嵌入+位置编码)进入编码器层。"},  
    {"step": 2, "text": "自注意力子层。"},  
    {"step": 3, "text": "前馈网络子层。"},  
]  
table = [  
    {"character": "并行计算", "description": "编码器可并行处理整个序列(与RNN不同)"},  
    {"character": "自注意力", "description": "每个词直接关联所有词,捕获长距离依赖"},  
    {"character": "位置编码", "description": "为无顺序的注意力机制注入位置信息"},  
]  
alerts = [  
    {  
        "date": "2015-03-10",  
        "desc": RichText("Very critical alert", color="FF0000", bold=True),  
        "type": "CRITICAL",  
        "bg": "FF0000",  
    },  
    # ... 其他数据  
]  
# 渲染模板  
context = {  
    "title": "Transformer",  
    "text_body": text,  
    "picture1": picture1,  
    "picture2": picture2,  
    "paragraphs1": paragraphs1,  
    "paragraphs2": paragraphs2,  
    "runs": paragraphs1,  
    "display_paragraph": True,  
    "table1": table,  
    "table2": table,  
    "alerts": alerts,  
}  
tpl.render(context)  
tpl.save("文件3.docx")  

4. 进阶功能

4.1 表格高级操作

设置单元格边框

python 复制代码
from docx.oxml import OxmlElement  
from docx.oxml.ns import qn  
def set_cell_border(cell, **kwargs):  
    tc = cell._tc  
    tcPr = tc.get_or_add_tcPr()  
    tcBorders = tcPr.first_child_found_in("w:tcBorders")  
    if tcBorders is None:  
        tcBorders = OxmlElement("w:tcBorders")  
        tcPr.append(tcBorders)  
      
    for edge in ("left", "top", "right", "bottom", "insideH", "insideV"):  
        edge_data = kwargs.get(edge)  
        if edge_data:  
            tag = "w:{}".format(edge)  
            element = tcBorders.find(qn(tag))  
            if element is None:  
                element = OxmlElement(tag)  
                tcBorders.append(element)  
              
            for key in ["sz", "val", "color", "space", "shadow"]:  
                if key in edge_data:  
                    element.set(qn("w:{}".format(key)), str(edge_data[key]))  
# 使用示例  
set_cell_border(  
    table.cell(0, 0),  
    top={"sz": 4, "val": "single", "color": "#000000", "space": "0"},  
    bottom={"sz": 4, "val": "single", "color": "#000000", "space": "0"},  
    left={"sz": 4, "val": "single", "color": "#000000", "space": "0"},  
    right={"sz": 4, "val": "single", "color": "#000000", "space": "0"},  
)  

4.2 超链接

python 复制代码
def add_hyperlink(paragraph, url, text):  
    part = paragraph.part  
    r_id = part.relate_to(  
        url,  
        "http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink",  
        is_external=True,  
    )  
    hyperlink = OxmlElement("w:hyperlink")  
    hyperlink.set(qn("r:id"), r_id)  
      
    run = OxmlElement("w:r")  
    run_text = OxmlElement("w:t")  
    run_text.text = text  
    run.append(run_text)  
    hyperlink.append(run)  
      
    paragraph._p.append(hyperlink)  
p = doc.add_paragraph("点击访问: ")  
add_hyperlink(p, "https://www.baidu.com", "示例链接")  

4.3 图片高级操作

提取文档中的图片

python 复制代码
import zipfile  
from xml.etree.ElementTree import fromstring  
def extract_images(docx_path, output_dir):  
    with zipfile.ZipFile(docx_path) as z:  
        try:  
            doc_rels = z.read('word/_rels/document.xml.rels').decode('utf-8')  
        except KeyError:  
            return []  
          
        root = fromstring(doc_rels)  
        rels = []  
        for child in root:  
            if 'Type' in child.attrib and child.attrib['Type'] == RT.IMAGE:  
                rels.append((child.attrib['Id'], child.attrib['Target']))  
          
        images = []  
        for rel_id, target in rels:  
            try:  
                image_data = z.read('word/' + target)  
                image_name = target.split('/')[-1]  
                with open(f"{output_dir}/{image_name}", 'wb') as f:  
                    f.write(image_data)  
                images.append(image_name)  
            except KeyError:  
                continue  
        return images  
print(extract_images("Transformer原理纯享版.docx", "pictures"))  

插入浮动图片

python 复制代码
# 插入"衬于文字下方"的浮动图片  
# 如将 behindDoc="1" 改成0就是"浮于文字上方"了  
# refer to docx.oxml.shape.CT_Inline  
class CT_Anchor(BaseOxmlElement):  
    """  
    ``<w:anchor>`` element, container for a floating image.  
    """  
    extent = OneAndOnlyOne('wp:extent')  
    docPr = OneAndOnlyOne('wp:docPr')  
    graphic = OneAndOnlyOne('a:graphic')  
    @classmethod  
    def new(cls, cx, cy, shape_id, pic, pos_x, pos_y):  
        """  
        Return a new ``<wp:anchor>`` element populated with the values passed  
        as parameters.  
        """  
        anchor = parse_xml(cls._anchor_xml(pos_x, pos_y))  
        anchor.extent.cx = cx  
        anchor.extent.cy = cy  
        anchor.docPr.id = shape_id  
        anchor.docPr.name = 'Picture %d' % shape_id  
        anchor.graphic.graphicData.uri = (  
            'http://schemas.openxmlformats.org/drawingml/2006/picture'  
        )  
        anchor.graphic.graphicData._insert_pic(pic)  
        return anchor  
    @classmethod  
    def new_pic_anchor(cls, shape_id, rId, filename, cx, cy, pos_x, pos_y):  
        """  
        Return a new `wp:anchor` element containing the `pic:pic` element  
        specified by the argument values.  
        """  
        pic_id = 0  # Word doesn't seem to use this, but does not omit it  
        pic = CT_Picture.new(pic_id, filename, rId, cx, cy)  
        anchor = cls.new(cx, cy, shape_id, pic, pos_x, pos_y)  
        anchor.graphic.graphicData._insert_pic(pic)  
        return anchor  
    @classmethod  
    def _anchor_xml(cls, pos_x, pos_y):  
        return (  
            '<wp:anchor distT="0" distB="0" distL="0" distR="0" simplePos="0" relativeHeight="0" \n'  
            '           behindDoc="1" locked="0" layoutInCell="1" allowOverlap="1" \n'  
            '           %s>\n'  
            '  <wp:simplePos x="0" y="0"/>\n'  
            '  <wp:positionH relativeFrom="page">\n'  
            '    <wp:posOffset>%d</wp:posOffset>\n'  
            '  </wp:positionH>\n'  
            '  <wp:positionV relativeFrom="page">\n'  
            '    <wp:posOffset>%d</wp:posOffset>\n'  
            '  </wp:positionV>\n'                      
            '  <wp:extent cx="914400" cy="914400"/>\n'  
            '  <wp:wrapNone/>\n'  
            '  <wp:docPr id="666" name="unnamed"/>\n'  
            '  <wp:cNvGraphicFramePr>\n'  
            '    <a:graphicFrameLocks noChangeAspect="1"/>\n'  
            '  </wp:cNvGraphicFramePr>\n'  
            '  <a:graphic>\n'  
            '    <a:graphicData uri="URI not set"/>\n'  
            '  </a:graphic>\n'  
            '</wp:anchor>' % ( nsdecls('wp', 'a', 'pic', 'r'), int(pos_x), int(pos_y) )  
        )  
# refer to docx.parts.story.BaseStoryPart.new_pic_inline  
def new_pic_anchor(part, image_descriptor, width, height, pos_x, pos_y):  
    """Return a newly-created `w:anchor` element.  
    The element contains the image specified by *image_descriptor* and is scaled  
    based on the values of *width* and *height*.  
    """  
    rId, image = part.get_or_add_image(image_descriptor)  
    cx, cy = image.scaled_dimensions(width, height)  
    shape_id, filename = part.next_id, image.filename      
    return CT_Anchor.new_pic_anchor(shape_id, rId, filename, cx, cy, pos_x, pos_y)  
# refer to docx.text.run.add_picture  
def add_float_picture(p, image_path_or_stream, width=None, height=None, pos_x=0, pos_y=0):  
    """Add float picture at fixed position `pos_x` and `pos_y` to the top-left point of page.  
    """  
    run = p.add_run()  
    anchor = new_pic_anchor(run.part, image_path_or_stream, width, height, pos_x, pos_y)  
    run._r.add_drawing(anchor)  
# refer to docx.oxml.__init__.py  
register_element_cls('wp:anchor', CT_Anchor)  
document = Document()  
# add a floating picture  
p = document.add_paragraph()  
add_float_picture(p, '图片1.png')  
# add text  
p.add_run('Hello World '*50)  
document.save('文件2.docx')  
# https://www.cnblogs.com/dancesir/p/17788854.html  

4.4 分栏

python 复制代码
# 分2栏  
section = doc.sections[0]  
sectPr = section._sectPr  
cols = sectPr.xpath('./w:cols')[0]  
cols.set(qn('w:num'),'2')  

4.5 页眉页脚

python 复制代码
# 普通页眉  
doc = Document('Transformer原理纯享版.docx')  
doc.sections[0].header.paragraphs[0].text = "这是第1节页眉"  
# 分奇偶设置页眉  
doc.settings.odd_and_even_pages_header_footer = True  
doc.sections[0].even_page_header.paragraphs[0].text = "这是偶数页页眉"  
doc.sections[0].header.paragraphs[0].text = "这是奇数页页眉"  
# 设置首页页眉  
doc.sections[0].different_first_page_header_footer = True  
doc.sections[0].first_page_header.paragraphs[0].text = "这是首页页眉"  

4.6 目录

python 复制代码
# 插入目录(不会更新域)  
paragraph = doc.paragraphs[0].insert_paragraph_before()  
run = paragraph.add_run()  
fldChar = OxmlElement('w:fldChar')  
fldChar.set(qn('w:fldCharType'), 'begin')  
instrText = OxmlElement('w:instrText')  
instrText.set(qn('xml:space'), 'preserve')  
instrText.text = r'TOC \o "1-3" \h \z \u'  
fldChar2 = OxmlElement('w:fldChar')  
fldChar2.set(qn('w:fldCharType'), 'separate')  
fldChar3 = OxmlElement('w:t')  
fldChar3.text = "Right-click to update field."  
fldChar2.append(fldChar3)  
fldChar4 = OxmlElement('w:fldChar')  
fldChar4.set(qn('w:fldCharType'), 'end')  
r_element = run._r  
r_element.append(fldChar)  
r_element.append(instrText)  
r_element.append(fldChar2)  
r_element.append(fldChar4)  
# 自动更新目录  
import lxml  
name_space = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"  
update_name_space = "%supdateFields" % name_space  
val_name_space = "%sval" % name_space  
try:  
    element_update_field_obj = lxml.etree.SubElement(doc.settings.element, update_name_space)  
    element_update_field_obj.set(val_name_space, "true")  
except Exception as e:  
    del e  

4.7 文档合并

python 复制代码
from docxcompose.composer import Composer  
master = Document("文件1.docx")  
composer = Composer(master)  
doc1 = Document("文件2.docx")  
composer.append(doc1)  
doc2 = Document("文件3.docx")  
composer.append(doc2)  
composer.save("combined.docx")  

注意:合并文档时,后面的文档会跟随第一个文档的格式。

总结

本文介绍了Python处理Word文档的完整流程,包括:

  • 使用python-docx进行基础的文档创建、编辑和格式化
  • 使用docxtpl实现基于模板的自动化数据填充
  • 使用docxcompose合并多个Word文档
  • 各种进阶功能如设置单元格边框、插入超链接、提取图片、设置页眉页脚等

这些技术可以广泛应用于自动化报告生成、批量文档处理、合同模板填充等场景,大大提高工作效率。

参考资料

1. 我之前撰写过的博文

现在我把整个教程大概了一番,因此将仅保持本文更新:

2. 官方文档

3. 其他网络资料

相关推荐
海棠AI实验室4 小时前
第四章 项目目录结构:src/、configs/、data/、tests/ 的黄金布局
python·项目目录结构
爱笑的眼睛115 小时前
超越可视化:降维算法组件的深度解析与工程实践
java·人工智能·python·ai
清铎5 小时前
leetcode_day12_滑动窗口_《绝境求生》
python·算法·leetcode·动态规划
ai_top_trends5 小时前
2026 年工作计划 PPT 横评:AI 自动生成的优劣分析
人工智能·python·powerpoint
TDengine (老段)6 小时前
TDengine Python 连接器进阶指南
大数据·数据库·python·物联网·时序数据库·tdengine·涛思数据
brent4236 小时前
DAY50复习日
开发语言·python
万行6 小时前
机器学习&第三章
人工智能·python·机器学习·数学建模·概率论
Data_agent6 小时前
Cocbuy 模式淘宝 / 1688 代购系统(欧美市场)搭建指南
开发语言·python
m0_726365836 小时前
哈希分分预测系统 打造自适应趋势分析「Python+DeepSeek+PyQt5」
python·qt·哈希算法