教程视频和代码生成效果请看视频:www.bilibili.com/video/BV1Ab...
如果需要原Jupyter notebook文件和用作示例的图片、文档,可以联系我。
在Python中处理Word文档是一项常见且实用的任务。本文将介绍如何使用几个主流的Python库来创建、修改和处理Word文档,涵盖从基础操作到高级功能的完整流程。
所需库及安装
在开始之前,需要安装以下Python库:
- python-docx:用于创建和修改Word文档
- docxtpl:用于基于模板填充Word文档
- docxcompose:用于合并多个Word文档
- lxml:XML处理库
可以通过pip安装:
plain
pip install python-docx docxtpl docxcompose lxml
或者使用uv:
plain
uv add python-docx docxtpl docxcompose lxml
1. 基础操作
1.1 创建和保存文档
使用python-docx创建文档非常简单:
python
from docx import Document
doc = Document()
doc.add_paragraph("Python-docx是一个用于创建")
doc.save("文件1.docx")
1.2 设置中文字体
默认字体对中文支持不佳,需要单独设置中文字体:
python
from docx.oxml.ns import qn
def set_chinese_font(run, zh_font_name="宋体", en_font_name="Times New Roman"):
run.font.name = en_font_name
run._element.rPr.rFonts.set(qn("w:eastAsia"), zh_font_name)
doc = Document()
paragraph = doc.add_paragraph()
run = paragraph.add_run('这是一段设置了中文字体的文本。')
set_chinese_font(run)
doc.save("文件1.docx")
注意 :保存文件时,文件不能被打开,否则会报PermissionError错误。
1.3 导入现有文档
python
doc = Document('example.docx')
注意事项:
- 必须是标准docx文件,不能是doc文件
- 不能是strict open XML格式
1.4 遍历文档内容
python
# 遍历段落
for para in doc.paragraphs[:3]:
print(para)
print(para.text)
print()
# 遍历表格
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
print(cell.text)
2. 文档格式设置
2.1 小标题
python
doc.add_heading("1.1 Transformer整体工作流程", 2)
doc.add_heading("Transformer整体架构", 3)
注意:需要文档里有对应的标题样式,否则会报错。
2.2 段落处理
添加段落
python
text = """Transformer 模型由编码器(Encoder)和解码器(Decoder)组成。..."""
paragraph1 = doc.add_paragraph(text)
首行缩进
首行缩进2字符:
python
paragraph_format = paragraph1.paragraph_format
paragraph_format.first_line_indent = 0
paragraph_format.element.pPr.ind.set(qn("w:firstLineChars"), '200')
首行缩进固定距离:
python
para_format.first_line_indent = Pt(10)
段落对齐
python
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT
paragraph1.alignment = WD_PARAGRAPH_ALIGNMENT.LEFT
删除段落
python
p = paragraph1._element
p.getparent().remove(p)
换行处理
python
# 将文本按换行符分割成多个段落
for one_paragraph_text in text.split("\n"):
temp_paragraph = doc.add_paragraph(one_paragraph_text)
paragraph_format = temp_paragraph.paragraph_format
paragraph_format.first_line_indent = 0
paragraph_format.element.pPr.ind.set(qn("w:firstLineChars"), "200")
常用段落格式
python
from docx.shared import Pt
para_format = temp_paragraph.paragraph_format
para_format.line_spacing = Pt(18) # 行间距(固定值)
para_format.space_before = Pt(3) # 段前距离
para_format.space_after = Pt(0) # 段后距离
para_format.right_indent = Pt(20) # 右侧缩进
para_format.left_indent = Pt(0) # 左侧缩进
2.3 字符格式设置
python
from docx.shared import RGBColor, Pt
# 加粗文本
temp_paragraph.add_run('加粗文本').bold = True
# 红色斜体文本
run = temp_paragraph.add_run('红色斜体文本')
run.font.color.rgb = RGBColor(255,0,0) # 设置红色
run.font.size = Pt(14) # 字号14磅
run.bold = True # 加粗
run.italic = True # 斜体
run.underline = True # 下划线
# 下标和上标
run2 = temp_paragraph.add_run("1")
run2.font.subscript = True # 下标
run3 = temp_paragraph.add_run("2")
run3.font.superscript = True # 上标
2.4 表格处理
创建表格
python
table = doc.add_table(rows=4, cols=5)
table.style = "Grid Table 1 Light" # 应用预定义样式
填充单元格
python
# 方式1:直接指定单元格
cell = table.cell(0, 1)
cell.text = "parrot, possibly dead"
# 方式2:通过行获取单元格
row = table.rows[1]
cells = row.cells
cells[0].text = "Foo bar to you."
cells[1].text = "And a hearty foo bar to you too sir!"
获取可用表格样式
python
from docx.enum.style import WD_STYLE_TYPE
styles = doc.styles
for s in styles:
if s.type == WD_STYLE_TYPE.TABLE:
print(s.name)
增加和删除行
python
# 增加一行
row = table.add_row()
# 删除一行
def remove_row(table, row):
tbl = table._tbl
tr = row._tr
tbl.remove(tr)
row = table.rows[len(table.rows) - 1]
remove_row(table, row)
批量填充数据
python
# 方式1:一行一行添加
items = (
(7, "1024", "Plush kittens"),
(3, "2042", "Furbees"),
(1, "1288", "French Poodle Collars, Deluxe"),
)
for item in items:
cells = table.add_row().cells
cells[0].text = str(item[0])
cells[1].text = item[1]
cells[2].text = item[2]
# 方式2:批量填充
for row in table.rows:
for cell in row.cells:
cell.text = "数据单元"
合并单元格
python
table.cell(0, 0).merge(table.cell(1, 1)) # 跨行列合并
表格格式设置
python
# 表格宽度自适应
table.autofit = True
# 指定行高
from docx.shared import Cm
table.rows[0].height = Cm(0.93)
# 修改表格字体大小
table.style.font.size = Pt(15)
# 设置单元格对齐
from docx.enum.table import WD_ALIGN_VERTICAL
cell = table.cell(0, 0)
cell.paragraphs[0].paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER
cell.vertical_alignment = WD_ALIGN_VERTICAL.CENTER
# 复制表格
from copy import deepcopy
table_copy = deepcopy(doc.tables[0])
para1 = doc.add_paragraph()
para1._p.addnext(table_copy._element)
2.5 图片处理
插入图片
python
from io import BytesIO
import base64
# 普通插入
doc.add_picture('图片1.png')
doc.add_picture('图片2.png', width=Inches(2.5), height=Inches(2))
# 使用base64插入
picture2_base64 = open("图片2base64.txt").read()
img2_buf = base64.b64decode(picture2_base64)
doc.add_picture(BytesIO(img2_buf))
# 并排放图
run = doc.add_paragraph().add_run()
run.add_picture("图片1.png", width=Inches(2.5), height=Inches(2))
run.add_picture("图片1.png", width=Inches(2.5), height=Inches(2))
2.6 分页符
python
doc.add_page_break()
2.7 样式管理
python
# 修改已有样式
doc.styles["Normal"].font.size = Pt(14)
doc.styles['Normal'].font.name = 'Arial'
doc.styles['Normal']._element.rPr.rFonts.set(qn('w:eastAsia'), '楷体')
# 创建自定义段落样式
from docx.enum.style import WD_STYLE_TYPE
UserStyle1 = doc.styles.add_style('UserStyle1', WD_STYLE_TYPE.PARAGRAPH)
UserStyle1.font.size = Pt(40)
UserStyle1.font.color.rgb = RGBColor(0xff, 0xde, 0x00)
UserStyle1.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER
UserStyle1.font.name = '微软雅黑'
UserStyle1._element.rPr.rFonts.set(qn('w:eastAsia'), '微软雅黑')
# 使用自定义样式
doc.add_paragraph('自定义段落样式', style=UserStyle1)
3. 使用docxtpl进行模板填充
docxtpl可以将Word文档制作成模板,实现数据自动填充。
3.1 创建模板
首先创建一个包含占位符的Word模板,占位符使用双花括号{{}}包裹。
3.2 填充模板
python
from docxtpl import DocxTemplate, InlineImage, RichText
tpl = DocxTemplate("docxexample.docx")
text = """Transformer 模型由编码器(Encoder)和解码器(Decoder)组成..."""
picture1 = InlineImage(tpl, image_descriptor="图片1.png")
# 准备数据
paragraphs1 = [
"步骤1:输入表示(Input Representation)",
"步骤2:编码器处理(Encoder Processing)",
"步骤3:解码器处理(Decoder Processing)",
]
paragraphs2 = [
{"step": 1, "text": "输入向量(词嵌入+位置编码)进入编码器层。"},
{"step": 2, "text": "自注意力子层。"},
{"step": 3, "text": "前馈网络子层。"},
]
table = [
{"character": "并行计算", "description": "编码器可并行处理整个序列(与RNN不同)"},
{"character": "自注意力", "description": "每个词直接关联所有词,捕获长距离依赖"},
{"character": "位置编码", "description": "为无顺序的注意力机制注入位置信息"},
]
alerts = [
{
"date": "2015-03-10",
"desc": RichText("Very critical alert", color="FF0000", bold=True),
"type": "CRITICAL",
"bg": "FF0000",
},
# ... 其他数据
]
# 渲染模板
context = {
"title": "Transformer",
"text_body": text,
"picture1": picture1,
"picture2": picture2,
"paragraphs1": paragraphs1,
"paragraphs2": paragraphs2,
"runs": paragraphs1,
"display_paragraph": True,
"table1": table,
"table2": table,
"alerts": alerts,
}
tpl.render(context)
tpl.save("文件3.docx")
4. 进阶功能
4.1 表格高级操作
设置单元格边框
python
from docx.oxml import OxmlElement
from docx.oxml.ns import qn
def set_cell_border(cell, **kwargs):
tc = cell._tc
tcPr = tc.get_or_add_tcPr()
tcBorders = tcPr.first_child_found_in("w:tcBorders")
if tcBorders is None:
tcBorders = OxmlElement("w:tcBorders")
tcPr.append(tcBorders)
for edge in ("left", "top", "right", "bottom", "insideH", "insideV"):
edge_data = kwargs.get(edge)
if edge_data:
tag = "w:{}".format(edge)
element = tcBorders.find(qn(tag))
if element is None:
element = OxmlElement(tag)
tcBorders.append(element)
for key in ["sz", "val", "color", "space", "shadow"]:
if key in edge_data:
element.set(qn("w:{}".format(key)), str(edge_data[key]))
# 使用示例
set_cell_border(
table.cell(0, 0),
top={"sz": 4, "val": "single", "color": "#000000", "space": "0"},
bottom={"sz": 4, "val": "single", "color": "#000000", "space": "0"},
left={"sz": 4, "val": "single", "color": "#000000", "space": "0"},
right={"sz": 4, "val": "single", "color": "#000000", "space": "0"},
)
4.2 超链接
python
def add_hyperlink(paragraph, url, text):
part = paragraph.part
r_id = part.relate_to(
url,
"http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink",
is_external=True,
)
hyperlink = OxmlElement("w:hyperlink")
hyperlink.set(qn("r:id"), r_id)
run = OxmlElement("w:r")
run_text = OxmlElement("w:t")
run_text.text = text
run.append(run_text)
hyperlink.append(run)
paragraph._p.append(hyperlink)
p = doc.add_paragraph("点击访问: ")
add_hyperlink(p, "https://www.baidu.com", "示例链接")
4.3 图片高级操作
提取文档中的图片
python
import zipfile
from xml.etree.ElementTree import fromstring
def extract_images(docx_path, output_dir):
with zipfile.ZipFile(docx_path) as z:
try:
doc_rels = z.read('word/_rels/document.xml.rels').decode('utf-8')
except KeyError:
return []
root = fromstring(doc_rels)
rels = []
for child in root:
if 'Type' in child.attrib and child.attrib['Type'] == RT.IMAGE:
rels.append((child.attrib['Id'], child.attrib['Target']))
images = []
for rel_id, target in rels:
try:
image_data = z.read('word/' + target)
image_name = target.split('/')[-1]
with open(f"{output_dir}/{image_name}", 'wb') as f:
f.write(image_data)
images.append(image_name)
except KeyError:
continue
return images
print(extract_images("Transformer原理纯享版.docx", "pictures"))
插入浮动图片
python
# 插入"衬于文字下方"的浮动图片
# 如将 behindDoc="1" 改成0就是"浮于文字上方"了
# refer to docx.oxml.shape.CT_Inline
class CT_Anchor(BaseOxmlElement):
"""
``<w:anchor>`` element, container for a floating image.
"""
extent = OneAndOnlyOne('wp:extent')
docPr = OneAndOnlyOne('wp:docPr')
graphic = OneAndOnlyOne('a:graphic')
@classmethod
def new(cls, cx, cy, shape_id, pic, pos_x, pos_y):
"""
Return a new ``<wp:anchor>`` element populated with the values passed
as parameters.
"""
anchor = parse_xml(cls._anchor_xml(pos_x, pos_y))
anchor.extent.cx = cx
anchor.extent.cy = cy
anchor.docPr.id = shape_id
anchor.docPr.name = 'Picture %d' % shape_id
anchor.graphic.graphicData.uri = (
'http://schemas.openxmlformats.org/drawingml/2006/picture'
)
anchor.graphic.graphicData._insert_pic(pic)
return anchor
@classmethod
def new_pic_anchor(cls, shape_id, rId, filename, cx, cy, pos_x, pos_y):
"""
Return a new `wp:anchor` element containing the `pic:pic` element
specified by the argument values.
"""
pic_id = 0 # Word doesn't seem to use this, but does not omit it
pic = CT_Picture.new(pic_id, filename, rId, cx, cy)
anchor = cls.new(cx, cy, shape_id, pic, pos_x, pos_y)
anchor.graphic.graphicData._insert_pic(pic)
return anchor
@classmethod
def _anchor_xml(cls, pos_x, pos_y):
return (
'<wp:anchor distT="0" distB="0" distL="0" distR="0" simplePos="0" relativeHeight="0" \n'
' behindDoc="1" locked="0" layoutInCell="1" allowOverlap="1" \n'
' %s>\n'
' <wp:simplePos x="0" y="0"/>\n'
' <wp:positionH relativeFrom="page">\n'
' <wp:posOffset>%d</wp:posOffset>\n'
' </wp:positionH>\n'
' <wp:positionV relativeFrom="page">\n'
' <wp:posOffset>%d</wp:posOffset>\n'
' </wp:positionV>\n'
' <wp:extent cx="914400" cy="914400"/>\n'
' <wp:wrapNone/>\n'
' <wp:docPr id="666" name="unnamed"/>\n'
' <wp:cNvGraphicFramePr>\n'
' <a:graphicFrameLocks noChangeAspect="1"/>\n'
' </wp:cNvGraphicFramePr>\n'
' <a:graphic>\n'
' <a:graphicData uri="URI not set"/>\n'
' </a:graphic>\n'
'</wp:anchor>' % ( nsdecls('wp', 'a', 'pic', 'r'), int(pos_x), int(pos_y) )
)
# refer to docx.parts.story.BaseStoryPart.new_pic_inline
def new_pic_anchor(part, image_descriptor, width, height, pos_x, pos_y):
"""Return a newly-created `w:anchor` element.
The element contains the image specified by *image_descriptor* and is scaled
based on the values of *width* and *height*.
"""
rId, image = part.get_or_add_image(image_descriptor)
cx, cy = image.scaled_dimensions(width, height)
shape_id, filename = part.next_id, image.filename
return CT_Anchor.new_pic_anchor(shape_id, rId, filename, cx, cy, pos_x, pos_y)
# refer to docx.text.run.add_picture
def add_float_picture(p, image_path_or_stream, width=None, height=None, pos_x=0, pos_y=0):
"""Add float picture at fixed position `pos_x` and `pos_y` to the top-left point of page.
"""
run = p.add_run()
anchor = new_pic_anchor(run.part, image_path_or_stream, width, height, pos_x, pos_y)
run._r.add_drawing(anchor)
# refer to docx.oxml.__init__.py
register_element_cls('wp:anchor', CT_Anchor)
document = Document()
# add a floating picture
p = document.add_paragraph()
add_float_picture(p, '图片1.png')
# add text
p.add_run('Hello World '*50)
document.save('文件2.docx')
# https://www.cnblogs.com/dancesir/p/17788854.html
4.4 分栏
python
# 分2栏
section = doc.sections[0]
sectPr = section._sectPr
cols = sectPr.xpath('./w:cols')[0]
cols.set(qn('w:num'),'2')
4.5 页眉页脚
python
# 普通页眉
doc = Document('Transformer原理纯享版.docx')
doc.sections[0].header.paragraphs[0].text = "这是第1节页眉"
# 分奇偶设置页眉
doc.settings.odd_and_even_pages_header_footer = True
doc.sections[0].even_page_header.paragraphs[0].text = "这是偶数页页眉"
doc.sections[0].header.paragraphs[0].text = "这是奇数页页眉"
# 设置首页页眉
doc.sections[0].different_first_page_header_footer = True
doc.sections[0].first_page_header.paragraphs[0].text = "这是首页页眉"
4.6 目录
python
# 插入目录(不会更新域)
paragraph = doc.paragraphs[0].insert_paragraph_before()
run = paragraph.add_run()
fldChar = OxmlElement('w:fldChar')
fldChar.set(qn('w:fldCharType'), 'begin')
instrText = OxmlElement('w:instrText')
instrText.set(qn('xml:space'), 'preserve')
instrText.text = r'TOC \o "1-3" \h \z \u'
fldChar2 = OxmlElement('w:fldChar')
fldChar2.set(qn('w:fldCharType'), 'separate')
fldChar3 = OxmlElement('w:t')
fldChar3.text = "Right-click to update field."
fldChar2.append(fldChar3)
fldChar4 = OxmlElement('w:fldChar')
fldChar4.set(qn('w:fldCharType'), 'end')
r_element = run._r
r_element.append(fldChar)
r_element.append(instrText)
r_element.append(fldChar2)
r_element.append(fldChar4)
# 自动更新目录
import lxml
name_space = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
update_name_space = "%supdateFields" % name_space
val_name_space = "%sval" % name_space
try:
element_update_field_obj = lxml.etree.SubElement(doc.settings.element, update_name_space)
element_update_field_obj.set(val_name_space, "true")
except Exception as e:
del e
4.7 文档合并
python
from docxcompose.composer import Composer
master = Document("文件1.docx")
composer = Composer(master)
doc1 = Document("文件2.docx")
composer.append(doc1)
doc2 = Document("文件3.docx")
composer.append(doc2)
composer.save("combined.docx")
注意:合并文档时,后面的文档会跟随第一个文档的格式。
总结
本文介绍了Python处理Word文档的完整流程,包括:
- 使用python-docx进行基础的文档创建、编辑和格式化
- 使用docxtpl实现基于模板的自动化数据填充
- 使用docxcompose合并多个Word文档
- 各种进阶功能如设置单元格边框、插入超链接、提取图片、设置页眉页脚等
这些技术可以广泛应用于自动化报告生成、批量文档处理、合同模板填充等场景,大大提高工作效率。
参考资料
1. 我之前撰写过的博文
现在我把整个教程大概了一番,因此将仅保持本文更新:
2. 官方文档
- python-docx官方文档:python-docx.readthedocs.io/
- docxtpl官方文档:docxtpl.readthedocs.io/
3. 其他网络资料
- 利用python-docx批量处理Word文件------表格(二)样式控制
- 使用python-docx解析word文档,需要提取完整的目录层级、和每个标题下的内容,以及图片 - CSDN文库
- python-docx 处理导出word有段前距离段后距离的问题_python-docx paragraphformat-CSDN博客
- 【python-docx】文本操作(段落、run、标题、首行缩进、段前段后、多倍行距、对齐方式)_python docx设置首行缩进-CSDN博客
- python-docx样式_python docx style-CSDN博客
- Python读写word文档(.docx) python-docx的使用_python 读取docx-CSDN博客
- Python-docx库-常用操作篇-CSDN博客
- Python中的文档处理神器:深度解析python-docx库-CSDN博客
- 【笔记】Python-docx写文档时逐字符设置字体与上下标_python word 上标-CSDN博客
- python table 怎么设置字号 python设置word表格字体_kekenai的技术博客_51CTO博客
- 关于python docx包中,如何对Word自身表格实现复制,并且粘贴到原docx文档中?(已解决) | Python | Python 技术论坛
- ms word - In python-docx how do I delete a table row? - Stack Overflow
- KeyError: u"no style with name 'Table Grid'"; python 无法创建word表格_keyerror: "no style with name 'table grid-CSDN博客
- python docx-template - 知乎
