python-docx 报错 KeyError: “There is no item named ‘NULL‘ in the archive“

python-docx 报错 KeyError: "There is no item named 'NULL' in the archive"

    • 原因
    • 解决办法
      • [手动删除 docx 文件中的无效引用](#手动删除 docx 文件中的无效引用)
      • [删除 `Target="../NULL"` 的 `Relationship` 并重新打包为 docx](#删除 Target="../NULL"Relationship 并重新打包为 docx)
      • [解析 docx 时忽略 `Target="../NULL"` 的 `Relationship`](#解析 docx 时忽略 Target="../NULL"Relationship)
    • 参考
python 复制代码
import docx

doc = docx.Document('DE862197-1772524393022.docx')

在使用 python-docx 解析 .docx 文件时遇到以下的报错:

shell 复制代码
Traceback (most recent call last):
  File "D:\project\python\p1\t1.py", line 3, in <module>
    doc = docx.Document('DE862197-1772524393022.docx')
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\project\python\p1\.venv\Lib\site-packages\docx\api.py", line 27, in Document
    document_part = cast("DocumentPart", Package.open(docx).main_document_part)
                                         ^^^^^^^^^^^^^^^^^^
  File "D:\project\python\p1\.venv\Lib\site-packages\docx\opc\package.py", line 126, in open
    pkg_reader = PackageReader.from_file(pkg_file)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\project\python\p1\.venv\Lib\site-packages\docx\opc\pkgreader.py", line 25, in from_file
    sparts = PackageReader._load_serialized_parts(phys_reader, pkg_srels, content_types)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\project\python\p1\.venv\Lib\site-packages\docx\opc\pkgreader.py", line 51, in _load_serialized_parts
    for partname, blob, reltype, srels in part_walker:
                                          ^^^^^^^^^^^
  File "D:\project\python\p1\.venv\Lib\site-packages\docx\opc\pkgreader.py", line 82, in _walk_phys_parts
    for partname, blob, reltype, srels in next_walker:
                                          ^^^^^^^^^^^
  File "D:\project\python\p1\.venv\Lib\site-packages\docx\opc\pkgreader.py", line 79, in _walk_phys_parts
    blob = phys_reader.blob_for(partname)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\project\python\p1\.venv\Lib\site-packages\docx\opc\phys_pkg.py", line 83, in blob_for
    return self._zipf.read(pack_uri.membername)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\software\uv\python\cpython-3.12.12-windows-x86_64-none\Lib\zipfile\__init__.py", line 1584, in read
    with self.open(name, "r", pwd) as fp:
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\software\uv\python\cpython-3.12.12-windows-x86_64-none\Lib\zipfile\__init__.py", line 1621, in open
    zinfo = self.getinfo(name)
            ^^^^^^^^^^^^^^^^^^
  File "D:\software\uv\python\cpython-3.12.12-windows-x86_64-none\Lib\zipfile\__init__.py", line 1549, in getinfo
    raise KeyError(
KeyError: "There is no item named 'NULL' in the archive"

原因

将 docx 文件解压后,发现 DE862197-1772524393022.docx!\word\_rels\document.xml.rels 文件中的 Id="rId4"RelationshipTarget="../NULL"

解决办法

手动删除 docx 文件中的无效引用

  1. 打开文件 a.docx

  2. ctrl + f 打开左侧导航

  3. 点击搜索框右边的放大镜后,选择图形

  4. 使用红框右边的上下箭头逐个查看图片

  5. 删除或替换类似下图的图片

  6. 如下两个截图,有的图片很小,只显示一个红叉(下图 1),甚至有的图片就什么都不显示(下图 2),需要手动将图片尺寸放大后确认

  7. 必须使用"图形查找"逐个确认图片是否正常,否则会遗漏小图片或尺寸为 0 的图片

删除 Target="../NULL"Relationship 并重新打包为 docx

The image part with relationship rID8 was not found in Microsoft Word 中描述了大概的操作,但我试过不太好使,以下是我的操作:

  1. 先将文件 a.docx 重命名为 a.zip
  2. 将 a.zip 解药到 a 文件夹
  3. 使用记事本打开 a/word/_rels/document.xml.rels,查找 "null",并删除 <Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="../NULL"/>
  4. 重新将 a/ 文件夹下的所有文件压缩为 a-new.zip (不要包含 a 文件夹本身)
  5. 将 a-new.zip 重命名为 a-new.docx

解析 docx 时忽略 Target="../NULL"Relationship

此解决方案来自 Open Word docx file with "The image part with relationship rID8 was not found" error, it always fails · Issue #1105 · python-openxml/python-docx 的评论,替换 _SerializedRelationships.load_from_xml 方法的实现:

python 复制代码
import docx

from docx.opc.pkgreader import _SerializedRelationships, _SerializedRelationship
from docx.opc.oxml import parse_xml


def load_from_xml_v2(baseURI, rels_item_xml):
    """
    Return |_SerializedRelationships| instance loaded with the
    relationships contained in *rels_item_xml*. Returns an empty
    collection if *rels_item_xml* is |None|.
    """
    srels = _SerializedRelationships()
    if rels_item_xml is not None:
        rels_elm = parse_xml(rels_item_xml)
        for rel_elm in rels_elm.Relationship_lst:
            if rel_elm.target_ref in ('../NULL', 'NULL'):
                continue
            srels._srels.append(_SerializedRelationship(baseURI, rel_elm))
    return srels


_SerializedRelationships.load_from_xml = load_from_xml_v2

doc = docx.Document('a.docx')

下图中左边是原实现,右侧是新的实现 v2,添加了一个判断跳过了 Target="../NULL"Target="NULL"Relationship

参考

  1. KeyError: "There is no item named 'word/NULL' in the archive" · Issue #797 · python-openxml/python-docx
  2. fix: accommodate NULL relationship (by skipping) · Issue #678 · python-openxml/python-docx
  3. KeyError: "There is no item named 'word/NULL' in the archive" · Issue #797 · python-openxml/python-docx
  4. KeyError: "There is no item named 'word/NULL' in the archive" | YG 的零碎笔记
  5. The image part with relationship rID8 was not found in Microsoft Word - Microsoft 365 Apps | Microsoft Learn
相关推荐
不光头强2 小时前
抽象类和接口的区别
java·开发语言·python
ShoreKiten2 小时前
Flask/ssti --by vulhub
后端·python·flask
m0_547722922 小时前
乒乓球比赛管理系统
python·mysql
奔跑的蜗牛FelixChioa2 小时前
python连接sqlite快速入门
开发语言·python·sqlite
王夏奇2 小时前
python-PyQt6库学习
开发语言·python·学习
Westward-sun.2 小时前
【Python+PyTorch】从零实现食物识别:自动生成标注 + CNN 训练全流程
pytorch·python·cnn
蓝净云2 小时前
python包管理工具uv
python·uv
啊阿狸不会拉杆2 小时前
《计算机视觉:模型、学习和推理》第 19 章-时序模型
人工智能·python·学习·机器学习·计算机视觉·时序模型
tryCbest2 小时前
Django 基础入门教程(第三篇):Admin后台与ORM进阶(单表、多表、聚合查询)
python·django