读取pdf、docx、doc、ppt、pptx并转为txt

文章目录

一、思路构建
二、开始实现
三、存在的问题
- [3.1 解析doc文档遇到问题及解决方法：](#3.1 解析doc文档遇到问题及解决方法：)
- [3.2 解析ppt文档遇到问题及解决方法：](#3.2 解析ppt文档遇到问题及解决方法：)
四、读取pdf中的图片

一、思路构建

Zip文件和初始化文件放在同一个文件夹下；
然后解析zip文件读取到一个新的文件夹下；
然后进入这个新的文件夹开始读取，遇到新文件夹就进入；
读取所有的pdf、docs、doc、ppt并转为txt，合并入新的txt；

二、开始实现

安装环境

bash 复制代码

pip install python-pptx
pip install pdfplumber
pip install python-docx
apt-get install antiword
apt-get install libreoffice

读取pptx
以文本在list里的形式返回，如["第一段","第二段"]

py 复制代码

def _parse_pptx(self):
    if not os.path.isfile(self.file_path):
        print(f"[WARNING] can't find {self.file_path}")
    txt_list = []
    prs = Presentation(self.file_path)
    for slide in prs.slides:
        for shape in slide.shapes:
            if not shape.has_text_frame:
                continue
            for paragraph in shape.text_frame.paragraphs:
                for run in paragraph.runs:
                    txt_list.append(run.text)
    return txt_list

读取docx
以文本在list里的形式返回，如["第一段","第二段"]

py 复制代码

def _parse_docx(self):
    if not os.path.isfile(self.file_path):
        print(f"[WARNING] can't find {self.file_path}")
    txt_list = []
    doc = Document(self.file_path)
    for paragraph in doc.paragraphs:
        txt_list.append(paragraph.text + '\n')
    return txt_list

读取pdf
以文本在list里的形式返回，如["第一段","第二段"]

py 复制代码

def _parse_pdf(self):
    if not os.path.isfile(self.file_path):
        print(f"[WARNING] can't find {self.file_path}")
    txt_list = []
    with pdfplumber.open(self.file_path) as pdf:
        for page in pdf.pages:
            txt_list.append(page.extract_text())
    return txt_list

三、存在的问题

.doc .ppt 如何处理；
docx中一些删除线的，也会识别进来，文档应该避免删除线这种形式，应直接删除；
解压zip时，中文乱码；

3.1 解析doc文档遇到问题及解决方法：

bash 复制代码

pip install pypiwin32

报错：

bash 复制代码

INFO: pip is looking at multiple versions of pypiwin32 to determine which version is compatible with other requirements. This could take a while.
  Using cached pypiwin32-219.zip (4.8 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [7 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-5ufsbmju/pypiwin32_d42420d9531d46289d9e8ad3dd35073f/setup.py", line 121
          print "Building pywin32", pywin32_version
                ^
      SyntaxError: Missing parentheses in call to 'print'. Did you mean print("Building pywin32", pywin32_version)?
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

解决方法

以下方法可以解决，但是python解析doc文档，文字的中间会有莫名的换行符，建议输入尽量用最新版的docx格式的文档。

在 Ubuntu 上读取 .doc 文件可以使用 antiword 工具，它可以将 Microsoft Word 文档转换为文本。你可以通过安装 antiword 并在 Python 中使用 subprocess 模块来实现。

以下是一个示例代码：

首先，确保你已经安装了 antiword：

bash 复制代码

sudo apt-get install antiword

使用以下 Python 代码来读取 .doc 文件并将其转换为文本：

py 复制代码

import subprocess

def doc_to_text(doc_file):
    process = subprocess.Popen(['antiword', doc_file], stdout=subprocess.PIPE)
    output, _ = process.communicate()
    return output.decode('utf-8')

# 示例用法
doc_file = 'example.doc'  # 替换成你的.doc文件名
text = doc_to_text(doc_file)

# 将文本保存到txt文件
with open('output.txt', 'w', encoding='utf-8') as file:
    file.write(text)

在上述代码中，我们定义了一个名为 doc_to_text 的函数，它接受一个 .doc 文件的路径作为参数。函数使用 subprocess.Popen 调用 antiword，将其输出捕获，并将其解码为 UTF-8 字符串。

示例用法中，你需要将 'example.doc' 替换为你的 .doc 文件名。运行这段代码后，它会将 .doc 文件的内容转换为文本，并保存到名为 output.txt 的文本文件中。

请注意，antiword 可能无法处理包含复杂格式、表格等的 .doc 文件。如果你需要处理这些特殊情况，可能需要考虑使用更复杂的工具或库。

3.2 解析ppt文档遇到问题及解决方法：

直接解析ppt发现很难，这里把ppt转为pdf，然后再把pdf转txt

环境

bash 复制代码

sudo apt install libreoffice

代码

py 复制代码

@staticmethod
def convert_ppt_to_pdf(ppt_file, output_folder="/home/gykj/thomascai/data/Archive1024/new_Archive.zip"):
    output_file = os.path.join(output_folder, os.path.splitext(os.path.basename(ppt_file))[0] + ".pdf")
    cmd = f"soffice --headless --convert-to pdf --outdir {output_folder} {ppt_file}"
    subprocess.run(cmd, shell=True)
    print(f"Successfully converted {ppt_file} to {output_file}")
    return output_file

def _parse_ppt(self):
    output_file = self.convert_ppt_to_pdf(self.file_path)
    self._set_file_path(output_file)
    self._parse_pdf()

四、读取pdf中的图片

bash 复制代码

pip install PyMuPDF

由于页与页之间的连接原因，pdf转markdown先

(未完待续，关注不迷路，谢谢~)