pdf格式转换为txt格式

pdf文档转换为txt文档

首先在python3虚拟环境中安装PyPDF2

Python 3.6.8 (default, Jun 20 2023, 11:53:23)

GCC 4.8.5 20150623 (Red Hat 4.8.5-44)\] on linux Type "help", "copyright", "credits" or "license" for more information. \>\>\> import sys \>\>\> sys.path \['', '/usr/lib64/python36.zip', '/usr/lib64/python3.6', '/usr/lib64/python3.6/lib-dynload', '/home/clusteruser/env3/lib64/python3.6/site-packages', '/home/clusteruser/env3/lib64/python3.6/site-packages/setuptools-58.0.4-py3.6.egg', '/home/clusteruser/env3/lib64/python3.6/site-packages/selenium-3.141.0-py3.6.egg', '/home/clusteruser/env3/lib64/python3.6/site-packages/urllib3-1.26.6-py3.6.egg', '/home/clusteruser/env3/lib/python3.6/site-packages', '/home/clusteruser/env3/lib/python3.6/site-packages/setuptools-58.0.4-py3.6.egg', '/home/clusteruser/env3/lib/python3.6/site-packages/selenium-3.141.0-py3.6.egg', '/home/clusteruser/env3/lib/python3.6/site-packages/urllib3-1.26.6-py3.6.egg'

>>> quit();

(env3) [clusteruser@node0xc7 pdf-txt]$ pip3 install --target='/home/clusteruser/env3/lib64/python3.6/site-packages' PyPDF2

Collecting PyPDF2

Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)

|████████████████████████████████| 232 kB 407 kB/s

Collecting typing_extensions>=3.10.0.0

Downloading typing_extensions-4.1.1-py3-none-any.whl (26 kB)

Collecting dataclasses

Downloading dataclasses-0.8-py3-none-any.whl (19 kB)

Installing collected packages: typing-extensions, dataclasses, PyPDF2

Successfully installed PyPDF2-3.0.1 dataclasses-0.8 typing-extensions-4.1.1

***************************************************************************************

完成代码

(env3) [clusteruser@node0xc7 pdf-txt]$ cat pdf-text.py

import PyPDF2

def pdf_to_text(pdf_path, txt_path):

with open(pdf_path, 'rb') as pdf_file:

reader = PyPDF2.PdfReader(pdf_file)

text = ''

for page_number in range(len(reader.pages)):

text += reader.pages[page_number].extract_text()

with open(txt_path, 'w', encoding='utf-8') as txt_file:

txt_file.write(text)

调用函数进行转换

pdf_to_text('input.pdf', 'output.txt')

执行代码

python3 pdf-text.py

相关推荐
wdxylb1 小时前
云原生俱乐部-shell知识点归纳(1)
linux·云原生
飞翔的佩奇2 小时前
【完整源码+数据集+部署教程】表盘指针检测系统源码和数据集:改进yolo11-CA-HSFPN
python·yolo·计算机视觉·数据集·yolo11·表盘指针检测
larance2 小时前
SQLAlchemy 的异步操作来批量保存对象列表
数据库·python
飞雪20073 小时前
Alibaba Cloud Linux 3 在 Apple M 芯片 Mac 的 VMware Fusion 上部署的完整密码重置教程(二)
linux·macos·阿里云·vmware·虚拟机·aliyun·alibaba cloud
路溪非溪3 小时前
关于Linux内核中头文件问题相关总结
linux
搏博3 小时前
基于Python3.10.6与jieba库的中文分词模型接口在Windows Server 2022上的实现与部署教程
windows·python·自然语言处理·flask·中文分词
lxmyzzs4 小时前
pyqt5无法显示opencv绘制文本和掩码信息
python·qt·opencv
萧鼎5 小时前
Python pyzmq 库详解:从入门到高性能分布式通信
开发语言·分布式·python
Lovyk5 小时前
Linux 正则表达式
linux·运维
yujkss6 小时前
Python脚本每天爬取微博热搜-终版
开发语言·python