目录
安装:pip install pdfplumber -i https://pypi.tuna.tsinghua.edu.cn/simple/
data:image/s3,"s3://crabby-images/22725/22725eab990b2c083c2fa17aac5c28142b7621b8" alt=""
提取文本内容
python
from pdfplumber import open as op
def read_pdf(pdf_path):
with op(pdf_path) as pdf:
for page in pdf.pages:
# 包含表格在内的所有文本信息
text = page.extract_text()
print(text)
data:image/s3,"s3://crabby-images/a9da9/a9da9fb2475aa7504c6ce0411dc91f52f343b88b" alt=""
提取表格内容
可将提取到的表格数据通过写表的方式插入到excel中或做其他处理
python
for table in page.extract_tables():
for row in table:
print(row)
data:image/s3,"s3://crabby-images/03a02/03a02641a7dd9989184f9d33ce5bed1b6ac1e19c" alt=""
提取图片信息
python
for img in page.images:
print(img)
data:image/s3,"s3://crabby-images/f8c2c/f8c2c17c992bdf9dd626d3ca7986bf399fa013d4" alt=""
文本框信息数据
python
for wds in page.extract_words(): # 文本框位置及内容
print(wds)
data:image/s3,"s3://crabby-images/5c21b/5c21bef854e3a241c27e6ac3f0d92e9fc981f5fe" alt=""
将对应页转为图片
resolution参数为像素值,值越大,图片越清晰
python
page.to_image(resolution=500).save(rf'{pdf_path.split(".")[0]}-{page.page_number}.png')
转CSV、JSON、字典数据
python
print(page.to_csv()) # 转csv数据
print(page.to_json(indent=4)) # 转JSON数据
print(page.to_dict()) # 转字典数据