测试pdfplumber识别效果好些;另外pdf这两个如果超过20多页就没法识别了,结果为空
1、pdfplumber
安装:
pip install pdfplumber -i http://mirrors.aliyun.com/pypi/simple --trusted-host mirrors.aliyun.com
代码:
import pdfplumber
with pdfplumber.open(r"C:\Users\loong\Downloads\数字人研究报告.pdf") as pdf:
num_pages = len(pdf.pages)
print(num_pages)
for page_num in range(num_pages):
page = pdf.pages[page_num]
text = page.extract_text()
print(text)
原内容
识别结果:
2、PyPDF2
安装:
pip install PyPDF2
代码:
import PyPDF2
from tqdm import tqdm
pdftext = ""
with open(r"C:\Users\loong\Desktop\杰创\大模型\杰创智能.pdf", "rb") as pdfFileObj:
pdfReader = PyPDF2.PdfReader(pdfFileObj)
for page in tqdm(pdfReader.pages):
pdftext += page.extract_text()
print(pdftext)