使用pdfplumber库处理pdf文件获取文本图片作者等信息

复制代码
   To use the `pdfplumber` library to extract content from PDF files in Python, follow these steps with example code:

1. Install pdfplumber 库安装

First, install the library using pip:

bash 复制代码
pip install pdfplumber

2. Basic Text Extraction获取文本

Extract all text from a PDF file:

python 复制代码
import pdfplumber

# Replace with your PDF file path (use raw string or double backslashes on Windows)
pdf_path = r'c:\Users\czliu\Documents\python\example.pdf'

# Extract all text from PDF
with pdfplumber.open(pdf_path) as pdf:
    all_text = ''
    for page in pdf.pages:
        all_text += page.extract_text() + '\n'

print("Extracted Text:")
print(all_text)

3. Extract Text from Specific Pages获取指定的页内容

Extract text from page 2 (0-indexed, so page 1 is index 0):

python 复制代码
import pdfplumber

pdf_path = r'c:\Users\czliu\Documents\python\example.pdf'

with pdfplumber.open(pdf_path) as pdf:
    # Extract text from page 2 (index 1)
    page = pdf.pages[1]
    page_text = page.extract_text()
    
    print("Text from Page 2:")
    print(page_text)

4. Extract Tables (pdfplumber's Key Feature) 获取表格

pdfplumber excels at extracting structured tables. Example with table extraction:

python 复制代码
import pdfplumber
import pandas as pd  # Optional, for better table handling

pdf_path = r'c:\Users\czliu\Documents\python\example.pdf'

with pdfplumber.open(pdf_path) as pdf:
    for page_num, page in enumerate(pdf.pages, 1):
        # Extract tables from current page
        tables = page.extract_tables()
        
        if tables:
            print(f"\n--- Tables from Page {page_num} ---")
            for table_idx, table in enumerate(tables):
                print(f"\nTable {table_idx + 1}:")
                
                # Print raw table data
                for row in table:
                    print(row)
                
                # Optional: Convert to pandas DataFrame for better manipulation
                df = pd.DataFrame(table[1:], columns=table[0])  # Assume first row is header
                print(f"\nDataFrame for Table {table_idx + 1}:")
                print(df)

5. Extract PDF Metadata 获取作者、创建时间等信息

Get document information like author, title, creation date:

python 复制代码
import pdfplumber

pdf_path = r'c:\Users\czliu\Documents\python\example.pdf'

with pdfplumber.open(pdf_path) as pdf:
    metadata = pdf.metadata
    print("PDF Metadata:")
    for key, value in metadata.items():
        print(f"{key}: {value}")

6. Extract Images (Advanced)获取图片内容并存盘

Extract images embedded in the PDF:

python 复制代码
import pdfplumber
import pillow as pw  # Optional, for image saving
pdf_path = r'c:\Users\czliu\Documents\python\example.pdf'

with pdfplumber.open(pdf_path) as pdf:
    for page_num, page in enumerate(pdf.pages, 1):
        images = page.images
        if images:
            print(f"\n--- Images from Page {page_num} ---")
            for img_idx, img in enumerate(images):
             	  pw.Image.save(img, f"page{page_num}_img{img_idx + 1}.png")   # Save image
                print(f"Image {img_idx + 1}:")
                print(f"  Coordinates: {img['bbox']}")
                print(f"  Width: {img['width']}, Height: {img['height']}")
                # Note: To save images, you'll need additional libraries like PIL

Notes:

  • On Windows, use raw strings (r'path') or double backslashes ('c:\\Users\\...') for file paths.
  • pdfplumber's table extraction uses camelot's algorithm under the hood and can be customized with table_settings (e.g., table_settings={"vertical_strategy": "lines", "horizontal_strategy": "lines"}).
  • For scanned PDFs, you'll need OCR tools like Tesseract (pdfplumber alone won't work for scanned text).
相关推荐
程序员佳佳34 分钟前
文章标题:彻底抛弃OpenAI官方Key?实测GPT-5.2与Banana Pro(Gemini 3):这才是开发者的终极红利!
开发语言·人工智能·python·gpt·ai作画·api·midjourney
qq_356196951 小时前
day49_通道注意力机制 @浙大疏锦行
python
Yeats_Liao1 小时前
MindSpore开发之路(十四):简化训练循环:高阶API `mindspore.Model` 的妙用
人工智能·python·深度学习
写代码的【黑咖啡】1 小时前
Python中的Pandas:数据分析的利器
python·数据分析·pandas
机器懒得学习1 小时前
WGAN-GP RVE 生成系统深度技术分析
python·深度学习·计算机视觉
晨光32111 小时前
Day43 训练和测试的规范写法
python·深度学习·机器学习
海棠AI实验室1 小时前
Python 学习路线图:从 0 到 1 的最短闭环
开发语言·python·学习
玄同7651 小时前
Python 函数:LLM 通用逻辑的封装与复用
开发语言·人工智能·python·深度学习·语言模型·自然语言处理
俞凡1 小时前
深入理解 Python GIL
python
机器学习算法与Python实战1 小时前
PDF 文件翻译,我有4个方案推荐
pdf