To use the `pdfplumber` library to extract content from PDF files in Python, follow these steps with example code:
1. Install pdfplumber 库安装
First, install the library using pip:
bash
pip install pdfplumber
2. Basic Text Extraction获取文本
Extract all text from a PDF file:
python
import pdfplumber
# Replace with your PDF file path (use raw string or double backslashes on Windows)
pdf_path = r'c:\Users\czliu\Documents\python\example.pdf'
# Extract all text from PDF
with pdfplumber.open(pdf_path) as pdf:
all_text = ''
for page in pdf.pages:
all_text += page.extract_text() + '\n'
print("Extracted Text:")
print(all_text)
3. Extract Text from Specific Pages获取指定的页内容
Extract text from page 2 (0-indexed, so page 1 is index 0):
python
import pdfplumber
pdf_path = r'c:\Users\czliu\Documents\python\example.pdf'
with pdfplumber.open(pdf_path) as pdf:
# Extract text from page 2 (index 1)
page = pdf.pages[1]
page_text = page.extract_text()
print("Text from Page 2:")
print(page_text)
4. Extract Tables (pdfplumber's Key Feature) 获取表格
pdfplumber excels at extracting structured tables. Example with table extraction:
python
import pdfplumber
import pandas as pd # Optional, for better table handling
pdf_path = r'c:\Users\czliu\Documents\python\example.pdf'
with pdfplumber.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages, 1):
# Extract tables from current page
tables = page.extract_tables()
if tables:
print(f"\n--- Tables from Page {page_num} ---")
for table_idx, table in enumerate(tables):
print(f"\nTable {table_idx + 1}:")
# Print raw table data
for row in table:
print(row)
# Optional: Convert to pandas DataFrame for better manipulation
df = pd.DataFrame(table[1:], columns=table[0]) # Assume first row is header
print(f"\nDataFrame for Table {table_idx + 1}:")
print(df)
5. Extract PDF Metadata 获取作者、创建时间等信息
Get document information like author, title, creation date:
python
import pdfplumber
pdf_path = r'c:\Users\czliu\Documents\python\example.pdf'
with pdfplumber.open(pdf_path) as pdf:
metadata = pdf.metadata
print("PDF Metadata:")
for key, value in metadata.items():
print(f"{key}: {value}")
6. Extract Images (Advanced)获取图片内容并存盘
Extract images embedded in the PDF:
python
import pdfplumber
import pillow as pw # Optional, for image saving
pdf_path = r'c:\Users\czliu\Documents\python\example.pdf'
with pdfplumber.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages, 1):
images = page.images
if images:
print(f"\n--- Images from Page {page_num} ---")
for img_idx, img in enumerate(images):
pw.Image.save(img, f"page{page_num}_img{img_idx + 1}.png") # Save image
print(f"Image {img_idx + 1}:")
print(f" Coordinates: {img['bbox']}")
print(f" Width: {img['width']}, Height: {img['height']}")
# Note: To save images, you'll need additional libraries like PIL
Notes:
- On Windows, use raw strings (
r'path') or double backslashes ('c:\\Users\\...') for file paths. - pdfplumber's table extraction uses camelot's algorithm under the hood and can be customized with
table_settings(e.g.,table_settings={"vertical_strategy": "lines", "horizontal_strategy": "lines"}). - For scanned PDFs, you'll need OCR tools like Tesseract (pdfplumber alone won't work for scanned text).