Python NLP实战:PDF文本批量提取与主题建模
1. PDF文本批量提取
核心工具:
PyPDF2:基础PDF文本提取pdfplumber:增强型文本/表格提取(推荐)
python
import os
import pdfplumber
def extract_pdf_texts(folder_path):
"""批量提取PDF文件夹中的文本"""
all_texts = []
for filename in os.listdir(folder_path):
if filename.endswith(".pdf"):
with pdfplumber.open(os.path.join(folder_path, filename)) as pdf:
text = ""
for page in pdf.pages:
text += page.extract_text() + "\n"
all_texts.append(text)
return all_texts
# 使用示例
pdf_texts = extract_pdf_texts("/path/to/pdf_folder")
2. 文本预处理流程
python
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
def preprocess_text(text):
"""文本清洗与标准化"""
# 1. 小写化 & 移除特殊字符
text = re.sub(r'[^a-zA-Z\s]', '', text.lower())
# 2. 分词
words = nltk.word_tokenize(text)
# 3. 移除停用词
stop_words = set(stopwords.words('english'))
words = [w for w in words if w not in stop_words and len(w) > 2]
# 4. 词形还原
lemmatizer = WordNetLemmatizer()
return [lemmatizer.lemmatize(w) for w in words]
# 预处理所有文档
processed_docs = [preprocess_text(text) for text in pdf_texts]
3. 主题建模(LDA实现)
python
from gensim import corpora, models
# 1. 创建词典与词袋
dictionary = corpora.Dictionary(processed_docs)
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
# 2. 训练LDA模型
lda_model = models.LdaModel(
corpus=corpus,
id2word=dictionary,
num_topics=5, # 主题数量
passes=10, # 训练轮次
random_state=42
)
# 3. 可视化主题
def print_topics(model):
for idx, topic in model.print_topics(-1):
print(f"主题 {idx}: {topic}")
print_topics(lda_model)
4. 结果优化技巧
-
主题数选择:
python# 使用一致性分数选择最优主题数 from gensim.models import CoherenceModel coherence_scores = [] for num_topics in range(3, 10): model = models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary) coherencemodel = CoherenceModel(model, texts=processed_docs, dictionary=dictionary, coherence='c_v') coherence_scores.append(coherencemodel.get_coherence()) -
TF-IDF加权:
pythonfrom gensim.models import TfidfModel tfidf = TfidfModel(corpus) corpus_tfidf = tfidf[corpus] lda_model_tfidf = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=5)
5. 完整工作流示例
python
# 1. 提取PDF文本
pdf_texts = extract_pdf_texts("research_papers")
# 2. 预处理
processed_docs = [preprocess_text(text) for text in pdf_texts]
# 3. 训练优化模型
dictionary = corpora.Dictionary(processed_docs)
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
tfidf = TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lda_model = models.LdaModel(
corpus=corpus_tfidf,
id2word=dictionary,
num_topics=5,
passes=15,
alpha='auto'
)
# 4. 输出主题关键词
print_topics(lda_model)
输出示例:
主题 0: 0.025*"data" + 0.018*"learning" + 0.012*"model" + 0.009*"algorithm" + ... 主题 1: 0.031*"network" + 0.022*"neural" + 0.015*"deep" + 0.011*"layer" + ...
关键注意事项
-
PDF提取质量:
- 扫描版PDF需先使用OCR工具(如Tesseract)处理
- 表格密集文档使用
pdfplumber.extract_table()
-
NLP预处理:
- 领域特定停用词(如学术论文中的"figure", "table")
- 保留专业名词(通过词性标注筛选名词)
-
模型调优:
- 主题数通过$$ \text{一致性分数} = \frac{1}{N}\sum_{i=1}^{N} \text{score}(t_i) $$优化
- 超参数调整:
alpha(文档-主题密度),eta(主题-词语密度)
-
替代方案:
python# 使用BERTopic进行现代主题建模 from bertopic import BERTopic topic_model = BERTopic(language="english") topics, _ = topic_model.fit_transform([" ".join(doc) for doc in processed_docs])