第一章:BeautifulSoup4 是什么
1.1 简单介绍BeautifulSoup4
BeautifulSoup4(简称 BS4)是 Python 中最强大的 HTML 和 XML 文档解析库之一。
- 自动编码处理:自动将输入文档转换为 Unicode 编码,输出为 UTF-8,无需手动处理编码问题
- 容错性强:能够优雅处理格式错误或结构混乱的 HTML 文档
- 直观的 API:提供类似 jQuery 的选择器语法,学习成本低
- 多种解析器支持:支持 html.parser、lxml、html5lib 等多种解析器
1.2 BeautifulSoup4 的应用场景
- 竞品数据监控:实时抓取竞争对手产品价格、库存信息
- 市场情报收集:从新闻网站、行业报告中提取市场趋势数据
- 社交媒体分析:抓取微博、知乎等平台的用户评论和互动数据
- 金融数据采集:获取股票行情、财经新闻等实时数据
- 电商数据分析:爬取商品信息、用户评价进行市场分析
第二章:环境配置与安装
2.1 基础安装
bash
# 安装 BeautifulSoup4 核心库
pip install beautifulsoup4
# 安装推荐的解析器(lxml 性能最佳)
pip install lxml
# 安装 html5lib(容错性最强,选装)
pip install html5lib
# 安装网络请求库(配合使用)
pip install requests
2.2 验证安装
python
import bs4
import lxml
import html5lib
import requests
print(f"BeautifulSoup4 版本: {bs4.__version__}")
print(f"lxml 版本: {lxml.__version__}")
print(f"html5lib 版本: {html5lib.__version__}")
print(f"requests 版本: {requests.__version__}")
第三章:解析器选择与性能对比
3.1 三种主流解析器详解
| 解析器 | 安装命令 | 优点 | 缺点 | 适用场景 |
|---|---|---|---|---|
| html.parser | 无需安装 | 内置标准库,无需额外依赖 | 速度一般,容错性中等 | 简单任务,快速原型开发 |
| lxml | pip install lxml |
速度最快,功能强大 | 需要 C 语言依赖 | 生产环境推荐,大规模数据处理 |
| html5lib | pip install html5lib |
容错性最强,最接近浏览器解析 | 速度最慢 | 处理复杂、不规范的 HTML |
3.2 性能测试对比
python
import time
from bs4 import BeautifulSoup
import requests
# 测试网页
url = "https://baidu.com"
html_content = requests.get(url).text
# 性能测试函数
def test_parser(parser_name, iterations=100):
start_time = time.time()
for _ in range(iterations):
soup = BeautifulSoup(html_content, parser_name)
end_time = time.time()
return (end_time - start_time) / iterations * 1000 # 毫秒
print("解析器性能对比(平均耗时/次):")
print(f"html.parser: {test_parser('html.parser'):.2f}ms")
print(f"lxml: {test_parser('lxml'):.2f}ms")
print(f"html5lib: {test_parser('html5lib'):.2f}ms")
性能结论:
- lxml 比 html.parser 快 1.5-3 倍
- html5lib 比 lxml 慢 2-10 倍,但容错性最佳
- 推荐:生产环境使用 lxml,开发调试使用 html5lib
第四章:基础用法
4.1 创建 BeautifulSoup 对象
python
from bs4 import BeautifulSoup
# 方式1:从字符串创建
html_doc = """
<html>
<head><title>测试页面</title></head>
<body>
<h1>欢迎使用 BeautifulSoup4</h1>
<p class="content">这是第一个段落</p>
<p class="content">这是第二个段落</p>
</body>
</html>
"""
# 使用 lxml 解析器(推荐)
soup = BeautifulSoup(html_doc, 'lxml')
# 方式2:从文件创建
with open('example.html', 'r', encoding='utf-8') as file:
soup = BeautifulSoup(file, 'lxml')
# 方式3:从 URL 创建(配合 requests)
import requests
response = requests.get('https://example.com')
soup = BeautifulSoup(response.text, 'lxml')
4.2 四种核心对象类型
BeautifulSoup 将文档转换为树形结构,包含四种 Python 对象:
python
from bs4 import BeautifulSoup, Comment
html = """
<html>
<head><title>测试</title></head>
<body>
<!-- 这是一个注释 -->
<p class="Class">Hello World</p>
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
# 1. Tag - 标签对象
tag = soup.p
print(type(tag)) # <class 'bs4.element.Tag'>
print(tag.name) # 'p'
print(tag['class']) # 访问指定属性
print(tag.attrs) # 访问所有属性
# 2. NavigableString - 标签内的文本
text = tag.string
print(type(text)) # <class 'bs4.element.NavigableString'>
print(text) # 'Hello World'
# 3. BeautifulSoup - 整个文档对象
print(type(soup)) # <class 'bs4.BeautifulSoup'>
# 4. Comment - 特殊的 NavigableString
comment = soup.find(string=lambda text: isinstance(text, Comment))
print(type(comment)) # <class 'bs4.element.Comment'>
print(comment) # ' 这是一个注释 '
4.3 遍历文档树
python
# 获取子节点
print(soup.body.contents) # 返回列表
print(list(soup.body.children)) # 返回迭代器
# 获取后代节点
for descendant in soup.body.descendants:
print(descendant)
# 获取父节点
print(soup.p.parent) # body 标签
print(soup.p.parents) # 所有祖先节点
# 获取兄弟节点
print(soup.p.next_sibling) # 下一个兄弟
print(soup.p.previous_sibling) # 上一个兄弟
第五章:搜索文档树(核心)
5.1 find() 与 find_all() 方法
5.1.1 基础用法
python
from bs4 import BeautifulSoup
html = """
<div class="container">
<h1>标题</h1>
<p class="text" id="p1">段落1</p>
<p class="text" id="p2">段落2</p>
<a href="https://example.com">链接</a>
<span>其他内容</span>
</div>
"""
soup = BeautifulSoup(html, 'lxml')
# find() - 返回第一个匹配的元素
first_p = soup.find('p')
print(first_p) # <p class="text" id="p1">段落1</p>
# find_all() - 返回所有匹配的元素列表
all_p = soup.find_all('p')
print(all_p) # [<p class="text" id="p1">段落1</p>, <p class="text" id="p2">段落2</p>]
print(len(all_p)) # 2
# 使用 limit 参数限制返回数量
limited_p = soup.find_all('p', limit=1)
print(len(limited_p)) # 1
5.1.2 按属性搜索
python
# 按 class 搜索(注意:class 是 Python 关键字,需用 class_)
by_class = soup.find_all('p', class_='text')
print(len(by_class)) # 2
# 按 id 搜索
by_id = soup.find(id='p1')
print(by_id.text) # '段落1'
# 按多个属性组合搜索
by_attrs = soup.find_all('p', attrs={'class': 'text', 'id': 'p2'})
print(by_attrs[0].text) # '段落2'
# 按属性值模式搜索(正则表达式)
import re
by_pattern = soup.find_all(id=re.compile('^p'))
print(len(by_pattern)) # 2
5.1.3 高级搜索技巧
python
# 按文本内容搜索
by_text = soup.find_all(string='段落1')
print(by_text) # ['段落1']
# 按文本模式搜索
by_text_pattern = soup.find_all(string=re.compile('段落'))
print(len(by_text_pattern)) # 2
# 使用函数作为过滤器
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
result = soup.find_all(has_class_but_no_id)
print(result) # [<p class="text" id="p1">...</p>, <p class="text" id="p2">...</p>]
# 递归搜索控制
non_recursive = soup.find_all('p', recursive=False)
print(len(non_recursive)) # 0(因为 p 不是 soup 的直接子节点)
5.2 CSS 选择器(select() 方法)
python
# 基础选择器
soup.select('p') # 所有 p 标签
soup.select('.text') # class="text" 的元素
soup.select('#p1') # id="p1" 的元素
soup.select('div p') # div 下的所有 p 标签
soup.select('div > p') # div 的直接子 p 标签
# 组合选择器
soup.select('p.text') # class="text" 的 p 标签
soup.select('p#p1') # id="p1" 的 p 标签
soup.select('p, span') # p 或 span 标签
# 属性选择器
soup.select('[href]') # 有 href 属性的元素
soup.select('[href="https://example.com"]') # href 精确匹配
soup.select('[class~="text"]') # class 包含 text
# 伪类选择器
soup.select('p:first-child') # 第一个 p 标签
soup.select('p:last-child') # 最后一个 p 标签
soup.select('p:nth-of-type(2)') # 第二个 p 标签
5.3 find() vs find_all() vs select() 对比
| 方法 | 返回值 | 语法 | 适用场景 |
|---|---|---|---|
| find() | 单个 Tag 对象 | Python 字典语法 | 只需要第一个匹配项 |
| find_all() | Tag 对象列表 | Python 字典语法 | 需要所有匹配项,参数灵活 |
| select() | Tag 对象列表 | CSS 选择器语法 | 熟悉 CSS,层级关系复杂 |
第六章:数据清洗和处理
6.1 提取文本内容
python
# 基础文本提取
print(soup.p.string) # '段落1'
print(soup.p.get_text()) # '段落1'
# 提取所有文本(包括子标签)
print(soup.get_text()) # '标题段落1段落2链接其他内容'
# 提取时去除空白
print(soup.get_text(strip=True)) # '标题段落1段落2链接其他内容'
# 提取多个元素的文本
texts = [p.get_text() for p in soup.find_all('p')]
print(texts) # ['段落1', '段落2']
6.2 提取属性值
python
# 获取单个属性
link = soup.find('a')
print(link['href']) # 'https://example.com'
print(link.get('href')) # 'https://example.com'
# 获取所有属性
print(link.attrs) # {'href': 'https://example.com'}
# 安全获取(属性不存在时返回 None)
print(link.get('target')) # None
print(link.get('target', '_blank')) # '_blank'(默认值)
# 获取所有链接
links = soup.find_all('a')
hrefs = [link.get('href') for link in links]
print(hrefs)
6.3 处理特殊内容
python
# 处理 HTML 实体
html_with_entities = "<p>Price: ¥100</p>"
soup = BeautifulSoup(html_with_entities, 'lxml')
print(soup.p.string) # 'Price: ¥100'(自动解码)
# 处理注释
html_with_comment = "<p><!-- 这是注释 -->正文</p>"
soup = BeautifulSoup(html_with_comment, 'lxml')
comment = soup.find(string=lambda text: isinstance(text, Comment))
print(comment) # ' 这是注释 '
# 提取脚本和样式内容
script_content = soup.find('script').string
style_content = soup.find('style').string
第七章:修改文档树
7.1 修改标签和属性
python
html = "<p class='old'>原始内容</p>"
soup = BeautifulSoup(html, 'lxml')
# 修改标签名称
soup.p.name = 'div'
print(soup) # <div class="old">原始内容</div>
# 修改属性
soup.div['class'] = 'new'
soup.div['id'] = 'modified'
print(soup.div) # <div class="new" id="modified">原始内容</div>
# 删除属性
del soup.div['class']
print(soup.div) # <div id="modified">原始内容</div>
7.2 修改文本内容
python
# 修改文本
soup.div.string = '新内容'
print(soup.div) # <div id="modified">新内容</div>
# 替换为新标签
new_tag = soup.new_tag('span', class_='highlight')
new_tag.string = '高亮内容'
soup.div.string.replace_with(new_tag)
print(soup.div) # <div id="modified"><span class="highlight">高亮内容</span></div>
7.3 添加和删除元素
python
# append() - 添加到末尾
soup.div.append('追加的文本')
# insert() - 指定位置插入
new_p = soup.new_tag('p')
new_p.string = '插入的段落'
soup.div.insert(0, new_p) # 插入到第一个位置
# insert_before() / insert_after()
soup.div.insert_before(soup.new_tag('hr'))
soup.div.insert_after(soup.new_tag('hr'))
# clear() - 清空内容
soup.div.clear()
# extract() - 移除并返回
removed = soup.div.extract()
# decompose() - 移除且不返回
soup.div.decompose()
# unwrap() - 移除标签但保留内容
soup.span.unwrap()
第八章:实战案例
每个网站的设计各有差异,这里只提供模式思路,具体还需通过实际场景修改
8.1 案例1:电商商品信息爬取
python
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import time
class ECommerceScraper:
def __init__(self, base_url):
self.base_url = base_url
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
self.products = []
def fetch_page(self, url):
"""获取页面内容"""
try:
response = requests.get(url, headers=self.headers, timeout=10)
response.raise_for_status()
return response.text
except requests.RequestException as e:
print(f"请求失败: {e}")
return None
def parse_product(self, product_div):
"""解析单个商品信息"""
try:
# 提取商品名称
name_tag = product_div.find('h2', class_='product-name')
name = name_tag.get_text(strip=True) if name_tag else 'N/A'
# 提取价格
price_tag = product_div.find('span', class_='price')
price = price_tag.get_text(strip=True) if price_tag else 'N/A'
# 提取评分
rating_tag = product_div.find('div', class_='rating')
rating = rating_tag.get('data-score', 'N/A') if rating_tag else 'N/A'
# 提取评论数
review_tag = product_div.find('span', class_='review-count')
reviews = review_tag.get_text(strip=True) if review_tag else '0'
# 提取商品链接
link_tag = product_div.find('a', class_='product-link')
link = link_tag.get('href', '') if link_tag else ''
if link and not link.startswith('http'):
link = self.base_url + link
return {
'name': name,
'price': price,
'rating': rating,
'reviews': reviews,
'link': link,
'scraped_at': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
}
except Exception as e:
print(f"解析商品失败: {e}")
return None
def scrape_category(self, category_url, max_pages=5):
"""爬取分类页面"""
for page in range(1, max_pages + 1):
url = f"{category_url}?page={page}"
print(f"正在爬取: {url}")
html = self.fetch_page(url)
if not html:
break
soup = BeautifulSoup(html, 'lxml')
# 查找所有商品容器
product_divs = soup.find_all('div', class_='product-item')
if not product_divs:
print("未找到商品,可能已到达最后一页")
break
for product_div in product_divs:
product = self.parse_product(product_div)
if product:
self.products.append(product)
print(f"第 {page} 页: 找到 {len(product_divs)} 个商品")
time.sleep(1) # 避免请求过快
def save_to_excel(self, filename='products.xlsx'):
"""保存到 Excel"""
df = pd.DataFrame(self.products)
df.to_excel(filename, index=False)
print(f"数据已保存到 {filename}")
print(f"共爬取 {len(self.products)} 个商品")
def analyze_data(self):
"""数据分析"""
if not self.products:
print("没有数据可分析")
return
df = pd.DataFrame(self.products)
# 价格分析
df['price_clean'] = df['price'].str.replace(r'[^\d.]', '', regex=True).astype(float)
print(f"\n价格统计:")
print(f"平均价格: ¥{df['price_clean'].mean():.2f}")
print(f"最高价格: ¥{df['price_clean'].max():.2f}")
print(f"最低价格: ¥{df['price_clean'].min():.2f}")
# 评分分析
df['rating'] = pd.to_numeric(df['rating'], errors='coerce')
print(f"\n评分统计:")
print(f"平均评分: {df['rating'].mean():.2f}")
print(f"好评率(4.0+): {(df['rating'] >= 4.0).sum() / len(df) * 100:.1f}%")
# 使用示例
if __name__ == "__main__":
scraper = ECommerceScraper('https://example-ecommerce.com')
scraper.scrape_category('https://example-ecommerce.com/category/electronics', max_pages=3)
scraper.save_to_excel()
scraper.analyze_data()
8.2 案例2:新闻网站内容监控
python
import requests
from bs4 import BeautifulSoup
import sqlite3
from datetime import datetime
import hashlib
class NewsMonitor:
def __init__(self, db_path='news.db'):
self.db_path = db_path
self.init_database()
def init_database(self):
"""初始化数据库"""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS news (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
url TEXT UNIQUE NOT NULL,
summary TEXT,
publish_time TEXT,
source TEXT,
content_hash TEXT,
scraped_at TEXT,
is_new INTEGER DEFAULT 1
)
''')
conn.commit()
conn.close()
def get_article_hash(self, content):
"""生成内容哈希(用于去重)"""
return hashlib.md5(content.encode('utf-8')).hexdigest()
def is_article_exists(self, content_hash):
"""检查文章是否已存在"""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
cursor.execute('SELECT id FROM news WHERE content_hash = ?', (content_hash,))
result = cursor.fetchone()
conn.close()
return result is not None
def save_article(self, article):
"""保存文章到数据库"""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
try:
cursor.execute('''
INSERT OR IGNORE INTO news
(title, url, summary, publish_time, source, content_hash, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?)
''', (
article['title'],
article['url'],
article['summary'],
article['publish_time'],
article['source'],
article['content_hash'],
article['scraped_at']
))
conn.commit()
return cursor.rowcount > 0
except sqlite3.IntegrityError:
return False
finally:
conn.close()
def scrape_news(self, url, source_name):
"""爬取新闻列表"""
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
try:
response = requests.get(url, headers=headers, timeout=10)
response.encoding = 'utf-8'
soup = BeautifulSoup(response.text, 'lxml')
# 查找新闻列表(根据实际网站结构调整)
news_items = soup.find_all('div', class_='news-item')
new_articles = 0
for item in news_items:
try:
# 提取标题
title_tag = item.find('h3', class_='news-title')
title = title_tag.get_text(strip=True) if title_tag else ''
# 提取链接
link_tag = item.find('a')
article_url = link_tag.get('href', '') if link_tag else ''
if article_url and not article_url.startswith('http'):
article_url = url + article_url
# 提取摘要
summary_tag = item.find('p', class_='summary')
summary = summary_tag.get_text(strip=True) if summary_tag else ''
# 提取发布时间
time_tag = item.find('span', class_='publish-time')
publish_time = time_tag.get_text(strip=True) if time_tag else ''
# 生成内容哈希
content_hash = self.get_article_hash(title + summary)
# 检查是否已存在
if self.is_article_exists(content_hash):
continue
# 保存文章
article = {
'title': title,
'url': article_url,
'summary': summary,
'publish_time': publish_time,
'source': source_name,
'content_hash': content_hash,
'scraped_at': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
}
if self.save_article(article):
new_articles += 1
print(f"✓ {title}")
except Exception as e:
print(f"解析文章失败: {e}")
continue
print(f"\n本次爬取: {len(news_items)} 篇, 新增: {new_articles} 篇")
return new_articles
except requests.RequestException as e:
print(f"请求失败: {e}")
return 0
def get_latest_news(self, limit=10):
"""获取最新新闻"""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
cursor.execute('''
SELECT title, url, publish_time, source
FROM news
ORDER BY scraped_at DESC
LIMIT ?
''', (limit,))
results = cursor.fetchall()
conn.close()
return results
# 使用示例
if __name__ == "__main__":
monitor = NewsMonitor()
# 监控多个新闻源
news_sources = [
('https://news.example.com/politics', '政治新闻'),
('https://news.example.com/tech', '科技新闻'),
('https://news.example.com/business', '商业新闻')
]
for url, source in news_sources:
print(f"\n爬取 {source}...")
monitor.scrape_news(url, source)
# 查看最新新闻
print("\n=== 最新新闻 ===")
latest = monitor.get_latest_news(5)
for title, url, time, source in latest:
print(f"[{source}] {time} - {title}")
8.3 案例3:股票数据实时监控
python
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import time
import matplotlib.pyplot as plt
class StockMonitor:
def __init__(self):
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
self.price_history = {}
def get_stock_price(self, stock_code):
"""获取股票价格"""
url = f"https://finance.example.com/quote/{stock_code}"
try:
response = requests.get(url, headers=self.headers, timeout=5)
soup = BeautifulSoup(response.text, 'lxml')
# 根据实际网站结构调整选择器
price_tag = soup.select_one('.stock-price .current')
price = float(price_tag.get_text(strip=True).replace(',', '')) if price_tag else None
change_tag = soup.select_one('.stock-price .change')
change = change_tag.get_text(strip=True) if change_tag else '0.00'
return {
'code': stock_code,
'price': price,
'change': change,
'time': datetime.now().strftime('%H:%M:%S')
}
except Exception as e:
print(f"获取 {stock_code} 价格失败: {e}")
return None
def monitor_stocks(self, stock_codes, duration_minutes=5, interval_seconds=10):
"""监控多只股票"""
end_time = time.time() + duration_minutes * 60
print(f"开始监控 {len(stock_codes)} 只股票,持续 {duration_minutes} 分钟...")
while time.time() < end_time:
timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
print(f"\n[{timestamp}]")
for code in stock_codes:
data = self.get_stock_price(code)
if data and data['price']:
print(f"{code}: ¥{data['price']:.2f} ({data['change']})")
# 记录历史数据
if code not in self.price_history:
self.price_history[code] = []
self.price_history[code].append({
'time': data['time'],
'price': data['price']
})
time.sleep(interval_seconds)
def plot_price_trend(self):
"""绘制价格趋势图"""
if not self.price_history:
print("没有数据可绘制")
return
plt.figure(figsize=(12, 6))
for code, history in self.price_history.items():
times = [h['time'] for h in history]
prices = [h['price'] for h in history]
plt.plot(times, prices, marker='o', label=code)
plt.xlabel('Time')
plt.ylabel('Price (¥)')
plt.title('Stock Price Trend')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.grid(True, alpha=0.3)
plt.savefig('stock_trend.png')
print("价格趋势图已保存为 stock_trend.png")
def generate_report(self):
"""生成监控报告"""
report = []
report.append("=" * 50)
report.append("STOCK MONITORING REPORT")
report.append("=" * 50)
report.append(f"Report Time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
report.append(f"Total Stocks Monitored: {len(self.price_history)}")
report.append("")
for code, history in self.price_history.items():
if len(history) < 2:
continue
first_price = history[0]['price']
last_price = history[-1]['price']
change = last_price - first_price
change_percent = (change / first_price) * 100
report.append(f"Stock: {code}")
report.append(f" Start Price: ¥{first_price:.2f}")
report.append(f" End Price: ¥{last_price:.2f}")
report.append(f" Change: ¥{change:+.2f} ({change_percent:+.2f}%)")
report.append(f" High: ¥{max(h['price'] for h in history):.2f}")
report.append(f" Low: ¥{min(h['price'] for h in history):.2f}")
report.append("")
report_text = "\n".join(report)
with open('stock_report.txt', 'w', encoding='utf-8') as f:
f.write(report_text)
print("监控报告已保存为 stock_report.txt")
print(report_text)
# 使用示例
if __name__ == "__main__":
monitor = StockMonitor()
# 监控股票列表
stocks = ['AAPL', 'GOOGL', 'MSFT', 'TSLA', 'AMZN']
# 开始监控(监控5分钟,每10秒刷新一次)
monitor.monitor_stocks(stocks, duration_minutes=5, interval_seconds=10)
# 生成可视化图表
monitor.plot_price_trend()
# 生成文本报告
monitor.generate_report()
第九章:优化技巧
9.1 代码优化技巧
python
from bs4 import BeautifulSoup
import time
# 反面案例:低效的多次查找
def bad_practice(html):
soup = BeautifulSoup(html, 'lxml')
# 每次都重新遍历整个文档树
for i in range(100):
title = soup.find('h1')
print(title.text)
# 正面案例:缓存查找结果
def good_practice(html):
soup = BeautifulSoup(html, 'lxml')
# 只查找一次,缓存结果
title = soup.find('h1')
title_text = title.text if title else ''
for i in range(100):
print(title_text)
# 性能对比
html = "<html><h1>Test</h1></html>"
start = time.time()
bad_practice(html)
print(f"反面案例耗时: {time.time() - start:.4f}s")
start = time.time()
good_practice(html)
print(f"正面案例耗时: {time.time() - start:.4f}s")
9.2 内存优化策略
python
from bs4 import BeautifulSoup
import gc
def process_large_html(html_chunks):
"""处理大文件的内存优化方法"""
results = []
for i, chunk in enumerate(html_chunks):
# 创建局部作用域
soup = BeautifulSoup(chunk, 'lxml')
# 提取需要的数据
data = extract_data(soup)
results.append(data)
# 显式删除大对象
del soup
# 定期触发垃圾回收
if i % 10 == 0:
gc.collect()
return results
def extract_data(soup):
"""提取数据的函数"""
return {
'title': soup.find('h1').text if soup.find('h1') else '',
'content': soup.find('div', class_='content').text if soup.find('div', class_='content') else ''
}
9.3 错误处理与重试机制
python
import requests
from bs4 import BeautifulSoup
import time
from functools import wraps
def retry_on_failure(max_retries=3, delay=2):
"""重试装饰器"""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except Exception as e:
if attempt < max_retries - 1:
print(f"尝试 {attempt + 1}/{max_retries} 失败: {e}")
time.sleep(delay * (attempt + 1)) # 指数退避
else:
print(f"所有 {max_retries} 次尝试均失败")
raise
return wrapper
return decorator
class RobustScraper:
def __init__(self):
self.session = requests.Session()
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
@retry_on_failure(max_retries=3, delay=2)
def fetch_page(self, url):
"""带重试的页面获取"""
response = self.session.get(url, timeout=10)
response.raise_for_status()
return response.text
def parse_with_fallback(self, html, parsers=None):
"""带降级策略的解析"""
if parsers is None:
parsers = ['lxml', 'html.parser', 'html5lib']
for parser in parsers:
try:
soup = BeautifulSoup(html, parser)
# 验证解析是否成功
if soup.find('body'):
print(f"✓ 使用 {parser} 解析成功")
return soup
except Exception as e:
print(f"✗ {parser} 解析失败: {e}")
raise ValueError("所有解析器均失败")
9.4 反爬虫策略应对
python
import requests
from bs4 import BeautifulSoup
import random
import time
class AntiAntiCrawler:
def __init__(self):
self.user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36',
]
self.proxies = [
# 可配置代理池
]
def get_random_headers(self):
"""随机化请求头"""
return {
'User-Agent': random.choice(self.user_agents),
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1'
}
def scrape_with_delay(self, urls, min_delay=1, max_delay=3):
"""带随机延迟的爬取"""
results = []
for url in urls:
headers = self.get_random_headers()
try:
response = requests.get(url, headers=headers, timeout=10)
soup = BeautifulSoup(response.text, 'lxml')
results.append(soup)
# 随机延迟,模拟人类行为
delay = random.uniform(min_delay, max_delay)
time.sleep(delay)
except Exception as e:
print(f"爬取 {url} 失败: {e}")
continue
return results
第十章:常见问题与解决方案
10.1 编码问题
python
# 问题:中文乱码
# 解决方案:
response = requests.get(url)
response.encoding = 'utf-8' # 显式指定编码
soup = BeautifulSoup(response.text, 'lxml')
# 或者自动检测编码
from charset_normalizer import detect
encoding = detect(response.content)['encoding']
response.encoding = encoding
10.2 动态内容加载
python
# 问题:JavaScript 渲染的内容无法获取
# 解决方案1:使用 Selenium
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(url)
time.sleep(3) # 等待页面加载
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
driver.quit()
# 解决方案2:查找 API 接口
# 很多网站的数据来自后端 API,直接调用 API 更高效
api_url = "https://example.com/api/data"
response = requests.get(api_url, headers=headers)
data = response.json() # 直接获取 JSON 数据
10.3 复杂选择器
python
# 问题:层级太深,选择器复杂
# 解决方案:分步查找
# 不推荐:超长选择器
result = soup.select('div.container > div.row > div.col-md-8 > ul.list > li.item > a.link')
# 推荐:分步查找
container = soup.find('div', class_='container')
if container:
row = container.find('div', class_='row')
if row:
col = row.find('div', class_='col-md-8')
if col:
items = col.find_all('li', class_='item')
for item in items:
link = item.find('a', class_='link')
if link:
print(link.get('href'))
10.4 性能瓶颈
python
# 问题:处理大量数据时速度慢
# 解决方案:
# 1. 使用更快的解析器
soup = BeautifulSoup(html, 'lxml') # 而不是 'html.parser'
# 2. 限制搜索范围
# 错误示例:搜索整个文档
all_divs = soup.find_all('div')
# 正确示例:先定位容器,再在容器内搜索
container = soup.find('div', id='main-content')
if container:
all_divs = container.find_all('div')
# 3. 使用 CSS 选择器(通常比 find_all 快)
items = soup.select('div.item') # 比 soup.find_all('div', class_='item') 快
# 4. 避免在循环中重复查找
# ❌
for i in range(1000):
title = soup.find('h1').text
# ✅
title = soup.find('h1').text
for i in range(1000):
use_title(title)
第十一章:完整项目模板
python
"""
企业级数据爬虫框架
功能:
- 配置化管理
- 自动重试
- 数据验证
- 日志记录
- 异常处理
- 数据导出
"""
import requests
from bs4 import BeautifulSoup
import pandas as pd
import logging
from datetime import datetime
import json
import os
from typing import List, Dict, Optional
import time
# 配置日志
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('scraper.log'),
logging.StreamHandler()
]
)
class EnterpriseScraper:
"""企业级爬虫基类"""
def __init__(self, config: Dict):
self.config = config
self.base_url = config.get('base_url', '')
self.headers = config.get('headers', {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
self.timeout = config.get('timeout', 10)
self.max_retries = config.get('max_retries', 3)
self.delay = config.get('delay', 1)
self.session = requests.Session()
self.session.headers.update(self.headers)
self.data = []
def fetch(self, url: str, params: Optional[Dict] = None) -> Optional[str]:
"""获取页面内容(带重试)"""
for attempt in range(self.max_retries):
try:
logging.info(f"请求: {url}")
response = self.session.get(url, params=params, timeout=self.timeout)
response.raise_for_status()
# 检查响应内容
if not response.text or len(response.text) < 100:
logging.warning(f"响应内容过短: {len(response.text)}")
raise ValueError("响应内容异常")
return response.text
except Exception as e:
logging.error(f"请求失败 (尝试 {attempt + 1}/{self.max_retries}): {e}")
if attempt < self.max_retries - 1:
time.sleep(self.delay * (attempt + 1))
else:
logging.error(f"所有重试均失败: {url}")
return None
return None
def parse(self, html: str) -> List[Dict]:
"""解析页面(子类实现)"""
raise NotImplementedError("子类必须实现 parse 方法")
def validate(self, item: Dict) -> bool:
"""验证数据(子类实现)"""
return True
def scrape(self, urls: List[str]) -> List[Dict]:
"""爬取多个 URL"""
all_data = []
for i, url in enumerate(urls, 1):
logging.info(f"处理 {i}/{len(urls)}: {url}")
html = self.fetch(url)
if not html:
continue
try:
items = self.parse(html)
logging.info(f"解析得到 {len(items)} 条数据")
# 验证和过滤
valid_items = [item for item in items if self.validate(item)]
logging.info(f"验证通过 {len(valid_items)} 条数据")
all_data.extend(valid_items)
# 延迟避免被封
if i < len(urls):
time.sleep(self.delay)
except Exception as e:
logging.error(f"解析失败: {e}")
continue
self.data = all_data
return all_data
def save_to_csv(self, filename: str = 'output.csv'):
"""保存为 CSV"""
if not self.data:
logging.warning("没有数据可保存")
return
df = pd.DataFrame(self.data)
df.to_csv(filename, index=False, encoding='utf-8-sig')
logging.info(f"数据已保存到 {filename} ({len(self.data)} 条)")
def save_to_excel(self, filename: str = 'output.xlsx'):
"""保存为 Excel"""
if not self.data:
logging.warning("没有数据可保存")
return
df = pd.DataFrame(self.data)
df.to_excel(filename, index=False)
logging.info(f"数据已保存到 {filename} ({len(self.data)} 条)")
def save_to_json(self, filename: str = 'output.json'):
"""保存为 JSON"""
if not self.data:
logging.warning("没有数据可保存")
return
with open(filename, 'w', encoding='utf-8') as f:
json.dump(self.data, f, ensure_ascii=False, indent=2)
logging.info(f"数据已保存到 {filename} ({len(self.data)} 条)")
def generate_report(self):
"""生成统计报告"""
if not self.data:
logging.warning("没有数据可分析")
return
df = pd.DataFrame(self.data)
report = []
report.append("=" * 60)
report.append("SCRAPING REPORT")
report.append("=" * 60)
report.append(f"Report Time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
report.append(f"Total Records: {len(self.data)}")
report.append(f"Fields: {', '.join(df.columns)}")
report.append("")
# 统计信息
for col in df.columns:
report.append(f"Field: {col}")
report.append(f" Non-null: {df[col].notnull().sum()}/{len(df)}")
report.append(f" Unique: {df[col].nunique()}")
# 数值字段统计
if df[col].dtype in ['int64', 'float64']:
report.append(f" Mean: {df[col].mean():.2f}")
report.append(f" Min: {df[col].min():.2f}")
report.append(f" Max: {df[col].max():.2f}")
report.append("")
report_text = "\n".join(report)
logging.info(report_text)
# 保存报告
with open('scraping_report.txt', 'w', encoding='utf-8') as f:
f.write(report_text)
logging.info("报告已保存到 scraping_report.txt")
# 使用示例
if __name__ == "__main__":
# 配置
config = {
'base_url': 'https://example.com',
'headers': {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9',
},
'timeout': 10,
'max_retries': 3,
'delay': 2
}
# 创建爬虫实例(需要继承并实现 parse 和 validate 方法)
# scraper = MyCustomScraper(config)
# urls = ['https://example.com/page1', 'https://example.com/page2']
# scraper.scrape(urls)
# scraper.save_to_excel()
# scraper.generate_report()
logging.info("框架加载成功,请继承 EnterpriseScraper 类实现具体爬虫")
总结
核心要点回顾
- 解析器选择:生产环境优先使用 lxml,开发调试使用 html5lib
- 搜索方法:find/find_all 适合复杂条件,select 适合层级关系
- 性能优化:缓存查找结果、限制搜索范围、使用合适的选择器
- 错误处理:实现重试机制、降级策略、完善的日志记录
- 反爬应对:随机化请求头、添加延迟、使用代理池
进阶学习路径
- 深入理解 HTML/CSS:掌握更复杂的选择器语法
- 学习 XPath:与 lxml 配合使用,处理更复杂的文档结构
- 掌握 Selenium:处理 JavaScript 动态渲染的页面
- 学习 Scrapy 框架:构建更强大的爬虫系统
- 数据存储优化:学习 MongoDB、Elasticsearch 等数据库
- 分布式爬虫:使用 Celery、Redis 构建分布式系统
法律与道德提醒
- 遵守网站的 robots.txt 协议
- 尊重网站的使用条款
- 合理控制请求频率,避免对服务器造成压力
- 仅用于合法的数据分析目的
- 注意数据隐私和版权问题
附录:资源