文章导读
在数据时代,网络爬虫是获取公开网页数据的重要手段。本文基于Python生态,从基础静态页面爬取入手,逐步介绍动态页面处理、高并发框架和反爬策略。适合初学者到中级开发者参考,包含完整可运行代码示例。
无论你是做数据分析、监控还是研究项目,都能找到实用方案。所有代码基于Python 3.11+测试通过。
技术栈关键词:httpx、Playwright、Scrapy、parsel、curl_cffi
1. 环境准备与基础依赖安装
首先创建虚拟环境并安装核心库:
python -m venv spider_env
source spider_env/bin/activate # Windows: spider_env\Scripts\activate
pip install httpx parsel beautifulsoup4 lxml playwright scrapy scrapy-playwright curl-cffi
# 安装浏览器
playwright install chromium
playwright install firefox # 可选,备用
为什么这些库?
- httpx:现代HTTP客户端,支持异步和HTTP/2,取代requests。
- parsel:Scrapy官方选择器,轻量高效解析HTML/CSS。
- Playwright:跨浏览器自动化,支持无头模式,处理JS渲染首选。
- Scrapy + scrapy-playwright:工程化爬虫框架标配。
2. 静态页面爬取:httpx + parsel入门
静态页面无需JS渲染,速度最快。以下爬取豆瓣最新图书示例:
# static_demo.py
import asyncio
import httpx
from parsel import Selector
async def crawl_douban_books():
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
}
async with httpx.AsyncClient(
http2=True,
headers=headers,
timeout=10.0,
limits=httpx.Limits(max_keepalive_connections=5)
) as client:
resp = await client.get('https://book.douban.com/latest')
resp.raise_for_status()
selector = Selector(resp.text)
books = selector.css('.grid-view li .title a::text').getall()
print('豆瓣最新图书(前10本):')
for i, book in enumerate(books[:10], 1):
print(f'{i:2d}. {book.strip()}')
if __name__ == '__main__':
asyncio.run(crawl_douban_books())
运行效果:
豆瓣最新图书(前10本):
1. 《xxxxxx》
2. 《xxxxxx》
...
优化点:
- 使用http2=True提升并发速度。
- parsel比BeautifulSoup更快,支持XPath/CSS混合。
3. 动态JS页面:Playwright无头浏览器实战
遇到React/Vue/Angular页面,用Playwright模拟真实浏览器:
# playwright_demo.py
import asyncio
from playwright.async_api import async_playwright
import json
async def crawl_js_page():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
locale='zh-CN'
)
# 防自动化检测脚本
await context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
Object.defineProperty(navigator, 'plugins', { get: () => [1, 2, 3, 4, 5] });
""")
page = await context.new_page()
await page.goto('https://book.douban.com/latest', wait_until='networkidle')
# 等待动态元素加载
await page.wait_for_selector('.grid-view li', timeout=10000)
# 提取数据
books = await page.eval_on_selector_all('.title a', 'els => els.map(el => el.textContent.trim())')
print(json.dumps(books[:5], ensure_ascii=False, indent=2))
await browser.close()
if __name__ == '__main__':
asyncio.run(crawl_js_page())
关键技巧:
- wait_until='networkidle':等待网络空闲,确保JS加载完成。
- add_init_script:隐藏webdriver属性,绕过简单反爬。
- 支持截图调试:await page.screenshot(path='debug.png')。
4. 工程化爬虫:Scrapy + Playwright框架
单脚本适合原型,生产环境用Scrapy。快速创建项目:
scrapy startproject bookspider
cd bookspider
pip install scrapy-playwright
4.1 settings.py核心配置
# settings.py
BOT_NAME = 'bookspider'
SPIDER_MODULES = ['bookspider.spiders']
NEWSPIDER_MODULE = 'bookspider.spiders'
# 爬取速度控制
CONCURRENT_REQUESTS = 8
DOWNLOAD_DELAY = 1
RANDOMIZE_DOWNLOAD_DELAY = True
ROBOTSTXT_OBEY = False
# Playwright中间件
DOWNLOADER_MIDDLEWARES = {
'scrapy_playwright.middleware.PlaywrightMiddleware': 543,
}
PLAYWRIGHT_LAUNCH_OPTIONS = {
'headless': True,
'args': ['--no-sandbox', '--disable-blink-features=AutomationControlled']
}
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 30000.0
# 输出
FEEDS = {
'books.json': {'format': 'json'},
}
4.2 spiders/book_spider.py
# bookspider/spiders/book_spider.py
import scrapy
from scrapy_playwright.page import PageMethod
class BookSpider(scrapy.Spider):
name = 'book'
start_urls = ['https://book.douban.com/latest']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
meta={
'playwright': True,
'playwright_page_methods': [
PageMethod('wait_for_selector', '.grid-view li', timeout=10000)
]
}
)
def parse(self, response):
for book in response.css('.grid-view li .title a::text').getall()[:20]:
yield {
'title': book.strip(),
'url': response.url,
'crawl_time': scrapy.utils.datetime.get_timestamp()
}
运行命令:
scrapy crawl book -o books.json
输出示例(books.json):
[
{"title": "书名1", "url": "...", "crawl_time": 1737740000.0},
...
]
5. 反爬虫策略与高级伪装
5.1 常用反爬类型及应对
| 反爬类型 | 检测特征 | Python应对方案 |
|---|---|---|
| UA校验 | 简单字符串匹配 | 真实UA池 + curl_cffi |
| JS指纹 | navigator对象 | Playwright stealth插件 |
| TLS/JA3 | HTTPS握手指纹 | curl_cffi(impersonate='chrome') |
| IP限制 | 频繁请求同一IP | 代理池(住宅IP) + 随机延时 |
| 行为分析 | 无鼠标轨迹/滚动 | Playwright鼠标/键盘模拟 |
5.2 curl_cffi极致TLS伪装示例
# anticaptcha_demo.py
import asyncio
from curl_cffi import requests
async def stealth_request():
resp = requests.get(
'https://httpbin.org/headers', # 测试接口
impersonate="chrome124", # 伪装Chrome124 TLS指纹
headers={'User-Agent': '真实UA'}
)
print(resp.json())
asyncio.run(stealth_request())
安装:pip install curl_cffi
6. 高并发与数据存储
- 异步并发:httpx + asyncio.Semaphore限流。
- 分布式:Scrapy + Redis/RabbitMQ队列。
- 存储:MongoDB/ClickHouse(结构化)或Parquet(大数据)。
示例并发限流:
sem = asyncio.Semaphore(10) # 最大10并发
async def bounded_fetch(url):
async with sem:
# 请求逻辑
pass
7. 完整项目模板与注意事项
GitHub模板:搜索"python-scrapy-playwright-template" fork使用。
法律合规:
- 只爬公开数据,遵守robots.txt。
- 控制爬取频率,避免DDoS。
- 商业使用需评估robots.txt和条款。
性能监控:
- 日志:logging + ELK栈。
- 指标:Prometheus + Grafana。
常见问题:
- Q: Playwright内存高?A: 用persistent context复用。
- Q: IP被封?A: 构建代理池,失败重试+切换。
结语
Python爬虫生态成熟,从httpx简单起步,到Scrapy+Playwright生产级部署,一条龙解决。实践是关键,建议从豆瓣/知乎等简单站点练手,逐步挑战电商/社交。