"Scrapy到底该怎么学?"今天,我将用这篇万字长文,带你从零开始掌握Scrapy框架的核心用法,并分享我在实际项目中的实战经验!建议收藏⭐!
一、Scrapy简介:为什么选择它?
1.1 Scrapy vs Requests+BeautifulSoup
很多新手会问:"我已经会用Requests+BeautifulSoup了,为什么还要学Scrapy?"
对比项 | Requests+BS4 | Scrapy |
---|---|---|
性能 | 同步请求,速度慢 | 异步IO,高性能 |
扩展性 | 需要手动实现 | 内置中间件、管道系统 |
功能完整性 | 仅基础爬取 | 自带去重、队列管理、异常处理 |
适用场景 | 小规模数据采集 | 企业级爬虫项目 |
👉 结论:如果是小型项目,Requests够用;但如果是商业级爬虫,Scrapy是更好的选择!
1.2 Scrapy核心架构
(图解Scrapy架构,建议配合流程图理解)
生成Request 发送请求 排队 返回Response 生成Item Spider Engine Scheduler Downloader Item Pipeline
二、手把手实战:开发你的第一个Scrapy爬虫
2.1 环境准备
bash
# 推荐使用虚拟环境
python -m venv scrapy_env
source scrapy_env/bin/activate # Linux/Mac
scrapy_env\Scripts\activate # Windows
pip install scrapy
2.2 创建项目
bash
scrapy startproject book_crawler
cd book_crawler
scrapy genspider books books.toscrape.com
2.3 编写爬虫代码
python
# spiders/books.py
import scrapy
class BooksSpider(scrapy.Spider):
name = "books"
def start_requests(self):
urls = ['http://books.toscrape.com/']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# 提取书籍信息
for book in response.css('article.product_pod'):
yield {
'title': book.css('h3 a::attr(title)').get(),
'price': book.css('p.price_color::text').get(),
'rating': book.css('p.star-rating::attr(class)').get().split()[-1]
}
# 翻页逻辑
next_page = response.css('li.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
2.4 运行爬虫
bash
scrapy crawl books -o books.csv
三、Scrapy高级技巧(企业级应用)
3.1 突破反爬:随机UserAgent+代理IP
python
# middlewares.py
from fake_useragent import UserAgent
import random
class RandomUserAgentMiddleware:
def process_request(self, request, spider):
request.headers['User-Agent'] = UserAgent().random
class ProxyMiddleware:
PROXY_LIST = [
'http://proxy1.example.com:8080',
'http://proxy2.example.com:8080'
]
def process_request(self, request, spider):
proxy = random.choice(self.PROXY_LIST)
request.meta['proxy'] = proxy
在settings.py
中启用:
python
DOWNLOADER_MIDDLEWARES = {
'book_crawler.middlewares.RandomUserAgentMiddleware': 543,
'book_crawler.middlewares.ProxyMiddleware': 544,
}
3.2 数据存储:MySQL+Pipeline
python
# pipelines.py
import pymysql
class MySQLPipeline:
def __init__(self):
self.conn = pymysql.connect(
host='localhost',
user='root',
password='123456',
db='scrapy_data',
charset='utf8mb4'
)
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
sql = """
INSERT INTO books(title, price, rating)
VALUES (%s, %s, %s)
"""
self.cursor.execute(sql, (
item['title'],
item['price'],
item['rating']
))
self.conn.commit()
return item
def close_spider(self, spider):
self.conn.close()
四、常见问题Q&A
Q1:如何爬取JavaScript渲染的页面?
方案一:Scrapy+Splash
python
# 安装:docker run -p 8050:8050 scrapinghub/splash
yield scrapy.Request(
url,
self.parse,
meta={'splash': {'args': {'wait': 2.5}}}
)
方案二:Scrapy+Playwright(推荐)
python
# settings.py
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
Q2:如何实现分布式爬虫?
使用scrapy-redis
:
python
# settings.py
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_URL = 'redis://:password@localhost:6379/0'
五、性能优化技巧
-
并发控制:
python# settings.py CONCURRENT_REQUESTS = 32 # 默认16 DOWNLOAD_DELAY = 0.25 # 防止被封
-
缓存请求:
pythonHTTPCACHE_ENABLED = True HTTPCACHE_EXPIRATION_SECS = 86400 # 缓存1天
-
自动限速:
pythonAUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 5.0