摘要:本文是一份 Scrapling 网络爬虫库的完整实战指南,涵盖从环境搭建到构建电商价格监控系统的全流程。文章首先介绍 Scrapling 的安装与核心概念(简洁、智能、异步),随后通过大量可运行代码示例,逐步讲解请求发送、CSS/XPath 选择器提取、动态页面渲染、反爬对抗策略、数据清洗与结构化导出、异常处理以及异步并发抓取等关键技术。最后,综合运用所学知识构建一个完整的电商价格监控系统,包含爬虫模块、数据解析、存储、通知和 Rich 库美化日志输出,帮助读者快速上手 Scrapling 并应用于实际项目。
① 环境搭建:一分钟上手 Scrapling
Scrapling 是一个轻量级、高性能的 Python 网络爬虫库,以其简洁的 API 和强大的智能选择器著称。与 Scrapy 的笨重和 BeautifulSoup 的低效不同,Scrapling 在保持代码简洁的同时,提供了原生异步支持和类 jQuery 的选择器语法,让爬虫开发变得前所未有的高效。
1.1 安装要求
- Python 3.8 及以上版本
- pip 包管理工具
- 操作系统:Windows / macOS / Linux 均可
1.2 快速安装
打开终端,执行以下命令即可完成安装:
bash
pip install scrapling
如果你需要使用动态页面渲染功能,建议同时安装 Playwright:
bash
pip install scrapling[playwright]
playwright install
1.3 验证安装
安装完成后,在 Python 交互环境中运行以下代码,验证是否安装成功:
python
import scrapling
print(scrapling.__version__)
如果输出版本号(如 0.1.0),说明安装成功。
1.4 第一个爬虫:3 行代码抓取网页
python
from scrapling import Fetcher
fetcher = Fetcher()
page = fetcher.get("https://example.com")
print(page.text[:200]) # 打印网页前200个字符
没错,只需要 3 行代码,你就完成了第一次网页抓取!Fetcher 是 Scrapling 的核心请求类,它会自动处理连接池、请求头和编码问题。
② 核心概念:Scrapling 的架构与选择器哲学
下面这张架构图展示了 Scrapling 四大核心组件之间的协作关系:
#mermaid-svg-cqfbbbGccr9Zdvah{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-cqfbbbGccr9Zdvah .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-cqfbbbGccr9Zdvah .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-cqfbbbGccr9Zdvah .error-icon{fill:#552222;}#mermaid-svg-cqfbbbGccr9Zdvah .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-cqfbbbGccr9Zdvah .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-cqfbbbGccr9Zdvah .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-cqfbbbGccr9Zdvah .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-cqfbbbGccr9Zdvah .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-cqfbbbGccr9Zdvah .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-cqfbbbGccr9Zdvah .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-cqfbbbGccr9Zdvah .marker{fill:#333333;stroke:#333333;}#mermaid-svg-cqfbbbGccr9Zdvah .marker.cross{stroke:#333333;}#mermaid-svg-cqfbbbGccr9Zdvah svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-cqfbbbGccr9Zdvah p{margin:0;}#mermaid-svg-cqfbbbGccr9Zdvah .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-cqfbbbGccr9Zdvah .cluster-label text{fill:#333;}#mermaid-svg-cqfbbbGccr9Zdvah .cluster-label span{color:#333;}#mermaid-svg-cqfbbbGccr9Zdvah .cluster-label span p{background-color:transparent;}#mermaid-svg-cqfbbbGccr9Zdvah .label text,#mermaid-svg-cqfbbbGccr9Zdvah span{fill:#333;color:#333;}#mermaid-svg-cqfbbbGccr9Zdvah .node rect,#mermaid-svg-cqfbbbGccr9Zdvah .node circle,#mermaid-svg-cqfbbbGccr9Zdvah .node ellipse,#mermaid-svg-cqfbbbGccr9Zdvah .node polygon,#mermaid-svg-cqfbbbGccr9Zdvah .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-cqfbbbGccr9Zdvah .rough-node .label text,#mermaid-svg-cqfbbbGccr9Zdvah .node .label text,#mermaid-svg-cqfbbbGccr9Zdvah .image-shape .label,#mermaid-svg-cqfbbbGccr9Zdvah .icon-shape .label{text-anchor:middle;}#mermaid-svg-cqfbbbGccr9Zdvah .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-cqfbbbGccr9Zdvah .rough-node .label,#mermaid-svg-cqfbbbGccr9Zdvah .node .label,#mermaid-svg-cqfbbbGccr9Zdvah .image-shape .label,#mermaid-svg-cqfbbbGccr9Zdvah .icon-shape .label{text-align:center;}#mermaid-svg-cqfbbbGccr9Zdvah .node.clickable{cursor:pointer;}#mermaid-svg-cqfbbbGccr9Zdvah .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-cqfbbbGccr9Zdvah .arrowheadPath{fill:#333333;}#mermaid-svg-cqfbbbGccr9Zdvah .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-cqfbbbGccr9Zdvah .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-cqfbbbGccr9Zdvah .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-cqfbbbGccr9Zdvah .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-cqfbbbGccr9Zdvah .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-cqfbbbGccr9Zdvah .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-cqfbbbGccr9Zdvah .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-cqfbbbGccr9Zdvah .cluster text{fill:#333;}#mermaid-svg-cqfbbbGccr9Zdvah .cluster span{color:#333;}#mermaid-svg-cqfbbbGccr9Zdvah div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-cqfbbbGccr9Zdvah .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-cqfbbbGccr9Zdvah rect.text{fill:none;stroke-width:0;}#mermaid-svg-cqfbbbGccr9Zdvah .icon-shape,#mermaid-svg-cqfbbbGccr9Zdvah .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-cqfbbbGccr9Zdvah .icon-shape p,#mermaid-svg-cqfbbbGccr9Zdvah .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-cqfbbbGccr9Zdvah .icon-shape .label rect,#mermaid-svg-cqfbbbGccr9Zdvah .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-cqfbbbGccr9Zdvah .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-cqfbbbGccr9Zdvah .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-cqfbbbGccr9Zdvah :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 发送 HTTP 请求
解析 HTML
动态渲染
CSS / XPath 查询
Playwright 渲染
拦截 AJAX
Fetcher
请求管理器
Page
响应页面
Navigator
智能选择器引擎
Adaptor
动态内容适配器
结构化数据
网络请求分析
要熟练使用 Scrapling,首先需要理解它的核心设计理念。Scrapling 的设计哲学可以概括为三个词:简洁、智能、异步。
2.1 核心组件架构
Scrapling 的架构由以下核心组件组成:
| 组件 | 作用 | 类比 |
|---|---|---|
Fetcher |
发送 HTTP 请求,管理会话 | 浏览器的地址栏 |
Page |
存储响应内容,提供解析方法 | 浏览器的渲染页面 |
Navigator |
智能选择器引擎,支持 CSS/XPath | 浏览器的开发者工具 |
Adaptor |
动态内容渲染适配器 | 浏览器的 JavaScript 引擎 |
2.2 选择器哲学:像 jQuery 一样操作 HTML
Scrapling 最大的亮点在于其选择器设计。如果你用过 jQuery,会感到非常熟悉:
python
# 传统方式(BeautifulSoup)
soup.find("div", class_="content").find_all("a")
# Scrapling 方式(类 jQuery)
page.css("div.content a")
Scrapling 的选择器引擎 Navigator 支持:
- CSS 选择器 :
page.css("div.product-card > h2.title") - XPath 选择器 :
page.xpath("//div[@class='product']//h2") - 智能匹配:自动处理残缺 HTML,容错能力强
- 链式调用 :支持
.css().xpath().re()连续操作
2.3 异步原生支持
Scrapling 从底层原生支持异步操作,无需额外配置:
python
import asyncio
from scrapling import AsyncFetcher
async def fetch_page():
fetcher = AsyncFetcher()
page = await fetcher.get("https://example.com")
return page
result = asyncio.run(fetch_page())
③ 基础实战:发送请求与解析第一个页面
下面这张流程图展示了一个完整爬虫请求的生命周期:
#mermaid-svg-Bf3Evhx5GUfQtE6A{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-Bf3Evhx5GUfQtE6A .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-Bf3Evhx5GUfQtE6A .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-Bf3Evhx5GUfQtE6A .error-icon{fill:#552222;}#mermaid-svg-Bf3Evhx5GUfQtE6A .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-Bf3Evhx5GUfQtE6A .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-Bf3Evhx5GUfQtE6A .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-Bf3Evhx5GUfQtE6A .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-Bf3Evhx5GUfQtE6A .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-Bf3Evhx5GUfQtE6A .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-Bf3Evhx5GUfQtE6A .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-Bf3Evhx5GUfQtE6A .marker{fill:#333333;stroke:#333333;}#mermaid-svg-Bf3Evhx5GUfQtE6A .marker.cross{stroke:#333333;}#mermaid-svg-Bf3Evhx5GUfQtE6A svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-Bf3Evhx5GUfQtE6A p{margin:0;}#mermaid-svg-Bf3Evhx5GUfQtE6A .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-Bf3Evhx5GUfQtE6A .cluster-label text{fill:#333;}#mermaid-svg-Bf3Evhx5GUfQtE6A .cluster-label span{color:#333;}#mermaid-svg-Bf3Evhx5GUfQtE6A .cluster-label span p{background-color:transparent;}#mermaid-svg-Bf3Evhx5GUfQtE6A .label text,#mermaid-svg-Bf3Evhx5GUfQtE6A span{fill:#333;color:#333;}#mermaid-svg-Bf3Evhx5GUfQtE6A .node rect,#mermaid-svg-Bf3Evhx5GUfQtE6A .node circle,#mermaid-svg-Bf3Evhx5GUfQtE6A .node ellipse,#mermaid-svg-Bf3Evhx5GUfQtE6A .node polygon,#mermaid-svg-Bf3Evhx5GUfQtE6A .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Bf3Evhx5GUfQtE6A .rough-node .label text,#mermaid-svg-Bf3Evhx5GUfQtE6A .node .label text,#mermaid-svg-Bf3Evhx5GUfQtE6A .image-shape .label,#mermaid-svg-Bf3Evhx5GUfQtE6A .icon-shape .label{text-anchor:middle;}#mermaid-svg-Bf3Evhx5GUfQtE6A .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-Bf3Evhx5GUfQtE6A .rough-node .label,#mermaid-svg-Bf3Evhx5GUfQtE6A .node .label,#mermaid-svg-Bf3Evhx5GUfQtE6A .image-shape .label,#mermaid-svg-Bf3Evhx5GUfQtE6A .icon-shape .label{text-align:center;}#mermaid-svg-Bf3Evhx5GUfQtE6A .node.clickable{cursor:pointer;}#mermaid-svg-Bf3Evhx5GUfQtE6A .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-Bf3Evhx5GUfQtE6A .arrowheadPath{fill:#333333;}#mermaid-svg-Bf3Evhx5GUfQtE6A .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-Bf3Evhx5GUfQtE6A .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-Bf3Evhx5GUfQtE6A .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Bf3Evhx5GUfQtE6A .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-Bf3Evhx5GUfQtE6A .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Bf3Evhx5GUfQtE6A .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-Bf3Evhx5GUfQtE6A .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-Bf3Evhx5GUfQtE6A .cluster text{fill:#333;}#mermaid-svg-Bf3Evhx5GUfQtE6A .cluster span{color:#333;}#mermaid-svg-Bf3Evhx5GUfQtE6A div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-Bf3Evhx5GUfQtE6A .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-Bf3Evhx5GUfQtE6A rect.text{fill:none;stroke-width:0;}#mermaid-svg-Bf3Evhx5GUfQtE6A .icon-shape,#mermaid-svg-Bf3Evhx5GUfQtE6A .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Bf3Evhx5GUfQtE6A .icon-shape p,#mermaid-svg-Bf3Evhx5GUfQtE6A .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-Bf3Evhx5GUfQtE6A .icon-shape .label rect,#mermaid-svg-Bf3Evhx5GUfQtE6A .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Bf3Evhx5GUfQtE6A .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-Bf3Evhx5GUfQtE6A .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-Bf3Evhx5GUfQtE6A :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 是
否
开始
创建 Fetcher 实例
发送 GET 请求
状态码 200?
解析 HTML 页面
处理错误/重试
CSS/XPath 提取数据
输出结构化结果
结束
掌握了核心概念后,我们来编写第一个完整的爬虫程序,抓取一个真实的网页。
3.1 发送 GET 请求
python
from scrapling import Fetcher
# 创建 fetcher 实例(自动管理会话)
fetcher = Fetcher()
# 发送 GET 请求
response = fetcher.get("https://httpbin.org/get")
# 查看响应信息
print(f"状态码: {response.status}")
print(f"响应头: {dict(response.headers)}")
print(f"响应内容: {response.text[:300]}")
3.2 带参数的请求
python
# 方式一:URL 参数拼接
response = fetcher.get("https://httpbin.org/get?name=scrapling&version=1.0")
# 方式二:使用 params 参数(推荐)
params = {"name": "scrapling", "version": "1.0"}
response = fetcher.get("https://httpbin.org/get", params=params)
3.3 解析 HTML 页面
python
# 抓取一个真实的博客页面
page = fetcher.get("https://quotes.toscrape.com/")
# 提取所有名言
quotes = page.css("div.quote")
for quote in quotes:
text = quote.css("span.text::text").get()
author = quote.css("small.author::text").get()
print(f"「{text}」------ {author}")
3.4 处理响应编码
Scrapling 会自动检测编码,但遇到乱码时可以手动指定:
python
page = fetcher.get("https://example.com")
page.encoding = "utf-8" # 手动指定编码
print(page.text)
④ 智能提取:CSS 选择器与 XPath 的高级应用
Scrapling 的选择器引擎是其核心竞争力。本节将深入讲解各种选择器的高级用法。
4.1 CSS 选择器进阶
python
# 属性选择器
page.css("a[href^='https']") # href 以 https 开头的链接
page.css("img[alt$='logo']") # alt 以 logo 结尾的图片
page.css("div[data-id*='product']") # data-id 包含 product 的 div
# 伪类选择器
page.css("li:first-child") # 第一个 li
page.css("tr:nth-child(2n)") # 偶数行
page.css("div:not(.hidden)") # 排除隐藏元素
# 文本提取
page.css("h1::text") # 提取文本内容
page.css("a::attr(href)") # 提取属性值
page.css("div::html") # 提取内部 HTML
4.2 XPath 选择器实战
python
# 基本路径查询
page.xpath("//div[@class='content']//h2")
# 条件筛选
page.xpath("//a[contains(@href, 'download')]")
page.xpath("//div[position() < 3]") # 前两个 div
# 文本匹配
page.xpath("//p[text()='Hello World']")
page.xpath("//span[starts-with(text(), 'Price:')]")
4.3 智能容错匹配
Scrapling 的 Navigator 会自动修复不规范的 HTML:
python
# 即使 HTML 标签未闭合,也能正确提取
html = "<div><p>Hello<span>World</div>"
page = fetcher.from_html(html)
result = page.css("div p span::text").get()
print(result) # 输出: World
4.4 链式调用与数据提取
python
# 复杂场景:提取商品列表
page = fetcher.get("https://books.toscrape.com/")
books = page.css("article.product_pod")
for book in books:
title = book.css("h3 a::attr(title)").get()
price = book.css("p.price_color::text").get()
rating = book.css("p.star-rating::attr(class)").get()
print(f"{title} | {price} | 评分: {rating}")
⑤ 动态内容:处理 JavaScript 渲染的页面
下面这张时序图展示了 Scrapling 处理动态页面的完整过程:
目标网站 Playwright 浏览器 Adaptor 用户代码 目标网站 Playwright 浏览器 Adaptor 用户代码 #mermaid-svg-1HDXZccEkKXhRzRp{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-1HDXZccEkKXhRzRp .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-1HDXZccEkKXhRzRp .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-1HDXZccEkKXhRzRp .error-icon{fill:#552222;}#mermaid-svg-1HDXZccEkKXhRzRp .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-1HDXZccEkKXhRzRp .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-1HDXZccEkKXhRzRp .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-1HDXZccEkKXhRzRp .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-1HDXZccEkKXhRzRp .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-1HDXZccEkKXhRzRp .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-1HDXZccEkKXhRzRp .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-1HDXZccEkKXhRzRp .marker{fill:#333333;stroke:#333333;}#mermaid-svg-1HDXZccEkKXhRzRp .marker.cross{stroke:#333333;}#mermaid-svg-1HDXZccEkKXhRzRp svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-1HDXZccEkKXhRzRp p{margin:0;}#mermaid-svg-1HDXZccEkKXhRzRp .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-1HDXZccEkKXhRzRp text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-1HDXZccEkKXhRzRp .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-1HDXZccEkKXhRzRp .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-1HDXZccEkKXhRzRp .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-1HDXZccEkKXhRzRp .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-1HDXZccEkKXhRzRp #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-1HDXZccEkKXhRzRp .sequenceNumber{fill:white;}#mermaid-svg-1HDXZccEkKXhRzRp #sequencenumber{fill:#333;}#mermaid-svg-1HDXZccEkKXhRzRp #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-1HDXZccEkKXhRzRp .messageText{fill:#333;stroke:none;}#mermaid-svg-1HDXZccEkKXhRzRp .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-1HDXZccEkKXhRzRp .labelText,#mermaid-svg-1HDXZccEkKXhRzRp .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-1HDXZccEkKXhRzRp .loopText,#mermaid-svg-1HDXZccEkKXhRzRp .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-1HDXZccEkKXhRzRp .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-1HDXZccEkKXhRzRp .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-1HDXZccEkKXhRzRp .noteText,#mermaid-svg-1HDXZccEkKXhRzRp .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-1HDXZccEkKXhRzRp .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-1HDXZccEkKXhRzRp .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-1HDXZccEkKXhRzRp .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-1HDXZccEkKXhRzRp .actorPopupMenu{position:absolute;}#mermaid-svg-1HDXZccEkKXhRzRp .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-1HDXZccEkKXhRzRp .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-1HDXZccEkKXhRzRp .actor-man circle,#mermaid-svg-1HDXZccEkKXhRzRp line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-1HDXZccEkKXhRzRp :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 可选:模拟用户交互 adaptor.get(url) 启动/复用浏览器实例 发送 HTTP 请求 返回 HTML + JS 执行 JavaScript 渲染 等待元素加载 (wait_for) 返回渲染后的 Page 返回可解析的页面 page.css() 提取数据 page.click() / page.fill() 执行交互操作 返回更新后的页面
现代网站大量使用 JavaScript 动态加载内容,传统的 HTTP 请求无法获取这些数据。Scrapling 通过 Adaptor 组件完美解决这个问题。
5.1 安装浏览器驱动
bash
pip install scrapling[playwright]
playwright install chromium
5.2 使用 Adaptor 渲染动态页面
python
from scrapling import Adaptor
# 创建适配器(自动管理浏览器实例)
adaptor = Adaptor()
# 渲染动态页面
page = adaptor.get("https://quotes.toscrape.com/js/")
# 等待特定元素加载完成
page.wait_for("div.quote", timeout=10)
# 提取动态加载的内容
quotes = page.css("div.quote")
print(f"共找到 {len(quotes)} 条名言")
5.3 模拟用户交互
python
# 点击按钮加载更多内容
page.click("button#load-more")
page.wait_for("div.quote:nth-child(11)", timeout=5)
# 填写搜索框并提交
page.fill("input#search", "Python")
page.click("button#search-btn")
page.wait_for("div.result", timeout=5)
# 滚动到页面底部(触发懒加载)
page.scroll_to_bottom()
page.wait(2) # 等待 2 秒
5.4 拦截网络请求
python
# 拦截并分析 AJAX 请求
def on_request(request):
if "api" in request.url:
print(f"拦截到 API 请求: {request.url}")
adaptor.on("request", on_request)
page = adaptor.get("https://example.com")
⑥ 反爬对抗:请求伪装与代理轮换策略
在实际爬虫开发中,反爬虫机制是必须应对的挑战。Scrapling 提供了丰富的反爬策略工具。
6.1 请求头伪装
python
from scrapling import Fetcher
# 自定义请求头,模拟真实浏览器
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
"Referer": "https://www.google.com/",
}
fetcher = Fetcher(headers=headers)
page = fetcher.get("https://httpbin.org/headers")
6.2 代理轮换
python
# 使用代理池
proxies = [
"http://proxy1.example.com:8080",
"http://proxy2.example.com:8080",
"http://proxy3.example.com:8080",
]
import random
for url in target_urls:
proxy = random.choice(proxies)
page = fetcher.get(url, proxies={"http": proxy, "https": proxy})
print(f"使用代理 {proxy} 访问成功")
6.3 请求频率控制
python
import time
from scrapling import Fetcher
fetcher = Fetcher()
urls = ["https://example.com/page/1", "https://example.com/page/2"]
for url in urls:
page = fetcher.get(url)
print(f"已抓取: {url}")
time.sleep(2) # 每次请求间隔 2 秒
6.4 Cookie 与 Session 管理
python
# 自动管理 Session
fetcher = Fetcher()
# 先登录
login_data = {"username": "test", "password": "123456"}
fetcher.post("https://example.com/login", data=login_data)
# 后续请求自动携带登录态
page = fetcher.get("https://example.com/dashboard")
⑦ 数据清洗:从原始 HTML 到结构化数据
抓取到的原始数据通常包含大量噪声,需要进行清洗和结构化处理。
7.1 文本清洗
python
import re
def clean_text(text):
# 去除多余空白
text = re.sub(r'\s+', ' ', text)
# 去除 HTML 标签
text = re.sub(r'<[^>]+>', '', text)
# 去除特殊字符
text = re.sub(r'[^\w\s\u4e00-\u9fff.,!?;:()()]', '', text)
return text.strip()
raw_text = page.css("div.content::text").get()
cleaned = clean_text(raw_text)
print(cleaned)
7.2 数据格式化
python
# 价格格式化
def parse_price(price_str):
# 去除货币符号和空格
price = re.sub(r'[^\d.]', '', price_str)
return float(price)
# 日期格式化
def parse_date(date_str):
from datetime import datetime
formats = ["%Y-%m-%d", "%Y/%m/%d", "%B %d, %Y"]
for fmt in formats:
try:
return datetime.strptime(date_str.strip(), fmt)
except ValueError:
continue
return None
7.3 导出为结构化数据
python
import csv
import json
# 保存为 CSV
def save_to_csv(data, filename):
with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=data[0].keys())
writer.writeheader()
writer.writerows(data)
# 保存为 JSON
def save_to_json(data, filename):
with open(filename, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
⑧ 异常处理:常见错误与调试技巧
爬虫开发中异常处理至关重要,本节总结 Scrapling 的常见错误及解决方案。
8.1 网络请求异常
python
from scrapling import Fetcher
from scrapling.exceptions import RequestError, TimeoutError
fetcher = Fetcher()
try:
page = fetcher.get("https://example.com", timeout=10)
except TimeoutError:
print("请求超时,请检查网络或增加超时时间")
except RequestError as e:
print(f"请求失败: {e.status_code} - {e.reason}")
except Exception as e:
print(f"未知错误: {e}")
8.2 选择器匹配失败
python
# 安全提取:使用 get() 而不是直接索引
title = page.css("h1.title::text").get()
if title:
print(f"标题: {title}")
else:
print("未找到标题元素")
# 使用默认值
price = page.css("span.price::text").get(default="价格未知")
8.3 调试技巧
python
# 打印页面 HTML 进行调试
print(page.html[:1000])
# 保存页面到本地查看
with open("debug.html", "w", encoding="utf-8") as f:
f.write(page.html)
# 使用浏览器开发者工具验证选择器
# 在浏览器控制台测试: document.querySelectorAll("div.quote")
⑨ 性能飞跃:异步并发与协程抓取
下面这张对比图展示了同步与异步抓取的核心差异:
#mermaid-svg-NOxEb9PNGqqfZ1ui{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-NOxEb9PNGqqfZ1ui .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-NOxEb9PNGqqfZ1ui .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-NOxEb9PNGqqfZ1ui .error-icon{fill:#552222;}#mermaid-svg-NOxEb9PNGqqfZ1ui .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-NOxEb9PNGqqfZ1ui .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-NOxEb9PNGqqfZ1ui .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-NOxEb9PNGqqfZ1ui .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-NOxEb9PNGqqfZ1ui .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-NOxEb9PNGqqfZ1ui .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-NOxEb9PNGqqfZ1ui .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-NOxEb9PNGqqfZ1ui .marker{fill:#333333;stroke:#333333;}#mermaid-svg-NOxEb9PNGqqfZ1ui .marker.cross{stroke:#333333;}#mermaid-svg-NOxEb9PNGqqfZ1ui svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-NOxEb9PNGqqfZ1ui p{margin:0;}#mermaid-svg-NOxEb9PNGqqfZ1ui .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-NOxEb9PNGqqfZ1ui .cluster-label text{fill:#333;}#mermaid-svg-NOxEb9PNGqqfZ1ui .cluster-label span{color:#333;}#mermaid-svg-NOxEb9PNGqqfZ1ui .cluster-label span p{background-color:transparent;}#mermaid-svg-NOxEb9PNGqqfZ1ui .label text,#mermaid-svg-NOxEb9PNGqqfZ1ui span{fill:#333;color:#333;}#mermaid-svg-NOxEb9PNGqqfZ1ui .node rect,#mermaid-svg-NOxEb9PNGqqfZ1ui .node circle,#mermaid-svg-NOxEb9PNGqqfZ1ui .node ellipse,#mermaid-svg-NOxEb9PNGqqfZ1ui .node polygon,#mermaid-svg-NOxEb9PNGqqfZ1ui .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-NOxEb9PNGqqfZ1ui .rough-node .label text,#mermaid-svg-NOxEb9PNGqqfZ1ui .node .label text,#mermaid-svg-NOxEb9PNGqqfZ1ui .image-shape .label,#mermaid-svg-NOxEb9PNGqqfZ1ui .icon-shape .label{text-anchor:middle;}#mermaid-svg-NOxEb9PNGqqfZ1ui .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-NOxEb9PNGqqfZ1ui .rough-node .label,#mermaid-svg-NOxEb9PNGqqfZ1ui .node .label,#mermaid-svg-NOxEb9PNGqqfZ1ui .image-shape .label,#mermaid-svg-NOxEb9PNGqqfZ1ui .icon-shape .label{text-align:center;}#mermaid-svg-NOxEb9PNGqqfZ1ui .node.clickable{cursor:pointer;}#mermaid-svg-NOxEb9PNGqqfZ1ui .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-NOxEb9PNGqqfZ1ui .arrowheadPath{fill:#333333;}#mermaid-svg-NOxEb9PNGqqfZ1ui .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-NOxEb9PNGqqfZ1ui .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-NOxEb9PNGqqfZ1ui .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-NOxEb9PNGqqfZ1ui .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-NOxEb9PNGqqfZ1ui .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-NOxEb9PNGqqfZ1ui .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-NOxEb9PNGqqfZ1ui .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-NOxEb9PNGqqfZ1ui .cluster text{fill:#333;}#mermaid-svg-NOxEb9PNGqqfZ1ui .cluster span{color:#333;}#mermaid-svg-NOxEb9PNGqqfZ1ui div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-NOxEb9PNGqqfZ1ui .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-NOxEb9PNGqqfZ1ui rect.text{fill:none;stroke-width:0;}#mermaid-svg-NOxEb9PNGqqfZ1ui .icon-shape,#mermaid-svg-NOxEb9PNGqqfZ1ui .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-NOxEb9PNGqqfZ1ui .icon-shape p,#mermaid-svg-NOxEb9PNGqqfZ1ui .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-NOxEb9PNGqqfZ1ui .icon-shape .label rect,#mermaid-svg-NOxEb9PNGqqfZ1ui .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-NOxEb9PNGqqfZ1ui .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-NOxEb9PNGqqfZ1ui .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-NOxEb9PNGqqfZ1ui :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 异步模式 - 并发
同步模式 - 串行
等待 2s
等待 2s
等待 2s
同时执行
同时执行
同时执行
全部完成后
总耗时: 8s
总耗时: 2s
请求 1
请求 2
请求 3
请求 4
请求 1
请求 2
请求 3
请求 4
所有结果汇总
⏱ 8 秒
⏱ 2 秒
当需要抓取大量页面时,异步并发是提升效率的关键。
9.1 异步基础
python
import asyncio
from scrapling import AsyncFetcher
async def fetch_single(url):
fetcher = AsyncFetcher()
page = await fetcher.get(url)
return page
# 运行异步任务
result = asyncio.run(fetch_single("https://example.com"))
9.2 并发抓取多个页面
python
async def fetch_all(urls):
fetcher = AsyncFetcher()
tasks = [fetcher.get(url) for url in urls]
pages = await asyncio.gather(*tasks)
return pages
# 同时抓取 10 个页面
urls = [f"https://quotes.toscrape.com/page/{i}/" for i in range(1, 11)]
pages = asyncio.run(fetch_all(urls))
print(f"成功抓取 {len(pages)} 个页面")
9.3 信号量控制并发数
python
async def fetch_with_limit(urls, max_concurrent=5):
semaphore = asyncio.Semaphore(max_concurrent)
fetcher = AsyncFetcher()
async def fetch_one(url):
async with semaphore:
return await fetcher.get(url)
tasks = [fetch_one(url) for url in urls]
return await asyncio.gather(*tasks)
9.4 性能对比
python
import time
# 同步方式
start = time.time()
fetcher = Fetcher()
for url in urls[:10]:
fetcher.get(url)
print(f"同步耗时: {time.time() - start:.2f}s")
# 异步方式
start = time.time()
asyncio.run(fetch_all(urls[:10]))
print(f"异步耗时: {time.time() - start:.2f}s")
⑩ 综合实战:构建一个完整的电商价格监控系统
现在,让我们综合运用前面学到的所有知识,构建一个完整的电商价格监控系统。
10.1 项目结构
price_monitor/
├── main.py # 主程序入口
├── scraper.py # 爬虫模块
├── parser.py # 数据解析模块
├── storage.py # 数据存储模块
├── notifier.py # 通知模块
└── config.py # 配置文件
10.2 爬虫模块实现
python
# scraper.py
from scrapling import Fetcher, Adaptor
import time
class ProductScraper:
def __init__(self):
self.fetcher = Fetcher()
self.adaptor = Adaptor()
def fetch_product(self, url):
"""抓取单个商品页面"""
try:
page = self.adaptor.get(url, timeout=15)
page.wait_for("div.product-info", timeout=10)
return page
except Exception as e:
print(f"抓取失败: {url} - {e}")
return None
def fetch_batch(self, urls, delay=2):
"""批量抓取商品"""
products = []
for url in urls:
page = self.fetch_product(url)
if page:
products.append(page)
time.sleep(delay)
return products
10.3 数据解析模块
python
# parser.py
class ProductParser:
def parse(self, page):
"""解析商品页面"""
product = {
"name": page.css("h1.product-name::text").get(),
"price": self._parse_price(page),
"rating": page.css("span.rating::attr(data-score)").get(),
"reviews": page.css("span.review-count::text").get(),
"stock": page.css("span.stock-status::text").get(),
"url": page.url,
"timestamp": time.time()
}
return product
def _parse_price(self, page):
price_str = page.css("span.price::text").get()
if price_str:
return float(price_str.replace("¥", "").replace(",", ""))
return None
10.4 价格监控主程序
python
# main.py
import time
import json
from scraper import ProductScraper
from parser import ProductParser
from storage import DataStorage
from notifier import PriceNotifier
class PriceMonitor:
def __init__(self, config):
self.scraper = ProductScraper()
self.parser = ProductParser()
self.storage = DataStorage(config["db_path"])
self.notifier = PriceNotifier(config["email"])
self.target_price = config["target_price"]
def check_price(self, url):
"""检查单个商品价格"""
page = self.scraper.fetch_product(url)
if not page:
return
product = self.parser.parse(page)
self.storage.save(product)
if product["price"] and product["price"] <= self.target_price:
self.notifier.send_alert(product)
print(f"🎉 降价提醒!{product['name']} 当前价格: ¥{product['price']}")
else:
print(f"当前价格: ¥{product['price']}")
def run(self, urls, interval=3600):
"""启动监控循环"""
print(f"启动价格监控,检查间隔: {interval}秒")
while True:
for url in urls:
self.check_price(url)
time.sleep(interval)
if __name__ == "__main__":
config = {
"target_price": 299,
"db_path": "prices.db",
"email": "user@example.com"
}
monitor = PriceMonitor(config)
monitor.run([
"https://example.com/product/1",
"https://example.com/product/2"
])
10.5 运行与扩展
bash
# 安装依赖
pip install scrapling[playwright]
playwright install chromium
# 运行监控系统
python main.py
10.5.1 运行效果展示
启动价格监控系统后,终端将输出类似如下的实时运行日志,展示完整的抓取、解析、存储和降价提醒流程:
text
$ python main.py
[2026-06-02 08:00:01] 🚀 启动价格监控,检查间隔: 3600秒
[2026-06-02 08:00:01] 📡 正在抓取: https://example.com/product/1
[2026-06-02 08:00:03] ✅ 页面加载完成 (状态码: 200)
[2026-06-02 08:00:03] 🔍 解析商品信息...
[2026-06-02 08:00:03] 💾 已保存到数据库: prices.db
[2026-06-02 08:00:03] 📊 当前价格: ¥359.00 (目标价: ¥299.00)
[2026-06-02 08:00:03] ⏳ 等待 2 秒...
[2026-06-02 08:00:05] 📡 正在抓取: https://example.com/product/2
[2026-06-02 08:00:07] ✅ 页面加载完成 (状态码: 200)
[2026-06-02 08:00:07] 🔍 解析商品信息...
[2026-06-02 08:00:07] 💾 已保存到数据库: prices.db
[2026-06-02 08:00:07] 📊 当前价格: ¥285.00 (目标价: ¥299.00)
[2026-06-02 08:00:07] 🎉 降价提醒!无线蓝牙耳机 Pro 当前价格: ¥285.00
[2026-06-02 08:00:07] 📧 已发送邮件通知至 user@example.com
[2026-06-02 08:00:07] 💤 休眠 3600 秒,等待下一轮检查...
从日志中可以清晰看到系统的完整工作流程:
- 启动监控:输出检查间隔配置
- 抓取商品:依次访问每个商品页面
- 解析数据:提取名称、价格、评分等信息
- 存储记录:将数据持久化到 SQLite 数据库
- 价格判断:对比当前价格与目标价
- 触发告警:当价格低于目标价时,自动发送邮件通知
- 循环等待:按配置间隔进入下一轮检查
这个
10.5.2 使用 Rich 库美化日志输出
为了让监控日志在终端中更直观、更专业,可以使用 Python 的 rich 库为日志添加颜色、进度条和面板等视觉效果。
首先安装 rich:
bash
pip install rich
然后创建一个美化版的日志模块 rich_logger.py:
python
# rich_logger.py
from rich.console import Console
from rich.table import Table
from rich.progress import Progress, SpinnerColumn, TextColumn, BarColumn
from rich.panel import Panel
from datetime import datetime
import time
console = Console()
class RichMonitorLogger:
"""使用 Rich 库美化的监控日志输出器"""
@staticmethod
def print_startup(interval: int, urls: list):
"""打印启动信息,使用面板和表格"""
table = Table(title="📋 监控配置", show_header=True, header_style="bold cyan")
table.add_column("参数", style="dim", width=20)
table.add_column("值")
table.add_row("检查间隔", f"{interval} 秒")
table.add_row("监控商品数", str(len(urls)))
table.add_row("目标价格", "¥299.00")
console.print(Panel(table, title="🚀 价格监控系统启动", border_style="green"))
@staticmethod
def print_fetch_start(url: str):
"""打印开始抓取信息"""
timestamp = datetime.now().strftime("%H:%M:%S")
console.print(f"[dim]{timestamp}[/dim] 📡 [bold yellow]正在抓取:[/bold yellow] {url}")
@staticmethod
def print_fetch_success(status: int):
"""打印抓取成功信息"""
color = "green" if status == 200 else "yellow"
console.print(f" ✅ [bold {color}]页面加载完成[/bold {color}] (状态码: {status})")
@staticmethod
def print_parsing():
"""打印解析中信息"""
with Progress(
SpinnerColumn(),
TextColumn("[progress.description]{task.description}"),
transient=True,
) as progress:
progress.add_task(description="[cyan]🔍 解析商品信息...", total=None)
time.sleep(0.5) # 模拟解析耗时
@staticmethod
def print_price_info(name: str, price: float, target: float):
"""打印价格信息,根据是否低于目标价显示不同颜色"""
timestamp = datetime.now().strftime("%H:%M:%S")
if price <= target:
console.print(f"[dim]{timestamp}[/dim] 🎉 [bold red]降价提醒![/bold red] {name} [bold green]¥{price:.2f}[/bold green]")
else:
console.print(f"[dim]{timestamp}[/dim] 📊 [bold]{name}[/bold] 当前价格: [yellow]¥{price:.2f}[/yellow] (目标价: [cyan]¥{target:.2f}[/cyan])")
@staticmethod
def print_save():
"""打印保存成功信息"""
console.print(" 💾 [green]已保存到数据库[/green]: [bold]prices.db[/bold]")
@staticmethod
def print_notification(email: str):
"""打印通知发送信息"""
console.print(f" 📧 [magenta]已发送邮件通知至[/magenta] [underline]{email}[/underline]")
@staticmethod
def print_sleep(seconds: int):
"""打印休眠信息,带进度条"""
with Progress(
TextColumn("[progress.description]{task.description}"),
BarColumn(),
TextColumn("[progress.percentage]{task.percentage:>3.0f}%"),
) as progress:
task = progress.add_task(f"[blue]💤 休眠 {seconds} 秒,等待下一轮检查...", total=seconds)
for _ in range(seconds):
time.sleep(1)
progress.update(task, advance=1)
@staticmethod
def print_summary(stats: dict):
"""打印本轮监控统计摘要"""
table = Table(title="📊 本轮监控统计", show_header=True, header_style="bold cyan")
table.add_column("指标", style="dim", width=20)
table.add_column("数值")
table.add_row("检查商品数", str(stats.get("checked", 0)))
table.add_row("成功数", f"[green]{stats.get('success', 0)}[/green]")
table.add_row("失败数", f"[red]{stats.get('failed', 0)}[/red]")
table.add_row("降价商品", f"[bold red]{stats.get('discounted', 0)}[/bold red]")
console.print(Panel(table, border_style="cyan"))
# 使用示例:在 main.py 中替换原有的 print 语句
if __name__ == "__main__":
logger = RichMonitorLogger()
# 模拟一轮监控流程
logger.print_startup(interval=3600, urls=["https://example.com/product/1", "https://example.com/product/2"])
logger.print_fetch_start("https://example.com/product/1")
logger.print_fetch_success(200)
logger.print_parsing()
logger.print_save()
logger.print_price_info("无线蓝牙耳机 Pro", 359.00, 299.00)
logger.print_fetch_start("https://example.com/product/2")
logger.print_fetch_success(200)
logger.print_parsing()
logger.print_save()
logger.print_price_info("无线蓝牙耳机 Pro", 285.00, 299.00)
logger.print_notification("user@example.com")
logger.print_summary({"checked": 2, "success": 2, "failed": 0, "discounted": 1})
运行效果如下(终端中会显示彩色输出):
text
┌─────────────────────────────────────────────┐
│ 🚀 价格监控系统启动 │
│ │
│ 📋 监控配置 │
│ ┌──────────┬──────────────────────┐ │
│ │ 参数 │ 值 │ │
│ ├──────────┼──────────────────────┤ │
│ │ 检查间隔 │ 3600 秒 │ │
│ │ 监控商品数│ 2 │ │
│ │ 目标价格 │ ¥299.00 │ │
│ └──────────┴──────────────────────┘ │
└─────────────────────────────────────────────┘
23:28:40 📡 正在抓取: https://example.com/product/1
✅ 页面加载完成 (状态码: 200)
🔍 解析商品信息...
💾 已保存到数据库: prices.db
23:28:41 📊 无线蓝牙耳机 Pro 当前价格: ¥359.00 (目标价: ¥299.00)
23:28:41 📡 正在抓取: https://example.com/product/2
✅ 页面加载完成 (状态码: 200)
🔍 解析商品信息...
💾 已保存到数据库: prices.db
23:28:42 🎉 降价提醒!无线蓝牙耳机 Pro 当前价格: ¥285.00
📧 已发送邮件通知至 user@example.com
┌─────────────────────────────────────────────┐
│ 📊 本轮监控统计 │
│ ┌──────────┬──────────────────────┐ │
│ │ 指标 │ 数值 │ │
│ ├──────────┼──────────────────────┤ │
│ │ 检查商品数│ 2 │ │
│ │ 成功数 │ 2 │ │
│ │ 失败数 │ 0 │ │
│ │ 降价商品 │ 1 │ │
│ └──────────┴──────────────────────┘ │
└─────────────────────────────────────────────┘
通过 rich 库,你可以轻松实现:
- 颜色区分:成功用绿色、失败用红色、价格用黄色、降价用红色加粗
- 面板与表格:将配置信息和统计摘要用面板包裹,结构清晰
- 进度条:休眠等待时显示进度条,让用户知道系统仍在运行
- 旋转动画:解析数据时显示旋转动画,提升交互体验
- 统一时间戳:所有日志行统一显示时间戳,便于追溯
价格监控系统可以轻松扩展::
- 添加更多电商平台支持
- 集成企业微信/钉钉通知
- 使用 Redis 缓存提升性能
- 部署到云服务器实现 7×24 小时监控
通过这 10 个章节的系统学习,你已经掌握了 Scrapling 从入门到实战的全部技能。从环境搭建到构建完整的监控系统,每一步都配有可运行的代码示例。现在,打开你的编辑器,开始你的第一个 Scrapling 爬虫项目吧!