用Scrapy 从数据挖掘到监控和自动化测试

Scrapy 是一个 BSD 许可的快速高级网络爬虫和网络抓取框架,用于抓取网站并从其页面中提取结构化数据。它可以用于广泛的用途,从数据挖掘到监控和自动化测试。

安装scrapy

复制代码
pip install scrapy

爬虫示例

示例代码写入文件

python 复制代码
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/tag/humor/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "author": quote.xpath("span/small/text()").get(),
                "text": quote.css("span.text::text").get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

执行

python 复制代码
scrapy runspider quotes_spider.py -o quotes.jsonl

可以看到执行结果如下:

python 复制代码
scrapy runspider quotes_spider.py -o quotes.jsonl
2024-05-01 22:10:19 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: scrapybot)
2024-05-01 22:10:19 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.11.6, cssselect 1.2.0, parsel 1.9.1, w3lib 2.1.2, Twisted 24.3.0, Python 3.10.13 (main, Nov  9 2023, 03:04:43) [Clang 14.0.5 (https://github.com/llvm/llvm-project.git llvmorg-14.0.5-0-gc12386, pyOpenSSL 24.1.0 (OpenSSL 1.1.1t-freebsd  7 Feb 2023), cryptography 42.0.5, Platform FreeBSD-13.2-RELEASE-p10-amd64-64bit-ELF
2024-05-01 22:10:19 [scrapy.addons] INFO: Enabled addons:
[]
2024-05-01 22:10:19 [py.warnings] WARNING: /usr/home/skywalk/py310/lib/python3.10/site-packages/scrapy/utils/request.py:254: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2024-05-01 22:10:19 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.pollreactor.PollReactor
2024-05-01 22:10:19 [scrapy.extensions.telnet] INFO: Telnet Password: 18295d3f4c994eee
2024-05-01 22:10:19 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2024-05-01 22:10:19 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}
2024-05-01 22:10:20 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-05-01 22:10:20 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-05-01 22:10:20 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-05-01 22:10:20 [scrapy.core.engine] INFO: Spider opened

完成此操作后, quotes.jsonl 文件中将包含JSON行格式的引号列表,其中包含文本和作者,如下所示:

python 复制代码
{"author": "Jane Austen", "text": "\u201cThe person, be it gentleman or lady, who has not
 pleasure in a good novel, must be intolerably stupid.\u201d"}
{"author": "Steve Martin", "text": "\u201cA day without sunshine is like, you know, night
.\u201d"}
{"author": "Garrison Keillor", "text": "\u201cAnyone who thinks sitting in church can mak
e you a Christian must also think that sitting in a garage can make you a car.\u201d"}
{"author": "Jim Henson", "text": "\u201cBeauty is in the eye of the beholder and it may b
e necessary from time to time to give a stupid or misinformed beholder a black eye.\u201d
"}

监控

日志监控:Scrapy 提供了强大的日志系统,可以通过查看日志来监控爬虫的运行状态,也可以通过日志分析出被监控网站的运行状态。

自动化测试

Spider 编写测试,可以模拟 HTTP 响应,并验证 Spider 是否能够正确解析这些数据。

ps scapy是一个包处理软件。可以参考这篇文档学习:通过摆弄python scapy模块 了解网络模型--Get your hands dirty! - 知乎

相关推荐
搂着猫睡的小鱼鱼9 小时前
Ozon 商品页数据解析与提取 API
爬虫·php
深蓝电商API11 小时前
住宅代理与数据中心代理在爬虫中的选择
爬虫·python
csdn_aspnet12 小时前
Libvio.link爬虫技术深度解析:反爬机制破解与高效数据抓取
爬虫·反爬·libvio
0思必得014 小时前
[Web自动化] Selenium处理滚动条
前端·爬虫·python·selenium·自动化
vx_biyesheji000115 小时前
豆瓣电影推荐系统 | Python Django 协同过滤 Echarts可视化 深度学习 大数据 毕业设计源码
大数据·爬虫·python·深度学习·django·毕业设计·echarts
深蓝电商API16 小时前
爬虫IP封禁后的自动切换与检测机制
爬虫·python
喵手18 小时前
Python爬虫实战:公共自行车站点智能采集系统 - 从零构建生产级爬虫的完整实战(附CSV导出 + SQLite持久化存储)!
爬虫·python·爬虫实战·零基础python爬虫教学·采集公共自行车站点·公共自行车站点智能采集系统·采集公共自行车站点导出csv
喵手18 小时前
Python爬虫实战:地图 POI + 行政区反查实战 - 商圈热力数据准备完整方案(附CSV导出 + SQLite持久化存储)!
爬虫·python·爬虫实战·零基础python爬虫教学·地区poi·行政区反查·商圈热力数据采集
芷栀夏18 小时前
从 CANN 开源项目看现代爬虫架构的演进:轻量、智能与统一
人工智能·爬虫·架构·开源·cann
喵手1 天前
Python爬虫实战:HTTP缓存系统深度实战 — ETag、Last-Modified与requests-cache完全指南(附SQLite持久化存储)!
爬虫·python·爬虫实战·http缓存·etag·零基础python爬虫教学·requests-cache