1.scrapling介绍
AI时代,传统的爬虫需要根据网站的改变,需要变成xpath的匹配方式,或者说风控,scrapling比较吸引人的是可以根据学习网站的结构,网页更新时重新定位元素,据说还能绕过WAF,JA3指纹等
项目地址:https://github.com/D4Vinci/Scrapling
2.初体验
2.1 安装依赖
certifi
orjson
w3lib
typing_extensions
lxml
cssselect
curl_cffi
playwright
browserforge
patchright
msgspec
anyio
2.2 demo 测试
python
# from scrapling.fetchers import StealthyFetcher
#
# page = StealthyFetcher.fetch('https://example.com', headless=True)
# products = page.css('.product', auto_save=True) # 第一次爬取时保存元素特征
#
# # 后来网站改版了,没关系,开启 adaptive=True 自动找回!
# products = page.css('.product', adaptive=True)
# print(products)
import os
import certifi
import ssl
from scrapling.spiders import Spider, Response
os.environ['SSL_CERT_FILE'] = certifi.where()
os.environ['REQUESTS_CA_BUNDLE'] = certifi.where() # 针对requests库
# 如果你使用http.client或urllib,还需要设置这个
ssl._create_default_https_context = ssl.create_default_context
class MySpider(Spider):
name = "demo"
start_urls = ["https://example.com/"]
async def parse(self, response: Response):
for item in response.css('.product'):
yield {"title": item.css('h2::text').get()}
print(f"certifi 路径: {certifi.where()}")
print(f"证书文件是否存在: {certifi.where()}")
print(f"SSL默认路径: {ssl.get_default_verify_paths()}")
# 强制Python的SSL模块使用certifi的证书
MySpider().start()
输出结果如下
bash
D:\ProgramData\Anaconda3\envs\scrapling\python.exe P:\code\Scrapling-main\tests\demo\demo.py
certifi 路径: D:\ProgramData\Anaconda3\envs\scrapling\Lib\site-packages\certifi\cacert.pem
证书文件是否存在: D:\ProgramData\Anaconda3\envs\scrapling\Lib\site-packages\certifi\cacert.pem
SSL默认路径: DefaultVerifyPaths(cafile='D:\\ProgramData\\Anaconda3\\envs\\scrapling\\Lib\\site-packages\\certifi\\cacert.pem', capath=None, openssl_cafile_env='SSL_CERT_FILE', openssl_cafile='C:\\Program Files\\Common Files\\ssl\\cert.pem', openssl_capath_env='SSL_CERT_DIR', openssl_capath='C:\\Program Files\\Common Files\\ssl\\certs')
[2026-03-09 23:32:09]:(demo) INFO: Spider initialized
[2026-03-09 23:32:09]:(demo) DEBUG: Starting spider
[2026-03-09 23:32:10]:(demo) WARNING: Attempt 1 failed: Failed to perform, curl: (60) SSL certificate problem: unable to get local issuer certificate. See https://curl.se/libcurl/c/libcurl-errors.html first for more details.. Retrying in 1 seconds...
[2026-03-09 23:32:11]:(demo) WARNING: Attempt 2 failed: Failed to perform, curl: (60) SSL certificate problem: unable to get local issuer certificate. See https://curl.se/libcurl/c/libcurl-errors.html first for more details.. Retrying in 1 seconds...
[2026-03-09 23:32:13]:(demo) ERROR: Failed after 3 attempts: Failed to perform, curl: (60) SSL certificate problem: unable to get local issuer certificate. See https://curl.se/libcurl/c/libcurl-errors.html first for more details.
[2026-03-09 23:32:13]:(demo) DEBUG: Spider idle
[2026-03-09 23:32:13]:(demo) DEBUG: Spider closed
[2026-03-09 23:32:13]:(demo) INFO: {
"items_scraped": 0,
"items_dropped": 0,
"elapsed_seconds": 3.64,
"download_delay": 0.0,
"concurrent_requests": 4,
"concurrent_requests_per_domain": 0,
"requests_count": 0,
"requests_per_second": 0.0,
"sessions_requests_count": {},
"failed_requests_count": 1,
"offsite_requests_count": 0,
"blocked_requests_count": 0,
"response_status_count": {},
"response_bytes": 0,
"domains_response_bytes": {},
"proxies": [],
"custom_stats": {},
"log_count": {
"debug": 3,
"info": 1,
"warning": 2,
"error": 1,
"critical": 0
}
}
进程已结束,退出代码为 0
报出的SSL问题还通过安装
bash
pip install --upgrade certifi
还是报错,明天继续测试新的特性
Attempt 1 failed: Failed to perform, curl: (60) SSL certificate problem: unable to get local issuer certificate