scrapling AI爬虫 初体验

1.scrapling介绍

AI时代,传统的爬虫需要根据网站的改变,需要变成xpath的匹配方式,或者说风控,scrapling比较吸引人的是可以根据学习网站的结构,网页更新时重新定位元素,据说还能绕过WAF,JA3指纹等

项目地址:https://github.com/D4Vinci/Scrapling

2.初体验

2.1 安装依赖

复制代码
certifi
orjson
w3lib
typing_extensions
lxml
cssselect
curl_cffi
playwright
browserforge
patchright
msgspec
anyio

2.2 demo 测试

python 复制代码
# from scrapling.fetchers import StealthyFetcher
#
# page = StealthyFetcher.fetch('https://example.com', headless=True)
# products = page.css('.product', auto_save=True)  # 第一次爬取时保存元素特征
#
# # 后来网站改版了,没关系,开启 adaptive=True 自动找回!
# products = page.css('.product', adaptive=True)
# print(products)
import os

import certifi
import ssl
from scrapling.spiders import Spider, Response
os.environ['SSL_CERT_FILE'] = certifi.where()
os.environ['REQUESTS_CA_BUNDLE'] = certifi.where()  # 针对requests库
# 如果你使用http.client或urllib,还需要设置这个
ssl._create_default_https_context = ssl.create_default_context
class MySpider(Spider):
  name = "demo"
  start_urls = ["https://example.com/"]

  async def parse(self, response: Response):
      for item in response.css('.product'):
          yield {"title": item.css('h2::text').get()}
print(f"certifi 路径: {certifi.where()}")
print(f"证书文件是否存在: {certifi.where()}")
print(f"SSL默认路径: {ssl.get_default_verify_paths()}")
# 强制Python的SSL模块使用certifi的证书

MySpider().start()

输出结果如下

bash 复制代码
D:\ProgramData\Anaconda3\envs\scrapling\python.exe P:\code\Scrapling-main\tests\demo\demo.py 
certifi 路径: D:\ProgramData\Anaconda3\envs\scrapling\Lib\site-packages\certifi\cacert.pem
证书文件是否存在: D:\ProgramData\Anaconda3\envs\scrapling\Lib\site-packages\certifi\cacert.pem
SSL默认路径: DefaultVerifyPaths(cafile='D:\\ProgramData\\Anaconda3\\envs\\scrapling\\Lib\\site-packages\\certifi\\cacert.pem', capath=None, openssl_cafile_env='SSL_CERT_FILE', openssl_cafile='C:\\Program Files\\Common Files\\ssl\\cert.pem', openssl_capath_env='SSL_CERT_DIR', openssl_capath='C:\\Program Files\\Common Files\\ssl\\certs')
[2026-03-09 23:32:09]:(demo) INFO: Spider initialized
[2026-03-09 23:32:09]:(demo) DEBUG: Starting spider
[2026-03-09 23:32:10]:(demo) WARNING: Attempt 1 failed: Failed to perform, curl: (60) SSL certificate problem: unable to get local issuer certificate. See https://curl.se/libcurl/c/libcurl-errors.html first for more details.. Retrying in 1 seconds...
[2026-03-09 23:32:11]:(demo) WARNING: Attempt 2 failed: Failed to perform, curl: (60) SSL certificate problem: unable to get local issuer certificate. See https://curl.se/libcurl/c/libcurl-errors.html first for more details.. Retrying in 1 seconds...
[2026-03-09 23:32:13]:(demo) ERROR: Failed after 3 attempts: Failed to perform, curl: (60) SSL certificate problem: unable to get local issuer certificate. See https://curl.se/libcurl/c/libcurl-errors.html first for more details.
[2026-03-09 23:32:13]:(demo) DEBUG: Spider idle
[2026-03-09 23:32:13]:(demo) DEBUG: Spider closed
[2026-03-09 23:32:13]:(demo) INFO: {
    "items_scraped": 0,
    "items_dropped": 0,
    "elapsed_seconds": 3.64,
    "download_delay": 0.0,
    "concurrent_requests": 4,
    "concurrent_requests_per_domain": 0,
    "requests_count": 0,
    "requests_per_second": 0.0,
    "sessions_requests_count": {},
    "failed_requests_count": 1,
    "offsite_requests_count": 0,
    "blocked_requests_count": 0,
    "response_status_count": {},
    "response_bytes": 0,
    "domains_response_bytes": {},
    "proxies": [],
    "custom_stats": {},
    "log_count": {
        "debug": 3,
        "info": 1,
        "warning": 2,
        "error": 1,
        "critical": 0
    }
}

进程已结束,退出代码为 0

报出的SSL问题还通过安装

bash 复制代码
pip install --upgrade certifi

还是报错,明天继续测试新的特性

Attempt 1 failed: Failed to perform, curl: (60) SSL certificate problem: unable to get local issuer certificate

相关推荐
大刘讲IT7 小时前
AI重塑企业信息价值标准:从“系统供给”到“用户定义”的企业数字化新范式
人工智能·经验分享·ai·制造
流年似水~7 小时前
MCP协议实战:从零搭建一个让Claude能“看见“数据库的工具服务
数据库·人工智能·程序人生·ai·ai编程
哥布林学者7 小时前
深度学习进阶(十三)可变形卷积 DCN
机器学习·ai
桔子雨7 小时前
【PicoBox】基于 C# + PicoServer,面向 AI 生成网页的托管工具
ai·picoserver·轻量web框架
薛定谔的猫3698 小时前
LLM Agents: 从大语言模型到自主智能体的演进与架构解析
ai·llm·agent·machine learning·architecture
笨蛋©8 小时前
[实战] 制造业 ISO 9001 认证中的数字化质量控制:从检验计划到自动化闭环
ai·cad·质量管理·制造业·图纸识别
AwesomeCPA9 小时前
Claude Code 实战(2):构建工业级 AI 并行开发流水线
ai
笨蛋©10 小时前
[实战] 制造业数字化:CAD图纸气泡图自动化标注与检验计划生成指南
ai·数字化·cad·质量管理·制造业
熊猫钓鱼>_>13 小时前
当“虾”遇上“马”:QClaw 融合 Hermes 背后的智能体进化论
人工智能·ai·腾讯云·agent·openclaw·qclaw·hermes
深念Y13 小时前
Denuvo加密被全面攻破?聊聊D加密原理和这次的破解事件
人工智能·游戏·ai·逆向·虚拟机·虚拟·d加密