scrapling AI爬虫 初体验

1.scrapling介绍

AI时代,传统的爬虫需要根据网站的改变,需要变成xpath的匹配方式,或者说风控,scrapling比较吸引人的是可以根据学习网站的结构,网页更新时重新定位元素,据说还能绕过WAF,JA3指纹等

项目地址:https://github.com/D4Vinci/Scrapling

2.初体验

2.1 安装依赖

复制代码
certifi
orjson
w3lib
typing_extensions
lxml
cssselect
curl_cffi
playwright
browserforge
patchright
msgspec
anyio

2.2 demo 测试

python 复制代码
# from scrapling.fetchers import StealthyFetcher
#
# page = StealthyFetcher.fetch('https://example.com', headless=True)
# products = page.css('.product', auto_save=True)  # 第一次爬取时保存元素特征
#
# # 后来网站改版了,没关系,开启 adaptive=True 自动找回!
# products = page.css('.product', adaptive=True)
# print(products)
import os

import certifi
import ssl
from scrapling.spiders import Spider, Response
os.environ['SSL_CERT_FILE'] = certifi.where()
os.environ['REQUESTS_CA_BUNDLE'] = certifi.where()  # 针对requests库
# 如果你使用http.client或urllib,还需要设置这个
ssl._create_default_https_context = ssl.create_default_context
class MySpider(Spider):
  name = "demo"
  start_urls = ["https://example.com/"]

  async def parse(self, response: Response):
      for item in response.css('.product'):
          yield {"title": item.css('h2::text').get()}
print(f"certifi 路径: {certifi.where()}")
print(f"证书文件是否存在: {certifi.where()}")
print(f"SSL默认路径: {ssl.get_default_verify_paths()}")
# 强制Python的SSL模块使用certifi的证书

MySpider().start()

输出结果如下

bash 复制代码
D:\ProgramData\Anaconda3\envs\scrapling\python.exe P:\code\Scrapling-main\tests\demo\demo.py 
certifi 路径: D:\ProgramData\Anaconda3\envs\scrapling\Lib\site-packages\certifi\cacert.pem
证书文件是否存在: D:\ProgramData\Anaconda3\envs\scrapling\Lib\site-packages\certifi\cacert.pem
SSL默认路径: DefaultVerifyPaths(cafile='D:\\ProgramData\\Anaconda3\\envs\\scrapling\\Lib\\site-packages\\certifi\\cacert.pem', capath=None, openssl_cafile_env='SSL_CERT_FILE', openssl_cafile='C:\\Program Files\\Common Files\\ssl\\cert.pem', openssl_capath_env='SSL_CERT_DIR', openssl_capath='C:\\Program Files\\Common Files\\ssl\\certs')
[2026-03-09 23:32:09]:(demo) INFO: Spider initialized
[2026-03-09 23:32:09]:(demo) DEBUG: Starting spider
[2026-03-09 23:32:10]:(demo) WARNING: Attempt 1 failed: Failed to perform, curl: (60) SSL certificate problem: unable to get local issuer certificate. See https://curl.se/libcurl/c/libcurl-errors.html first for more details.. Retrying in 1 seconds...
[2026-03-09 23:32:11]:(demo) WARNING: Attempt 2 failed: Failed to perform, curl: (60) SSL certificate problem: unable to get local issuer certificate. See https://curl.se/libcurl/c/libcurl-errors.html first for more details.. Retrying in 1 seconds...
[2026-03-09 23:32:13]:(demo) ERROR: Failed after 3 attempts: Failed to perform, curl: (60) SSL certificate problem: unable to get local issuer certificate. See https://curl.se/libcurl/c/libcurl-errors.html first for more details.
[2026-03-09 23:32:13]:(demo) DEBUG: Spider idle
[2026-03-09 23:32:13]:(demo) DEBUG: Spider closed
[2026-03-09 23:32:13]:(demo) INFO: {
    "items_scraped": 0,
    "items_dropped": 0,
    "elapsed_seconds": 3.64,
    "download_delay": 0.0,
    "concurrent_requests": 4,
    "concurrent_requests_per_domain": 0,
    "requests_count": 0,
    "requests_per_second": 0.0,
    "sessions_requests_count": {},
    "failed_requests_count": 1,
    "offsite_requests_count": 0,
    "blocked_requests_count": 0,
    "response_status_count": {},
    "response_bytes": 0,
    "domains_response_bytes": {},
    "proxies": [],
    "custom_stats": {},
    "log_count": {
        "debug": 3,
        "info": 1,
        "warning": 2,
        "error": 1,
        "critical": 0
    }
}

进程已结束,退出代码为 0

报出的SSL问题还通过安装

bash 复制代码
pip install --upgrade certifi

还是报错,明天继续测试新的特性

Attempt 1 failed: Failed to perform, curl: (60) SSL certificate problem: unable to get local issuer certificate

相关推荐
Thomas.Sir8 小时前
第一章:Agent智能体开发实战之【初步认识 LlamaIndex:从入门到实操】
人工智能·python·ai·检索增强·llama·llamaindex
数据知道8 小时前
claw-code 源码详细分析:Route / Bootstrap / Tool-Pool——把提示词映射到「可执行面」的分层策略
网络·ai·web·claude code
一见9 小时前
Sub-Agent 与 Agent Team 的本质区别
ai·subagent·agent team
杨浦老苏10 小时前
开源的AI编程工作站HolyClaude
人工智能·docker·ai·编辑器·开发·群晖
加油201910 小时前
软件工程师知识库搭建
ai·知识库·rag
ofoxcoding11 小时前
Grok 4.1 API 完全指南:性能实测、成本测算与接入方案(2026)
ai
gao_tjie12 小时前
Google Veo API:生成 AI 视频的全面指南
ai
Thomas.Sir12 小时前
第十三章:RAG知识库开发之【GraphRAG 从基础到实战】
python·ai·rag·graphrag
LoserChaser12 小时前
OpenClaw 指令大全:分类详解与使用指南
人工智能·ai·语言模型
TDengine (老段)12 小时前
TDengine IDMP 可视化 —— 面板
大数据·数据库·人工智能·物联网·ai·时序数据库·tdengine