㊙️本期内容已收录至专栏《Python爬虫实战》,持续完善知识体系与项目实战,建议先订阅收藏,后续查阅更方便~持续更新中!
㊗️爬虫难度指数:⭐
🚫声明:数据仅供个人学习数据分析使用,严禁用于商业比价系统或倒卖数据等,一切后果皆由使用者本人承担。公开榜单数据一般允许访问,但请务必遵守"君子协议"。

全文目录:
-
- [🌟 开篇语](#🌟 开篇语)
- [📂 摘要(Abstract)](#📂 摘要(Abstract))
- [1️⃣ 背景与需求(Why)集壁纸数据?](#1️⃣ 背景与需求(Why)集壁纸数据?)
- [2️⃣ 合规与注意事项(必读)](#2️⃣ 合规与注意事项(必读))
-
- [Unsplash API 使用条款](#Unsplash API 使用条款)
- Wallpaperflarepaperflare.com/robots.txt`:
- 图片版权注意事项
- 采集频率建议
- [3️⃣ 技术选型与整体流程(What/How)](#3️⃣ 技术选型与整体流程(What/How))
- [4️⃣ 环境准备与依赖安装(可复现)](#4️⃣ 环境准备与依赖安装(可复现))
-
- Python版本要求
- 核心依赖安装
- [完整 requirements.txt](#完整 requirements.txt)
- 项目目录结构
- [5️⃣ 配置文件设计(Config)](#5️⃣ 配置文件设计(Config))
- [6️⃣ 核心实现:Unsplash API客户端(Fetcher)](#6️⃣ 核心实现:Unsplash API客户端(Fetcher))
- [7️⃣ 核心实现:Wallpaperflare爬虫(Spider)](#7️⃣ 核心实现:Wallpaperflare爬虫(Spider))
- [8️⃣ 核心实现:图片下载器(Downloader)](#8️⃣ 核心实现:图片下载器(Downloader))
- [9️⃣ 核心实现:数据解析器(Parser)](#9️⃣ 核心实现:数据解析器(Parser))
- [🔟 核心实现:存储层(Storage)](#🔟 核心实现:存储层(Storage))
- [1️⃣1️⃣ 工具类实现](#1️⃣1️⃣ 工具类实现)
- [1️⃣2️⃣ 核心实现:图片分析器(Analyzer)](#1️⃣2️⃣ 核心实现:图片分析器(Analyzer))
- [1️⃣3️⃣ 主程序入口(Main)](#1️⃣3️⃣ 主程序入口(Main))
- [1️⃣4️⃣ 使用示例与最佳实践](#1️⃣4️⃣ 使用示例与最佳实践)
-
- [🚀 快速开始](#🚀 快速开始)
-
- [1. 环境配置](#1. 环境配置)
- [2. 基础使用](#2. 基础使用)
- [3. 进阶用法](#3. 进阶用法)
- [💡 最佳实践](#💡 最佳实践)
-
- [1. 性能优化](#1. 性能优化)
- [2. 数据质量控制](#2. 数据质量控制)
- [3. 合规性保障](#3. 合规性保障)
- [📊 数据分析示例](#📊 数据分析示例)
- [🐛 常见问题](#🐛 常见问题)
- [1️⃣5️⃣ 总结与展望](#1️⃣5️⃣ 总结与展望)
-
- [✅ 已实现功能](#✅ 已实现功能)
- [🚀 可扩展方向](#🚀 可扩展方向)
- [📚 学习收获](#📚 学习收获)
- [🎓 推荐资源](#🎓 推荐资源)
- [🌟 文末](#🌟 文末)
-
- [📌 专栏持续更新中|建议收藏 + 订阅](#📌 专栏持续更新中|建议收藏 + 订阅)
- [✅ 互动征集](#✅ 互动征集)
🌟 开篇语
哈喽,各位小伙伴们你们好呀~我是【喵手】。
运营社区: C站 / 掘金 / 腾讯云 / 阿里云 / 华为云 / 51CTO
欢迎大家常来逛逛,一起学习,一起进步~🌟
我长期专注 Python 爬虫工程化实战 ,主理专栏 👉 《Python爬虫实战》:从采集策略 到反爬对抗 ,从数据清洗 到分布式调度 ,持续输出可复用的方法论与可落地案例。内容主打一个"能跑、能用、能扩展 ",让数据价值真正做到------抓得到、洗得净、用得上。
📌 专栏食用指南(建议收藏)
- ✅ 入门基础:环境搭建 / 请求与解析 / 数据落库
- ✅ 进阶提升:登录鉴权 / 动态渲染 / 反爬对抗
- ✅ 工程实战:异步并发 / 分布式调度 / 监控与容错
- ✅ 项目落地:数据治理 / 可视化分析 / 场景化应用
📣 专栏推广时间 :如果你想系统学爬虫,而不是碎片化东拼西凑,欢迎订阅/关注专栏《Python爬虫实战》
订阅后更新会优先推送,按目录学习更高效~
📂 摘要(Abstract)
本文将深入讲解如何构建一个完整的高清壁纸采集系统,通过 Unsplash官方API + Wallpaperflare网站爬虫 双引擎方案,最终产出包含标题、作者、分辨率、图片链接、下载地址等完整信息的结构化数据集(SQLite + CSV + 本地图片库)。
读完本文你将获得:
- 掌握图片平台的两种数据获取方式:官方API(稳定高效)vs 网页爬虫(灵活全面)
- 学会设计高性能图片下载系统(断点续传、并发下载、去重优化)
- 了解图片采集的版权问题与合规边界,避免法律纠纷
- 获取可直接运行的10000+行生产级代码,包含完整的异常处理和性能优化
1️⃣ 背景与需求(Why)集壁纸数据?
高清壁纸采集在以下场景中具有实际价值:
- 设计师/创作者:建立个人素材库,快速找到灵感来源
- 壁纸应用开发者:为APP/网站提供丰富的壁纸资源
- 数据分析师:分析图片流行趋势,研究配色偏好
- 机器学习工程师:构建图像分类、风格迁移等模型的训练集
- 摄影爱好者:学习优秀摄影作品的构图和用光技巧
为什么选择Unsplash和Wallpaperflare?
Unsplash:
- ✅ 提供官方API,稳定可靠
- ✅ 图片质量极高(专业摄影师上传)
- ✅ 完全免费使用(需遵守许可协议)
- ✅ 元数据丰富(作者、相机参数、位置等)
Wallpaperflare:
- ✅ 分类详细(4K、超宽屏、手机壁纸等)
- ✅ 数量庞大(数百万张)
- ✅ 支持多分辨率下载
- ✅ 更新频繁
目标字段清单
| 字段名 | 类型 | 说明 | 示例值 |
|---|---|---|---|
| image_id | str | 图片唯一标识 | "unsplash_abc123" |
| title | str | 图片标题/描述 | "Mountain landscape at sunset" |
| author | str | 作者/摄影师 | "John Doe" |
| author_url | str | 作者主页 | "https://unsplash.com/@john" |
| width | int | 原始宽度 | 5472 |
| height | int | 原始高度 | 3648 |
| resolution | str | 分辨率标签 | "5K" / "4K" / "1080P" |
| aspect_ratio | str | 宽高比 | "16:9" / "21:9" |
| file_size | int | 文件大小(字节) | 3456789 |
| format | str | 图片格式 | "jpg" / "png" |
| color | str | 主色调 | "#3a5f7d" |
| tags | list | 标签列表 | ["nature", "mountain", "sunset"] |
| category | str | 分类 | "Nature" / "Abstract" |
| url_raw | str | 原图链接 | "https://images.unsplash.com/photo..." |
| url_full | str | 完整尺寸 | "https://images.unsplash.com/..." |
| url_regular | str | 常规尺寸(1080P) | "https://images.unsplash.com/..." |
| url_small | str | 小图(400x) | "https://images.unsplash.com/..." |
| url_thumb | str | 缩略图(200x) | "https://images.unsplash.com/..." |
| download_url | str | 直接下载链接 | "https://..." |
| local_path | str | 本地存储路径 | "data/images/abc123.jpg" |
| likes | int | 点赞数 | 1234 |
| downloads | int | 下载次数 | 5678 |
| views | int | 浏览量 | 12345 |
| created_at | str | 上传时间 | "2025-01-20 14:30:00" |
| camera | str | 相机型号 | "Canon EOS R5" |
| focal_length | str | 焦距 | "24mm" |
| aperture | str | 光圈 | "f/2.8" |
| iso | int | ISO | 100 |
| location | str | 拍摄地点 | "Yosemite, USA" |
| crawl_time | str | 采集时间 | "2025-01-27 15:30:00" |
2️⃣ 合规与注意事项(必读)
Unsplash API 使用条款
API使用规则:
- ✅ 免费tier:每小时50次请求
- ✅ 需要申请API Key(简单注册即可)
- ✅ 必须在使用图片处标注来源和作者
- 载后用于付费壁纸APP(需商业授权)
- ❌ 禁止移除水印或作者信息
许可协议(Uns):
- ✅ 可用于商业和非商业用途
- ✅ 无需联系摄影师获得许可
- ✅ 可以修改图片
- ❌ 不能将图片重新销售或分发(如单独出售壁纸包)
- ❌ 不能用于创建竞品网站
官方文档: https://unsplash.com/documentation
Wallpaperflarepaperflare.com/robots.txt`:
json
User-agent: *
Disallow: /search
Disallow: /ajax/
Allow: /
解读:
- ✅ 允许抓取壁纸详情页
- ✅ 允许抓取分类列表
- ❌ 禁止暴li搜索
- ❌ 禁止抓取ajax接口
图片版权注意事项
⚠️ 重要警告:
- 不是所有壁纸网站的图片都可以随意使用
- 即使网站允许下载,也不代表可以商用
- 必须查看每张图片的具体许可协议
- 建议优先使用有明确开源协议的平台(如Unsplash、Pexels)
安全做法:
- 只采集元数据,不批量下载图片
- 下载图片仅用于个人学习
- 商业用途必须购买授权或使用CC0许可的图片
- 始终标注图片来源和作者
采集频率建议
| 平台 | 建议频率 | 并发数 | 备注 |
|---|---|---|---|
| Unsplash API | 1次/秒 | 单线程 | 受API限额控制 |
| Wallpaperflare | 1次/3秒 | 最多3并发 | 需控制频率避免封IP |
3️⃣ 技术选型与整体流程(What/How)
方案对比
| 维度 | Unsplash API | Wallpaperflare爬虫 | Pexels API |
|---|---|---|---|
| 稳定性 | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 图片质量 | 极高 | 高 | 高 |
| 元数据 | 非常丰富 | 基础 | 丰富 |
| 分类 | 较少 | 非常详细 | 中等 |
| 数量 | 300万+ | 1000万+ | 300万+ |
| 合规性 | 官方支持 | 需遵守robots | 官方支持 |
| 学习价值 | API调用 | 反爬技巧 | API调用 |
本文选择:Unsplash API 为主 + Wallpaperflare 爬虫为辅
理由:
- Unsplash API稳定可靠,适合长期稳定运行
- Wallpaperflare补充更多分类和4K资源
- 两者结合可以覆盖绝大多数需求
数据流转架构
json
┌──────────────────────────────────────────────────────────┐
│ Main Controller │
│ (main.py - 主控制器) │
└────────────────────────┬─────────────────────────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│Unsplash API │ │Wallpaperflare│ │ Downloader │
│ Fetcher │ │ Spider │ │ Engine │
│ │ │ │ │ │
│ - API调用 │ │ - HTML解析 │ │ - 并发下载 │
│ - 分页遍历 │ │ - 翻页处理 │ │ - 断点续传 │
│ - 速率限制 │ │ - 反爬处理 │ │ - 去重检查 │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
└─────────────────┼─────────────────┘
▼
┌──────────────────┐
│ Parser Layer │
│ (parser.py) │
│ │
│ - 元数据提取 │
│ - 数据清洗 │
│ - 格式标准化 │
└────────┬─────────┘
▼
┌──────────────────┐
│ Storage Layer │
│ (storage.py) │
│ │
│ - SQLite存储 │
│ - CSV/Excel导出 │
│ - 图片管理 │
└────────┬─────────┘
▼
┌──────────────────┐
│ Analyzer Layer │
│ (analyzer.py) │
│ │
│ - 统计分析 │
│ - 色彩分析 │
│ - 相似度检测 │
└──────────────────┘
为什么这样设计?
- 双引擎架构:API和爬虫互补,一个失效另一个可继续
- 下载器独立:图片下载逻辑复杂,单独成模块便于优化
- 分层清晰:每层职责单一,易于测试和维护
- 可扩展:后续可轻松添加Pexels、Pixabay等其他源
4️⃣ 环境准备与依赖安装(可复现)
Python版本要求
bash
Python >= 3.8 # 需要支持async/await和类型注解
核心依赖安装
bash
# HTTP请求和异步
pip install requests aiohttp aiofiles
# HTML解析
pip install lxml beautifulsoup4
# 图片处理
pip install Pillow
# 数据处理
pip install pandas openpyxl
# 进度条和日志
pip install tqdm loguru
# API限流
pip install ratelimit
# 可选:图像分析
pip install scikit-image opencv-python
完整 requirements.txt
text
# HTTP 和异步
requests==2.31.0
aiohttp==3.9.1
aiofiles==23.2.1
# 解析
lxml==5.1.0
beautifulsoup4==4.12.3
# 图片处理
Pillow==10.2.0
# 数据处理
pandas==2.1.4
openpyxl==3.1.2
# 工具
tqdm==4.66.1
loguru==0.7.2
ratelimit==2.2.1
# 图像分析(可选)
scikit-image==0.22.0
opencv-python==4.9.0.80
# 哈希(去重)
imagehash==4.3.1
项目目录结构
json
wallpaper_scraper/
├── config/
│ ├── __init__.py
│ └── settings.py # 配置(API Key、路径等)
├── core/
│ ├── __init__.py
│ ├── unsplash_fetcher.py # Unsplash API客户端
│ ├── wallpaper_spider.py # Wallpaperflare爬虫
│ ├── downloader.py # 图片下载器
│ ├── parser.py # 数据解析与清洗
│ ├── storage.py # 存储层
│ └── analyzer.py # 图片分析
├── utils/
│ ├── __init__.py
│ ├── logger.py # 日志工具
│ ├── retry.py # 重试装饰器
│ ├── hasher.py # 图片去重(感知哈希)
│ └── validator.py # 数据校验
├── data/
│ ├── wallpapers.db # SQLite数据库
│ ├── wallpapers.csv # CSV导出
│ ├── wallpapers.xlsx # Excel报表
│ └── images/ # 下载的图片
│ ├── unsplash/
│ ├── wallpaperflare/
│ └── thumbnails/ # 缩略图
├── logs/
│ └── scraper_{date}.log
├── tests/
│ ├── test_fetcher.py
│ └── test_downloader.py
├── main.py # 主入口
├── requirements.txt
└── README.md
5️⃣ 配置文件设计(Config)
config/settings.py
python
"""
全局配置文件
包含API密钥、路径、下载参数等
"""
import os
from pathlib import Path
# ============== 项目路径 ==============
BASE_DIR = Path(__file__).resolve().parent.parent
DATA_DIR = BASE_DIR / 'data'
IMAGE_DIR = DATA_DIR / 'images'
UNSPLASH_DIR = IMAGE_DIR / 'unsplash'
WALLPAPER_DIR = IMAGE_DIR / 'wallpaperflare'
THUMBNAIL_DIR = IMAGE_DIR / 'thumbnails'
LOG_DIR = BASE_DIR / 'logs'
# 创建目录
for directory in [DATA_DIR, IMAGE_DIR, UNSPLASH_DIR, WALLPAPER_DIR, THUMBNAIL_DIR, LOG_DIR]:
directory.mkdir(parents=True, exist_ok=True)
# ============== Unsplash API配置 ==============
# 注册地址: https://unsplash.com/developers
UNSPLASH_ACCESS_KEY = os.getenv('UNSPLASH_ACCESS_KEY', 'YOUR_ACCESS_KEY_HERE')
UNSPLASH_SECRET_KEY = os.getenv('UNSPLASH_SECRET_KEY', '')
UNSPLASH_API_BASE = "https://api.unsplash.com"
UNSPLASH_API_VERSION = "v1"
# API端点
UNSPLASH_ENDPOINTS = {
'photos': f"{UNSPLASH_API_BASE}/photos",
'search': f"{UNSPLASH_API_BASE}/search/photos",
'collections': f"{UNSPLASH_API_BASE}/collections",
'random': f"{UNSPLASH_API_BASE}/photos/random",
}
# API限流(免费版:50次/小时)
UNSPLASH_RATE_LIMIT = 50
UNSPLASH_RATE_PERIOD = 3600 # 秒
# ============== Wallpaperflare配置 ==============
WALLPAPERFLARE_BASE = "https://www.wallpaperflare.com"
# 分类URL
WALLPAPERFLARE_CATEGORIES = {
'nature': f"{WALLPAPERFLARE_BASE}/nature-wallpaper",
'abstract': f"{WALLPAPERFLARE_BASE}/abstract-wallpaper",
'animals': f"{WALLPAPERFLARE_BASE}/animals-wallpaper",
'anime': f"{WALLPAPERFLARE_BASE}/anime-wallpaper",
'cars': f"{WALLPAPERFLARE_BASE}/cars-wallpaper",
'city': f"{WALLPAPERFLARE_BASE}/city-wallpaper",
'space': f"{WALLPAPERFLARE_BASE}/space-wallpaper",
}
# ============== HTTP请求配置 ==============
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
}
# 超时配置
TIMEOUT = 30
CONNECT_TIMEOUT = 10
READ_TIMEOUT = 60 # 下载大图片需要更长时间
# 重试配置
MAX_RETRIES = 3
RETRY_DELAY = 2
RETRY_BACKOFF = 2
# 请求间隔
REQUEST_INTERVAL = 2
REQUEST_INTERVAL_RANDOM = 1
# ============== 下载配置 ==============
# 下载模式
DOWNLOAD_MODE = "metadata_only" # "metadata_only" | "with_images" | "thumbnails_only"
# 并发下载
DOWNLOAD_CONCURRENT = 5 # 同时下载5张图片
DOWNLOAD_CHUNK_SIZE = 8192 # 8KB per chunk
# 图片质量选择
UNSPLASH_QUALITY = "regular" # "raw" | "full" | "regular" | "small" | "thumb"
WALLPAPER_RESOLUTION = "1920x1080" # 优先下载的分辨率
# 文件大小限制
MAX_FILE_SIZE = 20 * 1024 * 1024 # 20MB
MIN_FILE_SIZE = 10 * 1024 # 10KB
# 支持的格式
ALLOWED_FORMATS = ['jpg', 'jpeg', 'png', 'webp']
# ============== 采集配置==============
# Unsplash
UNSPLASH_PER_PAGE = 30 # 每页数量(最大30)
UNSPLASH_MAX_PAGES = 10 # 最多采集页数
UNSPLASH_ORDER_BY = "latest" # "latest" | "oldest" | "popular"
# Wallpaperflare
WALLPAPER_PER_PAGE = 24 # 每页数量
WALLPAPER_MAX_PAGES = 5 # 最多采集页数
# 去重配置
ENABLE_DEDUPLICATION = True # 启用去重
DEDUP_METHOD = "phash" # "phash" | "md5" | "url"
PHASH_THRESHOLD = 5 # 感知哈希相似度阈值(越小越严格)
# ============== 存储配置 ==============
DB_PATH = DATA_DIR / 'wallpapers.db'
CSV_PATH = DATA_DIR / 'wallpapers.csv'
EXCEL_PATH = DATA_DIR / 'wallpapers.xlsx'
JSON_PATH = DATA_DIR / 'wallpapers.json'
# 导出开关
EXPORT_CSV = True
EXPORT_EXCEL = True
EXPORT_JSON = False
# 数据库备份
DB_BACKUP = True
BACKUP_INTERVAL = 100 # 每采集100张备份一次
# ============== 日志配置 ==============
LOG_LEVEL = "INFO"
LOG_FORMAT = "<green>{time:YYYY-MM-DD HH:mm:ss}</green> | <level>{level: <8}</level> | <cyan>{name}</cyan> - <level>{message}</level>"
LOG_ROTATION = "1 day"
LOG_RETENTION = "7 days"
# ============== 图片处理配置 ==============
# 缩略图生成
GENERATE_THUMBNAILS = True
THUMBNAIL_SIZE = (400, 300) # (width, height)
THUMBNAIL_QUALITY = 85
# 色彩分析
EXTRACT_DOMINANT_COLORS = True # 提取主色调
COLOR_CLUSTERS = 5 # 提取前N个颜色
# ============== 代理配置 ==============
USE_PROXY = False
PROXY_POOL = []
# ============== 调试配置 ==============
DEBUG = False
DRY_RUN = False # 干跑模式(不实际下载)
SAVE_HTML = False # 保存HTML用于调试
6️⃣ 核心实现:Unsplash API客户端(Fetcher)
core/unsplash_fetcher.py
python
"""
Unsplash API 客户端
负责与Unsplash官方API交互
"""
import time
import requests
from typing import Optional, List, Dict
from ratelimit import limits, sleep_and_retry
from config.settings import (
UNSPLASH_ACCESS_KEY, UNSPLASH_ENDPOINTS,
UNSPLASH_PER_PAGE, UNSPLASH_RATE_LIMIT, UNSPLASH_RATE_PERIOD,
TIMEOUT, MAX_RETRIES
)
from utils.logger import log
from utils.retry import retry
class UnsplashFetcher:
"""
Unsplash API 客户端
功能:
1. 获取照片列表
2. 搜索照片
3. 获取随机照片
4. 获取照片详情
5. 自动限流
"""
def __init__(self, access_key: str = UNSPLASH_ACCESS_KEY):
"""
初始化客户端
参数:
access_key: Unsplash API密钥
"""
if not access_key or access_key == 'YOUR_ACCESS_KEY_HERE':
raise ValueError(
"❌ 请先设置Unsplash API Key!\n"
"1. 访问 https://unsplash.com/developers\n"
"2. 创建应用获取Access Key\n"
"3. 在config/settings.py中设置UNSPLASH_ACCESS_KEY"
)
self.access_key = access_key
self.session = self._create_session()
self.request_count = 0
log.info("Unsplash API客户端初始化完成")
def _create_session(self) -> requests.Session:
"""创建Session"""
session = requests.Session()
session.headers.update({
'Authorization': f'Client-ID {self.access_key}',
'Accept-Version': 'v1',
})
return session
@sleep_and_retry
@limits(calls=UNSPLASH_RATE_LIMIT, period=UNSPLASH_RATE_PERIOD)
@retry(max_attempts=MAX_RETRIES, exceptions=(requests.RequestException,))
def _request(self, url: str, params: Dict = None) -> Optional[Dict]:
"""
发送API请求(带自动限流和重试)
参数:
url: API端点
params: 请求参数
返回:
JSON响应
"""
self.request_count += 1
try:
log.debug(f"[请求 #{self.request_count}] {url}")
response = self.session.get(
url,
params=params,
timeout=TIMEOUT
)
# 检查限流
rate_limit = response.headers.get('X-Ratelimit-Remaining')
if rate_limit:
log.debug(f"API剩余配额: {rate_limit}")
if int(rate_limit) < 5:
log.warning("⚠️ API配额即将用尽!")
response.raise_for_status()
return response.json()
except requests.HTTPError as e:
if e.response.status_code == 401:
log.error("❌ API认证失败,请检查Access Key")
elif e.response.status_code == 403:
log.error("❌ API权限不足或配额用尽")
elif e.response.status_code == 404:
log.error("❌ API端点不存在")
raise
except requests.Timeout:
log.error("⏱️ 请求超时")
raise
except Exception as e:
log.exception(f"❌ 请求失败: {e}")
raise
def get_photos(
self,
page: int = 1,
per_page: int = UNSPLASH_PER_PAGE,
order_by: str = 'latest'
) -> List[Dict]:
"""
获取照片列表
参数:
page: 页码(从1开始)
per_page: 每页数量(1-30)
order_by: 排序方式(latest/oldest/popular)
返回:
照片列表
"""
params = {
'page': page,
'per_page': min(per_page, 30), # API限制最多30
'order_by': order_by
}
photos = self._request(UNSPLASH_ENDPOINTS['photos'], params)
if photos:
log.success(f"✓ 获取第{page}页,共{len(photos)}张照片")
return photos or []
def search_photos(
self,
query: str,
page: int = 1,
per_page: int = UNSPLASH_PER_PAGE,
orientation: str = None,
color: str = None
) -> Dict:
"""
搜索照片
参数:
query: 搜索关键词
page: 页码
per_page: 每页数量
orientation: 方向(landscape/portrait/squarish)
color: 颜色过滤(black/white/yellow等)
返回:
搜索结果字典 {'total': xxx, 'results': []}
"""
params = {
'query': query,
'page': page,
'per_page': per_page,
}
if orientation:
params['orientation'] = orientation
if color:
params['color'] = color
result = self._request(UNSPLASH_ENDPOINTS['search'], params)
if result:
total = result.get('total', 0)
photos = result.get('results', [])
log.success(f"✓ 搜索'{query}'找到{total}张,返回{len(photos)}张")
return result or {'total': 0, 'results': []}
def get_random_photos(
self,
count: int = 1,
collections: str = None,
topics: str = None,
query: str = None
) -> List[Dict]:
"""
获取随机照片
参数:
count: 数量(1-30)
collections: 集合ID(逗号分隔)
topics: 主题ID(逗号分隔)
query: 搜索关键词
返回:
随机照片列表
"""
params = {'count': min(count, 30)}
if collections:
params['collections'] = collections
if topics:
params['topics'] = topics
if query:
params['query'] = query
photos = self._request(UNSPLASH_ENDPOINTS['random'], params)
# random端点返回单张时是dict,多张时是list
if isinstance(photos, dict):
photos = [photos]
log.success(f"✓ 获取{len(photos)}张随机照片")
return photos or []
def get_photo_detail(self, photo_id: str) -> Optional[Dict]:
"""
获取照片详情
参数:
photo_id: 照片ID
返回:
照片详情
"""
url = f"{UNSPLASH_ENDPOINTS['photos']}/{photo_id}"
detail = self._request(url)
if detail:
log.debug(f"✓ 获取照片详情: {photo_id}")
return detail
def trigger_download(self, download_location: str) -> bool:
"""
触发下载统计(Unsplash要求)
说明:
根据Unsplash API规范,每次下载图片后需要调用download端点
以便Unsplash统计下载次数,这是使用API的必要条件
参数:
download_location: 照片数据中的download_location字段
返回:
是否成功
"""
try:
response = self.session.get(download_location)
response.raise_for_status()
log.debug("✓ 下载统计已触发")
return True
except Exception as e:
log.warning(f"下载统计失败: {e}")
return False
def close(self):
"""关闭Session"""
if self.session:
self.session.close()
log.info("Unsplash API客户端已关闭")
7️⃣ 核心实现:Wallpaperflare爬虫(Spider)
core/wallpaper_spider.py
python
"""
Wallpaperflare 爬虫
负责从Wallpaperflare网站抓取壁纸信息
"""
import re
import time
import random
from typing import List, Dict, Optional
from lxml import etree
import requests
from config.settings import (
WALLPAPERFLARE_BASE, WALLPAPERFLARE_CATEGORIES,
HEADERS, TIMEOUT, REQUEST_INTERVAL, REQUEST_INTERVAL_RANDOM
)
from utils.logger import log
from utils.retry import retry
class WallpaperflareSpider:
"""
Wallpaperflare 爬虫
功能:
1. 抓取分类列表
2. 解析壁纸详情
3. 提取下载链接
4. 反爬处理
"""
def __init__(self):
"""初始化爬虫"""
self.session = self._create_session()
self.request_count = 0
log.info("Wallpaperflare爬虫初始化完成")
def _create_session(self) -> requests.Session:
"""创建Session"""
session = requests.Session()
session.headers.update(HEADERS)
return session
def _sleep(self):
"""请求间隔(带随机抖动)"""
jitter = random.uniform(0, REQUEST_INTERVAL_RANDOM)
sleep_time = REQUEST_INTERVAL + jitter
time.sleep(sleep_time)
@retry(max_attempts=3, exceptions=(requests.RequestException,))
def _fetch_page(self, url: str) -> Optional[str]:
"""
获取页面HTML
参数:
url: 页面URL
返回:
HTML字符串
"""
self.request_count += 1
log.debug(f"[请求 #{self.request_count}] {url}")
try:
if self.request_count > 1:
self._sleep()
response = self.session.get(url, timeout=TIMEOUT)
response.raise_for_status()
# 检查是否被拦截
if 'cloudflare' in response.text.lower() or len(response.text) < 1000:
log.warning("⚠️ 可能触发反爬,HTML内容异常")
return None
log.success(f"✓ 获取页面成功,大小: {len(response.text)} 字节")
return response.text
except requests.HTTPError as e:
if e.response.status_code == 403:
log.error("403 Forbidden - 触发反爬,建议降低频率或使用代理")
elif e.response.status_code == 429:
log.error("429 Too Many Requests - 触发限流")
raise
except Exception as e:
log.exception(f"获取页面失败: {e}")
raise
def parse_category_page(self, html: str, page: int = 1) -> List[Dict]:
"""
解析分类列表页
参数:
html: 页面HTML
page: 页码
返回:
壁纸数据列表
"""
if not html:
return []
try:
tree = etree.HTML(html)
# Wallpaperflare的HTML结构(2025年1月):
# <ul class="gallery">
# <li>
# <figure>
# <a href="/wallpaper/xxx">
# <img data-src="thumbnail.jpg">
# </a>
# <figcaption>
# <span class="res">1920x1080</span>
# </figcaption>
# </figure>
# </li>
# </ul>
items = tree.xpath('//ul[@class="gallery"]/li')
if not items:
log.warning(f"第{page}页未找到壁纸节点")
return []
log.info(f"第{page}页匹配到 {len(items)} 个壁纸节点")
wallpapers = []
for idx, item in enumerate(items, 1):
try:
wallpaper = self._extract_wallpaper_data(item, page, idx)
if wallpaper:
wallpapers.append(wallpaper)
except Exception as e:
log.error(f"第{page}页第{idx}个壁纸解析失败: {e}")
return wallpapers
except Exception as e:
log.exception(f"页面解析失败: {e}")
return []
def _extract_wallpaper_data(self, item, page: int, idx: int) -> Optional[Dict]:
"""
从单个节点提取壁纸数据
参数:
item: lxml Element对象
page: 页码
idx: 序号
返回:
壁纸数据字典
"""
data = {}
# 1. 提取详情页链接
link = item.xpath('.//a[@href]/@href')
if link:
detail_url = link[0]
if not detail_url.startswith('http'):
detail_url = WALLPAPERFLARE_BASE + detail_url
data['detail_url'] = detail_url
# 从URL提取ID
id_match = re.search(r'/wallpaper/(\d+)', detail_url)
if id_match:
data['image_id'] = f"wf_{id_match.group(1)}"
else:
return None
# 2. 提取缩略图
img = item.xpath('.//img/@data-src | .//img/@src')
if img:
thumb_url = img[0]
if thumb_url.startswith('//'):
thumb_url = 'https:' + thumb_url
data['url_thumb'] = thumb_url
# 3. 提取分辨率
resolution = item.xpath('.//span[@class="res"]/text()')
if resolution:
res_text = resolution[0].strip()
# 格式: "1920x1080" 或 "1920 x 1080"
res_match = re.search(r'(\d+)\s*x\s*(\d+)', res_text, re.I)
if res_match:
data['width'] = int(res_match.group(1))
data['height'] = int(res_match.group(2))
data['resolution'] = res_text.replace(' ', '')
# 4. 提取标题(有些页面有)
title = item.xpath('.//figcaption//text()')
if title:
data['title'] = ' '.join([t.strip() for t in title if t.strip()])
# 5. 设置默认值
data.setdefault('title', f"Wallpaper {data.get('image_id', '')}")
data.setdefault('width', 0)
data.setdefault('height', 0)
data['source'] = 'wallpaperflare'
data['page'] = page
data['index'] = idx
from datetime import datetime
data['crawl_time'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
return data
def parse_detail_page(self, html: str, wallpaper_id: str) -> Optional[Dict]:
"""
解析壁纸详情页(获取高清下载链接)
参数:
html: 详情页HTML
wallpaper_id: 壁纸ID
返回:
详情数据(包含下载链接)
"""
if not html:
return None
try:
tree = etree.HTML(html)
detail = {}
# 1. 提取标题
title = tree.xpath('//h1[@itemprop="name"]/text()')
if title:
detail['title'] = title[0].strip()
# 2. 提取标签
tags = tree.xpath('//a[@rel="tag"]/text()')
if tags:
detail['tags'] = [tag.strip() for tag in tags]
# 3. 提取多分辨率下载链接
# Wallpaperflare提供多个分辨率下载
download_section = tree.xpath('//div[@class="download-options"]//a')
download_urls = {}
for link in download_section:
res_text = link.xpath('./text()')
href = link.xpath('./@href')
if res_text and href:
res = res_text[0].strip()
url = href[0]
if url.startswith('//'):
url = 'https:' + url
download_urls[res] = url
detail['download_urls'] = download_urls
# 4. 提取原图链接(通常是data-src属性)
raw_img = tree.xpath('//img[@id="wallpaper"]/@src | //img[@id="wallpaper"]/@data-src')
if raw_img:
detail['url_raw'] = raw_img[0] if raw_img[0].startswith('http') else 'https:' + raw_img[0]
log.debug(f"✓ 详情页解析成功: {wallpaper_id}")
return detail
except Exception as e:
log.exception(f"详情页解析失败: {e}")
return None
def fetch_category(
self,
category: str,
max_pages: int = 5
) -> List[Dict]:
"""
采集指定分类的壁纸
参数:
category: 分类名称(nature/abstract等)
max_pages: 最多采集页数
返回:
壁纸列表
"""
if category not in WALLPAPERFLARE_CATEGORIES:
log.error(f"未知分类: {category}")
return []
base_url = WALLPAPERFLARE_CATEGORIES[category]
log.info(f"开始采集分类: {category}")
all_wallpapers = []
for page in range(1, max_pages + 1):
# Wallpaperflare的分页URL格式
url = f"{base_url}/page/{page}" if page > 1 else base_url
html = self._fetch_page(url)
if not html:
log.warning(f"第{page}页获取失败,停止采集")
break
wallpapers = self.parse_category_page(html, page)
all_wallpapers.extend(wallpapers)
log.success(f"✓ 第{page}页采集完成,获取{len(wallpapers)}张壁纸")
# 如果没有数据了,停止翻页
if not wallpapers:
log.info("没有更多数据,停止采集")
break
log.success(f"✓ 分类'{category}'采集完成,共{len(all_wallpapers)}张壁纸")
return all_wallpapers
def close(self):
"""关闭Session"""
if self.session:
self.session.close()
log.info("Wallpaperflare爬虫已关闭")
8️⃣ 核心实现:图片下载器(Downloader)
core/downloader.py
python
"""
图片下载器
支持并发下载、断点续传、去重检查
"""
import os
import hashlib
from pathlib import Path
from typing import Optional, Dict
from concurrent.futures import ThreadPoolExecutor, as_completed
import requests
from tqdm import tqdm
from config.settings import (
UNSPLASH_DIR, WALLPAPER_DIR, THUMBNAIL_DIR,
DOWNLOAD_CONCURRENT, DOWNLOAD_CHUNK_SIZE,
MAX_FILE_SIZE, MIN_FILE_SIZE, ALLOWED_FORMATS,
TIMEOUT, GENERATE_THUMBNAILS, THUMBNAIL_SIZE, THUMBNAIL_QUALITY
)
from utils.logger import log
from utils.hasher import ImageHasher
class ImageDownloader:
"""
图片下载器
功能:
1. 并发下载
2. 断点续传
3. 文件去重(MD5/感知哈希)
4. 缩略图生成
5. 进度显示
"""
def __init__(self):
"""初始化下载器"""
self.session = requests.Session()
self.downloaded_count = 0
self.failed_count = 0
self.skipped_count = 0
self.hasher = ImageHasher()
log.info("图片下载器初始化完成")
def download_image(
self,
url: str,
save_dir: Path,
filename: str = None,
source: str = 'unknown'
) -> Optional[str]:
"""
下载单张图片
参数:
url: 图片URL
save_dir: 保存目录
filename: 文件名(不含扩展名)
source: 来源(unsplash/wallpaperflare)
返回:
保存路径,失败返回None
"""
try:
# 1. 确定文件名
if not filename:
filename = hashlib.md5(url.encode()).hexdigest()
# 2. 确定扩展名
ext = self._get_extension(url)
if ext not in ALLOWED_FORMATS:
log.warning(f"不支持的格式: {ext}")
return None
filepath = save_dir / f"{filename}.{ext}"
# 3. 检查文件是否已存在
if filepath.exists():
log.debug(f"文件已存在,跳过: {filepath.name}")
self.skipped_count += 1
return str(filepath)
# 4. 下载文件
response = self.session.get(
url,
stream=True,
timeout=TIMEOUT
)
response.raise_for_status()
# 5. 检查文件大小
content_length = int(response.headers.get('Content-Length', 0))
if content_length > MAX_FILE_SIZE:
log.warning(f"文件过大 ({content_length/1024/1024:.2f}MB),跳过")
self.skipped_count += 1
return None
if content_length < MIN_FILE_SIZE:
log.warning(f"文件过小 ({content_length/1024:.2f}KB),可能是错误页面")
self.skipped_count += 1
return None
# 6. 写入文件
with open(filepath, 'wb') as f:
for chunk in response.iter_content(chunk_size=DOWNLOAD_CHUNK_SIZE):
if chunk:
f.write(chunk)
self.downloaded_count += 1
log.success(f"✓ 下载成功: {filepath.name} ({content_length/1024:.1f}KB)")
# 7. 生成缩略图
if GENERATE_THUMBNAILS:
self._generate_thumbnail(filepath)
return str(filepath)
except requests.HTTPError as e:
log.error(f"下载失败 (HTTP {e.response.status_code}): {url}")
self.failed_count += 1
return None
except Exception as e:
log.error(f"下载失败: {e}")
self.failed_count += 1
return None
def download_batch(
self,
download_tasks: list,
max_workers: int = DOWNLOAD_CONCURRENT
) -> Dict:
"""
批量并发下载
参数:
download_tasks: 下载任务列表 [{'url': xxx, 'save_dir': xxx, ...}, ...]
max_workers: 最大并发数
返回:
统计字典
"""
log.info(f"开始批量下载,任务数: {len(download_tasks)},并发数: {max_workers}")
results = []
with ThreadPoolExecutor(max_workers=max_workers) as executor:
# 提交任务
future_to_task = {
executor.submit(
self.download_image,
task['url'],
task['save_dir'],
task.get('filename'),
task.get('source', 'unknown')
): task
for task in download_tasks
}
# 收集结果(带进度条)
with tqdm(total=len(download_tasks), desc="下载进度") as pbar:
for future in as_completed(future_to_task):
task = future_to_task[future]
try:
result = future.result()
results.append(result)
except Exception as e:
log.error(f"任务失败: {e}")
results.append(None)
finally:
pbar.update(1)
# 统计
stats = {
'total': len(download_tasks),
'success': self.downloaded_count,
'failed': self.failed_count,
'skipped': self.skipped_count,
'results': results
}
log.info(f"批量下载完成: 成功{stats['success']}, 失败{stats['failed']}, 跳过{stats['skipped']}")
return stats
def _get_extension(self, url: str) -> str:
"""从URL提取文件扩展名"""
# 优先从URL路径提取
ext = url.split('?')[0].split('.')[-1].lower()
# 验证扩展名
if ext in ALLOWED_FORMATS:
return ext
# 默认jpg
return 'jpg'
def _generate_thumbnail(self, image_path: Path) -> Optional[Path]:
"""
生成缩略图
参数:
image_path: 原图路径
返回:
缩略图路径
"""
try:
from PIL import Image
thumb_path = THUMBNAIL_DIR / f"thumb_{image_path.name}"
with Image.open(image_path) as img:
# 保持宽高比缩放
img.thumbnail(THUMBNAIL_SIZE, Image.Resampling.LANCZOS)
img.save(thumb_path, quality=THUMBNAIL_QUALITY, optimize=True)
log.debug(f"✓ 缩略图生成: {thumb_path.name}")
return thumb_path
except Exception as e:
log.warning(f"缩略图生成失败: {e}")
return None
def check_duplicate(self, image_path: str, threshold: int = 5) -> Optional[str]:
"""
检查图片是否重复(感知哈希)
参数:
image_path: 图片路径
threshold: 相似度阈值(越小越严格)
返回:
重复图片路径,无重复返回None
"""
return self.hasher.find_similar(image_path, threshold)
def close(self):
"""关闭下载器"""
if self.session:
self.session.close()
log.info("图片下载器已关闭")
9️⃣ 核心实现:数据解析器(Parser)
core/parser.py
python
"""
数据解析器
负责解析、清洗、标准化采始数据
"""
import re
from typing import Dict, Optional, List
from datetime import datetime
from urllib.parse import urlparse
from utils.logger import log
class WallpaperParser:
"""
壁纸数据解析器
功能:
1. 标准化Unsplash数据
2. 标准化Wallpaperflare数据
3. 数据清洗与验证
4. 字段补全
"""
@staticmethod
def parse_unsplash_photo(photo: Dict) -> Dict:
"""
解析Unsplash照片数据
参数:
photo: Unsplash API返回的原始数据
返回:
标准化后的数据字典
"""
try:
data = {
# 基础信息
'image_id': f"unsplash_{photo['id']}",
'title': photo.get('description') or photo.get('alt_description') or f"Photo by {photo['user']['name']}",
'author': photo['user']['name'],
'author_url': photo['user']['links']['html'],
'source': 'unsplash',
# 尺寸信息
'width': photo['width'],
'height': photo['height'],
'resolution': WallpaperParser._calc_resolution_label(photo['width'], photo['height']),
'aspect_ratio': WallpaperParser._calc_aspect_ratio(photo['width'], photo['height']),
# 图片链接
'url_raw': photo['urls']['raw'],
'url_full': photo['urls']['full'],
'url_regular': photo['urls']['regular'],
'url_small': photo['urls']['small'],
'url_thumb': photo['urls']['thumb'],
'download_url': photo['links']['download'],
# 颜色
'color': photo.get('color', '#000000'),
# 统计数据
'likes': photo.get('likes', 0),
'downloads': photo.get('downloads', 0),
'views': photo.get('views', 0),
# 时间
'created_at': WallpaperParser._parse_datetime(photo.get('created_at')),
'crawl_time': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
}
# EXIF信息(相机参数)
exif = photo.get('exif', {})
if exif:
data.update({
'camera': exif.get('name'),
'focal_length': exif.get('focal_length'),
'aperture': exif.get('aperture'),
'iso': exif.get('iso'),
})
# 位置信息
location = photo.get('location', {})
if location:
parts = [
location.get('city'),
location.get('country')
]
data['location'] = ', '.join([p for p in parts if p])
# 标签
tags = photo.get('tags', [])
if tags:
data['tags'] = [tag['title'] for tag in tags if 'title' in tag]
# 分类(从标签推断)
if tags:
data['category'] = WallpaperParser._infer_category(data.get('tags', []))
log.debug(f"✓ Unsplash数据解析成功: {data['image_id']}")
return data
except KeyError as e:
log.error(f"Unsplash数据缺少必要字段: {e}")
return {}
except Exception as e:
log.exception(f"Unsplash数据解析失败: {e}")
return {}
@staticmethod
def parse_wallpaperflare_item(item: Dict) -> Dict:
"""
解析Wallpaperflare数据
参数:
item: 爬虫采集的原始数据
返回:
标准化后的数据字典
"""
try:
data = {
# 基础信息
'image_id': item.get('image_id', ''),
'title': item.get('title', 'Untitled Wallpaper'),
'source': 'wallpaperflare',
# 尺寸
'width': item.get('width', 0),
'height': item.get('height', 0),
'resolution': item.get('resolution', ''),
'aspect_ratio': WallpaperParser._calc_aspect_ratio(
item.get('width', 0),
item.get('height', 0)
),
# 链接
'url_thumb': item.get('url_thumb', ''),
'url_raw': item.get('url_raw', ''),
'detail_url': item.get('detail_url', ''),
# 标签和分类
'tags': item.get('tags', []),
'category': item.get('category', ''),
# 时间
'crawl_time': item.get('crawl_time', datetime.now().strftime('%Y-%m-%d %H:%M:%S')),
}
# 多分辨率下载链接
download_urls = item.get('download_urls', {})
if download_urls:
data['download_urls'] = download_urls
# 选择最高分辨率作为主下载链接
if '3840x2160' in download_urls:
data['download_url'] = download_urls['3840x2160']
elif '2560x1440' in download_urls:
data['download_url'] = download_urls['2560x1440']
elif '1920x1080' in download_urls:
data['download_url'] = download_urls['1920x1080']
else:
data['download_url'] = list(download_urls.values())[0]
# 如果没有分类,从标签推断
if not data['category'] and data['tags']:
data['category'] = WallpaperParser._infer_category(data['tags'])
log.debug(f"✓ Wallpaperflare数据解析成功: {data['image_id']}")
return data
except Exception as e:
log.exception(f"Wallpaperflare数据解析失败: {e}")
return {}
@staticmethod
def _calc_resolution_label(width: int, height: int) -> str:
"""
计算分辨率标签
参数:
width: 宽度
height: 高度
返回:
分辨率标签 (8K/5K/4K/2K/1080P等)
"""
pixels = width * height
if pixels >= 7680 * 4320: # 33M
return '8K'
elif pixels >= 5120 * 2880: # 14.7M
return '5K'
elif pixels >= 3840 * 2160: # 8.3M
return '4K'
elif pixels >= 2560 * 1440: # 3.7M
return '2K/QHD'
elif pixels >= 1920 * 1080: # 2M
return '1080P/FHD'
elif pixels >= 1280 * 720: # 0.9M
return '720P/HD'
else:
return f'{width}x{height}'
@staticmethod
def _calc_aspect_ratio(width: int, height: int) -> str:
"""
计算宽高比
参数:
width: 宽度
height: 高度
返回:
宽高比标签 (16:9/21:9等)
"""
if width == 0 or height == 0:
return 'Unknown'
from math import gcd
# 计算最大公约数
divisor = gcd(width, height)
ratio_w = width // divisor
ratio_h = height // divisor
# 常见宽高比映射
ratio_map = {
(16, 9): '16:9',
(21, 9): '21:9',
(32, 9): '32:9',
(4, 3): '4:3',
(3, 2): '3:2',
(16, 10): '16:10',
(1, 1): '1:1',
(9, 16): '9:16', # 竖屏
}
return ratio_map.get((ratio_w, ratio_h), f'{ratio_w}:{ratio_h}')
@staticmethod
def _parse_datetime(dt_str: Optional[str]) -> Optional[str]:
"""
解析日期时间字符串
参数:
dt_str: ISO 8601格式的时间字符串
返回:
标准格式时间 "YYYY-MM-DD HH:MM:SS"
"""
if not dt_str:
return None
try:
# 处理ISO 8601格式: 2025-01-20T14:30:00Z
dt = datetime.fr'))
return dt.strftime('%Y-%m-%d %H:%M:%S')
except Exception as e:
log.warning(f"日期解析失败: {dt_str}, {e}")
return dt_str
@staticmethod
def _infer_category(tags: List[str]) -> str:
"""
从标签推断分类
参数:
tags: 标签列表
返回:
分类名称
"""
# 分类关键词映射
category_keywords = {
'Nature': ['nature', 'landscape', 'mountain', 'forest', 'ocean', 'beach', 'sunset', 'sunrise'],
'Abstract': ['abstract', 'pattern', 'texture', 'geometric', 'gradient'],
'Animals': ['animal', 'cat', 'dog', 'bird', 'wildlife'],
'City': ['city', 'urban', 'building', 'architecture', 'street'],
'Space': ['space', 'galaxy', 'planet', 'star', 'nebula', 'astronomy'],
'Technology': ['technology', 'computer', 'code', 'digital', 'circuit'],
'Art': ['art', 'painting', 'illustration', 'artwork'],
'Minimal': ['minimal', 'minimalist', 'simple', 'clean'],
'Dark': ['dark', 'black', 'night', 'moody'],
'Colorful': ['colorful', 'vibrant', 'rainbow', 'neon'],
}
# 标签转小写
tags_lower = [tag.lower() for tag in tags]
# 匹配分类
for category, keywords in category_keywords.items():
if any(keyword in tag for tag in tags_lower for keyword in keywords):
return category
return 'Other'
@staticmethod
def validate_data(data: Dict) -> bool:
"""
验证数据完整性
参数:
data: 待验证的数据字典
返回:
是否有效
"""
# 必需字段
required_fields = ['image_id', 'title', 'source']
for field in required_fields:
if not data.get(field):
log.warning(f"数据缺少必需字段: {field}")
return False
# 尺寸校验
if data.get('width', 0) <= 0 or data.get('height', 0) <= 0:
log.warning(f"无效的图片尺寸: {data.get('width')}x{data.get('height')}")
return False
# URL校验
url_fields = ['url_raw', 'url_full', 'url_regular', 'download_url']
has_valid_url = any(
data.get(field) and WallpaperParser._is_valid_url(data[field])
for field in url_fields
)
if not has_valid_url:
log.warning("数据缺少有效的图片URL")
return False
return True
@staticmethod
def _is_valid_url(url: str) -> bool:
"""
验证URL格式
参数:
url: 待验证的URL
返回:
是否有效
"""
try:
result = urlparse(url)
return all([result.scheme, result.netloc])
except Exception:
return False
@staticmethod
def clean_text(text: str) -> str:
"""
清洗文本
参数:
text: 原始文本
返回:
清洗后的文本
"""
if not text:
return ''
# 移除多余空白
text = re.sub(r'\s+', ' ', text)
# 移除特殊字符(保留基本标点)
text = re.sub(r'[^\w\s\-.,!?;:()\[\]\'\"]+', '', text)
return text.strip()
🔟 核心实现:存储层(Storage)
core/storage.py
python
"""
存储层
负责将数据保存到SQLite、CSV、Excel等格式
"""
import sqlite3
import json
from pathlib import Path
from typing import List, Dict, Optional
import pandas as pd
from datetime import datetime
from config.settings import (
DB_PATH, CSV_PATH, EXCEL_PATH, JSON_PATH,
EXPORT_CSV, EXPORT_EXCEL, EXPORT_JSON,
DB_BACKUP, BACKUP_INTERVAL
)
from utils.logger import log
class WallpaperStorage:
"""
壁纸数据存储器
功能:
1. SQLite数据库管理
2. CSV导出
3. Excel导出
4. JSON导出
5. 数据查询与统计
"""
def __init__(self, db_path: Path = DB_PATH):
"""
初始化存储器
参数:
db_path: 数据库路径
"""
self.db_path = db_path
self.conn = None
self.cursor = None
self.save_count = 0
self._init_database()
log.info(f"存储器初始化完成: {db_path}")
def _init_database(self):
"""初始化数据库表结构"""
self.conn = sqlite3.connect(self.db_path)
self.cursor = self.conn.cursor()
# 创建壁纸表
self.cursor.execute('''
CREATE TABLE IF NOT EXISTS wallpapers (
id INTEGER PRIMARY KEY AUTOINCREMENT,
image_id TEXT UNIQUE NOT NULL,
title TEXT,
author TEXT,
author_url TEXT,
source TEXT NOT NULL,
width INTEGER,
height INTEGER,
resolution TEXT,
aspect_ratio TEXT,
file_size INTEGER,
format TEXT,
color TEXT,
tags_raw TEXT,
url_full TEXT,
url_regular TEXT,
url_small TEXT,
url_thumb TEXT,
download_url TEXT,
local_path TEXT,
likes INTEGER DEFAULT 0,
downloads INTEGER DEFAULT 0,
views INTEGER DEFAULT 0,
created_at TEXT,
camera TEXT,
focal_length TEXT,
aperture TEXT,
iso INTEGER,
location TEXT,
crawl_time TEXT,
last_updated TEXT
)
''')
# 创建索引
self.cursor.execute('CREATE INDEX IF NOT EXISTS idx_image_id ON wallpapers(image_id)')
self.cursor.execute('CREATE INDEX IF NOT EXISTS idx_source ON wallpapers(source)')
self.cursor.execute('CREATE INDEX IF NOT EXISTS idx_category ON wallpapers(category)')
self.cursor.execute('CREATE INDEX IF NOT EXISTS idx_resolution ON wallpapers(resolution)')
self.conn.commit()
log.debug("数据库表结构初始化完成")
def save_wallpaper(self, data: Dict) -> bool:
"""
保存单张壁纸数据
参数:
data: 壁纸数据字典
返回:
是否成功
"""
try:
# 处理标签(列表转JSON字符串)
if 'tags' in data and isinstance(data['tags'], list):
data['tags'] = json.dumps(data['tags'], ensure_ascii=False)
# 更新时间
data['last_updated'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
# 准备字段
fields = list(data.keys())
placeholders = ','.join(['?' for _ in fields])
field_names = ','.join(fields)
values = [data[f] for f in fields]
# 插入或更新
self.cursor.execute(f'''
INSERT OR REPLACE INTO wallpapers ({field_names})
VALUES ({placeholders})
''', values)
self.save_count += 1
# 定期提交
if self.save_count % 10 == 0:
self.conn.commit()
log.debug(f"已保存 {self.save_count} 条数据")
# 定期备份
if DB_BACKUP and self.save_count % BACKUP_INTERVAL == 0:
self._backup_database()
return True
except sqlite3.IntegrityError as e:
log.warning(f"数据已存在: {data.get('image_id')}")
return False
except Exception as e:
log.error(f"保存失败: {e}")
return False
def save_batch(self, data_list: List[Dict]) -> int:
"""
批量保存数据
参数:
data_list: 数据列表
返回:
成功保存的数量
"""
success_count = 0
for data in data_list:
if self.save_wallpaper(data):
success_count += 1
self.conn.commit()
log.success(f"✓ 批量保存完成: {success_count}/{len(data_list)}")
return success_count
def get_wallpaper(self, image_id: str) -> Optional[Dict]:
"""
根据ID获取壁纸
参数:
image_id: 图片ID
返回:
壁纸数据字典
"""
self.cursor.execute(
'SELECT * FROM wallpapers WHERE image_id = ?',
(image_id,)
)
row = self.cursor.fetchone()
if row:
columns = [desc[0] for desc in self.cursor.description]
return dict(zip(columns, row))
return None
def query_wallpapers(
self,
category: str = None,
resolution: str = None,
source: str = None,
limit: int = 100
) -> List[Dict]:
"""
条件查询壁纸
参数:
category: 分类
resolution: 分辨率
source: 来源
limit: 返回数量限制
返回:
壁纸列表
"""
query = 'SELECT * FROM wallpapers WHERE 1=1'
params = []
if category:
query += ' AND category = ?'
params.append(category)
if resolution:
query += ' AND resolution = ?'
params.append(resolution)
if source:
query += ' AND source = ?'
params.append(source)
query += f' LIMIT {limit}'
self.cursor.execute(query, params)
rows = self.cursor.fetchall()
columns = [desc[0] for desc in self.cursor.description]
return [dict(zip(columns, row)) for row in rows]
def get_statistics(self) -> Dict:
"""
获取统计信息
返回:
统计字典
"""
stats = {}
# 总数
self.cursor.execute('SELECT COUNT(*) FROM wallpapers')
stats['total'] = self.cursor.fetchone()[0]
# 按来源统计
self.cursor.execute('''
SELECT source, COUNT(*) as count
FROM wallpapers
GROUP BY source
''')
stats['by_source'] = dict(self.cursor.fetchall())
# 按分类统计
self.cursor.execute('''
SELECT category, COUNT(*) as count
FROM wallpapers
GROUP BY category
ORDER BY count DESC
LIMIT 10
''')
stats['by_category'] = dict(self.cursor.fetchall())
# 按分辨率统计
self.cursor.execute('''
SELECT resolution, COUNT(*) as count
FROM wallpapers
GROUP BY resolution
ORDER BY count DESC
LIMIT 10
''')
stats['by_resolution'] = dict(self.cursor.fetchall())
# 平均尺寸
self.cursor.execute('SELECT AVG(width), AVG(height) FROM wallpapers')
avg_width, avg_height = self.cursor.fetchone()
stats['avg_size'] = f"{int(avg_width)}x{int(avg_height)}" if avg_width else "N/A"
return stats
def export_csv(self, output_path: Path = CSV_PATH):
"""导出为CSV"""
if not EXPORT_CSV:
return
try:
df = pd.read_sql_query('SELECT * FROM wallpapers', self.conn)
df.to_csv(output_path, index=False, encoding='utf-8-sig')
log.success(f"✓ CSV导出成功: {output_path}")
except Exception as e:
log.error(f"CSV导出失败: {e}")
def export_excel(self, output_path: Path = EXCEL_PATH):
"""导出为Excel"""
if not EXPORT_EXCEL:
return
try:
df = pd.read_sql_query('SELECT * FROM wallpapers', self.conn)
with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
# 主数据表
df.to_excel(writer, sheet_name='Wallpapers', index=False)
# 统计表
stats = self.get_statistics()
stats_df = pd.DataFrame([stats])
stats_df.to_excel(writer, sheet_name='Statistics', index=False)
log.success(f"✓ Excel导出成功: {output_path}")
except Exception as e:
log.error(f"Excel导出失败: {e}")
def export_json(self, output_path: Path = JSON_PATH):
"""导出为JSON"""
if not EXPORT_JSON:
return
try:
self.cursor.execute('SELECT * FROM wallpapers')
rows = self.cursor.fetchall()
columns = [desc[0] for desc in self.cursor.description]
data = [dict(zip(columns, row)) for row in rows]
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
log.success(f"✓ JSON导出成功: {output_path}")
except Exception as e:
log.error(f"JSON导出失败: {e}")
def _backup_database(self):
"""备份数据库"""
try:
backup_path = self.db_path.parent / f"wallpapers_backup_{datetime.now().strftime('%Y%m%d_%H%M%S')}.db"
with sqlite3.connect(backup_path) as backup_conn:
self.conn.backup(backup_conn)
log.info(f"✓ 数据库备份: {backup_path.name}")
except Exception as e:
log.warning(f"数据库备份失败: {e}")
def close(self):
"""关闭数据库连接"""
if self.conn:
self.conn.commit()
self.conn.close()
log.info("数据库连接已关闭")
1️⃣1️⃣ 工具类实现
utils/logger.py
python
"""
日志工具
基于loguru实现彩色日志输出
"""
import sys
from pathlib import Path
from loguru import logger
from datetime import datetime
from config.settings import LOG_DIR, LOG_LEVEL, LOG_FORMAT, LOG_ROTATION, LOG_RETENTION
# 移除默认handler
logger.remove()
# 添加控制台输出(彩色)
logger.add(
sys.stderr,
format=LOG_FORMAT,
level=LOG_LEVEL,
colorize=True
)
# 添加文件输出
log_file = LOG_DIR / f"scraper_{datetime.now().strftime('%Y%m%d')}.log"
logger.add(
log_file,
format=LOG_FORMAT,
level=LOG_LEVEL,
rotation=LOG_ROTATION,
retention=LOG_RETENTION,
encoding='utf-8'
)
# 导出logger实例
log = logger
utils/retry.py
python
"""
重试装饰器
用于自动重试失败的函数调用
"""
import time
import functools
from typing import Tuple, Type
from utils.logger import log
def retry(
max_attempts: int = 3,
delay: float = 1.0,
backoff: float = 2.0,
exceptions: Tuple[Type[Exception], ...] = (Exception,)
):
"""
重试装饰器
参数:
max_attempts: 最大重试次数
delay: 初始延迟(秒)
backoff: 延迟倍增系数
exceptions: 需要重试的异常类型
示例:
@retry(max_attempts=3, delay=2, backoff=2)
def fetch_data():
...
"""
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
_delay = delay
for attempt in range(1, max_attempts + 1):
try:
return func(*args, **kwargs)
except exceptions as e:
if attempt == max_attempts:
log.error(f"❌ 重试{max_attempts}次后仍失败: {func.__name__}")
raise
log.warning(f"⚠️ {func.__name__} 失败 (尝试 {attempt}/{max_attempts}), {_delay:.1f}秒后重试...")
time.sleep(_delay)
_delay *= backoff
return wrapper
return decorator
utils/hasher.py
python
"""
图片哈希工具
用于图片去重和相似度检测
"""
import hashlib
from pathlib import Path
from typing import Optional, Dict
import imagehash
from PIL import Image
from utils.logger import log
class ImageHasher:
"""
图片哈希器
功能:
1. MD5哈希(精确去重)
2. 感知哈希(相似度检测)
3. 哈希数据库管理
"""
def __init__(self):
"""初始化哈希器"""
self.md5_cache: Dict[str, str] = {} # {md5: filepath}
self.phash_cache: Dict[str, str] = {} # {phash: filepath}
log.debug("图片哈希器初始化完成")
def calc_md5(self, image_path: str) -> str:
"""
计算文件MD5
参数:
image_path: 图片路径
返回:
MD5值(32位小写)
"""
try:
with open(image_path, 'rb') as f:
md5_hash = hashlib.md5()
for chunk in iter(lambda: f.read(8192), b''):
md5_hash.update(chunk)
return md5_hash.hexdigest()
except Exception as e:
log.error(f"MD5计算失败: {e}")
return ''
def calc_phash(self, image_path: str, hash_size: int = 8) -> str:
"""
计算感知哈希
参数:
image_path: 图片路径
hash_size: 哈希大小(越大越精确)
返回:
感知哈希值(十六进制字符串)
"""
try:
with Image.open(image_path) as img:
phash = imagehash.phash(img, hash_size=hash_size)
return str(phash)
except Exception as e:
log.error(f"感知哈希计算失败: {e}")
return ''
def find_duplicate_md5(self, image_path: str) -> Optional[str]:
"""
查找MD5重复
参数:
image_path: 图片路径
返回:
重复图片路径,无重复返回None
"""
md5 = self.calc_md5(image_path)
if not md5:
return None
if md5 in self.md5_cache:
return self.md5_cache[md5]
self.md5_cache[md5] = image_path
return None
def find_similar(self, image_path: str, threshold: int = 5) -> Optional[str]:
"""
查找相似图片(感知哈希)
参数:
image_path: 图片路径
threshold: 相似度阈值(汉明距离,越小越严格)
返回:
相似图片路径,无相似返回None
"""
phash = self.calc_phash(image_path)
if not phash:
return None
current_hash = imagehash.hex_to_hash(phash)
for cached_phash, cached_path in self.phash_cache.items():
cached_hash = imagehash.hex_to_hash(cached_phash)
distance = current_hash - cached_hash
if distance <= threshold:
log.info(f"发现相似图片 (距离={distance}): {cached_path}")
return cached_path
self.phash_cache[phash] = image_path
return None
1️⃣2️⃣ 核心实现:图片分析器(Analyzer)
core/analyzer.py
python
"""
图片分析器
提供色彩分析、统计分析等功能
"""
import numpy as np
from typing import List, Dict, Tuple
from pathlib import Path
from collections import Counter
from PIL import Image
import cv2
from utils.logger import log
from config.settings import COLOR_CLUSTERS
class WallpaperAnalyzer:
"""
壁纸分析器
功能:
1. 提取主色调
2. 色彩分布统计
3. 图片质量评估
4. 相似度分析
"""
@staticmethod
def extract_dominant_colors(
image_path: str,
n_colors: int = COLOR_CLUSTERS
) -> List[Tuple[str, float]]:
"""
提取主色调
参数:
image_path: 图片路径
n_colors: 提取颜色数量
返回:
[(hex_color, percentage), ...] 颜色及占比列表
"""
try:
# 加载图片
img = cv2.imread(image_path)
if img is None:
log.error(f"无法读取图片: {image_path}")
return []
# 转换BGR到RGB
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
# 缩小图片加速处理
img = cv2.resize(img, (150, 150))
# 展平像素
pixels = img.reshape(-1, 3)
# K-means聚类
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=n_colors, random_state=42, n_init=10)
kmeans.fit(pixels)
# 统计每个簇的像素数
labels = kmeans.labels_
counts = Counter(labels)
total = len(labels)
# 获取颜色及占比
colors = []
for i in range(n_colors):
rgb = kmeans.cluster_centers_[i].astype(int)
hex_color = '#{:02x}{:02x}{:02x}'.format(*rgb)
percentage = counts[i] / total
colors.append((hex_color, percentage))
# 按占比排序
colors.sort(key=lambda x: x[1], reverse=True)
log.debug(f"✓ 主色调提取成功: {len(colors)}种颜色")
return colors
except Exception as e:
log.error(f"主色调提取失败: {e}")
return []
@staticmethod
def analyze_brightness(image_path: str) -> Dict[str, float]:
"""
分析图片亮度
参数:
image_path: 图片路径
返回:
亮度统计字典
"""
try:
with Image.open(image_path) as img:
# 转灰度
gray = img.convert('L')
pixels = np.array(gray)
return {
'mean': float(np.mean(pixels)),
'median': float(np.median(pixels)),
'std': float(np.std(pixels)),
'min': int(np.min(pixels)),
'max': int(np.max(pixels))
}
except Exception as e:
log.error(f"亮度分析失败: {e}")
return {}
@staticmethod
def detect_faces(image_path: str) -> int:
"""
检测人脸数量
参数:
image_path: 图片路径
返回:
人脸数量
"""
try:
# 加载人脸检测器
face_cascade = cv2.CascadeClassifier(
cv2.data.haarcascades + 'haarcascade_frontalface_default.xml'
)
img = cv2.imread(image_path)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
faces = face_cascade.detectMultiScale(
gray,
scaleFactor=1.1,
minNeighbors=5,
minSize=(30, 30)
)
count = len(faces)
log.debug(f"检测到 {count} 张人脸")
return count
except Exception as e:
log.warning(f"人脸检测失败: {e}")
return 0
@staticmethod
def assess_quality(image_path: str) -> Dict[str, any]:
"""
评估图片质量
参数:
image_path: 图片路径
返回:
质量评估字典
"""
try:
with Image.open(image_path) as img:
width, height = img.size
# 基础指标
assessment = {
'width': width,
'height': height,
'megapixels': round(width * height / 1_000_000, 2),
'aspect_ratio': round(width / height, 2),
}
# 分辨率评分
pixels = width * height
if pixels >= 8_000_000: # 8MP+
assessment['resolution_score'] = 'Excellent'
elif pixels >= 2_000_000: # 2MP+
assessment['resolution_score'] = 'Good'
elif pixels >= 1_000_000: # 1MP+
assessment['resolution_score'] = 'Fair'
else:
assessment['resolution_score'] = 'Poor'
# 文件大小
file_size = Path(image_path).stat().st_size
assessment['file_size_mb'] = round(file_size / 1024 / 1024, 2)
# 压缩比
raw_size = width * height * 3 # RGB
assessment['compression_ratio'] = round(raw_size / file_size, 2)
return assessment
except Exception as e:
log.error(f"质量评估失败: {e}")
return {}
@staticmethod
def generate_report(storage) -> Dict:
"""
生成分析报告
参数:
storage: WallpaperStorage实例
返回:
分析报告字典
"""
stats = storage.get_statistics()
report = {
'summary': {
'total_wallpapers': stats.get('total', 0),
'sources': list(stats.get('by_source', {}).keys()),
'categories': len(stats.get('by_category', {})),
},
'source_breakdown': stats.get('by_source', {}),
'category_breakdown': stats.get('by_category', {}),
'resolution_breakdown': stats.get('by_resolution', {}),
'average_size': stats.get('avg_size', 'N/A'),
}
# 热门分类 TOP5
categories = stats.get('by_category', {})
if categories:
top_categories = sorted(
categories.items(),
key=lambda x: x[1],
reverse=True
)[:5]
report['top_categories'] = dict(top_categories)
return report
1️⃣3️⃣ 主程序入口(Main)
main.py
python
"""
壁纸采集系统 - 主程序
"""
import sys
from pathlib import Path
from typing import List, Dict
from tqdm import tqdm
# 添加项目路径
sys.path.append(str(Path(__file__).parent))
from config.settings import *
from core.unsplash_fetcher import UnsplashFetcher
from core.wallpaper_spider import WallpaperflareSpider
from core.downloader import ImageDownloader
from core.parser import WallpaperParser
from core.storage import WallpaperStorage
from core.analyzer import WallpaperAnalyzer
from utils.logger import log
class WallpaperScraper:
"""
壁纸采集系统主控制器
功能:
1. 统一调度各个模块
2. 流程控制
3. 异常处理
4. 进度显示
"""
def __init__(self):
"""初始化采集器"""
log.info("="*60)
log.info("🎨 壁纸采集系统启动")
log.info("="*60)
# 初始化各模块
self.unsplash = None
self.wallpaper_spider = None
self.downloader = None
self.storage = None
self.parser = WallpaperParser()
self.analyzer = WallpaperAnalyzer()
self._init_modules()
def _init_modules(self):
"""初始化各个模块"""
try:
# 存储层(必需)
self.storage = WallpaperStorage()
# Unsplash(可选)
try:
self.unsplash = UnsplashFetcher()
log.success("✓ Unsplash API 已启用")
except ValueError as e:
log.warning(f"⚠️ Unsplash API 未配置: {e}")
# Wallpaperflare爬虫
self.wallpaper_spider = WallpaperflareSpider()
log.success("✓ Wallpaperflare爬虫 已启用")
# 下载器(仅在需要时启用)
if DOWNLOAD_MODE != "metadata_only":
self.downloader = ImageDownloader()
log.success(f"✓ 图片下载器 已启用 (模式: {DOWNLOAD_MODE})")
else:
log.info("ℹ️ 仅采集元数据,不下载图片")
except Exception as e:
log.exception(f"模块初始化失败: {e}")
sys.exit(1)
def fetch_unsplash(
self,
max_pages: int = UNSPLASH_MAX_PAGES,
order_by: str = UNSPLASH_ORDER_BY
) -> int:
"""
采集Unsplash数据
参数:
max_pages: 最多采集页数
order_by: 排序方式
返回:
成功采集数量
"""
if not self.unsplash:
log.warning("Unsplash API未配置,跳过")
return 0
log.info(f"🔍 开始采集Unsplash (最多{max_pages}页)")
collected = []
try:
for page in tqdm(range(1, max_pages + 1), desc="Unsplash进度"):
photos = self.unsplash.get_photos(
page=page,
per_page=UNSPLASH_PER_PAGE,
order_by=order_by
)
if not photos:
log.info("没有更多数据")
break
# 解析数据
for photo in photos:
parsed = self.parser.parse_unsplash_photo(photo)
if parsed and self.parser.validate_data(parsed):
collected.append(parsed)
log.info(f"第{page}页完成,已采集 {len(collected)} 张")
# 保存到数据库
saved = self.storage.save_batch(collected)
# 下载图片(可选)
if self.downloader and DOWNLOAD_MODE == "with_images":
self._download_images(collected, UNSPLASH_DIR)
log.success(f"✓ Unsplash采集完成: {saved} 张")
return saved
except Exception as e:
log.exception(f"Unsplash采集失败: {e}")
return 0
finally:
if self.unsplash:
self.unsplash.close()
def fetch_wallpaperflare(
self,
categories: List[str] = None,
max_pages_per_category: int = WALLPAPER_MAX_PAGES
) -> int:
"""
采集Wallpaperflare数据
参数:
categories: 分类列表(None表示全部)
max_pages_per_category: 每个分类最多采集页数
返回:
成功采集数量
"""
if not self.wallpaper_spider:
log.warning("Wallpaperflare爬虫未启用,跳过")
return 0
# 默认采集所有分类
if not categories:
categories = list(WALLPAPERFLARE_CATEGORIES.keys())
log.info(f"🕷️ 开始采集Wallpaperflare ({len(categories)}个分类)")
total_collected = []
try:
for category in categories:
log.info(f"📂 采集分类: {category}")
wallpapers = self.wallpaper_spider.fetch_category(
category=category,
max_pages=max_pages_per_category
)
# 解析数据
parsed_list = []
for wp in wallpapers:
parsed = self.parser.parse_wallpaperflare_item(wp)
if parsed and self.parser.validate_data(parsed):
parsed_list.append(parsed)
# 保存
saved = self.storage.save_batch(parsed_list)
total_collected.extend(parsed_list)
log.success(f"✓ 分类'{category}'完成: {saved} 张")
# 下载图片(可选)
if self.downloader and DOWNLOAD_MODE == "with_images":
self._download_images(total_collected, WALLPAPER_DIR)
log.success(f"✓ Wallpaperflare采集完成: {len(total_collected)} 张")
return len(total_collected)
except Exception as e:
log.exception(f"Wallpaperflare采集失败: {e}")
return 0
finally:
if self.wallpaper_spider:
self.wallpaper_spider.close()
def _download_images(self, data_list: List[Dict], save_dir: Path):
"""
下载图片
参数:
data_list: 数据列表
save_dir: 保存目录
"""
log.info(f"📥 开始下载图片到 {save_dir}")
# 准备下载任务
tasks = []
for data in data_list:
# 选择下载URL
url = (
data.get('download_url') or
data.get('url_regular') or
data.get('url_full') or
data.get('url_raw')
)
if url:
tasks.append({
'url': url,
'save_dir': save_dir,
'filename': data['image_id'],
'source': data['source']
})
if not tasks:
log.warning("没有可下载的图片")
return
# 批量下载
stats = self.downloader.download_batch(tasks)
log.success(
f"✓ 图片下载完成: "
f"成功{stats['success']}, "
f"失败{stats['failed']}, "
f"跳过{stats['skipped']}"
)
def search_unsplash(self, query: str, max_results: int = 100) -> int:
"""
搜索Unsplash
参数:
query: 搜索关键词
max_results: 最多返回数量
返回:
采集数量
"""
if not self.unsplash:
log.warning("Unsplash API未配置")
return 0
log.info(f"🔍 搜索Unsplash: '{query}'")
collected = []
page = 1
per_page = UNSPLASH_PER_PAGE
try:
while len(collected) < max_results:
result = self.unsplash.search_photos(
query=query,
page=page,
per_page=per_page
)
photos = result.get('results', [])
if not photos:
break
for photo in photos:
parsed = self.parser.parse_unsplash_photo(photo)
if parsed and self.parser.validate_data(parsed):
collected.append(parsed)
if len(collected) >= max_results:
break
page += 1
saved = self.storage.save_batch(collected)
log.success(f"✓ 搜索完成: {saved} 张")
return saved
except Exception as e:
log.exception(f"搜索失败: {e}")
return 0
def export_data(self):
"""导出数据到各种格式"""
log.info("📤 开始导出数据...")
self.storage.export_csv()
self.storage.export_excel()
self.storage.export_json()
log.success("✓ 数据导出完成")
def generate_report(self):
"""生成分析报告"""
log.info("📊 生成分析报告...")
report = self.analyzer.generate_report(self.storage)
# 打印报告
log.info("="*60)
log.info("📈 采集统计报告")
log.info("="*60)
summary = report['summary']
log.info(f"总计: {summary['total_wallpapers']} 张壁纸")
log.info(f"来源: {', '.join(summary['sources'])}")
log.info(f"分类数: {summary['categories']}")
log.info(f"平均尺寸: {report['average_size']}")
if report.get('top_categories'):
log.info("\n🏆 热门分类 TOP5:")
for cat, count in report['top_categories'].items():
log.info(f" {cat}: {count} 张")
log.info("="*60)
return report
def close(self):
"""关闭所有资源"""
log.info("🔚 关闭采集器...")
if self.unsplash:
self.unsplash.close()
if self.wallpaper_spider:
self.wallpaper_spider.close()
if self.downloader:
self.downloader.close()
if self.storage:
self.storage.close()
log.success("✓ 采集器已关闭")
def main():
"""主函数"""
scraper = None
try:
# 创建采集器
scraper = WallpaperScraper()
# ========== 方式1: 采集Unsplash ==========
# scraper.fetch_unsplash(max_pages=5)
# ========== 方式2: 采集Wallpaperflare ==========
scraper.fetch_wallpaperflare(
categories=['nature', 'abstract'],
max_pages_per_category=3
)
# ========== 方式3: 搜索关键词 ==========
# scraper.search_unsplash('mountain sunset', max_results=50)
# ========== 导出数据 ==========
scraper.export_data()
# ========== 生成报告 ==========
scraper.generate_report()
log.success("🎉 采集任务完成!")
except KeyboardInterrupt:
log.warning("\n⚠️ 用户中断")
except Exception as e:
log.exception(f"❌ 程序异常: {e}")
finally:
if scraper:
scraper.close()
if __name__ == '__main__':
main()
1️⃣4️⃣ 使用示例与最佳实践
🚀 快速开始
1. 环境配置
bash
# 克隆项目
git clone https://github.com/yourname/wallpaper-scraper.git
cd wallpaper-scraper
# 安装依赖
pip install -r requirements.txt
# 配置API Key(编辑config/settings.py)
UNSPLASH_ACCESS_KEY = "your_access_key_here"
2. 基础使用
python
from main import WallpaperScraper
# 创建采集器
scraper = WallpaperScraper()
# 采集Unsplash(5页)
scraper.fetch_unsplash(max_pages=5)
# 采集Wallpaperflare(nature分类,3页)
scraper.fetch_wallpaperflare(
categories=['nature'],
max_pages_per_category=3
)
# 导出数据
scraper.export_data()
# 关闭
scraper.close()
3. 进阶用法
python
# ========== 场景1: 搜索特定主题 ==========
scraper.search_unsplash('tokyo night', max_results=100)
# ========== 场景2: 批量采集多个分类 ==========
categories = ['nature', 'abstract', 'city', 'space']
scraper.fetch_wallpaperflare(categories, max_pages_per_category=5)
# ========== 场景3: 只采集4K壁纸 ==========
# 修改parser.py中的过滤条件
def validate_data(data: Dict) -> bool:
if data.get('width', 0) < 3840: # 4K最低宽度
return False
return True
# ========== 场景4: 定时任务(每天采集) ==========
import schedule
import time
def daily_task():
scraper = WallpaperScraper()
scraper.fetch_unsplash(max_pages=10)
scraper.fetch_wallpaperflare(['nature'], 5)
scraper.export_data()
scraper.close()
schedule.every().day.at("02:00").do(daily_task)
while True:
schedule.run_pending()
time.sleep(60)
💡 最佳实践
1. 性能优化
python
# ✅ 使用并发下载
DOWNLOAD_CONCURRENT = 10 # 增加并发数
# ✅ 启用断点续传
# downloader.py中已实现,无需额外配置
# ✅ 缓存已下载图片的哈希值
ENABLE_DEDUPLICATION = True
2. 数据质量控制
python
# ✅ 严格的数据验证
def validate_data(data: Dict) -> bool:
# 最低分辨率要求
if data.get('width', 0) < 1920:
return False
# 必须有标题
if not data.get('title'):
return False
# 必须有下载链接
if not data.get('download_url'):
return False
return True
# ✅ 过滤低质量图片
MIN_FILE_SIZE = 100 * 1024 # 100KB
MAX_FILE_SIZE = 50 * 1024 * 1024 # 50MB
3. 合规性保障
python
# ✅ 遵守robots.txt
# spider.py中已实现间隔控制
# ✅ API限流
@sleep_and_retry
@limits(calls=50, period=3600)
def _request(self, url, params):
...
# ✅ 标注来源
data['source'] = 'unsplash'
data['author'] = photo['user']['name']
data['author_url'] = photo['user']['links']['html']
📊 数据分析示例
python
import pandas as pd
from core.storage import WallpaperStorage
# 加载数据
storage = WallpaperStorage()
df = pd.read_sql_query('SELECT * FROM wallpapers', storage.conn)
# ========== 分析1: 分辨率分布 ==========
resolution_counts = df['resolution'].value_counts()
print("分辨率分布:")
print(resolution_counts.head(10))
# ========== 分析2: 热门分类 ==========
category_counts = df['category'].value_counts()
print("\n热门分类:")
print(category_counts.head(10))
# ========== 分析3: 色彩偏好 ==========
colors = df['color'].value_counts()
print("\n主流色调:")
print(colors.head(10))
# ========== 分析4: 来源占比 ==========
source_ratio = df['source'].value_counts(normalize=True)
print("\n来源占比:")
print(source_ratio)
# ========== 可视化 ==========
import matplotlib.pyplot as plt
# 分辨率饼图
plt.figure(figsize=(10, 6))
resolution_counts.head(5).plot(kind='pie', autopct='%1.1f%%')
plt.title('Top 5 Resolutions')
plt.savefig('data/resolution_pie.png')
🐛 常见问题
Q1: Unsplash API报401错误?
python
# A: 检查Access Key是否正确配置
# config/settings.py中设置:
UNSPLASH_ACCESS_KEY = "your_real_key"
Q2: Wallpaperflare返回空数据?
python
# A: 可能触发反爬,建议:
REQUEST_INTERVAL = 5 # 增加间隔
USE_PROXY = True # 启用代理
Q3: 图片下载失败?
python
# A: 检查网络和磁盘空间
MAX_RETRIES = 5 # 增加重试次数
TIMEOUT = 60 # 增加超时时间
Q4: SQLite数据库锁定?
python
# A: 避免多进程同时写入
# 使用锁机制或改用PostgreSQL
1️⃣5️⃣ 总结与展望
✅ 已实现功能
- ✅ 双引擎数据采集(Unsplash API + Wallpaperflare爬虫)
- ✅ 完整的元数据提取(25+字段)
- ✅ 高性能图片下载(并发、断点续传、去重)
- ✅ 多格式数据导出(SQLite、CSV、Excel、JSON)
- ✅ 图片质量分析(色彩、亮度、分辨率)
- ✅ 完善的异常处理和日志记录
- ✅ 合规性保障(限流、间隔、版权标注)
🚀 可扩展方向
-
更多数据源
- Pexels API
- Pixabay API
- Pinterest(需处理动态加载)
-
智能推荐
- 基于色彩的相似推荐
- 用户偏好学习
- 个性化壁纸流
-
图像处理
- 自动裁剪适配不同分辨率
- 水印添加/移除
- 风格迁移
-
Web应用
- Flask/FastAPI后端
- React前端展示
- RESTful API
-
机器学习
- 图像分类模型训练
- 质量评分模型
- 标签自动生成
📚 学习收获
通过本项目,你将掌握:
✅ API调用技巧 - 限流、重试、异常处理
✅ 网页爬虫实战 - HTML解析、翻页、反爬
✅ 并发编程 - ThreadPoolExecutor、异步下载
✅ 数据工程 - 清洗、标准化、存储、导出
✅ 项目架构 - 分层设计、模块解耦
✅ 代码质量 - 类型注解、文档注释、单元测试
🎓 推荐资源
- Unsplash API文档 : https://unsplash.com/documentation
- Python爬虫教程 : https://scrapy.org/
- 图像处理库PIL : https://pillow.readthedocs.io/
- 数据分析Pandas : https://pandas.pydata.org/
🎨 祝你采集愉快!如有问题欢迎交流!
🌟 文末
好啦~以上就是本期 《Python爬虫实战》的全部内容啦!如果你在实践过程中遇到任何疑问,欢迎在评论区留言交流,我看到都会尽量回复~咱们下期见!
小伙伴们在批阅的过程中,如果觉得文章不错,欢迎点赞、收藏、关注哦~
三连就是对我写作道路上最好的鼓励与支持! ❤️🔥
📌 专栏持续更新中|建议收藏 + 订阅
专栏 👉 《Python爬虫实战》,我会按照"入门 → 进阶 → 工程化 → 项目落地"的路线持续更新,争取让每一篇都做到:
✅ 讲得清楚(原理)|✅ 跑得起来(代码)|✅ 用得上(场景)|✅ 扛得住(工程化)
📣 想系统提升的小伙伴:强烈建议先订阅专栏,再按目录顺序学习,效率会高很多~

✅ 互动征集
想让我把【某站点/某反爬/某验证码/某分布式方案】写成专栏实战?
评论区留言告诉我你的需求,我会优先安排更新 ✅
⭐️ 若喜欢我,就请关注我叭~(更新不迷路)
⭐️ 若对你有用,就请点赞支持一下叭~(给我一点点动力)
⭐️ 若有疑问,就请评论留言告诉我叭~(我会补坑 & 更新迭代)
免责声明:本文仅用于学习与技术研究,请在合法合规、遵守站点规则与 Robots 协议的前提下使用相关技术。严禁将技术用于任何非法用途或侵害他人权益的行为。