Scrapy ImagesPipeline和FilesPipeline自定义使用

Scrapy 作为 Python 生态中强大的爬虫框架，内置了ImagesPipeline和FilesPipeline两个核心管道，专门用于处理图片、文件的下载需求。默认配置虽能满足基础场景，但实际开发中，我们常需要自定义存储路径、过滤文件格式、处理下载异常等，本文将详细讲解如何灵活定制这两个管道。

一、核心概念与基础使用

1. 管道作用

ImagesPipeline：专注图片下载，支持图片校验（如校验是否为有效图片）、缩略图生成、图片格式转换等。
FilesPipeline：通用文件下载管道，可下载任意格式文件（如 PDF、视频、压缩包等），逻辑与ImagesPipeline高度相似，仅处理对象不同。

2. 基础配置（以 ImagesPipeline 为例）

首先需安装依赖（处理图片需要Pillow）：

bash

运行

复制代码

pip install scrapy pillow

在settings.py中配置基础参数：

python

运行

复制代码

# 启用图片管道
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1,
}
# 图片存储根目录（绝对路径/相对路径均可）
IMAGES_STORE = './downloaded_images'
# 允许的图片扩展名
IMAGES_ALLOWED_EXTENSIONS = {'jpg', 'jpeg', 'png', 'gif'}
# 图片下载超时时间
IMAGES_TIMEOUT = 15

爬虫中需返回包含image_urls字段的 Item（FilesPipeline对应file_urls）：

python

运行

复制代码

import scrapy

class ImageItem(scrapy.Item):
    image_urls = scrapy.Field()  # 必须包含，存储图片URL列表
    images = scrapy.Field()      # 下载完成后，存储图片信息（路径、URL、校验和等）
    title = scrapy.Field()       # 自定义字段，如图片标题

class ImageSpider(scrapy.Spider):
    name = 'image_spider'
    start_urls = ['https://example.com/image-page']

    def parse(self, response):
        item = ImageItem()
        # 提取页面中的图片URL
        item['image_urls'] = response.css('img::attr(src)').extract()
        item['title'] = response.css('h1::text').extract_first()
        yield item

二、自定义管道的核心场景与实现

默认管道的不足：存储路径固定（按哈希值生成）、无法根据 Item 字段动态命名、无自定义异常处理。我们通过继承ImagesPipeline/FilesPipeline并重写核心方法解决这些问题。

场景 1：自定义文件存储路径与文件名

需求：将图片按 "爬虫名 / 分类 / 标题 + 后缀" 的格式存储，而非默认的哈希路径。

实现（自定义 ImagesPipeline）：

python

运行

复制代码

# pipelines.py
import os
from urllib.parse import urlparse
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem

class CustomImagesPipeline(ImagesPipeline):
    # 重写此方法，自定义下载请求（可添加请求头、修改URL等）
    def get_media_requests(self, item, info):
        # 遍历图片URL，为每个URL生成下载请求，并携带Item字段
        for image_url in item['image_urls']:
            yield scrapy.Request(
                image_url,
                meta={
                    'title': item['title'],  # 携带标题字段
                    'spider_name': info.spider.name  # 携带爬虫名
                }
            )

    # 重写此方法，自定义文件路径和文件名
    def file_path(self, request, response=None, info=None, *, item=None):
        # 1. 解析URL获取文件后缀
        url_path = urlparse(request.url).path
        file_ext = os.path.splitext(url_path)[-1]
        # 处理无后缀的情况，默认.jpg
        if not file_ext:
            file_ext = '.jpg'
        
        # 2. 获取meta中的自定义字段
        spider_name = request.meta['spider_name']
        title = request.meta['title'].replace('/', '_').replace('\\', '_')  # 过滤非法字符
        
        # 3. 构造自定义路径：spider名/标题/标题+后缀
        # 避免重复，可添加序号（此处简化，实际可结合哈希）
        file_name = f"{title}{file_ext}"
        return os.path.join(spider_name, title, file_name)

    # 重写此方法，处理下载完成后的Item
    def item_completed(self, results, item, info):
        # results格式：[(success, {'url': '', 'path': '', 'checksum': ''}), ...]
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            # 无图片下载成功，丢弃该Item
            raise DropItem(f"Item {item['title']} 无有效图片")
        # 将下载后的路径存入Item
        item['images'] = image_paths
        return item

对应 FilesPipeline 的自定义实现（仅需替换继承类）：

python

运行

复制代码

from scrapy.pipelines.files import FilesPipeline

class CustomFilesPipeline(FilesPipeline):
    def get_media_requests(self, item, info):
        for file_url in item['file_urls']:
            yield scrapy.Request(
                file_url,
                meta={
                    'file_name': item['file_name'],
                    'category': item['category']
                }
            )

    def file_path(self, request, response=None, info=None, *, item=None):
        category = request.meta['category']
        file_name = request.meta['file_name']
        url_path = urlparse(request.url).path
        file_ext = os.path.splitext(url_path)[-1]
        return os.path.join(category, f"{file_name}{file_ext}")

修改settings.py启用自定义管道：

python

运行

复制代码

ITEM_PIPELINES = {
    # 替换默认管道为自定义管道
    'your_project_name.pipelines.CustomImagesPipeline': 1,
    # 'your_project_name.pipelines.CustomFilesPipeline': 2,  # 文件管道同理
}
IMAGES_STORE = './custom_images'  # 自定义存储根目录
# FilesPipeline需配置FILES_STORE
# FILES_STORE = './custom_files'

场景 2：过滤无效文件 / 自定义下载条件

需求：仅下载大小超过 10KB 的图片，过滤损坏或过小的图片。

扩展自定义管道，添加文件大小校验：

python

运行

复制代码

# 在CustomImagesPipeline中新增方法
def item_completed(self, results, item, info):
    valid_images = []
    for ok, result in results:
        if not ok:
            continue
        # 获取文件绝对路径
        file_path = os.path.join(self.store.basedir, result['path'])
        # 校验文件大小（10KB = 10 * 1024 字节）
        if os.path.getsize(file_path) < 10 * 1024:
            os.remove(file_path)  # 删除小文件
            continue
        valid_images.append(result['path'])
    
    if not valid_images:
        raise DropItem(f"Item {item['title']} 无有效大图片")
    item['images'] = valid_images
    return item

场景 3：生成图片缩略图

ImagesPipeline内置缩略图生成功能，仅需在settings.py配置：

python

运行

复制代码

# 启用缩略图
IMAGES_THUMBS = {
    'small': (50, 50),    # 小图：宽50，高50（按比例缩放）
    'big': (200, 200),    # 大图：宽200，高200
}
# 缩略图存储路径默认在原文件路径同级的thumbs目录下，可通过file_path重写

三、常见问题与注意事项

URL 格式问题 ：部分图片 URL 为相对路径，需在get_media_requests中拼接完整域名：

python

运行

复制代码

def get_media_requests(self, item, info):
    base_url = 'https://example.com'
    for image_url in item['image_urls']:
        if not image_url.startswith('http'):
            image_url = base_url + image_url
        yield scrapy.Request(image_url, meta={'title': item['title']})

反爬处理 ：下载请求需携带 User-Agent、Cookie 等，可在settings.py配置：

python

运行

复制代码

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...'
# 或在get_media_requests中为Request添加headers

异常处理 ：网络超时、文件不存在等异常可通过errback处理：

python

运行

复制代码

def get_media_requests(self, item, info):
    for image_url in item['image_urls']:
        yield scrapy.Request(
            image_url,
            meta={'title': item['title']},
            errback=self.handle_download_error  # 异常回调
        )

def handle_download_error(self, failure):
    self.logger.error(f"下载失败：{failure.request.url}，原因：{failure.value}")

并发控制 ：避免下载过快被封 IP，配置下载并发数：

python

运行

复制代码

# settings.py
CONCURRENT_REQUESTS = 16  # 全局并发
CONCURRENT_REQUESTS_PER_DOMAIN = 4  # 单域名并发
DOWNLOAD_DELAY = 1  # 下载延迟1秒

四、ImagesPipeline vs FilesPipeline 核心区别

特性	ImagesPipeline	FilesPipeline
处理对象	仅图片（JPG/PNG/GIF 等）	任意文件（PDF/zip/ 视频等）
依赖	需要安装 Pillow	无额外依赖
特有功能	图片校验、缩略图生成	无特有功能，通用文件下载
核心配置	IMAGES_STORE、IMAGES_THUMBS	FILES_STORE、FILES_ALLOWED_EXTENSIONS
必选 Item 字段	image_urls、images	file_urls、files

总结

ImagesPipeline和FilesPipeline的自定义核心是继承原类，重写get_media_requests（定制请求）、file_path（定制路径）、item_completed（处理结果）三个方法。
自定义路径时需过滤文件名非法字符，避免路径错误；下载文件时建议添加异常处理和文件有效性校验。
ImagesPipeline适用于图片场景（支持缩略图、格式校验），FilesPipeline适用于通用文件下载，二者自定义逻辑高度复用。

通过灵活定制这两个管道，可满足爬虫中文件 / 图片下载的绝大多数个性化需求，同时遵循 Scrapy 的框架规范，保证代码的可维护性和稳定性。