利用Python爬虫实现百度图片搜索的PNG图片下载

在图像识别、训练数据集构建等场景中，我们经常需要从互联网上批量下载图片素材。百度图片是中文搜索中最常用的来源之一。本文将介绍如何使用Python构建一个稳定、可扩展的百度图片爬虫，专门用于下载并保存高清PNG格式图片。

一、项目目标

本项目的目标是：

自动从百度图片搜索指定关键词
下载所有可用图片并统一转换为PNG格式
支持分页下载，避免重复图片
稳定运行，具备错误处理能力
自动过滤无效链接和非图像内容

二、关键依赖

主要依赖的Python库如下：

requests：用于发起HTTP请求
Pillow (PIL)：图像格式识别与转换
hashlib：用于生成唯一文件名
io：处理图片内容流
json、os、time 等标准库

三、爬虫结构设计

1. 接口分析

百度图片新版接口为：百度图片 | 免费AI图像生成工具与海量高清图平台

分页机制基于参数 pn（当前起始项索引）和 gsm（页面校验字段），每页默认返回30条结果。

2. 核心流程

完整流程如下：

构造请求头和参数，模拟浏览器访问行为
获取搜索结果的JSON响应
提取每张图片的有效URL（优先高清地址）
验证URL有效性并请求下载
判断是否为PNG格式，非PNG则使用Pillow进行转换
利用图片内容计算MD5，避免重复保存
统一保存为PNG格式到本地指定目录

3. 文件去重机制

使用 hashlib.md5 计算图片二进制内容的哈希值作为文件名，确保即使同一张图片来自不同URL也不会重复保存。

四、异常与边界处理

为了保证稳定运行，代码中加入了如下处理机制：

请求超时控制，防止因单张图片卡死
图片类型校验，跳过非图像内容
JSON解析失败自动保存错误响应
URL合法性检查，避免处理无效链接
Pillow转换异常捕获，跳过无法解析的图片

五、转换逻辑详解

如果下载的图片不是PNG格式，会自动使用 Pillow 转换：

如果是 P, LA 模式，转换为 RGBA
如果是 CMYK，转换为 RGB
所有图片统一保存为 .png 格式，适用于图像训练、压缩优化等应用场景

六、运行效果示例

以关键词"风景"为例，设置 max_pages=3，即可抓取约90张图片，转换并保存为高清PNG文件。脚本自动等待每页完成后延迟5秒，降低被封风险。

七、完整代码入口

脚本如下所示：

python 复制代码

import requests
import os
import time
import hashlib
import json
import io
from urllib.parse import unquote
from PIL import Image  # 需要安装Pillow库：pip install Pillow


def download_baidu_png_images(query, max_pages=3, save_dir='./data'):
    """
    下载百度图片搜索结果并确保保存为PNG格式（自动转换非PNG图片）
    """
    # 创建保存目录
    os.makedirs(save_dir, exist_ok=True)

    # 更新请求头
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 Edg/118.0.2088.46',
        'Referer': 'https://image.baidu.com/',
        'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8'
    }

    # 新版请求参数
    base_url = 'https://image.baidu.com/search/acjson'
    downloaded = 0  # 成功下载计数器

    for page in range(max_pages):
        try:
            # 生成动态参数
            pn = page * 30
            gsm = hex(50 + page)[2:]  # 生成动态gsm参数

            params = {
                'tn': 'resultjson_com',
                'ipn': 'rj',
                'ct': 201326592,
                'is': '',
                'fp': 'result',
                'queryWord': query,
                'cl': 2,
                'lm': -1,
                'ie': 'utf-8',
                'oe': 'utf-8',
                'adpicid': '',
                'st': -1,
                'z': '',
                'ic': '',
                'hd': '',
                'latest': '',
                'copyright': '',
                'word': query,
                's': '',
                'se': '',
                'tab': '',
                'width': '',
                'height': '',
                'face': 0,
                'istype': 2,
                'qc': '',
                'nc': 1,
                'fr': '',
                'expermode': '',
                'pn': pn,
                'rn': 30,
                'gsm': gsm,
                'time': int(time.time() * 1000)
            }

            response = requests.get(base_url, headers=headers, params=params, timeout=15)
            response.encoding = 'utf-8'
            print(f"第 {page + 1} 页响应状态码：{response.status_code}")

            try:
                data = json.loads(response.text.strip().replace("\\'", "'"))
            except Exception as e:
                print(f"JSON解析失败：{str(e)}")
                with open('error.json', 'w', encoding='utf-8') as f:
                    f.write(response.text)
                print("已保存错误响应到error.json")
                continue

            # 调试：打印数据长度
            print(f"获取到 {len(data.get('data', []))} 条数据")

            for index, item in enumerate(data.get('data', [])):
                if not isinstance(item, dict):
                    continue

                # 优先获取高清图片URL
                img_url = item.get('hoverURL') or item.get('thumbURL') or item.get('middleURL')
                if not img_url:
                    # 处理加密的objURL
                    if 'objURL' in item:
                        try:
                            img_url = unquote(item['objURL']).split('src=')[-1].split('&')[0]
                        except:
                            continue

                if not img_url:
                    print(f"第 {index} 条记录未找到有效URL")
                    continue

                # URL有效性检查
                if not img_url.startswith(('http://', 'https://')):
                    print(f"无效的URL格式：{img_url[:50]}...")
                    continue

                try:
                    # 增加图片下载超时时间
                    img_res = requests.get(img_url, headers=headers, timeout=(10, 20))
                    if img_res.status_code != 200:
                        print(f"下载失败：HTTP {img_res.status_code} - {img_url[:50]}...")
                        continue

                    # 增强文件类型验证
                    content_type = img_res.headers.get('Content-Type', '')
                    if 'image' not in content_type:
                        print(f"非图片内容：{content_type} - {img_url[:50]}...")
                        continue

                    # 验证是否为PNG格式
                    is_png = False
                    if len(img_res.content) >= 8:
                        png_signature = img_res.content[:8]
                        is_png = png_signature.startswith(b'\x89PNG\r\n\x1a\n') or png_signature[:4] == b'\x89PNG'

                    if is_png:
                        # 直接保存PNG
                        file_hash = hashlib.md5(img_res.content).hexdigest()
                        filename = os.path.join(save_dir, f"{file_hash}.png")

                        if not os.path.exists(filename):
                            with open(filename, 'wb') as f:
                                f.write(img_res.content)
                            downloaded += 1
                            print(f"成功保存PNG：{filename}")
                        else:
                            print(f"PNG文件已存在：{filename}")
                    else:
                        # 转换其他格式为PNG
                        try:
                            image = Image.open(io.BytesIO(img_res.content))

                            # 处理图像模式
                            if image.mode in ('P', 'LA'):
                                image = image.convert('RGBA')
                            elif image.mode == 'CMYK':
                                image = image.convert('RGB')

                            # 保存转换后的PNG到内存
                            img_byte_arr = io.BytesIO()
                            image.save(img_byte_arr, format='PNG')
                            converted_content = img_byte_arr.getvalue()

                            # 计算哈希并保存
                            file_hash = hashlib.md5(converted_content).hexdigest()
                            filename = os.path.join(save_dir, f"{file_hash}.png")

                            if not os.path.exists(filename):
                                with open(filename, 'wb') as f:
                                    f.write(converted_content)
                                downloaded += 1
                                print(f"转换保存PNG：{filename}（原始格式：{image.format}）")
                            else:
                                print(f"PNG文件已存在：{filename}")

                        except Exception as convert_error:
                            print(f"格式转换失败：{img_url[:50]}... 错误：{str(convert_error)}")
                            continue

                except Exception as e:
                    print(f"下载失败：{img_url[:50]}... 错误：{str(e)}")
                    continue

            print(f"第 {page + 1} 页处理完成，累计下载 {downloaded} 张图片，等待5秒...")
            time.sleep(5)

        except Exception as e:
            print(f"页面处理异常：{str(e)}")
            continue

    print(f"\n总计下载 {downloaded} 张PNG图片")


if __name__ == '__main__':
    download_baidu_png_images(
        query="风景",  # 修改搜索关键词
        max_pages=100,  # 爬取页数
        save_dir='./data'  # 保存目录
    )

八、结语

百度图片爬虫本身并不复杂，但需要考虑接口变化、图片格式处理、反爬机制应对等实际问题。通过合理的结构设计和充分的异常处理，我们可以构建一个稳定、高效的图片抓取工具，为深度学习和计算机视觉项目打下数据基础。

如需进一步改造为模块化、多线程或封装为API服务，也可以在当前框架上快速拓展。