Python爬虫第15节-2025今日头条街拍美图抓取实战

一、项目背景与概述

二、环境准备与工具配置

[2.1 开发环境要求](#2.1 开发环境要求)

[2.2 辅助工具配置](#2.2 辅助工具配置)

三、详细抓取流程解析

[3.1 页面加载机制分析](#3.1 页面加载机制分析)

[3.2 关键请求识别技巧](#3.2 关键请求识别技巧)

[3.3 参数规律深度分析](#3.3 参数规律深度分析)

四、爬虫代码实现

五、实现关键

六、法律与道德规范

一、项目概述

在当今互联网时代，数据采集与分析已成为重要的技术能力。本教程将以今日头条街拍美图为例，详细介绍如何通过Python实现Ajax数据抓取的完整流程。

今日头条作为国内领先的内容平台，其图片内容资源丰富，特别是街拍类图片深受用户喜爱。通过分析其Ajax接口，我们可以学习到现代Web应用的数据加载方式，掌握处理动态内容的爬虫开发技巧。

二、环境准备与工具配置

2.1 开发环境要求

Python环境：建议使用Python 3.6及以上版本

可通过官网下载安装：https://www.python.org/downloads/
安装后验证：`python --version`

必需库安装：

bash 复制代码

   pip install requests urllib3 hashlib multiprocessing

推荐开发工具：

IDE：PyCharm、VS Code
浏览器：Chrome（用于开发者工具分析）
调试工具：Postman（用于接口测试）

2.2 辅助工具配置

Chrome开发者工具：

快捷键F12或右键"检查"打开
重点使用Network面板和Elements面板
建议开启"Preserve log"选项保留请求记录

代理设置（可选）：

略

三、详细抓取流程解析

3.1 页面加载机制分析

现代Web应用普遍采用前后端分离架构，理解其数据加载原理至关重要：

传统网页：服务端渲染，HTML包含所有内容
现代SPA：初始HTML为空壳，通过Ajax动态加载数据

下面以今日头条为例分析：
首次加载基础框架

在今日头条首页（ https://www.toutiao.com/），直接输入街拍搜索，然后如下点击图片选项，就会出现大把的美女图片。

这时按ctrl+s即可保存所有已加载的网页内容，保存下来的网页可以直接搜到jpeg的图片链接。但是我们现在要查看接口请求，我们按F12打开开发者工具，重新加载页面，查看所有的网络请求。

首先，查看发起第一个网络请求，请求的URL是 ( https://so.toutiao.com/search?dvpf=pc\&source=search_subtab_switch\&keyword=街拍\&pd=atlas\&action_type=search_subtab_switch\&page_num=0\&search_id=\&from=gallery\&cur_tab_title=gallery )。然后打开Preview选项卡，查看响应体内容。要是页面上的内容是依据第一个请求的结果渲染出来的。

由于查看他的请求头内容Headers没有明确的 X-Requested-With 请求头，看不出是ajax请求。

通过XHR请求获取实际内容

然而我点击Fetch/XHR进行过滤ajax请求，并快速向下滚动页面就会发现下面某接口：

https://so.toutiao.com/search?dvpf=pc\&source=search_subtab_switch\&keyword=街拍\&pd=atlas\&action_type=search_subtab_switch\&page_num=1\&search_id=20250416152239B787556EC505238FF4F8\&from=gallery\&cur_tab_title=gallery\&rawJSON=1

其中请求头内容的sec-fetch-mode: cors 和其他请求头（如 accept: */*）以及请求的行为模式（如通过查询参数发送数据），这些都是AJAX 请求特征。

然后点击Preview查看响应数据的数据结构

看得出为了防止连续被爬取图片，参数page_num和search_id是有限制关系的，即使page_num有规律，但是search_id却另外计算获取。

3.2 关键请求识别技巧

在开发者工具的Network面板中：

过滤XHR请求：快速定位Ajax调用
Preview功能：直观查看JSON数据结构
Headers分析：

Request URL：接口地址
Request Method：GET/POST
Query Parameters：请求参数

时序分析：通过Waterfall了解加载顺序

3.3 参数规律深度分析

从提供的 URL，我们可以解析出一系列的查询参数（query parameters），每个参数都有其特定的作用：

`dvpf=pc`: 这个参数可能指示请求是从个人计算机（PC）发出的，可能用于区分不同设备类型的请求，如移动端或平板电脑。
`source=search_subtab_switch`: 这个参数可能表示请求的来源，这里是"search_subtab_switch"，可能是指用户在搜索结果的子标签页之间进行切换时触发的请求。
`keyword=%E8%A1%97%E6%8B%8D`: 这是经过URL编码的搜索关键词，解码后是"街拍"。这是用户在搜索框中输入的关键词。
`pd=atlas`: 这个参数的具体含义不是很清晰，它可能指的是请求的特定部分或页面（Page Division）。
`action_type=search_subtab_switch`: 类似于 `source` 参数，这个参数可能指明了用户执行的动作类型，这里是在搜索子标签页之间切换。
`page_num=1`: 这个参数指示请求的是搜索结果的第几页，这里是第2页，从0开始算。
`search_id=20250416152239B787556EC505238FF4F8`: 这是一个唯一的搜索会话标识符，用于追踪用户的搜索会话。
`from=gallery`: 这个参数可能指示用户是从图库部分发起搜索的。
`cur_tab_title=gallery`: 这个参数可能表示当前活动的标签页标题是图库。
`rawJSON=1`: 这个参数指示服务器返回的数据格式应该是原始的 JSON 格式，而不是经过任何服务器端处理的数据。

每个参数都是键值对的形式，通过 `&` 符号连接。服务器会解析这些参数，以便根据用户的请求提供定制化的内容。例如，服务器可以根据 `keyword` 参数提供搜索关键词的相关结果，`page_num` 参数用来分页显示结果，而 `search_id` 用来保持搜索会话的连续性。

四、爬虫代码实现

模拟请求的时候需要设置必要的请求头参数，包括cookie等，不然请求拿不到数据。因为search_id暂时还没找到解决办法，只能获取某一页数据。由于search_id会变，所以下面拿了最新的模拟请求：

python 复制代码

import requests
import re
import os
from hashlib import md5
import json
from urllib.parse import urlencode
import time
import gzip
import io

def get_page(page_num):
    base_url = 'https://so.toutiao.com/search?'
    params = {
        'dvpf': 'pc',
        'source': 'search_subtab_switch',
        'keyword': '街拍',
        'pd': 'atlas',
        'action_type': 'search_subtab_switch',
        'page_num': page_num,
        'search_id': '202504161626333A15E72F0B54BF7C39AB',
        'from': 'gallery',
        'cur_tab_title': 'gallery',
        'rawJSON': '1'
    }
    
    headers = {
        'Accept': '*/*',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Connection': 'keep-alive',
        'Cookie': 'tt_webid=7493748220343043611; _tea_utm_cache_4916=undefined; _S_DPR=1; _S_IPAD=0; s_v_web_id=verify_m9jdj7kh_AAv9h1Yy_tQ7d_4YCM_Bi1J_NMRjG0VhwXwy; notRedShot=1; _ga=GA1.1.1977312639.1744775745; n_mh=cLfs9nL5b4PcW0-N8UbVRswdJYzGe1l4yne0DeQP69A; passport_auth_status=7434e2c0910f588f9110b55667f76765%2C39b0d143549b4c6761ea2d3398afbe9a; passport_auth_status_ss=7434e2c0910f588f9110b55667f76765%2C39b0d143549b4c6761ea2d3398afbe9a; sid_guard=4337463570ade0a3133b7705c077800d%7C1744776132%7C5184002%7CSun%2C+15-Jun-2025+04%3A02%3A14+GMT; uid_tt=47365c817295ff0854b31381946d71b2; uid_tt_ss=47365c817295ff0854b31381946d71b2; sid_tt=4337463570ade0a3133b7705c077800d; sessionid=4337463570ade0a3133b7705c077800d; sessionid_ss=4337463570ade0a3133b7705c077800d; _S_WIN_WH=1920_396; __ac_nonce=067ff6999008e39e36643; __ac_signature=_02B4Z6wo00f01lt23uAAAIDD6btFq9NvMj5bVtpAAPEx9d; __ac_referer=https://so.toutiao.com/search?dvpf=pc&source=input&keyword=%E8%A1%97%E6%8B%8D',
        'Host': 'so.toutiao.com',
        'Referer': 'https://so.toutiao.com/search?dvpf=pc&source=search_subtab_switch&keyword=%E8%A1%97%E6%8B%8D&pd=atlas&action_type=search_subtab_switch&page_num=0&search_id=&from=gallery&cur_tab_title=gallery',
        'Sec-Fetch-Dest': 'empty',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Site': 'same-origin',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
        'sec-ch-ua': '"Chromium";v="122", "Not(A:Brand";v="24", "Google Chrome";v="122"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': '"Windows"'
    }
    
    url = base_url + urlencode(params)
    try:
        # 直接请求数据
        response = requests.get(url, headers=headers, stream=True)
        
        if response.status_code == 200:
            try:
                # 尝试直接解析响应内容
                content = response.text
                try:
                    json_data = json.loads(content)
                    return json_data
                except json.JSONDecodeError:
                    print(f"Failed to decode JSON. Response content: {content[:200]}...")
                    # 如果解析失败，尝试解码原始内容
                    raw_content = response.content.decode('utf-8', errors='ignore')
                    print(f"Raw content: {raw_content[:200]}...")
                    return None
                    
            except Exception as e:
                print(f"Error processing response: {str(e)}")
                print(f"Response headers: {response.headers}")
                return None
        else:
            print(f"Request failed with status code: {response.status_code}")
            return None
    except requests.ConnectionError as e:
        print('Error occurred while fetching the page:', e)
        return None

def parse_images(json_data):
    if not json_data or 'rawData' not in json_data:
        return
    
    # 解析返回的JSON数据
    data = json_data['rawData'].get('data', [])
    for item in data:
        if item.get('text') and (item.get('img_url') or item.get('original_image_url')):
            title = item.get('text', '')
            # 优先使用img_url
            image_url = item.get('img_url', '')
            if not image_url:
                image_url = item.get('original_image_url', '')
            yield {
                'title': title,
                'image': image_url,
                'width': item.get('width', 0),
                'height': item.get('height', 0)
            }

def save_image(item):
    if not item.get('title') or not item.get('image'):
        return
    
    # 清理文件名中的非法字符
    title = re.sub(r'[\\/:*?"<>|]', '_', item.get('title'))
    img_path = 'img' + os.path.sep + title
    
    if not os.path.exists(img_path):
        os.makedirs(img_path)
    
    max_retries = 3
    retry_count = 0
    
    original_url = item.get('image')
    
    # 构建可能的URL列表
    image_urls = []
    if 'pstatp.com' in original_url or 'ttcdn-tos' in original_url:
        # 移除协议头以便替换域名
        url_without_protocol = original_url.split('://')[-1]
        
        # 替换域名
        domains = [
            'p3-search.byteimg.com',
            'p1-search.byteimg.com',
            'p6-search.byteimg.com',
            'p3-tt.byteimg.com',
            'p1-tt.byteimg.com',
            'p6-tt.byteimg.com'
        ]
        
        # 替换图片格式和尺寸
        formats = [
            '~640x640.jpeg',
            '~480x480.jpeg',
            '~noop.image',
            '~tplv-tt-cs0.jpeg'
        ]
        
        base_url = url_without_protocol.split('~')[0]
        for domain in domains:
            for fmt in formats:
                url = f'https://{domain}/{base_url.split("/", 1)[1]}{fmt}'
                if url not in image_urls:
                    image_urls.append(url)
    
    # 始终添加原始URL作为最后的选项
    if original_url not in image_urls:
        image_urls.append(original_url)
    
    # 请求头
    img_headers = {
        'Accept': 'image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Connection': 'keep-alive',
        'Referer': 'https://so.toutiao.com/',
        'Origin': 'https://so.toutiao.com',
        'Sec-Fetch-Dest': 'image',
        'Sec-Fetch-Mode': 'no-cors',
        'Sec-Fetch-Site': 'cross-site',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
        'Cookie': 'tt_webid=7493748220343043611; _tea_utm_cache_4916=undefined; _S_DPR=1; _S_IPAD=0; s_v_web_id=verify_m9jdj7kh_AAv9h1Yy_tQ7d_4YCM_Bi1J_NMRjG0VhwXwy'
    }
    
    success = False
    for image_url in image_urls:
        if retry_count >= max_retries:
            break
            
        try:
            # 使用stream=True来分块下载
            response = requests.get(image_url, headers=img_headers, stream=True, timeout=10)
            if response.status_code == 200:
                try:
                    # 使用响应内容的前8192字节来计算MD5
                    first_chunk = next(response.iter_content(chunk_size=8192))
                    file_name = md5(first_chunk).hexdigest()
                    file_path = img_path + os.path.sep + f'{file_name}.jpg'
                    
                    if not os.path.exists(file_path):
                        # 分块写入文件，先写入第一块
                        with open(file_path, 'wb') as f:
                            f.write(first_chunk)
                            # 继续写入剩余的块
                            for chunk in response.iter_content(chunk_size=8192):
                                if chunk:
                                    f.write(chunk)
                        print('Downloaded image path is %s' % file_path)
                        print(f'Image size: {item.get("width")}x{item.get("height")}')
                    else:
                        print('Already Downloaded', file_path)
                    success = True
                    break
                except Exception as e:
                    print(f'Error processing image data: {str(e)}')
                    continue
            else:
                print(f"Failed to download image, status code: {response.status_code}")
                print(f"Image URL: {image_url}")
                
        except Exception as e:
            print('Error downloading image:', str(e))
            print(f"Image URL: {image_url}")
        
        retry_count += 1
        if not success and retry_count < max_retries:
            print(f"Retrying... ({retry_count}/{max_retries})")
            time.sleep(2)
    
    if not success:
        print(f"Failed to download image after trying all URL variations")

def main(page):
    json_data = get_page(page)
    if json_data:
        for item in parse_images(json_data):
            save_image(item)
    else:
        print(f"Failed to get data for page {page}")

if __name__ == '__main__':
    # 爬取第1页
    print('Starting page 1...')
    main(1)
    print('Page 1 completed')

爬取结果：

五、实现关键

请求头模拟

python 复制代码

headers = {
    'Accept': '*/*',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Connection': 'keep-alive',
    'Referer': 'https://so.toutiao.com/',
    'Origin': 'https://so.toutiao.com',
    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-origin',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...',
    'sec-ch-ua': '"Chromium";v="122", "Not(A:Brand";v="24"...',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"'
}

模拟真实浏览器的请求头
添加 Referer 和 Origin 防盗链
设置正确的 Sec-Fetch-* 系列头
使用最新的 Chrome User-Agent

Cookie 处理

python 复制代码

'Cookie': 'tt_webid=7493748220343043611; _tea_utm_cache_4916=undefined; s_v_web_id=verify_m9jdj7kh_AAv9h1Yy_tQ7d_4YCM_Bi1J_NMRjG0VhwXwy; ...'

使用有效的会话 Cookie
包含必要的认证信息
保持登录状态

多域名策略

python 复制代码

domains = [
    'p3-search.byteimg.com',
    'p1-search.byteimg.com',
    'p6-search.byteimg.com',
    'p3-tt.byteimg.com',
    'p1-tt.byteimg.com',
    'p6-tt.byteimg.com'
]

尝试多个 CDN 域名
在域名被封禁时自动切换
使用不同的图片服务器

URL 格式变换

python 复制代码

formats = [
    '~640x640.jpeg',
    '~480x480.jpeg',
    '~noop.image',
    '~tplv-tt-cs0.jpeg'
]

尝试不同的图片格式和尺寸
使用多种图片后缀
适应不同的图片处理参数

请求延时和重试

python 复制代码

time.sleep(2)  # 增加等待时间
max_retries = 3  # 最大重试次数

添加请求间隔，避免频率过高
实现重试机制
错误时优雅退出

分块下载

python 复制代码

response = requests.get(url, headers=headers, stream=True)
for chunk in response.iter_content(chunk_size=8192):
    if chunk:
        f.write(chunk)

使用流式下载
分块处理大文件
避免内存占用过大

错误处理

python 复制代码

try:
    # 下载尝试
except Exception as e:
    print('Error downloading image:', str(e))
    print(f"Image URL: {image_url}")

详细的错误信息记录
异常捕获和处理
失败时的优雅降级

URL 优先级策略

python 复制代码

# 优先使用img_url
image_url = item.get('img_url', '')
if not image_url:
    image_url = item.get('original_image_url', '')

优先使用缩略图 URL
备选原始图片 URL
多重 URL 备份

请求超时设置

python 复制代码

response = requests.get(image_url, headers=img_headers, stream=True, timeout=10)

设置合理的超时时间
避免请求挂起
提高程序稳定性

文件处理优化

python 复制代码

file_name = md5(first_chunk).hexdigest()
if not os.path.exists(file_path):
    # 写入文件

使用 MD5 避免重复下载
检查文件是否存在
文件名清理和规范化

由于反爬虫的限制，上面的关键步骤提高了爬虫的成功率和稳定性，同时也避免了对目标服务器造成过大压力。

后续优化：后续可以尝试分页爬取

六、法律与道德规范

robots.txt检查：

访问 `http://www.toutiao.com/robots.txt\`
遵守爬虫协议规定

合理使用原则：

控制请求频率
不进行商业性使用
尊重版权信息

数据使用建议：

仅用于学习研究
不存储敏感信息
及时删除原始数据