爬虫之淘宝商品详情获取实战

淘宝作为国内大型电商平台，其反爬机制较为严格，获取商品详情需要综合运用网络请求、数据解析及反爬应对策略。以下将从环境搭建、技术实现到反爬处理进行全面实战讲解。

一、前期准备与环境搭建

1. 所需工具与库

Python 环境（建议 3.8+）
主要库 ：
- requests：发送 HTTP 请求获取网页内容
- BeautifulSoup/lxml：解析 HTML 数据
- json：处理 JSON 格式数据
- re：正则表达式提取特定信息
- selenium/Playwright：处理动态加载内容
- fake-useragent：生成随机 User-Agent
辅助工具 ：
- Chrome 浏览器及对应版本的 WebDriver
- Fiddler/Charles：抓包分析网络请求
- Postman：测试 API 接口

2. 淘宝商品链接分析

淘宝商品链接通常形如：
https://item.taobao.com/item.htm?id=商品ID

或短链接：https://detail.tmall.com/item.htm?id=商品ID

核心参数为id，即商品唯一标识符。

二、基础爬虫实现（基于 requests）

1. 基础请求框架

python

复制代码

import requests
from fake_useragent import UserAgent
import time
import random
import re
import json

# 随机User-Agent生成
ua = UserAgent()

def get_taobao_item_detail(item_id):
    """获取淘宝商品详情"""
    try:
        # 构造请求URL
        url = f"https://detail.tmall.com/item.htm?id={item_id}"
        
        # 请求头设置（关键反爬策略）
        headers = {
            "User-Agent": ua.random,
            "Referer": f"https://search.tmall.com/search?q=商品搜索关键词",
            "Accept": "text/html,application/xhtml+xml,application/xml",
            "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
            "Cache-Control": "max-age=0",
            "Upgrade-Insecure-Requests": "1",
            "Cookie": "你的Cookie信息"  # 重要：登录状态Cookie可获取更多信息
        }
        
        # 发送请求（添加随机延时避免频繁请求）
        time.sleep(random.uniform(1, 3))  # 随机延时1-3秒
        response = requests.get(url, headers=headers, timeout=10)
        
        # 检查响应状态
        if response.status_code == 200:
            response.encoding = 'utf-8'
            return response.text
        else:
            print(f"请求失败，状态码：{response.status_code}")
            return None
    except Exception as e:
        print(f"请求异常：{e}")
        return None

2. 数据解析（提取核心信息）

淘宝商品详情页数据通常以 JSON 形式嵌入 HTML 中，可通过正则表达式提取：

python

复制代码

def parse_item_detail(html):
    """解析商品详情HTML，提取关键信息"""
    if not html:
        return {}
    
    try:
        # 提取商品信息JSON（不同页面结构可能需要调整正则）
        match = re.search(r'g_page_config = (\{.*?\});', html)
        if match:
            config_json = json.loads(match.group(1))
            item_info = config_json.get('itemInfo', {})
            item = item_info.get('item', {})
            
            # 提取核心字段
            result = {
                "商品ID": item.get('id'),
                "商品标题": item.get('title'),
                "商品价格": item.get('price'),
                "原价": item.get('originalPrice'),
                "销量": item.get('sales'),
                "库存": item.get('stock'),
                "商品图片": item.get('image'),
                "详情页URL": item.get('detailUrl'),
                "店铺名称": config_json.get('shopInfo', {}).get('name'),
                "店铺ID": config_json.get('shopInfo', {}).get('id'),
                "评价数": config_json.get('commentInfo', {}).get('commentCount')
            }
            return result
        else:
            # 备选方案：直接解析HTML（适用于JSON提取失败的情况）
            from bs4 import BeautifulSoup
            soup = BeautifulSoup(html, 'lxml')
            title = soup.find('h1', class_='tb-main-title')?.get_text(strip=True)
            price = soup.find('em', class_='tb-rmb-num')?.get_text(strip=True)
            sales = soup.find('div', class_='tb-sell-count')?.get_text(strip=True)
            
            return {
                "商品标题": title,
                "商品价格": price,
                "销量": sales
            }
    except Exception as e:
        print(f"解析异常：{e}")
        return {}

三、应对反爬机制（关键难点）

淘宝的反爬措施包括：

浏览器指纹识别
Cookie 有效性验证
滑块验证码
动态加载数据
请求频率限制

1. 进阶方案：使用 Selenium 模拟浏览器

python

复制代码

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def get_detail_with_selenium(item_id):
    """使用Selenium模拟浏览器获取商品详情"""
    chrome_options = Options()
    # 可选：无头模式（隐藏浏览器窗口）
    # chrome_options.add_argument('--headless')
    chrome_options.add_argument(f'user-agent={ua.random}')
    chrome_options.add_argument('--disable-blink-features=AutomationControlled')  # 绕过webdriver检测
    chrome_options.add_experimental_option('excludeSwitches', ['enable-automation'])
    chrome_options.add_experimental_option('useAutomationExtension', False)
    
    # 启动浏览器
    driver = webdriver.Chrome(options=chrome_options)
    driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
        "source": """
            Object.defineProperty(navigator, 'webdriver', {
                get: () => undefined
            })
        """
    })
    
    try:
        url = f"https://detail.tmall.com/item.htm?id={item_id}"
        driver.get(url)
        
        # 等待页面加载完成（动态内容可能需要更长时间）
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.ID, 'J_DetailMeta'))
        )
        
        # 滚动页面加载更多内容（如详情页图片）
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)  # 等待滚动后内容加载
        
        # 获取页面源码
        html = driver.page_source
        return html
    except Exception as e:
        print(f"Selenium请求异常：{e}")
        return None
    finally:
        driver.quit()

2. 反爬优化策略

Cookie 管理 ：
- 登录状态 Cookie（通过扫码登录获取）可访问更多数据
- 使用requests.Session()保持 Cookie 会话
请求频率控制 ：
- 随机延时（random.uniform(2, 5)）
- 限制每分钟请求数（如不超过 20 次）
IP 代理 ：
- 使用代理 IP 池（如阿布云、快代理）
- 示例（requests 中使用代理）：

python

复制代码

proxies = {
    "http": "http://代理IP:端口",
    "https": "https://代理IP:端口"
}
response = requests.get(url, headers=headers, proxies=proxies)

验证码处理 ：
- 复杂验证码需人工介入或使用打码平台（如超级鹰）
- Selenium 可模拟人工操作滑块

四、完整实战流程示例

python

复制代码

# 1. 定义商品ID列表
item_ids = ["678901234567", "567890123456"]  # 替换为实际商品ID

# 2. 遍历获取商品详情
all_items = []
for item_id in item_ids:
    print(f"正在获取商品ID：{item_id}")
    
    # 方案选择：优先使用requests，失败则切换至Selenium
    html = get_taobao_item_detail(item_id)
    if not html:
        print(f"requests获取失败，尝试Selenium...")
        html = get_detail_with_selenium(item_id)
    
    # 解析数据
    item_data = parse_item_detail(html)
    if item_data:
        all_items.append(item_data)
        print(f"获取成功：{item_data['商品标题']}")
    else:
        print(f"解析失败，商品ID：{item_id}")
    
    # 间隔时间（避免频繁请求）
    time.sleep(random.uniform(3, 6))

# 3. 保存数据（如JSON文件）
if all_items:
    with open(f"taobao_items_{time.strftime('%Y%m%d')}.json", 'w', encoding='utf-8') as f:
        json.dump(all_items, f, ensure_ascii=False, indent=2)
    print(f"数据已保存，共{len(all_items)}条商品信息")

五、法律与合规提醒

合理使用爬虫 ：
- 遵守淘宝《用户协议》和《爬虫规则》
- 限制请求频率，避免对平台服务器造成压力
数据用途限制 ：
- 不得将爬取数据用于商业牟利或非法用途
- 个人信息（如用户评价）需脱敏处理
版权保护 ：
- 商品图片、详情等内容受版权保护，未经允许不得转载

六、进阶方向

分布式爬虫：使用 Scrapy+Redis 构建分布式爬取系统
实时监控：定时获取商品价格、库存变化，实现价格监控
数据可视化：将商品数据（如价格趋势、销量分布）可视化展示
API 接口封装：将爬虫封装为 API 服务，方便业务系统调用

通过以上实战方案，可有效获取淘宝商品详情，但需注意反爬机制的动态变化（如页面结构更新、反爬策略升级），需持续优化代码以保持爬虫有效性