用 Python 给京东商品详情做“全身 CT”——可量产、可扩展的爬虫实战

一、技术选型：为什么选 Python 而不是 Java

维度	Python	Java
开发效率	脚本即服务，`python crawler.py` 即刻运行	需配置 Gradle/Maven
动态页面	Playwright 一键等待 JS，渲染速度秒杀 Selenium	HtmlUnit 坑多，Selenium 太重
数据科学	Pandas + Jupyter 即时分析，无缝衔接 ML	需额外转数据格式
运维成本	Serverless（Lambda/云函数）原生支持	需打包 jar 或镜像

一句话："Python 写原型快，Java 上生产稳；调研阶段用 Python，上线后如果 QPS 爆表再考虑 Java 重构。"

二、整体架构速览（3 分钟看懂）

复制代码

┌---------------------------┐
|  JD 商品详情页 HTML/JSON  |
└------------┬--------------┘
             │ 1. 随机 UA + 住宅代理池
             ▼
┌---------------------------┐
|  解析层（Playwright）     |
|  自动等待 / 重试 / 熔断    |
└------------┬--------------┘
             │ 2. 字段清洗
             ▼
┌---------------------------┐
|  存储层（CSV / SQLite）   |
|  增量 / 版本控制           |
└------------┬--------------┘
             │ 3. 监控告警
             ▼
         飞书群 + Grafana

三、开发前准备（5 分钟搞定）

环境

Python 3.11 + Poetry + Playwright
一次性安装依赖

bash

bash 复制代码

pip install poetry
poetry init --python="^3.11" --no-interaction
poetry add playwright pandas loguru tenacity fake-useragent aiofiles
playwright install chromium   # 自动下载浏览器

目标字段 & CSS 选择器

| 字段 | 选择器 | |---|---| | 标题 | div.sku-name | | 价格 | span.price-now | | 图片 | img#spec-img | | 评论数 | div.comment-count | | 店铺 | div.shop-name | | 库存 | div.stock-txt |

四、MVP：150 行代码即可跑通

单文件脚本，支持异步并发 10 SKU，自动重试 429，结果直接写 jd_detail.csv。

Python

python 复制代码

import asyncio, csv, re, random, sys
from pathlib import Path
from playwright.async_api import async_playwright
from loguru import logger
from fake_useragent import UserAgent
import pandas as pd

CONCURRENCY = 10
RETRY     = 3
TIMEOUT   = 35_000
RESULT    = "jd_detail.csv"
HEADERS   = ["sku","title","price","image","comment","shop","stock","scrape_time"]

async def scrape_one(page, sku: str) -> dict:
    url = f"https://item.jd.com/{sku}.html"
    logger.info("🚀 正在抓取 {}", sku)
    for attempt in range(1, RETRY+1):
        try:
            await page.goto(url, wait_until="domcontentloaded", timeout=TIMEOUT)
            await page.wait_for_selector("div.sku-name", timeout=8000)
            break
        except Exception as e:
            logger.warning("⚠️  {} 第 {} 次失败: {}", sku, attempt, e)
            await asyncio.sleep(random.uniform(2, 4))
    else:
        logger.error("❌  {} 超过最大重试", sku)
        return {"sku": sku, "title": "N/A"}

    # 字段提取
    title  = await page.locator("div.sku-name").first.text_content() or "N/A"
    price  = await page.locator("span.price-now").first.text_content() or "N/A"
    image  = await page.locator("img#spec-img").first.get_attribute("src") or ""
    if image and not image.startswith("http"): image = "https:" + image
    comment= await page.locator("div.comment-count").first.text_content() or "0"
    comment= re.sub(r"\D", "", comment)
    shop   = await page.locator("div.shop-name").first.text_content() or "N/A"
    stock  = await page.locator("div.stock-txt").first.text_content() or "N/A"

    return {
        "sku": sku,
        "title": title.strip(),
        "price": price.strip(),
        "image": image,
        "comment": comment,
        "shop": shop.strip(),
        "stock": stock.strip(),
        "scrape_time": pd.Timestamp.utcnow().isoformat()
    }

async def worker(queue: asyncio.Queue, playwright):
    browser = await playwright.chromium.launch(headless=True)
    ua = UserAgent()
    context = await browser.new_context(user_agent=ua.random, locale="zh-CN")
    page = await context.new_page()
    while True:
        sku = await queue.get()
        if sku is None:
            queue.task_done()
            break
        data = await scrape_one(page, sku)
        await save_one(data)
        queue.task_done()
    await context.close()
    await browser.close()

async def save_one(data):
    file_exists = Path(RESULT).exists()
    with open(RESULT, "a", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=HEADERS)
        if not file_exists:
            writer.writeheader()
        writer.writerow(data)
    logger.info("💾 已保存 {}", data["sku"])

async def main():
    skus = ["100035288046", "100012043978", "100012043979"]  # 可换成万级列表
    queue = asyncio.Queue()
    for s in skus:
        await queue.put(s)
    async with async_playwright() as p:
        tasks = [asyncio.create_task(worker(queue, p)) for _ in range(CONCURRENCY)]
        await queue.join()
        for _ in tasks: await queue.put(None)
        await asyncio.gather(*tasks)
    logger.success(">>> 全部完成，结果见 {}", RESULT)

if __name__ == "__main__":
    asyncio.run(main())

运行效果：

复制代码

2025-10-22 08:12:10 | 🚀 正在抓取 100035288046
2025-10-22 08:12:13 | 💾 已保存 100035288046
...
2025-10-22 08:12:25 | >>> 全部完成，结果见 jd_detail.csv

CSV 预览：

sku	title	price	image	comment	shop	stock	scrape_time
100035288046	Apple iPhone 15 128GB 蓝色	5999.00	https://img10.360buyimg.com/...	50000	京东自营旗舰店	现货	2025-10-22T08:12:13

五、反爬三件套，让你的爬虫"长命百岁"

住宅代理池

付费：BrightData、Oxylabs、IPRoyal（支持 SOCKS5）。

代码层只需在 browser.new_context(proxy={"server": "http://user:pass@ip:port"}) 动态轮换即可。
浏览器指纹随机化
- 每次启动 playwright 随机 UA、viewport、timezone、WebGL vendor
- 禁用 WebDriver 属性：navigator.webdriver = undefined
- 屏蔽图片/CSS 加速：page.route("**/*.{png,jpg,css}", lambda route: route.abort())

限速 + 重试

单 IP 每秒 ≤ 1 请求；随机 sleep 2~5 s
返回 429 时指数退避 1s→2s→4s→8s，最多 5 次
用 tenacity 装饰器一键实现重试：

Python

python 复制代码

from tenacity import retry, wait_exponential, stop_after_attempt
@retry(wait=wait_exponential(multiplier=1, min=1, max=60), stop=stop_after_attempt(5))
async def scrape_one(page, sku):
    ...

六、把数据"喂"给业务：4 个真实场景

选品决策

每天 06:00 定时跑完 1000 个 SKU，用 Pandas 算出"昨日降价 Top10" → 飞书群推送，运营上班即可决策。
动态定价

将抓取到的京东价格、跟卖数丢进自研算法，自动调整 ERP 售价，保证最低价的竞争力。
库存预警

监控 stock 字段，一旦出现 "仅剩 2 件" 立即发邮件：可以加大广告抢流量！
评论情感分析

把 comment 与 reviewText 一并收下来，用 SnowNLP / TextBlob 做情感打分，找到 1~2 星差评关键词，反向优化说明书。

七、Docker 一键部署

dockerfile

Dart 复制代码

FROM mcr.microsoft.com/playwright/python:v1.42.0-focal
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY crawler.py .
CMD ["python", "crawler.py"]

构建 & 运行：

bash

bash 复制代码

docker build -t jd-py .
docker run --rm -v $(pwd)/data:/app/data jd-py

八、写在最后的"防吃牢饭"提示

京东数据受《网络安全法》《反不正当竞争法》约束，请务必：

仅抓取"公开可见、无需登录"页面；
遵守 robots.txt（京东允许 /item/ 但频率需合理）；
数据仅限内部商业分析，不得直接转载、转售或公开 API 化；
生产环境优先使用官方 API（open.jd.com）。