一、技术选型:为什么选 Python 而不是 Java
| 维度 | Python | Java |
|---|---|---|
| 开发效率 | 脚本即服务,python crawler.py 即刻运行 |
需配置 Gradle/Maven |
| 动态页面 | Playwright 一键等待 JS,渲染速度秒杀 Selenium | HtmlUnit 坑多,Selenium 太重 |
| 数据科学 | Pandas + Jupyter 即时分析,无缝衔接 ML | 需额外转数据格式 |
| 运维成本 | Serverless(Lambda/云函数)原生支持 | 需打包 jar 或镜像 |
一句话:"Python 写原型快,Java 上生产稳;调研阶段用 Python,上线后如果 QPS 爆表再考虑 Java 重构。"
二、整体架构速览(3 分钟看懂)
┌---------------------------┐
| JD 商品详情页 HTML/JSON |
└------------┬--------------┘
│ 1. 随机 UA + 住宅代理池
▼
┌---------------------------┐
| 解析层(Playwright) |
| 自动等待 / 重试 / 熔断 |
└------------┬--------------┘
│ 2. 字段清洗
▼
┌---------------------------┐
| 存储层(CSV / SQLite) |
| 增量 / 版本控制 |
└------------┬--------------┘
│ 3. 监控告警
▼
飞书群 + Grafana
三、开发前准备(5 分钟搞定)
-
环境
Python 3.11 + Poetry + Playwright
-
一次性安装依赖
bash
bash
pip install poetry
poetry init --python="^3.11" --no-interaction
poetry add playwright pandas loguru tenacity fake-useragent aiofiles
playwright install chromium # 自动下载浏览器
-
目标字段 & CSS 选择器
| 字段 | 选择器 | |---|---| | 标题 |
div.sku-name| | 价格 |span.price-now| | 图片 |img#spec-img| | 评论数 |div.comment-count| | 店铺 |div.shop-name| | 库存 |div.stock-txt|
四、MVP:150 行代码即可跑通
单文件脚本,支持异步并发 10 SKU,自动重试 429,结果直接写
jd_detail.csv。
Python
python
import asyncio, csv, re, random, sys
from pathlib import Path
from playwright.async_api import async_playwright
from loguru import logger
from fake_useragent import UserAgent
import pandas as pd
CONCURRENCY = 10
RETRY = 3
TIMEOUT = 35_000
RESULT = "jd_detail.csv"
HEADERS = ["sku","title","price","image","comment","shop","stock","scrape_time"]
async def scrape_one(page, sku: str) -> dict:
url = f"https://item.jd.com/{sku}.html"
logger.info("🚀 正在抓取 {}", sku)
for attempt in range(1, RETRY+1):
try:
await page.goto(url, wait_until="domcontentloaded", timeout=TIMEOUT)
await page.wait_for_selector("div.sku-name", timeout=8000)
break
except Exception as e:
logger.warning("⚠️ {} 第 {} 次失败: {}", sku, attempt, e)
await asyncio.sleep(random.uniform(2, 4))
else:
logger.error("❌ {} 超过最大重试", sku)
return {"sku": sku, "title": "N/A"}
# 字段提取
title = await page.locator("div.sku-name").first.text_content() or "N/A"
price = await page.locator("span.price-now").first.text_content() or "N/A"
image = await page.locator("img#spec-img").first.get_attribute("src") or ""
if image and not image.startswith("http"): image = "https:" + image
comment= await page.locator("div.comment-count").first.text_content() or "0"
comment= re.sub(r"\D", "", comment)
shop = await page.locator("div.shop-name").first.text_content() or "N/A"
stock = await page.locator("div.stock-txt").first.text_content() or "N/A"
return {
"sku": sku,
"title": title.strip(),
"price": price.strip(),
"image": image,
"comment": comment,
"shop": shop.strip(),
"stock": stock.strip(),
"scrape_time": pd.Timestamp.utcnow().isoformat()
}
async def worker(queue: asyncio.Queue, playwright):
browser = await playwright.chromium.launch(headless=True)
ua = UserAgent()
context = await browser.new_context(user_agent=ua.random, locale="zh-CN")
page = await context.new_page()
while True:
sku = await queue.get()
if sku is None:
queue.task_done()
break
data = await scrape_one(page, sku)
await save_one(data)
queue.task_done()
await context.close()
await browser.close()
async def save_one(data):
file_exists = Path(RESULT).exists()
with open(RESULT, "a", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=HEADERS)
if not file_exists:
writer.writeheader()
writer.writerow(data)
logger.info("💾 已保存 {}", data["sku"])
async def main():
skus = ["100035288046", "100012043978", "100012043979"] # 可换成万级列表
queue = asyncio.Queue()
for s in skus:
await queue.put(s)
async with async_playwright() as p:
tasks = [asyncio.create_task(worker(queue, p)) for _ in range(CONCURRENCY)]
await queue.join()
for _ in tasks: await queue.put(None)
await asyncio.gather(*tasks)
logger.success(">>> 全部完成,结果见 {}", RESULT)
if __name__ == "__main__":
asyncio.run(main())
运行效果:
2025-10-22 08:12:10 | 🚀 正在抓取 100035288046
2025-10-22 08:12:13 | 💾 已保存 100035288046
...
2025-10-22 08:12:25 | >>> 全部完成,结果见 jd_detail.csv
CSV 预览:
| sku | title | price | image | comment | shop | stock | scrape_time |
|---|---|---|---|---|---|---|---|
| 100035288046 | Apple iPhone 15 128GB 蓝色 | 5999.00 | https://img10.360buyimg.com/... | 50000 | 京东自营旗舰店 | 现货 | 2025-10-22T08:12:13 |
五、反爬三件套,让你的爬虫"长命百岁"
-
住宅代理池
付费:BrightData、Oxylabs、IPRoyal(支持 SOCKS5)。
代码层只需在
browser.new_context(proxy={"server": "http://user:pass@ip:port"})动态轮换即可。 -
浏览器指纹随机化
-
每次启动 playwright 随机 UA、viewport、timezone、WebGL vendor
-
禁用 WebDriver 属性:
navigator.webdriver = undefined -
屏蔽图片/CSS 加速:
page.route("**/*.{png,jpg,css}", lambda route: route.abort())
-
-
限速 + 重试
-
单 IP 每秒 ≤ 1 请求;随机 sleep 2~5 s
-
返回 429 时指数退避 1s→2s→4s→8s,最多 5 次
-
用
tenacity装饰器一键实现重试:
Python
pythonfrom tenacity import retry, wait_exponential, stop_after_attempt @retry(wait=wait_exponential(multiplier=1, min=1, max=60), stop=stop_after_attempt(5)) async def scrape_one(page, sku): ... -
六、把数据"喂"给业务:4 个真实场景
-
选品决策
每天 06:00 定时跑完 1000 个 SKU,用 Pandas 算出"昨日降价 Top10" → 飞书群推送,运营上班即可决策。
-
动态定价
将抓取到的京东价格、跟卖数丢进自研算法,自动调整 ERP 售价,保证最低价的竞争力。
-
库存预警
监控
stock字段,一旦出现 "仅剩 2 件" 立即发邮件:可以加大广告抢流量! -
评论情感分析
把
comment与reviewText一并收下来,用 SnowNLP / TextBlob 做情感打分,找到 1~2 星差评关键词,反向优化说明书。
七、Docker 一键部署
dockerfile
Dart
FROM mcr.microsoft.com/playwright/python:v1.42.0-focal
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY crawler.py .
CMD ["python", "crawler.py"]
构建 & 运行:
bash
bash
docker build -t jd-py .
docker run --rm -v $(pwd)/data:/app/data jd-py
八、写在最后的"防吃牢饭"提示
京东数据受《网络安全法》《反不正当竞争法》约束,请务必:
-
仅抓取"公开可见、无需登录"页面;
-
遵守 robots.txt(京东允许
/item/但频率需合理); -
数据仅限内部商业分析,不得直接转载、转售或公开 API 化;
-
生产环境优先使用官方 API(open.jd.com)。