京东店铺全商品数据是电商运营、竞品分析、供应链优化的核心数据源,其获取依赖「店铺基础信息接口+商品分类接口+全量SKU接口+商品详情补充接口」的多层级联动。京东针对店铺商品接口采用「店铺层级权限校验+接口请求频率分级+数据返回字段限流」的三重风控体系,传统单一接口爬取方案易出现数据不全、接口封禁等问题。本文创新性提出「层级穿透采集+数据联动补全+资产化重构」全链路方案,实现店铺全商品数据的完整获取与价值升级。
一、接口核心机制与层级风控拆解
京东店铺商品数据分布在多个关联接口中,不同接口对应不同的店铺层级权限(店铺首页公开数据、分类页商品数据、SKU详情数据),核心特征与风控逻辑如下:
1. 多层级接口链路与核心参数
获取店铺全商品需完成「店铺信息校验→商品分类获取→分类下商品采集→SKU详情补全」的完整链路,各环节核心接口与参数如下:
| 链路环节 | 核心接口 | 核心参数 | 数据范围 | 风控特征 |
|---|---|---|---|---|
| 店铺信息校验 | 开放平台:jingdong.shop.read.getShopInfoWeb端:shop.jd.com/shopHome/index.action | vendorId(店铺ID)、shopIdcallback(Web端动态参数) | 店铺名称、类型、主营类目、评分 | 公开数据,无权限限制,请求频率宽松 |
| 商品分类获取 | Web端:shop.jd.com/json/promotion/queryCategoryTree.json | shopId、_(时间戳) | 店铺自定义分类、分类ID、商品数量 | 需携带店铺访问cookie,无登录态可获取 |
| 分类下商品采集 | Web端:shop.jd.com/json/promotion/queryCategoryProductList.json | shopId、categoryId、pageNo、pageSize、callback | 分类下商品ID、名称、价格、销量、图片 | 单分类最多采集50页,单IP单日超100页触发限流 |
| SKU详情补全 | 开放平台:jingdong.item.read.getWeb端:item.jd.com/xxx.html(解析页面) | skuId、fields(开放平台)skuId、uuid(Web端) | SKU规格、库存、促销、商家服务 | 开放平台需授权,Web端需模拟用户浏览行为 |
2. 关键突破点
-
店铺层级权限穿透:京东店铺商品按分类层级展示,不同分类对应不同接口参数,传统方案易遗漏多级分类商品,需实现分类树递归遍历与全层级商品采集;
-
Web端动态参数破解:店铺商品接口含动态callback参数(随机字符串+时间戳)与_参数(时间戳),直接固定参数易返回403,需实时生成动态参数;
-
多接口数据联动补全:分类页接口仅返回基础商品信息,SKU规格、库存等详情需联动开放平台/Web端详情接口补全,需解决商品ID与SKU的关联映射;
-
分页限流突破:单分类最多50页商品,多分类店铺需多线程异步采集提升效率,同时通过IP池+cookie轮换规避单IP限流;
-
商品数据资产化重构:将分散的商品基础信息、分类信息、规格详情、促销数据整合为结构化资产,支持多维度检索与价值分析。

点击获取key和secret
二、创新技术方案实现
本方案核心分为4大组件:动态参数生成器、店铺分类递归采集器、多源商品数据补全器、商品数据资产化重构器,实现从店铺分类遍历到商品资产化的全链路闭环。
1. 动态参数生成器(核心突破)
破解京东店铺Web端接口动态参数校验,生成符合风控要求的callback、_等参数,确保请求合法性:
import time import random import hashlib from typing import Dict class JdShopDynamicParamGenerator: def __init__(self): self.callback_prefix = "jQuery" # Web端callback固定前缀 self.callback_length = 16 # callback随机串长度 def generate_callback(self) -> str: """生成Web端接口动态callback参数(格式:jQuery+随机串_时间戳)""" random_str = ''.join(random.choices("0123456789abcdefghijklmnopqrstuvwxyz", k=self.callback_length)) timestamp = str(int(time.time() * 1000)) return f"{self.callback_prefix}{random_str}_{timestamp}" def generate_timestamp_param(self) -> str: """生成_参数(毫秒级时间戳,Web端接口必需)""" return str(int(time.time() * 1000)) def generate_uuid(self) -> str: """生成Web端设备标识uuid(模拟真实设备,降低风控)""" return hashlib.md5(f"{time.time()}{random.random()}".encode()).hexdigest() def build_category_product_params(self, shop_id: str, category_id: str, page_no: int = 1, page_size: int = 20) -> Dict: """构建分类商品列表接口完整参数""" return { "shopId": shop_id, "categoryId": category_id, "pageNo": page_no, "pageSize": page_size, "callback": self.generate_callback(), "_": self.generate_timestamp_param() } def build_category_tree_params(self, shop_id: str) -> Dict: """构建商品分类树接口参数""" return { "shopId": shop_id, "callback": self.generate_callback(), "_": self.generate_timestamp_param() }
2. 店铺分类递归采集器
实现店铺商品分类树的递归遍历,全层级采集分类信息,为后续全商品采集奠定基础:
import requests from fake_useragent import UserAgent import json import time from typing import List, Dict, Optional from JdShopDynamicParamGenerator import JdShopDynamicParamGenerator class JdShopCategoryScraper: def __init__(self, shop_id: str, cookie: Optional[str] = None, proxy: Optional[str] = None): self.shop_id = shop_id # 目标店铺ID self.cookie = cookie # 访问店铺的cookie(可选,提升兼容性) self.proxy = proxy self.param_generator = JdShopDynamicParamGenerator() self.session = self._init_session() # 接口地址配置 self.category_tree_url = "https://shop.jd.com/json/promotion/queryCategoryTree.json" self.category_product_url = "https://shop.jd.com/json/promotion/queryCategoryProductList.json" def _init_session(self) -> requests.Session: """初始化请求会话(模拟真实用户浏览店铺行为)""" session = requests.Session() session.headers.update({ "User-Agent": UserAgent().random, "Accept": "text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01", "Accept-Language": "zh-CN,zh;q=0.9", "Referer": f"https://shop.jd.com/{self.shop_id}.html", "X-Requested-With": "XMLHttpRequest" # 模拟AJAX请求,关键参数 }) if self.cookie: session.headers["Cookie"] = self.cookie if self.proxy: session.proxies = {"http": self.proxy, "https": self.proxy} return session def _parse_callback_data(self, raw_data: str) -> Dict: """解析Web端接口callback包裹的JSON数据(格式:jQueryxxx_xxx(json);)""" try: # 截取JSON部分 start_idx = raw_data.find("(") + 1 end_idx = raw_data.rfind(")") json_data = raw_data[start_idx:end_idx] return json.loads(json_data) except Exception as e: print(f"解析callback数据失败:{e}") return {} def get_category_tree(self) -> List[Dict]: """递归获取店铺全部分类树(含多级分类)""" category_tree = [] params = self.param_generator.build_category_tree_params(self.shop_id) response = self.session.get(self.category_tree_url, params=params, timeout=15) raw_data = response.text parsed_data = self._parse_callback_data(raw_data) # 递归遍历分类树 self._recursive_parse_category(parsed_data.get("data", []), category_tree) return category_tree def _recursive_parse_category(self, raw_categories: List[Dict], result: List[Dict], parent_id: str = "0"): """递归解析多级分类""" for category in raw_categories: category_info = { "category_id": str(category.get("categoryId", "")), "category_name": category.get("categoryName", ""), "parent_id": parent_id, "product_count": category.get("productCount", 0), "is_leaf": category.get("isLeaf", True) # 是否为叶子分类(无子分类) } result.append(category_info) # 若有子分类,递归解析 if not category_info["is_leaf"] and "children" in category: self._recursive_parse_category(category["children"], result, category_info["category_id"]) def get_leaf_categories(self) -> List[Dict]: """获取所有叶子分类(仅叶子分类含商品数据)""" full_category_tree = self.get_category_tree() return [cat for cat in full_category_tree if cat["is_leaf"] and cat["product_count"] > 0] def get_products_by_category(self, category_id: str, max_pages: int = 50) -> List[Dict]: """采集指定分类下的所有商品(支持分页,最大50页)""" products = [] for page_no in range(1, max_pages + 1): params = self.param_generator.build_category_product_params( self.shop_id, category_id, page_no, 20 ) # 控制请求频率,避免风控 time.sleep(random.uniform(1.5, 2.5)) response = self.session.get(self.category_product_url, params=params, timeout=15) raw_data = response.text parsed_data = self._parse_callback_data(raw_data) product_list = parsed_data.get("data", {}).get("productList", []) if not product_list: print(f"分类{category_id}第{page_no}页无商品数据,停止采集") break # 结构化商品数据 structured_products = self._structurize_product(product_list) products.extend(structured_products) print(f"采集分类{category_id}第{page_no}页,获取{len(structured_products)}件商品") return products def _structurize_product(self, raw_products: List[Dict]) -> List[Dict]: """结构化分类商品数据""" result = [] for product in raw_products: result.append({ "product_id": str(product.get("productId", "")), "sku_id": str(product.get("skuId", "")), "product_name": product.get("productName", ""), "price": product.get("jdPrice", ""), "original_price": product.get("marketPrice", ""), "sales_count": product.get("saleCount", 0), # 销量 "comment_count": product.get("commentCount", 0), # 评论数 "main_img_url": product.get("imgUrl", ""), "is_promotion": product.get("isPromotion", False), # 是否促销 "promotion_tag": product.get("promotionTag", "") # 促销标签 }) return result def get_all_products_by_shop(self, max_pages_per_category: int = 50) -> Dict: """采集店铺全部分类下的所有商品(核心方法)""" # 1. 获取所有叶子分类 leaf_categories = self.get_leaf_categories() if not leaf_categories: return {"error": "未获取到店铺有效分类", "shop_id": self.shop_id} print(f"获取到店铺{self.shop_id}共{len(leaf_categories)}个有效商品分类") # 2. 遍历分类采集商品 all_products = [] category_product_count = {} for category in leaf_categories: category_id = category["category_id"] category_name = category["category_name"] print(f"\n开始采集分类:{category_name}({category_id})") category_products = self.get_products_by_category(category_id, max_pages_per_category) all_products.extend(category_products) category_product_count[category_name] = len(category_products) # 3. 整合结果 return { "shop_id": self.shop_id, "total_product_count": len(all_products), "category_product_distribution": category_product_count, # 分类商品分布 "products": all_products, "crawl_time": time.strftime("%Y-%m-%d %H:%M:%S") }
3. 多源商品数据补全器
联动开放平台与Web端接口,补全分类商品缺失的SKU规格、库存、促销详情等数据,提升数据完整性:
import requests import json import time from typing import Dict, List, Optional from JdZeusSignGenerator import JdZeusSignGenerator # 复用之前实现的宙斯签名生成器 class JdProductDataCompleter: def __init__(self, zeus_app_key: Optional[str] = None, zeus_app_secret: Optional[str] = None, access_token: Optional[str] = None, proxy: Optional[str] = None): self.zeus_app_key = zeus_app_key self.zeus_app_secret = zeus_app_secret self.access_token = access_token # 开放平台授权令牌 self.proxy = proxy self.sign_generator = JdZeusSignGenerator(zeus_app_key, zeus_app_secret) if zeus_app_key else None self.session = self._init_session() # 接口地址配置 self.zeus_api_url = "https://api.jd.com/routerjson" self.web_sku_detail_url = "https://item.jd.com/{sku_id}.html" def _init_session(self) -> requests.Session: """初始化请求会话""" session = requests.Session() session.headers.update({ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" }) if self.proxy: session.proxies = {"http": self.proxy, "https": self.proxy} return session def _fetch_sku_detail_by_zeus(self, sku_id: str) -> Dict: """通过开放平台补全SKU详情(优先选择,低风控)""" if not self.sign_generator: return {"error": "未配置开放平台参数,无法调用宙斯接口"} params = { "method": "jingdong.item.read.get", "skuId": sku_id, "fields": "sku_id,spec,stock,warehouse,brand_name,seller_id,shop_name,payment_terms" } if self.access_token: params["access_token"] = self.access_token # 生成宙斯签名 sign, timestamp, nonce = self.sign_generator.generate_sign(params) params.update({"sign": sign, "timestamp": timestamp, "nonce": nonce}) response = self.session.post(self.zeus_api_url, data=params, timeout=15) return self._structurize_zeus_sku_detail(response.json()) def _fetch_sku_detail_by_web(self, sku_id: str) -> Dict: """通过Web端商品页面补全SKU详情(无开放平台权限时使用)""" url = self.web_sku_detail_url.format(sku_id=sku_id) time.sleep(random.uniform(2, 3)) response = self.session.get(url, timeout=15) return self._parse_web_sku_detail(response.text, sku_id) def _structurize_zeus_sku_detail(self, raw_data: Dict) -> Dict: """结构化开放平台SKU详情数据""" result = {"spec_info": [], "stock_info": {}, "brand_info": ""} if "error_response" in raw_data: result["error"] = raw_data["error_response"]["msg"] return result raw_detail = raw_data.get("result", {}) # 规格信息 spec_list = raw_detail.get("spec", []) result["spec_info"] = [{"spec_id": s.get("specId"), "spec_name": s.get("specName")} for s in spec_list] # 库存信息 result["stock_info"] = { "stock": raw_detail.get("stock", 0), "warehouse_name": raw_detail.get("warehouse", {}).get("warehouseName", "") } # 品牌信息 result["brand_info"] = raw_detail.get("brand_name", "") return result def _parse_web_sku_detail(self, html: str, sku_id: str) -> Dict: """解析Web端商品页面获取SKU详情(正则提取核心数据)""" result = {"spec_info": [], "stock_info": {}, "brand_info": ""} try: # 提取规格信息(匹配var specList = ... 格式) import re spec_match = re.search(r'var specList = (\[.*?\]);', html, re.DOTALL) if spec_match: spec_list = json.loads(spec_match.group(1)) result["spec_info"] = [{"spec_id": s.get("specId"), "spec_name": s.get("itemDesc")} for s in spec_list] # 提取库存信息(匹配var stock = ... 格式) stock_match = re.search(r'var stock = (\d+);', html) if stock_match: result["stock_info"]["stock"] = int(stock_match.group(1)) # 提取品牌信息 brand_match = re.search(r'from collections import Counter, defaultdict import json import time from typing import Dict, List class JdProductAssetReconstructor: def init(self, shop_product_data: Dict, completed_products: List[Dict]): self.shop_product_data = shop_product_data # 店铺商品基础数据 self.completed_products = completed_products # 补全后的商品数据 self.asset_report = {} def stat_category_distribution(self) -> Dict: """统计商品分类分布(数量、销量占比)""" category_dist = self.shop_product_data["category_product_distribution"] total_sales = sum(product["sales_count"] for product in self.completed_products) category_sales = defaultdict(int) for product in self.completed_products: # 匹配商品所属分类(简化:通过商品名称模糊匹配,实际可通过接口关联) for category_name in category_dist.keys(): if category_name in product["product_name"]: category_sales[category_name] += product["sales_count"] break # 计算销量占比 category_sales_ratio = { cat: f"{(sales/total_sales)*100:.1f}%" if total_sales > 0 else "0.0%" for cat, sales in category_sales.items() } return { "category_product_count": category_dist, "category_sales_count": dict(category_sales), "category_sales_ratio": category_sales_ratio } def sort_products_by_sales(self, top_n: int = 10) -> List[Dict]: """按销量排序,获取店铺TOP N热销商品""" sorted_products = sorted( self.completed_products, key=lambda x: x["sales_count"], reverse=True )[:top_n] # 结构化热销商品数据 return [ { "rank": idx + 1, "product_name": product["product_name"][:30] + "..." if len(product["product_name"]) > 30 else product["product_name"], "sku_id": product["sku_id"], "price": product["price"], "sales_count": product["sales_count"], "comment_count": product["comment_count"], "brand_info": product["brand_info"] } for idx, product in enumerate(sorted_products) ] def assess_product_risk(self) -> Dict: """商品风险评估(库存不足、无品牌、销量过低)""" risk_products = { "low_stock_products": [], # 库存不足商品(<50件) "no_brand_products": [], # 无品牌商品 "low_sales_products": [] # 低销量商品(<10件) } for product in self.completed_products: # 库存风险 stock = product["stock_info"].get("stock", 0) if stock < 50 and stock != 0: risk_products["low_stock_products"].append(product["sku_id"]) # 无品牌风险 if not product["brand_info"]: risk_products["no_brand_products"].append(product["sku_id"]) # 低销量风险 if product["sales_count"] < 10: risk_products["low_sales_products"].append(product["sku_id"]) # 计算风险商品占比 total_product = len(self.completed_products) risk_ratio = { "low_stock_ratio": f"{(len(risk_products['low_stock_products'])/total_product)*100:.1f}%" if total_product > 0 else "0.0%", "no_brand_ratio": f"{(len(risk_products['no_brand_products'])/total_product)*100:.1f}%" if total_product > 0 else "0.0%", "low_sales_ratio": f"{(len(risk_products['low_sales_products'])/total_product)*100:.1f}%" if total_product > 0 else "0.0%" } return { "risk_products": risk_products, "risk_ratio": risk_ratio, "total_risk_product_count": sum(len(v) for v in risk_products.values()) } def generate_asset_report(self) -> Dict: """生成店铺商品数据资产化报告""" # 1. 基础统计 category_dist = self.stat_category_distribution() top_sales_products = self.sort_products_by_sales(10) product_risk = self.assess_product_risk() complete_rate = len([p for p in self.completed_products if p["complete_status"] == "success"]) / len(self.completed_products) * 100 if self.completed_products else 0 # 2. 构建资产报告 self.asset_report = { "shop_summary": { "shop_id": self.shop_product_data["shop_id"], "total_product_count": self.shop_product_data["total_product_count"], "complete_rate": f"{complete_rate:.1f}%", "crawl_time": self.shop_product_data["crawl_time"], "report_time": time.strftime("%Y-%m-%d %H:%M:%S") }, "category_distribution": category_dist, "top_sales_products": top_sales_products, "product_risk_assessment": product_risk, "data_field_explain": { "category_distribution": "商品分类数量与销量分布", "top_sales_products": "店铺TOP10热销商品", "product_risk_assessment": "商品库存、品牌、销量风险评估" } } return self.asset_report def export_asset_report(self, save_path: str): """导出资产化报告为JSON""" with open(save_path, "w", encoding="utf-8") as f: json.dump(self.asset_report, f, ensure_ascii=False, indent=2) print(f"店铺商品数据资产化报告已导出至:{save_path}") def visualize_asset_summary(self): """可视化资产报告核心信息""" summary = self.asset_report["shop_summary"] category_dist = self.asset_report["category_distribution"] risk_assessment = self.asset_report["product_risk_assessment"] print("\n=== 京东店铺商品数据资产化核心摘要 ===") print(f"店铺ID:{summary['shop_id']}") print(f"商品总数:{summary['total_product_count']} | 数据补全率:{summary['complete_rate']}") print(f"采集时间:{summary['crawl_time']} | 报告生成时间:{summary['report_time']}") print("\n一、商品分类分布") for cat, count in category_dist["category_product_count"].items(): print(f" {cat}:{count}件商品,销量占比{category_dist['category_sales_ratio'].get(cat, '0.0%')}") print("\n二、TOP5热销商品") for product in self.asset_report["top_sales_products"][:5]: print(f" 第{product['rank']}名:{product['product_name']}") print(f" SKU:{product['sku_id']} | 价格:{product['price']} | 销量:{product['sales_count']}") print("\n三、商品风险评估") print(f" 库存不足商品(<50件):{len(risk_assessment['risk_products']['low_stock_products'])}件(占比{risk_assessment['risk_ratio']['low_stock_ratio']})") print(f" 无品牌商品:{len(risk_assessment['risk_products']['no_brand_products'])}件(占比{risk_assessment['risk_ratio']['no_brand_ratio']})") print(f" 低销量商品(<10件):{len(risk_assessment['risk_products']['low_sales_products'])}件(占比{risk_assessment['risk_ratio']['low_sales_ratio']})") print(f" 总风险商品数:{risk_assessment['total_risk_product_count']}件")def main(): # 配置参数(需替换为实际值) SHOP_ID = "1000123456" # 目标店铺ID(京东店铺URL中获取,如shop.jd.com/1000123456.html) JD_COOKIE = "user-key=xxx; 3rdcookie=xxx; other_cookie=xxx" # 访问店铺的cookie PROXY = "http://127.0.0.1:7890" # 可选,高匿代理 ZEUS_APP_KEY = "你的京东宙斯APP_KEY" # 可选,开放平台参数 ZEUS_APP_SECRET = "你的京东宙斯APP_SECRET" # 可选 ACCESS_TOKEN = "你的开放平台授权令牌" # 可选 MAX_PAGES_PER_CATEGORY = 10 # 每个分类最大采集页数 ASSET_REPORT_SAVE_PATH = "./jd_shop_product_asset_report.json" # 1. 初始化店铺分类与商品采集器 category_scraper = JdShopCategoryScraper( shop_id=SHOP_ID, cookie=JD_COOKIE, proxy=PROXY ) # 2. 采集店铺全部分类下的基础商品数据 print("开始采集店铺全商品基础数据...") shop_product_data = category_scraper.get_all_products_by_shop(MAX_PAGES_PER_CATEGORY) if "error" in shop_product_data: print(f"商品采集失败:{shop_product_data['error']}") return print(f"店铺基础商品数据采集完成,共{shop_product_data['total_product_count']}件商品") # 3. 初始化商品数据补全器 completer = JdProductDataCompleter( zeus_app_key=ZEUS_APP_KEY, zeus_app_secret=ZEUS_APP_SECRET, access_token=ACCESS_TOKEN, proxy=PROXY ) # 4. 补全商品SKU详情数据 print("\n开始补全商品SKU详情数据...") completed_products = completer.complete_product_data(shop_product_data["products"]) # 5. 初始化商品数据资产化重构器 reconstructor = JdProductAssetReconstructor(shop_product_data, completed_products) # 6. 生成资产化报告 asset_report = reconstructor.generate_asset_report() # 7. 可视化核心结果 reconstructor.visualize_asset_summary() # 8. 导出资产化报告 reconstructor.export_asset_report(ASSET_REPORT_SAVE_PATH) if name == "main": main()层级穿透全量采集动态参数破解风控多源数据联动补全商品资产化重构高兼容性与扩展性请求频率严格控制cookie合规使用数据使用规范接口权限合规反爬适配维护数据脱敏与隐私保护多线程异步采集店铺竞品对比分析商品价格趋势监控可视化报表升级AI智能分析