本文深入讲解Python集成Pangolin Scrape API的完整技术方案,包含环境配置、认证机制、错误处理、并发优化等核心技术点,并提供榜单监控和价格追踪两个生产级实战项目的完整代码实现。
目录
- 技术背景与方案选型
- 开发环境配置
- API客户端架构设计
- 认证与安全机制
- 错误处理与重试策略
- [实战项目一:Best Sellers榜单监控系统](#实战项目一:Best Sellers榜单监控系统)
- 实战项目二:竞品价格追踪系统
- 性能优化:并发与缓存
- 生产环境部署建议
技术背景与方案选型

自建爬虫vs API集成:技术决策分析
在电商数据采集领域,技术方案的选择直接影响项目的成功率和维护成本。让我们从技术角度对比两种方案:
自建爬虫方案
优势:
- 完全控制采集逻辑
- 无API调用成本
- 可定制化程度高
劣势:
- 需要维护反爬虫对抗机制
- 页面结构变化需要及时更新解析代码
- 代理IP池管理复杂
- 并发控制和限流策略需要自行实现
- 数据质量依赖解析准确性
API集成方案
优势:
- 开箱即用,快速集成
- 专业团队维护,稳定性高
- 数据结构标准化
- 支持大规模并发
- 分钟级数据时效性
劣势:
- 按调用量计费
- 依赖第三方服务
- 定制化能力受限
Pangolin Scrape API技术特点
Pangolin API在技术实现上具有以下核心优势:
- 高并发支持:单账户支持上千万页面/天的采集规模
- 分钟级时效:实时数据采集,延迟控制在分钟级
- 高准确率:Sponsored广告位采集率达98%,实测美国站点平均采集率达到 90% 以上
- 全字段覆盖:包括product description、customer says等深度字段
- 多站点支持:覆盖Amazon全球主要站点
开发环境配置
Python版本选择
推荐使用Python 3.8+版本,主要考虑因素:
- 支持类型注解(Type Hints),提高代码可维护性
- 性能优化(如字典合并操作符
|) - 标准库功能增强(如
functools.cached_property)
依赖包安装
创建requirements.txt文件:
txt
requests>=2.28.0
pandas>=1.5.0
python-dotenv>=0.21.0
schedule>=1.1.0
retry>=0.9.2
安装依赖:
bash
pip install -r requirements.txt
项目结构设计
pangolin-api-project/
├── .env # 环境变量配置
├── requirements.txt # 依赖包列表
├── config/
│ └── settings.py # 配置管理
├── src/
│ ├── __init__.py
│ ├── client.py # API客户端
│ ├── exceptions.py # 自定义异常
│ ├── monitors/
│ │ ├── bestseller.py # 榜单监控
│ │ └── price.py # 价格追踪
│ └── utils/
│ ├── cache.py # 缓存工具
│ └── logger.py # 日志工具
├── tests/
│ └── test_client.py # 单元测试
└── main.py # 主程序入口
API客户端架构设计
基础客户端实现
python
# src/client.py
import os
import requests
from dotenv import load_dotenv
from typing import Dict, Optional, Any, List
from .exceptions import PangolinAPIError, AuthenticationError
class PangolinClient:
"""
Pangolin API客户端基础类
提供统一的API调用接口,封装认证、请求、错误处理等通用逻辑
"""
def __init__(self, api_key: Optional[str] = None, base_url: Optional[str] = None):
"""
初始化API客户端
Args:
api_key: API密钥,如不提供则从环境变量读取
base_url: API基础URL,如不提供则从环境变量读取
"""
load_dotenv()
self.api_key = api_key or os.getenv('PANGOLIN_API_KEY')
self.base_url = base_url or os.getenv('PANGOLIN_BASE_URL',
'https://api.pangolinfo.com/scrape')
if not self.api_key:
raise ValueError("API密钥未配置,请设置PANGOLIN_API_KEY环境变量")
self.session = requests.Session()
self.session.headers.update({
'User-Agent': 'PangolinPythonClient/1.0'
})
def _make_request(
self,
endpoint: str,
params: Dict[str, Any],
method: str = 'GET',
timeout: int = 30
) -> Optional[Dict]:
"""
发起API请求的通用方法
Args:
endpoint: API端点
params: 请求参数
method: HTTP方法
timeout: 超时时间(秒)
Returns:
API响应的JSON数据,失败返回None
Raises:
AuthenticationError: 认证失败
PangolinAPIError: API调用错误
"""
params['api_key'] = self.api_key
url = f"{self.base_url}/{endpoint}" if endpoint else self.base_url
try:
response = self.session.request(
method=method,
url=url,
params=params if method == 'GET' else None,
json=params if method == 'POST' else None,
timeout=timeout
)
# 根据状态码处理响应
if response.status_code == 200:
return response.json()
elif response.status_code == 401:
raise AuthenticationError("API密钥无效或已过期")
elif response.status_code == 429:
raise PangolinAPIError("请求频率超限,请稍后重试")
else:
response.raise_for_status()
except requests.exceptions.Timeout:
raise PangolinAPIError(f"请求超时({timeout}秒)")
except requests.exceptions.ConnectionError:
raise PangolinAPIError("网络连接失败")
except requests.exceptions.RequestException as e:
raise PangolinAPIError(f"请求失败: {str(e)}")
return None
def get_product_details(
self,
asin: str,
marketplace: str = 'US',
parse: bool = True
) -> Optional[Dict]:
"""
获取商品详情数据
Args:
asin: 亚马逊商品ASIN码
marketplace: 站点代码(US/UK/DE/FR/IT/ES/CA/JP等)
parse: 是否返回解析后的结构化数据
Returns:
商品详情数据字典
"""
params = {
'type': 'product',
'asin': asin,
'marketplace': marketplace,
'parse': str(parse).lower()
}
return self._make_request('', params)
def get_bestsellers(
self,
category: str,
marketplace: str = 'US',
page: int = 1
) -> Optional[Dict]:
"""
获取Best Sellers榜单数据
Args:
category: 类目URL或类目ID
marketplace: 站点代码
page: 页码
Returns:
榜单数据字典
"""
params = {
'type': 'bestsellers',
'category': category,
'marketplace': marketplace,
'page': page,
'parse': 'true'
}
return self._make_request('', params)
def get_search_results(
self,
keyword: str,
marketplace: str = 'US',
page: int = 1
) -> Optional[Dict]:
"""
获取搜索结果页数据
Args:
keyword: 搜索关键词
marketplace: 站点代码
page: 页码
Returns:
搜索结果数据字典
"""
params = {
'type': 'search',
'keyword': keyword,
'marketplace': marketplace,
'page': page,
'parse': 'true'
}
return self._make_request('', params)
自定义异常类
python
# src/exceptions.py
class PangolinAPIError(Exception):
"""API调用基础异常类"""
pass
class AuthenticationError(PangolinAPIError):
"""认证失败异常"""
pass
class RateLimitError(PangolinAPIError):
"""请求频率超限异常"""
pass
class InvalidParameterError(PangolinAPIError):
"""参数错误异常"""
pass
class DataParseError(PangolinAPIError):
"""数据解析错误异常"""
pass
错误处理与重试策略
装饰器实现自动重试
python
# src/utils/retry.py
import time
import functools
from typing import Callable, Type, Tuple
from ..exceptions import PangolinAPIError, RateLimitError
def retry_on_error(
max_retries: int = 3,
backoff_factor: float = 2.0,
exceptions: Tuple[Type[Exception], ...] = (PangolinAPIError,)
):
"""
自动重试装饰器
Args:
max_retries: 最大重试次数
backoff_factor: 退避因子(指数退避)
exceptions: 需要重试的异常类型
"""
def decorator(func: Callable) -> Callable:
@functools.wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except exceptions as e:
if attempt == max_retries - 1:
raise
wait_time = backoff_factor ** attempt
print(f"第{attempt + 1}次重试,等待{wait_time}秒...")
time.sleep(wait_time)
return None
return wrapper
return decorator
增强版API客户端
python
class EnhancedPangolinClient(PangolinClient):
"""增强版API客户端,支持自动重试和速率限制"""
@retry_on_error(max_retries=3, backoff_factor=2.0)
def get_product_details(self, asin: str, marketplace: str = 'US') -> Optional[Dict]:
"""带自动重试的商品详情获取"""
return super().get_product_details(asin, marketplace)
@retry_on_error(max_retries=3, backoff_factor=2.0)
def get_bestsellers(self, category: str, marketplace: str = 'US',
page: int = 1) -> Optional[Dict]:
"""带自动重试的榜单数据获取"""
return super().get_bestsellers(category, marketplace, page)
实战项目一:Best Sellers榜单监控系统
核心功能设计
- 定时采集指定类目的榜单数据
- 对比历史数据,识别新上榜商品
- 追踪排名变化趋势
- 生成Excel报表
- 异常预警通知
完整代码实现
python
# src/monitors/bestseller.py
import pandas as pd
from datetime import datetime
from pathlib import Path
import json
from typing import Dict, List, Optional
from ..client import PangolinClient
class BestSellersMonitor:
"""
Best Sellers榜单监控系统
功能:
- 定时采集榜单数据
- 历史数据对比分析
- 排名变化追踪
- 报表生成
"""
def __init__(self, client: PangolinClient, data_dir: str = './data'):
"""
初始化监控器
Args:
client: Pangolin API客户端实例
data_dir: 数据存储目录
"""
self.client = client
self.data_dir = Path(data_dir)
self.data_dir.mkdir(parents=True, exist_ok=True)
self.history_file = self.data_dir / 'bestsellers_history.json'
self.history = self._load_history()
def _load_history(self) -> Dict:
"""加载历史数据"""
if self.history_file.exists():
with open(self.history_file, 'r', encoding='utf-8') as f:
return json.load(f)
return {}
def _save_history(self):
"""保存历史数据"""
with open(self.history_file, 'w', encoding='utf-8') as f:
json.dump(self.history, f, ensure_ascii=False, indent=2)
def monitor_category(
self,
category: str,
marketplace: str = 'US',
max_pages: int = 1
) -> Dict:
"""
监控指定类目的榜单
Args:
category: 类目标识
marketplace: 站点代码
max_pages: 采集页数
Returns:
监控结果字典
"""
print(f"[{datetime.now()}] 开始采集 {marketplace}/{category} 榜单...")
all_products = []
# 采集多页数据
for page in range(1, max_pages + 1):
data = self.client.get_bestsellers(category, marketplace, page)
if not data or 'products' not in data:
print(f"第{page}页数据采集失败")
break
all_products.extend(data['products'])
print(f"第{page}页采集完成,获取{len(data['products'])}个商品")
if not all_products:
return {'success': False, 'message': '数据采集失败'}
# 构建当前榜单数据
timestamp = datetime.now().isoformat()
current_ranking = {}
for product in all_products:
asin = product.get('asin')
if not asin:
continue
current_ranking[asin] = {
'rank': product.get('rank'),
'title': product.get('title'),
'price': product.get('price', {}).get('value'),
'currency': product.get('price', {}).get('currency'),
'rating': product.get('rating'),
'reviews_count': product.get('reviews_count'),
'timestamp': timestamp
}
# 分析变化
category_key = f"{marketplace}_{category}"
changes = {}
if category_key in self.history:
changes = self._analyze_changes(category_key, current_ranking)
# 更新历史记录
self.history[category_key] = current_ranking
self._save_history()
print(f"采集完成,共{len(current_ranking)}个商品")
return {
'success': True,
'total_products': len(current_ranking),
'changes': changes,
'timestamp': timestamp
}
def _analyze_changes(
self,
category_key: str,
current_ranking: Dict
) -> Dict:
"""
分析榜单变化
Args:
category_key: 类目键
current_ranking: 当前榜单数据
Returns:
变化分析结果
"""
previous = self.history[category_key]
# 识别新上榜商品
new_products = []
current_asins = set(current_ranking.keys())
previous_asins = set(previous.keys())
for asin in current_asins - previous_asins:
product = current_ranking[asin]
new_products.append({
'asin': asin,
'title': product['title'],
'rank': product['rank'],
'price': product['price']
})
# 识别排名变化
rank_changes = []
for asin in current_asins & previous_asins:
old_rank = previous[asin]['rank']
new_rank = current_ranking[asin]['rank']
if old_rank and new_rank:
change = old_rank - new_rank
if abs(change) >= 10: # 排名变化超过10位
rank_changes.append({
'asin': asin,
'title': current_ranking[asin]['title'],
'old_rank': old_rank,
'new_rank': new_rank,
'change': change
})
# 识别下榜商品
removed_products = []
for asin in previous_asins - current_asins:
product = previous[asin]
removed_products.append({
'asin': asin,
'title': product['title'],
'last_rank': product['rank']
})
# 打印分析结果
if new_products:
print(f"\n发现 {len(new_products)} 个新上榜商品:")
for p in new_products[:5]: # 只显示前5个
print(f" - [{p['rank']}] {p['title'][:50]}...")
if rank_changes:
print(f"\n排名显著变化的商品 ({len(rank_changes)}个):")
for p in sorted(rank_changes, key=lambda x: abs(x['change']), reverse=True)[:5]:
direction = "↑" if p['change'] > 0 else "↓"
print(f" {direction} [{p['old_rank']}→{p['new_rank']}] {p['title'][:50]}...")
return {
'new_products': new_products,
'rank_changes': rank_changes,
'removed_products': removed_products
}
def export_to_excel(
self,
category_key: str,
filename: Optional[str] = None
):
"""
导出榜单数据到Excel
Args:
category_key: 类目键(格式:marketplace_category)
filename: 输出文件名
"""
if category_key not in self.history:
print(f"没有找到 {category_key} 的历史数据")
return
data = self.history[category_key]
df = pd.DataFrame.from_dict(data, orient='index')
df = df.sort_values('rank')
if not filename:
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
filename = self.data_dir / f'bestsellers_{category_key}_{timestamp}.xlsx'
df.to_excel(filename, index_label='ASIN')
print(f"数据已导出到: {filename}")
使用示例
python
# main.py
from src.client import PangolinClient
from src.monitors.bestseller import BestSellersMonitor
import schedule
import time
def main():
# 初始化客户端和监控器
client = PangolinClient()
monitor = BestSellersMonitor(client)
# 定义监控任务
def daily_monitor():
# 监控多个类目
categories = ['kitchen', 'home', 'electronics']
for category in categories:
result = monitor.monitor_category(category, 'US', max_pages=2)
if result['success']:
monitor.export_to_excel(f"US_{category}")
# 设置定时任务:每天早上9点执行
schedule.every().day.at("09:00").do(daily_monitor)
# 立即执行一次
daily_monitor()
# 保持程序运行
while True:
schedule.run_pending()
time.sleep(60)
if __name__ == '__main__':
main()
实战项目二:竞品价格追踪系统
系统架构
价格追踪系统
├── 数据采集层:定时获取商品价格
├── 数据存储层:SQLite数据库
├── 分析引擎:价格变化检测
└── 预警系统:异常通知
完整实现
python
# src/monitors/price.py
import sqlite3
import pandas as pd
from datetime import datetime, timedelta
from pathlib import Path
from typing import List, Dict, Optional
from ..client import PangolinClient
class PriceTracker:
"""
竞品价格追踪系统
功能:
- 定时采集商品价格
- 历史价格存储
- 价格变化分析
- 异常预警
"""
def __init__(self, client: PangolinClient, db_path: str = './data/price_history.db'):
"""
初始化价格追踪器
Args:
client: Pangolin API客户端实例
db_path: 数据库文件路径
"""
self.client = client
self.db_path = Path(db_path)
self.db_path.parent.mkdir(parents=True, exist_ok=True)
self._init_database()
def _init_database(self):
"""初始化SQLite数据库"""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS price_records (
id INTEGER PRIMARY KEY AUTOINCREMENT,
asin TEXT NOT NULL,
marketplace TEXT NOT NULL,
price REAL,
currency TEXT,
availability TEXT,
in_stock BOOLEAN,
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
)
''')
cursor.execute('''
CREATE INDEX IF NOT EXISTS idx_asin_time
ON price_records(asin, timestamp)
''')
conn.commit()
conn.close()
def track_product(
self,
asin: str,
marketplace: str = 'US'
) -> bool:
"""
追踪单个商品价格
Args:
asin: 商品ASIN
marketplace: 站点代码
Returns:
是否成功
"""
product_data = self.client.get_product_details(asin, marketplace)
if not product_data:
return False
price_info = product_data.get('price', {})
availability = product_data.get('availability', '')
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
cursor.execute('''
INSERT INTO price_records
(asin, marketplace, price, currency, availability, in_stock)
VALUES (?, ?, ?, ?, ?, ?)
''', (
asin,
marketplace,
price_info.get('value'),
price_info.get('currency'),
availability,
'in stock' in availability.lower() if availability else False
))
conn.commit()
conn.close()
return True
def track_multiple(
self,
asin_list: List[str],
marketplace: str = 'US',
delay: float = 1.0
) -> Dict:
"""
批量追踪多个商品
Args:
asin_list: ASIN列表
marketplace: 站点代码
delay: 请求间隔(秒)
Returns:
追踪结果统计
"""
import time
success_count = 0
failed_asins = []
for asin in asin_list:
if self.track_product(asin, marketplace):
success_count += 1
print(f"✓ {asin} 价格已记录")
else:
failed_asins.append(asin)
print(f"✗ {asin} 采集失败")
time.sleep(delay)
result = {
'total': len(asin_list),
'success': success_count,
'failed': len(failed_asins),
'failed_asins': failed_asins
}
print(f"\n完成: {success_count}/{len(asin_list)} 个商品")
return result
def get_price_history(
self,
asin: str,
days: int = 30
) -> pd.DataFrame:
"""
获取商品的价格历史
Args:
asin: 商品ASIN
days: 查询天数
Returns:
价格历史DataFrame
"""
conn = sqlite3.connect(self.db_path)
query = '''
SELECT timestamp, price, currency, availability, in_stock
FROM price_records
WHERE asin = ?
AND timestamp >= datetime('now', '-{} days')
ORDER BY timestamp
'''.format(days)
df = pd.read_sql_query(query, conn, params=(asin,))
df['timestamp'] = pd.to_datetime(df['timestamp'])
conn.close()
return df
def detect_price_changes(
self,
asin: str,
threshold: float = 0.05
) -> Optional[Dict]:
"""
检测价格异常变化
Args:
asin: 商品ASIN
threshold: 变化阈值(百分比)
Returns:
价格变化信息,无变化返回None
"""
df = self.get_price_history(asin, days=7)
if len(df) < 2:
return None
latest = df.iloc[-1]
previous = df.iloc[-2]
if pd.notna(latest['price']) and pd.notna(previous['price']):
change_rate = (latest['price'] - previous['price']) / previous['price']
if abs(change_rate) >= threshold:
return {
'asin': asin,
'previous_price': previous['price'],
'current_price': latest['price'],
'change_rate': change_rate,
'change_amount': latest['price'] - previous['price'],
'alert_type': 'price_drop' if change_rate < 0 else 'price_increase',
'timestamp': latest['timestamp']
}
return None
def generate_report(
self,
asin_list: List[str],
threshold: float = 0.05
) -> List[Dict]:
"""
生成价格监控报告
Args:
asin_list: ASIN列表
threshold: 变化阈值
Returns:
价格变化列表
"""
alerts = []
for asin in asin_list:
alert = self.detect_price_changes(asin, threshold)
if alert:
alerts.append(alert)
if alerts:
print("\n=== 价格异动预警 ===")
for alert in alerts:
change_pct = alert['change_rate'] * 100
symbol = "↓" if alert['alert_type'] == 'price_drop' else "↑"
print(f"{symbol} {alert['asin']}: "
f"${alert['previous_price']:.2f} → ${alert['current_price']:.2f} "
f"({change_pct:+.1f}%)")
else:
print("未检测到显著价格变化")
return alerts
def export_price_trends(
self,
asin: str,
filename: Optional[str] = None
):
"""
导出价格趋势图表
Args:
asin: 商品ASIN
filename: 输出文件名
"""
df = self.get_price_history(asin, days=30)
if df.empty:
print(f"没有找到 {asin} 的价格历史")
return
if not filename:
filename = f'price_trend_{asin}_{datetime.now().strftime("%Y%m%d")}.xlsx'
# 计算统计指标
df['price_change'] = df['price'].diff()
df['price_change_pct'] = df['price'].pct_change() * 100
with pd.ExcelWriter(filename, engine='openpyxl') as writer:
df.to_excel(writer, sheet_name='Price History', index=False)
# 添加统计摘要
summary = pd.DataFrame({
'指标': ['最高价', '最低价', '平均价', '当前价', '价格波动率'],
'值': [
df['price'].max(),
df['price'].min(),
df['price'].mean(),
df['price'].iloc[-1],
df['price'].std()
]
})
summary.to_excel(writer, sheet_name='Summary', index=False)
print(f"价格趋势已导出到: {filename}")
性能优化:并发与缓存
并发采集实现
python
# src/utils/concurrent.py
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import List, Dict, Callable
import time
class ConcurrentFetcher:
"""并发数据采集器"""
def __init__(self, max_workers: int = 5, rate_limit: float = 0.2):
"""
初始化并发采集器
Args:
max_workers: 最大并发数
rate_limit: 速率限制(秒/请求)
"""
self.max_workers = max_workers
self.rate_limit = rate_limit
def fetch_batch(
self,
fetch_func: Callable,
items: List,
**kwargs
) -> Dict:
"""
批量并发采集
Args:
fetch_func: 采集函数
items: 待采集项目列表
**kwargs: 传递给采集函数的额外参数
Returns:
采集结果字典
"""
results = {}
failed = []
with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
# 提交任务
future_to_item = {
executor.submit(fetch_func, item, **kwargs): item
for item in items
}
# 收集结果
for future in as_completed(future_to_item):
item = future_to_item[future]
try:
data = future.result()
if data:
results[item] = data
print(f"✓ {item}")
else:
failed.append(item)
print(f"✗ {item} - 无数据")
except Exception as e:
failed.append(item)
print(f"✗ {item} - {str(e)}")
# 速率限制
time.sleep(self.rate_limit)
return {
'success': results,
'failed': failed,
'total': len(items),
'success_count': len(results)
}
缓存机制实现
python
# src/utils/cache.py
from datetime import datetime, timedelta
from typing import Dict, Any, Optional
import pickle
from pathlib import Path
class DataCache:
"""数据缓存管理器"""
def __init__(self, cache_dir: str = './cache', default_ttl: int = 3600):
"""
初始化缓存管理器
Args:
cache_dir: 缓存目录
default_ttl: 默认缓存有效期(秒)
"""
self.cache_dir = Path(cache_dir)
self.cache_dir.mkdir(parents=True, exist_ok=True)
self.default_ttl = default_ttl
self.memory_cache = {}
def get(self, key: str) -> Optional[Any]:
"""
获取缓存数据
Args:
key: 缓存键
Returns:
缓存的数据,过期或不存在返回None
"""
# 先检查内存缓存
if key in self.memory_cache:
data, expire_time = self.memory_cache[key]
if datetime.now() < expire_time:
return data
else:
del self.memory_cache[key]
# 检查文件缓存
cache_file = self.cache_dir / f"{key}.pkl"
if cache_file.exists():
try:
with open(cache_file, 'rb') as f:
data, expire_time = pickle.load(f)
if datetime.now() < expire_time:
# 加载到内存缓存
self.memory_cache[key] = (data, expire_time)
return data
else:
cache_file.unlink()
except Exception as e:
print(f"缓存读取失败: {e}")
return None
def set(self, key: str, data: Any, ttl: Optional[int] = None):
"""
设置缓存数据
Args:
key: 缓存键
data: 要缓存的数据
ttl: 缓存有效期(秒),None使用默认值
"""
ttl = ttl or self.default_ttl
expire_time = datetime.now() + timedelta(seconds=ttl)
# 保存到内存缓存
self.memory_cache[key] = (data, expire_time)
# 保存到文件缓存
cache_file = self.cache_dir / f"{key}.pkl"
try:
with open(cache_file, 'wb') as f:
pickle.dump((data, expire_time), f)
except Exception as e:
print(f"缓存写入失败: {e}")
def clear(self, key: Optional[str] = None):
"""
清除缓存
Args:
key: 缓存键,None则清除所有缓存
"""
if key:
# 清除指定缓存
if key in self.memory_cache:
del self.memory_cache[key]
cache_file = self.cache_dir / f"{key}.pkl"
if cache_file.exists():
cache_file.unlink()
else:
# 清除所有缓存
self.memory_cache.clear()
for cache_file in self.cache_dir.glob('*.pkl'):
cache_file.unlink()
class CachedPangolinClient(PangolinClient):
"""带缓存功能的API客户端"""
def __init__(self, cache_ttl: int = 3600, **kwargs):
super().__init__(**kwargs)
self.cache = DataCache(default_ttl=cache_ttl)
def get_product_details(
self,
asin: str,
marketplace: str = 'US',
use_cache: bool = True
) -> Optional[Dict]:
"""获取商品详情(支持缓存)"""
cache_key = f"product_{marketplace}_{asin}"
if use_cache:
cached_data = self.cache.get(cache_key)
if cached_data:
print(f"从缓存返回: {asin}")
return cached_data
data = super().get_product_details(asin, marketplace)
if data:
self.cache.set(cache_key, data)
return data
生产环境部署建议
Docker化部署
dockerfile
# Dockerfile
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "main.py"]
日志配置
python
# src/utils/logger.py
import logging
from pathlib import Path
def setup_logger(name: str, log_file: str = 'app.log', level=logging.INFO):
"""配置日志系统"""
# 创建日志目录
log_dir = Path('logs')
log_dir.mkdir(exist_ok=True)
# 创建logger
logger = logging.getLogger(name)
logger.setLevel(level)
# 文件处理器
fh = logging.FileHandler(log_dir / log_file, encoding='utf-8')
fh.setLevel(level)
# 控制台处理器
ch = logging.StreamHandler()
ch.setLevel(level)
# 格式化器
formatter = logging.Formatter(
'%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
fh.setFormatter(formatter)
ch.setFormatter(formatter)
logger.addHandler(fh)
logger.addHandler(ch)
return logger
总结
本文详细介绍了Python集成Pangolin Scrape API的完整技术方案,包括:
- 架构设计:模块化的客户端设计,支持扩展和维护
- 错误处理:完善的异常处理和自动重试机制
- 实战项目:榜单监控和价格追踪两个生产级应用
- 性能优化:并发采集和智能缓存策略
- 部署方案:Docker化和日志系统配置
通过API集成方式,开发者可以快速构建稳定可靠的数据采集系统,专注于业务逻辑和数据分析,而无需投入大量精力维护爬虫基础设施。
作者简介:资深Python开发者,专注于电商数据采集和分析领域。
版权声明:本文为原创技术文章,转载请注明出处。