Python调用Pangolin API完整实战教程:从环境搭建到生产级应用

本文深入讲解Python集成Pangolin Scrape API的完整技术方案,包含环境配置、认证机制、错误处理、并发优化等核心技术点,并提供榜单监控和价格追踪两个生产级实战项目的完整代码实现。

目录

技术背景与方案选型

自建爬虫vs API集成:技术决策分析

在电商数据采集领域,技术方案的选择直接影响项目的成功率和维护成本。让我们从技术角度对比两种方案:

自建爬虫方案

复制代码
优势:
- 完全控制采集逻辑
- 无API调用成本
- 可定制化程度高

劣势:
- 需要维护反爬虫对抗机制
- 页面结构变化需要及时更新解析代码
- 代理IP池管理复杂
- 并发控制和限流策略需要自行实现
- 数据质量依赖解析准确性

API集成方案

复制代码
优势:
- 开箱即用,快速集成
- 专业团队维护,稳定性高
- 数据结构标准化
- 支持大规模并发
- 分钟级数据时效性

劣势:
- 按调用量计费
- 依赖第三方服务
- 定制化能力受限

Pangolin Scrape API技术特点

Pangolin API在技术实现上具有以下核心优势:

  1. 高并发支持:单账户支持上千万页面/天的采集规模
  2. 分钟级时效:实时数据采集,延迟控制在分钟级
  3. 高准确率:Sponsored广告位采集率达98%,实测美国站点平均采集率达到 90% 以上
  4. 全字段覆盖:包括product description、customer says等深度字段
  5. 多站点支持:覆盖Amazon全球主要站点

开发环境配置

Python版本选择

推荐使用Python 3.8+版本,主要考虑因素:

  • 支持类型注解(Type Hints),提高代码可维护性
  • 性能优化(如字典合并操作符|
  • 标准库功能增强(如functools.cached_property

依赖包安装

创建requirements.txt文件:

txt 复制代码
requests>=2.28.0
pandas>=1.5.0
python-dotenv>=0.21.0
schedule>=1.1.0
retry>=0.9.2

安装依赖:

bash 复制代码
pip install -r requirements.txt

项目结构设计

复制代码
pangolin-api-project/
├── .env                    # 环境变量配置
├── requirements.txt        # 依赖包列表
├── config/
│   └── settings.py        # 配置管理
├── src/
│   ├── __init__.py
│   ├── client.py          # API客户端
│   ├── exceptions.py      # 自定义异常
│   ├── monitors/
│   │   ├── bestseller.py  # 榜单监控
│   │   └── price.py       # 价格追踪
│   └── utils/
│       ├── cache.py       # 缓存工具
│       └── logger.py      # 日志工具
├── tests/
│   └── test_client.py     # 单元测试
└── main.py                # 主程序入口

API客户端架构设计

基础客户端实现

python 复制代码
# src/client.py
import os
import requests
from dotenv import load_dotenv
from typing import Dict, Optional, Any, List
from .exceptions import PangolinAPIError, AuthenticationError

class PangolinClient:
    """
    Pangolin API客户端基础类
    
    提供统一的API调用接口,封装认证、请求、错误处理等通用逻辑
    """
    
    def __init__(self, api_key: Optional[str] = None, base_url: Optional[str] = None):
        """
        初始化API客户端
        
        Args:
            api_key: API密钥,如不提供则从环境变量读取
            base_url: API基础URL,如不提供则从环境变量读取
        """
        load_dotenv()
        self.api_key = api_key or os.getenv('PANGOLIN_API_KEY')
        self.base_url = base_url or os.getenv('PANGOLIN_BASE_URL', 
                                               'https://api.pangolinfo.com/scrape')
        
        if not self.api_key:
            raise ValueError("API密钥未配置,请设置PANGOLIN_API_KEY环境变量")
        
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'PangolinPythonClient/1.0'
        })
    
    def _make_request(
        self, 
        endpoint: str, 
        params: Dict[str, Any],
        method: str = 'GET',
        timeout: int = 30
    ) -> Optional[Dict]:
        """
        发起API请求的通用方法
        
        Args:
            endpoint: API端点
            params: 请求参数
            method: HTTP方法
            timeout: 超时时间(秒)
            
        Returns:
            API响应的JSON数据,失败返回None
            
        Raises:
            AuthenticationError: 认证失败
            PangolinAPIError: API调用错误
        """
        params['api_key'] = self.api_key
        url = f"{self.base_url}/{endpoint}" if endpoint else self.base_url
        
        try:
            response = self.session.request(
                method=method,
                url=url,
                params=params if method == 'GET' else None,
                json=params if method == 'POST' else None,
                timeout=timeout
            )
            
            # 根据状态码处理响应
            if response.status_code == 200:
                return response.json()
            elif response.status_code == 401:
                raise AuthenticationError("API密钥无效或已过期")
            elif response.status_code == 429:
                raise PangolinAPIError("请求频率超限,请稍后重试")
            else:
                response.raise_for_status()
                
        except requests.exceptions.Timeout:
            raise PangolinAPIError(f"请求超时({timeout}秒)")
        except requests.exceptions.ConnectionError:
            raise PangolinAPIError("网络连接失败")
        except requests.exceptions.RequestException as e:
            raise PangolinAPIError(f"请求失败: {str(e)}")
        
        return None
    
    def get_product_details(
        self, 
        asin: str, 
        marketplace: str = 'US',
        parse: bool = True
    ) -> Optional[Dict]:
        """
        获取商品详情数据
        
        Args:
            asin: 亚马逊商品ASIN码
            marketplace: 站点代码(US/UK/DE/FR/IT/ES/CA/JP等)
            parse: 是否返回解析后的结构化数据
            
        Returns:
            商品详情数据字典
        """
        params = {
            'type': 'product',
            'asin': asin,
            'marketplace': marketplace,
            'parse': str(parse).lower()
        }
        
        return self._make_request('', params)
    
    def get_bestsellers(
        self,
        category: str,
        marketplace: str = 'US',
        page: int = 1
    ) -> Optional[Dict]:
        """
        获取Best Sellers榜单数据
        
        Args:
            category: 类目URL或类目ID
            marketplace: 站点代码
            page: 页码
            
        Returns:
            榜单数据字典
        """
        params = {
            'type': 'bestsellers',
            'category': category,
            'marketplace': marketplace,
            'page': page,
            'parse': 'true'
        }
        
        return self._make_request('', params)
    
    def get_search_results(
        self,
        keyword: str,
        marketplace: str = 'US',
        page: int = 1
    ) -> Optional[Dict]:
        """
        获取搜索结果页数据
        
        Args:
            keyword: 搜索关键词
            marketplace: 站点代码
            page: 页码
            
        Returns:
            搜索结果数据字典
        """
        params = {
            'type': 'search',
            'keyword': keyword,
            'marketplace': marketplace,
            'page': page,
            'parse': 'true'
        }
        
        return self._make_request('', params)

自定义异常类

python 复制代码
# src/exceptions.py

class PangolinAPIError(Exception):
    """API调用基础异常类"""
    pass

class AuthenticationError(PangolinAPIError):
    """认证失败异常"""
    pass

class RateLimitError(PangolinAPIError):
    """请求频率超限异常"""
    pass

class InvalidParameterError(PangolinAPIError):
    """参数错误异常"""
    pass

class DataParseError(PangolinAPIError):
    """数据解析错误异常"""
    pass

错误处理与重试策略

装饰器实现自动重试

python 复制代码
# src/utils/retry.py
import time
import functools
from typing import Callable, Type, Tuple
from ..exceptions import PangolinAPIError, RateLimitError

def retry_on_error(
    max_retries: int = 3,
    backoff_factor: float = 2.0,
    exceptions: Tuple[Type[Exception], ...] = (PangolinAPIError,)
):
    """
    自动重试装饰器
    
    Args:
        max_retries: 最大重试次数
        backoff_factor: 退避因子(指数退避)
        exceptions: 需要重试的异常类型
    """
    def decorator(func: Callable) -> Callable:
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except exceptions as e:
                    if attempt == max_retries - 1:
                        raise
                    
                    wait_time = backoff_factor ** attempt
                    print(f"第{attempt + 1}次重试,等待{wait_time}秒...")
                    time.sleep(wait_time)
            
            return None
        return wrapper
    return decorator

增强版API客户端

python 复制代码
class EnhancedPangolinClient(PangolinClient):
    """增强版API客户端,支持自动重试和速率限制"""
    
    @retry_on_error(max_retries=3, backoff_factor=2.0)
    def get_product_details(self, asin: str, marketplace: str = 'US') -> Optional[Dict]:
        """带自动重试的商品详情获取"""
        return super().get_product_details(asin, marketplace)
    
    @retry_on_error(max_retries=3, backoff_factor=2.0)
    def get_bestsellers(self, category: str, marketplace: str = 'US', 
                       page: int = 1) -> Optional[Dict]:
        """带自动重试的榜单数据获取"""
        return super().get_bestsellers(category, marketplace, page)

实战项目一:Best Sellers榜单监控系统

核心功能设计

  1. 定时采集指定类目的榜单数据
  2. 对比历史数据,识别新上榜商品
  3. 追踪排名变化趋势
  4. 生成Excel报表
  5. 异常预警通知

完整代码实现

python 复制代码
# src/monitors/bestseller.py
import pandas as pd
from datetime import datetime
from pathlib import Path
import json
from typing import Dict, List, Optional
from ..client import PangolinClient

class BestSellersMonitor:
    """
    Best Sellers榜单监控系统
    
    功能:
    - 定时采集榜单数据
    - 历史数据对比分析
    - 排名变化追踪
    - 报表生成
    """
    
    def __init__(self, client: PangolinClient, data_dir: str = './data'):
        """
        初始化监控器
        
        Args:
            client: Pangolin API客户端实例
            data_dir: 数据存储目录
        """
        self.client = client
        self.data_dir = Path(data_dir)
        self.data_dir.mkdir(parents=True, exist_ok=True)
        
        self.history_file = self.data_dir / 'bestsellers_history.json'
        self.history = self._load_history()
    
    def _load_history(self) -> Dict:
        """加载历史数据"""
        if self.history_file.exists():
            with open(self.history_file, 'r', encoding='utf-8') as f:
                return json.load(f)
        return {}
    
    def _save_history(self):
        """保存历史数据"""
        with open(self.history_file, 'w', encoding='utf-8') as f:
            json.dump(self.history, f, ensure_ascii=False, indent=2)
    
    def monitor_category(
        self, 
        category: str, 
        marketplace: str = 'US',
        max_pages: int = 1
    ) -> Dict:
        """
        监控指定类目的榜单
        
        Args:
            category: 类目标识
            marketplace: 站点代码
            max_pages: 采集页数
            
        Returns:
            监控结果字典
        """
        print(f"[{datetime.now()}] 开始采集 {marketplace}/{category} 榜单...")
        
        all_products = []
        
        # 采集多页数据
        for page in range(1, max_pages + 1):
            data = self.client.get_bestsellers(category, marketplace, page)
            if not data or 'products' not in data:
                print(f"第{page}页数据采集失败")
                break
            
            all_products.extend(data['products'])
            print(f"第{page}页采集完成,获取{len(data['products'])}个商品")
        
        if not all_products:
            return {'success': False, 'message': '数据采集失败'}
        
        # 构建当前榜单数据
        timestamp = datetime.now().isoformat()
        current_ranking = {}
        
        for product in all_products:
            asin = product.get('asin')
            if not asin:
                continue
                
            current_ranking[asin] = {
                'rank': product.get('rank'),
                'title': product.get('title'),
                'price': product.get('price', {}).get('value'),
                'currency': product.get('price', {}).get('currency'),
                'rating': product.get('rating'),
                'reviews_count': product.get('reviews_count'),
                'timestamp': timestamp
            }
        
        # 分析变化
        category_key = f"{marketplace}_{category}"
        changes = {}
        
        if category_key in self.history:
            changes = self._analyze_changes(category_key, current_ranking)
        
        # 更新历史记录
        self.history[category_key] = current_ranking
        self._save_history()
        
        print(f"采集完成,共{len(current_ranking)}个商品")
        
        return {
            'success': True,
            'total_products': len(current_ranking),
            'changes': changes,
            'timestamp': timestamp
        }
    
    def _analyze_changes(
        self, 
        category_key: str, 
        current_ranking: Dict
    ) -> Dict:
        """
        分析榜单变化
        
        Args:
            category_key: 类目键
            current_ranking: 当前榜单数据
            
        Returns:
            变化分析结果
        """
        previous = self.history[category_key]
        
        # 识别新上榜商品
        new_products = []
        current_asins = set(current_ranking.keys())
        previous_asins = set(previous.keys())
        
        for asin in current_asins - previous_asins:
            product = current_ranking[asin]
            new_products.append({
                'asin': asin,
                'title': product['title'],
                'rank': product['rank'],
                'price': product['price']
            })
        
        # 识别排名变化
        rank_changes = []
        for asin in current_asins & previous_asins:
            old_rank = previous[asin]['rank']
            new_rank = current_ranking[asin]['rank']
            
            if old_rank and new_rank:
                change = old_rank - new_rank
                
                if abs(change) >= 10:  # 排名变化超过10位
                    rank_changes.append({
                        'asin': asin,
                        'title': current_ranking[asin]['title'],
                        'old_rank': old_rank,
                        'new_rank': new_rank,
                        'change': change
                    })
        
        # 识别下榜商品
        removed_products = []
        for asin in previous_asins - current_asins:
            product = previous[asin]
            removed_products.append({
                'asin': asin,
                'title': product['title'],
                'last_rank': product['rank']
            })
        
        # 打印分析结果
        if new_products:
            print(f"\n发现 {len(new_products)} 个新上榜商品:")
            for p in new_products[:5]:  # 只显示前5个
                print(f"  - [{p['rank']}] {p['title'][:50]}...")
        
        if rank_changes:
            print(f"\n排名显著变化的商品 ({len(rank_changes)}个):")
            for p in sorted(rank_changes, key=lambda x: abs(x['change']), reverse=True)[:5]:
                direction = "↑" if p['change'] > 0 else "↓"
                print(f"  {direction} [{p['old_rank']}→{p['new_rank']}] {p['title'][:50]}...")
        
        return {
            'new_products': new_products,
            'rank_changes': rank_changes,
            'removed_products': removed_products
        }
    
    def export_to_excel(
        self, 
        category_key: str, 
        filename: Optional[str] = None
    ):
        """
        导出榜单数据到Excel
        
        Args:
            category_key: 类目键(格式:marketplace_category)
            filename: 输出文件名
        """
        if category_key not in self.history:
            print(f"没有找到 {category_key} 的历史数据")
            return
        
        data = self.history[category_key]
        df = pd.DataFrame.from_dict(data, orient='index')
        df = df.sort_values('rank')
        
        if not filename:
            timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
            filename = self.data_dir / f'bestsellers_{category_key}_{timestamp}.xlsx'
        
        df.to_excel(filename, index_label='ASIN')
        print(f"数据已导出到: {filename}")

使用示例

python 复制代码
# main.py
from src.client import PangolinClient
from src.monitors.bestseller import BestSellersMonitor
import schedule
import time

def main():
    # 初始化客户端和监控器
    client = PangolinClient()
    monitor = BestSellersMonitor(client)
    
    # 定义监控任务
    def daily_monitor():
        # 监控多个类目
        categories = ['kitchen', 'home', 'electronics']
        
        for category in categories:
            result = monitor.monitor_category(category, 'US', max_pages=2)
            if result['success']:
                monitor.export_to_excel(f"US_{category}")
    
    # 设置定时任务:每天早上9点执行
    schedule.every().day.at("09:00").do(daily_monitor)
    
    # 立即执行一次
    daily_monitor()
    
    # 保持程序运行
    while True:
        schedule.run_pending()
        time.sleep(60)

if __name__ == '__main__':
    main()

实战项目二:竞品价格追踪系统

系统架构

复制代码
价格追踪系统
├── 数据采集层:定时获取商品价格
├── 数据存储层:SQLite数据库
├── 分析引擎:价格变化检测
└── 预警系统:异常通知

完整实现

python 复制代码
# src/monitors/price.py
import sqlite3
import pandas as pd
from datetime import datetime, timedelta
from pathlib import Path
from typing import List, Dict, Optional
from ..client import PangolinClient

class PriceTracker:
    """
    竞品价格追踪系统
    
    功能:
    - 定时采集商品价格
    - 历史价格存储
    - 价格变化分析
    - 异常预警
    """
    
    def __init__(self, client: PangolinClient, db_path: str = './data/price_history.db'):
        """
        初始化价格追踪器
        
        Args:
            client: Pangolin API客户端实例
            db_path: 数据库文件路径
        """
        self.client = client
        self.db_path = Path(db_path)
        self.db_path.parent.mkdir(parents=True, exist_ok=True)
        
        self._init_database()
    
    def _init_database(self):
        """初始化SQLite数据库"""
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()
        
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS price_records (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                asin TEXT NOT NULL,
                marketplace TEXT NOT NULL,
                price REAL,
                currency TEXT,
                availability TEXT,
                in_stock BOOLEAN,
                timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
            )
        ''')
        
        cursor.execute('''
            CREATE INDEX IF NOT EXISTS idx_asin_time 
            ON price_records(asin, timestamp)
        ''')
        
        conn.commit()
        conn.close()
    
    def track_product(
        self, 
        asin: str, 
        marketplace: str = 'US'
    ) -> bool:
        """
        追踪单个商品价格
        
        Args:
            asin: 商品ASIN
            marketplace: 站点代码
            
        Returns:
            是否成功
        """
        product_data = self.client.get_product_details(asin, marketplace)
        if not product_data:
            return False
        
        price_info = product_data.get('price', {})
        availability = product_data.get('availability', '')
        
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()
        
        cursor.execute('''
            INSERT INTO price_records 
            (asin, marketplace, price, currency, availability, in_stock)
            VALUES (?, ?, ?, ?, ?, ?)
        ''', (
            asin,
            marketplace,
            price_info.get('value'),
            price_info.get('currency'),
            availability,
            'in stock' in availability.lower() if availability else False
        ))
        
        conn.commit()
        conn.close()
        
        return True
    
    def track_multiple(
        self, 
        asin_list: List[str], 
        marketplace: str = 'US',
        delay: float = 1.0
    ) -> Dict:
        """
        批量追踪多个商品
        
        Args:
            asin_list: ASIN列表
            marketplace: 站点代码
            delay: 请求间隔(秒)
            
        Returns:
            追踪结果统计
        """
        import time
        
        success_count = 0
        failed_asins = []
        
        for asin in asin_list:
            if self.track_product(asin, marketplace):
                success_count += 1
                print(f"✓ {asin} 价格已记录")
            else:
                failed_asins.append(asin)
                print(f"✗ {asin} 采集失败")
            
            time.sleep(delay)
        
        result = {
            'total': len(asin_list),
            'success': success_count,
            'failed': len(failed_asins),
            'failed_asins': failed_asins
        }
        
        print(f"\n完成: {success_count}/{len(asin_list)} 个商品")
        return result
    
    def get_price_history(
        self, 
        asin: str, 
        days: int = 30
    ) -> pd.DataFrame:
        """
        获取商品的价格历史
        
        Args:
            asin: 商品ASIN
            days: 查询天数
            
        Returns:
            价格历史DataFrame
        """
        conn = sqlite3.connect(self.db_path)
        
        query = '''
            SELECT timestamp, price, currency, availability, in_stock
            FROM price_records
            WHERE asin = ?
            AND timestamp >= datetime('now', '-{} days')
            ORDER BY timestamp
        '''.format(days)
        
        df = pd.read_sql_query(query, conn, params=(asin,))
        df['timestamp'] = pd.to_datetime(df['timestamp'])
        
        conn.close()
        
        return df
    
    def detect_price_changes(
        self, 
        asin: str, 
        threshold: float = 0.05
    ) -> Optional[Dict]:
        """
        检测价格异常变化
        
        Args:
            asin: 商品ASIN
            threshold: 变化阈值(百分比)
            
        Returns:
            价格变化信息,无变化返回None
        """
        df = self.get_price_history(asin, days=7)
        
        if len(df) < 2:
            return None
        
        latest = df.iloc[-1]
        previous = df.iloc[-2]
        
        if pd.notna(latest['price']) and pd.notna(previous['price']):
            change_rate = (latest['price'] - previous['price']) / previous['price']
            
            if abs(change_rate) >= threshold:
                return {
                    'asin': asin,
                    'previous_price': previous['price'],
                    'current_price': latest['price'],
                    'change_rate': change_rate,
                    'change_amount': latest['price'] - previous['price'],
                    'alert_type': 'price_drop' if change_rate < 0 else 'price_increase',
                    'timestamp': latest['timestamp']
                }
        
        return None
    
    def generate_report(
        self, 
        asin_list: List[str],
        threshold: float = 0.05
    ) -> List[Dict]:
        """
        生成价格监控报告
        
        Args:
            asin_list: ASIN列表
            threshold: 变化阈值
            
        Returns:
            价格变化列表
        """
        alerts = []
        
        for asin in asin_list:
            alert = self.detect_price_changes(asin, threshold)
            if alert:
                alerts.append(alert)
        
        if alerts:
            print("\n=== 价格异动预警 ===")
            for alert in alerts:
                change_pct = alert['change_rate'] * 100
                symbol = "↓" if alert['alert_type'] == 'price_drop' else "↑"
                print(f"{symbol} {alert['asin']}: "
                      f"${alert['previous_price']:.2f} → ${alert['current_price']:.2f} "
                      f"({change_pct:+.1f}%)")
        else:
            print("未检测到显著价格变化")
        
        return alerts
    
    def export_price_trends(
        self, 
        asin: str, 
        filename: Optional[str] = None
    ):
        """
        导出价格趋势图表
        
        Args:
            asin: 商品ASIN
            filename: 输出文件名
        """
        df = self.get_price_history(asin, days=30)
        
        if df.empty:
            print(f"没有找到 {asin} 的价格历史")
            return
        
        if not filename:
            filename = f'price_trend_{asin}_{datetime.now().strftime("%Y%m%d")}.xlsx'
        
        # 计算统计指标
        df['price_change'] = df['price'].diff()
        df['price_change_pct'] = df['price'].pct_change() * 100
        
        with pd.ExcelWriter(filename, engine='openpyxl') as writer:
            df.to_excel(writer, sheet_name='Price History', index=False)
            
            # 添加统计摘要
            summary = pd.DataFrame({
                '指标': ['最高价', '最低价', '平均价', '当前价', '价格波动率'],
                '值': [
                    df['price'].max(),
                    df['price'].min(),
                    df['price'].mean(),
                    df['price'].iloc[-1],
                    df['price'].std()
                ]
            })
            summary.to_excel(writer, sheet_name='Summary', index=False)
        
        print(f"价格趋势已导出到: {filename}")

性能优化:并发与缓存

并发采集实现

python 复制代码
# src/utils/concurrent.py
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import List, Dict, Callable
import time

class ConcurrentFetcher:
    """并发数据采集器"""
    
    def __init__(self, max_workers: int = 5, rate_limit: float = 0.2):
        """
        初始化并发采集器
        
        Args:
            max_workers: 最大并发数
            rate_limit: 速率限制(秒/请求)
        """
        self.max_workers = max_workers
        self.rate_limit = rate_limit
    
    def fetch_batch(
        self, 
        fetch_func: Callable,
        items: List,
        **kwargs
    ) -> Dict:
        """
        批量并发采集
        
        Args:
            fetch_func: 采集函数
            items: 待采集项目列表
            **kwargs: 传递给采集函数的额外参数
            
        Returns:
            采集结果字典
        """
        results = {}
        failed = []
        
        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
            # 提交任务
            future_to_item = {
                executor.submit(fetch_func, item, **kwargs): item
                for item in items
            }
            
            # 收集结果
            for future in as_completed(future_to_item):
                item = future_to_item[future]
                try:
                    data = future.result()
                    if data:
                        results[item] = data
                        print(f"✓ {item}")
                    else:
                        failed.append(item)
                        print(f"✗ {item} - 无数据")
                except Exception as e:
                    failed.append(item)
                    print(f"✗ {item} - {str(e)}")
                
                # 速率限制
                time.sleep(self.rate_limit)
        
        return {
            'success': results,
            'failed': failed,
            'total': len(items),
            'success_count': len(results)
        }

缓存机制实现

python 复制代码
# src/utils/cache.py
from datetime import datetime, timedelta
from typing import Dict, Any, Optional
import pickle
from pathlib import Path

class DataCache:
    """数据缓存管理器"""
    
    def __init__(self, cache_dir: str = './cache', default_ttl: int = 3600):
        """
        初始化缓存管理器
        
        Args:
            cache_dir: 缓存目录
            default_ttl: 默认缓存有效期(秒)
        """
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(parents=True, exist_ok=True)
        self.default_ttl = default_ttl
        self.memory_cache = {}
    
    def get(self, key: str) -> Optional[Any]:
        """
        获取缓存数据
        
        Args:
            key: 缓存键
            
        Returns:
            缓存的数据,过期或不存在返回None
        """
        # 先检查内存缓存
        if key in self.memory_cache:
            data, expire_time = self.memory_cache[key]
            if datetime.now() < expire_time:
                return data
            else:
                del self.memory_cache[key]
        
        # 检查文件缓存
        cache_file = self.cache_dir / f"{key}.pkl"
        if cache_file.exists():
            try:
                with open(cache_file, 'rb') as f:
                    data, expire_time = pickle.load(f)
                
                if datetime.now() < expire_time:
                    # 加载到内存缓存
                    self.memory_cache[key] = (data, expire_time)
                    return data
                else:
                    cache_file.unlink()
            except Exception as e:
                print(f"缓存读取失败: {e}")
        
        return None
    
    def set(self, key: str, data: Any, ttl: Optional[int] = None):
        """
        设置缓存数据
        
        Args:
            key: 缓存键
            data: 要缓存的数据
            ttl: 缓存有效期(秒),None使用默认值
        """
        ttl = ttl or self.default_ttl
        expire_time = datetime.now() + timedelta(seconds=ttl)
        
        # 保存到内存缓存
        self.memory_cache[key] = (data, expire_time)
        
        # 保存到文件缓存
        cache_file = self.cache_dir / f"{key}.pkl"
        try:
            with open(cache_file, 'wb') as f:
                pickle.dump((data, expire_time), f)
        except Exception as e:
            print(f"缓存写入失败: {e}")
    
    def clear(self, key: Optional[str] = None):
        """
        清除缓存
        
        Args:
            key: 缓存键,None则清除所有缓存
        """
        if key:
            # 清除指定缓存
            if key in self.memory_cache:
                del self.memory_cache[key]
            
            cache_file = self.cache_dir / f"{key}.pkl"
            if cache_file.exists():
                cache_file.unlink()
        else:
            # 清除所有缓存
            self.memory_cache.clear()
            for cache_file in self.cache_dir.glob('*.pkl'):
                cache_file.unlink()


class CachedPangolinClient(PangolinClient):
    """带缓存功能的API客户端"""
    
    def __init__(self, cache_ttl: int = 3600, **kwargs):
        super().__init__(**kwargs)
        self.cache = DataCache(default_ttl=cache_ttl)
    
    def get_product_details(
        self, 
        asin: str, 
        marketplace: str = 'US',
        use_cache: bool = True
    ) -> Optional[Dict]:
        """获取商品详情(支持缓存)"""
        cache_key = f"product_{marketplace}_{asin}"
        
        if use_cache:
            cached_data = self.cache.get(cache_key)
            if cached_data:
                print(f"从缓存返回: {asin}")
                return cached_data
        
        data = super().get_product_details(asin, marketplace)
        
        if data:
            self.cache.set(cache_key, data)
        
        return data

生产环境部署建议

Docker化部署

dockerfile 复制代码
# Dockerfile
FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["python", "main.py"]

日志配置

python 复制代码
# src/utils/logger.py
import logging
from pathlib import Path

def setup_logger(name: str, log_file: str = 'app.log', level=logging.INFO):
    """配置日志系统"""
    
    # 创建日志目录
    log_dir = Path('logs')
    log_dir.mkdir(exist_ok=True)
    
    # 创建logger
    logger = logging.getLogger(name)
    logger.setLevel(level)
    
    # 文件处理器
    fh = logging.FileHandler(log_dir / log_file, encoding='utf-8')
    fh.setLevel(level)
    
    # 控制台处理器
    ch = logging.StreamHandler()
    ch.setLevel(level)
    
    # 格式化器
    formatter = logging.Formatter(
        '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
    )
    fh.setFormatter(formatter)
    ch.setFormatter(formatter)
    
    logger.addHandler(fh)
    logger.addHandler(ch)
    
    return logger

总结

本文详细介绍了Python集成Pangolin Scrape API的完整技术方案,包括:

  1. 架构设计:模块化的客户端设计,支持扩展和维护
  2. 错误处理:完善的异常处理和自动重试机制
  3. 实战项目:榜单监控和价格追踪两个生产级应用
  4. 性能优化:并发采集和智能缓存策略
  5. 部署方案:Docker化和日志系统配置

通过API集成方式,开发者可以快速构建稳定可靠的数据采集系统,专注于业务逻辑和数据分析,而无需投入大量精力维护爬虫基础设施。


作者简介:资深Python开发者,专注于电商数据采集和分析领域。

版权声明:本文为原创技术文章,转载请注明出处。

相关推荐
_一路向北_3 小时前
爬虫框架:Feapder使用心得
爬虫·python
皇族崛起3 小时前
【3D标注】- Unreal Engine 5.7 与 Python 交互基础
python·3d·ue5
你想知道什么?4 小时前
Python基础篇(上) 学习笔记
笔记·python·学习
Swizard5 小时前
速度与激情:Android Python + CameraX 零拷贝实时推理指南
android·python·ai·移动开发
一直跑5 小时前
Liunx服务器centos7离线升级内核(Liunx服务器centos7.9离线/升级系统内核)
python
leocoder5 小时前
大模型基础概念入门 + 代码实战(实现一个多轮会话机器人)
前端·人工智能·python
Buxxxxxx5 小时前
DAY 37 深入理解SHAP图
python
ada7_5 小时前
LeetCode(python)108.将有序数组转换为二叉搜索树
数据结构·python·算法·leetcode
请一直在路上5 小时前
python文件打包成exe(虚拟环境打包,减少体积)
开发语言·python
浩瀚地学5 小时前
【Arcpy】入门学习笔记(五)-矢量数据
经验分享·笔记·python·arcgis·arcpy