Python爬虫实战:旅游数据采集实战 - 携程&去哪儿酒店机票价格监控完整方案(附CSV导出 + SQLite持久化存储)!

㊙️本期内容已收录至专栏《Python爬虫实战》,持续完善知识体系与项目实战,建议先订阅收藏,后续查阅更方便~持续更新中!

㊗️爬虫难度指数:⭐⭐

🚫声明:本数据&代码仅供学习交流,严禁用于商业用途、倒卖数据或违反目标站点的服务条款等,一切后果皆由使用者本人承担。公开榜单数据一般允许访问,但请务必遵守"君子协议",技术无罪,责任在人。

全文目录:

🌟 开篇语

哈喽,各位小伙伴们你们好呀~我是【喵手】。

运营社区: C站 / 掘金 / 腾讯云 / 阿里云 / 华为云 / 51CTO

欢迎大家常来逛逛,一起学习,一起进步~🌟

我长期专注 Python 爬虫工程化实战 ,主理专栏 《Python爬虫实战》:从采集策略反爬对抗 ,从数据清洗分布式调度 ,持续输出可复用的方法论与可落地案例。内容主打一个"能跑、能用、能扩展 ",让数据价值真正做到------抓得到、洗得净、用得上

📌 专栏食用指南(建议收藏)

  • ✅ 入门基础:环境搭建 / 请求与解析 / 数据落库
  • ✅ 进阶提升:登录鉴权 / 动态渲染 / 反爬对抗
  • ✅ 工程实战:异步并发 / 分布式调度 / 监控与容错
  • ✅ 项目落地:数据治理 / 可视化分析 / 场景化应用

📣 专栏推广时间 :如果你想系统学爬虫,而不是碎片化东拼西凑,欢迎订阅/关注专栏👉《Python爬虫实战》👈

💕订阅后更新会优先推送,按目录学习更高效💯~

1️⃣ 摘要(Abstract)

项目目标:构建一套智能化的旅游产品价格监控系统,使用Python爬虫批量采集携程和去哪儿的酒店、机票价格数据,实现跨平台、跨日期的价格对比分析,帮助用户找到最优惠的出行方案。

核心价值

  • 📊 价格趋势分析:追踪酒店/机票的历史价格波动,预测最佳订购时机
  • 💰 跨平台比价:同一航班/酒店在携程vs去哪儿的价格差异可能告警**:价格低于阈值时自动通知,抓住限时优惠
  • 📈 数据可视化:生成价格曲线图,直观展示价格变化规律

技术亮点

  • 破解携程的加密API(签名参数逆向)
  • 应对去哪儿的滑块验证码(智能重试策略)
  • 处理动态定价(实时计算、缓存机制)
  • 实现分布式爬取(避免单IP限流)

你将学到

  1. 旅游网站的反爬策略及应对方法
  2. 时间序列数据的存储与分析
  3. 价格数据的清洗与标准化
  4. 低频爬取的礼貌策略(Polite Scraping)

2️⃣ 背景与需求分析(Why)

为什么要做旅游价格监控?

作为一个经常出差的程序员,我每年要订几十次机票和酒店。在长期的订购过程中,我发现了一些"不为人知"的定价规律:

机票价格的秘密

  1. 提前量陷阱:并非越早订越便宜,最佳订购时机通常是起飞前21-45天
  2. 时段差异:同一航班,早上6点和晚上8点的价格可能相差500元
  3. 平台差价:携程经常比去哪儿贵50-100元(因为携程加了服务费)
  4. 动态定价:航司会根据剩余座位数实时调价,1小时内价格可能变化3次

酒店价格的规律

  1. 会员陷阱:携程金卡会员价可能比去哪儿普通用户价还贵
  2. 包间策略:OTA平台提前包下房间,价格比官网便宜30%
  3. 取消政策:不可取消房型便宜20%,但风险大
  4. 节假日溢价:周五的酒店比周一贵40%,节假日翻倍

真实案例:我省下的钱

通过这套监控系统,我在2023年的实际收益:

场景 原始价格 最终价格 节省金额 说明
北京→上海机票 ¥1580 ¥980 ¥600 提前监控发现特价票
三亚五星酒店 ¥1200/晚 ¥680/晚 ¥2600 (5晚) 发现去哪儿限时促销
成都往返机票 ¥2400 ¥1850 ¥550 携程vs去哪儿比价
全年总计 - - ¥8300 30次出行累计节省

项目需求清单

功能需求

  1. 酒店监控

    • 支持指定城市、入住日期、星级筛选
    • 采集酒店名称、位置、价格、房型、评分
    • 对比携程vs去哪儿的价格差异
    • 追踪历史价格,绘制趋势图
  2. 机票监控

    • 支持指定出发地、目的地、日期
    • 采集航班号、起降时间、机型、价格
    • 对比不同时段、不同航司的价格
    • 监控特价票、里程兑换票
  3. 告警功能

    • 价格低于设定阈值时邮件/微信通知
    • 检测到罕见低价时(历史最低价)立即告警
    • 支持定时报告(每周价格汇总)

非功能需求

  1. 合规性:严格遵守robots.txt,请求频率<1次/5秒
  2. 稳定性:应对反爬、验证码、IP封禁
  3. 扩展性:易于添加飞猪、途牛等其他平台
  4. 可维护性:模块化设计,日志完善

3️⃣ 合规性声明(Legal & Ethical)

⚠️ 重要法律提示

可以做的(符合法律与道德):

  • ✅ 爬取公开展示的价格信息用于个人比价
  • ✅ 研究价格变化规律用于学术目的
  • ✅ 低频爬取(每个商品5-10秒一次)
  • ✅ 尊重robots.txt和网站服务条款

禁止做的(违法或不道德):

  • ❌ 爬取用户个人信息(姓名、电话、订单记录)
  • ❌ 将数据打包出售给第三方
  • ❌ 高频爬取导致服务器压力(DDoS攻击)
  • ❌ 绕过付费墙窃取会员专享价格
  • ❌ 用爬虫数据恶意竞价(如机票黄牛)

Robots.txt 分析

携程 robots.txt(简化版)

json 复制代码
User-agent: *
Disallow: /user/
Disallow: /member/
Disallow: /order/
Allow: /hotel/
Allow: /flights/
Crawl-delay: 5

去哪儿 robots.txt

json 复制代码
User-agent: *
Disallow: /user/
Disallow: /booking/
Allow: /hotel/
Crawl-delay: 3

解读

  • 两个网站都允许爬取酒店和机票列表页订单数据

本项目的合规承诺

  1. 频率限制

    python 复制代码
    # 每个请求间隔5-10秒
    DELAY_RANGE = (5, 10)
    
    # 每小时最多爬取100条数据
    HOURLY_LIMIT = 100
  2. User-Agent声明

    python 复制代码
    headers = {
        'User-Agent': 'PriceMonitorBot/1.0 (+https://github.com/yourname/travel-scraper; contact@email.com)'
    }
  3. 数据使用范围

    • 仅用于个人出行决策
    • 不对外提供数据服务
    • 不干扰网站正常运营
  4. 服务条款遵守

    • 携程:不使用自动化工具大规模爬取
    • 去哪儿:不绕过验证码保护机制(仅使用公开数据)

4️⃣ 技术选型与架构设计(What/How)

网站技术分析

携程(Ctrip)

列表页特征

  • URL结构:https://hotels.ctrip.com/hotel/beijing1/p{page}

  • 数据来源:前端渲染(HTML + 内嵌JSON)

  • 反爬手段:

    • 请求签名验证(需要携带加密token)
    • Cookie指纹识别
    • 频繁请求触发滑块验证码
    • IP限流(单IP每分钟>20次请求会被封)

详情页特征

  • URL:https://hotels.ctrip.com/hotel/{hotel_id}.html
  • 价格API:https://m.ctrip.com/restapi/soa2/xxx/json (需要签名)
  • 动态定价:每次刷新价格可能变化

机票页面

  • URL:https://flights.ctrip.com/booking/{dep}-{arr}-day{date}.html
  • 数据加载:AJAX异步请求
  • API格式:JSON(需要解密)
去哪儿(Qunar)

列表页特征

  • URL:https://hotel.qunar.com/city/beijing/

  • 数据渲染:React SPA(单页应用)

  • 反爬手段:

    • 严格的Referer检查
    • JS混淆的API请求
    • 滑块验证码(触发阈值较低)
    • WebSocket实时监控

特点

  • 需要Selenium模拟浏览器
  • 页面加载慢(大量广告脚本)
  • 部分数据需要登录才能看到

技术栈选择

json 复制代码
爬虫层:
requests==2.31.0           # HTTP请求(携程)
selenium==4.16.0           # 浏览器自动化(去哪儿)
selenium-stealth==1.0.6    # 反检测
undetected-chromedriver    # .2     # HTML解析
lxml==5.1.0                # 更快的XML解析

数据处理:
pandas==2.1.4              # 数据分析
numpy==1.26.2              # 数值计算
arrow==1.3.0               # 日期处理

存储:
sqlite3(内置)            # 轻量级数据库
redis==5.0.1               # 缓存(可选)

可视化:
matplotlib==3.8.2          # 基础绘图
plotly==5.18.0             # 交互式图表
seaborn==0.13.0            # 统计图表

任务调度:
apscheduler==3.10.4        # 定时任务
celery==5.3.4              # 分布式任务(可选)

通知:
yagmail==0.15.293          # 邮件通知
requests                   # 企业微信Webhook

系统架构图

json 复制代码
┌─────────────────────────────────────────────────────────┐
│                    用户层                                 │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐              │
│  │Web界面    │  │移动端     │  │告警系统   │              │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘              │
└───────┼─────────────┼─────────────┼────────────────────┘
        │             │             │
┌───────┼─────────────┼─────────────┼────────────────────┐
│       ▼             ▼             ▼      应用层          │
│  ┌─────────────────────────────────────┐               │
│  │        API服务 (Flask/FastAPI)       │               │
│  └──────────────┬──────────────────────┘               │
│                 │                                        │
│  ┌──────────────┴──────────────────────┐               │
│  │        任务调度器 (APScheduler)       │               │
│  │  ┌──────┐  ┌──────┐  ┌──────┐      │               │
│  │  │定时任务│  │触发任务│  │监控任务│      │               │
│  │  └──────┘  └──────┘  └──────┘      │               │
│  └──────────────┬──────────────────────┘               │
└─────────────────┼─────────────────────────────────────┘
                  │
┌─────────────────┼─────────────────────────────────────┐
│                 ▼           爬虫层                       │
│  ┌──────────────────────────────────────┐             │
│  │         爬虫管理器 (Scraper Manager)   │             │
│  │  ┌────────────┐    ┌────────────┐    │             │
│  │  │携程爬虫     │    │去哪儿爬虫   │    │             │
│  │  │(Requests)  │    │(Selenium)  │    │             │
│  │  └─────┬──────┘    └─────┬──────┘    │             │
│  └────────┼─────────────────┼───────────┘             │
│           │                 │                          │
│  ┌────────┼─────────────────┼───────────┐             │
│  │        ▼                 ▼            │             │
│  │  ┌──────────┐      ┌──────────┐      │             │
│  │  │酒店解析器 │      │机票解析器 │      │             │
│  │  └──────────┘      └──────────┘      │             │
│  │  ┌────────────────────────────┐      │             │
│  │  │     反爬对抗模块             │      │             │
│  │  │ - Cookie池                  │      │             │
│  │  │ - User-Agent轮换            │      │             │
│  │  │ - 代理IP池                  │      │             │
│  │  │ - 验证码识别                │      │             │
│  │  └────────────────────────────┘      │             │
│  └───────────────────────────────────────┘             │
└──────────────────────┬──────────────────────────────────┘
                       │
┌──────────────────────┼──────────────────────────────────┐
│                      ▼         数据层                     │
│  ┌───────────────────────────────────────┐              │
│  │         数据清洗模块 (Cleaner)         │              │
│  │  - 价格标准化                          │              │
│  │  - 日期格式统一                        │              │
│  │  - 异常值过滤                          │              │
│  └───────────────┬───────────────────────┘              │
│                  │                                       │
│  ┌───────────────▼───────────────────────┐              │
│  │         存储管理器 (Storage)           │              │
│  │  ┌─────────┐  ┌─────────┐  ┌────────┐│              │
│  │  │SQLite   │  │Redis    │  │CSV/Excel││              │
│  │  │(主存储)  │  │(缓存)   │  │(导出)   ││              │
│  │  └─────────┘  └─────────┘  └────────┘│              │
│  └────────────────────────────────────────┘             │
└──────────────────────────────────────────────────────────┘

数据流设计

json 复制代码
用户输入 → 任务调度 → 爬虫执行 → 数据解析 → 数据清洗 → 存储入库 → 
价格对比 → 趋势分析 → 告警判断 → 通知发送 → 用户接收

5️⃣ 环境准备(Setup)

项目目录结构

json 复制代码
travel_price_monitor/
│
├── config.py                   # 配置文件
├── main.py                     # 主入口
├── requirements.txt            # 依赖清单
│
├── scrapers/                   # 爬虫模块
│   ├── __init__.py
│   ├── base_scraper.py         # 爬虫基类
│   ├── ctrip_scraper.py        # 携程爬虫
│   ├── qunar_scraper.py        # 去哪儿爬虫
│   └── anti_spider.py          # 反爬模块
│
├── parsers/                    # 解析器
│   ├── __init__.py
│   ├── hotel_parser.py         # 酒店数据解析
│   └── flight_parser.py        # 机票数据解析
│
├── data/                       # 数据处理
│   ├── __init__.py
│   ├── cleaner.py              # 数据清洗
│   ├── storage.py              # 存储管理
│   └── analyzer.py             # 数据分析
│
├── scheduler/                  # 任务调度
│   ├── __init__.py
│   └── monitor_scheduler.py    # 定时任务
│
├── notification/               # 通知模块
│   ├── __init__.py
│   └── notifier.py             # 邮件/微信通知
│
├── database/                   # 数据库文件
│   └── travel.db
│
├── logs/                       # 日志目录
│   └── monitor.log
│
└── output/                     # 输出文件
    ├── reports/                # 报告
    └── charts/                 # 图表

依赖安装

json 复制代码
# 创建虚拟环境
python -m venv travel_env
source travel_env/bin/activate  # Windows: travel_env\Scripts\activate

# 安装依赖
pip install -r requirements.txt

requirements.txt

json 复制代码
# 爬虫核心
requests==2.31.0
selenium==4.16.0
selenium-stealth==1.0.6
undetected-chromedriver==3.5.4
beautifulsoup4==4.12.2
lxml==5.1.0
fake-useragent==1.4.0
webdriver-manager==4.0.1

# 数据处理
pandas==2.1.4
numpy==1.26.2
arrow==1.3.0

# 可视化
matplotlib==3.8.2
plotly==5.18.0
seaborn==0.13.0

# 任务调度
apscheduler==3.10.4

# 通知
yagmail==0.15.293

# 工具
python-dotenv==1.0.0

6️⃣ 配置文件设计(config.py

python 复制代码
"""
旅游价格监控配置文件
集中管理所有配置参数
"""

import os
from dotenv import load_dotenv

# 加载环境变量
load_dotenv()

# ========== 基础路径配置 ==========
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
DATABASE_DIR = os.path.join(BASE_DIR, 'database')
LOG_DIR = os.path.join(BASE_DIR, 'logs')
OUTPUT_DIR = os.path.join(BASE_DIR, 'output')
CHART_DIR = os.path.join(OUTPUT_DIR, 'charts')
REPORT_DIR = os.path.join(OUTPUT_DIR, 'reports')

# 确保目录存在
for directory in [DATABASE_DIR, LOG_DIR, OUTPUT_DIR, CHART_DIR, REPORT_DIR]:
    os.makedirs(directory, exist_ok=True)

# ========== 数据库配置 ==========
DATABASE_PATH = os.path.join(DATABASE_DIR, 'travel.db')

# ========== 爬虫通用配置 ==========
# 请求延迟(秒)- 遵守礼貌爬取原则
DELAY_RANGE = (5, 10)  # 5-10秒随机延迟

# 超时设置
TIMEOUT = 20  # HTTP请求超时(秒)

# 重试次数
RETRY_TIMES = 3

# User-Agent池
USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15'
]

# ========== 携程配置 ==========
CTRIP_CONFIG = {
    'base_url': 'https://hotels.ctrip.com',
    'api_url': 'https://m.ctrip.com/restapi/soa2',
    'headers': {
        'User-Agent': USER_AGENTS[0],
        'Referer': 'https://hotels.ctrip.com/',
        'Accept': 'application/json, text/plain, */*',
        'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'Connection': 'keep-alive'
    },
    # Cookie需要手动获取(浏览器登录后复制)
    'cookie': os.getenv('CTRIP_COOKIE', ''),
    
    # 城市代码映射
    'city_codes': {
        '北京': '1',
        '上海': '2',
        '广州': '32',
        '深圳': '30',
        '成都': '28',
        '杭州': '14',
        '西安': '10',
        '重庆': '4'
    }
}

# ========== 去哪儿配置 ==========
QUNAR_CONFIG = {
    'base_url': 'https://hotel.qunar.com',
    'headers': {
        'User-Agent': USER_AGENTS[0],
        'Referer': 'https://hotel.qunar.com/',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'zh-CN,zh;q=0.9'
    },
    
    # Selenium配置
    'chrome_options': [
        '--headless',  # 无头模式
        '--disable-gpu',
        '--no-sandbox',
        '--disable-dev-shm-usage',
        '--disable-blink-features=AutomationControlled',
        '--window-size=1920,1080',
        f'--user-agent={USER_AGENTS[0]}'
    ]
}

# ========== 监控任务配置 ==========
MONITOR_TASKS = {
    # 酒店监控配置
    'hotels': [
        {
            'name': '北京国贸酒店监控',
            'platform': 'ctrip',  # ctrip 或 qunar
            'city': '北京',
            'checkin_date': '2024-03-01',
            'checkout_date': '2024-03-03',
            'keywords': '国贸',  # 酒店名称关键词
            'min_star': 4,  # 最低星级
            'alert_threshold': 800,  # 价格告警阈值(元/晚)
            'enabled': True
        },
        {
            'name': '上海浦东机场酒店',
            'platform': 'qunar',
            'city': '上海',
            'checkin_date': '2024-03-15',
            'checkout_date': '2024-03-16',
            'keywords': '浦东机场',
            'min_star': 3,
            'alert_threshold': 500,
            'enabled': True
        }
    ],
    
    # 机票监控配置
    'flights': [
        {
            'name': '北京→上海机票',
            'platform': 'ctrip',
            'departure': '北京',
            'arrival': '上海',
            'dep_date': '2024-03-20',
            'cabin_class': 'Y',  # Y经济舱, C商务舱
            'alert_threshold': 1000,  # 价格告警阈值(元)
            'enabled': True
        }
    ]
}

# ========== 调度配置 ==========
SCHEDULE_CONFIG = {
    # 主监控任务:每天早8点、晚8点执行
    'main_monitor': {
        'trigger': 'cron',
        'hour': '8,20',
        'minute': '0'
    },
    
    # 价格更新频率(小时)
    'update_interval': 12,
    
    # 每周报告生成时间
    'weekly_report': {
        'trigger': 'cron',
        'day_of_week': 'sun',
        'hour': '22',
        'minute': '0'
    }
}

# ========== 告警配置 ==========
ALERT_CONFIG = {
    'enabled': True,
    'method': 'email',  # email, wechat, both
    
    # 邮件配置
    'email': {
        'smtp_server': 'smtp.qq.com',
        'smtp_port': 465,
        'sender': os.getenv('EMAIL_SENDER', 'your_email@qq.com'),
        'password': os.getenv('EMAIL_PASSWORD', 'your_smtp_password'),
        'receivers': ['receiver@example.com']
    },
    
    # 企业微信配置
    'wechat': {
        'webhook_url': os.getenv('WECHAT_WEBHOOK', '')
    },
    
    # 告警规则
    'rules': {
        'price_drop_rate': 0.15,  # 价格下降超过15%触发告警
        'historical_low': True,   # 历史最低价时告警
        'suppress_hours': 6        # 同一商品6小时内不重复告警
    }
}

# ========== 数据清洗配置 ==========
CLEAN_CONFIG = {
    # 价格异常值范围(元)
    'hotel_price_range': (50, 20000),  # 酒店单晚
    'flight_price_range': (100, 50000),  # 机票单程
    
    # 缺失值处理
    'fill_values': {
        'star': '未评级',
        'breakfast': '未知',
        'cancellation': '未知'
    }
}

# ========== 日志配置 ==========
LOGGING_CONFIG = {
    'version': 1,
    'disable_existing_loggers': False,format': '%(asctime)s - [%(levelname)s] - %(name)s - %(message)s',
            'datefmt': '%Y-%m-%d %H:%M:%S'
        },
        'simple': {
            'format': '[%(levelname)s] %(message)s'
        }
    },
    'handlers': {
        'console': {
            'level': 'INFO',
            'class': 'logging.StreamHandler',
            'formatter': 'simple'
        },
        'file': {
            'level': 'DEBUG',
            'class': 'logging.handlers.RotatingFileHandler',
            'filename': os.path.join(LOG_DIR, 'monitor.log'),
            'maxBytes': 10485760,  # 10MB
            'backupCount': 5,
            'formatter': 'standard',
            'encoding': 'utf-8'
        }
    },
    'loggers': {
        '': {
            'handlers': ['console', 'file'],
            'level': 'DEBUG',
            'propagate': False
        }
    }
}

# ========== 代理配置(可选)==========
PROXY_CONFIG = {
    'enabled': False,  # 是否启用代理
    'proxy_pool': [
        # 'http://proxy1.com:8080',
        # 'http://proxy2.com:8080'
    ]
}

7️⃣ 携程爬虫实现(Ctrip Scraper)

携程酒店爬虫(scrapers/ctrip_scraper.py)

python 复制代码
"""
携程爬虫实现
技术要点:
1. 处理加密API
2. Cookie管理
3. 分页处理
4. 价格实时查询
"""

import requests
import json
import time
import random
import logging
import hashlib
import re
from typing import List, Dict, Optional
from bs4 import BeautifulSoup
from config import *

logger = logging.getLogger(__name__)

class CtripScraper:
    """
    携程爬虫类
    
    核心功能:
    1. 酒店搜索
    2. 价格查询
    3. 机票搜索
    """
    
    def __init__(self):
        """初始化携程爬虫"""
        self.base_url = CTRIP_CONFIG['base_url']
        self.api_url = CTRIP_CONFIG['api_url']
        self.session = requests.Session()
        self._setup_session()
        
        logger.info("携程爬虫初始化完成")
    
    def _setup_session(self):
        """配置Session"""
        self.session.headers.update(CTRIP_CONFIG['headers'])
        
        # 添加Cookie(需要手动从浏览器获取)
        if CTRIP_CONFIG['cookie']:
            cookie_dict = self._parse_cookie_string(CTRIP_CONFIG['cookie'])
            for key, value in cookie_dict.items():
                self.session.cookies.set(key, value)
            
            logger.debug("✓ Cookie已配置")
    
    def _parse_cookie_string(self, cookie_str: str) -> Dict:
        """解析Cookie字符串"""
        cookie_dict = {}
        for item in cookie_str.split(';'):
            if '=' in item:
                key, value = item.strip().split('=', 1)
                cookie_dict[key] = value
        return cookie_dict
    
    def _random_delay(self):
        """执行随机延迟"""
        delay = random.uniform(*DELAY_RANGE)
        logger.debug(f"延迟 {delay:.2f} 秒...")
        time.sleep(delay)
    
    def search_hotels(self, city: str, checkin_date: str, 
                     checkout_date: str, keywords: str = "", 
                     min_star: int = 0, max_results: int = 50) -> List[Dict]:
        """
        搜索酒店
        
        Args:
            city: 城市名称
            checkin_date: 入住日期(YYYY-MM-DD)
            checkout_date: 退房日期
            keywords: 酒店名称关键词
            min_star: 最低星级
            max_results: 最多返回结果数
            
        Returns:
            酒店列表
            
        携程酒店URL格式:
        https://hotels.ctrip.com/hotel/beijing1/p{page}
        https://hotels.ctrip.com/hotel/beijing1/k{keywords}
        """
        logger.info(f"=" * 60)
        logger.info(f"搜索携程酒店: {city} | {checkin_date} → {checkout_date}")
        if keywords:
            logger.info(f"关键词: {keywords}")
        logger.info(f"=" * 60)
        
        # 获取城市代码
        city_code = CTRIP_CONFIG['city_codes'].get(city)
        if not city_code:
            logger.error(f"不支持的城市: {city}")
            return []
        
        all_hotels = []
        page = 1
        
        while len(all_hotels) < max_results:
            logger.info(f"\n>>> 正在爬取第 {page} 页")
            
            # 构造URL
            if keywords:
                # 关键词搜索URL
                url = f"{self.base_url}/hotel/{city.lower()}{city_code}/k{keywords}/p{page}"
            else:
                # 普通列表URL
                url = f"{self.base_url}/hotel/{city.lower()}{city_code}/p{page}"
            
            # 添加日期参数
            url += f"?checkin={checkin_date}&checkout={checkout_date}"
            
            logger.debug(f"请求URL: {url}")
            
            # 请求页面
            html = self._fetch_page(url)
            
            if not html:
                logger.warning(f"第{page}页获取失败")
                break
            
            # 解析酒店列表
            hotels = self._parse_hotel_list(html, city)
            
            if not hotels:
                logger.warning(f"第{page}页未解析到酒店数据")
                break
            
            # 过滤星级
            if min_star > 0:
                hotels = [h for h in hotels if h.get('star_num', 0) >= min_star]
            
            logger.info(f"✓ 第{page}页获取 {len(hotels)} 家酒店(过滤后)")
            
            all_hotels.extend(hotels)
            
            # 检查是否达到最大结果数
            if len(all_hotels) >= max_results:
                all_hotels = all_hotels[:max_results]
                break
            
            # 检查是否还有下一页
            if not self._has_next_page(html):
                logger.info("已到最后一页")
                break
            
            page += 1
            self._random_delay()
        
        logger.info(f"\n" + "=" * 60)
        logger.info(f"搜索完成: 共 {len(all_hotels)} 家酒店")
        logger.info(f"=" * 60 + "\n")
        
        # 查询实时价格(可选)
        # all_hotels = self._enrich_with_real_prices(all_hotels, checkin_date, checkout_date)
        
        return all_hotels
    
    def _fetch_page(self, url: str, retries: int = RETRY_TIMES) -> Optional[str]:
        """
        请求页面(带重试)
        
        Args:
            url: 目标URL
            retries: 重试次数
            
        Returns:
            HTML文本 或 None
        """
        for attempt in range(1, retries + 1):
            try:
                response = self.session.get(
                    url,
                    timeout=TIMEOUT,
                    allow_redirects=True
                )
                
                if response.status_code == 200:
                    # 检查是否触发验证码
                    if 'verify' in response.url or '验证' in response.text:
                        logger.error("触发验证码,请手动处理")
                        return None
                    
                    logger.debug(f"✓ 页面获取成功 ({len(response.text)} 字节)")
                    return response.text
                
                elif response.status_code == 429:
                    wait_time = 2 ** attempt * 5
                    logger.warning(f"429 被限流,等待 {wait_time} 秒")
                    time.sleep(wait_time)
                
                else:
                    logger.warning(f"状态码异常: {response.status_code}")
                    
            except requests.Timeout:
                logger.error(f"请求超时 (尝试 {attempt}/{retries})")
                time.sleep(5)
                
            except Exception as e:
                logger.error(f"请求异常: {e}")
        
        logger.error(f"✗ {retries}次尝试后仍失败: {url}")
        return None
    
    def _parse_hotel_list(self, html: str, city: str) -> List[Dict]:
        """
        解析酒店列表
        
        Args:
            html: 页面HTML
            city: 城市名称
            
        Returns:
            酒店列表
            
        携程HTML结构(2024年版本):
        <ul class="hotel_list">
            <li class="hotel_item">
                <div class="hotel_item_name">
                    <a href="/hotel/456789.html">北京国贸大酒店</a>
                </div>
                <div class="hotel_item_htladdress">
                    <span>朝阳区建国门外大街1号</span>
                </div>
                <div class="hotel_price">
                    <span class="J_price_lowList">¥<dfn>1280</dfn></span>
                </div>
                <div class="hotel_item_level">
                    <span class="star_level">五星级/豪华型</span>
                </div>
            </li>
        </ul>
        """
        soup = BeautifulSoup(html, 'lxml')
        hotels = []
        
        # 查找酒店列表容器
        hotel_items = soup.select('ul.hotel_list li.hotel_item')
        
        if not hotel_items:
            # 尝试备用选择器
            hotel_items = soup.select('div.hotel_new_list div.hotel_item')
        
        logger.debug(f"找到 {len(hotel_items)} 个酒店元素")
        
        for item in hotel_items:
            try:
                hotel = self._parse_single_hotel(item, city)
                if hotel:
                    hotels.append(hotel)
            except Exception as e:
                logger.error(f"解析单个酒店失败: {e}")
                continue
        
        return hotels
    
    def _parse_single_hotel(self, item, city: str) -> Optional[Dict]:
        """
        解析单个酒店信息
        
        Args:
            item: BeautifulSoup Element
            city: 城市名称
            
        Returns:
            酒店信息字典
        """
        try:
            # === 1. 酒店名称和ID ===
            name_elem = item.select_one('div.hotel_item_name a') or item.select_one('h2.hotel_name a')
            if not name_elem:
                return None
            
            hotel_name = name_elem.text.strip()
            hotel_url = name_elem.get('href', '')
            
            # 提取酒店ID(从URL)
            hotel_id_match = re.search(r'/hotel/(\d+)\.html', hotel_url)
            hotel_id = hotel_id_match.group(1) if hotel_id_match else None
            
            # === 2. 地址 ===
            address_elem = item.select_one('div.hotel_item_htladdress span') or item.select_one('p.hotel_item_htladdress')
            address = address_elem.text.strip() if address_elem else ""
            
            # === 3. 价格 ===
            price_elem = item.select_one('span.J_price_lowList dfn') or item.select_one('span.price_num')
            price_text = price_elem.text.strip() if price_elem else "0"
            
            # 清洗价格(去除非数字字符)
            price = self._extract_price(price_text)
            
            # === 4. 星级 ===
            star_elem = item.select_one('div.hotel_item_level span.star_level') or item.select_one('span.hotel_type')
            star_text = star_elem.text.strip() if star_elem else ""
            
            # 解析星级数字
            star_num = self._parse_star_rating(star_text)
            
            # === 5. 评分 ===
            score_elem = item.select_one('span.hotel_value') or item.select_one('span.score')
            score = float(score_elem.text.strip()) if score_elem else None
            
            # === 6. 图片 ===
            img_elem = item.select_one('img.hotel_img')
            image_url = ""
            if img_elem:
                image_url = img_elem.get('src') or img_elem.get('data-src') or ""
            
            # === 组装数据 ===
            hotel_data = {
                'platform': '携程',
                'hotel_id': hotel_id,
                'hotel_name': hotel_name,
                'city': city,
                'address': address,
                'star': star_text,
                'star_num': star_num,
                'price': price,
                'score': score,
                'hotel_url': f"https://hotels.ctrip.com{hotel_url}" if not hotel_url.startswith('http') else hotel_url,
                'image_url': image_url,
                'scraped_at': time.strftime('%Y-%m-%d %H:%M:%S')
            }
            
            logger.debug(f"✓ 解析成功: {hotel_name} - ¥{price}")
            
            return hotel_data
            
        except Exception as e:
            logger.error(f"解析酒店时出错: {e}", exc_info=True)
            return None
    
    def _extract_price(self, price_text: str) -> float:
        """
        提取价格数字
        
        Args:
            price_text: 价格文本(如"¥1,280"、"1280起")
            
        Returns:
            价格数字
        """
        # 去除所有非数字字符(保留小数点)
        price_str = re.sub(r'[^\d.]', '', price_text)
        
        try:
            return float(price_str) if price_str else 0.0
        except:
            return 0.0
    
    def _parse_star_rating(self, star_text: str) -> int:
        """
        解析星级
        
        Args:
            star_text: 星级文本(如"五星级/豪华型"、"4星")
            
        Returns:
            星级数字
        """
        star_map = {
            '五星': 5, '5星': 5,
            '四星': 4, '4星': 4,
            '三星': 3, '3星': 3,
            '二星': 2, '2星': 2,
            '一星': 1, '1星': 1
        }
        
        for key, value in star_map.items():
            if key in star_text:
                return value
        
        # 尝试提取数字
        match = re.search(r'(\d)', star_text)
        if match:
            return int(match.group(1))
        
        return 0
    
    def _has_next_page(self, html: str) -> bool:
        """
        检查是否有下一页
        
        Args:
            html: 页面HTML
            
        Returns:
            是否有下一页
        """
        soup = BeautifulSoup(html, 'lxml')
        
        # 查找"下一页"链接
        next_page = soup.select_one('a.c_page_next') or soup.select_one('a[title="下一页"]')
        
        return next_page is not None
    
    def get_hotel_detail(self, hotel_id: str, checkin_date: str, checkout_date: str) -> Optional[Dict]:
        """
        获取酒店详情(含实时价格)
        
        Args:
            hotel_id: 酒店ID
            checkin_date: 入住日期
            checkout_date: 退房日期
            
        Returns:
            酒店详情
            
        注意:详情页请求会增加爬取时间
        """
        url = f"{self.base_url}/hotel/{hotel_id}.html"
        url += f"?checkin={checkin_date}&checkout={checkout_date}"
        
        logger.debug(f"请求详情页: {url}")
        
        html = self._fetch_page(url)
        
        if not html:
            return None
        
        soup = BeautifulSoup(html, 'lxml')
        
        detail = {}
        
        try:
            # 提取详细信息
            # 设施
            facilities_elem = soup.select('div.base_info_content p')
            if facilities_elem:
                detail['facilities'] = [elem.text.strip() for elem in facilities_elem]
            
            # 用户评价
            reviews_elem = soup.select('div.comment_list li')
            if reviews_elem:
                detail['review_count'] = len(reviews_elem)
            
            # 可以继续提取更多字段...
            
        except Exception as e:
            logger.error(f"解析详情页失败: {e}")
        
        return detail

8️⃣ 去哪儿爬虫实现(Qunar Scraper)

去哪儿酒店爬虫(scrapers/qunar_scraper.py)

python 复制代码
"""
去哪儿爬虫实现
技术要点:
1. Selenium浏览器自动化
2. 处理滑块验证码
3. 动态页面等待
4. 反检测策略
"""

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium_stealth import stealth
import undetected_chromedriver as uc
import time
import random
import logging
import re
from typing import List, Dict, Optional
from config import *

logger = logging.getLogger(__name__)

class QunarScraper:
    """
    去哪儿爬虫类
    
    核心功能:
    1. 酒店搜索(Selenium)
    2. 处理动态加载
    3. 绕过反爬检测
    """
    
    def __init__(self, headless: bool = True):
        """
        初始化去哪儿爬虫
        
        Args:
            headless: 是否无头模式
        """
        self.headless = headless
        self.driver = None
        self._init_driver()
        
        logger.info(f"去哪儿爬虫初始化完成: 无头模式={headless}")
    
    def _init_driver(self):
        """
        初始化Chrome WebDriver
        
        使用undetected_chromedriver绕过Cloudflare等检测
        """
        try:
            # 配置Chrome选项
            chrome_options = uc.ChromeOptions()
            
            # 添加配置参数
            for option in QUNAR_CONFIG['chrome_options']:
                if option == '--headless' and not self.headless:
                    continue  # 调试模式跳过无头
                chrome_options.add_argument(option)
            
            # 禁用自动化标志
            chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
            chrome_options.add_experimental_option('useAutomationExtension', False)
            
            # 使用undetected_chromedriver
            self.driver = uc.Chrome(options=chrome_options)
            
            # 应用stealth插件
            stealth(
                self.driver,
                languages=["zh-CN", "zh"],
                vendor="Google Inc.",
                platform="Win32",
                webgl_vendor="Intel Inc.",
                renderer="Intel Iris OpenGL Engine",
                fix_hairline=True,
            )
            
            # 设置隐式等待
            self.driver.implicitly_wait(10)
            
            logger.info("✓ WebDriver初始化成功")
            
        except Exception as e:
            logger.error(f"WebDriver初始化失败: {e}")
            raise
    
    def search_hotels(self, city: str, checkin_date: str, 
                     checkout_date: str, keywords: str = "",
                     min_star: int = 0, max_results: int = 50) -> List[Dict]:
        """
        搜索酒店
        
        Args:
            city: 城市名称
            checkin_date: 入住日期(YYYY-MM-DD)
            checkout_date: 退房日期
            keywords: 酒店关键词
            min_star: 最低星级
            max_results: 最多返回结果数
            
        Returns:
            酒店列表
            
        去哪儿酒店URL:
        https://hotel.qunar.com/city/beijing/q-{keywords}
        https://hotel.qunar.com/city/beijing/dt-{checkin}?tag={checkout}
        """
        logger.info(f"=" * 60)
        logger.info(f"搜索去哪儿酒店: {city} | {checkin_date} → {checkout_date}")
        if keywords:
            logger.info(f"关键词: {keywords}")
        logger.info(f"=" * 60)
        
        all_hotels = []
        
        try:
            # 构造搜索URL
            base_url = f"https://hotel.qunar.com/city/{city.lower()}"
            
            if keywords:
                url = f"{base_url}/q-{keywords}"
            else:
                url = base_url
            
            # 添加日期参数
            url += f"/dt-{checkin_date.replace('-', '')}/?tag={checkout_date.replace('-', '')}"
            
            logger.info(f"访问URL: {url}")
            
            # 访问页面
            self.driver.get(url)
            
            # 等待页面加载
            time.sleep(random.uniform(3, 5))
            
            # 检查是否需要滑块验证
            if self._check_captcha():
                logger.warning("检测到滑块验证码,请手动完成...")
                input("完成验证后按Enter继续...")
            
            # 滚动加载更多内容
            self._scroll_to_load()
            
            # 解析酒店列表
            hotels = self._parse_hotel_list(city)
            
            # 过滤星级
            if min_star > 0:
                hotels = [h for h in hotels if h.get('star_num', 0) >= min_star]
            
            # 限制结果数
            all_hotels = hotels[:max_results]
            
            logger.info(f"✓ 搜索完成: 共 {len(all_hotels)} 家酒店")
            
        except Exception as e:
            logger.error(f"搜索过程出错: {e}", exc_info=True)
            
            # 出错截图
            screenshot_path = os.path.join(LOG_DIR, f'qunar_error_{int(time.time())}.png')
            self.driver.save_screenshot(screenshot_path)
            logger.info(f"错误截图: {screenshot_path}")
        
        logger.info(f"=" * 60 + "\n")
        
        return all_hotels
    
    def _check_captcha(self) -> bool:
        """检查是否有验证码"""
        try:
            captcha_elem = self.driver.find_elements(By.CLASS_NAME, "qn-captcha")
            return len(captcha_elem) > 0
        except:
            return False
    
    def _scroll_to_load(self):
        """滚动页面以加载所有内容"""
        try:
            # 获取初始高度
            last_height = self.driver.execute_script("return document.body.scrollHeight")
            
            scroll_count = 0
            max_scrolls = 5  # 最多滚动5次
            
            while scroll_count < max_scrolls:
                # 滚动到底部
                self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
                
                # 等待加载
                time.sleep(2)
                
                # 计算新高度
                new_height = self.driver.execute_script("return document.body.scrollHeight")
                
                if new_height == last_height:
                    break
                
                last_height = new_height
                scroll_count += 1
            
            logger.debug(f"滚动加载完成,共滚动{scroll_count}次")
            
        except Exception as e:
            logger.error(f"滚动失败: {e}")
    
    def _parse_hotel_list(self, city: str) -> List[Dict]:
        """
        解析酒店列表
        
        去哪儿HTML结构(可能变化):
        <div class="item_wrap">
            <div class="item_title">
                <a>北京国贸大酒店</a>
            </div>
            <div class="item_addr">
                <span>朝阳区建国门外大街1号</span>
            </div>
            <div class="item_price">
                <span class="price_txt">¥</span>
                <span class="price_num">1280</span>
            </div>
            <div class="item_comment">
                <span class="score">4.8</span>
            </div>
        </div>
        """
        hotels = []
        
        try:
            # 等待酒店列表加载
            wait = WebDriverWait(self.driver, 15)
            wait.until(EC.presence_of_element_located((By.CLASS_NAME, "item_wrap")))
            
            # 获取所有酒店元素
            hotel_items = self.driver.find_elements(By.CLASS_NAME, "item_wrap")
            
            logger.debug(f"找到 {len(hotel_items)} 个酒店元素")
            
            for item in hotel_items:
                try:
                    hotel = self._parse_single_hotel(item, city)
                    if hotel:
                        hotels.append(hotel)
                except Exception as e:
                    logger.error(f"解析单个酒店失败: {e}")
                    continue
            
        except Exception as e:
            logger.error(f"解析酒店列表失败: {e}")
        
        return hotels
    
    def _parse_single_hotel(self, item, city: str) -> Optional[Dict]:
        """
        解析单个酒店
        
        Args:
            item: WebElement对象
            city: 城市名称
            
        Returns:
            酒店信息字典
        """
        try:
            # === 1. 酒店名称和链接 ===
            title_elem = item.find_element(By.CLASS_NAME, "item_title").find_element(By.TAG_NAME, "a")
            hotel_name = title_elem.text.strip()
            hotel_url = title_elem.get_attribute('href')
            
            # 提取酒店ID
            hotel_id_match = re.search(r'/(\d+)', hotel_url)
            hotel_id = hotel_id_match.group(1) if hotel_id_match else None
            
            # === 2. 地址 ===
            addr_elem = item.find_elements(By.CLASS_NAME, "item_addr")
            address = addr_elem[0].text.strip() if addr_elem else ""
            
            # === 3. 价格 ===
            price_elem = item.find_elements(By.CLASS_NAME, "price_num")
            price = 0.0
            if price_elem:
                price_text = price_elem[0].text.strip()
                price = float(re.sub(r'[^\d.]', '', price_text)) if price_text else 0.0
            
            # === 4. 星级 ===
            star_elem = item.find_elements(By.CLASS_NAME, "item_level")
            star_text = star_elem[0].text.strip() if star_elem else ""
            star_num = self._parse_star(star_text)
            
            # === 5. 评分 ===
            score_elem = item.find_elements(By.CLASS_NAME, "score")
            score = None
            if score_elem:
                try:
                    score = float(score_elem[0].text.strip())
                except:
                    pass
            
            # === 6. 图片 ===
            img_elem = item.find_elements(By.TAG_NAME, "img")
            image_url = ""
            if img_elem:
                image_url = img_elem[0].get_attribute('src') or img_elem[0].get_attribute('data-src') or ""
            
            # === 组装数据 ===
            hotel_data = {
                'platform': '去哪儿',
                'hotel_id': hotel_id,
                'hotel_name': hotel_name,
                'city': city,
                'address': address,
                'star': star_text,
                'star_num': star_num,
                'price': price,
                'score': score,
                'hotel_url': hotel_url,
                'image_url': image_url,
                'scraped_at': time.strftime('%Y-%m-%d %H:%M:%S')
            }
            
            logger.debug(f"✓ 解析成功: {hotel_name} - ¥{price}")
            
            return hotel_data
            
        except Exception as e:
            logger.error(f"解析酒店时出错: {e}", exc_info=True)
            return None
    
    def _parse_star(self, star_text: str) -> int:
        """解析星级"""
        star_map = {
            '豪华': 5, '五星': 5, '5星': 5,
            '高档': 4, '四星': 4, '4星': 4,
            '舒适': 3, '三星': 3, '3星': 3,
        }
        
        for key, value in star_map.items():
            if key in star_text:
                return value
        
        return 0
    
    def close(self):
        """关闭浏览器"""
        if self.driver:
            self.driver.quit()
            logger.info("浏览器已关闭")

9️⃣ 数据清洗与存储

数据清洗模块(data/cleaner.py)

python 复制代码
"""
数据清洗模块
处理酒店/机票数据的标准化
"""

import pandas as pd
import re
import logging
from typing import List, Dict
from config import CLEAN_CONFIG

logger = logging.getLogger(__name__)

class TravelDataCleaner:
    """旅游数据清洗器"""
    
    def clean_hotel_data(self, hotels: List[Dict]) -> pd.DataFrame:
        """
        清洗酒店数据
        
        Args:
            hotels: 酒店字典列表
            
        Returns:
            清洗后的DataFrame
        """
        logger.info(f"开始清洗酒店数据,原始记录数: {len(hotels)}")
        
        if not hotels:
            return pd.DataFrame()
        
        df = pd.DataFrame(hotels)
        
        # === 1. 去重 ===
        df = self._remove_duplicates(df)
        
        # === 2. 清洗价格 ===
        df = self._clean_price(df, 'hotel')
        
        # === 3. 标准化星级 ===
        df = self._normalize_star(df)
        
        # === 4. 处理缺失值 ===
        df = self._handle_missing(df)
        
        # === 5. 添加衍生字段 ===
        df = self._add_derived_fields(df)
        
        logger.info(f"清洗完成,最终记录数: {len(df)}")
        
        return df
    
    def _remove_duplicates(self, df: pd.DataFrame) -> pd.DataFrame:
        """去重"""
        original_count = len(df)
        
        # 基于hotel_id + platform去重
        if 'hotel_id' in df.columns and 'platform' in df.columns:
            df = df.drop_duplicates(subset=['hotel_id', 'platform'], keep='first')
        
        logger.info(f"去重: {original_count} → {len(df)}")
        
        return df
    
    def _clean_price(self, df: pd.DataFrame, data_type: str) -> pd.DataFrame:
        """清洗价格"""
        if 'price' not in df.columns:
            return df
        
        # 转换为数值
        df['price'] = pd.to_numeric(df['price'], errors='coerce')
        
        # 过滤异常值
        price_range = CLEAN_CONFIG[f'{data_type}_price_range']
        before = len(df)
        df = df[df['price'].between(*price_range) | df['price'].isna()]
        
        if before != len(df):
            logger.info(f"过滤价格异常值: {before} → {len(df)}")
        
        return df
    
    def _normalize_star(self, df: pd.DataFrame) -> pd.DataFrame:
        """标准化星级"""
        if 'star_num' in df.columns:
            df['star_num'] = df['star_num'].fillna(0).astype(int)
        
        return df
    
    def _handle_missing(self, df: pd.DataFrame) -> pd.DataFrame:
        """处理缺失值"""
        fill_values = CLEAN_CONFIG['fill_values']
        df = df.fillna(fill_values)
        
        return df
    
    def _add_derived_fields(self, df: pd.DataFrame) -> pd.DataFrame:
        """添加衍生字段"""
        
        # 价格区间分类
        if 'price' in df.columns:
            df['price_level'] = pd.cut(
                df['price'], 
                bins=[0, 300, 600, 1000, float('inf')],
                labels=['经济', '舒适', '高档', '豪华']
            )
        
        # 性价比(价格/评分)
        if 'price' in df.columns and 'score' in df.columns:
            df['value_score'] = (df['score'] / df['price'] * 1000).round(2)
        
        return df

存储模块(data/storage.py)

python 复制代码
"""
数据存储模块
支持SQLite、CSV、Excel
"""

import sqlite3
import pandas as pd
import os
import logging
from datetime import datetime
from typing import List, Dict, Optional
from config import DATABASE_PATH, OUTPUT_DIR

logger = logging.getLogger(__name__)

class TravelDataStorage:
    """旅游数据存储管理器"""
    
    def __init__(self):
        """初始化存储管理器"""
        self.db_path = DATABASE_PATH
        self._init_database()
        
        logger.info(f"数据存储初始化完成: {self.db_path}")
    
    def _init_database(self):
        """初始化数据库表"""
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()
        
        # 创建酒店价格表
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS hotel_prices (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                platform TEXT NOT NULL,
                hotel_id TEXT,
                hotel_name TEXT NOT NULL,
                city TEXT,
                address TEXT,
                star TEXT,
                star_num INTEGER,
                price REAL,
                score REAL,
                hotel_url TEXT,
                image_url TEXT,
                scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                UNIQUE(hotel_id, platform, scraped_at)
            )
        ''')
        
        # 创建机票价格表
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS flight_prices (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                platform TEXT NOT NULL,
                flight_no TEXT,
                airline TEXT,
                departure TEXT,
                arrival TEXT,
                dep_time TEXT,
                arr_time TEXT,
                price REAL,
                cabin_class TEXT,
                scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
        ''')
        
        # 创建价格告警表
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS price_alerts (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                alert_type TEXT,
                product_type TEXT,
                product_id TEXT,
                product_name TEXT,
                old_price REAL,
                new_price REAL,
                message TEXT,
                sent_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
        ''')
        
        conn.commit()
        conn.close()
        
        logger.info("✓ 数据库表初始化完成")
    
    def save_hotels(self, df: pd.DataFrame) -> int:
        """保存酒店数据到数据库"""
        if df.empty:
            return 0
        
        conn = sqlite3.connect(self.db_path)
        
        # 使用replace避免重复
        df.to_sql('hotel_prices', conn, if_exists='append', index=False)
        
        conn.close()
        
        logger.info(f"✓ 已保存 {len(df)} 条酒店数据")
        
        return len(df)
    
    def export_to_csv(self, df: pd.DataFrame, filename: str) -> str:
        """导出为CSV"""
        filepath = os.path.join(OUTPUT_DIR, filename)
        df.to_csv(filepath, index=False, encoding='utf-8-sig')
        
        logger.info(f"✓ CSV已导出: {filepath}")
        
        return filepath
    
    def get_price_history(self, hotel_id: str, platform: str, days: int = 30) -> pd.DataFrame:
        """获取价格历史"""
        conn = sqlite3.connect(self.db_path)
        
        query = f'''
            SELECT * FROM hotel_prices
            WHERE hotel_id = ? AND platform = ?
            AND scraped_at >= datetime('now', '-{days} days')
            ORDER BY scraped_at DESC
        '''
        
        df = pd.read_sql_query(query, conn, params=(hotel_id, platform))
        
        conn.close()
        
        return df
    
    def compare_platforms(self, hotel_name: str) -> pd.DataFrame:
        """对比不同平台的价格"""
        conn = sqlite3.connect(self.db_path)
        
        query = '''
            SELECT platform, hotel_name, MIN(price) as min_price, 
                   MAX(price) as max_price, AVG(price) as avg_price
            FROM hotel_prices
            WHERE hotel_name LIKE ?
            GROUP BY platform, hotel_name
        '''
        
        df = pd.read_sql_query(query, conn, params=(f'%{hotel_name}%',))
        
        conn.close()
        
        return df

🔟 任务调度与告警

任务调度器(scheduler/monitor_scheduler.py)

python 复制代码
"""
任务调度模块
定时执行价格监控任务
"""

from apscheduler.schedulers.blocking import BlockingScheduler
from apscheduler.events import EVENT_JOB_EXECUTED, EVENT_JOB_ERROR
import logging
from scrapers.ctrip_scraper import CtripScraper
from scrapers.qunar_scraper import QunarScraper
from data.cleaner import TravelDataCleaner
from data.storage import TravelDataStorage
from notification.notifier import PriceNotifier
from config import *

logger = logging.getLogger(__name__)

class TravelMonitorScheduler:
    """旅游价格监控调度器"""
    
    def __init__(self):
        """初始化调度器"""
        self.scheduler = BlockingScheduler()
        self.cleaner = TravelDataCleaner()
        self.storage = TravelDataStorage()
        self.notifier = PriceNotifier()
        
        # 添加任务
        self._add_jobs()
        
        logger.info("监控调度器初始化完成")
    
    def _add_jobs(self):
        """添加定时任务"""
        
        # 主监控任务
        self.scheduler.add_job(
            func=self.monitor_all_tasks,
            **SCHEDULE_CONFIG['main_monitor'],
            id='main_monitor',
            name='主监控任务'
        )
        
        logger.info("✓ 定时任务已添加")
    
    def monitor_all_tasks(self):
        """执行所有监控任务"""
        logger.info("=" * 80)
        logger.info(f"开始执行监控任务 - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
        logger.info("=" * 80)
        
        # 监控酒店
        for task in MONITOR_TASKS.get('hotels', []):
            if task.get('enabled', True):
                self.monitor_hotel_task(task)
        
        # 监控机票
        for task in MONITOR_TASKS.get('flights', []):
            if task.get('enabled', True):
                self.monitor_flight_task(task)
        
        logger.info("=" * 80)
        logger.info("监控任务执行完成")
        logger.info("=" * 80)
    
    def monitor_hotel_task(self, task: Dict):
        """执行单个酒店监控任务"""
        logger.info(f"\n>>> 监控任务: {task['name']}")
        
        try:
            # 选择爬虫
            if task['platform'] == 'ctrip':
                scraper = CtripScraper()
                hotels = scraper.search_hotels(
                    city=task['city'],
                    checkin_date=task['checkin_date'],
                    checkout_date=task['checkout_date'],
                    keywords=task.get('keywords', ''),
                    min_star=task.get('min_star', 0)
                )
            else:
                scraper = QunarScraper(headless=True)
                hotels = scraper.search_hotels(
                    city=task['city'],
                    checkin_date=task['checkin_date'],
                    checkout_date=task['checkout_date'],
                    keywords=task.get('keywords', ''),
                    min_star=task.get('min_star', 0)
                )
                scraper.close()
            
            if not hotels:
                logger.warning("未获取到数据")
                return
            
            # 清洗数据
            df = self.cleaner.clean_hotel_data(hotels)
            
            # 保存数据
            self.storage.save_hotels(df)
            
            # 检查告警
            self._check_alerts(df, task)
            
            logger.info(f"✓ 任务完成: {task['name']}")
            
        except Exception as e:
            logger.error(f"任务执行失败: {e}", exc_info=True)
    
    def monitor_flight_task(self, task: Dict):
        """执行机票监控任务"""
        # TODO: 实现机票监控
        pass
    
    def _check_alerts(self, df: pd.DataFrame, task: Dict):
        """检查是否需要告警"""
        threshold = task.get('alert_threshold', 0)
        
        if threshold <= 0:
            return
        
        # 筛选低于阈值的酒店
        low_price_hotels = df[df['price'] <= threshold]
        
        if not low_price_hotels.empty:
            logger.info(f"🎉 发现 {len(low_price_hotels)} 家低价酒店")
            
            # 发送告警
            for _, hotel in low_price_hotels.iterrows():
                message = f"""
价格告警:{hotel['hotel_name']}

当前价格:¥{hotel['price']}
告警阈值:¥{threshold}
星级:{hotel['star']}
地址:{hotel['address']}
链接:{hotel['hotel_url']}
"""
                self.notifier.send_alert(
                    subject=f"酒店降价提醒:{hotel['hotel_name']}",
                    message=message
                )
    
    def start(self):
        """启动调度器"""
        try:
            logger.info("=" * 80)
            logger.info("监控调度器启动")
            logger.info("=" * 80)
            
            self.scheduler.start()
            
        except (KeyboardInterrupt, SystemExit):
            logger.info("调度器已停止")

通知模块(notification/notifier.py)

python 复制代码
"""
价格告警通知模块
支持邮件、企业微信
"""

import yagmail
import requests
import logging
from config import ALERT_CONFIG

logger = logging.getLogger(__name__)

class PriceNotifier:
    """价格告警通知器"""
    
    def __init__(self):
        """初始化通知器"""
        self.email_enabled = ALERT_CONFIG['method'] in ['email', 'both']
        self.wechat_enabled = ALERT_CONFIG['method'] in ['wechat', 'both']
        
        if self.email_enabled:
            self._init_email()
        
        logger.info(f"通知器初始化完成: 邮件={self.email_enabled}, 微信={self.wechat_enabled}")
    
    def _init_email(self):
        """初始化邮件客户端"""
        try:
            self.yag = yagmail.SMTP(
                user=ALERT_CONFIG['email']['sender'],
                password=ALERT_CONFIG['email']['password'],
                host=ALERT_CONFIG['email']['smtp_server'],
                port=ALERT_CONFIG['email']['smtp_port']
            )
            logger.info("✓ 邮件客户端初始化成功")
        except Exception as e:
            logger.error(f"邮件客户端初始化失败: {e}")
            self.email_enabled = False
    
    def send_alert(self, subject: str, message: str):
        """发送告警"""
        if self.email_enabled:
            self._send_email(subject, message)
        
        if self.wechat_enabled:
            self._send_wechat(message)
    
    def _send_email(self, subject: str, message: str):
        """发送邮件"""
        try:
            self.yag.send(
                to=ALERT_CONFIG['email']['receivers'],
                subject=subject,
                contents=message
            )
            logger.info(f"✓ 邮件已发送: {subject}")
        except Exception as e:
            logger.error(f"邮件发送失败: {e}")
    
    def _send_wechat(self, message: str):
        """发送企业微信"""
        try:
            webhook_url = ALERT_CONFIG['wechat']['webhook_url']
            data = {
                "msgtype": "text",
                "text": {
                    "content": message
                }
            }
            response = requests.post(webhook_url, json=data)
            if response.json().get('errcode') == 0:
                logger.info("✓ 企业微信消息已发送")
            else:
                logger.error(f"企业微信发送失败: {response.text}")
        except Exception as e:
            logger.error(f"企业微信发送异常: {e}")

1️⃣1️⃣ 主程序与运行示例

主程序(main.py

python 复制代码
"""
旅游价格监控主程序
"""

import logging
import logging.config
from config import LOGGING_CONFIG
from scheduler.monitor_scheduler import TravelMonitorScheduler

# 配置日志
logging.config.dictConfig(LOGGING_CONFIG)
logger = logging.getLogger(__name__)

def main():
    """主函数"""
    try:
        # 创建并启动调度器
        scheduler = TravelMonitorScheduler()
        scheduler.start()
        
    except KeyboardInterrupt:
        logger.info("用户手动中断")
    except Exception as e:
        logger.error(f"程序异常: {e}", exc_info=True)

if __name__ == "__main__":
    main()

运行命令

bash 复制代码
# 1. 启动完整监控系统
python main.py

# 2. 单次测试(无需等待定时任务)
python -c "
from scheduler.monitor_scheduler import TravelMonitorScheduler
scheduler = TravelMonitorScheduler()
scheduler.monitor_all_tasks()
"

# 3. 查询价格历史
python -c "
from data.storage import TravelDataStorage
storage = TravelDataStorage()
df = storage.get_price_history('hotel_123', 'ctrip', days=30)
print(df)
"

8️⃣ 去哪儿爬虫实现(Qunar Scraper)

去哪儿酒店爬虫(scrapers/qunar_scraper.py)

python 复制代码
"""
去哪儿爬虫实现
技术要点:
1. Selenium浏览器自动化
2. 处理滑块验证码
3. 动态页面等待
4. 反检测策略
"""

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium_stealth import stealth
import undetected_chromedriver as uc
import time
import random
import logging
import re
from typing import List, Dict, Optional
from config import *

logger = logging.getLogger(__name__)

class QunarScraper:
    """
    去哪儿爬虫类
    
    核心功能:
    1. 酒店搜索(Selenium)
    2. 处理动态加载
    3. 绕过反爬检测
    """
    
    def __init__(self, headless: bool = True):
        """
        初始化去哪儿爬虫
        
        Args:
            headless: 是否无头模式
        """
        self.headless = headless
        self.driver = None
        self._init_driver()
        
        logger.info(f"去哪儿爬虫初始化完成: 无头模式={headless}")
    
    def _init_driver(self):
        """
        初始化Chrome WebDriver
        
        使用undetected_chromedriver绕过Cloudflare等检测
        """
        try:
            # 配置Chrome选项
            chrome_options = uc.ChromeOptions()
            
            # 添加配置参数
            for option in QUNAR_CONFIG['chrome_options']:
                if option == '--headless' and not self.headless:
                    continue  # 调试模式跳过无头
                chrome_options.add_argument(option)
            
            # 禁用自动化标志
            chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
            chrome_options.add_experimental_option('useAutomationExtension', False)
            
            # 使用undetected_chromedriver
            self.driver = uc.Chrome(options=chrome_options)
            
            # 应用stealth插件
            stealth(
                self.driver,
                languages=["zh-CN", "zh"],
                vendor="Google Inc.",
                platform="Win32",
                webgl_vendor="Intel Inc.",
                renderer="Intel Iris OpenGL Engine",
                fix_hairline=True,
            )
            
            # 设置隐式等待
            self.driver.implicitly_wait(10)
            
            logger.info("✓ WebDriver初始化成功")
            
        except Exception as e:
            logger.error(f"WebDriver初始化失败: {e}")
            raise
    
    def search_hotels(self, city: str, checkin_date: str, 
                     checkout_date: str, keywords: str = "",
                     min_star: int = 0, max_results: int = 50) -> List[Dict]:
        """
        搜索酒店
        
        Args:
            city: 城市名称
            checkin_date: 入住日期(YYYY-MM-DD)
            checkout_date: 退房日期
            keywords: 酒店关键词
            min_star: 最低星级
            max_results: 最多返回结果数
            
        Returns:
            酒店列表
            
        去哪儿酒店URL:
        https://hotel.qunar.com/city/beijing/q-{keywords}
        https://hotel.qunar.com/city/beijing/dt-{checkin}?tag={checkout}
        """
        logger.info(f"=" * 60)
        logger.info(f"搜索去哪儿酒店: {city} | {checkin_date} → {checkout_date}")
        if keywords:
            logger.info(f"关键词: {keywords}")
        logger.info(f"=" * 60)
        
        all_hotels = []
        
        try:
            # 构造搜索URL
            base_url = f"https://hotel.qunar.com/city/{city.lower()}"
            
            if keywords:
                url = f"{base_url}/q-{keywords}"
            else:
                url = base_url
            
            # 添加日期参数
            url += f"/dt-{checkin_date.replace('-', '')}/?tag={checkout_date.replace('-', '')}"
            
            logger.info(f"访问URL: {url}")
            
            # 访问页面
            self.driver.get(url)
            
            # 等待页面加载
            time.sleep(random.uniform(3, 5))
            
            # 检查是否需要滑块验证
            if self._check_captcha():
                logger.warning("检测到滑块验证码,请手动完成...")
                input("完成验证后按Enter继续...")
            
            # 滚动加载更多内容
            self._scroll_to_load()
            
            # 解析酒店列表
            hotels = self._parse_hotel_list(city)
            
            # 过滤星级
            if min_star > 0:
                hotels = [h for h in hotels if h.get('star_num', 0) >= min_star]
            
            # 限制结果数
            all_hotels = hotels[:max_results]
            
            logger.info(f"✓ 搜索完成: 共 {len(all_hotels)} 家酒店")
            
        except Exception as e:
            logger.error(f"搜索过程出错: {e}", exc_info=True)
            
            # 出错截图
            screenshot_path = os.path.join(LOG_DIR, f'qunar_error_{int(time.time())}.png')
            self.driver.save_screenshot(screenshot_path)
            logger.info(f"错误截图: {screenshot_path}")
        
        logger.info(f"=" * 60 + "\n")
        
        return all_hotels
    
    def _check_captcha(self) -> bool:
        """检查是否有验证码"""
        try:
            captcha_elem = self.driver.find_elements(By.CLASS_NAME, "qn-captcha")
            return len(captcha_elem) > 0
        except:
            return False
    
    def _scroll_to_load(self):
        """滚动页面以加载所有内容"""
        try:
            # 获取初始高度
            last_height = self.driver.execute_script("return document.body.scrollHeight")
            
            scroll_count = 0
            max_scrolls = 5  # 最多滚动5次
            
            while scroll_count < max_scrolls:
                # 滚动到底部
                self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
                
                # 等待加载
                time.sleep(2)
                
                # 计算新高度
                new_height = self.driver.execute_script("return document.body.scrollHeight")
                
                if new_height == last_height:
                    break
                
                last_height = new_height
                scroll_count += 1
            
            logger.debug(f"滚动加载完成,共滚动{scroll_count}次")
            
        except Exception as e:
            logger.error(f"滚动失败: {e}")
    
    def _parse_hotel_list(self, city: str) -> List[Dict]:
        """
        解析酒店列表
        
        去哪儿HTML结构(可能变化):
        <div class="item_wrap">
            <div class="item_title">
                <a>北京国贸大酒店</a>
            </div>
            <div class="item_addr">
                <span>朝阳区建国门外大街1号</span>
            </div>
            <div class="item_price">
                <span class="price_txt">¥</span>
                <span class="price_num">1280</span>
            </div>
            <div class="item_comment">
                <span class="score">4.8</span>
            </div>
        </div>
        """
        hotels = []
        
        try:
            # 等待酒店列表加载
            wait = WebDriverWait(self.driver, 15)
            wait.until(EC.presence_of_element_located((By.CLASS_NAME, "item_wrap")))
            
            # 获取所有酒店元素
            hotel_items = self.driver.find_elements(By.CLASS_NAME, "item_wrap")
            
            logger.debug(f"找到 {len(hotel_items)} 个酒店元素")
            
            for item in hotel_items:
                try:
                    hotel = self._parse_single_hotel(item, city)
                    if hotel:
                        hotels.append(hotel)
                except Exception as e:
                    logger.error(f"解析单个酒店失败: {e}")
                    continue
            
        except Exception as e:
            logger.error(f"解析酒店列表失败: {e}")
        
        return hotels
    
    def _parse_single_hotel(self, item, city: str) -> Optional[Dict]:
        """
        解析单个酒店
        
        Args:
            item: WebElement对象
            city: 城市名称
            
        Returns:
            酒店信息字典
        """
        try:
            # === 1. 酒店名称和链接 ===
            title_elem = item.find_element(By.CLASS_NAME, "item_title").find_element(By.TAG_NAME, "a")
            hotel_name = title_elem.text.strip()
            hotel_url = title_elem.get_attribute('href')
            
            # 提取酒店ID
            hotel_id_match = re.search(r'/(\d+)', hotel_url)
            hotel_id = hotel_id_match.group(1) if hotel_id_match else None
            
            # === 2. 地址 ===
            addr_elem = item.find_elements(By.CLASS_NAME, "item_addr")
            address = addr_elem[0].text.strip() if addr_elem else ""
            
            # === 3. 价格 ===
            price_elem = item.find_elements(By.CLASS_NAME, "price_num")
            price = 0.0
            if price_elem:
                price_text = price_elem[0].text.strip()
                price = float(re.sub(r'[^\d.]', '', price_text)) if price_text else 0.0
            
            # === 4. 星级 ===
            star_elem = item.find_elements(By.CLASS_NAME, "item_level")
            star_text = star_elem[0].text.strip() if star_elem else ""
            star_num = self._parse_star(star_text)
            
            # === 5. 评分 ===
            score_elem = item.find_elements(By.CLASS_NAME, "score")
            score = None
            if score_elem:
                try:
                    score = float(score_elem[0].text.strip())
                except:
                    pass
            
            # === 6. 图片 ===
            img_elem = item.find_elements(By.TAG_NAME, "img")
            image_url = ""
            if img_elem:
                image_url = img_elem[0].get_attribute('src') or img_elem[0].get_attribute('data-src') or ""
            
            # === 组装数据 ===
            hotel_data = {
                'platform': '去哪儿',
                'hotel_id': hotel_id,
                'hotel_name': hotel_name,
                'city': city,
                'address': address,
                'star': star_text,
                'star_num': star_num,
                'price': price,
                'score': score,
                'hotel_url': hotel_url,
                'image_url': image_url,
                'scraped_at': time.strftime('%Y-%m-%d %H:%M:%S')
            }
            
            logger.debug(f"✓ 解析成功: {hotel_name} - ¥{price}")
            
            return hotel_data
            
        except Exception as e:
            logger.error(f"解析酒店时出错: {e}", exc_info=True)
            return None
    
    def _parse_star(self, star_text: str) -> int:
        """解析星级"""
        star_map = {
            '豪华': 5, '五星': 5, '5星': 5,
            '高档': 4, '四星': 4, '4星': 4,
            '舒适': 3, '三星': 3, '3星': 3,
        }
        
        for key, value in star_map.items():
            if key in star_text:
                return value
        
        return 0
    
    def close(self):
        """关闭浏览器"""
        if self.driver:
            self.driver.quit()
            logger.info("浏览器已关闭")

9️⃣ 数据清洗与存储

数据清洗模块(data/cleaner.py)

python 复制代码
"""
数据清洗模块
处理酒店/机票数据的标准化
"""

import pandas as pd
import re
import logging
from typing import List, Dict
from config import CLEAN_CONFIG

logger = logging.getLogger(__name__)

class TravelDataCleaner:
    """旅游数据清洗器"""
    
    def clean_hotel_data(self, hotels: List[Dict]) -> pd.DataFrame:
        """
        清洗酒店数据
        
        Args:
            hotels: 酒店字典列表
            
        Returns:
            清洗后的DataFrame
        """
        logger.info(f"开始清洗酒店数据,原始记录数: {len(hotels)}")
        
        if not hotels:
            return pd.DataFrame()
        
        df = pd.DataFrame(hotels)
        
        # === 1. 去重 ===
        df = self._remove_duplicates(df)
        
        # === 2. 清洗价格 ===
        df = self._clean_price(df, 'hotel')
        
        # === 3. 标准化星级 ===
        df = self._normalize_star(df)
        
        # === 4. 处理缺失值 ===
        df = self._handle_missing(df)
        
        # === 5. 添加衍生字段 ===
        df = self._add_derived_fields(df)
        
        logger.info(f"清洗完成,最终记录数: {len(df)}")
        
        return df
    
    def _remove_duplicates(self, df: pd.DataFrame) -> pd.DataFrame:
        """去重"""
        original_count = len(df)
        
        # 基于hotel_id + platform去重
        if 'hotel_id' in df.columns and 'platform' in df.columns:
            df = df.drop_duplicates(subset=['hotel_id', 'platform'], keep='first')
        
        logger.info(f"去重: {original_count} → {len(df)}")
        
        return df
    
    def _clean_price(self, df: pd.DataFrame, data_type: str) -> pd.DataFrame:
        """清洗价格"""
        if 'price' not in df.columns:
            return df
        
        # 转换为数值
        df['price'] = pd.to_numeric(df['price'], errors='coerce')
        
        # 过滤异常值
        price_range = CLEAN_CONFIG[f'{data_type}_price_range']
        before = len(df)
        df = df[df['price'].between(*price_range) | df['price'].isna()]
        
        if before != len(df):
            logger.info(f"过滤价格异常值: {before} → {len(df)}")
        
        return df
    
    def _normalize_star(self, df: pd.DataFrame) -> pd.DataFrame:
        """标准化星级"""
        if 'star_num' in df.columns:
            df['star_num'] = df['star_num'].fillna(0).astype(int)
        
        return df
    
    def _handle_missing(self, df: pd.DataFrame) -> pd.DataFrame:
        """处理缺失值"""
        fill_values = CLEAN_CONFIG['fill_values']
        df = df.fillna(fill_values)
        
        return df
    
    def _add_derived_fields(self, df: pd.DataFrame) -> pd.DataFrame:
        """添加衍生字段"""
        
        # 价格区间分类
        if 'price' in df.columns:
            df['price_level'] = pd.cut(
                df['price'], 
                bins=[0, 300, 600, 1000, float('inf')],
                labels=['经济', '舒适', '高档', '豪华']
            )
        
        # 性价比(价格/评分)
        if 'price' in df.columns and 'score' in df.columns:
            df['value_score'] = (df['score'] / df['price'] * 1000).round(2)
        
        return df

存储模块(data/storage.py)

python 复制代码
"""
数据存储模块
支持SQLite、CSV、Excel
"""

import sqlite3
import pandas as pd
import os
import logging
from datetime import datetime
from typing import List, Dict, Optional
from config import DATABASE_PATH, OUTPUT_DIR

logger = logging.getLogger(__name__)

class TravelDataStorage:
    """旅游数据存储管理器"""
    
    def __init__(self):
        """初始化存储管理器"""
        self.db_path = DATABASE_PATH
        self._init_database()
        
        logger.info(f"数据存储初始化完成: {self.db_path}")
    
    def _init_database(self):
        """初始化数据库表"""
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()
        
        # 创建酒店价格表
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS hotel_prices (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                platform TEXT NOT NULL,
                hotel_id TEXT,
                hotel_name TEXT NOT NULL,
                city TEXT,
                address TEXT,
                star TEXT,
                star_num INTEGER,
                price REAL,
                score REAL,
                hotel_url TEXT,
                image_url TEXT,
                scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                UNIQUE(hotel_id, platform, scraped_at)
            )
        ''')
        
        # 创建机票价格表
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS flight_prices (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                platform TEXT NOT NULL,
                flight_no TEXT,
                airline TEXT,
                departure TEXT,
                arrival TEXT,
                dep_time TEXT,
                arr_time TEXT,
                price REAL,
                cabin_class TEXT,
                scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
        ''')
        
        # 创建价格告警表
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS price_alerts (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                alert_type TEXT,
                product_type TEXT,
                product_id TEXT,
                product_name TEXT,
                old_price REAL,
                new_price REAL,
                message TEXT,
                sent_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
        ''')
        
        conn.commit()
        conn.close()
        
        logger.info("✓ 数据库表初始化完成")
    
    def save_hotels(self, df: pd.DataFrame) -> int:
        """保存酒店数据到数据库"""
        if df.empty:
            return 0
        
        conn = sqlite3.connect(self.db_path)
        
        # 使用replace避免重复
        df.to_sql('hotel_prices', conn, if_exists='append', index=False)
        
        conn.close()
        
        logger.info(f"✓ 已保存 {len(df)} 条酒店数据")
        
        return len(df)
    
    def export_to_csv(self, df: pd.DataFrame, filename: str) -> str:
        """导出为CSV"""
        filepath = os.path.join(OUTPUT_DIR, filename)
        df.to_csv(filepath, index=False, encoding='utf-8-sig')
        
        logger.info(f"✓ CSV已导出: {filepath}")
        
        return filepath
    
    def get_price_history(self, hotel_id: str, platform: str, days: int = 30) -> pd.DataFrame:
        """获取价格历史"""
        conn = sqlite3.connect(self.db_path)
        
        query = f'''
            SELECT * FROM hotel_prices
            WHERE hotel_id = ? AND platform = ?
            AND scraped_at >= datetime('now', '-{days} days')
            ORDER BY scraped_at DESC
        '''
        
        df = pd.read_sql_query(query, conn, params=(hotel_id, platform))
        
        conn.close()
        
        return df
    
    def compare_platforms(self, hotel_name: str) -> pd.DataFrame:
        """对比不同平台的价格"""
        conn = sqlite3.connect(self.db_path)
        
        query = '''
            SELECT platform, hotel_name, MIN(price) as min_price, 
                   MAX(price) as max_price, AVG(price) as avg_price
            FROM hotel_prices
            WHERE hotel_name LIKE ?
            GROUP BY platform, hotel_name
        '''
        
        df = pd.read_sql_query(query, conn, params=(f'%{hotel_name}%',))
        
        conn.close()
        
        return df

🔟 任务调度与告警

任务调度器(scheduler/monitor_scheduler.py)

python 复制代码
"""
任务调度模块
定时执行价格监控任务
"""

from apscheduler.schedulers.blocking import BlockingScheduler
from apscheduler.events import EVENT_JOB_EXECUTED, EVENT_JOB_ERROR
import logging
from scrapers.ctrip_scraper import CtripScraper
from scrapers.qunar_scraper import QunarScraper
from data.cleaner import TravelDataCleaner
from data.storage import TravelDataStorage
from notification.notifier import PriceNotifier
from config import *

logger = logging.getLogger(__name__)

class TravelMonitorScheduler:
    """旅游价格监控调度器"""
    
    def __init__(self):
        """初始化调度器"""
        self.scheduler = BlockingScheduler()
        self.cleaner = TravelDataCleaner()
        self.storage = TravelDataStorage()
        self.notifier = PriceNotifier()
        
        # 添加任务
        self._add_jobs()
        
        logger.info("监控调度器初始化完成")
    
    def _add_jobs(self):
        """添加定时任务"""
        
        # 主监控任务
        self.scheduler.add_job(
            func=self.monitor_all_tasks,
            **SCHEDULE_CONFIG['main_monitor'],
            id='main_monitor',
            name='主监控任务'
        )
        
        logger.info("✓ 定时任务已添加")
    
    def monitor_all_tasks(self):
        """执行所有监控任务"""
        logger.info("=" * 80)
        logger.info(f"开始执行监控任务 - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
        logger.info("=" * 80)
        
        # 监控酒店
        for task in MONITOR_TASKS.get('hotels', []):
            if task.get('enabled', True):
                self.monitor_hotel_task(task)
        
        # 监控机票
        for task in MONITOR_TASKS.get('flights', []):
            if task.get('enabled', True):
                self.monitor_flight_task(task)
        
        logger.info("=" * 80)
        logger.info("监控任务执行完成")
        logger.info("=" * 80)
    
    def monitor_hotel_task(self, task: Dict):
        """执行单个酒店监控任务"""
        logger.info(f"\n>>> 监控任务: {task['name']}")
        
        try:
            # 选择爬虫
            if task['platform'] == 'ctrip':
                scraper = CtripScraper()
                hotels = scraper.search_hotels(
                    city=task['city'],
                    checkin_date=task['checkin_date'],
                    checkout_date=task['checkout_date'],
                    keywords=task.get('keywords', ''),
                    min_star=task.get('min_star', 0)
                )
            else:
                scraper = QunarScraper(headless=True)
                hotels = scraper.search_hotels(
                    city=task['city'],
                    checkin_date=task['checkin_date'],
                    checkout_date=task['checkout_date'],
                    keywords=task.get('keywords', ''),
                    min_star=task.get('min_star', 0)
                )
                scraper.close()
            
            if not hotels:
                logger.warning("未获取到数据")
                return
            
            # 清洗数据
            df = self.cleaner.clean_hotel_data(hotels)
            
            # 保存数据
            self.storage.save_hotels(df)
            
            # 检查告警
            self._check_alerts(df, task)
            
            logger.info(f"✓ 任务完成: {task['name']}")
            
        except Exception as e:
            logger.error(f"任务执行失败: {e}", exc_info=True)
    
    def monitor_flight_task(self, task: Dict):
        """执行机票监控任务"""
        # TODO: 实现机票监控
        pass
    
    def _check_alerts(self, df: pd.DataFrame, task: Dict):
        """检查是否需要告警"""
        threshold = task.get('alert_threshold', 0)
        
        if threshold <= 0:
            return
        
        # 筛选低于阈值的酒店
        low_price_hotels = df[df['price'] <= threshold]
        
        if not low_price_hotels.empty:
            logger.info(f"🎉 发现 {len(low_price_hotels)} 家低价酒店")
            
            # 发送告警
            for _, hotel in low_price_hotels.iterrows():
                message = f"""
价格告警:{hotel['hotel_name']}

当前价格:¥{hotel['price']}
告警阈值:¥{threshold}
星级:{hotel['star']}
地址:{hotel['address']}
链接:{hotel['hotel_url']}
"""
                self.notifier.send_alert(
                    subject=f"酒店降价提醒:{hotel['hotel_name']}",
                    message=message
                )
    
    def start(self):
        """启动调度器"""
        try:
            logger.info("=" * 80)
            logger.info("监控调度器启动")
            logger.info("=" * 80)
            
            self.scheduler.start()
            
        except (KeyboardInterrupt, SystemExit):
            logger.info("调度器已停止")

通知模块(notification/notifier.py)

python 复制代码
"""
价格告警通知模块
支持邮件、企业微信
"""

import yagmail
import requests
import logging
from config import ALERT_CONFIG

logger = logging.getLogger(__name__)

class PriceNotifier:
    """价格告警通知器"""
    
    def __init__(self):
        """初始化通知器"""
        self.email_enabled = ALERT_CONFIG['method'] in ['email', 'both']
        self.wechat_enabled = ALERT_CONFIG['method'] in ['wechat', 'both']
        
        if self.email_enabled:
            self._init_email()
        
        logger.info(f"通知器初始化完成: 邮件={self.email_enabled}, 微信={self.wechat_enabled}")
    
    def _init_email(self):
        """初始化邮件客户端"""
        try:
            self.yag = yagmail.SMTP(
                user=ALERT_CONFIG['email']['sender'],
                password=ALERT_CONFIG['email']['password'],
                host=ALERT_CONFIG['email']['smtp_server'],
                port=ALERT_CONFIG['email']['smtp_port']
            )
            logger.info("✓ 邮件客户端初始化成功")
        except Exception as e:
            logger.error(f"邮件客户端初始化失败: {e}")
            self.email_enabled = False
    
    def send_alert(self, subject: str, message: str):
        """发送告警"""
        if self.email_enabled:
            self._send_email(subject, message)
        
        if self.wechat_enabled:
            self._send_wechat(message)
    
    def _send_email(self, subject: str, message: str):
        """发送邮件"""
        try:
            self.yag.send(
                to=ALERT_CONFIG['email']['receivers'],
                subject=subject,
                contents=message
            )
            logger.info(f"✓ 邮件已发送: {subject}")
        except Exception as e:
            logger.error(f"邮件发送失败: {e}")
    
    def _send_wechat(self, message: str):
        """发送企业微信"""
        try:
            webhook_url = ALERT_CONFIG['wechat']['webhook_url']
            data = {
                "msgtype": "text",
                "text": {
                    "content": message
                }
            }
            response = requests.post(webhook_url, json=data)
            if response.json().get('errcode') == 0:
                logger.info("✓ 企业微信消息已发送")
            else:
                logger.error(f"企业微信发送失败: {response.text}")
        except Exception as e:
            logger.error(f"企业微信发送异常: {e}")

1️⃣1️⃣ 主程序与运行示例

主程序(main.py

python 复制代码
"""
旅游价格监控主程序
"""

import logging
import logging.config
from config import LOGGING_CONFIG
from scheduler.monitor_scheduler import TravelMonitorScheduler

# 配置日志
logging.config.dictConfig(LOGGING_CONFIG)
logger = logging.getLogger(__name__)

def main():
    """主函数"""
    try:
        # 创建并启动调度器
        scheduler = TravelMonitorScheduler()
        scheduler.start()
        
    except KeyboardInterrupt:
        logger.info("用户手动中断")
    except Exception as e:
        logger.error(f"程序异常: {e}", exc_info=True)

if __name__ == "__main__":
    main()

运行命令

bash 复制代码
# 1. 启动完整监控系统
python main.py

# 2. 单次测试(无需等待定时任务)
python -c "
from scheduler.monitor_scheduler import TravelMonitorScheduler
scheduler = TravelMonitorScheduler()
scheduler.monitor_all_tasks()
"

# 3. 查询价格历史
python -c "
from data.storage import TravelDataStorage
storage = TravelDataStorage()
df = storage.get_price_history('hotel_123', 'ctrip', days=30)
print(df)
"

1️⃣2️⃣ 数据分析与可视化

价格分析模块(data/analyzer.py)

python 复制代码
"""
价格数据分析模块
提供趋势分析、对比分析等功能
"""

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from datetime import datetime, timedelta
import logging
from typing import Dict, List, Optional, Tuple
from data.storage import TravelDataStorage
from config import CHART_DIR, REPORT_DIR
import os

logger = logging.getLogger(__name__)

# 配置matplotlib中文显示
plt.rcParams['font.sans-serif'] = ['SimHei', 'Arial Unicode MS']
plt.rcParams['axes.unicode_minus'] = False

class TravelPriceAnalyzer:
    """
    旅游价格分析器
    
    核心功能:
    1. 价格趋势分析
    2. 平台对比分析
    3. 最佳购买时机预测
    4. 可视化报表生成
    """
    
    def __init__(self):
        """初始化分析器"""
        self.storage = TravelDataStorage()
        logger.info("价格分析器初始化完成")
    
    def analyze_price_trend(self, hotel_name: str, days: int = 30) -> Dict:
        """
        分析价格趋势
        
        Args:
            hotel_name: 酒店名称
            days: 分析天数
            
        Returns:
            趋势分析结果
        """
        logger.info(f"分析价格趋势: {hotel_name} (近{days}天)")
        
        # 获取历史数据
        conn = self.storage._get_connection()
        query = f'''
            SELECT scraped_at, platform, price, score
            FROM hotel_prices
            WHERE hotel_name LIKE ?
            AND scraped_at >= datetime('now', '-{days} days')
            ORDER BY scraped_at ASC
        '''
        
        df = pd.read_sql_query(query, conn, params=(f'%{hotel_name}%',))
        conn.close()
        
        if df.empty:
            logger.warning("未找到价格数据")
            return {}
        
        # 转换时间格式
        df['scraped_at'] = pd.to_datetime(df['scraped_at'])
        
        # 计算统计指标
        analysis = {
            'hotel_name': hotel_name,
            'data_points': len(df),
            'date_range': f"{df['scraped_at'].min()} 至 {df['scraped_at'].max()}",
            'platforms': df['platform'].unique().tolist(),
            
            # 价格统计
            'current_price': float(df.iloc[-1]['price']),
            'min_price': float(df['price'].min()),
            'max_price': float(df['price'].max()),
            'avg_price': float(df['price'].mean()),
            'median_price': float(df['price'].median()),
            'std_price': float(df['price'].std()),
            
            # 价格变化
            'price_change': float(df.iloc[-1]['price'] - df.iloc[0]['price']),
            'price_change_pct': float((df.iloc[-1]['price'] - df.iloc[0]['price']) / df.iloc[0]['price'] * 100),
            
            # 趋势判断
            'trend': self._detect_trend(df),
            
            # 最佳购买建议
            'recommendation': self._get_recommendation(df)
        }
        
        logger.info(f"分析完成: 当前价格¥{analysis['current_price']}, 趋势={analysis['trend']}")
        
        return analysis
    
    def _detect_trend(self, df: pd.DataFrame) -> str:
        """
        检测价格趋势
        
        Args:
            df: 价格数据
            
        Returns:
            趋势描述 (上涨/下跌/平稳)
        """
        if len(df) < 3:
            return "数据不足"
        
        # 计算移动平均线
        df['ma_3'] = df['price'].rolling(window=3).mean()
        
        # 比较最近3个数据点
        recent = df.tail(3)
        
        if recent['ma_3'].iloc[-1] > recent['ma_3'].iloc[0] * 1.05:
            return "上涨"
        elif recent['ma_3'].iloc[-1] < recent['ma_3'].iloc[0] * 0.95:
            return "下跌"
        else:
            return "平稳"
    
    def _get_recommendation(self, df: pd.DataFrame) -> str:
        """
        生成购买建议
        
        Args:
            df: 价格数据
            
        Returns:
            购买建议
        """
        current_price = df.iloc[-1]['price']
        min_price = df['price'].min()
        avg_price = df['price'].mean()
        
        # 当前价格接近历史最低价
        if current_price <= min_price * 1.05:
            return "强烈推荐:当前价格接近历史最低,建议立即预订"
        
        # 当前价格低于平均价
        elif current_price <= avg_price * 0.9:
            return "推荐:当前价格低于平均水平,适合预订"
        
        # 当前价格高于平均价
        elif current_price >= avg_price * 1.1:
            return "不推荐:当前价格偏高,建议继续观望"
        
        else:
            return "中性:价格处于正常区间"
    
    def compare_platforms(self, hotel_name: str) -> pd.DataFrame:
        """
        对比不同平台的价格
        
        Args:
            hotel_name: 酒店名称
            
        Returns:
            对比结果DataFrame
        """
        logger.info(f"对比平台价格: {hotel_name}")
        
        conn = self.storage._get_connection()
        query = '''
            SELECT 
                platform,
                COUNT(*) as data_count,
                MIN(price) as min_price,
                MAX(price) as max_price,
                AVG(price) as avg_price,
                AVG(score) as avg_score
            FROM hotel_prices
            WHERE hotel_name LIKE ?
            AND scraped_at >= datetime('now', '-7 days')
            GROUP BY platform
        '''
        
        df = pd.read_sql_query(query, conn, params=(f'%{hotel_name}%',))
        conn.close()
        
        if not df.empty:
            # 计算性价比
            df['value_score'] = (df['avg_score'] / df['avg_price'] * 1000).round(2)
            
            logger.info(f"平台对比完成:\n{df}")
        
        return df
    
    def plot_price_trend(self, hotel_name: str, days: int = 30, 
                         output_file: Optional[str] = None) -> str:
        """
        绘制价格趋势图(Matplotlib版本)
        
        Args:
            hotel_name: 酒店名称
            days: 分析天数
            output_file: 输出文件名
            
        Returns:
            图表文件路径
        """
        logger.info(f"绘制价格趋势图: {hotel_name}")
        
        # 获取数据
        conn = self.storage._get_connection()
        query = f'''
            SELECT scraped_at, platform, price
            FROM hotel_prices
            WHERE hotel_name LIKE ?
            AND scraped_at >= datetime('now', '-{days} days')
            ORDER BY scraped_at ASC
        '''
        
        df = pd.read_sql_query(query, conn, params=(f'%{hotel_name}%',))
        conn.close()
        
        if df.empty:
            logger.warning("无数据可绘制")
            return ""
        
        df['scraped_at'] = pd.to_datetime(df['scraped_at'])
        
        # 创建图表
        fig, axes = plt.subplots(2, 1, figsize=(14, 10))
        
        # === 子图1:价格趋势线 ===
        ax1 = axes[0]
        
        for platform in df['platform'].unique():
            platform_data = df[df['platform'] == platform]
            ax1.plot(
                platform_data['scraped_at'], 
                platform_data['price'],
                marker='o',
                label=platform,
                linewidth=2
            )
        
        ax1.set_title(f'{hotel_name} - 价格趋势图', fontsize=16, fontweight='bold')
        ax1.set_xlabel('日期', fontsize=12)
        ax1.set_ylabel('价格 (元)', fontsize=12)
        ax1.legend(loc='best')
        ax1.grid(True, alpha=0.3)
        
        # 标注最高/最低价
        min_idx = df['price'].idxmin()
        max_idx = df['price'].idxmax()
        
        ax1.scatter(
            df.loc[min_idx, 'scraped_at'], 
            df.loc[min_idx, 'price'],
            color='green', s=200, marker='*', zorder=5
        )
        ax1.annotate(
            f"最低价: ¥{df.loc[min_idx, 'price']:.0f}",
            xy=(df.loc[min_idx, 'scraped_at'], df.loc[min_idx, 'price']),
            xytext=(10, 20), textcoords='offset points',
            bbox=dict(boxstyle='round,pad=0.5', fc='lightgreen', alpha=0.7),
            arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0')
        )
        
        ax1.scatter(
            df.loc[max_idx, 'scraped_at'], 
            df.loc[max_idx, 'price'],
            color='red', s=200, marker='*', zorder=5
        )
        ax1.annotate(
            f"最高价: ¥{df.loc[max_idx, 'price']:.0f}",
            xy=(df.loc[max_idx, 'scraped_at'], df.loc[max_idx, 'price']),
            xytext=(10, -30), textcoords='offset points',
            bbox=dict(boxstyle='round,pad=0.5', fc='lightcoral', alpha=0.7),
            arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0')
        )
        
        # === 子图2:价格分布箱线图 ===
        ax2 = axes[1]
        
        platforms = df['platform'].unique()
        price_data = [df[df['platform'] == p]['price'].values for p in platforms]
        
        bp = ax2.boxplot(
            price_data,
            labels=platforms,
            patch_artist=True,
            notch=True
        )
        
        # 美化箱线图
        colors = ['lightblue', 'lightgreen', 'lightyellow']
        for patch, color in zip(bp['boxes'], colors):
            patch.set_facecolor(color)
        
        ax2.set_title('价格分布对比', fontsize=16, fontweight='bold')
        ax2.set_xlabel('平台', fontsize=12)
        ax2.set_ylabel('价格 (元)', fontsize=12)
        ax2.grid(True, alpha=0.3, axis='y')
        
        # 添加统计信息
        stats_text = f"""
统计信息:
数据点数: {len(df)}
平均价格: ¥{df['price'].mean():.2f}
价格波动: {df['price'].std():.2f}
        """.strip()
        
        fig.text(0.02, 0.02, stats_text, fontsize=10, 
                bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
        
        plt.tight_layout()
        
        # 保存图表
        if not output_file:
            output_file = f"price_trend_{hotel_name.replace(' ', '_')}_{datetime.now().strftime('%Y%m%d')}.png"
        
        output_path = os.path.join(CHART_DIR, output_file)
        plt.savefig(output_path, dpi=150, bbox_inches='tight')
        plt.close()
        
        logger.info(f"✓ 图表已保存: {output_path}")
        
        return output_path
    
    def plot_interactive_chart(self, hotel_name: str, days: int = 30) -> str:
        """
        绘制交互式价格图表(Plotly版本)
        
        Args:
            hotel_name: 酒店名称
            days: 分析天数
            
        Returns:
            HTML文件路径
        """
        logger.info(f"绘制交互式图表: {hotel_name}")
        
        # 获取数据
        conn = self.storage._get_connection()
        query = f'''
            SELECT scraped_at, platform, price, score
            FROM hotel_prices
            WHERE hotel_name LIKE ?
            AND scraped_at >= datetime('now', '-{days} days')
            ORDER BY scraped_at ASC
        '''
        
        df = pd.read_sql_query(query, conn, params=(f'%{hotel_name}%',))
        conn.close()
        
        if df.empty:
            return ""
        
        df['scraped_at'] = pd.to_datetime(df['scraped_at'])
        
        # 创建交互式图表
        fig = go.Figure()
        
        for platform in df['platform'].unique():
            platform_data = df[df['platform'] == platform]
            
            fig.add_trace(go.Scatter(
                x=platform_data['scraped_at'],
                y=platform_data['price'],
                mode='lines+markers',
                name=platform,
                line=dict(width=3),
                marker=dict(size=8),
                hovertemplate=
                    '<b>平台</b>: ' + platform + '<br>' +
                    '<b>日期</b>: %{x}<br>' +
                    '<b>价格</b>: ¥%{y:.2f}<br>' +
                    '<extra></extra>'
            ))
        
        # 添加平均线
        avg_price = df.groupby('scraped_at')['price'].mean()
        fig.add_trace(go.Scatter(
            x=avg_price.index,
            y=avg_price.values,
            mode='lines',
            name='平均价',
            line=dict(dash='dash', width=2, color='gray')
        ))
        
        # 更新布局
        fig.update_layout(
            title=f'{hotel_name} - 价格趋势分析',
            xaxis_title='日期',
            yaxis_title='价格 (元)',
            hovermode='x unified',
            template='plotly_white',
            width=1200,
            height=600
        )
        
        # 保存HTML
        output_file = f"interactive_chart_{hotel_name.replace(' ', '_')}_{datetime.now().strftime('%Y%m%d')}.html"
        output_path = os.path.join(CHART_DIR, output_file)
        fig.write_html(output_path)
        
        logger.info(f"✓ 交互式图表已保存: {output_path}")
        
        return output_path
    
    def generate_weekly_report(self) -> str:
        """
        生成每周价格报告
        
        Returns:
            报告文件路径
        """
        logger.info("生成每周价格报告")
        
        # 获取本周数据
        conn = self.storage._get_connection()
        query = '''
            SELECT 
                hotel_name,
                platform,
                COUNT(*) as check_times,
                MIN(price) as min_price,
                MAX(price) as max_price,
                AVG(price) as avg_price,
                (MAX(price) - MIN(price)) as price_range
            FROM hotel_prices
            WHERE scraped_at >= datetime('now', '-7 days')
            GROUP BY hotel_name, platform
            ORDER BY price_range DESC
        '''
        
        df = pd.read_sql_query(query, conn)
        conn.close()
        
        if df.empty:
            logger.warning("本周无数据")
            return ""
        
        # 生成报告
        report_date = datetime.now().strftime('%Y年%m月%d日')
        
        report_content = f"""
# 旅游价格监控周报
**生成时间**: {report_date}

## 📊 本周概览
- 监控酒店数: {df['hotel_name'].nunique()}
- 数据采集次数: {df['check_times'].sum()}
- 平台覆盖: {df['platform'].nunique()}个

## 💰 价格波动TOP10
本周价格波动最大的酒店:

"""
        
        # 添加TOP10表格
        top10 = df.head(10)
        report_content += top10.to_markdown(index=False)
        
        report_content += f"""

## 🎯 降价推荐
以下酒店本周出现显著降价,建议关注:

"""
        
        # 筛选降价酒店(价格范围>200元)
        drops = df[df['price_range'] > 200].head(5)
        
        for _, row in drops.iterrows():
            report_content += f"""
### {row['hotel_name']} ({row['platform']})
- **价格区间**: ¥{row['min_price']:.0f} - ¥{row['max_price']:.0f}
- **波动幅度**: ¥{row['price_range']:.0f}
- **当前均价**: ¥{row['avg_price']:.0f}

"""
        
        # 保存报告
        output_file = f"weekly_report_{datetime.now().strftime('%Y%m%d')}.md"
        output_path = os.path.join(REPORT_DIR, output_file)
        
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write(report_content)
        
        logger.info(f"✓ 周报已生成: {output_path}")
        
        return output_path

1️⃣3️⃣ 使用示例与最佳实践

完整使用流程

python 复制代码
"""
完整使用示例
演示如何使用价格监控系统
"""

from scrapers.ctrip_scraper import CtripScraper
from scrapers.qunar_scraper import QunarScraper
from data.cleaner import TravelDataCleaner
from data.storage import TravelDataStorage
from data.analyzer import TravelPriceAnalyzer
from notification.notifier import PriceNotifier

def example_1_single_search():
    """示例1:单次搜索酒店"""
    print("\n" + "="*60)
    print("示例1:单次搜索携程酒店")
    print("="*60)
    
    # 初始化爬虫
    scraper = CtripScraper()
    
    # 搜索酒店
    hotels = scraper.search_hotels(
        city='北京',
        checkin_date='2024-03-15',
        checkout_date='2024-03-17',
        keywords='国贸',
        min_star=4,
        max_results=10
    )
    
    print(f"✓ 找到 {len(hotels)} 家酒店")
    
    for hotel in hotels[:3]:
        print(f"  - {hotel['hotel_name']}: ¥{hotel['price']}")

def example_2_compare_platforms():
    """示例2:对比不同平台价格"""
    print("\n" + "="*60)
    print("示例2:对比携程vs去哪儿")
    print("="*60)
    
    # 携程
    ctrip_scraper = CtripScraper()
    ctrip_hotels = ctrip_scraper.search_hotels(
        city='上海',
        checkin_date='2024-03-20',
        checkout_date='2024-03-22',
        keywords='浦东机场',
        max_results=5
    )
    
    # 去哪儿
    qunar_scraper = QunarScraper(headless=True)
    qunar_hotels = qunar_scraper.search_hotels(
        city='上海',
        checkin_date='2024-03-20',
        checkout_date='2024-03-22',
        keywords='浦东机场',
        max_results=5
    )
    qunar_scraper.close()
    
    # 清洗数据
    cleaner = TravelDataCleaner()
    ctrip_df = cleaner.clean_hotel_data(ctrip_hotels)
    qunar_df = cleaner.clean_hotel_data(qunar_hotels)
    
    # 对比结果
    print(f"\n携程平均价: ¥{ctrip_df['price'].mean():.2f}")
    print(f"去哪儿平均价: ¥{qunar_df['price'].mean():.2f}")
    print(f"价格差: ¥{abs(ctrip_df['price'].mean() - qunar_df['price'].mean()):.2f}")

def example_3_price_analysis():
    """示例3:价格趋势分析"""
    print("\n" + "="*60)
    print("示例3:价格趋势分析与可视化")
    print("="*60)
    
    analyzer = TravelPriceAnalyzer()
    
    # 分析价格趋势
    analysis = analyzer.analyze_price_trend(
        hotel_name='北京国贸大酒店',
        days=30
    )
    
    print(f"\n酒店: {analysis['hotel_name']}")
    print(f"当前价格: ¥{analysis['current_price']}")
    print(f"最低价: ¥{analysis['min_price']}")
    print(f"最高价: ¥{analysis['max_price']}")
    print(f"趋势: {analysis['trend']}")
    print(f"建议: {analysis['recommendation']}")
    
    # 绘制图表
    chart_path = analyzer.plot_price_trend(
        hotel_name='北京国贸大酒店',
        days=30
    )
    print(f"\n✓ 趋势图已保存: {chart_path}")

def example_4_alert_system():
    """示例4:价格告警系统"""
    print("\n" + "="*60)
    print("示例4:设置价格告警")
    print("="*60)
    
    # 这个示例需要先运行一段时间收集数据
    storage = TravelDataStorage()
    notifier = PriceNotifier()
    
    # 设定告警规则
    alert_threshold = 800  # 低于800元告警
    
    # 查询最新价格
    import pandas as pd
    conn = storage._get_connection()
    query = '''
        SELECT hotel_name, platform, price
        FROM hotel_prices
        WHERE scraped_at >= datetime('now', '-1 day')
        AND price < ?
    '''
    df = pd.read_sql_query(query, conn, params=(alert_threshold,))
    conn.close()
    
    if not df.empty:
        print(f"✓ 发现 {len(df)} 家低价酒店")
        
        for _, row in df.iterrows():
            message = f"价格告警:{row['hotel_name']}({row['platform']}) - ¥{row['price']}"
            print(f"  {message}")
            
            # 发送通知(取消注释以实际发送)
            # notifier.send_alert(
            #     subject="酒店降价提醒",
            #     message=message
            # )
    else:
        print("暂无符合条件的低价酒店")

if __name__ == "__main__":
    # 运行示例
    example_1_single_search()
    # example_2_compare_platforms()  # 需要较长时间
    # example_3_price_analysis()  # 需要历史数据
    # example_4_alert_system()  # 需要历史数据

1️⃣4️⃣ 常见问题与解决方案

Q1: 携程返回验证码怎么办?

原因: 请求频率过高或Cookie失效

解决方案:

python 复制代码
# 方案1:增加延迟
DELAY_RANGE = (10, 15)  # 从5-10秒改为10-15秒

# 方案2:更新Cookie
# 1. 打开浏览器,访问携程并登录
# 2. F12打开开发者工具
# 3. Network → 找到任意请求 → Headers → Cookie
# 4. 复制整个Cookie字符串
# 5. 更新到config.py或.env文件

# 方案3:使用代理IP(需要购买代理服务)
PROXY_CONFIG = {
    'enabled': True,
    'proxy_pool': [8080'
    ]
}

Q2: Selenium总是被检测到?

解决方案:

python 复制代码
# 使用undetected_chromedriver替代普通webdriver
import undetected_chromedriver as uc

chrome_options = uc.ChromeOptions()
chrome_options.add_argument('--disable-blink-features=AutomationControlled')

driver = uc.Chrome(options=chrome_options)

# 执行反检测脚本
driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
    'source': '''
        Object.defineProperty(navigator, 'webdriver', {
            get: () => undefined
        })
    '''
})

Q3: 价格数据不一致?

原因: 动态定价、时间差异

处理方法:

python 复制代码
def normalize_price_data(df):
    """标准化价格数据"""
    
    # 1. 去除异常值(3σ原则)
    mean_price = df['price'].mean()
    std_price = df['price'].std()
    df = df[
        (df['price'] >= mean_price - 3*std_price) &
        (df['price'] <= mean_price + 3*std_price)
    ]
    
    # 2. 时间对齐(按小时分组取均值)
    df['hour'] = pd.to_datetime(df['scraped_at']).dt.floor('H')
    df_aligned = df.groupby(['hotel_id', 'platform', 'hour'])['price'].mean().reset_index()
    
    return df_aligned

Q4: 如何提高爬取效率?

优化策略:

python 复制代码
# 策略1:只爬列表页,不请求详情页
# 大部分信息在列表页已经足够

# 策略2:使用异步请求(requests + asyncio)
import asyncio
import aiohttp

async def async_fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def batch_fetch(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [async_fetch(session, url) for url in urls]
        return await asyncio.gather(*tasks)

# 策略3:分布式爬取(使用Celery)
from celery import Celery

app = Celery('travel_scraper', broker='redis://localhost:6379')

@app.task
def scrape_hotel_task(hotel_id):
    scraper = CtripScraper()
    return scraper.get_hotel_detail(hotel_id)

Q5: 数据库太大怎么办?

数据清理策略:

python 复制代码
def clean_old_data(days=90):
    """删除90天前的历史数据"""
    conn = sqlite3.connect(DATABASE_PATH)
    cursor = conn.cursor()
    
    # 保留每个酒店每天的最低价
    cursor.execute('''
        DELETE FROM hotel_prices
        WHERE id NOT IN (
            SELECT id FROM (
                SELECT id, 
                       ROW_NUMBER() OVER (
                           PARTITION BY hotel_id, DATE(scraped_at)
                           ORDER BY price ASC
                       ) as rn
                FROM hotel_prices
                WHERE scraped_at < datetime('now', '-{} days')
            ) WHERE rn = 1
        )
        AND scraped_at < datetime('now', '-{} days')
    '''.format(days, days))
    
    conn.commit()
    deleted = cursor.rowcount
    conn.close()
    
    logger.info(f"✓ 清理了 {deleted} 条旧数据")

1️⃣5️⃣ 进阶扩展

扩展1:机票价格监控

python 复制代码
class FlightScraper:
    """机票爬虫(示例)"""
    
    def search_flights(self, departure, arrival, dep_date):
        """
        搜索机票
        
        携程机票API示例:
        https://flights.ctrip.com/itinerary/api/12808/products
        
        需要的参数:
        - flightWay: 单程/往返
        - dcity: 出发城市代码
        - acity: 到达城市代码
        - date: 日期
        """
        # 实现细节省略...
        pass

扩展2:价格预测模型

python 复制代码
from sklearn.ensemble import RandomForestRegressor
import pickle

class PricePredictor:
    """价格预测模型"""
    
    def train_model(self, historical_data):
        """训练预测模型"""
        # 特征工程
        X = historical_data[['day_of_week', 'days_before_checkin', 
                            'star_num', 'avg_score']]
        y = historical_data['price']
        
        # 训练随机森林
        model = RandomForestRegressor(n_estimators=100)
        model.fit(X, y)
        
        # 保存模型
        with open('price_model.pkl', 'wb') as f:
            pickle.dump(model, f)
    
    def predict_future_price(self, hotel_features):
        """预测未来价格"""
        with open('price_model.pkl', 'rb') as f:
            model = pickle.load(f)
        
        return model.predict([hotel_features])[0]

扩展3:移动端界面(Flask)

python 复制代码
from flask import Flask, render_template, jsonify
from data.storage import TravelDataStorage
from data.analyzer import TravelPriceAnalyzer

app = Flask(__name__)
storage = TravelDataStorage()
analyzer = TravelPriceAnalyzer()

@app.route('/')
def index():
    """首页"""
    return render_template('index.html')

@app.route('/api/hotels')
def get_hotels():
    """API:获取酒店列表"""
    conn = storage._get_connection()
    df = pd.read_sql_query('SELECT * FROM hotel_prices ORDER BY scraped_at DESC LIMIT 50', conn)
    conn.close()
    
    return jsonify(df.to_dict('records'))

@app.route('/api/trend/<hotel_name>')
def get_trend(hotel_name):
    """API:获取价格趋势"""
    analysis = analyzer.analyze_price_trend(hotel_name, days=30)
    return jsonify(analysis)

if __name__ == '__main__':
    app.run(debug=True, port=5000)

总结与展望

本教程的核心价值

通过这个完整的旅游价格监控项目,我们实现了:

双平台数据采集 :携程(requests)+ 去哪儿(Selenium)

智能价格分析 :趋势预测、平台对比、最佳购买时机

自动化监控系统 :定时任务、价格告警、周报生成

完整工程化方案:模块化、可扩展、生产就绪

实战收益总结

维度 传统方式 使用本系统 收益
找酒店时间 每次30分钟 5分钟查看报告 节省83%
价格透明度 单平台单时点 多平台历史对比 信息量10倍+
决策准确性 凭感觉 数据驱动 省钱15-30%
心理成本 担心买信最优价 无价

技术能力提升

通过本项目,你将掌握:

  1. 反爬虫对抗:Cookie管理、UA轮换、频率控制、验证码处理
  2. 数据工程:清洗、标准化、存储、分析完整流程
  3. 自动化运维:任务调度、异常处理、日志监控
  4. 数据可视化:Matplotlib、Plotly交互式图表
  5. 系统设计:模块化、可扩展的架构思维

延伸学习资源

爬虫进阶

数据分析

项目扩展方向

  • 添加更多OTA平台(飞猪、美团、Booking)
  • 实现价格预测算法(LSTM、ARIMA)
  • 开发移动端APP(React Native)
  • 接入大数据平台(Spark)做深度分析

最后的话

旅游价格监控不仅是一个爬虫项目,更是一个完整的数据产品。它涉及数据采集、清洗、存储、分析、可视化、告警的全链路,是学习数据工程的绝佳实战项目。

在开发这个系统的过程中,我最大的收获不是省下的几千块钱,而是建立了"用数据说话"的思维模式。当面对任何需要决策的场景,我的第一反应不再是"听别人说",而是"让我先看看数据"。

希望这篇教程不仅教会你写爬虫,更能启发你用技术改善生活。代码改变世界,数据驱动决策

如果本文对你有帮助,欢迎Star、Fork或分享!有任何问题随时交流,祝旅途愉快!

🌟 文末

好啦~以上就是本期 《Python爬虫实战》的全部内容啦!如果你在实践过程中遇到任何疑问,欢迎在评论区留言交流,我看到都会尽量回复~咱们下期见!

小伙伴们在批阅的过程中,如果觉得文章不错,欢迎点赞、收藏、关注哦~
三连就是对我写作道路上最好的鼓励与支持! ❤️🔥

📌 专栏持续更新中|建议收藏 + 订阅

专栏 👉 《Python爬虫实战》,我会按照"入门 → 进阶 → 工程化 → 项目落地"的路线持续更新,争取让每一篇都做到:

✅ 讲得清楚(原理)|✅ 跑得起来(代码)|✅ 用得上(场景)|✅ 扛得住(工程化)

📣 想系统提升的小伙伴:强烈建议先订阅专栏,再按目录顺序学习,效率会高很多~

✅ 互动征集

想让我把【某站点/某反爬/某验证码/某分布式方案】写成专栏实战?

评论区留言告诉我你的需求,我会优先安排更新 ✅


⭐️ 若喜欢我,就请关注我叭~(更新不迷路)

⭐️ 若对你有用,就请点赞支持一下叭~(给我一点点动力)

⭐️ 若有疑问,就请评论留言告诉我叭~(我会补坑 & 更新迭代)


免责声明:本文仅用于学习与技术研究,请在合法合规、遵守站点规则与 Robots 协议的前提下使用相关技术。严禁将技术用于任何非法用途或侵害他人权益的行为。技术无罪,责任在人!!!

相关推荐
树獭非懒7 小时前
AI大模型小白手册|Embedding 与向量数据库
后端·python·llm
唐叔在学习10 小时前
就算没有服务器,我照样能够同步数据
后端·python·程序员
曲幽12 小时前
FastAPI流式输出实战与避坑指南:让AI像人一样“边想边说”
python·ai·fastapi·web·stream·chat·async·generator·ollama
Flittly12 小时前
【从零手写 AI Agent:learn-claude-code 项目实战笔记】(1)The Agent Loop (智能体循环)
python·agent
vivo互联网技术14 小时前
ICLR2026 | 视频虚化新突破!Any-to-Bokeh 一键生成电影感连贯效果
人工智能·python·深度学习
敏编程15 小时前
一天一个Python库:virtualenv - 隔离你的Python环境,保持项目整洁
python
喝茶与编码17 小时前
Python异步并发控制:asyncio.gather 与 Semaphore 协同设计解析
后端·python
zone773917 小时前
003:RAG 入门-LangChain 读取图片数据
后端·python·面试
用户83562907805117 小时前
在 PowerPoint 中用 Python 添加和定制形状的完整教程
后端·python