python （第十一章动态网页爬取和反爬机制）

第10周学习计划：动态网页爬取和反爬机制

目标：掌握动态网页爬取工具（Selenium）和应对常见反爬措施。

学习内容总览

动态网页爬取：Selenium 基础。
反爬机制：识别和应对（如请求头、延迟、代理）。
实践任务：爬取动态电商页面（如京东商品价格）。

第一部分：动态网页爬取 - Selenium

1. Selenium 简介

功能：模拟浏览器操作，加载 JavaScript 动态内容。
安装：
- pip install selenium
- 下载浏览器驱动（如 ChromeDriver），与你的 Chrome 版本匹配。
- 将驱动放入 PATH，或在代码中指定路径。

2. 基本用法

示例：

python 复制代码

from selenium import webdriver
from selenium.webdriver.chrome.service import Service

# 指定 ChromeDriver 路径（替换为你的路径）
service = Service(executable_path="path/to/chromedriver")
driver = webdriver.Chrome(service=service)

driver.get("https://www.example.com")
print(driver.title)  # 输出页面标题
driver.quit()  # 关闭浏览器

3. 元素定位

用 find_element 或 find_elements 提取网页元素。
常用方法 ：
- By.ID
- By.CLASS_NAME
- By.CSS_SELECTOR
示例：

python 复制代码

from selenium.webdriver.common.by import By

driver.get("http://localhost:8000/test_shop.html")
products = driver.find_elements(By.CLASS_NAME, "product")
for product in products:
    name = product.find_element(By.CLASS_NAME, "name").text
    price = product.find_element(By.CLASS_NAME, "price").text
    print(f"{name} - {price}")
driver.quit()

第二部分：反爬机制

1. 常见反爬手段

User-Agent 检查：网站检测请求头。
IP 限制：频繁请求被封。
JavaScript 验证：数据动态加载。
验证码：需要人工干预。

2. 应对方法

伪装请求头 ：

python 复制代码

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/91.0.4472.124"}

延迟请求 ：

python 复制代码

import time
time.sleep(2)  # 每次请求间隔2秒

代理IP ：

使用免费代理或代理服务（如 proxyscrape.com）。

python 复制代码

proxies = {"http": "http://代理IP:端口", "https": "https://代理IP:端口"}
response = requests.get(url, proxies=proxies)

实践任务：爬取动态电商页面

目标

用 Selenium 爬取京东（https://www.jd.com/）首页的商品名称和价格（动态加载部分），并保存到文件。

代码实现

python 复制代码

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import json
from datetime import datetime
import time

class JDPriceTracker:
    def __init__(self, filename="jd_prices.json"):
        self.prices = {}
        self.filename = filename
        self.load_prices()
        # 配置 Selenium
        self.service = Service(executable_path="path/to/chromedriver")  # 替换为你的路径
        self.driver = webdriver.Chrome(service=self.service)

    def fetch_prices(self, url):
        """抓取京东首页商品价格"""
        self.driver.get(url)
        time.sleep(3)  # 等待页面加载
        
        try:
            # 定位商品元素（京东的动态商品在 .gl-i-wrap 中）
            products = self.driver.find_elements(By.CLASS_NAME, "gl-i-wrap")
            current_prices = {}
            for product in products[:5]:  # 只抓前5个
                try:
                    name = product.find_element(By.CSS_SELECTOR, ".p-name a").get_attribute("title")
                    price = product.find_element(By.CSS_SELECTOR, ".p-price i").text
                    current_prices[name] = float(price)
                    print(f"抓到：{name} - ¥{price}")
                except Exception as e:
                    print(f"解析单个商品失败：{e}")
            self.update_prices(current_prices)
        except Exception as e:
            print(f"抓取失败：{e}")
        finally:
            self.driver.quit()

    def update_prices(self, current_prices):
        now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        for name, price in current_prices.items():
            if name not in self.prices:
                self.prices[name] = [{"time": now, "price": price}]
            elif self.prices[name][-1]["price"] != price:
                self.prices[name].append({"time": now, "price": price})
        self.save_prices()

    def view_prices(self):
        if not self.prices:
            print("暂无记录！")
        else:
            print("\n价格记录：")
            for name, history in self.prices.items():
                print(f"{name}:")
                for entry in history:
                    print(f"  {entry['time']} - ¥{entry['price']}")

    def save_prices(self):
        with open(self.filename, "w", encoding="utf-8") as f:
            json.dump(self.prices, f, ensure_ascii=False, indent=2)

    def load_prices(self):
        try:
            with open(self.filename, "r", encoding="utf-8") as f:
                self.prices = json.load(f)
        except FileNotFoundError:
            self.prices = {}

def main():
    tracker = JDPriceTracker()
    url = "https://www.jd.com/"
    
    while True:
        print("\n=== 京东价格监控 ===")
        print("1. 抓取当前价格")
        print("2. 查看记录")
        print("3. 退出")
        
        choice = input("请选择操作（1-3）：")
        
        if choice == "1":
            tracker.fetch_prices(url)
        
        elif choice == "2":
            tracker.view_prices()
        
        elif choice == "3":
            print("谢谢使用！")
            break
        
        else:
            print("无效选择，请输入 1-3！")

if __name__ == "__main__":
    main()

代码讲解

Selenium 配置：
- 用 Service 指定 ChromeDriver 路径。
- time.sleep(3) 等待页面动态内容加载。
元素定位：
- .gl-i-wrap 是京东商品的容器。
- .p-name a 获取标题，.p-price i 获取价格。
异常处理：
- 捕获单个商品解析失败，避免程序崩溃。
- 用 finally 确保浏览器关闭。

动手实践

准备环境 ：
- 安装 selenium：pip install selenium。
- 下载 ChromeDriver，替换代码中的路径。
运行程序 ：
- 执行代码，选 1 抓取京东首页。
- 选 2 查看记录。
- 选 3 退出。
检查结果 ：
- 查看 jd_prices.json，确认数据。

预期输出

抓取（示例，实际结果随京东变化）：

抓到：Apple iPhone 14 Pro Max - ¥7999.0
抓到：华为 Mate 60 Pro - ¥6499.0
...
查看记录：

价格记录：
Apple iPhone 14 Pro Max:
2025-02-25 22:00:00 - ¥7999.0
华为 Mate 60 Pro:
2025-02-25 22:00:00 - ¥6499.0

小挑战

多页爬取：滚动页面加载更多商品。
代理支持：加入代理池应对 IP 限制。
价格变化提醒：如果价格变化，打印通知。

python （第十一章 动态网页爬取和反爬机制）

第10周学习计划：动态网页爬取和反爬机制

学习内容总览

第一部分：动态网页爬取 - Selenium

1. Selenium 简介

2. 基本用法

3. 元素定位

第二部分：反爬机制

1. 常见反爬手段

2. 应对方法

实践任务：爬取动态电商页面

目标

代码实现

代码讲解

动手实践

预期输出

小挑战

python （第十一章动态网页爬取和反爬机制）