【爬虫】7.2. JavaScript动态渲染界面爬取-Selenium实战

JavaScript动态渲染界面爬取-Selenium实战

爬取的网页为:https://spa2.scrape.center,里面的内容都是通过Ajax渲染出来的,在分析xhr时候发现url里面有token参数,所有我们使用selenium自动化工具来爬取JavaScript渲染的界面。

python 复制代码
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from selenium.webdriver.support.ui import WebDriverWait
import logging
from selenium.webdriver.support import expected_conditions
import re
import json
from os import makedirs
from os.path import exists

# 配置日志
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s: %(message)s')
# 基本url
url = "https://spa2.scrape.center/page/{page}"
# selenium初始化
browser = webdriver.Chrome()
# 显式等待初始化
wait = WebDriverWait(browser, 10)
book_url = list()

# 目录设置
RESULTS_DIR = 'results'
exists(RESULTS_DIR) or makedirs(RESULTS_DIR)
# 任意异常
class ScraperError(Exception):
    pass

# 获取书本URL
def PageDetail(URL):
    browser.get(URL)
    try:
        all_element = wait.until(expected_conditions.presence_of_all_elements_located((By.CSS_SELECTOR, ".el-card .name")))
        return all_element
    except TimeoutException:
        logging.info("Time error happen in %s while finding the href", URL)

# 获取书本信息
def GetDetail(book_list):
    try:
        for book in book_list:
            browser.get(book)
            URL = browser.current_url
            book_name = wait.until(expected_conditions.presence_of_element_located((By.CLASS_NAME, "m-b-sm"))).text
            categories = [elements.text for elements in wait.until(expected_conditions.presence_of_all_elements_located((By.CSS_SELECTOR, ".categories button span")))]
            content = wait.until(expected_conditions.presence_of_element_located((By.CSS_SELECTOR, ".item .drama p[data-v-f7128f80]"))).text
            detail = {
                "URL": URL,
                "book_name": book_name,
                "categories": categories,
                "content": content
            }

            SaveDetail(detail)
    except TimeoutException:
        logging.info("Time error happen in %s while finding the book detail", browser.current_url)

# JSON文件保存
def SaveDetail(detail):
    cleaned_name = re.sub(r'[\/:*?"<>|]', '_', detail.get("book_name"))
    detail["book_name"] = cleaned_name
    data_path = f'{RESULTS_DIR}/{cleaned_name}.json'
    logging.info("Saving Book %s...", cleaned_name)
    try:
        json.dump(detail, open(data_path, 'w', encoding='utf-8'),
                  ensure_ascii=False, indent=2)
        logging.info("Saving Book %s over", cleaned_name)
    except ScraperError as e:
        logging.info("Some error happen in %s while saving the book detail", cleaned_name)

# 主函数
def main():
    try:
        for page in range(1, 11):
            for each_page in PageDetail(url.format(page= page)):
                book_url.append(each_page.get_attribute("href"))
        GetDetail(book_url)
    except ScraperError as e:
        logging.info("An abnormal position has occurred")
    finally:
        browser.close()

if __name__ == "__main__":
    main()
相关推荐
Highcharts.js2 小时前
数据之美:用Highcharts打造专业级弧线图
javascript·数据库·highcharts·图表开发·弧线图
SuperEugene3 小时前
错误处理与 try/catch:真实项目里应该捕什么错?
前端·javascript·面试
喵手4 小时前
Python爬虫实战:从零构建书籍价格情报数据库(附CSV导出 + SQLite持久化存储)!
爬虫·python·爬虫实战·零基础python爬虫教学·csv导出·构建书籍价格情报·书籍价格采集
没有bug.的程序员5 小时前
自动化测试之魂:Selenium 与 TestNG 深度集成内核、Page Object 模型实战与 Web UI 交付质量指南
前端·自动化测试·selenium·ui·testng·page·object
夕除5 小时前
js--22
前端·javascript·python
Qhappy5 小时前
某加密企业版过frida检测
javascript
用户5757303346246 小时前
🔥 一文搞懂 JavaScript 包装类(面试必考)
javascript
滕青山6 小时前
Base64编码/解码 核心JS实现
前端·javascript·vue.js
sww_10266 小时前
SAA ReactAgent工作原理
开发语言·前端·javascript
linux_cfan6 小时前
拒绝“黑屏”与“哑剧”:Web视频播放器UX体验与自动播放选型指南 (2026版)
前端·javascript·音视频·html5·ux