【爬虫】7.2. JavaScript动态渲染界面爬取-Selenium实战

JavaScript动态渲染界面爬取-Selenium实战

爬取的网页为:https://spa2.scrape.center,里面的内容都是通过Ajax渲染出来的,在分析xhr时候发现url里面有token参数,所有我们使用selenium自动化工具来爬取JavaScript渲染的界面。

python 复制代码
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from selenium.webdriver.support.ui import WebDriverWait
import logging
from selenium.webdriver.support import expected_conditions
import re
import json
from os import makedirs
from os.path import exists

# 配置日志
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s: %(message)s')
# 基本url
url = "https://spa2.scrape.center/page/{page}"
# selenium初始化
browser = webdriver.Chrome()
# 显式等待初始化
wait = WebDriverWait(browser, 10)
book_url = list()

# 目录设置
RESULTS_DIR = 'results'
exists(RESULTS_DIR) or makedirs(RESULTS_DIR)
# 任意异常
class ScraperError(Exception):
    pass

# 获取书本URL
def PageDetail(URL):
    browser.get(URL)
    try:
        all_element = wait.until(expected_conditions.presence_of_all_elements_located((By.CSS_SELECTOR, ".el-card .name")))
        return all_element
    except TimeoutException:
        logging.info("Time error happen in %s while finding the href", URL)

# 获取书本信息
def GetDetail(book_list):
    try:
        for book in book_list:
            browser.get(book)
            URL = browser.current_url
            book_name = wait.until(expected_conditions.presence_of_element_located((By.CLASS_NAME, "m-b-sm"))).text
            categories = [elements.text for elements in wait.until(expected_conditions.presence_of_all_elements_located((By.CSS_SELECTOR, ".categories button span")))]
            content = wait.until(expected_conditions.presence_of_element_located((By.CSS_SELECTOR, ".item .drama p[data-v-f7128f80]"))).text
            detail = {
                "URL": URL,
                "book_name": book_name,
                "categories": categories,
                "content": content
            }

            SaveDetail(detail)
    except TimeoutException:
        logging.info("Time error happen in %s while finding the book detail", browser.current_url)

# JSON文件保存
def SaveDetail(detail):
    cleaned_name = re.sub(r'[\/:*?"<>|]', '_', detail.get("book_name"))
    detail["book_name"] = cleaned_name
    data_path = f'{RESULTS_DIR}/{cleaned_name}.json'
    logging.info("Saving Book %s...", cleaned_name)
    try:
        json.dump(detail, open(data_path, 'w', encoding='utf-8'),
                  ensure_ascii=False, indent=2)
        logging.info("Saving Book %s over", cleaned_name)
    except ScraperError as e:
        logging.info("Some error happen in %s while saving the book detail", cleaned_name)

# 主函数
def main():
    try:
        for page in range(1, 11):
            for each_page in PageDetail(url.format(page= page)):
                book_url.append(each_page.get_attribute("href"))
        GetDetail(book_url)
    except ScraperError as e:
        logging.info("An abnormal position has occurred")
    finally:
        browser.close()

if __name__ == "__main__":
    main()
相关推荐
over697几秒前
掌控 JavaScript 的 this:从迷失到精准控制
前端·javascript·面试
天才熊猫君几秒前
基于 `component` 的弹窗组件统一管理方案
前端·javascript
巴拉巴拉~~17 分钟前
Flutter 通用按钮组件 CommonButtonWidget:多样式 + 多状态 + 交互优化
javascript·flutter·交互
豆苗学前端21 分钟前
Vue 2 vs Vue 3 响应式原理深度对比(源码理解层面,吊打面试官)
前端·javascript·面试
serve the people29 分钟前
AI 模型识别 Nginx 流量中爬虫机器人的防御机制
人工智能·爬虫·nginx
TimelessHaze36 分钟前
算法复杂度分析与优化:从理论到实战
前端·javascript·算法
叫我詹躲躲42 分钟前
为什么永远不要让前端直接连接数据库
javascript·mysql
晚霞的不甘43 分钟前
实战前瞻:构建高可用、强实时的 Flutter + OpenHarmony 智慧医疗健康平台
前端·javascript·flutter
小兔崽子去哪了1 小时前
文件上传专题
java·javascript
Aevget1 小时前
DevExtreme JS & ASP.NET Core v25.2预览 - DataGrid/TreeList全新升级
开发语言·javascript·asp.net·界面控件·ui开发·devextreme