【爬虫】7.2. JavaScript动态渲染界面爬取-Selenium实战

JavaScript动态渲染界面爬取-Selenium实战

爬取的网页为:https://spa2.scrape.center,里面的内容都是通过Ajax渲染出来的,在分析xhr时候发现url里面有token参数,所有我们使用selenium自动化工具来爬取JavaScript渲染的界面。

python 复制代码
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from selenium.webdriver.support.ui import WebDriverWait
import logging
from selenium.webdriver.support import expected_conditions
import re
import json
from os import makedirs
from os.path import exists

# 配置日志
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s: %(message)s')
# 基本url
url = "https://spa2.scrape.center/page/{page}"
# selenium初始化
browser = webdriver.Chrome()
# 显式等待初始化
wait = WebDriverWait(browser, 10)
book_url = list()

# 目录设置
RESULTS_DIR = 'results'
exists(RESULTS_DIR) or makedirs(RESULTS_DIR)
# 任意异常
class ScraperError(Exception):
    pass

# 获取书本URL
def PageDetail(URL):
    browser.get(URL)
    try:
        all_element = wait.until(expected_conditions.presence_of_all_elements_located((By.CSS_SELECTOR, ".el-card .name")))
        return all_element
    except TimeoutException:
        logging.info("Time error happen in %s while finding the href", URL)

# 获取书本信息
def GetDetail(book_list):
    try:
        for book in book_list:
            browser.get(book)
            URL = browser.current_url
            book_name = wait.until(expected_conditions.presence_of_element_located((By.CLASS_NAME, "m-b-sm"))).text
            categories = [elements.text for elements in wait.until(expected_conditions.presence_of_all_elements_located((By.CSS_SELECTOR, ".categories button span")))]
            content = wait.until(expected_conditions.presence_of_element_located((By.CSS_SELECTOR, ".item .drama p[data-v-f7128f80]"))).text
            detail = {
                "URL": URL,
                "book_name": book_name,
                "categories": categories,
                "content": content
            }

            SaveDetail(detail)
    except TimeoutException:
        logging.info("Time error happen in %s while finding the book detail", browser.current_url)

# JSON文件保存
def SaveDetail(detail):
    cleaned_name = re.sub(r'[\/:*?"<>|]', '_', detail.get("book_name"))
    detail["book_name"] = cleaned_name
    data_path = f'{RESULTS_DIR}/{cleaned_name}.json'
    logging.info("Saving Book %s...", cleaned_name)
    try:
        json.dump(detail, open(data_path, 'w', encoding='utf-8'),
                  ensure_ascii=False, indent=2)
        logging.info("Saving Book %s over", cleaned_name)
    except ScraperError as e:
        logging.info("Some error happen in %s while saving the book detail", cleaned_name)

# 主函数
def main():
    try:
        for page in range(1, 11):
            for each_page in PageDetail(url.format(page= page)):
                book_url.append(each_page.get_attribute("href"))
        GetDetail(book_url)
    except ScraperError as e:
        logging.info("An abnormal position has occurred")
    finally:
        browser.close()

if __name__ == "__main__":
    main()
相关推荐
失落的多巴胺1 小时前
使用deepseek制作“喝什么奶茶”随机抽签小网页
javascript·css·css3·html5
DataGear1 小时前
如何在DataGear 5.4.1 中快速制作SQL服务端分页的数据表格看板
javascript·数据库·sql·信息可视化·数据分析·echarts·数据可视化
影子信息1 小时前
vue 前端动态导入文件 import.meta.glob
前端·javascript·vue.js
样子20181 小时前
Vue3 之dialog弹框简单制作
前端·javascript·vue.js·前端框架·ecmascript
kevin_水滴石穿1 小时前
Vue 中报错 TypeError: crypto$2.getRandomValues is not a function
前端·javascript·vue.js
翻滚吧键盘1 小时前
vue文本插值
javascript·vue.js·ecmascript
泡泡以安2 小时前
安卓高版本HTTPS抓包:终极解决方案
爬虫·https·安卓逆向·安卓抓包
海的诗篇_3 小时前
前端开发面试题总结-原生小程序部分
前端·javascript·面试·小程序·vue·html
q567315234 小时前
Java Selenium反爬虫技术方案
java·爬虫·selenium
黄瓜沾糖吃5 小时前
大佬们指点一下倒计时有什么问题吗?
前端·javascript