爬虫技术-利用Python和Selenium批量下载动态渲染网页中的标准文本文件

近日工作需要整理信息安全的各项标准文件，这些文件通常发布在在官方网站，供社会各界下载和参考。

这些页面中，标准文本文件常以Word（.doc/.docx）或PDF格式提供下载。由于文件数量庞大，手动逐条点击下载效率极低，且易遗漏，因此决定通过爬虫脚本进行批量自动化下载。

一、流程规划和难点分析

下载流程：

列表页 ：通过下方链接可以获取所有征求意见通知的标题和对应详情页链接。全国信息安全标准化技术委员会Free HTML5 Template by FREEHTML5https://www.tc260.org.cn/front/bzzqyjList.html?start=0&length=10
详情页：点击通知后进入详情页，页面中包含"标准文本"文件的下载链接。
文件下载：点击下载链接即可获得Word或PDF格式文件。

要点分析：

动态渲染：标准文本文件的下载链接并不直接写在静态HTML里，而是由页面JavaScript动态生成。
多格式文件 ：文件格式多样，包含.doc, .docx, .pdf，需兼顾。
文件命名需求：下载文件需根据通知标题提取核心标准名，生成规范文件名，方便管理。
稳定性和礼貌爬取：避免请求过快导致被封，需合理设置间隔。

二、技术选型

Requests + BeautifulSoup：用于抓取列表页的静态HTML，解析出通知标题和详情页链接。
Selenium：用于模拟浏览器，完整渲染详情页JavaScript后获取文件下载链接。
webdriver-manager：自动管理Chrome驱动，简化环境配置。
Python标准库：文件操作、正则表达式处理文件名等。

三、具体实现

获取列表页信息

用requests请求列表页，利用BeautifulSoup解析HTML，定位所有含"征求意见稿征求意见的通知"的链接和标题，形成待爬取列表。

复制代码

resp = requests.get(LIST_URL, headers=HEADERS)
soup = BeautifulSoup(resp.text, "html.parser")
# 筛选符合条件的a标签，得到title和详情链接

2. Selenium渲染详情页

详情页的文件链接由JS动态生成，直接用requests无法拿到。使用Selenium模拟浏览器打开详情页：

启动无头Chrome浏览器
加载详情页URL
等待若干秒让JS执行完成
获取渲染后的完整HTML

driver.get(detail_url)
time.sleep(5) # 等待JS渲染
html = driver.page_source

3. 解析文件下载链接

用BeautifulSoup解析渲染后的HTML，提取所有.doc, .docx, .pdf文件链接和文件名，筛选出"标准文本"相关的文件。

复制代码

for a in soup.find_all("a", href=True):
    if a['href'].endswith(('.doc', '.docx', '.pdf')) and "标准文本" in a.text:
        # 记录文件名和下载链接

4. 文件命名规范处理

从通知的完整标题中，用正则提取标准名（《》内内容）和"征求意见稿"关键字，生成规范文件名，避免文件名非法字符。

复制代码

def simplify_title(full_title):
    match = re.search(r'《([^》]+)》', full_title)
    if match:
        name = f"《{match.group(1)}》征求意见稿"
    else:
        name = full_title
    return re.sub(r'[\\/*?:"<>|]', "_", name)

5. 下载文件与日志记录

利用requests下载文件，保存到指定目录。
统一打印日志并写入日志文件，方便追踪。
每下载完一个文件，等待5秒，减小服务器压力。

四、完整代码和效果展示

复制代码

import os
import time
import re
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

BASE_URL = "这里写地址"
LIST_URL = "这里写地址"

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}

LOG_FILE = os.path.join("downloads", "download_log.txt")

def log_print(msg):
    print(msg)
    with open(LOG_FILE, "a", encoding="utf-8") as f:
        f.write(msg + "\n")

def sanitize_filename(name):
    return re.sub(r'[\\/*?:"<>|]', "_", name)

def simplify_title(full_title):
    match = re.search(r'《([^》]+)》', full_title)
    if not match:
        return sanitize_filename(full_title)
    standard_name = f"《{match.group(1)}》"
    if "征求意见稿" in full_title:
        return standard_name + "征求意见稿"
    else:
        return standard_name

def get_list_page():
    resp = requests.get(LIST_URL, headers=HEADERS)
    resp.raise_for_status()
    return resp.text

def parse_list_page(html):
    soup = BeautifulSoup(html, "html.parser")
    notices = []
    for a in soup.find_all("a", href=True):
        text = a.get_text(strip=True)
        href = a['href']
        if "征求意见稿征求意见的通知" in text and href.startswith("/front/bzzqyjDetail.html"):
            notices.append({
                "title": text,
                "detail_url": BASE_URL + href
            })
    return notices

def fetch_detail_page_selenium(url):
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-gpu')
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    try:
        driver.get(url)
        time.sleep(5)
        html = driver.page_source
    finally:
        driver.quit()
    return html

def parse_detail_files(html):
    soup = BeautifulSoup(html, "html.parser")
    files = []
    for a in soup.find_all("a", href=True):
        href = a['href']
        if href.endswith((".doc", ".docx", ".pdf")):
            file_name = a.get_text(strip=True)
            file_url = href if href.startswith("http") else BASE_URL + href
            files.append((file_name, file_url))
    return files

def download_file(url, filename):
    log_print(f"下载文件: {filename}  链接: {url}")
    resp = requests.get(url, headers=HEADERS)
    resp.raise_for_status()
    with open(filename, "wb") as f:
        f.write(resp.content)
    log_print(f"下载完成: {filename}")

def main():
    os.makedirs("downloads", exist_ok=True)
    with open(LOG_FILE, "w", encoding="utf-8") as f:
        f.write("下载日志\n\n")

    list_html = get_list_page()
    notices = parse_list_page(list_html)
    log_print(f"共找到{len(notices)}条通知")

    for notice in notices:
        log_print(f"处理通知：{notice['title']}")
        detail_html = fetch_detail_page_selenium(notice['detail_url'])
        files = parse_detail_files(detail_html)

        std_files = [f for f in files if "标准文本" in f[0]]

        if not std_files:
            log_print("未找到标准文本文件，跳过")
            continue

        for file_name, file_url in std_files:
            simple_name = simplify_title(notice['title'])
            ext = os.path.splitext(file_url)[1]
            safe_name = sanitize_filename(simple_name) + ext
            save_path = os.path.join("downloads", safe_name)
            try:
                download_file(file_url, save_path)
                time.sleep(5)
            except Exception as e:
                log_print(f"下载失败: {file_name}，错误: {e}")

if __name__ == "__main__":
    main()

最终运行即可实现如下效果