Python:爬虫基础《爬取红楼梦》

小说爬虫项目说明文档

用于爬取诗词名句网上小说内容的Python爬虫项目。本项目以《红楼梦》为例，演示如何爬取完整的小说内容。

项目功能

爬取小说的所有章节名称
获取每个章节的URL链接
下载并保存每个章节的内容到独立的文本文件
自动创建存储目录
包含基本的错误处理和请求延迟

环境要求

Python 3.x
依赖包：
- requests
- beautifulsoup4
- logging

安装依赖

bash 复制代码

pip install requests beautifulsoup4

项目结构说明

项目主要包含以下几个核心函数：

extract_chapter_names(source): 提取所有章节名称
extract_list_url(source, domain): 提取所有章节的URL
extract_chapter_content(url_list, chapter_names, headers, folder_name): 下载并保存章节内容

实现步骤

1. 基础设置

python 复制代码

import os
import requests
import logging
import time
from bs4 import BeautifulSoup

# 设置请求头
headers = {
    "User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Mobile Safari/537.36"
}

2. 提取章节名称

extract_chapter_names 函数实现以下功能：

查找包含章节列表的div元素
提取所有章节的名称
返回章节名称列表

extract_chapter_names 函数详细说明

函数概述

extract_chapter_names 函数用于从网页源代码中提取所有小说章节的名称。这个函数接收一个 BeautifulSoup 对象作为参数，返回一个包含所有章节名称的列表。

函数定义

python 复制代码

def extract_chapter_names(source):
    page_list = source.find("div", class_="ContL")
    list_chapter = page_list.find("div", "list")
    a_list = list_chapter.findAll("a")
    chapter_names = []
    for a in a_list:
        print(a.get_text())
        chapter_names.append(a.get_text())
    return chapter_names

详细实现步骤

1. 定位主要内容区域

python 复制代码

page_list = source.find("div", class_="ContL")

使用 find() 方法查找类名为 "ContL" 的 div 元素
这个 div 包含了小说的所有章节列表
class_="ContL" 是指定要查找的 CSS 类名

2. 定位章节列表区域

python 复制代码

list_chapter = page_list.find("div", "list")

在主要内容区域中查找类名为 "list" 的 div 元素
这个 div 直接包含了所有章节的链接

3. 获取所有章节链接

python 复制代码

a_list = list_chapter.findAll("a")

使用 findAll() 方法获取所有的 <a> 标签
每个 <a> 标签代表一个章节链接
返回的是一个包含所有匹配元素的列表

4. 提取章节名称

python 复制代码

chapter_names = []
for a in a_list:
    print(a.get_text())
    chapter_names.append(a.get_text())

创建空列表 chapter_names 存储章节名称
遍历每个 <a> 标签
使用 get_text() 方法获取链接文本，即章节名称
打印章节名称用于调试和进度显示
将章节名称添加到列表中

5. 返回结果

python 复制代码

return chapter_names

返回包含所有章节名称的列表

示例输出

函数返回的列表格式如下：

python 复制代码

[
    "第一回 甄士隐梦幻识通灵 贾雨村风尘怀闺秀",
    "第二回 贾夫人仙逝扬州城 冷子兴演说荣国府",
    # ... 更多章节名称
]

注意事项

函数依赖于网页的特定 HTML 结构，如果网站改版可能需要更新代码
确保传入的 source 参数是有效的 BeautifulSoup 对象
网页编码应该正确设置为 UTF-8，否则可能出现乱码
打印输出有助于监控爬取进度和调试

可能的改进

添加错误处理机制，处理元素不存在的情况
添加数据清洗功能，去除不必要的空白字符
可以添加章节编号提取功能
可以添加进度条显示替代简单的打印输出

3. 提取章节URL

extract_list_url 函数实现以下功能：

查找所有章节的链接元素
组合完整的URL地址
返回URL列表

extract_list_url 函数详细说明

函数概述

extract_list_url 函数用于从网页源代码中提取所有章节的URL链接。这个函数接收两个参数：

source: BeautifulSoup 对象，包含解析后的网页内容
domain: 网站的域名，用于构建完整的URL

函数定义

python 复制代码

def extract_list_url(source, domain):
    page_list = source.find("div", class_="ContL")
    list_chapter = page_list.find("div", "list")
    a_list = list_chapter.findAll("a")
    href_list = []
    for a in a_list:
        href = a.get("href")
        everyUrl = domain + href
        href_list.append(everyUrl)
    return href_list

详细实现步骤

1. 定位主要内容区域

python 复制代码

page_list = source.find("div", class_="ContL")

使用 find() 方法查找类名为 "ContL" 的 div 元素
这个 div 是包含所有章节链接的容器
与 extract_chapter_names 函数使用相同的定位方式

2. 定位章节列表区域

python 复制代码

list_chapter = page_list.find("div", "list")

在主要内容区域中查找类名为 "list" 的 div 元素
这个 div 包含了所有章节的链接信息

3. 获取所有链接元素

python 复制代码

a_list = list_chapter.findAll("a")

使用 findAll() 方法获取所有的 <a> 标签
每个 <a> 标签包含了章节的相对URL

4. 提取和组合URL

python 复制代码

href_list = []
for a in a_list:
    href = a.get("href")
    everyUrl = domain + href
    href_list.append(everyUrl)

创建空列表 href_list 存储完整URL
遍历每个 <a> 标签
使用 get("href") 获取链接的相对路径
将域名和相对路径组合成完整的URL
将完整URL添加到列表中

5. 返回结果

python 复制代码

return href_list

返回包含所有章节完整URL的列表

示例输出

函数返回的列表格式如下：

python 复制代码

[
    "https://www.shicimingju.com/book/hongloumeng/1.html",
    "https://www.shicimingju.com/book/hongloumeng/2.html",
    # ... 更多章节URL
]

URL 结构说明

完整URL的构成：

域名：https://www.shicimingju.com
相对路径：/book/hongloumeng/章节序号.html
组合后：https://www.shicimingju.com/book/hongloumeng/章节序号.html

注意事项

确保传入的 domain 参数末尾没有多余的斜杠，避免URL重复
函数依赖于网站的特定HTML结构，网站改版时需要更新代码
确保传入的 source 是有效的 BeautifulSoup 对象
URL 组合时要注意路径的正确性

可能的改进

添加URL有效性验证
添加错误处理机制
可以添加URL格式检查
可以添加并行获取功能提高效率
添加URL去重机制
可以添加URL规范化处理
添加日志记录功能，方便调试和监控

4. 下载章节内容

extract_chapter_content 函数实现以下功能：

创建保存小说的文件夹
遍历所有章节URL
下载并解析章节内容
保存到独立的文本文件
包含5秒延迟，避免请求过于频繁

extract_chapter_content 函数详细说明

函数概述

extract_chapter_content 函数是整个爬虫项目的核心函数，负责下载和保存小说内容。这个函数接收四个参数：

url_list: 所有章节的URL列表
chapter_names: 对应的章节名称列表
headers: HTTP请求头
folder_name: 保存小说内容的文件夹名称

函数定义

python 复制代码

def extract_chapter_content(url_list, chapter_names, headers, folder_name):
    if not os.path.exists(folder_name):
        os.makedirs(folder_name)

    for i, url in enumerate(url_list):
        response = session.get(url, headers=headers)
        if response.status_code != 200:
            logging.error(f"请求失败，状态码: {response.status_code}")
        else:
            response.encoding = "utf-8"
            soup = BeautifulSoup(response.text, "html.parser")
            a = soup.find("div", class_="ContL")
            b = a.find("div", class_="contbox textinfor")
            c = b.find("div", class_="text p_pad")
            p_list = c.findAll("p")
            seen_texts = set()
            chapter_content = []
            for p in p_list:
                text = p.getText().replace('\xa0', ' ')
                if text not in seen_texts:
                    print(text)
                    chapter_content.append(text)
                    seen_texts.add(text)

            chapter_name = chapter_names[i]
            file_path = os.path.join(folder_name, f"{chapter_name}.txt")
            with open(file_path, "w+", encoding="utf-8") as file:
                file.write("\n".join(chapter_content))
            print(f"{chapter_name}章节的内容获取完毕")
        time.sleep(5)
        print(f"{i}章节的内容获取完毕")

详细实现步骤

1. 创建保存目录

python 复制代码

if not os.path.exists(folder_name):
    os.makedirs(folder_name)

检查保存目录是否存在
如果目录不存在，则创建新目录
使用 os.makedirs() 可以创建多层目录结构

2. 遍历章节URL

python 复制代码

for i, url in enumerate(url_list):

使用 enumerate() 同时获取索引和URL
索引用于匹配章节名称
URL用于下载具体内容

3. 发送HTTP请求

python 复制代码

response = session.get(url, headers=headers)
if response.status_code != 200:
    logging.error(f"请求失败，状态码: {response.status_code}")

使用 session 发送GET请求
检查响应状态码
记录错误日志（如果请求失败）

4. 解析章节内容

python 复制代码

response.encoding = "utf-8"
soup = BeautifulSoup(response.text, "html.parser")
a = soup.find("div", class_="ContL")
b = a.find("div", class_="contbox textinfor")
c = b.find("div", class_="text p_pad")
p_list = c.findAll("p")

设置正确的字符编码
使用BeautifulSoup解析HTML
逐层定位到具体内容区域
获取所有段落内容

5. 内容去重和清理

python 复制代码

seen_texts = set()
chapter_content = []
for p in p_list:
    text = p.getText().replace('\xa0', ' ')
    if text not in seen_texts:
        print(text)
        chapter_content.append(text)
        seen_texts.add(text)

使用集合去除重复内容
清理特殊字符（如\xa0）
保存清理后的文本
打印内容用于调试

6. 保存章节内容

python 复制代码

chapter_name = chapter_names[i]
file_path = os.path.join(folder_name, f"{chapter_name}.txt")
with open(file_path, "w+", encoding="utf-8") as file:
    file.write("\n".join(chapter_content))

获取对应的章节名称
构建文件保存路径
使用UTF-8编码保存文件
用换行符连接各段落

7. 延迟处理

python 复制代码

time.sleep(5)
print(f"{i}章节的内容获取完毕")

添加5秒延迟避免请求过快
打印进度信息

文件保存格式

每个章节保存为独立的文本文件：

文件名：章节名.txt
编码：UTF-8
内容格式：段落间用换行符分隔

注意事项

确保网络连接稳定
注意请求频率限制
保持足够的磁盘空间
正确处理文件编码
注意错误处理和日志记录

可能的改进

添加断点续传功能
实现多线程下载
添加重试机制
增加进度条显示
支持自定义延迟时间
添加内容验证机制
支持不同的文本格式（如EPUB、PDF）
添加内容备份机制
实现错误自动恢复

使用方法

确保已安装所有依赖包
运行脚本：

python 复制代码

python your_script.py

注意事项

请遵守网站的爬虫规则
代码中包含5秒的延迟，避免对目标网站造成压力
建议添加错误处理和日志记录
确保有足够的存储空间保存小说内容

输出结果

程序会在当前目录下创建一个"红楼梦"文件夹，其中包含所有章节的文本文件。每个文件以章节名命名，包含该章节的完整内容。

可能的改进

添加命令行参数支持不同小说的爬取
增加断点续传功能
添加更完善的错误处理
支持多线程下载
添加进度条显示
支持导出不同格式（如PDF、EPUB等）

完整代码示例

python 复制代码

"""完成的代码如下所示"""
import os
import requests
import logging
import time
from bs4 import BeautifulSoup


def extract_chapter_names(source):
    page_list = source.find("div", class_="ContL")
    list_chapter = page_list.find("div", "list")
    a_list = list_chapter.findAll("a")
    chapter_names = []
    for a in a_list:
        print(a.get_text())
        chapter_names.append(a.get_text())
    return chapter_names  # 返回章节的名称列表


def extract_list_url(source, domain):
    # 抓取每一个章节的url
    page_list = source.find("div", class_="ContL")
    list_chapter = page_list.find("div", "list")
    a_list = list_chapter.findAll("a")
    href_list = []
    for a in a_list:
        href = a.get("href")
        everyUrl = domain + href
        href_list.append(everyUrl)
    return href_list  # 返回每一个章节对应的url


# 抓取单个小说的章节内容
def extract_chapter_content(url_list, chapter_names, headers, folder_name):
    if not os.path.exists(folder_name):
        os.makedirs(folder_name)

    for i, url in enumerate(url_list):
        response = session.get(url, headers=headers)
        if response.status_code != 200:
            logging.error(f"请求失败，状态码: {response.status_code}")
        else:
            response.encoding = "utf-8"
            soup = BeautifulSoup(response.text, "html.parser")
            a = soup.find("div", class_="ContL")
            b = a.find("div", class_="contbox textinfor")
            c = b.find("div", class_="text p_pad")
            p_list = c.findAll("p")
            seen_texts = set()
            chapter_content = []
            for p in p_list:
                text = p.getText().replace('\xa0', ' ')  # 处理NBSP
                if text not in seen_texts:
                    print(text)
                    chapter_content.append(text)
                    seen_texts.add(text)

            chapter_name = chapter_names[i]
            file_path = os.path.join(folder_name, f"{chapter_name}.txt")
            with open(file_path, "w+", encoding="utf-8") as file:
                file.write("\n".join(chapter_content))
            print(f"{chapter_name}章节的内容获取完毕")
        time.sleep(5)
        print(f"{i}章节的内容获取完毕")
    print("整个小说爬取完毕")


if __name__ == '__main__':
    url = "https://www.shicimingju.com/book/hongloumeng.html"
    headers = {
        "User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/131.0.0.0 Mobile Safari/537.36 "
    }
    session = requests.session()
    page_source = session.get(url, headers=headers)

    # 检查请求是否成功
    if page_source.status_code != 200:
        logging.error(f"请求失败，状态码: {page_source.status_code}")
    else:
        page_source.encoding = "utf-8"
        domain = "https://www.shicimingju.com"
        soup = BeautifulSoup(page_source.text, "html.parser")
        chapter_names = extract_chapter_names(soup)
        list_url = extract_list_url(soup, domain)
        extract_chapter_content(list_url, chapter_names, headers, folder_name="./红楼梦")

补充说明

可以使用多线程爬取或者携程爬取，当然多线程爬取就已经够用了，太快了不好。