Python爬虫实战：分页爬取 + 详情页采集 + CSV存储

完整代码在最后

利用 BooksToScrape 网站完成一个完整的小型爬虫项目，实现：

分页数据爬取
详情页链接收集
详情页数据解析
CSV数据存储
URL拼接处理

虽然网站本身比较简单，适合新手练习

没有用到多线程，还没学会

项目目标

目标网站：

text 复制代码

https://books.toscrape.com/

需要采集的数据：

书名
UPC
Price (excl. tax)
Availability
图片链接
商品描述

最终保存到 CSV 文件中。

项目流程设计

text 复制代码

分页爬取
    ↓
收集所有书籍URL
    ↓
访问详情页
    ↓
解析数据
    ↓
保存CSV

分页爬取实现

网站分页结构：

html 复制代码

<li class="next">
    <a href="catalogue/page-2.html">next</a>
</li>

核心思路：

python 复制代码

current_url = start_url

while True:

    请求当前页

    收集书籍链接

    查找下一页

    如果不存在:
        break

    进入下一页

代码关键部分：

python 复制代码

next_li = soup.find("li", class_="next")

if next_li is None:
    break

next_href = next_li.find("a")["href"]

current_url = urljoin(current_url, next_href)

urljoin 的重要性

不能直接使用字符串拼接：

python 复制代码

current_url = current_url + next_href

直接拼接第二页之后变成：

text 复制代码

https://books.toscrape.com/catalogue/page-2.htmlpage-3.html

明显是错误地址。

正确写法：

python 复制代码

from urllib.parse import urljoin

current_url = urljoin(current_url, next_href)

这样无论：

text 复制代码

page-3.html

还是：

text 复制代码

catalogue/page-2.html

都能正确处理。

收集详情页链接

页面结构：

html 复制代码

<div class="image_container">
    <a href="...">

提取方式：

python 复制代码

book_hrefs = soup.find_all(
    "div",
    class_="image_container"
)

获取详情页：

python 复制代码

book_url = urljoin(
    current_url,
    book_href.find("a")["href"]
)

保存：

python 复制代码

book_urls.append(book_url)

解析详情页

进入详情页后：

python 复制代码

res = requests.get(book_url)

soup = BeautifulSoup(
    res.text,
    "html.parser"
)

获取书名：

python 复制代码

title = soup.find(
    "div",
    class_="col-sm-6 product_main"
).find("h1").text.strip()

解析商品表格

详情页包含：

html 复制代码

<table>

例如：

text 复制代码

UPC
Price (excl. tax)
Availability

处理方式：

python 复制代码

table = soup.find("table")

rows = table.find_all("tr")

构建字典：

python 复制代码

book_info = {}

for row in rows:

    key = row.find("th").text.strip()

    value = row.find("td").text.strip()

    book_info[key] = value

这样获取数据非常方便：

python 复制代码

book_info["UPC"]
book_info["Availability"]

图片地址处理

图片链接是相对路径：

html 复制代码

<img src="../../media/cache/...">

必须转换为完整链接：

python 复制代码

img_url = urljoin(
    book_url,
    img_src
)

最终得到：

text 复制代码

https://books.toscrape.com/media/...

商品描述解析

页面结构：

html 复制代码

<div id="product_description">

获取描述：

python 复制代码

description = soup.find(
    "div",
    id="product_description"
).find_next_sibling("p").text.strip()

同时需要考虑：

python 复制代码

desc_tag = soup.find(
    "div",
    id="product_description"
)

if desc_tag:
    description = desc_tag.find_next_sibling(
        "p"
    ).text.strip()
else:
    description = ""

CSV存储

创建文件：

python 复制代码

with open(
    "books.csv",
    "w",
    newline="",
    encoding="utf-8-sig"
) as file:

写入表头：

python 复制代码

writer.writerow([
    "书名",
    "UPC",
    "Price",
    "Availability",
    "封面",
    "描述"
])

写入数据：

python 复制代码

writer.writerow([
    title,
    book_info["UPC"],
    book_info["Price (excl. tax)"],
    book_info["Availability"],
    img_url,
    description
])

调试过程中遇到的问题

1. 首页数据丢失

原因：

python 复制代码

先判断next
再收集数据

导致最后一页数据无法采集。

解决：

python 复制代码

先采集
再判断next

2. URL拼接错误

错误：

python 复制代码

current_url + next_href

导致：

text 复制代码

page-2.htmlpage-3.html

解决：

python 复制代码

urljoin()

完整代码

python 复制代码

'''
获取所有书籍链接
请求详情页并且保存
'''

import requests 
import time
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin

url = "https://books.toscrape.com/"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive"
}

book_urls = []

def get_book_list():

    current_url = url

    num =0

    while True:

        res = requests.get(current_url,headers=headers,timeout=10)

        soup = BeautifulSoup(res.text,"html.parser")

        book_hrefs = soup.find_all("div",class_="image_container")

        for book_href in book_hrefs:
            num+=1
            print(f"获取第{num}条书籍信息")
            book_url = urljoin(current_url,book_href.find("a")["href"])
            book_urls.append(book_url)

        #先收集当前页书籍再做判断
        next_li  = soup.find("li",class_="next")
       
        if next_li  is None:
            print("获取数据结束")
            break

        next_href = next_li.find("a")["href"]

        current_url = urljoin(current_url,next_href)

def save_books():
    num = 0
    with open ("book_1.csv","w",newline="",encoding="utf-8-sig") as file:
        writer = csv.writer(file)

        writer.writerow(["书名","UPC","Price (excl. tax)","Availability","封面","描述"])

        for book_url in book_urls:
            res = requests.get(book_url,headers=headers,timeout=10)
            soup = BeautifulSoup(res.text,"html.parser")

            title = soup.find("div",class_="col-sm-6 product_main").find("h1").text.strip()
            table = soup.find("table")
            rows = table.find_all("tr")

            book_info = {}
            for row in rows:
                key = row.find("th").text.strip()
                value = row.find("td").text.strip()
                book_info[key] = value
            
            img_src = soup.find("div",class_="item active").find("img")["src"]
            img_url = urljoin(book_url,img_src)

            description = soup.find("div",id="product_description").find_next_sibling("p").text.strip()

            writer.writerow([title,
                             book_info["UPC"],
                             book_info["Price (excl. tax)"],
                             book_info["Availability"],
                             img_url,
                             description
                             ])
            num+=1
            print(f"第{num}条数据保存成功")

if __name__ == "__main__":
    get_book_list()            
    save_books()