Python爬虫实战:分页爬取 + 详情页采集 + CSV存储
完整代码在最后
利用 BooksToScrape 网站完成一个完整的小型爬虫项目,实现:
- 分页数据爬取
- 详情页链接收集
- 详情页数据解析
- CSV数据存储
- URL拼接处理
虽然网站本身比较简单,适合新手练习
没有用到多线程,还没学会
项目目标
目标网站:
text
https://books.toscrape.com/
需要采集的数据:
- 书名
- UPC
- Price (excl. tax)
- Availability
- 图片链接
- 商品描述
最终保存到 CSV 文件中。
项目流程设计
text
分页爬取
↓
收集所有书籍URL
↓
访问详情页
↓
解析数据
↓
保存CSV
分页爬取实现
网站分页结构:
html
<li class="next">
<a href="catalogue/page-2.html">next</a>
</li>
核心思路:
python
current_url = start_url
while True:
请求当前页
收集书籍链接
查找下一页
如果不存在:
break
进入下一页
代码关键部分:
python
next_li = soup.find("li", class_="next")
if next_li is None:
break
next_href = next_li.find("a")["href"]
current_url = urljoin(current_url, next_href)
urljoin 的重要性
不能直接使用字符串拼接:
python
current_url = current_url + next_href
直接拼接第二页之后变成:
text
https://books.toscrape.com/catalogue/page-2.htmlpage-3.html
明显是错误地址。
正确写法:
python
from urllib.parse import urljoin
current_url = urljoin(current_url, next_href)
这样无论:
text
page-3.html
还是:
text
catalogue/page-2.html
都能正确处理。
收集详情页链接
页面结构:
html
<div class="image_container">
<a href="...">
提取方式:
python
book_hrefs = soup.find_all(
"div",
class_="image_container"
)
获取详情页:
python
book_url = urljoin(
current_url,
book_href.find("a")["href"]
)
保存:
python
book_urls.append(book_url)
解析详情页
进入详情页后:
python
res = requests.get(book_url)
soup = BeautifulSoup(
res.text,
"html.parser"
)
获取书名:
python
title = soup.find(
"div",
class_="col-sm-6 product_main"
).find("h1").text.strip()
解析商品表格
详情页包含:
html
<table>
例如:
text
UPC
Price (excl. tax)
Availability
处理方式:
python
table = soup.find("table")
rows = table.find_all("tr")
构建字典:
python
book_info = {}
for row in rows:
key = row.find("th").text.strip()
value = row.find("td").text.strip()
book_info[key] = value
这样获取数据非常方便:
python
book_info["UPC"]
book_info["Availability"]
图片地址处理
图片链接是相对路径:
html
<img src="../../media/cache/...">
必须转换为完整链接:
python
img_url = urljoin(
book_url,
img_src
)
最终得到:
text
https://books.toscrape.com/media/...
商品描述解析
页面结构:
html
<div id="product_description">
获取描述:
python
description = soup.find(
"div",
id="product_description"
).find_next_sibling("p").text.strip()
同时需要考虑:
python
desc_tag = soup.find(
"div",
id="product_description"
)
if desc_tag:
description = desc_tag.find_next_sibling(
"p"
).text.strip()
else:
description = ""
CSV存储
创建文件:
python
with open(
"books.csv",
"w",
newline="",
encoding="utf-8-sig"
) as file:
写入表头:
python
writer.writerow([
"书名",
"UPC",
"Price",
"Availability",
"封面",
"描述"
])
写入数据:
python
writer.writerow([
title,
book_info["UPC"],
book_info["Price (excl. tax)"],
book_info["Availability"],
img_url,
description
])
调试过程中遇到的问题
1. 首页数据丢失
原因:
python
先判断next
再收集数据
导致最后一页数据无法采集。
解决:
python
先采集
再判断next
2. URL拼接错误
错误:
python
current_url + next_href
导致:
text
page-2.htmlpage-3.html
解决:
python
urljoin()
完整代码
python
'''
获取所有书籍链接
请求详情页并且保存
'''
import requests
import time
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin
url = "https://books.toscrape.com/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive"
}
book_urls = []
def get_book_list():
current_url = url
num =0
while True:
res = requests.get(current_url,headers=headers,timeout=10)
soup = BeautifulSoup(res.text,"html.parser")
book_hrefs = soup.find_all("div",class_="image_container")
for book_href in book_hrefs:
num+=1
print(f"获取第{num}条书籍信息")
book_url = urljoin(current_url,book_href.find("a")["href"])
book_urls.append(book_url)
#先收集当前页书籍再做判断
next_li = soup.find("li",class_="next")
if next_li is None:
print("获取数据结束")
break
next_href = next_li.find("a")["href"]
current_url = urljoin(current_url,next_href)
def save_books():
num = 0
with open ("book_1.csv","w",newline="",encoding="utf-8-sig") as file:
writer = csv.writer(file)
writer.writerow(["书名","UPC","Price (excl. tax)","Availability","封面","描述"])
for book_url in book_urls:
res = requests.get(book_url,headers=headers,timeout=10)
soup = BeautifulSoup(res.text,"html.parser")
title = soup.find("div",class_="col-sm-6 product_main").find("h1").text.strip()
table = soup.find("table")
rows = table.find_all("tr")
book_info = {}
for row in rows:
key = row.find("th").text.strip()
value = row.find("td").text.strip()
book_info[key] = value
img_src = soup.find("div",class_="item active").find("img")["src"]
img_url = urljoin(book_url,img_src)
description = soup.find("div",id="product_description").find_next_sibling("p").text.strip()
writer.writerow([title,
book_info["UPC"],
book_info["Price (excl. tax)"],
book_info["Availability"],
img_url,
description
])
num+=1
print(f"第{num}条数据保存成功")
if __name__ == "__main__":
get_book_list()
save_books()