Python爬虫项目实战：从 BeautifulSoup 到 XPath

完整项目代码在最后

基于 requests + lxml + XPath + ThreadPoolExecutor 的图书网站爬虫项目。

BeautifulSoup 版本可以看这里：https://blog.csdn.net/2301_76809965/article/details/161854164

项目目标并不复杂：

获取网站所有分页链接
进入每个分页提取图书信息
保存到 CSV 文件
使用多线程提高爬取效率

虽然项目规模不大，但在编写过程中对 XPath 的理解有了明显提升，也踩了不少坑。

本文记录一下项目实现思路、XPath 学习经验以及常见问题。

项目结构

本次练习网站：

python 复制代码

https://books.toscrape.com/

需要采集的数据：

书名
价格
库存状态
图片链接

最终保存到 CSV 文件。

项目核心流程：

text 复制代码

获取所有分页链接
        ↓
多线程访问分页
        ↓
解析每本书信息
        ↓
数据保存CSV

项目中的XPath应用

获取所有图书节点

首先找到每本书所在的节点：

html 复制代码

<article class="product_pod">

XPath：

python 复制代码

articles = tree.xpath("//article[@class='product_pod']")

此时：

python 复制代码

articles

返回的是一个列表。

之后遍历：

python 复制代码

for article in articles:

每次循环中的：

python 复制代码

article

就是一本书对应的节点。

获取书名

HTML：

html 复制代码

<h3>
    <a title="A Light in the Attic">
</h3>

XPath：

python 复制代码

title = article.xpath("h3/a/@title")[0]
title = article.xpath("string(.//h3/a/@title)")
#使用string则不用索引

结果：

python 复制代码

A Light in the Attic
A Light in the Attic

获取价格

HTML：

html 复制代码

<p class="price_color">£51.77</p>

XPath：

python 复制代码

price = article.xpath(".//p[@class='price_color']/text()")[0]

结果：

python 复制代码

£51.77

获取库存状态

HTML：

html 复制代码

<p class="instock availability">
    <i></i>
    In stock
</p>

XPath：

python 复制代码

stock = article.xpath(
    "normalize-space(.//p[@class='instock availability'])"
)

结果：

python 复制代码

In stock

这里使用：

xpath 复制代码

normalize-space()

自动去掉空格和换行。

比：

python 复制代码

text()[0]

稳定很多。

获取图片地址

HTML：

html 复制代码

<img src="../media/cache/xx.jpg">

XPath：

python 复制代码

img_src = article.xpath(".//img/@src")[0]

得到的是：

python 复制代码

../media/cache/xx.jpg

需要转换为完整地址：

python 复制代码

from urllib.parse import urljoin

img_url = urljoin(page_url, img_src)

结果：

python 复制代码

https://books.toscrape.com/media/cache/xx.jpg

XPath学习心得

刚开始接触 XPath 时最大的感受：

不难理解，但非常依赖练习量。

语法其实并不多：

xpath 复制代码

//
.
@
text()
contains()
normalize-space()

真正难的是：

看懂网页结构
判断当前节点位置
写出最短XPath

相对路径比绝对路径更重要

很多新手喜欢：

xpath 复制代码

/html/body/div/div/div/section/div/article

这种路径。

问题：

网页稍微改版就失效。

更推荐：

xpath 复制代码

//article[@class='product_pod']

或者：

xpath 复制代码

.//img/@src

通过属性定位。

稳定性高很多。

学会使用当前节点

例如：

python 复制代码

for article in articles:

此时：

python 复制代码

article

已经是一整本书的节点。

获取书名：

python 复制代码

article.xpath("h3/a/@title")

获取图片：

python 复制代码

article.xpath(".//img/@src")

这里的：

xpath 复制代码

表示当前节点。

这是XPath最重要的知识点之一。

如果不会相对路径：

后面复杂页面会非常痛苦。

text()并不总是可靠

例如：

html 复制代码

<p>
    <i></i>
    In stock
</p>

执行：

xpath 复制代码

text()

python 复制代码

['\n', ' In stock\n']

经常出现：

空格
换行
多个文本节点

导致：

python 复制代码

text()[0]

拿到空字符串。

我踩过的坑

坑1：忘记加 $0$

XPath返回的是列表。

例如：

python 复制代码

title = article.xpath("h3/a/@title")

得到：

python 复制代码

['A Light in the Attic']

不是：

python 复制代码

A Light in the Attic

需要：

python 复制代码

title = article.xpath("h3/a/@title")[0]

坑2：相对路径写成绝对路径

错误：

python 复制代码

article.xpath("//img/@src")

这样会从整个文档开始查找。

结果每本书都拿到第一张图片。

正确：

python 复制代码

article.xpath(".//img/@src")

从当前书籍节点开始查找。

坑3：text()获取不到真正内容

例如：

python 复制代码

stock = article.xpath(
    ".//p[@class='instock availability']/text()"
)

python 复制代码

['\n', ' In stock\n']

需要：

python 复制代码

normalize-space()

或者：

python 复制代码

string()

处理。

坑4：相对链接无法直接访问

获取到：

python 复制代码

../media/cache/xx.jpg

直接访问会报错。

必须：

python 复制代码

urljoin()

转换。

坑5：页面编码问题

有些网站：

python 复制代码

response.text

会乱码。

解决：

python 复制代码

response.encoding = response.apparent_encoding

或者：

python 复制代码

response.content

配合 lxml 解析。

多线程带来的提升

项目后期加入：

python 复制代码

ThreadPoolExecutor

代码：

python 复制代码

with ThreadPoolExecutor(10) as pool:
    pool.map(parse_page, page_urls)

效果非常明显。

单线程：

text 复制代码

依次请求
依次解析

多线程：

text 复制代码

同时请求多个页面

对于爬虫这种：

text 复制代码

网络IO密集型任务

提升远大于计算优化。

学习XPath后的感悟

刚开始看到 XPath：

xpath 复制代码

//div[@class='xxx']/a/text()

觉得很复杂。

真正写了几十个页面之后发现：

XPath 本质上是在回答一个问题：

我要的数据，位于当前HTML树的什么位置？

只要能看懂浏览器开发者工具里的结构图，XPath 就会越来越顺手。

text 复制代码

先定位节点
↓
缩小范围
↓
找到目标元素
↓
提取属性或文本

完整代码

python 复制代码

import requests
from lxml import etree
import csv,os
from urllib.parse import urljoin
from concurrent.futures import ThreadPoolExecutor

url = "https://books.toscrape.com/"

#获取所有分页信息
#如果这里需要使用多线程提速，可以使用循环拼接url，不用去找next_url
def get_page_list():

    urls_list = [url]
    current_url = url
    num = 0
    while True:
        num+=1
        print(f"获取分页数据{num}")
        res = requests.get(current_url,timeout=10)

        tree = etree.HTML(res.text)

        next_li = tree.xpath(".//li[@class='next']")

        if len(next_li) == 0:
            break

        next_href = next_li[0].xpath("string(.//a/@href)")

        next_url = urljoin(current_url,next_href)

        current_url = next_url

        urls_list.append(current_url)

    return urls_list


#获取所有书籍信息
def get_books(url):
    print(f"正在爬取{url}")
    book_info = []
    res = requests.get(url,timeout=10)
    tree = etree.HTML(res.text)

    articles = tree.xpath("//article[@class='product_pod']")

    for article in articles:
        title = article.xpath("string(.//h3/a/@title)")
        price = article.xpath("string(.//p[@class='price_color'])").strip()
        instock = article.xpath("string(.//p[@class='instock availability'])").strip()
        img_src = article.xpath("string(.//img/@src)")
        img_url = urljoin(url,img_src)

        book_info.append([title,price,instock,img_url])

    return book_info


#图片下载
def save_image(url,i):

    print(f"正在下载图片{i}")

    makedir = "download"
    os.makedirs(makedir,exist_ok=True)

    res = requests.get(url,timeout=20)

    filename = os.path.join(makedir,f"image_{i}.png")

    with open(filename,"wb") as f:
        f.write(res.content)


if __name__ == "__main__":

    urls_list = get_page_list()
    all_books = []

    with open("books.csv","w",newline="",encoding="utf-8-sig") as f:
        writer = csv.writer(f)
        writer.writerow(["书名","价格","库存","封面"])

        with ThreadPoolExecutor(max_workers=10) as pool:
            books = pool.map(get_books,urls_list)
            for book in books:
                all_books.extend(book)
            
            writer.writerows(all_books)

            for i,book in enumerate(all_books):
                img_url = book[3]
                pool.submit(save_image,img_url,i)
        print("程序执行完毕")

Python爬虫项目实战：从 BeautifulSoup 到 XPath