Python 爬虫入门（十）：Scrapy选择器「详细介绍」

前言
[1. CSS选择器](#1. CSS选择器)
- [1.1 基本用法](#1.1 基本用法)
- [1.2 提取属性和文本](#1.2 提取属性和文本)
- [1.3 嵌套选择](#1.3 嵌套选择)
[2. XPath选择器](#2. XPath选择器)
- [2.1 基本用法](#2.1 基本用法)
- [2.2 提取属性和文本](#2.2 提取属性和文本)
- [2.3 条件过滤](#2.3 条件过滤)
[3. 正则表达式选择器](#3. 正则表达式选择器)
- [3.1 基本用法](#3.1 基本用法)
- [3.2 结合Scrapy选择器使用](#3.2 结合Scrapy选择器使用)
[4. PyQuery选择器](#4. PyQuery选择器)
- [4.1 基本用法](#4.1 基本用法)
- [4.2 提取属性和文本](#4.2 提取属性和文本)
- [4.3 嵌套选择](#4.3 嵌套选择)
总结

前言

欢迎来到"Python 爬虫入门"系列文章。在网络爬虫中，选择器是解析和提取网页内容的重要工具，不同的选择器有不同的特点和应用场景。

在本文中，我将详细介绍四种常用的选择器：CSS选择器、XPath选择器、正则表达式和PyQuery选择器 ，并结合具体实例讲解如何在Scrapy中使用这些选择器来提取数据。为了更好的展示，我们将使用 https://jsonplaceholder.typicode.com 的数据。

1. CSS选择器

CSS选择器是通过CSS样式规则来选取HTML元素的一种方法。它直观、简洁且易于理解，是Scrapy中常用的选择器之一。

1.1 基本用法

在Scrapy中，可以使用 response.css() 方法来使用CSS选择器。

以下是常用的CSS选择器示例：

1.2 提取属性和文本

CSS选择器不仅可以选择元素，还可以提取元素的属性和文本内容。

使用 ::attr(attribute_name) 可以提取属性，使用 ::text 可以提取文本。

python 复制代码

import scrapy

class CSSSelectorSpider(scrapy.Spider):
    name = "css_selector_spider"
    start_urls = ['https://jsonplaceholder.typicode.com/users']

    def parse(self, response):
        # 提取用户的名字
        names = response.css('.username::text').getall()
        for name in names:
            yield {'name': name}
        
        # 提取用户的电子邮件地址
        emails = response.css('.email::text').getall()
        for email in emails:
            yield {'email': email}

1.3 嵌套选择

CSS选择器还可以进行嵌套选择，从父元素选择子元素。

python 复制代码

import scrapy

class CSSSelectorSpider(scrapy.Spider):
    name = "css_selector_spider"
    start_urls = ['https://jsonplaceholder.typicode.com/comments']

    def parse(self, response):
        # 选择具有特定类的div中的所有评论
        comments = response.css('div.comment-body p::text').getall()
        for comment in comments:
            yield {'comment': comment}

2. XPath选择器

XPath选择器是一种通过路径表达式来选取XML或HTML文档中的节点的方法。

它功能强大，支持复杂的节点选择和条件过滤。

2.1 基本用法

在Scrapy中，可以使用 response.xpath() 方法来使用XPath选择器。

以下是一些常用的XPath选择器示例：

python 复制代码

import scrapy

class XPathSelectorSpider(scrapy.Spider):
    name = "xpath_selector_spider"
    start_urls = ['https://jsonplaceholder.typicode.com/posts']

    def parse(self, response):
        # 选择所有的文章标题
        titles = response.xpath('//h2/text()').getall()
        for title in titles:
            yield {'title': title}

2.2 提取属性和文本

XPath选择器同样可以提取元素的属性和文本内容。

使用 @attribute_name 可以提取属性，使用 text() 可以提取文本。

python 复制代码

import scrapy

class XPathSelectorSpider(scrapy.Spider):
    name = "xpath_selector_spider"
    start_urls = ['https://jsonplaceholder.typicode.com/users']

    def parse(self, response):
        # 提取用户的名字
        names = response.xpath('//div[@class="username"]/text()').getall()
        for name in names:
            yield {'name': name}
        
        # 提取用户的电子邮件地址
        emails = response.xpath('//div[@class="email"]/text()').getall()
        for email in emails:
            yield {'email': email}

2.3 条件过滤

XPath选择器支持复杂的条件过滤，可以根据元素的属性值或文本内容进行筛选。

python 复制代码

import scrapy

class XPathSelectorSpider(scrapy.Spider):
    name = "xpath_selector_spider"
    start_urls = ['https://jsonplaceholder.typicode.com/comments']

    def parse(self, response):
        # 选择具有特定属性值的元素
        special_elements = response.xpath('//div[@data-type="example"]').getall()
        for element in special_elements:
            yield {'special_element': element}

        # 选择包含特定文本的元素
        specific_elements = response.xpath('//p[contains(text(), "specific text")]/text()').getall()
        for element in specific_elements:
            yield {'specific_element': element}

3. 正则表达式选择器

正则表达式选择器是通过模式匹配来提取文本内容的一种方法。

它适用于处理结构不规则的文本数据。

3.1 基本用法

在Scrapy中，可以使用 re 模块来编写正则表达式，并结合 response.text 或 response.body 使用。

python 复制代码

import scrapy
import re

class RegexSelectorSpider(scrapy.Spider):
    name = "regex_selector_spider"
    start_urls = ['https://jsonplaceholder.typicode.com/posts']

    def parse(self, response):
        # 提取所有的文章ID
        ids = re.findall(r'"id": (\d+)', response.text)
        for id_ in ids:
            yield {'id': id_}

3.2 结合Scrapy选择器使用

正则表达式可以结合Scrapy的CSS选择器或XPath选择器使用，以进一步提取特定元素的内容。

python 复制代码

import scrapy
import re

class RegexSelectorSpider(scrapy.Spider):
    name = "regex_selector_spider"
    start_urls = ['https://jsonplaceholder.typicode.com/comments']

    def parse(self, response):
        # 使用CSS选择器选择特定元素，然后使用正则表达式提取内容
        elements = response.css('div.comment-body').getall()
        for element in elements:
            emails = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', element)
            for email in emails:
                yield {'email': email}

4. PyQuery选择器

PyQuery选择器是一个类似于jQuery的Python库，提供了简洁的语法来操作HTML文档。

4.1 基本用法

在Scrapy中，可以使用 pyquery 模块来使用PyQuery选择器。

以下是一些常用的PyQuery选择器示例：

python 复制代码

import scrapy
from pyquery import PyQuery as pq

class PyQuerySelectorSpider(scrapy.Spider):
    name = "pyquery_selector_spider"
    start_urls = ['https://jsonplaceholder.typicode.com/posts']

    def parse(self, response):
        # 使用PyQuery选择器选择所有的文章标题
        doc = pq(response.text)
        titles = doc('h2').items()
        for title in titles:
            yield {'title': title.text()}

4.2 提取属性和文本

PyQuery选择器可以方便地提取元素的属性和文本内容。

python 复制代码

import scrapy
from pyquery import PyQuery as pq

class PyQuerySelectorSpider(scrapy.Spider):
    name = "pyquery_selector_spider"
    start_urls = ['https://jsonplaceholder.typicode.com/users']

    def parse(self, response):
        # 使用PyQuery选择器提取用户名
        doc = pq(response.text)
        usernames = doc('.username').items()
        for username in usernames:
            yield {'username': username.text()}
        
        # 提取电子邮件地址
        emails = doc('.email').items()
        for email in emails:
            yield {'email': email.text()}

4.3 嵌套选择

PyQuery选择器也支持嵌套选择，从父元素选择子元素。

python 复制代码

import scrapy
from pyquery import PyQuery as pq

class PyQuerySelectorSpider(scrapy.Spider):
    name = "pyquery_selector_spider"
    start_urls = ['https://jsonplaceholder.typicode.com/comments']

    def parse(self, response):
        # 使用PyQuery选择器选择所有评论
        doc = pq(response.text)
        comments = doc('div.comment-body p').items()
        for comment in comments:
            yield {'comment': comment.text()}

总结

不同的选择器有不同的特点和适用场景。通过掌握CSS选择器、XPath选择器、正则表达式和PyQuery选择器的使用方法，可以有效地提取网页中的数据。