爬虫中 XPath 使用完全指南

📖 1. XPath 简介与背景

1.1 什么是 XPath

XPath （XML Path Language）是一种在 XML 和 HTML 文档中定位节点 的查询语言。它使用路径表达式来选取文档中的节点或节点集。

核心特点：

🎯 精确定位：通过路径、属性、文本等多维度定位元素
🌲 树形遍历：基于文档的树形结构进行导航
💪 功能强大：支持复杂的逻辑判断和函数调用
🔄 W3C 标准：跨平台、跨语言通用

1.2 为什么爬虫需要 XPath

在网页爬虫中，我们需要从 HTML 中提取数据，主要有以下方法：

方法	优势	劣势	适用场景
正则表达式	灵活、快速	难维护、易出错	简单文本提取
BeautifulSoup	易用、容错性好	速度较慢	入门学习、格式混乱的页面
XPath	强大、精准、快速	学习曲线陡峭	结构化数据提取（推荐）
CSS 选择器	简洁、前端友好	功能受限	简单定位

XPath 的核心优势 ：

✅ 可以向上查找父节点（CSS 选择器做不到）

✅ 支持复杂的文本匹配和逻辑运算

✅ 在 lxml 和 Scrapy 中性能优异

1.3 XPath 在爬虫中的地位

复制代码

网页请求 → HTML 响应 → [XPath 解析] → 结构化数据 → 存储/分析
              ↓
         DOM 树结构
              ↓
      XPath 路径表达式
              ↓
         目标数据节点

🎯 2. 核心概念与术语

2.1 节点类型（Node Types）

HTML 文档是一个节点树，包含以下类型：

html 复制代码

<div id="content" class="main">
    <h1>标题</h1>
    <p>段落文本</p>
</div>

节点类型	说明	XPath 表示	示例
元素节点	HTML 标签	标签名	`div`, `h1`, `p`
属性节点	标签的属性	`@属性名`	`@id`, `@class`
文本节点	标签内的文本	`text()`	"标题", "段落文本"
根节点	整个文档	`/`	文档起点

2.2 文档树结构

复制代码

                    根节点 (/)
                       |
                    <html>
                    /    \
               <head>    <body>
                            |
                         <div>
                         /   \
                      <h1>   <p>
                       |      |
                    "标题"  "文本"

关系术语：

父节点（Parent） ：<div> 是 <h1> 的父节点
子节点（Children） ：<h1> 和 <p> 是 <div> 的子节点
兄弟节点（Sibling） ：<h1> 和 <p> 互为兄弟节点
祖先节点（Ancestor） ：<html> 是 <h1> 的祖先节点
后代节点（Descendant） ：<h1> 是 <html> 的后代节点

2.3 路径表达式基础

XPath 使用路径表达式来选取节点：

xpath 复制代码

/html/body/div/h1          # 绝对路径
//h1                        # 相对路径（任意位置的 h1）
//div[@class="main"]/h1    # 带条件的路径

📝 3. XPath 语法基础

3.1 路径类型

🔹 绝对路径（从根节点开始）

xpath 复制代码

/html/body/div/ul/li[1]/a

特点：

✅ 精确、明确
❌ 脆弱（页面结构改变就失效）
❌ 冗长

适用场景：结构固定且简单的页面

🔹 相对路径（从任意节点开始）

xpath 复制代码

//div[@class="content"]//a

特点：

✅ 灵活、简洁
✅ 容错性好
⚠️ 可能匹配到多个节点

适用场景：爬虫开发中的首选方式

3.2 节点选择语法

表达式	说明	示例	结果
`nodename`	选取所有该标签	`//div`	所有 div 元素
`/`	从根节点选取	`/html`	根节点下的 html
`//`	从任意位置选取	`//a`	文档中所有 a 标签
`.`	当前节点	`.//span`	当前节点下的 span
`..`	父节点	`../div`	父节点下的 div
`@`	选取属性	`//a/@href`	所有 a 标签的 href 属性

3.3 谓词（Predicates）- 条件过滤

谓词用 [] 包裹，用于筛选特定节点：

🔹 索引定位

xpath 复制代码

//ul/li[1]          # 第一个 li（索引从 1 开始）
//ul/li[last()]     # 最后一个 li
//ul/li[position()<3]  # 前两个 li

⚠️ 注意：XPath 索引从 1 开始，不是 0！

🔹 属性匹配

xpath 复制代码

//div[@id="main"]                  # id 等于 main
//a[@class="link"]                 # class 等于 link
//img[@src]                        # 包含 src 属性
//input[@type="text"]              # type 等于 text

🔹 文本匹配

xpath 复制代码

//h1[text()="标题"]                # 文本完全等于
//div[contains(text(), "关键词")]  # 文本包含
//span[starts-with(text(), "前缀")] # 文本以...开头

3.4 通配符

通配符	说明	示例	匹配结果
`*`	任意元素节点	`/html/body/*`	body 下所有子元素
`@*`	任意属性	`//div[@*]`	有任意属性的 div
`node()`	任意类型节点	`//div/node()`	div 下所有节点

⚡ 4. 常用表达式速查表

4.1 基础定位

xpath 复制代码

# 选取所有 div 标签
//div

# 选取 id 为 content 的 div
//div[@id="content"]

# 选取 class 为 article 的 div
//div[@class="article"]

# 选取同时满足多个属性
//input[@type="text" and @name="username"]

# 选取 div 或 span
//div | //span

4.2 属性提取

xpath 复制代码

# 提取 href 属性
//a/@href

# 提取 src 属性
//img/@src

# 提取 data-id 属性
//div/@data-id

# 提取 alt 属性
//img/@alt

4.3 文本提取

xpath 复制代码

# 提取标签的直接文本内容
//h1/text()

# 提取标签及其所有后代的文本
//div//text()

# 提取并去除空白
normalize-space(//p/text())

# 提取包含特定文本的元素
//div[contains(text(), "关键词")]

4.4 层级关系

xpath 复制代码

# 子元素（直接子节点）
//div/p

# 后代元素（所有层级）
//div//p

# 父元素
//a[@class="link"]/..

# 兄弟元素
//h1/following-sibling::p           # 后面的兄弟
//h1/preceding-sibling::div         # 前面的兄弟

# 祖先元素
//span/ancestor::div                # 所有祖先 div
//span/ancestor::div[1]             # 最近的祖先 div

4.5 逻辑组合

xpath 复制代码

# 且（and）
//div[@class="main" and @id="content"]

# 或（or）
//div[@class="main" or @class="sidebar"]

# 非（not）
//div[not(@class)]                  # 没有 class 属性

# 多条件
//a[@href and contains(text(), "详情")]

4.6 常用函数

xpath 复制代码

# contains() - 包含
//div[contains(@class, "item")]     # class 包含 item
//p[contains(text(), "关键词")]      # 文本包含关键词

# starts-with() - 开头
//a[starts-with(@href, "https")]    # href 以 https 开头
//div[starts-with(@id, "post-")]    # id 以 post- 开头

# normalize-space() - 去除空白
//p[normalize-space(text())="标题"]  # 忽略前后空格

# string-length() - 字符串长度
//input[@name and string-length(@name) > 0]

# substring() - 子字符串
//div[substring(@id, 1, 4) = "post"]

🐍 5. Python 中使用 XPath (lxml)

5.1 环境安装

bash 复制代码

# 安装 lxml 库
pip install lxml

# 安装 requests（用于获取网页）
pip install requests

5.2 基本用法

python 复制代码

from lxml import etree
import requests

# 方法1：解析 HTML 字符串
html_str = """
<html>
    <body>
        <div id="content">
            <h1>标题</h1>
            <p class="desc">描述信息</p>
            <a href="https://example.com">链接</a>
        </div>
    </body>
</html>
"""

# 创建解析器对象
tree = etree.HTML(html_str)

# 方法2：从网页获取并解析
response = requests.get('https://example.com')
tree = etree.HTML(response.text)

# 使用 XPath 提取数据
title = tree.xpath('//h1/text()')           # ['标题']
desc = tree.xpath('//p[@class="desc"]/text()')  # ['描述信息']
link = tree.xpath('//a/@href')              # ['https://example.com']

print(title[0])    # 输出：标题

5.3 完整示例：爬取新闻列表

python 复制代码

from lxml import etree
import requests

def crawl_news():
    """
    示例：爬取新闻网站的标题和链接
    """
    url = "https://news.example.com"
    
    # 发送请求
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
    response = requests.get(url, headers=headers)
    
    # 解析 HTML
    tree = etree.HTML(response.content)
    
    # 提取新闻列表
    # 假设新闻在 <div class="news-item"> 中
    news_items = tree.xpath('//div[@class="news-item"]')
    
    news_list = []
    for item in news_items:
        # 提取标题（相对于当前节点）
        title = item.xpath('.//h2/a/text()')
        # 提取链接
        link = item.xpath('.//h2/a/@href')
        # 提取发布时间
        time = item.xpath('.//span[@class="time"]/text()')
        
        if title and link:
            news_list.append({
                'title': title[0].strip(),
                'link': link[0],
                'time': time[0].strip() if time else ''
            })
    
    return news_list

# 使用
if __name__ == '__main__':
    news = crawl_news()
    for item in news:
        print(f"标题: {item['title']}")
        print(f"链接: {item['link']}")
        print(f"时间: {item['time']}")
        print("-" * 50)

5.4 常用方法

python 复制代码

# xpath() - 返回列表
result = tree.xpath('//div')
print(type(result))  # <class 'list'>

# xpath() 返回的是 Element 对象，可以继续使用 xpath
divs = tree.xpath('//div[@class="main"]')
for div in divs:
    # 在当前 div 下继续查找
    title = div.xpath('.//h1/text()')
    print(title)

# 获取第一个匹配项（推荐）
first_div = tree.xpath('//div[@class="main"]')[0]

# 使用 extract_first() 的替代方案（在 lxml 中）
def get_first(xpath_result, default=''):
    """安全获取第一个结果"""
    return xpath_result[0] if xpath_result else default

title = get_first(tree.xpath('//h1/text()'), '默认标题')

🕷️ 6. Scrapy 框架中的 XPath

6.1 Scrapy Selector 对象

Scrapy 提供了 Selector 对象，封装了 XPath 和 CSS 选择器：

python 复制代码

import scrapy

class NewsSpider(scrapy.Spider):
    name = 'news'
    start_urls = ['https://news.example.com']
    
    def parse(self, response):
        # response 对象自带 xpath() 方法
        
        # 提取所有新闻标题
        titles = response.xpath('//h2[@class="title"]/text()').getall()
        
        # 提取第一个标题
        first_title = response.xpath('//h2[@class="title"]/text()').get()
        
        # 提取链接
        links = response.xpath('//a[@class="link"]/@href').getall()
        
        # 遍历新闻列表
        for news in response.xpath('//div[@class="news-item"]'):
            yield {
                'title': news.xpath('.//h2/text()').get(),
                'link': news.xpath('.//a/@href').get(),
                'summary': news.xpath('.//p[@class="summary"]/text()').get()
            }

6.2 Scrapy 中的 XPath 方法

方法	说明	返回值	示例
`.xpath()`	返回 Selector 列表	`SelectorList`	`response.xpath('//div')`
`.get()`	获取第一个结果	`str` 或 `None`	`response.xpath('//h1/text()').get()`
`.getall()`	获取所有结果	`list`	`response.xpath('//a/text()').getall()`
`.re()`	用正则提取	`list`	`response.xpath('//p/text()').re(r'\d+')`
`.re_first()`	用正则提取第一个	`str` 或 `None`	`response.xpath('//p/text()').re_first(r'\d+')`

6.3 完整 Scrapy 爬虫示例

python 复制代码

import scrapy

class BookSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']
    
    def parse(self, response):
        """解析图书列表页"""
        # 提取所有图书
        books = response.xpath('//article[@class="product_pod"]')
        
        for book in books:
            yield {
                # 书名
                'title': book.xpath('.//h3/a/@title').get(),
                # 价格（去除货币符号）
                'price': book.xpath('.//p[@class="price_color"]/text()').get(),
                # 评分
                'rating': book.xpath('.//p[contains(@class, "star-rating")]/@class').re_first(r'star-rating (\w+)'),
                # 链接（拼接完整 URL）
                'link': response.urljoin(book.xpath('.//h3/a/@href').get()),
                # 库存状态
                'stock': book.xpath('.//p[@class="instock availability"]/text()').re_first(r'\w+')
            }
        
        # 跟进下一页
        next_page = response.xpath('//li[@class="next"]/a/@href').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

6.4 Scrapy Shell 调试

bash 复制代码

# 启动 Scrapy Shell
scrapy shell "https://news.example.com"

# 在 Shell 中测试 XPath
>>> response.xpath('//h1/text()').get()
'新闻标题'

>>> response.xpath('//div[@class="content"]//p/text()').getall()
['段落1', '段落2', '段落3']

# 查看匹配到的元素数量
>>> len(response.xpath('//a'))
120

# 使用 view() 在浏览器中查看页面
>>> view(response)

💼 7. 实战案例

7.1 案例1：爬取电商商品列表

python 复制代码

from lxml import etree
import requests

def crawl_products(url):
    """
    爬取电商网站商品信息
    目标：商品名称、价格、评分、销量
    """
    response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
    tree = etree.HTML(response.text)
    
    # 提取商品列表
    products = tree.xpath('//div[@class="product-item"]')
    
    product_list = []
    for product in products:
        item = {
            # 商品名称
            'name': product.xpath('.//h3[@class="title"]/a/text()')[0].strip(),
            
            # 价格（提取数字部分）
            'price': product.xpath('.//span[@class="price"]/text()')[0].replace('¥', '').strip(),
            
            # 评分（可能不存在）
            'rating': product.xpath('.//span[@class="rating"]/text()')[0] if product.xpath('.//span[@class="rating"]/text()') else 'N/A',
            
            # 销量（提取数字）
            'sales': product.xpath('.//span[@class="sales"]/text()')[0].replace('已售', '').strip(),
            
            # 商品链接
            'link': product.xpath('.//h3[@class="title"]/a/@href')[0],
            
            # 店铺名称
            'shop': product.xpath('.//span[@class="shop-name"]/text()')[0].strip()
        }
        product_list.append(item)
    
    return product_list

7.2 案例2：提取嵌套表格数据

python 复制代码

def parse_table(html):
    """
    解析 HTML 表格
    示例表格结构：
    <table>
        <thead>
            <tr><th>姓名</th><th>年龄</th></tr>
        </thead>
        <tbody>
            <tr><td>张三</td><td>25</td></tr>
            <tr><td>李四</td><td>30</td></tr>
        </tbody>
    </table>
    """
    tree = etree.HTML(html)
    
    # 提取表头
    headers = tree.xpath('//table/thead/tr/th/text()')
    
    # 提取所有行
    rows = tree.xpath('//table/tbody/tr')
    
    data = []
    for row in rows:
        # 提取当前行的所有单元格
        cells = row.xpath('./td/text()')
        # 组合成字典
        row_data = dict(zip(headers, cells))
        data.append(row_data)
    
    return data

# 结果：
# [
#     {'姓名': '张三', '年龄': '25'},
#     {'姓名': '李四', '年龄': '30'}
# ]

7.3 案例3：处理动态 class（多个 class）

html 复制代码

<!-- 问题：class 有多个值 -->
<div class="item product featured">商品</div>

python 复制代码

# ❌ 错误写法（完全匹配，无法匹配）
tree.xpath('//div[@class="item"]')

# ✅ 正确写法1：使用 contains()
tree.xpath('//div[contains(@class, "item")]')

# ✅ 正确写法2：匹配所有 class
tree.xpath('//div[contains(concat(" ", normalize-space(@class), " "), " item ")]')

# ✅ 推荐写法：使用 contains（最常用）
tree.xpath('//div[contains(@class, "product")]')

7.4 案例4：提取列表中的第 N 个元素后的所有元素

python 复制代码

# 需求：提取列表中第3个 li 之后的所有 li
html = """
<ul>
    <li>项目1</li>
    <li>项目2</li>
    <li>项目3</li>
    <li>项目4</li>
    <li>项目5</li>
</ul>
"""

tree = etree.HTML(html)

# 方法1：使用 position()
items = tree.xpath('//ul/li[position() > 3]/text()')
# 结果：['项目4', '项目5']

# 方法2：使用 following-sibling
items = tree.xpath('//ul/li[3]/following-sibling::li/text()')
# 结果：['项目4', '项目5']

7.5 案例5：爬取分页数据

python 复制代码

import scrapy

class PaginationSpider(scrapy.Spider):
    name = 'pagination'
    start_urls = ['https://example.com/page/1']
    
    def parse(self, response):
        # 提取当前页数据
        items = response.xpath('//div[@class="item"]')
        for item in items:
            yield {
                'title': item.xpath('.//h2/text()').get(),
                'content': item.xpath('.//p/text()').get()
            }
        
        # 方法1：通过"下一页"链接翻页
        next_page = response.xpath('//a[@class="next-page"]/@href').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)
        
        # 方法2：通过页码数字翻页
        current_page = response.xpath('//span[@class="current"]/text()').get()
        total_pages = response.xpath('//span[@class="total"]/text()').get()
        
        if current_page and total_pages:
            current = int(current_page)
            total = int(total_pages)
            
            if current < total:
                next_url = f'https://example.com/page/{current + 1}'
                yield scrapy.Request(next_url, callback=self.parse)

🔧 8. 常见问题与调试技巧

8.1 找不到元素

问题1：XPath 正确但返回空列表

python 复制代码

result = tree.xpath('//div[@class="content"]')
print(result)  # []

可能原因：

页面使用了 JavaScript 动态加载（XPath 只能解析静态 HTML）
XPath 表达式有误
编码问题

解决方案：

python 复制代码

# 方案1：检查是否是动态加载
# 使用 Selenium 或 Playwright 渲染页面
from selenium import webdriver

driver = webdriver.Chrome()
driver.get(url)
html = driver.page_source
tree = etree.HTML(html)

# 方案2：打印 HTML 源码检查
print(response.text[:1000])  # 查看前1000个字符

# 方案3：使用浏览器开发者工具验证 XPath
# 在 Chrome Console 中输入：
# $x('//div[@class="content"]')

问题2：Chrome 复制的 XPath 不可用

xpath 复制代码

# Chrome 复制的 XPath（绝对路径，不推荐）
/html/body/div[1]/div[2]/div[3]/ul/li[1]/a

问题：页面结构稍有变化就失效

解决方案：改用相对路径 + 属性定位

xpath 复制代码

# 优化后（更稳定）
//ul[@class="menu"]/li/a

8.2 空白文本处理

python 复制代码

# 问题：提取的文本有空白符
text = tree.xpath('//p/text()')[0]
print(repr(text))  # '  \n\t文本内容\n  '

# 解决方案1：Python strip()
text = tree.xpath('//p/text()')[0].strip()

# 解决方案2：XPath normalize-space()
text = tree.xpath('normalize-space(//p/text())')
# 注意：normalize-space() 返回字符串，不是列表

8.3 命名空间问题

某些 XML 文档使用了命名空间：

xml 复制代码

<root xmlns:ns="http://example.com/ns">
    <ns:item>数据</ns:item>
</root>

python 复制代码

# ❌ 直接使用会失败
tree.xpath('//ns:item/text()')

# ✅ 方法1：移除命名空间（推荐）
from lxml import etree

# 使用 fromstring 并去除命名空间
xml_str = '<root xmlns:ns="http://example.com/ns"><ns:item>数据</ns:item></root>'
tree = etree.fromstring(xml_str.encode())

# 移除所有命名空间
for elem in tree.iter():
    if elem.tag is not None and '}' in elem.tag:
        elem.tag = elem.tag.split('}', 1)[1]

result = tree.xpath('//item/text()')

# ✅ 方法2：使用命名空间映射
namespaces = {'ns': 'http://example.com/ns'}
result = tree.xpath('//ns:item/text()', namespaces=namespaces)

8.4 提取包含特殊字符的属性

html 复制代码

<div data-item-id="12345">内容</div>

python 复制代码

# ✅ 正确（使用引号）
item_id = tree.xpath('//div/@data-item-id')[0]

# ❌ 错误（连字符会被误解析）
# item_id = tree.xpath('//div/@data-item-id')  # 可能出错

8.5 调试工具推荐

🔹 Chrome DevTools

javascript 复制代码

// 在 Chrome Console 中测试 XPath
$x('//div[@class="content"]')  // 返回匹配的元素数组

// 测试 CSS 选择器
$$('div.content')

🔹 XPath Helper (Chrome 插件)

实时测试 XPath
高亮匹配的元素
显示匹配数量

🔹 Scrapy Shell

bash 复制代码

scrapy shell "https://example.com"

# 测试 XPath
>>> response.xpath('//h1/text()').get()
>>> len(response.xpath('//div'))

🔹 在线 XPath 测试工具

8.6 常见错误速查

错误提示	原因	解决方案
`SyntaxError: Invalid predicate`	XPath 语法错误	检查括号、引号是否匹配
`IndexError: list index out of range`	没有匹配到元素	使用 `get()` 或先判断列表是否为空
`AttributeError: 'NoneType' object has no attribute`	元素不存在	使用 `if` 判断或 `get(default)`
`lxml.etree.XPathEvalError`	XPath 表达式错误	检查语法，使用在线工具验证

⚖️ 9. XPath vs CSS 选择器

9.1 语法对比

需求	XPath	CSS 选择器
选取所有 div	`//div`	`div`
选取 id=main	`//div[@id="main"]`	`div#main`
选取 class=item	`//div[@class="item"]`	`div.item`
选取子元素	`//div/p`	`div > p`
选取后代元素	`//div//p`	`div p`
选取第一个	`//div[1]`	`div:first-child`
选取最后一个	`//div[last()]`	`div:last-child`
选取属性	`//a/@href`	不支持（需额外提取）
选取文本	`//p/text()`	不支持（需额外提取）
选取父元素	`//a/..`	❌ 不支持
文本包含	`//p[contains(text(), "关键词")]`	❌ 不支持

9.2 性能对比

python 复制代码

import time
from lxml import etree

html = '<div>' + '<p>text</p>' * 10000 + '</div>'
tree = etree.HTML(html)

# 测试 XPath
start = time.time()
result1 = tree.xpath('//p')
print(f"XPath: {time.time() - start:.4f}s")

# 测试 CSS 选择器
start = time.time()
result2 = tree.cssselect('p')
print(f"CSS: {time.time() - start:.4f}s")

# 结果：两者性能相近，XPath 略快

结论：在 lxml 中，XPath 性能略优于 CSS 选择器。

9.3 选择建议

场景	推荐方案	原因
需要向上查找父节点	XPath	CSS 不支持
需要文本匹配	XPath	CSS 不支持
简单的标签/类选择	CSS	语法更简洁
复杂的逻辑判断	XPath	功能更强大
Scrapy 框架	XPath	原生支持更好
前端背景开发者	CSS	更熟悉

个人建议：

🎯 爬虫项目首选 XPath（功能强大、灵活）
🎯 简单场景可使用 CSS（快速开发）
🎯 两者可以混用（根据具体需求选择）

🚀 10. 进阶技巧与最佳实践

10.1 性能优化

🔹 避免过度使用 `//`

python 复制代码

# ❌ 低效（全文档搜索多次）
tree.xpath('//div//p//a//span')

# ✅ 高效（定位到 div 后再搜索）
tree.xpath('//div[@class="content"]//span')

🔹 使用精确的路径

python 复制代码

# ❌ 模糊（匹配所有 div）
tree.xpath('//div/text()')

# ✅ 精确（限定范围）
tree.xpath('//div[@class="article"]//p/text()')

🔹 缓存常用节点

python 复制代码

# ❌ 重复解析
for i in range(100):
    title = tree.xpath('//div[@class="item"][{}]//h2/text()'.format(i))

# ✅ 先提取所有，再遍历
items = tree.xpath('//div[@class="item"]')
for item in items:
    title = item.xpath('.//h2/text()')

10.2 复杂场景处理

🔹 提取动态生成的内容

python 复制代码

# 场景：内容由 JavaScript 渲染
# 方案1：使用 Selenium
from selenium import webdriver
from lxml import etree

driver = webdriver.Chrome()
driver.get(url)
driver.implicitly_wait(10)  # 等待页面加载

html = driver.page_source
tree = etree.HTML(html)
result = tree.xpath('//div[@class="dynamic-content"]/text()')

driver.quit()

# 方案2：使用 Playwright（更现代）
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto(url)
    page.wait_for_selector('.dynamic-content')  # 等待元素加载
    
    html = page.content()
    tree = etree.HTML(html)
    result = tree.xpath('//div[@class="dynamic-content"]/text()')
    
    browser.close()

🔹 处理反爬虫（动态 class）

python 复制代码

# 场景：网站 class 名称随机生成
# <div class="item_a3b2c1">商品</div>

# 方案1：使用 starts-with()
tree.xpath('//div[starts-with(@class, "item_")]')

# 方案2：使用 contains()
tree.xpath('//div[contains(@class, "item")]')

# 方案3：根据内容定位
tree.xpath('//div[contains(text(), "商品")]')

# 方案4：根据层级关系
tree.xpath('//section[@id="products"]//div[position()>0]')

🔹 提取 JSON 数据（嵌入在 HTML 中）

python 复制代码

# 场景：数据嵌入在 <script> 标签中
# <script type="application/json">{"name": "商品"}</script>

import json

json_str = tree.xpath('//script[@type="application/json"]/text()')[0]
data = json.loads(json_str)
print(data['name'])  # 商品

10.3 最佳实践清单

✅ DO（推荐）：

使用相对路径（//div）而非绝对路径（/html/body/div）
优先通过 id、class 等属性定位
使用 contains() 处理多 class 情况
使用 normalize-space() 处理文本空白
先提取列表，再遍历处理（提升性能）
使用 Scrapy Shell 或浏览器测试 XPath
为提取结果添加默认值（防止崩溃）

❌ DON'T（避免）：

避免使用 Chrome 自动生成的绝对路径
避免过度使用 //（性能问题）
避免假设元素一定存在（添加判断）
避免在循环中重复解析整个文档
避免在 XPath 中硬编码索引（[1]、[2]）

10.4 代码模板

python 复制代码

from lxml import etree
import requests

class Crawler:
    """爬虫基类（XPath 最佳实践模板）"""
    
    def __init__(self, url):
        self.url = url
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
    
    def get_tree(self):
        """获取解析树"""
        response = requests.get(self.url, headers=self.headers)
        response.encoding = 'utf-8'
        return etree.HTML(response.text)
    
    def safe_extract(self, tree, xpath, index=0, default=''):
        """安全提取（带默认值）"""
        result = tree.xpath(xpath)
        if result and len(result) > index:
            return result[index].strip() if isinstance(result[index], str) else result[index]
        return default
    
    def extract_list(self, tree, xpath):
        """提取列表"""
        result = tree.xpath(xpath)
        return [item.strip() for item in result if isinstance(item, str)]
    
    def parse(self):
        """解析逻辑（子类实现）"""
        raise NotImplementedError

# 使用示例
class Newscrawler(Crawler):
    def parse(self):
        tree = self.get_tree()
        
        # 提取标题（安全）
        title = self.safe_extract(tree, '//h1[@class="title"]/text()', default='无标题')
        
        # 提取列表
        tags = self.extract_list(tree, '//div[@class="tags"]//a/text()')
        
        return {
            'title': title,
            'tags': tags
        }

📚 11. 学习资源与工具

11.1 官方文档

资源	链接	说明
W3C XPath 规范	https://www.w3.org/TR/xpath/	官方标准文档
lxml 文档	https://lxml.de/xpathxslt.html	Python lxml 库
Scrapy 选择器文档	https://docs.scrapy.org/en/latest/topics/selectors.html	Scrapy 框架

11.2 在线工具

工具	功能	链接
XPath Tester	在线测试 XPath	https://www.freeformatter.com/xpath-tester.html
XPath Helper	Chrome 插件	Chrome 应用商店搜索
Scrapy Shell	命令行测试	`scrapy shell URL`

11.3 推荐教程

MDN Web Docs - XPath
- 适合前端开发者
- 包含详细示例
Scrapy 官方教程
- 爬虫实战
- XPath 在项目中的应用
《Python 网络爬虫从入门到实践》
- 系统学习爬虫技术
- 包含 XPath 章节

11.4 实战项目推荐

爬取豆瓣电影 Top 250
- 练习列表数据提取
- 练习分页处理
爬取招聘网站职位信息
- 练习复杂结构解析
- 练习数据清洗
爬取电商网站商品评论
- 练习动态内容处理
- 练习反爬虫应对

11.5 进阶方向

JavaScript 渲染页面处理
- Selenium WebDriver
- Playwright
- Pyppeteer
反爬虫对抗
- IP 代理池
- User-Agent 轮换
- Cookie 管理
- 验证码识别
分布式爬虫
- Scrapy-Redis
- Celery 任务队列
- 消息队列（RabbitMQ/Kafka）
数据存储
- MongoDB（文档型数据库）
- MySQL/PostgreSQL（关系型数据库）
- Elasticsearch（搜索引擎）

📝 总结

核心要点回顾

XPath 是什么：用于定位 XML/HTML 节点的查询语言
为什么用 XPath：功能强大、灵活、性能好（相比 BeautifulSoup）
语法核心 ：
- / 和 //（路径）
- @（属性）
- []（谓词/条件）
- text()（文本）
常用场景 ：
- 属性提取：//a/@href
- 文本提取：//h1/text()
- 条件过滤：//div[@class="main"]
- 模糊匹配：contains(), starts-with()
Python 实践 ：
- lxml 库：tree.xpath()
- Scrapy 框架：response.xpath().get()
最佳实践 ：
- 使用相对路径
- 优先属性定位
- 安全提取（防止崩溃）
- 性能优化（减少全文搜索）

学习路径建议

复制代码

第1周：基础语法 → 路径表达式 → 谓词 → 函数
第2周：Python lxml → 实战小项目（爬取列表页）
第3周：Scrapy 框架 → 爬取完整网站 → 数据存储
第4周：进阶技巧 → 反爬虫对抗 → 性能优化

快速参考卡片

xpath 复制代码

# 常用表达式速查
//标签名                      # 所有该标签
//标签[@属性="值"]            # 属性匹配
//标签[contains(@属性, "值")] # 属性包含
//标签/text()                # 提取文本
//标签/@属性                 # 提取属性
//标签[1]                    # 第一个（索引从1开始）
//标签[last()]               # 最后一个
//标签/..                    # 父节点
.//标签                      # 当前节点下
//标签1 | //标签2            # 或关系
//标签[条件1 and 条件2]      # 且关系

爬虫中 XPath 使用完全指南