Python爬虫技术第16节 XPath

XPath是一种在XML文档中查找信息的语言，尽管XML和HTML在语法上有区别，但XPath同样适用于HTML文档的解析，尤其是在使用如lxml这样的库时。XPath提供了一种强大的方法来定位和提取XML/HTML文档中的元素和属性。

XPath基础

XPath表达式由路径表达式组成，它们指定了文档中的位置。下面是一些基本的XPath语法：

根节点：
- / 表示绝对路径的开始，指向文档的根节点。
- // 表示从当前位置到文档的任意位置。
元素选择：
- elementName 选择该名称下的所有子节点。
- @attributeName 选择指定的属性。
路径操作：
- child/ 选择当前节点的直接子节点。
- .. 移动到父节点。
- . 当前节点。
位置路径：
- last() 返回集合中的最后一个节点的位置。
- position() 返回节点在其父节点中的位置。
过滤器：
- [condition] 过滤节点，如 [contains(text(), 'keyword')]。
- [1] 选择第一个节点。
- [last()] 选择最后一个节点。
- [position()=odd] 选择位置为奇数的节点。
轴：
- ancestor::* 选择所有祖先节点。
- following-sibling::* 选择当前节点之后的所有同级节点。
- preceding-sibling::* 选择当前节点之前的所有同级节点。

使用Python和lxml库

假设你有以下HTML文档：

html 复制代码

<div id="container">
    <h1>Title</h1>
    <div class="content">
        <p>Paragraph 1</p>
        <p>Paragraph 2</p>
    </div>
    <div class="sidebar">
        <ul>
            <li>Item 1</li>
            <li>Item 2</li>
        </ul>
    </div>
</div>

使用lxml库解析和提取数据：

python 复制代码

from lxml import etree

html = '''
<div id="container">
    <h1>Title</h1>
    <div class="content">
        <p>Paragraph 1</p>
        <p>Paragraph 2</p>
    </div>
    <div class="sidebar">
        <ul>
            <li>Item 1</li>
            <li>Item 2</li>
        </ul>
    </div>
</div>
'''

root = etree.fromstring(html)

# 获取标题
title = root.xpath('//h1/text()')
print("Title:", title[0])

# 获取所有段落
paragraphs = root.xpath('//div[@class="content"]/p/text()')
print("Paragraphs:", paragraphs)

# 获取列表项
items = root.xpath('//div[@class="sidebar"]/ul/li/text()')
print("Items:", items)

使用Scrapy框架

Scrapy是一个用于Web爬取的框架，内置支持XPath和CSS选择器。下面是如何在Scrapy项目中使用XPath：

python 复制代码

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    def parse(self, response):
        # 获取标题
        title = response.xpath('//h1/text()').get()
        yield {'title': title}

        # 获取所有段落
        paragraphs = response.xpath('//div[@class="content"]/p/text()').getall()
        yield {'paragraphs': paragraphs}

        # 获取列表项
        items = response.xpath('//div[@class="sidebar"]/ul/li/text()').getall()
        yield {'items': items}

XPath与CSS选择器的比较

虽然XPath提供了更强大的查询能力，但CSS选择器通常在HTML文档中更直观易读。XPath更适合处理复杂的查询，尤其是在需要跨层级或根据条件过滤节点的情况下。然而，对于简单的结构化文档，CSS选择器往往足够使用，而且代码更为简洁。

在实际应用中，可以根据具体需求和文档结构选择使用XPath或CSS选择器。大多数现代的Python Web爬取库都同时支持这两种选择器。

当然，可以考虑以下几个方面：增加错误处理、处理更复杂的HTML结构、提取嵌套数据以及执行多次请求来处理动态加载的内容。下面我将展示如何使用Python和lxml库来实现这些功能。

错误处理和异常管理

在使用XPath进行网页爬取时，应考虑到可能发生的错误，如网络问题、无效的XPath表达式、找不到期望的元素等。这里是一个带有错误处理的示例：

python 复制代码

from lxml import etree
import requests

def fetch_html(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        return response.text
    except requests.RequestException as e:
        print(f"Request error: {e}")
        return None

def parse_html(html):
    if html is None:
        print("Failed to fetch HTML")
        return

    try:
        tree = etree.HTML(html)
        title = tree.xpath('//h1/text()')
        if title:
            print("Title:", title[0])
        else:
            print("Title not found")

        paragraphs = tree.xpath('//div[@class="content"]/p/text()')
        if paragraphs:
            print("Paragraphs:", paragraphs)
        else:
            print("No paragraphs found")

        items = tree.xpath('//div[@class="sidebar"]/ul/li/text()')
        if items:
            print("Items:", items)
        else:
            print("No items found")

    except etree.XPathEvalError as e:
        print(f"XPath evaluation error: {e}")

def main():
    url = "http://example.com"
    html = fetch_html(url)
    parse_html(html)

if __name__ == "__main__":
    main()

处理更复杂的HTML结构

有时网页结构可能包含嵌套的元素，或者有多个相似的元素。XPath允许你使用更复杂的表达式来处理这些情况。例如，如果每个列表项都有额外的信息，可以使用如下XPath表达式：

python 复制代码

items_with_details = tree.xpath('//div[@class="sidebar"]/ul/li')
for item in items_with_details:
    item_text = item.xpath('./text()')
    item_link = item.xpath('.//a/@href')
    print("Item:", item_text, "Link:", item_link)

处理动态加载的内容

如果网站使用JavaScript动态加载内容，单次请求可能无法获取全部数据。在这种情况下，可以使用Selenium或Requests-HTML库来模拟浏览器行为。以下是使用Requests-HTML的示例：

python 复制代码

from requests_html import HTMLSession

session = HTMLSession()

def fetch_and_render(url):
    r = session.get(url)
    r.html.render(sleep=1)  # Wait for JavaScript to execute
    return r.html.raw_html.decode('utf-8')

def main():
    url = "http://example.com"
    html = fetch_and_render(url)
    tree = etree.HTML(html)
    # Now you can use XPath on the rendered HTML
    ...

if __name__ == "__main__":
    main()

请注意，使用像Selenium这样的工具可能会显著增加你的爬虫脚本的资源消耗和运行时间，因为它模拟了一个完整的浏览器环境。

通过这些扩展，你的XPath代码将更加健壮，能够处理更复杂和动态的网页结构。在开发爬虫时，始终记得遵守网站的robots.txt规则和尊重网站的使用条款，避免过度请求导致的服务压力。

接下来，我们可以引入一些最佳实践，比如：

模块化：将代码分解成多个函数，提高可读性和可维护性。
参数化：使函数接受参数，以便于复用和配置。
日志记录：记录关键步骤和潜在的错误信息，便于调试和监控。
并发处理：利用多线程或多进程处理多个URL，提高效率。
重试机制：在网络不稳定时自动重试失败的请求。
数据存储：将提取的数据保存到文件或数据库中。

下面是一个使用上述最佳实践的代码示例：

python 复制代码

import logging
import requests
from lxml import etree
from time import sleep
from concurrent.futures import ThreadPoolExecutor, as_completed

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def fetch_html(url, max_retries=3, delay=1):
    """Fetch HTML from a given URL with retry mechanism."""
    for attempt in range(max_retries):
        try:
            response = requests.get(url)
            response.raise_for_status()
            return response.text
        except requests.RequestException as e:
            logging.error(f"Error fetching URL: {url}, attempt {attempt + 1}/{max_retries}. Error: {e}")
            if attempt < max_retries - 1:
                sleep(delay * (attempt + 1))  # Exponential backoff
    return None

def parse_html(html, xpath_expression):
    """Parse HTML using provided XPath expression."""
    if html is None:
        logging.error("Failed to fetch HTML")
        return None
    try:
        tree = etree.HTML(html)
        result = tree.xpath(xpath_expression)
        return result
    except etree.XPathEvalError as e:
        logging.error(f"XPath evaluation error: {e}")
        return None

def save_data(data, filename):
    """Save data to a file."""
    with open(filename, 'w') as f:
        f.write(str(data))

def process_url(url, xpath_expression, output_filename):
    """Process a single URL by fetching, parsing, and saving data."""
    logging.info(f"Processing URL: {url}")
    html = fetch_html(url)
    data = parse_html(html, xpath_expression)
    if data:
        save_data(data, output_filename)
        logging.info(f"Data saved to {output_filename}")

def main(urls, xpath_expression, output_dir):
    """Main function to process multiple URLs concurrently."""
    with ThreadPoolExecutor(max_workers=5) as executor:
        futures = []
        for url in urls:
            output_filename = f"{output_dir}/data_{url.split('/')[-1]}.txt"
            future = executor.submit(process_url, url, xpath_expression, output_filename)
            futures.append(future)

        for future in as_completed(futures):
            future.result()

if __name__ == "__main__":
    urls = ["http://example1.com", "http://example2.com"]
    xpath_expression = '//div[@class="content"]/p/text()'  # Example XPath expression
    output_dir = "./output"
    main(urls, xpath_expression, output_dir)

在这个例子中，我们定义了以下几个关键函数：

fetch_html：负责从URL获取HTML，具有重试机制。
parse_html：使用提供的XPath表达式解析HTML。
save_data：将数据保存到文件。
process_url：处理单个URL，包括获取HTML、解析数据并保存。
main：主函数，使用线程池并行处理多个URL。

这种结构允许你轻松地扩展爬虫的功能，比如添加更多的URL或XPath表达式，同时保持代码的清晰和可维护性。

Python爬虫技术 第16节 XPath

XPath基础

使用Python和lxml库

使用Scrapy框架

XPath与CSS选择器的比较

错误处理和异常管理

处理更复杂的HTML结构

处理动态加载的内容

Python爬虫技术第16节 XPath