Python 爬虫入门（六）：urllib库的使用方法

前言
[1. urllib 概述](#1. urllib 概述)
[2. urllib.request 模块](#2. urllib.request 模块)
- [2.1 发送GET请求](#2.1 发送GET请求)
- [2.2 发送POST请求](#2.2 发送POST请求)
- [2.3 添加headers](#2.3 添加headers)
- [2.4 处理异常](#2.4 处理异常)
[3. urllib.error 模块](#3. urllib.error 模块)
[4. urllib.parse 模块](#4. urllib.parse 模块)
- [4.1 URL解析](#4.1 URL解析)
- [4.2 URL编码和解码](#4.2 URL编码和解码)
- [4.3 拼接URL](#4.3 拼接URL)
[5. urllib.robotparser 模块](#5. urllib.robotparser 模块)
[6. 实战示例: 爬取豆瓣电影Top250](#6. 实战示例: 爬取豆瓣电影Top250)
[7. urllib vs requests](#7. urllib vs requests)
[8. 注意事项](#8. 注意事项)
总结

前言

欢迎来到"Python 爬虫入门"系列的第六篇文章。今天我们来学习Python标准库中的urllib,这是一个用于处理URL的强大工具包。

urllib是Python内置的HTTP请求库,不需要额外安装,就可以直接使用。它提供了一系列用于操作URL的函数和类,可以用来发送请求、处理响应、解析URL等。尽管现在很多人更喜欢使用requests库,但是了解和掌握urllib仍然很有必要,因为它是很多其他库的基础,而且在一些特殊情况下可能会更有优势。

在这篇文章里,我会详细介绍urllib的四个主要模块:request、error、parse和robotparser,并通过实际的代码示例来展示它们的用法。

1. urllib 概述

urllib是 Python 标准库中用于URL处理的模块集合,不需要通过 pip 安装。

它包含了多个处理URL的模块:

urllib.request: 用于打开和读取URL

urllib.error: 包含urllib.request抛出的异常

urllib.parse: 用于解析URL

urllib.robotparser: 用于解析robots.txt文件

这些模块提供了一系列强大的工具,可以帮助我们进行网络请求和URL处理。接下来,我们将逐一介绍这些模块的主要功能和使用方法。

2. urllib.request 模块

urllib.request模块是urllib中最常用的模块,它提供了一系列函数和类来打开URL(主要是HTTP)。

我们可以使用这个模块来模拟浏览器发送GET和POST请求。

2.1 发送GET请求

使用urllib.request发送GET请求非常简单,我们可以使用urlopen()函数:

python 复制代码

import urllib.request
import gzip
import io

url = 'https://www.python.org/'
response = urllib.request.urlopen(url)

# 获取响应头
content_type = response.headers.get('Content-Encoding')

# 读取数据
data = response.read()

# 检查是否需要解压缩
if content_type == 'gzip':
    buf = io.BytesIO(data)
    with gzip.GzipFile(fileobj=buf) as f:
        data = f.read()

print(data.decode('utf-8'))

这段代码会打开Python官网,并打印出网页的HTML内容。

2.2 发送POST请求

发送POST请求稍微复杂一些,我们需要使用Request对象:

python 复制代码

import urllib.request
import urllib.parse

url = 'http://httpbin.org/post'
data = urllib.parse.urlencode({'name': 'John', 'age': 25}).encode('utf-8')
req = urllib.request.Request(url, data=data, method='POST')
response = urllib.request.urlopen(req)

print(response.read().decode('utf-8'))

这段代码向httpbin.org发送了一个POST请求,包含了name和age两个参数。

2.3 添加headers

在实际的爬虫中,我们常常需要添加headers来模拟浏览器行为。

可以在创建Request对象时添加headers:

python 复制代码

import urllib.request
import gzip
import io

url = 'https://www.python.org/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
req = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(req)

# 获取响应头中的 Content-Encoding
content_encoding = response.headers.get('Content-Encoding')

# 读取数据
data = response.read()

# 如果数据被 gzip 压缩，则需要解压
if content_encoding == 'gzip':
    buf = io.BytesIO(data)
    with gzip.GzipFile(fileobj=buf) as f:
        data = f.read()

# 尝试使用 'utf-8' 解码
try:
    print(data.decode('utf-8'))
except UnicodeDecodeError:
    print("Cannot decode data with 'utf-8' encoding.")

添加headers后,运行结果如下：

2.4 处理异常

在进行网络请求时,可能会遇到各种异常情况。我们可以使用try-except语句来处理这些异常:

python 复制代码

import urllib.request
import urllib.error
import gzip
import io

try:
    # 发送请求并获取响应
    response = urllib.request.urlopen('https://www.python.org/')

    # 获取响应头中的 Content-Encoding
    content_encoding = response.headers.get('Content-Encoding')

    # 读取数据
    data = response.read()

    # 如果数据被 gzip 压缩，则需要解压
    if content_encoding == 'gzip':
        buf = io.BytesIO(data)
        with gzip.GzipFile(fileobj=buf) as f:
            data = f.read()

    # 尝试使用 'utf-8' 解码
    print(data.decode('utf-8'))
except urllib.error.URLError as e:
    print(f'URLError: {e.reason}')
except urllib.error.HTTPError as e:
    print(f'HTTPError: {e.code}, {e.reason}')
except UnicodeDecodeError:
    print("Cannot decode data with 'utf-8' encoding.")

这段代码会捕获URLError和HTTPError,这两种异常都定义在urllib.error模块中。

3. urllib.error 模块

urllib.error模块定义了urllib.request可能抛出的异常类。主要有两个异常类:

URLError: 由urllib.request产生的异常的基类。
HTTPError: URLError的子类,用于处理HTTP和HTTPS URL的错误。

我们已经在上面的例子中看到了如何捕获和处理这些异常。

4. urllib.parse 模块

urllib.parse模块提供了许多URL处理的实用函数,例如解析、引用、拆分和组合。

4.1 URL解析

python 复制代码

from urllib.parse import urlparse

url = 'https://www.python.org/doc/?page=1#introduction'
parsed = urlparse(url)

print(parsed)
print(f'Scheme: {parsed.scheme}')
print(f'Netloc: {parsed.netloc}')
print(f'Path: {parsed.path}')
print(f'Params: {parsed.params}')
print(f'Query: {parsed.query}')
print(f'Fragment: {parsed.fragment}')

这段代码会解析URL,并打印出各个组成部分。

4.2 URL编码和解码

在处理URL时,我们经常需要对参数进行编码和解码:

python 复制代码

from urllib.parse import urlencode, unquote

params = {'name': 'John Doe', 'age': 30, 'city': 'New York'}
encoded = urlencode(params)
print(f'Encoded: {encoded}')

decoded = unquote(encoded)
print(f'Decoded: {decoded}')

urlencode()函数将字典转换为URL编码的字符串,而unquote()函数则进行解码。

4.3 拼接URL

python 复制代码

from urllib.parse import urljoin

base_url = 'https://www.python.org/doc/'
relative_url = 'tutorial/index.html'
full_url = urljoin(base_url, relative_url)

print(full_url)

urljoin()函数可以方便地将一个基础URL和相对URL拼接成一个完整的URL。

5. urllib.robotparser 模块

urllib.robotparser模块提供了一个RobotFileParser类,用于解析robots.txt文件。

robots.txt是一个网站用来告诉爬虫哪些页面可以爬取,哪些不可以爬取的文件。

python 复制代码

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('https://www.python.org/robots.txt')
rp.read()

print(rp.can_fetch('*', 'https://www.python.org/'))
print(rp.can_fetch('*', 'https://www.python.org/admin/'))

这段代码会读取Python官网的robots.txt文件,然后检查是否允许爬取某些URL。

6. 实战示例: 爬取豆瓣电影Top250

现在,让我们用我们学到的知识来写一个实际的爬虫,爬取豆瓣电影Top250的信息。

python 复制代码

import urllib.request
import urllib.error
import re
from bs4 import BeautifulSoup

def get_movie_info(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    
    try:
        req = urllib.request.Request(url, headers=headers)
        response = urllib.request.urlopen(req)
        html = response.read().decode('utf-8')
        
        soup = BeautifulSoup(html, 'html.parser')
        movie_list = soup.find('ol', class_='grid_view')
        
        for movie_li in movie_list.find_all('li'):
            rank = movie_li.find('em').string
            title = movie_li.find('span', class_='title').string
            rating = movie_li.find('span', class_='rating_num').string
            
            if movie_li.find('span', class_='inq'):
                quote = movie_li.find('span', class_='inq').string
            else:
                quote = "N/A"
            
            print(f"Rank: {rank}")
            print(f"Title: {title}")
            print(f"Rating: {rating}")
            print(f"Quote: {quote}")
            print('-' * 50)
        
    except urllib.error.URLError as e:
        if hasattr(e, 'reason'):
            print(f'Failed to reach the server. Reason: {e.reason}')
        elif hasattr(e, 'code'):
            print(f'The server couldn\'t fulfill the request. Error code: {e.code}')

# 爬取前5页
for i in range(5):
    url = f'https://movie.douban.com/top250?start={i*25}'
    get_movie_info(url)

这个爬虫会爬取豆瓣电影Top250的前5页,每页25部电影,共125部电影的信息。它使用了我们之前学到的urllib.request发送请求,使用BeautifulSoup解析HTML,并处理了可能出现的异常。

7. urllib vs requests

虽然urllib是Python的标准库,但在实际开发中,很多人更喜欢使用requests库。

这是因为:

易用性: requests的API设计更加人性化,使用起来更加直观和简单。

功能强大: requests自动处理了很多urllib需要手动处理的事情,比如保持会话、处理cookies等。

异常处理: requests的异常处理更加直观和统一。

然而,urllib作为标准库仍然有其优势:

无需安装: 作为标准库,urllib无需额外安装即可使用。
底层操作: urllib提供了更多的底层操作,在某些特殊情况下可能更有优势。

在大多数情况下,如果你的项目允许使用第三方库,requests可能是更好的选择。但了解和掌握urllib仍然很有必要,因为它是Python网络编程的基础,而且在一些特殊情况下可能会更有用。

8. 注意事项

在使用urllib进行爬虫时,有一些重要的注意事项:

遵守robots.txt: 使用urllib.robotparser解析robots.txt文件,遵守网站的爬取规则。

添加合适的User-Agent: 在headers中添加合适的User-Agent,避免被网站识别为爬虫而被封禁。

控制爬取速度: 添加适当的延时,避免对目标网站造成过大压力。

处理异常: 正确处理可能出现的网络异常和HTTP错误。

解码响应: 注意正确解码响应内容,处理不同的字符编码。

URL编码: 在构造URL时,注意对参数进行正确的URL编码。

总结

在本文中，我们学习了Python标准库urllib的使用方法，包括发送GET和POST请求、异常处理、URL解析和构造，以及robots.txt文件解析，并将这些知识应用到了实际的爬虫案例中。
虽然requests库在实际开发中更受欢迎，但掌握urllib仍然十分重要。它不仅是Python网络编程的基础，而且在某些特殊情况下可能会更有优势。
希望通过本文，你对urllib有了更深入的理解，并能在你的爬虫项目中灵活运用。无论使用何种工具，都要遵守网络爬虫的伦理规范，尊重网站的规则和其他用户的权益。

如果你有任何问题或者好的想法，欢迎随时和我交流。