Python - 爬虫-网页抓取数据-库requests

requests库是一个功能强大的HTTP库，用于发送各种HTTP请求，如GET、POST、PUT、DELETE等。

requests官网：Requests: HTTP for Humans™ --- Requests 2.32.3 documentation

使用requests可以模拟浏览器的请求，比起之前用的urllib，requests模块的api更加便捷（本质就是封装了urllib3）

requests库可以通过pip安装：

python 复制代码

pip install requests

主要功能和特性

‌API简洁‌：requests的API设计非常简洁，使得发送请求和获取响应变得非常容易。
‌支持多种请求方法‌：支持GET、POST、PUT、DELETE等多种HTTP请求方法。
‌会话管理‌：支持会话管理，可以保持连接，方便进行多次请求。
‌支持Cookies和Session‌：可以方便地处理Cookies和会话，实现模拟登录等功能。
‌异常处理‌：提供异常处理机制，方便处理网络请求中的错误。
‌支持自动编码‌：自动处理URL编码和响应内容的编码，使得使用更加方便。
‌支持代理和认证‌：可以设置代理服务器，支持HTTP认证。
‌响应内容处理‌：提供丰富的响应内容处理功能，如获取文本、二进制内容、JSON等。

高级功能和使用技巧

‌设置超时 ‌：可以通过设置timeout参数来控制请求的超时时间。
‌使用代理 ‌：通过proxies参数设置代理服务器。
‌自定义Headers ‌：通过headers参数可以自定义请求头，模拟浏览器请求。
‌处理JSON数据 ‌：使用json参数可以直接发送JSON数据，并自动处理响应的JSON数据。
‌异常处理 ‌：使用try-except结构捕获并处理可能出现的异常，如requests.exceptions.RequestException。
‌使用Cookies ‌：通过cookies参数可以设置或获取Cookies，实现会话管理

requests库7个主要方法

方法	说明
requsts.requst()	构造一个请求，最基本的方法，是下面方法的支撑
requsts.get()	获取网页，对应HTTP中的GET方法
requsts.post()	向网页提交信息，对应HTTP中的POST方法
requsts.head()	获取html网页的头信息，对应HTTP中的HEAD方法
requsts.put()	向html提交put方法，对应HTTP中的PUT方法
requsts.patch()	向html网页提交局部请求修改的的请求，对应HTTP中的PATCH方法
requsts.delete()	向html提交删除请求，对应HTTP中的DELETE方法

requests内置的状态字符

python 复制代码

# -*- coding: utf-8 -*-
 
from .structures import LookupDict
 
_codes = {
 
    # Informational.
    100: ('continue',),
    101: ('switching_protocols',),
    102: ('processing',),
    103: ('checkpoint',),
    122: ('uri_too_long', 'request_uri_too_long'),
    200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\\o/', ''),
    201: ('created',),
    202: ('accepted',),
    203: ('non_authoritative_info', 'non_authoritative_information'),
    204: ('no_content',),
    205: ('reset_content', 'reset'),
    206: ('partial_content', 'partial'),
    207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'),
    208: ('already_reported',),
    226: ('im_used',),
 
    # Redirection.
    300: ('multiple_choices',),
    301: ('moved_permanently', 'moved', '\\o-'),
    302: ('found',),
    303: ('see_other', 'other'),
    304: ('not_modified',),
    305: ('use_proxy',),
    306: ('switch_proxy',),
    307: ('temporary_redirect', 'temporary_moved', 'temporary'),
    308: ('permanent_redirect',
          'resume_incomplete', 'resume',),  # These 2 to be removed in 3.0
 
    # Client Error.
    400: ('bad_request', 'bad'),
    401: ('unauthorized',),
    402: ('payment_required', 'payment'),
    403: ('forbidden',),
    404: ('not_found', '-o-'),
    405: ('method_not_allowed', 'not_allowed'),
    406: ('not_acceptable',),
    407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),
    408: ('request_timeout', 'timeout'),
    409: ('conflict',),
    410: ('gone',),
    411: ('length_required',),
    412: ('precondition_failed', 'precondition'),
    413: ('request_entity_too_large',),
    414: ('request_uri_too_large',),
    415: ('unsupported_media_type', 'unsupported_media', 'media_type'),
    416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),
    417: ('expectation_failed',),
    418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),
    421: ('misdirected_request',),
    422: ('unprocessable_entity', 'unprocessable'),
    423: ('locked',),
    424: ('failed_dependency', 'dependency'),
    425: ('unordered_collection', 'unordered'),
    426: ('upgrade_required', 'upgrade'),
    428: ('precondition_required', 'precondition'),
    429: ('too_many_requests', 'too_many'),
    431: ('header_fields_too_large', 'fields_too_large'),
    444: ('no_response', 'none'),
    449: ('retry_with', 'retry'),
    450: ('blocked_by_windows_parental_controls', 'parental_controls'),
    451: ('unavailable_for_legal_reasons', 'legal_reasons'),
    499: ('client_closed_request',),
 
    # Server Error.
    500: ('internal_server_error', 'server_error', '/o\\', ''),
    501: ('not_implemented',),
    502: ('bad_gateway',),
    503: ('service_unavailable', 'unavailable'),
    504: ('gateway_timeout',),
    505: ('http_version_not_supported', 'http_version'),
    506: ('variant_also_negotiates',),
    507: ('insufficient_storage',),
    509: ('bandwidth_limit_exceeded', 'bandwidth'),
    510: ('not_extended',),
    511: ('network_authentication_required', 'network_auth', 'network_authentication'),
}
 
codes = LookupDict(name='status_codes')
 
for code, titles in _codes.items():
    for title in titles:
        setattr(codes, title, code)
        if not title.startswith('\\'):
            setattr(codes, title.upper(), code)

json和data的区别：

使用json

当请求参数是JSON的时候使用json的参数当请求参数是json，但是要使用data的参数，那么请求参数要进行序列化的处理

使用data

当请求参数是表单的时候使用data当请求参数是JSON格式的时候，那么请求参数要进行序列化的处理

requests库的异常

异常	说明
requests.ConnectionError	网络连接异常，如DNS查询失败，拒绝连接等
requests.HTTPError	HTTP错误异常
requests.URLRequired	URL缺失异常
requests.TooManyRedirects	超过最大重定向次数，产生重定向异常
requests.ConnectTimeout	连接远程服务器超时异常
requests.Timeout	请求URL超时，产生超时异常

一、基于requests之GET请求

1、基本请求

python 复制代码

# GET请求
r = requests.get('http://httpbin.org/get')
print(r.status_code,r.reason) # 200 OK
print(r.text)
# {
#   "args": {},
#   "headers": {
#     "Accept": "*/*",
#     "Accept-Encoding": "gzip, deflate",
#     "Host": "httpbin.org",
#     "User-Agent": "python-requests/2.32.3",
#     "X-Amzn-Trace-Id": "Root=1-67a700f6-138938e6136cffb223f72ef3"
#   },
#   "origin": "58.241.18.10",
#   "url": "http://httpbin.org/get"
# }

2、带参数的GET请求->params

python 复制代码

# GET带参数请求
r = requests.get('http://httpbin.org/get',params={'name':'zhangsan','age':18})
print(r.json()) # {'args': {'age': '18', 'name': 'zhangsan'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.32.3', 'X-Amzn-Trace-Id': 'Root=1-67a700f7-7cf1df80781d94dd57a99d44'}, 'origin': '58.241.18.10', 'url': 'http://httpbin.org/get?name=zhangsan&age=18'}

3、带参数的GET请求->headers

通常我们在发送请求时都需要带上请求头，请求头是将自身伪装成浏览器的关键，常见的有用的请求头如下

python 复制代码

ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 Edg/111.0.1661.62'
headers = {
    'User-Agent':ua
}
r = requests.get('http://httpbin.org/headers',headers=headers)
print(r.json()) # {'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 Edg/111.0.1661.62', 'X-Amzn-Trace-Id': 'Root=1-67a70344-0928ab01777e0ca91dd4e6db'}}

4、带参数的GET请求->cookies

登录github，然后从浏览器中获取cookies，以后就可以直接拿着cookie登录了，无需输入用户名密码

python 复制代码

cookies = dict(userid='111111', token='abcd1234')
r = requests.get('http://httpbin.org/cookies',cookies=cookies)
print(r.json()) # {'cookies': {'token': 'abcd1234', 'userid': '111111'}}

二、基于POST请求

python 复制代码

r = requests.post('http://httpbin.org/post',data={'name':'zhangsan','age':18})
print(r.json()) # {'args': {}, 'data': '', 'files': {}, 'form': {'age': '18', 'name': 'zhangsan'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Content-Length': '20', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.32.3', 'X-Amzn-Trace-Id': 'Root=1-67a702b9-7bc957120cdadae371c32908'}, 'json': None, 'origin': '58.241.18.10', 'url': 'http://httpbin.org/post'}

1、GET请求

HTTP默认的请求方法就是GET

1、没有请求体

2、数据必须在1K之内

3、GET请求数据会暴露在浏览器的地址栏中

GET请求常用的操作：

1、在浏览器的地址栏中直接给出URL，那么就一定是GET请求

2、点击页面上的超链接也一定是GET请求

3、提交表单时，表单默认使用GET请求，但可以设置为POST

2、POST请求

1、数据不会出现在地址栏中

2、数据的大小没有上限

3、有请求体

4、请求体中如果存在中文，会使用URL编码！

注：

！！！requests.post()用法与requests.get()完全一致，特殊的是requests.post()有一个data参数，用来存放请求体数据

python 复制代码

import requests

'''
requests使用
'''

print('--------GET请求--------')
# GET请求
r = requests.get('http://httpbin.org/get')
print(r.status_code,r.reason) # 200 OK
print(r.text)
# {
#   "args": {},
#   "headers": {
#     "Accept": "*/*",
#     "Accept-Encoding": "gzip, deflate",
#     "Host": "httpbin.org",
#     "User-Agent": "python-requests/2.32.3",
#     "X-Amzn-Trace-Id": "Root=1-67a700f6-138938e6136cffb223f72ef3"
#   },
#   "origin": "58.241.18.10",
#   "url": "http://httpbin.org/get"
# }
# GET带参数请求
r = requests.get('http://httpbin.org/get',params={'name':'zhangsan','age':18})
print(r.json()) # {'args': {'age': '18', 'name': 'zhangsan'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.32.3', 'X-Amzn-Trace-Id': 'Root=1-67a700f7-7cf1df80781d94dd57a99d44'}, 'origin': '58.241.18.10', 'url': 'http://httpbin.org/get?name=zhangsan&age=18'}


print('--------POST请求--------')
r = requests.post('http://httpbin.org/post',data={'name':'zhangsan','age':18})
print(r.json()) # {'args': {}, 'data': '', 'files': {}, 'form': {'age': '18', 'name': 'zhangsan'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Content-Length': '20', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.32.3', 'X-Amzn-Trace-Id': 'Root=1-67a702b9-7bc957120cdadae371c32908'}, 'json': None, 'origin': '58.241.18.10', 'url': 'http://httpbin.org/post'}


print('--------自定义headers请求--------')
ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 Edg/111.0.1661.62'
headers = {
    'User-Agent':ua
}
r = requests.get('http://httpbin.org/headers',headers=headers)
print(r.json()) # {'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 Edg/111.0.1661.62', 'X-Amzn-Trace-Id': 'Root=1-67a70344-0928ab01777e0ca91dd4e6db'}}


print('--------带cookies请求--------')
cookies = dict(userid='111111', token='abcd1234')
r = requests.get('http://httpbin.org/cookies',cookies=cookies)
print(r.json()) # {'cookies': {'token': 'abcd1234', 'userid': '111111'}}


print('--------Basic-auth认证请求--------')
r = requests.get('http://httpbin.org/basic-auth/admin/admin123',auth=('admin','admin123'))
print(r.json()) # {'authenticated': True, 'user': 'admin'}


print('--------抛出状态码异常--------')
bad_r = requests.get('http://httpbin.org/status/404')
print(bad_r.status_code) # 404
# bad_r.raise_for_status() # 抛出异常 requests.exceptions.HTTPError: 404 Client Error: NOT FOUND for url: http://httpbin.org/status/404


print('--------requests.Session请求对象--------')
# 创建一个Session对象
s = requests.Session()
# Session对象会保存服务器返回的set-cookies头信息里面的内容
s.get('http://httpbin.org/cookies/set/userid/111111')
s.get('http://httpbin.org/cookies/set/token/abcd1234')
# 下次请求将本地所有cookies信息自动添加到请求头信息里面
r = s.get('http://httpbin.org/cookies')
print(r.json()) # {'cookies': {'token': 'abcd1234', 'userid': '111111'}}


print('--------requests使用代理--------')
print('不使用代理：', requests.get('http://httpbin.org/ip').json())
proxies = {
    'http':'http://localhost:8080'
}
# print('使用代理：', requests.get('http://httpbin.org/ip',proxies=proxies).json())


print('--------请求超时--------')
r = requests.get('http://httpbin.org/delay/4',timeout=5)
print(r.text)

三、 curl、wget、urllib 和 requests的优缺点及区别

1、curl

优点

支持多种协议（HTTP, HTTPS, FTP等）
可以处理复杂的HTTP请求，包括认证、cookies等
命令行工具，适合脚本和自动化任务

缺点

不是Python原生库，需要通过子进程调用
对于Python开发者来说不够直观

适用场景

需要跨语言或跨平台使用时

2、wget

优点

简单易用，适合下载整个网站或文件
支持递归下载，可以下载整个站点

缺点

主要用于文件下载，功能相对单一
不支持发送自定义HTTP头或复杂请求
同样不是Python原生库，需通过命令行调用

适用场景

批量下载静态资源或整个网站

3、urllib

优点

Python标准库的一部分，无需额外安装
支持基本的HTTP/HTTPS请求，包括GET、POST
可以处理URL编码、重定向等

缺点

API设计较为老旧，代码可读性较差
处理复杂请求时代码量较大

适用场景

简单的HTTP请求或对库大小敏感的环境

4、requests

优点

API简洁友好，易于上手
支持会话保持、Cookies、文件上传等功能
自动处理重定向、SSL验证等
社区活跃，文档丰富

缺点

非标准库，需单独安装

适用场景

绝大多数Web爬虫和API交互任务

总结来说，在Python中开发爬虫时，推荐使用requests库，因为它提供了更简洁的API和更好的用户体验。对于简单的HTTP请求，也可以考虑使用urllib。而curl和wget更适合命令行操作或特定需求场景。