python网络爬虫基础:requests库的应用

Requests 库是在 urllib 的基础上开发而来,相比之下更加简洁优美,在编写爬虫程序时应用较为广泛。注意,本文更偏于程序实现,具体各个字段、参数的解释详见本专栏其他博文(后续看情况更新)。闲话少说,正式开始。

python 复制代码
import requests

requests主要可构造如下网络请求

python 复制代码
requests.get()
requests.post()
requests.head()
requests.put()
requests.options()
requests.patch()
requests.delete()

1.requests对象基础属性

基础响应程序如下,运行后便会爬取该网页的内容存储在response对象中。

python 复制代码
response = requests.get('https://httpbin.org/get')
python 复制代码
# 返回html文本 Unicode类型
print(response.text)
# 返回json数据  若返回结果不是json,则报错:json.decoder.JSONDecodeError
print(response.json())
# 返回二进制数据(字节),用于图片、视频流获取
print(response.content)
# 返回状态码
print(response.status_code) 
# 返回头部信息
print(response.headers) 
# 返回请求的url
print(response.url) 
# 返回cookie信息
print(response.cookies) 
# 返回编码信息
print(response.encoding) 
# 返回请求历史
print(response.history)
# 返回请求时间
print(response.elapsed) 
# status_code < 400 --> True
print(response.ok) 
# 返回重定向属性
print(response.is_redirect) 

requests内置了状态码验证模块,不过都是英文描述,私以为记住常用状态码的数字即可,毕竟该模块本身输出也是整数类型

python 复制代码
print(type(requests.codes.ok))  # <class 'int'> -->200

常用状态码示例

python 复制代码
#信息性状态码
100:('continue',),
101:('switching_protocols',),
102:('processing',),
103:('checkpoint',),
122:('uri_too_long','request_uri_too_long'),

#成功状态码
200:('ok','okay','all_ok','all_okay','all_good','\\o/','√'),
201:('created',),
202:('accepted',),
203:('non_authoritative_info','non_authoritative_information'),
204:('no_content',),
205:('reset_content','reset'),
206:('partial_content','partial'),
207:('multi_status','multiple_status','multi_stati','multiple_stati'),
208:('already_reported',),
226:('im_used',),

#重定向状态码
300:('multiple_choices',),
301:('moved_permanently','moved','\\o-'),
302:('found',),
303:('see_other','other'),
304:('not_modified',),
305:('user_proxy',),
306:('switch_proxy',),
307:('temporary_redirect','temporary_moved','temporary'),
308:('permanent_redirect',),

#客户端请求错误
400:('bad_request','bad'),
401:('unauthorized',),
402:('payment_required','payment'),
403:('forbiddent',),
404:('not_found','-o-'),
405:('method_not_allowed','not_allowed'),
406:('not_acceptable',),
407:('proxy_authentication_required','proxy_auth','proxy_authentication'),
408:('request_timeout','timeout'),
409:('conflict',),
410:('gone',),
411:('length_required',),
412:('precondition_failed','precondition'),
413:('request_entity_too_large',),
414:('request_uri_too_large',),
415:('unsupported_media_type','unsupported_media','media_type'),
416:('request_range_not_satisfiable','requested_range','range_not_satisfiable'),
417:('expectation_failed',),
418:('im_a_teapot','teapot','i_am_a_teapot'),
421:('misdirected_request',),
422:('unprocessable_entity','unprocessable'),
423:('locked'),
424:('failed_dependency','dependency'),
425:('unordered_collection','unordered'),
426:('upgrade_required','upgrade'),
428:('precondition_required','precondition'),
429:('too_many_requests','too_many'),
431:('header_fields_too_large','fields_too_large'),
444:('no_response','none'),
449:('retry_with','retry'),
450:('blocked_by_windows_parental_controls','parental_controls'),
451:('unavailable_for_legal_reasons','legal_reasons'),
499:('client_closed_request',),

#服务端错误状态码
500:('internal_server_error','server_error','/o\\','×')
501:('not_implemented',),
502:('bad_gateway',),
503:('service_unavailable','unavailable'),
504:('gateway_timeout',),
505:('http_version_not_supported','http_version'),
506:('variant_also_negotiates',),
507:('insufficient_storage',),
509:('bandwidth_limit_exceeded','bandwith'),
510:('not_extended',),
511:('network_authentication_required','network_auth','network_authentication')

2.requests.get()

以下为get()方法的参数:

1)params:构造URL的查询参数,可以为字典,列表,元组,字节。

python 复制代码
params = {'name': 'li', 'age': 20}
response = requests.get('https://httpbin.org/get',params = params)

请求的网页由 https://httpbin.org/get ==> https://httpbin.org/get?name=li\&age=20

2)headers:请求头,可包含众多请求参数,字典形式传入。以下介绍使用较多的两个参数。

python 复制代码
User_Agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
cookies="A=123; B=321; C__1=666=; D_A2=888== "
headers = {'User-Agent': User_Agent,
           "cookies": cookies}
response = requests.get('https://httpbin.org/get',headers = headers)

User_Agent:浏览器标识信息,服务器通过查看请求的User_Agent判断请求来源,若没有该字段或者字段表示该请求不来源于任何一个已知的浏览器便默认该请求来自于爬虫程序(requests请求的User_Agent字段默认为:python-requests/requests.version),此时大部分服务器会选择拒绝访问,是最简单的防爬机制,也是最好克服的防爬机制。有时服务器发现同一个浏览器频繁访问时也会将其判定为爬虫程序,可通过随机传入多个User-Agent的方式规避此规则。

python 复制代码
lines = []
with open('userAgents.txt', 'r', encoding='utf-8') as f:
    for line in f:
        cleaned_line = re.sub(r'^"|"$', '', line)
        lines.append(cleaned_line.strip())
User_Agent = random.choice(lines)

userAgents.txt文件存储了多种浏览器标识信息,内容如下:

python 复制代码
"Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"
"Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5"
"Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7"
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20"
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27"
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1"
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2"
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7"
"Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10"
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)"
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)"
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"
"Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0"
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2"
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1"
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre"
"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )"
"Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)"
"Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a"
"Mozilla/2.02E (Win95; U)"
"Mozilla/3.01Gold (Win95; I)"
"Mozilla/4.8 [en] (Windows NT 5.1; U)"
"Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)"
"HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1"
"Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0"
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1"
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17"
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1"
"Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17"
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"
"Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1"
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17"
"Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3"
"Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1"
"Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2"
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1"
"Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1"
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17"
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17"
"Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17"
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2"
"Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3"
"Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2"
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17"
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17"
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"
"Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"
"Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13"
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2"
"Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1"
"Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1"

cookies:某些网站为了辨别用户身份,储存在用户本地终端上的数据(通常经过加密),由用户客户端计算机暂时或永久保存的信息,是爬虫程序实现会话操作的主要依赖。

python 复制代码
cookies="A=123; B=321; C__1=666=; D_A2=888== "

上面是一个cookies字符串,至于为什么这么抽象,因为保护cookies安全从我做起。

"Cookie劫持是指攻击者通过某种手段获取用户的Cookie信息,进而冒充用户身份进行非法操作的行为"

注意到cookies的结构为 " key = value ; key = value...... " ,cookies字符串中每个键值对之间由 " ; " 分割,而每个键值对内部由 " = " 分割,其中 " = " 是由 " ; " 分割的不同键值对中第一次出现的 " = " ,也就是说key中绝对没有 " = " 而value中可能存在 " = " ,由这种规则可进一步将cookies字符串拆分为键值对,即字典。

python 复制代码
cookies_dic = {i.split("=",1)[0]:i.split("=",1)[1] for i in cookies.split("; ")}

cookies也可以从response对象中获取,也可以在headers参数外面另外添加cookies参数,后者是CookieJar类型,单独传入时需要进行对象转化。

3)proxies:IP代理,类似于浏览器标识信息,有时服务器发现同一个IP频繁访问时会将其判定为爬虫程序,使用IP代理方法是应对此类反爬虫性价比最高的方式,但有一定成本(IP代理也有免费的但不稳定,可选用相对稳定的收费IP代理,价格也不高,往往还有免费额度)

python 复制代码
proxies = {
    'http': 'http://10.10.10:66',
    'https': 'https://10.10.10:66'
}
response = requests.get('https://httpbin.org/get',proxies = proxies)

收费的IP代理会进行用户认证,格式类似于:http://user:password@host:port。前半部分是用户名与密码,后半部分是服务器地址与端口号。(HTTP Basic Auth)

python 复制代码
# 需要pip install requests[socks] 才能使用socks5代理
proxies = {
    'http': 'http://user:password@10.10.10.66:8080/',     # 适用于 HTTP 代理
    'https': 'http://user:password@10.10.10.66:8080/',    # 适用于 HTTPS 代理
    'socks5': 'socks5://user:password@10.10.10.66:1080/'  # Socks5 代理
}

对于 HTTP 和 HTTPS 类型的代理,大部分代理服务器要求的连接方式是通过 http:// 前缀,即使代理的目标网址是 HTTPS。也就是说,requests 在建立到代理的连接时会用 HTTP,但在传输用户数据时依旧会对 HTTPS 请求进行加密。另外,requests 允许 HTTP 和 HTTPS 请求复用同一个代理地址,只要这个代理能够支持 HTTPS 请求的加密传输。

4)timeout:超时设置

python 复制代码
timeout = 5
response = requests.get('https://httpbin.org/get',timeout = timeout)

timeout=5 表示请求超时时间为5秒,如果5秒内没有返回结果,就会抛出 Timeout 异常。注意,请求时间包括连接与读取,如果需要分别设置连接与读取的超时时间,可以使用元组的形式传入,例如 timeout=(5, 30) 表示连接超时时间为5秒,读取超时时间为30秒。如果不设置 timeout 参数(默认None),requests 将会一直等待服务器返回结果,直到服务器返回结果或者发生异常为止。

5)verify:SSL证书验证

python 复制代码
verify = False
response = requests.get('https://httpbin.org/get',verify = verify)

有时在爬取访问某些网站时会出现以下警告:

在浏览器中尚可通过"高级"选项无视风险访问,但在爬虫程序中会报错,提示SSL证书验证错误。为避免该问题需设置忽略SSL证书验证,将verify设置为False。不过一般来说不遇到这种问题就不要设置忽略,因为这样会带来安全风险。

忽略SSL证书验证后仍会弹出警告,可屏蔽警告,也可通过日志模块捕获警告。

python 复制代码
# 屏蔽警告
import requests
from requests.packages import urllib3
urllib3.disable_warnings()
response = requests.get('https://httpbin.org/get',verify=False)
# 通过日志模块捕获警告。
import logging
import requests
logging.captureWarnings(True)
response = requests.get('https://httpbin.org/get',verify=False)

也可以指定一个本地证书用作客户端证书,通过使用cert参数指定证书文件路径,可以是单个文件(包含密钥和证书)或一个包含两个文件路径的元组。注意key应该解密。

python 复制代码
cert = ('/path/server.crt','/path/key')
response = requests.get('https://httpbin.org/get',cert = cert)

6)auth:身份验证机制

python 复制代码
auth=('user','password')
response = requests.get('https://httpbin.org/get',auth = auth)

综上,一个常用的get()表达式为

python 复制代码
response = requests.get('https://httpbin.org/get',headers=headers,params=params,verify=verify,timeout=timeout,proxies=proxies,auth=auth)

3.requests.post()

在get()方法的基础上post()还特别具有两个常用的参数,用于向服务器提交数据。data参数用于提交表单数据,格式为字典;files参数用于提交文件格式的数据。data可以为可以为字段,列表,元组,字节或文件对象。

python 复制代码
data = {
    'name': 'li',
    'age': 20
}
files = {'file': open('userAgents.txt','r')}
response = requests.post('https://httpbin.org/post',data = data , files = files)

当然还有一些参数如:

json:传递JSON 序列化对象

allow_redirects:设置重定向 (布尔类型)

stream:流式请求,主要对接流式 API

后续更新中会涉及

4.requests.session()

无论是get()还是post()都是不同的、单独的会话,如果需要保持会话,需要使用session对象。

当然通过一直保持一个cookies维持会话也行,但稍显繁琐

python 复制代码
session = requests.session()
response = session.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
response = session.get('http://httpbin.org/cookies')

session()的用法非常简单,仅仅多了一步链式调用。第一次向网站发送请求后,用户cookies将存储在session对象中,后续基于此对象发送的所有请求都将默认保持在同一个会话中。

5.requests高级用法

prepared request对象是 requests 库中的一种高级用法,它允许用户在发送请求之前对请求进行更详细的配置和定制,在规模任务中构造队列,比较方便。

python 复制代码
from requests import Request, Session
url = 'https://httpbin.org/get'
s = Session()
req = Request('GET', url)
prepped = s.prepare_request(req)
r = s.send(prepped)
相关推荐
人人人人一样一样16 分钟前
作业Python
python
四口鲸鱼爱吃盐35 分钟前
Pytorch | 利用VMI-FGSM针对CIFAR10上的ResNet分类器进行对抗攻击
人工智能·pytorch·python
四口鲸鱼爱吃盐43 分钟前
Pytorch | 利用PI-FGSM针对CIFAR10上的ResNet分类器进行对抗攻击
人工智能·pytorch·python
小陈phd1 小时前
深度学习之超分辨率算法——SRCNN
python·深度学习·tensorflow·卷积
CodeClimb1 小时前
【华为OD-E卷-简单的自动曝光 100分(python、java、c++、js、c)】
java·python·华为od
数据小小爬虫1 小时前
如何利用Python爬虫获取商品历史价格信息
开发语言·爬虫·python
NiNg_1_2341 小时前
Python的sklearn中的RandomForestRegressor使用详解
开发语言·python·sklearn
黑色叉腰丶大魔王1 小时前
《基于 Python 的网页爬虫详细教程》
开发语言·爬虫·python
laity172 小时前
爬取小说案例-BeautifulSoup教学篇
爬虫·python