python数据分析之爬虫基础:requests详解

1、requests基本使用

1.1、requests介绍

requests是python中一个常用于发送HTTP请求的第三方库,它极大地简化了web服务交互的过程。它是唯一的一个非转基因的python HTTP库,人类可以安全享用。

1.2、requests库的安装

pip install -i https://pypi.tuan.tsinghua.edu.cn/simple requests

1.3、requests基础语法

python 复制代码
import requests
url = 'http://www.baidu.com'
response = requests.get(url)

1.4、response的属性以及类型

(1)一个类型:

python 复制代码
print(type(response)) # <class 'requests.models.Response'>

(2)六个属性:

python 复制代码
# 是指相应的编码格式
response.encoding = 'utf-8'
# 以字符串形式返回网页源码
print(response.text)
# 获取请求头
print(response.url)
# 返回二进制数据
print(response.content)
# 返回状态码信息
print(response.status_code)
# 获取响应头信息
print(response.headers)

2、requests的get请求

爬取郑州页面信息,和urllib基本差不多,只要明白urllib,相信requests的get请求也不会有什么难度。

python 复制代码
import requests
url = 'https://www.baidu.com/s?'
headers = {
    "user-agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
}
data = {
    "wd":"郑州"
}
# url 请求资源路径 params 参数 # kwargs 字典
response = requests.get(url=url,params=data,headers=headers)
content = response.text
print(content)

与urllib的get请求区别:

1、参数需要使用params传递

2、参数无需urlencode

3、不需要请求对象的定制

4、请求资源路径中的?可以省略

3、requests的post请求

我们还是以之前urllib中关于post请求-百度翻译为例:

python 复制代码
import requests
url = "https://fanyi.baidu.com/sug"
headers = {
    "user-agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
    "cookie":'BIDUPSID=91AC5A2A82E26F50448A070917943E70; PSTM=1732629509; BAIDUID=91AC5A2A82E26F50448A070917943E70:FG=1; BDUSS_BFESS=E1IcjZ0NVRodGlNNjJaNFdXNUZQVjVsZE04eW5iaVdOSXkzQ3BDRkcxVndMbkpuRUFBQUFBJCQAAAAAAQAAAAEAAABYaMgfAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHChSmdwoUpne; BAIDUID_BFESS=91AC5A2A82E26F50448A070917943E70:FG=1; ZFY=0L:BrFXMz3oPPSIl2WrbINbmdK4f2nDwQtL:Bfl6za7PM:C; BDRCVFR[l9-IMhu-BDf]=mk3SLVN4HKm; delPer=0; H_PS_PSSID=61027_61099_61217_61280_61298_61246_60853; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; H_WISE_SIDS=61027_61099_61217_61280_61298_61246_60853; PSINO=1; BA_HECTOR=a58l2h24a121a1808ka48g213kh3u01jlb88s1u; BCLID=10763796247062205483; BCLID_BFESS=10763796247062205483; BDSFRCVID=rvFOJexroG3B_xQJosAdbCbKXuweG7bTDYrEOwXPsp3LGJLVdLE8EG0Pts1-dEu-S2OOogKKBeOTHn0F_2uxOjjg8UtVJeC6EG0Ptf8g0M5; BDSFRCVID_BFESS=rvFOJexroG3B_xQJosAdbCbKXuweG7bTDYrEOwXPsp3LGJLVdLE8EG0Pts1-dEu-S2OOogKKBeOTHn0F_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF=tbkD_C-MfIvhDRTvhCcjh-FSMgTBKI62aKDsoJ71BhcqJ-ovQpJmjU4ByRnkBJoa0Krihn6cWKJJ8UbeWfvp3t_D-tuH3lLHQJnph66dah5nhMJmBp_VhfL3qtCOaJby523i5J5vQpn_hhQ3DRoWXPIqbN7P-p5Z5mAqKl0MLPbtbb0xXj_0DTbLjH8jqTntaD5yWj6JanTjjTrFbKTjhPrML4tJWMT-MTryKM3xJh7-Ox7Xy4nDLPDUWMciB5OMBanRhlRNQRjVHqI4Lq_K360ZWec72MQxtNRJMMKEal5MKqF9MRJobUPULxo9LUvXtgcdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLfC-aMCt6eno_Mt4HqfbQa4JWHDQbsJOOaCvDSqQOy4oTj6D05-TRbMRZXa5ZaRonKqviEP8RW4r_3MvB-fnyKMIJye3CBItbtbr5ol6KQft20-DAeMtjBbLLfNTtVn7jWhvIeq72y-I2QlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8EjHCDJ5kDtJuHVbobHJoHjJbGq4bohjPX54j9BtQO-DOxoho7MUjkDPOqb-5T-xPR5qJ-05baQgnkQq5vbMnmqPtRXMJkXhKOX-_O0x-jLTneo66e34KVVIoOXPnJyUPYbtnnBPCj3H8HL4nv2JcJbM5m3x6qLTKkQN3T-PKO5bRu_CcJ-J8XMD89jTbP; H_BDCLCKID_SF_BFESS=tbkD_C-MfIvhDRTvhCcjh-FSMgTBKI62aKDsoJ71BhcqJ-ovQpJmjU4ByRnkBJoa0Krihn6cWKJJ8UbeWfvp3t_D-tuH3lLHQJnph66dah5nhMJmBp_VhfL3qtCOaJby523i5J5vQpn_hhQ3DRoWXPIqbN7P-p5Z5mAqKl0MLPbtbb0xXj_0DTbLjH8jqTntaD5yWj6JanTjjTrFbKTjhPrML4tJWMT-MTryKM3xJh7-Ox7Xy4nDLPDUWMciB5OMBanRhlRNQRjVHqI4Lq_K360ZWec72MQxtNRJMMKEal5MKqF9MRJobUPULxo9LUvXtgcdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLfC-aMCt6eno_Mt4HqfbQa4JWHDQbsJOOaCvDSqQOy4oTj6D05-TRbMRZXa5ZaRonKqviEP8RW4r_3MvB-fnyKMIJye3CBItbtbr5ol6KQft20-DAeMtjBbLLfNTtVn7jWhvIeq72y-I2QlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8EjHCDJ5kDtJuHVbobHJoHjJbGq4bohjPX54j9BtQO-DOxoho7MUjkDPOqb-5T-xPR5qJ-05baQgnkQq5vbMnmqPtRXMJkXhKOX-_O0x-jLTneo66e34KVVIoOXPnJyUPYbtnnBPCj3H8HL4nv2JcJbM5m3x6qLTKkQN3T-PKO5bRu_CcJ-J8XMD89jTbP; ab_sr=1.0.1_ZmQ5MTQ5YzBmNGJkNTY1NzMwMDMyZDljNDI4ZDNmNDk2YjBiOTJiOTkyNTYwZDEwYWM1MTAyNDliM2IwZjQxNmFmYmQxZGJmZDI0MDI5YmViZDIwYzIwMDVkZmMxNjljNGEzNzQ5MTYyOWY5MzVmMTgxZTQxOGY4YzFhMTk3YWRiNGQ0NGI3Y2M1NjhjOGEyMTE1MDU1N2M1MDI2OWVjMg==; RT="z=1&dm=baidu.com&si=683d19d9-ec4a-4ee1-ba25-d45da6aaef7f&ss=m4fnfeoj&sl=3&tt=b6o&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=ruw"'
}
data = {
    "kw":"eye"
}
response = requests.post(url=url, headers=headers, data=data)
content = response.text
import json
content = json.loads(content)
print(content)

与urllib的post请求的区别:

1、post请求不需要编解码

2、post请求的参数是data

3、不需要请求对象的定制

4、代理

python 复制代码
import requests
url = "http://www.baidu.com/s?"
headers = {
    # "accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
    "user-agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
    # "cookie":'BIDUPSID=91AC5A2A82E26F50448A070917943E70; PSTM=1732629509; BAIDUID=91AC5A2A82E26F50448A070917943E70:FG=1; BD_UPN=12314753; BDUSS_BFESS=E1IcjZ0NVRodGlNNjJaNFdXNUZQVjVsZE04eW5iaVdOSXkzQ3BDRkcxVndMbkpuRUFBQUFBJCQAAAAAAQAAAAEAAABYaMgfAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHChSmdwoUpne; BAIDUID_BFESS=91AC5A2A82E26F50448A070917943E70:FG=1; ZFY=0L:BrFXMz3oPPSIl2WrbINbmdK4f2nDwQtL:Bfl6za7PM:C; B64_BOT=1; BDRCVFR[l9-IMhu-BDf]=mk3SLVN4HKm; delPer=0; BD_CK_SAM=1; H_PS_PSSID=61027_61099_61217_61280_61298_61246_60853; shifen[8451320_53724]=1733557849; shifen[304792146112_6039]=1733557876; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; H_WISE_SIDS=61027_61099_61217_61280_61298_61246_60853; BA_HECTOR=a58l2h24a121a1808ka48g213kh3u01jlb88s1u; shifen[8332037_91638]=1733665082; BCLID=10763796247062205483; BCLID_BFESS=10763796247062205483; BDSFRCVID=rvFOJexroG3B_xQJosAdbCbKXuweG7bTDYrEOwXPsp3LGJLVdLE8EG0Pts1-dEu-S2OOogKKBeOTHn0F_2uxOjjg8UtVJeC6EG0Ptf8g0M5; BDSFRCVID_BFESS=rvFOJexroG3B_xQJosAdbCbKXuweG7bTDYrEOwXPsp3LGJLVdLE8EG0Pts1-dEu-S2OOogKKBeOTHn0F_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF=tbkD_C-MfIvhDRTvhCcjh-FSMgTBKI62aKDsoJ71BhcqJ-ovQpJmjU4ByRnkBJoa0Krihn6cWKJJ8UbeWfvp3t_D-tuH3lLHQJnph66dah5nhMJmBp_VhfL3qtCOaJby523i5J5vQpn_hhQ3DRoWXPIqbN7P-p5Z5mAqKl0MLPbtbb0xXj_0DTbLjH8jqTntaD5yWj6JanTjjTrFbKTjhPrML4tJWMT-MTryKM3xJh7-Ox7Xy4nDLPDUWMciB5OMBanRhlRNQRjVHqI4Lq_K360ZWec72MQxtNRJMMKEal5MKqF9MRJobUPULxo9LUvXtgcdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLfC-aMCt6eno_Mt4HqfbQa4JWHDQbsJOOaCvDSqQOy4oTj6D05-TRbMRZXa5ZaRonKqviEP8RW4r_3MvB-fnyKMIJye3CBItbtbr5ol6KQft20-DAeMtjBbLLfNTtVn7jWhvIeq72y-I2QlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8EjHCDJ5kDtJuHVbobHJoHjJbGq4bohjPX54j9BtQO-DOxoho7MUjkDPOqb-5T-xPR5qJ-05baQgnkQq5vbMnmqPtRXMJkXhKOX-_O0x-jLTneo66e34KVVIoOXPnJyUPYbtnnBPCj3H8HL4nv2JcJbM5m3x6qLTKkQN3T-PKO5bRu_CcJ-J8XMD89jTbP; H_BDCLCKID_SF_BFESS=tbkD_C-MfIvhDRTvhCcjh-FSMgTBKI62aKDsoJ71BhcqJ-ovQpJmjU4ByRnkBJoa0Krihn6cWKJJ8UbeWfvp3t_D-tuH3lLHQJnph66dah5nhMJmBp_VhfL3qtCOaJby523i5J5vQpn_hhQ3DRoWXPIqbN7P-p5Z5mAqKl0MLPbtbb0xXj_0DTbLjH8jqTntaD5yWj6JanTjjTrFbKTjhPrML4tJWMT-MTryKM3xJh7-Ox7Xy4nDLPDUWMciB5OMBanRhlRNQRjVHqI4Lq_K360ZWec72MQxtNRJMMKEal5MKqF9MRJobUPULxo9LUvXtgcdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLfC-aMCt6eno_Mt4HqfbQa4JWHDQbsJOOaCvDSqQOy4oTj6D05-TRbMRZXa5ZaRonKqviEP8RW4r_3MvB-fnyKMIJye3CBItbtbr5ol6KQft20-DAeMtjBbLLfNTtVn7jWhvIeq72y-I2QlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8EjHCDJ5kDtJuHVbobHJoHjJbGq4bohjPX54j9BtQO-DOxoho7MUjkDPOqb-5T-xPR5qJ-05baQgnkQq5vbMnmqPtRXMJkXhKOX-_O0x-jLTneo66e34KVVIoOXPnJyUPYbtnnBPCj3H8HL4nv2JcJbM5m3x6qLTKkQN3T-PKO5bRu_CcJ-J8XMD89jTbP; ab_sr=1.0.1_ZmQ5MTQ5YzBmNGJkNTY1NzMwMDMyZDljNDI4ZDNmNDk2YjBiOTJiOTkyNTYwZDEwYWM1MTAyNDliM2IwZjQxNmFmYmQxZGJmZDI0MDI5YmViZDIwYzIwMDVkZmMxNjljNGEzNzQ5MTYyOWY5MzVmMTgxZTQxOGY4YzFhMTk3YWRiNGQ0NGI3Y2M1NjhjOGEyMTE1MDU1N2M1MDI2OWVjMg==; RT="z=1&dm=baidu.com&si=683d19d9-ec4a-4ee1-ba25-d45da6aaef7f&ss=m4fnfeoj&sl=4&tt=cn1&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=wmj&ul=o4bd&hd=o4c0"; PSINO=7; sugstore=1; H_PS_645EC=e2c20yk9RoanWFIVyDJbr18JC5dzOzNojiUaPy0JXsXtSzcOKsks5N3IUyetiaDn7Vsq5ZY; baikeVisitId=1d823dea-39eb-4e63-978d-65fd09a0d697; COOKIE_SESSION=81376_0_6_6_7_3_1_0_6_3_205_1_111167_0_0_0_1733584849_0_1733666222%7C9%2379969_3_1733137574%7C2'
}
data = {
    "wd":"ip"
}
# 代理池
proxy={
    "http":"23.247.137.142:80"
}
response =requests.get(url=url,params=data,headers=headers,proxies=proxy)
content = response.text
file = open("ip.html","w",encoding="utf-8")
file.write(content)
file.close()

5、cookie登录

我们以古诗文个人主页页面为例子,含有验证码。

首先我们进入登陆界面后,搜遍输入密码,然后打开开发者模式,看到login接口,看负载(payload)里面有许多信息。

__VIEWSTATE:MnTNH2SbI9isHX8zdfu1NvmByZXoSVf8Vxj5QIeJ5C8EmgWhaBFQRNjQYMe47E+qOO+ss1LSDNdjYeNRy/bdvD7wktgbMm73Cku21k7NhLMYo79CC54kuz//cZ9kSLKKFvkpppzOssnyET3GX789uH1DMUM= __VIEWSTATEGENERATOR: C93BE1AE

这两个信息不固定,是变量,而code也是变量。因此解决这三个变量就是这个例子的难点

难点:(1)__VIEWSTATE __VIEWSTATEGENERATOR

我们回到登陆页面,检查源代码,发现里面是有这两个变量的。而hidden我们称之为隐藏域。

获取登录页面源码:

python 复制代码
import requests
url = "https://www.gushiwen.cn/user/login.aspx?from=http://www.gushiwen.cn/user/collect.aspx"
headers = {
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)
content = response.text

解析__VIEWSTATE __VIEWSTATEGENERATOR两个变量的value,可以通过beautifulsoup语法,也可用通过xpath:

python 复制代码
from lxml import etree
tree = etree.HTML(content)
__VIEWSTATE = tree.xpath('//input[@name="__VIEWSTATE"]/@value')
__VIEWSTATEGENERATOR = tree.xpath('//input[@name="__VIEWSTATEGENERATOR"]/@value')
print(__VIEWSTATE)
print(__VIEWSTATEGENERATOR)

难点:(2)code验证码(获取验证码图片)

python 复制代码
code = tree.xpath('//img[@id="imgCode"]/@src')[0]
code_url = "https://so.gushiwen.cn"+code

获取了验证码图片后下载到本地观察验证码,然后在控制台输入即可!(当然也可以用pytesseract来识别数字)

python 复制代码
import urllib.request
urllib.request.urlretrieve(url=code_url,filename="code.jpg")
code_name = input("请输入验证码:")

但这种方法显然是有问题的,只有我们输入验证码后才会生成新的验证码,也就是说这个时候我们输入的验证码是旧的验证码。因此我们可以用requests库中的session方法,通过session的返回值,是请求变成一个对象。

python 复制代码
session = requests.session()
response_code = session.get(code_url)
content_code = response_code.content # 此时要使用二进制数据,因为使用的图片的下载
f = open("code.jpg","wb") # wb的模式就是将二进制数据写入到文件
f.write(content_code)
f.close()
code_name = input("请输入验证码:")

抓取登录按钮的接口

python 复制代码
url_post = "https://www.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fwww.gushiwen.cn%2fuser%2fcollect.aspx"
data_post = {
    "__VIEWSTATE": viewstate,
    "__VIEWSTATEGENERATOR": viewstategenerator,
    "from": "http://www.gushiwen.cn/user/collect.aspx",
    "email": 17719114890,
    "pwd": "dwq0219423",
    "code": code_name,
    "denglu": "登录"
}
response_post = session.post(url=url_post, headers=headers, data=data_post)
content_post = response_post.text
f = open("古诗文.html","w",encoding="utf-8")
f.write(content_post)

完整代码如下:

python 复制代码
import requests
url = "https://www.gushiwen.cn/user/login.aspx?from=http://www.gushiwen.cn/user/collect.aspx"
headers = {
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)
content = response.text
from lxml import etree
tree = etree.HTML(content)
viewstate = tree.xpath('//input[@name="__VIEWSTATE"]/@value')[0]
viewstategenerator = tree.xpath('//input[@name="__VIEWSTATEGENERATOR"]/@value')[0]
code = tree.xpath('//img[@id="imgCode"]/@src')[0]
code_url = "https://so.gushiwen.cn"+code
session = requests.session()
response_code = session.get(code_url)
content_code = response_code.content # 此时要使用二进制数据,因为使用的图片的下载
f = open("code.jpg","wb") # wb的模式就是将二进制数据写入到文件
f.write(content_code)
f.close()
code_name = input("请输入验证码:")
url_post = "https://www.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fwww.gushiwen.cn%2fuser%2fcollect.aspx"
data_post = {
    "__VIEWSTATE": viewstate,
    "__VIEWSTATEGENERATOR": viewstategenerator,
    "from": "http://www.gushiwen.cn/user/collect.aspx",
    "email": 17719114890,
    "pwd": "dwq0219423",
    "code": code_name,
    "denglu": "登录"
}
response_post = session.post(url=url_post, headers=headers, data=data_post)
content_post = response_post.text
f = open("古诗文.html","w",encoding="utf-8")
f.write(content_post)
相关推荐
大G哥1 分钟前
财务数据分析优化 | 实战应用小浣熊
信息可视化·数据挖掘·数据分析
pp不会算法^v^7 分钟前
Could not transfer artifact javax.xml.bind:jaxb-api:pom:2.3.1
xml·java·开发语言·maven
顾以沫9 分钟前
数据结构--栈和队列
java·开发语言·数据结构
Evaporator Core12 分钟前
Apache HTTP 服务器高级性能优化
服务器·http·apache
Evaporator Core13 分钟前
Apache HTTP 服务器的安全配置指南
服务器·http·apache
Kolde15 分钟前
java.lang.NoClassDefFoundError: org/apache/commons/collections/MapUtils
java·开发语言·apache
全村狗子的希望22 分钟前
Go语言错误分类
开发语言·go
A-刘晨阳22 分钟前
【Linux】Nginx一个域名https&一个地址配置多个项目【项目实战】
linux·运维·nginx·http·https
skywalk816331 分钟前
奇怪的知识又增加了:ESP32下的Lisp编程=>ULisp--Lisp for microcontrollers
开发语言·单片机·物联网·esp32·lisp
勇敢一点♂37 分钟前
Supervisor的简单教程
python