前面,已经介绍过了urllib库和requests库(爬虫基本库的使用(urllib库的详细解析)-CSDN博客爬虫基本库的使用(requests库的详细解析)-CSDN博客),已经可以爬取大多数网站的数据。但对于某些网站依然无能为力 ,因为这些网站强制使用HTTP/2.0协议访问,而urllib库和requests库只支持HTTP/1.1协议。那碰上这种情况应该怎么办呢?只需要使用支持HTTP/2.0协议的请求库不就好了。目前,应用比较广泛的是hyper和httpx。但httpx用起来更方便而且也更强大,requests库的功能它几乎都支持。那么,这里就详细来介绍httpx库吧!!!
目录
[4、 Client对象](#4、 Client对象)
httpx库
1、示例
下面我们来看一个案例,https://spa16.scrape.center/就是一个强制使用HTTP/2.0协议。采用requests库是无法请求的,不信?我们来试试:
python
import requests
url = 'https://spa16.scrape.center/'
response = requests.get(url)
print(response.text)
结果如下:
diff
Traceback (most recent call last):
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\connectionpool.py", line 793, in urlopen
response = self._make_request(
^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\connectionpool.py", line 537, in _make_request
response = conn.getresponse()
^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\connection.py", line 466, in getresponse
httplib_response = super().getresponse()
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 1374, in getresponse
response.begin()
File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 318, in begin
version, status, reason = self._read_status()
^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 287, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\requests\adapters.py", line 486, in send
resp = conn.urlopen(
^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\connectionpool.py", line 847, in urlopen
retries = retries.increment(
^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\util\retry.py", line 470, in increment
raise reraise(type(error), error, _stacktrace)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\util\util.py", line 38, in reraise
raise value.with_traceback(tb)
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\connectionpool.py", line 793, in urlopen
response = self._make_request(
^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\connectionpool.py", line 537, in _make_request
response = conn.getresponse()
^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\connection.py", line 466, in getresponse
httplib_response = super().getresponse()
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 1374, in getresponse
response.begin()
File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 318, in begin
version, status, reason = self._read_status()
^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 287, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Lenovo\Desktop\爬虫学习\Python知识\01.py", line 3, in <module>
response = requests.get(url)
^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\requests\api.py", line 73, in get
return request("get", url, params=params, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\requests\api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\requests\sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\requests\sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\requests\adapters.py", line 501, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
可以看到,一堆报错。头痛吧!抛出RemoteDisconnected错误,这就是requests库无法请求HTTP/2.0协议的网站。
2、安装
httpx是需要提前安装的第三方库,所需Python版本是3.6及以上,安装命令如下:
diff
pip install httpx
3、基本使用
https:和requests的很多API存在相似之处, 我们先看下最基本的 GET 请求的用法:
python
import httpx
response = httpx.get('https://www.httpbin.org/get')
print(response. status_code)
print(response. headers)
print(response. text)
这里我们还是请求之前的测试网站,直接使用httpx的get方法即可, 用法和requests里的一模一样, 将返回结果赋值为response 变量, 然后打印出它的 status_code、headers、text等属性, 运行结果如下:
diff
200
Headers({'date': 'Thu, 22 Feb 2024 03:37:08 GMT', 'content-type': 'application/json', 'content-length': '311', 'connection': 'keep-alive', 'server': 'gunicorn/19.9.0', 'access-control-allow-origin': '*', 'access-control-allow-credentials': 'true'})
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "www.httpbin.org",
"User-Agent": "python-httpx/0.27.0",
"X-Amzn-Trace-Id": "Root=1-65d6c164-3cdcebd7381e5873457a6866"
},
"origin": "111.72.54.67",
"url": "https://www.httpbin.org/get"
}
输出结果包含三项内容, status_code 属性对应状态码, 为 200; headers 属性对应响应头, 是一个 Headers 对象, 类似于一个字典; text 属性对应响应体, 可以看到其中的 User-Agent 是python-httpx/0.18.1, 代表我们是用 https请求的。下面换一个 User-Agent 再请求一次, 代码改写如下:
python
import httpx
headers = {
'User-Agent': 'Mozilla/5.0(Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36(KHTML, like Gecko)Chrome/90.0.4430.93 Safari/537.36'
}
response = httpx.get('https://www.httpbin.org/get',headers=headers)
print(response. text)
这里我们换了一个User-Agent 重新请求, 并将其赋值为headers变量,然后传递给 headers 参数,运行结果如下:
diff
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "www.httpbin.org",
"User-Agent": "Mozilla/5.0(Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36(KHTML, like Gecko)Chrome/90.0.4430.93 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-65d6c2c6-0fef57aa6cc8a6fa4f0ea62e"
},
"origin": "111.72.54.67",
"url": "https://www.httpbin.org/get"
}
可以发现更换User-Agent生效了!接下来,我们尝试用httpx请求一下这个网站,看看效果如何,代码如下:
python
import httpx
response = httpx.get('https://spa16.scrape.center/')
print(response.text)
结果如下:
python
Traceback (most recent call last):
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_transports\default.py", line 69, in map_httpcore_exceptions
yield
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_transports\default.py", line 233, in handle_request
resp = self._pool.handle_request(req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpcore\_sync\connection_pool.py", line 216, in handle_request
raise exc from None
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpcore\_sync\connection_pool.py", line 196, in handle_request
response = connection.handle_request(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpcore\_sync\connection.py", line 101, in handle_request
return self._connection.handle_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpcore\_sync\http11.py", line 143, in handle_request
raise exc
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpcore\_sync\http11.py", line 113, in handle_request
) = self._receive_response_headers(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpcore\_sync\http11.py", line 186, in _receive_response_headers
event = self._receive_event(timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpcore\_sync\http11.py", line 238, in _receive_event
raise RemoteProtocolError(msg)
httpcore.RemoteProtocolError: Server disconnected without sending a response.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\Lenovo\Desktop\爬虫学习\Python知识\01.py", line 2, in <module>
response = httpx.get('https://spa16.scrape.center/')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_api.py", line 198, in get
return request(
^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_api.py", line 106, in request
return client.request(
^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_client.py", line 827, in request
return self.send(request, auth=auth, follow_redirects=follow_redirects)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_client.py", line 914, in send
response = self._send_handling_auth(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_client.py", line 942, in _send_handling_auth
response = self._send_handling_redirects(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_client.py", line 979, in _send_handling_redirects
response = self._send_single_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_client.py", line 1015, in _send_single_request
response = transport.handle_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_transports\default.py", line 232, in handle_request
with map_httpcore_exceptions():
File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python311\Lib\contextlib.py", line 155, in __exit__
self.gen.throw(typ, value, traceback)
File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_transports\default.py", line 86, in map_httpcore_exceptions
raise mapped_exc(message) from exc
httpx.RemoteProtocolError: Server disconnected without sending a response.
啊,为什么会报错呀。不是说好了支持HTTP/2.0协议吗?其实,httpx默认是不会开启对HTTP/2.0协议的支持的,默认的还是HTTP/1.1协议,需要手动声明一下才能使用HTTP/2.0协议。代码如下:
python
import httpx
client = httpx.Client(http2=True)
response = client.get('https://spa16.scrape.center/')
print(response.text)
这里我们声明了一个 Client 对象,赋值为client变量, 同时显式地将 http2参数设置为 True,这样便开启了对 HTTP/2.0的支持,之后就会发现可以成功获取HTML 代码了。这也就印证了这个示例网站只能使用HTTP/2.0访问。刚才我们也提到了, https和requests有很多相似的 API,上面实现的是GET请求, 对于POST请求、PUT请求和DELETE 请求来说, 实现方式是类似的:
python
import httpx
r =https.get(' https://www.httpbin.org/get',params= {'name':'germey'})
I = https://www.httpbin.org/post',data= {'name':'germey'})
r = httpx.put(' https://www.httpsbin.org/put')
I = https.delete(' https://www.httpbin.org/delete')
x = https://www.https://www.httpbin.org/patch')
基于得到的 Response 对象, 可以使用如下属性和方法获取想要的内容。
- status_code: 状态码。
- text: 响应体的文本内容。
- content: 响应体的二进制内容, 当请求的目标是二进制数据(如图片)时, 可以使用此属性获取。
- headers: 响应头, 是 Headers对象, 可以用像获取字典中的内容一样获取其中某个Header 的值。
- json: 方法, 可以调用此方法将文本结果转化为 JSON 对象。
除了这些, httpx还有一些基本用法也和 requests 极其类似, 这里就不再赘述了,可以参考官方文档: https://www.python-httpx.org/quickstart/
4、 Client对象
httpx中有一些基本的 API和requests 中的非常相似, 但也有一些 API是不相似的, 例如 httpx中有一个 Client 对象, 就可以和requests 中的 Session 对象类比学习。下面我们介绍Client 对象的使用。官方比较推荐的使用方式是with as 语句,示例如下:
python
import httpx
with httpx.Client() as client:
response = client.get('https://www.httpbin.org/get')
print(response)
运行结果如下:
diff
<Response [200 OK]>
这个用法等价于:
python
import httpx
client = httpx.Client()
try:
response = client.get('https://www.httpbin.org/get')
finally:
client. close()
两种方式的运行结果是一样的,只不过这里需要我们在最后显式地调用close方法来关闭Client 对象。另外,在声明Client 对象时可以指定一些参数,例如headers,这样使用该对象发起的所有请求都会默认带上这些参数配置,示例如下:
python
import httpx
url = 'http://www.httpbin.org/headers'
headers = {'User-Agent': 'my-app/0.0.1'}
with httpx.Client(headers=headers) as client:
r = client. get(url)
print(r.json()['headers']['User-Agent'])
这里我们声明了一个 headers 变量, 内容为User-Agent 属性, 然后将此变量传递给 headers 参数初始化了一个Client对象,并赋值为client变量,最后用client变量请求了测试网站,并打印返回结果中的 User-Agent 的内容:
diff
my-app/0.0.1
可以看到, headers 成功赋值了!
这里介绍httpx的基本用法就到此为止,总结一下:httpx是跟requests非常相似的库,并且同时支持 HTTP/2.0协议。