爬虫基本库的使用(httpx库的详细解析)

前面,已经介绍过了urllib库和requests库(爬虫基本库的使用(urllib库的详细解析)-CSDN博客爬虫基本库的使用(requests库的详细解析)-CSDN博客),已经可以爬取大多数网站的数据。但对于某些网站依然无能为力 ,因为这些网站强制使用HTTP/2.0协议访问,而urllib库和requests库只支持HTTP/1.1协议。那碰上这种情况应该怎么办呢?只需要使用支持HTTP/2.0协议的请求库不就好了。目前,应用比较广泛的是hyper和httpx。但httpx用起来更方便而且也更强大,requests库的功能它几乎都支持。那么,这里就详细来介绍httpx库吧!!!

目录

httpx库

1、示例

2、安装

3、基本使用

[4、 Client对象](#4、 Client对象)


httpx库

1、示例

下面我们来看一个案例,https://spa16.scrape.center/就是一个强制使用HTTP/2.0协议。采用requests库是无法请求的,不信?我们来试试:

python 复制代码
import  requests
url = 'https://spa16.scrape.center/'
response = requests.get(url)
print(response.text)

结果如下:

diff 复制代码
Traceback (most recent call last):
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\connectionpool.py", line 793, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\connectionpool.py", line 537, in _make_request
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\connection.py", line 466, in getresponse
    httplib_response = super().getresponse()
                       ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 1374, in getresponse
    response.begin()
  File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 318, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 287, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\requests\adapters.py", line 486, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\connectionpool.py", line 847, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\util\retry.py", line 470, in increment
    raise reraise(type(error), error, _stacktrace)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\util\util.py", line 38, in reraise
    raise value.with_traceback(tb)
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\connectionpool.py", line 793, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\connectionpool.py", line 537, in _make_request
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\urllib3\connection.py", line 466, in getresponse
    httplib_response = super().getresponse()
                       ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 1374, in getresponse
    response.begin()
  File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 318, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 287, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Lenovo\Desktop\爬虫学习\Python知识\01.py", line 3, in <module>
    response = requests.get(url)
               ^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\requests\api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\requests\api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\requests\sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\requests\sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\requests\adapters.py", line 501, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

可以看到,一堆报错。头痛吧!抛出RemoteDisconnected错误,这就是requests库无法请求HTTP/2.0协议的网站。

2、安装

httpx是需要提前安装的第三方库,所需Python版本是3.6及以上,安装命令如下:

diff 复制代码
pip install httpx

3、基本使用

https:和requests的很多API存在相似之处, 我们先看下最基本的 GET 请求的用法:

python 复制代码
import httpx
response = httpx.get('https://www.httpbin.org/get')
print(response. status_code)
print(response. headers)
print(response. text)

这里我们还是请求之前的测试网站,直接使用httpx的get方法即可, 用法和requests里的一模一样, 将返回结果赋值为response 变量, 然后打印出它的 status_code、headers、text等属性, 运行结果如下:

diff 复制代码
200
Headers({'date': 'Thu, 22 Feb 2024 03:37:08 GMT', 'content-type': 'application/json', 'content-length': '311', 'connection': 'keep-alive', 'server': 'gunicorn/19.9.0', 'access-control-allow-origin': '*', 'access-control-allow-credentials': 'true'})
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "www.httpbin.org", 
    "User-Agent": "python-httpx/0.27.0", 
    "X-Amzn-Trace-Id": "Root=1-65d6c164-3cdcebd7381e5873457a6866"
  }, 
  "origin": "111.72.54.67", 
  "url": "https://www.httpbin.org/get"
}

输出结果包含三项内容, status_code 属性对应状态码, 为 200; headers 属性对应响应头, 是一个 Headers 对象, 类似于一个字典; text 属性对应响应体, 可以看到其中的 User-Agent 是python-httpx/0.18.1, 代表我们是用 https请求的。下面换一个 User-Agent 再请求一次, 代码改写如下:

python 复制代码
import httpx
headers = {
    'User-Agent': 'Mozilla/5.0(Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36(KHTML, like Gecko)Chrome/90.0.4430.93 Safari/537.36'
}
response = httpx.get('https://www.httpbin.org/get',headers=headers)
print(response. text)

这里我们换了一个User-Agent 重新请求, 并将其赋值为headers变量,然后传递给 headers 参数,运行结果如下:

diff 复制代码
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "www.httpbin.org", 
    "User-Agent": "Mozilla/5.0(Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36(KHTML, like Gecko)Chrome/90.0.4430.93 Safari/537.36", 
    "X-Amzn-Trace-Id": "Root=1-65d6c2c6-0fef57aa6cc8a6fa4f0ea62e"
  }, 
  "origin": "111.72.54.67", 
  "url": "https://www.httpbin.org/get"
}

可以发现更换User-Agent生效了!接下来,我们尝试用httpx请求一下这个网站,看看效果如何,代码如下:

python 复制代码
import httpx
response = httpx.get('https://spa16.scrape.center/')
print(response.text)

结果如下:

python 复制代码
Traceback (most recent call last):
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_transports\default.py", line 69, in map_httpcore_exceptions
    yield
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_transports\default.py", line 233, in handle_request
    resp = self._pool.handle_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpcore\_sync\connection_pool.py", line 216, in handle_request
    raise exc from None
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpcore\_sync\connection_pool.py", line 196, in handle_request
    response = connection.handle_request(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpcore\_sync\connection.py", line 101, in handle_request
    return self._connection.handle_request(request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpcore\_sync\http11.py", line 143, in handle_request
    raise exc
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpcore\_sync\http11.py", line 113, in handle_request
    ) = self._receive_response_headers(**kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpcore\_sync\http11.py", line 186, in _receive_response_headers
    event = self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpcore\_sync\http11.py", line 238, in _receive_event
    raise RemoteProtocolError(msg)
httpcore.RemoteProtocolError: Server disconnected without sending a response.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\Lenovo\Desktop\爬虫学习\Python知识\01.py", line 2, in <module>
    response = httpx.get('https://spa16.scrape.center/')
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_api.py", line 198, in get
    return request(
           ^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_api.py", line 106, in request
    return client.request(
           ^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_client.py", line 827, in request
    return self.send(request, auth=auth, follow_redirects=follow_redirects)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_client.py", line 914, in send
    response = self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_client.py", line 942, in _send_handling_auth
    response = self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_client.py", line 979, in _send_handling_redirects
    response = self._send_single_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_client.py", line 1015, in _send_single_request
    response = transport.handle_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_transports\default.py", line 232, in handle_request
    with map_httpcore_exceptions():
  File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python311\Lib\contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "C:\Users\Lenovo\Desktop\爬虫学习\venv\Lib\site-packages\httpx\_transports\default.py", line 86, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.RemoteProtocolError: Server disconnected without sending a response.

啊,为什么会报错呀。不是说好了支持HTTP/2.0协议吗?其实,httpx默认是不会开启对HTTP/2.0协议的支持的,默认的还是HTTP/1.1协议,需要手动声明一下才能使用HTTP/2.0协议。代码如下:

python 复制代码
import httpx
client = httpx.Client(http2=True)
response = client.get('https://spa16.scrape.center/')
print(response.text)

这里我们声明了一个 Client 对象,赋值为client变量, 同时显式地将 http2参数设置为 True,这样便开启了对 HTTP/2.0的支持,之后就会发现可以成功获取HTML 代码了。这也就印证了这个示例网站只能使用HTTP/2.0访问。刚才我们也提到了, https和requests有很多相似的 API,上面实现的是GET请求, 对于POST请求、PUT请求和DELETE 请求来说, 实现方式是类似的:

python 复制代码
import httpx
r =https.get('  https://www.httpbin.org/get',params=  {'name':'germey'})
I =   https://www.httpbin.org/post',data=  {'name':'germey'})
r = httpx.put('  https://www.httpsbin.org/put')
I = https.delete('  https://www.httpbin.org/delete')
x =   https://www.https://www.httpbin.org/patch')

基于得到的 Response 对象, 可以使用如下属性和方法获取想要的内容。

  • status_code: 状态码。
  • text: 响应体的文本内容。
  • content: 响应体的二进制内容, 当请求的目标是二进制数据(如图片)时, 可以使用此属性获取。
  • headers: 响应头, 是 Headers对象, 可以用像获取字典中的内容一样获取其中某个Header 的值。
  • json: 方法, 可以调用此方法将文本结果转化为 JSON 对象。

除了这些, httpx还有一些基本用法也和 requests 极其类似, 这里就不再赘述了,可以参考官方文档: https://www.python-httpx.org/quickstart/

4、 Client对象

httpx中有一些基本的 API和requests 中的非常相似, 但也有一些 API是不相似的, 例如 httpx中有一个 Client 对象, 就可以和requests 中的 Session 对象类比学习。下面我们介绍Client 对象的使用。官方比较推荐的使用方式是with as 语句,示例如下:

python 复制代码
import httpx
with httpx.Client() as client:
    response = client.get('https://www.httpbin.org/get')
    print(response)

运行结果如下:

diff 复制代码
<Response [200 OK]>

这个用法等价于:

python 复制代码
import httpx
client = httpx.Client()
try:
    response = client.get('https://www.httpbin.org/get')
finally:
    client. close()

两种方式的运行结果是一样的,只不过这里需要我们在最后显式地调用close方法来关闭Client 对象。另外,在声明Client 对象时可以指定一些参数,例如headers,这样使用该对象发起的所有请求都会默认带上这些参数配置,示例如下:

python 复制代码
import httpx
url = 'http://www.httpbin.org/headers'
headers = {'User-Agent': 'my-app/0.0.1'}
with httpx.Client(headers=headers) as client:
    r = client. get(url)
    print(r.json()['headers']['User-Agent'])

这里我们声明了一个 headers 变量, 内容为User-Agent 属性, 然后将此变量传递给 headers 参数初始化了一个Client对象,并赋值为client变量,最后用client变量请求了测试网站,并打印返回结果中的 User-Agent 的内容:

diff 复制代码
my-app/0.0.1

可以看到, headers 成功赋值了!

这里介绍httpx的基本用法就到此为止,总结一下:httpx是跟requests非常相似的库,并且同时支持 HTTP/2.0协议。

相关推荐
try2find1 小时前
安装llama-cpp-python踩坑记
开发语言·python·llama
泡泡以安2 小时前
安卓高版本HTTPS抓包:终极解决方案
爬虫·https·安卓逆向·安卓抓包
博观而约取2 小时前
Django ORM 1. 创建模型(Model)
数据库·python·django
精灵vector3 小时前
构建专家级SQL Agent交互
python·aigc·ai编程
q567315234 小时前
Java Selenium反爬虫技术方案
java·爬虫·selenium
Zonda要好好学习4 小时前
Python入门Day2
开发语言·python
Vertira4 小时前
pdf 合并 python实现(已解决)
前端·python·pdf
太凉4 小时前
Python之 sorted() 函数的基本语法
python
项目題供诗4 小时前
黑马python(二十四)
开发语言·python
晓13135 小时前
OpenCV篇——项目(二)OCR文档扫描
人工智能·python·opencv·pycharm·ocr