之前做分布式爬虫的时候,都是从push url来拿到爬虫消费的链接,这里提出一个问题,假如这个请求是post请求的呢,我观察了scrapy-redis的源码,其中spider.py的代码是这样写的
1.scrapy-redis源码分析
bash
def make_request_from_data(self, data):
"""Returns a `Request` instance for data coming from Redis.
Overriding this function to support the `json` requested `data` that contains
`url` ,`meta` and other optional parameters. `meta` is a nested json which contains sub-data.
Along with:
After accessing the data, sending the FormRequest with `url`, `meta` and addition `formdata`, `method`
For example:
.. code:: json
{
"url": "https://example.com",
"meta": {
"job-id":"123xsd",
"start-date":"dd/mm/yy",
},
"url_cookie_key":"fertxsas",
"method":"POST",
}
If `url` is empty, return `[]`. So you should verify the `url` in the data.
If `method` is empty, the request object will set method to 'GET', optional.
If `meta` is empty, the request object will set `meta` to an empty dictionary, optional.
This json supported data can be accessed from 'scrapy.spider' through response.
'request.url', 'request.meta', 'request.cookies', 'request.method'
Parameters
----------
data : bytes
Message from redis.
"""
formatted_data = bytes_to_str(data, self.redis_encoding)
if is_dict(formatted_data):
parameter = json.loads(formatted_data)
else:
self.logger.warning(
f"{TextColor.WARNING}WARNING: String request is deprecated, please use JSON data format. "
f"Detail information, please check https://github.com/rmax/scrapy-redis#features{TextColor.ENDC}"
)
return FormRequest(formatted_data, dont_filter=True)
if parameter.get("url", None) is None:
self.logger.warning(
f"{TextColor.WARNING}The data from Redis has no url key in push data{TextColor.ENDC}"
)
return []
url = parameter.pop("url")
method = parameter.pop("method").upper() if "method" in parameter else "GET"
metadata = parameter.pop("meta") if "meta" in parameter else {}
return FormRequest(
url, dont_filter=True, method=method, formdata=parameter, meta=metadata
)
源码地址:https://github.com/rmax/scrapy-redis
可以看到这里是可以处理post请求的
2.scrapy-rabbitmq-schrduler源码分析
地址:
https://github.com/aox-lei/scrapy-rabbitmq-scheduler
python
class RabbitSpider(scrapy.Spider):
def _make_request(self, mframe, hframe, body):
try:
request = request_from_dict(pickle.loads(body), self)
except Exception as e:
body = body.decode()
request = scrapy.Request(body, callback=self.parse, dont_filter=True)
return request
可以看到RabbitSpider继承了spider的嘞,改写了request,当我们发我post请求的时候 request_from_dict(pickle.loads(body), self)会报错
bash
builtins.UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
pick.loads
在尝试反序列化字节数据时遇到无法解码的字节序列造成的。具体来说,UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
说明传入的数据包含非 UTF-8 编码的字节,可能是二进制数据或其他编码格式的数据。
bash
def _make_request(self, mframe, hframe, body):
try:
# 反序列化 body 数据
data = pickle.loads(body)
# 获取请求的 URL 和其他参数
url = data.get('url')
method = data.get('method', 'GET').upper() # 默认 GET,如果是 POST 需要设置为 'POST'
headers = data.get('headers', {})
cookies = data.get('cookies', {})
body_data = data.get('body') # 可能是 POST 请求的表单数据
callback_str = data.get('callback') # 回调函数名称(字符串)
errback_str = data.get('errback') # 错误回调函数名称(字符串)
meta = data.get('meta', {})
# 尝试从全局字典中获取回调函数
# 使用爬虫实例的 `getattr` 方法获取回调函数
callback = getattr(self, callback_str, None) if callback_str else None
errback = getattr(self, errback_str, None) if errback_str else None
# # 确保回调函数存在
# if callback is None:
# self.logger.error(f"Callback function '{callback_str}' not found.")
# if errback is None:
# self.logger.error(f"Errback function '{errback_str}' not found.")
# 判断请求方法,如果是 POST,则使用 FormRequest
if callback:
if method == 'POST':
# FormRequest 适用于带有表单数据的 POST 请求
request = scrapy.FormRequest(
url=url,
method='POST',
headers=headers,
cookies=cookies,
body=body_data, # 请求的主体
callback=callback,
errback=errback,
meta=meta,
dont_filter=True
)
else:
# 默认处理 GET 请求
request = scrapy.Request(
url=url,
headers=headers,
cookies=cookies,
callback=callback,
errback=errback,
meta=meta,
dont_filter=True
)
else: pass
except Exception as e:
body = body.decode()
request = scrapy.Request(body, callback=self.parse, dont_filter=True)
return request
直接获取callback是个字符串而不是函数,要在spider中获取到对应的函数
注:由于scrapy-rabbitmq-scheduler无人更新维护,目前新的scrapy已经不支持,上述最新的代码已推github: https://github.com/tieyongjie/scrapy-rabbitmq-task
安装直接安装
bash
pip install scrapy-rabbitmq-task