这个爬虫我想分三期来写:
✅ 第一期写如何爬取汽车的车型信息;
✅ 第二期写如何爬取汽车的车评;
✅ 第三期写如何对车评嵌入情感分析结果,以及用简单的方法把数据插入mysql中;
技术基于scrapy框架、BERT语言模型、mysql数据库。
1 新建工程
scrapy的老三样,可以用命令创建工程
bash
scrapy startproject car_spider
进入目录,创建爬虫
bash
cd car_spider
scrapy genspider car dongchedi.com
这样scrapy就自动搭好了看框架,然后用开发工具打开就行了。
2 分页爬取车型
我们的爬取思路是先根据车型接口获取到所有的车型,然后根据汽车的详情接口去获取汽车详情数据。
获取车型的接口是post请求,带有分页参数的,所以可以这么处理:
python
class CarSpider(scrapy.Spider):
name = 'car'
allowed_domains = ['dongchedi.com']
start_url = 'https://www.dongchedi.com/motor/pc/car/brand/select_series_v2?aid=1839&app_name=auto_web_pc'
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36'
}
def __init__(self, *args, **kwargs):
super(CarSpider, self).__init__(*args, **kwargs)
self.item_count = 0 # 初始化计数器
self.limit = 20 # 每页数据量
self.current_page = 1 # 当前页码
def start_requests(self):
"""首次发送 POST 请求"""
formdata = {
'page': str(self.current_page),
'limit': str(self.limit),
}
yield scrapy.FormRequest(
url=self.start_url,
formdata=formdata,
callback=self.parse
)
def parse(self, response):
"""解析分页数据并提取信息"""
data = json.loads(response.body)
series_list = data['data']['series']
series_count = data['data']['series_count'] # 总数据量
total_pages = (series_count // self.limit) + 1 # 计算总页数
# 处理当前页的数据
for car in series_list:
car_id = car['concern_id']
detail_url = f'https://www.dongchedi.com/motor/car_page/m/v1/series_all_json/?series_id={car_id}&city_name=武汉&show_city_price=1&m_station_dealer_price_v=1'
yield scrapy.Request(url=detail_url, callback=self.parse_detail)
# 如果还有下一页,继续发送请求
if self.current_page < total_pages:
self.current_page += 1
formdata = {
'page': str(self.current_page),
'limit': str(self.limit),
}
yield scrapy.FormRequest(
url=self.start_url,
formdata=formdata,
callback=self.parse
)
3 爬取车型详细信息
获取到car['concern_id']之后,就根据这个利用parse_detail去处理详情信息,获取到之后传给Item、pipeline
python
def parse_detail(self, response):
"""解析汽车详细信息"""
data = json.loads(response.body)
series_all = data.get('data', {})
cover_img = series_all.get('cover_url')
brand_name = series_all.get('brand_name')
series_id = series_all.get('series_id')
online_models = series_all.get('online', [])
for model in online_models:
model_info = model['info']
try:
series_name = model_info['name']
car_name = model_info['series_name']
price_info = model_info['price_info']
dealer_price = price_info.get('official_price', 'N/A')
car_id = model_info.get('car_id')
owner_price_summary = model_info.get('owner_price_summary', {})
naked_price_avg = owner_price_summary.get('naked_price_avg', 'N/A')
# 创建Item实例
item = DongchediItem()
item['name'] = f"{car_name}-{series_name}"
item['dealer_price'] = dealer_price
item['naked_price_avg'] = naked_price_avg
item['brand_name'] = brand_name
item['cover_img'] = cover_img
item['series_id'] = series_id
item['car_id'] = car_id
self.item_count += 1 # 增加计数器
# 返回item,保存到数据库或文件
yield item
4 爬取结束给出统计信息
爬虫结束后给出统计信息,看看爬取了多少个数据,这里用紫色的字体:
python
def closed(self, reason):
"""爬虫结束时调用"""
purple_text = f"\033[95m总共爬取了 {self.item_count} 个 item\033[0m"
self.logger.info(purple_text)
5 items 和 pipeline
实际上这个接口可以获取非常多的信息,这边写的比较简单,pipeline的处理也省略了,就进行一个打印:
items
python
class DongchediItem(scrapy.Item):
name = scrapy.Field()
dealer_price = scrapy.Field()
naked_price_avg = scrapy.Field()
brand_name = scrapy.Field()
cover_img = scrapy.Field()
series_id = scrapy.Field()
car_id = scrapy.Field()
pipelines,注意在settings.py里激活一下
python
class DongchediPipeline:
def process_item(self, item, spider):
# 可以添加保存数据库或文件的逻辑
print(item)
return item
5 运行结果
运行一圈,发现爬取到了12565个车的数据。