项目:豆瓣电影Top250的爬取
案例需求:
1.使用scrapy爬虫技术爬取豆瓣电影Top250的电影数据(包括排名、电影名称、导演、演员、url、评分)
2.实现分页爬取,共十页
3.将爬取下来的数据保存在数据库中
案例分析:
1.找到正确的数据吧,并复制正确的请求url
做好准备:开启管道、关闭君子协议、伪造浏览器
ITEM_PIPELINES = { 'doubanbook.pipelines.DoubanbookPipeline': 300, }
ROBOTSTXT_OBEY = False
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
2.解析数据:如title
def parse(self, response):
# 实例一个容器保存爬取的信息
item = DoubanbookItem()
# 这部分是爬取部分,使用xpath的方式选择信息,具体方法根据网页结构而定
for box in response.xpath('//ol[@class="grid_view"]/li'):
item['Rank'] = box.xpath('.//div[@class="pic"]/em/text()').extract()[0]
item['Name'] = box.xpath('.//div[@class="info"]/div[1]/a/span[1]/text()').extract()[0].strip().replace("\n","").replace(" ", "")
s = box.xpath('.//div[@class="bd"]/p/text()').extract()[0].strip().replace(" ", "")
item['Author'] = s.split()[0]
if len(s.split()) > 1:
item['Actor'] = s.split()[1]
item['Score'] = box.xpath('.//div[@class="star"]/span[2]/text()').extract()[0].strip()
item['Url'] = box.xpath('.//div[@class="pic"]/a/@href').extract()
yield item
items.py中
import scrapy
class DoubanbookItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
Rank = scrapy.Field()
Name = scrapy.Field()
Author = scrapy.Field()
Actor = scrapy.Field()
Score = scrapy.Field()
Url = scrapy.Field()
3.获取翻页链接
# 获取下一页的rl信息
next_url = response.xpath('//span[@class="next"]/link/@href').extract()
if next_url:
# 将信息组合成下一页的url
page = 'https://movie.douban.com/top250' + next_url[0]
yield scrapy.Request(page, callback=self.parse)
4.保存至数据库
from itemadapter import ItemAdapter
import pymysql.cursors
import pymysql
from twisted.enterprise import adbapi
class DoubanbookPipeline(object):
def __init__(self):
# 打开文件
# 连接数据库
self.conn = pymysql.connect(
host='localhost',
port=3306,
user='root',
passwd='wx990826',
db='douban',
)
self.cur = self.conn.cursor()
def process_item(self, item, spider):
sqli = "insert into movie(ranks,title,author,actor,score,url) values(%s,%s,%s,%s,%s,%s)"
self.cur.execute(sqli, (
item['Rank'], item['Name'], item['Author'], item['Actor'], item['Score'],item['Url']))
self.conn.commit()
return item
# 该方法在spider被开启时被调用。
运行项目:
from scrapy import cmdline cmdline.execute(['scrapy','crawl','read','--nolog'])
运行结果: