【头歌】Scrapy爬虫(二)热门网站数据爬取

第1关:猫眼电影排行TOP100信息爬取

本关任务:爬取猫眼电影榜单TOP100榜 的100部电影信息保存到本地MySQL数据库。

相关知识

为了完成本关任务,你需要掌握:

MySQL相关知识(默认已掌握);

Scrapy settings.py文件设置的具体含义;

网站多页内容的爬取(翻页);

数据匹配详解。

step1/maoyan/maoyan/items.py

复制代码
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class MaoyanItem(scrapy.Item):

    #********** Begin **********#
    name = scrapy.Field()
    starts = scrapy.Field()
    releasetime = scrapy.Field()
    score = scrapy.Field()
    
    #********** End **********#

step1/maoyan/maoyan/pipelines.py

复制代码
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
from maoyan import settings
class MaoyanPipeline(object):
    def process_item(self, item, spider):
        #********** Begin **********#
        #1.连接数据库
        connection = pymysql.connect(
            host='localhost',  # 连接的是本地数据库
            port=3306,         #数据库端口名
            user='root',        # 自己的mysql用户名
            passwd='123123',  # 自己的密码
            db='mydb',      # 数据库的名字
            charset='utf8',     # 默认的编码方式
        )                    
        #2.建表、给表插入数据,完成后关闭数据库连接,return返回item
        name = item['name']
        starts = item['starts']
        releasetime = item['releasetime']
        score = item['score']
        try:
            with connection.cursor() as cursor:
                sql1 = 'Create Table If Not Exists mymovies(name varchar(50) CHARACTER SET utf8 NOT NULL,starts text CHARACTER SET utf8 NOT NULL,releasetime varchar(50) CHARACTER SET utf8 DEFAULT NULL,score varchar(20) CHARACTER SET utf8 NOT NULL,PRIMARY KEY(name))'
                # 单章小说的写入
                sql2 = 'Insert into mymovies values ('%s','%s','%s','%s')' % (name, starts, releasetime, score)
                cursor.execute(sql1)
                cursor.execute(sql2)
            # 提交本次插入的记录
            connection.commit()
        finally:
            # 关闭连接
            connection.close()    
            return item
        #********** End **********#

step1/maoyan/maoyan/spiders/movies.py

复制代码
# -*- coding: utf-8 -*-
import scrapy
from maoyan.items import MaoyanItem

class MoviesSpider(scrapy.Spider):
    name = 'movies'
    allowed_domains = ['127.0.0.1']
    offset = 0
    url = "http://127.0.0.1:8080/board/4?offset="
    #********** Begin **********#
    #1.对url进行定制,为翻页做准备
    start_urls = [url + str(offset)]
    #2.定义爬虫函数parse()
    def parse(self, response):
        item = MaoyanItem()
        movies = response.xpath("//div[ @class ='board-item-content']")
        for each in movies:
            #电影名
            name = each.xpath(".//div/p/a/text()").extract()[0]
            #主演明星
            starts = each.xpath(".//div[1]/p/text()").extract()[0]
            #上映时间
            releasetime = each.xpath(".//div[1]/p[3]/text()").extract()[0]
            score1 = each.xpath(".//div[2]/p/i[1]/text()").extract()[0]
            score2 = each.xpath(".//div[2]/p/i[2]/text()").extract()[0]
            #评分
            score = score1 + score2
            item['name'] = name
            item['starts'] = starts
            item['releasetime'] = releasetime
            item['score'] = score
            yield item
    #3.在函数的最后offset自加10,然后重新发出请求实现翻页功能
        if self.offset < 90:
            self.offset += 10
            yield scrapy.Request(self.url+str(self.offset), callback=self.parse)

    #********** End **********#
第2关:小说网站玄幻分类第一页小说爬取

本关任务:爬目标网页的3本小说保存到本

为了完成本关任务,你需要掌握:

xpath匹配:循环获取相同标签下的内容;

多个item类的处理;

深入二级页面的数据爬取。

地MySQL数据库,目标网页为全书网玄幻分类首页。

step2/NovelProject/NovelProject/items.py

复制代码
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy
#存放全部小说信息
class NovelprojectItem(scrapy.Item):
    #********** Begin **********#
    name = scrapy.Field()
    author = scrapy.Field()
    state = scrapy.Field()
    description = scrapy.Field()
    
    
    #********** End **********#

#单独存放小说章节
class NovelprojectItem2(scrapy.Item):
    #********** Begin **********#
    tablename = scrapy.Field()           # 表命名时需要
    title = scrapy.Field()
    
    #********** End **********#

step2/NovelProject/NovelProject/pipelines.py

复制代码
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
from NovelProject.items import NovelprojectItem,NovelprojectItem2
class NovelprojectPipeline(object):
    def process_item(self, item, spider):

        #********** Begin **********#
    
        #1.和本地的数据库mydb建立连接        
        connection = pymysql.connect(
            host='localhost',    # 连接的是本地数据库
            port = 3306,         # 端口号
            user='root',         # 自己的mysql用户名
            passwd='123123',     # 自己的密码
            db='mydb',           # 数据库的名字
            charset='utf8',      # 默认的编码方式:
        )
        
        
        #2.处理来自NovelprojectItem的item(处理完成后return返回item)
        if isinstance(item, NovelprojectItem):
            # 从items里取出数据
            name = item['name']
            author = item['author']
            state = item['state']
            description = item['description']
            try:
                with connection.cursor() as cursor:
                    # 小说信息写入
                    sql1 = 'Create Table If Not Exists novel(name varchar(20) CHARACTER SET utf8 NOT NULL,author varchar(10) CHARACTER SET utf8,state varchar(20) CHARACTER SET utf8,description text CHARACTER SET utf8,PRIMARY KEY (name))'
                    sql2 = 'Insert into novel values ('%s','%s','%s','%s')' % (name, author, state, description)
                    cursor.execute(sql1)
                    cursor.execute(sql2)
                # 提交本次插入的记录
                connection.commit()
            finally:
                # 关闭连接
                connection.close()
            return item       
        
        #3.处理来自NovelprojectItem2的item(处理完成后return返回item)
        elif isinstance(item, NovelprojectItem2):
            tablename = item['tablename']
            title = item['title']
            try:
                with connection.cursor() as cursor:
                    # 小说章节的写入
                    sql3 = 'Create Table If Not Exists %s(title varchar(20) CHARACTER SET utf8 NOT NULL,PRIMARY KEY (title))' % tablename
                    sql4 = 'Insert into %s values ('%s')' % (tablename, title)
                    cursor.execute(sql3)
                    cursor.execute(sql4)
                connection.commit()
            finally:
                connection.close()
            return item
        
        #********** End **********#

step2/NovelProject/NovelProject/spiders/novel.py

复制代码
# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy.http import Request
from NovelProject.items import NovelprojectItem
from NovelProject.items import NovelprojectItem2

class NovelSpider(scrapy.Spider):
    name = 'novel'
    allowed_domains = ['127.0.0.1']
    start_urls = ['http://127.0.0.1:8000/list/1_1.html']   #全书网玄幻魔法类第一页

    #********** Begin **********#
    #1.定义函数,通过'马上阅读'获取每一本书的 URL
    def parse(self, response):
        book_urls = response.xpath('//li/a[@class="l mr10"]/@href').extract()
        three_book_urls = book_urls[0:3]  # 只取3本
        for book_url in three_book_urls:
            yield Request(book_url, callback=self.parse_read)
    #2.定义函数,进入小说简介页面,获取信息,得到后yield返回给pipelines处理,并获取'开始阅读'的url,进入章节目录
    def parse_read(self, response):
        item = NovelprojectItem()
        # 小说名字
        name = response.xpath('//div[@class="b-info"]/h1/text()').extract_first()
        #小说简介
        description = response.xpath('//div[@class="infoDetail"]/div/text()').extract_first()
        # 小说连载状态
        state = response.xpath('//div[@class="bookDetail"]/dl[1]/dd/text()').extract_first()
        # 作者名字
        author = response.xpath('//div[@class="bookDetail"]/dl[2]/dd/text()').extract_first()
        item['name'] = name
        item['description'] = description
        item['state'] = state
        item['author'] = author
        yield item
        # 获取开始阅读按钮的URL,进入章节目录
        read_url = response.xpath('//a[@class="reader"]/@href').extract()[0]
        yield Request(read_url, callback=self.parse_info)
    #3.定义函数,进入章节目录,获取小说章节名并yield返回
    def parse_info(self, response):
        item = NovelprojectItem2()
        tablename = response.xpath('//div[@class="main-index"]/a[3]/text()').extract_first()
        titles = response.xpath('//div[@class="clearfix dirconone"]/li')
        for each in titles:
            title = each.xpath('.//a/text()').extract_first()
            item['tablename'] = tablename
            item['title'] = title
            yield item

在测试行会输出你mydb数据库中表的个数,应该有4个表,预期输出为:

相关推荐
青春不朽5121 小时前
Scrapy框架入门指南
python·scrapy
深蓝电商API3 小时前
处理字体反爬:woff字体文件解析实战
爬虫·python
NPE~4 小时前
自动化工具Drissonpage 保姆级教程(含xpath语法)
运维·后端·爬虫·自动化·网络爬虫·xpath·浏览器自动化
喵手10 小时前
Python爬虫实战:电商价格监控系统 - 从定时任务到历史趋势分析的完整实战(附CSV导出 + SQLite持久化存储)!
爬虫·python·爬虫实战·零基础python爬虫教学·电商价格监控系统·从定时任务到历史趋势分析·采集结果sqlite存储
摘星|10 小时前
正则匹配与爬虫爬取图片路径综合练习
爬虫
喵手11 小时前
Python爬虫实战:京东/淘宝搜索多页爬虫实战 - 从反爬对抗到数据入库的完整工程化方案(附CSV导出 + SQLite持久化存储)!
爬虫·python·爬虫实战·零基础python爬虫教学·京东淘宝页面数据采集·反爬对抗到数据入库·采集结果csv导出
0思必得011 小时前
[Web自动化] Selenium获取元素的子元素
前端·爬虫·selenium·自动化·web自动化
搂着猫睡的小鱼鱼1 天前
Ozon 商品页数据解析与提取 API
爬虫·php
深蓝电商API1 天前
住宅代理与数据中心代理在爬虫中的选择
爬虫·python
csdn_aspnet1 天前
Libvio.link爬虫技术深度解析:反爬机制破解与高效数据抓取
爬虫·反爬·libvio