创建crawlspider爬虫文件:
base
scrapy genspider -t crawl 爬虫文件名 爬取的域名
scrapy genspider -t crawl read https://www.dushu.com/book/1206.html
LinkExtractor 链接提取器通过它,Spider可以知道从爬取的页面中提取出哪些链接,提取出的链接会自动生成Request请求对象
python
class ReadSpider(CrawlSpider):
name = "read"
allowed_domains = ["www.dushu.com"]
start_urls = ["https://www.dushu.com/book/1206_1.html"]
# LinkExtractor 链接提取器通过它,Spider可以知道从爬取的页面中提取出哪些链接。提取出的链接会自动生成Request请求对象
rules = (Rule(LinkExtractor(allow=r"/book/1206_\d+\.html"), callback="parse_item", follow=False),)
def parse_item(self, response):
name_list = response.xpath('//div[@class="book-info"]//img/@alt')
src_list = response.xpath('//div[@class="book-info"]//img/@data-original')
for i in range(len(name_list)):
name = name_list[i].extract()
src = src_list[i].extract()
book = ScarpyReadbook41Item(name=name, src=src)
yield book
开启管道、
写入文件
python
class ScarpyReadbook41Pipeline:
def open_spider(self, spider):
self.fp = open('books.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
self.fp.write(str(item))
return item
def close_spider(self, spider):
self.fp.close()
运行之后发现没有第一页数据
需要在start_urls里加上_1,不然不会读取第一页数据
python
start_urls = ["https://www.dushu.com/book/1206_1.html"]