Scrapy安装
Python实现一个简单的爬虫程序(爬取图片)_python简单扒图脚本-CSDN博客
创建爬虫项目
创建爬虫项目:
bash
scrapy startproject test_spider
创建爬虫程序文件:
bash
>cd test_spider\test_spider\spiders
>scrapy genspider doubanSpider movie.douban.com
编写爬虫程序
分析网址:
bash
https://movie.douban.com/top250?start=25&filter=
其中,start=25是分页信息,一共有10页,每页25个电影记录,start数值为0、25、50......225。
python
for i in range(0,24,25):
req = "https://movie.douban.com/top250?start={}&filter=".format(i)
yield scrapy.Request(url=req, meta={'url': req}, headers=self.headers,callback=self.parse)
time.sleep(2)
提取电影网址、中文名、外文名:
python
html_data = response.body
sp = BeautifulSoup(html_data, 'html.parser')
list = sp.find(class_='grid_view').find_all('li')
for one in list:
link = one.find(class_='info').find(class_='hd').find('a')['href']
print(link)
titles = one.find_all(class_='title')
title_zh =titles[0].text.strip().replace(',',' ')
title_en = ''
if len(titles)>1:
title_en = titles[1].text.strip().replace(',',' ').lstrip('/')
print(title_zh,title_en)
提取导演、演员信息:
python
bd = one.find(class_='info').find(class_='bd')
p1 = bd.find_all('p')[0].text.strip().replace('\n','').replace('\r','').replace(',',' ')
print(p1)
提取评分信息:
python
spans = bd.find(class_='star').find_all('span')
score = spans[1].text
num = spans[3].text.replace('人评价','')
print(score,num)
写入csv文件:
python
with open('movies.csv','a+',encoding='utf-8') as f:
f.write('网址,中文名,外文名,导演,评分,评价人数\n')
with open('movies.csv','a+',encoding='utf-8') as f:
f.write('{},{},{},{},{},{}\n'.format(link,title_zh,title_en,p1,score,num))
完整代码:
python
import scrapy
from bs4 import BeautifulSoup
import time
class DoubanspiderSpider(scrapy.Spider):
name = "doubanSpider"
allowed_domains = ["movie.douban.com"]
start_urls = ["https://movie.douban.com"]
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36','Cookie': ''}
def start_requests(self):
with open('movies.csv','a+',encoding='utf-8') as f:
f.write('网址,中文名,外文名,导演,评分,评价人数\n')
for i in range(0,24,25):
req = "https://movie.douban.com/top250?start={}&filter=".format(i)
yield scrapy.Request(url=req, meta={'url': req}, headers=self.headers,callback=self.parse)
time.sleep(2)
def parse(self, response):
print("========= parse ==============")
html_data = response.body
sp = BeautifulSoup(html_data, 'html.parser')
list = sp.find(class_='grid_view').find_all('li')
for one in list:
link = one.find(class_='info').find(class_='hd').find('a')['href']
print(link)
titles = one.find_all(class_='title')
title_zh =titles[0].text.strip().replace(',',' ')
title_en = ''
if len(titles)>1:
title_en = titles[1].text.strip().replace(',',' ').lstrip('/')
print(title_zh,title_en)
bd = one.find(class_='info').find(class_='bd')
p1 = bd.find_all('p')[0].text.strip().replace('\n','').replace('\r','').replace(',',' ')
print(p1)
spans = bd.find(class_='star').find_all('span')
score = spans[1].text
num = spans[3].text.replace('人评价','')
print(score,num)
with open('movies.csv','a+',encoding='utf-8') as f:
f.write('{},{},{},{},{},{}\n'.format(link,title_zh,title_en,p1,score,num))
运行爬虫程序:
scrapy crawl doubanSpider
生成的csv文件如下: