文章目录
- [1. 链接提取](#1. 链接提取)
- [2. crawlspider全站数据爬取](#2. crawlspider全站数据爬取)
- [3. Redis简单使用](#3. Redis简单使用)
增加爬取延迟 setting文件中,取消注释DOWNLOAD_DELAY = 3
1. 链接提取
python
# 导包
from scrapy.linkextractors import LinkExtracto
def parse(self, resp,**kwargs):
le = LinkExtractor(restrict_xpaths=('//ul[@class="viewlist_ul"]/li/a',))
links = le.extract_links(resp)
'''
链接提取方法
def __init__(
self,
allow=(),# 允许提取,使用正则
deny=(), 不允许提取,使用正则
allow_domains=(), 允许提取的域名
deny_domains=(),
restrict_xpaths=(), 使用xpath提取
tags=("a", "area"),
attrs=("href",),
canonicalize=False,
unique=True,
process_value=None,
deny_extensions=None,
restrict_css=(),
strip=True,
restrict_text=None,
):
'''
dont_filter 的用法
python
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
if response.status != 200:
request.dont_filter = True #检查代理是否可用,不可用重复请求
return request
return response
2. crawlspider全站数据爬取
python
# 创建scrapy全站模板
scrapy genspider -t crawl pachpng baidu.com
class ErshouqicheSpider(CrawlSpider):
name = "ershouqiche"
allowed_domains = ["che168.com","autohome.com.cn"]
start_urls = ["https://www.che168.com/china/list/"]
rules = (Rule(LinkExtractor(restrict_xpaths=('//ul[@class="viewlist_ul"]/li/a',)), callback="parse_item", follow=False),
Rule(LinkExtractor(restrict_xpaths=('//div[@class="page fn-clear"]/a',)), follow=True))
def parse_item(self, response):
print(response.url)
tittle = response.xpath("//h3[@class='car-brand-name']/text()").extract_first()
price = response.xpath("//span[@id='overlayPrice']/text()").extract_first()
#item["domain_id"] = response.xpath('//input[@id="sid"]/@value').get()
#item["name"] = response.xpath('//div[@id="name"]').get()
#item["description"] = response.xpath('//div[@id="description"]').get()
print(tittle,price)
3. Redis简单使用
下载:http://redis.cn/download.html
redis命令
sql
#将redis安装到windows服务
redis-server.exe --service-install redis.windows. conf --loglevel verbose
#卸载服务:
redis-server --service-uninstall
#开启服务:
redis-server --service-start
#停止服务:
redis-server --service-stop
配置redis
登录redis
RDM redis可视化工具的安装
https://blog.csdn.net/qq_39715000/article/details/120724800
redis常见数据类型
redis中常见的数据类型有5个.
自增
redis操作
hash
列表
集合
redis教程列表
https://www.runoob.com/redis/redis-sorted-sets.html
python使用resid
pip install redis