关于scrapy模块中setting.py文件的介绍

作用

在Scrapy框架中,settings.py 文件起着非常重要的作用,它用于配置和控制整个Scrapy爬虫项目的行为、性能和功能。

setting.py文件的介绍

python 复制代码
# Scrapy settings for haodaifu project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

# 设置爬虫的名称
BOT_NAME = "spidername"

# 指定包含的爬虫代码的模块
SPIDER_MODULES = ["spidername.spiders"]
NEWSPIDER_MODULE = "spidername.spiders"

# 设置用户代理,用于模拟浏览器或特定的爬虫身份
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"

# 配置爬虫是否遵循robots.txt柜子
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# 控制并发请求的数量(默认为16)
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 64

# 设置下载延迟,控制请求之间的时间间隔,以避免对目标服务器造成过大负载
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3

# 配置每个域名的最大并发请求数
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 16

# 设置每个IP地址的最大并发请求数
CONCURRENT_REQUESTS_PER_IP = 16

# 启用或禁用cookies
# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# 启用或禁用Telnet控制台
# Disable Telnet Console (enabled by default)
TELNETCONSOLE_ENABLED = False

# 默认的请求头
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {

    'Referer': 'https://www.xxx.com/',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36',
}

# 启用或禁用爬虫中间件
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# 爬虫中间件及其顺序
SPIDER_MIDDLEWARES = {
    "spidername.middlewares.SpidernameSpiderMiddleware": 543,
}

# 启用或禁用下载中间件
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# 下载中间件及其顺序
DOWNLOADER_MIDDLEWARES = {
    "spidername.middlewares.SpidernameDownloaderMiddleware": 543,
}

# 启用和配置scrapy扩展
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {
    "scrapy.extensions.telnet.TelnetConsole": None,
}

# 启用或禁用项目管道
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# 项目管道
ITEM_PIPELINES = {
   "spidername.pipelines.spidernamePipeline": 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
# 是否启用自动限速
AUTOTHROTTLE_ENABLED = True

# 初始化下载延迟(秒) 
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5

# 最大下载延迟(秒)
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to
# each remote server
# scrapy 的目标并发请求数
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:
# 是否启用自动限速调试模式
AUTOTHROTTLE_DEBUG = False


# 启用和配置http缓存
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# 是否启用HTTP缓存
HTTPCACHE_ENABLED = True

# 缓存超时时间(秒)
HTTPCACHE_EXPIRATION_SECS = 0

# 缓存目录
HTTPCACHE_DIR = "httpcache"

# 忽略缓存的HTTP状态码列表
HTTPCACHE_IGNORE_HTTP_CODES = []

HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

# 输出feed的编码
FEED_EXPORT_ENCODING = "utf-8"
相关推荐
深蓝电商API1 天前
Scrapy源码剖析:下载器中间件是如何工作的?
爬虫·scrapy
深蓝电商API4 天前
解析器的抉择:parsel vs lxml,在 Scrapy 中如何做出最佳选择?
scrapy·lxml·parsel
小白学大数据9 天前
集成Scrapy与异步库:Scrapy+Playwright自动化爬取动态内容
运维·爬虫·scrapy·自动化
深蓝电商API9 天前
爬虫性能压榨艺术:深入剖析 Scrapy 内核与中间件优化
爬虫·scrapy
B站_计算机毕业设计之家19 天前
python舆情分析可视化系统 情感分析 微博 爬虫 scrapy爬虫技术 朴素贝叶斯分类算法大数据 计算机✅
大数据·爬虫·python·scrapy·数据分析·1024程序员节·舆情分析
深兰科技19 天前
深兰科技法务大模型亮相,推动律所文书处理智能化
人工智能·scrapy·beautifulsoup·scikit-learn·pyqt·fastapi·深兰科技
龙腾AI白云22 天前
大模型-7种大模型微调方法 上
scrapy·scikit-learn·pyqt
万粉变现经纪人23 天前
如何解决 pip install -r requirements.txt 子目录可编辑安装缺少 pyproject.toml 问题
开发语言·python·scrapy·beautifulsoup·scikit-learn·matplotlib·pip
万粉变现经纪人23 天前
如何解决 pip install -r requirements.txt 私有索引未设为 trusted-host 导致拒绝 问题
开发语言·python·scrapy·flask·beautifulsoup·pandas·pip
万粉变现经纪人24 天前
如何解决 pip install -r requirements.txt 私有仓库认证失败 401 Unauthorized 问题
开发语言·python·scrapy·flask·beautifulsoup·pandas·pip