Web爬虫应用功能及需求设计

Web爬虫应用功能及需求设计

1. 概述

这是一个基于PyQt5的图形用户界面(GUI)网络爬虫应用程序。

应用允许用户输入URL,选择输出目录,并爬取指定网站的内容。提供了一个用户友好的界面来控制爬虫过程,并显示爬取进度和状态。

2. 主要功能

2.1 URL输入与历史记录

  • 功能:允许用户输入URL,并保存最近使用的URL历史记录。
  • 设计理念:提高用户体验,方便重复使用常用URL。
  • 相关代码:
python 复制代码
class MainWindow(QMainWindow):
    def __init__(self):
        # ...
        self.url_combo = QComboBox()
        self.url_combo.setEditable(True)
        self.url_combo.setInsertPolicy(QComboBox.InsertAtTop)
        self.url_combo.editTextChanged.connect(self.on_url_changed)
        # ...

    def on_url_changed(self, text):
        if text and text not in [self.url_combo.itemText(i) for i in range(self.url_combo.count())]:
            self.url_combo.insertItem(0, text)
            self.save_url_to_history()

    def save_url_to_history(self):
        urls = [self.url_combo.itemText(i) for i in range(self.url_combo.count())]
        urls = urls[:10]  # Only keep the 10 most recent URLs
        self.settings.setValue("url_history", urls)
        self.settings.sync()

    def load_url_history(self):
        urls = self.settings.value("url_history", [])
        self.url_combo.addItems(urls)

2.2 输出目录选择

  • 功能:允许用户选择爬取内容的保存位置。
  • 设计理念:提供直观的文件对话框,方便用户选择目录。
  • 相关代码:
python 复制代码
def select_output_directory(self):
    directory = QFileDialog.getExistingDirectory(self, "Select Output Directory")
    if directory:
        self.output_input.setText(directory)

2.3 爬虫控制

  • 功能:启动和控制爬虫过程。
  • 设计理念:使用多线程确保GUI响应性,同时提供进度和状态更新。
  • 相关代码:
python 复制代码
def start_crawling(self):
    url = self.url_combo.currentText()
    output_dir = self.output_input.text()

    if not url or not output_dir:
        self.status_label.setText("Please enter URL and select output directory")
        return

    self.crawler_thread = CrawlerThread(url, output_dir, self.browser_settings)
    self.crawler_thread.progress_signal.connect(self.update_progress)
    self.crawler_thread.status_signal.connect(self.update_status)
    self.crawler_thread.finished_signal.connect(self.crawling_finished)
    self.crawler_thread.start()

    self.crawl_button.setEnabled(False)

2.4 浏览器设置

  • 功能:允许用户自定义爬虫的浏览器相关设置。
  • 设计理念:提供灵活性,使用户能够根据需要调整爬虫行为。
  • 相关代码:
python 复制代码
class BrowserSettingsDialog(QDialog):
    def __init__(self, parent=None):
        # ...
        self.ua_input = QLineEdit()
        self.timeout_input = QLineEdit()
        self.robots_checkbox = QCheckBox("Follow robots.txt")
        self.retries_input = QComboBox()
        # ...

    def get_settings(self):
        return {
            "user_agent": self.ua_input.text(),
            "timeout": int(self.timeout_input.text() or 30),
            "follow_robots": self.robots_checkbox.isChecked(),
            "max_retries": int(self.retries_input.currentText())
        }

2.5 进度和状态显示

  • 功能:实时显示爬虫的进度和状态信息。
  • 设计理念:提供清晰的反馈,让用户了解爬虫的工作状态。
  • 相关代码:
python 复制代码
def update_progress(self, value):
    self.progress_bar.setValue(value)

def update_status(self, status):
    self.status_label.setText(status)

3. robots.txt 遵守功能

3.1 功能描述

该爬虫程序包含了检查网站的robots.txt文件并遵守其协议的功能。这个功能被集成在浏览器设置中,用户可以选择是否启用此功能。

3.2 设计理念

  • 尊重网站所有者:通过遵守robots.txt,我们尊重网站管理员对其网站内容的访问控制。
  • 合法合规:许多国家的法律要求网络爬虫遵守robots.txt协议。
  • 可配置性:允许用户根据需要启用或禁用此功能,提供灵活性。

3.3 实现细节

  1. 在浏览器设置中添加选项:
python 复制代码
class BrowserSettingsDialog(QDialog):
    def __init__(self, parent=None):
        # ...
        self.robots_checkbox = QCheckBox("Follow robots.txt")
        layout.addWidget(self.robots_checkbox)
        # ...

    def get_settings(self):
        return {
            # ...
            "follow_robots": self.robots_checkbox.isChecked(),
            # ...
        }
  1. 在爬虫逻辑中实现检查:
python 复制代码
class CrawlerThread(QThread):
    # ...
    def crawl_directory(self, url: str, base_path: str) -> None:
        # ...
        # Check robots.txt if enabled
        if self.browser_settings['follow_robots']:
            rp = robotparser.RobotFileParser()
            rp.set_url(urljoin(url, "/robots.txt"))
            rp.read()
            if not rp.can_fetch(self.browser_settings['user_agent'], url):
                logging.warning(f"Skipping {url} as per robots.txt")
                self.status_signal.emit(f"Skipping {url} as per robots.txt")
                return
        # ...

3.4 用户界面交互

用户可以在"Browser Settings"对话框中选择是否遵守robots.txt:

  1. 点击主界面上的"Browser Settings"按钮。
  2. 在弹出的对话框中,勾选或取消勾选"Follow robots.txt"选项。
  3. 点击"OK"保存设置。

3.5 优势

  1. 合规性:确保爬虫行为符合网络道德和法律要求。
  2. 减少被封禁风险:遵守robots.txt可以降低爬虫被网站封禁的风险。
  3. 灵活性:用户可以根据具体需求选择是否启用此功能。

3.6 todo

  1. 缓存robots.txt:为了提高效率,可以考虑缓存已经读取过的robots.txt文件内容。
  2. 更详细的robots.txt解析:添加对更复杂robots.txt指令的支持,如爬取频率限制。
  3. 可视化robots.txt规则:在UI中显示当前网站的robots.txt规则,帮助用户理解爬取限制。

4. 技术实现

4.1 多线程爬虫

使用QThread实现爬虫功能,确保GUI的响应性。

python 复制代码
class CrawlerThread(QThread):
    progress_signal = pyqtSignal(int)
    status_signal = pyqtSignal(str)
    finished_signal = pyqtSignal()

    def __init__(self, url: str, base_path: str, browser_settings: dict):
        # ...

    def run(self):
        self.crawl_directory(self.url, self.base_path)
        self.finished_signal.emit()

    def crawl_directory(self, url: str, base_path: str) -> None:
        # 爬虫逻辑实现
        # ...

4.2 设置持久化

使用QSettings保存和加载URL历史记录和浏览器设置。

python 复制代码
def save_browser_settings(self):
    for key, value in self.browser_settings.items():
        self.settings.setValue(f"browser_settings/{key}", value)
    self.settings.sync()

def load_browser_settings(self):
    return {
        "user_agent": self.settings.value("browser_settings/user_agent", ""),
        "timeout": int(self.settings.value("browser_settings/timeout", 30)),
        "follow_robots": self.settings.value("browser_settings/follow_robots", True, type=bool),
        "max_retries": int(self.settings.value("browser_settings/max_retries", 3))
    }

5. 未来扩展

代码中包含了一个占位符用于未来扩展:

python 复制代码
self.extension_placeholder = QLabel("Future extensions will be added here")
layout.addWidget(self.extension_placeholder)

这为将来添加新功能预留了空间,如:

  • 高级过滤选项
  • 爬取深度控制
  • 导出爬取结果为不同格式,格式优化等
  • 爬取调度功能

6. 总结

这个Web爬虫应用程序提供了一个用户友好的界面,允许用户轻松地爬取网站(主要基于ftp的文件批量下载)内容,结合了多线程爬虫、自定义浏览器设置和持久化存储等功能,为用户提供了一个功能丰富且灵活的工具。通过清晰的进度和状态显示,用户可以轻松监控爬虫的工作状态,预留了一定的扩展空间。

完整代码:

python 复制代码
import sys
import requests
from bs4 import BeautifulSoup
import os
from urllib.parse import urljoin, urlparse
import logging
from typing import Optional, Set, List, Tuple
from urllib import robotparser
from PyQt5.QtWidgets import (QApplication, QMainWindow, QWidget, QVBoxLayout, QHBoxLayout,
                             QPushButton, QLineEdit, QTextEdit, QProgressBar, QLabel,
                             QFileDialog, QComboBox, QDialog, QDialogButtonBox, QCheckBox)
from PyQt5.QtGui import QIntValidator
from PyQt5.QtCore import QThread, pyqtSignal, Qt, QSettings


logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')


class CrawlerThread(QThread):
    progress_signal = pyqtSignal(int)
    status_signal = pyqtSignal(str)
    finished_signal = pyqtSignal()

    def __init__(self, url: str, base_path: str, browser_settings: dict):
        super().__init__()
        self.url = url
        self.base_path = base_path
        self.browser_settings = browser_settings
        self.session = requests.Session()
        self.visited = set()

        # Apply browser settings
        self.session.headers.update({'User-Agent': self.browser_settings['user_agent']})
        self.session.mount('https://',
                           requests.adapters.HTTPAdapter(max_retries=self.browser_settings['max_retries']))
        self.session.mount('http://',
                           requests.adapters.HTTPAdapter(max_retries=self.browser_settings['max_retries']))

    def run(self):
        self.crawl_directory(self.url, self.base_path)
        self.finished_signal.emit()

    def crawl_directory(self, url: str, base_path: str) -> None:
        if url in self.visited:
            logging.warning(f"Skipping already visited URL: {url}")
            return
        self.visited.add(url)

        try:
            response = self.session.get(url, timeout=self.browser_settings['timeout'])
            response.raise_for_status()
        except requests.RequestException as e:
            logging.error(f"Failed to fetch {url}: {e}")
            self.status_signal.emit(f"Error: Failed to fetch {url}")
            return

        # Check robots.txt if enabled
        if self.browser_settings['follow_robots']:
            rp = robotparser.RobotFileParser()
            rp.set_url(urljoin(url, "/robots.txt"))
            rp.read()
            if not rp.can_fetch(self.browser_settings['user_agent'], url):
                logging.warning(f"Skipping {url} as per robots.txt")
                self.status_signal.emit(f"Skipping {url} as per robots.txt")
                return

        soup = BeautifulSoup(response.text, 'html.parser')
        links = soup.find_all('a', href=True)

        for link in links:
            href = link['href']
            full_url = urljoin(url, href)
            if full_url.startswith(url) and full_url not in self.visited:
                file_name = os.path.join(base_path, urlparse(full_url).path.lstrip('/'))
                os.makedirs(os.path.dirname(file_name), exist_ok=True)

                try:
                    with open(file_name, 'w', encoding='utf-8') as f:
                        f.write(requests.get(full_url).text)
                    self.status_signal.emit(f"Saved: {full_url}")
                except Exception as e:
                    logging.error(f"Failed to save {full_url}: {e}")
                    self.status_signal.emit(f"Error: Failed to save {full_url}")

                self.crawl_directory(full_url, base_path)

        self.progress_signal.emit(len(self.visited))


class BrowserSettingsDialog(QDialog):
    def __init__(self, parent=None):
        super().__init__(parent)
        self.setWindowTitle("Browser Settings")
        self.setModal(True)

        layout = QVBoxLayout(self)

        # User-Agent
        ua_layout = QHBoxLayout()
        ua_layout.addWidget(QLabel("User-Agent:"))
        self.ua_input = QLineEdit()
        ua_layout.addWidget(self.ua_input)
        layout.addLayout(ua_layout)

        # Timeout
        timeout_layout = QHBoxLayout()
        timeout_layout.addWidget(QLabel("Timeout (seconds):"))
        self.timeout_input = QLineEdit()
        self.timeout_input.setValidator(QIntValidator(1, 60))
        timeout_layout.addWidget(self.timeout_input)
        layout.addLayout(timeout_layout)

        # Follow robots.txt
        self.robots_checkbox = QCheckBox("Follow robots.txt")
        layout.addWidget(self.robots_checkbox)

        # Max retries
        retries_layout = QHBoxLayout()
        retries_layout.addWidget(QLabel("Max retries:"))
        self.retries_input = QComboBox()
        self.retries_input.addItems([str(i) for i in range(6)])
        retries_layout.addWidget(self.retries_input)
        layout.addLayout(retries_layout)

        # Buttons
        self.button_box = QDialogButtonBox(QDialogButtonBox.Ok | QDialogButtonBox.Cancel)
        self.button_box.accepted.connect(self.accept)
        self.button_box.rejected.connect(self.reject)
        layout.addWidget(self.button_box)

    def get_settings(self):
        return {
            "user_agent": self.ua_input.text(),
            "timeout": int(self.timeout_input.text() or 30),
            "follow_robots": self.robots_checkbox.isChecked(),
            "max_retries": int(self.retries_input.currentText())
        }

    def set_settings(self, settings):
        self.ua_input.setText(settings.get("user_agent", ""))
        self.timeout_input.setText(str(settings.get("timeout", 30)))
        self.robots_checkbox.setChecked(settings.get("follow_robots", True))
        self.retries_input.setCurrentText(str(settings.get("max_retries", 3)))


class MainWindow(QMainWindow):
    def __init__(self):
        super().__init__()
        self.setWindowTitle("Web Crawler")
        self.setGeometry(100, 100, 800, 600)

        self.settings = QSettings("YourCompany", "WebCrawler")

        central_widget = QWidget()
        self.setCentralWidget(central_widget)

        layout = QVBoxLayout()
        central_widget.setLayout(layout)

        # URL input with history
        url_layout = QHBoxLayout()
        self.url_combo = QComboBox()
        self.url_combo.setEditable(True)
        self.url_combo.setInsertPolicy(QComboBox.InsertAtTop)
        self.url_combo.editTextChanged.connect(self.on_url_changed)
        url_layout.addWidget(QLabel("URL:"))
        url_layout.addWidget(self.url_combo)
        layout.addLayout(url_layout)

        # Output directory selection
        output_layout = QHBoxLayout()
        self.output_input = QLineEdit()
        self.output_button = QPushButton("Select")
        self.output_button.clicked.connect(self.select_output_directory)
        output_layout.addWidget(QLabel("Output:"))
        output_layout.addWidget(self.output_input)
        output_layout.addWidget(self.output_button)
        layout.addLayout(output_layout)

        # Crawl button
        self.crawl_button = QPushButton("Start Crawling")
        self.crawl_button.clicked.connect(self.start_crawling)
        layout.addWidget(self.crawl_button)

        # Settings button
        self.settings_button = QPushButton("Browser Settings")
        self.settings_button.clicked.connect(self.open_browser_settings)
        layout.addWidget(self.settings_button)

        # Progress bar
        self.progress_bar = QProgressBar()
        layout.addWidget(self.progress_bar)

        # Status label
        self.status_label = QLabel()
        layout.addWidget(self.status_label)

        # Load saved URL history
        self.load_url_history()

        # Load browser settings
        self.browser_settings = self.load_browser_settings()

        # Placeholder for future extensions
        self.extension_placeholder = QLabel("Future extensions will be added here")
        layout.addWidget(self.extension_placeholder)

    def on_url_changed(self, text):
        if text and text not in [self.url_combo.itemText(i) for i in range(self.url_combo.count())]:
            self.url_combo.insertItem(0, text)
            self.save_url_to_history()

    def select_output_directory(self):
        directory = QFileDialog.getExistingDirectory(self, "Select Output Directory")
        if directory:
            self.output_input.setText(directory)

    def start_crawling(self):
        url = self.url_combo.currentText()
        output_dir = self.output_input.text()

        if not url or not output_dir:
            self.status_label.setText("Please enter URL and select output directory")
            return

        # Ensure current URL is at the top of the list
        self.url_combo.removeItem(self.url_combo.findText(url))
        self.url_combo.insertItem(0, url)
        self.url_combo.setCurrentIndex(0)

        # Save URL history
        self.save_url_to_history()

        self.crawler_thread = CrawlerThread(url, output_dir, self.browser_settings)
        self.crawler_thread.progress_signal.connect(self.update_progress)
        self.crawler_thread.status_signal.connect(self.update_status)
        self.crawler_thread.finished_signal.connect(self.crawling_finished)
        self.crawler_thread.start()

        self.crawl_button.setEnabled(False)

    def update_progress(self, value):
        self.progress_bar.setValue(value)

    def update_status(self, status):
        self.status_label.setText(status)

    def crawling_finished(self):
        self.crawl_button.setEnabled(True)
        self.status_label.setText("Crawling finished")

    def save_url_to_history(self):
        urls = [self.url_combo.itemText(i) for i in range(self.url_combo.count())]
        urls = urls[:10]  # Only keep the 10 most recent URLs
        self.settings.setValue("url_history", urls)
        self.settings.sync()

    def load_url_history(self):
        urls = self.settings.value("url_history", [])
        self.url_combo.addItems(urls)

    def open_browser_settings(self):
        dialog = BrowserSettingsDialog(self)
        dialog.set_settings(self.browser_settings)
        if dialog.exec_() == QDialog.Accepted:
            self.browser_settings = dialog.get_settings()
            self.save_browser_settings()

    def save_browser_settings(self):
        for key, value in self.browser_settings.items():
            self.settings.setValue(f"browser_settings/{key}", value)
        self.settings.sync()

    def load_browser_settings(self):
        return {
            "user_agent": self.settings.value("browser_settings/user_agent", ""),
            "timeout": int(self.settings.value("browser_settings/timeout", 30)),
            "follow_robots": self.settings.value("browser_settings/follow_robots", True, type=bool),
            "max_retries": int(self.settings.value("browser_settings/max_retries", 3))
        }


if __name__ == "__main__":
    app = QApplication(sys.argv)
    window = MainWindow()
    window.show()
    sys.exit(app.exec_())
相关推荐
花花鱼几秒前
vue3 基于element-plus进行的一个可拖动改变导航与内容区域大小的简单方法
前端·javascript·elementui
小馒头学python2 分钟前
机器学习是什么?AIGC又是什么?机器学习与AIGC未来科技的双引擎
人工智能·python·机器学习
k09334 分钟前
sourceTree回滚版本到某次提交
开发语言·前端·javascript
神奇夜光杯12 分钟前
Python酷库之旅-第三方库Pandas(202)
开发语言·人工智能·python·excel·pandas·标准库及第三方库·学习与成长
千天夜23 分钟前
使用UDP协议传输视频流!(分片、缓存)
python·网络协议·udp·视频流
EricWang135825 分钟前
[OS] 项目三-2-proc.c: exit(int status)
服务器·c语言·前端
September_ning25 分钟前
React.lazy() 懒加载
前端·react.js·前端框架
测试界的酸菜鱼27 分钟前
Python 大数据展示屏实例
大数据·开发语言·python
羊小猪~~31 分钟前
神经网络基础--什么是正向传播??什么是方向传播??
人工智能·pytorch·python·深度学习·神经网络·算法·机器学习
web行路人35 分钟前
React中类组件和函数组件的理解和区别
前端·javascript·react.js·前端框架