从零开发一个简单的Web爬虫(使用Requests和BeautifulSoup)

目录

  • 从零开发一个简单的Web爬虫(使用Requests和BeautifulSoup)
    • [1. 引言:Web爬虫的重要性和应用场景](#1. 引言:Web爬虫的重要性和应用场景)
      • [1.1 什么是Web爬虫?](#1.1 什么是Web爬虫?)
      • [1.2 Web爬虫的应用领域](#1.2 Web爬虫的应用领域)
      • [1.3 为什么选择Python开发爬虫?](#1.3 为什么选择Python开发爬虫?)
    • [2. 环境准备和基础概念](#2. 环境准备和基础概念)
      • [2.1 必要的Python库安装](#2.1 必要的Python库安装)
      • [2.2 HTTP协议基础](#2.2 HTTP协议基础)
    • [3. 基础爬虫架构设计](#3. 基础爬虫架构设计)
      • [3.1 爬虫系统架构](#3.1 爬虫系统架构)
      • [3.2 核心类设计](#3.2 核心类设计)
    • [4. 完整爬虫实现](#4. 完整爬虫实现)
      • [4.1 基础爬虫实现](#4.1 基础爬虫实现)
      • [4.2 高级爬虫功能](#4.2 高级爬虫功能)
    • [5. 爬虫伦理和最佳实践](#5. 爬虫伦理和最佳实践)
      • [5.1 遵守robots.txt](#5.1 遵守robots.txt)
      • [5.2 性能优化和数学原理](#5.2 性能优化和数学原理)
    • [6. 实际应用案例](#6. 实际应用案例)
      • [6.1 新闻网站爬虫](#6.1 新闻网站爬虫)
    • [7. 总结](#7. 总结)
      • [7.1 爬虫开发的关键要点](#7.1 爬虫开发的关键要点)
        • [✅ 核心功能](#✅ 核心功能)
        • [✅ 高级特性](#✅ 高级特性)
        • [✅ 实际应用](#✅ 实际应用)
      • [7.2 数学原理回顾](#7.2 数学原理回顾)
      • [7.3 最佳实践建议](#7.3 最佳实践建议)
      • [7.4 扩展学习方向](#7.4 扩展学习方向)

『宝藏代码胶囊开张啦!』------ 我的 CodeCapsule 来咯!✨写代码不再头疼!我的新站点 CodeCapsule 主打一个 "白菜价"+"量身定制 "!无论是卡脖子的毕设/课设/文献复现 ,需要灵光一现的算法改进 ,还是想给项目加个"外挂",这里都有便宜又好用的代码方案等你发现!低成本,高适配,助你轻松通关!速来围观 👉 CodeCapsule官网

从零开发一个简单的Web爬虫(使用Requests和BeautifulSoup)

1. 引言:Web爬虫的重要性和应用场景

1.1 什么是Web爬虫?

Web爬虫(Web Crawler),也称为网络蜘蛛(Spider),是一种自动浏览互联网并收集信息的程序。它通过系统地访问网页、提取数据并跟踪链接,实现对网络资源的自动化采集和处理。

根据统计,互联网上超过50%的网络流量来自各种爬虫程序,它们在搜索引擎、数据挖掘、市场研究等领域发挥着至关重要的作用。

1.2 Web爬虫的应用领域

python 复制代码
# 爬虫应用场景示例
applications = {
    "搜索引擎": "Google、百度等搜索引擎的核心技术",
    "价格监控": "电商平台价格比较和趋势分析",
    "舆情分析": "社交媒体和新闻网站的情感分析",
    "学术研究": "科学文献和数据收集",
    "竞争情报": "监控竞争对手的动态",
    "内容聚合": "新闻和博客内容聚合"
}

1.3 为什么选择Python开发爬虫?

Python在Web爬虫开发中具有显著优势:

  • 丰富的库生态系统:Requests、BeautifulSoup、Scrapy等
  • 简洁的语法:快速原型开发和代码维护
  • 强大的数据处理能力:Pandas、NumPy等数据处理库
  • 良好的异步支持:aiohttp、asyncio提高爬取效率
  • 活跃的社区:丰富的学习资源和解决方案

2. 环境准备和基础概念

2.1 必要的Python库安装

python 复制代码
#!/usr/bin/env python3
"""
Web爬虫开发环境配置和依赖检查
"""

import sys
import subprocess
import importlib
from typing import Dict, List, Tuple

class EnvironmentSetup:
    """
    环境配置和依赖管理类
    """
    
    def __init__(self):
        self.required_packages = {
            'requests': 'requests',
            'beautifulsoup4': 'bs4',
            'lxml': 'lxml',
            'pandas': 'pandas',
            'fake-useragent': 'fake_useragent',
            'urllib3': 'urllib3',
            'selenium': 'selenium'
        }
        
        self.optional_packages = {
            'aiohttp': 'aiohttp',
            'asyncio': 'asyncio',
            'scrapy': 'scrapy',
            'pyquery': 'pyquery'
        }
    
    def check_python_version(self) -> Tuple[bool, str]:
        """
        检查Python版本
        
        Returns:
            Tuple[bool, str]: (是否满足要求, 版本信息)
        """
        version = sys.version_info
        version_str = f"{version.major}.{version.minor}.{version.micro}"
        
        if version.major == 3 and version.minor >= 7:
            return True, version_str
        else:
            return False, version_str
    
    def check_package_installed(self, package_name: str, import_name: str) -> bool:
        """
        检查包是否已安装
        
        Args:
            package_name: 包名称
            import_name: 导入名称
            
        Returns:
            bool: 是否已安装
        """
        try:
            importlib.import_module(import_name)
            return True
        except ImportError:
            return False
    
    def install_package(self, package_name: str) -> bool:
        """
        安装Python包
        
        Args:
            package_name: 包名称
            
        Returns:
            bool: 安装是否成功
        """
        try:
            subprocess.check_call([
                sys.executable, "-m", "pip", "install", package_name
            ])
            return True
        except subprocess.CalledProcessError as e:
            print(f"安装 {package_name} 失败: {e}")
            return False
    
    def check_environment(self) -> Dict[str, any]:
        """
        全面检查开发环境
        
        Returns:
            Dict: 环境检查结果
        """
        print("=" * 60)
        print("Web爬虫开发环境检查")
        print("=" * 60)
        
        # 检查Python版本
        py_ok, py_version = self.check_python_version()
        print(f"Python版本: {py_version} {'✅' if py_ok else '❌'}")
        
        if not py_ok:
            print("需要Python 3.7或更高版本")
            return {"status": "failed", "reason": "python_version"}
        
        # 检查必需包
        missing_required = []
        installed_required = []
        
        for pkg, import_name in self.required_packages.items():
            if self.check_package_installed(pkg, import_name):
                print(f"✅ {pkg} 已安装")
                installed_required.append(pkg)
            else:
                print(f"❌ {pkg} 未安装")
                missing_required.append(pkg)
        
        # 检查可选包
        missing_optional = []
        installed_optional = []
        
        for pkg, import_name in self.optional_packages.items():
            if self.check_package_installed(pkg, import_name):
                print(f"✅ {pkg} 已安装 (可选)")
                installed_optional.append(pkg)
            else:
                print(f"⚠️  {pkg} 未安装 (可选)")
                missing_optional.append(pkg)
        
        return {
            "status": "success" if not missing_required else "failed",
            "python_version": py_version,
            "python_ok": py_ok,
            "required_installed": installed_required,
            "required_missing": missing_required,
            "optional_installed": installed_optional,
            "optional_missing": missing_optional
        }
    
    def setup_environment(self) -> bool:
        """
        自动设置开发环境
        
        Returns:
            bool: 设置是否成功
        """
        result = self.check_environment()
        
        if result["status"] == "success":
            print("\n✅ 环境配置完成!")
            return True
        
        print(f"\n缺少 {len(result['required_missing'])} 个必需包")
        
        # 自动安装缺失的包
        success_count = 0
        for package in result['required_missing']:
            print(f"正在安装 {package}...")
            if self.install_package(package):
                success_count += 1
                print(f"✅ {package} 安装成功")
            else:
                print(f"❌ {package} 安装失败")
        
        # 验证安装结果
        final_check = self.check_environment()
        return final_check["status"] == "success"

def main():
    """主函数"""
    env_setup = EnvironmentSetup()
    
    print("Web爬虫开发环境配置工具")
    print("此工具将检查并安装必要的依赖包")
    print()
    
    if env_setup.setup_environment():
        print("\n🎉 环境配置成功! 可以开始开发Web爬虫了。")
        
        # 显示下一步建议
        print("\n下一步建议:")
        print("1. 学习基本的HTTP协议和HTML知识")
        print("2. 了解robots.txt和爬虫道德规范")
        print("3. 开始编写简单的爬虫脚本")
        
    else:
        print("\n❌ 环境配置失败,请手动安装缺失的包")
        print("可以使用: pip install 包名")

if __name__ == "__main__":
    main()

2.2 HTTP协议基础

在开发爬虫之前,理解HTTP协议是至关重要的:

python 复制代码
"""
HTTP协议基础概念演示
"""

class HTTPBasics:
    """
    HTTP协议基础概念
    """
    
    def explain_http_methods(self):
        """解释HTTP方法"""
        methods = {
            "GET": "请求资源,不应产生副作用",
            "POST": "提交数据,可能改变服务器状态", 
            "HEAD": "只获取响应头信息",
            "PUT": "更新资源",
            "DELETE": "删除资源"
        }
        
        print("HTTP方法说明:")
        for method, description in methods.items():
            print(f"  {method}: {description}")
    
    def common_status_codes(self):
        """常见HTTP状态码"""
        status_codes = {
            200: "OK - 请求成功",
            301: "Moved Permanently - 永久重定向",
            302: "Found - 临时重定向", 
            404: "Not Found - 资源不存在",
            403: "Forbidden - 禁止访问",
            500: "Internal Server Error - 服务器错误",
            503: "Service Unavailable - 服务不可用"
        }
        
        print("\n常见HTTP状态码:")
        for code, meaning in status_codes.items():
            print(f"  {code}: {meaning}")
    
    def important_headers(self):
        """重要的HTTP头部"""
        headers = {
            "User-Agent": "客户端标识",
            "Referer": "来源页面", 
            "Cookie": "会话信息",
            "Content-Type": "请求体类型",
            "Accept": "可接受的响应类型",
            "Authorization": "认证信息"
        }
        
        print("\n重要HTTP头部:")
        for header, purpose in headers.items():
            print(f"  {header}: {purpose}")

# 运行演示
if __name__ == "__main__":
    http_basics = HTTPBasics()
    http_basics.explain_http_methods()
    http_basics.common_status_codes() 
    http_basics.important_headers()

3. 基础爬虫架构设计

3.1 爬虫系统架构

控制层 存储层 核心组件 调度器 数据存储器 URL管理器 网页下载器 网页解析器 数据处理器

3.2 核心类设计

python 复制代码
#!/usr/bin/env python3
"""
Web爬虫核心架构设计
"""

import time
import random
from abc import ABC, abstractmethod
from typing import List, Dict, Optional, Set
from urllib.parse import urljoin, urlparse
from collections import deque

class URLManager:
    """
    URL管理器 - 负责管理待爬取和已爬取的URL
    """
    
    def __init__(self):
        self.to_crawl: deque = deque()  # 待爬取URL队列
        self.crawled: Set[str] = set()  # 已爬取URL集合
        self.failed: Set[str] = set()   # 爬取失败的URL
    
    def add_url(self, url: str) -> None:
        """
        添加URL到待爬取队列
        
        Args:
            url: 要添加的URL
        """
        if (url not in self.to_crawl and 
            url not in self.crawled and 
            url not in self.failed):
            self.to_crawl.append(url)
    
    def add_urls(self, urls: List[str]) -> None:
        """
        批量添加URL
        
        Args:
            urls: URL列表
        """
        for url in urls:
            self.add_url(url)
    
    def get_url(self) -> Optional[str]:
        """
        获取下一个要爬取的URL
        
        Returns:
            str: URL地址,如果没有则返回None
        """
        if self.has_next():
            url = self.to_crawl.popleft()
            self.crawled.add(url)
            return url
        return None
    
    def has_next(self) -> bool:
        """检查是否还有待爬取的URL"""
        return len(self.to_crawl) > 0
    
    def mark_failed(self, url: str) -> None:
        """
        标记URL为爬取失败
        
        Args:
            url: 失败的URL
        """
        if url in self.crawled:
            self.crawled.remove(url)
        self.failed.add(url)
    
    def get_stats(self) -> Dict[str, int]:
        """
        获取统计信息
        
        Returns:
            Dict: 统计信息
        """
        return {
            'to_crawl': len(self.to_crawl),
            'crawled': len(self.crawled),
            'failed': len(self.failed),
            'total': len(self.to_crawl) + len(self.crawled) + len(self.failed)
        }

class WebDownloader:
    """
    网页下载器 - 负责下载网页内容
    """
    
    def __init__(self, delay: float = 1.0, timeout: int = 10):
        """
        初始化下载器
        
        Args:
            delay: 请求延迟(秒)
            timeout: 请求超时时间(秒)
        """
        self.delay = delay
        self.timeout = timeout
        self.last_request_time = 0
        
    def download(self, url: str, **kwargs) -> Optional[str]:
        """
        下载网页内容
        
        Args:
            url: 要下载的URL
            **kwargs: 其他请求参数
            
        Returns:
            str: 网页内容,失败返回None
        """
        import requests
        from fake_useragent import UserAgent
        
        # 遵守爬虫礼仪,添加延迟
        self._respect_delay()
        
        try:
            # 设置请求头
            headers = kwargs.get('headers', {})
            if 'User-Agent' not in headers:
                ua = UserAgent()
                headers['User-Agent'] = ua.random
            
            # 发送请求
            response = requests.get(
                url, 
                timeout=self.timeout,
                headers=headers,
                **kwargs
            )
            
            # 检查响应状态
            response.raise_for_status()
            
            # 更新最后请求时间
            self.last_request_time = time.time()
            
            return response.text
            
        except requests.exceptions.RequestException as e:
            print(f"下载失败 {url}: {e}")
            return None
    
    def _respect_delay(self) -> None:
        """遵守请求延迟"""
        if self.delay > 0 and self.last_request_time > 0:
            elapsed = time.time() - self.last_request_time
            if elapsed < self.delay:
                time.sleep(self.delay - elapsed)

class HTMLParser:
    """
    HTML解析器 - 负责解析网页内容并提取数据
    """
    
    def __init__(self):
        pass
    
    def parse(self, html: str, base_url: str = "") -> Dict:
        """
        解析HTML内容
        
        Args:
            html: HTML内容
            base_url: 基础URL用于解析相对链接
            
        Returns:
            Dict: 解析结果
        """
        from bs4 import BeautifulSoup
        
        if not html:
            return {'links': [], 'data': {}}
        
        soup = BeautifulSoup(html, 'lxml')
        
        # 提取所有链接
        links = self._extract_links(soup, base_url)
        
        # 提取页面数据
        data = self._extract_data(soup)
        
        return {
            'links': links,
            'data': data,
            'title': self._get_title(soup),
            'meta_description': self._get_meta_description(soup)
        }
    
    def _extract_links(self, soup, base_url: str) -> List[str]:
        """
        提取页面中的所有链接
        
        Args:
            soup: BeautifulSoup对象
            base_url: 基础URL
            
        Returns:
            List[str]: 链接列表
        """
        links = []
        
        for link in soup.find_all('a', href=True):
            href = link['href']
            
            # 处理相对链接
            if base_url and not href.startswith(('http://', 'https://')):
                href = urljoin(base_url, href)
            
            # 过滤无效链接
            if self._is_valid_url(href):
                links.append(href)
        
        return list(set(links))  # 去重
    
    def _extract_data(self, soup) -> Dict:
        """
        提取页面数据
        
        Args:
            soup: BeautifulSoup对象
            
        Returns:
            Dict: 提取的数据
        """
        data = {}
        
        # 提取所有文本内容
        texts = soup.stripped_strings
        data['text_content'] = ' '.join(list(texts)[:1000])  # 限制长度
        
        # 提取图片
        data['images'] = [
            img['src'] for img in soup.find_all('img', src=True)
            if self._is_valid_url(img['src'])
        ]
        
        # 提取元数据
        data['meta_keywords'] = self._get_meta_keywords(soup)
        
        return data
    
    def _get_title(self, soup) -> str:
        """获取页面标题"""
        title_tag = soup.find('title')
        return title_tag.get_text().strip() if title_tag else ""
    
    def _get_meta_description(self, soup) -> str:
        """获取meta描述"""
        meta_desc = soup.find('meta', attrs={'name': 'description'})
        return meta_desc.get('content', '') if meta_desc else ""
    
    def _get_meta_keywords(self, soup) -> str:
        """获取meta关键词"""
        meta_keywords = soup.find('meta', attrs={'name': 'keywords'})
        return meta_keywords.get('content', '') if meta_keywords else ""
    
    def _is_valid_url(self, url: str) -> bool:
        """
        检查URL是否有效
        
        Args:
            url: 要检查的URL
            
        Returns:
            bool: 是否有效
        """
        if not url or url.startswith(('javascript:', 'mailto:', 'tel:')):
            return False
        
        parsed = urlparse(url)
        return bool(parsed.netloc and parsed.scheme in ['http', 'https'])

class DataStorage:
    """
    数据存储器 - 负责存储爬取的数据
    """
    
    def __init__(self, storage_type: str = 'file'):
        """
        初始化存储器
        
        Args:
            storage_type: 存储类型 ('file', 'csv', 'json')
        """
        self.storage_type = storage_type
    
    def save(self, data: Dict, filename: str = None) -> bool:
        """
        保存数据
        
        Args:
            data: 要保存的数据
            filename: 文件名
            
        Returns:
            bool: 保存是否成功
        """
        try:
            if self.storage_type == 'file':
                return self._save_to_file(data, filename)
            elif self.storage_type == 'csv':
                return self._save_to_csv(data, filename)
            elif self.storage_type == 'json':
                return self._save_to_json(data, filename)
            else:
                print(f"不支持的存储类型: {self.storage_type}")
                return False
        except Exception as e:
            print(f"保存数据失败: {e}")
            return False
    
    def _save_to_file(self, data: Dict, filename: str) -> bool:
        """保存到文本文件"""
        if not filename:
            filename = f"crawled_data_{int(time.time())}.txt"
        
        with open(filename, 'w', encoding='utf-8') as f:
            for key, value in data.items():
                f.write(f"=== {key} ===\n")
                if isinstance(value, (list, tuple)):
                    for item in value:
                        f.write(f"{item}\n")
                else:
                    f.write(f"{value}\n")
                f.write("\n")
        
        print(f"数据已保存到: {filename}")
        return True
    
    def _save_to_csv(self, data: Dict, filename: str) -> bool:
        """保存到CSV文件"""
        import csv
        
        if not filename:
            filename = f"crawled_data_{int(time.time())}.csv"
        
        # 这里需要根据具体数据结构实现
        # 简化实现,只保存主要数据
        with open(filename, 'w', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow(['Field', 'Value'])
            for key, value in data.items():
                if isinstance(value, (list, tuple)):
                    writer.writerow([key, '; '.join(str(v) for v in value)])
                else:
                    writer.writerow([key, value])
        
        print(f"数据已保存到CSV: {filename}")
        return True
    
    def _save_to_json(self, data: Dict, filename: str) -> bool:
        """保存到JSON文件"""
        import json
        
        if not filename:
            filename = f"crawled_data_{int(time.time())}.json"
        
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(data, f, ensure_ascii=False, indent=2)
        
        print(f"数据已保存到JSON: {filename}")
        return True

4. 完整爬虫实现

4.1 基础爬虫实现

python 复制代码
#!/usr/bin/env python3
"""
基础Web爬虫完整实现
使用Requests和BeautifulSoup
"""

import time
import re
import json
from typing import Dict, List, Optional, Set
from urllib.parse import urljoin, urlparse
from dataclasses import dataclass
from concurrent.futures import ThreadPoolExecutor, as_completed

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

@dataclass
class CrawlResult:
    """爬取结果数据类"""
    url: str
    title: str = ""
    content: str = ""
    links: List[str] = None
    images: List[str] = None
    status_code: int = 0
    error: str = ""
    
    def __post_init__(self):
        if self.links is None:
            self.links = []
        if self.images is None:
            self.images = []

class SimpleWebCrawler:
    """
    简单Web爬虫类
    """
    
    def __init__(self, 
                 delay: float = 1.0,
                 timeout: int = 10,
                 max_pages: int = 100,
                 user_agent: str = None):
        """
        初始化爬虫
        
        Args:
            delay: 请求间隔(秒)
            timeout: 请求超时(秒)
            max_pages: 最大爬取页面数
            user_agent: 用户代理字符串
        """
        self.delay = delay
        self.timeout = timeout
        self.max_pages = max_pages
        self.user_agent = user_agent or UserAgent().random
        
        self.visited_urls: Set[str] = set()
        self.results: List[CrawlResult] = []
        self.session = requests.Session()
        
        # 设置会话头信息
        self.session.headers.update({
            'User-Agent': self.user_agent,
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
            'Accept-Encoding': 'gzip, deflate',
            'Connection': 'keep-alive',
        })
    
    def crawl(self, start_urls: List[str], 
              allowed_domains: List[str] = None) -> List[CrawlResult]:
        """
        开始爬取
        
        Args:
            start_urls: 起始URL列表
            allowed_domains: 允许的域名列表
            
        Returns:
            List[CrawlResult]: 爬取结果列表
        """
        print(f"开始爬取,起始URL: {start_urls}")
        print(f"最大页面数: {self.max_pages}")
        print(f"请求延迟: {self.delay}秒")
        
        # 初始化URL队列
        url_queue = []
        for url in start_urls:
            if self._is_allowed_domain(url, allowed_domains):
                url_queue.append(url)
                self.visited_urls.add(url)
        
        page_count = 0
        
        while url_queue and page_count < self.max_pages:
            current_url = url_queue.pop(0)
            print(f"爬取 [{page_count + 1}/{self.max_pages}]: {current_url}")
            
            # 爬取当前页面
            result = self._crawl_page(current_url)
            self.results.append(result)
            
            if result.links:
                # 添加新链接到队列
                for link in result.links:
                    if (link not in self.visited_urls and 
                        self._is_allowed_domain(link, allowed_domains) and
                        len(url_queue) < (self.max_pages - page_count)):
                        
                        url_queue.append(link)
                        self.visited_urls.add(link)
            
            page_count += 1
            
            # 遵守爬虫礼仪
            if self.delay > 0 and page_count < self.max_pages:
                time.sleep(self.delay)
        
        print(f"爬取完成! 共爬取 {len(self.results)} 个页面")
        return self.results
    
    def _crawl_page(self, url: str) -> CrawlResult:
        """
        爬取单个页面
        
        Args:
            url: 要爬取的URL
            
        Returns:
            CrawlResult: 爬取结果
        """
        result = CrawlResult(url=url)
        
        try:
            # 发送HTTP请求
            response = self.session.get(url, timeout=self.timeout)
            result.status_code = response.status_code
            
            # 检查响应状态
            if response.status_code != 200:
                result.error = f"HTTP {response.status_code}"
                return result
            
            # 解析HTML内容
            soup = BeautifulSoup(response.content, 'lxml')
            
            # 提取标题
            title_tag = soup.find('title')
            result.title = title_tag.get_text().strip() if title_tag else ""
            
            # 提取正文内容(简化版)
            result.content = self._extract_content(soup)
            
            # 提取所有链接
            result.links = self._extract_links(soup, url)
            
            # 提取图片
            result.images = self._extract_images(soup, url)
            
        except requests.exceptions.RequestException as e:
            result.error = f"请求错误: {e}"
        except Exception as e:
            result.error = f"解析错误: {e}"
        
        return result
    
    def _extract_content(self, soup) -> str:
        """
        提取页面主要内容
        
        Args:
            soup: BeautifulSoup对象
            
        Returns:
            str: 提取的文本内容
        """
        # 移除脚本和样式标签
        for script in soup(["script", "style"]):
            script.decompose()
        
        # 获取文本内容
        text = soup.get_text()
        
        # 清理文本
        lines = (line.strip() for line in text.splitlines())
        chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
        text = ' '.join(chunk for chunk in chunks if chunk)
        
        # 限制长度
        return text[:5000] if len(text) > 5000 else text
    
    def _extract_links(self, soup, base_url: str) -> List[str]:
        """
        提取页面中的所有链接
        
        Args:
            soup: BeautifulSoup对象
            base_url: 基础URL
            
        Returns:
            List[str]: 链接列表
        """
        links = []
        
        for link in soup.find_all('a', href=True):
            href = link['href']
            
            # 处理相对链接
            full_url = urljoin(base_url, href)
            
            # 过滤无效链接
            if self._is_valid_url(full_url):
                links.append(full_url)
        
        return list(set(links))  # 去重
    
    def _extract_images(self, soup, base_url: str) -> List[str]:
        """
        提取页面中的图片
        
        Args:
            soup: BeautifulSoup对象
            base_url: 基础URL
            
        Returns:
            List[str]: 图片URL列表
        """
        images = []
        
        for img in soup.find_all('img', src=True):
            src = img['src']
            
            # 处理相对链接
            full_url = urljoin(base_url, src)
            
            if self._is_valid_url(full_url):
                images.append(full_url)
        
        return list(set(images))
    
    def _is_valid_url(self, url: str) -> bool:
        """
        检查URL是否有效
        
        Args:
            url: 要检查的URL
            
        Returns:
            bool: 是否有效
        """
        if not url:
            return False
        
        # 过滤掉非HTTP链接和常见无效模式
        invalid_patterns = [
            'javascript:', 'mailto:', 'tel:', '#',
            'callto:', 'fax:', 'sms:'
        ]
        
        if any(url.startswith(pattern) for pattern in invalid_patterns):
            return False
        
        parsed = urlparse(url)
        return bool(parsed.netloc and parsed.scheme in ['http', 'https'])
    
    def _is_allowed_domain(self, url: str, allowed_domains: List[str]) -> bool:
        """
        检查域名是否在允许列表中
        
        Args:
            url: 要检查的URL
            allowed_domains: 允许的域名列表
            
        Returns:
            bool: 是否允许
        """
        if not allowed_domains:
            return True
        
        domain = urlparse(url).netloc
        return any(allowed in domain for allowed in allowed_domains)
    
    def save_results(self, filename: str = None, format: str = 'json') -> bool:
        """
        保存爬取结果
        
        Args:
            filename: 文件名
            format: 保存格式 ('json', 'csv', 'txt')
            
        Returns:
            bool: 保存是否成功
        """
        if not filename:
            timestamp = int(time.time())
            filename = f"crawl_results_{timestamp}.{format}"
        
        try:
            if format == 'json':
                return self._save_as_json(filename)
            elif format == 'csv':
                return self._save_as_csv(filename)
            elif format == 'txt':
                return self._save_as_text(filename)
            else:
                print(f"不支持的格式: {format}")
                return False
        except Exception as e:
            print(f"保存结果失败: {e}")
            return False
    
    def _save_as_json(self, filename: str) -> bool:
        """保存为JSON格式"""
        data = []
        for result in self.results:
            data.append({
                'url': result.url,
                'title': result.title,
                'content_length': len(result.content),
                'links_count': len(result.links),
                'images_count': len(result.images),
                'status_code': result.status_code,
                'error': result.error,
                'content_preview': result.content[:200] + '...' if len(result.content) > 200 else result.content
            })
        
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(data, f, ensure_ascii=False, indent=2)
        
        print(f"结果已保存为JSON: {filename}")
        return True
    
    def _save_as_csv(self, filename: str) -> bool:
        """保存为CSV格式"""
        import csv
        
        with open(filename, 'w', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow([
                'URL', 'Title', 'Content Length', 'Links Count', 
                'Images Count', 'Status Code', 'Error'
            ])
            
            for result in self.results:
                writer.writerow([
                    result.url,
                    result.title,
                    len(result.content),
                    len(result.links),
                    len(result.images),
                    result.status_code,
                    result.error
                ])
        
        print(f"结果已保存为CSV: {filename}")
        return True
    
    def _save_as_text(self, filename: str) -> bool:
        """保存为文本格式"""
        with open(filename, 'w', encoding='utf-8') as f:
            for i, result in enumerate(self.results, 1):
                f.write(f"=== 页面 {i} ===\n")
                f.write(f"URL: {result.url}\n")
                f.write(f"标题: {result.title}\n")
                f.write(f"状态码: {result.status_code}\n")
                f.write(f"错误: {result.error}\n")
                f.write(f"链接数量: {len(result.links)}\n")
                f.write(f"图片数量: {len(result.images)}\n")
                f.write(f"内容预览: {result.content[:300]}...\n")
                f.write("\n" + "="*50 + "\n\n")
        
        print(f"结果已保存为文本: {filename}")
        return True
    
    def get_statistics(self) -> Dict:
        """
        获取爬取统计信息
        
        Returns:
            Dict: 统计信息
        """
        total_pages = len(self.results)
        successful_pages = len([r for r in self.results if r.status_code == 200])
        failed_pages = total_pages - successful_pages
        
        total_links = sum(len(r.links) for r in self.results)
        total_images = sum(len(r.images) for r in self.results)
        total_content = sum(len(r.content) for r in self.results)
        
        return {
            'total_pages': total_pages,
            'successful_pages': successful_pages,
            'failed_pages': failed_pages,
            'total_links_found': total_links,
            'total_images_found': total_images,
            'total_content_length': total_content,
            'average_content_length': total_content // total_pages if total_pages > 0 else 0
        }

# 使用示例和演示
def demo_crawler():
    """演示爬虫使用方法"""
    print("简单Web爬虫演示")
    print("=" * 50)
    
    # 创建爬虫实例
    crawler = SimpleWebCrawler(
        delay=2.0,        # 2秒延迟
        timeout=10,       # 10秒超时
        max_pages=5,      # 最多爬取5个页面
    )
    
    # 起始URL(使用示例网站)
    start_urls = [
        'http://books.toscrape.com/',  # 示例网站,适合爬虫练习
    ]
    
    # 开始爬取
    results = crawler.crawl(
        start_urls=start_urls,
        allowed_domains=['books.toscrape.com']
    )
    
    # 显示统计信息
    stats = crawler.get_statistics()
    print("\n爬取统计:")
    for key, value in stats.items():
        print(f"  {key}: {value}")
    
    # 保存结果
    crawler.save_results(format='json')
    crawler.save_results(format='csv')
    
    # 显示前几个结果
    print("\n前3个页面的结果:")
    for i, result in enumerate(results[:3], 1):
        print(f"\n{i}. {result.url}")
        print(f"   标题: {result.title}")
        print(f"   状态: {result.status_code}")
        print(f"   链接数: {len(result.links)}")
        print(f"   图片数: {len(result.images)}")
        print(f"   内容长度: {len(result.content)} 字符")

if __name__ == "__main__":
    demo_crawler()

4.2 高级爬虫功能

python 复制代码
#!/usr/bin/env python3
"""
高级Web爬虫功能
包含代理、并发、JavaScript渲染等高级特性
"""

import asyncio
import aiohttp
from typing import List, Dict, Optional
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

class AdvancedWebCrawler(SimpleWebCrawler):
    """
    高级Web爬虫类
    继承基础爬虫并添加高级功能
    """
    
    def __init__(self, 
                 delay: float = 1.0,
                 timeout: int = 10,
                 max_pages: int = 100,
                 user_agent: str = None,
                 use_proxy: bool = False,
                 max_workers: int = 5):
        """
        初始化高级爬虫
        
        Args:
            delay: 请求间隔
            timeout: 请求超时
            max_pages: 最大页面数
            user_agent: 用户代理
            use_proxy: 是否使用代理
            max_workers: 最大工作线程数
        """
        super().__init__(delay, timeout, max_pages, user_agent)
        self.use_proxy = use_proxy
        self.max_workers = max_workers
        self.proxies = self._load_proxies() if use_proxy else []
    
    def concurrent_crawl(self, start_urls: List[str], 
                        allowed_domains: List[str] = None) -> List[CrawlResult]:
        """
        并发爬取
        
        Args:
            start_urls: 起始URL列表
            allowed_domains: 允许的域名列表
            
        Returns:
            List[CrawlResult]: 爬取结果列表
        """
        print(f"开始并发爬取,工作线程数: {self.max_workers}")
        
        # 使用线程池进行并发爬取
        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
            # 提交任务
            future_to_url = {
                executor.submit(self._crawl_page, url): url 
                for url in start_urls[:self.max_pages]
            }
            
            # 收集结果
            for future in as_completed(future_to_url):
                url = future_to_url[future]
                try:
                    result = future.result()
                    self.results.append(result)
                    self.visited_urls.add(url)
                    
                    print(f"完成: {url} (状态: {result.status_code})")
                    
                except Exception as e:
                    print(f"爬取失败 {url}: {e}")
                    error_result = CrawlResult(url=url, error=str(e))
                    self.results.append(error_result)
        
        print(f"并发爬取完成! 共爬取 {len(self.results)} 个页面")
        return self.results
    
    async def async_crawl(self, urls: List[str]) -> List[CrawlResult]:
        """
        异步爬取(适用于大量URL)
        
        Args:
            urls: URL列表
            
        Returns:
            List[CrawlResult]: 爬取结果列表
        """
        print("开始异步爬取...")
        
        async with aiohttp.ClientSession() as session:
            tasks = []
            for url in urls[:self.max_pages]:
                task = self._async_crawl_page(session, url)
                tasks.append(task)
            
            results = await asyncio.gather(*tasks, return_exceptions=True)
            
            # 处理结果
            for result in results:
                if isinstance(result, CrawlResult):
                    self.results.append(result)
                elif isinstance(result, Exception):
                    print(f"异步爬取错误: {result}")
        
        return self.results
    
    async def _async_crawl_page(self, session, url: str) -> CrawlResult:
        """
        异步爬取单个页面
        
        Args:
            session: aiohttp会话
            url: 要爬取的URL
            
        Returns:
            CrawlResult: 爬取结果
        """
        result = CrawlResult(url=url)
        
        try:
            async with session.get(url, timeout=aiohttp.ClientTimeout(total=self.timeout)) as response:
                result.status_code = response.status
                
                if response.status == 200:
                    html = await response.text()
                    
                    # 使用BeautifulSoup解析
                    soup = BeautifulSoup(html, 'lxml')
                    
                    # 提取数据
                    title_tag = soup.find('title')
                    result.title = title_tag.get_text().strip() if title_tag else ""
                    result.content = self._extract_content(soup)
                    result.links = self._extract_links(soup, url)
                    result.images = self._extract_images(soup, url)
                else:
                    result.error = f"HTTP {response.status}"
        
        except Exception as e:
            result.error = f"异步请求错误: {e}"
        
        return result
    
    def crawl_with_selenium(self, url: str) -> CrawlResult:
        """
        使用Selenium爬取JavaScript渲染的页面
        
        Args:
            url: 要爬取的URL
            
        Returns:
            CrawlResult: 爬取结果
        """
        print(f"使用Selenium爬取: {url}")
        
        result = CrawlResult(url=url)
        
        # 配置Chrome选项
        chrome_options = Options()
        chrome_options.add_argument('--headless')  # 无头模式
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument('--disable-dev-shm-usage')
        chrome_options.add_argument(f'--user-agent={self.user_agent}')
        
        driver = None
        try:
            # 启动浏览器
            driver = webdriver.Chrome(options=chrome_options)
            driver.get(url)
            
            # 等待页面加载
            driver.implicitly_wait(10)
            
            # 获取页面源码
            html = driver.page_source
            
            # 解析内容
            soup = BeautifulSoup(html, 'lxml')
            
            title_tag = soup.find('title')
            result.title = title_tag.get_text().strip() if title_tag else ""
            result.content = self._extract_content(soup)
            result.links = self._extract_links(soup, url)
            result.images = self._extract_images(soup, url)
            result.status_code = 200
            
            print(f"Selenium爬取成功: {url}")
            
        except Exception as e:
            result.error = f"Selenium错误: {e}"
            print(f"Selenium爬取失败 {url}: {e}")
        
        finally:
            if driver:
                driver.quit()
        
        return result
    
    def _load_proxies(self) -> List[str]:
        """
        加载代理列表
        
        Returns:
            List[str]: 代理列表
        """
        # 这里可以从文件或API加载代理
        # 简化实现,返回空列表
        print("代理功能需要自行实现代理源")
        return []
    
    def rotate_user_agent(self):
        """轮换用户代理"""
        ua = UserAgent()
        new_ua = ua.random
        self.session.headers.update({'User-Agent': new_ua})
        self.user_agent = new_ua
        print(f"用户代理已更换: {new_ua}")

# 高级功能演示
def demo_advanced_features():
    """演示高级爬虫功能"""
    print("高级Web爬虫功能演示")
    print("=" * 50)
    
    # 创建高级爬虫实例
    advanced_crawler = AdvancedWebCrawler(
        delay=1.0,
        timeout=10,
        max_pages=3,
        max_workers=2
    )
    
    # 测试URL列表
    test_urls = [
        'http://books.toscrape.com/',
        'http://quotes.toscrape.com/',
        'http://quotes.toscrape.com/js/'
    ]
    
    print("1. 测试并发爬取:")
    results = advanced_crawler.concurrent_crawl(test_urls)
    
    print("\n2. 测试用户代理轮换:")
    advanced_crawler.rotate_user_agent()
    
    print("\n3. 获取统计信息:")
    stats = advanced_crawler.get_statistics()
    for key, value in stats.items():
        print(f"  {key}: {value}")
    
    # 保存结果
    advanced_crawler.save_results("advanced_crawl_results.json")

if __name__ == "__main__":
    demo_advanced_features()

5. 爬虫伦理和最佳实践

5.1 遵守robots.txt

python 复制代码
#!/usr/bin/env python3
"""
爬虫伦理和robots.txt处理
"""

import urllib.robotparser
from urllib.parse import urlparse
import time

class EthicalCrawler:
    """
    遵守伦理的爬虫类
    """
    
    def __init__(self, crawl_delay: float = 1.0):
        self.crawl_delay = crawl_delay
        self.robot_parsers = {}
        self.last_access_time = {}
    
    def can_fetch(self, url: str, user_agent: str = '*') -> bool:
        """
        检查是否允许爬取该URL
        
        Args:
            url: 要检查的URL
            user_agent: 用户代理字符串
            
        Returns:
            bool: 是否允许爬取
        """
        parsed_url = urlparse(url)
        base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"
        
        # 获取或创建robots.txt解析器
        if base_url not in self.robot_parsers:
            self.robot_parsers[base_url] = self._create_robot_parser(base_url)
        
        rp = self.robot_parsers[base_url]
        
        if rp:
            return rp.can_fetch(user_agent, url)
        
        return True  # 如果没有robots.txt,默认允许
    
    def _create_robot_parser(self, base_url: str) -> urllib.robotparser.RobotFileParser:
        """
        创建robots.txt解析器
        
        Args:
            base_url: 基础URL
            
        Returns:
            RobotFileParser: 解析器实例
        """
        rp = urllib.robotparser.RobotFileParser()
        robots_url = f"{base_url}/robots.txt"
        
        try:
            rp.set_url(robots_url)
            rp.read()
            print(f"已加载robots.txt: {robots_url}")
            return rp
        except Exception as e:
            print(f"无法加载robots.txt {robots_url}: {e}")
            return None
    
    def get_crawl_delay(self, url: str, user_agent: str = '*') -> float:
        """
        获取建议的爬取延迟
        
        Args:
            url: URL
            user_agent: 用户代理
            
        Returns:
            float: 建议的延迟时间(秒)
        """
        parsed_url = urlparse(url)
        base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"
        
        if base_url in self.robot_parsers and self.robot_parsers[base_url]:
            rp = self.robot_parsers[base_url]
            delay = rp.crawl_delay(user_agent)
            if delay:
                return delay
        
        return self.crawl_delay
    
    def respect_delay(self, url: str) -> None:
        """
        遵守爬取延迟
        
        Args:
            url: 当前爬取的URL
        """
        parsed_url = urlparse(url)
        domain = parsed_url.netloc
        
        current_time = time.time()
        
        if domain in self.last_access_time:
            last_time = self.last_access_time[domain]
            elapsed = current_time - last_time
            delay = self.get_crawl_delay(url)
            
            if elapsed < delay:
                sleep_time = delay - elapsed
                print(f"遵守延迟: 等待 {sleep_time:.2f} 秒")
                time.sleep(sleep_time)
        
        self.last_access_time[domain] = time.time()

# 使用示例
def check_robots_example():
    """检查robots.txt示例"""
    crawler = EthicalCrawler()
    
    test_urls = [
        'https://www.google.com/search?q=python',
        'https://www.baidu.com/s?wd=python',
        'https://httpbin.org/'
    ]
    
    for url in test_urls:
        can_fetch = crawler.can_fetch(url)
        delay = crawler.get_crawl_delay(url)
        
        print(f"URL: {url}")
        print(f"  允许爬取: {'是' if can_fetch else '否'}")
        print(f"  建议延迟: {delay}秒")
        print()

if __name__ == "__main__":
    check_robots_example()

5.2 性能优化和数学原理

python 复制代码
#!/usr/bin/env python3
"""
爬虫性能优化和数学原理
"""

import time
import math
from typing import List, Dict
from dataclasses import dataclass

@dataclass
class PerformanceMetrics:
    """性能指标"""
    total_urls: int
    successful_requests: int
    failed_requests: int
    total_time: float
    data_size: int

class CrawlerOptimizer:
    """
    爬虫性能优化器
    """
    
    def __init__(self):
        self.metrics = []
    
    def calculate_optimal_threads(self, 
                                 target_throughput: int,
                                 avg_response_time: float) -> int:
        """
        计算最优线程数
        
        使用Little定律: L = λ * W
        其中 L 是平均并发请求数,λ 是到达率,W 是平均响应时间
        
        Args:
            target_throughput: 目标吞吐量(请求/秒)
            avg_response_time: 平均响应时间(秒)
            
        Returns:
            int: 最优线程数
        """
        # 根据Little定律,需要的并发数
        optimal_concurrency = target_throughput * avg_response_time
        
        # 考虑系统开销,增加20%缓冲
        optimal_threads = math.ceil(optimal_concurrency * 1.2)
        
        print(f"目标吞吐量: {target_throughput} 请求/秒")
        print(f"平均响应时间: {avg_response_time:.2f} 秒")
        print(f"计算出的最优线程数: {optimal_threads}")
        
        return max(1, optimal_threads)  # 至少1个线程
    
    def estimate_crawl_time(self, 
                           total_urls: int,
                           requests_per_second: float,
                           success_rate: float = 0.95) -> float:
        """
        估算总爬取时间
        
        Args:
            total_urls: 总URL数量
            requests_per_second: 每秒请求数
            success_rate: 成功率
            
        Returns:
            float: 估算时间(秒)
        """
        # 考虑重试的有效URL数
        effective_urls = total_urls / success_rate
        
        # 总时间 = 有效URL数 / 每秒请求数
        total_time = effective_urls / requests_per_second
        
        # 转换为更友好的格式
        hours = total_time / 3600
        minutes = (total_time % 3600) / 60
        
        print(f"总URL数: {total_urls}")
        print(f"预估成功率: {success_rate * 100}%")
        print(f"有效请求数: {effective_urls:.0f}")
        print(f"预估总时间: {total_time:.0f} 秒 ({hours:.1f} 小时)")
        
        return total_time
    
    def analyze_bottleneck(self, metrics: PerformanceMetrics) -> Dict[str, float]:
        """
        分析性能瓶颈
        
        Args:
            metrics: 性能指标
            
        Returns:
            Dict: 瓶颈分析结果
        """
        total_requests = metrics.successful_requests + metrics.failed_requests
        
        if total_requests == 0:
            return {}
        
        # 计算各种比率
        success_rate = metrics.successful_requests / total_requests
        requests_per_second = total_requests / metrics.total_time
        data_rate = metrics.data_size / metrics.total_time  # bytes/sec
        
        # 分析瓶颈
        bottleneck_analysis = {}
        
        if success_rate < 0.8:
            bottleneck_analysis['主要瓶颈'] = '请求成功率低'
            bottleneck_analysis['建议'] = '检查网络连接或目标网站限制'
        
        if requests_per_second < 1:
            bottleneck_analysis['主要瓶颈'] = '请求速度慢'
            bottleneck_analysis['建议'] = '增加并发数或优化网络连接'
        
        if data_rate < 1024:  # 小于1KB/秒
            bottleneck_analysis['主要瓶颈'] = '数据传输慢'
            bottleneck_analysis['建议'] = '检查网络带宽或压缩数据'
        
        print("性能分析结果:")
        print(f"  成功率: {success_rate:.2%}")
        print(f"  请求速度: {requests_per_second:.2f} 请求/秒")
        print(f"  数据速率: {data_rate/1024:.2f} KB/秒")
        
        if bottleneck_analysis:
            print(f"  主要瓶颈: {bottleneck_analysis['主要瓶颈']}")
            print(f"  建议: {bottleneck_analysis['建议']}")
        else:
            print("  性能良好,无明显瓶颈")
        
        return {
            'success_rate': success_rate,
            'requests_per_second': requests_per_second,
            'data_rate': data_rate,
            'bottleneck': bottleneck_analysis
        }

# 性能优化演示
def performance_demo():
    """性能优化演示"""
    optimizer = CrawlerOptimizer()
    
    print("爬虫性能优化分析")
    print("=" * 50)
    
    # 计算最优线程数
    print("1. 最优线程数计算:")
    optimal_threads = optimizer.calculate_optimal_threads(
        target_throughput=10,  # 10请求/秒
        avg_response_time=0.5   # 0.5秒/请求
    )
    
    print(f"\n2. 爬取时间估算:")
    crawl_time = optimizer.estimate_crawl_time(
        total_urls=1000,
        requests_per_second=5,
        success_rate=0.9
    )
    
    print(f"\n3. 性能瓶颈分析:")
    sample_metrics = PerformanceMetrics(
        total_urls=100,
        successful_requests=85,
        failed_requests=15,
        total_time=30.0,  # 30秒
        data_size=500000  # 500KB
    )
    
    analysis = optimizer.analyze_bottleneck(sample_metrics)

if __name__ == "__main__":
    performance_demo()

6. 实际应用案例

6.1 新闻网站爬虫

python 复制代码
#!/usr/bin/env python3
"""
新闻网站爬虫案例
"""

class NewsCrawler(SimpleWebCrawler):
    """
    新闻网站专用爬虫
    """
    
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.news_data = []
    
    def extract_news(self, soup, url: str) -> Dict:
        """
        提取新闻内容
        
        Args:
            soup: BeautifulSoup对象
            url: 新闻URL
            
        Returns:
            Dict: 新闻数据
        """
        news = {
            'url': url,
            'title': '',
            'publish_date': '',
            'author': '',
            'content': '',
            'tags': [],
            'summary': ''
        }
        
        # 提取标题(多种可能的选择器)
        title_selectors = [
            'h1',
            '.article-title',
            '.news-title', 
            '.title',
            'header h1'
        ]
        
        for selector in title_selectors:
            title_tag = soup.select_one(selector)
            if title_tag:
                news['title'] = title_tag.get_text().strip()
                break
        
        # 提取发布时间
        date_selectors = [
            '.publish-date',
            '.article-date',
            '.date',
            'time'
        ]
        
        for selector in date_selectors:
            date_tag = soup.select_one(selector)
            if date_tag:
                news['publish_date'] = date_tag.get_text().strip()
                break
        
        # 提取作者
        author_selectors = [
            '.author',
            '.article-author',
            '.byline'
        ]
        
        for selector in author_selectors:
            author_tag = soup.select_one(selector)
            if author_tag:
                news['author'] = author_tag.get_text().strip()
                break
        
        # 提取正文内容
        content_selectors = [
            '.article-content',
            '.news-content',
            '.content',
            'article'
        ]
        
        for selector in content_selectors:
            content_tag = soup.select_one(selector)
            if content_tag:
                # 移除不需要的元素
                for element in content_tag.select('script, style, nav, footer, aside'):
                    element.decompose()
                
                news['content'] = content_tag.get_text().strip()
                break
        
        # 生成摘要
        if news['content']:
            news['summary'] = news['content'][:200] + '...'
        
        return news
    
    def crawl_news_site(self, start_url: str, max_news: int = 20) -> List[Dict]:
        """
        爬取新闻网站
        
        Args:
            start_url: 起始URL
            max_news: 最大新闻数量
            
        Returns:
            List[Dict]: 新闻数据列表
        """
        print(f"开始爬取新闻网站: {start_url}")
        
        # 首先爬取新闻列表页
        list_result = self._crawl_page(start_url)
        
        if list_result.status_code != 200:
            print(f"无法访问新闻列表页: {start_url}")
            return []
        
        # 解析列表页,提取新闻链接
        soup = BeautifulSoup(list_result.content, 'lxml')
        news_links = self._extract_news_links(soup, start_url)
        
        print(f"找到 {len(news_links)} 个新闻链接")
        
        # 爬取新闻详情页
        news_count = 0
        for news_url in news_links[:max_news]:
            print(f"爬取新闻 {news_count + 1}/{min(len(news_links), max_news)}: {news_url}")
            
            news_result = self._crawl_page(news_url)
            
            if news_result.status_code == 200:
                soup = BeautifulSoup(news_result.content, 'lxml')
                news_data = self.extract_news(soup, news_url)
                self.news_data.append(news_data)
                news_count += 1
            
            # 遵守延迟
            if self.delay > 0 and news_count < max_news:
                time.sleep(self.delay)
        
        print(f"新闻爬取完成! 共爬取 {len(self.news_data)} 篇新闻")
        return self.news_data
    
    def _extract_news_links(self, soup, base_url: str) -> List[str]:
        """
        提取新闻链接
        
        Args:
            soup: BeautifulSoup对象
            base_url: 基础URL
            
        Returns:
            List[str]: 新闻链接列表
        """
        news_links = []
        
        # 常见的新闻链接选择器
        link_selectors = [
            'a[href*="article"]',
            'a[href*="news"]', 
            '.news-list a',
            '.article-list a',
            '.news-item a'
        ]
        
        for selector in link_selectors:
            links = soup.select(selector)
            for link in links:
                href = link.get('href')
                if href:
                    full_url = urljoin(base_url, href)
                    if self._is_valid_url(full_url):
                        news_links.append(full_url)
        
        return list(set(news_links))  # 去重
    
    def save_news(self, filename: str = None) -> bool:
        """
        保存新闻数据
        
        Args:
            filename: 文件名
            
        Returns:
            bool: 保存是否成功
        """
        if not filename:
            timestamp = int(time.time())
            filename = f"news_data_{timestamp}.json"
        
        try:
            import json
            
            # 准备保存的数据
            save_data = {
                'crawl_time': time.strftime('%Y-%m-%d %H:%M:%S'),
                'total_news': len(self.news_data),
                'news': self.news_data
            }
            
            with open(filename, 'w', encoding='utf-8') as f:
                json.dump(save_data, f, ensure_ascii=False, indent=2)
            
            print(f"新闻数据已保存到: {filename}")
            return True
            
        except Exception as e:
            print(f"保存新闻数据失败: {e}")
            return False

# 新闻爬虫演示
def news_crawler_demo():
    """新闻爬虫演示"""
    print("新闻网站爬虫演示")
    print("=" * 50)
    
    # 创建新闻爬虫
    news_crawler = NewsCrawler(
        delay=2.0,
        max_pages=10
    )
    
    # 注意:在实际使用时,请确保遵守目标网站的robots.txt和使用条款
    # 这里使用示例网站进行演示
    example_news_site = "http://quotes.toscrape.com/"  # 替换为实际的新闻网站
    
    # 爬取新闻
    news_data = news_crawler.crawl_news_site(
        start_url=example_news_site,
        max_news=5
    )
    
    # 显示结果
    print(f"\n爬取的新闻摘要:")
    for i, news in enumerate(news_data, 1):
        print(f"\n{i}. {news.get('title', '无标题')}")
        print(f"   链接: {news['url']}")
        print(f"   摘要: {news.get('summary', '无内容')}")
    
    # 保存结果
    news_crawler.save_news()

if __name__ == "__main__":
    news_crawler_demo()

7. 总结

7.1 爬虫开发的关键要点

通过本文,我们构建了一个完整的Web爬虫系统,涵盖了:

✅ 核心功能
  • URL管理:队列管理和去重机制
  • 网页下载:HTTP请求和错误处理
  • 内容解析:BeautifulSoup数据提取
  • 数据存储:多种格式输出支持
✅ 高级特性
  • 并发爬取:多线程和异步支持
  • JavaScript渲染:Selenium集成
  • 伦理遵守:robots.txt处理和请求延迟
  • 性能优化:瓶颈分析和参数调优
✅ 实际应用
  • 新闻爬虫:结构化数据提取
  • 电商监控:价格和产品信息抓取
  • 内容聚合:多源信息收集

7.2 数学原理回顾

在爬虫性能优化中,我们使用了重要的数学原理:

Little定律
L = λ × W L = \lambda \times W L=λ×W

其中:

  • L L L = 平均并发请求数
  • λ \lambda λ = 请求到达率(请求/秒)
  • W W W = 平均响应时间(秒)

吞吐量计算
吞吐量 = 成功请求数 总时间 \text{吞吐量} = \frac{\text{成功请求数}}{\text{总时间}} 吞吐量=总时间成功请求数

7.3 最佳实践建议

  1. 遵守法律法规:尊重robots.txt和网站使用条款
  2. 控制爬取频率:避免对目标网站造成压力
  3. 处理异常情况:完善的错误处理和重试机制
  4. 数据质量保证:数据清洗和验证流程
  5. 监控和日志:实时监控爬虫运行状态

7.4 扩展学习方向

  • 分布式爬虫:Scrapy-Redis框架
  • 反爬虫应对:IP代理、验证码识别
  • 大数据处理:与Hadoop、Spark集成
  • 机器学习:智能内容提取和分类

代码自查说明:本文所有代码均经过基本测试,但在生产环境使用前请确保:

  1. 遵守目标网站的robots.txt和使用条款
  2. 配置适当的请求延迟和并发控制
  3. 处理各种网络异常和解析错误
  4. 定期更新依赖库和安全补丁
  5. 监控爬虫性能和资源使用情况

重要提醒:开发和使用爬虫时,请始终遵循:

  • 法律法规和知识产权保护
  • 网站的使用条款和服务协议
  • 数据隐私和保护原则
  • 网络伦理和道德规范

通过本文学到的知识,你可以继续探索更复杂的爬虫应用,构建功能更强大的数据采集系统。

相关推荐
紫小米2 小时前
Vue 2 和 Vue 3 的区别
前端·javascript·vue.js
dllxhcjla2 小时前
三大特性+盒子模型
java·前端·css
Cache技术分享2 小时前
233. Java 集合 - 遍历 Collection 中的元素
前端·后端
嚴寒3 小时前
Mac 安装 Dart & Flutter 完整开发环境指南
前端·macos
B站计算机毕业设计之家3 小时前
大数据python招聘数据分析预测系统 招聘数据平台 +爬虫+可视化 +django框架+vue框架 大数据技术✅
大数据·爬虫·python·机器学习·数据挖掘·数据分析
用户6600676685393 小时前
从变量提升到调用栈:V8 引擎如何 “读懂” JS 代码
前端·javascript
白兰地空瓶3 小时前
【深度揭秘】JS 那些看似简单方法的底层黑魔法
前端·javascript
进阶的小叮当3 小时前
Vue代码打包成apk?Cordova帮你解决!
android·前端·javascript
天天进步20153 小时前
从零开始构建现代化React应用:最佳实践与性能优化
前端·react.js·性能优化