【爬虫教程】第6章:DNS解析优化与代理池架构

第6章:DNS解析优化与代理池架构

目录

  • [6.1 引言:DNS和代理在爬虫中的重要性](#6.1 引言:DNS和代理在爬虫中的重要性)
    • [6.1.1 DNS解析对爬虫性能的影响](#6.1.1 DNS解析对爬虫性能的影响)
    • [6.1.2 代理池的必要性](#6.1.2 代理池的必要性)
    • [6.1.3 本章学习目标](#6.1.3 本章学习目标)
  • [6.2 DNS解析流程深度解析](#6.2 DNS解析流程深度解析)
    • [6.2.1 DNS查询的完整流程](#6.2.1 DNS查询的完整流程)
    • [6.2.2 递归查询 vs 迭代查询](#6.2.2 递归查询 vs 迭代查询)
    • [6.2.3 本地缓存和系统DNS](#6.2.3 本地缓存和系统DNS)
    • [6.2.4 DNS记录类型详解](#6.2.4 DNS记录类型详解)
  • [6.3 DNS缓存机制深度解析](#6.3 DNS缓存机制深度解析)
    • [6.3.1 TTL的含义和作用](#6.3.1 TTL的含义和作用)
    • [6.3.2 缓存策略设计](#6.3.2 缓存策略设计)
    • [6.3.3 缓存失效处理](#6.3.3 缓存失效处理)
    • [6.3.4 多级缓存架构](#6.3.4 多级缓存架构)
  • [6.4 DNS over HTTPS/TLS实现](#6.4 DNS over HTTPS/TLS实现)
    • [6.4.1 DoH协议原理](#6.4.1 DoH协议原理)
    • [6.4.2 DoT协议原理](#6.4.2 DoT协议原理)
    • [6.4.3 DoH/DoT的安全优势](#6.4.3 DoH/DoT的安全优势)
    • [6.4.4 DoH客户端实现](#6.4.4 DoH客户端实现)
  • [6.5 代理池架构设计](#6.5 代理池架构设计)
    • [6.5.1 代理池的数据结构设计](#6.5.1 代理池的数据结构设计)
    • [6.5.2 代理类型详解(HTTP/HTTPS/SOCKS4/SOCKS5)](#6.5.2 代理类型详解(HTTP/HTTPS/SOCKS4/SOCKS5))
    • [6.5.3 代理池的接口设计](#6.5.3 代理池的接口设计)
    • [6.5.4 代理池架构图](#6.5.4 代理池架构图)
  • [6.6 代理健康检查机制](#6.6 代理健康检查机制)
    • [6.6.1 健康检查的设计原则](#6.6.1 健康检查的设计原则)
    • [6.6.2 健康检查的实现方法](#6.6.2 健康检查的实现方法)
    • [6.6.3 健康检查的频率控制](#6.6.3 健康检查的频率控制)
    • [6.6.4 健康状态的评估标准](#6.6.4 健康状态的评估标准)
  • [6.7 负载均衡算法实现](#6.7 负载均衡算法实现)
    • [6.7.1 轮询算法(Round Robin)](#6.7.1 轮询算法(Round Robin))
    • [6.7.2 随机算法(Random)](#6.7.2 随机算法(Random))
    • [6.7.3 加权算法(Weighted)](#6.7.3 加权算法(Weighted))
    • [6.7.4 最少连接数算法(Least Connections)](#6.7.4 最少连接数算法(Least Connections))
    • [6.7.5 算法性能对比](#6.7.5 算法性能对比)
  • [6.8 工具链:DNS和代理工具使用](#6.8 工具链:DNS和代理工具使用)
    • [6.8.1 使用dnspython进行DNS查询](#6.8.1 使用dnspython进行DNS查询)
    • [6.8.2 使用DoH服务进行DNS查询](#6.8.2 使用DoH服务进行DNS查询)
    • [6.8.3 使用httpx/aiohttp配置代理](#6.8.3 使用httpx/aiohttp配置代理)
    • [6.8.4 使用Redis管理代理池数据](#6.8.4 使用Redis管理代理池数据)
    • [6.8.5 使用aiohttp-socks支持SOCKS代理](#6.8.5 使用aiohttp-socks支持SOCKS代理)
  • [6.9 代码对照:完整实现](#6.9 代码对照:完整实现)
    • [6.9.1 自定义DNS解析器实现(支持缓存和DoH)](#6.9.1 自定义DNS解析器实现(支持缓存和DoH))
    • [6.9.2 代理池管理器类的完整实现](#6.9.2 代理池管理器类的完整实现)
    • [6.9.3 代理健康检查的实现代码](#6.9.3 代理健康检查的实现代码)
    • [6.9.4 代理轮换中间件的实现](#6.9.4 代理轮换中间件的实现)
    • [6.9.5 代理池监控面板代码](#6.9.5 代理池监控面板代码)
  • [6.10 实战演练:构建高可用代理池系统](#6.10 实战演练:构建高可用代理池系统)
    • [6.10.1 步骤1:设计代理池的数据结构和接口](#6.10.1 步骤1:设计代理池的数据结构和接口)
    • [6.10.2 步骤2:实现代理添加、删除、查询功能](#6.10.2 步骤2:实现代理添加、删除、查询功能)
    • [6.10.3 步骤3:实现定时健康检查机制](#6.10.3 步骤3:实现定时健康检查机制)
    • [6.10.4 步骤4:实现负载均衡算法(多种算法对比)](#6.10.4 步骤4:实现负载均衡算法(多种算法对比))
    • [6.10.5 步骤5:实现代理轮换中间件](#6.10.5 步骤5:实现代理轮换中间件)
    • [6.10.6 步骤6:集成到异步爬虫框架中](#6.10.6 步骤6:集成到异步爬虫框架中)
    • [6.10.7 步骤7:完整实战代码](#6.10.7 步骤7:完整实战代码)
  • [6.11 常见坑点与排错](#6.11 常见坑点与排错)
    • [6.11.1 DNS缓存时间过长导致IP变更无法感知](#6.11.1 DNS缓存时间过长导致IP变更无法感知)
    • [6.11.2 代理健康检查频率过高会被代理商封禁](#6.11.2 代理健康检查频率过高会被代理商封禁)
    • [6.11.3 SOCKS5代理需要特殊处理UDP流量](#6.11.3 SOCKS5代理需要特殊处理UDP流量)
    • [6.11.4 代理池资源耗尽导致请求失败](#6.11.4 代理池资源耗尽导致请求失败)
    • [6.11.5 负载均衡算法选择不当导致性能下降](#6.11.5 负载均衡算法选择不当导致性能下降)
  • [6.12 总结](#6.12 总结)

6.1 引言:DNS和代理在爬虫中的重要性

在爬虫开发中,DNS解析和代理使用是两个关键环节。DNS解析的速度直接影响请求的响应时间,而代理池的质量决定了爬虫的稳定性和反检测能力。理解DNS解析机制和构建高效的代理池系统,是构建高性能爬虫的基础。

6.1.1 DNS解析对爬虫性能的影响

DNS解析的性能影响:

python 复制代码
import time
import socket

def test_dns_resolution(hostname: str, iterations: int = 100):
    """测试DNS解析性能"""
    total_time = 0
    
    for _ in range(iterations):
        start = time.time()
        try:
            socket.gethostbyname(hostname)
        except Exception as e:
            print(f"DNS resolution failed: {e}")
        elapsed = time.time() - start
        total_time += elapsed
    
    avg_time = total_time / iterations
    print(f"Average DNS resolution time: {avg_time*1000:.2f}ms")
    print(f"Total time for {iterations} requests: {total_time:.2f}s")
    
    # 如果每次请求都进行DNS解析,总耗时会非常长
    estimated_total = avg_time * iterations
    print(f"Estimated total time without cache: {estimated_total:.2f}s")

# 测试
test_dns_resolution("www.example.com", 100)
# 输出示例:
# Average DNS resolution time: 50.23ms
# Total time for 100 requests: 5.02s
# Estimated total time without cache: 5.02s

DNS解析的性能问题:

  1. 延迟累积

    • 每次DNS查询通常需要20-100ms
    • 大量请求时,DNS延迟会显著累积
    • 没有缓存时,每个请求都要等待DNS解析
  2. 网络依赖

    • DNS查询依赖网络连接
    • 网络不稳定时,DNS查询可能超时
    • 影响整体爬虫的稳定性
  3. DNS服务器限制

    • 公共DNS服务器可能有速率限制
    • 频繁查询可能被限流
    • 需要实现智能重试和降级

DNS缓存的效果:

python 复制代码
import time
from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_dns_resolve(hostname: str) -> str:
    """带缓存的DNS解析"""
    return socket.gethostbyname(hostname)

def test_cached_dns(hostname: str, iterations: int = 100):
    """测试缓存DNS解析性能"""
    # 第一次解析(无缓存)
    start = time.time()
    cached_dns_resolve(hostname)
    first_time = time.time() - start
    
    # 后续解析(有缓存)
    start = time.time()
    for _ in range(iterations - 1):
        cached_dns_resolve(hostname)
    cached_time = time.time() - start
    
    avg_cached_time = cached_time / (iterations - 1)
    print(f"First resolution (no cache): {first_time*1000:.2f}ms")
    print(f"Average cached resolution: {avg_cached_time*1000:.6f}ms")
    print(f"Speedup: {first_time/avg_cached_time:.0f}x")

# 测试
test_cached_dns("www.example.com", 100)
# 输出示例:
# First resolution (no cache): 45.23ms
# Average cached resolution: 0.000123ms
# Speedup: 367723x

6.1.2 代理池的必要性

为什么需要代理池?

  1. IP封禁问题

    • 频繁请求同一网站会被封IP
    • 使用代理可以轮换IP,避免封禁
    • 提高爬虫的稳定性
  2. 地理位置限制

    • 某些网站有地理位置限制
    • 使用对应地区的代理可以绕过限制
    • 实现全球数据采集
  3. 请求频率控制

    • 单个IP的请求频率有限
    • 使用多个代理可以分散请求
    • 提高整体爬取速度

代理池的挑战:

python 复制代码
# 问题1:代理质量参差不齐
proxies = [
    "http://proxy1.com:8080",  # 速度快,稳定
    "http://proxy2.com:8080",  # 速度慢,不稳定
    "http://proxy3.com:8080",  # 已失效
]

# 问题2:代理需要健康检查
# 问题3:代理需要负载均衡
# 问题4:代理需要自动轮换

6.1.3 本章学习目标

通过本章学习,你将:

  1. 深入理解DNS解析机制

    • DNS查询的完整流程
    • 缓存机制的设计和优化
    • DoH/DoT的实现和使用
  2. 掌握代理池架构设计

    • 代理池的数据结构设计
    • 健康检查机制
    • 负载均衡算法
  3. 实现完整的代理池系统

    • 代理的添加、删除、查询
    • 自动健康检查
    • 智能负载均衡
    • 监控和统计
  4. 集成到爬虫框架

    • 代理轮换中间件
    • 与异步爬虫框架集成
    • 性能优化和调优

6.2 DNS解析流程深度解析

DNS(Domain Name System)是互联网的"电话簿",将域名转换为IP地址。理解DNS解析流程对于优化爬虫性能至关重要。

6.2.1 DNS查询的完整流程

DNS查询的完整流程:
权威DNS服务器 TLD DNS服务器 根DNS服务器 系统DNS解析器 本地缓存 应用程序 权威DNS服务器 TLD DNS服务器 根DNS服务器 系统DNS解析器 本地缓存 应用程序 alt 缓存命中 缓存未命中 1. 查询本地缓存 返回缓存的IP 2. 查询系统DNS解析器 3. 查询根DNS服务器 返回.com的TLD服务器地址 4. 查询TLD DNS服务器 返回example.com的权威服务器地址 5. 查询权威DNS服务器 返回www.example.com的IP地址 6. 更新缓存 返回IP地址

DNS查询的详细步骤:

  1. 本地缓存查询

    • 检查应用程序缓存
    • 检查系统DNS缓存
    • 检查hosts文件
  2. 递归查询

    • 向系统DNS解析器发送查询
    • DNS解析器负责完整的查询过程
  3. 迭代查询

    • 从根DNS服务器开始
    • 逐级查询到权威DNS服务器
    • 获取最终的IP地址

Python代码演示DNS查询:

python 复制代码
import socket
import dns.resolver
import time

def dns_query_system(hostname: str) -> str:
    """使用系统DNS解析"""
    start = time.time()
    try:
        ip = socket.gethostbyname(hostname)
        elapsed = time.time() - start
        print(f"System DNS: {hostname} -> {ip} ({elapsed*1000:.2f}ms)")
        return ip
    except socket.gaierror as e:
        print(f"DNS resolution failed: {e}")
        return None

def dns_query_dnspython(hostname: str) -> list:
    """使用dnspython库解析"""
    start = time.time()
    try:
        answers = dns.resolver.resolve(hostname, 'A')
        ips = [str(answer) for answer in answers]
        elapsed = time.time() - start
        print(f"dnspython DNS: {hostname} -> {ips} ({elapsed*1000:.2f}ms)")
        return ips
    except Exception as e:
        print(f"DNS resolution failed: {e}")
        return []

# 使用示例
dns_query_system("www.example.com")
dns_query_dnspython("www.example.com")

6.2.2 递归查询 vs 迭代查询

递归查询(Recursive Query):

  • 客户端向DNS服务器发送递归查询
  • DNS服务器负责完成整个查询过程
  • 客户端只需等待最终结果

迭代查询(Iterative Query):

  • DNS服务器返回下一个应该查询的服务器地址
  • 客户端需要自己完成后续查询
  • 通常用于DNS服务器之间的查询

查询类型对比:

python 复制代码
class DNSQueryType:
    """DNS查询类型"""
    RECURSIVE = "recursive"  # 递归查询
    ITERATIVE = "iterative"  # 迭代查询

def recursive_query(hostname: str) -> str:
    """递归查询(客户端视角)"""
    # 客户端发送递归查询,等待最终结果
    return socket.gethostbyname(hostname)

def iterative_query(hostname: str) -> str:
    """迭代查询(手动实现)"""
    # 1. 查询根DNS服务器
    # 2. 获取TLD服务器地址
    # 3. 查询TLD服务器
    # 4. 获取权威服务器地址
    # 5. 查询权威服务器
    # 6. 获取IP地址
    # (实际实现较复杂,这里仅演示概念)
    pass

6.2.3 本地缓存和系统DNS

本地缓存层次:

python 复制代码
class DNSCacheLevel:
    """DNS缓存层次"""
    APPLICATION = "application"  # 应用程序缓存
    SYSTEM = "system"            # 系统DNS缓存
    HOSTS_FILE = "hosts_file"    # hosts文件

# 1. 应用程序缓存(最快)
app_cache = {}

# 2. 系统DNS缓存(操作系统管理)
# Windows: ipconfig /displaydns
# Linux: systemd-resolve --statistics
# macOS: dscacheutil -q host

# 3. hosts文件
# Windows: C:\Windows\System32\drivers\etc\hosts
# Linux/macOS: /etc/hosts

系统DNS缓存查看:

bash 复制代码
# Windows
ipconfig /displaydns

# Linux (systemd-resolved)
systemd-resolve --statistics

# Linux (nscd)
nscd -g

# macOS
dscacheutil -q host -a name www.example.com

6.2.4 DNS记录类型详解

常见的DNS记录类型:

记录类型 说明 示例
A IPv4地址记录 www.example.com -> 192.0.2.1
AAAA IPv6地址记录 www.example.com -> 2001:db8::1
CNAME 别名记录 www -> example.com
MX 邮件交换记录 example.com -> mail.example.com
TXT 文本记录 用于SPF、DKIM等
NS 名称服务器记录 example.com -> ns1.example.com

查询不同类型的DNS记录:

python 复制代码
import dns.resolver

def query_dns_records(hostname: str):
    """查询各种DNS记录"""
    records = {}
    
    # A记录(IPv4)
    try:
        answers = dns.resolver.resolve(hostname, 'A')
        records['A'] = [str(answer) for answer in answers]
    except Exception as e:
        records['A'] = None
    
    # AAAA记录(IPv6)
    try:
        answers = dns.resolver.resolve(hostname, 'AAAA')
        records['AAAA'] = [str(answer) for answer in answers]
    except Exception as e:
        records['AAAA'] = None
    
    # CNAME记录
    try:
        answers = dns.resolver.resolve(hostname, 'CNAME')
        records['CNAME'] = [str(answer) for answer in answers]
    except Exception as e:
        records['CNAME'] = None
    
    # MX记录
    try:
        answers = dns.resolver.resolve(hostname, 'MX')
        records['MX'] = [(str(answer.preference), str(answer.exchange)) for answer in answers]
    except Exception as e:
        records['MX'] = None
    
    return records

# 使用示例
records = query_dns_records("example.com")
print(records)

6.3 DNS缓存机制深度解析

DNS缓存是提高DNS解析性能的关键机制。合理设计缓存策略可以大幅提升爬虫性能。

6.3.1 TTL的含义和作用

TTL(Time To Live)的含义:

  • TTL是DNS记录的生存时间
  • 表示DNS记录在缓存中的有效期(秒)
  • 超过TTL后,缓存记录应该被丢弃

TTL的作用:

  1. 平衡性能和准确性

    • TTL短:更准确,但查询频繁
    • TTL长:查询少,但可能使用过期IP
  2. 控制缓存更新频率

    • 服务器可以通过调整TTL控制缓存更新
    • 动态IP通常设置较短的TTL

查看DNS记录的TTL:

python 复制代码
import dns.resolver

def get_dns_ttl(hostname: str) -> int:
    """获取DNS记录的TTL"""
    try:
        answers = dns.resolver.resolve(hostname, 'A')
        # dnspython返回的answer对象包含TTL
        ttl = answers.rrset.ttl
        return ttl
    except Exception as e:
        print(f"Failed to get TTL: {e}")
        return None

# 使用示例
ttl = get_dns_ttl("www.example.com")
print(f"TTL: {ttl} seconds ({ttl/60:.1f} minutes)")

6.3.2 缓存策略设计

缓存策略的关键要素:

  1. 缓存存储结构

    • 使用字典存储域名到IP的映射
    • 记录缓存时间戳
    • 记录TTL值
  2. 缓存过期检查

    • 每次查询时检查缓存是否过期
    • 过期则重新查询并更新缓存
  3. 缓存大小限制

    • 限制缓存条目数量
    • 使用LRU(最近最少使用)策略淘汰

完整的DNS缓存实现:

python 复制代码
import time
from collections import OrderedDict
from typing import Optional, List

class DNSCache:
    """DNS缓存实现"""
    
    def __init__(self, max_size: int = 1000, default_ttl: int = 300):
        self.cache = OrderedDict()  # {hostname: (ips, timestamp, ttl)}
        self.max_size = max_size
        self.default_ttl = default_ttl
    
    def get(self, hostname: str) -> Optional[List[str]]:
        """获取缓存的DNS记录"""
        if hostname not in self.cache:
            return None
        
        ips, timestamp, ttl = self.cache[hostname]
        
        # 检查是否过期
        age = time.time() - timestamp
        if age > ttl:
            # 缓存过期,删除
            del self.cache[hostname]
            return None
        
        # 更新访问顺序(LRU)
        self.cache.move_to_end(hostname)
        return ips
    
    def set(self, hostname: str, ips: List[str], ttl: Optional[int] = None):
        """设置DNS缓存"""
        if ttl is None:
            ttl = self.default_ttl
        
        # 如果缓存已满,删除最旧的条目
        if len(self.cache) >= self.max_size and hostname not in self.cache:
            self.cache.popitem(last=False)  # 删除最旧的
        
        self.cache[hostname] = (ips, time.time(), ttl)
        self.cache.move_to_end(hostname)  # 更新访问顺序
    
    def clear(self):
        """清空缓存"""
        self.cache.clear()
    
    def remove(self, hostname: str):
        """删除特定域名的缓存"""
        if hostname in self.cache:
            del self.cache[hostname]
    
    def cleanup_expired(self):
        """清理过期的缓存条目"""
        current_time = time.time()
        expired_hostnames = []
        
        for hostname, (ips, timestamp, ttl) in self.cache.items():
            if current_time - timestamp > ttl:
                expired_hostnames.append(hostname)
        
        for hostname in expired_hostnames:
            del self.cache[hostname]
        
        return len(expired_hostnames)
    
    def stats(self) -> dict:
        """获取缓存统计信息"""
        current_time = time.time()
        valid_count = 0
        expired_count = 0
        
        for hostname, (ips, timestamp, ttl) in self.cache.items():
            if current_time - timestamp > ttl:
                expired_count += 1
            else:
                valid_count += 1
        
        return {
            'total': len(self.cache),
            'valid': valid_count,
            'expired': expired_count,
            'max_size': self.max_size,
        }

# 使用示例
cache = DNSCache(max_size=100, default_ttl=300)

# 设置缓存
cache.set("www.example.com", ["192.0.2.1"], ttl=300)

# 获取缓存
ips = cache.get("www.example.com")
print(f"Cached IPs: {ips}")

# 清理过期缓存
expired_count = cache.cleanup_expired()
print(f"Cleaned up {expired_count} expired entries")

# 查看统计
stats = cache.stats()
print(f"Cache stats: {stats}")

6.3.3 缓存失效处理

缓存失效的场景:

  1. TTL过期

    • 记录超过TTL时间
    • 需要重新查询
  2. 主动失效

    • 检测到IP变更
    • 手动清除缓存
  3. 错误失效

    • DNS查询失败
    • 清除可能错误的缓存

智能缓存失效策略:

python 复制代码
class SmartDNSCache(DNSCache):
    """智能DNS缓存(支持失效处理)"""
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.failure_count = {}  # {hostname: failure_count}
        self.max_failures = 3
    
    def mark_failure(self, hostname: str):
        """标记DNS查询失败"""
        self.failure_count[hostname] = self.failure_count.get(hostname, 0) + 1
        
        # 如果失败次数过多,清除缓存
        if self.failure_count[hostname] >= self.max_failures:
            self.remove(hostname)
            del self.failure_count[hostname]
    
    def mark_success(self, hostname: str):
        """标记DNS查询成功"""
        if hostname in self.failure_count:
            del self.failure_count[hostname]
    
    def get_with_fallback(self, hostname: str, resolver_func) -> Optional[List[str]]:
        """获取缓存,如果失效则重新查询"""
        # 先尝试从缓存获取
        ips = self.get(hostname)
        if ips:
            return ips
        
        # 缓存未命中或过期,重新查询
        try:
            ips = resolver_func(hostname)
            if ips:
                self.set(hostname, ips)
                self.mark_success(hostname)
            return ips
        except Exception as e:
            self.mark_failure(hostname)
            raise

# 使用示例
cache = SmartDNSCache()

def resolve_dns(hostname: str) -> List[str]:
    """DNS解析函数"""
    import socket
    ip = socket.gethostbyname(hostname)
    return [ip]

# 使用缓存和回退
ips = cache.get_with_fallback("www.example.com", resolve_dns)
print(f"Resolved IPs: {ips}")

6.3.4 多级缓存架构

多级缓存的设计:




应用程序
L1: 应用缓存
缓存命中?
L2: 系统DNS缓存
缓存命中?
L3: DNS服务器查询

多级缓存实现:

python 复制代码
class MultiLevelDNSCache:
    """多级DNS缓存"""
    
    def __init__(self):
        self.l1_cache = DNSCache(max_size=100, default_ttl=60)   # 应用缓存(短TTL)
        self.l2_cache = DNSCache(max_size=1000, default_ttl=300) # 系统缓存(长TTL)
    
    def get(self, hostname: str) -> Optional[List[str]]:
        """多级缓存查询"""
        # L1缓存查询
        ips = self.l1_cache.get(hostname)
        if ips:
            return ips
        
        # L2缓存查询
        ips = self.l2_cache.get(hostname)
        if ips:
            # 更新L1缓存
            self.l1_cache.set(hostname, ips)
            return ips
        
        return None
    
    def set(self, hostname: str, ips: List[str], ttl: Optional[int] = None):
        """设置多级缓存"""
        self.l1_cache.set(hostname, ips, ttl)
        self.l2_cache.set(hostname, ips, ttl)
    
    def clear_all(self):
        """清空所有缓存"""
        self.l1_cache.clear()
        self.l2_cache.clear()

# 使用示例
multi_cache = MultiLevelDNSCache()
multi_cache.set("www.example.com", ["192.0.2.1"])
ips = multi_cache.get("www.example.com")
print(f"Resolved from cache: {ips}")

6.4 DNS over HTTPS/TLS实现

DNS over HTTPS (DoH) 和 DNS over TLS (DoT) 是加密的DNS查询协议,提供更好的隐私和安全性。

6.4.1 DoH协议原理

DoH的工作原理:

  1. HTTP/2请求

    • 使用HTTP/2发送DNS查询
    • 查询数据作为HTTP请求体
    • 使用HTTPS加密传输
  2. JSON格式

    • 大多数DoH服务使用JSON格式
    • 请求和响应都是JSON
  3. 标准端点

    • Cloudflare: https://cloudflare-dns.com/dns-query
    • Google: https://dns.google/resolve
    • Quad9: https://dns.quad9.net/dns-query

DoH查询格式:

python 复制代码
import aiohttp
import json
from typing import List, Optional

async def doh_query_json(hostname: str, doh_server: str = "https://cloudflare-dns.com/dns-query") -> List[str]:
    """使用DoH进行DNS查询(JSON格式)"""
    params = {
        'name': hostname,
        'type': 'A',  # A记录
    }
    headers = {
        'Accept': 'application/dns-json',
    }
    
    async with aiohttp.ClientSession() as session:
        try:
            async with session.get(doh_server, params=params, headers=headers, timeout=5) as resp:
                if resp.status == 200:
                    data = await resp.json()
                    
                    # 解析响应
                    ips = []
                    if 'Answer' in data:
                        for answer in data['Answer']:
                            if answer.get('type') == 1:  # A记录
                                ips.append(answer['data'])
                    
                    return ips
                else:
                    print(f"DoH query failed with status {resp.status}")
                    return []
        except Exception as e:
            print(f"DoH query error: {e}")
            return []

# 使用示例
import asyncio

async def main():
    ips = await doh_query_json("www.example.com")
    print(f"Resolved IPs: {ips}")

# asyncio.run(main())

6.4.2 DoT协议原理

DoT的工作原理:

  1. TLS连接

    • 在TCP 853端口建立TLS连接
    • 使用标准DNS协议(加密传输)
  2. DNS over TLS

    • 使用标准的DNS消息格式
    • 通过TLS加密传输

DoT客户端实现(需要dnspython支持):

python 复制代码
import dns.query
import dns.message
import ssl

def dot_query(hostname: str, dot_server: str = "1.1.1.1", port: int = 853) -> List[str]:
    """使用DoT进行DNS查询"""
    # 创建DNS查询消息
    query = dns.message.make_query(hostname, dns.rdatatype.A)
    
    # 创建TLS上下文
    context = ssl.create_default_context()
    
    # 发送DoT查询
    try:
        response = dns.query.tls(query, dot_server, port=port, ssl_context=context)
        
        # 解析响应
        ips = []
        for answer in response.answer:
            for rdata in answer:
                if rdata.rdtype == dns.rdatatype.A:
                    ips.append(str(rdata))
        
        return ips
    except Exception as e:
        print(f"DoT query error: {e}")
        return []

# 使用示例
ips = dot_query("www.example.com")
print(f"Resolved IPs: {ips}")

6.4.3 DoH/DoT的安全优势

安全优势:

  1. 加密传输

    • DNS查询数据加密,防止窃听
    • 保护查询隐私
  2. 防止DNS劫持

    • 使用HTTPS/TLS,防止中间人攻击
    • 验证服务器证书
  3. 绕过DNS污染

    • 使用可信的DoH/DoT服务器
    • 避免本地DNS污染

对比传统DNS:

python 复制代码
def compare_dns_methods(hostname: str):
    """对比不同DNS查询方法"""
    import time
    
    methods = {
        'System DNS': lambda h: [socket.gethostbyname(h)],
        'DoH (Cloudflare)': lambda h: asyncio.run(doh_query_json(h)),
        'DoT (Cloudflare)': lambda h: dot_query(h, "1.1.1.1"),
    }
    
    results = {}
    for method_name, method_func in methods.items():
        try:
            start = time.time()
            ips = method_func(hostname)
            elapsed = time.time() - start
            results[method_name] = {
                'ips': ips,
                'time': elapsed * 1000,
                'success': True,
            }
        except Exception as e:
            results[method_name] = {
                'ips': None,
                'time': None,
                'success': False,
                'error': str(e),
            }
    
    return results

# 使用示例
results = compare_dns_methods("www.example.com")
for method, result in results.items():
    print(f"{method}: {result}")

6.4.4 DoH客户端实现

完整的DoH客户端:

python 复制代码
import aiohttp
import asyncio
import time
from typing import List, Optional, Dict

class DoHClient:
    """DNS over HTTPS客户端"""
    
    def __init__(
        self,
        doh_servers: List[str] = None,
        timeout: float = 5.0,
        cache: Optional[DNSCache] = None,
    ):
        self.doh_servers = doh_servers or [
            "https://cloudflare-dns.com/dns-query",
            "https://dns.google/resolve",
            "https://dns.quad9.net/dns-query",
        ]
        self.timeout = timeout
        self.cache = cache
        self.session = None
    
    async def _get_session(self):
        """获取aiohttp会话(延迟创建)"""
        if self.session is None:
            self.session = aiohttp.ClientSession()
        return self.session
    
    async def query(
        self,
        hostname: str,
        record_type: str = 'A',
        use_cache: bool = True,
    ) -> List[str]:
        """查询DNS记录"""
        # 检查缓存
        if use_cache and self.cache:
            cached_ips = self.cache.get(hostname)
            if cached_ips:
                return cached_ips
        
        # 尝试多个DoH服务器
        last_error = None
        for doh_server in self.doh_servers:
            try:
                ips = await self._query_server(hostname, record_type, doh_server)
                if ips:
                    # 更新缓存
                    if use_cache and self.cache:
                        self.cache.set(hostname, ips)
                    return ips
            except Exception as e:
                last_error = e
                continue
        
        # 所有服务器都失败
        if last_error:
            raise last_error
        return []
    
    async def _query_server(
        self,
        hostname: str,
        record_type: str,
        doh_server: str,
    ) -> List[str]:
        """查询单个DoH服务器"""
        session = await self._get_session()
        params = {
            'name': hostname,
            'type': record_type,
        }
        headers = {
            'Accept': 'application/dns-json',
        }
        
        async with session.get(
            doh_server,
            params=params,
            headers=headers,
            timeout=aiohttp.ClientTimeout(total=self.timeout),
        ) as resp:
            if resp.status != 200:
                raise Exception(f"DoH server returned status {resp.status}")
            
            data = await resp.json()
            
            # 解析响应
            ips = []
            if 'Answer' in data:
                for answer in data['Answer']:
                    if answer.get('type') == 1:  # A记录
                        ips.append(answer['data'])
            
            return ips
    
    async def close(self):
        """关闭客户端"""
        if self.session:
            await self.session.close()
            self.session = None

# 使用示例
async def main():
    cache = DNSCache()
    doh_client = DoHClient(cache=cache)
    
    # 查询DNS
    ips = await doh_client.query("www.example.com")
    print(f"Resolved IPs: {ips}")
    
    # 再次查询(使用缓存)
    ips = await doh_client.query("www.example.com")
    print(f"Cached IPs: {ips}")
    
    await doh_client.close()

# asyncio.run(main())

6.5 代理池架构设计

代理池是爬虫系统的核心组件,负责管理、调度和维护代理资源。

6.5.1 代理池的数据结构设计

代理池的核心数据结构:

python 复制代码
from dataclasses import dataclass
from enum import Enum
from typing import Optional
import time

class ProxyType(Enum):
    """代理类型"""
    HTTP = "http"
    HTTPS = "https"
    SOCKS4 = "socks4"
    SOCKS5 = "socks5"

@dataclass
class Proxy:
    """代理对象"""
    host: str
    port: int
    proxy_type: ProxyType
    username: Optional[str] = None
    password: Optional[str] = None
    
    # 状态信息
    is_active: bool = True
    success_count: int = 0
    failure_count: int = 0
    last_check_time: Optional[float] = None
    last_success_time: Optional[float] = None
    response_time: float = 0.0  # 平均响应时间(秒)
    
    # 元数据
    location: Optional[str] = None  # 地理位置
    provider: Optional[str] = None   # 代理提供商
    
    def __str__(self) -> str:
        """代理字符串表示"""
        if self.username and self.password:
            return f"{self.proxy_type.value}://{self.username}:{self.password}@{self.host}:{self.port}"
        else:
            return f"{self.proxy_type.value}://{self.host}:{self.port}"
    
    def to_dict(self) -> dict:
        """转换为字典"""
        return {
            'host': self.host,
            'port': self.port,
            'type': self.proxy_type.value,
            'username': self.username,
            'password': self.password,
            'is_active': self.is_active,
            'success_count': self.success_count,
            'failure_count': self.failure_count,
            'last_check_time': self.last_check_time,
            'last_success_time': self.last_success_time,
            'response_time': self.response_time,
            'location': self.location,
            'provider': self.provider,
        }
    
    @property
    def success_rate(self) -> float:
        """成功率"""
        total = self.success_count + self.failure_count
        if total == 0:
            return 0.0
        return self.success_count / total
    
    @property
    def url(self) -> str:
        """代理URL"""
        return str(self)

# 使用示例
proxy = Proxy(
    host="proxy.example.com",
    port=8080,
    proxy_type=ProxyType.HTTP,
    username="user",
    password="pass",
)
print(f"Proxy URL: {proxy.url}")
print(f"Success rate: {proxy.success_rate:.2%}")

代理池的数据结构:

python 复制代码
from collections import deque
from typing import Dict, List, Set
import threading

class ProxyPool:
    """代理池数据结构"""
    
    def __init__(self):
        # 主存储:{proxy_id: Proxy}
        self.proxies: Dict[str, Proxy] = {}
        
        # 活跃代理集合(快速查找)
        self.active_proxies: Set[str] = set()
        
        # 代理队列(用于轮询)
        self.proxy_queue: deque = deque()
        
        # 按类型分组
        self.proxies_by_type: Dict[ProxyType, List[str]] = {
            ProxyType.HTTP: [],
            ProxyType.HTTPS: [],
            ProxyType.SOCKS4: [],
            ProxyType.SOCKS5: [],
        }
        
        # 线程安全锁
        self.lock = threading.Lock()
    
    def add_proxy(self, proxy: Proxy, proxy_id: Optional[str] = None) -> str:
        """添加代理"""
        if proxy_id is None:
            proxy_id = f"{proxy.host}:{proxy.port}"
        
        with self.lock:
            self.proxies[proxy_id] = proxy
            if proxy.is_active:
                self.active_proxies.add(proxy_id)
                self.proxy_queue.append(proxy_id)
            
            # 按类型分组
            self.proxies_by_type[proxy.proxy_type].append(proxy_id)
        
        return proxy_id
    
    def remove_proxy(self, proxy_id: str):
        """删除代理"""
        with self.lock:
            if proxy_id in self.proxies:
                proxy = self.proxies[proxy_id]
                del self.proxies[proxy_id]
                self.active_proxies.discard(proxy_id)
                
                # 从队列中移除
                if proxy_id in self.proxy_queue:
                    self.proxy_queue.remove(proxy_id)
                
                # 从类型分组中移除
                if proxy_id in self.proxies_by_type[proxy.proxy_type]:
                    self.proxies_by_type[proxy.proxy_type].remove(proxy_id)
    
    def get_proxy(self, proxy_id: str) -> Optional[Proxy]:
        """获取代理"""
        return self.proxies.get(proxy_id)
    
    def get_active_proxies(self) -> List[Proxy]:
        """获取所有活跃代理"""
        with self.lock:
            return [self.proxies[pid] for pid in self.active_proxies if pid in self.proxies]
    
    def get_proxies_by_type(self, proxy_type: ProxyType) -> List[Proxy]:
        """按类型获取代理"""
        with self.lock:
            return [self.proxies[pid] for pid in self.proxies_by_type[proxy_type] if pid in self.proxies]

# 使用示例
pool = ProxyPool()

proxy1 = Proxy("proxy1.com", 8080, ProxyType.HTTP)
proxy2 = Proxy("proxy2.com", 8080, ProxyType.SOCKS5)

pool.add_proxy(proxy1)
pool.add_proxy(proxy2)

active = pool.get_active_proxies()
print(f"Active proxies: {len(active)}")

6.5.2 代理类型详解(HTTP/HTTPS/SOCKS4/SOCKS5)

代理类型对比:

代理类型 协议层 支持HTTPS 支持UDP 认证方式 使用场景
HTTP 应用层 Basic 简单HTTP请求
HTTPS 应用层 Basic HTTPS请求
SOCKS4 传输层 简单TCP连接
SOCKS5 传输层 多种 复杂网络场景

代理类型的特点:

  1. HTTP/HTTPS代理

    • 工作在应用层
    • 只支持HTTP协议
    • 需要CONNECT方法建立隧道
  2. SOCKS4代理

    • 工作在传输层
    • 支持TCP连接
    • 不支持UDP和IPv6
  3. SOCKS5代理

    • 工作在传输层
    • 支持TCP和UDP
    • 支持IPv4和IPv6
    • 支持多种认证方式

代理URL格式:

python 复制代码
def format_proxy_url(proxy: Proxy) -> str:
    """格式化代理URL"""
    if proxy.proxy_type == ProxyType.HTTP:
        scheme = "http"
    elif proxy.proxy_type == ProxyType.HTTPS:
        scheme = "https"
    elif proxy.proxy_type == ProxyType.SOCKS4:
        scheme = "socks4"
    elif proxy.proxy_type == ProxyType.SOCKS5:
        scheme = "socks5"
    else:
        raise ValueError(f"Unknown proxy type: {proxy.proxy_type}")
    
    if proxy.username and proxy.password:
        return f"{scheme}://{proxy.username}:{proxy.password}@{proxy.host}:{proxy.port}"
    else:
        return f"{scheme}://{proxy.host}:{proxy.port}"

# 使用示例
proxy = Proxy("proxy.com", 8080, ProxyType.SOCKS5, "user", "pass")
url = format_proxy_url(proxy)
print(f"Proxy URL: {url}")
# 输出: socks5://user:pass@proxy.com:8080

6.5.3 代理池的接口设计

代理池的核心接口:

python 复制代码
from abc import ABC, abstractmethod
from typing import Optional, List

class IProxyPool(ABC):
    """代理池接口"""
    
    @abstractmethod
    def add_proxy(self, proxy: Proxy) -> str:
        """添加代理"""
        pass
    
    @abstractmethod
    def remove_proxy(self, proxy_id: str):
        """删除代理"""
        pass
    
    @abstractmethod
    def get_proxy(self, proxy_id: str) -> Optional[Proxy]:
        """获取代理"""
        pass
    
    @abstractmethod
    def get_next_proxy(self, strategy: str = "round_robin") -> Optional[Proxy]:
        """获取下一个代理(根据策略)"""
        pass
    
    @abstractmethod
    def mark_success(self, proxy_id: str, response_time: float):
        """标记代理成功"""
        pass
    
    @abstractmethod
    def mark_failure(self, proxy_id: str):
        """标记代理失败"""
        pass
    
    @abstractmethod
    def get_stats(self) -> dict:
        """获取统计信息"""
        pass

6.5.4 代理池架构图

代理池的完整架构:
爬虫应用
代理池管理器
代理存储
健康检查器
负载均衡器
活跃代理队列
代理元数据
定时检查任务
健康状态评估
轮询算法
随机算法
加权算法
最少连接数
监控面板
统计信息
实时状态


6.6 代理健康检查机制

代理健康检查是确保代理池质量的关键机制。

6.6.1 健康检查的设计原则

设计原则:

  1. 非侵入性

    • 健康检查不应该影响正常使用
    • 使用独立的检查任务
  2. 频率控制

    • 避免过于频繁的检查
    • 防止被代理服务器封禁
  3. 多维度评估

    • 响应时间
    • 成功率
    • 可用性
  4. 异步执行

    • 健康检查应该是异步的
    • 不阻塞主流程

6.6.2 健康检查的实现方法

健康检查方法:

  1. HTTP请求测试

    • 发送HTTP请求到测试URL
    • 检查响应状态码
    • 测量响应时间
  2. 连接测试

    • 测试TCP连接
    • 检查连接建立时间
  3. 实际请求测试

    • 使用代理发送真实请求
    • 记录成功率和响应时间

完整的健康检查实现:

python 复制代码
import asyncio
import aiohttp
import time
from typing import Optional

class ProxyHealthChecker:
    """代理健康检查器"""
    
    def __init__(
        self,
        test_url: str = "http://httpbin.org/ip",
        timeout: float = 5.0,
        max_concurrent: int = 10,
    ):
        self.test_url = test_url
        self.timeout = timeout
        self.semaphore = asyncio.Semaphore(max_concurrent)
    
    async def check_proxy(self, proxy: Proxy) -> dict:
        """检查单个代理"""
        async with self.semaphore:
            start_time = time.time()
            result = {
                'proxy_id': f"{proxy.host}:{proxy.port}",
                'success': False,
                'response_time': None,
                'error': None,
            }
            
            try:
                # 构建代理URL
                proxy_url = proxy.url
                
                # 创建连接器(支持SOCKS)
                connector = None
                if proxy.proxy_type in [ProxyType.SOCKS4, ProxyType.SOCKS5]:
                    try:
                        from aiohttp_socks import ProxyConnector
                        connector = ProxyConnector.from_url(proxy_url)
                    except ImportError:
                        result['error'] = "aiohttp-socks not installed"
                        return result
                else:
                    connector = aiohttp.TCPConnector()
                
                # 发送测试请求
                timeout = aiohttp.ClientTimeout(total=self.timeout)
                async with aiohttp.ClientSession(connector=connector) as session:
                    async with session.get(
                        self.test_url,
                        proxy=proxy_url if proxy.proxy_type in [ProxyType.HTTP, ProxyType.HTTPS] else None,
                        timeout=timeout,
                    ) as resp:
                        if resp.status == 200:
                            result['success'] = True
                            result['response_time'] = time.time() - start_time
                        else:
                            result['error'] = f"HTTP {resp.status}"
                
                if connector:
                    await connector.close()
                    
            except asyncio.TimeoutError:
                result['error'] = "Timeout"
            except Exception as e:
                result['error'] = str(e)
            
            return result
    
    async def check_proxies(self, proxies: List[Proxy]) -> List[dict]:
        """批量检查代理"""
        tasks = [self.check_proxy(proxy) for proxy in proxies]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        
        # 处理异常结果
        processed_results = []
        for result in results:
            if isinstance(result, Exception):
                processed_results.append({
                    'success': False,
                    'error': str(result),
                })
            else:
                processed_results.append(result)
        
        return processed_results

# 使用示例
async def main():
    checker = ProxyHealthChecker()
    
    proxy = Proxy("proxy.example.com", 8080, ProxyType.HTTP)
    result = await checker.check_proxy(proxy)
    print(f"Health check result: {result}")

# asyncio.run(main())

6.6.3 健康检查的频率控制

频率控制策略:

  1. 基于时间间隔

    • 固定时间间隔检查
    • 例如:每5分钟检查一次
  2. 基于使用频率

    • 使用频繁的代理检查更频繁
    • 使用少的代理检查较少
  3. 基于失败率

    • 失败率高的代理检查更频繁
    • 稳定的代理检查较少

智能频率控制:

python 复制代码
class SmartHealthChecker(ProxyHealthChecker):
    """智能健康检查器(频率控制)"""
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.check_intervals = {}  # {proxy_id: next_check_time}
        self.base_interval = 300  # 基础间隔(5分钟)
        self.min_interval = 60    # 最小间隔(1分钟)
        self.max_interval = 3600  # 最大间隔(1小时)
    
    def get_check_interval(self, proxy: Proxy) -> float:
        """计算检查间隔"""
        # 基于成功率调整间隔
        success_rate = proxy.success_rate
        
        if success_rate < 0.5:
            # 成功率低,频繁检查
            interval = self.min_interval
        elif success_rate < 0.8:
            # 成功率中等
            interval = self.base_interval
        else:
            # 成功率高,减少检查
            interval = self.max_interval
        
        return interval
    
    def should_check(self, proxy: Proxy) -> bool:
        """判断是否应该检查"""
        proxy_id = f"{proxy.host}:{proxy.port}"
        current_time = time.time()
        
        if proxy_id not in self.check_intervals:
            return True
        
        next_check_time = self.check_intervals[proxy_id]
        return current_time >= next_check_time
    
    def update_check_time(self, proxy: Proxy):
        """更新下次检查时间"""
        proxy_id = f"{proxy.host}:{proxy.port}"
        interval = self.get_check_interval(proxy)
        self.check_intervals[proxy_id] = time.time() + interval
    
    async def check_if_needed(self, proxy: Proxy) -> Optional[dict]:
        """如果需要则检查代理"""
        if not self.should_check(proxy):
            return None
        
        result = await self.check_proxy(proxy)
        self.update_check_time(proxy)
        return result

# 使用示例
checker = SmartHealthChecker()

proxy = Proxy("proxy.example.com", 8080, ProxyType.HTTP)
proxy.success_count = 10
proxy.failure_count = 2

if checker.should_check(proxy):
    result = await checker.check_if_needed(proxy)
    print(f"Check result: {result}")

6.6.4 健康状态的评估标准

健康状态评估:

python 复制代码
class ProxyHealthEvaluator:
    """代理健康状态评估器"""
    
    @staticmethod
    def evaluate(proxy: Proxy) -> str:
        """评估代理健康状态"""
        # 计算健康分数(0-100)
        score = 0
        
        # 成功率(40分)
        success_rate = proxy.success_rate
        score += success_rate * 40
        
        # 响应时间(30分)
        if proxy.response_time > 0:
            if proxy.response_time < 1.0:
                score += 30
            elif proxy.response_time < 3.0:
                score += 20
            elif proxy.response_time < 5.0:
                score += 10
        
        # 活跃状态(20分)
        if proxy.is_active:
            score += 20
        
        # 最近成功时间(10分)
        if proxy.last_success_time:
            time_since_success = time.time() - proxy.last_success_time
            if time_since_success < 300:  # 5分钟内成功
                score += 10
            elif time_since_success < 1800:  # 30分钟内成功
                score += 5
        
        # 确定健康等级
        if score >= 80:
            return "excellent"
        elif score >= 60:
            return "good"
        elif score >= 40:
            return "fair"
        else:
            return "poor"
    
    @staticmethod
    def should_remove(proxy: Proxy) -> bool:
        """判断是否应该移除代理"""
        # 失败率过高
        if proxy.failure_count > 10 and proxy.success_rate < 0.2:
            return True
        
        # 长时间未成功
        if proxy.last_success_time:
            time_since_success = time.time() - proxy.last_success_time
            if time_since_success > 3600:  # 1小时未成功
                return True
        
        return False

# 使用示例
evaluator = ProxyHealthEvaluator()

proxy = Proxy("proxy.example.com", 8080, ProxyType.HTTP)
proxy.success_count = 8
proxy.failure_count = 2
proxy.response_time = 0.5
proxy.is_active = True
proxy.last_success_time = time.time()

health = evaluator.evaluate(proxy)
print(f"Proxy health: {health}")

should_remove = evaluator.should_remove(proxy)
print(f"Should remove: {should_remove}")

6.7 负载均衡算法实现

负载均衡算法决定如何从代理池中选择代理,不同的算法适用于不同的场景。

6.7.1 轮询算法(Round Robin)

轮询算法原理:

  • 按顺序依次选择代理
  • 公平分配请求
  • 实现简单

实现:

python 复制代码
class RoundRobinBalancer:
    """轮询负载均衡器"""
    
    def __init__(self, proxy_pool: ProxyPool):
        self.proxy_pool = proxy_pool
        self.current_index = 0
        self.lock = threading.Lock()
    
    def get_next_proxy(self) -> Optional[Proxy]:
        """获取下一个代理"""
        active_proxies = self.proxy_pool.get_active_proxies()
        if not active_proxies:
            return None
        
        with self.lock:
            proxy = active_proxies[self.current_index]
            self.current_index = (self.current_index + 1) % len(active_proxies)
            return proxy

# 使用示例
pool = ProxyPool()
# ... 添加代理 ...
balancer = RoundRobinBalancer(pool)

for _ in range(10):
    proxy = balancer.get_next_proxy()
    print(f"Selected proxy: {proxy.host}")

6.7.2 随机算法(Random)

随机算法原理:

  • 随机选择代理
  • 避免热点代理
  • 适合代理质量相近的场景

实现:

python 复制代码
import random

class RandomBalancer:
    """随机负载均衡器"""
    
    def __init__(self, proxy_pool: ProxyPool):
        self.proxy_pool = proxy_pool
    
    def get_next_proxy(self) -> Optional[Proxy]:
        """获取随机代理"""
        active_proxies = self.proxy_pool.get_active_proxies()
        if not active_proxies:
            return None
        return random.choice(active_proxies)

6.7.3 加权算法(Weighted)

加权算法原理:

  • 根据代理质量分配权重
  • 质量好的代理被选中的概率更高
  • 适合代理质量差异大的场景

实现:

python 复制代码
class WeightedBalancer:
    """加权负载均衡器"""
    
    def __init__(self, proxy_pool: ProxyPool):
        self.proxy_pool = proxy_pool
    
    def calculate_weight(self, proxy: Proxy) -> float:
        """计算代理权重"""
        # 基于成功率和响应时间
        success_rate = proxy.success_rate
        response_time = proxy.response_time if proxy.response_time > 0 else 5.0
        
        # 权重 = 成功率 * (1 / 响应时间) * 100
        weight = success_rate * (1.0 / response_time) * 100
        return max(weight, 0.1)  # 最小权重0.1
    
    def get_next_proxy(self) -> Optional[Proxy]:
        """获取加权随机代理"""
        active_proxies = self.proxy_pool.get_active_proxies()
        if not active_proxies:
            return None
        
        # 计算权重
        weights = [self.calculate_weight(p) for p in active_proxies]
        total_weight = sum(weights)
        
        if total_weight == 0:
            return random.choice(active_proxies)
        
        # 加权随机选择
        r = random.uniform(0, total_weight)
        cumulative = 0
        for proxy, weight in zip(active_proxies, weights):
            cumulative += weight
            if r <= cumulative:
                return proxy
        
        return active_proxies[-1]  # 兜底

6.7.4 最少连接数算法(Least Connections)

最少连接数算法原理:

  • 选择当前连接数最少的代理
  • 平衡代理负载
  • 适合长时间连接的场景

实现:

python 复制代码
class LeastConnectionsBalancer:
    """最少连接数负载均衡器"""
    
    def __init__(self, proxy_pool: ProxyPool):
        self.proxy_pool = proxy_pool
        self.connection_count = {}  # {proxy_id: count}
        self.lock = threading.Lock()
    
    def get_next_proxy(self) -> Optional[Proxy]:
        """获取连接数最少的代理"""
        active_proxies = self.proxy_pool.get_active_proxies()
        if not active_proxies:
            return None
        
        # 找到连接数最少的代理
        min_connections = float('inf')
        selected_proxy = None
        
        with self.lock:
            for proxy in active_proxies:
                proxy_id = f"{proxy.host}:{proxy.port}"
                count = self.connection_count.get(proxy_id, 0)
                if count < min_connections:
                    min_connections = count
                    selected_proxy = proxy
        
        return selected_proxy
    
    def increment_connections(self, proxy: Proxy):
        """增加连接数"""
        proxy_id = f"{proxy.host}:{proxy.port}"
        with self.lock:
            self.connection_count[proxy_id] = self.connection_count.get(proxy_id, 0) + 1
    
    def decrement_connections(self, proxy: Proxy):
        """减少连接数"""
        proxy_id = f"{proxy.host}:{proxy.port}"
        with self.lock:
            count = self.connection_count.get(proxy_id, 0)
            if count > 0:
                self.connection_count[proxy_id] = count - 1

6.7.5 算法性能对比

性能对比测试:

python 复制代码
def compare_balancers(proxy_pool: ProxyPool, iterations: int = 1000):
    """对比不同负载均衡算法"""
    balancers = {
        'Round Robin': RoundRobinBalancer(proxy_pool),
        'Random': RandomBalancer(proxy_pool),
        'Weighted': WeightedBalancer(proxy_pool),
        'Least Connections': LeastConnectionsBalancer(proxy_pool),
    }
    
    results = {}
    for name, balancer in balancers.items():
        start = time.time()
        selection_count = {}
        
        for _ in range(iterations):
            proxy = balancer.get_next_proxy()
            if proxy:
                proxy_id = f"{proxy.host}:{proxy.port}"
                selection_count[proxy_id] = selection_count.get(proxy_id, 0) + 1
        
        elapsed = time.time() - start
        
        # 计算选择分布的均匀度(标准差)
        counts = list(selection_count.values())
        if counts:
            mean = sum(counts) / len(counts)
            variance = sum((x - mean) ** 2 for x in counts) / len(counts)
            std_dev = variance ** 0.5
        else:
            std_dev = 0
        
        results[name] = {
            'time': elapsed * 1000,  # 毫秒
            'std_dev': std_dev,
            'distribution': selection_count,
        }
    
    return results

# 使用示例
# results = compare_balancers(proxy_pool, 1000)
# for name, result in results.items():
#     print(f"{name}: {result['time']:.2f}ms, std_dev: {result['std_dev']:.2f}")

6.8 工具链:DNS和代理工具使用

6.8.1 使用dnspython进行DNS查询

安装dnspython:

bash 复制代码
pip install dnspython

基本使用:

python 复制代码
import dns.resolver

def query_dns(hostname: str, record_type: str = 'A') -> List[str]:
    """使用dnspython查询DNS"""
    try:
        answers = dns.resolver.resolve(hostname, record_type)
        return [str(answer) for answer in answers]
    except Exception as e:
        print(f"DNS query failed: {e}")
        return []

# 使用示例
ips = query_dns("www.example.com")
print(f"IPs: {ips}")

6.8.2 使用DoH服务进行DNS查询

使用Cloudflare DoH:

python 复制代码
async def query_doh_cloudflare(hostname: str) -> List[str]:
    """使用Cloudflare DoH查询"""
    doh_url = "https://cloudflare-dns.com/dns-query"
    return await doh_query_json(hostname, doh_url)

6.8.3 使用httpx/aiohttp配置代理

httpx配置代理:

python 复制代码
import httpx

# HTTP代理
proxy_url = "http://proxy.example.com:8080"
client = httpx.Client(proxies=proxy_url)

# SOCKS5代理(需要httpx[socks])
proxy_url = "socks5://proxy.example.com:1080"
client = httpx.Client(proxies=proxy_url)

# 使用代理发送请求
response = client.get("https://httpbin.org/ip")

aiohttp配置代理:

python 复制代码
import aiohttp

# HTTP代理
proxy_url = "http://proxy.example.com:8080"
async with aiohttp.ClientSession() as session:
    async with session.get("https://httpbin.org/ip", proxy=proxy_url) as resp:
        print(await resp.text())

6.8.4 使用Redis管理代理池数据

Redis代理池存储:

python 复制代码
import redis
import json

class RedisProxyPool:
    """基于Redis的代理池"""
    
    def __init__(self, redis_host: str = "localhost", redis_port: int = 6379):
        self.redis_client = redis.Redis(host=redis_host, port=redis_port, decode_responses=True)
        self.proxy_key_prefix = "proxy:"
        self.active_set_key = "proxies:active"
    
    def add_proxy(self, proxy: Proxy, proxy_id: Optional[str] = None) -> str:
        """添加代理到Redis"""
        if proxy_id is None:
            proxy_id = f"{proxy.host}:{proxy.port}"
        
        key = f"{self.proxy_key_prefix}{proxy_id}"
        self.redis_client.hset(key, mapping=proxy.to_dict())
        
        if proxy.is_active:
            self.redis_client.sadd(self.active_set_key, proxy_id)
        
        return proxy_id
    
    def get_proxy(self, proxy_id: str) -> Optional[Proxy]:
        """从Redis获取代理"""
        key = f"{self.proxy_key_prefix}{proxy_id}"
        data = self.redis_client.hgetall(key)
        
        if not data:
            return None
        
        return Proxy(
            host=data['host'],
            port=int(data['port']),
            proxy_type=ProxyType(data['type']),
            username=data.get('username'),
            password=data.get('password'),
            is_active=data.get('is_active', 'True') == 'True',
            success_count=int(data.get('success_count', 0)),
            failure_count=int(data.get('failure_count', 0)),
        )
    
    def get_active_proxies(self) -> List[str]:
        """获取所有活跃代理ID"""
        return list(self.redis_client.smembers(self.active_set_key))

6.8.5 使用aiohttp-socks支持SOCKS代理

安装aiohttp-socks:

bash 复制代码
pip install aiohttp-socks

使用SOCKS代理:

python 复制代码
from aiohttp_socks import ProxyConnector

# SOCKS5代理
proxy_url = "socks5://proxy.example.com:1080"
connector = ProxyConnector.from_url(proxy_url)

async with aiohttp.ClientSession(connector=connector) as session:
    async with session.get("https://httpbin.org/ip") as resp:
        print(await resp.text())

6.9 代码对照:完整实现

6.9.1 自定义DNS解析器实现(支持缓存和DoH)

完整的DNS解析器:

python 复制代码
import asyncio
import aiohttp
import time
from typing import Optional, List

class AdvancedDNSResolver:
    """高级DNS解析器(支持缓存和DoH)"""
    
    def __init__(
        self,
        use_doh: bool = True,
        doh_servers: List[str] = None,
        cache: Optional[DNSCache] = None,
        fallback_to_system: bool = True,
    ):
        self.use_doh = use_doh
        self.doh_servers = doh_servers or [
            "https://cloudflare-dns.com/dns-query",
            "https://dns.google/resolve",
        ]
        self.cache = cache or DNSCache()
        self.fallback_to_system = fallback_to_system
        self.doh_client = DoHClient(doh_servers=self.doh_servers, cache=self.cache) if use_doh else None
    
    async def resolve(self, hostname: str) -> List[str]:
        """解析域名"""
        # 检查缓存
        cached_ips = self.cache.get(hostname)
        if cached_ips:
            return cached_ips
        
        # 使用DoH查询
        if self.use_doh and self.doh_client:
            try:
                ips = await self.doh_client.query(hostname)
                if ips:
                    return ips
            except Exception as e:
                print(f"DoH query failed: {e}")
        
        # 回退到系统DNS
        if self.fallback_to_system:
            try:
                import socket
                ip = socket.gethostbyname(hostname)
                ips = [ip]
                self.cache.set(hostname, ips)
                return ips
            except Exception as e:
                print(f"System DNS failed: {e}")
        
        return []
    
    async def close(self):
        """关闭解析器"""
        if self.doh_client:
            await self.doh_client.close()

# 使用示例
async def main():
    resolver = AdvancedDNSResolver(use_doh=True)
    ips = await resolver.resolve("www.example.com")
    print(f"Resolved IPs: {ips}")
    await resolver.close()

# asyncio.run(main())

6.9.2 代理池管理器类的完整实现

完整的代理池管理器:

python 复制代码
import asyncio
import threading
import time
from typing import Optional, List, Dict
from collections import deque

class ProxyPoolManager:
    """代理池管理器(完整版)"""
    
    def __init__(
        self,
        health_checker: Optional[ProxyHealthChecker] = None,
        balancer_type: str = "round_robin",
        health_check_interval: float = 300.0,
    ):
        self.pool = ProxyPool()
        self.health_checker = health_checker or ProxyHealthChecker()
        self.balancer_type = balancer_type
        self.health_check_interval = health_check_interval
        
        # 负载均衡器
        self.balancer = self._create_balancer()
        
        # 健康检查任务
        self.health_check_task = None
        self.running = False
        
        # 统计信息
        self.stats = {
            'total_requests': 0,
            'successful_requests': 0,
            'failed_requests': 0,
            'proxy_rotations': 0,
        }
    
    def _create_balancer(self):
        """创建负载均衡器"""
        if self.balancer_type == "round_robin":
            return RoundRobinBalancer(self.pool)
        elif self.balancer_type == "random":
            return RandomBalancer(self.pool)
        elif self.balancer_type == "weighted":
            return WeightedBalancer(self.pool)
        elif self.balancer_type == "least_connections":
            return LeastConnectionsBalancer(self.pool)
        else:
            return RoundRobinBalancer(self.pool)
    
    def add_proxy(self, proxy: Proxy) -> str:
        """添加代理"""
        return self.pool.add_proxy(proxy)
    
    def remove_proxy(self, proxy_id: str):
        """删除代理"""
        self.pool.remove_proxy(proxy_id)
    
    def get_next_proxy(self) -> Optional[Proxy]:
        """获取下一个代理"""
        proxy = self.balancer.get_next_proxy()
        if proxy:
            self.stats['proxy_rotations'] += 1
        return proxy
    
    def mark_success(self, proxy: Proxy, response_time: float):
        """标记代理成功"""
        proxy.success_count += 1
        proxy.last_success_time = time.time()
        proxy.last_check_time = time.time()
        
        # 更新平均响应时间
        total_requests = proxy.success_count + proxy.failure_count
        proxy.response_time = (
            (proxy.response_time * (total_requests - 1) + response_time) / total_requests
        )
        
        self.stats['successful_requests'] += 1
        self.stats['total_requests'] += 1
    
    def mark_failure(self, proxy: Proxy):
        """标记代理失败"""
        proxy.failure_count += 1
        proxy.last_check_time = time.time()
        
        # 如果失败率过高,标记为非活跃
        if proxy.success_rate < 0.2 and proxy.failure_count > 10:
            proxy.is_active = False
            self.pool.active_proxies.discard(f"{proxy.host}:{proxy.port}")
        
        self.stats['failed_requests'] += 1
        self.stats['total_requests'] += 1
    
    async def health_check_loop(self):
        """健康检查循环"""
        while self.running:
            try:
                # 获取所有活跃代理
                active_proxies = self.pool.get_active_proxies()
                
                if active_proxies:
                    # 批量健康检查
                    results = await self.health_checker.check_proxies(active_proxies)
                    
                    # 更新代理状态
                    for proxy, result in zip(active_proxies, results):
                        proxy_id = f"{proxy.host}:{proxy.port}"
                        if result['success']:
                            proxy.is_active = True
                            proxy.response_time = result.get('response_time', proxy.response_time)
                            self.pool.active_proxies.add(proxy_id)
                        else:
                            # 失败次数过多则标记为非活跃
                            if proxy.failure_count > 5:
                                proxy.is_active = False
                                self.pool.active_proxies.discard(proxy_id)
                
                await asyncio.sleep(self.health_check_interval)
            except Exception as e:
                print(f"Health check error: {e}")
                await asyncio.sleep(60)
                继续完成第6章的剩余内容:
    
    def start(self):
        """启动代理池管理器"""
        self.running = True
        self.health_check_task = asyncio.create_task(self.health_check_loop())
    
    def stop(self):
        """停止代理池管理器"""
        self.running = False
        if self.health_check_task:
            self.health_check_task.cancel()
    
    def get_stats(self) -> dict:
        """获取统计信息"""
        active_count = len(self.pool.active_proxies)
        total_count = len(self.pool.proxies)
        
        return {
            **self.stats,
            'total_proxies': total_count,
            'active_proxies': active_count,
            'inactive_proxies': total_count - active_count,
            'success_rate': (
                self.stats['successful_requests'] / self.stats['total_requests']
                if self.stats['total_requests'] > 0 else 0.0
            ),
        }

# 使用示例
async def main():
    manager = ProxyPoolManager(
        balancer_type="weighted",
        health_check_interval=300.0,
    )
    
    # 添加代理
    proxy1 = Proxy("proxy1.com", 8080, ProxyType.HTTP)
    proxy2 = Proxy("proxy2.com", 8080, ProxyType.SOCKS5)
    manager.add_proxy(proxy1)
    manager.add_proxy(proxy2)
    
    # 启动管理器
    manager.start()
    
    # 使用代理
    proxy = manager.get_next_proxy()
    if proxy:
        print(f"Using proxy: {proxy.url}")
        # 模拟请求成功
        manager.mark_success(proxy, 0.5)
    
    # 查看统计
    stats = manager.get_stats()
    print(f"Stats: {stats}")
    
    # 停止管理器
    manager.stop()

# asyncio.run(main())

6.9.3 代理健康检查的实现代码

完整的健康检查实现:

python 复制代码
import asyncio
import aiohttp
import time
from typing import List, Dict, Optional

class ComprehensiveHealthChecker:
    """综合健康检查器"""
    
    def __init__(
        self,
        test_urls: List[str] = None,
        timeout: float = 5.0,
        max_concurrent: int = 10,
        retry_times: int = 2,
    ):
        self.test_urls = test_urls or [
            "http://httpbin.org/ip",
            "http://httpbin.org/get",
        ]
        self.timeout = timeout
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.retry_times = retry_times
    
    async def check_proxy_comprehensive(self, proxy: Proxy) -> Dict:
        """综合健康检查"""
        async with self.semaphore:
            results = {
                'proxy_id': f"{proxy.host}:{proxy.port}",
                'success': False,
                'response_times': [],
                'test_results': [],
                'error': None,
            }
            
            # 对每个测试URL进行检查
            for test_url in self.test_urls:
                for attempt in range(self.retry_times):
                    try:
                        result = await self._test_proxy(proxy, test_url)
                        results['test_results'].append(result)
                        
                        if result['success']:
                            results['response_times'].append(result['response_time'])
                            results['success'] = True
                            break
                    except Exception as e:
                        if attempt == self.retry_times - 1:
                            results['error'] = str(e)
            
            # 计算平均响应时间
            if results['response_times']:
                results['avg_response_time'] = sum(results['response_times']) / len(results['response_times'])
            else:
                results['avg_response_time'] = None
            
            # 计算成功率
            results['success_rate'] = (
                len([r for r in results['test_results'] if r['success']]) / len(results['test_results'])
                if results['test_results'] else 0.0
            )
            
            return results
    
    async def _test_proxy(self, proxy: Proxy, test_url: str) -> Dict:
        """测试单个URL"""
        start_time = time.time()
        result = {
            'url': test_url,
            'success': False,
            'response_time': None,
            'status_code': None,
            'error': None,
        }
        
        try:
            proxy_url = proxy.url
            
            # 创建连接器
            connector = None
            if proxy.proxy_type in [ProxyType.SOCKS4, ProxyType.SOCKS5]:
                try:
                    from aiohttp_socks import ProxyConnector
                    connector = ProxyConnector.from_url(proxy_url)
                except ImportError:
                    result['error'] = "aiohttp-socks not installed"
                    return result
            else:
                connector = aiohttp.TCPConnector()
            
            # 发送请求
            timeout = aiohttp.ClientTimeout(total=self.timeout)
            async with aiohttp.ClientSession(connector=connector) as session:
                async with session.get(
                    test_url,
                    proxy=proxy_url if proxy.proxy_type in [ProxyType.HTTP, ProxyType.HTTPS] else None,
                    timeout=timeout,
                ) as resp:
                    result['status_code'] = resp.status
                    result['success'] = resp.status == 200
                    result['response_time'] = time.time() - start_time
            
            if connector:
                await connector.close()
                
        except asyncio.TimeoutError:
            result['error'] = "Timeout"
            result['response_time'] = self.timeout
        except Exception as e:
            result['error'] = str(e)
            result['response_time'] = time.time() - start_time
        
        return result
    
    async def check_proxies_batch(self, proxies: List[Proxy]) -> List[Dict]:
        """批量健康检查"""
        tasks = [self.check_proxy_comprehensive(proxy) for proxy in proxies]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        
        processed_results = []
        for result in results:
            if isinstance(result, Exception):
                processed_results.append({
                    'success': False,
                    'error': str(result),
                })
            else:
                processed_results.append(result)
        
        return processed_results

# 使用示例
async def main():
    checker = ComprehensiveHealthChecker()
    
    proxy = Proxy("proxy.example.com", 8080, ProxyType.HTTP)
    result = await checker.check_proxy_comprehensive(proxy)
    print(f"Health check result: {result}")

# asyncio.run(main())

6.9.4 代理轮换中间件的实现

代理轮换中间件:

python 复制代码
import asyncio
import aiohttp
from typing import Optional, Callable
import random

class ProxyRotateMiddleware:
    """代理轮换中间件"""
    
    def __init__(self, proxy_pool_manager: ProxyPoolManager):
        self.proxy_pool_manager = proxy_pool_manager
        self.current_proxy: Optional[Proxy] = None
        self.proxy_usage_count = {}  # {proxy_id: count}
        self.max_usage_per_proxy = 100  # 每个代理最多使用次数
    
    def get_proxy_for_request(self) -> Optional[Proxy]:
        """为请求获取代理"""
        # 检查当前代理是否还能使用
        if self.current_proxy:
            proxy_id = f"{self.current_proxy.host}:{self.current_proxy.port}"
            usage_count = self.proxy_usage_count.get(proxy_id, 0)
            
            if usage_count < self.max_usage_per_proxy and self.current_proxy.is_active:
                self.proxy_usage_count[proxy_id] = usage_count + 1
                return self.current_proxy
        
        # 获取新代理
        self.current_proxy = self.proxy_pool_manager.get_next_proxy()
        if self.current_proxy:
            proxy_id = f"{self.current_proxy.host}:{self.current_proxy.port}"
            self.proxy_usage_count[proxy_id] = 1
        
        return self.current_proxy
    
    async def request_with_proxy(
        self,
        url: str,
        method: str = "GET",
        **kwargs
    ) -> aiohttp.ClientResponse:
        """使用代理发送请求"""
        proxy = self.get_proxy_for_request()
        if not proxy:
            raise Exception("No available proxy")
        
        proxy_url = proxy.url
        start_time = time.time()
        
        try:
            # 创建连接器
            connector = None
            if proxy.proxy_type in [ProxyType.SOCKS4, ProxyType.SOCKS5]:
                from aiohttp_socks import ProxyConnector
                connector = ProxyConnector.from_url(proxy_url)
            
            # 发送请求
            async with aiohttp.ClientSession(connector=connector) as session:
                async with session.request(
                    method,
                    url,
                    proxy=proxy_url if proxy.proxy_type in [ProxyType.HTTP, ProxyType.HTTPS] else None,
                    **kwargs
                ) as resp:
                    response_time = time.time() - start_time
                    
                    # 标记代理状态
                    if resp.status == 200:
                        self.proxy_pool_manager.mark_success(proxy, response_time)
                    else:
                        self.proxy_pool_manager.mark_failure(proxy)
                    
                    return resp
        except Exception as e:
            # 标记代理失败
            self.proxy_pool_manager.mark_failure(proxy)
            raise
    
    def rotate_proxy(self):
        """强制轮换代理"""
        self.current_proxy = None

# 使用示例
async def main():
    manager = ProxyPoolManager()
    # ... 添加代理 ...
    
    middleware = ProxyRotateMiddleware(manager)
    
    # 使用中间件发送请求
    try:
        async with middleware.request_with_proxy("https://httpbin.org/ip") as resp:
            print(await resp.text())
    except Exception as e:
        print(f"Request failed: {e}")

# asyncio.run(main())

6.9.5 代理池监控面板代码

监控面板实现:

python 复制代码
from flask import Flask, render_template_string, jsonify
import threading
import time

app = Flask(__name__)

# 监控面板HTML模板
MONITOR_TEMPLATE = """
<!DOCTYPE html>
<html>
<head>
    <title>代理池监控面板</title>
    <style>
        body { font-family: Arial, sans-serif; margin: 20px; }
        .stats { display: flex; gap: 20px; margin-bottom: 20px; }
        .stat-card { background: #f0f0f0; padding: 15px; border-radius: 5px; }
        .stat-card h3 { margin: 0 0 10px 0; }
        .stat-card .value { font-size: 24px; font-weight: bold; }
        table { width: 100%; border-collapse: collapse; }
        th, td { padding: 10px; text-align: left; border-bottom: 1px solid #ddd; }
        th { background-color: #4CAF50; color: white; }
        .status-active { color: green; }
        .status-inactive { color: red; }
    </style>
    <script>
        setInterval(function() {
            fetch('/api/stats')
                .then(response => response.json())
                .then(data => {
                    document.getElementById('stats').innerHTML = generateStatsHTML(data);
                    document.getElementById('proxies').innerHTML = generateProxiesHTML(data.proxies);
                });
        }, 5000);
        
        function generateStatsHTML(data) {
            return `
                <div class="stat-card">
                    <h3>总代理数</h3>
                    <div class="value">${data.total_proxies}</div>
                </div>
                <div class="stat-card">
                    <h3>活跃代理</h3>
                    <div class="value">${data.active_proxies}</div>
                </div>
                <div class="stat-card">
                    <h3>成功率</h3>
                    <div class="value">${(data.success_rate * 100).toFixed(2)}%</div>
                </div>
                <div class="stat-card">
                    <h3>总请求数</h3>
                    <div class="value">${data.total_requests}</div>
                </div>
            `;
        }
        
        function generateProxiesHTML(proxies) {
            let html = '<table><tr><th>代理</th><th>类型</th><th>状态</th><th>成功率</th><th>响应时间</th><th>使用次数</th></tr>';
            proxies.forEach(proxy => {
                html += `
                    <tr>
                        <td>${proxy.host}:${proxy.port}</td>
                        <td>${proxy.type}</td>
                        <td class="${proxy.is_active ? 'status-active' : 'status-inactive'}">
                            ${proxy.is_active ? '活跃' : '非活跃'}
                        </td>
                        <td>${(proxy.success_rate * 100).toFixed(2)}%</td>
                        <td>${proxy.response_time.toFixed(3)}s</td>
                        <td>${proxy.success_count + proxy.failure_count}</td>
                    </tr>
                `;
            });
            html += '</table>';
            return html;
        }
    </script>
</head>
<body>
    <h1>代理池监控面板</h1>
    <div class="stats" id="stats"></div>
    <h2>代理列表</h2>
    <div id="proxies"></div>
</body>
</html>
"""

class ProxyPoolMonitor:
    """代理池监控器"""
    
    def __init__(self, proxy_pool_manager: ProxyPoolManager, port: int = 5000):
        self.proxy_pool_manager = proxy_pool_manager
        self.port = port
        self.app = Flask(__name__)
        self._setup_routes()
    
    def _setup_routes(self):
        """设置路由"""
        @self.app.route('/')
        def index():
            return render_template_string(MONITOR_TEMPLATE)
        
        @self.app.route('/api/stats')
        def api_stats():
            stats = self.proxy_pool_manager.get_stats()
            proxies = self.proxy_pool_manager.pool.get_active_proxies()
            
            proxy_data = []
            for proxy in proxies:
                proxy_data.append({
                    'host': proxy.host,
                    'port': proxy.port,
                    'type': proxy.proxy_type.value,
                    'is_active': proxy.is_active,
                    'success_rate': proxy.success_rate,
                    'response_time': proxy.response_time,
                    'success_count': proxy.success_count,
                    'failure_count': proxy.failure_count,
                })
            
            return jsonify({
                **stats,
                'proxies': proxy_data,
            })
    
    def run(self, debug: bool = False):
        """运行监控面板"""
        self.app.run(host='0.0.0.0', port=self.port, debug=debug)

# 使用示例
# monitor = ProxyPoolMonitor(proxy_pool_manager, port=5000)
# monitor.run()

6.10 实战演练:构建高可用代理池系统

本节将一步步演示如何构建一个完整的高可用代理池系统。

6.10.1 步骤1:设计代理池的数据结构和接口

数据结构设计:

python 复制代码
# 使用之前定义的Proxy和ProxyPool类
# 设计要点:
# 1. 使用字典存储代理(快速查找)
# 2. 使用集合存储活跃代理(快速过滤)
# 3. 使用队列实现轮询(公平分配)
# 4. 线程安全(使用锁保护)

接口设计:

python 复制代码
class IProxyPoolManager(ABC):
    """代理池管理器接口"""
    
    @abstractmethod
    def add_proxy(self, proxy: Proxy) -> str:
        """添加代理"""
        pass
    
    @abstractmethod
    def remove_proxy(self, proxy_id: str):
        """删除代理"""
        pass
    
    @abstractmethod
    def get_next_proxy(self) -> Optional[Proxy]:
        """获取下一个代理"""
        pass
    
    @abstractmethod
    def mark_success(self, proxy: Proxy, response_time: float):
        """标记成功"""
        pass
    
    @abstractmethod
    def mark_failure(self, proxy: Proxy):
        """标记失败"""
        pass

6.10.2 步骤2:实现代理添加、删除、查询功能

完整实现:

python 复制代码
class ProxyPoolManagerV2(ProxyPoolManager):
    """代理池管理器V2(增强版)"""
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.proxy_metadata = {}  # {proxy_id: metadata}
    
    def add_proxy_from_string(self, proxy_string: str, **metadata) -> str:
        """从字符串添加代理"""
        # 解析代理字符串
        # 格式: http://user:pass@host:port 或 socks5://host:port
        try:
            from urllib.parse import urlparse
            
            parsed = urlparse(proxy_string)
            proxy_type_map = {
                'http': ProxyType.HTTP,
                'https': ProxyType.HTTPS,
                'socks4': ProxyType.SOCKS4,
                'socks5': ProxyType.SOCKS5,
            }
            
            proxy_type = proxy_type_map.get(parsed.scheme)
            if not proxy_type:
                raise ValueError(f"Unsupported proxy type: {parsed.scheme}")
            
            proxy = Proxy(
                host=parsed.hostname,
                port=parsed.port or (1080 if 'socks' in parsed.scheme else 8080),
                proxy_type=proxy_type,
                username=parsed.username,
                password=parsed.password,
                **metadata
            )
            
            proxy_id = self.add_proxy(proxy)
            self.proxy_metadata[proxy_id] = metadata
            return proxy_id
        except Exception as e:
            print(f"Failed to add proxy from string: {e}")
            return None
    
    def add_proxies_from_file(self, file_path: str):
        """从文件批量添加代理"""
        with open(file_path, 'r') as f:
            for line in f:
                line = line.strip()
                if line and not line.startswith('#'):
                    self.add_proxy_from_string(line)
    
    def export_proxies(self, file_path: str):
        """导出代理到文件"""
        with open(file_path, 'w') as f:
            for proxy_id, proxy in self.pool.proxies.items():
                f.write(f"{proxy.url}\n")
    
    def search_proxies(self, **filters) -> List[Proxy]:
        """搜索代理"""
        results = []
        for proxy in self.pool.proxies.values():
            match = True
            
            if 'type' in filters and proxy.proxy_type != filters['type']:
                match = False
            if 'location' in filters and proxy.location != filters['location']:
                match = False
            if 'min_success_rate' in filters and proxy.success_rate < filters['min_success_rate']:
                match = False
            if 'max_response_time' in filters and proxy.response_time > filters['max_response_time']:
                match = False
            
            if match:
                results.append(proxy)
        
        return results

# 使用示例
manager = ProxyPoolManagerV2()

# 从字符串添加
manager.add_proxy_from_string("http://user:pass@proxy.com:8080", location="US")

# 从文件批量添加
manager.add_proxies_from_file("proxies.txt")

# 搜索代理
us_proxies = manager.search_proxies(location="US", min_success_rate=0.8)

6.10.3 步骤3:实现定时健康检查机制

定时健康检查:

python 复制代码
class ScheduledHealthChecker:
    """定时健康检查器"""
    
    def __init__(
        self,
        proxy_pool_manager: ProxyPoolManager,
        interval: float = 300.0,
        batch_size: int = 10,
    ):
        self.proxy_pool_manager = proxy_pool_manager
        self.interval = interval
        self.batch_size = batch_size
        self.health_checker = ComprehensiveHealthChecker()
        self.running = False
        self.task = None
    
    async def health_check_loop(self):
        """健康检查循环"""
        while self.running:
            try:
                # 获取需要检查的代理
                active_proxies = self.proxy_pool_manager.pool.get_active_proxies()
                
                # 分批检查
                for i in range(0, len(active_proxies), self.batch_size):
                    batch = active_proxies[i:i+self.batch_size]
                    results = await self.health_checker.check_proxies_batch(batch)
                    
                    # 更新代理状态
                    for proxy, result in zip(batch, results):
                        if result['success']:
                            self.proxy_pool_manager.mark_success(
                                proxy,
                                result.get('avg_response_time', 0.0)
                            )
                        else:
                            self.proxy_pool_manager.mark_failure(proxy)
                
                await asyncio.sleep(self.interval)
            except Exception as e:
                print(f"Health check error: {e}")
                await asyncio.sleep(60)
    
    def start(self):
        """启动健康检查"""
        self.running = True
        self.task = asyncio.create_task(self.health_check_loop())
    
    def stop(self):
        """停止健康检查"""
        self.running = False
        if self.task:
            self.task.cancel()

# 使用示例
health_checker = ScheduledHealthChecker(proxy_pool_manager, interval=300.0)
health_checker.start()

6.10.4 步骤4:实现负载均衡算法(多种算法对比)

算法对比测试:

python 复制代码
def test_load_balancing_algorithms(proxy_pool_manager: ProxyPoolManager):
    """测试负载均衡算法"""
    algorithms = {
        'round_robin': RoundRobinBalancer(proxy_pool_manager.pool),
        'random': RandomBalancer(proxy_pool_manager.pool),
        'weighted': WeightedBalancer(proxy_pool_manager.pool),
        'least_connections': LeastConnectionsBalancer(proxy_pool_manager.pool),
    }
    
    iterations = 1000
    results = {}
    
    for algo_name, balancer in algorithms.items():
        selection_distribution = {}
        start_time = time.time()
        
        for _ in range(iterations):
            proxy = balancer.get_next_proxy()
            if proxy:
                proxy_id = f"{proxy.host}:{proxy.port}"
                selection_distribution[proxy_id] = selection_distribution.get(proxy_id, 0) + 1
        
        elapsed = time.time() - start_time
        
        # 计算分布均匀度
        counts = list(selection_distribution.values())
        if counts:
            mean = sum(counts) / len(counts)
            variance = sum((x - mean) ** 2 for x in counts) / len(counts)
            std_dev = variance ** 0.5
            cv = std_dev / mean if mean > 0 else 0  # 变异系数
        else:
            cv = 0
        
        results[algo_name] = {
            'time': elapsed * 1000,
            'distribution': selection_distribution,
            'coefficient_of_variation': cv,
        }
    
    return results

# 使用示例
# results = test_load_balancing_algorithms(proxy_pool_manager)
# for algo, result in results.items():
#     print(f"{algo}: {result['time']:.2f}ms, CV: {result['coefficient_of_variation']:.3f}")

6.10.5 步骤5:实现代理轮换中间件

集成到HTTP客户端:

python 复制代码
class ProxyAwareHTTPClient:
    """支持代理的HTTP客户端"""
    
    def __init__(self, proxy_pool_manager: ProxyPoolManager):
        self.proxy_pool_manager = proxy_pool_manager
        self.middleware = ProxyRotateMiddleware(proxy_pool_manager)
    
    async def get(self, url: str, **kwargs) -> aiohttp.ClientResponse:
        """GET请求"""
        return await self.middleware.request_with_proxy(url, "GET", **kwargs)
    
    async def post(self, url: str, **kwargs) -> aiohttp.ClientResponse:
        """POST请求"""
        return await self.middleware.request_with_proxy(url, "POST", **kwargs)
    
    async def request(self, method: str, url: str, **kwargs) -> aiohttp.ClientResponse:
        """通用请求方法"""
        return await self.middleware.request_with_proxy(url, method, **kwargs)

# 使用示例
async def main():
    client = ProxyAwareHTTPClient(proxy_pool_manager)
    
    try:
        async with await client.get("https://httpbin.org/ip") as resp:
            print(await resp.text())
    except Exception as e:
        print(f"Request failed: {e}")

# asyncio.run(main())

6.10.6 步骤6:集成到异步爬虫框架中

爬虫框架集成:

python 复制代码
import asyncio
from typing import List, Callable

class AsyncCrawlerWithProxy:
    """支持代理的异步爬虫"""
    
    def __init__(
        self,
        proxy_pool_manager: ProxyPoolManager,
        max_concurrent: int = 10,
    ):
        self.proxy_pool_manager = proxy_pool_manager
        self.client = ProxyAwareHTTPClient(proxy_pool_manager)
        self.semaphore = asyncio.Semaphore(max_concurrent)
    
    async def crawl_url(self, url: str) -> dict:
        """爬取单个URL"""
        async with self.semaphore:
            try:
                async with await self.client.get(url) as resp:
                    text = await resp.text()
                    return {
                        'url': url,
                        'status': resp.status,
                        'content': text,
                        'success': True,
                    }
            except Exception as e:
                return {
                    'url': url,
                    'status': None,
                    'content': None,
                    'success': False,
                    'error': str(e),
                }
    
    async def crawl_urls(self, urls: List[str]) -> List[dict]:
        """批量爬取URL"""
        tasks = [self.crawl_url(url) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        
        processed_results = []
        for result in results:
            if isinstance(result, Exception):
                processed_results.append({
                    'success': False,
                    'error': str(result),
                })
            else:
                processed_results.append(result)
        
        return processed_results

# 使用示例
async def main():
    crawler = AsyncCrawlerWithProxy(proxy_pool_manager, max_concurrent=10)
    
    urls = [
        "https://httpbin.org/ip",
        "https://httpbin.org/get",
        # ... 更多URL
    ]
    
    results = await crawler.crawl_urls(urls)
    
    success_count = sum(1 for r in results if r.get('success'))
    print(f"Success: {success_count}/{len(results)}")

# asyncio.run(main())

6.10.7 步骤7:完整实战代码

完整的代理池系统:

python 复制代码
import asyncio
import aiohttp
import time
import logging
from typing import List, Optional

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class CompleteProxyPoolSystem:
    """完整的代理池系统"""
    
    def __init__(
        self,
        proxy_sources: List[str] = None,
        health_check_interval: float = 300.0,
        balancer_type: str = "weighted",
    ):
        # 初始化组件
        self.proxy_pool_manager = ProxyPoolManagerV2(
            balancer_type=balancer_type,
            health_check_interval=health_check_interval,
        )
        
        self.health_checker = ScheduledHealthChecker(
            self.proxy_pool_manager,
            interval=health_check_interval,
        )
        
        self.crawler = AsyncCrawlerWithProxy(
            self.proxy_pool_manager,
            max_concurrent=10,
        )
        
        # 加载代理
        if proxy_sources:
            for source in proxy_sources:
                if source.startswith('http'):
                    # 从URL加载
                    asyncio.create_task(self.load_proxies_from_url(source))
                else:
                    # 从文件加载
                    self.proxy_pool_manager.add_proxies_from_file(source)
    
    async def load_proxies_from_url(self, url: str):
        """从URL加载代理列表"""
        try:
            async with aiohttp.ClientSession() as session:
                async with session.get(url) as resp:
                    if resp.status == 200:
                        text = await resp.text()
                        for line in text.split('\n'):
                            line = line.strip()
                            if line:
                                self.proxy_pool_manager.add_proxy_from_string(line)
        except Exception as e:
            logger.error(f"Failed to load proxies from URL: {e}")
    
    def start(self):
        """启动系统"""
        logger.info("Starting proxy pool system...")
        self.proxy_pool_manager.start()
        self.health_checker.start()
        logger.info("Proxy pool system started")
    
    def stop(self):
        """停止系统"""
        logger.info("Stopping proxy pool system...")
        self.proxy_pool_manager.stop()
        self.health_checker.stop()
        logger.info("Proxy pool system stopped")
    
    async def crawl(self, urls: List[str]) -> List[dict]:
        """使用代理池爬取URL"""
        return await self.crawler.crawl_urls(urls)
    
    def get_stats(self) -> dict:
        """获取系统统计"""
        return self.proxy_pool_manager.get_stats()

# 使用示例
async def main():
    # 创建系统
    system = CompleteProxyPoolSystem(
        proxy_sources=["proxies.txt"],
        health_check_interval=300.0,
        balancer_type="weighted",
    )
    
    # 启动系统
    system.start()
    
    # 等待代理加载和健康检查
    await asyncio.sleep(10)
    
    # 爬取URL
    urls = [
        "https://httpbin.org/ip",
        "https://httpbin.org/get",
    ]
    
    results = await system.crawl(urls)
    
    # 查看统计
    stats = system.get_stats()
    logger.info(f"System stats: {stats}")
    
    # 停止系统
    system.stop()

if __name__ == "__main__":
    asyncio.run(main())

6.11 常见坑点与排错

6.11.1 DNS缓存时间过长导致IP变更无法感知

问题描述:

python 复制代码
# 错误示例:TTL设置过长
cache = DNSCache(default_ttl=86400)  # 24小时(太长!)

# 如果服务器IP变更,24小时内无法感知

解决方案:

python 复制代码
# 正确示例:合理的TTL设置
cache = DNSCache(default_ttl=300)  # 5分钟

# 或者根据DNS记录的TTL动态设置
def get_dns_ttl(hostname: str) -> int:
    """获取DNS记录的TTL"""
    try:
        answers = dns.resolver.resolve(hostname, 'A')
        return answers.rrset.ttl
    except:
        return 300  # 默认5分钟

# 使用动态TTL
ttl = get_dns_ttl("www.example.com")
cache.set("www.example.com", ips, ttl=ttl)

6.11.2 代理健康检查频率过高会被代理商封禁

问题描述:

python 复制代码
# 错误示例:检查频率过高
health_checker = ScheduledHealthChecker(
    proxy_pool_manager,
    interval=10.0,  # 10秒检查一次(太频繁!)
)

# 可能被代理服务器识别为异常行为并封禁

解决方案:

python 复制代码
# 正确示例:合理的检查频率
health_checker = ScheduledHealthChecker(
    proxy_pool_manager,
    interval=300.0,  # 5分钟检查一次
    batch_size=5,     # 每次只检查5个代理
)

# 或者使用智能频率控制
smart_checker = SmartHealthChecker(
    base_interval=300.0,
    min_interval=60.0,
    max_interval=3600.0,
)

6.11.3 SOCKS5代理需要特殊处理UDP流量

问题描述:

python 复制代码
# 错误示例:使用普通HTTP客户端连接SOCKS5代理
# 某些UDP流量可能无法正常工作

解决方案:

python 复制代码
# 正确示例:使用专门的SOCKS连接器
from aiohttp_socks import ProxyConnector

proxy_url = "socks5://proxy.example.com:1080"
connector = ProxyConnector.from_url(proxy_url)

async with aiohttp.ClientSession(connector=connector) as session:
    # 现在可以正常使用SOCKS5代理
    async with session.get("https://httpbin.org/ip") as resp:
        print(await resp.text())

6.11.4 代理池资源耗尽导致请求失败

问题描述:

python 复制代码
# 错误示例:没有检查代理可用性
proxy = proxy_pool_manager.get_next_proxy()
# 如果所有代理都不可用,proxy为None,会导致错误

解决方案:

python 复制代码
# 正确示例:检查代理可用性并实现降级
proxy = proxy_pool_manager.get_next_proxy()
if not proxy:
    # 降级:不使用代理
    logger.warning("No available proxy, using direct connection")
    # 或者等待代理恢复
    await asyncio.sleep(10)
    proxy = proxy_pool_manager.get_next_proxy()

6.11.5 负载均衡算法选择不当导致性能下降

问题描述:

python 复制代码
# 错误示例:代理质量差异大时使用轮询
# 质量差的代理会被频繁使用,影响整体性能
balancer = RoundRobinBalancer(proxy_pool)

解决方案:

python 复制代码
# 正确示例:根据场景选择算法
# 代理质量差异大:使用加权算法
# 代理质量相近:使用轮询或随机算法
# 长时间连接:使用最少连接数算法

if proxy_quality_varies:
    balancer = WeightedBalancer(proxy_pool)
else:
    balancer = RoundRobinBalancer(proxy_pool)

6.12 总结

本章深入讲解了DNS解析优化和代理池架构的完整实现。通过本章学习,你应该能够:

核心知识点回顾

  1. DNS解析机制

    • DNS查询的完整流程(递归/迭代)
    • DNS缓存机制和TTL管理
    • DoH/DoT的实现和使用
  2. 代理池架构

    • 代理池的数据结构设计
    • 代理类型的选择和使用
    • 健康检查机制的设计
  3. 负载均衡算法

    • 轮询、随机、加权、最少连接数
    • 不同算法的适用场景
    • 算法性能对比和选择
  4. 实战能力

    • 构建完整的代理池系统
    • 集成到爬虫框架
    • 监控和统计

最佳实践建议

  1. DNS优化

    • 使用合理的TTL(5-10分钟)
    • 实现多级缓存
    • 使用DoH提高安全性
  2. 代理池管理

    • 定期健康检查(5-10分钟)
    • 实现智能频率控制
    • 根据场景选择负载均衡算法
  3. 性能优化

    • 使用异步健康检查
    • 批量处理代理
    • 实现代理预热机制
  4. 监控和运维

    • 实现监控面板
    • 记录详细日志
    • 设置告警机制

下一步学习方向

  1. 深入学习

    • DNS协议细节
    • 代理协议实现
    • 分布式代理池
  2. 实战项目

    • 构建大规模代理池
    • 实现代理自动获取
    • 开发代理质量评估系统

通过本章的学习,你已经掌握了DNS解析优化和代理池架构的核心技术,能够构建高性能、高可用的爬虫系统。


本章完

相关推荐
飞凌嵌入式14 小时前
T153核心板:异构架构赋能工业嵌入式,筑牢工业设备实时控制底座
架构
陈猪的杰咪14 小时前
GitHub Copilot 2026计费新规:AI Credits消耗解析与节省策略
人工智能·ai·架构·github·copilot
watersink15 小时前
MCP 协议与 Skill 开发架构培训文档
人工智能·架构
@insist12315 小时前
系统架构设计师-嵌入式处理器核心知识体系:从分类到架构选型全解析
架构·分类·系统架构·软考·系统架构设计师·软件水平考试
小白学大数据15 小时前
线上故障急救:依托 OpenClaw 日志排查 403 和 503 问题
爬虫·python·selenium·数据分析
实在智能RPA16 小时前
航空维修知识库构建方法:从RAG到Agent-native的架构演进与全栈工程实践
人工智能·ai·架构
Rain50916 小时前
2.1 Nest.js 项目初始化与模块化架构
开发语言·前端·javascript·后端·架构·数据分析·node.js
大蚂蚁2号17 小时前
深度解析:2026短视频批量生成底层技术、架构演进与企业落地实战
架构·音视频
ping某17 小时前
一个“日志备份”需求,为什么会牵出整个 Linux 日志系统?
后端·架构
阿狸猿17 小时前
论微服务架构及其应用
java·微服务·架构