【爬虫教程】第6章:DNS解析优化与代理池架构

第6章:DNS解析优化与代理池架构

目录

  • [6.1 引言:DNS和代理在爬虫中的重要性](#6.1 引言:DNS和代理在爬虫中的重要性)
    • [6.1.1 DNS解析对爬虫性能的影响](#6.1.1 DNS解析对爬虫性能的影响)
    • [6.1.2 代理池的必要性](#6.1.2 代理池的必要性)
    • [6.1.3 本章学习目标](#6.1.3 本章学习目标)
  • [6.2 DNS解析流程深度解析](#6.2 DNS解析流程深度解析)
    • [6.2.1 DNS查询的完整流程](#6.2.1 DNS查询的完整流程)
    • [6.2.2 递归查询 vs 迭代查询](#6.2.2 递归查询 vs 迭代查询)
    • [6.2.3 本地缓存和系统DNS](#6.2.3 本地缓存和系统DNS)
    • [6.2.4 DNS记录类型详解](#6.2.4 DNS记录类型详解)
  • [6.3 DNS缓存机制深度解析](#6.3 DNS缓存机制深度解析)
    • [6.3.1 TTL的含义和作用](#6.3.1 TTL的含义和作用)
    • [6.3.2 缓存策略设计](#6.3.2 缓存策略设计)
    • [6.3.3 缓存失效处理](#6.3.3 缓存失效处理)
    • [6.3.4 多级缓存架构](#6.3.4 多级缓存架构)
  • [6.4 DNS over HTTPS/TLS实现](#6.4 DNS over HTTPS/TLS实现)
    • [6.4.1 DoH协议原理](#6.4.1 DoH协议原理)
    • [6.4.2 DoT协议原理](#6.4.2 DoT协议原理)
    • [6.4.3 DoH/DoT的安全优势](#6.4.3 DoH/DoT的安全优势)
    • [6.4.4 DoH客户端实现](#6.4.4 DoH客户端实现)
  • [6.5 代理池架构设计](#6.5 代理池架构设计)
    • [6.5.1 代理池的数据结构设计](#6.5.1 代理池的数据结构设计)
    • [6.5.2 代理类型详解(HTTP/HTTPS/SOCKS4/SOCKS5)](#6.5.2 代理类型详解(HTTP/HTTPS/SOCKS4/SOCKS5))
    • [6.5.3 代理池的接口设计](#6.5.3 代理池的接口设计)
    • [6.5.4 代理池架构图](#6.5.4 代理池架构图)
  • [6.6 代理健康检查机制](#6.6 代理健康检查机制)
    • [6.6.1 健康检查的设计原则](#6.6.1 健康检查的设计原则)
    • [6.6.2 健康检查的实现方法](#6.6.2 健康检查的实现方法)
    • [6.6.3 健康检查的频率控制](#6.6.3 健康检查的频率控制)
    • [6.6.4 健康状态的评估标准](#6.6.4 健康状态的评估标准)
  • [6.7 负载均衡算法实现](#6.7 负载均衡算法实现)
    • [6.7.1 轮询算法(Round Robin)](#6.7.1 轮询算法(Round Robin))
    • [6.7.2 随机算法(Random)](#6.7.2 随机算法(Random))
    • [6.7.3 加权算法(Weighted)](#6.7.3 加权算法(Weighted))
    • [6.7.4 最少连接数算法(Least Connections)](#6.7.4 最少连接数算法(Least Connections))
    • [6.7.5 算法性能对比](#6.7.5 算法性能对比)
  • [6.8 工具链:DNS和代理工具使用](#6.8 工具链:DNS和代理工具使用)
    • [6.8.1 使用dnspython进行DNS查询](#6.8.1 使用dnspython进行DNS查询)
    • [6.8.2 使用DoH服务进行DNS查询](#6.8.2 使用DoH服务进行DNS查询)
    • [6.8.3 使用httpx/aiohttp配置代理](#6.8.3 使用httpx/aiohttp配置代理)
    • [6.8.4 使用Redis管理代理池数据](#6.8.4 使用Redis管理代理池数据)
    • [6.8.5 使用aiohttp-socks支持SOCKS代理](#6.8.5 使用aiohttp-socks支持SOCKS代理)
  • [6.9 代码对照:完整实现](#6.9 代码对照:完整实现)
    • [6.9.1 自定义DNS解析器实现(支持缓存和DoH)](#6.9.1 自定义DNS解析器实现(支持缓存和DoH))
    • [6.9.2 代理池管理器类的完整实现](#6.9.2 代理池管理器类的完整实现)
    • [6.9.3 代理健康检查的实现代码](#6.9.3 代理健康检查的实现代码)
    • [6.9.4 代理轮换中间件的实现](#6.9.4 代理轮换中间件的实现)
    • [6.9.5 代理池监控面板代码](#6.9.5 代理池监控面板代码)
  • [6.10 实战演练:构建高可用代理池系统](#6.10 实战演练:构建高可用代理池系统)
    • [6.10.1 步骤1:设计代理池的数据结构和接口](#6.10.1 步骤1:设计代理池的数据结构和接口)
    • [6.10.2 步骤2:实现代理添加、删除、查询功能](#6.10.2 步骤2:实现代理添加、删除、查询功能)
    • [6.10.3 步骤3:实现定时健康检查机制](#6.10.3 步骤3:实现定时健康检查机制)
    • [6.10.4 步骤4:实现负载均衡算法(多种算法对比)](#6.10.4 步骤4:实现负载均衡算法(多种算法对比))
    • [6.10.5 步骤5:实现代理轮换中间件](#6.10.5 步骤5:实现代理轮换中间件)
    • [6.10.6 步骤6:集成到异步爬虫框架中](#6.10.6 步骤6:集成到异步爬虫框架中)
    • [6.10.7 步骤7:完整实战代码](#6.10.7 步骤7:完整实战代码)
  • [6.11 常见坑点与排错](#6.11 常见坑点与排错)
    • [6.11.1 DNS缓存时间过长导致IP变更无法感知](#6.11.1 DNS缓存时间过长导致IP变更无法感知)
    • [6.11.2 代理健康检查频率过高会被代理商封禁](#6.11.2 代理健康检查频率过高会被代理商封禁)
    • [6.11.3 SOCKS5代理需要特殊处理UDP流量](#6.11.3 SOCKS5代理需要特殊处理UDP流量)
    • [6.11.4 代理池资源耗尽导致请求失败](#6.11.4 代理池资源耗尽导致请求失败)
    • [6.11.5 负载均衡算法选择不当导致性能下降](#6.11.5 负载均衡算法选择不当导致性能下降)
  • [6.12 总结](#6.12 总结)

6.1 引言:DNS和代理在爬虫中的重要性

在爬虫开发中,DNS解析和代理使用是两个关键环节。DNS解析的速度直接影响请求的响应时间,而代理池的质量决定了爬虫的稳定性和反检测能力。理解DNS解析机制和构建高效的代理池系统,是构建高性能爬虫的基础。

6.1.1 DNS解析对爬虫性能的影响

DNS解析的性能影响:

python 复制代码
import time
import socket

def test_dns_resolution(hostname: str, iterations: int = 100):
    """测试DNS解析性能"""
    total_time = 0
    
    for _ in range(iterations):
        start = time.time()
        try:
            socket.gethostbyname(hostname)
        except Exception as e:
            print(f"DNS resolution failed: {e}")
        elapsed = time.time() - start
        total_time += elapsed
    
    avg_time = total_time / iterations
    print(f"Average DNS resolution time: {avg_time*1000:.2f}ms")
    print(f"Total time for {iterations} requests: {total_time:.2f}s")
    
    # 如果每次请求都进行DNS解析,总耗时会非常长
    estimated_total = avg_time * iterations
    print(f"Estimated total time without cache: {estimated_total:.2f}s")

# 测试
test_dns_resolution("www.example.com", 100)
# 输出示例:
# Average DNS resolution time: 50.23ms
# Total time for 100 requests: 5.02s
# Estimated total time without cache: 5.02s

DNS解析的性能问题:

  1. 延迟累积

    • 每次DNS查询通常需要20-100ms
    • 大量请求时,DNS延迟会显著累积
    • 没有缓存时,每个请求都要等待DNS解析
  2. 网络依赖

    • DNS查询依赖网络连接
    • 网络不稳定时,DNS查询可能超时
    • 影响整体爬虫的稳定性
  3. DNS服务器限制

    • 公共DNS服务器可能有速率限制
    • 频繁查询可能被限流
    • 需要实现智能重试和降级

DNS缓存的效果:

python 复制代码
import time
from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_dns_resolve(hostname: str) -> str:
    """带缓存的DNS解析"""
    return socket.gethostbyname(hostname)

def test_cached_dns(hostname: str, iterations: int = 100):
    """测试缓存DNS解析性能"""
    # 第一次解析(无缓存)
    start = time.time()
    cached_dns_resolve(hostname)
    first_time = time.time() - start
    
    # 后续解析(有缓存)
    start = time.time()
    for _ in range(iterations - 1):
        cached_dns_resolve(hostname)
    cached_time = time.time() - start
    
    avg_cached_time = cached_time / (iterations - 1)
    print(f"First resolution (no cache): {first_time*1000:.2f}ms")
    print(f"Average cached resolution: {avg_cached_time*1000:.6f}ms")
    print(f"Speedup: {first_time/avg_cached_time:.0f}x")

# 测试
test_cached_dns("www.example.com", 100)
# 输出示例:
# First resolution (no cache): 45.23ms
# Average cached resolution: 0.000123ms
# Speedup: 367723x

6.1.2 代理池的必要性

为什么需要代理池?

  1. IP封禁问题

    • 频繁请求同一网站会被封IP
    • 使用代理可以轮换IP,避免封禁
    • 提高爬虫的稳定性
  2. 地理位置限制

    • 某些网站有地理位置限制
    • 使用对应地区的代理可以绕过限制
    • 实现全球数据采集
  3. 请求频率控制

    • 单个IP的请求频率有限
    • 使用多个代理可以分散请求
    • 提高整体爬取速度

代理池的挑战:

python 复制代码
# 问题1:代理质量参差不齐
proxies = [
    "http://proxy1.com:8080",  # 速度快,稳定
    "http://proxy2.com:8080",  # 速度慢,不稳定
    "http://proxy3.com:8080",  # 已失效
]

# 问题2:代理需要健康检查
# 问题3:代理需要负载均衡
# 问题4:代理需要自动轮换

6.1.3 本章学习目标

通过本章学习,你将:

  1. 深入理解DNS解析机制

    • DNS查询的完整流程
    • 缓存机制的设计和优化
    • DoH/DoT的实现和使用
  2. 掌握代理池架构设计

    • 代理池的数据结构设计
    • 健康检查机制
    • 负载均衡算法
  3. 实现完整的代理池系统

    • 代理的添加、删除、查询
    • 自动健康检查
    • 智能负载均衡
    • 监控和统计
  4. 集成到爬虫框架

    • 代理轮换中间件
    • 与异步爬虫框架集成
    • 性能优化和调优

6.2 DNS解析流程深度解析

DNS(Domain Name System)是互联网的"电话簿",将域名转换为IP地址。理解DNS解析流程对于优化爬虫性能至关重要。

6.2.1 DNS查询的完整流程

DNS查询的完整流程:
权威DNS服务器 TLD DNS服务器 根DNS服务器 系统DNS解析器 本地缓存 应用程序 权威DNS服务器 TLD DNS服务器 根DNS服务器 系统DNS解析器 本地缓存 应用程序 alt [缓存命中] [缓存未命中] 1. 查询本地缓存 返回缓存的IP 2. 查询系统DNS解析器 3. 查询根DNS服务器 返回.com的TLD服务器地址 4. 查询TLD DNS服务器 返回example.com的权威服务器地址 5. 查询权威DNS服务器 返回www.example.com的IP地址 6. 更新缓存 返回IP地址

DNS查询的详细步骤:

  1. 本地缓存查询

    • 检查应用程序缓存
    • 检查系统DNS缓存
    • 检查hosts文件
  2. 递归查询

    • 向系统DNS解析器发送查询
    • DNS解析器负责完整的查询过程
  3. 迭代查询

    • 从根DNS服务器开始
    • 逐级查询到权威DNS服务器
    • 获取最终的IP地址

Python代码演示DNS查询:

python 复制代码
import socket
import dns.resolver
import time

def dns_query_system(hostname: str) -> str:
    """使用系统DNS解析"""
    start = time.time()
    try:
        ip = socket.gethostbyname(hostname)
        elapsed = time.time() - start
        print(f"System DNS: {hostname} -> {ip} ({elapsed*1000:.2f}ms)")
        return ip
    except socket.gaierror as e:
        print(f"DNS resolution failed: {e}")
        return None

def dns_query_dnspython(hostname: str) -> list:
    """使用dnspython库解析"""
    start = time.time()
    try:
        answers = dns.resolver.resolve(hostname, 'A')
        ips = [str(answer) for answer in answers]
        elapsed = time.time() - start
        print(f"dnspython DNS: {hostname} -> {ips} ({elapsed*1000:.2f}ms)")
        return ips
    except Exception as e:
        print(f"DNS resolution failed: {e}")
        return []

# 使用示例
dns_query_system("www.example.com")
dns_query_dnspython("www.example.com")

6.2.2 递归查询 vs 迭代查询

递归查询(Recursive Query):

  • 客户端向DNS服务器发送递归查询
  • DNS服务器负责完成整个查询过程
  • 客户端只需等待最终结果

迭代查询(Iterative Query):

  • DNS服务器返回下一个应该查询的服务器地址
  • 客户端需要自己完成后续查询
  • 通常用于DNS服务器之间的查询

查询类型对比:

python 复制代码
class DNSQueryType:
    """DNS查询类型"""
    RECURSIVE = "recursive"  # 递归查询
    ITERATIVE = "iterative"  # 迭代查询

def recursive_query(hostname: str) -> str:
    """递归查询(客户端视角)"""
    # 客户端发送递归查询,等待最终结果
    return socket.gethostbyname(hostname)

def iterative_query(hostname: str) -> str:
    """迭代查询(手动实现)"""
    # 1. 查询根DNS服务器
    # 2. 获取TLD服务器地址
    # 3. 查询TLD服务器
    # 4. 获取权威服务器地址
    # 5. 查询权威服务器
    # 6. 获取IP地址
    # (实际实现较复杂,这里仅演示概念)
    pass

6.2.3 本地缓存和系统DNS

本地缓存层次:

python 复制代码
class DNSCacheLevel:
    """DNS缓存层次"""
    APPLICATION = "application"  # 应用程序缓存
    SYSTEM = "system"            # 系统DNS缓存
    HOSTS_FILE = "hosts_file"    # hosts文件

# 1. 应用程序缓存(最快)
app_cache = {}

# 2. 系统DNS缓存(操作系统管理)
# Windows: ipconfig /displaydns
# Linux: systemd-resolve --statistics
# macOS: dscacheutil -q host

# 3. hosts文件
# Windows: C:\Windows\System32\drivers\etc\hosts
# Linux/macOS: /etc/hosts

系统DNS缓存查看:

bash 复制代码
# Windows
ipconfig /displaydns

# Linux (systemd-resolved)
systemd-resolve --statistics

# Linux (nscd)
nscd -g

# macOS
dscacheutil -q host -a name www.example.com

6.2.4 DNS记录类型详解

常见的DNS记录类型:

记录类型 说明 示例
A IPv4地址记录 www.example.com -> 192.0.2.1
AAAA IPv6地址记录 www.example.com -> 2001:db8::1
CNAME 别名记录 www -> example.com
MX 邮件交换记录 example.com -> mail.example.com
TXT 文本记录 用于SPF、DKIM等
NS 名称服务器记录 example.com -> ns1.example.com

查询不同类型的DNS记录:

python 复制代码
import dns.resolver

def query_dns_records(hostname: str):
    """查询各种DNS记录"""
    records = {}
    
    # A记录(IPv4)
    try:
        answers = dns.resolver.resolve(hostname, 'A')
        records['A'] = [str(answer) for answer in answers]
    except Exception as e:
        records['A'] = None
    
    # AAAA记录(IPv6)
    try:
        answers = dns.resolver.resolve(hostname, 'AAAA')
        records['AAAA'] = [str(answer) for answer in answers]
    except Exception as e:
        records['AAAA'] = None
    
    # CNAME记录
    try:
        answers = dns.resolver.resolve(hostname, 'CNAME')
        records['CNAME'] = [str(answer) for answer in answers]
    except Exception as e:
        records['CNAME'] = None
    
    # MX记录
    try:
        answers = dns.resolver.resolve(hostname, 'MX')
        records['MX'] = [(str(answer.preference), str(answer.exchange)) for answer in answers]
    except Exception as e:
        records['MX'] = None
    
    return records

# 使用示例
records = query_dns_records("example.com")
print(records)

6.3 DNS缓存机制深度解析

DNS缓存是提高DNS解析性能的关键机制。合理设计缓存策略可以大幅提升爬虫性能。

6.3.1 TTL的含义和作用

TTL(Time To Live)的含义:

  • TTL是DNS记录的生存时间
  • 表示DNS记录在缓存中的有效期(秒)
  • 超过TTL后,缓存记录应该被丢弃

TTL的作用:

  1. 平衡性能和准确性

    • TTL短:更准确,但查询频繁
    • TTL长:查询少,但可能使用过期IP
  2. 控制缓存更新频率

    • 服务器可以通过调整TTL控制缓存更新
    • 动态IP通常设置较短的TTL

查看DNS记录的TTL:

python 复制代码
import dns.resolver

def get_dns_ttl(hostname: str) -> int:
    """获取DNS记录的TTL"""
    try:
        answers = dns.resolver.resolve(hostname, 'A')
        # dnspython返回的answer对象包含TTL
        ttl = answers.rrset.ttl
        return ttl
    except Exception as e:
        print(f"Failed to get TTL: {e}")
        return None

# 使用示例
ttl = get_dns_ttl("www.example.com")
print(f"TTL: {ttl} seconds ({ttl/60:.1f} minutes)")

6.3.2 缓存策略设计

缓存策略的关键要素:

  1. 缓存存储结构

    • 使用字典存储域名到IP的映射
    • 记录缓存时间戳
    • 记录TTL值
  2. 缓存过期检查

    • 每次查询时检查缓存是否过期
    • 过期则重新查询并更新缓存
  3. 缓存大小限制

    • 限制缓存条目数量
    • 使用LRU(最近最少使用)策略淘汰

完整的DNS缓存实现:

python 复制代码
import time
from collections import OrderedDict
from typing import Optional, List

class DNSCache:
    """DNS缓存实现"""
    
    def __init__(self, max_size: int = 1000, default_ttl: int = 300):
        self.cache = OrderedDict()  # {hostname: (ips, timestamp, ttl)}
        self.max_size = max_size
        self.default_ttl = default_ttl
    
    def get(self, hostname: str) -> Optional[List[str]]:
        """获取缓存的DNS记录"""
        if hostname not in self.cache:
            return None
        
        ips, timestamp, ttl = self.cache[hostname]
        
        # 检查是否过期
        age = time.time() - timestamp
        if age > ttl:
            # 缓存过期,删除
            del self.cache[hostname]
            return None
        
        # 更新访问顺序(LRU)
        self.cache.move_to_end(hostname)
        return ips
    
    def set(self, hostname: str, ips: List[str], ttl: Optional[int] = None):
        """设置DNS缓存"""
        if ttl is None:
            ttl = self.default_ttl
        
        # 如果缓存已满,删除最旧的条目
        if len(self.cache) >= self.max_size and hostname not in self.cache:
            self.cache.popitem(last=False)  # 删除最旧的
        
        self.cache[hostname] = (ips, time.time(), ttl)
        self.cache.move_to_end(hostname)  # 更新访问顺序
    
    def clear(self):
        """清空缓存"""
        self.cache.clear()
    
    def remove(self, hostname: str):
        """删除特定域名的缓存"""
        if hostname in self.cache:
            del self.cache[hostname]
    
    def cleanup_expired(self):
        """清理过期的缓存条目"""
        current_time = time.time()
        expired_hostnames = []
        
        for hostname, (ips, timestamp, ttl) in self.cache.items():
            if current_time - timestamp > ttl:
                expired_hostnames.append(hostname)
        
        for hostname in expired_hostnames:
            del self.cache[hostname]
        
        return len(expired_hostnames)
    
    def stats(self) -> dict:
        """获取缓存统计信息"""
        current_time = time.time()
        valid_count = 0
        expired_count = 0
        
        for hostname, (ips, timestamp, ttl) in self.cache.items():
            if current_time - timestamp > ttl:
                expired_count += 1
            else:
                valid_count += 1
        
        return {
            'total': len(self.cache),
            'valid': valid_count,
            'expired': expired_count,
            'max_size': self.max_size,
        }

# 使用示例
cache = DNSCache(max_size=100, default_ttl=300)

# 设置缓存
cache.set("www.example.com", ["192.0.2.1"], ttl=300)

# 获取缓存
ips = cache.get("www.example.com")
print(f"Cached IPs: {ips}")

# 清理过期缓存
expired_count = cache.cleanup_expired()
print(f"Cleaned up {expired_count} expired entries")

# 查看统计
stats = cache.stats()
print(f"Cache stats: {stats}")

6.3.3 缓存失效处理

缓存失效的场景:

  1. TTL过期

    • 记录超过TTL时间
    • 需要重新查询
  2. 主动失效

    • 检测到IP变更
    • 手动清除缓存
  3. 错误失效

    • DNS查询失败
    • 清除可能错误的缓存

智能缓存失效策略:

python 复制代码
class SmartDNSCache(DNSCache):
    """智能DNS缓存(支持失效处理)"""
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.failure_count = {}  # {hostname: failure_count}
        self.max_failures = 3
    
    def mark_failure(self, hostname: str):
        """标记DNS查询失败"""
        self.failure_count[hostname] = self.failure_count.get(hostname, 0) + 1
        
        # 如果失败次数过多,清除缓存
        if self.failure_count[hostname] >= self.max_failures:
            self.remove(hostname)
            del self.failure_count[hostname]
    
    def mark_success(self, hostname: str):
        """标记DNS查询成功"""
        if hostname in self.failure_count:
            del self.failure_count[hostname]
    
    def get_with_fallback(self, hostname: str, resolver_func) -> Optional[List[str]]:
        """获取缓存,如果失效则重新查询"""
        # 先尝试从缓存获取
        ips = self.get(hostname)
        if ips:
            return ips
        
        # 缓存未命中或过期,重新查询
        try:
            ips = resolver_func(hostname)
            if ips:
                self.set(hostname, ips)
                self.mark_success(hostname)
            return ips
        except Exception as e:
            self.mark_failure(hostname)
            raise

# 使用示例
cache = SmartDNSCache()

def resolve_dns(hostname: str) -> List[str]:
    """DNS解析函数"""
    import socket
    ip = socket.gethostbyname(hostname)
    return [ip]

# 使用缓存和回退
ips = cache.get_with_fallback("www.example.com", resolve_dns)
print(f"Resolved IPs: {ips}")

6.3.4 多级缓存架构

多级缓存的设计:




应用程序
L1: 应用缓存
缓存命中?
L2: 系统DNS缓存
缓存命中?
L3: DNS服务器查询

多级缓存实现:

python 复制代码
class MultiLevelDNSCache:
    """多级DNS缓存"""
    
    def __init__(self):
        self.l1_cache = DNSCache(max_size=100, default_ttl=60)   # 应用缓存(短TTL)
        self.l2_cache = DNSCache(max_size=1000, default_ttl=300) # 系统缓存(长TTL)
    
    def get(self, hostname: str) -> Optional[List[str]]:
        """多级缓存查询"""
        # L1缓存查询
        ips = self.l1_cache.get(hostname)
        if ips:
            return ips
        
        # L2缓存查询
        ips = self.l2_cache.get(hostname)
        if ips:
            # 更新L1缓存
            self.l1_cache.set(hostname, ips)
            return ips
        
        return None
    
    def set(self, hostname: str, ips: List[str], ttl: Optional[int] = None):
        """设置多级缓存"""
        self.l1_cache.set(hostname, ips, ttl)
        self.l2_cache.set(hostname, ips, ttl)
    
    def clear_all(self):
        """清空所有缓存"""
        self.l1_cache.clear()
        self.l2_cache.clear()

# 使用示例
multi_cache = MultiLevelDNSCache()
multi_cache.set("www.example.com", ["192.0.2.1"])
ips = multi_cache.get("www.example.com")
print(f"Resolved from cache: {ips}")

6.4 DNS over HTTPS/TLS实现

DNS over HTTPS (DoH) 和 DNS over TLS (DoT) 是加密的DNS查询协议,提供更好的隐私和安全性。

6.4.1 DoH协议原理

DoH的工作原理:

  1. HTTP/2请求

    • 使用HTTP/2发送DNS查询
    • 查询数据作为HTTP请求体
    • 使用HTTPS加密传输
  2. JSON格式

    • 大多数DoH服务使用JSON格式
    • 请求和响应都是JSON
  3. 标准端点

    • Cloudflare: https://cloudflare-dns.com/dns-query
    • Google: https://dns.google/resolve
    • Quad9: https://dns.quad9.net/dns-query

DoH查询格式:

python 复制代码
import aiohttp
import json
from typing import List, Optional

async def doh_query_json(hostname: str, doh_server: str = "https://cloudflare-dns.com/dns-query") -> List[str]:
    """使用DoH进行DNS查询(JSON格式)"""
    params = {
        'name': hostname,
        'type': 'A',  # A记录
    }
    headers = {
        'Accept': 'application/dns-json',
    }
    
    async with aiohttp.ClientSession() as session:
        try:
            async with session.get(doh_server, params=params, headers=headers, timeout=5) as resp:
                if resp.status == 200:
                    data = await resp.json()
                    
                    # 解析响应
                    ips = []
                    if 'Answer' in data:
                        for answer in data['Answer']:
                            if answer.get('type') == 1:  # A记录
                                ips.append(answer['data'])
                    
                    return ips
                else:
                    print(f"DoH query failed with status {resp.status}")
                    return []
        except Exception as e:
            print(f"DoH query error: {e}")
            return []

# 使用示例
import asyncio

async def main():
    ips = await doh_query_json("www.example.com")
    print(f"Resolved IPs: {ips}")

# asyncio.run(main())

6.4.2 DoT协议原理

DoT的工作原理:

  1. TLS连接

    • 在TCP 853端口建立TLS连接
    • 使用标准DNS协议(加密传输)
  2. DNS over TLS

    • 使用标准的DNS消息格式
    • 通过TLS加密传输

DoT客户端实现(需要dnspython支持):

python 复制代码
import dns.query
import dns.message
import ssl

def dot_query(hostname: str, dot_server: str = "1.1.1.1", port: int = 853) -> List[str]:
    """使用DoT进行DNS查询"""
    # 创建DNS查询消息
    query = dns.message.make_query(hostname, dns.rdatatype.A)
    
    # 创建TLS上下文
    context = ssl.create_default_context()
    
    # 发送DoT查询
    try:
        response = dns.query.tls(query, dot_server, port=port, ssl_context=context)
        
        # 解析响应
        ips = []
        for answer in response.answer:
            for rdata in answer:
                if rdata.rdtype == dns.rdatatype.A:
                    ips.append(str(rdata))
        
        return ips
    except Exception as e:
        print(f"DoT query error: {e}")
        return []

# 使用示例
ips = dot_query("www.example.com")
print(f"Resolved IPs: {ips}")

6.4.3 DoH/DoT的安全优势

安全优势:

  1. 加密传输

    • DNS查询数据加密,防止窃听
    • 保护查询隐私
  2. 防止DNS劫持

    • 使用HTTPS/TLS,防止中间人攻击
    • 验证服务器证书
  3. 绕过DNS污染

    • 使用可信的DoH/DoT服务器
    • 避免本地DNS污染

对比传统DNS:

python 复制代码
def compare_dns_methods(hostname: str):
    """对比不同DNS查询方法"""
    import time
    
    methods = {
        'System DNS': lambda h: [socket.gethostbyname(h)],
        'DoH (Cloudflare)': lambda h: asyncio.run(doh_query_json(h)),
        'DoT (Cloudflare)': lambda h: dot_query(h, "1.1.1.1"),
    }
    
    results = {}
    for method_name, method_func in methods.items():
        try:
            start = time.time()
            ips = method_func(hostname)
            elapsed = time.time() - start
            results[method_name] = {
                'ips': ips,
                'time': elapsed * 1000,
                'success': True,
            }
        except Exception as e:
            results[method_name] = {
                'ips': None,
                'time': None,
                'success': False,
                'error': str(e),
            }
    
    return results

# 使用示例
results = compare_dns_methods("www.example.com")
for method, result in results.items():
    print(f"{method}: {result}")

6.4.4 DoH客户端实现

完整的DoH客户端:

python 复制代码
import aiohttp
import asyncio
import time
from typing import List, Optional, Dict

class DoHClient:
    """DNS over HTTPS客户端"""
    
    def __init__(
        self,
        doh_servers: List[str] = None,
        timeout: float = 5.0,
        cache: Optional[DNSCache] = None,
    ):
        self.doh_servers = doh_servers or [
            "https://cloudflare-dns.com/dns-query",
            "https://dns.google/resolve",
            "https://dns.quad9.net/dns-query",
        ]
        self.timeout = timeout
        self.cache = cache
        self.session = None
    
    async def _get_session(self):
        """获取aiohttp会话(延迟创建)"""
        if self.session is None:
            self.session = aiohttp.ClientSession()
        return self.session
    
    async def query(
        self,
        hostname: str,
        record_type: str = 'A',
        use_cache: bool = True,
    ) -> List[str]:
        """查询DNS记录"""
        # 检查缓存
        if use_cache and self.cache:
            cached_ips = self.cache.get(hostname)
            if cached_ips:
                return cached_ips
        
        # 尝试多个DoH服务器
        last_error = None
        for doh_server in self.doh_servers:
            try:
                ips = await self._query_server(hostname, record_type, doh_server)
                if ips:
                    # 更新缓存
                    if use_cache and self.cache:
                        self.cache.set(hostname, ips)
                    return ips
            except Exception as e:
                last_error = e
                continue
        
        # 所有服务器都失败
        if last_error:
            raise last_error
        return []
    
    async def _query_server(
        self,
        hostname: str,
        record_type: str,
        doh_server: str,
    ) -> List[str]:
        """查询单个DoH服务器"""
        session = await self._get_session()
        params = {
            'name': hostname,
            'type': record_type,
        }
        headers = {
            'Accept': 'application/dns-json',
        }
        
        async with session.get(
            doh_server,
            params=params,
            headers=headers,
            timeout=aiohttp.ClientTimeout(total=self.timeout),
        ) as resp:
            if resp.status != 200:
                raise Exception(f"DoH server returned status {resp.status}")
            
            data = await resp.json()
            
            # 解析响应
            ips = []
            if 'Answer' in data:
                for answer in data['Answer']:
                    if answer.get('type') == 1:  # A记录
                        ips.append(answer['data'])
            
            return ips
    
    async def close(self):
        """关闭客户端"""
        if self.session:
            await self.session.close()
            self.session = None

# 使用示例
async def main():
    cache = DNSCache()
    doh_client = DoHClient(cache=cache)
    
    # 查询DNS
    ips = await doh_client.query("www.example.com")
    print(f"Resolved IPs: {ips}")
    
    # 再次查询(使用缓存)
    ips = await doh_client.query("www.example.com")
    print(f"Cached IPs: {ips}")
    
    await doh_client.close()

# asyncio.run(main())

6.5 代理池架构设计

代理池是爬虫系统的核心组件,负责管理、调度和维护代理资源。

6.5.1 代理池的数据结构设计

代理池的核心数据结构:

python 复制代码
from dataclasses import dataclass
from enum import Enum
from typing import Optional
import time

class ProxyType(Enum):
    """代理类型"""
    HTTP = "http"
    HTTPS = "https"
    SOCKS4 = "socks4"
    SOCKS5 = "socks5"

@dataclass
class Proxy:
    """代理对象"""
    host: str
    port: int
    proxy_type: ProxyType
    username: Optional[str] = None
    password: Optional[str] = None
    
    # 状态信息
    is_active: bool = True
    success_count: int = 0
    failure_count: int = 0
    last_check_time: Optional[float] = None
    last_success_time: Optional[float] = None
    response_time: float = 0.0  # 平均响应时间(秒)
    
    # 元数据
    location: Optional[str] = None  # 地理位置
    provider: Optional[str] = None   # 代理提供商
    
    def __str__(self) -> str:
        """代理字符串表示"""
        if self.username and self.password:
            return f"{self.proxy_type.value}://{self.username}:{self.password}@{self.host}:{self.port}"
        else:
            return f"{self.proxy_type.value}://{self.host}:{self.port}"
    
    def to_dict(self) -> dict:
        """转换为字典"""
        return {
            'host': self.host,
            'port': self.port,
            'type': self.proxy_type.value,
            'username': self.username,
            'password': self.password,
            'is_active': self.is_active,
            'success_count': self.success_count,
            'failure_count': self.failure_count,
            'last_check_time': self.last_check_time,
            'last_success_time': self.last_success_time,
            'response_time': self.response_time,
            'location': self.location,
            'provider': self.provider,
        }
    
    @property
    def success_rate(self) -> float:
        """成功率"""
        total = self.success_count + self.failure_count
        if total == 0:
            return 0.0
        return self.success_count / total
    
    @property
    def url(self) -> str:
        """代理URL"""
        return str(self)

# 使用示例
proxy = Proxy(
    host="proxy.example.com",
    port=8080,
    proxy_type=ProxyType.HTTP,
    username="user",
    password="pass",
)
print(f"Proxy URL: {proxy.url}")
print(f"Success rate: {proxy.success_rate:.2%}")

代理池的数据结构:

python 复制代码
from collections import deque
from typing import Dict, List, Set
import threading

class ProxyPool:
    """代理池数据结构"""
    
    def __init__(self):
        # 主存储:{proxy_id: Proxy}
        self.proxies: Dict[str, Proxy] = {}
        
        # 活跃代理集合(快速查找)
        self.active_proxies: Set[str] = set()
        
        # 代理队列(用于轮询)
        self.proxy_queue: deque = deque()
        
        # 按类型分组
        self.proxies_by_type: Dict[ProxyType, List[str]] = {
            ProxyType.HTTP: [],
            ProxyType.HTTPS: [],
            ProxyType.SOCKS4: [],
            ProxyType.SOCKS5: [],
        }
        
        # 线程安全锁
        self.lock = threading.Lock()
    
    def add_proxy(self, proxy: Proxy, proxy_id: Optional[str] = None) -> str:
        """添加代理"""
        if proxy_id is None:
            proxy_id = f"{proxy.host}:{proxy.port}"
        
        with self.lock:
            self.proxies[proxy_id] = proxy
            if proxy.is_active:
                self.active_proxies.add(proxy_id)
                self.proxy_queue.append(proxy_id)
            
            # 按类型分组
            self.proxies_by_type[proxy.proxy_type].append(proxy_id)
        
        return proxy_id
    
    def remove_proxy(self, proxy_id: str):
        """删除代理"""
        with self.lock:
            if proxy_id in self.proxies:
                proxy = self.proxies[proxy_id]
                del self.proxies[proxy_id]
                self.active_proxies.discard(proxy_id)
                
                # 从队列中移除
                if proxy_id in self.proxy_queue:
                    self.proxy_queue.remove(proxy_id)
                
                # 从类型分组中移除
                if proxy_id in self.proxies_by_type[proxy.proxy_type]:
                    self.proxies_by_type[proxy.proxy_type].remove(proxy_id)
    
    def get_proxy(self, proxy_id: str) -> Optional[Proxy]:
        """获取代理"""
        return self.proxies.get(proxy_id)
    
    def get_active_proxies(self) -> List[Proxy]:
        """获取所有活跃代理"""
        with self.lock:
            return [self.proxies[pid] for pid in self.active_proxies if pid in self.proxies]
    
    def get_proxies_by_type(self, proxy_type: ProxyType) -> List[Proxy]:
        """按类型获取代理"""
        with self.lock:
            return [self.proxies[pid] for pid in self.proxies_by_type[proxy_type] if pid in self.proxies]

# 使用示例
pool = ProxyPool()

proxy1 = Proxy("proxy1.com", 8080, ProxyType.HTTP)
proxy2 = Proxy("proxy2.com", 8080, ProxyType.SOCKS5)

pool.add_proxy(proxy1)
pool.add_proxy(proxy2)

active = pool.get_active_proxies()
print(f"Active proxies: {len(active)}")

6.5.2 代理类型详解(HTTP/HTTPS/SOCKS4/SOCKS5)

代理类型对比:

代理类型 协议层 支持HTTPS 支持UDP 认证方式 使用场景
HTTP 应用层 Basic 简单HTTP请求
HTTPS 应用层 Basic HTTPS请求
SOCKS4 传输层 简单TCP连接
SOCKS5 传输层 多种 复杂网络场景

代理类型的特点:

  1. HTTP/HTTPS代理

    • 工作在应用层
    • 只支持HTTP协议
    • 需要CONNECT方法建立隧道
  2. SOCKS4代理

    • 工作在传输层
    • 支持TCP连接
    • 不支持UDP和IPv6
  3. SOCKS5代理

    • 工作在传输层
    • 支持TCP和UDP
    • 支持IPv4和IPv6
    • 支持多种认证方式

代理URL格式:

python 复制代码
def format_proxy_url(proxy: Proxy) -> str:
    """格式化代理URL"""
    if proxy.proxy_type == ProxyType.HTTP:
        scheme = "http"
    elif proxy.proxy_type == ProxyType.HTTPS:
        scheme = "https"
    elif proxy.proxy_type == ProxyType.SOCKS4:
        scheme = "socks4"
    elif proxy.proxy_type == ProxyType.SOCKS5:
        scheme = "socks5"
    else:
        raise ValueError(f"Unknown proxy type: {proxy.proxy_type}")
    
    if proxy.username and proxy.password:
        return f"{scheme}://{proxy.username}:{proxy.password}@{proxy.host}:{proxy.port}"
    else:
        return f"{scheme}://{proxy.host}:{proxy.port}"

# 使用示例
proxy = Proxy("proxy.com", 8080, ProxyType.SOCKS5, "user", "pass")
url = format_proxy_url(proxy)
print(f"Proxy URL: {url}")
# 输出: socks5://user:pass@proxy.com:8080

6.5.3 代理池的接口设计

代理池的核心接口:

python 复制代码
from abc import ABC, abstractmethod
from typing import Optional, List

class IProxyPool(ABC):
    """代理池接口"""
    
    @abstractmethod
    def add_proxy(self, proxy: Proxy) -> str:
        """添加代理"""
        pass
    
    @abstractmethod
    def remove_proxy(self, proxy_id: str):
        """删除代理"""
        pass
    
    @abstractmethod
    def get_proxy(self, proxy_id: str) -> Optional[Proxy]:
        """获取代理"""
        pass
    
    @abstractmethod
    def get_next_proxy(self, strategy: str = "round_robin") -> Optional[Proxy]:
        """获取下一个代理(根据策略)"""
        pass
    
    @abstractmethod
    def mark_success(self, proxy_id: str, response_time: float):
        """标记代理成功"""
        pass
    
    @abstractmethod
    def mark_failure(self, proxy_id: str):
        """标记代理失败"""
        pass
    
    @abstractmethod
    def get_stats(self) -> dict:
        """获取统计信息"""
        pass

6.5.4 代理池架构图

代理池的完整架构:
爬虫应用
代理池管理器
代理存储
健康检查器
负载均衡器
活跃代理队列
代理元数据
定时检查任务
健康状态评估
轮询算法
随机算法
加权算法
最少连接数
监控面板
统计信息
实时状态


6.6 代理健康检查机制

代理健康检查是确保代理池质量的关键机制。

6.6.1 健康检查的设计原则

设计原则:

  1. 非侵入性

    • 健康检查不应该影响正常使用
    • 使用独立的检查任务
  2. 频率控制

    • 避免过于频繁的检查
    • 防止被代理服务器封禁
  3. 多维度评估

    • 响应时间
    • 成功率
    • 可用性
  4. 异步执行

    • 健康检查应该是异步的
    • 不阻塞主流程

6.6.2 健康检查的实现方法

健康检查方法:

  1. HTTP请求测试

    • 发送HTTP请求到测试URL
    • 检查响应状态码
    • 测量响应时间
  2. 连接测试

    • 测试TCP连接
    • 检查连接建立时间
  3. 实际请求测试

    • 使用代理发送真实请求
    • 记录成功率和响应时间

完整的健康检查实现:

python 复制代码
import asyncio
import aiohttp
import time
from typing import Optional

class ProxyHealthChecker:
    """代理健康检查器"""
    
    def __init__(
        self,
        test_url: str = "http://httpbin.org/ip",
        timeout: float = 5.0,
        max_concurrent: int = 10,
    ):
        self.test_url = test_url
        self.timeout = timeout
        self.semaphore = asyncio.Semaphore(max_concurrent)
    
    async def check_proxy(self, proxy: Proxy) -> dict:
        """检查单个代理"""
        async with self.semaphore:
            start_time = time.time()
            result = {
                'proxy_id': f"{proxy.host}:{proxy.port}",
                'success': False,
                'response_time': None,
                'error': None,
            }
            
            try:
                # 构建代理URL
                proxy_url = proxy.url
                
                # 创建连接器(支持SOCKS)
                connector = None
                if proxy.proxy_type in [ProxyType.SOCKS4, ProxyType.SOCKS5]:
                    try:
                        from aiohttp_socks import ProxyConnector
                        connector = ProxyConnector.from_url(proxy_url)
                    except ImportError:
                        result['error'] = "aiohttp-socks not installed"
                        return result
                else:
                    connector = aiohttp.TCPConnector()
                
                # 发送测试请求
                timeout = aiohttp.ClientTimeout(total=self.timeout)
                async with aiohttp.ClientSession(connector=connector) as session:
                    async with session.get(
                        self.test_url,
                        proxy=proxy_url if proxy.proxy_type in [ProxyType.HTTP, ProxyType.HTTPS] else None,
                        timeout=timeout,
                    ) as resp:
                        if resp.status == 200:
                            result['success'] = True
                            result['response_time'] = time.time() - start_time
                        else:
                            result['error'] = f"HTTP {resp.status}"
                
                if connector:
                    await connector.close()
                    
            except asyncio.TimeoutError:
                result['error'] = "Timeout"
            except Exception as e:
                result['error'] = str(e)
            
            return result
    
    async def check_proxies(self, proxies: List[Proxy]) -> List[dict]:
        """批量检查代理"""
        tasks = [self.check_proxy(proxy) for proxy in proxies]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        
        # 处理异常结果
        processed_results = []
        for result in results:
            if isinstance(result, Exception):
                processed_results.append({
                    'success': False,
                    'error': str(result),
                })
            else:
                processed_results.append(result)
        
        return processed_results

# 使用示例
async def main():
    checker = ProxyHealthChecker()
    
    proxy = Proxy("proxy.example.com", 8080, ProxyType.HTTP)
    result = await checker.check_proxy(proxy)
    print(f"Health check result: {result}")

# asyncio.run(main())

6.6.3 健康检查的频率控制

频率控制策略:

  1. 基于时间间隔

    • 固定时间间隔检查
    • 例如:每5分钟检查一次
  2. 基于使用频率

    • 使用频繁的代理检查更频繁
    • 使用少的代理检查较少
  3. 基于失败率

    • 失败率高的代理检查更频繁
    • 稳定的代理检查较少

智能频率控制:

python 复制代码
class SmartHealthChecker(ProxyHealthChecker):
    """智能健康检查器(频率控制)"""
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.check_intervals = {}  # {proxy_id: next_check_time}
        self.base_interval = 300  # 基础间隔(5分钟)
        self.min_interval = 60    # 最小间隔(1分钟)
        self.max_interval = 3600  # 最大间隔(1小时)
    
    def get_check_interval(self, proxy: Proxy) -> float:
        """计算检查间隔"""
        # 基于成功率调整间隔
        success_rate = proxy.success_rate
        
        if success_rate < 0.5:
            # 成功率低,频繁检查
            interval = self.min_interval
        elif success_rate < 0.8:
            # 成功率中等
            interval = self.base_interval
        else:
            # 成功率高,减少检查
            interval = self.max_interval
        
        return interval
    
    def should_check(self, proxy: Proxy) -> bool:
        """判断是否应该检查"""
        proxy_id = f"{proxy.host}:{proxy.port}"
        current_time = time.time()
        
        if proxy_id not in self.check_intervals:
            return True
        
        next_check_time = self.check_intervals[proxy_id]
        return current_time >= next_check_time
    
    def update_check_time(self, proxy: Proxy):
        """更新下次检查时间"""
        proxy_id = f"{proxy.host}:{proxy.port}"
        interval = self.get_check_interval(proxy)
        self.check_intervals[proxy_id] = time.time() + interval
    
    async def check_if_needed(self, proxy: Proxy) -> Optional[dict]:
        """如果需要则检查代理"""
        if not self.should_check(proxy):
            return None
        
        result = await self.check_proxy(proxy)
        self.update_check_time(proxy)
        return result

# 使用示例
checker = SmartHealthChecker()

proxy = Proxy("proxy.example.com", 8080, ProxyType.HTTP)
proxy.success_count = 10
proxy.failure_count = 2

if checker.should_check(proxy):
    result = await checker.check_if_needed(proxy)
    print(f"Check result: {result}")

6.6.4 健康状态的评估标准

健康状态评估:

python 复制代码
class ProxyHealthEvaluator:
    """代理健康状态评估器"""
    
    @staticmethod
    def evaluate(proxy: Proxy) -> str:
        """评估代理健康状态"""
        # 计算健康分数(0-100)
        score = 0
        
        # 成功率(40分)
        success_rate = proxy.success_rate
        score += success_rate * 40
        
        # 响应时间(30分)
        if proxy.response_time > 0:
            if proxy.response_time < 1.0:
                score += 30
            elif proxy.response_time < 3.0:
                score += 20
            elif proxy.response_time < 5.0:
                score += 10
        
        # 活跃状态(20分)
        if proxy.is_active:
            score += 20
        
        # 最近成功时间(10分)
        if proxy.last_success_time:
            time_since_success = time.time() - proxy.last_success_time
            if time_since_success < 300:  # 5分钟内成功
                score += 10
            elif time_since_success < 1800:  # 30分钟内成功
                score += 5
        
        # 确定健康等级
        if score >= 80:
            return "excellent"
        elif score >= 60:
            return "good"
        elif score >= 40:
            return "fair"
        else:
            return "poor"
    
    @staticmethod
    def should_remove(proxy: Proxy) -> bool:
        """判断是否应该移除代理"""
        # 失败率过高
        if proxy.failure_count > 10 and proxy.success_rate < 0.2:
            return True
        
        # 长时间未成功
        if proxy.last_success_time:
            time_since_success = time.time() - proxy.last_success_time
            if time_since_success > 3600:  # 1小时未成功
                return True
        
        return False

# 使用示例
evaluator = ProxyHealthEvaluator()

proxy = Proxy("proxy.example.com", 8080, ProxyType.HTTP)
proxy.success_count = 8
proxy.failure_count = 2
proxy.response_time = 0.5
proxy.is_active = True
proxy.last_success_time = time.time()

health = evaluator.evaluate(proxy)
print(f"Proxy health: {health}")

should_remove = evaluator.should_remove(proxy)
print(f"Should remove: {should_remove}")

6.7 负载均衡算法实现

负载均衡算法决定如何从代理池中选择代理,不同的算法适用于不同的场景。

6.7.1 轮询算法(Round Robin)

轮询算法原理:

  • 按顺序依次选择代理
  • 公平分配请求
  • 实现简单

实现:

python 复制代码
class RoundRobinBalancer:
    """轮询负载均衡器"""
    
    def __init__(self, proxy_pool: ProxyPool):
        self.proxy_pool = proxy_pool
        self.current_index = 0
        self.lock = threading.Lock()
    
    def get_next_proxy(self) -> Optional[Proxy]:
        """获取下一个代理"""
        active_proxies = self.proxy_pool.get_active_proxies()
        if not active_proxies:
            return None
        
        with self.lock:
            proxy = active_proxies[self.current_index]
            self.current_index = (self.current_index + 1) % len(active_proxies)
            return proxy

# 使用示例
pool = ProxyPool()
# ... 添加代理 ...
balancer = RoundRobinBalancer(pool)

for _ in range(10):
    proxy = balancer.get_next_proxy()
    print(f"Selected proxy: {proxy.host}")

6.7.2 随机算法(Random)

随机算法原理:

  • 随机选择代理
  • 避免热点代理
  • 适合代理质量相近的场景

实现:

python 复制代码
import random

class RandomBalancer:
    """随机负载均衡器"""
    
    def __init__(self, proxy_pool: ProxyPool):
        self.proxy_pool = proxy_pool
    
    def get_next_proxy(self) -> Optional[Proxy]:
        """获取随机代理"""
        active_proxies = self.proxy_pool.get_active_proxies()
        if not active_proxies:
            return None
        return random.choice(active_proxies)

6.7.3 加权算法(Weighted)

加权算法原理:

  • 根据代理质量分配权重
  • 质量好的代理被选中的概率更高
  • 适合代理质量差异大的场景

实现:

python 复制代码
class WeightedBalancer:
    """加权负载均衡器"""
    
    def __init__(self, proxy_pool: ProxyPool):
        self.proxy_pool = proxy_pool
    
    def calculate_weight(self, proxy: Proxy) -> float:
        """计算代理权重"""
        # 基于成功率和响应时间
        success_rate = proxy.success_rate
        response_time = proxy.response_time if proxy.response_time > 0 else 5.0
        
        # 权重 = 成功率 * (1 / 响应时间) * 100
        weight = success_rate * (1.0 / response_time) * 100
        return max(weight, 0.1)  # 最小权重0.1
    
    def get_next_proxy(self) -> Optional[Proxy]:
        """获取加权随机代理"""
        active_proxies = self.proxy_pool.get_active_proxies()
        if not active_proxies:
            return None
        
        # 计算权重
        weights = [self.calculate_weight(p) for p in active_proxies]
        total_weight = sum(weights)
        
        if total_weight == 0:
            return random.choice(active_proxies)
        
        # 加权随机选择
        r = random.uniform(0, total_weight)
        cumulative = 0
        for proxy, weight in zip(active_proxies, weights):
            cumulative += weight
            if r <= cumulative:
                return proxy
        
        return active_proxies[-1]  # 兜底

6.7.4 最少连接数算法(Least Connections)

最少连接数算法原理:

  • 选择当前连接数最少的代理
  • 平衡代理负载
  • 适合长时间连接的场景

实现:

python 复制代码
class LeastConnectionsBalancer:
    """最少连接数负载均衡器"""
    
    def __init__(self, proxy_pool: ProxyPool):
        self.proxy_pool = proxy_pool
        self.connection_count = {}  # {proxy_id: count}
        self.lock = threading.Lock()
    
    def get_next_proxy(self) -> Optional[Proxy]:
        """获取连接数最少的代理"""
        active_proxies = self.proxy_pool.get_active_proxies()
        if not active_proxies:
            return None
        
        # 找到连接数最少的代理
        min_connections = float('inf')
        selected_proxy = None
        
        with self.lock:
            for proxy in active_proxies:
                proxy_id = f"{proxy.host}:{proxy.port}"
                count = self.connection_count.get(proxy_id, 0)
                if count < min_connections:
                    min_connections = count
                    selected_proxy = proxy
        
        return selected_proxy
    
    def increment_connections(self, proxy: Proxy):
        """增加连接数"""
        proxy_id = f"{proxy.host}:{proxy.port}"
        with self.lock:
            self.connection_count[proxy_id] = self.connection_count.get(proxy_id, 0) + 1
    
    def decrement_connections(self, proxy: Proxy):
        """减少连接数"""
        proxy_id = f"{proxy.host}:{proxy.port}"
        with self.lock:
            count = self.connection_count.get(proxy_id, 0)
            if count > 0:
                self.connection_count[proxy_id] = count - 1

6.7.5 算法性能对比

性能对比测试:

python 复制代码
def compare_balancers(proxy_pool: ProxyPool, iterations: int = 1000):
    """对比不同负载均衡算法"""
    balancers = {
        'Round Robin': RoundRobinBalancer(proxy_pool),
        'Random': RandomBalancer(proxy_pool),
        'Weighted': WeightedBalancer(proxy_pool),
        'Least Connections': LeastConnectionsBalancer(proxy_pool),
    }
    
    results = {}
    for name, balancer in balancers.items():
        start = time.time()
        selection_count = {}
        
        for _ in range(iterations):
            proxy = balancer.get_next_proxy()
            if proxy:
                proxy_id = f"{proxy.host}:{proxy.port}"
                selection_count[proxy_id] = selection_count.get(proxy_id, 0) + 1
        
        elapsed = time.time() - start
        
        # 计算选择分布的均匀度(标准差)
        counts = list(selection_count.values())
        if counts:
            mean = sum(counts) / len(counts)
            variance = sum((x - mean) ** 2 for x in counts) / len(counts)
            std_dev = variance ** 0.5
        else:
            std_dev = 0
        
        results[name] = {
            'time': elapsed * 1000,  # 毫秒
            'std_dev': std_dev,
            'distribution': selection_count,
        }
    
    return results

# 使用示例
# results = compare_balancers(proxy_pool, 1000)
# for name, result in results.items():
#     print(f"{name}: {result['time']:.2f}ms, std_dev: {result['std_dev']:.2f}")

6.8 工具链:DNS和代理工具使用

6.8.1 使用dnspython进行DNS查询

安装dnspython:

bash 复制代码
pip install dnspython

基本使用:

python 复制代码
import dns.resolver

def query_dns(hostname: str, record_type: str = 'A') -> List[str]:
    """使用dnspython查询DNS"""
    try:
        answers = dns.resolver.resolve(hostname, record_type)
        return [str(answer) for answer in answers]
    except Exception as e:
        print(f"DNS query failed: {e}")
        return []

# 使用示例
ips = query_dns("www.example.com")
print(f"IPs: {ips}")

6.8.2 使用DoH服务进行DNS查询

使用Cloudflare DoH:

python 复制代码
async def query_doh_cloudflare(hostname: str) -> List[str]:
    """使用Cloudflare DoH查询"""
    doh_url = "https://cloudflare-dns.com/dns-query"
    return await doh_query_json(hostname, doh_url)

6.8.3 使用httpx/aiohttp配置代理

httpx配置代理:

python 复制代码
import httpx

# HTTP代理
proxy_url = "http://proxy.example.com:8080"
client = httpx.Client(proxies=proxy_url)

# SOCKS5代理(需要httpx[socks])
proxy_url = "socks5://proxy.example.com:1080"
client = httpx.Client(proxies=proxy_url)

# 使用代理发送请求
response = client.get("https://httpbin.org/ip")

aiohttp配置代理:

python 复制代码
import aiohttp

# HTTP代理
proxy_url = "http://proxy.example.com:8080"
async with aiohttp.ClientSession() as session:
    async with session.get("https://httpbin.org/ip", proxy=proxy_url) as resp:
        print(await resp.text())

6.8.4 使用Redis管理代理池数据

Redis代理池存储:

python 复制代码
import redis
import json

class RedisProxyPool:
    """基于Redis的代理池"""
    
    def __init__(self, redis_host: str = "localhost", redis_port: int = 6379):
        self.redis_client = redis.Redis(host=redis_host, port=redis_port, decode_responses=True)
        self.proxy_key_prefix = "proxy:"
        self.active_set_key = "proxies:active"
    
    def add_proxy(self, proxy: Proxy, proxy_id: Optional[str] = None) -> str:
        """添加代理到Redis"""
        if proxy_id is None:
            proxy_id = f"{proxy.host}:{proxy.port}"
        
        key = f"{self.proxy_key_prefix}{proxy_id}"
        self.redis_client.hset(key, mapping=proxy.to_dict())
        
        if proxy.is_active:
            self.redis_client.sadd(self.active_set_key, proxy_id)
        
        return proxy_id
    
    def get_proxy(self, proxy_id: str) -> Optional[Proxy]:
        """从Redis获取代理"""
        key = f"{self.proxy_key_prefix}{proxy_id}"
        data = self.redis_client.hgetall(key)
        
        if not data:
            return None
        
        return Proxy(
            host=data['host'],
            port=int(data['port']),
            proxy_type=ProxyType(data['type']),
            username=data.get('username'),
            password=data.get('password'),
            is_active=data.get('is_active', 'True') == 'True',
            success_count=int(data.get('success_count', 0)),
            failure_count=int(data.get('failure_count', 0)),
        )
    
    def get_active_proxies(self) -> List[str]:
        """获取所有活跃代理ID"""
        return list(self.redis_client.smembers(self.active_set_key))

6.8.5 使用aiohttp-socks支持SOCKS代理

安装aiohttp-socks:

bash 复制代码
pip install aiohttp-socks

使用SOCKS代理:

python 复制代码
from aiohttp_socks import ProxyConnector

# SOCKS5代理
proxy_url = "socks5://proxy.example.com:1080"
connector = ProxyConnector.from_url(proxy_url)

async with aiohttp.ClientSession(connector=connector) as session:
    async with session.get("https://httpbin.org/ip") as resp:
        print(await resp.text())

6.9 代码对照:完整实现

6.9.1 自定义DNS解析器实现(支持缓存和DoH)

完整的DNS解析器:

python 复制代码
import asyncio
import aiohttp
import time
from typing import Optional, List

class AdvancedDNSResolver:
    """高级DNS解析器(支持缓存和DoH)"""
    
    def __init__(
        self,
        use_doh: bool = True,
        doh_servers: List[str] = None,
        cache: Optional[DNSCache] = None,
        fallback_to_system: bool = True,
    ):
        self.use_doh = use_doh
        self.doh_servers = doh_servers or [
            "https://cloudflare-dns.com/dns-query",
            "https://dns.google/resolve",
        ]
        self.cache = cache or DNSCache()
        self.fallback_to_system = fallback_to_system
        self.doh_client = DoHClient(doh_servers=self.doh_servers, cache=self.cache) if use_doh else None
    
    async def resolve(self, hostname: str) -> List[str]:
        """解析域名"""
        # 检查缓存
        cached_ips = self.cache.get(hostname)
        if cached_ips:
            return cached_ips
        
        # 使用DoH查询
        if self.use_doh and self.doh_client:
            try:
                ips = await self.doh_client.query(hostname)
                if ips:
                    return ips
            except Exception as e:
                print(f"DoH query failed: {e}")
        
        # 回退到系统DNS
        if self.fallback_to_system:
            try:
                import socket
                ip = socket.gethostbyname(hostname)
                ips = [ip]
                self.cache.set(hostname, ips)
                return ips
            except Exception as e:
                print(f"System DNS failed: {e}")
        
        return []
    
    async def close(self):
        """关闭解析器"""
        if self.doh_client:
            await self.doh_client.close()

# 使用示例
async def main():
    resolver = AdvancedDNSResolver(use_doh=True)
    ips = await resolver.resolve("www.example.com")
    print(f"Resolved IPs: {ips}")
    await resolver.close()

# asyncio.run(main())

6.9.2 代理池管理器类的完整实现

完整的代理池管理器:

python 复制代码
import asyncio
import threading
import time
from typing import Optional, List, Dict
from collections import deque

class ProxyPoolManager:
    """代理池管理器(完整版)"""
    
    def __init__(
        self,
        health_checker: Optional[ProxyHealthChecker] = None,
        balancer_type: str = "round_robin",
        health_check_interval: float = 300.0,
    ):
        self.pool = ProxyPool()
        self.health_checker = health_checker or ProxyHealthChecker()
        self.balancer_type = balancer_type
        self.health_check_interval = health_check_interval
        
        # 负载均衡器
        self.balancer = self._create_balancer()
        
        # 健康检查任务
        self.health_check_task = None
        self.running = False
        
        # 统计信息
        self.stats = {
            'total_requests': 0,
            'successful_requests': 0,
            'failed_requests': 0,
            'proxy_rotations': 0,
        }
    
    def _create_balancer(self):
        """创建负载均衡器"""
        if self.balancer_type == "round_robin":
            return RoundRobinBalancer(self.pool)
        elif self.balancer_type == "random":
            return RandomBalancer(self.pool)
        elif self.balancer_type == "weighted":
            return WeightedBalancer(self.pool)
        elif self.balancer_type == "least_connections":
            return LeastConnectionsBalancer(self.pool)
        else:
            return RoundRobinBalancer(self.pool)
    
    def add_proxy(self, proxy: Proxy) -> str:
        """添加代理"""
        return self.pool.add_proxy(proxy)
    
    def remove_proxy(self, proxy_id: str):
        """删除代理"""
        self.pool.remove_proxy(proxy_id)
    
    def get_next_proxy(self) -> Optional[Proxy]:
        """获取下一个代理"""
        proxy = self.balancer.get_next_proxy()
        if proxy:
            self.stats['proxy_rotations'] += 1
        return proxy
    
    def mark_success(self, proxy: Proxy, response_time: float):
        """标记代理成功"""
        proxy.success_count += 1
        proxy.last_success_time = time.time()
        proxy.last_check_time = time.time()
        
        # 更新平均响应时间
        total_requests = proxy.success_count + proxy.failure_count
        proxy.response_time = (
            (proxy.response_time * (total_requests - 1) + response_time) / total_requests
        )
        
        self.stats['successful_requests'] += 1
        self.stats['total_requests'] += 1
    
    def mark_failure(self, proxy: Proxy):
        """标记代理失败"""
        proxy.failure_count += 1
        proxy.last_check_time = time.time()
        
        # 如果失败率过高,标记为非活跃
        if proxy.success_rate < 0.2 and proxy.failure_count > 10:
            proxy.is_active = False
            self.pool.active_proxies.discard(f"{proxy.host}:{proxy.port}")
        
        self.stats['failed_requests'] += 1
        self.stats['total_requests'] += 1
    
    async def health_check_loop(self):
        """健康检查循环"""
        while self.running:
            try:
                # 获取所有活跃代理
                active_proxies = self.pool.get_active_proxies()
                
                if active_proxies:
                    # 批量健康检查
                    results = await self.health_checker.check_proxies(active_proxies)
                    
                    # 更新代理状态
                    for proxy, result in zip(active_proxies, results):
                        proxy_id = f"{proxy.host}:{proxy.port}"
                        if result['success']:
                            proxy.is_active = True
                            proxy.response_time = result.get('response_time', proxy.response_time)
                            self.pool.active_proxies.add(proxy_id)
                        else:
                            # 失败次数过多则标记为非活跃
                            if proxy.failure_count > 5:
                                proxy.is_active = False
                                self.pool.active_proxies.discard(proxy_id)
                
                await asyncio.sleep(self.health_check_interval)
            except Exception as e:
                print(f"Health check error: {e}")
                await asyncio.sleep(60)
                继续完成第6章的剩余内容:
    
    def start(self):
        """启动代理池管理器"""
        self.running = True
        self.health_check_task = asyncio.create_task(self.health_check_loop())
    
    def stop(self):
        """停止代理池管理器"""
        self.running = False
        if self.health_check_task:
            self.health_check_task.cancel()
    
    def get_stats(self) -> dict:
        """获取统计信息"""
        active_count = len(self.pool.active_proxies)
        total_count = len(self.pool.proxies)
        
        return {
            **self.stats,
            'total_proxies': total_count,
            'active_proxies': active_count,
            'inactive_proxies': total_count - active_count,
            'success_rate': (
                self.stats['successful_requests'] / self.stats['total_requests']
                if self.stats['total_requests'] > 0 else 0.0
            ),
        }

# 使用示例
async def main():
    manager = ProxyPoolManager(
        balancer_type="weighted",
        health_check_interval=300.0,
    )
    
    # 添加代理
    proxy1 = Proxy("proxy1.com", 8080, ProxyType.HTTP)
    proxy2 = Proxy("proxy2.com", 8080, ProxyType.SOCKS5)
    manager.add_proxy(proxy1)
    manager.add_proxy(proxy2)
    
    # 启动管理器
    manager.start()
    
    # 使用代理
    proxy = manager.get_next_proxy()
    if proxy:
        print(f"Using proxy: {proxy.url}")
        # 模拟请求成功
        manager.mark_success(proxy, 0.5)
    
    # 查看统计
    stats = manager.get_stats()
    print(f"Stats: {stats}")
    
    # 停止管理器
    manager.stop()

# asyncio.run(main())

6.9.3 代理健康检查的实现代码

完整的健康检查实现:

python 复制代码
import asyncio
import aiohttp
import time
from typing import List, Dict, Optional

class ComprehensiveHealthChecker:
    """综合健康检查器"""
    
    def __init__(
        self,
        test_urls: List[str] = None,
        timeout: float = 5.0,
        max_concurrent: int = 10,
        retry_times: int = 2,
    ):
        self.test_urls = test_urls or [
            "http://httpbin.org/ip",
            "http://httpbin.org/get",
        ]
        self.timeout = timeout
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.retry_times = retry_times
    
    async def check_proxy_comprehensive(self, proxy: Proxy) -> Dict:
        """综合健康检查"""
        async with self.semaphore:
            results = {
                'proxy_id': f"{proxy.host}:{proxy.port}",
                'success': False,
                'response_times': [],
                'test_results': [],
                'error': None,
            }
            
            # 对每个测试URL进行检查
            for test_url in self.test_urls:
                for attempt in range(self.retry_times):
                    try:
                        result = await self._test_proxy(proxy, test_url)
                        results['test_results'].append(result)
                        
                        if result['success']:
                            results['response_times'].append(result['response_time'])
                            results['success'] = True
                            break
                    except Exception as e:
                        if attempt == self.retry_times - 1:
                            results['error'] = str(e)
            
            # 计算平均响应时间
            if results['response_times']:
                results['avg_response_time'] = sum(results['response_times']) / len(results['response_times'])
            else:
                results['avg_response_time'] = None
            
            # 计算成功率
            results['success_rate'] = (
                len([r for r in results['test_results'] if r['success']]) / len(results['test_results'])
                if results['test_results'] else 0.0
            )
            
            return results
    
    async def _test_proxy(self, proxy: Proxy, test_url: str) -> Dict:
        """测试单个URL"""
        start_time = time.time()
        result = {
            'url': test_url,
            'success': False,
            'response_time': None,
            'status_code': None,
            'error': None,
        }
        
        try:
            proxy_url = proxy.url
            
            # 创建连接器
            connector = None
            if proxy.proxy_type in [ProxyType.SOCKS4, ProxyType.SOCKS5]:
                try:
                    from aiohttp_socks import ProxyConnector
                    connector = ProxyConnector.from_url(proxy_url)
                except ImportError:
                    result['error'] = "aiohttp-socks not installed"
                    return result
            else:
                connector = aiohttp.TCPConnector()
            
            # 发送请求
            timeout = aiohttp.ClientTimeout(total=self.timeout)
            async with aiohttp.ClientSession(connector=connector) as session:
                async with session.get(
                    test_url,
                    proxy=proxy_url if proxy.proxy_type in [ProxyType.HTTP, ProxyType.HTTPS] else None,
                    timeout=timeout,
                ) as resp:
                    result['status_code'] = resp.status
                    result['success'] = resp.status == 200
                    result['response_time'] = time.time() - start_time
            
            if connector:
                await connector.close()
                
        except asyncio.TimeoutError:
            result['error'] = "Timeout"
            result['response_time'] = self.timeout
        except Exception as e:
            result['error'] = str(e)
            result['response_time'] = time.time() - start_time
        
        return result
    
    async def check_proxies_batch(self, proxies: List[Proxy]) -> List[Dict]:
        """批量健康检查"""
        tasks = [self.check_proxy_comprehensive(proxy) for proxy in proxies]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        
        processed_results = []
        for result in results:
            if isinstance(result, Exception):
                processed_results.append({
                    'success': False,
                    'error': str(result),
                })
            else:
                processed_results.append(result)
        
        return processed_results

# 使用示例
async def main():
    checker = ComprehensiveHealthChecker()
    
    proxy = Proxy("proxy.example.com", 8080, ProxyType.HTTP)
    result = await checker.check_proxy_comprehensive(proxy)
    print(f"Health check result: {result}")

# asyncio.run(main())

6.9.4 代理轮换中间件的实现

代理轮换中间件:

python 复制代码
import asyncio
import aiohttp
from typing import Optional, Callable
import random

class ProxyRotateMiddleware:
    """代理轮换中间件"""
    
    def __init__(self, proxy_pool_manager: ProxyPoolManager):
        self.proxy_pool_manager = proxy_pool_manager
        self.current_proxy: Optional[Proxy] = None
        self.proxy_usage_count = {}  # {proxy_id: count}
        self.max_usage_per_proxy = 100  # 每个代理最多使用次数
    
    def get_proxy_for_request(self) -> Optional[Proxy]:
        """为请求获取代理"""
        # 检查当前代理是否还能使用
        if self.current_proxy:
            proxy_id = f"{self.current_proxy.host}:{self.current_proxy.port}"
            usage_count = self.proxy_usage_count.get(proxy_id, 0)
            
            if usage_count < self.max_usage_per_proxy and self.current_proxy.is_active:
                self.proxy_usage_count[proxy_id] = usage_count + 1
                return self.current_proxy
        
        # 获取新代理
        self.current_proxy = self.proxy_pool_manager.get_next_proxy()
        if self.current_proxy:
            proxy_id = f"{self.current_proxy.host}:{self.current_proxy.port}"
            self.proxy_usage_count[proxy_id] = 1
        
        return self.current_proxy
    
    async def request_with_proxy(
        self,
        url: str,
        method: str = "GET",
        **kwargs
    ) -> aiohttp.ClientResponse:
        """使用代理发送请求"""
        proxy = self.get_proxy_for_request()
        if not proxy:
            raise Exception("No available proxy")
        
        proxy_url = proxy.url
        start_time = time.time()
        
        try:
            # 创建连接器
            connector = None
            if proxy.proxy_type in [ProxyType.SOCKS4, ProxyType.SOCKS5]:
                from aiohttp_socks import ProxyConnector
                connector = ProxyConnector.from_url(proxy_url)
            
            # 发送请求
            async with aiohttp.ClientSession(connector=connector) as session:
                async with session.request(
                    method,
                    url,
                    proxy=proxy_url if proxy.proxy_type in [ProxyType.HTTP, ProxyType.HTTPS] else None,
                    **kwargs
                ) as resp:
                    response_time = time.time() - start_time
                    
                    # 标记代理状态
                    if resp.status == 200:
                        self.proxy_pool_manager.mark_success(proxy, response_time)
                    else:
                        self.proxy_pool_manager.mark_failure(proxy)
                    
                    return resp
        except Exception as e:
            # 标记代理失败
            self.proxy_pool_manager.mark_failure(proxy)
            raise
    
    def rotate_proxy(self):
        """强制轮换代理"""
        self.current_proxy = None

# 使用示例
async def main():
    manager = ProxyPoolManager()
    # ... 添加代理 ...
    
    middleware = ProxyRotateMiddleware(manager)
    
    # 使用中间件发送请求
    try:
        async with middleware.request_with_proxy("https://httpbin.org/ip") as resp:
            print(await resp.text())
    except Exception as e:
        print(f"Request failed: {e}")

# asyncio.run(main())

6.9.5 代理池监控面板代码

监控面板实现:

python 复制代码
from flask import Flask, render_template_string, jsonify
import threading
import time

app = Flask(__name__)

# 监控面板HTML模板
MONITOR_TEMPLATE = """
<!DOCTYPE html>
<html>
<head>
    <title>代理池监控面板</title>
    <style>
        body { font-family: Arial, sans-serif; margin: 20px; }
        .stats { display: flex; gap: 20px; margin-bottom: 20px; }
        .stat-card { background: #f0f0f0; padding: 15px; border-radius: 5px; }
        .stat-card h3 { margin: 0 0 10px 0; }
        .stat-card .value { font-size: 24px; font-weight: bold; }
        table { width: 100%; border-collapse: collapse; }
        th, td { padding: 10px; text-align: left; border-bottom: 1px solid #ddd; }
        th { background-color: #4CAF50; color: white; }
        .status-active { color: green; }
        .status-inactive { color: red; }
    </style>
    <script>
        setInterval(function() {
            fetch('/api/stats')
                .then(response => response.json())
                .then(data => {
                    document.getElementById('stats').innerHTML = generateStatsHTML(data);
                    document.getElementById('proxies').innerHTML = generateProxiesHTML(data.proxies);
                });
        }, 5000);
        
        function generateStatsHTML(data) {
            return `
                <div class="stat-card">
                    <h3>总代理数</h3>
                    <div class="value">${data.total_proxies}</div>
                </div>
                <div class="stat-card">
                    <h3>活跃代理</h3>
                    <div class="value">${data.active_proxies}</div>
                </div>
                <div class="stat-card">
                    <h3>成功率</h3>
                    <div class="value">${(data.success_rate * 100).toFixed(2)}%</div>
                </div>
                <div class="stat-card">
                    <h3>总请求数</h3>
                    <div class="value">${data.total_requests}</div>
                </div>
            `;
        }
        
        function generateProxiesHTML(proxies) {
            let html = '<table><tr><th>代理</th><th>类型</th><th>状态</th><th>成功率</th><th>响应时间</th><th>使用次数</th></tr>';
            proxies.forEach(proxy => {
                html += `
                    <tr>
                        <td>${proxy.host}:${proxy.port}</td>
                        <td>${proxy.type}</td>
                        <td class="${proxy.is_active ? 'status-active' : 'status-inactive'}">
                            ${proxy.is_active ? '活跃' : '非活跃'}
                        </td>
                        <td>${(proxy.success_rate * 100).toFixed(2)}%</td>
                        <td>${proxy.response_time.toFixed(3)}s</td>
                        <td>${proxy.success_count + proxy.failure_count}</td>
                    </tr>
                `;
            });
            html += '</table>';
            return html;
        }
    </script>
</head>
<body>
    <h1>代理池监控面板</h1>
    <div class="stats" id="stats"></div>
    <h2>代理列表</h2>
    <div id="proxies"></div>
</body>
</html>
"""

class ProxyPoolMonitor:
    """代理池监控器"""
    
    def __init__(self, proxy_pool_manager: ProxyPoolManager, port: int = 5000):
        self.proxy_pool_manager = proxy_pool_manager
        self.port = port
        self.app = Flask(__name__)
        self._setup_routes()
    
    def _setup_routes(self):
        """设置路由"""
        @self.app.route('/')
        def index():
            return render_template_string(MONITOR_TEMPLATE)
        
        @self.app.route('/api/stats')
        def api_stats():
            stats = self.proxy_pool_manager.get_stats()
            proxies = self.proxy_pool_manager.pool.get_active_proxies()
            
            proxy_data = []
            for proxy in proxies:
                proxy_data.append({
                    'host': proxy.host,
                    'port': proxy.port,
                    'type': proxy.proxy_type.value,
                    'is_active': proxy.is_active,
                    'success_rate': proxy.success_rate,
                    'response_time': proxy.response_time,
                    'success_count': proxy.success_count,
                    'failure_count': proxy.failure_count,
                })
            
            return jsonify({
                **stats,
                'proxies': proxy_data,
            })
    
    def run(self, debug: bool = False):
        """运行监控面板"""
        self.app.run(host='0.0.0.0', port=self.port, debug=debug)

# 使用示例
# monitor = ProxyPoolMonitor(proxy_pool_manager, port=5000)
# monitor.run()

6.10 实战演练:构建高可用代理池系统

本节将一步步演示如何构建一个完整的高可用代理池系统。

6.10.1 步骤1:设计代理池的数据结构和接口

数据结构设计:

python 复制代码
# 使用之前定义的Proxy和ProxyPool类
# 设计要点:
# 1. 使用字典存储代理(快速查找)
# 2. 使用集合存储活跃代理(快速过滤)
# 3. 使用队列实现轮询(公平分配)
# 4. 线程安全(使用锁保护)

接口设计:

python 复制代码
class IProxyPoolManager(ABC):
    """代理池管理器接口"""
    
    @abstractmethod
    def add_proxy(self, proxy: Proxy) -> str:
        """添加代理"""
        pass
    
    @abstractmethod
    def remove_proxy(self, proxy_id: str):
        """删除代理"""
        pass
    
    @abstractmethod
    def get_next_proxy(self) -> Optional[Proxy]:
        """获取下一个代理"""
        pass
    
    @abstractmethod
    def mark_success(self, proxy: Proxy, response_time: float):
        """标记成功"""
        pass
    
    @abstractmethod
    def mark_failure(self, proxy: Proxy):
        """标记失败"""
        pass

6.10.2 步骤2:实现代理添加、删除、查询功能

完整实现:

python 复制代码
class ProxyPoolManagerV2(ProxyPoolManager):
    """代理池管理器V2(增强版)"""
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.proxy_metadata = {}  # {proxy_id: metadata}
    
    def add_proxy_from_string(self, proxy_string: str, **metadata) -> str:
        """从字符串添加代理"""
        # 解析代理字符串
        # 格式: http://user:pass@host:port 或 socks5://host:port
        try:
            from urllib.parse import urlparse
            
            parsed = urlparse(proxy_string)
            proxy_type_map = {
                'http': ProxyType.HTTP,
                'https': ProxyType.HTTPS,
                'socks4': ProxyType.SOCKS4,
                'socks5': ProxyType.SOCKS5,
            }
            
            proxy_type = proxy_type_map.get(parsed.scheme)
            if not proxy_type:
                raise ValueError(f"Unsupported proxy type: {parsed.scheme}")
            
            proxy = Proxy(
                host=parsed.hostname,
                port=parsed.port or (1080 if 'socks' in parsed.scheme else 8080),
                proxy_type=proxy_type,
                username=parsed.username,
                password=parsed.password,
                **metadata
            )
            
            proxy_id = self.add_proxy(proxy)
            self.proxy_metadata[proxy_id] = metadata
            return proxy_id
        except Exception as e:
            print(f"Failed to add proxy from string: {e}")
            return None
    
    def add_proxies_from_file(self, file_path: str):
        """从文件批量添加代理"""
        with open(file_path, 'r') as f:
            for line in f:
                line = line.strip()
                if line and not line.startswith('#'):
                    self.add_proxy_from_string(line)
    
    def export_proxies(self, file_path: str):
        """导出代理到文件"""
        with open(file_path, 'w') as f:
            for proxy_id, proxy in self.pool.proxies.items():
                f.write(f"{proxy.url}\n")
    
    def search_proxies(self, **filters) -> List[Proxy]:
        """搜索代理"""
        results = []
        for proxy in self.pool.proxies.values():
            match = True
            
            if 'type' in filters and proxy.proxy_type != filters['type']:
                match = False
            if 'location' in filters and proxy.location != filters['location']:
                match = False
            if 'min_success_rate' in filters and proxy.success_rate < filters['min_success_rate']:
                match = False
            if 'max_response_time' in filters and proxy.response_time > filters['max_response_time']:
                match = False
            
            if match:
                results.append(proxy)
        
        return results

# 使用示例
manager = ProxyPoolManagerV2()

# 从字符串添加
manager.add_proxy_from_string("http://user:pass@proxy.com:8080", location="US")

# 从文件批量添加
manager.add_proxies_from_file("proxies.txt")

# 搜索代理
us_proxies = manager.search_proxies(location="US", min_success_rate=0.8)

6.10.3 步骤3:实现定时健康检查机制

定时健康检查:

python 复制代码
class ScheduledHealthChecker:
    """定时健康检查器"""
    
    def __init__(
        self,
        proxy_pool_manager: ProxyPoolManager,
        interval: float = 300.0,
        batch_size: int = 10,
    ):
        self.proxy_pool_manager = proxy_pool_manager
        self.interval = interval
        self.batch_size = batch_size
        self.health_checker = ComprehensiveHealthChecker()
        self.running = False
        self.task = None
    
    async def health_check_loop(self):
        """健康检查循环"""
        while self.running:
            try:
                # 获取需要检查的代理
                active_proxies = self.proxy_pool_manager.pool.get_active_proxies()
                
                # 分批检查
                for i in range(0, len(active_proxies), self.batch_size):
                    batch = active_proxies[i:i+self.batch_size]
                    results = await self.health_checker.check_proxies_batch(batch)
                    
                    # 更新代理状态
                    for proxy, result in zip(batch, results):
                        if result['success']:
                            self.proxy_pool_manager.mark_success(
                                proxy,
                                result.get('avg_response_time', 0.0)
                            )
                        else:
                            self.proxy_pool_manager.mark_failure(proxy)
                
                await asyncio.sleep(self.interval)
            except Exception as e:
                print(f"Health check error: {e}")
                await asyncio.sleep(60)
    
    def start(self):
        """启动健康检查"""
        self.running = True
        self.task = asyncio.create_task(self.health_check_loop())
    
    def stop(self):
        """停止健康检查"""
        self.running = False
        if self.task:
            self.task.cancel()

# 使用示例
health_checker = ScheduledHealthChecker(proxy_pool_manager, interval=300.0)
health_checker.start()

6.10.4 步骤4:实现负载均衡算法(多种算法对比)

算法对比测试:

python 复制代码
def test_load_balancing_algorithms(proxy_pool_manager: ProxyPoolManager):
    """测试负载均衡算法"""
    algorithms = {
        'round_robin': RoundRobinBalancer(proxy_pool_manager.pool),
        'random': RandomBalancer(proxy_pool_manager.pool),
        'weighted': WeightedBalancer(proxy_pool_manager.pool),
        'least_connections': LeastConnectionsBalancer(proxy_pool_manager.pool),
    }
    
    iterations = 1000
    results = {}
    
    for algo_name, balancer in algorithms.items():
        selection_distribution = {}
        start_time = time.time()
        
        for _ in range(iterations):
            proxy = balancer.get_next_proxy()
            if proxy:
                proxy_id = f"{proxy.host}:{proxy.port}"
                selection_distribution[proxy_id] = selection_distribution.get(proxy_id, 0) + 1
        
        elapsed = time.time() - start_time
        
        # 计算分布均匀度
        counts = list(selection_distribution.values())
        if counts:
            mean = sum(counts) / len(counts)
            variance = sum((x - mean) ** 2 for x in counts) / len(counts)
            std_dev = variance ** 0.5
            cv = std_dev / mean if mean > 0 else 0  # 变异系数
        else:
            cv = 0
        
        results[algo_name] = {
            'time': elapsed * 1000,
            'distribution': selection_distribution,
            'coefficient_of_variation': cv,
        }
    
    return results

# 使用示例
# results = test_load_balancing_algorithms(proxy_pool_manager)
# for algo, result in results.items():
#     print(f"{algo}: {result['time']:.2f}ms, CV: {result['coefficient_of_variation']:.3f}")

6.10.5 步骤5:实现代理轮换中间件

集成到HTTP客户端:

python 复制代码
class ProxyAwareHTTPClient:
    """支持代理的HTTP客户端"""
    
    def __init__(self, proxy_pool_manager: ProxyPoolManager):
        self.proxy_pool_manager = proxy_pool_manager
        self.middleware = ProxyRotateMiddleware(proxy_pool_manager)
    
    async def get(self, url: str, **kwargs) -> aiohttp.ClientResponse:
        """GET请求"""
        return await self.middleware.request_with_proxy(url, "GET", **kwargs)
    
    async def post(self, url: str, **kwargs) -> aiohttp.ClientResponse:
        """POST请求"""
        return await self.middleware.request_with_proxy(url, "POST", **kwargs)
    
    async def request(self, method: str, url: str, **kwargs) -> aiohttp.ClientResponse:
        """通用请求方法"""
        return await self.middleware.request_with_proxy(url, method, **kwargs)

# 使用示例
async def main():
    client = ProxyAwareHTTPClient(proxy_pool_manager)
    
    try:
        async with await client.get("https://httpbin.org/ip") as resp:
            print(await resp.text())
    except Exception as e:
        print(f"Request failed: {e}")

# asyncio.run(main())

6.10.6 步骤6:集成到异步爬虫框架中

爬虫框架集成:

python 复制代码
import asyncio
from typing import List, Callable

class AsyncCrawlerWithProxy:
    """支持代理的异步爬虫"""
    
    def __init__(
        self,
        proxy_pool_manager: ProxyPoolManager,
        max_concurrent: int = 10,
    ):
        self.proxy_pool_manager = proxy_pool_manager
        self.client = ProxyAwareHTTPClient(proxy_pool_manager)
        self.semaphore = asyncio.Semaphore(max_concurrent)
    
    async def crawl_url(self, url: str) -> dict:
        """爬取单个URL"""
        async with self.semaphore:
            try:
                async with await self.client.get(url) as resp:
                    text = await resp.text()
                    return {
                        'url': url,
                        'status': resp.status,
                        'content': text,
                        'success': True,
                    }
            except Exception as e:
                return {
                    'url': url,
                    'status': None,
                    'content': None,
                    'success': False,
                    'error': str(e),
                }
    
    async def crawl_urls(self, urls: List[str]) -> List[dict]:
        """批量爬取URL"""
        tasks = [self.crawl_url(url) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        
        processed_results = []
        for result in results:
            if isinstance(result, Exception):
                processed_results.append({
                    'success': False,
                    'error': str(result),
                })
            else:
                processed_results.append(result)
        
        return processed_results

# 使用示例
async def main():
    crawler = AsyncCrawlerWithProxy(proxy_pool_manager, max_concurrent=10)
    
    urls = [
        "https://httpbin.org/ip",
        "https://httpbin.org/get",
        # ... 更多URL
    ]
    
    results = await crawler.crawl_urls(urls)
    
    success_count = sum(1 for r in results if r.get('success'))
    print(f"Success: {success_count}/{len(results)}")

# asyncio.run(main())

6.10.7 步骤7:完整实战代码

完整的代理池系统:

python 复制代码
import asyncio
import aiohttp
import time
import logging
from typing import List, Optional

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class CompleteProxyPoolSystem:
    """完整的代理池系统"""
    
    def __init__(
        self,
        proxy_sources: List[str] = None,
        health_check_interval: float = 300.0,
        balancer_type: str = "weighted",
    ):
        # 初始化组件
        self.proxy_pool_manager = ProxyPoolManagerV2(
            balancer_type=balancer_type,
            health_check_interval=health_check_interval,
        )
        
        self.health_checker = ScheduledHealthChecker(
            self.proxy_pool_manager,
            interval=health_check_interval,
        )
        
        self.crawler = AsyncCrawlerWithProxy(
            self.proxy_pool_manager,
            max_concurrent=10,
        )
        
        # 加载代理
        if proxy_sources:
            for source in proxy_sources:
                if source.startswith('http'):
                    # 从URL加载
                    asyncio.create_task(self.load_proxies_from_url(source))
                else:
                    # 从文件加载
                    self.proxy_pool_manager.add_proxies_from_file(source)
    
    async def load_proxies_from_url(self, url: str):
        """从URL加载代理列表"""
        try:
            async with aiohttp.ClientSession() as session:
                async with session.get(url) as resp:
                    if resp.status == 200:
                        text = await resp.text()
                        for line in text.split('\n'):
                            line = line.strip()
                            if line:
                                self.proxy_pool_manager.add_proxy_from_string(line)
        except Exception as e:
            logger.error(f"Failed to load proxies from URL: {e}")
    
    def start(self):
        """启动系统"""
        logger.info("Starting proxy pool system...")
        self.proxy_pool_manager.start()
        self.health_checker.start()
        logger.info("Proxy pool system started")
    
    def stop(self):
        """停止系统"""
        logger.info("Stopping proxy pool system...")
        self.proxy_pool_manager.stop()
        self.health_checker.stop()
        logger.info("Proxy pool system stopped")
    
    async def crawl(self, urls: List[str]) -> List[dict]:
        """使用代理池爬取URL"""
        return await self.crawler.crawl_urls(urls)
    
    def get_stats(self) -> dict:
        """获取系统统计"""
        return self.proxy_pool_manager.get_stats()

# 使用示例
async def main():
    # 创建系统
    system = CompleteProxyPoolSystem(
        proxy_sources=["proxies.txt"],
        health_check_interval=300.0,
        balancer_type="weighted",
    )
    
    # 启动系统
    system.start()
    
    # 等待代理加载和健康检查
    await asyncio.sleep(10)
    
    # 爬取URL
    urls = [
        "https://httpbin.org/ip",
        "https://httpbin.org/get",
    ]
    
    results = await system.crawl(urls)
    
    # 查看统计
    stats = system.get_stats()
    logger.info(f"System stats: {stats}")
    
    # 停止系统
    system.stop()

if __name__ == "__main__":
    asyncio.run(main())

6.11 常见坑点与排错

6.11.1 DNS缓存时间过长导致IP变更无法感知

问题描述:

python 复制代码
# 错误示例:TTL设置过长
cache = DNSCache(default_ttl=86400)  # 24小时(太长!)

# 如果服务器IP变更,24小时内无法感知

解决方案:

python 复制代码
# 正确示例:合理的TTL设置
cache = DNSCache(default_ttl=300)  # 5分钟

# 或者根据DNS记录的TTL动态设置
def get_dns_ttl(hostname: str) -> int:
    """获取DNS记录的TTL"""
    try:
        answers = dns.resolver.resolve(hostname, 'A')
        return answers.rrset.ttl
    except:
        return 300  # 默认5分钟

# 使用动态TTL
ttl = get_dns_ttl("www.example.com")
cache.set("www.example.com", ips, ttl=ttl)

6.11.2 代理健康检查频率过高会被代理商封禁

问题描述:

python 复制代码
# 错误示例:检查频率过高
health_checker = ScheduledHealthChecker(
    proxy_pool_manager,
    interval=10.0,  # 10秒检查一次(太频繁!)
)

# 可能被代理服务器识别为异常行为并封禁

解决方案:

python 复制代码
# 正确示例:合理的检查频率
health_checker = ScheduledHealthChecker(
    proxy_pool_manager,
    interval=300.0,  # 5分钟检查一次
    batch_size=5,     # 每次只检查5个代理
)

# 或者使用智能频率控制
smart_checker = SmartHealthChecker(
    base_interval=300.0,
    min_interval=60.0,
    max_interval=3600.0,
)

6.11.3 SOCKS5代理需要特殊处理UDP流量

问题描述:

python 复制代码
# 错误示例:使用普通HTTP客户端连接SOCKS5代理
# 某些UDP流量可能无法正常工作

解决方案:

python 复制代码
# 正确示例:使用专门的SOCKS连接器
from aiohttp_socks import ProxyConnector

proxy_url = "socks5://proxy.example.com:1080"
connector = ProxyConnector.from_url(proxy_url)

async with aiohttp.ClientSession(connector=connector) as session:
    # 现在可以正常使用SOCKS5代理
    async with session.get("https://httpbin.org/ip") as resp:
        print(await resp.text())

6.11.4 代理池资源耗尽导致请求失败

问题描述:

python 复制代码
# 错误示例:没有检查代理可用性
proxy = proxy_pool_manager.get_next_proxy()
# 如果所有代理都不可用,proxy为None,会导致错误

解决方案:

python 复制代码
# 正确示例:检查代理可用性并实现降级
proxy = proxy_pool_manager.get_next_proxy()
if not proxy:
    # 降级:不使用代理
    logger.warning("No available proxy, using direct connection")
    # 或者等待代理恢复
    await asyncio.sleep(10)
    proxy = proxy_pool_manager.get_next_proxy()

6.11.5 负载均衡算法选择不当导致性能下降

问题描述:

python 复制代码
# 错误示例:代理质量差异大时使用轮询
# 质量差的代理会被频繁使用,影响整体性能
balancer = RoundRobinBalancer(proxy_pool)

解决方案:

python 复制代码
# 正确示例:根据场景选择算法
# 代理质量差异大:使用加权算法
# 代理质量相近:使用轮询或随机算法
# 长时间连接:使用最少连接数算法

if proxy_quality_varies:
    balancer = WeightedBalancer(proxy_pool)
else:
    balancer = RoundRobinBalancer(proxy_pool)

6.12 总结

本章深入讲解了DNS解析优化和代理池架构的完整实现。通过本章学习,你应该能够:

核心知识点回顾

  1. DNS解析机制

    • DNS查询的完整流程(递归/迭代)
    • DNS缓存机制和TTL管理
    • DoH/DoT的实现和使用
  2. 代理池架构

    • 代理池的数据结构设计
    • 代理类型的选择和使用
    • 健康检查机制的设计
  3. 负载均衡算法

    • 轮询、随机、加权、最少连接数
    • 不同算法的适用场景
    • 算法性能对比和选择
  4. 实战能力

    • 构建完整的代理池系统
    • 集成到爬虫框架
    • 监控和统计

最佳实践建议

  1. DNS优化

    • 使用合理的TTL(5-10分钟)
    • 实现多级缓存
    • 使用DoH提高安全性
  2. 代理池管理

    • 定期健康检查(5-10分钟)
    • 实现智能频率控制
    • 根据场景选择负载均衡算法
  3. 性能优化

    • 使用异步健康检查
    • 批量处理代理
    • 实现代理预热机制
  4. 监控和运维

    • 实现监控面板
    • 记录详细日志
    • 设置告警机制

下一步学习方向

  1. 深入学习

    • DNS协议细节
    • 代理协议实现
    • 分布式代理池
  2. 实战项目

    • 构建大规模代理池
    • 实现代理自动获取
    • 开发代理质量评估系统

通过本章的学习,你已经掌握了DNS解析优化和代理池架构的核心技术,能够构建高性能、高可用的爬虫系统。


本章完

相关推荐
xrkhy1 天前
多线程,高并发、物联网以及spring架构的面试题-->周
java·spring·架构
七夜zippoe1 天前
依赖注入:构建可测试的Python应用架构
开发语言·python·架构·fastapi·依赖注入·反转
LucidX1 天前
Kubernetes集群架构与组件
容器·架构·kubernetes
攀登的牵牛花1 天前
前端向架构突围系列 - 架构方法(三):前端设计文档的写作模式
前端·架构
小白学大数据1 天前
Redis 在定时增量爬虫中的去重机制与过期策略
开发语言·数据库·redis·爬虫
嫂子的姐夫1 天前
012-AES加解密:某勾网(参数data和响应密文)
javascript·爬虫·python·逆向·加密算法
techdashen1 天前
MySQL体系架构 - 简洁版
数据库·mysql·架构
百锦再1 天前
Vue大屏开发全流程及技术细节详解
前端·javascript·vue.js·微信小程序·小程序·架构·ecmascript
cute_ming1 天前
浅谈提示词工程:企业级系统化实践与自动化架构(三)
人工智能·ubuntu·机器学习·架构·自动化