第6章:DNS解析优化与代理池架构
目录
- [6.1 引言:DNS和代理在爬虫中的重要性](#6.1 引言:DNS和代理在爬虫中的重要性)
- [6.1.1 DNS解析对爬虫性能的影响](#6.1.1 DNS解析对爬虫性能的影响)
- [6.1.2 代理池的必要性](#6.1.2 代理池的必要性)
- [6.1.3 本章学习目标](#6.1.3 本章学习目标)
- [6.2 DNS解析流程深度解析](#6.2 DNS解析流程深度解析)
- [6.2.1 DNS查询的完整流程](#6.2.1 DNS查询的完整流程)
- [6.2.2 递归查询 vs 迭代查询](#6.2.2 递归查询 vs 迭代查询)
- [6.2.3 本地缓存和系统DNS](#6.2.3 本地缓存和系统DNS)
- [6.2.4 DNS记录类型详解](#6.2.4 DNS记录类型详解)
- [6.3 DNS缓存机制深度解析](#6.3 DNS缓存机制深度解析)
- [6.3.1 TTL的含义和作用](#6.3.1 TTL的含义和作用)
- [6.3.2 缓存策略设计](#6.3.2 缓存策略设计)
- [6.3.3 缓存失效处理](#6.3.3 缓存失效处理)
- [6.3.4 多级缓存架构](#6.3.4 多级缓存架构)
- [6.4 DNS over HTTPS/TLS实现](#6.4 DNS over HTTPS/TLS实现)
- [6.4.1 DoH协议原理](#6.4.1 DoH协议原理)
- [6.4.2 DoT协议原理](#6.4.2 DoT协议原理)
- [6.4.3 DoH/DoT的安全优势](#6.4.3 DoH/DoT的安全优势)
- [6.4.4 DoH客户端实现](#6.4.4 DoH客户端实现)
- [6.5 代理池架构设计](#6.5 代理池架构设计)
- [6.5.1 代理池的数据结构设计](#6.5.1 代理池的数据结构设计)
- [6.5.2 代理类型详解(HTTP/HTTPS/SOCKS4/SOCKS5)](#6.5.2 代理类型详解(HTTP/HTTPS/SOCKS4/SOCKS5))
- [6.5.3 代理池的接口设计](#6.5.3 代理池的接口设计)
- [6.5.4 代理池架构图](#6.5.4 代理池架构图)
- [6.6 代理健康检查机制](#6.6 代理健康检查机制)
- [6.6.1 健康检查的设计原则](#6.6.1 健康检查的设计原则)
- [6.6.2 健康检查的实现方法](#6.6.2 健康检查的实现方法)
- [6.6.3 健康检查的频率控制](#6.6.3 健康检查的频率控制)
- [6.6.4 健康状态的评估标准](#6.6.4 健康状态的评估标准)
- [6.7 负载均衡算法实现](#6.7 负载均衡算法实现)
- [6.7.1 轮询算法(Round Robin)](#6.7.1 轮询算法(Round Robin))
- [6.7.2 随机算法(Random)](#6.7.2 随机算法(Random))
- [6.7.3 加权算法(Weighted)](#6.7.3 加权算法(Weighted))
- [6.7.4 最少连接数算法(Least Connections)](#6.7.4 最少连接数算法(Least Connections))
- [6.7.5 算法性能对比](#6.7.5 算法性能对比)
- [6.8 工具链:DNS和代理工具使用](#6.8 工具链:DNS和代理工具使用)
- [6.8.1 使用dnspython进行DNS查询](#6.8.1 使用dnspython进行DNS查询)
- [6.8.2 使用DoH服务进行DNS查询](#6.8.2 使用DoH服务进行DNS查询)
- [6.8.3 使用httpx/aiohttp配置代理](#6.8.3 使用httpx/aiohttp配置代理)
- [6.8.4 使用Redis管理代理池数据](#6.8.4 使用Redis管理代理池数据)
- [6.8.5 使用aiohttp-socks支持SOCKS代理](#6.8.5 使用aiohttp-socks支持SOCKS代理)
- [6.9 代码对照:完整实现](#6.9 代码对照:完整实现)
- [6.9.1 自定义DNS解析器实现(支持缓存和DoH)](#6.9.1 自定义DNS解析器实现(支持缓存和DoH))
- [6.9.2 代理池管理器类的完整实现](#6.9.2 代理池管理器类的完整实现)
- [6.9.3 代理健康检查的实现代码](#6.9.3 代理健康检查的实现代码)
- [6.9.4 代理轮换中间件的实现](#6.9.4 代理轮换中间件的实现)
- [6.9.5 代理池监控面板代码](#6.9.5 代理池监控面板代码)
- [6.10 实战演练:构建高可用代理池系统](#6.10 实战演练:构建高可用代理池系统)
- [6.10.1 步骤1:设计代理池的数据结构和接口](#6.10.1 步骤1:设计代理池的数据结构和接口)
- [6.10.2 步骤2:实现代理添加、删除、查询功能](#6.10.2 步骤2:实现代理添加、删除、查询功能)
- [6.10.3 步骤3:实现定时健康检查机制](#6.10.3 步骤3:实现定时健康检查机制)
- [6.10.4 步骤4:实现负载均衡算法(多种算法对比)](#6.10.4 步骤4:实现负载均衡算法(多种算法对比))
- [6.10.5 步骤5:实现代理轮换中间件](#6.10.5 步骤5:实现代理轮换中间件)
- [6.10.6 步骤6:集成到异步爬虫框架中](#6.10.6 步骤6:集成到异步爬虫框架中)
- [6.10.7 步骤7:完整实战代码](#6.10.7 步骤7:完整实战代码)
- [6.11 常见坑点与排错](#6.11 常见坑点与排错)
- [6.11.1 DNS缓存时间过长导致IP变更无法感知](#6.11.1 DNS缓存时间过长导致IP变更无法感知)
- [6.11.2 代理健康检查频率过高会被代理商封禁](#6.11.2 代理健康检查频率过高会被代理商封禁)
- [6.11.3 SOCKS5代理需要特殊处理UDP流量](#6.11.3 SOCKS5代理需要特殊处理UDP流量)
- [6.11.4 代理池资源耗尽导致请求失败](#6.11.4 代理池资源耗尽导致请求失败)
- [6.11.5 负载均衡算法选择不当导致性能下降](#6.11.5 负载均衡算法选择不当导致性能下降)
- [6.12 总结](#6.12 总结)
6.1 引言:DNS和代理在爬虫中的重要性
在爬虫开发中,DNS解析和代理使用是两个关键环节。DNS解析的速度直接影响请求的响应时间,而代理池的质量决定了爬虫的稳定性和反检测能力。理解DNS解析机制和构建高效的代理池系统,是构建高性能爬虫的基础。
6.1.1 DNS解析对爬虫性能的影响
DNS解析的性能影响:
python
import time
import socket
def test_dns_resolution(hostname: str, iterations: int = 100):
"""测试DNS解析性能"""
total_time = 0
for _ in range(iterations):
start = time.time()
try:
socket.gethostbyname(hostname)
except Exception as e:
print(f"DNS resolution failed: {e}")
elapsed = time.time() - start
total_time += elapsed
avg_time = total_time / iterations
print(f"Average DNS resolution time: {avg_time*1000:.2f}ms")
print(f"Total time for {iterations} requests: {total_time:.2f}s")
# 如果每次请求都进行DNS解析,总耗时会非常长
estimated_total = avg_time * iterations
print(f"Estimated total time without cache: {estimated_total:.2f}s")
# 测试
test_dns_resolution("www.example.com", 100)
# 输出示例:
# Average DNS resolution time: 50.23ms
# Total time for 100 requests: 5.02s
# Estimated total time without cache: 5.02s
DNS解析的性能问题:
-
延迟累积:
- 每次DNS查询通常需要20-100ms
- 大量请求时,DNS延迟会显著累积
- 没有缓存时,每个请求都要等待DNS解析
-
网络依赖:
- DNS查询依赖网络连接
- 网络不稳定时,DNS查询可能超时
- 影响整体爬虫的稳定性
-
DNS服务器限制:
- 公共DNS服务器可能有速率限制
- 频繁查询可能被限流
- 需要实现智能重试和降级
DNS缓存的效果:
python
import time
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_dns_resolve(hostname: str) -> str:
"""带缓存的DNS解析"""
return socket.gethostbyname(hostname)
def test_cached_dns(hostname: str, iterations: int = 100):
"""测试缓存DNS解析性能"""
# 第一次解析(无缓存)
start = time.time()
cached_dns_resolve(hostname)
first_time = time.time() - start
# 后续解析(有缓存)
start = time.time()
for _ in range(iterations - 1):
cached_dns_resolve(hostname)
cached_time = time.time() - start
avg_cached_time = cached_time / (iterations - 1)
print(f"First resolution (no cache): {first_time*1000:.2f}ms")
print(f"Average cached resolution: {avg_cached_time*1000:.6f}ms")
print(f"Speedup: {first_time/avg_cached_time:.0f}x")
# 测试
test_cached_dns("www.example.com", 100)
# 输出示例:
# First resolution (no cache): 45.23ms
# Average cached resolution: 0.000123ms
# Speedup: 367723x
6.1.2 代理池的必要性
为什么需要代理池?
-
IP封禁问题:
- 频繁请求同一网站会被封IP
- 使用代理可以轮换IP,避免封禁
- 提高爬虫的稳定性
-
地理位置限制:
- 某些网站有地理位置限制
- 使用对应地区的代理可以绕过限制
- 实现全球数据采集
-
请求频率控制:
- 单个IP的请求频率有限
- 使用多个代理可以分散请求
- 提高整体爬取速度
代理池的挑战:
python
# 问题1:代理质量参差不齐
proxies = [
"http://proxy1.com:8080", # 速度快,稳定
"http://proxy2.com:8080", # 速度慢,不稳定
"http://proxy3.com:8080", # 已失效
]
# 问题2:代理需要健康检查
# 问题3:代理需要负载均衡
# 问题4:代理需要自动轮换
6.1.3 本章学习目标
通过本章学习,你将:
-
深入理解DNS解析机制:
- DNS查询的完整流程
- 缓存机制的设计和优化
- DoH/DoT的实现和使用
-
掌握代理池架构设计:
- 代理池的数据结构设计
- 健康检查机制
- 负载均衡算法
-
实现完整的代理池系统:
- 代理的添加、删除、查询
- 自动健康检查
- 智能负载均衡
- 监控和统计
-
集成到爬虫框架:
- 代理轮换中间件
- 与异步爬虫框架集成
- 性能优化和调优
6.2 DNS解析流程深度解析
DNS(Domain Name System)是互联网的"电话簿",将域名转换为IP地址。理解DNS解析流程对于优化爬虫性能至关重要。
6.2.1 DNS查询的完整流程
DNS查询的完整流程:
权威DNS服务器 TLD DNS服务器 根DNS服务器 系统DNS解析器 本地缓存 应用程序 权威DNS服务器 TLD DNS服务器 根DNS服务器 系统DNS解析器 本地缓存 应用程序 alt [缓存命中] [缓存未命中] 1. 查询本地缓存 返回缓存的IP 2. 查询系统DNS解析器 3. 查询根DNS服务器 返回.com的TLD服务器地址 4. 查询TLD DNS服务器 返回example.com的权威服务器地址 5. 查询权威DNS服务器 返回www.example.com的IP地址 6. 更新缓存 返回IP地址
DNS查询的详细步骤:
-
本地缓存查询:
- 检查应用程序缓存
- 检查系统DNS缓存
- 检查hosts文件
-
递归查询:
- 向系统DNS解析器发送查询
- DNS解析器负责完整的查询过程
-
迭代查询:
- 从根DNS服务器开始
- 逐级查询到权威DNS服务器
- 获取最终的IP地址
Python代码演示DNS查询:
python
import socket
import dns.resolver
import time
def dns_query_system(hostname: str) -> str:
"""使用系统DNS解析"""
start = time.time()
try:
ip = socket.gethostbyname(hostname)
elapsed = time.time() - start
print(f"System DNS: {hostname} -> {ip} ({elapsed*1000:.2f}ms)")
return ip
except socket.gaierror as e:
print(f"DNS resolution failed: {e}")
return None
def dns_query_dnspython(hostname: str) -> list:
"""使用dnspython库解析"""
start = time.time()
try:
answers = dns.resolver.resolve(hostname, 'A')
ips = [str(answer) for answer in answers]
elapsed = time.time() - start
print(f"dnspython DNS: {hostname} -> {ips} ({elapsed*1000:.2f}ms)")
return ips
except Exception as e:
print(f"DNS resolution failed: {e}")
return []
# 使用示例
dns_query_system("www.example.com")
dns_query_dnspython("www.example.com")
6.2.2 递归查询 vs 迭代查询
递归查询(Recursive Query):
- 客户端向DNS服务器发送递归查询
- DNS服务器负责完成整个查询过程
- 客户端只需等待最终结果
迭代查询(Iterative Query):
- DNS服务器返回下一个应该查询的服务器地址
- 客户端需要自己完成后续查询
- 通常用于DNS服务器之间的查询
查询类型对比:
python
class DNSQueryType:
"""DNS查询类型"""
RECURSIVE = "recursive" # 递归查询
ITERATIVE = "iterative" # 迭代查询
def recursive_query(hostname: str) -> str:
"""递归查询(客户端视角)"""
# 客户端发送递归查询,等待最终结果
return socket.gethostbyname(hostname)
def iterative_query(hostname: str) -> str:
"""迭代查询(手动实现)"""
# 1. 查询根DNS服务器
# 2. 获取TLD服务器地址
# 3. 查询TLD服务器
# 4. 获取权威服务器地址
# 5. 查询权威服务器
# 6. 获取IP地址
# (实际实现较复杂,这里仅演示概念)
pass
6.2.3 本地缓存和系统DNS
本地缓存层次:
python
class DNSCacheLevel:
"""DNS缓存层次"""
APPLICATION = "application" # 应用程序缓存
SYSTEM = "system" # 系统DNS缓存
HOSTS_FILE = "hosts_file" # hosts文件
# 1. 应用程序缓存(最快)
app_cache = {}
# 2. 系统DNS缓存(操作系统管理)
# Windows: ipconfig /displaydns
# Linux: systemd-resolve --statistics
# macOS: dscacheutil -q host
# 3. hosts文件
# Windows: C:\Windows\System32\drivers\etc\hosts
# Linux/macOS: /etc/hosts
系统DNS缓存查看:
bash
# Windows
ipconfig /displaydns
# Linux (systemd-resolved)
systemd-resolve --statistics
# Linux (nscd)
nscd -g
# macOS
dscacheutil -q host -a name www.example.com
6.2.4 DNS记录类型详解
常见的DNS记录类型:
| 记录类型 | 说明 | 示例 |
|---|---|---|
| A | IPv4地址记录 | www.example.com -> 192.0.2.1 |
| AAAA | IPv6地址记录 | www.example.com -> 2001:db8::1 |
| CNAME | 别名记录 | www -> example.com |
| MX | 邮件交换记录 | example.com -> mail.example.com |
| TXT | 文本记录 | 用于SPF、DKIM等 |
| NS | 名称服务器记录 | example.com -> ns1.example.com |
查询不同类型的DNS记录:
python
import dns.resolver
def query_dns_records(hostname: str):
"""查询各种DNS记录"""
records = {}
# A记录(IPv4)
try:
answers = dns.resolver.resolve(hostname, 'A')
records['A'] = [str(answer) for answer in answers]
except Exception as e:
records['A'] = None
# AAAA记录(IPv6)
try:
answers = dns.resolver.resolve(hostname, 'AAAA')
records['AAAA'] = [str(answer) for answer in answers]
except Exception as e:
records['AAAA'] = None
# CNAME记录
try:
answers = dns.resolver.resolve(hostname, 'CNAME')
records['CNAME'] = [str(answer) for answer in answers]
except Exception as e:
records['CNAME'] = None
# MX记录
try:
answers = dns.resolver.resolve(hostname, 'MX')
records['MX'] = [(str(answer.preference), str(answer.exchange)) for answer in answers]
except Exception as e:
records['MX'] = None
return records
# 使用示例
records = query_dns_records("example.com")
print(records)
6.3 DNS缓存机制深度解析
DNS缓存是提高DNS解析性能的关键机制。合理设计缓存策略可以大幅提升爬虫性能。
6.3.1 TTL的含义和作用
TTL(Time To Live)的含义:
- TTL是DNS记录的生存时间
- 表示DNS记录在缓存中的有效期(秒)
- 超过TTL后,缓存记录应该被丢弃
TTL的作用:
-
平衡性能和准确性:
- TTL短:更准确,但查询频繁
- TTL长:查询少,但可能使用过期IP
-
控制缓存更新频率:
- 服务器可以通过调整TTL控制缓存更新
- 动态IP通常设置较短的TTL
查看DNS记录的TTL:
python
import dns.resolver
def get_dns_ttl(hostname: str) -> int:
"""获取DNS记录的TTL"""
try:
answers = dns.resolver.resolve(hostname, 'A')
# dnspython返回的answer对象包含TTL
ttl = answers.rrset.ttl
return ttl
except Exception as e:
print(f"Failed to get TTL: {e}")
return None
# 使用示例
ttl = get_dns_ttl("www.example.com")
print(f"TTL: {ttl} seconds ({ttl/60:.1f} minutes)")
6.3.2 缓存策略设计
缓存策略的关键要素:
-
缓存存储结构:
- 使用字典存储域名到IP的映射
- 记录缓存时间戳
- 记录TTL值
-
缓存过期检查:
- 每次查询时检查缓存是否过期
- 过期则重新查询并更新缓存
-
缓存大小限制:
- 限制缓存条目数量
- 使用LRU(最近最少使用)策略淘汰
完整的DNS缓存实现:
python
import time
from collections import OrderedDict
from typing import Optional, List
class DNSCache:
"""DNS缓存实现"""
def __init__(self, max_size: int = 1000, default_ttl: int = 300):
self.cache = OrderedDict() # {hostname: (ips, timestamp, ttl)}
self.max_size = max_size
self.default_ttl = default_ttl
def get(self, hostname: str) -> Optional[List[str]]:
"""获取缓存的DNS记录"""
if hostname not in self.cache:
return None
ips, timestamp, ttl = self.cache[hostname]
# 检查是否过期
age = time.time() - timestamp
if age > ttl:
# 缓存过期,删除
del self.cache[hostname]
return None
# 更新访问顺序(LRU)
self.cache.move_to_end(hostname)
return ips
def set(self, hostname: str, ips: List[str], ttl: Optional[int] = None):
"""设置DNS缓存"""
if ttl is None:
ttl = self.default_ttl
# 如果缓存已满,删除最旧的条目
if len(self.cache) >= self.max_size and hostname not in self.cache:
self.cache.popitem(last=False) # 删除最旧的
self.cache[hostname] = (ips, time.time(), ttl)
self.cache.move_to_end(hostname) # 更新访问顺序
def clear(self):
"""清空缓存"""
self.cache.clear()
def remove(self, hostname: str):
"""删除特定域名的缓存"""
if hostname in self.cache:
del self.cache[hostname]
def cleanup_expired(self):
"""清理过期的缓存条目"""
current_time = time.time()
expired_hostnames = []
for hostname, (ips, timestamp, ttl) in self.cache.items():
if current_time - timestamp > ttl:
expired_hostnames.append(hostname)
for hostname in expired_hostnames:
del self.cache[hostname]
return len(expired_hostnames)
def stats(self) -> dict:
"""获取缓存统计信息"""
current_time = time.time()
valid_count = 0
expired_count = 0
for hostname, (ips, timestamp, ttl) in self.cache.items():
if current_time - timestamp > ttl:
expired_count += 1
else:
valid_count += 1
return {
'total': len(self.cache),
'valid': valid_count,
'expired': expired_count,
'max_size': self.max_size,
}
# 使用示例
cache = DNSCache(max_size=100, default_ttl=300)
# 设置缓存
cache.set("www.example.com", ["192.0.2.1"], ttl=300)
# 获取缓存
ips = cache.get("www.example.com")
print(f"Cached IPs: {ips}")
# 清理过期缓存
expired_count = cache.cleanup_expired()
print(f"Cleaned up {expired_count} expired entries")
# 查看统计
stats = cache.stats()
print(f"Cache stats: {stats}")
6.3.3 缓存失效处理
缓存失效的场景:
-
TTL过期:
- 记录超过TTL时间
- 需要重新查询
-
主动失效:
- 检测到IP变更
- 手动清除缓存
-
错误失效:
- DNS查询失败
- 清除可能错误的缓存
智能缓存失效策略:
python
class SmartDNSCache(DNSCache):
"""智能DNS缓存(支持失效处理)"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.failure_count = {} # {hostname: failure_count}
self.max_failures = 3
def mark_failure(self, hostname: str):
"""标记DNS查询失败"""
self.failure_count[hostname] = self.failure_count.get(hostname, 0) + 1
# 如果失败次数过多,清除缓存
if self.failure_count[hostname] >= self.max_failures:
self.remove(hostname)
del self.failure_count[hostname]
def mark_success(self, hostname: str):
"""标记DNS查询成功"""
if hostname in self.failure_count:
del self.failure_count[hostname]
def get_with_fallback(self, hostname: str, resolver_func) -> Optional[List[str]]:
"""获取缓存,如果失效则重新查询"""
# 先尝试从缓存获取
ips = self.get(hostname)
if ips:
return ips
# 缓存未命中或过期,重新查询
try:
ips = resolver_func(hostname)
if ips:
self.set(hostname, ips)
self.mark_success(hostname)
return ips
except Exception as e:
self.mark_failure(hostname)
raise
# 使用示例
cache = SmartDNSCache()
def resolve_dns(hostname: str) -> List[str]:
"""DNS解析函数"""
import socket
ip = socket.gethostbyname(hostname)
return [ip]
# 使用缓存和回退
ips = cache.get_with_fallback("www.example.com", resolve_dns)
print(f"Resolved IPs: {ips}")
6.3.4 多级缓存架构
多级缓存的设计:
是
否
是
否
应用程序
L1: 应用缓存
缓存命中?
L2: 系统DNS缓存
缓存命中?
L3: DNS服务器查询
多级缓存实现:
python
class MultiLevelDNSCache:
"""多级DNS缓存"""
def __init__(self):
self.l1_cache = DNSCache(max_size=100, default_ttl=60) # 应用缓存(短TTL)
self.l2_cache = DNSCache(max_size=1000, default_ttl=300) # 系统缓存(长TTL)
def get(self, hostname: str) -> Optional[List[str]]:
"""多级缓存查询"""
# L1缓存查询
ips = self.l1_cache.get(hostname)
if ips:
return ips
# L2缓存查询
ips = self.l2_cache.get(hostname)
if ips:
# 更新L1缓存
self.l1_cache.set(hostname, ips)
return ips
return None
def set(self, hostname: str, ips: List[str], ttl: Optional[int] = None):
"""设置多级缓存"""
self.l1_cache.set(hostname, ips, ttl)
self.l2_cache.set(hostname, ips, ttl)
def clear_all(self):
"""清空所有缓存"""
self.l1_cache.clear()
self.l2_cache.clear()
# 使用示例
multi_cache = MultiLevelDNSCache()
multi_cache.set("www.example.com", ["192.0.2.1"])
ips = multi_cache.get("www.example.com")
print(f"Resolved from cache: {ips}")
6.4 DNS over HTTPS/TLS实现
DNS over HTTPS (DoH) 和 DNS over TLS (DoT) 是加密的DNS查询协议,提供更好的隐私和安全性。
6.4.1 DoH协议原理
DoH的工作原理:
-
HTTP/2请求:
- 使用HTTP/2发送DNS查询
- 查询数据作为HTTP请求体
- 使用HTTPS加密传输
-
JSON格式:
- 大多数DoH服务使用JSON格式
- 请求和响应都是JSON
-
标准端点:
- Cloudflare:
https://cloudflare-dns.com/dns-query - Google:
https://dns.google/resolve - Quad9:
https://dns.quad9.net/dns-query
- Cloudflare:
DoH查询格式:
python
import aiohttp
import json
from typing import List, Optional
async def doh_query_json(hostname: str, doh_server: str = "https://cloudflare-dns.com/dns-query") -> List[str]:
"""使用DoH进行DNS查询(JSON格式)"""
params = {
'name': hostname,
'type': 'A', # A记录
}
headers = {
'Accept': 'application/dns-json',
}
async with aiohttp.ClientSession() as session:
try:
async with session.get(doh_server, params=params, headers=headers, timeout=5) as resp:
if resp.status == 200:
data = await resp.json()
# 解析响应
ips = []
if 'Answer' in data:
for answer in data['Answer']:
if answer.get('type') == 1: # A记录
ips.append(answer['data'])
return ips
else:
print(f"DoH query failed with status {resp.status}")
return []
except Exception as e:
print(f"DoH query error: {e}")
return []
# 使用示例
import asyncio
async def main():
ips = await doh_query_json("www.example.com")
print(f"Resolved IPs: {ips}")
# asyncio.run(main())
6.4.2 DoT协议原理
DoT的工作原理:
-
TLS连接:
- 在TCP 853端口建立TLS连接
- 使用标准DNS协议(加密传输)
-
DNS over TLS:
- 使用标准的DNS消息格式
- 通过TLS加密传输
DoT客户端实现(需要dnspython支持):
python
import dns.query
import dns.message
import ssl
def dot_query(hostname: str, dot_server: str = "1.1.1.1", port: int = 853) -> List[str]:
"""使用DoT进行DNS查询"""
# 创建DNS查询消息
query = dns.message.make_query(hostname, dns.rdatatype.A)
# 创建TLS上下文
context = ssl.create_default_context()
# 发送DoT查询
try:
response = dns.query.tls(query, dot_server, port=port, ssl_context=context)
# 解析响应
ips = []
for answer in response.answer:
for rdata in answer:
if rdata.rdtype == dns.rdatatype.A:
ips.append(str(rdata))
return ips
except Exception as e:
print(f"DoT query error: {e}")
return []
# 使用示例
ips = dot_query("www.example.com")
print(f"Resolved IPs: {ips}")
6.4.3 DoH/DoT的安全优势
安全优势:
-
加密传输:
- DNS查询数据加密,防止窃听
- 保护查询隐私
-
防止DNS劫持:
- 使用HTTPS/TLS,防止中间人攻击
- 验证服务器证书
-
绕过DNS污染:
- 使用可信的DoH/DoT服务器
- 避免本地DNS污染
对比传统DNS:
python
def compare_dns_methods(hostname: str):
"""对比不同DNS查询方法"""
import time
methods = {
'System DNS': lambda h: [socket.gethostbyname(h)],
'DoH (Cloudflare)': lambda h: asyncio.run(doh_query_json(h)),
'DoT (Cloudflare)': lambda h: dot_query(h, "1.1.1.1"),
}
results = {}
for method_name, method_func in methods.items():
try:
start = time.time()
ips = method_func(hostname)
elapsed = time.time() - start
results[method_name] = {
'ips': ips,
'time': elapsed * 1000,
'success': True,
}
except Exception as e:
results[method_name] = {
'ips': None,
'time': None,
'success': False,
'error': str(e),
}
return results
# 使用示例
results = compare_dns_methods("www.example.com")
for method, result in results.items():
print(f"{method}: {result}")
6.4.4 DoH客户端实现
完整的DoH客户端:
python
import aiohttp
import asyncio
import time
from typing import List, Optional, Dict
class DoHClient:
"""DNS over HTTPS客户端"""
def __init__(
self,
doh_servers: List[str] = None,
timeout: float = 5.0,
cache: Optional[DNSCache] = None,
):
self.doh_servers = doh_servers or [
"https://cloudflare-dns.com/dns-query",
"https://dns.google/resolve",
"https://dns.quad9.net/dns-query",
]
self.timeout = timeout
self.cache = cache
self.session = None
async def _get_session(self):
"""获取aiohttp会话(延迟创建)"""
if self.session is None:
self.session = aiohttp.ClientSession()
return self.session
async def query(
self,
hostname: str,
record_type: str = 'A',
use_cache: bool = True,
) -> List[str]:
"""查询DNS记录"""
# 检查缓存
if use_cache and self.cache:
cached_ips = self.cache.get(hostname)
if cached_ips:
return cached_ips
# 尝试多个DoH服务器
last_error = None
for doh_server in self.doh_servers:
try:
ips = await self._query_server(hostname, record_type, doh_server)
if ips:
# 更新缓存
if use_cache and self.cache:
self.cache.set(hostname, ips)
return ips
except Exception as e:
last_error = e
continue
# 所有服务器都失败
if last_error:
raise last_error
return []
async def _query_server(
self,
hostname: str,
record_type: str,
doh_server: str,
) -> List[str]:
"""查询单个DoH服务器"""
session = await self._get_session()
params = {
'name': hostname,
'type': record_type,
}
headers = {
'Accept': 'application/dns-json',
}
async with session.get(
doh_server,
params=params,
headers=headers,
timeout=aiohttp.ClientTimeout(total=self.timeout),
) as resp:
if resp.status != 200:
raise Exception(f"DoH server returned status {resp.status}")
data = await resp.json()
# 解析响应
ips = []
if 'Answer' in data:
for answer in data['Answer']:
if answer.get('type') == 1: # A记录
ips.append(answer['data'])
return ips
async def close(self):
"""关闭客户端"""
if self.session:
await self.session.close()
self.session = None
# 使用示例
async def main():
cache = DNSCache()
doh_client = DoHClient(cache=cache)
# 查询DNS
ips = await doh_client.query("www.example.com")
print(f"Resolved IPs: {ips}")
# 再次查询(使用缓存)
ips = await doh_client.query("www.example.com")
print(f"Cached IPs: {ips}")
await doh_client.close()
# asyncio.run(main())
6.5 代理池架构设计
代理池是爬虫系统的核心组件,负责管理、调度和维护代理资源。
6.5.1 代理池的数据结构设计
代理池的核心数据结构:
python
from dataclasses import dataclass
from enum import Enum
from typing import Optional
import time
class ProxyType(Enum):
"""代理类型"""
HTTP = "http"
HTTPS = "https"
SOCKS4 = "socks4"
SOCKS5 = "socks5"
@dataclass
class Proxy:
"""代理对象"""
host: str
port: int
proxy_type: ProxyType
username: Optional[str] = None
password: Optional[str] = None
# 状态信息
is_active: bool = True
success_count: int = 0
failure_count: int = 0
last_check_time: Optional[float] = None
last_success_time: Optional[float] = None
response_time: float = 0.0 # 平均响应时间(秒)
# 元数据
location: Optional[str] = None # 地理位置
provider: Optional[str] = None # 代理提供商
def __str__(self) -> str:
"""代理字符串表示"""
if self.username and self.password:
return f"{self.proxy_type.value}://{self.username}:{self.password}@{self.host}:{self.port}"
else:
return f"{self.proxy_type.value}://{self.host}:{self.port}"
def to_dict(self) -> dict:
"""转换为字典"""
return {
'host': self.host,
'port': self.port,
'type': self.proxy_type.value,
'username': self.username,
'password': self.password,
'is_active': self.is_active,
'success_count': self.success_count,
'failure_count': self.failure_count,
'last_check_time': self.last_check_time,
'last_success_time': self.last_success_time,
'response_time': self.response_time,
'location': self.location,
'provider': self.provider,
}
@property
def success_rate(self) -> float:
"""成功率"""
total = self.success_count + self.failure_count
if total == 0:
return 0.0
return self.success_count / total
@property
def url(self) -> str:
"""代理URL"""
return str(self)
# 使用示例
proxy = Proxy(
host="proxy.example.com",
port=8080,
proxy_type=ProxyType.HTTP,
username="user",
password="pass",
)
print(f"Proxy URL: {proxy.url}")
print(f"Success rate: {proxy.success_rate:.2%}")
代理池的数据结构:
python
from collections import deque
from typing import Dict, List, Set
import threading
class ProxyPool:
"""代理池数据结构"""
def __init__(self):
# 主存储:{proxy_id: Proxy}
self.proxies: Dict[str, Proxy] = {}
# 活跃代理集合(快速查找)
self.active_proxies: Set[str] = set()
# 代理队列(用于轮询)
self.proxy_queue: deque = deque()
# 按类型分组
self.proxies_by_type: Dict[ProxyType, List[str]] = {
ProxyType.HTTP: [],
ProxyType.HTTPS: [],
ProxyType.SOCKS4: [],
ProxyType.SOCKS5: [],
}
# 线程安全锁
self.lock = threading.Lock()
def add_proxy(self, proxy: Proxy, proxy_id: Optional[str] = None) -> str:
"""添加代理"""
if proxy_id is None:
proxy_id = f"{proxy.host}:{proxy.port}"
with self.lock:
self.proxies[proxy_id] = proxy
if proxy.is_active:
self.active_proxies.add(proxy_id)
self.proxy_queue.append(proxy_id)
# 按类型分组
self.proxies_by_type[proxy.proxy_type].append(proxy_id)
return proxy_id
def remove_proxy(self, proxy_id: str):
"""删除代理"""
with self.lock:
if proxy_id in self.proxies:
proxy = self.proxies[proxy_id]
del self.proxies[proxy_id]
self.active_proxies.discard(proxy_id)
# 从队列中移除
if proxy_id in self.proxy_queue:
self.proxy_queue.remove(proxy_id)
# 从类型分组中移除
if proxy_id in self.proxies_by_type[proxy.proxy_type]:
self.proxies_by_type[proxy.proxy_type].remove(proxy_id)
def get_proxy(self, proxy_id: str) -> Optional[Proxy]:
"""获取代理"""
return self.proxies.get(proxy_id)
def get_active_proxies(self) -> List[Proxy]:
"""获取所有活跃代理"""
with self.lock:
return [self.proxies[pid] for pid in self.active_proxies if pid in self.proxies]
def get_proxies_by_type(self, proxy_type: ProxyType) -> List[Proxy]:
"""按类型获取代理"""
with self.lock:
return [self.proxies[pid] for pid in self.proxies_by_type[proxy_type] if pid in self.proxies]
# 使用示例
pool = ProxyPool()
proxy1 = Proxy("proxy1.com", 8080, ProxyType.HTTP)
proxy2 = Proxy("proxy2.com", 8080, ProxyType.SOCKS5)
pool.add_proxy(proxy1)
pool.add_proxy(proxy2)
active = pool.get_active_proxies()
print(f"Active proxies: {len(active)}")
6.5.2 代理类型详解(HTTP/HTTPS/SOCKS4/SOCKS5)
代理类型对比:
| 代理类型 | 协议层 | 支持HTTPS | 支持UDP | 认证方式 | 使用场景 |
|---|---|---|---|---|---|
| HTTP | 应用层 | 是 | 否 | Basic | 简单HTTP请求 |
| HTTPS | 应用层 | 是 | 否 | Basic | HTTPS请求 |
| SOCKS4 | 传输层 | 是 | 否 | 无 | 简单TCP连接 |
| SOCKS5 | 传输层 | 是 | 是 | 多种 | 复杂网络场景 |
代理类型的特点:
-
HTTP/HTTPS代理:
- 工作在应用层
- 只支持HTTP协议
- 需要CONNECT方法建立隧道
-
SOCKS4代理:
- 工作在传输层
- 支持TCP连接
- 不支持UDP和IPv6
-
SOCKS5代理:
- 工作在传输层
- 支持TCP和UDP
- 支持IPv4和IPv6
- 支持多种认证方式
代理URL格式:
python
def format_proxy_url(proxy: Proxy) -> str:
"""格式化代理URL"""
if proxy.proxy_type == ProxyType.HTTP:
scheme = "http"
elif proxy.proxy_type == ProxyType.HTTPS:
scheme = "https"
elif proxy.proxy_type == ProxyType.SOCKS4:
scheme = "socks4"
elif proxy.proxy_type == ProxyType.SOCKS5:
scheme = "socks5"
else:
raise ValueError(f"Unknown proxy type: {proxy.proxy_type}")
if proxy.username and proxy.password:
return f"{scheme}://{proxy.username}:{proxy.password}@{proxy.host}:{proxy.port}"
else:
return f"{scheme}://{proxy.host}:{proxy.port}"
# 使用示例
proxy = Proxy("proxy.com", 8080, ProxyType.SOCKS5, "user", "pass")
url = format_proxy_url(proxy)
print(f"Proxy URL: {url}")
# 输出: socks5://user:pass@proxy.com:8080
6.5.3 代理池的接口设计
代理池的核心接口:
python
from abc import ABC, abstractmethod
from typing import Optional, List
class IProxyPool(ABC):
"""代理池接口"""
@abstractmethod
def add_proxy(self, proxy: Proxy) -> str:
"""添加代理"""
pass
@abstractmethod
def remove_proxy(self, proxy_id: str):
"""删除代理"""
pass
@abstractmethod
def get_proxy(self, proxy_id: str) -> Optional[Proxy]:
"""获取代理"""
pass
@abstractmethod
def get_next_proxy(self, strategy: str = "round_robin") -> Optional[Proxy]:
"""获取下一个代理(根据策略)"""
pass
@abstractmethod
def mark_success(self, proxy_id: str, response_time: float):
"""标记代理成功"""
pass
@abstractmethod
def mark_failure(self, proxy_id: str):
"""标记代理失败"""
pass
@abstractmethod
def get_stats(self) -> dict:
"""获取统计信息"""
pass
6.5.4 代理池架构图
代理池的完整架构:
爬虫应用
代理池管理器
代理存储
健康检查器
负载均衡器
活跃代理队列
代理元数据
定时检查任务
健康状态评估
轮询算法
随机算法
加权算法
最少连接数
监控面板
统计信息
实时状态
6.6 代理健康检查机制
代理健康检查是确保代理池质量的关键机制。
6.6.1 健康检查的设计原则
设计原则:
-
非侵入性:
- 健康检查不应该影响正常使用
- 使用独立的检查任务
-
频率控制:
- 避免过于频繁的检查
- 防止被代理服务器封禁
-
多维度评估:
- 响应时间
- 成功率
- 可用性
-
异步执行:
- 健康检查应该是异步的
- 不阻塞主流程
6.6.2 健康检查的实现方法
健康检查方法:
-
HTTP请求测试:
- 发送HTTP请求到测试URL
- 检查响应状态码
- 测量响应时间
-
连接测试:
- 测试TCP连接
- 检查连接建立时间
-
实际请求测试:
- 使用代理发送真实请求
- 记录成功率和响应时间
完整的健康检查实现:
python
import asyncio
import aiohttp
import time
from typing import Optional
class ProxyHealthChecker:
"""代理健康检查器"""
def __init__(
self,
test_url: str = "http://httpbin.org/ip",
timeout: float = 5.0,
max_concurrent: int = 10,
):
self.test_url = test_url
self.timeout = timeout
self.semaphore = asyncio.Semaphore(max_concurrent)
async def check_proxy(self, proxy: Proxy) -> dict:
"""检查单个代理"""
async with self.semaphore:
start_time = time.time()
result = {
'proxy_id': f"{proxy.host}:{proxy.port}",
'success': False,
'response_time': None,
'error': None,
}
try:
# 构建代理URL
proxy_url = proxy.url
# 创建连接器(支持SOCKS)
connector = None
if proxy.proxy_type in [ProxyType.SOCKS4, ProxyType.SOCKS5]:
try:
from aiohttp_socks import ProxyConnector
connector = ProxyConnector.from_url(proxy_url)
except ImportError:
result['error'] = "aiohttp-socks not installed"
return result
else:
connector = aiohttp.TCPConnector()
# 发送测试请求
timeout = aiohttp.ClientTimeout(total=self.timeout)
async with aiohttp.ClientSession(connector=connector) as session:
async with session.get(
self.test_url,
proxy=proxy_url if proxy.proxy_type in [ProxyType.HTTP, ProxyType.HTTPS] else None,
timeout=timeout,
) as resp:
if resp.status == 200:
result['success'] = True
result['response_time'] = time.time() - start_time
else:
result['error'] = f"HTTP {resp.status}"
if connector:
await connector.close()
except asyncio.TimeoutError:
result['error'] = "Timeout"
except Exception as e:
result['error'] = str(e)
return result
async def check_proxies(self, proxies: List[Proxy]) -> List[dict]:
"""批量检查代理"""
tasks = [self.check_proxy(proxy) for proxy in proxies]
results = await asyncio.gather(*tasks, return_exceptions=True)
# 处理异常结果
processed_results = []
for result in results:
if isinstance(result, Exception):
processed_results.append({
'success': False,
'error': str(result),
})
else:
processed_results.append(result)
return processed_results
# 使用示例
async def main():
checker = ProxyHealthChecker()
proxy = Proxy("proxy.example.com", 8080, ProxyType.HTTP)
result = await checker.check_proxy(proxy)
print(f"Health check result: {result}")
# asyncio.run(main())
6.6.3 健康检查的频率控制
频率控制策略:
-
基于时间间隔:
- 固定时间间隔检查
- 例如:每5分钟检查一次
-
基于使用频率:
- 使用频繁的代理检查更频繁
- 使用少的代理检查较少
-
基于失败率:
- 失败率高的代理检查更频繁
- 稳定的代理检查较少
智能频率控制:
python
class SmartHealthChecker(ProxyHealthChecker):
"""智能健康检查器(频率控制)"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.check_intervals = {} # {proxy_id: next_check_time}
self.base_interval = 300 # 基础间隔(5分钟)
self.min_interval = 60 # 最小间隔(1分钟)
self.max_interval = 3600 # 最大间隔(1小时)
def get_check_interval(self, proxy: Proxy) -> float:
"""计算检查间隔"""
# 基于成功率调整间隔
success_rate = proxy.success_rate
if success_rate < 0.5:
# 成功率低,频繁检查
interval = self.min_interval
elif success_rate < 0.8:
# 成功率中等
interval = self.base_interval
else:
# 成功率高,减少检查
interval = self.max_interval
return interval
def should_check(self, proxy: Proxy) -> bool:
"""判断是否应该检查"""
proxy_id = f"{proxy.host}:{proxy.port}"
current_time = time.time()
if proxy_id not in self.check_intervals:
return True
next_check_time = self.check_intervals[proxy_id]
return current_time >= next_check_time
def update_check_time(self, proxy: Proxy):
"""更新下次检查时间"""
proxy_id = f"{proxy.host}:{proxy.port}"
interval = self.get_check_interval(proxy)
self.check_intervals[proxy_id] = time.time() + interval
async def check_if_needed(self, proxy: Proxy) -> Optional[dict]:
"""如果需要则检查代理"""
if not self.should_check(proxy):
return None
result = await self.check_proxy(proxy)
self.update_check_time(proxy)
return result
# 使用示例
checker = SmartHealthChecker()
proxy = Proxy("proxy.example.com", 8080, ProxyType.HTTP)
proxy.success_count = 10
proxy.failure_count = 2
if checker.should_check(proxy):
result = await checker.check_if_needed(proxy)
print(f"Check result: {result}")
6.6.4 健康状态的评估标准
健康状态评估:
python
class ProxyHealthEvaluator:
"""代理健康状态评估器"""
@staticmethod
def evaluate(proxy: Proxy) -> str:
"""评估代理健康状态"""
# 计算健康分数(0-100)
score = 0
# 成功率(40分)
success_rate = proxy.success_rate
score += success_rate * 40
# 响应时间(30分)
if proxy.response_time > 0:
if proxy.response_time < 1.0:
score += 30
elif proxy.response_time < 3.0:
score += 20
elif proxy.response_time < 5.0:
score += 10
# 活跃状态(20分)
if proxy.is_active:
score += 20
# 最近成功时间(10分)
if proxy.last_success_time:
time_since_success = time.time() - proxy.last_success_time
if time_since_success < 300: # 5分钟内成功
score += 10
elif time_since_success < 1800: # 30分钟内成功
score += 5
# 确定健康等级
if score >= 80:
return "excellent"
elif score >= 60:
return "good"
elif score >= 40:
return "fair"
else:
return "poor"
@staticmethod
def should_remove(proxy: Proxy) -> bool:
"""判断是否应该移除代理"""
# 失败率过高
if proxy.failure_count > 10 and proxy.success_rate < 0.2:
return True
# 长时间未成功
if proxy.last_success_time:
time_since_success = time.time() - proxy.last_success_time
if time_since_success > 3600: # 1小时未成功
return True
return False
# 使用示例
evaluator = ProxyHealthEvaluator()
proxy = Proxy("proxy.example.com", 8080, ProxyType.HTTP)
proxy.success_count = 8
proxy.failure_count = 2
proxy.response_time = 0.5
proxy.is_active = True
proxy.last_success_time = time.time()
health = evaluator.evaluate(proxy)
print(f"Proxy health: {health}")
should_remove = evaluator.should_remove(proxy)
print(f"Should remove: {should_remove}")
6.7 负载均衡算法实现
负载均衡算法决定如何从代理池中选择代理,不同的算法适用于不同的场景。
6.7.1 轮询算法(Round Robin)
轮询算法原理:
- 按顺序依次选择代理
- 公平分配请求
- 实现简单
实现:
python
class RoundRobinBalancer:
"""轮询负载均衡器"""
def __init__(self, proxy_pool: ProxyPool):
self.proxy_pool = proxy_pool
self.current_index = 0
self.lock = threading.Lock()
def get_next_proxy(self) -> Optional[Proxy]:
"""获取下一个代理"""
active_proxies = self.proxy_pool.get_active_proxies()
if not active_proxies:
return None
with self.lock:
proxy = active_proxies[self.current_index]
self.current_index = (self.current_index + 1) % len(active_proxies)
return proxy
# 使用示例
pool = ProxyPool()
# ... 添加代理 ...
balancer = RoundRobinBalancer(pool)
for _ in range(10):
proxy = balancer.get_next_proxy()
print(f"Selected proxy: {proxy.host}")
6.7.2 随机算法(Random)
随机算法原理:
- 随机选择代理
- 避免热点代理
- 适合代理质量相近的场景
实现:
python
import random
class RandomBalancer:
"""随机负载均衡器"""
def __init__(self, proxy_pool: ProxyPool):
self.proxy_pool = proxy_pool
def get_next_proxy(self) -> Optional[Proxy]:
"""获取随机代理"""
active_proxies = self.proxy_pool.get_active_proxies()
if not active_proxies:
return None
return random.choice(active_proxies)
6.7.3 加权算法(Weighted)
加权算法原理:
- 根据代理质量分配权重
- 质量好的代理被选中的概率更高
- 适合代理质量差异大的场景
实现:
python
class WeightedBalancer:
"""加权负载均衡器"""
def __init__(self, proxy_pool: ProxyPool):
self.proxy_pool = proxy_pool
def calculate_weight(self, proxy: Proxy) -> float:
"""计算代理权重"""
# 基于成功率和响应时间
success_rate = proxy.success_rate
response_time = proxy.response_time if proxy.response_time > 0 else 5.0
# 权重 = 成功率 * (1 / 响应时间) * 100
weight = success_rate * (1.0 / response_time) * 100
return max(weight, 0.1) # 最小权重0.1
def get_next_proxy(self) -> Optional[Proxy]:
"""获取加权随机代理"""
active_proxies = self.proxy_pool.get_active_proxies()
if not active_proxies:
return None
# 计算权重
weights = [self.calculate_weight(p) for p in active_proxies]
total_weight = sum(weights)
if total_weight == 0:
return random.choice(active_proxies)
# 加权随机选择
r = random.uniform(0, total_weight)
cumulative = 0
for proxy, weight in zip(active_proxies, weights):
cumulative += weight
if r <= cumulative:
return proxy
return active_proxies[-1] # 兜底
6.7.4 最少连接数算法(Least Connections)
最少连接数算法原理:
- 选择当前连接数最少的代理
- 平衡代理负载
- 适合长时间连接的场景
实现:
python
class LeastConnectionsBalancer:
"""最少连接数负载均衡器"""
def __init__(self, proxy_pool: ProxyPool):
self.proxy_pool = proxy_pool
self.connection_count = {} # {proxy_id: count}
self.lock = threading.Lock()
def get_next_proxy(self) -> Optional[Proxy]:
"""获取连接数最少的代理"""
active_proxies = self.proxy_pool.get_active_proxies()
if not active_proxies:
return None
# 找到连接数最少的代理
min_connections = float('inf')
selected_proxy = None
with self.lock:
for proxy in active_proxies:
proxy_id = f"{proxy.host}:{proxy.port}"
count = self.connection_count.get(proxy_id, 0)
if count < min_connections:
min_connections = count
selected_proxy = proxy
return selected_proxy
def increment_connections(self, proxy: Proxy):
"""增加连接数"""
proxy_id = f"{proxy.host}:{proxy.port}"
with self.lock:
self.connection_count[proxy_id] = self.connection_count.get(proxy_id, 0) + 1
def decrement_connections(self, proxy: Proxy):
"""减少连接数"""
proxy_id = f"{proxy.host}:{proxy.port}"
with self.lock:
count = self.connection_count.get(proxy_id, 0)
if count > 0:
self.connection_count[proxy_id] = count - 1
6.7.5 算法性能对比
性能对比测试:
python
def compare_balancers(proxy_pool: ProxyPool, iterations: int = 1000):
"""对比不同负载均衡算法"""
balancers = {
'Round Robin': RoundRobinBalancer(proxy_pool),
'Random': RandomBalancer(proxy_pool),
'Weighted': WeightedBalancer(proxy_pool),
'Least Connections': LeastConnectionsBalancer(proxy_pool),
}
results = {}
for name, balancer in balancers.items():
start = time.time()
selection_count = {}
for _ in range(iterations):
proxy = balancer.get_next_proxy()
if proxy:
proxy_id = f"{proxy.host}:{proxy.port}"
selection_count[proxy_id] = selection_count.get(proxy_id, 0) + 1
elapsed = time.time() - start
# 计算选择分布的均匀度(标准差)
counts = list(selection_count.values())
if counts:
mean = sum(counts) / len(counts)
variance = sum((x - mean) ** 2 for x in counts) / len(counts)
std_dev = variance ** 0.5
else:
std_dev = 0
results[name] = {
'time': elapsed * 1000, # 毫秒
'std_dev': std_dev,
'distribution': selection_count,
}
return results
# 使用示例
# results = compare_balancers(proxy_pool, 1000)
# for name, result in results.items():
# print(f"{name}: {result['time']:.2f}ms, std_dev: {result['std_dev']:.2f}")
6.8 工具链:DNS和代理工具使用
6.8.1 使用dnspython进行DNS查询
安装dnspython:
bash
pip install dnspython
基本使用:
python
import dns.resolver
def query_dns(hostname: str, record_type: str = 'A') -> List[str]:
"""使用dnspython查询DNS"""
try:
answers = dns.resolver.resolve(hostname, record_type)
return [str(answer) for answer in answers]
except Exception as e:
print(f"DNS query failed: {e}")
return []
# 使用示例
ips = query_dns("www.example.com")
print(f"IPs: {ips}")
6.8.2 使用DoH服务进行DNS查询
使用Cloudflare DoH:
python
async def query_doh_cloudflare(hostname: str) -> List[str]:
"""使用Cloudflare DoH查询"""
doh_url = "https://cloudflare-dns.com/dns-query"
return await doh_query_json(hostname, doh_url)
6.8.3 使用httpx/aiohttp配置代理
httpx配置代理:
python
import httpx
# HTTP代理
proxy_url = "http://proxy.example.com:8080"
client = httpx.Client(proxies=proxy_url)
# SOCKS5代理(需要httpx[socks])
proxy_url = "socks5://proxy.example.com:1080"
client = httpx.Client(proxies=proxy_url)
# 使用代理发送请求
response = client.get("https://httpbin.org/ip")
aiohttp配置代理:
python
import aiohttp
# HTTP代理
proxy_url = "http://proxy.example.com:8080"
async with aiohttp.ClientSession() as session:
async with session.get("https://httpbin.org/ip", proxy=proxy_url) as resp:
print(await resp.text())
6.8.4 使用Redis管理代理池数据
Redis代理池存储:
python
import redis
import json
class RedisProxyPool:
"""基于Redis的代理池"""
def __init__(self, redis_host: str = "localhost", redis_port: int = 6379):
self.redis_client = redis.Redis(host=redis_host, port=redis_port, decode_responses=True)
self.proxy_key_prefix = "proxy:"
self.active_set_key = "proxies:active"
def add_proxy(self, proxy: Proxy, proxy_id: Optional[str] = None) -> str:
"""添加代理到Redis"""
if proxy_id is None:
proxy_id = f"{proxy.host}:{proxy.port}"
key = f"{self.proxy_key_prefix}{proxy_id}"
self.redis_client.hset(key, mapping=proxy.to_dict())
if proxy.is_active:
self.redis_client.sadd(self.active_set_key, proxy_id)
return proxy_id
def get_proxy(self, proxy_id: str) -> Optional[Proxy]:
"""从Redis获取代理"""
key = f"{self.proxy_key_prefix}{proxy_id}"
data = self.redis_client.hgetall(key)
if not data:
return None
return Proxy(
host=data['host'],
port=int(data['port']),
proxy_type=ProxyType(data['type']),
username=data.get('username'),
password=data.get('password'),
is_active=data.get('is_active', 'True') == 'True',
success_count=int(data.get('success_count', 0)),
failure_count=int(data.get('failure_count', 0)),
)
def get_active_proxies(self) -> List[str]:
"""获取所有活跃代理ID"""
return list(self.redis_client.smembers(self.active_set_key))
6.8.5 使用aiohttp-socks支持SOCKS代理
安装aiohttp-socks:
bash
pip install aiohttp-socks
使用SOCKS代理:
python
from aiohttp_socks import ProxyConnector
# SOCKS5代理
proxy_url = "socks5://proxy.example.com:1080"
connector = ProxyConnector.from_url(proxy_url)
async with aiohttp.ClientSession(connector=connector) as session:
async with session.get("https://httpbin.org/ip") as resp:
print(await resp.text())
6.9 代码对照:完整实现
6.9.1 自定义DNS解析器实现(支持缓存和DoH)
完整的DNS解析器:
python
import asyncio
import aiohttp
import time
from typing import Optional, List
class AdvancedDNSResolver:
"""高级DNS解析器(支持缓存和DoH)"""
def __init__(
self,
use_doh: bool = True,
doh_servers: List[str] = None,
cache: Optional[DNSCache] = None,
fallback_to_system: bool = True,
):
self.use_doh = use_doh
self.doh_servers = doh_servers or [
"https://cloudflare-dns.com/dns-query",
"https://dns.google/resolve",
]
self.cache = cache or DNSCache()
self.fallback_to_system = fallback_to_system
self.doh_client = DoHClient(doh_servers=self.doh_servers, cache=self.cache) if use_doh else None
async def resolve(self, hostname: str) -> List[str]:
"""解析域名"""
# 检查缓存
cached_ips = self.cache.get(hostname)
if cached_ips:
return cached_ips
# 使用DoH查询
if self.use_doh and self.doh_client:
try:
ips = await self.doh_client.query(hostname)
if ips:
return ips
except Exception as e:
print(f"DoH query failed: {e}")
# 回退到系统DNS
if self.fallback_to_system:
try:
import socket
ip = socket.gethostbyname(hostname)
ips = [ip]
self.cache.set(hostname, ips)
return ips
except Exception as e:
print(f"System DNS failed: {e}")
return []
async def close(self):
"""关闭解析器"""
if self.doh_client:
await self.doh_client.close()
# 使用示例
async def main():
resolver = AdvancedDNSResolver(use_doh=True)
ips = await resolver.resolve("www.example.com")
print(f"Resolved IPs: {ips}")
await resolver.close()
# asyncio.run(main())
6.9.2 代理池管理器类的完整实现
完整的代理池管理器:
python
import asyncio
import threading
import time
from typing import Optional, List, Dict
from collections import deque
class ProxyPoolManager:
"""代理池管理器(完整版)"""
def __init__(
self,
health_checker: Optional[ProxyHealthChecker] = None,
balancer_type: str = "round_robin",
health_check_interval: float = 300.0,
):
self.pool = ProxyPool()
self.health_checker = health_checker or ProxyHealthChecker()
self.balancer_type = balancer_type
self.health_check_interval = health_check_interval
# 负载均衡器
self.balancer = self._create_balancer()
# 健康检查任务
self.health_check_task = None
self.running = False
# 统计信息
self.stats = {
'total_requests': 0,
'successful_requests': 0,
'failed_requests': 0,
'proxy_rotations': 0,
}
def _create_balancer(self):
"""创建负载均衡器"""
if self.balancer_type == "round_robin":
return RoundRobinBalancer(self.pool)
elif self.balancer_type == "random":
return RandomBalancer(self.pool)
elif self.balancer_type == "weighted":
return WeightedBalancer(self.pool)
elif self.balancer_type == "least_connections":
return LeastConnectionsBalancer(self.pool)
else:
return RoundRobinBalancer(self.pool)
def add_proxy(self, proxy: Proxy) -> str:
"""添加代理"""
return self.pool.add_proxy(proxy)
def remove_proxy(self, proxy_id: str):
"""删除代理"""
self.pool.remove_proxy(proxy_id)
def get_next_proxy(self) -> Optional[Proxy]:
"""获取下一个代理"""
proxy = self.balancer.get_next_proxy()
if proxy:
self.stats['proxy_rotations'] += 1
return proxy
def mark_success(self, proxy: Proxy, response_time: float):
"""标记代理成功"""
proxy.success_count += 1
proxy.last_success_time = time.time()
proxy.last_check_time = time.time()
# 更新平均响应时间
total_requests = proxy.success_count + proxy.failure_count
proxy.response_time = (
(proxy.response_time * (total_requests - 1) + response_time) / total_requests
)
self.stats['successful_requests'] += 1
self.stats['total_requests'] += 1
def mark_failure(self, proxy: Proxy):
"""标记代理失败"""
proxy.failure_count += 1
proxy.last_check_time = time.time()
# 如果失败率过高,标记为非活跃
if proxy.success_rate < 0.2 and proxy.failure_count > 10:
proxy.is_active = False
self.pool.active_proxies.discard(f"{proxy.host}:{proxy.port}")
self.stats['failed_requests'] += 1
self.stats['total_requests'] += 1
async def health_check_loop(self):
"""健康检查循环"""
while self.running:
try:
# 获取所有活跃代理
active_proxies = self.pool.get_active_proxies()
if active_proxies:
# 批量健康检查
results = await self.health_checker.check_proxies(active_proxies)
# 更新代理状态
for proxy, result in zip(active_proxies, results):
proxy_id = f"{proxy.host}:{proxy.port}"
if result['success']:
proxy.is_active = True
proxy.response_time = result.get('response_time', proxy.response_time)
self.pool.active_proxies.add(proxy_id)
else:
# 失败次数过多则标记为非活跃
if proxy.failure_count > 5:
proxy.is_active = False
self.pool.active_proxies.discard(proxy_id)
await asyncio.sleep(self.health_check_interval)
except Exception as e:
print(f"Health check error: {e}")
await asyncio.sleep(60)
继续完成第6章的剩余内容:
def start(self):
"""启动代理池管理器"""
self.running = True
self.health_check_task = asyncio.create_task(self.health_check_loop())
def stop(self):
"""停止代理池管理器"""
self.running = False
if self.health_check_task:
self.health_check_task.cancel()
def get_stats(self) -> dict:
"""获取统计信息"""
active_count = len(self.pool.active_proxies)
total_count = len(self.pool.proxies)
return {
**self.stats,
'total_proxies': total_count,
'active_proxies': active_count,
'inactive_proxies': total_count - active_count,
'success_rate': (
self.stats['successful_requests'] / self.stats['total_requests']
if self.stats['total_requests'] > 0 else 0.0
),
}
# 使用示例
async def main():
manager = ProxyPoolManager(
balancer_type="weighted",
health_check_interval=300.0,
)
# 添加代理
proxy1 = Proxy("proxy1.com", 8080, ProxyType.HTTP)
proxy2 = Proxy("proxy2.com", 8080, ProxyType.SOCKS5)
manager.add_proxy(proxy1)
manager.add_proxy(proxy2)
# 启动管理器
manager.start()
# 使用代理
proxy = manager.get_next_proxy()
if proxy:
print(f"Using proxy: {proxy.url}")
# 模拟请求成功
manager.mark_success(proxy, 0.5)
# 查看统计
stats = manager.get_stats()
print(f"Stats: {stats}")
# 停止管理器
manager.stop()
# asyncio.run(main())
6.9.3 代理健康检查的实现代码
完整的健康检查实现:
python
import asyncio
import aiohttp
import time
from typing import List, Dict, Optional
class ComprehensiveHealthChecker:
"""综合健康检查器"""
def __init__(
self,
test_urls: List[str] = None,
timeout: float = 5.0,
max_concurrent: int = 10,
retry_times: int = 2,
):
self.test_urls = test_urls or [
"http://httpbin.org/ip",
"http://httpbin.org/get",
]
self.timeout = timeout
self.semaphore = asyncio.Semaphore(max_concurrent)
self.retry_times = retry_times
async def check_proxy_comprehensive(self, proxy: Proxy) -> Dict:
"""综合健康检查"""
async with self.semaphore:
results = {
'proxy_id': f"{proxy.host}:{proxy.port}",
'success': False,
'response_times': [],
'test_results': [],
'error': None,
}
# 对每个测试URL进行检查
for test_url in self.test_urls:
for attempt in range(self.retry_times):
try:
result = await self._test_proxy(proxy, test_url)
results['test_results'].append(result)
if result['success']:
results['response_times'].append(result['response_time'])
results['success'] = True
break
except Exception as e:
if attempt == self.retry_times - 1:
results['error'] = str(e)
# 计算平均响应时间
if results['response_times']:
results['avg_response_time'] = sum(results['response_times']) / len(results['response_times'])
else:
results['avg_response_time'] = None
# 计算成功率
results['success_rate'] = (
len([r for r in results['test_results'] if r['success']]) / len(results['test_results'])
if results['test_results'] else 0.0
)
return results
async def _test_proxy(self, proxy: Proxy, test_url: str) -> Dict:
"""测试单个URL"""
start_time = time.time()
result = {
'url': test_url,
'success': False,
'response_time': None,
'status_code': None,
'error': None,
}
try:
proxy_url = proxy.url
# 创建连接器
connector = None
if proxy.proxy_type in [ProxyType.SOCKS4, ProxyType.SOCKS5]:
try:
from aiohttp_socks import ProxyConnector
connector = ProxyConnector.from_url(proxy_url)
except ImportError:
result['error'] = "aiohttp-socks not installed"
return result
else:
connector = aiohttp.TCPConnector()
# 发送请求
timeout = aiohttp.ClientTimeout(total=self.timeout)
async with aiohttp.ClientSession(connector=connector) as session:
async with session.get(
test_url,
proxy=proxy_url if proxy.proxy_type in [ProxyType.HTTP, ProxyType.HTTPS] else None,
timeout=timeout,
) as resp:
result['status_code'] = resp.status
result['success'] = resp.status == 200
result['response_time'] = time.time() - start_time
if connector:
await connector.close()
except asyncio.TimeoutError:
result['error'] = "Timeout"
result['response_time'] = self.timeout
except Exception as e:
result['error'] = str(e)
result['response_time'] = time.time() - start_time
return result
async def check_proxies_batch(self, proxies: List[Proxy]) -> List[Dict]:
"""批量健康检查"""
tasks = [self.check_proxy_comprehensive(proxy) for proxy in proxies]
results = await asyncio.gather(*tasks, return_exceptions=True)
processed_results = []
for result in results:
if isinstance(result, Exception):
processed_results.append({
'success': False,
'error': str(result),
})
else:
processed_results.append(result)
return processed_results
# 使用示例
async def main():
checker = ComprehensiveHealthChecker()
proxy = Proxy("proxy.example.com", 8080, ProxyType.HTTP)
result = await checker.check_proxy_comprehensive(proxy)
print(f"Health check result: {result}")
# asyncio.run(main())
6.9.4 代理轮换中间件的实现
代理轮换中间件:
python
import asyncio
import aiohttp
from typing import Optional, Callable
import random
class ProxyRotateMiddleware:
"""代理轮换中间件"""
def __init__(self, proxy_pool_manager: ProxyPoolManager):
self.proxy_pool_manager = proxy_pool_manager
self.current_proxy: Optional[Proxy] = None
self.proxy_usage_count = {} # {proxy_id: count}
self.max_usage_per_proxy = 100 # 每个代理最多使用次数
def get_proxy_for_request(self) -> Optional[Proxy]:
"""为请求获取代理"""
# 检查当前代理是否还能使用
if self.current_proxy:
proxy_id = f"{self.current_proxy.host}:{self.current_proxy.port}"
usage_count = self.proxy_usage_count.get(proxy_id, 0)
if usage_count < self.max_usage_per_proxy and self.current_proxy.is_active:
self.proxy_usage_count[proxy_id] = usage_count + 1
return self.current_proxy
# 获取新代理
self.current_proxy = self.proxy_pool_manager.get_next_proxy()
if self.current_proxy:
proxy_id = f"{self.current_proxy.host}:{self.current_proxy.port}"
self.proxy_usage_count[proxy_id] = 1
return self.current_proxy
async def request_with_proxy(
self,
url: str,
method: str = "GET",
**kwargs
) -> aiohttp.ClientResponse:
"""使用代理发送请求"""
proxy = self.get_proxy_for_request()
if not proxy:
raise Exception("No available proxy")
proxy_url = proxy.url
start_time = time.time()
try:
# 创建连接器
connector = None
if proxy.proxy_type in [ProxyType.SOCKS4, ProxyType.SOCKS5]:
from aiohttp_socks import ProxyConnector
connector = ProxyConnector.from_url(proxy_url)
# 发送请求
async with aiohttp.ClientSession(connector=connector) as session:
async with session.request(
method,
url,
proxy=proxy_url if proxy.proxy_type in [ProxyType.HTTP, ProxyType.HTTPS] else None,
**kwargs
) as resp:
response_time = time.time() - start_time
# 标记代理状态
if resp.status == 200:
self.proxy_pool_manager.mark_success(proxy, response_time)
else:
self.proxy_pool_manager.mark_failure(proxy)
return resp
except Exception as e:
# 标记代理失败
self.proxy_pool_manager.mark_failure(proxy)
raise
def rotate_proxy(self):
"""强制轮换代理"""
self.current_proxy = None
# 使用示例
async def main():
manager = ProxyPoolManager()
# ... 添加代理 ...
middleware = ProxyRotateMiddleware(manager)
# 使用中间件发送请求
try:
async with middleware.request_with_proxy("https://httpbin.org/ip") as resp:
print(await resp.text())
except Exception as e:
print(f"Request failed: {e}")
# asyncio.run(main())
6.9.5 代理池监控面板代码
监控面板实现:
python
from flask import Flask, render_template_string, jsonify
import threading
import time
app = Flask(__name__)
# 监控面板HTML模板
MONITOR_TEMPLATE = """
<!DOCTYPE html>
<html>
<head>
<title>代理池监控面板</title>
<style>
body { font-family: Arial, sans-serif; margin: 20px; }
.stats { display: flex; gap: 20px; margin-bottom: 20px; }
.stat-card { background: #f0f0f0; padding: 15px; border-radius: 5px; }
.stat-card h3 { margin: 0 0 10px 0; }
.stat-card .value { font-size: 24px; font-weight: bold; }
table { width: 100%; border-collapse: collapse; }
th, td { padding: 10px; text-align: left; border-bottom: 1px solid #ddd; }
th { background-color: #4CAF50; color: white; }
.status-active { color: green; }
.status-inactive { color: red; }
</style>
<script>
setInterval(function() {
fetch('/api/stats')
.then(response => response.json())
.then(data => {
document.getElementById('stats').innerHTML = generateStatsHTML(data);
document.getElementById('proxies').innerHTML = generateProxiesHTML(data.proxies);
});
}, 5000);
function generateStatsHTML(data) {
return `
<div class="stat-card">
<h3>总代理数</h3>
<div class="value">${data.total_proxies}</div>
</div>
<div class="stat-card">
<h3>活跃代理</h3>
<div class="value">${data.active_proxies}</div>
</div>
<div class="stat-card">
<h3>成功率</h3>
<div class="value">${(data.success_rate * 100).toFixed(2)}%</div>
</div>
<div class="stat-card">
<h3>总请求数</h3>
<div class="value">${data.total_requests}</div>
</div>
`;
}
function generateProxiesHTML(proxies) {
let html = '<table><tr><th>代理</th><th>类型</th><th>状态</th><th>成功率</th><th>响应时间</th><th>使用次数</th></tr>';
proxies.forEach(proxy => {
html += `
<tr>
<td>${proxy.host}:${proxy.port}</td>
<td>${proxy.type}</td>
<td class="${proxy.is_active ? 'status-active' : 'status-inactive'}">
${proxy.is_active ? '活跃' : '非活跃'}
</td>
<td>${(proxy.success_rate * 100).toFixed(2)}%</td>
<td>${proxy.response_time.toFixed(3)}s</td>
<td>${proxy.success_count + proxy.failure_count}</td>
</tr>
`;
});
html += '</table>';
return html;
}
</script>
</head>
<body>
<h1>代理池监控面板</h1>
<div class="stats" id="stats"></div>
<h2>代理列表</h2>
<div id="proxies"></div>
</body>
</html>
"""
class ProxyPoolMonitor:
"""代理池监控器"""
def __init__(self, proxy_pool_manager: ProxyPoolManager, port: int = 5000):
self.proxy_pool_manager = proxy_pool_manager
self.port = port
self.app = Flask(__name__)
self._setup_routes()
def _setup_routes(self):
"""设置路由"""
@self.app.route('/')
def index():
return render_template_string(MONITOR_TEMPLATE)
@self.app.route('/api/stats')
def api_stats():
stats = self.proxy_pool_manager.get_stats()
proxies = self.proxy_pool_manager.pool.get_active_proxies()
proxy_data = []
for proxy in proxies:
proxy_data.append({
'host': proxy.host,
'port': proxy.port,
'type': proxy.proxy_type.value,
'is_active': proxy.is_active,
'success_rate': proxy.success_rate,
'response_time': proxy.response_time,
'success_count': proxy.success_count,
'failure_count': proxy.failure_count,
})
return jsonify({
**stats,
'proxies': proxy_data,
})
def run(self, debug: bool = False):
"""运行监控面板"""
self.app.run(host='0.0.0.0', port=self.port, debug=debug)
# 使用示例
# monitor = ProxyPoolMonitor(proxy_pool_manager, port=5000)
# monitor.run()
6.10 实战演练:构建高可用代理池系统
本节将一步步演示如何构建一个完整的高可用代理池系统。
6.10.1 步骤1:设计代理池的数据结构和接口
数据结构设计:
python
# 使用之前定义的Proxy和ProxyPool类
# 设计要点:
# 1. 使用字典存储代理(快速查找)
# 2. 使用集合存储活跃代理(快速过滤)
# 3. 使用队列实现轮询(公平分配)
# 4. 线程安全(使用锁保护)
接口设计:
python
class IProxyPoolManager(ABC):
"""代理池管理器接口"""
@abstractmethod
def add_proxy(self, proxy: Proxy) -> str:
"""添加代理"""
pass
@abstractmethod
def remove_proxy(self, proxy_id: str):
"""删除代理"""
pass
@abstractmethod
def get_next_proxy(self) -> Optional[Proxy]:
"""获取下一个代理"""
pass
@abstractmethod
def mark_success(self, proxy: Proxy, response_time: float):
"""标记成功"""
pass
@abstractmethod
def mark_failure(self, proxy: Proxy):
"""标记失败"""
pass
6.10.2 步骤2:实现代理添加、删除、查询功能
完整实现:
python
class ProxyPoolManagerV2(ProxyPoolManager):
"""代理池管理器V2(增强版)"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.proxy_metadata = {} # {proxy_id: metadata}
def add_proxy_from_string(self, proxy_string: str, **metadata) -> str:
"""从字符串添加代理"""
# 解析代理字符串
# 格式: http://user:pass@host:port 或 socks5://host:port
try:
from urllib.parse import urlparse
parsed = urlparse(proxy_string)
proxy_type_map = {
'http': ProxyType.HTTP,
'https': ProxyType.HTTPS,
'socks4': ProxyType.SOCKS4,
'socks5': ProxyType.SOCKS5,
}
proxy_type = proxy_type_map.get(parsed.scheme)
if not proxy_type:
raise ValueError(f"Unsupported proxy type: {parsed.scheme}")
proxy = Proxy(
host=parsed.hostname,
port=parsed.port or (1080 if 'socks' in parsed.scheme else 8080),
proxy_type=proxy_type,
username=parsed.username,
password=parsed.password,
**metadata
)
proxy_id = self.add_proxy(proxy)
self.proxy_metadata[proxy_id] = metadata
return proxy_id
except Exception as e:
print(f"Failed to add proxy from string: {e}")
return None
def add_proxies_from_file(self, file_path: str):
"""从文件批量添加代理"""
with open(file_path, 'r') as f:
for line in f:
line = line.strip()
if line and not line.startswith('#'):
self.add_proxy_from_string(line)
def export_proxies(self, file_path: str):
"""导出代理到文件"""
with open(file_path, 'w') as f:
for proxy_id, proxy in self.pool.proxies.items():
f.write(f"{proxy.url}\n")
def search_proxies(self, **filters) -> List[Proxy]:
"""搜索代理"""
results = []
for proxy in self.pool.proxies.values():
match = True
if 'type' in filters and proxy.proxy_type != filters['type']:
match = False
if 'location' in filters and proxy.location != filters['location']:
match = False
if 'min_success_rate' in filters and proxy.success_rate < filters['min_success_rate']:
match = False
if 'max_response_time' in filters and proxy.response_time > filters['max_response_time']:
match = False
if match:
results.append(proxy)
return results
# 使用示例
manager = ProxyPoolManagerV2()
# 从字符串添加
manager.add_proxy_from_string("http://user:pass@proxy.com:8080", location="US")
# 从文件批量添加
manager.add_proxies_from_file("proxies.txt")
# 搜索代理
us_proxies = manager.search_proxies(location="US", min_success_rate=0.8)
6.10.3 步骤3:实现定时健康检查机制
定时健康检查:
python
class ScheduledHealthChecker:
"""定时健康检查器"""
def __init__(
self,
proxy_pool_manager: ProxyPoolManager,
interval: float = 300.0,
batch_size: int = 10,
):
self.proxy_pool_manager = proxy_pool_manager
self.interval = interval
self.batch_size = batch_size
self.health_checker = ComprehensiveHealthChecker()
self.running = False
self.task = None
async def health_check_loop(self):
"""健康检查循环"""
while self.running:
try:
# 获取需要检查的代理
active_proxies = self.proxy_pool_manager.pool.get_active_proxies()
# 分批检查
for i in range(0, len(active_proxies), self.batch_size):
batch = active_proxies[i:i+self.batch_size]
results = await self.health_checker.check_proxies_batch(batch)
# 更新代理状态
for proxy, result in zip(batch, results):
if result['success']:
self.proxy_pool_manager.mark_success(
proxy,
result.get('avg_response_time', 0.0)
)
else:
self.proxy_pool_manager.mark_failure(proxy)
await asyncio.sleep(self.interval)
except Exception as e:
print(f"Health check error: {e}")
await asyncio.sleep(60)
def start(self):
"""启动健康检查"""
self.running = True
self.task = asyncio.create_task(self.health_check_loop())
def stop(self):
"""停止健康检查"""
self.running = False
if self.task:
self.task.cancel()
# 使用示例
health_checker = ScheduledHealthChecker(proxy_pool_manager, interval=300.0)
health_checker.start()
6.10.4 步骤4:实现负载均衡算法(多种算法对比)
算法对比测试:
python
def test_load_balancing_algorithms(proxy_pool_manager: ProxyPoolManager):
"""测试负载均衡算法"""
algorithms = {
'round_robin': RoundRobinBalancer(proxy_pool_manager.pool),
'random': RandomBalancer(proxy_pool_manager.pool),
'weighted': WeightedBalancer(proxy_pool_manager.pool),
'least_connections': LeastConnectionsBalancer(proxy_pool_manager.pool),
}
iterations = 1000
results = {}
for algo_name, balancer in algorithms.items():
selection_distribution = {}
start_time = time.time()
for _ in range(iterations):
proxy = balancer.get_next_proxy()
if proxy:
proxy_id = f"{proxy.host}:{proxy.port}"
selection_distribution[proxy_id] = selection_distribution.get(proxy_id, 0) + 1
elapsed = time.time() - start_time
# 计算分布均匀度
counts = list(selection_distribution.values())
if counts:
mean = sum(counts) / len(counts)
variance = sum((x - mean) ** 2 for x in counts) / len(counts)
std_dev = variance ** 0.5
cv = std_dev / mean if mean > 0 else 0 # 变异系数
else:
cv = 0
results[algo_name] = {
'time': elapsed * 1000,
'distribution': selection_distribution,
'coefficient_of_variation': cv,
}
return results
# 使用示例
# results = test_load_balancing_algorithms(proxy_pool_manager)
# for algo, result in results.items():
# print(f"{algo}: {result['time']:.2f}ms, CV: {result['coefficient_of_variation']:.3f}")
6.10.5 步骤5:实现代理轮换中间件
集成到HTTP客户端:
python
class ProxyAwareHTTPClient:
"""支持代理的HTTP客户端"""
def __init__(self, proxy_pool_manager: ProxyPoolManager):
self.proxy_pool_manager = proxy_pool_manager
self.middleware = ProxyRotateMiddleware(proxy_pool_manager)
async def get(self, url: str, **kwargs) -> aiohttp.ClientResponse:
"""GET请求"""
return await self.middleware.request_with_proxy(url, "GET", **kwargs)
async def post(self, url: str, **kwargs) -> aiohttp.ClientResponse:
"""POST请求"""
return await self.middleware.request_with_proxy(url, "POST", **kwargs)
async def request(self, method: str, url: str, **kwargs) -> aiohttp.ClientResponse:
"""通用请求方法"""
return await self.middleware.request_with_proxy(url, method, **kwargs)
# 使用示例
async def main():
client = ProxyAwareHTTPClient(proxy_pool_manager)
try:
async with await client.get("https://httpbin.org/ip") as resp:
print(await resp.text())
except Exception as e:
print(f"Request failed: {e}")
# asyncio.run(main())
6.10.6 步骤6:集成到异步爬虫框架中
爬虫框架集成:
python
import asyncio
from typing import List, Callable
class AsyncCrawlerWithProxy:
"""支持代理的异步爬虫"""
def __init__(
self,
proxy_pool_manager: ProxyPoolManager,
max_concurrent: int = 10,
):
self.proxy_pool_manager = proxy_pool_manager
self.client = ProxyAwareHTTPClient(proxy_pool_manager)
self.semaphore = asyncio.Semaphore(max_concurrent)
async def crawl_url(self, url: str) -> dict:
"""爬取单个URL"""
async with self.semaphore:
try:
async with await self.client.get(url) as resp:
text = await resp.text()
return {
'url': url,
'status': resp.status,
'content': text,
'success': True,
}
except Exception as e:
return {
'url': url,
'status': None,
'content': None,
'success': False,
'error': str(e),
}
async def crawl_urls(self, urls: List[str]) -> List[dict]:
"""批量爬取URL"""
tasks = [self.crawl_url(url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
processed_results = []
for result in results:
if isinstance(result, Exception):
processed_results.append({
'success': False,
'error': str(result),
})
else:
processed_results.append(result)
return processed_results
# 使用示例
async def main():
crawler = AsyncCrawlerWithProxy(proxy_pool_manager, max_concurrent=10)
urls = [
"https://httpbin.org/ip",
"https://httpbin.org/get",
# ... 更多URL
]
results = await crawler.crawl_urls(urls)
success_count = sum(1 for r in results if r.get('success'))
print(f"Success: {success_count}/{len(results)}")
# asyncio.run(main())
6.10.7 步骤7:完整实战代码
完整的代理池系统:
python
import asyncio
import aiohttp
import time
import logging
from typing import List, Optional
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class CompleteProxyPoolSystem:
"""完整的代理池系统"""
def __init__(
self,
proxy_sources: List[str] = None,
health_check_interval: float = 300.0,
balancer_type: str = "weighted",
):
# 初始化组件
self.proxy_pool_manager = ProxyPoolManagerV2(
balancer_type=balancer_type,
health_check_interval=health_check_interval,
)
self.health_checker = ScheduledHealthChecker(
self.proxy_pool_manager,
interval=health_check_interval,
)
self.crawler = AsyncCrawlerWithProxy(
self.proxy_pool_manager,
max_concurrent=10,
)
# 加载代理
if proxy_sources:
for source in proxy_sources:
if source.startswith('http'):
# 从URL加载
asyncio.create_task(self.load_proxies_from_url(source))
else:
# 从文件加载
self.proxy_pool_manager.add_proxies_from_file(source)
async def load_proxies_from_url(self, url: str):
"""从URL加载代理列表"""
try:
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
if resp.status == 200:
text = await resp.text()
for line in text.split('\n'):
line = line.strip()
if line:
self.proxy_pool_manager.add_proxy_from_string(line)
except Exception as e:
logger.error(f"Failed to load proxies from URL: {e}")
def start(self):
"""启动系统"""
logger.info("Starting proxy pool system...")
self.proxy_pool_manager.start()
self.health_checker.start()
logger.info("Proxy pool system started")
def stop(self):
"""停止系统"""
logger.info("Stopping proxy pool system...")
self.proxy_pool_manager.stop()
self.health_checker.stop()
logger.info("Proxy pool system stopped")
async def crawl(self, urls: List[str]) -> List[dict]:
"""使用代理池爬取URL"""
return await self.crawler.crawl_urls(urls)
def get_stats(self) -> dict:
"""获取系统统计"""
return self.proxy_pool_manager.get_stats()
# 使用示例
async def main():
# 创建系统
system = CompleteProxyPoolSystem(
proxy_sources=["proxies.txt"],
health_check_interval=300.0,
balancer_type="weighted",
)
# 启动系统
system.start()
# 等待代理加载和健康检查
await asyncio.sleep(10)
# 爬取URL
urls = [
"https://httpbin.org/ip",
"https://httpbin.org/get",
]
results = await system.crawl(urls)
# 查看统计
stats = system.get_stats()
logger.info(f"System stats: {stats}")
# 停止系统
system.stop()
if __name__ == "__main__":
asyncio.run(main())
6.11 常见坑点与排错
6.11.1 DNS缓存时间过长导致IP变更无法感知
问题描述:
python
# 错误示例:TTL设置过长
cache = DNSCache(default_ttl=86400) # 24小时(太长!)
# 如果服务器IP变更,24小时内无法感知
解决方案:
python
# 正确示例:合理的TTL设置
cache = DNSCache(default_ttl=300) # 5分钟
# 或者根据DNS记录的TTL动态设置
def get_dns_ttl(hostname: str) -> int:
"""获取DNS记录的TTL"""
try:
answers = dns.resolver.resolve(hostname, 'A')
return answers.rrset.ttl
except:
return 300 # 默认5分钟
# 使用动态TTL
ttl = get_dns_ttl("www.example.com")
cache.set("www.example.com", ips, ttl=ttl)
6.11.2 代理健康检查频率过高会被代理商封禁
问题描述:
python
# 错误示例:检查频率过高
health_checker = ScheduledHealthChecker(
proxy_pool_manager,
interval=10.0, # 10秒检查一次(太频繁!)
)
# 可能被代理服务器识别为异常行为并封禁
解决方案:
python
# 正确示例:合理的检查频率
health_checker = ScheduledHealthChecker(
proxy_pool_manager,
interval=300.0, # 5分钟检查一次
batch_size=5, # 每次只检查5个代理
)
# 或者使用智能频率控制
smart_checker = SmartHealthChecker(
base_interval=300.0,
min_interval=60.0,
max_interval=3600.0,
)
6.11.3 SOCKS5代理需要特殊处理UDP流量
问题描述:
python
# 错误示例:使用普通HTTP客户端连接SOCKS5代理
# 某些UDP流量可能无法正常工作
解决方案:
python
# 正确示例:使用专门的SOCKS连接器
from aiohttp_socks import ProxyConnector
proxy_url = "socks5://proxy.example.com:1080"
connector = ProxyConnector.from_url(proxy_url)
async with aiohttp.ClientSession(connector=connector) as session:
# 现在可以正常使用SOCKS5代理
async with session.get("https://httpbin.org/ip") as resp:
print(await resp.text())
6.11.4 代理池资源耗尽导致请求失败
问题描述:
python
# 错误示例:没有检查代理可用性
proxy = proxy_pool_manager.get_next_proxy()
# 如果所有代理都不可用,proxy为None,会导致错误
解决方案:
python
# 正确示例:检查代理可用性并实现降级
proxy = proxy_pool_manager.get_next_proxy()
if not proxy:
# 降级:不使用代理
logger.warning("No available proxy, using direct connection")
# 或者等待代理恢复
await asyncio.sleep(10)
proxy = proxy_pool_manager.get_next_proxy()
6.11.5 负载均衡算法选择不当导致性能下降
问题描述:
python
# 错误示例:代理质量差异大时使用轮询
# 质量差的代理会被频繁使用,影响整体性能
balancer = RoundRobinBalancer(proxy_pool)
解决方案:
python
# 正确示例:根据场景选择算法
# 代理质量差异大:使用加权算法
# 代理质量相近:使用轮询或随机算法
# 长时间连接:使用最少连接数算法
if proxy_quality_varies:
balancer = WeightedBalancer(proxy_pool)
else:
balancer = RoundRobinBalancer(proxy_pool)
6.12 总结
本章深入讲解了DNS解析优化和代理池架构的完整实现。通过本章学习,你应该能够:
核心知识点回顾
-
DNS解析机制:
- DNS查询的完整流程(递归/迭代)
- DNS缓存机制和TTL管理
- DoH/DoT的实现和使用
-
代理池架构:
- 代理池的数据结构设计
- 代理类型的选择和使用
- 健康检查机制的设计
-
负载均衡算法:
- 轮询、随机、加权、最少连接数
- 不同算法的适用场景
- 算法性能对比和选择
-
实战能力:
- 构建完整的代理池系统
- 集成到爬虫框架
- 监控和统计
最佳实践建议
-
DNS优化:
- 使用合理的TTL(5-10分钟)
- 实现多级缓存
- 使用DoH提高安全性
-
代理池管理:
- 定期健康检查(5-10分钟)
- 实现智能频率控制
- 根据场景选择负载均衡算法
-
性能优化:
- 使用异步健康检查
- 批量处理代理
- 实现代理预热机制
-
监控和运维:
- 实现监控面板
- 记录详细日志
- 设置告警机制
下一步学习方向
-
深入学习:
- DNS协议细节
- 代理协议实现
- 分布式代理池
-
实战项目:
- 构建大规模代理池
- 实现代理自动获取
- 开发代理质量评估系统
通过本章的学习,你已经掌握了DNS解析优化和代理池架构的核心技术,能够构建高性能、高可用的爬虫系统。
本章完