住宅代理IP在使用中常遇到部分节点失效、请求被识别的问题。本文实现一个具备健康检查机制的代理池,并集成curl_cffi进行TLS指纹伪装,提升采集稳定性。
一、代理池健康检查
1.1 单代理检测
python
import requests
import time
def check_proxy(proxy, test_url="https://httpbin.org/ip", timeout=5):
"""检测代理是否可用,返回延迟或None"""
proxies = {"http": proxy, "https": proxy}
try:
start = time.time()
r = requests.get(test_url, proxies=proxies, timeout=timeout)
if r.status_code == 200:
return time.time() - start
except:
pass
return None
1.2 并发检测与轮询池
python
import random
from concurrent.futures import ThreadPoolExecutor, as_completed
class ProxyPool:
def __init__(self, proxy_list, max_workers=10):
self.proxy_list = proxy_list
self.available = [ ]
self._update_available(max_workers)
def _update_available(self, workers):
results = {}
with ThreadPoolExecutor(max_workers=workers) as ex:
futures = {ex.submit(check_proxy, p): p for p in self.proxy_list}
for f in as_completed(futures):
lat = f.result()
if lat is not None:
results[futures[f]] = lat
self.available = sorted(results.items(), key=lambda x: x[1])
def get(self):
if not self.available:
return None
return random.choice(self.available)[0]
二、TLS指纹伪装
使用curl_cffi替代requests,模拟真实浏览器指纹。
python
from curl_cffi import requests as curl_requests
def fetch(url, proxy=None):
kwargs = {"impersonate": "chrome120"}
if proxy:
kwargs["proxies"] = {"http": proxy, "https": proxy}
resp = curl_requests.get(url, **kwargs)
return resp.text
对比测试:
csharp
# 普通requests
r = requests.get("https://tls.browserleaks.com/json")
print(r.json().get("ja3_hash")) # 固定短哈希
# curl_cffi伪装
r2 = curl_requests.get("https://tls.browserleaks.com/json", impersonate="chrome120")
print(r2.json().get("ja3_hash")) # 与真实Chrome一致
三、完整采集器示例
python
import time
import random
def worker(url, proxy_pool, max_retry=3):
for _ in range(max_retry):
proxy = proxy_pool.get()
if not proxy:
time.sleep(2)
continue
try:
text = fetch(url, proxy=proxy)
return text
except Exception as e:
print(f"失败: {proxy} -> {e}")
time.sleep(random.uniform(1, 3))
return None
# 示例:使用住宅代理池(如辣椒HTTP提供的代理列表)
proxies = [
"http://user:pass@host:port",
# 可添加多个
]
pool = ProxyPool(proxies)
data = worker("https://example.com/api", pool)
四、效果说明
- 健康检查机制可自动剔除失效IP,维持代理池可用性
- curl_cffi指纹伪装使TLS握手特征与真实Chrome一致,降低被识别的概率
- 二者结合能有效延长采集任务的稳定运行时间
本文介绍的技术方案适用于任何标准代理服务商,不涉及具体商业推荐。实际使用中建议定期更新代理列表,并根据目标网站调整请求频率。