IP伪装、代理池与分布式爬虫

一、动态代理IP应用:代理池的获取、选择与使用

代理池技术的核心是通过动态切换IP地址,让爬虫看起来像不同用户在访问网站,从而规避封禁。

(一)代理池的获取途径

1. 免费代理:低成本但高风险

免费代理可从公开网站(如西刺代理、快代理)获取,或通过API接口抓取。虽然免费,但存在诸多弊端:

  • 存活周期短:平均仅4-6小时

  • 可用率低:通常不足15%

  • 安全隐患:可能被监听流量

Python代码示例(抓取并验证免费代理)

python 复制代码
import requests
from bs4 import BeautifulSoup

def get_free_proxies():
    url = "https://www.example-proxy-list.net/"  # 替换为有效URL
    try:
        response = requests.get(url, timeout=10)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            # 假设代理信息在表格的tbody中,每行有IP和端口两个td
            proxy_rows = soup.select('tbody tr')
            proxies = []
            for row in proxy_rows:
                ip = row.select_one('td').text
                port = row.select_one('td + td').text
                proxies.append(f"{ip}:{port}")
            return proxies
        else:
            print(f"获取免费代理失败,状态码:{response.status_code}")
            return []
    except Exception as e:
        print(f"获取免费代理时出错:{e}")
        return []

注意:由于网络原因,代码中的示例URL无法正常访问,请替换为有效的代理列表网站URL。

2. 付费代理:稳定性与定制化的选择

付费代理服务商(如神龙HTTP、ipipgo)提供高匿住宅IP,支持按需切换地理位置,响应速度快(0.3秒以内)。其核心优势在于:

  • 稳定性高:请求成功率比免费代理高78%

  • 定制化服务:可指定城市IP、切换频率等

付费代理使用示例

python 复制代码
import requests

def use_paid_proxy(url, proxy_ip, proxy_port, proxy_user, proxy_pass):
    proxies = {
        "http": f"http://{proxy_user}:{proxy_pass}@{proxy_ip}:{proxy_port}",
        "https": f"http://{proxy_user}:{proxy_pass}@{proxy_ip}:{proxy_port}"
    }
    try:
        response = requests.get(url, proxies=proxies, timeout=10)
        return response.text
    except Exception as e:
        print(f"使用付费代理失败:{e}")
        return None

(二)代理池的选择标准

1. 匿名性验证:确保真实IP不暴露

通过访问httpbin.org/ip检查代理是否隐藏真实IP:

python 复制代码
def check_anonymity(proxy):
    try:
        response = requests.get("https://httpbin.org/ip", proxies={"http": proxy, "https": proxy}, timeout=10)
        return response.json()["origin"] != "真实IP地址"
    except:
        return False
2. 稳定性监控:波动控制在15%以内
python 复制代码
import time

def monitor_stability(proxy, test_url, duration=600):  # 默认监测10分钟
    response_times = []
    end_time = time.time() + duration
    while time.time() < end_time:
        try:
            start = time.time()
            requests.get(test_url, proxies={"http": proxy, "https": proxy}, timeout=10)
            response_times.append(time.time() - start)
        except:
            response_times.append(None)
        time.sleep(30)  # 每30秒测试一次
    
    valid_times = [t for t in response_times if t is not None]
    if len(valid_times) < len(response_times) * 0.8:
        return False  # 失败率超过20%则不稳定
    
    avg_time = sum(valid_times) / len(valid_times)
    variance = sum([(t - avg_time)**2 for t in valid_times]) / len(valid_times)
    return variance / avg_time**2 < 0.15  # 波动系数小于15%
3. 协议兼容性检测
python 复制代码
def check_protocol_compatibility(proxy):
    http_test_url = "http://httpbin.org/get"
    https_test_url = "https://httpbin.org/get"
    
    try:
        requests.get(http_test_url, proxies={"http": proxy}, timeout=10)
        http_supported = True
    except:
        http_supported = False
    
    try:
        requests.get(https_test_url, proxies={"https": proxy}, timeout=10)
        https_supported = True
    except:
        https_supported = False
    
    return http_supported and https_supported

(三)动态IP轮换方法

1. 随机选择策略
python 复制代码
import random

class SimpleProxyRotator:
    def __init__(self, proxies):
        self.proxies = proxies
        self.blacklist = set()
    
    def get_proxy(self):
        available_proxies = [p for p in self.proxies if p not in self.blacklist]
        if not available_proxies:
            self.blacklist.clear()  # 清空黑名单,重新尝试所有代理
            available_proxies = self.proxies.copy()
        return random.choice(available_proxies)
    
    def block_proxy(self, proxy):
        self.blacklist.add(proxy)
2. 轮询选择策略
python 复制代码
class RoundRobinProxyRotator:
    def __init__(self, proxies):
        self.proxies = proxies
        self.current_index = 0
        self.blacklist = set()
    
    def get_proxy(self):
        available_proxies = [p for p in self.proxies if p not in self.blacklist]
        if not available_proxies:
            self.blacklist.clear()
            available_proxies = self.proxies.copy()
        
        proxy = available_proxies[self.current_index % len(available_proxies)]
        self.current_index += 1
        return proxy
    
    def block_proxy(self, proxy):
        self.blacklist.add(proxy)
3. 基于权重的选择策略
python 复制代码
class WeightedProxyRotator:
    def __init__(self, proxies_with_weights):
        # proxies_with_weights格式为[{'proxy': 'ip:port', 'weight': 10}, ...]
        self.proxies = proxies_with_weights
        self.total_weight = sum(p['weight'] for p in proxies_with_weights)
        self.blacklist = set()
    
    def get_proxy(self):
        available_proxies = [p for p in self.proxies if p['proxy'] not in self.blacklist]
        if not available_proxies:
            self.blacklist.clear()
            available_proxies = self.proxies.copy()
        
        # 计算累计权重
        cumulative_weights = []
        current_sum = 0
        for proxy in available_proxies:
            current_sum += proxy['weight']
            cumulative_weights.append(current_sum)
        
        # 随机选择
        rand = random.uniform(0, cumulative_weights[-1])
        selected_index = None
        for i, weight in enumerate(cumulative_weights):
            if rand <= weight:
                selected_index = i
                break
        
        return available_proxies[selected_index]['proxy']
    
    def block_proxy(self, proxy):
        self.blacklist.add(proxy)
    
    def update_weight(self, proxy, new_weight):
        for p in self.proxies:
            if p['proxy'] == proxy:
                p['weight'] = new_weight
                self.total_weight = sum(p['weight'] for p in self.proxies)
                break

使用示例

python 复制代码
proxies_with_weights = [
    {'proxy': '192.168.1.1:8080', 'weight': 10},
    {'proxy': '192.168.1.2:8080', 'weight': 5},
    {'proxy': '192.168.1.3:8080', 'weight': 8}
]

rotator = WeightedProxyRotator(proxies_with_weights)

for _ in range(5):
    proxy = rotator.get_proxy()
    print(f"使用代理:{proxy}")
    # 模拟请求
    # ...
    # 如果代理失败,则加入黑名单
    # rotator.block_proxy(proxy)
    # 或者调整权重
    # rotator.update_weight(proxy, new_weight=3)

二、分布式爬虫实践:Scrapy-Redis多节点协作

分布式爬虫通过多节点分散请求压力,结合代理池技术可有效提升抗封禁能力。

(一)Scrapy-Redis核心原理

1. 任务队列

Redis存储待抓取URL,所有节点共享同一队列,实现任务分发。

2. 去重机制

基于Redis Set存储URL指纹,避免重复抓取:

python 复制代码
def url_to_fingerprint(url):
    return hashlib.sha256(url.encode('utf-8')).hexdigest()

# Redis中存储指纹
redis_conn.sadd('seen_urls', url_to_fingerprint(url))
3. 动态调度

主节点分配任务,从节点并行执行,支持负载均衡:

python 复制代码
class MasterScheduler:
    def __init__(self, redis_url):
        self.redis = redis.from_url(redis_url)
    
    def add_task(self, url):
        fingerprint = url_to_fingerprint(url)
        if self.redis.sismember('seen_urls', fingerprint):
            return False  # 已存在
        self.redis.rpush('task_queue', url)
        self.redis.sadd('seen_urls', fingerprint)
        return True
    
    def get_task(self):
        return self.redis.blpop('task_queue', timeout=30)

class WorkerNode:
    def __init__(self, redis_url):
        self.redis = redis.from_url(redis_url)
    
    def fetch_task(self):
        task = self.redis.blpop('task_queue', timeout=30)
        if task:
            return task[1].decode('utf-8')
        return None
    
    def mark_task_done(self, url):
        self.redis.sadd('completed_tasks', url)

(二)部署步骤

1. 环境配置

settings.py中添加:

python 复制代码
# settings.py
SCHEDULER = 'scrapy_redis.scheduler.Scheduler'
DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
REDIS_URL = 'redis://user:pass@master_ip:6379'
SCHEDULER_PERSIST = True  # 持久化任务队列
2. 爬虫代码示例
python 复制代码
from scrapy_redis.spiders import RedisSpider

class MyDistributedSpider(RedisSpider):
    name = 'my_distributed_crawler'
    redis_key = 'crawler:start_urls'  # Redis中存储任务的键
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.proxy_rotator = SimpleProxyRotator(available_proxies)
    
    def parse(self, response):
        # 解析逻辑
        yield {'data': response.text}
        
        # 提取下一页链接
        next_page = response.css('a.next-page::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse, meta={'proxy': self.proxy_rotator.get_proxy()})

(三)实际应用案例:电商数据采集

1. 架构设计
  • 主节点:管理Redis队列

  • 从节点:部署50+爬虫实例,每个实例绑定独立代理池

2. 反爬策略
  • 动态UA池(200+真实浏览器标识)降低30%封禁率

    python 复制代码
    user_agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
        # 更多UA...
    ]
    
    class UserAgentMiddleware:
        def process_request(self, request, spider):
            request.headers['User-Agent'] = random.choice(user_agents)
  • 请求间隔随机化(0.5-5秒),模拟人工操作:

    python 复制代码
    class RandomDelayMiddleware:
        def __init__(self, min_delay, max_delay):
            self.min_delay = min_delay
            self.max_delay = max_delay
        
        @classmethod
        def from_crawler(cls, crawler):
            return cls(
                min_delay=crawler.settings.get('RANDOM_DELAY_MIN', 0.5),
                max_delay=crawler.settings.get('RANDOM_DELAY_MAX', 5)
            )
        
        def process_request(self, request, spider):
            delay = random.uniform(self.min_delay, self.max_delay)
            time.sleep(delay)
3. 实际效果
  • 日均采集10万页面

  • 封禁率低于2%

三、完整实践案例:搭建高效爬虫系统

(一)系统架构图

bash 复制代码
                            +-------------------+
                            |  主节点(Redis)  |
                            +--------+----------+
                                     |
                                     v
+----------+------------------+  +----------+------------------+
|          |                  |  |          |                  |
|  从节点1  |    从节点2        |  | 从节点50  |   从节点N         |
| (爬虫实例)|   (爬虫实例)       |  | (爬虫实例)|   (爬虫实例)       |
|          |                  |  |          |                  |
+----------+------------------+  +----------+------------------+

(二)核心代码整合

1. 代理管理模块
python 复制代码
class ProxyManager:
    def __init__(self, free_proxy_urls=None, paid_proxy_urls=None):
        self.free_proxy_urls = free_proxy_urls or ["https://www.example-proxy-list.net/"]
        self.paid_proxy_urls = paid_proxy_urls or []
        self.proxies = []
        self.blacklist = set()
        self.last_refresh_time = 0
        self.refresh_interval = 30 * 60  # 30分钟刷新一次
    
    def _fetch_free_proxies(self):
        all_proxies = []
        for url in self.free_proxy_urls:
            try:
                response = requests.get(url, timeout=10)
                if response.status_code == 200:
                    soup = BeautifulSoup(response.text, 'html.parser')
                    proxy_rows = soup.select('tbody tr')
                    for row in proxy_rows:
                        ip = row.select_one('td').text
                        port = row.select_one('td + td').text
                        all_proxies.append(f"{ip}:{port}")
            except:
                continue
        return all_proxies
    
    def _fetch_paid_proxies(self):
        paid_proxies = []
        for proxy_info in self.paid_proxy_urls:
            # 假设paid_proxy_urls是包含IP、端口、用户名、密码的字典列表
            proxy_str = f"{proxy_info['user']}:{proxy_info['pass']}@{proxy_info['ip']}:{proxy_info['port']}"
            paid_proxies.append(f"http://{proxy_str}")
        return paid_proxies
    
    def refresh_proxies(self):
        if time.time() - self.last_refresh_time < self.refresh_interval and self.proxies:
            return
        
        self.proxies = []
        self.proxies.extend(self._fetch_free_proxies())
        self.proxies.extend(self._fetch_paid_proxies())
        
        # 验证代理有效性并过滤
        validated_proxies = []
        for proxy in self.proxies:
            if self._validate_proxy(proxy):
                validated_proxies.append(proxy)
        
        self.proxies = validated_proxies
        self.last_refresh_time = time.time()
    
    def _validate_proxy(self, proxy):
        try:
            # 检查HTTP支持
            response = requests.get("http://httpbin.org/ip", proxies={"http": proxy}, timeout=10)
            if response.status_code != 200:
                return False
            
            # 检查HTTPS支持
            response = requests.get("https://httpbin.org/ip", proxies={"https": proxy}, timeout=10)
            if response.status_code != 200:
                return False
            
            # 检查匿名性
            if response.json().get("origin") == "真实IP地址":
                return False
            
            return True
        except:
            return False
    
    def get_proxy(self):
        if not self.proxies:
            self.refresh_proxies()
        
        available_proxies = [p for p in self.proxies if p not in self.blacklist]
        if not available_proxies:
            self.blacklist.clear()
            available_proxies = self.proxies.copy()
        
        return random.choice(available_proxies) if available_proxies else None
    
    def block_proxy(self, proxy):
        self.blacklist.add(proxy)
2. 爬虫模块(基于Scrapy-Redis)
python 复制代码
from scrapy_redis.spiders import RedisSpider
from myproject.items import MyItem
from myproject.middlewares import ProxyManager

class MyDistributedSpider(RedisSpider):
    name = 'my_distributed_crawler'
    redis_key = 'crawler:start_urls'
    
    custom_settings = {
        'REDIS_URL': 'redis://user:pass@master_ip:6379',
        'SCHEDULER_PERSIST': True,
        'DOWNLOAD_DELAY': 0,  # 由中间件控制延迟
        'DOWNLOADER_MIDDLEWARES': {
            'myproject.middlewares.RandomDelayMiddleware': 543,
            'myproject.middlewares.ProxyMiddleware': 750,
            'myproject.middlewares.UserAgentMiddleware': 500,
        }
    }
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.proxy_manager = ProxyManager()
    
    def parse(self, response):
        # 提取数据
        item = MyItem()
        item['title'] = response.css('h1::text').get()
        item['content'] = response.css('div.content::text').get()
        yield item
        
        # 提取下一页链接
        next_page = response.css('a.next-page::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)
3. 中间件模块
python 复制代码
import random
import time
import requests
from scrapy import signals
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.exceptions import NotConfigured

class RandomDelayMiddleware:
    def __init__(self, min_delay, max_delay):
        self.min_delay = min_delay
        self.max_delay = max_delay
    
    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        min_delay = settings.getfloat('RANDOM_DELAY_MIN', 0.5)
        max_delay = settings.getfloat('RANDOM_DELAY_MAX', 5)
        if min_delay >= max_delay:
            raise NotConfigured("RANDOM_DELAY_MIN should be less than RANDOM_DELAY_MAX")
        return cls(min_delay, max_delay)
    
    def process_request(self, request, spider):
        delay = random.uniform(self.min_delay, self.max_delay)
        time.sleep(delay)

class ProxyMiddleware:
    def __init__(self):
        self.proxy_manager = ProxyManager()
    
    def process_request(self, request, spider):
        proxy = self.proxy_manager.get_proxy()
        if proxy:
            request.meta['proxy'] = proxy
        else:
            spider.logger.warning("No valid proxies available")
    
    def process_exception(self, request, exception, spider):
        if 'proxy' in request.meta:
            self.proxy_manager.block_proxy(request.meta['proxy'])
            spider.logger.info(f"Blocked proxy: {request.meta['proxy']}")
            # 重新调度请求
            return request.copy()

class UserAgentMiddleware:
    def __init__(self):
        self.user_agents = [
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
            # 更多UA...
        ]
    
    def process_request(self, request, spider):
        request.headers['User-Agent'] = random.choice(self.user_agents)

class RetryProxyMiddleware(RetryMiddleware):
    def process_exception(self, request, exception, spider):
        if 'proxy' in request.meta:
            proxy = request.meta['proxy']
            spider.logger.info(f"Retrying via proxy {proxy}: caught exception {exception}")
            self.proxy_manager.block_proxy(proxy)
        
        retryreq = request.copy()
        retryreq.dont_filter = True
        retryreq.priority = request.priority + self.priority_adjust
        
        return retryreq

(三)部署与运行

1. 环境准备

确保所有节点安装以下组件:

bash 复制代码
pip install scrapy scrapy-redis redis pymongo requests beautifulsoup4
2. 主节点启动
bash 复制代码
# 启动Redis服务器
redis-server

# 向Redis添加初始任务
python add_initial_tasks.py
3. 从节点启动
bash 复制代码
# 在每个从节点上运行
scrapy crawl my_distributed_crawler
4. 监控与维护
  • 定期检查代理池状态

  • 监控爬虫日志,分析被封禁原因

  • 根据网站反爬策略调整请求频率和UA池

四、常见问题与解决方案

(一)代理频繁失效

  • 原因:免费代理质量不稳定,或目标网站加强了反爬措施

  • 解决方案

    1. 缩短代理刷新间隔
    python 复制代码
    proxy_manager = ProxyManager(refresh_interval=15 * 60)  # 15分钟刷新一次
    1. 增加付费代理比例

    2. 实现代理健康检查

    python 复制代码
    def check_proxy_health(proxy):
        try:
            response = requests.get("http://httpbin.org/ip", proxies={"http": proxy}, timeout=5)
            return response.status_code == 200
        except:
            return False

(二)数据重复采集

  • 原因:Redis去重机制失效,或爬虫逻辑存在漏洞

  • 解决方案

    1. 确保URL指纹计算正确
    python 复制代码
    def url_to_fingerprint(url):
        return hashlib.sha256(url.encode('utf-8')).hexdigest()
    1. 在爬虫中添加本地去重
    python 复制代码
    class MySpider(scrapy.Spider):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self.seen_urls = set()
        
        def parse(self, response):
            if response.url in self.seen_urls:
                return
            self.seen_urls.add(response.url)
            # 正常解析逻辑...

(三)爬虫速度过慢

  • 原因:代理速度慢、请求间隔过大或节点资源不足

  • 解决方案

    1. 优化代理选择策略,优先使用高速代理
    python 复制代码
    class WeightedProxyRotator:
        # 优先选择权重高的代理
        def get_fast_proxy(self):
            weights = [p['weight'] for p in self.proxies]
            total = sum(weights)
            rand = random.uniform(0, total)
            current = 0
            for proxy in self.proxies:
                current += proxy['weight']
                if current >= rand:
                    return proxy['proxy']
            return random.choice(self.proxies)['proxy']
    1. 动态调整请求间隔
    python 复制代码
    class AdaptiveDelayMiddleware:
        def __init__(self):
            self.delay = 1.0  # 初始延迟
            self.min_delay = 0.5
            self.max_delay = 5.0
            self.success_count = 0
        
        def process_response(self, request, response, spider):
            if response.status == 200:
                self.success_count += 1
                if self.success_count >= 5:  # 连续5次成功减少延迟
                    self.delay = max(self.min_delay, self.delay * 0.8)
                    self.success_count = 0
            else:
                self.success_count = 0
                self.delay = min(self.max_delay, self.delay * 1.5)
            return response
        
        def process_request(self, request, spider):
            time.sleep(self.delay)
    1. 增加从节点数量,提高并发能力

五、进阶优化技巧

(一)代理自动切换与故障恢复

python 复制代码
class SmartProxyMiddleware:
    def __init__(self):
        self.proxy_manager = ProxyManager()
        self.retry_times = {}
    
    def process_request(self, request, spider):
        if 'proxy' in request.meta:
            current_proxy = request.meta['proxy']
            if self.proxy_manager.is_proxy_available(current_proxy):
                return None  # 继续使用当前代理
            else:
                # 切换代理
                new_proxy = self.proxy_manager.get_proxy()
                request.meta['proxy'] = new_proxy
                spider.logger.info(f"Switched proxy from {current_proxy} to {new_proxy}")
        else:
            # 初始请求分配代理
            new_proxy = self.proxy_manager.get_proxy()
            request.meta['proxy'] = new_proxy
        
        return None
    
    def process_exception(self, request, exception, spider):
        proxy = request.meta.get('proxy')
        if proxy:
            self.proxy_manager.block_proxy(proxy)
            spider.logger.warning(f"Proxy {proxy} failed, exception: {exception}")
            
            # 重试请求
            retry_request = request.copy()
            retry_request.dont_filter = True
            retry_request.priority = request.priority + 1
            
            # 更新重试次数
            self.retry_times[retry_request.url] = self.retry_times.get(retry_request.url, 0) + 1
            
            # 最多重试3次
            if self.retry_times[retry_request.url] <= 3:
                return retry_request
            else:
                spider.logger.warning(f"Abandoning {retry_request.url} after 3 failed attempts")
        return None

(二)基于行为分析的反爬规避

python 复制代码
class BehaviorAntiCrawlMiddleware:
    def __init__(self):
        self.mouse_movements = []
        self.keyboard_events = []
        self.last_action_time = time.time()
    
    def simulate_mouse_movement(self):
        # 模拟鼠标随机移动
        x = random.randint(0, 1920)
        y = random.randint(0, 1080)
        self.mouse_movements.append((x, y, time.time()))
        
        # 保持鼠标轨迹自然
        if len(self.mouse_movements) > 10:
            self.mouse_movements.pop(0)
    
    def simulate_keyboard_event(self):
        # 模拟随机键盘事件
        keys = ['a', 'b', 'c', ' ', '\n', '\t']
        key = random.choice(keys)
        self.keyboard_events.append((key, time.time()))
        
        # 保持键盘事件记录合理
        if len(self.keyboard_events) > 20:
            self.keyboard_events.pop(0)
    
    def process_request(self, request, spider):
        # 模拟人类行为
        if random.random() < 0.3:  # 30%概率触发鼠标移动
            self.simulate_mouse_movement()
        
        if random.random() < 0.1:  # 10%概率触发键盘事件
            self.simulate_keyboard_event()
        
        # 添加行为数据到请求头(如果网站支持)
        behavior_data = {
            "mouse_movements": self.mouse_movements[-5:],  # 最近5次鼠标移动
            "keyboard_events": self.keyboard_events[-10:],  # 最近10次键盘事件
            "session_duration": time.time() - self.last_action_time
        }
        request.headers['X-Behavior-Data'] = json.dumps(behavior_data)
        
        self.last_action_time = time.time()
        return None

(三)分布式任务调度优化

python 复制代码
class IntelligentScheduler:
    def __init__(self, redis_url):
        self.redis = redis.from_url(redis_url)
        self.node_heartbeats = {}
        self.task_priorities = {}
    
    def register_node(self, node_id):
        self.node_heartbeats[node_id] = time.time()
        self.redis.sadd('active_nodes', node_id)
    
    def update_node_heartbeat(self, node_id):
        self.node_heartbeats[node_id] = time.time()
    
    def distribute_task(self, task):
        # 根据节点活跃度和负载分配任务
        active_nodes = self.redis.smembers('active_nodes')
        if not active_nodes:
            return None
        
        # 计算节点权重(基于心跳和历史表现)
        node_weights = {}
        current_time = time.time()
        for node_id in active_nodes:
            node_id = node_id.decode('utf-8')
            heartbeat_age = current_time - self.node_heartbeats.get(node_id, 0)
            task_completion_rate = self.redis.get(f'node:{node_id}:completion_rate') or 0.8
            node_weights[node_id] = max(0.1, min(1.0, (60 - heartbeat_age)/60 * task_completion_rate))
        
        # 选择权重最高的节点
        selected_node = max(node_weights.items(), key=lambda x: x[1])[0]
        
        # 发送任务到选中的节点
        self.redis.rpush(f'node:{selected_node}:task_queue', task)
        return selected_node
    
    def monitor_tasks(self):
        # 定期检查任务状态
        pending_tasks = self.redis.llen('pending_tasks')
        completed_tasks = self.redis.llen('completed_tasks')
        error_tasks = self.redis.llen('error_tasks')
        
        # 计算任务成功率
        total_tasks = pending_tasks + completed_tasks + error_tasks
        if total_tasks > 0:
            success_rate = completed_tasks / total_tasks
            self.redis.set('scheduler:success_rate', success_rate)
        
        # 检测并恢复卡住的任务
        stuck_tasks = self.redis.zrange('stuck_tasks', 0, -1)
        for task_id in stuck_tasks:
            task_id = task_id.decode('utf-8')
            task_info = self.redis.hgetall(f'task:{task_id}')
            if task_info:
                self.redis.rpush('pending_tasks', task_id)
                self.redis.zrem('stuck_tasks', task_id)

通过以上详细指南和示例代码,您可以构建一个高效、稳定的分布式爬虫系统,有效应对各类反爬措施。

相关推荐
yuhuhuh3 小时前
如何搭建spark yarn模式的集群
大数据·分布式·spark
eqwaak03 小时前
基于DrissionPage的实习信息爬虫改造与解析
爬虫·python·语言模型·性能优化·drissionpage
依年南台3 小时前
如何搭建spark yarn模式的集群
大数据·分布式·spark
z10_146 小时前
什么是静态住宅ip,跨境电商为什么要用静态住宅ip
网络·网络协议·tcp/ip
Excuse_lighttime6 小时前
UDP / TCP 协议
网络·tcp/ip·udp
T - mars7 小时前
爬虫准备前工作
爬虫
API小爬虫7 小时前
爬虫过程中如何确保数据准确性
爬虫
进取星辰8 小时前
18. LangChain分布式任务调度:大规模应用的性能优化
分布式·性能优化·langchain