【爬虫框架-7】日志追踪实现

分布式爬虫全链路追踪系统设计与实现

一、背景与问题

在分布式爬虫系统中,一个初始请求可能会派生出数十子请求(列表页→详情页→评论页→下一页...),这些请求分布在不同的消费者进程中执行。如何追踪整个任务链的执行状态、性能瓶颈以及调用关系,成为系统可观测性的核心挑战。

1.1 核心诉求

  1. 全链路追踪:从初始请求到所有子孙请求,形成完整的调用链
  2. 任务状态监控:实时掌握每个请求的执行状态(pending/success/failed)
  3. 性能分析:记录每个请求的耗时、错误信息等
  4. 任务完成回调:当整棵任务树完成时自动触发通知
  5. 日志关联:在海量日志中快速定位某个任务链的所有相关日志

二、技术方案:OpenTelemetry 风格的 Trace/Span 模型

2.1 核心概念

参考 OpenTelemetry 的分布式追踪标准,定义了三个关键ID:

python 复制代码
# Request 类的追踪字段
trace_id: str        # 整个爬取任务的唯一标识,所有子请求继承
span_id: str         # 当前请求节点的唯一标识
parent_span_id: str  # 父请求节点的 Span ID

调用链示例

plain 复制代码
列表页:
  trace_id='abc12345'
  span_id='span001'
  parent_span_id=None

商品A详情:
  trace_id='abc12345'     # 继承父请求
  span_id='span002'       # 新生成
  parent_span_id='span001' # 指向列表页

商品A评论第1页:
  trace_id='abc12345'
  span_id='span003'
  parent_span_id='span002' # 指向详情页

商品A评论第2页:
  trace_id='abc12345'
  span_id='span004'
  parent_span_id='span002' # 同一个父请求

2.2 与 funboost task_id 的整合

funboost 的 task_id 是消费任务的唯一标识,我们将其与 span 信息结合:

python 复制代码
# 格式:parent_span_id:span_id
task_id = f"{result.parent_span_id or ''}:{result.span_id or ''}"

# 示例
task_id = "span001:span002"  # 表示父节点是 span001,当前节点是 span002

这样设计的好处:

  • task_id 即包含调用关系:无需额外字段就能知道父子关系
  • 兼容 funboost:仍然是唯一字符串,不影响消息去重
  • 日志可读性parent_span_id - span_id 一目了然

三、系统架构:分层设计

3.1 架构图

plain 复制代码
┌─────────────────────────────────────────────────────┐
│                   日志层 (Logger)                    │
│  - TaskIdLogger 自动注入 task_id                     │
│  - 日志格式: %(task_id)s = parent_span_id:span_id   │
└─────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────┐
│                Request 对象 (网络层)                 │
│  - trace_id / span_id / parent_span_id              │
│  - 自动继承父请求的 trace_id                         │
│  - 自动生成新的 span_id                              │
└─────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────┐
│                Worker 处理层 (核心)                  │
│  - 反序列化 Request 对象                             │
│  - 构造 task_id = parent_span_id:span_id            │
│  - 发布子请求时传递 task_id 到 funboost             │
└─────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────┐
│            TaskTraceManager (追踪管理器)            │
│  - 记录请求链路 (Redis + InfluxDB)                  │
│  - 追踪任务树状态                                    │
│  - 任务完成时触发回调                                │
└─────────────────────────────────────────────────────┘

四、核心实现

4.1 第一层:funboost 日志系统扩展

4.1.1 原始 funboost 日志配置

funboost 默认的日志格式(funboost_config.py:125-128):

python 复制代码
NB_LOG_FORMATER_INDEX_FOR_CONSUMER_AND_PUBLISHER = logging.Formatter(
    f'%(asctime)s-({nb_log_config_default.computer_ip},{nb_log_config_default.computer_name})'
    f'-[p%(process)d_t%(thread)d] - %(name)s - "%(filename)s:%(lineno)d" - '
    f'%(funcName)s - %(levelname)s - %(task_id)s - %(message)s',
    "%Y-%m-%d %H:%M:%S",
)

关键字段

  • %(task_id)s:funboost 自动注入的任务ID(通过 TaskIdLogger 实现)
  • 这个 task_id 来自 funboost.core.current_task.get_current_taskid()
  • 这个能从上下文自动获取taskid ,然后用给日志,灰常好用
4.1.2 我们的扩展方案

目标 :让 task_id 显示为 parent_span_id:span_id 格式

实现路径

  1. funboost 发布任务时,我们主动传入 task_id 参数
  2. funboost 的 TaskIdLogger 会自动从上下文读取这个 task_id
  3. 日志输出时自动填充到 %(task_id)s 位置

代码位置funspider/utils/fun_logger.py

python 复制代码
from funboost.core.task_id_logger import TaskIdLogger

# 使用 funboost 原生的 TaskIdLogger(无需自定义)
log_manager = LogManager('funspider', logger_cls=TaskIdLogger)

logger_config = {
    'log_level_int': 10,
    'is_add_stream_handler': True,
    'formatter_template': FunboostCommonConfig.NB_LOG_FORMATER_INDEX_FOR_CONSUMER_AND_PUBLISHER,
    'log_path': None,
    'log_filename': None,
}

logger = log_manager.get_logger_and_add_handlers(**logger_config)

日志输出示例

plain 复制代码
2025-12-14 13:13:08-(192.168.1.5,koohai)-[p28568_t30180] - funspider - "worker.py:315" 
- _handle_parse_results - INFO - span001:span002 - 发布子请求: http://example.com/page1

4.2 第二层:Request 对象的 Trace/Span 初始化

4.2.1 Request 初始化时的自动追踪

代码位置funspider/network/request.py:143-207

python 复制代码
class Request:
    """增强的Request类,支持处理器直接修改和中间件选择"""

    def __init__(self, url: str, callback: Optional[Callable] = None,
                 headers: Optional[Dict[str, str]] = None,
                 meta: Optional[Dict[str, Any]] = None,
                 method: str = 'GET',
                 params: str =None,
                 middleware_tags: List[str] = None,
                 middleware_conditions: List[str] = None,
                 pipeline_tags: List[str] = None,
                 pipeline_conditions: List[str] = None,
                 download: Optional[Callable] = None,
                 request_id: Optional[str] = None,
                 depth: Optional[int] = None,
                 parent_request_id: Optional[str] = None,
  
                 priority_params: Optional[Dict[str, Any]] = None,
                 trace_id: Optional[str] = None,
                 span_id: Optional[str] = None,
                 parent_span_id: Optional[str] = None,
                 queue_prefix: Optional[str] = None,
                 **kwargs):
    
        # 基础属性
        self.url = url
        self.callback = callback
        self.method = method
        self.request_id = request_id or uuid.uuid4().hex
        self.queue_prefix = queue_prefix

        # 重要:确保 headers 和 meta 是字典类型,支持处理器直接修改
        self.headers = headers or {}
        self.meta = meta or {}

        # 设置深度和父请求ID到meta中
        self.meta['depth'] = depth or 0
        self.meta['parent_request_id'] = parent_request_id
  
        
        # ====== OpenTelemetry 风格的链路追踪 ======
        # Trace ID: 整个爬取任务的唯一标识,所有子请求继承
        # Span ID: 当前请求节点的唯一标识
        # Parent Span ID: 父请求节点的 Span ID
        
        # 如果已经设置了 trace_id/span_id,则不再重新生成(避免覆盖用户自定义的值)
 

        self.trace_id, self.span_id, self.parent_span_id = self._init_trace_span(
            trace_id, span_id, parent_span_id, parent_request_id
        )
        
        # 保存到 meta 中,方便序列化和传递
        self.meta['trace_id'] = self.trace_id
        self.meta['span_id'] = self.span_id
        self.meta['parent_span_id'] = self.parent_span_id

        # 中间件选择
        self.middleware_tags = middleware_tags or []
        self.middleware_conditions = middleware_conditions or []

        # Pipeline选择
        self.pipeline_tags = pipeline_tags or []
        self.pipeline_conditions = pipeline_conditions or []

        # 自定义下载器
        self.download = download

        # PriorityConsumingControlConfig 参数(任务级别控制)
        self.priority_params = priority_params or {}

        # 设置回调名称
        if callable(self.callback):
            self.callback_name = self.callback.__name__
        elif self.callback is None:
            self.callback_name = 'parse'  # 默认名称
        else:
            self.callback_name = str(self.callback)

        # HTTP参数处理 - 提取常用HTTP参数作为直接属性
        self._extract_http_params(kwargs)

        # 其他参数存储在kwargs中
        self.kwargs = kwargs
    
    def _init_trace_span(self, trace_id: Optional[str], span_id: Optional[str], 
                        parent_span_id: Optional[str], parent_request_id: Optional[str]) -> tuple:
        """
        初始化 Trace ID 和 Span ID
        
        逻辑:
        1. trace_id:如果传入则继承,否则生成新的(16位短ID)
        2. span_id:总是生成新的(12位短ID)
        3. parent_span_id:如果传入则使用,否则为 None(根节点)
        """
        # 1. 生成或继承 Trace ID
        if trace_id:
            final_trace_id = trace_id  # 子请求继承
        elif parent_request_id and 'trace_id' in self.meta:
            final_trace_id = self.meta.get('trace_id')
        else:
            final_trace_id = uuid.uuid4().hex[:16]  # 初始请求生成
        
        # 2. 生成 Span ID(每个请求都是新的)
        if span_id:
            final_span_id = span_id
        else:
            final_span_id = uuid.uuid4().hex[:12]
        
        # 3. 设置 Parent Span ID
        if parent_span_id:
            final_parent_span_id = parent_span_id
        elif parent_request_id and 'span_id' in self.meta:
            final_parent_span_id = self.meta.get('span_id')
        else:
            final_parent_span_id = None  # 根节点
        
        return final_trace_id, final_span_id, final_parent_span_id

4.3 第三层:Worker 处理子请求时的自动传递

4.3.1 处理解析结果时注入追踪信息

代码位置funspider/core/worker.py:260-319

python 复制代码
def _handle_parse_results(self, results, spider_instance, parent_request: Optional[Request] = None):
    """
    处理解析结果,自动为子请求注入追踪信息
    """
    # 1. 获取父请求的追踪信息
    parent_trace_id = None
    parent_span_id = None
    
    if parent_request:
        parent_trace_id = parent_request.trace_id      # 继承
        parent_span_id = parent_request.span_id        # 作为子请求的 parent_span_id
        # 从 request 里拿到span id 
    for result in results:
        if isinstance(result, Request):  
            # 2. 为子请求设置 trace_id(继承父请求)
            # core 中判断是子请求还是 item 的时候 自动注入父traceid 。 因此 爬虫里只要给request 赋值一个 spanid就好了。 (还是习惯taskid,和batchid ,但是为了规范 )
            if parent_trace_id and not result.trace_id:
                result.meta['trace_id'] = parent_trace_id
                result.trace_id = parent_trace_id
            
            # 3. 为子请求设置 parent_span_id(父请求的 span_id)
            if parent_span_id and not result.parent_span_id:
                result.meta['parent_span_id'] = parent_span_id
                result.parent_span_id = parent_span_id
            
            # 4. 如果子请求没有 span_id,自动生成
            if not result.span_id:
                new_span_id = uuid.uuid4().hex[:12]
                result.span_id = new_span_id
                result.meta['span_id'] = new_span_id
            
            # 5. 构造 task_id(格式:parent_span_id:span_id)
            task_id = f"{result.parent_span_id or ''}:{result.span_id or ''}"
            
            # 6. 日志输出(显示完整追踪信息)
            logger.debug(
                f"[trace={result.trace_id}][span={result.span_id}][parent={result.parent_span_id}] "
                f"发布子请求: {result.url}"
            )
            
            # 7. 发布到 funboost 队列时传入 task_id
            # ... (见下一节)
4.3.2 发布任务时传入 task_id

代码位置funspider/core/engine.py(发布方法)

python 复制代码
def publish_request(self, request: Request, task_id: str = None):
    """
    发布请求到队列
    
    Args:
        request: 请求对象
        task_id: 任务ID(格式:parent_span_id:span_id)
    """
    callback_name = request.callback_name
    queue_prefix = request.queue_prefix or self.default_queue_prefix
    
    # 获取或创建 Booster
    booster = self._get_or_create_booster(queue_prefix, callback_name)
    
    # 关键:传入 task_id 到 funboost
    booster.publish(
        request.to_dict(),
        task_id=task_id,  # funboost 会将其注入到上下文中
        **request.priority_params
    )

funboost 内部处理

  1. task_id 被存储到 funboost.core.current_task 上下文中
  2. TaskIdLoggermakeRecord() 方法自动读取这个 task_id
  3. 日志格式化时填充到 %(task_id)s 位置

4.4 第四层:TaskTraceManager 统一追踪管理

4.4.1 核心功能

代码位置funspider/core/trace_stats.py

python 复制代码
class TaskTraceManager:
    """
    统一任务追踪管理器
    
    功能:
    1. 【监控统计】记录请求链路、性能数据
    2. 【任务协调】追踪任务树完成状态
    3. 【完成回调】任务树完成时触发自定义回调
    
    数据存储:
    - Redis: 实时状态(1小时TTL)
    - InfluxDB: 历史数据(长期分析,可选)
    """
    
    def __init__(self, spider_name: str, redis_url: str = None, 
                 influx_config: Optional[Dict] = None,
                 on_tree_completed: Optional[Callable] = None):
        self.spider_name = spider_name
        self.redis_client = get_redis_client(redis_url)
        self.on_tree_completed = on_tree_completed
        
        # InfluxDB 异步写入(可选)
        if influx_config:
            self._write_queue = Queue(maxsize=10000)
            self._writer_thread = Thread(target=self._background_writer, daemon=True)
            self._writer_thread.start()
4.4.2 记录请求链路
python 复制代码
def record_request(self, trace_id: str, span_id: str, parent_span_id: Optional[str],
                   callback_name: str, url: str, status: str = "pending",
                   metadata: Optional[Dict] = None):
    """
    记录请求(同时写入 Redis 和 InfluxDB)
    
    Redis 数据结构:
    - trace:{trace_id}:total -> {callback_name: count}
    - trace:{trace_id}:status:{status} -> {callback_name: count}
    - trace:{trace_id}:children:{parent_span_id} -> {callback_name: count}
    - trace:{trace_id}:request:{span_id} -> {url, callback, parent_span_id, status, timestamp}
    """
    timestamp = time.time()
    
    # 1. Redis: 实时状态(同步写入)
    pipe = self.redis_client.pipeline()
    pipe.hincrby(f"trace:{trace_id}:total", callback_name, 1)
    pipe.hincrby(f"trace:{trace_id}:status:{status}", callback_name, 1)
    
    if parent_span_id:
        pipe.hincrby(f"trace:{trace_id}:children:{parent_span_id}", callback_name, 1)
    
    request_key = f"trace:{trace_id}:request:{span_id}"
    pipe.hset(request_key, mapping={
        'url': url,
        'callback': callback_name,
        'parent_span_id': parent_span_id or '',
        'status': status,
        'timestamp': timestamp
    })
    pipe.expire(request_key, 3600)  # 1小时过期
    pipe.execute()
    
    # 2. InfluxDB: 历史数据(异步写入)
    if self._write_api:
        self._write_queue.put({
            'type': 'request',
            'trace_id': trace_id,
            'span_id': span_id,
            'parent_span_id': parent_span_id or 'root',
            'callback_name': callback_name,
            'status': status,
            'url': url,
            'timestamp': timestamp,
            'metadata': metadata or {}
        })
4.4.3 更新请求状态
python 复制代码
def update_status(self, trace_id: str, span_id: str, callback_name: str,
                  old_status: str, new_status: str,
                  duration_ms: Optional[float] = None,
                  error_msg: Optional[str] = None):
    """
    更新请求状态(pending -> success/failed)
    
    同时记录:
    - 状态计数变更
    - 请求耗时
    - 错误信息
    """
    pipe = self.redis_client.pipeline()
    pipe.hincrby(f"trace:{trace_id}:status:{old_status}", callback_name, -1)
    pipe.hincrby(f"trace:{trace_id}:status:{new_status}", callback_name, 1)
    
    request_key = f"trace:{trace_id}:request:{span_id}"
    pipe.hset(request_key, 'status', new_status)
    if duration_ms is not None:
        pipe.hset(request_key, 'duration_ms', duration_ms)
    if error_msg:
        pipe.hset(request_key, 'error', error_msg)
    
    pipe.execute()
4.4.4 获取追踪树结构
python 复制代码
def get_trace_tree(self, trace_id: str) -> Dict:
    """
    获取完整的请求树结构(用于可视化)
    
    Returns:
        {
            "trace_id": "abc12345",
            "tree": {
                "span_id": "span001",
                "url": "http://example.com",
                "status": "success",
                "children": [
                    {
                        "span_id": "span002",
                        "url": "http://example.com/page1",
                        "status": "success",
                        "children": [...]
                    }
                ]
            }
        }
    """
    # 1. 获取所有请求详情
    request_keys = self.redis_client.keys(f"trace:{trace_id}:request:*")
    requests = {}
    
    for key in request_keys:
        span_id = key.decode().split(':')[-1]
        data = self.redis_client.hgetall(key)
        requests[span_id] = {k.decode(): v.decode() for k, v in data.items()}
    
    # 2. 构建树结构
    root_nodes = []
    children_map = {}
    
    for span_id, req_data in requests.items():
        parent_span_id = req_data.get('parent_span_id', '')
        node = {'span_id': span_id, **req_data, 'children': []}
        
        if not parent_span_id:
            root_nodes.append(node)
        else:
            if parent_span_id not in children_map:
                children_map[parent_span_id] = []
            children_map[parent_span_id].append(node)
    
    # 3. 递归填充子节点
    def fill_children(node):
        span_id = node['span_id']
        if span_id in children_map:
            node['children'] = children_map[span_id]
            for child in node['children']:
                fill_children(child)
    
    for root in root_nodes:
        fill_children(root)
    
    return {'trace_id': trace_id, 'tree': root_nodes[0] if root_nodes else None}
4.4.5 异步批量写入 InfluxDB
python 复制代码
def _background_writer(self):
    """
    后台线程批量写入 InfluxDB
    
    策略:
    - 批量大小:100 条
    - 刷新间隔:2 秒
    """
    points_buffer = []
    last_flush = time.time()
    
    while True:
        try:
            # 非阻塞获取数据
            data = self._write_queue.get(timeout=0.1)
            point = self._build_point(data)
            points_buffer.append(point)
        except:
            pass
        
        # 满足条件则刷新
        should_flush = (
            len(points_buffer) >= 100 or
            time.time() - last_flush >= 2.0
        )
        
        if should_flush and points_buffer:
            self._write_api.write(bucket=self._bucket, record=points_buffer)
            logger.debug(f"InfluxDB 批量写入: {len(points_buffer)} 条")
            points_buffer.clear()
            last_flush = time.time()

def _build_point(self, data: Dict):
    """构建 InfluxDB Point"""
    point = (
        Point("spider_request")
        .tag("spider_name", self.spider_name)
        .tag("trace_id", data['trace_id'])
        .tag("span_id", data['span_id'])
        .tag("parent_span_id", data['parent_span_id'])
        .tag("callback", data['callback_name'])
        .tag("status", data['status'])
        .field("url", data['url'])
        .field("count", 1)
        .time(int(data['timestamp'] * 1e9))
    )
    
    # 添加元数据(如 duration_ms, error_msg)
    if data.get('metadata'):
        for k, v in data['metadata'].items():
            if v is not None:
                point.field(k, v)
    
    return point

五、完整调用流程

5.1 初始请求发布

python 复制代码
# 用户代码
spider = MySpider()
spider.start_requests()

def start_requests(self):
    request = Request(
        url='http://example.com/list',
        callback=self.parse
    )
    # 此时自动生成:
    # trace_id='abc12345'(新生成)
    # span_id='span001'(新生成)
    # parent_span_id=None(根节点)
    
    yield request

5.2 Worker 消费初始请求

python 复制代码
# funspider/core/worker.py
def _process_request_task(self, spider_instance, payload: Dict):
    # 1. 反序列化
    request = Request.from_dict(payload)
    
    # 2. 记录请求开始
    self.task_manager.record_request(
        trace_id=request.trace_id,        # 'abc12345'
        span_id=request.span_id,          # 'span001'
        parent_span_id=request.parent_span_id,  # None
        callback_name='parse',
        url=request.url,
        status="pending"
    )
    
    # 3. 下载 + 解析
    response = spider_instance.download(request)
    results = spider_instance.parse(request, response)
    
    # 4. 处理解析结果(发布子请求)
    self._handle_parse_results(results, spider_instance, parent_request=request)
    
    # 5. 更新状态
    self.task_manager.update_status(
        trace_id=request.trace_id,
        span_id=request.span_id,
        callback_name='parse',
        old_status="pending",
        new_status="success",
        duration_ms=100
    )

5.3 发布子请求

python 复制代码
# funspider/core/worker.py
def _handle_parse_results(self, results, spider_instance, parent_request):
    for result in results:
        if isinstance(result, Request):
            # 自动注入追踪信息
            result.trace_id = parent_request.trace_id        # 'abc12345'(继承)
            result.parent_span_id = parent_request.span_id   # 'span001'(父节点)
            result.span_id = uuid.uuid4().hex[:12]           # 'span002'(新生成)
            
            # 构造 task_id
            task_id = f"{result.parent_span_id}:{result.span_id}"  # 'span001:span002'
            
            # 发布到队列
            self.engine.publish_request(result, task_id=task_id)
            
            # 日志输出:
            # 2025-12-14 13:13:08 - funspider - worker.py:315 - INFO - span001:span002 - 发布子请求: ...

5.4 消费子请求

python 复制代码
# Worker 消费 span002 任务
def _process_request_task(self, spider_instance, payload: Dict):
    request = Request.from_dict(payload)
    # 此时:
    # request.trace_id = 'abc12345'
    # request.span_id = 'span002'
    # request.parent_span_id = 'span001'
    
    # funboost 上下文中的 task_id = 'span001:span002'
    # 日志自动显示:
    # 2025-12-14 13:13:10 - funspider - worker.py:154 - DEBUG - span001:span002 - Processing request: ...
    
    # 记录请求
    self.task_manager.record_request(
        trace_id='abc12345',
        span_id='span002',
        parent_span_id='span001',  # 建立父子关系
        callback_name='parse_detail',
        url=request.url,
        status="pending"
    )
    
    # ... 继续处理

六、使用示例

6.1 爬虫代码

python 复制代码
from funspider import BaseSpider, Request, Item

class MySpider(BaseSpider):
    name = 'my_spider'
    
    def start_requests(self):
        # 初始请求(自动生成 trace_id 和 span_id)
        yield Request(
            url='http://example.com/list',
            callback=self.parse
        )
    
    def parse(self, request, response):
        """解析列表页"""
        for item in response.xpath('//div[@class="item"]'):
            # 发布详情页请求(自动继承 trace_id,生成新 span_id)
            yield Request(
                url=item.xpath('./a/@href').get(),
                callback=self.parse_detail
            )
    
    def parse_detail(self, request, response):
        """解析详情页"""
        yield Item(
            title=response.xpath('//h1/text()').get(),
            content=response.xpath('//div[@class="content"]/text()').get()
        )

6.2 查询追踪树

python 复制代码
# 获取某个 trace 的完整调用树
tree = spider.task_manager.get_trace_tree('abc12345')

# 输出结果:
{
    "trace_id": "abc12345",
    "tree": {
        "span_id": "span001",
        "url": "http://example.com/list",
        "callback": "parse",
        "status": "success",
        "duration_ms": 100,
        "children": [
            {
                "span_id": "span002",
                "url": "http://example.com/item/1",
                "callback": "parse_detail",
                "status": "success",
                "duration_ms": 80,
                "children": []
            },
            {
                "span_id": "span003",
                "url": "http://example.com/item/2",
                "callback": "parse_detail",
                "status": "success",
                "duration_ms": 75,
                "children": []
            }
        ]
    }
}

6.3 查询统计信息

python 复制代码
# 获取某个 trace 的统计信息
stats = spider.task_manager.get_trace_stats('abc12345')

# 输出结果:
{
    'total': {'parse': 1, 'parse_detail': 2},
    'pending': {'parse': 0, 'parse_detail': 0},
    'success': {'parse': 1, 'parse_detail': 2},
    'failed': {'parse': 0, 'parse_detail': 0}
}

6.4 日志输出示例

plain 复制代码
2025-12-14 13:13:08-(192.168.1.5,koohai)-[p28568_t30180] - funspider - "worker.py:154" - _process_request_task - INFO - :span001 - Processing request: http://example.com/list

2025-12-14 13:13:08-(192.168.1.5,koohai)-[p28568_t30180] - funspider - "worker.py:315" - _handle_parse_results - DEBUG - span001:span002 - 发布子请求: http://example.com/item/1

2025-12-14 13:13:09-(192.168.1.5,koohai)-[p28568_t30181] - funspider - "worker.py:154" - _process_request_task - INFO - span001:span002 - Processing request: http://example.com/item/1

2025-12-14 13:13:09-(192.168.1.5,koohai)-[p28568_t30181] - funspider - "worker.py:205" - _process_request_task - INFO - span001:span002 - 请求成功,耗时: 80ms

解读

  • :span001:根请求,无父节点
  • span001:span002:子请求,父节点是 span001
  • 不同进程(t30180 vs t30181)处理不同请求,但通过 task_id 关联

当然后面的redis统计以及influxdb 仅供参考。目前还在完善中。

更多文章,敬请关注gzh:零基础爬虫第一天

相关推荐
傻啦嘿哟6 小时前
分布式爬虫架构:Scrapy+Kafka+Spark实战指南
分布式·爬虫·架构
是有头发的程序猿7 小时前
1688数据采集:官方API与网页爬虫实战指南
开发语言·c++·爬虫
电商API_180079052478 小时前
数据分析之淘宝商品数据获取方法分享
爬虫·信息可视化
星川皆无恙8 小时前
基于ARIMA 算法模型和NLP:社交媒体舆情分析在涉众型经济犯罪情报挖掘中的应用研究
人工智能·爬虫·python·算法·机器学习·自然语言处理·数据分析
sugar椰子皮8 小时前
【补环境框架】序
爬虫
风跟我说过她8 小时前
基于Scrapy-Redis的分布式房产数据爬虫系统设计与实现
redis·分布式·爬虫·scrapy
小白学大数据9 小时前
实时监控 1688 商品价格变化的爬虫系统实现
javascript·爬虫·python
最晚的py20 小时前
Python抓取ZLibrary元数据
爬虫·python
深蓝电商API1 天前
爬虫遇到AST加密怎么办?AST逆向入门到精通
爬虫