分布式爬虫全链路追踪系统设计与实现
一、背景与问题
在分布式爬虫系统中,一个初始请求可能会派生出数十子请求(列表页→详情页→评论页→下一页...),这些请求分布在不同的消费者进程中执行。如何追踪整个任务链的执行状态、性能瓶颈以及调用关系,成为系统可观测性的核心挑战。
1.1 核心诉求
- 全链路追踪:从初始请求到所有子孙请求,形成完整的调用链
- 任务状态监控:实时掌握每个请求的执行状态(pending/success/failed)
- 性能分析:记录每个请求的耗时、错误信息等
- 任务完成回调:当整棵任务树完成时自动触发通知
- 日志关联:在海量日志中快速定位某个任务链的所有相关日志
二、技术方案:OpenTelemetry 风格的 Trace/Span 模型
2.1 核心概念
参考 OpenTelemetry 的分布式追踪标准,定义了三个关键ID:
python
# Request 类的追踪字段
trace_id: str # 整个爬取任务的唯一标识,所有子请求继承
span_id: str # 当前请求节点的唯一标识
parent_span_id: str # 父请求节点的 Span ID
调用链示例:
plain
列表页:
trace_id='abc12345'
span_id='span001'
parent_span_id=None
商品A详情:
trace_id='abc12345' # 继承父请求
span_id='span002' # 新生成
parent_span_id='span001' # 指向列表页
商品A评论第1页:
trace_id='abc12345'
span_id='span003'
parent_span_id='span002' # 指向详情页
商品A评论第2页:
trace_id='abc12345'
span_id='span004'
parent_span_id='span002' # 同一个父请求
2.2 与 funboost task_id 的整合
funboost 的 task_id 是消费任务的唯一标识,我们将其与 span 信息结合:
python
# 格式:parent_span_id:span_id
task_id = f"{result.parent_span_id or ''}:{result.span_id or ''}"
# 示例
task_id = "span001:span002" # 表示父节点是 span001,当前节点是 span002
这样设计的好处:
- task_id 即包含调用关系:无需额外字段就能知道父子关系
- 兼容 funboost:仍然是唯一字符串,不影响消息去重
- 日志可读性 :
parent_span_id - span_id一目了然
三、系统架构:分层设计
3.1 架构图
plain
┌─────────────────────────────────────────────────────┐
│ 日志层 (Logger) │
│ - TaskIdLogger 自动注入 task_id │
│ - 日志格式: %(task_id)s = parent_span_id:span_id │
└─────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────┐
│ Request 对象 (网络层) │
│ - trace_id / span_id / parent_span_id │
│ - 自动继承父请求的 trace_id │
│ - 自动生成新的 span_id │
└─────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────┐
│ Worker 处理层 (核心) │
│ - 反序列化 Request 对象 │
│ - 构造 task_id = parent_span_id:span_id │
│ - 发布子请求时传递 task_id 到 funboost │
└─────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────┐
│ TaskTraceManager (追踪管理器) │
│ - 记录请求链路 (Redis + InfluxDB) │
│ - 追踪任务树状态 │
│ - 任务完成时触发回调 │
└─────────────────────────────────────────────────────┘
四、核心实现
4.1 第一层:funboost 日志系统扩展
4.1.1 原始 funboost 日志配置
funboost 默认的日志格式(funboost_config.py:125-128):
python
NB_LOG_FORMATER_INDEX_FOR_CONSUMER_AND_PUBLISHER = logging.Formatter(
f'%(asctime)s-({nb_log_config_default.computer_ip},{nb_log_config_default.computer_name})'
f'-[p%(process)d_t%(thread)d] - %(name)s - "%(filename)s:%(lineno)d" - '
f'%(funcName)s - %(levelname)s - %(task_id)s - %(message)s',
"%Y-%m-%d %H:%M:%S",
)
关键字段:
%(task_id)s:funboost 自动注入的任务ID(通过TaskIdLogger实现)- 这个
task_id来自funboost.core.current_task.get_current_taskid() - 这个能从上下文自动获取taskid ,然后用给日志,灰常好用
4.1.2 我们的扩展方案
目标 :让 task_id 显示为 parent_span_id:span_id 格式
实现路径:
- funboost 发布任务时,我们主动传入
task_id参数 - funboost 的
TaskIdLogger会自动从上下文读取这个task_id - 日志输出时自动填充到
%(task_id)s位置
代码位置 :funspider/utils/fun_logger.py
python
from funboost.core.task_id_logger import TaskIdLogger
# 使用 funboost 原生的 TaskIdLogger(无需自定义)
log_manager = LogManager('funspider', logger_cls=TaskIdLogger)
logger_config = {
'log_level_int': 10,
'is_add_stream_handler': True,
'formatter_template': FunboostCommonConfig.NB_LOG_FORMATER_INDEX_FOR_CONSUMER_AND_PUBLISHER,
'log_path': None,
'log_filename': None,
}
logger = log_manager.get_logger_and_add_handlers(**logger_config)
日志输出示例:
plain
2025-12-14 13:13:08-(192.168.1.5,koohai)-[p28568_t30180] - funspider - "worker.py:315"
- _handle_parse_results - INFO - span001:span002 - 发布子请求: http://example.com/page1
4.2 第二层:Request 对象的 Trace/Span 初始化
4.2.1 Request 初始化时的自动追踪
代码位置 :funspider/network/request.py:143-207
python
class Request:
"""增强的Request类,支持处理器直接修改和中间件选择"""
def __init__(self, url: str, callback: Optional[Callable] = None,
headers: Optional[Dict[str, str]] = None,
meta: Optional[Dict[str, Any]] = None,
method: str = 'GET',
params: str =None,
middleware_tags: List[str] = None,
middleware_conditions: List[str] = None,
pipeline_tags: List[str] = None,
pipeline_conditions: List[str] = None,
download: Optional[Callable] = None,
request_id: Optional[str] = None,
depth: Optional[int] = None,
parent_request_id: Optional[str] = None,
priority_params: Optional[Dict[str, Any]] = None,
trace_id: Optional[str] = None,
span_id: Optional[str] = None,
parent_span_id: Optional[str] = None,
queue_prefix: Optional[str] = None,
**kwargs):
# 基础属性
self.url = url
self.callback = callback
self.method = method
self.request_id = request_id or uuid.uuid4().hex
self.queue_prefix = queue_prefix
# 重要:确保 headers 和 meta 是字典类型,支持处理器直接修改
self.headers = headers or {}
self.meta = meta or {}
# 设置深度和父请求ID到meta中
self.meta['depth'] = depth or 0
self.meta['parent_request_id'] = parent_request_id
# ====== OpenTelemetry 风格的链路追踪 ======
# Trace ID: 整个爬取任务的唯一标识,所有子请求继承
# Span ID: 当前请求节点的唯一标识
# Parent Span ID: 父请求节点的 Span ID
# 如果已经设置了 trace_id/span_id,则不再重新生成(避免覆盖用户自定义的值)
self.trace_id, self.span_id, self.parent_span_id = self._init_trace_span(
trace_id, span_id, parent_span_id, parent_request_id
)
# 保存到 meta 中,方便序列化和传递
self.meta['trace_id'] = self.trace_id
self.meta['span_id'] = self.span_id
self.meta['parent_span_id'] = self.parent_span_id
# 中间件选择
self.middleware_tags = middleware_tags or []
self.middleware_conditions = middleware_conditions or []
# Pipeline选择
self.pipeline_tags = pipeline_tags or []
self.pipeline_conditions = pipeline_conditions or []
# 自定义下载器
self.download = download
# PriorityConsumingControlConfig 参数(任务级别控制)
self.priority_params = priority_params or {}
# 设置回调名称
if callable(self.callback):
self.callback_name = self.callback.__name__
elif self.callback is None:
self.callback_name = 'parse' # 默认名称
else:
self.callback_name = str(self.callback)
# HTTP参数处理 - 提取常用HTTP参数作为直接属性
self._extract_http_params(kwargs)
# 其他参数存储在kwargs中
self.kwargs = kwargs
def _init_trace_span(self, trace_id: Optional[str], span_id: Optional[str],
parent_span_id: Optional[str], parent_request_id: Optional[str]) -> tuple:
"""
初始化 Trace ID 和 Span ID
逻辑:
1. trace_id:如果传入则继承,否则生成新的(16位短ID)
2. span_id:总是生成新的(12位短ID)
3. parent_span_id:如果传入则使用,否则为 None(根节点)
"""
# 1. 生成或继承 Trace ID
if trace_id:
final_trace_id = trace_id # 子请求继承
elif parent_request_id and 'trace_id' in self.meta:
final_trace_id = self.meta.get('trace_id')
else:
final_trace_id = uuid.uuid4().hex[:16] # 初始请求生成
# 2. 生成 Span ID(每个请求都是新的)
if span_id:
final_span_id = span_id
else:
final_span_id = uuid.uuid4().hex[:12]
# 3. 设置 Parent Span ID
if parent_span_id:
final_parent_span_id = parent_span_id
elif parent_request_id and 'span_id' in self.meta:
final_parent_span_id = self.meta.get('span_id')
else:
final_parent_span_id = None # 根节点
return final_trace_id, final_span_id, final_parent_span_id
4.3 第三层:Worker 处理子请求时的自动传递
4.3.1 处理解析结果时注入追踪信息
代码位置 :funspider/core/worker.py:260-319
python
def _handle_parse_results(self, results, spider_instance, parent_request: Optional[Request] = None):
"""
处理解析结果,自动为子请求注入追踪信息
"""
# 1. 获取父请求的追踪信息
parent_trace_id = None
parent_span_id = None
if parent_request:
parent_trace_id = parent_request.trace_id # 继承
parent_span_id = parent_request.span_id # 作为子请求的 parent_span_id
# 从 request 里拿到span id
for result in results:
if isinstance(result, Request):
# 2. 为子请求设置 trace_id(继承父请求)
# core 中判断是子请求还是 item 的时候 自动注入父traceid 。 因此 爬虫里只要给request 赋值一个 spanid就好了。 (还是习惯taskid,和batchid ,但是为了规范 )
if parent_trace_id and not result.trace_id:
result.meta['trace_id'] = parent_trace_id
result.trace_id = parent_trace_id
# 3. 为子请求设置 parent_span_id(父请求的 span_id)
if parent_span_id and not result.parent_span_id:
result.meta['parent_span_id'] = parent_span_id
result.parent_span_id = parent_span_id
# 4. 如果子请求没有 span_id,自动生成
if not result.span_id:
new_span_id = uuid.uuid4().hex[:12]
result.span_id = new_span_id
result.meta['span_id'] = new_span_id
# 5. 构造 task_id(格式:parent_span_id:span_id)
task_id = f"{result.parent_span_id or ''}:{result.span_id or ''}"
# 6. 日志输出(显示完整追踪信息)
logger.debug(
f"[trace={result.trace_id}][span={result.span_id}][parent={result.parent_span_id}] "
f"发布子请求: {result.url}"
)
# 7. 发布到 funboost 队列时传入 task_id
# ... (见下一节)
4.3.2 发布任务时传入 task_id
代码位置 :funspider/core/engine.py(发布方法)
python
def publish_request(self, request: Request, task_id: str = None):
"""
发布请求到队列
Args:
request: 请求对象
task_id: 任务ID(格式:parent_span_id:span_id)
"""
callback_name = request.callback_name
queue_prefix = request.queue_prefix or self.default_queue_prefix
# 获取或创建 Booster
booster = self._get_or_create_booster(queue_prefix, callback_name)
# 关键:传入 task_id 到 funboost
booster.publish(
request.to_dict(),
task_id=task_id, # funboost 会将其注入到上下文中
**request.priority_params
)
funboost 内部处理:
task_id被存储到funboost.core.current_task上下文中TaskIdLogger的makeRecord()方法自动读取这个task_id- 日志格式化时填充到
%(task_id)s位置
4.4 第四层:TaskTraceManager 统一追踪管理
4.4.1 核心功能
代码位置 :funspider/core/trace_stats.py
python
class TaskTraceManager:
"""
统一任务追踪管理器
功能:
1. 【监控统计】记录请求链路、性能数据
2. 【任务协调】追踪任务树完成状态
3. 【完成回调】任务树完成时触发自定义回调
数据存储:
- Redis: 实时状态(1小时TTL)
- InfluxDB: 历史数据(长期分析,可选)
"""
def __init__(self, spider_name: str, redis_url: str = None,
influx_config: Optional[Dict] = None,
on_tree_completed: Optional[Callable] = None):
self.spider_name = spider_name
self.redis_client = get_redis_client(redis_url)
self.on_tree_completed = on_tree_completed
# InfluxDB 异步写入(可选)
if influx_config:
self._write_queue = Queue(maxsize=10000)
self._writer_thread = Thread(target=self._background_writer, daemon=True)
self._writer_thread.start()
4.4.2 记录请求链路
python
def record_request(self, trace_id: str, span_id: str, parent_span_id: Optional[str],
callback_name: str, url: str, status: str = "pending",
metadata: Optional[Dict] = None):
"""
记录请求(同时写入 Redis 和 InfluxDB)
Redis 数据结构:
- trace:{trace_id}:total -> {callback_name: count}
- trace:{trace_id}:status:{status} -> {callback_name: count}
- trace:{trace_id}:children:{parent_span_id} -> {callback_name: count}
- trace:{trace_id}:request:{span_id} -> {url, callback, parent_span_id, status, timestamp}
"""
timestamp = time.time()
# 1. Redis: 实时状态(同步写入)
pipe = self.redis_client.pipeline()
pipe.hincrby(f"trace:{trace_id}:total", callback_name, 1)
pipe.hincrby(f"trace:{trace_id}:status:{status}", callback_name, 1)
if parent_span_id:
pipe.hincrby(f"trace:{trace_id}:children:{parent_span_id}", callback_name, 1)
request_key = f"trace:{trace_id}:request:{span_id}"
pipe.hset(request_key, mapping={
'url': url,
'callback': callback_name,
'parent_span_id': parent_span_id or '',
'status': status,
'timestamp': timestamp
})
pipe.expire(request_key, 3600) # 1小时过期
pipe.execute()
# 2. InfluxDB: 历史数据(异步写入)
if self._write_api:
self._write_queue.put({
'type': 'request',
'trace_id': trace_id,
'span_id': span_id,
'parent_span_id': parent_span_id or 'root',
'callback_name': callback_name,
'status': status,
'url': url,
'timestamp': timestamp,
'metadata': metadata or {}
})
4.4.3 更新请求状态
python
def update_status(self, trace_id: str, span_id: str, callback_name: str,
old_status: str, new_status: str,
duration_ms: Optional[float] = None,
error_msg: Optional[str] = None):
"""
更新请求状态(pending -> success/failed)
同时记录:
- 状态计数变更
- 请求耗时
- 错误信息
"""
pipe = self.redis_client.pipeline()
pipe.hincrby(f"trace:{trace_id}:status:{old_status}", callback_name, -1)
pipe.hincrby(f"trace:{trace_id}:status:{new_status}", callback_name, 1)
request_key = f"trace:{trace_id}:request:{span_id}"
pipe.hset(request_key, 'status', new_status)
if duration_ms is not None:
pipe.hset(request_key, 'duration_ms', duration_ms)
if error_msg:
pipe.hset(request_key, 'error', error_msg)
pipe.execute()
4.4.4 获取追踪树结构
python
def get_trace_tree(self, trace_id: str) -> Dict:
"""
获取完整的请求树结构(用于可视化)
Returns:
{
"trace_id": "abc12345",
"tree": {
"span_id": "span001",
"url": "http://example.com",
"status": "success",
"children": [
{
"span_id": "span002",
"url": "http://example.com/page1",
"status": "success",
"children": [...]
}
]
}
}
"""
# 1. 获取所有请求详情
request_keys = self.redis_client.keys(f"trace:{trace_id}:request:*")
requests = {}
for key in request_keys:
span_id = key.decode().split(':')[-1]
data = self.redis_client.hgetall(key)
requests[span_id] = {k.decode(): v.decode() for k, v in data.items()}
# 2. 构建树结构
root_nodes = []
children_map = {}
for span_id, req_data in requests.items():
parent_span_id = req_data.get('parent_span_id', '')
node = {'span_id': span_id, **req_data, 'children': []}
if not parent_span_id:
root_nodes.append(node)
else:
if parent_span_id not in children_map:
children_map[parent_span_id] = []
children_map[parent_span_id].append(node)
# 3. 递归填充子节点
def fill_children(node):
span_id = node['span_id']
if span_id in children_map:
node['children'] = children_map[span_id]
for child in node['children']:
fill_children(child)
for root in root_nodes:
fill_children(root)
return {'trace_id': trace_id, 'tree': root_nodes[0] if root_nodes else None}
4.4.5 异步批量写入 InfluxDB
python
def _background_writer(self):
"""
后台线程批量写入 InfluxDB
策略:
- 批量大小:100 条
- 刷新间隔:2 秒
"""
points_buffer = []
last_flush = time.time()
while True:
try:
# 非阻塞获取数据
data = self._write_queue.get(timeout=0.1)
point = self._build_point(data)
points_buffer.append(point)
except:
pass
# 满足条件则刷新
should_flush = (
len(points_buffer) >= 100 or
time.time() - last_flush >= 2.0
)
if should_flush and points_buffer:
self._write_api.write(bucket=self._bucket, record=points_buffer)
logger.debug(f"InfluxDB 批量写入: {len(points_buffer)} 条")
points_buffer.clear()
last_flush = time.time()
def _build_point(self, data: Dict):
"""构建 InfluxDB Point"""
point = (
Point("spider_request")
.tag("spider_name", self.spider_name)
.tag("trace_id", data['trace_id'])
.tag("span_id", data['span_id'])
.tag("parent_span_id", data['parent_span_id'])
.tag("callback", data['callback_name'])
.tag("status", data['status'])
.field("url", data['url'])
.field("count", 1)
.time(int(data['timestamp'] * 1e9))
)
# 添加元数据(如 duration_ms, error_msg)
if data.get('metadata'):
for k, v in data['metadata'].items():
if v is not None:
point.field(k, v)
return point
五、完整调用流程
5.1 初始请求发布
python
# 用户代码
spider = MySpider()
spider.start_requests()
def start_requests(self):
request = Request(
url='http://example.com/list',
callback=self.parse
)
# 此时自动生成:
# trace_id='abc12345'(新生成)
# span_id='span001'(新生成)
# parent_span_id=None(根节点)
yield request
5.2 Worker 消费初始请求
python
# funspider/core/worker.py
def _process_request_task(self, spider_instance, payload: Dict):
# 1. 反序列化
request = Request.from_dict(payload)
# 2. 记录请求开始
self.task_manager.record_request(
trace_id=request.trace_id, # 'abc12345'
span_id=request.span_id, # 'span001'
parent_span_id=request.parent_span_id, # None
callback_name='parse',
url=request.url,
status="pending"
)
# 3. 下载 + 解析
response = spider_instance.download(request)
results = spider_instance.parse(request, response)
# 4. 处理解析结果(发布子请求)
self._handle_parse_results(results, spider_instance, parent_request=request)
# 5. 更新状态
self.task_manager.update_status(
trace_id=request.trace_id,
span_id=request.span_id,
callback_name='parse',
old_status="pending",
new_status="success",
duration_ms=100
)
5.3 发布子请求
python
# funspider/core/worker.py
def _handle_parse_results(self, results, spider_instance, parent_request):
for result in results:
if isinstance(result, Request):
# 自动注入追踪信息
result.trace_id = parent_request.trace_id # 'abc12345'(继承)
result.parent_span_id = parent_request.span_id # 'span001'(父节点)
result.span_id = uuid.uuid4().hex[:12] # 'span002'(新生成)
# 构造 task_id
task_id = f"{result.parent_span_id}:{result.span_id}" # 'span001:span002'
# 发布到队列
self.engine.publish_request(result, task_id=task_id)
# 日志输出:
# 2025-12-14 13:13:08 - funspider - worker.py:315 - INFO - span001:span002 - 发布子请求: ...
5.4 消费子请求
python
# Worker 消费 span002 任务
def _process_request_task(self, spider_instance, payload: Dict):
request = Request.from_dict(payload)
# 此时:
# request.trace_id = 'abc12345'
# request.span_id = 'span002'
# request.parent_span_id = 'span001'
# funboost 上下文中的 task_id = 'span001:span002'
# 日志自动显示:
# 2025-12-14 13:13:10 - funspider - worker.py:154 - DEBUG - span001:span002 - Processing request: ...
# 记录请求
self.task_manager.record_request(
trace_id='abc12345',
span_id='span002',
parent_span_id='span001', # 建立父子关系
callback_name='parse_detail',
url=request.url,
status="pending"
)
# ... 继续处理
六、使用示例
6.1 爬虫代码
python
from funspider import BaseSpider, Request, Item
class MySpider(BaseSpider):
name = 'my_spider'
def start_requests(self):
# 初始请求(自动生成 trace_id 和 span_id)
yield Request(
url='http://example.com/list',
callback=self.parse
)
def parse(self, request, response):
"""解析列表页"""
for item in response.xpath('//div[@class="item"]'):
# 发布详情页请求(自动继承 trace_id,生成新 span_id)
yield Request(
url=item.xpath('./a/@href').get(),
callback=self.parse_detail
)
def parse_detail(self, request, response):
"""解析详情页"""
yield Item(
title=response.xpath('//h1/text()').get(),
content=response.xpath('//div[@class="content"]/text()').get()
)
6.2 查询追踪树
python
# 获取某个 trace 的完整调用树
tree = spider.task_manager.get_trace_tree('abc12345')
# 输出结果:
{
"trace_id": "abc12345",
"tree": {
"span_id": "span001",
"url": "http://example.com/list",
"callback": "parse",
"status": "success",
"duration_ms": 100,
"children": [
{
"span_id": "span002",
"url": "http://example.com/item/1",
"callback": "parse_detail",
"status": "success",
"duration_ms": 80,
"children": []
},
{
"span_id": "span003",
"url": "http://example.com/item/2",
"callback": "parse_detail",
"status": "success",
"duration_ms": 75,
"children": []
}
]
}
}
6.3 查询统计信息
python
# 获取某个 trace 的统计信息
stats = spider.task_manager.get_trace_stats('abc12345')
# 输出结果:
{
'total': {'parse': 1, 'parse_detail': 2},
'pending': {'parse': 0, 'parse_detail': 0},
'success': {'parse': 1, 'parse_detail': 2},
'failed': {'parse': 0, 'parse_detail': 0}
}
6.4 日志输出示例
plain
2025-12-14 13:13:08-(192.168.1.5,koohai)-[p28568_t30180] - funspider - "worker.py:154" - _process_request_task - INFO - :span001 - Processing request: http://example.com/list
2025-12-14 13:13:08-(192.168.1.5,koohai)-[p28568_t30180] - funspider - "worker.py:315" - _handle_parse_results - DEBUG - span001:span002 - 发布子请求: http://example.com/item/1
2025-12-14 13:13:09-(192.168.1.5,koohai)-[p28568_t30181] - funspider - "worker.py:154" - _process_request_task - INFO - span001:span002 - Processing request: http://example.com/item/1
2025-12-14 13:13:09-(192.168.1.5,koohai)-[p28568_t30181] - funspider - "worker.py:205" - _process_request_task - INFO - span001:span002 - 请求成功,耗时: 80ms
解读:
:span001:根请求,无父节点span001:span002:子请求,父节点是 span001- 不同进程(
t30180vst30181)处理不同请求,但通过task_id关联
当然后面的redis统计以及influxdb 仅供参考。目前还在完善中。
更多文章,敬请关注gzh:零基础爬虫第一天
