摘要:本文从实战角度出发,系统讲解如何使用 OpenTelemetry Python SDK 实现微服务架构下的全链路追踪。涵盖自动埋点与手动埋点、W3C 上下文传播机制、采样策略配置,以及跨服务请求的可视化追踪与性能瓶颈定位。附完整可运行代码,适合中高级 Python 开发者直接落地使用。
1. 为什么需要全链路追踪?
在微服务架构中,一个用户请求可能经过 API 网关 → 用户服务 → 订单服务 → 库存服务 → 数据库 等多个节点。当出现延迟或错误时,仅靠单服务日志几乎无法快速定位问题。
全链路追踪(Distributed Tracing) 解决的核心问题:
| 问题场景 | 追踪能力 |
|---|---|
| 接口响应慢,不知道瓶颈在哪个服务 | 通过 Span 时间线可视化每个服务耗时 |
| 请求偶发失败,日志分散无法关联 | 通过 trace_id 串联完整调用链 |
| 数据库慢查询拖垮整条链路 | 精确定位到具体 SQL 语句的执行时间 |
| 第三方 API 调用超时 | 追踪出站 HTTP 请求的完整生命周期 |
2. OpenTelemetry 核心概念速览
OpenTelemetry(简称 OTel)是 CNCF 的可观测性标准项目,提供与厂商无关的遥测数据采集方案。核心概念如下:
┌─────────────────────────────────────────────────┐
│ Trace(链路) │
│ │
│ Span A ──→ Span B ──→ Span C ──→ Span D │
│ (入口) (子调用) (子调用) (孙调用) │
│ │
│ 每个 Span 包含: │
│ - trace_id: 全局唯一链路标识 │
│ - span_id: 当前 Span 唯一标识 │
│ - parent_span_id: 父 Span 标识(根 Span 无) │
│ - name: 操作名称 │
│ - start_time / end_time: 起止时间 │
│ - attributes: 键值对属性(如 http.method) │
│ - events: 时间戳事件日志 │
│ - status: OK / ERROR │
└─────────────────────────────────────────────────┘
四大组件:
- API (
opentelemetry-api):定义 Tracer、Span 等接口,不包含实现 - SDK (
opentelemetry-sdk):API 的官方实现,包含采样、处理、导出 - Instrumentation(自动/手动埋点库):针对 Flask、FastAPI、requests 等框架的插桩
- Exporter(导出器):将数据发送到 Jaeger、Zipkin、OTLP Collector 等后端
3. 环境准备与依赖安装
3.1 基础依赖
# 核心包
pip install opentelemetry-api
pip install opentelemetry-sdk
# OTLP 协议导出器(推荐,兼容主流后端)
pip install opentelemetry-exporter-otlp-proto-grpc
pip install opentelemetry-exporter-otlp-proto-http # HTTP 备选
# 控制台导出器(调试用,SDK 自带,无需额外安装)
3.2 自动埋点库(按需安装)
# Web 框架
pip install opentelemetry-instrumentation-flask
pip install opentelemetry-instrumentation-fastapi
pip install opentelemetry-instrumentation-django
# ⚠️ FastAPI / Starlette 底层必须安装 ASGI instrumentation
pip install opentelemetry-instrumentation-asgi
# HTTP 客户端
pip install opentelemetry-instrumentation-requests
pip install opentelemetry-instrumentation-httpx
pip install opentelemetry-instrumentation-aiohttp-client
# 数据库
pip install opentelemetry-instrumentation-sqlalchemy
pip install opentelemetry-instrumentation-psycopg2
pip install opentelemetry-instrumentation-pymysql
pip install opentelemetry-instrumentation-redis
# 消息队列
pip install opentelemetry-instrumentation-celery
pip install opentelemetry-instrumentation-kafka-python
3.3 一键安装所有常用 instrumentation
pip install opentelemetry-instrumentation
opentelemetry-bootstrap -a install
opentelemetry-bootstrap -a install会扫描当前 Python 环境已安装的库(如 Flask、requests),自动安装对应的 instrumentation 包。
4. 自动埋点:零侵入接入主流框架
自动埋点是 OpenTelemetry 的核心优势之一------无需修改业务代码,即可自动为框架的入口和出站调用生成 Span。
4.1 Flask 服务自动埋点
# app_gateway.py - API 网关服务
from flask import Flask, jsonify
import requests
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
# ========== 第一步:配置 TracerProvider ==========
resource = Resource.create({
SERVICE_NAME: "api-gateway",
"service.version": "1.0.0",
"deployment.environment": "production",
})
provider = TracerProvider(resource=resource)
# OTLP gRPC 导出器 → OpenTelemetry Collector
exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317", insecure=True)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
# ========== 第二步:自动埋点 ==========
FlaskInstrumentor().instrument() # 自动为 Flask 路由创建 Span
RequestsInstrumentor().instrument() # 自动追踪 requests 出站调用
# ========== 第三步:业务代码(完全无侵入) ==========
app = Flask(__name__)
@app.route("/api/orders/<order_id>")
def get_order(order_id):
# 调用订单服务 --- requests.get 会被自动追踪
resp = requests.get(f"http://order-service:5001/orders/{order_id}")
return jsonify(resp.json())
@app.route("/health")
def health():
return jsonify({"status": "ok"})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000)
4.2 FastAPI 服务自动埋点
# app_order.py - 订单服务
from fastapi import FastAPI
import pymysql
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.pymysql import PyMySQLInstrumentor
# 配置 TracerProvider
resource = Resource.create({SERVICE_NAME: "order-service"})
provider = TracerProvider(resource=resource)
provider.add_span_processor(
BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://otel-collector:4317", insecure=True)
)
)
trace.set_tracer_provider(provider)
# 数据库自动埋点
PyMySQLInstrumentor().instrument()
# ⚠️ FastAPI 必须在 app 实例化后使用 instrument_app() 方法
app = FastAPI()
FastAPIInstrumentor.instrument_app(app) # 正确用法:传入 app 实例
@app.get("/orders/{order_id}")
def get_order(order_id: int):
conn = pymysql.connect(host="mysql", user="root", password="secret", db="orders")
cursor = conn.cursor()
cursor.execute("SELECT * FROM orders WHERE id = %s", (order_id,))
row = cursor.fetchone()
conn.close()
return {"order_id": order_id, "data": row}
⚠️ 关键注意 :FastAPI 基于 Starlette(ASGI 框架),必须同时安装
opentelemetry-instrumentation-asgi,否则中间件不会正确调用上下文提取。推荐使用FastAPIInstrumentor.instrument_app(app)而非全局 instrument。
4.3 使用 opentelemetry-instrument 命令行(最简方式)
# 无需修改代码,直接通过命令行启动自动埋点
opentelemetry-instrument \
--service_name api-gateway \
--exporter_otlp_endpoint http://otel-collector:4317 \
python app_gateway.py
这种方式连 TracerProvider 的配置代码都不需要写,适合快速接入。
5. 手动埋点:精细化追踪关键业务逻辑
自动埋点覆盖的是框架层面的通用操作(HTTP 请求、数据库调用),但业务逻辑中的关键步骤需要手动创建 Span。
5.1 基础手动埋点
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
tracer = trace.get_tracer("order-service", "1.0.0")
def process_order(order_id: int):
# 手动创建 Span,追踪业务逻辑
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("order.type", "normal")
# 步骤1: 校验订单
with tracer.start_as_current_span("validate_order") as validate_span:
order = fetch_order(order_id)
if not order:
# ✅ 正确用法:使用 Status 对象
validate_span.set_status(
Status(StatusCode.ERROR, "Order not found")
)
return {"error": "Order not found"}
# 步骤2: 计算价格
with tracer.start_as_current_span("calculate_price") as price_span:
total = calculate_total(order)
price_span.set_attribute("order.total", total)
# 步骤3: 检查库存
with tracer.start_as_current_span("check_inventory") as inv_span:
available = check_stock(order["product_id"])
inv_span.set_attribute("inventory.available", available)
return {"order_id": order_id, "total": total, "available": available}
5.2 记录 Span 事件(日志与异常)
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
tracer = trace.get_tracer(__name__)
def process_payment(order_id: int, amount: float):
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("payment.amount", amount)
span.set_attribute("payment.currency", "CNY")
try:
# 记录关键业务事件
span.add_event("payment_started", {
"payment.order_id": order_id,
"payment.gateway": "alipay",
})
result = call_payment_gateway(order_id, amount)
span.add_event("payment_completed", {
"payment.transaction_id": result["txn_id"],
})
span.set_status(Status(StatusCode.OK))
return result
except PaymentGatewayTimeout as e:
# 记录异常信息
span.set_status(Status(StatusCode.ERROR, str(e)))
span.record_exception(e) # 自动记录异常堆栈
span.add_event("payment_timeout", {
"error.type": type(e).__name__,
"error.retryable": True,
})
raise
except Exception as e:
span.set_status(Status(StatusCode.ERROR, str(e)))
span.record_exception(e, attributes={
"error.category": "unknown",
})
raise
5.3 异步场景的手动埋点
import asyncio
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
async def fetch_user_data(user_id: int):
with tracer.start_as_current_span("fetch_user_data") as span:
span.set_attribute("user.id", user_id)
# 并行查询 --- 每个协程有自己的 Span
name_task = asyncio.create_task(
_fetch_with_span("get_user_name", user_id)
)
avatar_task = asyncio.create_task(
_fetch_with_span("get_user_avatar", user_id)
)
name, avatar = await asyncio.gather(name_task, avatar_task)
span.set_attribute("user.name", name)
return {"name": name, "avatar": avatar}
async def _fetch_with_span(operation: str, user_id: int):
with tracer.start_as_current_span(operation) as span:
span.set_attribute("user.id", user_id)
# 模拟异步 IO
await asyncio.sleep(0.1)
return f"result_{operation}"
注意 :
start_as_current_span是一个上下文管理器,在异步场景下会自动处理contextvars上下文隔离,无需手动传递。OpenTelemetry Python 1.20+ 已默认支持 async context,但前提是不能手动使用context.attach()或混用threading.local风格的旧逻辑。
6. 上下文传播:让 trace_id 贯穿整条调用链
上下文传播(Context Propagation)是分布式追踪的核心机制------它确保同一个请求在不同服务间共享相同的 trace_id,从而将所有 Span 串联成一条完整的调用链。
6.1 传播机制原理
┌──────────────┐ HTTP Headers ┌──────────────┐
│ Service A │ ──────────────────────→│ Service B │
│ │ │ │
│ traceparent: │ │ 从 Header │
│ 00-abc123.. │ W3C TraceContext │ 提取上下文 │
│ -def456-01 │ 格式注入到请求头 │ 创建子 Span │
└──────────────┘ └──────────────┘
traceparent 格式: {version}-{trace_id}-{parent_span_id}-{flags}
示例: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
6.2 配置 Propagator
from opentelemetry.propagate import set_global_textmap
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
from opentelemetry.baggage.propagation import W3CBaggagePropagator
from opentelemetry.propagators.composite import CompositePropagator
# 组合多种传播格式
set_global_textmap(CompositePropagator([
TraceContextTextMapPropagator(), # W3C TraceContext(推荐标准)
W3CBaggagePropagator(), # W3C Baggage(传递业务元数据)
]))
⚠️ 重要 :OpenTelemetry Python SDK 默认不自动注入/提取 HTTP 请求头。如果你使用了
FlaskInstrumentor/RequestsInstrumentor等自动埋点库,它们会自动调用全局 Propagator;但如果手动发送 HTTP 请求,必须显式调用inject()和extract()。
6.3 手动注入与提取(非 HTTP 场景)
当使用 gRPC、消息队列等非 HTTP 协议时,需要手动处理上下文传播:
from opentelemetry import trace
from opentelemetry.propagate import inject, extract
# ===== 生产者端:注入上下文到消息 =====
def publish_order_event(order_id: int):
carrier = {}
# 将当前上下文注入到 carrier 字典中
inject(carrier)
message = {
"order_id": order_id,
"trace_context": carrier, # 随消息传递
}
kafka_producer.send("order-events", message)
# ===== 消费者端:从消息中提取上下文 =====
def consume_order_event(message):
# 从消息中提取追踪上下文
ctx = extract(message["trace_context"])
tracer = trace.get_tracer(__name__)
# 使用提取的上下文作为父上下文创建 Span
with tracer.start_as_current_span(
"process_order_event",
context=ctx, # 关键:使用提取的上下文
) as span:
span.set_attribute("order.id", message["order_id"])
# 处理业务逻辑...
6.4 Baggage:传递业务元数据
Baggage 允许你在整条链路中传递自定义业务数据(如用户 ID、租户 ID):
from opentelemetry import baggage, context
# 设置 Baggage(通常在入口服务)
ctx = baggage.set_baggage("user.tier", "vip", context.get_current())
ctx = baggage.set_baggage("tenant.id", "company_abc", ctx)
# 在下游服务读取 Baggage
user_tier = baggage.get_baggage("user.tier")
tenant_id = baggage.get_baggage("tenant.id")
print(f"Processing request for VIP user in tenant: {tenant_id}")
⚠️ 注意:Baggage 会随每个请求头传播,避免存储过大的值。OTel 建议 Baggage 总大小不超过 8KB。
7. 采样策略配置:在数据量与可观测性之间取得平衡
在生产环境中,100% 采样会产生海量数据。合理的采样策略能大幅降低存储和传输成本。
7.1 内置采样器
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.sampling import (
ALWAYS_ON, # 全量采样(开发/测试环境)
ALWAYS_OFF, # 不采样
TraceIdRatioBased, # 按比例采样
ParentBased, # 基于父 Span 决策(推荐生产使用)
)
# 方式1:全量采样(开发调试用)
provider = TracerProvider(sampler=ALWAYS_ON)
# 方式2:按 TraceID 比例采样(10%)
provider = TracerProvider(sampler=TraceIdRatioBased(0.1))
# 方式3:ParentBased + 比例采样(推荐生产环境)
# 逻辑:如果父 Span 已采样 → 跟随父决策;如果是根 Span → 按比例采样
provider = TracerProvider(
sampler=ParentBased(root=TraceIdRatioBased(0.1))
)
7.2 ParentBased 采样器详解
ParentBased 是生产环境最推荐的采样器,它的决策逻辑如下:
收到创建 Span 请求
│
├─ 有远程父 Span 且已采样? → 采样 ✓
├─ 有远程父 Span 且未采样? → 不采样 ✗
├─ 有本地父 Span 且已采样? → 采样 ✓
├─ 有本地父 Span 且未采样? → 不采样 ✗
└─ 没有父 Span(根 Span)? → 使用 root 采样器决策
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased, ALWAYS_ON
# 精细配置 ParentBased 各分支
sampler = ParentBased(
root=TraceIdRatioBased(0.05), # 根 Span:5% 采样率
remote_parent_sampled=ALWAYS_ON, # 远程父已采样:100% 跟随
remote_parent_not_sampled=None, # 远程父未采样:默认不采样
local_parent_sampled=ALWAYS_ON, # 本地父已采样:100% 跟随
local_parent_not_sampled=None, # 本地父未采样:默认不采样
)
provider = TracerProvider(sampler=sampler)
7.3 自定义采样器:基于业务规则
from opentelemetry.sdk.trace.sampling import (
Sampler, SamplingResult, Decision, TraceIdRatioBased,
)
from opentelemetry.context import Context
from opentelemetry.trace import SpanKind
from typing import Optional, Sequence
class BusinessRuleSampler(Sampler):
"""
自定义采样器:
- VIP 用户请求 100% 采样
- 健康检查接口不采样
- 其他请求按比例采样
"""
def __init__(self, default_ratio: float = 0.1):
self._ratio_sampler = TraceIdRatioBased(default_ratio)
def should_sample(
self,
parent_context: Optional[Context],
trace_id: int,
name: str,
kind: Optional[SpanKind] = None,
attributes: Optional[dict] = None,
links: Optional[Sequence] = None,
) -> SamplingResult:
attrs = attributes or {}
# 健康检查接口不采样
if name.startswith("/health") or name.startswith("/metrics"):
return SamplingResult(Decision.DROP)
# VIP 用户全量采样
if attrs.get("user.tier") == "vip":
return SamplingResult(
Decision.RECORD_AND_SAMPLE,
attributes=attrs,
)
# 其他走比例采样
return self._ratio_sampler.should_sample(
parent_context, trace_id, name, kind, attributes, links
)
def get_description(self) -> str:
return "BusinessRuleSampler(VIP=100%, health=0%, default=10%)"
# 使用自定义采样器
provider = TracerProvider(sampler=BusinessRuleSampler(default_ratio=0.1))
7.4 通过环境变量配置采样
# 在启动时通过环境变量配置(优先级高于代码)
export OTEL_TRACES_SAMPLER="parentbased_traceidratio"
export OTEL_TRACES_SAMPLER_ARG="0.05" # 采样率参数
export OTEL_SERVICE_NAME="my-service"
export OTEL_EXPORTER_OTLP_ENDPOINT="http://otel-collector:4317"
8. SpanProcessor 与数据导出
8.1 两种 SpanProcessor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
BatchSpanProcessor, # 批量异步导出(生产推荐)
SimpleSpanProcessor, # 同步逐条导出(调试用)
ConsoleSpanExporter, # 输出到控制台
)
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
provider = TracerProvider()
# 生产环境:批量导出,高性能
otlp_exporter = OTLPSpanExporter(
endpoint="http://otel-collector:4317",
insecure=True,
timeout=10,
)
provider.add_span_processor(
BatchSpanProcessor(
otlp_exporter,
max_queue_size=2048, # 队列最大容量
schedule_delay_millis=5000, # 批量发送间隔(ms)
max_export_batch_size=512, # 每批最大 Span 数
export_timeout_millis=30000, # 导出超时
)
)
# 开发调试:同步导出到控制台
provider.add_span_processor(
SimpleSpanProcessor(ConsoleSpanExporter())
)
8.2 多后端同时导出
from opentelemetry.sdk.trace.export import BatchSpanProcessor, SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
provider = TracerProvider()
# 导出到 Jaeger(通过 OTLP 协议)
provider.add_span_processor(
BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://jaeger:4317", insecure=True)
)
)
# 导出到阿里云 SLS / 其他后端
provider.add_span_processor(
BatchSpanProcessor(
OTLPSpanExporter(
endpoint="https://your-sls-endpoint:443",
headers={"x-sls-token": "your-token"},
)
)
)
# 本地调试仍然输出到控制台
provider.add_span_processor(
SimpleSpanProcessor(ConsoleSpanExporter())
)
注意 :Jaeger 自 v1.35 起已原生支持 OTLP 协议,旧的
JaegerExporter已被废弃,请统一使用OTLPSpanExporter指向 Jaeger 的 4317 端口。
9. 跨服务追踪可视化与性能瓶颈定位
9.1 完整微服务示例架构
用户请求 → API Gateway (Flask:5000)
│
├─→ Order Service (FastAPI:5001)
│ │
│ ├─→ MySQL 查询
│ └─→ Redis 缓存
│
└─→ Inventory Service (Flask:5002)
│
└─→ MySQL 查询
9.2 完整代码:三服务联动
API Gateway(入口服务):
# gateway.py
from flask import Flask, jsonify
import requests
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
def setup_tracing():
resource = Resource.create({SERVICE_NAME: "api-gateway"})
provider = TracerProvider(
resource=resource,
sampler=ParentBased(root=TraceIdRatioBased(1.0)), # 开发环境全量采样
)
provider.add_span_processor(
BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
)
)
trace.set_tracer_provider(provider)
setup_tracing()
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()
app = Flask(__name__)
tracer = trace.get_tracer("gateway")
@app.route("/api/orders/<order_id>")
def get_order_detail(order_id):
with tracer.start_as_current_span("gateway_aggregate") as span:
span.set_attribute("request.order_id", order_id)
# 请求下游服务(requests 自动注入 traceparent Header)
order_resp = requests.get(
f"http://localhost:5001/orders/{order_id}",
timeout=5,
)
inventory_resp = requests.get(
f"http://localhost:5002/inventory/{order_id}",
timeout=5,
)
result = {
"order": order_resp.json(),
"inventory": inventory_resp.json(),
}
span.set_attribute("response.status", "success")
return jsonify(result)
if __name__ == "__main__":
app.run(port=5000)
Order Service(中间服务):
# order_service.py
from fastapi import FastAPI
import redis
import pymysql
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.pymysql import PyMySQLInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
resource = Resource.create({SERVICE_NAME: "order-service"})
provider = TracerProvider(resource=resource)
provider.add_span_processor(
BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
)
)
trace.set_tracer_provider(provider)
# 数据库和缓存自动埋点
PyMySQLInstrumentor().instrument()
RedisInstrumentor().instrument()
app = FastAPI()
# ✅ 必须在 app 创建后调用 instrument_app
FastAPIInstrumentor.instrument_app(app)
tracer = trace.get_tracer("order-service")
@app.get("/orders/{order_id}")
def get_order(order_id: int):
with tracer.start_as_current_span("fetch_order_logic") as span:
span.set_attribute("order.id", order_id)
# 先查缓存
with tracer.start_as_current_span("redis_cache_lookup"):
r = redis.Redis(host="localhost", port=6379, db=0)
cached = r.get(f"order:{order_id}")
if cached:
span.set_attribute("cache.hit", True)
import json
return json.loads(cached)
# 缓存未命中,查数据库
with tracer.start_as_current_span("mysql_query") as db_span:
conn = pymysql.connect(
host="localhost", user="root",
password="secret", db="orders"
)
cursor = conn.cursor()
cursor.execute(
"SELECT id, product_id, quantity, total FROM orders WHERE id = %s",
(order_id,)
)
row = cursor.fetchone()
conn.close()
if not row:
db_span.set_attribute("db.result", "not_found")
return {"error": "Not found"}
result = {
"id": row[0], "product_id": row[1],
"quantity": row[2], "total": float(row[3]),
}
# 回写缓存
with tracer.start_as_current_span("redis_cache_write"):
import json
r.setex(f"order:{order_id}", 300, json.dumps(result))
span.set_attribute("cache.hit", False)
return result
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=5001)
9.3 OpenTelemetry Collector 配置
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
# Jaeger(通过 OTLP 协议,可视化追踪)
otlp/jaeger:
endpoint: http://jaeger:4317
tls:
insecure: true
# 调试输出
debug:
verbosity: detailed
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/jaeger, debug]
9.4 Docker Compose 一键启动全栈
# docker-compose.yml
version: "3.9"
services:
# OpenTelemetry Collector
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-config.yaml
ports:
- "4317:4317"
- "4318:4318"
# Jaeger(追踪可视化,原生支持 OTLP)
jaeger:
image: jaegertracing/all-in-one:latest
environment:
- COLLECTOR_OTLP_ENABLED=true
ports:
- "16686:16686" # Jaeger UI
# Redis
redis:
image: redis:7-alpine
ports:
- "6379:6379"
# MySQL
mysql:
image: mysql:8.0
environment:
MYSQL_ROOT_PASSWORD: secret
MYSQL_DATABASE: orders
ports:
- "3306:3306"
# 业务服务
gateway:
build: ./gateway
ports:
- "5000:5000"
environment:
- OTEL_SERVICE_NAME=api-gateway
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
depends_on:
- otel-collector
- order-service
order-service:
build: ./order-service
ports:
- "5001:5001"
environment:
- OTEL_SERVICE_NAME=order-service
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
depends_on:
- otel-collector
- redis
- mysql
9.5 在 Jaeger 中分析性能瓶颈
启动后访问 http://localhost:16686(Jaeger UI),你可以:
-
选择服务 :在 Service 下拉框中选择
api-gateway -
查看完整链路:点击任意 Trace,查看从 Gateway → Order Service → Redis/MySQL 的完整时间线
-
定位瓶颈:
Trace: GET /api/orders/42
总耗时: 320ms
│
├─ gateway_aggregate 320ms ████████████████████████████
│ ├─ HTTP GET order-service 280ms ███████████████████████
│ │ └─ fetch_order_logic 275ms ██████████████████████
│ │ ├─ redis_cache_lookup 2ms █
│ │ ├─ mysql_query 250ms ████████████████████ ← 瓶颈!
│ │ └─ redis_cache_write 3ms █
│ │
│ └─ HTTP GET inventory 35ms ███
│ └─ mysql_query 20ms ██
通过上述瀑布图,一眼就能发现 MySQL 查询耗时 250ms 是性能瓶颈,需要优化 SQL 或添加索引。
10. 生产环境最佳实践与常见踩坑
10.1 优雅关闭:确保 Span 不丢失
import atexit
from opentelemetry import trace
provider = trace.get_tracer_provider()
def shutdown():
"""确保所有 Span 在进程退出前被刷新导出"""
if hasattr(provider, "force_flush"):
provider.force_flush(timeout_millis=5000)
if hasattr(provider, "shutdown"):
provider.shutdown()
atexit.register(shutdown)
10.2 Trace 与日志关联
import logging
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter
# 配置日志 Provider
logger_provider = LoggerProvider()
logger_provider.add_log_record_processor(
BatchLogRecordProcessor(
OTLPLogExporter(endpoint="http://otel-collector:4317", insecure=True)
)
)
# 将 OTel Handler 注入 Python logging
handler = LoggingHandler(level=logging.INFO, logger_provider=logger_provider)
logging.getLogger().addHandler(handler)
# 现在日志中会自动包含 trace_id 和 span_id
logger = logging.getLogger(__name__)
logger.info("Processing order", extra={"order_id": 42})
日志格式配置(让 trace_id 出现在日志文本中):
formatter = logging.Formatter(
"%(asctime)s [%(levelname)s] "
"trace_id=%(otelTraceID)s span_id=%(otelSpanID)s "
"%(name)s - %(message)s"
)
10.3 常见踩坑清单
| 问题 | 原因 | 解决方案 |
|---|---|---|
| trace_id 在跨服务时断裂 | 未正确传播上下文 | 确保使用 RequestsInstrumentor 自动注入 Header,或手动调用 inject()/extract() |
| FastAPI 追踪不生效 | 缺少 ASGI instrumentation | pip install opentelemetry-instrumentation-asgi,并使用 FastAPIInstrumentor.instrument_app(app) |
| 异步场景下 Span 父子关系错乱 | 手动操作了 context | 避免在 async 函数中手动 context.attach/detach,使用 start_as_current_span 即可 |
| Span 丢失未导出 | 进程退出前未 flush | 添加 atexit 钩子调用 provider.shutdown() |
gRPC 导出报错 UNAVAILABLE |
Collector 未启动或端口错误 | 检查 Collector 状态和 OTEL_EXPORTER_OTLP_ENDPOINT |
| 环境变量配置不生效 | 代码中硬编码覆盖了环境变量 | 删除代码中的 set_tracer_provider,用环境变量替代 |
| 采样后下游出现断链 | 采样策略不一致 | 统一使用 ParentBased 采样器确保链路一致性 |
| Flask/FastAPI 自动注入 trace_id 到日志失败 | 未替换日志 Handler | 使用 opentelemetry.sdk._logs.LoggingHandler 替换原生 Handler |
10.4 性能影响评估
# 基准测试建议
import time
from opentelemetry import trace
tracer = trace.get_tracer("benchmark")
start = time.perf_counter()
for _ in range(10000):
with tracer.start_as_current_span("benchmark"):
pass
elapsed = time.perf_counter() - start
print(f"10000 spans: {elapsed:.3f}s, avg: {elapsed/10000*1000:.3f}ms/span")
# 通常结果:每个 Span 开销 < 0.01ms,对业务延迟影响可忽略
11. 总结
本文完整覆盖了 OpenTelemetry 在 Python 中实现全链路追踪的核心知识点:
| 模块 | 关键内容 |
|---|---|
| 自动埋点 | 通过 instrumentation 库零侵入追踪 Flask/FastAPI/requests/数据库;FastAPI 使用 instrument_app(app) |
| 手动埋点 | 使用 start_as_current_span 精细追踪业务逻辑,使用 Status(StatusCode.ERROR, msg) 记录异常 |
| 上下文传播 | W3C TraceContext 自动注入/提取,Baggage 传递业务元数据 |
| 采样策略 | ParentBased 生产推荐,自定义采样器按业务规则过滤 |
| 数据导出 | BatchSpanProcessor + OTLP 协议,支持 Jaeger/Zipkin/SLS 等多后端 |
| 可视化分析 | Jaeger UI 瀑布图定位性能瓶颈,日志通过 LoggingHandler 关联 trace_id |
推荐技术栈组合:
OpenTelemetry SDK → OTLP Collector → Jaeger(追踪)+ Prometheus(指标)+ Loki(日志)
💡 建议 :开发环境使用
ALWAYS_ON全量采样 +ConsoleSpanExporter调试;生产环境使用ParentBased(TraceIdRatioBased(0.1))+BatchSpanProcessor+ OTLP 导出。
参考资料:
- OpenTelemetry Python 官方文档
- OpenTelemetry Python GitHub
- OpenTelemetry Python Contrib(Instrumentation 库)
- W3C Trace Context 规范
如果本文对你有帮助,欢迎 点赞👍 收藏⭐ 关注🔔 三连支持!有问题欢迎评论区交流~
