
前言
微服务架构下,一次用户请求可能涉及十几个服务的调用。
当用户反馈"请求失败了",运维人员面临的问题:
- 是哪个服务出的问题?
- 是网络问题还是代码问题?
- 为什么测试环境正常,生产环境就出问题?
- 这个问题能复现吗?
传统的监控方案已经不够用了。
可观测性(Observability) 成为了现代运维的核心能力。
一、什么是可观测性?
1.1 可观测性的定义
可观测性:仅通过外部输出推断系统内部状态的能力。
不可观测的系统:
- 内部出错了,但外部完全不知道
- 问题只能靠"猜"或"拆开看"
可观测的系统:
- 任何内部问题,都可以通过外部数据推断出来
- 日志、指标、链路追踪 = 系统的"黑匣子"
1.2 可观测性三支柱
#mermaid-svg-4LMIyAlSmuHZ1vzF{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-4LMIyAlSmuHZ1vzF .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-4LMIyAlSmuHZ1vzF .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-4LMIyAlSmuHZ1vzF .error-icon{fill:#552222;}#mermaid-svg-4LMIyAlSmuHZ1vzF .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-4LMIyAlSmuHZ1vzF .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-4LMIyAlSmuHZ1vzF .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-4LMIyAlSmuHZ1vzF .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-4LMIyAlSmuHZ1vzF .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-4LMIyAlSmuHZ1vzF .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-4LMIyAlSmuHZ1vzF .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-4LMIyAlSmuHZ1vzF .marker{fill:#333333;stroke:#333333;}#mermaid-svg-4LMIyAlSmuHZ1vzF .marker.cross{stroke:#333333;}#mermaid-svg-4LMIyAlSmuHZ1vzF svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-4LMIyAlSmuHZ1vzF p{margin:0;}#mermaid-svg-4LMIyAlSmuHZ1vzF .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-4LMIyAlSmuHZ1vzF .cluster-label text{fill:#333;}#mermaid-svg-4LMIyAlSmuHZ1vzF .cluster-label span{color:#333;}#mermaid-svg-4LMIyAlSmuHZ1vzF .cluster-label span p{background-color:transparent;}#mermaid-svg-4LMIyAlSmuHZ1vzF .label text,#mermaid-svg-4LMIyAlSmuHZ1vzF span{fill:#333;color:#333;}#mermaid-svg-4LMIyAlSmuHZ1vzF .node rect,#mermaid-svg-4LMIyAlSmuHZ1vzF .node circle,#mermaid-svg-4LMIyAlSmuHZ1vzF .node ellipse,#mermaid-svg-4LMIyAlSmuHZ1vzF .node polygon,#mermaid-svg-4LMIyAlSmuHZ1vzF .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-4LMIyAlSmuHZ1vzF .rough-node .label text,#mermaid-svg-4LMIyAlSmuHZ1vzF .node .label text,#mermaid-svg-4LMIyAlSmuHZ1vzF .image-shape .label,#mermaid-svg-4LMIyAlSmuHZ1vzF .icon-shape .label{text-anchor:middle;}#mermaid-svg-4LMIyAlSmuHZ1vzF .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-4LMIyAlSmuHZ1vzF .rough-node .label,#mermaid-svg-4LMIyAlSmuHZ1vzF .node .label,#mermaid-svg-4LMIyAlSmuHZ1vzF .image-shape .label,#mermaid-svg-4LMIyAlSmuHZ1vzF .icon-shape .label{text-align:center;}#mermaid-svg-4LMIyAlSmuHZ1vzF .node.clickable{cursor:pointer;}#mermaid-svg-4LMIyAlSmuHZ1vzF .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-4LMIyAlSmuHZ1vzF .arrowheadPath{fill:#333333;}#mermaid-svg-4LMIyAlSmuHZ1vzF .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-4LMIyAlSmuHZ1vzF .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-4LMIyAlSmuHZ1vzF .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-4LMIyAlSmuHZ1vzF .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-4LMIyAlSmuHZ1vzF .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-4LMIyAlSmuHZ1vzF .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-4LMIyAlSmuHZ1vzF .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-4LMIyAlSmuHZ1vzF .cluster text{fill:#333;}#mermaid-svg-4LMIyAlSmuHZ1vzF .cluster span{color:#333;}#mermaid-svg-4LMIyAlSmuHZ1vzF div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-4LMIyAlSmuHZ1vzF .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-4LMIyAlSmuHZ1vzF rect.text{fill:none;stroke-width:0;}#mermaid-svg-4LMIyAlSmuHZ1vzF .icon-shape,#mermaid-svg-4LMIyAlSmuHZ1vzF .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-4LMIyAlSmuHZ1vzF .icon-shape p,#mermaid-svg-4LMIyAlSmuHZ1vzF .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-4LMIyAlSmuHZ1vzF .icon-shape .label rect,#mermaid-svg-4LMIyAlSmuHZ1vzF .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-4LMIyAlSmuHZ1vzF .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-4LMIyAlSmuHZ1vzF .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-4LMIyAlSmuHZ1vzF :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 可观测性
提供上下文
提供上下文
定量分析
日志 Logs
离散事件
发生了什么?
指标 Metrics
聚合数据
问题在哪里?
链路追踪 Traces
请求路径
为什么出问题?
三支柱的关系:
| 支柱 | 回答的问题 | 数据特点 | 典型工具 |
|---|---|---|---|
| 日志 (Logs) | 发生了什么? | 离散事件,包含最详尽的上下文 | ELK、Loki、Filebeat |
| 指标 (Metrics) | 问题在哪里? | 聚合数据,高时效性,用于定量分析与告警 | Prometheus、Grafana |
| 链路追踪 (Traces) | 为什么出问题? | 请求级端到端串联,展示分布式调用拓扑 | Jaeger、OpenTelemetry |
二、日志(Logs):离散事件的记录
2.1 日志的本质
日志 = 带时间戳的结构化事件记录
json
// 结构化日志示例
{
"timestamp": "2026-06-04T14:30:00.123Z",
"level": "ERROR",
"service": "order-service",
"instance": "10.0.1.5",
"trace_id": "abc123def456",
"message": "Failed to process order",
"error": {
"code": "PAYMENT_TIMEOUT",
"detail": "Payment gateway response timeout after 30s"
},
"context": {
"order_id": "ORD20260604001",
"user_id": "USR12345",
"amount": 299.00
}
}
2.2 日志级别
| 级别 | 含义 | 生产环境使用场景 |
|---|---|---|
| TRACE | 最详细的系统追踪信息 | 严禁在生产开启,仅限开发环境排查死锁等极端场景 |
| DEBUG | 关键流程的中间变量与调试信息 | 线下测试环境,生产环境按需短暂开启 |
| INFO | 正常的业务主干核心节点记录 | 生产环境默认级别(如:订单创建、状态流转成功) |
| WARN | 潜在的异常,系统有容错,但需要关注 | 生产环境保留(如:触发业务重试、接口响应接近阈值) |
| ERROR | 功能受损,当前请求失败,但不影响整体 | 生产环境核心告警源(如:数据库断开、第三方接口报错) |
| FATAL | 灾难性错误,导致进程必须退出 | 导致服务无法启动或核心内存崩塌的致命错误 |
python
# Python日志示例
import logging
logger = logging.getLogger(__name__)
# ✅ 好的实践:记录关键业务节点
logger.info("Order created", extra={
"order_id": "ORD001",
"user_id": "USR123",
"amount": 299.00
})
# ✅ 好的实践:记录异常信息(使用 exception 自动捕获堆栈)
try:
process_payment(order)
except PaymentError as e:
# exception 会自动记录完整的 Stack Trace,无需手动转 str
logger.exception("Payment processing failed", extra={
"order_id": order.id,
"error_code": e.code
})
# ❌ 差的实践:日志过多
logger.debug("Entering function x") # 生产环境不需要
logger.debug("Variable a =", a)
logger.debug("Variable b =", b)
# ❌ 差的实践:日志过少
logger.error("Error occurred") # 没有上下文
2.3 日志采集架构
#mermaid-svg-6lC6Lyr0J1E1Yp03{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-6lC6Lyr0J1E1Yp03 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .error-icon{fill:#552222;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .marker.cross{stroke:#333333;}#mermaid-svg-6lC6Lyr0J1E1Yp03 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-6lC6Lyr0J1E1Yp03 p{margin:0;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .cluster-label text{fill:#333;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .cluster-label span{color:#333;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .cluster-label span p{background-color:transparent;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .label text,#mermaid-svg-6lC6Lyr0J1E1Yp03 span{fill:#333;color:#333;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .node rect,#mermaid-svg-6lC6Lyr0J1E1Yp03 .node circle,#mermaid-svg-6lC6Lyr0J1E1Yp03 .node ellipse,#mermaid-svg-6lC6Lyr0J1E1Yp03 .node polygon,#mermaid-svg-6lC6Lyr0J1E1Yp03 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .rough-node .label text,#mermaid-svg-6lC6Lyr0J1E1Yp03 .node .label text,#mermaid-svg-6lC6Lyr0J1E1Yp03 .image-shape .label,#mermaid-svg-6lC6Lyr0J1E1Yp03 .icon-shape .label{text-anchor:middle;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .rough-node .label,#mermaid-svg-6lC6Lyr0J1E1Yp03 .node .label,#mermaid-svg-6lC6Lyr0J1E1Yp03 .image-shape .label,#mermaid-svg-6lC6Lyr0J1E1Yp03 .icon-shape .label{text-align:center;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .node.clickable{cursor:pointer;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .arrowheadPath{fill:#333333;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-6lC6Lyr0J1E1Yp03 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-6lC6Lyr0J1E1Yp03 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-6lC6Lyr0J1E1Yp03 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .cluster text{fill:#333;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .cluster span{color:#333;}#mermaid-svg-6lC6Lyr0J1E1Yp03 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-6lC6Lyr0J1E1Yp03 rect.text{fill:none;stroke-width:0;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .icon-shape,#mermaid-svg-6lC6Lyr0J1E1Yp03 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .icon-shape p,#mermaid-svg-6lC6Lyr0J1E1Yp03 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .icon-shape .label rect,#mermaid-svg-6lC6Lyr0J1E1Yp03 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-6lC6Lyr0J1E1Yp03 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-6lC6Lyr0J1E1Yp03 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-6lC6Lyr0J1E1Yp03 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 可视化层
存储层
消息队列
日志采集层
应用服务
Service A
Service B
Service C
Filebeat
Fluentd
Kafka
Redis
Elasticsearch
Kibana
2.4 日志查询示例(Kibana DSL)
json
// 查询错误日志(使用 term 查询 keyword 类型字段,避免分词开销)
GET /logs-2026.06.04/_search
{
"query": {
"bool": {
"must": [
{ "term": { "level": "ERROR" }},
{ "range": { "@timestamp": { "gte": "now-1h" }}}
],
"should": [
{ "term": { "service": "order-service" }},
{ "term": { "service": "payment-service" }}
]
}
},
"sort": [
{ "@timestamp": "desc" }
],
"size": 100
}
优化说明: 对 level、service 等 keyword 类型字段,应使用 term 查询而非 match 查询。match 会走分词器,在大数据量下性能差异显著。
三、指标(Metrics):聚合的数据
3.1 指标的本质
指标 = 带标签的时序数据点
promql
# Prometheus指标示例
# 指标名称:http_request_duration_seconds
# 标签:service, method, status, quantile
# 值:0.523(秒)
# 查询P99延迟(必须使用 sum by (le) 聚合桶数据)
histogram_quantile(0.99, sum by (le, job) (rate(http_request_duration_seconds_bucket[5m])))
# 查询每秒请求数
sum by (job) (rate(http_requests_total[5m]))
# 查询错误率(必须用 sum by(job) 擦除 status 标签后再除法)
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))
3.2 四大指标类型
1. Counter(计数器):只增不减
python
# 使用场景:请求总数、订单总数、错误总数
requests_total = Counter('http_requests_total',
'Total HTTP requests',
['method', 'endpoint'])
requests_total.labels(method='GET', endpoint='/api/orders').inc()
2. Gauge(仪表):可增可减
python
# 使用场景:CPU使用率、内存占用、在线用户数
cpu_usage = Gauge('cpu_usage_percent', 'CPU usage percent')
cpu_usage.set(45.6) # 设置为45.6%
cpu_usage.dec(10) # 减少10
cpu_usage.inc(5) # 增加5
3. Histogram(直方图):分布统计
python
# 使用场景:请求延迟、响应大小
request_duration = Histogram('http_request_duration_seconds',
'HTTP request duration',
buckets=[0.01, 0.05, 0.1, 0.5, 1, 5])
# 请求处理
start = time.time()
process_request()
request_duration.observe(time.time() - start)
4. Summary(汇总):百分位数
python
# 使用场景:需要精确百分位数
request_summary = Summary('http_request_summary_seconds',
'HTTP request summary',
['method', 'endpoint'])
# 自动计算quantile=0.5, 0.9, 0.99
⚠️ Summary vs Histogram 关键区别:
| 维度 | Histogram | Summary |
|---|---|---|
| 聚合方式 | 传递桶数据,支持在Prometheus服务端二次聚合 | 客户端本地计算,无法跨实例聚合 |
| 多实例场景 | ✅ 可用 sum by (le) 聚合全集群 |
❌ 只能看单个实例的百分位 |
| 灵活性 | 低(bucket边界固定) | 高(可自定义百分位) |
| 推荐场景 | 生产环境首选(微服务多实例必须用) | 单实例或客户端直连Prometheus |
这就是为什么生产环境通常首选 Histogram 的根本原因。
3.3 黄金指标(USE方法 + RED方法)
USE 方法(面向资源,如服务器、数据库):
| 指标 | 说明 | 典型应用公式(Prometheus) |
|---|---|---|
| Utilization(利用率) | 资源被占用的时间百分比 | 100 - (avg(rate(node_cpu...mode="idle"))*100) |
| Saturation(饱和度) | 资源的负荷程度(如排队长度) | node_load1(系统1分钟负载) |
| Errors(错误) | 资源层面的错误事件数 | 设备驱动报错、存储读写校验失败计数 |
RED 方法(面向服务,如微服务接口):
| 指标 | 说明 | 正确的 Prometheus 表达式 |
|---|---|---|
| Rate(请求率) | 每秒接收的请求数 | sum(rate(http_requests_total[5m])) |
| Errors(错误率) | 每秒失败的请求数 | sum(rate(http_requests_total{status=~"5.."}[5m])) |
| Duration(延迟) | 请求处理所消耗的时间 | histogram_quantile(0.99, sum by (le) (rate(..._bucket[5m]))) |
⚠️ 注意: 这里的 sum by (le) 至关重要------Histogram的桶数据必须先按 le 标签聚合,才能正确计算分位数。多实例场景下还需要加上 job 等标签。
3.4 Prometheus查询示例
promql
# 服务可用性(基于up指标)
sum(up{job="api-gateway"}) / count(up{job="api-gateway"})
# P50延迟(必须 sum by (le, job) 聚合)
histogram_quantile(0.50,
sum by (le, job) (rate(http_request_duration_seconds_bucket[5m])))
# P99延迟(必须 sum by (le, job) 聚合)
histogram_quantile(0.99,
sum by (le, job) (rate(http_request_duration_seconds_bucket[5m])))
# 过去5分钟的错误率(必须 sum by (job) 擦除 status 标签后再除法)
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))
* 100
# 过去1小时的请求量变化
sum by (job) (increase(http_requests_total[1h]))
四、链路追踪(Traces):请求的端到端路径
4.1 链路追踪的本质
链路追踪 = 记录一次请求在分布式系统中的完整旅程
#mermaid-svg-m7NjZ5PkSB7kuUIx{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-m7NjZ5PkSB7kuUIx .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-m7NjZ5PkSB7kuUIx .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-m7NjZ5PkSB7kuUIx .error-icon{fill:#552222;}#mermaid-svg-m7NjZ5PkSB7kuUIx .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-m7NjZ5PkSB7kuUIx .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-m7NjZ5PkSB7kuUIx .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-m7NjZ5PkSB7kuUIx .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-m7NjZ5PkSB7kuUIx .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-m7NjZ5PkSB7kuUIx .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-m7NjZ5PkSB7kuUIx .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-m7NjZ5PkSB7kuUIx .marker{fill:#333333;stroke:#333333;}#mermaid-svg-m7NjZ5PkSB7kuUIx .marker.cross{stroke:#333333;}#mermaid-svg-m7NjZ5PkSB7kuUIx svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-m7NjZ5PkSB7kuUIx p{margin:0;}#mermaid-svg-m7NjZ5PkSB7kuUIx .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-m7NjZ5PkSB7kuUIx .cluster-label text{fill:#333;}#mermaid-svg-m7NjZ5PkSB7kuUIx .cluster-label span{color:#333;}#mermaid-svg-m7NjZ5PkSB7kuUIx .cluster-label span p{background-color:transparent;}#mermaid-svg-m7NjZ5PkSB7kuUIx .label text,#mermaid-svg-m7NjZ5PkSB7kuUIx span{fill:#333;color:#333;}#mermaid-svg-m7NjZ5PkSB7kuUIx .node rect,#mermaid-svg-m7NjZ5PkSB7kuUIx .node circle,#mermaid-svg-m7NjZ5PkSB7kuUIx .node ellipse,#mermaid-svg-m7NjZ5PkSB7kuUIx .node polygon,#mermaid-svg-m7NjZ5PkSB7kuUIx .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-m7NjZ5PkSB7kuUIx .rough-node .label text,#mermaid-svg-m7NjZ5PkSB7kuUIx .node .label text,#mermaid-svg-m7NjZ5PkSB7kuUIx .image-shape .label,#mermaid-svg-m7NjZ5PkSB7kuUIx .icon-shape .label{text-anchor:middle;}#mermaid-svg-m7NjZ5PkSB7kuUIx .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-m7NjZ5PkSB7kuUIx .rough-node .label,#mermaid-svg-m7NjZ5PkSB7kuUIx .node .label,#mermaid-svg-m7NjZ5PkSB7kuUIx .image-shape .label,#mermaid-svg-m7NjZ5PkSB7kuUIx .icon-shape .label{text-align:center;}#mermaid-svg-m7NjZ5PkSB7kuUIx .node.clickable{cursor:pointer;}#mermaid-svg-m7NjZ5PkSB7kuUIx .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-m7NjZ5PkSB7kuUIx .arrowheadPath{fill:#333333;}#mermaid-svg-m7NjZ5PkSB7kuUIx .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-m7NjZ5PkSB7kuUIx .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-m7NjZ5PkSB7kuUIx .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-m7NjZ5PkSB7kuUIx .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-m7NjZ5PkSB7kuUIx .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-m7NjZ5PkSB7kuUIx .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-m7NjZ5PkSB7kuUIx .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-m7NjZ5PkSB7kuUIx .cluster text{fill:#333;}#mermaid-svg-m7NjZ5PkSB7kuUIx .cluster span{color:#333;}#mermaid-svg-m7NjZ5PkSB7kuUIx div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-m7NjZ5PkSB7kuUIx .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-m7NjZ5PkSB7kuUIx rect.text{fill:none;stroke-width:0;}#mermaid-svg-m7NjZ5PkSB7kuUIx .icon-shape,#mermaid-svg-m7NjZ5PkSB7kuUIx .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-m7NjZ5PkSB7kuUIx .icon-shape p,#mermaid-svg-m7NjZ5PkSB7kuUIx .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-m7NjZ5PkSB7kuUIx .icon-shape .label rect,#mermaid-svg-m7NjZ5PkSB7kuUIx .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-m7NjZ5PkSB7kuUIx .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-m7NjZ5PkSB7kuUIx .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-m7NjZ5PkSB7kuUIx :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Trace ID: abc123
超时 30s
用户请求
网关
认证服务
订单服务
支付服务
库存服务
span: 网关
开始时间, 结束时间, 状态
span: 订单服务
开始时间, 结束时间, 状态
span: 支付服务
开始时间, 结束时间, 错误信息
4.2 Trace结构
json
// Jaeger Trace结构
{
"traceId": "abc123def456",
"spans": [
{
"spanId": "span1",
"operationName": "HTTP GET /api/orders",
"serviceName": "api-gateway",
"startTime": 1717500000000,
"duration": 120, // 毫秒
"tags": {
"http.status_code": 200
}
},
{
"spanId": "span2",
"operationName": "CreateOrder",
"serviceName": "order-service",
"startTime": 1717500001000,
"duration": 85,
"references": [
{
"refType": "CHILD_OF",
"traceId": "abc123def456",
"spanId": "span1"
}
]
},
{
"spanId": "span3",
"operationName": "ProcessPayment",
"serviceName": "payment-service",
"startTime": 1717500001500,
"duration": 320,
"tags": {
"error": true,
"error.kind": "Timeout"
}
}
]
}
4.3 OpenTelemetry架构
#mermaid-svg-iKU5pZXMeFPysRFQ{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-iKU5pZXMeFPysRFQ .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-iKU5pZXMeFPysRFQ .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-iKU5pZXMeFPysRFQ .error-icon{fill:#552222;}#mermaid-svg-iKU5pZXMeFPysRFQ .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-iKU5pZXMeFPysRFQ .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-iKU5pZXMeFPysRFQ .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-iKU5pZXMeFPysRFQ .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-iKU5pZXMeFPysRFQ .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-iKU5pZXMeFPysRFQ .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-iKU5pZXMeFPysRFQ .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-iKU5pZXMeFPysRFQ .marker{fill:#333333;stroke:#333333;}#mermaid-svg-iKU5pZXMeFPysRFQ .marker.cross{stroke:#333333;}#mermaid-svg-iKU5pZXMeFPysRFQ svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-iKU5pZXMeFPysRFQ p{margin:0;}#mermaid-svg-iKU5pZXMeFPysRFQ .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-iKU5pZXMeFPysRFQ .cluster-label text{fill:#333;}#mermaid-svg-iKU5pZXMeFPysRFQ .cluster-label span{color:#333;}#mermaid-svg-iKU5pZXMeFPysRFQ .cluster-label span p{background-color:transparent;}#mermaid-svg-iKU5pZXMeFPysRFQ .label text,#mermaid-svg-iKU5pZXMeFPysRFQ span{fill:#333;color:#333;}#mermaid-svg-iKU5pZXMeFPysRFQ .node rect,#mermaid-svg-iKU5pZXMeFPysRFQ .node circle,#mermaid-svg-iKU5pZXMeFPysRFQ .node ellipse,#mermaid-svg-iKU5pZXMeFPysRFQ .node polygon,#mermaid-svg-iKU5pZXMeFPysRFQ .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-iKU5pZXMeFPysRFQ .rough-node .label text,#mermaid-svg-iKU5pZXMeFPysRFQ .node .label text,#mermaid-svg-iKU5pZXMeFPysRFQ .image-shape .label,#mermaid-svg-iKU5pZXMeFPysRFQ .icon-shape .label{text-anchor:middle;}#mermaid-svg-iKU5pZXMeFPysRFQ .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-iKU5pZXMeFPysRFQ .rough-node .label,#mermaid-svg-iKU5pZXMeFPysRFQ .node .label,#mermaid-svg-iKU5pZXMeFPysRFQ .image-shape .label,#mermaid-svg-iKU5pZXMeFPysRFQ .icon-shape .label{text-align:center;}#mermaid-svg-iKU5pZXMeFPysRFQ .node.clickable{cursor:pointer;}#mermaid-svg-iKU5pZXMeFPysRFQ .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-iKU5pZXMeFPysRFQ .arrowheadPath{fill:#333333;}#mermaid-svg-iKU5pZXMeFPysRFQ .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-iKU5pZXMeFPysRFQ .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-iKU5pZXMeFPysRFQ .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-iKU5pZXMeFPysRFQ .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-iKU5pZXMeFPysRFQ .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-iKU5pZXMeFPysRFQ .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-iKU5pZXMeFPysRFQ .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-iKU5pZXMeFPysRFQ .cluster text{fill:#333;}#mermaid-svg-iKU5pZXMeFPysRFQ .cluster span{color:#333;}#mermaid-svg-iKU5pZXMeFPysRFQ div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-iKU5pZXMeFPysRFQ .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-iKU5pZXMeFPysRFQ rect.text{fill:none;stroke-width:0;}#mermaid-svg-iKU5pZXMeFPysRFQ .icon-shape,#mermaid-svg-iKU5pZXMeFPysRFQ .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-iKU5pZXMeFPysRFQ .icon-shape p,#mermaid-svg-iKU5pZXMeFPysRFQ .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-iKU5pZXMeFPysRFQ .icon-shape .label rect,#mermaid-svg-iKU5pZXMeFPysRFQ .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-iKU5pZXMeFPysRFQ .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-iKU5pZXMeFPysRFQ .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-iKU5pZXMeFPysRFQ :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 可视化层
后端存储
OTel Collector
应用服务(自动注入)
Java Service
Go Service
Python Service
接收 traces/metrics/logs
预处理、过滤、聚合
转发到后端
Jaeger
Traces
Prometheus
Metrics
Jaeger UI
Grafana
4.4 链路追踪代码示例
Go语言 + OpenTelemetry(2026年现代标准):
go
package main
import (
"context"
"net/http"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/trace"
)
func main() {
ctx := context.Background()
// 1. 现代标准:初始化 OTLP gRPC 导出器(直接发送给 OTel Collector 或新版 Jaeger)
exporter, err := otlptracegrpc.New(ctx, otlptracegrpc.WithInsecure())
if err != nil {
panic(err)
}
// 2. 创建 TracerProvider
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
)
defer tp.Shutdown(ctx)
otel.SetTracerProvider(tp)
// 3. 创建入口 Span
tracer := otel.Tracer("order-service")
ctx, span := tracer.Start(ctx, "CreateOrder")
defer span.End()
span.SetAttributes(
attribute.String("order.id", "ORD001"),
attribute.Float64("order.amount", 299.00),
)
// 4. 跨服务网络调用:使用 otelhttp 自动将 Trace 注入 HTTP Header 传给下游
client := &http.Client{Transport: otelhttp.NewTransport(http.DefaultTransport)}
req, _ := http.NewRequestWithContext(ctx, "POST", "http://payment-service/pay", nil)
_, _ = client.Do(req)
}
⚠️ 关键技术说明:
-
OTLP 替换 Jaeger Exporter:2026年云原生生态已全面转向 OTLP 协议,Jaeger 原生 Exporter 已被官方废弃。
-
otelhttp 自动注入 Trace Context :本地函数调用无法跨网络传递 Trace。只有使用
otelhttp.NewTransport()包装 HTTP Client,才能自动将 Trace 信息注入到 HTTP Header 中传给下游服务。
五、三支柱的关联与互补
5.1 典型故障排查场景
场景:用户反馈订单支付失败
1. 指标发现异常
Prometheus 实时告警触发------RED 方法错误率表达式:
sum by(job)(rate(http_requests_total{status=~"5.."}[5m]))
/ sum by(job)(rate(http_requests_total[5m])) > 0.05
→ 发现 payment-service 错误率跃升至 5% 以上
2. 日志定位问题
Kibana: 查询 payment-service 错误日志
→ 发现 "Payment gateway timeout after 30s"
3. 链路追踪定位根因
Jaeger: 查看支付服务的 trace
→ 发现 payment-service → gateway 调用超时
→ trace 显示 gateway 响应时间 30000ms
结论:支付网关超时导致支付失败
前后呼应: 这里的 RED 方法 PromQL 表达式,正是第 3 节强调的"标签必须先聚合再除法"的正确写法------读者学到的知识在这里得到了实战验证。
5.2 三支柱特性深度对比
| 维度 | 日志 (Logs) | 指标 (Metrics) | 链路追踪 (Traces) |
|---|---|---|---|
| 数据量 | 极高(随事件呈线性爆炸) | 极低(与请求量无关,仅与指标数相关) | 高(随分布式调用深度增加) |
| 查询速度 | 较慢(依赖全文索引建立与检索) | 极快(时间序列时序数据库,秒级响应) | 中等(依赖 TraceID 索引) |
| 擅长问题 | 未知问题(依靠关键词野蛮搜索) | 已知问题(Dashboard 监控与实时告警) | 性能瓶颈(定位分布式系统死锁与耗时) |
| 存储周期 | 短(成本高,通常 7-30 天) | 长(成本低,可保留数月甚至数年) | 短(通常结合采样率保留 7 天) |
| 系统开销 | 中等(I/O 密集型) | 极低(内存计算计数器) | 高(需要全链路侵入式埋点与上下文传递) |
5.3 现代可观测性推荐工具栈
| 类型 | 方案组合 | 适用场景与特点 |
|---|---|---|
| 经典全家桶 | ELK + Prometheus + Jaeger | 功能最强大,各领域生态最成熟,但维护三套系统成本极高。 |
| 轻量云原生 | Loki + Prometheus + Tempo + Grafana | 著名的 Grafana LGTM 栈,共用 Grafana 标签体系,一键打通"指标-日志-追踪"。 |
| 企业一体化 | Datadog / Dynatrace | 商业化 SaaS 平台,开箱即用,体验极佳,但费用高昂。 |
六、Grafana统一可观测性面板
6.1 面板配置示例
json
{
"panels": [
{
"title": "服务健康状态",
"type": "stat",
"targets": [
{
"expr": "up{job=~\"api|order|payment\"}",
"legendFormat": "{{job}}"
}
]
},
{
"title": "请求延迟 P50/P95/P99",
"type": "timeseries",
"targets": [
{
"expr": "histogram_quantile(0.50, sum by (le, job) (rate(http_request_duration_seconds_bucket[5m])))",
"legendFormat": "P50"
},
{
"expr": "histogram_quantile(0.95, sum by (le, job) (rate(http_request_duration_seconds_bucket[5m])))",
"legendFormat": "P95"
},
{
"expr": "histogram_quantile(0.99, sum by (le, job) (rate(http_request_duration_seconds_bucket[5m])))",
"legendFormat": "P99"
}
]
},
{
"title": "错误率趋势",
"type": "timeseries",
"targets": [
{
"expr": "sum by (job) (rate(http_requests_total{status=~\"5..\"}[5m])) / sum by (job) (rate(http_requests_total[5m])) * 100",
"legendFormat": "Error Rate %"
}
]
}
]
}
6.2 告警规则配置
yaml
# Grafana告警规则
groups:
- name: service-health
interval: 30s
rules:
# 服务不可用告警
- alert: ServiceDown
expr: up{job="api-gateway"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "API网关服务不可用"
dashboard: "/d/service-health"
# 延迟过高告警(必须 sum by (le, job) 聚合)
- alert: HighLatency
expr: histogram_quantile(0.99, sum by (le, job) (rate(http_request_duration_seconds_bucket[5m]))) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "服务延迟过高"
runbook: "https://wiki.example.com/runbook/high-latency"
总结
可观测性三支柱对比:
| 支柱 | 回答 | 工具 | 关键点 |
|---|---|---|---|
| 日志 | 发生了什么? | ELK, Loki | 结构化、上下文丰富 |
| 指标 | 问题在哪里? | Prometheus, InfluxDB | 聚合、可告警 |
| 链路追踪 | 为什么出问题? | Jaeger, Zipkin | 请求路径、根因定位 |
建设建议:
1. 起步阶段:先做好指标监控(Prometheus + Grafana)
- 成本低、见效快
- 覆盖80%的监控需求
2. 进阶阶段:引入日志系统(ELK/Loki)
- 排查未知问题
- 审计、合规需求
3. 高级阶段:部署链路追踪(Jaeger)
- 微服务架构必备
- 定位分布式系统问题
4. 终极形态:三支柱融合(Grafana LGTM / Datadog)
- 统一视图
- 关联分析
- 一键定位问题
你们公司是如何建设可观测性体系的?有什么实践经验?欢迎在评论区分享~
如果觉得有帮助,欢迎点赞、收藏!