opentelemetry全链路初探--埋点与jaeger

前言

某天一位业务研发老哥跑来咨询

  • 研发老哥:我的服务出现了504,但是不太清楚是哪个环节报错,每次请求需要访问4个微服务、2个数据库、1个redis、1个消息队列。。。
  • 苦逼运维:停停停,不要再说了,目前不支持链路追踪,只能手动帮你一个服务一个服务的排查了
  • 先请老哥大概描述了一下业务逻辑以及访问方式,10分钟过去了。再逐级排查每个服务以及对应访问的资源层,终于在半小时之后完成了故障定位。。。

这效率也太低了,于是,关于链路建设项目提上了议程,目标只有一个,快速定位问题,提高稳定性。而链路建设,OpenTelemetry是目前行业热点,那本运维就来研究研究

环境准备

组件 版本
操作系统 Ubuntu 22.04.4 LTS
opentelemetry-sdk 1.35.0

安装

首先先简单说一下OpenTelemetry的数据采集流程,然后先跑起来再去讨论细节

  • OpenTelemetry就是在代码中埋入采集点进行数据采集,opentelemetry-sdk
  • 再通过固定的协议将数据上传至某个地方进行数据展示,jaeger UI

安装OpenTelemetry-sdk

复制代码
pip3 install opentelemetry-sdk opentelemetry-exporter-otlp opentelemetry-api

安装数据展示jaeger UI

复制代码
docker pull docker.m.daocloud.io/jaegertracing/all-in-one:latest

docker run -d --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  docker.m.daocloud.io/jaegertracing/all-in-one:latest

docker启动之后访问:http://127.0.0.1:16686

第一个例子

web服务

首先先准备一个web服务,这里我们用tornado来实现,安装tornado:pip3 install tornado

复制代码
import tornado.httpserver as httpserver
import tornado.web
from tornado.ioloop import IOLoop


class TestFlow(tornado.web.RequestHandler):
    def get(self):
        self.finish('hello world')


def applications():
    urls = []
    urls.append([r'/', TestFlow])
    return tornado.web.Application(urls)

def main():
    app = applications()
    server = httpserver.HTTPServer(app)
    server.bind(10000, '0.0.0.0')
    server.start(1)
    IOLoop.current().start()


if __name__ == "__main__":
    try:
        main()
    except KeyboardInterrupt as e:
        IOLoop.current().stop()
    finally:
        IOLoop.current().close()

检查是否能够正常访问:

添加埋点

复制代码
import tornado.httpserver as httpserver
import tornado.web
from tornado.ioloop import IOLoop
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter


trace.set_tracer_provider(
    TracerProvider(resource=Resource.create({SERVICE_NAME: "s1"}))
)
tracer = trace.get_tracer(__name__)
span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces"))
trace.get_tracer_provider().add_span_processor(span_processor)


class TestFlow(tornado.web.RequestHandler):
    def get(self):
        views()
        self.finish('hello world')

def views():
    span = tracer.start_span("s1-span")
    span.end()

def applications():
    urls = []
    urls.append([r'/', TestFlow])
    return tornado.web.Application(urls)

def main():
    app = applications()
    server = httpserver.HTTPServer(app)
    server.bind(10000, '0.0.0.0')
    server.start(1)
    IOLoop.current().start()


if __name__ == "__main__":
    try:
        main()
    except KeyboardInterrupt as e:
        IOLoop.current().stop()
    finally:
        IOLoop.current().close()

再次访问 curl http://localhost:10000 ,打开jaeger UI查看

已经有数据了,刚才的埋点已经上报至jaeger UI了

埋点数据属性

丰富一下埋点数据的属性

复制代码
def views():
    span = tracer.start_span("s1-span")
    span.set_attribute("name", "wilson")
    span.set_attribute("addr", "cd")
    span.end()

增加数据库访问追踪

复制代码
def views():
    span = tracer.start_span("s1-span")
    span.set_attribute("name", "wilson")
    span.set_attribute("addr", "cd")
    ctx = trace.set_span_in_context(span)
    get_db(ctx)
    span.end()

def get_db(parent_ctx):
    span = tracer.start_span("s1-span-db", context=parent_ctx)
    span.end()

增加跨服务追踪

增加第二个web服务:s2.py

复制代码
import tornado.httpserver as httpserver
import tornado.web
from tornado.ioloop import IOLoop
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator



trace.set_tracer_provider(
    TracerProvider(resource=Resource.create({SERVICE_NAME: "s2"}))
)
tracer = trace.get_tracer(__name__)
span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces"))
trace.get_tracer_provider().add_span_processor(span_processor)


class TestFlow(tornado.web.RequestHandler):
    def get(self):
        ctx = TraceContextTextMapPropagator().extract(self.request.headers)
        span = tracer.start_span("s2-span", context=ctx)
        span.end()
        self.finish('hello world')

def applications():
    urls = []
    urls.append([r'/', TestFlow])
    return tornado.web.Application(urls)

def main():
    app = applications()
    server = httpserver.HTTPServer(app)
    server.bind(20000, '0.0.0.0')
    server.start(1)
    IOLoop.current().start()


if __name__ == "__main__":
    try:
        main()
    except KeyboardInterrupt as e:
        IOLoop.current().stop()
    finally:
        IOLoop.current().close()

修改s1.py

复制代码
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
import requests

def views():
    span = tracer.start_span("s1-span")
    span.set_attribute("name", "wilson")
    span.set_attribute("addr", "cd")
    ctx = trace.set_span_in_context(span)
    get_db(ctx)
    headers = {}
    TraceContextTextMapPropagator().inject(headers, context=ctx)
    requests.get("http://localhost:20000", headers=headers)
    span.end()

改造进k8s

jaeger

编排文件:

复制代码
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: jaeger
  name: jaeger
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
      - image: docker.m.daocloud.io/jaegertracing/all-in-one:latest
        imagePullPolicy: Always
        name: jaeger
      dnsPolicy: ClusterFirst
      restartPolicy: Always

---

apiVersion: v1
kind: Service
metadata:
  labels:
    app: jaeger-service
  name: jaeger-service
  namespace: default
spec:
  ports:
  - name: port-4317
    port: 4317
    protocol: TCP
    targetPort: 4317
  - name: port-4318
    port: 4318
    protocol: TCP
    targetPort: 4318
  - name: port-16686
    port: 16686
    protocol: TCP
    targetPort: 16686
  selector:
    app: jaeger
  type: NodePort

s2

1)制作镜像

由于在k8s集群中通过svc访问jaeger,需要改造一下s2.py

s2.py

复制代码
...
import os

JAEGER_ADDR=os.environ.get('JAEGER_ADDR')
...
span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=JAEGER_ADDR))
...

Dockerfile

复制代码
FROM python:3.8

WORKDIR /opt
RUN pip3 install tornado opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp -i https://pypi.tuna.tsinghua.edu.cn/simple
ADD s2.py /opt
CMD python3 s2.py

2)编排文件

复制代码
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: s2
  name: s2
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: s2
  template:
    metadata:
      labels:
        app: s2
    spec:
      containers:
      - env:
        - name: JAEGER_ADDR
          value: http://jaeger-service:4318/v1/traces
        image: s2:v1
        imagePullPolicy: Always
        name: s2
      dnsPolicy: ClusterFirst
      restartPolicy: Always
---

apiVersion: v1
kind: Service
metadata:
  labels:
    app: s2-service
  name: s2-service
  namespace: default
spec:
  ports:
  - name: s2-port
    port: 20000
    protocol: TCP
    targetPort: 20000
  selector:
    app: s2
  type: NodePort

s1

1)制作镜像

由于在k8s集群中通过svc访问s2与jaeger,需要改造一下s1.py

s1.py

复制代码
...
import os

S2_ADDR=os.environ.get('S2_ADDR')
JAEGER_ADDR=os.environ.get('JAEGER_ADDR')

...
span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=JAEGER_ADDR))
...

def views():
    span = tracer.start_span("s1-span")
    span.set_attribute("name", "wilson")
    span.set_attribute("addr", "cd")
    ctx = trace.set_span_in_context(span)
    get_db(ctx)
    headers = {}
    TraceContextTextMapPropagator().inject(headers, context=ctx)
    requests.get(S2_ADDR, headers=headers)
    span.end()

...

Dockerfile:

复制代码
FROM python:3.8

WORKDIR /opt
RUN pip3 install tornado opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp -i https://pypi.tuna.tsinghua.edu.cn/simple
ADD s1.py /opt
CMD python3 s1.py

2)编排文件

复制代码
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: s1
  name: s1
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: s1
  template:
    metadata:
      labels:
        app: s1
    spec:
      containers:
      - env:
        - name: S2_ADDR
          value: http://s2-service:20000
        - name: JAEGER_ADDR
          value: http://jaeger-service:4318/v1/traces
        image: s1:v1
        imagePullPolicy: Always
        name: s1
      dnsPolicy: ClusterFirst
      restartPolicy: Always

---

apiVersion: v1
kind: Service
metadata:
  labels:
    app: s1-service
  name: s1-service
  namespace: default
spec:
  ports:
  - name: s1-port
    port: 10000
    protocol: TCP
    targetPort: 10000
  selector:
    app: s1
  type: NodePort

查看结果

复制代码
▶ kubectl get pod -owide
NAME                            READY   STATUS    RESTARTS         AGE     IP             NODE       NOMINATED NODE   READINESS GATES
jaeger-6669cd7c4-4pl5j          1/1     Running   0                7m31s   10.244.0.236   minikube   <none>           <none>
s1-5c569c5b4b-lctzq             1/1     Running   0                73s     10.244.0.237   minikube   <none>           <none>
s2-5bb648dcdf-mlnbj             1/1     Running   0                61s     10.244.0.238   minikube   <none>           <none>

▶ kubectl get svc
NAME             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                         AGE
jaeger-service   NodePort    10.106.13.217    <none>        4317:31891/TCP,4318:31997/TCP,16686:31002/TCP   5m49s
s1-service       NodePort    10.102.25.195    <none>        10000:32376/TCP                                 4m23s
s2-service       NodePort    10.103.114.198   <none>        20000:30032/TCP                                 3m40s

进行数据测试:

  • 访问s1服务

    复制代码
    ▶ curl http://192.168.49.2:32376
    hello world%
  • 查看jaeger日志,访问:http://192.168.49.2:31002/

总结

在第一个例子中,我们主要采集了业务服务的trace记录,即一个完整的请求需要经过的路径,包括读取数据库、跨服务请求等等

在整个跟踪过程中trace_idspan_id发挥了决定性的作用,前者为请求链路的唯一标识,串联了整个访问步骤;而后者则是链路上每一次不同的具体操作的标识

  • 采集:通过嵌入代码埋点,采集重点监控的流程,比如数据库读写速度、下游服务速度等
  • 处理:opentelemetry-sdk对数据进行处理:过滤、缓存、合并
  • 导出:将处理过的数据,通过固定的协议(otlp协议、grpc协议、http协议等)发送到后端系统,比如jaeger

联系我

  • 联系我,做深入的交流

至此,本文结束

在下才疏学浅,有撒汤漏水的,请各位不吝赐教...