使用ScheduledExecutorService日志没有traceId问题及解决方案
问题发现
问题发现是在一次线上的任务执行中,项目中使用ScheduledExecutorService的延迟任务作为异步任务执行的方案,但是在线程执行任务过程中,没有正确返回结果,又在使用traceId排查日志时找不到异常报错,但是在本地打印完整日志时,复现找到了错误原因,以及异常报错,并且发现在afterExecute里的异常报错日志没有携带traceId,所以在生产上根据traceId找不到错误
2. 基本配置
项目的日志配置为
xml
<!-- 日志格式 -->
<encoder class="ch.qos.logback.classic.encoder.PatternLayoutEncoder">
<pattern>%d{yyyy-MM-dd HH:mm:ss:SSS} [%X{X-B3-TraceId:-}] [%thread] %-5level [%class:%line] - %m %n</pattern>
<charset>utf-8</charset>
</encoder>
所打印出来的日志一般是这样的:
yaml
2025-07-23 14:51:16.742 INFO [rsx,4df8688b6e53a74b,4df8688b6e53a74b,false] 77320 --- [nio-8086-exec-2] c.c.d.c.core.interceptor.LogInterceptor : 取当前登录用户:null
2025-07-23 14:51:16.775 INFO [rsx,4df8688b6e53a74b,4df8688b6e53a74b,false] 77320 1947912380988571648 0:0:0:0:0:0:0:1 --- [nio-8086-exec-2] c.c.d.rsx.util.asyncTask.AsyncManager : 线程:http-nio-8086-exec-2,任务开始执行
2025-07-23 14:51:16.784 INFO [rsx,4df8688b6e53a74b,4df8688b6e53a74b,false] 77320 1947912380988571648 0:0:0:0:0:0:0:1 --- [nio-8086-exec-2] c.c.d.rsx.util.asyncTask.AsyncManager : 线程:http-nio-8086-exec-2,任务执行完毕
2025-07-23 14:51:16.796 INFO [rsx,4df8688b6e53a74b,e1333a200a3f1c12,false] 77320 --- [schedule-pool-1] cn.com.dzpjpt.rsx.util.ThreadLocalUtil : 当前线程: schedule-pool-1, 身份证设置为: 弋路阳
2025-07-23 14:51:16.797 INFO [rsx,4df8688b6e53a74b,e1333a200a3f1c12,false] 77320 --- [schedule-pool-1] c.c.d.rsx.controller.TestController : 线程:schedule-pool-1,执行任务
但是出现问题没有traceId的日志长这样:
less
2025-07-23 14:51:19.799 INFO [rsx,,,] 77320 --- [schedule-pool-1] c.c.d.rsx.controller.TestController : 线程:schedule-pool-1,任务异常
2025-07-23 14:51:19.799 INFO [rsx,,,] 77320 --- [schedule-pool-1] c.c.d.rsx.controller.TestController : 线程:schedule-pool-1,MDC不清理...
2025-07-23 14:51:19.801 INFO [rsx,,,] 77320 --- [schedule-pool-1] cn.com.dzpjpt.rsx.util.Threads : 线程:schedule-pool-1,打印线程异常信息:线程发生错误
java.lang.RuntimeException: null
at cn.com.dzpjpt.rsx.controller.TestController$2.run(TestController.java:1555)
at org.springframework.cloud.sleuth.instrument.async.TraceRunnable.run(TraceRunnable.java:67)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
at java.util.concurrent.FutureTask.run(FutureTask.java)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
在生产环境的海量日志里排查问题,但又没有traceId,多是一件难事啊...今天我们就解决为什么traceId会消失以及如何修复的问题,如果表述问题恳请批评指正!
3. 导致问题原因
众所周知,在使用线程池执行任务时,可以重写beforeExecute方法和afterExecute方法,他们分别会在run方法执行前后进行执行,beforeExecute可以记录任务开始时间,设置前置条件等;afterExecute可以记录任务完成时间,进行异常捕获或对任务结果进行处理,源码来自ThreadPoolExecutor方法如下:
ini
final void runWorker(Worker w) {
Thread wt = Thread.currentThread();
Runnable task = w.firstTask;
w.firstTask = null;
w.unlock(); // allow interrupts
boolean completedAbruptly = true;
try {
while (task != null || (task = getTask()) != null) {
w.lock();
// If pool is stopping, ensure thread is interrupted;
// if not, ensure thread is not interrupted. This
// requires a recheck in second case to deal with
// shutdownNow race while clearing interrupt
if ((runStateAtLeast(ctl.get(), STOP) ||
(Thread.interrupted() &&
runStateAtLeast(ctl.get(), STOP))) &&
!wt.isInterrupted())
wt.interrupt();
try {
beforeExecute(wt, task);
Throwable thrown = null;
try {
task.run();
} catch (RuntimeException x) {
thrown = x; throw x;
} catch (Error x) {
thrown = x; throw x;
} catch (Throwable x) {
thrown = x; throw new Error(x);
} finally {
afterExecute(task, thrown);
}
} finally {
task = null;
w.completedTasks++;
w.unlock();
}
}
completedAbruptly = false;
} finally {
processWorkerExit(w, completedAbruptly);
}
}
根据源码,beforeExecute方法执行在run方法之前,而afterExecute方法写在finally中,任务执行完毕后无论如何都要执行,但是在beforeExecute和afterExecute方法中所打印的所有日志均无traceId,按理来说都是同一个线程所用的threadLocal也是同一个,为什么,MDC所设置的traceId为什么会不见了呢,根据我在本地环境一遍遍的debug终于发现了原因,就是程序在执行自己写的run方法时,其实是在执行TraceRunnable类中的run而TraceRunnable类中的run才是runWorker中的task.run()方法,如下:
kotlin
@Override
public void run() {
ScopedSpan span = this.tracer.startScopedSpanWithParent(this.spanName,
this.parent);
try {
this.delegate.run();
}
catch (Exception | Error e) {
span.error(e);
throw e;
}
finally {
span.finish();
}
}
由于traceId的初始化在startScopedSpanWithParent方法中,继续跟踪startScopedSpanWithParent方法可跟踪至如下方法,就是该方法设置了MDC中的相关信息:
ini
@Override
public CurrentTraceContext.Scope decorateScope(TraceContext currentSpan,
CurrentTraceContext.Scope scope) {
final String previousTraceId = MDC.get("traceId");
final String previousParentId = MDC.get("parentId");
final String previousSpanId = MDC.get("spanId");
final String spanExportable = MDC.get("spanExportable");
final String legacyPreviousTraceId = MDC.get(LEGACY_TRACE_ID_NAME);
final String legacyPreviousParentId = MDC.get(LEGACY_PARENT_ID_NAME);
final String legacyPreviousSpanId = MDC.get(LEGACY_SPAN_ID_NAME);
final String legacySpanExportable = MDC.get(LEGACY_EXPORTABLE_NAME);
final List<AbstractMap.SimpleEntry<String, String>> previousMdc = previousMdc();
if (currentSpan != null) {
String traceIdString = currentSpan.traceIdString();
MDC.put("traceId", traceIdString);
MDC.put(LEGACY_TRACE_ID_NAME, traceIdString);
String parentId = currentSpan.parentId() != null
? HexCodec.toLowerHex(currentSpan.parentId()) : null;
replace("parentId", parentId);
replace(LEGACY_PARENT_ID_NAME, parentId);
String spanId = HexCodec.toLowerHex(currentSpan.spanId());
MDC.put("spanId", spanId);
MDC.put(LEGACY_SPAN_ID_NAME, spanId);
String sampled = String.valueOf(currentSpan.sampled());
MDC.put("spanExportable", sampled);
MDC.put(LEGACY_EXPORTABLE_NAME, sampled);
log("Starting scope for span: {}", currentSpan);
if (currentSpan.parentId() != null) {
if (log.isTraceEnabled()) {
log.trace("With parent: {}", currentSpan.parentId());
}
}
for (String key : whitelistedBaggageKeysWithValue(currentSpan)) {
MDC.put(key, ExtraFieldPropagation.get(currentSpan, key));
}
for (String key : whitelistedPropagationKeysWithValue(currentSpan)) {
MDC.put(key, ExtraFieldPropagation.get(currentSpan, key));
}
for (String key : whitelistedLocalKeysWithValue(currentSpan)) {
MDC.put(key, ExtraFieldPropagation.get(currentSpan, key));
}
}
else {
MDC.remove("traceId");
MDC.remove("parentId");
MDC.remove("spanId");
MDC.remove("spanExportable");
MDC.remove(LEGACY_TRACE_ID_NAME);
MDC.remove(LEGACY_PARENT_ID_NAME);
MDC.remove(LEGACY_SPAN_ID_NAME);
MDC.remove(LEGACY_EXPORTABLE_NAME);
for (String s : whitelistedBaggageKeys()) {
MDC.remove(s);
}
for (String s : whitelistedPropagationKeys()) {
MDC.remove(s);
}
for (String s : whitelistedLocalKeys()) {
MDC.remove(s);
}
previousMdc.clear();
}
而他的删除是在该run方法中finally里的程序,继续跟踪finish方法可跟踪至如下方法,就是如下的方法会将所有MDC设置的信息清空,:
scss
@Override
public void close() {
log("Closing scope for span: {}", currentSpan);
scope.close();
replace("traceId", previousTraceId);
replace("parentId", previousParentId);
replace("spanId", previousSpanId);
replace("spanExportable", spanExportable);
replace(LEGACY_TRACE_ID_NAME, legacyPreviousTraceId);
replace(LEGACY_PARENT_ID_NAME, legacyPreviousParentId);
replace(LEGACY_SPAN_ID_NAME, legacyPreviousSpanId);
replace(LEGACY_EXPORTABLE_NAME, legacySpanExportable);
for (AbstractMap.SimpleEntry<String, String> entry : previousMdc) {
replace(entry.getKey(), entry.getValue());
}
}
罪魁祸首找到了,好消息是自动设置防止我们遗忘traceId的设置工作、自动清理方法会自动清理Threadlocal中的缓存,防止内存泄漏;坏消息是这会导致我们在beforeExecute中拿不到系统设置的traceId,且后续afterExecute方法也没有traceId等相关信息,但Threadlocal中的其他信息依旧存在,所以如果想在初始化前就拿到traceId信息,且后续仍然使用MDC相关信息,我们可以...
4.解决方案
①在beforeExecute中拿到系统设置的traceId:
根据java任务已经封装好的执行顺序,在beforeExecute,中获取业务MDC,这是违反线程池设计原则的,不推荐这样做,所以原生的ScheduledExecutorService不能重写runWorker方法,但如果有这方面的需求就可以直接重写一个线程池,自己来定义所有的方法包括runWorker,beforeExecute,afterExecute等等..在这里就不过多赘述啦。
②在afterExecute方法中使用traceId
我们可以仍然使用原生的线程池和TraceRunnable类,只用将traceId等相关信息,在我们自己的run方法中提前再次放入threadLocal中,像这样:
dart
private static final ThreadLocal<Map<String, String>> CHECK_INPUT_LOG = new ThreadLocal<>();
Map<String, String> copyOfContextMap = MDC.getCopyOfContextMap();
CHECK_INPUT_LOG.set(copyOfContextMap);
等到想使用的时候可以在afterExecute方法开头取出来后再次放入MDC中,即:
dart
private static final ThreadLocal<Map<String, String>> CHECK_INPUT_LOG = new ThreadLocal<>();
CHECK_INPUT_LOG.get()
MDC.setContextMap(stringStringMap);
这样在afterExecute中的所有方法日志都会带上traceId,极大的方便了日志查询!但是要注意一定要在最后删除掉所设置的信息,像这样👇,否则会造成内存泄露!!!!!
arduino
MDC.clear();
private static final ThreadLocal<Map<String, String>> CHECK_INPUT_LOG = new ThreadLocal<>();
CHECK_INPUT_LOG.remove();