监听 Android ANR 信号并获取所有方法栈信息
在前面的文章中我有介绍过 ANR
的原理,感兴趣的同学可以看看:[Framework] 深入理解 Android ANR。
AMS
向应用进程发送 ANR
信号后会被 Signal Catcher
线程捕获,然后它就会 dump 所有的线程栈信息到目录 /data/anr
中,这个目录是需要 root
权限才可以读取的,在虚拟机里面比较好拿到,通过 adb root
就可以直接获取 root
权限;不过一般的手机就比较难拿了,可以通过 adb bugreport
命令来导出这些文件。
虽然我们线下有方法获取 ANR
的 dump 文件,但是非常麻烦,而且 Android
没有提供专门的接口来监听 ANR
的回调,线上用户也没有办法获取到 ANR
的 dump 文件,所以本篇文章就是介绍如何监听 ANR
的信号和获取 ANR
时的 dump
文件信息。
监听 ANR 信号
在 Android
中 ANR
的信号是 SIGQUIT
,它默认是被锁定的,无法替换它原来的信号处理函数,我们需要先解除锁定:
C
sigset_t sig_sets;
sigemptyset(&sig_sets);
sigaddset(&sig_sets, SIGQUIT);
pthread_sigmask(SIG_UNBLOCK, &sig_sets, nullptr);
在解除锁定后我们就可以替换原来的信号处理函数:
C
struct sigaction sigAction{};
sigfillset(&sigAction.sa_mask);
sigAction.sa_flags = SA_RESTART | SA_ONSTACK | SA_SIGINFO;
sigAction.sa_sigaction = anrSignalHandler;
ret = sigaction(SIGQUIT, &sigAction, nullptr);
if (ret == 0) {
LOGD("Monitor anr signal success.");
} else {
LOGE("Monitor anr signal fail: %d", ret);
}
上面代码中的 anrSignalHandler
就是我们的信号处理函数的指针,通过 sigaction()
方法去注册信号处理,这个函数的第三个参数是原来的旧的信号处理的 Action
,我们只需要传入一个 struct sigaction
的指针就能够将原来的信号处理的 Action
写入到我们传入的地址中。获取到原来的信号处理函数后,我们就可以在收到信号后,继续传递给原来的信号处理函数。
不过我这里没有获取原来的处理函数,我自己尝试这么做,但是在收到信号后然后回调给原来的处理函数会出现报错,目前我也不知道出现这个问题的原因,所以我换了一个方法向原来的信号处理函数发送消息,后面会介绍。
再来看看我的信号处理函数:
C
static void anrSignalHandler(int sig, siginfo_t *sig_info, void *uc) {
LOGD("Receive anr signal.");
int fromPid1 = sig_info->_si_pad[3];
int fromPid2 = sig_info->_si_pad[4];
int myPid = getpid();
if (fromPid1 != myPid && fromPid2 != myPid) {
// 处理我们的逻辑
pthread_mutex_lock(lock);
if (dumpState == NO_DUMP) {
dumpState = WAITING_ANR_DUMP;
} else {
LOGE("Skip dump anr, because state: %d", dumpState);
}
pthread_mutex_unlock(lock);
}
syscall(SYS_tgkill, myPid, gSignalCatcherTid, SIGQUIT);
}
前面我们讲到 ANR
信号是 AMS
向应用进程发送的,所以信号发送的进程肯定不是我们的应用进程,因为我们的应用进程可以给自己发送信号的,简单通过 kill
方法就可以。所以我们需要判断发送信号的进程不是我们的进程,我们才做 ANR
的处理。当收到 ANR
信号后我们需要再向 Signal Catcher
线程发送信号,发送的方式是 syscall(SYS_tgkill, myPid, gSignalCatcherTid, SIGQUIT);
。
这里问题又来了我们怎么获取 Signal Catcher
的 tid
呢?在 Linux
中 /proc/[pid]
中存放了很多进程相关的信息,在 /proc/[pid]/task
目录下面存放了该进程所有的线程信息,文件名就是 tid
,文件中的内容就是对应线程的名字。
text
OPD2A0:/proc/26483/task $ ls
16343 16346 16348 16350 16354 16357 16374 16377 16379 16381 16392 16394 16396 16398 16400 16402 16405 16412 16577 22976 22978
16344 16347 16349 16351 16355 16365 16376 16378 16380 16390 16393 16395 16397 16399 16401 16404 16407 16576 16814 22977 26483
所以通过读取上述文件就能够找到对应线程的 tid
,反之也可以。
我这里给一下我写的参考代码:
C
int getSignalCatcherTid() {
pid_t myPid = getpid();
char *processPath = new char[MAX_BUFFER_SIZE];
int size = sprintf(processPath, "/proc/%d/task", myPid);
if (size >= MAX_BUFFER_SIZE) {
LOGE("Read proc path fail, read buffer size: %d", size);
return -1;
}
DIR *processDir = opendir(processPath);
if (processDir) {
int tid = -1;
dirent * child = readdir(processDir);
while (child != nullptr) {
if (isNumberStr(child->d_name, 256)) {
char *filePath = new char[MAX_BUFFER_SIZE];
size = sprintf(filePath, "%s/%s/comm", processPath, child->d_name);
if (size >= MAX_BUFFER_SIZE) {
continue;
}
char *threadName = new char[MAX_BUFFER_SIZE];
int fd = open(filePath, O_RDONLY);
size = read(fd, threadName, MAX_BUFFER_SIZE);
close(fd);
threadName[size - 1] = '\0';
if (strcmp(threadName, "Signal Catcher") == 0) {
tid = atoi(child->d_name);
break;
}
}
child = readdir(processDir);
}
closedir(processDir);
return tid;
} else {
LOGE("Read process dir fail.");
}
return - 1;
}
获取 Signal Catcher 线程的 dump 文件
ANR
信号是监听到了,那么我们要怎么才能够获取到 Signal Catcher
线程写入的 dump 文件呢?首先要知道 Signal Catcher
线程,是我们应用进程中的一个线程,它是在我们应用进程启动时就创建了。我们想要获取它写的文件,就可以通过 PLT/GOT Hook
的方法,去 Hook
它的 write()
方法,这样我们就能够拿到它写入的内容了,我之前有介绍过 PLT/GOT Hook
,感兴趣的同学可以参考这篇文章:手把手教你如何 Hook Native 方法。
我这里使用了 xHook
来完成 hook
。
C
int hookSignalCatcherWrite() {
int apiLevel = android_get_device_api_level();
int signalCatcherTid = gSignalCatcherTid;
if (signalCatcherTid <= 0) {
signalCatcherTid = getSignalCatcherTid();
gSignalCatcherTid = signalCatcherTid;
}
LOGD("ApiLevel: %d, SignalCatcherTid: %d", apiLevel, signalCatcherTid);
if (signalCatcherTid <= 0) {
LOGE("Get Signal Catcher tid fail.");
return -1;
}
char *writeLibName;
if (apiLevel >= 30 || apiLevel == 25 || apiLevel == 24) {
writeLibName = ".*/libc\.so$";
} else if (apiLevel == 29) {
writeLibName = ".*/libbase\.so$";
} else {
writeLibName = ".*/libart\.so$";
}
int ret = xhook_register(writeLibName,
"write",
(void *) my_write,
nullptr);
LOGD("xhook hook write register result: %d", ret);
if (ret == 0) {
ret = xhook_refresh(1);
LOGD("xhook hook write refresh result: %d", ret);
return ret;
} else {
return ret;
}
}
不同的 Android
版本 hook
的 so
库也不一样,我也是参考大佬们的操作,最好是去看 Android
源码,Signal Catcher
的相关代码被打包到哪个 so
中。
我们在简单看看我们的 hook
函数 my_write
的实现:
C
ssize_t my_write(int fd, const void *const buf, size_t count) {
if (gSignalCatcherTid == gettid()) {
pthread_mutex_lock(lock);
if (dumpState != NO_DUMP) {
LOGD("SignalCatcher write count: %d", count);
long time = get_time_millis();
char *stackFileName = new char[MAX_BUFFER_SIZE];
const char * dir;
if (dumpState == WAITING_STACK_DUMP) {
dir = gStackTraceDir;
LOGD("Start stack dump.");
} else {
dir = gAnrTraceDir;
LOGD("Start anr dump.");
}
sprintf(stackFileName, "%s/%ld.text", dir, time);
LOGD("Create stack file: %s", stackFileName);
int fileFd = open(stackFileName, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR);
if (fileFd < 0) {
LOGE("Create file fail: %d", fd);
goto end;
}
write(fileFd, buf, count);
close(fileFd);
write(gStackNotifyFd, &time, sizeof(time));
goto end;
} else {
goto end;
}
end:
pthread_mutex_unlock(lock);
}
return origin_write(fd, buf, count);
}
首先我们会先判断当前的线程是不是 Signal Catcher
,同时还会判断我们自己设定的状态,如果这些都没有问题,我们就认为这是我们要的 ANR
dump 文件,然后我们将它写入到我们的文件里面。
最后还会调用真正实现的 write()
方法。
主动获取所有的方法栈信息
通过系统的 ANR
信号来获取方法栈的 dump 信息,相对就被动一些,有的时候我们想要知道应用当前的所有线程的状态,这个时候我们就可以主动发送一个 SIGQUIT
信号给 Signal Catcher
线程,这样也可以通过 hook
拿到对应的 dump 文件,发送信号的方式和我们自定义的 signal action
中处理的方式一样,也是通过 syscall(SYS_tgkill, myPid, gSignalCatcherTid, SIGQUIT);
方法发送。
ANR dump 文件示例
text
// ...
suspend all histogram: Sum: 165us 99% C.I. 1us-21us Avg: 7.173us Max: 21us
DALVIK THREADS (23):
"Signal Catcher" daemon prio=10 tid=2 Runnable
| group="system" sCount=0 ucsCount=0 flags=0 obj=0x13600338 self=0xb400007bf3a26000
| sysTid=5041 nice=-20 cgrp=default sched=0/0 handle=0x7bf4ffbcb0
| state=R schedstat=( 28127001 5785385 10 ) utm=2 stm=0 core=5 HZ=100
| stack=0x7bf4f04000-0x7bf4f06000 stackSize=991KB
| held mutexes= "mutator lock"(shared held)
native: #00 pc 0000000000570ec4 /apex/com.android.art/lib64/libart.so (art::DumpNativeStack(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, int, BacktraceMap*, char const*, art::ArtMethod*, void*, bool)+148) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
native: #01 pc 0000000000675a24 /apex/com.android.art/lib64/libart.so (art::Thread::DumpStack(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, bool, BacktraceMap*, bool) const+340) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
native: #02 pc 000000000069310c /apex/com.android.art/lib64/libart.so (art::DumpCheckpoint::Run(art::Thread*)+908) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
native: #03 pc 000000000068ccac /apex/com.android.art/lib64/libart.so (art::ThreadList::RunCheckpoint(art::Closure*, art::Closure*)+508) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
native: #04 pc 000000000068bf54 /apex/com.android.art/lib64/libart.so (art::ThreadList::Dump(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, bool)+1796) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
native: #05 pc 000000000068b70c /apex/com.android.art/lib64/libart.so (art::ThreadList::DumpForSigQuit(std::__1::basic_ostream<char, std::__1::char_traits<char> >&)+1340) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
native: #06 pc 000000000063d300 /apex/com.android.art/lib64/libart.so (art::Runtime::DumpForSigQuit(std::__1::basic_ostream<char, std::__1::char_traits<char> >&)+208) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
native: #07 pc 0000000000651dc0 /apex/com.android.art/lib64/libart.so (art::SignalCatcher::HandleSigQuit()+1376) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
native: #08 pc 0000000000650e54 /apex/com.android.art/lib64/libart.so (art::SignalCatcher::Run(void*)+340) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
native: #09 pc 00000000000eb720 /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: cd953571180b7f5f8ae5570dad29595f)
native: #10 pc 000000000007e2d0 /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: cd953571180b7f5f8ae5570dad29595f)
(no managed stack frames)
"main" prio=5 tid=1 Native
| group="main" sCount=1 ucsCount=0 flags=1 obj=0x73869160 self=0xb400007c11e10800
| sysTid=15609 nice=-10 cgrp=default sched=1073741824/0 handle=0x7cbd635500
| state=S schedstat=( 1086854706 330699698 4068 ) utm=63 stm=45 core=6 HZ=100
| stack=0x7fd3027000-0x7fd3029000 stackSize=8188KB
| held mutexes=
native: #00 pc 0000000000078dec /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28) (BuildId: cd953571180b7f5f8ae5570dad29595f)
native: #01 pc 00000000002833dc /apex/com.android.art/lib64/libart.so (art::ConditionVariable::WaitHoldingLocks(art::Thread*)+140) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
native: #02 pc 000000000043bf3c /apex/com.android.art/lib64/libart.so (art::(anonymous namespace)::CheckJNI::FindClass(_JNIEnv*, char const*) (.llvm.11132044689082360456)+460) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
native: #03 pc 0000000000128ebc /system/lib64/libandroid_runtime.so (android::NativeDisplayEventReceiver::dispatchVsync(long, android::PhysicalDisplayId, unsigned int, android::gui::VsyncEventData)+92) (BuildId: 4da95a3e8bdc1b6a6682b67c10bdc47e)
native: #04 pc 00000000000c1820 /system/lib64/libgui.so (android::DisplayEventDispatcher::handleEvent(int, int, void*)+272) (BuildId: 1d69b7a57862392ad7b7712ed6197e18)
native: #05 pc 000000000001836c /system/lib64/libutils.so (android::Looper::pollInner(int)+1068) (BuildId: 6038dbf95f76d91eaf842148f10f89ea)
native: #06 pc 0000000000017ee0 /system/lib64/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+112) (BuildId: 6038dbf95f76d91eaf842148f10f89ea)
native: #07 pc 000000000016410c /system/lib64/libandroid_runtime.so (android::android_os_MessageQueue_nativePollOnce(_JNIEnv*, _jobject*, long, int)+44) (BuildId: 4da95a3e8bdc1b6a6682b67c10bdc47e)
at android.os.MessageQueue.nativePollOnce(Native method)
at android.os.MessageQueue.next(MessageQueue.java:339)
at android.os.Looper.loopOnce(Looper.java:186)
at android.os.Looper.loop(Looper.java:351)
at android.app.ActivityThread.main(ActivityThread.java:8377)
at java.lang.reflect.Method.invoke(Native method)
at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:584)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1013)
"Jit thread pool worker thread 0" daemon prio=5 tid=4 Native
| group="system" sCount=1 ucsCount=0 flags=1 obj=0x135c0720 self=0xb400007bf3a47800
| sysTid=5046 nice=9 cgrp=default sched=0/0 handle=0x7bf4d01cb0
| state=S schedstat=( 12650002 4618461 48 ) utm=0 stm=0 core=1 HZ=100
| stack=0x7bf4c02000-0x7bf4c04000 stackSize=1023KB
| held mutexes=
native: #00 pc 0000000000078dec /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28) (BuildId: cd953571180b7f5f8ae5570dad29595f)
native: #01 pc 00000000002833dc /apex/com.android.art/lib64/libart.so (art::ConditionVariable::WaitHoldingLocks(art::Thread*)+140) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
native: #02 pc 0000000000694b78 /apex/com.android.art/lib64/libart.so (art::ThreadPool::GetTask(art::Thread*)+120) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
native: #03 pc 0000000000693f50 /apex/com.android.art/lib64/libart.so (art::ThreadPoolWorker::Run()+144) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
native: #04 pc 00000000006939cc /apex/com.android.art/lib64/libart.so (art::ThreadPoolWorker::Callback(void*)+172) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
native: #05 pc 00000000000eb720 /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: cd953571180b7f5f8ae5570dad29595f)
native: #06 pc 000000000007e2d0 /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: cd953571180b7f5f8ae5570dad29595f)
(no managed stack frames)
"perfetto_hprof_listener" prio=10 tid=8 Native (still starting up)
| group="" sCount=1 ucsCount=0 flags=1 obj=0x0 self=0xb400007bf3a6f800
| sysTid=5044 nice=-20 cgrp=default sched=0/0 handle=0x7bf4efdcb0
| state=S schedstat=( 119385 21461461 4 ) utm=0 stm=0 core=6 HZ=100
| stack=0x7bf4e06000-0x7bf4e08000 stackSize=991KB
| held mutexes=
native: #00 pc 00000000000d5774 /apex/com.android.runtime/lib64/bionic/libc.so (read+4) (BuildId: cd953571180b7f5f8ae5570dad29595f)
native: #01 pc 000000000001dee4 /apex/com.android.art/lib64/libperfetto_hprof.so (void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, ArtPlugin_Initialize::$_34> >(void*)+260) (BuildId: 13ee3b989b35c4e1d3ac372e558e2961)
native: #02 pc 00000000000eb720 /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: cd953571180b7f5f8ae5570dad29595f)
native: #03 pc 000000000007e2d0 /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: cd953571180b7f5f8ae5570dad29595f)
(no managed stack frames)
"binder:15609_1" prio=5 tid=9 Native
| group="main" sCount=1 ucsCount=0 flags=1 obj=0x13640020 self=0xb400007bf4867400
| sysTid=5054 nice=-20 cgrp=default sched=0/0 handle=0x7bf42dfcb0
| state=S schedstat=( 333385 370462 3 ) utm=0 stm=0 core=4 HZ=100
| stack=0x7bf41e8000-0x7bf41ea000 stackSize=991KB
| held mutexes=
native: #00 pc 00000000000d5a54 /apex/com.android.runtime/lib64/bionic/libc.so (__ioctl+4) (BuildId: cd953571180b7f5f8ae5570dad29595f)
native: #01 pc 00000000000873bc /apex/com.android.runtime/lib64/bionic/libc.so (ioctl+156) (BuildId: cd953571180b7f5f8ae5570dad29595f)
native: #02 pc 000000000005f48c /system/lib64/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+284) (BuildId: 821d5191ea842f908c210c9c338b12f6)
native: #03 pc 000000000005f788 /system/lib64/libbinder.so (android::IPCThreadState::getAndExecuteCommand()+24) (BuildId: 821d5191ea842f908c210c9c338b12f6)
native: #04 pc 00000000000600a4 /system/lib64/libbinder.so (android::IPCThreadState::joinThreadPool(bool)+68) (BuildId: 821d5191ea842f908c210c9c338b12f6)
native: #05 pc 0000000000090048 /system/lib64/libbinder.so (android::PoolThread::threadLoop()+24) (BuildId: 821d5191ea842f908c210c9c338b12f6)
native: #06 pc 0000000000013550 /system/lib64/libutils.so (android::Thread::_threadLoop(void*)+416) (BuildId: 6038dbf95f76d91eaf842148f10f89ea)
native: #07 pc 00000000000cc59c /system/lib64/libandroid_runtime.so (android::AndroidRuntime::javaThreadShell(void*)+140) (BuildId: 4da95a3e8bdc1b6a6682b67c10bdc47e)
native: #08 pc 00000000000eb720 /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: cd953571180b7f5f8ae5570dad29595f)
native: #09 pc 000000000007e2d0 /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: cd953571180b7f5f8ae5570dad29595f)
(no managed stack frames)
// ...
这个文件中包含所有的 Java
线程栈和 Native
线程栈,而且其中还包含线程的状态,锁信息,栈大小等等有用的信息,这些信息对我们分析问题也非常有帮助。
最后
我把上面的所有代码都开源了,而且还发布成了一个单独的 aar
库,感兴趣的同学可以看看:dumpstack