crash导致的ANR问题分析
日志分析部分:
anr原因是MotionEvent一直没有让InputDispatcher端finish
bash
01-12 12:01:15.435 522 2179 I am_anr : [0,2142,com.example.linjw.dagger2demo,818462534,Input dispatching timed out (3c73418 com.example.linjw.dagger2demo/com.example.linjw.dagger2demo.SearchActivity (server) is not responding. Waited 5005ms for MotionEvent(deviceId=4, eventTime=162728390000, source=TOUCHSCREEN | STYLUS | BLUETOOTH_STYLUS, displayId=0, action=DOWN, actionButton=0x00000000, flags=0x00000000, metaState=0x00000000, buttonState=0x00000000, classification=NONE, edgeFlags=0x00000000, xPrecision=22.8, yPrecision=11.1, xCursorPosition=nan, yCursorPosition=nan, pointers=[0: (1011.9, 359.0)]), policyFlags=0x62000000)]
日志可以看到进程2142中一个MotionEvent 5s都没有结束。
那么这里大家可能会觉得主线程卡了等,那么日志可以看看2142主线程到底在干啥:
可以看到2142主线程在5秒以前就已经有空指针异常Shutting down VM了。
那么这里的空指针异常的crash到底和ANR有啥关系呢?
要解释这个问题就需要看看派发流程:
只要InputDispatcher发送事件后5s没有收到app发送的finish那么就会ANR。
那么接下来好好理清楚App正常接受事件流程。
正常日志派发流程:
这里知识java端的堆栈,其实这里由native端进行调用到
frameworks/base/core/jni/android_view_InputEventReceiver.cpp

frameworks/base/core/java/android/view/InputEventReceiver.java
cpp
// Called from native code.
@SuppressWarnings("unused")
@UnsupportedAppUsage(maxTargetSdk = Build.VERSION_CODES.R, trackingBug = 170729553)
private void dispatchInputEvent(int seq, InputEvent event) {
mSeqMap.put(event.getSequenceNumber(), seq);
onInputEvent(event);
}
这里的onInputEvent是有对应实现的:
可以看到默认的直接会执行finishInputEvent方法,这个finishInputEvent最后会调用到InputConsumer::sendFinishedSignal方法传递finish到InputDispatcher中。

那么最后子类到底是哪里处理的呢?
最后是会执行到ViewRootImpl中的WindowInputEventReceiver进行处理:
frameworks/base/core/java/android/view/ViewRootImpl.java
处理完成后会调用到finishInputEvent

但是如果和demo一样,在dispatch过程中抛出异常不执行到finishInputEvent呢?那是不是就无法进行sendFinish呢了?这里就需要深入分析native部分代码了。
异常后源码分析部分: 
那么这里出现异常难道就不会再次发送finish了么,新版本确实是这样的,一旦检测到了有异常后虽然有设置skipCallbacks为true,但是没看到新版本代码有任何针对skipCallbacks为true情况进行调用sendFinishedSignal,但是在以前版本其实是有的
bash
commit 3bdcdd8531781569d501e7023c22e25e2bae0dd1
Author: Jeff Brown <jeffbrown@google.com>
Date: Tue Apr 10 20:36:07 2012 -0700
Be more careful about exceptions in input callbacks.
consumeEvents() may be called reentrantly so we need to be
careful when handling exceptions. When called directly
through JNI, the exception should be allowed to bubble up
to the caller. When called from a Looper callback, the
exception should be recorded on the MessageQueue and bubbled
when the call to nativePollOnce() returns.
Bug: 6312938
Change-Id: Ief5e315802f586aa85af7eef1bd6e9bea4ce24ab
diff --git a/core/jni/android_view_InputEventReceiver.cpp b/core/jni/android_view_InputEventReceiver.cpp
index 348437d0f34a..8f6f5f4966a3 100644
--- a/core/jni/android_view_InputEventReceiver.cpp
+++ b/core/jni/android_view_InputEventReceiver.cpp
@@ -52,8 +52,7 @@ public:
status_t initialize();
status_t finishInputEvent(uint32_t seq, bool handled);
- status_t consumeEvents(bool consumeBatches);
- static int handleReceiveCallback(int receiveFd, int events, void* data);
+ status_t consumeEvents(JNIEnv* env, bool consumeBatches);
protected:
virtual ~NativeInputEventReceiver();
@@ -68,6 +67,8 @@ private:
const char* getInputChannelName() {
return mInputConsumer.getChannel()->getName().string();
}
+
+ static int handleReceiveCallback(int receiveFd, int events, void* data);
};
@@ -128,11 +129,13 @@ int NativeInputEventReceiver::handleReceiveCallback(int receiveFd, int events, v
return 1;
}
- status_t status = r->consumeEvents(false /*consumeBatches*/);
+ JNIEnv* env = AndroidRuntime::getJNIEnv();
+ status_t status = r->consumeEvents(env, false /*consumeBatches*/);
+ r->mMessageQueue->raiseAndClearException(env, "handleReceiveCallback");
return status == OK || status == NO_MEMORY ? 1 : 0;
}
-status_t NativeInputEventReceiver::consumeEvents(bool consumeBatches) {
+status_t NativeInputEventReceiver::consumeEvents(JNIEnv* env, bool consumeBatches) {
#if DEBUG_DISPATCH_CYCLE
ALOGD("channel '%s' ~ Consuming input events, consumeBatches=%s.", getInputChannelName(),
consumeBatches ? "true" : "false");
@@ -142,7 +145,7 @@ status_t NativeInputEventReceiver::consumeEvents(bool consumeBatches) {
mBatchedInputEventPending = false;
}
- JNIEnv* env = AndroidRuntime::getJNIEnv();
+ bool skipCallbacks = false;
//省略部分
- if (!inputEventObj) {
- ALOGW("channel '%s' ~ Failed to obtain event object.", getInputChannelName());
- mInputConsumer.sendFinishedSignal(seq, false);
- continue;
- }
+ default:
+ assert(false); // InputConsumer should prevent this from ever happening
+ inputEventObj = NULL;
+ }
+ if (inputEventObj) {
#if DEBUG_DISPATCH_CYCLE
- ALOGD("channel '%s' ~ Dispatching input event.", getInputChannelName());
+ ALOGD("channel '%s' ~ Dispatching input event.", getInputChannelName());
#endif
- env->CallVoidMethod(mReceiverObjGlobal,
- gInputEventReceiverClassInfo.dispatchInputEvent, seq, inputEventObj);
-
- env->DeleteLocalRef(inputEventObj);
+ env->CallVoidMethod(mReceiverObjGlobal,
+ gInputEventReceiverClassInfo.dispatchInputEvent, seq, inputEventObj);
+ if (env->ExceptionCheck()) {
+ ALOGE("Exception dispatching input event.");
+ skipCallbacks = true;
+ }
+ } else {
+ ALOGW("channel '%s' ~ Failed to obtain event object.", getInputChannelName());
+ skipCallbacks = true;
+ }
+ }
- if (mMessageQueue->raiseAndClearException(env, "dispatchInputEvent")) {
+ if (skipCallbacks) {
mInputConsumer.sendFinishedSignal(seq, false);
}
}
可以看到这个提交其实针对skipCallbacks是有进行sendFinishedSignal的处理,但是新版本已经没有了,那是为啥呢?
经过查证发现是在因为小米如下提交导致删除了针对skipCallbacks为true情况下调用sendFinishedSignal。
具体小米提交详情:
bash
commit a6c3e088c64c2ccb7bf68a013a026af386f462a5
Author: chenxinyu <chenxinyu7@xiaomi.com>
Date: Tue Nov 16 20:42:13 2021 +0800
Delete skipCallbacks when Exception dispatchInputEvent beacuse calling finishInputEvent twice will cause 'Native Crash'
If there is an exception, finishInputEvent method will be called, then NativeInputEventReceiver also send finish signal,will cause a native crash,'Abort message: 'Could not find consume time for seq=xxxx'
[1] https://cs.android.com/android/platform/superproject/+/master:frameworks/base/core/jni/android_view_InputEventReceiver.cpp;l=441?q=InputEventRe&ss=android%2Fplatform%2Fsuperproject:frameworks%2F
[2] https://cs.android.com/android/platform/superproject/+/master:frameworks/native/libs/input/InputTransport.cpp;l=1259?q=InputTRAN&ss=android%2Fplatform%2Fsuperproject:frameworks%2F
Signed-off-by: chenxinyu <chenxinyu7@xiaomi.com>
Change-Id: Ib834e2a960741f7fa33a0661c67f305af0db517a
Merged-In: Ib834e2a960741f7fa33a0661c67f305af0db517a
diff --git a/core/jni/android_view_InputEventReceiver.cpp b/core/jni/android_view_InputEventReceiver.cpp
index a699f912806d..7d0f60adeb5c 100644
--- a/core/jni/android_view_InputEventReceiver.cpp
+++ b/core/jni/android_view_InputEventReceiver.cpp
@@ -447,10 +447,6 @@ status_t NativeInputEventReceiver::consumeEvents(JNIEnv* env,
skipCallbacks = true;
}
}
-
- if (skipCallbacks) {
- mInputConsumer.sendFinishedSignal(seq, false);
- }
}
}
明显可以看出小米这个提交其实是为了修复一个'Native Crash',因为有的场景会重复调用sendFinishedSignal方法导致了crash。
其实这里也可以看看源码:
frameworks/native/libs/input/InputTransport.cpp
cpp
status_t InputConsumer::sendUnchainedFinishedSignal(uint32_t seq, bool handled) {
InputMessage msg;
msg.header.type = InputMessage::Type::FINISHED;
msg.header.seq = seq;
msg.body.finished.handled = handled;
//核心就是这个getConsumeTime导致了native crash
msg.body.finished.consumeTime = getConsumeTime(seq);
status_t result = mChannel->sendMessage(&msg);
if (result == OK) {
// Remove the consume time if the socket write succeeded. We will not need to ack this
// message anymore. If the socket write did not succeed, we will try again and will still
// need consume time.
popConsumeTime(seq);
}
return result;
}
//核心就是这个getConsumeTime导致了native crash
nsecs_t InputConsumer::getConsumeTime(uint32_t seq) const {
auto it = mConsumeTimes.find(seq);
// Consume time will be missing if either 'finishInputEvent' is called twice, or if it was
// called for the wrong (synthetic?) input event. Either way, it is a bug that should be fixed.
LOG_ALWAYS_FATAL_IF(it == mConsumeTimes.end(), "Could not find consume time for seq=%" PRIu32,
seq);
return it->second;
}
那么难道真的是小米提交修复crash导致的anr么?
那么如果把小米的修改提交进行revert掉就不会ANR了吗?
我们尝试revert小米提交验证,发现确实产生异常的当前事件确实是可以发送finish到InputDispatcher中,如果不再进行触摸其实也不会ANR



但是如果再对app进行触摸事件一样会进行ANR,这个是为什么呢?
其实日志中也可以看得到原因,因为异常Shutting down VM
bash
01-12 11:25:51.469 2627 2627 D AndroidRuntime: Shutting down VM
01-12 11:25:51.474 2627 2627 E AndroidRuntime: FATAL EXCEPTION: main
01-12 11:25:51.474 2627 2627 E AndroidRuntime: Process: com.example.linjw.dagger2demo, PID: 2627
那么自然主线程也不可以继续执行接收输入事件了。这里也可以通过Perfetto来看看相关线程情况:
明显可以看到在出现异常后,主线程直接就是DetachCurrentThread,自然也不会再进行事件的接收。

那么总结一下:
只要有异常抛出,那么主线就直接无法再继续运行,所以哪怕加上原来的skipCallbacks为true时候进行sendFinishedSignal,确实当前这个事件正常finish了,不会当场anr,但是主线程已经停止运行了,如果有新事件过来无法接受也会ANR。
所以严格意义说,小米这个修改并不会影响最后ANR的结果。
思考:那么是否有更好修改方式呢?
这里在了解清楚了小米提交的修改背景后,发表点个人观点:
其实本质上就是sendFinishedSignal被重复调用了,因为第一次调用后就会删除,导致第二次无法getConsumeTime,是日志中主动抛出异常的。
那么其实修改方式完全可以考虑在sendFinishedSignal这个方法体实现中,利用返回值,或者log打印报错等方式,导致最后sendFinishedSignal执行失败,不会发送信息到InputDispatcher端就可以,这种温和方式解决,而不是一旦发现有地方调用2次就直接native crash。
修改方式也不应该是一直控制想办法不让业务模块调用2次的,因为这种其实后续扩展比较难控制是否有可能会调用到两次。
而且在android_view_InputEventReceiver.cpp中的sendFinishedSignal属于异常情况下的一种底线收尾原则,让app产生异常也可以结束告知InputDispatcher进行finish,从而不会anr。
原文地址:
https://mp.weixin.qq.com/s/C4WUkXVFhAhLs4gDin1CpA
更多fw实战开发干货,请关注下面"千里马学框架"