问题背景
近期我们项目在自动化测试中发生一起奇怪的冻屏现象,它的直接原因是启动某个应用程序时,system_server 进程的 ActivityManager:procStart 线程卡在函数 attemptUsapSendArgsAndGetResult 上等待 socket 数据返回,然而一直没有数据返回。
php
"ActivityManager:procStart" prio=5 tid=33 Native
| group="main" sCount=1 ucsCount=0 flags=1 obj=0x13a81920 self=0xb400007cbbd42800
| sysTid=1882 nice=-2 cgrp=foreground sched=1073741824/0 handle=0x7ca6d67cb0
| state=S schedstat=( 87741769 194728696 264 ) utm=2 stm=6 core=3 HZ=100
| stack=0x7ca6c64000-0x7ca6c66000 stackSize=1039KB
| held mutexes=
native: #00 pc 000ed8e4 /apex/com.android.runtime/lib64/bionic/libc.so (__recvmsg+4)
native: #01 pc 000a2f40 /apex/com.android.runtime/lib64/bionic/libc.so (recvmsg+48)
native: #02 pc 00014974 /system/lib64/libbase.so (android::base::ReceiveFileDescriptorVector+372)
native: #03 pc 00195474 /system/lib64/libandroid_runtime.so (android::socket_read_all+84)
native: #04 pc 00194c8c /system/lib64/libandroid_runtime.so (android::socket_readba+332)
at android.net.LocalSocketImpl.readba_native(Native method)
at android.net.LocalSocketImpl.-$$Nest$mreadba_native(unavailable:0)
at android.net.LocalSocketImpl$SocketInputStream.read(LocalSocketImpl.java:110)
- locked <0x000ba91b> (a java.lang.Object)
at java.io.DataInputStream.readFully(DataInputStream.java:203)
at java.io.DataInputStream.readInt(DataInputStream.java:394)
at android.os.ZygoteProcess.attemptUsapSendArgsAndGetResult(ZygoteProcess.java:540)
at android.os.ZygoteProcess.zygoteSendArgsAndGetResult(ZygoteProcess.java:484)
at android.os.ZygoteProcess.startViaZygote(ZygoteProcess.java:828)
- locked <0x085f6ab8> (a java.lang.Object)
at android.os.ZygoteProcess.start(ZygoteProcess.java:405)
at android.os.Process.start(Process.java:755)
at com.android.server.am.ProcessList.startProcess(ProcessList.java:2543)
at com.android.server.am.ProcessList.lambda$handleProcessStart$1(ProcessList.java:2193)
at com.android.server.am.ProcessList.$r8$lambda$swBGAihsWK-ua7Y-9peg-di3_lY(unavailable:0)
at com.android.server.am.ProcessList$$ExternalSyntheticLambda1.run(unavailable:22)
at com.android.server.am.ProcessList.handleProcessStart(ProcessList.java:2219)
at com.android.server.am.ProcessList.lambda$startProcessLocked$0(ProcessList.java:2159)
at com.android.server.am.ProcessList.$r8$lambda$RYfds-QGIevfi1LguTPr20cBG60(unavailable:0)
at com.android.server.am.ProcessList$$ExternalSyntheticLambda5.run(unavailable:22)
at android.os.Handler.handleCallback(Handler.java:958)
at android.os.Handler.dispatchMessage(Handler.java:99)
at android.os.Looper.loopOnce(Looper.java:222)
at android.os.Looper.loop(Looper.java:314)
at android.os.HandlerThread.run(HandlerThread.java:67)
at com.android.server.ServiceThread.run(ServiceThread.java:46)
对此问题,我们需要知道此时 system_server 等待的 socket 服务端进程是谁,此时服务端进程又在干吗,熟悉此部分代码,从堆栈上看可以知道等待的服务端进程是 usap64,但实际上后台有多个 usap64 进程,一个个排查当然也是可行的,这里我还是采用内存分析的方式来锁定。将 system_server 内存转储成 Coredump 文件。
客户端解析
less
art-parser> bt 1882
"ActivityManager:procStart" prio=5 tid=33 Native
| group="main" sCount=0 ucsCount=0 flags=0 obj=0x13a81920 self=0xb400007cbbd42800
| sysTid=1882 nice=<unknown> cgrp=<unknown> sched=<unknown> handle=0x7ca6d67cb0
| stack=0x7ca6c64000-0x7ca6c66000 stackSize=0x103cb0
| held mutexes=
x0 0x00000000000002c4 x1 0x0000007ca6d66650 x2 0x0000000040004028 x3 0x0000007ca6d66580
x4 0x0000007ca6d66600 x5 0x0000000000000004 x6 0x0000000000000000 x7 0x0000000000000000
x8 0x00000000000000d4 x9 0x0000000000000001 x10 0x0000000000000110 x11 0x0000000000000003
x12 0x0000007cbbd511dc x13 0xb400007cbbf27b40 x14 0x0000007ca6d668f0 x15 0x0000000013c958b8
x16 0x0000007e0b2fd710 x17 0x0000007dec500f10 x18 0x0000007ca6ba6000 x19 0x0000007ca6d66650
x20 0x0000000000000040 x21 0x0000007ca6d68000 x22 0x00000000000002c4 x23 0x0000000000000000
x24 0x0000007ca6d68000 x25 0x0000007ca6d664f0 x26 0x0000007ca6d66600 x27 0x0000007ca6d66718
x28 0x0000000000000000 x29 0x0000007ca6d66490
lr 0x0000007dec500f44 sp 0x0000007ca6d66450 pc 0x0000007dec54b8e4 pst 0x0000000060001000
FP[0x7ca6d66490] PC[0x7dec54b8e4] native: #00 (__recvmsg+0x4) /apex/com.android.runtime/lib64/bionic/libc.so
FP[0x7ca6d66490] PC[0x7dec500f44] native: #01 (recvmsg+0x34) /apex/com.android.runtime/lib64/bionic/libc.so
FP[0x7ca6d666a0] PC[0x7e0b2d0978] native: #02 (android::base::ReceiveFileDescriptorVector(android::base::borrowed_fd, void*, unsigned long, unsigned long, std::__1::vector<android::base::unique_fd_impl<android::base::DefaultCloser>, std::__1::allocator<android::base::unique_fd_impl<android::base::DefaultCloser> > >*)+0x178) /system/lib64/libbase.so
FP[0x7ca6d66790] PC[0x7de88c8478] native: #03 (android::register_android_net_LocalSocketImpl(_JNIEnv*)+0x1158) /system/lib64/libandroid_runtime.so
FP[0x7ca6d667f0] PC[0x7de88c7c90] native: #04 (android::register_android_net_LocalSocketImpl(_JNIEnv*)+0x970) /system/lib64/libandroid_runtime.so
>>> analysis 'art::OatHeader::kOatVersion' ...
>>> 'art::OatHeader::kOatVersion' = 238 ?
QF[0x7ca6d66840] PC[0x0000000000] at dex-pc 0x0000000000 android.net.LocalSocketImpl.readba_native(Native method) //AM[0x70954228]
QF[0x7ca6d668f0] PC[0x0071a845ec] at dex-pc 0x7d655c363c android.net.LocalSocketImpl$SocketInputStream.read //AM[0x706be940]
QF[0x7ca6d66960] PC[0x00713a896c] at dex-pc 0x7d664d19c0 java.io.DataInputStream.readInt //AM[0x6fd15fd0]
QF[0x7ca6d669c0] PC[0x7d66d31334] at dex-pc 0x7d64a52fc4 android.os.ZygoteProcess.attemptUsapSendArgsAndGetResult //AM[0x70981b90]
QF[0x7ca6d66ad0] PC[0x0071735b40] at dex-pc 0x7d64a53850 android.os.ZygoteProcess.zygoteSendArgsAndGetResult //AM[0x70981df0]
QF[0x7ca6d66b50] PC[0x00717356b8] at dex-pc 0x7d64a5374a android.os.ZygoteProcess.startViaZygote //AM[0x70981d90]
QF[0x7ca6d66bd0] PC[0x7d66d31a5c] at dex-pc 0x7d64a53170 android.os.ZygoteProcess.start //AM[0x70981f70]
QF[0x7ca6d66e70] PC[0x7d66d319dc] at dex-pc 0x7d64a37090 android.os.Process.start //AM[0x7097ef98]
QF[0x7ca6d670f0] PC[0x7d51d1a288] at dex-pc 0x7d510b1e98 com.android.server.am.ProcessList.startProcess //AM[0x9a9d0718]
QF[0x7ca6d67200] PC[0x7d51d188a0] at dex-pc 0x7d510b9340 com.android.server.am.ProcessList.lambda$handleProcessStart$1 //AM[0x9a9d04f8]
QF[0x7ca6d672c0] PC[0x7d51d15a00] at dex-pc 0x7d510b0140 com.android.server.am.ProcessList$$ExternalSyntheticLambda1.run //AM[0x7d43255648]
QF[0x7ca6d67350] PC[0x7d66d32158] at dex-pc 0x7d510b8b1a com.android.server.am.ProcessList.handleProcessStart //AM[0x9a9d0438]
QF[0x7ca6d67510] PC[0x7d66d319dc] at dex-pc 0x7d510b94c4 com.android.server.am.ProcessList.lambda$startProcessLocked$0 //AM[0x9a9d0538]
QF[0x7ca6d67660] PC[0x7d66d319dc] at dex-pc 0x7d510b758c com.android.server.am.ProcessList.$r8$lambda$RYfds-QGIevfi1LguTPr20cBG60 //AM[0x9a9d0138]
QF[0x7ca6d677b0] PC[0x7d66d30b68] at dex-pc 0x7d510b02a0 com.android.server.am.ProcessList$$ExternalSyntheticLambda5.run //AM[0x7d432553f0]
QF[0x7ca6d67900] PC[0x0071aec7e0] at dex-pc 0x7d649fcdd8 android.os.Handler.dispatchMessage //AM[0x707206c8]
QF[0x7ca6d67930] PC[0x0071aefd10] at dex-pc 0x7d64a241fc android.os.Looper.loopOnce //AM[0x70721830]
QF[0x7ca6d67a00] PC[0x0071aef840] at dex-pc 0x7d64a24926 android.os.Looper.loop //AM[0x70721810]
QF[0x7ca6d67a50] PC[0x0071aee7d4] at dex-pc 0x7d649fc448 android.os.HandlerThread.run //AM[0x70720d58]
QF[0x7ca6d67aa0] PC[0x7d51c7bf3c] at dex-pc 0x7d50f6da60 com.android.server.ServiceThread.run //AM[0x9a9c4370]
方法一 | 从寄存器上看很容易知道 sock fd = 0x2c4 = 708 |
方法二 | cat /proc/1745/task/1882/syscall 系统调用ID 参数0 ~ 参数7 212 0x2c4 0x7ca6d66650 0x40004028 0x7ca6d66580 0x7ca6d66600 0x4 0x7ca6d66450 0x7dec54b8e8 |
从调用栈上看,LocalSocket 创建的 socket 使用的是 AF_UNIX 协议族,因此,我们可以通过 ino 号查询该 socket 状态。
arduino
typedef enum {
SS_FREE = 0, /* not allocated */
SS_UNCONNECTED, /* unconnected to any socket */
SS_CONNECTING, /* in process of connecting */
SS_CONNECTED, /* connected to socket */
SS_DISCONNECTING /* in process of disconnecting */
} socket_state;
shell
# ls -lh /proc/1745/fd | grep 708
lrwx------ 1 system system 64 2023-10-25 19:39 708 -> socket:[125175]
# netstat -p 1745 | grep 125175
unix 3 [ ] STREAM CONNECTED 125175 1745/system_server @jdwp-control
或者
# cat /proc/net/unix | grep 125175
Num RefCount Protocol Flags Type St Inode Path
0000000000000000: 00000003 00000000 00000000 0001 03 125175
从 socket 状态上看,已经连上服务端,等待服务端返回数据,我们先查看这个 attemptUsapSendArgsAndGetResult 函数的参数内容。
ini
art-parser> disassemble 0x70981df0 -i 0x7d64a53850
android.os.Process$ProcessStartResult android.os.ZygoteProcess.zygoteSendArgsAndGetResult(android.os.ZygoteProcess$ZygoteState, int, java.util.ArrayList) [dex_method_idx=10385]
DEX CODE:
0x7d64a53850: 3070 2871 0054 | invoke-direct {v4, v5, v0}, android.os.Process$ProcessStartResult android.os.ZygoteProcess.attemptUsapSendArgsAndGetResult(android.os.ZygoteProcess$ZygoteState, java.lang.String) // method@10353
QF[0x7ca6d66ad0] PC[0x0071735b40] at dex-pc 0x7d64a53850 android.os.ZygoteProcess.zygoteSendArgsAndGetResult //AM[0x70981df0]
{
StackMap[28] (code_region=[0x71735820-0x71735b58], native_pc=0x320, dex_pc=0x60, register_mask=0x2dc00000)
Virtual registers
{
v0 = r22 v1 = r28 v2 = r29 v4 = r23
v5 = r24 v6 = r25 v7 = r26
}
Physical registers
{
x19 = 0xb400007cbbd42800 x20 = 0x0 x21 = 0xb400007cbbd428c8 x22 = 0x13a81aa8
x23 = 0x728961c0 x24 = 0x13a81a80 x25 = 0x1 x26 = 0x13a825d8
x27 = 0x6fbab3c8 x28 = 0x1 x29 = 0x0 x30 = 0x71735b40
}
}
sql
art-parser> p 0x13a81aa8
Size: 0xb30
Object Name: java.lang.String
iFields of java.lang.String
[0x8] int count = 0xb19
[0xc] int hash = 0x0
[16
--runtime-args
--setuid=10092
--setgid=10092
--runtime-flags=18876416
--mount-external-default
--target-sdk-version=29
--setgroups=3003,50092,20092,9997
--nice-name=com.android.statementservice
--seinfo=platform:privapp:targetSdkVersion=29:complete
--app-data-dir=/data/user/0/com.android.statementservice
--package-name=com.android.statementservice
--pkg-data-info-map=com.android.statementservice,null,10472
--allowlisted-data-info-map=com.google.android.gms,null,10735
--disabled-compat-changes=3400644,73144566,78294732,117835097,119147584,123562444,124107808,128611929,130595455,132649864,132742131,133396946,135549675,135634846,135754954,135772972,136069189,136219221,136274596,136293963,140852299,141600225,142191088,142365358,143231523,143539591,143937733,144027538,146211400,147316723,147340954,147798919,148086656,148180766,148534348,148535736,149254050,149391281,149994052,150232615,150776642,150857253,150935354,150939131,151105954,151163173,154726397,156215187,157233955,157629738,158482162,159047832,160794467,161145287,161252188,162547999,163400105,165573442,166236554,167676448,168419799,168782947,168936375,169273070,169887240,169897160,170188668,170233598,170503758,170668199,171032338,171306433,171317480,171979766,172100307,172251878,173031413,174041399,174042936,174042980,174043039,174227820,174228127,174664365,175101461,175319604,175408749,176926741,176926753,176926771,176926829,177438394,178038272,178209446,180326732,180326787,180326845,180523564,181136395,181350407,181658987,182185642,182478738,182734110,182811243,183147249,183155436,183164979,183372781,183407956,183972877,184323934,184838306,185004937,185199076,186082280,189229956,189472651,189969734,189969744,189969749,189969779,189969782,189970036,189970038,189970040,191513214,191911306,192341120,193232191,193247900,194532703,194833441,195579280,196254758,197654537,201522903,201712607,201794303,202110963,202894742,203704822,203800354,205194548,205907456,205919743,206033068,207133734,207557677,208648326,208739934,210856463,210923482,211757425,213289672,214016041,214741472,215066299,215656264,216114297,218393363,218493289,218533173,218865702,218959984,224562872,226439802,227752274,229362273,230590090,234793325,235355681,236283604,236704164,236825255,237508058,237531167,239784307,240273417,240663182,241104082,241766793,242193913,242194868,242716250,243827847,244358506,244637991,247079863,253665015,254631730,254662522,255038118,255039210,255042465,255371817,255659651,255938466,255940284,258236856,258825825,260560985,261055255,261072174,261770108,262645982,263076149,263259275,263959004,264301586,264304459,265103382,265131695,265451093,265452344,265456536,265464455,266124927,266201607,266524688,269165460,269849258,270049379,270306772,271850009,273509367,273564678,277035314,280471513,288845345,293644536
android.app.ActivityThread
seq=48
]
iFields of java.lang.Object
[0x0] java.lang.Class shadow$_klass_ = 0x6fb83df8
[0x4] int shadow$_monitor_ = 0x0
ini
art-parser> p 0x13a81a80
Size: 0x28
Padding: 0x7
Object Name: android.os.ZygoteProcess$ZygoteState
iFields of android.os.ZygoteProcess$ZygoteState
[0x8] java.util.List mAbiList = 0x1704f5c8
[0x20] boolean mClosed = 0x0
[0xc] android.net.LocalSocketAddress mUsapSocketAddress = 0x72e38ec8
[0x10] java.io.DataInputStream mZygoteInputStream = 0x1704f5d8
[0x14] java.io.BufferedWriter mZygoteOutputWriter = 0x1704f5f8
[0x18] android.net.LocalSocket mZygoteSessionSocket = 0x1704f618
[0x1c] android.net.LocalSocketAddress mZygoteSocketAddress = 0x72e38ee8
iFields of java.lang.Object
[0x0] java.lang.Class shadow$_klass_ = 0x7009ec28
[0x4] int shadow$_monitor_ = 0x0
art-parser> p 0x72e38ec8
Size: 0x10
Object Name: android.net.LocalSocketAddress
iFields of android.net.LocalSocketAddress
[0x8] java.lang.String name = usap_pool_primary
[0xc] android.net.LocalSocketAddress$Namespace namespace = 0x6ff14ec8
iFields of java.lang.Object
[0x0] java.lang.Class shadow$_klass_ = 0x70052e40
[0x4] int shadow$_monitor_ = 0x20000000
bash
# netstat -a -p | grep usap_pool_primary
unix 2 [ ACC ] STREAM LISTENING 41780 952/zygote64 /dev/socket/usap_pool_primary
unix 2 [ ] DGRAM CONNECTED 89237 2581/com.xiaomi.fin /dev/socket/usap_pool_primary
unix 2 [ ] DGRAM CONNECTED 62176 1398/shelld /dev/socket/usap_pool_primary
unix 2 [ ] DGRAM CONNECTED 58514 1012/vendor.mediate /dev/socket/usap_pool_primary
unix 3 [ ] STREAM CONNECTED 81802 3690/usap64 /dev/socket/usap_pool_primary
unix 2 [ ] DGRAM CONNECTED 81982 984/android.hardwar /dev/socket/usap_pool_primary
至此,我们可以得出结论是,system_server 在启动应用程序 com.android.statementservice 时发生,并且等待服务端进程 3690/usap64 完成并返回结果集。
服务端解析
less
# debuggerd -b 3690
----- pid 3690 at 2023-10-25 20:15:07.118260161+0800 -----
Cmd line: usap64
ABI: 'arm64'
"usap64" sysTid=3690
#00 pc 00000000000902fc /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28)
#01 pc 000000000025ab30 /apex/com.android.art/lib64/libart.so (art::Mutex::ExclusiveLock(art::Thread*)+336) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
#02 pc 00000000003b8e24 /apex/com.android.art/lib64/libart.so (art::gc::space::RegionSpace::AllocNewTlab(art::Thread*, unsigned long, unsigned long*)+52) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
#03 pc 00000000003814ac /apex/com.android.art/lib64/libart.so (art::gc::Heap::AllocWithNewTLAB(art::Thread*, art::gc::AllocatorType, unsigned long, bool, unsigned long*, unsigned long*, unsigned long*)+1260) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
#04 pc 0000000000757240 /apex/com.android.art/lib64/libart.so (artAllocArrayFromCodeResolvedRegionTLAB+224) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
#05 pc 0000000000225e24 /apex/com.android.art/lib64/libart.so (art_quick_alloc_array_resolved32_region_tlab+132) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
#06 pc 00000000001cbbb0 /system/framework/arm64/boot.oat (java.util.regex.Pattern.fastSplit+560) (BuildId: f2e702c0d8fd389971dd1d32a3ba28e75c08fef8)
#07 pc 0000000000160200 /system/framework/arm64/boot.oat (java.lang.String.split+64) (BuildId: f2e702c0d8fd389971dd1d32a3ba28e75c08fef8)
#08 pc 00000000007ffddc /system/framework/arm64/boot-framework.oat (com.android.internal.os.ZygoteArguments.parseArgs+2844) (BuildId: 1b854c6594c6966f2579f1e4b017c1ab8c92236f)
#09 pc 00000000007ff29c /system/framework/arm64/boot-framework.oat (com.android.internal.os.ZygoteArguments.getInstance+124) (BuildId: 1b854c6594c6966f2579f1e4b017c1ab8c92236f)
#10 pc 0000000000209418 /apex/com.android.art/lib64/libart.so (nterp_helper+152) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
#11 pc 000000000053026e /system/framework/framework.jar (com.android.internal.os.Zygote.childMain+126)
#12 pc 00000000002093b4 /apex/com.android.art/lib64/libart.so (nterp_helper+52) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
#13 pc 00000000005306f4 /system/framework/framework.jar (com.android.internal.os.Zygote.forkUsap+64)
#14 pc 00000000002093b4 /apex/com.android.art/lib64/libart.so (nterp_helper+52) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
#15 pc 000000000052f15e /system/framework/framework.jar (com.android.internal.os.ZygoteServer.fillUsapPool+158)
#16 pc 000000000080c340 /system/framework/arm64/boot-framework.oat (com.android.internal.os.ZygoteServer.runSelectLoop+4080) (BuildId: 1b854c6594c6966f2579f1e4b017c1ab8c92236f)
#17 pc 0000000000807288 /system/framework/arm64/boot-framework.oat (com.android.internal.os.ZygoteInit.main+3976) (BuildId: 1b854c6594c6966f2579f1e4b017c1ab8c92236f)
#18 pc 0000000000210c80 /apex/com.android.art/lib64/libart.so (art_quick_invoke_static_stub+640) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
#19 pc 0000000000253b40 /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+224) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
#20 pc 000000000064bd68 /apex/com.android.art/lib64/libart.so (art::JValue art::InvokeWithVarArgs<art::ArtMethod*>(art::ScopedObjectAccessAlreadyRunnable const&, _jobject*, art::ArtMethod*, std::__va_list)+408) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
#21 pc 000000000064c340 /apex/com.android.art/lib64/libart.so (art::JValue art::InvokeWithVarArgs<_jmethodID*>(art::ScopedObjectAccessAlreadyRunnable const&, _jobject*, _jmethodID*, std::__va_list)+80) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
#22 pc 00000000005160f4 /apex/com.android.art/lib64/libart.so (art::JNI<true>::CallStaticVoidMethodV(_JNIEnv*, _jclass*, _jmethodID*, std::__va_list)+692) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
#23 pc 00000000000dfca8 /system/lib64/libandroid_runtime.so (_JNIEnv::CallStaticVoidMethod(_jclass*, _jmethodID*, ...)+104)
#24 pc 00000000000ec454 /system/lib64/libandroid_runtime.so (android::AndroidRuntime::start(char const*, android::Vector<android::String8> const&, bool)+1028)
#25 pc 000000000000258c /system/bin/app_process64 (main+1324) (BuildId: 3177237856445357fb121f4100079921)
#26 pc 000000000008c798 /apex/com.android.runtime/lib64/bionic/libc.so (__libc_init+104)
----- end 3690 -----
yaml
# ps -eT 3690
USER PID TID PPID VSZ RSS WCHAN ADDR S CMD
root 3690 3690 952 6306924 426964 futex_wai+ 0 S usap64
从现场通过 debug 手段,确定进程 3690 有且仅有一个线程,并且处于一个等锁的状态,这个锁永远也不可能主动释放了,因此卡死了客户端出现冻屏,我们分析下锁的 owner 是谁,仍然是将进程 3690 内存转储成 Coredump 文件来进行内存分析。
rust
(gdb) bt
#0 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41
#1 0x0000007d66d81b34 in art::futex(int volatile*, int, int, timespec const*, int volatile*, int) [clone .__uniq.55395457626730424248235132913560037531] (op=128, val=0, uaddr2=0x0, val3=0,
uaddr=<optimized out>, timeout=<optimized out>) at art/runtime/base/mutex-inl.h:43
#2 art::Mutex::ExclusiveLock (this=0xb400007d67e22c90, self=0xb400007d67e42c00) at art/runtime/base/mutex.cc:471
#3 0x0000007d66edfe28 in art::MutexLock::MutexLock (self=<optimized out>, mu=..., this=<optimized out>) at art/runtime/base/mutex.h:510
#4 art::gc::space::RegionSpace::AllocNewTlab (this=0xb400007d67e22a80, self=0xb400007d67e42c00, tlab_size=1695, bytes_tl_bulk_allocated=0x7fedceb2f8) at art/runtime/gc/space/region_space.cc:847
#5 0x0000007d66ea84b0 in art::gc::Heap::AllocWithNewTLAB (this=0xb400007d67e29700, self=0xb400007d67e42c00, allocator_type=<optimized out>, alloc_size=<optimized out>, grow=false,
bytes_allocated=0x7fedceb308, usable_size=0x7fedceb300, bytes_tl_bulk_allocated=0x7fedceb2f8) at art/runtime/gc/heap.cc:4567
#6 0x0000007d6727e244 in art::gc::Heap::TryToAllocate<false, false> (this=<optimized out>, self=<optimized out>, allocator_type=<optimized out>, alloc_size=<optimized out>,
bytes_allocated=<optimized out>, usable_size=<optimized out>, bytes_tl_bulk_allocated=<optimized out>) at art/runtime/gc/heap-inl.h:416
#7 art::gc::Heap::AllocObjectWithAllocator<false, true, art::mirror::SetLengthVisitor> (this=0xb400007d67e29700, self=0xb400007d67e42c00, klass=..., byte_count=936,
allocator=art::gc::kAllocatorTypeRegionTLAB, pre_fence_visitor=...) at art/runtime/gc/heap-inl.h:147
#8 art::mirror::Array::Alloc<false, false> (self=<optimized out>, array_class=..., component_count=<optimized out>, allocator_type=art::gc::kAllocatorTypeRegionTLAB,
component_size_shift=<optimized out>) at art/runtime/mirror/array-alloc-inl.h:147
#9 art::AllocArrayFromCodeResolved<false> (component_count=<optimized out>, allocator_type=art::gc::kAllocatorTypeRegionTLAB, klass=..., self=<optimized out>)
at art/runtime/entrypoints/entrypoint_utils-inl.h:369
#10 artAllocArrayFromCodeResolvedRegionTLAB (klass=0x6fb84108, component_count=<optimized out>, self=0xb400007d67e42c00) at art/runtime/entrypoints/quick/quick_alloc_entrypoints.cc:139
#11 0x0000007d66d4ce28 in art_quick_alloc_array_resolved32_region_tlab () at art/runtime/arch/arm64/quick_entrypoints_arm64.S:1641
#12 0x00000000712a2bb4 in java::util::regex::Pattern::fastSplit (re=..., input=..., limit=<optimized out>)
from symbols/system/framework/arm64/boot.oat
#13 0x0000000071237204 in java::lang::String::split (this=..., regex=...)
from symbols/system/framework/arm64/boot.oat
#14 0x0000000071ccfde0 in com::android::internal::os::ZygoteArguments::parseArgs (this=..., args=..., argCount=<optimized out>) at com/android/internal/os/ZygoteArguments.java:286
#15 0x0000000071ccf2a0 in com::android::internal::os::ZygoteArguments::getInstance (args=...)
from symbols/system/framework/arm64/boot-framework.oat
#16 0x0000007d66d3041c in nterp_helper () at out_sys/soong/.intermediates/art/runtime/libart_mterp.arm64ng/gen/mterp_arm64ng.S:11269
#17 0x0000007d66d303b8 in nterp_helper () at out_sys/soong/.intermediates/art/runtime/libart_mterp.arm64ng/gen/mterp_arm64ng.S:11269
#18 0x0000007d66d303b8 in nterp_helper () at out_sys/soong/.intermediates/art/runtime/libart_mterp.arm64ng/gen/mterp_arm64ng.S:11269
#19 0x0000000071cdc344 in com::android::internal::os::ZygoteServer::runSelectLoop (this=..., abiList=...) at com/android/internal/os/ZygoteServer.java:665
#20 0x0000000071cd728c in com::android::internal::os::ZygoteInit::main (argv=...) at com/android/internal/os/ZygoteInit.java:1068
#21 0x0000007d66d37c84 in art_quick_invoke_static_stub () at art/runtime/arch/arm64/quick_entrypoints_arm64.S:688
#22 0x0000007d66d7ab44 in art::ArtMethod::Invoke (this=0x709a91a8, self=0xb400007d67e42c00, args=0x62, args_size=<optimized out>, result=0x7fedcebdb8, shorty=0x7d6464e228 "VL")
at art/runtime/art_method.cc:425
#23 0x0000007d67172d6c in art::(anonymous namespace)::InvokeWithArgArray(art::ScopedObjectAccessAlreadyRunnable const&, art::ArtMethod*, art::(anonymous namespace)::ArgArray*, art::JValue*, char const*) [clone .__uniq.245181933781456475607640333933569312899] (soa=..., method=0x709a91a8, arg_array=0x7fedcebd40, result=0x7fedcebdb8, shorty=0x7d6464e228 "VL") at art/runtime/reflection.cc:458
#24 art::InvokeWithVarArgs<art::ArtMethod*> (soa=..., obj=0x0, method=0x709a91a8, args=...) at art/runtime/reflection.cc:550
#25 0x0000007d67173344 in art::InvokeWithVarArgs<_jmethodID*> (soa=..., obj=0x0, mid=<optimized out>, args=...) at art/runtime/reflection.cc:565
#26 0x0000007d6703d0f8 in art::JNI<true>::CallStaticVoidMethodV (env=<optimized out>, mid=0x709a91a8, args=...) at art/runtime/jni/jni_internal.cc:1963
#27 0x0000007de8812cac in _JNIEnv::CallStaticVoidMethod (this=<optimized out>, clazz=<optimized out>, methodID=<optimized out>) at libnativehelper/include_jni/jni.h:779
#28 0x0000007de881f458 in android::AndroidRuntime::start (this=0x7fedcec190, className=0x5e8d845394 "com.android.internal.os.ZygoteInit", options=..., zygote=true)
at frameworks/base/core/jni/AndroidRuntime.cpp:1358
#29 0x0000005e8d846590 in main (argc=<optimized out>, argv=0x7fedced2e8) at frameworks/base/cmds/app_process/app_main.cpp:372
ini
(gdb) p *('art::Mutex' *)0xb400007d67e22c90
$2 = {
<art::BaseMutex> = {
_vptr$BaseMutex = 0x7d67327230 <vtable for art::Mutex+16>,
name_ = 0x7d66be275c "Region lock",
contention_log_data_ = 0xb400007d67e22ca0,
level_ = art::kRegionSpaceRegionLock,
should_respond_to_empty_checkpoint_request_ = false
},
members of art::Mutex:
state_and_contenders_ = {
<std::__1::atomic<int>> = {
<std::__1::__atomic_base<int, true>> = {
<std::__1::__atomic_base<int, false>> = {
__a_ = 3,
static is_always_lock_free = <optimized out>
}, <No data fields>}, <No data fields>}, <No data fields>},
static kHeldMask = 1,
static kContenderShift = 1,
static kContenderIncrement = 2,
exclusive_owner_ = {
<std::__1::atomic<int>> = {
<std::__1::__atomic_base<int, true>> = {
<std::__1::__atomic_base<int, false>> = {
__a_ = 3657,
static is_always_lock_free = <optimized out>
}, <No data fields>}, <No data fields>}, <No data fields>},
recursion_count_ = 1,
recursive_ = false,
enable_monitor_timeout_ = false,
monitor_id_ = 758263346
}
奇怪的事情发生了,线程 3657 到底是谁,即使 ps -eT | grep 3657 搜遍了所有线程号,终究无法得到,因此只有一个原因是该线程已经退出了。
疑问一
<math xmlns="http://www.w3.org/1998/Math/MathML"> 为什么 3657 已退出,锁却没有被释放? \color{red}{为什么 3657 已退出,锁却没有被释放?} </math>为什么3657已退出,锁却没有被释放?
其实这个多线程 Fork 后,子进程发生死锁现象,在 Linux 编写程序发生这种太常见了,由于 Fork 的写时复制机制,因此存在父进程的一个锁变量状态处于上锁状态,此时发生了 Fork,父子进程共享了同一个虚拟内存地址空间,此时子进程在使用该锁,则会出现死锁现象。
数据恢复
基于前面的疑问,我们可以知道线程 3657 获得锁之后,发生了 Fork,因此 3657 的线程栈也应该在进程 3690 的地址空间下。
shell
# cat /proc/3690/maps | grep 3657
7d5016a000-7d5016b000 ---p 00000000 00:00 0 [anon:stack_and_tls:3657]
7d5016b000-7d50272000 rw-p 00000000 00:00 0 [anon:stack_and_tls:3657]
前面我们保存了进程 3690 的 Coredump 文件,并且它是 Zygote 进程 Fork 而来,共享了 Zygote 复制前的内存,因此我们查看此时被虚拟机管理的线程还有谁。
3690 数据解析
ini
art-parser> thread
[952] "main" tid=1 Unknown
[3657] "FinalizerWatchdogDaemon" tid=5 Unknown
父进程 952 与子进程 3690,在写分离之前,它们是共享地址空间,也共享同一个栈地址,因此我们可以修改一些假数据,让 3690 添加到虚拟机管理下,这样就可以使用 art-parser 来解析 3690 堆栈的 Java 数据。
ini
art-parser> bt 952
"main" prio=5 tid=1 Unknown
| group="main" sCount=0 ucsCount=0 flags=0 obj=0x72845b10 self=0xb400007d67e42c00
| sysTid=952 nice=<unknown> cgrp=<unknown> sched=<unknown> handle=0x7e1a07b4f8
| stack=0x7fed4f0000-0x7fed4f2000 stackSize=0x7ff000
| held mutexes= "mutator lock"
x0 0x0000000000000000 x1 0x0000000000000000 x2 0x0000000000000000 x3 0x0000000000000000
x4 0x0000000000000000 x5 0x0000000000000000 x6 0x0000000000000000 x7 0x0000000000000000
x8 0x0000000000000000 x9 0x0000000000000000 x10 0x0000000000000000 x11 0x0000000000000000
x12 0x0000000000000000 x13 0x0000000000000000 x14 0x0000000000000000 x15 0x0000000000000000
x16 0x0000000000000000 x17 0x0000000000000000 x18 0x0000000000000000 x19 0x0000000000000000
x20 0x0000000000000000 x21 0x0000000000000000 x22 0x0000000000000000 x23 0x0000000000000000
x24 0x0000000000000000 x25 0x0000000000000000 x26 0x0000000000000000 x27 0x0000000000000000
x28 0x0000000000000000 x29 0x0000000000000000
lr 0x0000000000000000 sp 0x0000000000000000 pc 0x0000000000000000 pst 0x0000000000000000
0x0 fault sp address!!
>>> analysis 'art::OatHeader::kOatVersion' ...
>>> 'art::OatHeader::kOatVersion' = 238 ?
QF[0x7fedceb450] PC[0x00712a2bb4] at dex-pc 0x7d66667c92 java.util.regex.Pattern.fastSplit //AM[0x6fccbfb0]
QF[0x7fedceb4b0] PC[0x0071237204] at dex-pc 0x7d6650ad72 java.lang.String.split //AM[0x6fd3ec48]
QF[0x7fedceb4f0] PC[0x0071ccfde0] at dex-pc 0x7d64410aba com.android.internal.os.ZygoteArguments.parseArgs //AM[0x708328d8]
QF[0x7fedceb620] PC[0x0071ccf2a0] at dex-pc 0x7d64410398 com.android.internal.os.ZygoteArguments.getInstance //AM[0x708328b8]
QF[0x7fedceb660] PC[0x7d66d3041c] at dex-pc 0x7d6441526e com.android.internal.os.Zygote.childMain //AM[0x709a8698]
QF[0x7fedceb850] PC[0x7d66d303b8] at dex-pc 0x7d644156f4 com.android.internal.os.Zygote.forkUsap //AM[0x709a87f8]
QF[0x7fedceb960] PC[0x7d66d303b8] at dex-pc 0x7d6441415e com.android.internal.os.ZygoteServer.fillUsapPool //AM[0x70832f98]
QF[0x7fedceba80] PC[0x0071cdc344] at dex-pc 0x7d64414632 com.android.internal.os.ZygoteServer.runSelectLoop //AM[0x70833018]
QF[0x7fedcebb80] PC[0x0071cd728c] at dex-pc 0x7d6441307e com.android.internal.os.ZygoteInit.main //AM[0x709a91a8]
堆栈中的 self 则是 art::Thread 的地址,其中 0x3b8 则是该线程的 ID,我们只需要将 0x3b8 修改成 0xe6a 即可。
makefile
art-parser> rd 0xb400007d67e42c00 -e 0xb400007d67e42c10
0xb400007d67e42c00: 0x0000000000000000 0x000003b800000001
sql
art-parser> wd 0xb400007d67e42c08 0x00000e6a00000001
New overlay (0x785f98f000) [0x7d67a00000, 0x7d68200000).
Overlay vaddr(0xb400007d67e42c08) old(0x000003b800000001) new(0x00000e6a00000001)
art-parser> thread
[3690] "main" tid=1 Unknown
[3657] "FinalizerWatchdogDaemon" tid=5 Unknown
less
art-parser> bt 3690
"main" prio=5 tid=1 Unknown
| group="main" sCount=0 ucsCount=0 flags=0 obj=0x72845b10 self=0xb400007d67e42c00
| sysTid=3690 nice=<unknown> cgrp=<unknown> sched=<unknown> handle=0x7e1a07b4f8
| stack=0x7fed4f0000-0x7fed4f2000 stackSize=0x7ff000
| held mutexes= "mutator lock"
| wait mutex= "Region lock" held by thread 3657
x0 0xb400007d67e22ca4 x1 0x0000000000000080 x2 0x0000000000000003 x3 0x0000000000000000
x4 0x0000000000000000 x5 0x0000000000000000 x6 0x0000000000000000 x7 0x0000007fedceb2f8
x8 0x0000000000000062 x9 0x0000000000000000 x10 0x0000000000000032 x11 0x0000000000000033
x12 0x0000000000000003 x13 0x0000000000000033 x14 0x000000000000001c x15 0x00000000703013b8
x16 0x0000007d67336918 x17 0x0000007dec4ee2e0 x18 0x0000007e19ada000 x19 0xb400007d67e22c90
x20 0xb400007d67e42c00 x21 0xb400007d67e22ca4 x22 0x0000007d66bde13b x23 0x0000007d66bc4f70
x24 0x0000007d6753d000 x25 0x0000007fedceb150 x26 0x0000000000000001 x27 0x00000000fffffffe
x28 0x0000007e18d58000 x29 0x0000007fedceb190
lr 0x0000007d66d81b34 sp 0x0000007fedceb120 pc 0x0000007dec4ee2fc pst 0x0000000060001000
FP[0x7fedceb190] PC[0x7dec4ee2fc] native: #00 (syscall+0x1c) /apex/com.android.runtime/lib64/bionic/libc.so
FP[0x7fedceb190] PC[0x7d66d81b34] native: #01 (art::Mutex::ExclusiveLock(art::Thread*)+0x154) /apex/com.android.art/lib64/libart.so
FP[0x7fedceb1f0] PC[0x7d66edfe28] native: #02 (art::gc::space::RegionSpace::AllocNewTlab(art::Thread*, unsigned long, unsigned long*)+0x38) /apex/com.android.art/lib64/libart.so
FP[0x7fedceb280] PC[0x7d66ea84b0] native: #03 (art::gc::Heap::AllocWithNewTLAB(art::Thread*, art::gc::AllocatorType, unsigned long, bool, unsigned long*, unsigned long*, unsigned long*)+0x4f0) /apex/com.android.art/lib64/libart.so
FP[0x7fedceb390] PC[0x7d6727e244] native: #04 (artAllocArrayFromCodeResolvedRegionTLAB+0xe4) /apex/com.android.art/lib64/libart.so
QF[0x7fedceb450] PC[0x00712a2bb4] at dex-pc 0x7d66667c92 java.util.regex.Pattern.fastSplit //AM[0x6fccbfb0]
QF[0x7fedceb4b0] PC[0x0071237204] at dex-pc 0x7d6650ad72 java.lang.String.split //AM[0x6fd3ec48]
QF[0x7fedceb4f0] PC[0x0071ccfde0] at dex-pc 0x7d64410aba com.android.internal.os.ZygoteArguments.parseArgs //AM[0x708328d8]
QF[0x7fedceb620] PC[0x0071ccf2a0] at dex-pc 0x7d64410398 com.android.internal.os.ZygoteArguments.getInstance //AM[0x708328b8]
QF[0x7fedceb660] PC[0x7d66d3041c] at dex-pc 0x7d6441526e com.android.internal.os.Zygote.childMain //AM[0x709a8698]
QF[0x7fedceb850] PC[0x7d66d303b8] at dex-pc 0x7d644156f4 com.android.internal.os.Zygote.forkUsap //AM[0x709a87f8]
QF[0x7fedceb960] PC[0x7d66d303b8] at dex-pc 0x7d6441415e com.android.internal.os.ZygoteServer.fillUsapPool //AM[0x70832f98]
QF[0x7fedceba80] PC[0x0071cdc344] at dex-pc 0x7d64414632 com.android.internal.os.ZygoteServer.runSelectLoop //AM[0x70833018]
QF[0x7fedcebb80] PC[0x0071cd728c] at dex-pc 0x7d6441307e com.android.internal.os.ZygoteInit.main //AM[0x709a91a8]
我们此时也知道解析下从客户端传入的参数,核对下前面的分析是否一致。
ini
art-parser> disassemble 0x708328b8 -i 0x7d64410398
com.android.internal.os.ZygoteArguments com.android.internal.os.ZygoteArguments.getInstance(com.android.internal.os.ZygoteCommandBuffer) [dex_method_idx=61801]
DEX CODE:
0x7d64410398: 3070 f166 0021 | invoke-direct {v1, v2, v0}, void com.android.internal.os.ZygoteArguments.<init>(com.android.internal.os.ZygoteCommandBuffer, int) // method@61798
art-parser> bt 3690 -v
QF[0x7fedceb620] PC[0x0071ccf2a0] at dex-pc 0x7d64410398 com.android.internal.os.ZygoteArguments.getInstance //AM[0x708328b8]
{
StackMap[5] (code_region=[0x71ccf220-0x71ccf2b8], native_pc=0x80, dex_pc=0xa, register_mask=0xc00000)
Physical registers
{
x22 = 0x12d7e5f8 x23 = 0x12d7e5e0 x24 = 0xffffffff x25 = 0x10
x26 = 0x7fedceb734 x27 = 0x7fedceb6b8 x28 = 0x7fedceb7b0 x29 = 0x7fedceb734
x30 = 0x71ccf2a0
}
}
csharp
0x71ccf280 <+96>: mov x2, x23
0x71ccf284 <+100>: mov x3, x25
0x71ccf288 <+104>: mov x1, x0
0x71ccf28c <+108>: mov x22, x1
0x71ccf290 <+112>: adrp x0, 0x70832000
0x71ccf294 <+116>: add x0, x0, #0x8d8
0x71ccf298 <+120>: ldr x30, [x0, #24]
0x71ccf29c <+124>: blr x30
ini
art-parser> p 0x12d7e5f8
Size: 0x90
Padding: 0x5
Object Name: com.android.internal.os.ZygoteArguments
iFields of com.android.internal.os.ZygoteArguments
[0x7c] boolean mAbiListQuery = 0x0
[0x8] java.lang.String[] mAllowlistedDataInfoList = 0x12d7ec30
[0xc] java.lang.String[] mApiDenylistExemptions = 0x0
[0x10] java.lang.String mAppDataDir = /data/user/0/com.android.statementservice
[0x7d] boolean mBindMountAppDataDirs = 0x0
[0x7e] boolean mBindMountAppStorageDirs = 0x0
[0x7f] boolean mBootCompleted = 0x0
[0x80] boolean mCapabilitiesSpecified = 0x0
[0x14] long[] mDisabledCompatChanges = 0x0
[0x50] long mEffectiveCapabilities = 0x0
[0x60] int mGid = 0x276c
[0x81] boolean mGidSpecified = 0x1
[0x18] int[] mGids = 0x12d7e8a0
[0x64] int mHiddenApiAccessLogSampleRate = 0xffffffff
[0x68] int mHiddenApiAccessStatslogSampleRate = 0xffffffff
[0x1c] java.lang.String mInstructionSet = 0x0
[0x20] java.lang.String mInvokeWith = 0x0
[0x82] boolean mIsTopApp = 0x0
[0x6c] int mMountExternal = 0x1
[0x24] java.lang.String mNiceName = com.android.statementservice
[0x28] java.lang.String mPackageName = com.android.statementservice
[0x58] long mPermittedCapabilities = 0x0
[0x83] boolean mPidQuery = 0x0
[0x2c] java.lang.String[] mPkgDataInfoList = 0x12d7eb30
[0x30] java.lang.String mPreloadApp = 0x0
[0x84] boolean mPreloadDefault = 0x0
[0x34] java.lang.String mPreloadPackage = 0x0
[0x38] java.lang.String mPreloadPackageCacheKey = 0x0
[0x3c] java.lang.String mPreloadPackageLibFileName = 0x0
[0x40] java.lang.String mPreloadPackageLibs = 0x0
[0x44] java.util.ArrayList mRLimits = 0x0
[0x48] java.lang.String[] mRemainingArgs = 0x0
[0x70] int mRuntimeFlags = 0x1200800
[0x4c] java.lang.String mSeInfo = platform:privapp:targetSdkVersion=29:complete
[0x85] boolean mSeInfoSpecified = 0x1
[0x86] boolean mStartChildZygote = 0x0
[0x74] int mTargetSdkVersion = 0x1d
[0x87] boolean mTargetSdkVersionSpecified = 0x1
[0x78] int mUid = 0x276c
[0x88] boolean mUidSpecified = 0x1
[0x89] boolean mUsapPoolEnabled = 0x0
[0x8a] boolean mUsapPoolStatusSpecified = 0x0
iFields of java.lang.Object
[0x0] java.lang.Class shadow$_klass_ = 0x70177360
[0x4] int shadow$_monitor_ = 0x0
从参数上我们可以核实该 usap64 此时确实是用来启动应用程序 com.android.statementservice,与客户端解析部分吻合。
3657 栈恢复
css
7d5016b000-7d50272000 rw-p 00000000 00:00 0 [anon:stack_and_tls:3657]
前面我们知道 3657 栈地址范围,于是我们可以根据现有的栈内存进行推导,但这次推导只能之下而上。
bash
(gdb) x /i 0x0000007dec56126c-0x4
0x7dec561268 <_ZN5NonPIL20MutexLockWithTimeoutEP24pthread_mutex_internal_tbPK8timespec+328>: bl 0x7dec4f3040 <_Z15__futex_wait_exPVvbibPK8timespec>
bash
Dump of assembler code for function _ZN5NonPIL20MutexLockWithTimeoutEP24pthread_mutex_internal_tbPK8timespec:
0x0000007dec561120 <+0>: sub sp, sp, #0x80
0x0000007dec561124 <+4>: stp x29, x30, [sp, #32]
0x0000007dec561128 <+8>: str x27, [sp, #48]
0x0000007dec56112c <+12>: stp x26, x25, [sp, #64]
0x0000007dec561130 <+16>: stp x24, x23, [sp, #80]
0x0000007dec561134 <+20>: stp x22, x21, [sp, #96]
0x0000007dec561138 <+24>: stp x20, x19, [sp, #112]
0x0000007dec56113c <+28>: add x29, sp, #0x20
于是我们可以得到一组 gdb、lldb 以及 art-parser 进行栈回溯所需要的寄存器值。
即:
寄存器 | 值 |
---|---|
PC | 0x7dec4f3040 |
LR | 0x7dec56126c |
SP | 0x7d5026e550 |
FP | 0x7d5026e570 |
对原本进程 3690 的 Coredump 文件进行修改。
erlang
0001B710 05 00 00 00 88 01 00 00 01 00 00 00 43 4F 52 45 ............CORE
0001B720 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B730 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B740 00 00 00 00 6A 0E 00 00 00 00 00 00 00 00 00 00 ....j...........
0001B750 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B760 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B770 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B780 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B790 00 00 00 00 A4 2C E2 67 7D 00 00 B4 80 00 00 00 .....,.g}.......
0001B7A0 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 ................
0001B7B0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B7C0 00 00 00 00 00 00 00 00 00 00 00 00 F8 B2 CE ED ................
0001B7D0 7F 00 00 00 62 00 00 00 00 00 00 00 00 00 00 00 ....b...........
0001B7E0 00 00 00 00 32 00 00 00 00 00 00 00 33 00 00 00 ....2.......3...
0001B7F0 00 00 00 00 03 00 00 00 00 00 00 00 33 00 00 00 ............3...
0001B800 00 00 00 00 1C 00 00 00 00 00 00 00 B8 13 30 70 ..............0p
0001B810 00 00 00 00 18 69 33 67 7D 00 00 00 E0 E2 4E EC .....i3g}.....N.
0001B820 7D 00 00 00 00 A0 AD 19 7E 00 00 00 90 2C E2 67 }.......~....,.g
0001B830 7D 00 00 B4 00 2C E4 67 7D 00 00 B4 A4 2C E2 67 }....,.g}....,.g
0001B840 7D 00 00 B4 3B E1 BD 66 7D 00 00 00 70 4F BC 66 }...;..f}...pO.f
0001B850 7D 00 00 00 00 D0 53 67 7D 00 00 00 50 B1 CE ED }.....Sg}...P...
0001B860 7F 00 00 00 01 00 00 00 00 00 00 00 FE FF FF FF ................
0001B870 00 00 00 00 00 80 D5 18 7E 00 00 00 90 B1 CE ED ........~.......
0001B880 7F 00 00 00 34 1B D8 66 7D 00 00 00 20 B1 CE ED ....4..f}... ...
0001B890 7F 00 00 00 FC E2 4E EC 7D 00 00 00 00 10 00 60 ......N.}......`
对 Coredump 的寄存器以及线程号进行修改后
erlang
0001B710 05 00 00 00 88 01 00 00 01 00 00 00 43 4F 52 45 ............CORE
0001B720 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B730 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B740 00 00 00 00 49 0E 00 00 00 00 00 00 00 00 00 00 ....I...........
0001B750 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B760 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B770 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B780 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B790 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B7A0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B7B0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B7C0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B7D0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B7E0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B7F0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B810 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B820 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B830 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B840 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B850 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B860 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B870 00 00 00 00 00 00 00 00 00 00 00 00 70 E5 26 50 ............p.&P
0001B880 7D 00 00 00 6C 12 56 EC 7D 00 00 00 50 E5 26 50 }...l.V.}...P.&P
0001B890 7D 00 00 00 40 30 4F EC 7D 00 00 00 00 10 00 60 }...@0O.}......`
less
art-parser> bt 3657
"FinalizerWatchdogDaemon" prio=<unknown> tid=5 Unknown
| group="<unknown>" sCount=0 ucsCount=0 flags=0 obj=0x0 self=0xb400007d67f87c00
| sysTid=3657 nice=<unknown> cgrp=<unknown> sched=<unknown> handle=0x7d5026ecb0
| stack=0x7d5016b000-0x7d5016d000 stackSize=0x103cb0
| held mutexes= "Region lock" "mutator lock"
x0 0x0000000000000000 x1 0x0000000000000000 x2 0x0000000000000000 x3 0x0000000000000000
x4 0x0000000000000000 x5 0x0000000000000000 x6 0x0000000000000000 x7 0x0000000000000000
x8 0x0000000000000000 x9 0x0000000000000000 x10 0x0000000000000000 x11 0x0000000000000000
x12 0x0000000000000000 x13 0x0000000000000000 x14 0x0000000000000000 x15 0x0000000000000000
x16 0x0000000000000000 x17 0x0000000000000000 x18 0x0000000000000000 x19 0x0000000000000000
x20 0x0000000000000000 x21 0x0000000000000000 x22 0x0000000000000000 x23 0x0000000000000000
x24 0x0000000000000000 x25 0x0000000000000000 x26 0x0000000000000000 x27 0x0000000000000000
x28 0x0000000000000000 x29 0x0000007d5026e570
lr 0x0000007dec56126c sp 0x0000007d5026e550 pc 0x0000007dec4f3040 pst 0x0000000060001000
FP[0x7d5026e570] PC[0x7dec4f3040] native: #00 (__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+0x0) /apex/com.android.runtime/lib64/bionic/libc.so
FP[0x7d5026e5e0] PC[0x7dec56126c] native: #02 (NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+0x14c) /apex/com.android.runtime/lib64/bionic/libc.so
FP[0x7d5026e670] PC[0x7dec4da500] native: #03 (je_malloc_mutex_lock_slow+0xc0) /apex/com.android.runtime/lib64/bionic/libc.so
FP[0x7d5026e6c0] PC[0x7dec4ba294] native: #04 (je_arena_tcache_fill_small+0x64) /apex/com.android.runtime/lib64/bionic/libc.so
FP[0x7d5026e720] PC[0x7dec4e543c] native: #05 (je_tcache_alloc_small_hard+0x1c) /apex/com.android.runtime/lib64/bionic/libc.so
FP[0x7d5026e780] PC[0x7dec4b1c38] native: #06 (je_malloc+0x2e8) /apex/com.android.runtime/lib64/bionic/libc.so
FP[0x7d5026e7d0] PC[0x7dec4ad5c8] native: #07 (malloc+0x28) /apex/com.android.runtime/lib64/bionic/libc.so
FP[0x7d5026e7f0] PC[0x7e12e5401c] native: #08 (operator new(unsigned long)+0x1c) /apex/com.android.art/lib64/libc++.so
FP[0x7d5026e810] PC[0x7d66ee0024] native: #09 (art::gc::space::RegionSpace::RevokeThreadLocalBuffersLocked(art::Thread*, bool)+0x74) /apex/com.android.art/lib64/libart.so
FP[0x7d5026e840] PC[0x7d66ee01f4] native: #10 (art::gc::space::RegionSpace::RevokeThreadLocalBuffers(art::Thread*)+0x54) /apex/com.android.art/lib64/libart.so
FP[0x7d5026e880] PC[0x7d66ea646c] native: #11 (art::gc::Heap::RevokeThreadLocalBuffers(art::Thread*)+0x8c) /apex/com.android.art/lib64/libart.so
FP[0x7d5026e960] PC[0x7d671da368] native: #12 (art::Thread::Destroy(bool)+0xf18) /apex/com.android.art/lib64/libart.so
FP[0x7d5026eb30] PC[0x7d671effe4] native: #13 (art::ThreadList::Unregister(art::Thread*, bool)+0xa4) /apex/com.android.art/lib64/libart.so
FP[0x7d5026ebf0] PC[0x7d671cb16c] native: #14 (art::Thread::CreateCallback(void*)+0x8ac) /apex/com.android.art/lib64/libart.so
FP[0x7d5026ec50] PC[0x7dec55fd60] native: #15 (__pthread_start(void*)+0xd0) /apex/com.android.runtime/lib64/bionic/libc.so
FP[0x7d5026ec80] PC[0x7dec4f3bc4] native: #16 (__start_thread+0x44) /apex/com.android.runtime/lib64/bionic/libc.so
将修改后的 coredump 文件,重载 art-parser 上解析,即可清楚的看到 3657 线程在函数 RevokeThreadLocalBuffers上获得了锁 Region lock,然后发生了 Fork 行为,因此造成子进程死锁了。
疑问二
事实上 Google 也考虑到 Linux 多线程 Fork 易死锁问题,因此在 Zygote Fork 子进程之前会停止所有 Daemon 线程,那为什么 3657 还未完成退出,就发生了 Fork。
细节 1
可能有同学注意到线程都已经 join 了,为什么还有 Daemon 线程没有退出,函数 Daemons.stop() 却已经完成了?
这里涉及到 ART 虚拟机线程设计,在 ART 中 join 线程,会在 art::Thread::Destroy 中释放锁资源,因此线程就不会在等待就结束了,对应代码如下: 然而 3657 线程已经运行到 RevokeThreadLocalBuffers 此处,因此 Java 层最后一个 Daemon.stop() 已经结束了。
细节 2
即使 Daemons.stop() 函数结束了,ART 虚拟机仍可以在做一次保护,只需考虑 ThreadList 列表仅剩主线程即可。我们可以看 nativePreFork 在 ART 虚拟机后续如何处理的,有无等待 ThreadList 清理且等待。 然而我们的机器 unregistering_count_ 并未减到 0,nativePreFork 就已经结束了。 对比我们项目 PreZygoteFork 代码,好家伙并没这一段代码。 目前项目的处理方式,通过 Java API 访问 /proc/self/task 目录下的文件数来判断进程是否仅剩一个。 从 3690 进程的内存中,查询父进程历史线程本地变量 可以从中找到最后一次从 tasks.list() 产生的 java.lang.String[] 对象
less
art-parser> p 0x12d7e370
Size: 0x18
Array Name: java.lang.String[]
[0] 952
[1] 3655
[2] 3657
art-parser> p 0x12d7e3f0
Size: 0x10
Array Name: java.lang.String[]
[0] 952
结论
至于为什么通过 Java API 访问 /proc/self/task 目录下的文件数会出现不可靠情况仍需进一步调查。
Revert^2 "Wait for thread termination in PreZygoteFork()" | android-review.googlesource.com/c/platform/... |
Revert^2 "Remove waitUntilAllThreadsStopped()" | android-review.googlesource.com/c/platform/... |
Google 给出的新的 waitUntilAllThreadsStopped 处理,依旧是存在瑕疵,首先循环条件有限 1000 次采样,不排除一种情况在最后一次线程退出时,结束了 Unregister 函数后,ThreadList 判断是否为 1 条件通过,但线程也可能在内核态中 do_exit 上发生等待,可能就造成 waitUntilAllThreadsStopped 判断 1000 次不可达,从而发生 FATAL 错误。
补充
ini
/*
* Find the next thread in the thread list.
* Return NULL if there is an error or no next thread.
*
* The reference to the input task_struct is released.
*/
static struct task_struct *next_tid(struct task_struct *start)
{
struct task_struct *pos = NULL;
rcu_read_lock();
if (pid_alive(start)) {
pos = next_thread(start);
if (thread_group_leader(pos))
pos = NULL;
else
get_task_struct(pos);
}
rcu_read_unlock();
put_task_struct(start);
return pos;
}
/* for the /proc/TGID/task/ directories */
static int proc_task_readdir(struct file *file, struct dir_context *ctx)
{
struct inode *inode = file_inode(file);
struct task_struct *task;
struct pid_namespace *ns;
int tid;
if (proc_inode_is_dead(inode))
return -ENOENT;
if (!dir_emit_dots(file, ctx))
return 0;
/* f_version caches the tgid value that the last readdir call couldn't
* return. lseek aka telldir automagically resets f_version to 0.
*/
ns = proc_pid_ns(inode->i_sb);
tid = (int)file->f_version;
file->f_version = 0;
for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns);
task;
task = next_tid(task), ctx->pos++) {
char name[10 + 1];
unsigned int len;
tid = task_pid_nr_ns(task, ns);
if (!tid)
continue; /* The task has just exited. */
len = snprintf(name, sizeof(name), "%u", tid);
if (!proc_fill_cache(file, ctx, name, len,
proc_task_instantiate, task, NULL)) {
/* returning this tgid failed, save it as the first
* pid for the next readir call */
file->f_version = (u64)tid;
put_task_struct(task);
break;
}
}
return 0;
验证实验
实验原理: |
---|
创建 Daemon 线程前,保存 /proc/self/task 的所有线程号到样本一 (alltask) |
启动 5 个 Daemon 线程,分别叫 Daemon-0、Daemon-1、......、Daemon-4 |
终止 4 个 Daemon 线程,保留 Daemon-4 |
保存 /proc/self/task 的所有线程号到样本二 (savetask),并循环判断校验 savetask 长度 > alltask 长度,1000次后终止 Daemon-4 重新测试。 |
当 savetask 长度 <= alltask 长度时,保存现场 Coredump。 |
使用 art-parser 解析内存,核实 savetask 是否存在 Daemon-4,并校验前后数据。若不存在 Daemon-4, 则 proc_task_readdir 发生了截断。 |
ini
int length = 0;
String[] savetask;
String[] alltask;
void doTest(View view) {
mThreads.add(new Daemon());
mThreads.add(new Daemon());
mThreads.add(new Daemon());
mThreads.add(new Daemon());
mThreads.add(new Daemon());
new Thread(new Runnable() {
@Override
public void run() {
while (true) {
File tasks = new File("/proc/self/task");
alltask = tasks.list();
length = alltask.length;
Log.v("Opencore", "s " + length);
for (int i = 0; i < mThreads.size(); i++) {
mThreads.get(i).start(i);
}
for (int i = 0; i < mThreads.size() - 1; i++) {
mThreads.get(i).stop();
}
int loop = 0;
boolean dodump = true;
File tasks2 = new File("/proc/self/task");
savetask = tasks2.list();
while (savetask.length > length) {
Log.v("Opencore", "" + savetask.length);
Thread.yield();
if (loop == 1000) {
mThreads.get(mThreads.size() - 1).stop();
dodump = false;
}
loop++;
savetask = tasks2.list();
}
Log.v("Opencore", "e " + length);
if (dodump) {
Coredump.getInstance().doCoredump();
mThreads.get(4).stop();
}
}
}
}).start();
ini
art-parser> p 0x16d40e20
Size: 0x180
Padding: 0x4
Object Name: penguin.opencore.coretester.MainActivity
iFields of penguin.opencore.coretester.MainActivity
[0x168] java.lang.String[] alltask = 0x13cc3d30
[0x178] int length = 0x1a
[0x16c] java.util.ArrayList mThreads = 0x16d68448
[0x170] penguin.opencore.coretester.NterpTester nterpTester = 0x16d68460
[0x174] java.lang.String[] savetask = 0x13cc5590
art-parser> p 0x13cc3d30
Size: 0x78
Padding: 0x4
Array Name: java.lang.String[]
[0] 26654
[1] 26661
[2] 26662
[3] 26663
[4] 26664
[5] 26665
[6] 26666
[7] 26667
[8] 26668
[9] 26669
[10] 26670
[11] 26673
[12] 26674
[13] 26676
[14] 26677
[15] 26679
[16] 26682
[17] 26683
[18] 26684
[19] 26689
[20] 26691
[21] 26693
[22] 26694
[23] 26695
[24] 26730
[25] 27063
art-parser> p 0x13cc5590
Size: 0x78
Padding: 0x4
Array Name: java.lang.String[]
[0] 26654
[1] 26661
[2] 26662
[3] 26663
[4] 26664
[5] 26665
[6] 26666
[7] 26667
[8] 26668
[9] 26669
[10] 26670
[11] 26673
[12] 26674
[13] 26676
[14] 26677
[15] 26679
[16] 26682
[17] 26683
[18] 26684
[19] 26689
[20] 26691
[21] 26693
[22] 26694
[23] 26695
[24] 26730
[25] 27063
art-parser> thread
[26654] "main" tid=1 Native
[26661] "Signal Catcher" tid=4 WaitingInMainSignalCatcherLoop
[26662] "perfetto_hprof_listener" tid=7 Native
[26663] "ADB-JDWP Connection Control Thread" tid=8 WaitingInMainDebuggerLoop
[26664] "Jit thread pool worker thread 0" tid=9 Native
[26665] "HeapTaskDaemon" tid=10 WaitingForTaskProcessor
[26666] "ReferenceQueueDaemon" tid=11 Waiting
[26667] "FinalizerDaemon" tid=12 Waiting
[26668] "FinalizerWatchdogDaemon" tid=13 Sleeping
[26669] "binder:26654_1" tid=14 Native
[26670] "binder:26654_2" tid=15 Native
[26673] "binder:26654_3" tid=16 Native
[26674] "26654-ScoutStateMachine" tid=17 Native
[26676] "Profile Saver" tid=18 Native
[26677] "opencore-bg" tid=19 Native
[26679] "RenderThread" tid=21 Native
[26682] "Binder:interceptor" tid=22 Native
[26683] "Timer-0" tid=23 Waiting
[26684] "Timer-1" tid=24 Waiting
[26691] "SurfaceSyncGroupTimer" tid=20 Native
[26693] "hwuiTask0" tid=25 Native
[26694] "hwuiTask1" tid=26 Native
[26730] "Thread-6" tid=27 Waiting
[27063] "binder:26654_4" tid=2 Native
[27789] "Daemon-4" tid=3 Native
根本原因
Java API 'ZygoteHooks.waitUntilAllThreadsStopped' 已经运行了很多年,但为什么最近才发生此类问题,突然变得不可靠,其原因是最新的内核修复了一个小问题导致。Commit 0658a0961b0ac ("procfs: do not list TID 0 in /proc//task")。
diff
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 533d5836eb9a..5541de99809c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3799,7 +3799,10 @@ static int proc_task_readdir(struct file *file, struct dir_context *ctx)
task = next_tid(task), ctx->pos++) {
char name[10 + 1];
unsigned int len;
+
tid = task_pid_nr_ns(task, ns);
+ if (!tid)
+ continue; /* The task has just exited. */
len = snprintf(name, sizeof(name), "%u", tid);
if (!proc_fill_cache(file, ctx, name, len,
proc_task_instantiate, task, NULL)) {
像本文遇到的案例,如果函数 proc_task_readdir 发生了截断。
版本 | |
---|---|
0658a0961b0ac 之前 | 952、3655 或者 952、0 |
0658a0961b0ac 之后 | 952 |
因此在之前的 Android 版本上,这个函数 while (tasks.list().length > 1) 此处即便内核截断了返回,也至少返回 2 个目录,因此会继续等待,直到正常退出。同样的实验进行调整也在内核 5.15 上测试是否发生截断会得到 0 这个目录。
ini
art-parser> p 0x13d81178
Size: 0x178
Object Name: penguin.opencore.coretester.MainActivity
iFields of penguin.opencore.coretester.MainActivity
0x164] java.lang.String[] alltask = 0x13ba88e0
[0x174] int length = 0x1b
[0x168] java.util.ArrayList mThreads = 0x13da7778
[0x16c] penguin.opencore.coretester.NterpTester nterpTester = 0x13da7790
[0x170] java.lang.String[] savetask = 0x13ba9438
art-parser> p 0x13ba9438
Size: 0x80
Padding: 0x4
Array Name: java.lang.String[]
[0] 30799
[1] 30806
[2] 30807
[3] 30808
[4] 30809
[5] 30810
[6] 30811
[7] 30812
[8] 30813
[9] 30814
[10] 30815
[11] 30817
[12] 30823
[13] 30825
[14] 30827
[15] 30828
[16] 30831
[17] 30832
[18] 30833
[19] 30841
[20] 30842
[21] 30852
[22] 30853
[23] 30859
[24] 30860
[25] 30898
[26] 31335
[27] 0
正如一句老话,这个功能是基于一个 Bug 之上实现的,如果你把这个 Bug 修复了,那我变成了 Bug。