Usap64 进程莫名死锁问题分析

问题背景

近期我们项目在自动化测试中发生一起奇怪的冻屏现象,它的直接原因是启动某个应用程序时,system_server 进程的 ActivityManager:procStart 线程卡在函数 attemptUsapSendArgsAndGetResult 上等待 socket 数据返回,然而一直没有数据返回。

php 复制代码
"ActivityManager:procStart" prio=5 tid=33 Native
  | group="main" sCount=1 ucsCount=0 flags=1 obj=0x13a81920 self=0xb400007cbbd42800
  | sysTid=1882 nice=-2 cgrp=foreground sched=1073741824/0 handle=0x7ca6d67cb0
  | state=S schedstat=( 87741769 194728696 264 ) utm=2 stm=6 core=3 HZ=100
  | stack=0x7ca6c64000-0x7ca6c66000 stackSize=1039KB
  | held mutexes=
  native: #00 pc 000ed8e4  /apex/com.android.runtime/lib64/bionic/libc.so (__recvmsg+4) 
  native: #01 pc 000a2f40  /apex/com.android.runtime/lib64/bionic/libc.so (recvmsg+48) 
  native: #02 pc 00014974  /system/lib64/libbase.so (android::base::ReceiveFileDescriptorVector+372)
  native: #03 pc 00195474  /system/lib64/libandroid_runtime.so (android::socket_read_all+84) 
  native: #04 pc 00194c8c  /system/lib64/libandroid_runtime.so (android::socket_readba+332) 
  at android.net.LocalSocketImpl.readba_native(Native method)
  at android.net.LocalSocketImpl.-$$Nest$mreadba_native(unavailable:0)
  at android.net.LocalSocketImpl$SocketInputStream.read(LocalSocketImpl.java:110)
  - locked <0x000ba91b> (a java.lang.Object)
  at java.io.DataInputStream.readFully(DataInputStream.java:203)
  at java.io.DataInputStream.readInt(DataInputStream.java:394)
  at android.os.ZygoteProcess.attemptUsapSendArgsAndGetResult(ZygoteProcess.java:540)
  at android.os.ZygoteProcess.zygoteSendArgsAndGetResult(ZygoteProcess.java:484)
  at android.os.ZygoteProcess.startViaZygote(ZygoteProcess.java:828)
  - locked <0x085f6ab8> (a java.lang.Object)
  at android.os.ZygoteProcess.start(ZygoteProcess.java:405)
  at android.os.Process.start(Process.java:755)
  at com.android.server.am.ProcessList.startProcess(ProcessList.java:2543)
  at com.android.server.am.ProcessList.lambda$handleProcessStart$1(ProcessList.java:2193)
  at com.android.server.am.ProcessList.$r8$lambda$swBGAihsWK-ua7Y-9peg-di3_lY(unavailable:0)
  at com.android.server.am.ProcessList$$ExternalSyntheticLambda1.run(unavailable:22)
  at com.android.server.am.ProcessList.handleProcessStart(ProcessList.java:2219)
  at com.android.server.am.ProcessList.lambda$startProcessLocked$0(ProcessList.java:2159)
  at com.android.server.am.ProcessList.$r8$lambda$RYfds-QGIevfi1LguTPr20cBG60(unavailable:0)
  at com.android.server.am.ProcessList$$ExternalSyntheticLambda5.run(unavailable:22)
  at android.os.Handler.handleCallback(Handler.java:958)
  at android.os.Handler.dispatchMessage(Handler.java:99)
  at android.os.Looper.loopOnce(Looper.java:222)
  at android.os.Looper.loop(Looper.java:314)
  at android.os.HandlerThread.run(HandlerThread.java:67)
  at com.android.server.ServiceThread.run(ServiceThread.java:46)

对此问题,我们需要知道此时 system_server 等待的 socket 服务端进程是谁,此时服务端进程又在干吗,熟悉此部分代码,从堆栈上看可以知道等待的服务端进程是 usap64,但实际上后台有多个 usap64 进程,一个个排查当然也是可行的,这里我还是采用内存分析的方式来锁定。将 system_server 内存转储成 Coredump 文件。

客户端解析

less 复制代码
art-parser> bt 1882
"ActivityManager:procStart" prio=5 tid=33 Native
  | group="main" sCount=0 ucsCount=0 flags=0 obj=0x13a81920 self=0xb400007cbbd42800
  | sysTid=1882 nice=<unknown> cgrp=<unknown> sched=<unknown> handle=0x7ca6d67cb0
  | stack=0x7ca6c64000-0x7ca6c66000 stackSize=0x103cb0
  | held mutexes=
  x0  0x00000000000002c4  x1  0x0000007ca6d66650  x2  0x0000000040004028  x3  0x0000007ca6d66580  
  x4  0x0000007ca6d66600  x5  0x0000000000000004  x6  0x0000000000000000  x7  0x0000000000000000  
  x8  0x00000000000000d4  x9  0x0000000000000001  x10 0x0000000000000110  x11 0x0000000000000003  
  x12 0x0000007cbbd511dc  x13 0xb400007cbbf27b40  x14 0x0000007ca6d668f0  x15 0x0000000013c958b8  
  x16 0x0000007e0b2fd710  x17 0x0000007dec500f10  x18 0x0000007ca6ba6000  x19 0x0000007ca6d66650  
  x20 0x0000000000000040  x21 0x0000007ca6d68000  x22 0x00000000000002c4  x23 0x0000000000000000  
  x24 0x0000007ca6d68000  x25 0x0000007ca6d664f0  x26 0x0000007ca6d66600  x27 0x0000007ca6d66718  
  x28 0x0000000000000000  x29 0x0000007ca6d66490  
  lr  0x0000007dec500f44  sp  0x0000007ca6d66450  pc  0x0000007dec54b8e4  pst 0x0000000060001000  
  FP[0x7ca6d66490] PC[0x7dec54b8e4] native: #00 (__recvmsg+0x4) /apex/com.android.runtime/lib64/bionic/libc.so
  FP[0x7ca6d66490] PC[0x7dec500f44] native: #01 (recvmsg+0x34) /apex/com.android.runtime/lib64/bionic/libc.so
  FP[0x7ca6d666a0] PC[0x7e0b2d0978] native: #02 (android::base::ReceiveFileDescriptorVector(android::base::borrowed_fd, void*, unsigned long, unsigned long, std::__1::vector<android::base::unique_fd_impl<android::base::DefaultCloser>, std::__1::allocator<android::base::unique_fd_impl<android::base::DefaultCloser> > >*)+0x178) /system/lib64/libbase.so
  FP[0x7ca6d66790] PC[0x7de88c8478] native: #03 (android::register_android_net_LocalSocketImpl(_JNIEnv*)+0x1158) /system/lib64/libandroid_runtime.so
  FP[0x7ca6d667f0] PC[0x7de88c7c90] native: #04 (android::register_android_net_LocalSocketImpl(_JNIEnv*)+0x970) /system/lib64/libandroid_runtime.so
>>> analysis 'art::OatHeader::kOatVersion' ...
>>> 'art::OatHeader::kOatVersion' = 238 ?
  QF[0x7ca6d66840] PC[0x0000000000] at dex-pc 0x0000000000 android.net.LocalSocketImpl.readba_native(Native method)  //AM[0x70954228]
  QF[0x7ca6d668f0] PC[0x0071a845ec] at dex-pc 0x7d655c363c android.net.LocalSocketImpl$SocketInputStream.read  //AM[0x706be940]
  QF[0x7ca6d66960] PC[0x00713a896c] at dex-pc 0x7d664d19c0 java.io.DataInputStream.readInt  //AM[0x6fd15fd0]
  QF[0x7ca6d669c0] PC[0x7d66d31334] at dex-pc 0x7d64a52fc4 android.os.ZygoteProcess.attemptUsapSendArgsAndGetResult  //AM[0x70981b90]
  QF[0x7ca6d66ad0] PC[0x0071735b40] at dex-pc 0x7d64a53850 android.os.ZygoteProcess.zygoteSendArgsAndGetResult  //AM[0x70981df0]
  QF[0x7ca6d66b50] PC[0x00717356b8] at dex-pc 0x7d64a5374a android.os.ZygoteProcess.startViaZygote  //AM[0x70981d90]
  QF[0x7ca6d66bd0] PC[0x7d66d31a5c] at dex-pc 0x7d64a53170 android.os.ZygoteProcess.start  //AM[0x70981f70]
  QF[0x7ca6d66e70] PC[0x7d66d319dc] at dex-pc 0x7d64a37090 android.os.Process.start  //AM[0x7097ef98]
  QF[0x7ca6d670f0] PC[0x7d51d1a288] at dex-pc 0x7d510b1e98 com.android.server.am.ProcessList.startProcess  //AM[0x9a9d0718]
  QF[0x7ca6d67200] PC[0x7d51d188a0] at dex-pc 0x7d510b9340 com.android.server.am.ProcessList.lambda$handleProcessStart$1  //AM[0x9a9d04f8]
  QF[0x7ca6d672c0] PC[0x7d51d15a00] at dex-pc 0x7d510b0140 com.android.server.am.ProcessList$$ExternalSyntheticLambda1.run  //AM[0x7d43255648]
  QF[0x7ca6d67350] PC[0x7d66d32158] at dex-pc 0x7d510b8b1a com.android.server.am.ProcessList.handleProcessStart  //AM[0x9a9d0438]
  QF[0x7ca6d67510] PC[0x7d66d319dc] at dex-pc 0x7d510b94c4 com.android.server.am.ProcessList.lambda$startProcessLocked$0  //AM[0x9a9d0538]
  QF[0x7ca6d67660] PC[0x7d66d319dc] at dex-pc 0x7d510b758c com.android.server.am.ProcessList.$r8$lambda$RYfds-QGIevfi1LguTPr20cBG60  //AM[0x9a9d0138]
  QF[0x7ca6d677b0] PC[0x7d66d30b68] at dex-pc 0x7d510b02a0 com.android.server.am.ProcessList$$ExternalSyntheticLambda5.run  //AM[0x7d432553f0]
  QF[0x7ca6d67900] PC[0x0071aec7e0] at dex-pc 0x7d649fcdd8 android.os.Handler.dispatchMessage  //AM[0x707206c8]
  QF[0x7ca6d67930] PC[0x0071aefd10] at dex-pc 0x7d64a241fc android.os.Looper.loopOnce  //AM[0x70721830]
  QF[0x7ca6d67a00] PC[0x0071aef840] at dex-pc 0x7d64a24926 android.os.Looper.loop  //AM[0x70721810]
  QF[0x7ca6d67a50] PC[0x0071aee7d4] at dex-pc 0x7d649fc448 android.os.HandlerThread.run  //AM[0x70720d58]
  QF[0x7ca6d67aa0] PC[0x7d51c7bf3c] at dex-pc 0x7d50f6da60 com.android.server.ServiceThread.run  //AM[0x9a9c4370]
方法一 从寄存器上看很容易知道 sock fd = 0x2c4 = 708
方法二 cat /proc/1745/task/1882/syscall 系统调用ID 参数0 ~ 参数7 212 0x2c4 0x7ca6d66650 0x40004028 0x7ca6d66580 0x7ca6d66600 0x4 0x7ca6d66450 0x7dec54b8e8

从调用栈上看,LocalSocket 创建的 socket 使用的是 AF_UNIX 协议族,因此,我们可以通过 ino 号查询该 socket 状态。

arduino 复制代码
typedef enum {
    SS_FREE = 0,            /* not allocated        */
    SS_UNCONNECTED,         /* unconnected to any socket    */
    SS_CONNECTING,          /* in process of connecting */
    SS_CONNECTED,           /* connected to socket      */
    SS_DISCONNECTING        /* in process of disconnecting  */
} socket_state;
shell 复制代码
# ls -lh /proc/1745/fd | grep 708
lrwx------ 1 system system 64 2023-10-25 19:39 708 -> socket:[125175]

# netstat -p 1745 | grep 125175
unix 3 [ ] STREAM CONNECTED 125175 1745/system_server @jdwp-control
或者
# cat /proc/net/unix | grep 125175
Num RefCount Protocol Flags Type St Inode Path
0000000000000000: 00000003 00000000 00000000 0001 03 125175

从 socket 状态上看,已经连上服务端,等待服务端返回数据,我们先查看这个 attemptUsapSendArgsAndGetResult 函数的参数内容。

ini 复制代码
art-parser> disassemble 0x70981df0 -i 0x7d64a53850
android.os.Process$ProcessStartResult android.os.ZygoteProcess.zygoteSendArgsAndGetResult(android.os.ZygoteProcess$ZygoteState, int, java.util.ArrayList) [dex_method_idx=10385]
DEX CODE:
  0x7d64a53850: 3070 2871 0054           | invoke-direct {v4, v5, v0}, android.os.Process$ProcessStartResult android.os.ZygoteProcess.attemptUsapSendArgsAndGetResult(android.os.ZygoteProcess$ZygoteState, java.lang.String) // method@10353

QF[0x7ca6d66ad0] PC[0x0071735b40] at dex-pc 0x7d64a53850 android.os.ZygoteProcess.zygoteSendArgsAndGetResult  //AM[0x70981df0]
  {
    StackMap[28] (code_region=[0x71735820-0x71735b58], native_pc=0x320, dex_pc=0x60, register_mask=0x2dc00000)
      Virtual registers
      {
        v0 = r22    v1 = r28    v2 = r29    v4 = r23    
        v5 = r24    v6 = r25    v7 = r26
      }
      Physical registers
      {
        x19 = 0xb400007cbbd42800    x20 = 0x0    x21 = 0xb400007cbbd428c8    x22 = 0x13a81aa8    
        x23 = 0x728961c0    x24 = 0x13a81a80    x25 = 0x1    x26 = 0x13a825d8    
        x27 = 0x6fbab3c8    x28 = 0x1    x29 = 0x0    x30 = 0x71735b40
      }
  }
sql 复制代码
art-parser> p 0x13a81aa8
Size: 0xb30
Object Name: java.lang.String
  iFields of java.lang.String
    [0x8] int count = 0xb19
    [0xc] int hash = 0x0
[16
--runtime-args
--setuid=10092
--setgid=10092
--runtime-flags=18876416
--mount-external-default
--target-sdk-version=29
--setgroups=3003,50092,20092,9997
--nice-name=com.android.statementservice
--seinfo=platform:privapp:targetSdkVersion=29:complete
--app-data-dir=/data/user/0/com.android.statementservice
--package-name=com.android.statementservice
--pkg-data-info-map=com.android.statementservice,null,10472
--allowlisted-data-info-map=com.google.android.gms,null,10735
--disabled-compat-changes=3400644,73144566,78294732,117835097,119147584,123562444,124107808,128611929,130595455,132649864,132742131,133396946,135549675,135634846,135754954,135772972,136069189,136219221,136274596,136293963,140852299,141600225,142191088,142365358,143231523,143539591,143937733,144027538,146211400,147316723,147340954,147798919,148086656,148180766,148534348,148535736,149254050,149391281,149994052,150232615,150776642,150857253,150935354,150939131,151105954,151163173,154726397,156215187,157233955,157629738,158482162,159047832,160794467,161145287,161252188,162547999,163400105,165573442,166236554,167676448,168419799,168782947,168936375,169273070,169887240,169897160,170188668,170233598,170503758,170668199,171032338,171306433,171317480,171979766,172100307,172251878,173031413,174041399,174042936,174042980,174043039,174227820,174228127,174664365,175101461,175319604,175408749,176926741,176926753,176926771,176926829,177438394,178038272,178209446,180326732,180326787,180326845,180523564,181136395,181350407,181658987,182185642,182478738,182734110,182811243,183147249,183155436,183164979,183372781,183407956,183972877,184323934,184838306,185004937,185199076,186082280,189229956,189472651,189969734,189969744,189969749,189969779,189969782,189970036,189970038,189970040,191513214,191911306,192341120,193232191,193247900,194532703,194833441,195579280,196254758,197654537,201522903,201712607,201794303,202110963,202894742,203704822,203800354,205194548,205907456,205919743,206033068,207133734,207557677,208648326,208739934,210856463,210923482,211757425,213289672,214016041,214741472,215066299,215656264,216114297,218393363,218493289,218533173,218865702,218959984,224562872,226439802,227752274,229362273,230590090,234793325,235355681,236283604,236704164,236825255,237508058,237531167,239784307,240273417,240663182,241104082,241766793,242193913,242194868,242716250,243827847,244358506,244637991,247079863,253665015,254631730,254662522,255038118,255039210,255042465,255371817,255659651,255938466,255940284,258236856,258825825,260560985,261055255,261072174,261770108,262645982,263076149,263259275,263959004,264301586,264304459,265103382,265131695,265451093,265452344,265456536,265464455,266124927,266201607,266524688,269165460,269849258,270049379,270306772,271850009,273509367,273564678,277035314,280471513,288845345,293644536
android.app.ActivityThread
seq=48
]
  iFields of java.lang.Object
    [0x0] java.lang.Class shadow$_klass_ = 0x6fb83df8
    [0x4] int shadow$_monitor_ = 0x0
ini 复制代码
art-parser> p 0x13a81a80
Size: 0x28
Padding: 0x7
Object Name: android.os.ZygoteProcess$ZygoteState
  iFields of android.os.ZygoteProcess$ZygoteState
    [0x8] java.util.List mAbiList = 0x1704f5c8
    [0x20] boolean mClosed = 0x0
    [0xc] android.net.LocalSocketAddress mUsapSocketAddress = 0x72e38ec8
    [0x10] java.io.DataInputStream mZygoteInputStream = 0x1704f5d8
    [0x14] java.io.BufferedWriter mZygoteOutputWriter = 0x1704f5f8
    [0x18] android.net.LocalSocket mZygoteSessionSocket = 0x1704f618
    [0x1c] android.net.LocalSocketAddress mZygoteSocketAddress = 0x72e38ee8
  iFields of java.lang.Object
    [0x0] java.lang.Class shadow$_klass_ = 0x7009ec28
    [0x4] int shadow$_monitor_ = 0x0
    
art-parser> p 0x72e38ec8
Size: 0x10
Object Name: android.net.LocalSocketAddress
  iFields of android.net.LocalSocketAddress
    [0x8] java.lang.String name = usap_pool_primary
    [0xc] android.net.LocalSocketAddress$Namespace namespace = 0x6ff14ec8
  iFields of java.lang.Object
    [0x0] java.lang.Class shadow$_klass_ = 0x70052e40
    [0x4] int shadow$_monitor_ = 0x20000000
bash 复制代码
# netstat -a -p | grep usap_pool_primary
unix 2 [ ACC ] STREAM LISTENING 41780 952/zygote64 /dev/socket/usap_pool_primary
unix 2 [ ] DGRAM CONNECTED 89237 2581/com.xiaomi.fin /dev/socket/usap_pool_primary
unix 2 [ ] DGRAM CONNECTED 62176 1398/shelld /dev/socket/usap_pool_primary
unix 2 [ ] DGRAM CONNECTED 58514 1012/vendor.mediate /dev/socket/usap_pool_primary
unix 3 [ ] STREAM CONNECTED 81802 3690/usap64 /dev/socket/usap_pool_primary
unix 2 [ ] DGRAM CONNECTED 81982 984/android.hardwar /dev/socket/usap_pool_primary

至此,我们可以得出结论是,system_server 在启动应用程序 com.android.statementservice 时发生,并且等待服务端进程 3690/usap64 完成并返回结果集。

服务端解析

less 复制代码
# debuggerd -b 3690                          
----- pid 3690 at 2023-10-25 20:15:07.118260161+0800 -----
Cmd line: usap64
ABI: 'arm64'

"usap64" sysTid=3690
    #00 pc 00000000000902fc  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28) 
    #01 pc 000000000025ab30  /apex/com.android.art/lib64/libart.so (art::Mutex::ExclusiveLock(art::Thread*)+336) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
    #02 pc 00000000003b8e24  /apex/com.android.art/lib64/libart.so (art::gc::space::RegionSpace::AllocNewTlab(art::Thread*, unsigned long, unsigned long*)+52) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
    #03 pc 00000000003814ac  /apex/com.android.art/lib64/libart.so (art::gc::Heap::AllocWithNewTLAB(art::Thread*, art::gc::AllocatorType, unsigned long, bool, unsigned long*, unsigned long*, unsigned long*)+1260) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
    #04 pc 0000000000757240  /apex/com.android.art/lib64/libart.so (artAllocArrayFromCodeResolvedRegionTLAB+224) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
    #05 pc 0000000000225e24  /apex/com.android.art/lib64/libart.so (art_quick_alloc_array_resolved32_region_tlab+132) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
    #06 pc 00000000001cbbb0  /system/framework/arm64/boot.oat (java.util.regex.Pattern.fastSplit+560) (BuildId: f2e702c0d8fd389971dd1d32a3ba28e75c08fef8)
    #07 pc 0000000000160200  /system/framework/arm64/boot.oat (java.lang.String.split+64) (BuildId: f2e702c0d8fd389971dd1d32a3ba28e75c08fef8)
    #08 pc 00000000007ffddc  /system/framework/arm64/boot-framework.oat (com.android.internal.os.ZygoteArguments.parseArgs+2844) (BuildId: 1b854c6594c6966f2579f1e4b017c1ab8c92236f)
    #09 pc 00000000007ff29c  /system/framework/arm64/boot-framework.oat (com.android.internal.os.ZygoteArguments.getInstance+124) (BuildId: 1b854c6594c6966f2579f1e4b017c1ab8c92236f)
    #10 pc 0000000000209418  /apex/com.android.art/lib64/libart.so (nterp_helper+152) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
    #11 pc 000000000053026e  /system/framework/framework.jar (com.android.internal.os.Zygote.childMain+126)
    #12 pc 00000000002093b4  /apex/com.android.art/lib64/libart.so (nterp_helper+52) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
    #13 pc 00000000005306f4  /system/framework/framework.jar (com.android.internal.os.Zygote.forkUsap+64)
    #14 pc 00000000002093b4  /apex/com.android.art/lib64/libart.so (nterp_helper+52) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
    #15 pc 000000000052f15e  /system/framework/framework.jar (com.android.internal.os.ZygoteServer.fillUsapPool+158)
    #16 pc 000000000080c340  /system/framework/arm64/boot-framework.oat (com.android.internal.os.ZygoteServer.runSelectLoop+4080) (BuildId: 1b854c6594c6966f2579f1e4b017c1ab8c92236f)
    #17 pc 0000000000807288  /system/framework/arm64/boot-framework.oat (com.android.internal.os.ZygoteInit.main+3976) (BuildId: 1b854c6594c6966f2579f1e4b017c1ab8c92236f)
    #18 pc 0000000000210c80  /apex/com.android.art/lib64/libart.so (art_quick_invoke_static_stub+640) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
    #19 pc 0000000000253b40  /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+224) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
    #20 pc 000000000064bd68  /apex/com.android.art/lib64/libart.so (art::JValue art::InvokeWithVarArgs<art::ArtMethod*>(art::ScopedObjectAccessAlreadyRunnable const&, _jobject*, art::ArtMethod*, std::__va_list)+408) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
    #21 pc 000000000064c340  /apex/com.android.art/lib64/libart.so (art::JValue art::InvokeWithVarArgs<_jmethodID*>(art::ScopedObjectAccessAlreadyRunnable const&, _jobject*, _jmethodID*, std::__va_list)+80) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
    #22 pc 00000000005160f4  /apex/com.android.art/lib64/libart.so (art::JNI<true>::CallStaticVoidMethodV(_JNIEnv*, _jclass*, _jmethodID*, std::__va_list)+692) (BuildId: bec5ac60247dc778a314f5b24a093aa5)
    #23 pc 00000000000dfca8  /system/lib64/libandroid_runtime.so (_JNIEnv::CallStaticVoidMethod(_jclass*, _jmethodID*, ...)+104) 
    #24 pc 00000000000ec454  /system/lib64/libandroid_runtime.so (android::AndroidRuntime::start(char const*, android::Vector<android::String8> const&, bool)+1028) 
    #25 pc 000000000000258c  /system/bin/app_process64 (main+1324) (BuildId: 3177237856445357fb121f4100079921)
    #26 pc 000000000008c798  /apex/com.android.runtime/lib64/bionic/libc.so (__libc_init+104) 

----- end 3690 -----
yaml 复制代码
# ps -eT 3690
USER PID TID PPID VSZ RSS WCHAN ADDR S CMD
root 3690 3690 952 6306924 426964 futex_wai+ 0 S usap64

从现场通过 debug 手段,确定进程 3690 有且仅有一个线程,并且处于一个等锁的状态,这个锁永远也不可能主动释放了,因此卡死了客户端出现冻屏,我们分析下锁的 owner 是谁,仍然是将进程 3690 内存转储成 Coredump 文件来进行内存分析。

rust 复制代码
(gdb) bt
#0  syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41
#1  0x0000007d66d81b34 in art::futex(int volatile*, int, int, timespec const*, int volatile*, int) [clone .__uniq.55395457626730424248235132913560037531] (op=128, val=0, uaddr2=0x0, val3=0, 
    uaddr=<optimized out>, timeout=<optimized out>) at art/runtime/base/mutex-inl.h:43
#2  art::Mutex::ExclusiveLock (this=0xb400007d67e22c90, self=0xb400007d67e42c00) at art/runtime/base/mutex.cc:471
#3  0x0000007d66edfe28 in art::MutexLock::MutexLock (self=<optimized out>, mu=..., this=<optimized out>) at art/runtime/base/mutex.h:510
#4  art::gc::space::RegionSpace::AllocNewTlab (this=0xb400007d67e22a80, self=0xb400007d67e42c00, tlab_size=1695, bytes_tl_bulk_allocated=0x7fedceb2f8) at art/runtime/gc/space/region_space.cc:847
#5  0x0000007d66ea84b0 in art::gc::Heap::AllocWithNewTLAB (this=0xb400007d67e29700, self=0xb400007d67e42c00, allocator_type=<optimized out>, alloc_size=<optimized out>, grow=false, 
    bytes_allocated=0x7fedceb308, usable_size=0x7fedceb300, bytes_tl_bulk_allocated=0x7fedceb2f8) at art/runtime/gc/heap.cc:4567
#6  0x0000007d6727e244 in art::gc::Heap::TryToAllocate<false, false> (this=<optimized out>, self=<optimized out>, allocator_type=<optimized out>, alloc_size=<optimized out>, 
    bytes_allocated=<optimized out>, usable_size=<optimized out>, bytes_tl_bulk_allocated=<optimized out>) at art/runtime/gc/heap-inl.h:416
#7  art::gc::Heap::AllocObjectWithAllocator<false, true, art::mirror::SetLengthVisitor> (this=0xb400007d67e29700, self=0xb400007d67e42c00, klass=..., byte_count=936, 
    allocator=art::gc::kAllocatorTypeRegionTLAB, pre_fence_visitor=...) at art/runtime/gc/heap-inl.h:147
#8  art::mirror::Array::Alloc<false, false> (self=<optimized out>, array_class=..., component_count=<optimized out>, allocator_type=art::gc::kAllocatorTypeRegionTLAB, 
    component_size_shift=<optimized out>) at art/runtime/mirror/array-alloc-inl.h:147
#9  art::AllocArrayFromCodeResolved<false> (component_count=<optimized out>, allocator_type=art::gc::kAllocatorTypeRegionTLAB, klass=..., self=<optimized out>)
    at art/runtime/entrypoints/entrypoint_utils-inl.h:369
#10 artAllocArrayFromCodeResolvedRegionTLAB (klass=0x6fb84108, component_count=<optimized out>, self=0xb400007d67e42c00) at art/runtime/entrypoints/quick/quick_alloc_entrypoints.cc:139
#11 0x0000007d66d4ce28 in art_quick_alloc_array_resolved32_region_tlab () at art/runtime/arch/arm64/quick_entrypoints_arm64.S:1641
#12 0x00000000712a2bb4 in java::util::regex::Pattern::fastSplit (re=..., input=..., limit=<optimized out>)
   from symbols/system/framework/arm64/boot.oat
#13 0x0000000071237204 in java::lang::String::split (this=..., regex=...)
   from symbols/system/framework/arm64/boot.oat
#14 0x0000000071ccfde0 in com::android::internal::os::ZygoteArguments::parseArgs (this=..., args=..., argCount=<optimized out>) at com/android/internal/os/ZygoteArguments.java:286
#15 0x0000000071ccf2a0 in com::android::internal::os::ZygoteArguments::getInstance (args=...)
   from symbols/system/framework/arm64/boot-framework.oat
#16 0x0000007d66d3041c in nterp_helper () at out_sys/soong/.intermediates/art/runtime/libart_mterp.arm64ng/gen/mterp_arm64ng.S:11269
#17 0x0000007d66d303b8 in nterp_helper () at out_sys/soong/.intermediates/art/runtime/libart_mterp.arm64ng/gen/mterp_arm64ng.S:11269
#18 0x0000007d66d303b8 in nterp_helper () at out_sys/soong/.intermediates/art/runtime/libart_mterp.arm64ng/gen/mterp_arm64ng.S:11269
#19 0x0000000071cdc344 in com::android::internal::os::ZygoteServer::runSelectLoop (this=..., abiList=...) at com/android/internal/os/ZygoteServer.java:665
#20 0x0000000071cd728c in com::android::internal::os::ZygoteInit::main (argv=...) at com/android/internal/os/ZygoteInit.java:1068
#21 0x0000007d66d37c84 in art_quick_invoke_static_stub () at art/runtime/arch/arm64/quick_entrypoints_arm64.S:688
#22 0x0000007d66d7ab44 in art::ArtMethod::Invoke (this=0x709a91a8, self=0xb400007d67e42c00, args=0x62, args_size=<optimized out>, result=0x7fedcebdb8, shorty=0x7d6464e228 "VL")
    at art/runtime/art_method.cc:425
#23 0x0000007d67172d6c in art::(anonymous namespace)::InvokeWithArgArray(art::ScopedObjectAccessAlreadyRunnable const&, art::ArtMethod*, art::(anonymous namespace)::ArgArray*, art::JValue*, char const*) [clone .__uniq.245181933781456475607640333933569312899] (soa=..., method=0x709a91a8, arg_array=0x7fedcebd40, result=0x7fedcebdb8, shorty=0x7d6464e228 "VL") at art/runtime/reflection.cc:458
#24 art::InvokeWithVarArgs<art::ArtMethod*> (soa=..., obj=0x0, method=0x709a91a8, args=...) at art/runtime/reflection.cc:550
#25 0x0000007d67173344 in art::InvokeWithVarArgs<_jmethodID*> (soa=..., obj=0x0, mid=<optimized out>, args=...) at art/runtime/reflection.cc:565
#26 0x0000007d6703d0f8 in art::JNI<true>::CallStaticVoidMethodV (env=<optimized out>, mid=0x709a91a8, args=...) at art/runtime/jni/jni_internal.cc:1963
#27 0x0000007de8812cac in _JNIEnv::CallStaticVoidMethod (this=<optimized out>, clazz=<optimized out>, methodID=<optimized out>) at libnativehelper/include_jni/jni.h:779
#28 0x0000007de881f458 in android::AndroidRuntime::start (this=0x7fedcec190, className=0x5e8d845394 "com.android.internal.os.ZygoteInit", options=..., zygote=true)
    at frameworks/base/core/jni/AndroidRuntime.cpp:1358
#29 0x0000005e8d846590 in main (argc=<optimized out>, argv=0x7fedced2e8) at frameworks/base/cmds/app_process/app_main.cpp:372
ini 复制代码
(gdb) p *('art::Mutex' *)0xb400007d67e22c90
$2 = {
  <art::BaseMutex> = {
    _vptr$BaseMutex = 0x7d67327230 <vtable for art::Mutex+16>,
    name_ = 0x7d66be275c "Region lock",
    contention_log_data_ = 0xb400007d67e22ca0,
    level_ = art::kRegionSpaceRegionLock,
    should_respond_to_empty_checkpoint_request_ = false
  }, 
  members of art::Mutex:
  state_and_contenders_ = {
    <std::__1::atomic<int>> = {
      <std::__1::__atomic_base<int, true>> = {
        <std::__1::__atomic_base<int, false>> = {
          __a_ = 3,
          static is_always_lock_free = <optimized out>
        }, <No data fields>}, <No data fields>}, <No data fields>},
  static kHeldMask = 1,
  static kContenderShift = 1,
  static kContenderIncrement = 2,
  exclusive_owner_ = {
    <std::__1::atomic<int>> = {
      <std::__1::__atomic_base<int, true>> = {
        <std::__1::__atomic_base<int, false>> = {
          __a_ = 3657,
          static is_always_lock_free = <optimized out>
        }, <No data fields>}, <No data fields>}, <No data fields>},
  recursion_count_ = 1,
  recursive_ = false,
  enable_monitor_timeout_ = false,
  monitor_id_ = 758263346
}

奇怪的事情发生了,线程 3657 到底是谁,即使 ps -eT | grep 3657 搜遍了所有线程号,终究无法得到,因此只有一个原因是该线程已经退出了。

疑问一

<math xmlns="http://www.w3.org/1998/Math/MathML"> 为什么 3657 已退出,锁却没有被释放? \color{red}{为什么 3657 已退出,锁却没有被释放?} </math>为什么3657已退出,锁却没有被释放?

其实这个多线程 Fork 后,子进程发生死锁现象,在 Linux 编写程序发生这种太常见了,由于 Fork 的写时复制机制,因此存在父进程的一个锁变量状态处于上锁状态,此时发生了 Fork,父子进程共享了同一个虚拟内存地址空间,此时子进程在使用该锁,则会出现死锁现象。

数据恢复

基于前面的疑问,我们可以知道线程 3657 获得锁之后,发生了 Fork,因此 3657 的线程栈也应该在进程 3690 的地址空间下。

shell 复制代码
# cat /proc/3690/maps | grep 3657
7d5016a000-7d5016b000 ---p 00000000 00:00 0 [anon:stack_and_tls:3657]
7d5016b000-7d50272000 rw-p 00000000 00:00 0 [anon:stack_and_tls:3657]

前面我们保存了进程 3690 的 Coredump 文件,并且它是 Zygote 进程 Fork 而来,共享了 Zygote 复制前的内存,因此我们查看此时被虚拟机管理的线程还有谁。

3690 数据解析

ini 复制代码
art-parser> thread
[952] "main" tid=1 Unknown
[3657] "FinalizerWatchdogDaemon" tid=5 Unknown

父进程 952 与子进程 3690,在写分离之前,它们是共享地址空间,也共享同一个栈地址,因此我们可以修改一些假数据,让 3690 添加到虚拟机管理下,这样就可以使用 art-parser 来解析 3690 堆栈的 Java 数据。

ini 复制代码
art-parser> bt 952
"main" prio=5 tid=1 Unknown
  | group="main" sCount=0 ucsCount=0 flags=0 obj=0x72845b10 self=0xb400007d67e42c00
  | sysTid=952 nice=<unknown> cgrp=<unknown> sched=<unknown> handle=0x7e1a07b4f8
  | stack=0x7fed4f0000-0x7fed4f2000 stackSize=0x7ff000
  | held mutexes= "mutator lock"
  x0  0x0000000000000000  x1  0x0000000000000000  x2  0x0000000000000000  x3  0x0000000000000000  
  x4  0x0000000000000000  x5  0x0000000000000000  x6  0x0000000000000000  x7  0x0000000000000000  
  x8  0x0000000000000000  x9  0x0000000000000000  x10 0x0000000000000000  x11 0x0000000000000000  
  x12 0x0000000000000000  x13 0x0000000000000000  x14 0x0000000000000000  x15 0x0000000000000000  
  x16 0x0000000000000000  x17 0x0000000000000000  x18 0x0000000000000000  x19 0x0000000000000000  
  x20 0x0000000000000000  x21 0x0000000000000000  x22 0x0000000000000000  x23 0x0000000000000000  
  x24 0x0000000000000000  x25 0x0000000000000000  x26 0x0000000000000000  x27 0x0000000000000000  
  x28 0x0000000000000000  x29 0x0000000000000000  
  lr  0x0000000000000000  sp  0x0000000000000000  pc  0x0000000000000000  pst 0x0000000000000000  
  0x0 fault sp address!!
>>> analysis 'art::OatHeader::kOatVersion' ...
>>> 'art::OatHeader::kOatVersion' = 238 ?
  QF[0x7fedceb450] PC[0x00712a2bb4] at dex-pc 0x7d66667c92 java.util.regex.Pattern.fastSplit  //AM[0x6fccbfb0]
  QF[0x7fedceb4b0] PC[0x0071237204] at dex-pc 0x7d6650ad72 java.lang.String.split  //AM[0x6fd3ec48]
  QF[0x7fedceb4f0] PC[0x0071ccfde0] at dex-pc 0x7d64410aba com.android.internal.os.ZygoteArguments.parseArgs  //AM[0x708328d8]
  QF[0x7fedceb620] PC[0x0071ccf2a0] at dex-pc 0x7d64410398 com.android.internal.os.ZygoteArguments.getInstance  //AM[0x708328b8]
  QF[0x7fedceb660] PC[0x7d66d3041c] at dex-pc 0x7d6441526e com.android.internal.os.Zygote.childMain  //AM[0x709a8698]
  QF[0x7fedceb850] PC[0x7d66d303b8] at dex-pc 0x7d644156f4 com.android.internal.os.Zygote.forkUsap  //AM[0x709a87f8]
  QF[0x7fedceb960] PC[0x7d66d303b8] at dex-pc 0x7d6441415e com.android.internal.os.ZygoteServer.fillUsapPool  //AM[0x70832f98]
  QF[0x7fedceba80] PC[0x0071cdc344] at dex-pc 0x7d64414632 com.android.internal.os.ZygoteServer.runSelectLoop  //AM[0x70833018]
  QF[0x7fedcebb80] PC[0x0071cd728c] at dex-pc 0x7d6441307e com.android.internal.os.ZygoteInit.main  //AM[0x709a91a8]

堆栈中的 self 则是 art::Thread 的地址,其中 0x3b8 则是该线程的 ID,我们只需要将 0x3b8 修改成 0xe6a 即可。

makefile 复制代码
art-parser> rd 0xb400007d67e42c00 -e 0xb400007d67e42c10
0xb400007d67e42c00: 0x0000000000000000  0x000003b800000001
sql 复制代码
art-parser> wd 0xb400007d67e42c08 0x00000e6a00000001
New overlay (0x785f98f000) [0x7d67a00000, 0x7d68200000).
Overlay vaddr(0xb400007d67e42c08) old(0x000003b800000001) new(0x00000e6a00000001)

art-parser> thread
[3690] "main" tid=1 Unknown
[3657] "FinalizerWatchdogDaemon" tid=5 Unknown
less 复制代码
art-parser> bt 3690
"main" prio=5 tid=1 Unknown
  | group="main" sCount=0 ucsCount=0 flags=0 obj=0x72845b10 self=0xb400007d67e42c00
  | sysTid=3690 nice=<unknown> cgrp=<unknown> sched=<unknown> handle=0x7e1a07b4f8
  | stack=0x7fed4f0000-0x7fed4f2000 stackSize=0x7ff000
  | held mutexes= "mutator lock"
  | wait mutex= "Region lock" held by thread 3657
  x0  0xb400007d67e22ca4  x1  0x0000000000000080  x2  0x0000000000000003  x3  0x0000000000000000  
  x4  0x0000000000000000  x5  0x0000000000000000  x6  0x0000000000000000  x7  0x0000007fedceb2f8  
  x8  0x0000000000000062  x9  0x0000000000000000  x10 0x0000000000000032  x11 0x0000000000000033  
  x12 0x0000000000000003  x13 0x0000000000000033  x14 0x000000000000001c  x15 0x00000000703013b8  
  x16 0x0000007d67336918  x17 0x0000007dec4ee2e0  x18 0x0000007e19ada000  x19 0xb400007d67e22c90  
  x20 0xb400007d67e42c00  x21 0xb400007d67e22ca4  x22 0x0000007d66bde13b  x23 0x0000007d66bc4f70  
  x24 0x0000007d6753d000  x25 0x0000007fedceb150  x26 0x0000000000000001  x27 0x00000000fffffffe  
  x28 0x0000007e18d58000  x29 0x0000007fedceb190  
  lr  0x0000007d66d81b34  sp  0x0000007fedceb120  pc  0x0000007dec4ee2fc  pst 0x0000000060001000  
  FP[0x7fedceb190] PC[0x7dec4ee2fc] native: #00 (syscall+0x1c) /apex/com.android.runtime/lib64/bionic/libc.so
  FP[0x7fedceb190] PC[0x7d66d81b34] native: #01 (art::Mutex::ExclusiveLock(art::Thread*)+0x154) /apex/com.android.art/lib64/libart.so
  FP[0x7fedceb1f0] PC[0x7d66edfe28] native: #02 (art::gc::space::RegionSpace::AllocNewTlab(art::Thread*, unsigned long, unsigned long*)+0x38) /apex/com.android.art/lib64/libart.so
  FP[0x7fedceb280] PC[0x7d66ea84b0] native: #03 (art::gc::Heap::AllocWithNewTLAB(art::Thread*, art::gc::AllocatorType, unsigned long, bool, unsigned long*, unsigned long*, unsigned long*)+0x4f0) /apex/com.android.art/lib64/libart.so
  FP[0x7fedceb390] PC[0x7d6727e244] native: #04 (artAllocArrayFromCodeResolvedRegionTLAB+0xe4) /apex/com.android.art/lib64/libart.so
  QF[0x7fedceb450] PC[0x00712a2bb4] at dex-pc 0x7d66667c92 java.util.regex.Pattern.fastSplit  //AM[0x6fccbfb0]
  QF[0x7fedceb4b0] PC[0x0071237204] at dex-pc 0x7d6650ad72 java.lang.String.split  //AM[0x6fd3ec48]
  QF[0x7fedceb4f0] PC[0x0071ccfde0] at dex-pc 0x7d64410aba com.android.internal.os.ZygoteArguments.parseArgs  //AM[0x708328d8]
  QF[0x7fedceb620] PC[0x0071ccf2a0] at dex-pc 0x7d64410398 com.android.internal.os.ZygoteArguments.getInstance  //AM[0x708328b8]
  QF[0x7fedceb660] PC[0x7d66d3041c] at dex-pc 0x7d6441526e com.android.internal.os.Zygote.childMain  //AM[0x709a8698]
  QF[0x7fedceb850] PC[0x7d66d303b8] at dex-pc 0x7d644156f4 com.android.internal.os.Zygote.forkUsap  //AM[0x709a87f8]
  QF[0x7fedceb960] PC[0x7d66d303b8] at dex-pc 0x7d6441415e com.android.internal.os.ZygoteServer.fillUsapPool  //AM[0x70832f98]
  QF[0x7fedceba80] PC[0x0071cdc344] at dex-pc 0x7d64414632 com.android.internal.os.ZygoteServer.runSelectLoop  //AM[0x70833018]
  QF[0x7fedcebb80] PC[0x0071cd728c] at dex-pc 0x7d6441307e com.android.internal.os.ZygoteInit.main  //AM[0x709a91a8]

我们此时也知道解析下从客户端传入的参数,核对下前面的分析是否一致。

ini 复制代码
art-parser> disassemble 0x708328b8 -i 0x7d64410398
com.android.internal.os.ZygoteArguments com.android.internal.os.ZygoteArguments.getInstance(com.android.internal.os.ZygoteCommandBuffer) [dex_method_idx=61801]
DEX CODE:
  0x7d64410398: 3070 f166 0021           | invoke-direct {v1, v2, v0}, void com.android.internal.os.ZygoteArguments.<init>(com.android.internal.os.ZygoteCommandBuffer, int) // method@61798
  
art-parser> bt 3690 -v
  QF[0x7fedceb620] PC[0x0071ccf2a0] at dex-pc 0x7d64410398 com.android.internal.os.ZygoteArguments.getInstance  //AM[0x708328b8]
  {
    StackMap[5] (code_region=[0x71ccf220-0x71ccf2b8], native_pc=0x80, dex_pc=0xa, register_mask=0xc00000)
      Physical registers
      {
        x22 = 0x12d7e5f8    x23 = 0x12d7e5e0    x24 = 0xffffffff    x25 = 0x10    
        x26 = 0x7fedceb734    x27 = 0x7fedceb6b8    x28 = 0x7fedceb7b0    x29 = 0x7fedceb734    
        x30 = 0x71ccf2a0
      }
  }
csharp 复制代码
   0x71ccf280 <+96>:         mov        x2, x23
   0x71ccf284 <+100>:        mov        x3, x25
   0x71ccf288 <+104>:        mov        x1, x0
   0x71ccf28c <+108>:        mov        x22, x1
   0x71ccf290 <+112>:        adrp       x0, 0x70832000
   0x71ccf294 <+116>:        add        x0, x0, #0x8d8
   0x71ccf298 <+120>:        ldr        x30, [x0, #24]
   0x71ccf29c <+124>:        blr        x30
ini 复制代码
art-parser> p 0x12d7e5f8
Size: 0x90
Padding: 0x5
Object Name: com.android.internal.os.ZygoteArguments
  iFields of com.android.internal.os.ZygoteArguments
    [0x7c] boolean mAbiListQuery = 0x0
    [0x8] java.lang.String[] mAllowlistedDataInfoList = 0x12d7ec30
    [0xc] java.lang.String[] mApiDenylistExemptions = 0x0
    [0x10] java.lang.String mAppDataDir = /data/user/0/com.android.statementservice
    [0x7d] boolean mBindMountAppDataDirs = 0x0
    [0x7e] boolean mBindMountAppStorageDirs = 0x0
    [0x7f] boolean mBootCompleted = 0x0
    [0x80] boolean mCapabilitiesSpecified = 0x0
    [0x14] long[] mDisabledCompatChanges = 0x0
    [0x50] long mEffectiveCapabilities = 0x0
    [0x60] int mGid = 0x276c
    [0x81] boolean mGidSpecified = 0x1
    [0x18] int[] mGids = 0x12d7e8a0
    [0x64] int mHiddenApiAccessLogSampleRate = 0xffffffff
    [0x68] int mHiddenApiAccessStatslogSampleRate = 0xffffffff
    [0x1c] java.lang.String mInstructionSet = 0x0
    [0x20] java.lang.String mInvokeWith = 0x0
    [0x82] boolean mIsTopApp = 0x0
    [0x6c] int mMountExternal = 0x1
    [0x24] java.lang.String mNiceName = com.android.statementservice
    [0x28] java.lang.String mPackageName = com.android.statementservice
    [0x58] long mPermittedCapabilities = 0x0
    [0x83] boolean mPidQuery = 0x0
    [0x2c] java.lang.String[] mPkgDataInfoList = 0x12d7eb30
    [0x30] java.lang.String mPreloadApp = 0x0
    [0x84] boolean mPreloadDefault = 0x0
    [0x34] java.lang.String mPreloadPackage = 0x0
    [0x38] java.lang.String mPreloadPackageCacheKey = 0x0
    [0x3c] java.lang.String mPreloadPackageLibFileName = 0x0
    [0x40] java.lang.String mPreloadPackageLibs = 0x0
    [0x44] java.util.ArrayList mRLimits = 0x0
    [0x48] java.lang.String[] mRemainingArgs = 0x0
    [0x70] int mRuntimeFlags = 0x1200800
    [0x4c] java.lang.String mSeInfo = platform:privapp:targetSdkVersion=29:complete
    [0x85] boolean mSeInfoSpecified = 0x1
    [0x86] boolean mStartChildZygote = 0x0
    [0x74] int mTargetSdkVersion = 0x1d
    [0x87] boolean mTargetSdkVersionSpecified = 0x1
    [0x78] int mUid = 0x276c
    [0x88] boolean mUidSpecified = 0x1
    [0x89] boolean mUsapPoolEnabled = 0x0
    [0x8a] boolean mUsapPoolStatusSpecified = 0x0
  iFields of java.lang.Object
    [0x0] java.lang.Class shadow$_klass_ = 0x70177360
    [0x4] int shadow$_monitor_ = 0x0

从参数上我们可以核实该 usap64 此时确实是用来启动应用程序 com.android.statementservice,与客户端解析部分吻合。

3657 栈恢复

css 复制代码
7d5016b000-7d50272000 rw-p 00000000 00:00 0 [anon:stack_and_tls:3657]
前面我们知道 3657 栈地址范围,于是我们可以根据现有的栈内存进行推导,但这次推导只能之下而上。
bash 复制代码
(gdb) x /i 0x0000007dec56126c-0x4
0x7dec561268 <_ZN5NonPIL20MutexLockWithTimeoutEP24pthread_mutex_internal_tbPK8timespec+328>: bl 0x7dec4f3040 <_Z15__futex_wait_exPVvbibPK8timespec>
bash 复制代码
Dump of assembler code for function _ZN5NonPIL20MutexLockWithTimeoutEP24pthread_mutex_internal_tbPK8timespec:
0x0000007dec561120 <+0>: sub sp, sp, #0x80
0x0000007dec561124 <+4>: stp x29, x30, [sp, #32]
0x0000007dec561128 <+8>: str x27, [sp, #48]
0x0000007dec56112c <+12>: stp x26, x25, [sp, #64]
0x0000007dec561130 <+16>: stp x24, x23, [sp, #80]
0x0000007dec561134 <+20>: stp x22, x21, [sp, #96]
0x0000007dec561138 <+24>: stp x20, x19, [sp, #112]
0x0000007dec56113c <+28>: add x29, sp, #0x20

于是我们可以得到一组 gdb、lldb 以及 art-parser 进行栈回溯所需要的寄存器值。

即:

寄存器
PC 0x7dec4f3040
LR 0x7dec56126c
SP 0x7d5026e550
FP 0x7d5026e570

对原本进程 3690 的 Coredump 文件进行修改。

erlang 复制代码
0001B710 05 00 00 00 88 01 00 00 01 00 00 00 43 4F 52 45 ............CORE
0001B720 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B730 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B740 00 00 00 00 6A 0E 00 00 00 00 00 00 00 00 00 00 ....j...........
0001B750 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B760 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B770 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B780 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 
0001B790 00 00 00 00 A4 2C E2 67 7D 00 00 B4 80 00 00 00 .....,.g}.......
0001B7A0 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 ................
0001B7B0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B7C0 00 00 00 00 00 00 00 00 00 00 00 00 F8 B2 CE ED ................
0001B7D0 7F 00 00 00 62 00 00 00 00 00 00 00 00 00 00 00 ....b...........
0001B7E0 00 00 00 00 32 00 00 00 00 00 00 00 33 00 00 00 ....2.......3...
0001B7F0 00 00 00 00 03 00 00 00 00 00 00 00 33 00 00 00 ............3...
0001B800 00 00 00 00 1C 00 00 00 00 00 00 00 B8 13 30 70 ..............0p
0001B810 00 00 00 00 18 69 33 67 7D 00 00 00 E0 E2 4E EC .....i3g}.....N.
0001B820 7D 00 00 00 00 A0 AD 19 7E 00 00 00 90 2C E2 67 }.......~....,.g
0001B830 7D 00 00 B4 00 2C E4 67 7D 00 00 B4 A4 2C E2 67 }....,.g}....,.g
0001B840 7D 00 00 B4 3B E1 BD 66 7D 00 00 00 70 4F BC 66 }...;..f}...pO.f
0001B850 7D 00 00 00 00 D0 53 67 7D 00 00 00 50 B1 CE ED }.....Sg}...P...
0001B860 7F 00 00 00 01 00 00 00 00 00 00 00 FE FF FF FF ................
0001B870 00 00 00 00 00 80 D5 18 7E 00 00 00 90 B1 CE ED ........~.......
0001B880 7F 00 00 00 34 1B D8 66 7D 00 00 00 20 B1 CE ED ....4..f}... ...
0001B890 7F 00 00 00 FC E2 4E EC 7D 00 00 00 00 10 00 60 ......N.}......`

对 Coredump 的寄存器以及线程号进行修改后

erlang 复制代码
0001B710 05 00 00 00 88 01 00 00 01 00 00 00 43 4F 52 45 ............CORE
0001B720 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B730 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B740 00 00 00 00 49 0E 00 00 00 00 00 00 00 00 00 00 ....I...........
0001B750 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B760 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B770 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B780 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B790 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B7A0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B7B0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B7C0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B7D0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B7E0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B7F0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B810 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B820 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B830 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B840 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B850 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B860 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0001B870 00 00 00 00 00 00 00 00 00 00 00 00 70 E5 26 50 ............p.&P
0001B880 7D 00 00 00 6C 12 56 EC 7D 00 00 00 50 E5 26 50 }...l.V.}...P.&P
0001B890 7D 00 00 00 40 30 4F EC 7D 00 00 00 00 10 00 60 }...@0O.}......`
less 复制代码
art-parser> bt 3657
"FinalizerWatchdogDaemon" prio=<unknown> tid=5 Unknown
  | group="<unknown>" sCount=0 ucsCount=0 flags=0 obj=0x0 self=0xb400007d67f87c00
  | sysTid=3657 nice=<unknown> cgrp=<unknown> sched=<unknown> handle=0x7d5026ecb0
  | stack=0x7d5016b000-0x7d5016d000 stackSize=0x103cb0
  | held mutexes= "Region lock" "mutator lock"
  x0  0x0000000000000000  x1  0x0000000000000000  x2  0x0000000000000000  x3  0x0000000000000000  
  x4  0x0000000000000000  x5  0x0000000000000000  x6  0x0000000000000000  x7  0x0000000000000000  
  x8  0x0000000000000000  x9  0x0000000000000000  x10 0x0000000000000000  x11 0x0000000000000000  
  x12 0x0000000000000000  x13 0x0000000000000000  x14 0x0000000000000000  x15 0x0000000000000000  
  x16 0x0000000000000000  x17 0x0000000000000000  x18 0x0000000000000000  x19 0x0000000000000000  
  x20 0x0000000000000000  x21 0x0000000000000000  x22 0x0000000000000000  x23 0x0000000000000000  
  x24 0x0000000000000000  x25 0x0000000000000000  x26 0x0000000000000000  x27 0x0000000000000000  
  x28 0x0000000000000000  x29 0x0000007d5026e570  
  lr  0x0000007dec56126c  sp  0x0000007d5026e550  pc  0x0000007dec4f3040  pst 0x0000000060001000  
  FP[0x7d5026e570] PC[0x7dec4f3040] native: #00 (__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+0x0) /apex/com.android.runtime/lib64/bionic/libc.so
  FP[0x7d5026e5e0] PC[0x7dec56126c] native: #02 (NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+0x14c) /apex/com.android.runtime/lib64/bionic/libc.so
  FP[0x7d5026e670] PC[0x7dec4da500] native: #03 (je_malloc_mutex_lock_slow+0xc0) /apex/com.android.runtime/lib64/bionic/libc.so
  FP[0x7d5026e6c0] PC[0x7dec4ba294] native: #04 (je_arena_tcache_fill_small+0x64) /apex/com.android.runtime/lib64/bionic/libc.so
  FP[0x7d5026e720] PC[0x7dec4e543c] native: #05 (je_tcache_alloc_small_hard+0x1c) /apex/com.android.runtime/lib64/bionic/libc.so
  FP[0x7d5026e780] PC[0x7dec4b1c38] native: #06 (je_malloc+0x2e8) /apex/com.android.runtime/lib64/bionic/libc.so
  FP[0x7d5026e7d0] PC[0x7dec4ad5c8] native: #07 (malloc+0x28) /apex/com.android.runtime/lib64/bionic/libc.so
  FP[0x7d5026e7f0] PC[0x7e12e5401c] native: #08 (operator new(unsigned long)+0x1c) /apex/com.android.art/lib64/libc++.so
  FP[0x7d5026e810] PC[0x7d66ee0024] native: #09 (art::gc::space::RegionSpace::RevokeThreadLocalBuffersLocked(art::Thread*, bool)+0x74) /apex/com.android.art/lib64/libart.so
  FP[0x7d5026e840] PC[0x7d66ee01f4] native: #10 (art::gc::space::RegionSpace::RevokeThreadLocalBuffers(art::Thread*)+0x54) /apex/com.android.art/lib64/libart.so
  FP[0x7d5026e880] PC[0x7d66ea646c] native: #11 (art::gc::Heap::RevokeThreadLocalBuffers(art::Thread*)+0x8c) /apex/com.android.art/lib64/libart.so
  FP[0x7d5026e960] PC[0x7d671da368] native: #12 (art::Thread::Destroy(bool)+0xf18) /apex/com.android.art/lib64/libart.so
  FP[0x7d5026eb30] PC[0x7d671effe4] native: #13 (art::ThreadList::Unregister(art::Thread*, bool)+0xa4) /apex/com.android.art/lib64/libart.so
  FP[0x7d5026ebf0] PC[0x7d671cb16c] native: #14 (art::Thread::CreateCallback(void*)+0x8ac) /apex/com.android.art/lib64/libart.so
  FP[0x7d5026ec50] PC[0x7dec55fd60] native: #15 (__pthread_start(void*)+0xd0) /apex/com.android.runtime/lib64/bionic/libc.so
  FP[0x7d5026ec80] PC[0x7dec4f3bc4] native: #16 (__start_thread+0x44) /apex/com.android.runtime/lib64/bionic/libc.so

将修改后的 coredump 文件,重载 art-parser 上解析,即可清楚的看到 3657 线程在函数 RevokeThreadLocalBuffers上获得了锁 Region lock,然后发生了 Fork 行为,因此造成子进程死锁了。

疑问二

事实上 Google 也考虑到 Linux 多线程 Fork 易死锁问题,因此在 Zygote Fork 子进程之前会停止所有 Daemon 线程,那为什么 3657 还未完成退出,就发生了 Fork。

细节 1

可能有同学注意到线程都已经 join 了,为什么还有 Daemon 线程没有退出,函数 Daemons.stop() 却已经完成了?

这里涉及到 ART 虚拟机线程设计,在 ART 中 join 线程,会在 art::Thread::Destroy 中释放锁资源,因此线程就不会在等待就结束了,对应代码如下: 然而 3657 线程已经运行到 RevokeThreadLocalBuffers 此处,因此 Java 层最后一个 Daemon.stop() 已经结束了。

细节 2

即使 Daemons.stop() 函数结束了,ART 虚拟机仍可以在做一次保护,只需考虑 ThreadList 列表仅剩主线程即可。我们可以看 nativePreFork 在 ART 虚拟机后续如何处理的,有无等待 ThreadList 清理且等待。 然而我们的机器 unregistering_count_ 并未减到 0,nativePreFork 就已经结束了。 对比我们项目 PreZygoteFork 代码,好家伙并没这一段代码。 目前项目的处理方式,通过 Java API 访问 /proc/self/task 目录下的文件数来判断进程是否仅剩一个。 从 3690 进程的内存中,查询父进程历史线程本地变量 可以从中找到最后一次从 tasks.list() 产生的 java.lang.String[] 对象

less 复制代码
art-parser> p 0x12d7e370
Size: 0x18
Array Name: java.lang.String[]
    [0] 952
    [1] 3655
    [2] 3657
    
art-parser> p 0x12d7e3f0
Size: 0x10
Array Name: java.lang.String[]
    [0] 952

结论

至于为什么通过 Java API 访问 /proc/self/task 目录下的文件数会出现不可靠情况仍需进一步调查。

Revert^2 "Wait for thread termination in PreZygoteFork()" android-review.googlesource.com/c/platform/...
Revert^2 "Remove waitUntilAllThreadsStopped()" android-review.googlesource.com/c/platform/...

Google 给出的新的 waitUntilAllThreadsStopped 处理,依旧是存在瑕疵,首先循环条件有限 1000 次采样,不排除一种情况在最后一次线程退出时,结束了 Unregister 函数后,ThreadList 判断是否为 1 条件通过,但线程也可能在内核态中 do_exit 上发生等待,可能就造成 waitUntilAllThreadsStopped 判断 1000 次不可达,从而发生 FATAL 错误。

补充

ini 复制代码
/*
 * Find the next thread in the thread list.
 * Return NULL if there is an error or no next thread.
 *
 * The reference to the input task_struct is released.
 */
static struct task_struct *next_tid(struct task_struct *start)
{
    struct task_struct *pos = NULL;
    rcu_read_lock();
    if (pid_alive(start)) {
        pos = next_thread(start);
        if (thread_group_leader(pos))
            pos = NULL;
        else
            get_task_struct(pos);
    }
    rcu_read_unlock();
    put_task_struct(start);
    return pos;
}

/* for the /proc/TGID/task/ directories */
static int proc_task_readdir(struct file *file, struct dir_context *ctx)
{
    struct inode *inode = file_inode(file);
    struct task_struct *task;
    struct pid_namespace *ns;
    int tid;

    if (proc_inode_is_dead(inode))
        return -ENOENT;

    if (!dir_emit_dots(file, ctx))
        return 0;

    /* f_version caches the tgid value that the last readdir call couldn't
     * return. lseek aka telldir automagically resets f_version to 0.
     */
    ns = proc_pid_ns(inode->i_sb);
    tid = (int)file->f_version;
    file->f_version = 0;
    for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns);
         task;
         task = next_tid(task), ctx->pos++) {
        char name[10 + 1];
        unsigned int len;

        tid = task_pid_nr_ns(task, ns);
        if (!tid)
            continue;   /* The task has just exited. */
        len = snprintf(name, sizeof(name), "%u", tid);
        if (!proc_fill_cache(file, ctx, name, len,
                proc_task_instantiate, task, NULL)) {
            /* returning this tgid failed, save it as the first
             * pid for the next readir call */
            file->f_version = (u64)tid;
            put_task_struct(task);
            break;
        }
    }

    return 0;

验证实验

实验原理:
创建 Daemon 线程前,保存 /proc/self/task 的所有线程号到样本一 (alltask)
启动 5 个 Daemon 线程,分别叫 Daemon-0、Daemon-1、......、Daemon-4
终止 4 个 Daemon 线程,保留 Daemon-4
保存 /proc/self/task 的所有线程号到样本二 (savetask),并循环判断校验 savetask 长度 > alltask 长度,1000次后终止 Daemon-4 重新测试。
当 savetask 长度 <= alltask 长度时,保存现场 Coredump。
使用 art-parser 解析内存,核实 savetask 是否存在 Daemon-4,并校验前后数据。若不存在 Daemon-4, 则 proc_task_readdir 发生了截断。
ini 复制代码
int length = 0;
String[] savetask;
String[] alltask;
void doTest(View view) {

    mThreads.add(new Daemon());
    mThreads.add(new Daemon());
    mThreads.add(new Daemon());
    mThreads.add(new Daemon());
    mThreads.add(new Daemon());

    new Thread(new Runnable() {
        @Override
        public void run() {
            while (true) {
                File tasks = new File("/proc/self/task");
                alltask = tasks.list();
                length = alltask.length;
                Log.v("Opencore", "s " + length);

                for (int i = 0; i < mThreads.size(); i++) {
                    mThreads.get(i).start(i);
                }

                for (int i = 0; i < mThreads.size() - 1; i++) {
                    mThreads.get(i).stop();
                }

                int loop = 0;
                boolean dodump = true;

                File tasks2 = new File("/proc/self/task");
                savetask = tasks2.list();
                while (savetask.length > length) {
                    Log.v("Opencore", "" + savetask.length);
                    Thread.yield();
                    if (loop == 1000) {
                        mThreads.get(mThreads.size() - 1).stop();
                        dodump = false;
                    }
                    loop++;
                    savetask = tasks2.list();
                }

                Log.v("Opencore", "e " + length);
                if (dodump) {
                    Coredump.getInstance().doCoredump();
                    mThreads.get(4).stop();
                }
            }
        }
    }).start();
ini 复制代码
art-parser> p 0x16d40e20
Size: 0x180
Padding: 0x4
Object Name: penguin.opencore.coretester.MainActivity
  iFields of penguin.opencore.coretester.MainActivity
    [0x168] java.lang.String[] alltask = 0x13cc3d30
    [0x178] int length = 0x1a
    [0x16c] java.util.ArrayList mThreads = 0x16d68448
    [0x170] penguin.opencore.coretester.NterpTester nterpTester = 0x16d68460
    [0x174] java.lang.String[] savetask = 0x13cc5590
    
art-parser> p 0x13cc3d30
Size: 0x78
Padding: 0x4
Array Name: java.lang.String[]
    [0] 26654
    [1] 26661
    [2] 26662
    [3] 26663
    [4] 26664
    [5] 26665
    [6] 26666
    [7] 26667
    [8] 26668
    [9] 26669
    [10] 26670
    [11] 26673
    [12] 26674
    [13] 26676
    [14] 26677
    [15] 26679
    [16] 26682
    [17] 26683
    [18] 26684
    [19] 26689
    [20] 26691
    [21] 26693
    [22] 26694
    [23] 26695
    [24] 26730
    [25] 27063
    
art-parser> p 0x13cc5590
Size: 0x78
Padding: 0x4
Array Name: java.lang.String[]
    [0] 26654
    [1] 26661
    [2] 26662
    [3] 26663
    [4] 26664
    [5] 26665
    [6] 26666
    [7] 26667
    [8] 26668
    [9] 26669
    [10] 26670
    [11] 26673
    [12] 26674
    [13] 26676
    [14] 26677
    [15] 26679
    [16] 26682
    [17] 26683
    [18] 26684
    [19] 26689
    [20] 26691
    [21] 26693
    [22] 26694
    [23] 26695
    [24] 26730
    [25] 27063
    
art-parser> thread
[26654] "main" tid=1 Native
[26661] "Signal Catcher" tid=4 WaitingInMainSignalCatcherLoop
[26662] "perfetto_hprof_listener" tid=7 Native
[26663] "ADB-JDWP Connection Control Thread" tid=8 WaitingInMainDebuggerLoop
[26664] "Jit thread pool worker thread 0" tid=9 Native
[26665] "HeapTaskDaemon" tid=10 WaitingForTaskProcessor
[26666] "ReferenceQueueDaemon" tid=11 Waiting
[26667] "FinalizerDaemon" tid=12 Waiting
[26668] "FinalizerWatchdogDaemon" tid=13 Sleeping
[26669] "binder:26654_1" tid=14 Native
[26670] "binder:26654_2" tid=15 Native
[26673] "binder:26654_3" tid=16 Native
[26674] "26654-ScoutStateMachine" tid=17 Native
[26676] "Profile Saver" tid=18 Native
[26677] "opencore-bg" tid=19 Native
[26679] "RenderThread" tid=21 Native
[26682] "Binder:interceptor" tid=22 Native
[26683] "Timer-0" tid=23 Waiting
[26684] "Timer-1" tid=24 Waiting
[26691] "SurfaceSyncGroupTimer" tid=20 Native
[26693] "hwuiTask0" tid=25 Native
[26694] "hwuiTask1" tid=26 Native
[26730] "Thread-6" tid=27 Waiting
[27063] "binder:26654_4" tid=2 Native
[27789] "Daemon-4" tid=3 Native

根本原因

Java API 'ZygoteHooks.waitUntilAllThreadsStopped' 已经运行了很多年,但为什么最近才发生此类问题,突然变得不可靠,其原因是最新的内核修复了一个小问题导致。Commit 0658a0961b0ac ("procfs: do not list TID 0 in /proc//task")。

diff 复制代码
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 533d5836eb9a..5541de99809c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3799,7 +3799,10 @@ static int proc_task_readdir(struct file *file, struct dir_context *ctx)
             task = next_tid(task), ctx->pos++) {
                char name[10 + 1];
                unsigned int len;
+
                tid = task_pid_nr_ns(task, ns);
+               if (!tid)
+                       continue;       /* The task has just exited. */
                len = snprintf(name, sizeof(name), "%u", tid);
                if (!proc_fill_cache(file, ctx, name, len,
                                proc_task_instantiate, task, NULL)) {

像本文遇到的案例,如果函数 proc_task_readdir 发生了截断。

版本
0658a0961b0ac 之前 952、3655 或者 952、0
0658a0961b0ac 之后 952

因此在之前的 Android 版本上,这个函数 while (tasks.list().length > 1) 此处即便内核截断了返回,也至少返回 2 个目录,因此会继续等待,直到正常退出。同样的实验进行调整也在内核 5.15 上测试是否发生截断会得到 0 这个目录。

ini 复制代码
art-parser> p 0x13d81178
Size: 0x178
Object Name: penguin.opencore.coretester.MainActivity
  iFields of penguin.opencore.coretester.MainActivity
    0x164] java.lang.String[] alltask = 0x13ba88e0
    [0x174] int length = 0x1b
    [0x168] java.util.ArrayList mThreads = 0x13da7778
    [0x16c] penguin.opencore.coretester.NterpTester nterpTester = 0x13da7790
    [0x170] java.lang.String[] savetask = 0x13ba9438

art-parser> p 0x13ba9438
Size: 0x80
Padding: 0x4
Array Name: java.lang.String[]
    [0] 30799
    [1] 30806
    [2] 30807
    [3] 30808
    [4] 30809
    [5] 30810
    [6] 30811
    [7] 30812
    [8] 30813
    [9] 30814
    [10] 30815
    [11] 30817
    [12] 30823
    [13] 30825
    [14] 30827
    [15] 30828
    [16] 30831
    [17] 30832
    [18] 30833
    [19] 30841
    [20] 30842
    [21] 30852
    [22] 30853
    [23] 30859
    [24] 30860
    [25] 30898
    [26] 31335
    [27] 0

正如一句老话,这个功能是基于一个 Bug 之上实现的,如果你把这个 Bug 修复了,那我变成了 Bug。

相关推荐
卫生纸不够用6 分钟前
子Shell及Shell嵌套模式
linux·bash
world=hello19 分钟前
关于科研中使用linux服务器的集锦
linux·服务器
soragui1 小时前
【ChatGPT】OpenAI 如何使用流模式进行回答
linux·运维·游戏
JasonYin~1 小时前
HarmonyOS NEXT 实战之元服务:静态案例效果---手机查看电量
android·华为·harmonyos
zhangphil1 小时前
Android adb查看某个进程的总线程数
android·adb
抛空2 小时前
Android14 - SystemServer进程的启动与工作流程分析
android
白云coy2 小时前
Redis 安装部署[主从、哨兵、集群](linux版)
linux·redis
Logintern092 小时前
Linux如何设置redis可以外网访问—执行使用指定配置文件启动redis
linux·运维·redis
娶不到胡一菲的汪大东2 小时前
Linux之ARM(MX6U)裸机篇----1.开发环境搭建
linux·运维·服务器