一:背景
1. 讲故事
客户端程序卡死是什么原因导致的?dump也拿到了手,既然有了dump就开始正式分析吧。
二:WinDbg 分析
1. 什么导致的卡死
客户端的程序卡死比较好找原因,入手点就是主线程,看下它此时正在做什么,可以用 k
命令。
0:000> k
# ChildEBP RetAddr
00 003cdf7c 74c115ce ntdll!NtWaitForSingleObject+0x15
01 003cdf7c 756e1194 KERNELBASE!WaitForSingleObjectEx+0x98
02 003cdf94 6f573bea kernel32!WaitForSingleObjectExImplementation+0x75
03 003cdfc4 6f573c31 clr!CLREventWaitHelper2+0x33
04 003ce014 6f573bb6 clr!CLREventWaitHelper+0x2a
05 003ce04c 6f57c8be clr!CLREventBase::WaitEx+0x152
06 003ce060 6f5764a9 clr!WKS::GCHeap::WaitUntilGCComplete+0x34
07 003ce0b0 6f583cf4 clr!Thread::RareDisablePreemptiveGC+0x231
08 003ce134 6a87a767 clr!JIT_RareDisableHelper+0x24
09 003ce16c 6a87472b System_Drawing_ni+0x4a767
0a 003ce17c 0846b372 System_Drawing_ni!System.Drawing.Graphics.Clear+0x1b
...
从卦中信息看,代码正在托管层做Graphics,突然程序触发了GC,因为STW的原因,clr需要使用SuspendRuntime把主线程导入到 WaitUntilGCComplete
进行等待,有了这些信息之后,接下来就是寻找为什么会触发GC。
2. 为什么会触发 GC
要找到GC触发原因,首先要找哪一个线程触发了GC,这个可以用 !t
看下托管线程列表中的 GC 字样,输出如下:
0:000> !t
ThreadCount: 48
UnstartedThread: 0
BackgroundThread: 35
PendingThread: 0
DeadThread: 0
Hosted Runtime: no
Lock
ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception
48 42 ee8 0ee39f60 1029220 Cooperative 00000000:00000000 0076c700 3 MTA (GC) (Threadpool Worker)
...
0:048> k 10
# ChildEBP RetAddr
00 4775c9e8 6f57d24e clr!WKS::gc_heap::mark_object_simple1+0x8a
01 4775ca14 6f57bf72 clr!WKS::gc_heap::mark_object_simple+0x22b
02 4775ca34 6f5774b2 clr!WKS::GCHeap::Promote+0xaa
03 4775ca4c 6f57809c clr!GcEnumObject+0x37
04 4775cdbc 6f5777cb clr!EECodeManager::EnumGcRefs+0x854
05 4775ce10 6f5723b9 clr!GcStackCrawlCallBack+0x167
06 4775d0dc 6f5724bf clr!Thread::StackWalkFramesEx+0x92
07 4775d410 6f57743b clr!Thread::StackWalkFrames+0x9d
08 4775d448 6f57ba0e clr!GCToEEInterface::GcScanRoots+0x108
09 4775d4a8 6f5792db clr!WKS::gc_heap::mark_phase+0x18a
0a 4775d4d0 6f57966f clr!WKS::gc_heap::gc1+0xda
0b 4775d508 6f57978c clr!WKS::gc_heap::garbage_collect+0x447
0c 4775d530 6f70b767 clr!WKS::GCHeap::GarbageCollectGeneration+0x1f6
0d 4775d590 6f70b7a3 clr!WKS::gc_heap::trigger_ephemeral_gc+0x1e
0e 4775d590 6f575f6f clr!WKS::gc_heap::allocate_small+0x270
0f 4775d5bc 6f575fca clr!WKS::gc_heap::try_allocate_more_space+0x17c
...
从卦中可以看到当前 48 号线程触发了GC,并且是处于三阶段中的标记阶段,接下来需要观察下到底触发的是哪一代GC,可以用 dt
观察下 setting 全局变量即可。
0:048> x clr!*settings*
...
6fbd4bd8 clr!WKS::gc_heap::settings = <no type information>
6fbd7538 clr!SVR::gc_heap::settings = <no type information>
...
0:048> dp clr!WKS::gc_heap::settings
6fbd4bd8 00002ce4 00000002 00000001 00000001
6fbd4be8 00000000 00000000 00000000 00000000
6fbd4bf8 00000001 00000000 00000000 00000000
6fbd4c08 00000000 00000000 00000005 00000001
6fbd4c18 00000000 00000000 00000000 00000001
...
从卦中的 +0x4
偏移可以看到当前触发的是 FullGC,从 +0x38
可以看到GC的触发原因是 reason_oos_soh = 5
,有经验的朋友看到这里应该就知道是什么原因了。
3. 为什么会触发FullGC
相信大家都知道FullGC 有一个 STW 的概念,既然有STW自然就会让程序卡死,回过头来说一下经验在哪里,对,就是这个指针的长度,很显然这个程序是 32bit 的,所以很大概率程序是 32bit 部署,会受到 2G 虚拟地址的限制,因为可用内存不足导致高频的触发 FullGC,可以用 !address -summary
去验证一下。
0:048> !address -summary
--- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
<unknown> 955 6029f000 ( 1.503 GB) 81.21% 75.13%
Image 1251 10105000 ( 257.020 MB) 13.57% 12.55%
Free 326 995d000 ( 153.363 MB) 7.49%
Stack 165 34c0000 ( 52.750 MB) 2.78% 2.58%
Heap 137 2db0000 ( 45.688 MB) 2.41% 2.23%
Other 12 47000 ( 284.000 kB) 0.01% 0.01%
TEB 55 37000 ( 220.000 kB) 0.01% 0.01%
PEB 1 1000 ( 4.000 kB) 0.00% 0.00%
...
--- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_COMMIT 2152 66fc7000 ( 1.609 GB) 86.97% 80.46%
MEM_RESERVE 424 f6cc000 ( 246.797 MB) 13.03% 12.05%
MEM_FREE 326 995d000 ( 153.363 MB) 7.49%
...
从卦中可以看到,当前程序吃了1.6G的虚拟地址,占全量的 80% ,这样情况按理说程序会抛 OutOfMemoryException
异常,在 !t
中也得到了验证。
0:048> !t
ThreadCount: 48
UnstartedThread: 0
BackgroundThread: 35
PendingThread: 0
DeadThread: 0
Hosted Runtime: no
Lock
ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception
...
33 39 10c8 0ee3dec0 1029220 Preemptive 00000000:00000000 0076c700 0 MTA (Threadpool Worker) System.OutOfMemoryException 32614444 (nested exceptions)
...
46 44 89c 0ee3c458 1029220 Preemptive 00000000:00000000 0076c700 1 MTA (Threadpool Worker) System.OutOfMemoryException 32605d34 (nested exceptions)
...
接下来的问题是谁吃掉了 1.6G 的内存,总有地方会吃,可以使用 !eeheap -gc
观察下托管堆。
0:048> !eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x32632d50
generation 1 starts at 0x3262534c
generation 2 starts at 0x03291000
ephemeral segment allocation context: (0x3265ffb0, 0x3265ffbc)
segment begin allocated size
03290000 03291000 0428fee4 0xffeee4(16772836)
06c60000 06c61000 07c5ffc4 0xffefc4(16773060)
...
7d210000 7d211000 7e20ffac 0xffefac(16773036)
31660000 31661000 3265ffb0 0xffefb0(16773040)
Large object heap starts at 0x04291000
segment begin allocated size
04290000 04291000 0450fa78 0x27ea78(2615928)
53390000 53391000 54391020 0x1000020(16777248)
Total Size: Size: 0x4622fd30 (1176698160) bytes.
------------------------------
GC Heap Size: Size: 0x4622fd30 (1176698160) bytes.
从卦中看应该就是托管堆吃掉了,接下来就是看下托管堆中哪一类对象最多,最终找到了一个大集合,命令输出如下:
0:048> !gcroot 4c0507c0
Thread 89c:
471bd450 07f76405 IBatisNet.DataMapper.MappedStatements.MappedStatement.RunQueryForList[[System.__Canon, mscorlib]](IBatisNet.DataMapper.Scope.RequestScope, IBatisNet.DataMapper.ISqlMapSession, System.Object, System.Collections.Generic.IList`1<System.__Canon>, IBatisNet.DataMapper.RowDelegate`1<System.__Canon>)
ebp+90: 471be6c4
-> 32c2ea50 System.Collections.Generic.List`1[[xxx.Model]]
-> 53391010 xxxRMT[]
-> 4c0507c0 xxxMT
0:048> !do 32c2ea50
Name: System.Collections.Generic.List`1[[xxx, xxx]]
MethodTable: 095f1b58
EEClass: 6e246b4c
Size: 24(0x18) bytes
File: C:\Windows\Microsoft.Net\assembly\GAC_32\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
Fields:
MT Field Offset Type VT Attr Value Name
6e67ca34 4001886 4 System.__Canon[] 0 instance 53391010 _items
6e66f2d8 4001887 c System.Int32 1 instance 3011824 _size
6e66f2d8 4001888 10 System.Int32 1 instance 3011824 _version
6e66d824 4001889 8 System.Object 0 instance 00000000 _syncRoot
6e67ca34 400188a 4 System.__Canon[] 0 static <no information>
从卦中看当前的List有length=3011824
,并且还被 89c
线程持有,最终通过代码找到了是某种查询下导致的大SQL引发。
三:总结
这次程序卡死还是挺有意思的,表象是主线程被GC卡住,实则是大SQL导致虚拟地址不足,分享出来让大家少踩坑吧!