一:背景
1. 讲故事
前些天有位朋友在微信上找到我,说他们的系统出现了CPU爆高,找不到原因,让我帮忙看一下,dump也拿出来了,接下来上windbg分析。
二:CPU爆高分析
1. 真的爆高吗
dump的分析第一原则就是相信数据,先使用 !tp 观察cpu使用率。
C#
0:031> !tp
Using the Portable thread pool.
CPU utilization: 100%
Workers Total: 5
Workers Running: 4
Workers Idle: 1
Worker Min Limit: 2
Worker Max Limit: 32767
从卦中可以看到 CPU utilization=100%,说明确实被打满了,接下来我们看下 cpu 的 strength,使用 !cpuid 即可,参考如下:
C#
0:031> !cpuid
CP F/M/S Manufacturer MHz
0 6,85,7 <unavailable> 2993
1 6,85,7 <unavailable> 2993
这卦不看不知道,一看吓一跳,堂堂server端的程序居然只有2core, 这程序稍微痉挛一下必定cpu爆高,这上哪说理去,让马儿跑又不让马儿吃草。。。
接下来我们探究下到底是什么引发的cpu爆高,按照行业惯例,首先怀疑是不是gc导致的,最steady的方式就是使用命令 ~* k 观察各个线程栈。
C#
# 31 Id: 19e4.23e4 Suspend: 0 Teb: 00000063`6f12e000 Unfrozen ".NET TP Worker"
# Child-SP RetAddr Call Site
00 00000063`041bef18 00007ffe`03a5d75e ntdll!NtWaitForSingleObject+0x14
01 00000063`041bef20 00007ffd`b87330fd KERNELBASE!WaitForSingleObjectEx+0x8e
02 (Inline Function) --------`-------- coreclr!GCEvent::Impl::Wait+0x13 [D:\a\_work\1\s\src\coreclr\gc\windows\gcenv.windows.cpp @ 1372]
03 (Inline Function) --------`-------- coreclr!GCEvent::Wait+0x13 [D:\a\_work\1\s\src\coreclr\gc\windows\gcenv.windows.cpp @ 1422]
04 00000063`041befc0 00007ffd`b8734bb2 coreclr!SVR::gc_heap::wait_for_gc_done+0x5d [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 14979]
05 00000063`041beff0 00007ffd`b873311d coreclr!SVR::GCHeap::GarbageCollectGeneration+0xee [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 50760]
06 00000063`041bf040 00007ffd`b882d6d8 coreclr!SVR::gc_heap::trigger_gc_for_alloc+0x15 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 19134]
07 00000063`041bf070 00007ffd`b87e0b70 coreclr!SVR::gc_heap::try_allocate_more_space+0x4ca20 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 19295]
08 00000063`041bf0e0 00007ffd`b87e0ae4 coreclr!SVR::gc_heap::allocate_more_space+0x68 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 19779]
09 00000063`041bf140 00007ffd`b8724e65 coreclr!SVR::gc_heap::allocate_uoh_object+0x58 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 45154]
0a 00000063`041bf1b0 00007ffd`b86eb239 coreclr!SVR::GCHeap::Alloc+0x1c5 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 49673]
0b 00000063`041bf200 00007ffd`b86edd6a coreclr!Alloc+0x99 [D:\a\_work\1\s\src\coreclr\vm\gchelpers.cpp @ 238]
0c (Inline Function) --------`-------- coreclr!AllocateSzArray+0x29b [D:\a\_work\1\s\src\coreclr\vm\gchelpers.cpp @ 428]
0d 00000063`041bf250 00007ffd`5be4394a coreclr!JIT_NewArr1+0x3ba [D:\a\_work\1\s\src\coreclr\vm\jithelpers.cpp @ 2800]
0e 00000063`041bf470 00007ffd`5bdbd7de Microsoft_Data_SqlClient!Microsoft.Data.SqlClient.TdsParser.TryReadSqlStringValue(Microsoft.Data.SqlClient.SqlBuffer, Byte, Int32, System.Text.Encoding, Boolean, Microsoft.Data.SqlClient.TdsParserStateObject)+0x17aa9ca [/_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/TdsParser.cs @ 5703]
...
17 00000063`041bfab0 00007ffd`b86ca02a coreclr!CallDescrWorkerInternal+0x83 [D:\a\_work\1\s\src\coreclr\vm\amd64\CallDescrWorkerAMD64.asm @ 100]
18 00000063`041bfaf0 00007ffd`b87eb116 coreclr!DispatchCallSimple+0x66 [D:\a\_work\1\s\src\coreclr\vm\callhelpers.cpp @ 221]
19 00000063`041bfb80 00007ffd`b86e2602 coreclr!ThreadNative::KickOffThread_Worker+0x66 [D:\a\_work\1\s\src\coreclr\vm\comsynchronizable.cpp @ 158]
...
21 00000063`041bfd90 00000000`00000000 ntdll!RtlUserThreadStart+0x2b
纵观卦中的各个调用栈,并没有看到触发的GC线程,只有一个 coreclr!SVR::gc_heap::wait_for_gc_done ,根据gc的知识,这是等待gc finish的操作,所以这个程序果然触发了GC。
2. 为什么会触发 GC
要想回答这个问题,需要知道GC触发的是哪一代,这是我们继续下去的重要突破口,接下来使用 x coreclr!*settings* 提取 settings 结构体,参考如下:
C#
0:031> x coreclr!*settings*
00007ffd`b8b221b0 coreclr!SVR::gc_heap::settings = class SVR::gc_mechanisms
0:031> dx -r1 (*((coreclr!SVR::gc_mechanisms *)0x7ffdb8b221b0))
(*((coreclr!SVR::gc_mechanisms *)0x7ffdb8b221b0)) [Type: SVR::gc_mechanisms]
[+0x000] gc_index : 0x2d3e [Type: unsigned __int64]
[+0x008] condemned_generation : 2 [Type: int]
[+0x00c] promotion : 1 [Type: int]
[+0x010] compaction : 0 [Type: int]
[+0x014] loh_compaction : 0 [Type: int]
[+0x018] heap_expansion : 0 [Type: int]
[+0x01c] concurrent : 0x0 [Type: unsigned int]
[+0x020] demotion : 0 [Type: int]
[+0x024] card_bundles : 1 [Type: int]
[+0x028] gen0_reduction_count : 0 [Type: int]
[+0x02c] should_lock_elevation : 0 [Type: int]
[+0x030] elevation_locked_count : 0 [Type: int]
[+0x034] elevation_reduced : 0 [Type: int]
[+0x038] minimal_gc : 0 [Type: int]
[+0x03c] reason : reason_alloc_loh (4) [Type: gc_reason]
[+0x040] pause_mode : pause_interactive (1) [Type: SVR::gc_pause_mode]
[+0x044] found_finalizers : 1 [Type: int]
[+0x048] background_p : 0 [Type: int]
[+0x04c] b_state : bgc_not_in_process (0) [Type: bgc_state]
[+0x050] stress_induced : 0 [Type: int]
[+0x054] entry_memory_load : 0x24 [Type: unsigned int]
[+0x058] entry_available_physical_mem : 0x3d3d73000 [Type: unsigned __int64]
[+0x060] exit_memory_load : 0x24 [Type: unsigned int]
从卦中的 condemned_generation : 2 和 reason : reason_alloc_loh 来看,一切都明白了,然来这个 gc正在触发 fullgc,并且是因为 loh 的代边界满了,这个信息的潜台词就是告诉我们,有人在大概率折腾 LOH,接下来用 !eeheap -gc 切到大对象堆中去看看。
C#
0:031> !eeheap -gc
DATAS =
========================================
Number of GC Heaps: 2
----------------------------------------
Large object heap
segment begin allocated committed allocated size committed size
...
022f3f4fe4d0 01efdb000028 01efdf9ded80 01efdf9df000 0x49ded58 (77458776) 0x49df000 (77459456)
022f3f500750 01efe7000028 01efea8c86b8 01efea8e9000 0x38c8690 (59541136) 0x38e9000 (59674624)
022f3f501890 01efed000028 01eff22a05c0 01eff22a1000 0x52a0598 (86640024) 0x52a1000 (86642688)
022f3f504c50 01efff000028 01f004e6c1b8 01f004e8d000 0x5e6c190 (99008912) 0x5e8d000 (99143680)
022f3f508010 01f011000028 01f016620f68 01f016641000 0x5620f40 (90312512) 0x5641000 (90443776)
022f3f50a290 01f01d000028 01f0219ded80 01f0219df000 0x49ded58 (77458776) 0x49df000 (77459456)
022f3f50b3d0 01f023000028 01f023f1be28 01f023f3c000 0xf1be00 (15842816) 0xf3c000 (15974400)
022f3f513250 01f04f000028 01f0539ded80 01f0539df000 0x49ded58 (77458776) 0x49df000 (77459456)
022f3f5154d0 01f05b000028 01f05b000028 01f05f9df000 0x49df000 (77459456)
022f3f518890 01f06d000028 01f0719ded80 01f0719df000 0x49ded58 (77458776) 0x49df000 (77459456)
022f3f524650 01f0af000028 01f0b39ded80 01f0b39df000 0x49ded58 (77458776) 0x49df000 (77459456)
...
Pinned object heap
segment begin allocated committed allocated size committed size
022f3f4df200 01ef2d800028 01ef2d816d08 01ef2d821000 0x16ce0 (93408) 0x21000 (135168)
------------------------------
GC Allocated Heap Size: Size: 0xce96da90 (3465992848) bytes.
GC Committed Heap Size: Size: 0x119465000 (4719005696) bytes.
从卦中看,大对象堆确实块头不小,接下来抽几个段出来看看。
C#
0:031> !dumpheap 01efdb000028 01efdf9ded80
Address MT Size
01efdb000028 01ef2af58580 32 Free
01efdb000048 7ffd58de2a10 77,458,744
Statistics:
MT Count TotalSize Class Name
01ef2af58580 1 32 Free
7ffd58de2a10 1 77,458,744 System.Char[]
Total 2 objects, 77,458,776 bytes
0:031> !dumpheap 01f04f000028 01f0539ded80
Address MT Size
01f04f000028 01ef2af58580 32 Free
01f04f000048 7ffd58de2a10 77,458,744
Statistics:
MT Count TotalSize Class Name
01ef2af58580 1 32 Free
7ffd58de2a10 1 77,458,744 System.Char[]
Total 2 objects, 77,458,776 bytes
从卦中看确实也验证了我们的思路,里面有不少 77M 的 char\[\],如果胆大心细,你也能大概率猜到这个等待gc完成的线程应该也是在分配大对象,使用 !dso 观察便知。
C#
0:031> !dso
OS Thread Id: 0x23e4 (31)
SP/REG Object Name
0063041bf198 01ef3607c610 Microsoft.Data.SqlClient.TdsParserStateObject+StateSnapshot
0063041bf200 01ef8fd2d128 System.Threading.ExecutionContext
0063041bf220 01ef2ff33390 Microsoft.Data.SqlClient.TdsParserStateObjectNative
0063041bf228 01ef8fd2d150 Microsoft.Data.SqlClient.TdsParserStateObject+StateSnapshot+PacketData
0063041bf268 01ef3607c610 Microsoft.Data.SqlClient.TdsParserStateObject+StateSnapshot
0063041bf278 01f06d000048 System.Char[]
0063041bf290 022fc03f2228 System.String
0063041bf298 01ef8fd2d150 Microsoft.Data.SqlClient.TdsParserStateObject+StateSnapshot+PacketData
0063041bf340 01ef8fd2cfe0 System.Threading.Tasks.TaskCompletionSource<System.Object>
0063041bf3c0 01ef3607b8c8 Microsoft.Data.SqlClient._SqlMetaData
0063041bf400 01ef2ff38e38 Microsoft.Data.SqlClient.SNIHandle
0063041bf438 01ef2ff33390 Microsoft.Data.SqlClient.TdsParserStateObjectNative
0063041bf450 01ef3607b8c8 Microsoft.Data.SqlClient._SqlMetaData
0063041bf460 01ef3607c290 Microsoft.Data.SqlClient.SqlBuffer
0063041bf470 01ef2ff33390 Microsoft.Data.SqlClient.TdsParserStateObjectNative
0063041bf480 01ef2ff33390 Microsoft.Data.SqlClient.TdsParserStateObjectNative
0063041bf4a8 01ef2ff33390 Microsoft.Data.SqlClient.TdsParserStateObjectNative
0063041bf4b0 01ef8fd21370 Microsoft.Data.SqlClient.TdsParserStateObject+StateSnapshot+PacketData
0063041bf528 01ef3607b8c8 Microsoft.Data.SqlClient._SqlMetaData

到这里基本上就知道了,是有人从数据库中捞取 77M 的大对象所致,接下来我们看下神秘的77M是什么,直接将内存导出到txt中,这里涉及到客户隐私,我们截张去敏的图,参考如下:

原来这个 77M 是一个售后单,将这个涉及到若干图片的html直接保存到数据库所致。
三:总结
这次生产事故是开发朋友将html存到数据库,当读取到内存的时候触发fullgc,外加弱鸡的cpu,多个因素叠加引发cpu被打爆,解决的办法也很简单,要么改逻辑,要么升级cpu,要么改gc模式。
