记一次 .NET 某注塑模具系统 CPU爆高分析

一:背景

1. 讲故事

前些天有位朋友在微信上找到我,说他们的系统出现了CPU爆高,找不到原因,让我帮忙看一下,dump也拿出来了,接下来上windbg分析。

二:CPU爆高分析

1. 真的爆高吗

dump的分析第一原则就是相信数据,先使用 !tp 观察cpu使用率。

C# 复制代码
0:031> !tp
Using the Portable thread pool.

CPU utilization:  100%
Workers Total:    5
Workers Running:  4
Workers Idle:     1
Worker Min Limit: 2
Worker Max Limit: 32767

从卦中可以看到 CPU utilization=100%,说明确实被打满了,接下来我们看下 cpu 的 strength,使用 !cpuid 即可,参考如下:

C# 复制代码
0:031> !cpuid
CP  F/M/S  Manufacturer     MHz
 0  6,85,7  <unavailable>   2993
 1  6,85,7  <unavailable>   2993

这卦不看不知道,一看吓一跳,堂堂server端的程序居然只有2core, 这程序稍微痉挛一下必定cpu爆高,这上哪说理去,让马儿跑又不让马儿吃草。。。

接下来我们探究下到底是什么引发的cpu爆高,按照行业惯例,首先怀疑是不是gc导致的,最steady的方式就是使用命令 ~* k 观察各个线程栈。

C# 复制代码
# 31  Id: 19e4.23e4 Suspend: 0 Teb: 00000063`6f12e000 Unfrozen ".NET TP Worker"
 # Child-SP          RetAddr               Call Site
00 00000063`041bef18 00007ffe`03a5d75e     ntdll!NtWaitForSingleObject+0x14
01 00000063`041bef20 00007ffd`b87330fd     KERNELBASE!WaitForSingleObjectEx+0x8e
02 (Inline Function) --------`--------     coreclr!GCEvent::Impl::Wait+0x13 [D:\a\_work\1\s\src\coreclr\gc\windows\gcenv.windows.cpp @ 1372] 
03 (Inline Function) --------`--------     coreclr!GCEvent::Wait+0x13 [D:\a\_work\1\s\src\coreclr\gc\windows\gcenv.windows.cpp @ 1422] 
04 00000063`041befc0 00007ffd`b8734bb2     coreclr!SVR::gc_heap::wait_for_gc_done+0x5d [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 14979] 
05 00000063`041beff0 00007ffd`b873311d     coreclr!SVR::GCHeap::GarbageCollectGeneration+0xee [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 50760] 
06 00000063`041bf040 00007ffd`b882d6d8     coreclr!SVR::gc_heap::trigger_gc_for_alloc+0x15 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 19134] 
07 00000063`041bf070 00007ffd`b87e0b70     coreclr!SVR::gc_heap::try_allocate_more_space+0x4ca20 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 19295] 
08 00000063`041bf0e0 00007ffd`b87e0ae4     coreclr!SVR::gc_heap::allocate_more_space+0x68 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 19779] 
09 00000063`041bf140 00007ffd`b8724e65     coreclr!SVR::gc_heap::allocate_uoh_object+0x58 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 45154] 
0a 00000063`041bf1b0 00007ffd`b86eb239     coreclr!SVR::GCHeap::Alloc+0x1c5 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 49673] 
0b 00000063`041bf200 00007ffd`b86edd6a     coreclr!Alloc+0x99 [D:\a\_work\1\s\src\coreclr\vm\gchelpers.cpp @ 238] 
0c (Inline Function) --------`--------     coreclr!AllocateSzArray+0x29b [D:\a\_work\1\s\src\coreclr\vm\gchelpers.cpp @ 428] 
0d 00000063`041bf250 00007ffd`5be4394a     coreclr!JIT_NewArr1+0x3ba [D:\a\_work\1\s\src\coreclr\vm\jithelpers.cpp @ 2800] 
0e 00000063`041bf470 00007ffd`5bdbd7de     Microsoft_Data_SqlClient!Microsoft.Data.SqlClient.TdsParser.TryReadSqlStringValue(Microsoft.Data.SqlClient.SqlBuffer, Byte, Int32, System.Text.Encoding, Boolean, Microsoft.Data.SqlClient.TdsParserStateObject)+0x17aa9ca [/_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/TdsParser.cs @ 5703] 
...
17 00000063`041bfab0 00007ffd`b86ca02a     coreclr!CallDescrWorkerInternal+0x83 [D:\a\_work\1\s\src\coreclr\vm\amd64\CallDescrWorkerAMD64.asm @ 100] 
18 00000063`041bfaf0 00007ffd`b87eb116     coreclr!DispatchCallSimple+0x66 [D:\a\_work\1\s\src\coreclr\vm\callhelpers.cpp @ 221] 
19 00000063`041bfb80 00007ffd`b86e2602     coreclr!ThreadNative::KickOffThread_Worker+0x66 [D:\a\_work\1\s\src\coreclr\vm\comsynchronizable.cpp @ 158] 
...
21 00000063`041bfd90 00000000`00000000     ntdll!RtlUserThreadStart+0x2b

纵观卦中的各个调用栈,并没有看到触发的GC线程,只有一个 coreclr!SVR::gc_heap::wait_for_gc_done ,根据gc的知识,这是等待gc finish的操作,所以这个程序果然触发了GC。

2. 为什么会触发 GC

要想回答这个问题,需要知道GC触发的是哪一代,这是我们继续下去的重要突破口,接下来使用 x coreclr!*settings* 提取 settings 结构体,参考如下:

C# 复制代码
0:031> x coreclr!*settings*
00007ffd`b8b221b0 coreclr!SVR::gc_heap::settings = class SVR::gc_mechanisms

0:031> dx -r1 (*((coreclr!SVR::gc_mechanisms *)0x7ffdb8b221b0))
(*((coreclr!SVR::gc_mechanisms *)0x7ffdb8b221b0))                 [Type: SVR::gc_mechanisms]
    [+0x000] gc_index         : 0x2d3e [Type: unsigned __int64]
    [+0x008] condemned_generation : 2 [Type: int]
    [+0x00c] promotion        : 1 [Type: int]
    [+0x010] compaction       : 0 [Type: int]
    [+0x014] loh_compaction   : 0 [Type: int]
    [+0x018] heap_expansion   : 0 [Type: int]
    [+0x01c] concurrent       : 0x0 [Type: unsigned int]
    [+0x020] demotion         : 0 [Type: int]
    [+0x024] card_bundles     : 1 [Type: int]
    [+0x028] gen0_reduction_count : 0 [Type: int]
    [+0x02c] should_lock_elevation : 0 [Type: int]
    [+0x030] elevation_locked_count : 0 [Type: int]
    [+0x034] elevation_reduced : 0 [Type: int]
    [+0x038] minimal_gc       : 0 [Type: int]
    [+0x03c] reason           : reason_alloc_loh (4) [Type: gc_reason]
    [+0x040] pause_mode       : pause_interactive (1) [Type: SVR::gc_pause_mode]
    [+0x044] found_finalizers : 1 [Type: int]
    [+0x048] background_p     : 0 [Type: int]
    [+0x04c] b_state          : bgc_not_in_process (0) [Type: bgc_state]
    [+0x050] stress_induced   : 0 [Type: int]
    [+0x054] entry_memory_load : 0x24 [Type: unsigned int]
    [+0x058] entry_available_physical_mem : 0x3d3d73000 [Type: unsigned __int64]
    [+0x060] exit_memory_load : 0x24 [Type: unsigned int]

从卦中的 condemned_generation : 2reason : reason_alloc_loh 来看,一切都明白了,然来这个 gc正在触发 fullgc,并且是因为 loh 的代边界满了,这个信息的潜台词就是告诉我们,有人在大概率折腾 LOH,接下来用 !eeheap -gc 切到大对象堆中去看看。

C# 复制代码
0:031> !eeheap -gc

DATAS = 
========================================
Number of GC Heaps: 2
----------------------------------------
Large object heap
         segment            begin        allocated        committed allocated size       committed size      
    ...
    022f3f4fe4d0     01efdb000028     01efdf9ded80     01efdf9df000 0x49ded58 (77458776) 0x49df000 (77459456)
    022f3f500750     01efe7000028     01efea8c86b8     01efea8e9000 0x38c8690 (59541136) 0x38e9000 (59674624)
    022f3f501890     01efed000028     01eff22a05c0     01eff22a1000 0x52a0598 (86640024) 0x52a1000 (86642688)
    022f3f504c50     01efff000028     01f004e6c1b8     01f004e8d000 0x5e6c190 (99008912) 0x5e8d000 (99143680)
    022f3f508010     01f011000028     01f016620f68     01f016641000 0x5620f40 (90312512) 0x5641000 (90443776)
    022f3f50a290     01f01d000028     01f0219ded80     01f0219df000 0x49ded58 (77458776) 0x49df000 (77459456)
    022f3f50b3d0     01f023000028     01f023f1be28     01f023f3c000 0xf1be00 (15842816)  0xf3c000 (15974400) 
    022f3f513250     01f04f000028     01f0539ded80     01f0539df000 0x49ded58 (77458776) 0x49df000 (77459456)
    022f3f5154d0     01f05b000028     01f05b000028     01f05f9df000                      0x49df000 (77459456)
    022f3f518890     01f06d000028     01f0719ded80     01f0719df000 0x49ded58 (77458776) 0x49df000 (77459456)
    022f3f524650     01f0af000028     01f0b39ded80     01f0b39df000 0x49ded58 (77458776) 0x49df000 (77459456)
    ...
Pinned object heap
         segment            begin        allocated        committed allocated size       committed size      
    022f3f4df200     01ef2d800028     01ef2d816d08     01ef2d821000 0x16ce0 (93408)      0x21000 (135168)    
------------------------------
GC Allocated Heap Size:    Size: 0xce96da90 (3465992848) bytes.
GC Committed Heap Size:    Size: 0x119465000 (4719005696) bytes.

从卦中看,大对象堆确实块头不小,接下来抽几个段出来看看。

C# 复制代码
0:031> !dumpheap 01efdb000028     01efdf9ded80
        Address               MT           Size
   01efdb000028     01ef2af58580             32 Free
   01efdb000048     7ffd58de2a10     77,458,744 

Statistics:
         MT Count  TotalSize Class Name
01ef2af58580     1         32 Free
7ffd58de2a10     1 77,458,744 System.Char[]
Total 2 objects, 77,458,776 bytes


0:031> !dumpheap 01f04f000028     01f0539ded80
        Address               MT           Size
   01f04f000028     01ef2af58580             32 Free
   01f04f000048     7ffd58de2a10     77,458,744 

Statistics:
         MT Count  TotalSize Class Name
01ef2af58580     1         32 Free
7ffd58de2a10     1 77,458,744 System.Char[]
Total 2 objects, 77,458,776 bytes

从卦中看确实也验证了我们的思路,里面有不少 77M 的 char\[\],如果胆大心细,你也能大概率猜到这个等待gc完成的线程应该也是在分配大对象,使用 !dso 观察便知。

C# 复制代码
0:031> !dso
OS Thread Id: 0x23e4 (31)
          SP/REG           Object Name
    0063041bf198     01ef3607c610 Microsoft.Data.SqlClient.TdsParserStateObject+StateSnapshot
    0063041bf200     01ef8fd2d128 System.Threading.ExecutionContext
    0063041bf220     01ef2ff33390 Microsoft.Data.SqlClient.TdsParserStateObjectNative
    0063041bf228     01ef8fd2d150 Microsoft.Data.SqlClient.TdsParserStateObject+StateSnapshot+PacketData
    0063041bf268     01ef3607c610 Microsoft.Data.SqlClient.TdsParserStateObject+StateSnapshot
    0063041bf278     01f06d000048 System.Char[]
    0063041bf290     022fc03f2228 System.String
    0063041bf298     01ef8fd2d150 Microsoft.Data.SqlClient.TdsParserStateObject+StateSnapshot+PacketData
    0063041bf340     01ef8fd2cfe0 System.Threading.Tasks.TaskCompletionSource<System.Object>
    0063041bf3c0     01ef3607b8c8 Microsoft.Data.SqlClient._SqlMetaData
    0063041bf400     01ef2ff38e38 Microsoft.Data.SqlClient.SNIHandle
    0063041bf438     01ef2ff33390 Microsoft.Data.SqlClient.TdsParserStateObjectNative
    0063041bf450     01ef3607b8c8 Microsoft.Data.SqlClient._SqlMetaData
    0063041bf460     01ef3607c290 Microsoft.Data.SqlClient.SqlBuffer
    0063041bf470     01ef2ff33390 Microsoft.Data.SqlClient.TdsParserStateObjectNative
    0063041bf480     01ef2ff33390 Microsoft.Data.SqlClient.TdsParserStateObjectNative
    0063041bf4a8     01ef2ff33390 Microsoft.Data.SqlClient.TdsParserStateObjectNative
    0063041bf4b0     01ef8fd21370 Microsoft.Data.SqlClient.TdsParserStateObject+StateSnapshot+PacketData
    0063041bf528     01ef3607b8c8 Microsoft.Data.SqlClient._SqlMetaData

到这里基本上就知道了,是有人从数据库中捞取 77M 的大对象所致,接下来我们看下神秘的77M是什么,直接将内存导出到txt中,这里涉及到客户隐私,我们截张去敏的图,参考如下:

原来这个 77M 是一个售后单,将这个涉及到若干图片的html直接保存到数据库所致。

三:总结

这次生产事故是开发朋友将html存到数据库,当读取到内存的时候触发fullgc,外加弱鸡的cpu,多个因素叠加引发cpu被打爆,解决的办法也很简单,要么改逻辑,要么升级cpu,要么改gc模式。