记一次 .NET 某智慧工厂视觉程序 崩溃分析

一:背景

1. 讲故事

好久没有写文章了,除了家里的小宝宝要照护,还有一个就是卷英语的听说去了,虽有些时段未更新,但这个 .NET高级调试之旅 肯定不会断的,感谢大家的期盼,接下来开始分享吧。

前些天有位朋友找到我,说它们的视觉程序崩溃了,让我帮忙看下怎么回事?虽然给同行们分析没有那么频了,但该出手时还得出手,让朋友抓一个dump,我来给上一卦。

二:崩溃分析

1. 为什么会崩溃

查崩溃原因的approachs有很多,除了使用常规的 !analyze -v ,还有一个就是双击打开windbg,会直接定位错误位置,输出如下:

C# 复制代码
...................................................
This dump file has an exception of interest stored in it.
The stored exception information can be accessed via .ecxr
(930.1684): CLR exception - code e0434352 (first/second chance not available)
CLR exception type: System.Reflection.TargetInvocationException
    "调用的目标发生了异常。"
For analysis of this file, run !analyze -v
KERNELBASE!RaiseException+0x69:
00007ff9`4a83cf19 0f1f440000      nop     dword ptr [rax+rax]

从卦中的 CLR exception - code e0434352 (first/second chance not available) 来看,这是一个 CLR exception,即 Managed Exception,并且是一个 TargetInvocationException 目标对象调用异常,接下来用 !t 观察下托管异常信息。

C# 复制代码
0:000> !t
ThreadCount:      36
UnstartedThread:  3
BackgroundThread: 31
PendingThread:    0
DeadThread:       0
Hosted Runtime:   no
                                                                                                        Lock  
       ID OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   0    1 1684 0000027727ebfc30    26020 Preemptive  000002772D3ACD50:000002772D3AEC80 0000027727e600b0 0     STA System.Reflection.TargetInvocationException 000002772d33e5d8
   5    2  978 0000027727eea240    2b220 Preemptive  000002772D334D10:000002772D336C80 0000027727e600b0 0     MTA (Finalizer) 
   9    3  acc 00000277420e2900  202b220 Preemptive  000002772D398278:000002772D398C80 0000027727e600b0 0     MTA 
  10    4  aac 000002774211a3e0  8029220 Preemptive  0000000000000000:0000000000000000 0000027727e600b0 0     MTA (Threadpool Completion Port) 

从卦中看确实是一个托管异常,接下来使用 !pe -nested 观察下异常信息。

C# 复制代码
0:000> !pe -nested
Exception object: 000002772d33e5d8
Exception type:   System.Reflection.TargetInvocationException
Message:          调用的目标发生了异常。
InnerException:   System.AccessViolationException, Use !PrintException 000002772d33e438 to see more.
StackTrace (generated):
    SP               IP               Function
    000000D8F17FE160 0000000000000000 mscorlib_ni!System.RuntimeMethodHandle.InvokeMethod(System.Object, System.Object[], System.Signature, Boolean)+0xffff8006c9419390
    000000D8F17FE160 00007FF933CDFAAD mscorlib_ni!System.Reflection.RuntimeMethodInfo.UnsafeInvokeInternal(System.Object, System.Object[], System.Object[])+0x10d
    000000D8F17FE1D0 00007FF933CFFA40 mscorlib_ni!System.Delegate.DynamicInvokeImpl(System.Object[])+0xa0
    000000D8F17FE220 00007FF91039945D System_Windows_Forms_ni!System.Windows.Forms.Control.InvokeMarshaledCallbackDo(ThreadMethodEntry)+0x9d
    000000D8F17FE260 00007FF910399379 System_Windows_Forms_ni!System.Windows.Forms.Control.InvokeMarshaledCallbackHelper(System.Object)+0x69
    000000D8F17FE2B0 00007FF933CD00B2 mscorlib_ni!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)+0x172
    000000D8F17FE380 00007FF933CCFF35 mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)+0x15
    000000D8F17FE3B0 00007FF933CCFF05 mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)+0x55
    000000D8F17FE400 00007FF9103992FC System_Windows_Forms_ni!System.Windows.Forms.Control.InvokeMarshaledCallback(ThreadMethodEntry)+0xbc
    000000D8F17FE450 00007FF910399066 System_Windows_Forms_ni!System.Windows.Forms.Control.InvokeMarshaledCallbacks()+0xe6
    000000D8F17FE4C0 00007FF910382D69 System_Windows_Forms_ni!System.Windows.Forms.Control.WndProc(System.Windows.Forms.Message ByRef)+0x509
    000000D8F17FE580 00007FF91038EEA7 System_Windows_Forms_ni!System.Windows.Forms.Form.WndProc(System.Windows.Forms.Message ByRef)+0x67
    000000D8F17FE5E0 00007FF910382092 System_Windows_Forms_ni!System.Windows.Forms.NativeWindow.Callback(IntPtr, Int32, IntPtr, IntPtr)+0xc2
    000000D8F17FE990 0000000000000000 System_Windows_Forms_ni!System.Windows.Forms.UnsafeNativeMethods.DispatchMessageW(MSG ByRef)+0x1
    000000D8F17FEA50 00007FF910398671 System_Windows_Forms_ni!System.Windows.Forms.Application+ComponentManager.System.Windows.Forms.UnsafeNativeMethods.IMsoComponentManager.FPushMessageLoop(IntPtr, Int32, Int32)+0x341
    000000D8F17FEB40 00007FF910397FD7 System_Windows_Forms_ni!System.Windows.Forms.Application+ThreadContext.RunMessageLoopInner(Int32, System.Windows.Forms.ApplicationContext)+0x1c7
    000000D8F17FEBE0 00007FF910397DD2 System_Windows_Forms_ni!System.Windows.Forms.Application+ThreadContext.RunMessageLoop(Int32, System.Windows.Forms.ApplicationContext)+0x52
    000000D8F17FEC40 00007FF8D76B0D60 TWVision!TWVision.Program.Main()+0x260

StackTraceString: <none>
HResult: 80131604
0:000> !PrintException /d 000002772d33e438
Exception object: 000002772d33e438
Exception type:   System.AccessViolationException
Message:          尝试读取或写入受保护的内存。这通常指示其他内存已损坏。
InnerException:   <none>
StackTrace (generated):
<none>
StackTraceString: <none>
HResult: 80004003

从卦象看,看样子是UI在执行回调函数的时候崩出来的异常,比如上面的 InvokeMarshaledCallbackDo 函数。

2. 到底在执行什么函数

接下来的思路在哪里呢?肯定就是要找到 UI线程 到底在执行哪一个函数,这个需要大家知道一点 MessageLoop 的一些相关知识,即在 UI线程中找到正在处理的 ThreadMethodEntry 队列item,使用 !dso 即可。

C# 复制代码
0:000> !dso
OS Thread Id: 0x1684 (0)
RSP/REG          Object           Name
rax              000002772d33e788 System.String    调用的目标发生了异常。
000000D8F17FD828 000002772d33e5d8 System.Reflection.TargetInvocationException
...
000000D8F17FE058 000002772c6c56a8 System.Windows.Forms.Control+ThreadMethodEntry

0:000> !do 000002772c6c56a8
Name:        System.Windows.Forms.Control+ThreadMethodEntry
MethodTable: 00007ff9100df170
EEClass:     00007ff910191fd8
Size:        104(0x68) bytes
File:        C:\Windows\Microsoft.Net\assembly\GAC_MSIL\System.Windows.Forms\v4.0_4.0.0.0__b77a5c561934e089\System.Windows.Forms.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff9100d87e8  400385b        8 ...ows.Forms.Control  0 instance 0000027729b28e58 caller
00007ff9100d87e8  400385c       10 ...ows.Forms.Control  0 instance 0000027729b28e58 marshaler
00007ff9337580f0  400385d       18      System.Delegate  0 instance 000002772c6c5610 method
00007ff933755e70  400385e       20      System.Object[]  0 instance 0000000000000000 args
00007ff933755dd8  400385f       28        System.Object  0 instance 0000000000000000 retVal
00007ff933755b70  4003860       30     System.Exception  0 instance 0000000000000000 exception
00007ff93375b698  4003861       58       System.Boolean  1 instance                0 synchronous
00007ff93375b698  4003862       59       System.Boolean  1 instance                0 isCompleted
00007ff9337d4398  4003863       38 ....ManualResetEvent  0 instance 0000000000000000 resetEvent
00007ff933755dd8  4003864       40        System.Object  0 instance 000002772c6c5710 invokeSyncObject
00007ff9337d24b0  4003865       48 ....ExecutionContext  0 instance 000002772c6c5650 executionContext
00007ff9337d70e8  4003866       50 ...ronizationContext  0 instance 00000277298bd630 syncContext

从卦中数据看,里面的 000002772c6c5610 method 就是我们要找的方法,输出如下:

C# 复制代码
0:000> !DumpObj /d 000002772c6c5610
Name:        System.Action
MethodTable: 00007ff9337daff0
EEClass:     00007ff9338e8f50
Size:        64(0x40) bytes
File:        C:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff933755dd8  40002f3        8        System.Object  0 instance 000002772c6c55f0 _target
00007ff933755dd8  40002f4       10        System.Object  0 instance 0000000000000000 _methodBase
00007ff9337d31f8  40002f5       18        System.IntPtr  1 instance     7ff8d7c3dd90 _methodPtr
00007ff9337d31f8  40002f6       20        System.IntPtr  1 instance                0 _methodPtrAux
00007ff933755dd8  4000300       28        System.Object  0 instance 0000000000000000 _invocationList
00007ff9337d31f8  4000301       30        System.IntPtr  1 instance                0 _invocationCount
0:000> !U 7ff8d7c3dd90
Unmanaged code
00007ff8`d7c3dd90 e9fb2fa300      jmp     xxx!xxx.FuncLib+<>c__DisplayClass10_0.<ShowMsg>b__0() (00007ff8`d8670d90)
00007ff8`d7c3dd95 5f              pop     rdi
00007ff8`d7c3dd96 0100            add     dword ptr [rax],eax
00007ff8`d7c3dd98 a075d5d7f87f0000e8 mov   al,byte ptr [E800007FF8D7D575h]
00007ff8`d7c3dda1 eb66            jmp     xxx.ConfigSystem.get_MESLines()+0x1 (00007ff8`d7c3de09)
00007ff8`d7c3dda3 f65e5e          neg     byte ptr [rsi+5Eh]
00007ff8`d7c3dda6 0006            add     byte ptr [rsi],al
00007ff8`d7c3dda8 e8e366f65e      call    clr!PrecodeFixupThunk (00007ff9`36ba4490)
00007ff8`d7c3ddad 5e              pop     rsi
00007ff8`d7c3ddae 0105e8db66f6    add     dword ptr [00007ff8`ce2ab99c],eax

卦中的 xxx!xxx.FuncLib+<>c__DisplayClass10_0.<ShowMsg>b__0() 就是正在 executing 的方法,接下来就是祭出 ILSPY 观察这个方法的代码,成功给它找到,截图如下:

3. 崩溃点在哪里

虽然找到了崩溃方法,但我们还没有找到程序的崩溃点,毕竟在 !pe 中我们一无所获,不过真相离我们越来越近了,如果你深谙底层,你会知道 Exception 中会有一个 _ip 字段,这个字段就记录了崩溃的 IP指令,截图如下:

最后通过 uf 7ff92854124a 观察崩溃处的汇编代码,输出如下:

C# 复制代码
0:000> uf 7ff92854124a
msftedit!CRchTxtPtr::ReplaceRange+0x4fc:
00007ff9`2854122c 488b03          mov     rax,qword ptr [rbx]
00007ff9`2854122f 488d4de0        lea     rcx,[rbp-20h]
00007ff9`28541233 48894c2430      mov     qword ptr [rsp+30h],rcx
00007ff9`28541238 448bce          mov     r9d,esi
00007ff9`2854123b 897c2428        mov     dword ptr [rsp+28h],edi
00007ff9`2854123f 458bc6          mov     r8d,r14d
00007ff9`28541242 418bd4          mov     edx,r12d
00007ff9`28541245 44897c2420      mov     dword ptr [rsp+20h],r15d
00007ff9`2854124a 488b00          mov     rax,qword ptr [rax]
00007ff9`2854124d 488bcb          mov     rcx,rbx
00007ff9`28541250 ff15ca012500    call    qword ptr [msftedit!_guard_dispatch_icall_fptr (00007ff9`28791420)]

从卦象中,应该是 rax,qword ptr [rax] 中的 rax 是 0 导致的,这里我就不到栈中去恢复了,这里有一个重要信息就是 msftedit!CRchTxtPtr::ReplaceRange 函数,这告诉我们它崩溃的是在 RichTextBox 中,最后回头在 ShowMsg() 中寻找 RichTextBox 的相关赋值操作,果然不少,截图如下:

最后还有一个疑问,当前正在写入什么内容呢?如果你想知道,也很简单,把上面的 000002772c6c55f0 _target 弄出来即可,截图如下:

最后就是下结论了,在我的分析之旅中,我见过几次 RichTextBox 导致的程序崩溃,不知道是不稳定还是有bug的存在,所以我个人建议是做排除法,不要使用这玩意再观察效果,效果自然是满意的。

三:总结

我见过 RichTextBox导致的崩溃,也见过RichTexBox导致的非托管内存暴涨,大家以后在用这个控件的时候要 watch out 一下。