记一次 .NET 某光放测试系统 崩溃分析

一:背景

1. 讲故事

微信好友里有位朋友找到我,说他部署在windows上的程序,用debug模式正常,但用 release 模式跑程序就崩溃,如果把程序切到 .NET6 的话又都正常,所以很迷茫,让我看看怎么回事,哈哈,这种问题直接抓dump分析就好了。

二:崩溃分析

1. 为什么会崩溃

分析过崩溃程序的朋友应该知道,不管是托管还是非托管崩溃,先用 !analyze -v 命令开路,简化输出如下:

C# 复制代码
0:000> !analyze -v
*******************************************************************************
*                                                                             *
*                        Exception Analysis                                   *
*                                                                             *
*******************************************************************************

CONTEXT:  (.ecxr)
rax=0000000000000004 rbx=000001e34b283ec0 rcx=0000000000000228
rdx=0000000000000000 rsi=000001e34ac2f4e0 rdi=000001e34ab58e70
rip=00007ff95ac53659 rsp=0000007735d7e1c0 rbp=0000007735d7e1e0
 r8=0000000000000000  r9=000001e3464ba1c0 r10=0000000000000228
r11=0000000000000228 r12=0000000000000000 r13=000001e34880eae8
r14=000001e34ab58e70 r15=0000000000000008
iopl=0         nv up di pl nz na pe nc
cs=0000  ss=0000  ds=0000  es=0000  fs=0000  gs=0000             efl=00000000
System_Private_CoreLib!System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw+0x39:
00007ff9`5ac53659 cc              int     3
Resetting default scope

EXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 00007ff95ac53659 (System_Private_CoreLib!System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw+0x0000000000000039)
   ExceptionCode: e0434f4d (CLR exception)
  ExceptionFlags: 00000000
NumberParameters: 0
...

从卦中的 ExceptionCode: e0434f4d (CLR exception) 来看,这是一个经典的托管异常,既然是托管异常,这个问题就比较简单了,使用 !t 找下到底是哪一个托管线程抛的,输出如下:

C# 复制代码
0:000> !t
ThreadCount:      15
UnstartedThread:  0
BackgroundThread: 11
PendingThread:    0
DeadThread:       3
Hosted Runtime:   no
                                                                                                            Lock  
 DBG   ID     OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   0    1     81d8 000001E3464BA1C0    a6028 Preemptive  000001E34ABA2340:000001E34ABA3D30 000001e347fc40b0 -00001 STA Prism.Ioc.ContainerResolutionException 000001e34a986608
   6    2     448c 000001E34803A440    2b228 Preemptive  000001E34A876980:000001E34A8784B0 000001e347fc40b0 -00001 MTA (Finalizer) 
   ...

从卦中的 Prism.Ioc.ContainerResolutionException 来看,貌似是和 Prism 有关,接下来可以用 !pe 命令观察调用栈详情。

C# 复制代码
0:000> !pe
Exception object: 000001e34a986608
Exception type:   Prism.Ioc.ContainerResolutionException
Message:          An unexpected error occurred while resolving 'xxx.Views.LoginWindow'
InnerException:   Unity.ResolutionFailedException, Use !PrintException 000001E34A986228 to see more.
StackTrace (generated):
    SP               IP               Function
    0000007735D668E0 00007FF95A64DEC8 Prism_Unity_Wpf!Prism.Unity.UnityContainerExtension.Resolve(System.Type, System.ValueTuple`2<System.Type,System.Object>[])+0x2a8
    0000007735D7DC60 00007FF95A64DBFD Prism_Unity_Wpf!Prism.Unity.UnityContainerExtension.Resolve(System.Type)+0x3d
    0000007735D7DCA0 00007FF95A64DB88 Prism!Prism.Ioc.IContainerProviderExtensions.Resolve[[System.__Canon, System.Private.CoreLib]](Prism.Ioc.IContainerProvider)+0x48
    0000007735D7DCF0 00007FF95A956742 xxx!xxx.App.InitializeShell(System.Windows.Window)+0x42
    0000007735D7DD40 00007FF959B21148 Prism_Wpf!Prism.PrismApplicationBase.Initialize()+0x208
    0000007735D7DDA0 00007FF959B20F17 xxx!xxx.App.<>n__0()+0x17
    ....

从卦象来看,这不是最原始的异常,言外之意就是下面还有子异常,也只有找到最里层的异常才能发现灾难的祸根,经过一层层的下钻,最后找到了最原始的异常,参考如下:

C# 复制代码
0:000> !PrintException /d 000001E34A97E940
Exception object: 000001e34a97e940
Exception type:   System.PlatformNotSupportedException
Message:          System.IO.Ports is currently only supported on Windows.
InnerException:   <none>
StackTrace (generated):
    SP               IP               Function
    0000007735D7B580 00007FF95A9588E7 System_IO_Ports!System.IO.Ports.SerialPort.GetPortNames()+0x47
    0000007735D7B5C0 00007FF95A958859 xxx!xxx.ViewModels.LoginWindowViewModel.RefreshComs()+0x19
    0000007735D7B600 00007FF95A957FBC xxx!xxx.ViewModels.LoginWindowViewModel..ctor()+0x14c
    0000007735D7B9D0 0000000000000000 System_Private_CoreLib!System.RuntimeMethodHandle.InvokeMethod(System.Object, Void**, System.Signature, Boolean)+0x46a770b0
    0000007735D7B9D0 00007FF9B8C03106 System_Private_CoreLib!System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(System.Object, System.Reflection.BindingFlags)+0x36

StackTraceString: <none>
HResult: 80131539

从卦中来看是 GetPortNames() 方法抛出来的平台不支持异常,这就很迷惑了。

2. 为什么会平台不支持

了解 PlatformNotSupportedException 异常,只能寻找相关的源代码了,通过dnspy截图如下:

从卦中来看这是一个空方法,接下来拿这个异常在网上找下资料,看样子是这位朋友需要升级或者降级 system.io.ports 的版本,截图如下:

完整链接:https://learn.microsoft.com/en-us/answers/questions/1621393/system-io-ports-only-availble-on-windows-but-im-us

本来是很兴奋的,以为是类似多线程操控非 volatile 变量导致的debug和release行为不一致呢,结果是这玩意,害!

三:总结

本次故障相对比较简单,对我们这些老手来说简直是 1+1,但我们何尝不是从新手练过来的,所以本篇是初学者很好的一个练手素材。