文章目录
- [1. 前言](#1. 前言)
- [2. 事故现场](#2. 事故现场)
- [3. 分析](#3. 分析)
- [4. 参考资料](#4. 参考资料)
1. 前言
限于作者能力水平,本文可能存在谬误,因此而给读者带来的损失,作者不做任何承诺。
2. 事故现场
是在一台 ARM64 嵌入式设备上出现的问题,问题具有随机性,不是每次必现的。经过大量测试,抓到了几个事故现场,这里贴出两个事故现场的日志。
- 事故现场 1
c
[ 25.932551][ T1149] Unable to handle kernel paging request at virtual address fff0f080f8200050
[......]
[ 25.942352][ T1149] Mem abort info:
[ 25.956422][ T1149] ESR = 0x96000004
[......]
[ 25.967823][ T553] Unable to handle kernel paging request at virtual address fffff080f8872000
[ 25.967826][ T553] Mem abort info:
[ 25.967828][ T553] ESR = 0x96000004
[ 25.967831][ T553] EC = 0x25: DABT (current EL), IL = 32 bits
[ 25.967833][ T553] SET = 0, FnV = 0
[ 25.967835][ T553] EA = 0, S1PTW = 0
[ 25.967837][ T553] Data abort info:
[ 25.967838][ T553] ISV = 0, ISS = 0x00000004
[ 25.967840][ T553] CM = 0, WnR = 0
[ 25.967843][ T553] [fffff080f8872000] address between user and kernel address ranges
[ 25.967847][ T553] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[ 25.967851][ T553] Modules linked in: ohci_sunxi(+) ehci_sunxi sunxi_hci aic8800_fdrv aic8800_btlpm aic8800_bsp mali_kbase(O) sunxi_rfkill
[ 25.967871][ T553] CPU: 1 PID: 553 Comm: Binder:463_4 Tainted: G O 5.4.125+ #52
[ 25.967874][ T553] Hardware name: sun50iw9 (DT)
[ 25.967878][ T553] pstate: 20400005 (nzCv daif +PAN -UAO)
[ 25.967893][ T553] pc : constraint_expr_eval+0x90/0x480
[ 25.967897][ T553] lr : constraint_expr_eval+0x33c/0x480
[ 25.967899][ T553] sp : ffffffc0139cb5e0
[ 25.967901][ T553] x29: ffffffc0139cb630 x28: 0000000000000000
[ 25.967905][ T553] x27: ffffff80ece11b78 x26: ffffffc0139cb614
[ 25.967908][ T553] x25: ffffff80f77e4758 x24: ffffff80ece11b78
[ 25.967912][ T553] x23: ffffffc0112c82e8 x22: ffffff80ece11b68
[ 25.967916][ T553] x21: ffffff80f77e4748 x20: 0000000000000000
[ 25.967919][ T553] x19: fffff080f8872000 x18: ffffffc0136d5078
[ 25.967923][ T553] x17: 000000004a25b790 x16: 00000000c2b2ae35
[ 25.967927][ T553] x15: 00000000e3cb8d62 x14: 0000000085ebca6b
[ 25.967930][ T553] x13: ffffff80f9700000 x12: 000000008f3f9fb3
[ 25.967934][ T553] x11: 0000000000000001 x10: 0000000000000000
[ 25.967937][ T553] x9 : 0000000000000000 x8 : 0000000000000001
[ 25.967941][ T553] x7 : 6c65732f636f7270 x6 : ffffffc0139cba70
[ 25.967944][ T553] x5 : ffffffc0139cb9e0 x4 : ffffff80f8872b00
[ 25.967948][ T553] x3 : 0000000000000000 x2 : 0000000000000000
[ 25.967951][ T553] x1 : ffffff80f77e4760 x0 : 0000000000000001
[ 25.967956][ T553] Call trace:
[ 25.967960][ T553] constraint_expr_eval+0x90/0x480
[ 25.967963][ T553] context_struct_compute_av+0x39c/0x65c
[ 25.967967][ T553] security_compute_av+0x108/0x378
[ 25.967973][ T553] avc_compute_av+0x60/0x294
[ 25.967976][ T553] avc_has_perm_noaudit+0xf0/0x128
[ 25.967983][ T553] selinux_inode_permission+0x144/0x1f4
[ 25.967990][ T553] security_inode_permission+0x60/0xb0
[ 25.967996][ T553] inode_permission+0x4c/0x198
[ 25.968001][ T553] link_path_walk+0x10c/0x634
[ 25.968004][ T553] path_openat+0x80/0xf78
[ 25.968007][ T553] do_filp_open+0x78/0x124
[ 25.968011][ T553] do_sys_open+0x13c/0x290
[ 25.968015][ T553] __arm64_compat_sys_openat+0x24/0x30
[ 25.968024][ T553] el0_svc_common+0xb8/0x1cc
[ 25.968027][ T553] el0_svc_compat_handler+0x1c/0x28
[ 25.968032][ T553] el0_svc_compat+0x8/0x24
[ 25.968041][ T553] Code: b82a6b48 2a0903fc f9401673 b4001bf3 (b9400268)
[ 25.968044][ T553] ---[ end trace 6ee3778270c97e6f ]---
[ 25.968048][ T553] Kernel panic - not syncing: Fatal exception
[ 25.968055][ T553] SMP: stopping secondary CPUs
[ 25.971695][ T553] Kernel Offset: disabled
[ 25.971707][ T553] CPU features: 0x00010002,20002004
[ 25.971709][ T553] Memory Limit: none
[ 26.341937][ T553] ---[ end Kernel panic - not syncing: Fatal exception ]---
[ 26.350001][ T553] crashdump enter
- 事故现场2
c
[ 20.778014][ T1012] Unable to handle kernel paging request at virtual address fffff080f86e3140
[ 20.778174][ C0] Unable to handle kernel paging request at virtual address fffff0c011793050
[ 20.778698][ T751] Unable to handle kernel paging request at virtual address fffff080f87c7008
[ 20.778702][ T751] Mem abort info:
[ 20.778704][ T751] ESR = 0x96000004
[ 20.778707][ T751] EC = 0x25: DABT (current EL), IL = 32 bits
[ 20.778709][ T751] SET = 0, FnV = 0
[ 20.778711][ T751] EA = 0, S1PTW = 0
[ 20.778712][ T751] Data abort info:
[ 20.778714][ T751] ISV = 0, ISS = 0x00000004
[ 20.778716][ T751] CM = 0, WnR = 0
[ 20.778720][ T751] [fffff080f87c7008] address between user and kernel address ranges
[ 20.778723][ T751] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[ 20.778727][ T751] Modules linked in: sunxi_usbc ohci_sunxi ehci_sunxi sunxi_hci aic8800_fdrv aic8800_btlpm aic8800_bsp mali_kbase(O) sunxi_rfkill
[ 20.778748][ T751] CPU: 3 PID: 751 Comm: Binder:461_6 Tainted: G O 5.4.125+ #49
[ 20.778751][ T751] Hardware name: sun50iw9 (DT)
[ 20.778755][ T751] pstate: 80400005 (Nzcv daif +PAN -UAO)
[ 20.778773][ T751] pc : avtab_search_node+0xe8/0x128
[ 20.778778][ T751] lr : context_struct_compute_av+0x1c0/0x65c
[ 20.778780][ T751] sp : ffffffc01542b820
[ 20.778782][ T751] x29: ffffffc01542ba50 x28: ffffff80f80f7a40
[ 20.778786][ T751] x27: 0000000000000000 x26: ffffffc011706ee0
[ 20.778790][ T751] x25: ffffffc011706ea0 x24: ffffffc01542bb70
[ 20.778793][ T751] x23: ffffffc01542bc00 x22: ffffff80f92b2718
[ 20.778797][ T751] x21: 000000000000009c x20: ffffff80f80e1440
[ 20.778801][ T751] x19: 0000000000000037 x18: ffffffc0136db068
[ 20.778804][ T751] x17: 000000004e048830 x16: 00000000c2b2ae35
[ 20.778808][ T751] x15: 00000000f1f165bc x14: 0000000085ebca6b
[ 20.778811][ T751] x13: ffffff80f9170000 x12: 00000000c6e6d541
[ 20.778815][ T751] x11: 0000000000000707 x10: 000000000000009c
[ 20.778818][ T751] x9 : 0000000000000038 x8 : 000000000000000d
[ 20.778822][ T751] x7 : 676e007764676f6c x6 : ffffffc01542bc00
[ 20.778826][ T751] x5 : ffffffc01542bb70 x4 : ffffffc01542bc00
[ 20.778829][ T751] x3 : ffffffc01542bb70 x2 : 0000000000000030
[ 20.778833][ T751] x1 : ffffffc01542b870 x0 : fffff080f87c7008
[ 20.778837][ T751] Call trace:
[ 20.778841][ T751] avtab_search_node+0xe8/0x128
[ 20.778844][ T751] security_compute_av+0x108/0x378
[ 20.778850][ T751] avc_compute_av+0x60/0x294
[ 20.778854][ T751] avc_has_perm_noaudit+0xf0/0x128
[ 20.778861][ T751] selinux_inode_permission+0x144/0x1f4
[ 20.778868][ T751] security_inode_permission+0x60/0xb0
[ 20.778874][ T751] inode_permission+0x4c/0x198
[ 20.778881][ T751] unix_find_other+0x68/0x24c
[ 20.778885][ T751] unix_dgram_connect+0x254/0x3cc
[ 20.778894][ T751] __sys_connect+0xd4/0x140
[ 20.778900][ T751] __arm64_sys_connect+0x20/0x30
[ 20.778908][ T751] el0_svc_common+0xb8/0x1cc
[ 20.778912][ T751] el0_svc_compat_handler+0x1c/0x28
[ 20.778917][ T751] el0_svc_compat+0x8/0x24
[ 20.778925][ T751] Code: 14000004 54000223 f9400800 b4000200 (7940000c)
[ 20.778929][ T751] ---[ end trace 11e0954576e51b12 ]---
[ 20.778933][ T751] Kernel panic - not syncing: Fatal exception
[ 20.778940][ T751] SMP: stopping secondary CPUs
[ 20.787718][ T1012] Mem abort info:
[ 20.797379][ C0] Mem abort info:
[ 20.807042][ T1012] ESR = 0x96000004
[ 20.810948][ C0] ESR = 0x96000004
[ 20.815146][ T1012] EC = 0x25: DABT (current EL), IL = 32 bits
[ 20.821875][ C0] EC = 0x25: DABT (current EL), IL = 32 bits
[ 20.826075][ T1012] SET = 0, FnV = 0
[ 20.830373][ C0] SET = 0, FnV = 0
[ 20.834376][ T1012] EA = 0, S1PTW = 0
[ 20.839449][ C0] EA = 0, S1PTW = 0
[ 20.843554][ T1012] Data abort info:
[ 20.852336][ C0] Data abort info:
[ 20.859364][ T1012] ISV = 0, ISS = 0x00000004
[ 20.874191][ C0] ISV = 0, ISS = 0x00000004
[ 20.883951][ T1012] CM = 0, WnR = 0
[ 20.889124][ C0] CM = 0, WnR = 0
[ 20.895278][ T1012] [fffff080f86e3140] address between user and kernel address ranges
[ 20.900945][ C0] [fffff0c011793050] address between user and kernel address ranges
[ 21.862283][ T751] SMP: failed to stop secondary CPUs 0,2-3
[ 21.868630][ T751] Kernel Offset: disabled
[ 21.873322][ T751] CPU features: 0x00010002,20002004
[ 21.878990][ T751] Memory Limit: none
[ 21.883198][ T751] ---[ end Kernel panic - not syncing: Fatal exception ]---
3. 分析
对于嵌入式设备的问题,可以分别从硬件和软件两方面来看。
产生崩溃的几个地址 0xfffff080f8872000, 0xfffff080f87c7008, 0xfffff080f86e3140, 0xfffff0c011793050,既不位于内核空间,也不位于用户空间:
c
[ 25.967843][ T553] [fffff080f8872000] address between user and kernel address ranges
......
[ 20.778720][ T751] [fffff080f87c7008] address between user and kernel address ranges
......
[ 20.895278][ T1012] [fffff080f86e3140] address between user and kernel address ranges
......
[ 20.900945][ C0] [fffff0c011793050] address between user and kernel address ranges
对于该问题,首先从软件进行了排查,结合上面的日志,怀疑可能是内存踩踏导致这样的现象,经过源码剖析、kasan、slub debug,无果。
由于现象随机,且内核崩溃栈的位置也是随机的,转而怀疑是不是 DDR 硬件访问出错。有什么依据吗?日志信息里的 ESR 寄存器 EC 域的值 0x25,看起来是个线索。ARM 手册是这么说的:


后续测试中,将离 DDR 很近的 WiFi 天线拆掉,目前测试不再重现该问题。可能是 WiFi 天线影响了 DDR 的读写。当然,这并不一定是最终结论。
4. 参考资料
1\] [DRAM中的ECC类型、编码规则和纠错原理](https://zhuanlan.zhihu.com/p/780330153 "DRAM中的ECC类型、编码规则和纠错原理") \[2\] [搞DDR必懂的关键技术笔记:DDR RAS---内存中的纠错码 (ECC)](https://aijishu.com/a/1060000000478785 "搞DDR必懂的关键技术笔记:DDR RAS—内存中的纠错码 (ECC)") \[3\] [DDR 内存中的纠错码 (ECC)](https://www.synopsys.com/zh-cn/designware-ip/technical-bulletin/error-correction-code-ddr.html "DDR 内存中的纠错码 (ECC)") \[4\] [ARMv8架构的CPU在linux-5.4.18系统下对物理内存ECC错误的处理](https://blog.csdn.net/u010936265/article/details/134222732 "ARMv8架构的CPU在linux-5.4.18系统下对物理内存ECC错误的处理")