Linux 内存案例:DDR 访问出错?

文章目录

  • [1. 前言](#1. 前言)
  • [2. 事故现场](#2. 事故现场)
  • [3. 分析](#3. 分析)
  • [4. 参考资料](#4. 参考资料)

1. 前言

限于作者能力水平,本文可能存在谬误,因此而给读者带来的损失,作者不做任何承诺。

2. 事故现场

是在一台 ARM64 嵌入式设备上出现的问题,问题具有随机性,不是每次必现的。经过大量测试,抓到了几个事故现场,这里贴出两个事故现场的日志。

  • 事故现场 1
c 复制代码
[   25.932551][ T1149] Unable to handle kernel paging request at virtual address fff0f080f8200050
[......]
[   25.942352][ T1149] Mem abort info:
[   25.956422][ T1149]   ESR = 0x96000004
[......]
[   25.967823][  T553] Unable to handle kernel paging request at virtual address fffff080f8872000
[   25.967826][  T553] Mem abort info:
[   25.967828][  T553]   ESR = 0x96000004
[   25.967831][  T553]   EC = 0x25: DABT (current EL), IL = 32 bits
[   25.967833][  T553]   SET = 0, FnV = 0
[   25.967835][  T553]   EA = 0, S1PTW = 0
[   25.967837][  T553] Data abort info:
[   25.967838][  T553]   ISV = 0, ISS = 0x00000004
[   25.967840][  T553]   CM = 0, WnR = 0
[   25.967843][  T553] [fffff080f8872000] address between user and kernel address ranges
[   25.967847][  T553] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[   25.967851][  T553] Modules linked in: ohci_sunxi(+) ehci_sunxi sunxi_hci aic8800_fdrv aic8800_btlpm aic8800_bsp mali_kbase(O) sunxi_rfkill
[   25.967871][  T553] CPU: 1 PID: 553 Comm: Binder:463_4 Tainted: G           O      5.4.125+ #52
[   25.967874][  T553] Hardware name: sun50iw9 (DT)
[   25.967878][  T553] pstate: 20400005 (nzCv daif +PAN -UAO)
[   25.967893][  T553] pc : constraint_expr_eval+0x90/0x480
[   25.967897][  T553] lr : constraint_expr_eval+0x33c/0x480
[   25.967899][  T553] sp : ffffffc0139cb5e0
[   25.967901][  T553] x29: ffffffc0139cb630 x28: 0000000000000000 
[   25.967905][  T553] x27: ffffff80ece11b78 x26: ffffffc0139cb614 
[   25.967908][  T553] x25: ffffff80f77e4758 x24: ffffff80ece11b78 
[   25.967912][  T553] x23: ffffffc0112c82e8 x22: ffffff80ece11b68 
[   25.967916][  T553] x21: ffffff80f77e4748 x20: 0000000000000000 
[   25.967919][  T553] x19: fffff080f8872000 x18: ffffffc0136d5078 
[   25.967923][  T553] x17: 000000004a25b790 x16: 00000000c2b2ae35 
[   25.967927][  T553] x15: 00000000e3cb8d62 x14: 0000000085ebca6b 
[   25.967930][  T553] x13: ffffff80f9700000 x12: 000000008f3f9fb3 
[   25.967934][  T553] x11: 0000000000000001 x10: 0000000000000000 
[   25.967937][  T553] x9 : 0000000000000000 x8 : 0000000000000001 
[   25.967941][  T553] x7 : 6c65732f636f7270 x6 : ffffffc0139cba70 
[   25.967944][  T553] x5 : ffffffc0139cb9e0 x4 : ffffff80f8872b00 
[   25.967948][  T553] x3 : 0000000000000000 x2 : 0000000000000000 
[   25.967951][  T553] x1 : ffffff80f77e4760 x0 : 0000000000000001 
[   25.967956][  T553] Call trace:
[   25.967960][  T553]  constraint_expr_eval+0x90/0x480
[   25.967963][  T553]  context_struct_compute_av+0x39c/0x65c
[   25.967967][  T553]  security_compute_av+0x108/0x378
[   25.967973][  T553]  avc_compute_av+0x60/0x294
[   25.967976][  T553]  avc_has_perm_noaudit+0xf0/0x128
[   25.967983][  T553]  selinux_inode_permission+0x144/0x1f4
[   25.967990][  T553]  security_inode_permission+0x60/0xb0
[   25.967996][  T553]  inode_permission+0x4c/0x198
[   25.968001][  T553]  link_path_walk+0x10c/0x634
[   25.968004][  T553]  path_openat+0x80/0xf78
[   25.968007][  T553]  do_filp_open+0x78/0x124
[   25.968011][  T553]  do_sys_open+0x13c/0x290
[   25.968015][  T553]  __arm64_compat_sys_openat+0x24/0x30
[   25.968024][  T553]  el0_svc_common+0xb8/0x1cc
[   25.968027][  T553]  el0_svc_compat_handler+0x1c/0x28
[   25.968032][  T553]  el0_svc_compat+0x8/0x24
[   25.968041][  T553] Code: b82a6b48 2a0903fc f9401673 b4001bf3 (b9400268) 
[   25.968044][  T553] ---[ end trace 6ee3778270c97e6f ]---
[   25.968048][  T553] Kernel panic - not syncing: Fatal exception
[   25.968055][  T553] SMP: stopping secondary CPUs
[   25.971695][  T553] Kernel Offset: disabled
[   25.971707][  T553] CPU features: 0x00010002,20002004
[   25.971709][  T553] Memory Limit: none
[   26.341937][  T553] ---[ end Kernel panic - not syncing: Fatal exception ]---
[   26.350001][  T553] crashdump enter
  • 事故现场2
c 复制代码
[   20.778014][ T1012] Unable to handle kernel paging request at virtual address fffff080f86e3140
[   20.778174][    C0] Unable to handle kernel paging request at virtual address fffff0c011793050
[   20.778698][  T751] Unable to handle kernel paging request at virtual address fffff080f87c7008
[   20.778702][  T751] Mem abort info:
[   20.778704][  T751]   ESR = 0x96000004
[   20.778707][  T751]   EC = 0x25: DABT (current EL), IL = 32 bits
[   20.778709][  T751]   SET = 0, FnV = 0
[   20.778711][  T751]   EA = 0, S1PTW = 0
[   20.778712][  T751] Data abort info:
[   20.778714][  T751]   ISV = 0, ISS = 0x00000004
[   20.778716][  T751]   CM = 0, WnR = 0
[   20.778720][  T751] [fffff080f87c7008] address between user and kernel address ranges
[   20.778723][  T751] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[   20.778727][  T751] Modules linked in: sunxi_usbc ohci_sunxi ehci_sunxi sunxi_hci aic8800_fdrv aic8800_btlpm aic8800_bsp mali_kbase(O) sunxi_rfkill
[   20.778748][  T751] CPU: 3 PID: 751 Comm: Binder:461_6 Tainted: G           O      5.4.125+ #49
[   20.778751][  T751] Hardware name: sun50iw9 (DT)
[   20.778755][  T751] pstate: 80400005 (Nzcv daif +PAN -UAO)
[   20.778773][  T751] pc : avtab_search_node+0xe8/0x128
[   20.778778][  T751] lr : context_struct_compute_av+0x1c0/0x65c
[   20.778780][  T751] sp : ffffffc01542b820
[   20.778782][  T751] x29: ffffffc01542ba50 x28: ffffff80f80f7a40 
[   20.778786][  T751] x27: 0000000000000000 x26: ffffffc011706ee0 
[   20.778790][  T751] x25: ffffffc011706ea0 x24: ffffffc01542bb70 
[   20.778793][  T751] x23: ffffffc01542bc00 x22: ffffff80f92b2718 
[   20.778797][  T751] x21: 000000000000009c x20: ffffff80f80e1440 
[   20.778801][  T751] x19: 0000000000000037 x18: ffffffc0136db068 
[   20.778804][  T751] x17: 000000004e048830 x16: 00000000c2b2ae35 
[   20.778808][  T751] x15: 00000000f1f165bc x14: 0000000085ebca6b 
[   20.778811][  T751] x13: ffffff80f9170000 x12: 00000000c6e6d541 
[   20.778815][  T751] x11: 0000000000000707 x10: 000000000000009c 
[   20.778818][  T751] x9 : 0000000000000038 x8 : 000000000000000d 
[   20.778822][  T751] x7 : 676e007764676f6c x6 : ffffffc01542bc00 
[   20.778826][  T751] x5 : ffffffc01542bb70 x4 : ffffffc01542bc00 
[   20.778829][  T751] x3 : ffffffc01542bb70 x2 : 0000000000000030 
[   20.778833][  T751] x1 : ffffffc01542b870 x0 : fffff080f87c7008 
[   20.778837][  T751] Call trace:
[   20.778841][  T751]  avtab_search_node+0xe8/0x128
[   20.778844][  T751]  security_compute_av+0x108/0x378
[   20.778850][  T751]  avc_compute_av+0x60/0x294
[   20.778854][  T751]  avc_has_perm_noaudit+0xf0/0x128
[   20.778861][  T751]  selinux_inode_permission+0x144/0x1f4
[   20.778868][  T751]  security_inode_permission+0x60/0xb0
[   20.778874][  T751]  inode_permission+0x4c/0x198
[   20.778881][  T751]  unix_find_other+0x68/0x24c
[   20.778885][  T751]  unix_dgram_connect+0x254/0x3cc
[   20.778894][  T751]  __sys_connect+0xd4/0x140
[   20.778900][  T751]  __arm64_sys_connect+0x20/0x30
[   20.778908][  T751]  el0_svc_common+0xb8/0x1cc
[   20.778912][  T751]  el0_svc_compat_handler+0x1c/0x28
[   20.778917][  T751]  el0_svc_compat+0x8/0x24
[   20.778925][  T751] Code: 14000004 54000223 f9400800 b4000200 (7940000c) 
[   20.778929][  T751] ---[ end trace 11e0954576e51b12 ]---
[   20.778933][  T751] Kernel panic - not syncing: Fatal exception
[   20.778940][  T751] SMP: stopping secondary CPUs
[   20.787718][ T1012] Mem abort info:
[   20.797379][    C0] Mem abort info:
[   20.807042][ T1012]   ESR = 0x96000004
[   20.810948][    C0]   ESR = 0x96000004
[   20.815146][ T1012]   EC = 0x25: DABT (current EL), IL = 32 bits
[   20.821875][    C0]   EC = 0x25: DABT (current EL), IL = 32 bits
[   20.826075][ T1012]   SET = 0, FnV = 0
[   20.830373][    C0]   SET = 0, FnV = 0
[   20.834376][ T1012]   EA = 0, S1PTW = 0
[   20.839449][    C0]   EA = 0, S1PTW = 0
[   20.843554][ T1012] Data abort info:
[   20.852336][    C0] Data abort info:
[   20.859364][ T1012]   ISV = 0, ISS = 0x00000004
[   20.874191][    C0]   ISV = 0, ISS = 0x00000004
[   20.883951][ T1012]   CM = 0, WnR = 0
[   20.889124][    C0]   CM = 0, WnR = 0
[   20.895278][ T1012] [fffff080f86e3140] address between user and kernel address ranges
[   20.900945][    C0] [fffff0c011793050] address between user and kernel address ranges
[   21.862283][  T751] SMP: failed to stop secondary CPUs 0,2-3
[   21.868630][  T751] Kernel Offset: disabled
[   21.873322][  T751] CPU features: 0x00010002,20002004
[   21.878990][  T751] Memory Limit: none
[   21.883198][  T751] ---[ end Kernel panic - not syncing: Fatal exception ]---

3. 分析

对于嵌入式设备的问题,可以分别从硬件软件两方面来看。

产生崩溃的几个地址 0xfffff080f8872000, 0xfffff080f87c7008, 0xfffff080f86e3140, 0xfffff0c011793050,既不位于内核空间,也不位于用户空间

c 复制代码
[   25.967843][  T553] [fffff080f8872000] address between user and kernel address ranges
......
[   20.778720][  T751] [fffff080f87c7008] address between user and kernel address ranges
......
[   20.895278][ T1012] [fffff080f86e3140] address between user and kernel address ranges
......
[   20.900945][    C0] [fffff0c011793050] address between user and kernel address ranges

对于该问题,首先从软件进行了排查,结合上面的日志,怀疑可能是内存踩踏导致这样的现象,经过源码剖析kasanslub debug,无果。

由于现象随机,且内核崩溃栈的位置也是随机的,转而怀疑是不是 DDR 硬件访问出错。有什么依据吗?日志信息里的 ESR 寄存器 EC 域的值 0x25,看起来是个线索。ARM 手册是这么说的:

后续测试中,将离 DDR 很近的 WiFi 天线拆掉,目前测试不再重现该问题。可能是 WiFi 天线影响了 DDR 的读写。当然,这并不一定是最终结论。

4. 参考资料

1 DRAM中的ECC类型、编码规则和纠错原理

2 搞DDR必懂的关键技术笔记:DDR RAS---内存中的纠错码 (ECC)

3 DDR 内存中的纠错码 (ECC)

4 ARMv8架构的CPU在linux-5.4.18系统下对物理内存ECC错误的处理

相关推荐
A小辣椒15 小时前
TShark:Wireshark CLI 功能
linux
A小辣椒19 小时前
TShark:基础知识
linux
AlfredZhao21 小时前
OCI 明明分配了 200G 系统盘,为什么 df 只看到 30G?
linux·oci
AlfredZhao1 天前
vi 删除指定范围的行,不用再反复按 dd
linux·vi
用户9718356334662 天前
银河麒麟 KY10 申威(SW64) 安装 nginx-1.16.1-2.p01.ky10.sw_64.rpm 详细步骤
linux
猪脚踏浪2 天前
linux 拷贝文件或目录到指定的位置
linux
摇滚侠2 天前
Linux CentOS7 rpm 安装 MySQL 5.7
linux·运维·mysql
bush42 天前
嵌入式linux学习记录十四、术语
linux·嵌入式
载数而行5202 天前
Linux 11 动态监控指令top
linux
不会C语言的男孩3 天前
Linux 系统编程 · 第 8 章:进程基础
linux·c语言