Linux 内存案例:DDR 访问出错?

文章目录

  • [1. 前言](#1. 前言)
  • [2. 事故现场](#2. 事故现场)
  • [3. 分析](#3. 分析)
  • [4. 参考资料](#4. 参考资料)

1. 前言

限于作者能力水平,本文可能存在谬误,因此而给读者带来的损失,作者不做任何承诺。

2. 事故现场

是在一台 ARM64 嵌入式设备上出现的问题,问题具有随机性,不是每次必现的。经过大量测试,抓到了几个事故现场,这里贴出两个事故现场的日志。

  • 事故现场 1
c 复制代码
[   25.932551][ T1149] Unable to handle kernel paging request at virtual address fff0f080f8200050
[......]
[   25.942352][ T1149] Mem abort info:
[   25.956422][ T1149]   ESR = 0x96000004
[......]
[   25.967823][  T553] Unable to handle kernel paging request at virtual address fffff080f8872000
[   25.967826][  T553] Mem abort info:
[   25.967828][  T553]   ESR = 0x96000004
[   25.967831][  T553]   EC = 0x25: DABT (current EL), IL = 32 bits
[   25.967833][  T553]   SET = 0, FnV = 0
[   25.967835][  T553]   EA = 0, S1PTW = 0
[   25.967837][  T553] Data abort info:
[   25.967838][  T553]   ISV = 0, ISS = 0x00000004
[   25.967840][  T553]   CM = 0, WnR = 0
[   25.967843][  T553] [fffff080f8872000] address between user and kernel address ranges
[   25.967847][  T553] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[   25.967851][  T553] Modules linked in: ohci_sunxi(+) ehci_sunxi sunxi_hci aic8800_fdrv aic8800_btlpm aic8800_bsp mali_kbase(O) sunxi_rfkill
[   25.967871][  T553] CPU: 1 PID: 553 Comm: Binder:463_4 Tainted: G           O      5.4.125+ #52
[   25.967874][  T553] Hardware name: sun50iw9 (DT)
[   25.967878][  T553] pstate: 20400005 (nzCv daif +PAN -UAO)
[   25.967893][  T553] pc : constraint_expr_eval+0x90/0x480
[   25.967897][  T553] lr : constraint_expr_eval+0x33c/0x480
[   25.967899][  T553] sp : ffffffc0139cb5e0
[   25.967901][  T553] x29: ffffffc0139cb630 x28: 0000000000000000 
[   25.967905][  T553] x27: ffffff80ece11b78 x26: ffffffc0139cb614 
[   25.967908][  T553] x25: ffffff80f77e4758 x24: ffffff80ece11b78 
[   25.967912][  T553] x23: ffffffc0112c82e8 x22: ffffff80ece11b68 
[   25.967916][  T553] x21: ffffff80f77e4748 x20: 0000000000000000 
[   25.967919][  T553] x19: fffff080f8872000 x18: ffffffc0136d5078 
[   25.967923][  T553] x17: 000000004a25b790 x16: 00000000c2b2ae35 
[   25.967927][  T553] x15: 00000000e3cb8d62 x14: 0000000085ebca6b 
[   25.967930][  T553] x13: ffffff80f9700000 x12: 000000008f3f9fb3 
[   25.967934][  T553] x11: 0000000000000001 x10: 0000000000000000 
[   25.967937][  T553] x9 : 0000000000000000 x8 : 0000000000000001 
[   25.967941][  T553] x7 : 6c65732f636f7270 x6 : ffffffc0139cba70 
[   25.967944][  T553] x5 : ffffffc0139cb9e0 x4 : ffffff80f8872b00 
[   25.967948][  T553] x3 : 0000000000000000 x2 : 0000000000000000 
[   25.967951][  T553] x1 : ffffff80f77e4760 x0 : 0000000000000001 
[   25.967956][  T553] Call trace:
[   25.967960][  T553]  constraint_expr_eval+0x90/0x480
[   25.967963][  T553]  context_struct_compute_av+0x39c/0x65c
[   25.967967][  T553]  security_compute_av+0x108/0x378
[   25.967973][  T553]  avc_compute_av+0x60/0x294
[   25.967976][  T553]  avc_has_perm_noaudit+0xf0/0x128
[   25.967983][  T553]  selinux_inode_permission+0x144/0x1f4
[   25.967990][  T553]  security_inode_permission+0x60/0xb0
[   25.967996][  T553]  inode_permission+0x4c/0x198
[   25.968001][  T553]  link_path_walk+0x10c/0x634
[   25.968004][  T553]  path_openat+0x80/0xf78
[   25.968007][  T553]  do_filp_open+0x78/0x124
[   25.968011][  T553]  do_sys_open+0x13c/0x290
[   25.968015][  T553]  __arm64_compat_sys_openat+0x24/0x30
[   25.968024][  T553]  el0_svc_common+0xb8/0x1cc
[   25.968027][  T553]  el0_svc_compat_handler+0x1c/0x28
[   25.968032][  T553]  el0_svc_compat+0x8/0x24
[   25.968041][  T553] Code: b82a6b48 2a0903fc f9401673 b4001bf3 (b9400268) 
[   25.968044][  T553] ---[ end trace 6ee3778270c97e6f ]---
[   25.968048][  T553] Kernel panic - not syncing: Fatal exception
[   25.968055][  T553] SMP: stopping secondary CPUs
[   25.971695][  T553] Kernel Offset: disabled
[   25.971707][  T553] CPU features: 0x00010002,20002004
[   25.971709][  T553] Memory Limit: none
[   26.341937][  T553] ---[ end Kernel panic - not syncing: Fatal exception ]---
[   26.350001][  T553] crashdump enter
  • 事故现场2
c 复制代码
[   20.778014][ T1012] Unable to handle kernel paging request at virtual address fffff080f86e3140
[   20.778174][    C0] Unable to handle kernel paging request at virtual address fffff0c011793050
[   20.778698][  T751] Unable to handle kernel paging request at virtual address fffff080f87c7008
[   20.778702][  T751] Mem abort info:
[   20.778704][  T751]   ESR = 0x96000004
[   20.778707][  T751]   EC = 0x25: DABT (current EL), IL = 32 bits
[   20.778709][  T751]   SET = 0, FnV = 0
[   20.778711][  T751]   EA = 0, S1PTW = 0
[   20.778712][  T751] Data abort info:
[   20.778714][  T751]   ISV = 0, ISS = 0x00000004
[   20.778716][  T751]   CM = 0, WnR = 0
[   20.778720][  T751] [fffff080f87c7008] address between user and kernel address ranges
[   20.778723][  T751] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[   20.778727][  T751] Modules linked in: sunxi_usbc ohci_sunxi ehci_sunxi sunxi_hci aic8800_fdrv aic8800_btlpm aic8800_bsp mali_kbase(O) sunxi_rfkill
[   20.778748][  T751] CPU: 3 PID: 751 Comm: Binder:461_6 Tainted: G           O      5.4.125+ #49
[   20.778751][  T751] Hardware name: sun50iw9 (DT)
[   20.778755][  T751] pstate: 80400005 (Nzcv daif +PAN -UAO)
[   20.778773][  T751] pc : avtab_search_node+0xe8/0x128
[   20.778778][  T751] lr : context_struct_compute_av+0x1c0/0x65c
[   20.778780][  T751] sp : ffffffc01542b820
[   20.778782][  T751] x29: ffffffc01542ba50 x28: ffffff80f80f7a40 
[   20.778786][  T751] x27: 0000000000000000 x26: ffffffc011706ee0 
[   20.778790][  T751] x25: ffffffc011706ea0 x24: ffffffc01542bb70 
[   20.778793][  T751] x23: ffffffc01542bc00 x22: ffffff80f92b2718 
[   20.778797][  T751] x21: 000000000000009c x20: ffffff80f80e1440 
[   20.778801][  T751] x19: 0000000000000037 x18: ffffffc0136db068 
[   20.778804][  T751] x17: 000000004e048830 x16: 00000000c2b2ae35 
[   20.778808][  T751] x15: 00000000f1f165bc x14: 0000000085ebca6b 
[   20.778811][  T751] x13: ffffff80f9170000 x12: 00000000c6e6d541 
[   20.778815][  T751] x11: 0000000000000707 x10: 000000000000009c 
[   20.778818][  T751] x9 : 0000000000000038 x8 : 000000000000000d 
[   20.778822][  T751] x7 : 676e007764676f6c x6 : ffffffc01542bc00 
[   20.778826][  T751] x5 : ffffffc01542bb70 x4 : ffffffc01542bc00 
[   20.778829][  T751] x3 : ffffffc01542bb70 x2 : 0000000000000030 
[   20.778833][  T751] x1 : ffffffc01542b870 x0 : fffff080f87c7008 
[   20.778837][  T751] Call trace:
[   20.778841][  T751]  avtab_search_node+0xe8/0x128
[   20.778844][  T751]  security_compute_av+0x108/0x378
[   20.778850][  T751]  avc_compute_av+0x60/0x294
[   20.778854][  T751]  avc_has_perm_noaudit+0xf0/0x128
[   20.778861][  T751]  selinux_inode_permission+0x144/0x1f4
[   20.778868][  T751]  security_inode_permission+0x60/0xb0
[   20.778874][  T751]  inode_permission+0x4c/0x198
[   20.778881][  T751]  unix_find_other+0x68/0x24c
[   20.778885][  T751]  unix_dgram_connect+0x254/0x3cc
[   20.778894][  T751]  __sys_connect+0xd4/0x140
[   20.778900][  T751]  __arm64_sys_connect+0x20/0x30
[   20.778908][  T751]  el0_svc_common+0xb8/0x1cc
[   20.778912][  T751]  el0_svc_compat_handler+0x1c/0x28
[   20.778917][  T751]  el0_svc_compat+0x8/0x24
[   20.778925][  T751] Code: 14000004 54000223 f9400800 b4000200 (7940000c) 
[   20.778929][  T751] ---[ end trace 11e0954576e51b12 ]---
[   20.778933][  T751] Kernel panic - not syncing: Fatal exception
[   20.778940][  T751] SMP: stopping secondary CPUs
[   20.787718][ T1012] Mem abort info:
[   20.797379][    C0] Mem abort info:
[   20.807042][ T1012]   ESR = 0x96000004
[   20.810948][    C0]   ESR = 0x96000004
[   20.815146][ T1012]   EC = 0x25: DABT (current EL), IL = 32 bits
[   20.821875][    C0]   EC = 0x25: DABT (current EL), IL = 32 bits
[   20.826075][ T1012]   SET = 0, FnV = 0
[   20.830373][    C0]   SET = 0, FnV = 0
[   20.834376][ T1012]   EA = 0, S1PTW = 0
[   20.839449][    C0]   EA = 0, S1PTW = 0
[   20.843554][ T1012] Data abort info:
[   20.852336][    C0] Data abort info:
[   20.859364][ T1012]   ISV = 0, ISS = 0x00000004
[   20.874191][    C0]   ISV = 0, ISS = 0x00000004
[   20.883951][ T1012]   CM = 0, WnR = 0
[   20.889124][    C0]   CM = 0, WnR = 0
[   20.895278][ T1012] [fffff080f86e3140] address between user and kernel address ranges
[   20.900945][    C0] [fffff0c011793050] address between user and kernel address ranges
[   21.862283][  T751] SMP: failed to stop secondary CPUs 0,2-3
[   21.868630][  T751] Kernel Offset: disabled
[   21.873322][  T751] CPU features: 0x00010002,20002004
[   21.878990][  T751] Memory Limit: none
[   21.883198][  T751] ---[ end Kernel panic - not syncing: Fatal exception ]---

3. 分析

对于嵌入式设备的问题,可以分别从硬件软件两方面来看。

产生崩溃的几个地址 0xfffff080f8872000, 0xfffff080f87c7008, 0xfffff080f86e3140, 0xfffff0c011793050,既不位于内核空间,也不位于用户空间

c 复制代码
[   25.967843][  T553] [fffff080f8872000] address between user and kernel address ranges
......
[   20.778720][  T751] [fffff080f87c7008] address between user and kernel address ranges
......
[   20.895278][ T1012] [fffff080f86e3140] address between user and kernel address ranges
......
[   20.900945][    C0] [fffff0c011793050] address between user and kernel address ranges

对于该问题,首先从软件进行了排查,结合上面的日志,怀疑可能是内存踩踏导致这样的现象,经过源码剖析kasanslub debug,无果。

由于现象随机,且内核崩溃栈的位置也是随机的,转而怀疑是不是 DDR 硬件访问出错。有什么依据吗?日志信息里的 ESR 寄存器 EC 域的值 0x25,看起来是个线索。ARM 手册是这么说的:

后续测试中,将离 DDR 很近的 WiFi 天线拆掉,目前测试不再重现该问题。可能是 WiFi 天线影响了 DDR 的读写。当然,这并不一定是最终结论。

4. 参考资料

1\] [DRAM中的ECC类型、编码规则和纠错原理](https://zhuanlan.zhihu.com/p/780330153 "DRAM中的ECC类型、编码规则和纠错原理") \[2\] [搞DDR必懂的关键技术笔记:DDR RAS---内存中的纠错码 (ECC)](https://aijishu.com/a/1060000000478785 "搞DDR必懂的关键技术笔记:DDR RAS—内存中的纠错码 (ECC)") \[3\] [DDR 内存中的纠错码 (ECC)](https://www.synopsys.com/zh-cn/designware-ip/technical-bulletin/error-correction-code-ddr.html "DDR 内存中的纠错码 (ECC)") \[4\] [ARMv8架构的CPU在linux-5.4.18系统下对物理内存ECC错误的处理](https://blog.csdn.net/u010936265/article/details/134222732 "ARMv8架构的CPU在linux-5.4.18系统下对物理内存ECC错误的处理")

相关推荐
Xの哲學2 小时前
从硬中断到 softirq:Linux 软中断机制的全景解剖
linux·服务器·网络·算法·边缘计算
lsp84ch802 小时前
MacBookPro运行飞牛Nas,解决合盖亮屏
linux·网络·macbook·nas·飞牛
wdfk_prog3 小时前
[Linux]学习笔记系列 -- [fs]mnt_idmapping
linux·笔记·学习
optimistic_chen3 小时前
【Redis 系列】常用数据结构---Hash类型
linux·数据结构·redis·分布式·哈希算法
我就是你毛毛哥3 小时前
Linux 定时备份 MySQL 并推送 Gitee
linux·mysql
旖旎夜光3 小时前
Linux(7)(下)
linux·学习
吃螺丝粉3 小时前
zookeeper权限设置
linux·运维·服务器
denggun123453 小时前
内存优化-(二)-oc&swift
ios·性能优化·cocoa·内存·swift
一只旭宝4 小时前
Linux专题十三:shell脚本编程
linux·运维·服务器