前言
本文以一次真实的内核宕机问题为切入点,结合实际操作案例,详细展示了如何利用工具 crash
对内核转储(kdump)进行深入分析和调试的方法。通过对崩溃日志的解读、函数调用栈的梳理、关键地址的定位以及代码逻辑的排查,本文提供了一套系统化的内核问题分析思路和实用技巧。本指南基于 InLinux2312-LTS-SP1 版本,旨在帮助读者快速掌握内核 kdump 问题的排查方法,提升故障处理效率。
浪潮云启操作系统(InLinux)版本
以下操作步骤均基于InLinux2312-LTS-SP1 版本,在此版本上进行问题分析。
问题分析过程
问题现象:
测试环境有3台服务器,服务器存储配置为2*6.4T NVMe+10*12T SATA盘
,基于bcache做缓存加速配置,每块NVMe盘分了5分区,每个nvme分区作为1块12T SATA盘的cache device。
因为需要提高单台服务器的存储密度,所以将12T SATA盘更换为16T SATA盘。
现场操作步骤如下:
1、创建bache设备。
make-bcache -C /dev/nvme2n1p1 -B /dev/sda --writeback --force --wipe-bcache
/dev/sda为12T的SATA盘。
/dev/nvme2n1p1为nvme盘的第一个分区。分区大小为1024G。
分区命令为 parted -s --align optimal /dev/nvme2n1 mkpart primary 2048s 1024GiB
共10块硬盘,2个nvme,将每个nvme分区成5个分区,共创建10个bcache设备。
2、在bcache0上执行fio测试
cat /home/script/run-fio-randrw.sh
bcache_name=$1
if [ -z "${bcache_name}" ];then
echo bcache_name is empty
exit -1
fi
fio --filename=/dev/${bcache_name} --ioengine=libaio --rw=randrw --bs=4k --size=100% --iodepth=128 --numjobs=4 --direct=1 --name=randrw --group_reporting --runtime=30 --ramp_time=5 --lockmem=1G | tee -a ./randrw-iops_k1.log
多次执行bash run-fio-randrw.sh bcache0
2、 关机
bash
poweroff
没有执行bcache数据清除操作
3、替换12T的SATA盘为16TSATA盘
关机后拔掉12T硬盘,替换成16T的硬盘。
4、调整nvme2n1分区大小为1536G
分区执行完触发kernel panic
parted -s --align optimal /dev/nvme2n1 mkpart primary 2048s 1536GiB
5、重启系统,不能正常进入系统。一直处于重启状态。
6、通过光盘进入rescue模式,清除nvme2n1p1 超级块信息后。再次重新启动后,可以正常进入系统。
wipefs -af /dev/nvme2n1p1
7、重新分区,再次触发kernel panic。
parted -s --align optimal /dev/nvme2n1 mkpart primary 2048s 1536GiB
在另外两台服务器上执行同样操作,未触发panic。
出问题的服务器,加上cache_set结构体的root为空判断后,能够正常进入系统。
日志分析
错误日志信息
log
[root@storage-aqkp-002 127.0.0.1-2024-11-10-11:47:37]# cat vmcore-dmesg.txt |grep bcache
[ 21.365228] bcache: bch_journal_replay() journal replay done, 9 keys in 5 entries, seq 987460
[ 21.382581] bcache: register_cache() registered cache device nvme3n1p4
[ 21.524130] bcache: bch_journal_replay() journal replay done, 9 keys in 5 entries, seq 1019863
[ 21.535174] bcache: register_cache() registered cache device nvme3n1p2
[ 21.698388] bcache: bch_journal_replay() journal replay done, 9 keys in 5 entries, seq 1109121
[ 21.708619] bcache: register_cache() registered cache device nvme3n1p3
[ 21.868881] bcache: bch_journal_replay() journal replay done, 0 keys in 1 entries, seq 1127759
[ 21.879083] bcache: register_cache() registered cache device nvme3n1p5
[ 22.054332] bcache: bch_journal_replay() journal replay done, 9 keys in 5 entries, seq 1102627
[ 22.064518] bcache: register_cache() registered cache device nvme3n1p1
[ 249.369289] bcache: register_bcache() error : device already registered
[ 249.369415] bcache: register_bcache() error : device already registered
[ 249.370308] bcache: register_bcache() error : device already registered
[ 249.370517] bcache: register_bcache() error : device already registered
[ 249.371315] bcache: register_bcache() error : device already registered
[ 359.459929] nvme2n1:
[ 359.473124] nvme2n1: p1
[ 359.618056] bcache: prio_read() bad csum reading priorities
[ 359.624878] bcache: bch_cache_set_error() error on f774c122-6c02-469b-b798-ca53c10efa76: IO error reading priorities, disabling caching
[ 359.638311] bcache: register_cache() error nvme2n1p1: failed to run cache set
[ 359.646709] bcache: register_bcache() error : failed to register device
[ 359.658968] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000200
[ 359.669077] Mem abort info:
[ 359.672871] ESR = 0x96000044
[ 359.676929] EC = 0x25: DABT (current EL), IL = 32 bits
[ 359.683221] SET = 0, FnV = 0
[ 359.687253] EA = 0, S1PTW = 0
[ 359.691368] Data abort info:
[ 359.695212] ISV = 0, ISS = 0x00000044
[ 359.700003] CM = 0, WnR = 1
[ 359.703909] user pgtable: 4k pages, 48-bit VAs, pgdp=00002040022e2000
[ 359.711284] [0000000000000200] pgd=0000000000000000, p4d=0000000000000000
[ 359.719262] Internal error: Oops: 0000000096000044 [#1] SMP
[ 359.725760] Modules linked in: xt_set ipt_rpfilter xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net ip_set ipip tunnel4 ip_tunnel veth xt_statistic xt_nat xt_addrtype ip6table_nat ip6_tables ipt
able_mangle xt_physdev xt_conntrack xt_comment xt_mark iptable_filter nf_conntrack_netlink nfnetlink sch_ingress iptable_nat xt_MASQUERADE ip_tables rbd ceph libceph dns_resolver overlay openvswitch nsh n
f_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c 8021q garp mrp bonding vfat fat dm_multipath rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser rdma_
cm iw_cm ib_cm libiscsi scsi_transport_iscsi hns_roce_hw_v2 ib_uverbs ib_core bcache dm_mod crc64 ipmi_ssif ses enclosure aes_ce_blk aes_ce_cipher realtek acpi_ipmi hisi_sas_v3_hw hibmc_drm ghash_ce hclge
sha1_ce hisi_sas_main nvme drm_vram_helper hns3 ipmi_si drm_ttm_helper nvme_core libsas hnae3 ipmi_devintf ttm host_edma_drv sg scsi_transport_sas i2c_designware_platform
[ 359.725845] nfit
[ 359.730936] bcache: register_bcache() error : device already registered
[ 359.815384] ipmi_msghandler i2c_designware_core hisi_uncore_ddrc_pmu hisi_uncore_hha_pmu hisi_uncore_l3c_pmu libnvdimm hisi_uncore_pmu sch_fq_codel br_netfilter bridge stp llc fuse ext4 mbcache jbd2 s
d_mod t10_pi ahci libahci sha2_ce sha256_arm64 sbsa_gwdt libata megaraid_sas(OE) aes_neon_bs aes_neon_blk crypto_simd cryptd
[ 359.833119] bcache: register_bcache() error : device already registered
[ 359.856792] CPU: 57 PID: 7773 Comm: kworker/57:2 Kdump: loaded Tainted: G OE 5.10.0-202.0.0.115.ile2312sp1.aarch64 #1
[ 359.856793] Hardware name: Enginetech EG920A-G20/BC82AMDDRA, BIOS 6.67 11/15/2023
[ 359.856819] Workqueue: events cache_set_flush [bcache]
[ 359.894922] pstate: 00400009 (nzcv daif +PAN -UAO -TCO BTYPE=--)
[ 359.901919] pc : cache_set_flush+0x94/0x190 [bcache]
[ 359.907876] lr : cache_set_flush+0x88/0x190 [bcache]
[ 359.913815] sp : ffff800046373d50
[ 359.918104] x29: ffff800046373d50 x28: 0000000000000000
[ 359.924380] x27: ffff800012213c48 x26: ffffbe503baba218
[ 359.930651] x25: ffff49cc48ca0808 x24: ffff49cc06674000
[ 359.936916] x23: ffff49cc48ca0808 x22: ffff49cc48ca0000
[ 359.943172] x21: ffff49cc48ca04a8 x20: 0000000000000000
[ 359.949419] x19: 0000000000000200 x18: 0000000000000000
[ 359.955662] x17: 0000000000000000 x16: ffffbe503a531760
[ 359.961896] x15: 0000000000000004 x14: ffff49cc00004990
[ 359.968123] x13: 0000000000000000 x12: ffff49cc3dd02a40
[ 359.974342] x11: ffff49cc3dd02910 x10: ffff2a0c0040b6c2
[ 359.980556] x9 : ffffbe503a591d88 x8 : ffff49cc3dd02938
[ 359.986770] x7 : ffff49cc07f03a18 x6 : 0000000000000000
[ 359.992977] x5 : ffff29cc59c16218 x4 : ffff49cc48ca0808
[ 359.999182] x3 : 0000000000000000 x2 : ffff49cc48ca0808
[ 360.004565] bcache: bch_journal_replay() journal replay done, 11 keys in 6 entries, seq 1096092
[ 360.005380] x1 : ffff49cc48ca0808 x0 : 0000000000000001
[ 360.016207] bcache: register_cache() registered cache device nvme2n1p3
[ 360.022922] Call trace:
[ 360.022934] cache_set_flush+0x94/0x190 [bcache]
[ 360.022946] process_one_work+0x1d8/0x4e0
[ 360.045082] bcache: register_bcache() error : device already registered
[ 360.045966] worker_thread+0x154/0x420
[ 360.045970] kthread+0x108/0x150
[ 360.046495] bcache: register_bcache() error : device already registered
[ 360.066044] bcache: register_bcache() error : device already registered
[ 360.066162] bcache: register_bcache() error : device already registered
[ 360.070249] ret_from_fork+0x10/0x18
[ 360.070254] Code: 940043e2 72001c1f 54000700 f90006f3 (f9010297)
[ 360.090288] bcache: register_bcache() error : device already registered
[ 360.091355] bcache: register_bcache() error : device already registered
[ 360.097327] SMP: stopping secondary CPUs
[ 360.119238] Starting crashdump kernel...
日志分析结果
代码正向分析
根据日志可以分析到问题函数调用栈
run_cache_set register_cache_set prio_read run_cache_set bch_cache_set_unregister bch_cache_set_stop register_cache_set __cache_set_unregister cache_set_flush list_add
用户态执行bcache-make注册bcache设备的时候,会调用register_cache_set函数。
register_cache_set函数先进行uuid检查,确保uuid的唯一性。调用bch_cache_set_alloc进行结构体成员初始化、closure回调函数注册等操作。在这里cacheing的closure回调函数设置为__cache_set_unregister,然后运行run_cache_set。run_cache_set会先读取bcache硬盘上的日志文件初始化btree root结构。根据日志错误"IO error reading priorities",cache结构体的root成员还没有被初始化。后面的cache_set_flush操作必然会导致内核panic。
static const char *register_cache_set(struct cache *ca)
{
char buf[12];
const char *err = "cannot allocate memory";
struct cache_set *c;
//uuid重复性检查
list_for_each_entry(c, &bch_cache_sets, list)
if (!memcmp(c->set_uuid, ca->sb.set_uuid, 16)) {
if (c->cache)
return "duplicate cache set member";
goto found;
}
//内存里的缓存结构sb结构成员初始化
c = bch_cache_set_alloc(&ca->sb);
if (!c)
return err;
err = "error creating kobject";
if (kobject_add(&c->kobj, bcache_kobj, "%pU", c->set_uuid) ||
kobject_add(&c->internal, &c->kobj, "internal"))
goto err;
//增加监控统计信息/sys/block/bcache0/bcache/stats_{total,stats_five_minute, s
//stats_day,stats_hour}
if (bch_cache_accounting_add_kobjs(&c->accounting, &c->kobj))
goto err;
//初始化debugfs下bcache信息
bch_debug_init_cache_set(c);
//如果存在缓存集,添加到缓存list成员中
list_add(&c->list, &bch_cache_sets);
found:
//创建类似/sys/block/bcache0/bcache/cache/cache0/set目录链接和缓存集目录下的
//cache0的目录链接
sprintf(buf, "cache%i", ca->sb.nr_this_dev);
if (sysfs_create_link(&ca->kobj, &c->kobj, "set") ||
sysfs_create_link(&c->kobj, &ca->kobj, buf))
goto err;
//添加缓存结合和缓存集的映射关系
kobject_get(&ca->kobj);
ca->set = c;
ca->set->cache = ca;
err = "failed to run cache set";
if (run_cache_set(c) < 0)
goto err;
return NULL;
err:
//出错后调用注销bcache设备操作
bch_cache_set_unregister(c);
return err;
}
struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
{
int iter_size;
struct cache *ca = container_of(sb, struct cache, sb);
struct cache_set *c = kzalloc(sizeof(struct cache_set), GFP_KERNEL);
if (!c)
return NULL;
__module_get(THIS_MODULE);
//初始化异步执行结构
closure_init(&c->cl, NULL);
set_closure_fn(&c->cl, cache_set_free, system_wq);
closure_init(&c->caching, &c->cl);
set_closure_fn(&c->caching, __cache_set_unregister, system_wq);
closure_init(&c->caching, &c->cl);
set_closure_fn(&c->caching, __cache_set_unregister, system_wq);
在bch_cache_set_alloc函数中,设置closure的回调函数为__cache_set_unregister。
void bch_cache_set_unregister(struct cache_set *c)
{
set_bit(CACHE_SET_UNREGISTERING, &c->flags);
//停止bcache缓存盘和后端盘
bch_cache_set_stop(c);
}
run_cache_set函数在这个问题中返回err。
static int run_cache_set(struct cache_set *c)
{
const char *err = "cannot allocate memory";
struct cached_dev *dc, *t;
struct cache *ca = c->cache;
struct closure cl;
LIST_HEAD(journal);
struct journal_replay *l;
closure_init_stack(&cl);
c->nbuckets = ca->sb.nbuckets;
set_gc_sectors(c);
if (CACHE_SYNC(&c->cache->sb)) {
struct bkey *k;
struct jset *j;
err = "cannot allocate memory for journal";
if (bch_journal_read(c, &journal))
goto err;
pr_debug("btree_journal_read() done\n");
err = "no journal entries found";
if (list_empty(&journal))
goto err;
j = &list_entry(journal.prev, struct journal_replay, list)->j;
err = "IO error reading priorities";
if (prio_read(ca, j->prio_bucket[ca->sb.nr_this_dev]))
goto err;
/*
* If prio_read() fails it'll call cache_set_error and we'll
* tear everything down right away, but if we perhaps checked
* sooner we could avoid journal replay.
*/
k = &j->btree_root;
err = "bad btree root";
if (__bch_btree_ptr_invalid(c, k))
goto err;
err = "error reading btree root";
//这里初始化cache_set的root成员,前面如果出错就不会初始化。root指针为空。
c->root = bch_btree_node_get(c, NULL, k,
j->btree_level,
true, NULL);
if (IS_ERR_OR_NULL(c->root))
goto err;
list_del_init(&c->root->list);
rw_unlock(true, c->root);。
err:
while (!list_empty(&journal)) {
l = list_first_entry(&journal, struct journal_replay, list);
list_del(&l->list);
kfree(l);
}
closure_sync(&cl);
bch_cache_set_error(c, "%s", err);
return -EIO;
}
执行run_cache_set出错后,bcache会执行bch_cache_set_unregister函数注销bcache设备。bch_cache_set_unregister调用bch_cache_set_stop,在bch_cache_set_stop中调用之前注册的__cache_set_unregister异步回调函数完成bcache设备注销操作。
c
void bch_cache_set_stop(struct cache_set *c)
{
if (!test_and_set_bit(CACHE_SET_STOPPING, &c->flags))
/* closure_fn set to __cache_set_unregister() */
closure_queue(&c->caching);//异步回调机制调用之前的注册按照函数注册的前后顺序执行
}
c
static inline void closure_queue(struct closure *cl)
{
struct workqueue_struct *wq = cl->wq;
/**
* Changes made to closure, work_struct, or a couple of other structs
* may cause work.func not pointing to the right location.
*/
BUILD_BUG_ON(offsetof(struct closure, fn)
!= offsetof(struct work_struct, func));
if (wq) {
INIT_WORK(&cl->work, cl->work.func);
BUG_ON(!queue_work(wq, &cl->work));
} else
cl->fn(cl);//这里会执行注册的__cache_set_unregister异步回调函数
}
crash 逆向分析问题
ARM寄存器介绍
X0到X7为传递参数和结果的寄存器;X19和X28为调用函数时传递参数的寄存器。
FP(X29)为栈帧寄存器,LR(X30)为链接寄存器。
在 ARM 架构中,FP
(Frame Pointer)和 LR
(Link Register)是用于函数调用和堆栈帧管理的两个重要寄存器:
-
FP(Frame Pointer) :通常指向当前函数调用的堆栈帧的开始位置。每当一个新函数被调用时,
FP
会被推送到堆栈中,并在调用函数时被设置为当前函数的堆栈帧的起始地址。FP
可以用于追踪堆栈中函数调用的链条,帮助在调试时查看调用历史。 -
LR(Link Register) :存储函数返回的地址。当函数被调用时,
LR
会存储当前指令的下一条指令的地址。函数返回时会将LR
的值复制到程序计数器(PC)中,从而返回到调用者的位置。 -
PC(Program Counter) :这是 ARM64 架构中的程序计数器寄存器,记录了当前执行的指令地址。在崩溃时,
PC
指向的地址是导致错误的指令位置。
汇编指令
(1) stp指令
在 ARM 架构的汇编语言中,stp
是一种指令,用于将两个寄存器的值存储到内存中。具体来说,stp
代表 Store Pair (存储一对数据)。常用于函数调用时保存寄存器,特别是当需要同时保存多个寄存器时。它有助于优化代码,减少多次 str
指令的使用。
- 语法:
asmatmel
stp <reg1>, <reg2>, [<address>, <offset>]`
<reg1>
和 <reg2>
:要存储的两个寄存器的内容。
[<address>, <offset>]
:存储目标的内存地址,可以使用一个基地址和偏移量来指定。
示例:stp x19, x20, [sp, #16]
解释:
x19
和x20
是待存储的两个寄存器。[sp, #16]
表示内存地址,基地址是sp
(堆栈指针寄存器),并且偏移量是#16
。
具体来说,stp x19, x20, [sp, #16]
的操作是:
- 将
x19
的值存储到栈上,偏移量为16
字节。 - 将
x20
的值存储到栈上,紧接在x19
后面(即x19
存储后面的内存位置是x20
的位置)。
内存布局:
- 假设
sp
当前的值是0xffffbe50121fa000
,执行这条指令后:x19
的值会被存储到0xffffbe50121fa010
(sp + 16
)。x20
的值会被存储到0xffffbe50121fa018
(sp + 24
)。
使用场景:
stp
通常用于保存一对寄存器的内容,尤其在函数调用时保存寄存器的值(例如保存返回地址、寄存器的内容等),以便在函数返回时恢复这些寄存器的值。
举例说明:
在函数调用过程中,通常会有类似于以下的代码来保存寄存器的状态:
asmatmel
`stp x19, x20, [sp, #-16]!`
这条指令的意思是:
- 将
x19
和x20
存储到当前堆栈指针sp
减去16
字节的地址处(同时更新sp
,即sp = sp - 16
)。 - 如果使用的是
!
(例如[sp, #-16]!
),这表示在存储后立即更新sp
。
在栈帧的保存与恢复中,stp
用来高效地处理多个寄存器的保存,可以减少使用单独的 str
指令来存储每个寄存器的次数。
(2) ldp
指令
stp
对应的加载指令是 ldp
(Load Pair),用于从内存中加载一对数据到两个寄存器中。用法类似,只不过是从内存读取数据。
示例:
asmatmel
ldp x19, x20, [sp, #16]`
这条指令将内存中 `sp + 16` 处的数据加载到 `x19` 和 `x20` 寄存器中。
crash分析调用栈步骤
调试kdump vmcore文件需要安装crash命令和kernel debuginfo rpm安装包。
bash
yum install crash kernel-debuginfo kernel-debugsource -y
log
[ 359.992977] x5 : ffff29cc59c16218 x4 : ffff49cc48ca0808
[root@storage-aqkp-002 127.0.0.1-2024-11-10-11:47:37]# crash /usr/lib/debug/lib/modules/5.10.0-202.0.0.115.ile2312sp1.aarch64/vmlinux /var/crash/127.0.0.1-2024-11-10-11\:47\:37/vmcore
crash 8.0.2-1.ile2312sp1
Copyright (C) 2002-2022 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011, 2020-2022 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
Copyright (C) 2015, 2021 VMware, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "aarch64-unknown-linux-gnu".
Type "show configuration" for configuration details.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
WARNING: kernel version inconsistency between vmlinux and dumpfile
KERNEL: /usr/lib/debug/lib/modules/5.10.0-202.0.0.115.ile2312sp1.aarch64/vmlinux [TAINTED]
DUMPFILE: /var/crash/127.0.0.1-2024-11-10-11:47:37/vmcore [PARTIAL DUMP]
CPUS: 96
DATE: Sun Nov 10 11:46:56 CST 2024
UPTIME: 00:06:00
LOAD AVERAGE: 0.15, 0.28, 0.17
TASKS: 1763
NODENAME: storage-aqkp-002
RELEASE: 5.10.0-202.0.0.115.ile2312sp1.aarch64
VERSION: #1 SMP Mon Jun 17 01:51:52 UTC 2024
MACHINE: aarch64 (unknown Mhz)
MEMORY: 704 GB
PANIC: "Unable to handle kernel NULL pointer dereference at virtual address 0000000000000200"
PID: 7773
COMMAND: "kworker/57:2"
TASK: ffff49cc44d69340 [THREAD_INFO: ffff49cc44d69340]
CPU: 57
STATE: TASK_RUNNING (PANIC)
crash> mod -s bcache /usr/lib/debug/lib/modules/5.10.0-202.0.0.115.ile2312sp1.aarch64/kernel/drivers/md/bcache/bcache.ko-5.10.0-202.0.0.115.ile2312sp1.aarch64.debug
MODULE NAME BASE SIZE OBJECT FILE
ffffbe501221b040 bcache ffffbe50121e2000 319488 /usr/lib/debug/lib/modules/5.10.0-202.0.0.115.ile2312sp1.aarch64/kernel/drivers/md/bcache/bcache.ko-5.10.0-202.0.0.115.ile2312sp1.aarch64.debug
crash>
crash> bt
PID: 7773 TASK: ffff49cc44d69340 CPU: 57 COMMAND: "kworker/57:2"
#0 [ffff800046373800] machine_kexec at ffffbe5039eb54a8
#1 [ffff8000463739b0] __crash_kexec at ffffbe503a052824
#2 [ffff8000463739e0] crash_kexec at ffffbe503a0529cc
#3 [ffff800046373a60] die at ffffbe5039e9445c
#4 [ffff800046373ac0] die_kernel_fault at ffffbe5039ec698c
#5 [ffff800046373af0] __do_kernel_fault at ffffbe5039ec6a38
#6 [ffff800046373b20] do_page_fault at ffffbe503ac76ba4
#7 [ffff800046373b70] do_translation_fault at ffffbe503ac76ebc
#8 [ffff800046373b90] do_mem_abort at ffffbe5039ec68ac
#9 [ffff800046373bc0] el1_abort at ffffbe503ac669bc
#10 [ffff800046373bf0] el1_sync_handler at ffffbe503ac671d4
#11 [ffff800046373d30] el1_sync at ffffbe5039e82230
#12 [ffff800046373d50] cache_set_flush at ffffbe50121fa4c4 [bcache]
#13 [ffff800046373da0] process_one_work at ffffbe5039f5af68
#14 [ffff800046373e00] worker_thread at ffffbe5039f5b3c4
#15 [ffff800046373e50] kthread at ffffbe5039f634b8
crash> dis cache_set_flush+0x94
0xffffbe50121fa4c8 <cache_set_flush+148>: str x23, [x20, #512]
crash> dis -s cache_set_flush+0x94
FILE: ./include/linux/list.h
LINE: 71
66 {
67 if (!__list_add_valid(new, prev, next))
68 return;
69
70 next->prev = new;
* 71 new->next = next;
72 new->prev = prev;
73 WRITE_ONCE(prev->next, new);
74 }
crash>
crash分析的时候除了安装kernel-debuginfo安装包外,还需要加载模块调试信息。
bash
#加载bcache调试信息
mod -s bcache /usr/lib/debug/lib/modules/5.10.0-202.0.0.115.ile2312sp1.aarch64/kernel/drivers/md/bcache/bcache.ko-5.10.0-202.0.0.115.ile2312sp1.aarch64.debug
根据vmcore-message.txt中的出问题的函数地址,ARM中为PC寄存器内容,X86上为RIP寄存器内容。本次崩溃的函数地址为pc : cache_set_flush+0x94/0x190,通过dis -s cache_set_flush+0x94就可以查看出错问题的调用栈。
然后结合汇编代码和vmcore-message.txt的寄存器内容对问题进行分析。
crash> dis cache_set_flush
0xffffbe50121fa434 <cache_set_flush>: mov x9, x30
0xffffbe50121fa438 <cache_set_flush+4>: nop
0xffffbe50121fa43c <cache_set_flush+8>: paciasp
0xffffbe50121fa440 <cache_set_flush+12>: stp x29, x30, [sp, #-80]!
0xffffbe50121fa444 <cache_set_flush+16>: mov x29, sp
0xffffbe50121fa448 <cache_set_flush+20>: stp x21, x22, [sp, #32]
0xffffbe50121fa44c <cache_set_flush+24>: mov x21, x0
0xffffbe50121fa450 <cache_set_flush+28>: sub x22, x0, #0x4a8
0xffffbe50121fa454 <cache_set_flush+32>: stp x19, x20, [sp, #16]
0xffffbe50121fa458 <cache_set_flush+36>: add x0, x22, #0x128
0xffffbe50121fa45c <cache_set_flush+40>: stp x23, x24, [sp, #48]
0xffffbe50121fa460 <cache_set_flush+44>: ldur x24, [x21, #-56]
0xffffbe50121fa464 <cache_set_flush+48>: bl 0xffffbe50121f8e88 <bch_cache_accounting_destroy>
0xffffbe50121fa468 <cache_set_flush+52>: add x0, x22, #0xc0
0xffffbe50121fa46c <cache_set_flush+56>: bl 0xffffbe501220b890 <bcache_device_free+2504>
0xffffbe50121fa470 <cache_set_flush+60>: add x0, x22, #0x60
0xffffbe50121fa474 <cache_set_flush+64>: bl 0xffffbe501220b6e0 <bcache_device_free+2072>
0xffffbe50121fa478 <cache_set_flush+68>: ldr x0, [x22, #2256]
0xffffbe50121fa47c <cache_set_flush+72>: cbz x0, 0xffffbe50121fa48c <cache_set_flush+88>
0xffffbe50121fa480 <cache_set_flush+76>: cmn x0, #0x1, lsl #12
0xffffbe50121fa484 <cache_set_flush+80>: b.hi 0xffffbe50121fa48c <cache_set_flush+88> // b.pmore
0xffffbe50121fa488 <cache_set_flush+84>: bl 0xffffbe501220b56c <bcache_device_free+1700>
0xffffbe50121fa48c <cache_set_flush+88>: add x0, x22, #0x8, lsl #12
0xffffbe50121fa490 <cache_set_flush+92>: ldr x20, [x0, #17656]
0xffffbe50121fa494 <cache_set_flush+96>: cmn x20, #0x1, lsl #12
0xffffbe50121fa498 <cache_set_flush+100>: b.hi 0xffffbe50121fa4d8 <cache_set_flush+164> // b.pmore
0xffffbe50121fa49c <cache_set_flush+104>: str x25, [sp, #64]
0xffffbe50121fa4a0 <cache_set_flush+108>: add x19, x20, #0x200
0xffffbe50121fa4a4 <cache_set_flush+112>: add x25, x21, #0x360
0xffffbe50121fa4a8 <cache_set_flush+116>: mov x0, x19
0xffffbe50121fa4ac <cache_set_flush+120>: ldr x23, [x21, #864]
0xffffbe50121fa4b0 <cache_set_flush+124>: mov x1, x25
0xffffbe50121fa4b4 <cache_set_flush+128>: mov x2, x23
0xffffbe50121fa4b8 <cache_set_flush+132>: bl 0xffffbe501220b440 <bcache_device_free+1400>
0xffffbe50121fa4bc <cache_set_flush+136>: tst w0, #0xff
0xffffbe50121fa4c0 <cache_set_flush+140>: b.eq 0xffffbe50121fa5a0 <cache_set_flush+364> // b.none
0xffffbe50121fa4c4 <cache_set_flush+144>: str x19, [x23, #8]
0xffffbe50121fa4c8 <cache_set_flush+148>: str x23, [x20, #512]
0xffffbe50121fa4cc <cache_set_flush+152>: str x25, [x20, #520]
0xffffbe50121fa4d0 <cache_set_flush+156>: str x19, [x21, #864]
0xffffbe50121fa4d4 <cache_set_flush+160>: ldr x25, [sp, #64]
0xffffbe50121fa4d8 <cache_set_flush+164>: ldur x0, [x21, #-72]
0xffffbe50121fa4dc <cache_set_flush+168>: tst w0, #0x8
0xffffbe50121fa4e0 <cache_set_flush+172>: b.ne 0xffffbe50121fa4f8 <cache_set_flush+196> // b.any
0xffffbe50121fa4e4 <cache_set_flush+176>: ldr x0, [x22, #2056]
0xffffbe50121fa4e8 <cache_set_flush+180>: add x23, x21, #0x360
0xffffbe50121fa4ec <cache_set_flush+184>: sub x19, x0, #0x200
0xffffbe50121fa4f0 <cache_set_flush+188>: cmp x23, x0
0xffffbe50121fa4f4 <cache_set_flush+192>: b.ne 0xffffbe50121fa584 <cache_set_flush+336> // b.any
0xffffbe50121fa4f8 <cache_set_flush+196>: ldr x0, [x24, #2504]
0xffffbe50121fa4fc <cache_set_flush+200>: cbz x0, 0xffffbe50121fa504 <cache_set_flush+208>
0xffffbe50121fa500 <cache_set_flush+204>: bl 0xffffbe501220b56c <bcache_device_free+1700>
0xffffbe50121fa504 <cache_set_flush+208>: add x19, x22, #0x8, lsl #12
0xffffbe50121fa508 <cache_set_flush+212>: ldr x0, [x19, #18568]
0xffffbe50121fa50c <cache_set_flush+216>: cbz x0, 0xffffbe50121fa52c <cache_set_flush+248>
0xffffbe50121fa510 <cache_set_flush+220>: mov x0, #0xc710 // #50960
0xffffbe50121fa514 <cache_set_flush+224>: add x22, x22, x0
0xffffbe50121fa518 <cache_set_flush+228>: mov x0, x22
0xffffbe50121fa51c <cache_set_flush+232>: bl 0xffffbe501220b710 <bcache_device_free+2120>
0xffffbe50121fa520 <cache_set_flush+236>: ldr x1, [x19, #18216]
0xffffbe50121fa524 <cache_set_flush+240>: mov x0, x22
0xffffbe50121fa528 <cache_set_flush+244>: blr x1
0xffffbe50121fa52c <cache_set_flush+248>: str xzr, [x21]
0xffffbe50121fa530 <cache_set_flush+252>: str xzr, [x21, #24]
0xffffbe50121fa534 <cache_set_flush+256>: dmb ish
0xffffbe50121fa538 <cache_set_flush+260>: mov w1, #0x1 // #1
0xffffbe50121fa53c <cache_set_flush+264>: mov x0, x21
0xffffbe50121fa540 <cache_set_flush+268>: movk w1, #0x4000, lsl #16
0xffffbe50121fa544 <cache_set_flush+272>: bl 0xffffbe50121ef064 <closure_sub>
0xffffbe50121fa548 <cache_set_flush+276>: ldp x19, x20, [sp, #16]
0xffffbe50121fa54c <cache_set_flush+280>: ldp x21, x22, [sp, #32]
0xffffbe50121fa550 <cache_set_flush+284>: ldp x23, x24, [sp, #48]
0xffffbe50121fa554 <cache_set_flush+288>: ldp x29, x30, [sp], #80
0xffffbe50121fa558 <cache_set_flush+292>: autiasp
0xffffbe50121fa55c <cache_set_flush+296>: ret
0xffffbe50121fa560 <cache_set_flush+300>: mov x0, x19
0xffffbe50121fa564 <cache_set_flush+304>: mov x1, #0x0 // #0
0xffffbe50121fa568 <cache_set_flush+308>: bl 0xffffbe50121e9714 <__bch_btree_node_write>
0xffffbe50121fa56c <cache_set_flush+312>: mov x0, x20
0xffffbe50121fa570 <cache_set_flush+316>: bl 0xffffbe501220b704 <bcache_device_free+2108>
0xffffbe50121fa574 <cache_set_flush+320>: ldr x1, [x19, #512]
0xffffbe50121fa578 <cache_set_flush+324>: sub x19, x1, #0x200
0xffffbe50121fa57c <cache_set_flush+328>: cmp x23, x1
0xffffbe50121fa580 <cache_set_flush+332>: b.eq 0xffffbe50121fa4f8 <cache_set_flush+196> // b.none
0xffffbe50121fa584 <cache_set_flush+336>: add x20, x19, #0x90
0xffffbe50121fa588 <cache_set_flush+340>: mov x0, x20
0xffffbe50121fa58c <cache_set_flush+344>: bl 0xffffbe501220b4e8 <bcache_device_free+1568>
0xffffbe50121fa590 <cache_set_flush+348>: ldr x0, [x19, #176]
0xffffbe50121fa594 <cache_set_flush+352>: tst w0, #0x2
0xffffbe50121fa598 <cache_set_flush+356>: b.eq 0xffffbe50121fa56c <cache_set_flush+312> // b.none
0xffffbe50121fa59c <cache_set_flush+360>: b 0xffffbe50121fa560 <cache_set_flush+300>
0xffffbe50121fa5a0 <cache_set_flush+364>: ldr x25, [sp, #64]
0xffffbe50121fa5a4 <cache_set_flush+368>: b 0xffffbe50121fa4d8 <cache_set_flush+164>
0xffffbe50121fa5a8 <cache_set_flush+372>: nop
0xffffbe50121fa5ac <cache_set_flush+376>: nop
0xffffbe50121fa5b0 <cache_set_flush+380>: ldrsb w4, [x5, #2724]
0xffffbe50121fa5b4 <cache_set_flush+384>: .inst 0xffffbe50 ; undefined
0xffffbe50121fa5b8 <cache_set_flush+388>: nop
0xffffbe50121fa5bc <cache_set_flush+392>: ldr x16, 0xffffbe50121fa5b0 <cache_set_flush+380>
0xffffbe50121fa5c0 <cache_set_flush+396>: br x16
crash>
pc : cache_set_flush+0x94/0x190
说明程序在 cache_set_flush
函数执行到偏移 0x94
处的指令时崩溃。通过反汇编分析,这个位置的指令是 str x23, [x20, #512]
,导致了对 NULL
指针的访问错误。
日志中的 pc : cache_set_flush+0x94/0x190
表示崩溃发生时的程序计数器(Program Counter, pc
)的值,即当前执行的指令在函数 cache_set_flush
中的位置。具体含义如下:
cache_set_flush+0x94/0x190
:
-
cache_set_flush
是函数名,表示程序当前正在cache_set_flush
函数中执行。 -
+0x94
表示程序计数器位于cache_set_flush
函数的偏移量0x94
(即十六进制 148)的指令上。 -
/0x190
表示整个cache_set_flush
函数的长度为0x190
(即十六进制 400)。这是函数的总长度,用于提供一个相对参考,帮助确定崩溃发生的位置在函数中的相对进度。crash> dis -s cache_set_flush+0x94
FILE: ./include/linux/list.h
LINE: 7166 { 67 if (!__list_add_valid(new, prev, next)) 68 return; 69 70 next->prev = new;
- 71 new->next = next;
72 new->prev = prev;
73 WRITE_ONCE(prev->next, new);
74 }
- 71 new->next = next;
上面dis -s cache_set_flush+0x94结果,表示问题出现在链表的操作过程中。
从 crash> dis cache_set_flush的输出,我们可以看到 cache_set_flush 函数的反汇编代码。我们可以重点分析其中的几个部分,找到发生崩溃的原因,并且确认如何定位 NULL 指针访问。
汇编代码关键部分分析:
寄存器内容:
x0 用于传递第一个参数,通常是指向 cache_set 结构体的指针,在反汇编中可以看到它在多次操作中出现。
x20 被用来存储一个值,并且在代码中多次出现。特别是 ldr x20, [x0, #17656],它指向了 cache_set 结构体的偏移量 0x17656,并将其内容存入 x20。
x19 用来存储某些地址,似乎是某种缓存或内存地址,在后续代码中会被多次修改。
代码执行流程:
初始化堆栈:
stp x29, x30, [sp, #-80]!
mov x29, sp
这部分代码保存了当前函数的返回地址和堆栈指针。把栈的空间往下延伸128字节,然后把调用者的FP和LR压入栈。存放在栈顶向下的偏移128字节地方。
- 参数传递和指针计算:
日志中x0 : 0000000000000001,表示第一个参数x0寄存器的地址0000000000000001。
mov x21, x0
sub x22, x0, #0x4a8
第一个参数减去0x4a8偏移,也就是将caching成员偏移1192获取cache_set结构体的地址。x22寄存器存放cache_set结构体的地址。
x21存放的是第一个参数的地址。
日志中X22的地址为ffff49cc48ca0000。
crash> struct -o cache_set
struct cache_set {
[0] struct closure cl;
[80] struct list_head list;
[96] struct kobject kobj;
[192] struct kobject internal;
[288] struct dentry *debug;
[296] struct cache_accounting accounting;
[1120] unsigned long flags;
[1128] atomic_t idle_counter;
[1132] atomic_t at_max_writeback_rate;
[1136] struct cache *cache;
[1144] struct bcache_device **devices;
[1152] unsigned int devices_max_used;
[1156] atomic_t attached_dev_nr;
[1160] struct list_head cached_devs;
[1176] uint64_t cached_dev_sectors;
[1184] atomic_long_t flash_dev_dirty_sectors;
[1192] struct closure caching;
[1272] struct closure sb_write;
[1352] struct semaphore sb_write_mutex;
[1376] mempool_t search;
[1448] mempool_t bio_meta;
[1520] struct bio_set bio_split;
[1952] struct shrinker shrink;
[2016] struct mutex bucket_lock;
[2048] unsigned short bucket_bits;
[2050] unsigned short block_bits;
[2052] unsigned int btree_pages;
[2056] struct list_head btree_cache;
[2072] struct list_head btree_cache_freeable;
[2088] struct list_head btree_cache_freed;
296偏移为成员accounting。x0设置为cache_set结构体偏移296后的地址,也就是c->accounting变量的地址。
crash> struct -o cache_set
struct cache_set {
[0] struct closure cl;
[80] struct list_head list;
[96] struct kobject kobj;
[192] struct kobject internal;
[288] struct dentry *debug;
[296] struct cache_accounting accounting;
访问内存和函数调用:
add x0, x22, #0x128
stp x23, x24, [sp, #48]
ldur x24, [x21, #-56]
bl 0xffffbe50121f8e88 <bch_cache_accounting_destroy>
crash> struct -o cache_set.cache
struct cache_set {
[1136] struct cache *cache;
}
crash> struct -o cache_set.caching
struct cache_set {
[1192] struct closure caching;
}
对应C代码
bch_cache_accounting_destroy(&c->accounting);
从 x21 偏移 -56(1192-1136) 读取数据,也就是从caching成员偏移到cache成员,读取cache地址到寄存器x24。
调用函数 bch_cache_accounting_destroy,函数参数地址为x0,也就是c->accounting变量的地址。
0xffffbe50121fa468 <cache_set_flush+52>: add x0, x22, #0xc0
0xffffbe50121fa46c <cache_set_flush+56>: bl 0xffffbe501220b890 <bcache_device_free+2504>
0xc0偏移对应cache_set结构体internal成员
crash> struct -o cache_set.internal
struct cache_set {
[192] struct kobject internal;
}
上面汇编对应C代码
kobject_put(&c->internal);
0xffffbe50121fa470 <cache_set_flush+60>: add x0, x22, #0x60
0xffffbe50121fa474 <cache_set_flush+64>: bl 0xffffbe501220b6e0 <bcache_device_free+2072>
0x60偏移对应的cache_set结构体的kobj成员。
crash> struct -o cache_set.kobj
struct cache_set {
[96] struct kobject kobj;
}
上面汇编对应C代码
kobject_del(&c->kobj);
通过结构体成员偏移可以确认。
crash> struct -o cache_set.kobj -x
struct cache_set {
[0x60] struct kobject kobj;
}
crash> struct -o cache_set.internal -x
struct cache_set {
[0xc0] struct kobject internal;
}
crash>
0xc0和0x60恰好和汇编代码参数偏移对得上。
NULL 检查和访问 x20:
0xffffbe50121fa478 <cache_set_flush+68>: ldr x0, [x22, #2256]
0xffffbe50121fa47c <cache_set_flush+72>: cbz x0, 0xffffbe50121fa48c <cache_set_flush+88>
0xffffbe50121fa480 <cache_set_flush+76>: cmn x0, #0x1, lsl #12
0xffffbe50121fa484 <cache_set_flush+80>: b.hi 0xffffbe50121fa48c <cache_set_flush+80> // b.pmore
ldr x0, [x22, #2256]
cbz x0, 0xffffbe50121fa48c <cache_set_flush+88>
cmn x0, #0x1, lsl #12
b.hi 0xffffbe50121fa48c <cache_set_flush+80>
bl 0xffffbe501220b56c <bcache_device_free+1700>
crash> struct -o cache_set
struct cache_set {
[0] struct closure cl;
[80] struct list_head list;
...
[296] struct cache_accounting accounting;
...
[2192] struct gc_stat gc_stats;
[2240] size_t nbuckets;
[2248] size_t avail_nbuckets;
[2256] struct task_struct *gc_thread;
这里从 x22 偏移 2256 读取值到 x0,x22偏移2256得到gc_thread的地址。然后通过cbz 检查它是否为 NULL。如果为 NULL,代码跳转至 cache_set_flush+88,否则进行进一步处理。
对应C代码
if (!IS_ERR_OR_NULL(c->gc_thread))
kthread_stop(c->gc_thread);
处理 x20 内容:
0xffffbe50121fa48c <cache_set_flush+88>: add x0, x22, #0x8, lsl #12
0xffffbe50121fa490 <cache_set_flush+92>: ldr x20, [x0, #17656]
0xffffbe50121fa494 <cache_set_flush+96>: cmn x20, #0x1, lsl #12
0xffffbe50121fa498 <cache_set_flush+100>: b.hi 0xffffbe50121fa4d8 <cache_set_flush+156> // b.pmore
0xffffbe50121fa49c <cache_set_flush+104>: str x25, [sp, #64]
0xffffbe50121fa4a0 <cache_set_flush+108>: add x19, x20, #0x200
0xffffbe50121fa4a4 <cache_set_flush+112>: add x25, x21, #0x360
0xffffbe50121fa4a8 <cache_set_flush+116>: mov x0, x19
0xffffbe50121fa4ac <cache_set_flush+120>: ldr x23, [x21, #864]
0xffffbe50121fa4b0 <cache_set_flush+124>: mov x1, x25
0xffffbe50121fa4b4 <cache_set_flush+128>: mov x2, x23
0xffffbe50121fa4b8 <cache_set_flush+132>: bl 0xffffbe501220b440 <bcache_device_free+1400>
将 x22
寄存器的值加上 0x8
左移 12
位的结果,并将其存储在 x0
中。x22
是指向 struct cache_set
的指针。
#0x8, lsl #12
表示对 0x8
进行左移 12
位,结果是 0x8000
。因此,x0 = x22 + 0x8000
。此时 x22
是指向 struct cache_set
的指针,而加上 0x8000
后,0x8000+17656=0xc4f8恰好就是结构体cache_set的成员root地址。这里 x20
可以确认存储的就是结构体cache_set的成员root地址。
crash> struct -o cache_set.root -x
struct cache_set {
[0xc4f8] struct btree *root;
}
接着检查 x20 是否符合某些条件。如果条件满足,代码跳转至 cache_set_flush+156。
cmn x20, #0x1, lsl #12` (偏移 `0xffffbe50121fa494`)
作用 :将 x20
和 0x1 << 12
(即 0x1000
)进行加法运算,并更新条件标志。
-
解释 :
cmn
指令会更新条件标志,这样可以在后续的分支指令中使用。x20 + 0x1000
的结果会影响标志位。b.hi 0xffffbe50121fa4d8
(偏移
0xffffbe50121fa498`)
作用 :如果上面的 cmn
指令结果表明 x20 > 0x1000
,则跳转到偏移 0xffffbe50121fa4d8
处。
解释 :如果 x20
的值大于 0x1000
,跳转到 cache_set_flush+156
。
对应c代码
if (!IS_ERR_OR_NULL(c->root))
list_add(&c->root->list, &c->btree_cache);
static inline void list_add(struct list_head *new, struct list_head *head)
{
__list_add(new, head, head->next);
}
static inline void __list_add(struct list_head *new,
struct list_head *prev,
struct list_head *next)
{
if (!__list_add_valid(new, prev, next))
return;
next->prev = new;
new->next = next;
new->prev = prev;
WRITE_ONCE(prev->next, new);
}
这里正好和dis -s cache_set_flush+0x94出错位置的代码对应上。
另外,从汇编代码中可以分析出X20表示cache_set的root成员地址,而vmcore-message.txt中X20为 x20: 0000000000000000。
内存写入和进一步函数调用:
str x19, [sp, #64]
str x23, [x20, #512]
这两行代码将 x19 和 x23 的内容分别存储到堆栈和内存中的指定位置。
crash> struct -o btree.list
struct btree {
[512] struct list_head list;
}
btree结构偏移512后为list,访问root指针的512偏移时发生kernel崩溃。
分析结论
从 cache_set_flush 的反汇编可以看出,在执行 ldr x20, [x0, #17656] 时,x0 是指向 cache_set 结构体的指针,而 x20 存储的是从该结构体偏移 17656 地址读取的内容,也就是cache_set结构体的root成员地址。x20是 NULL,操作list时导致kernel panic。
崩溃的原因由于 cache_set_flush
函数中访问了未初始化或空的指针x20
,需要检查 cache_set
结构体的初始化过程,确认相关指针是否正确设置。如果指针为 NULL
或无效,则需要修复初始化过程以避免这种错误。
总结
当内核出现kdump时,一般按照下面步骤分析:
1、分析内核崩溃时候的日志,根据PC或者RIP寄存器定位出错问题函数地址。
2、根据现场操作确认导致崩溃的操作步骤。
3、梳理代码和函数调用栈。
4、通过crash命令分析转储文件,确认导致问题的代码。