浪潮云启操作系统(InLinux) bcache宕机问题分析

前言

本文以一次真实的内核宕机问题为切入点,结合实际操作案例,详细展示了如何利用工具 crash对内核转储(kdump)进行深入分析和调试的方法。通过对崩溃日志的解读、函数调用栈的梳理、关键地址的定位以及代码逻辑的排查,本文提供了一套系统化的内核问题分析思路和实用技巧。本指南基于 InLinux2312-LTS-SP1 版本,旨在帮助读者快速掌握内核 kdump 问题的排查方法,提升故障处理效率。

浪潮云启操作系统(InLinux)版本

以下操作步骤均基于InLinux2312-LTS-SP1 版本,在此版本上进行问题分析。

问题分析过程

问题现象:

测试环境有3台服务器,服务器存储配置为2*6.4T NVMe+10*12T SATA盘,基于bcache做缓存加速配置,每块NVMe盘分了5分区,每个nvme分区作为1块12T SATA盘的cache device。

因为需要提高单台服务器的存储密度,所以将12T SATA盘更换为16T SATA盘。

现场操作步骤如下:

1、创建bache设备。

make-bcache -C /dev/nvme2n1p1 -B /dev/sda --writeback --force --wipe-bcache

/dev/sda为12T的SATA盘。

/dev/nvme2n1p1为nvme盘的第一个分区。分区大小为1024G。

分区命令为 parted -s --align optimal /dev/nvme2n1 mkpart primary 2048s 1024GiB

共10块硬盘,2个nvme,将每个nvme分区成5个分区,共创建10个bcache设备。

2、在bcache0上执行fio测试

cat /home/script/run-fio-randrw.sh 
bcache_name=$1
if [ -z "${bcache_name}" ];then
    echo bcache_name is empty
    exit -1
fi

fio --filename=/dev/${bcache_name} --ioengine=libaio --rw=randrw --bs=4k --size=100% --iodepth=128 --numjobs=4 --direct=1 --name=randrw --group_reporting --runtime=30 --ramp_time=5 --lockmem=1G | tee -a ./randrw-iops_k1.log

多次执行bash run-fio-randrw.sh bcache0

2、 关机

bash 复制代码
poweroff

没有执行bcache数据清除操作

3、替换12T的SATA盘为16TSATA盘

关机后拔掉12T硬盘,替换成16T的硬盘。

4、调整nvme2n1分区大小为1536G

分区执行完触发kernel panic

parted -s --align optimal /dev/nvme2n1 mkpart primary 2048s 1536GiB

5、重启系统,不能正常进入系统。一直处于重启状态。

6、通过光盘进入rescue模式,清除nvme2n1p1 超级块信息后。再次重新启动后,可以正常进入系统。

wipefs -af /dev/nvme2n1p1

7、重新分区,再次触发kernel panic。

parted -s --align optimal /dev/nvme2n1 mkpart primary 2048s 1536GiB

在另外两台服务器上执行同样操作,未触发panic。

出问题的服务器,加上cache_set结构体的root为空判断后,能够正常进入系统。

日志分析

错误日志信息
log 复制代码
[root@storage-aqkp-002 127.0.0.1-2024-11-10-11:47:37]# cat   vmcore-dmesg.txt  |grep bcache
[   21.365228] bcache: bch_journal_replay() journal replay done, 9 keys in 5 entries, seq 987460
[   21.382581] bcache: register_cache() registered cache device nvme3n1p4
[   21.524130] bcache: bch_journal_replay() journal replay done, 9 keys in 5 entries, seq 1019863
[   21.535174] bcache: register_cache() registered cache device nvme3n1p2
[   21.698388] bcache: bch_journal_replay() journal replay done, 9 keys in 5 entries, seq 1109121
[   21.708619] bcache: register_cache() registered cache device nvme3n1p3
[   21.868881] bcache: bch_journal_replay() journal replay done, 0 keys in 1 entries, seq 1127759
[   21.879083] bcache: register_cache() registered cache device nvme3n1p5
[   22.054332] bcache: bch_journal_replay() journal replay done, 9 keys in 5 entries, seq 1102627
[   22.064518] bcache: register_cache() registered cache device nvme3n1p1
[  249.369289] bcache: register_bcache() error : device already registered
[  249.369415] bcache: register_bcache() error : device already registered
[  249.370308] bcache: register_bcache() error : device already registered
[  249.370517] bcache: register_bcache() error : device already registered
[  249.371315] bcache: register_bcache() error : device already registered
[  359.459929]  nvme2n1:
[  359.473124]  nvme2n1: p1
[  359.618056] bcache: prio_read() bad csum reading priorities
[  359.624878] bcache: bch_cache_set_error() error on f774c122-6c02-469b-b798-ca53c10efa76: IO error reading priorities, disabling caching
[  359.638311] bcache: register_cache() error nvme2n1p1: failed to run cache set
[  359.646709] bcache: register_bcache() error : failed to register device
[  359.658968] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000200
[  359.669077] Mem abort info:
[  359.672871]   ESR = 0x96000044
[  359.676929]   EC = 0x25: DABT (current EL), IL = 32 bits
[  359.683221]   SET = 0, FnV = 0
[  359.687253]   EA = 0, S1PTW = 0
[  359.691368] Data abort info:
[  359.695212]   ISV = 0, ISS = 0x00000044
[  359.700003]   CM = 0, WnR = 1
[  359.703909] user pgtable: 4k pages, 48-bit VAs, pgdp=00002040022e2000
[  359.711284] [0000000000000200] pgd=0000000000000000, p4d=0000000000000000
[  359.719262] Internal error: Oops: 0000000096000044 [#1] SMP
[  359.725760] Modules linked in: xt_set ipt_rpfilter xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net ip_set ipip tunnel4 ip_tunnel veth xt_statistic xt_nat xt_addrtype ip6table_nat ip6_tables ipt
able_mangle xt_physdev xt_conntrack xt_comment xt_mark iptable_filter nf_conntrack_netlink nfnetlink sch_ingress iptable_nat xt_MASQUERADE ip_tables rbd ceph libceph dns_resolver overlay openvswitch nsh n
f_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c 8021q garp mrp bonding vfat fat dm_multipath rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser rdma_
cm iw_cm ib_cm libiscsi scsi_transport_iscsi hns_roce_hw_v2 ib_uverbs ib_core bcache dm_mod crc64 ipmi_ssif ses enclosure aes_ce_blk aes_ce_cipher realtek acpi_ipmi hisi_sas_v3_hw hibmc_drm ghash_ce hclge
 sha1_ce hisi_sas_main nvme drm_vram_helper hns3 ipmi_si drm_ttm_helper nvme_core libsas hnae3 ipmi_devintf ttm host_edma_drv sg scsi_transport_sas i2c_designware_platform
[  359.725845]  nfit
[  359.730936] bcache: register_bcache() error : device already registered
[  359.815384]  ipmi_msghandler i2c_designware_core hisi_uncore_ddrc_pmu hisi_uncore_hha_pmu hisi_uncore_l3c_pmu libnvdimm hisi_uncore_pmu sch_fq_codel br_netfilter bridge stp llc fuse ext4 mbcache jbd2 s
d_mod t10_pi ahci libahci sha2_ce sha256_arm64 sbsa_gwdt libata megaraid_sas(OE) aes_neon_bs aes_neon_blk crypto_simd cryptd
[  359.833119] bcache: register_bcache() error : device already registered
[  359.856792] CPU: 57 PID: 7773 Comm: kworker/57:2 Kdump: loaded Tainted: G           OE     5.10.0-202.0.0.115.ile2312sp1.aarch64 #1
[  359.856793] Hardware name: Enginetech EG920A-G20/BC82AMDDRA, BIOS 6.67 11/15/2023
[  359.856819] Workqueue: events cache_set_flush [bcache]
[  359.894922] pstate: 00400009 (nzcv daif +PAN -UAO -TCO BTYPE=--)
[  359.901919] pc : cache_set_flush+0x94/0x190 [bcache]
[  359.907876] lr : cache_set_flush+0x88/0x190 [bcache]
[  359.913815] sp : ffff800046373d50
[  359.918104] x29: ffff800046373d50 x28: 0000000000000000
[  359.924380] x27: ffff800012213c48 x26: ffffbe503baba218
[  359.930651] x25: ffff49cc48ca0808 x24: ffff49cc06674000
[  359.936916] x23: ffff49cc48ca0808 x22: ffff49cc48ca0000
[  359.943172] x21: ffff49cc48ca04a8 x20: 0000000000000000
[  359.949419] x19: 0000000000000200 x18: 0000000000000000
[  359.955662] x17: 0000000000000000 x16: ffffbe503a531760
[  359.961896] x15: 0000000000000004 x14: ffff49cc00004990
[  359.968123] x13: 0000000000000000 x12: ffff49cc3dd02a40
[  359.974342] x11: ffff49cc3dd02910 x10: ffff2a0c0040b6c2
[  359.980556] x9 : ffffbe503a591d88 x8 : ffff49cc3dd02938
[  359.986770] x7 : ffff49cc07f03a18 x6 : 0000000000000000
[  359.992977] x5 : ffff29cc59c16218 x4 : ffff49cc48ca0808
[  359.999182] x3 : 0000000000000000 x2 : ffff49cc48ca0808
[  360.004565] bcache: bch_journal_replay() journal replay done, 11 keys in 6 entries, seq 1096092
[  360.005380] x1 : ffff49cc48ca0808 x0 : 0000000000000001
[  360.016207] bcache: register_cache() registered cache device nvme2n1p3
[  360.022922] Call trace:
[  360.022934]  cache_set_flush+0x94/0x190 [bcache]
[  360.022946]  process_one_work+0x1d8/0x4e0
[  360.045082] bcache: register_bcache() error : device already registered
[  360.045966]  worker_thread+0x154/0x420
[  360.045970]  kthread+0x108/0x150
[  360.046495] bcache: register_bcache() error : device already registered
[  360.066044] bcache: register_bcache() error : device already registered
[  360.066162] bcache: register_bcache() error : device already registered
[  360.070249]  ret_from_fork+0x10/0x18
[  360.070254] Code: 940043e2 72001c1f 54000700 f90006f3 (f9010297)
[  360.090288] bcache: register_bcache() error : device already registered
[  360.091355] bcache: register_bcache() error : device already registered
[  360.097327] SMP: stopping secondary CPUs
[  360.119238] Starting crashdump kernel...
日志分析结果
代码正向分析

根据日志可以分析到问题函数调用栈
run_cache_set register_cache_set prio_read run_cache_set bch_cache_set_unregister bch_cache_set_stop register_cache_set __cache_set_unregister cache_set_flush list_add

用户态执行bcache-make注册bcache设备的时候,会调用register_cache_set函数。

register_cache_set函数先进行uuid检查,确保uuid的唯一性。调用bch_cache_set_alloc进行结构体成员初始化、closure回调函数注册等操作。在这里cacheing的closure回调函数设置为__cache_set_unregister,然后运行run_cache_set。run_cache_set会先读取bcache硬盘上的日志文件初始化btree root结构。根据日志错误"IO error reading priorities",cache结构体的root成员还没有被初始化。后面的cache_set_flush操作必然会导致内核panic。

static const char *register_cache_set(struct cache *ca)
{
    char buf[12];
    const char *err = "cannot allocate memory";
    struct cache_set *c;
    //uuid重复性检查
    list_for_each_entry(c, &bch_cache_sets, list)
        if (!memcmp(c->set_uuid, ca->sb.set_uuid, 16)) {
            if (c->cache)
                return "duplicate cache set member";

            goto found;
        }
    //内存里的缓存结构sb结构成员初始化
    c = bch_cache_set_alloc(&ca->sb);
    if (!c)
        return err;

    err = "error creating kobject";
    if (kobject_add(&c->kobj, bcache_kobj, "%pU", c->set_uuid) ||
        kobject_add(&c->internal, &c->kobj, "internal"))
        goto err;
    //增加监控统计信息/sys/block/bcache0/bcache/stats_{total,stats_five_minute,                  s
    //stats_day,stats_hour}
    if (bch_cache_accounting_add_kobjs(&c->accounting, &c->kobj))
        goto err;
    //初始化debugfs下bcache信息
    bch_debug_init_cache_set(c);
    //如果存在缓存集,添加到缓存list成员中
    list_add(&c->list, &bch_cache_sets);
found:
    //创建类似/sys/block/bcache0/bcache/cache/cache0/set目录链接和缓存集目录下的
    //cache0的目录链接
    sprintf(buf, "cache%i", ca->sb.nr_this_dev);
    if (sysfs_create_link(&ca->kobj, &c->kobj, "set") ||
        sysfs_create_link(&c->kobj, &ca->kobj, buf))
        goto err;
    //添加缓存结合和缓存集的映射关系
    kobject_get(&ca->kobj);
    ca->set = c;
    ca->set->cache = ca;

    err = "failed to run cache set";
    if (run_cache_set(c) < 0)
        goto err;

    return NULL;
err:
    //出错后调用注销bcache设备操作
    bch_cache_set_unregister(c);
    return err;
}

struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
{
    int iter_size;
    struct cache *ca = container_of(sb, struct cache, sb);
    struct cache_set *c = kzalloc(sizeof(struct cache_set), GFP_KERNEL);

    if (!c)
        return NULL;

    __module_get(THIS_MODULE);
    //初始化异步执行结构
    closure_init(&c->cl, NULL);
    set_closure_fn(&c->cl, cache_set_free, system_wq);

    closure_init(&c->caching, &c->cl);
    set_closure_fn(&c->caching, __cache_set_unregister, system_wq);

    closure_init(&c->caching, &c->cl);
    set_closure_fn(&c->caching, __cache_set_unregister, system_wq);

在bch_cache_set_alloc函数中,设置closure的回调函数为__cache_set_unregister。

void bch_cache_set_unregister(struct cache_set *c)
{
    set_bit(CACHE_SET_UNREGISTERING, &c->flags);
    //停止bcache缓存盘和后端盘
    bch_cache_set_stop(c);
}

run_cache_set函数在这个问题中返回err。

static int run_cache_set(struct cache_set *c)
{
    const char *err = "cannot allocate memory";
    struct cached_dev *dc, *t;
    struct cache *ca = c->cache;
    struct closure cl;
    LIST_HEAD(journal);
    struct journal_replay *l;

    closure_init_stack(&cl);

    c->nbuckets = ca->sb.nbuckets;
    set_gc_sectors(c);

    if (CACHE_SYNC(&c->cache->sb)) {
        struct bkey *k;
        struct jset *j;

        err = "cannot allocate memory for journal";
        if (bch_journal_read(c, &journal))
            goto err;

        pr_debug("btree_journal_read() done\n");

        err = "no journal entries found";
        if (list_empty(&journal))
            goto err;

        j = &list_entry(journal.prev, struct journal_replay, list)->j;

        err = "IO error reading priorities";
        if (prio_read(ca, j->prio_bucket[ca->sb.nr_this_dev]))
            goto err;

        /*
         * If prio_read() fails it'll call cache_set_error and we'll
         * tear everything down right away, but if we perhaps checked
         * sooner we could avoid journal replay.
         */

        k = &j->btree_root;

        err = "bad btree root";
        if (__bch_btree_ptr_invalid(c, k))
            goto err;

        err = "error reading btree root";
        //这里初始化cache_set的root成员,前面如果出错就不会初始化。root指针为空。
        c->root = bch_btree_node_get(c, NULL, k,
                         j->btree_level,
                         true, NULL);
        if (IS_ERR_OR_NULL(c->root))
            goto err;

        list_del_init(&c->root->list);
        rw_unlock(true, c->root);。

err:
    while (!list_empty(&journal)) {
        l = list_first_entry(&journal, struct journal_replay, list);
        list_del(&l->list);
        kfree(l);
    }

    closure_sync(&cl);

    bch_cache_set_error(c, "%s", err);

    return -EIO;
}

执行run_cache_set出错后,bcache会执行bch_cache_set_unregister函数注销bcache设备。bch_cache_set_unregister调用bch_cache_set_stop,在bch_cache_set_stop中调用之前注册的__cache_set_unregister异步回调函数完成bcache设备注销操作。

c 复制代码
void bch_cache_set_stop(struct cache_set *c)
{
	if (!test_and_set_bit(CACHE_SET_STOPPING, &c->flags))
		/* closure_fn set to __cache_set_unregister() */
		closure_queue(&c->caching);//异步回调机制调用之前的注册按照函数注册的前后顺序执行
}
c 复制代码
static inline void closure_queue(struct closure *cl)
{
	struct workqueue_struct *wq = cl->wq;
	/**
	 * Changes made to closure, work_struct, or a couple of other structs
	 * may cause work.func not pointing to the right location.
	 */
	BUILD_BUG_ON(offsetof(struct closure, fn)
		     != offsetof(struct work_struct, func));
	if (wq) {
		INIT_WORK(&cl->work, cl->work.func);
		BUG_ON(!queue_work(wq, &cl->work));
	} else
		cl->fn(cl);//这里会执行注册的__cache_set_unregister异步回调函数
}
crash 逆向分析问题
ARM寄存器介绍


X0到X7为传递参数和结果的寄存器;X19和X28为调用函数时传递参数的寄存器。

FP(X29)为栈帧寄存器,LR(X30)为链接寄存器。

在 ARM 架构中,FP(Frame Pointer)和 LR(Link Register)是用于函数调用和堆栈帧管理的两个重要寄存器:

  • FP(Frame Pointer) :通常指向当前函数调用的堆栈帧的开始位置。每当一个新函数被调用时,FP 会被推送到堆栈中,并在调用函数时被设置为当前函数的堆栈帧的起始地址。FP 可以用于追踪堆栈中函数调用的链条,帮助在调试时查看调用历史。

  • LR(Link Register) :存储函数返回的地址。当函数被调用时,LR 会存储当前指令的下一条指令的地址。函数返回时会将 LR 的值复制到程序计数器(PC)中,从而返回到调用者的位置。

  • PC(Program Counter) :这是 ARM64 架构中的程序计数器寄存器,记录了当前执行的指令地址。在崩溃时,PC 指向的地址是导致错误的指令位置。

汇编指令

(1) stp指令

在 ARM 架构的汇编语言中,stp 是一种指令,用于将两个寄存器的值存储到内存中。具体来说,stp 代表 Store Pair (存储一对数据)。常用于函数调用时保存寄存器,特别是当需要同时保存多个寄存器时。它有助于优化代码,减少多次 str 指令的使用。

  1. 语法:
asmatmel 复制代码
stp <reg1>, <reg2>, [<address>, <offset>]`

<reg1><reg2>:要存储的两个寄存器的内容。

[<address>, <offset>]:存储目标的内存地址,可以使用一个基地址和偏移量来指定。

示例:stp x19, x20, [sp, #16]

解释:

  • x19x20 是待存储的两个寄存器。
  • [sp, #16] 表示内存地址,基地址是 sp(堆栈指针寄存器),并且偏移量是 #16

具体来说,stp x19, x20, [sp, #16] 的操作是:

  1. x19 的值存储到栈上,偏移量为 16 字节。
  2. x20 的值存储到栈上,紧接在 x19 后面(即 x19 存储后面的内存位置是 x20 的位置)。

内存布局:

  • 假设 sp 当前的值是 0xffffbe50121fa000,执行这条指令后:
    • x19 的值会被存储到 0xffffbe50121fa010sp + 16)。
    • x20 的值会被存储到 0xffffbe50121fa018sp + 24)。

使用场景:

stp 通常用于保存一对寄存器的内容,尤其在函数调用时保存寄存器的值(例如保存返回地址、寄存器的内容等),以便在函数返回时恢复这些寄存器的值。

举例说明:

在函数调用过程中,通常会有类似于以下的代码来保存寄存器的状态:

asmatmel 复制代码
`stp x19, x20, [sp, #-16]!`

这条指令的意思是:

  • x19x20 存储到当前堆栈指针 sp 减去 16 字节的地址处(同时更新 sp,即 sp = sp - 16)。
  • 如果使用的是 !(例如 [sp, #-16]!),这表示在存储后立即更新 sp

在栈帧的保存与恢复中,stp 用来高效地处理多个寄存器的保存,可以减少使用单独的 str 指令来存储每个寄存器的次数。

(2) ldp指令

stp 对应的加载指令是 ldp(Load Pair),用于从内存中加载一对数据到两个寄存器中。用法类似,只不过是从内存读取数据。

示例:

asmatmel 复制代码
ldp x19, x20, [sp, #16]`

这条指令将内存中 `sp + 16` 处的数据加载到 `x19` 和 `x20` 寄存器中。
crash分析调用栈步骤

调试kdump vmcore文件需要安装crash命令和kernel debuginfo rpm安装包。

bash 复制代码
yum install crash kernel-debuginfo kernel-debugsource -y
log 复制代码
[  359.992977] x5 : ffff29cc59c16218 x4 : ffff49cc48ca0808
[root@storage-aqkp-002 127.0.0.1-2024-11-10-11:47:37]# crash /usr/lib/debug/lib/modules/5.10.0-202.0.0.115.ile2312sp1.aarch64/vmlinux  /var/crash/127.0.0.1-2024-11-10-11\:47\:37/vmcore

crash 8.0.2-1.ile2312sp1
Copyright (C) 2002-2022  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011, 2020-2022  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
Copyright (C) 2015, 2021  VMware, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "aarch64-unknown-linux-gnu".
Type "show configuration" for configuration details.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...

WARNING: kernel version inconsistency between vmlinux and dumpfile

      KERNEL: /usr/lib/debug/lib/modules/5.10.0-202.0.0.115.ile2312sp1.aarch64/vmlinux  [TAINTED]
    DUMPFILE: /var/crash/127.0.0.1-2024-11-10-11:47:37/vmcore  [PARTIAL DUMP]
        CPUS: 96
        DATE: Sun Nov 10 11:46:56 CST 2024
      UPTIME: 00:06:00
LOAD AVERAGE: 0.15, 0.28, 0.17
       TASKS: 1763
    NODENAME: storage-aqkp-002
     RELEASE: 5.10.0-202.0.0.115.ile2312sp1.aarch64
     VERSION: #1 SMP Mon Jun 17 01:51:52 UTC 2024
     MACHINE: aarch64  (unknown Mhz)
      MEMORY: 704 GB
       PANIC: "Unable to handle kernel NULL pointer dereference at virtual address 0000000000000200"
         PID: 7773
     COMMAND: "kworker/57:2"
        TASK: ffff49cc44d69340  [THREAD_INFO: ffff49cc44d69340]
         CPU: 57
       STATE: TASK_RUNNING (PANIC)

crash> mod -s bcache /usr/lib/debug/lib/modules/5.10.0-202.0.0.115.ile2312sp1.aarch64/kernel/drivers/md/bcache/bcache.ko-5.10.0-202.0.0.115.ile2312sp1.aarch64.debug
     MODULE       NAME                           BASE          SIZE  OBJECT FILE
ffffbe501221b040  bcache                   ffffbe50121e2000  319488  /usr/lib/debug/lib/modules/5.10.0-202.0.0.115.ile2312sp1.aarch64/kernel/drivers/md/bcache/bcache.ko-5.10.0-202.0.0.115.ile2312sp1.aarch64.debug 
crash> 
crash> bt
PID: 7773     TASK: ffff49cc44d69340  CPU: 57   COMMAND: "kworker/57:2"
 #0 [ffff800046373800] machine_kexec at ffffbe5039eb54a8
 #1 [ffff8000463739b0] __crash_kexec at ffffbe503a052824
 #2 [ffff8000463739e0] crash_kexec at ffffbe503a0529cc
 #3 [ffff800046373a60] die at ffffbe5039e9445c
 #4 [ffff800046373ac0] die_kernel_fault at ffffbe5039ec698c
 #5 [ffff800046373af0] __do_kernel_fault at ffffbe5039ec6a38
 #6 [ffff800046373b20] do_page_fault at ffffbe503ac76ba4
 #7 [ffff800046373b70] do_translation_fault at ffffbe503ac76ebc
 #8 [ffff800046373b90] do_mem_abort at ffffbe5039ec68ac
 #9 [ffff800046373bc0] el1_abort at ffffbe503ac669bc
#10 [ffff800046373bf0] el1_sync_handler at ffffbe503ac671d4
#11 [ffff800046373d30] el1_sync at ffffbe5039e82230
#12 [ffff800046373d50] cache_set_flush at ffffbe50121fa4c4 [bcache]
#13 [ffff800046373da0] process_one_work at ffffbe5039f5af68
#14 [ffff800046373e00] worker_thread at ffffbe5039f5b3c4
#15 [ffff800046373e50] kthread at ffffbe5039f634b8
crash> dis cache_set_flush+0x94
0xffffbe50121fa4c8 <cache_set_flush+148>:       str     x23, [x20, #512]
crash> dis -s cache_set_flush+0x94
FILE: ./include/linux/list.h
LINE: 71

  66    {
  67            if (!__list_add_valid(new, prev, next))
  68                    return;
  69    
  70            next->prev = new;
* 71            new->next = next;
  72            new->prev = prev;
  73            WRITE_ONCE(prev->next, new);
  74    }

crash> 

crash分析的时候除了安装kernel-debuginfo安装包外,还需要加载模块调试信息。

bash 复制代码
#加载bcache调试信息
mod -s bcache /usr/lib/debug/lib/modules/5.10.0-202.0.0.115.ile2312sp1.aarch64/kernel/drivers/md/bcache/bcache.ko-5.10.0-202.0.0.115.ile2312sp1.aarch64.debug

根据vmcore-message.txt中的出问题的函数地址,ARM中为PC寄存器内容,X86上为RIP寄存器内容。本次崩溃的函数地址为pc : cache_set_flush+0x94/0x190,通过dis -s cache_set_flush+0x94就可以查看出错问题的调用栈。

然后结合汇编代码和vmcore-message.txt的寄存器内容对问题进行分析。

crash> dis  cache_set_flush
0xffffbe50121fa434 <cache_set_flush>:   mov     x9, x30
0xffffbe50121fa438 <cache_set_flush+4>: nop
0xffffbe50121fa43c <cache_set_flush+8>: paciasp
0xffffbe50121fa440 <cache_set_flush+12>:        stp     x29, x30, [sp, #-80]!
0xffffbe50121fa444 <cache_set_flush+16>:        mov     x29, sp
0xffffbe50121fa448 <cache_set_flush+20>:        stp     x21, x22, [sp, #32]
0xffffbe50121fa44c <cache_set_flush+24>:        mov     x21, x0
0xffffbe50121fa450 <cache_set_flush+28>:        sub     x22, x0, #0x4a8
0xffffbe50121fa454 <cache_set_flush+32>:        stp     x19, x20, [sp, #16]
0xffffbe50121fa458 <cache_set_flush+36>:        add     x0, x22, #0x128
0xffffbe50121fa45c <cache_set_flush+40>:        stp     x23, x24, [sp, #48]
0xffffbe50121fa460 <cache_set_flush+44>:        ldur    x24, [x21, #-56]
0xffffbe50121fa464 <cache_set_flush+48>:        bl      0xffffbe50121f8e88 <bch_cache_accounting_destroy>
0xffffbe50121fa468 <cache_set_flush+52>:        add     x0, x22, #0xc0
0xffffbe50121fa46c <cache_set_flush+56>:        bl      0xffffbe501220b890 <bcache_device_free+2504>
0xffffbe50121fa470 <cache_set_flush+60>:        add     x0, x22, #0x60
0xffffbe50121fa474 <cache_set_flush+64>:        bl      0xffffbe501220b6e0 <bcache_device_free+2072>
0xffffbe50121fa478 <cache_set_flush+68>:        ldr     x0, [x22, #2256]
0xffffbe50121fa47c <cache_set_flush+72>:        cbz     x0, 0xffffbe50121fa48c <cache_set_flush+88>
0xffffbe50121fa480 <cache_set_flush+76>:        cmn     x0, #0x1, lsl #12
0xffffbe50121fa484 <cache_set_flush+80>:        b.hi    0xffffbe50121fa48c <cache_set_flush+88>  // b.pmore
0xffffbe50121fa488 <cache_set_flush+84>:        bl      0xffffbe501220b56c <bcache_device_free+1700>
0xffffbe50121fa48c <cache_set_flush+88>:        add     x0, x22, #0x8, lsl #12
0xffffbe50121fa490 <cache_set_flush+92>:        ldr     x20, [x0, #17656]
0xffffbe50121fa494 <cache_set_flush+96>:        cmn     x20, #0x1, lsl #12
0xffffbe50121fa498 <cache_set_flush+100>:       b.hi    0xffffbe50121fa4d8 <cache_set_flush+164>  // b.pmore
0xffffbe50121fa49c <cache_set_flush+104>:       str     x25, [sp, #64]
0xffffbe50121fa4a0 <cache_set_flush+108>:       add     x19, x20, #0x200
0xffffbe50121fa4a4 <cache_set_flush+112>:       add     x25, x21, #0x360
0xffffbe50121fa4a8 <cache_set_flush+116>:       mov     x0, x19
0xffffbe50121fa4ac <cache_set_flush+120>:       ldr     x23, [x21, #864]
0xffffbe50121fa4b0 <cache_set_flush+124>:       mov     x1, x25
0xffffbe50121fa4b4 <cache_set_flush+128>:       mov     x2, x23
0xffffbe50121fa4b8 <cache_set_flush+132>:       bl      0xffffbe501220b440 <bcache_device_free+1400>
0xffffbe50121fa4bc <cache_set_flush+136>:       tst     w0, #0xff
0xffffbe50121fa4c0 <cache_set_flush+140>:       b.eq    0xffffbe50121fa5a0 <cache_set_flush+364>  // b.none
0xffffbe50121fa4c4 <cache_set_flush+144>:       str     x19, [x23, #8]
0xffffbe50121fa4c8 <cache_set_flush+148>:       str     x23, [x20, #512]
0xffffbe50121fa4cc <cache_set_flush+152>:       str     x25, [x20, #520]
0xffffbe50121fa4d0 <cache_set_flush+156>:       str     x19, [x21, #864]
0xffffbe50121fa4d4 <cache_set_flush+160>:       ldr     x25, [sp, #64]
0xffffbe50121fa4d8 <cache_set_flush+164>:       ldur    x0, [x21, #-72]
0xffffbe50121fa4dc <cache_set_flush+168>:       tst     w0, #0x8
0xffffbe50121fa4e0 <cache_set_flush+172>:       b.ne    0xffffbe50121fa4f8 <cache_set_flush+196>  // b.any
0xffffbe50121fa4e4 <cache_set_flush+176>:       ldr     x0, [x22, #2056]
0xffffbe50121fa4e8 <cache_set_flush+180>:       add     x23, x21, #0x360
0xffffbe50121fa4ec <cache_set_flush+184>:       sub     x19, x0, #0x200
0xffffbe50121fa4f0 <cache_set_flush+188>:       cmp     x23, x0
0xffffbe50121fa4f4 <cache_set_flush+192>:       b.ne    0xffffbe50121fa584 <cache_set_flush+336>  // b.any
0xffffbe50121fa4f8 <cache_set_flush+196>:       ldr     x0, [x24, #2504]
0xffffbe50121fa4fc <cache_set_flush+200>:       cbz     x0, 0xffffbe50121fa504 <cache_set_flush+208>
0xffffbe50121fa500 <cache_set_flush+204>:       bl      0xffffbe501220b56c <bcache_device_free+1700>
0xffffbe50121fa504 <cache_set_flush+208>:       add     x19, x22, #0x8, lsl #12
0xffffbe50121fa508 <cache_set_flush+212>:       ldr     x0, [x19, #18568]
0xffffbe50121fa50c <cache_set_flush+216>:       cbz     x0, 0xffffbe50121fa52c <cache_set_flush+248>
0xffffbe50121fa510 <cache_set_flush+220>:       mov     x0, #0xc710                     // #50960
0xffffbe50121fa514 <cache_set_flush+224>:       add     x22, x22, x0
0xffffbe50121fa518 <cache_set_flush+228>:       mov     x0, x22
0xffffbe50121fa51c <cache_set_flush+232>:       bl      0xffffbe501220b710 <bcache_device_free+2120>
0xffffbe50121fa520 <cache_set_flush+236>:       ldr     x1, [x19, #18216]
0xffffbe50121fa524 <cache_set_flush+240>:       mov     x0, x22
0xffffbe50121fa528 <cache_set_flush+244>:       blr     x1
0xffffbe50121fa52c <cache_set_flush+248>:       str     xzr, [x21]
0xffffbe50121fa530 <cache_set_flush+252>:       str     xzr, [x21, #24]
0xffffbe50121fa534 <cache_set_flush+256>:       dmb     ish
0xffffbe50121fa538 <cache_set_flush+260>:       mov     w1, #0x1                        // #1
0xffffbe50121fa53c <cache_set_flush+264>:       mov     x0, x21
0xffffbe50121fa540 <cache_set_flush+268>:       movk    w1, #0x4000, lsl #16
0xffffbe50121fa544 <cache_set_flush+272>:       bl      0xffffbe50121ef064 <closure_sub>
0xffffbe50121fa548 <cache_set_flush+276>:       ldp     x19, x20, [sp, #16]
0xffffbe50121fa54c <cache_set_flush+280>:       ldp     x21, x22, [sp, #32]
0xffffbe50121fa550 <cache_set_flush+284>:       ldp     x23, x24, [sp, #48]
0xffffbe50121fa554 <cache_set_flush+288>:       ldp     x29, x30, [sp], #80
0xffffbe50121fa558 <cache_set_flush+292>:       autiasp
0xffffbe50121fa55c <cache_set_flush+296>:       ret
0xffffbe50121fa560 <cache_set_flush+300>:       mov     x0, x19
0xffffbe50121fa564 <cache_set_flush+304>:       mov     x1, #0x0                        // #0
0xffffbe50121fa568 <cache_set_flush+308>:       bl      0xffffbe50121e9714 <__bch_btree_node_write>
0xffffbe50121fa56c <cache_set_flush+312>:       mov     x0, x20
0xffffbe50121fa570 <cache_set_flush+316>:       bl      0xffffbe501220b704 <bcache_device_free+2108>
0xffffbe50121fa574 <cache_set_flush+320>:       ldr     x1, [x19, #512]
0xffffbe50121fa578 <cache_set_flush+324>:       sub     x19, x1, #0x200
0xffffbe50121fa57c <cache_set_flush+328>:       cmp     x23, x1
0xffffbe50121fa580 <cache_set_flush+332>:       b.eq    0xffffbe50121fa4f8 <cache_set_flush+196>  // b.none
0xffffbe50121fa584 <cache_set_flush+336>:       add     x20, x19, #0x90
0xffffbe50121fa588 <cache_set_flush+340>:       mov     x0, x20
0xffffbe50121fa58c <cache_set_flush+344>:       bl      0xffffbe501220b4e8 <bcache_device_free+1568>
0xffffbe50121fa590 <cache_set_flush+348>:       ldr     x0, [x19, #176]
0xffffbe50121fa594 <cache_set_flush+352>:       tst     w0, #0x2
0xffffbe50121fa598 <cache_set_flush+356>:       b.eq    0xffffbe50121fa56c <cache_set_flush+312>  // b.none
0xffffbe50121fa59c <cache_set_flush+360>:       b       0xffffbe50121fa560 <cache_set_flush+300>
0xffffbe50121fa5a0 <cache_set_flush+364>:       ldr     x25, [sp, #64]
0xffffbe50121fa5a4 <cache_set_flush+368>:       b       0xffffbe50121fa4d8 <cache_set_flush+164>
0xffffbe50121fa5a8 <cache_set_flush+372>:       nop
0xffffbe50121fa5ac <cache_set_flush+376>:       nop
0xffffbe50121fa5b0 <cache_set_flush+380>:       ldrsb   w4, [x5, #2724]
0xffffbe50121fa5b4 <cache_set_flush+384>:       .inst   0xffffbe50 ; undefined
0xffffbe50121fa5b8 <cache_set_flush+388>:       nop
0xffffbe50121fa5bc <cache_set_flush+392>:       ldr     x16, 0xffffbe50121fa5b0 <cache_set_flush+380>
0xffffbe50121fa5c0 <cache_set_flush+396>:       br      x16
crash> 

pc : cache_set_flush+0x94/0x190 说明程序在 cache_set_flush 函数执行到偏移 0x94 处的指令时崩溃。通过反汇编分析,这个位置的指令是 str x23, [x20, #512],导致了对 NULL 指针的访问错误。

日志中的 pc : cache_set_flush+0x94/0x190 表示崩溃发生时的程序计数器(Program Counter, pc)的值,即当前执行的指令在函数 cache_set_flush 中的位置。具体含义如下:

cache_set_flush+0x94/0x190

  • cache_set_flush 是函数名,表示程序当前正在 cache_set_flush 函数中执行。

  • +0x94 表示程序计数器位于 cache_set_flush 函数的偏移量 0x94(即十六进制 148)的指令上。

  • /0x190 表示整个 cache_set_flush 函数的长度为 0x190(即十六进制 400)。这是函数的总长度,用于提供一个相对参考,帮助确定崩溃发生的位置在函数中的相对进度。

    crash> dis -s cache_set_flush+0x94
    FILE: ./include/linux/list.h
    LINE: 71

    66    {
    67            if (!__list_add_valid(new, prev, next))
    68                    return;
    69    
    70            next->prev = new;
    
    • 71 new->next = next;
      72 new->prev = prev;
      73 WRITE_ONCE(prev->next, new);
      74 }

上面dis -s cache_set_flush+0x94结果,表示问题出现在链表的操作过程中。

从 crash> dis cache_set_flush的输出,我们可以看到 cache_set_flush 函数的反汇编代码。我们可以重点分析其中的几个部分,找到发生崩溃的原因,并且确认如何定位 NULL 指针访问。

汇编代码关键部分分析:

寄存器内容:

x0 用于传递第一个参数,通常是指向 cache_set 结构体的指针,在反汇编中可以看到它在多次操作中出现。

x20 被用来存储一个值,并且在代码中多次出现。特别是 ldr x20, [x0, #17656],它指向了 cache_set 结构体的偏移量 0x17656,并将其内容存入 x20。

x19 用来存储某些地址,似乎是某种缓存或内存地址,在后续代码中会被多次修改。

代码执行流程:

初始化堆栈:

stp x29, x30, [sp, #-80]!
mov x29, sp

这部分代码保存了当前函数的返回地址和堆栈指针。把栈的空间往下延伸128字节,然后把调用者的FP和LR压入栈。存放在栈顶向下的偏移128字节地方。

  • 参数传递和指针计算:

日志中x0 : 0000000000000001,表示第一个参数x0寄存器的地址0000000000000001。

mov x21, x0
sub x22, x0, #0x4a8

第一个参数减去0x4a8偏移,也就是将caching成员偏移1192获取cache_set结构体的地址。x22寄存器存放cache_set结构体的地址。

x21存放的是第一个参数的地址。

日志中X22的地址为ffff49cc48ca0000。

crash> struct -o  cache_set 
struct cache_set {
      [0] struct closure cl;
     [80] struct list_head list;
     [96] struct kobject kobj;
    [192] struct kobject internal;
    [288] struct dentry *debug;
    [296] struct cache_accounting accounting;
   [1120] unsigned long flags;
   [1128] atomic_t idle_counter;
   [1132] atomic_t at_max_writeback_rate;
   [1136] struct cache *cache;
   [1144] struct bcache_device **devices;
   [1152] unsigned int devices_max_used;
   [1156] atomic_t attached_dev_nr;
   [1160] struct list_head cached_devs;
   [1176] uint64_t cached_dev_sectors;
   [1184] atomic_long_t flash_dev_dirty_sectors;
   [1192] struct closure caching;
   [1272] struct closure sb_write;
   [1352] struct semaphore sb_write_mutex;
   [1376] mempool_t search;
   [1448] mempool_t bio_meta;
   [1520] struct bio_set bio_split;
   [1952] struct shrinker shrink;
   [2016] struct mutex bucket_lock;
   [2048] unsigned short bucket_bits;
   [2050] unsigned short block_bits;
   [2052] unsigned int btree_pages;
   [2056] struct list_head btree_cache;
   [2072] struct list_head btree_cache_freeable;
   [2088] struct list_head btree_cache_freed;

296偏移为成员accounting。x0设置为cache_set结构体偏移296后的地址,也就是c->accounting变量的地址。

crash> struct -o cache_set
struct cache_set {
 [0] struct closure cl;
 [80] struct list_head list;
 [96] struct kobject kobj;
 [192] struct kobject internal;
 [288] struct dentry *debug;
 [296] struct cache_accounting accounting;

访问内存和函数调用:

add     x0, x22, #0x128
stp     x23, x24, [sp, #48]
ldur    x24, [x21, #-56]
bl      0xffffbe50121f8e88 <bch_cache_accounting_destroy>

crash> struct -o cache_set.cache
struct cache_set {
   [1136] struct cache *cache;
}
crash> struct -o cache_set.caching
struct cache_set {
   [1192] struct closure caching;
}

对应C代码

	bch_cache_accounting_destroy(&c->accounting);

从 x21 偏移 -56(1192-1136) 读取数据,也就是从caching成员偏移到cache成员,读取cache地址到寄存器x24。

调用函数 bch_cache_accounting_destroy,函数参数地址为x0,也就是c->accounting变量的地址。

0xffffbe50121fa468 <cache_set_flush+52>: add x0, x22, #0xc0
0xffffbe50121fa46c <cache_set_flush+56>: bl 0xffffbe501220b890 <bcache_device_free+2504>

0xc0偏移对应cache_set结构体internal成员

crash> struct -o cache_set.internal
struct cache_set {
    [192] struct kobject internal;
}

上面汇编对应C代码

	kobject_put(&c->internal);

0xffffbe50121fa470 <cache_set_flush+60>:        add     x0, x22, #0x60
0xffffbe50121fa474 <cache_set_flush+64>:        bl      0xffffbe501220b6e0 <bcache_device_free+2072>

0x60偏移对应的cache_set结构体的kobj成员。

crash> struct -o cache_set.kobj
struct cache_set {
   [96] struct kobject kobj;
}

上面汇编对应C代码

kobject_del(&c->kobj);

通过结构体成员偏移可以确认。

crash> struct -o cache_set.kobj -x
struct cache_set {
     [0x60] struct kobject kobj;
}
crash> struct -o cache_set.internal -x
struct cache_set {
     [0xc0] struct kobject internal;
}
crash> 

0xc0和0x60恰好和汇编代码参数偏移对得上。

NULL 检查和访问 x20:

0xffffbe50121fa478 <cache_set_flush+68>:        ldr     x0, [x22, #2256]
0xffffbe50121fa47c <cache_set_flush+72>:        cbz     x0, 0xffffbe50121fa48c <cache_set_flush+88>
0xffffbe50121fa480 <cache_set_flush+76>:        cmn     x0, #0x1, lsl #12
0xffffbe50121fa484 <cache_set_flush+80>:        b.hi    0xffffbe50121fa48c <cache_set_flush+80>  // b.pmore

ldr x0, [x22, #2256]
cbz x0, 0xffffbe50121fa48c <cache_set_flush+88>
cmn x0, #0x1, lsl #12
b.hi 0xffffbe50121fa48c <cache_set_flush+80>
bl      0xffffbe501220b56c <bcache_device_free+1700>

crash> struct -o  cache_set 
struct cache_set {
      [0] struct closure cl;
     [80] struct list_head list;
  ...
    [296] struct cache_accounting accounting;
  ...
   [2192] struct gc_stat gc_stats;
   [2240] size_t nbuckets;
   [2248] size_t avail_nbuckets;
   [2256] struct task_struct *gc_thread;

这里从 x22 偏移 2256 读取值到 x0,x22偏移2256得到gc_thread的地址。然后通过cbz 检查它是否为 NULL。如果为 NULL,代码跳转至 cache_set_flush+88,否则进行进一步处理。

对应C代码

    if (!IS_ERR_OR_NULL(c->gc_thread))
        kthread_stop(c->gc_thread);

处理 x20 内容:

0xffffbe50121fa48c <cache_set_flush+88>:        add     x0, x22, #0x8, lsl #12
0xffffbe50121fa490 <cache_set_flush+92>:        ldr     x20, [x0, #17656]
0xffffbe50121fa494 <cache_set_flush+96>:        cmn     x20, #0x1, lsl #12
0xffffbe50121fa498 <cache_set_flush+100>:       b.hi    0xffffbe50121fa4d8 <cache_set_flush+156>  // b.pmore
0xffffbe50121fa49c <cache_set_flush+104>:       str     x25, [sp, #64]
0xffffbe50121fa4a0 <cache_set_flush+108>:       add     x19, x20, #0x200
0xffffbe50121fa4a4 <cache_set_flush+112>:       add     x25, x21, #0x360
0xffffbe50121fa4a8 <cache_set_flush+116>:       mov     x0, x19
0xffffbe50121fa4ac <cache_set_flush+120>:       ldr     x23, [x21, #864]
0xffffbe50121fa4b0 <cache_set_flush+124>:       mov     x1, x25
0xffffbe50121fa4b4 <cache_set_flush+128>:       mov     x2, x23
0xffffbe50121fa4b8 <cache_set_flush+132>:       bl      0xffffbe501220b440 <bcache_device_free+1400>

x22 寄存器的值加上 0x8 左移 12 位的结果,并将其存储在 x0 中。x22 是指向 struct cache_set 的指针。

#0x8, lsl #12 表示对 0x8 进行左移 12 位,结果是 0x8000。因此,x0 = x22 + 0x8000。此时 x22 是指向 struct cache_set 的指针,而加上 0x8000 后,0x8000+17656=0xc4f8恰好就是结构体cache_set的成员root地址。这里 x20可以确认存储的就是结构体cache_set的成员root地址。

crash> struct -o cache_set.root -x
struct cache_set {
 [0xc4f8] struct btree *root;
}

接着检查 x20 是否符合某些条件。如果条件满足,代码跳转至 cache_set_flush+156。

cmn x20, #0x1, lsl #12` (偏移 `0xffffbe50121fa494`)

作用 :将 x200x1 << 12(即 0x1000)进行加法运算,并更新条件标志。

  • 解释cmn 指令会更新条件标志,这样可以在后续的分支指令中使用。x20 + 0x1000 的结果会影响标志位。

    b.hi 0xffffbe50121fa4d8(偏移0xffffbe50121fa498`)

作用 :如果上面的 cmn 指令结果表明 x20 > 0x1000,则跳转到偏移 0xffffbe50121fa4d8 处。

解释 :如果 x20 的值大于 0x1000,跳转到 cache_set_flush+156

对应c代码

    if (!IS_ERR_OR_NULL(c->root))
        list_add(&c->root->list, &c->btree_cache);

static inline void list_add(struct list_head *new, struct list_head *head)
{
    __list_add(new, head, head->next);
}
static inline void __list_add(struct list_head *new,
                  struct list_head *prev,
                  struct list_head *next)
{
    if (!__list_add_valid(new, prev, next))
        return;

    next->prev = new;
    new->next = next;
    new->prev = prev;
    WRITE_ONCE(prev->next, new);
}

这里正好和dis -s cache_set_flush+0x94出错位置的代码对应上。

另外,从汇编代码中可以分析出X20表示cache_set的root成员地址,而vmcore-message.txt中X20为 x20: 0000000000000000。

内存写入和进一步函数调用:

str x19, [sp, #64]
str x23, [x20, #512]

这两行代码将 x19 和 x23 的内容分别存储到堆栈和内存中的指定位置。

crash> struct -o  btree.list           
struct btree {
  [512] struct list_head list;
}

btree结构偏移512后为list,访问root指针的512偏移时发生kernel崩溃。

分析结论

从 cache_set_flush 的反汇编可以看出,在执行 ldr x20, [x0, #17656] 时,x0 是指向 cache_set 结构体的指针,而 x20 存储的是从该结构体偏移 17656 地址读取的内容,也就是cache_set结构体的root成员地址。x20是 NULL,操作list时导致kernel panic。

崩溃的原因由于 cache_set_flush 函数中访问了未初始化或空的指针x20,需要检查 cache_set 结构体的初始化过程,确认相关指针是否正确设置。如果指针为 NULL 或无效,则需要修复初始化过程以避免这种错误。

总结

当内核出现kdump时,一般按照下面步骤分析:

1、分析内核崩溃时候的日志,根据PC或者RIP寄存器定位出错问题函数地址。

2、根据现场操作确认导致崩溃的操作步骤。

3、梳理代码和函数调用栈。

4、通过crash命令分析转储文件,确认导致问题的代码。

相关推荐
love530love5 分钟前
Windows 11 中利用 WSL - Linux 虚拟环境部署 ChatTTS-Enhanced 项目教程
linux·运维·windows
冷曦_sole9 分钟前
linux-22 目录管理(二)rmdir命令,删除目录
linux·运维·服务器
活跃的煤矿打工人32 分钟前
【星海随笔】删除ceph
linux·服务器·ceph
__zhangheng42 分钟前
Mac 查询IP配置,网络代理
linux·服务器·网络·git
0xdadream1 小时前
Vim 编辑器详细教程
linux·编辑器·vim
木卫二号Coding1 小时前
Docker-构建自己的Web-Linux系统-镜像webtop:ubuntu-kde
linux·ubuntu·docker
hc_bmxxf1 小时前
Linux应用软件编程-多任务处理(进程)
linux·运维·服务器
沐多2 小时前
波折重重:一个Linux实时系统Xenomai宕机问题的深度定位过程
linux·xenomai·实时linux·xenomai4
猿经验2 小时前
tar.gz压缩文件在linux上解压异常问题:gzip:stdin:invalid compressed data
linux·运维·服务器
木卫二号Coding2 小时前
宝塔-firefox(Docker应用)-构建自己的Web浏览器
linux·docker·开源