手动修复 rabbitmq 报错 “Crash dump is being written to“

rabbitmq 报错:

复制代码
2023-11-07 16:38:52.682 [error] emulator Error in process <0.368.0> on node 'rabbit@rabbitmq-0.rabbitmq-discovery.openstack.svc.cluster.local' with exit value:
{shutdown,[{mnesia_loader,handle_exit,2,[{file,"mnesia_loader.erl"},{line,963}]},{mnesia_loader,tab_receiver,5,[{file,"mnesia_loader.erl"},{line,440}]},{mnesia_loader,spawned_receiver,8,[{file,"mnesia_loader.erl"},{line,343}]}]}
2023-11-07 16:38:52.683 [error] emulator Error in process <0.367.0> on node 'rabbit@rabbitmq-0.rabbitmq-discovery.openstack.svc.cluster.local' with exit value:
{badarg,[{ets,insert,[mnesia_gvar,{last_error,{{shutdown,[{mnesia_loader,handle_exit,2,[{file,"mnesia_loader.erl"},{line,963}]},{mnesia_loader,tab_receiver,5,[{file,"mnesia_loader.erl"},{line,440}]},{mnesia_loader,spawned_receiver,8,[{file,"mnesia_loader.erl"},{line,343}]}]},[{mnesia_loader,wait_on_load_complete,1,[{file,"mnesia_loader.erl"},{line,359}]},{mnesia_tm,apply_fun,3,[{file,"mnesia_tm.erl"},{line,840}]},{mnesia_tm,execute_transaction,5,[{file,"mnesia_tm.erl"},{line,816}]},{mnesia_loader,init_receiver,5,[{file,"mnesia_loader.erl"},{line,285}]},{mnesia_loader,do_get_network_copy,5,[{file,"mnesia_loader.erl"},{line,221}]},{mnesia_controller,'-load_table_fun/1-fun-4-',5,[{file,"mnesia_controller.erl"},{line,2186}]},{mnesia_controller,'-load_and_reply/2-fun-0-',2,[{file,"mnesia_controller.erl"},{line,2133}]}]}}],[]},{mnesia_lib,set,2,[{file,"mnesia_lib.erl"},{line,443}]},{mnesia_lib,fix_error,1,[{file,"mnesia_lib.erl"},{line,906}]},{mnesia_tm,return_abort,3,[{file,"mnesia_tm.erl"},{line,962}]},{mnesia_loader,init_receiver,5,[{file,"mnesia_loader.erl"},{line,285}]},{mnesia_loader,do_get_network_copy,5,[{file,"mnesia_loader.erl"},{line,221}]},{mnesia_controller,'-load_table_fun/1-fun-4-',5,[{file,"mnesia_controller.erl"},{line,2186}]},{mnesia_controller,'-load_and_reply/2-fun-0-',2,[{file,"mnesia_controller.erl"},{line,2133}]}]}
2023-11-07 16:38:52.685 [info] <0.43.0> Application mnesia exited with reason: stopped
2023-11-07 16:38:52.685 [info] <0.43.0> Application tools exited with reason: stopped
2023-11-07 16:38:52.685 [error] <0.8.0> 
Error description:
    init:do_boot/3
    init:start_em/1
    rabbit:start_it/1 line 465
    rabbit:broker_start/1 line 341
    rabbit:start_loaded_apps/2 line 586
    app_utils:manage_applications/6 line 126
    lists:foldl/3 line 1263
    rabbit:'-handle_app_error/1-fun-0-'/3 line 709
throw:{could_not_start,ra,
       {ra,
        {{shutdown,
          {failed_to_start_child,ra_system_sup,
           {shutdown,
            {failed_to_start_child,ra_log_sup,
             {shutdown,
              {failed_to_start_child,ra_log_wal_sup,
               {shutdown,
                {failed_to_start_child,ra_log_wal,
                 {{case_clause,{ok,<<>>}},
                  [{ra_log_wal,open_existing,1,
                    [{file,"src/ra_log_wal.erl"},{line,556}]},
                   {ra_log_wal,'-recover_wal/2-lc$^0/1-0-',1,
                    [{file,"src/ra_log_wal.erl"},{line,240}]},
                   {ra_log_wal,recover_wal,2,
                    [{file,"src/ra_log_wal.erl"},{line,243}]},
                   {ra_log_wal,init,1,
                    [{file,"src/ra_log_wal.erl"},{line,186}]},
                   {gen_batch_server,init_it,6,
                    [{file,"src/gen_batch_server.erl"},{line,125}]},
                   {proc_lib,init_p_do_apply,3,
                    [{file,"proc_lib.erl"},{line,249}]}]}}}}}}}}},
         {ra_app,start,[normal,[]]}}}}
Log file(s) (may contain more information):
   <stdout>

BOOT FAILED
===========

Error description:
    init:do_boot/3
    init:start_em/1
    rabbit:start_it/1 line 465
    rabbit:broker_start/1 line 341
    rabbit:start_loaded_apps/2 line 586
    app_utils:manage_applications/6 line 126
    lists:foldl/3 line 1263
    rabbit:'-handle_app_error/1-fun-0-'/3 line 709
throw:{could_not_start,ra,
       {ra,
        {{shutdown,
          {failed_to_start_child,ra_system_sup,
           {shutdown,
            {failed_to_start_child,ra_log_sup,
             {shutdown,
              {failed_to_start_child,ra_log_wal_sup,
               {shutdown,
                {failed_to_start_child,ra_log_wal,
                 {{case_clause,{ok,<<>>}},
                  [{ra_log_wal,open_existing,1,
                    [{file,"src/ra_log_wal.erl"},{line,556}]},
                   {ra_log_wal,'-recover_wal/2-lc$^0/1-0-',1,
                    [{file,"src/ra_log_wal.erl"},{line,240}]},
                   {ra_log_wal,recover_wal,2,
                    [{file,"src/ra_log_wal.erl"},{line,243}]},
                   {ra_log_wal,init,1,
                    [{file,"src/ra_log_wal.erl"},{line,186}]},
                   {gen_batch_server,init_it,6,
                    [{file,"src/gen_batch_server.erl"},{line,125}]},
                   {proc_lib,init_p_do_apply,3,
                    [{file,"proc_lib.erl"},{line,249}]}]}}}}}}}}},
         {ra_app,start,[normal,[]]}}}}
Log file(s) (may contain more information):
   <stdout>

{"init terminating in do_boot",{could_not_start,ra,{ra,{{shutdown,{failed_to_start_child,ra_system_sup,{shutdown,{failed_to_start_child,ra_log_sup,{shutdown,{failed_to_start_child,ra_log_wal_sup,{shutdown,{failed_to_start_child,ra_log_wal,{{case_clause,{ok,<<>>}},[{ra_log_wal,open_existing,1,[{file,"src/ra_log_wal.erl"},{line,556}]},{ra_log_wal,'-recover_wal/2-lc$^0/1-0-',1,[{file,"src/ra_log_wal.erl"},{line,240}]},{ra_log_wal,recover_wal,2,[{file,"src/ra_log_wal.erl"},{line,243}]},{ra_log_wal,init,1,[{file,"src/ra_log_wal.erl"},{line,186}]},{gen_batch_server,init_it,6,[{file,"src/gen_batch_server.erl"},{line,125}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}}}}}}}}},{ra_app,start,[normal,[]]}}}}}
init terminating in do_boot ({could_not_start,ra,{ra,{{shutdown,{_}},{ra_app,start,[_]}}}})

Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done

修复方法:

(1) 找到 rabbitmq 使用的 pv,例如: rabbitmq-0 的 pod:

复制代码
# kubectl get pv | grep rabbitmq-0
pvc-70ed48bf-bef8-4658-b530-1fd3a6ef5937   200Gi      RWO            Delete           Bound    openstack/rabbitmq-data-rabbitmq-0                                    ceph-ssd                6d17h

(2) 找到 pv 使用的信息:

复制代码
# kubectl get pv pvc-70ed48bf-bef8-4658-b530-1fd3a6ef5937 -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    kubernetes.io/createdby: rbd-dynamic-provisioner
    pv.kubernetes.io/bound-by-controller: "yes"
    pv.kubernetes.io/provisioned-by: kubernetes.io/rbd
  creationTimestamp: "2023-10-31T15:40:59Z"
  finalizers:
  - kubernetes.io/pv-protection
  name: pvc-70ed48bf-bef8-4658-b530-1fd3a6ef5937
  resourceVersion: "7552"
  uid: 6848417a-dd4f-430c-85e5-f3234a1ac6bf
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 200Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: rabbitmq-data-rabbitmq-0
    namespace: openstack
    resourceVersion: "4704"
    uid: 70ed48bf-bef8-4658-b530-1fd3a6ef5937
  persistentVolumeReclaimPolicy: Delete
  rbd:
    image: kubernetes-dynamic-pvc-c8a3585f-dc7b-438c-a22e-cca9d84c341f
    keyring: /etc/ceph/keyring
    monitors:
    - ceph-mon.ceph.svc.cluster.local:6789
    pool: ssdpool
    secretRef:
      name: pvc-ceph-client-key
    user: admin
  storageClassName: ceph-ssd
  volumeMode: Filesystem
status:
  phase: Bound

需要的信息:

复制代码
    image: kubernetes-dynamic-pvc-c8a3585f-dc7b-438c-a22e-cca9d84c341f

(3) 在 pod 节点上查看对应的物理设备

复制代码
# ssh node-2 rbd showmapped | grep kubernetes-dynamic-pvc-c8a3585f-dc7b-438c-a22e-cca9d84c341f
0  ssdpool           kubernetes-dynamic-pvc-c8a3585f-dc7b-438c-a22e-cca9d84c341f -    /dev/rbd0  

(4) 查看设备挂载目录

复制代码
# ssh node-2 mount | grep rbd0
/dev/rbd0 on /var/lib/kubelet/plugins/kubernetes.io/rbd/mounts/ssdpool-image-kubernetes-dynamic-pvc-c8a3585f-dc7b-438c-a22e-cca9d84c341f type ext4 (rw,relatime,stripe=1024)
/dev/rbd0 on /var/lib/kubelet/pods/3a37e264-4fd5-4cb8-844b-6b6cd4a6859c/volumes/kubernetes.io~rbd/pvc-70ed48bf-bef8-4658-b530-1fd3a6ef5937 type ext4 (rw,relatime,stripe=1024)

(5) 查找 wal 文件路径,查找的路径来自步骤 (4)

复制代码
# ssh node-2 find /var/lib/kubelet/pods/3a37e264-4fd5-4cb8-844b-6b6cd4a6859c/volumes/kubernetes.io~rbd/pvc-70ed48bf-bef8-4658-b530-1fd3a6ef5937 -name "*.wal"
/var/lib/kubelet/pods/3a37e264-4fd5-4cb8-844b-6b6cd4a6859c/volumes/kubernetes.io~rbd/pvc-70ed48bf-bef8-4658-b530-1fd3a6ef5937/mnesia/rabbit@rabbitmq-0.rabbitmq-discovery.openstack.svc.cluster.local/quorum/rabbit@rabbitmq-0.rabbitmq-discovery.openstack.svc.cluster.local/00000025.wal

(6) 删除 wal 文件

此步骤请慎重操作,建议将文件备份后再操作。

复制代码
# ssh node-2 rm -rf /var/lib/kubelet/pods/3a37e264-4fd5-4cb8-844b-6b6cd4a6859c/volumes/kubernetes.io~rbd/pvc-70ed48bf-bef8-4658-b530-1fd3a6ef5937/mnesia/rabbit@rabbitmq-0.rabbitmq-discovery.openstack.svc.cluster.local/quorum/rabbit@rabbitmq-0.rabbitmq-discovery.openstack.svc.cluster.local/00000025.wal
Warning: Permanently added 'node-2' (ED25519) to the list of known hosts.

(7) 删除 pod,重新启动 pod

复制代码
# kubectl delete pods rabbitmq-0 -n openstack 
pod "rabbitmq-0" deleted

等待 pod 再次启动,过一会重新数据同步恢复。

相关推荐
回家路上绕了弯15 小时前
深入解析Agent Subagent架构:原理、协同逻辑与实战落地指南
分布式·后端
用户83071968408217 小时前
Spring Boot 集成 RabbitMQ :8 个最佳实践,杜绝消息丢失与队列阻塞
spring boot·后端·rabbitmq
用户8307196840823 天前
RabbitMQ vs RocketMQ 事务大对决:一个在“裸奔”,一个在“开挂”?
后端·rabbitmq·rocketmq
初次攀爬者4 天前
RabbitMQ的消息模式和高级特性
后端·消息队列·rabbitmq
初次攀爬者6 天前
ZooKeeper 实现分布式锁的两种方式
分布式·后端·zookeeper
让我上个超影吧7 天前
消息队列——RabbitMQ(高级)
java·rabbitmq
塔中妖7 天前
Windows 安装 RabbitMQ 详细教程(含 Erlang 环境配置)
windows·rabbitmq·erlang
断手当码农7 天前
Redis 实现分布式锁的三种方式
数据库·redis·分布式
初次攀爬者7 天前
Redis分布式锁实现的三种方式-基于setnx,lua脚本和Redisson
redis·分布式·后端
业精于勤_荒于稀7 天前
物流订单系统99.99%可用性全链路容灾体系落地操作手册
分布式