ORACLE EXADATA一体机系统节点1在12月6日上午9点27分出现异常后发生主机重启。一体机环境上采用集群架构,运行的是医院最核心的CIS业务系统;节点1发生重启后,业务可以转移到节点2上运行。
工程师在接到接到故障支持电话后,及时进行响应,及时对系统状态进行检查,确定业务已经恢复正常;之后继续深入分析,查找节点1重启的根本原因;经深入分析以及协调ORACLE 二线技术支持,确认为CPU P1 主核心cores 瞬间临时无法被固件注册MCA控制器,重启后系统恢复正常;根据排查过程汇总整理此文档。
- 节点1重启问题分析
- 数据库集群及DB日志信息
通过对操作系统日志及数据库日志的分析,可以发现在故障前,系统均在正常运行,未出现异常信息。相关日志如下:
S节点1集群及数据库日志:
1.集群日志,突然重启
2021-10-28 16:38:11.373 [SRVM(383018)]CRS-10051: CVU found following errors with Clusterware setup : PRVF-5410 : Check of common NTP Time Server failed
PRVF-5416 : Query of NTP daemon failed on all nodes
PRVF-5410 : Check of common NTP Time Server failed
PRVF-5416 : Query of NTP daemon failed on all nodes
PRCW-1015 : Wallet orcl does not exist.
CLSW-9: The cluster wallet to be operated on does not exist. :[1015]
PRCW-1015 : Wallet orcl does not exist.
CLSW-9: The cluster wallet to be operated on does not exist. :[1015]
2021-12-06 09:46:19.860 [OHASD(19894)]CRS-8500: Oracle Clusterware OHASD process is starting with operating system process ID 19894 ====重启集群软件的信息
2021-12-06 09:46:19.877 [OHASD(19894)]CRS-0714: Oracle Clusterware Release 12.1.0.2.0.
2021-12-06 09:46:19.915 [OHASD(19894)]CRS-2112: The OLR service started on node dyyy-dbadm01.
2021-12-06 09:46:19.987 [OHASD(19894)]CRS-1301: Oracle High Availability Service started on node
2.数据库日志,突然重启
Mon Dec 06 09:38:22 2021
LNS: Standby redo logfile selected for thread 1 sequence 61537 for destination LOG_ARCHIVE_DEST_2
Mon Dec 06 09:38:22 2021
LNS: Standby redo logfile selected for thread 1 sequence 61537 for destination LOG_ARCHIVE_DEST_3
Mon Dec 06 09:38:23 2021
Archived Log entry 246138 added for thread 1 sequence 61536 ID 0x5f59d953 dest 1:
Mon Dec 06 09:47:40 2021
Starting ORACLE instance (normal) ===重启前日志正常
************************ Large Pages Information *******************
Per process system memlock (soft) limit = UNLIMITED
Total Shared Global Region in Large Pages = 248 GB (100%)
3.OS日志:
Dec 5 03:06:13 dyyy-dbadm01 LSI MegaRAID SNMP Agent: Agent Ver 3.18.0.2 (Oct 30th, 2012) Started
Dec 5 03:06:13 dyyy-dbadm01 kernel: [19249636.999436] megaraid_sas 0000:23:00.0: Application firmware crash dump mode set success
Dec 5 03:06:13 dyyy-dbadm01 kernel: [19249637.370922] megaraid_sas 0000:23:00.0: Application firmware crash dump mode set success
Dec 5 16:02:25 dyyy-dbadm01 auditd[12165]: Audit daemon rotating log files
Dec 6 03:06:12 dyyy-dbadm01 LSI MegaRAID SNMP Agent: Agent Ver 3.18.0.2 (Oct 30th, 2012) Stopped
Dec 6 03:06:12 dyyy-dbadm01 LSI MegaRAID SNMP Agent: Agent Ver 3.18.0.2 (Oct 30th, 2012) Stopped
Dec 6 03:06:12 dyyy-dbadm01 LSI MegaRAID SNMP Agent: Agent Ver 3.18.0.2 (Oct 30th, 2012) Started
Dec 6 03:06:12 dyyy-dbadm01 LSI MegaRAID SNMP Agent: Agent Ver 3.18.0.2 (Oct 30th, 2012) Started
Dec 6 03:06:13 dyyy-dbadm01 kernel: [19336122.769755] megaraid_sas 0000:23:00.0: Application firmware crash dump mode set success
Dec 6 03:06:13 dyyy-dbadm01 kernel: [19336122.771358] megaraid_sas 0000:23:00.0: Application firmware crash dump mode set success
Dec 6 03:07:05 dyyy-dbadm01 kernel: [19336174.787057] megaraid_sas 0000:23:00.0: Application firmware crash dump mode set success
Dec 6 09:43:09 dyyy-dbadm01 kernel: imklog 5.8.10, log source = /proc/kmsg started. ========>>>>启动日志
Dec 6 09:43:09 dyyy-dbadm01 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="12564" x-info="http://www.rsyslog.com"] start
Dec 6 09:43:09 dyyy-dbadm01 kernel: [ 0.000000] Initializing cgroup subsys cpuset
Dec 6 09:43:09 dyyy-dbadm01 kernel: [ 0.000000] Initializing cgroup subsys cpu
Dec 6 09:43:09 dyyy-dbadm01 kernel: [ 0.000000] Initializing cgroup subsys cpuacct
Dec 6 09:43:09 dyyy-dbadm01 kernel: [ 0.000000] Linux version 4.1.12-61.33.1.el6uek.x86_64 (mockbuild@x86-ol6-builder-04) (gcc version 4.4.7 20120313 (Red Hat 4.4
.7-16) (GCC) ) #2 SMP Tue Mar 14 13:16:51 PDT 2017
Dec 6 09:43:09 dyyy-dbadm01 kernel: [ 0.000000] Command line: root=LABEL=DBSYS bootarea=dbsys bootfrom=BOOT ro loglevel=7 panic=60 debug pci=noaer log_buf_len=1m
nmi_watchdog=0 transparent_hugepage=never rd_NO_PLYMOUTH audit=1 console=tty1 console=ttyS0,115200n8 crashkernel=448M@128M numa=on
-
- 故障前的系统负载
检查故障前的系统负载,CPU使用率在30%以下,未出现高负载情况:
-
- ILOM中系统信息的分析
检查ILOM硬件管理平台中的相关信息,可以发现提示了CPU P1的相关错误:

-
- 重启后硬件状态正常
在发生重启后,检查ILOM硬件管理平台的信息,各硬件状态均正常。

- 总结与后续处理建议
- 节点1重启问题分析总结
通过对相关日志的分析,可以得出结论:
- 主机层面未出现高CPU负载的情况,相关日志显示是系统突然重启,重启前无相关报错。
- 分析硬件日志,当前CPU P1 主核心cores 瞬间临时无法被固件注册MCA控制器,因此触发硬件重启机制。主机重启后所有硬件状态恢复正常。
- 后续建议
目前分析本次异常重启是由于CPU P1 主核心cores 瞬间临时无法被固件注册MCA控制器,因此触发硬件重启机制。主机重启后所有硬件状态恢复正常。因此建议近期持续观察系统硬件运行及告警信息情况;如短期内再次出现类似情况,则需要更换出现过异常的CPU 。