- 近期遇到EXADATA X6数据库一体机的2个计算节点,在4月初出现了服务器异常重启问题;在后续进行问题原因分析以及系统运行指标的持续监控,同时进行深度巡检,涉及固件兼容性、硬件故障、软件配置等各个模块,并调整了一体机软件的配置,如脑裂参数等。同时优化了高消耗SQL,降低一体机内部网络带宽,降低一体机重启风险;
- 节点1的服务器重启问题分析
- 重启的时间节点
重启的时间点为4月4日的17:52分:
root@testdbadm01 \~\]# last\|grep boot reboot system boot 2.6.39-400.277.1 Sat Apr 4 17:52 - 11:32 (12+17:40) 1. 1. ******重启时间点的操作系统日志分析****** 分析操作系统的message日志,在异常重启前,没有报错/异常信息。从日志内容看是突然发生的重启。 1. 1. ******重启前的数据集群日志****** 从集群软件的ALERT日志,以及ASM实例的日志来看,在服务器重启前无异常报错信息: 日志信息如下: 1. 集群软件的ALERT日志 2026-04-03 14:34:36.636: \[ctssd(49976)\]CRS-2409:The clock on host testdbadm01 is not synchronous with the mean cluster time. No action has been taken as the Cl uster Time Synchronization Service is running in observer mode. 2026-04-03 22:54:56.634: \[ctssd(49976)\]CRS-2409:The clock on host testdbadm01 is not synchronous with the mean cluster time. No action has been taken as the Cl uster Time Synchronization Service is running in observer mode. 2026-04-04 02:39:14.518: \[ctssd(49976)\]CRS-2409:The clock on host testdbadm01 is not synchronous with the mean cluster time. No action has been taken as the Cl uster Time Synchronization Service is running in observer mode. 2026-04-04 18:08:38.697: \[ohasd(29369)\]CRS-2112:The OLR service started on node testdbadm01. 2026-04-04 18:08:38.721: \[ohasd(29369)\]CRS-1301:Oracle High Availability Service started on node testdbadm01. 2026-04-04 18:08:38.735: \[ohasd(29369)\]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred 2、ASM实例的日志 \[grid@testdbadm01 \~\]$ tail -n 500 alert_+ASM1.log \|more Tue Jun 24 01:21:08 2025 NOTE: ASM client cawzbf1:cawzbfstd disconnected unexpectedly. NOTE: check client alert log. NOTE: Trace records dumped in trace file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_267772.trc Tue Jun 24 08:46:12 2025 NOTE: client cawzbf1:cawzbfstd registered, osid 221372, mbr 0x10 Thu Jan 22 16:47:39 2026 NOTE: client jgswj1:jgswjstd deregistered Thu Jan 22 16:50:05 2026 NOTE: client zwhzyq1:zwhzyqstd deregistered Sat Apr 04 18:09:42 2026 NOTE: No asm libraries found in the system NOTE: No asm libraries found in the system \* instance_number obtained from CSS = 1, checking for the existence of node 0... \* node 0 does not exist. instance_number = 1 Starting ORACLE instance (normal) \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\* Large Pages Information \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\* * **节点2的服务器重启问题分析** 1. ******重启的时间节点****** 重启的时间点为4月11日的04:49分: \[root@testdbatest \~\]# last\|grep boot reboot system boot 2.6.39-400.277.1 Sat Apr 11 04:49 - 11:32 (6+06:43) 1. 1. ******重启时间点的操作系统日志分析****** 分析操作系统的message日志,在异常重启前,没有报错/异常信息。从日志内容看是突然发生的重启。 1. 1. ******重启前的数据集群日志****** 从集群软件的ALERT日志,以及ASM实例的日志来看,在服务器重启前无异常报错信息: 日志信息如下: 1. 集群软件的ALERT日志 2026-04-04 18:10:09.398: \[crsd(74843)\]CRS-2772:Server 'testdbadm01' has been assigned to pool 'ora.hzldrk'. 2026-04-04 18:10:09.398: \[crsd(74843)\]CRS-2772:Server 'testdbadm01' has been assigned to pool 'ora.dbm01'. 2026-04-04 18:10:09.399: \[crsd(74843)\]CRS-2772:Server 'testdbadm01' has been assigned to pool 'ora.czqcglxt'. 2026-04-11 04:29:37.621: \[ohasd(29387)\]CRS-2112:The OLR service started on node testdbatest. 2026-04-11 04:29:37.647: \[ohasd(29387)\]CRS-1301:Oracle High Availability Service started on node testdbatest. 2026-04-11 04:29:37.675: 2、ASM实例的日志 Sat Apr 04 18:09:45 2026 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Fix write in gcs resources Reconfiguration complete Sat Apr 11 04:30:38 2026 NOTE: No asm libraries found in the system NOTE: No asm libraries found in the system \* instance_number obtained from CSS = 2, checking for the existence of node 0... \* node 0 does not exist. instance_number = 2 Starting ORACLE instance (normal) \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\* Large Pages Information \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\* Per process system memlock (soft) limit = UNLIMITED * **服务器硬件状态及告警信息分析** 问题发生后我们及时对服务器硬件的状态及日志进行分析,均无异常。相关信息如下: 节点1的ILOM硬件管理平台查看状态、无告警的输出信息: 节点2的ILOM硬件管理平台查看状态、无告警的输出信息: * **总结与后续处理建议** 1. ******问题分析总结****** 通过对EXADATA X6数据库一体机的2个计算节点相关日志的深入分析,故障现象表现为服务器突然重启,无异常报错信息。在快速响应恢复业务使用后,对一体机进行深度巡检,涉及固件兼容性、硬件故障、软件配置等各个模块,并调整了一体机软件的配置,如脑裂参数等;同时优化了高资源消耗的业务SQL,降低一体机内部网络带宽,降低一体机重启风险; 一体机软件相关参数调整如下: 节点1的系统参数: kernel.shmmax = 460513787494 kernel.shmall = 112430122 kernel.randomize_va_space = 2 kernel.sysrq = 1 kernel.panic = 60 kernel.softlockup_panic = 1 kernel.unknown_nmi_panic = 1 kernel.nmi_watchdog = 0 kernel.core_uses_pid = 1 kernel.watchdog_thresh = 30 kernel.printk = 4 4 1 7 vm.max_map_count = 250000 vm.nr_hugepages=23330 kernel.msgmni = 2878 kernel.msgmax = 8192 kernel.msgmnb = 65536 kernel.shmmni = 4096 fs.file-max = 13631488 fs.aio-max-nr = 3145728 net.core.rmem_default = 4194304 net.core.wmem_default = 262144 net.core.rmem_max = 4194304 net.core.wmem_max = 2097152 kernel.pid_max = 400000 kernel.sem = 1024 60000 1024 256 vm.min_free_kbytes = 524288 net.core.somaxconn = 1024 节点2的系统参数: kernel.shmmax = 460513787494 kernel.shmall = 112430122 kernel.randomize_va_space = 2 kernel.sysrq = 1 kernel.panic = 60 kernel.softlockup_panic = 1 kernel.unknown_nmi_panic = 1 kernel.nmi_watchdog = 0 kernel.core_uses_pid = 1 kernel.watchdog_thresh = 30 kernel.printk = 4 4 1 7 vm.max_map_count = 250000 vm.nr_hugepages=23330 kernel.msgmni = 2878 kernel.msgmax = 8192 kernel.msgmnb = 65536 kernel.shmmni = 4096 fs.file-max = 13631488 fs.aio-max-nr = 3145728 net.core.rmem_default = 4194304 net.core.wmem_default = 262144 net.core.rmem_max = 4194304 net.core.wmem_max = 2097152 kernel.pid_max = 400000 kernel.sem = 10000 10240000 10000 1024 vm.min_free_kbytes = 524288 net.core.somaxconn = 1024 对高资源消耗的业务SQL进行优化,