- 近期遇到EXADATA X6数据库一体机的2个计算节点,在4月初出现了服务器异常重启问题;在后续进行问题原因分析以及系统运行指标的持续监控,同时进行深度巡检,涉及固件兼容性、硬件故障、软件配置等各个模块,并调整了一体机软件的配置,如脑裂参数等。同时优化了高消耗SQL,降低一体机内部网络带宽,降低一体机重启风险;
- 节点1的服务器重启问题分析
- 重启的时间节点
重启的时间点为4月4日的17:52分:
root@testdbadm01 \~# last|grep boot
reboot system boot 2.6.39-400.277.1 Sat Apr 4 17:52 - 11:32 (12+17:40)
-
- 重启时间点的操作系统日志分析
分析操作系统的message日志,在异常重启前,没有报错/异常信息。从日志内容看是突然发生的重启。
-
- 重启前的数据集群日志
从集群软件的ALERT日志,以及ASM实例的日志来看,在服务器重启前无异常报错信息:
日志信息如下:
- 集群软件的ALERT日志
2026-04-03 14:34:36.636:
ctssd(49976)CRS-2409:The clock on host testdbadm01 is not synchronous with the mean cluster time. No action has been taken as the Cl
uster Time Synchronization Service is running in observer mode.
2026-04-03 22:54:56.634:
ctssd(49976)CRS-2409:The clock on host testdbadm01 is not synchronous with the mean cluster time. No action has been taken as the Cl
uster Time Synchronization Service is running in observer mode.
2026-04-04 02:39:14.518:
ctssd(49976)CRS-2409:The clock on host testdbadm01 is not synchronous with the mean cluster time. No action has been taken as the Cl
uster Time Synchronization Service is running in observer mode.
2026-04-04 18:08:38.697:
ohasd(29369)CRS-2112:The OLR service started on node testdbadm01.
2026-04-04 18:08:38.721:
ohasd(29369)CRS-1301:Oracle High Availability Service started on node testdbadm01.
2026-04-04 18:08:38.735:
ohasd(29369)CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2、ASM实例的日志
grid@testdbadm01 \~$ tail -n 500 alert_+ASM1.log |more
Tue Jun 24 01:21:08 2025
NOTE: ASM client cawzbf1:cawzbfstd disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_267772.trc
Tue Jun 24 08:46:12 2025
NOTE: client cawzbf1:cawzbfstd registered, osid 221372, mbr 0x10
Thu Jan 22 16:47:39 2026
NOTE: client jgswj1:jgswjstd deregistered
Thu Jan 22 16:50:05 2026
NOTE: client zwhzyq1:zwhzyqstd deregistered
Sat Apr 04 18:09:42 2026
NOTE: No asm libraries found in the system
NOTE: No asm libraries found in the system
* instance_number obtained from CSS = 1, checking for the existence of node 0...
* node 0 does not exist. instance_number = 1
Starting ORACLE instance (normal)
************************ Large Pages Information *******************
- 节点2的服务器重启问题分析
- 重启的时间节点
重启的时间点为4月11日的04:49分:
root@testdbatest \~# last|grep boot
reboot system boot 2.6.39-400.277.1 Sat Apr 11 04:49 - 11:32 (6+06:43)
-
- 重启时间点的操作系统日志分析
分析操作系统的message日志,在异常重启前,没有报错/异常信息。从日志内容看是突然发生的重启。
-
- 重启前的数据集群日志
从集群软件的ALERT日志,以及ASM实例的日志来看,在服务器重启前无异常报错信息:
日志信息如下:
- 集群软件的ALERT日志
2026-04-04 18:10:09.398:
crsd(74843)CRS-2772:Server 'testdbadm01' has been assigned to pool 'ora.hzldrk'.
2026-04-04 18:10:09.398:
crsd(74843)CRS-2772:Server 'testdbadm01' has been assigned to pool 'ora.dbm01'.
2026-04-04 18:10:09.399:
crsd(74843)CRS-2772:Server 'testdbadm01' has been assigned to pool 'ora.czqcglxt'.
2026-04-11 04:29:37.621:
ohasd(29387)CRS-2112:The OLR service started on node testdbatest.
2026-04-11 04:29:37.647:
ohasd(29387)CRS-1301:Oracle High Availability Service started on node testdbatest.
2026-04-11 04:29:37.675:
2、ASM实例的日志
Sat Apr 04 18:09:45 2026
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Sat Apr 11 04:30:38 2026
NOTE: No asm libraries found in the system
NOTE: No asm libraries found in the system
* instance_number obtained from CSS = 2, checking for the existence of node 0...
* node 0 does not exist. instance_number = 2
Starting ORACLE instance (normal)
************************ Large Pages Information *******************
Per process system memlock (soft) limit = UNLIMITED
- 服务器硬件状态及告警信息分析
问题发生后我们及时对服务器硬件的状态及日志进行分析,均无异常。相关信息如下:
节点1的ILOM硬件管理平台查看状态、无告警的输出信息:
节点2的ILOM硬件管理平台查看状态、无告警的输出信息:
- 总结与后续处理建议
- 问题分析总结
通过对EXADATA X6数据库一体机的2个计算节点相关日志的深入分析,故障现象表现为服务器突然重启,无异常报错信息。在快速响应恢复业务使用后,对一体机进行深度巡检,涉及固件兼容性、硬件故障、软件配置等各个模块,并调整了一体机软件的配置,如脑裂参数等;同时优化了高资源消耗的业务SQL,降低一体机内部网络带宽,降低一体机重启风险;
一体机软件相关参数调整如下:
节点1的系统参数:
kernel.shmmax = 460513787494
kernel.shmall = 112430122
kernel.randomize_va_space = 2
kernel.sysrq = 1
kernel.panic = 60
kernel.softlockup_panic = 1
kernel.unknown_nmi_panic = 1
kernel.nmi_watchdog = 0
kernel.core_uses_pid = 1
kernel.watchdog_thresh = 30
kernel.printk = 4 4 1 7
vm.max_map_count = 250000
vm.nr_hugepages=23330
kernel.msgmni = 2878
kernel.msgmax = 8192
kernel.msgmnb = 65536
kernel.shmmni = 4096
fs.file-max = 13631488
fs.aio-max-nr = 3145728
net.core.rmem_default = 4194304
net.core.wmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_max = 2097152
kernel.pid_max = 400000
kernel.sem = 1024 60000 1024 256
vm.min_free_kbytes = 524288
net.core.somaxconn = 1024
节点2的系统参数:
kernel.shmmax = 460513787494
kernel.shmall = 112430122
kernel.randomize_va_space = 2
kernel.sysrq = 1
kernel.panic = 60
kernel.softlockup_panic = 1
kernel.unknown_nmi_panic = 1
kernel.nmi_watchdog = 0
kernel.core_uses_pid = 1
kernel.watchdog_thresh = 30
kernel.printk = 4 4 1 7
vm.max_map_count = 250000
vm.nr_hugepages=23330
kernel.msgmni = 2878
kernel.msgmax = 8192
kernel.msgmnb = 65536
kernel.shmmni = 4096
fs.file-max = 13631488
fs.aio-max-nr = 3145728
net.core.rmem_default = 4194304
net.core.wmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_max = 2097152
kernel.pid_max = 400000
kernel.sem = 10000 10240000 10000 1024
vm.min_free_kbytes = 524288
net.core.somaxconn = 1024
对高资源消耗的业务SQL进行优化,