EXADATA X6数据库一体机的2个计算节点轮流重启问题分析

  • 近期遇到EXADATA X6数据库一体机的2个计算节点,在4月初出现了服务器异常重启问题;在后续进行问题原因分析以及系统运行指标的持续监控,同时进行深度巡检,涉及固件兼容性、硬件故障、软件配置等各个模块,并调整了一体机软件的配置,如脑裂参数等。同时优化了高消耗SQL,降低一体机内部网络带宽,降低一体机重启风险;
  • 节点1的服务器重启问题分析
    1. 重启的时间节点

重启的时间点为4月4日的17:52分:

root@testdbadm01 \~# last|grep boot

reboot system boot 2.6.39-400.277.1 Sat Apr 4 17:52 - 11:32 (12+17:40)

    1. 重启时间点的操作系统日志分析

分析操作系统的message日志,在异常重启前,没有报错/异常信息。从日志内容看是突然发生的重启。

    1. 重启前的数据集群日志

从集群软件的ALERT日志,以及ASM实例的日志来看,在服务器重启前无异常报错信息:

日志信息如下:

  1. 集群软件的ALERT日志

2026-04-03 14:34:36.636:

ctssd(49976)CRS-2409:The clock on host testdbadm01 is not synchronous with the mean cluster time. No action has been taken as the Cl

uster Time Synchronization Service is running in observer mode.

2026-04-03 22:54:56.634:

ctssd(49976)CRS-2409:The clock on host testdbadm01 is not synchronous with the mean cluster time. No action has been taken as the Cl

uster Time Synchronization Service is running in observer mode.

2026-04-04 02:39:14.518:

ctssd(49976)CRS-2409:The clock on host testdbadm01 is not synchronous with the mean cluster time. No action has been taken as the Cl

uster Time Synchronization Service is running in observer mode.

2026-04-04 18:08:38.697:

ohasd(29369)CRS-2112:The OLR service started on node testdbadm01.

2026-04-04 18:08:38.721:

ohasd(29369)CRS-1301:Oracle High Availability Service started on node testdbadm01.

2026-04-04 18:08:38.735:

ohasd(29369)CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred

2、ASM实例的日志

grid@testdbadm01 \~$ tail -n 500 alert_+ASM1.log |more

Tue Jun 24 01:21:08 2025

NOTE: ASM client cawzbf1:cawzbfstd disconnected unexpectedly.

NOTE: check client alert log.

NOTE: Trace records dumped in trace file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_267772.trc

Tue Jun 24 08:46:12 2025

NOTE: client cawzbf1:cawzbfstd registered, osid 221372, mbr 0x10

Thu Jan 22 16:47:39 2026

NOTE: client jgswj1:jgswjstd deregistered

Thu Jan 22 16:50:05 2026

NOTE: client zwhzyq1:zwhzyqstd deregistered

Sat Apr 04 18:09:42 2026

NOTE: No asm libraries found in the system

NOTE: No asm libraries found in the system

* instance_number obtained from CSS = 1, checking for the existence of node 0...

* node 0 does not exist. instance_number = 1

Starting ORACLE instance (normal)

************************ Large Pages Information *******************

  • 节点2的服务器重启问题分析
    1. 重启的时间节点

重启的时间点为4月11日的04:49分:

root@testdbatest \~# last|grep boot

reboot system boot 2.6.39-400.277.1 Sat Apr 11 04:49 - 11:32 (6+06:43)

    1. 重启时间点的操作系统日志分析

分析操作系统的message日志,在异常重启前,没有报错/异常信息。从日志内容看是突然发生的重启。

    1. 重启前的数据集群日志

从集群软件的ALERT日志,以及ASM实例的日志来看,在服务器重启前无异常报错信息:

日志信息如下:

  1. 集群软件的ALERT日志

2026-04-04 18:10:09.398:

crsd(74843)CRS-2772:Server 'testdbadm01' has been assigned to pool 'ora.hzldrk'.

2026-04-04 18:10:09.398:

crsd(74843)CRS-2772:Server 'testdbadm01' has been assigned to pool 'ora.dbm01'.

2026-04-04 18:10:09.399:

crsd(74843)CRS-2772:Server 'testdbadm01' has been assigned to pool 'ora.czqcglxt'.

2026-04-11 04:29:37.621:

ohasd(29387)CRS-2112:The OLR service started on node testdbatest.

2026-04-11 04:29:37.647:

ohasd(29387)CRS-1301:Oracle High Availability Service started on node testdbatest.

2026-04-11 04:29:37.675:

2、ASM实例的日志

Sat Apr 04 18:09:45 2026

LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived

Set master node info

Submitted all remote-enqueue requests

Dwn-cvts replayed, VALBLKs dubious

All grantable enqueues granted

Submitted all GCS remote-cache requests

Fix write in gcs resources

Reconfiguration complete

Sat Apr 11 04:30:38 2026

NOTE: No asm libraries found in the system

NOTE: No asm libraries found in the system

* instance_number obtained from CSS = 2, checking for the existence of node 0...

* node 0 does not exist. instance_number = 2

Starting ORACLE instance (normal)

************************ Large Pages Information *******************

Per process system memlock (soft) limit = UNLIMITED

  • 服务器硬件状态及告警信息分析

问题发生后我们及时对服务器硬件的状态及日志进行分析,均无异常。相关信息如下:

节点1的ILOM硬件管理平台查看状态、无告警的输出信息:

节点2的ILOM硬件管理平台查看状态、无告警的输出信息:

  • 总结与后续处理建议
    1. 问题分析总结

通过对EXADATA X6数据库一体机的2个计算节点相关日志的深入分析,故障现象表现为服务器突然重启,无异常报错信息。在快速响应恢复业务使用后,对一体机进行深度巡检,涉及固件兼容性、硬件故障、软件配置等各个模块,并调整了一体机软件的配置,如脑裂参数等;同时优化了高资源消耗的业务SQL,降低一体机内部网络带宽,降低一体机重启风险;

一体机软件相关参数调整如下:

节点1的系统参数:

kernel.shmmax = 460513787494

kernel.shmall = 112430122

kernel.randomize_va_space = 2

kernel.sysrq = 1

kernel.panic = 60

kernel.softlockup_panic = 1

kernel.unknown_nmi_panic = 1

kernel.nmi_watchdog = 0

kernel.core_uses_pid = 1

kernel.watchdog_thresh = 30

kernel.printk = 4 4 1 7

vm.max_map_count = 250000

vm.nr_hugepages=23330

kernel.msgmni = 2878

kernel.msgmax = 8192

kernel.msgmnb = 65536

kernel.shmmni = 4096

fs.file-max = 13631488

fs.aio-max-nr = 3145728

net.core.rmem_default = 4194304

net.core.wmem_default = 262144

net.core.rmem_max = 4194304

net.core.wmem_max = 2097152

kernel.pid_max = 400000

kernel.sem = 1024 60000 1024 256

vm.min_free_kbytes = 524288

net.core.somaxconn = 1024

节点2的系统参数:

kernel.shmmax = 460513787494

kernel.shmall = 112430122

kernel.randomize_va_space = 2

kernel.sysrq = 1

kernel.panic = 60

kernel.softlockup_panic = 1

kernel.unknown_nmi_panic = 1

kernel.nmi_watchdog = 0

kernel.core_uses_pid = 1

kernel.watchdog_thresh = 30

kernel.printk = 4 4 1 7

vm.max_map_count = 250000

vm.nr_hugepages=23330

kernel.msgmni = 2878

kernel.msgmax = 8192

kernel.msgmnb = 65536

kernel.shmmni = 4096

fs.file-max = 13631488

fs.aio-max-nr = 3145728

net.core.rmem_default = 4194304

net.core.wmem_default = 262144

net.core.rmem_max = 4194304

net.core.wmem_max = 2097152

kernel.pid_max = 400000

kernel.sem = 10000 10240000 10000 1024

vm.min_free_kbytes = 524288

net.core.somaxconn = 1024

对高资源消耗的业务SQL进行优化,

相关推荐
海南java第二人3 小时前
Nebula Graph 实战:基于图数据库存储 CMDB 实体关系
数据库·图数据库·nebula
曹牧3 小时前
oracle:“not all variables bound”
数据库·oracle
数据库百宝箱3 小时前
Oracle RMAN Image Copy 本地恢复
数据库·oracle
zuYM4g7Dp4 小时前
NoSql数据库设计心得
数据库·nosql
睡不醒男孩0308236 小时前
第七篇:揭秘 PostgreSQL 数据库内核级管控:CLup 深度架构设计与高可用底座技术白皮书
数据库·postgresql·clup
cmes_love7 小时前
Level 2逐笔成交历史数据下载方法笔记
数据库·笔记·oracle
swordbob7 小时前
MySQL字符集陷阱:从Oracle迁移踩坑到utf8mb4强制规范
数据库·sql
牛油果子哥q7 小时前
【C++ STL string 】C++ STL string 终极精讲:底层原理、内存机制、全套API、深浅拷贝、易错坑点与工程实战规范
数据库·c++
十五年专注C++开发7 小时前
MySql中各种功能用sql语句实现总结
数据库·sql·mysql
数据库小学妹8 小时前
AI时代数据库怎么选?多模融合、数据统一存储与选型实战指南
数据库·人工智能·经验分享·ai