近期遇到多次6节点集群的ACFS文件系统环境异常问题;如24日中午12点附近出现ACFS文件系统访问异常,通过查看集群ALERT日志、CSSD进程日志及OSW监控软件的日志,可以发现OSW监控软件在11:55-12:40分时没有收集到虚拟机LINUX主机的监控数据,同期数据库的CSSD进程也有与其它节点的私网信息已经丢失网络心跳,因此可以推断当时主机已经HANG住。
前环境为VMWARE虚拟机环境搭建6节点ORACLE GRID集群,使用ACFS文件系统为应用程序提供数据共享目录 /DATA,应用程序同时部署在6节点ORACLE GRID集群的对应主机上;未在集群环境运行ORACLE数据库。
对于近期两次典型问题分析如下:6/24日中午12点附近出现ACFS文件系统访问异常,通过查看集群ALERT日志、CSSD进程日志及OSW监控软件的日志,可以发现OSW监控软件在11:55-12:40分时没有收集到主机的监控数据,同期数据库的CSSD进程也有与其它节点的私网信息已经丢失网络心跳,因此可以推断当时主机已经HANG住。7/2日上午9点附近ACFS文件系统无法访问,当时OSW监控未开启;从集群ALERT日志来看当时有应用进程在使用/DATA目录 无法UNMOUNT,操作系统日志中有NFO: task java:12227 blocked for more than 120 seconds.信息,因未有其它有效信息,暂无法判断当时何种原因导致ACFS文件系统访问异常。
从具体的问题来看,ORACLE集群软件做为操作系统上层的软件,会受到底层操作系统OS以及更底层的VMWARE虚拟机环境的影响;由于几层系统之间监控日志粒度也不同,对于问题的分析带来了较大的复杂度;许多信息无法向下追踪去查找根本原因;
如下为分析过程:
1.集群 alert日志信息
2019-06-24 11:32:43.138:
ctssd(3268)\]CRS-2408:The clock on host node5 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.
2019-06-24 12:17:40.220:
\[cssd(3148)\]CRS-1612:Network communication with node node2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.560 seconds
2019-06-24 12:17:48.222:
\[cssd(3148)\]CRS-1611:Network communication with node node2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 6.560 seconds
2019-06-24 12:17:52.223:
\[cssd(3148)\]CRS-1610:Network communication with node node2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.560 seconds
2019-06-24 12:17:54.790:
\[cssd(3148)\]CRS-1601:CSSD Reconfiguration complete. Active nodes are node1 node3 node4 node5 node6 .
2019-06-24 12:18:38.016:
\[cssd(3148)\]CRS-1601:CSSD Reconfiguration complete. Active nodes are node1 node2 node3 node4 node5 node6 .
2019-06-24 12:33:48.943:
\[cssd(3148)\]CRS-1662:Member kill requested by node node6 for member number 5, group ocr_oanew-cluster
2019-06-24 12:33:48.959:
#### **2.OSW监控数据**
部分输入如下:
zzz \*\*\*Mon Jun 24 11:55:04 CST 2019
Tasks: 520 total, 1 running, 519 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.9%us, 1.4%sy, 0.1%ni, 96.5%id, 0.0%wa, 0.1%hi, 0.1%si, 0.0%st
Mem: 24608192k total, 24400720k used, 207472k free, 450168k buffers
Swap: 16383992k total, 149316k used, 16234676k free, 3719180k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21449 root 20 0 14.9g 6.5g 16m S 34.6 27.6 311:40.89 java
25134 root 20 0 109m 1212 892 D 5.9 0.0 0:50.58 find
25763 root 10 -10 0 0 0 S 4.0 0.0 1:33.21 oks_comm
2522 root 20 0 157m 19m 6044 S 2.0 0.1 296:44.37 Xorg
25569 root 30 10 238m 12m 5388 S 2.0 0.1 0:07.87 floaters
32100 oracle 20 0 4636 1268 660 S 2.0 0.0 0:00.03 pidstat
32110 oracle 20 0 4648 1284 660 S 2.0 0.0 0:00.03 pidstat
32125 oracle 20 0 15300 1556 932 R 2.0 0.0 0:00.02 top
32152 root 20 0 7407m 11m 7084 S 2.0 0.0 0:00.02 jstat
106 root 20 0 0 0 0 S 1.0 0.0 6:46.23 kblockd/0
24801 oracle 20 0 1835m 38m 16m S 1.0 0.2 5:47.71 oraagent.bin
25759 root 10 -10 0 0 0 S 1.0 0.0 0:04.91 oks_comm
25760 root 10 -10 0 0 0 S 1.0 0.0 0:05.08 oks_comm
25761 root 10 -10 0 0 0 S 1.0 0.0 0:04.99 oks_comm
25762 root 10 -10 0 0 0 S 1.0 0.0 0:17.44 oks_comm
27667 root 20 0 815m 19m 10m S 1.0 0.1 110:42.34 octssd.bin
27731 root RT 0 756m 90m 57m S 1.0 0.4 823:45.84 osysmond.bin
1 root 20 0 19364 1152 920 S 0.0 0.0 0:01.55 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.50 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 1:37.36 migration/0
4 root 20 0 0 0 0 S 0.0 0.0 0:29.71 ksoftirqd/0
5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
6 root RT 0 0 0 0 S 0.0 0.0 0:03.78 watchdog/0
7 root RT 0 0 0 0 S 0.0 0.0 2:56.86 migration/1
8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/1
9 root 20 0 0 0 0 S 0.0 0.0 0:40.01 ksoftirqd/1
10 root RT 0 0 0 0 S 0.0 0.0 0:03.14 watchdog/1
11 root RT 0 0 0 0 S 0.0 0.0 1:45.06 migration/2
12 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2
13 root 20 0 0 0 0 S 0.0 0.0 0:30.71 ksoftirqd/2
14 root RT 0 0 0 0 S 0.0 0.0 0:04.09 watchdog/2
15 root RT 0 0 0 0 S 0.0 0.0 1:39.74 migration/3
16 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3
17 root 20 0 0 0 0 S 0.0 0.0 0:15.30 ksoftirqd/3
18 root RT 0 0 0 0 S 0.0 0.0 0:05.59 watchdog/3
19 root RT 0 0 0 0 S 0.0 0.0 1:21.81 migration/4
20 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/4
21 root 20 0 0 0 0 S 0.0 0.0 0:24.62 ksoftirqd/4
22 root RT 0 0 0 0 S 0.0 0.0 0:02.89 watchdog/4
23 root RT 0 0 0 0 S 0.0 0.0 2:59.13 migration/5
24 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/5
25 root 20 0 0 0 0 S 0.0 0.0 0:29.33 ksoftirqd/5
26 root RT 0 0 0 0 S 0.0 0.0 0:03.09 watchdog/5
27 root RT 0 0 0 0 S 0.0 0.0 2:05.44 migration/6
zzz \*\*\*Mon Jun 24 12:40:18 CST 2019
top - 12:40:19 up 38 days, 21:55, 7 users, load average: 389.66, 349.33, 237.0
Tasks: 479 total, 2 running, 476 sleeping, 0 stopped, 1 zombie
Cpu(s): 12.3%us, 7.6%sy, 0.6%ni, 79.2%id, 0.2%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 24608192k total, 13600176k used, 11008016k free, 450344k buffers
Swap: 16383992k total, 121744k used, 16262248k free, 3679644k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2079 root 20 0 1925m 60m 8280 R 195.2 0.3 0:01.97 java
25787 root 20 0 0 0 0 S 29.7 0.0 0:13.62 acfsvol1
1785 root 30 10 233m 8240 5356 S 10.9 0.0 0:00.18 floaters
1780 root 20 0 1434m 33m 15m S 2.0 0.1 0:00.11 orarootagent.bi
1848 oracle 20 0 4660 1292 660 S 2.0 0.0 0:00.03 pidstat
1784 oracle 20 0 4708 1344 660 S 1.0 0.0 0:00.02 pidstat
2522 root 20 0 156m 17m 6044 S 1.0 0.1 296:44.88 Xorg
23104 root 20 0 1914m 34m 16m S 1.0 0.1 168:20.21 ohasd.bin
27384 oracle RT 0 1346m 115m 54m S 1.0 0.5 390:41.65 ocssd.bin
27731 root RT 0 756m 90m 57m S 1.0 0.4 823:47.37 osysmond.bin
1 root 20 0 19364 1152 920 S 0.0 0.0 0:01.59 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.50 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 1:37.36 migration/0
4 root 20 0 0 0 0 S 0.0 0.0 0:29.89 ksoftirqd/0
5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
6 root RT 0 0 0 0 S 0.0 0.0 0:03.79 watchdog/0
7 root RT 0 0 0 0 S 0.0 0.0 2:56.86 migration/1
8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/1
9 root 20 0 0 0 0 S 0.0 0.0 0:40.02 ksoftirqd/1
10 root RT 0 0 0 0 S 0.0 0.0 0:03.14 watchdog/1
11 root RT 0 0 0 0 S 0.0 0.0 1:45.06 migration/2
12 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2
13 root 20 0 0 0 0 S 0.0 0.0 0:30.72 ksoftirqd/2
14 root RT 0 0 0 0 S 0.0 0.0 0:04.10 watchdog/2
15 root RT 0 0 0 0 S 0.0 0.0 1:39.82 migration/3
16 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3
17 root 20 0 0 0 0 S 0.0 0.0 0:15.30 ksoftirqd/3
18 root RT 0 0 0 0 S 0.0 0.0 0:05.60 watchdog/3
19 root RT 0 0 0 0 S 0.0 0.0 1:21.81 migration/4
20 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/4
21 root 20 0 0 0 0 S 0.0 0.0 0:24.64 ksoftirqd/4
22 root RT 0 0 0 0 S 0.0 0.0 0:02.89 watchdog/4
23 root RT 0 0 0 0 S 0.0 0.0 2:59.15 migration/5
24 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/5
#### **3.节点1 CSSD进程日志信息**
2019-06-24 12:17:46.238: \[ CSSD\]\[2716677888\]clssnmSendingThread: sent 5 status msgs to all nodes
2019-06-24 12:17:46.631: \[ CSSD\]\[2727913216\]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349866/3359336954
2019-06-24 12:17:47.132: \[ CSSD\]\[2727913216\]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349867/3359337454
2019-06-24 12:17:47.631: \[ CSSD\]\[2718254848\]clssnmPollingThread: node node2 (2) at 75% heartbeat fatal, removal in 7.150 seconds
2019-06-24 12:17:47.631: \[ CSSD\]\[2727913216\]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349867/3359337954
2019-06-24 12:17:48.132: \[ CSSD\]\[2727913216\]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349868/3359338454
2019-06-24 12:17:48.631: \[ CSSD\]\[2727913216\]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349868/3359338954
2019-06-24 12:17:49.132: \[ CSSD\]\[2727913216\]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349869/3359339454
2019-06-24 12:17:49.631: \[ CSSD\]\[2727913216\]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349869/3359339954
2019-06-24 12:17:50.132: \[ CSSD\]\[2727913216\]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349870/3359340454
2019-06-24 12:17:50.631: \[ CSSD\]\[2727913216\]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349870/3359340954
2019-06-24 12:17:51.132: \[ CSSD\]\[2727913216\]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349871/3359341454
2019-06-24 12:17:51.240: \[ CSSD\]\[2716677888\]clssnmSendingThread: sending status msg to all nodes
2019-06-24 12:17:51.240: \[ CSSD\]\[2716677888\]clssnmSendingThread: sent 5 status msgs to all nodes
2019-06-24 12:17:51.632: \[ CSSD\]\[2727913216\]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349871/3359341954
2019-06-24 12:17:52.133: \[ CSSD\]\[2727913216\]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349872/3359342454
2019-06-24 12:17:52.632: \[ CSSD\]\[2718254848\]clssnmPollingThread: node node2 (2) at 90% heartbeat fatal, removal in 2.150 seconds,
seedhbimpd 1
1.
1. **7/2日问题分析**
7/2日上午9点附近ACFS文件系统无法访问,当时OSW监控未开启;从集群ALERT日志来看当时有应用进程在使用/DATA目录 无法UNMOUNT,操作系统日志中有NFO: task java:12227 blocked for more than 120 seconds.信息,因未有其它有效信息,暂无法判断当时何种原因导致ACFS文件系统访问异常。
#### ******1.集群ALERT日志信息******
2019-07-02 08:49:03.484:
\[ctssd(3257)\]CRS-2408:The clock on host node1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.
\[client(17179)\]CRS-10001:02-Jul-19 09:04 ACFS-9153: Program '/app/weaver/jdk1.6.0_27/bin/java -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djava.system.class.loader=com.caucho.loader.SystemClassLoader -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Xmx6000m -Xms6000m -Xss256k -XX:PermSize=256m -XX:MaxPermSize=512m -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+DisableExplicitGC -javaagent:wagent.jar -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Dresin.root=/app/weaver/Resin/ -Dcom.sun.management.jmxremote -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Dresin.root=/app/weaver/Resin/ -Dcom.sun.management.jmxremo
WARNING:Alert message too long
\[client(17188)\]CRS-10001:02-Jul-19 09:04 ACFS-9153: Program '/app/weaver/jdk1.6.0_27/bin/java -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djava.system.class.loader=com.caucho.loader.SystemClassLoader -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Xmx6000m -Xms6000m -Xss256k -XX:PermSize=256m -XX:MaxPermSize=512m -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+DisableExplicitGC -javaagent:wagent.jar -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Dresin.root=/app/weaver/Resin/ -Dcom.sun.management.jmxremote -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Dresin.root=/app/weaver/Resin/ -Dcom.sun.management.jmxremo
WARNING:Alert message too long
\[client(17190)\]CRS-10001:02-Jul-19 09:04 ACFS-9252: The following process IDs have open references on mount point '/data':
\[client(17192)\]CRS-10001:5822
\[client(17194)\]CRS-10001:02-Jul-19 09:04 ACFS-9253: Failed to unmount mount point '/data'. Mount point likely in use.
\[client(17196)\]CRS-10001:02-Jul-19 09:04 ACFS-9254: Manual intervention is required.
\[client(17219)\]CRS-10001:02-Jul-19 09:04 ACFS-9153: Program '/app/weaver/jdk1.6.0_27/bin/java -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djava.system.class.loader=com.caucho.loader.SystemClassLoader -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Xmx6000m -Xms6000m -Xss256k -XX:PermSize=256m -XX:MaxPermSize=512m -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+DisableExplicitGC -javaagent:wagent.jar -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Dresin.root=/app/weaver/Resin/ -Dcom.sun.management.jmxremote -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Dresin.root=/app/weaver/Resin/ -Dcom.sun.management.jmxremo
WARNING:Alert message too long
\[client(17225)\]CRS-10001:02-Jul-19 09:04 ACFS-9153: Program '/app/weaver/jdk1.6.0_27/bin/java -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djava.system.class.loader=com.caucho.loader.SystemClassLoader -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Xmx6000m -Xms6000m -Xss256k -XX:PermSize=256m -XX:MaxPermSize=512m -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+DisableExplicitGC -javaagent:wagent.jar -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Dresin.root=/app/weaver/Resin/ -Dcom.sun.management.jmxremote -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Dresin.root=/app/weaver/Resin/ -Dcom.sun.management.jmxremo
WARNING:Alert message too long
\[client(17227)\]CRS-10001:02-Jul-19 09:04 ACFS-9252: The following process IDs have open references on mount point '/data':
\[client(17229)\]CRS-10001:5822
#### ******2.操作系统日志******
Jul 2 08:59:49 node1 kernel: \[\
Jul 2 09:01:49 node1 kernel: [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
Jul 2 09:01:49 node1 kernel: [<ffffffff8152784d>] ? bictcp_cong_avoid+0x2d/0x390
3.数据库CHM相关日志
oracle@node6 node6\]$ cat 02-JUL-2019-09:20:20.txt\|grep "spent too much time" dm-1 ior: 0.000 iow: 1117.912 ios: 279 qlen: 304 wait: 7914;';3:Time=07-02-19 09.15.20, Disk dm-1 spent too much time (7914 msecs) waiting for I/O (\> 100 msecs)' type: SYS sdb ior: 0.000 iow: 1654.062 ios: 152 qlen: 23 wait: 573;';3:Time=07-02-19 09.15.20, Disk sdb spent too much time (573 msecs) waiting for I/O (\> 100 msecs)' type: SYS sda ior: 0.000 iow: 11.182 ios: 1 qlen: 0 wait: 119;';3:Time=07-02-19 09.15.40, Disk sda spent too much time (119 msecs) waiting for I/O (\> 100 msecs)' type: SWAP sda3 ior: 0.000 iow: 11.182 ios: 1 qlen: 0 wait: 119;';3:Time=07-02-19 09.15.40, Disk sda3 spent too much time (119 msecs) waiting for I/O (\> 100 msecs)' type: SYS dm-0 ior: 0.000 iow: 11.182 ios: 2 qlen: 1 wait: 412;';3:Time=07-02-19 09.15.40, Disk dm-0 spent too much time (412 msecs) waiting for I/O (\> 100 msecs)' type: SYS sdc ior: 192.196 iow: 1.996 ios: 7 qlen: 2 wait: 377;';3:Time=07-02-19 09.15.40, Disk sdc spent too much time (377 msecs) waiting for I/O (\> 100 msecs)' type: SYS sdc ior: 106.347 iow: 2.101 ios: 14 qlen: 0 wait: 148;';3:Time=07-02-19 09.16.20, Disk sdc spent too much time (148 msecs) waiting for I/O (\> 100 msecs)' type: SYS sda ior: 0.000 iow: 13.605 ios: 3 qlen: 3 wait: 937;';3:Time=07-02-19 09.16.40, Disk sda spent too much time (937 msecs) waiting for I/O (\> 100 msecs)' type: SWAP sda3 ior: 0.000 iow: 13.605 ios: 3 qlen: 3 wait: 937;';3:Time=07-02-19 09.16.40, Disk sda3 spent too much time (937 msecs) waiting for I/O (\> 100 msecs)' type: SYS dm-1 ior: 0.000 iow: 24.811 ios: 6 qlen: 14 wait: 1565;';3:Time=07-02-19 09.16.40, Disk dm-1 spent too much time (1565 msecs) waiting for I/O (\> 100 msecs)' type: SYS dm-0 ior: 0.000 iow: 15.206 ios: 3 qlen: 4 wait: 838;';3:Time=07-02-19 09.16.40, Disk dm-0 spent too much time (838 msecs) waiting for I/O (\> 100 msecs)' type: SYS sdc ior: 0.899 iow: 2.000 ios: 3 qlen: 1 wait: 382;';3:Time=07-02-19 09.16.40, Disk sdc spent too much time (382 msecs) waiting for I/O (\> 100 msecs)' type: SYS sdb ior: 0.000 iow: 18.407 ios: 1 qlen: 3 wait: 770;';3:Time=07-02-19 09.16.40, Disk sdb spent too much time (770 msecs) waiting for I/O (\> 100 msecs)' type: SYS dm-1 ior: 0.000 iow: 737.072 ios: 184 qlen: 10 wait: 1060;';3:Time=07-02-19 09.16.55, Disk dm-1 spent too much time (1060 msecs) waiting for I/O (\> 100 msecs)' type: SYS sdb ior: 0.000 iow: 1011.573 ios: 15 qlen: 0 wait: 1210;';3:Time=07-02-19 09.16.55, Disk sdb spent too much time (1210 msecs) waiting for I/O (\> 100 msecs)' type: SYS sda ior: 0.000 iow: 8.803 ios: 1 qlen: 0 wait: 3992;';3:Time=07-02-19 09.17.00, Disk sda spent too much time (3992 msecs) waiting for I/O (\> 100 msecs)' type: SWAP sda3 ior: 0.000 iow: 8.803 ios: 1 qlen: 0 wait: 3992;';3:Time=07-02-19 09.17.00, Disk sda3 spent too much time (3992 msecs) waiting for I/O (\> 100 msecs)' type: SYS dm-0 ior: 0.000 iow: 7.202 ios: 1 qlen: 0 wait: 4436;';3:Time=07-02-19 09.17.00, Disk dm-0 spent too much time (4436 msecs) waiting for I/O (\> 100 msecs)' type: SYS sdc ior: 2.596 iow: 1.896 ios: 3 qlen: 1 wait: 370;';3:Time=07-02-19 09.17.40, Disk sdc spent too much time (370 msecs) waiting for I/O (\> 100 msecs)' type: SYS sda ior: 0.000 iow: 21.602 ios: 3 qlen: 1 wait: 1943;';3:Time=07-02-19 09.18.45, Disk sda spent too much time (1943 msecs) waiting for I/O (\> 100 msecs)' type: SWAP sda3 ior: 0.000 iow: 21.602 ios: 3 qlen: 1 wait: 1943;';3:Time=07-02-19 09.18.45, Disk sda3 spent too much time (1943 msecs) waiting for I/O (\> 100 msecs)' type: SYS dm-1 ior: 0.000 iow: 1968.174 ios: 492 qlen: 77 wait: 202;';3:Time=07-02-19 09.18.45, Disk dm-1 spent too much time (202 msecs) waiting for I/O (\> 100 msecs)' type: SYS dm-0 ior: 0.000 iow: 8.801 ios: 2 qlen: 2 wait: 4660;';3:Time=07-02-19 09.18.45, Disk dm-0 spent too much time (4660 msecs) waiting for I/O (\> 100 msecs)' type: SYS sdc ior: 5.700 iow: 2.899 ios: 6 qlen: 0 wait: 1033;';3:Time=07-02-19 09.18.45, Disk sdc spent too much time (1033 msecs) waiting for I/O (\> 100 msecs)' type: SYS dm-1 ior: 0.000 iow: 274.506 ios: 68 qlen: 208 wait: 12512;';3:Time=07-02-19 09.20.05, Disk dm-1 spent too much time (12512 msecs) waiting for I/O (\> 100 msecs)' type: SYS sdb ior: 0.000 iow: 579.425 ios: 47 qlen: 39 wait: 2515;';3:Time=07-02-19 09.20.05, Disk sdb spent too much time (2515 msecs) waiting for I/O (\> 100 msecs)' type: SYS ## ******三、总结与后续处理建议****** #### ******3.1 问题总结****** 当前环境为VMWARE虚拟机环境搭建6节点ORACLE GRID集群,使用ACFS文件系统为应用程序提供数据共享目录 /DATA,应用程序同时部署在6节点ORACLE GRID集群的对应主机上;未在集群环境运行ORACLE数据库。 对于近期两次典型问题分析如下:6/24日中午12点附近出现ACFS文件系统访问异常,通过查看集群ALERT日志、CSSD进程日志及OSW监控软件的日志,可以发现OSW监控软件在11:55-12:40分时没有收集到主机的监控数据,同期数据库的CSSD进程也有与其它节点的私网信息已经丢失网络心跳,因此可以推断当时主机已经HANG住。7/2日上午9点附近ACFS文件系统无法访问,当时OSW监控未开启;从集群ALERT日志来看当时有应用进程在使用/DATA目录 无法UNMOUNT,操作系统日志中有NFO: task java:12227 blocked for more than 120 seconds.信息,因未有其它有效信息,暂无法判断当时何种原因导致ACFS文件系统访问异常。 从具体的问题来看,ORACLE集群软件做为操作系统上层的软件,会受到底层操作系统OS以及更底层的VMWARE虚拟机环境的影响;由于几层系统之间监控日志粒度也不同,对于问题的分析带来了较大的复杂度;许多信息无法向下追踪去查找根本原因; #### ******3.2 后续处理建议****** 因此结合历次问题及整体架构的考虑建议如下: 1.加强对LINUX虚拟主机运行情况的监控,如开启OSW监控,开启ZABBIX监控。 2.建议联系VMWARE虚拟机维护人员沟通是否可以从VMWARE虚拟机层面对LINUX主机进行监控,同时对VMWARE虚拟机本身及底层的物理机能有更加细粒度的监控。 3.ASM实例的memory_max_target内存参数当前为默认的1076M;后续建议调整到2048M,提升ASM实例的性能。