在运维的多套ORACLE ODAX9-2版本,都遇到了一个计算节点的告警:Description: The service Processor poweron selftest has deteced a problem. Probabity;:100, UulD:cd1ebbdf-f099-61de-ca44-ef646defe034, Resource:/SYS/MB,;此告警从描述上来看比较验证,但是事实是主机运行正常,对此告警进行分析认为就误报,ORACLE ODA的硬件管理平台ILOM上提供了清理告警的接口,按如下步骤进行清理后,告警消除,后续持续观察,系统运行正常。
处理步骤如下:
1、查看告警信息:
点击查看告警详情:
2、命令行接口查看告警信息(序列号已经脱敏请勿对比)
[root@aaadb1 ~]# ipmitool sunoem cli
Connected. Use ^D to exit.
-> start /SP/faultmgmt/shell
Are you sure you want to start /SP/faultmgmt/shell (y/n)? y
faultmgmtsp> fmadm faulty
Time UUID msgid Severity
2023-07-31/08:55:39 cd1ebbdf-f099-61de-ca44-ef646defe034 ILOM-8000-4T Critical
Problem Status : open
Diag Engine : fdd 1.0
System
Manufacturer : Oracle Corporation
Name : ORACLE SERVER X9-2L
Part_Number : 7603
Serial_Number : 23000
System Component
Firmware_Manufacturer : Oracle Corporation
Firmware_Version : (ILOM)5.1.0.23 r147470,(BIOS)62070300
Firmware_Release : (ILOM)2022.09.03,(BIOS)2022.08.17
Suspect 1 of 1
Problem class : fault.chassis.device.sppost
Certainty : 100%
Affects : /SYS/MB
Status : faulted
FRU
Status : faulty
Location : /SYS/MB
Manufacturer : Oracle Corporation
Name : ASM,MTHRBD,2U
Part_Number : 820000
Revision : 12
Serial_Number : 4650000
Chassis
Manufacturer : Oracle Corporation
Name : ORACLE SERVER X9-2L
Part_Number : 76000000
Serial_Number : 2300000
Description : The Service Processor power-on self test has detected a
problem.
Response : The service-required LED may be illuminated on the affected
FRU and chassis.
Impact : The Service Processor may not be able to perform necessary
functions to power on, monitor, or manage the system.
Action : Please refer to the associated reference document at
http://support.oracle.com/msg/ILOM-8000-4T for the latest
service procedures and policies regarding this diagnosis.
3、清理告警
清理的步骤参考PSH Procedural Article for ILOM-Based Diagnosis (Doc ID 1155200.1),使用'fmadm repair' 命令即可
- Enter the fault management shell.
-> start /SP/faultmgmt/shell
Are you sure you want to start /SP/faultmgmt/shell (y/n) ? y
faultmgmtsp>
- Use 'fmadm repair' to clear the fault.
Rather than the UUID, the FRU path (/SYS/FANBD/FM0) could also be used.
Example 3
Example 3 shows the 'fmadm repaired' command required after the suspect FRU has been replaced. Using the UUID from the 'fmadm faulty from Example 1 above, the command would be:faultmgmtsp> fmadm repair 9df39f93-f356-6d26-e081-e4f3a9872c2f
Example 4
Example 4 shows the 'fmadm repaired' command required after the FRU has been replaced.. This example shows the FRU Path from Example 2 above being used. The command would be:
fmadm repair /SYS/MB
具体处理日志如下:(根据告警事件的UUID)
faultmgmtsp> fmadm repair cd1ebbdf-f099-61de-ca44-ef646defe034
faultmgmtsp> fmadm faulty
No faults found
faultmgmtsp> exit
-> exit
Disconnected