近日连续处理了几个NetApp FAS存储系统SSD磁盘重启后,全部故障的案例。这里是case的总结和分享,以后有遇到的可以参考处理。
案例1:客户一套FAS8020,带一个DS2246盘柜,内置24个800G X447A的磁盘,机房掉电后,重启系统无法启动,串口登录,进入loader,发现如下报错:
所有磁盘均提示failed initialization due to error 5.
由于系统盘也在这些磁盘上,所以系统无法启动,系统不断的reboot。
案例2:某客户有一套FAS8040,其中一个扩展柜DS2246 有24个 X447A 800G的SSD磁盘,也是机房掉电,所有的控制器和shelf扩展柜都重启,柜子里面的24块盘全部显示故障,0容量。如下图所示:
出现多个硬盘故障,客户aggregate肯定是offline,业务全部中断,甚至数据也很可能要丢失。
原因:
导致出现所有SSD磁盘故障的原因是由于SSD磁盘微码的bug,下面是官方对该问题的说明:
NetApp的FAS存储系统(AFF/FAS)和E系列的PX02*系列SSD磁盘有一个已知的内部日志行为会导致磁盘故障。前提是满足下面的条件:
- 磁盘连续加电超过7万个小时;
- SSD磁盘做了下电和上电的动作。当下电以后,再次上电就会返回一个SCSI的报错4/4C/A8,这个check condition会导致SSD磁盘故障。
*PHM2* (AFF/FAS) and PX02* (E-Series) SSD drives have a known defect in the internal logging behavior that might cause a drive failure when the following conditions are met:
- The drive has been powered on for more than 70,000 hours (power-on hours value)
- The drive is power cycled (turned off, then on again) After the drive exceeds 70,000 power-on hours and the drive is powered off, when the drive is next powered on it might return a check condition 4/4C/A8. This check condition might cause the drive to fail.
下面的磁盘会受到这个bug的影响。
drives: Drive Identifier Capacity Firmware ---------------- -------- --------
X438_PHM2400MCTO 400GB NA05 X439_PHM21T6MCTO 1.6TB NA05
X440_PHM2800MCTO 800GB NA05 X446_PHM2200MCTO 200GB NA05
X447_PHM2800MCTO 800GB NA05 X448_PHM2200MCTO 200GB NA05
X449_PHM2800MCTO 800GB NA05 X575_PHM2400MCTO 400GB NA05
X576_PHM21T6MCTO 1.6TB NA05 X577_PHM2800MCTO 900GB NA05
PX02SMU080 800GB MS03 PX02SMF080 800GB MS03 PX02SMF040
400GB MS03 PX02SMB160 1.6TB MS03
在下面这些Ontap 版本中有会有这种影响:
8.3RC2, 9.7P3, 9.3P18, 9.1P3, 9.8, 9.1P8, 8.2.5P2
NetApp官方的解决方案就是升级操作系统到下面的版本来规避问题的发生:
9.10.0, 9.10.0P1, 9.10.1, 9.10.1P1, 9.10.1P10, 9.10.1P11, 9.10.1P12,
9.10.1P2, 9.10.1P3, 9.10.1P4, 9.10.1P5, 9.10.1P6, 9.10.1P7, 9.10.1P8,
9.10.1P9, 9.10.1RC1, 9.10.1RC1P1, 9.10.1RC2, 9.11.0, 9.11.0P1,
9.11.0P2, 9.11.1, 9.11.1P1, 9.11.1P2, 9.11.1P3, 9.11.1P4, 9.11.1P5,
9.11.1P6, 9.11.1P7, 9.11.1P8, 9.11.1P9, 9.11.1RC1, 9.11.1RC1P1, 9.12.0,
9.12.0P1, 9.12.0P2, 9.12.1, 9.12.1P1, 9.12.1P2, 9.12.1P3, 9.12.1RC1,
9.12.1RC1P1, 9.13.0, 9.13.0P1, 9.13.0P2, 9.13.1RC1, 9.5P17, 9.5P18,
9.5P19, 9.6P15, 9.6P16, 9.6P17, 9.6P18, 9.7P13, 9.7P14, 9.7P15, 9.7P16,
9.7P17, 9.7P18, 9.7P19, 9.7P20, 9.7P21, 9.7P22, 9.8P10, 9.8P11, 9.8P12,
9.8P13, 9.8P14, 9.8P15, 9.8P16, 9.8P17, 9.8P18, 9.8P4, 9.8P5, 9.8P6,
9.8P7, 9.8P8, 9.8P9, 9.9.1, 9.9.1P1, 9.9.1P10, 9.9.1P11, 9.9.1P12,
9.9.1P13, 9.9.1P14, 9.9.1P15, 9.9.1P2, 9.9.1P3, 9.9.1P4, 9.9.1P5, 9.9.1P6,
9.9.1P7, 9.9.1P8, 9.9.1P9, 9.9.1RC1
但是问题来了,如果已经遇到这种问题了该怎么处理,官方的解决方案是:
临时解决方案没有,如果遇到了这个问题,请联系支持中心。
在处理类似的宕机和丢失数据的case中,我们积累了丰富的经验,可以add wechat at StorageExpert来探讨进一步的现场解决方案,数据恢复成功率100%。