提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档
文章目录
前言
经常使用服务器的都知道RAID的重要性,现代的服务器基本都是基于RAID部署的。举个例子,基于RAID 1
部署的磁盘阵列互为镜像
,只要不是两块盘同时坏,数据一定不会丢,只需要替换掉故障的盘,什么都不用做,RAID会自动重建
的。还有RAID 0
、RAID 5
、RAID 10
等这里就不赘述了,不是本篇讨论的重点,今天我们只讨论怎么查看服务器的RAID配置。
本篇基于MegaRAID,不涉及其它RAID。
一、MegaRAID
MegaRAID
是一种常用于服务器和高端存储设备的RAID(冗余阵列)控制器技术。通过MegaRAID,用户可以配置和管理硬盘驱动器(HDD)或固态硬盘(SSD)的RAID阵列,以提供数据冗余、提高性能和增加存储容量。
MegaRAID是LSI
公司的产品,LSI后来被Avago
公司收购。
LSI公司
我们面临的问题是在Linux系统里面没有办法直接看到硬RAID信息
(通过BIOS创建),只能看到软RAID信息
(通过软件创建)。所以各家厂商提供了专门的工具帮助开发者或运维者查看和管理RAID。我们讲的MegaRAID就是这样的,我所使用的RAID就是在BIOS里面设置的。
二、安装MegaCLI
MegaRAID提供了一个工具管理RAID叫MegaCLI
,这个工具在博通网站就能下载到,支持多种Linux发行版(Ubuntu、CentOS、RedHat...),今天主要讲Ubuntu上的MegaCLI安装。
注意:Ubuntu-18.04和以上版本安装方式不一样,后面会细讲!
1.Ubuntu-18.04
这个版本比较容易,使用APT
就可以安装了。以下方法按照顺序执行,我经过多方认证确定不会出现问题,如果发生冲突大概率是你的环境有问题。
bash
#/etc/apt/sources.list末尾添加源
sudo sed -i '$a\deb http://hwraid.le-vert.net/ubuntu precise main' /etc/apt/sources.list
#添加验证密钥
wget -O - http://hwraid.le-vert.net/debian/hwraid.le-vert.net.gpg.key | sudo apt-key add -
#更新源
sudo apt update
#安装
sudo apt-get install megacli megactl megaraid-status
2.Ubuntu-22.04
这个版本不能用上面的方式安装,因为megacli依赖Python2
,而22.04已经移除了Python2
,强行安装或许可以,但不是最好的方法。
这里我自己上传了个MegaRAID.tar,这是我从CentOS7中提取的可执行文件+库
,因为官方只提供了.rpm的包,没有提供.deb的包,所以我用这种方法移植到Ubuntu了。
资源链接
下载完成之后是MegaRAID.tar,执行下面的命令:
bash
解压
sudo tar -xmf MegaRAID.tar
移动
sudo mv MegaRAID /opt
创建软链
sudo ln -s /opt/MegaRAID/MegaCli/MegaCli64 /usr/sbin/megacli
缺少依赖库就安装,不缺少就跳过
sudo apt install libncurses5
这个时候就可以开始使用了。
3.Ubuntu-20.04
放在22.04后面是因为我手上没有20.04的版本,APT方式安装的megacli依赖Python2,我也忘了20.04是不是阉割了Python2,使用which python2
来测试是不是预装了python2.7,如果有按照18.04的方式安装,没有的话就按照22.04的方式安装。
4.CentOS-7
实测可以按照22.04的方式安装。
5.其它发行版
我常用的就这些发行版,需要其它版本请下载下面的多版本试试。
MegaCLI
三、使用MegaCLI
确保已经安装完成了。
注意:需要sudo权限或者切换root用户!
bash
查看机器型号
dmidecode | grep "Product"
查看厂商
dmidecode| grep "Manufacturer"
查看序列号
dmidecode | grep "Serial Number"
查看CPU信息
dmidecode | grep "CPU"
查看CPU个数
dmidecode | grep "Socket Designation: CPU" |wc --l
查看出厂日期
dmidecode | grep "Date"
查看充电状态
megacli -AdpBbuCmd -GetBbuStatus -aALL |grep "Charger Status"
显示BBU状态信息
megacli -AdpBbuCmd -GetBbuStatus --aALL
显示BBU容量信息
megacli -AdpBbuCmd -GetBbuCapacityInfo --aALL
显示BBU设计参数
megacli -AdpBbuCmd -GetBbuDesignInfo --aALL
显示当前BBU属性
megacli -AdpBbuCmd -GetBbuProperties --aALL
查看充电进度百分比
megacli -AdpBbuCmd -GetBbuStatus -aALL |grep "Relative State of Charge"
查询Raid阵列数
megacli -cfgdsply -aALL |grep "Number of DISK GROUPS:"
显示Raid卡型号,Raid设置,Disk相关信息
megacli -cfgdsply --aALL
显示所有物理信息
megacli -PDList -aALL
显示所有逻辑磁盘组信息
megacli -LDInfo -LALL --aAll
查看物理磁盘重建进度(重要)
megacli -PDRbld -ShowProg -PhysDrv [1:5] -a0
查看适配器个数
megacli --adpCount
查看适配器时间
megacli -AdpGetTime --aALL
显示所有适配器信息
megacli -AdpAllInfo --aAll
查看Cache 策略设置
megacli -cfgdsply -aALL |grep Polic
简单说下常用的两个命令,第一个是查看RAID配置,第二个是查看物理磁盘。
1.查看所有适配器信息
没错,MegaRAID称所有的RAID组合为适配器
(Adapter),每个适配器都有一个唯一ID
。
bash
sudo megacli -LDInfo -LALL --aAll
Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 893.75 GB
Sector Size : 512
Is VD emulated : Yes
Mirror Data : 893.75 GB
State : Optimal
Strip Size : 64 KB
Number Of Drives : 2
Span Depth : 1
Default Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None
Can spin up in 1 minute: No
LD has drives that support T10 power conditions: No
LD's IO profile supports MAX power savings with cached writes: No
Bad Blocks Exist: No
Is VD Cached: No
RAID Level
:就是配置的RAID方式,我这里是RAID 1
RAID Level对应关系:
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0 #RAID 1
RAID Level : Primary-0, Secondary-0, RAID Level Qualifier-0 #RAID 0
RAID Level : Primary-5, Secondary-0, RAID Level Qualifier-3 #RAID 5
Size
:逻辑磁盘容量,就是组成RAID之后你在系统里看到的可用容量
Sector Size
:镞大小,不细讲了,你可以自行研究下。
Number Of Drives
:多少个物理磁盘参与了这个RAID,这里显示是2个。
注:关于RAID 1支持的硬盘数,我查了好久,可以确认的是Intel的RAID是仅限2个磁盘,Wiki上说至少2块,我见过的都是2块,大于2块的我没见过。因为2块其实已经很安全了,2块磁盘同时坏的可能性很低。所以似乎也没多少必要用3块乃至更多的磁盘,毕竟成本在那里,服务器磁盘可不便宜!
2.查看所有物理磁盘信息
这才是我真正想看到的,从Linux系统层面无法看到插了几块硬盘,每块硬盘都插在了哪个槽位,每块硬盘属于哪个RAID,这都是很重要的信息。虽然从BIOS可以看到,但是我总没有必要去重启服务器吧,有些服务器可是24小时运行的呢!
bash
sudo megacli -PDList -aALL
Adapter #0
Enclosure Device ID: 32
Slot Number: 0
Drive's position: DiskGroup: 0, Span: 0, Arm: 0
Enclosure position: 1
Device Id: 0
WWN: **************
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA
Raw Size: 894.252 GB [0x6fc81ab0 Sectors]
Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors]
Coerced Size: 893.75 GB [0x6fb80000 Sectors]
Sector Size: 512
Logical Sector Size: 512
Physical Sector Size: 4096
Firmware state: Online, Spun Up
Device Firmware Level: 0100
Shield Counter: 0
Successful diagnostics completion on : N/A
SAS Address(0): 0x4433221104000000
Connected Port Number: 1(path0)
Inquiry Data: PHYF8454006A960CGN INTEL SSDSC2KB960G8 XCV10100
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Solid State Device
Drive: Not Certified
Drive Temperature :18C (64.40 F)
PI Eligibility: No
Drive is formatted for PI information: No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Drive has flagged a S.M.A.R.T alert : No
Enclosure Device ID: 32
Slot Number: 1
Drive's position: DiskGroup: 0, Span: 0, Arm: 1
Enclosure position: 1
Device Id: 1
WWN: *************
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA
Raw Size: 894.252 GB [0x6fc81ab0 Sectors]
Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors]
Coerced Size: 893.75 GB [0x6fb80000 Sectors]
Sector Size: 512
Logical Sector Size: 512
Physical Sector Size: 4096
Firmware state: Online, Spun Up
Device Firmware Level: 0100
Shield Counter: 0
Successful diagnostics completion on : N/A
SAS Address(0): 0x4433221100000000
Connected Port Number: 0(path0)
Inquiry Data: PHYF845406QH960CGN INTEL SSDSC2KB960G8 XCV10100
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Solid State Device
Drive: Not Certified
Drive Temperature :19C (66.20 F)
PI Eligibility: No
Drive is formatted for PI information: No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Drive has flagged a S.M.A.R.T alert : No
Slot Number
:槽位号
PD Type
:物理接口,SATA就是SATA控制器
Device Speed
:设备速度规格
Link Speed
:连接速度规格,这个应该是如果SATA3的设备接SATA2的接口它俩可能不一样。
Media Type
:磁盘物理材质,我只见过固态和机械,其它的请查询官方文档
Drive Temperature
:硬盘温度,摄氏度/华氏度,一般50℃算偏高了,正常40℃左右差不多,太高了就检查下机器散热。
四、方便的脚本
我从网上找了一个方便的脚本,不用每次都输入那么多指令了。使用方式也很简单,复制保存为lsi.sh
,你输入./lsi.sh
会给出指示。
注意:这个脚本更适合运维场景,方便快速排查,毕竟这些命令记起来还挺头疼的,复杂用法还得参考官方文档。
bash
#!/bin/bash
#
# Calomel.org
# https://calomel.org/megacli_lsi_commands.html
# LSI MegaRaid CLI
# lsi.sh @ Version 0.05
#
# description: MegaCLI script to configure and monitor LSI raid cards.
# Full path to the MegaRaid CLI binary
#MegaCli="/usr/local/sbin/MegaCli64"
MegaCli=`which megacli`
# The identifying number of the enclosure. Default for our systems is "8". Use
# "MegaCli64 -PDlist -a0 | grep "Enclosure Device"" to see what your number
# is and set this variable.
ENCLOSURE="8"
if [ $# -eq 0 ]
then
echo ""
echo " OBPG .:. lsi.sh $arg1 $arg2"
echo "-----------------------------------------------------"
echo "status = Status of Virtual drives (volumes)"
echo "drives = Status of hard drives"
echo "ident \$slot = Blink light on drive (need slot number)"
echo "good \$slot = Simply makes the slot \"Unconfigured(good)\" (need slot number)"
echo "replace \$slot = Replace \"Unconfigured(bad)\" drive (need slot number)"
echo "progress = Status of drive rebuild"
echo "errors = Show drive errors which are non-zero"
echo "bat = Battery health and capacity"
echo "batrelearn = Force BBU re-learn cycle"
echo "logs = Print card logs"
echo "checkNemail = Check volume(s) and send email on raid errors"
echo "allinfo = Print out all settings and information about the card"
echo "settime = Set the raid card's time to the current system time"
echo "setdefaults = Set preferred default settings for new raid setup"
echo ""
exit
fi
# General status of all RAID virtual disks or volumes and if PATROL disk check
# is running.
if [ $1 = "status" ]
then
$MegaCli -LDInfo -Lall -aALL -NoLog
echo "###############################################"
$MegaCli -AdpPR -Info -aALL -NoLog
echo "###############################################"
$MegaCli -LDCC -ShowProg -LALL -aALL -NoLog
exit
fi
# Shows the state of all drives and if they are online, unconfigured or missing.
if [ $1 = "drives" ]
then
$MegaCli -PDlist -aALL -NoLog | egrep 'Slot|state' | awk '/Slot/{if (x)print x;x="";}{x=(!x)?$0:x" -"$0;}END{print x;}' | sed 's/Firmware state://g'
exit
fi
# Use to blink the light on the slot in question. Hit enter again to turn the blinking light off.
if [ $1 = "ident" ]
then
$MegaCli -PdLocate -start -physdrv[$ENCLOSURE:$2] -a0 -NoLog
logger "`hostname` - identifying enclosure $ENCLOSURE, drive $2 "
read -p "Press [Enter] key to turn off light..."
$MegaCli -PdLocate -stop -physdrv[$ENCLOSURE:$2] -a0 -NoLog
exit
fi
# When a new drive is inserted it might have old RAID headers on it. This
# method simply removes old RAID configs from the drive in the slot and make
# the drive "good." Basically, Unconfigured(bad) to Unconfigured(good). We use
# this method on our FreeBSD ZFS machines before the drive is added back into
# the zfs pool.
if [ $1 = "good" ]
then
# set Unconfigured(bad) to Unconfigured(good)
$MegaCli -PDMakeGood -PhysDrv[$ENCLOSURE:$2] -a0 -NoLog
# clear 'Foreign' flag or invalid raid header on replacement drive
$MegaCli -CfgForeign -Clear -aALL -NoLog
exit
fi
# Use to diagnose bad drives. When no errors are shown only the slot numbers
# will print out. If a drive(s) has an error you will see the number of errors
# under the slot number. At this point you can decided to replace the flaky
# drive. Bad drives might not fail right away and will slow down your raid with
# read/write retries or corrupt data.
if [ $1 = "errors" ]
then
echo "Slot Number: 0"; $MegaCli -PDlist -aALL -NoLog | egrep -i 'error|fail|slot' | egrep -v ' 0'
exit
fi
# status of the battery and the amount of charge. Without a working Battery
# Backup Unit (BBU) most of the LSI read/write caching will be disabled
# automatically. You want caching for speed so make sure the battery is ok.
if [ $1 = "bat" ]
then
$MegaCli -AdpBbuCmd -aAll -NoLog
exit
fi
# Force a Battery Backup Unit (BBU) re-learn cycle. This will discharge the
# lithium BBU unit and recharge it. This check might take a few hours and you
# will want to always run this in off hours. LSI suggests a battery relearn
# monthly or so. We actually run it every three(3) months by way of a cron job.
# Understand if your "Current Cache Policy" is set to "No Write Cache if Bad
# BBU" then write-cache will be disabled during this check. This means writes
# to the raid will be VERY slow at about 1/10th normal speed. NOTE: if the
# battery is new (new bats should charge for a few hours before they register)
# or if the BBU comes up and says it has no charge try powering off the machine
# and restart it. This will force the LSI card to re-evaluate the BBU. Silly
# but it works.
if [ $1 = "batrelearn" ]
then
$MegaCli -AdpBbuCmd -BbuLearn -aALL -NoLog
exit
fi
# Use to replace a drive. You need the slot number and may want to use the
# "drives" method to show which drive in a slot is "Unconfigured(bad)". Once
# the new drive is in the slot and spun up this method will bring the drive
# online, clear any foreign raid headers from the replacement drive and set the
# drive as a hot spare. We will also tell the card to start rebuilding if it
# does not start automatically. The raid should start rebuilding right away
# either way. NOTE: if you pass a slot number which is already part of the raid
# by mistake the LSI raid card is smart enough to just error out and _NOT_
# destroy the raid drive, thankfully.
if [ $1 = "replace" ]
then
logger "`hostname` - REPLACE enclosure $ENCLOSURE, drive $2 "
# set Unconfigured(bad) to Unconfigured(good)
$MegaCli -PDMakeGood -PhysDrv[$ENCLOSURE:$2] -a0 -NoLog
# clear 'Foreign' flag or invalid raid header on replacement drive
$MegaCli -CfgForeign -Clear -aALL -NoLog
# set drive as hot spare
$MegaCli -PDHSP -Set -PhysDrv [$ENCLOSURE:$2] -a0 -NoLog
# show rebuild progress on replacement drive just to make sure it starts
$MegaCli -PDRbld -ShowProg -PhysDrv [$ENCLOSURE:$2] -a0 -NoLog
exit
fi
# Print all the logs from the LSI raid card. You can grep on the output.
if [ $1 = "logs" ]
then
$MegaCli -FwTermLog -Dsply -aALL -NoLog
exit
fi
# Use to query the RAID card and find the drive which is rebuilding. The script
# will then query the rebuilding drive to see what percentage it is rebuilt and
# how much time it has taken so far. You can then guess-ti-mate the
# completion time.
if [ $1 = "progress" ]
then
DRIVE=`$MegaCli -PDlist -aALL -NoLog | egrep 'Slot|state' | awk '/Slot/{if (x)print x;x="";}{x=(!x)?$0:x" -"$0;}END{print x;}' | sed 's/Firmware state://g' | egrep build | awk '{print $3}'`
$MegaCli -PDRbld -ShowProg -PhysDrv [$ENCLOSURE:$DRIVE] -a0 -NoLog
exit
fi
# Use to check the status of the raid. If the raid is degraded or faulty the
# script will send email to the address in the $EMAIL variable. We normally add
# this method to a cron job to be run every few hours so we are notified of any
# issues.
if [ $1 = "checkNemail" ]
then
EMAIL="raidadmin@localhost"
# Check if raid is in good condition
STATUS=`$MegaCli -LDInfo -Lall -aALL -NoLog | egrep -i 'fail|degrad|error'`
# On bad raid status send email with basic drive information
if [ "$STATUS" ]; then
$MegaCli -PDlist -aALL -NoLog | egrep 'Slot|state' | awk '/Slot/{if (x)print x;x="";}{x=(!x)?$0:x" -"$0;}END{print x;}' | sed 's/Firmware state://g' | mail -s `hostname`' - RAID Notification' $EMAIL
fi
fi
# Use to print all information about the LSI raid card. Check default options,
# firmware version (FW Package Build), battery back-up unit presence, installed
# cache memory and the capabilities of the adapter. Pipe to grep to find the
# term you need.
if [ $1 = "allinfo" ]
then
$MegaCli -AdpAllInfo -aAll -NoLog
exit
fi
# Update the LSI card's time with the current operating system time. You may
# want to setup a cron job to call this method once a day or whenever you
# think the raid card's time might drift too much.
if [ $1 = "settime" ]
then
$MegaCli -AdpGetTime -aALL -NoLog
$MegaCli -AdpSetTime `date +%Y%m%d` `date +%H:%M:%S` -aALL -NoLog
$MegaCli -AdpGetTime -aALL -NoLog
exit
fi
# These are the defaults we like to use on the hundreds of raids we manage. You
# will want to go through each option here and make sure you want to use them
# too. These options are for speed optimization, build rate tweaks and PATROL
# options. When setting up a new machine we simply execute the "setdefaults"
# method and the raid is configured. You can use this on live raids too.
if [ $1 = "setdefaults" ]
then
# Read Cache enabled specifies that all reads are buffered in cache memory.
$MegaCli -LDSetProp -Cached -LAll -aAll -NoLog
# Adaptive Read-Ahead if the controller receives several requests to sequential sectors
$MegaCli -LDSetProp ADRA -LALL -aALL -NoLog
# Hard Disk cache policy enabled allowing the drive to use internal caching too
$MegaCli -LDSetProp EnDskCache -LAll -aAll -NoLog
# Write-Back cache enabled
$MegaCli -LDSetProp WB -LALL -aALL -NoLog
# Continue booting with data stuck in cache. Set Boot with Pinned Cache Enabled.
$MegaCli -AdpSetProp -BootWithPinnedCache -1 -aALL -NoLog
# PATROL run every 672 hours or monthly (RAID6 77TB @60% rebuild takes 21 hours)
$MegaCli -AdpPR -SetDelay 672 -aALL -NoLog
# Check Consistency every 672 hours or monthly
$MegaCli -AdpCcSched -SetDelay 672 -aALL -NoLog
# Enable autobuild when a new Unconfigured(good) drive is inserted or set to hot spare
$MegaCli -AdpAutoRbld -Enbl -a0 -NoLog
# RAID rebuild rate to 60% (build quick before another failure)
$MegaCli -AdpSetProp \{RebuildRate -60\} -aALL -NoLog
# RAID check consistency rate to 60% (fast parity checks)
$MegaCli -AdpSetProp \{CCRate -60\} -aALL -NoLog
# Enable Native Command Queue (NCQ) on all drives
$MegaCli -AdpSetProp NCQEnbl -aAll -NoLog
# Sound alarm disabled (server room is too loud anyways)
$MegaCli -AdpSetProp AlarmDsbl -aALL -NoLog
# Use write-back cache mode even if BBU is bad. Make sure your machine is on UPS too.
$MegaCli -LDSetProp CachedBadBBU -LAll -aAll -NoLog
# Disable auto learn BBU check which can severely affect raid speeds
OUTBBU=$(mktemp /tmp/output.XXXXXXXXXX)
echo "autoLearnMode=1" > $OUTBBU
$MegaCli -AdpBbuCmd -SetBbuProperties -f $OUTBBU -a0 -NoLog
rm -rf $OUTBBU
exit
fi
### EOF ###
五、配置方法
先空着吧,最好自行查询官方文档。没验证过的方法我不敢写在这里,操作错误被人喷死。操作硬盘毕竟不是闹着玩的。切记,数据无价!!!
总结
1、不是所有的MegaRAID硬件都支持,具体可以安装试试
2、没有提供配置RAID方法是因为有风险
,我本身设备有限,很多东西不能实际测试,恐怕误导他人,导致数据丢失的惨剧发生。有需要的可以自行官方查阅指令。