Correcting Hard Disk Failures with Adaptec RAID Controllers using ARCCONF
This article includes the recommended procedure for a hard disk failure on an Adaptec RAID controller, when ARCCONF (the Command Line Interface for Adaptec RAID controllers) has been installed.
A hard disk has failed on a server with an Adaptec RAID controller and ARCCONF has been installed on that server. The consequence of this failure is that the corresponding logical device status has been set to Degraded. Foreseeable causes for a hard disk failure may include:
- having multiple defective sectors (media errors)
- not responding within the timeout period set by a command to the controller (timeouts)
Note: RAID controllers and hard disks have mechanisms, which can marginalize individual defective sectors and replace them with sectors from a reserve area. If the number of defective sectors exceeds a certain threshold, the RAID controller will no longer accept the hard disk and it must be replaced.
Step 1: Rescan
In rare cases, the hard disk might actually be completely fine, but is simply not responding to controller commands in a timely manner (timeouts) and does not really need to be replaced with a new hard disk. For this reason, a rescan should first be performed by the controller.
ARCCONF RESCAN <Controller#>
ARCCONF RESCAN 1
If the hard disk is still acceptable and does not have any electrical or mechanical errors, the controller will re-discover it and at least list it with the physical devices.
ARCCONF GETCONFIG <Controller#> PD
ARCCONF GETCONFIG 1 PD
Because the parameters above produce a very long report for most hard disks, under Linux the report can be reduced to most important information:
arcconf getconfig 1 pd|egrep "Device #|State\>|Reported Location|Reported Channel|S.M.A.R.T. warnings"
If the area with the metadata is still ok on the hard disk, it will be listed as a member of the associated logical device again, in most cases. However, the status of the logical device will then remain Degraded, because the hard disk failure has probably left it in an inconsistent state. For this reason, the logical device has to be manually rebuilt.
If the hard disk is no longer detected after the rescan, that may have the following causes:
- The hard disk may be defective.
- The cable from the controller to hard disk or backplane may be defective.
- The backplane may be defective.
- The controller may be defective.
Step 2: Clear und Verify
If the hard disk has been recognized again after the rescan from Step 1, a manual rebuild still needs to be performed. For this, the area with the metadata will have to be deleted first.
ARCCONF TASK START <Controller#> DEVICE <Channel#> <ID#> CLEAR
ARCCONF TASK START 1 DEVICE 0 0 CLEAR
Once this clearing task has been performed, verifying the hard disk is recommended in order to test for defective sectors.
ARCCONF TASK START <Controller#> DEVICE <Channel#> <ID#> VERIFY
ARCCONF TASK START 1 DEVICE 0 0 VERIFY
To repair potentially defective sectors at this time, the
VERIFY_FIX option can be used instead of the
Step 3: Rescan Again
The hard disk should now be listed as an available drive after the renewed rescan and the rebuild process will start automatically at this time, assuming that the automatic failover feature has been enabled. You can ask if this feature has been enabled using the following command.
ARCCONF GETCONFIG <Controller#> AD
ARCCONF GETCONFIG 1 AD
Enabling or Disabling the Automatic Failover Feature
ARCCONF FAILOVER <Controller#> <on|off>
ARCCONF FAILOVER 1 on
Step 4: Designated Hot Spare
In the event that the automatic failover feature has not been enabled and you do not want to enable it, the available hard disk can also be designated as a so-called hot spare disk. In this manner, you can assign the available drive to the associated logical device and start the rebuild automatically afterwards.
ARCCONF SETSTATE <Controller#> DEVICE <Channel#> <ID#> HSP LOGICALDRIVE <LD#>
ARCCONF SETSTATE 1 DEVICE 0 0 HSP LOGICALDRIVE 1