Introduction
With hot swapping, you can replace drives (HDDs/SSDs) while the system is running to minimize server downtime if a drive fails. Please read this article to help you prepare and perform a hot-swap exchange.
Compatibility
The majority of our new server models are hot-swap capable.
You can check whether your server is hot-swap capable on Robot. Go to the server and click on the "Support" tab. Then, in the new window, click on the bottom on "Technical". Under "What kind of technical problem are you facing ?", click on "Drive is broken". Now scroll down until you see "Replacement options". If you see the option "Swap while the system is running", your server is hot-swap cabable.
Important notes
Generally, you should first remove the drive that you want replaced from the RAID. You should do this before you start the rest of the hot-swap process. This will help prevent any further damage to the drive during the exchange. Please also be very careful that you enter the correct serial number for the defective drive. If you can no longer see the serial number for the defective drive, then tell us this clearly, and then give us the serial numbers for all of the drives that are functional.
Procedure
Hardware RAID
If you are using a Raid controller with the server, you can exchange the drives via hot-swap; this is true for all operating systems. Currently at Hetzner, we have Adaptec and LSI RAID controllers.
You can find information about the controllers here:
To request a drive exchange, write a support request as normal via your Robot account.
Below are some examples:
Important: These are examples only. You need to adapt the steps and especially the command parameters to YOUR specific system!
LSI controller
Example configuration: Debian installation on a RAID 1 array with two SSDs
The command line tools MegaCli64
and StorCLI
are available.
-
StorCLI:
- You can find "StorCLI", for example, at http://mirror.hetzner.com/tools/LSI/tools/StorCLI/MR_SAS_StorCLI_1.17.08.zip. (You can convert the RPM package to a deb package using
alien
and then install it). - Create an alias to make it easier to use:
alias storcli='/opt/MegaRAID/storcli/storcli64'
In this example, let's imagine that there is a defective SSD at slot 0.
-
You can find the status and serial numbers (Inquiry Data) with the following command, for example:
storcli /c0/eALL/sALL show all | egrep 'Device attributes|SN = | Intf | SATA'
-
If the defective drive does not yet have the status 'offline', tjos this to 'offline' with
storcli
:storcli /c0/e252/s0 set offline
-
Now the SSD is marked as missing ...
storcli /c0/e252/s0 set missing
-
Now write a support request via Robot and ask for the drive exchange.
-
After our team has exchanged the drive, check the new drive's status:
storcli /c0/eall/sall show
-
If the rebuild does not start on its own, start the rebuild manually.
storcli /c0/e252/s0 start rebuild
- You can find "StorCLI", for example, at http://mirror.hetzner.com/tools/LSI/tools/StorCLI/MR_SAS_StorCLI_1.17.08.zip. (You can convert the RPM package to a deb package using
-
MegaCli64:
- You can find MegaCli64 at http://download.hetzner.com/tools/LSI/tools/MegaCLI/8.07.10_MegaCLI_Linux.zip. (You can convert the RPM package to a deb package using
alien
and then install it). - The tool is quite tolerant regarding the notation of parameters. You can enter parameters with or without a hyphen, and they are case-insensitive.
- Create an alias to make it easier to use:
alias megacli='/opt/MegaRAID/MegaCli/MegaCli64'.
In this example, let's imagine that there is a defective SSD at slot 0.
-
You can find the status and serial numbers (Inquiry Data) with the following command, for example:
megacli pdlist a0 | grep -Ei 'enclosure|slot|firmware state|inquiry'
-
If the defective drive does not yet have the status (firmware state) 'offline',
MegaCli
will set it to 'offline':megacli pdoffline physdrv[252:0] a0
-
Now the SSD is marked as missing ...
megacli pdmarkmissing physdrv[252:0] a0
-
...and prepared for the exchange
megacli pdprprmv physdrv[252:0] a0
-
Now write a support request via Robot and ask for the drive exchange.
-
After our team has exchanged the drive, check the new drive's status:
megacli pdlist a0 | grep -Ei 'enclosure|slot|firmware state|inquiry'
-
If the rebuild does not start on its own, start it manually.
- You can find MegaCli64 at http://download.hetzner.com/tools/LSI/tools/MegaCLI/8.07.10_MegaCLI_Linux.zip. (You can convert the RPM package to a deb package using
Adaptec controller
Example configuration: Debian installation on a RAID 1 array with two drives.
- You need the command line tool
arcconf
. You can find this tool and the required C++ library at http://download.hetzner.com/tools/Adaptec/tools/. - The defective drive is connected to slot 0.
-
You can find the status and serial numbers with the following command, for example:
arcconf getconfig 1 pd|egrep "Device #|State\>|Reported Location|Reported Channel|Serial|S.M.A.R.T. warnings"
-
If the defective drive does not yet have the status 'failed', this status is set.
arcconf setstate 1 device 0 0 ddd
-
Now write a support request via Robot and ask for the drive exchange.
-
After our team has exchanged the drive, check the new drive's status:
arcconf getconfig 1 pd | egrep "Device #|State\>|Reported Location|Reported Channel|Serial|S.M.A.R.T. warnings"
-
If the rebuild does not start on its own, start it manually.
Software RAID
In principle, hot swapping is also possible for drives on the SATA controller. The operating system recognizes the change of the connection status at the respective controller and recognizes the new drive as soon as it is connected. The steps you need to take differ depending on the operating system and configuration.
Below are some examples:
Important: These are just examples. You need to adjust the steps and especially the command parameters to YOUR specific system!
Linux
You can find information and a detailed example scenario for replacing drives in Linux software RAID at: Hard disk replacement in software RAID
Windows
Important: With Windows, it is not possible to hot-swap the start plex. Therefore, you need to boot the system from the intact Plex before starting the hot-swap process. (Microsoft also refers to mirroring as plexing, so a "plex" is a part of a mirrored volume).
The following example, let's imagine that the server has a Hetzner standard installation of Windows Server in UEFI mode with two drives and mirroring. The defective drive is disk 1 (secondary Plex). The system was started from the primary plex.
- Remove HDD/SSD from the RAID.
In Disk Management (diskmgmt.msc), open the context menu of Volume C: and select "Remove Mirroring".
-
Read the serial number of the defective or intact HDD/SSD with diskid32.exe.
-
Make a support request and ask our team to replace the drive (hot swapping).
-
After our team has exchanged the drive, start diskpart.
-
Prepare drive / create partitions based on the intact HDD/SSD.
-
If replacement HDD/SSD is not detected:
DISKPART> rescan
-
Display drive:
DISKPART> list disk
-
If the defective drive is displayed as M1 (missing):
DISKPART> select disk M1 DISKPART> delete disk
-
Convert removable drive to dynamic media with GPT.
-
Create and format the EFI partition and assign drive letter E to it.
-
Add HDD/SSD to mirror C and wait until synchronization is complete.
DISKPART> select disk 1 DISKPART> convert gpt DISKPART> create partition efi size=200 DISKPART> format fs=fat32 quick DISKPART> assign letter=e DISKPART> convert dynamic DISKPART> select volume c DISKPART> add disk 1 wait
-
Assign the letter x to the EFI partition of the intact HDD/SSD.
DISKPART> select disk 0 DISKPART> select part 1 DISKPART> assign letter=x DISKPART> exit
-
EFI partition and boot manager:
In the example, the EFI partitions have been assigned the following drive letters: x: existing EFI partition e: newly created EFI partition on the replacement drive
-
First of all, you should save the system BCD memory (here in the file
BCD_backup
in the current directory), so that you can undo any changes you make later usingbcdedit /import
:bcdedit /export BCD_backup
-
Recursively copy the EFI partition, but skip the BCD memory and the System Volume Information folder:
robocopy x:\ e:\ * /e /copyall /dcopy:t /xf BCD.* /xd "System Volume Information"
-
Now export the system BCD memory to the replacement drive with
bcdedit
:bcdedit /export e:\EFI\Microsoft\Boot\BCD
Now you can start both boot managers from either of the two boot plexes.
Under certain circumstances, you may need to make further adjustments to the BCD memory (e.g. if there is still an orphaned start entry). You can find more information at: http://download.microsoft.com/download/6/E/E/6EE26977-FAA0-41CC-8BDA-7A0C5E6EB9CC/Configuring%20Disk%20Mirroring%20for%20Windows%20Server%202012.docx.
FreeBSD
-
gmirror + UFS:
Example configuration: FreeBSD installation with
UFS
andgmirror
with the following arrays:/dev/mirror/boot (ada0p1 + ada1p1) /dev/mirror/swap (ada0p2 + ada1p2) /dev/mirror/root (ada0p3 + ada1p3)
The defective HDD/SSD is ada1.
- Remove the defective HDD/SSD from the RAID.
-
Check the status:
gmirror status
-
Disable partitions of the defective HDD/SSD if necessary:
gmirror deactivate boot ada1p1 gmirror deactivate swap ada1p2 gmirror deactivate root ada1p3
-
"Forget" partitions of the defective HDD/SSD:
gmirror forget boot gmirror forget swap gmirror forget root
- Find the serial number of the defective HDD/SSD:
-
For example, with
smartctl
from the smartmontools package:smartctl -a /dev/ada1 |grep -i serial
-
Or using
camcontrol
:camcontrol identify /dev/ada1 |grep -i serial
-
Now write a support request via Robot and ask for the drive exchange.
-
After the exchange is complete, copy the partition table from ada0 to ada1:
gpart backup ada0 | gpart restore ada1
NOTE: Currently, there appears to be a bug in FreeBSD 11 that prevents FreeBSD from restoring the partition table, which may prevent booting from the replaced drive. If you encounter this problem, please see the FreeBSD Forum post.
-
Add partitions of the swap HDD/SSD to gmirror:
gmirror insert boot ada1p1 gmirror insert swap ada1p2 gmirror insert root ada1p3
-
Install boot code on the replacement HDD/SSD:
gpart bootcode -b /boot/pmbr -p /boot/gptboot -i 1 ada1
-
ZFS
Sample configuration: FreeBSD installation using ZFS with the following arrays:
/dev/mirror/boot (ada0p1 + ada1p1) /dev/mirror/swap (ada0p2 + ada1p2)
ZFS pool zroot with mirroring via gpt/root0 (GPT label for ada0p3) and gpt/root1 (GPT label for ada1p3)
The defective HDD/SSD is ada0.
(The two
gmirror
mirrors boot and swap are handled according to the above procedure).-
If you want to use ZFS for mirroring, you have to check the state of the mirror before replacing it, too, and if necessary, set the corresponding partition (in the following example gpt/root0) to offline:
zpool status pool: zroot state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM zroot ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 gpt/root0 ONLINE 0 0 0 gpt/root1 ONLINE 0 0 0 zpool offline zroot gpt/root0 zpool status pool: zroot state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: none requested config: NAME STATE READ WRITE CKSUM zroot DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 8894732708877724737 OFFLINE 0 0 0 was /dev/gpt/root0 gpt/root1 ONLINE 0 0 0 gmirror deactivate boot ada0p1 gmirror deactivate swap ada0p2 gmirror forget boot gmirror forget swap
-
If you use GPT labels like in the example, you can find the assignment to the drive using
gpart
:gpart list | grep -Egg 'geom|label' Geom name: ada0 label: boot0 label: swap0 label: root0 Geom name: ada1 label: boot1 label: swap1 label: root1
-
Find the serial number of the defective HDD/SSD:
-
For example, with
smartctl
from the smartmontools package:smartctl -a /dev/ada0 |grep -i serial
-
Or via
camcontrol
:camcontrol identify /dev/ada0 |grep -i serial
-
Write a support ticket via Robot to ask and ask our team to replace the drive. Make sure to include the correct serial number of the drive. After the exchange, tranfer the partition table via
gpart
, repair thegmirror
mirror, and install the boot code:gpart backup ada1 | gpart restore ada0 gmirror insert boot ada0p1 gmirror insert swap ada0p2 gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0
-
Then adjust the GPT label of the ZFS partition (in this case the third, i.e. ada0p3) of the replacement drive (gpt/root0):
gpart modify -i 3 -l root0 ada0
-
The new device can now replace the failed part of the mirror:
zpool replace zroot gpt/root0 zpool status -x all pools are healthy
For detailed information on configuring and managing the ZFS file system, see the Oracle documentation: Oracle ZFS Documentation (English)
-