Blog

Replacing A Failed Drive In A SVM Configuration

Apr 08, 2011 | 6 minutes read
Share this:

Tags: SVM, RAID, System

Here is a roughly step-by-step procedure in case of a failed drive which is part of a Solaris Volume Manager mirror configuration (RAID-1), and how to replace it while the system is up and running.

Here are the kind of messages reported by the operating system:

# grep md_mirror /var/adm/messages
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d17: /dev/dsk/c1t0d0s7 needs maintenance
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d11: /dev/dsk/c1t0d0s1 needs maintenance
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d5: open error on /dev/dsk/c1t0d0s5
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d1: open error on /dev/dsk/c1t0d0s1
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d10: /dev/dsk/c1t0d0s0 needs maintenance
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d13: /dev/dsk/c1t0d0s3 needs maintenance
/var/adm/messages:Mar 30 15:41:37 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d100: open error on /dev/dsk/c1t0d0s0
/var/adm/messages:Mar 30 15:41:37 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d14: /dev/dsk/c1t0d0s4 needs maintenance
/var/adm/messages:Mar 30 15:41:37 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d103: open error on /dev/dsk/c1t0d0s3
/var/adm/messages:Mar 30 15:41:37 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d104: open error on /dev/dsk/c1t0d0s4

Figure out the SVM configuration layout:

# metastat -c
d78              p  300MB d7
d72              p 1018MB d7
d76              p  100MB d7
d75              p  500MB d7
d74              p   50GB d7
d73              p  200MB d7
d77              p  256MB d7
d71              p  250MB d7
    d7           m  119GB d17 (maint) d27
        d17      s  119GB c1t0d0s7 (maint)
        d27      s  119GB c1t1d0s7
d104             m  2.4GB d24 d14 (maint)
    d24          s  2.4GB c1t1d0s4
    d14          s  2.4GB c1t0d0s4 (maint)
d103             m  2.0GB d23 d13 (maint)
    d23          s  2.0GB c1t1d0s3
    d13          s  2.0GB c1t0d0s3 (maint)
d100             m  4.9GB d20 d10 (maint)
    d20          s  4.9GB c1t1d0s0
    d10          s  4.9GB c1t0d0s0 (maint)
d5               m  4.0GB d15 (unavail) d25
    d15          s  4.0GB c1t0d0s5 (-)
    d25          s  4.0GB c1t1d0s5
d1               m  3.9GB d11 (maint) d21
    d11          s  3.9GB c1t0d0s1 (maint)
    d21          s  3.9GB c1t1d0s1

As we can see, the failed disk is reported as drive not to be available anymore:

# echo | format
[...]
AVAILABLE DISK SELECTIONS:
       0. c1t0d0 
          /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012ab94d1,0
       1. c1t1d0 
          /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012a66aa1,0
[...]

Find more information about the failed drive, such as type of errors, serial number, WWNN of the drive, etc.:

# cfgadm -alv c1
Ap_Id                          Receptacle   Occupant     Condition  Information
When         Type         Busy     Phys_Id
c1                             connected    configured   unknown
unavailable  fc-private   n        /devices/pci@9,600000/SUNW,qlc@2/fp@0,0:fc
c1::500000e012a66aa1           connected    configured   unknown    FUJITSU MAX3147FCSUN146G
unavailable  disk         y        /devices/pci@9,600000/SUNW,qlc@2/fp@0,0:fc::500000e012a66aa1
c1::500000e012ab94d1           connected    configured   failed     FUJITSU MAX3147FCSUN146G
unavailable  disk         y        /devices/pci@9,600000/SUNW,qlc@2/fp@0,0:fc::500000e012ab94d1
# iostat -En
[...]
c1t1d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: FUJITSU  Product: MAX3147FCSUN146G Revision: 1103 Serial No: 0634G021R2
Size: 146.81GB <146810536448 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 1 Predictive Failure Analysis: 0
c1t0d0           Soft Errors: 0 Hard Errors: 1 Transport Errors: 73
Vendor: FUJITSU  Product: MAX3147FCSUN146G Revision: 1103 Serial No: 0634G023LB
Size: 146.81GB <146810536448 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 1 Predictive Failure Analysis: 0
[...]

Well, first clear the metadb configuration by removing references to the bad disk drive:

# metadb
    flags           first blk       block count
  Wm  p  l          16            8192         /dev/dsk/c1t0d0s6
  W   p  l          8208          8192         /dev/dsk/c1t0d0s6
  W   p  l          16400         8192         /dev/dsk/c1t0d0s6
 a    p  luo        16            8192         /dev/dsk/c1t1d0s6
 a    p  luo        8208          8192         /dev/dsk/c1t1d0s6
 a    p  luo        16400         8192         /dev/dsk/c1t1d0s6
# metadb -d c1t0d0s6
# metadb
    flags           first blk       block count
 a    p  luo        16            8192         /dev/dsk/c1t1d0s6
 a    p  luo        8208          8192         /dev/dsk/c1t1d0s6
 a    p  luo        16400         8192         /dev/dsk/c1t1d0s6

Since the disk is completely gone, the proper way to remove a FC drive didn't work as expected:

# luxadm remove_device /dev/rdsk/c1t0d0s2

 WARNING!!! Please ensure that no filesystems are mounted on these device(s).
 All data on these devices should have been backed up.

 Error: SCSI failure. - /dev/rdsk/c1t0d0s2.

So, let's go by physically replacing the failed drive. Here is the output of the hardware events on the system's console:

# dmesg
[...]
Mar 31 18:06:17 beastie picld[152]: [ID 222282 daemon.error] Fault detected: DISK0
Mar 31 18:06:18 beastie qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(0): Loop OFFLINE
Mar 31 18:06:18 beastie qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(0): Loop ONLINE
Mar 31 18:06:18 beastie fctl: [ID 517869 kern.warning] WARNING: fp(3)::fp_plogi_intr: fp 1 pd ef
Mar 31 18:06:19 beastie scsi: [ID 107833 kern.warning] WARNING: /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012a66aa1,0 (ssd0):
Mar 31 18:06:19 beastie        Error for Command: write(10)               Error Level: Retryable
Mar 31 18:06:19 beastie scsi: [ID 107833 kern.notice]  Requested Block: 37369856                  Error Block: 37369856
Mar 31 18:06:19 beastie scsi: [ID 107833 kern.notice]  Vendor: FUJITSU                            Serial Number: 0634G021R2
Mar 31 18:06:19 beastie scsi: [ID 107833 kern.notice]  Sense Key: Unit Attention
Mar 31 18:06:19 beastie scsi: [ID 107833 kern.notice]  ASC: 0x29 (bus device reset message occurred), ASCQ: 0x3, FRU: 0x0
Mar 31 18:06:37 beastie scsi: [ID 799468 kern.info] ssd144 at fp3: name w500000e0125c4531,0, bus address ef
Mar 31 18:06:37 beastie genunix: [ID 936769 kern.info] ssd144 is /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e0125c4531,0
Mar 31 18:06:37 beastie genunix: [ID 408114 kern.info] /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e0125c4531,0 (ssd144) online
Mar 31 18:06:52 beastie scsi: [ID 107833 kern.warning] WARNING: /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012ab94d1,0 (ssd3):
Mar 31 18:06:52 beastie        drive offline
[...]
Mar 31 18:07:31 beastie picld[152]: [ID 691918 daemon.error] FSP_GEN_FAULT_LED has turned ON
Mar 31 18:07:43 beastie picld[152]: [ID 861866 daemon.error] Notice: DISK0 okay
Mar 31 18:07:44 beastie picld[152]: [ID 114988 daemon.error] FSP_GEN_FAULT_LED has turned OFF
[...]

If necessary (if not done automatically), recreate and eventually clean the public interface from the /dev subtree, and verify the new drive is properly managed by the operating system:

# devfsadm -Cv
[...]
# echo | format
AVAILABLE DISK SELECTIONS:
       0. c1t0d0 
          /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e0125c4531,0
       1. c1t1d0 
          /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012a66aa1,0
[...]

So, now create a proper VTOC on the new disk, and propagate the metadb configuration on it as before:

# prtvtoc /dev/rdsk/c1t1d0s2 | fmthard -s - /dev/rdsk/c1t0d0s2
fmthard:  New volume table of contents now in place.
# metadb -a -c 3 c1t0d0s6
# metadb
    flags           first blk       block count
 a        u         16            8192         /dev/dsk/c1t0d0s6
 a        u         8208          8192         /dev/dsk/c1t0d0s6
 a        u         16400         8192         /dev/dsk/c1t0d0s6
 a    p  luo        16            8192         /dev/dsk/c1t1d0s6
 a    p  luo        8208          8192         /dev/dsk/c1t1d0s6
 a    p  luo        16400         8192         /dev/dsk/c1t1d0s6

Then, just replace the new drive as if it was the old one in the SVM configuration, and let the mirror reconstruct itself automatically:

# metareplace -e d104 c1t0d0s4
d104: device c1t0d0s4 is replaced with c1t0d0s4
[...]
# metareplace -e d5 c1t0d0s5
d5: device c1t0d0s5 is replaced with c1t0d0s5
# metastat -c
d78              p  300MB d7
d72              p 1018MB d7
d76              p  100MB d7
d75              p  500MB d7
d74              p   50GB d7
d73              p  200MB d7
d77              p  256MB d7
d71              p  250MB d7
    d7           m  119GB d17 (resync-0%) d27
        d17      s  119GB c1t0d0s7 (resyncing)
        d27      s  119GB c1t1d0s7
d104             m  2.4GB d24 d14 (resync-41%)
    d24          s  2.4GB c1t1d0s4
    d14          s  2.4GB c1t0d0s4 (resyncing)
d103             m  2.0GB d23 d13 (resync-28%)
    d23          s  2.0GB c1t1d0s3
    d13          s  2.0GB c1t0d0s3 (resyncing)
d100             m  4.9GB d20 d10 (resync-10%)
    d20          s  4.9GB c1t1d0s0
    d10          s  4.9GB c1t0d0s0 (resyncing)
d5               m  4.0GB d15 (resync-6%) d25
    d15          s  4.0GB c1t0d0s5 (resyncing)
    d25          s  4.0GB c1t1d0s5
d1               m  3.9GB d11 (resync-13%) d21
    d11          s  3.9GB c1t0d0s1 (resyncing)
    d21          s  3.9GB c1t1d0s1

You are done.