Here is a roughly step-by-step procedure in case of a failed drive which is part of a Solaris Volume Manager mirror configuration (RAID-1), and how to replace it while the system is up and running.
Here are the kind of messages reported by the operating system:
# grep md_mirror /var/adm/messages
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d17: /dev/dsk/c1t0d0s7 needs maintenance
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d11: /dev/dsk/c1t0d0s1 needs maintenance
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d5: open error on /dev/dsk/c1t0d0s5
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d1: open error on /dev/dsk/c1t0d0s1
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d10: /dev/dsk/c1t0d0s0 needs maintenance
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d13: /dev/dsk/c1t0d0s3 needs maintenance
/var/adm/messages:Mar 30 15:41:37 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d100: open error on /dev/dsk/c1t0d0s0
/var/adm/messages:Mar 30 15:41:37 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d14: /dev/dsk/c1t0d0s4 needs maintenance
/var/adm/messages:Mar 30 15:41:37 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d103: open error on /dev/dsk/c1t0d0s3
/var/adm/messages:Mar 30 15:41:37 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d104: open error on /dev/dsk/c1t0d0s4
Figure out the SVM configuration layout:
# metastat -c
d78 p 300MB d7
d72 p 1018MB d7
d76 p 100MB d7
d75 p 500MB d7
d74 p 50GB d7
d73 p 200MB d7
d77 p 256MB d7
d71 p 250MB d7
d7 m 119GB d17 (maint) d27
d17 s 119GB c1t0d0s7 (maint)
d27 s 119GB c1t1d0s7
d104 m 2.4GB d24 d14 (maint)
d24 s 2.4GB c1t1d0s4
d14 s 2.4GB c1t0d0s4 (maint)
d103 m 2.0GB d23 d13 (maint)
d23 s 2.0GB c1t1d0s3
d13 s 2.0GB c1t0d0s3 (maint)
d100 m 4.9GB d20 d10 (maint)
d20 s 4.9GB c1t1d0s0
d10 s 4.9GB c1t0d0s0 (maint)
d5 m 4.0GB d15 (unavail) d25
d15 s 4.0GB c1t0d0s5 (-)
d25 s 4.0GB c1t1d0s5
d1 m 3.9GB d11 (maint) d21
d11 s 3.9GB c1t0d0s1 (maint)
d21 s 3.9GB c1t1d0s1
As we can see, the failed disk is reported as drive not to be available anymore:
# echo | format
[...]
AVAILABLE DISK SELECTIONS:
0. c1t0d0
/pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012ab94d1,0
1. c1t1d0
/pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012a66aa1,0
[...]
Find more information about the failed drive, such as type of errors, serial number, WWNN of the drive, etc.:
# cfgadm -alv c1
Ap_Id Receptacle Occupant Condition Information
When Type Busy Phys_Id
c1 connected configured unknown
unavailable fc-private n /devices/pci@9,600000/SUNW,qlc@2/fp@0,0:fc
c1::500000e012a66aa1 connected configured unknown FUJITSU MAX3147FCSUN146G
unavailable disk y /devices/pci@9,600000/SUNW,qlc@2/fp@0,0:fc::500000e012a66aa1
c1::500000e012ab94d1 connected configured failed FUJITSU MAX3147FCSUN146G
unavailable disk y /devices/pci@9,600000/SUNW,qlc@2/fp@0,0:fc::500000e012ab94d1
# iostat -En
[...]
c1t1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: FUJITSU Product: MAX3147FCSUN146G Revision: 1103 Serial No: 0634G021R2
Size: 146.81GB <146810536448 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 1 Predictive Failure Analysis: 0
c1t0d0 Soft Errors: 0 Hard Errors: 1 Transport Errors: 73
Vendor: FUJITSU Product: MAX3147FCSUN146G Revision: 1103 Serial No: 0634G023LB
Size: 146.81GB <146810536448 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 1 Predictive Failure Analysis: 0
[...]
Well, first clear the metadb
configuration by removing references to
the bad disk drive:
# metadb
flags first blk block count
Wm p l 16 8192 /dev/dsk/c1t0d0s6
W p l 8208 8192 /dev/dsk/c1t0d0s6
W p l 16400 8192 /dev/dsk/c1t0d0s6
a p luo 16 8192 /dev/dsk/c1t1d0s6
a p luo 8208 8192 /dev/dsk/c1t1d0s6
a p luo 16400 8192 /dev/dsk/c1t1d0s6
# metadb -d c1t0d0s6
# metadb
flags first blk block count
a p luo 16 8192 /dev/dsk/c1t1d0s6
a p luo 8208 8192 /dev/dsk/c1t1d0s6
a p luo 16400 8192 /dev/dsk/c1t1d0s6
Since the disk is completely gone, the proper way to remove a FC drive didn't work as expected:
# luxadm remove_device /dev/rdsk/c1t0d0s2
WARNING!!! Please ensure that no filesystems are mounted on these device(s).
All data on these devices should have been backed up.
Error: SCSI failure. - /dev/rdsk/c1t0d0s2.
So, let's go by physically replacing the failed drive. Here is the output of the hardware events on the system's console:
# dmesg
[...]
Mar 31 18:06:17 beastie picld[152]: [ID 222282 daemon.error] Fault detected: DISK0
Mar 31 18:06:18 beastie qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(0): Loop OFFLINE
Mar 31 18:06:18 beastie qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(0): Loop ONLINE
Mar 31 18:06:18 beastie fctl: [ID 517869 kern.warning] WARNING: fp(3)::fp_plogi_intr: fp 1 pd ef
Mar 31 18:06:19 beastie scsi: [ID 107833 kern.warning] WARNING: /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012a66aa1,0 (ssd0):
Mar 31 18:06:19 beastie Error for Command: write(10) Error Level: Retryable
Mar 31 18:06:19 beastie scsi: [ID 107833 kern.notice] Requested Block: 37369856 Error Block: 37369856
Mar 31 18:06:19 beastie scsi: [ID 107833 kern.notice] Vendor: FUJITSU Serial Number: 0634G021R2
Mar 31 18:06:19 beastie scsi: [ID 107833 kern.notice] Sense Key: Unit Attention
Mar 31 18:06:19 beastie scsi: [ID 107833 kern.notice] ASC: 0x29 (bus device reset message occurred), ASCQ: 0x3, FRU: 0x0
Mar 31 18:06:37 beastie scsi: [ID 799468 kern.info] ssd144 at fp3: name w500000e0125c4531,0, bus address ef
Mar 31 18:06:37 beastie genunix: [ID 936769 kern.info] ssd144 is /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e0125c4531,0
Mar 31 18:06:37 beastie genunix: [ID 408114 kern.info] /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e0125c4531,0 (ssd144) online
Mar 31 18:06:52 beastie scsi: [ID 107833 kern.warning] WARNING: /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012ab94d1,0 (ssd3):
Mar 31 18:06:52 beastie drive offline
[...]
Mar 31 18:07:31 beastie picld[152]: [ID 691918 daemon.error] FSP_GEN_FAULT_LED has turned ON
Mar 31 18:07:43 beastie picld[152]: [ID 861866 daemon.error] Notice: DISK0 okay
Mar 31 18:07:44 beastie picld[152]: [ID 114988 daemon.error] FSP_GEN_FAULT_LED has turned OFF
[...]
If necessary (if not done automatically), recreate and eventually clean
the public interface from the /dev
subtree, and verify the new drive
is properly managed by the operating system:
# devfsadm -Cv
[...]
# echo | format
AVAILABLE DISK SELECTIONS:
0. c1t0d0
/pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e0125c4531,0
1. c1t1d0
/pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012a66aa1,0
[...]
So, now create a proper VTOC on the new disk, and propagate the metadb
configuration on it as before:
# prtvtoc /dev/rdsk/c1t1d0s2 | fmthard -s - /dev/rdsk/c1t0d0s2
fmthard: New volume table of contents now in place.
# metadb -a -c 3 c1t0d0s6
# metadb
flags first blk block count
a u 16 8192 /dev/dsk/c1t0d0s6
a u 8208 8192 /dev/dsk/c1t0d0s6
a u 16400 8192 /dev/dsk/c1t0d0s6
a p luo 16 8192 /dev/dsk/c1t1d0s6
a p luo 8208 8192 /dev/dsk/c1t1d0s6
a p luo 16400 8192 /dev/dsk/c1t1d0s6
Then, just replace the new drive as if it was the old one in the SVM configuration, and let the mirror reconstruct itself automatically:
# metareplace -e d104 c1t0d0s4
d104: device c1t0d0s4 is replaced with c1t0d0s4
[...]
# metareplace -e d5 c1t0d0s5
d5: device c1t0d0s5 is replaced with c1t0d0s5
# metastat -c
d78 p 300MB d7
d72 p 1018MB d7
d76 p 100MB d7
d75 p 500MB d7
d74 p 50GB d7
d73 p 200MB d7
d77 p 256MB d7
d71 p 250MB d7
d7 m 119GB d17 (resync-0%) d27
d17 s 119GB c1t0d0s7 (resyncing)
d27 s 119GB c1t1d0s7
d104 m 2.4GB d24 d14 (resync-41%)
d24 s 2.4GB c1t1d0s4
d14 s 2.4GB c1t0d0s4 (resyncing)
d103 m 2.0GB d23 d13 (resync-28%)
d23 s 2.0GB c1t1d0s3
d13 s 2.0GB c1t0d0s3 (resyncing)
d100 m 4.9GB d20 d10 (resync-10%)
d20 s 4.9GB c1t1d0s0
d10 s 4.9GB c1t0d0s0 (resyncing)
d5 m 4.0GB d15 (resync-6%) d25
d15 s 4.0GB c1t0d0s5 (resyncing)
d25 s 4.0GB c1t1d0s5
d1 m 3.9GB d11 (resync-13%) d21
d11 s 3.9GB c1t0d0s1 (resyncing)
d21 s 3.9GB c1t1d0s1
You are done.