Here is a roughly step-by-step procedure in case of a failed drive which is part of a Solaris Volume Manager mirror configuration (RAID-1), and how to replace it while the system is up and running.
Here are the kind of messages reported by the operating system:
# grep md_mirror /var/adm/messages /var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d17: /dev/dsk/c1t0d0s7 needs maintenance /var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d11: /dev/dsk/c1t0d0s1 needs maintenance /var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d5: open error on /dev/dsk/c1t0d0s5 /var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d1: open error on /dev/dsk/c1t0d0s1 /var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d10: /dev/dsk/c1t0d0s0 needs maintenance /var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d13: /dev/dsk/c1t0d0s3 needs maintenance /var/adm/messages:Mar 30 15:41:37 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d100: open error on /dev/dsk/c1t0d0s0 /var/adm/messages:Mar 30 15:41:37 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d14: /dev/dsk/c1t0d0s4 needs maintenance /var/adm/messages:Mar 30 15:41:37 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d103: open error on /dev/dsk/c1t0d0s3 /var/adm/messages:Mar 30 15:41:37 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d104: open error on /dev/dsk/c1t0d0s4
Figure out the SVM configuration layout:
# metastat -c
d78 p 300MB d7
d72 p 1018MB d7
d76 p 100MB d7
d75 p 500MB d7
d74 p 50GB d7
d73 p 200MB d7
d77 p 256MB d7
d71 p 250MB d7
d7 m 119GB d17 (maint) d27
d17 s 119GB c1t0d0s7 (maint)
d27 s 119GB c1t1d0s7
d104 m 2.4GB d24 d14 (maint)
d24 s 2.4GB c1t1d0s4
d14 s 2.4GB c1t0d0s4 (maint)
d103 m 2.0GB d23 d13 (maint)
d23 s 2.0GB c1t1d0s3
d13 s 2.0GB c1t0d0s3 (maint)
d100 m 4.9GB d20 d10 (maint)
d20 s 4.9GB c1t1d0s0
d10 s 4.9GB c1t0d0s0 (maint)
d5 m 4.0GB d15 (unavail) d25
d15 s 4.0GB c1t0d0s5 (-)
d25 s 4.0GB c1t1d0s5
d1 m 3.9GB d11 (maint) d21
d11 s 3.9GB c1t0d0s1 (maint)
d21 s 3.9GB c1t1d0s1
As we can see, the failed disk is reported as drive not to be available anymore:
# echo | format
[...]
AVAILABLE DISK SELECTIONS:
0. c1t0d0
/pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012ab94d1,0
1. c1t1d0
/pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012a66aa1,0
[...]
Find more information about the failed drive, such as type of errors, serial number, WWNN of the drive, etc.:
# cfgadm -alv c1 Ap_Id Receptacle Occupant Condition Information When Type Busy Phys_Id c1 connected configured unknown unavailable fc-private n /devices/pci@9,600000/SUNW,qlc@2/fp@0,0:fc c1::500000e012a66aa1 connected configured unknown FUJITSU MAX3147FCSUN146G unavailable disk y /devices/pci@9,600000/SUNW,qlc@2/fp@0,0:fc::500000e012a66aa1 c1::500000e012ab94d1 connected configured failed FUJITSU MAX3147FCSUN146G unavailable disk y /devices/pci@9,600000/SUNW,qlc@2/fp@0,0:fc::500000e012ab94d1 # iostat -En [...] c1t1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: FUJITSU Product: MAX3147FCSUN146G Revision: 1103 Serial No: 0634G021R2 Size: 146.81GB <146810536448 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 1 Predictive Failure Analysis: 0 c1t0d0 Soft Errors: 0 Hard Errors: 1 Transport Errors: 73 Vendor: FUJITSU Product: MAX3147FCSUN146G Revision: 1103 Serial No: 0634G023LB Size: 146.81GB <146810536448 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 1 Predictive Failure Analysis: 0 [...]
Well, first clear the metadb configuration by removing
references to the bad disk drive:
# metadb
flags first blk block count
Wm p l 16 8192 /dev/dsk/c1t0d0s6
W p l 8208 8192 /dev/dsk/c1t0d0s6
W p l 16400 8192 /dev/dsk/c1t0d0s6
a p luo 16 8192 /dev/dsk/c1t1d0s6
a p luo 8208 8192 /dev/dsk/c1t1d0s6
a p luo 16400 8192 /dev/dsk/c1t1d0s6
# metadb -d c1t0d0s6
# metadb
flags first blk block count
a p luo 16 8192 /dev/dsk/c1t1d0s6
a p luo 8208 8192 /dev/dsk/c1t1d0s6
a p luo 16400 8192 /dev/dsk/c1t1d0s6
Since the disk is completely gone, the proper way to remove a FC drive didn't work as expected:
# luxadm remove_device /dev/rdsk/c1t0d0s2 WARNING!!! Please ensure that no filesystems are mounted on these device(s). All data on these devices should have been backed up. Error: SCSI failure. - /dev/rdsk/c1t0d0s2.
So, let's go by physically replacing the failed drive. Here is the output of the hardware events on the system's console:
# dmesg [...] Mar 31 18:06:17 beastie picld[152]: [ID 222282 daemon.error] Fault detected: DISK0 Mar 31 18:06:18 beastie qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(0): Loop OFFLINE Mar 31 18:06:18 beastie qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(0): Loop ONLINE Mar 31 18:06:18 beastie fctl: [ID 517869 kern.warning] WARNING: fp(3)::fp_plogi_intr: fp 1 pd ef Mar 31 18:06:19 beastie scsi: [ID 107833 kern.warning] WARNING: /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012a66aa1,0 (ssd0): Mar 31 18:06:19 beastie Error for Command: write(10) Error Level: Retryable Mar 31 18:06:19 beastie scsi: [ID 107833 kern.notice] Requested Block: 37369856 Error Block: 37369856 Mar 31 18:06:19 beastie scsi: [ID 107833 kern.notice] Vendor: FUJITSU Serial Number: 0634G021R2 Mar 31 18:06:19 beastie scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Mar 31 18:06:19 beastie scsi: [ID 107833 kern.notice] ASC: 0x29 (bus device reset message occurred), ASCQ: 0x3, FRU: 0x0 Mar 31 18:06:37 beastie scsi: [ID 799468 kern.info] ssd144 at fp3: name w500000e0125c4531,0, bus address ef Mar 31 18:06:37 beastie genunix: [ID 936769 kern.info] ssd144 is /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e0125c4531,0 Mar 31 18:06:37 beastie genunix: [ID 408114 kern.info] /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e0125c4531,0 (ssd144) online Mar 31 18:06:52 beastie scsi: [ID 107833 kern.warning] WARNING: /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012ab94d1,0 (ssd3): Mar 31 18:06:52 beastie drive offline [...] Mar 31 18:07:31 beastie picld[152]: [ID 691918 daemon.error] FSP_GEN_FAULT_LED has turned ON Mar 31 18:07:43 beastie picld[152]: [ID 861866 daemon.error] Notice: DISK0 okay Mar 31 18:07:44 beastie picld[152]: [ID 114988 daemon.error] FSP_GEN_FAULT_LED has turned OFF [...]
If necessary (if not done automatically), recreate and eventually clean the
public interface from the /dev subtree, and verify the new drive
is properly managed by the operating system:
# devfsadm -Cv
[...]
# echo | format
AVAILABLE DISK SELECTIONS:
0. c1t0d0
/pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e0125c4531,0
1. c1t1d0
/pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012a66aa1,0
[...]
So, now create a proper VTOC on the new disk, and propagate the
metadb configuration on it as before:
# prtvtoc /dev/rdsk/c1t1d0s2 | fmthard -s - /dev/rdsk/c1t0d0s2
fmthard: New volume table of contents now in place.
# metadb -a -c 3 c1t0d0s6
# metadb
flags first blk block count
a u 16 8192 /dev/dsk/c1t0d0s6
a u 8208 8192 /dev/dsk/c1t0d0s6
a u 16400 8192 /dev/dsk/c1t0d0s6
a p luo 16 8192 /dev/dsk/c1t1d0s6
a p luo 8208 8192 /dev/dsk/c1t1d0s6
a p luo 16400 8192 /dev/dsk/c1t1d0s6
Then, just replace the new drive as if it was the old one in the SVM configuration, and let the mirror reconstruct itself automatically:
# metareplace -e d104 c1t0d0s4
d104: device c1t0d0s4 is replaced with c1t0d0s4
[...]
# metareplace -e d5 c1t0d0s5
d5: device c1t0d0s5 is replaced with c1t0d0s5
# metastat -c
d78 p 300MB d7
d72 p 1018MB d7
d76 p 100MB d7
d75 p 500MB d7
d74 p 50GB d7
d73 p 200MB d7
d77 p 256MB d7
d71 p 250MB d7
d7 m 119GB d17 (resync-0%) d27
d17 s 119GB c1t0d0s7 (resyncing)
d27 s 119GB c1t1d0s7
d104 m 2.4GB d24 d14 (resync-41%)
d24 s 2.4GB c1t1d0s4
d14 s 2.4GB c1t0d0s4 (resyncing)
d103 m 2.0GB d23 d13 (resync-28%)
d23 s 2.0GB c1t1d0s3
d13 s 2.0GB c1t0d0s3 (resyncing)
d100 m 4.9GB d20 d10 (resync-10%)
d20 s 4.9GB c1t1d0s0
d10 s 4.9GB c1t0d0s0 (resyncing)
d5 m 4.0GB d15 (resync-6%) d25
d15 s 4.0GB c1t0d0s5 (resyncing)
d25 s 4.0GB c1t1d0s5
d1 m 3.9GB d11 (resync-13%) d21
d11 s 3.9GB c1t0d0s1 (resyncing)
d21 s 3.9GB c1t1d0s1
You are done.

Oracle ACE