blog'o thnet

To content | To menu | To search

Sunday 17 April 2011

Replacing A Failed Drive In A SVM Configuration

Here is a roughly step-by-step procedure in case of a failed drive which is part of a Solaris Volume Manager mirror configuration (RAID-1), and how to replace it while the system is up and running.

Here are the kind of messages reported by the operating system:

# grep md_mirror /var/adm/messages
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d17: /dev/dsk/c1t0d0s7 needs maintenance
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d11: /dev/dsk/c1t0d0s1 needs maintenance
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d5: open error on /dev/dsk/c1t0d0s5
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d1: open error on /dev/dsk/c1t0d0s1
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d10: /dev/dsk/c1t0d0s0 needs maintenance
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d13: /dev/dsk/c1t0d0s3 needs maintenance
/var/adm/messages:Mar 30 15:41:37 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d100: open error on /dev/dsk/c1t0d0s0
/var/adm/messages:Mar 30 15:41:37 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d14: /dev/dsk/c1t0d0s4 needs maintenance
/var/adm/messages:Mar 30 15:41:37 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d103: open error on /dev/dsk/c1t0d0s3
/var/adm/messages:Mar 30 15:41:37 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d104: open error on /dev/dsk/c1t0d0s4

Figure out the SVM configuration layout:

# metastat -c
d78              p  300MB d7
d72              p 1018MB d7
d76              p  100MB d7
d75              p  500MB d7
d74              p   50GB d7
d73              p  200MB d7
d77              p  256MB d7
d71              p  250MB d7
    d7           m  119GB d17 (maint) d27
        d17      s  119GB c1t0d0s7 (maint)
        d27      s  119GB c1t1d0s7
d104             m  2.4GB d24 d14 (maint)
    d24          s  2.4GB c1t1d0s4
    d14          s  2.4GB c1t0d0s4 (maint)
d103             m  2.0GB d23 d13 (maint)
    d23          s  2.0GB c1t1d0s3
    d13          s  2.0GB c1t0d0s3 (maint)
d100             m  4.9GB d20 d10 (maint)
    d20          s  4.9GB c1t1d0s0
    d10          s  4.9GB c1t0d0s0 (maint)
d5               m  4.0GB d15 (unavail) d25
    d15          s  4.0GB c1t0d0s5 (-)
    d25          s  4.0GB c1t1d0s5
d1               m  3.9GB d11 (maint) d21
    d11          s  3.9GB c1t0d0s1 (maint)
    d21          s  3.9GB c1t1d0s1

As we can see, the failed disk is reported as drive not to be available anymore:

# echo | format
[...]
AVAILABLE DISK SELECTIONS:
       0. c1t0d0 
          /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012ab94d1,0
       1. c1t1d0 
          /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012a66aa1,0
[...]

Find more information about the failed drive, such as type of errors, serial number, WWNN of the drive, etc.:

# cfgadm -alv c1
Ap_Id                          Receptacle   Occupant     Condition  Information
When         Type         Busy     Phys_Id
c1                             connected    configured   unknown
unavailable  fc-private   n        /devices/pci@9,600000/SUNW,qlc@2/fp@0,0:fc
c1::500000e012a66aa1           connected    configured   unknown    FUJITSU MAX3147FCSUN146G
unavailable  disk         y        /devices/pci@9,600000/SUNW,qlc@2/fp@0,0:fc::500000e012a66aa1
c1::500000e012ab94d1           connected    configured   failed     FUJITSU MAX3147FCSUN146G
unavailable  disk         y        /devices/pci@9,600000/SUNW,qlc@2/fp@0,0:fc::500000e012ab94d1
# iostat -En
[...]
c1t1d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: FUJITSU  Product: MAX3147FCSUN146G Revision: 1103 Serial No: 0634G021R2
Size: 146.81GB <146810536448 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 1 Predictive Failure Analysis: 0
c1t0d0           Soft Errors: 0 Hard Errors: 1 Transport Errors: 73
Vendor: FUJITSU  Product: MAX3147FCSUN146G Revision: 1103 Serial No: 0634G023LB
Size: 146.81GB <146810536448 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 1 Predictive Failure Analysis: 0
[...]

Well, first clear the metadb configuration by removing references to the bad disk drive:

# metadb
    flags           first blk       block count
  Wm  p  l          16            8192         /dev/dsk/c1t0d0s6
  W   p  l          8208          8192         /dev/dsk/c1t0d0s6
  W   p  l          16400         8192         /dev/dsk/c1t0d0s6
 a    p  luo        16            8192         /dev/dsk/c1t1d0s6
 a    p  luo        8208          8192         /dev/dsk/c1t1d0s6
 a    p  luo        16400         8192         /dev/dsk/c1t1d0s6
# metadb -d c1t0d0s6
# metadb
    flags           first blk       block count
 a    p  luo        16            8192         /dev/dsk/c1t1d0s6
 a    p  luo        8208          8192         /dev/dsk/c1t1d0s6
 a    p  luo        16400         8192         /dev/dsk/c1t1d0s6

Since the disk is completely gone, the proper way to remove a FC drive didn't work as expected:

# luxadm remove_device /dev/rdsk/c1t0d0s2

 WARNING!!! Please ensure that no filesystems are mounted on these device(s).
 All data on these devices should have been backed up.

 Error: SCSI failure. - /dev/rdsk/c1t0d0s2.

So, let's go by physically replacing the failed drive. Here is the output of the hardware events on the system's console:

# dmesg
[...]
Mar 31 18:06:17 beastie picld[152]: [ID 222282 daemon.error] Fault detected: DISK0
Mar 31 18:06:18 beastie qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(0): Loop OFFLINE
Mar 31 18:06:18 beastie qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(0): Loop ONLINE
Mar 31 18:06:18 beastie fctl: [ID 517869 kern.warning] WARNING: fp(3)::fp_plogi_intr: fp 1 pd ef
Mar 31 18:06:19 beastie scsi: [ID 107833 kern.warning] WARNING: /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012a66aa1,0 (ssd0):
Mar 31 18:06:19 beastie        Error for Command: write(10)               Error Level: Retryable
Mar 31 18:06:19 beastie scsi: [ID 107833 kern.notice]  Requested Block: 37369856                  Error Block: 37369856
Mar 31 18:06:19 beastie scsi: [ID 107833 kern.notice]  Vendor: FUJITSU                            Serial Number: 0634G021R2
Mar 31 18:06:19 beastie scsi: [ID 107833 kern.notice]  Sense Key: Unit Attention
Mar 31 18:06:19 beastie scsi: [ID 107833 kern.notice]  ASC: 0x29 (bus device reset message occurred), ASCQ: 0x3, FRU: 0x0
Mar 31 18:06:37 beastie scsi: [ID 799468 kern.info] ssd144 at fp3: name w500000e0125c4531,0, bus address ef
Mar 31 18:06:37 beastie genunix: [ID 936769 kern.info] ssd144 is /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e0125c4531,0
Mar 31 18:06:37 beastie genunix: [ID 408114 kern.info] /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e0125c4531,0 (ssd144) online
Mar 31 18:06:52 beastie scsi: [ID 107833 kern.warning] WARNING: /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012ab94d1,0 (ssd3):
Mar 31 18:06:52 beastie        drive offline
[...]
Mar 31 18:07:31 beastie picld[152]: [ID 691918 daemon.error] FSP_GEN_FAULT_LED has turned ON
Mar 31 18:07:43 beastie picld[152]: [ID 861866 daemon.error] Notice: DISK0 okay
Mar 31 18:07:44 beastie picld[152]: [ID 114988 daemon.error] FSP_GEN_FAULT_LED has turned OFF
[...]

If necessary (if not done automatically), recreate and eventually clean the public interface from the /dev subtree, and verify the new drive is properly managed by the operating system:

# devfsadm -Cv
[...]
# echo | format
AVAILABLE DISK SELECTIONS:
       0. c1t0d0 
          /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e0125c4531,0
       1. c1t1d0 
          /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012a66aa1,0
[...]

So, now create a proper VTOC on the new disk, and propagate the metadb configuration on it as before:

# prtvtoc /dev/rdsk/c1t1d0s2 | fmthard -s - /dev/rdsk/c1t0d0s2
fmthard:  New volume table of contents now in place.
# metadb -a -c 3 c1t0d0s6
# metadb
    flags           first blk       block count
 a        u         16            8192         /dev/dsk/c1t0d0s6
 a        u         8208          8192         /dev/dsk/c1t0d0s6
 a        u         16400         8192         /dev/dsk/c1t0d0s6
 a    p  luo        16            8192         /dev/dsk/c1t1d0s6
 a    p  luo        8208          8192         /dev/dsk/c1t1d0s6
 a    p  luo        16400         8192         /dev/dsk/c1t1d0s6

Then, just replace the new drive as if it was the old one in the SVM configuration, and let the mirror reconstruct itself automatically:

# metareplace -e d104 c1t0d0s4
d104: device c1t0d0s4 is replaced with c1t0d0s4
[...]
# metareplace -e d5 c1t0d0s5
d5: device c1t0d0s5 is replaced with c1t0d0s5
# metastat -c
d78              p  300MB d7
d72              p 1018MB d7
d76              p  100MB d7
d75              p  500MB d7
d74              p   50GB d7
d73              p  200MB d7
d77              p  256MB d7
d71              p  250MB d7
    d7           m  119GB d17 (resync-0%) d27
        d17      s  119GB c1t0d0s7 (resyncing)
        d27      s  119GB c1t1d0s7
d104             m  2.4GB d24 d14 (resync-41%)
    d24          s  2.4GB c1t1d0s4
    d14          s  2.4GB c1t0d0s4 (resyncing)
d103             m  2.0GB d23 d13 (resync-28%)
    d23          s  2.0GB c1t1d0s3
    d13          s  2.0GB c1t0d0s3 (resyncing)
d100             m  4.9GB d20 d10 (resync-10%)
    d20          s  4.9GB c1t1d0s0
    d10          s  4.9GB c1t0d0s0 (resyncing)
d5               m  4.0GB d15 (resync-6%) d25
    d15          s  4.0GB c1t0d0s5 (resyncing)
    d25          s  4.0GB c1t1d0s5
d1               m  3.9GB d11 (resync-13%) d21
    d11          s  3.9GB c1t0d0s1 (resyncing)
    d21          s  3.9GB c1t1d0s1

You are done.

Monday 22 September 2008

About GNU/Linux Software Mirroring And LVM

Here, the final aim was to provide data access redundancy through SAN storage hosted on remote sites across Wide Area Network (WAN) links. After some relatively long and painful tries to mimic software mirroring as found on HP-UX platform using Logical Volume Management (LVM), i.e. at the logical volume level, I finally give up deciding this functionality will definitely not fit my need. Why? Here are my comments.

  1. It is not possible to provide clear and manageable storage multipath when the need to distinguish between the multiple sites is important, ala mirror across controllers found on Veritas VxVM on Sun Solaris system, for example. So, managing many physical volumes along with lots of logical volumes is very complicated.
  2. There is no exact mapping capability between logical volume storage on a given physical volume.
  3. The need to have a disk-based log, i.e. a persistent log. Yes, one can always provide the option --corelog at the creation time to the logical volume initial build and have an in-memory log , i.e. a non-persistent log, but this requires the entire copies (mirrors) be resynchronized upon reboot. Not really viable on multi-TB environments.
  4. A write-intensive workload on a file system living on a logical volume mirror will suffer high latency: the overhead is important, and the time to do mostly-write jobs grow dramatically. It is really hard to get high level statistics, only low level metrics seems consistent: sd SCSI devices and dm- device mapper components for each paths entries. Not from the multipath devices standpoint, which is the more interesting from the end user and SA point of view.
  5. You can't extend a logical volume, which is really a no-go per-se. On that point, the Red Hat support answered that this functionality may be added in a future release, the current state may eventually be a Request For Enhancement (RFE), if a proper business justification is provided. One must break the logical volume mirror copy, then rebuild it completely. Not realistic when the logical volume is made of a lot of physical extents across multiple physical volumes.
  6. A LVM configuration can be totally blocked by itself, and not usable at all. The fact is, LVM use persistent storage blocks to keep track of its own metadata. The metadata size is set at physical volume creation time only, and can't be change afterward. This size is statically defined as 255 physical volume blocks, and can be adjust from the LVM configuration file. The problem is, when this circular buffer space (stored in ASCII) fills up--such as when there are a lot of logical volumes in a mirrored environment--it is not possible to do anything more with LVM. So you can't add more logical volume, can't add more logical volume copies,... and can't delete them trying to reestablish a proper LVM configuration. Well, here are the answers given by the Red Hat support to two keys questions in this situation:
    • How to size the metadata, i.e. if we need to change it from the default value, how can we determine the new proper and appropriate size, and from which information?
      I am afraid but Metadata size can only be defined at the time of PV creation and there is no real formula for calculating the size in advance. By changing the default value of 255 you can get a higher space value. For general LVM setup (with less LV's and VG's) default size works fine however in cases where high number of LV's are required a custom value will be required.
    • We just want to delete all LV copies, which means to return to the initial situation and have 0 copy for all LV, i.e. only one LV per-se, in order to be able to change LVM configuration again (we can't do anything on our production server right now)?
      I discussed this situation with peer engineers and also referenced a similar previous case. From the notes of the same the workaround is to use the backup file (/etc/lvm/backup) and restore the PV's. I agree that this really not a production environment method however seems the only workaround.

So, the production RDBMS Oracle server is finally now being evacuate to an other machine. Hum... Hope to see better enterprise experience using the mdadm package to handle RAID software, instead of mirror (RAID-1) LVM. Maybe more about that in an other blog entry?

Friday 15 February 2008

LVM2 Simple Mirroring On RHEL4

When the need to evacuate all persistent SAN storage from EMC DMX1K to HP XP12K (HDS), three main solutions were envisaged. The first one was brute data copy (tar, cpio, etc.) but was not very practical with the size of the data (multi-terabytes) and the time involved in copying them. The two others were based on LVM technologies: mirroring, or moving.

Although the choice has been to use the online and transparent moving data technology (see pvmove for more information), it was interesting to note that Red Hat has backported support for the creation and manipulation of simple mirrors to their RHEL4 distribution. These functionalities were introduced with the RHBA-2006:0504-15 advisory issued on 2006-08-10, i.e. between RHEL4 Update 4 and RHEL4 Update 5 (and so available via RHN at this time). It is just too bad that the online help for LVM commands are not properly synchronized nor fully documented by the corresponding manual page: clearly, this doesn't help to use them in the best conditions (no, Google isn't always the better option when using these kinds of functionalities in big companies).

Wednesday 30 May 2007

RAID-1 Volume From the root File System Using SVM on x86 Platform

Here is a little step-by-step guide to create a soft mirror from the root file system, known as an encapsulation of the system's disk. This will provide full protection against one disk failure, and complete redundancy. In the same time, this will have the effect to speed read requests (since there exists multiple backing devices hosting the same data), but write performance is generally degraded. First, know your running system, particularly on which disk it is currently installed and which other device is available for the second mirror side.

# df -hF ufs
Filesystem             size   used  avail capacity  Mounted on
/dev/dsk/c1d0s0        7.9G   5.2G   2.6G    67%    /
# swap -lh
swapfile             dev    swaplo   blocks     free
/dev/dsk/c1d0s1     102,65       4K     4.0G     4.0G
#
# echo | format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
       0. c1d0 
          /pci@0,0/pci-ide@8/ide@0/cmdk@0,0
       1. c2d0 
          /pci@0,0/pci-ide@8/ide@1/cmdk@0,0
[...]

Well, we will use the c2d0 as the second submirror. So, we need to default to one Solaris partition that uses the whole disk and make it bootable (we are using GRUB in this case). The slice for the second submirror must have a slice tag of root and the root slice must be slice 0 (so, we will duplicate the label's content from the boot disk to the mirror disk).

# fdisk -B /dev/rdsk/c2d0p0
# fdisk /dev/rdsk/c2d0p0
             Total disk size is 36483 cylinders
             Cylinder size is 16065 (512 byte) blocks

                                               Cylinders
      Partition   Status    Type          Start   End   Length    %
      =========   ======    ============  =====   ===   ======   ===
          1       Active    Solaris2          1  36482    36482    100

SELECT ONE OF THE FOLLOWING:
   1. Create a partition
   2. Specify the active partition
   3. Delete a partition
   4. Change between Solaris and Solaris2 Partition IDs
   5. Exit (update disk configuration and exit)
   6. Cancel (exit without updating disk configuration)
Enter Selection:
#
# prtvtoc /dev/rdsk/c1d0s2 | fmthard -s - /dev/rdsk/c2d0s2
fmthard:  New volume table of contents now in place.
#
# /sbin/installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c2d0s0
stage1 written to partition 0 sector 0 (abs 16065)
stage2 written to partition 0, 260 sectors starting at 50 (abs 16115)

Create replicas of the metadevice state database:

# metadb -a -c 3 -f c1d0s4 c2d0s4
# metadb
        flags           first blk       block count
     a        u         16              8192            /dev/dsk/c1d0s4
     a        u         8208            8192            /dev/dsk/c1d0s4
     a        u         16400           8192            /dev/dsk/c1d0s4
     a        u         16              8192            /dev/dsk/c2d0s4
     a        u         8208            8192            /dev/dsk/c2d0s4
     a        u         16400           8192            /dev/dsk/c2d0s4

Flag -f is needed because it is the first invocation/creation of metadb(1m).

Set up the RAID-0 metadevices (stripe or concatenation volumes) corresponding to the / file system and the swap space, and automatically configure system files (/etc/vfstab and /etc/system) for the root metadevice.

# metainit -f d10 1 1 c1d0s0
d10: Concat/Stripe is setup
# metainit -f d11 1 1 c1d0s1
d11: Concat/Stripe is setup
# metainit d20 1 1 c2d0s0
d20: Concat/Stripe is setup
# metainit d21 1 1 c2d0s1
d21: Concat/Stripe is setup
# metainit d0 -m d10
d0: Mirror is setup
# metainit d1 -m d11
d1: Mirror is setup
#
# cp /etc/vfstab /etc/vfstab.beforesvm
# sed -e 's@/dev/dsk/c1d0s1@/dev/md/dsk/d1@' /etc/vfstab.beforesvm > /etc/vfstab
# metaroot d0
# diff /etc/vfstab /etc/vfstab.beforesvm
6,7c6,7
< /dev/md/dsk/d1   -                 -   swap   -   no   -
< /dev/md/dsk/d0   /dev/md/rdsk/d0   /   ufs    1   no   -
---
> /dev/dsk/c1d0s1  -                 -   swap   -   no   -
> /dev/dsk/c1d0s0  /dev/rdsk/c1d0s0  /   ufs    1   no   -

Flag -f is needed because the file systems created on the slice we want to initialize a new metadevice are currently mounted (in use).

Reboot on the metadevices: the operating system will now boot encapsulated, on a one-side mirror. Last, attach the second part of the mirror and adapt the system dump configuration.

# lockfs -af && shutdown -y -g 0 -i 6
[...]
# metattach d0 d20
d0: submirror d20 is attached
# metattach d1 d21
d1: submirror d21 is attached
#
# metastat -p
d1 -m /dev/md/rdsk/d11 /dev/md/rdsk/d21 1
d11 1 1 /dev/rdsk/c1d0s1
d21 1 1 /dev/rdsk/c2d0s1
d0 -m /dev/md/rdsk/d10 /dev/md/rdsk/d20 1
d10 1 1 /dev/rdsk/c1d0s0
d20 1 1 /dev/rdsk/c2d0s0
# metastat | grep %
    Resync in progress: 41 % done
    Resync in progress: 46 % done
#
# rmdir /var/crash/*
# mkdir /var/crash/`hostname`
# chmod 700 /var/crash/`hostname`
# dumpadm -s /var/crash/`hostname` -d /dev/md/dsk/d1
      Dump content: kernel pages
       Dump device: /dev/md/dsk/d1 (swap)
Savecore directory: /var/crash/bento
  Savecore enabled: yes

Last, define the alternative boot path in the menu.lst GRUB configuration file: the Solaris/BSD slice 0 on the first fdisk partition on the second BIOS disk.

cat << EOF >> /boot/grub/menu.lst
title Solaris Nevada snv_65 X86 (Alternate Boot Path)
root (hd1,0,a)
kernel$ /platform/i86pc/kernel/$ISADIR/unix
module$ /platform/i86pc/$ISADIR/boot_archive
EOF
#
# bootadm list-menu 
The location for the active GRUB menu is: /boot/grub/menu.lst
default 0
timeout 10
0 Solaris Nevada snv_65 X86
1 Solaris failsafe
2 Solaris Nevada snv_65 X86 (Alternate Boot Path)

For further (and deeper) information on this subject, please refer to the excellent Sun Microsystems Documentation on Solaris Volume Manager, and particularly x86: Creating a RAID-1 Volume From the root (/) File System.

Monday 1 May 2006

How to Patch a Live System Mirrored with SVM

Aim of this memo

The main purpose of this technical note is to demonstrate how to patch a running (live) system currently mirrored using SVM, minimizing the downtime as far as possible.

The idea is simple: detach one side of the mirror, apply the cluster patch against it and reboot on it. If all seems OK, re-encapsulate the system. This can achieve similar goal currently found in the Live Upgrade feature of the Solaris OS (see live_upgrade(5)), with less complexity and different requirement (LVM RAID-1 vs. spare disk, or free slice).

Using this solution, the downtime can go between 10 to 30 minutes of service unavailability (depending on the hardware POST) and a maximum of two reboots are required, whatever is the number of patches to apply.

Here it is

Here is a system encapsulated using SDS 4.x or SVM 1.x, and the associated SVM encapsulation configuration:

# metastat -p
d3 -m d13 d23 1
d13 1 1 c0t0d0s3
d23 1 1 c0t1d0s3
d1 -m d11 d21 1
d11 1 1 c0t0d0s1
d21 1 1 c0t1d0s1
d0 -m d10 d20 1
d10 1 1 c0t0d0s0
d20 1 1 c0t1d0s0
#
# cat /etc/vfstab
#device         device          mount   FS      fsck    mount   mount
#to mount       to fsck         point   type    pass    at boot options
#
fd      -       /dev/fd fd      -       no      -
/proc   -       /proc   proc    -       no      -
/dev/md/dsk/d3  -       -       swap    -       no      -
/dev/md/dsk/d0  /dev/md/rdsk/d0 /       ufs     1       no      -
/dev/md/dsk/d1  /dev/md/rdsk/d1 /var    ufs     1       no      -
swap    -       /tmp    tmpfs   -       yes     -

Run an explorer and generate a cluster patch, based on tools provided by the OSE for example, if you are luckily enough to have one included with your support plan (or just pick one provided at SunSolve).

Then, be sure to be able to boot on the two disks, just in case:

# installboot /usr/platform/`uname -i`/lib/fs/ufs/bootblk /dev/rdsk/c0t0d0s0
# installboot /usr/platform/`uname -i`/lib/fs/ufs/bootblk /dev/rdsk/c0t1d0s0

The next step is to voluntarily detach one side of the mirror: take the first one for the sake of simplicity (i.e. c0t0d0). Indeed, in this case we are pretty sure that its alias name at the OBP is disk.

Note: You can always create it at the OBP (using the usual set of commands, such as show-disks, devalias, etc.) if you want. That is just a matter of personal preferences.

# lockfs -af /* Just to minimize the fs inconsistencies at next fsck(1m). */
#
# metadetach d0 d10
# metadetach d1 d11
# metadetach d3 d13
#
# metaclear d10
# metaclear d11
# metaclear d13

Check and repair the file systems if necessary, since we will boot on them the next time:

# fsck /dev/dsk/c0t0d0s0
# fsck /dev/dsk/c0t0d0s1

Next steps include mounting the recently detached file systems and prepare the first disk to boot without SVM encapsulation:

# mkdir /mirror
# mount /dev/dsk/c0t0d0s0 /mirror
# mount /dev/dsk/c0t0d0s1 /mirror/var
#
# cat << EOF > /mirror/etc/vfstab
#device         device          mount   FS      fsck    mount   mount
#to mount       to fsck         point   type    pass    at boot options
#
fd      -       /dev/fd fd      -       no      -
/proc   -       /proc   proc    -       no      -
/dev/dsk/c0t0d0s3       -       -       swap    -       no      -
/dev/dsk/c0t0d0s0       /dev/rdsk/c0t0d0s0      /       ufs     1       no      -
/dev/dsk/c0t0d0s1       /dev/rdsk/c0t0d0s1      /var    ufs     1       no      -
swap    -       /tmp    tmpfs   -       yes     -
EOF
#
# cp /mirror/etc/system /mirror/etc/system.orig
# sed -e 's;rootdev:/pseudo/md@0:0,0,blk;*rootdev:/pseudo/md@0:0,0,blk;' \
   /mirror/etc/system.orig > /mirror/etc/system

Last, install patches against the first disk, clean things up a little and reboot if the install procedure went all smooth:

# ./install_all_patches -R /mirror
#
# umount /mirror/var
# umount /mirror
# rmdir /mirror
#
# shutdown -y -g 0 -i 6

After rebooting, carefully review the behavior of the very freshly patched system. If all seems well, don't forget to re-encapsulate the second disk. Here is a quick and easy way to this:

/* Recreate the metadb. */
# metadb -d c0t0d0s4 c0t1d0s4
# metadb -a -c3 -f c0t0d0s4 c0t1d0s4
#
/* Clean the system metadevices always present. */
# metaclear d0
# metaclear d1
# metaclear d3
# metaclear d20
# metaclear d21
# metaclear d23
#
/* Re-create them as part of a mirror. */
# metainit -f d10 1 1 c0t0d0s0
# metainit d0 -m d10
# metainit -f d11 1 1 c0t0d0s1
# metainit d1 -m d11
# metainit -f d13 1 1 c0t0d0s3
# metainit d3 -m d13
#
/* Be able to boot on the new metadevices. */
# metaroot d0
#
/* Reboot, and create the second side of the mirror. */
# shutdown -y -g 0 -i 6
[...]
# metainit d20 1 1 c0t1d0s0
# metattach d0 d20
# metainit d21 1 1 c0t1d0s1
# metattach d1 d21
# metainit d23 1 1 c0t1d0s3
# metattach d3 d23

For a little more detailed explanation about encapsulating the system using SVM on Sun Solaris, please refer to the dedicated entry in this blog.

Last, it must be mentioned that this documentation was written by our OSE, and that this procedure was officially marked as supported by Sun Microsystems.

Saturday 22 April 2006

Setting Up a Soft Mirroring System Using gmirror(8)

Because of the nature of services provided by the ThNET Project, i already want to keep the I/O very reliable and used some old hardware RAID technology to do the job. Since there is no full hardware support due to some provider legal binary restrictions, the solution wasn't perfect and i always take a lot of time rebuilding the RAID array because of some obscure problem when problems on disks occurred.

I didn't want to do the same thing with the new infrastructure server, and decided to build the RAID-1 solution on top of the gmirror(8) software, a GEOM framework based tool provided under recent FreeBSD releases.

So, based on the work of others (see the end of this entry for references), here are the steps i follow to switch from classical to mirror solution for the main server of the project.

First, make sure that the second disk is treated as a new, fresh one:

# dd if=/dev/zero of=/dev/ad10 bs=512 count=79

Put a GEOM label onto it and force load the gmirror.ko kernel module:

# gmirror label -v -n -b round-robin gm0 /dev/ad10
# gmirror load

Then write a PC (BIOS) MBR, place a new BSD label, initialize it and create custom partitions:

# fdisk -v -B -I /dev/mirror/gm0
# bsdlabel -w -B /dev/mirror/gm0s1
# bsdlabel -e /dev/mirror/gm0s1
# cat << EOF > /etc/bsdlabel.gm0s1
# /dev/mirror/gm0s1:
8 partitions:
#         size   offset   fstype   [fsize bsize bps/cpg]
# a: Will be mounted as `/'.
# d: Will be mounted as `/var'.
# e: Will be mounted as `/tmp'.
# f: Will be mounted as `/usr'.
# g: Will be mounted as `/home'.
a:      512M       16   4.2BSD     2048 16384       8
b:     2048M        *     swap
c: 586099332        0   unused        0     0         # "raw" part, don't edit
d:     2048M        *   4.2BSD     2048 16384   28528
e:     4096M        *   4.2BSD     2048 16384   28528
f:    12288M        *   4.2BSD     2048 16384   28528
g:         *        *   4.2BSD     2048 16384   28528
EOF
# bsdlabel -R /dev/mirror/gm0s1 /etc/bsdlabel.gm0s1

Make new file systems on the corresponding partitions (note: generally speaking, it seems better not to put soft-updates on the root partition):

# newfs /dev/mirror/gm0s1a
# newfs -U /dev/mirror/gm0s1d
# newfs -U /dev/mirror/gm0s1e
# newfs -U /dev/mirror/gm0s1f
# newfs -U /dev/mirror/gm0s1g

Populate the content the of the second disk using dump(8) and restore(8) for example, or use some backup if this may be applicable for you:

# mkdir /tmp/gm0s1 && mount /mnt/da0
# 
# mount /dev/mirror/gm0s1a /tmp/gm0s1
# gzip -dc /mnt/da0/dump/2006-04-10.*/bento.thilelli.net.root.dump.gz | \
(cd /tmp/gm0s1 && restore -rf -)
# 
# mount /dev/mirror/gm0s1d /tmp/gm0s1/var
# gzip -dc /mnt/da0/dump/2006-04-10.*/bento.thilelli.net.var.dump.gz | \
(cd /tmp/gm0s1/var && restore -rf -)
# 
# mount /dev/mirror/gm0s1f /tmp/gm0s1/usr
# gzip -dc /mnt/da0/dump/2006-04-10.*/bento.thilelli.net.usr.dump.gz | \
(cd /tmp/gm0s1/usr && restore -rf -)
# 
# mount /dev/mirror/gm0s1g /tmp/gm0s1/home
# gzip -dc /mnt/da0/dump/2006-04-10.*/bento.thilelli.net.home.dump.gz | \
(cd /tmp/gm0s1/home && restore -rf -)
# 
# mount /dev/mirror/gm0s1e /tmp/gm0s1/tmp
# chmod 1777 /tmp/gm0s1/tmp

Prepare the new file system table, force the load of the GEOM mirror at boot time (necessary for the root mount) and be sure to boot on the second disk at the next reboot:

# cp -p /tmp/gm0s1/etc/fstab /tmp/gm0s1/etc/fstab.orig
/*
 * sed -e 's/dev\/ad8/dev\/mirror\/gm0/g' < /tmp/gm0s1/etc/fstab.orig \
 *  > /tmp/gm0s1/etc/fstab
 */
# cat << EOF > /tmp/gm0s1/etc/fstab
# Device                Mountpoint      FStype  Options                         Dump    Pass#
/dev/mirror/gm0s1b      none            swap    sw                              0       0
/dev/mirror/gm0s1a      /               ufs     rw                              1       1
/dev/mirror/gm0s1e      /tmp            ufs     rw,noatime,nosuid,nodev         2       2
/dev/mirror/gm0s1f      /usr            ufs     rw                              2       2
/dev/mirror/gm0s1d      /var            ufs     rw,noexec                       2       2
/dev/mirror/gm0s1g      /home           ufs     rw,userquota,nosuid,nodev       2       2
/dev/acd0               /cdrom          cd9660  ro,noauto                       0       0
/dev/da0s1              /mnt/da0        ufs     rw,noauto,nosuid,nodev          0       0
/dev/da1s1              /mnt/da1        msdosfs rw,noauto                       0       0
EOF
# echo geom_mirror_load=\"YES\" >> /tmp/gm0s1/boot/loader.conf
# echo "1:ad(1,a)/boot/loader" > /boot.config

Unmount the second side of the mirror and reboot:

# umount /tmp/gm0s1/tmp
# umount /tmp/gm0s1/home
# umount /tmp/gm0s1/usr
# umount /tmp/gm0s1/var
# umount /tmp/gm0s1
# 
# sync && shutdown -r now

After rebooting on the second disk (the GEOMified one), switch the mirror to auto-synchronization and add the first disk, which is now immediately synchronized with the second disk's content:

# gmirror configure -a gm0
# gmirror insert gm0 /dev/ad8
# 
# gmirror list
Geom name: gm0
State: DEGRADED
Components: 2
Balance: round-robin
Slice: 4096
Flags: NONE
GenID: 0
SyncID: 1
ID: 770137303
Providers:
1. Name: mirror/gm0
Mediasize: 300090727936 (279G)
Sectorsize: 512
Mode: r7w6e7
Consumers:
1. Name: ad10
Mediasize: 300090728448 (279G)
Sectorsize: 512
Mode: r1w1e1
State: ACTIVE
Priority: 0
Flags: NONE
GenID: 0
SyncID: 1
ID: 2706535066
2. Name: ad8
Mediasize: 300090728448 (279G)
Sectorsize: 512
Mode: r1w1e1
State: SYNCHRONIZING
Priority: 0
Flags: DIRTY, SYNCHRONIZING
GenID: 0
SyncID: 1
Synchronized: 71%
ID: 2682952005
# 
# gmirror status
Name    Status  Components
mirror/gm0  DEGRADED  ad10
              ad8 (71%)

During all of these steps, some kernel messages may be seen on the console or in the /var/log/messages system logs file, as shown below:

# tail -f /var/log/messages
GEOM_MIRROR: Device gm0: provider ad8 detected.
GEOM_MIRROR: Device gm0: rebuilding provider ad8.
GEOM_MIRROR: Device gm0: rebuilding provider ad8 finished.
GEOM_MIRROR: Device gm0: provider ad8 activated.

Last, please find some invaluable documentation on the subject, with a special note for the BSD DevCenter one since, although less secure for the data than the others, bypassed the need for duplicating the data from one disk to the other, much less as found when using SVM on Solaris from Sun Microsystems.

Wednesday 12 April 2006

Stability Problem and New RAID-1 System

Just after the big update and new infrastructure installation during the past month, the server encountered stability problem on a regular basis (sometimes crashing more than one time a day). At first though, we noted it may be caused by the new processor architecture, a less mature FreeBSD distribution than the well known i386.

But the problem appears to be specifically related to I/O, especially when the input/output are very intensive, i.e. file system dump(8), file system snapshot mksnap_ffs(8) or big tar(1) or cpio(1L) archive transfers.

After days, we didn't succeed to clearly isolate the source of the problem, but decided to quickly put the system under RAID-1 system management (disks mirroring) and configure the excellent gmirror(8) to do the job. So, i switched from an old hardware RAID mechanism, to a pure LVM one. Hope we will be able to find something new without disturbing the ThNET services anymore.

More on the involved manipulations to put the system under the gmirror(8)control on a later post. Stay tuned.

Thursday 21 July 2005

Broken RAID Array... Found the First Guilty Component

Last week, the mirror array for the system broke one more time. After some investigation, It turns out to be a faulty Serillel adapter. Luckily i had one more of this in the stock... just in case (what a good idea, isn't it?).

On the other hand, the mirror array for the data (home directories) broke itself during the overall restore procedure. Still in this degraded mode for now, need more time to work on this... quickly :(

# grep DEGRADED /var/log/messages | tail -1
Jul 17 17:54:24 bento kernel: ar1: 117246MB <ATA RAID1 array> [14946/255/63] status: DEGRADED subdisks:

# atacontrol status ar1
ar1: ATA RAID1 subdisks: ad8 status: DEGRADED

Don't forget to make some good backups! There is no good reason not to do so. It is just a matter of time to use them.

Friday 8 July 2005

Rebuilding ATA RAID-1 Array Using FreeBSD 5.X

Ouch! This morning i discovered that one of the mirror for the system array disks was broken on one of the servers, as can be show in the /var/log/messages log file:

Jul  1 08:22:56 bento kernel: ad4: FAILURE - WRITE_DMA status=51<READY,DSC,ERROR> error=14<NID_NOT_FOUND,ABORTED> LBA=16082271
Jul  1 08:22:56 bento kernel: ar0: WARNING - mirror lost
Jul  1 08:22:57 bento kernel: ad4: FAILURE - WRITE_DMA status=51<READY,DSC,ERROR> error=14<NID_NOT_FOUND,ABORTED> LBA=16099295

So, assuming that the hardware error is not so bad and maybe recoverable (i really don't want to replay the last backups if i can), i follow these steps to rebuild the faulting RAID1 array...

Check for more information on ATA devices and the impacted array:

# atacontrol list
ATA channel 0:
    Master: acd0 <SAMSUNG CD-ROM SC-152L/C100> ATA/ATAPI revision 0
    Slave:       no device present
ATA channel 1:
    Master:      no device present
    Slave:       no device present
ATA channel 2:
    Master:  ad4 <Maxtor 6Y080P0/YAR41VW0> ATA/ATAPI revision 7
    Slave:       no device present
ATA channel 3:
    Master:  ad6 <Maxtor 6Y080P0/YAR41VW0> ATA/ATAPI revision 7
    Slave:       no device present
ATA channel 4:
    Master:  ad8 <Maxtor 6Y120P0/YAR41VW0> ATA/ATAPI revision 7
    Slave:       no device present
ATA channel 5:
    Master: ad10 <Maxtor 6Y120P0/YAR41VW0> ATA/ATAPI revision 7
    Slave:       no device present
#
# atacontrol status ar0
ar0: ATA RAID1 subdisks: ad6 status: DEGRADED

Detach the disk from the array (then it will be safely removable if necessary):

# atacontrol detach 2
#
# grep ad4 /var/log/messages | tail -1
Jul  2 11:13:40 bento kernel: ad4: WARNING - removed from configuration

Reattach the disk to the configuration:

# atacontrol attach 2
Master:  ad4 <Maxtor 6Y080P0/YAR41VW0> ATA/ATAPI revision 7
Slave:       no device present
#
# grep ad4 /var/log/messages | tail -1
Jul  2 11:13:47 bento kernel: ad4: 78167MB <Maxtor 6Y080P0/YAR41VW0> [158816/16/63] at ata2-master UDMA133

Add a spare disk (the same as before in fact, in our case) to the existing system RAID:

# atacontrol addspare ar0 ad4
#
# grep ad4 /var/log/messages | tail -1
Jul  2 11:14:03 bento kernel: ad4: inserted into ar0 disk0 as spare

Rebuild the RAID1 dynamically:

# atacontrol rebuild ar0

Check the progression of the rebuild:

# atacontrol status ar0
ar0: ATA RAID1 subdisks: ad4 ad6 status: REBUILDING 7% completed

When all is done, this can be shown using atacontrol(8) as follow:

# atacontrol status ar0
ar0: ATA RAID1 subdisks: ad4 ad6 status: READY

Monday 6 June 2005

Encapsulation of the System's Disk Using SVM

  1. c0t0d0s2 represents the first system disk (boot)
  2. c0t1d0s2 represents the second disk (mirror)

Duplicate the label's content from the boot disk to the mirror disk:

# prtvtoc /dev/rdsk/c0t0d0s2 | fmthard -s - /dev/rdsk/c0t1d0s2

Create replicas of the metadevice state database:

# metadb -a -c3 -f c0t0d0s4 c0t1d0s4
# metadb

Option -f is needed because it is the first invocation/creation of metadb(1m).

Creation of metadevices:

# metainit -f d10 1 1 c0t0d0s0
# metainit -f d11 1 1 c0t0d0s1
# metainit -f d13 1 1 c0t0d0s3
# metainit -f d16 1 1 c0t0d0s6
#
# metainit d20 1 1 c0t1d0s0
# metainit d21 1 1 c0t1d0s1
# metainit d23 1 1 c0t1d0s3
# metainit d26 1 1 c0t1d0s6

Option -f is needed because the file systems created on the slice we want to initialize a new metadevice are already mounted.

Create the first part of the mirror:

# metainit d0 -m d10
# metainit d1 -m d11
# metainit d3 -m d13
# metainit d6 -m d16
#
# cp /etc/vfstab /etc/vfstab.beforesvm
# metaroot d0

Don't forget to edit /etc/vfstab in order to reflect the other metadevices:

  • s@/dev/dsk/cXtYdZsN@/dev/md/dsk/dN@
  • s@/dev/rdsk/cXtYdZsN@/dev/md/rdsk/dN@

Install the boot block code on the alternate boot disk and set it in the OpenBoot Prom (OBP):

# installboot /usr/platform/`uname -i`/lib/fs/ufs/bootblk /dev/rdsk/c0t1d0s0
# eeprom boot-device="disk disk1 net"   /* Or just "disk disk1". */

Reboot on the new metadevices (the operating system will now boot encapsulated):

# shutdown -y -g 0 -i 6

Attach the second part of the mirror:

# metattach d0 d20
# metattach d1 d21
# metattach d3 d23
# metattach d6 d26

Verify all:

# metastat -p
# metastat | grep \%

Modify the system dump configuration:

# mkdir /var/crash/`hostname`
# chmod 700 /var/crash/`hostname`
# dumpadm -s /var/crash/`hostname`
# dumpadm -d /dev/md/dsk/d1