blog'o thnet

To content | To menu | To search

Tag - mirror

Entries feed - Comments feed

Monday 22 September 2008

About GNU/Linux Software Mirroring And LVM

Here, the final aim was to provide data access redundancy through SAN storage hosted on remote sites across Wide Area Network (WAN) links. After some relatively long and painful tries to mimic software mirroring as found on HP-UX platform using Logical Volume Management (LVM), i.e. at the logical volume level, I finally give up deciding this functionality will definitely not fi my need. Why? Here are my comments.

  1. It is not possible to provide clear and manageable storage multipath when the need to distinguish between the multiple sites is important, ala mirror across controllers found on Veritas VxVM on Sun Solaris system, for example. So, managing many physical volumes along with lots of logical volumes is very complicated.
  2. There is no exact mapping capability between logical volume storage on a given physical volume.
  3. The need to have a disk-based log, i.e. a persistent log. Yes, one can always provide the option --corelog at the creation time to the logical volume initial build and have an in-memory log , i.e. a non-persistent log, but this requires the entire copies (mirrors) be resynchronized upon reboot. Not really viable on multi-TB environments.
  4. A write-intensive workload on a file system living on a logical volume mirror will suffer high latency: the overhead is important, and the time to do mostly-write jobs grow dramatically. It is really hard to get high level statistics, only low level metrics seems consistent: sd SCSI devices and dm- device mapper components for each paths entries. Not from the multipath devices standpoint, which is the more interesting from the end user and SA point of view.
  5. You can't extend a logical volume, which is really a no-go per-se. On that point, the Red Hat support answered that this functionality may be added in a future release, the current state may eventually be a Request For Enhancement (RFE), if a proper business justification is provided. One must break the logical volume mirror copy, then rebuild it completely. Not realistic when the logical volume is made of a lot of physical extents across multiple physical volumes.
  6. A LVM configuration can be totally blocked by itself, and not usable at all. The fact is, LVM use persistent storage blocks to keep track of its own metadata. The metadata size is set at physical volume creation time only, and can't be change afterward. This size is statically defined as 255 physical volume blocks, and can be adjust from the LVM configuration file. The problem is, when this circular buffer space (stored in ASCII) fills up--such as when there are a lot of logical volumes in a mirrored environment--it is not possible to do anything more with LVM. So you can't add more logical volume, can't add more logical volume copies,... and can't delete them trying to reestablish a proper LVM configuration. Well, here are the answers given by the Red Hat support to two keys questions in this situation:
    • How to size the metadata, i.e. if we need to change it from the default value, how can we determine the new proper and appropriate size, and from which information?
      I am afraid but Metadata size can only be defined at the time of PV creation and there is no real formula for calculating the size in advance. By changing the default value of 255 you can get a higher space value. For general LVM setup (with less LV's and VG's) default size works fine however in cases where high number of LV's are required a custom value will be required.
    • We just want to delete all LV copies, which means to return to the initial situation and have 0 copy for all LV, i.e. only one LV per-se, in order to be able to change LVM configuration again (we can't do anything on our production server right now)?
      I discussed this situation with peer engineers and also referenced a similar previous case. From the notes of the same the workaround is to use the backup file (/etc/lvm/backup) and restore the PV's. I agree that this really not a production environment method however seems the only workaround.

So, the production RDBMS Oracle server is finally now being evacuate to an other machine. Hum... Hope to see better enterprise experience using the mdadm package to handle RAID software, instead of mirror (RAID-1) LVM. Maybe more about that in an other blog entry?

Friday 15 February 2008

LVM2 Simple Mirroring On RHEL4

When the need to evacuate all persistent SAN storage from EMC DMX1K to HP XP12K (HDS), three main solutions were envisaged. The first one was brute data copy (tar, cpio, etc.) but was not very practical with the size of the data (multi-terabytes) and the time involved in copying them. The two others were based on LVM technologies: mirroring, or moving.

Although the choice has been to use the online and transparent moving data technology (see pvmove for more information), it was interesting to note that Red Hat has backported support for the creation and manipulation of simple mirrors to their RHEL4 distribution. These functionalities were introduced with the RHBA-2006:0504-15 advisory issued on 2006-08-10, i.e. between RHEL4 Update 4 and RHEL4 Update 5 (and so available via RHN at this time). It is just too bad that the online help for LVM commands are not properly synchronized nor fully documented by the corresponding manual page: clearly, this doesn't help to use them in the best conditions (no, Google isn't always the better option when using these kinds of functionalities in big companies).

Monday 1 May 2006

How to Patch a Live System Mirrored with SVM

Aim of this memo

The main purpose of this technical note is to demonstrate how to patch a running (live) system currently mirrored using SVM, minimizing the downtime as far as possible.

The idea is simple: detach one side of the mirror, apply the cluster patch against it and reboot on it. If all seems ok, re-encapsulate the system. This can achieve similar goal currently found in the Live Upgrade feature of the Solaris OS (see live_upgrade(5)), with less complexity and different requirement (LVM RAID-1 vs. spare disk, or free slice).

Using this solution, the downtime can go between 10 to 30 minutes of service unavailability (depending on the hardware POST) and a maximum of two reboots are required, whatever is the number of patches to apply.

Here it is

Here is a system encapsulated using SDS 4.x or SVM 1.x, and the associated SVM encapsulation configuration:

# metastat -p
d3 -m d13 d23 1
d13 1 1 c0t0d0s3
d23 1 1 c0t1d0s3
d1 -m d11 d21 1
d11 1 1 c0t0d0s1
d21 1 1 c0t1d0s1
d0 -m d10 d20 1
d10 1 1 c0t0d0s0
d20 1 1 c0t1d0s0
#
# cat /etc/vfstab
#device         device          mount   FS      fsck    mount   mount
#to mount       to fsck         point   type    pass    at boot options
#
fd      -       /dev/fd fd      -       no      -
/proc   -       /proc   proc    -       no      -
/dev/md/dsk/d3  -       -       swap    -       no      -
/dev/md/dsk/d0  /dev/md/rdsk/d0 /       ufs     1       no      -
/dev/md/dsk/d1  /dev/md/rdsk/d1 /var    ufs     1       no      -
swap    -       /tmp    tmpfs   -       yes     -

Run an explorer and generate a cluster patch, based on tools provided by the OSE for example, if you are luckily enough to have one included with your support plan (or just pick one provided at SunSolve).

Then, be sure to be able to boot on the two disks, just in case:

# installboot /usr/platform/`uname -i`/lib/fs/ufs/bootblk /dev/rdsk/c0t0d0s0
# installboot /usr/platform/`uname -i`/lib/fs/ufs/bootblk /dev/rdsk/c0t1d0s0

The next step is to voluntarily detach one side of the mirror: take the first one for the sake of simplicity (i.e. c0t0d0). Indeed, in this case we are pretty sure that its alias name at the OBP is disk.

Note: You can always create it at the OBP (using the usual set of commands, such as show-disks, devalias, etc.) if you want. That is just a matter of personal preferences.

# lockfs -af /* Just to minimize the fs inconsistencies at next fsck(1m). */
#
# metadetach d0 d10
# metadetach d1 d11
# metadetach d3 d13
#
# metaclear d10
# metaclear d11
# metaclear d13

Check and repair the file systems if necessary, since we will boot on them the next time:

# fsck /dev/dsk/c0t0d0s0
# fsck /dev/dsk/c0t0d0s1

Next steps include mounting the recently detached file systems and prepare the first disk to boot without SVM encapsulation:

# mkdir /mirror
# mount /dev/dsk/c0t0d0s0 /mirror
# mount /dev/dsk/c0t0d0s1 /mirror/var
#
# cat << EOF > /mirror/etc/vfstab
#device         device          mount   FS      fsck    mount   mount
#to mount       to fsck         point   type    pass    at boot options
#
fd      -       /dev/fd fd      -       no      -
/proc   -       /proc   proc    -       no      -
/dev/dsk/c0t0d0s3       -       -       swap    -       no      -
/dev/dsk/c0t0d0s0       /dev/rdsk/c0t0d0s0      /       ufs     1       no      -
/dev/dsk/c0t0d0s1       /dev/rdsk/c0t0d0s1      /var    ufs     1       no      -
swap    -       /tmp    tmpfs   -       yes     -
EOF
#
# cp /mirror/etc/system /mirror/etc/system.orig
# sed -e 's;rootdev:/pseudo/md@0:0,0,blk;*rootdev:/pseudo/md@0:0,0,blk;' \
   /mirror/etc/system.orig > /mirror/etc/system

Last, install patches against the first disk, clean things up a little and reboot if the install procedure went all smooth:

# ./install_all_patches -R /mirror
#
# umount /mirror/var
# umount /mirror
# rmdir /mirror
#
# shutdown -y -g 0 -i 6

After rebooting, carefully review the behavior of the very freshly patched system. If all seems well, don't forget to re-encapsulate the second disk. Here is a quick and easy way to this:

/* Recreate the metadb. */
# metadb -d c0t0d0s4 c0t1d0s4
# metadb -a -c3 -f c0t0d0s4 c0t1d0s4
#
/* Clean the system metadevices always present. */
# metaclear d0
# metaclear d1
# metaclear d3
# metaclear d20
# metaclear d21
# metaclear d23
#
/* Re-create them as part of a mirror. */
# metainit -f d10 1 1 c0t0d0s0
# metainit d0 -m d10
# metainit -f d11 1 1 c0t0d0s1
# metainit d1 -m d11
# metainit -f d13 1 1 c0t0d0s3
# metainit d3 -m d13
#
/* Be able to boot on the new metadevices. */
# metaroot d0
#
/* Reboot, and create the second side of the mirror. */
# shutdown -y -g 0 -i 6
[...]
# metainit d20 1 1 c0t1d0s0
# metattach d0 d20
# metainit d21 1 1 c0t1d0s1
# metattach d1 d21
# metainit d23 1 1 c0t1d0s3
# metattach d3 d23

For a little more detailed explanation about encapsulating the system using SVM on Sun Solaris, please refer to the dedicated entry in this blog.

Last, it must be mentioned that this documentation was written by our OSE, and that this procedure was officially marked as supported by Sun Microsystems.

Friday 8 July 2005

Rebuilding ATA RAID-1 Array Using FreeBSD 5.X

Ouch! This morning i discovered that one of the mirror for the system array disks was broken on one of the servers, as can be show in the /var/log/messages log file:

Jul  1 08:22:56 bento kernel: ad4: FAILURE - WRITE_DMA status=51<READY,DSC,ERROR> error=14<NID_NOT_FOUND,ABORTED> LBA=16082271
Jul  1 08:22:56 bento kernel: ar0: WARNING - mirror lost
Jul  1 08:22:57 bento kernel: ad4: FAILURE - WRITE_DMA status=51<READY,DSC,ERROR> error=14<NID_NOT_FOUND,ABORTED> LBA=16099295

So, assuming that the hardware error is not so bad and maybe recoverable (i really don't want to replay the last backups if i can), i follow these steps to rebuild the faulting RAID1 array...

Check for more information on ATA devices and the impacted array:

# atacontrol list
ATA channel 0:
    Master: acd0 <SAMSUNG CD-ROM SC-152L/C100> ATA/ATAPI revision 0
    Slave:       no device present
ATA channel 1:
    Master:      no device present
    Slave:       no device present
ATA channel 2:
    Master:  ad4 <Maxtor 6Y080P0/YAR41VW0> ATA/ATAPI revision 7
    Slave:       no device present
ATA channel 3:
    Master:  ad6 <Maxtor 6Y080P0/YAR41VW0> ATA/ATAPI revision 7
    Slave:       no device present
ATA channel 4:
    Master:  ad8 <Maxtor 6Y120P0/YAR41VW0> ATA/ATAPI revision 7
    Slave:       no device present
ATA channel 5:
    Master: ad10 <Maxtor 6Y120P0/YAR41VW0> ATA/ATAPI revision 7
    Slave:       no device present
#
# atacontrol status ar0
ar0: ATA RAID1 subdisks: ad6 status: DEGRADED

Detach the disk from the array (then it will be safely removable if necessary):

# atacontrol detach 2
#
# grep ad4 /var/log/messages | tail -1
Jul  2 11:13:40 bento kernel: ad4: WARNING - removed from configuration

Reattach the disk to the configuration:

# atacontrol attach 2
Master:  ad4 <Maxtor 6Y080P0/YAR41VW0> ATA/ATAPI revision 7
Slave:       no device present
#
# grep ad4 /var/log/messages | tail -1
Jul  2 11:13:47 bento kernel: ad4: 78167MB <Maxtor 6Y080P0/YAR41VW0> [158816/16/63] at ata2-master UDMA133

Add a spare disk (the same as before in fact, in our case) to the existing system RAID:

# atacontrol addspare ar0 ad4
#
# grep ad4 /var/log/messages | tail -1
Jul  2 11:14:03 bento kernel: ad4: inserted into ar0 disk0 as spare

Rebuild the RAID1 dynamically:

# atacontrol rebuild ar0

Check the progression of the rebuild:

# atacontrol status ar0
ar0: ATA RAID1 subdisks: ad4 ad6 status: REBUILDING 7% completed

When all is done, this can be shown using atacontrol(8) as follow:

# atacontrol status ar0
ar0: ATA RAID1 subdisks: ad4 ad6 status: READY