blog'o thnet

To content | To menu | To search

Monday 22 September 2008

About GNU/Linux Software Mirroring And LVM

Here, the final aim was to provide data access redundancy through SAN storage hosted on remote sites across Wide Area Network (WAN) links. After some relatively long and painful tries to mimic software mirroring as found on HP-UX platform using Logical Volume Management (LVM), i.e. at the logical volume level, I finally give up deciding this functionality will definitely not fit my need. Why? Here are my comments.

  1. It is not possible to provide clear and manageable storage multipath when the need to distinguish between the multiple sites is important, ala mirror across controllers found on Veritas VxVM on Sun Solaris system, for example. So, managing many physical volumes along with lots of logical volumes is very complicated.
  2. There is no exact mapping capability between logical volume storage on a given physical volume.
  3. The need to have a disk-based log, i.e. a persistent log. Yes, one can always provide the option --corelog at the creation time to the logical volume initial build and have an in-memory log , i.e. a non-persistent log, but this requires the entire copies (mirrors) be resynchronized upon reboot. Not really viable on multi-TB environments.
  4. A write-intensive workload on a file system living on a logical volume mirror will suffer high latency: the overhead is important, and the time to do mostly-write jobs grow dramatically. It is really hard to get high level statistics, only low level metrics seems consistent: sd SCSI devices and dm- device mapper components for each paths entries. Not from the multipath devices standpoint, which is the more interesting from the end user and SA point of view.
  5. You can't extend a logical volume, which is really a no-go per-se. On that point, the Red Hat support answered that this functionality may be added in a future release, the current state may eventually be a Request For Enhancement (RFE), if a proper business justification is provided. One must break the logical volume mirror copy, then rebuild it completely. Not realistic when the logical volume is made of a lot of physical extents across multiple physical volumes.
  6. A LVM configuration can be totally blocked by itself, and not usable at all. The fact is, LVM use persistent storage blocks to keep track of its own metadata. The metadata size is set at physical volume creation time only, and can't be change afterward. This size is statically defined as 255 physical volume blocks, and can be adjust from the LVM configuration file. The problem is, when this circular buffer space (stored in ASCII) fills up--such as when there are a lot of logical volumes in a mirrored environment--it is not possible to do anything more with LVM. So you can't add more logical volume, can't add more logical volume copies,... and can't delete them trying to reestablish a proper LVM configuration. Well, here are the answers given by the Red Hat support to two keys questions in this situation:
    • How to size the metadata, i.e. if we need to change it from the default value, how can we determine the new proper and appropriate size, and from which information?
      I am afraid but Metadata size can only be defined at the time of PV creation and there is no real formula for calculating the size in advance. By changing the default value of 255 you can get a higher space value. For general LVM setup (with less LV's and VG's) default size works fine however in cases where high number of LV's are required a custom value will be required.
    • We just want to delete all LV copies, which means to return to the initial situation and have 0 copy for all LV, i.e. only one LV per-se, in order to be able to change LVM configuration again (we can't do anything on our production server right now)?
      I discussed this situation with peer engineers and also referenced a similar previous case. From the notes of the same the workaround is to use the backup file (/etc/lvm/backup) and restore the PV's. I agree that this really not a production environment method however seems the only workaround.

So, the production RDBMS Oracle server is finally now being evacuate to an other machine. Hum... Hope to see better enterprise experience using the mdadm package to handle RAID software, instead of mirror (RAID-1) LVM. Maybe more about that in an other blog entry?

Tuesday 16 September 2008

Announcing Solaris 10 10/08

Although this seems a little bit confident, the long-awaited Update 6 to the Solaris 10 operating system release is just behind the door. This release (scheduled to be available in mid-October) will includes virtualization enhancements including the ability for a Solaris Container to automatically update its environment when moved from one system to another, Logical Domains support for dynamically reconfigurable disk and network I/O, and paravirtualization support when Solaris 10 is used as a guest OS in Xen-based environments such as Sun xVM Server. Solaris 10 10/08 also includes support for the latest systems from Sun and other vendors, such as those based on the Intel Xeon Processor 7400 Series.

This release will be the very first Solaris release to be able to boot from ZFS and use it as their root file system, such as what can be found on OpenSolaris or Solaris Express Community Release today.

Check the What’s New web page for Solaris, and consult the Solaris Media Gallery videos for more information.

Update #1 (2008-10-14): Don't forget to consult the must read What's New in Solaris 10 10/08? from the San Antonio OpenSolaris User Group.

Update #2 (2008-10-31): Get yours, and go reading the official What's New in the Solaris 10 10/08 Release page.

Friday 16 May 2008

Comparison: EMC PowerPath vs. GNU/Linux dm-multipath

I will present some notes about the use of multipath solutions on Red Hat systems: EMC PowerPath and GNU/Linux dm-multipath. Along those notes, keep in mind that they were based on tests done when pressure was very high to put new systems in production, so lack of time resulted in less complete tests than expected. These tests were done more than a year ago, and so before the release of RHEL4 Update 5 and some of RHBA related to both LVM and dm-multipath technologies.

Keep in mind that without purchasing an appropriate EMC license, PowerPath can only be used in failover mode (active-passive mode). Multiple paths accesses are not supported in this case: no round-robin, and no I/O load balancer for example.

EMC PowerPath

Advantages

  1. Not specific to the SAN Host Bus Adapter (HBA).
  2. Support for multiple and heterogeneous SAN storage provider.
  3. Support for most UNIX and Unix-like platforms.
  4. Without a valid license, can only work in degraded mode (failover).
  5. Is not sensible to a change in the SCSI LUN renumbering. Adapt accordingly the corresponding multiple sd devices (different paths to a given device) with its multipath definition of the emcpower device.
  6. Provide easily the ID of the SAN storage.

Drawbacks

  1. Not integrated with the operating system (which generally has its own solution).
  2. The need to force a RPM re-installation in case of a kernel upgrade on RHEL systems (due to the fact that kernel modules are stored in a path containing the exact major and minor versions of the installed (booted) kernel.
  3. Non-automatic update procedure.

GNU/Linux device-mapper-multipath

Advantages

  1. Not specific to the SAN Host Bus Adapter (HBA).
  2. Support for multiple and heterogeneous SAN storage provider.
  3. Well integrated with the operating system.
  4. Automatic update using RHN (you must be a licensed and registered user in this case).
  5. No additional license cost.

Drawbacks

  1. Only available on GNU/Linux systems.
  2. Configuration (files and keywords) very tedious and difficult.
  3. Without the use of LVM (Logical Volume Management), it has not the ability to follow SCSI LUN renumbering! Even in this case, be sure not to have blacklisted the newly discovered SCSI devices (sd).

Last, please find some interesting documentation on the subject below:

Friday 2 May 2008

PHP APC Extension Bug With Optimized Open Source Software Stack

To easily manage LDAP accounts (and general LDAP entries in fact), we have created a Solaris Zone and installed the excellent Cool Stack bundle to host the LAM (LDAP Account Manager) management web tool. But after upgrading the Cool Stack to version 1.2 we encountered a very annoying problem mostly with freezing web pages, and generally ending up in restarting the Apache web server provided by the Cool Stack. After some troubleshooting, we discover that this behavior was introduced by a bug in the APC-3.0.14 module bundled with the updated php-5.2.4 scripting software in this version of the Cool Stack.

Luckily, the bug was already fixed and a new version of the APC extension of PHP is available for download (in fact, just replace to original apc.so module by the new one). All the Cool Stack related problems, associated fixes and instructions are listed on the Cool Stack 1.2 Patches page: be sure to keep in sync' if you are a Cool Stack consumer.

Monday 28 April 2008

memconf And AMD Athlon 64 X2 Dual Core Processor

The last update to the excellent memconf utility (V2.5 22-Feb-2008) support properly recent Solaris Express releases, and my recent change from the stock AMD Opteron Processor 148 to an AMD Athlon 64 X2 Dual Core Processor 3800+. (I mostly did that change just to be able to access two run queues separately, not to gain more power per se.)

So, here is the new and appropriate memconf report:

# memconf -d
memconf:  V2.5 22-Feb-2008 http://www.4schmidts.com/unix.html
hostname: unic
manufacturer: Sun Microsystems, Inc.
model:    Sun Ultra 20 Workstation (AMD Athlon(tm) 64 X2 Dual Core \
 Processor 3800+ Socket 939 2010MHz)
Sun Family Part Number: A63
Solaris Express Community Edition snv_87 X86, 64-bit kernel, SunOS 5.11
1 AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ Socket 939 2010MHz cpu
diagbanner = Sun Ultra 20 Workstation
cpubanner = AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ Socket 939 2010MHz
model = Sun Ultra 20 Workstation
machine = i86pc
platform = i86pc
perl version: 5.008004
CPU Units:
==== Processor Sockets ====================================
Version                          Location Tag
-------------------------------- --------------------------
AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ Socket 939
AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ Socket 939
Memory Units:
Type    Status Set Device Locator      Bank Locator
------- ------ --- ------------------- --------------------
unknown in use 0   A0                  Bank0/1
unknown in use 0   A1                  Bank2/3
unknown in use 0   A2                  Bank4/5
unknown in use 0   A3                  Bank6/7
total memory = 2048MB (2GB)

You can check and compare with the previous report on my blog.

Sunday 6 April 2008

Change The DST Time Zone Definition For Test Purpose

Because we continue to encounter some specialized software vendor which can't tell if there is a problem with the Daylight Saving Time change for their application, the need to test the time adjustment beforehand arise sometimes. In this case, the first thing which comes in mind is to change the system clock, without modifying the timezone. Although this can do the job, it doesn't test the DST adjustment properly, and affect the overall operating environment, system-wide.

To do things cleaner, we can try to modify directly the timezone in use. This will test the real DST automatic adjustment, while not changing the system clock (and not impacting other software, or services). Say we want to change DST for Europe/Paris time zone one day before the official date for summer 2008. First, obtain the current setting:

# zdump -v `rtc` | \
   awk '$6 ~ /'"`date '+%Y'`"'/ && $12 !~ /'"`date '+%H:%M:%S'`"'/ {print $0}'
Europe/Paris  Sun Mar 30 00:59:59 2008 UTC = Sun Mar 30 01:59:59 2008 CET isdst=0
Europe/Paris  Sun Mar 30 01:00:00 2008 UTC = Sun Mar 30 03:00:00 2008 CEST isdst=1
Europe/Paris  Sun Oct 26 00:59:59 2008 UTC = Sun Oct 26 02:59:59 2008 CEST isdst=1
Europe/Paris  Sun Oct 26 01:00:00 2008 UTC = Sun Oct 26 02:00:00 2008 CET isdst=0

Then, get the time zone source file for the Europe/Paris geographic definition and adapt it writing an exception rule for the current year. Because the current definition for Europe/Paris is based on the EU rule, our modification will be based on this rule name.

# mkdir -p /tmp/zoneinfo/src
# cp /usr/share/lib/zoneinfo/src/europe /tmp/zoneinfo/src
# diff /usr/share/lib/zoneinfo/src/europe /tmp/zoneinfo/src/europe
1073a1074,1076
> # Test the DST adjustment one day in advance, for March 2008 only.
> # Rule        NAME    FROM    TO      TYPE    IN      ON      AT      SAVE    LETTER/S
> Rule          EU      2008    only    -       Mar     29       2:00   1:00    S

Last, compile the updated time zone definition, put it in place, and verify the new DST date.

# zic -d /tmp/zoneinfo /tmp/zoneinfo/src/europe
# cp /tmp/zoneinfo/Europe/Paris /usr/share/lib/zoneinfo/Europe
# zdump -v `rtc` | \
   awk '$6 ~ /'"`date '+%Y'`"'/ && $12 !~ /'"`date '+%H:%M:%S'`"'/ {print $0}'
Europe/Paris  Sat Mar 29 00:59:59 2008 UTC = Sat Mar 29 01:59:59 2008 CET isdst=0
Europe/Paris  Sat Mar 29 01:00:00 2008 UTC = Sat Mar 29 03:00:00 2008 CEST isdst=1
Europe/Paris  Sun Oct 26 00:59:59 2008 UTC = Sun Oct 26 02:59:59 2008 CEST isdst=1
Europe/Paris  Sun Oct 26 01:00:00 2008 UTC = Sun Oct 26 02:00:00 2008 CET isdst=0

Once the fake DST adjustment is validated at the software level, clean things up a little.

# zic /usr/share/lib/zoneinfo/src/europe
# rm -r /tmp/zoneinfo

Thursday 13 March 2008

Update A Corrupted GRUB Boot Archive, With SVM

In a previous discussion about the GRUB boot archive and how it can be regenerated, I mentioned that it will not be as easy as it can be when the root file system use the md driver. I will now show two different methods to do the same thing when the root file system is build upon a SVM mirror (RAID-1):

  1. Unmirror the root file system only.
  2. Unmirror the entire system, i.e. all devices.

Note: Although this test case was done using Solaris 10 8/07 under a virtual machine build upon VirtualBox on latest Solaris Express Community Edition, the instructions must be valid for Solaris 10 1/06 and later.

Initial setup

As we can see, the system use only a root file system, and a swap device. Both are encapsulated with SVM.

# df -k -F ufs
Filesystem     kbytes    used   avail capacity  Mounted on
/dev/md/dsk/d0 6147798 3455578 2630743      57%  /
# swap -l
swapfile             dev  swaplo blocks   free
/dev/md/dsk/d1      85,1       8 4194288 4194288
# metastat -c d0 d1
d0               m  6.0GB d10 d20
    d10          s  6.0GB c0d0s0
    d20          s  6.0GB c1d1s0
d1               m  2.0GB d11 d21
    d11          s  2.0GB c0d0s1
    d21          s  2.0GB c1d1s1

Unmirror the root file system only

The idea is to boot on the GRUB Failsafe mode, select the first side of the mirror, and modify the system and vfstab configuration files to use the correct device path. For the system file, this means to actually remove the rootdev:/pseudo/md@0:0,0,bl entry, not just comment it. For the vfstab file, this means replacing the root file system metadevice path /dev/md/[r]dsk/d0 by the first underlying device path, i.e. /dev/[r]dsk/c0d0s0. Last, regenerate the boot archive on the alternate root path.

[...]
Booting to milestone "milestone/single-user:default".
Configuring devices.
Searching for installed OS instances...
/dev/dsk/c0d0s0 is under md control, skipping.
/dev/dsk/c1d1s0 is under md control, skipping.
No installed OS instance found.

Starting shell.
# fsck /dev/rdsk/c0d0s0
# mount -F ufs /dev/dsk/c0d0s0 /a
# cp /a/etc/system /a/etc/system.bckp
# cp /a/etc/vfstab /a/etc/vfstab.bckp
# TERM=vt100 vi /a/etc/system
# TERM=vt100 vi /a/etc/vfstab
# bootadm update-archive -R /a
# umount /a
# fsck /dev/rdsk/c0d0s0
# reboot

Then, boot into milestone/multi-user:default level and detach the second half of the mirror, since the first half correspond to the valid and updated underlying device. Next, restore the original configuration files which refers to the encapsulated metadevices, and reboot.

# df -k -F ufs
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/dsk/c0d0s0      6147798 3458810 2627511    57%    /
# swap -l
swapfile             dev  swaplo blocks   free
/dev/md/dsk/d1      85,1       8 4194288 4194288
# metastat -c d0
d0               m  6.0GB d10 d20
    d10          s  6.0GB c0d0s0
    d20          s  6.0GB c1d1s0
# metadetach d0 d20
d0: submirror d20 is detached
# metastat -c d0
d0               m  6.0GB d10
    d10          s  6.0GB c0d0s0
# cp /etc/system.orig /etc/system
# cp /etc/vfstab.orig /etc/vfstab
# shutdown -y -i 6 -g 0

After the reboot, just reattach the second half of the mirror, and wait for complete synchronization to be fully redundant again.

# df -k -F ufs
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/md/dsk/d0       6147798 3458714 2627607    57%    /
# swap -l
swapfile             dev  swaplo blocks   free
/dev/md/dsk/d1      85,1       8 4194288 4194288
# metattach d0 d20
d0: submirror d20 is attached
# metastat -c d0
d0               m  6.0GB d10 d20 (resync-29%)
    d10          s  6.0GB c0d0s0
    d20          s  6.0GB c1d1s0

Unmirror the entire system, i.e. all devices

The idea is exactly the same as for unmirroring the root file system only, but adapting the vfstab file to change the swap entry, too. (So, I didn't reproduce the code listing here.)

Then, boot into milestone/single-user:default level modifying the corresponding GRUB entry as follow: kernel /platform/i86pc/multiboot -s. Completely delete all the metadevices and metadb configurations to clear SVM settings. Last, continue into milestone/multi-user:default level to boot unmirrored.

# metaclear -f -r d0 d1
# metadb -f -d  c1d0s4 c1d0s4
# ^D

Now, the system must be fully encapsulate by SVM again. Please refer to online Sun Documentation, or some past entries on this subject, depending on the system's architecture: SPARC systems, or x86 platforms.

Sunday 9 March 2008

Update A Corrupted GRUB Boot Archive, Without SVM

Solaris 10 systems on x86 architecture use the GNU GRand Unified Bootloader (GRUB) which is the boot loader responsible for loading a boot archive into a system's memory. The boot archive is a collection of critical files (kernel modules and configuration files) that are required to boot the Solaris OS. As stated in the Sun documentation:

These files are needed during system startup before the root file system is mounted. Two boot archives are maintained on a system:

  • The boot archive that is used to boot the Solaris OS on a system. This boot archive is sometimes called the primary boot archive.
  • The boot archive that is used for recovery when the primary boot archive is damaged. This boot archive starts the system without mounting the root file system. On the GRUB menu, this boot archive is called failsafe. The archive's essential purpose is to regenerate the primary boot archive, which is usually used to boot the system.

The Solaris OS generally keeps the boot archive properly synchronized on its own. Sometimes, the boot archive gets corrupted--for example when (bad) patches are applied, or the the operating system crashed. In these cases, the boot archive must be regenerated. This is easily accomplished following the Sun documentations x86: How to Boot the Failsafe Archive for Recovery Purposes, and x86: How to Boot the Failsafe Archive to Forcibly Update a Corrupt Boot Archive. The main drawback is when the system is encapsulated under a SVM mirror (RAID-1) since the md driver is not managed under the failsafe mode. Please refer to this blog entry on this subject, if needed.

Wednesday 5 March 2008

Create And Remove A Remote Printer Queue (CLI)

You can easily create and remove a remote printer queue using the BSD type spooler. You just have to fill the configuration file /tmp/lp.list properly, i.e. provide the local printer name, the remote LPD server, and the remote printer queue:

# cat << EOF > /tmp/lp.list
locname1 lpdserv1 remname1
locname2 lpdserv2 remname2
EOF

Then, just run the appropriate script depending of the desired behavior. Follow, an example when removing the two queues:

# cat << EOF > /tmp/lp.remove
#!/usr/bin/env sh

for lplocal in `awk '{print $1}' /tmp/lp.list`; do
  /usr/sbin/lpshut
  /usr/bin/cancel ${lplocal} -e 2> /dev/null
  /usr/sbin/lpadmin -x${lplocal}
  /usr/sbin/lpsched -v
  sleep 1
done

exit 0
EOF
# sh /tmp/lp.remove
scheduler stopped
scheduler is running
scheduler stopped
scheduler is running
# lpstat -olocname1
no system default destination
lpstat: "locname1" not a request id or a destination

And now, the creation:

# cat << EOF > /tmp/lp.create
#!/usr/bin/env sh

while read lp; do
  eval set -- `IFS=" "; printf '"%s" ' ${lp}`
  lplocal="$1"
  lpserver="$2"
  lpremote="$3"

  /usr/sbin/lpshut
  /usr/sbin/lpadmin -p${lplocal} -orm${lpserver} -orp${lpremote} \
   -mrmodel -v/dev/null -orc -ob3 -ocmrcmodel -osmrsmodel
  /usr/sbin/accept ${lplocal}
  /usr/bin/enable ${lplocal}
  /usr/sbin/lpsched -v
  sleep 1
done < /tmp/lp.list

exit 0
EOF
# sh /tmp/lp.create
scheduler stopped
destination "locname1" now accepting requests
printer "locname1" now enabled
scheduler is running
scheduler stopped
destination "locname2" now accepting requests
printer "locname2" now enabled
scheduler is running
# lpstat -olocname1
no system default destination
printer queue for locname1
                         Windows LPD Server
                   Printer \\lpdserv1
emname1
Owner       Status         Jobname          Job-Id    Size   Pages  Priority
----------------------------------------------------------------------------
hostname: locname1: ready and waiting
no entries

That's it!

Friday 15 February 2008

LVM2 Simple Mirroring On RHEL4

When the need to evacuate all persistent SAN storage from EMC DMX1K to HP XP12K (HDS), three main solutions were envisaged. The first one was brute data copy (tar, cpio, etc.) but was not very practical with the size of the data (multi-terabytes) and the time involved in copying them. The two others were based on LVM technologies: mirroring, or moving.

Although the choice has been to use the online and transparent moving data technology (see pvmove for more information), it was interesting to note that Red Hat has backported support for the creation and manipulation of simple mirrors to their RHEL4 distribution. These functionalities were introduced with the RHBA-2006:0504-15 advisory issued on 2006-08-10, i.e. between RHEL4 Update 4 and RHEL4 Update 5 (and so available via RHN at this time). It is just too bad that the online help for LVM commands are not properly synchronized nor fully documented by the corresponding manual page: clearly, this doesn't help to use them in the best conditions (no, Google isn't always the better option when using these kinds of functionalities in big companies).

- page 3 of 16 -