blog'o thnet

To content | To menu | To search

Friday 16 May 2008

Comparison: EMC PowerPath vs. GNU/Linux dm-multipath

I will present some notes about the use of multipath solutions on Red Hat systems: EMC PowerPath and GNU/Linux dm-multipath. Along those notes, keep in mind that they were based on tests done when pressure was very high to put new systems in production, so lack of time resulted in less complete tests than expected. These tests were done more than a year ago, and so before the release of RHEL4 Update 5 and some of RHBA related to both LVM and dm-multipath technologies.

Keep in mind that without purchasing an appropriate EMC license, PowerPath can only be used in failover mode (active-passive mode). Multiple paths accesses are not supported in this case: no round-robin, and no I/O load balancer for example.

EMC PowerPath

Advantages

  1. Not specific to the SAN Host Bus Adapter (HBA).
  2. Support for multiple and heterogeneous SAN storage provider.
  3. Support for most UNIX and Unix-like platforms.
  4. Without a valid license, can only work in degraded mode (failover).
  5. Is not sensible to a change in the SCSI LUN renumbering. Adapt accordingly the corresponding multiple sd devices (different paths to a given device) with its multipath definition of the emcpower device.
  6. Provide easily the ID of the SAN storage.

Drawbacks

  1. Not integrated with the operating system (which generally has its own solution).
  2. The need to force a RPM re-installation in case of a kernel upgrade on RHEL systems (due to the fact that kernel modules are stored in a path containing the exact major and minor versions of the installed (booted) kernel.
  3. Non-automatic update procedure.

GNU/Linux device-mapper-multipath

Advantages

  1. Not specific to the SAN Host Bus Adapter (HBA).
  2. Support for multiple and heterogeneous SAN storage provider.
  3. Well integrated with the operating system.
  4. Automatic update using RHN (you must be a licensed and registered user in this case).
  5. No additional license cost.

Drawbacks

  1. Only available on GNU/Linux systems.
  2. Configuration (files and keywords) very tedious and difficult.
  3. Without the use of LVM (Logical Volume Management), it has not the ability to follow SCSI LUN renumbering! Even in this case, be sure not to have blacklisted the newly discovered SCSI devices (sd).

Last, please find some interesting documentation on the subject below:

Friday 2 May 2008

PHP APC Extension Bug With Optimized Open Source Software Stack

To easily manage LDAP accounts (and general LDAP entries in fact), we have created a Solaris Zone and installed the excellent Cool Stack bundle to host the LAM (LDAP Account Manager) management web tool. But after upgrading the Cool Stack to version 1.2 we encountered a very annoying problem mostly with freezing web pages, and generally ending up in restarting the Apache web server provided by the Cool Stack. After some troubleshooting, we discover that this behavior was introduced by a bug in the APC-3.0.14 module bundled with the updated php-5.2.4 scripting software in this version of the Cool Stack.

Luckily, the bug was already fixed and a new version of the APC extension of PHP is available for download (in fact, just replace to original apc.so module by the new one). All the Cool Stack related problems, associated fixes and instructions are listed on the Cool Stack 1.2 Patches page: be sure to keep in sync' if you are a Cool Stack consumer.

Monday 28 April 2008

memconf And AMD Athlon 64 X2 Dual Core Processor

The last update to the excellent memconf utility (V2.5 22-Feb-2008) support properly recent Solaris Express releases, and my recent change from the stock AMD Opteron Processor 148 to an AMD Athlon 64 X2 Dual Core Processor 3800+. (I mostly did that change just to be able to access two run queues separately, not to gain more power per se.)

So, here is the new and appropriate memconf report:

# memconf -d
memconf:  V2.5 22-Feb-2008 http://www.4schmidts.com/unix.html
hostname: unic
manufacturer: Sun Microsystems, Inc.
model:    Sun Ultra 20 Workstation (AMD Athlon(tm) 64 X2 Dual Core \
 Processor 3800+ Socket 939 2010MHz)
Sun Family Part Number: A63
Solaris Express Community Edition snv_87 X86, 64-bit kernel, SunOS 5.11
1 AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ Socket 939 2010MHz cpu
diagbanner = Sun Ultra 20 Workstation
cpubanner = AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ Socket 939 2010MHz
model = Sun Ultra 20 Workstation
machine = i86pc
platform = i86pc
perl version: 5.008004
CPU Units:
==== Processor Sockets ====================================
Version                          Location Tag
-------------------------------- --------------------------
AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ Socket 939
AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ Socket 939
Memory Units:
Type    Status Set Device Locator      Bank Locator
------- ------ --- ------------------- --------------------
unknown in use 0   A0                  Bank0/1
unknown in use 0   A1                  Bank2/3
unknown in use 0   A2                  Bank4/5
unknown in use 0   A3                  Bank6/7
total memory = 2048MB (2GB)

You can check and compare with the previous report on my blog.

Sunday 6 April 2008

Change The DST Time Zone Definition For Test Purpose

Because we continue to encounter some specialized software vendor which can't tell if there is a problem with the Daylight Saving Time change for their application, the need to test the time adjustment beforehand arise sometimes. In this case, the first thing which comes in mind is to change the system clock, without modifying the timezone. Although this can do the job, it doesn't test the DST adjustment properly, and affect the overall operating environment, system-wide.

To do things cleaner, we can try to modify directly the timezone in use. This will test the real DST automatic adjustment, while not changing the system clock (and not impacting other software, or services). Say we want to change DST for Europe/Paris time zone one day before the official date for summer 2008. First, obtain the current setting:

# zdump -v `rtc` | \
   awk '$6 ~ /'"`date '+%Y'`"'/ && $12 !~ /'"`date '+%H:%M:%S'`"'/ {print $0}'
Europe/Paris  Sun Mar 30 00:59:59 2008 UTC = Sun Mar 30 01:59:59 2008 CET isdst=0
Europe/Paris  Sun Mar 30 01:00:00 2008 UTC = Sun Mar 30 03:00:00 2008 CEST isdst=1
Europe/Paris  Sun Oct 26 00:59:59 2008 UTC = Sun Oct 26 02:59:59 2008 CEST isdst=1
Europe/Paris  Sun Oct 26 01:00:00 2008 UTC = Sun Oct 26 02:00:00 2008 CET isdst=0

Then, get the time zone source file for the Europe/Paris geographic definition and adapt it writing an exception rule for the current year. Because the current definition for Europe/Paris is based on the EU rule, our modification will be based on this rule name.

# mkdir -p /tmp/zoneinfo/src
# cp /usr/share/lib/zoneinfo/src/europe /tmp/zoneinfo/src
# diff /usr/share/lib/zoneinfo/src/europe /tmp/zoneinfo/src/europe
1073a1074,1076
> # Test the DST adjustment one day in advance, for March 2008 only.
> # Rule        NAME    FROM    TO      TYPE    IN      ON      AT      SAVE    LETTER/S
> Rule          EU      2008    only    -       Mar     29       2:00   1:00    S

Last, compile the updated time zone definition, put it in place, and verify the new DST date.

# zic -d /tmp/zoneinfo /tmp/zoneinfo/src/europe
# cp /tmp/zoneinfo/Europe/Paris /usr/share/lib/zoneinfo/Europe
# zdump -v `rtc` | \
   awk '$6 ~ /'"`date '+%Y'`"'/ && $12 !~ /'"`date '+%H:%M:%S'`"'/ {print $0}'
Europe/Paris  Sat Mar 29 00:59:59 2008 UTC = Sat Mar 29 01:59:59 2008 CET isdst=0
Europe/Paris  Sat Mar 29 01:00:00 2008 UTC = Sat Mar 29 03:00:00 2008 CEST isdst=1
Europe/Paris  Sun Oct 26 00:59:59 2008 UTC = Sun Oct 26 02:59:59 2008 CEST isdst=1
Europe/Paris  Sun Oct 26 01:00:00 2008 UTC = Sun Oct 26 02:00:00 2008 CET isdst=0

Once the fake DST adjustment is validated at the software level, clean things up a little.

# zic /usr/share/lib/zoneinfo/src/europe
# rm -r /tmp/zoneinfo

Thursday 13 March 2008

Update A Corrupted GRUB Boot Archive, With SVM

In a previous discussion about the GRUB boot archive and how it can be regenerated, I mentioned that it will not be as easy as it can be when the root file system use the md driver. I will now show two different methods to do the same thing when the root file system is build upon a SVM mirror (RAID-1):

  1. Unmirror the root file system only.
  2. Unmirror the entire system, i.e. all devices.

Note: Although this test case was done using Solaris 10 8/07 under a virtual machine build upon VirtualBox on latest Solaris Express Community Edition, the instructions must be valid for Solaris 10 1/06 and later.

Initial setup

As we can see, the system use only a root file system, and a swap device. Both are encapsulated with SVM.

# df -k -F ufs
Filesystem     kbytes    used   avail capacity  Mounted on
/dev/md/dsk/d0 6147798 3455578 2630743      57%  /
# swap -l
swapfile             dev  swaplo blocks   free
/dev/md/dsk/d1      85,1       8 4194288 4194288
# metastat -c d0 d1
d0               m  6.0GB d10 d20
    d10          s  6.0GB c0d0s0
    d20          s  6.0GB c1d1s0
d1               m  2.0GB d11 d21
    d11          s  2.0GB c0d0s1
    d21          s  2.0GB c1d1s1

Unmirror the root file system only

The idea is to boot on the GRUB failsafe mode, select the first side of the mirror, and modify the system and vfstab configuration files to use the correct device path. For the system file, this means to actually remove the rootdev:/pseudo/md@0:0,0,bl entry, not just comment it. For the vfstab file, this means replacing the root file system metadevice path /dev/md/[r]dsk/d0 by the first underlying device path, i.e. /dev/[r]dsk/c0d0s0. Last, regenerate the boot archive on the alternate root path.

[...]
Booting to milestone "milestone/single-user:default".
Configuring devices.
Searching for installed OS instances...
/dev/dsk/c0d0s0 is under md control, skipping.
/dev/dsk/c1d1s0 is under md control, skipping.
No installed OS instance found.

Starting shell.
# fsck /dev/rdsk/c0d0s0
# mount -F ufs /dev/dsk/c0d0s0 /a
# cp /a/etc/system /a/etc/system.bckp
# cp /a/etc/vfstab /a/etc/vfstab.bckp
# TERM=vt100 vi /a/etc/system
# TERM=vt100 vi /a/etc/vfstab
# bootadm update-archive -R /a
# umount /a
# fsck /dev/rdsk/c0d0s0
# reboot

Then, boot into milestone/multi-user:default level and detach the second half of the mirror, since the first half correspond to the valid and updated underlying device. Next, restore the original configuration files which refers to the encapsulated metadevices, and reboot.

# df -k -F ufs
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/dsk/c0d0s0      6147798 3458810 2627511    57%    /
# swap -l
swapfile             dev  swaplo blocks   free
/dev/md/dsk/d1      85,1       8 4194288 4194288
# metastat -c d0
d0               m  6.0GB d10 d20
    d10          s  6.0GB c0d0s0
    d20          s  6.0GB c1d1s0
# metadetach d0 d20
d0: submirror d20 is detached
# metastat -c d0
d0               m  6.0GB d10
    d10          s  6.0GB c0d0s0
# cp /etc/system.orig /etc/system
# cp /etc/vfstab.orig /etc/vfstab
# shutdown -y -i 6 -g 0

After the reboot, just reattach the second half of the mirror, and wait for complete synchronization to be fully redundant again.

# df -k -F ufs
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/md/dsk/d0       6147798 3458714 2627607    57%    /
# swap -l
swapfile             dev  swaplo blocks   free
/dev/md/dsk/d1      85,1       8 4194288 4194288
# metattach d0 d20
d0: submirror d20 is attached
# metastat -c d0
d0               m  6.0GB d10 d20 (resync-29%)
    d10          s  6.0GB c0d0s0
    d20          s  6.0GB c1d1s0

Unmirror the entire system, i.e. all devices

The idea is exactly the same as for unmirroring the root file system only, but adapting the vfstab file to change the swap entry, too. (So, I didn't reproduce the code listing here.)

Then, boot into milestone/single-user:default level modifying the corresponding GRUB entry as follow: kernel /platform/i86pc/multiboot -s. Completely delete all the metadevices and metadb configurations to clear SVM settings. Last, continue into milestone/multi-user:default level to boot unmirrored.

# metaclear -f -r d0 d1
# metadb -f -d  c1d0s4 c1d0s4
# ^D

Now, the system must be fully encapsulate by SVM again. Please refer to online Sun Documentation, or some past entries on this subject, depending on the system's architecture: SPARC systems, or x86 platforms.

Sunday 9 March 2008

Update A Corrupted GRUB Boot Archive, Without SVM

Solaris 10 systems on x86 architecture use the GNU GRand Unified Bootloader (GRUB) which is the boot loader responsible for loading a boot archive into a system's memory. The boot archive is a collection of critical files (kernel modules and configuration files) that are required to boot the Solaris OS. As stated in the Sun documentation:

These files are needed during system startup before the root file system is mounted. Two boot archives are maintained on a system:

  • The boot archive that is used to boot the Solaris OS on a system. This boot archive is sometimes called the primary boot archive.
  • The boot archive that is used for recovery when the primary boot archive is damaged. This boot archive starts the system without mounting the root file system. On the GRUB menu, this boot archive is called failsafe. The archive's essential purpose is to regenerate the primary boot archive, which is usually used to boot the system.

The Solaris OS generally keeps the boot archive properly synchronized on its own. Sometimes, the boot archive gets corrupted--for example when (bad) patches are applied, or the the operating system crashed. In these cases, the boot archive must be regenerated. This is easily accomplished following the Sun documentations x86: How to Boot the Failsafe Archive for Recovery Purposes, and x86: How to Boot the Failsafe Archive to Forcibly Update a Corrupt Boot Archive. The main drawback is when the system is encapsulated under a SVM mirror (RAID-1) since the md driver is not managed under the failsafe mode. Please refer to this blog entry on this subject, if needed.

Wednesday 5 March 2008

Create And Remove A Remote Printer Queue (CLI)

You can easily create and remove a remote printer queue using the BSD type spooler. You just have to fill the configuration file /tmp/lp.list properly, i.e. provide the local printer name, the remote LPD server, and the remote printer queue:

# cat << EOF > /tmp/lp.list
locname1 lpdserv1 remname1
locname2 lpdserv2 remname2
EOF

Then, just run the appropriate script depending of the desired behavior. Follow, an example when removing the two queues:

# cat << EOF > /tmp/lp.remove
#!/usr/bin/env sh

for lplocal in `awk '{print $1}' /tmp/lp.list`; do
  /usr/sbin/lpshut
  /usr/bin/cancel ${lplocal} -e 2> /dev/null
  /usr/sbin/lpadmin -x${lplocal}
  /usr/sbin/lpsched -v
  sleep 1
done

exit 0
EOF
# sh /tmp/lp.remove
scheduler stopped
scheduler is running
scheduler stopped
scheduler is running
# lpstat -olocname1
no system default destination
lpstat: "locname1" not a request id or a destination

And now, the creation:

# cat << EOF > /tmp/lp.create
#!/usr/bin/env sh

while read lp; do
  eval set -- `IFS=" "; printf '"%s" ' ${lp}`
  lplocal="$1"
  lpserver="$2"
  lpremote="$3"

  /usr/sbin/lpshut
  /usr/sbin/lpadmin -p${lplocal} -orm${lpserver} -orp${lpremote} \
   -mrmodel -v/dev/null -orc -ob3 -ocmrcmodel -osmrsmodel
  /usr/sbin/accept ${lplocal}
  /usr/bin/enable ${lplocal}
  /usr/sbin/lpsched -v
  sleep 1
done < /tmp/lp.list

exit 0
EOF
# sh /tmp/lp.create
scheduler stopped
destination "locname1" now accepting requests
printer "locname1" now enabled
scheduler is running
scheduler stopped
destination "locname2" now accepting requests
printer "locname2" now enabled
scheduler is running
# lpstat -olocname1
no system default destination
printer queue for locname1
                         Windows LPD Server
                   Printer \\lpdserv1
emname1
Owner       Status         Jobname          Job-Id    Size   Pages  Priority
----------------------------------------------------------------------------
hostname: locname1: ready and waiting
no entries

That's it!

Friday 15 February 2008

LVM2 Simple Mirroring On RHEL4

When the need to evacuate all persistent SAN storage from EMC DMX1K to HP XP12K (HDS), three main solutions were envisaged. The first one was brute data copy (tar, cpio, etc.) but was not very practical with the size of the data (multi-terabytes) and the time involved in copying them. The two others were based on LVM technologies: mirroring, or moving.

Although the choice has been to use the online and transparent moving data technology (see pvmove for more information), it was interesting to note that Red Hat has backported support for the creation and manipulation of simple mirrors to their RHEL4 distribution. These functionalities were introduced with the RHBA-2006:0504-15 advisory issued on 2006-08-10, i.e. between RHEL4 Update 4 and RHEL4 Update 5 (and so available via RHN at this time). It is just too bad that the online help for LVM commands are not properly synchronized nor fully documented by the corresponding manual page: clearly, this doesn't help to use them in the best conditions (no, Google isn't always the better option when using these kinds of functionalities in big companies).

Saturday 9 February 2008

Deleting SCSI Device Paths For A Multipath SAN LUN

When releasing a multipath device under RHEL4, different SCSI devices corresponding to different paths must be cleared properly before removing the SAN LUN effectively. When the LUN was delete before to clean up the paths at the OS level, it is always possible to remove them afterwards. In the following example, it is assume that the freeing LVM manipulations were already done, and that the LUN is managed by EMC PowerPath.

  1. First, get and verify the SCSI devices corresponding to the multipath LUN:
    # grep "I/O error on device" /var/log/messages | tail -2
    Feb  4 00:20:47 beastie kernel: Buffer I/O error on device sdo, \
     logical block 12960479
    Feb  4 00:20:47 beastie kernel: Buffer I/O error on device sdp, \
     logical block 12960479
    # powermt display dev=sdo
    Bad dev value sdo, or not under Powerpath control.
    # powermt display dev=sdp
    Bad dev value sdp, or not under Powerpath control.
    
  2. Then, get the appropriate scsi#:channel#:id#:lun# informations:
    # find /sys/devices -name "*block" -print | \
     xargs \ls -l | awk -F\/ '$NF ~ /sdo$/ || $NF ~ /sdp$/ \
     {print "HBA: "$7"\tscsi#:channel#:id#:lun#: "$9}'
    HBA: host0      scsi#:channel#:id#:lun#: 0:0:0:9
    HBA: host0      scsi#:channel#:id#:lun#: 0:0:1:9
    
  3. When the individual SCSI paths are known, remove them from the system:
    # echo 1 > /sys/bus/scsi/devices/0\:0\:0\:9/delete
    # echo 1 > /sys/bus/scsi/devices/0\:0\:1\:9/delete
    # dmesg | grep "Synchronizing SCSI cache"
    Synchronizing SCSI cache for disk sdp:
    Synchronizing SCSI cache for disk sdo:
    

Sunday 23 December 2007

Tuning Is Evil

Each month, I hear many coworkers or specific application management teams asking about putting some system tunings in place, even on very recent operating system releases. All the time. Most of these settings comes from the Internet, are found in forum posts, or articles related to a subsystem, or in technical publications. And some of them comes from third party software providers, or editors. A very, very few settings are proposed or recommended by system administrators, or by knowledgeable people in tuning area.

The problem is that, most of the time, these tunings are related to another release of the operating system, are not updated to keep current with the Best Of Practices for a given OS release, or simply are not well understood and not applicable without affecting (badly) current running environments. More, already present tunings are reported as-is on upgraded and fresh installed systems without more thinking, or without be assured these are always applicable (or obsolete) and what are the new defaults (if not dynamic). One of the most representative example today of this is the new System V IPC facilities found from the GA Solaris 10, and later, operating system, where some Oracle DBAs always ask SA team for shared memory settings as found on Solaris 8 systems.

Although extract from the Solaris Internals and Performance FAQ for ZFS, here is a great excerpt we all must read carefully and try to keep in mind when modifying default behavior of a system:

Tuning is evil and should not be done...in general.

First, consider that the default values are set by the people who know most things about the effects of the tuning. If a better value exists, it would be the default. While alternative values might help a given workload, it could quite possibly degrade some other aspects of performance. Maybe, catastrophically so.

Over time, tuning recommendations might become stale at best or might lead to performance degradations. Customers are leery of changing a tuning that is in place and the net effect is a worse product than what it could be. Moreover, tuning enabled on a given system might spread to other systems, where it might not be warranted at all.

- page 1 of 14