blog'o thnet

To content | To menu | To search

Tag - MPxIO

Entries feed - Comments feed

Monday 22 September 2008

About GNU/Linux Software Mirroring And LVM

Here, the final aim was to provide data access redundancy through SAN storage hosted on remote sites across Wide Area Network (WAN) links. After some relatively long and painful tries to mimic software mirroring as found on HP-UX platform using Logical Volume Management (LVM), i.e. at the logical volume level, I finally give up deciding this functionality will definitely not fit my need. Why? Here are my comments.

  1. It is not possible to provide clear and manageable storage multipath when the need to distinguish between the multiple sites is important, ala mirror across controllers found on Veritas VxVM on Sun Solaris system, for example. So, managing many physical volumes along with lots of logical volumes is very complicated.
  2. There is no exact mapping capability between logical volume storage on a given physical volume.
  3. The need to have a disk-based log, i.e. a persistent log. Yes, one can always provide the option --corelog at the creation time to the logical volume initial build and have an in-memory log , i.e. a non-persistent log, but this requires the entire copies (mirrors) be resynchronized upon reboot. Not really viable on multi-TB environments.
  4. A write-intensive workload on a file system living on a logical volume mirror will suffer high latency: the overhead is important, and the time to do mostly-write jobs grow dramatically. It is really hard to get high level statistics, only low level metrics seems consistent: sd SCSI devices and dm- device mapper components for each paths entries. Not from the multipath devices standpoint, which is the more interesting from the end user and SA point of view.
  5. You can't extend a logical volume, which is really a no-go per-se. On that point, the Red Hat support answered that this functionality may be added in a future release, the current state may eventually be a Request For Enhancement (RFE), if a proper business justification is provided. One must break the logical volume mirror copy, then rebuild it completely. Not realistic when the logical volume is made of a lot of physical extents across multiple physical volumes.
  6. A LVM configuration can be totally blocked by itself, and not usable at all. The fact is, LVM use persistent storage blocks to keep track of its own metadata. The metadata size is set at physical volume creation time only, and can't be change afterward. This size is statically defined as 255 physical volume blocks, and can be adjust from the LVM configuration file. The problem is, when this circular buffer space (stored in ASCII) fills up--such as when there are a lot of logical volumes in a mirrored environment--it is not possible to do anything more with LVM. So you can't add more logical volume, can't add more logical volume copies,... and can't delete them trying to reestablish a proper LVM configuration. Well, here are the answers given by the Red Hat support to two keys questions in this situation:
    • How to size the metadata, i.e. if we need to change it from the default value, how can we determine the new proper and appropriate size, and from which information?
      I am afraid but Metadata size can only be defined at the time of PV creation and there is no real formula for calculating the size in advance. By changing the default value of 255 you can get a higher space value. For general LVM setup (with less LV's and VG's) default size works fine however in cases where high number of LV's are required a custom value will be required.
    • We just want to delete all LV copies, which means to return to the initial situation and have 0 copy for all LV, i.e. only one LV per-se, in order to be able to change LVM configuration again (we can't do anything on our production server right now)?
      I discussed this situation with peer engineers and also referenced a similar previous case. From the notes of the same the workaround is to use the backup file (/etc/lvm/backup) and restore the PV's. I agree that this really not a production environment method however seems the only workaround.

So, the production RDBMS Oracle server is finally now being evacuate to an other machine. Hum... Hope to see better enterprise experience using the mdadm package to handle RAID software, instead of mirror (RAID-1) LVM. Maybe more about that in an other blog entry?

Friday 16 May 2008

Comparison: EMC PowerPath vs. GNU/Linux dm-multipath

I will present some notes about the use of multipath solutions on Red Hat systems: EMC PowerPath and GNU/Linux dm-multipath. Along those notes, keep in mind that they were based on tests done when pressure was very high to put new systems in production, so lack of time resulted in less complete tests than expected. These tests were done more than a year ago, and so before the release of RHEL4 Update 5 and some of RHBA related to both LVM and dm-multipath technologies.

Keep in mind that without purchasing an appropriate EMC license, PowerPath can only be used in failover mode (active-passive mode). Multiple paths accesses are not supported in this case: no round-robin, and no I/O load balancer for example.

EMC PowerPath

Advantages

  1. Not specific to the SAN Host Bus Adapter (HBA).
  2. Support for multiple and heterogeneous SAN storage provider.
  3. Support for most UNIX and Unix-like platforms.
  4. Without a valid license, can only work in degraded mode (failover).
  5. Is not sensible to a change in the SCSI LUN renumbering. Adapt accordingly the corresponding multiple sd devices (different paths to a given device) with its multipath definition of the emcpower device.
  6. Provide easily the ID of the SAN storage.

Drawbacks

  1. Not integrated with the operating system (which generally has its own solution).
  2. The need to force a RPM re-installation in case of a kernel upgrade on RHEL systems (due to the fact that kernel modules are stored in a path containing the exact major and minor versions of the installed (booted) kernel.
  3. Non-automatic update procedure.

GNU/Linux device-mapper-multipath

Advantages

  1. Not specific to the SAN Host Bus Adapter (HBA).
  2. Support for multiple and heterogeneous SAN storage provider.
  3. Well integrated with the operating system.
  4. Automatic update using RHN (you must be a licensed and registered user in this case).
  5. No additional license cost.

Drawbacks

  1. Only available on GNU/Linux systems.
  2. Configuration (files and keywords) very tedious and difficult.
  3. Without the use of LVM (Logical Volume Management), it has not the ability to follow SCSI LUN renumbering! Even in this case, be sure not to have blacklisted the newly discovered SCSI devices (sd).

Last, please find some interesting documentation on the subject below:

Saturday 9 February 2008

Deleting SCSI Device Paths For A Multipath SAN LUN

When releasing a multipath device under RHEL4, different SCSI devices corresponding to different paths must be cleared properly before removing the SAN LUN effectively. When the LUN was delete before to clean up the paths at the OS level, it is always possible to remove them afterwards. In the following example, it is assume that the freeing LVM manipulations were already done, and that the LUN is managed by EMC PowerPath.

  1. First, get and verify the SCSI devices corresponding to the multipath LUN:
    # grep "I/O error on device" /var/log/messages | tail -2
    Feb  4 00:20:47 beastie kernel: Buffer I/O error on device sdo, \
     logical block 12960479
    Feb  4 00:20:47 beastie kernel: Buffer I/O error on device sdp, \
     logical block 12960479
    # powermt display dev=sdo
    Bad dev value sdo, or not under Powerpath control.
    # powermt display dev=sdp
    Bad dev value sdp, or not under Powerpath control.
    
  2. Then, get the appropriate scsi#:channel#:id#:lun# informations:
    # find /sys/devices -name "*block" -print | \
     xargs \ls -l | awk -F\/ '$NF ~ /sdo$/ || $NF ~ /sdp$/ \
     {print "HBA: "$7"\tscsi#:channel#:id#:lun#: "$9}'
    HBA: host0      scsi#:channel#:id#:lun#: 0:0:0:9
    HBA: host0      scsi#:channel#:id#:lun#: 0:0:1:9
    
  3. When the individual SCSI paths are known, remove them from the system:
    # echo 1 > /sys/bus/scsi/devices/0\:0\:0\:9/delete
    # echo 1 > /sys/bus/scsi/devices/0\:0\:1\:9/delete
    # dmesg | grep "Synchronizing SCSI cache"
    Synchronizing SCSI cache for disk sdp:
    Synchronizing SCSI cache for disk sdo:
    

Monday 9 July 2007

Installing a VIOS from the HMC Using a backupios Archive File

Once the corresponding partition has been defined on the managed system, log on to the HMC using an account having hmcsuperadmin authority. hscroot is such an account. Then, to install the VIOS partition using a previously generated backupios tar file, issue a command similar to the following:

$ installios \
   -s Server-9113-550-SN65E3R4F \
   -S uu.xx.yy.zz \
   -p vios01 \
   -r installation \
   -i vv.xx.yy.zz \
   -d nfssrv:/path/to/backupios/archive \
   -m 00:11:22:aa:bb:cc \
   -g ww.xx.yy.zz \
   -P 100 \
   -D full

Where:

  • -s: Managed system
  • -p: Partition name
  • -r: Partition profile
  • -d: Path to installation image(s) (/dev/cdrom or srv:/path/to/backup)
  • -i: Client IP address
  • -S: Client IP subnet mask
  • -g: Client gateway
  • -m: Client MAC address
  • -P: Port speed (optional, 100 is the default (10, 100, or 1000))
  • -D: Port duplex (optional, full is the default (full, or half))

Note that he profile named installation is very similar to the profile named normal: it just doesn't include all the extra-stuff necessary for our final pSeries configuration, i.e. SAN HBA, virtual LAN, etc. This is necessary not to install on SAN disks, or try to use a virtual Ethernet adapter during VIOS installation process. After rebooting on the fresh installed VIOS, connect to the console and check for:

  1. Clean-up the content of the /etc/hosts file, in particular be sure that the FQDN and short name of the NIM server are mentioned properly.
  2. Configure the IP address(es) on the physical interface(s), and the corresponding hostname--and don't forget that they will be modify latter in order to create SEA device!
  3. Recreate the mirror in order to use the two first disks (with exact mapping), and be sure to have two copies the lg_dumplv logical volume (not really sure about this one, but it doesn't hurt anyway...).
  4. Update the content of the /etc/resolv.conf file.
  5. Be able to resolve hostnames using other network centralized mechanisms:
    # cat << EOF >> /etc/netsvc.conf
    hosts = local, nis, bind
    EOF
    
  6. Don't forget to erase the installation NIM configuration found under /etc/niminfo and set it as a new NIM client for the current NIM server:
    # mv /etc/niminfo /etc/niminfo.orig
    # niminit -a name=vios01 \
     -a master=nim.example.com \
     -a pif_name=en0 \ # May be `en5' if the SEA was already configured.
     -a connect=nimsh
    
  7. Change the padmin account password.

Last, here are some welcome tuning configuration steps:

  • Update the VIOS installation software with the external bundle pack, if available.
  • Reboot the VIOS using the profile named normal (whi include all the targeted hardware definitions).
  • There are a few parameters to change on the fibre channel adapter and fscsi interface on top of it. The first one is dyntrk, which allow fabric reconfiguration without having to reboot the Virtual I/O Server. The second one is fs_err_recov, which will prevent the Virtual I/O Server to retry sending an operation on a disk if the disk become unavailable. We change it because the Virtual I/O Client will take care of accessing the disk using MPxIO and thus, will redirect the I/O operations to the second Virtual I/O Server. The last parameter we change is the one that controls the number of commands to queue to the physical adapter. A reboot is necessary in order to change these parameters:
    $ chdev -dev fscsi0 -attr dyntrk=yes -perm
    fscsi0 changed
    $ chdev -dev fscsi0 -attr fc_err_recov=fast_fail -perm
    fscsi0 changed
    $ chdev -dev fcs0 -attr num_cmd_elems=2048 -perm
    fcs0 changed
    
  • We can safely change the software transmit queue size and descriptor queue size with the following commands. Since the adapter is in use, we change the settings in ODM only, and the new configuration will be use at next reboot:
    $ chdev -dev ent0 -attr tx_que_sz=16384 -perm
    ent0 changed
    $ chdev -dev ent1 -attr tx_que_sz=16384 -perm
    ent1 changed
    $ chdev -dev ent0 -attr txdesc_que_sz=1024 -perm
    ent0 changed
    $ chdev -dev ent1 -attr txdesc_que_sz=1024 -perm
    ent1 changed
    
  • And be sure to force the speed and mode of the desired Ethernet interfaces:
    $ chdev -dev ent0 -attr media_speed=100_Full_Duplex -perm
    ent0 changed
    $ chdev -dev ent1 -attr media_speed=100_Full_Duplex -perm
    ent1 changed
    
  • Now, we need to create the Shared Ethernet Adapter to be able to access the external network and bind the virtual adapter to the real one:
    $ chdev -dev en0 -attr state=detach
    en0 changed
    $ chdev -dev en1 -attr state=detach
    en1 changed
    $ mkvdev -sea ent0 -vadapter ent3 -default ent3 -defaultid 1
    ent5 Available
    en5
    et5
    $ mkvdev -sea ent1 -vadapter ent4 -default ent4 -defaultid 3
    ent6 Available
    en6
    et6
    $ mktcpip -hostname vios01 \
       -inetaddr vv.xx.yy.zz \
       -interface en5 \
       -netmask uu.xx.yy.zz \
       -gateway ww.xx.yy.zz \
       -nsrvaddr tt.xx.yy.zz \
       -nsrvdomain example.com \
       -start
    
  • Don't forget to install the MPxIO driver provided by EMC on their FTP web site:
    # cd /mnt/EMC.Symmetrix
    # TERM=vt220 smitty installp
    # lslpp -al | grep 'EMC.Symmetrix' | sort -u
                                 5.2.0.3  COMMITTED  EMC Symmetrix Fibre Channel
      EMC.Symmetrix.aix.rte      5.2.0.3  COMMITTED  EMC Symmetrix AIX Support
      EMC.Symmetrix.fcp.MPIO.rte
    
  • Assuming that the clock is given by the default gateway network device, we can set and configure the NTP client this way:
    # ntpdate ww.xx.yy.zz
    # cp /etc/ntp.conf /etc/ntp.conf.orig
    # diff -c /etc/ntp.conf.orig /etc/ntp.conf
    *** /etc/ntp.conf.orig  Fri Sep 30 18:05:17 2005
    --- /etc/ntp.conf       Fri Sep 30 18:05:43 2005
    ***************
    *** 36,41 ****
      #
      #   Broadcast client, no authentication.
      #
    ! broadcastclient
      driftfile /etc/ntp.drift
      tracefile /etc/ntp.trace
    --- 36,42 ----
      #
      #   Broadcast client, no authentication.
      #
    ! #broadcastclient
    ! server ww.xx.yy.zz
      driftfile /etc/ntp.drift
      tracefile /etc/ntp.trace
    #
    # chrctcp -S -a xntpd
    

Side note: This entry was originally contributed by Patrice Lachance, which first wrote about this subject.

Saturday 6 August 2005

Details About SAN Disks and MPxIO Capabilities on a VIOS

Obtaining these sorts of particular and specific informations (such as MultiPath I/O status) from a Virtual I/O Server can be very easily achieved using the following one (long) line shell script, helped by the lsdev(1), lscfg(1) and lspath commands:

# for disk in `lsdev | grep hdisk | egrep  -v "SCSI Disk Drive|Raid1" | awk '{print $1}'`
> do
> lscfg -v -l ${disk} | egrep "${disk}|Manufacturer|Machine Type|ROS Level and ID|Serial Number|Part Number"
> echo "`lspath -H -l ${disk} | grep ${disk} | awk '{print\"\tMultiPath I/O (MPIO) status: \"$1\" on parent \"$3}'`"
> echo ""
> done

  hdisk3           U787B.001.DNW3897-P1-C3-T1-W5006048448930A41-L9000000000000  EMC Symmetrix FCP MPIO RaidS
        Manufacturer................EMC     
        Machine Type and Model......SYMMETRIX       
        ROS Level and ID............5670
        Serial Number...............9312A020
        Part Number.................000000000000510001000287
        MultiPath I/O (MPIO) status: Enabled on parent fscsi0
        MultiPath I/O (MPIO) status: Enabled on parent fscsi1

  hdisk4           U787B.001.DNW3897-P1-C3-T1-W5006048448930A41-LA000000000000  EMC Symmetrix FCP MPIO RaidS
        Manufacturer................EMC     
        Machine Type and Model......SYMMETRIX       
        ROS Level and ID............5670
        Serial Number...............9312E020
        Part Number.................000000000000510001000287
        MultiPath I/O (MPIO) status: Enabled on parent fscsi0
        MultiPath I/O (MPIO) status: Enabled on parent fscsi1
[...]

Pattern SCSI Disk Drive is excluded since it represents local SCSI disks, as well as pattern Raid1 because it is a view corresponding to parity disks (which are logical disks only used by SAN administrators).