blog'o thnet

To content | To menu | To search

Monday 22 September 2008

About GNU/Linux Software Mirroring And LVM

Here, the final aim was to provide data access redundancy through SAN storage hosted on remote sites across Wide Area Network (WAN) links. After some relatively long and painful tries to mimic software mirroring as found on HP-UX platform using Logical Volume Management (LVM), i.e. at the logical volume level, I finally give up deciding this functionality will definitely not fi my need. Why? Here are my comments.

  1. It is not possible to provide clear and manageable storage multipath when the need to distinguish between the multiple sites is important, ala mirror across controllers found on Veritas VxVM on Sun Solaris system, for example. So, managing many physical volumes along with lots of logical volumes is very complicated.
  2. There is no exact mapping capability between logical volume storage on a given physical volume.
  3. The need to have a disk-based log, i.e. a persistent log. Yes, one can always provide the option --corelog at the creation time to the logical volume initial build and have an in-memory log , i.e. a non-persistent log, but this requires the entire copies (mirrors) be resynchronized upon reboot. Not really viable on multi-TB environments.
  4. A write-intensive workload on a file system living on a logical volume mirror will suffer high latency: the overhead is important, and the time to do mostly-write jobs grow dramatically. It is really hard to get high level statistics, only low level metrics seems consistent: sd SCSI devices and dm- device mapper components for each paths entries. Not from the multipath devices standpoint, which is the more interesting from the end user and SA point of view.
  5. You can't extend a logical volume, which is really a no-go per-se. On that point, the Red Hat support answered that this functionality may be added in a future release, the current state may eventually be a Request For Enhancement (RFE), if a proper business justification is provided. One must break the logical volume mirror copy, then rebuild it completely. Not realistic when the logical volume is made of a lot of physical extents across multiple physical volumes.
  6. A LVM configuration can be totally blocked by itself, and not usable at all. The fact is, LVM use persistent storage blocks to keep track of its own metadata. The metadata size is set at physical volume creation time only, and can't be change afterward. This size is statically defined as 255 physical volume blocks, and can be adjust from the LVM configuration file. The problem is, when this circular buffer space (stored in ASCII) fills up--such as when there are a lot of logical volumes in a mirrored environment--it is not possible to do anything more with LVM. So you can't add more logical volume, can't add more logical volume copies,... and can't delete them trying to reestablish a proper LVM configuration. Well, here are the answers given by the Red Hat support to two keys questions in this situation:
    • How to size the metadata, i.e. if we need to change it from the default value, how can we determine the new proper and appropriate size, and from which information?
      I am afraid but Metadata size can only be defined at the time of PV creation and there is no real formula for calculating the size in advance. By changing the default value of 255 you can get a higher space value. For general LVM setup (with less LV's and VG's) default size works fine however in cases where high number of LV's are required a custom value will be required.
    • We just want to delete all LV copies, which means to return to the initial situation and have 0 copy for all LV, i.e. only one LV per-se, in order to be able to change LVM configuration again (we can't do anything on our production server right now)?
      I discussed this situation with peer engineers and also referenced a similar previous case. From the notes of the same the workaround is to use the backup file (/etc/lvm/backup) and restore the PV's. I agree that this really not a production environment method however seems the only workaround.

So, the production RDBMS Oracle server is finally now being evacuate to an other machine. Hum... Hope to see better enterprise experience using the mdadm package to handle RAID software, instead of mirror (RAID-1) LVM. Maybe more about that in an other blog entry?

Friday 15 February 2008

LVM2 Simple Mirroring On RHEL4

When the need to evacuate all persistent SAN storage from EMC DMX1K to HP XP12K (HDS), three main solutions were envisaged. The first one was brute data copy (tar, cpio, etc.) but was not very practical with the size of the data (multi-terabytes) and the time involved in copying them. The two others were based on LVM technologies: mirroring, or moving.

Although the choice has been to use the online and transparent moving data technology (see pvmove for more information), it was interesting to note that Red Hat has backported support for the creation and manipulation of simple mirrors to their RHEL4 distribution. These functionalities were introduced with the RHBA-2006:0504-15 advisory issued on 2006-08-10, i.e. between RHEL4 Update 4 and RHEL4 Update 5 (and so available via RHN at this time). It is just too bad that the online help for LVM commands are not properly synchronized nor fully documented by the corresponding manual page: clearly, this doesn't help to use them in the best conditions (no, Google isn't always the better option when using these kinds of functionalities in big companies).

Thursday 21 June 2007

Altering LVM Configuration When a Disk is Not in ODM Anymore

If you remove a disk from the system using rmdev -dl hdiskX without having previously reduced the volume group to remove the disk from LVM, and thus have not updated properly the on-disk format information (called VGDA), you get a discrepancy between the ODM and the LVM configurations. Here is how to solve the issue (without any warranty though!).

What are the volume group informations:

# lsvg -p rootvg                
rootvg:
PV_NAME           PV STATE          TOTAL PPs   FREE PPs    FREE DISTRIBUTION
hdisk0            active            2157        1019        174..00..00..413..432
0516-304 : Unable to find device id 00ce4b6a01292201 in the Device
       Configuration Database.
00ce4b6a01292201  missing           2157        1019        174..71..00..342..432
# lspv
hdisk0          00ce4b6ade6da849                    rootvg          active
hdisk2          00ce4b6a01b09b83                    drakevg         active
hdisk3          00ce4b6afd175206                    drakevg         active
# lsdev -Cc disk
hdisk0 Available  Virtual SCSI Disk Drive
hdisk2 Available  Virtual SCSI Disk Drive
hdisk3 Available  Virtual SCSI Disk Drive

As we can notice, the disk is still in the LVM configuration but doesn't show up in the devices. To solve this issue, we need to cheat the ODM in order to be able to use LVM commands to change the LVM configuration, stored on the volume group disks. The idea is to reinsert a disk in the ODM configuration, remove the disk from LVM and then remove it from ODM. Here is how we do it. First, let's make a copy of the ODM files that we will change:

# cd /etc/objrepos/
# cp CuAt CuAt.before_cheat
# cp CuDv CuDv.before_cheat
# cp CuPath CuPath.before_cheat

Now, we will extract the hdisk0's definition from ODM and add it back as hdisk1's definition:

# odmget -q "name=hdisk0" CuAt
CuAt:
       name = "hdisk0"
       attribute = "unique_id"
       value = "3520200946033223609SYMMETRIX03EMCfcp05VDASD03AIXvscsi"
       type = "R"
       generic = ""
       rep = "n"
       nls_index = 0
CuAt:
       name = "hdisk0"
       attribute = "pvid"
       value = "00ce4b6ade6da8490000000000000000"
       type = "R"
       generic = "D"
       rep = "s"
       nls_index = 11
# odmget -q "name=hdisk0" CuDv
CuDv:
       name = "hdisk0"
       status = 1
       chgstatus = 2
       ddins = "scsidisk"
       location = ""
       parent = "vscsi0"
       connwhere = "810000000000"
       PdDvLn = "disk/vscsi/vdisk"
# odmget -q "name=hdisk0" CuPath
CuPath:
       name = "hdisk0"
       parent = "vscsi0"
       connection = "810000000000"
       alias = ""
       path_status = 1
       path_id = 0

Basically, we need to insert new entries in the three classes CuAt, CuDv and CuPath with hdisk0 changed to hdisk1. A few others attributes need to be changed. The most important one is the PVID, located in CuAt. We will use the value reported as missing by lsvg -p rootvg. Attribute unique_id also need to be changed. You can just change a few characters in the existing string, it just need to be unique in the system. The other attributes to change are connwhere in CuDv and connection in CuPath. Their value represent the LUN ID of the disk. Again, this value is not relevant, it just have to be unique. We can check the current LUN defined by running lscfg on all the disks defined:

# lscfg -vl hdisk*
 hdisk0           U9117.570.65E4B6A-V6-C2-T1-L810000000000  Virtual SCSI Disk Drive
 hdisk2           U9117.570.65E4B6A-V6-C3-T1-L810000000000  Virtual SCSI Disk Drive
 hdisk3           U9117.570.65E4B6A-V6-C3-T1-L820000000000  Virtual SCSI Disk Drive

LUN 81 is used on controller C2 and LUNs 81 and 82 on C3. Let's choose 85, which for sure will not collide with other devices. The following commands will generate the text files that will be used to cheat the ODM, according to what was just explained:

# mkdir /tmp/cheat
# cd /tmp/cheat
# odmget -q "name=hdisk0" CuAt | sed -e 's/hdisk0/hdisk1/g' \
   -e 's/00ce4b6ade6da849/00ce4b6a01292201/' \
   -e 's/609SYMMETRIX/719SYMMETRIX/' > hdisk1.CuAt
# odmget -q "name=hdisk0" CuDv | sed -e 's/hdisk0/hdisk1/' \
   -e 's/810000000000/850000000000/' > hdisk1.CuDv
# odmget -q "name=hdisk0" CuPath | sed -e 's/hdisk0/hdisk1/' \
   -e 's/810000000000/850000000000/' > hdisk1.CuPAth

Let's look at the generated files:

# cat hdisk1.CuAt
CuAt:
       name = "hdisk1"
       attribute = "unique_id"
       value = "3520200946033223719SYMMETRIX03EMCfcp05VDASD03AIXvscsi"
       type = "R"
       generic = ""
       rep = "n"
       nls_index = 0
CuAt:
       name = "hdisk1"
       attribute = "pvid"
       value = "00ce4b6a012922010000000000000000"
       type = "R"
       generic = "D"
       rep = "s"
       nls_index = 11
# cat hdisk1.CuDv
CuDv:
       name = "hdisk1"
       status = 1
       chgstatus = 2
       ddins = "scsidisk"
       location = ""
       parent = "vscsi0"
       connwhere = "850000000000"
       PdDvLn = "disk/vscsi/vdisk"
# cat hdisk1.CuPath
CuPath:
       name = "hdisk1"
       parent = "vscsi0"
       connection = "850000000000"
       alias = ""
       path_status = 1
       path_id = 0

So, we are ready to insert the data in the ODM:

# odmadd hdisk1.CuAt
# odmadd hdisk1.CuDv
# odmadd hdisk1.CuPath
# lsvg -p rootvg
rootvg:
PV_NAME           PV STATE          TOTAL PPs   FREE PPs    FREE DISTRIBUTION
hdisk0            active            2157        1019        174..00..00..413..432
hdisk1            missing           2157        1019        174..71..00..342..432

The disk is now back in ODM! Now, to remove the disk from the VGDA, we use the reducevg command:

# reducevg rootvg hdisk1
0516-016 ldeletepv: Cannot delete physical volume with allocated
       partitions. Use either migratepv to move the partitions or
       reducevg with the -d option to delete the partitions.
0516-884 reducevg: Unable to remove physical volume hdisk1.

We will use the -d flag to remove the physical partitions associated to each logical volumes and located hdisk1. A few lines have been remove to simplify listing...

# reducevg -d rootvg hdisk1
0516-914 rmlv: Warning, all data belonging to logical volume
       lv01 on physical volume hdisk1 will be destroyed.
rmlv: Do you wish to continue? y(es) n(o)?
y
0516-304 putlvodm: Unable to find device id 00ce4b6a012922010000000000000000 in the
       Device Configuration Database.
0516-896 reducevg: Warning, cannot remove physical volume hdisk1 from
       Device Configuration Database.
# lsvg -l rootvg
rootvg:
LV NAME             TYPE       LPs   PPs   PVs  LV STATE      MOUNT POINT
hd5                 boot       2     2     1    closed/syncd  N/A
hd6                 paging     256   256   1    open/syncd    N/A
hd8                 jfs2log    1     1     1    open/syncd    N/A
hd4                 jfs2       7     7     1    open/syncd    /
hd2                 jfs2       384   384   1    open/syncd    /usr
hd9var              jfs2       64    64    1    open/syncd    /var
hd3                 jfs2       128   128   1    open/syncd    /tmp
hd1                 jfs2       2     2     1    open/syncd    /home
hd10opt             jfs2       32    32    1    open/syncd    /opt
fslv04              jfs2       256   256   1    open/syncd    /usr/sys/inst.images
loglv01             jfslog     1     1     1    closed/syncd  N/A
lv01                jfs        5     5     1    closed/syncd  /mkcd/cd_images
# lsvg -p rootvg
rootvg:
PV_NAME           PV STATE          TOTAL PPs   FREE PPs    FREE DISTRIBUTION
hdisk0            active            2157        1019        174..00..00..413..432

The disk has been deleted from the VGDA. What about ODM?

# lsdev -Cc disk
hdisk0 Available  Virtual SCSI Disk Drive
hdisk1 Available  Virtual SCSI Disk Drive
hdisk2 Available  Virtual SCSI Disk Drive
hdisk3 Available  Virtual SCSI Disk Drive
# rmdev -dl hdisk1
Method error (/etc/methods/ucfgdevice):
       0514-043 Error getting or assigning a minor number.

We probably forgot to cheat one ODM class... Never mind: let's remove the cheat we added to ODM and see what appends:

# odmdelete -o CuAt -q "name=hdisk1"
2 objects deleted
# lspv
hdisk0          00ce4b6ade6da849                    rootvg          active
hdisk2          00ce4b6a01b09b83                    drakevg         active
hdisk1          none                                None            
hdisk3          00ce4b6afd175206                    drakevg         active
# rmdev -dl hdisk1
Method error (/etc/methods/ucfgdevice):
       0514-043 Error getting or assigning a minor number.
# odmdelete -o CuDv -q "name=hdisk1"
1 objects deleted
# lspv
hdisk0          00ce4b6ade6da849                    rootvg          active
hdisk2          00ce4b6a01b09b83                    drakevg         active
hdisk3          00ce4b6afd175206                    drakevg         active
# lspath
Enabled hdisk0 vscsi0
Enabled hdisk2 vscsi0
Enabled hdisk2 vscsi1
Enabled hdisk3 vscsi1
Enabled hdisk3 vscsi0
Unknown hdisk1 vscsi0
# odmdelete -o CuPath -q "name=hdisk1"
1 objects deleted
# lspath
Enabled hdisk0 vscsi0
Enabled hdisk2 vscsi0
Enabled hdisk2 vscsi1
Enabled hdisk3 vscsi1
Enabled hdisk3 vscsi0

That's it! Use with care.

Side note: This entry was originally contributed by Patrice Lachance, which first wrote about this subject.

Wednesday 12 April 2006

Stability Problem and New RAID-1 System

Just after the big update and new infrastructure installation during the past month, the server encountered stability problem on a regular basis (sometimes crashing more than one time a day). At first though, we noted it may be caused by the new processor architecture, a less mature FreeBSD distribution than the well known i386.

But the problem appears to be specifically related to I/O, especially when the input/output are very intensive, i.e. file system dump(8), file system snapshot mksnap_ffs(8) or big tar(1) or cpio(1L) archive transfers.

After days, we didn't succeed to clearly isolate the source of the problem, but decided to quickly put the system under RAID-1 system management (disks mirroring) and configure the excellent gmirror(8) to do the job. So, i switched from an old hardware RAID mechanism, to a pure LVM one. Hope we will be able to find something new without disturbing the ThNET services anymore.

More on the involved manipulations to put the system under the gmirror(8)control on a later post. Stay tuned.