blog'o thnet

To content | To menu | To search

Tag - system

Entries feed - Comments feed

Monday 10 October 2011

Encrypted SWAP Device Just Disappeared In Solaris 11 EA

For some months, I used to encrypt the SWAP device (which is a ZFS volume) and thus have an encrypted /tmp. This worked fine with Solaris 11 Express, but I encountered a strange behavior in Solaris 11 EA which leads to have the SWAP device to... well, just disappeared.

Here is what I found after two boots; and on several machines:

# swap -l
No swap devices configured

# zfs list -t volume
NAME         USED  AVAIL  REFER  MOUNTPOINT
rpool/dump  32.8G   240G  31.8G  -

# grep swap /etc/vfstab
swap            -               /tmp            tmpfs   -       yes     -
/dev/zvol/dsk/rpool/swap        -               -               swap    -       no      encrypted

So, the rpool/swap dataset disappeared. I am sure not to have destroyed it, in particular since this appears on multiple servers. Nevertheless, I found this in the history of the zpool command:

# zpool history | grep destroy
[...]
2011-10-05.10:22:49 zfs destroy rpool/swap

# last reboot | head -2
reboot    system boot                   Wed Oct  5 10:23
reboot    system down                   Wed Oct  5 10:20

So, this problem seems to be related to some actions at boot time. What have the logs of SMF services to say about that?

# find /var/svc/log -print | xargs grep -i swap
/var/svc/log/system-filesystem-usr:default.log:cannot create 'rpool/swap': pool must be upgraded to set this property or value
/var/svc/log/system-filesystem-usr:default.log:cannot open 'rpool/swap': dataset does not exist
/var/svc/log/system-filesystem-usr:default.log:cannot create 'rpool/swap': pool must be upgraded to set this property or value
/var/svc/log/system-filesystem-usr:default.log:cannot open 'rpool/swap': dataset does not exist

# tail -3 /var/svc/log/system-filesystem-usr:default.log
[ Oct  5 12:00:05 Executing start method ("/lib/svc/method/fs-usr"). ]
cannot create 'rpool/swap': pool must be upgraded to set this property or value
[ Oct  5 12:00:13 Method "start" exited with status 0. ]

Ouch, what happened here? The message is interesting, but is a little misleading: it is on fresh Solaris 11 EA installations, and so the pools and datasets are all up to date:

# zpool upgrade && zfs upgrade
This system is currently running ZFS pool version 33.
All pools are formatted using this version.
This system is currently running ZFS filesystem version 5.
All filesystems are formatted with the current version.

So, it seems that the rpool/swap device is re-created at boot time, and for some reason it doesn't work as expected. Here is an attempt to discover where the device is re-created and why it does fail.

# find /lib/svc/method -print | xargs grep -i sbin/swapadd
/lib/svc/method/fs-usr:/usr/sbin/swapadd -1
/lib/svc/method/nfs-client:     /usr/sbin/swapadd
/lib/svc/method/fs-local:/usr/sbin/swapadd >/dev/null 2>&1

# grep "zfs destroy" /usr/sbin/swapadd
                zfs destroy $zvol > /dev/null 2>&1

# sed -n '/zfs create/,/\$zvol/p' /usr/sbin/swapadd
        zfs create -V $volsize -o volblocksize=`/usr/bin/pagesize` \
            -o primarycache=$primarycache -o secondarycache=$secondarycache \
            -o encryption=$encryption -o keysource=raw,file:///dev/random $zvol

So, the re-creation at boot time of the rpool/swap appears only when using an encrypted volume. And after a bit of digging, here what I found. At the first boot, here is the command used to create the encrypted volume:

zfs create -V 4G -o volblocksize=8192 -o primarycache=metadata -o secondarycache=all -o encryption=on -o keysource=raw,file:///dev/random rpool/swap

But on a second boot, here is the slightly different command used this time:

zfs create -V 4G -o volblocksize=8192 -o primarycache=metadata -o secondarycache=all -o encryption=aes-128-ctr -o keysource=raw,file:///dev/random rpool/swap

This is because the arguments passed to the command is backed-up and restored from the settings just before the deletion of the volume. As mentioned in the zfs(1m) manual page, only the following encryption algorithm are supported... and so the one which is sets is not valid (the error message saying that the pool must be upgraded to set this property or value is a little more clear by now).

encryption=off | on | aes-128-ccm | aes-192-ccm | aes-256-ccm | aes-128-gcm | aes-192-gcm | aes-256-gcm

The question is, how can this happen? Where does this algorithm com from? The answer is simple: it seems that this is the swap(1m) command which alters some properties of the rpool/swap volume:

# swap -d /dev/zvol/dsk/rpool/swap
# zfs destroy rpool/swap
# zfs create -V 4G -o volblocksize=8192 -o primarycache=metadata -o secondarycache=all -o encryption=on -o keysource=raw,file:///dev/random rpool/swap
# zfs list -H -o type,volsize,volblocksize,encryption rpool/swap
volume  4G      8K      on
# swap -1 -a /dev/zvol/dsk/rpool/swap
# zfs list -H -o type,volsize,volblocksize,encryption rpool/swap
volume  4G      1M      aes-128-ctr

Not only is the algorithm changed to something not supported (yet?), but the volblocksize property is touched as well. This was not the case on Solaris 11 Express 2010.11.

Hope someone can help me on this side, and that this is a known bug which is already (or will be quickly) addressed, in particular for the Solaris 11 GA. I already posted a comment on the blog of Darren Moffat, just in case this can help a bit.

Wednesday 5 October 2011

Quick Notes About Oracle Solaris 11 Early Adopter

The Oracle Solaris 11 Early Adopter release is available for some days by now. This EA release provides access to the final (complete) functionality which will be delivered in Oracle Solaris 11 GA. Although I only played with it for a few days, here are my very, very first notes about things I found interesting to mention, in no particular order.

  • I noticed that the Oracle Solaris 11/11 release (and not EA) was mentioned in one of the subsections of the provided draft for the documentation. Was this inadvertently forgotten... on purpose? ;)
  • The support for Flash Archives seems to have finally disappeared. I know about the Distribution Constructor argument, but a flar(1M) (as an mksysb(1) for AIX) definitely has a special place in the Solaris ecosystem (particularly for crash recovery scenario).
  • The -x option has been removed from the vi(1) command (among others), and is now replaced by the use of the encrypt(1) command. I know a place where she will be missed: you know who you are :)
  • It seems that the network-boot-arguments command is now supported to be able to set IP configuration directly from the OBP, just in case a DHCP server is not an option to get this information at installation time (as we can do on IBM AIX using the IPL configuration from the SMS menu).
  • Automated Installer is now able to install Zones along with the main system.
  • New utilities are provided to help migrating JumpStart configuration files to AI manifests (I did not use them yet though).
  • RBAC things have changed a little, for example the provided profiles are now defined under different files under the /etc/security/prof_attr.d directory instead of a single file (/etc/security/prof_attr) before that (even in Solaris 11 Express). More, there is no Primary Administrator profile anymore, but a new System Administrator profile which doesn't have some security privileges the old profile has (can not read the /etc/shadow file for example).
  • The useradd(1m) command has well evolved. This utility is now able to automatically create a dedicated ZFS dataset as the home directory (which is not a directory anymore :)) if the -d flag is given, to populate the /etc/auto_home file, and to enable to autofs service to serve the /home content automatically as needed.
  • Although the default shell is now bash(1) (why not the newly integrated ksh93(1)?), the default PATH seen in OpenSolaris releases and Solaris 11 Express, which used to set GNU tools in front of SYSV commands, is reverted back to a more classical and fully functional paths: /usr/bin:/usr/sbin. At least the ls -v is OK again by default. Nonetheless, the path /usr/gnu/bin is here for whoever is interested.
  • An interesting change is the motivation to put out some old and well known configuration file. For better or for worse, the /etc/nodename is dead in Solaris 11. It is replaced by a property of a new SMF. So in order to change the nodename of a host, you must now do:
    # svccfg -s node setprop config/nodename = "mynewnodename"
    # svcadm refresh node
    
  • In the same vein, the /etc/default/init is replaced by a SMF too. The SMF is named system/environment:init, and the corresponding environment properties are environment/LANG, environment/LC_*, and environment/TZ.
  • If you want to be able to manually configure the network, you have to disable NWAM, to change the active Network Configuration Profiles (NCP) and enable traditional configuration:
    # netadm enable -p ncp DefaultFixed
    
  • The old sys-unconfig(1m) command is now replaced by a more powerful sysconfig(1m) utility which can unconfigure or reconfigure a Solaris instance, and generate a configuration profile which can be used to configure a system, or a Zone (exit the sysidcfg file).
  • Shares (NFS, SMB) are now supported inside a non-global zone.
  • The default networking mode is switched to exclusive-IP.
  • Similarly as can be found for SRM and privileges configuration settings with automatic Resources Pools, a VNIC can now be automatically instantiated for the time the Zone is booted, and automatically removed when she shuts down.
  • A new mode for the Zone known as Read-Only permits to create some instance which may be more or less writable, i.e. some parts may not be changed (configuration, file systems, etc.).
  • The IPS packages are now automagically updated in each Zones using Boot Environments.
  • Last point in this quick entry, the default locale positioned is en_US.UTF-8, and not just the old C. Well, not a big deal, but I found some tools which have issued some warnings against this locale such as expect for example.

So, I think that Solaris 11 is getting better, even from a Solaris 11 Express experience standpoint. Some choices are surprising, but the overall seems coherent and works as expected. A more longer experience in real user cases will be necessary to judge of this (very big) release, but I am mostly pleased with the direction taken by Solaris, and I am exited to put all of this new stuff in production!

Wednesday 3 August 2011

Interesting Use Case Of Solaris Swap Space

As you probably know, the Solaris operating system uses the (badly worded) swap space to designate the virtual swap space of a UNIX process, which is to differentiate from the physical swap space which represents the disk or file swap device.

The swap space allocation goes through three different stages. The first stage, reserved , represents the virtual swap space corresponding to the virtual size of all segments of a process which are reserved at creation time. The second stage, allocated, represents the physical (real) pages which are allocated (touched) in the virtual swap space. The last stage, swapped-out, represents the memory pages which are swapped out on the disk or file swap device.

Some operating systems does lazy memory allocations, such as IBM AIX or the Linux distros. This radically differs from Solaris which try to reserve virtual swap space, in order to assign memory, at request time rather than at the time it was needed. This means than the program can be informed synchronously of an out of swap space error. This is far more safe for the data than to lie to the running program (and suppose it will not use all memory pages it has initially reserved) which can then fail during normal execution.

Although this means some different things for Solaris, I will concentrate on one particular point in this post: the implementation of the disk swap device on a system which boots on ZFS. In this case, the disk swap device is a ZFS dataset which type is volume, a logical volume exported as a raw or block device. The ZFS datasets are generally thin provisioned in that they do not have a hard capped limit positioned (they can all compete against the available pool size), and they do not have space reserved for them by default. For a volume, things a are a little different since a refreservation is set at the size of the volume (a little bit more for ZFS metadata in fact). This behavior is mandatory because of the different consumers of a volume, be it used as a raw device, as a block device layered under an other file system, or as a special device such as a dump or as a swap device. In all these cases, the refreservation is here to prevent unexpected behavior of these different consumers.

Back to our ZFS volume as a swap device, here is a typical configuration:

$ zfs list rpool rpool/swap
NAME         USED  AVAIL  REFER  MOUNTPOINT
rpool       9.94G  23.3G    76K  /rpool
rpool/swap     4G  27.3G    16K  -

$ zfs get referenced,volsize,refreservation,usedbyrefreservation rpool/swap
NAME        PROPERTY              VALUE          SOURCE
rpool/swap  referenced            16K            -
rpool/swap  volsize               4G             local
rpool/swap  refreservation        4G             local
rpool/swap  usedbyrefreservation  4.00G          -

$ swap -lh
swapfile             dev    swaplo   blocks     free
/dev/zvol/dsk/rpool/swap 161,2        8K     4.0G     4.0G

As expected, in order to have a real backing storage for the physical swap device and being able to honor the fact that Solaris does not do lazy memory allocation, a refreservation is set to the swap volume to ensure valid swapping out in case paging occurred.

The problem arises when the processes reserves a lots of memory, but only allocates a little portion of those memory pages. Why? Simply because the system need to have lots of virtual swap space, which will not even be used, but which must be available for the system to operate properly. On large systems hosting large databases or Java workloads this can be problematic as the swap volume will consume lots of space in the ZFS Root Pool. The growing size of the Root Pool may have some side effects such as: less space available for the snapshots or the other Boot Environment, larger size for the backup of the operating system (recursive snapshosts of the pool), or a high consumption which can cause some concerns with internal disks of small size.

As stated in the manual page for zfs(1M):

Though not recommended, a "sparse volume" (also known as "thin provisioning") can be created by specifying the -s option to the zfs create -V command, or by changing the reservation after the volume has been created. A “sparse volume” is a volume where the reservation is less then the volume size. Consequently, writes to a sparse volume can fail with ENOSPC when the pool is low on space. For a sparse volume, changes to volsize are not reflected in the reservation.

So, as test case only and on a non-production system, I will totally wipe out the refreservation on the ZFS volume which represents the swap device, and see how the freed space will return to its parent dataset:

$ pfexec zfs set refreservation=none rpool/swap

$ zfs list rpool rpool/swap
NAME         USED  AVAIL  REFER  MOUNTPOINT
rpool       5.94G  27.3G    76K  /rpool
rpool/swap    16K  27.3G    16K  -

$ zfs get referenced,volsize,refreservation,usedbyrefreservation rpool/swap
NAME        PROPERTY              VALUE          SOURCE
rpool/swap  referenced            16K            -
rpool/swap  volsize               4G             local
rpool/swap  refreservation        none           local
rpool/swap  usedbyrefreservation  0              -

$ swap -lh
swapfile             dev    swaplo   blocks     free
/dev/zvol/dsk/rpool/swap 161,2        8K     4.0G     4.0G

Clearly, the ZFS volume corresponding to the swap device does not consume space anymore (since there was no memory page paged out on the swap device beforehand) and its size is not artificially sets up to the volume size: the property usedbyrefreservation now shows that there is no refreservation anymore. Note that the available space from the parent dataset increased from 23.3GB to 27.3GB, while the used space decreased from 9.94GB to 5.94GB.

So, assuming there is plenty of free space in the parent dataset, the swap device will be able to grow up to its size, 4GB. But if the pool will be low on space for some reason, the swap device (now a sparse volume) will fail with a not enough space error, which will surely be badly handled by the system or the processes who believed to have the reserved space initially. Because of that, be sure to revert back the configuration to the original settings:

$ pfexec zfs set refreservation=4G rpool/swap

$ zfs list rpool rpool/swap
NAME         USED  AVAIL  REFER  MOUNTPOINT
rpool       9.94G  23.3G    76K  /rpool
rpool/swap     4G  27.3G    16K  -

$ zfs get referenced,volsize,refreservation,usedbyrefreservation rpool/swap
NAME        PROPERTY              VALUE          SOURCE
rpool/swap  referenced            16K            -
rpool/swap  volsize               4G             local
rpool/swap  refreservation        4G             local
rpool/swap  usedbyrefreservation  4.00G          -

Please consult the official Oracle documentation on Managing Your ZFS Swap and Dump Devices for more information.

Sunday 17 April 2011

Replacing A Failed Drive In A SVM Configuration

Here is a roughly step-by-step procedure in case of a failed drive which is part of a Solaris Volume Manager mirror configuration (RAID-1), and how to replace it while the system is up and running.

Here are the kind of messages reported by the operating system:

# grep md_mirror /var/adm/messages
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d17: /dev/dsk/c1t0d0s7 needs maintenance
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d11: /dev/dsk/c1t0d0s1 needs maintenance
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d5: open error on /dev/dsk/c1t0d0s5
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d1: open error on /dev/dsk/c1t0d0s1
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d10: /dev/dsk/c1t0d0s0 needs maintenance
/var/adm/messages:Mar 30 15:41:36 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d13: /dev/dsk/c1t0d0s3 needs maintenance
/var/adm/messages:Mar 30 15:41:37 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d100: open error on /dev/dsk/c1t0d0s0
/var/adm/messages:Mar 30 15:41:37 beastie md_mirror: [ID 104909 kern.warning] WARNING: md: d14: /dev/dsk/c1t0d0s4 needs maintenance
/var/adm/messages:Mar 30 15:41:37 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d103: open error on /dev/dsk/c1t0d0s3
/var/adm/messages:Mar 30 15:41:37 beastie md_mirror: [ID 976326 kern.warning] WARNING: md d104: open error on /dev/dsk/c1t0d0s4

Figure out the SVM configuration layout:

# metastat -c
d78              p  300MB d7
d72              p 1018MB d7
d76              p  100MB d7
d75              p  500MB d7
d74              p   50GB d7
d73              p  200MB d7
d77              p  256MB d7
d71              p  250MB d7
    d7           m  119GB d17 (maint) d27
        d17      s  119GB c1t0d0s7 (maint)
        d27      s  119GB c1t1d0s7
d104             m  2.4GB d24 d14 (maint)
    d24          s  2.4GB c1t1d0s4
    d14          s  2.4GB c1t0d0s4 (maint)
d103             m  2.0GB d23 d13 (maint)
    d23          s  2.0GB c1t1d0s3
    d13          s  2.0GB c1t0d0s3 (maint)
d100             m  4.9GB d20 d10 (maint)
    d20          s  4.9GB c1t1d0s0
    d10          s  4.9GB c1t0d0s0 (maint)
d5               m  4.0GB d15 (unavail) d25
    d15          s  4.0GB c1t0d0s5 (-)
    d25          s  4.0GB c1t1d0s5
d1               m  3.9GB d11 (maint) d21
    d11          s  3.9GB c1t0d0s1 (maint)
    d21          s  3.9GB c1t1d0s1

As we can see, the failed disk is reported as drive not to be available anymore:

# echo | format
[...]
AVAILABLE DISK SELECTIONS:
       0. c1t0d0 
          /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012ab94d1,0
       1. c1t1d0 
          /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012a66aa1,0
[...]

Find more information about the failed drive, such as type of errors, serial number, WWNN of the drive, etc.:

# cfgadm -alv c1
Ap_Id                          Receptacle   Occupant     Condition  Information
When         Type         Busy     Phys_Id
c1                             connected    configured   unknown
unavailable  fc-private   n        /devices/pci@9,600000/SUNW,qlc@2/fp@0,0:fc
c1::500000e012a66aa1           connected    configured   unknown    FUJITSU MAX3147FCSUN146G
unavailable  disk         y        /devices/pci@9,600000/SUNW,qlc@2/fp@0,0:fc::500000e012a66aa1
c1::500000e012ab94d1           connected    configured   failed     FUJITSU MAX3147FCSUN146G
unavailable  disk         y        /devices/pci@9,600000/SUNW,qlc@2/fp@0,0:fc::500000e012ab94d1
# iostat -En
[...]
c1t1d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: FUJITSU  Product: MAX3147FCSUN146G Revision: 1103 Serial No: 0634G021R2
Size: 146.81GB <146810536448 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 1 Predictive Failure Analysis: 0
c1t0d0           Soft Errors: 0 Hard Errors: 1 Transport Errors: 73
Vendor: FUJITSU  Product: MAX3147FCSUN146G Revision: 1103 Serial No: 0634G023LB
Size: 146.81GB <146810536448 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 1 Predictive Failure Analysis: 0
[...]

Well, first clear the metadb configuration by removing references to the bad disk drive:

# metadb
    flags           first blk       block count
  Wm  p  l          16            8192         /dev/dsk/c1t0d0s6
  W   p  l          8208          8192         /dev/dsk/c1t0d0s6
  W   p  l          16400         8192         /dev/dsk/c1t0d0s6
 a    p  luo        16            8192         /dev/dsk/c1t1d0s6
 a    p  luo        8208          8192         /dev/dsk/c1t1d0s6
 a    p  luo        16400         8192         /dev/dsk/c1t1d0s6
# metadb -d c1t0d0s6
# metadb
    flags           first blk       block count
 a    p  luo        16            8192         /dev/dsk/c1t1d0s6
 a    p  luo        8208          8192         /dev/dsk/c1t1d0s6
 a    p  luo        16400         8192         /dev/dsk/c1t1d0s6

Since the disk is completely gone, the proper way to remove a FC drive didn't work as expected:

# luxadm remove_device /dev/rdsk/c1t0d0s2

 WARNING!!! Please ensure that no filesystems are mounted on these device(s).
 All data on these devices should have been backed up.

 Error: SCSI failure. - /dev/rdsk/c1t0d0s2.

So, let's go by physically replacing the failed drive. Here is the output of the hardware events on the system's console:

# dmesg
[...]
Mar 31 18:06:17 beastie picld[152]: [ID 222282 daemon.error] Fault detected: DISK0
Mar 31 18:06:18 beastie qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(0): Loop OFFLINE
Mar 31 18:06:18 beastie qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(0): Loop ONLINE
Mar 31 18:06:18 beastie fctl: [ID 517869 kern.warning] WARNING: fp(3)::fp_plogi_intr: fp 1 pd ef
Mar 31 18:06:19 beastie scsi: [ID 107833 kern.warning] WARNING: /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012a66aa1,0 (ssd0):
Mar 31 18:06:19 beastie        Error for Command: write(10)               Error Level: Retryable
Mar 31 18:06:19 beastie scsi: [ID 107833 kern.notice]  Requested Block: 37369856                  Error Block: 37369856
Mar 31 18:06:19 beastie scsi: [ID 107833 kern.notice]  Vendor: FUJITSU                            Serial Number: 0634G021R2
Mar 31 18:06:19 beastie scsi: [ID 107833 kern.notice]  Sense Key: Unit Attention
Mar 31 18:06:19 beastie scsi: [ID 107833 kern.notice]  ASC: 0x29 (bus device reset message occurred), ASCQ: 0x3, FRU: 0x0
Mar 31 18:06:37 beastie scsi: [ID 799468 kern.info] ssd144 at fp3: name w500000e0125c4531,0, bus address ef
Mar 31 18:06:37 beastie genunix: [ID 936769 kern.info] ssd144 is /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e0125c4531,0
Mar 31 18:06:37 beastie genunix: [ID 408114 kern.info] /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e0125c4531,0 (ssd144) online
Mar 31 18:06:52 beastie scsi: [ID 107833 kern.warning] WARNING: /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012ab94d1,0 (ssd3):
Mar 31 18:06:52 beastie        drive offline
[...]
Mar 31 18:07:31 beastie picld[152]: [ID 691918 daemon.error] FSP_GEN_FAULT_LED has turned ON
Mar 31 18:07:43 beastie picld[152]: [ID 861866 daemon.error] Notice: DISK0 okay
Mar 31 18:07:44 beastie picld[152]: [ID 114988 daemon.error] FSP_GEN_FAULT_LED has turned OFF
[...]

If necessary (if not done automatically), recreate and eventually clean the public interface from the /dev subtree, and verify the new drive is properly managed by the operating system:

# devfsadm -Cv
[...]
# echo | format
AVAILABLE DISK SELECTIONS:
       0. c1t0d0 
          /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e0125c4531,0
       1. c1t1d0 
          /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e012a66aa1,0
[...]

So, now create a proper VTOC on the new disk, and propagate the metadb configuration on it as before:

# prtvtoc /dev/rdsk/c1t1d0s2 | fmthard -s - /dev/rdsk/c1t0d0s2
fmthard:  New volume table of contents now in place.
# metadb -a -c 3 c1t0d0s6
# metadb
    flags           first blk       block count
 a        u         16            8192         /dev/dsk/c1t0d0s6
 a        u         8208          8192         /dev/dsk/c1t0d0s6
 a        u         16400         8192         /dev/dsk/c1t0d0s6
 a    p  luo        16            8192         /dev/dsk/c1t1d0s6
 a    p  luo        8208          8192         /dev/dsk/c1t1d0s6
 a    p  luo        16400         8192         /dev/dsk/c1t1d0s6

Then, just replace the new drive as if it was the old one in the SVM configuration, and let the mirror reconstruct itself automatically:

# metareplace -e d104 c1t0d0s4
d104: device c1t0d0s4 is replaced with c1t0d0s4
[...]
# metareplace -e d5 c1t0d0s5
d5: device c1t0d0s5 is replaced with c1t0d0s5
# metastat -c
d78              p  300MB d7
d72              p 1018MB d7
d76              p  100MB d7
d75              p  500MB d7
d74              p   50GB d7
d73              p  200MB d7
d77              p  256MB d7
d71              p  250MB d7
    d7           m  119GB d17 (resync-0%) d27
        d17      s  119GB c1t0d0s7 (resyncing)
        d27      s  119GB c1t1d0s7
d104             m  2.4GB d24 d14 (resync-41%)
    d24          s  2.4GB c1t1d0s4
    d14          s  2.4GB c1t0d0s4 (resyncing)
d103             m  2.0GB d23 d13 (resync-28%)
    d23          s  2.0GB c1t1d0s3
    d13          s  2.0GB c1t0d0s3 (resyncing)
d100             m  4.9GB d20 d10 (resync-10%)
    d20          s  4.9GB c1t1d0s0
    d10          s  4.9GB c1t0d0s0 (resyncing)
d5               m  4.0GB d15 (resync-6%) d25
    d15          s  4.0GB c1t0d0s5 (resyncing)
    d25          s  4.0GB c1t1d0s5
d1               m  3.9GB d11 (resync-13%) d21
    d11          s  3.9GB c1t0d0s1 (resyncing)
    d21          s  3.9GB c1t1d0s1

You are done.

Sunday 23 December 2007

Tuning Is Evil

Each month, I hear many coworkers or specific application management teams asking about putting some system tunings in place, even on very recent operating system releases. All the time. Most of these settings comes from the Internet, are found in forum posts, or articles related to a subsystem, or in technical publications. And some of them comes from third party software providers, or editors. A very, very few settings are proposed or recommended by system administrators, or by knowledgeable people in tuning area.

The problem is that, most of the time, these tunings are related to another release of the operating system, are not updated to keep current with the Best Of Practices for a given OS release, or simply are not well understood and not applicable without affecting (badly) current running environments. More, already present tunings are reported as-is on upgraded and fresh installed systems without more thinking, or without be assured these are always applicable (or obsolete) and what are the new defaults (if not dynamic). One of the most representative example today of this is the new System V IPC facilities found from the GA Solaris 10, and later, operating system, where some Oracle DBAs always ask SA team for shared memory settings as found on Solaris 8 systems.

Although extract from the Solaris Internals and Performance FAQ for ZFS, here is a great excerpt we all must read carefully and try to keep in mind when modifying default behavior of a system:

Tuning is evil and should not be done...in general.

First, consider that the default values are set by the people who know most things about the effects of the tuning. If a better value exists, it would be the default. While alternative values might help a given workload, it could quite possibly degrade some other aspects of performance. Maybe, catastrophically so.

Over time, tuning recommendations might become stale at best or might lead to performance degradations. Customers are leery of changing a tuning that is in place and the net effect is a worse product than what it could be. Moreover, tuning enabled on a given system might spread to other systems, where it might not be warranted at all.

Thursday 10 May 2007

Managing System and Process Core Dumps Generation

Core Dump Management on the Solaris OS and Using the Solaris coreadm utility to control core file generation are clearly two useful recent writings about knowing how to enable and configure process and system core dumps on a Solaris system.

You will learn what are SIGSEGV and SIGBUS signals, and the role they are playing in core generation. You will know how to easily alter the current process and system configuration files (respectively coreadm.conf, and dumpadm.conf) using appropriate system commands (respectively coreadm(1M), and dumpadm(1M)). In the same time, you will learn some basics about how to extract and interpret core files content. Then, you will find some tips on how a system dump can be voluntarily generated on both UltraSPARC and x86 platforms. Last, Matty will show us that you can even set process core dump configuration to log to the syslog facility. Very nice!

Tuesday 8 May 2007

Best Practices in Installation Locations and Filesystem Hierarchies

As it is repeatedly discussed at work these days, I think it may worth mentioning there exists official locations to install bundled and unbundled packages, and well defined paths with a recommended purpose for most major players in UNIX and UNIX-like platforms such as Solaris, GNU/Linux and FreeBSD for example.

For Solaris (and OpenSolaris derived distributions), you can consult the Recommended Installation Locations for Solaris-compatible Software Components document from the Architecture Process and Tools community, along with the filesystem(5) manual page online, or on an installed system.

For a GNU/Linux environment, you can read both the Linux Standard Base and the Filesystem Hierarchy Standard recommendations.

Although the BSD are not so well-known as GNU/Linux, there are some very good and alive projects. More, the documentation available for each distributions are well written, and frequently updated. For these systems, and most notably FreeBSD and HP-UX systems, you can read the hier(7) manual page online, or on a live system.

As a side note, it is interesting to note that Ian Murdock (founder of the famous Debian GNU/Linux distribution, and just joining Sun to head up operating system platform strategy) is also the chair of the Linux Standard Base—while Sun is a member of the Linux Foundation (which host the LSB project). Maybe can we expect more standardization between these two major players in the future?

Tuesday 20 February 2007

Bad Online Reports on Windows Vista

Generally speaking, i am not really a big supporter of the Windows operating system. And most of the time, i wait to see and try something to get an opinion on it. But after seeing these three reports against recent Microsoft products (two on Windows Vista, one about Office 2007), i will certainly wait a little before even trying them, at least until the already planned (scheduled?) Service Pack 1 will be out (at least for Windows Vista).

I will not say much about it right now. Go reading the reports to get an idea of what frightened me in the first place, if you care:

Wednesday 10 January 2007

What is about... Virtualization

As we hear more and more things about server or data center virtualization nowadays, i found those those two articles to be very clear and interesting in explaining what is really involve behind this general term (virtualization).

The first one came from the developerWorks, whereas the second one is hosted on InformationWeek. Worth reading, really.

Monday 6 June 2005

Encapsulation of the System's Disk Using SVM

  1. c0t0d0s2 represents the first system disk (boot)
  2. c0t1d0s2 represents the second disk (mirror)

Duplicate the label's content from the boot disk to the mirror disk:

# prtvtoc /dev/rdsk/c0t0d0s2 | fmthard -s - /dev/rdsk/c0t1d0s2

Create replicas of the metadevice state database:

# metadb -a -c3 -f c0t0d0s4 c0t1d0s4
# metadb

Option -f is needed because it is the first invocation/creation of metadb(1m).

Creation of metadevices:

# metainit -f d10 1 1 c0t0d0s0
# metainit -f d11 1 1 c0t0d0s1
# metainit -f d13 1 1 c0t0d0s3
# metainit -f d16 1 1 c0t0d0s6
#
# metainit d20 1 1 c0t1d0s0
# metainit d21 1 1 c0t1d0s1
# metainit d23 1 1 c0t1d0s3
# metainit d26 1 1 c0t1d0s6

Option -f is needed because the file systems created on the slice we want to initialize a new metadevice are already mounted.

Create the first part of the mirror:

# metainit d0 -m d10
# metainit d1 -m d11
# metainit d3 -m d13
# metainit d6 -m d16
#
# cp /etc/vfstab /etc/vfstab.beforesvm
# metaroot d0

Don't forget to edit /etc/vfstab in order to reflect the other metadevices:

  • s@/dev/dsk/cXtYdZsN@/dev/md/dsk/dN@
  • s@/dev/rdsk/cXtYdZsN@/dev/md/rdsk/dN@

Install the boot block code on the alternate boot disk and set it in the OpenBoot Prom (OBP):

# installboot /usr/platform/`uname -i`/lib/fs/ufs/bootblk /dev/rdsk/c0t1d0s0
# eeprom boot-device="disk disk1 net"   /* Or just "disk disk1". */

Reboot on the new metadevices (the operating system will now boot encapsulated):

# shutdown -y -g 0 -i 6

Attach the second part of the mirror:

# metattach d0 d20
# metattach d1 d21
# metattach d3 d23
# metattach d6 d26

Verify all:

# metastat -p
# metastat | grep \%

Modify the system dump configuration:

# mkdir /var/crash/`hostname`
# chmod 700 /var/crash/`hostname`
# dumpadm -s /var/crash/`hostname`
# dumpadm -d /dev/md/dsk/d1