blog'o thnet

To content | To menu | To search

Monday 10 October 2011

Encrypted SWAP Device Just Disappeared In Solaris 11 EA

For some months, I used to encrypt the SWAP device (which is a ZFS volume) and thus have an encrypted /tmp. This worked fine with Solaris 11 Express, but I encountered a strange behavior in Solaris 11 EA which leads to have the SWAP device to... well, just disappeared.

Here is what I found after two boots; and on several machines:

# swap -l
No swap devices configured

# zfs list -t volume
NAME         USED  AVAIL  REFER  MOUNTPOINT
rpool/dump  32.8G   240G  31.8G  -

# grep swap /etc/vfstab
swap            -               /tmp            tmpfs   -       yes     -
/dev/zvol/dsk/rpool/swap        -               -               swap    -       no      encrypted

So, the rpool/swap dataset disappeared. I am sure not to have destroyed it, in particular since this appears on multiple servers. Nevertheless, I found this in the history of the zpool command:

# zpool history | grep destroy
[...]
2011-10-05.10:22:49 zfs destroy rpool/swap

# last reboot | head -2
reboot    system boot                   Wed Oct  5 10:23
reboot    system down                   Wed Oct  5 10:20

So, this problem seems to be related to some actions at boot time. What have the logs of SMF services to say about that?

# find /var/svc/log -print | xargs grep -i swap
/var/svc/log/system-filesystem-usr:default.log:cannot create 'rpool/swap': pool must be upgraded to set this property or value
/var/svc/log/system-filesystem-usr:default.log:cannot open 'rpool/swap': dataset does not exist
/var/svc/log/system-filesystem-usr:default.log:cannot create 'rpool/swap': pool must be upgraded to set this property or value
/var/svc/log/system-filesystem-usr:default.log:cannot open 'rpool/swap': dataset does not exist

# tail -3 /var/svc/log/system-filesystem-usr:default.log
[ Oct  5 12:00:05 Executing start method ("/lib/svc/method/fs-usr"). ]
cannot create 'rpool/swap': pool must be upgraded to set this property or value
[ Oct  5 12:00:13 Method "start" exited with status 0. ]

Ouch, what happened here? The message is interesting, but is a little misleading: it is on fresh Solaris 11 EA installations, and so the pools and datasets are all up to date:

# zpool upgrade && zfs upgrade
This system is currently running ZFS pool version 33.
All pools are formatted using this version.
This system is currently running ZFS filesystem version 5.
All filesystems are formatted with the current version.

So, it seems that the rpool/swap device is re-created at boot time, and for some reason it doesn't work as expected. Here is an attempt to discover where the device is re-created and why it does fail.

# find /lib/svc/method -print | xargs grep -i sbin/swapadd
/lib/svc/method/fs-usr:/usr/sbin/swapadd -1
/lib/svc/method/nfs-client:     /usr/sbin/swapadd
/lib/svc/method/fs-local:/usr/sbin/swapadd >/dev/null 2>&1

# grep "zfs destroy" /usr/sbin/swapadd
                zfs destroy $zvol > /dev/null 2>&1

# sed -n '/zfs create/,/\$zvol/p' /usr/sbin/swapadd
        zfs create -V $volsize -o volblocksize=`/usr/bin/pagesize` \
            -o primarycache=$primarycache -o secondarycache=$secondarycache \
            -o encryption=$encryption -o keysource=raw,file:///dev/random $zvol

So, the re-creation at boot time of the rpool/swap appears only when using an encrypted volume. And after a bit of digging, here what I found. At the first boot, here is the command used to create the encrypted volume:

zfs create -V 4G -o volblocksize=8192 -o primarycache=metadata -o secondarycache=all -o encryption=on -o keysource=raw,file:///dev/random rpool/swap

But on a second boot, here is the slightly different command used this time:

zfs create -V 4G -o volblocksize=8192 -o primarycache=metadata -o secondarycache=all -o encryption=aes-128-ctr -o keysource=raw,file:///dev/random rpool/swap

This is because the arguments passed to the command is backed-up and restored from the settings just before the deletion of the volume. As mentioned in the zfs(1m) manual page, only the following encryption algorithm are supported... and so the one which is sets is not valid (the error message saying that the pool must be upgraded to set this property or value is a little more clear by now).

encryption=off | on | aes-128-ccm | aes-192-ccm | aes-256-ccm | aes-128-gcm | aes-192-gcm | aes-256-gcm

The question is, how can this happen? Where does this algorithm com from? The answer is simple: it seems that this is the swap(1m) command which alters some properties of the rpool/swap volume:

# swap -d /dev/zvol/dsk/rpool/swap
# zfs destroy rpool/swap
# zfs create -V 4G -o volblocksize=8192 -o primarycache=metadata -o secondarycache=all -o encryption=on -o keysource=raw,file:///dev/random rpool/swap
# zfs list -H -o type,volsize,volblocksize,encryption rpool/swap
volume  4G      8K      on
# swap -1 -a /dev/zvol/dsk/rpool/swap
# zfs list -H -o type,volsize,volblocksize,encryption rpool/swap
volume  4G      1M      aes-128-ctr

Not only is the algorithm changed to something not supported (yet?), but the volblocksize property is touched as well. This was not the case on Solaris 11 Express 2010.11.

Hope someone can help me on this side, and that this is a known bug which is already (or will be quickly) addressed, in particular for the Solaris 11 GA. I already posted a comment on the blog of Darren Moffat, just in case this can help a bit.

Wednesday 3 August 2011

Interesting Use Case Of Solaris Swap Space

As you probably know, the Solaris operating system uses the (badly worded) swap space to designate the virtual swap space of a UNIX process, which is to differentiate from the physical swap space which represents the disk or file swap device.

The swap space allocation goes through three different stages. The first stage, reserved , represents the virtual swap space corresponding to the virtual size of all segments of a process which are reserved at creation time. The second stage, allocated, represents the physical (real) pages which are allocated (touched) in the virtual swap space. The last stage, swapped-out, represents the memory pages which are swapped out on the disk or file swap device.

Some operating systems does lazy memory allocations, such as IBM AIX or the Linux distros. This radically differs from Solaris which try to reserve virtual swap space, in order to assign memory, at request time rather than at the time it was needed. This means than the program can be informed synchronously of an out of swap space error. This is far more safe for the data than to lie to the running program (and suppose it will not use all memory pages it has initially reserved) which can then fail during normal execution.

Although this means some different things for Solaris, I will concentrate on one particular point in this post: the implementation of the disk swap device on a system which boots on ZFS. In this case, the disk swap device is a ZFS dataset which type is volume, a logical volume exported as a raw or block device. The ZFS datasets are generally thin provisioned in that they do not have a hard capped limit positioned (they can all compete against the available pool size), and they do not have space reserved for them by default. For a volume, things a are a little different since a refreservation is set at the size of the volume (a little bit more for ZFS metadata in fact). This behavior is mandatory because of the different consumers of a volume, be it used as a raw device, as a block device layered under an other file system, or as a special device such as a dump or as a swap device. In all these cases, the refreservation is here to prevent unexpected behavior of these different consumers.

Back to our ZFS volume as a swap device, here is a typical configuration:

$ zfs list rpool rpool/swap
NAME         USED  AVAIL  REFER  MOUNTPOINT
rpool       9.94G  23.3G    76K  /rpool
rpool/swap     4G  27.3G    16K  -

$ zfs get referenced,volsize,refreservation,usedbyrefreservation rpool/swap
NAME        PROPERTY              VALUE          SOURCE
rpool/swap  referenced            16K            -
rpool/swap  volsize               4G             local
rpool/swap  refreservation        4G             local
rpool/swap  usedbyrefreservation  4.00G          -

$ swap -lh
swapfile             dev    swaplo   blocks     free
/dev/zvol/dsk/rpool/swap 161,2        8K     4.0G     4.0G

As expected, in order to have a real backing storage for the physical swap device and being able to honor the fact that Solaris does not do lazy memory allocation, a refreservation is set to the swap volume to ensure valid swapping out in case paging occurred.

The problem arises when the processes reserves a lots of memory, but only allocates a little portion of those memory pages. Why? Simply because the system need to have lots of virtual swap space, which will not even be used, but which must be available for the system to operate properly. On large systems hosting large databases or Java workloads this can be problematic as the swap volume will consume lots of space in the ZFS Root Pool. The growing size of the Root Pool may have some side effects such as: less space available for the snapshots or the other Boot Environment, larger size for the backup of the operating system (recursive snapshosts of the pool), or a high consumption which can cause some concerns with internal disks of small size.

As stated in the manual page for zfs(1M):

Though not recommended, a "sparse volume" (also known as "thin provisioning") can be created by specifying the -s option to the zfs create -V command, or by changing the reservation after the volume has been created. A “sparse volume” is a volume where the reservation is less then the volume size. Consequently, writes to a sparse volume can fail with ENOSPC when the pool is low on space. For a sparse volume, changes to volsize are not reflected in the reservation.

So, as test case only and on a non-production system, I will totally wipe out the refreservation on the ZFS volume which represents the swap device, and see how the freed space will return to its parent dataset:

$ pfexec zfs set refreservation=none rpool/swap

$ zfs list rpool rpool/swap
NAME         USED  AVAIL  REFER  MOUNTPOINT
rpool       5.94G  27.3G    76K  /rpool
rpool/swap    16K  27.3G    16K  -

$ zfs get referenced,volsize,refreservation,usedbyrefreservation rpool/swap
NAME        PROPERTY              VALUE          SOURCE
rpool/swap  referenced            16K            -
rpool/swap  volsize               4G             local
rpool/swap  refreservation        none           local
rpool/swap  usedbyrefreservation  0              -

$ swap -lh
swapfile             dev    swaplo   blocks     free
/dev/zvol/dsk/rpool/swap 161,2        8K     4.0G     4.0G

Clearly, the ZFS volume corresponding to the swap device does not consume space anymore (since there was no memory page paged out on the swap device beforehand) and its size is not artificially sets up to the volume size: the property usedbyrefreservation now shows that there is no refreservation anymore. Note that the available space from the parent dataset increased from 23.3GB to 27.3GB, while the used space decreased from 9.94GB to 5.94GB.

So, assuming there is plenty of free space in the parent dataset, the swap device will be able to grow up to its size, 4GB. But if the pool will be low on space for some reason, the swap device (now a sparse volume) will fail with a not enough space error, which will surely be badly handled by the system or the processes who believed to have the reserved space initially. Because of that, be sure to revert back the configuration to the original settings:

$ pfexec zfs set refreservation=4G rpool/swap

$ zfs list rpool rpool/swap
NAME         USED  AVAIL  REFER  MOUNTPOINT
rpool       9.94G  23.3G    76K  /rpool
rpool/swap     4G  27.3G    16K  -

$ zfs get referenced,volsize,refreservation,usedbyrefreservation rpool/swap
NAME        PROPERTY              VALUE          SOURCE
rpool/swap  referenced            16K            -
rpool/swap  volsize               4G             local
rpool/swap  refreservation        4G             local
rpool/swap  usedbyrefreservation  4.00G          -

Please consult the official Oracle documentation on Managing Your ZFS Swap and Dump Devices for more information.

Monday 3 January 2011

Solaris 11 Express: Problem #2

In this series, I will report the bugs or problems I find when running the Oracle Solaris 11 Express distribution. I hope this will give more visibility on those PR to Oracle to correct them before the release of Solaris 11 next year.

For some customers, I had the habit to clone a non-global zone using a template zone. But in order to save some space, I generally use the capability to use a ZFS snapshot as input for the clone, avoiding creating a snapshot each time a new clone is created.

It seems that this capability is not usable anymore on Solaris 11 Express at this time:

# zoneadm -z zone3 clone -s dpool/store/zone/zone1/ROOT/zbe@zone2_snap zone1
/usr/lib/brand/ipkg/clone: -s: unknown option

Nevertheless, this functionality is always described in the manual page:

brand-specific usage: clone {sourcezone} usage: clone [-m method] [-s ] [brand-specific args] zonename Clone the installation of another zone. The -m option can be used to specify 'copy' which forces a copy of the source zone. The -s option can be used to specify the name of a ZFS snapshot that was taken from a previous clone command. The snapshot will be used as the source instead of creating a new ZFS snapshot. All other arguments are passed to the brand clone function; see brands(5) for more information.

No luck here. Even if the space consideration may be minimized by the deduplication feature of ZFS in Solaris 11 Express, it is not always appropriate nor usable: on small size server for example.

FYI, this problem is covered by the Bug ID number 6383119. Note that you can add yourself to the interest list at the bottom of the bug report page:

Sunday 29 August 2010

Apropos Solaris

John Fowler (Oracle Executive Vice President for Server and Storage Systems) held an on-line webcast on August 10 on the strategy for hardware servers based on SPARC and x86, and the formalization of the upcoming release of Solaris 11 in 2011.

This post is only aimed at summarize the main points, the complete slides of the presentation are available at the Oracle web site.

  1. Message #1: SPARC is alive and will continue. Solaris is alive and will continue. Both actively.
  2. Message #2: What is interesting here is that this is not only intentions, it is a real roadmap up to five years, on the ex-Sun well-known products. Oracle clearly has some strong plans about Solaris, SPARC ad x86 platforms, and just began to speak publicly about them. We will see probably more about them all at the Oracle OpenWorld in few weeks now.

The points are:

  • A roadmap for SPARC and Solaris up to 2015.
  • SPARC will double performance improvement every two years:
    • Cores: 128 (32 in 2010).
    • Threads: 16384 (512 in 2010).
    • Memory capacity: 64TB (4TB in 2010).
    • Logical Domains: 256 (128 in 2010).
    • Java Ops per second: 50000 (5000 in 2010).
  • Very SPARC oriented: it seems that there will only be one SPARC brand at the end of 2015.
  • Two big families of SPARC servers: lots of threads known as the T-Series, lots of sockets known as M-Series.
  • A least one Update to Solaris 10 around 2010Q3, a beta program of Solaris 11 known as Solaris 11 Express due to last 2010, then Solaris 11 due in 2011 and up to 2015.

Solaris 11 will be based on the now close OpenSolaris distribution, which will include:

  • Image Packaging System (IPS): totally new packaging system fully integrated with ZFS and Boot Environment Administration (aimed at replacing Live Upgrade).
  • Crossbow network virtualization stack.
  • ZFS de-duplication, and lots of recent optimizations and functionalities.
  • CIFS file services : in-kernel implementation of CIFS.
  • Enhanced Gnome user environment.
  • Updated installer and auto network installer ("AI", aimed at replacing JumpStart)
  • Network Automagic configuration.
  • And many more (I heard Solaris 10 BrandZ...).

Thursday 25 December 2008

More News To Come About Shrinking A zpool

As a little update to an older post on this subject, and although this post from Matthew Ahrens is about the new scrub code recently introduced in OpenSolaris build 94--and was in fact a priority before the launch of the Sun Storage 7000 Unified Storage Systems (a.k.a. Amber Road)--it is interesting to note that some of the new code will be usable to remove a disk from a ZFS pool.

As Matthew wrote:

This work lays a bunch of infrastructure that will be used by the upcoming device removal feature.

Saturday 18 October 2008

Discrepancies Between df And du Outputs

As a SA, it not uncommon to have regularly requests about big differences between the du and df outputs on a UFS file system. (For ZFS specific considerations, please see the ZFS FAQ.)

The du utility reports the sum of space allocated to all files in the file hierarchy rooted in the directory plus the space allocated to the directory itself. The df utility reports the amount of disk space occupied by a mounted file system.

When a file is remove from the file system, i.e. is unlinked (the hard link count goes to zero), the space belonging to this file is accounted against the du tool, but is not visible to the df utility until all references to it (open file descriptors) are closed. In order to find the guilty process, one can follow the information found in the SunManagers Frequently Asked Questions. Here is an example of such finding, but using a slightly different method to get the process currently holding the open descriptor to the deleted file.

Find the file which has been unlinked through the procfs interface:

# find /proc/*/fd \( -type f -a ! -size 0 -a -links 0 \) -print | xargs \ls -li
 415975 --w-------   0 user  group  2125803025 Oct 15 23:59 /proc/1252/fd/3

Eventually, get more detail about it:

# pargs -c 1252
1252:   rvd.basic -reliability 5 -listen tcp:9876 -logfile /path/to/log/rvd_9876.l
argv[0]: rvd.basic
argv[1]: -reliability
argv[2]: 5
argv[3]: -listen
argv[4]: tcp:9876
argv[5]: -logfile
argv[6]: /path/to/log/rvd_9876.log

Check to see if you can understand what is the content of the unlinked file:

# tail /proc/1252/fd/3
-------------------------------------------------------------------------------
2008-10-15 23:59:32.002116 - [MSG] BBG_Transmitter_class.cc, line 792 (thread 25087:4)
[4060] Sent a heartbeat
-------------------------------------------------------------------------------
BBG_Transmitter_class.cc: [4111] No activity detected. Send a Heartbeat message
-------------------------------------------------------------------------------
2008-10-15 23:59:32.134829 - [MSG] BBG_Transmitter_class.cc, line 1138 (thread 25087:4)
[4065] Heartbeat acknowledged by Bloomberg

You can correlate the size of the removed, but always referenced, file to the space accounted from the du and df tools:

# df -k /path/to
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/md/dsk/d5       6017990 5874592   83219    99%    /path/to
# du -sk /path/to
3791632 /data
# echo "(5874616-3791632)*1024" | bc
2132975616

So, we now found the ~2GB log file which was always opened (used) by a process. Now, there are two solutions to be able to get back the freed space:

  1. Truncate the unlinked file (quick workaround).
  2. Simply restart properly the corresponding program (better option).

Use the solution which fits the best your need in your environment.

Tuesday 16 September 2008

Announcing Solaris 10 10/08

Although this seems a little bit confident, the long-awaited Update 6 to the Solaris 10 operating system release is just behind the door. This release (scheduled to be available in mid-October) will includes virtualization enhancements including the ability for a Solaris Container to automatically update its environment when moved from one system to another, Logical Domains support for dynamically reconfigurable disk and network I/O, and paravirtualization support when Solaris 10 is used as a guest OS in Xen-based environments such as Sun xVM Server. Solaris 10 10/08 also includes support for the latest systems from Sun and other vendors, such as those based on the Intel Xeon Processor 7400 Series.

This release will be the very first Solaris release to be able to boot from ZFS and use it as their root file system, such as what can be found on OpenSolaris or Solaris Express Community Release today.

Check the What’s New web page for Solaris, and consult the Solaris Media Gallery videos for more information.

Update #1 (2008-10-14): Don't forget to consult the must read What's New in Solaris 10 10/08? from the San Antonio OpenSolaris User Group.

Update #2 (2008-10-31): Get yours, and go reading the official What's New in the Solaris 10 10/08 Release page.

Friday 30 March 2007

ZFS Recent News

Well. More than a real blog entry, this post is more about keeping in touch with some recent add-ons in ZFS area. First, you can read the ZFS Overview and Guide just published on BigAdmin. Second, you must watch the excellent Thumper do it yourself, which is a very nice showcase of ZFS use. Third, a great listing of recent add-ons put in latest SXCE builds is available at Robert Milkowski's blog.

Last, be sure to check Tim Foster explanations about the recently announced ZFS Boot support in build 62, for the x86 platform. All interesting links included. His script to set up ZFS root automatically too! (Since all bits not yet well integrated...)

Thursday 1 March 2007

Want to Shrink a zpool?

If so, be patient. In fact it is a high-wanted feature, and the ZFS team is working hard on it right now. You can learn more about this feature following the Shrinking a zpool? thread on the zfs-discuss forum on opensolaris.org. Here are some chosen excerpts.

From Matthew A. Ahrens #1:

Regardless of where you want or don't want to use shrink, we are actively working on this, targeting delivery in s10u5.

From Matthew A. Ahrens #2:

Yeah, the implementation is nontrivial. Of course, this won't have any impact on snapshots, clones, etc. and will happen on-line. Any other solution would be unacceptable.

Howdy... I really like these kind of short and concise answers!

Thursday 11 January 2007

NFS and ZFS plus ZIL Interesting Notes

I recently learn about NFS on ZFS interaction problem reading the great blog of Ben Rockwood. Although not directly related to what he encountered, this recent great post about how NFS behaves with ZFS backend, particularly on the performance comparison front, says a lot of things about why you might see poor performance using this two technologies together.

To go deeper on this front, you can read more about the ZIL purpose on Eric Kustarz's weblog, and follow closely this ZFS thread on the OpenSolaris website.

- page 1 of 2