As you probably know, the Solaris operating system uses the (badly worded) swap space to designate the virtual swap space of a UNIX process, which is to differentiate from the physical swap space which represents the disk or file swap device.

The swap space allocation goes through three different stages. The first stage, reserved , represents the virtual swap space corresponding to the virtual size of all segments of a process which are reserved at creation time. The second stage, allocated, represents the physical (real) pages which are allocated (touched) in the virtual swap space. The last stage, swapped-out, represents the memory pages which are swapped out on the disk or file swap device.

Some operating systems does lazy memory allocations, such as IBM AIX or the Linux distros. This radically differs from Solaris which try to reserve virtual swap space, in order to assign memory, at request time rather than at the time it was needed. This means than the program can be informed synchronously of an out of swap space error. This is far more safe for the data than to lie to the running program (and suppose it will not use all memory pages it has initially reserved) which can then fail during normal execution.

Although this means some different things for Solaris, I will concentrate on one particular point in this post: the implementation of the disk swap device on a system which boots on ZFS. In this case, the disk swap device is a ZFS dataset which type is volume, a logical volume exported as a raw or block device. The ZFS datasets are generally thin provisioned in that they do not have a hard capped limit positioned (they can all compete against the available pool size), and they do not have space reserved for them by default. For a volume, things a are a little different since a refreservation is set at the size of the volume (a little bit more for ZFS metadata in fact). This behavior is mandatory because of the different consumers of a volume, be it used as a raw device, as a block device layered under an other file system, or as a special device such as a dump or as a swap device. In all these cases, the refreservation is here to prevent unexpected behavior of these different consumers.

Back to our ZFS volume as a swap device, here is a typical configuration:

$ zfs list rpool rpool/swap
NAME         USED  AVAIL  REFER  MOUNTPOINT
rpool       9.94G  23.3G    76K  /rpool
rpool/swap     4G  27.3G    16K  -

$ zfs get referenced,volsize,refreservation,usedbyrefreservation rpool/swap
NAME        PROPERTY              VALUE          SOURCE
rpool/swap  referenced            16K            -
rpool/swap  volsize               4G             local
rpool/swap  refreservation        4G             local
rpool/swap  usedbyrefreservation  4.00G          -

$ swap -lh
swapfile             dev    swaplo   blocks     free
/dev/zvol/dsk/rpool/swap 161,2        8K     4.0G     4.0G

As expected, in order to have a real backing storage for the physical swap device and being able to honor the fact that Solaris does not do lazy memory allocation, a refreservation is set to the swap volume to ensure valid swapping out in case paging occurred.

The problem arises when the processes reserves a lots of memory, but only allocates a little portion of those memory pages. Why? Simply because the system need to have lots of virtual swap space, which will not even be used, but which must be available for the system to operate properly. On large systems hosting large databases or Java workloads this can be problematic as the swap volume will consume lots of space in the ZFS Root Pool. The growing size of the Root Pool may have some side effects such as: less space available for the snapshots or the other Boot Environment, larger size for the backup of the operating system (recursive snapshosts of the pool), or a high consumption which can cause some concerns with internal disks of small size.

As stated in the manual page for zfs(1M):

Though not recommended, a "sparse volume" (also known as "thin provisioning") can be created by specifying the -s option to the zfs create -V command, or by changing the reservation after the volume has been created. A “sparse volume” is a volume where the reservation is less then the volume size. Consequently, writes to a sparse volume can fail with ENOSPC when the pool is low on space. For a sparse volume, changes to volsize are not reflected in the reservation.

So, as test case only and on a non-production system, I will totally wipe out the refreservation on the ZFS volume which represents the swap device, and see how the freed space will return to its parent dataset:

$ pfexec zfs set refreservation=none rpool/swap

$ zfs list rpool rpool/swap
NAME         USED  AVAIL  REFER  MOUNTPOINT
rpool       5.94G  27.3G    76K  /rpool
rpool/swap    16K  27.3G    16K  -

$ zfs get referenced,volsize,refreservation,usedbyrefreservation rpool/swap
NAME        PROPERTY              VALUE          SOURCE
rpool/swap  referenced            16K            -
rpool/swap  volsize               4G             local
rpool/swap  refreservation        none           local
rpool/swap  usedbyrefreservation  0              -

$ swap -lh
swapfile             dev    swaplo   blocks     free
/dev/zvol/dsk/rpool/swap 161,2        8K     4.0G     4.0G

Clearly, the ZFS volume corresponding to the swap device does not consume space anymore (since there was no memory page paged out on the swap device beforehand) and its size is not artificially sets up to the volume size: the property usedbyrefreservation now shows that there is no refreservation anymore. Note that the available space from the parent dataset increased from 23.3GB to 27.3GB, while the used space decreased from 9.94GB to 5.94GB.

So, assuming there is plenty of free space in the parent dataset, the swap device will be able to grow up to its size, 4GB. But if the pool will be low on space for some reason, the swap device (now a sparse volume) will fail with a not enough space error, which will surely be badly handled by the system or the processes who believed to have the reserved space initially. Because of that, be sure to revert back the configuration to the original settings:

$ pfexec zfs set refreservation=4G rpool/swap

$ zfs list rpool rpool/swap
NAME         USED  AVAIL  REFER  MOUNTPOINT
rpool       9.94G  23.3G    76K  /rpool
rpool/swap     4G  27.3G    16K  -

$ zfs get referenced,volsize,refreservation,usedbyrefreservation rpool/swap
NAME        PROPERTY              VALUE          SOURCE
rpool/swap  referenced            16K            -
rpool/swap  volsize               4G             local
rpool/swap  refreservation        4G             local
rpool/swap  usedbyrefreservation  4.00G          -

Please consult the official Oracle documentation on Managing Your ZFS Swap and Dump Devices for more information.