blog'o thnet

To content | To menu | To search

Tag - memory

Entries feed - Comments feed

Monday 10 October 2011

Encrypted SWAP Device Just Disappeared In Solaris 11 EA

For some months, I used to encrypt the SWAP device (which is a ZFS volume) and thus have an encrypted /tmp. This worked fine with Solaris 11 Express, but I encountered a strange behavior in Solaris 11 EA which leads to have the SWAP device to... well, just disappeared.

Here is what I found after two boots; and on several machines:

# swap -l
No swap devices configured

# zfs list -t volume
NAME         USED  AVAIL  REFER  MOUNTPOINT
rpool/dump  32.8G   240G  31.8G  -

# grep swap /etc/vfstab
swap            -               /tmp            tmpfs   -       yes     -
/dev/zvol/dsk/rpool/swap        -               -               swap    -       no      encrypted

So, the rpool/swap dataset disappeared. I am sure not to have destroyed it, in particular since this appears on multiple servers. Nevertheless, I found this in the history of the zpool command:

# zpool history | grep destroy
[...]
2011-10-05.10:22:49 zfs destroy rpool/swap

# last reboot | head -2
reboot    system boot                   Wed Oct  5 10:23
reboot    system down                   Wed Oct  5 10:20

So, this problem seems to be related to some actions at boot time. What have the logs of SMF services to say about that?

# find /var/svc/log -print | xargs grep -i swap
/var/svc/log/system-filesystem-usr:default.log:cannot create 'rpool/swap': pool must be upgraded to set this property or value
/var/svc/log/system-filesystem-usr:default.log:cannot open 'rpool/swap': dataset does not exist
/var/svc/log/system-filesystem-usr:default.log:cannot create 'rpool/swap': pool must be upgraded to set this property or value
/var/svc/log/system-filesystem-usr:default.log:cannot open 'rpool/swap': dataset does not exist

# tail -3 /var/svc/log/system-filesystem-usr:default.log
[ Oct  5 12:00:05 Executing start method ("/lib/svc/method/fs-usr"). ]
cannot create 'rpool/swap': pool must be upgraded to set this property or value
[ Oct  5 12:00:13 Method "start" exited with status 0. ]

Ouch, what happened here? The message is interesting, but is a little misleading: it is on fresh Solaris 11 EA installations, and so the pools and datasets are all up to date:

# zpool upgrade && zfs upgrade
This system is currently running ZFS pool version 33.
All pools are formatted using this version.
This system is currently running ZFS filesystem version 5.
All filesystems are formatted with the current version.

So, it seems that the rpool/swap device is re-created at boot time, and for some reason it doesn't work as expected. Here is an attempt to discover where the device is re-created and why it does fail.

# find /lib/svc/method -print | xargs grep -i sbin/swapadd
/lib/svc/method/fs-usr:/usr/sbin/swapadd -1
/lib/svc/method/nfs-client:     /usr/sbin/swapadd
/lib/svc/method/fs-local:/usr/sbin/swapadd >/dev/null 2>&1

# grep "zfs destroy" /usr/sbin/swapadd
                zfs destroy $zvol > /dev/null 2>&1

# sed -n '/zfs create/,/\$zvol/p' /usr/sbin/swapadd
        zfs create -V $volsize -o volblocksize=`/usr/bin/pagesize` \
            -o primarycache=$primarycache -o secondarycache=$secondarycache \
            -o encryption=$encryption -o keysource=raw,file:///dev/random $zvol

So, the re-creation at boot time of the rpool/swap appears only when using an encrypted volume. And after a bit of digging, here what I found. At the first boot, here is the command used to create the encrypted volume:

zfs create -V 4G -o volblocksize=8192 -o primarycache=metadata -o secondarycache=all -o encryption=on -o keysource=raw,file:///dev/random rpool/swap

But on a second boot, here is the slightly different command used this time:

zfs create -V 4G -o volblocksize=8192 -o primarycache=metadata -o secondarycache=all -o encryption=aes-128-ctr -o keysource=raw,file:///dev/random rpool/swap

This is because the arguments passed to the command is backed-up and restored from the settings just before the deletion of the volume. As mentioned in the zfs(1m) manual page, only the following encryption algorithm are supported... and so the one which is sets is not valid (the error message saying that the pool must be upgraded to set this property or value is a little more clear by now).

encryption=off | on | aes-128-ccm | aes-192-ccm | aes-256-ccm | aes-128-gcm | aes-192-gcm | aes-256-gcm

The question is, how can this happen? Where does this algorithm com from? The answer is simple: it seems that this is the swap(1m) command which alters some properties of the rpool/swap volume:

# swap -d /dev/zvol/dsk/rpool/swap
# zfs destroy rpool/swap
# zfs create -V 4G -o volblocksize=8192 -o primarycache=metadata -o secondarycache=all -o encryption=on -o keysource=raw,file:///dev/random rpool/swap
# zfs list -H -o type,volsize,volblocksize,encryption rpool/swap
volume  4G      8K      on
# swap -1 -a /dev/zvol/dsk/rpool/swap
# zfs list -H -o type,volsize,volblocksize,encryption rpool/swap
volume  4G      1M      aes-128-ctr

Not only is the algorithm changed to something not supported (yet?), but the volblocksize property is touched as well. This was not the case on Solaris 11 Express 2010.11.

Hope someone can help me on this side, and that this is a known bug which is already (or will be quickly) addressed, in particular for the Solaris 11 GA. I already posted a comment on the blog of Darren Moffat, just in case this can help a bit.

Wednesday 3 August 2011

Interesting Use Case Of Solaris Swap Space

As you probably know, the Solaris operating system uses the (badly worded) swap space to designate the virtual swap space of a UNIX process, which is to differentiate from the physical swap space which represents the disk or file swap device.

The swap space allocation goes through three different stages. The first stage, reserved , represents the virtual swap space corresponding to the virtual size of all segments of a process which are reserved at creation time. The second stage, allocated, represents the physical (real) pages which are allocated (touched) in the virtual swap space. The last stage, swapped-out, represents the memory pages which are swapped out on the disk or file swap device.

Some operating systems does lazy memory allocations, such as IBM AIX or the Linux distros. This radically differs from Solaris which try to reserve virtual swap space, in order to assign memory, at request time rather than at the time it was needed. This means than the program can be informed synchronously of an out of swap space error. This is far more safe for the data than to lie to the running program (and suppose it will not use all memory pages it has initially reserved) which can then fail during normal execution.

Although this means some different things for Solaris, I will concentrate on one particular point in this post: the implementation of the disk swap device on a system which boots on ZFS. In this case, the disk swap device is a ZFS dataset which type is volume, a logical volume exported as a raw or block device. The ZFS datasets are generally thin provisioned in that they do not have a hard capped limit positioned (they can all compete against the available pool size), and they do not have space reserved for them by default. For a volume, things a are a little different since a refreservation is set at the size of the volume (a little bit more for ZFS metadata in fact). This behavior is mandatory because of the different consumers of a volume, be it used as a raw device, as a block device layered under an other file system, or as a special device such as a dump or as a swap device. In all these cases, the refreservation is here to prevent unexpected behavior of these different consumers.

Back to our ZFS volume as a swap device, here is a typical configuration:

$ zfs list rpool rpool/swap
NAME         USED  AVAIL  REFER  MOUNTPOINT
rpool       9.94G  23.3G    76K  /rpool
rpool/swap     4G  27.3G    16K  -

$ zfs get referenced,volsize,refreservation,usedbyrefreservation rpool/swap
NAME        PROPERTY              VALUE          SOURCE
rpool/swap  referenced            16K            -
rpool/swap  volsize               4G             local
rpool/swap  refreservation        4G             local
rpool/swap  usedbyrefreservation  4.00G          -

$ swap -lh
swapfile             dev    swaplo   blocks     free
/dev/zvol/dsk/rpool/swap 161,2        8K     4.0G     4.0G

As expected, in order to have a real backing storage for the physical swap device and being able to honor the fact that Solaris does not do lazy memory allocation, a refreservation is set to the swap volume to ensure valid swapping out in case paging occurred.

The problem arises when the processes reserves a lots of memory, but only allocates a little portion of those memory pages. Why? Simply because the system need to have lots of virtual swap space, which will not even be used, but which must be available for the system to operate properly. On large systems hosting large databases or Java workloads this can be problematic as the swap volume will consume lots of space in the ZFS Root Pool. The growing size of the Root Pool may have some side effects such as: less space available for the snapshots or the other Boot Environment, larger size for the backup of the operating system (recursive snapshosts of the pool), or a high consumption which can cause some concerns with internal disks of small size.

As stated in the manual page for zfs(1M):

Though not recommended, a "sparse volume" (also known as "thin provisioning") can be created by specifying the -s option to the zfs create -V command, or by changing the reservation after the volume has been created. A “sparse volume” is a volume where the reservation is less then the volume size. Consequently, writes to a sparse volume can fail with ENOSPC when the pool is low on space. For a sparse volume, changes to volsize are not reflected in the reservation.

So, as test case only and on a non-production system, I will totally wipe out the refreservation on the ZFS volume which represents the swap device, and see how the freed space will return to its parent dataset:

$ pfexec zfs set refreservation=none rpool/swap

$ zfs list rpool rpool/swap
NAME         USED  AVAIL  REFER  MOUNTPOINT
rpool       5.94G  27.3G    76K  /rpool
rpool/swap    16K  27.3G    16K  -

$ zfs get referenced,volsize,refreservation,usedbyrefreservation rpool/swap
NAME        PROPERTY              VALUE          SOURCE
rpool/swap  referenced            16K            -
rpool/swap  volsize               4G             local
rpool/swap  refreservation        none           local
rpool/swap  usedbyrefreservation  0              -

$ swap -lh
swapfile             dev    swaplo   blocks     free
/dev/zvol/dsk/rpool/swap 161,2        8K     4.0G     4.0G

Clearly, the ZFS volume corresponding to the swap device does not consume space anymore (since there was no memory page paged out on the swap device beforehand) and its size is not artificially sets up to the volume size: the property usedbyrefreservation now shows that there is no refreservation anymore. Note that the available space from the parent dataset increased from 23.3GB to 27.3GB, while the used space decreased from 9.94GB to 5.94GB.

So, assuming there is plenty of free space in the parent dataset, the swap device will be able to grow up to its size, 4GB. But if the pool will be low on space for some reason, the swap device (now a sparse volume) will fail with a not enough space error, which will surely be badly handled by the system or the processes who believed to have the reserved space initially. Because of that, be sure to revert back the configuration to the original settings:

$ pfexec zfs set refreservation=4G rpool/swap

$ zfs list rpool rpool/swap
NAME         USED  AVAIL  REFER  MOUNTPOINT
rpool       9.94G  23.3G    76K  /rpool
rpool/swap     4G  27.3G    16K  -

$ zfs get referenced,volsize,refreservation,usedbyrefreservation rpool/swap
NAME        PROPERTY              VALUE          SOURCE
rpool/swap  referenced            16K            -
rpool/swap  volsize               4G             local
rpool/swap  refreservation        4G             local
rpool/swap  usedbyrefreservation  4.00G          -

Please consult the official Oracle documentation on Managing Your ZFS Swap and Dump Devices for more information.

Tuesday 16 March 2010

Debugging After A Solaris System Was P2V'ed

I recently faced a problem where an application stop working after transferring a system from an old Sun E450 running Solaris 8 to a more recent Sun Fire V490 running Solaris 10 10/09 with the Solaris 8 Container software stack. Although all went smooth during the P2V, and most of all applications runs pretty well in the non-global zone, we encounter a problem where the Courier SMTP product didn't answer anymore to remote connections, such as:

# telnet localhost 25
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Connection closed by foreign host.

The fact is, the couriertcpd program is running, and listening to the TCP port number 25.

# pfiles `pgrep couriertcpd | head -1`
5691:   /usr/lib/courier/sbin/couriertcpd
  Current rlimit: 1024 file descriptors
[...]
   5: S_IFSOCK mode:0666 dev:366,0 ino:62871 uid:0 gid:0 size:0
      O_RDWR|O_NONBLOCK FD_CLOEXEC
        sockname: AF_INET 0.0.0.0  port: 25
[...]

Interestingly, here is what is logged by the Courier application:

# grep mail.info /path/to/Courier.log
XXX courieresmtpd: [ID 702911 mail.info] gdbm fatal: couldn't init cache

The problem seems located around the gdbm library which is a dependency of Courier. So, what happened when we try to initiate a direct connection using telnet on port 25:

# truss -alef -rall -wall -p 5691
[...]
5691/1:         fork()                                          = 245
245/1:          fork()          (returning as child ...)        = 5691
[...]
245/1:          brk(0x004136D8)                                 = 0
245/1:          brk(0x004136D8)                                 = 0
245/1:          brk(0x004336D8)                                 Err#12 ENOMEM
245/1:          write(2, " g d b m   f a t a l :  ", 12)        = 12
245/1:          write(2, 0xFF2D4D28, 19)                        = 19
245/1:             c o u l d n ' t   i n i t   c a c h e
245/1:          write(2, "\n", 1)                               = 1
245/1:          llseek(0, 0, SEEK_CUR)                          = 0
245/1:          _exit(-1)
5691/1:             Received signal #18, SIGCLD, in poll() [caught]
[...]
# echo "ibase=16; `echo 004136D8 | tr [:lower:] [:upper:]`" | bc
4273880
# echo "ibase=16; `echo 004336D8 | tr [:lower:] [:upper:]`" | bc
4404952

Ok. The brk(2) function is used to change dynamically the amount of space allocated for the calling process's data segment. So, the process which PID is 245, forked from couriertcpd, can not allocate more than 4273880 bytes since a try to allocate 4404952 bytes return an error. When the brk(2) function fails, it is generally due to one of these two major cases:

  1. Insufficient space exists in the swap area to support the expansion.
  2. The data segment size limit as set by setrlimit(2) would be exceeded.

At first, we decided to grow the size of the swap space in the global zone from 4GB to 8GB (since there is no resource control on memory for this non-global zone). As it is a full-ZFS Solaris 10 system, here are the steps to do so:

# swap -d /dev/zvol/dsk/rpool/swap
# zfs set volsize=8G rpool/swap
# /sbin/swapadd
# swap -l
swapfile             dev  swaplo blocks   free
/dev/zvol/dsk/rpool/swap 256,2      16 16777200 16777200

But nothing change from the couriertcpd point of view, as we may have thought since we were able to deallocate the entire swap device from the global zone. So, we shrink it back using the same steps as for growing it. Then, is it possible we have reached an upper resource limitation?

The couriertcpd is executed under the daemon identity: what are the maximum size of data segment (or heap) for the daemon account:

# su - daemon -c "ulimit -Sd; ulimit -Hd"
unlimited
unlimited

This seems more than sufficient, not to say more. But what are the real current limitation for the running process?

# plimit -k 5691
5691:   /usr/lib/courier/sbin/couriertcpd
   resource              current         maximum
  time(seconds)         unlimited       unlimited
  file(kbytes)          unlimited       unlimited
  data(kbytes)          4096            4096
  stack(kbytes)         8192            unlimited
  coredump(kbytes)      unlimited       unlimited
  nofiles(descriptors)  1024            1024
  vmemory(kbytes)       unlimited       unlimited

Ok. One can argue that the 4404952 bytes size is very close to the 4096KB (4194304 bytes) data size limitation... and yes, it is. The two points here are the fact that the limitation is close to memory allocation size for sure, and that there must be some settings somewhere which sets this resource limitation since we verified it is not a the account level (the current hard limit is set to 4096KB, thus not very unlimited). So, we tried to set the resource limit for the running process to much higher value, and voila:

# plimit -d 65536,65536 5691
# telnet localhost 25
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
220 S0003010 ESMTP
quit
221 Bye.
Connection closed by foreign host.

Well, all seems far better this time:

# truss -alef -rall -wall -p 5691
[...]
5691/1:         fork()                                          = 4861
4861/1:          fork()          (returning as child ...)        = 5691
[...]
4861/1:         brk(0x00CF36D8)                                 = 0
4861/1:         lseek(6, 262144, SEEK_SET)                      = 262144
4861/1:         read(6, 0x00071730, 131072)                     = 131072
[...]
# echo "ibase=16; `echo 00CF36D8 | tr [:lower:] [:upper:]`" | bc
13579992

So, now the memory allocation succeed at changing the space allocated up to 13579992 bytes (13262KB). We decided to set the data segment size up to 16MB, which seems to be sufficient in our case.

We now need to find the configuration responsible for setting the size of the data segment size to 4096KB. Looking at the courier documentation, and and after a bit of digging on the system itself we found the responsible configuration file, and adapt it as follow:

# grep ULIMIT= /usr/lib/courier/etc/esmtpd
#ULIMIT=4096
ULIMIT=16384

As a last validation, we made a complete reboot of the non-global zone to verify the overall behavior, which is OK right now. As a last word, it was very interesting problem to fight with, but I didn't understand until now why 4MB was sufficient on an old Sun E450, and why it is not in a branded Solaris 8 non-global zone... comments welcome.

Monday 28 April 2008

memconf And AMD Athlon 64 X2 Dual Core Processor

The last update to the excellent memconf utility (V2.5 22-Feb-2008) support properly recent Solaris Express releases, and my recent change from the stock AMD Opteron Processor 148 to an AMD Athlon 64 X2 Dual Core Processor 3800+. (I mostly did that change just to be able to access two run queues separately, not to gain more power per se.)

So, here is the new and appropriate memconf report:

# memconf -d
memconf:  V2.5 22-Feb-2008 http://www.4schmidts.com/unix.html
hostname: unic
manufacturer: Sun Microsystems, Inc.
model:    Sun Ultra 20 Workstation (AMD Athlon(tm) 64 X2 Dual Core \
 Processor 3800+ Socket 939 2010MHz)
Sun Family Part Number: A63
Solaris Express Community Edition snv_87 X86, 64-bit kernel, SunOS 5.11
1 AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ Socket 939 2010MHz cpu
diagbanner = Sun Ultra 20 Workstation
cpubanner = AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ Socket 939 2010MHz
model = Sun Ultra 20 Workstation
machine = i86pc
platform = i86pc
perl version: 5.008004
CPU Units:
==== Processor Sockets ====================================
Version                          Location Tag
-------------------------------- --------------------------
AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ Socket 939
AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ Socket 939
Memory Units:
Type    Status Set Device Locator      Bank Locator
------- ------ --- ------------------- --------------------
unknown in use 0   A0                  Bank0/1
unknown in use 0   A1                  Bank2/3
unknown in use 0   A2                  Bank4/5
unknown in use 0   A3                  Bank6/7
total memory = 2048MB (2GB)

You can check and compare with the previous report on my blog.

Wednesday 2 May 2007

What is the Size of Memory DIMMs on AIX5L

Well, here is a quick one-liner to get the available memory on an AIX system, with an detailed listing of each size of each memory DIMM on currently installed banks (provided in MB):

# lscfg -vp | egrep 'Memory DIMM|Size'
     Memory DIMM:
       Size........................2048
     Memory DIMM:
       Size........................2048
     Memory DIMM:
       Size........................2048
     Memory DIMM:
       Size........................2048
     Memory DIMM:
       Size........................2048
     Memory DIMM:
       Size........................2048
     Memory DIMM:
       Size........................2048
     Memory DIMM:
       Size........................2048

Saturday 9 December 2006

Memory Behaviour: Tuning Linux's Kernel Overcommit

After encounter some problem at work running large Oracle databases on a RHEL4 system, we need to prevent kernel overcommit from exceeding a certain threshold. In fact, the problem was that the system begin to kill processes under heavy load: in our case ssh connections... Ouch!

The solution was to alter the default behavior of the Linux kernel in this area, as mentioned in Documentation/vm/overcommit-accounting in the corresponding source code tarball.

Interestingly, an overall and great explanation of this mechanism is available today on the O'Reilly LinuxDevCenter.com. It is worth reading it, i think.

Monday 4 December 2006

memconf Update!

AS you may notice, there is a new version of the memconf utility. I am proud that the improved x86 support was done using my Sun Ultra 20 as the reference Opteron (amd64) system ;-)

Here are the outputs before, and after Tom Schmidt's great work (look at the bold informations):

$ memconf -d
memconf:  V2.0 17-Oct-2006 http://www.4schmidts.com/unix.html
hostname: unic.thilelli.net
model:    Sun Ultra 20 Workstation (Solaris x86 machine)
Solaris Nevada snv_51 X86, 32-bit kernel, SunOS 5.11
2 x86 2211MHz cpus
diagbanner = Sun Ultra 20 Workstation
model = Sun Ultra 20 Workstation
modelmore = (Solaris x86 machine)
machine = i86pc
platform = i86pc
perl version: 5.008004
CPU Units:
==== Processor Sockets ====================================
Version                          Location Tag
-------------------------------- --------------------------
AMD Opteron(tm) Processor 148    Socket 939
Memory Units:
Type    Status Set Device Locator      Bank Locator
------- ------ --- ------------------- --------------------
unknown in use 0   A0                  Bank0/1
unknown in use 0   A1                  Bank2/3
unknown in use 0   A2                  Bank4/5
unknown in use 0   A3                  Bank6/7
total memory = 2047MB (1.9990234375GB)
$
$ memconf -d
memconf:  V2.1 29-Nov-2006 http://www.4schmidts.com/unix.html
hostname: unic
model:    Sun Ultra 20 Workstation (Solaris x86 machine)
Solaris Nevada snv_52 X86, 64-bit kernel, SunOS 5.11
1 AMD Opteron(tm) Processor 148 2211MHz cpu
diagbanner = Sun Ultra 20 Workstation
model = Sun Ultra 20 Workstation
modelmore = (Solaris x86 machine)
machine = i86pc
platform = i86pc
perl version: 5.008004
CPU Units:
==== Processor Sockets ====================================
Version                          Location Tag
-------------------------------- --------------------------
AMD Opteron(tm) Processor 148    Socket 939
Memory Units:
Type    Status Set Device Locator      Bank Locator
------- ------ --- ------------------- --------------------
unknown in use 0   A0                  Bank0/1
unknown in use 0   A1                  Bank2/3
unknown in use 0   A2                  Bank4/5
unknown in use 0   A3                  Bank6/7
total memory = 2048MB (2GB)

Friday 3 November 2006

memconf Now Support x86 Systems

After months of waiting, the well known memconf utility now support x86 systems. Try it yourself!

Here is the output on my personal Sun Ultra 20 workstation (an AMD Opteron machine) running build 51 of OpenSolaris:

# memconf
hostname: unic.thilelli.net
Sun Ultra 20 Workstation (Solaris x86 machine)
Type    Status Set Device Locator      Bank Locator
------- ------ --- ------------------- --------------------
unknown in use 0   A0                  Bank0/1
unknown in use 0   A1                  Bank2/3
unknown in use 0   A2                  Bank4/5
unknown in use 0   A3                  Bank6/7
total memory = 2047MB (1.9990234375GB)