blog'o thnet

To content | To menu | To search

Saturday 1 November 2008

System V IPC Now Managed By Resource Controls

When it comes to Solaris 10, all IPC facilities are either automatically configured or can be controlled by resource controls. In the same time, they get new default values, when applicable.

As an example, we will assume that we need to change the limit on number of shared memory segments that can be created, and that the new default (128) is not enough either. Before Solaris 10, you've had to set the shmsys:shminfo_shmmni tunable parameter in the /etc/system kernel configuration file, which is a system wide limit, and required a reboot. This parameter is now marked as Obsolete or Have Been Removed, and its use is clearly deprecated.

To increase the corresponding limit up to 256 shared memory segments, we now have to deal with the project.max-shm-ids resource control which is controlled at the project level. The idea is to set the appropriate resource control to a project, then execute a program in the context of this project. One method to achieve this is to create a project at one side (using the project(4) database), and to populate the extended user attributes to do the association between this project and a user account (using the user_attr(4) database) in order to put the new project as the default project for the user. Or it is possible not to create an extended user attribute with this project at all, but use its characteristics explicitly through the newtask(1) command (and the login(1), cron(1M), and su(1M) programs, or the setproject(3PROJECT) function). But the simplest method, and the less intrusive one, is certainly to directly put the project as the default one for a user account. Here is how to do so.

By default, no error message is logged against the syslog daemon for resource controls. To be able to see an appropriate message in the messages log file, you must first enable globally the syslog action for the wanted resource control (the default level is notice).

# rctladm -e syslog project.max-shm-ids
# rctladm -l project.max-shm-ids
project.max-shm-ids   syslog=notice   [ no-basic deny count ]

When the limit on the number of shared memory segments is reached, one message similar to the following is write to the log file:

# grep rctl /var/adm/messages
/var/adm/messages:Oct 21 16:47:29 hostname genunix: [ID 883052 kern.notice] privileged rctl project.max-shm-ids (value 128) exceeded by project 3

Here is the definition of the new project, and its configuration.

# getent project user.username
user.username:1000:Project To Increase The Limit Of SHM Segments:::project.max-shm-ids=(priv,256,deny)
#
# projects -l user.username
user.username
      projid : 1000
      comment: "Project To Increase The Limit Of SHM Segments"
      users  : (none)
      groups : (none)
      attribs: project.max-shm-ids=(priv,256,deny)

When a project name begin with the pattern user., the project will automatically be set as the default one for the corresponding user, without the need to populate the extended user attributes database. Check that the project is set as the default project for the account username.

# id -p username
uid=100(username) gid=100(groupname) projid=1000(user.username)
#
# projects -d username
user.username

After a login phase using the username identity, the programs progname is launched. We can confirm the use of shared memory segments under the context of the project user.username, and we can consult the programs statistics report.

# ipcs -mJ
IPC status from  as of Wed Oct 29 11:39:59 CET 2008
T         ID KEY        MODE    OWNER     GROUP       PROJECT
Shared Memory:
m 1409286255   0 --rw-rw-rw- username groupname user.username
m  469762152   0 --rw-rw-rw- username groupname user.username
m         56   0 --rw-rw-rw- username groupname user.username
#
# prstat -n5 -cJ
   PID USERNAME  SIZE   RSS STATE PRI NICE    TIME  CPU PROCESS/NLWP
  3704 username  373M  284M cpu24   2   10 0:07:37 2.1% progname/26
  6785 username  285M  196M sleep  29   10 0:04:13 1.1% progname/26
  4480 username  785M  697M sleep  29   10 0:11:40 1.1% progname/26
  5836 username  293M  204M sleep  29   10 0:06:31 1.0% progname/26
  7635 username  277M  188M sleep  29   10 0:01:00 0.9% progname/26
PROJID    NPROC  SWAP   RSS MEMORY      TIME  CPU PROJECT
  1000       26 6472M 6333M    26%   3:57:24  23% user.username
     1       17   41M   87M   0.4%   2:39:58 0.0% user.root
     0       43  184M  267M   1.1%   4:07:25 0.0% system
     3        4 5856K   11M   0.0%   0:00:00 0.0% default
Total: 90 processes, 916 lwps, load averages: 4.41, 2.36, 1.04

Last, we can verify the new setting for one progname instance. For example for PID 3704:

# prctl -n project.max-shm-ids 3704
process: 3704: bin/progname 54 80 -Xmx192m
NAME    PRIVILEGE       VALUE    FLAG   ACTION      RECIPIENT
project.max-shm-ids
        privileged        256       -   deny                -
        system          16.8M     max   deny                -

The resource management facility can do much more than just tuning IPC settings, such as managing CPU usage, and physical memory control. It is a more fine-grained facility than what is in place before Solaris 10, and did not required a reboot anymore.

As a last word, we can note that there are command line tools to help creating and managing projects and extended user attributes for locally stored databases: respectively projadd(1M), projmod(1M), and useradd(1M), usermod(1M). But since the information sources was hosted in NIS and LDAP network directories, we did not use them for this test case though.

Sunday 26 October 2008

Solaris vs. RHEL Costs And Features Comparisons

Clearly, the costs involved in running Solaris and RHEL platforms are not well understood, and generally favors GNU/Linux environments. This is (most of the time) untrue, since this tend to be based on personal user experience, which is in fact far different from running lots of systems in high demand production datacenters.

Here some interesting readings on these subjects--costs and features analysis--from:

YMMV for sure, but I personally think that Solaris costs are overestimated, and its features are mostly unknown, or at least underused... but this is a very large and hot topic nowadays, I know.

Saturday 18 October 2008

Discrepancies Between df And du Outputs

As a SA, it not uncommon to have regularly requests about big differences between the du and df outputs on a UFS file system. (For ZFS specific considerations, please see the ZFS FAQ.)

The du utility reports the sum of space allocated to all files in the file hierarchy rooted in the directory plus the space allocated to the directory itself. The df utility reports the amount of disk space occupied by a mounted file system.

When a file is remove from the file system, i.e. is unlinked (the hard link count goes to zero), the space belonging to this file is accounted against the du tool, but is not visible to the df utility until all references to it (open file descriptors) are closed. In order to find the guilty process, one can follow the information found in the SunManagers Frequently Asked Questions. Here is an example of such finding, but using a slightly different method to get the process currently holding the open descriptor to the deleted file.

Find the file which has been unlinked through the procfs interface:

# find /proc/*/fd \( -type f -a ! -size 0 -a -links 0 \) -print | xargs \ls -li
 415975 --w-------   0 user  group  2125803025 Oct 15 23:59 /proc/1252/fd/3

Eventually, get more detail about it:

# pargs -c 1252
1252:   rvd.basic -reliability 5 -listen tcp:9876 -logfile /path/to/log/rvd_9876.l
argv[0]: rvd.basic
argv[1]: -reliability
argv[2]: 5
argv[3]: -listen
argv[4]: tcp:9876
argv[5]: -logfile
argv[6]: /path/to/log/rvd_9876.log

Check to see if you can understand what is the content of the unlinked file:

# tail /proc/1252/fd/3
-------------------------------------------------------------------------------
2008-10-15 23:59:32.002116 - [MSG] BBG_Transmitter_class.cc, line 792 (thread 25087:4)
[4060] Sent a heartbeat
-------------------------------------------------------------------------------
BBG_Transmitter_class.cc: [4111] No activity detected. Send a Heartbeat message
-------------------------------------------------------------------------------
2008-10-15 23:59:32.134829 - [MSG] BBG_Transmitter_class.cc, line 1138 (thread 25087:4)
[4065] Heartbeat acknowledged by Bloomberg

You can correlate the size of the removed, but always referenced, file to the space accounted from the du and df tools:

# df -k /path/to
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/md/dsk/d5       6017990 5874592   83219    99%    /path/to
# du -sk /path/to
3791632 /data
# echo "(5874616-3791632)*1024" | bc
2132975616

So, we now found the ~2GB log file which was always opened (used) by a process. Now, there are two solutions to be able to get back the freed space:

  1. Truncate the unlinked file (quick workaround).
  2. Simply restart properly the corresponding program (better option).

Use the solution which fits the best your need in your environment.

Saturday 11 October 2008

Anatomy Of An Attack

Well, although I didn't generally give credibility by speaking about public FUD, I will just let you know about really great, and point by point explanations on the recent InfoWorld (and New York Times) publication from Paul Krill Is Sun Solaris on its deathbed?

As Jim Grisanzio stated recently on the advocacy-discuss mailing list:

[...] More importantly, though, is that the original article not only fell flat but it was actually aggressively rejected by many in the open source community. That's an interesting shift out there. And a good one, too.

Well, I can't agree more with Jim on his points.

Update #1 (2008-10-28): Don't forget to read the interesting inputs from Joerg Moellenkamp.

Monday 6 October 2008

Fake The hostid Of A Solaris Zone, Updated

As a little follow-up to Fake The hostid Of A Solaris Zone, and regarding the discussion on the capacity to change the hostid of a Solaris non-global zone, it is interesting to mention these (updated) informations:

  1. The LD_PRELOAD trick proposed before is not a proper option, and is really ugly (and intrusive if you didn't unset it before continuing the execution of a program).
  2. When using Solaris 8 or Solaris 9 Containers, there is a feature called Host ID Emulation from the zonecfg utility which can do exactly that.
  3. Before the introduction of the privileges in a non-global zone with Solaris 11/06 (a.k.a. Solaris Update 3), you must run the DTrace zhostid script (daemon) within the global zone. It is not mandatory to run it from the global zone anymore. Using the appropriate dtrace_user privilege only, you can run it directly from the non-global zone:
    # zonecfg -z ngzone set limitpriv=default,dtrace_user
    # zoneadm -z ngzone boot
    # zlogin ngzone
    [Connected to zone 'ngzone' pts/5]
    Last login: Sat Oct  4 18:57:17 on pts/5
    Sun Microsystems Inc.   SunOS 5.11      snv_99  November 2008
    # /sbin/zonename 
    ngzone
    # /usr/bin/hostid
    837d47dd
    # ./zhostid &
    [1] 21506
    # /usr/bin/hostid
    20a82f32
    # ^D
    [Connection to zone 'ngzone' pts/5 closed]
    

Monday 29 September 2008

Sun's Free Proficiency Assessment System

Free. Online. Right now. As Peter Tribble mentioned in his blog, it is clearly not perfect and some questions (and answers) seems a little obscure sometimes. Nonetheless, as I have never had any (yes, any) formal training, I though interesting to try these tests, and here are the results for pre-assessment for:

  1. UNIX Essentials Featuring the Solaris 10 OS, I scored 37 out of 42.
  2. Sun Certified System Administrator for the Solaris 10 OS (Part 1), I scored 45 out of 48.
  3. Certified System Administrator for the Solaris 10 OS (Part 2), I scored 45 out of 48.

Maybe, is it time to pass to official Solaris Operating System certifications? ;)

  • Sun Certified Solaris Associate (SCSAS)
  • Sun Certified System Administrator (SCSA)
  • Sun Certified Network Administrator (SCNA)
  • Sun Certified Security Administrator (SCSECA)

Well... if time permit!

Tuesday 16 September 2008

Announcing Solaris 10 10/08

Although this seems a little bit confident, the long-awaited Update 6 to the Solaris 10 operating system release is just behind the door. This release (scheduled to be available in mid-October) will includes virtualization enhancements including the ability for a Solaris Container to automatically update its environment when moved from one system to another, Logical Domains support for dynamically reconfigurable disk and network I/O, and paravirtualization support when Solaris 10 is used as a guest OS in Xen-based environments such as Sun xVM Server. Solaris 10 10/08 also includes support for the latest systems from Sun and other vendors, such as those based on the Intel Xeon Processor 7400 Series.

This release will be the very first Solaris release to be able to boot from ZFS and use it as their root file system, such as what can be found on OpenSolaris or Solaris Express Community Release today.

Check the What’s New web page for Solaris, and consult the Solaris Media Gallery videos for more information.

Update #1 (2008-10-14): Don't forget to consult the must read What's New in Solaris 10 10/08? from the San Antonio OpenSolaris User Group.

Update #2 (2008-10-31): Get yours, and go reading the official What's New in the Solaris 10 10/08 Release page.

Friday 2 May 2008

PHP APC Extension Bug With Optimized Open Source Software Stack

To easily manage LDAP accounts (and general LDAP entries in fact), we have created a Solaris Zone and installed the excellent Cool Stack bundle to host the LAM (LDAP Account Manager) management web tool. But after upgrading the Cool Stack to version 1.2 we encountered a very annoying problem mostly with freezing web pages, and generally ending up in restarting the Apache web server provided by the Cool Stack. After some troubleshooting, we discover that this behavior was introduced by a bug in the APC-3.0.14 module bundled with the updated php-5.2.4 scripting software in this version of the Cool Stack.

Luckily, the bug was already fixed and a new version of the APC extension of PHP is available for download (in fact, just replace to original apc.so module by the new one). All the Cool Stack related problems, associated fixes and instructions are listed on the Cool Stack 1.2 Patches page: be sure to keep in sync' if you are a Cool Stack consumer.

Sunday 6 April 2008

Change The DST Time Zone Definition For Test Purpose

Because we continue to encounter some specialized software vendor which can't tell if there is a problem with the Daylight Saving Time change for their application, the need to test the time adjustment beforehand arise sometimes. In this case, the first thing which comes in mind is to change the system clock, without modifying the timezone. Although this can do the job, it doesn't test the DST adjustment properly, and affect the overall operating environment, system-wide.

To do things cleaner, we can try to modify directly the timezone in use. This will test the real DST automatic adjustment, while not changing the system clock (and not impacting other software, or services). Say we want to change DST for Europe/Paris time zone one day before the official date for summer 2008. First, obtain the current setting:

# zdump -v `rtc` | \
   awk '$6 ~ /'"`date '+%Y'`"'/ && $12 !~ /'"`date '+%H:%M:%S'`"'/ {print $0}'
Europe/Paris  Sun Mar 30 00:59:59 2008 UTC = Sun Mar 30 01:59:59 2008 CET isdst=0
Europe/Paris  Sun Mar 30 01:00:00 2008 UTC = Sun Mar 30 03:00:00 2008 CEST isdst=1
Europe/Paris  Sun Oct 26 00:59:59 2008 UTC = Sun Oct 26 02:59:59 2008 CEST isdst=1
Europe/Paris  Sun Oct 26 01:00:00 2008 UTC = Sun Oct 26 02:00:00 2008 CET isdst=0

Then, get the time zone source file for the Europe/Paris geographic definition and adapt it writing an exception rule for the current year. Because the current definition for Europe/Paris is based on the EU rule, our modification will be based on this rule name.

# mkdir -p /tmp/zoneinfo/src
# cp /usr/share/lib/zoneinfo/src/europe /tmp/zoneinfo/src
# diff /usr/share/lib/zoneinfo/src/europe /tmp/zoneinfo/src/europe
1073a1074,1076
> # Test the DST adjustment one day in advance, for March 2008 only.
> # Rule        NAME    FROM    TO      TYPE    IN      ON      AT      SAVE    LETTER/S
> Rule          EU      2008    only    -       Mar     29       2:00   1:00    S

Last, compile the updated time zone definition, put it in place, and verify the new DST date.

# zic -d /tmp/zoneinfo /tmp/zoneinfo/src/europe
# cp /tmp/zoneinfo/Europe/Paris /usr/share/lib/zoneinfo/Europe
# zdump -v `rtc` | \
   awk '$6 ~ /'"`date '+%Y'`"'/ && $12 !~ /'"`date '+%H:%M:%S'`"'/ {print $0}'
Europe/Paris  Sat Mar 29 00:59:59 2008 UTC = Sat Mar 29 01:59:59 2008 CET isdst=0
Europe/Paris  Sat Mar 29 01:00:00 2008 UTC = Sat Mar 29 03:00:00 2008 CEST isdst=1
Europe/Paris  Sun Oct 26 00:59:59 2008 UTC = Sun Oct 26 02:59:59 2008 CEST isdst=1
Europe/Paris  Sun Oct 26 01:00:00 2008 UTC = Sun Oct 26 02:00:00 2008 CET isdst=0

Once the fake DST adjustment is validated at the software level, clean things up a little.

# zic /usr/share/lib/zoneinfo/src/europe
# rm -r /tmp/zoneinfo

Thursday 13 March 2008

Update A Corrupted GRUB Boot Archive, With SVM

In a previous discussion about the GRUB boot archive and how it can be regenerated, I mentioned that it will not be as easy as it can be when the root file system use the md driver. I will now show two different methods to do the same thing when the root file system is build upon a SVM mirror (RAID-1):

  1. Unmirror the root file system only.
  2. Unmirror the entire system, i.e. all devices.

Note: Although this test case was done using Solaris 10 8/07 under a virtual machine build upon VirtualBox on latest Solaris Express Community Edition, the instructions must be valid for Solaris 10 1/06 and later.

Initial setup

As we can see, the system use only a root file system, and a swap device. Both are encapsulated with SVM.

# df -k -F ufs
Filesystem     kbytes    used   avail capacity  Mounted on
/dev/md/dsk/d0 6147798 3455578 2630743      57%  /
# swap -l
swapfile             dev  swaplo blocks   free
/dev/md/dsk/d1      85,1       8 4194288 4194288
# metastat -c d0 d1
d0               m  6.0GB d10 d20
    d10          s  6.0GB c0d0s0
    d20          s  6.0GB c1d1s0
d1               m  2.0GB d11 d21
    d11          s  2.0GB c0d0s1
    d21          s  2.0GB c1d1s1

Unmirror the root file system only

The idea is to boot on the GRUB failsafe mode, select the first side of the mirror, and modify the system and vfstab configuration files to use the correct device path. For the system file, this means to actually remove the rootdev:/pseudo/md@0:0,0,bl entry, not just comment it. For the vfstab file, this means replacing the root file system metadevice path /dev/md/[r]dsk/d0 by the first underlying device path, i.e. /dev/[r]dsk/c0d0s0. Last, regenerate the boot archive on the alternate root path.

[...]
Booting to milestone "milestone/single-user:default".
Configuring devices.
Searching for installed OS instances...
/dev/dsk/c0d0s0 is under md control, skipping.
/dev/dsk/c1d1s0 is under md control, skipping.
No installed OS instance found.

Starting shell.
# fsck /dev/rdsk/c0d0s0
# mount -F ufs /dev/dsk/c0d0s0 /a
# cp /a/etc/system /a/etc/system.bckp
# cp /a/etc/vfstab /a/etc/vfstab.bckp
# TERM=vt100 vi /a/etc/system
# TERM=vt100 vi /a/etc/vfstab
# bootadm update-archive -R /a
# umount /a
# fsck /dev/rdsk/c0d0s0
# reboot

Then, boot into milestone/multi-user:default level and detach the second half of the mirror, since the first half correspond to the valid and updated underlying device. Next, restore the original configuration files which refers to the encapsulated metadevices, and reboot.

# df -k -F ufs
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/dsk/c0d0s0      6147798 3458810 2627511    57%    /
# swap -l
swapfile             dev  swaplo blocks   free
/dev/md/dsk/d1      85,1       8 4194288 4194288
# metastat -c d0
d0               m  6.0GB d10 d20
    d10          s  6.0GB c0d0s0
    d20          s  6.0GB c1d1s0
# metadetach d0 d20
d0: submirror d20 is detached
# metastat -c d0
d0               m  6.0GB d10
    d10          s  6.0GB c0d0s0
# cp /etc/system.orig /etc/system
# cp /etc/vfstab.orig /etc/vfstab
# shutdown -y -i 6 -g 0

After the reboot, just reattach the second half of the mirror, and wait for complete synchronization to be fully redundant again.

# df -k -F ufs
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/md/dsk/d0       6147798 3458714 2627607    57%    /
# swap -l
swapfile             dev  swaplo blocks   free
/dev/md/dsk/d1      85,1       8 4194288 4194288
# metattach d0 d20
d0: submirror d20 is attached
# metastat -c d0
d0               m  6.0GB d10 d20 (resync-29%)
    d10          s  6.0GB c0d0s0
    d20          s  6.0GB c1d1s0

Unmirror the entire system, i.e. all devices

The idea is exactly the same as for unmirroring the root file system only, but adapting the vfstab file to change the swap entry, too. (So, I didn't reproduce the code listing here.)

Then, boot into milestone/single-user:default level modifying the corresponding GRUB entry as follow: kernel /platform/i86pc/multiboot -s. Completely delete all the metadevices and metadb configurations to clear SVM settings. Last, continue into milestone/multi-user:default level to boot unmirrored.

# metaclear -f -r d0 d1
# metadb -f -d  c1d0s4 c1d0s4
# ^D

Now, the system must be fully encapsulate by SVM again. Please refer to online Sun Documentation, or some past entries on this subject, depending on the system's architecture: SPARC systems, or x86 platforms.

- page 1 of 5