blog'o thnet

To content | To menu | To search

Saturday 1 November 2008

System V IPC Now Managed By Resource Controls

When it comes to Solaris 10, all IPC facilities are either automatically configured or can be controlled by resource controls. In the same time, they get new default values, when applicable.

As an example, we will assume that we need to change the limit on number of shared memory segments that can be created, and that the new default (128) is not enough either. Before Solaris 10, you've had to set the shmsys:shminfo_shmmni tunable parameter in the /etc/system kernel configuration file, which is a system wide limit, and required a reboot. This parameter is now marked as Obsolete or Have Been Removed, and its use is clearly deprecated.

To increase the corresponding limit up to 256 shared memory segments, we now have to deal with the project.max-shm-ids resource control which is controlled at the project level. The idea is to set the appropriate resource control to a project, then execute a program in the context of this project. One method to achieve this is to create a project at one side (using the project(4) database), and to populate the extended user attributes to do the association between this project and a user account (using the user_attr(4) database) in order to put the new project as the default project for the user. Or it is possible not to create an extended user attribute with this project at all, but use its characteristics explicitly through the newtask(1) command (and the login(1), cron(1M), and su(1M) programs, or the setproject(3PROJECT) function). But the simplest method, and the less intrusive one, is certainly to directly put the project as the default one for a user account. Here is how to do so.

By default, no error message is logged against the syslog daemon for resource controls. To be able to see an appropriate message in the messages log file, you must first enable globally the syslog action for the wanted resource control (the default level is notice).

# rctladm -e syslog project.max-shm-ids
# rctladm -l project.max-shm-ids
project.max-shm-ids   syslog=notice   [ no-basic deny count ]

When the limit on the number of shared memory segments is reached, one message similar to the following is write to the log file:

# grep rctl /var/adm/messages
/var/adm/messages:Oct 21 16:47:29 hostname genunix: [ID 883052 kern.notice] privileged rctl project.max-shm-ids (value 128) exceeded by project 3

Here is the definition of the new project, and its configuration.

# getent project user.username
user.username:1000:Project To Increase The Limit Of SHM Segments:::project.max-shm-ids=(priv,256,deny)
#
# projects -l user.username
user.username
      projid : 1000
      comment: "Project To Increase The Limit Of SHM Segments"
      users  : (none)
      groups : (none)
      attribs: project.max-shm-ids=(priv,256,deny)

When a project name begin with the pattern user., the project will automatically be set as the default one for the corresponding user, without the need to populate the extended user attributes database. Check that the project is set as the default project for the account username.

# id -p username
uid=100(username) gid=100(groupname) projid=1000(user.username)
#
# projects -d username
user.username

After a login phase using the username identity, the programs progname is launched. We can confirm the use of shared memory segments under the context of the project user.username, and we can consult the programs statistics report.

# ipcs -mJ
IPC status from  as of Wed Oct 29 11:39:59 CET 2008
T         ID KEY        MODE    OWNER     GROUP       PROJECT
Shared Memory:
m 1409286255   0 --rw-rw-rw- username groupname user.username
m  469762152   0 --rw-rw-rw- username groupname user.username
m         56   0 --rw-rw-rw- username groupname user.username
#
# prstat -n5 -cJ
   PID USERNAME  SIZE   RSS STATE PRI NICE    TIME  CPU PROCESS/NLWP
  3704 username  373M  284M cpu24   2   10 0:07:37 2.1% progname/26
  6785 username  285M  196M sleep  29   10 0:04:13 1.1% progname/26
  4480 username  785M  697M sleep  29   10 0:11:40 1.1% progname/26
  5836 username  293M  204M sleep  29   10 0:06:31 1.0% progname/26
  7635 username  277M  188M sleep  29   10 0:01:00 0.9% progname/26
PROJID    NPROC  SWAP   RSS MEMORY      TIME  CPU PROJECT
  1000       26 6472M 6333M    26%   3:57:24  23% user.username
     1       17   41M   87M   0.4%   2:39:58 0.0% user.root
     0       43  184M  267M   1.1%   4:07:25 0.0% system
     3        4 5856K   11M   0.0%   0:00:00 0.0% default
Total: 90 processes, 916 lwps, load averages: 4.41, 2.36, 1.04

Last, we can verify the new setting for one progname instance. For example for PID 3704:

# prctl -n project.max-shm-ids 3704
process: 3704: bin/progname 54 80 -Xmx192m
NAME    PRIVILEGE       VALUE    FLAG   ACTION      RECIPIENT
project.max-shm-ids
        privileged        256       -   deny                -
        system          16.8M     max   deny                -

The resource management facility can do much more than just tuning IPC settings, such as managing CPU usage, and physical memory control. It is a more fine-grained facility than what is in place before Solaris 10, and did not required a reboot anymore.

As a last word, we can note that there are command line tools to help creating and managing projects and extended user attributes for locally stored databases: respectively projadd(1M), projmod(1M), and useradd(1M), usermod(1M). But since the information sources was hosted in NIS and LDAP network directories, we did not use them for this test case though.

Sunday 26 October 2008

Solaris vs. RHEL Costs And Features Comparisons

Clearly, the costs involved in running Solaris and RHEL platforms are not well understood, and generally favors GNU/Linux environments. This is (most of the time) untrue, since this tend to be based on personal user experience, which is in fact far different from running lots of systems in high demand production data centers.

Here some interesting readings on these subjects--costs and features analysis--from:

YMMV for sure, but I personally think that Solaris costs are overestimated, and its features are mostly unknown, or at least underused... but this is a very large and hot topic nowadays, I know.

Update #1 (2008-11-14): Go to read interesting comment update from Jim Laurent.

Update #2 (2008-12-02): Go to read Joerg Moellenkamp's entry about similar points.

Update #3 (2008-12-03): Go to read this article appearing in Computerworld.com.

Saturday 18 October 2008

Discrepancies Between df And du Outputs

As a SA, it not uncommon to have regularly requests about big differences between the du and df outputs on a UFS file system. (For ZFS specific considerations, please see the ZFS FAQ.)

The du utility reports the sum of space allocated to all files in the file hierarchy rooted in the directory plus the space allocated to the directory itself. The df utility reports the amount of disk space occupied by a mounted file system.

When a file is remove from the file system, i.e. is unlinked (the hard link count goes to zero), the space belonging to this file is accounted against the du tool, but is not visible to the df utility until all references to it (open file descriptors) are closed. In order to find the guilty process, one can follow the information found in the SunManagers Frequently Asked Questions. Here is an example of such finding, but using a slightly different method to get the process currently holding the open descriptor to the deleted file.

Find the file which has been unlinked through the procfs interface:

# find /proc/*/fd \( -type f -a ! -size 0 -a -links 0 \) -print | xargs \ls -li
 415975 --w-------   0 user  group  2125803025 Oct 15 23:59 /proc/1252/fd/3

Eventually, get more detail about it:

# pargs -c 1252
1252:   rvd.basic -reliability 5 -listen tcp:9876 -logfile /path/to/log/rvd_9876.l
argv[0]: rvd.basic
argv[1]: -reliability
argv[2]: 5
argv[3]: -listen
argv[4]: tcp:9876
argv[5]: -logfile
argv[6]: /path/to/log/rvd_9876.log

Check to see if you can understand what is the content of the unlinked file:

# tail /proc/1252/fd/3
-------------------------------------------------------------------------------
2008-10-15 23:59:32.002116 - [MSG] BBG_Transmitter_class.cc, line 792 (thread 25087:4)
[4060] Sent a heartbeat
-------------------------------------------------------------------------------
BBG_Transmitter_class.cc: [4111] No activity detected. Send a Heartbeat message
-------------------------------------------------------------------------------
2008-10-15 23:59:32.134829 - [MSG] BBG_Transmitter_class.cc, line 1138 (thread 25087:4)
[4065] Heartbeat acknowledged by Bloomberg

You can correlate the size of the removed, but always referenced, file to the space accounted from the du and df tools:

# df -k /path/to
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/md/dsk/d5       6017990 5874592   83219    99%    /path/to
# du -sk /path/to
3791632 /data
# echo "(5874616-3791632)*1024" | bc
2132975616

So, we now found the ~2GB log file which was always opened (used) by a process. Now, there are two solutions to be able to get back the freed space:

  1. Truncate the unlinked file (quick workaround).
  2. Simply restart properly the corresponding program (better option).

Use the solution which fits the best your need in your environment.

Saturday 11 October 2008

Anatomy Of An Attack

Well, although I didn't generally give credibility by speaking about public FUD, I will just let you know about really great, and point by point explanations on the recent InfoWorld (and New York Times) publication from Paul Krill Is Sun Solaris on its deathbed?

As Jim Grisanzio stated recently on the advocacy-discuss mailing list:

[...] More importantly, though, is that the original article not only fell flat but it was actually aggressively rejected by many in the open source community. That's an interesting shift out there. And a good one, too.

Well, I can't agree more with Jim on his points.

Update #1 (2008-10-28): Don't forget to read the interesting inputs from Joerg Moellenkamp.

Monday 6 October 2008

Fake The hostid Of A Solaris Zone, Updated

As a little follow-up to Fake The hostid Of A Solaris Zone, and regarding the discussion on the capacity to change the hostid of a Solaris non-global zone, it is interesting to mention these (updated) informations:

  1. The LD_PRELOAD trick proposed before is not a proper option, and is really ugly (and intrusive if you didn't unset it before continuing the execution of a program).
  2. When using Solaris 8 or Solaris 9 Containers, there is a feature called Host ID Emulation from the zonecfg utility which can do exactly that.
  3. Before the introduction of the privileges in a non-global zone with Solaris 11/06 (a.k.a. Solaris Update 3), you must run the DTrace zhostid script (daemon) within the global zone. It is not mandatory to run it from the global zone anymore. Using the appropriate dtrace_user privilege only, you can run it directly from the non-global zone:
    # zonecfg -z ngzone set limitpriv=default,dtrace_user
    # zoneadm -z ngzone boot
    # zlogin ngzone
    [Connected to zone 'ngzone' pts/5]
    Last login: Sat Oct  4 18:57:17 on pts/5
    Sun Microsystems Inc.   SunOS 5.11      snv_99  November 2008
    # /sbin/zonename 
    ngzone
    # /usr/bin/hostid
    837d47dd
    # ./zhostid &
    [1] 21506
    # /usr/bin/hostid
    20a82f32
    # ^D
    [Connection to zone 'ngzone' pts/5 closed]
    

Monday 29 September 2008

Sun's Free Proficiency Assessment System

Free. Online. Right now. As Peter Tribble mentioned in his blog, it is clearly not perfect and some questions (and answers) seems a little obscure sometimes. Nonetheless, as I have never had any (yes, any) formal training, I though interesting to try these tests, and here are the results for pre-assessment for:

  1. UNIX Essentials Featuring the Solaris 10 OS, I scored 37 out of 42.
  2. Sun Certified System Administrator for the Solaris 10 OS (Part 1), I scored 45 out of 48.
  3. Certified System Administrator for the Solaris 10 OS (Part 2), I scored 45 out of 48.

Maybe, is it time to pass to official Solaris Operating System certifications? ;)

  • Sun Certified Solaris Associate (SCSAS)
  • Sun Certified System Administrator (SCSA)
  • Sun Certified Network Administrator (SCNA)
  • Sun Certified Security Administrator (SCSECA)

Well... if time permit!

Monday 22 September 2008

About GNU/Linux Software Mirroring And LVM

Here, the final aim was to provide data access redundancy through SAN storage hosted on remote sites across Wide Area Network (WAN) links. After some relatively long and painful tries to mimic software mirroring as found on HP-UX platform using Logical Volume Management (LVM), i.e. at the logical volume level, I finally give up deciding this functionality will definitely not fit my need. Why? Here are my comments.

  1. It is not possible to provide clear and manageable storage multipath when the need to distinguish between the multiple sites is important, ala mirror across controllers found on Veritas VxVM on Sun Solaris system, for example. So, managing many physical volumes along with lots of logical volumes is very complicated.
  2. There is no exact mapping capability between logical volume storage on a given physical volume.
  3. The need to have a disk-based log, i.e. a persistent log. Yes, one can always provide the option --corelog at the creation time to the logical volume initial build and have an in-memory log , i.e. a non-persistent log, but this requires the entire copies (mirrors) be resynchronized upon reboot. Not really viable on multi-TB environments.
  4. A write-intensive workload on a file system living on a logical volume mirror will suffer high latency: the overhead is important, and the time to do mostly-write jobs grow dramatically. It is really hard to get high level statistics, only low level metrics seems consistent: sd SCSI devices and dm- device mapper components for each paths entries. Not from the multipath devices standpoint, which is the more interesting from the end user and SA point of view.
  5. You can't extend a logical volume, which is really a no-go per-se. On that point, the Red Hat support answered that this functionality may be added in a future release, the current state may eventually be a Request For Enhancement (RFE), if a proper business justification is provided. One must break the logical volume mirror copy, then rebuild it completely. Not realistic when the logical volume is made of a lot of physical extents across multiple physical volumes.
  6. A LVM configuration can be totally blocked by itself, and not usable at all. The fact is, LVM use persistent storage blocks to keep track of its own metadata. The metadata size is set at physical volume creation time only, and can't be change afterward. This size is statically defined as 255 physical volume blocks, and can be adjust from the LVM configuration file. The problem is, when this circular buffer space (stored in ASCII) fills up--such as when there are a lot of logical volumes in a mirrored environment--it is not possible to do anything more with LVM. So you can't add more logical volume, can't add more logical volume copies,... and can't delete them trying to reestablish a proper LVM configuration. Well, here are the answers given by the Red Hat support to two keys questions in this situation:
    • How to size the metadata, i.e. if we need to change it from the default value, how can we determine the new proper and appropriate size, and from which information?
      I am afraid but Metadata size can only be defined at the time of PV creation and there is no real formula for calculating the size in advance. By changing the default value of 255 you can get a higher space value. For general LVM setup (with less LV's and VG's) default size works fine however in cases where high number of LV's are required a custom value will be required.
    • We just want to delete all LV copies, which means to return to the initial situation and have 0 copy for all LV, i.e. only one LV per-se, in order to be able to change LVM configuration again (we can't do anything on our production server right now)?
      I discussed this situation with peer engineers and also referenced a similar previous case. From the notes of the same the workaround is to use the backup file (/etc/lvm/backup) and restore the PV's. I agree that this really not a production environment method however seems the only workaround.

So, the production RDBMS Oracle server is finally now being evacuate to an other machine. Hum... Hope to see better enterprise experience using the mdadm package to handle RAID software, instead of mirror (RAID-1) LVM. Maybe more about that in an other blog entry?

Tuesday 16 September 2008

Announcing Solaris 10 10/08

Although this seems a little bit confident, the long-awaited Update 6 to the Solaris 10 operating system release is just behind the door. This release (scheduled to be available in mid-October) will includes virtualization enhancements including the ability for a Solaris Container to automatically update its environment when moved from one system to another, Logical Domains support for dynamically reconfigurable disk and network I/O, and paravirtualization support when Solaris 10 is used as a guest OS in Xen-based environments such as Sun xVM Server. Solaris 10 10/08 also includes support for the latest systems from Sun and other vendors, such as those based on the Intel Xeon Processor 7400 Series.

This release will be the very first Solaris release to be able to boot from ZFS and use it as their root file system, such as what can be found on OpenSolaris or Solaris Express Community Release today.

Check the What’s New web page for Solaris, and consult the Solaris Media Gallery videos for more information.

Update #1 (2008-10-14): Don't forget to consult the must read What's New in Solaris 10 10/08? from the San Antonio OpenSolaris User Group.

Update #2 (2008-10-31): Get yours, and go reading the official What's New in the Solaris 10 10/08 Release page.

Friday 16 May 2008

Comparison: EMC PowerPath vs. GNU/Linux dm-multipath

I will present some notes about the use of multipath solutions on Red Hat systems: EMC PowerPath and GNU/Linux dm-multipath. Along those notes, keep in mind that they were based on tests done when pressure was very high to put new systems in production, so lack of time resulted in less complete tests than expected. These tests were done more than a year ago, and so before the release of RHEL4 Update 5 and some of RHBA related to both LVM and dm-multipath technologies.

Keep in mind that without purchasing an appropriate EMC license, PowerPath can only be used in failover mode (active-passive mode). Multiple paths accesses are not supported in this case: no round-robin, and no I/O load balancer for example.

EMC PowerPath

Advantages

  1. Not specific to the SAN Host Bus Adapter (HBA).
  2. Support for multiple and heterogeneous SAN storage provider.
  3. Support for most UNIX and Unix-like platforms.
  4. Without a valid license, can only work in degraded mode (failover).
  5. Is not sensible to a change in the SCSI LUN renumbering. Adapt accordingly the corresponding multiple sd devices (different paths to a given device) with its multipath definition of the emcpower device.
  6. Provide easily the ID of the SAN storage.

Drawbacks

  1. Not integrated with the operating system (which generally has its own solution).
  2. The need to force a RPM re-installation in case of a kernel upgrade on RHEL systems (due to the fact that kernel modules are stored in a path containing the exact major and minor versions of the installed (booted) kernel.
  3. Non-automatic update procedure.

GNU/Linux device-mapper-multipath

Advantages

  1. Not specific to the SAN Host Bus Adapter (HBA).
  2. Support for multiple and heterogeneous SAN storage provider.
  3. Well integrated with the operating system.
  4. Automatic update using RHN (you must be a licensed and registered user in this case).
  5. No additional license cost.

Drawbacks

  1. Only available on GNU/Linux systems.
  2. Configuration (files and keywords) very tedious and difficult.
  3. Without the use of LVM (Logical Volume Management), it has not the ability to follow SCSI LUN renumbering! Even in this case, be sure not to have blacklisted the newly discovered SCSI devices (sd).

Last, please find some interesting documentation on the subject below:

Friday 2 May 2008

PHP APC Extension Bug With Optimized Open Source Software Stack

To easily manage LDAP accounts (and general LDAP entries in fact), we have created a Solaris Zone and installed the excellent Cool Stack bundle to host the LAM (LDAP Account Manager) management web tool. But after upgrading the Cool Stack to version 1.2 we encountered a very annoying problem mostly with freezing web pages, and generally ending up in restarting the Apache web server provided by the Cool Stack. After some troubleshooting, we discover that this behavior was introduced by a bug in the APC-3.0.14 module bundled with the updated php-5.2.4 scripting software in this version of the Cool Stack.

Luckily, the bug was already fixed and a new version of the APC extension of PHP is available for download (in fact, just replace to original apc.so module by the new one). All the Cool Stack related problems, associated fixes and instructions are listed on the Cool Stack 1.2 Patches page: be sure to keep in sync' if you are a Cool Stack consumer.

- page 2 of 16 -