blog'o thnet

To content | To menu | To search

Sunday 13 March 2011

Customized Solaris installation and patching experience

I recently faced a curious problem when trying to patch an Alternate Boot Environment created with Live Upgrade on Solaris 10. Although I initially though it was a LU problem, the solution is finally related to the patches to be applied and the way a Solaris is installed.

Assuming the ABE is named s10u9, I first tried to apply the Critical Patch Updates the new way, i.e. through a switch to the installcluster script , which quickly failed like this:

# cd /net/jumpstart/export/media/patch/cpu/10_Recommended_CPU_2010-10
# ./installcluster --apply-prereq --s10cluster
[...]
# ./installcluster -B s10u9 --s10cluster
ERROR: Patch set cannot be installed from a live boot environment without zones
       support, to a target boot environment that has zones support.

So, I tried to apply the patches using the luupgrade command, but it failed with a very similar message:

# ptime luupgrade -t -n s10u9 -s /net/jumpstart/export/media/patch/cpu/10_Recommended_CPU_2010-10/patches `cat patch_order`
Validating the contents of the media .
The media contains 198 software patches that can be added.
All 198 patches will be added because you did not specify any specific patches to add.
Mounting the BE .
ERROR: The boot environment  supports non-global zones. The current boot
environment does not support non-global zones. Releases prior to Solaris 10 cannot be
used to maintain Solaris 10 and later releases that include support for non-global zones.
You may only execute the specified operation on a system with Solaris 10 (or later)
installed.

The fact is, the Primary Boot Environment is a Solaris 10 installation. So, why complaining that the PBE is an older release? Looking on OTN discussion forums and in the README file which came with the Critical Patch Updates release, there is a known bug which can end this way. This will occur when /etc/zones/index in the inactive boot environment has an incorrect setting for the state for the global zone. The correct setting is installed. So, get check this one:

# lumount -n s10u9
/.alt.s10u9
# grep "^global:configured:" /.alt.s10u9/etc/zones/index
# luumount -n s10u9

So no luck here. But wait: if the PBE is a customized Solaris 10 installation, it may be that the installed packages missed the Zone feature, which seems to be mandatory by installcluster or liveupgrade -t to figure out if the PBE is a proper (usable) Solaris 10 installation. So, I just installed the missing packages from the install media...

# mount -r -F hsfs `lofiadm -a /net/jumpstart/export/media/iso/sol-10-u9-ga-sparc-dvd.iso` /mnt
# pkginfo -d /mnt/Solaris_10/Product | nawk '$2 ~ /zone/ || $2 ~ /pool$/ {print $0}'
application SUNWluzone                       Live Upgrade (zones support)
system      SUNWpool                         Resource Pools
system      SUNWzoner                        Solaris Zones (Root)
system      SUNWzoneu                        Solaris Zones (Usr)
# yes | pkgadd -d /mnt/Solaris_10/Product SUNWluzone SUNWzoner SUNWzoneu SUNWpool
[...]
# umount /mnt
# lofiadm -d /dev/lofi/1

... and this must be OK right now:

# ./installcluster -B s10u9 --s10cluster
Setup ...
CPU OS Cluster 2010/10 Solaris 10 SPARC (2010.10.06)
Application of patches started : 2011.02.07 11:17:08

Applying 120900-04 (  1 of 198) ... skipped
[...]
Installation of patch set to alternate boot environment complete.

Please remember to activate boot environment s10u9 with luactivate(1M)
before rebooting.
Install log files written :
  /.alt.s10u9/var/sadm/install_data/s10s_rec_cluster_short_2011.02.07_11.17.08.log
  /.alt.s10u9/var/sadm/install_data/s10s_rec_cluster_verbose_2011.02.07_11.17.08.log

And it is... The question is, why is the Zone feature necessary and mandatory in this case?

Thursday 7 October 2010

Live Upgrading To Solaris 10 9/10

If you try to update to the latest Solaris 10 Update (U9), one new step is now required in order to be able to successfully luupgrade to the desired Update. As mentioned in the Oracle Solaris 10 9/10 Release Notes, a new Auto Registration mecanism has been added to this release to facilitate registering the system using your Oracle support credentials.

So, if you try the classical luupgrade following incantation, it will fail with the reported message:

# luupgrade -u -n s10u9 -s /mnt -j /var/tmp/profile
System has findroot enabled GRUB
No entry for BE  in GRUB menu
Copying failsafe kernel from media.
61364 blocks
miniroot filesystem is 
Mounting miniroot at 
ERROR: The auto registration file <> does not exist or incomplete.
       The auto registration file is mandatory for this upgrade.
       Use -k  argument along with luupgrade command.

So, you now need to set the Auto Registration choice as a mandatory parameter. Here is how it resembles right now:

# echo "auto_reg=disable" > /var/tmp/sysidcfg
# luupgrade -u -n s10u9 -s /mnt -j /var/tmp/profile -k /var/tmp/sysidcfg
System has findroot enabled GRUB
No entry for BE  in GRUB menu
Copying failsafe kernel from media.
61364 blocks
miniroot filesystem is 
Mounting miniroot at 
#######################################################################
 NOTE: To improve products and services, Oracle Solaris communicates
 configuration data to Oracle after rebooting.

 You can register your version of Oracle Solaris to capture this data
 for your use, or the data is sent anonymously.

 For information about what configuration data is communicated and how
 to control this facility, see the Release Notes or
 www.oracle.com/goto/solarisautoreg.

 INFORMATION: After activated and booted into new BE ,
 Auto Registration happens automatically with the following Information

autoreg=disable
#######################################################################
Validating the contents of the media .
The media is a standard Solaris media.
The media contains an operating system upgrade image.
The media contains  version <10>.
Constructing upgrade profile to use.
Locating the operating system upgrade program.
Checking for existence of previously scheduled Live Upgrade requests.
Creating upgrade profile for BE .
Checking for GRUB menu on ABE .
Saving GRUB menu on ABE .
Checking for x86 boot partition on ABE.
Determining packages to install or upgrade for BE .
Performing the operating system upgrade of the BE .
CAUTION: Interrupting this process may leave the boot environment unstable
or unbootable.
Upgrading Solaris: 100% completed
Installation of the packages from this media is complete.
Restoring GRUB menu on ABE .
Updating package information on boot environment .
Package information successfully updated on boot environment .
Adding operating system patches to the BE .
The operating system patch installation is complete.
ABE boot partition backing deleted.
PBE GRUB has no capability information.
PBE GRUB has no versioning information.
ABE GRUB is newer than PBE GRUB. Updating GRUB.
GRUB update was successfull.
INFORMATION: The file  on boot
environment  contains a log of the upgrade operation.
INFORMATION: The file  on boot
environment  contains a log of cleanup operations required.
INFORMATION: Review the files listed above. Remember that all of the files
are located on boot environment . Before you activate boot
environment , determine if any additional system maintenance is
required or if additional media of the software distribution must be
installed.
The Solaris upgrade of the boot environment  is complete.
Creating miniroot device
Configuring failsafe for system.
Failsafe configuration is complete.
Installing failsafe
Failsafe install is complete.

Not sure this will ease the upgrade path to this Update, even if there is nothing really wrong with this. It may just have been less intrusive I think.

Sunday 26 September 2010

Deactivate The VMware HGFS File System

I recently faced a problem when our backup administrator was unable to browse remotely the root (/) file system on Solaris when the system was installed as a guest in a VMware ESX hypervisor. After digging around the system, I find that the Host-Guest File System made the HP DataProtector agent unable to stat the /hgfs pseudo-file system as can be seen in the /var/opt/omni/log/debug.log debug log file:

[...]
09/23/10 17:06:36  FSBRDA.11618.0 ["da/bda/solaris.c /main/blr_dp61/10":1324] A.06.11 b243
SolStatObj: /hgfs lstat failed! errno 5

Although it is not a bug per itself, installing the VMware Tools just enable the HGFS module independently of the virtualization stack: VMware ESX doesn't provide access to the Shared Folders facility, although the VMware Workstation does. So in my case, I can just disable it without loosing any useful functionality.

Since it may be advisable to have a configuration option at the VMware Tools level, I didn't find one. Some may argue that write a little script at the boot to unmount the /hgfs file system is good enough, I find painful and not very elegant. In fact, I prefer to disable at the kernel module level using the module's configuration file:

# cp -p /kernel/drv/vmhgfs.conf /kernel/drv/vmhgfs.conf.orig
/* Edit, and comment the vmhgfs line. */
# cat /kernel/drv/vmhgfs.conf
# name= parent="pseudo" instance=0;
#name="vmhgfs" parent="pseudo" instance=0;

Last, be sure to recreate the GRUB boot archive before rebooting the system, and all backup stuff went well again.

# bootadm update-archive
# shutdown -y -i 6 -g 0

Sunday 29 August 2010

Apropos Solaris

John Fowler (Oracle Executive Vice President for Server and Storage Systems) held an on-line webcast on August 10 on the strategy for hardware servers based on SPARC and x86, and the formalization of the upcoming release of Solaris 11 in 2011.

This post is only aimed at summarize the main points, the complete slides of the presentation are available at the Oracle web site.

  1. Message #1: SPARC is alive and will continue. Solaris is alive and will continue. Both actively.
  2. Message #2: What is interesting here is that this is not only intentions, it is a real roadmap up to five years, on the ex-Sun well-known products. Oracle clearly has some strong plans about Solaris, SPARC ad x86 platforms, and just began to speak publicly about them. We will see probably more about them all at the Oracle OpenWorld in few weeks now.

The points are:

  • A roadmap for SPARC and Solaris up to 2015.
  • SPARC will double performance improvement every two years:
    • Cores: 128 (32 in 2010).
    • Threads: 16384 (512 in 2010).
    • Memory capacity: 64TB (4TB in 2010).
    • Logical Domains: 256 (128 in 2010).
    • Java Ops per second: 50000 (5000 in 2010).
  • Very SPARC oriented: it seems that there will only be one SPARC brand at the end of 2015.
  • Two big families of SPARC servers: lots of threads known as the T-Series, lots of sockets known as M-Series.
  • A least one Update to Solaris 10 around 2010Q3, a beta program of Solaris 11 known as Solaris 11 Express due to last 2010, then Solaris 11 due in 2011 and up to 2015.

Solaris 11 will be based on the now close OpenSolaris distribution, which will include:

  • Image Packaging System (IPS): totally new packaging system fully integrated with ZFS and Boot Environment Administration (aimed at replacing Live Upgrade).
  • Crossbow network virtualization stack.
  • ZFS de-duplication, and lots of recent optimizations and functionalities.
  • CIFS file services : in-kernel implementation of CIFS.
  • Enhanced Gnome user environment.
  • Updated installer and auto network installer ("AI", aimed at replacing JumpStart)
  • Network Automagic configuration.
  • And many more (I heard Solaris 10 BrandZ...).

Sunday 2 May 2010

Live Upgrading When Diagnostics Mode Is Enabled

Recently, we faced an interesting problem when using Live Upgrade on some of our SPARC servers (with lots of non-global zones hosted on SAN devices). Here are the basic steps we generally follow when using LU:

  1. Update the Live Upgrade functionality according to the Article ID #1004881.1, Solaris Live Upgrade Software: Patch Requirements.
  2. Create the ABE.
  3. Upgrade the ABE with an operating system image (and test the upgrade according to a JumpStart profile).
  4. Apply a determined Recommended Patch Cluster to the ABE.
  5. Activate the ABE to be the next booted BE.
  6. Reboot on the new BE, and post-configuration steps--eventually.

In some circumstances, and even if all the steps went pretty well--the activation of the new BE was ok (we traced its activities)--we did reboot on the old BE:

# lustatus
Boot Environment     Is       Active Active    Can    Copy
Name                 Complete Now    On Reboot Delete Status
-------------------- -------- ------ --------- ------ -------
s10u4                yes      yes    yes       no     -
s10u8                yes      no     no        yes    -
# lucurr
s10u4
# luactivate -n s10u8
[...]
# lustatus
Boot Environment     Is       Active Active    Can    Copy
Name                 Complete Now    On Reboot Delete Status
-------------------- -------- ------ --------- ------ -------
s10u4                yes      yes    no        no     -
s10u8                yes      no     yes       no     -
# shutdown -y -g 0 -i 6
[...]
# lucurr
s10u4

Ouch. After a bit of digging, and seeing nothing wrong from the console via the Service Processor, we hit the following message from the log of the SMF legacy script run by LU when rebooting (at the shutdown time more precisely):

# cat /var/svc/log/rc6.log
[...]
Executing legacy init script "/etc/rc0.d/K62lu".
Live Upgrade: Deactivating current boot environment <s10u4>.
zlogin: login allowed only to running zones (zonename1 is 'installed').
zlogin: login allowed only to running zones (zonename2 is 'installed').
Live Upgrade: Executing Stop procedures for boot environment <s10u4>.
Live Upgrade: Current boot environment is <s10u4>.
Live Upgrade: New boot environment will be <s10u8>.
Live Upgrade: Activating boot environment <s10u8>.
Creating boot_archive for /.alt.tmp.b-9Tb.mnt
updating /.alt.tmp.b-9Tb.mnt/platform/sun4v/boot_archive
Live Upgrade: The boot device for boot environment <s10u8> is
</dev/dsk/c1t0d0s4>.
/etc/lib/lu/lubootdev: ERROR: Unable to get current boot devices.
/etc/lib/lu/lubootdev: INFORMATION: The system is running with the system
boot PROM diagnostics mode enabled. When diagnostics mode is
enabled, Live Upgrade is unable to access the system boot
device list, causing certain features of Live Upgrade (such
as changing the system boot device after activating a boot
environment) to fail. To correct this problem, please run
the system in normal, non-diagnostic mode. The system might
have a key switch or other external means of booting the
system in normal mode. If you do not have such a means, you
can set one or both of the EEPROM parameters 'diag-switch?'
or 'diagnostic-mode?' to 'false'.  After making a change,
either through external means or by changing an EEPROM
parameter, retry the Live Upgrade operation or command.
ERROR: Live Upgrade: Unable to change primary boot device to boot
environment <s10u8>.
ERROR: You must manually change the system boot prom to boot the system
from device </pci@0/pci@0/pci@2/scsi@0/sd@0,0:e>.
Live Upgrade: Activation of boot environment <s10u8> completed.
Legacy init script "/etc/rc0.d/K62lu" exited with return code 0.
[...]

Well, pretty explicit in fact, but very unexpected when the activation went so well beforehand. So, go to check the EEPROM, and change it back if necessary:

# eeprom diag-switch?
diag-switch?=true
# eeprom diag-switch?=false

And all returned to a normal situation when activating again, and rebooting. Although this case is self explanatory in the corresponding log file, and is describe in the Bug ID #6949588, I think this one may be put more visible to the system administrator, for example by checking the EEPROM configuration during the BE activation code (at the luactivate command).

Monday 26 April 2010

Problem Starting OCCSD In A Non-Global Zone

If you are not able to start the Oracle Cluster Synchronization Services Daemon (OCCSD) in a non-global zone on Solaris 10, I bet you are running Oracle 10.2.0.3 and higher. In this case, you will see something similar in the the /var/adm/messages file--but nothing is coming up:

Apr 26 10:39:51 zonename oracle: [ID 702911 user.error] Oracle Cluster Synchronization Service starting by user request.
Apr 26 10:39:52 zonename root: [ID 702911 user.error] Cluster Ready Services completed waiting on dependencies.

Trying to trace the ocssd.bin process during start-up give you something similar to:

[...]
12564:   0.0803 setrlimit(RLIMIT_CORE, 0xFFFFFFFF7FFFF900)      = 0
12564:   0.0804 priocntlsys(1, 0xFFFFFFFF7FFFF694, 6, 0xFFFFFFFF7FFFF768, 0) Err#1 EPERM [proc_priocntl]
12564:   0.0810 fstat(2, 0xFFFFFFFF7FFFE870)                    = 0
12564:   0.0811 brk(0x100229E80)                                = 0
12564:   0.0813 brk(0x10022DE80)                                = 0
12564:   0.0815 fstat(2, 0xFFFFFFFF7FFFE740)                    = 0
12564:   0.0816 ioctl(2, TCGETA, 0xFFFFFFFF7FFFE7AC)            Err#25 ENOTTY
12564:   0.0818 write(2, " s e t p r i o r i t y :".., 52)      = 52
12564:   0.0821 _exit(100)

So, in this case you just hit a privilege restriction, which did not apply before with older release of Oracle. As clearly mentioned in the output of truss, the proc_priocntl is not available in the non-global zone for use by Oracle. A clean solution, available only with Solaris 10 11/06 (U3) and later, is to use the limitpriv configuration property to extend the basic privileges provided by the zone framework.

As stated in the privileges(5) man page:

PRIV_PROC_PRIOCNTL Allow a process to elevate its priority above its current level. Allow a process to change its scheduling class to any scheduling class, including the RT class.

Interestingly, this seems to be exactly the case for the Oracle Cluster Synchronization Services Daemon:

# zonecfg -z zonename set limitpriv=default,proc_priocntl
# zoneadm -z zonename reboot
# zlogin zonename "ps -o class,args -p `pgrep ocssd.bin`"
 CLS COMMAND
  RT /soft/oracle/10.2.0/asm_1/bin/ocssd.bin

Ok, that's fine right now.

Tuesday 6 April 2010

Performance Problem Using The HP DP Agent

I recently faced an interesting problem when the backup agent for HP DataProtector regularly encounter network performance problem. The problem arise on a global zone which hosts many non-global zones, each running its own instance of the HP DP agent. The network configuration is rather unusual: there are two network ports, one which connects to an administration VLAN, the other which carries the data and external customer network streams (using VLAN-tagging). Since the global zone is only use as an hypervisor of system's resources for the non-global zones, it just plumbs the data network port, but has no IP address assigned to it.

The fact is, the configuration sets the nodename as the hostname. But the usable interface of the global zone which is in the administration network use another name than the hostname: beastie-adm in the following case, assuming the hostname was set to beastie.

So, now that I briefly describe the configuration of the server, lets see what happen at the network level from the global zone point of view when firing a remote telnet on the DataProtector agent TCP port (5555):

# snoop -d aggr2 -t d port 5555
Using device /dev/aggr2 (promiscuous mode)
  0.00000 172.1.2.3 -> beastie-adm.domain.tld TCP D=5555 S=37228 Syn Seq=1117789660 Len=0 Win=65535 Options=
  0.00005 beastie-adm.domain.tld -> 172.1.2.3 TCP D=37228 S=5555 Syn Ack=1117789661 Seq=808681264 Len=0 Win=49480 Options=
  0.00500 172.1.2.3 -> beastie-adm.domain.tld TCP D=5555 S=37228 Ack=808681265 Seq=1117789661 Len=0 Win=65535
 39.36365 beastie-adm.domain.tld ->172.1.2.3 TCP D=37228 S=5555 Push Ack=1117789661 Seq=808681265 Len=87 Win=49480
  0.00014 beastie-adm.domain.tld -> 172.1.2.3 TCP D=37228 S=5555 Fin Ack=1117789661 Seq=808681352 Len=0 Win=49480
  0.00587 172.1.2.3 -> beastie-adm.domain.tld TCP D=5555 S=37228 Ack=808681353 Seq=1117789661 Len=0 Win=65448
  0.00092 172.1.2.3 -> beastie-adm.domain.tld TCP D=5555 S=37228 Fin Ack=808681353 Seq=1117789661 Len=0 Win=65448
  0.00002 beastie-adm.domain.tld -> 172.1.2.3 TCP D=37228 S=5555 Ack=1117789662 Seq=808681353 Len=0 Win=49480

Ouch. Something is clearly wrong here: as can be seen in this trace, 40 seconds have been lost before establishing the connection. Snooping more largely on the interface show these requests:

[...]
beastie-adm.domain.tld -> nameserver.domain.tld DNS C beastie.domain.tld. Internet Addr ?
nameserver.domain.tld -> beastie-adm.domain.tld DNS R  Error: 3(Name Error)
beastie-adm.domain.tld -> nameserver.domain.tld DNS C beastie. Internet Addr ?
nameserver.domain.tld -> beastie-adm.domain.tld DNS R  Error: 2(Server Fail)
beastie-adm.domain.tld -> nameserver.domain.tld DNS C beastie. Internet Addr ?
nameserver.domain.tld -> beastie-adm.domain.tld DNS R  Error: 2(Server Fail)
beastie-adm.domain.tld -> nameserver.domain.tld DNS C beastie.domain.tld. Internet Addr ?
nameserver.domain.tld -> beastie-adm.domain.tld DNS R  Error: 3(Name Error)
beastie-adm.domain.tld -> nameserver.domain.tld DNS C beastie. Internet Addr ?
nameserver.domain.tld -> beastie-adm.domain.tld DNS R  Error: 2(Server Fail)
[...]

Ok. Lets see what is the process corresponding to the HP DP agent doing during the connection tentative:

[...]
9431/1:         sysinfo(SI_HOSTNAME, "beastie", 64)         = 12
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getpid()                                        = 9431 [473]
9431/1:         lstat64("/var/opt/omni/log/debug.log", 0x080476F8) = 0
9431/1:         open64("/var/opt/omni/log/debug.log", O_RDWR|O_APPEND|O_CREAT, 0666) = 4
9431/1:         time()                                          = 1269079436
9431/1:         fstat64(4, 0x08046F30)                          = 0
9431/1:         fstat64(4, 0x08046E70)                          = 0
9431/1:         ioctl(4, TCGETA, 0x08046F04)                    Err#25 ENOTTY
9431/1:         sigaction(SIGSEGV, 0x08047AF0, 0x08047B70)      = 0
9431/1:         sigaction(SIGSEGV, 0x08047AF0, 0x08047B70)      = 0
9431/1:         write(4, 0x08130DC4, 162)                       = 162
9431/1:           \n 0 3 / 2 0 / 1 0   1 1 : 0 3 : 5 6     I N E T . 9 4 3 1 . 0
9431/1:            [ " l i b / c m n / c o m m o n . c   / m a i n / h s l _ d p 6
9431/1:            1 / h s l _ h p i t 2 _ 2 / 9 " : 1 2 7 4 ]   A . 0 6 . 1 1   b
9431/1:            2 4 3\n g e t h o s t b y n a m e ( )   f a i l e d ,   h _ e r
9431/1:            r n o = 2   [ h o s t = h o s t n a m e ,   r e t r y = 5
9431/1:            ]\n
9431/1:         llseek(4, 0, SEEK_CUR)                          = 939002
9431/1:         close(4)
[...]

By now it is clear that most of time spent by the agent was used trying to resolve the hostname to an IP address (as logged to the /var/opt/omni/log/debug.log debug log file). Since there is no IP address declared for beastie as there is no network directly attached to the global zone, there is no mapping defined for the hostname. Bingo.

So, in this case we decided to set the hostname locally (in the /etc/hosts file) as an alias for the loopback entry. This way, the information is resolved, and the agent will not timed-out performing the gethostbyname(3NSL) syscall anyomore:

# getent hosts `beastie`
127.0.0.1       localhost loghost beastie
# snoop -d aggr2 -t d port 5555
Using device /dev/aggr2 (promiscuous mode)
  0.00000 172.1.2.3 -> beastie-adm.occ.lan TCP D=5555 S=37237 Syn Seq=978964828 Len=0 Win=65535 Options=
  0.00004 beastie-adm.occ.lan -> 172.1.2.3 TCP D=37237 S=5555 Syn Ack=978964829 Seq=1184861168 Len=0 Win=49480 Options=
  0.00128 172.1.2.3 -> beastie-adm.occ.lan TCP D=5555 S=37237 Ack=1184861169 Seq=978964829 Len=0 Win=65535
 10.00279 beastie-adm.occ.lan ->172.1.2.3 TCP D=37237 S=5555 Push Ack=978964829 Seq=1184861169 Len=87 Win=49480
  0.00012 beastie-adm.occ.lan -> 172.1.2.3 TCP D=37237 S=5555 Fin Ack=978964829 Seq=1184861256 Len=0 Win=49480
  0.00137 172.1.2.3 -> beastie-adm.occ.lan TCP D=5555 S=37237 Ack=1184861257 Seq=978964829 Len=0 Win=65448
  0.00220 172.1.2.3 -> beastie-adm.occ.lan TCP D=5555 S=37237 Fin Ack=1184861257 Seq=978964829 Len=0 Win=65448
  0.00002 beastie-adm.occ.lan -> 172.1.2.3 TCP D=37237 S=5555 Ack=978964830 Seq=1184861257 Len=0 Win=49480

So, the main question which remain is "Why need the HP DP agent to resolve specifically on the hostname, even if this name is not use apart the backup transaction?" I did not have any answer for this at this time...

Thursday 25 March 2010

Prevent A Non-Global Zone Reaching Others

When using non-global zones, the network stream didn't leave the global zone. Although very interesting when looking for performance for multi-tiers applications hosted on non-global zones from the same system, it can be a problem when it comes to segregate different networks used by the different non-global zones.

To my knowledge, IP Filter can be use from the global zone to help in this case. But a more cleaner approach would be to block (reject) the route between those non-global zones. For example, if one non-global zone has an IP address of addrX, and the second non-global zone has an address of addrY, then the following commands will prevent network traffic from passing between the two zones.

# route add addrX addrY -interface -reject
# route add addrY addrX -interface -reject

The problem is, when there is a lot of non-global zones you need to segregate, you need to add 2^n routes, which represents 32 routes for 5 non-global zones... Not very scalable, and not manageable. If someone know a better solution, please feel free to comment this post.

Tuesday 16 March 2010

Debugging After A Solaris System Was P2V'ed

I recently faced a problem where an application stop working after transferring a system from an old Sun E450 running Solaris 8 to a more recent Sun Fire V490 running Solaris 10 10/09 with the Solaris 8 Container software stack. Although all went smooth during the P2V, and most of all applications runs pretty well in the non-global zone, we encounter a problem where the Courier SMTP product didn't answer anymore to remote connections, such as:

# telnet localhost 25
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Connection closed by foreign host.

The fact is, the couriertcpd program is running, and listening to the TCP port number 25.

# pfiles `pgrep couriertcpd | head -1`
5691:   /usr/lib/courier/sbin/couriertcpd
  Current rlimit: 1024 file descriptors
[...]
   5: S_IFSOCK mode:0666 dev:366,0 ino:62871 uid:0 gid:0 size:0
      O_RDWR|O_NONBLOCK FD_CLOEXEC
        sockname: AF_INET 0.0.0.0  port: 25
[...]

Interestingly, here is what is logged by the Courier application:

# grep mail.info /path/to/Courier.log
XXX courieresmtpd: [ID 702911 mail.info] gdbm fatal: couldn't init cache

The problem seems located around the gdbm library which is a dependency of Courier. So, what happened when we try to initiate a direct connection using telnet on port 25:

# truss -alef -rall -wall -p 5691
[...]
5691/1:         fork()                                          = 245
245/1:          fork()          (returning as child ...)        = 5691
[...]
245/1:          brk(0x004136D8)                                 = 0
245/1:          brk(0x004136D8)                                 = 0
245/1:          brk(0x004336D8)                                 Err#12 ENOMEM
245/1:          write(2, " g d b m   f a t a l :  ", 12)        = 12
245/1:          write(2, 0xFF2D4D28, 19)                        = 19
245/1:             c o u l d n ' t   i n i t   c a c h e
245/1:          write(2, "\n", 1)                               = 1
245/1:          llseek(0, 0, SEEK_CUR)                          = 0
245/1:          _exit(-1)
5691/1:             Received signal #18, SIGCLD, in poll() [caught]
[...]
# echo "ibase=16; `echo 004136D8 | tr [:lower:] [:upper:]`" | bc
4273880
# echo "ibase=16; `echo 004336D8 | tr [:lower:] [:upper:]`" | bc
4404952

Ok. The brk(2) function is used to change dynamically the amount of space allocated for the calling process's data segment. So, the process which PID is 245, forked from couriertcpd, can not allocate more than 4273880 bytes since a try to allocate 4404952 bytes return an error. When the brk(2) function fails, it is generally due to one of these two major cases:

  1. Insufficient space exists in the swap area to support the expansion.
  2. The data segment size limit as set by setrlimit(2) would be exceeded.

At first, we decided to grow the size of the swap space in the global zone from 4GB to 8GB (since there is no resource control on memory for this non-global zone). As it is a full-ZFS Solaris 10 system, here are the steps to do so:

# swap -d /dev/zvol/dsk/rpool/swap
# zfs set volsize=8G rpool/swap
# /sbin/swapadd
# swap -l
swapfile             dev  swaplo blocks   free
/dev/zvol/dsk/rpool/swap 256,2      16 16777200 16777200

But nothing change from the couriertcpd point of view, as we may have thought since we were able to deallocate the entire swap device from the global zone. So, we shrink it back using the same steps as for growing it. Then, is it possible we have reached an upper resource limitation?

The couriertcpd is executed under the daemon identity: what are the maximum size of data segment (or heap) for the daemon account:

# su - daemon -c "ulimit -Sd; ulimit -Hd"
unlimited
unlimited

This seems more than sufficient, not to say more. But what are the real current limitation for the running process?

# plimit -k 5691
5691:   /usr/lib/courier/sbin/couriertcpd
   resource              current         maximum
  time(seconds)         unlimited       unlimited
  file(kbytes)          unlimited       unlimited
  data(kbytes)          4096            4096
  stack(kbytes)         8192            unlimited
  coredump(kbytes)      unlimited       unlimited
  nofiles(descriptors)  1024            1024
  vmemory(kbytes)       unlimited       unlimited

Ok. One can argue that the 4404952 bytes size is very close to the 4096KB (4194304 bytes) data size limitation... and yes, it is. The two points here are the fact that the limitation is close to memory allocation size for sure, and that there must be some settings somewhere which sets this resource limitation since we verified it is not a the account level (the current hard limit is set to 4096KB, thus not very unlimited). So, we tried to set the resource limit for the running process to much higher value, and voila:

# plimit -d 65536,65536 5691
# telnet localhost 25
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
220 S0003010 ESMTP
quit
221 Bye.
Connection closed by foreign host.

Well, all seems far better this time:

# truss -alef -rall -wall -p 5691
[...]
5691/1:         fork()                                          = 4861
4861/1:          fork()          (returning as child ...)        = 5691
[...]
4861/1:         brk(0x00CF36D8)                                 = 0
4861/1:         lseek(6, 262144, SEEK_SET)                      = 262144
4861/1:         read(6, 0x00071730, 131072)                     = 131072
[...]
# echo "ibase=16; `echo 00CF36D8 | tr [:lower:] [:upper:]`" | bc
13579992

So, now the memory allocation succeed at changing the space allocated up to 13579992 bytes (13262KB). We decided to set the data segment size up to 16MB, which seems to be sufficient in our case.

We now need to find the configuration responsible for setting the size of the data segment size to 4096KB. Looking at the courier documentation, and and after a bit of digging on the system itself we found the responsible configuration file, and adapt it as follow:

# grep ULIMIT= /usr/lib/courier/etc/esmtpd
#ULIMIT=4096
ULIMIT=16384

As a last validation, we made a complete reboot of the non-global zone to verify the overall behavior, which is OK right now. As a last word, it was very interesting problem to fight with, but I didn't understand until now why 4MB was sufficient on an old Sun E450, and why it is not in a branded Solaris 8 non-global zone... comments welcome.

Tuesday 1 December 2009

Oracle Commitment To Sun Technologies

Here are more information about Oracle commitment to Sun business, and the Solaris operating system in particular:

Lastly, Oracle and Sun Overview and FAQ for customers and partners is a must read for all persons interested in the future and Oracle's investment in all the current technologies from Sun Microsystems.

- page 3 of 8 -