blog'o thnet

To content | To menu | To search

Sunday 29 August 2010

Apropos Solaris

John Fowler (Oracle Executive Vice President for Server and Storage Systems) held an on-line webcast on August 10 on the strategy for hardware servers based on SPARC and x86, and the formalization of the upcoming release of Solaris 11 in 2011.

This post is only aimed at summarize the main points, the complete slides of the presentation are available at the Oracle web site.

  1. Message #1: SPARC is alive and will continue. Solaris is alive and will continue. Both actively.
  2. Message #2: What is interesting here is that this is not only intentions, it is a real roadmap up to five years, on the ex-Sun well-known products. Oracle clearly has some strong plans about Solaris, SPARC ad x86 platforms, and just began to speak publicly about them. We will see probably more about them all at the Oracle OpenWorld in few weeks now.

The points are:

  • A roadmap for SPARC and Solaris up to 2015.
  • SPARC will double performance improvement every two years:
    • Cores: 128 (32 in 2010).
    • Threads: 16384 (512 in 2010).
    • Memory capacity: 64TB (4TB in 2010).
    • Logical Domains: 256 (128 in 2010).
    • Java Ops per second: 50000 (5000 in 2010).
  • Very SPARC oriented: it seems that there will only be one SPARC brand at the end of 2015.
  • Two big families of SPARC servers: lots of threads known as the T-Series, lots of sockets known as M-Series.
  • A least one Update to Solaris 10 around 2010Q3, a beta program of Solaris 11 known as Solaris 11 Express due to last 2010, then Solaris 11 due in 2011 and up to 2015.

Solaris 11 will be based on the now close OpenSolaris distribution, which will include:

  • Image Packaging System (IPS): totally new packaging system fully integrated with ZFS and Boot Environment Administration (aimed at replacing Live Upgrade).
  • Crossbow network virtualization stack.
  • ZFS de-duplication, and lots of recent optimizations and functionalities.
  • CIFS file services : in-kernel implementation of CIFS.
  • Enhanced Gnome user environment.
  • Updated installer and auto network installer ("AI", aimed at replacing JumpStart)
  • Network Automagic configuration.
  • And many more (I heard Solaris 10 BrandZ...).

Sunday 2 May 2010

Live Upgrading When Diagnostics Mode Is Enabled

Recently, we faced an interesting problem when using Live Upgrade on some of our SPARC servers (with lots of non-global zones hosted on SAN devices). Here are the basic steps we generally follow when using LU:

  1. Update the Live Upgrade functionality according to the Article ID #1004881.1, Solaris Live Upgrade Software: Patch Requirements.
  2. Create the ABE.
  3. Upgrade the ABE with an operating system image (and test the upgrade according to a JumpStart profile).
  4. Apply a determined Recommended Patch Cluster to the ABE.
  5. Activate the ABE to be the next booted BE.
  6. Reboot on the new BE, and post-configuration steps--eventually.

In some circumstances, and even if all the steps went pretty well--the activation of the new BE was ok (we traced its activities)--we did reboot on the old BE:

# lustatus
Boot Environment     Is       Active Active    Can    Copy
Name                 Complete Now    On Reboot Delete Status
-------------------- -------- ------ --------- ------ -------
s10u4                yes      yes    yes       no     -
s10u8                yes      no     no        yes    -
# lucurr
s10u4
# luactivate -n s10u8
[...]
# lustatus
Boot Environment     Is       Active Active    Can    Copy
Name                 Complete Now    On Reboot Delete Status
-------------------- -------- ------ --------- ------ -------
s10u4                yes      yes    no        no     -
s10u8                yes      no     yes       no     -
# shutdown -y -g 0 -i 6
[...]
# lucurr
s10u4

Ouch. After a bit of digging, and seeing nothing wrong from the console via the Service Processor, we hit the following message from the log of the SMF legacy script run by LU when rebooting (at the shutdown time more precisely):

# cat /var/svc/log/rc6.log
[...]
Executing legacy init script "/etc/rc0.d/K62lu".
Live Upgrade: Deactivating current boot environment <s10u4>.
zlogin: login allowed only to running zones (zonename1 is 'installed').
zlogin: login allowed only to running zones (zonename2 is 'installed').
Live Upgrade: Executing Stop procedures for boot environment <s10u4>.
Live Upgrade: Current boot environment is <s10u4>.
Live Upgrade: New boot environment will be <s10u8>.
Live Upgrade: Activating boot environment <s10u8>.
Creating boot_archive for /.alt.tmp.b-9Tb.mnt
updating /.alt.tmp.b-9Tb.mnt/platform/sun4v/boot_archive
Live Upgrade: The boot device for boot environment <s10u8> is
</dev/dsk/c1t0d0s4>.
/etc/lib/lu/lubootdev: ERROR: Unable to get current boot devices.
/etc/lib/lu/lubootdev: INFORMATION: The system is running with the system
boot PROM diagnostics mode enabled. When diagnostics mode is
enabled, Live Upgrade is unable to access the system boot
device list, causing certain features of Live Upgrade (such
as changing the system boot device after activating a boot
environment) to fail. To correct this problem, please run
the system in normal, non-diagnostic mode. The system might
have a key switch or other external means of booting the
system in normal mode. If you do not have such a means, you
can set one or both of the EEPROM parameters 'diag-switch?'
or 'diagnostic-mode?' to 'false'.  After making a change,
either through external means or by changing an EEPROM
parameter, retry the Live Upgrade operation or command.
ERROR: Live Upgrade: Unable to change primary boot device to boot
environment <s10u8>.
ERROR: You must manually change the system boot prom to boot the system
from device </pci@0/pci@0/pci@2/scsi@0/sd@0,0:e>.
Live Upgrade: Activation of boot environment <s10u8> completed.
Legacy init script "/etc/rc0.d/K62lu" exited with return code 0.
[...]

Well, pretty explicit in fact, but very unexpected when the activation went so well beforehand. So, go to check the EEPROM, and change it back if necessary:

# eeprom diag-switch?
diag-switch?=true
# eeprom diag-switch?=false

And all returned to a normal situation when activating again, and rebooting. Although this case is self explanatory in the corresponding log file, and is describe in the Bug ID #6949588, I think this one may be put more visible to the system administrator, for example by checking the EEPROM configuration during the BE activation code (at the luactivate command).

Monday 26 April 2010

Problem Starting OCCSD In A Non-Global Zone

If you are not able to start the Oracle Cluster Synchronization Services Daemon (OCCSD) in a non-global zone on Solaris 10, I bet you are running Oracle 10.2.0.3 and higher. In this case, you will see something similar in the the /var/adm/messages file--but nothing is coming up:

Apr 26 10:39:51 zonename oracle: [ID 702911 user.error] Oracle Cluster Synchronization Service starting by user request.
Apr 26 10:39:52 zonename root: [ID 702911 user.error] Cluster Ready Services completed waiting on dependencies.

Trying to trace the ocssd.bin process during start-up give you something similar to:

[...]
12564:   0.0803 setrlimit(RLIMIT_CORE, 0xFFFFFFFF7FFFF900)      = 0
12564:   0.0804 priocntlsys(1, 0xFFFFFFFF7FFFF694, 6, 0xFFFFFFFF7FFFF768, 0) Err#1 EPERM [proc_priocntl]
12564:   0.0810 fstat(2, 0xFFFFFFFF7FFFE870)                    = 0
12564:   0.0811 brk(0x100229E80)                                = 0
12564:   0.0813 brk(0x10022DE80)                                = 0
12564:   0.0815 fstat(2, 0xFFFFFFFF7FFFE740)                    = 0
12564:   0.0816 ioctl(2, TCGETA, 0xFFFFFFFF7FFFE7AC)            Err#25 ENOTTY
12564:   0.0818 write(2, " s e t p r i o r i t y :".., 52)      = 52
12564:   0.0821 _exit(100)

So, in this case you just hit a privilege restriction, which did not apply before with older release of Oracle. As clearly mentioned in the output of truss, the proc_priocntl is not available in the non-global zone for use by Oracle. A clean solution, available only with Solaris 10 11/06 (U3) and later, is to use the limitpriv configuration property to extend the basic privileges provided by the zone framework.

As stated in the privileges(5) man page:

PRIV_PROC_PRIOCNTL Allow a process to elevate its priority above its current level. Allow a process to change its scheduling class to any scheduling class, including the RT class.

Interestingly, this seems to be exactly the case for the Oracle Cluster Synchronization Services Daemon:

# zonecfg -z zonename set limitpriv=default,proc_priocntl
# zoneadm -z zonename reboot
# zlogin zonename "ps -o class,args -p `pgrep ocssd.bin`"
 CLS COMMAND
  RT /soft/oracle/10.2.0/asm_1/bin/ocssd.bin

Ok, that's fine right now.

Tuesday 6 April 2010

Performance Problem Using The HP DP Agent

I recently faced an interesting problem when the backup agent for HP DataProtector regularly encounter network performance problem. The problem arise on a global zone which hosts many non-global zones, each running its own instance of the HP DP agent. The network configuration is rather unusual: there are two network ports, one which connects to an administration VLAN, the other which carries the data and external customer network streams (using VLAN-tagging). Since the global zone is only use as an hypervisor of system's resources for the non-global zones, it just plumbs the data network port, but has no IP address assigned to it.

The fact is, the configuration sets the nodename as the hostname. But the usable interface of the global zone which is in the administration network use another name than the hostname: beastie-adm in the following case, assuming the hostname was set to beastie.

So, now that I briefly describe the configuration of the server, lets see what happen at the network level from the global zone point of view when firing a remote telnet on the DataProtector agent TCP port (5555):

# snoop -d aggr2 -t d port 5555
Using device /dev/aggr2 (promiscuous mode)
  0.00000 172.1.2.3 -> beastie-adm.domain.tld TCP D=5555 S=37228 Syn Seq=1117789660 Len=0 Win=65535 Options=
  0.00005 beastie-adm.domain.tld -> 172.1.2.3 TCP D=37228 S=5555 Syn Ack=1117789661 Seq=808681264 Len=0 Win=49480 Options=
  0.00500 172.1.2.3 -> beastie-adm.domain.tld TCP D=5555 S=37228 Ack=808681265 Seq=1117789661 Len=0 Win=65535
 39.36365 beastie-adm.domain.tld ->172.1.2.3 TCP D=37228 S=5555 Push Ack=1117789661 Seq=808681265 Len=87 Win=49480
  0.00014 beastie-adm.domain.tld -> 172.1.2.3 TCP D=37228 S=5555 Fin Ack=1117789661 Seq=808681352 Len=0 Win=49480
  0.00587 172.1.2.3 -> beastie-adm.domain.tld TCP D=5555 S=37228 Ack=808681353 Seq=1117789661 Len=0 Win=65448
  0.00092 172.1.2.3 -> beastie-adm.domain.tld TCP D=5555 S=37228 Fin Ack=808681353 Seq=1117789661 Len=0 Win=65448
  0.00002 beastie-adm.domain.tld -> 172.1.2.3 TCP D=37228 S=5555 Ack=1117789662 Seq=808681353 Len=0 Win=49480

Ouch. Something is clearly wrong here: as can be seen in this trace, 40 seconds have been lost before establishing the connection. Snooping more largely on the interface show these requests:

[...]
beastie-adm.domain.tld -> nameserver.domain.tld DNS C beastie.domain.tld. Internet Addr ?
nameserver.domain.tld -> beastie-adm.domain.tld DNS R  Error: 3(Name Error)
beastie-adm.domain.tld -> nameserver.domain.tld DNS C beastie. Internet Addr ?
nameserver.domain.tld -> beastie-adm.domain.tld DNS R  Error: 2(Server Fail)
beastie-adm.domain.tld -> nameserver.domain.tld DNS C beastie. Internet Addr ?
nameserver.domain.tld -> beastie-adm.domain.tld DNS R  Error: 2(Server Fail)
beastie-adm.domain.tld -> nameserver.domain.tld DNS C beastie.domain.tld. Internet Addr ?
nameserver.domain.tld -> beastie-adm.domain.tld DNS R  Error: 3(Name Error)
beastie-adm.domain.tld -> nameserver.domain.tld DNS C beastie. Internet Addr ?
nameserver.domain.tld -> beastie-adm.domain.tld DNS R  Error: 2(Server Fail)
[...]

Ok. Lets see what is the process corresponding to the HP DP agent doing during the connection tentative:

[...]
9431/1:         sysinfo(SI_HOSTNAME, "beastie", 64)         = 12
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getpid()                                        = 9431 [473]
9431/1:         lstat64("/var/opt/omni/log/debug.log", 0x080476F8) = 0
9431/1:         open64("/var/opt/omni/log/debug.log", O_RDWR|O_APPEND|O_CREAT, 0666) = 4
9431/1:         time()                                          = 1269079436
9431/1:         fstat64(4, 0x08046F30)                          = 0
9431/1:         fstat64(4, 0x08046E70)                          = 0
9431/1:         ioctl(4, TCGETA, 0x08046F04)                    Err#25 ENOTTY
9431/1:         sigaction(SIGSEGV, 0x08047AF0, 0x08047B70)      = 0
9431/1:         sigaction(SIGSEGV, 0x08047AF0, 0x08047B70)      = 0
9431/1:         write(4, 0x08130DC4, 162)                       = 162
9431/1:           \n 0 3 / 2 0 / 1 0   1 1 : 0 3 : 5 6     I N E T . 9 4 3 1 . 0
9431/1:            [ " l i b / c m n / c o m m o n . c   / m a i n / h s l _ d p 6
9431/1:            1 / h s l _ h p i t 2 _ 2 / 9 " : 1 2 7 4 ]   A . 0 6 . 1 1   b
9431/1:            2 4 3\n g e t h o s t b y n a m e ( )   f a i l e d ,   h _ e r
9431/1:            r n o = 2   [ h o s t = h o s t n a m e ,   r e t r y = 5
9431/1:            ]\n
9431/1:         llseek(4, 0, SEEK_CUR)                          = 939002
9431/1:         close(4)
[...]

By now it is clear that most of time spent by the agent was used trying to resolve the hostname to an IP address (as logged to the /var/opt/omni/log/debug.log debug log file). Since there is no IP address declared for beastie as there is no network directly attached to the global zone, there is no mapping defined for the hostname. Bingo.

So, in this case we decided to set the hostname locally (in the /etc/hosts file) as an alias for the loopback entry. This way, the information is resolved, and the agent will not timed-out performing the gethostbyname(3NSL) syscall anyomore:

# getent hosts `beastie`
127.0.0.1       localhost loghost beastie
# snoop -d aggr2 -t d port 5555
Using device /dev/aggr2 (promiscuous mode)
  0.00000 172.1.2.3 -> beastie-adm.occ.lan TCP D=5555 S=37237 Syn Seq=978964828 Len=0 Win=65535 Options=
  0.00004 beastie-adm.occ.lan -> 172.1.2.3 TCP D=37237 S=5555 Syn Ack=978964829 Seq=1184861168 Len=0 Win=49480 Options=
  0.00128 172.1.2.3 -> beastie-adm.occ.lan TCP D=5555 S=37237 Ack=1184861169 Seq=978964829 Len=0 Win=65535
 10.00279 beastie-adm.occ.lan ->172.1.2.3 TCP D=37237 S=5555 Push Ack=978964829 Seq=1184861169 Len=87 Win=49480
  0.00012 beastie-adm.occ.lan -> 172.1.2.3 TCP D=37237 S=5555 Fin Ack=978964829 Seq=1184861256 Len=0 Win=49480
  0.00137 172.1.2.3 -> beastie-adm.occ.lan TCP D=5555 S=37237 Ack=1184861257 Seq=978964829 Len=0 Win=65448
  0.00220 172.1.2.3 -> beastie-adm.occ.lan TCP D=5555 S=37237 Fin Ack=1184861257 Seq=978964829 Len=0 Win=65448
  0.00002 beastie-adm.occ.lan -> 172.1.2.3 TCP D=37237 S=5555 Ack=978964830 Seq=1184861257 Len=0 Win=49480

So, the main question which remain is "Why need the HP DP agent to resolve specifically on the hostname, even if this name is not use apart the backup transaction?" I did not have any answer for this at this time...

Thursday 25 March 2010

Prevent A Non-Global Zone Reaching Others

When using non-global zones, the network stream didn't leave the global zone. Although very interesting when looking for performance for multi-tiers applications hosted on non-global zones from the same system, it can be a problem when it comes to segregate different networks used by the different non-global zones.

To my knowledge, IP Filter can be use from the global zone to help in this case. But a more cleaner approach would be to block (reject) the route between those non-global zones. For example, if one non-global zone has an IP address of addrX, and the second non-global zone has an address of addrY, then the following commands will prevent network traffic from passing between the two zones.

# route add addrX addrY -interface -reject
# route add addrY addrX -interface -reject

The problem is, when there is a lot of non-global zones you need to segregate, you need to add 2^n routes, which represents 32 routes for 5 non-global zones... Not very scalable, and not manageable. If someone know a better solution, please feel free to comment this post.

Tuesday 16 March 2010

Debugging After A Solaris System Was P2V'ed

I recently faced a problem where an application stop working after transferring a system from an old Sun E450 running Solaris 8 to a more recent Sun Fire V490 running Solaris 10 10/09 with the Solaris 8 Container software stack. Although all went smooth during the P2V, and most of all applications runs pretty well in the non-global zone, we encounter a problem where the Courier SMTP product didn't answer anymore to remote connections, such as:

# telnet localhost 25
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Connection closed by foreign host.

The fact is, the couriertcpd program is running, and listening to the TCP port number 25.

# pfiles `pgrep couriertcpd | head -1`
5691:   /usr/lib/courier/sbin/couriertcpd
  Current rlimit: 1024 file descriptors
[...]
   5: S_IFSOCK mode:0666 dev:366,0 ino:62871 uid:0 gid:0 size:0
      O_RDWR|O_NONBLOCK FD_CLOEXEC
        sockname: AF_INET 0.0.0.0  port: 25
[...]

Interestingly, here is what is logged by the Courier application:

# grep mail.info /path/to/Courier.log
XXX courieresmtpd: [ID 702911 mail.info] gdbm fatal: couldn't init cache

The problem seems located around the gdbm library which is a dependency of Courier. So, what happened when we try to initiate a direct connection using telnet on port 25:

# truss -alef -rall -wall -p 5691
[...]
5691/1:         fork()                                          = 245
245/1:          fork()          (returning as child ...)        = 5691
[...]
245/1:          brk(0x004136D8)                                 = 0
245/1:          brk(0x004136D8)                                 = 0
245/1:          brk(0x004336D8)                                 Err#12 ENOMEM
245/1:          write(2, " g d b m   f a t a l :  ", 12)        = 12
245/1:          write(2, 0xFF2D4D28, 19)                        = 19
245/1:             c o u l d n ' t   i n i t   c a c h e
245/1:          write(2, "\n", 1)                               = 1
245/1:          llseek(0, 0, SEEK_CUR)                          = 0
245/1:          _exit(-1)
5691/1:             Received signal #18, SIGCLD, in poll() [caught]
[...]
# echo "ibase=16; `echo 004136D8 | tr [:lower:] [:upper:]`" | bc
4273880
# echo "ibase=16; `echo 004336D8 | tr [:lower:] [:upper:]`" | bc
4404952

Ok. The brk(2) function is used to change dynamically the amount of space allocated for the calling process's data segment. So, the process which PID is 245, forked from couriertcpd, can not allocate more than 4273880 bytes since a try to allocate 4404952 bytes return an error. When the brk(2) function fails, it is generally due to one of these two major cases:

  1. Insufficient space exists in the swap area to support the expansion.
  2. The data segment size limit as set by setrlimit(2) would be exceeded.

At first, we decided to grow the size of the swap space in the global zone from 4GB to 8GB (since there is no resource control on memory for this non-global zone). As it is a full-ZFS Solaris 10 system, here are the steps to do so:

# swap -d /dev/zvol/dsk/rpool/swap
# zfs set volsize=8G rpool/swap
# /sbin/swapadd
# swap -l
swapfile             dev  swaplo blocks   free
/dev/zvol/dsk/rpool/swap 256,2      16 16777200 16777200

But nothing change from the couriertcpd point of view, as we may have thought since we were able to deallocate the entire swap device from the global zone. So, we shrink it back using the same steps as for growing it. Then, is it possible we have reached an upper resource limitation?

The couriertcpd is executed under the daemon identity: what are the maximum size of data segment (or heap) for the daemon account:

# su - daemon -c "ulimit -Sd; ulimit -Hd"
unlimited
unlimited

This seems more than sufficient, not to say more. But what are the real current limitation for the running process?

# plimit -k 5691
5691:   /usr/lib/courier/sbin/couriertcpd
   resource              current         maximum
  time(seconds)         unlimited       unlimited
  file(kbytes)          unlimited       unlimited
  data(kbytes)          4096            4096
  stack(kbytes)         8192            unlimited
  coredump(kbytes)      unlimited       unlimited
  nofiles(descriptors)  1024            1024
  vmemory(kbytes)       unlimited       unlimited

Ok. One can argue that the 4404952 bytes size is very close to the 4096KB (4194304 bytes) data size limitation... and yes, it is. The two points here are the fact that the limitation is close to memory allocation size for sure, and that there must be some settings somewhere which sets this resource limitation since we verified it is not a the account level (the current hard limit is set to 4096KB, thus not very unlimited). So, we tried to set the resource limit for the running process to much higher value, and voila:

# plimit -d 65536,65536 5691
# telnet localhost 25
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
220 S0003010 ESMTP
quit
221 Bye.
Connection closed by foreign host.

Well, all seems far better this time:

# truss -alef -rall -wall -p 5691
[...]
5691/1:         fork()                                          = 4861
4861/1:          fork()          (returning as child ...)        = 5691
[...]
4861/1:         brk(0x00CF36D8)                                 = 0
4861/1:         lseek(6, 262144, SEEK_SET)                      = 262144
4861/1:         read(6, 0x00071730, 131072)                     = 131072
[...]
# echo "ibase=16; `echo 00CF36D8 | tr [:lower:] [:upper:]`" | bc
13579992

So, now the memory allocation succeed at changing the space allocated up to 13579992 bytes (13262KB). We decided to set the data segment size up to 16MB, which seems to be sufficient in our case.

We now need to find the configuration responsible for setting the size of the data segment size to 4096KB. Looking at the courier documentation, and and after a bit of digging on the system itself we found the responsible configuration file, and adapt it as follow:

# grep ULIMIT= /usr/lib/courier/etc/esmtpd
#ULIMIT=4096
ULIMIT=16384

As a last validation, we made a complete reboot of the non-global zone to verify the overall behavior, which is OK right now. As a last word, it was very interesting problem to fight with, but I didn't understand until now why 4MB was sufficient on an old Sun E450, and why it is not in a branded Solaris 8 non-global zone... comments welcome.

Tuesday 1 December 2009

Oracle Commitment To Sun Technologies

Here are more information about Oracle commitment to Sun business, and the Solaris operating system in particular:

Lastly, Oracle and Sun Overview and FAQ for customers and partners is a must read for all persons interested in the future and Oracle's investment in all the current technologies from Sun Microsystems.

Thursday 26 November 2009

Oracle Database 11g Release 2 For Solaris

As a follow-up to the preceding entry on the upcoming availability on the Oracle Database 11g Release 2 on both the Solaris SPARC and x86 releases, we can see that this is a reality as of today. Both architectures are readily available for download.

As a system administrator I think this very interesting and encouraging, not only because of the availability of one of the more robust RDBMS system on Solaris platforms, but because this is some actions taken after words from Oracle which seems to fit together. And so, the interest in Solaris as an OS of choice is more reinforced now.

Wednesday 7 October 2009

Upcoming Oracle RDBMS And Solaris News

Following the recent news about the future of Sun from Larry Ellison itself, we now can hope about more things to come on Solaris, both on SPARC and x86 platforms. In particular, this quote is particularly encouraging:

We're a big supporter of Linux, but the fact is that Solaris just a much more mature OS, its just a fact. We became a big supporter of Linux years ago because it ran on smaller and cheaper X86 processors and Solaris did not, we had no choice. [...] So we are a supporter of Linux, but Solaris is a more mature operating system designed for bigger systems. We support both.

In the very same vein, I just heard from two different sources these upcoming changes from Oracle:

  1. The just release Oracle Database 11g Release 2, currently available only for Linux (and released on 1 September 2009), will be available soon--one to two months--for both Solaris SPARC and x86, at the same time.
  2. Secondly, Solaris x86 will be raised to Tier 2 platform from Tier 3 currently.

Well, pretty good news in fact! Seems that Solaris will be a serious and growing competitor in the (near) future!

Sunday 22 March 2009

Finding The Process Responsible For Crashing A System

Recently, we encounter a wave of suicide on most of the nodes which formed some Oracle RAC cluster on lots of Sun M5000 domains platforms. Although the logs found on Oracle RAC were interesting, they didn't help us to determine precisely the origin of the crashes. Since the domains panic'ed, we were able to briefly analyze the cores generated at crash time to get the process which initiated the panics. Here is how to do so.

First, be sure to have proper and usable core on persistent storage:

# cd /var/crash/nodename
# file *.0
unix.0:         ELF 64-bit MSB executable SPARCV9 Version 1, UltraSPARC1 Extensions Required, statically linked, not stripped, no debugging information available
vmcore.0:       SunOS 5.10 Generic_127111-11 64-bit SPARC crash dump from 'nodename'

Then, extract useful information using MDB dcmds such as ::status, ::showrev and ::panicinfo which give us the exact panic message and provide us the message and thread responsible for the system crash:

# mdb -k unix.0 vmcore.0
Loading modules: [ unix krtld genunix specfs dtrace ufs sd mpt px ssd fcp fctl md ip qlc hook neti sctp arp usba nca zfs random logindmux ptm cpc sppp crypto wrsmd fcip nfs ipc ]
> ::status
debugging crash dump vmcore.0 (64-bit) from nodename
operating system: 5.10 Generic_127111-11 (sun4u)
panic message: forced crash dump initiated at user request
dump content: kernel pages only
> ::showrev
Hostname: nodename
Release: 5.10
Kernel architecture: sun4u
Application architecture: sparcv9
Kernel version: SunOS 5.10 sun4u Generic_127111-11
Platform: SUNW,SPARC-Enterprise
> ::panicinfo
             cpu                0
          thread      300171c7300
         message forced crash dump initiated at user request
          tstate       4400001606
              g1                b
              g2                0
              g3          11c13e0
              g4              6e0
              g5         88000000
              g6                0
              g7      300171c7300
              o0          1208020
              o1      2a10176b9e8
              o2                1
              o3                0
              o4 fffffffffffffff5
              o5             1000
              o6      2a10176b0b1
              o7          10626a4
              pc          1044d8c
             npc          1044d90
               y                0

Well. Now what we have the exact thread number (thread ID), we can find the corresponding UNIX process helped by the following script:

# cat /var/tmp/findstack.vmcore.sh
#!/usr/bin/env sh

echo "::ps" | mdb -k unix.0 vmcore.0 | \
 nawk '$8 !~ /ADDR/ {print $8" "$NF}' > /tmp/.core.$$

cat /dev/null > /tmp/core.$$

while read ps; do
  echo "process name: `echo ${ps} | nawk '{print $2}'`" >> /tmp/core.$$
  echo ${ps} | nawk '{print $1"::walk thread | ::findstack"}' | \
   mdb unix.0 vmcore.0 >> /tmp/core.$$
  echo >> /tmp/core.$$
done < /tmp/.core.$$

\rm /tmp/.core.$$

exit 0

Now, just find the lines for the guilty process in the output file. In our case, it is the oprocd.bin process:

# vi /tmp/core.*
[...]
process name: oprocd.bin
stack pointer for thread 300171c7300: 2a10176b0b1
  000002a10176b161 kadmin+0x4a4()
  000002a10176b221 uadmin+0x11c()
  000002a10176b2e1 syscall_trap+0xac()
[...]

This process is locked in memory to monitor the cluster and provide I/O fencing. oprocd.bin performs its check, stops running, and if the wake up is beyond the expected time, then it resets the processor and reboots the node. An oprocd.bin failure results in Oracle Clusterware restarting the node. Please read the Oracle Clusterware and Oracle Real Application Clusters documentation for more information.

Although the incident is always under investigation, it seems the nodes were impacted by the additional second that was added at the end of 2008...

- page 1 of 6