blog'o thnet

To content | To menu | To search

Sunday 22 March 2009

Finding The Process Responsible For Crashing A System

Recently, we encounter a wave of suicide on most of the nodes which formed some Oracle RAC cluster on lots of Sun M5000 domains platforms. Although the logs found on Oracle RAC were interesting, they didn't help us to determine precisely the origin of the crashes. Since the domains panic'ed, we were able to briefly analyze the cores generated at crash time to get the process which initiated the panics. Here is how to do so.

First, be sure to have proper and usable core on persistent storage:

# cd /var/crash/nodename
# file *.0
unix.0:         ELF 64-bit MSB executable SPARCV9 Version 1, UltraSPARC1 Extensions Required, statically linked, not stripped, no debugging information available
vmcore.0:       SunOS 5.10 Generic_127111-11 64-bit SPARC crash dump from 'nodename'

Then, extract useful information using MDB dcmds such as ::status, ::showrev and ::panicinfo which give us the exact panic message and provide us the message and thread responsible for the system crash:

# mdb -k unix.0 vmcore.0
Loading modules: [ unix krtld genunix specfs dtrace ufs sd mpt px ssd fcp fctl md ip qlc hook neti sctp arp usba nca zfs random logindmux ptm cpc sppp crypto wrsmd fcip nfs ipc ]
> ::status
debugging crash dump vmcore.0 (64-bit) from nodename
operating system: 5.10 Generic_127111-11 (sun4u)
panic message: forced crash dump initiated at user request
dump content: kernel pages only
> ::showrev
Hostname: nodename
Release: 5.10
Kernel architecture: sun4u
Application architecture: sparcv9
Kernel version: SunOS 5.10 sun4u Generic_127111-11
Platform: SUNW,SPARC-Enterprise
> ::panicinfo
             cpu                0
          thread      300171c7300
         message forced crash dump initiated at user request
          tstate       4400001606
              g1                b
              g2                0
              g3          11c13e0
              g4              6e0
              g5         88000000
              g6                0
              g7      300171c7300
              o0          1208020
              o1      2a10176b9e8
              o2                1
              o3                0
              o4 fffffffffffffff5
              o5             1000
              o6      2a10176b0b1
              o7          10626a4
              pc          1044d8c
             npc          1044d90
               y                0

Well. Now what we have the exact thread number (thread ID), we can find the corresponding UNIX process helped by the following script:

# cat /var/tmp/findstack.vmcore.sh
#!/usr/bin/env sh

echo "::ps" | mdb -k unix.0 vmcore.0 | \
 nawk '$8 !~ /ADDR/ {print $8" "$NF}' > /tmp/.core.$$

cat /dev/null > /tmp/core.$$

while read ps; do
  echo "process name: `echo ${ps} | nawk '{print $2}'`" >> /tmp/core.$$
  echo ${ps} | nawk '{print $1"::walk thread | ::findstack"}' | \
   mdb unix.0 vmcore.0 >> /tmp/core.$$
  echo >> /tmp/core.$$
done < /tmp/.core.$$

\rm /tmp/.core.$$

exit 0

Now, just find the lines for the guilty process in the output file. In our case, it is the oprocd.bin process:

# vi /tmp/core.*
[...]
process name: oprocd.bin
stack pointer for thread 300171c7300: 2a10176b0b1
  000002a10176b161 kadmin+0x4a4()
  000002a10176b221 uadmin+0x11c()
  000002a10176b2e1 syscall_trap+0xac()
[...]

This process is locked in memory to monitor the cluster and provide I/O fencing. oprocd.bin performs its check, stops running, and if the wake up is beyond the expected time, then it resets the processor and reboots the node. An oprocd.bin failure results in Oracle Clusterware restarting the node. Please read the Oracle Clusterware and Oracle Real Application Clusters documentation for more information.

Although the incident is always under investigation, it seems the nodes were impacted by the additional second that was added at the end of 2008...

Friday 2 May 2008

PHP APC Extension Bug With Optimized Open Source Software Stack

To easily manage LDAP accounts (and general LDAP entries in fact), we have created a Solaris Zone and installed the excellent Cool Stack bundle to host the LAM (LDAP Account Manager) management web tool. But after upgrading the Cool Stack to version 1.2 we encountered a very annoying problem mostly with freezing web pages, and generally ending up in restarting the Apache web server provided by the Cool Stack. After some troubleshooting, we discover that this behavior was introduced by a bug in the APC-3.0.14 module bundled with the updated php-5.2.4 scripting software in this version of the Cool Stack.

Luckily, the bug was already fixed and a new version of the APC extension of PHP is available for download (in fact, just replace to original apc.so module by the new one). All the Cool Stack related problems, associated fixes and instructions are listed on the Cool Stack 1.2 Patches page: be sure to keep in sync' if you are a Cool Stack consumer.

Monday 17 December 2007

A Process Can Not Be Killed Upon Hanging In sendfilev()

Here is a little but annoying bug I faced recently on a production system running Solaris 10 11/06: a process can not be killed.

The KILL signal can't be ignored by a process. The associated handler must be the default--taking the signal into account, and honor it--and it is the case (the incriminated process id is 26632):

# psig 26632 | grep KILL
KILL    default

Nevertheless, the process seems not to stop as expected:

# echo $$
2913
#
# kill -s KILL 26632
#
# dtrace -n 'fbt::sigtoproc:entry /pid == 2913/ {trace(arg2); trace(execname);}'
dtrace: description 'fbt::sigtoproc:entry ' matched 1 probe
CPU     ID                    FUNCTION:NAME
 18  16387                  sigtoproc:entry                 9  tcsh
#
# ps -ef | awk '$2 ~ /26632/ {print $0}'
nonpriv 26632     1   0   Nov 13 ?         0:01 /usr/sbin/in.ftpd -a

This not the correct behavior since a pending signal is not honored:

# pflags 26632
26632:  /usr/sbin/in.ftpd -a
       data model = _ILP32  flags = ORPHAN|MSACCT|MSFORK
       sigpend = 0x00006100,0x00000000
/1:    flags = 0

More, the process seems in a non-coherent system state: it is now nearly impossible to trace it using, truss, the proctools,...

# truss -alef -p 26632
truss: unanticipated system error: 26632
#
# pstack 26632
pstack: cannot examine 26632: unanticipated system error
#
# pfiles 26632
pfiles: unanticipated system error: 26632
#
# pldd 26632
pldd: cannot examine 26632: unanticipated system error

... or helped by DTace:

# dtrace -n 'profile-1000hz /pid ==26632/ { @[stack()] = count() }'
dtrace: description 'profile-1000hz ' matched 1 probe
^C
#
# dtrace -n 'profile-1000hz /pid ==26632/ { @[ustack()] = count() }'
dtrace: description 'profile-1000hz ' matched 1 probe
^C

Follow some information gathered through mdb in kernel debugging mode:

> ::status
debugging live kernel (64-bit) on socrate
operating system: 5.10 Generic_118833-36 (sun4u)
>
> ::showrev
Hostname: socrate
Release: 5.10
Kernel architecture: sun4u
Application architecture: sparcv9
Kernel version: SunOS 5.10 sun4u Generic_118833-36
Platform: SUNW,Sun-Fire-V490
>
> ::pgrep in.ftpd
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R  26632      1    261    261  30501 0x5a024b00 00000600144b0c28 in.ftpd
>
> 00000600144b0c28::kill
mdb: command is not supported by current target
>
> 00000600144b0c28::thread
            ADDR    STATE    FLG PFLG SFLG   PRI  EPRI PIL             INTR
00000600144b0c28 inval/2000 157e 5778    0     0     0   6                2
>
> 00000600144b0c28::walk thread | ::findstack
stack pointer for thread 30002aba360: 2a1022e7ec1
[ 000002a1022e7ec1 cv_wait+0x38() ]
  000002a1022e7f71 page_lock_es+0x204()
  000002a1022e8021 pvn_vplist_dirty+0x2a4()
  000002a1022e8101 nfs_putpages+0x124()
  000002a1022e81c1 nfs3_putpage+0xcc()
  000002a1022e8271 fop_putpage+0x1c()
  000002a1022e8321 nfs_purge_caches+0xe4()
  000002a1022e83d1 nfs_attr_cache+0x20c()
  000002a1022e8481 nfs3_getattr_otw+0x1b8()
  000002a1022e85f1 nfs3_validate_caches+0x4c()
  000002a1022e8731 nfs3_getpage+0xa4()
  000002a1022e8861 fop_getpage+0x44()
  000002a1022e8931 segmap_getmapflt+0x588()
  000002a1022e8a41 snf_segmap+0x13c()
  000002a1022e8bc1 sosendfile64+0x298()
  000002a1022e8d21 sendvec64+0xf8()
  000002a1022e8f61 sendfilev+0x178()
  000002a1022e92e1 syscall_trap32+0xcc()

We can see the use of the sendfilev() syscall: it appears to be a known bug around this (you must have a registered Sun support customer account to be able to view the second document).

The good news is the root cause is already fix in the development branch of the operating system. The bad news is, after opening a call to the Sun support team, that no patch will be released in a near future to correct this: the incorporation of the fix is currently planned for the next major Solaris 10 Update (Update 5) which is scheduled for summer 2008. However, the fix is already available via the Solaris Express program, if that is applicable to your environment. As a last note, and if the non-killable process owns a resource necessary to another program, there seems no other option than to plan a reboot of the system.

Thursday 27 September 2007

European Type6 USB Keyboard Circumflex Key Issue... Solved

I recently worked with Dermot Malone, the Responsible Engineer for this bug report.

The fact is the described problem persist as explained in the Description field of the bug report, running snv_70. But I must say I found something recently: as of now, I used the default C locale. But after reading this documentation (go to Unicode Locale: en_US.UTF-8 Support), I decided to try the en_US.UTF-8 locale, directly chosen from the dtlogin screen... and found that I can now have a similar English environment as before (using C locale), while supporting accent and circumflex characters from applications which support this locale (for example the GNOME Terminal, or Mozilla Firefox). Hum, I just find a little big curious that most of accent characters (say 'é') works properly using the C locale, but that I need to set an other locale to be able to use the circumflex characters (say 'ô').

As a matter of interest, Dermot ask me why I didn't use the fr_FR.UTF-8 locale. Here is my answer:

Well, I am a French guy, but systematically install English operating systems, and use the default C locale. In IT, English is _the_ standard, and all things (manual pages, messages, format strings, etc.) are all homogeneous this way. As a side note, I am sure not to encounter the problems found on RedHat Linux systems when the default locale is not properly supported by the OS sub-systems themselves (the RC scripts generally sets the LANG=C (or something like that) for 'grep', 'sed' and 'awk' for example), or some third party products (such as the IBM TSM Backup Archive client).

Now, using the en_US.UTF-8 locale on Solaris and Solaris Express, I can have best of both worlds: a fully functional (and supported) English environment, and be able to use extended characters specific to my language.

Sunday 18 February 2007

DHCP Bug Reintroduced in SXCE Build 57

After live upgrade'ing to snv_57, i was unable to get network informations, normally obtained using internal DHCP server. In the system log file, i can see something like:

Sat Feb 17 11:06:18 unic /sbin/dhcpagent[1494]: [ID 91213 daemon.warning]
 configure_v4_lease: incorrect broadcast address 255.255.255.255 specified
 for nge0; ignoring
Sat Feb 17 11:06:18 unic /sbin/dhcpagent[1494]: [ID 94624 daemon.warning]
 checkaddr: expected peer address 192.168.1.255 on nge0, have 255.255.255.255

In fact, it seems an old bug was reintroduced while working on an other (related) CR. See bug report 6516629 for a detail description of the problem. Since it was already fixed in snv_58, and because there are other known issues with this build being unable to boot some systems i will stay with build 56, waiting for build 58.

Monday 22 January 2007

European Type6 USB Keyboard Circumflex Key Issue

Although my first bug report against snv_38 on a issue related to the use of a European Type6 USB keyboard with circumflex key seems to be fixed somewhere between snv_39 and snv_54 (at least the status mentioned "Closed: Not Reproducible"), it is not. Filled a new bug report linking to the first one, against snv_55b this time.

Well, i think i will continue to copy and paste the appropriate word using Google searches for a while...

Sunday 4 June 2006

Precompiled PHP RPMs and IBM IHS Web Server on AIX

Some developers wanted to use the IBM provided PHP packages on our AIX pSeries servers to use with the IHS web server. So, we installed the corresponding RPMs from the AIX Toolbox for Linux Applications CD.

But quickly, the following error messages show up when loading the PHP module:

Cannot load /opt/freeware/apache/libexec/libphp4.so into server:
rtld: 0712-001 Symbol ap_child_terminate was referenced from module
/opt/freeware/apache/libexec/libphp4.so(), but a runtime definition of the symbol
was not found.
rtld: 0712-001 Symbol ap_signal was referenced from module
/opt/freeware/apache/libexec/libphp4.so
(), but a runtime definition of the symbol was not found.
rtld: 0712-001 Symbol ap_hard_timeout was referenced from module
/opt/freeware/apache/libexec/libphp4.so(), but a runtime definition of the symbol
was not found.
rtld: 0712-001 Symbol ap_reset_timeout was referenced from module
/opt/freeware/apache/libexec/libphp4.so(), but a runtime definition of the symbol
was not found.
rtld: 0712-001 Symbol ap_table_get was referenced from module
/opt/freeware/apache/libexec/libphp4.so(), but a runtime definition of the symbol
was not found.
rtld: 0712-001 Symbol ap_pstrdup was referenced from module
/opt/freeware/apache/libexec/libphp4.so
(), but a runtime definition of the symbol was not found.
rtld: 0712-001 Symbol ap_table_add was referenced from module
/opt/freeware/apache/libexec/libphp4.so(), but a runtime definition of the symbol
was not found.
Additional errors occurred but are not reported.

However, we can see that:

  1. The dependencies (dynamic linked libraries) are all resolved:
    # ldd /opt/freeware/apache/libexec/libphp4.so
    /opt/freeware/apache/libexec/libphp4.so needs:
             /usr/lib/libc.a(shr.o)
             /usr/lib/libdl.a(shr.o)
             /usr/lib/libbind.a(shr.o)
             /usr/lib/libcrypt.a(shr.o)
             /usr/lib/libz.a(libz.so.1)
             /unix
       
    
  2. The needed symbols are not found anyway:
    # nm /opt/freeware/apache/libexec/libphp4.so | grep ap_hard_timeout
    .ap_hard_timeout     T      843164
    .ap_hard_timeout     t      843164          40
    ap_hard_timeout      U           -
    ap_hard_timeout      d      247100           4
       
    
    The U means Undefined symbol in this case.

In fact, the system calls which begin with ap_ are generally Apache specific. In fact, here is what we can found on a FreeBSD system with the 2.2 release of the Apache web server installed:

# find /usr/local -print | xargs grep ap_allow_options
/usr/local/include/apache22/http_core.h:AP_DECLARE(int)
ap_allow_options(request_rec *r);
Binary file /usr/local/libexec/apache22/mod_include.so matches
Binary file /usr/local/libexec/apache22/mod_autoindex.so matches
Binary file /usr/local/libexec/apache22/mod_cgi.so matches
Binary file /usr/local/libexec/apache22/mod_negotiation.so matches Binary file
/usr/local/libexec/apache22/mod_rewrite.so matches
/usr/local/libexec/apache22/httpd.exp:ap_allow_options
Binary file /usr/local/sbin/httpd matches
#
# nm /usr/local/sbin/httpd | grep -i ap_allow_options
0000000000429c50 T ap_allow_options

So, it seems that the RPMs provided for the PHP support is compiled for Apache exclusively (since the function prototypes are declared in a header file provided by the core Apache distribution), which seems very reasonable at a second though.

Ended to compile it by hand; which isn't finally too bad for the last PHP5 version.

Sunday 30 April 2006

List of Bugs Fixed per Operating System/Networking (ON) Build

Already want to be able to easily find what is new in each build of the ON consolidation from OpenSolaris; including Solaris Express and Solaris Express: Community Release?

Speaking about specifics putback logs (HTML and text based reports are available):

  1. ON consolidation: change logs and downloads
  2. X consolidation: change logs

For more general overview of what happen at the SX release level, the two best places are certainly:

  1. docs.sun.com: What's New in Solaris Express
  2. Dan Price's Weblog: Solaris entries

Friday 28 October 2005

FreeBSD LORs (Lock Order Reversal)

I encountered two LORs in the past, especially during the testing phase of the just behind the corner FreeBSD 6.0 release.

I already reported about two of them, but wanted to point out that the second seems to be fixed, as shown in this patch. Regrettably, the MFC seem not to have taken place in the RELENG_6 branch... yet. On the other side, this one is not known to generate panic or some other bad behavior, so i think it is ok.

Wednesday 26 October 2005

Chasing FIFO Bug

When testing RELENG_6 (for weeks now), i encountered a strange bug running an UP and a SMP kernel under heavy parallel load, as when building the world using the -j flag. The symptom was simply a panic, no more no less.

After posting about it on current@, i received good help from Robert N M Watson himself asking for crash dump and testing. After some work on his side, he came with a lot of work on the FreeBSD's FIFO implementation, now committed to the source tree.

As a side note, it be mentioned that these modifications seems to solve the problem even on the UP machine, but Robert said that this one may had another origin and may not be totally solved. Here is an excerpt of one of his reply:

This is actually interesting -- Kris Kenneway reported the same thing -- that is, that with the most recent spate of FIFO bug fixes, the problem has gone away. The only problem is that I don't think I fixed a bug with the symptoms that you have experienced, which suggests instead that I've changed the timing of the bug so that it occurs less rarely. I sent Kris a set of assertion patches to run with, and he generates some nice parallel make load, and I have a regression test I've been running. I think we're set for now and I'll continue to work on reproducing it after my recent fifo cleanup. It is possible I fixed a race condition without meaning to, of course... Please let me know if you have any further FIFO panics! Thanks again, Robert N M Watson

I just can say that panic doesn't occur since then. Thanks to him.

- page 1 of 2