blog'o thnet

To content | To menu | To search

Tag - crash

Entries feed - Comments feed

Sunday 22 March 2009

Finding The Process Responsible For Crashing A System

Recently, we encounter a wave of suicide on most of the nodes which formed some Oracle RAC cluster on lots of Sun M5000 domains platforms. Although the logs found on Oracle RAC were interesting, they didn't help us to determine precisely the origin of the crashes. Since the domains panic'ed, we were able to briefly analyze the cores generated at crash time to get the process which initiated the panics. Here is how to do so.

First, be sure to have proper and usable core on persistent storage:

# cd /var/crash/nodename
# file *.0
unix.0:         ELF 64-bit MSB executable SPARCV9 Version 1, UltraSPARC1 Extensions Required, statically linked, not stripped, no debugging information available
vmcore.0:       SunOS 5.10 Generic_127111-11 64-bit SPARC crash dump from 'nodename'

Then, extract useful information using MDB dcmds such as ::status, ::showrev and ::panicinfo which give us the exact panic message and provide us the message and thread responsible for the system crash:

# mdb -k unix.0 vmcore.0
Loading modules: [ unix krtld genunix specfs dtrace ufs sd mpt px ssd fcp fctl md ip qlc hook neti sctp arp usba nca zfs random logindmux ptm cpc sppp crypto wrsmd fcip nfs ipc ]
> ::status
debugging crash dump vmcore.0 (64-bit) from nodename
operating system: 5.10 Generic_127111-11 (sun4u)
panic message: forced crash dump initiated at user request
dump content: kernel pages only
> ::showrev
Hostname: nodename
Release: 5.10
Kernel architecture: sun4u
Application architecture: sparcv9
Kernel version: SunOS 5.10 sun4u Generic_127111-11
Platform: SUNW,SPARC-Enterprise
> ::panicinfo
             cpu                0
          thread      300171c7300
         message forced crash dump initiated at user request
          tstate       4400001606
              g1                b
              g2                0
              g3          11c13e0
              g4              6e0
              g5         88000000
              g6                0
              g7      300171c7300
              o0          1208020
              o1      2a10176b9e8
              o2                1
              o3                0
              o4 fffffffffffffff5
              o5             1000
              o6      2a10176b0b1
              o7          10626a4
              pc          1044d8c
             npc          1044d90
               y                0

Well. Now what we have the exact thread number (thread ID), we can find the corresponding UNIX process helped by the following script:

# cat /var/tmp/findstack.vmcore.sh
#!/usr/bin/env sh

echo "::ps" | mdb -k unix.0 vmcore.0 | \
 nawk '$8 !~ /ADDR/ {print $8" "$NF}' > /tmp/.core.$$

cat /dev/null > /tmp/core.$$

while read ps; do
  echo "process name: `echo ${ps} | nawk '{print $2}'`" >> /tmp/core.$$
  echo ${ps} | nawk '{print $1"::walk thread | ::findstack"}' | \
   mdb unix.0 vmcore.0 >> /tmp/core.$$
  echo >> /tmp/core.$$
done < /tmp/.core.$$

\rm /tmp/.core.$$

exit 0

Now, just find the lines for the guilty process in the output file. In our case, it is the oprocd.bin process:

# vi /tmp/core.*
[...]
process name: oprocd.bin
stack pointer for thread 300171c7300: 2a10176b0b1
  000002a10176b161 kadmin+0x4a4()
  000002a10176b221 uadmin+0x11c()
  000002a10176b2e1 syscall_trap+0xac()
[...]

This process is locked in memory to monitor the cluster and provide I/O fencing. oprocd.bin performs its check, stops running, and if the wake up is beyond the expected time, then it resets the processor and reboots the node. An oprocd.bin failure results in Oracle Clusterware restarting the node. Please read the Oracle Clusterware and Oracle Real Application Clusters documentation for more information.

Although the incident is always under investigation, it seems the nodes were impacted by the additional second that was added at the end of 2008...

Sunday 9 March 2008

Update A Corrupted GRUB Boot Archive, Without SVM

Solaris 10 systems on x86 architecture use the GNU GRand Unified Bootloader (GRUB) which is the boot loader responsible for loading a boot archive into a system's memory. The boot archive is a collection of critical files (kernel modules and configuration files) that are required to boot the Solaris OS. As stated in the Sun documentation:

These files are needed during system startup before the root file system is mounted. Two boot archives are maintained on a system:

  • The boot archive that is used to boot the Solaris OS on a system. This boot archive is sometimes called the primary boot archive.
  • The boot archive that is used for recovery when the primary boot archive is damaged. This boot archive starts the system without mounting the root file system. On the GRUB menu, this boot archive is called failsafe. The archive's essential purpose is to regenerate the primary boot archive, which is usually used to boot the system.

The Solaris OS generally keeps the boot archive properly synchronized on its own. Sometimes, the boot archive gets corrupted--for example when (bad) patches are applied, or the the operating system crashed. In these cases, the boot archive must be regenerated. This is easily accomplished following the Sun documentations x86: How to Boot the Failsafe Archive for Recovery Purposes, and x86: How to Boot the Failsafe Archive to Forcibly Update a Corrupt Boot Archive. The main drawback is when the system is encapsulated under a SVM mirror (RAID-1) since the md driver is not managed under the failsafe mode. Please refer to this blog entry on this subject, if needed.

Wednesday 12 April 2006

Stability Problem and New RAID-1 System

Just after the big update and new infrastructure installation during the past month, the server encountered stability problem on a regular basis (sometimes crashing more than one time a day). At first though, we noted it may be caused by the new processor architecture, a less mature FreeBSD distribution than the well known i386.

But the problem appears to be specifically related to I/O, especially when the input/output are very intensive, i.e. file system dump(8), file system snapshot mksnap_ffs(8) or big tar(1) or cpio(1L) archive transfers.

After days, we didn't succeed to clearly isolate the source of the problem, but decided to quickly put the system under RAID-1 system management (disks mirroring) and configure the excellent gmirror(8) to do the job. So, i switched from an old hardware RAID mechanism, to a pure LVM one. Hope we will be able to find something new without disturbing the ThNET services anymore.

More on the involved manipulations to put the system under the gmirror(8)control on a later post. Stay tuned.

Wednesday 26 October 2005

Chasing FIFO Bug

When testing RELENG_6 (for weeks now), i encountered a strange bug running an UP and a SMP kernel under heavy parallel load, as when building the world using the -j flag. The symptom was simply a panic, no more no less.

After posting about it on current@, i received good help from Robert N M Watson himself asking for crash dump and testing. After some work on his side, he came with a lot of work on the FreeBSD's FIFO implementation, now committed to the source tree.

As a side note, it be mentioned that these modifications seems to solve the problem even on the UP machine, but Robert said that this one may had another origin and may not be totally solved. Here is an excerpt of one of his reply:

This is actually interesting -- Kris Kenneway reported the same thing -- that is, that with the most recent spate of FIFO bug fixes, the problem has gone away. The only problem is that I don't think I fixed a bug with the symptoms that you have experienced, which suggests instead that I've changed the timing of the bug so that it occurs less rarely. I sent Kris a set of assertion patches to run with, and he generates some nice parallel make load, and I have a regression test I've been running. I think we're set for now and I'll continue to work on reproducing it after my recent fifo cleanup. It is possible I fixed a race condition without meaning to, of course... Please let me know if you have any further FIFO panics! Thanks again, Robert N M Watson

I just can say that panic doesn't occur since then. Thanks to him.