Recently, we encounter a wave of suicide on most of the nodes which formed some Oracle RAC cluster on lots of Sun M5000 domains platforms. Although the logs found on Oracle RAC were interesting, they didn't help us to determine precisely the origin of the crashes. Since the domains panic'ed, we were able to briefly analyze the cores generated at crash time to get the process which initiated the panics. Here is how to do so.
First, be sure to have proper and usable core on persistent storage:
# cd /var/crash/nodename
# file *.0
unix.0: ELF 64-bit MSB executable SPARCV9 Version 1, UltraSPARC1 Extensions Required, statically linked, not stripped, no debugging information available
vmcore.0: SunOS 5.10 Generic_127111-11 64-bit SPARC crash dump from 'nodename'
Then, extract useful information using MDB dcmds such as ::status
,
::showrev
and ::panicinfo
which give us the exact panic message and
provide us the message and thread responsible for the system crash:
# mdb -k unix.0 vmcore.0
Loading modules: [ unix krtld genunix specfs dtrace ufs sd mpt px ssd fcp fctl md ip qlc hook neti sctp arp usba nca zfs random logindmux ptm cpc sppp crypto wrsmd fcip nfs ipc ]
> ::status
debugging crash dump vmcore.0 (64-bit) from nodename
operating system: 5.10 Generic_127111-11 (sun4u)
panic message: forced crash dump initiated at user request
dump content: kernel pages only
> ::showrev
Hostname: nodename
Release: 5.10
Kernel architecture: sun4u
Application architecture: sparcv9
Kernel version: SunOS 5.10 sun4u Generic_127111-11
Platform: SUNW,SPARC-Enterprise
> ::panicinfo
cpu 0
thread 300171c7300
message forced crash dump initiated at user request
tstate 4400001606
g1 b
g2 0
g3 11c13e0
g4 6e0
g5 88000000
g6 0
g7 300171c7300
o0 1208020
o1 2a10176b9e8
o2 1
o3 0
o4 fffffffffffffff5
o5 1000
o6 2a10176b0b1
o7 10626a4
pc 1044d8c
npc 1044d90
y 0
Well. Now what we have the exact thread number (thread ID), we can find the corresponding UNIX process helped by the following script:
# cat /var/tmp/findstack.vmcore.sh
#!/usr/bin/env sh
echo "::ps" | mdb -k unix.0 vmcore.0 | \
nawk '$8 !~ /ADDR/ {print $8" "$NF}' > /tmp/.core.$$
cat /dev/null > /tmp/core.$$
while read ps; do
echo "process name: `echo ${ps} | nawk '{print $2}'`" >> /tmp/core.$$
echo ${ps} | nawk '{print $1"::walk thread | ::findstack"}' | \
mdb unix.0 vmcore.0 >> /tmp/core.$$
echo >> /tmp/core.$$
done < /tmp/.core.$$
\rm /tmp/.core.$$
exit 0
Now, just find the lines for the guilty process in the output file. In
our case, it is the oprocd.bin
process:
# vi /tmp/core.*
[...]
process name: oprocd.bin
stack pointer for thread 300171c7300: 2a10176b0b1
000002a10176b161 kadmin+0x4a4()
000002a10176b221 uadmin+0x11c()
000002a10176b2e1 syscall_trap+0xac()
[...]
This process is locked in memory to monitor the cluster and provide I/O
fencing. oprocd.bin
performs its check, stops running, and if the wake
up is beyond the expected time, then it resets the processor and reboots
the node. An oprocd.bin
failure results in Oracle Clusterware
restarting the node. Please read the Oracle Clusterware and Oracle Real
Application
Clusters
documentation for more information.
Although the incident is always under investigation, it seems the nodes were impacted by the additional second that was added at the end of 2008...
Update #1 (2010-10-09): Here is a very interesting french blog entry about the same problem.