blog'o thnet

To content | To menu | To search

Tag - processus

Entries feed - Comments feed

Tuesday 16 March 2010

Debugging After A Solaris System Was P2V'ed

I recently faced a problem where an application stop working after transferring a system from an old Sun E450 running Solaris 8 to a more recent Sun Fire V490 running Solaris 10 10/09 with the Solaris 8 Container software stack. Although all went smooth during the P2V, and most of all applications runs pretty well in the non-global zone, we encounter a problem where the Courier SMTP product didn't answer anymore to remote connections, such as:

# telnet localhost 25
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Connection closed by foreign host.

The fact is, the couriertcpd program is running, and listening to the TCP port number 25.

# pfiles `pgrep couriertcpd | head -1`
5691:   /usr/lib/courier/sbin/couriertcpd
  Current rlimit: 1024 file descriptors
[...]
   5: S_IFSOCK mode:0666 dev:366,0 ino:62871 uid:0 gid:0 size:0
      O_RDWR|O_NONBLOCK FD_CLOEXEC
        sockname: AF_INET 0.0.0.0  port: 25
[...]

Interestingly, here is what is logged by the Courier application:

# grep mail.info /path/to/Courier.log
XXX courieresmtpd: [ID 702911 mail.info] gdbm fatal: couldn't init cache

The problem seems located around the gdbm library which is a dependency of Courier. So, what happened when we try to initiate a direct connection using telnet on port 25:

# truss -alef -rall -wall -p 5691
[...]
5691/1:         fork()                                          = 245
245/1:          fork()          (returning as child ...)        = 5691
[...]
245/1:          brk(0x004136D8)                                 = 0
245/1:          brk(0x004136D8)                                 = 0
245/1:          brk(0x004336D8)                                 Err#12 ENOMEM
245/1:          write(2, " g d b m   f a t a l :  ", 12)        = 12
245/1:          write(2, 0xFF2D4D28, 19)                        = 19
245/1:             c o u l d n ' t   i n i t   c a c h e
245/1:          write(2, "\n", 1)                               = 1
245/1:          llseek(0, 0, SEEK_CUR)                          = 0
245/1:          _exit(-1)
5691/1:             Received signal #18, SIGCLD, in poll() [caught]
[...]
# echo "ibase=16; `echo 004136D8 | tr [:lower:] [:upper:]`" | bc
4273880
# echo "ibase=16; `echo 004336D8 | tr [:lower:] [:upper:]`" | bc
4404952

Ok. The brk(2) function is used to change dynamically the amount of space allocated for the calling process's data segment. So, the process which PID is 245, forked from couriertcpd, can not allocate more than 4273880 bytes since a try to allocate 4404952 bytes return an error. When the brk(2) function fails, it is generally due to one of these two major cases:

  1. Insufficient space exists in the swap area to support the expansion.
  2. The data segment size limit as set by setrlimit(2) would be exceeded.

At first, we decided to grow the size of the swap space in the global zone from 4GB to 8GB (since there is no resource control on memory for this non-global zone). As it is a full-ZFS Solaris 10 system, here are the steps to do so:

# swap -d /dev/zvol/dsk/rpool/swap
# zfs set volsize=8G rpool/swap
# /sbin/swapadd
# swap -l
swapfile             dev  swaplo blocks   free
/dev/zvol/dsk/rpool/swap 256,2      16 16777200 16777200

But nothing change from the couriertcpd point of view, as we may have thought since we were able to deallocate the entire swap device from the global zone. So, we shrink it back using the same steps as for growing it. Then, is it possible we have reached an upper resource limitation?

The couriertcpd is executed under the daemon identity: what are the maximum size of data segment (or heap) for the daemon account:

# su - daemon -c "ulimit -Sd; ulimit -Hd"
unlimited
unlimited

This seems more than sufficient, not to say more. But what are the real current limitation for the running process?

# plimit -k 5691
5691:   /usr/lib/courier/sbin/couriertcpd
   resource              current         maximum
  time(seconds)         unlimited       unlimited
  file(kbytes)          unlimited       unlimited
  data(kbytes)          4096            4096
  stack(kbytes)         8192            unlimited
  coredump(kbytes)      unlimited       unlimited
  nofiles(descriptors)  1024            1024
  vmemory(kbytes)       unlimited       unlimited

Ok. One can argue that the 4404952 bytes size is very close to the 4096KB (4194304 bytes) data size limitation... and yes, it is. The two points here are the fact that the limitation is close to memory allocation size for sure, and that there must be some settings somewhere which sets this resource limitation since we verified it is not a the account level (the current hard limit is set to 4096KB, thus not very unlimited). So, we tried to set the resource limit for the running process to much higher value, and voila:

# plimit -d 65536,65536 5691
# telnet localhost 25
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
220 S0003010 ESMTP
quit
221 Bye.
Connection closed by foreign host.

Well, all seems far better this time:

# truss -alef -rall -wall -p 5691
[...]
5691/1:         fork()                                          = 4861
4861/1:          fork()          (returning as child ...)        = 5691
[...]
4861/1:         brk(0x00CF36D8)                                 = 0
4861/1:         lseek(6, 262144, SEEK_SET)                      = 262144
4861/1:         read(6, 0x00071730, 131072)                     = 131072
[...]
# echo "ibase=16; `echo 00CF36D8 | tr [:lower:] [:upper:]`" | bc
13579992

So, now the memory allocation succeed at changing the space allocated up to 13579992 bytes (13262KB). We decided to set the data segment size up to 16MB, which seems to be sufficient in our case.

We now need to find the configuration responsible for setting the size of the data segment size to 4096KB. Looking at the courier documentation, and and after a bit of digging on the system itself we found the responsible configuration file, and adapt it as follow:

# grep ULIMIT= /usr/lib/courier/etc/esmtpd
#ULIMIT=4096
ULIMIT=16384

As a last validation, we made a complete reboot of the non-global zone to verify the overall behavior, which is OK right now. As a last word, it was very interesting problem to fight with, but I didn't understand until now why 4MB was sufficient on an old Sun E450, and why it is not in a branded Solaris 8 non-global zone... comments welcome.

Sunday 22 March 2009

Finding The Process Responsible For Crashing A System

Recently, we encounter a wave of suicide on most of the nodes which formed some Oracle RAC cluster on lots of Sun M5000 domains platforms. Although the logs found on Oracle RAC were interesting, they didn't help us to determine precisely the origin of the crashes. Since the domains panic'ed, we were able to briefly analyze the cores generated at crash time to get the process which initiated the panics. Here is how to do so.

First, be sure to have proper and usable core on persistent storage:

# cd /var/crash/nodename
# file *.0
unix.0:         ELF 64-bit MSB executable SPARCV9 Version 1, UltraSPARC1 Extensions Required, statically linked, not stripped, no debugging information available
vmcore.0:       SunOS 5.10 Generic_127111-11 64-bit SPARC crash dump from 'nodename'

Then, extract useful information using MDB dcmds such as ::status, ::showrev and ::panicinfo which give us the exact panic message and provide us the message and thread responsible for the system crash:

# mdb -k unix.0 vmcore.0
Loading modules: [ unix krtld genunix specfs dtrace ufs sd mpt px ssd fcp fctl md ip qlc hook neti sctp arp usba nca zfs random logindmux ptm cpc sppp crypto wrsmd fcip nfs ipc ]
> ::status
debugging crash dump vmcore.0 (64-bit) from nodename
operating system: 5.10 Generic_127111-11 (sun4u)
panic message: forced crash dump initiated at user request
dump content: kernel pages only
> ::showrev
Hostname: nodename
Release: 5.10
Kernel architecture: sun4u
Application architecture: sparcv9
Kernel version: SunOS 5.10 sun4u Generic_127111-11
Platform: SUNW,SPARC-Enterprise
> ::panicinfo
             cpu                0
          thread      300171c7300
         message forced crash dump initiated at user request
          tstate       4400001606
              g1                b
              g2                0
              g3          11c13e0
              g4              6e0
              g5         88000000
              g6                0
              g7      300171c7300
              o0          1208020
              o1      2a10176b9e8
              o2                1
              o3                0
              o4 fffffffffffffff5
              o5             1000
              o6      2a10176b0b1
              o7          10626a4
              pc          1044d8c
             npc          1044d90
               y                0

Well. Now what we have the exact thread number (thread ID), we can find the corresponding UNIX process helped by the following script:

# cat /var/tmp/findstack.vmcore.sh
#!/usr/bin/env sh

echo "::ps" | mdb -k unix.0 vmcore.0 | \
 nawk '$8 !~ /ADDR/ {print $8" "$NF}' > /tmp/.core.$$

cat /dev/null > /tmp/core.$$

while read ps; do
  echo "process name: `echo ${ps} | nawk '{print $2}'`" >> /tmp/core.$$
  echo ${ps} | nawk '{print $1"::walk thread | ::findstack"}' | \
   mdb unix.0 vmcore.0 >> /tmp/core.$$
  echo >> /tmp/core.$$
done < /tmp/.core.$$

\rm /tmp/.core.$$

exit 0

Now, just find the lines for the guilty process in the output file. In our case, it is the oprocd.bin process:

# vi /tmp/core.*
[...]
process name: oprocd.bin
stack pointer for thread 300171c7300: 2a10176b0b1
  000002a10176b161 kadmin+0x4a4()
  000002a10176b221 uadmin+0x11c()
  000002a10176b2e1 syscall_trap+0xac()
[...]

This process is locked in memory to monitor the cluster and provide I/O fencing. oprocd.bin performs its check, stops running, and if the wake up is beyond the expected time, then it resets the processor and reboots the node. An oprocd.bin failure results in Oracle Clusterware restarting the node. Please read the Oracle Clusterware and Oracle Real Application Clusters documentation for more information.

Although the incident is always under investigation, it seems the nodes were impacted by the additional second that was added at the end of 2008...

Update #1 (2010-10-09): Here is a very interesting french blog entry about the same problem.

Monday 17 December 2007

A Process Can Not Be Killed Upon Hanging In sendfilev()

Here is a little but annoying bug I faced recently on a production system running Solaris 10 11/06: a process can not be killed.

The KILL signal can't be ignored by a process. The associated handler must be the default--taking the signal into account, and honor it--and it is the case (the incriminated process id is 26632):

# psig 26632 | grep KILL
KILL    default

Nevertheless, the process seems not to stop as expected:

# echo $$
2913
#
# kill -s KILL 26632
#
# dtrace -n 'fbt::sigtoproc:entry /pid == 2913/ {trace(arg2); trace(execname);}'
dtrace: description 'fbt::sigtoproc:entry ' matched 1 probe
CPU     ID                    FUNCTION:NAME
 18  16387                  sigtoproc:entry                 9  tcsh
#
# ps -ef | awk '$2 ~ /26632/ {print $0}'
nonpriv 26632     1   0   Nov 13 ?         0:01 /usr/sbin/in.ftpd -a

This not the correct behavior since a pending signal is not honored:

# pflags 26632
26632:  /usr/sbin/in.ftpd -a
       data model = _ILP32  flags = ORPHAN|MSACCT|MSFORK
       sigpend = 0x00006100,0x00000000
/1:    flags = 0

More, the process seems in a non-coherent system state: it is now nearly impossible to trace it using, truss, the proctools,...

# truss -alef -p 26632
truss: unanticipated system error: 26632
#
# pstack 26632
pstack: cannot examine 26632: unanticipated system error
#
# pfiles 26632
pfiles: unanticipated system error: 26632
#
# pldd 26632
pldd: cannot examine 26632: unanticipated system error

... or helped by DTace:

# dtrace -n 'profile-1000hz /pid ==26632/ { @[stack()] = count() }'
dtrace: description 'profile-1000hz ' matched 1 probe
^C
#
# dtrace -n 'profile-1000hz /pid ==26632/ { @[ustack()] = count() }'
dtrace: description 'profile-1000hz ' matched 1 probe
^C

Follow some information gathered through mdb in kernel debugging mode:

> ::status
debugging live kernel (64-bit) on socrate
operating system: 5.10 Generic_118833-36 (sun4u)
>
> ::showrev
Hostname: socrate
Release: 5.10
Kernel architecture: sun4u
Application architecture: sparcv9
Kernel version: SunOS 5.10 sun4u Generic_118833-36
Platform: SUNW,Sun-Fire-V490
>
> ::pgrep in.ftpd
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R  26632      1    261    261  30501 0x5a024b00 00000600144b0c28 in.ftpd
>
> 00000600144b0c28::kill
mdb: command is not supported by current target
>
> 00000600144b0c28::thread
            ADDR    STATE    FLG PFLG SFLG   PRI  EPRI PIL             INTR
00000600144b0c28 inval/2000 157e 5778    0     0     0   6                2
>
> 00000600144b0c28::walk thread | ::findstack
stack pointer for thread 30002aba360: 2a1022e7ec1
[ 000002a1022e7ec1 cv_wait+0x38() ]
  000002a1022e7f71 page_lock_es+0x204()
  000002a1022e8021 pvn_vplist_dirty+0x2a4()
  000002a1022e8101 nfs_putpages+0x124()
  000002a1022e81c1 nfs3_putpage+0xcc()
  000002a1022e8271 fop_putpage+0x1c()
  000002a1022e8321 nfs_purge_caches+0xe4()
  000002a1022e83d1 nfs_attr_cache+0x20c()
  000002a1022e8481 nfs3_getattr_otw+0x1b8()
  000002a1022e85f1 nfs3_validate_caches+0x4c()
  000002a1022e8731 nfs3_getpage+0xa4()
  000002a1022e8861 fop_getpage+0x44()
  000002a1022e8931 segmap_getmapflt+0x588()
  000002a1022e8a41 snf_segmap+0x13c()
  000002a1022e8bc1 sosendfile64+0x298()
  000002a1022e8d21 sendvec64+0xf8()
  000002a1022e8f61 sendfilev+0x178()
  000002a1022e92e1 syscall_trap32+0xcc()

We can see the use of the sendfilev() syscall: it appears to be a known bug around this (you must have a registered Sun support customer account to be able to view the second document).

The good news is the root cause is already fix in the development branch of the operating system. The bad news is, after opening a call to the Sun support team, that no patch will be released in a near future to correct this: the incorporation of the fix is currently planned for the next major Solaris 10 Update (Update 5) which is scheduled for summer 2008. However, the fix is already available via the Solaris Express program, if that is applicable to your environment. As a last note, and if the non-killable process owns a resource necessary to another program, there seems no other option than to plan a reboot of the system.

Thursday 10 May 2007

Managing System and Process Core Dumps Generation

Core Dump Management on the Solaris OS and Using the Solaris coreadm utility to control core file generation are clearly two useful recent writings about knowing how to enable and configure process and system core dumps on a Solaris system.

You will learn what are SIGSEGV and SIGBUS signals, and the role they are playing in core generation. You will know how to easily alter the current process and system configuration files (respectively coreadm.conf, and dumpadm.conf) using appropriate system commands (respectively coreadm(1M), and dumpadm(1M)). In the same time, you will learn some basics about how to extract and interpret core files content. Then, you will find some tips on how a system dump can be voluntarily generated on both UltraSPARC and x86 platforms. Last, Matty will show us that you can even set process core dump configuration to log to the syslog facility. Very nice!

Tuesday 3 April 2007

Notes About Interesting ps(1b) Behaviour

I encounter a strange behavior using /usr/ucb/ps command recently. Normally, the COMMAND header of the output of the ps(1b) is truncated to the size of a default terminal. Adding one w flag can grow this output to 132 characters. Adding a second w flag doesn't truncate the output at all, reading the complete arguments list in the /proc/${PID}/as special file; in fact, a representation of the address space for the corresponding process. The restriction that apply here is the ability to read this file, generally the owner of the running process only.

Interestingly, at least on Solaris 8 and 9, you can print untruncated arguments list--even if you didn't have the the right to read the content of the address space special file--if the SUNWscpux package is installed. This package seems not delivered with Solaris 10.

As I remember, this package is a prerequisite for 64-bit architecture platform or you will see error messages such as Data Type Too Large if executed without it installed, but it seems to be able to behave in a very different manner as I expected.

Don't known what to think about this right now. Any though?