blog'o thnet

To content | To menu | To search

Tag - debug

Entries feed - Comments feed

Tuesday 6 April 2010

Performance Problem Using The HP DP Agent

I recently faced an interesting problem when the backup agent for HP DataProtector regularly encounter network performance problem. The problem arise on a global zone which hosts many non-global zones, each running its own instance of the HP DP agent. The network configuration is rather unusual: there are two network ports, one which connects to an administration VLAN, the other which carries the data and external customer network streams (using VLAN-tagging). Since the global zone is only use as an hypervisor of system's resources for the non-global zones, it just plumbs the data network port, but has no IP address assigned to it.

The fact is, the configuration sets the nodename as the hostname. But the usable interface of the global zone which is in the administration network use another name than the hostname: beastie-adm in the following case, assuming the hostname was set to beastie.

So, now that I briefly describe the configuration of the server, lets see what happen at the network level from the global zone point of view when firing a remote telnet on the DataProtector agent TCP port (5555):

# snoop -d aggr2 -t d port 5555
Using device /dev/aggr2 (promiscuous mode)
  0.00000 172.1.2.3 -> beastie-adm.domain.tld TCP D=5555 S=37228 Syn Seq=1117789660 Len=0 Win=65535 Options=
  0.00005 beastie-adm.domain.tld -> 172.1.2.3 TCP D=37228 S=5555 Syn Ack=1117789661 Seq=808681264 Len=0 Win=49480 Options=
  0.00500 172.1.2.3 -> beastie-adm.domain.tld TCP D=5555 S=37228 Ack=808681265 Seq=1117789661 Len=0 Win=65535
 39.36365 beastie-adm.domain.tld ->172.1.2.3 TCP D=37228 S=5555 Push Ack=1117789661 Seq=808681265 Len=87 Win=49480
  0.00014 beastie-adm.domain.tld -> 172.1.2.3 TCP D=37228 S=5555 Fin Ack=1117789661 Seq=808681352 Len=0 Win=49480
  0.00587 172.1.2.3 -> beastie-adm.domain.tld TCP D=5555 S=37228 Ack=808681353 Seq=1117789661 Len=0 Win=65448
  0.00092 172.1.2.3 -> beastie-adm.domain.tld TCP D=5555 S=37228 Fin Ack=808681353 Seq=1117789661 Len=0 Win=65448
  0.00002 beastie-adm.domain.tld -> 172.1.2.3 TCP D=37228 S=5555 Ack=1117789662 Seq=808681353 Len=0 Win=49480

Ouch. Something is clearly wrong here: as can be seen in this trace, 40 seconds have been lost before establishing the connection. Snooping more largely on the interface show these requests:

[...]
beastie-adm.domain.tld -> nameserver.domain.tld DNS C beastie.domain.tld. Internet Addr ?
nameserver.domain.tld -> beastie-adm.domain.tld DNS R  Error: 3(Name Error)
beastie-adm.domain.tld -> nameserver.domain.tld DNS C beastie. Internet Addr ?
nameserver.domain.tld -> beastie-adm.domain.tld DNS R  Error: 2(Server Fail)
beastie-adm.domain.tld -> nameserver.domain.tld DNS C beastie. Internet Addr ?
nameserver.domain.tld -> beastie-adm.domain.tld DNS R  Error: 2(Server Fail)
beastie-adm.domain.tld -> nameserver.domain.tld DNS C beastie.domain.tld. Internet Addr ?
nameserver.domain.tld -> beastie-adm.domain.tld DNS R  Error: 3(Name Error)
beastie-adm.domain.tld -> nameserver.domain.tld DNS C beastie. Internet Addr ?
nameserver.domain.tld -> beastie-adm.domain.tld DNS R  Error: 2(Server Fail)
[...]

Ok. Lets see what is the process corresponding to the HP DP agent doing during the connection tentative:

[...]
9431/1:         sysinfo(SI_HOSTNAME, "beastie", 64)         = 12
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getpid()                                        = 9431 [473]
9431/1:         lstat64("/var/opt/omni/log/debug.log", 0x080476F8) = 0
9431/1:         open64("/var/opt/omni/log/debug.log", O_RDWR|O_APPEND|O_CREAT, 0666) = 4
9431/1:         time()                                          = 1269079436
9431/1:         fstat64(4, 0x08046F30)                          = 0
9431/1:         fstat64(4, 0x08046E70)                          = 0
9431/1:         ioctl(4, TCGETA, 0x08046F04)                    Err#25 ENOTTY
9431/1:         sigaction(SIGSEGV, 0x08047AF0, 0x08047B70)      = 0
9431/1:         sigaction(SIGSEGV, 0x08047AF0, 0x08047B70)      = 0
9431/1:         write(4, 0x08130DC4, 162)                       = 162
9431/1:           \n 0 3 / 2 0 / 1 0   1 1 : 0 3 : 5 6     I N E T . 9 4 3 1 . 0
9431/1:            [ " l i b / c m n / c o m m o n . c   / m a i n / h s l _ d p 6
9431/1:            1 / h s l _ h p i t 2 _ 2 / 9 " : 1 2 7 4 ]   A . 0 6 . 1 1   b
9431/1:            2 4 3\n g e t h o s t b y n a m e ( )   f a i l e d ,   h _ e r
9431/1:            r n o = 2   [ h o s t = h o s t n a m e ,   r e t r y = 5
9431/1:            ]\n
9431/1:         llseek(4, 0, SEEK_CUR)                          = 939002
9431/1:         close(4)
[...]

By now it is clear that most of time spent by the agent was used trying to resolve the hostname to an IP address (as logged to the /var/opt/omni/log/debug.log debug log file). Since there is no IP address declared for beastie as there is no network directly attached to the global zone, there is no mapping defined for the hostname. Bingo.

So, in this case we decided to set the hostname locally (in the /etc/hosts file) as an alias for the loopback entry. This way, the information is resolved, and the agent will not timed-out performing the gethostbyname(3NSL) syscall anyomore:

# getent hosts `beastie`
127.0.0.1       localhost loghost beastie
# snoop -d aggr2 -t d port 5555
Using device /dev/aggr2 (promiscuous mode)
  0.00000 172.1.2.3 -> beastie-adm.occ.lan TCP D=5555 S=37237 Syn Seq=978964828 Len=0 Win=65535 Options=
  0.00004 beastie-adm.occ.lan -> 172.1.2.3 TCP D=37237 S=5555 Syn Ack=978964829 Seq=1184861168 Len=0 Win=49480 Options=
  0.00128 172.1.2.3 -> beastie-adm.occ.lan TCP D=5555 S=37237 Ack=1184861169 Seq=978964829 Len=0 Win=65535
 10.00279 beastie-adm.occ.lan ->172.1.2.3 TCP D=37237 S=5555 Push Ack=978964829 Seq=1184861169 Len=87 Win=49480
  0.00012 beastie-adm.occ.lan -> 172.1.2.3 TCP D=37237 S=5555 Fin Ack=978964829 Seq=1184861256 Len=0 Win=49480
  0.00137 172.1.2.3 -> beastie-adm.occ.lan TCP D=5555 S=37237 Ack=1184861257 Seq=978964829 Len=0 Win=65448
  0.00220 172.1.2.3 -> beastie-adm.occ.lan TCP D=5555 S=37237 Fin Ack=1184861257 Seq=978964829 Len=0 Win=65448
  0.00002 beastie-adm.occ.lan -> 172.1.2.3 TCP D=37237 S=5555 Ack=978964830 Seq=1184861257 Len=0 Win=49480

So, the main question which remain is "Why need the HP DP agent to resolve specifically on the hostname, even if this name is not use apart the backup transaction?" I did not have any answer for this at this time...

Tuesday 16 March 2010

Debugging After A Solaris System Was P2V'ed

I recently faced a problem where an application stop working after transferring a system from an old Sun E450 running Solaris 8 to a more recent Sun Fire V490 running Solaris 10 10/09 with the Solaris 8 Container software stack. Although all went smooth during the P2V, and most of all applications runs pretty well in the non-global zone, we encounter a problem where the Courier SMTP product didn't answer anymore to remote connections, such as:

# telnet localhost 25
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Connection closed by foreign host.

The fact is, the couriertcpd program is running, and listening to the TCP port number 25.

# pfiles `pgrep couriertcpd | head -1`
5691:   /usr/lib/courier/sbin/couriertcpd
  Current rlimit: 1024 file descriptors
[...]
   5: S_IFSOCK mode:0666 dev:366,0 ino:62871 uid:0 gid:0 size:0
      O_RDWR|O_NONBLOCK FD_CLOEXEC
        sockname: AF_INET 0.0.0.0  port: 25
[...]

Interestingly, here is what is logged by the Courier application:

# grep mail.info /path/to/Courier.log
XXX courieresmtpd: [ID 702911 mail.info] gdbm fatal: couldn't init cache

The problem seems located around the gdbm library which is a dependency of Courier. So, what happened when we try to initiate a direct connection using telnet on port 25:

# truss -alef -rall -wall -p 5691
[...]
5691/1:         fork()                                          = 245
245/1:          fork()          (returning as child ...)        = 5691
[...]
245/1:          brk(0x004136D8)                                 = 0
245/1:          brk(0x004136D8)                                 = 0
245/1:          brk(0x004336D8)                                 Err#12 ENOMEM
245/1:          write(2, " g d b m   f a t a l :  ", 12)        = 12
245/1:          write(2, 0xFF2D4D28, 19)                        = 19
245/1:             c o u l d n ' t   i n i t   c a c h e
245/1:          write(2, "\n", 1)                               = 1
245/1:          llseek(0, 0, SEEK_CUR)                          = 0
245/1:          _exit(-1)
5691/1:             Received signal #18, SIGCLD, in poll() [caught]
[...]
# echo "ibase=16; `echo 004136D8 | tr [:lower:] [:upper:]`" | bc
4273880
# echo "ibase=16; `echo 004336D8 | tr [:lower:] [:upper:]`" | bc
4404952

Ok. The brk(2) function is used to change dynamically the amount of space allocated for the calling process's data segment. So, the process which PID is 245, forked from couriertcpd, can not allocate more than 4273880 bytes since a try to allocate 4404952 bytes return an error. When the brk(2) function fails, it is generally due to one of these two major cases:

  1. Insufficient space exists in the swap area to support the expansion.
  2. The data segment size limit as set by setrlimit(2) would be exceeded.

At first, we decided to grow the size of the swap space in the global zone from 4GB to 8GB (since there is no resource control on memory for this non-global zone). As it is a full-ZFS Solaris 10 system, here are the steps to do so:

# swap -d /dev/zvol/dsk/rpool/swap
# zfs set volsize=8G rpool/swap
# /sbin/swapadd
# swap -l
swapfile             dev  swaplo blocks   free
/dev/zvol/dsk/rpool/swap 256,2      16 16777200 16777200

But nothing change from the couriertcpd point of view, as we may have thought since we were able to deallocate the entire swap device from the global zone. So, we shrink it back using the same steps as for growing it. Then, is it possible we have reached an upper resource limitation?

The couriertcpd is executed under the daemon identity: what are the maximum size of data segment (or heap) for the daemon account:

# su - daemon -c "ulimit -Sd; ulimit -Hd"
unlimited
unlimited

This seems more than sufficient, not to say more. But what are the real current limitation for the running process?

# plimit -k 5691
5691:   /usr/lib/courier/sbin/couriertcpd
   resource              current         maximum
  time(seconds)         unlimited       unlimited
  file(kbytes)          unlimited       unlimited
  data(kbytes)          4096            4096
  stack(kbytes)         8192            unlimited
  coredump(kbytes)      unlimited       unlimited
  nofiles(descriptors)  1024            1024
  vmemory(kbytes)       unlimited       unlimited

Ok. One can argue that the 4404952 bytes size is very close to the 4096KB (4194304 bytes) data size limitation... and yes, it is. The two points here are the fact that the limitation is close to memory allocation size for sure, and that there must be some settings somewhere which sets this resource limitation since we verified it is not a the account level (the current hard limit is set to 4096KB, thus not very unlimited). So, we tried to set the resource limit for the running process to much higher value, and voila:

# plimit -d 65536,65536 5691
# telnet localhost 25
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
220 S0003010 ESMTP
quit
221 Bye.
Connection closed by foreign host.

Well, all seems far better this time:

# truss -alef -rall -wall -p 5691
[...]
5691/1:         fork()                                          = 4861
4861/1:          fork()          (returning as child ...)        = 5691
[...]
4861/1:         brk(0x00CF36D8)                                 = 0
4861/1:         lseek(6, 262144, SEEK_SET)                      = 262144
4861/1:         read(6, 0x00071730, 131072)                     = 131072
[...]
# echo "ibase=16; `echo 00CF36D8 | tr [:lower:] [:upper:]`" | bc
13579992

So, now the memory allocation succeed at changing the space allocated up to 13579992 bytes (13262KB). We decided to set the data segment size up to 16MB, which seems to be sufficient in our case.

We now need to find the configuration responsible for setting the size of the data segment size to 4096KB. Looking at the courier documentation, and and after a bit of digging on the system itself we found the responsible configuration file, and adapt it as follow:

# grep ULIMIT= /usr/lib/courier/etc/esmtpd
#ULIMIT=4096
ULIMIT=16384

As a last validation, we made a complete reboot of the non-global zone to verify the overall behavior, which is OK right now. As a last word, it was very interesting problem to fight with, but I didn't understand until now why 4MB was sufficient on an old Sun E450, and why it is not in a branded Solaris 8 non-global zone... comments welcome.

Sunday 22 March 2009

Finding The Process Responsible For Crashing A System

Recently, we encounter a wave of suicide on most of the nodes which formed some Oracle RAC cluster on lots of Sun M5000 domains platforms. Although the logs found on Oracle RAC were interesting, they didn't help us to determine precisely the origin of the crashes. Since the domains panic'ed, we were able to briefly analyze the cores generated at crash time to get the process which initiated the panics. Here is how to do so.

First, be sure to have proper and usable core on persistent storage:

# cd /var/crash/nodename
# file *.0
unix.0:         ELF 64-bit MSB executable SPARCV9 Version 1, UltraSPARC1 Extensions Required, statically linked, not stripped, no debugging information available
vmcore.0:       SunOS 5.10 Generic_127111-11 64-bit SPARC crash dump from 'nodename'

Then, extract useful information using MDB dcmds such as ::status, ::showrev and ::panicinfo which give us the exact panic message and provide us the message and thread responsible for the system crash:

# mdb -k unix.0 vmcore.0
Loading modules: [ unix krtld genunix specfs dtrace ufs sd mpt px ssd fcp fctl md ip qlc hook neti sctp arp usba nca zfs random logindmux ptm cpc sppp crypto wrsmd fcip nfs ipc ]
> ::status
debugging crash dump vmcore.0 (64-bit) from nodename
operating system: 5.10 Generic_127111-11 (sun4u)
panic message: forced crash dump initiated at user request
dump content: kernel pages only
> ::showrev
Hostname: nodename
Release: 5.10
Kernel architecture: sun4u
Application architecture: sparcv9
Kernel version: SunOS 5.10 sun4u Generic_127111-11
Platform: SUNW,SPARC-Enterprise
> ::panicinfo
             cpu                0
          thread      300171c7300
         message forced crash dump initiated at user request
          tstate       4400001606
              g1                b
              g2                0
              g3          11c13e0
              g4              6e0
              g5         88000000
              g6                0
              g7      300171c7300
              o0          1208020
              o1      2a10176b9e8
              o2                1
              o3                0
              o4 fffffffffffffff5
              o5             1000
              o6      2a10176b0b1
              o7          10626a4
              pc          1044d8c
             npc          1044d90
               y                0

Well. Now what we have the exact thread number (thread ID), we can find the corresponding UNIX process helped by the following script:

# cat /var/tmp/findstack.vmcore.sh
#!/usr/bin/env sh

echo "::ps" | mdb -k unix.0 vmcore.0 | \
 nawk '$8 !~ /ADDR/ {print $8" "$NF}' > /tmp/.core.$$

cat /dev/null > /tmp/core.$$

while read ps; do
  echo "process name: `echo ${ps} | nawk '{print $2}'`" >> /tmp/core.$$
  echo ${ps} | nawk '{print $1"::walk thread | ::findstack"}' | \
   mdb unix.0 vmcore.0 >> /tmp/core.$$
  echo >> /tmp/core.$$
done < /tmp/.core.$$

\rm /tmp/.core.$$

exit 0

Now, just find the lines for the guilty process in the output file. In our case, it is the oprocd.bin process:

# vi /tmp/core.*
[...]
process name: oprocd.bin
stack pointer for thread 300171c7300: 2a10176b0b1
  000002a10176b161 kadmin+0x4a4()
  000002a10176b221 uadmin+0x11c()
  000002a10176b2e1 syscall_trap+0xac()
[...]

This process is locked in memory to monitor the cluster and provide I/O fencing. oprocd.bin performs its check, stops running, and if the wake up is beyond the expected time, then it resets the processor and reboots the node. An oprocd.bin failure results in Oracle Clusterware restarting the node. Please read the Oracle Clusterware and Oracle Real Application Clusters documentation for more information.

Although the incident is always under investigation, it seems the nodes were impacted by the additional second that was added at the end of 2008...

Update #1 (2010-10-09): Here is a very interesting french blog entry about the same problem.

Sunday 28 January 2007

About Kernel Debugging on FreeBSD

After posting about some very interesting entries on how to debug kernel problems in Solaris 10, here are some keywords which may be needed while looking for help on FreeBSD mailing lists. This is just a quick memo, so please read the excellent Kernel Debugging chapter of the FreeBSD Developers's Handbook.

At the debugger prompt, using DDB at the system console:

db> where
db> show pcpu
db> show alllocks
db> ps
db> show lockedvnods
db> show lockedbufs

Good luck!

Saturday 6 January 2007

Kernel Debugger in Solaris 10

Here is a little note to point out those two blog entries on how to debug kernel problem at boot time, from Dan Mick and Eric Lowe, respectively.

Too bad i didn't have a serial console on my Sun Ultra 20 workstation. I will try these steps when U20 will be near my hands, though.