blog'o thnet

To content | To menu | To search

Tuesday 16 March 2010

Debugging After A Solaris System Was P2V'ed

I recently faced a problem where an application stop working after transferring a system from an old Sun E450 running Solaris 8 to a more recent Sun Fire V490 running Solaris 10 10/09 with the Solaris 8 Container software stack. Although all went smooth during the P2V, and most of all applications runs pretty well in the non-global zone, we encounter a problem where the Courier SMTP product didn't answer anymore to remote connections, such as:

# telnet localhost 25
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Connection closed by foreign host.

The fact is, the couriertcpd program is running, and listening to the TCP port number 25.

# pfiles `pgrep couriertcpd | head -1`
5691:   /usr/lib/courier/sbin/couriertcpd
  Current rlimit: 1024 file descriptors
[...]
   5: S_IFSOCK mode:0666 dev:366,0 ino:62871 uid:0 gid:0 size:0
      O_RDWR|O_NONBLOCK FD_CLOEXEC
        sockname: AF_INET 0.0.0.0  port: 25
[...]

Interestingly, here is what is logged by the Courier application:

# grep mail.info /path/to/Courier.log
XXX courieresmtpd: [ID 702911 mail.info] gdbm fatal: couldn't init cache

The problem seems located around the gdbm library which is a dependency of Courier. So, what happened when we try to initiate a direct connection using telnet on port 25:

# truss -alef -rall -wall -p 5691
[...]
5691/1:         fork()                                          = 245
245/1:          fork()          (returning as child ...)        = 5691
[...]
245/1:          brk(0x004136D8)                                 = 0
245/1:          brk(0x004136D8)                                 = 0
245/1:          brk(0x004336D8)                                 Err#12 ENOMEM
245/1:          write(2, " g d b m   f a t a l :  ", 12)        = 12
245/1:          write(2, 0xFF2D4D28, 19)                        = 19
245/1:             c o u l d n ' t   i n i t   c a c h e
245/1:          write(2, "\n", 1)                               = 1
245/1:          llseek(0, 0, SEEK_CUR)                          = 0
245/1:          _exit(-1)
5691/1:             Received signal #18, SIGCLD, in poll() [caught]
[...]
# echo "ibase=16; `echo 004136D8 | tr [:lower:] [:upper:]`" | bc
4273880
# echo "ibase=16; `echo 004336D8 | tr [:lower:] [:upper:]`" | bc
4404952

Ok. The brk(2) function is used to change dynamically the amount of space allocated for the calling process's data segment. So, the process which PID is 245, forked from couriertcpd, can not allocate more than 4273880 bytes since a try to allocate 4404952 bytes return an error. When the brk(2) function fails, it is generally due to one of these two major cases:

  1. Insufficient space exists in the swap area to support the expansion.
  2. The data segment size limit as set by setrlimit(2) would be exceeded.

At first, we decided to grow the size of the swap space in the global zone from 4GB to 8GB (since there is no resource control on memory for this non-global zone). As it is a full-ZFS Solaris 10 system, here are the steps to do so:

# swap -d /dev/zvol/dsk/rpool/swap
# zfs set volsize=8G rpool/swap
# /sbin/swapadd
# swap -l
swapfile             dev  swaplo blocks   free
/dev/zvol/dsk/rpool/swap 256,2      16 16777200 16777200

But nothing change from the couriertcpd point of view, as we may have thought since we were able to deallocate the entire swap device from the global zone. So, we shrink it back using the same steps as for growing it. Then, is it possible we have reached an upper resource limitation?

The couriertcpd is executed under the daemon identity: what are the maximum size of data segment (or heap) for the daemon account:

# su - daemon -c "ulimit -Sd; ulimit -Hd"
unlimited
unlimited

This seems more than sufficient, not to say more. But what are the real current limitation for the running process?

# plimit -k 5691
5691:   /usr/lib/courier/sbin/couriertcpd
   resource              current         maximum
  time(seconds)         unlimited       unlimited
  file(kbytes)          unlimited       unlimited
  data(kbytes)          4096            4096
  stack(kbytes)         8192            unlimited
  coredump(kbytes)      unlimited       unlimited
  nofiles(descriptors)  1024            1024
  vmemory(kbytes)       unlimited       unlimited

Ok. One can argue that the 4404952 bytes size is very close to the 4096KB (4194304 bytes) data size limitation... and yes, it is. The two points here are the fact that the limitation is close to memory allocation size for sure, and that there must be some settings somewhere which sets this resource limitation since we verified it is not a the account level (the current hard limit is set to 4096KB, thus not very unlimited). So, we tried to set the resource limit for the running process to much higher value, and voila:

# plimit -d 65536,65536 5691
# telnet localhost 25
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
220 S0003010 ESMTP
quit
221 Bye.
Connection closed by foreign host.

Well, all seems far better this time:

# truss -alef -rall -wall -p 5691
[...]
5691/1:         fork()                                          = 4861
4861/1:          fork()          (returning as child ...)        = 5691
[...]
4861/1:         brk(0x00CF36D8)                                 = 0
4861/1:         lseek(6, 262144, SEEK_SET)                      = 262144
4861/1:         read(6, 0x00071730, 131072)                     = 131072
[...]
# echo "ibase=16; `echo 00CF36D8 | tr [:lower:] [:upper:]`" | bc
13579992

So, now the memory allocation succeed at changing the space allocated up to 13579992 bytes (13262KB). We decided to set the data segment size up to 16MB, which seems to be sufficient in our case.

We now need to find the configuration responsible for setting the size of the data segment size to 4096KB. Looking at the courier documentation, and and after a bit of digging on the system itself we found the responsible configuration file, and adapt it as follow:

# grep ULIMIT= /usr/lib/courier/etc/esmtpd
#ULIMIT=4096
ULIMIT=16384

As a last validation, we made a complete reboot of the non-global zone to verify the overall behavior, which is OK right now. As a last word, it was very interesting problem to fight with, but I didn't understand until now why 4MB was sufficient on an old Sun E450, and why it is not in a branded Solaris 8 non-global zone... comments welcome.

Sunday 28 February 2010

Oracle On The Future Of OpenSolaris, Finally

After the official and long-awaited public information from Oracle on the merge with Sun Microsystems, some great news came on both hardware and software portfolios, in particular the x86 ans SPARC ecosystems, and around the Solaris operating system.

The main unknown was about the OpenSolaris community, and distribution model. And until very recently, some important voices around the community stayed without answer, in particular from Ben Rockwood, or Peter Tribble.

Well, until recently. In fact, the OpenSolaris Annual Meeting (held on IRC through the #opensolaris-meeting canal last 26 February) brought some answers very shortly, which currently begin to spread through the community. I hope this will quiet some recent misunderstanding on the support model of the OpenSolaris distribution.

Tuesday 1 December 2009

Oracle Commitment To Sun Technologies

Here are more information about Oracle commitment to Sun business, and the Solaris operating system in particular:

Lastly, Oracle and Sun Overview and FAQ for customers and partners is a must read for all persons interested in the future and Oracle's investment in all the current technologies from Sun Microsystems.

Thursday 26 November 2009

Oracle Database 11g Release 2 For Solaris

As a follow-up to the preceding entry on the upcoming availability on the Oracle Database 11g Release 2 on both the Solaris SPARC and x86 releases, we can see that this is a reality as of today. Both architectures are readily available for download.

As a system administrator I think this very interesting and encouraging, not only because of the availability of one of the more robust RDBMS system on Solaris platforms, but because this is some actions taken after words from Oracle which seems to fit together. And so, the interest in Solaris as an OS of choice is more reinforced now.

Wednesday 7 October 2009

Upcoming Oracle RDBMS And Solaris News

Following the recent news about the future of Sun from Larry Ellison itself, we now can hope about more things to come on Solaris, both on SPARC and x86 platforms. In particular, this quote is particularly encouraging:

We're a big supporter of Linux, but the fact is that Solaris just a much more mature OS, its just a fact. We became a big supporter of Linux years ago because it ran on smaller and cheaper X86 processors and Solaris did not, we had no choice. [...] So we are a supporter of Linux, but Solaris is a more mature operating system designed for bigger systems. We support both.

In the very same vein, I just heard from two different sources these upcoming changes from Oracle:

  1. The just release Oracle Database 11g Release 2, currently available only for Linux (and released on 1 September 2009), will be available soon--one to two months--for both Solaris SPARC and x86, at the same time.
  2. Secondly, Solaris x86 will be raised to Tier 2 platform from Tier 3 currently.

Well, pretty good news in fact! Seems that Solaris will be a serious and growing competitor in the (near) future!

Sunday 22 March 2009

Finding The Process Responsible For Crashing A System

Recently, we encounter a wave of suicide on most of the nodes which formed some Oracle RAC cluster on lots of Sun M5000 domains platforms. Although the logs found on Oracle RAC were interesting, they didn't help us to determine precisely the origin of the crashes. Since the domains panic'ed, we were able to briefly analyze the cores generated at crash time to get the process which initiated the panics. Here is how to do so.

First, be sure to have proper and usable core on persistent storage:

# cd /var/crash/nodename
# file *.0
unix.0:         ELF 64-bit MSB executable SPARCV9 Version 1, UltraSPARC1 Extensions Required, statically linked, not stripped, no debugging information available
vmcore.0:       SunOS 5.10 Generic_127111-11 64-bit SPARC crash dump from 'nodename'

Then, extract useful information using MDB dcmds such as ::status, ::showrev and ::panicinfo which give us the exact panic message and provide us the message and thread responsible for the system crash:

# mdb -k unix.0 vmcore.0
Loading modules: [ unix krtld genunix specfs dtrace ufs sd mpt px ssd fcp fctl md ip qlc hook neti sctp arp usba nca zfs random logindmux ptm cpc sppp crypto wrsmd fcip nfs ipc ]
> ::status
debugging crash dump vmcore.0 (64-bit) from nodename
operating system: 5.10 Generic_127111-11 (sun4u)
panic message: forced crash dump initiated at user request
dump content: kernel pages only
> ::showrev
Hostname: nodename
Release: 5.10
Kernel architecture: sun4u
Application architecture: sparcv9
Kernel version: SunOS 5.10 sun4u Generic_127111-11
Platform: SUNW,SPARC-Enterprise
> ::panicinfo
             cpu                0
          thread      300171c7300
         message forced crash dump initiated at user request
          tstate       4400001606
              g1                b
              g2                0
              g3          11c13e0
              g4              6e0
              g5         88000000
              g6                0
              g7      300171c7300
              o0          1208020
              o1      2a10176b9e8
              o2                1
              o3                0
              o4 fffffffffffffff5
              o5             1000
              o6      2a10176b0b1
              o7          10626a4
              pc          1044d8c
             npc          1044d90
               y                0

Well. Now what we have the exact thread number (thread ID), we can find the corresponding UNIX process helped by the following script:

# cat /var/tmp/findstack.vmcore.sh
#!/usr/bin/env sh

echo "::ps" | mdb -k unix.0 vmcore.0 | \
 nawk '$8 !~ /ADDR/ {print $8" "$NF}' > /tmp/.core.$$

cat /dev/null > /tmp/core.$$

while read ps; do
  echo "process name: `echo ${ps} | nawk '{print $2}'`" >> /tmp/core.$$
  echo ${ps} | nawk '{print $1"::walk thread | ::findstack"}' | \
   mdb unix.0 vmcore.0 >> /tmp/core.$$
  echo >> /tmp/core.$$
done < /tmp/.core.$$

\rm /tmp/.core.$$

exit 0

Now, just find the lines for the guilty process in the output file. In our case, it is the oprocd.bin process:

# vi /tmp/core.*
[...]
process name: oprocd.bin
stack pointer for thread 300171c7300: 2a10176b0b1
  000002a10176b161 kadmin+0x4a4()
  000002a10176b221 uadmin+0x11c()
  000002a10176b2e1 syscall_trap+0xac()
[...]

This process is locked in memory to monitor the cluster and provide I/O fencing. oprocd.bin performs its check, stops running, and if the wake up is beyond the expected time, then it resets the processor and reboots the node. An oprocd.bin failure results in Oracle Clusterware restarting the node. Please read the Oracle Clusterware and Oracle Real Application Clusters documentation for more information.

Although the incident is always under investigation, it seems the nodes were impacted by the additional second that was added at the end of 2008...

Wednesday 25 February 2009

Problem Querying ru. Name Servers

In a previous position, I encounter a rather strange problem I would like to share here. It has to do with DNS resolution. From the internal network of a company it was not possible to get the IP address of the www.banks2ifrs.ru. host name.

We determined that the name servers which manage the banks2ifrs.ru. domain are ns3.nic.ru. and ns4.nic.ru.. In the same time, we saw that for the fbk.ru. domain name, the managing name servers are ns3.nic.ru. and gw.fbk.ru.. Interestingly, it was possible to resolve www.fbk.ru. using gw.fbk.ru., but not from ns3.nic.ru.. More, we noted that the reverse resolution for the name servers ns3.nic.ru. and ns4.nic.ru. are not correct, translated to ns3.ripn.net. and ns4.ripn.net., respectively. So, it is worth to mention that nothing was wrong when querying the name server gw.fbk.ru..

# host -t a ns3.nic.ru.
ns3.nic.ru has address 194.85.61.20
# host -t a ns4.nic.ru.
ns4.nic.ru has address 194.226.96.8
# host -t ptr 194.85.61.20
20.61.85.194.in-addr.arpa domain name pointer ns3.ripn.net.
# host -t ptr 194.226.96.8
8.96.226.194.in-addr.arpa domain name pointer ns4.ripn.net.

The error from the DNS query seems to be related to an incomplete answer from the server (the truncated flag was set to 1 in the network trace) when the query is made over UDP. In this case, an automatic fallback over TCP must be used, certainly prohibited from the company's network security policy. This may say that the answer is larger than 512 bytes long, too. So, we tried to advertise different sizes of the UDP message buffer, but without being confident that this message went through network devices properly. Nonetheless it would seem curious to get an answer larger that 120 bytes long.

Last, we can note that the complexity of the network layout (DMZ, firewalls, NAT, etc.) may badly interact and hamper DNS queries, at least in certain circumstances.

After more investigation from the network team, they decided to permit TCP DNS queries. And it worked. It worked letting the internal DNS servers doing their job themselves...

# dig +trace -t a www.banks2ifrs.ru.
; <<>> DiG 9.3.4-P1 <<>> +trace www.banks2ifrs.ru.
;; global options:  printcmd
.                       449798  IN   NS   L.ROOT-SERVERS.NET.
.                       449798  IN   NS   M.ROOT-SERVERS.NET.
.                       449798  IN   NS   A.ROOT-SERVERS.NET.
.                       449798  IN   NS   B.ROOT-SERVERS.NET.
.                       449798  IN   NS   C.ROOT-SERVERS.NET.
.                       449798  IN   NS   D.ROOT-SERVERS.NET.
.                       449798  IN   NS   E.ROOT-SERVERS.NET.
.                       449798  IN   NS   F.ROOT-SERVERS.NET.
.                       449798  IN   NS   G.ROOT-SERVERS.NET.
.                       449798  IN   NS   H.ROOT-SERVERS.NET.
.                       449798  IN   NS   I.ROOT-SERVERS.NET.
.                       449798  IN   NS   J.ROOT-SERVERS.NET.
.                       449798  IN   NS   K.ROOT-SERVERS.NET.
;; Received 512 bytes from 127.0.0.1#53(127.0.0.1) in 0 ms

ru.                   172800  IN   NS   ns.ripn.net.
ru.                   172800  IN   NS   ns2.nic.fr.
ru.                   172800  IN   NS   ns2.ripn.net.
ru.                   172800  IN   NS   ns5.msk-ix.net.
ru.                   172800  IN   NS   ns9.ripn.net.
ru.                   172800  IN   NS   sunic.sunet.se.
;; Received 297 bytes from 199.7.83.42#53(L.ROOT-SERVERS.NET) in 125 ms

banks2ifrs.ru.   345600  IN   NS   ns4.nic.ru.
banks2ifrs.ru.   345600  IN   NS   ns3.nic.ru.
;; Received 107 bytes from 194.85.105.17#53(ns.ripn.net) in 66 ms

www.banks2ifrs.ru.   86400   IN   A     83.222.6.194
banks2ifrs.ru.            86400   IN   NS   ns4.nic.ru.
banks2ifrs.ru.            86400   IN   NS   ns3.nic.ru.
;; Received 91 bytes from 194.226.96.8#53(ns4.nic.ru) in 65 ms

... and it worked when querying directly the name servers responsible for the wanted domain:

# dig @ns4.nic.ru. -t a www.banks2ifrs.ru.
; <<>> DiG 9.3.4-P1 <<>> @ns4.nic.ru. -t a www.banks2ifrs.ru.
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1530
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 0

;; QUESTION SECTION:
;www.banks2ifrs.ru.             IN   A

;; ANSWER SECTION:
www.banks2ifrs.ru.      86400   IN   A   83.222.6.194

;; AUTHORITY SECTION:
banks2ifrs.ru.          86400   IN   NS   ns4.nic.ru.
banks2ifrs.ru.          86400   IN   NS   ns3.nic.ru.

;; Query time: 65 msec
;; SERVER: 194.226.96.8#53(194.226.96.8)
;; WHEN: Tue Apr 15 21:05:12 2008
;; MSG SIZE  rcvd: 91

Note: The size of the answer is 91 bytes long, so nothing wrong from this side.

I think we will never know what was going wrong here, even if the heart of the problem seems related specifically only to the two same name servers.

Monday 5 January 2009

Quick Reference Guides For Network Technologies

I am always looking for some good IT cheat sheets, and for a lots of thing. Here is one of my favorite web site around network technologies. Interestingly, this site propose some regularly updated quick reference guides for the network, and are classified under the following categories (at the time of this writing):

  • Protocols
  • Applications
  • Reference
  • Syntax
  • Technologies
  • Miscellaneous

You can find them all at the Cheat Sheets Library section.

As the author wrote on his blog, if you notice an error or would like to see a new cheat sheet on a specific topic, don't hesitate and drop him a line.

Thursday 25 December 2008

More News To Come About Shrinking A zpool

As a little update to an older post on this subject, and although this post from Matthew Ahrens is about the new scrub code recently introduced in OpenSolaris build 94--and was in fact a priority before the launch of the Sun Storage 7000 Unified Storage Systems (a.k.a. Amber Road)--it is interesting to note that some of the new code will be usable to remove a disk from a ZFS pool.

As Matthew wrote:

This work lays a bunch of infrastructure that will be used by the upcoming device removal feature.

Wednesday 3 December 2008

GRUB Boot Archive With SVM, A Better Approach

In a previous discussion about the GRUB boot archive and how it can be regenerated in Failsafe mode, I mentioned that it will not be as easy as it can be when the root file system use the md driver. I previously show a method to do this which necessitate to unmirror one or more file systems when the root file system is build upon a SVM mirror. This was not very optimal since a lot of of manipulations are involved, which may lead to human error(s), and may seems to be a little complicated.

This method was build on Performing System Recovery from the Solaris Volume Manager official documentation, which show up last month on the Sun-Managers mailing list.

Note: Although this test case was done using Solaris 10 10/08 under a virtual machine build upon VirtualBox on latest OpenSolaris release, the instructions must be valid for Solaris 10 1/06 and later.

Initial setup

As we saw before, the system use only a root file system, and a swap device. Both are encapsulated with SVM:

# df -k -F ufs
Filesystem     kbytes    used   avail capacity  Mounted on
/dev/md/dsk/d0 6147798 3455578 2630743      57%  /
# swap -l
swapfile             dev  swaplo blocks   free
/dev/md/dsk/d1      85,1       8 4194288 4194288
# metastat -c d0 d1
d0               m  6.0GB d10 d20
    d10          s  6.0GB c0d0s0
    d20          s  6.0GB c1d1s0
d1               m  2.0GB d11 d21
    d11          s  2.0GB c0d0s1
    d21          s  2.0GB c1d1s1

Regenerate the GRUB boot archive

The idea is to boot on the GRUB Failsafe mode, get the md configuration from local root file system, and load manually the md module, hence properly configured. The main advantage is to be fully self hosted from the Failsafe mode, and not have to manipulate SVM more than necessary, especially when breaking the mirror, loosing redundancy for a time.

[...]
Booting to milestone "milestone/single-user:default".
Configuring devices.
Searching for installed OS instances...
/dev/dsk/c0d0s0 is under md control, skipping.
/dev/dsk/c1d1s0 is under md control, skipping.
No installed OS instance found.

Starting shell.
# mount -F ufs -o ro /dev/dsk/c0d0s0 /a
# cp -p /a/kernel/drv/md.conf /kernel/drv
# umount /a
# update_drv -f md
devfsadm: mkdir failed for /dev 0x1ed: Read-only file system
# metainit -r
# metasync d0
# fsck /dev/md/rdsk/d0
# mount -F ufs /dev/md/dsk/d0 /a
# bootadm update-archive -R /a
# umount /a
# reboot

Really interesting!

- page 1 of 16