I recently faced a problem where an application stop working after transferring a system from an old Sun E450 running Solaris 8 to a more recent Sun Fire V490 running Solaris 10 10/09 with the Solaris 8 Container software stack. Although all went smooth during the P2V, and most of all applications runs pretty well in the non-global zone, we encounter a problem where the Courier SMTP product didn't answer anymore to remote connections, such as:
# telnet localhost 25
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Connection closed by foreign host.
The fact is, the couriertcpd
program is running, and listening to the
TCP port number 25.
# pfiles `pgrep couriertcpd | head -1`
5691: /usr/lib/courier/sbin/couriertcpd
Current rlimit: 1024 file descriptors
[...]
5: S_IFSOCK mode:0666 dev:366,0 ino:62871 uid:0 gid:0 size:0
O_RDWR|O_NONBLOCK FD_CLOEXEC
sockname: AF_INET 0.0.0.0 port: 25
[...]
Interestingly, here is what is logged by the Courier application:
# grep mail.info /path/to/Courier.log
XXX courieresmtpd: [ID 702911 mail.info] gdbm fatal: couldn't init cache
The problem seems located around the gdbm library which is a dependency
of Courier. So, what happened when we try to initiate a direct
connection using telnet
on port 25:
# truss -alef -rall -wall -p 5691
[...]
5691/1: fork() = 245
245/1: fork() (returning as child ...) = 5691
[...]
245/1: brk(0x004136D8) = 0
245/1: brk(0x004136D8) = 0
245/1: brk(0x004336D8) Err#12 ENOMEM
245/1: write(2, " g d b m f a t a l : ", 12) = 12
245/1: write(2, 0xFF2D4D28, 19) = 19
245/1: c o u l d n ' t i n i t c a c h e
245/1: write(2, "
", 1) = 1
245/1: llseek(0, 0, SEEK_CUR) = 0
245/1: _exit(-1)
5691/1: Received signal #18, SIGCLD, in poll() [caught]
[...]
# echo "ibase=16; `echo 004136D8 | tr [:lower:] [:upper:]`" | bc
4273880
# echo "ibase=16; `echo 004336D8 | tr [:lower:] [:upper:]`" | bc
4404952
Ok. The brk(2)
function is used to change dynamically the amount of
space allocated for the calling process's data segment. So, the process
which PID is 245, forked from couriertcpd
, can not allocate more than
4273880 bytes since a try to allocate 4404952 bytes return an error.
When the brk(2)
function fails, it is generally due to one of these
two major cases:
setrlimit(2)
would be
exceeded.At first, we decided to grow the size of the swap space in the global zone from 4GB to 8GB (since there is no resource control on memory for this non-global zone). As it is a full-ZFS Solaris 10 system, here are the steps to do so:
# swap -d /dev/zvol/dsk/rpool/swap
# zfs set volsize=8G rpool/swap
# /sbin/swapadd
# swap -l
swapfile dev swaplo blocks free
/dev/zvol/dsk/rpool/swap 256,2 16 16777200 16777200
But nothing change from the couriertcpd
point of view, as we may have
thought since we were able to deallocate the entire swap device from the
global zone. So, we shrink it back using the same steps as for growing
it. Then, is it possible we have reached an upper resource limitation?
The couriertcpd
is executed under the daemon
identity: what are the
maximum size of data segment (or heap) for the daemon
account:
# su - daemon -c "ulimit -Sd; ulimit -Hd"
unlimited
unlimited
This seems more than sufficient, not to say more. But what are the real current limitation for the running process?
# plimit -k 5691
5691: /usr/lib/courier/sbin/couriertcpd
resource current maximum
time(seconds) unlimited unlimited
file(kbytes) unlimited unlimited
data(kbytes) 4096 4096
stack(kbytes) 8192 unlimited
coredump(kbytes) unlimited unlimited
nofiles(descriptors) 1024 1024
vmemory(kbytes) unlimited unlimited
Ok. One can argue that the 4404952 bytes size is very close to the 4096KB (4194304 bytes) data size limitation... and yes, it is. The two points here are the fact that the limitation is close to memory allocation size for sure, and that there must be some settings somewhere which sets this resource limitation since we verified it is not a the account level (the current hard limit is set to 4096KB, thus not very unlimited). So, we tried to set the resource limit for the running process to much higher value, and voila:
# plimit -d 65536,65536 5691
# telnet localhost 25
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
220 S0003010 ESMTP
quit
221 Bye.
Connection closed by foreign host.
Well, all seems far better this time:
# truss -alef -rall -wall -p 5691
[...]
5691/1: fork() = 4861
4861/1: fork() (returning as child ...) = 5691
[...]
4861/1: brk(0x00CF36D8) = 0
4861/1: lseek(6, 262144, SEEK_SET) = 262144
4861/1: read(6, 0x00071730, 131072) = 131072
[...]
# echo "ibase=16; `echo 00CF36D8 | tr [:lower:] [:upper:]`" | bc
13579992
So, now the memory allocation succeed at changing the space allocated up to 13579992 bytes (13262KB). We decided to set the data segment size up to 16MB, which seems to be sufficient in our case.
We now need to find the configuration responsible for setting the size of the data segment size to 4096KB. Looking at the courier documentation, and and after a bit of digging on the system itself we found the responsible configuration file, and adapt it as follow:
# grep ULIMIT= /usr/lib/courier/etc/esmtpd
#ULIMIT=4096
ULIMIT=16384
As a last validation, we made a complete reboot of the non-global zone to verify the overall behavior, which is OK right now. As a last word, it was very interesting problem to fight with, but I didn't understand until now why 4MB was sufficient on an old Sun E450, and why it is not in a branded Solaris 8 non-global zone... comments welcome.