blog'o thnet

To content | To menu | To search

Friday 2 May 2008

PHP APC Extension Bug With Optimized Open Source Software Stack

To easily manage LDAP accounts (and general LDAP entries in fact), we have created a Solaris Zone and installed the excellent Cool Stack bundle to host the LAM (LDAP Account Manager) management web tool. But after upgrading the Cool Stack to version 1.2 we encountered a very annoying problem mostly with freezing web pages, and generally ending up in restarting the Apache web server provided by the Cool Stack. After some troubleshooting, we discover that this behavior was introduced by a bug in the APC-3.0.14 module bundled with the updated php-5.2.4 scripting software in this version of the Cool Stack.

Luckily, the bug was already fixed and a new version of the APC extension of PHP is available for download (in fact, just replace to original apc.so module by the new one). All the Cool Stack related problems, associated fixes and instructions are listed on the Cool Stack 1.2 Patches page: be sure to keep in sync' if you are a Cool Stack consumer.

Monday 17 December 2007

A Process Can Not Be Killed Upon Hanging In sendfilev()

Here is a little but annoying bug I faced recently on a production system running Solaris 10 11/06: a process can not be killed.

The KILL signal can't be ignored by a process. The associated handler must be the default--taking the signal into account, and honor it--and it is the case (the incriminated process id is 26632):

# psig 26632 | grep KILL
KILL    default

Nevertheless, the process seems not to stop as expected:

# echo $$
2913
#
# kill -s KILL 26632
#
# dtrace -n 'fbt::sigtoproc:entry /pid == 2913/ {trace(arg2); trace(execname);}'
dtrace: description 'fbt::sigtoproc:entry ' matched 1 probe
CPU     ID                    FUNCTION:NAME
 18  16387                  sigtoproc:entry                 9  tcsh
#
# ps -ef | awk '$2 ~ /26632/ {print $0}'
nonpriv 26632     1   0   Nov 13 ?         0:01 /usr/sbin/in.ftpd -a

This not the correct behavior since a pending signal is not honored:

# pflags 26632
26632:  /usr/sbin/in.ftpd -a
       data model = _ILP32  flags = ORPHAN|MSACCT|MSFORK
       sigpend = 0x00006100,0x00000000
/1:    flags = 0

More, the process seems in a non-coherent system state: it is now nearly impossible to trace it using, truss, the proctools,...

# truss -alef -p 26632
truss: unanticipated system error: 26632
#
# pstack 26632
pstack: cannot examine 26632: unanticipated system error
#
# pfiles 26632
pfiles: unanticipated system error: 26632
#
# pldd 26632
pldd: cannot examine 26632: unanticipated system error

... or helped by DTace:

# dtrace -n 'profile-1000hz /pid ==26632/ { @[stack()] = count() }'
dtrace: description 'profile-1000hz ' matched 1 probe
^C
#
# dtrace -n 'profile-1000hz /pid ==26632/ { @[ustack()] = count() }'
dtrace: description 'profile-1000hz ' matched 1 probe
^C

Follow some information gathered through mdb in kernel debugging mode:

> ::status
debugging live kernel (64-bit) on socrate
operating system: 5.10 Generic_118833-36 (sun4u)
>
> ::showrev
Hostname: socrate
Release: 5.10
Kernel architecture: sun4u
Application architecture: sparcv9
Kernel version: SunOS 5.10 sun4u Generic_118833-36
Platform: SUNW,Sun-Fire-V490
>
> ::pgrep in.ftpd
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R  26632      1    261    261  30501 0x5a024b00 00000600144b0c28 in.ftpd
>
> 00000600144b0c28::kill
mdb: command is not supported by current target
>
> 00000600144b0c28::thread
            ADDR    STATE    FLG PFLG SFLG   PRI  EPRI PIL             INTR
00000600144b0c28 inval/2000 157e 5778    0     0     0   6                2
>
> 00000600144b0c28::walk thread | ::findstack
stack pointer for thread 30002aba360: 2a1022e7ec1
[ 000002a1022e7ec1 cv_wait+0x38() ]
  000002a1022e7f71 page_lock_es+0x204()
  000002a1022e8021 pvn_vplist_dirty+0x2a4()
  000002a1022e8101 nfs_putpages+0x124()
  000002a1022e81c1 nfs3_putpage+0xcc()
  000002a1022e8271 fop_putpage+0x1c()
  000002a1022e8321 nfs_purge_caches+0xe4()
  000002a1022e83d1 nfs_attr_cache+0x20c()
  000002a1022e8481 nfs3_getattr_otw+0x1b8()
  000002a1022e85f1 nfs3_validate_caches+0x4c()
  000002a1022e8731 nfs3_getpage+0xa4()
  000002a1022e8861 fop_getpage+0x44()
  000002a1022e8931 segmap_getmapflt+0x588()
  000002a1022e8a41 snf_segmap+0x13c()
  000002a1022e8bc1 sosendfile64+0x298()
  000002a1022e8d21 sendvec64+0xf8()
  000002a1022e8f61 sendfilev+0x178()
  000002a1022e92e1 syscall_trap32+0xcc()

We can see the use of the sendfilev() syscall: it appears to be a known bug around this (you must have a registered Sun support customer account to be able to view the second document).

The good news is the root cause is already fix in the development branch of the operating system. The bad news is, after opening a call to the Sun support team, that no patch will be released in a near future to correct this: the incorporation of the fix is currently planned for the next major Solaris 10 Update (Update 5) which is scheduled for summer 2008. However, the fix is already available via the Solaris Express program, if that is applicable to your environment. As a last note, and if the non-killable process owns a resource necessary to another program, there seems no other option than to plan a reboot of the system.

Thursday 27 September 2007

European Type6 USB Keyboard Circumflex Key Issue... Solved

I recently worked with Dermot Malone, the Responsible Engineer for this bug report.

The fact is the described problem persist as explained in the Description field of the bug report, running snv_70. But I must say I found something recently: as of now, I used the default C locale. But after reading this documentation (go to Unicode Locale: en_US.UTF-8 Support), I decided to try the en_US.UTF-8 locale, directly chosen from the dtlogin screen... and found that I can now have a similar English environment as before (using C locale), while supporting accent and circumflex characters from applications which support this locale (for example the GNOME Terminal, or Mozilla Firefox). Hum, I just find a little big curious that most of accent characters (say 'é') works properly using the C locale, but that I need to set an other locale to be able to use the circumflex characters (say 'ô').

As a matter of interest, Dermot ask me why I didn't use the fr_FR.UTF-8 locale. Here is my answer:

Well, I am a French guy, but systematically install English operating systems, and use the default C locale. In IT, English is _the_ standard, and all things (manual pages, messages, format strings, etc.) are all homogeneous this way. As a side note, I am sure not to encounter the problems found on RedHat Linux systems when the default locale is not properly supported by the OS sub-systems themselves (the RC scripts generally sets the LANG=C (or something like that) for 'grep', 'sed' and 'awk' for example), or some third party products (such as the IBM TSM Backup Archive client).

Now, using the en_US.UTF-8 locale on Solaris and Solaris Express, I can have best of both worlds: a fully functional (and supported) English environment, and be able to use extended characters specific to my language.

Sunday 18 February 2007

DHCP Bug Reintroduced in SXCE Build 57

After live upgrade'ing to snv_57, i was unable to get network informations, normally obtained using internal DHCP server. In the system log file, i can see something like:

Sat Feb 17 11:06:18 unic /sbin/dhcpagent[1494]: [ID 91213 daemon.warning]
 configure_v4_lease: incorrect broadcast address 255.255.255.255 specified
 for nge0; ignoring
Sat Feb 17 11:06:18 unic /sbin/dhcpagent[1494]: [ID 94624 daemon.warning]
 checkaddr: expected peer address 192.168.1.255 on nge0, have 255.255.255.255

In fact, it seems an old bug was reintroduced while working on an other (related) CR. See bug report 6516629 for a detail description of the problem. Since it was already fixed in snv_58, and because there are other known issues with this build being unable to boot some systems i will stay with build 56, waiting for build 58.

Monday 22 January 2007

European Type6 USB Keyboard Circumflex Key Issue

Although my first bug report against snv_38 on a issue related to the use of a European Type6 USB keyboard with circumflex key seems to be fixed somewhere between snv_39 and snv_54 (at least the status mentioned "Closed: Not Reproducible"), it is not. Filled a new bug report linking to the first one, against snv_55b this time.

Well, i think i will continue to copy and paste the appropriate word using Google searches for a while...

Sunday 30 April 2006

List of Bugs Fixed per Operating System/Networking (ON) Build

Already want to be able to easily find what is new in each build of the ON consolidation from OpenSolaris; including Solaris Express and Solaris Express: Community Release?

Speaking about specifics putback logs (HTML and text based reports are available):

  1. ON consolidation: change logs and downloads
  2. X consolidation: change logs

For more general overview of what happen at the SX release level, the two best places are certainly:

  1. docs.sun.com: What's New in Solaris Express
  2. Dan Price's Weblog: Solaris entries

Wednesday 26 October 2005

Chasing fifo Bug

When testing RELENG_6 (for weeks now), i encountered a strange bug running an UP and a SMP kernel under heavy parallel load, as when building the world using the -j flag. The symptom was simply a panic, no more no less.

After posting about it on current@, i received good help from Robert N M Watson himself asking for crash dump and testing. After some work on his side, he came with a lot of work on the FreeBSD's fifo implementation, now committed to the source tree.

As a side note, it be mentioned that these modifications seems to solve the problem even on the UP machine, but Robert said that this one may had another origin and may not be totally solved. Here is an excerpt of one of his reply:

This is actually interesting -- Kris Kenneway reported the same thing -- that is, that with the most recent spate of fifo bug fixes, the problem has gone away. The only problem is that I don't think I fixed a bug with the symptoms that you have experienced, which suggests instead that I've changed the timing of the bug so that it occurs less rarely. I sent Kris a set of assertion patches to run with, and he generates some nice parallel make load, and I have a regression test I've been running. I think we're set for now and I'll continue to work on reproducing it after my recent fifo cleanup. It is possible I fixed a race condition without meaning to, of course... Please let me know if you have any further fifo panics! Thanks again, Robert N M Watson

I just can say that panic doesn't occur since then. Thanks to him.

Sunday 23 October 2005

FreeBSD's Upcoming 6.0 Release

After playing with RELENG_6 for a while, i can now safely says that the new upcoming FreeBSD release will be a very good release(tm)! Altogether, the operating system is very stable and have a good responsiveness.

But... my main problem encountered with my laptop still remain: a long standing problem with the GigE ethernet driver re(4). See Problem Report kern/80005 for more information on that subject. Just note that the problem was not solved (nor changed) trying the new polling(4) operational mode, nor trying the development branch of FreeBSD: 7-CURRENT, as somebody had advised it to me in the past... Too bad.

Friday 10 June 2005

How to Patch an AIX 5L System

On a fresh AIX 5.3ML1 (the corresponding output of oslevel -r is 5300-01), a bug was encountered when trying to change the NIS password of a remote user with yppasswd(1): the command systematically dumped core.

Openning a soft call at the IBM support center leads to patch the system with the APAR IY66747 (memory fault in chpass_netgroup). The fix package contains 19 filesets for a total package size of 54,464,512 bytes. After downloading it from IBM Fix Central (which can be found at http://www-912.ibm.com/eserver/support/fixes/), all the .bff were stored in a temporary location /files/tmpcdinst/IY66747 which was NFS mounted in order to reduce needed space on each concerned host.

The System Management Interface Tool (SMIT) recognizes /usr/sys/inst.images as the default directory for fixes downloaded onto hard disk. This was not the case here!

Here are the steps to install the fix package:

# cd /files/tmpcdinst/IY66747
# inutoc
# TERM=vt220 smitty update_all
/*
 * INPUT device / directory for software               [.]
 *  SOFTWARE to update                                  _update_all
 *  PREVIEW only? (update operation will NOT occur)     no
 *  COMMIT software updates?                            no
 *  SAVE replaced files?                                yes
 *  AUTOMATICALLY install requisite software?           yes
 *  EXTEND file systems if space needed?                yes
 *  VERIFY install and check file sizes?                no
 *  DETAILED output?                                    no
 *  Process multiple volumes?                           yes
 *  ACCEPT new license agreements?                      yes
 *  PREVIEW new LICENSE agreements?                     no
 */

Verify that the fix package applied correctly:

# instfix -ik IY66747
    All filesets for IY66747 were found.

Then, here is the correct behavior of the freshly patched system:

# yppasswd jpeg
Old NIS password:
jpeg's New password:
Re-enter jpeg's new password:
NIS passwd changed on neptune

All seems ok!

Side note: Since smitty update_all need to issue the bosboot command at the end of the update, a reboot is mostly recommended!