Tags: DTrace, Performance
Recently, I faced a performance problem showing an important load average on a server running Nagios, and hosted on a Solaris 10 platform. So, in order to identify the origin of the problem, I decided to use DTrace --for the very first time. DTrace is a comprehensive dynamic tracing framework and provides a powerful infrastructure to concisely answer arbitrary questions about the behavior of the operating system and user programs.
In the following case, I will use some one-liners and DTrace scripts taken, or derived, from the excellent DTraceToolkit by Brendan Gregg--coauthor of “Solaris Performance and Tools”, from the second edition of the Solaris Internals books. Many, many thanks to him for his great and impressive work.
Back to my server running Nagios, we can see a constant load average around 4 or 5, for a uni-processor system:
# uptime
11:48am up 85 day(s), 23:27, 2 users, load average: 4.53, 4.44, 4.48
#
# vmstat 1
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr m0 m1 m3 m4 in sy cs us sy id
3 0 0 9286536 1484928 1029 8643 42 156 161 0 203 1 2 0 0 676 78530 949 47 44 8
1 0 0 9263880 1467936 1123 10536 0 0 0 0 0 0 0 0 0 673 37455 1241 52 48 0
1 0 0 9262232 1455472 1097 10656 0 0 0 0 0 0 0 0 0 771 35824 1496 52 48 0
0 0 0 9302168 1484760 1236 11513 0 0 0 0 0 0 0 0 0 592 32530 864 50 50 0
3 0 0 9260872 1458072 967 9788 0 1616 1616 0 0 0 0 0 0 1141 30488 1280 51 48 1
3 0 0 9224896 1434208 1229 11008 0 0 0 0 0 0 0 0 0 638 36756 1198 53 47 0
So, we can see there is a lot a system CPU time spend on the processor. Can we know what is(are) the current program(s) which seems to take most of this time?
# dtrace -n 'syscall:::entry { @num[execname] = count(); }'
dtrace: description 'syscall:::entry ' matched 226 probes
^C
fmd 1
inetd 1
svc.configd 1
sshd 8
svc.startd 12
sendmail 20
dced 30
picld 34
xntpd 50
httpd 68
nscd 236
dtrace 454
init 2265
nagios_xng_1.2 5170
sh 9645
mysqld 21901
php 250039
Well, php
is a clear winner here, and that is not abnormal since there
are a lot of servers to monitor, including large Sybase data servers.
Right now, try to get the most used system calls during php
execution:
# dtrace -n 'syscall:::entry { @num[probefunc] = count(); }'
dtrace: description 'syscall:::entry ' matched 226 probes
^C
[...]
pollsys 951
llseek 1263
getcwd 1371
memcntl 1667
write 2679
lseek 3552
resolvepath 3745
munmap 3807
sigaction 3917
gtime 4228
fcntl 4758
close 5185
lwp_sigmask 5279
open 5311
mmap 7933
brk 34003
stat 43151
read 86493
Ok, the read
syscall is the very most called one from php
processes.
We will now try to see if this is always the same php
process which
call the read
syscall using the
readid.d DTrace script. Then, we will get
the exact file names accessed (read) by the php
processes helped by
the readfile.d script:
# ./readid.d php 5s
Sampling for 5s ... Please wait.
PROGRAM PID COUNT
[...]
php 26333 1351
php 26345 1353
php 26408 1353
php 26448 1353
php 26340 1353
php 26440 1353
php 26432 1353
php 26424 1353
php 26436 2376
php 26372 2376
php 26357 2376
php 26353 2376
php 26379 2376
php 26369 2378
php 26328 3136
#
# ./readfile.d php 5s
Sampling for 5s ... Please wait.
FILE NAME COUNT
[...]
/data/p2/sysdoc/soft/php_5.0.3_3/php.ini 231
/data/sybase_client/V12.5.1_32bits_SunOS/charsets/iso_1/charset.loc 330
/data/sybase_client/V12.5.1_32bits_SunOS/config/objectid.dat 374
/data/sybase_client/V12.5.1_32bits_SunOS/locales/message/us_english/cslib.loc
429
/data/sybase_client/V12.5.1_32bits_SunOS/charsets/utf8/iso_1.ctb 660
/data/sybase_client/V12.5.1_32bits_SunOS/locales/message/us_english/tcllib.loc
858
/data/sybase_client/V12.5.1_32bits_SunOS/locales/locales.dat 1056
/data/sybase_client/V12.5.1_32bits_SunOS/locales/message/us_english/ctlib.loc
2244
/data/sybase_client/V12.5.1_32bits_SunOS/charsets/utf8/charset.loc 5473
/tools/list/sybase/interfaces 33116
So, we saw that multiple php
processes uses the read
syscall, and
that the number of reads is relatively similar. The second output is
more interesting as it shows that the file which is read so often is the
interfaces
file, a Sybase RDBMS configuration file used for
client-server connection definitions.
Last, the opensnoop.d
DTrace script shows that some of our Nagios
configuration must be done again since we try to access to nonexistent
paths:
# sh ./opensnoop.d -x -n php
UID PID COMM FD PATH
620 8483 php -1 /var/ld/ld.config
620 8483 php -1 /bin//php-cli.ini
620 8483 php -1 /soft/sun_free/php/php-cli.ini
620 8483 php -1 /bin//php.ini
620 8483 php -1 /usr/nagios/site/my_syb_client/OCS-12_5/config/ocs.cfg
620 8483 php -1 //getinfo.php
620 8483 php -1 /soft/sun_free/php/lib/php/getinfo.php
620 8483 php -1 /soft/sun_free/adodb/getinfo.php
620 8483 php -1 //recuperation_seuil.php
620 8483 php -1 /soft/sun_free/php/lib/php/recuperation_seuil.php
620 8483 php -1 /soft/sun_free/adodb/recuperation_seuil.php
620 8483 php -1 //mysql_connect.php
620 8483 php -1 /soft/sun_free/php/lib/php/mysql_connect.php
620 8483 php -1 /soft/sun_free/adodb/mysql_connect.php
620 8483 php -1 //sybase_check_appli.php
620 8483 php -1 /soft/sun_free/php/lib/php/sybase_check_appli.php
620 8483 php -1 /soft/sun_free/adodb/sybase_check_appli.php
620 8483 php -1 //sybase_check_system.php
620 8483 php -1 /soft/sun_free/php/lib/php/sybase_check_system.php
620 8483 php -1 /soft/sun_free/adodb/sybase_check_system.php
620 8483 php -1 //mysql_connect.php
620 8483 php -1 /soft/sun_free/php/lib/php/mysql_connect.php
620 8483 php -1 /soft/sun_free/adodb/mysql_connect.php
620 8487 php -1 /var/ld/ld.config
620 8487 php -1 /bin//php-cli.ini
620 8487 php -1 /soft/sun_free/php/php-cli.ini
620 8487 php -1 /bin//php.ini
[...]
Well, DTrace helped us to see where the excessive CPU consumption came from, along with telling us what are the roots for this behavior: connections to Sybase RDBMS, and bad search paths. So, I now can go and see the developers to provide them with these useful information to debug their Nagios agents.