Blog

Tracking Performance Problem with DTrace

Apr 13, 2007 | 5 minutes read
Share this:

Tags: DTrace, Performance

Recently, I faced a performance problem showing an important load average on a server running Nagios, and hosted on a Solaris 10 platform. So, in order to identify the origin of the problem, I decided to use DTrace --for the very first time. DTrace is a comprehensive dynamic tracing framework and provides a powerful infrastructure to concisely answer arbitrary questions about the behavior of the operating system and user programs.

In the following case, I will use some one-liners and DTrace scripts taken, or derived, from the excellent DTraceToolkit by Brendan Gregg--coauthor of “Solaris Performance and Tools”, from the second edition of the Solaris Internals books. Many, many thanks to him for his great and impressive work.

Back to my server running Nagios, we can see a constant load average around 4 or 5, for a uni-processor system:

# uptime
 11:48am  up 85 day(s), 23:27,  2 users,  load average: 4.53, 4.44, 4.48
# 
# vmstat 1
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr m0 m1 m3 m4   in   sy   cs us sy id
 3 0 0 9286536 1484928 1029 8643 42 156 161 0 203 1 2 0 0 676 78530 949 47 44 8
 1 0 0 9263880 1467936 1123 10536 0 0 0 0 0 0  0  0  0  673 37455 1241 52 48 0
 1 0 0 9262232 1455472 1097 10656 0 0 0 0 0 0  0  0  0  771 35824 1496 52 48 0
 0 0 0 9302168 1484760 1236 11513 0 0 0 0 0 0  0  0  0  592 32530 864 50 50  0
 3 0 0 9260872 1458072 967 9788 0 1616 1616 0 0 0 0 0 0 1141 30488 1280 51 48 1
 3 0 0 9224896 1434208 1229 11008 0 0 0 0 0 0  0  0  0  638 36756 1198 53 47 0

So, we can see there is a lot a system CPU time spend on the processor. Can we know what is(are) the current program(s) which seems to take most of this time?

# dtrace -n 'syscall:::entry { @num[execname] = count(); }'
dtrace: description 'syscall:::entry ' matched 226 probes
^C

  fmd                                                               1
  inetd                                                             1
  svc.configd                                                       1
  sshd                                                              8
  svc.startd                                                       12
  sendmail                                                         20
  dced                                                             30
  picld                                                            34
  xntpd                                                            50
  httpd                                                            68
  nscd                                                            236
  dtrace                                                          454
  init                                                           2265
  nagios_xng_1.2                                                 5170
  sh                                                             9645
  mysqld                                                        21901
  php                                                          250039

Well, php is a clear winner here, and that is not abnormal since there are a lot of servers to monitor, including large Sybase data servers. Right now, try to get the most used system calls during php execution:

# dtrace -n 'syscall:::entry { @num[probefunc] = count(); }'
dtrace: description 'syscall:::entry ' matched 226 probes
^C

[...]
  pollsys                                                         951
  llseek                                                         1263
  getcwd                                                         1371
  memcntl                                                        1667
  write                                                          2679
  lseek                                                          3552
  resolvepath                                                    3745
  munmap                                                         3807
  sigaction                                                      3917
  gtime                                                          4228
  fcntl                                                          4758
  close                                                          5185
  lwp_sigmask                                                    5279
  open                                                           5311
  mmap                                                           7933
  brk                                                           34003
  stat                                                          43151
  read                                                          86493

Ok, the read syscall is the very most called one from php processes. We will now try to see if this is always the same php process which call the read syscall using the readid.d DTrace script. Then, we will get the exact file names accessed (read) by the php processes helped by the readfile.d script:

# ./readid.d php 5s
Sampling for 5s ... Please wait.

PROGRAM                PID      COUNT
[...]
php                  26333       1351
php                  26345       1353
php                  26408       1353
php                  26448       1353
php                  26340       1353
php                  26440       1353
php                  26432       1353
php                  26424       1353
php                  26436       2376
php                  26372       2376
php                  26357       2376
php                  26353       2376
php                  26379       2376
php                  26369       2378
php                  26328       3136
#
# ./readfile.d php 5s
Sampling for 5s ... Please wait.

FILE NAME                                                                 COUNT
[...]
/data/p2/sysdoc/soft/php_5.0.3_3/php.ini                                    231
/data/sybase_client/V12.5.1_32bits_SunOS/charsets/iso_1/charset.loc         330
/data/sybase_client/V12.5.1_32bits_SunOS/config/objectid.dat                374
/data/sybase_client/V12.5.1_32bits_SunOS/locales/message/us_english/cslib.loc
                                                                            429
/data/sybase_client/V12.5.1_32bits_SunOS/charsets/utf8/iso_1.ctb            660
/data/sybase_client/V12.5.1_32bits_SunOS/locales/message/us_english/tcllib.loc
                                                                            858
/data/sybase_client/V12.5.1_32bits_SunOS/locales/locales.dat               1056
/data/sybase_client/V12.5.1_32bits_SunOS/locales/message/us_english/ctlib.loc
                                                                           2244
/data/sybase_client/V12.5.1_32bits_SunOS/charsets/utf8/charset.loc         5473
/tools/list/sybase/interfaces                                             33116

So, we saw that multiple php processes uses the read syscall, and that the number of reads is relatively similar. The second output is more interesting as it shows that the file which is read so often is the interfaces file, a Sybase RDBMS configuration file used for client-server connection definitions.

Last, the opensnoop.d DTrace script shows that some of our Nagios configuration must be done again since we try to access to nonexistent paths:

# sh ./opensnoop.d -x -n php
  UID    PID COMM          FD PATH
  620   8483 php           -1 /var/ld/ld.config
  620   8483 php           -1 /bin//php-cli.ini
  620   8483 php           -1 /soft/sun_free/php/php-cli.ini

  620   8483 php           -1 /bin//php.ini
  620   8483 php           -1 /usr/nagios/site/my_syb_client/OCS-12_5/config/ocs.cfg
  620   8483 php           -1 //getinfo.php
  620   8483 php           -1 /soft/sun_free/php/lib/php/getinfo.php
  620   8483 php           -1 /soft/sun_free/adodb/getinfo.php
  620   8483 php           -1 //recuperation_seuil.php
  620   8483 php           -1 /soft/sun_free/php/lib/php/recuperation_seuil.php
  620   8483 php           -1 /soft/sun_free/adodb/recuperation_seuil.php
  620   8483 php           -1 //mysql_connect.php
  620   8483 php           -1 /soft/sun_free/php/lib/php/mysql_connect.php
  620   8483 php           -1 /soft/sun_free/adodb/mysql_connect.php
  620   8483 php           -1 //sybase_check_appli.php
  620   8483 php           -1 /soft/sun_free/php/lib/php/sybase_check_appli.php
  620   8483 php           -1 /soft/sun_free/adodb/sybase_check_appli.php
  620   8483 php           -1 //sybase_check_system.php
  620   8483 php           -1 /soft/sun_free/php/lib/php/sybase_check_system.php
  620   8483 php           -1 /soft/sun_free/adodb/sybase_check_system.php
  620   8483 php           -1 //mysql_connect.php
  620   8483 php           -1 /soft/sun_free/php/lib/php/mysql_connect.php
  620   8483 php           -1 /soft/sun_free/adodb/mysql_connect.php
  620   8487 php           -1 /var/ld/ld.config
  620   8487 php           -1 /bin//php-cli.ini
  620   8487 php           -1 /soft/sun_free/php/php-cli.ini
  620   8487 php           -1 /bin//php.ini
[...]

Well, DTrace helped us to see where the excessive CPU consumption came from, along with telling us what are the roots for this behavior: connections to Sybase RDBMS, and bad search paths. So, I now can go and see the developers to provide them with these useful information to debug their Nagios agents.