blog'o thnet

To content | To menu | To search

Tag - performance

Entries feed - Comments feed

Tuesday 6 April 2010

Performance Problem Using The HP DP Agent

I recently faced an interesting problem when the backup agent for HP DataProtector regularly encounter network performance problem. The problem arise on a global zone which hosts many non-global zones, each running its own instance of the HP DP agent. The network configuration is rather unusual: there are two network ports, one which connects to an administration VLAN, the other which carries the data and external customer network streams (using VLAN-tagging). Since the global zone is only use as an hypervisor of system's resources for the non-global zones, it just plumbs the data network port, but has no IP address assigned to it.

The fact is, the configuration sets the nodename as the hostname. But the usable interface of the global zone which is in the administration network use another name than the hostname: beastie-adm in the following case, assuming the hostname was set to beastie.

So, now that I briefly describe the configuration of the server, lets see what happen at the network level from the global zone point of view when firing a remote telnet on the DataProtector agent TCP port (5555):

# snoop -d aggr2 -t d port 5555
Using device /dev/aggr2 (promiscuous mode)
  0.00000 172.1.2.3 -> beastie-adm.domain.tld TCP D=5555 S=37228 Syn Seq=1117789660 Len=0 Win=65535 Options=
  0.00005 beastie-adm.domain.tld -> 172.1.2.3 TCP D=37228 S=5555 Syn Ack=1117789661 Seq=808681264 Len=0 Win=49480 Options=
  0.00500 172.1.2.3 -> beastie-adm.domain.tld TCP D=5555 S=37228 Ack=808681265 Seq=1117789661 Len=0 Win=65535
 39.36365 beastie-adm.domain.tld ->172.1.2.3 TCP D=37228 S=5555 Push Ack=1117789661 Seq=808681265 Len=87 Win=49480
  0.00014 beastie-adm.domain.tld -> 172.1.2.3 TCP D=37228 S=5555 Fin Ack=1117789661 Seq=808681352 Len=0 Win=49480
  0.00587 172.1.2.3 -> beastie-adm.domain.tld TCP D=5555 S=37228 Ack=808681353 Seq=1117789661 Len=0 Win=65448
  0.00092 172.1.2.3 -> beastie-adm.domain.tld TCP D=5555 S=37228 Fin Ack=808681353 Seq=1117789661 Len=0 Win=65448
  0.00002 beastie-adm.domain.tld -> 172.1.2.3 TCP D=37228 S=5555 Ack=1117789662 Seq=808681353 Len=0 Win=49480

Ouch. Something is clearly wrong here: as can be seen in this trace, 40 seconds have been lost before establishing the connection. Snooping more largely on the interface show these requests:

[...]
beastie-adm.domain.tld -> nameserver.domain.tld DNS C beastie.domain.tld. Internet Addr ?
nameserver.domain.tld -> beastie-adm.domain.tld DNS R  Error: 3(Name Error)
beastie-adm.domain.tld -> nameserver.domain.tld DNS C beastie. Internet Addr ?
nameserver.domain.tld -> beastie-adm.domain.tld DNS R  Error: 2(Server Fail)
beastie-adm.domain.tld -> nameserver.domain.tld DNS C beastie. Internet Addr ?
nameserver.domain.tld -> beastie-adm.domain.tld DNS R  Error: 2(Server Fail)
beastie-adm.domain.tld -> nameserver.domain.tld DNS C beastie.domain.tld. Internet Addr ?
nameserver.domain.tld -> beastie-adm.domain.tld DNS R  Error: 3(Name Error)
beastie-adm.domain.tld -> nameserver.domain.tld DNS C beastie. Internet Addr ?
nameserver.domain.tld -> beastie-adm.domain.tld DNS R  Error: 2(Server Fail)
[...]

Ok. Lets see what is the process corresponding to the HP DP agent doing during the connection tentative:

[...]
9431/1:         sysinfo(SI_HOSTNAME, "beastie", 64)         = 12
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getuid()                                        = 0 [0]
9431/1:         getuid()                                        = 0 [0]
9431/1:         door_info(3, 0x08047900)                        = 0
9431/1:         door_call(3, 0x08047958)                        = 0
9431/1:         getpid()                                        = 9431 [473]
9431/1:         lstat64("/var/opt/omni/log/debug.log", 0x080476F8) = 0
9431/1:         open64("/var/opt/omni/log/debug.log", O_RDWR|O_APPEND|O_CREAT, 0666) = 4
9431/1:         time()                                          = 1269079436
9431/1:         fstat64(4, 0x08046F30)                          = 0
9431/1:         fstat64(4, 0x08046E70)                          = 0
9431/1:         ioctl(4, TCGETA, 0x08046F04)                    Err#25 ENOTTY
9431/1:         sigaction(SIGSEGV, 0x08047AF0, 0x08047B70)      = 0
9431/1:         sigaction(SIGSEGV, 0x08047AF0, 0x08047B70)      = 0
9431/1:         write(4, 0x08130DC4, 162)                       = 162
9431/1:           \n 0 3 / 2 0 / 1 0   1 1 : 0 3 : 5 6     I N E T . 9 4 3 1 . 0
9431/1:            [ " l i b / c m n / c o m m o n . c   / m a i n / h s l _ d p 6
9431/1:            1 / h s l _ h p i t 2 _ 2 / 9 " : 1 2 7 4 ]   A . 0 6 . 1 1   b
9431/1:            2 4 3\n g e t h o s t b y n a m e ( )   f a i l e d ,   h _ e r
9431/1:            r n o = 2   [ h o s t = h o s t n a m e ,   r e t r y = 5
9431/1:            ]\n
9431/1:         llseek(4, 0, SEEK_CUR)                          = 939002
9431/1:         close(4)
[...]

By now it is clear that most of time spent by the agent was used trying to resolve the hostname to an IP address (as logged to the /var/opt/omni/log/debug.log debug log file). Since there is no IP address declared for beastie as there is no network directly attached to the global zone, there is no mapping defined for the hostname. Bingo.

So, in this case we decided to set the hostname locally (in the /etc/hosts file) as an alias for the loopback entry. This way, the information is resolved, and the agent will not timed-out performing the gethostbyname(3NSL) syscall anyomore:

# getent hosts `beastie`
127.0.0.1       localhost loghost beastie
# snoop -d aggr2 -t d port 5555
Using device /dev/aggr2 (promiscuous mode)
  0.00000 172.1.2.3 -> beastie-adm.occ.lan TCP D=5555 S=37237 Syn Seq=978964828 Len=0 Win=65535 Options=
  0.00004 beastie-adm.occ.lan -> 172.1.2.3 TCP D=37237 S=5555 Syn Ack=978964829 Seq=1184861168 Len=0 Win=49480 Options=
  0.00128 172.1.2.3 -> beastie-adm.occ.lan TCP D=5555 S=37237 Ack=1184861169 Seq=978964829 Len=0 Win=65535
 10.00279 beastie-adm.occ.lan ->172.1.2.3 TCP D=37237 S=5555 Push Ack=978964829 Seq=1184861169 Len=87 Win=49480
  0.00012 beastie-adm.occ.lan -> 172.1.2.3 TCP D=37237 S=5555 Fin Ack=978964829 Seq=1184861256 Len=0 Win=49480
  0.00137 172.1.2.3 -> beastie-adm.occ.lan TCP D=5555 S=37237 Ack=1184861257 Seq=978964829 Len=0 Win=65448
  0.00220 172.1.2.3 -> beastie-adm.occ.lan TCP D=5555 S=37237 Fin Ack=1184861257 Seq=978964829 Len=0 Win=65448
  0.00002 beastie-adm.occ.lan -> 172.1.2.3 TCP D=37237 S=5555 Ack=978964830 Seq=1184861257 Len=0 Win=49480

So, the main question which remain is "Why need the HP DP agent to resolve specifically on the hostname, even if this name is not use apart the backup transaction?" I did not have any answer for this at this time...

Monday 23 July 2007

Tracking Performance Problem with DTrace

Recently, I faced a performance problem showing an important load average on a server running Nagios, and hosted on a Solaris 10 platform. So, in order to identify the origin of the problem, I decided to use DTrace --for the very first time. DTrace is a comprehensive dynamic tracing framework and provides a powerful infrastructure to concisely answer arbitrary questions about the behavior of the operating system and user programs.

In the following case, I will use some one-liners and Dtrace scripts taken, or derived, from the excellent DTraceToolkit by Brendan Gregg--co-author of Solaris Performance and Tools, from the second edition of the Solaris Internals books. Many, many thanks to him for his great and impressive work.

Back to my server running Nagios, we can see a constant load average around 4 or 5, for a uni-processor system:

# uptime
 11:48am  up 85 day(s), 23:27,  2 users,  load average: 4.53, 4.44, 4.48
# 
# vmstat 1
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr m0 m1 m3 m4   in   sy   cs us sy id
 3 0 0 9286536 1484928 1029 8643 42 156 161 0 203 1 2 0 0 676 78530 949 47 44 8
 1 0 0 9263880 1467936 1123 10536 0 0 0 0 0 0  0  0  0  673 37455 1241 52 48 0
 1 0 0 9262232 1455472 1097 10656 0 0 0 0 0 0  0  0  0  771 35824 1496 52 48 0
 0 0 0 9302168 1484760 1236 11513 0 0 0 0 0 0  0  0  0  592 32530 864 50 50  0
 3 0 0 9260872 1458072 967 9788 0 1616 1616 0 0 0 0 0 0 1141 30488 1280 51 48 1
 3 0 0 9224896 1434208 1229 11008 0 0 0 0 0 0  0  0  0  638 36756 1198 53 47 0

So, we can see there is a lot a system CPU time spend on the processor. Can we know what is(are) the current program(s) which seems to take most of this time?

# dtrace -n 'syscall:::entry { @num[execname] = count(); }'
dtrace: description 'syscall:::entry ' matched 226 probes
^C

  fmd                                                               1
  inetd                                                             1
  svc.configd                                                       1
  sshd                                                              8
  svc.startd                                                       12
  sendmail                                                         20
  dced                                                             30
  picld                                                            34
  xntpd                                                            50
  httpd                                                            68
  nscd                                                            236
  dtrace                                                          454
  init                                                           2265
  nagios_xng_1.2                                                 5170
  sh                                                             9645
  mysqld                                                        21901
  php                                                          250039

Well, php is a clear winner here, and that is not abnormal since there are a lot of servers to monitor, including large Sybase data servers. Right now, try to get the most used system calls during php execution:

# dtrace -n 'syscall:::entry { @num[probefunc] = count(); }'
dtrace: description 'syscall:::entry ' matched 226 probes
^C

[...]
  pollsys                                                         951
  llseek                                                         1263
  getcwd                                                         1371
  memcntl                                                        1667
  write                                                          2679
  lseek                                                          3552
  resolvepath                                                    3745
  munmap                                                         3807
  sigaction                                                      3917
  gtime                                                          4228
  fcntl                                                          4758
  close                                                          5185
  lwp_sigmask                                                    5279
  open                                                           5311
  mmap                                                           7933
  brk                                                           34003
  stat                                                          43151
  read                                                          86493

Ok, the read syscall is the very most called one from php processes. We will now try to see if this is always the same php processus which call the read syscall using the readid.d DTrace script. Then, we will get the exact file names accessed (read) by the php processes helped by the readfile.d script:

# ./readid.d php 5s
Sampling for 5s ... Please wait.

PROGRAM                PID      COUNT
[...]
php                  26333       1351
php                  26345       1353
php                  26408       1353
php                  26448       1353
php                  26340       1353
php                  26440       1353
php                  26432       1353
php                  26424       1353
php                  26436       2376
php                  26372       2376
php                  26357       2376
php                  26353       2376
php                  26379       2376
php                  26369       2378
php                  26328       3136
#
# ./readfile.d php 5s
Sampling for 5s ... Please wait.

FILE NAME                                                                 COUNT
[...]
/data/p2/sysdoc/soft/php_5.0.3_3/php.ini                                    231
/data/sybase_client/V12.5.1_32bits_SunOS/charsets/iso_1/charset.loc         330
/data/sybase_client/V12.5.1_32bits_SunOS/config/objectid.dat                374
/data/sybase_client/V12.5.1_32bits_SunOS/locales/message/us_english/cslib.loc
                                                                            429
/data/sybase_client/V12.5.1_32bits_SunOS/charsets/utf8/iso_1.ctb            660
/data/sybase_client/V12.5.1_32bits_SunOS/locales/message/us_english/tcllib.loc
                                                                            858
/data/sybase_client/V12.5.1_32bits_SunOS/locales/locales.dat               1056
/data/sybase_client/V12.5.1_32bits_SunOS/locales/message/us_english/ctlib.loc
                                                                           2244
/data/sybase_client/V12.5.1_32bits_SunOS/charsets/utf8/charset.loc         5473
/tools/list/sybase/interfaces                                             33116

So, we saw that multiple php processes uses the read syscall, and that the number of reads is relatively similar. The second output is more interesting as it shows that the file which is read so often is the interfaces file, a Sybase RDBMS configuration file used for client-server connection definitions.

Last, the opensnoop.d DTrace script shows that some of our Nagios configuration must be done again since we try to access to non-existent paths:

# sh ./opensnoop.d -x -n php
  UID    PID COMM          FD PATH
  620   8483 php           -1 /var/ld/ld.config
  620   8483 php           -1 /bin//php-cli.ini
  620   8483 php           -1 /soft/sun_free/php/php-cli.ini

  620   8483 php           -1 /bin//php.ini
  620   8483 php           -1 /usr/nagios/site/my_syb_client/OCS-12_5/config/ocs.cfg
  620   8483 php           -1 //getinfo.php
  620   8483 php           -1 /soft/sun_free/php/lib/php/getinfo.php
  620   8483 php           -1 /soft/sun_free/adodb/getinfo.php
  620   8483 php           -1 //recuperation_seuil.php
  620   8483 php           -1 /soft/sun_free/php/lib/php/recuperation_seuil.php
  620   8483 php           -1 /soft/sun_free/adodb/recuperation_seuil.php
  620   8483 php           -1 //mysql_connect.php
  620   8483 php           -1 /soft/sun_free/php/lib/php/mysql_connect.php
  620   8483 php           -1 /soft/sun_free/adodb/mysql_connect.php
  620   8483 php           -1 //sybase_check_appli.php
  620   8483 php           -1 /soft/sun_free/php/lib/php/sybase_check_appli.php
  620   8483 php           -1 /soft/sun_free/adodb/sybase_check_appli.php
  620   8483 php           -1 //sybase_check_system.php
  620   8483 php           -1 /soft/sun_free/php/lib/php/sybase_check_system.php
  620   8483 php           -1 /soft/sun_free/adodb/sybase_check_system.php
  620   8483 php           -1 //mysql_connect.php
  620   8483 php           -1 /soft/sun_free/php/lib/php/mysql_connect.php
  620   8483 php           -1 /soft/sun_free/adodb/mysql_connect.php
  620   8487 php           -1 /var/ld/ld.config
  620   8487 php           -1 /bin//php-cli.ini
  620   8487 php           -1 /soft/sun_free/php/php-cli.ini
  620   8487 php           -1 /bin//php.ini
[...]

Well, DTrace helped us to see where the excessive CPU consumption came from, along with telling us what are the roots for this behavior: connections to Sybase RDBMS, and bad search paths. So, I now can go and see the developers to provide them with these useful informations to debug their Nagios agents.

Thursday 11 January 2007

NFS and ZFS plus ZIL Interesting Notes

I recently learn about NFS on ZFS interaction problem reading the great blog of Ben Rockwood. Although not directly related to what he encountered, this recent great post about how NFS behaves with ZFS backend, particularly on the performance comparison front, says a lot of things about why you might see poor performance using this two technologies together.

To go deeper on this front, you can read more about the ZIL purpose on Eric Kustarz's weblog, and follow closely this ZFS thread on the OpenSolaris website.