Pythian Blog: Technical Track

Using strace to debug application errors in linux

strace is a very useful tool which traces system calls and signals for a running process. This helps a lot while debugging application level performance issues and bugs. Aim of this post is to demonstrate the power of strace in pinning down an application bug.

I came across an issue in which nagios was sending the following alerts for a RHEL6 system.

***** Nagios ***** Notification Type: PROBLEM Service: NTP Host: xxxxx Address: xx.xx.xx.xx State: UNKNOWN Date/Time: Tue Feb 17 10:08:36 EST 2015 Additional Info: cant create socket connection

On manually executing the nagios plugin on the affected system, we can see that the command is not running correctly.

# /usr/lib64/nagios/plugins/check_ntp_time -H localhost -w 1 -c 2
can’t create socket connection

I ran strace on the command. This would create a file /tmp/strace.out with strace output.

# strace -xvtto /tmp/strace.out /usr/lib64/nagios/plugins/check_ntp_time -H localhost -w 1 -c 2

Following are the options which I passed.

-x Print all non-ASCII strings in hexadecimal string format.
-v Print unabbreviated versions of environment, stat, termios, etc. calls. These structures
are very common in calls and so the default behavior displays a reasonable subset of struc?
ture members. Use this option to get all of the gory details.
-tt If given twice, the time printed will include the microseconds.
-o filename Write the trace output to the file filename rather than to stderr. Use filename.pid if -ff
is used. If the argument begins with `|’ or with `!’ then the rest of the argument is
treated as a command and all output is piped to it. This is convenient for piping the
debugging output to a program without affecting the redirections of executed programs.

Time stamps displayed with -tt option is not very useful in this example, but it is very useful while debugging application performance issues. -T which shows the time spend in each system call is also useful for those issues.

From strace output,

10:26:11.901173 socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 3
10:26:11.901279 connect(3, {sa_family=AF_INET, sin_port=htons(123), sin_addr=inet_addr(“127.0.0.1”)}, 16) = 0
10:26:11.901413 getsockname(3, {sa_family=AF_INET, sin_port=htons(38673), sin_addr=inet_addr(“127.0.0.1”)}, [16]) = 0
10:26:11.901513 close(3) = 0
10:26:11.901621 socket(PF_INET6, SOCK_DGRAM, IPPROTO_IP) = 3
10:26:11.901722 connect(3, {sa_family=AF_INET6, sin6_port=htons(123), inet_pton(AF_INET6, “::1”, &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 ENETUNREACH (Network is unreachable) <—————-
10:26:11.901830 close(3) = 0
10:26:11.901933 socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 3
10:26:11.902033 connect(3, {sa_family=AF_INET, sin_port=htons(123), sin_addr=inet_addr(“127.0.0.1”)}, 16) = 0
10:26:11.902130 socket(PF_INET6, SOCK_DGRAM, IPPROTO_UDP) = 4
10:26:11.902238 connect(4, {sa_family=AF_INET6, sin6_port=htons(123), inet_pton(AF_INET6, “::1”, &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 ENETUNREACH (Network is unreachable) <—————-
10:26:11.902355 fstat(1, {st_dev=makedev(0, 11), st_ino=3, st_mode=S_IFCHR|0620, st_nlink=1, st_uid=528, st_gid=5, st_blksize=1024, st_blocks=0, st_rdev=makedev(136, 0), st_atime=2015/02/17-10:26:11, st_mtime=2015/02/17-10:26:11, st_ctime=2015/02/17-10:16:32}) = 0
10:26:11.902490 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc5a8752000
10:26:11.902608 write(1, “can’t create socket connection”, 30) = 30

Let us have a deeper look,

You can see that socket() is opening a socket with PF_INET (IP v4) domain and IPPROTO_IP (tcp) protocol. This returns file descriptor 3. Then connect() is connecting to the socket using the same file descriptor and connects to ntp port (123) in localhost. Then it calls getsockname and closes the file descriptor for the socket.

10:26:11.901173 socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 3
10:26:11.901279 connect(3, {sa_family=AF_INET, sin_port=htons(123), sin_addr=inet_addr(“127.0.0.1”)}, 16) = 0
10:26:11.901413 getsockname(3, {sa_family=AF_INET, sin_port=htons(38673), sin_addr=inet_addr(“127.0.0.1”)}, [16]) = 0
10:26:11.901513 close(3) = 0

Next it does the same but with PF_INET6 (IP v6) domain. But you can see that connect() fails with ENETUNREACH.

10:26:11.901621 socket(PF_INET6, SOCK_DGRAM, IPPROTO_IP) = 3
10:26:11.901722 connect(3, {sa_family=AF_INET6, sin6_port=htons(123), inet_pton(AF_INET6, “::1”, &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 ENETUNREACH (Network is unreachable) <—————-
10:26:11.901830 close(3)

From connect man page,

ENETUNREACH
Network is unreachable.

This process is repeated with IPPROTO_UDP (udp) protocol as well.

On checking the system, I see that that only IPv4 is enabled. ‘inet6 addr’ line is missing.

[root@pbsftp ~]# ifconfig
eth0 Link encap:Ethernet HWaddr 00:50:56:90:2E:31
inet addr:xx.xx.xx.xx Bcast:xx.xx.xx.xx Mask:xx.xx.xx.xx <——————–
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:5494691 errors:0 dropped:0 overruns:0 frame:0
TX packets:4014672 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5877759230 (5.4 GiB) TX bytes:5608605924 (5.2 GiB)

IPv6 is disabled in the system using following /etc/sysctl.conf entries.

net.ipv6.conf.default.disable_ipv6=1
net.ipv6.conf.all.disable_ipv6 = 1

This behavior of nagios plugin is wrong as it should not die when one of the connect fails.

Issue is fixed in upstream patch.

Enabling IPv6 by removing following entries from /etc/sysctl.conf and running ‘sysctl -p’ would act as a workaround.

net.ipv6.conf.default.disable_ipv6=1
net.ipv6.conf.all.disable_ipv6 = 1

To fix the issue, the upstream patch need to be either backported manually to create an rpm or a support ticket need to be opened with the operating system vendor to backport the patch in their product release.

No Comments Yet

Let us know what you think

Subscribe by email