Pythian Blog: Technical Track

Two Cases of Troubleshooting with psn

No, it’s not the PlayStation Network! Linux Process Snapper — pSnapper, or psn — is an open source tool written by Tanel Poder. Tanel provides a number of great examples on how to use psn and even a few videos explaining the ideas behind it. Take a good look.

Here are two cases of how psn helped us find the root cause of performance issues in production systems.

Case #1

In an eBS 12.1 application, a serially run custom concurrent program completed in an acceptable time of about half a minute. It is a relatively simple custom shell script dealing with a few files and calling a few standard Unix programs.

When this program was run concurrently at the same time in 20 threads, each program execution took between 5 and 15 minutes. Top output showed a high load average over 20, CPU usage was close to zero and one program instance was in D state:

top - 16:51:21 up 130 days, 18:38,  2 users,  load average: 21.93, 14.84, 7.99
Tasks: 490 total,   1 running, 489 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.7 us,  2.2 sy,  0.0 ni, 96.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.7 st
KiB Mem : 49293268 total,  3327024 free,  8357988 used, 37608256 buff/cache
KiB Swap:  2097148 total,  1879036 free,   218112 used. 39050452 avail Mem
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
21252 appusr    20   0  113336    836    396 D   6.2  0.0   0:01.90 XX_CUSTOM_PROG
22148 root      20   0  162304   2464   1540 R   6.2  0.0   0:00.01 top
    1 root      20   0  194208   5268   3268 S   0.0  0.0  77:59.26 systemd
...

When things like this happen, you need to check what the process is doing with standard Linux tools: strace, pstack, etc. I wanted to try psn since I haven’t used it in any real systems before. Initially, psn output looked like this:

[root@hostname]# ./psn -d 60 -G syscall,wchan,filename

Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]
Sampling /proc/syscall, stat, wchan for 60 seconds... finished.


=== Active Threads ===================================================================================================

 samples | avg_threads | comm           | state                  | syscall         | wchan                 | filename
----------------------------------------------------------------------------------------------------------------------
    3183 |       18.09 | (XX_********)  | Disk (Uninterruptible) | getdents        | iterate_dir           |
     145 |        0.82 | (XX_********)  | Disk (Uninterruptible) | getdents        | rpc_wait_bit_killable |
     106 |        0.60 | (FNDLIBR)      | Disk (Uninterruptible) | fstat           | do_last               |
      40 |        0.23 | (ksoftirqd/*)  | Running (ON CPU)       | [running]       | 0                     |
...

As you can see the issue here is getdents — Get Directory Entries — syscalls that are iterating over the entries of some directories. Unfortunately, it doesn’t provide the filename/dirname; fortunately, psn is an open-source tool. After checking its source code, I realized it would take a very small and simple change to report directory names for getdents and a few other syscalls. This is the psn output with my fix in place:

[root@hostname]# ./psn -d 60 -G syscall,wchan,filename

Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]
Sampling /proc/stat, syscall, wchan for 60 seconds... finished.


=== Active Threads =======================================================================================================================

 samples | avg_threads | comm              | state                  | syscall         | wchan                 | filename
------------------------------------------------------------------------------------------------------------------------------------------
    3073 |       17.56 | (XX_********)     | Disk (Uninterruptible) | getdents        | iterate_dir           | /dir_path/applcsf/log
     188 |        1.07 | (XX_********)     | Disk (Uninterruptible) | stat            | lookup_slow           |
     149 |        0.85 | (XX_********)     | Disk (Uninterruptible) | getdents        | rpc_wait_bit_killable | /dir_path/applcsf/log
      64 |        0.37 | (INVLIBR)         | Disk (Uninterruptible) | llistxattr      | lookup_slow           |
      53 |        0.30 | (FNDLIBR)         | Disk (Uninterruptible) | llistxattr      | lookup_slow           |
...

And here it is! For some reason, all programs were trying to access the $APPLCSF/log directory, which holds hundreds of thousands of logs, plus it’s located on an NFSv3 mount. It’s not clear why getdents calls on log dir are slow when programs run concurrently; I suspect it is a combination of concurrency limitations in NFSv3 (NFSv4 delivered some fixes to such issues) and the NFS server responding slower.

OK we know that something touches log directory with too many files, but what is something? I spent quite a bit of time trying to nail it down and, with the help of bash tracing, I finally found that the shell script itself is not an issue. It is a standard eBS environment script that is using an unquoted JDBC connection string with @(...) characters. That URL string forced bash to use some form of pattern matching on files in the current directory. I have an SR open with the eBS product team; hopefully one day the bug will be fixed.

Case #2

This one is also related to an eBS apps node. It is a VM running in a VMware ESXi cluster. For an unknown reason, some VMs hang regularly every 3-4h. When a hang happens, the load average is high, but nothing is running on the system in this case. Even a simple uptime command hangs. The hang lasts for about 20 minutes and then goes away like nothing happened.

I initially thought it might be related to VMware — I’ve had some issues with it before — but there was nothing unusual on the VMware side. While this was otherwise a really good case study for psn, the node is a production system where it is hard to install even a python script. To work around this, I wrote a command to mimic psn logic and group wchan of all the processes that I could see:

for x in /proc/[0-9]*/wchan ; do cat $x 2>/dev/null && echo ; done |
 grep -v "poll_schedule_timeout\|ep_poll\|futex_wait_queue_me\|do_wait\|rescuer_thread\|worker_thread\|smpboot_thread_fn" |
 sort | uniq -c | sort -n

To explain, I got each process its wchan in /proc filesystem (ignoring those processes I have no access to), then filter and throw away some idle events, and finally group and sort by the number of occurrences. This is practically the same work that psn is doing, except psn collects and aggregates multiple samples (and a bunch of other useful things) while I’m collecting just one sample and aggregate wchan only. This was the output of running the commands during normal operation:

...
      2 rcu_gp_kthread
      3 scsi_error_handler
      4 sigsuspend
      7 hrtimer_nanosleep
      7 n_tty_read
      7 xfsaild
      8 do_lock_file_wait.part.31
      8 pipe_wait
      8 sk_wait_data

And here is output during the start of a hang:

...
      1 mfe_aac_reporter_thread_fn
      1 mfe_fileaccess_netlink_scan_request_thread_fn
      1 nfs41_callback_svc
      1 watchdog
      2 rcu_gp_kthread
      3 scsi_error_handler
      4 sigsuspend
      6 hrtimer_nanosleep
      7 n_tty_read
      7 xfsaild
      9 do_lock_file_wait.part.31
      9 pipe_wait
      9 sk_wait_data
     24 mfe_fileaccess_set_new_entry_scan_state

Looks interesting: 24 similar wchan entries with an odd name mfe_fileaccess_set_new_entry_scan_state. Google Search gave exactly zero hits. Searching just by mfe_fileaccess gave back 7 results, one of wchich was a link to MOS Doc ID 2770509.1 Oracle Linux: System Hang in: “mfe_fileaccess_wake_up_all_events_and_reset_scanningQueue”. The MOS doc explains that the reason of Linux hang (related to a slightly different function) is … McAfee anti-virus! Well yes indeed McAfee was running on this system. And when it was disabled, the hangs were gone too, proving McAfee anti-virus was the culprit of the hangs.

Summary

  • psn is a really great tool for understanding what’s going on in almost any case where you want to use top.
  • As an open-source tool, psn can be customized relatively easily.
  • I particularly like how simple the idea of psn is: sampling critical data such as syscall/wchan and kernel stacks.
  • Decoding file descriptors for syscalls having fd as the first parameter is simply brilliant.
  • With a few lines of shell code, you can simulate some basic psn features if necessary.

No Comments Yet

Let us know what you think

Subscribe by email