Pythian Blog: Technical Track

Troubleshooting ORA-27090 Async I/O Errors with Systemtap

Last week, I ran into an issue on a high-volume Oracle database where sessions were periodically failing with ORA-27090 errors. Job queue processes were also seeing this error, and showing messages like this in the database alert log:

Tue Sep 11 20:56:15 2012
Errors in file /orahome/oracle/base/diag/rdbms/dbm/dbm1/trace/dbm1_j001_126103.trc:
ORA-27090: Unable to reserve kernel resources for asynchronous disk I/O
Linux-x86_64 Error: 11: Resource temporarily unavailable
Additional information: 3
Additional information: 128
ORA-27090: Unable to reserve kernel resources for asynchronous disk I/O
Linux-x86_64 Error: 11: Resource temporarily unavailable
Additional information: 3
Additional information: 128


The tracefile just showed the same ORA-27090 messages, so nothing particularly useful there. Oerr is of no help:

$ oerr ora 27090
27090, 00000, "Unable to reserve kernel resources for asynchronous disk I/O"
// *Cause: The system call to reserve kernel resources for asynchronous I/O
// has failed.
// *Action: Check errno

There’s a known bug, 7306820 “ORA-7445 [krhahw] / ORA-27090 during file header read. Instances may crash, but this bug is fixed in 11.2.0.1, and this database is running 11.2.0.3.

And on top of that, it’s an Exadata system, so I/O to storage servers go over the InfiniBand network rather than use async I/O (AIO) calls.

A web search turned up a blog entry from DBHK’s blog, pointing to a value of aio-max-nr being set too low. However aio-max-nr is actually set to the same level as the recommended value and matches Exadata defaults as well:

# cat /proc/sys/fs/aio-max-nr
3145728

The Linux kernel documentation has a brief but meaty description of this parameter:

aio-nr & aio-max-nr:

aio-nr is the running total of the number of events specified on the io_setup system call for all currently active aio contexts. If aio-nr reaches aio-max-nr then io_setup will fail with EAGAIN. Note that raising aio-max-nr does not result in the pre-allocation or re-sizing of any kernel data structures.

Taking a peek at aio-nr:

# cat /proc/sys/fs/aio-nr
3145726

We’re within 2 of the absolute limit, so it’s likely that this limit is indeed the problem. However, the question is: Who is using these AIO events? This DB is a huge session hog (8000+ concurrent sessions) but even there, 3M is a pretty high limit. And at this point, we can’t even be sure that its database processes use up the AIO events.

The only AIO-related information in /proc (or /sys for that matter) is the two files in /proc/sys/fs. To go into more detail requires some more tools.

Solaris admins will no doubt be familiar with DTrace, a kernel tracing framework that can expose all kinds of information in the OS kernel, among other things. Oracle has ported DTrace to Linux, but it requires the latest-and-greatest UEK2 kernel and is not yet supported on Exadata.

I came across another tool that also allows kernel inspection and is available in Oracle Linux 5: systemtap. Systemtap hooks into the call stack, allowing function calls to be traced, arguments captured, and if you’re really brave, actually modified.

With dependencies, I ended up needing to add four packages. As this machine doesn’t (yet) have a working yum repository, I used public-yum.oracle.com to obtain the following:

avahi-0.6.16-10.el5_6.x86_64.rpm
dbus-python-0.70-9.el5_4.x86_64.rpm
systemtap-1.6-7.el5_8.x86_64.rpm
systemtap-runtime-1.6-7.el5_8.x86_64.rpm

The avahi package is a tool for plug-and-play networking that I don’t exactly want running on a server, but the systemtap binary is linked to it for remote compilation capability. Avahi configures itself to auto-start itself on the next boot, so I disabled that:

# chkconfig avahi-daemon off
# chkconfig avahi-dnsconfd off

The systemtap packages complained about missing kernel package depedencies, since this system is running’s UEK kernel, naming the kernel package kernel-u2k instead. I ended up doing the installation with the –nodeps option to skip dependency checking.

I couldn’t find any pre-made scripts to monitor AIO, but in a 2008 presentation from Oracle, Linux engineering does have a bullet point on it:

• Tracking resources tuned via aio_nr and aio_max_nr

So based on some of the many example scripts I set out to build a script to monitor AIO calls. Here is the end result:

stap -ve '
global allocated, allocatedctx, freed
probe syscall.io_setup {
  allocatedctx[pid()] += maxevents; allocated[pid()]++;
  printf("%d AIO events requested by PID %d (%s)\n",
  	maxevents, pid(), cmdline_str());
}
probe syscall.io_destroy {freed[pid()]++}
probe kprocess.exit {
  if (allocated[pid()]) {
     printf("PID %d exited\n", pid());
     delete allocated[pid()];
     delete allocatedctx[pid()];
     delete freed[pid()];
  }
}
probe end {
foreach (pid in allocated) {
   printf("PID %d allocated=%d allocated events=%d freed=%d\n",
      pid, allocated[pid], allocatedctx[pid], freed[pid]);
}
}
'

Sample output (using sytemtap’s -v verbose option to see compilation details):

Pass 1: parsed user script and 76 library script(s) using 147908virt/22876res/2992shr kb, in 130usr/10sys/146real ms.
Pass 2: analyzed script: 4 probe(s), 10 function(s), 3 embed(s), 4 global(s) using 283072virt/49864res/4052shr kb, in 450usr/140sys/586real ms.
Pass 3: using cached /root/.systemtap/cache/11/stap_111c870f2747cede20e6a0e2f0a1b1ae_6256.c
Pass 4: using cached /root/.systemtap/cache/11/stap_111c870f2747cede20e6a0e2f0a1b1ae_6256.ko
Pass 5: starting run.
128 AIO events requested by PID 32885 (oracledbm1 (LOCAL=NO))
4096 AIO events requested by PID 32885 (oracledbm1 (LOCAL=NO))
128 AIO events requested by PID 69099 (oracledbm1 (LOCAL=NO))
4096 AIO events requested by PID 69099 (oracledbm1 (LOCAL=NO))
128 AIO events requested by PID 69142 (oracledbm1 (LOCAL=NO))
4096 AIO events requested by PID 69142 (oracledbm1 (LOCAL=NO))
128 AIO events requested by PID 69099 (oracledbm1 (LOCAL=NO))
128 AIO events requested by PID 69142 (oracledbm1 (LOCAL=NO))
128 AIO events requested by PID 32885 (oracledbm1 (LOCAL=NO))
4096 AIO events requested by PID 69142 (oracledbm1 (LOCAL=NO))
4096 AIO events requested by PID 69099 (oracledbm1 (LOCAL=NO))
128 AIO events requested by PID 69142 (oracledbm1 (LOCAL=NO))
128 AIO events requested by PID 69099 (oracledbm1 (LOCAL=NO))
...
(and when control-C is pressed):
PID 99043 allocated=6 allocatedevents=12672 freed=3
PID 37074 allocated=12 allocatedevents=25344 freed=6
PID 99039 allocated=18 allocatedevents=38016 freed=9
PID 69142 allocated=24 allocatedevents=50688 freed=12
PID 32885 allocated=36 allocatedevents=76032 freed=18
PID 69099 allocated=6 allocatedevents=12672 freed=3
Pass 5: run completed in 0usr/50sys/9139real ms.

It’s quite obvious here that the AIO allocations are all happening from Oracle database processes.

From the summary output we can see that each process seems to run io_setup twice as much as io_destroy. Kernel gurus may have an answer to this, but I suspect it has more to do with the data gathering than a massive leak in AIO events.

But the more interesting result is the frequent allocation of 4096 AIO events at a time. On a database with 8000 connections, that would be over 10 times the current limit.

The only major downside of increasing this limit seems to be avoiding to exhaust kernel memory. From a 2009 post to the linux-kernel mailing list:

Each time io_setup is called, a ring buffer is allocated that can hold nr_events I/O completions. That ring buffer is then mapped into the process’ address space, and the pages are pinned in memory. So, the reason for this upper limit (I believe) is to keep a malicious user from pinning all of kernel memory.

And with consultation with Oracle support, we set aio-max-nr to 50 million, enough to accommodate three databases with 16k connections, all allocating 4096 AIO events. Or in other words, way more than we ever expect to use.

# sysctl -w fs.aio-max-nr=50000000

And since this change, the ORA-27090 errors have gone away.

No Comments Yet

Let us know what you think

Subscribe by email