Disclaimer: Much of what follows is pure speculation on my part. It could be completely wrong, and I'm putting it out there in the hopes that it'll eventually be proven one way or the other.
The Summary
- Underscore parameter _use_adaptive_log_file_sync
- Default value changed in 11.2.0.3 from FALSE to TRUE
- Dynamic parameter
- Enables a new method of communication for LGWR to notify foreground processes of commit
- Old method used semaphores, LGWR had to explicitly "post" every waiting process
- New method has the FG processes sleep and "poll" to see if commit is complete
- Advantage is to free LGWR from CPU work required to inform lots of processes about commits
- LGWR dynamically switches between old and new method based on load and responsiveness
- Method can switch frequently at runtime, max frequency is 3 switches per minute (configurable)
- Switch is logged in LGWR tracefile, we have seen several switches per day
- Few problems in general, possible issues seem to be in RAC and/or the switching process itself
The Story
We're working with a customer who had a very successful Exadata go-live last weekend. They moved a large application mostly unchanged from 9i to Exadata - and we've generally had very positive feedback. (!!) It might be that just getting on modern hardware and software accounts for this, but it still gives everyone a good feeling. Nevertheless, there are always a few little adventures. One of these was a mysterious orange glob on the Cloud Control radar Monday at 1am. Orange is the "commit" class - so this was a little surprising! (And reminiscent of
Karl Arao's famous halloween monster!) A few drill-down clicks and we're looking at "log file sync" events. To make a long story short, it wasn't any of the usual suspects, and the necessary diagnostics info wasn't available after the fact to say what happened with certainty. We've done a fairly comprehensive incident report and it hasn't recurred... So I'm happy for now. Anyway, I haven't gotten to the really fun part yet. The really fun part is the
#everydaylearning as Yury would say. In the process of analyzing the orange glob, I ended up investigating whether it might be related to a new feature called Adaptive Log File Sync. I'd never heard of this before so I had quite a bit of fun learning, and it seemed worthwhile to share a few things I found. Before I dive in, one other important thing: I have to mention
Christo Kutrovsky. We were both on-site and digging into this feature at the same time. There was a constant exchange of competing theories and different ideas about how things might work. Together we assembled a strong, in-depth, and working theory very, very quickly - I love working on a
great team.
Log File Sync
First off, there are a few basic starting points for anyone troubleshooting time spent in the log file sync event.
In my specific case, I discovered quickly from the ASH data that this was a very unusual situation. This led to some more creative searching in the Oracle Support KnowledgeBase - at which point I discovered something very interesting:
Adaptive Log File Sync
Note 1462942.1 describes a feature whereby LGWR can switch between log write methods. This feature is enabled through an underscore parameter called
_use_adaptive_log_file_sync and the description of this parameter is:
adaptively switch between post/wait and polling. Now I'll be honest: searching on the internet and searching in Oracle Support's KnowledgeBase for "adaptive log file sync" yields almost nothing. I do believe in the principle of BAAG - but this is a case where some guesswork might be useful, especially to guide experimentation that could nail down more concrete answers. Hence the disclaimer at the top of this article. The words "post" and "wait" indicate that we're talking about semaphores. For some general background on semaphores, check out
the Wikipedia article and for more detail about the unix post() and wait() calls, a good resource is
chapter 30 from Andrea Arpaci-Dusseau's (UW Madison) textbook on Operating Systems. Oracle uses semaphores extensively. In fact, LGWR and commits are specifically mentioned as
an example in the Performance Tuning Guide - and Oracle even has an API to replace semaphore use with a third-party driver for lightweight post-wait implementations. (Like an ODM for your OS kernel.) If you look for references to "commit" in the Oracle docs, you'll find the word "post" everywhere when they talk about communication between the foreground processes and LGWR. Now, remember that a COMMIT has two options: first, IMMEDIATE or BATCH and second, WAIT or NOWAIT. It looks like this to me:
- Immediate: FG process will post to LGWR, triggering I/O (default)
- Batch: FG process will not post LGWR
- Wait: LGWR will post FG process when I/O is complete (default)
- NoWait: LGWR will not post FG process when I/O is complete
Why change a good thing? My theory is that it ties back to a rare problem
Riyaj mentioned in that blog post over 4 years ago. I'll quote him:
"LGWR is unable to post the processes fast enough, due to excessive commits. It is quite possible that there is no starvation for CPU or memory, and that I/O performance is decent enough. Still, if there are excessive commits, LGWR has to perform many writes/semctl calls, and this can increase ‘log file sync’ waits. This can also result in sharp increase in redo wastage statistics."
Maybe Oracle is reading Riyaj's blog? It appears that they came up with a new algorithm where LGWR
doesn't post. My guess: foreground processes can probably still post LGWR - but LGWR never posts back. Instead, foreground processes in WAIT mode "poll" either a memory structure or LGWR itself. It could be an in-house implementation by Oracle, it could still use semaphores, or it could use message queues somehow [seems like a long shot but
the unix poll() call is found there]. There's one interesting challenge that I can think of in implementing this. With the semaphore approach, all the active commits are *sleeping* (off the CPU) while LGWR flushes the log buffer. There could be dozens or hundreds of foreground processes simultaneously commiting on a very busy system. If we switch to a polling method, how do we ensure that these hundreds of processes don't start spinning and steal CPU from LGWR, making the whole system even worse than it was in the beginning? The answer might lie in a quick search for "adaptive_log_file_sync" from the underscore parameters. There are five more hidden parameters with that string:
select a.ksppinm name, b.ksppstvl value, a.ksppdesc description
from sys.x$ksppi a, sys.x$ksppcv b
where a.indx = b.indx and a.ksppinm like '\_%adaptive\_log%' escape '\'
order by name
Name |
Default |
Description |
_adaptive_log_file_sync_high_switch_freq_threshold |
3 |
Threshold for frequent log file sync mode switches (per minute) |
_adaptive_log_file_sync_poll_aggressiveness |
0 |
Polling interval selection bias (conservative=0, aggressive=100) |
_adaptive_log_file_sync_sched_delay_window |
60 |
Window (in seconds) for measuring average scheduling delay |
_adaptive_log_file_sync_use_polling_threshold |
200 |
Ratio of redo synch time to expected poll time as a percentage |
_adaptive_log_file_sync_use_postwait_threshold |
50 |
Percentage of foreground load from when post/wait was last used |
It appears that there are a number of knobs to turn with this new algorithm - and it looks like Oracle is somehow dynamically calculating a "polling interval". Furthermore, it seems to be taking "scheduling delay" into account. It also seems that by default, we're limited to switching modes every 20 seconds (3 per minute). This switching seems to be controlled by "thresholds" - and the threshold for enabling the new polling mode seems to be based on time. From the default percentage, it looks to me like Oracle won't switch until it thinks the poll time will be less than half of the current post time. For switching back it also seems to be another "half" percentage (50), though I'm not sure what "foreground load" might mean. Now there's one other underscore parameter which I noticed while looking for things related to log file sync:
Name |
Default |
Description |
_fg_sync_sleep_usecs |
0 |
Log file sync via usleep |
This parameter is interesting for one reason: The adaptive sync parameters were introduced in 11g, but this parameter was introduced in 10g. And if you think about the name, it actually sounds very similar to the "polling" strategy and not like a semaphore strategy at all! Foreground processes call usleep() during a log sync - if you're calling wait() then you don't need to sleep. But if you're polling, you definitely need to sleep. Maybe Oracle has been working on this idea since 10g? I wouldn't put it past them. :) So how do you know if you're using this feature? The most obvious sign will be in the LGWR trace file. There will be messages each time that a switch happens, looking something like this:
*** 2012-10-16 01:47:50.289
kcrfw_update_adaptive_sync_mode: poll->post current_sched_delay=0 switch_sched_delay=1 current_sync_count_delta=8 switch_sync_count_delta=59
*** 2012-10-16 01:47:50.289
Log file sync switching to post/wait
Current approximate redo synch write rate is 2 per sec
*** 2012-10-16 02:51:19.285
kcrfw_update_adaptive_sync_mode: post->poll long#=51 sync#=352 sync=4600 poll=1061 rw=500 rw+=500 ack=7 min_sleep=1061
*** 2012-10-16 02:51:19.285
Log file sync switching to polling
Current scheduling delay is 1 usec
Current approximate redo synch write rate is 117 per sec
I mentioned that the underscore parameters were introduced in 11g; however, my current understanding is that
only in 11.2.0.3 did the default value of _use_adaptive_log_file_sync change to true!! Is it comforting that they waited so long, or is it scary that they'd make this change with no available documentation or troubleshooting information? I guess it depends on your perspective. Here's mine:
Explanation Please?
- This is a change of critical, core code that impacted every single database installation and upgrade.
- I think it is a strange departure from Oracle’s normal practices to change such important code with absolutely no announcement, documentation, or technical troubleshooting information even in their customer-only support database.
- In my opinion this requires explanation; they can do (and have done) much better.
Regardless, it's now been 12 months since 11.2.0.3 was released. At present, it's only mentioned in a single support article. This offers at least some evidence that adaptive log file sync hasn't caused widespread panic. For the past year, many of you have been using this feature without even knowing - and there weren't enough problems to merit even two Oracle Support notes. I sent a few emails out to friends who also tend to work on large, loaded systems. I wasn't the first person to stumble across this -
James Morle mentioned it in
a tweet a few months ago, and (not surprisingly)
Riyaj has stumbled across the feature as well. It seems to me that the issues that have been encountered intersect with RAC and/or the switching process itself. But my informal survey was hardly scientific. Overall, I think Oracle dodged the bullet this time. But they still made an important change without supporting documentation. To my friends inside the big red mothership: Please don't do that again! We like it better the other way. Have you ever heard of this new 11.2.0.3 feature? Have you heard of the underscore parameter? I'd love to hear your stories!