Pythian Blog: Technical Track

Tuning Latch Contention: Cache-buffers-chain latches

Recently, I had an opportunity to tune latch contention for cache buffers chain (CBC) latches. The problem was high CPU-usage combined with poor application performance. A quick review of the statspack report for 15 minutes showed a latch-free wait as the top event, consuming approximately 3600 seconds in an 8-CPU server. CPU usage was quite high, which is a typical symptom of latch contention, due to the spinning involved. v$session_wait showed that hundreds of sessions were waiting for latch free event.

SQL> @waits10g

   SID PID     EVENT         P1_P2_P3_TEXT
------ ------- ------------  --------------------------------------
   294  17189  latch free    address 15873156640-number 127-tries 0
   628  17187  latch free    address 15873156640-number 127-tries 0
....
   343  17191  latch free    address 15873156640-number 127-tries 0
   599  17199  latch: cache  address 17748373096-number 122-tries 0
               buffers chains
   337  17214  latch: cache  address 17748373096-number 122-tries 0
               buffers chains
.....
   695  17228  latch: cache  address 17748373096-number 122-tries 0
               buffers chains
....
   276  15153  latch: cache  address 19878655176-number 122-tries 1
               buffers chains

I will use a two-pronged approach to find the root cause scientifically. First, I’ll find the SQL suffering from latch contention and objects associated with the access plan for that SQL. Next,I will find the buffers involved in latch contention, and map that back to objects. Finally, I will match these two techniques to pinpoint the root cause.

Before I go any further, let’s do a quick summary of the internals of latch operations.

A Brief Introduction to CBC Latches (and The Not-So-Brief Reason Why This is a Complicated Topic to Discuss Briefly)

Latches are internal memory structures that coordinate access to shared resources. Locks (also known as enqueues) are different from latches, the key difference being that enqueues, as the name suggests, provide a FIFO queueing mechanism, while latches do not. On the other hand, latches are held very briefly and locks are usually held longer.

In Oracle SGA, the buffer cache is the memory area into which data blocks are read. (If Automatic Shared Memory Management [ASMM] is in use, part of the shared pool can be tagged as KGH:NO ALLOC and remapped to the buffer cache area too.)

Each buffer in the buffer cache has an associated element in the buffer header array, externalized as x$bh. Buffer headers keep track of various attributes and state of buffers in the buffer cache. This buffer header array is allocated in the shared pool. The buffer headers are chained together in a doubly-linked list and linked to a hash bucket. There are many hash buckets, and their number is derived and governed by the _db_block_hash_buckets parameter). Access to these hash chains (both to inspect and change) is protected by cache-buffers-chains latches.

Furthermore, buffer headers can be linked and delinked from hash buckets dynamically.

Here is a simple algorithm to access a buffer (I had to deliberately cut out so as not to deviate too much from our primary discussion):

  1. Hash the data block address (DBAs: a combination of tablespace, file_id and block_id) to find hash bucket.
  2. Get the latch protecting the hash bucket.
  3. If success, walk the hash chain, reading buffer headers to see if a specific version of the block is already in the chain.If found, access the buffer in buffer cache, with protection of buffer pin/unpin actions.

    If not found, then find a free buffer in buffer cache, unlink the buffer header for that buffer from its current chain, link that buffer header with this hash chain, release the latch and read block in to that free buffer in buffer cache with buffer header pinned.

  4. If not success, spin for spin_count times and go to step 2. If that latch was not got with spinning, then sleep (with exponentially increasing sleep time with an upper bound), wakeup, and go to step 2.

Obviously, latches are playing crucial role in controlling access to critical resources such as the hash chain. My point is that repeated access to a few buffers can increase latch activity.

There are many CBC latch children. The parameter _db_block_hash_latches controls the number of latches and is derived from the buffer cache size. Furthermore, in Oracle 10g, shareable latches are used; and inspecting a hash chain necessitates acquiring latches in shared mode, which is compatible with other shared-mode operations. Note that these undocumented parameters are usually sufficient, and changes to these parameters must get approval from Oracle support.

Back to Our Problem . . .

Let’s revisit the problem at hand. The wait graph printed above shows that this latch contention is caused by two types of latches. Latch #127 is a simulator lru latch, and #122 is a cache buffers chains latch.

select latch#, name from v$latch where latch# in (127, 122);

The problem with a “simulator lru” latch is simple: there is a bug with db_cache_advice (bug number 5918642). If db_cache_advice is set to ON, latch contention due to simulator lru latches can be observed for large buffer caches. This issue was fixed quickly by setting db_cache_advice to OFF.

After resolving the “simulator lru” latch issue, I saw some improvement in performance — but not much.

Querying v$session to see what SQL statement(s) is/are causing latch contention. The state column below indicates that processes are not currently waiting for latches, but have waited in the past. 24 sessions are executing the same SQL statement, the last wait in the past is a ‘latch free’ event for these sessions, and yes, these are active sessions. If latch contention is prevalent, querying v$session as below will show the SQL statements to focus on.

select event, sql_hash_value,state, count(*) from v$session w
where event='latch free' and status='ACTIVE'
group by sql_hash_value, state , event
SQL> /

EVENT         SQL_HASH_VALUE STATE                 COUNT(*)
------------- -------------- ------------------- ----------
latch free        3629331128 WAITED KNOWN TIME           24
latch free         673277007 WAITED KNOWN TIME            1
latch free        1378683334 WAITED SHORT TIME            1
latch free        3629331128 WAITED SHORT TIME            5
latch free        2920275581 WAITED SHORT TIME            3

We can find the SQL statement querying v$sql_text with above hash value 3629331128. The SQL suffering from latch contention is printed below. Of course, for security reasons, I have changed the actual object names.

select * from v1 WHERE   (
 col1  IN (
  3, 20, 21, 44, 45, 47, 48, 49, 50, 51, 57, 58, 59, 67, 68,
  69, 76, 78, 79, 80, 81, 82, 84,85, 106, 450, 451, 452, 453,
 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465,
 466, 467, 468, 469, 470, 471, 472, 473, 474, 476, 478, 500,
  501, 502)
 OR col2  IN (3, 20, 21, 44, 45, 47, 48, 49, 50, 51, 57, 58,
 59, 67, 68, 69, 76, 78, 79, 80, 81, 82, 84, 85, 106, 450, 451,
 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463,
  464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 476,
 478, 500, 501, 502))
 AND  UPPER(col3) LIKE :1  and rownum < 200

The explain plan for the above shows that multiple tables are accessed in this view. But at this point, I don’t know which step in this plan is causing latch contention. If you have to guess which of the following tables is causing the issue, check your guess with correct answer later (and become a Battle Against Any Guess member immediately). (I removed a few columns from the plan output to improve readability.)

------------------------------------------------------------------
| Id  | Operation                        | Name    |Rows  | Cost |
------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |         |17778 |   223|
|*  1 |  COUNT STOPKEY                   |         |      |      |
|   2 |   CONCATENATION                  |         |      |      |
|*  3 |    FILTER                        |         |      |      |
|*  4 |     HASH JOIN                    |         |17777 |   216|
|*  5 |      TABLE ACCESS FULL           | ORG     |    4 |     2|
|   6 |      NESTED LOOPS                |         |  197 |   213|
|   7 |       NESTED LOOPS               |         |  195 |    18|
|   8 |        TABLE ACCESS FULL         | ORG     |    1 |     2|
|*  9 |        TABLE ACCESS FULL         | ORDER   |  195 |    16|
|* 10 |       TABLE ACCESS BY INDEX ROWID| TRADE   |    1 |     1|
|* 11 |        INDEX UNIQUE SCAN         | TRADE_PK|    1 |     0|
|* 12 |    FILTER                        |         |      |      |
|  13 |     NESTED LOOPS                 |         |    1 |     7|
|  14 |      NESTED LOOPS                |         |    1 |     6|
|  15 |       NESTED LOOPS               |         |    1 |     4|
|* 16 |        TABLE ACCESS FULL         | ORG     |    1 |     2|
|* 17 |        TABLE ACCESS FULL         | ORG     |    1 |     2|
|* 18 |       TABLE ACCESS FULL          | ORDER   |    3 |     2|
|* 19 |      TABLE ACCESS BY INDEX ROWID | TRADE   |    1 |     1|
|* 20 |       INDEX UNIQUE SCAN          | TRADE_PK|    1 |     0|
------------------------------------------------------------------

Researching Further

Re-querying v$session_wait, I see that a couple of latches are hot. I will consider one latch children with latch address 19875043200 as an example, and drill down further.

SID PID        EVENT       P1_P2_P3_TEXT
------ ---------- ----------- ----------------------------------------
578  17220     latch:CBC   address 19875043200-number 122-tries 0
664  17226     latch:CBC   address 19875043200-number 122-tries 0
695  17228     latch:CBC   address 19875043200-number 122-tries 0
701  23987     latch:CBC   address 19875043200-number 122-tries 0
...

Converting this latch address — 19875043200 — from decimal to hex yields 4A0A51780. But the latch address is 16 bytes long, so prefix it with zeros and querying v$latch_children to see activity against that latch children.

select addr, latch#, child#, level#, gets
from v$latch_children where addr='00000004A0A51780'
SQL> /
ADDR                 LATCH#     CHILD#     LEVEL#       GETS
---------------- ---------- ---------- ---------- ----------
00000004A0A51780        122      10437          1   23672075 

SQL> /
ADDR                 LATCH#     CHILD#     LEVEL#       GETS
---------------- ---------- ---------- ---------- ----------
00000004A0A51780        122      10437          1   23672209

I repeated the execution of above SQL almost immediately, which showed an increase of 134 gets in sub-seconds. The above step also helps to validate the latch address. Comparing with the latch type, I see that this latch address is indeed the Cache buffers chains latch.

Hang Those Buffers!

Next, I needed to find buffers protected by these latch children, and then find the buffers causing latch contention. Many such hash buckets (and so, numerous buffers) are protected by a latch children. Fortunately, column tch can be used effectively to identify hot block(s). Almost every access to a buffer increments the tch value for that buffer header. The idea here is to find buffers protected by that latch and identify buffers with higher touch counts. Those buffers are probable candidates for further analysis.

The x$bh table and v$latch_children can be joined to find those buffer attributes. (By the way, the following SQL can easily be rewritten to print buffers protected by the top latch, say, by sleeps.)

select hladdr,  file#, dbablk, decode(state,1,'cur ',3,'CR',state) ST, tch
 from x$bh where hladdr in
  (select addr from  (select addr from v$latch_children  where addr='00000004A0A51780'
 order by sleeps, misses,immediate_misses desc )where rownum <2)

HLADDR                FILE#     DBABLK ST   TCH
---------------- ---------- ---------- ---- ------
00000004A0A51780          1      52351 cur      3
00000004A0A51780         16     701009 cur     24
00000004A0A51780         16      23959 cur    182
00000004A0A51780         15      16215 cur   2855 <--
00000004A0A51780         26        693 cur      9
00000004A0A51780          9      52872 cur   2935 <--

00000004A0A51780          8      45128 cur   1831 <--
00000004A0A51780         16     635473 cur    560
00000004A0A51780         25     233403 cur     51
00000004A0A51780         25      97993 cur    110
00000004A0A51780          4      97273 cur     43
00000004A0A51780         25     268340 cur      0

12 rows selected.

There are at least three blocks with higher activity since their tch value is much higher. A note of caution: buffer cache activity is quite dynamic, and this analysis needs to be performed during the period of latch contention. Performing this analysis a few hours after latch contention will lead to an incorrect diagnosis.

We need to associate these three hot blocks with object names. With following script, the file number and DBAblk can be used to find the object.

accept  h_file_id  prompt  ' Enter file_id ==>'
accept  h_block_id  prompt ' Enter block_id==>'
set verify off
column owner format A10
column segment_name  format A20
column segment_type  format A10
column hdrfile    format 9999
column curfile    format 9999
column curblk     format 99999999
column hdrblock   format 99999999
select  owner, segment_name,partition_name, segment_type, file_id,block_id from dba_extents
where file_id = &&h_file_id and
      block_id  &&h_block_id;
set verify on

For example: to find the object_name for buffer with tch of 2855, supply 15 for the file_id and 16215 for the block_id.

I don’t favour the above script, since its performance is not optimal. It is much easier to dump the blocks and convert them to object ids. Let’s dump these three blocks.

alter system dump datafile 15 block min 16215 block max 16215;
alter system dump datafile 9 block min 52872 block max 52872;
alter system dump datafile 8 block min 45128 block max 45128;

Reading the trace files, I see three different segments, and one line per block is printed below from the trace file.

seg/obj: 0xd756  csc: 0x00.17b5fe4f  itc: 2  flg: E  typ: 1 - DATA
seg/obj: 0x1801a  csc: 0x00.1cb9f0ab  itc: 2  flg: E  typ: 1 - DATA
seg/obj: 0x181c5  csc: 0x00.1bef7a59  itc: 169  flg: E  typ: 2 - INDEX

The seg/obj field is the object_id printed in hex. Converting the hex numbers d756, 1801a, and 181c5 to decimal equivalents results in 55126, 98330, and 98757.

Now I can query dba_objects to find the object_names.

select owner, object_id, object_name, data_object_id from dba_objects
where object_id in (55126, 98330,98757) or
      data_object_id in (55126, 98330,98757)
SQL> /

OWNER        OBJECT_ID OBJECT_NAME        DATA_OBJECT_ID
------------ ---------- ----------------- --------------
SOME_USER         55126 ORDER                      55126
SOME_USER         98330 GBLOCK                     98330
SOME_USER         98757 ALLOCATION_OID             98757

Comparing explain plans and object_names printed above, we can see that ORDER table is a common object between these two techniques.

|   7 |       NESTED LOOPS        |       |   195 | 14040 |    18 |
|   8 |        TABLE ACCESS FULL  | ORG   |     1 |    34 |     2 |
|*  9 |        TABLE ACCESS FULL  | ORDER |   195 |  7410 |    16 |

...
|  14 |      NESTED LOOPS         |       |     1 |   106 |     6 |
|  15 |       NESTED LOOPS        |       |     1 |    68 |     4 |
|* 16 |        TABLE ACCESS FULL  | ORG   |     1 |    34 |     2 |
|* 17 |        TABLE ACCESS FULL  | ORG   |     1 |    34 |     2 |
|* 18 |       TABLE ACCESS FULL   | ORDER |     3 |   114 |     2 |

 9 - filter(UPPER(UPPER("GO"."ORD_ID")) LIKE :1 AND "GO"."IS_MOD"='F')
 18 - filter(UPPER(UPPER("GO"."ORD_ID")) LIKE :1 AND "GO"."IS_MOD"='F')

A Quick Summary

Let me summarize what I’ve done so far.

  1. I found one latch children address, located all buffers protected by that latch, found buffers with high tch, and queried to find object names for those buffers.
  2. Through v$session_wait I found the SQL hash value, found SQL suffering from latch contention, and generated the explain plan.

From these two different techniques, I can find objects common to both steps 1 and 2 and those objects are probable candidates to focus on. I see that the ORDER table is common to both techniques. From the plan above, the ORDER table is accessed in a tight nested-loop join. This will increase buffer access to ORDER table, which in turn results in higher latching activity.

SQL Tuning: That Was Easy

From here on, the solution is straightforward — I need to avoid tight nested-loop joins. Specifically, the if inner tables in the nested-loop join are accessed with Full Table Scan access, that can cause increased latching activity. For every row from the outer row source, the inner row source is queried, in a nested-loops join. A hash join might be a better access method. In a hash join, tables are scanned once and hashed, reducing latching activity.

In this specific case, ORDER is a small table. Further analysis reveals that the CBO chose nested-loop joins, since rownum triggers first_rows optimizer_mode. As a test, let’s remove the rownum clause to see what plan we get.

explain plan for
select * from v1 WHERE   (
 col1  IN (
  3, 20, 21, 44, 45, 47, 48, 49, 50, 51, 57, 58, 59, 67,
  68, 69, 76, 78, 79, 80, 81, 82, 84, 85, 106, 450, 451,
  452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462,
  463, 464, 465,  466, 467, 468, 469, 470, 471, 472, 473,
  474, 476, 478, 500, 501, 502)  OR
 col2  IN (3, 20, 21, 44, 45, 47, 48, 49, 50, 51, 57, 58,
  59, 67, 68, 69, 76, 78, 79, 80,  81, 82, 84, 85, 106, 450,
 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463,
  464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 476, 478, 500, 501, 502))
 AND  UPPER(col3) LIKE :1  --and rownum  select * from table(dbms_xplan.display);

---------------------------------------------------------------
| Id  | Operation            | Name     | Rows  | Bytes | Cost |
----------------------------------------------------------------
|   0 | SELECT STATEMENT     |          |  4902 |   818K|  4014|
|*  1 |  HASH JOIN           |          |  4902 |   818K|  4014|
|   2 |   TABLE ACCESS FULL  | ORG      |   370 | 12950 |     5|
|*  3 |   HASH JOIN          |          | 17513 |  2325K|  4008|
|   4 |    TABLE ACCESS FULL | ORG      |   370 | 12950 |     5|
|*  5 |    HASH JOIN         |          | 17479 |  1724K|  4002|
|*  6 |     TABLE ACCESS FULL| ORDER    | 17331 |   643K|  1320|
|   7 |     TABLE ACCESS FULL| TRADE    |   353K|    21M|  2680|
---------------------------------------------------------------

After commenting the rownum clause, the CBO chose hash join to join that table. Comparing execution times between these two versions, running both of these SQL statements in parallel, the original query consumed almost 70 CPU seconds per execution, while the query with rownum clause commented out, consumed just 0.5 seconds of CPU time. Essentially, the all_rows optimizer_mode should be used for this SQL, even if the rownum predicate is used. Fortunately, this query is accessing a view (and only similar queries are accessing that view), so adding an all_rows hint to that view resolved the latch contention.

We used a two-pronged approach: find objects causing latch contention, and match those objects with execution plan of any SQL statements suffering latch contention; then, resolve the issue with a minor change to the view. (The environment: Oracle version 10.2.0.3 on a Linux server.)

Of course, this can be read as PDF file from resolving-latch-contention-cbc.

No Comments Yet

Let us know what you think

Subscribe by email