Pythian Blog: Technical Track

Oracle RAC Cache Fusion Efficiency: A Buffer Cache Analysis

My apologies — there will be no broccoli in this post.

Many people would like to know how well their application will run in RAC. Would it be faster or slower? Would it run at all?

Well, I have a query that can answer that question. There’s a caveat however. You have to first put your application in RAC, then the query can tell you how well it runs.

Here is a bit of theory: The whole idea behind RAC is centuries old, and comes from big empires in the past: divide and conquer.

Read more after the jump, a buffer cache status query and the logic behind it.

Split your workload in pieces and have multiple machines do it. Sounds easy, but when you have multiple workers on the same project, some workers will run into each other. Add to this the consistency requirements and running into each other becomes an even greater problem.

For example, when you build a skyscraper, you have many people working on different floors at the same time. This is efficient. But when they need to look at the master plan, they may interfere with each other as there’s only 1 copy of the master plan, and only so many people may be around it at one time to read it.

Enough theory. Basically, the more evenly your work is split, the more efficient RAC is. To rephrase this: the fewer blocks you share on your nodes, the better RAC will run.

I recently upgraded my buffer cache status query to be RAC-aware. It used to be a select and join from v$bh and dba_objects, and a few “group by’s” to make it more readable. It basically showed which objects had cached data and how much, with some extra details of how much of that data was dirty (needs to be flushed to disk).

To upgrade the query to RAC, I joined gv$bh so that I could see the other nodes caches. Then I decided to extend it a bit more, and include CR (consistent read, read at a point of time) and ASSM space management block details. Then my numbers stopped adding up as some blocks were cached on both nodes, others were not.

Then the real RAC-awareness came. I added the possibility of differentiating between a block being cached on one node, or two nodes. I came up with the concept of a shared and non-shared cached block.

This is the output of the Oracle buffer cache status with cache fusion details:

OBJECT_NAME TYPE SN D% DIRTY MBYTES SHA sha% % PI CR CR_SHA ASSM cfe2%
4800 7417.4 2311 31.2 100 3.6 188.6 138.4 17.6 37.7
SALEFACT TABLE 20 1150.9 329.6 28.6 15.5 0 1.8 1.4 0 42.7
SALE TABLE 45 567.2 231.6 40.8 7.6 0.1 12.5 9.7 0 18.3
SALFT_SALE_IDX INDEX 23 550.5 150.7 27.4 7.4 0 9.6 6.4 0.1 45.2
DELIVERYINFO TABLE 25 478.5 138.3 28.9 6.5 0 1.9 1.4 0 42.2
PAYMENTINFO TABLE 22 442 136 30.8 6 1.5 1 0 38.4
PK_DELIVERYINFO INDEX 3 239.4 107.3 44.8 3.2 0 0.3 0.3 0 10.3
PK_PAYMENTINFO INDEX 21 238.4 106.4 44.6 3.2 0 0.2 0.2 0 10.7
PK_SALE INDEX 21 197.5 91.9 46.5 2.7 0 0.2 0.2 0 7
SALE_IDCUSTSTATUS INDEX 20 177.5 0.5 0.3 2.4 0 1.5 1 0 99.5
SPECIAL_PARTIC_IDX INDEX 12 151.8 69.4 45.7 2 0 3.3 2.7 0 8.6
CASEACTION_ID_TIME INDEX 5 146.8 72.7 49.5 2 0 1.2 1 0.9
<<<ROLLBACK>>> 7.03 1312 145.8 0.5 0.4 2 0.1 3.3 2.6 99.3
SPECIAL_PART TABLE 27 138.5 66.7 48.2 1.9 0 4.8 4.4 0 3.7
SL_DATESIDS_IDX INDEX 15 128.1 2.1 1.7 1.7 0.1 1.3 1 0.1 96.6
CASE TABLE 7 122.5 15.8 12.9 1.7 0 2.1 1.6 0.1 74.2
CUSTOMERINFO TABLE 3 111 55.1 49.6 1.5 0.8 0.8 0.1 0.8
ADDRESS TABLE 4 91.8 12.1 13.2 1.2 1.1 1 0.1 73.7

Here’s a legend:

Note that the very first line has no owner/object at all. It’s a summary of your ENTIRE cache on ALL nodes. I have also edited the table content so that it fits our column.

Note too that the RAC awareness supports only 2 nodes at this point.

SUBOBJECT_NAME – in case of partitioned objects, this will be the partition name. If empty, either non-partitioned object or summarized data for a partitioned object. (N.B.: to make the table narrower, I renamed this to SN.)
D% – percent of the cache (for this object) that needs to be flushed to disk. Empty if under 1% – I added this for a clearer report.
DIRTY – number of dirty buffers (if the object is checkpointed, that many writes will be needed).
MBYTES – megabytes of cache this object is occupying (everything).
SHA – number of blocks that are cached (shared) on both nodes.
SHA% – percent shared for the object.
% – percent of cache this object is using (everything).
PI – number of blocks that are representing a “past image” – see RAC concepts – blocks that were dirty on the current node, and were requested in exclusive mode (for modifications) on the another node.
CR – number of blocks that are a CR (consistent read). A CR is usually a read in the past. When you run a query and an update is issued after your query started, you will need to create CR blocks, by replicating CUR (current) blocks and applying UNDO records to them.
CR_SHA – same as above, but shared on both nodes.
ASSM – number of blocks cached (for the object) that are used to manage intra-segment object space. Basically, space management “overhead”.
And finally cfe% – cache fusion Efficiency (assuming 2 nodes). What does this number show you? It shows you how much of your object(s) you are accessing uniquely from one node. The higher the number, the more unique your access is.

Here are some examples to help you better understand it.

  • Your lookup tables (detail tables) will have a low cfe. The details are needed on both nodes for display purposes.
  • Your transaction tables should have a high cfe. You should be sharing less of that data.
  • If a table has blocks that are constantly jumping from node to node due to concurrent updates (worst case for RAC), you would have a low cfe and a high number of PI blocks.

And here is the query. It’s a beast, but it runs relatively fast, and provides a lot of data. I usually try to fit as few lines as possible, as I want to concentrate on the results. This gives you a top 100, which I find more the sufficient in most case:

/* RAC Buffer cache (db cache + cache fusion) contents status Christo Kutrovsky - The Pythian Group */
select *
from   (select *
   from   (select distinct o.owner, o.object_name, o.object_type as type, SUBOBJECT_NAME,
          round(case
             when sum(d_cnt) / sum(tot) * 100 >= 1 then
            sum(d_cnt) / sum(tot) * 100
           end,
           2) as "D%", sum(d_cnt) as dirty, round(sum(tot) * p.bs / 1024, 1) as mbytes,
          round(sum(cur_sha) * p.bs / 1024, 1) as sha,
          round(sum(cur_sha) / sum(tot) * 100, 1) as "sha%", round(sum(r) * 100, 1) as "%",
          round(sum(pi) * p.bs / 1024, 1) as pi, round(sum(cr) * p.bs / 1024, 1) as cr,
          round((sum(cr_sha)) * p.bs / 1024, 1) as cr_sha,
          round((sum(assm)) * p.bs / 1024, 1) as assm,
          round((1 - sum(cur_sha) / sum(tot) * 2) * 100, 1) as "cfe2%",
          sum(cur_x) as x
      from   (select to_number(decode(temp, 'Y', 9, decode(status, 'free', 0, objd))) as objd, temp,
          count(nullif(dirty, 'N')) as d_cnt, sum(pi) as pi, sum(cr) as cr, round(avg(cr), 1) as cr_i,
          sum(cr_min_inst / nullif(i, 1)) as cr_sha, sum(assm / i) as assm,
          sum(assm) - sum(assm / i) as assm_sha, sum(xcur) as cur_x,
          sum(scur / nullif(i, 1)) as cur_sha, sum(tot) as tot, sum(r) as r
         from   (select inst_id, file#, block#, temp, dirty, status, objd, class#, count(*) as tot,
             decode(status, 'cr', count(*)) as cr,
             case
              when status in ('scur') then
               count(*)
            end as sha1, decode(status, 'pi', count(*)) as pi,
             count(distinct inst_id) over(partition by class#, file#, block#) as i,
             sum(decode(status, 'cr', count(*))) over(partition by inst_id, file#, block#) as cr_min_inst,
             decode(status, 'xcur', count(*)) as xcur, decode(status, 'scur', count(*)) as scur,
             case
              when class# in (8, 9, 10) then
               count(*)
            end as assm, ratio_to_report(count(*)) over() as r
          from   gv$bh
          group  by inst_id, file#, block#, status, temp, dirty, objd, class#)
         group  by decode(status, 'free', 0, objd), temp) h,
       (select owner, object_name, subobject_name, object_id, data_object_id, object_type,
          row_number() over(partition by data_object_id order by object_type) rn, 'N' as temp
         from   dba_objects
         where  data_object_id > 0
         union all
         select ' ', '<<<FREE BLOCKS>>>', null, null, 0, null, 1, 'N'
         from   dual
         union all
         select ' ', '<<<ROLLBACK>>>', null, /*to_char(rownum)*/ null, 4294967296 - rownum, '', 1, 'N'
         from   dual
         connect by dummy = dummy
           and  rownum < 100
         union all
         select ' ', '<<<TEMP SEGMENT>>>', null, null, 9, null, 1, 'Y' as temp
         from   dual) o,
       (select value / 1024 as bs
         from   v$parameter
         where  name = 'db_block_size') p
      where  o.data_object_id = h.objd
      and  o.rn = 1
      and  o.temp = h.temp
      --and o.owner not in ('SYS','SYSTEM')
      group  by p.bs, rollup((o.owner, o.object_name, o.object_type), (SUBOBJECT_NAME)))
   order  by mbytes desc)
where  rownum <= 100

No Comments Yet

Let us know what you think

Subscribe by email