Pythian Blog: Technical Track

Select Statement Generating Redo and Other Mysteries of Exadata

Like many good stories, this one also started with an innocent question from a customer:

“I often check “SELECT COUNT(*) FROM TABLE;” on our old 10.2.0.4 cluster and the new Exadata.
The Exadata is slower than our old cluster by few minutes.
Exadata is not good at count(*) ? or other problem?”

To which I replied:
“Exadata should be AMAZING at count(*)…
so, we definitely need to check why it took so long…
which query?”

Exadata should do count(*) either from memory or by using smart scans, and since my friend is counting fairly large tables that won’t all fit into memory, Exadata should be significantly faster than their old system.

So, I started by trying to reproduce the issue. The customer sent me his slow query, and I ran it. The same query that took him 3 minutes to run, took me 2 seconds. Clearly we were not doing the same thing, and disk access was then our main suspect.

To see what his executions indicate, I asked the customer to run the same script adding “SET AUTOTRACE ON” at the top and send me the script and the results.
“SET AUTOTRACE ON” is far from perfect, but an easy way to get an overview of the issue without annoying the customer into non-compliance.

Here’s the statistics section for one of the queries:

0  				recursive calls
0  				db block gets
54903  			        consistent gets
54899 			        physical reads
2855188  		        redo size
552  			        bytes sent via SQL*Net to client
520  			        bytes received via SQL*Net from client
2  				SQL*Net roundtrips to/from client
0  				sorts (memory)
0  				sorts (disk)
1  				rows processed

There was a lot of disk access, but I was distracted by the huge redo size. Why was a select statement generating so much redo?

My best guess was delayed block cleanout . A quick twitter survey added another vote for cleanouts and a suggestion to check for audits. Checking for audits was easy, we had no audit on selects. But how do I confirm whether or not I’m doing block cleanouts?

The tweeps suggested: “First trace it. Then you have concrete evidence.” Trace is often the performance tuning tool of choice, since it shows very detailed information about the queries that run and their wait events. But trace is not the perfect tool for everything – “delayed block cleanout” is neither a wait event nor a query, so this information will not show up in a trace file.
Oracle keeps track of cleanouts using “delayed block cleanout” statistics. V$SESSTAT has this information, but it is cumulative. I will have to take a snapshot of V$SESSTAT before and after each query. If only there was a tool that would make it easier…

Snapper to the rescue!

I ran Snapper for snapshots every 5 seconds during the time the scripts ran. Why not use Snapper’s multi-snapshot ability? Because it only displays the information after the run, and my ADHD mind wanted constant feedback like I got from top and vmstat.

 sqlplus / as sysdba @snapper ash,stats 5 1 2021

And I soon caught the following evidence:

-------------------------------------------------------------------------------------------------------------------------------------
    SID, USERNAME  , TYPE, STATISTIC                                                 ,     HDELTA, HDELTA/SEC,    %TIME, GRAPH
-------------------------------------------------------------------------------------------------------------------------------------
   2021, SYS       , STAT, physical read requests optimized                          ,      8.53k,      1.71k,
   2021, SYS       , STAT, physical read total bytes optimized                       ,     69.91M,     13.98M,
.....
   2021, SYS       , STAT, redo entries                                              ,      8.53k,      1.71k,
   2021, SYS       , STAT, redo size                                                 ,     443.8k,     88.76k,
   2021, SYS       , STAT, redo entries for lost write detection                     ,      8.53k,      1.71k,
   2021, SYS       , STAT, redo size for lost write detection                        ,    443.61k,     88.72k,
   2021, SYS       , TIME, DB CPU                                                    ,    658.9ms,   131.78ms,    13.2%, |@@        |
   2021, SYS       , TIME, sql execute elapsed time                                  ,      4.01s,   801.59ms,    80.2%, |@@@@@@@@  |
   2021, SYS       , TIME, DB time                                                   ,      4.01s,   801.59ms,    80.2%, |@@@@@@@@  |
   2021, SYS       , WAIT, cell single block physical read                           ,      4.22s,   843.51ms,    84.4%, |@@@@@@@@@ |

Leaving out a lot of boring stuff, we can immediately see that we do not have any delayed block cleanout. Instead we have redo…for lost write detection.
What is this lost write detection? A quick Google query led me to a parameter that was added in version 10g: DB_LOST_WRITE_PROTECT . The documentation clearly said:
“When the parameter is set to TYPICAL on the primary database, the instance logs buffer cache reads for read-write tablespaces in the redo log, which is necessary for detection of lost writes.”

In the old cluster, the parameter was unset and defaulted to NONE. In Exadata X2-2, the default is TYPICAL. Maybe Oracle knows something that we don’t about write safety on Exadata? Regardless, when we set the parameter to NONE, the results were very different:

0  			recursive calls
0  			db block gets
12153  		        consistent gets
10808  		        physical reads
0  			redo size
552  		        bytes sent via SQL*Net to client
520  		        bytes received via SQL*Net from client
2  			SQL*Net roundtrips to/from client
0  			sorts (memory)
0  			sorts (disk)
1  			rows processed

We won! Wait…did we?

Understanding and getting rid of the mysterious REDO is great. But in follow up tests, we saw that every time the query had to touch disk, Exadata was still significantly slower than the cranky old SAN we used in the old cluster.

Let’s go back to Snapper.

Snapper output showed that 84.4% of the time was spent on “cell single block physical read”, and the statistics show that it is all “optimized” – meaning read from Exadata’s flash cache. It’s nice that the reads are fast, but why do we use “cell single block physical read”?

I turned to the execution plans, incidentally the one piece of information Snapper needed for troubleshooting that Snapper doesn’t show you.

Here’s the plan in the old cluster:

--------------------------
| Id  | Operation              | Name                    | Rows  | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |                         |     1 |     5   (0)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE        |                         |     1 |            |          |       |       |
|   2 |   PARTITION LIST ALL   |                         |  1562 |     5   (0)| 00:00:01 |     1 |    12 |
|   3 |    INDEX FAST FULL SCAN| IDX_XXXX                |  1562 |     5   (0)| 00:00:01 |     1 |    12 |
------------------------------

And in the new:

------------------
| Id  | Operation           | Name               | Rows  | Cost (%CPU)| Time | Pstart| Pstop |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |                    |     1 |     0   (0)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE     |                    |     1 |            |          |       |       |
|   2 |   PARTITION LIST ALL|                    |  1562 |     0   (0)| 00:00:01 |     1 |    12 |
|   3 |    INDEX FULL SCAN  | IDX_XXX            |  1562 |     0   (0)| 00:00:01 |     1 |    12 |
--------------------------------------------------------------------------------------------------

Do you see the difference? Are you surprised to learn that Exadata will apply smart scans on INDEX FAST FULL SCAN, but not on INDEX FULL SCAN?

Why did Oracle choose INDEX FULL SCAN instead of FAST FULL SCAN?

There could be many reasons: Maybe the index is not unique or allows nulls. Or maybe our optimizer parameters make single block access too attractive. Or maybe we have bad statistics.

Since it’s a short list, I eliminated the options one by one. Primary key exists, and optimizer parameters were all default. Table statistics was fine, so I was almost fooled. But index statistics indicated zero rows!

After gathering statistics, things looked better:

----------------------------------
| Id  | Operation                      | Name                    | Rows  | Cost(%CPU)| Time     | Pstart| Pstop  |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |                         |     1 |     5   (0)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE                |                         |     1 |            |          |       |       |
|   2 |   PARTITION LIST ALL           |                         |  1562 |     5   (0)| 00:00:01 |     1 |    12 |
|   3 |    INDEX STORAGE FAST FULL SCAN| IDX_XXX		 |  1562 |     5   (0)| 00:00:01 |     1 |    12 |
------------------------------------------------------------------------------------------------------------------

The customer confirmed: “Shocking speed lol”

Lessons Learned:

  1. Avoid premature guesses – You learn much more by getting information and analysing the facts.
  2. Snapper shows both wait events and session statistics, allowing deeper troubleshooting than just tracing.
  3. Snapper is also easy to use, making it my preferred performance troubleshooting tool.
  4. Snapper would be perfect if it also retrieved query plans.
  5. 90% of performance problems are bad statistics, even if the symptoms look more exotic at first.
  6. Oracle keeps adding new behaviors, and no one can know when something new will show up. Thank you for keeping my job exciting.

No Comments Yet

Let us know what you think

Subscribe by email