Pythian Blog: Technical Track

Statspack vs. AWR: Wrong number of SQL Executions

This post is about a Statspack anomaly I discovered while analyzing execution trends of a particular SQL on one of our clients’ databases. I decided to share it, since its appearance can be misleading. The database in question was an Oracle EE version 11.2.0.3. However, as Statspack is no longer maintained, the same problem can be reproduced on 19c. In this post, I’ll also verify whether the same problem shows up in AWR (spoiler: it doesn’t).

Problem description

Mining the Statspack repository for a certain SQL_ID using this simple script revealed a sharp increase in its number of executions a SQL which is usually executed once or twice per snapshot and has apparently been executed 409 times:

 

 SNAP_ID EXECUTIONS ELAPSED_TIME_SEC_TOTAL DISK_READS_PER_EXEC BUFFER_GETS_PER_EXEC ROWS_PROCESSED_PER_EXEC
-------- ---------- ---------------------- ------------------- -------------------- -----------------------
   31451          2                1126.48             1767896              1768738                10904478
   31452          2                1579.11           1760511.5            1761343.5                10904478
   31453          1                 498.76             1767896              1758346                10904478
   31455        409               481113.8          1752410.67            2256430.3              10886122.5

A proper Statspack report reported the same numbers. Cross-checking with the application side, there were indications that this couldn’t be true.

Looking closer into the suspiciously high execution count

As Statspack’s code is available in $ORACLE_HOME/rdbms/admin, we can easily verify how snapshots are created (spcpkg.sql) and also how reports are generated (sprepins.sql).

Snapshot creation

The PL/SQL procedure “SNAP_SQL” located in STATSPACK’s package body contains the INSERT statement which captures SQL execution statistics. It keeps only the relevant parts for this investigation:

 

 insert into stats$sql_summary
 select 
        l_snap_id
      , p_dbid
      , p_instance_number
      , max(substrb(sql_text,1,31)) text_subset
      , max(sql_id)                 sql_id
        .
        .
      , sum(executions)             executions
      , sum(px_servers_executions)  px_servers_executions
        .
        .
 from v$sql sql
where is_obsolete = 'N'
  and sql_id in (select sql_id 
                   from stats$v$sqlstats_summary sqlstats
                  where (   buffer_gets   > l_buffer_gets_th 
                         or disk_reads    > l_disk_reads_th
                         or parse_calls   > l_parse_calls_th
                         or executions    > l_executions_th
                         or sharable_mem  > l_sharable_mem_th
                         or version_count > l_version_count_th
                        )
                )
group by old_hash_value, address;

Notice that the SQL aggregates and stores execution statistics for a particular by “old_hash_value, address”:

  • “old_hash_value” is the same for all child cursors of a particular SQL (similar as SQL_ID) – Tanel Poder explains it in details in this post https://tanelpoder.com/2009/02/22/sql_id-is-just-a-fancy-representation-of-hash-value/ .
  • Same goes for “address”, as it represents the address of the handle to the parent for this cursor.

Since the “group by” doesn’t also include the CHILD_ADDRESS column, the  SQL execution statistics are aggregated across all of the child cursors. It’s also worth noting that the values in stats$sql_summary are cumulative. This means we have to calculate the difference between two snapshots to obtain the actual delta for a particular execution statistic.

Report generation

This brings us to the code responsible for calculating the number of executions between two snapshots in a Statspack report, available in $ORACLE_HOME/rdbms/admin/sprepins.sql. Here again, is only the relevant part:

 

.
.
select 
                old_hash_value
              , text_subset
              , sum(case
                    when snap_id = &&begin_snap and prev_snap_id = -1 
                    then 0
                    else
                         case when (address != prev_address)
                                or (executions < prev_executions)
                              then executions
                              else executions - prev_executions
                         end
                    end)                   delta_executions
                .
                .
              )   delta_rows_processed
          from (select snap_id
                     , old_hash_value
                     , text_subset
                       .
                       .
                     , executions
                     ,(lag(executions, 1, 0) 
                       over (partition by old_hash_value 
                                        , dbid
                                        , instance_number
                             order by snap_id))   prev_executions
                )   prev_rows_processed
                from stats$sql_summary s
               where s.snap_id between &&begin_snap and &&end_snap
                 and s.dbid            = &&dbid
                 and s.instance_number = &&inst_num
               )
        group by old_hash_value
               , text_subset
               , module

 

The relevant part for us is the logic in the ELSE part of the CASE statement. It returns the “executions” value stored in stats$sql_summary when:

  1. address != prev_address“: this means that a new parent cursor was created for a specific SQL between the analyzed snapshots and there’s no delta to be computed. This could be due to the previous parent being flushed out of the library cache for various reasons (e.g. shared pool pressure, explicit purge of the cursor, cursor obsoleting) or an instance restart. The instance restart case is actually not relevant here, as the Statspack report generation script checks whether both snapshots have the same instance startup time.
  2. (executions < prev_executions)“: if the number of executions in the final snapshot is lower than in the previous one, a similar event as described above could have happened to the cursora child cursor could be flushed out and reloaded. This isn’t a problem with one child cursor. However, we’ll see what can, and has, happened when there are more.

In all other cases the “executions – prev_executions” figure is reported, which means we simply get a delta between two snapshots. This is the most common scenario.

The above logic works until there’s a scenario like this:

  • There’s more than one child cursor for a SQL_ID.
  • One (or more), but not all of the child cursors get flushed out the library cache. If there are no new executions of the observed SQL, this effectively reduces the number of executions summed over all of the children.

And this is exactly what happened on the database I was analyzing. Checking the contents of stats$sql_summary for this particular SQL, we can see that SNAP_ID=31454 reports 410 executions, however the next snapshot reports 409 executions:

SQL> select snap_id, executions from stats$sql_summary where sql_id = '1v5wcft51pmy1' order by snap_id;

   SNAP_ID EXECUTIONS
---------- ----------
     31451        407
     31452        409
     31453        410
     31454        410 <--
     31455        409 <-- 
     31456        409
     31457        410

 

Knowing how Statspack collects and reports data from stats$sql_summary, I checked the status in v$sql and there were various child cursors for this particular SQL:

SQL> SELECT child_number, old_hash_value, address, child_address, first_load_time, last_load_time, last_active_time, executions, invalidations, parse_calls
       FROM v$sql
      WHERE sql_id = '1v5wcft51pmy1'
   ORDER BY 1;


CHILD_NUMBER OLD_HASH_VALUE ADDRESS          CHILD_ADDRESS    FIRST_LOAD_TIME     LAST_LOAD_TIME      LAST_ACTIVE_TIME     EXECUTIONS INVALIDATIONS PARSE_CALLS
------------ -------------- ---------------- ---------------- ------------------- ------------------- -------------------- ---------- ------------- -----------
           0     1140771677 000000D13F779218 000000D139974CF8 2021-05-20/23:11:33 2021-09-01/07:21:49 01.sep.2021 21:13:29         29             0         339
           1     1140771677 000000D13F779218 000000D139E94B40 2021-05-20/23:11:33 2021-09-01/21:41:37 02.sep.2021 08:03:50          7             0          85
           4     1140771677 000000D13F779218 000000D139E0EC88 2021-05-20/23:11:33 2021-08-26/22:11:38 27.aug.2021 09:43:43         10             0         123
           7     1140771677 000000D13F779218 000000D139C1B398 2021-05-20/23:11:33 2021-08-28/07:26:53 28.aug.2021 19:43:24         26             0         304
          10     1140771677 000000D13F779218 000000D13FAA3E40 2021-05-20/23:11:33 2021-08-29/21:41:34 30.aug.2021 07:25:36          6             0          69
          11     1140771677 000000D13F779218 000000D13838C840 2021-05-20/23:11:33 2021-08-30/08:38:21 30.aug.2021 21:44:38         29             0         317
          12     1140771677 000000D13F779218 000000D137BDC5E0 2021-05-20/23:11:33 2021-08-31/08:28:23 31.aug.2021 21:13:29         29             0         338

7 rows selected.

 

My theory about why the execution count decreased from 410 to 409 in stats$sql_summary is that one or more of the child cursors was purged from the library cache between snapshots 31454 and 31455. When taking snapshot 31455, the “sum(executions)” over all child cursors returned 409 and that was stored in stats$sql_summary. A further indication is that the values in the CHILD_NUMBER column have some missing values, meaning child cursors were purged from the library cache. When generating the Statspack report, the “(executions < prev_executions)” condition became true, so the 409 executions were reported.

Note: The values from v$sql displayed above were not taken immediately after SNAP_ID=31455 was taken, so the “sum(executions)” can’t be directly compared. The important takeaway here is the number of child cursors and the missing values in the child_number column.

Reproducing the problem

The main motivation for reproducing the problem is to verify the above theory and perhaps even more importantly, check how AWR behaves in such situations.

The plan to reproduce the issue was relatively simple; create a parent cursor with many child cursors, flush only some of them from the library cache and check what Statspack and AWR report.

Purging only some of the cursors turned out as the challenging part. At first I tried those techniques:

  1. dbms_shared_pool.purge: As explained by Jonathan Lewis in this post https://jonathanlewis.wordpress.com/2019/12/20/purge_cursor/, dbms_shared_pool.purge purges all of the child cursors.
  2. Setting “_cursor_obsolete_threshold” to say, 100. When the threshold is reached, a new parent cursor with the same SQL_ID is generated, then when no session is using the child cursors under the “old” parent, it gets flushed out the library cache along with its child cursors.  I tried to keep one child open in a session by pausing the process before completing all of the FETCH calls, but that didn’t help.

Eventually, I opted for an unscientific approachgenerating space pressure on the shared pool by heavily hard parsing and waiting for some of my observed child cursors to be purged out automatically by Oracle.

Test execution and observation of results

The script I used to execute the whole test is available here and the full log of the test is here. (Note that if you intend to re-run it, it might need a bit of tweaking depending on the sizes of your SGA and shared pool. The test was executed on a Oracle 19.5 EE running on Linux x64. Also, because of the heavy hard parsing, you should run it in an isolated database.) Finally, the observed SQL was selected from a real table because I wanted to check how other statistics behave. Here’s what we could use to speed up the test:

 

qtext="select /*"$(date)"*/  1 from dual"

Description of the most relevant steps in the test script:

  1. Generate a distinct SQL text used to generate many child cursors by including a comment with the current date
  2. “Color” the SQL using dbms_workload_repository.add_colored_sql in order to guarantee it’s captured by AWR
  3. Flush the shared pool and create an initial Statspack (SNAP_ID=571) and AWR (SNAP_ID=311) snapshot
  4. Generate 200 child cursors by changing the optimizer environment parameter “optimizer_index_cost_adj” to a different value for each new child. After the code runs, this is the status from v$sql:
CHILD_NUMBER EXECUTIONS
------------ ----------
           0        100
           1        100
           2        100
           3        100
           .
           .
         197        100
         198        100
         199        100

200 rows selected.

       CNT SUM(EXECUTIONS)
---------- ---------------
       200           20000

At this point there are 200 child cursors, each with 100 executions. Next, a Statspack (SNAP_ID=572) and AWR (SNAP_ID=312) snapshot is created.

5. Flood the shared pool by hard parsing in a loop. This is achieved by generating and executing unique SQLs in a loop. Exit the loop as soon as the number of child cursors for our observed SQL decreases. After that, check the number of child cursors in v$sql:

 

** Child cursor status from v$sql after the number of child cursors decreased: **

  COUNT(*) SUM(EXECUTIONS)
---------- ---------------
        41            4100

At this point, 159 child cursors were purged and there are only 41 left in the library cache.

6. Set optimizer_index_cost_adj=1000 (a value which wasn’t used yet) in order to force a new child cursor creation, and execute the observed SQL only once. The number of child cursors increases from 41 to 42 and the total number of executions increases by one to 4101:

 

** Child cursor status from v$sql: **

  COUNT(*) SUM(EXECUTIONS)
---------- ---------------
        42            4101

7. Generate another set of Statspack (SNAP_ID = 573) and AWR (SNAP_ID=313) snapshots.

Note: On some occasions, the act of creating Statspack and AWR snapshots purges some of the observed child cursors out of the library cache. We need to check the figures reported in the Statspack and AWR reports so they match with the above output, otherwise the test needs to be repeated.

Result analysis

Finally, we can verify the data captured in AWR and Statspack repositories.

Statspack
   SNAP_ID EXECUTIONS
---------- ----------
       572      20000
       573       4101

Excerpt from the Statspack report for snapshots 572-573:

    CPU                  CPU per             Elapsd                     Old
  Time (s)   Executions  Exec (s)  %Total   Time (s)    Buffer Gets  Hash Value
---------- ------------ ---------- ------ ---------- --------------- ----------
     16.28        4,101       0.00  366.4      16.32       6,393,462 1915833037
Module: SQL*Plus
select /*Wed Sep 22 17:33:31 CEST 2021*/  max(t_1.object_id) from t1 t_1

Observation: Even though there was only one execution of our observed SQL between snapshots 572 and 573, Statspack reports 4101 executions. This is what most likely happened also on the production database, and we just reproduced it.

AWR

AWR doesn’t have the same problem as Statspack, as it correctly reports only one execution for SNAP_ID=313:

 

   SNAP_ID   NODE BEGIN_INTERVAL_TIME            SQL_ID        PLAN_HASH_VALUE        EXECS ROWS_PROCESSED ROWS_PROCESSED_PER_EXEC    AVG_ETIME        AVG_LIO    AVG_PIO
---------- ------ ------------------------------ ------------- --------------- ------------ -------------- ----------------------- ------------ -------------- ----------
       312      1 22-SEP-21 05.33.32.745 PM      g96vqh5wyw7q9      3724264953       20,000          20000                       1         .004        1,559.0          0
       313      1 22-SEP-21 05.34.51.241 PM      g96vqh5wyw7q9                            1              1                       1         .005        1,562.0          0

The code to capture AWR snapshots isn’t as easily accessible as Statspack’s. Suffice to say that when researching how AWR captures the snapshots while inserting into the WRH$_SQLSTAT table (which is the table underlying the DBA_HIST_SQLSTAT view), it does so by running an INSERT INTO SELECT from X$KGLCURSOR_CHILD_SQLIDPH, which stores data for each child cursor. Therefore, it’s likely that runtime statistics at the child cursor level are taken into account when generating the AWR snapshots.

All of the files generated during this test are uploaded to the GitHub repository.

Summary

In cases where only some of the child cursors are purged from library cache, Statspack can report an erroneously high number of executions for a particular SQL. AWR doesn’t seem to suffer from the same problems.

Final notes

Be aware that in case you try to re-run the test with a higher number of child cursors, you might get “strange” results in AWR. Below is what AWR reports when running the test with 500 child cursors (which is still below the _cursor_obsolete_threshold=1024 limit in 19c). Despite there were 500 child cursors with 50000 executions in v$sql, AWR didn’t capture their runtime statistics:

 

SQL> SELECT ss.snap_id, begin_interval_time, plan_hash_value, executions_delta, rows_processed_delta, buffer_gets_delta, disk_reads_delta, elapsed_time_delta         
       FROM dba_hist_sqlstat s, dba_hist_snapshot ss
      WHERE sql_id = 'fs76gd8ck6gft'
            AND ss.snap_id = S.snap_id
            AND ss.instance_number = S.instance_number
     ORDER BY 1, 2, 3; 

   SNAP_ID BEGIN_INTERVAL_TIME                      PLAN_HASH_VALUE EXECUTIONS_DELTA ROWS_PROCESSED_DELTA BUFFER_GETS_DELTA DISK_READS_DELTA ELAPSED_TIME_DELTA
---------- ---------------------------------------- --------------- ---------------- -------------------- ----------------- ---------------- ------------------
       341 24-SEP-21 01.53.41.341 PM                              0                                                                                   199321839
       342 24-SEP-21 01.57.03.885 PM                              0                                                                                        6858

This behavior is described in My Oracle Support note: “Why “Executions” and “Elap per Exec(s)” are ‘Blank’ in AWR for Some SQL Statements (Doc ID 1522547.1)”:

“This is expected behavior. Since 10.2, “Executions” and “Elap per Exec(s)” statistics of high version count cursor(> 200 by default) will no longer be collected.”


Don’t forget to sign up for more updates here.

No Comments Yet

Let us know what you think

Subscribe by email