Pythian Blog: Technical Track

There's Always Another Bug Hiding Just Around the Corner

We were using a 10.2.0.3 database, and it had been running without any issues for several years. What could possibly go wrong? Anything! Suddenly, we started getting " ORA-07445: exception encountered: core dump [qercoStart()+156] [SIGSEGV] [Address not mapped to object] " a few times a minute in the alert log. A closer investigation revealed that one of the popular SQLs in the application couldn't complete anymore. It looked like a bug, since only the SQL was failing. We found a few references for various releases with the same conditions: ORA-07445 + qercoStart(). This list summarizes the possible causes for the error I found on My Oracle Support:
  • Using ROWNUM < x condition in the where clause
  • Using ROWNUM condition and FULL OUTER joins
  • Using ROWNUM condition with UNION ALL set operation
The strange thing was that this started suddenly; no changes to the code were made. Moreover, the SQL didn't contain FULL OUTER join operations or UNION ALL set operations: [sourcecode language="sql"] SELECT CBMD.CBMD_BASE_MDL_NUMBER, MFG_GROUP MFG_ID, MFG_NAME, CBMD_CATALOG_MODEL_NUMBER, CBMD_CATALOG_MODEL_SHORT_DESC, CBMD_IMAGE_PATH, SUM (CSMD_AVAILABLE_QTY) QTY FROM CATALOG_BASE_MODEL_DATA CBMD, CATALOG_BASE_MODEL_CATEGORY CBMC, CATALOG_SUB_MODEL_DATA CSMD WHERE CBMD.CBMD_BASE_MDL_NUMBER = CSMD.CBMD_BASE_MDL_NUMBER AND CBMD.CBMD_BASE_MDL_NUMBER = CBMC.CBMD_BASE_MDL_NUMBER AND ( (CSMD.CSMD_AVAILABLE_FOR_SALE_FLAG = 'N' AND CBMC.DC_DIVISION_CODE = 1) OR (CBMC.DC_DIVISION_CODE = 2) ) AND CBMD_PUT_ON_WEB_FLAG = 'Y' AND CSMD_AVAIL_FOR_WEB_DISP_FLAG = 'Y' AND CBMC.DC_DIVISION_CODE = :B1 AND ROWNUM < 5 GROUP BY CBMD.CBMD_BASE_MDL_NUMBER, MFG_GROUP, MFG_NAME, CBMD_CATALOG_MODEL_NUMBER, CBMD_CATALOG_MODEL_SHORT_DESC, CBMD_IMAGE_PATH, CBMD_IMAGE_NAME ORDER BY QTY DESC [/sourcecode] We also tried all the possible workarounds listed in the bug descriptions, but nothing helped:
  • Flushed the shared pool
  • setting "_complex_view_merging"=false
  • bouncing the database
As raising a SR for our 10.2.0.3 DB was unlikely to help, I decided to dig deeper. I knew something had changed, and that change was what triggered the bug. I didn't know where to start, so I decided to look more closely in the bug descriptions in My Oracle Support. All the bugs listed examples of SQL statements containing "ROWNUM < X" condition. The second similarity was harder to notice. Here are some examples - I've highlighted the interesting lines:
  1. from bug 7704557 on 10.2.0.4 [sourcecode language="sql" highlight="3"] select jsp1.name name , jsp1.value value from "SYSJCS".jcs_scheduler_parameters jsp1 where jsp1.name in ('database_name', 'global_names', 'scheduler_hostname', 'remote_start_port', 'scheduler_connect_string', 'oracle_sid', 'listener_port', 'remote_http_output','remote_http_port') and rownum <= 9 and scheduler_name = nvl (:scheduler, scheduler_name); [/sourcecode]
  2. from bug 7528596 on 10.2.0.3 [sourcecode language="sql" highlight="10,19"] SELECT /*+ FIRST_ROWS(200) */ rv.STATUS_NAME H_STATUS_ID ,rv.DESCRIPTION H_DESCRIPTION ,rv.PRIORITY_MEANING H_PRIORITY_CODE ,rv.CREATED_BY_NAME H_CREATED_BY , rv.CREATED_BY_EMAIL H_CREATED_BY_E , H_CREATED_BY_N , rv.CREATION_DATE H_CREATION_DATE ,rv.ASSIGNED_TO_NAME H_ASSIGNED_TO_USER_ID , rv.ASSIGNED_TO_EMAIL H_ASSIGNED_TO_USER_ID_E , rv.ASSIGNED_TO_USERNAME H_ASSIGNED_TO_USER_ID_N ,rv.REQUEST_ID H_REQUEST_ID FROM itgadm.kcrt_requests_v rv, itgadm.kcrt_req_header_details rh WHERE (1=1 AND (rv.batch_number =1 OR rv.batch_number is null) AND rv.REQUEST_TYPE_ID in (30593) AND rv.REQUEST_TYPE_ID in (30593) AND ( rv.STATUS_CODE NOT LIKE 'CLOSED%' AND rv.STATUS_CODE NOT LIKE 'CANCEL%' ) AND exists(SELECT /*+ NO_UNNEST */ pcv.REQUEST_ID FROM itgadm.KCRT_PARTICIPANT_CHECK_V pcv WHERE pcv.request_id = rv.request_id and pcv.user_id = 30481) AND rh.request_id = rv.request_id ) AND ( rh.PARAMETER22 = 30481 OR rv.ASSIGNED_TO_USER_ID = 30481 ) AND ROWNUM <= 200 ORDER BY rv.REQUEST_ID DESC;[/sourcecode]
  3. from bug 7416171 on 10.2.0.3 [sourcecode language="sql" highlight="7"] SELECT COUNT(*) FROM ( SELECT ( SELECT DECODE(COUNT(*),0,0,1) isStemData FROM ccd2.vw_policy_admins q WHERE q.system_id = a.system_id AND q.policy_no = a.policy_no AND ((q.policy_admin_id in (440502405,440502499))) AND rownum < 2)isStemData FROM ccd2.vw_ordf_partners_cnt a WHERE a.PARTNER_ID = 36977489 /*AND a.PARTNER_ID2 = a.PARTNER_ID */ AND a.ORDF_ID = 2 AND 1=1) b WHERE isStemData=1 AND rownum < 300[/sourcecode]
  4. from bug 3211315 on 9.2.0.4 [sourcecode language="sql" highlight="2"] select dummy from (SELECT dummy from dual where rownum < 2) FULL OUTER JOIN (SELECT dummy from dual where rownum < 2) using (dummy)[/sourcecode]
The first 3 SQLs contain "IN" or "OR" operators, and the last one contains a FULL OUTER JOIN set operation that was said to have issues. Knowing a bit of theory helped me identify some similarities:
  • Oracle introduced native FULL OUTER JOIN operation in 10.2.0.5. Before that, it was implemented using the UNION ALL operation. (Cristian Antognini explains it here and gives some examples.)
  • "OR" and "IN" predicates can sometimes be optimized by applying the "OR Expansion" transformation, which acquires the result set of each disjunction condition separately and then combines them using the set operations, i.e. UNION ALL. (Maria Colgan explains it here better than anyone else could.)
At that moment, I started suspecting this could be our case too because the SQL had an "OR" predicate. It was easy to check and confirm by looking at the execution plan. The highlighted line contained the CONCATENATION operation, which is the same as UNION ALL: [sourcecode language="sql" highlight="12"] PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------------------------------------- Plan hash value: 132832423 -------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 434 | 69874 | 61 (4)| 00:00:01 | | 1 | SORT ORDER BY | | 434 | 69874 | 61 (4)| 00:00:01 | | 2 | HASH GROUP BY | | 434 | 69874 | | | |* 3 | COUNT STOPKEY | | | | | | | 4 | CONCATENATION | | | | | | |* 5 | FILTER | | | | | | | 6 | NESTED LOOPS | | 1 | 161 | 30 (0)| 00:00:01 | | 7 | NESTED LOOPS | | 7 | 1008 | 23 (0)| 00:00:01 | |* 8 | TABLE ACCESS FULL | CATALOG_SUB_MODEL_DATA | 21 | 420 | 2 (0)| 00:00:01 | |* 9 | TABLE ACCESS BY INDEX ROWID| CATALOG_BASE_MODEL_DATA | 1 | 124 | 1 (0)| 00:00:01 | |* 10 | INDEX UNIQUE SCAN | CBMD_C1_1_PK | 1 | | 0 (0)| 00:00:01 | |* 11 | INDEX RANGE SCAN | CBMC_C1_1_PK | 1 | 17 | 1 (0)| 00:00:01 | |* 12 | FILTER | | | | | | | 13 | NESTED LOOPS | | 2 | 322 | 29 (0)| 00:00:01 | | 14 | NESTED LOOPS | | 7 | 1008 | 22 (0)| 00:00:01 | |* 15 | TABLE ACCESS FULL | CATALOG_SUB_MODEL_DATA | 20 | 400 | 2 (0)| 00:00:01 | |* 16 | TABLE ACCESS BY INDEX ROWID| CATALOG_BASE_MODEL_DATA | 1 | 124 | 1 (0)| 00:00:01 | |* 17 | INDEX UNIQUE SCAN | CBMD_C1_1_PK | 1 | | 0 (0)| 00:00:01 | |* 18 | INDEX RANGE SCAN | CBMC_C1_1_PK | 1 | 17 | 1 (0)| 00:00:01 | -------------------------------------------------------------------------------------------------------------- [/sourcecode] A quick Google search gave me a NO_EXPAND hint, which disables the OR expansion. However, I couldn't use it since it required a code change. I knew that the behavior of the optimizer was controlled by a large number of hidden parameters that are also listed in the 10053 trace: [sourcecode language="sql" highlight="38"] SQL> ALTER SESSION SET EVENTS '10053 trace name context forever,level 1'; Session altered. SQL> alter session set tracefile_identifier=CR758708_2; Session altered. SQL> alter session set max_dump_file_size=unlimited; Session altered. SQL> explain plan for SELECT CBMD.CBMD_BASE_MDL_NUMBER, /*removed some lines for readability*/ Explained. SQL> exit Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production With the Partitioning, OLAP and Data Mining options $ more test_ora_17805_CR758708_2.trc /*removed some lines for readability*/ ... *************************************** PARAMETERS USED BY THE OPTIMIZER ******************************** ... ************************************* PARAMETERS WITH DEFAULT VALUES ****************************** ... _fast_full_scan_enabled = true _optim_enhance_nnull_detection = true _parallel_broadcast_enabled = true _px_broadcast_fudge_factor = 100 _ordered_nested_loop = true _no_or_expansion = false optimizer_index_cost_adj = 100 optimizer_index_caching = 0 _system_index_caching = 0 _disable_datalayer_sampling = false ... [/sourcecode] I disabled the OR expansion by setting the parameter _no_or_expansion = true, checked the execution plan, and confirmed that the query transformation didn't happen: [sourcecode language="sql"] PLAN_TABLE_OUTPUT ----------------------------------------------------------------------------------------------------- Plan hash value: 1045847658 ----------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ----------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 4 | 644 | 324 (3)| 00:00:04 | | 1 | SORT ORDER BY | | 4 | 644 | 324 (3)| 00:00:04 | | 2 | HASH GROUP BY | | 4 | 644 | 324 (3)| 00:00:04 | |* 3 | COUNT STOPKEY | | | | | | |* 4 | HASH JOIN | | 434 | 69874 | 322 (2)| 00:00:04 | |* 5 | TABLE ACCESS FULL | CATALOG_SUB_MODEL_DATA | 3777 | 75540 | 62 (4)| 00:00:01 | |* 6 | HASH JOIN | | 3087 | 425K| 260 (2)| 00:00:04 | |* 7 | INDEX FAST FULL SCAN| CBMC_C1_1_PK | 1759 | 29903 | 13 (8)| 00:00:01 | |* 8 | TABLE ACCESS FULL | CATALOG_BASE_MODEL_DATA | 2685 | 325K| 247 (2)| 00:00:03 | ----------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - filter(ROWNUM<5) 4 - access("CBMD"."CBMD_BASE_MDL_NUMBER"="CSMD"."CBMD_BASE_MDL_NUMBER") filter("CSMD"."CSMD_AVAILABLE_FOR_SALE_FLAG"='N' AND "CBMC"."DC_DIVISION_CODE"=1 OR "CBMC"."DC_DIVISION_CODE"=2) 5 - filter("CSMD_AVAIL_FOR_WEB_DISP_FLAG"='Y') 6 - access("CBMD"."CBMD_BASE_MDL_NUMBER"="CBMC"."CBMD_BASE_MDL_NUMBER") 7 - filter(("CBMC"."DC_DIVISION_CODE"=1 OR "CBMC"."DC_DIVISION_CODE"=2) AND "CBMC"."DC_DIVISION_CODE"=TO_NUMBER(:B1)) 8 - filter("CBMD_PUT_ON_WEB_FLAG"='Y') 28 rows selected. [/sourcecode] In our case, the optimizer had changed the execution plan after fresh statistics were collected - this was the change that triggered the bug. We set the parameter to disable the OR expansion until we upgrade to 11.2. I wanted to share this story with you because it's interesting how different things (IN and OR predicates, UNION ALL and FULL OUTER JOIN set operations, etc.) transform behind the scenes into the same conditions to trigger the same bug. I think this incident has also changed the way I'll read bug descriptions on My Oracle Support in the future - there is information hidden between the lines.

No Comments Yet

Let us know what you think

Subscribe by email