Pythian Blog: Technical Track

Locating most current record using ROW_NUMBER() vs. Index Full Scan (Min/Max)

Very often when reading data, we need to locate the most current record for a set of conditions. One typical example is that we read the data to find the most recent order for a particular customer. In this simple scenario, each approach would work for the most part as you would expect. However things change quickly when you introduce extra conditions, such as trying to locate the most recent OPEN order. The most elegant, and usually most efficient way to do this is to use an analytical ranking function like ROW_NUMBER() and an index on the order key. In this case Oracle utilizes a clever optimization that shows up as WINDOW NOSORT STOPKEY in the execution plan. But there is an unpleasant surprise. The optimization appears to be written for the "general" case - and lacks some very important case specific optimizations. Here's a demonstration. First, let's build a test data set (requires 4 GB of space): [sql] create table my_orders nologging as select rownum id, sysdate-100+rownum/24/60 dt_inserted, cast('CLOSED' as varchar2(20)) as order_status, mod(rownum,5) customer_id, mod(rownum,5000) mod5000_id , rpad('x',3500,'x') filler from (select rownum r from dual connect by level <= 10000) r1, (select rownum r from dual connect by level <= 100) ; create /*+ parallel(32) */ index my_orders$pk on my_orders(id); alter table my_orders modify (id not null,customer_id not null); /* make a few orders open */ update my_orders set order_status='OPEN' where id in (999999,999998); [/sql] Here's the ROW_NUMBER() approach - very efficient and elegant. Note that we are NOT looking for open orders yet - just the most recent one. [sql] select * from ( select t.*, row_number() over (order by id desc) r from my_orders t where customer_id=4 ) where r=1; /* ------------------------------------------------------------------------------------------------------ | Id | Operation | Name | E-Rows |E-Bytes| Cost (%CPU)| E-Time | ------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 1 | 1825 | 200K (1)| 00:00:08 | |* 1 | VIEW | | 1 | 1825 | 200K (1)| 00:00:08 | |* 2 | WINDOW NOSORT STOPKEY | | 200K| 672M| 200K (1)| 00:00:08 | | 3 | TABLE ACCESS BY INDEX ROWID | MY_ORDERS | 200K| 672M| 200K (1)| 00:00:08 | |* 4 | INDEX RANGE SCAN DESCENDING| MY_ORDERS$CUSTOMER | 200K| | 529 (1)| 00:00:01 | ------------------------------------------------------------------------------------------------------ */ [/sql] The old school approach would have been: [sql] select * from my_orders where id = (select max(id) from my_orders where customer_id=4) and customer_id=4; /* ----------------------------------------------------------------------------------------------------------- | Id | Operation | Name | E-Rows |E-Bytes| Cost (%CPU)| E-Time | ----------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 3528 | 4 (0)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED| MY_ORDERS | 1 | 3528 | 4 (0)| 00:00:01 | |* 2 | INDEX RANGE SCAN | MY_ORDERS$CUSTOMER | 1 | | 3 (0)| 00:00:01 | | 3 | SORT AGGREGATE | | 1 | 8 | | | | 4 | FIRST ROW | | 1 | 8 | 3 (0)| 00:00:01 | |* 5 | INDEX RANGE SCAN (MIN/MAX) | MY_ORDERS$CUSTOMER | 1 | 8 | 3 (0)| 00:00:01 | ----------------------------------------------------------------------------------------------------------- */ [/sql] This uses the rather classical INDEX RANGE SCAN (MIN/MAX) approach. Notice the double-use of CUSTOMER_ID=4. This is to ensure the same index is used for both the min/max and the table lookup. If a different index (such as the PK) is used - this may result in extra disk-IO as that index's leaf block may not be in buffer cache. So far everything works as expected. Both approaches perform identically. The analytical SQL being slightly more "clear". The cost difference only illustrates that Oracle doesn't know that this option is as efficient, because the costing is based on the stats, and there is not enough information in the stats to say if the query will be resource intensive or not. Here's where things become interesting. What if you needed to locate the most recent OPEN order for a customer? The MAX() approach would look like this: [sql] select * from my_orders where id = (select /*+ INDEX_DESC(T)*/max(id) from my_orders t where customer_id=4 and order_status='OPEN') and customer_id=4; /* -------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | E-Rows |E-Bytes| Cost (%CPU)| E-Time | -------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 3528 | 4 (0)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED | MY_ORDERS | 1 | 3528 | 4 (0)| 00:00:01 | |* 2 | INDEX RANGE SCAN | MY_ORDERS$CUSTOMER | 1 | | 3 (0)| 00:00:01 | | 3 | SORT AGGREGATE | | 1 | 15 | | | |* 4 | TABLE ACCESS BY INDEX ROWID BATCHED| MY_ORDERS | 1 | 15 | 200K (1)| 00:00:08 | |* 5 | INDEX RANGE SCAN DESCENDING | MY_ORDERS$CUSTOMER | 200K| | 529 (1)| 00:00:01 | -------------------------------------------------------------------------------------------------------------- */ [/sql] In this plan the FIRST ROW and MIN/MAX steps are gone. Oracle now has to look up ALL orders for a particular customer via the index. Then it reads ALL relevant table blocks to filter out the order_status='OPEN' predicate. This can be very costly to find the MAX(ID) value. Such an execution plan will become slower as orders build up, and will very likely require disk IO for table access - as typically orders come over time for many customers - so orders for the same customer are not in the same blocks/table area. With hundreds of orders over say 2-3 years - you are guaranteed a significant slow down. Here's the ROW_NUMBER() approach. I had to hint it to force this plan. In my result set - very few orders are in status OPEN - so Oracle cannot determine that this is the best approach to locate OPEN orders. [sql] select * from ( select /*+INDEX_DESC(T)*/ t.*, row_number() over (order by id desc) r from my_orders t where customer_id=4 and order_status='OPEN' ) where r=1 /* ------------------------------------------------------------------------------------------------------ | Id | Operation | Name | E-Rows |E-Bytes| Cost (%CPU)| E-Time | ------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 1 | 1825 | 200K (1)| 00:00:08 | |* 1 | VIEW | | 1 | 1825 | 200K (1)| 00:00:08 | |* 2 | WINDOW NOSORT STOPKEY | | 1 | 3528 | 200K (1)| 00:00:08 | |* 3 | TABLE ACCESS BY INDEX ROWID | MY_ORDERS | 1 | 3528 | 200K (1)| 00:00:08 | |* 4 | INDEX RANGE SCAN DESCENDING| MY_ORDERS$CUSTOMER | 200K| | 529 (1)| 00:00:01 | ------------------------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("R"=1) 2 - filter(ROW_NUMBER() OVER ( ORDER BY INTERNAL_FUNCTION("ID") DESC )<=1) 3 - filter("ORDER_STATUS"='OPEN') 4 - access("CUSTOMER_ID"=4) */ [/sql] As you can see - the execution plan is the SAME as in the case for most recent (any) order. What Oracle would do in this case is locate the most recent record for CUSTOMER_ID=4 in the INDEX - and then start reading table blocks one by one until it identifies enough records that satisfy your condition. And I specifically wrote the previous sentence, as the logic Oracle follows. And this is where things start working not quite as you would expect. The WINDOW NOSORT STOPKEY means that it would read rows from the rowsource (in this case - rows from the table blocks, as identified by the index descending scan) one by one until records no longer satisfy the STOPKEY conditions. I've included the predicate information to demonstrate this. The effects of having a condition that only stops once a non-match is identified means that Oracle will keep reading from the row source until either a second row (open order in our case) is found or until all rows have been read from the rowsource. My guess for why this is the approach is because the ROW_NUMBER() shares the same code as RANK() and DENSE_RANK(). And since the RANK() function allows multiple records with the same RANK in the case of identical ordering conditions, it cannot stop "processing" until it has confirmed there are no longer "X" ranks. So what does this mean? It means that if we have two open orders that are recent - our query will be very fast. If we have only one recent order that is open - our query will be very slow, because it will go over all closed orders to ensure there is no other "open" order. Is there a Solution? Unfortunately I have not found a solution to this problem. You could add the STATUS column to your index, but that will make your index larger and it's not necessary a solution. There could be 10 other predicates you want to test for. I did not test possibilities to use a nested loops joins (with all "BATCH" optimizations turned off) as I deemed it too convoluted. I hope Oracle introduces the optimization to ROW_NUMBER() - so that it stops on the 1st row. I tested this last in 12.1.

No Comments Yet

Let us know what you think

Subscribe by email