Pythian Blog: Technical Track

SQL for Pattern Matching in Oracle 12c

Oracle 12c is out!
And with it, a myriad of new features that we’ll be learning and playing with in the months and years to come. Paraphrasing Iggy Fernandez’s blog: “So many Oracle manuals, so little time…” The new features abound, and we need to cherry pick some interesting ones to delve into.

There are the basic ones, like installing the new database software, which Yury Velikanov promptly explained just one day after the 12c release. And there are others, buried in the manuals, that may go unnoticed by the majority for some time. They may not be essential for a DBA or developer, but sometimes they can save us a great deal of work and time.

One of these “buried” features of Oracle 12c that caught my interest was SQL for Pattern Matching. It’s a extension to the syntax of the SELECT statement, using the MATCH_RECOGNIZE clause, that allows us to identify patterns across sequences of rows.

It’s just another way of doing the same things

In SQL, there are always many different ways to solve a single problem. MATCH_RECOGNIZE just adds to the set of possible solutions. However, when the problem at hand is to detect patterns in sequences of rows, it may simplify the job immensely and save us a lot of time and lines of code. That doesn’t mean, though, that it wasn’t possible to do pattern matching before; it probably just required more work and convoluted queries.

So, let’s look at one example. I’ll look for patterns in the archived log history of a database rather than venturing through stock market tickers, or other fields’ data that I’m less familiar with. The idea, though, can be applied to a variety of scenarios.

Let’s say that we want to find out periods of high archived log generation for a database. We want to know in which periods of 24 hours the database generated more than 30gb, for example. An easy way to do this is to use the query below, which all DBAs have certainly written before:

select trunc(completion_time) day, sum(blocks*block_size/1024/1024) mbytes
from v_archived_log
group by trunc(completion_time)
having sum(blocks*block_size/1024/1024) > 30000;

Which returns:

DAY         MBYTES
--------- --------
03-JUL-13   31,666
04-JUL-13   30,078

The query above shows all the days when the database generated more than 30gb of archived logs. There’s a problem, though. What if, for example, in a particular occasion the database generated 40gb of archived logs between noon the previous day and noon the current day, being quiet the rest of the time?

This query could miss this fact since there would be only 20gb of archived logs each day.

Using MATCH_RECOGNIZE

To write a query that considers any period of 24 hours, rather than calendar days, is more complicated. Using MATCH_RECOGNIZE we can solve that with query below:

SELECT *
FROM v_archived_log MATCH_RECOGNIZE (
     ORDER BY completion_time
     MEASURES to_char(FIRST (A.completion_time),'yyyy-mm-dd hh24:mi:ss') AS start_time,
              to_char(LAST (A.completion_time),'yyyy-mm-dd hh24:mi:ss') AS end_time,
              sum(A.blocks*A.block_size/1024/1024) as mbytes
     ONE ROW PER MATCH
     AFTER MATCH SKIP PAST LAST ROW
     PATTERN (Y+ Z)
       SUBSET A = (Y, Z)
     DEFINE
        Y AS (Y.completion_time - FIRST(Y.completion_time)) <= 1,
        Z AS (Z.completion_time - FIRST(Y.completion_time)) <= 1 and sum(A.blocks*A.block_size)/1024/1024 >= 30000
);

With the query above, we now can see that there were more periods of high activity than we thought before:

START_TIME          END_TIME              MBYTES
------------------- ------------------- --------
2013-06-19 05:00:06 2013-06-20 05:00:01   30,424
2013-06-20 05:12:25 2013-06-21 05:08:07   30,163
2013-06-25 05:04:12 2013-06-26 05:04:04   30,529
2013-07-02 18:51:42 2013-07-03 18:45:55   30,221
2013-07-03 18:52:42 2013-07-04 18:43:14   31,352

No way! It’s too complicated

It does looks scary at first sight! But if we break it down into smaller pieces, it gets easier to understand.

MATCH_RECOGNIZE allows us to search for a pattern in a sequence of rows. It’s important that the sequence of rows be ordered, otherwise the results wouldn’t be deterministic. The order of the rows is specified by the ORDER BY clause within MATCH_RECOGNIZE.

In this example, the pattern that we want to find is “a sequence of rows spanning a period of not more than 24 hours for which the sum of the archived log sizes is greater than 30gb“. It is specified as a regular-expression in the PATTERN clause as: “Y+ Z“, which means “one or more rows that match the Y condition, followed by one row that matches the Z condition”, where Y and Z are the conditions specified in the DEFINE clause.

The condition Y specifies that the timestamps of any rows labelled as Y must not be more than 24 hours apart. There’s no restriction on the maximum number of Y rows, just that they all be within a 24-hour period.

The only row that matches the Z condition must also be within 24 hours from the first occurrence of Y and must be such that the sum of the archived log volumes of all Y‘s and Z‘s are greater than 30gb. The SUBSET clause defines the set A as the union of all Y‘s and Z‘s, to simplify the calculation of the total volume.

For each match found in the sequence of rows, MATCH_RECOGNIZE can return either all rows within the match, i.e. all rows that matched the Y and Z conditions, or a single row summarizing it. In this case, since we just want the sum of all the archived log volumes, I specified that I wanted only ONE ROW PER MATCH. I also told MATCH_RECOGNIZE to resume the search after each match by starting from the first row after the match (AFTER MATCH SKIP clause, with the PAST LAST ROW option). This clause also permits resuming from the next row or from a specific occurrence of a pattern variable (conditions defined above).

Finally, in the MEASURES clause we define the value that will be calculated for each match found. Here we use the FIRST and LAST navigation functions to retrieve the start and end dates for each match and calculate the sum of the archived logs volume.

The uses for MATCH_RECOGNIZE are many, from finding patterns in audit trails and logs to analyzing stock market variations. This is another nice tool for the utility belt of DBAs and developers, and the fact that it uses the power of regular-expressions for specifying the patterns is a great touch and gives it a great potential.

Do you see potential interesting uses of this new feature? I’d love to hear them!

Read more…

If you’re interested, I’d suggest reading the manual to learn all the available options. Learning and mastering regular-expressions is also paramount and it’s a skill that has many other applications:

SQL for Pattern Matching (Database Data Warehousing Guide 12c Release 1)
Regular Expressions

No Comments Yet

Let us know what you think

Subscribe by email