Pythian Blog: Technical Track

Debugging IN vs OR Performance in MySQL

I was recently puzzled by the question: “Which query will be faster?”

SELECT * FROM table WHERE pk = x OR pk = x1 OR pk = x2 ...

Or

SELECT * FROM table WHERE pk IN (x1, x2,...);

There are 50k values in both IN and OR clauses, and lookup is done via primary key. Sounds simple. My first reaction was: “No, there is no difference between these two queries.” I checked the execution plan, and it showed a range scan of primary key for both IN and OR queries. The execution plans were identical. I also did a quick search on the web but couldn’t find any clear description of whether there are any differences in how MySQL handles these types of queries. So my final answer was: “No, there is no difference”.


But then I was presented with test results that showed that IN query was about 100 times faster than OR query. Where OR query took minutes to run, IN query took seconds! Okay, I said to myself, it is time to start digging.
First of all, I checked if MySQL was executing queries the way it showed in EXLAIN. I ran both queries and checked Handler status variables. In both cases, I saw 50k Handler_read_key values:

mysql [localhost] {msandbox} ((none)) > SHOW GLOBAL STATUS LIKE '%Handler%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 3     |
| Handler_read_key           | 50003 |
| Handler_read_last          | 0     |
| Handler_read_next          | 0     |
| Handler_read_prev          | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_next      | 31    |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_update             | 0     |
| Handler_write              | 16    |
+----------------------------+-------+
16 rows in set (0.00 sec)

I did a lot of test runs with different query modification, but still I couldn’t figure out why IN was so fast while OR was so slow. Since MySQL didn’t provide me with enough details on what was going on, I had to use gdb to get some insight. And it indeed provided me very valuable information and got me as close to understanding the root cause of the problem as I could.

Here is how the functions call path look like for IN query:

mysql_select -> JOIN::exec -> do_select -> sub_select -> rr_quick
  -> QUICK_RANGE_SELECT::get_next -> call to InnoDB

And this is how it looks for OR query:

mysql_select -> JOIN::exec -> do_select -> sub_select -> evaluate_join_record

Then I checked what evaluate_join_record did in the source code:

/**
  Process one record of the nested loop join.
    This function will evaluate parts of WHERE/ON clauses that are
    applicable to the partial record on hand and in case of success
    submit this record to the next level of the nested loop.
*/
static enum_nested_loop_state evaluate_join_record(JOIN *join,
 JOIN_TAB *join_tab, int error)

It looks like MySQL is taking one predicate, gets the row from InnoDB, and then compares this row to all other predicates in WHERE. If we have 50000 predicates matching 50000 rows, we will get 2500000000 comparisons! This can also explain why Arg_comparator::compare_binary_string is on top of the pt-pmp output. (I should have mentioned that the primary key is VARBINARY.)

My colleague Singer Wang suggested that if we were dealing with this type of issue, the execution time for OR-query would grow non-linearly when we add more predicates. Let’s check this. Here is the execution time for 50k predicates:

# time mysql -u root -p --socket=/tmp/mysql_sandbox5518.sock test \
 < 50k.sql > /dev/null
real	2m37.679s == 157 seconds

If what I wrote above is correct, MySQL is doing 2500000000 comparisons, which gives us 157/2500000000 = 0.0000000628 second per comparison.
If we add another 10000 predicates, we will have 60k * 60k = 3600000000 comparisons * 0.0000000628 = 226 seconds. Let’s try:

# time mysql -u root -p --socket=/tmp/mysql_sandbox5518.sock test  \
 < 60k.sql > /dev/null
real	3m42.090s = 222 seconds

That is close enough :)
So in this particular case, OR query was so slow because it had to compare a row with every other predicate in the list. However, during my initial research, I managed to find only one mention of MySQL possibly doing some other optimizations for IN query. Here Arjen Lentz said:

That is, if you have a value, the server does a quick binary search in the list to see if that value is part of it. So, a large IN() would be much faster than an equivalent OR construct.

But it is still unclear to me which value and what kind of binary search MySQL could use in my case. It would be nice if someone could clarify this in more detail.

No Comments Yet

Let us know what you think

Subscribe by email