Pythian Blog: Technical Track

Why are your indexes larger than your actual data?

I sometimes encounter this phenomenon when a database has tables in which the indexes are larger—sometimes much larger—than the actual data stored in the table. Often, the table becomes unwieldy, which generates a number of questions: Should I partition? Should I shard? Should I use a different database entirely?   The problem is especially common with InnoDB tables because of: * Clustered indexes - in Innodb, the primary key is included in every secondary key. If the primary key is large, it could have a detrimental effect on the entire table. * Slight index overhead - "each index record contains a five-byte header that may be preceded by a variable-length header. The header is used to link together consecutive records, and also in row-level locking." * The indexes are fragmented (here’s what you can do about this problem). When comparing to MyISAM, which compresses its string indexes, InnoDB looks quite big.   Here is an actual example: mysql> SELECT engine, count(*) tables, concat(round(sum(table_rows)/1000000,2),'M') rows, concat(round(sum(data_length)/(1024*1024*1024),2),'G') data, concat(round(sum(index_length)/(1024*1024*1024),2),'G') idx, concat(round(sum(data_length+index_length)/(1024*1024*1024),2),'G') total_size, round(sum(index_length)/sum(data_length),2) idxfrac FROM information_schema.TABLES GROUP BY engine ORDER BY sum(data_length+index_length) DESC LIMIT 10; +------------+--------+---------+--------+---------+------------+---------+ | engine | tables | rows | data | idx | total_size | idxfrac | +------------+--------+---------+--------+---------+------------+---------+ | InnoDB | 960 | 278.31M | 75.76G | 119.27G | 195.03G | 1.57 | | MyISAM | 1054 | 140.75M | 16.43G | 8.21G | 24.64G | 0.50 | | MRG_MYISAM | 18 | 4.46M | 0.48G | 0.00G | 0.48G | 0.00 | | MEMORY | 44 | NULL | 0.00G | 0.00G | 0.00G | NULL | | NULL | 7 | NULL | NULL | NULL | NULL | NULL | +------------+--------+---------+--------+---------+------------+---------+   mysql> SELECT CONCAT(table_schema, '.', table_name), CONCAT(ROUND(table_rows / 1000000, 2), 'M') rows, CONCAT(ROUND(data_length / ( 1024 * 1024 * 1024 ), 2), 'G') DATA, CONCAT(ROUND(index_length / ( 1024 * 1024 * 1024 ), 2), 'G') idx, CONCAT(ROUND(( data_length + index_length ) / ( 1024 * 1024 * 1024 ), 2), 'G') total_size, ROUND(index_length / data_length, 2) idxfrac FROM information_schema.TABLES ORDER BY data_length + index_length DESC LIMIT 3; +---------------------------------------+--------+-------+--------+------------+---------+ | CONCAT(table_schema, '.', table_name) | rows | DATA | idx | total_size | idxfrac | +---------------------------------------+--------+-------+--------+------------+---------+ | db1.big_old_table | 48.48M | 8.38G | 27.25G | 35.63G | 3.25 | | db1.big_new_table | 38.63M | 6.32G | 18.13G | 24.44G | 2.87 | | db1.another_big_table | 17.83M | 2.14G | 9.92G | 12.06G | 4.64 | +---------------------------------------+--------+-------+--------+------------+---------+   Warning: be careful when you run the above queries. Try to run them on an available and non-production server whenever possible as they may seriously slow down your server.   As you can see, the differences between the index sizes vary between 3-4.5 times the size of the data.   My experience suggests the root causes of why the indexes are bigger than the data, could be any of the following: 1) Too many indexes 2) Indexes on big columns - like varchar(255) 3) Redundant or duplicate indexes 4) Combination of all of the above   And what are the root causes of these problems? Here are some potential answers, garnered, once again, from personal experience: 1) A happy trigger on the phpmyadmin "add index" button 2) When the table was small, some queries were slow and indexes were added indiscriminately until these queries were fast (which explains the previous point) 3) Someone tried numerous different combinations of multi-column indexes to speed up different queries using the same table(s) 4) Additional indexes could have been used to solve bad (after-the-fact) schema design issues   Please note that there could be more causes. This is only a small sample of possibilities.   How to solve the problem? This is something I am still learning, but so far I have discovered that: 1) You need to find out which indexes are not being used any more. To do that, you can use pt-index-usage. What you would need is the general log (or your slow log at long_query_time=0) and you would need to run it through pt-index-usage. This will generate a list of ALTER TABLE statements. This tool can also generate the findings or the decisions for the results into database of your choice (meaning you need to give it a database server if you want that extra data). This report logs the indexes that were used by the queries in the general/slow log and suggests that you DROP the ones that are not. What's important to remember is that queries which did not run during the time you took your log - such as weekly or monthly reports - will not show up in the results. I find this tool useful, because I can use it with the same log I used for pt-digest-query. However, I do not blindly run all the drop table statements, I just use them to help me understand the indexes better. I sometimes use pt-duplicate-key-checker, but usually, pt-index-usage finds all or almost all of what pt-duplicate-key-checker finds.   2) I pick a table from the top table query I ran and investigate the indexes on it. I look for duplicate indexes or an index with similar columns (such as index1(a,b), index2(a), index3(b,a,c)) and try to determine why they are there. Next, I try to go over the queries that access the table from the general/slow log and see how they use the indexes. There are no real short cuts to this process. Usually, I focus on low hanging fruit and I am very careful not to negatively impact other queries that I may not know about. What I try to do is gather the uniqueness/cardinality of the indexes and determine how they help queries.   Here is an example to explain better: show create table users( user_id INT(4) UNSIGNED AUTOINC, fullname VARCHAR(255) DEFAULT '', active ENUM('Y','N') DEFAULT 'Y', PRIMARY KEY(id), KEY index1(fullname) );   Let’s pretend that the users table has 100k rows. Your query searches on the `fullname` column, which also has an index. In the case of most queries, the names may be unique enough to give back a few rows. But names like "John Smith" may give back an unwieldy number of results. If you are only interested in active users, then you can make a change like this:   ALTER TABLE users DROP INDEX index1, ADD INDEX index1(fullname,active);   By doing so, you increase the cardinality of the index because the additional column `active` may reduce the number of results yielded. In the case of the example, there may be over 100 John Smiths, but only 30 that are active.   3) Try to reduce the size of the queries. In the case of the earlier example, `fullname` is a varchar(255). If we use UTF8 for the table, then the index for it will use 767byes. We must determine what the fewest amount of characters that we can get away with is. This could be difficult if we don’t have any data to use for comparison. If I were running this query, I would use the following process to determine what that number might be:   mysql> set @unique_total=0; mysql> select count(distinct fullname) into @unique_total from users; 0 row in set (6 min 28.29 sec)   mysql> select @unique_total; +---------------+ | @unique_total | +---------------+ | 100448 | +---------------+ 1 row in set (0.00 sec)   mysql> select round(count(distinct left(fullname,200))/@unique_total*100,2) as c200, round(count(distinct left(fullname,100))/@unique_total*100,2) as c100, -> round(count(distinct left(fullname,75))/@unique_total*100,2) as c75, round(count(distinct left(fullname,50))/@unique_total*100,2) as c50, -> round(count(distinct left(fullname,30))/@unique_total*100,2) as c30, round(count(distinct left(fullname,20))/@unique_total*100,2) as c20 from users; +--------+--------+--------+-------+-------+-------+ | c200 | c100 | c75 | c50 | c30 | c20 | +--------+--------+--------+-------+-------+-------+ | 100.00 | 100.00 | 100.00 | 99.97 | 99.81 | 92.86 | +--------+--------+--------+-------+-------+-------+ 1 row in set (4 min 13.49 sec)   You can see that if I limit the index to 30 characters, it will be very precise. If I try a lower number, it will become less and less precise. Technically, I would aim for an index that’s limited to over 95% of the existing data.   If you set the number for the index’s characters too low, this may have a negative effect. To explain what I mean by that, let’s take the full name of the actor Daniel Michael Blake Day-Lewis, and let’s say that our index is limited to 20 characters. If we search “Daniel Michael Blake Day-Lewis,” we may get back results from the database for: Daniel Michael Blake Day-Lewis Daniel Michael Blake Daniel Michael Blake Smith Daniel Michael Blake Willis   The database will have to further search these rows after it has been brought up from the storage engine. So using a limited index in this case will require the database to do additional work.   4) If at all possible, try to change a VARCHAR column/index that has a defined number of rows to an ENUM (1 or 2 bytes) or normalize it to another table and keep the id (usually an INT) in the original table. Doing so can save a ton of space (remember the VARCHAR and UTF8 comment I made above?). Check the cardinality of your VARCHAR column using SHOW INDEX and see if that column’s cardinality is between 0-255 or 0-65535 to determine if it’s a good candidate for this conversion.   There are many more techniques for and things you should know about going over indexes. I hope that this post will encourage you to learn more of them. If you are interested in doing so, I will be speaking at Percona Live London, were I will cover this subject (and others) in more depth. Hope to see you there!

No Comments Yet

Let us know what you think

Subscribe by email