Pythian Blog: Technical Track

How Different Datatypes Affect Performance

When I was in high school, my practical arts teacher asked us to create a miniature of a structure. This meant looking at the blueprints of an existing structure and using it to create a small replica. I thought that this would be a great opportunity to learn more about reading blueprints and understanding what they are and how they affect structures. So, I asked my mom if I could look at our single-story house’s blueprints as the basis for my project. I told her about the project and that I wanted to learn more about reading blueprints. The beauty of what I did was that I had the luxury of looking around the structure with blueprint in hand without asking for permission. After all, it was our house. As I was going around the house, I noticed a small crack in the wall near the living room. If you’re a house owner, your first instinct would be to get it fixed. In my case, I had the blueprint to see why there was a crack in the wall. I found out that there wasn’t a foundation built near the wall when there should have been one. The tension caused by the characteristic behavior of concrete caused the crack (after taking a materials science course in university, I finally understood what this meant). Unfortunately, this wasn’t an easy fix. In order to fix the crack, we would have to take the entire wall down, put in a foundation, and rebuild the wall. This would prevent the occurrence of future cracks as well as allow us to put up a second level should we decide to do so in the future. Databases are no different from structures in that they require careful thought on the planning and design – both logical and physical. Like structures, we need to design databases for capacity and growth. Imagine a young couple with no plans to have kids deciding to buy a single-room unit in a high-rise condominium. Should they change their decision to have kids after some time, their condo unit would no longer be fit for them. They would have to either stay in the single-room unit even after having kids or move out of the condo unit and get a larger property. In both cases, they still have to go through some inconveniences. With the massive amount of data that businesses need to collect nowadays, database growth and capacity increase should be expected. The growth will definitely affect performance as more and more data need to be accessed by applications. A simple concept in point is the selection of appropriate data types. This is considered in the design process even before a table is created. For most database developers, defining the appropriate data type for a column simply addresses data integrity. The data type defines the valid values that can be stored in a specific column. However, it’s more than that. Careful selection of the appropriate data types affects overall performance as the database grows over time. Disk Performance Databases are stored on disks. Records are stored in data pages (in SQL Server) or data blocks (in Oracle). Since data types have corresponding storage requirements, proper selection can not only save space but also improve disk performance. Take for example the use of an integer data type. I used this data type because it has the same storage requirement for both Oracle and SQL Server. Imagine creating an application that stores the number of devices assigned to an individual. Since this is a number between 0 (meaning no device) and some positive value, it wouldn’t make sense to use bigint or int even though they are both appropriate data types for this purpose. First, we wouldn’t need a negative value for this example, only positive values. Negative values wouldn’t make sense. Second, imagine the maximum number of devices any individual can have at any given point in time. I personally have 7 devices – 2 laptops, a tablet and 4 phones. I don’t use all of my phones so I consider 3 of them as inactive. I doubt that any given individual can have more than 120 devices. You can argue with me on some edge cases but that’s more of a deviation than the norm. Given this information, a tinyint data type would suffice because it can store up to a maximum value of 255. Imagine the space savings that you get with using tinyint (1 byte) versus int (4 bytes) or even bigint (8 bytes) data types. That’s a 75% reduction in space if you initially used int or about 87.5% if you used bigint. Now you might be thinking: “We’re talking about bytes here, not gigabytes.” Well, you’re right. But we do expect to store more and more data over time. Imagine saving 300GB of disk space when you get to around 400 billion rows with using tinyint versus int. This translates to faster disk processing because it's dealing with less storage size to process, such as reading and writing data pages/blocks, running backups, and running consistency checks. I/O Performance Records don’t just stay on disk. They get modified depending on how the business process requires it. In SQL Server, before a record gets modified, the data page containing the record is retrieved from disk and loaded in memory. After the modification is done in memory, the changes are written in the transaction log for durability. A regular checkpoint commits those changes from the log to the data file. Imagine what this does to your data modification process. If you need to modify the same integer column, the amount of records you can fit in memory will depend on the size of the row. The size of these rows will depend on the size of each individual column that comprises the row. The smaller the data type used, the more rows we can fit in memory. The more records we can fit in memory, the faster the data modification process will be since we are not wasting space in the SQL Server buffer cache. The goal of every performance tuning exercise is to minimize I/O overhead. In reducing the storage requirement by selecting the smallest but appropriate data type, we’ve not only improved the I/O performance when doing data modification (and retrieval as well), but we’ve also improved I/O performance when writing to the transaction log file and back to the data file. Index Performance Indexes are a way to speed up access to database records. An index contains keys built using a column or a combination of columns defined in the underlying table. Creating indexes means consuming disk storage almost equivalent to the size of the columns defined in the index. The same reasoning applies to this: If you are using the smallest yet appropriate data type for your columns, the storage requirement for your indexes will also be smaller. This also affects whether or not the column will be used as a clustered index or a non-clustered index. A column defined as a clustered index will also be defined in the non-clustered index. Imagine the overhead of a wide clustered index on all of your clustered index as it appears on all of your non-clustered indexes. While indexes are a great way to tune queries and return records faster, they are still affected by the type of data that you select for your columns. In summary I can go on and on and talk about other side effects of a very simple concept such as data type on performance. Proper selection of the smallest yet appropriate data type does not happen after the database has been deployed and in operation for some time. It has to happen even before you create your first table. Because if you expect your business to grow, your database will grow as a side effect. And you want your database to be able to handle that growth. The last thing you want is to start seeing cracks on your wall and to have to tear the whole thing down just to build a better foundation.

No Comments Yet

Let us know what you think

Subscribe by email