Pythian Blog: Technical Track

AWS Redshift cluster sizing

  Sizing a database is one of the primary DBA functions no matter what the technology. Though Redshift is highly managed for us, we must still address this task. The first thing to note is that in sizing a cluster, we start with an estimated need of storage capacity, since the amount of storage available per node of the cluster is a fixed amount. While you get the disk space you pay for, AWS guidelines and user experience shows that performance can suffer when space becomes tight (>80%). So when sizing the cluster for a specific capacity an extra 20% will need to be tacked onto your calculations. AWS currently offers two types of instances for Redshift clusters, dense compute (dc1.*) or dense storage (ds2.*) servers. The dc1 series offers about 6x the CPU and 6x the memory per terabyte of storage. These are good for use cases where there is a regular load of heavy analytics querying involving multiple joins. The ds2 series is more cost effective when the tables are highly denormalized and the analytics can be offloaded into a BI tool such as Microstrategy. It is possible to migrate to a new node type through snapshots, but the process will involve some downtime. To address a few points of the Redshift architecture, note that only the compute nodes hold storage so the leader node is not considered (nor do you have to pay for it). Each storage disk is replicated to two others for redundancy, but these additional disks are not part of the calculations or specifications used for sizing (though they affect the disk monitoring calculations that we see later). Here we examine example clusters on a budget of around $10k using yearly contracted instances (saving 30+% over on-demand instances):

(7) dc1.large with 160Gb SSD @ $1380/year = 1120Gb @ $9660/year. (2) ds2.xlarge with 2Tb HHD @ $4295/year = 4000Gb @ $8590/year.

The dc1 instances in this case are around 4x as expensive per terabyte (though still quite the bargain as compared to hosting the cluster in-house) and give a 30-80% performance gain (depending on the benchmarks, ( example)). And while you can always add nodes to accommodate data growth, you can only add nodes of the same instance type which could potentially become quite expensive if you’re using instances of the small disk capacity dc1’s. Once your cluster is up, it is vital to monitor disk and CPU utilization so you know when to add new nodes. It is highly advisable to watch the graphs under the Performance tab in the Redshift Console as you add new load to the cluster. There are built in Cloudwatch alarms for disk usage, and these should be configured to alert above 70%. I like to know well in advance when it is getting there, so I regularly use Period= 5 minutes, Statistic = average, over 1 consecutive period, but since loads and vacuums can create usage surge spikes, you might want to configure the alert over more or longer periods. While Cloudwatch is great for this monitoring, it is convenient to also be able to compute capacity. There are several ways to query disk usage that render subtly different results, unfortunately none of which will yield the stats given by Cloudwatch. Here’s an example for a 6-node 12Tb cluster that currently shows disk space as 32% used on each node in the Console yet displays as 23%:

select host ,sum(capacity)/3 as total ,sum(used)/3 as used ,sum(capacity)/3 - sum(used)/3 as free ,(((sum(used)/3)/(sum(capacity)/3.00)) * 100.00) as pct_used from STV_PARTITIONS group by host

host total used free pct_used
0 1904780 450450 1454330 23.65
1 1904780 449895 1454885 23.62
2 1904780 449776 1455004 23.61
3 1904780 450673 1454107 23.66
4 1904780 451483 1453297 23.7
5 1904780 447840 1456940 23.51
The point here is to be wary of querying disk space and rely on Cloudwatch and the Console. Compression encoding on your tables can save substantial space (50-75% on average depending on the encoding) and can improve performance by 100+%. Encoding can also increase CPU usage, so we want to monitor it as we implement encoding. A cluster can be brought to a standstill by just one node hitting 100% CPU utilization, so we also need to setup Cloudwatch alarms to make sure we average < 70% or don’t hit spikes > 90% for more than 10 minutes. Once we bump up against those metrics, it’s time to add another node.   Discover more about our expertise in the Cloud.

No Comments Yet

Let us know what you think

Subscribe by email