Pythian Blog: Technical Track

Deploying Cloudera Impala on EC2 with Example Live Demo

A little while ago I blogged about (and open sourced) an Impala-powered soccer visualization demo, designed to demonstrate just how responsive Impala queries can be. Since not everyone has the time or resources to run the project themselves, we’ve decided to host it ourselves on an EC2 instance. You can try the visualization; we've also opened up the Impala web interface, where you can see query profiles and performance numbers, and Hue (username and password are both ‘test’), where you can run your own queries on the dataset. Note: The demo is now offline, contact us if you're interested in seeing it.

Deploying Impala on EC2

While there are many tools to deploy a Hadoop cluster on EC2 - like Apache Whirr, or even Cloudera Manager - I only wanted to use a single instance for the entire cluster. Starting from the base Ubuntu (Precise) image, I added Cloudera’s apt repos, and installed the single node configuration. Impala doesn’t support using Derby for the Hive metastore, so I installed MySQL and configured Hive to use it instead. Then I installed Impala using Cloudera’s instructions. Impala, and all of the Hadoop daemons, are running comfortably on one M3 2XLarge EC2 instance. Given our modest demands, this may actually be overkill; I over-specced the server trying to find a (now-obvious) performance problem involving short-circuit reads.

Short-Circuit Reads

On the Pythian cluster, we could consistently return a query in around half a second. On EC2, queries took closer to 5 seconds. A bit of investigation showed that in getting the server up and running, I had disabled short-circuit reads, which slows down Impala considerably. While Impala isn’t supposed to start without short-circuit reads, it only throws an error if you have short-circuit reads enabled but misconfigured. If short-circuits are off in the hdfs-site configuration, it will happily start and run very slowly. With the default DEB install, the libhadoop library isn’t installed to the LD_PATH on Ubuntu, which prevents short-circuit read from working. The easiest solution was to create symlinks for libhadoop to /usr/lib/, then run ldconfig:

ln -s /usr/lib/hadoop/lib/native/libhadoop.so /usr/lib/
ln -s /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0 /usr/lib/
ldconfig

To confirm whether your cluster has short-circuit reads enabled, you can visit the Impala web interface (by default, port 25000 on any system running impalad) and click on the ‘/varz’ tab. Search for ‘dfs.client.read.shortcircuit' - it should be set to ‘true’.

Partitioning

With libhadoop installed and short-circuit reads enabled, the next greatest performance improvement came from partitioning the table on the sensor id. Since all of our web interface queries filter by sensor id, Impala can perform some serious partition elimination: looking at the query profiles, partitioning the table reduced the amount of data read from HDFS from 4GB to 50MB, and the query time from 2.6s to 130ms. The README on Github has instructions on how to use dynamic partitioning in Hive to quickly partition the soccer data; these steps can be generalized to any dataset.

Discover more about our expertise in Hadoop

No Comments Yet

Let us know what you think

Subscribe by email