Pythian Blog: Technical Track

Comparing Hadoop Appliances

Today Oracle announced that it’s Big Data Appliance is available. You can see the press release here.
The appliance was initially announced at Oracle OpenWorld in September. The appliance announced today is pretty similar to what was presented by Oracle in OpenWorld, the glaring difference is that the Hadoop shipped with the appliance will not be a vanilla Apache Hadoop, but rather Cloudera’s Hadoop distribution and will include Cloudera’s administration software. You can read Oracle’s press release about the collaboration here. Alex Popescu blogged about the implications for Oracle and Cloudera.

At this time there are three Hadoop appliances in the market:Oracle’s Big Data appliance, Netapp’s Hadooplers and EMC’s Greenplum DCA. It looks like a lot of companies that did not already adopt Hadoop in 2011, are looking to do so in 2012, and some of them may be considering going with an appliance. I want to take a look at some of the reasons a company will be interested in a Hadoop appliance, and what are the differences between the different appliances.

First, lets recall why a company will be interested in Hadoop at all. The number one reason is that the company is interested in taking advantage of un-structured or semi-structured data. This data will not fit well into a relational database, but Hadoop offers a scalable and relatively easy-to-program way to work with it. This category includes emails, web server logs, instrumentation of online stores, images, video and external data sets (such as list of small businesses organized by geographical area). All this data can contain information that is critical to the business and should reside in your data warehouse, but it needs a lot of pre-processing, and this pre-processing will not happen in Oracle RDBMS.

The other reason to look into Hadoop is for information that exists in the database, but can’t be efficiently processed within the database. This is a wide use-case, and it is usually labelled “ETL” because the data is going out of an OLTP system and into a data warehouse. I find the label of ETL a poor fit. You use Hadoop when 99% of the work is in the “T” of ETL – Processing the data into useful information. When I say “ETL”, most people imagine de-normalization and some light aggregation that turns the OLTP data into the “Excel spreadsheet” format favored by the BI crowd, and creating bitmap indexes on top of it. If this is the case, you probably don’t need Hadoop. Some companies apply fairly advanced statistical analysis on their OTLP data and upload the results to the data warehouse, in some cases machine learning algorithms are used to predict future actions of customers (“people you might know”, “Other customers who looked at this product eventually bought”). This kind of deep processing of data can and is being done on top of the data warehouse, using BI tools. But it is sometimes more efficient, in terms of disk space and processing resources to do the same analysis on raw data in Hadoop and use the results as part of the DWH or even OLTP system. It is worth mentioning that Oracle cores are a much more expensive resource than Hadoop cores, and even when Oracle is capable of processing the data more efficiently, IT departments with budget constraints will look at the alternatives.

Now that I convinced all of you to add Hadoop to your 2012 shopping list, lets look at why you may consider Hadoop appliance vs. rolling out your own Hadoop cluster.

I wanted to say that the number one reason to roll your own cluster is cost, but Brian Proffitt from ITWorld  disagrees and thinks that Oracle’s Big Data Appliance costs about one third of what it would cost to build a comparable cluster on your own: “At $500,000, this may not seem like a bargain, but in reality it is. Typically, commoditized Hadoop systems run at about $4,000 a node. To get this much data storage capacity and power, you would need about 385 nodes…”

I’m looking over the specs and can’t quite see where Mr. Proffitt got his numbers. Oracle’s appliance has 18 servers, each with 12 cores, 48G RAM and 36T of storage. I can get a nice server with 6 cores, 16G RAM and 12T HD from Dell for 6000$. 54 of those will cost 324,000$ and give me more cores, and same amounts of memory and storage as Oracle’s offering. Oracle’s offer is competitive, but roll-your-own cluster is still a more frugal choice.

Another good reason to roll your own is the flexibility: Appliances are called that way because they have a very specific configuration. You get a certain number of nodes, cpus, RAM and storage. Oracle’s offering is an 18 node rack. What if you want 12 nodes? or 23? tough luck. What if you want less RAM and more CPU? you are still stuck. One of the nicer things about Hadoop is the fact that it does not require fancy hardware, at least, not at first. This allows you to do small scale proof of concept cluster running of some of the “test” servers in the data center and maybe few of the workstations from the development team, and once you demonstrate business value, you can ask for budget. Don’t laugh, some excellent large-scale enterprise clusters started that way. The trick is to make sure the business starts relying on your data, then create a large outage, and when they ask questions you explain “It was just a proof-of-concept system running on some spare hardware, and we needed the hardware for another project. If you want production Hadoop, it will cost 384,000$”.

If roll-your-own is so attractive, why go with an appliance?

An appliance gets you a standard configuration that a large vendor is willing to support.
If you already know how to size a Hadoop system, which HW to get and how to configure it for maximum performance and reliability, then you are good and can go and roll your own system. However, I’ve been at many meetings where customers begged their software vendors for hardware advice, that these vendors could not give. Many customers paid Pythian so we’ll tell them what hardware to buy for their next upgrade. Many customers bought very expensive systems, that became very expensive bottlenecks as they didn’t match the workloads the customer wanted to run on them. Getting business value out of Hadoop is a difficult problem without adding the extra difficulty of sizing the hardware. In many cases, its an excellent idea to let the vendor take care of the sizing exercise and concentrating internal IT resources of the other problems. This becomes even more critical since there’s still a significant shortage of employees with solid Hadoop experience. Your sysadmins may not know what hardware to get, how to size the system, how to best configure Hadoop and how to rescue it when it goes wrong, so vendor support becomes extra critical.

Lets take a look at all these appliances, and see what you get with each and how they are different:

Oracle’s Big Data Appliance:
You get 18 node rack, with 216 cores, 864G RAM and 648T disk storage. You also get 40G/s infiniband network between the appliance nodes and from the appliance to other Oracle appliances (i.e Exadata and Exalytics).
Software-wise, you get Cloudera’s Hadoop distribution with Cloudera’s management tools, you get Oracle’s NoSQL databasse – a distributed key-value store. And you also get a large number of integration tools:
* Oracle Loader for Hadoop to get data from Hadoop into Oracle RDBMS
* Oracle Data Integrator Application Adapter for Hadoop which enables Oracle Data Integrator to generate Hadoop MapReduce programs through an easy-to-use graphical interface
* Oracle Connector R which gives R users native, high performance access to Hadoop Distributed File System (HDFS) and MapReduce programming framework
* Oracle Direct Connector for Hadoop Distributed File System (ODCH), which enables the Oracle Database SQL engine to access data seamlessly from the Hadoop Distributed File System.

Why do you want Oracle’s Big Data Appliance?

Because you already have Oracle database and you need Hadoop to integrate with it. No one else will be able to get data from Hadoop to Oracle faster than Oracle can. They have an unfair advantage – they control the Oracle source code and they know all the internals. Not to mention the infiniband line to your Exadata. If the integration of Oracle and Hadoop is a priority for you, this is the right appliance to get.

Or because you want Cloudera’s enterprise-grade cluster management tools. Oracle want this cluster to be enterprise-ready Hadoop, and they got the right management tools for the job. On the other hand, spending large sums on a super-fast network for a system that was designed to maximize the locality of data processing and minimize network overhead is a strange decision indeed.

Note that Oracle says that the connector tools will be available as stand-alone utilities, so you can get good integration even if you decide to go with roll-your-own choice or another appliance.

EMC’s Greenplum HD and Greenplum MR:

EMC  has two Hadoop distributions and two Hadoop hardware offerings. They are definitely going into Hadoop in a big way:

  • Greenplum MR – Based on MapR’s improvements to Hadoop, this distribution should be faster and with better high availability than Apache Hadoop. It also supports NFS access to Hadoop’s file system (which is not HDFS in this distribution), and MapR’s control system.
  • Greenplum HD – This is straight up Apache Hadoop, with HDFS and its usual bucket of tools – Pig, Hive, Zookeeper and HBase.  EMC integrated Greenplum HD with two hardware offerings:
    • Isilon NAS – EMC acquired Isilon around the end of 2010, and at the end of January 2012 the company announced that Isilon’s OneFS now natively supports HDFS protocol, so MapReduce jobs on Greenplum HD cluster can work on data stored in Isilon’s NAS. This combination seems like a direct competitor for Netapp’s Hadoopler offering.
    • Greenplum DCA – A module running Greenplum HD that integrate’s into EMC’s Greenplum DB appliance. EMC’s relational share-nothing MPP architecture. Greenplum DCA is available as a quarter-rack add on.

Similar to Oracle’s Big Data machine, the Greenplum appliance also has 18 nodes (2 master servers and 16 segment servers. Master servers process the queries and come up with execution plans, segment servers do the processing and store the data.), with 12 cores and 768 GB RAM and 124 TB usable uncompressed capacity. This just gives you Greenplum DB, to this you’ll need to add Greenplum HD module. You get to decide how many servers you’ll have there, but each will have 12 cores, 48 GB RAM (wow!) and 28TB usable capacity.

Why do you want EMC Greenplum Hadoops?

Get the Greenplum DCA if you use the Greenplum database or plan to.  Greenplum DCA appliance is an add-on for Greenplum DB appliance, if you don’t want Greenplum DB it makes no sense to get one. If you do – it can be a powerful big-data analysis machine, with Greenplum’s relational MPP integrated with Hadoop for unstructured data.
If you are not into Greenplum DB, there is  plenty to love about Greenplum MR which solves some nagging availability and performance issues with the Apache and Cloudera Hadoops. Greenplum HD on Isilon will improve data density on the storage, as it uses Isilon’s striping and redundancy algorithms instead of replicating each block 3 times as you would in HDFS and can offer higher availability by removing the name-node single point of failure. It may even improve scalability, I’m waiting to see the benchmarks on this one.

Note: This section was edited based on feedback from EMC in the comments. The previous version mixed up Greenplum HD and MR, and didn’t include anything about Islion.

Netapp Hadooplers:
Officially, the name is Netapp Hadoop Open Storage System. It is indeed “open” in the sense that Netapp gives the most flexibility in sizing and configuring the system. Unfortunately this does not make for a very clear offering.

Netapp’s Hadoop solution is made of “building blocks” each block has 4 data nodes, which Netapp doesn’t specify, and it seems that its up to the customer to choose and buy the servers that he wants, and a Netapp E2660 that runs the Engenio OS, has 60 disks of 2T or 3T each, running in RAID 5 or RAID 6.
In addition the customer will need a job tracker node, and two name nodes, connected to FAS2040.

Software wise, you’ll need to run Cloudera Hadoop and Redhat Linux 5.6. They also bundle (or recommend? I’m not sure) Ganglia as a monitoring system. Netapp has a lot of excellent advice on how to configure those in its sizing guide.

Why do you want Netapp Hadoopler?

Because you like working with Netapp as a vendor, because you like the flexibility of almost rolling your own cluster, but with the support and guidance of an experienced vendor and because you think E2660 can give improved performance for HDFS. They have pretty convincing benchmarks for the latter.

You will definitely end up with a Hadoop like no other. Hadoop was built to use the local storage on each processing node, so moving the storage to E2660 is a serious departure from standard architecture. And why put up with RAID 5 or RAID 6, when Hadoop should really replicate each block to different nodes? The Hadoopler is a strange hybrid beast that I have hard time swallowing.

Discover more about our expertise in Hadoop.

No Comments Yet

Let us know what you think

Subscribe by email