
Overview
A large, private technology company was using on-premises Hadoop cluster to collect and store tens of thousands of online audio interactions generated by the company’s proprietary software, along with the associated metadata, for training purposes. These audio files and their related data were stored in a massive, 215-plus terabyte Apache HBase database, processed with the Apache Impala SQL query engine, and then fed into a proprietary neural network (advanced machine learning) application with the goal of continually refining and improving the system’s understanding and ability to respond to these interactions.
Their HBase database was so large that it did not have enough capacity to support the necessary snapshots long enough to perform backups. This meant their disaster recovery capabilities were almost nonexistent, with the company facing a recovery process of several weeks if not months (along with possible significant data loss) should a disaster occur. It was being run on a version of Cloudera CDH that is no longer supported, which meant the company was facing a bill of up to $233,000 per year in license fees and other costs to upgrade to a supported release. With the organization also needing to dramatically beef up its backup and disaster recovery capabilities, they were looking at costs of closer to half a million dollars per year to continue effectively using their on-prem Hadoop cluster.
The company turned to Pythian for expert advice on the state of their current system along with suggestions on the best on-premises or public cloud alternatives for their needs.
What we did
- Pythian expert consultants presented three main cloud options: Hadoop in the cloud as a service, cloud-native options that most closely fit the client’s current HBase column store, and cloud-native options centered around an object store and column store
Technologies used
- Cloudera; Apache Hadoop, Impala
- HBase; AWS RedShift, DynamoDB
- Simple Storage Service (S3); GCP BigTable, BigQuery
- Google Cloud Storage; Ceph; ClickHouse
Key Outcomes
The client acquired cutting-edge insights and hard data on all their on-premises and cloud options from multiple vendors, allowing them to make the most informed decision possible to upgrade their system

Explore our Cloud Strategy Services
No matter your business, no matter the challenge: Pythian’s solutions drive results.