Pythian provides expert data collection and ingestion consulting for a machine learning company facing end-of-life Cloudera Express


A large software-as-a-service (SaaS) company depended on its on-premise, proprietary neural network (advanced machine learning) application in order to continually refine its product. Because the product is AI-based, properly training its underlying algorithms was of the utmost importance to ensure a well-functioning service that could quickly and accurately adapt to the needs of customers.  The client’s on-prem Hadoop cluster (running on Cloudera Express) was used to collect and store tens of thousands of online audio files and metadata per day, stored in Hbase and processed using Apache Spark and the Apache Impala SQL query engine. These were then fed into a proprietary neural network (advanced machine learning) application and DataFox with the goal of continually refining and improving the system’s understanding and ability to respond to these interactions. However, because their version of Cloudera Express was no longer supported, the company needed to re-evaluate its data preparation and ingestion processes to find the most cost-effective and scalable alternative with the least possible disruption.
Read MoreLess


Pythian’s experience in data science, machine learning and neural networks – including our Machine Learning Partner Specialization from Google, and wealth of expertise working with other machine learning tools like AWS SageMaker, Apache Spark MLlib, TensorFlow, and Apache MXNet – meant we were well-positioned to provide advice on their best possible options. Pythian advised the client that its best bet to achieve continuity with its machine learning program while improving scalability was to replace its Hadoop cluster with a cloud-native solution such as AWS combined with Athena or Google Cloud Platform and Google BigQuery.
Read MoreLess


Pythian’s recommendation confirmed the client’s hunch that moving its machine learning data collection and ingestion processes to the cloud was the best way to continue its machine learning operations with the least disruption – ensuring the company’s software could continue improving in near-real-time – while also improving scalability and cost-effectiveness by using cloud-native ephemeral tools.
Read MoreLess

Explore Pythian’s popular services:


  • Cloudera
  • Apache Hadoop and Impala
  • DataFox
  • Google BigTable
  • BigQuery
  • Machine Learning Engine, and Dataproc; AWS EMR and Simple Storage Service (S3).

Looking to learn more about Data Science?