Pythian provides expert data collection and ingestion consulting

Customer Success

Pythian provides expert data collection and ingestion consulting 

A machine learning company needed to retire its on-premise Cloudera Express without disrupting its machine learning processes. They turned to Pythian for expert advice. 


A large software-as-a-service (SaaS) company depended on its on-premise, proprietary neural network (advanced machine learning) application to continually refine its product. Since the product is AI-based, properly training its underlying algorithms was important to ensure a well-functioning service that could quickly and accurately adapt to the needs of customers.   

The client’s on-prem Hadoop cluster, running on Cloudera Express, was used to collect and store tens of thousands of online audio files and metadata per day, stored in Hbase and processed using Apache Spark and the Apache Impala SQL query engine. These were then fed into a proprietary neural network (advanced machine learning) application and DataFox with the goal of continually refining and improving the system’s understanding and ability to respond to these interactions. However, because their version of Cloudera Express was no longer supported, the company needed to re-evaluate its data preparation and ingestion processes to find the most cost-effective and scalable alternative with the least possible disruption. 

Pythian used its experience in data science, machine learning, and neural networks to provide recommendations on the best possible options.  

What we did

  • Advised the client of the two best options to achieve continuity with its machine learning program while improving scalability: moving to AWS S3 with Athena or Google BigTable combined with BigQuery 

Technologies used

  • Cloudera 
  • Apache Hadoop and Impala 
  • DataFox 
  • Google BigTable 
  • BigQuery 
  • Machine Learning Engine and Dataproc; AWS EMR and Simple Storage Service (S3) 

Key Outcomes

Pythian’s recommendation confirmed the client’s hunch that moving its machine learning data collection and ingestion processes to the cloud was the best way to continue its machine learning operations with the least disruption. 

Improved scalability and cost-effectiveness with cloud-native ephemeral tools 

Explore our Data Science, Artificial Intelligence, and Machine Learning Services   

No matter your business, no matter the challenge: Pythian’s solutions drive results. 

More customer success stories