Pythian Blog: Technical Track

Big Data on Microsoft Azure – HDInsight

Introduction

  The best definition you going to find for data is that data is the new oil in today’s world. Starting from that, we can define a new horizon and a new way of looking at how we treat and work with data. This process has become extremely challenging and compelling since our data spectrum has changed from a structured to a non-structured form. Now, different features/products have come along to help us to handle these humongous sets of data. Companies that want to demonstrate a competitive advantage over others need to address one of the hardest IT tasks: customer behavior. This is now the hottest and most challenging job for data scientists and the reason is that they must know how to wrangle, massage and conform vast chunks of data before any AI or ML algorithm. However, it is not only that. Companies are missing a big point when designing and implementing their Big Data solutions. We usually describe Big Data as a storage and analysis of large and or complex data sets using a series of techniques including but not limited to: NoSQL, MapReduce and Machine Learning. But trusting and focusing only on those could blind your decisions since the results miss out the qualitative insights of your company vision. That is where “ Thick Data” comes into play. The key here is to bring more value to the quantitative data that you have stored in your Big Data solution. With research, surveys, questionnaires, focus groups, interviews, journals, videos, social media analyses and so on, this is going to help your company thrive by bringing more assertive decisions to support you in understanding not only your key audience but also your customers' behavior.  

HDInsight

    Since 2013, Microsoft has been helping their customers achieve the best of the Big Data ecosystem. With their partnership with Hortonworks distributor, they expanded their capabilities and were able to enrich their solutions on the Big Data spectrum. HDInsight is a fully managed, open-source analytics service for enterprises that want to use the Hadoop technology stack to solve and tackle Big Data problems. The platform offers a unique set of products that are entirely managed by Microsoft Azure. In a nutshell, Azure HDInsight is a cloud distribution of Hadoop components from the Hortonworks Data Platform – HDP, which makes it easy, fast and cost-effective to process a massive amount of data in a hyper-scale environment. There are several reasons why companies are looking for managed Big Data solutions nowadays. Mainly because of the low-cost and scalable possibility, security and compliance, monitoring, productivity, extensibility, as well as the most important reason: the global availability of the selected products.  

Cluster types

    HDInsight offers different cluster types to address different issues that you may struggle with in your business. They have an hourly-based approach to billing and in a decoupled architecture. That means you can process the data you want and afterwards destroy the cluster, saving the data inside of the Azure Blob Storage or Azure Data Lake Store. The data will remain there without being removed or changed once the process is over. Most of the companies that use the HDInsight flavor adopt this approach to achieve blazing fast performance and at the same time, reduce their costs with the infrastructure. In an on-premises environment, we are not allowed to turn off the computing part, since the HDFS and the processing area are coupled by using a PaaS (Platform-as-a-Services) solution. This solution makes it easy to work around this and also gives you endless possibilities to use a set of tools to help you to manage, orchestrate and monitor the entire data workflow. HDInsight offers the following cluster types: - Apache Hadoop - Apache Spark - Apache HBase - R Server - Apache Storm - Apache Interactive Query (Hive 2.0) - Apache Kafka * HDInsight is the only PaaS platform that offers this amount of fully-managed cluster types in a cloud environment.  

Common scenarios by cluster type

In this section, we are going to walk through the cluster types and review the best-fit solution as well the everyday-use cases scenarios for them.  
  • Apache Hadoop
A framework that uses HDFS, YARN resource management and a simple MapReduce programming model to process and analyze batch data in parallel. Common Use Cases/Scenarios = Batch-Processing, Low-Cost Storage, Cost-Effective, Parallel Processing.  
  • Apache Spark
An open-source, parallel-processing framework that supports in-memory processing to boost the performance of big-data analysis applications. Common Use Cases/Scenarios = Data Streaming, Machine Learning, Interactive Analysis and Fog Computing.  
  • Apache HBase
A NoSQL database built on Hadoop that provides random access, consistency for large amounts of unstructured and semi-structured data--potentially billions of rows times millions of columns. Common Use Cases/Scenarios = Huge Volumes of Messages, NoSQL Horizontally Scale, Automatic Sharding, Failover, for Billions of Rows and Millions of Columns, Columnar Storage.  
  • R Server
A server for hosting and managing parallel distributed R processes. It provides data scientists, statisticians and R programmers with on-demand access to scalable, distributed methods of analytics on HDInsight. Common Use Cases/Scenarios = Scalable, Distributed R Services, R-Based Analytics Process, Distributed Set of Algorithms – RevoScaleR and MicrosoftML, Operationalization of R Models.  
  • Apache Storm
A distributed, real-time computation system for processing large streams of data fast. Storm is offered as a managed cluster in HDInsight. Common Use Cases/Scenarios = Real-Time Data Normalization, Twitter Analysis, Event Log Monitoring.  
  • Interactive Query (Hive 2.0)
In-memory caching for interactive and faster Hive queries. Common Use Cases/Scenarios = Data Analysis in HSQL (SQL), Data Warehouse/Data Mart Scenarios.  
  • Apache Kafka
An open-source platform that's used for building streaming data pipelines and applications. Kafka also provides message-queue functionality that allows you to publish and subscribe to data streams. Common Use Cases/Scenarios = Messaging Exchange, Website Activity Tracking, Metrics Data Monitoring, Log Aggregation, Event Sourcing and Stream Processing.  
Learn more about Pythian's services and solutions for Microsoft Azure.

No Comments Yet

Let us know what you think

Subscribe by email