Automate HDInsight Cluster


An application (.Net) is implemented to manage Microsoft Azure resources without user interaction. Which automatically creates Azure HDInsight cluster, processes data (using Hive queries) and exports the processed data (using SQOOP queries) to required SQL Database.


Whenever we have large amount of structured or unstructured data to process. Hadoop is the best solution to process it. Microsoft Azure provides HDInsight service which is based on Hadoop. As HDInsight billing is per hour basis, to reduce the cost we can start HDInsight cluster, process data and delete it after completion. This functionality can be done using the Microsoft Azure portal. But it is not feasible to a user to go to Azure portal daily and create a cluster and process the data. To minimize the manual work, we created an application which will do all above functionality automatically. And it can be scheduled in scheduler to run daily or as per requirement.

How It Works:


·         Need a Microsoft Azure Subscription.

·         SQL Database

·         Azure Blob Storage

Follow the steps below,

1. Upload Raw data to Azure Blob – To manage the Azure blob refer below link, Manage Azure Blob

2. Create Azure HDInsight Cluster - To manage the HDInsight cluster (Create and Delete) refer below link, Create And Delete HDInsight Cluster.

3. After Creation of HDInsight Cluster, we need to submit Hive jobs to process data and store it in HDFS. And transfer the processed data from HDFS to SQL Database. It can be done using Azure SDK. Refer the Url to manage Hive and SQOOP jobs automatically through application. Submit Hive And SQOOP jobs.

4. After Data transferred to SQL Database, you can delete the HDInsight Cluster to save the cost.


Michael Patterson sat down with the CEO of Boston Byte, Mustapha Shaikh to discuss the significance and rapid digitization of the healthcar...