Official Documentation

Service Description

In addition to pure data storage, Microsoft Azure also provides capabilities for evaluation and processing of data stored in Azure. The flexible, straightforward provision of even major compute and storage resources, which can be deactivated after data is evaluated, makes Azure the ideal environment for such additional data processing.

HD Insight enables the analysis of large data sets with the aid of Hadoop, an open source framework very popular for calculating "Big Data problems," i.e., evaluations of data sets that cannot be stored in a relational database (e.g., because they are too large or they do not have a relational structure). In these situations, Microsoft Azure makes it possible to automatically provision Hadoop VMs, distributed data and computation algorithms, etc. In addition to efficiently computing MapReduce algorithms, it also supports other concepts such as Hive and Pig.

Getting Started

  1. HDInsight (Linux) Learning Path
    10/3/2016, Webpage
  2. Dat202.1x - Processing Big Data with Hadoop in Azure HDInsight
    5/24/2017, Mva
  3. Dat202.2x - Implementing Real-Time Analytics with Hadoop in Azure HDInsight
    5/24/2017, Mva
  4. Dat202.3x - Implementing Predictive Analytics with Spark in Azure HDInsight
    5/24/2017, Mva
  5. Cortana Intelligence Suite End-to-End
    1/16/2017, Mva
  6. HDInsight: Streaming Petabytes of IoT Data in Real-Time
    2/10/2017, Video, 1:08:27

Latest Content

Subscribe to News about HDInsight


Azure Documentation

1. Overview
     1.1. About HDInsight and Hadoop
     1.2. Hadoop components on HDInsight
     1.3. R Server
     1.4. Apache Spark
     1.5. HBase
     1.6. Apache Storm
     1.7. Kafka (Preview)
     1.8. Domain-joined HDInsight clusters (Preview)
     1.9. Release notes
          1.9.1. Recent
          1.9.2. Archive
2. Get Started
     2.1. Plan for HDInsight cluster capacity
     2.2. Start with Hadoop
     2.3. Start with Spark
          2.3.1. Create a Spark cluster
          2.3.2. Run queries on a Spark cluster
          2.3.3. Analyze data using BI tools
          2.3.4. Manage cluster resources
          2.3.5. Debug Spark jobs
     2.4. Start with R Server
     2.5. Start with HBase & NoSQL
     2.6. Start with Storm
     2.7. Start with Interactive Query
     2.8. Start with Kafka (Preview)
     2.9. Hadoop sandbox
     2.10. Data Lake Tools with Hortonworks Sandbox
     2.11. Tools for Visual Studio
     2.12. HDInsight using Azure Storage
     2.13. HDInsight using Azure Data Lake Store
3. How To
     3.1. Use Hadoop for batch queries
          3.1.1. Hive with Hadoop
      Use the Hive View
      Use Beeline
      Use cURL
      Use Azure PowerShell
      Use .NET SDK
      Use the HDInsight tools for Visual Studio
      Use Remote Desktop
      Use the Query Console
          3.1.2. Use a Java UDF with Hive
          3.1.3. Use MapReduce with Hadoop
      Use SSH
      Use cURL
      Use Azure PowerShell
      Use Remote Desktop
      Use .NET SDK
          3.1.4. Run the MapReduce samples
          3.1.5. Use Pig with Hadoop
      Use SSH and Pig
      Use Azure PowerShell
      Use the .NET SDK
      Use cURL
      Use Remote Desktop
          3.1.6. Use DataFu with Pig
          3.1.7. On-demand clusters
          3.1.8. Submit Hadoop jobs
     3.2. Use Spark for in-memory processing
          3.2.1. Get started - Spark developer
      Create standalone app
      Use an interactive Spark Shell
      Remote jobs with Livy
          3.2.2. With Data Lake Store
          3.2.3. Create apps using Eclipse
          3.2.4. Create apps using IntelliJ
          3.2.5. What is Spark Streaming?
          3.2.6. Process streaming events
          3.2.7. Predict HVAC performance
          3.2.8. Predict food inspection results
          3.2.9. Analyze website logs
          3.2.10. Use Caffe for deep learning
          3.2.11. Use with Microsoft Cognitive Toolkit
          3.2.12. Use Zeppelin notebooks
          3.2.13. Jupyter notebook kernels
          3.2.14. Use external packages with Jupyter using cell magic
          3.2.15. Use external packages with Jupyter using script action
          3.2.16. Use a local Jupyter notebook
          3.2.17. Debug jobs remotely with IntelliJ through VPN
          3.2.18. Known issues
     3.3. Use R Server
          3.3.1. Storage options
          3.3.2. Install RStudio
          3.3.3. Compute contexts
          3.3.4. ScaleR and SparkR
     3.4. Use HBase
          3.4.1. Use Phoenix and SQLLine
          3.4.2. Create clusters on a virtual network
          3.4.3. Configure HBase replication
          3.4.4. Develop an app with Java
     3.5. Use Storm
          3.5.1. Deploy and manage topologies
          3.5.2. Develop data processing apps in SCP
          3.5.3. Storm examples
      Write to Data Lake Store
      Develop Java-based topologies with Maven
      Develop C# topologies with Hadoop tools
      Process events with C# topologies
      Process events with Java topologies
      Analyze real-time sensor data
      Correlate events over time
      Develop topologies using Python
     3.6. Use domain-joined HDInsight (Preview)
          3.6.1. Configure
          3.6.2. Manage
          3.6.3. Configure Hive policies
     3.7. Use Kafka (Preview)
          3.7.1. Replicate Kafka data
          3.7.2. Configure storage and scalability of Kafka
          3.7.3. Configure high availability of data
          3.7.4. Analyze Kafka logs
          3.7.5. Use with Virtual Networks
          3.7.6. Use with Spark (Structured Streaming)
          3.7.7. Use with Spark (DStream)
          3.7.8. Use with Storm
     3.8. Use Interactive Query
          3.8.1. Use Zeppelin to run Hive queries
     3.9. Develop
          3.9.1. Develop C# streaming MapReduce programs
          3.9.2. Develop Java MapReduce programs
          3.9.3. Develop Scalding MapReduce jobs
          3.9.4. Use HDInsight Tools to create Spark apps
          3.9.5. Use HDInsight Tools to debug Spark apps remotely through SSH
          3.9.6. Use empty edge nodes
          3.9.7. Develop Python streaming programs
          3.9.8. Process and analyze JSON documents
          3.9.9. Serialize data with Avro Library
          3.9.10. Use C# user-defined functions
          3.9.11. Use Python with Hive and Pig
          3.9.12. Create non-interactive authentication .NET HDInsight applications
          3.9.13. Use HDInsight VSCode tool
     3.10. Analyze big data
          3.10.1. Analyze using Power Query
          3.10.2. Connect Power BI to Hadoop
          3.10.3. Connect Excel to Hadoop
          3.10.4. Connect using the Hive JDBC driver
          3.10.5. Analyze stored sensor data
          3.10.6. Analyze stored tweets
          3.10.7. Analyze flight delay data
          3.10.8. Generate recommendations with Mahout
          3.10.9. Analyze website logs with Hive
          3.10.10. Analyze Application Insights telemetry logs
     3.11. Extend clusters
          3.11.1. Use secure enabled storage account
          3.11.2. Customize clusters using Bootstrap
          3.11.3. Customize clusters using Script Action
          3.11.4. Connect HDInsight to your on-premises network
          3.11.5. Develop script actions
          3.11.6. Install and use Presto
          3.11.7. Install or update Mono
          3.11.8. Add Hive libraries
          3.11.9. Use Giraph
          3.11.10. Use Hue
          3.11.11. Use R
          3.11.12. Use Solr
          3.11.13. Use Virtual Network
          3.11.14. Use Zeppelin
          3.11.15. Build HDInsight applications
      Install HDInsight apps
      Install custom apps
      Use REST to install apps
      Publish HDInsight apps to Azure Marketplace
     3.12. Secure
          3.12.1. Use SSH with HDInsight
          3.12.2. Use SSH tunneling
          3.12.3. Restrict access to data
          3.12.4. Authorize users for Ambari Views
          3.12.5. Manage user permissions at the file and folder levels
     3.13. Manage
          3.13.1. Create Linux clusters
      Use Azure PowerShell
      Use cURL and the Azure REST API
      Use the .NET SDK
      Use the Azure CLI
      Use the Azure portal
      Use Azure Resource Manager templates
          3.13.2. Manage Hadoop clusters
      Use .NET SDK
      Use Azure PowerShell
      Use the Azure CLI
          3.13.3. Manage clusters using the Ambari web UI
      Use Ambari REST API
          3.13.4. Add storage accounts
          3.13.5. Upload data for Hadoop jobs
          3.13.6. Multiple HDInsight clusters with Data Lake Store
          3.13.7. Import and export data with Sqoop
      Connect with SSH
      Run using cURL
      Run using .NET SDK
      Run using Azure PowerShell
          3.13.8. Use Oozie for workflows
          3.13.9. Use time-based Oozie coordinators
          3.13.10. Cluster and service ports and URIs
          3.13.11. Migrate to Resource Manager development tools
          3.13.12. Availability and reliability
          3.13.13. Upgrade HDInsight cluster to newer version
          3.13.14. OS patching for HDInsight cluster
     3.14. Monitor
          3.14.1. Use Azure Log Analytics
          3.14.2. Cluster-specific dashboards
          3.14.3. Use queries with Log Analytics
          3.14.4. Monitor cluster performance
     3.15. Troubleshoot
          3.15.1. HBASE troubleshooting
          3.15.2. HDFS troubleshooting
          3.15.3. HIVE troubleshooting
          3.15.4. Spark troubleshooting
          3.15.5. STORM troubleshooting
          3.15.6. YARN troubleshooting
          3.15.7. Resources
      Information about using HDInsight on Linux
      Hadoop memory and performance
      Access Hadoop YARN application logs on Linux
      Enable heap dumps for Hadoop services
      Understand and resolve WebHCat errors
      Hive settings fix Out of Memory error
      Use Ambari Views to debug Tez Jobs
      Optimize Hive queries
4. Reference
     4.1. Code samples
     4.2. Azure PowerShell
     4.3. .NET (Hadoop)
     4.4. .NET (HBase)
     4.5. .NET (Avro)
     4.6. REST
     4.7. REST (Spark)
5. Related
     5.1. Windows clusters
          5.1.1. Migrate Windows clusters to Linux clusters
          5.1.2. Migrate .NET solutions to Linux clusters
          5.1.3. Run Hadoop MapReduce samples
          5.1.4. Use Solr on clusters
          5.1.5. Use Giraph to process large-scale graphs
          5.1.6. Use Oozie for workflows
          5.1.7. Deploy and manage Storm topologies
          5.1.8. Use Maven to build Java applications
          5.1.9. Use the Tez UI to debug Tez Jobs
          5.1.10. Customize using Script Action
          5.1.11. Access YARN application logs
          5.1.12. Use Apache Phoenix and SQuirreL
          5.1.13. Generate movie recommendations using Mahout
          5.1.14. Analyze flight delay data
          5.1.15. Develop script actions
          5.1.16. Analyze Twitter data
          5.1.17. Manage clusters with Azure portal
          5.1.18. Monitor clusters using the Ambari API
6. Resources
     6.1. Azure Roadmap
     6.2. Get help on the forum
     6.3. Learning path
     6.4. Microsoft Professional Program for Big Data
     6.5. Pricing calculator
     6.6. Windows tools for HDInsight

Online Training Content

Date Title
5/24/2017 Dat202.1x - Processing Big Data with Hadoop in Azure HDInsight
5/24/2017 Dat202.2x - Implementing Real-Time Analytics with Hadoop in Azure HDInsight
5/24/2017 Dat202.3x - Implementing Predictive Analytics with Spark in Azure HDInsight
5/24/2017 Orchestrating Big Data with Azure Data Factory
1/16/2017 Cortana Intelligence Suite End-to-End
3/21/2016 Building Blocks: Big Data and Machine Learning
6/3/2015 Einführung in Microsoft Azure–Advanced Services
4/29/2014 Implementing Big Data Analysis


Tool Description
HDInsight: Scale Horizontally Scale-HDInsightClusterNodes is a simple PowerShell workflow runbook that will help you automate the process of scaling in or out your HDInsight clusters depending on your needs.The script receives 4 parameters:ResourceGroupName: The name of the resource group where the cluster re
Azure Feature Pack for Integration Services (SSIS) SQL Server Integration Services (SSIS) Feature Pack for Azure for SQL Server 2016 is an extension that provides the following components for SSIS to connect to Azure, transfer data between Azure and on-premises data sources, and process data stored in Azure.


Date Title Length
10/10/2017 Understanding big data on Azure - structured, unstructured and streaming | BRK2293 1:15:05
9/30/2017 Internet of things with Azure Cosmos DB 0:52:56
9/30/2017 Microsoft makes artificial intelligence real 0:19:37
9/29/2017 Architect your big data solutions with SQL Data Warehouse and Azure Analysis Services 1:08:16
9/29/2017 Enterprise security and monitoring for big data solutions on Azure HDInsight 0:50:57
9/29/2017 Streaming Big Data on Azure with HDInsight Kafka, Storm and Spark 1:14:11
9/28/2017 Building Petabyte scale Interactive Data warehouse in Azure HDInsight 1:09:47
9/28/2017 Operationalizing Microsoft Cognitive Toolkit and TensorFlow models with HDInsight Spark 0:46:35
9/28/2017 Data on Azure: The big picture 1:12:13
9/28/2017 Delivering enterprise BI with Azure Analysis Services 1:15:30

Page 1 of 11