Official Documentation

Service Description

In addition to pure data storage, Microsoft Azure also provides capabilities for evaluation and processing of data stored in Azure. The flexible, straightforward provision of even major compute and storage resources, which can be deactivated after data is evaluated, makes Azure the ideal environment for such additional data processing.

HD Insight enables the analysis of large data sets with the aid of Hadoop, an open source framework very popular for calculating "Big Data problems," i.e., evaluations of data sets that cannot be stored in a relational database (e.g., because they are too large or they do not have a relational structure). In these situations, Microsoft Azure makes it possible to automatically provision Hadoop VMs, distributed data and computation algorithms, etc. In addition to efficiently computing MapReduce algorithms, it also supports other concepts such as Hive and Pig.

Getting Started

  1. HDInsight (Windows) Learning Path
    10/3/2016, Webpage
  2. HDInsight (Linux) Learning Path
    10/3/2016, Webpage
  3. Introduction to Azure HDInsight – using Hadoop with your applications
    11/18/2016, Video, 0:12:56
  4. Cortana Intelligence Suite End-to-End
    1/16/2017, Mva

Latest Content

RSS Feed

Hive Metastore in HDInsight –Tips, Tricks & Best Practices Blog
How to use BigDL on Apache Spark for Azure HDInsight Blog
Announcing new capabilities of HDInsight and DocumentDB at Strata Blog
Nodes in HDInsight Blog
Big Data Partner Program Video
Public preview: Azure HDInsight 3.6 with Apache Spark 2.1 Blog
Announcing preview of Azure HDInsight 3.6 with Apache Spark 2.1 Blog
Wiring your older Hadoop clusters to access Azure Data Lake Store Blog
Using Oozie SLA on HDInsight clusters Blog
Making Azure Data Lake Store the default file system for Hadoop Blog
Building advanced analytical solutions faster using Dataiku DSS on HDInsight Blog
HDinsight – How to perform Bulk Load with Phoenix ? Blog

Azure Documentation

1. Overview
     1.1. Hadoop
     1.2. Hadoop components on HDInsight
     1.3. R Server
     1.4. Apache Spark
     1.5. HBase
     1.6. Apache Storm
     1.7. Domain-joined HDInsight preview
     1.8. Kafka preview
2. Get Started
     2.1. Start with Hadoop
     2.2. Start with R Server
     2.3. Start with Spark
     2.4. Start with HBase & NoSQL
     2.5. Start with Storm
     2.6. Start with Interactive Hive preview
     2.7. Start with Kafka preview
     2.8. Hadoop sandbox
     2.9. Data Lake Tools with Hortonworks Sandbox
     2.10. Tools for Visual Studio
     2.11. Use Blob storage
3. How To
     3.1. Use Hadoop for batch queries
          3.1.1. Hive with Hadoop
      Use the Hive View
      Use SSH
      Use Beeline
      Use cURL
      Use PowerShell
      Use .NET SDK
      Use the HDInsight tools for Visual Studio
      Use Remote Desktop
      Use the Query Console
          3.1.2. Use a Java UDF with Hive
          3.1.3. Use MapReduce with Hadoop
      Use Remote Desktop
      Use SSH
      Use cURL
      Use PowerShell
          3.1.4. Run the MapReduce samples
          3.1.5. Use Pig with Hadoop
      Use Remote Desktop
      Use SSH and Pig
      Use PowerShell
      Use the .NET SDK
      Use cURL
          3.1.6. Use DataFu with Pig
          3.1.7. On-demand clusters
          3.1.8. Submit Hadoop jobs
     3.2. Use R Server
          3.2.1. Storage options
          3.2.2. Install RStudio
          3.2.3. Compute contexts
          3.2.4. ScaleR and SparkR
     3.3. Use Spark for in-memory processing
          3.3.1. With Data Lake Store
          3.3.2. With BI tools
          3.3.3. Create standalone app
          3.3.4. Develop apps using Eclipse
          3.3.5. Develop apps using IntelliJ
          3.3.6. Process streaming events
          3.3.7. Predict HVAC performance
          3.3.8. Predict food inspection results
          3.3.9. Analyze website logs
          3.3.10. Use Caffe for deep learning
          3.3.11. Use Zeppelin notebooks
          3.3.12. Jupyter notebook kernels
          3.3.13. Use external packages with Jupyter using cell magic
          3.3.14. Use external packages with Jupyter using script action
          3.3.15. Use a local Jupyter notebook
          3.3.16. Remote jobs with Livy
          3.3.17. Debug jobs remotely with IntelliJ
          3.3.18. Manage resources
          3.3.19. Track and debug jobs
          3.3.20. Known issues
     3.4. Use HBase
          3.4.1. Use Phoenix and SQLLine
          3.4.2. Analyze real-time tweets
          3.4.3. Create clusters on a virtual network
          3.4.4. Configure HBase replication
          3.4.5. Develop an app with Java
     3.5. Use Storm
          3.5.1. Deploy and manage topologies
          3.5.2. Develop data processing apps in SCP
          3.5.3. Storm examples
      Write to Data Lake Store
      Develop Java-based topologies with Maven
      Develop C# topologies with Hadoop tools
      Determine Twitter trending topics
      Process events with C# topologies
      Process events with Java topologies
      Use Power BI with a topology
      Analyze real-time sensor data
      Process vehicle sensor data
      Correlate events over time
      Develop topologies using Python
     3.6. Use domain-joined HDInsight preview
          3.6.1. Configure
          3.6.2. Manage
          3.6.3. Configure Hive policies
     3.7. Use Kafka preview
          3.7.1. Replicate Kafka data
          3.7.2. Use with Spark
          3.7.3. Use with Storm
     3.8. Develop
          3.8.1. Develop Java MapReduce programs
          3.8.2. Develop Scalding MapReduce jobs
          3.8.3. Use HDInsight Tools to create Spark apps
          3.8.4. Use empty edge nodes
          3.8.5. Develop Python streaming programs
          3.8.6. Process and analyze JSON documents
          3.8.7. Serialize data with Avro Library
          3.8.8. Use C# user-defined functions
          3.8.9. Use Python with Hive and Pig
     3.9. Analyze big data
          3.9.1. Analyze using Power Query
          3.9.2. Connect Excel to Hadoop
          3.9.3. Connect using the Hive JDBC driver
          3.9.4. Analyze stored sensor data
          3.9.5. Analyze stored tweets
          3.9.6. Analyze flight delay data
          3.9.7. Generate recommendations with Mahout
          3.9.8. Analyze website logs with Hive
          3.9.9. Analyze Application Insights telemetry logs
     3.10. Extend clusters
          3.10.1. Customize clusters using Bootstrap
          3.10.2. Customize clusters using Script Action
          3.10.3. Add Hive libraries
          3.10.4. Develop script actions
          3.10.5. Use Giraph
          3.10.6. Use Hue
          3.10.7. Use R
          3.10.8. Use Solr
          3.10.9. Use Virtual Network
          3.10.10. Use Zeppelin
          3.10.11. Build HDInsight applications
      Install HDInsight apps
      Install custom apps
      Use REST to install apps
      Publish HDInsight apps to Azure Marketplace
     3.11. Secure
          3.11.1. Use SSH tunneling
          3.11.2. Use SSH from Linux, Unix, OS X
          3.11.3. Use SSH from Windows OS
          3.11.4. Restrict access to data
     3.12. Manage
          3.12.1. Create Linux clusters
      Use Azure PowerShell
      Use cURL and the Azure REST API
      Use the .NET SDK
      Use the Azure CLI
      Use the Azure portal
      Use Azure Resource Manager templates
          3.12.2. Manage Hadoop clusters
      Use .NET SDK
      Use Azure PowerShell
      Use the Azure CLI
          3.12.3. Manage clusters using the Ambari web UI
      Use Ambari REST API
          3.12.4. Add storage accounts
          3.12.5. Upload data for Hadoop jobs
          3.12.6. Import and export data with Sqoop
      Connect with SSH
      Run using cURL
      Run using .NET SDK
      Run using PowerShell
          3.12.7. Use Oozie for workflows
          3.12.8. Use time-based Oozie coordinators
          3.12.9. Cluster and service ports and URIs
          3.12.10. Migrate to Resource Manager development tools
          3.12.11. Availability and reliability
     3.13. Troubleshoot
          3.13.1. Tips for Linux
          3.13.2. Release notes
          3.13.3. Analyze HDInsight logs
          3.13.4. Debug apps with YARN logs
          3.13.5. Enable heap dumps
          3.13.6. Fix errors from WebHCat
          3.13.7. Use Ambari Views to debug Tez Jobs
          3.13.8. More troubleshooting
      Hive settings fix Out of Memory error
      Optimize Hive queries
      Hive query performance
4. Reference
     4.1. PowerShell
     4.2. .NET (Hadoop)
     4.3. .NET (HBase)
     4.4. .NET (Avro)
     4.5. REST
     4.6. REST (Spark)
5. Related
     5.1. Windows clusters
          5.1.1. Migrate Windows clusters to Linux clusters
          5.1.2. Start with Hadoop
          5.1.3. Start with Storm
          5.1.4. Start with HBase
          5.1.5. Run Hadoop MapReduce samples
          5.1.6. Create Hadoop clusters
      Use the Azure portal
      Use .NET SDK
      Use Azure CLI
      Use Azure PowerShell
      Use Resource Manager templates
          5.1.7. Use Solr on clusters
          5.1.8. Use Giraph to process large-scale graphs
          5.1.9. Use Oozie for workflows
          5.1.10. Deploy and manage Storm topologies
          5.1.11. Use Maven to build Java applications
          5.1.12. Use the Tez UI to debug Tez Jobs
          5.1.13. Customize using Script Action
          5.1.14. Availability and reliability
          5.1.15. Access YARN application logs
          5.1.16. Use Apache Phoenix and SQuirreL
          5.1.17. Generate movie recommendations using Mahout
          5.1.18. Analyze flight delay data
          5.1.19. Develop script actions
          5.1.20. Analyze Twitter data
          5.1.21. Manage clusters with Azure portal
          5.1.22. Monitor clusters using the Ambari API
6. Resources
     6.1. Get help on the forum
     6.2. Learning path


Tool Description
Azure Feature Pack for Integration Services (SSIS) SQL Server Integration Services (SSIS) Feature Pack for Azure for SQL Server 2016 is an extension that provides the following components for SSIS to connect to Azure, transfer data between Azure and on-premises data sources, and process data stored in Azure.


Date Title Length
3/9/2017 Big Data Partner Program 0:13:20
2/10/2017 Introducing Apache Kafka on Azure HDInsight 0:16:32
2/6/2017 Cognitive Services, HDInsight, and Power BI on Azure Government 0:15:40
1/10/2017 HDInsight Compliance 0:15:31
12/30/2016 Introduction to Azure HDInsight – using Hadoop with your applications 0:12:57
12/14/2016 Interactive Spark on Azure 0:23:01
12/9/2016 Introducing Apache Kafka on Azure HDInsight 0:16:31
11/18/2016 Introduction to Azure HDInsight – using Hadoop with your applications 0:12:56
11/16/2016 Scalable machine learning with R and Spark 0:11:29
11/8/2016 Securing Azure HDInsight 0:17:23

Page 1 of 7