HDInsight

Official Documentation

Service Description

In addition to pure data storage, Microsoft Azure also provides capabilities for evaluation and processing of data stored in Azure. The flexible, straightforward provision of even major compute and storage resources, which can be deactivated after data is evaluated, makes Azure the ideal environment for such additional data processing.

HD Insight enables the analysis of large data sets with the aid of Hadoop, an open source framework very popular for calculating "Big Data problems," i.e., evaluations of data sets that cannot be stored in a relational database (e.g., because they are too large or they do not have a relational structure). In these situations, Microsoft Azure makes it possible to automatically provision Hadoop VMs, distributed data and computation algorithms, etc. In addition to efficiently computing MapReduce algorithms, it also supports other concepts such as Hive and Pig.

Getting Started

  1. HDInsight (Windows) Learning Path
    10/3/2016, Webpage
  2. HDInsight (Linux) Learning Path
    10/3/2016, Webpage
  3. Dat202.1x - Processing Big Data with Hadoop in Azure HDInsight
    5/24/2017, Mva
  4. Dat202.2x - Implementing Real-Time Analytics with Hadoop in Azure HDInsight
    5/24/2017, Mva
  5. Dat202.3x - Implementing Predictive Analytics with Spark in Azure HDInsight
    5/24/2017, Mva
  6. Cortana Intelligence Suite End-to-End
    1/16/2017, Mva
  7. HDInsight: Streaming Petabytes of IoT Data in Real-Time
    2/10/2017, Video, 1:08:27

Latest Content

Subscribe to News about HDInsight

Title  
Blog
Blog
Blog
Blog
Blog
Video
Blog
Video
Blog
Blog
Video
Video
more...

Azure Documentation

1. Overview
     1.1. About HDInsight and Hadoop
     1.2. Hadoop components on HDInsight
     1.3. R Server
     1.4. Apache Spark
     1.5. HBase
     1.6. Apache Storm
     1.7. Kafka (Preview)
     1.8. Domain-joined HDInsight clusters (Preview)
     1.9. Release notes
          1.9.1. Recent
          1.9.2. Archive
2. Get Started
     2.1. Start with Hadoop
     2.2. Start with Spark
          2.2.1. Create a Spark cluster
          2.2.2. Run queries on a Spark cluster
          2.2.3. Analyze data using BI tools
          2.2.4. Manage cluster resources
          2.2.5. Debug Spark jobs
     2.3. Start with R Server
     2.4. Start with HBase & NoSQL
     2.5. Start with Storm
     2.6. Start with Interactive Hive (Preview)
     2.7. Start with Kafka (Preview)
     2.8. Hadoop sandbox
     2.9. Data Lake Tools with Hortonworks Sandbox
     2.10. Tools for Visual Studio
     2.11. HDInsight using Azure Storage
     2.12. HDInsight using Azure Data Lake Store
3. How To
     3.1. Use Hadoop for batch queries
          3.1.1. Hive with Hadoop
               3.1.1.1. Use the Hive View
               3.1.1.2. Use Beeline
               3.1.1.3. Use cURL
               3.1.1.4. Use PowerShell
               3.1.1.5. Use .NET SDK
               3.1.1.6. Use the HDInsight tools for Visual Studio
               3.1.1.7. Use Remote Desktop
               3.1.1.8. Use the Query Console
          3.1.2. Use a Java UDF with Hive
          3.1.3. Use MapReduce with Hadoop
               3.1.3.1. Use SSH
               3.1.3.2. Use cURL
               3.1.3.3. Use PowerShell
               3.1.3.4. Use Remote Desktop
          3.1.4. Run the MapReduce samples
          3.1.5. Use Pig with Hadoop
               3.1.5.1. Use SSH and Pig
               3.1.5.2. Use PowerShell
               3.1.5.3. Use the .NET SDK
               3.1.5.4. Use cURL
               3.1.5.5. Use Remote Desktop
          3.1.6. Use DataFu with Pig
          3.1.7. On-demand clusters
          3.1.8. Submit Hadoop jobs
     3.2. Use R Server
          3.2.1. Storage options
          3.2.2. Install RStudio
          3.2.3. Compute contexts
          3.2.4. ScaleR and SparkR
     3.3. Use Spark for in-memory processing
          3.3.1. With Data Lake Store
          3.3.2. Create standalone app
          3.3.3. Create apps using Eclipse
          3.3.4. Create apps using IntelliJ
          3.3.5. Process streaming events
          3.3.6. Predict HVAC performance
          3.3.7. Predict food inspection results
          3.3.8. Analyze website logs
          3.3.9. Use Caffe for deep learning
          3.3.10. Use with Microsoft Cognitive Toolkit
          3.3.11. Use Zeppelin notebooks
          3.3.12. Jupyter notebook kernels
          3.3.13. Use external packages with Jupyter using cell magic
          3.3.14. Use external packages with Jupyter using script action
          3.3.15. Use a local Jupyter notebook
          3.3.16. Remote jobs with Livy
          3.3.17. Debug jobs remotely with IntelliJ through VPN
          3.3.18. Known issues
     3.4. Use HBase
          3.4.1. Use Phoenix and SQLLine
          3.4.2. Analyze real-time tweets
          3.4.3. Create clusters on a virtual network
          3.4.4. Configure HBase replication
          3.4.5. Develop an app with Java
     3.5. Use Storm
          3.5.1. Deploy and manage topologies
          3.5.2. Develop data processing apps in SCP
          3.5.3. Storm examples
               3.5.3.1. Write to Data Lake Store
               3.5.3.2. Develop Java-based topologies with Maven
               3.5.3.3. Develop C# topologies with Hadoop tools
               3.5.3.4. Process events with C# topologies
               3.5.3.5. Process events with Java topologies
               3.5.3.6. Use Power BI with a topology
               3.5.3.7. Analyze real-time sensor data
               3.5.3.8. Correlate events over time
               3.5.3.9. Develop topologies using Python
     3.6. Use domain-joined HDInsight (Preview)
          3.6.1. Configure
          3.6.2. Manage
          3.6.3. Configure Hive policies
     3.7. Use Kafka (Preview)
          3.7.1. Replicate Kafka data
          3.7.2. Configure storage and scalability of Kafka
          3.7.3. Configure high availability of data
          3.7.4. Use with Virtual Networks
          3.7.5. Use with Spark (Structured Streaming)
          3.7.6. Use with Spark (DStream)
          3.7.7. Use with Storm
     3.8. Develop
          3.8.1. Develop C# streaming MapReduce programs
          3.8.2. Develop Java MapReduce programs
          3.8.3. Develop Scalding MapReduce jobs
          3.8.4. Use HDInsight Tools to create Spark apps
          3.8.5. Use HDInsight Tools to debug Spark apps remotely through SSH
          3.8.6. Use empty edge nodes
          3.8.7. Develop Python streaming programs
          3.8.8. Process and analyze JSON documents
          3.8.9. Serialize data with Avro Library
          3.8.10. Use C# user-defined functions
          3.8.11. Use Python with Hive and Pig
     3.9. Analyze big data
          3.9.1. Analyze using Power Query
          3.9.2. Connect Excel to Hadoop
          3.9.3. Connect using the Hive JDBC driver
          3.9.4. Analyze stored sensor data
          3.9.5. Analyze stored tweets
          3.9.6. Analyze flight delay data
          3.9.7. Generate recommendations with Mahout
          3.9.8. Analyze website logs with Hive
          3.9.9. Analyze Application Insights telemetry logs
     3.10. Extend clusters
          3.10.1. Use secure enabled storage account
          3.10.2. Customize clusters using Bootstrap
          3.10.3. Customize clusters using Script Action
          3.10.4. Connect HDInsight to your on-premises network
          3.10.5. Develop script actions
          3.10.6. Install and use Presto
          3.10.7. Install or update Mono
          3.10.8. Add Hive libraries
          3.10.9. Use Giraph
          3.10.10. Use Hue
          3.10.11. Use R
          3.10.12. Use Solr
          3.10.13. Use Virtual Network
          3.10.14. Use Zeppelin
          3.10.15. Build HDInsight applications
               3.10.15.1. Install HDInsight apps
               3.10.15.2. Install custom apps
               3.10.15.3. Use REST to install apps
               3.10.15.4. Publish HDInsight apps to Azure Marketplace
     3.11. Secure
          3.11.1. Use SSH with HDInsight
          3.11.2. Use SSH tunneling
          3.11.3. Restrict access to data
     3.12. Manage
          3.12.1. Create Linux clusters
               3.12.1.1. Use Azure PowerShell
               3.12.1.2. Use cURL and the Azure REST API
               3.12.1.3. Use the .NET SDK
               3.12.1.4. Use the Azure CLI
               3.12.1.5. Use the Azure portal
               3.12.1.6. Use Azure Resource Manager templates
          3.12.2. Manage Hadoop clusters
               3.12.2.1. Use .NET SDK
               3.12.2.2. Use Azure PowerShell
               3.12.2.3. Use the Azure CLI
          3.12.3. Manage clusters using the Ambari web UI
               3.12.3.1. Use Ambari REST API
          3.12.4. Add storage accounts
          3.12.5. Upload data for Hadoop jobs
          3.12.6. Multiple HDInsight clusters with Data Lake Store
          3.12.7. Import and export data with Sqoop
               3.12.7.1. Connect with SSH
               3.12.7.2. Run using cURL
               3.12.7.3. Run using .NET SDK
               3.12.7.4. Run using PowerShell
          3.12.8. Use Oozie for workflows
          3.12.9. Use time-based Oozie coordinators
          3.12.10. Cluster and service ports and URIs
          3.12.11. Migrate to Resource Manager development tools
          3.12.12. Availability and reliability
          3.12.13. Upgrade HDInsight cluster to newer version
          3.12.14. OS patching for HDInsight cluster
     3.13. Troubleshoot
          3.13.1. HBASE troubleshooting
          3.13.2. HDFS troubleshooting
          3.13.3. HIVE troubleshooting
          3.13.4. Spark troubleshooting
          3.13.5. STORM troubleshooting
          3.13.6. YARN troubleshooting
          3.13.7. Resources
               3.13.7.1. Information about using HDInsight on Linux
               3.13.7.2. Hadoop memory and performance
               3.13.7.3. Access Hadoop YARN application logs on Linux
               3.13.7.4. Enable heap dumps for Hadoop services
               3.13.7.5. Understand and resolve WebHCat errors
               3.13.7.6. Hive settings fix Out of Memory error
               3.13.7.7. Use Ambari Views to debug Tez Jobs
               3.13.7.8. Optimize Hive queries
4. Reference
     4.1. Code samples
     4.2. PowerShell
     4.3. .NET (Hadoop)
     4.4. .NET (HBase)
     4.5. .NET (Avro)
     4.6. REST
     4.7. REST (Spark)
5. Related
     5.1. Windows clusters
          5.1.1. Migrate Windows clusters to Linux clusters
          5.1.2. Migrate .NET solutions to Linux clusters
          5.1.3. Run Hadoop MapReduce samples
          5.1.4. Use Solr on clusters
          5.1.5. Use Giraph to process large-scale graphs
          5.1.6. Use Oozie for workflows
          5.1.7. Deploy and manage Storm topologies
          5.1.8. Use Maven to build Java applications
          5.1.9. Use the Tez UI to debug Tez Jobs
          5.1.10. Customize using Script Action
          5.1.11. Access YARN application logs
          5.1.12. Use Apache Phoenix and SQuirreL
          5.1.13. Generate movie recommendations using Mahout
          5.1.14. Analyze flight delay data
          5.1.15. Develop script actions
          5.1.16. Analyze Twitter data
          5.1.17. Manage clusters with Azure portal
          5.1.18. Monitor clusters using the Ambari API
6. Resources
     6.1. Azure Roadmap
     6.2. Get help on the forum
     6.3. Learning path
     6.4. Microsoft Professional Program for Big Data
     6.5. Pricing calculator
     6.6. Windows tools for HDInsight

Online Training Content

Date Title
5/24/2017 Dat202.1x - Processing Big Data with Hadoop in Azure HDInsight
5/24/2017 Dat202.2x - Implementing Real-Time Analytics with Hadoop in Azure HDInsight
5/24/2017 Dat202.3x - Implementing Predictive Analytics with Spark in Azure HDInsight
5/24/2017 Orchestrating Big Data with Azure Data Factory
1/16/2017 Cortana Intelligence Suite End-to-End
3/21/2016 Building Blocks: Big Data and Machine Learning
6/3/2015 Einführung in Microsoft Azure–Advanced Services
4/29/2014 Implementing Big Data Analysis

Tools

Tool Description
Azure Feature Pack for Integration Services (SSIS) SQL Server Integration Services (SSIS) Feature Pack for Azure for SQL Server 2016 is an extension that provides the following components for SSIS to connect to Azure, transfer data between Azure and on-premises data sources, and process data stored in Azure.

Videos

Date Title Length
7/18/2017 Spark Performance Tuning - Part 4 0:26:38
7/7/2017 Spark Performance Tuning - Part 3 0:33:08
6/29/2017 Create Spark Applications with the Azure Toolkit for IntelliJ 0:06:01
6/29/2017 Debug HDInsight Spark Applications with Azure Toolkit for IntelliJ 0:06:02
6/27/2017 Spark Performance Tuning - Part 2 0:35:03
6/2/2017 Spark Performance Series - Part 1 0:26:44
5/10/2017 Using StorSimple data with services in Azure (Media Services, HDInsights, AzureML, etc.) 0:25:31
5/10/2017 Lambda Architecture for Connected Car Fleet Management 0:00:00
5/10/2017 A lap around Azure HDInsight and Cosmos DB Open Source Analytics + NoSQL 0:29:39
5/10/2017 Speedup Interactive Analytics on Petabytes of Data on Azure 0:32:10

Page 1 of 9