HDInsight

Official Documentation

Service Description

In addition to pure data storage, Microsoft Azure also provides capabilities for evaluation and processing of data stored in Azure. The flexible, straightforward provision of even major compute and storage resources, which can be deactivated after data is evaluated, makes Azure the ideal environment for such additional data processing.

HD Insight enables the analysis of large data sets with the aid of Hadoop, an open source framework very popular for calculating "Big Data problems," i.e., evaluations of data sets that cannot be stored in a relational database (e.g., because they are too large or they do not have a relational structure). In these situations, Microsoft Azure makes it possible to automatically provision Hadoop VMs, distributed data and computation algorithms, etc. In addition to efficiently computing MapReduce algorithms, it also supports other concepts such as Hive and Pig.

Getting Started

  1. HDInsight (Windows) Learning Path
    10/3/2016, Webpage
  2. HDInsight (Linux) Learning Path
    10/3/2016, Webpage
  3. Dat202.1x - Processing Big Data with Hadoop in Azure HDInsight
    5/24/2017, Mva
  4. Dat202.2x - Implementing Real-Time Analytics with Hadoop in Azure HDInsight
    5/24/2017, Mva
  5. Dat202.3x - Implementing Predictive Analytics with Spark in Azure HDInsight
    5/24/2017, Mva
  6. Cortana Intelligence Suite End-to-End
    1/16/2017, Mva
  7. HDInsight: Streaming Petabytes of IoT Data in Real-Time
    2/10/2017, Video, 1:08:27

Latest Content

RSS Feed

Title  
Spark Performance Tuning - Part 2 Video
Run H2O.ai in R on Azure HDInsight Blog
Microsoft R Server 9.1 on HDInsight is available! Blog
Announcing public preview of Apache Kafka on HDInsight with Azure Managed disks Blog
HDInsight tools for IntelliJ May updates Blog
Spark Performance Series - Part 1 Video
HDInsight tools for IntelliJ May updates Blog
HDInsight : BUILD Hive Lab is available now Blog
Announcing Public Preview of HDInsight HBase on Azure Data Lake Store Blog
Azure HDInsight: How to run Presto in one simple step and query across data sources such as Cosmos DB, SQL DB & Hive Blog
Allowing multiple users to access R Server on HDInsight Blog
Introducing the Azure app (preview) Blog

Azure Documentation

1. Overview
     1.1. About HDInsight and Hadoop
     1.2. Hadoop components on HDInsight
     1.3. R Server
     1.4. Apache Spark
     1.5. HBase
     1.6. Apache Storm
     1.7. Kafka (Preview)
     1.8. Domain-joined HDInsight clusters (Preview)
     1.9. Release notes
          1.9.1. Recent
          1.9.2. Archive
2. Get Started
     2.1. Start with Hadoop
     2.2. Start with Spark
     2.3. Start with R Server
     2.4. Start with HBase & NoSQL
     2.5. Start with Storm
     2.6. Start with Interactive Hive (Preview)
     2.7. Start with Kafka (Preview)
     2.8. Hadoop sandbox
     2.9. Data Lake Tools with Hortonworks Sandbox
     2.10. Tools for Visual Studio
     2.11. HDInsight using Azure Storage
     2.12. HDInsight using Azure Data Lake Store
3. How To
     3.1. Use Hadoop for batch queries
          3.1.1. Hive with Hadoop
               3.1.1.1. Use the Hive View
               3.1.1.2. Use Beeline
               3.1.1.3. Use cURL
               3.1.1.4. Use PowerShell
               3.1.1.5. Use .NET SDK
               3.1.1.6. Use the HDInsight tools for Visual Studio
               3.1.1.7. Use Remote Desktop
               3.1.1.8. Use the Query Console
          3.1.2. Use a Java UDF with Hive
          3.1.3. Use MapReduce with Hadoop
               3.1.3.1. Use SSH
               3.1.3.2. Use cURL
               3.1.3.3. Use PowerShell
               3.1.3.4. Use Remote Desktop
          3.1.4. Run the MapReduce samples
          3.1.5. Use Pig with Hadoop
               3.1.5.1. Use SSH and Pig
               3.1.5.2. Use PowerShell
               3.1.5.3. Use the .NET SDK
               3.1.5.4. Use cURL
               3.1.5.5. Use Remote Desktop
          3.1.6. Use DataFu with Pig
          3.1.7. On-demand clusters
          3.1.8. Submit Hadoop jobs
     3.2. Use R Server
          3.2.1. Storage options
          3.2.2. Install RStudio
          3.2.3. Compute contexts
          3.2.4. ScaleR and SparkR
     3.3. Use Spark for in-memory processing
          3.3.1. With Data Lake Store
          3.3.2. With BI tools
          3.3.3. Create standalone app
          3.3.4. Create apps using Eclipse
          3.3.5. Create apps using IntelliJ
          3.3.6. Process streaming events
          3.3.7. Predict HVAC performance
          3.3.8. Predict food inspection results
          3.3.9. Analyze website logs
          3.3.10. Use Caffe for deep learning
          3.3.11. Use with Microsoft Cognitive Toolkit
          3.3.12. Use Zeppelin notebooks
          3.3.13. Jupyter notebook kernels
          3.3.14. Use external packages with Jupyter using cell magic
          3.3.15. Use external packages with Jupyter using script action
          3.3.16. Use a local Jupyter notebook
          3.3.17. Remote jobs with Livy
          3.3.18. Debug jobs remotely with IntelliJ through VPN
          3.3.19. Manage resources
          3.3.20. Track and debug jobs
          3.3.21. Known issues
     3.4. Use HBase
          3.4.1. Use Phoenix and SQLLine
          3.4.2. Analyze real-time tweets
          3.4.3. Create clusters on a virtual network
          3.4.4. Configure HBase replication
          3.4.5. Develop an app with Java
     3.5. Use Storm
          3.5.1. Deploy and manage topologies
          3.5.2. Develop data processing apps in SCP
          3.5.3. Storm examples
               3.5.3.1. Write to Data Lake Store
               3.5.3.2. Develop Java-based topologies with Maven
               3.5.3.3. Develop C# topologies with Hadoop tools
               3.5.3.4. Determine Twitter trending topics
               3.5.3.5. Process events with C# topologies
               3.5.3.6. Process events with Java topologies
               3.5.3.7. Use Power BI with a topology
               3.5.3.8. Analyze real-time sensor data
               3.5.3.9. Process vehicle sensor data
               3.5.3.10. Correlate events over time
               3.5.3.11. Develop topologies using Python
     3.6. Use domain-joined HDInsight (Preview)
          3.6.1. Configure
          3.6.2. Manage
          3.6.3. Configure Hive policies
     3.7. Use Kafka (Preview)
          3.7.1. Replicate Kafka data
          3.7.2. Configure storage and scalability of Kafka
          3.7.3. Use with Virtual Networks
          3.7.4. Use with Spark (Structured Streaming)
          3.7.5. Use with Spark (DStream)
          3.7.6. Use with Storm
     3.8. Develop
          3.8.1. Develop C# streaming MapReduce programs
          3.8.2. Develop Java MapReduce programs
          3.8.3. Develop Scalding MapReduce jobs
          3.8.4. Use HDInsight Tools to create Spark apps
          3.8.5. Use HDInsight Tools to debug Spark apps remotely through SSH
          3.8.6. Use empty edge nodes
          3.8.7. Develop Python streaming programs
          3.8.8. Process and analyze JSON documents
          3.8.9. Serialize data with Avro Library
          3.8.10. Use C# user-defined functions
          3.8.11. Use Python with Hive and Pig
     3.9. Analyze big data
          3.9.1. Analyze using Power Query
          3.9.2. Connect Excel to Hadoop
          3.9.3. Connect using the Hive JDBC driver
          3.9.4. Analyze stored sensor data
          3.9.5. Analyze stored tweets
          3.9.6. Analyze flight delay data
          3.9.7. Generate recommendations with Mahout
          3.9.8. Analyze website logs with Hive
          3.9.9. Analyze Application Insights telemetry logs
     3.10. Extend clusters
          3.10.1. Customize clusters using Bootstrap
          3.10.2. Customize clusters using Script Action
          3.10.3. Develop script actions
          3.10.4. Install and use Presto
          3.10.5. Install or update Mono
          3.10.6. Add Hive libraries
          3.10.7. Use Giraph
          3.10.8. Use Hue
          3.10.9. Use R
          3.10.10. Use Solr
          3.10.11. Use Virtual Network
          3.10.12. Use Zeppelin
          3.10.13. Build HDInsight applications
               3.10.13.1. Install HDInsight apps
               3.10.13.2. Install custom apps
               3.10.13.3. Use REST to install apps
               3.10.13.4. Publish HDInsight apps to Azure Marketplace
     3.11. Secure
          3.11.1. Use SSH with HDInsight
          3.11.2. Use SSH tunneling
          3.11.3. Restrict access to data
     3.12. Manage
          3.12.1. Create Linux clusters
               3.12.1.1. Use Azure PowerShell
               3.12.1.2. Use cURL and the Azure REST API
               3.12.1.3. Use the .NET SDK
               3.12.1.4. Use the Azure CLI
               3.12.1.5. Use the Azure portal
               3.12.1.6. Use Azure Resource Manager templates
          3.12.2. Manage Hadoop clusters
               3.12.2.1. Use .NET SDK
               3.12.2.2. Use Azure PowerShell
               3.12.2.3. Use the Azure CLI
          3.12.3. Manage clusters using the Ambari web UI
               3.12.3.1. Use Ambari REST API
          3.12.4. Add storage accounts
          3.12.5. Upload data for Hadoop jobs
          3.12.6. Multiple HDInsight clusters with Data Lake Store
          3.12.7. Import and export data with Sqoop
               3.12.7.1. Connect with SSH
               3.12.7.2. Run using cURL
               3.12.7.3. Run using .NET SDK
               3.12.7.4. Run using PowerShell
          3.12.8. Use Oozie for workflows
          3.12.9. Use time-based Oozie coordinators
          3.12.10. Cluster and service ports and URIs
          3.12.11. Migrate to Resource Manager development tools
          3.12.12. Availability and reliability
          3.12.13. Upgrade HDInsight cluster to newer version
          3.12.14. OS patching for HDInsight cluster
     3.13. Troubleshoot
          3.13.1. Tips for Linux
          3.13.2. Analyze HDInsight logs
          3.13.3. Debug apps with YARN logs
          3.13.4. Enable heap dumps
          3.13.5. Fix errors from WebHCat
          3.13.6. Use Ambari Views to debug Tez Jobs
          3.13.7. More troubleshooting
               3.13.7.1. Hive settings fix Out of Memory error
               3.13.7.2. Optimize Hive queries
               3.13.7.3. Hive query performance
4. Reference
     4.1. PowerShell
     4.2. .NET (Hadoop)
     4.3. .NET (HBase)
     4.4. .NET (Avro)
     4.5. REST
     4.6. REST (Spark)
5. Related
     5.1. Windows clusters
          5.1.1. Migrate Windows clusters to Linux clusters
          5.1.2. Migrate .NET solutions to Linux clusters
          5.1.3. Run Hadoop MapReduce samples
          5.1.4. Use Solr on clusters
          5.1.5. Use Giraph to process large-scale graphs
          5.1.6. Use Oozie for workflows
          5.1.7. Deploy and manage Storm topologies
          5.1.8. Use Maven to build Java applications
          5.1.9. Use the Tez UI to debug Tez Jobs
          5.1.10. Customize using Script Action
          5.1.11. Access YARN application logs
          5.1.12. Use Apache Phoenix and SQuirreL
          5.1.13. Generate movie recommendations using Mahout
          5.1.14. Analyze flight delay data
          5.1.15. Develop script actions
          5.1.16. Analyze Twitter data
          5.1.17. Manage clusters with Azure portal
          5.1.18. Monitor clusters using the Ambari API
6. Resources
     6.1. Azure Roadmap
     6.2. Get help on the forum
     6.3. Learning path
     6.4. Microsoft Professional Program for Big Data
     6.5. Windows tools for HDInsight

Tools

Tool Description
Azure Feature Pack for Integration Services (SSIS) SQL Server Integration Services (SSIS) Feature Pack for Azure for SQL Server 2016 is an extension that provides the following components for SSIS to connect to Azure, transfer data between Azure and on-premises data sources, and process data stored in Azure.