HDInsight

Official Documentation

Service Description

In addition to pure data storage, Microsoft Azure also provides capabilities for evaluation and processing of data stored in Azure. The flexible, straightforward provision of even major compute and storage resources, which can be deactivated after data is evaluated, makes Azure the ideal environment for such additional data processing.

HD Insight enables the analysis of large data sets with the aid of Hadoop, an open source framework very popular for calculating "Big Data problems," i.e., evaluations of data sets that cannot be stored in a relational database (e.g., because they are too large or they do not have a relational structure). In these situations, Microsoft Azure makes it possible to automatically provision Hadoop VMs, distributed data and computation algorithms, etc. In addition to efficiently computing MapReduce algorithms, it also supports other concepts such as Hive and Pig.

Getting Started

  1. HDInsight (Windows) Learning Path
    10/3/2016, Webpage
  2. HDInsight (Linux) Learning Path
    10/3/2016, Webpage
  3. Introduction to Azure HDInsight – using Hadoop with your applications
    11/18/2016, Video, 0:12:56
  4. Cortana Intelligence Suite End-to-End
    1/16/2017, Mva

Latest Content

RSS Feed

Title  
Use H2O.ai on Azure HDInsight Blog
Azure HDInsight 3.6 – Five things that will make a data developer happy Blog
Announcing general availability of Azure HDInsight 3.6 Blog
Use BigDL on HDInsight Spark for Distributed Deep Learning Blog
Hive Metastore in HDInsight –Tips, Tricks & Best Practices Blog
End-to-End Data Science Walkthrough with Spark 2.0 on Azure HDInsight Hadoop Clusters Blog
How to use BigDL on Apache Spark for Azure HDInsight Blog
Announcing new capabilities of HDInsight and DocumentDB at Strata Blog
Nodes in HDInsight Blog
Big Data Partner Program Video
Public preview: Azure HDInsight 3.6 with Apache Spark 2.1 Blog
Announcing preview of Azure HDInsight 3.6 with Apache Spark 2.1 Blog

Azure Documentation

1. Overview
     1.1. Hadoop
     1.2. Hadoop components on HDInsight
     1.3. R Server
     1.4. Apache Spark
     1.5. HBase
     1.6. Apache Storm
     1.7. Kafka (Preview)
     1.8. Domain-joined HDInsight clusters (Preview)
     1.9. Release notes
          1.9.1. Recent
          1.9.2. Archive
2. Get Started
     2.1. Start with Hadoop
     2.2. Start with Spark
     2.3. Start with R Server
     2.4. Start with HBase & NoSQL
     2.5. Start with Storm
     2.6. Start with Interactive Hive (Preview)
     2.7. Start with Kafka (Preview)
     2.8. Hadoop sandbox
     2.9. Data Lake Tools with Hortonworks Sandbox
     2.10. Tools for Visual Studio
     2.11. HDInsight storage options
3. How To
     3.1. Use Hadoop for batch queries
          3.1.1. Hive with Hadoop
               3.1.1.1. Use the Hive View
               3.1.1.2. Use Beeline
               3.1.1.3. Use cURL
               3.1.1.4. Use PowerShell
               3.1.1.5. Use .NET SDK
               3.1.1.6. Use the HDInsight tools for Visual Studio
               3.1.1.7. Use Remote Desktop
               3.1.1.8. Use the Query Console
          3.1.2. Use a Java UDF with Hive
          3.1.3. Use MapReduce with Hadoop
               3.1.3.1. Use SSH
               3.1.3.2. Use cURL
               3.1.3.3. Use PowerShell
               3.1.3.4. Use Remote Desktop
          3.1.4. Run the MapReduce samples
          3.1.5. Use Pig with Hadoop
               3.1.5.1. Use SSH and Pig
               3.1.5.2. Use PowerShell
               3.1.5.3. Use the .NET SDK
               3.1.5.4. Use cURL
               3.1.5.5. Use Remote Desktop
          3.1.6. Use DataFu with Pig
          3.1.7. On-demand clusters
          3.1.8. Submit Hadoop jobs
     3.2. Use R Server
          3.2.1. Storage options
          3.2.2. Install RStudio
          3.2.3. Compute contexts
          3.2.4. ScaleR and SparkR
     3.3. Use Spark for in-memory processing
          3.3.1. With Data Lake Store
          3.3.2. With BI tools
          3.3.3. Create standalone app
          3.3.4. Create apps using Eclipse
          3.3.5. Create apps using IntelliJ
          3.3.6. Process streaming events
          3.3.7. Predict HVAC performance
          3.3.8. Predict food inspection results
          3.3.9. Analyze website logs
          3.3.10. Use Caffe for deep learning
          3.3.11. Use with Microsoft Cognitive Toolkit
          3.3.12. Use Zeppelin notebooks
          3.3.13. Jupyter notebook kernels
          3.3.14. Use external packages with Jupyter using cell magic
          3.3.15. Use external packages with Jupyter using script action
          3.3.16. Use a local Jupyter notebook
          3.3.17. Remote jobs with Livy
          3.3.18. Debug jobs remotely with IntelliJ
          3.3.19. Manage resources
          3.3.20. Track and debug jobs
          3.3.21. Known issues
     3.4. Use HBase
          3.4.1. Use Phoenix and SQLLine
          3.4.2. Analyze real-time tweets
          3.4.3. Create clusters on a virtual network
          3.4.4. Configure HBase replication
          3.4.5. Develop an app with Java
     3.5. Use Storm
          3.5.1. Deploy and manage topologies
          3.5.2. Develop data processing apps in SCP
          3.5.3. Storm examples
               3.5.3.1. Write to Data Lake Store
               3.5.3.2. Develop Java-based topologies with Maven
               3.5.3.3. Develop C# topologies with Hadoop tools
               3.5.3.4. Determine Twitter trending topics
               3.5.3.5. Process events with C# topologies
               3.5.3.6. Process events with Java topologies
               3.5.3.7. Use Power BI with a topology
               3.5.3.8. Analyze real-time sensor data
               3.5.3.9. Process vehicle sensor data
               3.5.3.10. Correlate events over time
               3.5.3.11. Develop topologies using Python
     3.6. Use domain-joined HDInsight (Preview)
          3.6.1. Configure
          3.6.2. Manage
          3.6.3. Configure Hive policies
     3.7. Use Kafka (Preview)
          3.7.1. Replicate Kafka data
          3.7.2. Use with Virtual Networks
          3.7.3. Use with Spark
          3.7.4. Use with Storm
     3.8. Develop
          3.8.1. Develop C# streaming MapReduce programs
          3.8.2. Develop Java MapReduce programs
          3.8.3. Develop Scalding MapReduce jobs
          3.8.4. Use HDInsight Tools to create Spark apps
          3.8.5. Use empty edge nodes
          3.8.6. Develop Python streaming programs
          3.8.7. Process and analyze JSON documents
          3.8.8. Serialize data with Avro Library
          3.8.9. Use C# user-defined functions
          3.8.10. Use Python with Hive and Pig
     3.9. Analyze big data
          3.9.1. Analyze using Power Query
          3.9.2. Connect Excel to Hadoop
          3.9.3. Connect using the Hive JDBC driver
          3.9.4. Analyze stored sensor data
          3.9.5. Analyze stored tweets
          3.9.6. Analyze flight delay data
          3.9.7. Generate recommendations with Mahout
          3.9.8. Analyze website logs with Hive
          3.9.9. Analyze Application Insights telemetry logs
     3.10. Extend clusters
          3.10.1. Customize clusters using Bootstrap
          3.10.2. Customize clusters using Script Action
          3.10.3. Add Hive libraries
          3.10.4. Develop script actions
          3.10.5. Use Giraph
          3.10.6. Use Hue
          3.10.7. Use R
          3.10.8. Use Solr
          3.10.9. Use Virtual Network
          3.10.10. Use Zeppelin
          3.10.11. Build HDInsight applications
               3.10.11.1. Install HDInsight apps
               3.10.11.2. Install custom apps
               3.10.11.3. Use REST to install apps
               3.10.11.4. Publish HDInsight apps to Azure Marketplace
     3.11. Secure
          3.11.1. Use SSH with HDInsight
          3.11.2. Use SSH tunneling
          3.11.3. Restrict access to data
     3.12. Manage
          3.12.1. Create Linux clusters
               3.12.1.1. Use Azure PowerShell
               3.12.1.2. Use cURL and the Azure REST API
               3.12.1.3. Use the .NET SDK
               3.12.1.4. Use the Azure CLI
               3.12.1.5. Use the Azure portal
               3.12.1.6. Use Azure Resource Manager templates
          3.12.2. Manage Hadoop clusters
               3.12.2.1. Use .NET SDK
               3.12.2.2. Use Azure PowerShell
               3.12.2.3. Use the Azure CLI
          3.12.3. Manage clusters using the Ambari web UI
               3.12.3.1. Use Ambari REST API
          3.12.4. Add storage accounts
          3.12.5. Upload data for Hadoop jobs
          3.12.6. Import and export data with Sqoop
               3.12.6.1. Connect with SSH
               3.12.6.2. Run using cURL
               3.12.6.3. Run using .NET SDK
               3.12.6.4. Run using PowerShell
          3.12.7. Use Oozie for workflows
          3.12.8. Use time-based Oozie coordinators
          3.12.9. Cluster and service ports and URIs
          3.12.10. Migrate to Resource Manager development tools
          3.12.11. Availability and reliability
          3.12.12. Upgrade HDInsight cluster to newer version
          3.12.13. OS patching for HDInsight cluster
     3.13. Troubleshoot
          3.13.1. Tips for Linux
          3.13.2. Analyze HDInsight logs
          3.13.3. Debug apps with YARN logs
          3.13.4. Enable heap dumps
          3.13.5. Fix errors from WebHCat
          3.13.6. Use Ambari Views to debug Tez Jobs
          3.13.7. More troubleshooting
               3.13.7.1. Hive settings fix Out of Memory error
               3.13.7.2. Optimize Hive queries
               3.13.7.3. Hive query performance
4. Reference
     4.1. PowerShell
     4.2. .NET (Hadoop)
     4.3. .NET (HBase)
     4.4. .NET (Avro)
     4.5. REST
     4.6. REST (Spark)
5. Related
     5.1. Windows clusters
          5.1.1. Migrate Windows clusters to Linux clusters
          5.1.2. Migrate .NET solutions to Linux clusters
          5.1.3. Run Hadoop MapReduce samples
          5.1.4. Use Solr on clusters
          5.1.5. Use Giraph to process large-scale graphs
          5.1.6. Use Oozie for workflows
          5.1.7. Deploy and manage Storm topologies
          5.1.8. Use Maven to build Java applications
          5.1.9. Use the Tez UI to debug Tez Jobs
          5.1.10. Customize using Script Action
          5.1.11. Access YARN application logs
          5.1.12. Use Apache Phoenix and SQuirreL
          5.1.13. Generate movie recommendations using Mahout
          5.1.14. Analyze flight delay data
          5.1.15. Develop script actions
          5.1.16. Analyze Twitter data
          5.1.17. Manage clusters with Azure portal
          5.1.18. Monitor clusters using the Ambari API
6. Resources
     6.1. Get help on the forum
     6.2. Learning path

Tools

Tool Description
Azure Feature Pack for Integration Services (SSIS) SQL Server Integration Services (SSIS) Feature Pack for Azure for SQL Server 2016 is an extension that provides the following components for SSIS to connect to Azure, transfer data between Azure and on-premises data sources, and process data stored in Azure.

Videos

Date Title Length
3/9/2017 Big Data Partner Program 0:13:20
2/10/2017 Introducing Apache Kafka on Azure HDInsight 0:16:32
2/6/2017 Cognitive Services, HDInsight, and Power BI on Azure Government 0:15:40
1/10/2017 HDInsight Compliance 0:15:31
12/30/2016 Introduction to Azure HDInsight – using Hadoop with your applications 0:12:57
12/14/2016 Interactive Spark on Azure 0:23:01
12/9/2016 Introducing Apache Kafka on Azure HDInsight 0:16:31
11/18/2016 Introduction to Azure HDInsight – using Hadoop with your applications 0:12:56
11/16/2016 Scalable machine learning with R and Spark 0:11:29
11/8/2016 Securing Azure HDInsight 0:17:23

Page 1 of 7