Today, data is being collected in ever-increasing amounts, at ever-increasing velocities, and in an ever-expanding variety of formats. This explosion of data is colloquially known as the Big Data phenomenon.

In order to gain actionable insights into big-data sources, new tools need to be leveraged that allow the data to be cleaned, analyzed, and visualized quickly and efficiently. Azure HDInsight provides a solution to this problem by making it exceedingly simple to create high-performance computing clusters provisioned with Apache Spark and members of the Spark ecosystem. Rather than spend time deploying hardware and installing, configuring, and maintaining software, you can focus on your research and apply your expertise to the data rather than the resources required to analyze that data.

Apache Spark is an open-source parallel-processing platform that excels at running large-scale data analytics jobs. Spark’s combined use of in-memory and disk data storage delivers performance improvements that allow it to process some tasks up to 100 times faster than Hadoop. With Microsoft Azure, deploying Apache Spark clusters becomes significantly simpler and gets you working on your data analysis that much sooner.

In this lab, you will experience Apache Spark for Azure HDInsight first-hand. After provisioning a Spark cluster, you will use the Microsoft Azure Storage Explorer to upload several Jupyter notebooks to the cluster. You will then use these notebooks to explore, visualize, and build a machine-learning model from food-inspection data — more than 100,000 rows of it — collected by the city of Chicago. The goal is to learn how to create and utilize your own Spark clusters, experience the ease with which they are provisioned in Azure, and, if you're new to Spark, get a working introduction to Spark data analytics.


In this hands-on lab, you will learn how to:

  • Deploy an HDInsight Spark cluster
  • Work with content stored in Azure Blob Storage and accessed by the Spark cluster as an HDFS volume
  • Use a Jupyter notebook to interactively explore a large dataset
  • Use a Jupyter notebook to develop and train a machine-learning model
  • Delete a Spark cluster to avoid incurring unnecessary charges


The following are required to complete this hands-on lab:


Click here to download a zip file containing the resources used in this lab. Copy the contents of the zip file into a folder on your hard disk.