Train a deep learning model
In this exercise, you’ll use the PyTorch library to train a deep learning model in Azure Databricks. Then you’ll use the Horovod library to distribute deep learning training across multiple worker nodes in a cluster.
This exercise should take approximately 45 minutes to complete.
Before you start
You’ll need an Azure subscription in which you have administrative-level access.
Provision an Azure Databricks workspace
An Azure Databricks workspace provides a central cloud-based environment for all your data science activities in Azure Databricks. You can create a workspace interactively in the Azure portal, use one you already have, or follow the steps below to use a script to provision a Standard workspace for this exercise.
Tip: If you already have an Azure Databricks workspace, you can skip this procedure and use your existing workspace.
- In a web browser, sign into the Azure portal at
https://portal.azure.com
. -
Use the [>_] button to the right of the search bar at the top of the page to create a new Cloud Shell in the Azure portal, selecting a PowerShell environment and creating storage if prompted. The cloud shell provides a command line interface in a pane at the bottom of the Azure portal, as shown here:
Note: If you have previously created a cloud shell that uses a Bash environment, use the the drop-down menu at the top left of the cloud shell pane to change it to PowerShell.
-
Note that you can resize the cloud shell by dragging the separator bar at the top of the pane, or by using the —, ◻, and X icons at the top right of the pane to minimize, maximize, and close the pane. For more information about using the Azure Cloud Shell, see the Azure Cloud Shell documentation.
-
In the PowerShell pane, enter the following commands to clone this repo:
rm -r mslearn-databricks-ml -f git clone https://github.com/MicrosoftLearning/mslearn-databricks-ml
-
After the repo has been cloned, enter the following commands to change to the folder for this lab and run the setup.ps1 script it contains:
cd mslearn-databricks-ml/Allfiles/labs/05 ./setup.ps1
-
If prompted, choose which subscription you want to use (this will only happen if you have access to multiple Azure subscriptions).
- Wait for the script to complete - this typically takes around 5 minutes, but in some cases may take longer. While you are waiting, review the Distributed training article in the Azure Databricks documentation.
Create a cluster
Azure Databricks is a distributed processing platform that uses Apache Spark clusters to process data in parallel on multiple nodes. Each cluster consists of a driver node to coordinate the work, and worker nodes to perform processing tasks. In this exercise, you’ll create a single-node cluster to minimize the compute resources used in the lab environment (in which resources may be constrained). In a production environment, you’d typically create a cluster with multiple worker nodes.
Tip: If you already have a cluster with a 13.3 LTS ML or higher runtime version in your Azure Databricks workspace, you can use it to complete this exercise and skip this procedure.
- In the Azure portal, browse to the msl-xxxxxxx resource group that was created by the script (or the resource group containing your existing Azure Databricks workspace)
- Select your Azure Databricks Service resource (named databricksxxxxxxx if you used the setup script to create it).
-
In the Overview page for your workspace, use the Launch Workspace button to open your Azure Databricks workspace in a new browser tab; signing in if prompted.
Tip: As you use the Databricks Workspace portal, various tips and notifications may be displayed. Dismiss these and follow the instructions provided to complete the tasks in this exercise.
- In the sidebar on the left, select the (+) New task, and then select Cluster.
- In the New Cluster page, create a new cluster with the following settings:
- Cluster name: User Name’s cluster (the default cluster name)
- Cluster mode: Single Node
- Access mode: Single user (with your user account selected)
- Databricks runtime version: Select the ML edition of the latest non-beta version of the runtime (Not a Standard runtime version) that:
- Does not use a GPU
- Includes Scala > 2.11
- Includes Spark > 3.0
- Use Photon Acceleration: unselected
- Node type: Standard_DS3_v2
- Terminate after 30 minutes of inactivity
- Wait for the cluster to be created. It may take a minute or two.
Note: If your cluster fails to start, your subscription may have insufficient quota in the region where your Azure Databricks workspace is provisioned. See CPU core limit prevents cluster creation for details. If this happens, you can try deleting your workspace and creating a new one in a different region. You can specify a region as a parameter for the setup script like this:
./setup.ps1 eastus
Import and explore a notebook
The tasks and code used in this exercise are in a notebook that you will run in your cluster.
- In the Azure Databricks portal for your workspace, in the sidebar on the left, select Workspace. Then select the ⌂ Home folder.
- At the top of the page, in the ⋮ menu next to your user name, select Import. Then in the Import dialog box, select URL and import the notebook from
https://github.com/MicrosoftLearning/mslearn-databricks-ml/raw/master/notebooks/Deep-learning.ipynb
- Connect the notebook to your cluster, and follow the instructions it contains; running its code cells to use deep learning techniques to train a model.
Clean-up
If you’re finished working with Azure Databricks for now, in Azure Databricks workspace, on the Compute page, select your cluster and select ■ Terminate to shut it down.
You can then close the Azure Databricks web portal, return to the Azure portal, and delete the msl-xxxxxxx resource group containing your Azure Databricks workspace.