Make data available in Azure Machine Learning
Although it’s fairly common to work with data on their local file system, in an enterprise environment it can be more effective to store the data in a central location where multiple data scientists and machine learning engineers can access it.
In this exercise, you’ll explore datastores and data assets, which are the primary objects used to abstract data access in Azure Machine Learning.
Before you start
You’ll need an Azure subscription in which you have administrative-level access.
Provision an Azure Machine Learning workspace
An Azure Machine Learning workspace provides a central place for managing all resources and assets you need to train and manage your models. You can interact with the Azure Machine Learning workspace through the studio, Python SDK, and Azure CLI.
You’ll use a Shell script which uses the Azure CLI to provision the workspace and necessary resources. Next, you’ll use the Designer in the Azure Machine Learning studio to train and compare models.
Create the workspace and compute resources
To create the Azure Machine Learning workspace and compute resources, you’ll use the Azure CLI. All necessary commands are grouped in a Shell script for you to execute.
- In a browser, open the Azure portal at
https://portal.azure.com/
, signing in with your Microsoft account. - Select the [>_] (Cloud Shell) button at the top of the page to the right of the search box. This opens a Cloud Shell pane at the bottom of the portal.
- Select Bash if asked. The first time you open the cloud shell, you will be asked to choose the type of shell you want to use (Bash or PowerShell).
- Check that the correct subscription is specified and that No storage account required is selected. Select Apply.
-
Enter the following commands in the terminal to clone this repo:
rm -r azure-ml-labs -f git clone https://github.com/MicrosoftLearning/mslearn-azure-ml.git azure-ml-labs
Use
SHIFT + INSERT
to paste your copied code into the Cloud Shell. -
Enter the following commands after the repo has been cloned, to change to the folder for this lab and run the setup.sh script it contains:
cd azure-ml-labs/Labs/03 ./setup.sh
Ignore any (error) messages that say that the extensions were not installed.
-
Wait for the script to complete - this typically takes around 5-10 minutes.
Troubleshooting tip: Workspace creation error
If you receive an error when running the setup script through the CLI, you need to provision the resources manually:
- In the Azure portal home page, select + Create a resource.
- Search for machine learning and then select Azure Machine Learning. Select Create.
- Create a new Azure Machine Learning resource with the following settings:
- Subscription: Your Azure subscription
- Resource group: rg-dp100-labs
- Workspace name: mlw-dp100-labs
- Region: Select the geographical region closest to you
- Storage account: Note the default new storage account that will be created for your workspace
- Key vault: Note the default new key vault that will be created for your workspace
- Application insights: Note the default new application insights resource that will be created for your workspace
- Container registry: None (one will be created automatically the first time you deploy a model to a container)
- Select Review + create and wait for the workspace and its associated resources to be created - this typically takes around 5 minutes.
- Select Go to resource and in its Overview page, select Launch studio. Another tab will open in your browser to open the Azure Machine Learning studio.
- Close any pop-ups that appear in the studio.
- Within the Azure Machine Learning studio, navigate to the Compute page and select + New under the Compute instances tab.
- Give the compute instance a unique name and then select Standard_DS11_v2 as the virtual machine size.
- Select Review + create and then select Create.
- Next, select the Compute clusters tab and select + New.
- Choose the same region as the one where you created your workspace and then select Standard_DS11_v2 as the virtual machine size. Select Next
- Give the cluster a unique name and then select Create. </ol> </details>
Explore the default datastores
When you create an Azure Machine Learning workspace, a Storage Account is automatically created and connected to your workspace. You’ll explore how the Storage Account is connected.
- In the Azure portal, navigate to the new resource group named rg-dp100-….
- Select the Storage Account in the resource group. The name often starts with the name you provided for the workspace (without hyphens).
- Review the Overview page of the Storage Account. Note that the Storage Account has several options for Data storage as shown in the Overview pane, and in the left menu.
- Select Containers to explore the Blob storage part of the Storage Account.
- Note the azureml-blobstore-… container. The default datastore for data assets uses this container to store data.
- Using the + Container button at the top of the screen, create a new container and name it
training-data
. - Select File shares from the left menu to explore the File share part of the Storage Account.
- Note the code-… file share. Any notebooks in the workspace are stored here. After cloning the lab materials, you can find the files in this file share, in the code-…/Users/your-user-name/azure-ml-labs folder.
Copy the access key
To create a datastore in the Azure Machine Learning workspace, you need to provide some credentials. An easy way to provide the workspace with access to a Blob storage is to use the account key.
- In the Storage Account, select the Access keys tab from the left menu.
- Note that two keys are provided: key1 and key2. Each key has the same functionality.
- Select Show for the Key field under key1.
- Copy the value of the Key field to a notepad. You’ll need to paste this value into the notebook later.
- Copy the name of your storage account from the top of the page. The name should start with mlwdp100storage… You’ll need to paste this value into the notebook later too.
Note: Copy the account key and name to a notepad to avoid automatic capitalization (which happens in Word). The key is case-sensitive.
Clone the lab materials
To create a datastore and data assets with the Python SDK, you’ll need to clone the lab materials into the workspace.
- In the Azure portal, navigate to the Azure Machine Learning workspace named mlw-dp100-labs.
- Select the Azure Machine Learning workspace, and in its Overview page, select Launch studio. Another tab will open in your browser to open the Azure Machine Learning studio.
- Close any pop-ups that appear in the studio.
- Within the Azure Machine Learning studio, navigate to the Compute page and verify that the compute instance and cluster you created in the previous section exist. The compute instance should be running, the cluster should be idle and have 0 nodes running.
- In the Compute instances tab, find your compute instance, and select the Terminal application.
-
In the terminal, install the Python SDK on the compute instance by running the following commands in the terminal:
pip uninstall azure-ai-ml pip install azure-ai-ml pip install mltable
Ignore any (error) messages that say that the packages were not installed.
-
Run the following command to clone a Git repository containing notebooks, data, and other files to your workspace:
git clone https://github.com/MicrosoftLearning/mslearn-azure-ml.git azure-ml-labs
- When the command has completed, in the Files pane, click ↻ to refresh the view and verify that a new Users/your-user-name/azure-ml-labs folder has been created.
Optionally, in another browser tab, navigate back to the Azure portal. Explore the file share code-… in the Storage account again to find the cloned lab materials in the newly created azure-ml-labs folder.
Create a datastore and data assets
The code to create a datastore and data assets with the Python SDK is provided in a notebook.
-
Open the Labs/03/Work with data.ipynb notebook.
Select Authenticate and follow the necessary steps if a notification appears asking you to authenticate.
- Verify that the notebook uses the Python 3.8 - AzureML kernel.
- Run all cells in the notebook.
Optional: Explore the data assets
Optionally, you can explore how the data assets are stored in the associated Storage Account.
- Navigate to the Data tab in the Azure Machine Learning studio to explore the data assets.
-
Select the diabetes-local data asset name to explore its details.
Under Data sources for the diabetes-local data asset, you’ll find where the file has been uploaded to. The path starting with LocalUpload/… shows the path within the Storage Account container azureml-blobstore-…. You can verify the file exists by navigating to that path in the Azure portal.
Delete Azure resources
When you finish exploring Azure Machine Learning, you should delete the resources you’ve created to avoid unnecessary Azure costs.
- Close the Azure Machine Learning studio tab and return to the Azure portal.
- In the Azure portal, on the Home page, select Resource groups.
- Select the rg-dp100-… resource group.
- At the top of the Overview page for your resource group, select Delete resource group.
- Enter the resource group name to confirm you want to delete it, and select Delete.