Monitor Data Drift

Changing trends in data over time can reduce the accuracy of the predictions made by a model. Monitoring for this data drift and retraining as necessary is an important way to ensure your machine learning solution continues to predict accurately.

Before you start

You’ll need an Azure subscription in which you have administrative-level access.

Provision an Azure Machine Learning workspace

An Azure Machine Learning workspace provides a central place for managing all resources and assets you need to train and manage your models. You can interact with the Azure Machine Learning workspace through the Studio, Python SDK, and Azure CLI.

Create the workspace

To create the Azure Machine Learning workspace and a compute instance, you’ll use the Azure CLI. All necessary commands are grouped in a Shell script for you to execute.

  1. In a browser, open the Azure portal at portal.azure.com, signing in with your Microsoft account.
  2. Select the [>_] (Cloud Shell) button at the top of the page to the right of the search box. This opens a Cloud Shell pane at the bottom of the portal.
  3. The first time you open the cloud shell, you will be asked to choose the type of shell you want to use (Bash or PowerShell). Select Bash.
  4. If you are asked to create storage for your cloud shell, check that the correct subscription is specified and select Create storage. Wait for the storage to be created.
  5. In the terminal, enter the following commands to clone this repo:

     rm -r mslearn-dp100 -f
     git clone https://github.com/MicrosoftLearning/mslearn-dp100 mslearn-dp100
    
  6. After the repo has been cloned, enter the following commands to change to the folder for this lab and run the setup.sh script it contains:

     cd mslearn-dp100
     ./setup.sh
    
  7. Wait for the script to complete - this typically takes around 5-10 minutes.

Clone the lab materials

You can use the Notebooks page in Azure Machine Learning studio to run notebooks.

  1. In Azure Machine Learning studio, view the Compute page for your workspace; and on the Compute Instances tab, start your compute instance if it is not already running.
  2. Select Terminal under Applications to open a terminal, and ensure that its Compute is set to your compute instance and that the current path is the /users/your-user-name folder.
  3. Enter the following command to clone a Git repository containing notebooks, data, and other files to your workspace:

     git clone https://github.com/MicrosoftLearning/mslearn-dp100 mslearn-dp100
    
  4. When the command has completed, in the Notebooks pane, click to refresh the view and verify that a new /users/your-user-name/mslearn-dp100 folder has been created. This folder contains multiple .ipynb notebook files.

Tip: New to Python? Use the Python cheat sheet to understand the code.

Monitor data drift for a dataset

In this exercise, the code to monitor data drift is provided in a notebook.

  1. In the Notebooks page, browse to the /users/your-user-name/mslearn-dp100 folder where you cloned the notebook repository, and open the Monitor Data Drift notebook.
  2. Then read the notes in the notebook, running each code cell in turn.

Delete Azure resources

When you finish exploring Azure Machine Learning, you should delete the resources you’ve created to avoid unnecessary Azure costs.

  1. Close the Azure Machine Learning Studio tab and return to the Azure portal.
  2. In the Azure portal, on the Home page, select Resource groups.
  3. Select the rg-dp100-labs resource group.
  4. At the top of the Overview page for your resource group, select Delete resource group.
  5. Enter the resource group name to confirm you want to delete it, and select Delete.