Generate batch predictions using a deployed model in Microsoft Fabric

In this lab, you’ll use a machine learning model to predict a quantitative measure of diabetes.

By completing this lab, you’ll gain hands-on experience in generating predictions and visualizing the results.

This lab will take approximately 20 minutes to complete.

Note: You need a Microsoft Fabric trial to complete this exercise.

Create a workspace

Before working with data in Fabric, create a workspace with the Fabric trial enabled.

  1. Navigate to the Microsoft Fabric home page at https://app.fabric.microsoft.com/home?experience=fabric in a browser.
  2. In the Microsoft Fabric home page, select Synapse Data Science
  3. In the menu bar on the left, select Workspaces (the icon looks similar to 🗇).
  4. Create a new workspace with a name of your choice, selecting a licensing mode that includes Fabric capacity (Trial, Premium, or Fabric).
  5. When your new workspace opens, it should be empty.

    Screenshot of an empty workspace in Fabric.

Create a notebook

You’ll use a notebook to train and use a model in this exercise.

  1. In the Synapse Data Science home page, create a new Notebook.

    After a few seconds, a new notebook containing a single cell will open. Notebooks are made up of one or more cells that can contain code or markdown (formatted text).

  2. Select the first cell (which is currently a code cell), and then in the dynamic tool bar at its top-right, use the M↓ button to convert the cell to a markdown cell.

    When the cell changes to a markdown cell, the text it contains is rendered.

  3. If necessary, use the 🖉 (Edit) button to switch the cell to editing mode, then delete the content and enter the following text:

    # Train and use a machine learning model
    

Train a machine learning model

First, let’s train a machine learning model that uses a regression algorithm to predict the response of interest for diabetes patients (a quantitative measure of disease progression one year after baseline)

  1. In your notebook, use the + Code icon below the latest cell to add a new code cell to the notebook.

    Tip: To see the + Code icon, move the mouse to just below and to the left of the output from the current cell. Alternatively, in the menu bar, on the Edit tab, select + Add code cell.

  2. Enter the following code to load and prepare data and use it to train a model.

    import pandas as pd
    import mlflow
    from sklearn.model_selection import train_test_split
    from sklearn.tree import DecisionTreeRegressor
    from mlflow.models.signature import ModelSignature
    from mlflow.types.schema import Schema, ColSpec
    
    # Get the data
    blob_account_name = "azureopendatastorage"
    blob_container_name = "mlsamples"
    blob_relative_path = "diabetes"
    blob_sas_token = r""
    wasbs_path = f"wasbs://%s@%s.blob.core.windows.net/%s" % (blob_container_name, blob_account_name, blob_relative_path)
    spark.conf.set("fs.azure.sas.%s.%s.blob.core.windows.net" % (blob_container_name, blob_account_name), blob_sas_token)
    df = spark.read.parquet(wasbs_path).toPandas()
    
    # Split the features and label for training
    X, y = df[['AGE','SEX','BMI','BP','S1','S2','S3','S4','S5','S6']].values, df['Y'].values
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)
    
    # Train the model in an MLflow experiment
    experiment_name = "experiment-diabetes"
    mlflow.set_experiment(experiment_name)
    with mlflow.start_run():
        mlflow.autolog(log_models=False)
        model = DecisionTreeRegressor(max_depth=5)
        model.fit(X_train, y_train)
           
        # Define the model signature
        input_schema = Schema([
            ColSpec("integer", "AGE"),
            ColSpec("integer", "SEX"),\
            ColSpec("double", "BMI"),
            ColSpec("double", "BP"),
            ColSpec("integer", "S1"),
            ColSpec("double", "S2"),
            ColSpec("double", "S3"),
            ColSpec("double", "S4"),
            ColSpec("double", "S5"),
            ColSpec("integer", "S6"),
         ])
        output_schema = Schema([ColSpec("integer")])
        signature = ModelSignature(inputs=input_schema, outputs=output_schema)
       
        # Log the model
        mlflow.sklearn.log_model(model, "model", signature=signature)
    
  3. Use the ▷ Run cell button on the left of the cell to run it. Alternatively, you can press SHIFT + ENTER on your keyboard to run a cell.

    Note: Since this is the first time you’ve run any Spark code in this session, the Spark pool must be started. This means that the first run in the session can take a minute or so to complete. Subsequent runs will be quicker.

  4. Use the + Code icon below the cell output to add a new code cell to the notebook, and enter the following code to register the model that was trained by the experiment in the previous cell:

    # Get the most recent experiement run
    exp = mlflow.get_experiment_by_name(experiment_name)
    last_run = mlflow.search_runs(exp.experiment_id, order_by=["start_time DESC"], max_results=1)
    last_run_id = last_run.iloc[0]["run_id"]
    
    # Register the model that was trained in that run
    print("Registering the model from run :", last_run_id)
    model_uri = "runs:/{}/model".format(last_run_id)
    mv = mlflow.register_model(model_uri, "diabetes-model")
    print("Name: {}".format(mv.name))
    print("Version: {}".format(mv.version))
    

    Your model is now saved in your workspace as diabetes-model. Optionally, you can use the browse feature in your workspace to find the model in the workspace and explore it using the UI.

Create a test dataset in a lakehouse

To use the model, you’re going to need a dataset of patient details for whom you need to predict a diabetes diagnosis. You’ll create this dataset as a table in a Microsoft Fabric Lakehouse.

  1. In the Notebook editor, in the Explorer pane on the left, select + Data sources to add a lakehouse.
  2. Select New lakehouse and select Add, and create a new Lakehouse with a valid name of your choice.
  3. When asked to stop the current session, select Stop now to restart the notebook.
  4. When the lakehouse is created and attached to your notebook, add a new code cell run the following code to create a dataset and save it in a table in the lakehouse:

    from pyspark.sql.types import IntegerType, DoubleType
    
    # Create a new dataframe with patient data
    data = [
        (62, 2, 33.7, 101.0, 157, 93.2, 38.0, 4.0, 4.8598, 87),
        (50, 1, 22.7, 87.0, 183, 103.2, 70.0, 3.0, 3.8918, 69),
        (76, 2, 32.0, 93.0, 156, 93.6, 41.0, 4.0, 4.6728, 85),
        (25, 1, 26.6, 84.0, 198, 131.4, 40.0, 5.0, 4.8903, 89),
        (53, 1, 23.0, 101.0, 192, 125.4, 52.0, 4.0, 4.2905, 80),
        (24, 1, 23.7, 89.0, 139, 64.8, 61.0, 2.0, 4.1897, 68),
        (38, 2, 22.0, 90.0, 160, 99.6, 50.0, 3.0, 3.9512, 82),
        (69, 2, 27.5, 114.0, 255, 185.0, 56.0, 5.0, 4.2485, 92),
        (63, 2, 33.7, 83.0, 179, 119.4, 42.0, 4.0, 4.4773, 94),
        (30, 1, 30.0, 85.0, 180, 93.4, 43.0, 4.0, 5.3845, 88)
    ]
    columns = ['AGE','SEX','BMI','BP','S1','S2','S3','S4','S5','S6']
    df = spark.createDataFrame(data, schema=columns)
    
    # Convert data types to match the model input schema
    df = df.withColumn("AGE", df["AGE"].cast(IntegerType()))
    df = df.withColumn("SEX", df["SEX"].cast(IntegerType()))
    df = df.withColumn("BMI", df["BMI"].cast(DoubleType()))
    df = df.withColumn("BP", df["BP"].cast(DoubleType()))
    df = df.withColumn("S1", df["S1"].cast(IntegerType()))
    df = df.withColumn("S2", df["S2"].cast(DoubleType()))
    df = df.withColumn("S3", df["S3"].cast(DoubleType()))
    df = df.withColumn("S4", df["S4"].cast(DoubleType()))
    df = df.withColumn("S5", df["S5"].cast(DoubleType()))
    df = df.withColumn("S6", df["S6"].cast(IntegerType()))
    
    # Save the data in a delta table
    table_name = "diabetes_test"
    df.write.format("delta").mode("overwrite").saveAsTable(table_name)
    print(f"Spark dataframe saved to delta table: {table_name}")
    
  5. When the code has completed, select the next to the Tables in the Lakehouse explorer pane, and select Refresh. The diabetes_test table should appear.
  6. Expand the diabetes_test table in the left pane to view all fields it includes.

Apply the model to generate predictions

Now you can use the model you trained previously to generate diabetes progression predictions for the rows of patient data in your table.

  1. Add a new code cell and run the following code:

    import mlflow
    from synapse.ml.predict import MLFlowTransformer
    
    ## Read the patient features data 
    df_test = spark.read.format("delta").load(f"Tables/{table_name}")
    
    # Use the model to generate diabetes predictions for each row
    model = MLFlowTransformer(
        inputCols=["AGE","SEX","BMI","BP","S1","S2","S3","S4","S5","S6"],
        outputCol="predictions",
        modelName="diabetes-model",
        modelVersion=1)
    df_test = model.transform(df)
    
    # Save the results (the original features PLUS the prediction)
    df_test.write.format('delta').mode("overwrite").option("mergeSchema", "true").saveAsTable(table_name)
    
  2. After the code has finished, select the next to the diabetes_test table in the Lakehouse explorer pane, and select Refresh. A new field predictions has been added.
  3. Add a new code cell to the notebook and drag the diabetes_test table to it. The necessary code to view the table’s contents will appear. Run the cell to display the data.

Clean up resources

In this exercise, you have used a model to generate batch predictions.

If you’ve finished exploring the notebook, you can delete the workspace that you created for this exercise.

  1. In the bar on the left, select the icon for your workspace to view all of the items it contains.
  2. In the menu on the toolbar, select Workspace settings.
  3. In the General section, select Remove this workspace .