Create a Lakeflow Declarative Pipeline

Lakeflow Declarative Pipelines is a framework within the Databricks Lakehouse Platform for building and running data pipelines in a declarative manner. This means you specify what data transformations you want to achieve, and the system automatically figures out how to execute them efficiently, handling many of the complexities of traditional data engineering.

Lakeflow Declarative Pipelines simplifies the development of ETL (Extract, Transform, Load) pipelines by abstracting away the complex, low-level details. Instead of writing procedural code that dictates every step, you use a simpler, declarative syntax in SQL or Python.

This lab will take approximately 40 minutes to complete.

Note: The Azure Databricks user interface is subject to continual improvement. The user interface may have changed since the instructions in this exercise were written. Lakeflow Declarative Pipelines is the evolution of Databricks’ Delta Live Tables (DLT), offering a unified approach for both batch and streaming workloads.

Provision an Azure Databricks workspace

Tip: If you already have an Azure Databricks workspace, you can skip this procedure and use your existing workspace.

This exercise includes a script to provision a new Azure Databricks workspace. The script attempts to create a Premium tier Azure Databricks workspace resource in a region in which your Azure subscription has sufficient quota for the compute cores required in this exercise; and assumes your user account has sufficient permissions in the subscription to create an Azure Databricks workspace resource. If the script fails due to insufficient quota or permissions, you can try to create an Azure Databricks workspace interactively in the Azure portal.

In a web browser, sign into the Azure portal at https://portal.azure.com.
Use the [>_] button to the right of the search bar at the top of the page to create a new Cloud Shell in the Azure portal, selecting a PowerShell environment. The cloud shell provides a command line interface in a pane at the bottom of the Azure portal, as shown here:

Note: If you have previously created a cloud shell that uses a Bash environment, switch it to PowerShell.
Note that you can resize the cloud shell by dragging the separator bar at the top of the pane, or by using the —, ⤢, and X icons at the top right of the pane to minimize, maximize, and close the pane. For more information about using the Azure Cloud Shell, see the Azure Cloud Shell documentation.

In the PowerShell pane, enter the following commands to clone this repo:

 rm -r mslearn-databricks -f
 git clone https://github.com/MicrosoftLearning/mslearn-databricks

After the repo has been cloned, enter the following command to run the setup.ps1 script, which provisions an Azure Databricks workspace in an available region:
```
 ./mslearn-databricks/setup.ps1
```
If prompted, choose which subscription you want to use (this will only happen if you have access to multiple Azure subscriptions).
Wait for the script to complete - this typically takes around 5 minutes, but in some cases may take longer. While you are waiting, review the Lakeflow Declarative Pipelines article in the Azure Databricks documentation.

Create a cluster

Azure Databricks is a distributed processing platform that uses Apache Spark clusters to process data in parallel on multiple nodes. Each cluster consists of a driver node to coordinate the work, and worker nodes to perform processing tasks. In this exercise, you’ll create a single-node cluster to minimize the compute resources used in the lab environment (in which resources may be constrained). In a production environment, you’d typically create a cluster with multiple worker nodes.

Tip: If you already have a cluster with a 13.3 LTS or higher runtime version in your Azure Databricks workspace, you can use it to complete this exercise and skip this procedure.

In the Azure portal, browse to the msl-xxxxxxx resource group that was created by the script (or the resource group containing your existing Azure Databricks workspace)
Select your Azure Databricks Service resource (named databricks-xxxxxxx if you used the setup script to create it).
In the Overview page for your workspace, use the Launch Workspace button to open your Azure Databricks workspace in a new browser tab; signing in if prompted.

Tip: As you use the Databricks Workspace portal, various tips and notifications may be displayed. Dismiss these and follow the instructions provided to complete the tasks in this exercise.
In the sidebar on the left, select the (+) New task, and then select Cluster (You may need to look in the More submenu).
In the New Cluster page, create a new cluster with the following settings:
- Cluster name: User Name’s cluster (the default cluster name)
- Policy: Unrestricted
- Cluster mode: Single Node
- Access mode: Single user (with your user account selected)
- Databricks runtime version: 13.3 LTS (Spark 3.4.1, Scala 2.12) or later
- Use Photon Acceleration: Selected
- Node type: Standard_D4ds_v5
- Terminate after 20 minutes of inactivity
Wait for the cluster to be created. It may take a minute or two.

Note: If your cluster fails to start, your subscription may have insufficient quota in the region where your Azure Databricks workspace is provisioned. See CPU core limit prevents cluster creation for details. If this happens, you can try deleting your workspace and creating a new one in a different region. You can specify a region as a parameter for the setup script like this: ./mslearn-databricks/setup.ps1 eastus

Create a notebook and ingest data

In the sidebar, use the (+) New link to create a Notebook.
Change the default notebook name (Untitled Notebook [date]) to Data Ingestion and Exploration and in the Connect drop-down list, select your cluster if it is not already selected. If the cluster is not running, it may take a minute or so to start.

In the first cell of the notebook, enter the following code, which uses shell commands to download data files from GitHub into the file system used by your cluster.

 %sh
 rm -r /dbfs/delta_lab
 mkdir /dbfs/delta_lab
 wget -O /dbfs/delta_lab/covid_data.csv https://github.com/MicrosoftLearning/mslearn-databricks/raw/main/data/covid_data.csv

Use the ▸ Run Cell menu option at the left of the cell to run it. Then wait for the Spark job run by the code to complete.

Create Lakeflow Declarative Pipeline using SQL

Create a new notebook and rename it to Covid Pipeline Notebook.
Next to the notebook’s name, select Python and change the default language to SQL.

Enter the following code in the first cell without running it. All cells will be executed after the pipeline is created. This code defines a materialized view that will be populated by the raw data previously downloaded:

 CREATE OR REFRESH MATERIALIZED VIEW raw_covid_data
 COMMENT "COVID sample dataset. This data was ingested from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University."
 AS
 SELECT
   Last_Update,
   Country_Region,
   Confirmed,
   Deaths,
   Recovered
 FROM read_files('dbfs:/delta_lab/covid_data.csv', format => 'csv', header => true)

Under the first cell, use the + Code icon to add a new cell, and enter the following code to query, filter, and format the data from the previous table before analysis.

 CREATE OR REFRESH MATERIALIZED VIEW processed_covid_data(
   CONSTRAINT valid_country_region EXPECT (Country_Region IS NOT NULL) ON VIOLATION FAIL UPDATE
 )
 COMMENT "Formatted and filtered data for analysis."
 AS
 SELECT
     TO_DATE(Last_Update, 'MM/dd/yyyy') as Report_Date,
     Country_Region,
     Confirmed,
     Deaths,
     Recovered
 FROM live.raw_covid_data;

In a third new code cell, enter the following code that will create an enriched data view for further analysis once the pipeline is successfully executed.

 CREATE OR REFRESH MATERIALIZED VIEW aggregated_covid_data
 COMMENT "Aggregated daily data for the US with total counts."
 AS
 SELECT
     Report_Date,
     sum(Confirmed) as Total_Confirmed,
     sum(Deaths) as Total_Deaths,
     sum(Recovered) as Total_Recovered
 FROM live.processed_covid_data
 GROUP BY Report_Date;

Select Jobs & Pipelines in the left sidebar and then select ETL pipeline.
In the Create pipeline page, create a new pipeline with the following settings:
- Pipeline name: Covid Pipeline
- Product edition: Advanced
- Pipeline mode: Triggered
- Source code: Browse to your Covid Pipeline Notebook notebook in the Users/user@name folder.
- Storage options: Hive Metastore
- Storage location: dbfs:/pipelines/delta_lab
- Target schema: Enter default
Select Create and then Start. Then wait for the pipeline to run (which may take some time).
After the pipeline has successfully run, go back to the recent Data Ingestion and Exploration notebook you created first, and run the following code in a new cell to verify that the files for all 3 new tables have been created in the specified storage location:
```
 display(dbutils.fs.ls("dbfs:/pipelines/delta_lab/schemas/default/tables"))
```
Add another code cell and run the following code to verify that the tables have been created in the default database:
```
 %sql

 SHOW TABLES
```

View results as a visualization

After creating the tables, it is possible to load them into dataframes and visualize the data.

In the Data Ingestion and Exploration notebook, add a new code cell and run the following code to load the aggregated_covid_data into a dataframe:
```
 %sql
    
 SELECT * FROM aggregated_covid_data
```
Above the table of results, select + and then select Visualization to view the visualization editor, and then apply the following options:
- Visualization type: Line
- X Column: Report_Date
- Y Column: Add a new column and select Total_Confirmed. Apply the Sum aggregation.
Save the visualization and view the resulting chart in the notebook.

Clean up

In Azure Databricks portal, on the Compute page, select your cluster and select ■ Terminate to shut it down.

If you’ve finished exploring Azure Databricks, you can delete the resources you’ve created to avoid unnecessary Azure costs and free up capacity in your subscription.