Lab 10: Design and implement data pipelines with Azure Databricks

Introduction

You are a data engineer at GlobStay, a hospitality group managing bookings across five hotel properties spanning beach resorts, mountain lodges, city-center inns, and airport transit hotels. GlobStay runs a nightly batch pipeline that ingests raw booking data from property management systems, cleans and validates it, and produces revenue analytics for the leadership team.

In this lab you design and implement that pipeline end-to-end. You will:

Build a medallion architecture (Bronze → Silver → Gold) for hotel booking data
Write parameterized, reusable notebook tasks that can be orchestrated by a Lakeflow Job
Add error handling so individual tasks can fail gracefully and signal their status to downstream tasks
Configure a Lakeflow Job with task dependencies, retry policies, and notifications
Explore conditional task flows (If/else branching) for routing based on data quality results

🤖 Use Genie Code throughout this lab

You are expected and encouraged to use the Genie Code for every exercise. Use it to get suggestions, explain errors, generate boilerplate, and explore APIs.

To open Genie Code, select the assistant-icon on the right side of any notebook cell, or use the keyboard shortcut.

Example prompt: “Help me write a PySpark statement that deduplicates a DataFrame on the booking_id column and filters out rows where rate_per_night is less than or equal to zero.”

Prerequisites

An Azure Databricks Premium workspace provisioned using Lab 00: Set up your Azure Databricks environment.
You have permission to create catalogs and schemas in Unity Catalog.
Basic familiarity with PySpark and Spark SQL.

Part 1: Notebook exercises (medallion architecture + error handling + parameters)

Import the notebook

In your Databricks workspace, click Workspace in the left sidebar.
Navigate to or create a folder where you want to store the lab.
Click the ⋮ (kebab menu) or right-click the folder, then select Import.
Choose URL, enter the following URL, and click Import: https://raw.githubusercontent.com/MicrosoftLearning/DP-750T00-Implement-Data-Engineering-Solutions-using-Azure-Databricks/refs/heads/main/Allfiles/10-design-implement-data-pipelines.ipynb
Open the imported notebook and, in the compute selector at the top, choose Serverless compute.

Work through the notebook exercises

The notebook contains five exercises. Run the provided setup cells and then complete each challenge cell.

Tip: Challenge cells contain instructions in comments. Cells marked with ✅ are already provided — simply run them.

Part 2: Orchestrate with Lakeflow Jobs (UI exercises)

These exercises use the Azure Databricks UI. You have already built the notebook logic — now you will stitch those notebooks together into a production-grade multi-task pipeline.

Note: For this exercise, you will reference the notebook you imported in Part 1. In a real project you would create separate notebooks for each layer. For demonstration purposes, you’ll configure the same notebook three times with different parameters.

Exercise: Create a multi-task pipeline job

In the left sidebar, click Jobs & Pipelines.
Click Create job.
At the top, replace the default job name with GlobStay Booking Pipeline.

Task 1 — Ingest bronze data

In the task editor, enter the task name ingest_bronze.
In the Type dropdown, select Notebook.
In the Source dropdown, select Workspace, then choose the path to the notebook you imported.
In the Compute field, select Serverless.
Click Save task.

Task 2 — Clean silver data

Click + Add task below Task 1 (this automatically sets a dependency on Task 1).
Name the task clean_silver, type Notebook, same notebook path, Serverless compute.
Under Parameters, add a key layer with value silver.
Under Retries, set:
- Retry count: 2
- Retry interval: 60 seconds
Click Save task.

Task 3 — Aggregate gold data

Click + Add task below Task 2.
Name the task aggregate_gold, type Notebook, same notebook path, Serverless compute.
Under Parameters, add a key layer with value gold.
Click Save task.

Verify the DAG

Review the DAG view. You should see three nodes arranged sequentially:
```
ingest_bronze → clean_silver → aggregate_gold
```
Notice that Task 2 and Task 3 each show “Depends on” the previous task.

Add failure notifications

In the right-hand job details panel, scroll to the Job notifications section.
Click Edit notifications.
Click Add notification, choose On failure, and enter your email address.
Optionally enable Mute notifications until the last retry to avoid alert fatigue during retries.
Click Save.

Run the job

Click Run now (top right).
In the Runs tab, observe each task executing in sequence.
Click a task in the run timeline to see its output, logs, and duration.

Exercise: Add a conditional task flow for data quality routing

In a real pipeline, you might want to route execution differently based on whether data quality issues were detected. This exercise adds an If/else condition task that branches based on a hypothetical data quality threshold.

Return to the GlobStay Booking Pipeline job editor.
Click + Add task alongside Task 3 (deselect Task 3 first to avoid a dependency).
Name the task check_quality, type Notebook, same notebook path.
Set this task to depend on clean_silver.
Click + Add task.
In the Type dropdown, select If/else condition.
Name it quality_gate.
Set this task to depend on check_quality.

In the Condition section, fill in the three fields separately:

Field	Value
Condition	``
Operator	`>`
Value	`5`

For the If true path, add a task named alert_data_issues (Notebook, same path).
For the If false path, add a task named proceed_gold (Notebook, same path).

Discussion: How does this compare to implementing error handling inside the notebook itself with try/except? When would you choose each approach?

Summary

In this lab you:

Built a medallion architecture (Bronze → Silver → Gold) for hotel booking data in Unity Catalog
Applied data cleaning patterns: deduplication, null filtering, date validation, and value constraints
Created Gold-layer aggregations covering property revenue and booking channel performance
Implemented error handling with try/except and dbutils.notebook.exit() for job-level signaling
Parameterized notebooks using dbutils.widgets and passed values between tasks using dbutils.jobs.taskValues
Configured a Lakeflow Job with sequential task dependencies, retry policies, notifications, and an If/else condition task