Lab 13: Monitor, Troubleshoot, and Optimize Workloads in Azure Databricks

Introduction

A query that used to complete in seconds now takes minutes. Cluster resources appear busy, yet most tasks finish almost immediately while a handful drag on. These are the hallmarks of data skew — and they are often accompanied by excessive shuffle, which forces Spark to move large volumes of data across the network between stages.

In this lab you deliberately reproduce both problems so you can observe exactly what they look like in the Spark UI, then apply targeted fixes and confirm the improvement.

Part	Topic	Where
Part 1	Set up the environment and generate data	Notebook
Part 2	Expose data skew and inspect the Spark UI	Notebook + Lab instructions
Part 3	Fix the data skew	Notebook
Part 4	Expose excessive shuffle and inspect the Spark UI	Notebook + Lab instructions
Part 5	Reduce shuffle overhead	Notebook

🤖 Use Genie Code throughout this lab

You are expected and encouraged to use the Genie Code for every task in the notebook.

To open Genie Code, select the assistant-icon on the right side of any notebook cell, or use the keyboard shortcut.

Use it to:

Understand Spark configuration options (AQE, shuffle partitions, broadcast threshold)
Generate optimised join and aggregation code
Interpret Spark UI metrics and stage details
Explore alternative skew-handling techniques (salting, bucketing)

Prerequisites

An Azure Databricks Premium workspace provisioned using Lab 00: Set up your Azure Databricks environment.
You have permission to create catalogs and schemas in Unity Catalog.
Basic familiarity with PySpark DataFrames is helpful.

Create a compute cluster

This lab requires a classic compute cluster (not Serverless). Some Spark configuration options used to disable and re-enable Adaptive Query Execution — such as spark.databricks.optimizer.adaptive.enabled — are only available on classic clusters.

⚠️ Photon must be disabled. Photon is Databricks’ vectorized query engine. When enabled, it optimises shuffle and skew handling automatically, which masks the performance problems this lab is designed to demonstrate. You will not see skewed task distributions or high shuffle metrics in the Spark UI if Photon is active.

In your Databricks workspace, click Compute in the left sidebar.
Click Create compute.

Configure the cluster with the following minimal settings:

Setting	Value
Cluster name	`perf-lab`
Policy	Unrestricted
Single node	Enabled
Access mode	Single user
Databricks Runtime	Latest LTS (non-ML) — make sure to pick a non-Photon runtime
Node type	Choose the smallest available (e.g. `Standard_DS3_v2` or equivalent)
Auto-terminate	30 minutes

Expand Advanced options, then click the Spark tab. Add the following line to the Spark config box to explicitly disable Photon:
```
spark.databricks.photon.enabled false
```
Click Create compute and wait for the cluster to reach the Running state before continuing.

Import the notebook

In your Databricks workspace, click Workspace in the left sidebar.
Navigate to or create a folder where you want to store the lab.
Click the ⋮ (kebab) menu or right-click the folder, then select Import.
Choose URL, enter the following URL, and click Import: https://raw.githubusercontent.com/MicrosoftLearning/DP-750T00-Implement-Data-Engineering-Solutions-using-Azure-Databricks/refs/heads/main/Allfiles/13-monitor-troubleshoot-optimize-workloads-azure-databricks.ipynb
Open the imported notebook and, in the compute selector at the top, select the perf-lab cluster you created above.

Part 2: Investigate Data Skew in the Spark UI

After running the cells in Exercise 2 of the notebook, return here and follow the steps below to examine the skewed execution in the Spark UI.

Open the Spark UI

In your Databricks workspace, click Compute in the left sidebar, then click the perf-lab cluster. On the cluster detail page, click the Spark UI tab. The Spark UI loads inline on the page — if you prefer a separate tab, use the Open in new tab link at the top of the Spark UI panel.
You land on the Jobs page, which lists all Spark jobs triggered during this session.

Identify the skewed aggregation job

Find the most recently completed jobs. The groupBy aggregation from Task 2.1 and the sort-merge join from Task 2.2 will appear near the top (newest first).
Click on the aggregation job (it will reference stages related to aggregate or hashAggregate).
On the job detail page, look at the Stages list. Click on the stage that performed the shuffle — it is typically the one with the most tasks and the longest duration.
Scroll down to the Summary Metrics table for the tasks in that stage. Focus on the Duration row:
- Note the Min, 25th percentile, Median, 75th percentile, and Max values.
- If Max is more than 50% higher than the 75th percentile, you are seeing data skew — a small number of tasks are processing most of the data.
Scroll further down to the Tasks section. Sort the list by Duration (descending). Observe how one or two tasks have a duration far greater than the rest.

Identify the skewed join job

Click Jobs in the Spark UI left navigation to return to the jobs list.
Click on the sort-merge join job from Task 2.2 and repeat the analysis. Look for the same pattern: a few tasks with much longer durations than the majority.
When you have finished your investigation, return to the notebook and continue with Exercise 3.

Part 4: Investigate Excessive Shuffle in the Spark UI

After running the cells in Exercise 4 of the notebook, return here to investigate the shuffle metrics.

Locate the shuffle-heavy stages

Return to the Spark UI tab (or reopen it from the notebook toolbar).
Click the Stages tab in the left-hand navigation to see all stages across all jobs.
Look at the Shuffle Read and Shuffle Write columns. The stages triggered by Exercise 4 will show significant data movement in both columns — far more than the actual input data size in many cases.

Count Exchange nodes in the DAG

Click on any stage with high shuffle values to open its detail page.
At the top of the stage detail page, expand DAG Visualization. Look for nodes labelled Exchange — each one represents a full shuffle of the data across the cluster network.
Count the total number of Exchange nodes visible across the stages of the Exercise 4 job. Compare this to the minimum number of shuffles that are actually needed for a groupBy → join → sort pipeline.

Compare Shuffle Read vs Input size

Still in the stage detail page, scroll to the Summary Metrics section. Compare the Input Size / Records with the Shuffle Read Size / Records. When shuffle read is significantly larger than the original input, the pipeline is doing more network I/O than necessary.
When you have finished your investigation, return to the notebook and continue with Exercise 5.