Lab 07: Ingest Data into Unity Catalog

Scenario

You are a data engineer at Solaris Energy, a renewable energy company operating solar farms in Spain and Germany, and wind turbine installations in the Netherlands and Denmark. Your team is building a centralised data platform on Azure Databricks to consolidate energy production readings, turbine maintenance events, and grid telemetry from all sites.

In this lab, you will practise the core data ingestion techniques available in Azure Databricks: batch ingestion with PySpark DataFrames, SQL-based file loading with COPY INTO, aggregated table creation with CTAS, and continuous file detection with Auto Loader.

By the end of this lab, you will be able to:

Create a Unity Catalog hierarchy (catalog, schema, volume) to house ingested data
Load CSV data from a managed volume into a Delta table using PySpark DataFrames
Use COPY INTO to incrementally load files with built-in deduplication
Create summary tables using CREATE TABLE AS SELECT
Configure Auto Loader to automatically detect and process new files from cloud storage

🤖 Genie Code — use it throughout this lab!

You are expected and encouraged to use the Genie Code for all exercises. Genie Code is available directly in the notebook cell toolbar. Use it to:

Look up syntax for SQL and PySpark operations
Explain error messages
Generate boilerplate code that you then adapt
Ask follow-up questions about how things work

To open Genie Code, select the assistant-icon on the right side of any notebook cell, or use the keyboard shortcut.

💡 Each exercise cell includes a suggested prompt you can copy directly into Genie Code to get started.

Prerequisites

An Azure Databricks Premium workspace provisioned using Lab 00: Set up your Azure Databricks environment.
You have the Data Engineer or equivalent role in the workspace, with permission to create catalogs and volumes
Basic familiarity with SQL and Python

Non-notebook exploration: Lakeflow Spark Declarative Pipelines (optional)

Lakeflow Spark Declarative Pipelines (formerly Delta Live Tables) is the recommended way to build production-grade ingestion pipelines with automatic orchestration, schema management, and exactly-once guarantees.

To explore the pipeline editor:

In the sidebar, select Jobs & Pipelines.
Select Create and then ETL pipeline.
Review the options for specifying a source notebook or SQL file, naming the pipeline catalog and schema, and choosing a cluster type.
Click Cancel — you do not need to create or run a pipeline.

Importing the notebook

Follow these steps to import the lab notebook into your Databricks workspace:

In the Databricks workspace, click Workspace in the left sidebar.
Navigate to or create a folder where you want to store the lab (for example, /Users//Labs).
Click the ⋮ (kebab) menu or right-click the folder, then select Import.
Choose URL, enter the following URL, and click Import: https://raw.githubusercontent.com/MicrosoftLearning/DP-750T00-Implement-Data-Engineering-Solutions-using-Azure-Databricks/refs/heads/main/Allfiles/07-ingest-data-into-unity-catalog.ipynb
Open the imported notebook and, in the compute selector at the top, choose Serverless compute.

Lab overview

The notebook is structured into four exercises:

Exercise	Topic	Technique
1	Set up the catalog hierarchy	SQL DDL — CREATE CATALOG, CREATE SCHEMA, CREATE VOLUME
2	Batch ingestion with DataFrames	PySpark spark.read / df.write
3	SQL-based file ingestion	COPY INTO, CREATE TABLE AS SELECT
4	Auto Loader	cloudFiles format with spark.readStream

Work through the exercises in order. Each exercise builds on the catalog and data created in the previous one.