Lab 07: Ingest Data into Unity Catalog

Scenario

You are a data engineer at Solaris Energy, a renewable energy company operating solar farms in Spain and Germany, and wind turbine installations in the Netherlands and Denmark. Your team is building a centralised data platform on Azure Databricks to consolidate energy production readings, turbine maintenance events, and grid telemetry from all sites.

In this lab, you will practise the core data ingestion techniques available in Azure Databricks: batch ingestion with PySpark DataFrames, SQL-based file loading with COPY INTO, aggregated table creation with CTAS, and continuous file detection with Auto Loader.

By the end of this lab, you will be able to:

  • Create a Unity Catalog hierarchy (catalog, schema, volume) to house ingested data
  • Load CSV data from a managed volume into a Delta table using PySpark DataFrames
  • Use COPY INTO to incrementally load files with built-in deduplication
  • Create summary tables using CREATE TABLE AS SELECT
  • Configure Auto Loader to automatically detect and process new files from cloud storage

🤖 Genie Code — use it throughout this lab!

You are expected and encouraged to use the Genie Code for all exercises. Genie Code is available directly in the notebook cell toolbar. Use it to:

  • Look up syntax for SQL and PySpark operations
  • Explain error messages
  • Generate boilerplate code that you then adapt
  • Ask follow-up questions about how things work

To open Genie Code, select the assistant-icon on the right side of any notebook cell, or use the keyboard shortcut.

💡 Each exercise cell includes a suggested prompt you can copy directly into Genie Code to get started.


Prerequisites

  • An Azure Databricks Premium workspace provisioned using Lab 00: Set up your Azure Databricks environment.
  • You have the Data Engineer or equivalent role in the workspace, with permission to create catalogs and volumes
  • Basic familiarity with SQL and Python

Non-notebook exploration: Lakeflow Spark Declarative Pipelines (optional)

Lakeflow Spark Declarative Pipelines (formerly Delta Live Tables) is the recommended way to build production-grade ingestion pipelines with automatic orchestration, schema management, and exactly-once guarantees.

To explore the pipeline editor:

  1. In the sidebar, select Jobs & Pipelines.
  2. Select Create and then ETL pipeline.
  3. Review the options for specifying a source notebook or SQL file, naming the pipeline catalog and schema, and choosing a cluster type.
  4. Click Cancel — you do not need to create or run a pipeline.

Importing the notebook

Follow these steps to import the lab notebook into your Databricks workspace:

  1. In the Databricks workspace, click Workspace in the left sidebar.
  2. Navigate to or create a folder where you want to store the lab (for example, /Users//Labs).
  3. Click the â‹® (kebab) menu or right-click the folder, then select Import.
  4. Choose URL, enter the following URL, and click Import: https://raw.githubusercontent.com/MicrosoftLearning/DP-750T00-Implement-Data-Engineering-Solutions-using-Azure-Databricks/refs/heads/main/Allfiles/07-ingest-data-into-unity-catalog.ipynb
  5. Open the imported notebook and, in the compute selector at the top, choose Serverless compute.

Lab overview

The notebook is structured into four exercises:

Exercise Topic Technique
1 Set up the catalog hierarchy SQL DDL — CREATE CATALOG, CREATE SCHEMA, CREATE VOLUME
2 Batch ingestion with DataFrames PySpark spark.read / df.write
3 SQL-based file ingestion COPY INTO, CREATE TABLE AS SELECT
4 Auto Loader cloudFiles format with spark.readStream

Work through the exercises in order. Each exercise builds on the catalog and data created in the previous one.