Lab: Ingest Data into Unity Catalog — Solaris Energy
Scenario
You are a data engineer at Solaris Energy, a renewable energy company operating solar farms in Spain and Germany, and wind turbine installations in the Netherlands and Denmark. Your team is building a centralised data platform on Azure Databricks to consolidate energy production readings, turbine maintenance events, and grid telemetry from all sites.
In this lab, you will practise the core data ingestion techniques available in Azure Databricks: batch ingestion with PySpark DataFrames, SQL-based file loading with COPY INTO, aggregated table creation with CTAS, and continuous file detection with Auto Loader.
By the end of this lab, you will be able to:
- Create a Unity Catalog hierarchy (catalog, schema, volume) to house ingested data
- Load CSV data from a managed volume into a Delta table using PySpark DataFrames
- Use
COPY INTOto incrementally load files with built-in deduplication - Create summary tables using
CREATE TABLE AS SELECT - Configure Auto Loader to automatically detect and process new files from cloud storage
🤖 Databricks Assistant — use it throughout this lab!
You are expected and encouraged to use the Databricks Assistant for all exercises. The Assistant is available directly in the notebook cell toolbar. Use it to:
- Look up syntax for SQL and PySpark operations
- Explain error messages
- Generate boilerplate code that you then adapt
- Ask follow-up questions about how things work
To open the Databricks Assistant, select the on the right side of any notebook cell, or use the keyboard shortcut.
💡 Each exercise cell includes a suggested prompt you can copy directly into the Databricks Assistant to get started.
Prerequisites
- An Azure Databricks workspace (Premium tier) has been provisioned and you have access to it
- You have the Data Engineer or equivalent role in the workspace, with permission to create catalogs and volumes
- Basic familiarity with SQL and Python
Non-notebook exploration: Lakeflow Connect (optional)
Lakeflow Connect provides a graphical, low-code approach for ingesting data from external sources such as SQL Server, Salesforce, and SharePoint directly into Unity Catalog tables — without writing custom code.
To explore Lakeflow Connect in your workspace:
- In the Databricks workspace sidebar, click Data Engineering → Data Ingestion.
- Browse the available connectors. Note the categories: Database connectors, SaaS connectors, and File-based ingestion.
- Click on SQL Server to see how you would configure a connection, choose tables to ingest, and set a destination catalog and schema.
- Observe the options for SCD Type 1 (overwrite) vs SCD Type 2 (history tracking) and full refresh vs incremental extraction.
💡 You do not need to complete the Lakeflow Connect setup as part of this lab — a configured SQL Server source is not provided. The exploration above is for familiarisation only.
Non-notebook exploration: Lakeflow Spark Declarative Pipelines (optional)
Lakeflow Spark Declarative Pipelines (formerly Delta Live Tables) is the recommended way to build production-grade ingestion pipelines with automatic orchestration, schema management, and exactly-once guarantees.
To explore the pipeline editor:
- In the sidebar, click Data Engineering → Pipelines.
- Click Create Pipeline and select ETL pipeline.
- Review the options for specifying a source notebook or SQL file, naming the pipeline catalog and schema, and choosing a cluster type.
- Click Cancel — you do not need to create or run a pipeline.
💡 The Auto CDC API (
create_auto_cdc_flow/AUTO CDC INTO) is the recommended way to process change data capture (CDC) feeds inside a Lakeflow Declarative Pipeline. It handles deduplication, out-of-order events, and SCD Type 1 or Type 2 patterns automatically. You would define it in a pipeline notebook or SQL file, then run it from the Pipelines UI.
Importing the notebook
Follow these steps to import the lab notebook into your Databricks workspace:
- In the Databricks workspace, click Workspace in the left sidebar.
- Navigate to or create a folder where you want to store the lab (for example,
/Users/<your-email>/Labs). - Click the ⋮ (kebab) menu or right-click the folder, then select Import.
- Choose URL, enter the following URL, and click Import:
https://raw.githubusercontent.com/MicrosoftLearning/DP-750T00-Implement-Data-Engineering-Solutions-using-Azure-Databricks/refs/heads/main/Allfiles/07-ingest-data-into-unity-catalog.ipynb - Open the imported notebook and, in the compute selector at the top, choose Serverless compute.
Lab overview
The notebook is structured into four exercises:
| Exercise | Topic | Technique |
|---|---|---|
| 1 | Set up the catalog hierarchy | SQL DDL — CREATE CATALOG, CREATE SCHEMA, CREATE VOLUME |
| 2 | Batch ingestion with DataFrames | PySpark spark.read / df.write |
| 3 | SQL-based file ingestion | COPY INTO, CREATE TABLE AS SELECT |
| 4 | Auto Loader | cloudFiles format with spark.readStream |
Work through the exercises in order. Each exercise builds on the catalog and data created in the previous one.