Lab 12: Implement Development Lifecycle Processes in Azure Databricks
Introduction
You are a data engineer responsible for maintaining an order processing pipeline used by your warehouse operations team. The pipeline reads raw order data, removes invalid records, normalizes status codes, and computes tax-inclusive totals.
As the pipeline moves toward production, your team has decided to adopt proper software development lifecycle (SDLC) practices. That means:
- Implementing a testing strategy so bugs are caught before deployment
- Packaging the pipeline as a Declarative Automation Bundle (DAB) so it can be deployed consistently across environments
- Using the Databricks CLI to validate, preview, and deploy the bundle
This lab is structured in three parts:
| Part | Topic | Where you work |
|---|---|---|
| Part 1 | Implement unit tests with pytest | In a notebook running on Serverless compute in your Azure Databricks workspace |
| Part 2 | Configure a Declarative Automation Bundle | In a PowerShell terminal on your local machine (the Databricks CLI runs locally) |
| Part 3 | Deploy and verify the bundle with the Databricks CLI | In the same local PowerShell terminal, then verify in the Databricks workspace UI |
Important: Parts 2 and 3 use the Databricks CLI installed on your computer/lab virtual machine, not a terminal inside the Databricks workspace. Open a local PowerShell window for those parts and keep it open throughout.
🤖 Use Genie Code throughout this lab
You are expected and encouraged to use the Genie Code for every exercise.
To open Genie Code, select the on the right side of any notebook cell, or use the keyboard shortcut.
Use it to:
- Generate pytest fixtures and test functions
- Understand error messages and fix failing tests
- Draft YAML bundle configuration
- Look up CLI command syntax
Prerequisites
- An Azure Databricks Premium workspace provisioned using Lab 00: Set up your Azure Databricks environment.
- You have permission to create jobs in the workspace (required for Part 3).
- Basic familiarity with Python and pytest.
Part 1: Implement a Testing Strategy (Notebook)
Import the notebook
- In your Databricks workspace, click Workspace in the left sidebar.
- Navigate to or create a folder where you want to store the lab.
- Click the ⋮ (kebab menu) or right-click the folder, then select Import.
- Choose URL, enter the following URL, and click Import:
https://raw.githubusercontent.com/MicrosoftLearning/DP-750T00-Implement-Data-Engineering-Solutions-using-Azure-Databricks/refs/heads/main/Allfiles/12-implement-development-lifecycle-processes-in-azure-databricks.ipynb - Open the imported notebook and, in the compute selector at the top, choose Serverless compute.
Work through the notebook
The notebook contains three exercises:
- Exercise 1 — Install pytest and review the provided transforms.py module that you will test.
- Exercise 2 — Write unit tests for each transformation function using pytest fixtures.
- Exercise 3 — Write an integration test that runs the full pipeline against data created with a Spark session.
Complete all exercises before continuing to Part 2.
Part 2: Configure a Declarative Automation Bundle
Declarative Automation Bundles (DABs) let you define your Databricks resources — jobs, pipelines, notebooks — as infrastructure-as-code in YAML configuration files. This makes deployments repeatable and auditable.
In this part, you create a bundle for the order processing job using the Databricks CLI on your local machine.
Where you work: All commands in Parts 2 and 3 run in a PowerShell terminal on your own computer or Lab Virtual Machine. Open Windows Terminal or PowerShell now and keep it open for the rest of the lab. The Databricks CLI talks to your workspace over the network — you do not run these commands inside the Databricks web UI.
Install and configure the Databricks CLI
-
In your local PowerShell window, install the Databricks CLI:
winget install Databricks.DatabricksCLIThen close and reopen PowerShell so the new
databrickscommand is on your PATH, and verify the installation:databricks --versionExpected outcome: the command prints a version number (for example,
Databricks CLI v0.x.x). If it reports that the command is not recognized, reopen PowerShell and try again. -
Authenticate the CLI against your Azure Databricks workspace:
databricks auth login --host https://<your-workspace-url>Replace **
** with the URL of your workspace (for example, https://adb-1234567890123456.7.azuredatabricks.net). You can copy this from the address bar of your browser when the workspace is open. When prompted for a **profile name**, press Enter to accept the default. A browser window opens — sign in and approve the request. Expected outcome: PowerShell displays
Profile DEFAULT was successfully saved, confirming the CLI is now authenticated to your workspace. -
Create a new project directory and the two subfolders the bundle needs, then move into the project directory:
mkdir ~/order-pipeline-bundle; cd ~/order-pipeline-bundle mkdir notebooks, resourcesExpected outcome: your PowerShell prompt now ends in
order-pipeline-bundle, and the folder contains emptynotebooksandresourcessubfolders. Runlsto confirm.
Create the bundle configuration file
The databricks.yml file is the heart of a Declarative Automation Bundle — it describes the resources to deploy. In this step you create that file in the root of your order-pipeline-bundle folder. Your task is to produce a databricks.yml file that meets the following requirements:
- Bundle name:
order-pipeline-bundle - A variables section with:
- An
environmentvariable (default:development) - A
cluster_policy_idvariable with a description and a default value of an empty string (""). A cluster policy ID enforces configuration rules on job clusters, and in a production bundle you’d reference this value on those clusters. To keep this lab simple — the job uses Serverless compute — the empty-string default makes the variable optional, so you don’t have to supply it at deploy time. If you’d rather use a real value, copy a policy ID from your workspace under Compute > Policies: open a policy and use the ID shown in its details or URL.
- An
- A resources section defining a job named order-pipeline-job that:
- Has a display name using the environment variable: ${var.environment}-order-pipeline
- Has two notebook tasks:
- validate-data — runs ./notebooks/validate.py
- transform-data — depends on validate-data and runs ./notebooks/transform.py
- A targets section with:
- A dev target (default, development mode, environment = development)
- A prod target (production mode, with its own workspace host and environment = production)
You create this file with a plain-text editor rather than from the command line, so it’s easier to read and edit the YAML.
-
Open the
order-pipeline-bundlefolder in a text editor:- VS Code (recommended): in your PowerShell window, run
code .to open the current folder. If thecodecommand isn’t available, open VS Code manually and select File > Open Folder, then choose yourorder-pipeline-bundlefolder. - Notepad: run
notepad databricks.yml. Notepad asks whether to create the file because it doesn’t exist yet — select Yes.
- VS Code (recommended): in your PowerShell window, run
-
Create a new file named databricks.yml in the root of the
order-pipeline-bundlefolder (in VS Code, select New File and name itdatabricks.yml). -
Copy the starter configuration below into the file. It already includes the bundle name, the
environmentvariable, the first task, and thedevtarget. Complete the three sections marked with# TODO:bundle: name: order-pipeline-bundle variables: environment: description: The deployment environment name default: development # TODO: Add a variable named 'cluster_policy_id'. # Give it a 'description' and a 'default' value of "" (an empty string), # which makes the variable optional so you don't have to supply it at deploy time. # (A real cluster policy ID comes from Compute > Policies in your workspace, # but you don't need one for this lab.) resources: jobs: order-pipeline-job: name: ${var.environment}-order-pipeline tasks: - task_key: validate-data notebook_task: notebook_path: ./notebooks/validate.py # TODO: Add a second task named 'transform-data' # It should depend on 'validate-data' and run ./notebooks/transform.py # Refer to the 'depends_on' key in the Declarative Automation Bundle schema. targets: dev: default: true mode: development variables: environment: development # TODO: Add a 'prod' target that: # - sets mode to production # - sets a workspace host (use a placeholder URL for now) # - overrides the environment variable to 'production'Tip: YAML is indentation-sensitive — use two spaces per level and never use tabs. Keep each
# TODOblock aligned with the example lines around it. -
Save the file (in VS Code, press Ctrl+S; in Notepad, select File > Save). Make sure it’s saved as
databricks.ymldirectly inside the order-pipeline-bundle folder, not inside notebooks or resources.
🤖 Genie Code tip: Ask “Show me a complete Declarative Automation Bundle databricks.yml example with two targets, job tasks, and custom variables” to get a full reference configuration you can adapt.
Need the solution? A completed reference file with all
# TODOsections filled in is available at Allfiles/answers/12-databricks-bundle-with-answers.yml. Try to complete the# TODOsections yourself first, then compare your work against the solution.
Expected outcome: a databricks.yml file exists in the root of order-pipeline-bundle, containing the cluster_policy_id variable, the second transform-data task with its depends_on, and the prod target you added. You can confirm it from PowerShell with Get-Content databricks.yml.
Create placeholder notebooks (required for validation)
Bundle validation checks that every notebook referenced in databricks.yml actually exists on disk and is a Databricks notebook. A plain .py file isn’t enough — a Databricks source-format notebook must begin with the special first line # Databricks notebook source, and the file must be saved as UTF-8 without a byte-order mark (BOM). If the file has a BOM or uses a different encoding, the Databricks CLI can’t detect the header and validation fails with “… is not a notebook” — even though the text looks correct in an editor.
Run these commands in your local PowerShell window (still inside the order-pipeline-bundle folder) to create both placeholder notebooks as BOM-free UTF-8:
[System.IO.File]::WriteAllLines("$PWD\notebooks\validate.py", @("# Databricks notebook source", "# validate"))
[System.IO.File]::WriteAllLines("$PWD\notebooks\transform.py", @("# Databricks notebook source", "# transform"))
Why not
Set-Content?Set-Content(andOut-File) can prepend a BOM or use UTF-16 depending on your PowerShell version, which is exactly what triggers the “is not a notebook” error.[System.IO.File]::WriteAllLinesalways writes UTF-8 without a BOM, so the magic header is the very first thing in the file.
Expected outcome: the notebooks folder contains validate.py and transform.py, each starting with # Databricks notebook source. Run Get-Content notebooks/validate.py to confirm the header is on line 1. When you rerun databricks bundle validate, the “is not a notebook” error is gone.
Part 3: Deploy and Verify the Bundle with the Databricks CLI
With your bundle configured, you use the Databricks CLI to validate, preview, and deploy it to your workspace.
Where you work: Run every command in this part in your local PowerShell window, from inside the
order-pipeline-bundlefolder. If you closed PowerShell, reopen it and runcd ~/order-pipeline-bundlefirst.
Step 1 — Validate the bundle
Run the following command from inside the order-pipeline-bundle directory. This checks that your databricks.yml is syntactically correct and references valid resources.
databricks bundle validate
Expected outcome: the command prints a summary showing the bundle Name (order-pipeline-bundle), the Target (dev), and the Workspace path, followed by Validation OK!. If you see errors instead, read the message (it usually names the line or key at fault), fix the YAML, and run the command again.
🤖 Tip: Copy any validation error messages and paste them into Genie Code to get an explanation and suggested fix.
Step 2 — Preview the deployment plan
Before making any changes to your workspace, preview what the deployment would create or update:
databricks bundle plan
Review the output. Expected outcome: the plan reports that the order-pipeline-job would be created (since it doesn’t exist yet) — no resources are changed at this stage. To run the plan against a specific target, name it explicitly:
databricks bundle plan -t dev
Step 3 — Deploy the bundle
Deploy the bundle to the dev target:
databricks bundle deploy -t dev
During deployment, the CLI:
- Uploads your notebook files to the workspace
- Creates the order-pipeline-job in the Jobs & Pipelines section of your workspace (prefixed with
[dev <username>]because development mode is active)
Expected outcome: the command finishes with a message such as Deployment complete! and no errors.
Step 4 — Verify the deployed resources
Confirm the deployment succeeded:
databricks bundle summary
Expected outcome: the output lists the deployed order-pipeline-job along with a direct URL to it. Open that URL in your browser (or go to Jobs & Pipelines in the workspace) and confirm that:
- The job appears with a name prefixed by
[dev <your-username>]. - Opening the job shows both tasks — validate-data and transform-data — with transform-data depending on validate-data in the task graph.
🤖 Tip: Ask “What does Declarative Automation Bundle development mode do to job names and schedules?” to understand why the job is prefixed with your username.
Step 5 — Clean up (optional)
To remove the deployed resources from your workspace, run the following from the order-pipeline-bundle folder:
databricks bundle destroy -t dev
When prompted, confirm that you want to delete the job that was created. Expected outcome: the CLI reports the resources were destroyed, and the job no longer appears in Jobs & Pipelines.
Summary
In this lab you:
- Implemented unit tests using pytest fixtures to verify individual transformation functions
- Configured a Declarative Automation Bundle with variables, job resources, and multi-environment targets
- Used the Databricks CLI to validate, plan, deploy, and verify a bundle deployment