Create a Microsoft Fabric Lakehouse
Large-scale data analytics solutions have traditionally been built around a data warehouse, in which data is stored in relational tables and queried using SQL. The growth in “big data” (characterized by high volumes, variety, and velocity of new data assets) together with the availability of low-cost storage and cloud-scale distributed compute technologies has led to an alternative approach to analytical data storage; the data lake. In a data lake, data is stored as files without imposing a fixed schema for storage. Increasingly, data engineers and analysts seek to benefit from the best features of both of these approaches by combining them in a data lakehouse; in which data is stored in files in a data lake and a relational schema is applied to them as a metadata layer so that they can be queried using traditional SQL semantics.
In Microsoft Fabric, a lakehouse provides highly scalable file storage in a OneLake store (built on Azure Data Lake Store Gen2) with a metastore for relational objects such as tables and views based on the open source Delta Lake table format. Delta Lake enables you to define a schema of tables in your lakehouse that you can query using SQL.
This lab takes approximately 30 minutes to complete.
Note: You need a Microsoft Fabric trial to complete this exercise.
Create a workspace
Before working with data in Fabric, create a workspace with the Fabric trial enabled.
- On the Microsoft Fabric home page at
https://app.fabric.microsoft.com/home?experience=fabric
, select Synapse Data Engineering. - In the menu bar on the left, select Workspaces (the icon looks similar to 🗇).
- Create a new workspace with a name of your choice, selecting a licensing mode in the Advanced section that includes Fabric capacity (Trial, Premium, or Fabric).
-
When your new workspace opens, it should be empty.
Create a lakehouse
Now that you have a workspace, it’s time to create a data lakehouse for your data files.
-
In the Synapse Data Engineering home page, create a new Lakehouse with a name of your choice.
After a minute or so, a new lakehouse will be created:
-
View the new lakehouse, and note that the Lakehouse explorer pane on the left enables you to browse tables and files in the lakehouse:
- The Tables folder contains tables that you can query using SQL semantics. Tables in a Microsoft Fabric lakehouse are based on the open source Delta Lake file format, commonly used in Apache Spark.
- The Files folder contains data files in the OneLake storage for the lakehouse that aren’t associated with managed delta tables. You can also create shortcuts in this folder to reference data that is stored externally.
Currently, there are no tables or files in the lakehouse.
Upload a file
Fabric provides multiple ways to load data into the lakehouse, including built-in support for pipelines that copy data from external sources and data flows (Gen 2) that you can define using visual tools based on Power Query. However one of the simplest ways to ingest small amounts of data is to upload files or folders from your local computer (or lab VM if applicable).
-
Download the sales.csv file from
https://raw.githubusercontent.com/MicrosoftLearning/dp-data/main/sales.csv
, saving it as sales.csv on your local computer (or lab VM if applicable).Note: To download the file, open a new tab in the browser and paste in the URL. Right click anywhere on the page containing the data and select Save as to save the page as a CSV file.
- Return to the web browser tab containing your lakehouse, and in the … menu for the Files folder in the Lakehouse explorer pane, select New subfolder, and create a subfolder named data.
- In the … menu for the new data folder, select Upload and Upload files, and then upload the sales.csv file from your local computer (or lab VM if applicable).
-
After the file has been uploaded, select the Files/data folder and verify that the sales.csv file has been uploaded, as shown here:
- Select the sales.csv file to see a preview of its contents.
Explore shortcuts
In many scenarios, the data you need to work with in your lakehouse may be stored in some other location. While there are many ways to ingest data into the OneLake storage for your lakehouse, another option is to instead create a shortcut. Shortcuts enable you to include externally sourced data in your analytics solution without the overhead and risk of data inconsistency associated with copying it.
- In the … menu for the Files folder, select New shortcut.
- View the available data source types for shortcuts. Then close the New shortcut dialog box without creating a shortcut.
Load file data into a table
The sales data you uploaded is in a file, which data analysts and engineers can work with directly by using Apache Spark code. However, in many scenarios you may want to load the data from the file into a table so that you can query it using SQL.
- On the Home page, select the Files/Data folder so you can see the sales.csv file it contains.
- In the … menu for the sales.csv file, select Load to Tables.
-
In Load to table dialog box, set the table name to sales and confirm the load operation. Then wait for the table to be created and loaded.
Tip: If the sales table does not automatically appear, in the … menu for the Tables folder, select Refresh.
-
In the Lakehouse explorer pane, select the sales table that has been created to view the data.
-
In the … menu for the sales table, select View files to see the underlying files for this table
Files for a delta table are stored in Parquet format, and include a subfolder named _delta_log in which details of transactions applied to the table are logged.
Use SQL to query tables
When you create a lakehouse and define tables in it, a SQL endpoint is automatically created through which the tables can be queried using SQL SELECT
statements.
-
At the top-right of the Lakehouse page, switch from Lakehouse to SQL analytics endpoint. Then wait a short time until the SQL analytics endpoint for your lakehouse opens in a visual interface from which you can query its tables.
-
Use the New SQL query button to open a new query editor, and enter the following SQL query:
SELECT Item, SUM(Quantity * UnitPrice) AS Revenue FROM sales GROUP BY Item ORDER BY Revenue DESC;
Note: If you are in a lab VM and have any problems entering the SQL query, you can download the 01-Snippets.txt file from
https://github.com/MicrosoftLearning/mslearn-fabric/raw/main/Allfiles/Labs/01/Assets/01-Snippets.txt
, saving it on the VM. You can then copy the query from the text file. -
Use the ▷ Run button to run the query and view the results, which should show the total revenue for each product.
Create a visual query
While many data professionals are familiar with SQL, data analysts with Power BI experience can apply their Power Query skills to create visual queries.
- On the toolbar, select New visual query.
-
Drag the sales table to the new visual query editor pane that opens to create a Power Query as shown here:
-
In the Manage columns menu, select Choose columns. Then select only the SalesOrderNumber and SalesOrderLineNumber columns.
-
in the Transform menu, select Group by. Then group the data by using the following Basic settings:
- Group by: SalesOrderNumber
- New column name: LineItems
- Operation: Count distinct values
- Column: SalesOrderLineNumber
When you’re done, the results pane under the visual query shows the number of line items for each sales order.
Create a report
The tables in your lakehouse are automatically added to a default semantic model for reporting with Power BI.
-
At the bottom of the SQL Endpoint page, select the Model tab. The data model schema for the semantic model is shown.
Note 1: In this exercise, the semantic model consists of a single table. In a real-world scenario, you would likely create multiple tables in your lakehouse, each of which would be included in the model. You could then define relationships between these tables in the model.
Note 2: The views frequently_run_queries, long_running_queries, exec_sessions_history, and exec_requests_history are part of the queryinsights schema automatically created by Fabric. It is a feature that provides a holistic view of historical query activity on the SQL analytics endpoint. Since this feature is out of the scope of this exercise, those views should be ignored for now.
-
In the menu ribbon, select the Reporting tab. Then select New report. Your current page will change to a report designer view.
- In the Data pane on the right, expand the sales table. Then select the following fields:
- Item
- Quantity
A table visualization is added to the report:
-
Hide the Data and Filters panes to create more space. Then ensure the table visualization is selected and in the Visualizations pane, change the visualization to a Clustered bar chart and resize it as shown here.
- On the File menu, select Save. Then save the report as
Item Sales Report
in the workspace you created previously. - Now, in the hub menu bar on the left, select your workspace to verify that it contains the following items:
- Your lakehouse.
- The SQL analytics endpoint for your lakehouse.
- A default semantic model for the tables in your lakehouse.
- The Item Sales Report report.
Clean up resources
In this exercise, you have created a lakehouse and imported data into it. You’ve seen how a lakehouse consists of files and tables stored in a OneLake data store. The managed tables can be queried using SQL, and are included in a default semantic model to support data visualizations.
If you’ve finished exploring your lakehouse, you can delete the workspace you created for this exercise.
- In the bar on the left, select the icon for your workspace to view all of the items it contains.
- In the … menu on the toolbar, select Workspace settings.
- In the General section, select Remove this workspace.