Design a machine learning solution - Case study

Estimated Time: 15 minutes

[!NOTE] To complete this exercise, read the case study carefully. Apply the design principles you’ve learned to make informed decisions. At the end, you’ll test your understanding by answering knowledge check questions.

Welcome to Contoso Retail! You’ve been hired as the lead data scientist to help us design a machine learning training solution.

Learning objectives

After completing this exercise, you’ll be able to:

Choose a data ingestion strategy for consolidating diverse data sources.
Select the right Azure service for a machine learning workload based on team skills and scale.
Provision cost-appropriate compute for model training.
Design a deployment approach that serves both real-time and batch prediction needs.

Understand the problem

At Contoso Retail, we operate both physical stores and an e-commerce platform. We want to build a product recommendation system that suggests items to customers based on their browsing and purchase history.

Mock-up of the Contoso Retail mobile app showing personalized product recommendations to a customer.

Our goal is to increase customer engagement and sales by showing personalized product recommendations:

In our mobile app, customers should see recommendations immediately when they view a product.
For our weekly email campaign, we want to include the top 5 recommended products for each of our 2 million customers.

Our data engineering team has been collecting customer interaction data for the past two years, including:

Browsing history (products viewed, time spent)
Purchase history (items bought, purchase dates, amounts)
Customer demographics (age, location, preferences)
Product catalog (categories, prices, descriptions, images)

The data is currently stored in multiple systems:

Data source	Service	Data type	Format	Update frequency
Transactional data	Azure SQL Database	Structured	Relational tables	Real time
Clickstream data	Azure Blob Storage	Semi-structured	JSON files	Hourly
Product images	Azure Blob Storage	Unstructured	Image files	As cataloged
Customer profiles	Dynamics 365 (CRM)	Structured	CRM records	Ongoing

We need your help deciding how to design the machine learning training solution to build this recommendation system.

Consider the requirements

As you design the solution, think about these key areas.

Data ingestion and preparation

Consider the data sources: We have data in Azure SQL Database, Blob Storage (JSON files), Blob Storage (images), and Dynamics 365. How should we consolidate this data?
Consider the data format: The data is in different formats (structured, semi-structured, and unstructured). What format should we use for training?
Consider the data pipeline: Should we build a data ingestion pipeline? If so, how often should it run?

Machine learning task and service

Consider the machine learning task: What type of machine learning task is this? Classification, regression, recommendation, or something else?
Consider the service: Should we use Azure Machine Learning, Azure Databricks, Microsoft Fabric, or Microsoft Foundry? What factors influence this choice?
Consider existing skills: Our team has strong Python experience but limited Spark knowledge. How does this affect our choice?

Compute resources

Consider the data size: We have 2 million customers and millions of product interactions. What compute type is appropriate?
Consider the model complexity: Recommendation systems can be simple (collaborative filtering) or complex (deep learning). How does this affect compute needs?
Consider cost: We have a limited budget for this initial phase. Should we start with CPU or GPU? General purpose or memory optimized?

Deployment requirements

Consider the deployment types: We need both real-time recommendations (mobile app) and batch predictions (email campaign). How should we handle these different needs?
Consider the frequency: Mobile app recommendations need to be instant. Email campaigns are sent weekly. Should we use different endpoints?
Consider the scale: Our app has 100,000 active daily users. Our email campaign targets 2 million customers. How does scale affect our deployment decisions?

Your task

Based on these requirements, you need to make design decisions about:

Data strategy: How will you ingest, transform, and store the data for training?
Service selection: Which Azure service(s) will you use for training, and why?
Compute strategy: What compute resources will you provision for training?
Deployment approach: How will you handle both real-time and batch prediction requirements?

Think through each decision carefully, considering trade-offs between cost, performance, complexity, and team capabilities. The knowledge check questions test your ability to make informed design choices based on this scenario.

[!TIP] There’s rarely a single “right” answer in solution design. Focus on the trade-offs between cost, performance, complexity, and your team’s skills — and be ready to justify your choices.

Compare your decisions

The following diagram shows one solution architecture that satisfies all of the requirements. Sketch your own design first, then expand the reference solution to compare.

Solution architecture diagram showing the four data sources flowing through an Azure Synapse Analytics ETL pipeline into Azure Data Lake Storage, then into Azure Machine Learning for training, which serves a real-time endpoint for the mobile app and a batch endpoint for the weekly email campaign.

Show reference solution

Data strategy: Use an ETL pipeline (for example, Azure Synapse Analytics or Azure Data Factory) to extract data from Azure SQL Database, Blob Storage, and Dynamics 365 on a schedule, then transform and land it in a unified Azure Data Lake Storage layer in a training-ready format such as Parquet.
Service selection: Use Azure Machine Learning. It supports the Python SDK that matches the team’s skills, scales to large datasets, and provides end-to-end tooling for custom model training without requiring Spark expertise.
Compute strategy: Start with CPU general-purpose compute to control cost during the initial phase. Monitor training time, then scale to memory-optimized or GPU compute only if model complexity or data volume requires it.
Deployment approach: Deploy two endpoints from the trained model — a real-time (online) endpoint for instant mobile app recommendations, and a batch endpoint to score all 2 million customers efficiently for the weekly email campaign.

Put it into practice

Answer the following questions based on the Contoso Retail case study. Select an answer for each question, then expand Show answer to check your reasoning.

1. Based on the Contoso Retail case study, what data ingestion strategy would be most appropriate for consolidating data from Azure SQL Database, Blob Storage (JSON), and Dynamics 365?

A. Manually export data from each source and combine in Excel before training.
B. Create an ETL pipeline using Azure Synapse Analytics to extract, transform, and load data into a unified storage layer like Azure Data Lake Storage.
C. Keep data in separate sources and connect directly to each during model training.

Show answer

✅ Correct answer: B. Create an ETL pipeline using Azure Synapse Analytics to extract, transform, and load data into a unified storage layer like Azure Data Lake Storage.

The data is spread across structured, semi-structured, and unstructured sources that update on different schedules. An automated ETL pipeline consolidates these sources into a single, training-ready layer. Manual export doesn’t scale to millions of interactions, and connecting directly to each source during training adds latency and complexity.

2. For the Contoso Retail recommendation system, which Azure service would be most suitable given the team’s Python experience and the need to train on large-scale customer interaction data?

A. Microsoft Foundry, because it provides pre-built recommendation models.
B. Azure Machine Learning, because it supports the Python SDK, handles large datasets, and provides comprehensive tools for custom model training.
C. Azure Databricks, because it’s required for any large-scale machine learning.

Show answer

✅ Correct answer: B. Azure Machine Learning, because it supports the Python SDK, handles large datasets, and provides comprehensive tools for custom model training.

The team has strong Python skills but limited Spark knowledge, which makes Azure Machine Learning a better fit than Azure Databricks. Azure Databricks is Spark-based and isn’t required for all large-scale machine learning. Microsoft Foundry focuses on generative AI rather than custom recommendation training.

3. Considering Contoso Retail needs both real-time recommendations (mobile app) and batch predictions (weekly email campaign), what deployment strategy should they implement?

A. Deploy two separate models: a real-time endpoint for the mobile app and a batch endpoint for the email campaign.
B. Deploy only a real-time endpoint and call it 2 million times for the email campaign.
C. Deploy only a batch endpoint and accept 5-10 minute delays for mobile app recommendations.

Show answer

✅ Correct answer: A. Deploy two separate models: a real-time endpoint for the mobile app and a batch endpoint for the email campaign.

The two scenarios have different latency and throughput needs. A real-time (online) endpoint delivers instant recommendations in the mobile app, while a batch endpoint efficiently scores 2 million customers for the weekly email. Forcing one endpoint type to handle both leads to either excessive cost or unacceptable latency.

4. What compute resource would be most appropriate for training the initial Contoso Retail recommendation model, given the 2 million customer dataset and budget constraints?

A. Start with CPU general-purpose compute, monitor performance, and scale to memory-optimized or GPU if needed.
B. Immediately provision the largest GPU memory-optimized compute to ensure fast training.
C. Use only local development machines to minimize Azure costs.

Show answer

✅ Correct answer: A. Start with CPU general-purpose compute, monitor performance, and scale to memory-optimized or GPU if needed.

With a limited initial budget, start with cost-effective CPU general-purpose compute and scale up only when the model complexity or training time justifies it. Provisioning the largest GPU upfront wastes budget, and local machines can’t handle the scale of 2 million customers and millions of interactions.

Key takeaways

Consolidate data before training. Diverse sources and formats should flow through an automated pipeline into a single, training-ready storage layer.
Match the service to your team and scale. Choose the platform that fits your team’s existing skills and your data volume rather than the most powerful option.
Start small on compute and scale up. Begin with cost-effective compute and add memory-optimized or GPU resources only when the workload demands it.
Design deployment around how predictions are consumed. Use real-time endpoints for instant responses and batch endpoints for high-volume scheduled scoring.