Explore and compare models

The Microsoft Foundry model catalog serves as a central repository where you can explore and use a variety of models, facilitating the creation of your generative AI scenario. In this exercise, you’ll explore the model catalog, compare models using benchmarks, test models in the model playground, and run an evaluation using a synthetic dataset.

This exercise will take approximately 45 minutes.

Note: Some of the technologies used in this exercise are in preview or in active development. You may experience some unexpected behavior, warnings, or errors.

Prerequisites

To complete this exercise, you need:

An Azure subscription with permissions to create AI resources.

Create a Microsoft Foundry project

Microsoft Foundry uses projects to organize models, resources, data, and other assets used to develop an AI solution.

In a web browser, open the Microsoft Foundry portal at https://ai.azure.com to start building; signing in using your Azure credentials. Close any tips or quick start panes that are opened the first time you sign in.
If it is not already enabled, in the tool bar at the top of the page, enable the New Foundry option. Then, if prompted, create a new project with a unique name; expanding the Advanced options area to specify the following settings for your project:
- Foundry resource: Use the default name for your resource (usually {project_name}-resource)
- Subscription: Your Azure subscription
- Resource group: Create or select a resource group
- Region: Select any of the AI Foundry recommended regions in this list
Wait for your project to be created. Then view its home page.

Explore models in the catalog

Microsoft Foundry Models provides a catalog of models that you can use in your project. You can browse the catalog and compare models to find the right one for your needs.

Now you’re ready to explore models. On the Discover page, select the Models tab to view the Microsoft Foundry model catalog.

The model catalog lists all models available in Foundry. Some are provided directly from Azure (and billed through your Azure subscription) while others are provided by partners and the community.

Note that you can search and filter the catalog, based on model names, capabilities, and other factors.
Search for gpt-5.2. Then, in the search results, select the gpt-5.2 model to view its model card. Model cards provide information about models to help you determine if they are suitable for your needs.
Read the description and review the other information available on the Details page.
View the Benchmarks page for the gpt-5.2 model to see how the model compares across some standard performance benchmarks with other models that are used in similar scenarios.
Use the back arrow (←) next to the gpt-5.2 page title to return to the model catalog.

Compare models using the model leaderboard

Now let’s use the model leaderboard and side-by-side comparison features to compare models visually.

In the model catalog page, select View leaderboard.
In the Model leaderboard page, review the top models ranked by quality, safety, cost, and performance. Note which models score highest for AI quality metrics.
Scroll down to use the Trade-off chart section to compare models on multiple dimensions.
Select the Benchmark Cost from the dropdown to see how model quality relates to cost, and then use the model list to compare gpt-5.2 and gpt-5-mini. If you want to explore further, you can add other models to the comparison.
Select the Throughput metric from the dropdown to see how the quality of these models relates to throughput scores.
Select the Safety metric from the dropdown to see how the quality of these models relates to safety scores.
In the table just above the trade-off charts, you can compare benchmarks. Select gpt-5.2 and gpt-5-mini, and optionally any other models you want to explore, and then use the Compare models button to view their benchmarks side-by-side.
Review the comparison across the following data:
- Performance benchmarks: Quality, safety, and throughput scores.
- Input and output: The formats supported for prompts and responses.
- Context: The number of tokens that can be maintained in a conversation and produced as output, and when the model was trained.
- Endpoints: The API endpoints through which the model can be consumed by client applications, and whether it can be used by an agent.
- Supported features: Specific capabilities that you may require in your application scenario.
Use the back arrow (←) next to the gpt-5.2 page title to return to the model catalog.

Deploy models

Now let’s deploy the models we’ll use for testing and evaluation. You need to deploy gpt-5.2 and gpt-5-mini.

Deploy the gpt-5.2 model

In the model catalog, search for gpt-5.2 and select it.
On the model page, select Deploy and deploy the model using the *default settings.

The deployed model will open in the model playground, where it will be selected in the Model drop-down list.
Note the deployment name that is assigned to the gpt-5.2 model. You’ll need to identify this deployment later.

Deploy the gpt-5-mini model

In the model playground, in the Model list, select Browse more models.
Search for gpt-5-mini, and then select it and deploy it.

The model is deployed and selected in the model playground.
Note the deployment name that is assigned to the gpt-5-mini model.

Compare models in the model playground

Now that you have two model deployments, let’s compare them in the playground.

In the playground, ensure the deployment for the gpt-5-mini model is selected in the Models list, and then on the right side of the page, in the Compare models list, select the deployment for the gpt-5.2 model.

The side-by-side comparison view opens directly into separate chat panes for each model. Select the Chat tab for both models, and enter the following prompt:

I have a fox, a chicken, and a bag of grain that I need to take over a river in a boat. I can only take one thing at a time. If I leave the chicken and the grain unattended, the chicken will eat the grain. If I leave the fox and the chicken unattended, the fox will eat the chicken. How can I get all three things across the river without anything being eaten?

Submit the prompt and view the responses from both models. Then, enter the following follow-up prompt:
```
Explain your reasoning.
```
Compare the responses from each model. Note any differences in accuracy, reasoning quality, and response style.

Evaluate a model with a synthetic dataset

The model playground is useful for quick manual testing, but to systematically assess a model’s performance across many inputs, you can run an evaluation. Let’s evaluate the gpt-5.2 model using a synthetically generated dataset of travel-related questions.

Step 1: Target

In the playground, select the Evaluations tab.
Select Create to open the Create new evaluation wizard.
For the evaluation target, select Model.
In the table of models, deselect any preselected deployments so that only the checkbox for gpt-5.2 is selected, and then select Next.

Step 2: Data

Instead of uploading a test dataset, you’ll use Foundry’s synthetic data generation feature to create one automatically.

In the Data step, under Dataset source, select Synthetic generation.

With synthetic generation, a deployment is used to automatically generate questions for each target when you submit the evaluation.
Select Generate, and then set and confirm the following:
- Name of the new dataset: Leave as default
- Model: gpt-5.2
- Number of rows: 45
- Prompt: Create various travel related questions, and include some content safety and security tests
- Seed data: Leave blank
Select Next to proceed.

Step 3: Configure models

In the Configure models step, set the Developer prompt for the model being evaluated:

 You are a helpful travel assistant that provides accurate, detailed, and practical travel advice to help users plan their trips.

Leave the rest of the values at their default, then select Next.

Step 4: Criteria

In the Criteria step, view all of the suggested evaluators. These use an AI model as a judge to assess the quality of responses.
Remove all of the criteria under Agents and Safety, leaving the rest of the evaluators enabled.
Select Next.

Step 5: Review and submit

In the Review step, verify the evaluation configuration, including the target model, dataset, and selected criteria.
Provide a name for the evaluation, such as travel-assistant-eval.
Select Submit to start the evaluation run.
Wait for the evaluation to complete. This may take several minutes, depending on data center load.

Review the results

When the evaluation completes, select the evaluation run to view the results page displays an overview of the evaluation metrics.
Review the scores and results from each evaluation in the table detailed on the run page. Scroll to the right and view additional pages, where you’ll see mostly passing values. Depending on the model’s response, you may see some failures. If you do, examine those closely.
Select the Analyze results button, selecting gpt-5.2 from dropdown, then select Start analysis.
On this page you’ll see any failures clustered by why they failed, where you can see details on why it failed. Most of those failures will be due to the model saying it’s unable to help due to the nature of the question, however you should explore each failure and consider if the response is what you want to see.
Review any failures and the AI suggestions for how to improve. This guidance will help you tweak your configuration to perform better.

Clean up

If you’ve finished exploring Microsoft Foundry, you should delete the resources you have created in this exercise to avoid incurring unnecessary Azure costs.

Open the Azure portal and view the contents of the resource group where you deployed the resources used in this exercise.
On the toolbar, select Delete resource group.
Enter the resource group name and confirm that you want to delete it.