Monitor and trace your generative AI agent
This exercise takes approximately 40 minutes.
Note: This lab assumes a pre-configured lab environment with Visual Studio Code, Azure CLI, and Python already installed.
Introduction
In this exercise, you’ll use Application Insights and distributed tracing to observe the Trail Guide Agent’s runtime behavior. After iterating on prompts in earlier labs, you now have a concrete question: what did those prompt changes actually cost in the real world?
Scenario: You’re operating the Adventure Works Trail Guide Agent and have deployed three prompt versions over time. Before deciding which version to keep in production, you want to observe how each one performs at runtime — measuring token usage, latency, and response behavior across a consistent set of test inputs.
You’ll run the same five test prompts against prompt versions v1, v2, and v3, then analyze the results from two angles:
- Azure Monitor: Aggregated token usage and latency metrics across versions
- Trace tree (local): A nested span tree per version from
python src/tests/check_traces.py, with per-prompt timing and token attributes
Set up the environment
To complete the tasks in this exercise, you need:
- Visual Studio Code
- Azure subscription with Microsoft Foundry access
- Git and GitHub account
- Python 3.9 or later
- Azure CLI and Azure Developer CLI (azd) installed
Tip: If you haven’t installed these prerequisites yet, see Lab 00: Prerequisites for installation instructions and links.
All steps in this lab will be performed using Visual Studio Code and its integrated terminal.
Create repository from template
You’ll start by creating your own repository from the template to practice realistic workflows.
- In a web browser, navigate to the template repository on GitHub at
https://github.com/MicrosoftLearning/mslearn-genaiops. - Select Use this template > Create a new repository.
- Enter a name for your repository (e.g.,
mslearn-genaiops). - Set the repository to Public or Private based on your preference.
- Select Create repository.
Clone the repository in Visual Studio Code
After creating your repository, clone it to your local machine.
- In Visual Studio Code, open the Command Palette by pressing Ctrl+Shift+P.
- Type Git: Clone and select it.
- Enter your repository URL:
https://github.com/[your-username]/mslearn-genaiops.git - Select a location on your local machine to clone the repository.
- When prompted, select Open to open the cloned repository in VS Code.
Deploy Microsoft Foundry resources
Now you’ll use the Azure Developer CLI to deploy all required Azure resources.
-
In Visual Studio Code, open a terminal by selecting Terminal > New Terminal from the menu.
-
Authenticate with Azure Developer CLI:
azd auth loginThis opens a browser window for Azure authentication. Sign in with your Azure credentials.
-
Authenticate with Azure CLI:
az loginSign in with your Azure credentials when prompted.
⚠️ Important In some environments, the VS Code integrated terminal may crash or close during the interactive login flow. If this happens, authenticate using explicit credentials instead:
az login --username <your-username> --password <your-password> -
Provision resources:
azd upWhen prompted, provide:
- Environment name (e.g.,
dev,test) - Used to name all resources - Azure subscription - Where resources will be created
- Location - Azure region (recommended: Sweden Central)
The command deploys the infrastructure from the
infra\folder, creating:- Resource Group - Container for all resources
- Foundry (AI Services) - The hub with access to models like GPT-4.1
- Foundry Project - Your workspace for creating and managing agents
- Log Analytics Workspace - Collects logs and telemetry data
- Application Insights - Monitors performance and usage
- Environment name (e.g.,
-
Create a
.envfile with the environment variables:azd env get-values > .envThis creates a
.envfile in your project root with all the provisioned resource information.⚠️ Important – File Encoding
After generating the
.envfile, make sure it is saved using UTF-8 encoding.In editors like VS Code, check the encoding indicator in the bottom-right corner.
If it shows UTF-16 LE (or any encoding other than UTF-8), click it, choose Save with Encoding, and select UTF-8.Using the wrong encoding may cause environment variables to be read incorrectly.
This creates a
.envfile in your project root with all the provisioned resource information.
Install Python dependencies
With your Azure resources deployed, install the required Python packages.
-
In the VS Code terminal, create and activate a virtual environment:
python -m venv .venv .venv\Scripts\Activate.ps1 -
Install the required dependencies:
python -m pip install -r requirements.txtThis installs all required packages, including:
azure-monitor-opentelemetry- Exports telemetry to Application Insightsopentelemetry-instrumentation-openai-v2- Auto-instruments chat completion calls
-
Add the model configuration to your
.envfile.Open the
.envfile in your repository root and add:MODEL_NAME="gpt-4.1"
Understand the monitoring script
Before running anything, take a moment to review what src/tests/run_monitoring.py does.
-
Open
src/tests/run_monitoring.pyin Visual Studio Code. -
Review the telemetry setup at the top of the script:
connection_string = project_client.telemetry.get_application_insights_connection_string() configure_azure_monitor(connection_string=connection_string) OpenAIInstrumentor().instrument()This retrieves the Application Insights connection string that was provisioned alongside your Foundry project.
OpenAIInstrumentorautomatically creates child spans for everychat.completions.create()call without any extra code. -
Review the
run_versionfunction. It wraps each prompt version in a top-level span namedtrail_guide_v1,trail_guide_v2, ortrail_guide_v3, and wraps each individual test inside a child span. Custom attributes likeresponse.total_tokensandresponse.duration_sare attached to every child span. -
Note the
VERSIONSlist at the top:VERSIONS = ["v1", "v2", "v3"]The script loads each version’s system prompt from
src/agents/trail_guide_agent/prompts/and runs the same five test prompts fromsrc/tests/test-prompts/against it. This gives you a consistent, reproducible comparison.
Run the monitoring script
-
In the VS Code terminal, run the script:
python src/tests/run_monitoring.py -
Watch the terminal output as the script progresses through each version. For each test prompt you’ll see:
- Duration: How long the model took to respond
- Tokens: Total tokens used, broken down into prompt and completion tokens
- Response: A preview of the model’s output
- Note any visible differences between versions as they run. For example:
- Does v3 produce noticeably longer responses than v1?
- Does token count grow consistently from v1 to v3?
-
View the trace tree for the latest run (recommended).
This lab uses a local trace viewer script that queries Log Analytics and prints a nested tree. This makes the parent/child chain very explicit without relying on a portal UI.
Run:
python src/tests/check_traces.pyYou should see one trace per prompt version, with three levels:
trail_guide_v1 ← root: entire version run ├── v1_day-hike-gear ← child: one test prompt (duration + token attributes) │ └── chat gpt-4.1 ← auto: the actual LLM call └── ... trail_guide_v2 (same structure) trail_guide_v3 (same structure)If the output says
No traces found yet, wait 2–3 minutes and run the script again.
View monitoring data in Azure Monitor
Now you’ll examine the aggregated performance metrics for all three versions.
Navigate to Azure Monitor from Microsoft Foundry
- In a web browser, open the Microsoft Foundry portal at
https://ai.azure.comand sign in using your Azure credentials. - Open your Foundry project.
- In the navigation pane on the left, select Monitoring.
-
Select Resource usage to see a summary of model interactions.
Note: If no data appears yet, wait a minute and refresh the page.
Review token usage across versions
Focus on the token usage metrics and compare the three prompt versions:
-
Total token count: The combined total of prompt and completion tokens across all calls.
This is the primary cost driver. A jump between versions here signals that the prompt change has a direct financial impact.
-
Prompt token count: Tokens used in the input (the system prompt plus the user message).
Longer, more detailed system prompts like v3 will show higher prompt token counts even for identical user questions.
-
Completion token count: Tokens the model generated in its response.
If v3 produces significantly more completion tokens than v1, the model is generating more verbose responses — which increases both cost and latency.
-
Total requests: The total number of model calls. You should see 15 requests — five test prompts across three versions.
Compare the individual interactions
Use the output of python src/tests/check_traces.py to compare prompt versions.
- For the same test name (for example,
v1_trail-difficultyvsv3_trail-difficulty), compare:- Duration (shown in milliseconds)
- Total tokens (and prompt vs completion breakdown)
- Use the nested structure to understand the chain:
- The
v{n}_{test-name}span is where you attach per-test token and duration attributes. -
The
chat gpt-4.1span is the instrumented model call. - Which version produces the most consistent response lengths?
- Which version shows the highest duration?
- Are there any calls that stand out as unusually slow or verbose?
- The
View trace data
The check_traces.py output gives you the full span tree for each version run — one root span per version, with nested child spans for each test prompt and auto-instrumented LLM calls beneath those.
Review the version spans
-
Run the trace viewer script:
python src/tests/check_traces.py - Locate the root spans:
trail_guide_v1,trail_guide_v2, andtrail_guide_v3. - Under each root span, locate a test child span like
v1_trail-difficulty. -
Review the inline attributes printed on each test child span:
Attribute What it tells you Span name (for example, v3_trail-difficulty)Which prompt version + which test prompt [...]msLatency for that span (milliseconds) tokens: total (↑prompt ↓completion)Token usage for that specific test call
Compare span trees across versions
-
Compare
trail_guide_v3totrail_guide_v1for the same test (for example,v3_trail-difficultyvsv1_trail-difficulty).- Are completion tokens higher in v3?
- Is the duration longer?
- Compare the
chat gpt-4.1child span durations under each test span.
-
Repeat the comparison for at least two other test prompts. Document your observations:
- Which version shows the best balance of response quality and token efficiency?
- Does the increase in prompt complexity (v1 → v3) produce proportionally better responses, or is the extra cost hard to justify?
(OPTIONAL) Create an alert
If you have extra time, set up an alert to notify you when token usage exceeds a threshold. This is an exercise designed to challenge you, which means instructions are intentionally less detailed.
- In Azure Monitor, create a new alert rule for your Azure AI Foundry project.
- Choose a metric such as Total token count and define a threshold that would be exceeded if you ran the script again (based on what you observed above).
- Create a new action group to define how you’ll be notified.
Alerts help you proactively catch unexpected spikes in usage before they translate into budget overruns.
(OPTIONAL) Add a fourth prompt version
If you completed Lab 03: Design and optimize prompts, you already evaluated v4_optimized_concise and found it reduced token usage significantly. Now verify that at runtime:
-
Open
src/tests/run_monitoring.pyand update theVERSIONSlist:VERSIONS = ["v1", "v2", "v3", "v4_optimized_concise"] -
Run the script again:
python src/tests/run_monitoring.py -
Run
python src/tests/check_traces.pyagain and compare the newtrail_guide_v4_optimized_concisespan tree againsttrail_guide_v3.- Does the token reduction you measured in evaluation hold up at runtime?
- Is there a latency improvement as well?
- Does the quality of responses in the spans appear comparable?
Where to find other labs
You can explore additional labs and exercises in the Azure AI Foundry Learning Portal or refer to the course’s lab section for other available activities.