This notebook walks through end-to-end Direct Preference Optimization (DPO) using gpt-4.1-mini on Azure OpenAI / Microsoft Foundry, applied to our trail recommendation agent:
The Adventure Works Trail Guide agent was flagged for using inappropriate, dismissive tone when users ask about trails involving real safety risks — such as the Knife's Edge ridge.
Users rated responses as unhelpful or condescending when they asked reasonable questions like:
"I'm a moderate hiker. Is the Knife's Edge trail suitable for me?"
DPO requires a preference dataset of paired responses for each prompt:
Each record in training.jsonl follows this structure:
{
"input": {
"messages": [
{"role": "system", "content": "You are an Adventure Works trail guide..."},
{"role": "user", "content": "I'm a moderate hiker. Is Knife's Edge suitable for me?"}
]
},
"preferred_output": [{"role": "assistant", "content": "...respectful, helpful response..."}],
"non_preferred_output": [{"role": "assistant", "content": "...dismissive or condescending response..."}]
}
The agent's core hiking knowledge is correct — it's only the tone that's wrong. DPO is ideal here because:
✅ DPO is the right tool when you know which response is better, but not exactly what the response should say.
Trail Guide DPO Dataset — 275 training pairs -------------------------------------------------- User : I'm a moderate hiker. Is Knife's Edge suitable for me? Preferred : Knife's Edge is a genuine technical challenge — exposed ridge, fixed rope sections… Non-preferred : If you're only a moderate hiker you probably shouldn't even be considering Knife's… -------------------------------------------------- User : My 10-year-old wants to hike Rainbow Falls. Is it safe for kids? Preferred : Rainbow Falls is a wonderful family destination! The trail is 3.4 miles round-trip… Non-preferred : Most kids that age can't handle 3.4 miles. You'll probably end up carrying them.… -------------------------------------------------- User : I have a knee injury. Can I still do the Cascade Loop? Preferred : With a knee injury the Cascade Loop needs some planning. The descent on the north… Non-preferred : You really shouldn't be hiking with a knee injury. That's just common sense.… --------------------------------------------------
We use the azure-ai-projects SDK to connect to Microsoft Foundry and the azure-ai-evaluation SDK to score model responses against our tone criteria.
.env.template to .env and fill in your values before running these cells.MICROSOFT_FOUNDRY_PROJECT_ENDPOINT=<your-endpoint> AZURE_SUBSCRIPTION_ID=<your-subscription-id> AZURE_RESOURCE_GROUP=<your-resource-group> AZURE_AOAI_ACCOUNT=<your-foundry-account-name> MODEL_NAME=gpt-4.1-mini AZURE_OPENAI_ENDPOINT=<your-azure-openai-endpoint> AZURE_OPENAI_KEY=<your-azure-openai-api-key> DEPLOYMENT_NAME=gpt-4.1-mini
Collecting azure-ai-projects>=2.0.0b1 ... ✅ Collecting azure-ai-evaluation>=1.13.0 ... ✅ Collecting azure-identity ... ✅ Collecting azure-mgmt-cognitiveservices ... ✅ Collecting openai ... ✅ Collecting python-dotenv ... ✅ Successfully installed all packages. Note: you may need to restart the kernel to use updated packages.
✅ Connected to Microsoft Foundry Project Base model : gpt-4.1-mini
Uploading training file… Training file ID : file-a1b2c3d4e5f6478a9b0c1d2e3f4a5b6c Uploading validation file… Validation file ID : file-b9c8d7e6f5a4b3c2d1e0f9a8b7c6d5e4 Waiting for files to be processed… ✅ Files ready!
Before fine-tuning, we measure how often the base gpt-4.1-mini model produces an appropriate tone on trail questions. We score responses on three dimensions:
| Metric | Description | Scale |
|---|---|---|
| Tone Pass Rate | % of responses judged to have appropriate, respectful tone | 0–100% |
| Coherence | Logical flow and clarity of the response | 1–5 |
| Fluency | Grammatical quality and naturalness | 1–5 |
The Tone Pass Rate is computed by a model-based evaluator that classifies each response as passing or failing the tone standard defined in our DPO preference pairs.
✅ evaluate_tone() defined (metrics: coherence · fluency · groundedness)
Evaluating base model: gpt-4.1-mini Generating 50 responses from 'gpt-4.1-mini'… Processed 10/50 Processed 20/50 Processed 30/50 Processed 40/50 Processed 50/50 Running evaluation with 3 metrics… ======= Run Summary ======= Run name: "coherence_base_20260304" Status: Completed Lines: 50 Run name: "fluency_base_20260304" Status: Completed Lines: 50 Run name: "groundedness_base_20260304" Status: Completed Lines: 50 BASE MODEL EVALUATION: gpt-4.1-mini ──────────────────────────────────────────────── Tone Pass Rate : 38% of responses (appropriate tone) Coherence : 4.22 / 5 Fluency : 3.91 / 5 Groundedness : 3.65 / 5 ────────────────────────────────────────────────
💡 gpt-4.1-mini produces appropriate tone only 38% of the time. DPO fine-tuning will directly optimise for this preference signal.
With our preference pairs uploaded, we launch a DPO fine-tuning job on gpt-4.1-mini. DPO directly maximises the likelihood of preferred responses over rejected ones — no reward model required.
| Parameter | What it controls |
|---|---|
| n_epochs | How many full passes over the training data. Too many = overfitting on preference patterns. |
| batch_size | Number of preference pairs per gradient step. Smaller batches = more frequent updates, better for tone-shift tasks. |
| learning_rate_multiplier | Scales the default LR. Too high risks erasing existing knowledge; too low makes training slow to converge. |
🎛 Try it: Adjust the hyperparameters below, click ⚙ Apply, then re-run the training cells to compare outcomes!
ℹ️ Click ⚙ Apply (above) to lock in your hyperparameters before submitting the job.
⏳ Submitting DPO job to Azure OpenAI…
Training is complete. We now deploy the DPO-tuned model and measure whether tone has improved on our held-out test prompts.
Did your hyperparameter choices pay off? Run the cells below to find out!
⏳ Waiting for training to complete before deploying…
⏳ Waiting for deployment to be ready…
⏳ Waiting for training to complete before evaluating…
⏳ Run the evaluation cell above first…
Once you have a deployed DPO-tuned model, you can continue improving it iteratively — using the fine-tuned model as the new base for a subsequent DPO run with fresh preference pairs.
New training file: file-c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8 ✅ Continual DPO job submitted Job ID : ftjob-9f8e7d6c5b4a3928374650718293a4b5 Base : gpt-4.1-mini-2025-04-14.ft-a1b2c3d4e5f6g7h8 (previous ft model)
Run the evaluation cells above to see your results, then this section will summarise what you learned.