This notebook walks through end-to-end Reinforcement Fine-Tuning (RFT) using o4-mini on Azure OpenAI, applied to our trail recommendation agent:
We use a dataset of complex trail-planning problems from Adventure Works. Each example includes multiple constraints that must all be satisfied in the model's recommendation.
Each example contains:
Example:
Plan a 4-day moderate loop. Group: 8 hikers (beginner + intermediate). Child age 7. Max 12 miles/day.
This task is not ideal for SFT because:
✅ RFT lets the model explore solutions and learn from a grader — without requiring exact output labels.
🏔 Trail Planning Dataset Samples ---------------------------------------- Scenario : 4-day moderate loop, 8 hikers, child age 7, max 12 mi/day Constraints: ['child_friendly', 'varied_fitness', 'gear_list_required'] ---------------------------------------- Scenario : Solo 14er weekend summit, intermediate fitness, altitude acclimatisation Constraints: ['altitude_safety', 'solo_emergency_plan', 'weather_contingency'] ---------------------------------------- Scenario : 3-day group backpacking, 2 beginners, 4 intermediate, 1 with knee injury Constraints: ['medical_accommodation', 'low_elevation_gain', 'camp_near_water'] ---------------------------------------- Scenario : Winter snowshoe day hike, family of 5, children ages 6 and 9 Constraints: ['child_friendly', 'cold_weather_gear', 'short_distance', 'bailout_route'] ---------------------------------------- Scenario : Via ferrata first-timer, moderate fitness, guide required, half-day Constraints: ['guided_activity', 'equipment_rental', 'difficulty_progression'] ----------------------------------------
To enable RFT, we configure three key components:
| Score | Description |
|---|---|
| 5 | All constraints addressed, safe, logical sequencing, appropriate difficulty |
| 4 | Minor gap — one constraint partially addressed or vague safety note |
| 3 | Two constraints partially addressed, or itinerary has a logical gap |
| 2 | Multiple constraints missed but basic structure is present |
| 1 | Major issues — missing safety plan or ignores key constraints |
| 0 | Invalid output, ignores scenario, or unsafe recommendation |
✅ System prompt defined (length: 512 chars)
✅ Response schema defined — 4 required fields: itinerary, gear_list, safety_notes, difficulty_rating
{'name': 'trail_plan_grader',
'type': 'score_model',
'model': 'o3-mini',
'pass_threshold': 5,
'range': [0.0, 5.0]}
Before fine-tuning, we benchmark how existing models handle multi-constraint trail planning using our custom grader.
We compare: o3, o4-mini, gpt-4.1, gpt-4.1-mini
| Metric | Description |
|---|---|
| Pass % | % of completions scoring at or above the pass threshold (5/5) |
| Error % | % of completions that failed structurally (malformed output) |
✅ Saved 100 records to data/trail_eval_100.jsonl Using Azure API for file upload... ✅ Eval file ID: file-a3f9b821cc2e4d5e8b1f3a7c09d4e612
Evaluation created successfully with ID: eval_trail_c9a284f5e13b 📊 Evaluation created: eval_trail_c9a284f5e13b Create Eval Run Status for o3: 201 Create Eval Run Status for o4-mini: 201 Create Eval Run Status for gpt-4.1: 201 Create Eval Run Status for gpt-4.1-mini: 201 🚀 Evaluation runs launched for: o3, o4-mini, gpt-4.1, gpt-4.1-mini
================================================ Base Model Evaluation Summary ================================================ Model Status Pass % Error % ---------------------------------------------- o3 completed 94.0% 0.0% o4-mini completed 71.0% 0.0% gpt-4.1 completed 8.0% 0.0% gpt-4.1-mini completed 3.0% 0.0% ================================================
💡 o4-mini at 71% is the best candidate for RFT — it has a good base but clear room για improvement on constraint satisfaction.
With our system prompt, schema, and grader defined, we fine-tune o4-mini using RFT on just 100 training examples.
Each training example includes a prompt with embedded scenario + constraints. The grader scores the model's completions, providing reward signal — no ground-truth labels needed.
🎛 Try it: Adjust the hyperparameters in the cell below, then re-run the training cells to see how they affect the reward curve and final pass rate!
✅ Converted 100 records to RFT format → data/trail_train_100.jsonl
✅ Converted 50 records to RFT format → data/trail_valid_50.jsonl
✅ Train set saved to: data/trail_train_100.jsonl
✅ Validation set saved to: data/trail_valid_50.jsonl
📝 Sample RFT Training Record:
{
"messages": [
{
"role": "user",
"content": "You are an Adventure Works trail guide...\nPlan a 4-day moderate loop.\nScenario: \"4-day moderate loop, group 8, child age 7\"\nConstraints: \"['child_friendly', 'varied_fitness', 'gear_list_required']\""
}
],
"scenario": "4-day moderate loop, group 8, child age 7",
"constraints": "['child_friendly', 'varied_fitness', 'gear_list_required']"
}
Using Azure API for file upload... ✅ File 'trail_train_100.jsonl' uploaded successfully. Using Azure API for file upload... ✅ File 'trail_valid_50.jsonl' uploaded successfully. 📤 Train File ID : file-d7e3f09ab14c4b8da3f250e18c6a9b1 📤 Valid File ID : file-94b1c23de50a4f9cb7e60d17fa5b2c8
ℹ️ Click ⚙ Apply (above) to lock in your hyperparameters before submitting the job.
⏳ Submitting RFT job to Azure OpenAI...
The fine-tuned model is now deployed. We evaluate it on the same 100-example test set with the same grader and compare pass rates.
Did our hyperparameter choices pay off? Run the cells below to find out!
Evaluation created successfully with ID: eval_trail_ft_b8c491d2 📊 Evaluation created: eval_trail_ft_b8c491d2 Create Eval Run Status for o4-mini-2025-04-16-trail-guide-rft: 201 🚀 Evaluation launched for: o4-mini-2025-04-16-trail-guide-rft
⏳ Loading evaluation results…
Run the evaluation cells above to see your results, then this section will summarise what you learned.