ORPO Training
ORPO combines SFT and preference optimization in a single training phase.What is ORPO?
ORPO (Odds Ratio Preference Optimization) is a simpler alternative to DPO that doesn’t require a reference model. It optimizes preferences using odds ratios directly, reducing memory usage and training complexity.Quick Start
Python API
Data Format
Same as DPO - preference pairs:ORPO vs DPO
| Aspect | ORPO | DPO |
|---|---|---|
| Reference model | Not needed | Not needed with PEFT, required for full fine-tuning |
| Memory usage | Lower | Higher (if using reference model) |
| Training speed | Faster | Slower |
| SFT phase | Combined | Separate |
| Complexity | Simpler | More options |
Parameters
| Parameter | Description | Default |
|---|---|---|
trainer | Set to "orpo" | Required |
dpo_beta | Odds ratio weight | 0.1 |
max_completion_length | Max response tokens | None |
image_column | Image column for VLM preference training | None |
VLM (Vision-Language) ORPO
ORPO supports vision-language models like Qwen 3.5-VL for image+text preference alignment. Setimage_column to enable VLM mode:
chosen/rejected columns with messages lists, and an image column containing the images. The image column is automatically renamed to images for TRL compatibility.
When to Use ORPO
Choose ORPO when:- Memory is limited (no reference model needed)
- You want combined SFT + alignment
- Simpler training pipeline preferred
- Starting from a base model (not instruction-tuned)
- You need fine-grained control
- Working with already instruction-tuned models
- Reference model behavior is important
Example: Customer Support
Next Steps
DPO Training
Alternative alignment method
Reward Modeling
Train reward models