Skip to main content

ORPO Training

ORPO combines SFT and preference optimization in a single training phase.

What is ORPO?

ORPO (Odds Ratio Preference Optimization) is a simpler alternative to DPO that doesn’t require a reference model. It optimizes preferences using odds ratios directly, reducing memory usage and training complexity.

Quick Start

aitraining llm --train \
  --model google/gemma-2-2b \
  --data-path ./preferences.jsonl \
  --project-name gemma-orpo \
  --trainer orpo \
  --prompt-text-column prompt \
  --text-column chosen \
  --rejected-text-column rejected \
  --peft
ORPO requires --prompt-text-column and --rejected-text-column. The --text-column defaults to "text", so only specify it if your chosen column has a different name.

Python API

from autotrain.trainers.clm.params import LLMTrainingParams
from autotrain.project import AutoTrainProject

params = LLMTrainingParams(
    model="google/gemma-2-2b",
    data_path="./preferences.jsonl",
    project_name="gemma-orpo",

    trainer="orpo",
    prompt_text_column="prompt",
    text_column="chosen",
    rejected_text_column="rejected",
    dpo_beta=0.1,  # Default: 0.1
    max_completion_length=None,  # Default: None

    epochs=3,
    batch_size=2,
    lr=5e-5,

    peft=True,
    lora_r=16,
)

project = AutoTrainProject(params=params, backend="local", process=True)
project.create()

Data Format

Same as DPO - preference pairs:
{
  "prompt": "What is AI?",
  "chosen": "AI is artificial intelligence, a field of computer science focused on creating systems that can perform tasks requiring human intelligence.",
  "rejected": "AI is just robots."
}

ORPO vs DPO

AspectORPODPO
Reference modelNot neededNot needed with PEFT, required for full fine-tuning
Memory usageLowerHigher (if using reference model)
Training speedFasterSlower
SFT phaseCombinedSeparate
ComplexitySimplerMore options

Parameters

ParameterDescriptionDefault
trainerSet to "orpo"Required
dpo_betaOdds ratio weight0.1
max_completion_lengthMax response tokensNone
image_columnImage column for VLM preference trainingNone

VLM (Vision-Language) ORPO

ORPO supports vision-language models like Qwen 3.5-VL for image+text preference alignment. Set image_column to enable VLM mode:
params = LLMTrainingParams(
    model="Qwen/Qwen3.5-VL-9B",
    trainer="orpo",
    image_column="images",
    text_column="chosen",
    rejected_text_column="rejected",
    prompt_text_column="prompt",
)
The dataset should have chosen/rejected columns with messages lists, and an image column containing the images. The image column is automatically renamed to images for TRL compatibility.

When to Use ORPO

Choose ORPO when:
  • Memory is limited (no reference model needed)
  • You want combined SFT + alignment
  • Simpler training pipeline preferred
  • Starting from a base model (not instruction-tuned)
Choose DPO when:
  • You need fine-grained control
  • Working with already instruction-tuned models
  • Reference model behavior is important

Example: Customer Support

params = LLMTrainingParams(
    model="google/gemma-2-2b",
    data_path="./support_preferences.jsonl",
    project_name="support-bot",

    trainer="orpo",
    dpo_beta=0.15,

    epochs=3,
    batch_size=2,
    gradient_accumulation=4,
    lr=2e-5,

    peft=True,
    lora_r=32,
    lora_alpha=64,

    log="wandb",
)

Next Steps

DPO Training

Alternative alignment method

Reward Modeling

Train reward models