ORPO Training

ORPO combines SFT and preference optimization in a single training phase.

What is ORPO?

ORPO (Odds Ratio Preference Optimization) is a simpler alternative to DPO that doesn’t require a reference model. It optimizes preferences using odds ratios directly, reducing memory usage and training complexity.

Quick Start

aitraining llm --train \
  --model google/gemma-2-2b \
  --data-path ./preferences.jsonl \
  --project-name gemma-orpo \
  --trainer orpo \
  --prompt-text-column prompt \
  --text-column chosen \
  --rejected-text-column rejected \
  --peft

ORPO requires --prompt-text-column and --rejected-text-column. The --text-column defaults to "text", so only specify it if your chosen column has a different name.

Python API

from autotrain.trainers.clm.params import LLMTrainingParams
from autotrain.project import AutoTrainProject

params = LLMTrainingParams(
    model="google/gemma-2-2b",
    data_path="./preferences.jsonl",
    project_name="gemma-orpo",

    trainer="orpo",
    prompt_text_column="prompt",
    text_column="chosen",
    rejected_text_column="rejected",
    dpo_beta=0.1,  # Default: 0.1
    max_completion_length=None,  # Default: None

    epochs=3,
    batch_size=2,
    lr=5e-5,

    peft=True,
    lora_r=16,
)

project = AutoTrainProject(params=params, backend="local", process=True)
project.create()

Data Format

Same as DPO - preference pairs:

{
  "prompt": "What is AI?",
  "chosen": "AI is artificial intelligence, a field of computer science focused on creating systems that can perform tasks requiring human intelligence.",
  "rejected": "AI is just robots."
}

ORPO vs DPO

Aspect	ORPO	DPO
Reference model	Not needed	Not needed with PEFT, required for full fine-tuning
Memory usage	Lower	Higher (if using reference model)
Training speed	Faster	Slower
SFT phase	Combined	Separate
Complexity	Simpler	More options

Parameters

Parameter	Description	Default
`trainer`	Set to `"orpo"`	Required
`dpo_beta`	Odds ratio weight	`0.1`
`max_completion_length`	Max response tokens	`None`
`image_column`	Image column for VLM preference training	`None`

VLM (Vision-Language) ORPO

ORPO supports vision-language models like Qwen 3.5-VL for image+text preference alignment. Set image_column to enable VLM mode:

params = LLMTrainingParams(
    model="Qwen/Qwen3.5-VL-9B",
    trainer="orpo",
    image_column="images",
    text_column="chosen",
    rejected_text_column="rejected",
    prompt_text_column="prompt",
)

The dataset should have chosen/rejected columns with messages lists, and an image column containing the images. The image column is automatically renamed to images for TRL compatibility.

When to Use ORPO

Choose ORPO when:

Memory is limited (no reference model needed)
You want combined SFT + alignment
Simpler training pipeline preferred
Starting from a base model (not instruction-tuned)

Choose DPO when:

You need fine-grained control
Working with already instruction-tuned models
Reference model behavior is important

Example: Customer Support

params = LLMTrainingParams(
    model="google/gemma-2-2b",
    data_path="./support_preferences.jsonl",
    project_name="support-bot",

    trainer="orpo",
    dpo_beta=0.15,

    epochs=3,
    batch_size=2,
    gradient_accumulation=4,
    lr=2e-5,

    peft=True,
    lora_r=32,
    lora_alpha=64,

    log="wandb",
)

ORPO Training

ORPO Training

What is ORPO?

Quick Start

Python API

Data Format

ORPO vs DPO

Parameters

VLM (Vision-Language) ORPO

When to Use ORPO

Example: Customer Support

Next Steps

DPO Training

Reward Modeling

​ORPO Training

​What is ORPO?

​Quick Start

​Python API

​Data Format

​ORPO vs DPO

​Parameters

​VLM (Vision-Language) ORPO

​When to Use ORPO

​Example: Customer Support

​Next Steps

DPO Training

Reward Modeling

ORPO Training

What is ORPO?

Quick Start

Python API

Data Format

ORPO vs DPO

Parameters

VLM (Vision-Language) ORPO

When to Use ORPO

Example: Customer Support

Next Steps