LLM Training

The aitraining llm command trains large language models with support for multiple trainers and techniques.

Quick Start

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name my-model \
  --trainer sft

Available Trainers

Trainer	Description
`default` / `sft` / `generic`	Supervised fine-tuning
`dpo`	Direct Preference Optimization
`orpo`	Odds Ratio Preference Optimization
`ppo`	Proximal Policy Optimization
`grpo`	Group Relative Policy Optimization (custom environments)
`reward`	Reward model training
`distillation`	Knowledge distillation

generic is an alias for default. All three (default, sft, generic) produce the same behavior.

PPO Trainer Requirements: PPO requires either --rl-reward-model-path (path to a trained reward model) or --model-ref (reference model for KL divergence). See PPO Training for full documentation.

GRPO Trainer Requirements: GRPO requires --rl-env-module (Python module path) and --rl-env-class (class name) for the custom environment. See GRPO Training for full documentation.

Parameter Groups

Parameters are organized into logical groups:

Basic Parameters

Parameter	Description	Default
`--model`	Base model to fine-tune	`google/gemma-3-270m`
`--data-path`	Path to training data	`data`
`--project-name`	Output directory name	`project-name`
`--train-split`	Training data split	`train`
`--valid-split`	Validation data split	`None`

Always specify these parameters: While --model, --data-path, and --project-name have defaults, you should always explicitly set them for your use case. The --project-name parameter sets the output folder - use a path like --project-name ./models/my-experiment to control where the trained model is saved.

Training Configuration

Parameter	Description	Default
`--trainer`	Training method	`default`
`--epochs`	Number of training epochs	`1`
`--batch-size`	Training batch size	`2`
`--lr`	Learning rate	`3e-5`
`--mixed-precision`	fp16/bf16/None	`None`
`--gradient-accumulation`	Accumulation steps	`4`
`--warmup-ratio`	Warmup ratio	`0.1`
`--optimizer`	Optimizer	`adamw_torch`
`--scheduler`	LR scheduler	`linear`
`--weight-decay`	Weight decay	`0.0`
`--max-grad-norm`	Max gradient norm	`1.0`
`--seed`	Random seed	`42`

Checkpointing & Evaluation

Parameter	Description	Default
`--eval-strategy`	When to evaluate (`epoch`, `steps`, `no`)	`epoch`
`--save-strategy`	When to save (`epoch`, `steps`, `no`)	`epoch`
`--save-steps`	Save every N steps (if save-strategy=steps)	`500`
`--save-total-limit`	Max checkpoints to keep	`1`
`--logging-steps`	Log every N steps (-1 for auto)	`-1`
`--resume-from-checkpoint`	Resume from checkpoint path, or `auto` to detect latest	`None`

Performance & Memory

Parameter	Description	Default
`--auto-find-batch-size`	Automatically find optimal batch size	`False`
`--disable-gradient-checkpointing`	Disable memory optimization	`False`
`--unsloth`	Use Unsloth for faster training (SFT only, llama/mistral/gemma/qwen2)	`False`
`--use-sharegpt-mapping`	Use Unsloth’s ShareGPT mapping	`False`
`--use-flash-attention-2`	Use Flash Attention 2 for faster training	`False`
`--attn-implementation`	Attention implementation (`eager`, `sdpa`, `flash_attention_2`)	`None`

Unsloth Requirements: Unsloth only works with sft/default trainers and specific model architectures (llama, mistral, gemma, qwen2). See Unsloth Integration for details.

Backend & Distribution

Parameter	Description	Default
`--backend`	Where to run (`local`, `spaces`)	`local`
`--distributed-backend`	Distribution backend (`ddp`, `deepspeed`)	`None`
`--ddp-timeout`	DDP/NCCL timeout in seconds	`7200`

Multi-GPU Behavior: With multiple GPUs and --distributed-backend not set, DDP is used automatically. Set --distributed-backend deepspeed for DeepSpeed Zero-3 optimization. Training is launched via Accelerate.

DeepSpeed Checkpointing: When using DeepSpeed, model saving uses accelerator.get_state_dict() and unwraps the model. PEFT adapter saving is handled differently under DeepSpeed.

PEFT/LoRA Parameters

Parameter	Description	Default
`--peft`	Enable LoRA training	`False`
`--lora-r`	LoRA rank	`16`
`--lora-alpha`	LoRA alpha	`32`
`--lora-dropout`	LoRA dropout	`0.05`
`--target-modules`	Modules to target	`all-linear`
`--quantization`	int4/int8 quantization	`None`
`--merge-adapter`	Merge LoRA after training	`True`

Data Processing

Parameter	Description	Default
`--text-column`	Text column name	`text`
`--block-size`	Max sequence length	`-1` (model default)
`--model-max-length`	Maximum model input length	Auto-detect from model
`--padding`	Padding side (`left` or `right`)	`right`
`--add-eos-token`	Append EOS token	`True`
`--chat-template`	Chat template to use	Auto by trainer
`--packing`	Enable sequence packing (requires flash attention)	`None`
`--auto-convert-dataset`	Auto-detect and convert dataset format	`False`
`--max-samples`	Limit dataset size for testing	`None`
`--save-processed-data`	Save processed data: `auto`, `local`, `hub`, `both`, `none`	`auto`

Chat Template Auto-Selection: SFT/DPO/ORPO/Reward trainers default to tokenizer (model’s built-in template). Use --chat-template none for plain text training.

Processed Data Saving: By default (auto), processed data is saved locally to {project}/data_processed/. If the source dataset was from the Hub, it’s also pushed as a private dataset. Original columns are renamed to _original_* to prevent conflicts.

Training Examples

SFT with LoRA

aitraining llm --train \
  --model meta-llama/Llama-3.2-1B \
  --data-path ./conversations.jsonl \
  --project-name llama-sft \
  --trainer sft \
  --peft \
  --lora-r 16 \
  --lora-alpha 32 \
  --epochs 3 \
  --batch-size 4

DPO Training

For DPO, you must specify the column names for prompt, chosen, and rejected responses:

aitraining llm --train \
  --model meta-llama/Llama-3.2-1B \
  --data-path ./preferences.jsonl \
  --project-name llama-dpo \
  --trainer dpo \
  --prompt-text-column prompt \
  --text-column chosen \
  --rejected-text-column rejected \
  --dpo-beta 0.1 \
  --peft \
  --lora-r 16

DPO and ORPO require --prompt-text-column and --rejected-text-column to be specified.

ORPO Training

ORPO combines SFT and preference optimization:

aitraining llm --train \
  --model google/gemma-2-2b \
  --data-path ./preferences.jsonl \
  --project-name gemma-orpo \
  --trainer orpo \
  --prompt-text-column prompt \
  --text-column chosen \
  --rejected-text-column rejected \
  --peft

GRPO Training

Train with Group Relative Policy Optimization using your own reward environment:

aitraining llm --train \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
  --trainer grpo \
  --rl-env-module my_envs.hotel_env \
  --rl-env-class HotelEnv \
  --rl-num-generations 4 \
  --rl-max-new-tokens 256 \
  --peft \
  --lr 1e-5

GRPO generates multiple completions per prompt, scores them via your environment (0-1), and optimizes the policy. See GRPO Training for environment interface details.

Knowledge Distillation

Train a smaller model to mimic a larger one:

aitraining llm --train \
  --model google/gemma-3-270m \
  --teacher-model google/gemma-2-2b \
  --data-path ./prompts.jsonl \
  --project-name distilled-model \
  --use-distillation \
  --distill-temperature 3.0

Distillation defaults: --distill-temperature 3.0, --distill-alpha 0.7, --distill-max-teacher-length 512

Logging & Monitoring

Weights & Biases (Default)

W&B logging with LEET visualizer is enabled by default. The LEET visualizer shows real-time training metrics directly in your terminal.

# W&B is on by default - just run training
aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name my-model

To disable W&B or the visualizer:

# Disable W&B logging entirely
aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name my-model \
  --log none

# Keep W&B but disable terminal visualizer
aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name my-model \
  --no-wandb-visualizer

TensorBoard

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name my-model \
  --log tensorboard

Push to Hugging Face Hub

Upload your trained model:

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name my-model \
  --push-to-hub \
  --username your-username \
  --token $HF_TOKEN

The repository is created as private by default. By default, the repo will be named {username}/{project-name}.

Custom Repository Name or Organization

Use --repo-id to push to a specific repository, useful for:

Pushing to an organization instead of your personal account
Using a different repo name than your local project-name

# Push to an organization
aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name ./local-training-output \
  --push-to-hub \
  --repo-id my-organization/my-custom-model-name \
  --token $HF_TOKEN

# Push to personal account with different name
aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name ./experiment-v3 \
  --push-to-hub \
  --repo-id your-username/production-model \
  --token $HF_TOKEN

Parameter	Description	Default
`--push-to-hub`	Enable pushing to Hub	`False`
`--hub-private` / `--no-hub-private`	Create repo as private or public	`True` (private)
`--username`	HF username (for default repo naming)	`None`
`--token`	HF API token	`None`
`--repo-id`	Full repo ID (e.g., `org/model-name`)	`{username}/{project-name}`

When using --repo-id, you don’t need --username since the repo ID already specifies the destination. However, you still need --token for authentication.

Advanced Options

Hyperparameter Sweeps

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name sweep-experiment \
  --use-sweep \
  --sweep-backend optuna \
  --sweep-n-trials 10

Enhanced Evaluation

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name my-model \
  --use-enhanced-eval \
  --eval-metrics "perplexity,bleu"

View All Parameters

See all parameters for a specific trainer:

aitraining llm --trainer sft --help
aitraining llm --trainer dpo --help

Next Steps

YAML Configs

Use configuration files

DPO Training

Deep dive into DPO

LoRA/PEFT

Efficient fine-tuning

Distillation

Knowledge distillation

GRPO Training

RL with custom environments

​LLM Training

​Quick Start

​Available Trainers

​Parameter Groups

​Basic Parameters

​Training Configuration

​Checkpointing & Evaluation

​Performance & Memory

​Backend & Distribution

​PEFT/LoRA Parameters

​Data Processing

​Training Examples

​SFT with LoRA

​DPO Training

​ORPO Training

​GRPO Training

​Knowledge Distillation

​Logging & Monitoring

​Weights & Biases (Default)

​TensorBoard

​Push to Hugging Face Hub

​Custom Repository Name or Organization

​Advanced Options

​Hyperparameter Sweeps

​Enhanced Evaluation

​View All Parameters

​Next Steps

YAML Configs

DPO Training

LoRA/PEFT

Distillation

GRPO Training

LLM Training

Quick Start

Available Trainers

Parameter Groups

Basic Parameters

Training Configuration

Checkpointing & Evaluation

Performance & Memory

Backend & Distribution

PEFT/LoRA Parameters

Data Processing

Training Examples

SFT with LoRA

DPO Training

ORPO Training

GRPO Training

Knowledge Distillation

Logging & Monitoring

Weights & Biases (Default)

TensorBoard

Push to Hugging Face Hub

Custom Repository Name or Organization

Advanced Options

Hyperparameter Sweeps

Enhanced Evaluation

View All Parameters

Next Steps