A professional, modular pipeline to fine-tune Llama-3 8B using Direct Preference Optimization (DPO) — from training to inference to benchmarking.
This repository contains a complete, production-grade pipeline for aligning a large language model using DPO. Instead of manually patching notebooks together, this project is structured as reusable Python scripts with YAML configurations so anyone can reproduce the training with a single command.
| Concept | What We Did |
|---|---|
| Base Model | unsloth/llama-3-8b-Instruct-bnb-4bit |
| Alignment Technique | Direct Preference Optimization (DPO) |
| Dataset | Intel/orca_dpo_pairs (1,000 samples) |
| Speed Optimization | Unsloth (2x faster training) |
| Memory Optimization | 4-bit quantization + Gradient Checkpointing |
| Environment | Kaggle T4 x2 GPU (Free Tier) |
| Trained Adapter | 🤗 Karan6124/llama3-8b-dpo-orca-adapter |
llama3-dpo-alignment-pipeline/
├── 📁 configs/
│ ├── dpo_config.yaml # All DPO training hyperparameters
│ └── benchmark_config.yaml # Test prompts & generation settings
├── 📁 scripts/
│ └── train_dpo.py # The main training engine (reads from configs/)
├── 📁 inference/
│ └── inference.py # Load adapter and run interactive inference
├── 📁 evaluation/
│ └── benchmark.py # Compare Base vs. Aligned model side-by-side
├── 📁 training/
│ └── training-llama3-dpo.ipynb # The original Kaggle notebook
├── 📁 models/ # (gitignored) Local adapter weights live here
├── pyproject.toml # Dependency management with uv
└── README.md
This project uses uv — the fastest Python package manager. Everything is managed in one command.
# Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"git clone https://github.com/Edge-Explorer/llama3-dpo-alignment-pipeline.git
cd llama3-dpo-alignment-pipeline
uv syncNote: Unsloth requires an NVIDIA GPU to import. For local development, you can write and review code without a GPU. Use Kaggle or Google Colab to actually run the scripts.
All training parameters are in configs/dpo_config.yaml. You can tweak learning rates, batch sizes, and sequence lengths without touching any Python code.
# On Kaggle or a GPU machine:
uv run python scripts/train_dpo.pyKey Hyperparameters (optimized for T4 GPU):
beta: 0.1 (DPO temperature)learning_rate: 5e-6per_device_train_batch_size: 1gradient_accumulation_steps: 8max_length: 768
We tracked the entire DPO alignment process using Weights & Biases (WandB).
The training loss shows a smooth convergence, confirming that the DPO adapter is effectively learning the preference pairs.

We optimized the pipeline for Kaggle T4 x2 GPUs. Memory usage remained stable under 15GB thanks to 4-bit quantization and gradient checkpointing.

Download the trained adapter from Hugging Face and load it locally.
# Make sure the adapter is in models/llama3_dpo_adapter/
uv run python inference/inference.pyOr use it directly in your own code:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "Karan6124/llama3-8b-dpo-orca-adapter",
max_seq_length = 2048,
load_in_4bit = True,
)
FastLanguageModel.for_inference(model)
messages = [{"role": "user", "content": "Why use DPO over SFT?"}]
inputs = tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])Run the benchmark script on Kaggle to generate a comparison report:
uv run python evaluation/benchmark.pyResults are saved automatically to evaluation/benchmark_report.txt.
| Problem | Solution |
|---|---|
transformers 5.0.0 broke trl in Colab |
Switched to Kaggle's stable environment |
DPOConfig not found in old trl |
Pinned to trl>=0.12.0 |
OutOfMemoryError on T4 GPU |
Reduced batch size to 1, enabled gradient checkpointing |
| Slow training | Unsloth's PatchDPOTrainer gave ~2x speedup |
| Messy notebook workflow | Refactored into reusable scripts + YAML configs |
This project is licensed under the MIT License — see LICENSE for details.
- Unsloth for making LLM fine-tuning incredibly fast
- TRL by HuggingFace for the DPO implementation
- Intel/orca_dpo_pairs for the training dataset
- Kaggle for the free GPUs that made this possible