Skip to content

Edge-Explorer/llama3-dpo-alignment-pipeline

Repository files navigation

🦙 Llama-3 DPO Alignment Pipeline

A professional, modular pipeline to fine-tune Llama-3 8B using Direct Preference Optimization (DPO) — from training to inference to benchmarking.

Python Unsloth TRL HuggingFace GitHub License


🚀 What is This Project?

This repository contains a complete, production-grade pipeline for aligning a large language model using DPO. Instead of manually patching notebooks together, this project is structured as reusable Python scripts with YAML configurations so anyone can reproduce the training with a single command.

⚡ The Core Idea

Concept What We Did
Base Model unsloth/llama-3-8b-Instruct-bnb-4bit
Alignment Technique Direct Preference Optimization (DPO)
Dataset Intel/orca_dpo_pairs (1,000 samples)
Speed Optimization Unsloth (2x faster training)
Memory Optimization 4-bit quantization + Gradient Checkpointing
Environment Kaggle T4 x2 GPU (Free Tier)
Trained Adapter 🤗 Karan6124/llama3-8b-dpo-orca-adapter

🗂️ Project Structure

llama3-dpo-alignment-pipeline/
├── 📁 configs/
│   ├── dpo_config.yaml          # All DPO training hyperparameters
│   └── benchmark_config.yaml   # Test prompts & generation settings
├── 📁 scripts/
│   └── train_dpo.py            # The main training engine (reads from configs/)
├── 📁 inference/
│   └── inference.py            # Load adapter and run interactive inference
├── 📁 evaluation/
│   └── benchmark.py            # Compare Base vs. Aligned model side-by-side
├── 📁 training/
│   └── training-llama3-dpo.ipynb  # The original Kaggle notebook
├── 📁 models/                  # (gitignored) Local adapter weights live here
├── pyproject.toml              # Dependency management with uv
└── README.md

🛠️ Setup & Installation

This project uses uv — the fastest Python package manager. Everything is managed in one command.

1. Install uv

# Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

2. Clone the Repo & Sync Dependencies

git clone https://github.com/Edge-Explorer/llama3-dpo-alignment-pipeline.git
cd llama3-dpo-alignment-pipeline
uv sync

Note: Unsloth requires an NVIDIA GPU to import. For local development, you can write and review code without a GPU. Use Kaggle or Google Colab to actually run the scripts.


🏋️ Training

All training parameters are in configs/dpo_config.yaml. You can tweak learning rates, batch sizes, and sequence lengths without touching any Python code.

# On Kaggle or a GPU machine:
uv run python scripts/train_dpo.py

Key Hyperparameters (optimized for T4 GPU):

  • beta: 0.1 (DPO temperature)
  • learning_rate: 5e-6
  • per_device_train_batch_size: 1
  • gradient_accumulation_steps: 8
  • max_length: 768

📊 Monitoring & Metrics

We tracked the entire DPO alignment process using Weights & Biases (WandB).

📈 Training Progress

The training loss shows a smooth convergence, confirming that the DPO adapter is effectively learning the preference pairs. DPO Training Metrics

🖥️ System Performance

We optimized the pipeline for Kaggle T4 x2 GPUs. Memory usage remained stable under 15GB thanks to 4-bit quantization and gradient checkpointing. System Utilization


🤖 Inference

Download the trained adapter from Hugging Face and load it locally.

# Make sure the adapter is in models/llama3_dpo_adapter/
uv run python inference/inference.py

Or use it directly in your own code:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Karan6124/llama3-8b-dpo-orca-adapter",
    max_seq_length = 2048,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)

messages = [{"role": "user", "content": "Why use DPO over SFT?"}]
inputs = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

📊 Benchmarking

Run the benchmark script on Kaggle to generate a comparison report:

uv run python evaluation/benchmark.py

Results are saved automatically to evaluation/benchmark_report.txt.


🏆 Key Lessons Learned

Problem Solution
transformers 5.0.0 broke trl in Colab Switched to Kaggle's stable environment
DPOConfig not found in old trl Pinned to trl>=0.12.0
OutOfMemoryError on T4 GPU Reduced batch size to 1, enabled gradient checkpointing
Slow training Unsloth's PatchDPOTrainer gave ~2x speedup
Messy notebook workflow Refactored into reusable scripts + YAML configs

📄 License

This project is licensed under the MIT License — see LICENSE for details.


🙏 Acknowledgements

About

A modular pipeline for aligning Llama-3-8B using Direct Preference Optimization (DPO) and Unsloth. Features modular YAML configurations, automated benchmarking, and professional experiment tracking with WandB.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors