Skip to content

letslego/peft-playground

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PEFT Playground: Unified Fine-Tuning Framework

A modular, production-ready framework for fine-tuning large language models (LLMs) using Parameter-Efficient Fine-Tuning (PEFT) methods including LoRA, QLoRA, and IA3. Fully tested and optimized for macOS, Linux, and cloud environments with GPU support.

πŸš€ Features

  • Multiple PEFT Methods: Easily switch between LoRA, QLoRA, and IA3 via YAML configuration
  • Cross-Platform Support:
    • macOS (MPS backend) - fully tested and working
    • Linux/Windows (CUDA)
    • Cloud GPUs (A100, H100, etc.)
  • Task-Specific Fine-Tuning: Built-in support for:
    • General instruction following
    • Summarization
    • RAG reranking
    • Tool calling
  • Comprehensive Performance Tracking:
    • Memory usage (CPU and GPU)
    • Training time and convergence
    • Perplexity and accuracy metrics
    • Side-by-side method comparison
  • Advanced Optimization Techniques:
    • 4-bit and 8-bit quantization (BitsAndBytes)
    • Gradient checkpointing for memory efficiency
    • Mixed precision training support
    • Multiple optimizer options (AdamW, 8-bit AdamW)
    • Automatic device detection and adaptation

πŸ“¦ Installation

Prerequisites

  • Python 3.12 (PyTorch compatibility)
  • 8GB+ RAM (for CPU training) or GPU with 6GB+ VRAM

Quick Setup

# Clone and setup
git clone <repository>
cd peft-playground

# Create virtual environment with Python 3.12
python3.12 -m venv venv
source venv/bin/activate  # or 'venv\Scripts\activate' on Windows

# Install dependencies
pip install -r requirements.txt

For macOS (M1/M2/M3 Chips)

The framework automatically handles:

  • Float16 instead of BFloat16 (MPS compatibility)
  • CPU loading with MPS transfer (avoids BFloat16 device issues)
  • Standard PyTorch optimizers (no CUDA-specific ops)

No additional setup needed!

For GPU Training (CUDA)

Ensure CUDA 11.8+ is installed, then:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

🎯 Quick Start

1. Train with LoRA (Default)

python examples/train_lora.py

Output: Saves adapter weights and training metrics to results/lora_YYYYMMDD_HHMMSS/

2. Train with QLoRA (4-bit quantization, lower memory)

# Edit configs/qlora_config.yaml then run:
python examples/train_qlora.py

3. Train with IA3 (Ultra-lightweight)

python examples/train_ia3.py

4. Compare All Methods Side-by-Side

python examples/compare_methods.py

Generates comparison report with memory, time, and accuracy metrics.

5. Run Inference with Trained Adapter

python examples/inference.py

🌍 Tested & Verified Configurations

Platform Model Method Status Notes
macOS M3 TinyLlama-1.1B LoRA βœ… Working MPS backend, batch_size=1
macOS M3 TinyLlama-1.1B IA3 βœ… Working Ultra-fast (few seconds/epoch)
Linux (A100) LLaMA-2-7B QLoRA βœ… Working 4-bit quantization enabled
Linux (RTX 4090) Mistral-7B LoRA βœ… Working BF16 mixed precision

πŸ”§ Configuration

All configuration is managed through YAML files in configs/. Each file is self-contained and includes model, PEFT method, training, and dataset parameters.

Configuration Structure

# PEFT Method Selection
peft_method: "lora"  # Options: "lora", "qlora", "ia3"

# Model Configuration
model:
  name: "TinyLlama/TinyLlama-1.1B-Chat-v1.0"  # HuggingFace model ID
  trust_remote_code: true

# Method-Specific Parameters (LoRA example)
lora:
  r: 16                    # LoRA rank
  lora_alpha: 32          # Scaling factor
  lora_dropout: 0.05
  target_modules:         # Which layers to adapt
    - "q_proj"
    - "v_proj"
    - "k_proj"
    - "o_proj"
    - "gate_proj"
    - "up_proj"
    - "down_proj"

# Training Hyperparameters
training:
  num_train_epochs: 1
  per_device_train_batch_size: 1       # Reduce for limited memory
  per_device_eval_batch_size: 1
  gradient_accumulation_steps: 4       # Simulate larger batches
  learning_rate: 2.0e-4
  weight_decay: 0.01
  warmup_steps: 100
  logging_steps: 10
  save_steps: 100
  eval_steps: 100
  fp16: false               # macOS: false, GPU: true or false
  bf16: false               # GPU only (not supported on macOS)
  gradient_checkpointing: true
  optim: "adamw_torch"      # macOS: adamw_torch, GPU: adamw_torch or paged_adamw_32bit
  lr_scheduler_type: "cosine"
  max_grad_norm: 0.3

# Quantization (only for QLoRA)
quantization:
  load_in_4bit: true        # 4-bit quantization
  load_in_8bit: false
  bnb_4bit_compute_dtype: "float16"
  bnb_4bit_quant_type: "nf4"

# Dataset Configuration
dataset:
  name: "databricks/databricks-dolly-15k"  # HuggingFace dataset
  type: "general"           # Task type

# Logging Configuration
logging:
  use_tensorboard: true
  log_dir: "tensorboard_logs"

Switching PEFT Methods

Simply edit the config file:

peft_method: "qlora"  # Change to use QLoRA

Or create a new config based on configs/qlora_config.yaml or configs/ia3_config.yaml.

πŸ“Š Supported Models

Pre-tested and verified:

  • TinyLlama-1.1B - Lightweight, fast training, good for testing (Recommended for macOS)
  • LLaMA-2 (7B, 13B, 70B) - Requires quantization for consumer hardware
  • Mistral-7B - Balanced model size and performance
  • Qwen (7B, 14B) - Good for multilingual tasks
  • Any HuggingFace Causal LM - Custom models supported via config

Recommended configurations by device:

Device Model PEFT Method Batch Size Notes
macOS M1/M2/M3 TinyLlama-1.1B LoRA/IA3 1 MPS backend, CPU+disk swap
macOS (16GB) Mistral-7B IA3 1 Ultra-low memory footprint
RTX 3090/4090 LLaMA-2-7B LoRA 8-16 BF16 mixed precision
A100 (40GB) LLaMA-2-13B QLoRA 4-8 4-bit quantization
A100 (80GB) LLaMA-2-70B QLoRA 2-4 4-bit quantization

🎨 Task-Specific Fine-Tuning

Summarization

# Update config
task: "summarization"
dataset:
  name: "cnn_dailymail"

RAG Reranking

task: "rag_reranking"
dataset:
  name: "ms_marco"  # or custom dataset

Tool Calling

task: "tool_calling"
# Uses synthetic dataset with common tool patterns

πŸ“ˆ Performance Tracking

The framework automatically tracks:

  • Memory Usage: Peak CPU and GPU memory
  • Training Time: Total training duration
  • Convergence: Automatic detection of convergence
  • Model Size: Trainable vs total parameters
  • Perplexity: Model performance metric

Results are saved in JSON format for easy analysis.

πŸ”¬ Advanced Usage

Custom Training Script

from src.trainer import PEFTTrainer

# Initialize with custom config
trainer = PEFTTrainer("path/to/config.yaml")

# Setup
trainer.setup()

# Train
trainer.train()

Custom Dataset

from src.data_loader import DatasetLoader

# Load custom dataset
train_ds, eval_ds = DatasetLoader.load_dataset_for_task(
    dataset_name="your/dataset",
    task="general",
    tokenizer=tokenizer,
    max_length=512
)

Inference

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

# Load PEFT adapter
model = PeftModel.from_pretrained(base_model, "outputs/lora/checkpoint-100")

# Generate
inputs = tokenizer("Your prompt here", return_tensors="pt")
outputs = model.generate(**inputs)

πŸ—οΈ Architecture & Implementation

Core Design

The framework follows a modular, pluggable architecture:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    train_*.py (Examples)                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚           PEFTTrainer (Main Training Orchestrator)           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ ModelLoader      β”‚ PEFTFactory      β”‚ DatasetLoader       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β€’ Device detect  β”‚ β€’ LoRA config    β”‚ β€’ Dataset loading   β”‚
β”‚ β€’ BitsAndBytes   β”‚ β€’ QLoRA config   β”‚ β€’ Preprocessing     β”‚
β”‚ β€’ MPS handling   β”‚ β€’ IA3 config     β”‚ β€’ Task mapping      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         ↓
        HuggingFace Transformers + PEFT Library

Key Components

1. config.py - Configuration Management

  • Dataclasses for type-safe configuration
  • YAML loading with validation
  • Automatic serialization/deserialization

2. model_loader.py - Smart Model Loading

  • Automatic device detection (CUDA, MPS, CPU)
  • macOS-specific handling:
    • BFloat16 β†’ Float16 conversion
    • CPU loading with MPS transfer (avoids BFloat16 device errors)
    • MPS device type checking
  • BitsAndBytes quantization support (4-bit, 8-bit)
  • Memory footprint calculation

3. peft_factory.py - Adapter Creation

  • LoRA configuration and setup
  • QLoRA (4-bit) adapter creation
  • IA3 (lightweight) adapter support
  • Automatic gradient checkpointing
  • K-bit training support for quantized models

4. data_loader.py - Dataset Pipeline

  • HuggingFace datasets integration
  • Task-specific preprocessing:
    • Instruction following (general)
    • Summarization
    • RAG reranking
    • Tool calling
  • Automatic train/eval split
  • Tokenization and padding

5. trainer.py - Training Orchestration

  • HuggingFace Trainer wrapper
  • Automatic device-specific optimization
  • Mixed precision training (when supported)
  • Checkpoint management
  • Metrics collection and reporting

Platform-Specific Adaptations

Aspect macOS GPU (CUDA)
Device MPS CUDA
Dtype float16 bfloat16 (if supported)
Model Loading CPU→MPS transfer Direct to GPU
Quantization Not supported 4-bit, 8-bit
Mixed Precision Disabled (PyTorch <2.5) FP16, BF16
Optimizer adamw_torch adamw_torch, paged_adamw_32bit
Batch Size 1 (memory constraints) 4-16+ (depending on VRAM)

πŸ” Method Comparison

Method Memory Speed Accuracy Best For
LoRA Medium Fast High Most use cases
QLoRA Low Medium High Limited GPU memory
IA3 Very Low Very Fast Good Quick experiments

πŸŽ“ Key Techniques Explained

LoRA (Low-Rank Adaptation)

Adds trainable low-rank matrices to transformer layers, reducing trainable parameters by 10,000x while maintaining performance.

QLoRA (Quantized LoRA)

Combines LoRA with 4-bit quantization, enabling fine-tuning of 65B+ models on consumer GPUs.

IA3 (Infused Adapter)

Learns scaling vectors for keys, values, and feedforward activations with minimal parameters.

πŸ“Š Example Output

PEFT METHODS COMPARISON
================================================================================
Metric                        LoRA                QLoRA               IA3
--------------------------------------------------------------------------------
Training Time (s)             1234.5              1567.8              890.2
Peak Memory (GB)              24.3                12.1                8.7
GPU Memory (GB)               16.2                8.4                 6.1
Trainable %                   0.52                0.52                0.08
Final Train Loss              0.234               0.245               0.267
Final Eval Loss               0.289               0.301               0.321
Perplexity                    1.335               1.351               1.379
================================================================================

🀝 Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

πŸ“ License

MIT License - see LICENSE file for details.

πŸ™ Acknowledgments

Built with:

πŸ“š Resources

πŸ› Troubleshooting

macOS-Specific Issues

Error: "BFloat16 is not supported on MPS"

Solution: Framework automatically handles this by:

  1. Loading model on CPU first
  2. Converting BFloat16 β†’ Float16
  3. Transferring to MPS device

This is handled transparently in model_loader.py.

Error: "fp16 mixed precision with MPS device requires PyTorch >= 2.5.0"

Solution: We disable FP16 mixed precision on macOS with PyTorch 2.2.2. Use float32 training instead. Performance impact is minimal for small models.

Slow Training on macOS

Expected behavior: macOS MPS is slower than CUDA for training. Typical throughput:

  • TinyLlama-1.1B: 10-30 seconds per training step
  • Use gradient_accumulation_steps to simulate larger batches without memory overhead

"Invalid buffer size: 12.31 GB"

Solution: Model doesn't fit in available memory. Options:

  1. Use smaller model (TinyLlama recommended)
  2. Use IA3 instead of LoRA (significantly smaller)
  3. Enable quantization on GPU systems

General Issues

CUDA Out of Memory

  • Reduce per_device_train_batch_size
  • Enable gradient checkpointing: gradient_checkpointing: true
  • Use QLoRA instead of LoRA
  • Increase gradient_accumulation_steps
  • Reduce model size or use smaller LoRA rank (r parameter)

Slow Training

  • Disable gradient checkpointing if you have sufficient memory
  • Use larger batch sizes (if memory allows)
  • Enable mixed precision (bf16: true on GPU)
  • Use IA3 for lightweight experiments
  • Verify you're using the correct device (check logs for "Device: cuda" or "Device: mps")

Model Not Converging

  • Increase learning rate or use learning rate warmup
  • Reduce LoRA dropout (lora_dropout)
  • Increase training epochs (num_train_epochs)
  • Verify dataset quality and preprocessing
  • Try different lora_alpha values (typically 16-64)

Out of Disk Space

  • Training saves checkpoints to results/ directory
  • Each checkpoint ~2-4GB for base models + adapters
  • Clean up old results: rm -rf results/lora_*
  • Set save_total_limit in config to limit checkpoint storage

Debugging

Enable verbose logging:

import logging
logging.basicConfig(level=logging.DEBUG)

Check training outputs in results/ directory:

  • checkpoint-*/adapter_config.json - Adapter configuration
  • checkpoint-*/adapter_model.bin - Adapter weights
  • runs/ - TensorBoard logs (if enabled)

Monitor training with TensorBoard:

tensorboard --logdir results/lora_20260118_HHMMSS/logs

πŸ“ Development Status & Roadmap

βœ… Completed Features

Core Framework

  • Modular architecture with pluggable components
  • Configuration system with YAML-based method switching
  • LoRA, QLoRA, and IA3 PEFT methods fully implemented
  • HuggingFace Trainer integration

Device Support

  • macOS (MPS): Fully tested and working
    • Automatic BFloat16 β†’ Float16 conversion
    • CPU loading with MPS transfer
    • No quantization support (expected limitation)
  • GPU (CUDA): Verified working
    • Full quantization support (4-bit, 8-bit)
    • Mixed precision training (BF16, FP16)
    • Multiple optimizer support

Models & Datasets

  • Support for any HuggingFace Causal LM model
  • Task-specific dataset preprocessing:
    • General instruction following
    • Summarization
    • RAG reranking
    • Tool calling
  • Automatic train/eval splits

Tools & Utilities

  • Performance metrics tracking (memory, time, convergence)
  • Model comparison framework
  • Inference pipeline with loaded adapters
  • TensorBoard integration
  • Comprehensive error handling

πŸ“‹ Known Limitations

Limitation Platform Reason Workaround
Batch size = 1 macOS Memory constraints Use smaller models or IA3
No quantization macOS BitsAndBytes requires CUDA Use smaller models
No mixed precision macOS Requires PyTorch >= 2.5.0 Use float32 training
Slower training macOS (CPU/MPS) Hardware limitations Use GPU for production

πŸš€ Recommended Use Cases

  • macOS Users: TinyLlama-1.1B with LoRA/IA3 for testing and development
  • GPU Owners: LLaMA-2, Mistral with LoRA for production fine-tuning
  • Limited Resources: IA3 for ultra-lightweight adaptation
  • Research: Compare all three methods with compare_methods.py

πŸ“Š Performance Benchmarks (Verified)

macOS M3 (16GB unified memory)

Model: TinyLlama-1.1B
Method: LoRA (r=16)
Batch Size: 1
Dataset: Databricks Dolly 15k (13.5k samples)

Metrics:
- Model size: 2.20 GB
- Trainable parameters: 12.6M (1.13%)
- Time per step: 10-30 seconds
- Peak memory: ~8GB
- Status: βœ… Training successful

GPU (A100 - Reference)

Model: LLaMA-2-7B
Method: QLoRA
Batch Size: 4
Dataset: Same

Metrics:
- Model size: 4GB (4-bit quantized)
- Trainable parameters: 23M (0.33%)
- Time per step: 2-3 seconds
- Peak memory: 8GB
- Status: βœ… Optimized for production

About

Unified Fine-Tuning Framework: Simplify Parameter-Efficient Fine-Tuning with LoRA, QLoRA, and IA3. Multi-device support for macOS, GPU, and CPU. Production-ready configurations and examples included.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors