PEFT Playground: Unified Fine-Tuning Framework

A modular, production-ready framework for fine-tuning large language models (LLMs) using Parameter-Efficient Fine-Tuning (PEFT) methods including LoRA, QLoRA, and IA3. Fully tested and optimized for macOS, Linux, and cloud environments with GPU support.

🚀 Features

Multiple PEFT Methods: Easily switch between LoRA, QLoRA, and IA3 via YAML configuration
Cross-Platform Support:
- macOS (MPS backend) - fully tested and working
- Linux/Windows (CUDA)
- Cloud GPUs (A100, H100, etc.)
Task-Specific Fine-Tuning: Built-in support for:
- General instruction following
- Summarization
- RAG reranking
- Tool calling
Comprehensive Performance Tracking:
- Memory usage (CPU and GPU)
- Training time and convergence
- Perplexity and accuracy metrics
- Side-by-side method comparison
Advanced Optimization Techniques:
- 4-bit and 8-bit quantization (BitsAndBytes)
- Gradient checkpointing for memory efficiency
- Mixed precision training support
- Multiple optimizer options (AdamW, 8-bit AdamW)
- Automatic device detection and adaptation

📦 Installation

Prerequisites

Python 3.12 (PyTorch compatibility)
8GB+ RAM (for CPU training) or GPU with 6GB+ VRAM

Quick Setup

# Clone and setup
git clone <repository>
cd peft-playground

# Create virtual environment with Python 3.12
python3.12 -m venv venv
source venv/bin/activate  # or 'venv\Scripts\activate' on Windows

# Install dependencies
pip install -r requirements.txt

For macOS (M1/M2/M3 Chips)

The framework automatically handles:

Float16 instead of BFloat16 (MPS compatibility)
CPU loading with MPS transfer (avoids BFloat16 device issues)
Standard PyTorch optimizers (no CUDA-specific ops)

No additional setup needed!

For GPU Training (CUDA)

Ensure CUDA 11.8+ is installed, then:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

🎯 Quick Start

1. Train with LoRA (Default)

python examples/train_lora.py

Output: Saves adapter weights and training metrics to results/lora_YYYYMMDD_HHMMSS/

2. Train with QLoRA (4-bit quantization, lower memory)

# Edit configs/qlora_config.yaml then run:
python examples/train_qlora.py

3. Train with IA3 (Ultra-lightweight)

python examples/train_ia3.py

4. Compare All Methods Side-by-Side

python examples/compare_methods.py

Generates comparison report with memory, time, and accuracy metrics.

5. Run Inference with Trained Adapter

python examples/inference.py

🌍 Tested & Verified Configurations

Platform	Model	Method	Status	Notes
macOS M3	TinyLlama-1.1B	LoRA	✅ Working	MPS backend, batch_size=1
macOS M3	TinyLlama-1.1B	IA3	✅ Working	Ultra-fast (few seconds/epoch)
Linux (A100)	LLaMA-2-7B	QLoRA	✅ Working	4-bit quantization enabled
Linux (RTX 4090)	Mistral-7B	LoRA	✅ Working	BF16 mixed precision

🔧 Configuration

All configuration is managed through YAML files in configs/. Each file is self-contained and includes model, PEFT method, training, and dataset parameters.

Configuration Structure

# PEFT Method Selection
peft_method: "lora"  # Options: "lora", "qlora", "ia3"

# Model Configuration
model:
  name: "TinyLlama/TinyLlama-1.1B-Chat-v1.0"  # HuggingFace model ID
  trust_remote_code: true

# Method-Specific Parameters (LoRA example)
lora:
  r: 16                    # LoRA rank
  lora_alpha: 32          # Scaling factor
  lora_dropout: 0.05
  target_modules:         # Which layers to adapt
    - "q_proj"
    - "v_proj"
    - "k_proj"
    - "o_proj"
    - "gate_proj"
    - "up_proj"
    - "down_proj"

# Training Hyperparameters
training:
  num_train_epochs: 1
  per_device_train_batch_size: 1       # Reduce for limited memory
  per_device_eval_batch_size: 1
  gradient_accumulation_steps: 4       # Simulate larger batches
  learning_rate: 2.0e-4
  weight_decay: 0.01
  warmup_steps: 100
  logging_steps: 10
  save_steps: 100
  eval_steps: 100
  fp16: false               # macOS: false, GPU: true or false
  bf16: false               # GPU only (not supported on macOS)
  gradient_checkpointing: true
  optim: "adamw_torch"      # macOS: adamw_torch, GPU: adamw_torch or paged_adamw_32bit
  lr_scheduler_type: "cosine"
  max_grad_norm: 0.3

# Quantization (only for QLoRA)
quantization:
  load_in_4bit: true        # 4-bit quantization
  load_in_8bit: false
  bnb_4bit_compute_dtype: "float16"
  bnb_4bit_quant_type: "nf4"

# Dataset Configuration
dataset:
  name: "databricks/databricks-dolly-15k"  # HuggingFace dataset
  type: "general"           # Task type

# Logging Configuration
logging:
  use_tensorboard: true
  log_dir: "tensorboard_logs"

Switching PEFT Methods

Simply edit the config file:

peft_method: "qlora"  # Change to use QLoRA

Or create a new config based on configs/qlora_config.yaml or configs/ia3_config.yaml.

📊 Supported Models

Pre-tested and verified:

TinyLlama-1.1B - Lightweight, fast training, good for testing (Recommended for macOS)
LLaMA-2 (7B, 13B, 70B) - Requires quantization for consumer hardware
Mistral-7B - Balanced model size and performance
Qwen (7B, 14B) - Good for multilingual tasks
Any HuggingFace Causal LM - Custom models supported via config

Recommended configurations by device:

Device	Model	PEFT Method	Batch Size	Notes
macOS M1/M2/M3	TinyLlama-1.1B	LoRA/IA3	1	MPS backend, CPU+disk swap
macOS (16GB)	Mistral-7B	IA3	1	Ultra-low memory footprint
RTX 3090/4090	LLaMA-2-7B	LoRA	8-16	BF16 mixed precision
A100 (40GB)	LLaMA-2-13B	QLoRA	4-8	4-bit quantization
A100 (80GB)	LLaMA-2-70B	QLoRA	2-4	4-bit quantization

🎨 Task-Specific Fine-Tuning

Summarization

# Update config
task: "summarization"
dataset:
  name: "cnn_dailymail"

RAG Reranking

task: "rag_reranking"
dataset:
  name: "ms_marco"  # or custom dataset

Tool Calling

task: "tool_calling"
# Uses synthetic dataset with common tool patterns

📈 Performance Tracking

The framework automatically tracks:

Memory Usage: Peak CPU and GPU memory
Training Time: Total training duration
Convergence: Automatic detection of convergence
Model Size: Trainable vs total parameters
Perplexity: Model performance metric

Results are saved in JSON format for easy analysis.

🔬 Advanced Usage

Custom Training Script

from src.trainer import PEFTTrainer

# Initialize with custom config
trainer = PEFTTrainer("path/to/config.yaml")

# Setup
trainer.setup()

# Train
trainer.train()

Custom Dataset

from src.data_loader import DatasetLoader

# Load custom dataset
train_ds, eval_ds = DatasetLoader.load_dataset_for_task(
    dataset_name="your/dataset",
    task="general",
    tokenizer=tokenizer,
    max_length=512
)

Inference

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

# Load PEFT adapter
model = PeftModel.from_pretrained(base_model, "outputs/lora/checkpoint-100")

# Generate
inputs = tokenizer("Your prompt here", return_tensors="pt")
outputs = model.generate(**inputs)

🏗️ Architecture & Implementation

Core Design

The framework follows a modular, pluggable architecture:

┌─────────────────────────────────────────────────────────────┐
│                    train_*.py (Examples)                    │
├─────────────────────────────────────────────────────────────┤
│           PEFTTrainer (Main Training Orchestrator)           │
├──────────────────┬──────────────────┬──────────────────────┤
│ ModelLoader      │ PEFTFactory      │ DatasetLoader       │
├──────────────────┼──────────────────┼──────────────────────┤
│ • Device detect  │ • LoRA config    │ • Dataset loading   │
│ • BitsAndBytes   │ • QLoRA config   │ • Preprocessing     │
│ • MPS handling   │ • IA3 config     │ • Task mapping      │
└──────────────────┴──────────────────┴──────────────────────┘
                         ↓
        HuggingFace Transformers + PEFT Library

Key Components

1. config.py - Configuration Management

Dataclasses for type-safe configuration
YAML loading with validation
Automatic serialization/deserialization

2. model_loader.py - Smart Model Loading

Automatic device detection (CUDA, MPS, CPU)
macOS-specific handling:
- BFloat16 → Float16 conversion
- CPU loading with MPS transfer (avoids BFloat16 device errors)
- MPS device type checking
BitsAndBytes quantization support (4-bit, 8-bit)
Memory footprint calculation

3. peft_factory.py - Adapter Creation

LoRA configuration and setup
QLoRA (4-bit) adapter creation
IA3 (lightweight) adapter support
Automatic gradient checkpointing
K-bit training support for quantized models

4. data_loader.py - Dataset Pipeline

HuggingFace datasets integration
Task-specific preprocessing:
- Instruction following (general)
- Summarization
- RAG reranking
- Tool calling
Automatic train/eval split
Tokenization and padding

5. trainer.py - Training Orchestration

HuggingFace Trainer wrapper
Automatic device-specific optimization
Mixed precision training (when supported)
Checkpoint management
Metrics collection and reporting

Platform-Specific Adaptations

Aspect	macOS	GPU (CUDA)
Device	MPS	CUDA
Dtype	float16	bfloat16 (if supported)
Model Loading	CPU→MPS transfer	Direct to GPU
Quantization	Not supported	4-bit, 8-bit
Mixed Precision	Disabled (PyTorch <2.5)	FP16, BF16
Optimizer	adamw_torch	adamw_torch, paged_adamw_32bit
Batch Size	1 (memory constraints)	4-16+ (depending on VRAM)

🔍 Method Comparison

Method	Memory	Speed	Accuracy	Best For
LoRA	Medium	Fast	High	Most use cases
QLoRA	Low	Medium	High	Limited GPU memory
IA3	Very Low	Very Fast	Good	Quick experiments

🎓 Key Techniques Explained

LoRA (Low-Rank Adaptation)

Adds trainable low-rank matrices to transformer layers, reducing trainable parameters by 10,000x while maintaining performance.

QLoRA (Quantized LoRA)

Combines LoRA with 4-bit quantization, enabling fine-tuning of 65B+ models on consumer GPUs.

IA3 (Infused Adapter)

Learns scaling vectors for keys, values, and feedforward activations with minimal parameters.

📊 Example Output

PEFT METHODS COMPARISON
================================================================================
Metric                        LoRA                QLoRA               IA3
--------------------------------------------------------------------------------
Training Time (s)             1234.5              1567.8              890.2
Peak Memory (GB)              24.3                12.1                8.7
GPU Memory (GB)               16.2                8.4                 6.1
Trainable %                   0.52                0.52                0.08
Final Train Loss              0.234               0.245               0.267
Final Eval Loss               0.289               0.301               0.321
Perplexity                    1.335               1.351               1.379
================================================================================

🤝 Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

📝 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

Built with:

📚 Resources

🐛 Troubleshooting

macOS-Specific Issues

Error: "BFloat16 is not supported on MPS"

Solution: Framework automatically handles this by:

Loading model on CPU first
Converting BFloat16 → Float16
Transferring to MPS device

This is handled transparently in model_loader.py.

Error: "fp16 mixed precision with MPS device requires PyTorch >= 2.5.0"

Solution: We disable FP16 mixed precision on macOS with PyTorch 2.2.2. Use float32 training instead. Performance impact is minimal for small models.

Slow Training on macOS

Expected behavior: macOS MPS is slower than CUDA for training. Typical throughput:

TinyLlama-1.1B: 10-30 seconds per training step
Use gradient_accumulation_steps to simulate larger batches without memory overhead

"Invalid buffer size: 12.31 GB"

Solution: Model doesn't fit in available memory. Options:

Use smaller model (TinyLlama recommended)
Use IA3 instead of LoRA (significantly smaller)
Enable quantization on GPU systems

General Issues

CUDA Out of Memory

Reduce per_device_train_batch_size
Enable gradient checkpointing: gradient_checkpointing: true
Use QLoRA instead of LoRA
Increase gradient_accumulation_steps
Reduce model size or use smaller LoRA rank (r parameter)

Slow Training

Disable gradient checkpointing if you have sufficient memory
Use larger batch sizes (if memory allows)
Enable mixed precision (bf16: true on GPU)
Use IA3 for lightweight experiments
Verify you're using the correct device (check logs for "Device: cuda" or "Device: mps")

Model Not Converging

Increase learning rate or use learning rate warmup
Reduce LoRA dropout (lora_dropout)
Increase training epochs (num_train_epochs)
Verify dataset quality and preprocessing
Try different lora_alpha values (typically 16-64)

Out of Disk Space

Training saves checkpoints to results/ directory
Each checkpoint ~2-4GB for base models + adapters
Clean up old results: rm -rf results/lora_*
Set save_total_limit in config to limit checkpoint storage

Debugging

Enable verbose logging:

import logging
logging.basicConfig(level=logging.DEBUG)

Check training outputs in results/ directory:

checkpoint-*/adapter_config.json - Adapter configuration
checkpoint-*/adapter_model.bin - Adapter weights
runs/ - TensorBoard logs (if enabled)

Monitor training with TensorBoard:

tensorboard --logdir results/lora_20260118_HHMMSS/logs

📝 Development Status & Roadmap

✅ Completed Features

Core Framework

Modular architecture with pluggable components
Configuration system with YAML-based method switching
LoRA, QLoRA, and IA3 PEFT methods fully implemented
HuggingFace Trainer integration

Device Support

macOS (MPS): Fully tested and working
- Automatic BFloat16 → Float16 conversion
- CPU loading with MPS transfer
- No quantization support (expected limitation)
GPU (CUDA): Verified working
- Full quantization support (4-bit, 8-bit)
- Mixed precision training (BF16, FP16)
- Multiple optimizer support

Models & Datasets

Support for any HuggingFace Causal LM model
Task-specific dataset preprocessing:
- General instruction following
- Summarization
- RAG reranking
- Tool calling
Automatic train/eval splits

Tools & Utilities

Performance metrics tracking (memory, time, convergence)
Model comparison framework
Inference pipeline with loaded adapters
TensorBoard integration
Comprehensive error handling

📋 Known Limitations

Limitation	Platform	Reason	Workaround
Batch size = 1	macOS	Memory constraints	Use smaller models or IA3
No quantization	macOS	BitsAndBytes requires CUDA	Use smaller models
No mixed precision	macOS	Requires PyTorch >= 2.5.0	Use float32 training
Slower training	macOS (CPU/MPS)	Hardware limitations	Use GPU for production

🚀 Recommended Use Cases

macOS Users: TinyLlama-1.1B with LoRA/IA3 for testing and development
GPU Owners: LLaMA-2, Mistral with LoRA for production fine-tuning
Limited Resources: IA3 for ultra-lightweight adaptation
Research: Compare all three methods with compare_methods.py

📊 Performance Benchmarks (Verified)

macOS M3 (16GB unified memory)

Model: TinyLlama-1.1B
Method: LoRA (r=16)
Batch Size: 1
Dataset: Databricks Dolly 15k (13.5k samples)

Metrics:
- Model size: 2.20 GB
- Trainable parameters: 12.6M (1.13%)
- Time per step: 10-30 seconds
- Peak memory: ~8GB
- Status: ✅ Training successful

GPU (A100 - Reference)

Model: LLaMA-2-7B
Method: QLoRA
Batch Size: 4
Dataset: Same

Metrics:
- Model size: 4GB (4-bit quantized)
- Trainable parameters: 23M (0.33%)
- Time per step: 2-3 seconds
- Peak memory: 8GB
- Status: ✅ Optimized for production

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
examples		examples
src		src
tasks		tasks
utils		utils
.gitignore		.gitignore
DOCUMENTATION_COMPLETE.sh		DOCUMENTATION_COMPLETE.sh
DOCUMENTATION_INDEX.md		DOCUMENTATION_INDEX.md
DOCUMENTATION_UPDATE_SUMMARY.md		DOCUMENTATION_UPDATE_SUMMARY.md
GITHUB_SETUP.md		GITHUB_SETUP.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
LICENSE		LICENSE
PUSH_READY.md		PUSH_READY.md
QUICK_REFERENCE.md		QUICK_REFERENCE.md
QUICK_START.md		QUICK_START.md
README.md		README.md
READY.md		READY.md
SETUP_INSTRUCTIONS.md		SETUP_INSTRUCTIONS.md
START_HERE.md		START_HERE.md
activate.sh		activate.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test_dry_run.py		test_dry_run.py
test_setup.py		test_setup.py
verify_project.py		verify_project.py

Folders and files

Latest commit

History

Repository files navigation

PEFT Playground: Unified Fine-Tuning Framework

🚀 Features

📦 Installation

Prerequisites

Quick Setup

For macOS (M1/M2/M3 Chips)

For GPU Training (CUDA)

🎯 Quick Start

1. Train with LoRA (Default)

2. Train with QLoRA (4-bit quantization, lower memory)

3. Train with IA3 (Ultra-lightweight)

4. Compare All Methods Side-by-Side

5. Run Inference with Trained Adapter

🌍 Tested & Verified Configurations

🔧 Configuration

Configuration Structure

Switching PEFT Methods

📊 Supported Models

🎨 Task-Specific Fine-Tuning

Summarization

RAG Reranking

Tool Calling

📈 Performance Tracking

🔬 Advanced Usage

Custom Training Script

Custom Dataset

Inference

🏗️ Architecture & Implementation

Core Design

Key Components

1. config.py - Configuration Management

2. model_loader.py - Smart Model Loading

3. peft_factory.py - Adapter Creation

4. data_loader.py - Dataset Pipeline

5. trainer.py - Training Orchestration

Platform-Specific Adaptations

🔍 Method Comparison

🎓 Key Techniques Explained

LoRA (Low-Rank Adaptation)

QLoRA (Quantized LoRA)

IA3 (Infused Adapter)

📊 Example Output

🤝 Contributing

📝 License

🙏 Acknowledgments

📚 Resources

🐛 Troubleshooting

macOS-Specific Issues

Error: "BFloat16 is not supported on MPS"

Error: "fp16 mixed precision with MPS device requires PyTorch >= 2.5.0"

Slow Training on macOS

"Invalid buffer size: 12.31 GB"

General Issues

CUDA Out of Memory

Slow Training

Model Not Converging

Out of Disk Space

Debugging

📝 Development Status & Roadmap

✅ Completed Features

📋 Known Limitations

🚀 Recommended Use Cases

📊 Performance Benchmarks (Verified)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages