A modular, production-ready framework for fine-tuning large language models (LLMs) using Parameter-Efficient Fine-Tuning (PEFT) methods including LoRA, QLoRA, and IA3. Fully tested and optimized for macOS, Linux, and cloud environments with GPU support.
- Multiple PEFT Methods: Easily switch between LoRA, QLoRA, and IA3 via YAML configuration
- Cross-Platform Support:
- macOS (MPS backend) - fully tested and working
- Linux/Windows (CUDA)
- Cloud GPUs (A100, H100, etc.)
- Task-Specific Fine-Tuning: Built-in support for:
- General instruction following
- Summarization
- RAG reranking
- Tool calling
- Comprehensive Performance Tracking:
- Memory usage (CPU and GPU)
- Training time and convergence
- Perplexity and accuracy metrics
- Side-by-side method comparison
- Advanced Optimization Techniques:
- 4-bit and 8-bit quantization (BitsAndBytes)
- Gradient checkpointing for memory efficiency
- Mixed precision training support
- Multiple optimizer options (AdamW, 8-bit AdamW)
- Automatic device detection and adaptation
- Python 3.12 (PyTorch compatibility)
- 8GB+ RAM (for CPU training) or GPU with 6GB+ VRAM
# Clone and setup
git clone <repository>
cd peft-playground
# Create virtual environment with Python 3.12
python3.12 -m venv venv
source venv/bin/activate # or 'venv\Scripts\activate' on Windows
# Install dependencies
pip install -r requirements.txtThe framework automatically handles:
- Float16 instead of BFloat16 (MPS compatibility)
- CPU loading with MPS transfer (avoids BFloat16 device issues)
- Standard PyTorch optimizers (no CUDA-specific ops)
No additional setup needed!
Ensure CUDA 11.8+ is installed, then:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txtpython examples/train_lora.pyOutput: Saves adapter weights and training metrics to results/lora_YYYYMMDD_HHMMSS/
# Edit configs/qlora_config.yaml then run:
python examples/train_qlora.pypython examples/train_ia3.pypython examples/compare_methods.pyGenerates comparison report with memory, time, and accuracy metrics.
python examples/inference.py| Platform | Model | Method | Status | Notes |
|---|---|---|---|---|
| macOS M3 | TinyLlama-1.1B | LoRA | β Working | MPS backend, batch_size=1 |
| macOS M3 | TinyLlama-1.1B | IA3 | β Working | Ultra-fast (few seconds/epoch) |
| Linux (A100) | LLaMA-2-7B | QLoRA | β Working | 4-bit quantization enabled |
| Linux (RTX 4090) | Mistral-7B | LoRA | β Working | BF16 mixed precision |
All configuration is managed through YAML files in configs/. Each file is self-contained and includes model, PEFT method, training, and dataset parameters.
# PEFT Method Selection
peft_method: "lora" # Options: "lora", "qlora", "ia3"
# Model Configuration
model:
name: "TinyLlama/TinyLlama-1.1B-Chat-v1.0" # HuggingFace model ID
trust_remote_code: true
# Method-Specific Parameters (LoRA example)
lora:
r: 16 # LoRA rank
lora_alpha: 32 # Scaling factor
lora_dropout: 0.05
target_modules: # Which layers to adapt
- "q_proj"
- "v_proj"
- "k_proj"
- "o_proj"
- "gate_proj"
- "up_proj"
- "down_proj"
# Training Hyperparameters
training:
num_train_epochs: 1
per_device_train_batch_size: 1 # Reduce for limited memory
per_device_eval_batch_size: 1
gradient_accumulation_steps: 4 # Simulate larger batches
learning_rate: 2.0e-4
weight_decay: 0.01
warmup_steps: 100
logging_steps: 10
save_steps: 100
eval_steps: 100
fp16: false # macOS: false, GPU: true or false
bf16: false # GPU only (not supported on macOS)
gradient_checkpointing: true
optim: "adamw_torch" # macOS: adamw_torch, GPU: adamw_torch or paged_adamw_32bit
lr_scheduler_type: "cosine"
max_grad_norm: 0.3
# Quantization (only for QLoRA)
quantization:
load_in_4bit: true # 4-bit quantization
load_in_8bit: false
bnb_4bit_compute_dtype: "float16"
bnb_4bit_quant_type: "nf4"
# Dataset Configuration
dataset:
name: "databricks/databricks-dolly-15k" # HuggingFace dataset
type: "general" # Task type
# Logging Configuration
logging:
use_tensorboard: true
log_dir: "tensorboard_logs"Simply edit the config file:
peft_method: "qlora" # Change to use QLoRAOr create a new config based on configs/qlora_config.yaml or configs/ia3_config.yaml.
Pre-tested and verified:
- TinyLlama-1.1B - Lightweight, fast training, good for testing (Recommended for macOS)
- LLaMA-2 (7B, 13B, 70B) - Requires quantization for consumer hardware
- Mistral-7B - Balanced model size and performance
- Qwen (7B, 14B) - Good for multilingual tasks
- Any HuggingFace Causal LM - Custom models supported via config
Recommended configurations by device:
| Device | Model | PEFT Method | Batch Size | Notes |
|---|---|---|---|---|
| macOS M1/M2/M3 | TinyLlama-1.1B | LoRA/IA3 | 1 | MPS backend, CPU+disk swap |
| macOS (16GB) | Mistral-7B | IA3 | 1 | Ultra-low memory footprint |
| RTX 3090/4090 | LLaMA-2-7B | LoRA | 8-16 | BF16 mixed precision |
| A100 (40GB) | LLaMA-2-13B | QLoRA | 4-8 | 4-bit quantization |
| A100 (80GB) | LLaMA-2-70B | QLoRA | 2-4 | 4-bit quantization |
# Update config
task: "summarization"
dataset:
name: "cnn_dailymail"task: "rag_reranking"
dataset:
name: "ms_marco" # or custom datasettask: "tool_calling"
# Uses synthetic dataset with common tool patternsThe framework automatically tracks:
- Memory Usage: Peak CPU and GPU memory
- Training Time: Total training duration
- Convergence: Automatic detection of convergence
- Model Size: Trainable vs total parameters
- Perplexity: Model performance metric
Results are saved in JSON format for easy analysis.
from src.trainer import PEFTTrainer
# Initialize with custom config
trainer = PEFTTrainer("path/to/config.yaml")
# Setup
trainer.setup()
# Train
trainer.train()from src.data_loader import DatasetLoader
# Load custom dataset
train_ds, eval_ds = DatasetLoader.load_dataset_for_task(
dataset_name="your/dataset",
task="general",
tokenizer=tokenizer,
max_length=512
)from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
# Load PEFT adapter
model = PeftModel.from_pretrained(base_model, "outputs/lora/checkpoint-100")
# Generate
inputs = tokenizer("Your prompt here", return_tensors="pt")
outputs = model.generate(**inputs)The framework follows a modular, pluggable architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β train_*.py (Examples) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β PEFTTrainer (Main Training Orchestrator) β
ββββββββββββββββββββ¬βββββββββββββββββββ¬βββββββββββββββββββββββ€
β ModelLoader β PEFTFactory β DatasetLoader β
ββββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββββββ€
β β’ Device detect β β’ LoRA config β β’ Dataset loading β
β β’ BitsAndBytes β β’ QLoRA config β β’ Preprocessing β
β β’ MPS handling β β’ IA3 config β β’ Task mapping β
ββββββββββββββββββββ΄βββββββββββββββββββ΄βββββββββββββββββββββββ
β
HuggingFace Transformers + PEFT Library
- Dataclasses for type-safe configuration
- YAML loading with validation
- Automatic serialization/deserialization
- Automatic device detection (CUDA, MPS, CPU)
- macOS-specific handling:
- BFloat16 β Float16 conversion
- CPU loading with MPS transfer (avoids BFloat16 device errors)
- MPS device type checking
- BitsAndBytes quantization support (4-bit, 8-bit)
- Memory footprint calculation
- LoRA configuration and setup
- QLoRA (4-bit) adapter creation
- IA3 (lightweight) adapter support
- Automatic gradient checkpointing
- K-bit training support for quantized models
- HuggingFace datasets integration
- Task-specific preprocessing:
- Instruction following (general)
- Summarization
- RAG reranking
- Tool calling
- Automatic train/eval split
- Tokenization and padding
- HuggingFace Trainer wrapper
- Automatic device-specific optimization
- Mixed precision training (when supported)
- Checkpoint management
- Metrics collection and reporting
| Aspect | macOS | GPU (CUDA) |
|---|---|---|
| Device | MPS | CUDA |
| Dtype | float16 | bfloat16 (if supported) |
| Model Loading | CPUβMPS transfer | Direct to GPU |
| Quantization | Not supported | 4-bit, 8-bit |
| Mixed Precision | Disabled (PyTorch <2.5) | FP16, BF16 |
| Optimizer | adamw_torch | adamw_torch, paged_adamw_32bit |
| Batch Size | 1 (memory constraints) | 4-16+ (depending on VRAM) |
| Method | Memory | Speed | Accuracy | Best For |
|---|---|---|---|---|
| LoRA | Medium | Fast | High | Most use cases |
| QLoRA | Low | Medium | High | Limited GPU memory |
| IA3 | Very Low | Very Fast | Good | Quick experiments |
Adds trainable low-rank matrices to transformer layers, reducing trainable parameters by 10,000x while maintaining performance.
Combines LoRA with 4-bit quantization, enabling fine-tuning of 65B+ models on consumer GPUs.
Learns scaling vectors for keys, values, and feedforward activations with minimal parameters.
PEFT METHODS COMPARISON
================================================================================
Metric LoRA QLoRA IA3
--------------------------------------------------------------------------------
Training Time (s) 1234.5 1567.8 890.2
Peak Memory (GB) 24.3 12.1 8.7
GPU Memory (GB) 16.2 8.4 6.1
Trainable % 0.52 0.52 0.08
Final Train Loss 0.234 0.245 0.267
Final Eval Loss 0.289 0.301 0.321
Perplexity 1.335 1.351 1.379
================================================================================
Contributions are welcome! Please feel free to submit issues or pull requests.
MIT License - see LICENSE file for details.
Built with:
Solution: Framework automatically handles this by:
- Loading model on CPU first
- Converting BFloat16 β Float16
- Transferring to MPS device
This is handled transparently in model_loader.py.
Solution: We disable FP16 mixed precision on macOS with PyTorch 2.2.2. Use float32 training instead. Performance impact is minimal for small models.
Expected behavior: macOS MPS is slower than CUDA for training. Typical throughput:
- TinyLlama-1.1B: 10-30 seconds per training step
- Use
gradient_accumulation_stepsto simulate larger batches without memory overhead
Solution: Model doesn't fit in available memory. Options:
- Use smaller model (TinyLlama recommended)
- Use IA3 instead of LoRA (significantly smaller)
- Enable quantization on GPU systems
- Reduce
per_device_train_batch_size - Enable gradient checkpointing:
gradient_checkpointing: true - Use QLoRA instead of LoRA
- Increase
gradient_accumulation_steps - Reduce model size or use smaller LoRA rank (
rparameter)
- Disable gradient checkpointing if you have sufficient memory
- Use larger batch sizes (if memory allows)
- Enable mixed precision (bf16: true on GPU)
- Use IA3 for lightweight experiments
- Verify you're using the correct device (check logs for "Device: cuda" or "Device: mps")
- Increase learning rate or use learning rate warmup
- Reduce LoRA dropout (
lora_dropout) - Increase training epochs (
num_train_epochs) - Verify dataset quality and preprocessing
- Try different
lora_alphavalues (typically 16-64)
- Training saves checkpoints to
results/directory - Each checkpoint ~2-4GB for base models + adapters
- Clean up old results:
rm -rf results/lora_* - Set
save_total_limitin config to limit checkpoint storage
Enable verbose logging:
import logging
logging.basicConfig(level=logging.DEBUG)Check training outputs in results/ directory:
checkpoint-*/adapter_config.json- Adapter configurationcheckpoint-*/adapter_model.bin- Adapter weightsruns/- TensorBoard logs (if enabled)
Monitor training with TensorBoard:
tensorboard --logdir results/lora_20260118_HHMMSS/logsCore Framework
- Modular architecture with pluggable components
- Configuration system with YAML-based method switching
- LoRA, QLoRA, and IA3 PEFT methods fully implemented
- HuggingFace Trainer integration
Device Support
- macOS (MPS): Fully tested and working
- Automatic BFloat16 β Float16 conversion
- CPU loading with MPS transfer
- No quantization support (expected limitation)
- GPU (CUDA): Verified working
- Full quantization support (4-bit, 8-bit)
- Mixed precision training (BF16, FP16)
- Multiple optimizer support
Models & Datasets
- Support for any HuggingFace Causal LM model
- Task-specific dataset preprocessing:
- General instruction following
- Summarization
- RAG reranking
- Tool calling
- Automatic train/eval splits
Tools & Utilities
- Performance metrics tracking (memory, time, convergence)
- Model comparison framework
- Inference pipeline with loaded adapters
- TensorBoard integration
- Comprehensive error handling
| Limitation | Platform | Reason | Workaround |
|---|---|---|---|
| Batch size = 1 | macOS | Memory constraints | Use smaller models or IA3 |
| No quantization | macOS | BitsAndBytes requires CUDA | Use smaller models |
| No mixed precision | macOS | Requires PyTorch >= 2.5.0 | Use float32 training |
| Slower training | macOS (CPU/MPS) | Hardware limitations | Use GPU for production |
- macOS Users: TinyLlama-1.1B with LoRA/IA3 for testing and development
- GPU Owners: LLaMA-2, Mistral with LoRA for production fine-tuning
- Limited Resources: IA3 for ultra-lightweight adaptation
- Research: Compare all three methods with
compare_methods.py
macOS M3 (16GB unified memory)
Model: TinyLlama-1.1B
Method: LoRA (r=16)
Batch Size: 1
Dataset: Databricks Dolly 15k (13.5k samples)
Metrics:
- Model size: 2.20 GB
- Trainable parameters: 12.6M (1.13%)
- Time per step: 10-30 seconds
- Peak memory: ~8GB
- Status: β
Training successful
GPU (A100 - Reference)
Model: LLaMA-2-7B
Method: QLoRA
Batch Size: 4
Dataset: Same
Metrics:
- Model size: 4GB (4-bit quantized)
- Trainable parameters: 23M (0.33%)
- Time per step: 2-3 seconds
- Peak memory: 8GB
- Status: β
Optimized for production