Rohan Jain — MS Machine Learning, University of Maryland
MS ML student at UMD with a background in data science and analytics, focused on applied ML systems and NLP. This project explores a complete model compression pipeline — knowledge distillation, INT8 quantization, and structured pruning — applied to a domain-specific financial NLP model. The goal: get a production-viable model that actually fits on a CPU, not just one that scores well in a notebook.
| 🐙 GitHub | github.com/Rohanjain2312 |
| 🤗 HuggingFace | huggingface.co/rohanjain2312 |
| linkedin.com/in/jaroh23 | |
| jaroh23@umd.edu |
A systematic compression study on FinBERT (ProsusAI/finbert — BERT-base pre-trained on 4.9B tokens of financial text) for 3-class financial sentiment classification. Five compression techniques are implemented from scratch, each trained end-to-end on Google Colab and benchmarked on identical CPU hardware.
| Try it | |
|---|---|
| 🤗 HF Spaces — no setup required | Teacher vs. student side-by-side, live benchmark table |
| Full pipeline with all cell outputs — no GPU needed to read results |
Training hardware: Google Colab T4 GPU. Benchmarking: CPU (median latency over 500 runs, 50 warmup).
| Model | Params | Size | Val Macro F1 | vs Teacher |
|---|---|---|---|---|
| Teacher (FinBERT fine-tuned) | 109M | 437.9 MB | 0.8876 | baseline |
| Student — Vanilla KD | 19M | 76.1 MB | 0.8017 | 5.8× smaller · −8.6 F1 pts |
| Student — Intermediate KD | 19M | 76.1 MB | 0.7712 | 5.8× smaller · −11.6 F1 pts |
| Student — PTQ (INT8) | 12M | 47.7 MB | 0.7712 | 9.1× smaller · same F1 as FP32 |
| Student — QAT (INT8) | 12M | 47.7 MB | 0.7601 | 9.1× smaller |
| Pruned Teacher 30% | 109M | 437.9 MB | 0.8966 | ↑ beats teacher by +0.9 pts |
| Pruned Teacher 50% | 109M | 437.9 MB | 0.8936 | ↑ beats teacher by +0.6 pts |
Surprising finding: Removing 30–50% of attention heads improved accuracy. Low-entropy heads (near-uniform attention distributions) add noise rather than signal — pruning them acts as structured regularisation. Full latency and throughput numbers are in the executed notebook.
FinBERT Teacher
(12 layers, 768d)
ProsusAI/finbert
Fine-tune on GPU
│
┌──────────────┼──────────────┐
│ │ │
Knowledge Quantization Structured
Distillation Pruning
│ │ │
┌─────┴─────┐ ┌────┴────┐ ┌────┴────┐
│ │ │ │ │ │
Vanilla Intermed. PTQ QAT 30% 50%
KD KD (INT8) (INT8) pruned pruned
(4L/384d) (4L/384d)
│ │ │
└──────────────┴──────────────┘
│
CPU Benchmark (7 variants)
Median latency · Macro F1 · Throughput
| Concept | Implementation |
|---|---|
| Knowledge distillation | Soft-label KL divergence with temperature scaling (T=4) and T² gradient correction; layer-mapped hidden-state MSE + attention-pattern MSE for intermediate KD |
| INT8 dynamic quantization | torch.quantization.quantize_dynamic (fbgemm backend); QAT with fake-quant ops and straight-through estimator (STE) for gradient-through-discrete-ops |
| Structured attention pruning | Per-head entropy scoring over validation set; iterative prune + fine-tune recovery (5 rounds × 3 epochs); importance metric derived from attention distribution uniformity |
| Custom transformer from scratch | 4-layer BERT-style encoder in pure torch.nn — multi-head self-attention, positional + segment + token embeddings, GELU FFN, LayerNorm residuals |
| Benchmarking protocol | Median-over-500 CPU latency (not mean — right-skewed distribution from GC pauses); 50 warmup runs; throughput measured at batch=32 |
| End-to-end reproducibility | Single Colab notebook trains all 7 variants in sequence; checkpoint_info.json per model captures hyperparameters + metrics + timestamp |
| Technique | Method | Reference |
|---|---|---|
| Vanilla KD | Soft-label KL loss + CE loss, temperature scaling | Hinton et al., 2015 |
| Intermediate KD | Hidden-state MSE + attention-pattern MSE at mapped layers | TinyBERT, Jiao et al., 2020 |
| INT8 PTQ | Post-training dynamic quantization (fbgemm) | PyTorch Quantization |
| INT8 QAT | Fake-quant + straight-through estimator fine-tuning | PyTorch QAT Guide |
| Structured Pruning | Entropy-based head importance, iterative prune + recover | Michel et al., 2019 |
Open the notebook above — it already contains all cell outputs so you can read every result, plot, and benchmark table without running anything. To re-run from scratch, connect to a T4 GPU runtime. The notebook runs the complete pipeline top-to-bottom: dataset prep → teacher training → distillation → quantization → pruning → benchmarking → plots. Runtime: ~3–4 hours total.
git clone https://github.com/Rohanjain2312/FinCompress.git
cd FinCompress/fincompress
pip install -r requirements.txt
# Download checkpoints from Google Drive into fincompress/checkpoints/
python -m fincompress.evaluation.benchmarkpython -m fincompress.data.prepare_dataset
# → fincompress/data/train.csv, val.csv, test.csv- Soft labels encode class uncertainty that hard one-hot labels discard — the teacher's [0.05, 0.72, 0.23] output teaches far more than the label "neutral"
- T² scaling is non-optional: high temperature flattens the soft distribution, reducing gradient magnitude of the KL term; multiplying by T² restores it — without this, CE loss dominates and you lose most of the distillation signal
- Layer mapping strategy matters: evenly-spaced pairing
{0→2, 1→5, 2→8, 3→11}forces the student to mimic the full representational hierarchy (syntactic early layers, semantic middle, task-specific final), not just the output layers
- PTQ is nearly free — apply it as the first compression step before committing to QAT's training cost
- The first and last layers are sensitivity cliffs: quantizing the embedding layer and final classifier causes disproportionate F1 loss;
quantize_dynamicwisely excludes them by default - STE is elegant: the straight-through estimator propagates gradients through the floor() rounding operation as if it were identity during backprop — a simple trick that makes an otherwise non-differentiable step trainable
- 30–50% of FinBERT's attention heads are redundant for 3-class financial sentiment — entropy scoring identifies them cheaply without computing saliency gradients
- The regularisation effect is real: removing redundant heads reduced overfitting enough to improve val F1 by +0.9 pts — the original teacher was slightly over-parameterised for this task
- The accuracy cliff is sharp: there is a head-count threshold beyond which each additional pruning round causes rapid F1 collapse; iterative pruning with recovery rounds defers this cliff significantly
fincompress/
├── data/
│ └── prepare_dataset.py Download + merge + split (CPU)
├── teacher/
│ └── train_teacher.py Fine-tune FinBERT teacher (Colab/GPU)
├── distillation/
│ ├── student_architecture.py 4-layer custom transformer from scratch
│ ├── soft_label_distillation.py Vanilla KD training loop (Colab/GPU)
│ └── intermediate_distillation.py Hidden + attention KD (Colab/GPU)
├── quantization/
│ ├── ptq.py Post-training INT8 quantization (CPU)
│ └── qat.py Quantization-aware training (Colab/GPU)
├── pruning/
│ ├── structured_pruning.py Entropy-based head scoring + pruning
│ └── prune_finetune.py Iterative prune + recover loop (Colab/GPU)
├── evaluation/
│ └── benchmark.py Master benchmark — 7 variants, CPU
├── checkpoints/ Gitignored (large binaries on Drive)
│ └── */checkpoint_info.json ✅ Committed — hyperparams + metrics
├── results/ benchmark_results.csv/json committed here
└── logs/ Training CSVs (gitignored)
notebooks/
└── fincompress_complete.ipynb Single notebook — full pipeline on Colab
hf_space/
└── app.py Gradio 6 demo → HF Spaces
| Task | Hardware | Est. Time |
|---|---|---|
| Dataset prep | Any CPU | ~2 min |
| Teacher fine-tuning | GPU ≥ 8 GB VRAM | ~30 min on T4 |
| Vanilla KD | GPU ≥ 8 GB VRAM | ~30 min on T4 |
| Intermediate KD | GPU ≥ 8 GB VRAM | ~45 min on T4 |
| QAT | GPU ≥ 8 GB VRAM | ~15 min on T4 |
| Pruning (both variants) | GPU ≥ 8 GB VRAM | ~60 min on T4 |
| PTQ + Benchmarking | CPU (x86) | ~15 min |
Google Colab free tier (T4) is sufficient for all GPU tasks.
- Hinton, G., Vinyals, O., Dean, J. (2015). Distilling the Knowledge in a Neural Network
- Jiao, X. et al. (2020). TinyBERT: Distilling BERT for Natural Language Understanding
- Michel, P. et al. (2019). Are Sixteen Heads Really Better Than One?. NeurIPS 2019.
- Araci, D. (2019). FinBERT: Financial Sentiment Analysis with Pre-trained Language Models
- PyTorch Team. Quantization — PyTorch Docs
The compression pipeline architecture, distillation loss formulations, student architecture design, benchmarking protocol, and all key engineering decisions were designed and authored by Rohan Jain. Claude Code was used as an implementation accelerator for boilerplate, file scaffolding, and debugging — similar to how a senior engineer uses Copilot while retaining full design ownership.
MIT License — see LICENSE for details.
Built by Rohan Jain — UMD MSML, Spring 2026