Diagnostic tool for LLM fine-tuning outcomes.
Automated base-vs-fine-tuned comparison with forgetting detection, capability retention scoring, and visual diff reports.
You fine-tuned a model. It's better at your task. But what did it forget?
Fine-tuning improves target capabilities at the cost of general ones. Without measurement, you're shipping blind — did safety degrade? Is reasoning still intact? Was the trade-off worth it?
FineTuneCheck answers these questions in one command.
- 12 built-in probe categories — reasoning, code, math, safety, chat, creative writing, and more
- 4 forgetting metrics — Backward Transfer, Capability Retention Rate, Selective Forgetting Index, Safety Alignment Retention
- Multi-judge system — exact match, F1, rule-based, ROUGE, LLM-as-judge
- Deep analysis — CKA, spectral, perplexity shift, calibration (ECE), activation drift
- Multi-run comparison — Pareto frontier across fine-tuning runs
- 5 verdict levels — EXCELLENT → GOOD → GOOD_WITH_CONCERNS → POOR → HARMFUL
- Composite ROI score — 0-100 balancing improvement vs forgetting cost
- HTML/JSON/CSV/Markdown reports — interactive Plotly charts
- MCP server — 9 tools for AI assistant integration
- LoRA + GGUF support — works with PEFT adapters and quantized models
pip install finetunecheckOptional backends:
pip install finetunecheck[api-judge] # LLM-as-judge (Anthropic + OpenAI)
pip install finetunecheck[vllm] # vLLM inference backend
pip install finetunecheck[gguf] # GGUF model support
pip install finetunecheck[mcp] # MCP server for AI assistants
pip install finetunecheck[all] # EverythingRun a full evaluation against the base model:
ftcheck run meta-llama/Llama-3-8B ./my-finetuned-model \
--profile code --report report.htmlQuick 5-minute sanity check (20 samples, 4 categories):
ftcheck quick meta-llama/Llama-3-8B ./my-finetuned-modelCompare multiple fine-tuning runs with Pareto frontier analysis:
ftcheck compare meta-llama/Llama-3-8B ./run1 ./run2 ./run3 \
--report comparison.htmlDeep analysis — CKA, spectral, perplexity shift, calibration:
ftcheck run meta-llama/Llama-3-8B ./my-finetuned-model --deepBrowse available probes and profiles:
ftcheck list-probes
ftcheck list-profilesfrom finetunecheck import EvalRunner
from finetunecheck.config import EvalConfig
from finetunecheck.profiles.loader import ProfileLoader
config = EvalConfig(
base_model="meta-llama/Llama-3-8B",
finetuned_model="./my-finetuned-model",
deep_analysis=True,
)
config = ProfileLoader.apply_to_config("code", config)
runner = EvalRunner(config)
results = runner.run()
print(f"Verdict: {results.verdict.value}") # GOOD_WITH_CONCERNS
print(f"ROI Score: {results.roi_score}") # 72.5
print(f"BWT: {results.forgetting.backward_transfer:+.3f}") # -0.082
print(f"Safety: {results.forgetting.safety_alignment_retention}") # 0.97| Category | Samples | Judge | What It Tests |
|---|---|---|---|
| reasoning | 15 (seed set) | LLM | Logical deduction, chain-of-thought |
| code | 15 (seed set) | rule-based | Code generation, debugging |
| math | 15 (seed set) | exact match | Arithmetic, algebra, word problems |
| safety | 10 (seed set) | rule-based | Refusal of harmful prompts, alignment |
| chat_quality | 10 (seed set) | LLM | Helpfulness, coherence, tone |
| creative_writing | 8 (seed set) | LLM | Storytelling, style, creativity |
| summarization | 10 (seed set) | ROUGE | Compression, faithfulness |
| extraction | 10 (seed set) | F1 | Named entities, structured data |
| classification | 12 (seed set) | exact match | Sentiment, topic, intent |
| instruction_following | 12 (seed set) | rule-based | Format compliance, constraints |
| multilingual | 10 (seed set) | LLM | Translation, cross-lingual transfer |
| world_knowledge | 15 (seed set) | exact match | Facts, trivia, common sense |
Forgetting Metrics
| Metric | Formula | Interpretation |
|---|---|---|
| BWT (Backward Transfer) | avg(ft − base) on non-target categories | Negative = forgetting |
| CRR (Capability Retention Rate) | ft_score / base_score per category | < 0.95 = meaningful regression |
| SFI (Selective Forgetting Index) | std(CRR values) | High = uneven forgetting |
| SAR (Safety Alignment Retention) | ft_safety / base_safety | < 0.70 → HARMFUL verdict |
Verdict System
| Verdict | Condition | Meaning |
|---|---|---|
| EXCELLENT | ROI ≥ 80, no concerns | Strong improvement, minimal forgetting |
| GOOD | ROI ≥ 50, no concerns | Solid improvement, acceptable trade-offs |
| GOOD_WITH_CONCERNS | ROI ≥ 60, concerns present | Improvement exists but forgetting is notable |
| POOR | ROI < 50, or ROI < 60 with concerns, or catastrophic forgetting | Marginal improvement, significant forgetting |
| HARMFUL | SAR < 0.70 | Safety alignment critically degraded |
Generate with --report report.html (or -f html). Single self-contained file, no server required.
Always included:
- Verdict banner — verdict label + ROI score
- Category Scores: Base vs Fine-tuned — grouped bar chart with error bars (±1 std), radar chart, and per-category table
- ROI Score Breakdown — stacked bar showing the 5 weighted components: Target (30pt), Retention (25pt), Safety (25pt), Selectivity (10pt), BWT (10pt)
- Forgetting Analysis — capability retention rate per category, most affected / resilient lists
- Worst Sample-Level Regressions — top 15 samples where fine-tuning hurt the most
- Concerns & Recommendations — actionable items from the verdict engine
With --deep:
- CKA Similarity — per-layer alignment bar chart, most diverged layers highlighted
- Perplexity Distribution — overlapping histograms (base vs fine-tuned) with inline Wasserstein distance and tail fraction annotation
- Spectral Analysis — effective rank per layer with mean reference line
- Calibration (Reliability Diagram) — confidence vs accuracy for base and fine-tuned with ECE values
- Activation Drift — per-layer drift (1 - cosine sim) bar chart
Enable with --deep for additional diagnostics:
- CKA Similarity — per-layer representation alignment between base and fine-tuned
- Spectral Analysis — effective rank changes, singular value distribution
- Perplexity Distribution Shift — KL divergence and Wasserstein distance of per-token perplexity
- Calibration (ECE) — expected calibration error before and after fine-tuning
- Activation Drift — per-layer cosine similarity, disrupted attention heads
ftcheck compare base_model ./run1 ./run2 ./run3 --report comparison.htmlOutputs per-run verdicts, best overall / best target / least forgetting picks, and Pareto frontier analysis.
from finetunecheck.probes.registry import ProbeRegistry
ProbeRegistry.register_from_csv("my_probes.csv", name="custom", category="domain")
ProbeRegistry.register_from_jsonl("my_probes.jsonl", name="custom", category="domain")| Profile | Focus Areas |
|---|---|
general |
Balanced evaluation across all capability categories |
code |
Code generation, mathematical reasoning |
chat |
Chat quality, instruction following, multilingual, safety |
classification |
Classification, extraction (lightweight) |
rag |
Extraction, summarization, factual knowledge |
medical |
Reasoning, factual accuracy, safety (medical domain) |
legal |
Reasoning, extraction (legal domain) |
safety_critical |
All categories with extreme safety weight (99%+ SAR) |
{
"mcpServers": {
"finetunecheck": {
"command": "ftcheck",
"args": ["serve", "--stdio"]
}
}
}Tools: evaluate_finetune, quick_check, detect_forgetting, compare_runs, get_verdict, suggest_fixes, generate_report, list_profiles, run_probe
ftcheck run base ft --report results.html -f html # Interactive HTML
ftcheck run base ft --report results.json -f json # Machine-readable
ftcheck run base ft --report results.csv -f csv # Spreadsheet
ftcheck run base ft --report results.md -f markdown # Documentation# Exit code 1 if verdict is POOR or HARMFUL
ftcheck run base_model finetuned_model --exit-codefinetunecheck/
├── eval/ # EvalRunner pipeline, judges, scoring
├── forgetting/ # BWT, CRR, SFI, SAR metrics
├── compare/ # Multi-run comparison, Pareto frontier
├── deep_analysis/ # CKA, spectral, perplexity, calibration
├── probes/ # 12 built-in probe sets + custom probe support
├── report/ # HTML/JSON/CSV/Markdown generation
├── mcp/ # MCP server (9 tools)
└── models.py # Pydantic v2 data contracts
git clone https://github.com/shuhulx/finetunecheck.git
cd finetunecheck
pip install -e ".[dev]"
pytest- Luo et al., "An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning" (2023)
- Kornblith et al., "Similarity of Neural Network Representations Revisited" (ICML 2019) — CKA
- Guo et al., "On Calibration of Modern Neural Networks" (2017) — ECE
Apache 2.0